Database Quality / Data Cleaning / Data Mining


Database - a collection of interrelated tables / datasets / files.

Database Integrity - generally refers to the technical / relational condition of a database. (Codd).

Keys (Primary / secondary) - the way in which records are inter linked / referenced.

Records - a collection of data items identified by a key or keys.

Data cleaning - the process of ensuring that a database is valid in technical and / or literal terms.

Data Mining - the process of extracting meaningful information from a database / collection of related databases or data mart / warehouse.

Data Mart / Warehouse - an aggregated layer of datasets abstracted from multiple sources eg operational systems, external data, management reporting systems.

Matching - the process of determining automatically the probability of one record from one database being equivalent to a similar record from another database where unique keys are missing or damaged.

RDBs - Relational Databases (most modern databases are relational).

Download PDF feature articles:

How clean is your data / database?

Broadly speaking there are two types of data cleanliness problem:

  • Technical validity
  • Semantic validity

    Technical validity

    You can reduce the risk / impact of technical issues but you cannot escape Quantum Reality that if it can happen it will happen. Consider the following:

  • Database systems do crash
  • Databases do sometimes physically delete records
  • All database systems contain bugs
  • Servers will crash several times a year
  • Disk drives will self-corrupt
  • Users will enter erroneous data
  • Administrators and users will make errors
  • ...

    Early warning reporting systems should be in place to identify errors, fix them automatically where possible and in any case report on them over time. Having reported on the errors suitable resource needs to be available to correct the errors before users experience any effects.

    Semantic validity

    In any one year the following WILL happen:

  • 8% of businesses will move location
  • 19% of MDs will have new positions
  • 17% of FDswill have new positions
  • 11% of HR directors will have new positions
  • 15.3% of IT directors will have new positions
  • ...

    One third of all database contacts will have become obsolete in any one year simply due to role change over time. Accept it... you cant change it... (Ref: Conduit & Identex).

    Add to this phone number changes / address changes / name changes / deaths / new job titles / company closures / mergers etc and it soon becomes clear that time itself is a database's worst enemy. The approach that many use to circumvent this problem is to rely on mail returns. This method is however greatly flawed - the vast proportion of unwanted mail is delivered directly to waste paper bins. Do you have the time to send back incorrectly addressed mail? No?

    The way forwards

    If we accept that our databases are less than perfect what can be done about it without excessive overheads?

  • Develop a project plan and costed ongoing strategy
  • Model what you have versus what you need
  • Identify the shortfall
  • Ensure that your systems are technically sound
  • Put in place effective means to identify / correct technical errors
  • Using the 80:20 rule identify your "inner" database
  • Validate and enrich the "inner" database (with an emphasis on metadata)
  • Keep on correcting / validating / enriching (bulk or targetted)
  • Report on record age / track change over time
  • Buy the support of your user base to assist the cleaning process
  • Educate the user base and pre-validate where possible
  • When reasonably clean implement Data Mining to reduce operating costs
  • ...

    It is worth noting that data cleaning is for life - not just prior to marketing campaigns! Database quality is also a cornerstone of Customer Relationship Management.

    Database Quality experts need to be able to perceive reality from at least 4 perspectives:

  • Business view (predicting the needs of the business)
  • Technical view (low level understanding of RDBs & data extraction)
  • User view (working with users to ensure that systems are usable)
  • Developers view (helping developers to provide workable solutions)

    A sustainable level of database quality in excess of 80%+ {1} is possible ({1} right person at the right company based on random samples of 2000 records).

    Glossary: data quality, database management, data mining, data analysis

    Home | Shop | Contact us | Therapists Toolkit | EMDR Software | EMDR Lite Software | Phobia Treatment Software | Aversion Software | TFT Software | Flooding Desensitisation / Implosion Therapy Software | Emetophobia Desensitization Multimedia Software | Desensitisation Therapy Multimedia | NLP Swish technique mood / state change software | Bilateral Audio Software | Drug / alcohol / food (cake / chocolate) aversion software | CBT / CBH Tasking Software | Directory / Resources | Sitemap

    Last Updated 15 July 2024 © 1998-2023