Data cleansing

For data cleansing ( english data cleansing or data editing ) include various methods for removing and correcting data errors in databases or other information systems . The errors can consist of incorrect (originally incorrect or outdated), redundant, inconsistent or incorrectly formatted data , for example .

Essential steps in data cleansing are duplicate detection (recognition and merging of the same data records) and data fusion (merging and completing incomplete data).

Data cleansing is a contribution to improving the quality of information . However, information quality also affects many other properties of data sources (credibility, relevance, availability, costs, ...) that cannot be improved through data cleansing.

Data cleansing process

The process for cleaning up the data is divided into five successive steps (Apel, 2009, p. 157):

Create a backup copy of the file / table
Data quality - define requirements for data
Analysis of the data
standardization
Cleansing the data

Data quality - requirements

High quality and reliable data must meet certain requirements, e.g. B.

Valid data: same data type, certain maximum values, etc.
complete data
Uniform data: same unit (e.g. currency, weight, length)
Data of integrity: data must be protected from intentional and / or unintentional manipulation.

Analysis of the data

After the requirements have been clarified, the data must e.g. B. with the help of the checklists can be checked whether the data has the required quality.

Standardize data before cleaning

For a successful cleanup, the data must first be standardized. For this purpose, these are first structured and then standardized .

The structuring brings the data into a uniform format, for example a date is brought into a uniform data format ( 01.09.2009 ). Or composite data is broken down into its components, e.g. B. the name of a customer in the name components salutation , title , first name and surname . Usually such structuring is not trivial and is carried out with the help of complex parsers.

During normalization, the existing values are mapped onto a standardized value list. This normalization can e.g. B. for the salutation, the academic title or company additions. For example, the company additions e. Kfr. And Kfm by the normalized value e. K. can be replaced, which greatly simplifies the subsequent cleanup.

Clean up data

There are six methods to choose from for cleaning the data, which can be used individually or in combination:

Deriving from other data: The correct values are derived from other data (e.g. the salutation from gender).
Replace with other data: The incorrect data is replaced by other data (e.g. from other systems).
Use default values: Default values are used instead of the incorrect data.

Remove faulty data: The data is filtered out and not processed further.

Remove duplicates: Duplicates are identified using the duplicate detection, the non-redundant data is consolidated from the duplicates and a single data set is formed from it.

Separate summaries: In contrast to the removal of duplicates, incorrectly summarized data is separated again.

Filing the incorrect data

Before you clean up the data, you should save the original, faulty data as a copy, and under no circumstances simply delete it after the cleanup. Otherwise the adjustments would not be comprehensible, and such a process would not be audit-proof.

An alternative is to store the corrected value in an additional column. Since additional storage space is required, this approach is only recommended for a few columns to be corrected in a data set. Another possibility is to store it in an additional line, which increases the memory requirement even more. It is therefore only suitable for a small number of data records to be corrected. The last option with a large number of columns and rows to be corrected is to create a separate table.

literature

Detlef Apel, Wolfgang Behme, Rüdiger Eberlein, Christian Merighi: Controlling data quality successfully . 2009, Hanser Fachbuch, ISBN 978-3-446-42056-4 .