Data mining

from Wikipedia, the free encyclopedia

Under data mining [ deɪtə maɪnɪŋ ] (of English data mining from English data , data 'and English mine , dig' degrade 'promote') is the systematic application of statistical methods to large data sets (especially " Big Data "Or mass data) with the aim of recognizing new cross-connections and trends . Due to their size, such databases are processed using computer-aided methods. In practice, the sub-term data mining was transferred to the entire process of so-called " Knowledge Discovery in Databases " ( English for knowledge discovery in databases; KDD), which also includes steps such as preprocessing and evaluation, while data mining in the narrower sense only denotes the actual processing step of the process.

The term data mining (actually, for example, "breaking down data") is a bit misleading, because it is about gaining knowledge from existing data and not about generating data itself. The concise term has nevertheless prevailed. The mere acquisition , storage and processing of large amounts of data is sometimes also referred to with the buzzword data mining. In the scientific context, it primarily describes the extraction of knowledge that is “ valid (in the statistical sense), previously unknown and potentially useful” “to determine certain regularities, laws and hidden relationships”. Fayyad defines it as "a step in the KDD process that consists of applying data analysis and discovery algorithms that provide a specific listing of samples (or models) of the data, within acceptable efficiency limits".

Differentiation from other departments

Many of the methods used in data mining actually come from statistics , in particular multivariate statistics, and are often only adapted in their complexity for use in data mining, often approximated to the detriment of accuracy. The loss of accuracy is often accompanied by a loss of statistical validity, so that the methods can sometimes even be "wrong" from a purely statistical point of view. For the application in data mining, however, the experimentally verified benefit and the acceptable running time are often more decisive than a statistically proven correctness.

The topic of machine learning is also closely related , but in data mining the focus is on finding new patterns, while in machine learning, known patterns are primarily intended to be automatically recognized by the computer in new data. However, a simple separation is not always possible here: If, for example, association rules are extracted from the data, this is a process that corresponds to typical data mining tasks; however, the extracted rules also meet the objectives of machine learning. Conversely, the sub-area of unsupervised learning from machine learning is very closely related to data mining. Machine learning methods are often used in data mining and vice versa.

Research in the field of database systems , especially index structures, plays a major role in data mining when it comes to reducing complexity. Typical tasks such as the search for nearest neighbors can be significantly accelerated with the help of a suitable database index, which improves the runtime of a data mining algorithm.

The information retrieval (IR) is another subject area that benefits from knowledge of data mining. Put simply, this is about the computer-aided search for complex content, but also about the presentation for the user. Data mining methods such as cluster analysis are used here to improve the search results and their presentation for the user, for example by grouping similar search results. Text mining and web mining are two specializations in data mining that are closely related to information retrieval.

The collection of data , i.e. the recording of information in a systematic way, is an important prerequisite for obtaining valid results with the help of data mining. If the data was collected statistically incorrectly, there may be a systematic error in the data, which is then found in the data mining step. The result is then possibly not a consequence of the observed objects, but caused by the way in which the data was recorded.

German name

An established German translation for the English term data mining does not yet exist.

There have been various attempts to find a German term for the imprecise English expression that is factually correct in all aspects. The Duden is limited to the Germanized Anglicism “data mining” (English “data mining”). Proposals for Eindeutschung include " data pattern recognition " (which is often misinterpreted as recognition of existing patterns) and " data mining " (which is the original meaning is not completely accessible). The Fremdwort -Duden uses “ data promotion ” as a literal translation , but marks this as an inappropriate translation. Even the targeted call for suggestions by the journal for artificial intelligence did not bring any convincing suggestions. None of these identifiers was able to achieve any noteworthy distribution, often because certain aspects of the topic, such as knowledge discovery, are lost and false associations arise, such as pattern recognition in the sense of image recognition .

Occasionally the German term “ Wissensentdeckung indaten ” (for the English Knowledge Discovery in Databases ) is used, which encompasses the entire process, which also includes the data mining step. Furthermore, this designation emphasizes both the scientific requirements and the fact that the process takes place in the database (and that a person does not, for example, form an opinion from the data through interpretation).

Data mining process

Data mining is the actual analysis step of the Knowledge Discovery in Databases process. The steps of the iterative process are broadly outlined:

  • Focus: the data collection and selection, but also the determination of existing knowledge
  • Preprocessing: data cleansing , in which sources are integrated and inconsistencies are eliminated, for example by removing or adding incomplete data sets.
  • Transformation into the appropriate format for the analysis step, for example by selecting attributes or discretising the values
  • Data mining, the real analysis step
  • Evaluation of the patterns found by the experts and control of the goals achieved

In further iterations , knowledge that has already been found can now be used ("integrated into the process") in order to obtain additional or more precise results in a new run.

Data mining tasks

Typical tasks in data mining are:

  • Outlier detection : Identification of unusual data sets: outliers, errors, changes
  • Cluster analysis : grouping objects based on similarities
  • Classification : elements not previously assigned to classes are assigned to the existing classes.
  • Association analysis : Identification of connections and dependencies in the data in the form of rules such as “A and B usually follow C”.
  • Regression analysis : Identification of relationships between (several) dependent and independent variables
  • Summary: Reduction of the data set to a more compact description without significant loss of information

These tasks can be roughly broken down into observation problems (outlier detection, cluster analysis) and prognosis problems (classification, regression analysis).

Outlier detection

This task searches for data objects that are inconsistent with the rest of the data, for example because they have unusual attribute values ​​or deviate from a general trend. For example, the Local Outlier Factor method searches for objects that have a density that differs significantly from their neighbors; this is referred to as “density-based outlier detection”.

Identified outliers are then often manually verified and hidden from the data set, as they can worsen the results of other methods. In some use cases such as fraud detection, however, the outliers are the most interesting objects.

Cluster analysis

Cluster analysis is about identifying groups of objects that are in some way more similar than other groups. Often these are accumulations in the data room, which is where the term cluster comes from. In a densely connected cluster analysis such as DBSCAN or OPTICS , the clusters can take any shape. Other methods such as the EM algorithm or the k-means algorithm prefer spherical clusters.

Objects that have not been assigned to a cluster can be interpreted as outliers in the sense of the previously mentioned outlier detection.

classification

Similar to cluster analysis, the classification is about assigning objects to groups (referred to here as classes). In contrast to the cluster analysis, the classes are usually predefined here (for example: bicycles, cars) and methods from machine learning are used to assign objects that have not been assigned to these classes.

Association analysis

In the association analysis, frequent connections are sought in the data records and usually formulated as final rules. A popular (albeit apparently fictional) example that was mentioned in the television series Numbers - The Logic of Crime , among others , is the following: the shopping basket analysis found that the product categories "diapers" and "beer" are bought together more than average, mostly presented in the form of a final rule “customer buys diapers, customer buys beer”. The interpretation of this result was that when men are sent to buy diapers by their wives, they like to take another beer with them. By placing the beer shelf on the way from the diaper to the cash register, beer sales could allegedly be increased further.

Regression analysis

In regression analysis, the statistical relationship between different attributes is modeled. This allows, among other things, the prognosis of missing attribute values, but also the analysis of the deviation analogous to the outlier detection. If findings from the cluster analysis are used and separate models are calculated for each cluster, better forecasts can typically be made. If a strong connection is found, this knowledge can also be used well for the summary.

Summary

Since data mining is often applied to large and complex amounts of data, an important task is also reducing this data to a manageable amount for the user. In particular, the outlier detection identifies individual objects that can be important; the cluster analysis identifies groups of objects for which it is often sufficient to examine them only on the basis of a random sample, which significantly reduces the number of data objects to be examined. Regression analysis makes it possible to remove redundant information and thus reduces the complexity of the data. Classification, association analysis, and regression analysis (sometimes also cluster analysis) also provide more abstract models of the data.

With the help of these approaches, both the analysis of the data and, for example, their visualization ( using random samples and less complexity) are simplified.

Specializations

While most data mining methods attempt to deal with data that is as general as possible, there are also specializations for more specific data types.

Text mining

Text mining is about the analysis of large textual databases. This can be used, for example, to detect plagiarism or to classify the text stock .

Web mining

Webmining is about the analysis of distributed data as represented by Internet pages . For the detection of clusters and outliers, not only the pages themselves, but also the relationships ( hyperlinks ) of the pages to one another are considered here. The constantly changing content and the non-guaranteed availability of data result in additional challenges. This subject area is also closely related to information retrieval .

Time series analysis

Temporal aspects and relationships play a major role in time series analysis. Existing data mining methods can be used here by means of special distance functions such as dynamic time warping distance, but specialized methods are also being developed. An important challenge is to identify rows with a similar course, even if it is slightly offset in time, but still has similar characteristics.

Data mining problems

Data defects

Many of the problems with data mining stem from insufficient preprocessing of the data or from systematic errors and distortions in their collection . These problems are often of a statistical nature and have to be solved at the time of recording: representative results cannot be obtained from non- representative data. Similar aspects must be observed here as when creating a representative sample .

Parameterization

The algorithms used in data mining often have several parameters that must be selected appropriately. With all parameters they provide valid results, and it is the responsibility of the user to choose the parameters so that the results are also useful. If, for example, the parameters and small are selected for the DBSCAN cluster analysis algorithm , the algorithm finds a finely resolved structure, but it also tends to break up clusters into small pieces. If you choose larger parameters, you will only find the main clusters, which may already be known and therefore not helpful. More developed methods often have fewer parameters or these parameters are easier to choose. For example, OPTICS is a further development of DBSCAN that largely eliminates the parameter .

Evaluation

The evaluation of data mining results presents the user with the problem that on the one hand he wants to gain new knowledge, on the other hand it is difficult to evaluate processes automatically. In the case of forecast problems such as classification, regression analysis and association analysis, the forecast on new data can be used for evaluation. This is more difficult for description problems such as outlier detection and cluster analysis. Clusters are usually assessed internally or externally , i.e. based on their mathematical compactness or their correspondence with known classes. The results of outlier detection methods are compared with known outliers. In both cases, however, the question arises whether this evaluation really fits the task of the “new findings” and not ultimately evaluates the “reproduction of old findings”.

interpretation

As a statistical process, the algorithms analyze the data without any background knowledge of their meaning. Therefore, the methods can usually only provide simple models such as groups or mean values. Often the results as such are no longer comprehensible. These machine-obtained results must then be interpreted by the user before they can really be called knowledge.

application areas

In addition to its applications in related areas of computer science, data mining is also increasingly used in industry:

  • Process analysis and optimization:
    • With the help of data mining, technical processes can be analyzed and the relationships between the individual process variables can be determined. This helps to control and optimize processes. The first successful approaches have already been achieved in the chemical industry and plastics processing.
  • Analysis of product data: data from the product life cycle can also be analyzed using data mining. This data is generated in particular during maintenance and service. They can be used to optimize and further develop the product and can help generate innovations.

Legal, moral and psychological aspects

Data mining as a scientific discipline is initially value-neutral. The methods allow the analysis of data from almost any source, for example measured values ​​from components or the analysis of historical bone finds. However, when the analyzed data relate to individuals, important legal and moral problems arise; Typically, however, already during the acquisition and storage of this data, not just during the analysis, and regardless of the specific analysis method used (statistics, database queries, data mining, ...).

Legal Aspects

Data that has been insufficiently anonymized can possibly be assigned to specific persons ( deanonymized ) again through data analysis . Typically, but here not use data mining, but simpler and specialized analytical methods for deanonymisation, Such an application - and above all the inadequate anonymization beforehand - may then be illegal (under data protection law ). For example, researchers were able to clearly identify user profiles in a social network using just a few questions. If, for example, movement data is only pseudonymized , the user can often be identified with a simple database query (technically speaking, no data mining!) As soon as one knows his place of residence and work: most people can use the 2–3 places where they are at spend most of the time being clearly identified.

The data protection law generally speaks of the " collection, processing or use " of personal data since this problem occurs not only in the use of data mining, but also in the use of other methods of analysis (eg. Statistics). Reliable protection against improper analysis is only possible if the relevant data is not recorded and stored in the first place.

Moral Aspects

The application of data mining processes to personal data also raises moral questions. For example, whether a computer program should divide people into “classes”. In addition, many of the methods are suitable for surveillance and advanced grid searches . For example, the SCHUFA score represents a classification of people into the classes "creditworthy" and "not creditworthy" obtained through statistics, perhaps also data mining, and is criticized accordingly .

Psychological aspects

Data mining processes themselves work in a value-neutral manner and only calculate probabilities without knowing the significance of this probability. However, if people are confronted with the result of these calculations, it can cause surprised, offended or alienated reactions. It is therefore important to consider whether and how one confronts someone with such results.

Google grants its users insight into the target groups it has identified for them  - unless they have opted out - and is often wrong. An American department store chain can, however, use the shopping behavior to determine whether a customer is pregnant. With the help of this information, targeted shopping vouchers can be sent. It is even possible to predict the date of birth.

Software packages for data mining

literature

The following literature provides an overview of the field of data mining from the perspective of computer science.
Task and application-specific literature can be found in the respective articles.

  • Martin Ester, Jörg Sander: Knowledge Discovery in Databases. Techniques and Applications . Springer , Berlin 2000, ISBN 3-540-67328-8 .
  • Ian H. Witten, Eibe Frank, Mark A. Hall: Data Mining: Practical Machine Learning Tools and Techniques . 3. Edition. Morgan Kaufmann, Burlington, MA 2011, ISBN 978-0-12-374856-0 ( waikato.ac.nz - in English, software for the book: WEKA ).
  • Sholom M. Weiss, Nitin Indurkhya: Predictive Data Mining. A Practical Guide . Morgan Kaufmann, Burlington, MA 1997, ISBN 1-55860-403-0 (in English).
  • Jiawei Han, Micheline Kamber, Jian Pei: Data Mining: Concepts and Techniques . Morgan Kaufmann, Burlington, MA 2011, ISBN 978-0-12-381479-1 (in English).
  • Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth: From Data Mining to Knowledge Discovery in Databases . In: AI Magazine . tape 17 , no. 3 , 1996, p. 37–54 ( kdnuggets.com [PDF] in English).

Individual evidence

  1. Entry data mining. In: duden.de. Retrieved December 18, 2016 .
  2. a b c Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth: From Data Mining to Knowledge Discovery in Databases . In: AI Magazine . tape 17 , no. 3 , 1996, p. 37–54 ( as PDF at: kdnuggets.com ).
  3. Jiawei Han, Micheline Kamber: Data mining: concepts and techniques . 1st edition. Morgan Kaufmann, 2001, ISBN 978-1-55860-489-6 , pp. 5 (Thus, data mining should have been more appropriately named "knowledge mining from data," which is unfortunately somewhat long).
  4. a b c Martin Ester, Jörg Sander: Knowledge Discovery in Databases. Techniques and Applications . Springer , Berlin 2000, ISBN 3-540-67328-8 .
  5. a b Duden online: Duden: Data-Mining: meaning, spelling, grammar, origin. Bibliographical Institute , accessed August 9, 2011 .
  6. a b [From the magazine “Artificial Intelligence”…] “a competition was held to find an adequate German term. And as sorry as I am, no adequate German term was found. “
    Hans-Peter Kriegel : Database techniques to support knowledge acquisition . In: Heinz Mandl , Gabi Reinmann-Rothmeier (eds.): Knowledge management: Information growth - knowledge loss? The strategic importance of knowledge management . Oldenbourg , Munich / Vienna 2000, ISBN 3-486-25386-7 , pp. 47-71 .
  7. N. Bissantz, J. Hagedorn: Data Mining. (Data pattern recognition), In: Wirtschaftsinformatik. 35 (1993) 5, pp. 481-487.
  8. Duden - The foreign dictionary: "engl. actual 'data transfer' "
  9. This story is probably a modern legend . Depending on the variant, the beer is placed next to the diaper, on the way to the checkout or at the other end of the supermarket (so that the customer has to pass as many other products as possible).
    KDNuggets post mentioning a possible source of the myth
  10. ^ I. Färber, S. Günnemann, H.-P. Kriegel , P. Kröger, E. Müller, E. Schubert, T. Seidl, A. Zimek: On Using Class-Labels in Evaluation of Clusterings . In: MultiClust: 1st International Workshop on Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010, Washington, DC . 2010 ( as PDF at: dbs.informatik.uni-muenchen.de ).
  11. C. Kugler, T. Hochrein, M. Bastian, T. Froese: Hidden Treasures in Data Graves, QZ Quality and Reliability , 2014, 3, pp. 38–41.
  12. Knowing what is in demand: data mining can accelerate innovation. In: IPH. Retrieved March 12, 2018 .
  13. Security gap : IT researchers unmask Internet surfers. In: Spiegel Online . Retrieved December 7, 2011 .
  14. Google Ad Preferences
  15. How Target Figured Out A Teen Girl Was Pregnant Before Her Father Did. In: Forbes.com. Retrieved February 16, 2012 .