Decision tree

Decision trees ( English : decision tree ) are ordered, directed trees , the depiction of the decision rules are used. The graphical representation as a tree diagram illustrates hierarchically successive decisions . They are important in numerous areas in which automatic classification takes place or formal rules are derived or presented from empirical knowledge .

Properties and use

Decision trees are a method for the automatic classification of data objects and thus for solving decision problems. They are also used for the clear presentation of formal rules. A decision tree always consists of a root node and any number of inner nodes as well as at least two leaves. Each node represents a logical rule and each leaf an answer to the decision problem.

The complexity and semantics of the rules are not restricted from the start. With binary decision trees, each rule expression can only take one of two values. All decision trees can be converted into binary decision trees.

Decision trees are used in numerous areas, e.g. B.

in stochastics as a probability tree to illustrate conditional probabilities (e.g. for absolute frequencies )
in machine learning as a method for the automatic classification of objects
in data mining
in decision theory
in medical decision-making (medicine) and in emergency medicine
in Business Rule Management Systems (BRMS) for the definition of rules

functionality

Classify with decision trees

To read the classification of an individual data object, you go down from the root node along the tree. For each node, an attribute is queried and a decision is made about the selection of the following node. This procedure continues until you reach a leaf. The sheet corresponds to the classification. A tree basically contains rules for answering exactly one question.

Example: A decision tree with low complexity

Binary decision tree for predicting whether an apple tree will bear fruit

The binary decision tree opposite provides an answer to the question of whether an apple tree will bear fruit. The tree needs a vector as input with information on the attributes of an apple tree. For example, an apple tree can have the attributes old , natural variety and rich soil . Starting with the root node, the decision rules of the tree are now applied to the input vector. In the example tree, an attribute of the input vector is queried at each node, for example the age of the apple tree at the root node. The answer decides on the next node and thus on the next decision rule to be applied, in this case the question about the variety, and then the question about the nature of the soil. If you get to a sheet of paper after a series of evaluated rules, you get the answer to the original question. It is not always necessary to go through all levels of the decision tree. For the apple tree described above, the answer is yes , meaning that the tree will bear fruit. This decision-making process is called formal classification .

Induction of decision trees

Decision trees can either be written down manually by experts or automatically induced from collected empirical values using machine learning techniques . There are several competing algorithms for this.

The induction of the decision trees usually takes place recursively using the top-down principle. To this end, it is of crucial importance that a suitable data set with reliable empirical values for the decision problem (the so-called training data set ) is available. This means that the classification of the target attribute must be known for each object in the training data set. For each induction step, the attribute is searched with which the training data can best be classified in this step with regard to the target attribute . Entropy measures , Gini index or others are used as a measure for determining the best classification . The identified attribute is now used to split the data. The procedure is applied recursively to the subsets created in this way, until each subset only contains objects with one classification. In the end, a decision tree was created that describes the empirical knowledge of the training data set in formal rules.

The trees can now be used to automatically classify other data sets or to interpret and evaluate the resulting set of rules.

Algorithms in comparison

The algorithms for the automatic induction of decision trees are all based on the same recursive top-down principle. They only differ in their criteria for selecting the attributes and values of the rules at the nodes of the tree, in their criteria for terminating the induction process and in possible post-processing steps that subsequently optimize a branch of a tree (or entire trees) that has already been calculated using various criteria .

In practice there are different families of algorithms:

The oldest family are the CHAIDs ( Chi-square Automatic Interaction Detectors ) from 1964. They can generate trees based on discrete attributes.
The family of CART's ( Classification And Regression Trees ) extends the CHAIDS above all the ability to be able to process real-valued attributes by a division criterion (English split-criterion ) for splitting real-value is used attributes into discrete categories. In addition, some optimization strategies such as B. introduced pruning .
The most recent algorithms are the ID3 algorithm and its successors C4.5 and C5.0 . Formally, they belong to the CART family , but are often seen as a separate group, since they introduce numerous new optimization strategies compared to CART .

Error rate and effectiveness

The error rate of a decision tree is the number of incorrectly classified data objects in relation to all data objects in a data set. This number is regularly determined either on the training data used or, better, on a set of data objects classified as correctly as possible, which is disjoint from the training data.

Depending on the area of application, it can be particularly important to keep the number of objects classified as false positive or false negative in particular low. In emergency medicine, for example, it is far less harmful to treat a healthy patient than not to treat a sick patient. The effectiveness of decision trees is therefore always a context-dependent variable.

Representation in disjunctive normal form

Every possible classification of a decision tree can be given in disjunctive normal form . To do this, one considers all leaves of this classification and their paths to the root node. For each of these paths, all the conditions of the decision nodes along the path are set in conjunction with one another and the resulting terms are set in disjunction . For the example shown above, the following statement results for the classification "apple tree bears fruit" (yes):

{\ displaystyle (({\ text {Age}} = {\ text {old}}) \ wedge ({\ text {Variety}} = {\ text {refined}})) \ lor (({\ text {Age }} = {\ text {alt}}) \ wedge ({\ text {variety}} = {\ text {natural}}) \ wedge ({\ text {soil}} = {\ text {rich}})) \ to {\ text {TreeBearsFruits}}}

Advantages and disadvantages

Interpretability and comprehensibility

A big advantage of decision trees is that they are easy to explain and understand. This allows the user to evaluate the result and recognize key attributes. This is particularly useful when the basic properties of the data are not known in advance. The induction of decision trees is therefore also an important technique in data mining .

Classification quality in real-valued data spaces

An often mentioned disadvantage of decision trees is the relatively poor classification quality in real-valued data spaces when the trees are used for automatic classification. Because of their discrete set of rules , the trees do a little worse for most classification problems from the real world compared to other classification techniques such as artificial neural networks or support vector machines . This means that although the trees can generate easily comprehensible rules for humans, these understandable rules often do not have the best possible quality for classification problems in the real world.

Size of decision trees and overfitting

Another disadvantage is the possible size of the decision trees if no simple rules can be induced from the training data. This can have multiple negative effects: On the one hand, a human observer quickly loses the overall overview of the context of the many rules, on the other hand, such large trees usually lead to over-adaptation to the training data set, so that new data sets are only incorrectly classified automatically. For this reason, so-called pruning methods were developed which shorten the decision trees to a reasonable size. For example, you can limit the maximum depth of the trees or specify a minimum number of objects per node.

Further processing of the results

The decision trees are often only used as an intermediate step towards a more efficient representation of the set of rules. In order to get to the rules, different decision trees are generated using different methods. Frequently occurring rules are extracted. The optimizations are overlaid to get a robust, general and correct rule set. The fact that the rules are not related to one another and that contradicting rules can be generated has a disadvantageous effect on this method.

Further developments and extensions

Decision forests

One method to increase the quality of classification when using decision trees is to use sets of decision trees instead of individual trees. Such amounts of decision trees are used as decision forests (English: decision forests ), respectively.

The use of decision forests is one of the so-called ensemble techniques in machine learning . The idea of decision forests is that a single decision tree does not have to provide an optimal classification, but the majority decision of a set of suitable trees can. Common methods of creating suitable forests are boosting , bagging and arcing .

The disadvantage of decision forests is that it is no longer as easy for humans to interpret the rules contained in all trees in their entirety in a simple manner, as it would be possible with individual trees.

Combination with neural networks

Decision trees can be combined with neural networks . It is thus possible to replace inefficient branches of the tree with neural networks in order to achieve a higher quality of classification that cannot be achieved with the trees alone. The advantages of both classification methods can also be used by mapping partial structures in the other method: The trees do not need as much training data for induction as the neural networks. On the other hand, they can be quite inaccurate, especially when they are small. The neural networks, on the other hand, classify more precisely, but require more training data. Therefore, attempts are made to use the properties of the decision trees to generate parts of neural networks by so-called TBNN (Tree-Based Neural Network) translating the rules of the decision trees into the neural networks.

Practical applications

Example: Decision tree for uploading images to Wikipedia

Application programs

There are several application programs that have implemented decision trees, for example they are offered in the statistics software packages GNU R , Scikit-learn , SPSS and SAS . The latter two packages, by the way, like most other data mining software packages, use the CHAID algorithm.

Example: customer classification

A bank wants to sell a new service with a direct mailing campaign . In order to increase the chances of success, the campaign should only address households that correspond to a combination of demographic variables that an appropriate decision tree has declared to be optimal. This process is called data segmentation or segmentation modeling .

Decision trees in business administration

In the area of decision theory within business administration, decision trees are drawn from left to right (and vice versa) in order to calculate payoffs from expenses and income and to optimize the decision for maximum payoff.

In these decision trees, squares represent decisions, circles represent possibilities, and triangles terminate the branches. Decision trees of this type can be combined with other methods, in the example above NPV (Net Present Value), linear distribution of assumptions (Investment # 2) and PERT 3-point estimation (Investment # 1) are used.

literature

Leo Breiman, Jerome H. Friedman, Richard A. Olshen, Charles J. Stone: Classification and regression trees , Wadsworth International Group, Belmont CA, 1984, ISBN 0-412-04841-8
Tom M. Mitchell: Machine Learning , McGraw-Hill, Boston USA, 1997, ISBN 0-07-115467-1
Jiawei Han, Micheline Kamber: Data Mining , Morgan Kaufmann publishers, San Francisco CA, 2006 (2nd edition), ISBN 978-1558609013
Peter Dörsam: Decision Theory Illustrated , Pd-Verlag, Heidenau, 2007 (5th edition), ISBN 978-3867073059
Günter Bamberg, Adolf G. Coenenberg , Michael Krapp: Business decision-making , Verlag Franz Vahlen, Munich, 2008 (14th edition), ISBN 978-3-8006-3506-1

Web links

Commons : Decision diagrams - collection of images, videos and audio files

Individual evidence

↑ YES Sonquist, JN Morgan: The Detection of Interaction Effects , Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor.
↑ JR Quinlan: Induction of decision trees , Machine learning, 1 (1): 81-106, 1986
↑ Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu and Philip S. Yu, et al .: Top 10 algorithms in data mining , Knowledge and Information Systems Volume 14, Number 1 (2008), 1-37
↑ R.King, C.Feng, A.Sutherland: Comparison of classification algorithms on large real-world problems , Applied Artificial Intelligence, 9 (3): 259-287, May / June 1995
↑ O.Gascuel, B. Bouchon-Meunier, G. Caraux, P.Gallinari, A. Guénoche, Y.Guermeur: Twelve numerical, symbolic and hybrid supervised classification methods , International Journal of Pattern Recognition and Artificial Intelligence, 12 (5) : 517-571, 1998
↑ A.Flöter: Analyzing biological expression data based on decision tree induction , dissertation, University of Potsdam, 2006
↑ M. Murata, Q. Ma, H. Isahara: Comparison of three machine-learning methods for thai part-of-speech tagging , ACM Transaction on Asian Language Information Processing, 1 (2): 145-158, June 2002
↑ Y.Freund: Boosting a weak learning algorithm by majority , Information and Computation, 121 (2): 256-285, 1995
↑ W.Tong, Q.Xie, H.Hong, H.Fang, L.Shi, R.Perkins, EFPetricoin: Using decision forests to classify prostate cancer samples on the basis of seldi-tof ms data: Assessing chance correlation and prediction confidence , Environmental Health Perspectives , 112 (16): 1622-1627, 2004
↑ E.Bauer, R.Kohavi: An empirical comparison of voting classification algorithms: Bagging, Boosting, and variants , Machine Learning, 36 (1-2): 105-139, 1998
↑ RESchapire: A short introduction to boosting , Japanese Society for Artificial Intelligence, 14 (5): 771-780, 1999
↑ L.Breiman: Bagging predictors , Machine Learning, 24 (2): 123-140, 1996
↑ R. Nock: Inducing interpretable voting classifiers without trading accuracy for simplicity: Theoretical results, approximation algorithms, and experiments , Journal of Artificial Intelligence Research, 17: 137-170, August 2002
^ AK Seewald, J. Petrak, G. Widmer: Hybrid decision tree learners with alternative leaf classifiers , Proceedings of the 14th International FLAIRS conference, AAAI Press, Menlo Park CA, 2000
↑ Decision Trees - scikit-learn documentation. Retrieved August 24, 2018 .

[1] YES Sonquist, JN Morgan: The Detection of Interaction Effects , Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor.

[2] JR Quinlan: Induction of decision trees , Machine learning, 1 (1): 81-106, 1986

[3] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu and Philip S. Yu, et al .: Top 10 algorithms in data mining , Knowledge and Information Systems Volume 14, Number 1 (2008), 1-37

[4] R.King, C.Feng, A.Sutherland: Comparison of classification algorithms on large real-world problems , Applied Artificial Intelligence, 9 (3): 259-287, May / June 1995

[5] O.Gascuel, B. Bouchon-Meunier, G. Caraux, P.Gallinari, A. Guénoche, Y.Guermeur: Twelve numerical, symbolic and hybrid supervised classification methods , International Journal of Pattern Recognition and Artificial Intelligence, 12 (5) : 517-571, 1998

[6] A.Flöter: Analyzing biological expression data based on decision tree induction , dissertation, University of Potsdam, 2006

[7] M. Murata, Q. Ma, H. Isahara: Comparison of three machine-learning methods for thai part-of-speech tagging , ACM Transaction on Asian Language Information Processing, 1 (2): 145-158, June 2002

[8] Y.Freund: Boosting a weak learning algorithm by majority , Information and Computation, 121 (2): 256-285, 1995

[9] W.Tong, Q.Xie, H.Hong, H.Fang, L.Shi, R.Perkins, EFPetricoin: Using decision forests to classify prostate cancer samples on the basis of seldi-tof ms data: Assessing chance correlation and prediction confidence , Environmental Health Perspectives , 112 (16): 1622-1627, 2004

[10] E.Bauer, R.Kohavi: An empirical comparison of voting classification algorithms: Bagging, Boosting, and variants , Machine Learning, 36 (1-2): 105-139, 1998

[11] RESchapire: A short introduction to boosting , Japanese Society for Artificial Intelligence, 14 (5): 771-780, 1999

[12] L.Breiman: Bagging predictors , Machine Learning, 24 (2): 123-140, 1996

[13] R. Nock: Inducing interpretable voting classifiers without trading accuracy for simplicity: Theoretical results, approximation algorithms, and experiments , Journal of Artificial Intelligence Research, 17: 137-170, August 2002

[14] AK Seewald, J. Petrak, G. Widmer: Hybrid decision tree learners with alternative leaf classifiers , Proceedings of the 14th International FLAIRS conference, AAAI Press, Menlo Park CA, 2000

[15] Decision Trees - scikit-learn documentation. Retrieved August 24, 2018 .