Simplification of decision trees

Decision trees ( Decision Trees ) come from the environment of machine learning, artificial intelligence and knowledge engineering . They are used to model and map classification tasks. Trees are mostly built by induction. Induction of trees is a NP -complete problem, which is why trees cannot be demonstrably induced optimally. This leads to the problem that trees often do not have the ideal measure between complexity (number of nodes) and accuracy (predictive accuracy).

There are several methods and approaches to simplify trees and thereby improve the classification properties. Any data set (without noise) can be modeled in the form of a tree structure so that each object is correctly assigned. A good tree is characterized by assigning a large number of unseen objects to the correct class variables without creating unnecessary complexity.

In Breslow and Aha, motivators for simplifying decision trees are listed.

Some trees do not manage to present the context of the data set briefly and concisely. Other modeling concepts would work better.
Errors in the basic data (noise) cause overfitting in trees. The tree shows both: the target definition and the noise. The result is a higher number of nodes.
Sub-tree replication: Sub-trees occur several times.
Large trees are more prone to errors than small ones, especially with unseen objects.

Approaches to simplifying trees

Five categories of decision tree simplification

Tree size is changed ( pruning , incremental treesizing)
Changing the test room (data-driven, hypothesis-driven)
Change of test search (selection criteria, lookahead search)
Database restrictions (case or attribute selection)
Alternative data structures (rules or graphs)

Change in tree size

Pruning describes the simplification of the tree structure by replacing subtrees and nodes with leaves. (See Wikipedia article Pruning)

Modifications in the test room

With this approach, simple tests (queries in nodes, e.g. humidity = true) are replaced by non-trivial tests. Queries in nodes are linked by mathematical operators (numerically or logically) in such a way that entire subtrees can be queried by clever reshaping in a single node (e.g. humidity = true (and) rain = false). With these operations it is possible to assign a singular path through a subtree to a specific class in a query. If the (and) linked query is true, the class "positive" is assigned, otherwise the class "negative". The type of query is called a multivariate test , since several attributes can be tested at the same time. In contrast to univariate tests as they occur in trivial trees, a maximum of one variable (attribute) can be queried in a node.

Linking the queries results in a new metric for the complexity of the trees: In multivariate tests, this is the sum of attributes, internal nodes and leaves.

Synergy effects of the multivariate tests can be:

significant reduction in complexity
higher accuracy in predictions
With two linked tests, more information can be obtained than with two sequential queries. (Information gain)

Data-driven test construction

Contains logical and numerical construction operators.

Numerically

Numerical operators are used to construct the multivariate tests: (+, -, ×, ÷ etc.) These classifiers are linear, multivariate tests and are called perceptrons . Like other tests, a perceptron can only have binary outputs. The nodes are combined on a so-called hyperplane. This enables smaller trees to be realized with the same accuracy. (Hyperplane merging) A well-known representative of the Perceptron trees is the PT2 (extension of the P.-T.) with a complexity class of O (n²).

Logical

Logical operators are used here, such as B. conjunction, disjunction or negation to logically link attribute queries. The complexity here is O (m2 ^ f), for f-features and m-combinations.

Hypothesis-Driven Test Construction

With these methods, advantages of feature interaction (attributes A and B have no information gain for class C, but A x B (x = any operator) has a significantly higher information gain than A and B separately) already at the induction of the tree used (constructive induction). The construction of tests (A x B in a node) is evaluated during induction in order to establish ideal classification properties from the start. This method is used successfully to prevent replication of subtrees.

Combined attributes are ranked according to their common information gain in order to achieve the most efficient classification properties possible. A representative of this methodology is CIRTE, an ID3 based classifier which produces more accuracy and small trees than ID3. This does not reduce the complexity. (O (m2 ^ (Ci-1)) for i possible attributes)

Modification of the test search

With this approach, different selection criteria are used, constant distributions are discretized and extended search methods are used.

Other selection criteria

This approach changes the search for the tests. The common (primitive) selection criteria (selection measures) are z. B. the entropy of an attribute (E (A)), the information content IV (A), the information gain (gain (A)) through a split. The gain ration of A is somewhat more suitable, as it takes into account not only the gain in information but also the loss of information through a split. (gain (A) / IV (A) = GainRatio (A))

Common procedures are e.g. B. the minimum description length (mdl measure), the weak theory learning measure (wtlm) or the KS distance.

Dealing with continuous attributes

Furthermore, continuous attributes are discretized into discrete attributes. Algorithms such as B. the ID3 cannot deal with constant attribute values. Therefore, a continuous distribution must be divided into n subspaces (clusters) before induction. In the ideal case, the interval limits of the ni are chosen so that the information gain is maximum. E.g. a table stores the income distribution of people. Instead of treating each salary as an attribute value (which does not provide any information gain for the class assignment), the salary table is discretized so that the salaries can be assigned to the clusters (low, medium and high). This creates information for a possible split.

Advanced search functions

The lookahead search already analyzes possible subtrees during induction down to a certain level (j) and works against the horizon effect. At point d (for height in the tree) the (e.g.) GainRatio () is already calculated for all subtrees up to d + j. In this way it can be evaluated whether it makes sense to continue stretching the tree up to d + j. The lookahead search is carried out after each induction step. With a high degree of feature interaction, the search brings 30% more accuracy, but otherwise hardly any advantages. However, the complexity for this is extremely high with O (f ^ dn ^ d) for f features, n cases and d for tree depth.

Database restrictions

The simplification approaches are based on two basic ideas. On the one hand, the database can be created by eliminating cases (rows) and, on the other hand, by removing features (columns). The corresponding data records are removed before the tree is induced if they are of no use for the classification of objects.

Case Selection

The best-known form of case selection is the windowing technique, as used in Qinlan's ID3. A randomly selected sub-set of data C '(Window) is induced (T'). The unused part is used to test the tree T '. Some of the wrongly classified objects from CC 'are added to the window to create a tree again. If all items have been classified correctly, the procedure stops and T 'becomes T.

Another algorithm based on C4.5 is the ROBUST-C4.5. A tree T 'with all objects is induced, pruned and tested. All misclassified objects are removed from the start set and a new tree is induced. These iterations are carried out until all c in C have been correctly assigned. The result is an average 29% smaller tree than with ID3 with a little more accuracy.

Feature selection

Here, too, there are two contrary approaches to inducing a tree. The first variant starts with an attribute, induces a tree and then tests it. The process is carried out (with one more attribute at a time) until the accuracy metric falls again in the test. Then a semi-optimal tree is found. The other option is to start with all attributes and remove one attribute with each iteration. As before, a finished tree is found when the test results get worse. The results of these procedures are on average up to 50% smaller trees with the same accuracy as with ID3.

Alternative data structures

A common form of simplification is to transfer trees into other modeling concepts. It is often the case that this allows smaller, faster, and more accurate models to be found. One of the possible structures are decision graphs. (see Wikipedia article Graph Theory )

Advantages of graphs over trees

Advantages of graphs over trees are:

A leaf occurs only once. (Double sheets are joined)
No subtree replication
Easier to understand

literature

LA Breslow and DW Aha, Simplifying Decision Trees: A Survey, The Knowledge Engineering Review, Vol 12 (1), 1997, pp. 1-47.
JR Quinlan, Induction of Decision Trees, Machine Learning 1, Kluwer Academic Publishers, 1986, pp. 81-106.