Pruning

from Wikipedia, the free encyclopedia

Pruning is the English expression for pruning (trimming) trees and bushes. In computer science in the field of machine learning, the expression is used for the simplification , shortening and optimization of decision trees .

The idea of ​​pruning originally stems from an attempt to prevent so-called overfitting in trees that were created through induced learning. Overfitting describes the undesired induction of noise in a tree. Noise denotes incorrect attribute values ​​or class membership, which falsify data sets and thus unnecessarily enlarge decision trees. By pruning the trees, the unnecessary sub-trees are shortened again.

Pruning in the machine learning environment

Pruning processes can be divided into two types (pre- and post-pruning).

Pre-pruning procedures prevent a complete induction of the training set by replacing a stop () criterion in the induction algorithm (e.g. max. Tree depth or information gain (Attr)> minGain). Pre-pruning methods are considered to be more efficient because they do not induce an entire set, but rather trees remain small from the start. Prepruning methods share a common problem, the horizon effect. This is to be understood as the undesired premature termination of the induction by the stop () criterion.

Post-pruning (or just pruning) is the most common way of simplifying trees. Here, nodes and subtrees are replaced with leaves to improve complexity. Pruning not only significantly reduces the size, but also improves the classification accuracy of unseen objects. It may be the case that the accuracy of the assignment on the test set deteriorates, but the accuracy of the classification properties of the tree increases overall.

The procedures are differentiated on the basis of their approach in the tree (top-down or bottom-up).

Bottom-up pruning

These procedures start at the last node in the tree (the lowest point). Following recursively upwards, they determine the relevance of each individual node. If the relevance for the classification is not given, the node is dropped or replaced by a leaf. The advantage is that no relevant sub-trees can be lost with this method. These methods include Reduced Error Pruning (REP), Minimum Cost Complexity Pruning (MCCP) or Minimum Error Pruning (MEP).

Top-down pruning

In contrast to the bottom-up method, this method starts at the root of the tree. Following the structure below, a relevance check is carried out which decides whether a node is relevant for the classification of all n items or not. By pruning the tree at an inner node, it can happen that an entire sub-tree (regardless of its relevance) is dropped. One of these representatives is pessimistic error pruning (PEP), which brings quite good results with unseen items.

Search procedure

In search procedures , various pruning methods are used to truncate search trees forwards if the algorithm knows from the data that has already been collected (or assumes in the case of speculative pruning) that these subtrees do not contain the searched object (used, for example, in chess programs ).

Important pruning techniques for minimax or alpha-beta searches that can be used to solve two-person zero-sum games with complete information (such as chess) include:

Pruning is also used in branch-and-bound algorithms in mathematical optimization . A subtree of the search tree is not considered here if the bound for the best possible solution in this subtree is worse than an already known solution.

Other areas

In forum software , pruning is a setting that causes the automatic deletion of old topics in order to save memory space, reduce the CPU load and thereby increase the speed of the forum.

References

  • LA Breslow and DW Aha, Simplifying Decision Trees: A Survey, The Knowledge Engineering Review, Vol 12 (1), 1997, pp. 1-47.
  • JR Quinlan, Induction of Decision Trees, Machine Learning 1, Kluwer Academic Publishers, 1986, pp. 81-106.