C4.5

from Wikipedia, the free encyclopedia

C4.5 is an algorithm of machine learning is used to from training data a decision tree to generate, with the records classified can be. It was developed by Ross Quinlan as an extension of the ID3 algorithm.

In addition to the well-known CARTs and CHAIDs , C4.5 is becoming more and more important. It is already used by various software packages.

Basically, this algorithm behaves similarly to the CART algorithm. The main difference is that with C4.5 no binary division has to be made, but any number of branches can be incorporated. The tree is getting wider. It is usually less deep than the corresponding CART tree. Instead, after the first classification, the subsequent splits become less significant.

Another difference can be seen in so-called pruning , when pruning the tree. CART creates some subtrees and tests them with new, previously unclassified data. C4.5, on the other hand, cuts the tree without considering the given database.

algorithm

C4.5 generates a decision tree from training data with which future instances that were not included in the training data can be classified. Similar to the ID3 algorithm, the calculation of the entropy is used to determine the sequence of the decision nodes in their distance from the root node within the decision tree to be generated.

procedure

The training data is a lot , consisting of the known training examples. Each training example is that amount p + 1-dimensional vector of the to-learn characteristics (instance) and the to-learn target classification : . When creating the top node in the decision tree, C4.5 looks for the first decision feature. To determine this, C4.5 compares for each feature the efficiency with which the training data set would be divided among this feature and decides on the best one. The criterion is the highest gain in information that can be achieved by the feature ( Kullback-Leibler divergence ). The training data are then divided into subsets according to their values ​​of the selected feature, for each of which a branch is created below the root node. The algorithm is continued recursively in that the previous sequence is applied again for each of these branches while restricting the subset of the training data assigned to this branch. If it is not possible to gain information on a branch by further subdividing the training data, a sheet with the remaining target classification is created on this branch. The highest possible maximum degree of the tree is thus .

example

Sarah goes sailing some days, not other days. Whether she goes sailing on a certain day depends primarily on the following characteristics: weather conditions (sunny / overcast / rainy); Temperature; Humidity; Wind strength (light / strong). On 14 random days, this data was recorded together with the information whether Sarah is going sailing:

Diagram example Sarah goes sailing.png
weather condition Temperature in ° C Humidity in% Wind force Sarah goes sailing
Sunny 29 85 Light Not correct
Sunny 27 90 Strong Not correct
Covered 28 78 Light True
Rainy 21st 96 Light True
Rainy 20th 80 Light True
Rainy 18th 70 Strong Not correct
Covered 17th 65 Strong True
Sunny 22nd 95 Light Not correct
Sunny 21st 70 Light True
Rainy 24 80 Light True
Sunny 24 70 Strong True
Covered 22nd 90 Strong True
Covered 27 75 Light True
Rainy 21st 80 Strong Not correct

Machine learning is to be used to uncover the connection between the four day-related characteristics and the statement whether Sarah goes sailing. Once you have determined this relationship, you can determine whether Sarah is going sailing for any other day if you know the weather data. C4.5 generates the decision tree shown from the training data given in the table. The numbers in brackets indicate the number of training records that correspond to this path.

Improvements over ID3

As an extension of the ID3 algorithm, C4.5 offers some improvements:

  • Applicability to both discrete and continuous attributes - if the data records contain, for example, a real variable as one of the characteristics, the characteristic values ​​are classified in discrete intervals.
  • Applicability to training data with missing attribute values ​​- Using C4.5, unavailable characteristic values ​​can be marked as unknown (?). Unknown values ​​are simply ignored when calculating the information gain.
  • Possible cost weighting of the attributes - Characteristics that are laborious to determine can be assigned a higher cost weighting. Features with high cost weightings tend to be arranged further down in the decision tree as branches, so that this feature has to be determined for fewer classifications. For example, the use of C4.5 for diagnosing diseases could be adapted using the database of symptoms and medical examination values ​​in such a way that cost-intensive examinations are more likely to be avoided and, if possible, are only used in cases of doubt between several possible diagnoses.
  • Pruning of the decision tree after it has been created - In order to reduce the number of possible result classes of the classification, C4.5 cuts off long branches. This also prevents overfitting to the training data.

See also

Individual evidence

  1. Tom M. Mitchell: Machine Learning, 1997
  2. Quinlan, JR C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.