C4.5 is an algorithm of machine learning is used to from training data a decision tree to generate, with the records classified can be. It was developed by Ross Quinlan as an extension of the ID3 algorithm.
Basically, this algorithm behaves similarly to the CART algorithm. The main difference is that with C4.5 no binary division has to be made, but any number of branches can be incorporated. The tree is getting wider. It is usually less deep than the corresponding CART tree. Instead, after the first classification, the subsequent splits become less significant.
Another difference can be seen in so-called pruning , when pruning the tree. CART creates some subtrees and tests them with new, previously unclassified data. C4.5, on the other hand, cuts the tree without considering the given database.
C4.5 generates a decision tree from training data with which future instances that were not included in the training data can be classified. Similar to the ID3 algorithm, the calculation of the entropy is used to determine the sequence of the decision nodes in their distance from the root node within the decision tree to be generated.
The training data is a lot , consisting of the known training examples. Each training example is that amount p + 1-dimensional vector of the to-learn characteristics (instance) and the to-learn target classification : . When creating the top node in the decision tree, C4.5 looks for the first decision feature. To determine this, C4.5 compares for each feature the efficiency with which the training data set would be divided among this feature and decides on the best one. The criterion is the highest gain in information that can be achieved by the feature ( Kullback-Leibler divergence ). The training data are then divided into subsets according to their values of the selected feature, for each of which a branch is created below the root node. The algorithm is continued recursively in that the previous sequence is applied again for each of these branches while restricting the subset of the training data assigned to this branch. If it is not possible to gain information on a branch by further subdividing the training data, a sheet with the remaining target classification is created on this branch. The highest possible maximum degree of the tree is thus .
Sarah goes sailing some days, not other days. Whether she goes sailing on a certain day depends primarily on the following characteristics: weather conditions (sunny / overcast / rainy); Temperature; Humidity; Wind strength (light / strong). On 14 random days, this data was recorded together with the information whether Sarah is going sailing:
|weather condition||Temperature in ° C||Humidity in%||Wind force||Sarah goes sailing|
Machine learning is to be used to uncover the connection between the four day-related characteristics and the statement whether Sarah goes sailing. Once you have determined this relationship, you can determine whether Sarah is going sailing for any other day if you know the weather data. C4.5 generates the decision tree shown from the training data given in the table. The numbers in brackets indicate the number of training records that correspond to this path.
Improvements over ID3
As an extension of the ID3 algorithm, C4.5 offers some improvements:
- Applicability to both discrete and continuous attributes - if the data records contain, for example, a real variable as one of the characteristics, the characteristic values are classified in discrete intervals.
- Applicability to training data with missing attribute values - Using C4.5, unavailable characteristic values can be marked as unknown (?). Unknown values are simply ignored when calculating the information gain.
- Possible cost weighting of the attributes - Characteristics that are laborious to determine can be assigned a higher cost weighting. Features with high cost weightings tend to be arranged further down in the decision tree as branches, so that this feature has to be determined for fewer classifications. For example, the use of C4.5 for diagnosing diseases could be adapted using the database of symptoms and medical examination values in such a way that cost-intensive examinations are more likely to be avoided and, if possible, are only used in cases of doubt between several possible diagnoses.
- Pruning of the decision tree after it has been created - In order to reduce the number of possible result classes of the classification, C4.5 cuts off long branches. This also prevents overfitting to the training data.
- Tom M. Mitchell: Machine Learning, 1997
- Quinlan, JR C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.