C4.5

C4.5 is an algorithm of machine learning is used to from training data a decision tree to generate, with the records classified can be. It was developed by Ross Quinlan as an extension of the ID3 algorithm.

In addition to the well-known CARTs and CHAIDs , C4.5 is becoming more and more important. It is already used by various software packages.

Basically, this algorithm behaves similarly to the CART algorithm. The main difference is that with C4.5 no binary division has to be made, but any number of branches can be incorporated. The tree is getting wider. It is usually less deep than the corresponding CART tree. Instead, after the first classification, the subsequent splits become less significant.

Another difference can be seen in so-called pruning , when pruning the tree. CART creates some subtrees and tests them with new, previously unclassified data. C4.5, on the other hand, cuts the tree without considering the given database.

algorithm

C4.5 generates a decision tree from training data with which future instances that were not included in the training data can be classified. Similar to the ID3 algorithm, the calculation of the entropy is used to determine the sequence of the decision nodes in their distance from the root node within the decision tree to be generated.

procedure

The training data is a lot , consisting of the known training examples. Each training example is that amount p + 1-dimensional vector of the to-learn characteristics (instance) and the to-learn target classification : . When creating the top node in the decision tree, C4.5 looks for the first decision feature. To determine this, C4.5 compares for each feature the efficiency with which the training data set would be divided among this feature and decides on the best one. The criterion is the highest gain in information that can be achieved by the feature ( Kullback-Leibler divergence ). The training data are then divided into subsets according to their values of the selected feature, for each of which a branch is created below the root node. The algorithm is continued recursively in that the previous sequence is applied again for each of these branches while restricting the subset of the training data assigned to this branch. If it is not possible to gain information on a branch by further subdividing the training data, a sheet with the remaining target classification is created on this branch. The highest possible maximum degree of the tree is thus . ${\ displaystyle S = \ {s_ {1}, s_ {2}, ... \}}$ ${\ displaystyle s_ {i} \ in S}$ ${\ displaystyle x_ {1, i}, x_ {2, i}, ..., x_ {p, i}}$ ${\ displaystyle c_ {i}}$ ${\ displaystyle s_ {i} = (x_ {1, i}, x_ {2, i}, ..., x_ {p, i}, c_ {i})}$ ${\ displaystyle p + 1}$

example

Sarah goes sailing some days, not other days. Whether she goes sailing on a certain day depends primarily on the following characteristics: weather conditions (sunny / overcast / rainy); Temperature; Humidity; Wind strength (light / strong). On 14 random days, this data was recorded together with the information whether Sarah is going sailing:

weather condition	Temperature in ° C	Humidity in%	Wind force	Sarah goes sailing
Sunny	29	85	Light	Not correct
Sunny	27	90	Strong	Not correct
Covered	28	78	Light	True
Rainy	21st	96	Light	True
Rainy	20th	80	Light	True
Rainy	18th	70	Strong	Not correct
Covered	17th	65	Strong	True
Sunny	22nd	95	Light	Not correct
Sunny	21st	70	Light	True
Rainy	24	80	Light	True
Sunny	24	70	Strong	True
Covered	22nd	90	Strong	True
Covered	27	75	Light	True
Rainy	21st	80	Strong	Not correct

Machine learning is to be used to uncover the connection between the four day-related characteristics and the statement whether Sarah goes sailing. Once you have determined this relationship, you can determine whether Sarah is going sailing for any other day if you know the weather data. C4.5 generates the decision tree shown from the training data given in the table. The numbers in brackets indicate the number of training records that correspond to this path.

Improvements over ID3

As an extension of the ID3 algorithm, C4.5 offers some improvements:

Applicability to both discrete and continuous attributes - if the data records contain, for example, a real variable as one of the characteristics, the characteristic values are classified in discrete intervals.
Applicability to training data with missing attribute values - Using C4.5, unavailable characteristic values can be marked as unknown (?). Unknown values are simply ignored when calculating the information gain.
Possible cost weighting of the attributes - Characteristics that are laborious to determine can be assigned a higher cost weighting. Features with high cost weightings tend to be arranged further down in the decision tree as branches, so that this feature has to be determined for fewer classifications. For example, the use of C4.5 for diagnosing diseases could be adapted using the database of symptoms and medical examination values in such a way that cost-intensive examinations are more likely to be avoided and, if possible, are only used in cases of doubt between several possible diagnoses.
Pruning of the decision tree after it has been created - In order to reduce the number of possible result classes of the classification, C4.5 cuts off long branches. This also prevents overfitting to the training data.

Individual evidence

↑ Tom M. Mitchell: Machine Learning, 1997
↑ Quinlan, JR C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.

[1] Tom M. Mitchell: Machine Learning, 1997

[2] Quinlan, JR C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.