# CART (algorithm)

CART ( C lassification a nd R egression T rees ) is an algorithm that for decision-making is used. It is used in decision trees .

The CART algorithm was first used in 1984 by Leo Breiman et al. published.

## General

An important feature of the CART algorithm is that only binary trees can be generated, which means that there are always exactly two branches at each branch. So the central element of this algorithm is finding an optimal binary separation.

With the CART algorithm, the selection of attributes is controlled by maximizing the information content. CARTs are characterized by the fact that they optimally separate the data in terms of classification . This is achieved with a threshold value that is searched for for each attribute. The information content of an attribute is considered to be high if a classification can be made with a high hit rate by evaluating the attribute characteristics resulting from the division via the threshold values. For the decision trees, which are calculated by the CART algorithm, the following applies: The higher the information content of an attribute in relation to the target variable, the higher up in the tree this attribute can be found.

The decision thresholds result from the optimization of the column entropy . The total entropies of the attributes result from a weighted mean of the column entropies.

## Mathematical description

Let be the amount of training data with input variables and output values . A decision tree can be formally represented by means of a function that assigns a prediction of the output value to each input. The CART algorithm for generating such a tree independently finds the branches ( nodes ) and associated rules for separating the data (enS split rule ), with which this assignment is as optimal as possible. ${\ displaystyle {\ mathcal {D}} = \ {(x ^ {(i)}, y_ {i}) \} _ {i = 1} ^ {m}}$${\ displaystyle x ^ {(i)} = (x_ {1} ^ {(i)}, \ dots, x_ {n} ^ {(i)}) \ in {\ mathcal {X}}}$${\ displaystyle y \ in {\ mathcal {Y}}}$${\ displaystyle T}$${\ displaystyle f_ {T}: {\ mathcal {X}} \ rightarrow {\ mathcal {Y}}}$

### Regression

First of all , d. H. the output is a quantitative characteristic and the tree to be constructed is intended to solve a regression problem. In order to find an optimal tree, a training criterion ( error function ) must first be defined. The mean square deviation is typically used for this : ${\ displaystyle {\ mathcal {Y}} \ subseteq \ mathbb {R}}$

${\ displaystyle E (T, {\ mathcal {D}}) = {\ frac {1} {m}} \ sum _ {i = 1} ^ {m} (f_ {T} (x ^ {(i) }) - y_ {i}) ^ {2}}$,

which is then minimized. The error function can, however, generally be freely selected.

A set is assigned to each leaf , so that the assigned disjoint sets form a partition of for all leaves . You are now looking for an estimated value that is as close as possible to the true values for all . The appraiser ${\ displaystyle l}$${\ displaystyle R_ {l}}$${\ displaystyle L}$${\ displaystyle D}$${\ displaystyle x ^ {(i)} \ in R_ {l}}$${\ displaystyle \ {y_ {i} \; | \; x ^ {(i)} \ in R_ {l} \}}$

${\ displaystyle f_ {T} (x ^ {(i)} \ in R_ {l}) = {\ textrm {mean}} (y_ {i} \; | \; x ^ {(i)} \ in R_ {l}) =: {\ hat {c}} _ {l}}$

provides a solution for this. Since the computation of all possible trees cannot be implemented in an efficient way, a greedy algorithm is best suited for this approach . Specifically: You start with a tree that consists of only one node and then successively find locally optimal branches. At each node one determines the feature that can best subdivide all entries of the parent node into two regions, whereby an optimal division rule must always be found. For ordinal features, this is done by means of a barrier that generates the two regions and for all in the original partition of the parent node in such a way that the error function is minimized. If the characteristics are not ordinal, the branching rules are based on the assignment to the various characteristics. Formally, this can be written as ${\ displaystyle j}$${\ displaystyle s}$${\ displaystyle R_ {1} (j, s) = \ {x ^ {(i)} \; | \; x_ {j} ^ {(i)} \ leq s \}}$${\ displaystyle R_ {2} (j, s) = \ {x ^ {(i)} \; | \; x_ {j} ^ {(i)}> s \}}$${\ displaystyle x ^ {(i)}}$

${\ displaystyle \ min _ {j, s} {\ big (} \ min _ {c_ {1}} \ sum _ {x ^ {(i)} \ in R_ {1} (j, s)} (y_ {i} -c_ {1}) ^ {2} + \ min _ {c_ {2}} \ sum _ {x ^ {(i)} \ in R_ {2} (j, s)} (y_ {i } -c_ {2}) ^ {2} {\ big)}}$,

whereby the two sums are minimized. ${\ displaystyle {\ hat {c}} _ {k} = {\ textrm {mean}} (y_ {i} \; | \; x ^ {(i)} \ in R_ {k} (j, s) ) _ {k = 1,2}}$

Starting from the individual node, two new nodes are added in each step, which in turn are branched further until a termination condition (e.g. the maximum path length from the root to the leaves) is met.

#### Pruning

Since the tree so in most cases too complex, so prone to overfitting ( English overfitting ), may (should), he will be trimmed ( English pruning ). Overfitting can be prevented by introducing a regulation term (see English regularization ) in the error function that does not depend on the data and penalizes the complexity of the decision tree. This prevents the tree from learning specific properties of the training data, which in general (i.e. for other data from the same population) do not contribute to the true predictions.

The second option, which is described below, is to first construct a relatively large tree , which is then pruned afterwards. Let be a real subgraph that can be created by removing inner nodes (i.e. the partitions of the children of this node are merged). Let be the number of leaves of such a subgraph, where each leaf is assigned the partition with elements. Be like above and ${\ displaystyle T_ {0}}$${\ displaystyle T \ subset T_ {0}}$${\ displaystyle | T |}$${\ displaystyle l}$${\ displaystyle R_ {l}}$${\ displaystyle | R_ {l} |}$${\ displaystyle {\ hat {c}} _ {l}}$

${\ displaystyle Q_ {l} (T, {\ mathcal {D_ {t}}}) = {\ frac {1} {| R_ {l} |}} \ sum _ {x ^ {(i)} \ in R_ {l}} (y_ {i} - {\ hat {c}} _ {l}) ^ {2}}$.

The idea is to find a tree that will do the job

${\ displaystyle C _ {\ alpha} (T, {\ mathcal {D_ {t}}}) = \ sum _ {l = 1} ^ {| T |} | R_ {l} | Q_ {l} (T, {\ mathcal {D_ {t}}}) + \ alpha | T |}$

minimized for a given . For this purpose, a different amount of data ( English test set ) is used to prevent overfitting (see cross-validation procedure ). ${\ displaystyle \ alpha}$${\ displaystyle {\ mathcal {D}}}$${\ displaystyle {\ mathcal {D_ {t}}}}$

The freely selectable parameter describes the weighting between the quality of the prediction of the decision tree and its complexity. For a given , a descending sequence of subtrees is created by removing, at each step, the inner node that produces the smallest per-node increase of , until there is only a single node left. There is then a clearly determinable smallest subtree that minimizes. ${\ displaystyle \ alpha}$${\ displaystyle \ alpha}$${\ displaystyle \ sum \ nolimits _ {l} | R_ {l} | Q_ {l} (T, {\ mathcal {D_ {t}}})}$${\ displaystyle C _ {\ alpha} (T, {\ mathcal {D_ {t}}})}$

### Classification

Now let the output values ​​be categorical, i.e. H. is a finite set and o. B. d. A. . The only changes to the algorithm compared to regression concern the error functions for the construction of the tree and the pruning. ${\ displaystyle {\ mathcal {Y}}}$ ${\ displaystyle y_ {i} \ in \ {1,2, \ dots, K \}}$

We define

${\ displaystyle {\ hat {p}} _ {lk}: = {\ frac {1} {| R_ {l} |}} \ sum _ {x_ {i} \ in R_ {l}} I (y_ { i} = k)}$,

where is equal to the number of elements in and the indicator function . ${\ displaystyle | R_ {l} |}$${\ displaystyle R_ {l}}$${\ displaystyle I (\ cdot)}$

This means that the entries can now be classified according to majority decision in each region : ${\ displaystyle R_ {l}}$

${\ displaystyle f_ {T} (x ^ {(i)} \ in R_ {l}) = {\ textrm {argmax}} _ {k} \, {\ hat {p}} _ {lk}}$.

Possible error functions indicate the so-called impurity of the partitions and can be defined as:

${\ displaystyle {\ frac {1} {| R_ {l} |}} \ sum _ {x_ {i} \ in R_ {l}} I (y_ {i} \ neq k) = 1 - {\ hat { p}} _ {lk} \ quad}$ (Misclassification error)
${\ displaystyle \ sum _ {k = 1} ^ {K} {\ hat {p}} _ {lk} (1 - {\ hat {p}} _ {lk}) \ quad}$ (Gini Index)
${\ displaystyle - \ sum _ {k = 1} ^ {K} {\ hat {p}} _ {lk} \ log {\ hat {p}} _ {lk} \ quad}$( Cross entropy )

Each of these functions can be used in the construction of a decision tree instead of the mean square deviation, so that the essential part of the algorithm remains unchanged.