Random Forest
A random forest is a classification method that consists of several uncorrelated decision trees . All decision trees have grown under a certain kind of randomization during the learning process. For a classification, each tree in this forest is allowed to make a decision and the class with the most votes decides the final classification. Random forests can also be used for regression .
The term Random Forest was coined by Leo Breiman in 1999. He researched various methods of randomizing decision trees, for example using bagging or boosting . Tin Kam Ho's research in 1995 preceded his work.
properties
A random forest has many advantages over other classification methods such as SVM .
- The classifier trains very quickly: This advantage results from the short training or construction time of a single decision tree and the fact that the training time in a random forest increases linearly with the number of trees.
- A test example is evaluated individually on each tree and can therefore be parallelized. So he evaluates quickly.
- It is very efficient for large amounts of data (many classes, many training examples, many features).
- Important classes can be recognized.
- The relationship between classes can be recognized.
functionality
There are many different variants and approaches to train and classify a random forest. This includes, among other things, which decision trees are used and whether a maximum depth of the trees is specified. According to Breiman, the following algorithm should be used for every decision tree in the forest :
- It will bootstrap pulled -samples.
- From the characteristics (features or dimensions) of the training data, characteristics are randomly selected at each node in the tree , which are to be considered as criteria for the cut (split). The subsequent selection of a feature from this set can be done, for example, by minimizing the entropy .
- The tree is fully developed and not cut back ( pruning ).
To classify an input, it is evaluated in each tree. The class that was chosen the most is the random forest edition. (If there are several classes that have been chosen the most, you have to choose otherwise.)
Bosch et al. additionally store the posteriori probabilities of the classes with which they find the leaf in each leaf . These probabilities are then taken into account in the classification. This can reduce the error rate in your application.
software
- Leo Breiman and Adele Cutler: An Implementation of a Random Forest in FORTRAN 77
- Random Forests from Salford Systems
- randomForest and ranger in R
- Oblique random forests with multivariate decision trees in R
- Random Jungle is a fast implementation for high-dimensional data. (C ++, parallel computing, low memory consumption, Linux + Windows)
- Waikato Environment for Knowledge Analysis : The free software package Weka from the University of Waikato provides a random forest implementation in addition to many other classifiers and regression methods.
- Vigra : Vigra offers a multitude of image processing algorithms as well as an efficient C ++ implementation of the random forest classifier and various regression methods (including LARS).
- MLlib : the Spark "Machine Learning Library" (MLlib) offers a. a RandomForest implementation for Spark Cluster available under Apache license
- SPSS Random Forest : Example of Random Forest in IBM SPSS
- Random Forests in Scikit-learn (Python)
Web links
- Random Forests , homepage of Leo Breiman and Adele Cutler
swell
- ↑ a b Breiman L., Random forests. In: Machine Learning , 2001, 45 (1), pages 5-32, doi : 10.1023 / A: 1010933404324
- ↑ Tin Kam Ho, Random Decision Forests, Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, Canada, August 14-18, 1995, 278-282, doi : 10.1109 / ICDAR.1995.598994 .
- ^ Andy Liaw & Matthew Wiener (2002): " Classification and Regression by random Forest ", R News 2/3: 18-22.
- ↑ Anna Bosch, Andrew Zisserman, Xavier Muñoz: Image classification using random forests and ferns. ICCV 2007. IEEE 11th International Conference on Computer Vision, pages 1-8, doi : 10.1109 / ICCV.2007.4409066 .