Hyperparameter optimization
In the field of machine learning , hyperparameter optimization refers to the search for optimal hyperparameters. A hyperparameter is a parameter that is used to control the training algorithm and, unlike other parameters, its value must be set before the actual training of the model.
approaches
Grid search
Grid search or grid search is the traditional way of looking for optimal hyperparameters. An exhaustive search is carried out on a manually specified subset of the hyperparameter space of the learning algorithm. A grid search must be carried out through a performance metric, which is typically calculated by cross-validation on training data or validation data not considered during training.
For real or unlimited spaces of individual hyperparameters, a discretization and restriction to a part of the space must be specified before the grid search .
One example is the hyper parameter optimization of SVM - classifier with a RBF kernel. This has two hyperparameters that have to be adjusted for unseen data for good performance: a regularization constant C and a kernel hyperparameter γ. Both parameters are continuous; therefore, a finite set of meaningful values must be chosen for each hyperparameter in order to perform the grid search, for example:
During the grid search, a support vector machine is then trained for each pair ( C , γ) in the Cartesian product of these two sets and evaluated for validation data. Finally, the raster search returns the combination that performs best according to the selected metric.
Grid search suffers from the curse of dimensionality . The individual tests can, however, be carried out in parallel, as there is no interdependence.
Random search
In the case of a random search , instead of exhaustively trying out all combinations, a random selection of values is made within the given hyperparameter space. In contrast to the grid search, no discretization of the space is required. The random search can outperform the grid search in speed and performance, especially if only a small number of hyperparameters influence the quality of the learning algorithm. The reason for this is that in the grid search only a few values are tried for each parameter (but several times), while randomly selected values are much better distributed in the search space.
The random search can also be carried out in parallel and offers a complete result over the search space at any time without neglecting areas. Processes can therefore be added or switched off at any time.
Bayesian optimization
Bayesian optimization is a global optimization method for noisy black box functions that builds a probabilistic model of the function between hyperparameter values and the metric to be evaluated on validation data for hyperparameter optimization . Iteratively tries out hyper parameter configurations that appear promising according to the current model, and then adjusts the model based on the new findings. Bayesian optimization tries to collect as many observations as possible about the function, especially about the position of the optimum. It simultaneously takes into account the exploration of areas in which there is little knowledge about the expected performance ( exploration ) and the utilization of knowledge about areas in which the optimum is expected ( exploitation ). In practice, it has been shown that with Bayesian optimization, based on the knowledge gathered, better results can be achieved in fewer attempts than with grid or random searches.
Gradient-based optimization
For some learning algorithms it is possible to calculate the gradient in relation to the hyperparameters and to optimize them using the steepest descent method . The first application of such techniques was for neural networks . They were later used for other models such as support vector machines and logistic regression .
Another approach to obtaining gradients in terms of hyperparameters is to automatically differentiate the steps of an iterative optimization algorithm .
Evolutionary Optimization
Evolutionary optimization uses evolutionary algorithms to search for the global optimum of a noisy black box function. The evolutionary hyperparameter optimization follows an evolution- inspired process:
- Create an initial population of random solutions (randomly chosen hyperparameter values)
- Evaluate the hyper parameter configurations and determine the fitness (for example the average performance in tenfold cross-validation)
- Arrange the hyperparametric configurations according to their fitness
- Replace the worst performing configurations with new ones generated by recombination and mutation
- Repeat steps 2 to 4 until satisfactory performance is achieved or the performance does not improve any further
Evolutionary optimization was used to optimize hyperparameters for statistical learning algorithms, automated machine learning and the search for architectures of deep neural networks .
Individual evidence
- ↑ Chin-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin (2010). A practical guide to support vector classification . Technical Report, National Taiwan University .
- ^ Ten quick tips for machine learning in computational biology . In: BioData Mining . 10, No. 35, December 2017, p. 35. doi : 10.1186 / s13040-017-0155-3 . PMID 29234465 . PMC 5721660 (free full text).
- ↑ a b James Bergstra, Yoshua Bengio: Random Search for Hyper-Parameter Optimization . In: Journal of Machine Learning Research . 13, 2012, pp. 281-305.
- ↑ Frank Hutter, Holger Hoos, Kevin Layton-Brown: Sequential model-based optimization for general algorithm configuration . In: Learning and Intelligent Optimization (Ed.): Lecture Notes in Computer Science . tape 6683 . Springer, Berlin, Heidelberg 2011, ISBN 978-3-642-25565-6 , pp. 507-523 , doi : 10.1007 / 978-3-642-25566-3_40 ( ubc.ca [PDF]).
- ↑ a b c James Bergstra, Remi Bardenet, Yoshua Bengio , Balazs Kegl: Algorithms for hyper-parameter optimization . In: Advances in Neural Information Processing Systems . 2011 ( nips.cc [PDF]).
- ↑ Jasper Snoek, Hugo Larochelle, Ryan Adams: Practical Bayesian Optimization of Machine Learning Algorithms . In: Advances in Neural Information Processing Systems . 2012. arxiv : 1206.2944 . bibcode : 2012arXiv1206.2944S .
- ↑ Chris Thornton, Frank Hutter, Holger Hoos: Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms . In: Knowledge Discovery and Data Mining . 2013. arxiv : 1208.3719 . bibcode : 2012arXiv1208.3719T .
- ^ Jan Larsen, Lars Kai Hansen, Claus Svarer, M Ohlsson: Design and regularization of neural networks: the optimal use of a validation set . In: Proceedings of the 1996 IEEE Signal Processing Society Workshop . 1996.
- ↑ Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet, Sayan Mukherjee: Choosing multiple parameters for support vector machines . In: Machine Learning . September.
- ↑ Chuong B, Chuan-Sheng Foo, Andrew Y Ng: Efficient multiple hyperparameter learning for log-linear models . In: Advances in Neural Information Processing Systems 20 . September.
- ^ Justin Domke: Generic Methods for Optimization-Based Modeling . In: Aistats . 22, 2012.
- ↑ Dougal Maclaurin, David Duvenaud, Ryan P. Adams: Gradient-based hyperparameter Optimization through Reversible learning . 2015, arxiv : 1502.03492 [stat.ML] .
- ↑ Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Dan Fink, Olivier Francon, Bala Raju, Hormoz Shahrzad, Arshak Navruzyan, Nigel Duffy, Babak Hodjat: Evolving Deep Neural Networks . 2017, arxiv : 1703.00548 [cs.NE] .
- ↑ Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, Koray Kavukcuoglu: Population Based Training of Neural Networks . 2017, arxiv : 1711.09846 [cs.LG] .