Hyperparameter optimization

In the field of machine learning , hyperparameter optimization refers to the search for optimal hyperparameters. A hyperparameter is a parameter that is used to control the training algorithm and, unlike other parameters, its value must be set before the actual training of the model.

approaches

A discrete set of values is defined for both hyperparameters of a model (here 10 elements each). In the grid search, models for all combinations of the values are tried out one after the other and the performance is measured. In the example there are therefore a total of 100 attempts.

Grid search

Grid search or grid search is the traditional way of looking for optimal hyperparameters. An exhaustive search is carried out on a manually specified subset of the hyperparameter space of the learning algorithm. A grid search must be carried out through a performance metric, which is typically calculated by cross-validation on training data or validation data not considered during training.

For real or unlimited spaces of individual hyperparameters, a discretization and restriction to a part of the space must be specified before the grid search .

One example is the hyper parameter optimization of SVM - classifier with a RBF kernel. This has two hyperparameters that have to be adjusted for unseen data for good performance: a regularization constant C and a kernel hyperparameter γ. Both parameters are continuous; therefore, a finite set of meaningful values must be chosen for each hyperparameter in order to perform the grid search, for example:

{\ displaystyle C \ in \ {10,100,1000 \}}

{\ displaystyle \ gamma \ in \ {0.1,0.2,0.5,1.0 \}}

During the grid search, a support vector machine is then trained for each pair ( C , γ) in the Cartesian product of these two sets and evaluated for validation data. Finally, the raster search returns the combination that performs best according to the selected metric.

Grid search suffers from the curse of dimensionality . The individual tests can, however, be carried out in parallel, as there is no interdependence.

With the random search, models are trained with hyperparameter values drawn at random from a search space. The model performance depending on the hyperparameters (blue = good, red = bad performance) does not influence the selection. In the example, 100 attempts were also made. In contrast to the grid search, significantly more values are tried out for a single hyperparameter (green lines).

Random search

In the case of a random search , instead of exhaustively trying out all combinations, a random selection of values is made within the given hyperparameter space. In contrast to the grid search, no discretization of the space is required. The random search can outperform the grid search in speed and performance, especially if only a small number of hyperparameters influence the quality of the learning algorithm. The reason for this is that in the grid search only a few values are tried for each parameter (but several times), while randomly selected values are much better distributed in the search space.

The random search can also be carried out in parallel and offers a complete result over the search space at any time without neglecting areas. Processes can therefore be added or switched off at any time.

With Tree-Structured Parzen Estimators (TPE) and Bayesian optimization, a model of the relationship between hyperparameters and measured performance is created in order to examine areas with a good performance more closely and to continue to explore other areas of the search space. As the example shows, most of the 100 attempts are near the actual best performance. In contrast to the grid and random search, the optimum is found much more precisely and faster.

Bayesian optimization

Bayesian optimization is a global optimization method for noisy black box functions that builds a probabilistic model of the function between hyperparameter values and the metric to be evaluated on validation data for hyperparameter optimization . Iteratively tries out hyper parameter configurations that appear promising according to the current model, and then adjusts the model based on the new findings. Bayesian optimization tries to collect as many observations as possible about the function, especially about the position of the optimum. It simultaneously takes into account the exploration of areas in which there is little knowledge about the expected performance ( exploration ) and the utilization of knowledge about areas in which the optimum is expected ( exploitation ). In practice, it has been shown that with Bayesian optimization, based on the knowledge gathered, better results can be achieved in fewer attempts than with grid or random searches.

Gradient-based optimization

For some learning algorithms it is possible to calculate the gradient in relation to the hyperparameters and to optimize them using the steepest descent method . The first application of such techniques was for neural networks . They were later used for other models such as support vector machines and logistic regression .

Another approach to obtaining gradients in terms of hyperparameters is to automatically differentiate the steps of an iterative optimization algorithm .

Evolutionary Optimization

Evolutionary optimization uses evolutionary algorithms to search for the global optimum of a noisy black box function. The evolutionary hyperparameter optimization follows an evolution- inspired process:

Create an initial population of random solutions (randomly chosen hyperparameter values)
Evaluate the hyper parameter configurations and determine the fitness (for example the average performance in tenfold cross-validation)
Arrange the hyperparametric configurations according to their fitness
Replace the worst performing configurations with new ones generated by recombination and mutation
Repeat steps 2 to 4 until satisfactory performance is achieved or the performance does not improve any further

Evolutionary optimization was used to optimize hyperparameters for statistical learning algorithms, automated machine learning and the search for architectures of deep neural networks .

Individual evidence

↑ Chin-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin (2010). A practical guide to support vector classification . Technical Report, National Taiwan University .
^ Ten quick tips for machine learning in computational biology . In: BioData Mining . 10, No. 35, December 2017, p. 35. doi : 10.1186 / s13040-017-0155-3 . PMID 29234465 . PMC 5721660 (free full text).
↑ ^a ^b James Bergstra, Yoshua Bengio: Random Search for Hyper-Parameter Optimization . In: Journal of Machine Learning Research . 13, 2012, pp. 281-305.
↑ Frank Hutter, Holger Hoos, Kevin Layton-Brown: Sequential model-based optimization for general algorithm configuration . In: Learning and Intelligent Optimization (Ed.): Lecture Notes in Computer Science . tape 6683 . Springer, Berlin, Heidelberg 2011, ISBN 978-3-642-25565-6 , pp. 507-523 , doi : 10.1007 / 978-3-642-25566-3_40 ( ubc.ca [PDF]).
↑ ^a ^b ^c James Bergstra, Remi Bardenet, Yoshua Bengio , Balazs Kegl: Algorithms for hyper-parameter optimization . In: Advances in Neural Information Processing Systems . 2011 ( nips.cc [PDF]).
↑ Jasper Snoek, Hugo Larochelle, Ryan Adams: Practical Bayesian Optimization of Machine Learning Algorithms . In: Advances in Neural Information Processing Systems . 2012. arxiv : 1206.2944 . bibcode : 2012arXiv1206.2944S .
↑ Chris Thornton, Frank Hutter, Holger Hoos: Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms . In: Knowledge Discovery and Data Mining . 2013. arxiv : 1208.3719 . bibcode : 2012arXiv1208.3719T .
^ Jan Larsen, Lars Kai Hansen, Claus Svarer, M Ohlsson: Design and regularization of neural networks: the optimal use of a validation set . In: Proceedings of the 1996 IEEE Signal Processing Society Workshop . 1996.
↑ Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet, Sayan Mukherjee: Choosing multiple parameters for support vector machines . In: Machine Learning . September.
↑ Chuong B, Chuan-Sheng Foo, Andrew Y Ng: Efficient multiple hyperparameter learning for log-linear models . In: Advances in Neural Information Processing Systems 20 . September.
^ Justin Domke: Generic Methods for Optimization-Based Modeling . In: Aistats . 22, 2012.
↑ Dougal Maclaurin, David Duvenaud, Ryan P. Adams: Gradient-based hyperparameter Optimization through Reversible learning . 2015, arxiv : 1502.03492 [stat.ML] .
↑ Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Dan Fink, Olivier Francon, Bala Raju, Hormoz Shahrzad, Arshak Navruzyan, Nigel Duffy, Babak Hodjat: Evolving Deep Neural Networks . 2017, arxiv : 1703.00548 [cs.NE] .
↑ Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, Koray Kavukcuoglu: Population Based Training of Neural Networks . 2017, arxiv : 1711.09846 [cs.LG] .

[1] Chin-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin (2010). A practical guide to support vector classification . Technical Report, National Taiwan University .

[2] Ten quick tips for machine learning in computational biology . In: BioData Mining . 10, No. 35, December 2017, p. 35. doi : 10.1186 / s13040-017-0155-3 . PMID 29234465 . PMC 5721660 (free full text).

[bergstra-3] James Bergstra, Yoshua Bengio: Random Search for Hyper-Parameter Optimization . In: Journal of Machine Learning Research . 13, 2012, pp. 281-305.

[4] Frank Hutter, Holger Hoos, Kevin Layton-Brown: Sequential model-based optimization for general algorithm configuration . In: Learning and Intelligent Optimization (Ed.): Lecture Notes in Computer Science . tape 6683 . Springer, Berlin, Heidelberg 2011, ISBN 978-3-642-25565-6 , pp. 507-523 , doi : 10.1007 / 978-3-642-25566-3_40 ( ubc.ca [PDF]).

[bergstra11-5] James Bergstra, Remi Bardenet, Yoshua Bengio , Balazs Kegl: Algorithms for hyper-parameter optimization . In: Advances in Neural Information Processing Systems . 2011 ( nips.cc [PDF]).

[snoek-6] Jasper Snoek, Hugo Larochelle, Ryan Adams: Practical Bayesian Optimization of Machine Learning Algorithms . In: Advances in Neural Information Processing Systems . 2012. arxiv : 1206.2944 . bibcode : 2012arXiv1206.2944S .

[thornton-7] Chris Thornton, Frank Hutter, Holger Hoos: Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms . In: Knowledge Discovery and Data Mining . 2013. arxiv : 1208.3719 . bibcode : 2012arXiv1208.3719T .

[8] Jan Larsen, Lars Kai Hansen, Claus Svarer, M Ohlsson: Design and regularization of neural networks: the optimal use of a validation set . In: Proceedings of the 1996 IEEE Signal Processing Society Workshop . 1996.

[9] Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet, Sayan Mukherjee: Choosing multiple parameters for support vector machines . In: Machine Learning . September.

[10] Chuong B, Chuan-Sheng Foo, Andrew Y Ng: Efficient multiple hyperparameter learning for log-linear models . In: Advances in Neural Information Processing Systems 20 . September.

[11] Justin Domke: Generic Methods for Optimization-Based Modeling . In: Aistats . 22, 2012.

[abs1502.03492-12] Dougal Maclaurin, David Duvenaud, Ryan P. Adams: Gradient-based hyperparameter Optimization through Reversible learning . 2015, arxiv : 1502.03492 [stat.ML] .

[miikkulainen1-13] Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Dan Fink, Olivier Francon, Bala Raju, Hormoz Shahrzad, Arshak Navruzyan, Nigel Duffy, Babak Hodjat: Evolving Deep Neural Networks . 2017, arxiv : 1703.00548 [cs.NE] .

[jaderberg1-14] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, Koray Kavukcuoglu: Population Based Training of Neural Networks . 2017, arxiv : 1711.09846 [cs.LG] .