# Supervised learning

Supervised learning is a branch of machine learning . With learning is meant the ability of an artificial intelligence to reproduce laws. The results are known through natural laws or expert knowledge and are used to learn the system.

A learning algorithm tries to find a hypothesis that makes predictions that are as accurate as possible. A hypothesis is to be understood as a mapping that assigns the presumed output value to each input value.

The method is based on a previously determined output to be learned, the results of which are known. The results of the learning process can be compared, ie “monitored”, with the known, correct results.

If the results of the output are available in a continuous distribution, the results of which can assume any quantitative values ​​within a given range of values, one usually speaks of a regression problem .

An example of such a regression problem is predicting house price trends based on certain variables, or determining a person's age from other information about the person. So it's mostly about predictions.

If, on the other hand, the results are available in discrete form or if the values ​​are qualitative, one speaks of a classification problem. An example here of is determining whether an email is spam or not spam.

This following article describes the procedure for the implementation of supervised learning and presents some methods for solving regression problems or for solving classification problems.

## Definitions

In order to better represent the mathematical relationships in the following, the following definitions are used:

${\ displaystyle x ^ {(i)}}$ = Input variables (also called "explanatory variables")
${\ displaystyle y ^ {(i)}}$ = Output / target variables (also called "declared variables")
${\ displaystyle (x ^ {(i)}, y ^ {(i)})}$ = Training pair / training example
${\ displaystyle {(x ^ {(i)}, y ^ {(i)}); i = 1, \ ldots, m}}$ = Data set that is used for learning (also called learning data set)
${\ displaystyle h (x)}$ = The hypothesis function that is to be learned by the algorithm in order to approximate as precisely as possible${\ displaystyle y}$ ## Action

In order to solve a specific problem with supervised learning, one must do the following steps:

1. Determine the type of training examples. This means that it must first be determined what type of data the training data set should contain. The handwriting analysis can include B. be a single handwritten character, a whole handwritten word or a whole line of handwriting.
2. Carry out a data collection according to the previous selection. Both the explanatory variables and the explained variables must be surveyed. This survey can be performed by human experts, measurements, and other methods.
3. The accuracy of the learned function depends heavily on how the explanatory variables are represented. Typically, these are transformed into a vector that contains a number of features that describe the object. The number of features shouldn't be too large; however, it should contain enough information to accurately predict the output.
4. The structure of the learned function and the associated learning algorithm must then be determined. In the case of a regression problem, for example, it should be decided at this point whether a function with or without parameters is better suited to carry out the approximation.
5. Then the learning algorithm is executed on the collected training data set. Some monitored learning algorithms require the user to define certain control parameters. These parameters can be adjusted either by optimizing a subset of the data set (called the validation data set) or by cross-validation.
6. Finally, the accuracy of the learned function must be determined. After setting the parameters and learning the parameters, the performance of the resulting function should be measured on a test data set that is separate from the training data set.

A wide range of supervised learning algorithms are available, each with their strengths and weaknesses. There is no learning algorithm that works best for all monitored learning problems (see no-free lunch theorems ). In the following, the most common learning algorithms for both regression and classification problems are presented and further algorithms are linked.

## Regression problems

The goal of supervised learning in the case of regression problems is usually based on certain explanatory variables such as the size or color of a house to predict something about this issue. The facts can be fundamentally different, for example the price of houses in a certain area or the development of the price of a share on the next day. The aim is accordingly to learn the relationship between the explanatory and the explained variable using a training data set and, with the help of this relationship, to predict future events that are not yet known. An example of such a learning algorithm that can make predictions is linear regression .

### Linear regression

The linear regression is the most common form for performing a regression. The model used for this is linear in the parameters, with the dependent variable being a function of the independent variables. In the regression, the outputs of the unknown function are noisy

${\ displaystyle y = h _ {\ theta} (x) + \ varepsilon}$ ,

where represents the unknown function and stands for random noise . The explanation for the noise is that there are additional hidden variables that are unobservable. The following regression function is used for this: ${\ displaystyle f (x) \ in \ mathbb {R}}$ ${\ displaystyle \ varepsilon}$ ${\ displaystyle h _ {\ theta} (x) = \ theta _ {0} + \ theta _ {1} x_ {1} + \ ldots + \ theta _ {n} x_ {n}}$ or in vector notation:

${\ displaystyle h _ {\ theta} (x) = \ sum _ {i = 0} ^ {n} \ theta _ {i} x_ {i} = {\ boldsymbol {\ theta}} ^ {\ top} \ mathbf {x}}$ They are the parameters of the function and is the vector that contains the explanatory variables. The parameters weight the individual explanatory variables accordingly and are therefore often referred to as regression weights. ${\ displaystyle \ theta _ {i}}$ ${\ displaystyle x}$ In order to get the most exact approximation of the output from the explanatory variables , a so-called “cost function” must be set up. ${\ displaystyle y}$ This function describes the mean square deviation , which arises from the fact that the hypothesis function only approximates the variable to be explained and does not represent it exactly. In this respect, the cost function, which is described by the following equation: ${\ displaystyle y}$ ${\ displaystyle J (\ theta) = {\ frac {1} {m}} \ sum _ {i = 1} ^ {m} (h _ {\ theta} (x ^ {(i)}) - y ^ { (i)}) ^ {2}}$ can be used to compute the error made in each approximation of . ${\ displaystyle y}$ The goal is now to minimize the cost function.

In order to minimize the function, the parameters must be selected in such a way that they correctly weight the respective x values ​​in order to come as close as possible to the desired y value.

The minimum can be calculated at this point in two different ways.

One method is the so-called gradient method .

This method consists of the following steps:

1. Any values ​​can be selected for the parameters.
2. At this point the derivation of the cost function is created and the steepest slope is determined
3. You go along this slope in the negative direction. The size of the steps is determined by a learning rate.
4. This process is repeated until one has reached the minimum of the cost function.

This is shown in the following equation for a single example (where alpha is the learning rate):

${\ displaystyle \ theta _ {j}: = \ theta _ {j} + \ alpha (y ^ {(i)} - h _ {\ theta} (x ^ {(i)})) x_ {j} ^ { (i)}}$ This equation is repeated until or until this difference has been minimized and the parameter has thus found its optimal value. ${\ displaystyle y (i) -h (x) = 0}$ Another method that can be used is the so-called normal equations (see Multiple Linear Regression ). With this, the minimization of the cost function can be carried out explicitly and without resorting to an iterative algorithm by implementing the following formula:

${\ displaystyle \ mathbf {\ theta} = (\ mathbf {X} ^ {\ top} \ mathbf {X}) ^ {- 1} \ mathbf {X} ^ {\ top} \ mathbf {y}}$ This formula gives us the optimal values ​​of the parameters.

 Gradient method Normal distribution The learning rate alpha must be determined No alpha is needed Requires lots of steps and repetitions There is no repetition Works well even with a lot of data From 10,000 observations, the calculation becomes slow and the computing power required is very large, since the inverse has to be formed.

## Classification problems

In contrast to regression problems, classification problems can be recognized by the fact that the output y can only assume a few discrete values. Most of the time, these values ​​are in qualitative form, for example when it comes to determining whether an email is spam or not spam on the basis of several explanatory variables. In this example the explanatory variables would be and the output would be 1 if it is a spam e-mail and 0 if there is no spam e-mail. ${\ displaystyle x ^ {(i)}}$ ${\ displaystyle y}$ A distinction is also made between binary classification problems and classification problems with multiple classes. An example of this would be to classify which of three brands a purchased product is from (the classes in this case are brand A, B or C).

### Logistic regression

The most common method for dealing with classification problems in supervised machine learning is logistic regression . Although, as the name suggests, this is also a regression, it is very well suited to teaching a computer program how to solve classification problems.

As already explained in the example for the classification of spam e-mails, the output takes either a value of 1 or 0. If you were to use a linear regression to solve this classification problem, you would probably get many values ​​that are above 1 or below 0.

Logistic regression, on the other hand, uses the sigmoid function given by the following equation:

${\ displaystyle h _ {\ theta} (x) = g ({\ boldsymbol {\ theta}} ^ {\ top} \ mathbf {x}) = {\ frac {\ exp (z)} {1+ \ exp ( z)}} = {\ frac {1} {1+ \ exp (-z)}}}$ .

This can be applied to the hypothesis function as follows:

${\ displaystyle h _ {\ theta} (x) = g (\ theta ^ {\ top} x) = {\ frac {1} {1+ \ exp (- {\ boldsymbol {\ theta}} ^ {\ top} \ mathbf {x})}}}$ Since g (z) always delivers values ​​between 0 and 1, the values ​​of between 0 and 1 are also in this way . This can be seen in the following graph: ${\ displaystyle h (x)}$  The values ​​of the sigmoid function are always between 0 and 1 and are interpreted in the context of logistic regression as the probability of belonging to a certain class

The division of an observation into a certain class is carried out as follows:

${\ displaystyle g (z) \ geq 0 {,} 5 \ Rightarrow Y = 1}$ ${\ displaystyle g (z) <0 {,} 5 \ Rightarrow Y = 0}$ In order to enable the most accurate possible assignment of the inputs to the target classes, the parameters must be optimized as with linear regression.

We assume the following relationship:

${\ displaystyle P (y = 1 \ mid x; \ theta) = h _ {\ theta} (x)}$ ${\ displaystyle P (y = 0 \ mid x; \ theta) = 1-h _ {\ theta} (x)}$ These equations mean that the probability that a certain input belongs to class 1 is given by the result of the hypothesis function. ${\ displaystyle h (x)}$ It follows that the general conditional probability for a certain output y under the condition of a certain input x is given by the following function:

${\ displaystyle p (y \ mid x; \ theta) = (h _ {\ theta} (x)) ^ {y} (1-h _ {\ theta} (x)) ^ {1-y}}$ If you multiply this probability for all observations in the data set together, you get the formula for the so-called “likelihood” of a certain parameter.

${\ displaystyle L (\ theta) = p (\ mathbf {y} \ mid X; \ theta)}$ ${\ displaystyle = \ prod _ {i = 1} ^ {m} p (y ^ {(i)} \ mid x ^ {(i)}; \ theta)}$ ${\ displaystyle = \ prod _ {i = 1} ^ {m} (h _ {\ theta} (x ^ {(i)})) ^ {y ^ {(i)}} (1-h _ {\ theta} (x ^ {(i)})) ^ {1-y ^ {(i)}}}$ If the mean square deviation has been minimized in linear regression in order to obtain the optimal values ​​for the parameters, in logistic regression the likelihood function is maximized in order to obtain the optimal values ​​of the parameters. This procedure is known as the maximum likelihood method .

In order to facilitate the maximization, the log-likelihood function is often formed:

${\ displaystyle \ ell (\ theta) = \ log L (\ theta)}$ ${\ displaystyle = \ sum _ {i = 1} ^ {m} y ^ {(i)} \ log h (x ^ {(i)}) + (1-y ^ {(i)}) \ log ( 1-h (x ^ {(i)}))}$ The gradient must now be calculated from this function, for which the so-called gradient ascent is used. It works similarly to the gradient method used in linear regression, except that it performs an addition instead of a subtraction, since the log-likelihood function should be maximized and not minimized. The following equation thus gives the optimized value of the parameter:

${\ displaystyle \ theta _ {j}: = \ theta _ {j} + \ alpha (y ^ {(i)} - h _ {\ theta} (x ^ {(i)})) x_ {j} ^ { (i)}}$ ### Perceptron algorithm

In the 1960s, the so-called perceptron algorithm was developed. It was built according to the ideas of the time about how the brain worked.

The main difference between the perceptual algorithm and the logistic regression is that the function takes either the value 0 or the value 1, but not any value between 0 and 1 as with the logistic regression. This is ensured by the function not like assumes a value between 0 and 1 in logistic regression with the help of a sigmoid function, but according to the formulas: ${\ displaystyle h (x)}$ ${\ displaystyle g (z)}$ ${\ displaystyle g (z) = 1}$ if ${\ displaystyle z \ geq 0}$ ${\ displaystyle g (z) = 0}$ if ${\ displaystyle z <0}$ corresponds to either exactly 0 or exactly 1.

The following still applies:

${\ displaystyle h _ {\ theta} (x) = g ({\ boldsymbol {\ theta}} ^ {\ top} \ mathbf {x})}$ And the updating rule is also described by:

${\ displaystyle \ theta _ {j}: = \ theta _ {j} + \ alpha (y ^ {(i)} - h _ {\ theta} (x ^ {(i)})) x_ {j} ^ { (i)}}$ This equation looks very similar to the learning processes of the previous algorithms. It must be noted, however, that the definition of perceptron does not have a particularly fluid learning process, since the error that occurs when an input is incorrectly classified by the algorithm can either be significantly overestimated or underestimated, in which only 1 or 0 can accept. For example, if is as well as if is, class 0 is predicted in both cases. However, if the observations actually belong to class 1, the parameters are adjusted by the same value in both cases. ${\ displaystyle g (z)}$ ${\ displaystyle h (x)}$ ${\ displaystyle z = -0 {,} 0001}$ ${\ displaystyle z = -100}$ ## Factors to Consider

### Distortion-Variance Dilemma

In supervised learning there is often a compromise between distortion and variance ( distortion-variance dilemma ). The variance refers to the amount that would change if we estimated it using a different training set of data. Since the training data are used to adapt to the statistical learning method, different training data sets lead to different ones . Ideally, however , the estimate shouldn't vary too much between training sets. However, if a method has a high variance, then small changes in the training data can lead to a much poorer mapping of the test data set. Basically, more flexible statistical methods have a higher variance because they map the training data set very well, but they make many mistakes when they have to predict previously unknown data. ${\ displaystyle h (x)}$ ${\ displaystyle h (x)}$ ${\ displaystyle y}$ On the other hand, the bias refers to the error that can arise from approaching a real-world problem, which can be very complicated, through a simpler model. For example, linear regression assumes that there is a problem that has a linear relationship between and . In reality, however, there are seldom problems that have a simple linear relationship, and so performing a linear regression will undoubtedly introduce some bias between and . ${\ displaystyle Y}$ ${\ displaystyle X_ {1}, X_ {2}, \ ldots, X_ {p}}$ ${\ displaystyle h (x)}$ ${\ displaystyle y}$ ### Amount of data and complexity of the "true function"

The second question is the amount of training data available in relation to the complexity of the “true function” (classifier or regression function). If the real function is simple, then an "inflexible" learning algorithm with high bias and low variance can learn from a small amount of data. However, if the true function is very complex (e.g. because it involves complex interactions between many different input features and behaves differently in different parts of the input space), then the function is only made up of a very large amount of training data and using a " flexible “learning algorithm with low bias and high variance can be learned.

### Exceptions in the output values

Another possible problem are so-called “ outliers ” in the target values. If the target values ​​are often wrong (due to human or sensor errors) then the learning algorithm should not try to find a function that exactly matches the training examples. Trying to fit the data too carefully will result in overfitting . Even if there are no measurement errors, errors can occur if the function to be learned is too complex for the selected learning algorithm. In such a situation, part of the objective function cannot be modeled, which means that the training data cannot be mapped correctly. If you have either problem, it is better to work with more distortion and lower variance. ${\ displaystyle y}$ In practice there are several approaches to prevent problems with the output values, such as: B. early stopping of the algorithm to avoid overfitting and the detection and removal of outliers before training the monitored learning algorithm. There are several algorithms that identify outliers and allow their removal.