Reinforcement learning

Reinforcement learning or reinforcement learning ( English reinforcement learning ) is of a variety of techniques machine learning in which an agent automatically learns a strategy to maximize received rewards. The agent is not shown which action is best in which situation, but instead receives a reward at certain times, which can also be negative. Using these rewards, he approximates a utility function that describes the value of a certain state or action.

The term is borrowed from psychology and has been used since the beginnings of cybernetics. So already used Marvin Minsky the term in his dissertation of 1954. The models of reinforcement learning try learning behavior in nature replicate.

model

Reinforcement learning methods consider the interaction of a learning agent with its environment. The latter is formulated as a Markov decision problem. So the environment has a multitude of states . Depending on the situation, the agent can choose an action from a set of available actions, whereby he arrives at a subsequent state and receives a reward . ${\ displaystyle S}$ ${\ displaystyle s_ {t} \ in S}$ ${\ displaystyle a_ {t} \ in A (s_ {t})}$ ${\ displaystyle s_ {t + 1} \ in S}$ ${\ displaystyle r_ {t + 1} \ in \ mathbb {R}}$ The agent's goal is the expected future profit

${\ displaystyle R_ {t} = \ sum _ {k = 0} ^ {T} \ gamma ^ {k} \ cdot r_ {t + k + 1}}$ With ${\ displaystyle 0 \ leq \ gamma \ leq 1}$ to maximize. So the expected profit is something like the expected overall reward. This is called the discount factor , which weights future rewards. For episodic problems, i. H. the world goes into a final state after a finite number of steps (such as a game of chess), the discounting factor is suitable . In this case, each reward is rated equally. For continuous problems ( ) one has to choose one so that the infinite series converges. For only counts the current reward ; all future rewards are ignored. Goes towards 1, the agent becomes more farsighted. ${\ displaystyle \ gamma \, \!}$ ${\ displaystyle \ gamma = 1 \, \!}$ ${\ displaystyle r_ {t + k + 1} \, \!}$ ${\ displaystyle T = \ infty}$ ${\ displaystyle \ gamma <1 \, \!}$ ${\ displaystyle R_ {t} \, \!}$ ${\ displaystyle \ gamma = 0 \, \!}$ ${\ displaystyle r_ {t + 1} \, \!}$ ${\ displaystyle \ gamma \, \!}$ For this purpose the agent pursues a strategy (English policy ) which he continuously improves. Usually the strategy is seen as a function that assigns an action to each state. However, non-deterministic strategies (or mixed strategies ) are also possible, so that an action is selected with a certain probability. In general, a strategy is thus defined as a conditional probability distribution: ${\ displaystyle \ pi \ colon S \ rightarrow A}$ ${\ displaystyle \ pi (s, a) = p (a | s) \ quad}$ Learning process

There are various algorithms for learning the agent's strategy. Monte Carlo methods and temporal difference learning are very successful . These are a series of algorithms in which the agent has a utility function that evaluates a certain state or a certain action in a state.

In the case of small status or action spaces, this can be a table, the fields of which are updated based on the reward received. In the case of large state spaces, however, the function must be approximated. For example, the Fourier series or a neural network are suitable for this .

If more than one agent is to learn, the convergence of the learning processes cannot (so far) be guaranteed even with cooperative agents, except in trivial cases. Nevertheless, behavior that is useful in practice can often be learned with the aid of heuristics, since the worst case rarely occurs.

literature

• Richard Sutton, Andrew Barto: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
• Dimitri P. Bertsekas, John Tsitsiklis: Neuro-Dynamic Programming. Athena Scientific, Cambridge, MA, 1996.
• Csaba Szepesvári, Algorithms for Reinforcement Learning, Morgan and Claypool, 2010 ( ualberta.ca PDF).
• Marc Patrick Deisenroth, Gerhard Neumann, Jan Peters: A Survey on Policy Search for Robotics. Foundations and Trends in Robotics, 21, pp. 388–403, 2013 ( ausy.tu-darmstadt.de PDF).
• Jens Kober, Drew Bagnell, Jan Peters: Reinforcement Learning in Robotics: A Survey. International Journal of Robotics Research, 32, 11, pp. 1238–1274, 2013 ( ausy.tu-darmstadt.de PDF).
• Warren B. Powell: Approximate Dynamic Programming. John Wiley and Sons, 2011.
• Stuart Russell, Peter Norvig: Artificial Intelligence: A Modern Approach. Pearson Studium, August 2004, ISBN 3-8273-7089-2 (German translation of the 2nd edition) Chapter 21.