Temporal Difference Learning

Temporal Difference Learning (also TD-Learning ) is a method of reinforcement learning . In reinforcement learning, an agent receives a reward after a series of actions and adjusts their strategy to maximize the reward. An agent with a TD learning algorithm does not only make the adjustment when it receives the reward, but after each action based on an estimated expected reward.

model

As with other reinforcement learning algorithms, the agent's interaction with its environment is described as a Markov decision problem .

The agent has a function V which evaluates a certain state (English state-value function ). To improve his strategy π (English policy ) he approaches V with each iteration to the ideal . Previous methods had to wait for the reward at the end of an episode to know how good the strategy was to customize the feature. TD methods, on the other hand, update their evaluation function after each action with the observed reward r (which can also be zero) and the expected reward estimated by the current evaluation function. This process is called bootstrapping and the estimate becomes more precise with each iteration. ${\ displaystyle s_ {t}}$ ${\ displaystyle V ^ {\ pi}}$

In the simplest algorithm, the agent chooses an action in each iteration based on its strategy, observes the reward and adjusts the evaluation function with the following formula:

${\ displaystyle V (s_ {t}) \ leftarrow (1- \ alpha) V (s_ {t}) + \ alpha {\ big (} r_ {t + 1} + \ gamma V (s_ {t + 1}) ) {\ big)}}$ ,

where α stands for the learning rate and γ for the discounting factor.

The learning rate indicates the extent to which new values adapt the current evaluation function. It has been mathematically proven that the algorithm converges to an optimum when the learning rate is initially high and gradually decreases. That is exactly if the two conditions

${\ displaystyle \ sum _ {i = 1} ^ {\ infty} \ alpha _ {i} = \ infty}$ and ${\ displaystyle \ sum _ {i = 1} ^ {\ infty} \ alpha _ {i} ^ {2} <\ infty}$

are fulfilled. In practice, one often chooses a small constant learning rate which, although it does not meet the condition, is more suitable in the case of a changing environment.

The discount factor weights the future rewards. A small value makes the agent more myopic and a large value more farsighted.

The agent's strategy can depend on the evaluation function. A prominent strategy is the ε-greedy policy . The agent either chooses the most promising action from his point of view (according to the evaluation function) or a random action. The parameter ε with values between 0 and 1 indicates the probability with which the agent chooses a random action rather than the best action.

Algorithms

Q-learning

Q-learning is a variant of TD-learning in which the agent evaluates the benefit of an action instead of a condition. The learned function Q (s, a) is therefore called the action-value function . In 1989, Chris Watkins first introduced this algorithm. He and Peter Dayan provided the proof of convergence in 1992.

The algorithm provides that the agent executes an action a according to his strategy for the current state s and receives the resulting reward r . From the subsequent state s ' , the agent accepts the most promising action a' according to its current evaluation function as a future reward. Based on this, he adjusts his evaluation function using the following formula:

${\ displaystyle Q (s, a) \ leftarrow (1- \ alpha) Q (s, a) + \ alpha {\ big (} r + \ gamma \ max _ {a '} Q (s', a') { \ big)}}$

Since the assumption that the future reward corresponds to the best possible action can deviate from the strategy ( e.g. ε-greedy ), one speaks of an off-policy algorithm.

SARSA

SARSA stands for state-action-reward-state-action and is also an algorithm for learning an action-value function . In contrast to Q-Learning, however, the agent remains true to his strategy when calculating his follow-up action ( on-policy ). GA Rummery and M. Niranjan mentioned the algorithm for the first time in 1994. The name SARSA proposed by Rich Sutton and only mentioned in the footnote prevailed.

An agent who implements SARSA carries out an action a for the current state s according to its strategy and receives a reward r . In the subsequent state s ' he again selects an action a' according to his strategy and takes its evaluation as future profit in order to adapt the evaluation function according to the following formula:

${\ displaystyle Q (s, a) \ leftarrow (1- \ alpha) Q (s, a) + \ alpha {\ big (} r + \ gamma Q (s ', a') {\ big)}}$

TD lambda

TD (λ) is an extension of the algorithm with " eligibility traces ". In the original algorithm, an action only changes the evaluation function for the state last visited. TD (λ) on the other hand adjusts the values of several visited states. To this end, each state has a numerical attribute that indicates the extent to which its evaluation is authorized to change. If a state is visited, the attribute goes to 1, and with each iteration it decays exponentially with the decay rate λ.

This extension built a bridge between TD learning and earlier Monte Carlo methods . If 1 is chosen for λ, the authorization attribute does not decay and at the end of the episode all visited states are adjusted based on the observed reward. This corresponds to the procedure of Monte Carlo algorithms. In contrast, TD (0) is the classic TD algorithm. In practice, a middle way between these two extremes is usually suitable, in which several states are adjusted.

The extension can also be applied to SARSA and Q-learning. The algorithms are then called SARSA (λ) and Q (λ).

Concrete applications

One of the most popular uses of TD learning is TD gammon. In the 1980s Gerald Tesauro developed the program that plays backgammon at the level of human experts. He implemented the TD lambda algorithm with a perceptron to approximate the evaluation function. The change in the neural network was carried out through backpropagation .

Individual evidence

↑ Peter Dayan, Terrence J. Sejnowski: TD (λ) Converges with Probability 1 . In: Machine Learning . No. 14 , 1994, pp. 295–301 ( PDF [accessed April 22, 2016]).
^ Chris Watkins: Learning from Delayed Rewards . Ph.D. Thesis. 1989 ( PDF [accessed April 26, 2016]).
↑ GA Rummery, M. Niranjan: On-line Q-Learning Using Connectionist Systems . 1994, p. 6 ( PDF [accessed April 26, 2016]).
↑ Gerald Tesauro: Practical Issues in Temporal Difference Learning . In: Machine Learning . No. 8 , 1992, pp. 257–277 ( PDF [accessed April 25, 2016]).
↑ Gerald Tesauro: Temporal difference learning and TD-Gammon . In: Commun. ACM . tape 38 , no. 3 , 1995, p. 58-68 , doi : 10.1145 / 203330.203343 .

[1] Peter Dayan, Terrence J. Sejnowski: TD (λ) Converges with Probability 1 . In: Machine Learning . No. 14 , 1994, pp. 295–301 ( PDF [accessed April 22, 2016]).

[2] Chris Watkins: Learning from Delayed Rewards . Ph.D. Thesis. 1989 ( PDF [accessed April 26, 2016]).

[3] GA Rummery, M. Niranjan: On-line Q-Learning Using Connectionist Systems . 1994, p. 6 ( PDF [accessed April 26, 2016]).

[4] Gerald Tesauro: Practical Issues in Temporal Difference Learning . In: Machine Learning . No. 8 , 1992, pp. 257–277 ( PDF [accessed April 25, 2016]).

[5] Gerald Tesauro: Temporal difference learning and TD-Gammon . In: Commun. ACM . tape 38 , no. 3 , 1995, p. 58-68 , doi : 10.1145 / 203330.203343 .