Automatic differentiation

The automatic differentiation or differentiation of algorithms is a method of computer science and applied mathematics . For a function in several variables, which is given as a procedure in a programming language or as a calculation graph, an extended procedure is generated that evaluates both the function and one or any number of gradients up to a full Jacobi matrix . If the output program contains loops, the number of loop passes must not depend on the independent variables.

These derivatives are z. B. required for solving nonlinear systems of equations using Newton's method and for methods of nonlinear optimization.

The most important aid here is the chain rule and the fact that the derivatives of the elementary functions available in the computer such as sin, cos, exp, log are known and can be calculated just as precisely. The effort for calculating the derivatives is thus proportional (with a small factor) to the effort for evaluating the output function.

Calculation of derivatives

Task: Given a function

{\ displaystyle f \ colon \ mathbb {R} ^ {n} \ to \ mathbb {R} ^ {m}, x \ mapsto y}

We are looking for the code / function for directional derivations or the full Jacobi matrix

{\ displaystyle {\ frac {\ partial f} {\ partial x}} = \ left [{\ tfrac {\ partial y_ {i}} {\ partial x_ {j}}} \ right] _ {i = 1. .m, j = 1..n}}

Different approaches to this are:

Try to find a closed, analytical form for f and determine “on paper” through differentiation. Then implement the code for by hand. ${\ displaystyle {\ tfrac {\ partial f} {\ partial x}}}$ ${\ displaystyle {\ tfrac {\ partial f} {\ partial x}}}$

Problem : Too difficult, time-consuming, error-prone

Advantages: very efficient, high accuracy
Generate the calculation rule for f in a computer algebra system and use the means available there for symbolic differentiation. Then export the code for into its actual environment. ${\ displaystyle {\ tfrac {\ partial f} {\ partial x}}}$
Problem : Time consuming, does not scale, too complicated for larger programs / functions
Find a numerical approximation of the derivative. It applies to lowercase h

${\ displaystyle {\ tfrac {\ partial f_ {k}} {\ partial x}} = \ lim _ {h \ to 0} {\ frac {f_ {k} (x + h) -f_ {k} (x )} {h}} \ approx {\ frac {f_ {k} (x + h) -f_ {k} (x)} {h}}}$ .

Problem : Choice of the optimal step size h, imprecise, possibly instability

Advantage: easy calculation
Place the calculation rule as a calculation tree, d. H. as an arithmetic network, and expand it to a calculation tree for function value and derivative using the chain rule . ${\ displaystyle {\ tfrac {\ partial f} {\ partial x}}}$

The idea of automatic differentiation (AD)

Every program that evaluates a function can be described as a sequence of intermediate steps in which intermediate results are converted in an elementary way. You can think of this as a (potentially infinite) sequence of intermediate values and functions that really only depend on one or two variables. The function is evaluated by setting at the beginning and one after the other ${\ displaystyle f (x) \ colon \ mathbb {R} ^ {n} \ to \ mathbb {R} ^ {m}, x \ mapsto y}$ ${\ displaystyle (t_ {1}, t_ {2}, t_ {3}, \ dots)}$ ${\ displaystyle q_ {k} \ colon \ mathbb {R} ^ {n + k-1} \ to \ mathbb {R}}$ ${\ displaystyle (t_ {1}, t_ {2}, \ dots, t_ {n}) = (x_ {1}, x_ {2}, \ dots, x_ {n})}$

{\ displaystyle {\ begin {aligned} t_ {n + 1} = & q_ {1} (t_ {1}, \ dots, t_ {n}) \\ t_ {n + 2} = & q_ {2} (t_ { 1}, \ dots, t_ {n}, t_ {n + 1}) \\\ dots & \\ t_ {n + K} = & q_ {K} (t_ {1}, \ dots, t_ {n}, t_ {n + 1}, \ dots, t_ {K-1}) \ end {aligned}}}

is determined. This can be set up so that the function values of f are in the most recently evaluated intermediate results; H. at the end it is still assigned. ${\ displaystyle (y_ {1}, \ dots, y_ {m}) = (t_ {K-m + 1}, \ dots, t_ {K})}$

AD describes a set of methods, the aim of which is to generate a new program that evaluates the Jacobian matrix of f i. The input variables x are called independent variables, the output variable (s) y are called dependent variables. With AD there are at least two different modes. ${\ displaystyle J = {\ tfrac {\ partial f} {\ partial x}} \ in \ mathbb {R} ^ {m \ times n}}$

Forward mode
Reverse mode

Forward mode

In the forward mode the matrix product is calculated

{\ displaystyle JS, ~ S \ in \ mathbb {R} ^ {n \ times p}}

the Jacobi matrix with any matrix (seed matrix) without first determining the components of the Jacobi matrix. ${\ displaystyle S}$

example 1

{\ displaystyle p = n \ quad {\ text {and}} \ quad S = I_ {n} \ Rightarrow}

AD calculates J

In the forward mode, directional derivatives are transported along the control flow of the computation of f . For each scalar variable v , a vector Dv is generated in the AD-generated code , the i- th component of which contains the directional derivative along the i- th independent variable.

Example 2

Compute a function

  ${\begin{aligned}&[y_{1},y_{2},b]=f\left(x_{1},x_{2},a\right)\left\{\right.\\&\quad b=x_{1}+x_{2}\\&\quad y_{1}=a\cdot \sin(b)\\&\quad y_{2}=b\cdot y_{1}\\&\left.\right\}\end{aligned}}$

An automatic differentiation in the forward mode would have a function

  $[y_{1},y_{2},Dy_{1},Dy_{2},b]=f_{AD}\left(x_{1},x_{2},Dx_{1},Dx_{2},a\right)$

to the result:

  ${\begin{aligned}&[y_{1},y_{2},Dy_{1},Dy_{2},b]=f_{AD}\left(x_{1},x_{2},Dx_{1},Dx_{2},a\right)\left\{\right.\\&\quad b=x_{1}+x_{2}\\&\quad Db=Dx_{1}+Dx_{2}\\&\quad y_{1}=a\cdot \sin(b)\\&\quad Dy_{1}=a\cdot \cos(b)\cdot Db\\&\quad y_{2}=b\cdot y_{1}\\&\quad Dy_{2}=Db\cdot y_{1}+b\cdot Dy_{1}\\&\left.\right\}\end{aligned}}$

Reverse mode

The reverse mode consists of two phases.

The original program is executed and certain data is saved.
The original program is executed in reverse. Directional derivations are transported and the data from phase 1 are used.

In phase 2, a vector is introduced for each scalar variable v . The i th component of this vector contains the i th direction derivative (in the direction of v ). The seed matrix is in . In reverse mode, the result is a product ${\ displaystyle a_ {v}}$ ${\ displaystyle a_ {y}}$

{\ displaystyle SJ, S \ in \ mathbb {R} ^ {p \ times m}}

example 1

{\ displaystyle p = m \ quad {\ text {and}} \ quad S = I_ {m \ times m} \ implies}

AD calculates J

Example 2

For each calculation rule line , the derivatives of u and v are supplemented by s in the following way: ${\ displaystyle s = g \ left (u, v \ right)}$

{\ displaystyle {\ begin {aligned} a \ _u & = a \ _u + {\ frac {\ partial g} {\ partial u}} a \ _s \\ a \ _v & = a \ _v + {\ frac {\ partial g} {\ partial v}} a \ _s \ end {aligned}}}

Find the - and - derivatives of . These are each referred to as and . The value is initialized with 1, all other values are initialized with 0. ${\ displaystyle x_ {1}}$ ${\ displaystyle x_ {2}}$ ${\ displaystyle y_ {2}}$ ${\ displaystyle a \ _x_ {1}}$ ${\ displaystyle a \ _x_ {2}}$ ${\ displaystyle a \ _y_ {2}}$ ${\ displaystyle a \ __ {\ ldots}}$

{\ displaystyle {\ begin {aligned} b & = x_ {1} + x_ {2} && (1) \\ y_ {1} & = a \ cdot \ sin (b) && (2) \\ y_ {2} & = b \ cdot y_ {1} && (3) \\ {\ text {from (3):}} & \\ a \ _b & = a \ _b + y_ {1} \ cdot a \ _y_ {2} \ \ a \ _y_ {1} & = a \ _y_ {1} + b \ cdot a \ _y_ {2} \\ {\ text {from (2):}} & \\ a \ _b & = a \ _b + a \ cdot \ cos (b) \ cdot a \ _y_ {1} \\ {\ text {from (1):}} & \\ a \ _x_ {1} & = a \ _x_ {1} +1 \ cdot a \ _b \\ a \ _x_ {2} & = a \ _x_ {2} +1 \ cdot a \ _b \ end {aligned}}}

Efficiency considerations

The efficiency of AD algorithms depends on the mode and the parameter p. The choice of the mode and the parameter p depends on what the Jacobian matrix is calculated for. It denotes

${\ displaystyle T_ {f}}$	to calculate the time f
${\ displaystyle M_ {f}}$	the storage requirements of this invoice
${\ displaystyle T_ {JS}}$	calculate the time f and JS
${\ displaystyle M_ {JS}}$	the storage requirements of this invoice
${\ displaystyle T_ {SJ}}$	calculate the time f and SJ
${\ displaystyle M_ {SJ}}$	the storage requirements of this invoice

The following applies to the two modes presented

Forward mode: ${\ displaystyle {T_ {JS} \ over T_ {f}} \ approx p, {M_ {JS} \ over M_ {f}} \ approx p}$
Reverse mode: ${\ displaystyle {T_ {SJ} \ over T_ {f}} \ approx p, {M_ {SJ} \ over M_ {f}} \ approx T_ {f}}$

The calculation as a chain of calculations

Given:, Question: How does the derivative of s change during the second phase to get the derivatives of u and v? ${\ displaystyle s = g \ left (u, v \ right)}$

   $a\_u=a\_u+{\partial g \over \partial u}a\_s$ 
   $a\_v=a\_v+{\partial g \over \partial v}a\_s$

${\ displaystyle f (x)}$ is interpreted as a sequence of programs. In the example "Optimizing a wing", the calculation comprises the following steps.

Superimposition of the wing with so-called "mode functions"

{\ displaystyle A \ left (x \ right) = A_ {0} + \ sum _ {j = 1} ^ {n} x_ {j} A_ {j}, n = 8, f_ {1}: \ mathbb { R} ^ {8} \ rightarrow \ mathbb {R} ^ {200}}

Calculation of a grid that is placed around the wing

{\ displaystyle G \ left (A \ right): \ mathbb {R} ^ {200} \ rightarrow \ mathbb {R} ^ {17428}}

Solving the Navier-Stokes equations on the grid and calculating the integrals of the same.

{\ displaystyle f \ left (G \ right): \ mathbb {R} ^ {17428} \ rightarrow \ mathbb {R}}

.

Overall, the function results

   $f=f\left(G\left(A\left(x\right)\right)\right)\rightarrow {\partial f \over \partial x}={\partial f \over \partial G}{\partial G \over \partial A}{\partial A \over \partial x}$

With a naive approach would be three matrices , , calculate and then perform two matrix multiplications. However, the disadvantage of the forward mode is: ${\ displaystyle {\ partial f \ over \ partial G}}$ ${\ displaystyle {\ partial G \ over \ partial A}}$ ${\ displaystyle {\ partial A \ over \ partial x}}$

   $T_{{\partial f \over \partial G}\cdot S}\approx 17428\cdot T_{f\left(G\right)}$

in reverse mode would be analog

   $T_{S\cdot {\partial f \over \partial G}}\approx 17428\cdot T_{f\left(G\right)}$

be valid. A better approach is to use the result of a calculation as the seed matrix of the following.

Choose the seed matrix of the first calculation ${\ displaystyle I_ {8x8} \ in \ mathbb {R} ^ {8x8}}$
The result of the first calculation as the seed matrix of the second calculation
The result of the second calculation as the seed matrix of the third calculation

so

${\ displaystyle {\ frac {\ partial A} {\ partial x}} I_ {8 \ times 8} \ in \ mathbb {R} ^ {8 \ times 200}}$
${\ displaystyle {\ frac {\ partial G} {\ partial A}} {\ frac {\ partial A} {\ partial x}} \ in \ mathbb {R} ^ {8 \ times 17428}}$
${\ displaystyle {\ frac {\ partial f} {\ partial G}} {\ frac {\ partial G} {\ partial x}} \ in \ mathbb {R} ^ {8 \ times 1}}$

Since the number of rows in each matrix is 8 (p = 8), the time and memory requirements increase by a maximum of 8 compared to the regular evaluation . ${\ displaystyle f (x)}$

literature

Andreas Griewank , Andrea Walther (2008): Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Second Edition , SIAM , xxii + 438 pages, ISBN 978-0-89871-659-7
George F. Corliss; Andreas Griewank (1993): "Operator Overloading as an Enabling Technology for Automatic Differentiation" (PDF; 227 kB) , Technical Report MCS-P358-0493, Mathematics and Computer Science Division, Argonne National Laboratory

Web links

Autodiff.org portal with programming tools on the subject
Software tools overview

Automatic differentiation

Calculation of derivatives

The idea of ​​automatic differentiation (AD)

Forward mode

example 1

Example 2

Reverse mode

example 1

Example 2

Efficiency considerations

The calculation as a chain of calculations

literature

Web links

The idea of automatic differentiation (AD)