Gauss-Newton method

The Gauss-Newton method (after Carl Friedrich Gauß and Isaac Newton ) is a numerical method for solving nonlinear minimization problems using the least squares method . The method is related to the Newton method for solving non-linear optimization problems, but has the advantage that the calculation of the 2nd derivative required for the Newton method is not necessary. The calculation of the 2nd derivative is often a limiting factor, especially for large problems with tens of thousands of parameters.

The optimization problem

The Gauss-Newton method solves problems in which the minimum of a sum of squares of continuously differentiable functions is sought, i.e. ${\ displaystyle f_ {i} \ colon \ mathbb {R} ^ {n} \ to \ mathbb {R}}$

{\ displaystyle \ min _ {x \ in \ mathbb {R} ^ {n}} \ left \ {{\ frac {1} {2}} \ sum _ {i = 1} ^ {m} \ left (f_ {i} (x) \ right) ^ {2} \ right \}}

with . With the Euclidean norm, this can also be written as ${\ displaystyle m \ geq n}$ ${\ displaystyle \ | \ cdot \ |}$

{\ displaystyle \ min _ {x \ in \ mathbb {R} ^ {n}} \ left \ {{\ frac {1} {2}} \ | f (x) \ | ^ {2} \ right \} }

with . Problems of this form occur frequently in practice, in particular the nonlinear problem is equivalent to minimizing , provided that it has a zero . Is a linear map , there is the standard case, the method of least squares linear model function . ${\ displaystyle f = (f_ {1}, \ dotsc, f_ {m}) \ colon \ mathbb {R} ^ {n} \ to \ mathbb {R} ^ {m}}$ ${\ displaystyle f (x) = 0}$ ${\ displaystyle {\ tfrac {1} {2}} \ | f (x) \ | ^ {2}}$ ${\ displaystyle f}$ ${\ displaystyle f}$

The method

The basic idea of the Gauss-Newton method is to linearize the objective function and to optimize the linearization in the sense of the least squares. The linearization, i.e. the 1st order Taylor expansion , of in the point reads ${\ displaystyle f}$ ${\ displaystyle f}$ ${\ displaystyle x ^ {0} \ in \ mathbb {R} ^ {n}}$

{\ displaystyle {\ tilde {f}} (x) = f (x ^ {0}) + \ nabla f (x ^ {0}) ^ {T} (xx ^ {0})}

.

The matrix is the Jacobi matrix and is often referred to as. The linear least squares problem is obtained ${\ displaystyle \ nabla f (x ^ {0}) ^ {T}}$ ${\ displaystyle J}$

{\ displaystyle \ min _ {x \ in \ mathbb {R} ^ {n}} \ left \ {{\ frac {1} {2}} \ | {\ tilde {f}} (x) \ | ^ { 2} = {\ frac {1} {2}} \ | J (xx ^ {0}) + f (x ^ {0}) \ | ^ {2} \ right \}}

,

with gradient . ${\ displaystyle \ nabla {\ frac {1} {2}} \ left \ Vert {\ tilde {f}} (x) \ right \ Vert ^ {2} = J ^ {T} \ left (J (xx ^ {0}) + f (x ^ {0}) \ right)}$

Setting the gradient to zero provides the so-called normal equations

{\ displaystyle J ^ {T} J (xx ^ {0}) = - J ^ {T} f (x ^ {0})}

with the explicit solution

{\ displaystyle x = x ^ {0} - \ left (J ^ {T} J \ right) ^ {- 1} J ^ {T} f (x ^ {0})}

.

The Gauss-Newton iteration step results directly from this

${\ displaystyle x ^ {k + 1} = x ^ {k} - \ alpha ^ {k} \ left ((J \| _ {x ^ {k}}) ^ {T} J \| _ {x ^ {k }} \ right) ^ {- 1} (J \| _ {x ^ {k}}) ^ {T} f (x ^ {k})}$ ,

making it clear that the Jacobi matrix is evaluated at this point and is a step size . ${\ displaystyle J | _ {x ^ {k}}}$ ${\ displaystyle x ^ {k}}$ ${\ displaystyle \ alpha ^ {k} \ geq 0}$

To solve the linear system of equations in the Gauss-Newton iteration step, there are different possibilities depending on the problem size and the structure:

Small problems ( ) are best solved with QR decomposition ${\ displaystyle n <1000, \ m <10000}$
The Cholesky decomposition is ideal for large problems , since the matrix is symmetrical by construction . There are specially adapted Cholesky variants for sparsely populated ${\ displaystyle J ^ {T} J}$ ${\ displaystyle J ^ {T} J}$

The CG method can be used as a general option , although preconditioning is usually necessary here

convergence

The update vector in the Gauss-Newton iteration step has the form where . If full rank has, then and therefore also positive definite . On the other hand, the gradient of the quadratic problem , so a descent direction , i.e. H. it applies . From this follows (with a suitable choice of the step size ) the convergence of the Gauss-Newton method to a stationary point . From this representation it can also be seen that the Gauss-Newton method is essentially a scaled gradient method with the positively definite scaling matrix . ${\ displaystyle d = -DJ ^ {T} f (x)}$ ${\ displaystyle D = \ left (J ^ {T} J \ right) ^ {- 1}}$ ${\ displaystyle J}$ ${\ displaystyle J ^ {T} J}$ ${\ displaystyle D}$ ${\ displaystyle J ^ {T} f (x)}$ ${\ displaystyle \ min _ {x \ in \ mathbb {R} ^ {n}} {\ frac {1} {2}} \ | f (x) \ | ^ {2}}$ ${\ displaystyle d}$ ${\ displaystyle \ left (J ^ {T} f (x) \ right) ^ {T} d <0}$ ${\ displaystyle \ alpha ^ {k}}$ ${\ displaystyle D}$

In general, no statement can be made about the speed of convergence . If the starting point is very far from the optimum or the matrix is poorly conditioned , the Gauss-Newton method converges as slowly as desired. On the other hand, if the starting point is sufficiently close to the optimum, one can show that the Gauss-Newton method converges quadratically. ${\ displaystyle x ^ {0}}$ ${\ displaystyle J ^ {T} J}$ ${\ displaystyle x ^ {0}}$

extension

In order to improve the behavior in the case of badly conditioned or singular , the Gauss-Newton iteration step can be modified as follows ${\ displaystyle J ^ {T} J}$

{\ displaystyle x ^ {k + 1} = x ^ {k} - \ alpha ^ {k} \ left ((J | _ {x ^ {k}}) ^ {T} J | _ {x ^ {k }} + \ Delta ^ {k} \ right) ^ {- 1} (J | _ {x ^ {k}}) ^ {T} f (x ^ {k})}

,

where the diagonal matrix is chosen so that positive is definite. With the choice , i.e. a scalar multiple of the identity matrix , one obtains the Levenberg-Marquardt algorithm . ${\ displaystyle \ Delta ^ {k}}$ ${\ displaystyle J | _ {x ^ {k}} ^ {T} J | _ {x ^ {k}} + \ Delta ^ {k}}$ ${\ displaystyle \ Delta ^ {k} = \ beta ^ {k} I, \ \ beta \ geq 0}$

example

The Rosenbrock function with .

{\ displaystyle a = 1, \ b = 100}

The Rosenbrock function

{\ displaystyle g: \ mathbb {R} ^ {2} \ to \ mathbb {R}: x \ mapsto \ left (a-x_ {1} \ right) ^ {2} + b \ left (x_ {2} -x_ {1} ^ {2} \ right) ^ {2}}

is often used as a test for optimization methods because it is challenging because of the narrow and shallow valley in which iterative methods can only take small steps. The constants are usually chosen with , the global optimum in this case is with the function value . ${\ displaystyle a = 1, \ b = 100}$ ${\ displaystyle x ^ {*} = (1,1)}$ ${\ displaystyle g (x ^ {*}) = 0}$

In order to use the Gauss-Newton method, the Rosenbrock function must first be brought into the form "sum of squares of functions". Since the Rosenbrock function already consists of a sum of two terms, the approach is chosen

{\ displaystyle {\ frac {1} {2}} \ left (f_ {1} (x) \ right) ^ {2} = \ left (a-x_ {1} \ right) ^ {2} \ \ \ Longleftrightarrow \ \ f_ {1} (x) = {\ sqrt {2}} (a-x_ {1})}

and

{\ displaystyle {\ frac {1} {2}} \ left (f_ {2} (x) \ right) ^ {2} = b \ left (x_ {2} -x_ {1} ^ {2} \ right ) ^ {2} \ \ \ Longleftrightarrow \ \ f_ {2} (x) = {\ sqrt {2b}} (x_ {2} -x_ {1} ^ {2})}

.

The Gauss-Newton problem for the Rosenbrock function is thus

{\ displaystyle \ min _ {x \ in \ mathbb {R} ^ {2}} {\ frac {1} {2}} \ | f (x) \ | ^ {2}}

whereby .

{\ displaystyle f: \ mathbb {R} ^ {2} \ to \ mathbb {R} ^ {2}: x \ mapsto {\ begin {pmatrix} f_ {1} (x) \\ f_ {2} (x ) \ end {pmatrix}} = {\ begin {pmatrix} {\ sqrt {2}} (a-x_ {1}) \\ {\ sqrt {2b}} (x_ {2} -x_ {1} ^ { 2}) \ end {pmatrix}}}

The Jacobi matrix results as and thus is . Since has full rank , positive is definite and the inverse exists. The following simple line search is used to determine the step size : ${\ displaystyle J = {\ begin {bmatrix} {\ tfrac {\ partial f_ {1}} {\ partial x_ {1}}} & {\ tfrac {\ partial f_ {1}} {\ partial x_ {2} }} \\ {\ tfrac {\ partial f_ {2}} {\ partial x_ {1}}} & {\ tfrac {\ partial f_ {2}} {\ partial x_ {2}}} \ end {bmatrix} } = {\ begin {bmatrix} - {\ sqrt {2}} & 0 \\ - 2x_ {1} {\ sqrt {2b}} & {\ sqrt {2b}} \ end {bmatrix}}}$ ${\ displaystyle D = J ^ {T} J = {\ begin {bmatrix} 8bx_ {1} ^ {2} + 2 & -4bx_ {1} \\ - 4bx_ {1} & 2b \ end {bmatrix}}}$ ${\ displaystyle J}$ ${\ displaystyle D}$ ${\ displaystyle D ^ {- 1}}$ ${\ displaystyle \ alpha ^ {k}}$

Start with . ${\ displaystyle \ alpha ^ {k} = 1}$
Calculate the new point as well . ${\ displaystyle {\ tilde {x}} = x ^ {k} + \ alpha ^ {k} d}$ ${\ displaystyle d = - \ left ((J | _ {x ^ {k}}) ^ {T} J | _ {x ^ {k}} \ right) ^ {- 1} (J | _ {x ^ {k}}) ^ {T} f (x ^ {k})}$
If sit and go to the next iteration. ${\ displaystyle f ({\ tilde {x}}) <f (x ^ {k})}$ ${\ displaystyle x ^ {k + 1} = {\ tilde {x}}}$
Otherwise halve and go to 2. ${\ displaystyle \ alpha ^ {k}}$

The line search forces the new function value to be less than the previous one; it is guaranteed to terminate (with possibly a very small one ), since it is a direction of descent. ${\ displaystyle \ alpha ^ {k}}$ ${\ displaystyle d}$

The starting point is chosen. The Gauss-Newton method converges to the global optimum in a few iterations: ${\ displaystyle x ^ {0} = (0, -0.1)}$

Optimization of the Rosenbrock function with the Gauss-Newton method

Optimization with the Gauss-Newton method
${\ displaystyle k}$	${\ displaystyle x ^ {k}}$	${\ displaystyle g (x ^ {k})}$
0	(0, -0.1)	2
1	(0.1250, -0.0875)	1.8291
2	(0.2344, -0.0473)	1.6306
3	(0.4258, 0.0680)	1.6131
4th	(0.5693, 0.2186)	1.3000
5	(0.7847, 0.5166)	1.0300
6th	(1.0, 0.9536)	0.2150
7th	(1.0, 1.0)	1.1212e-27

The gradient method (with the same line search) provides the following result in comparison, it does not find the optimum even after 500 iterations:

Optimization of the Rosenbrock function with the gradient method

Optimization with gradient methods
${\ displaystyle k}$	${\ displaystyle x ^ {k}}$	${\ displaystyle g (x ^ {k})}$
0	(0, -0.1)	2
1	(0.0156, 0.0562)	1.2827
2	(0.0337, -0.0313)	1.0386
3	(0.0454, 0.0194)	0.9411
4th	(0.0628, -0.0077)	0.8918
5	(0.0875, 0.0286)	0.8765
${\ displaystyle \ vdots}$	${\ displaystyle \ vdots}$	${\ displaystyle \ vdots}$
500	(0.8513, 0.7233)	0.0223

literature

Dimitri P. Bertsekas: '' Nonlinear Programming. '' Second Edition, Athena Scientific, 1995, ISBN 9781886529144 .
Yurii Nesterov: " Introductory Lectures on Convex Optimization: A Basic Course. " Springer Science & Business Media, 2003, ISBN 978-1-4419-8853-9 .
Jorge Nocedal, Stephen Wright: " Numerical Optimization. " Springer Science & Business Media, 2000, ISBN 9780387987934 .
Amir Beck: " Introduction to Nonlinear Optimization. " SIAM, 2014, ISBN 978-1611973648 .

Individual evidence

↑ Ceres Solver documentation. Retrieved May 10, 2019 .

[1] Ceres Solver documentation. Retrieved May 10, 2019 .