Differential dynamic programming

From testwiki
Jump to navigation Jump to search

Template:Short description Differential dynamic programming (DDP) is an optimal control algorithm of the trajectory optimization class. The algorithm was introduced in 1966 by Mayne[1] and subsequently analysed in Jacobson and Mayne's eponymous book.[2] The algorithm uses locally-quadratic models of the dynamics and cost functions, and displays quadratic convergence. It is closely related to Pantoja's step-wise Newton's method.[3][4]

Finite-horizon discrete-time problems

The dynamics

Template:NumBlk

describe the evolution of the state ๐ฑ given the control ๐ฎ from time i to time i+1. The total cost J0 is the sum of running costs and final cost f, incurred when starting from state ๐ฑ and applying the control sequence ๐”{๐ฎ0,๐ฎ1,๐ฎN1} until the horizon is reached:

J0(๐ฑ,๐”)=i=0N1(๐ฑi,๐ฎi)+f(๐ฑN),

where ๐ฑ0๐ฑ, and the ๐ฑi for i>0 are given by Template:EquationNote. The solution of the optimal control problem is the minimizing control sequence ๐”*(๐ฑ)argmin๐”J0(๐ฑ,๐”). Trajectory optimization means finding ๐”*(๐ฑ) for a particular ๐ฑ0, rather than for all possible initial states.

Dynamic programming

Let ๐”i be the partial control sequence ๐”i{๐ฎi,๐ฎi+1,๐ฎN1} and define the cost-to-go Ji as the partial sum of costs from i to N:

Ji(๐ฑ,๐”i)=j=iN1(๐ฑj,๐ฎj)+f(๐ฑN).

The optimal cost-to-go or value function at time i is the cost-to-go given the minimizing control sequence:

V(๐ฑ,i)min๐”iJi(๐ฑ,๐”i).

Setting V(๐ฑ,N)f(๐ฑN), the dynamic programming principle reduces the minimization over an entire sequence of controls to a sequence of minimizations over a single control, proceeding backwards in time:

Template:NumBlk

This is the Bellman equation.

Differential dynamic programming

DDP proceeds by iteratively performing a backward pass on the nominal trajectory to generate a new control sequence, and then a forward-pass to compute and evaluate a new nominal trajectory. We begin with the backward pass. If

(๐ฑ,๐ฎ)+V(๐Ÿ(๐ฑ,๐ฎ),i+1)

is the argument of the min[] operator in Template:EquationNote, let Q be the variation of this quantity around the i-th (๐ฑ,๐ฎ) pair:

Q(δ๐ฑ,δ๐ฎ)(๐ฑ+δ๐ฑ,๐ฎ+δ๐ฎ)+V(๐Ÿ(๐ฑ+δ๐ฑ,๐ฎ+δ๐ฎ),i+1)(๐ฑ,๐ฎ)V(๐Ÿ(๐ฑ,๐ฎ),i+1)

and expand to second order

Template:NumBlk

The Q notation used here is a variant of the notation of Morimoto where subscripts denote differentiation in denominator layout.[5] Dropping the index i for readability, primes denoting the next time-step VV(i+1), the expansion coefficients are

Q๐ฑ=๐ฑ+๐Ÿ๐ฑ๐–ณV'๐ฑQ๐ฎ=๐ฎ+๐Ÿ๐ฎ๐–ณV'๐ฑQ๐ฑ๐ฑ=๐ฑ๐ฑ+๐Ÿ๐ฑ๐–ณV'๐ฑ๐ฑ๐Ÿ๐ฑ+V๐ฑ๐Ÿ๐ฑ๐ฑQ๐ฎ๐ฎ=๐ฎ๐ฎ+๐Ÿ๐ฎ๐–ณV'๐ฑ๐ฑ๐Ÿ๐ฎ+V'๐ฑ๐Ÿ๐ฎ๐ฎQ๐ฎ๐ฑ=๐ฎ๐ฑ+๐Ÿ๐ฎ๐–ณV'๐ฑ๐ฑ๐Ÿ๐ฑ+V'๐ฑ๐Ÿ๐ฎ๐ฑ.

The last terms in the last three equations denote contraction of a vector with a tensor. Minimizing the quadratic approximation Template:EquationNote with respect to δ๐ฎ we have

Template:NumBlk

giving an open-loop term ๐ค=Q๐ฎ๐ฎ1Q๐ฎ and a feedback gain term ๐Š=Q๐ฎ๐ฎ1Q๐ฎ๐ฑ. Plugging the result back into Template:EquationNote, we now have a quadratic model of the value at time i:

ΔV(i)=12Q๐ฎTQ๐ฎ๐ฎ1Q๐ฎV๐ฑ(i)=Q๐ฑQ๐ฑ๐ฎQ๐ฎ๐ฎ1Q๐ฎV๐ฑ๐ฑ(i)=Q๐ฑ๐ฑQ๐ฑ๐ฎQ๐ฎ๐ฎ1Q๐ฎ๐ฑ.

Recursively computing the local quadratic models of V(i) and the control modifications {๐ค(i),๐Š(i)}, from i=N1 down to i=1, constitutes the backward pass. As above, the Value is initialized with V(๐ฑ,N)f(๐ฑN). Once the backward pass is completed, a forward pass computes a new trajectory:

๐ฑ^(1)=๐ฑ(1)๐ฎ^(i)=๐ฎ(i)+๐ค(i)+๐Š(i)(๐ฑ^(i)๐ฑ(i))๐ฑ^(i+1)=๐Ÿ(๐ฑ^(i),๐ฎ^(i))

The backward passes and forward passes are iterated until convergence.

Differential dynamic programming is a second-order algorithm like Newton's method. It therefore takes large steps toward the minimum and often requires regularization and/or line-search to achieve convergence.[6][7] Regularization in the DDP context means ensuring that the Q๐ฎ๐ฎ matrix in Template:EquationNote is positive definite. Line-search in DDP amounts to scaling the open-loop control modification ๐ค by some 0<α<1.

Monte Carlo version

Sampled differential dynamic programming (SaDDP) is a Monte Carlo variant of differential dynamic programming.[8][9][10] It is based on treating the quadratic cost of differential dynamic programming as the energy of a Boltzmann distribution. This way the quantities of DDP can be matched to the statistics of a multidimensional normal distribution. The statistics can be recomputed from sampled trajectories without differentiation.

Sampled differential dynamic programming has been extended to Path Integral Policy Improvement with Differential Dynamic Programming.[11] This creates a link between differential dynamic programming and path integral control,[12] which is a framework of stochastic optimal control.

Constrained problems

Interior Point Differential dynamic programming (IPDDP) is an interior-point method generalization of DDP that can address the optimal control problem with nonlinear state and input constraints.[13]

See also

References

Template:Reflist