Differential dynamic programming

From testwiki
Jump to navigation Jump to search

Template:Short description Differential dynamic programming (DDP) is an optimal control algorithm of the trajectory optimization class. The algorithm was introduced in 1966 by Mayne[1] and subsequently analysed in Jacobson and Mayne's eponymous book.[2] The algorithm uses locally-quadratic models of the dynamics and cost functions, and displays quadratic convergence. It is closely related to Pantoja's step-wise Newton's method.[3][4]

Finite-horizon discrete-time problems

The dynamics

Template:NumBlk

describe the evolution of the state ๐ฑ given the control ๐ฎ from time i to time i+1. The total cost J0 is the sum of running costs โ„“ and final cost โ„“f, incurred when starting from state ๐ฑ and applying the control sequence ๐”โ‰ก{๐ฎ0,๐ฎ1,๐ฎNโˆ’1} until the horizon is reached:

J0(๐ฑ,๐”)=โˆ‘i=0Nโˆ’1โ„“(๐ฑi,๐ฎi)+โ„“f(๐ฑN),

where ๐ฑ0โ‰ก๐ฑ, and the ๐ฑi for i>0 are given by Template:EquationNote. The solution of the optimal control problem is the minimizing control sequence ๐”โˆ—(๐ฑ)โ‰กargmin๐”J0(๐ฑ,๐”). Trajectory optimization means finding ๐”โˆ—(๐ฑ) for a particular ๐ฑ0, rather than for all possible initial states.

Dynamic programming

Let ๐”i be the partial control sequence ๐”iโ‰ก{๐ฎi,๐ฎi+1,๐ฎNโˆ’1} and define the cost-to-go Ji as the partial sum of costs from i to N:

Ji(๐ฑ,๐”i)=โˆ‘j=iNโˆ’1โ„“(๐ฑj,๐ฎj)+โ„“f(๐ฑN).

The optimal cost-to-go or value function at time i is the cost-to-go given the minimizing control sequence:

V(๐ฑ,i)โ‰กmin๐”iJi(๐ฑ,๐”i).

Setting V(๐ฑ,N)โ‰กโ„“f(๐ฑN), the dynamic programming principle reduces the minimization over an entire sequence of controls to a sequence of minimizations over a single control, proceeding backwards in time:

Template:NumBlk

This is the Bellman equation.

Differential dynamic programming

DDP proceeds by iteratively performing a backward pass on the nominal trajectory to generate a new control sequence, and then a forward-pass to compute and evaluate a new nominal trajectory. We begin with the backward pass. If

โ„“(๐ฑ,๐ฎ)+V(๐Ÿ(๐ฑ,๐ฎ),i+1)

is the argument of the min[โ‹…] operator in Template:EquationNote, let Q be the variation of this quantity around the i-th (๐ฑ,๐ฎ) pair:

Q(ฮด๐ฑ,ฮด๐ฎ)โ‰กโ„“(๐ฑ+ฮด๐ฑ,๐ฎ+ฮด๐ฎ)+V(๐Ÿ(๐ฑ+ฮด๐ฑ,๐ฎ+ฮด๐ฎ),i+1)โˆ’โ„“(๐ฑ,๐ฎ)โˆ’V(๐Ÿ(๐ฑ,๐ฎ),i+1)

and expand to second order

Template:NumBlk

The Q notation used here is a variant of the notation of Morimoto where subscripts denote differentiation in denominator layout.[5] Dropping the index i for readability, primes denoting the next time-step Vโ‰กV(i+1), the expansion coefficients are

Q๐ฑ=โ„“๐ฑ+๐Ÿ๐ฑ๐–ณV'๐ฑQ๐ฎ=โ„“๐ฎ+๐Ÿ๐ฎ๐–ณV'๐ฑQ๐ฑ๐ฑ=โ„“๐ฑ๐ฑ+๐Ÿ๐ฑ๐–ณV'๐ฑ๐ฑ๐Ÿ๐ฑ+V๐ฑโ‹…๐Ÿ๐ฑ๐ฑQ๐ฎ๐ฎ=โ„“๐ฎ๐ฎ+๐Ÿ๐ฎ๐–ณV'๐ฑ๐ฑ๐Ÿ๐ฎ+V'๐ฑโ‹…๐Ÿ๐ฎ๐ฎQ๐ฎ๐ฑ=โ„“๐ฎ๐ฑ+๐Ÿ๐ฎ๐–ณV'๐ฑ๐ฑ๐Ÿ๐ฑ+V'๐ฑโ‹…๐Ÿ๐ฎ๐ฑ.

The last terms in the last three equations denote contraction of a vector with a tensor. Minimizing the quadratic approximation Template:EquationNote with respect to ฮด๐ฎ we have

Template:NumBlk

giving an open-loop term ๐ค=โˆ’Q๐ฎ๐ฎโˆ’1Q๐ฎ and a feedback gain term ๐Š=โˆ’Q๐ฎ๐ฎโˆ’1Q๐ฎ๐ฑ. Plugging the result back into Template:EquationNote, we now have a quadratic model of the value at time i:

ฮ”V(i)=โˆ’12Q๐ฎTQ๐ฎ๐ฎโˆ’1Q๐ฎV๐ฑ(i)=Q๐ฑโˆ’Q๐ฑ๐ฎQ๐ฎ๐ฎโˆ’1Q๐ฎV๐ฑ๐ฑ(i)=Q๐ฑ๐ฑโˆ’Q๐ฑ๐ฎQ๐ฎ๐ฎโˆ’1Q๐ฎ๐ฑ.

Recursively computing the local quadratic models of V(i) and the control modifications {๐ค(i),๐Š(i)}, from i=Nโˆ’1 down to i=1, constitutes the backward pass. As above, the Value is initialized with V(๐ฑ,N)โ‰กโ„“f(๐ฑN). Once the backward pass is completed, a forward pass computes a new trajectory:

๐ฑ^(1)=๐ฑ(1)๐ฎ^(i)=๐ฎ(i)+๐ค(i)+๐Š(i)(๐ฑ^(i)โˆ’๐ฑ(i))๐ฑ^(i+1)=๐Ÿ(๐ฑ^(i),๐ฎ^(i))

The backward passes and forward passes are iterated until convergence.

Differential dynamic programming is a second-order algorithm like Newton's method. It therefore takes large steps toward the minimum and often requires regularization and/or line-search to achieve convergence.[6][7] Regularization in the DDP context means ensuring that the Q๐ฎ๐ฎ matrix in Template:EquationNote is positive definite. Line-search in DDP amounts to scaling the open-loop control modification ๐ค by some 0<ฮฑ<1.

Monte Carlo version

Sampled differential dynamic programming (SaDDP) is a Monte Carlo variant of differential dynamic programming.[8][9][10] It is based on treating the quadratic cost of differential dynamic programming as the energy of a Boltzmann distribution. This way the quantities of DDP can be matched to the statistics of a multidimensional normal distribution. The statistics can be recomputed from sampled trajectories without differentiation.

Sampled differential dynamic programming has been extended to Path Integral Policy Improvement with Differential Dynamic Programming.[11] This creates a link between differential dynamic programming and path integral control,[12] which is a framework of stochastic optimal control.

Constrained problems

Interior Point Differential dynamic programming (IPDDP) is an interior-point method generalization of DDP that can address the optimal control problem with nonlinear state and input constraints.[13]

See also

References

Template:Reflist