Actor-critic algorithm

From testwiki
Jump to navigation Jump to search

Template:Short description The actor-critic algorithm (AC) is a family of reinforcement learning (RL) algorithms that combine policy-based RL algorithms such as policy gradient methods, and value-based RL algorithms such as value iteration, Q-learning, SARSA, and TD learning.[1]

An AC algorithm consists of two main components: an "actor" that determines which actions to take according to a policy function, and a "critic" that evaluates those actions according to a value function.[2] Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases.

Overview

The actor-critic methods can be understood as an improvement over pure policy gradient methods like REINFORCE via introducing a baseline.

Actor

The actor uses a policy function π(a|s), while the critic estimates either the value function V(s), the action-value Q-function Q(s,a), the advantage function A(s,a), or any combination thereof.

The actor is a parameterized function πθ, where θ are the parameters of the actor. The actor takes as argument the state of the environment s and produces a probability distribution πθ(|s).

If the action space is discrete, then aπθ(a|s)=1. If the action space is continuous, then aπθ(a|s)da=1.

The goal of policy optimization is to improve the actor. That is, to find some θ that maximizes the expected episodic reward J(θ):J(θ)=𝔼πθ[t=0Tγtrt]where γ is the discount factor, rt is the reward at step t, and T is the time-horizon (which can be infinite).

The goal of policy gradient method is to optimize J(θ) by gradient ascent on the policy gradient J(θ).

As detailed on the policy gradient method page, there are many unbiased estimators of the policy gradient:θJ(θ)=𝔼πθ[0jTθlnπθ(Aj|Sj)Ψj|S0=s0]where Ψj is a linear sum of the following:

  • 0iT(γiRi).
  • γjjiT(γijRi): the REINFORCE algorithm.
  • γjjiT(γijRi)b(Sj): the REINFORCE with baseline algorithm. Here b is an arbitrary function.
  • γj(Rj+γVπθ(Sj+1)Vπθ(Sj)): TD(1) learning.
  • γjQπθ(Sj,Aj).
  • γjAπθ(Sj,Aj): Advantage Actor-Critic (A2C).[3]
  • γj(Rj+γRj+1+γ2Vπθ(Sj+2)Vπθ(Sj)): TD(2) learning.
  • γj(k=0n1γkRj+k+γnVπθ(Sj+n)Vπθ(Sj)): TD(n) learning.
  • γjn=1λn11λ(k=0n1γkRj+k+γnVπθ(Sj+n)Vπθ(Sj)): TD(λ) learning, also known as GAE (generalized advantage estimate).[4] This is obtained by an exponentially decaying sum of the TD(n) learning terms.

Critic

In the unbiased estimators given above, certain functions such as Vπθ,Qπθ,Aπθ appear. These are approximated by the critic. Since these functions all depend on the actor, the critic must learn alongside the actor. The critic is learned by value-based RL algorithms.

For example, if the critic is estimating the state-value function Vπθ(s), then it can be learned by any value function approximation method. Let the critic be a function approximator Vϕ(s) with parameters ϕ.

The simplest example is TD(1) learning, which trains the critic to minimize the TD(1) error:δi=Ri+γVϕ(Si+1)Vϕ(Si)The critic parameters are updated by gradient descent on the squared TD error:ϕϕαϕ(δi)2=ϕ+αδiϕVϕ(Si)where α is the learning rate. Note that the gradient is taken with respect to the ϕ in Vϕ(Si) only, since the ϕ in γVϕ(Si+1) constitutes a moving target, and the gradient is not taken with respect to that. This is a common source of error in implementations that use automatic differentiation, and requires "stopping the gradient" at that point.

Similarly, if the critic is estimating the action-value function Qπθ, then it can be learned by Q-learning or SARSA. In SARSA, the critic maintains an estimate of the Q-function, parameterized by ϕ, denoted as Qϕ(s,a). The temporal difference error is then calculated as δi=Ri+γQθ(Si+1,Ai+1)Qθ(Si,Ai). The critic is then updated byθθ+αδiθQθ(Si,Ai)The advantage critic can be trained by training both a Q-function Qϕ(s,a) and a state-value function Vϕ(s), then let Aϕ(s,a)=Qϕ(s,a)Vϕ(s). Although, it is more common to train just a state-value function Vϕ(s), then estimate the advantage by[3]Aϕ(Si,Ai)j0:n1γjRi+j+γnVϕ(Si+n)Vϕ(Si)Here, n is a positive integer. The higher n is, the more lower is the bias in the advantage estimation, but at the price of higher variance.

The Generalized Advantage Estimation (GAE) introduces a hyperparameter λ that smoothly interpolates between Monte Carlo returns (λ=1, high variance, no bias) and 1-step TD learning (λ=0, low variance, high bias). This hyperparameter can be adjusted to pick the optimal bias-variance trade-off in advantage estimation. It uses an exponentially decaying average of n-step returns with λ being the decay strength.[4]

Variants

  • Asynchronous Advantage Actor-Critic (A3C): Parallel and asynchronous version of A2C.[3]
  • Soft Actor-Critic (SAC): Incorporates entropy maximization for improved exploration.[5]
  • Deep Deterministic Policy Gradient (DDPG): Specialized for continuous action spaces.[6]

See also

References

Template:Reflist