Actor-critic algorithm

Template:Short description The actor-critic algorithm (AC) is a family of reinforcement learning (RL) algorithms that combine policy-based RL algorithms such as policy gradient methods, and value-based RL algorithms such as value iteration, Q-learning, SARSA, and TD learning.^[1]

An AC algorithm consists of two main components: an "actor" that determines which actions to take according to a policy function, and a "critic" that evaluates those actions according to a value function.^[2] Some AC algorithms are on-policy, some are off-policy. Some apply to either continuous or discrete action spaces. Some work in both cases.

Overview

The actor-critic methods can be understood as an improvement over pure policy gradient methods like REINFORCE via introducing a baseline.

Actor

The actor uses a policy function $π (a | s)$ , while the critic estimates either the value function $V (s)$ , the action-value Q-function $Q (s, a)$ , the advantage function $A (s, a)$ , or any combination thereof.

The actor is a parameterized function $π_{θ}$ , where $θ$ are the parameters of the actor. The actor takes as argument the state of the environment $s$ and produces a probability distribution $π_{θ} (\cdot | s)$ .

If the action space is discrete, then $\sum_{a} π_{θ} (a | s) = 1$ . If the action space is continuous, then $\int_{a} π_{θ} (a | s) d a = 1$ .

The goal of policy optimization is to improve the actor. That is, to find some $θ$ that maximizes the expected episodic reward $J (θ)$ : $J (θ) = 𝔼_{π_{θ}} [\sum_{t = 0}^{T} γ^{t} r_{t}]$ where $γ$ is the discount factor, $r_{t}$ is the reward at step $t$ , and $T$ is the time-horizon (which can be infinite).

The goal of policy gradient method is to optimize $J (θ)$ by gradient ascent on the policy gradient $\nabla J (θ)$ .

As detailed on the policy gradient method page, there are many unbiased estimators of the policy gradient: $\nabla_{θ} J (θ) = 𝔼_{π_{θ}} [\sum_{0 \leq j \leq T} \nabla_{θ} \ln π_{θ} (A_{j} | S_{j}) \cdot Ψ_{j} | S_{0} = s_{0}]$ where $Ψ_{j}$ is a linear sum of the following:

$\sum_{0 \leq i \leq T} (γ^{i} R_{i})$ .
$γ^{j} \sum_{j \leq i \leq T} (γ^{i - j} R_{i})$ : the REINFORCE algorithm.
$γ^{j} \sum_{j \leq i \leq T} (γ^{i - j} R_{i}) - b (S_{j})$ : the REINFORCE with baseline algorithm. Here $b$ is an arbitrary function.
$γ^{j} (R_{j} + γ V^{π_{θ}} (S_{j + 1}) - V^{π_{θ}} (S_{j}))$ : TD(1) learning.
$γ^{j} Q^{π_{θ}} (S_{j}, A_{j})$ .
$γ^{j} A^{π_{θ}} (S_{j}, A_{j})$ : Advantage Actor-Critic (A2C).^[3]
$γ^{j} (R_{j} + γ R_{j + 1} + γ^{2} V^{π_{θ}} (S_{j + 2}) - V^{π_{θ}} (S_{j}))$ : TD(2) learning.
$γ^{j} (\sum_{k = 0}^{n - 1} γ^{k} R_{j + k} + γ^{n} V^{π_{θ}} (S_{j + n}) - V^{π_{θ}} (S_{j}))$ : TD(n) learning.
$γ^{j} \sum_{n = 1}^{\infty} \frac{λ^{n - 1}}{1 - λ} \cdot (\sum_{k = 0}^{n - 1} γ^{k} R_{j + k} + γ^{n} V^{π_{θ}} (S_{j + n}) - V^{π_{θ}} (S_{j}))$ : TD(λ) learning, also known as GAE (generalized advantage estimate).^[4] This is obtained by an exponentially decaying sum of the TD(n) learning terms.

Critic

In the unbiased estimators given above, certain functions such as $V^{π_{θ}}, Q^{π_{θ}}, A^{π_{θ}}$ appear. These are approximated by the critic. Since these functions all depend on the actor, the critic must learn alongside the actor. The critic is learned by value-based RL algorithms.

For example, if the critic is estimating the state-value function $V^{π_{θ}} (s)$ , then it can be learned by any value function approximation method. Let the critic be a function approximator $V_{ϕ} (s)$ with parameters $ϕ$ .

The simplest example is TD(1) learning, which trains the critic to minimize the TD(1) error: $δ_{i} = R_{i} + γ V_{ϕ} (S_{i + 1}) - V_{ϕ} (S_{i})$ The critic parameters are updated by gradient descent on the squared TD error: $ϕ \leftarrow ϕ - α \nabla_{ϕ} (δ_{i})^{2} = ϕ + α δ_{i} \nabla_{ϕ} V_{ϕ} (S_{i})$ where $α$ is the learning rate. Note that the gradient is taken with respect to the $ϕ$ in $V_{ϕ} (S_{i})$ only, since the $ϕ$ in $γ V_{ϕ} (S_{i + 1})$ constitutes a moving target, and the gradient is not taken with respect to that. This is a common source of error in implementations that use automatic differentiation, and requires "stopping the gradient" at that point.

Similarly, if the critic is estimating the action-value function $Q^{π_{θ}}$ , then it can be learned by Q-learning or SARSA. In SARSA, the critic maintains an estimate of the Q-function, parameterized by $ϕ$ , denoted as $Q_{ϕ} (s, a)$ . The temporal difference error is then calculated as $δ_{i} = R_{i} + γ Q_{θ} (S_{i + 1}, A_{i + 1}) - Q_{θ} (S_{i}, A_{i})$ . The critic is then updated by $θ \leftarrow θ + α δ_{i} \nabla_{θ} Q_{θ} (S_{i}, A_{i})$ The advantage critic can be trained by training both a Q-function $Q_{ϕ} (s, a)$ and a state-value function $V_{ϕ} (s)$ , then let $A_{ϕ} (s, a) = Q_{ϕ} (s, a) - V_{ϕ} (s)$ . Although, it is more common to train just a state-value function $V_{ϕ} (s)$ , then estimate the advantage by^[3] $A_{ϕ} (S_{i}, A_{i}) \approx \sum_{j \in 0 : n - 1} γ^{j} R_{i + j} + γ^{n} V_{ϕ} (S_{i + n}) - V_{ϕ} (S_{i})$ Here, $n$ is a positive integer. The higher $n$ is, the more lower is the bias in the advantage estimation, but at the price of higher variance.

The Generalized Advantage Estimation (GAE) introduces a hyperparameter $λ$ that smoothly interpolates between Monte Carlo returns ( $λ = 1$ , high variance, no bias) and 1-step TD learning ( $λ = 0$ , low variance, high bias). This hyperparameter can be adjusted to pick the optimal bias-variance trade-off in advantage estimation. It uses an exponentially decaying average of n-step returns with $λ$ being the decay strength.^[4]

Variants

Asynchronous Advantage Actor-Critic (A3C): Parallel and asynchronous version of A2C.^[3]
Soft Actor-Critic (SAC): Incorporates entropy maximization for improved exploration.^[5]
Deep Deterministic Policy Gradient (DDPG): Specialized for continuous action spaces.^[6]

References

Template:Reflist

[1] Template:Cite journal

[2] Template:Cite journal

[:0-3] 3.0 ^3.1 ^3.2 Template:Citation

[arxiv.org-4] 4.0 ^4.1 Template:Citation

[5] Template:Citation

[6] Template:Citation

[1]

[2]

[3]

[4]

[5]

[6]

Actor-critic algorithm

Contents

Overview

Actor

Critic

Variants

See also

References

Navigation menu

Actor-critic algorithm

Overview

Actor

Critic

Variants

See also

References

Navigation menu

Search