Gated recurrent unit

From testwiki
Jump to navigation Jump to search

Template:Short description Template:Machine learning Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al.[1] The GRU is like a long short-term memory (LSTM) with a gating mechanism to input or forget certain features,[2] but lacks a context vector or output gate, resulting in fewer parameters than LSTM.[3] GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM.[4][5] GRUs showed that gating is indeed helpful in general, and Bengio's team came to no concrete conclusion on which of the two gating units was better.[6][7]

Architecture

There are several variations on the full gated unit, with gating done using the previous hidden state and the bias in various combinations, and a simplified form called minimal gated unit.[8]

The operator denotes the Hadamard product in the following.

Fully gated unit

Gated Recurrent Unit, fully gated version

Initially, for t=0, the output vector is h0=0.

zt=σ(Wzxt+Uzht1+bz)rt=σ(Wrxt+Urht1+br)h^t=ϕ(Whxt+Uh(rtht1)+bh)ht=(1zt)ht1+zth^t

Variables (d denotes the number of input features and e the number of output features):

  • xtd: input vector
  • hte: output vector
  • h^te: candidate activation vector
  • zt(0,1)e: update gate vector
  • rt(0,1)e: reset gate vector
  • We×d, Ue×e and be: parameter matrices and vector which need to be learned during training

Activation functions

Alternative activation functions are possible, provided that σ(x)[0,1].

Type 1
Type 2
Type 3

Alternate forms can be created by changing zt and rt[9]

  • Type 1, each gate depends only on the previous hidden state and the bias.
    zt=σ(Uzht1+bz)rt=σ(Urht1+br)
  • Type 2, each gate depends only on the previous hidden state.
    zt=σ(Uzht1)rt=σ(Urht1)
  • Type 3, each gate is computed using only the bias.
    zt=σ(bz)rt=σ(br)

Minimal gated unit

The minimal gated unit (MGU) is similar to the fully gated unit, except the update and reset gate vector is merged into a forget gate. This also implies that the equation for the output vector must be changed:[10]

ft=σ(Wfxt+Ufht1+bf)h^t=ϕ(Whxt+Uh(ftht1)+bh)ht=(1ft)ht1+fth^t

Variables

  • xt: input vector
  • ht: output vector
  • h^t: candidate activation vector
  • ft: forget vector
  • W, U and b: parameter matrices and vector

Light gated recurrent unit

The light gated recurrent unit (LiGRU)[4] removes the reset gate altogether, replaces tanh with the ReLU activation, and applies batch normalization (BN):

zt=σ(BN(Wzxt)+Uzht1)h~t=ReLU(BN(Whxt)+Uhht1)ht=ztht1+(1zt)h~t

LiGRU has been studied from a Bayesian perspective.[11] This analysis yielded a variant called light Bayesian recurrent unit (LiBRU), which showed slight improvements over the LiGRU on speech recognition tasks.

References

Template:Reflist

Template:Artificial intelligence navbox