Entry 24 of 24
ML Fundamentals Series
·2 min read

LSTM Gives a Network Three Gates to Choose What to Forget; GRU Does It With Two

Vanilla RNNs forget almost everything within a few timesteps because of vanishing gradients, which defeats the entire point of using a recurrent network on long sequences. LSTM (Long Short-Term Memory) is the fix: an improved RNN cell explicitly designed to capture long-term dependencies by giving the network a separate memory pathway that gradients can flow through without shrinking at every step.

The mechanism is three gates, each a small neural layer with a sigmoid output between 0 and 1, acting as a dial rather than a switch. The forget gate looks at the previous hidden state and current input and decides what to remove from the memory cell: a value near 0 means erase, near 1 means keep. ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f). The input gate decides what new information gets added: it=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i), paired with a candidate value C^t=tanh(Wc[ht1,xt]+bc)\hat{C}_t = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c) that proposes what that new information actually is. The output gate decides what part of the (now updated) memory gets exposed as the hidden state: ot=σ(Wo[ht1,xt]+bo)o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o), then ht=ottanh(Ct)h_t = o_t \odot \tanh(C_t). Three gates, three separate decisions: what to erase, what to add, what to reveal.

GRU (Gated Recurrent Unit) asks whether all three are necessary and answers no. It's a simplified LSTM with only two gates instead of three, and no separate cell state at all, just a single hidden state that does the job of both. The update gate ztz_t decides how much of the previous hidden state to carry forward into the next timestep: zt=σ(Wz[ht1,xt]+bz)z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z). The reset gate rtr_t controls how much of the past to ignore when computing a new candidate: rt=σ(Wr[ht1,xt]+br)r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r). That candidate is ht=tanh(Wh[rt,ht1,xt]+bh)h'_t = \tanh(W_h \cdot [r_t, h_{t-1}, x_t] + b_h), and the final hidden state blends old and new directly:

ht=(1zt)ht1+zthth_t = (1 - z_t) \cdot h_{t-1} + z_t \cdot h'_t
AspectLSTMGRU
Gates3 (forget, input, output)2 (update, reset)
StateSeparate cell state + hidden stateSingle hidden state
ParametersMoreFewer
Typical useLonger, more complex sequencesFaster training, comparable results on many tasks

Neither gate count is arbitrary: LSTM's third gate buys it a dedicated, slow-changing memory lane that's genuinely separate from what gets output at each step. GRU folds those two roles together and loses some of that separation, but in exchange trains faster with fewer parameters, and in practice often gets within a hair of LSTM's accuracy anyway.