LSTM Gives a Network Three Gates to Choose What to Forget; GRU Does It With Two

lstm gru rnn gated-recurrent-unit deep-learning

Vanilla RNNs forget almost everything within a few timesteps because of vanishing gradients, which defeats the entire point of using a recurrent network on long sequences. LSTM (Long Short-Term Memory) is the fix: an improved RNN cell explicitly designed to capture long-term dependencies by giving the network a separate memory pathway that gradients can flow through without shrinking at every step.

The mechanism is three gates, each a small neural layer with a sigmoid output between 0 and 1, acting as a dial rather than a switch. The forget gate looks at the previous hidden state and current input and decides what to remove from the memory cell: a value near 0 means erase, near 1 means keep. $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$ . The input gate decides what new information gets added: $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$ , paired with a candidate value $\hat{C}_t = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c)$ that proposes what that new information actually is. The output gate decides what part of the (now updated) memory gets exposed as the hidden state: $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$ , then $h_t = o_t \odot \tanh(C_t)$ . Three gates, three separate decisions: what to erase, what to add, what to reveal.

GRU (Gated Recurrent Unit) asks whether all three are necessary and answers no. It's a simplified LSTM with only two gates instead of three, and no separate cell state at all, just a single hidden state that does the job of both. The update gate $z_t$ decides how much of the previous hidden state to carry forward into the next timestep: $z_t = \sigma(W_z \cdot [h_{t-1}, x_t] + b_z)$ . The reset gate $r_t$ controls how much of the past to ignore when computing a new candidate: $r_t = \sigma(W_r \cdot [h_{t-1}, x_t] + b_r)$ . That candidate is $h'_t = \tanh(W_h \cdot [r_t, h_{t-1}, x_t] + b_h)$ , and the final hidden state blends old and new directly:

h_t = (1 - z_t) \cdot h_{t-1} + z_t \cdot h'_t

Aspect	LSTM	GRU
Gates	3 (forget, input, output)	2 (update, reset)
State	Separate cell state + hidden state	Single hidden state
Parameters	More	Fewer
Typical use	Longer, more complex sequences	Faster training, comparable results on many tasks

Neither gate count is arbitrary: LSTM's third gate buys it a dedicated, slow-changing memory lane that's genuinely separate from what gets output at each step. GRU folds those two roles together and loses some of that separation, but in exchange trains faster with fewer parameters, and in practice often gets within a hair of LSTM's accuracy anyway.