Recurrent Neural Network (RNN)
Recurrent Neural Network (RNN)
Created on Aug 10, 2025, Last Updated on Oct 22, 2025, By a Developer
Concepts
The major difference of Recurrent Neural Network from normal network is that data is chunked into small pieces and go through the model multiple times sequentially, and former one will have impact on the later one. Taking sentence as an example, sentences is chunked into words/tokens/characters, we save the activation value derived from the previous token and pass that together with the next token to the model, and repeat this for every token.
The Back Propagation for RNN, is very much focus on task the model focusing on. We might only use the last activation value as part of loss function if that’s the only thing we care eventually. We might also use all activation value to construct the loss function if we care all of them.
Gated Recurrent Unit (GRU)
GRU is invented to solve the problem that the information of given token has little influence on the tokens couple of tokens away. GRU provides a way to have feature for current token got passed to later tokens.
Long Short Term Memory (LSTM)
LSTM share the same motivation as GRU, but output the memory cell out along with activation, and use memory from previous step to calculate the memory and activation for next step.
Bi-Directional RNN
Language is not a not a uni-directional sequence, words appearing after can have impact on the words appearing before. Thus, having token has its information passed into “future” is not enough, it ideally also should have it passed to the “past”. Bi-Directional RNN is just as its, having two sets of parameter associated, one towards the future and one towards the past.
The usage of bi-directional connection is very depends on the tasks it performs on.
- For tasks like classification, since the entire sequence is available at once, the model can be a bi-directional RNN to leverage both past and future context for accuracy.
- While for sequence generation tasks, In many applications, the model must be uni-directional, so that job generation can start before the full input sequence is available. Video transcription, real time translation would be good example. The future tokens would either inaccessible or too massive to have all of them in one shot.
Word Embeddings
Word Embeddings is different from “Embeddings” in context of transformer. Word embeddings is a vector representation of words, in another word, is a dictionary where each word mapped to a vector. A good word embeddings capture semantic meanings for words, so that we can perform common analogy task like “finding the word that is to ‘woman’, what ‘king’ is to ‘man’”.
How Word Embeddings Are Learnt?
- Separate all corpus into words, and perform stemming and lemmatization.
- Come up with a word dictionary based on word count in entire corpus (Word2Vec use prediction based instead).
- The model is trained on logic that, given the model a wording context, what is probability that word B appear after word A within a certain distance. Some context possible:
- 4 words on left & right
- last 4 word
- nearby 1 word
- The output would be a vector with same size of word dictionary, passed to a Softmax layer to derive the word probability. And backward propagation is performed on both embeddings and the model weights.
Beam Search
During Sequence Generation, It is not always reliable to pick the word with highest probability all the time. Some better generation results might have not score well for its first few token. Beam Search is a technique for searching better generation result. It requires a hyperparameter
Attention Layer
This attention layer is well-known because of Transformer, where attention layer plays a key role. But it was invented before the existence of Transformer in machine translation domain, which is a sequence to sequence language task.
Performance of the RNN model without any attention mechanism degrades pretty drastically when corpus become longer. It’s just due to the model do not have enough “memory” to keep everything inside the activation value and memory cell.
A word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the word. The Attention layer attempt to derive the relationship among words to in keep help token pay attention to each other.
Attention layer lives on top of a normal bi-directional RNN model, taking sequence
The
The draw back is obviously the computation cost for both training and inference, which is quadratic to the length of the sequence.
Case Study
Word2Vec
Word2Vec is a model used to learn word embeddings. The intuition is that words with related meanings often occur near each other in texts. It use Skip-grams to pick training sets, which means, randomly pick two words from a certain window, and use one of them as a context word to predict the probability of the appearance of the other one. Although training still includes a neural network, but it eventually got dropped, the word embeddings is the only thing we need.