Generative Pre-trained Transformer (GPT)
Generative Pre-trained Transformer (GPT)
Created on Sep 09, 2025, Last Updated on Jan 28, 2026, By a Developer
GPT-2
GPT-2 is a decoder-only auto-regressive model, with good capability on text generation tasks.
Note: auto-regressive means the recurrent model use the output as the input of next iteration.
- GPT-2 uses BPE to tokenize words into word tokens.
"playing" => ["play", "ing"]
- Word tokens are mapped to token IDs then to word embeddings.
["play", "ing"] => [1234, 6543] => [[...], [...]]- The word “embeddings” here is somewhat close to the word embeddings in traditional Language modeling, although the embeddings training is part of the model training instead trained elsewhere.
- The positional encoding are added, which indicate the position of each token.
- The embeddings are passed through multiple decoder blocks to derive the final hidden state.
- Each decoder block contains:
- Masked multi-head self-attention layer with future token masked
- Skip Connection and Sum Layer
- Layer Normalization
- Feed-Forward Layer
- Skip Connection and Sum Layer
- Layer Normalization
- Each decoder block contains:
- The last hidden state is passed to a language modeling head to convert it to logits, which would be the ID of the next token in the sequence.
- The loss function used is Cross-Entropy Loss.
On this page
Read More
Attention Mechanism
by a Developer
Sep 2025
by a Developer
machine-learning
BART
by a Developer
Sep 2025
by a Developer
machine-learning
Natural Language Processing (NLP)
by a Developer
Aug 2025
by a Developer
machine-learning
Recurrent Neural Network (RNN)
by a Developer
Aug 2025
by a Developer
machine-learning
Tokenization
by a Developer
Sep 2025
by a Developer
machine-learning