Generative Pre-trained Transformer (GPT)
Generative Pre-trained Transformer (GPT)
Created on Sep 09, 2025, Last Updated on Oct 22, 2025, By a Developer
GPT-2
GPT-2 is a decoder-only model, with good capability on text generation tasks.
- GPT-2 uses BPE to tokenize words into word tokens.
"playing" => ["play", "ing"]
- Word tokens are mapped to token IDs then to word embeddings.
["play", "ing"] => [1234, 6543] => [[...], [...]]- The word “embeddings” here is somewhat close to the word embeddings in traditional Language modeling, although the embeddings training is part of the model training instead trained elsewhere.
- The positional encoding are added, which indicate the position of each token.
- The embeddings are passed through multiple decoder blocks to derive the final hidden state.
- Each decoder block contains:
- Masked multi-head self-attention layer with future token masked
- Skip Connection and Sum Layer
- Layer Normalization
- Feed-Forward Layer
- Skip Connection and Sum Layer
- Layer Normalization
- Each decoder block contains:
- The last hidden state is passed to a language modeling head to convert it to logits, which would be the ID of the next token in the sequence.
- The loss function used is Cross-Entropy Loss.