ZANE.C

Generative Pre-trained Transformer (GPT)

Generative Pre-trained Transformer (GPT)

Created on Sep 09, 2025, Last Updated on Oct 22, 2025, By a Developer

GPT-2


GPT-2 is a decoder-only model, with good capability on text generation tasks.

  • GPT-2 uses BPE to tokenize words into word tokens.
    • "playing" => ["play", "ing"]
  • Word tokens are mapped to token IDs then to word embeddings.
    • ["play", "ing"] => [1234, 6543] => [[...], [...]]
    • The word “embeddings” here is somewhat close to the word embeddings in traditional Language modeling, although the embeddings training is part of the model training instead trained elsewhere.
  • The positional encoding are added, which indicate the position of each token.
  • The embeddings are passed through multiple decoder blocks to derive the final hidden state.
    • Each decoder block contains:
      • Masked multi-head self-attention layer with future token masked
      • Skip Connection and Sum Layer
      • Layer Normalization
      • Feed-Forward Layer
      • Skip Connection and Sum Layer
      • Layer Normalization
  • The last hidden state is passed to a language modeling head to convert it to logits, which would be the ID of the next token in the sequence.
  • The loss function used is Cross-Entropy Loss.
Table of Content

© 2024-present Zane Chen. All Rights Reserved.