Transformer
Transformer
Created on Aug 02, 2025, Last Updated on Sep 07, 2025, By a Developer
Transformer was introduced as a machine learning architecture back in 20171, with the primary focus on Translation Task. And the architecture turned out to perform really well even in the space out of translation tasks.
Architecture
The initial Transformer model architecture consist of two parts, Encoder and Decoder.
- Encoder takes raw input (likely text), and turn them into a vector representation.
- Decoder takes the output of encoder along with its own input, to generate output tokens.
However, currently the wording “Encoder ” and “Decoder” can also be used to classify transformer models. Some models will only have encoder, and some will only have decoder, while some others will have both.
Encoder
BERT-like models are encoder only model, also called auto-encoding model. They are not the fancy generative models, but they do a lot of heavy lifting in production, performing tasks like classification and semantic search.
Decoder
GPT-like models are decoder only model, also called auto-regressive model. It feels a bit unintuitive when knowing this for the first time, since encoder is about understanding while decoder is about generating. This still holds true, but the fact is that GPT performs text generation tasks instead of question answering tasks. It is actually extending the prompts.
And during text generation, the model actually does not rely on the context as the way it does during translation. Each individual output token is not directly related to any specific token(s), but the entire context instead.
Encoder-decoder
T5- like models have both encoder and decoder, or called sequence to sequence model. They are good at content generation task with respect to its input, such as translation and summarization.
Training
Transformer, as a type of language model, is usually trained using self-supervised manner, meaning the desired output is calculated on the fly based on the input. This usually leads to model is not good at specific language task. That’s why this step is called pre-training, and also why GPT has its name - Generative Pre-trained Transformer.
Given above, before putting the model onto specific task, it usually went through transfer learning, or called fine-tuning, by giving input and human annotated label for it to perform supervised learning.