BART

#machine-learning

Created on Sep 09, 2025, Last Updated on Sep 09, 2025, By a Developer

BART

BART has an architecture very similar to the “original” transformer describe in the 2017 paper¹. And another way to think about it a BERT-like encoder + GPT-like auto-regressive decoder, with decoder having Cross-attention to the output of encoder.

BART is trained on masked language modeling, but the different from normal encoder stick to one corruption strategy, it can be any one. Usually, multiple tokens are replaced with a single [mask], so the model need to predict a sequence of tokens. This is called text infilling corruption strategy.
BART has an encoder architecture same as BERT, but with some difference:
- No more segment embeddings.
- At the end of encoder, BERT pass it to a feed-forward layer for prediction, while BART pass that over to decoder.
The encoder output then got passed to each decoder block. And similar to GPT, the previous sequence are passed to the decoder as input as well.
- Each decoder block includes:
  - Masked multi-head self-attention layer with future token masked (only previous sequence are passed to)
  - Skip Connection and Sum Layer
  - Layer Normalization
  - Cross-attention layer accept input from both sides.
  - Skip Connection and Sum Layer
  - Layer Normalization
  - Feed-Forward Layer
  - Skip Connection and Sum Layer
  - Layer Normalization
Output are passed to a language model head to predict the next token ID in the sequence.

Attention is All You Need ↩

Table of Content

BART

Footnotes

ZANE.C

BART

BART

BART

Footnotes

Table of Content