ZANE.C

BART

BART

Created on Sep 09, 2025, Last Updated on Sep 09, 2025, By a Developer

BART


BART has an architecture very similar to the “original” transformer describe in the 2017 paper1. And another way to think about it a BERT-like encoder + GPT-like auto-regressive decoder, with decoder having Cross-attention to the output of encoder.

  • BART is trained on masked language modeling, but the different from normal encoder stick to one corruption strategy, it can be any one. Usually, multiple tokens are replaced with a single [mask], so the model need to predict a sequence of tokens. This is called text infilling corruption strategy.
  • BART has an encoder architecture same as BERT, but with some difference:
    • No more segment embeddings.
    • At the end of encoder, BERT pass it to a feed-forward layer for prediction, while BART pass that over to decoder.
  • The encoder output then got passed to each decoder block. And similar to GPT, the previous sequence are passed to the decoder as input as well.
    • Each decoder block includes:
      • Masked multi-head self-attention layer with future token masked (only previous sequence are passed to)
      • Skip Connection and Sum Layer
      • Layer Normalization
      • Cross-attention layer accept input from both sides.
      • Skip Connection and Sum Layer
      • Layer Normalization
      • Feed-Forward Layer
      • Skip Connection and Sum Layer
      • Layer Normalization
  • Output are passed to a language model head to predict the next token ID in the sequence.

Footnotes

  1. Attention is All You Need

Table of Content

© 2024-present Zane Chen. All Rights Reserved.