BART
BART has an architecture very similar to the “original” transformer describe in the 2017 paper1. And another way to think about it a BERT-like encoder + GPT-like auto-regressive decoder, with decoder having Cross-attention to the output of encoder.
- BART is trained on masked language modeling, but the different from normal encoder stick to one corruption strategy, it can be any one. Usually, multiple tokens are replaced with a single
[mask], so the model need to predict a sequence of tokens. This is called text infilling corruption strategy. - BART has an encoder architecture same as BERT, but with some difference:
- No more segment embeddings.
- At the end of encoder, BERT pass it to a feed-forward layer for prediction, while BART pass that over to decoder.
- The encoder output then got passed to each decoder block. And similar to GPT, the previous sequence are passed to the decoder as input as well.
- Each decoder block includes:
- Masked multi-head self-attention layer with future token masked (only previous sequence are passed to)
- Skip Connection and Sum Layer
- Layer Normalization
- Cross-attention layer accept input from both sides.
- Skip Connection and Sum Layer
- Layer Normalization
- Feed-Forward Layer
- Skip Connection and Sum Layer
- Layer Normalization
- Each decoder block includes:
- Output are passed to a language model head to predict the next token ID in the sequence.