BERT
BERT is an encoder-only model and is the first model to implement deep bidirectionality to learn rich text context by attending to words on both sides.
BERT is not only trained for masked language modeling, and also for next sentence prediction (NSP), which means for predicting whether a sentence following another. However this feature got dropped in later model in BERT family, because sentence embeddings are sufficient for it.
- BERT uses WordPiece to parse input text into word tokens.
"playing" => ["play", "##ing"]
- To differentiate different sentences, a
[SEP]tokens are inserted. - A
[CLS]is added to the beginning of the token sequence, which is specialized for Language Tasks. - Word tokens then got mapped to word embeddings, which is trained as part of the model.
["play", "##ing"] => [1234, 9876] => [[...], [...]]
- Segment embedding are added to denote whether a token belongs to the first or second sentence in a pair of sentences, in context of NSP.
- Position embedding are added to denote position of each token.
- The input embeddings are passed through multiple encoder layers to output some final hidden states.
- The final hidden states will have similar structure as input tokens, one embedding vector mapped to one token.
- Each encoder block contains:
- Multi-Head self-attention layer.
- Skip Connection and Sum Layer
- Layer Normalization
- Feed-Forward Layer
- Skip Connection and Sum Layer
- Layer Normalization
- In masked language modeling (MLM), the final hidden states is passed to a feedforward network with a softmax layer to predict the mask token(s).
- In next sentence prediction (NSP), the final hidden states is passed to a feedforward network with a softmax layer to perform the binary classification.