- Language modeling is usually framed as unsupervised distribution estimation from a set of examples (x1, x2, …, xn) each composed of variable length sequences of symbols(s1, s2, …, sn).
- Learning to perform a single task can be expressed in a probabilistic framework as estimating a conditional distribution p(output|input). Condition not only on the input but also on the task to be performed should model p(output|input, task).
- Language provides a flexible way to specify tasks, inputs, and outputs all as a sequence of symbols. Language modeling is also able to, in principle, learn the tasks without the need for explicit supervision of which symbols are the outputs to be predicted.
- For example, a translation training example can be written as the sequence (translate to french, english text, french text)
- a reading comprehension training example can be written as (answer the question, document, question, answer).
Slightly over 8 million documents for a total of 40 GB of text.
Byte Pair Encoding (BPE) (Sennrich et al., 2015) is a practical middle ground between character and word level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences.
- Follows the details of the OpenAI GPT
- Layer normalization was moved to the input of each sub-block and an additional layer normalization was added after the final self-attention block.
- Scale the weights of residual layers at initialization by a factor of 1/√N where N is the number of residual layers.
- The vocabulary is expanded to 50,257. Also increase the context size from 512 to 1024 tokens and a larger batch size of 512 is used.