[论文笔记][2019]GPT 2: Language Models are Unsupervised Multitask Learners

Core

  • Language modeling is usually framed as unsupervised distribution estimation from a set of examples (x1, x2, …, xn) each composed of variable length sequences of symbols(s1, s2, …, sn).

  • Learning to perform a single task can be expressed in a probabilistic framework as estimating a conditional distribution p(output|input). Condition not only on the input but also on the task to be performed should model p(output|input, task).
  • Language provides a flexible way to specify tasks, inputs, and outputs all as a sequence of symbols. Language modeling is also able to, in principle, learn the tasks without the need for explicit supervision of which symbols are the outputs to be predicted.

Multi-task

p(output|input, task)

  • For example, a translation training example can be written as the sequence (translate to french, english text, french text)
  • a reading comprehension training example can be written as (answer the question, document, question, answer).

Training Dataset

Slightly over 8 million documents for a total of 40 GB of text.

Input Representation

Byte Pair Encoding (BPE) (Sennrich et al., 2015) is a practical middle ground between character and word level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences.

Model

  • Follows the details of the OpenAI GPT
  • Layer normalization was moved to the input of each sub-block and an additional layer normalization was added after the final self-attention block.
  • Scale the weights of residual layers at initialization by a factor of 1/N where is the number of residual layers.
  • The vocabulary is expanded to 50,257. Also increase the context size from 512 to 1024 tokens and a larger batch size of 512 is used.

paper

0