[论文笔记][NIPS-2019]XLNet: Generalized Autoregressive Pretraining for Language Understanding

Permutation Language Modeling

一. Introduction

  • 自回归语言模型(Autoregressive LM)

Since an AR language model is only trained to encode a uni-directional context (either forward or backward), it is not effective at modeling deep bidirectional contexts

  • 自编码语言模型(Autoencoder LM)

aims to reconstruct the original data from corrupted input. (like [MASK] in BERT)

  • contributions

二. Proposed Method

1.Background

2. Objective: Permutation Language Modeling

3. Architecture: Two-Stream Self-Attention for Target-Aware Representations

  • The content representation hθ(xzt ), or abbreviated as hzt , which serves a similar role to the standard hidden states in Transformer. This representation encodes both the context and xzt itself.
  • The query representation gθ(xz<t , zt), or abbreviated as gzt , which only has access to the contextual information xz<t and the position zt, but not the content xzt , as discussed above.

4. Incorporating Ideas from Transformer-XL

  • relative positional encodings
  • recurrence mechanism

5. Modeling Multiple Segments

Relative Segment Encodings

Given a pair of positions i and j in the sequence, if i and j are from the same segment, we use a segment encoding sij = s+ or otherwise sij = s , where s+ and s are learnable model parameters for each attention head. When i attends to j, the segment encoding sij is used to compute an attention weight aij =transpose (qi + b)*sij , where qi is the query vector as in a standard attention operation and b is a learnable head-specific bias vector. Finally, the value aij is added to the normal attention weight.

paper XLNet:运行机制及和Bert的异同比较 slides

0