Permutation Language Modeling
一. Introduction
- 自回归语言模型(Autoregressive LM)
Since an AR language model is only trained to encode a uni-directional context (either forward or backward), it is not effective at modeling deep bidirectional contexts
- 自编码语言模型(Autoencoder LM)
aims to reconstruct the original data from corrupted input. (like [MASK] in BERT)
- contributions
二. Proposed Method
1.Background
2. Objective: Permutation Language Modeling
3. Architecture: Two-Stream Self-Attention for Target-Aware Representations
- The content representation hθ(xz≤t ), or abbreviated as hzt , which serves a similar role to the standard hidden states in Transformer. This representation encodes both the context and xzt itself.
- The query representation gθ(xz<t , zt), or abbreviated as gzt , which only has access to the contextual information xz<t and the position zt, but not the content xzt , as discussed above.
4. Incorporating Ideas from Transformer-XL
- relative positional encodings
- recurrence mechanism
5. Modeling Multiple Segments
Relative Segment Encodings
Given a pair of positions i and j in the sequence, if i and j are from the same segment, we use a segment encoding sij = s+ or otherwise sij = s , where s+ and s are learnable model parameters for each attention head. When i attends to j, the segment encoding sij is used to compute an attention weight aij =transpose (qi + b)*sij , where qi is the query vector as in a standard attention operation and b is a learnable head-specific bias vector. Finally, the value aij is added to the normal attention weight.
paper XLNet:运行机制及和Bert的异同比较 slides