【NIPS-2019】XLNet: Generalized Autoregressive Pretraining for Language Understanding

Permutation Language Modeling

一. Introduction

  • 自回归语言模型(Autoregressive LM)

Since an AR language model is only trained to encode a uni-directional context (either forward or backward), it is not effective at modeling deep bidirectional contexts

  • 自编码语言模型(Autoencoder LM)

aims to reconstruct the original data from corrupted input. (like [MASK] in BERT)

  • contributions

二. Proposed Method

1.Background

2. Objective: Permutation Language Modeling

3. Architecture: Two-Stream Self-Attention for Target-Aware Representations

  • The content representation hθ(xzt ), or abbreviated as hzt , which serves a similar role to the standard hidden states in Transformer. This representation encodes both the context and xzt itself.
  • The query representation gθ(xz<t , zt), or abbreviated as gzt , which only has access to the contextual information xz<t and the position zt, but not the content xzt , as discussed above.

4. Incorporating Ideas from Transformer-XL

  • relative positional encodings
  • recurrence mechanism

5. Modeling Multiple Segments

Relative Segment Encodings

Given a pair of positions i and j in the sequence, if i and j are from the same segment, we use a segment encoding sij = s+ or otherwise sij = s , where s+ and s are learnable model parameters for each attention head. When i attends to j, the segment encoding sij is used to compute an attention weight aij =transpose (qi + b)*sij , where qi is the query vector as in a standard attention operation and b is a learnable head-specific bias vector. Finally, the value aij is added to the normal attention weight.

paper XLNet:运行机制及和Bert的异同比较 slides

【CoRR-2019】A Mutual Information Maximization Perspective of Language Representation Learning

简单总结:该文认为大部分语言模型(SKIP-GRAM,BERT,XLNet等)都是求局部信息(masked word或者包含负样本的n-gram )和全局信息(masked word’s sentence)的Mutual Information Maximization,并提出改进:局部信息为负样本。

一. Introduction

provide an alternative view and show that these methods also maximize a lower bound on the mutual information between different parts of a word sequence

二. Mutual information maximization

1. mutual information

2. InfoNCE

三. Mutual information maximization based on MODELS

1. SKIP-GRAM

2. BERT

3. XLNET

4. INFOWORD

4.1 the mutual information between global representation and local representation

4.2 objective function of InfoWord

四. Experiments

1.result

2. discussion

  • Span-based models

JDIM is related to these span-based models such as SpanBERT and MASS

  • Mutual information maximization

InfoNCE is widely accepted as a good representation learning objective

  • Regularization

Our analysis and the connection we draw to representation learning methods used in other domains provide an insight into possible ways to incorporate prior knowledge into language representation learning models(objective func加入正则化去学习prior knowledge,正则与先验的关系)

paper

【EMNLP-2019】Patient Knowledge Distillation for BERT Model Compression

一. Contributions

Two different strategies:

  • (i) PKD-Last: the student learns from the last k layers of the teacher, under the assumption that the top layers of the original network contain the most informative knowledge to teach the student;
  • (ii) PKD-Skip: the student learns from every k layers of the teacher, suggesting that the lower layers of the teacher network also contain important information and should be passed along for incremental distillation.

二. Related Work

Language Model Pre-training

  • (i) feature-based approach;
    • context-independent word representation (e.g., word2vec, GloVe, FastText)
    • sentence-level representation
    • contextualized word representation (e.g.,Cove, ELMo)
  • (ii) fine-tuning approach.(e.g.,GPT, BERT)

Model Compression & Knowledge Distillation

  • high degree of parameter redundancy: network pruning, weight quantization
  • compress a network with a large set of parameters into a compact and fast-to-execute model: knowledge distillation

三. Patient Knowledge Distillation

Distillation Objective

上面KD会出现过拟合,所以使用下面的PKD

Patient Teacher for Model Compression

四. Experiments

paper

常用shell脚本

1. 启动jupyter(可配置开机自启)

2. 自动登陆服务器(需要安装expect)

3. 查看集群jps(这里免密)