[论文笔记][NIPS-2019]XLNet: Generalized Autoregressive Pretraining for Language Understanding

Permutation Language Modeling

一. Introduction

  • 自回归语言模型(Autoregressive LM)

Since an AR language model is only trained to encode a uni-directional context (either forward or backward), it is not effective at modeling deep bidirectional contexts

  • 自编码语言模型(Autoencoder LM)

aims to reconstruct the original data from corrupted input. (like [MASK] in BERT)

  • contributions

二. Proposed Method

1.Background

2. Objective: Permutation Language Modeling

3. Architecture: Two-Stream Self-Attention for Target-Aware Representations

  • The content representation hθ(xzt ), or abbreviated as hzt , which serves a similar role to the standard hidden states in Transformer. This representation encodes both the context and xzt itself.
  • The query representation gθ(xz<t , zt), or abbreviated as gzt , which only has access to the contextual information xz<t and the position zt, but not the content xzt , as discussed above.

4. Incorporating Ideas from Transformer-XL

  • relative positional encodings
  • recurrence mechanism

5. Modeling Multiple Segments

Relative Segment Encodings

Given a pair of positions i and j in the sequence, if i and j are from the same segment, we use a segment encoding sij = s+ or otherwise sij = s , where s+ and s are learnable model parameters for each attention head. When i attends to j, the segment encoding sij is used to compute an attention weight aij =transpose (qi + b)*sij , where qi is the query vector as in a standard attention operation and b is a learnable head-specific bias vector. Finally, the value aij is added to the normal attention weight.

paper XLNet:运行机制及和Bert的异同比较 slides

[论文笔记][CoRR-2019]A Mutual Information Maximization Perspective of Language Representation Learning

简单总结:该文认为大部分语言模型(SKIP-GRAM,BERT,XLNet等)都是求局部信息(masked word或者包含负样本的n-gram )和全局信息(masked word’s sentence)的Mutual Information Maximization,并提出改进:局部信息为负样本。

一. Introduction

provide an alternative view and show that these methods also maximize a lower bound on the mutual information between different parts of a word sequence

二. Mutual information maximization

1. mutual information

2. InfoNCE

三. Mutual information maximization based on MODELS

1. SKIP-GRAM

2. BERT

3. XLNET

4. INFOWORD

4.1 the mutual information between global representation and local representation

4.2 objective function of InfoWord

四. Experiments

1.result

2. discussion

  • Span-based models

JDIM is related to these span-based models such as SpanBERT and MASS

  • Mutual information maximization

InfoNCE is widely accepted as a good representation learning objective

  • Regularization

Our analysis and the connection we draw to representation learning methods used in other domains provide an insight into possible ways to incorporate prior knowledge into language representation learning models(objective func加入正则化去学习prior knowledge,正则与先验的关系)

paper

[论文笔记][EMNLP-2019]Patient Knowledge Distillation for BERT Model Compression

一. Contributions

Two different strategies:

  • (i) PKD-Last: the student learns from the last k layers of the teacher, under the assumption that the top layers of the original network contain the most informative knowledge to teach the student;
  • (ii) PKD-Skip: the student learns from every k layers of the teacher, suggesting that the lower layers of the teacher network also contain important information and should be passed along for incremental distillation.

二. Related Work

Language Model Pre-training

  • (i) feature-based approach;
    • context-independent word representation (e.g., word2vec, GloVe, FastText)
    • sentence-level representation
    • contextualized word representation (e.g.,Cove, ELMo)
  • (ii) fine-tuning approach.(e.g.,GPT, BERT)

Model Compression & Knowledge Distillation

  • high degree of parameter redundancy: network pruning, weight quantization
  • compress a network with a large set of parameters into a compact and fast-to-execute model: knowledge distillation

三. Patient Knowledge Distillation

Distillation Objective

上面KD会出现过拟合,所以使用下面的PKD

Patient Teacher for Model Compression

四. Experiments

paper

常用shell脚本

1. 启动jupyter(可配置开机自启)

#!/bin/bash
origin_status=`ps -ef | grep -w jupyter | wc -l` # grep -w must match whole words;otherwise there is a difference between script and shell window.
if [ $origin_status -eq 1 ]
then
        current_time=`date "+%Y%m%d%H%M%S"`
        nohup jupyter notebook >jupyter.log_$current_time 2>&1 &
        now_status=`ps -ef | grep -w  jupyter | wc -l`
        if [ $now_status -eq 2 ]
        then
                echo "succeed in opening jupyter notebook at $current_time!"
        else
                echo "fail to open jupyter notebook."
        fi
else
        current_run=`expr $origin_status - 1`
        echo "$current_run jupyter notebook is open."
fi
1.修改文件
sudo vim /etc/rc.local
增加一行:su ubuntu -c /home/ubuntu/open_jupyter.sh (以ubuntu用户执行)

2.ubuntu 18.04 参考 https://zhuanlan.zhihu.com/p/63507762

3.注意
3.1 脚本中的所有函数或者命令都给出绝对路径,因为开机自启脚本会以root用户来执行
比如在该脚本中
nohup /home/ubuntu/anaconda3/bin/jupyter notebook >/home/ubuntu/jupyter.log_$current_time 2>&1 &

3.2 修改~/.jupyter/jupyter_notebook_config.py
c.NotebookApp.allow_root = True
c.NotebookApp.notebook_dir = '/home/ubuntu'

2. 自动登陆服务器(需要安装expect: brew install expect)

#!/usr/bin/expect
# set timeout 10 设置响应间隔
spawn ssh hostname@host
expect '*password*'
send "host pwd\r"
interact
#!/usr/bin/expect
set is_first [lindex $argv 0] #只能这样取得输入的参数,这是第一个参数,和bash不一样
set which_model [lindex $argv 1]
set which_ckpt [lindex $argv 2]
if { $is_first  == 1 } {# if只能这么写
        spawn ssh -i ~/Downloads/gongel_new.pem ubuntu@ec2-54-224-75-110.compute-1.amazonaws.com
        #172.31.45.39
        expect "Last login"
        send "rm -rf private_bert;sudo mount -t nfs -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-151b9f95.efs.us-east-1.amazonaws.com:\/   ~\/efs-mount-point;git clone -b adv_mask_bert https:\/\/github.com\/gongel\/private_bert.git\r"
        expect "Username"
        send "@qq.com\r"
        expect "Password"
        send "\r"
        expect "Resolving deltas: 100%"
        send "cd private_bert;nohup bash helper_finetune_script $which_model $which_ckpt \&\r"
        expect "*nohup.out*"
        send "\rexit\r"
        interact
} else {
        spawn ssh -i ~/Downloads/gongel_new.pem ubuntu@ec2-54-224-75-110.compute-1.amazonaws.com
        expect "Last login"
        send "cd  private_bert\r"
        interact
}

3. 时间统计

# 方法一
#!/bin/bash

startTime=`date +%Y%m%d-%H:%M:%S`
startTime_s=`date +%s`

endTime=`date +%Y%m%d-%H:%M:%S`
endTime_s=`date +%s`

sumTime=$[ $endTime_s - $startTime_s ]

echo "$startTime ---> $endTime" "Total:$sumTime seconds"
# 方法二
time bash xxx.sh
# 会返回3个时间数据
# real 该命令的总耗时, 包括user和sys及io等待, 时间片切换等待等等
# user 该命令在用户模式下的CPU耗时,也就是内核外的CPU耗时,不含IO等待这些时间
# sys  该命令在内核中的CPU耗时,不含IO,时间片切换耗时.

4. 查看集群jps(这里免密)

hosts=('master' 'slave1' 'slave2')
for host in ${hosts[@]}
do
        echo "###### ${host} jps #####"
        ssh ${host} jps  # no need to run 'exit' command
done