[论文笔记][ICLR-2020]ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Overview

Generator

The generator then learns to predict the original identities of the masked-out tokens.

Discriminator

The discriminator is trained to distinguish tokens in the data from tokens that have been replaced by generator samples.

Loss

  • x^masked: The tokens in the selected positions are replaced with a [MASK] token:
  • x^corrupt: Replacing the masked-out tokens with generator samples and train the discriminator to predict which tokens in x^corrupt match the original input x

paper

网络新词/流行词发现

一.PMI(Pointwise Mutual Information)

统计两个词语在文本中同时出现的概率,如果概率越大,其相关性就越紧密,关联度越高,就越有可能组成新词。

  • PMI > 0;两个词语是相关的;值越大,相关性越强。
  • PMI = 0;两个词语是统计独立的,不相关也不互斥。
  • PMI < 0;两个词语是不相关的,互斥的。

比如一份语料中,“深度学习”出现了10词,“深度”出现了15次,学习出现了“20”次。由于语料库总词数是个定值,那么深度学习这个词在“深度”,“学习”上的的点间互信息就为。其中N指总词数。

二.Entropy

基于一个词应该可以用在不同的场景,因此看这个词的左右搭配是否丰富,越丰富的搭配越有可能是词。左右熵值越大,说明该词的周边词越丰富,意味着词的自由程度越大,其成为一个独立的词的可能性也就越大。下面的x指的是左词或者右词。

在人人网用户状态中,“被子”一词一共出现了956次,“辈子”一词一共出现了2330次,。“被子”的左邻字用例非常丰富:用得最多的是“晒被子”,它一共出现了162次;其次是“的被子”,出现了85次;接 下来分别是“条被子”、“在被子”、“床被子”,分别出现了69次、64次和52次;当然,还有“叠被子”、“盖被子”、“加被子”、“新被子”、“掀被 子”、“收被子”、“薄被子”、“踢被子”、“抢被子”等100多种不同的用法构成的长尾。所有左邻字的信息熵为3.67453。但“辈子”的左邻字就很 可怜了,2330个“辈子”中有1276个是“一辈子”,有596个“这辈子”,有235个“下辈子”,有149个“上辈子”,有32个“半辈子”,有 10个“八辈子”,有7个“几辈子”,有6个“哪辈子”,以及“n辈子”、“两辈子”等13种更罕见的用法。所有左邻字的信息熵仅为1.25963。因而,“辈子”能否成词,明显就有争议

三.基于n-gram

四.参考

nohup自动退出

两次pretrain bert的时候都因为挂起了nohup task,但是跑十个小时左右就会自动退出,后面根据时间点推测是不是没有exit服务器,今天查了下,的确有此种可能,目前在服务器做了对比实验。

正确的使用办法为:

1.先回车,退出nohup的提示。(nohup command &)
2.然后执行exit正常退出当前账户。
3.然后再去连接终端。使得程序后台正常运行。

另外建议用Tmux代替nohup.

Pretraining record

  1. Horovod分布式训练
  2. BERT base:32*16G GPU(Tesla V100-SXM2),BERT large:64*16G GPU(Tesla V100-SXM2)
  3. Stage-one:
    --do_train=True \
    --do_eval=True \
    --bert_config_file=bert_base/bert_config.json \
    --train_batch_size=1024 \
    --max_seq_length=128 \
    --max_predictions_per_seq=20 \
    --num_train_steps=900000 \
    --num_warmup_steps=10000 \
    --learning_rate=1e-4 \
    --save_checkpoints_steps=50000

    Stage-two:

    --do_train=True \
    --do_eval=True \
    --bert_config_file=bert_base/bert_config.json \
    --init_checkpoint=stage-one/model.ckpt-450000 \
    --train_batch_size=256 \
    --max_seq_length=512 \
    --max_predictions_per_seq=76 \
    --num_train_steps=100000 \
    --num_warmup_steps=5000 \
    --learning_rate=1e-4 \
    --save_checkpoints_steps=5000

    4. global step/sec=1.3-1.9

Albert

  1. 梯度累加(gradient accumulation)相当于扩大batch size相同倍数,为了更好的效果,training steps也应该相应扩大。
  2. 代码
作者:Pascal
链接:https://www.zhihu.com/question/303070254/answer/573037166
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

for i,(images,target) in enumerate(train_loader):
    # 1. input output
    images = images.cuda(non_blocking=True)
    target = torch.from_numpy(np.array(target)).float().cuda(non_blocking=True)
    outputs = model(images)
    loss = criterion(outputs,target)

    # 2.1 loss regularization
    loss = loss/accumulation_steps   
    # 2.2 back propagation
    loss.backward()
    # 3. update parameters of net
    if((i+1)%accumulation_steps)==0:
        # optimizer the net
        optimizer.step()        # update parameters of net
        optimizer.zero_grad()   # reset gradient

[论文笔记][ICLR-2020]ALBERT- A Lite BERT for Self-supervised Learning of Language

1. Factorized embedding parameterization

  • By using this decomposition(first project them into a lower dimensional embedding space of size E, and then project it to the hidden space of size H, H>>E), we reduce the embedding parameters from O(V × H) to O(V × E + E × H).
  • For example, vocabulary of size 30000, reduced parameters=30000*768-(30000*128+128*768)=19101696≈19M+, BERT parameters=108M

2. Cross-layer parameter sharing(12/24 layers)

  • only sharing feed-forward network (FFN) parameters across layers
  • only sharing attention parameters
  • sharing all parameters across layers(the default decision for ALBERT)

3. Inter-sentence coherence loss(sentence order prediction SOP)

  • NSP: positive examples are created by taking consecutive segments from same documents; negative examples are created by pairing segments from different documents(NSP任务的正例是文章中连续的两个句子,而负例则是从两篇文档中各选一个句子构造而成); positive and negative examples are sampled with equal probability.
  • NSP conflates topic prediction and coherence prediction in a single task. However, topic prediction is easier to learn compared to coherence prediction.
  • SOP: positive examples use the same technique as BERT, negative examples take the same two consecutive segments but with their order swapped(其正例与NSP相同,但负例是通过选择一篇文档中的两个连续的句子并将它们的顺序交换构造的).

4. Question

  • Why NSP is ineffectiveness?
    • NSP conflates topic prediction and coherence prediction in a single task . However, topic prediction is easier to learn compared to coherence prediction, and also overlaps more with what is learned using the MLM loss.
    • SOP avoids topic prediction and instead focuses on modeling inter-sentence coherence
  • Masking strategy?
    • n-gram masking with  probability

5. Refs