【ICLR-2020】ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators



The generator then learns to predict the original identities of the masked-out tokens.


The discriminator is trained to distinguish tokens in the data from tokens that have been replaced by generator samples.


  • x^masked: The tokens in the selected positions are replaced with a [MASK] token:
  • x^corrupt: Replacing the masked-out tokens with generator samples and train the discriminator to predict which tokens in x^corrupt match the original input x



  • 基于规则
    • 基于规则的方法其主要思想是根据新词的构词特征外型特点建立规则库、专业词库或模式库,然后通过规则匹配发现新词。主要缺点在于局限于某个领域,并且需要建立规则库等。
  • 基于统计
    • 基于统计的方法,一般是利用统计策略提取出候选串,然后再利用语言知识排除不是新词语的垃圾串。或者是计算相关度,寻找相关度最大的字与字的组合。统计限于查找较短的新词。
  • 混合



两次pretrain bert的时候都因为挂起了nohup task,但是跑十个小时左右就会自动退出,后面根据时间点推测是不是没有exit服务器,今天查了下,的确有此种可能,目前在服务器做了对比实验。



Pretraining record

  1. Horovod分布式训练
  2. BERT base:32*16G GPU(Tesla V100-SXM2),BERT large:64*16G GPU(Tesla V100-SXM2)
  3. Stage-one:


    4. global step/sec=1.3-1.9


  1. 梯度累加(gradient accumulation)相当于扩大batch size相同倍数,为了更好的效果,training steps也应该相应扩大。
  2. 代码

【ICLR-2020】ALBERT- A Lite BERT for Self-supervised Learning of Language

1. Factorized embedding parameterization

  • By using this decomposition(first project them into a lower dimensional embedding space of size E, and then project it to the hidden space of size H, H>>E), we reduce the embedding parameters from O(V × H) to O(V × E + E × H).
  • For example, vocabulary of size 30000, reduced parameters=30000*768-(30000*128+128*768)=19101696≈19M+, BERT parameters=108M

2. Cross-layer parameter sharing(12/24 layers)

  • only sharing feed-forward network (FFN) parameters across layers
  • only sharing attention parameters
  • sharing all parameters across layers(the default decision for ALBERT)

3. Inter-sentence coherence loss(sentence order prediction SOP)

  • NSP: positive examples are created by taking consecutive segments from same documents; negative examples are created by pairing segments from different documents(NSP任务的正例是文章中连续的两个句子,而负例则是从两篇文档中各选一个句子构造而成); positive and negative examples are sampled with equal probability.
  • NSP conflates topic prediction and coherence prediction in a single task. However, topic prediction is easier to learn compared to coherence prediction.
  • SOP: positive examples use the same technique as BERT, negative examples take the same two consecutive segments but with their order swapped(其正例与NSP相同,但负例是通过选择一篇文档中的两个连续的句子并将它们的顺序交换构造的).

4. Question

  • Why NSP is ineffectiveness?
    • NSP conflates topic prediction and coherence prediction in a single task . However, topic prediction is easier to learn compared to coherence prediction, and also overlaps more with what is learned using the MLM loss.
    • SOP avoids topic prediction and instead focuses on modeling inter-sentence coherence

5. Refs