两次pretrain bert的时候都因为挂起了nohup task,但是跑十个小时左右就会自动退出,后面根据时间点推测是不是没有exit服务器,今天查了下,的确有此种可能,目前在服务器做了对比实验。



Pretraining bert record

  1. Horovod分布式训练
  2. BERT base:32*16G GPU(Tesla V100-SXM2),BERT large:64*16G GPU(Tesla V100-SXM2)
  3. Stage-one:


    4. global step/sec=1.3-1.9

【ICLR-2020】ALBERT- A Lite BERT for Self-supervised Learning of Language

1. Factorized embedding parameterization

  • By using this decomposition(first project them into a lower dimensional embedding space of size E, and then project it to the hidden space of size H, H>>E), we reduce the embedding parameters from O(V × H) to O(V × E + E × H).
  • For example, vocabulary of size 30000, reduced parameters=30000*768-(30000*128+128*768)=19101696≈19M+, BERT parameters=108M

2. Cross-layer parameter sharing(12/24 layers)

  • only sharing feed-forward network (FFN) parameters across layers
  • only sharing attention parameters
  • sharing all parameters across layers(the default decision for ALBERT)

3. Inter-sentence coherence loss(sentence order prediction SOP)

  • NSP: positive examples are created by taking consecutive segments from same documents; negative examples are created by pairing segments from different documents(NSP任务的正例是文章中连续的两个句子,而负例则是从两篇文档中各选一个句子构造而成); positive and negative examples are sampled with equal probability.
  • NSP conflates topic prediction and coherence prediction in a single task. However, topic prediction is easier to learn compared to coherence prediction.
  • SOP: positive examples use the same technique as BERT, negative examples take the same two consecutive segments but with their order swapped(其正例与NSP相同,但负例是通过选择一篇文档中的两个连续的句子并将它们的顺序交换构造的).

4. Question

  • Why NSP is ineffectiveness?
    • NSP conflates topic prediction and coherence prediction in a single task . However, topic prediction is easier to learn compared to coherence prediction, and also overlaps more with what is learned using the MLM loss.
    • SOP avoids topic prediction and instead focuses on modeling inter-sentence coherence

5. Refs

【CCL-2019】How to Fine-Tune BERT for Text Classification?

1. Contributions

  • We propose a general solution to fine-tune the pre-trained BERT model, which includes three steps:
    • (1) further pre-train BERT on within-task training data or in-domain data;
    • (2) optional fine-tuning BERT with multi-task learning if several related tasks are available;
    • (3) fine-tune BERT for the target task.
  • We also investigate the fine-tuning methods for BERT on target task, including preprocess of long text, layer selection, layer-wise learning rate, catastrophic forgetting, and low-shot learning problems.
  • We achieve the new state-of-the-art results on seven widely-studied English text classification datasets and one Chinese news classification dataset

2. Methodology

  • Fine-Tuning Strategies
    • Dealing with long texts: head+tail(empirically select the first 128 and the last 382 tokens) is the best
    • Features from Different layers
    • Catastrophic Forgetting: a lower learning rate to overcome
    • Layer-wise Decreasing Layer Rate
  • Further Pre-training
    • Within-task and in-domain(partition the seven English datasets into three domains: topic, sentiment, and question) further pre-training can significantly boost its performance
    • A preceding multi-task fine-tuning is also helpful to the single-task fine-tuning, but its benefit is smaller than further pre-training
    • Cross-domain further pre-training can not bring an obvious benefit in general. It is reasonable since BERT is already trained on a general domain