【ICLR-2020】ALBERT- A Lite BERT for Self-supervised Learning of Language

1. Factorized embedding parameterization

  • By using this decomposition(first project them into a lower dimensional embedding space of size E, and then project it to the hidden space of size H, H>>E), we reduce the embedding parameters from O(V × H) to O(V × E + E × H).
  • For example, vocabulary of size 30000, reduced parameters=30000*768-(30000*128+128*768)=19101696≈19M+, BERT parameters=108M

2. Cross-layer parameter sharing(12/24 layers)

  • only sharing feed-forward network (FFN) parameters across layers
  • only sharing attention parameters
  • sharing all parameters across layers(the default decision for ALBERT)

3. Inter-sentence coherence loss(sentence order prediction SOP)

  • NSP: positive examples are created by taking consecutive segments from same documents; negative examples are created by pairing segments from different documents(NSP任务的正例是文章中连续的两个句子,而负例则是从两篇文档中各选一个句子构造而成); positive and negative examples are sampled with equal probability.
  • NSP conflates topic prediction and coherence prediction in a single task. However, topic prediction is easier to learn compared to coherence prediction.
  • SOP: positive examples use the same technique as BERT, negative examples take the same two consecutive segments but with their order swapped(其正例与NSP相同,但负例是通过选择一篇文档中的两个连续的句子并将它们的顺序交换构造的).

4. Question

  • Why NSP is ineffectiveness?
    • NSP conflates topic prediction and coherence prediction in a single task . However, topic prediction is easier to learn compared to coherence prediction, and also overlaps more with what is learned using the MLM loss.
    • SOP avoids topic prediction and instead focuses on modeling inter-sentence coherence

5. Refs

【CCL-2019】How to Fine-Tune BERT for Text Classification?

1. Contributions

  • We propose a general solution to fine-tune the pre-trained BERT model, which includes three steps:
    • (1) further pre-train BERT on within-task training data or in-domain data;
    • (2) optional fine-tuning BERT with multi-task learning if several related tasks are available;
    • (3) fine-tune BERT for the target task.
  • We also investigate the fine-tuning methods for BERT on target task, including preprocess of long text, layer selection, layer-wise learning rate, catastrophic forgetting, and low-shot learning problems.
  • We achieve the new state-of-the-art results on seven widely-studied English text classification datasets and one Chinese news classification dataset

2. Methodology

  • Fine-Tuning Strategies
    • Dealing with long texts: head+tail(empirically select the first 128 and the last 382 tokens) is the best
    • Features from Different layers
    • Catastrophic Forgetting: a lower learning rate to overcome
    • Layer-wise Decreasing Layer Rate
  • Further Pre-training
    • Within-task and in-domain(partition the seven English datasets into three domains: topic, sentiment, and question) further pre-training can significantly boost its performance
    • A preceding multi-task fine-tuning is also helpful to the single-task fine-tuning, but its benefit is smaller than further pre-training
    • Cross-domain further pre-training can not bring an obvious benefit in general. It is reasonable since BERT is already trained on a general domain


【DeepLo-2019】Domain Adaptation with BERT-based Domain Classification and Data Selection


1. In the first step, we train a domain classifier with the same model architecture on the data from different domains with domain labels.


BERT domain adaptation

2. In the second step, we select a subset of source domain data based on the domain probability from the domain classifier, and train the original model on the selected source data.

The trained domain classifier is then used to predict the target domain probability for each data point from the source domain. Source data points with the highest target domain probability are selected for fine-tuning BERT for domain adaptation.


  • multi-source domain adaptation
  • applied to few-shot learning scenarios in which the selected source domain data can be used to augment the limited target domain training data