【CCL-2019】How to Fine-Tune BERT for Text Classification?

1. Contributions

  • We propose a general solution to fine-tune the pre-trained BERT model, which includes three steps:
    • (1) further pre-train BERT on within-task training data or in-domain data;
    • (2) optional fine-tuning BERT with multi-task learning if several related tasks are available;
    • (3) fine-tune BERT for the target task.
  • We also investigate the fine-tuning methods for BERT on target task, including preprocess of long text, layer selection, layer-wise learning rate, catastrophic forgetting, and low-shot learning problems.
  • We achieve the new state-of-the-art results on seven widely-studied English text classification datasets and one Chinese news classification dataset

2. Methodology

  • Fine-Tuning Strategies
    • Dealing with long texts: head+tail(empirically select the first 128 and the last 382 tokens) is the best
    • Features from Different layers
    • Catastrophic Forgetting: a lower learning rate to overcome
    • Layer-wise Decreasing Layer Rate
  • Further Pre-training
    • Within-task and in-domain(partition the seven English datasets into three domains: topic, sentiment, and question) further pre-training can significantly boost its performance
    • A preceding multi-task fine-tuning is also helpful to the single-task fine-tuning, but its benefit is smaller than further pre-training
    • Cross-domain further pre-training can not bring an obvious benefit in general. It is reasonable since BERT is already trained on a general domain


【DeepLo-2019】Domain Adaptation with BERT-based Domain Classification and Data Selection


1. In the first step, we train a domain classifier with the same model architecture on the data from different domains with domain labels.


BERT domain adaptation

2. In the second step, we select a subset of source domain data based on the domain probability from the domain classifier, and train the original model on the selected source data.

The trained domain classifier is then used to predict the target domain probability for each data point from the source domain. Source data points with the highest target domain probability are selected for fine-tuning BERT for domain adaptation.


  • multi-source domain adaptation
  • applied to few-shot learning scenarios in which the selected source domain data can be used to augment the limited target domain training data



130. 被围绕的区域

200. 岛屿数量

695. 岛屿的最大面积

547. 朋友圈

79. 单词搜索

417. 太平洋大西洋水流问题

133. 克隆图

473. 火柴拼正方形

494. 目标和