【ACL-2018】Universal Language Model Fine-tuning for Text Classification


  • We propose Universal Language Model Fine-tuning (ULMFiT), a method that can be used to achieve CV-like transfer learning for any task for NLP
  • We propose discriminative fine-tuning, slanted triangular learning rates, and gradual unfreezing, novel techniques to retain previous knowledge and avoid catastrophic forgetting during fine-tuning( 全文最有贡献的地方: 训练的3个tricks)
  • We significantly outperform the state-of-the-art on six representative text classification datasets, with an error reduction of 18-24% on the majority of dataset
  • We show that our method enables extremely sample-efficient transfer learning and perform an extensive ablation analysis.
  • We make the pretrained models and our code available to enable wider adoption

二、Related work

  • Transfer learning in CV:most work in CV focuses on transferring the last layers of the model achieve state-of-the-art results using features of an ImageNet model as input to a simple classifier.But it has been superseded by fine-tuning either the last or several of the last layers of a pretrained model and leaving the remaining layers frozen
  • Hypercolumns: In NLP, Embeddings at different levels are then used as features, concatenated either with the word embeddings or with the inputs at intermediate layers. In CV, hypercolumns have been nearly entirely superseded by end-to-end fine-tuning
  • Multi-task learning(MTL): add a language modeling objective to the model that is trained jointly with the main task model, But it requires the tasks to be trained from scratch every time, which makes it inefficient and often requires careful weighting of the task-specific objective functions
  • Fine-tuning: Fine-tuning has been used successfully to transfer between similar tasks e.g. in QA, for distantly supervised sentiment analysis, or MT domains. But may overfit with 10k labeled examples and require millions of in-domain documents for good performance


1. General-domain LM pretraining

pretrain the language model on Wikitext-103 consisting of 28,595 preprocessed Wikipedia articles and 103 million words, just need to do once.

2. Target task LM fine-tuning

2.1 Discriminative fine-tuning

discriminative fine-tuning allows us to tune each layer with different learning rates.

2.2 Slanted triangular learning rates

first linearly increases the learning rate and then linearly decays it according to the following update schedule


3. Target task classifier fine-tuning

3.1 Concat pooling

information may get lost if we only consider the last hidden state of the model

隐藏状态H = {h1; : : : ; hT }

最终结果为hc = [hT ; maxpool(H); meanpool(H)]  where [] is concatenation

3.2 Gradual unfreezing

fine-tuning all layers at once risks catastrophic forgetting, first unfreeze the last layer and fine-tune all unfrozen layers for one epoch. We then unfreeze the next lower frozen layer and repeat, until we fine-tune all layers until convergence at the last iteration.

3.3 BPTT for Text Classification (BPT3C)

We divide the document into fixed length batches of size b. At the beginning of each batch, the model is initialized with the final state of the previous batch; we keep track of the hidden states for mean and max-pooling; gradients are back-propagated to the batches whose hidden states contributed to the final prediction. In practice, we use variable length back propagation sequences.

3.4 Bidirectional language model


五、Discussion and future directions

  1. fine-tuning will be particularly useful
    • NLP for non-English languages, where training data for supervised pretraining tasks is scarce
    • new NLP tasks where no state-of-the-art architecture exists
    • tasks with limited amounts of labeled data (and some amounts of unlabeled data)
  2. directions:
    • improve language model pretraining and fine-tuning and make them more scalable: Language modeling can also be augmented with additional tasks in a multi-task learning fashion or enriched with additional supervision, e.g. syntax-sensitive dependencies to create a model that is more general or better suited for certain downstream tasks, ideally in a weakly-supervised manner to retain its universal properties

    • apply the method to novel tasks and models: While an extension to sequence labeling is straightforward, other tasks with more complex interactions such as entailment or question answering may require novel ways to pretrain and fine-tune