Papers read

【ACL-2018】Universal Language Model Fine-tuning for Text Classification

【ACL-2019】Deep Unknown Intent Detection with Margin Loss

【ACL-2019】A Novel Bi-directional Interrelated Model for Joint Intent Detection and Slot Filling

【AAAI-2019】Unsupervised Transfer Learning for Spoken Language Understanding in Intelligent Agents

【EMNLP-2018】Zero-shot User Intent Detection via Capsule Neural Networks

【ACL-2019】Joint Slot Filling and Intent Detection via Capsule Neural Networks

【NAACL-HLT-2018】Slot-Gated Modeling for Joint Slot Filling and Intent Prediction

【CoRR-2017】Multi-Domain Adversarial Learning for Slot Filling in Spoken Language Understanding

【INTERSPEECH-2016】Multi-Domain Joint Semantic Frame Parsing using Bi-directional RNN-LSTM

【SIGDIAL-2013】Deep Neural Network Approach for the Dialog State Tracking Challenge

【EMNLP-15】A Model of Zero-Shot Learning of Spoken Language Understanding

【NAACL-18】A Bi-model based RNN Semantic Frame Parsing Model for Intent Detection and Slot Filling

【INTERSPEECH-16】Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling

【IJCAI-16】A Joint Model of Intent Determination and Slot Filling for Spoken Language Understanding

【2017NIPS】Attention Is All You Need



二、Model Architecture


Encoder: 一共6层,每层有两个子层:multi-head self-attention和position-wise fully connected feed-forward network(包含residual connection和layer normalization)

Decoder:一共6层,每层有三个子层:multi-head self-attention、encoder-decoder attention(也叫context-attention)和position-wise fully connected feed-forward network

  • Scaled Dot-Product Attention:

  • Multi-Head Attention:

  • Applications of Attention in our Model:(三种)

  • Position-wise Feed-Forward Networks:

  • Positional Encoding:


相对编码:对于词汇之间的位置偏移 k, [公式] 可以表示成 [公式] 和 [公式]组合的形式




【ACL-2018】Universal Language Model Fine-tuning for Text Classification


  • We propose Universal Language Model Fine-tuning (ULMFiT), a method that can be used to achieve CV-like transfer learning for any task for NLP
  • We propose discriminative fine-tuning, slanted triangular learning rates, and gradual unfreezing, novel techniques to retain previous knowledge and avoid catastrophic forgetting during fine-tuning( 全文最有贡献的地方: 训练的3个tricks)
  • We significantly outperform the state-of-the-art on six representative text classification datasets, with an error reduction of 18-24% on the majority of dataset
  • We show that our method enables extremely sample-efficient transfer learning and perform an extensive ablation analysis.
  • We make the pretrained models and our code available to enable wider adoption

二、Related work

  • Transfer learning in CV:most work in CV focuses on transferring the last layers of the model achieve state-of-the-art results using features of an ImageNet model as input to a simple classifier.But it has been superseded by fine-tuning either the last or several of the last layers of a pretrained model and leaving the remaining layers frozen
  • Hypercolumns: In NLP, Embeddings at different levels are then used as features, concatenated either with the word embeddings or with the inputs at intermediate layers. In CV, hypercolumns have been nearly entirely superseded by end-to-end fine-tuning
  • Multi-task learning(MTL): add a language modeling objective to the model that is trained jointly with the main task model, But it requires the tasks to be trained from scratch every time, which makes it inefficient and often requires careful weighting of the task-specific objective functions
  • Fine-tuning: Fine-tuning has been used successfully to transfer between similar tasks e.g. in QA, for distantly supervised sentiment analysis, or MT domains. But may overfit with 10k labeled examples and require millions of in-domain documents for good performance


1. General-domain LM pretraining

pretrain the language model on Wikitext-103 consisting of 28,595 preprocessed Wikipedia articles and 103 million words, just need to do once.

2. Target task LM fine-tuning

2.1 Discriminative fine-tuning

discriminative fine-tuning allows us to tune each layer with different learning rates.

2.2 Slanted triangular learning rates

first linearly increases the learning rate and then linearly decays it according to the following update schedule


3. Target task classifier fine-tuning

3.1 Concat pooling

information may get lost if we only consider the last hidden state of the model

隐藏状态H = {h1; : : : ; hT }

最终结果为hc = [hT ; maxpool(H); meanpool(H)]  where [] is concatenation

3.2 Gradual unfreezing

fine-tuning all layers at once risks catastrophic forgetting, first unfreeze the last layer and fine-tune all unfrozen layers for one epoch. We then unfreeze the next lower frozen layer and repeat, until we fine-tune all layers until convergence at the last iteration.

3.3 BPTT for Text Classification (BPT3C)

We divide the document into fixed length batches of size b. At the beginning of each batch, the model is initialized with the final state of the previous batch; we keep track of the hidden states for mean and max-pooling; gradients are back-propagated to the batches whose hidden states contributed to the final prediction. In practice, we use variable length back propagation sequences.

3.4 Bidirectional language model


五、Discussion and future directions

  1. fine-tuning will be particularly useful
    • NLP for non-English languages, where training data for supervised pretraining tasks is scarce
    • new NLP tasks where no state-of-the-art architecture exists
    • tasks with limited amounts of labeled data (and some amounts of unlabeled data)
  2. directions:
    • improve language model pretraining and fine-tuning and make them more scalable: Language modeling can also be augmented with additional tasks in a multi-task learning fashion or enriched with additional supervision, e.g. syntax-sensitive dependencies to create a model that is more general or better suited for certain downstream tasks, ideally in a weakly-supervised manner to retain its universal properties

    • apply the method to novel tasks and models: While an extension to sequence labeling is straightforward, other tasks with more complex interactions such as entailment or question answering may require novel ways to pretrain and fine-tune