[论文笔记][EMNLP-2019]Patient Knowledge Distillation for BERT Model Compression

一. Contributions

Two different strategies:

  • (i) PKD-Last: the student learns from the last k layers of the teacher, under the assumption that the top layers of the original network contain the most informative knowledge to teach the student;
  • (ii) PKD-Skip: the student learns from every k layers of the teacher, suggesting that the lower layers of the teacher network also contain important information and should be passed along for incremental distillation.

二. Related Work

Language Model Pre-training

  • (i) feature-based approach;
    • context-independent word representation (e.g., word2vec, GloVe, FastText)
    • sentence-level representation
    • contextualized word representation (e.g.,Cove, ELMo)
  • (ii) fine-tuning approach.(e.g.,GPT, BERT)

Model Compression & Knowledge Distillation

  • high degree of parameter redundancy: network pruning, weight quantization
  • compress a network with a large set of parameters into a compact and fast-to-execute model: knowledge distillation

三. Patient Knowledge Distillation

Distillation Objective

上面KD会出现过拟合,所以使用下面的PKD

Patient Teacher for Model Compression

四. Experiments

paper

0