一. Introduction
1. Pre-trained language models (PLMs)
2. model compression
quantization weights pruning knowledge distillation (KD)
Distilled BiLSTMSOFT BERT-PKD DistilBERT
二. Method
1. TRANSFORMER DISTILLATION
Problem Formulation
Transformer-layer Distillation
Embedding-layer Distillation
Prediction-Layer Distillation
ALl-Layers Distillation(Conclusion)
2. TINYBERT LEARNING
三. EXPERIMENTS
四. APPENDIX
1. DATA AUGMENTATION DETAILS
0