[论文笔记][CoRR-2019]A Mutual Information Maximization Perspective of Language Representation Learning

简单总结:该文认为大部分语言模型(SKIP-GRAM,BERT,XLNet等)都是求局部信息(masked word或者包含负样本的n-gram )和全局信息(masked word’s sentence)的Mutual Information Maximization,并提出改进:局部信息为负样本。

一. Introduction

provide an alternative view and show that these methods also maximize a lower bound on the mutual information between different parts of a word sequence

二. Mutual information maximization

1. mutual information

2. InfoNCE

三. Mutual information maximization based on MODELS

1. SKIP-GRAM

2. BERT

3. XLNET

4. INFOWORD

4.1 the mutual information between global representation and local representation

4.2 objective function of InfoWord

四. Experiments

1.result

2. discussion

  • Span-based models

JDIM is related to these span-based models such as SpanBERT and MASS

  • Mutual information maximization

InfoNCE is widely accepted as a good representation learning objective

  • Regularization

Our analysis and the connection we draw to representation learning methods used in other domains provide an insight into possible ways to incorporate prior knowledge into language representation learning models(objective func加入正则化去学习prior knowledge,正则与先验的关系)

paper

0