【EMNLP-2018】Zero-shot User Intent Detection via Capsule Neural Networks

一、Contributions

  • 将胶囊神经网络运用到文本模型中,通过从语句中以层次方式提取和聚合语义
  • 提出了一种用于zero-shot意图识别基于胶囊的模型
  • 展示并解释了该模型在两个真实数据集上的表现

二、Model

1. SemanticCaps

双向LSTM隐藏状态的拼接,另外加上multi-head self-attention(加多头的好处是每个头都能关注一句话中特定的语义特征),但感觉这也不是多头呀,多头是每个时间步的隐藏状态对应多个头,这就感觉是普通的self-attention。

2. DetectionCaps

胶囊神经网络来一套,损失函数最大间隔增加了正则,对self-attention的权重矩阵A。

3. Zero-shot DetectionCaps

3.1 Knowledge Transfer Strategies

  • existing and emerging intents 二者存在相似性
  • “The intent labels also contain knowledge of how two intents are similar with each other”

3.2 Build Vote Vectors

3.3 Zero-shot Dynamic Routing

更换prediction vector(利用vote vector和相似度)

三、Experiment

四、Refs

paper/reading1/

Python赋值/浅/深拷贝

一、可变对象和不可变对象

  • 可变对象,该对象所指向的内存中的值可以被改变。变量(准确的说是引用)改变后,实际上是其所指的值直接发生改变,并没有发生复制行为,也没有开辟新的出地址,通俗点说就是原地改变。比如列表list字典dict集合set
  • 不可变对象,该对象所指向的内存中的值不能被改变。当改变某个变量时候,由于其所指的值不能被改变,相当于把原来的值复制一份后再改变,这会开辟一个新的地址,变量再指向这个新的地址。比如数值类型(int和float)字符串str元组tuple

二、赋值、浅拷贝、深拷贝

1. 赋值

1.1可变对象

a=b=obj 两者在内存中地址一致,并且子元素在内存的地址也一致,所以修改其中一个对象的值,另外一个对象的值肯定也会修改。

1.2 不可变对象

1.2.1 在缓存范围内

只要是某个定值(整型、浮点型、或者字符串),复制给任何对象,所指向的内存地址都是一致的。

1.2.2 不在缓存范围内

某个定值超过缓存范围,比如字符串过长、整型过大,会出现指向该定值的两个对象的内存地址不一致的情况。

2. 浅拷贝

  • 使用copy()函数 from copy import copy
  • 使用切片操作
  • 使用工厂函数(如list/dir/set)

2.1 可变对象

a=copy(b) a和b的内存地址不一致,但是子元素的内存地址都一致。如果b的子元素为不可变对象,那么修改任何一个对象中的该子元素,不会引起另一个对象的该子元素的改变;如果b的子元素为可变对象,那么修改任何一个对象中的该子元素,则会引起另一个对象相应的改变。(其实对子元素的修改就相当于赋值)。

2.2 不可变对象

相当于赋值。

3. 深拷贝

  • from copy import deepcopy

3.1 可变对象

a=deepcopy(b) a和b的内存地址不一致。如果子元素为不可变对象,那么深拷贝后内存地址也不一致,所以修改不会相互影响 ;如果子元素为可变对象,那么深拷贝后内存地址是一致的,但是修改也不会相互影响,具体参考不可变对象的赋值。

3.2 不可变对象

相当于赋值。

三、参考

【NIPS-2017】Dynamic Routing Between Capsules

一、Background & Contributions

  • CNN无法解决图片更改角度的问题:一张图片换个角度看应该还是原来的图片,而CNN不能解决角度问题。
  • 胶囊封装了有关它们以矢量形式检测到的特征状态的所有重要信息,比如长度为它们的概率,无论如何旋转或变换角度,概率不改变,只是更改了方向。
  • 较低级别的胶囊将其输入发送到与其输入“一致”的更高级别的胶囊。这是动态路由算法的本质。

二、Model & Algorithm

 向量ˆuj|i 和向量vj越相似,二者点积越大,导致bij变大,从而下一轮迭代中cij变大,继而capsule i转向capsule j对应的权重越大即概率越大。

三、Loss Function:MarginLoss in SVM

vk向量的长度代表概率,采用margin loss。

四、Refs

paper/reading1/

series1/series2/series3(国外小哥,写的不错。翻墙点我)

【ACL-2019】Joint Slot Filling and Intent Detection via Capsule Neural Networks

有点难度,需要先验知识:Dynamic Routing Between Capsules(Hinton. NIPS2017)

一、Contributions

  • 通过分层的胶囊神经网络捕捉word、slot和intent之间的层级关系;
  • 提出Re-routing的Dynamic Routing(在原来版本上的改进)

二、Models

1. WordCaps

随机Xavier initializer初始化word embedding,然后取BiLSTM的隐藏状态(前后向贫瘠)作为词的最终表示。

2. SlotCaps

2.1 Slot Filling by Dynamic Routing-by-agreement

pk|t

bkt

bkt is the logit (initialized as zero) representing the log prior probability that the t-th word in WordCaps agrees to be routed to the k-th slot capsule in SlotCaps (Line 2)

ckt

The dynamic routing-by-agreement learns an agreement value ckt that determines how likely the t-th word agrees to be routed to the k-th slot capsule, obtained from bkt.

skthe weighted sum sk to get vk

vkthe slot representation

 

总结:一旦我们在某次迭代中更新了vk,当pk|t和vk的乘积很大的话(pk|t和vk都为矢量/向量,点积可能为负数、0、正数的标量,相当于相似度衡量),bkt就会变得更大,下次迭代中ckt就会变大,从而t-th word(低级capsule )转向k-th slot(高级capsule)的权重越大即概率越大,也就意味着算法更有可能将第t个词路由到第k个槽类型。总之,这种通过无监督迭代的算法确保了迭代后,每个词都有合适的槽位。

2.2 Cross Entropy Loss for Slot Filling

3. IntentCaps

3.1 Intent Detection by Dynamic Routing-by-agreement

同slotFilling,只不过这里采取vk作为输入,得到意图表示(intent representation)ul.

3.2 Max-margin Loss for Intent Detection

同SF,但这里得到z拔。

4. Re-Routing

在算法一中进行更改,大体改进原因就是:之前的动态路由只展示低级特征如何构建高级特征(word–>slot–>intent),然后高级特征也能帮助学习低级特征。比如意图:添加音乐到播放列表就可以加强Sungmin所属的歌手槽位

三、Experiment

四、Related works

Intent Detection With recent developments in deep neural networks, user intent detection mod- els (Hu et al., 2009; Xu and Sarikaya, 2013; Zhang et al., 2016; Liu and Lane, 2016; Zhang et al.,2017; Chen et al., 2016; Xia et al., 2018) are pro- posed to classify user intents given their diversely expressed utterances in the natural language. As a text classification task, the decent performance on utterance-level intent detection usually relies on hidden representations that are learned in the intermediate layers via multiple non-linear trans- formations.

Recently, various capsule based text classifi- cation models are proposed that aggregate word- level features for utterance-level classification via dynamic routing-by-agreement (Gong et al., 2018;Zhao et al., 2018; Xia et al., 2018). Among them,Xia et al. (2018) adopts self-attention to extract in- termediate semantic features and uses a capsule- based neural network for intent detection. How- ever, existing works do not study word-level su- pervisions for the slot filling task. In this work, we explicitly model the hierarchical relationship be- tween words and slots on the word-level, as well as intents on the utterance-level via dynamic routing- by-agreement.

Slot Filling Slot filling annotates the utterance with finer granularity: it associates certain parts of the utterance, usually named entities, with pre- defined slot tags. Currently, the slot filling is usu- ally treated as a sequential labeling task. A re- current neural network such as Gated Recurrent Unit (GRU) or Long Short-term Memory Network (LSTM) is used to learn context-aware word repre- sentations, and Conditional Random Fields (CRF) are used to annotate each word based on its slot type. Recently, Shen et al. (2017); Tan et al. (2017) introduce the self-attention mechanism for CRF- free sequential labeling.

Joint Modeling via Sequence Labeling To over- come the error propagation in the word-level slot filling task and the utterance-level intent detection task in a pipeline, joint models are proposed to solve two tasks simultaneously in a unified frame- work. Xu and Sarikaya (2013) propose a Con- volution Neural Network (CNN) based sequential labeling model for slot filling. The hidden states corresponding to each word are summed up in a classification module to predict the utterance intent. A Conditional Random Field module ensures the best slot tag sequence of the utterance from all possible tag sequences. Hakkani-Tu ̈r et al. (2016) adopt a Recurrent Neural Network (RNN) for slot filling and the last hidden state of the RNN is used to predict the utterance intent. Liu and Lane(2016) further introduce an RNN based encoder- decoder model for joint slot filling and intent de- tection. An attention weighted sum of all encoded hidden states is used to predict the utterance intent. Some specific mechanisms are designed for RNNs to explicitly encode the slot from the ut- terance. For example, Goo et al. (2018) utilize a slot-gated mechanism as a special gate function in Long Short-term Memory Network (LSTM) to improve slot filling by the learned intent context vector. However, as the sequence becomes longer, it is risky to simply rely on the gate function to sequentially summarize and compress all slots and context information in a single vector (Cheng et al., 2016).

五、Refs

paper/reading1

【NAACL-HLT-2018】Slot-Gated Modeling for Joint Slot Filling and Intent Prediction

一、Contributions

However, the prior work did not “explicitly” model the relationships between the intent and slots; instead, it applied a joint loss function to “implicitly” consider both cues. Because the slots often highly depend on the intent, this work focuses on how to model the explicit relationships between slots and intent vectors by introducing a slot-gated mechanism.

之前的joint mode只是描述了意图和槽任务之前晦涩的关系,但是本篇论文描述了二者之间明确的关系(因为槽高度依赖意图)。

二、Models

1. Attention-Based RNN Model

1.1 Slot Filling without gate

其实就是针对每个隐藏状态(前后向拼接)进行self attention,但是在简单的数据集比如AITS上,SF增加attention效果没有太大的提升。

1.2 Intent Prediction

这里的attention不用于上面的,这里更简单,不用每两个隐藏状态进行交互,这里只得到一个最终的attention加权和。

2. Slot-Gated Mechanism

其实就是将ID的attention和SF的attention进行交互(SF不加attention的话就用隐藏状态,见Figure2b)得到g,然后让g作为SF的attention的权重,最终受益的是SF。

3. Joint Optimization

极大似然,梯度上升。

三、Refs

paper/reading1