- 提出Re-routing的Dynamic Routing（在原来版本上的改进）
随机Xavier initializer初始化word embedding，然后取BiLSTM的隐藏状态（前后向贫瘠）作为词的最终表示。
2.1 Slot Filling by Dynamic Routing-by-agreement
bkt is the logit (initialized as zero) representing the log prior probability that the t-th word in WordCaps agrees to be routed to the k-th slot capsule in SlotCaps (Line 2)
The dynamic routing-by-agreement learns an agreement value ckt that determines how likely the t-th word agrees to be routed to the k-th slot capsule, obtained from bkt.
sk：the weighted sum sk to get vk
vk：the slot representation
总结：一旦我们在某次迭代中更新了vk，当pk|t和vk的乘积很大的话（pk|t和vk都为矢量/向量，点积可能为负数、0、正数的标量，相当于相似度衡量），bkt就会变得更大，下次迭代中ckt就会变大，从而t-th word(低级capsule )转向k-th slot(高级capsule)的权重越大即概率越大，也就意味着算法更有可能将第t个词路由到第k个槽类型。总之，这种通过无监督迭代的算法确保了迭代后，每个词都有合适的槽位。
2.2 Cross Entropy Loss for Slot Filling
3.1 Intent Detection by Dynamic Routing-by-agreement
3.2 Max-margin Loss for Intent Detection
Intent Detection With recent developments in deep neural networks, user intent detection mod- els (Hu et al., 2009; Xu and Sarikaya, 2013; Zhang et al., 2016; Liu and Lane, 2016; Zhang et al.,2017; Chen et al., 2016; Xia et al., 2018) are pro- posed to classify user intents given their diversely expressed utterances in the natural language. As a text classification task, the decent performance on utterance-level intent detection usually relies on hidden representations that are learned in the intermediate layers via multiple non-linear trans- formations.
Recently, various capsule based text classifi- cation models are proposed that aggregate word- level features for utterance-level classification via dynamic routing-by-agreement (Gong et al., 2018;Zhao et al., 2018; Xia et al., 2018). Among them,Xia et al. (2018) adopts self-attention to extract in- termediate semantic features and uses a capsule- based neural network for intent detection. How- ever, existing works do not study word-level su- pervisions for the slot filling task. In this work, we explicitly model the hierarchical relationship be- tween words and slots on the word-level, as well as intents on the utterance-level via dynamic routing- by-agreement.
Slot Filling Slot filling annotates the utterance with finer granularity: it associates certain parts of the utterance, usually named entities, with pre- defined slot tags. Currently, the slot filling is usu- ally treated as a sequential labeling task. A re- current neural network such as Gated Recurrent Unit (GRU) or Long Short-term Memory Network (LSTM) is used to learn context-aware word repre- sentations, and Conditional Random Fields (CRF) are used to annotate each word based on its slot type. Recently, Shen et al. (2017); Tan et al. (2017) introduce the self-attention mechanism for CRF- free sequential labeling.
Joint Modeling via Sequence Labeling To over- come the error propagation in the word-level slot filling task and the utterance-level intent detection task in a pipeline, joint models are proposed to solve two tasks simultaneously in a unified frame- work. Xu and Sarikaya (2013) propose a Con- volution Neural Network (CNN) based sequential labeling model for slot filling. The hidden states corresponding to each word are summed up in a classification module to predict the utterance intent. A Conditional Random Field module ensures the best slot tag sequence of the utterance from all possible tag sequences. Hakkani-Tu ̈r et al. (2016) adopt a Recurrent Neural Network (RNN) for slot filling and the last hidden state of the RNN is used to predict the utterance intent. Liu and Lane(2016) further introduce an RNN based encoder- decoder model for joint slot filling and intent de- tection. An attention weighted sum of all encoded hidden states is used to predict the utterance intent. Some specific mechanisms are designed for RNNs to explicitly encode the slot from the ut- terance. For example, Goo et al. (2018) utilize a slot-gated mechanism as a special gate function in Long Short-term Memory Network (LSTM) to improve slot filling by the learned intent context vector. However, as the sequence becomes longer, it is risky to simply rely on the gate function to sequentially summarize and compress all slots and context information in a single vector (Cheng et al., 2016).