合并分支 'caorunzhe' 到 'mengxia'

Caorunzhe 查看合并请求 !898

合并分支 'caorunzhe' 到 'mengxia'
Caorunzhe 查看合并请求 !898
ef2c73d1 · 孟霞 · c4d60e0e · 131763e6 · ef2c73d1 · ef2c73d1
Commit ef2c73d1 authored Jan 14, 2021 by 孟霞
--- a/Chapter12/chapter12.tex
+++ b/Chapter12/chapter12.tex
@@ -325,11 +325,11 @@
 \begin{itemize}
 \vspace{0.5em}
-\item 首先，将$\mathbi{Q}$、$\mathbi{K}$、$\mathbi{V}$分别通过线性（Linear）变换的方式映射为$h$个子集。即$\mathbi{Q}_i = \mathbi{Q}\mathbi{W}_i^{\,Q} $、$\mathbi{K}_i = \mathbi{K}\mathbi{W}_i^{\,K} $、$\mathbi{V}_i = \mathbi{V}\mathbi{W}_i^{\,V} $，其中$i$表示第$i$个头， $\mathbi{W}_i^{\,Q}  \in \mathbb{R}^{d_{model} \times d_k}$,  $\mathbi{W}_i^{\,K}  \in \mathbb{R}^{d_{model} \times d_k}$,  $\mathbi{W}_i^{\,V}  \in \mathbb{R}^{d_{model} \times d_v}$是参数矩阵; $d_k=d_v=d_{model} / h$，对于不同的头采用不同的变换矩阵，这里$d_{model}$表示每个隐层向量的维度；
+\item 首先，将$\mathbi{Q}$、$\mathbi{K}$、$\mathbi{V}$分别通过线性（Linear）变换的方式映射为$h$个子集。即$\mathbi{Q}_i = \mathbi{Q}\mathbi{W}_i^{\,Q} $、$\mathbi{K}_i = \mathbi{K}\mathbi{W}_i^{\,K} $、$\mathbi{V}_i = \mathbi{V}\mathbi{W}_i^{\,V} $，其中$i$表示第$i$个头， $\mathbi{W}_i^{\,Q}  \in \mathbb{R}^{d_{\textrm{model}} \times d_k}$,  $\mathbi{W}_i^{\,K}  \in \mathbb{R}^{d_{\textrm{model}} \times d_k}$,  $\mathbi{W}_i^{\,V}  \in \mathbb{R}^{d_{\textrm{model}} \times d_v}$是参数矩阵; $d_k=d_v=d_{\textrm{model}} / h$，对于不同的头采用不同的变换矩阵，这里$d_{\textrm{model}}$表示每个隐层向量的维度；
 \vspace{0.5em}
 \item 其次，对每个头分别执行点乘注意力操作，并得到每个头的注意力操作的输出$\mathbi{head}_i$；
 \vspace{0.5em}
-\item 最后，将$h$个头的注意力输出在最后一维$d_v$进行拼接（Concat）重新得到维度为$hd_v$的输出，并通过对其右乘一个权重矩阵$\mathbi{W}^{\,o}$进行线性变换，从而对多头计算得到的信息进行融合，且将多头注意力输出的维度映射为模型的隐层大小（即$d_{model}$），这里参数矩阵$\mathbi{W}^{\,o} \in \mathbb{R}^{h d_v \times d_{model}}$。
+\item 最后，将$h$个头的注意力输出在最后一维$d_v$进行拼接（Concat）重新得到维度为$hd_v$的输出，并通过对其右乘一个权重矩阵$\mathbi{W}^{\,o}$进行线性变换，从而对多头计算得到的信息进行融合，且将多头注意力输出的维度映射为模型的隐层大小（即$d_{\textrm{model}}$），这里参数矩阵$\mathbi{W}^{\,o} \in \mathbb{R}^{h d_v \times d_{\textrm{model}}}$。
 \vspace{0.5em}
 \end{itemize}

--- a/Chapter15/chapter15.tex
+++ b/Chapter15/chapter15.tex
@@ -139,7 +139,7 @@ A_{ij}^{\rm rel} &=& \underbrace{\mathbi{E}_{x_i}\mathbi{W}_Q\mathbi{W}_{K}^{T}\
 \label{eq:15-14}
 \end{eqnarray}
-\noindent 其中，$A_{ij}^{\rm rel}$为使用相对位置编码后位置$i$与$j$关系的表示结果。公式中各项的含义为：(a)表示基于内容的表示（{\color{red} 啥意思没看懂，啥是基于内容？谁的内容？}），(b)表示基于内容的位置偏置，(c)表示全局内容的偏置，(d) 表示全局位置的偏置。公式\eqref{eq:15-13}中的(a)、(b)两项与前面介绍的相对位置编码一致\upcite{Shaw2018SelfAttentionWR}，并针对相对位置编码引入了额外的线性变换矩阵。同时，这种方法兼顾了全局内容偏置和全局位置偏置，可以更好地利用正余弦函数的归纳偏置特性。
+\noindent 其中，$A_{ij}^{\rm rel}$为使用相对位置编码后位置$i$与$j$关系的表示结果。公式中各项的含义为：(a)表示位置$i$与位置$j$之间词嵌入的相关性，可以看作是基于内容的表示，(b)表示基于内容的位置偏置，(c)表示全局内容的偏置，(d) 表示全局位置的偏置。公式\eqref{eq:15-13}中的(a)、(b)两项与前面介绍的相对位置编码一致\upcite{Shaw2018SelfAttentionWR}，并针对相对位置编码引入了额外的线性变换矩阵。同时，这种方法兼顾了全局内容偏置和全局位置偏置，可以更好地利用正余弦函数的归纳偏置特性。
 \vspace{0.5em}
 \item {\small\bfnew{结构化位置编码}}\index{基于结构化位置编码}（Structural Position Representations）\index{Structural Position Representations}\upcite{DBLP:conf/emnlp/WangTWS19a}。 例如，可以通过对输入句子进行依存句法分析得到句法树，根据叶子结点在句法树中的深度来表示其绝对位置，并在此基础上利用相对位置编码的思想计算节点之间的相对位置信息。
@@ -155,11 +155,11 @@ A_{ij}^{\rm rel} &=& \underbrace{\mathbi{E}_{x_i}\mathbi{W}_Q\mathbi{W}_{K}^{T}\
 \subsubsection{2. 注意力分布约束}
-\parinterval 局部注意力机制一直是机器翻译中受关注的研究方向\upcite{DBLP:journals/corr/LuongPM15}。通过对注意力权重的可视化，可以观测到不同位置的词受关注的程度相对平滑。这样的建模方式利于全局建模，但一定程度上分散了注意力，导致模型忽略了邻近单词之间的关系。为了提高模型对局部信息的感知，有以下几种方法：{\red 图2没有引用}
+\parinterval 局部注意力机制一直是机器翻译中受关注的研究方向\upcite{DBLP:journals/corr/LuongPM15}。通过对注意力权重的可视化，可以观测到不同位置的词受关注的程度相对平滑。这样的建模方式利于全局建模，但一定程度上分散了注意力，导致模型忽略了邻近单词之间的关系。为了提高模型对局部信息的感知，有以下几种方法：
 \begin{itemize}
 \vspace{0.5em}
-\item {\small\bfnew{引入高斯约束}}\upcite{Yang2018ModelingLF}。这类方法的核心思想是引入可学习的高斯分布$\mathbi{G}$作为局部约束，与注意力权重进行融合，具体的形式如下：
+\item {\small\bfnew{引入高斯约束}}\upcite{Yang2018ModelingLF}。如图\ref{fig:15-2}所示，这类方法的核心思想是引入可学习的高斯分布$\mathbi{G}$作为局部约束，与注意力权重进行融合，具体的形式如下：
 \begin{eqnarray}
 \mathbi{e}_{ij} &=& \frac{(\mathbi{x}_i \mathbi{W}_Q){(\mathbi{x}_j \mathbi{W}_K)}^{T}}{\sqrt{d_k}} + \mathbi{G}
 \label{eq:15-15}
@@ -184,7 +184,7 @@ v_i &=& \mathbi{I}_d^T\textrm{Tanh}(\mathbi{W}_d\mathbi{Q}_i)
 \label{eq:15-19}
 \end{eqnarray}
-\noindent 其中，$\mathbi{W}_p$、$\mathbi{W}_d$、$\mathbi{I}_p$、$\mathbi{I}_d$均为模型中可学习的参数矩阵。{\red 不同颜色的字有什么用}
+\noindent 其中，$\mathbi{W}_p$、$\mathbi{W}_d$、$\mathbi{I}_p$、$\mathbi{I}_d$均为模型中可学习的参数矩阵。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -337,13 +337,7 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}
 \begin{itemize}
 \vspace{0.5em}
-\item Reformer模型在计算Key和Value时使用相同的线性映射，共享Key和Value的值\upcite{Kitaev2020ReformerTE}，降低了自注意力机制的复杂度。进一步，Reformer引入了一种{\small\bfnew{局部哈希敏感注意力机制}}\index{局部哈希敏感注意力机制}（LSH Attention）\index{LSH Attention}，其提高效率的方式和固定模式中的局部建模一致，减少注意力机制的计算范围。对于每一个Query，通过局部哈希敏感机制找出和其较为相关的Key，并进行注意力的计算。其基本思路就是距离相近的向量以较大的概率被哈希分配到一个桶内，距离较远的向量被分配到一个桶内的概率则较低。哈希的散列函数为：{\red （下面公式不对）{\color{blue} 看代码确认一下！}}
+\item Reformer模型在计算Key和Value时使用相同的线性映射，共享Key和Value的值\upcite{Kitaev2020ReformerTE}，降低了自注意力机制的复杂度。进一步，Reformer引入了一种{\small\bfnew{局部哈希敏感注意力机制}}\index{局部哈希敏感注意力机制}（LSH Attention）\index{LSH Attention}，其提高效率的方式和固定模式中的局部建模一致，减少注意力机制的计算范围。对于每一个Query，通过局部哈希敏感机制找出和其较为相关的Key，并进行注意力的计算。其基本思路就是距离相近的向量以较大的概率被哈希分配到一个桶内，距离较远的向量被分配到一个桶内的概率则较低。此外，Reformer中还采用了一种{\small\bfnew{可逆残差网络结构}}\index{可逆残差网络结构}（The Reversible Residual Network）\index{The Reversible Residual Network}和分块计算前馈神经网络层的机制，即将前馈层的隐层维度拆分多个块后独立的进行计算，最后进行拼接操作，得到前馈层的输出。这种方式大幅度减少了内存（显存）占用，但由于在反向过程中需要重复计算某些节点，牺牲了一定的计算时间。
-\begin{eqnarray}
-\mathbi{h}(\mathbi{x}) &=& \arg\max([\mathbi{x}\mathbi{R};-\mathbi{x}\mathbi{R}])
-\label{eq:15-23}
-\end{eqnarray}
-\noindent 其中，$\mathbi{R}$为随机矩阵（{\color{red} 这块儿有些没看懂，为啥要用随机矩阵？上面的公式物理意义是啥？}），$[;]$代表拼接操作。当$\mathbi{h}(\textrm{Query}_i) = \mathbi{h}(\textrm{Key}_j )$，$i$和$j$ 为序列中不同位置单词的下标，也就是说当两个词的Query 和Key落在同一个散列桶时，对其进行注意力的计算。此外，Reformer中还采用了一种{\small\bfnew{可逆残差网络结构}}\index{可逆残差网络结构}（The Reversible Residual Network）\index{The Reversible Residual Network}和分块计算前馈神经网络层的机制，即将前馈层的隐层维度拆分多个块后独立的进行计算，最后进行拼接操作，得到前馈层的输出。这种方式大幅度减少了内存（显存）占用，但由于在反向过程中需要重复计算某些节点，牺牲了一定的计算时间。
 \vspace{0.5em}
 \item Routing Transformer通过聚类算法对序列中的不同单元进行分组，分别在组内进行自注意力机制的计算\upcite{DBLP:journals/corr/abs-2003-05997}。首先是将Query和Key映射到聚类矩阵$\mathbi{S}$：
@@ -415,7 +409,7 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}
 \end{eqnarray}
 \end{itemize}
-\parinterval 从上述公式中可以发现，在前向传播过程中，Pre-Norm结构可以通过残差路径将底层神经网络的输出直接暴露给上层神经网络。此外，在反向传播过程中，使用Pre-Norm结构也可以使得顶层网络的梯度更容易地反馈到底层网络。这里以一个含有$L$个子层的结构为例，令$Loss$表示整个神经网络输出上的损失，$\mathbi{x}_L$为顶层的输出。对于Post-Norm结构，根据链式法则，损失$Loss$相对于$\mathbi{x}_l$ 的梯度可以表示为：{\red （L层，顶层输出是L+1吧？）}
+\parinterval 从上述公式中可以发现，在前向传播过程中，Pre-Norm结构可以通过残差路径将底层神经网络的输出直接暴露给上层神经网络。此外，在反向传播过程中，使用Pre-Norm结构也可以使得顶层网络的梯度更容易地反馈到底层网络。这里以一个含有$L$个子层的结构为例，令$Loss$表示整个神经网络输出上的损失，$\mathbi{x}_L$为顶层的输出。对于Post-Norm结构，根据链式法则，损失$Loss$相对于$\mathbi{x}_l$ 的梯度可以表示为：
 \begin{eqnarray}
 \frac{\partial Loss}{\partial \mathbi{x}_l} &=& \frac{\partial Loss}{\partial \mathbi{x}_L} \times \prod_{k=l}^{L-1}\frac{\partial \textrm{LN}(\mathbi{y}_k)}{\partial \mathbi{y}_k} \times \prod_{k=l}^{L-1}\big(1+\frac{\partial F(\mathbi{x}_k;{\bm \theta_k})}{\partial \mathbi{x}_k} \big)
 \label{eq:15-28}
@@ -429,7 +423,7 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}
 \label{eq:15-29}
 \end{eqnarray}
-\parinterval 对比公式\eqref{eq:15-28}和公式\eqref{eq:15-29}可以看出，Pre-Norm结构直接把顶层的梯度$\frac{\partial Loss}{\partial \mathbi{x}_L}$传递给下层，并且如果将公式\eqref{eq:15-29}右侧进行展开，可以发现$\frac{\partial Loss}{\partial \mathbi{x}_l}$中直接含有$\frac{\partial Loss}{\partial \mathbi{x}_L}$部分。这个性质弱化了梯度计算对模型深度$L$的依赖；而如公式\eqref{eq:15-28}右侧所示，Post-Norm结构会导致一个与$L$相关的多项导数的积，伴随着$L$的增大更容易发生梯度消失和梯度爆炸问题。因此，Pre-Norm结构更适于堆叠多层神经网络的情况。比如，使用Pre-Norm 结构可以很轻松地训练一个30层（60个子层）编码器的Transformer网络，并带来可观的BLEU提升。这个结果相当于标准Transformer编码器深度的6倍，相对的，用Pre-Norm{\red （post-norm？）}结构训练深层网络的时候，训练结果很不稳定，当编码器深度超过12层后很难完成有效训练\upcite{WangLearning}，尤其是在低精度设备环境下损失函数出现发散情况。这里把使用Pre-Norm的深层Transformer模型称为Transformer-Deep。
+\parinterval 对比公式\eqref{eq:15-28}和公式\eqref{eq:15-29}可以看出，Pre-Norm结构直接把顶层的梯度$\frac{\partial Loss}{\partial \mathbi{x}_L}$传递给下层，并且如果将公式\eqref{eq:15-29}右侧进行展开，可以发现$\frac{\partial Loss}{\partial \mathbi{x}_l}$中直接含有$\frac{\partial Loss}{\partial \mathbi{x}_L}$部分。这个性质弱化了梯度计算对模型深度$L$的依赖；而如公式\eqref{eq:15-28}右侧所示，Post-Norm结构会导致一个与$L$相关的多项导数的积，伴随着$L$的增大更容易发生梯度消失和梯度爆炸问题。因此，Pre-Norm结构更适于堆叠多层神经网络的情况。比如，使用Pre-Norm 结构可以很轻松地训练一个30层（60个子层）编码器的Transformer网络，并带来可观的BLEU提升。这个结果相当于标准Transformer编码器深度的6倍，相对的，用Post-Norm结构训练深层网络的时候，训练结果很不稳定，当编码器深度超过12层后很难完成有效训练\upcite{WangLearning}，尤其是在低精度设备环境下损失函数出现发散情况。这里把使用Pre-Norm的深层Transformer模型称为Transformer-Deep。
 \parinterval 另一个有趣的发现是，使用深层网络后，网络可以更有效地利用较大的学习率和较大的批量训练，大幅度缩短了模型达到收敛状态的时间。相比于Transformer-Big等宽网络，Transformer-Deep并不需要太大的隐藏层维度就可以取得更优的翻译品质\upcite{WangLearning}。也就是说，Transformer-Deep是一个更“窄”更“深”的神经网络。这种结构的参数量比Transformer-Big少，系统运行效率更高。
@@ -443,7 +437,7 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}
 \parinterval 尽管使用Pre-Norm结构可以很容易地训练深层Transformer模型，但从信息传递的角度看，Transformer模型中第$l$层的输入仅仅依赖于前一层的输出。虽然残差连接可以跨层传递信息，但是对于很深的模型，整个模型的输入和输出之间仍需要经过很多次残差连接。
-\parinterval 为了使上层的神经网络可以更加方便地访问下层神经网络的信息，最简单的方法是引入更多的跨层连接。一种方法是直接将所有层的输出都连接到最上层，达到聚合多层信息的目的\upcite{Bapna2018TrainingDN,Wang2018MultilayerRF,Dou2018ExploitingDR}。另一种更加有效的方式是在网络前向计算的过程中建立当前层表示与之前层表示之间的关系，例如{\small\bfnew{动态线性聚合网络}}\upcite{WangLearning}\index{动态线性聚合网络}（Dynamic Linear Combination of Layers，DLCL）\index{Dynamic Linear Combination of Layers}和动态层聚合方法\upcite{Dou2019DynamicLA}。这些方法的共性在于，在每一层的输入中不仅考虑前一层的输出，同时将前面所有层的中间结果（包括词嵌入表示）进行聚合，本质上利用稠密的层间连接提高了网络中信息传递的效率（前向计算和反向梯度计算）。而前者{\red （连接到最上层的方法？）}利用线性的层融合手段来保证计算的时效性，主要应用于深层神经网络的训练，理论上等价于常微分方程中的高阶求解方法\upcite{WangLearning}。此外，为了进一步增强上层神经网络对底层表示的利用，研究人员从多尺度的角度对深层的编码器进行分块，并使用GRU来捕获不同块之间的联系，得到更高层次的表示。该方法可以看作是对动态线性聚合网络的延伸。接下来分别对上述几种改进方法展开讨论。
+\parinterval 为了使上层的神经网络可以更加方便地访问下层神经网络的信息，最简单的方法是引入更多的跨层连接。一种方法是直接将所有层的输出都连接到最上层，达到聚合多层信息的目的\upcite{Bapna2018TrainingDN,Wang2018MultilayerRF,Dou2018ExploitingDR}。另一种更加有效的方式是在网络前向计算的过程中建立当前层表示与之前层表示之间的关系，例如{\small\bfnew{动态线性聚合网络}}\upcite{WangLearning}\index{动态线性聚合网络}（Dynamic Linear Combination of Layers，DLCL）\index{Dynamic Linear Combination of Layers}和动态层聚合方法\upcite{Dou2019DynamicLA}。这些方法的共性在于，在每一层的输入中不仅考虑前一层的输出，同时将前面所有层的中间结果（包括词嵌入表示）进行聚合，本质上利用稠密的层间连接提高了网络中信息传递的效率（前向计算和反向梯度计算）。而DLCL利用线性的层融合手段来保证计算的时效性，主要应用于深层神经网络的训练，理论上等价于常微分方程中的高阶求解方法\upcite{WangLearning}。此外，为了进一步增强上层神经网络对底层表示的利用，研究人员从多尺度的角度对深层的编码器进行分块，并使用GRU来捕获不同块之间的联系，得到更高层次的表示。该方法可以看作是对动态线性聚合网络的延伸。接下来分别对上述几种改进方法展开讨论。
 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -451,8 +445,6 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}
 \subsubsection{1. 使用更多的跨层连接}
-{\color{blue} 肖：这章还有一个问题，前面$\mathbi{h}_i$是第$i$层的表示，后面$\mathbi{h}_i$是第$i$个位置的表示！其它章也有类似问题。不过我不建议做大的调整。}
 \parinterval 图\ref{fig:15-10}描述了引入了更多跨层连接的结构。在模型的前向计算过程中，假设编码端总层数为$L$，当完成编码端$L$层的逐层计算后，通过线性平均、加权平均等机制对模型的中间层表示进行融合，得到蕴含所有层信息的表示\mathbi{g}，作为编码-解码注意力机制的输入，与总共有$M$层的解码器共同处理解码信息。
 %----------------------------------------------
@@ -484,11 +476,11 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}
 \vspace{0.5em}
 \item 前馈神经网络。将之前中间层的表示进行级联，之后利用前馈神经网络得到融合的表示，如下：
 \begin{eqnarray}
-\mathbi{g} &=& \textrm{FNN}([\mathbi{h}^1,\ldots,\mathbi{h}^L])
+\mathbi{g} &=& \textrm{FNN}([\mathbi{h}^1,\cdot,\mathbi{h}^L])
 \label{eq:15-32}
 \end{eqnarray}
-\noindent 其中，$[\cdot]$表示级联操作{\red （上式符号是不是要换一下）}。这种方式比权重平均具有更强的拟合能力。
+\noindent 其中，$[\cdot]$表示级联操作。这种方式比权重平均具有更强的拟合能力。
 \vspace{0.5em}
 \item 基于多跳的自注意力机制。如图\ref{fig:15-11}所示，其做法与前馈神经网络类似，首先将不同层的表示拼接成2维的句子级矩阵表示\upcite{DBLP:journals/corr/LinFSYXZB17}。之后利用类似于前馈神经网络的思想将维度为$\mathbb{R}^{d_{\textrm{model}} \times L}$的矩阵映射到维度为$\mathbb{R}^{d_{\textrm{model}} \times n_{\rm hop}}$的矩阵，如下：
 \begin{eqnarray}
@@ -604,7 +596,7 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}
 \label{eq:15-40}
 \end{eqnarray}
-\noindent 其中，$u(-\gamma,\gamma)$表示$-\gamma$与$\gamma$间的均匀分布，$n_i$和$n_o$分别为线性变换$\mathbi{W}$中输入和输出的维度，也就是上一层神经元的数量和下一层神经元的数量。通过这种方式可以维持在前向与反向计算过程中输入与输出方差的一致性\upcite{DBLP:conf/iccv/HeZRS15}。{\red（这块的文献对吗？我感觉这块确实有个问题，前面说了结论，然后后面说怎么推的）{\color{blue} 肖：没看出来怎么保证方差一致性}}
+\noindent 其中，$u(-\gamma,\gamma)$表示$-\gamma$与$\gamma$间的均匀分布，$n_i$和$n_o$分别为线性变换$\mathbi{W}$中输入和输出的维度，也就是上一层神经元的数量和下一层神经元的数量。通过这种方式可以维持在前向与反向计算过程中输入与输出方差的一致性\upcite{DBLP:conf/iccv/HeZRS15}。
 \parinterval 令模型中某层神经元的输出表示为$\mathbi{Z}=\sum_{j=1}^{n_i}{w_j x_j}$。可以看出，$\mathbi{Z}$的核心是计算两个变量$w_j$和$x_j$乘积。两个变量乘积的方差的展开式为：
 \begin{eqnarray}
@@ -694,7 +686,7 @@ C(\mathbi{x}_j \mathbi{W}_K,\omega) &=& (\mathbi{x}_{j-\omega},\ldots,\mathbi{x}
 \subsubsection{4. ADMIN初始化策略}
-\parinterval 也有研究发现Post-Norm结构在训练过程中过度依赖残差支路，在训练初期很容易发生参数梯度方差过大的现象\upcite{DBLP:conf/emnlp/LiuLGCH20}。经过分析发现，虽然底层神经网络发生梯度消失是导致训练不稳定的重要因素，但并不是唯一因素。例如，标准Transformer模型中梯度消失的原因在于使用Post-Norm 结构的解码器。尽管通过调整模型结构解决梯度消失问题，模型训练不稳定的问题仍然没有很好地解决。进一步对Pre-Norm 结构进行分析发现，Pre-Norm输入与输出之间方差的变换率为$O(\log L)$（{\color{red} 说这句话是啥意思？}）。为了解决Post-Norm 结构在训练初期过于依赖残差支路的问题。可以使用两阶段的初始化方法来间接控制其输入与输出之间的方差，并将方差约束在$O(\log L)$内（{\color{red} 为什么方差要约束在一个复杂度之内？}）。这里，可以重新定义子层之间的残差连接如下：
+\parinterval 也有研究发现Post-Norm结构在训练过程中过度依赖残差支路，在训练初期很容易发生参数梯度方差过大的现象\upcite{DBLP:conf/emnlp/LiuLGCH20}。经过分析发现，虽然底层神经网络发生梯度消失是导致训练不稳定的重要因素，但并不是唯一因素。例如，标准Transformer模型中梯度消失的原因在于使用Post-Norm 结构的解码器。尽管通过调整模型结构解决梯度消失问题，模型训练不稳定的问题仍然没有很好地解决。研究人员观测到Post-Norm 结构在训练过程中过于依赖残差支路，而Pre-Norm结构在训练过程中逐渐呈现出对残差支路的依赖性，这更易于网络的训练。进一步从参数更新的角度出发，Pre-Norm由于参数的改变导致网络输出变化的方差经推导后可以表示为$O(\log L)$，而Post-Norm对应的方差为O($L$)。因此，可以尝试减小Post-Norm中由于参数更新导致的输出的方差值，从而达到稳定训练的目的。针对该问题，可以采用两阶段的初始化方法。这里，可以重新定义子层之间的残差连接如下：
 \begin{eqnarray}
 \mathbi{x}_{l+1} &=& \mathbi{x}_l \cdot {\bm  \omega_{l+1}} + F_{l+1}(\mathbi{x}_l)
 \label{eq:15-44}
@@ -1264,7 +1256,7 @@ f(x) &=& x \cdot \delta(\beta x) \\
 \begin{itemize}
 \vspace{0.5em}
-\item 多头注意力机制是近些年神经机器翻译中常用的结构。多头机制可以让模型从更多维度提取特征，也反应了一种多分支建模的思想。研究人员针对Transformer编码器的多头机制进行了分析，发现部分头在神经网络的学习过程中扮演至关重要的角色，并且蕴含语言学解释\upcite{DBLP:journals/corr/abs-1905-09418}。 而另一部分头本身则不具备很好的解释，对模型的帮助也不大，因此可以被剪枝掉。而且并不是头数越多，模型的性能就越强。{\red 一个有趣的发现是，如果在训练过程中使用多头机制，而在推断过程中去除大部分头，模型性能没有明显变化，而且能够提高在CPU上的执行效率（逻辑不太容易理解，有点绕）}\upcite{Michel2019AreSH}。
+\item 多头注意力机制是近些年神经机器翻译中常用的结构。多头机制可以让模型从更多维度提取特征，也反应了一种多分支建模的思想。研究人员针对Transformer编码器的多头机制进行了分析，发现部分头在神经网络的学习过程中扮演至关重要的角色，并且蕴含语言学解释\upcite{DBLP:journals/corr/abs-1905-09418}。 而另一部分头本身则不具备很好的解释，对模型的帮助也不大，因此可以被剪枝掉。而且并不是头数越多，模型的性能就越强。此外也有研究人员发现，如果在训练过程中使用多头机制，并在推断过程中去除大部分头，模型性能不仅没有明显变化，还能够提高在CPU上的执行效率\upcite{Michel2019AreSH}。
 \vspace{0.5em}
 \item 此外，也可以利用正则化手段，在训练过程中增大不同头之间的差异\upcite{DBLP:conf/emnlp/LiTYLZ18}。也可以引入多尺度的思想,对输入的特征进行分级表示，并引入短语的信息\upcite{DBLP:conf/emnlp/HaoWSZT19}。还可以通过对注意力权重进行调整，实现对序列中的实词与虚词进行区分\upcite{DBLP:conf/emnlp/Lin0RLS18}。 除了上述基于编码端-解码端的建模范式，还可以定义隐变量模型来捕获句子中潜在的语义信息\upcite{Su2018VariationalRN,DBLP:conf/acl/SetiawanSNP20}，或直接对源语言和目标语言序列进行联合表示\upcite{Li2020NeuralMT}。

--- a/Chapter17/Figures/figure-cascading-speech-translation.tex
+++ b/Chapter17/Figures/figure-cascading-speech-translation.tex
@@ -10,7 +10,7 @@
 \node(process_2)[process,fill=blue!20,right of = process_1,xshift=7.0cm,text width=4cm,align=center]{\baselineskip=4pt\LARGE{[[0.2,...,0.3], \qquad ..., \qquad  0.3,...,0.5]]}\par};
 \node(text_2)[below of = process_2,yshift=-2cm,scale=1.5]{语音特征};
 \node(process_3)[process,fill=orange!20,minimum width=6cm,minimum height=5cm,right of = process_2,xshift=8.2cm,text width=4cm,align=center]{};
-\node(text_3)[below of = process_3,yshift=-3cm,scale=1.5]{源语文本及其词格};
+\node(text_3)[below of = process_3,yshift=-3cm,scale=1.5]{源语言文本及其词格};
 \node(cir_s)[cir,very thick, below of = process_3,xshift=-2.2cm,yshift=1.1cm]{\LARGE S};
 \node(cir_a)[cir,right of = cir_s,xshift=1cm,yshift=0.8cm]{\LARGE a};
 \node(cir_c)[cir,right of = cir_a,xshift=1.2cm,yshift=0cm]{\LARGE c};

--- a/Chapter17/chapter17.tex
+++ b/Chapter17/chapter17.tex
@@ -25,7 +25,7 @@
 \parinterval 基于上下文的翻译是机器翻译的一个重要分支。传统方法中，机器翻译通常被定义为对一个句子进行翻译的问题。但是，现实中每句话往往不是独立出现的。比如，人们会使用语音进行表达，或者通过图片来传递信息，这些语音和图片内容都可以伴随着文字一起出现在翻译场景中。此外，句子往往存在于段落或者篇章之中，如果要理解这个句子，也需要整个段落或者篇章的信息。而这些上下文信息都是机器翻译可以利用的。
-\parinterval 本章在句子级翻译的基础上将问题扩展为更大上下文中的翻译，具体包括：语音翻译、图像翻译、篇章翻译三个主题。这些问题均为机器翻译应用中的真实需求。同时，使用多模态等信息也是当下自然语言处理的热点方向之一。
+\parinterval 本章在句子级翻译的基础上将问题扩展为更大上下文中的翻译，具体包括语音翻译、图像翻译、篇章翻译三个主题。这些问题均为机器翻译应用中的真实需求。同时，使用多模态等信息也是当下自然语言处理的热点方向之一。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -33,9 +33,9 @@
 \section{机器翻译需要更多的上下文}
-\parinterval 长期以来，机器翻译都是指句子级翻译。主要原因在于，句子级的翻译建模可以大大简化问题，使得机器翻译方法更容易被实践和验证。但是人类使用语言的过程并不是孤立地在一个个句子上进行的。这个问题可以类比于人类学习语言的过程：小孩成长过程中会接受视觉、听觉、触觉等多种信号，这些信号的共同作用使得他们产生对客观世界的“认识”，同时促使他们使用“语言”进行表达。从这个角度说，语言能力并不是由单一因素形成的，它往往伴随着其他信息的相互作用，比如，当我们翻译一句话的时候，会用到看到的画面、听到的语调、甚至前面说过的句子中的信息。
+\parinterval 长期以来，机器翻译都是指句子级翻译。主要原因在于，句子级的翻译建模可以大大简化问题，使得机器翻译方法更容易被实践和验证。但是人类使用语言的过程并不是孤立地在一个个句子上进行的。这个问题可以类比于人类学习语言的过程：小孩成长过程中会接受视觉、听觉、触觉等多种信号，这些信号的共同作用使得他们产生对客观世界的“认识”，同时促使他们使用“语言”进行表达。从这个角度说，语言能力并不是由单一因素形成的，它往往伴随着其他信息的相互作用，比如，当人们翻译一句话的时候，会用到看到的画面、听到的语调、甚至前面说过的句子中的信息。
-\parinterval 广义上，当前句子以外的信息都可以被看作一种上下文。比如，图\ref{fig:17-1}中，需要把英语句子“A girl jumps off a bank .”翻译为汉语。但是，其中的“bank”有多个含义，因此仅仅使用英语句子本身的信息可能会将其翻译为“银行”，而非正确的译文“河床”。但是，图\ref{fig:17-1}中也提供了这个英语句子所对应的图片，显然图片中直接展示了河床，这时“bank”是没有歧义的。通常也会把这种使用图片和文字一起进行机器翻译的任务称作{\small\bfnew{多模态机器翻译}}\index{多模态机器翻译}（Multi-Modal Machine Translation）\index{Multi-Modal Machine Translation}。
+\parinterval 广义上，当前句子以外的信息都可以被看作一种上下文。比如，图\ref{fig:17-1}中，需要把英语句子“A girl jumps off a bank .”翻译为汉语。但是，其中的“bank”有多个含义，因此仅仅使用英语句子本身的信息可能会将其翻译为“银行”，而非正确的译文“河床”。但是，图\ref{fig:17-1}中也提供了这个英语句子所对应的图片，显然图片中直接展示了河床，这时“bank”是没有歧义的。通常也会把这种使用图片和文字一起进行机器翻译的任务称作{\small\bfnew{多模态机器翻译}}\index{多模态机器翻译}（Multi-Modal Machine Translation）\index{Multi-model Machine Translation}。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -62,7 +62,7 @@
 \subsection{音频处理}
-\parinterval 为了保证对相关内容描述的完整性，这里对语音处理的基本知识作简要介绍。不同于文本，音频本质上是经过若干信号处理之后的{\small\bfnew{波形}}（Waveform）\index{Waveform}。具体来说，声音是一种空气的震动，因此可以被转换为模拟信号。模拟信号是一段连续的信号，经过采样变为离散的数字信号。采样是每隔固定的时间记录一下声音的振幅，采样率表示每秒的采样点数，单位是赫兹（Hz）。采样率越高，结果的损失则越小。通常来说，采样的标准是能够通过离散化的数字信号重现原始语音。我们日常生活中使用的手机和电脑设备的采样率一般为16kHz，表示每秒16000个采样点；而音频CD的采样率可以达到44.1kHz。 经过进一步的量化，将采样点的值转换为整型数值保存，从而减少占用的存储空间，通常采用的是16位量化。将采样率和量化位数相乘，就可以得到{\small\bfnew{比特率}}\index{比特率}（Bits Per Second，BPS）\index{Bits Per Second}，表示音频每秒占用的位数。例如，16kHz采样率和16位量化的音频，比特率为256kb/s。音频处理的整体流程如图\ref{fig:17-2}所示\upcite{洪青阳2020语音识别原理与应用,陈果果2020语音识别实战}。
+\parinterval 为了保证对相关内容描述的完整性，这里对语音处理的基本知识作简要介绍。不同于文本，音频本质上是经过若干信号处理之后的{\small\bfnew{波形}}（Waveform）\index{Waveform}。具体来说，声音是一种空气的震动，因此可以被转换为模拟信号。模拟信号是一段连续的信号，经过采样变为离散的数字信号。采样是每隔固定的时间记录一下声音的振幅，采样率表示每秒的采样点数，单位是赫兹（Hz）。采样率越高，结果的损失则越小。通常来说，采样的标准是能够通过离散化的数字信号重现原始语音。日常生活中使用的手机和电脑设备的采样率一般为16kHz，表示每秒16000个采样点；而音频CD的采样率可以达到44.1kHz。 经过进一步的量化，将采样点的值转换为整型数值保存，从而减少占用的存储空间，通常采用的是16位量化。将采样率和量化位数相乘，就可以得到{\small\bfnew{比特率}}\index{比特率}（Bits Per Second，BPS）\index{Bits Per Second}，表示音频每秒占用的位数。例如，16kHz采样率和16位量化的音频，比特率为256kb/s。音频处理的整体流程如图\ref{fig:17-2}所示\upcite{洪青阳2020语音识别原理与应用,陈果果2020语音识别实战}。
 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
@@ -85,7 +85,7 @@
 \end{figure}
 %----------------------------------------------------------------------------------------------------
-\parinterval 经过了上述的预处理操作，可以得到音频对应的帧序列，之后通过不同的操作来提取不同类型的声学特征。常用的声学特征包括{\small\bfnew{Mel频率倒谱系数}}\index{Mel频率倒谱系数}（Mel-Frequency Cepstral Coefficient，MFCC）\index{Mel-Frequency Cepstral Coefficient}、{\small\bfnew{感知线性预测系数}}\index{感知线性预测系数}（Perceptual Lienar Predictive，PLP）\index{Perceptual Lienar Predictive}、{\small\bfnew{滤波器组}}\index{滤波器组}（Filter-bank，Fbank）\index{Filter-bank}等。MFCC、PLP和Fbank特征都需要对预处理后的音频做{\small\bfnew{短时傅里叶变换}}\index{短时傅里叶变换}（Short-time Fourier Tranform，STFT）\index{Short-time Fourier Tranform}，得到具有规律的线性分辨率。之后再经过特定的操作，得到各种声学特征。不同声学特征的特点是不同的，MFCC去相关性较好，PLP抗噪性强，FBank可以保留更多的语音原始特征。在语音翻译中，比较常用的声学特征为FBank或MFCC\upcite{洪青阳2020语音识别原理与应用}。
+\parinterval 经过了上述的预处理操作，可以得到音频对应的帧序列，之后通过不同的操作来提取不同类型的声学特征。常用的声学特征包括{\small\bfnew{Mel频率倒谱系数}}\index{Mel频率倒谱系数}（Mel-frequency Cepstral Coefficient，MFCC）\index{Mel-Frequency Cepstral Coefficient}、{\small\bfnew{感知线性预测系数}}\index{感知线性预测系数}（Perceptual Lienar Predictive，PLP）\index{Perceptual Lienar Predictive}、{\small\bfnew{滤波器组}}\index{滤波器组}（Filter-bank，Fbank）\index{Filter-bank}等。MFCC、PLP和Fbank特征都需要对预处理后的音频做{\small\bfnew{短时傅里叶变换}}\index{短时傅里叶变换}（Short-time Fourier Tranform，STFT）\index{Short-time Fourier Tranform}，得到具有规律的线性分辨率。之后再经过特定的操作，得到各种声学特征。不同声学特征的特点是不同的，MFCC去相关性较好，PLP抗噪性强，FBank可以保留更多的语音原始特征。在语音翻译中，比较常用的声学特征为FBank或MFCC\upcite{洪青阳2020语音识别原理与应用}。
 \parinterval 实际上，提取到的声学特征可以类比于计算机视觉中的像素特征，或者自然语言处理中的词嵌入表示。不同之处在于，声学特征更加复杂多变，可能存在着较多的噪声和冗余信息。此外，相比对应的文字序列，音频提取到的特征序列长度要大十倍以上。比如，人类正常交流中每秒钟一般可以说2-3个字，而每秒钟的语音可以提取得到100帧的特征序列。巨大的长度比差异也为声学特征建模带来了挑战。
@@ -147,7 +147,7 @@
 \parinterval 可以看出，词格可以保存多条搜索路径，路径中保存了输入序列的时间信息以及解码过程。翻译模型基于词格进行翻译，可以降低语音识别模型带来的误差\upcite{DBLP:conf/acl/ZhangGCF19,DBLP:conf/acl/SperberNPW19}。但在端到端语音识别模型中，一般基于束搜索方法进行解码，且解码序列的长度与输入序列并不匹配，相比传统声学模型解码丢失了语音的时间信息，因此这种基于词格的方法主要集中在传统语音识别系统上。
-\parinterval 为了降低错误传播问题带来的影响，一种思路是通过一个后处理模型修正识别结果中的错误，再送给文本翻译模型进行翻译。也可以进一步对文本做{\small\bfnew{顺滑}}\index{顺滑}（Disfluency Detection\index{Disfluency Detection}），使得送给翻译系统的文本更加干净、流畅，比如除去一些导致停顿的语气词。这一做法在工业界得到了广泛应用，但由于每个模型只能串行地计算，也会带来额外的计算代价以及运算时间。另外一种思路是训练更加健壮的文本翻译模型，使其可以处理输入中存在的噪声或误差\upcite{DBLP:conf/acl/LiuTMCZ18}。 
+\parinterval 为了降低错误传播问题带来的影响，一种思路是通过一个后处理模型修正识别结果中的错误，再送给文本翻译模型进行翻译。也可以进一步对文本做{\small\bfnew{顺滑}}\index{顺滑}（Disfluency Detection\index{Disfluency Detection}）处理，使得送给翻译系统的文本更加干净、流畅，比如除去一些导致停顿的语气词。这一做法在工业界得到了广泛应用，但由于每个模型只能串行地计算，也会带来额外的计算代价以及运算时间。另外一种思路是训练更加健壮的文本翻译模型，使其可以处理输入中存在的噪声或误差\upcite{DBLP:conf/acl/LiuTMCZ18}。 
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -249,7 +249,7 @@
 \end{figure}
 %----------------------------------------------------------------------------------------------------
-\parinterval 另外一种多任务学习的思想是通过两个解码器，分别预测语音对应的源语言句子和目标语言句子，具体有图\ref{fig:17-10}展示的三种方式\upcite{DBLP:conf/naacl/AnastasopoulosC18,DBLP:conf/asru/BaharBN19}。图\ref{fig:17-10}(a)中采用单编码器-双解码器的方式，两个解码器根据编码器的表示，分别预测源语言句子和目标语言句子，从而使编码器训练地更加充分。这种做法的好处在于源语言文的本生任务成可以辅助翻译过程，相当于为源语言语音提供了额外的“模态”信息。图\ref{fig:17-10}(b)则通过使用两个级联的解码器，先利用第一个解码器生成源语言句子，然后再利用第一个解码器的表示，通过第二个解码器生成目标语言句子。这种方法通过增加一个中间输出，降低了模型的训练难度，但同时也会带来额外的解码耗时，因为两个解码器需要串行地进行生成。图\ref{fig:17-10}(c) 中模型更进一步，第二个编码器联合编码器和第一个解码器的表示进行生成，更充分地利用了已有信息。
+\parinterval 另外一种多任务学习的思想是通过两个解码器，分别预测语音对应的源语言句子和目标语言句子，具体有图\ref{fig:17-10}展示的三种方式\upcite{DBLP:conf/naacl/AnastasopoulosC18,DBLP:conf/asru/BaharBN19}。图\ref{fig:17-10}(a)中采用单编码器-双解码器的方式，两个解码器根据编码器的表示，分别预测源语言句子和目标语言句子，从而使编码器训练地更加充分。这种做法的好处在于源语言文的文本生成任务成可以辅助翻译过程，相当于为源语言语音提供了额外的“模态”信息。图\ref{fig:17-10}(b)则通过使用两个级联的解码器，先利用第一个解码器生成源语言句子，然后再利用第一个解码器的表示，通过第二个解码器生成目标语言句子。这种方法通过增加一个中间输出，降低了模型的训练难度，但同时也会带来额外的解码耗时，因为两个解码器需要串行地进行生成。图\ref{fig:17-10}(c) 中模型更进一步，第二个编码器联合编码器和第一个解码器的表示进行生成，更充分地利用了已有信息。
 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
 \centering
@@ -347,7 +347,7 @@
 \end{figure}
 %----------------------------------------------------------------------------------------------------
-\parinterval 那么，多模态机器翻译是如何计算上下文向量的呢？这里仿照第十章的内容给出描述。假设编码器输出的状态序列为$\{\mathbi{h}_1,...\mathbi{h}_m\}$，需要注意的是，这里的状态序列不是源语言句子的状态序列，而是通过基于卷积等操作提取到的图像的状态序列。假设图像的特征维度是$16 \times 16 \times 512$，其中前两个维度分别表示图像的高和宽，这里会将图像映射为$256 \times 512$ 的状态序列，其中$512$为每个状态的维度。对于目标语位置$j$，上下文向量$\mathbi{C}_{j}$被定义为对序列的编码器输出进行加权求和，如下：
+\parinterval 那么，多模态机器翻译是如何计算上下文向量的呢？这里仿照第十章的内容给出描述。假设编码器输出的状态序列为$\{\mathbi{h}_1,...\mathbi{h}_m\}$，需要注意的是，这里的状态序列不是源语言句子的状态序列，而是通过基于卷积等操作提取到的图像的状态序列。假设图像的特征维度是$16 \times 16 \times 512$，其中前两个维度分别表示图像的高和宽，这里会将图像映射为$256 \times 512$ 的状态序列，其中$512$为每个状态的维度。对于目标语言位置$j$，上下文向量$\mathbi{C}_{j}$被定义为对序列的编码器输出进行加权求和，如下：
 \begin{eqnarray}
 \mathbi{C}_{j}&=& \sum_{i}{{\alpha}_{i,j}{\mathbi{h}}_{i}}
 \end{eqnarray}
@@ -412,7 +412,7 @@
 \parinterval 要想使编码器-解码器框架在图像描述生成中充分发挥作用，编码器也要更好的表示图像信息。对于编码器的改进，通常体现在向编码器中添加图像的语义信息\upcite{DBLP:conf/cvpr/YouJWFL16,DBLP:conf/cvpr/ChenZXNSLC17,DBLP:journals/pami/FuJCSZ17}和位置信息\upcite{DBLP:conf/cvpr/ChenZXNSLC17,DBLP:conf/ijcai/LiuSWWY17}。
-\parinterval 图像的语义信息一般是指图像中存在的实体、属性、场景等等。如图\ref{fig:17-17}所示，从图像中利用属性或实体检测器提取出“girl”、“river”、“bank”等属性词和实体词，将他们作为图像的语义信息编码的一部分，再利用注意力机制计算目标语言单词与这些属性词或实体词之间的注意力权重\upcite{DBLP:conf/cvpr/YouJWFL16}。当然，除了图像中的实体和属性作为语义信息外，也可以将图片的场景信息加入到编码器当中\upcite{DBLP:journals/pami/FuJCSZ17}。有关如何做属性、实体和场景的检测，涉及到目标检测任务的工作，例如Faster-RCNN\upcite{DBLP:journals/pami/RenHG017}、YOLO\upcite{DBLP:journals/corr/abs-1804-02767,DBLP:journals/corr/abs-2004-10934}等等,这里不过多赘述。
+\parinterval 图像的语义信息一般是指图像中存在的实体、属性、场景等等。如图\ref{fig:17-17}所示，从图像中利用属性或实体检测器提取出“girl”、“river”、“bank”等属性词和实体词，将他们作为图像的语义信息编码的一部分，再利用注意力机制计算目标语言单词与这些属性词或实体词之间的注意力权重\upcite{DBLP:conf/cvpr/YouJWFL16}。当然，除了图像中的实体和属性作为语义信息外，也可以将图片的场景信息加入到编码器当中\upcite{DBLP:journals/pami/FuJCSZ17}。有关如何做属性、实体和场景的检测，涉及到目标检测任务的工作，例如Faster-RCNN\upcite{DBLP:journals/pami/RenHG017}、YOLO\upcite{DBLP:journals/corr/abs-1804-02767,DBLP:journals/corr/abs-2004-10934}等等,这里不再赘述。
 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]

--- a/Chapter5/chapter5.tex
+++ b/Chapter5/chapter5.tex
@@ -469,13 +469,14 @@ g(\seq{s},\seq{t}) & \equiv & \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \ti
 %----------------------------------------------
 \parinterval 已经有工作证明机器翻译问题是NP难的\upcite{knight1999decoding}。对于如此巨大的搜索空间，需要一种十分高效的搜索算法才能实现机器翻译的解码。在{\chaptertwo}已经介绍一些常用的搜索方法。这里使用一种贪婪的搜索方法实现机器翻译的解码。它把解码分成若干步骤，每步只翻译一个单词，并保留当前“ 最好”的结果，直至所有源语言单词都被翻译完毕。
-\vspace{0.3em}
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
-\input{./Chapter5/Figures/figure-greedy-mt-decoding-pseudo-code}
+\subfigure{\input{./Chapter5/Figures/figure-greedy-mt-decoding-process-1}}
-    \caption{贪婪的机器翻译解码算法的伪代码}
+\subfigure{\input{./Chapter5/Figures/figure-greedy-mt-decoding-process-3}}
-    \label{fig:5-10}
+    \caption{贪婪的机器翻译解码过程实例}
+    \label{fig:5-11}
 \end{figure}
 %----------------------------------------------
@@ -484,14 +485,13 @@ g(\seq{s},\seq{t}) & \equiv & \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \ti
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
-\subfigure{\input{./Chapter5/Figures/figure-greedy-mt-decoding-process-1}}
+\input{./Chapter5/Figures/figure-greedy-mt-decoding-pseudo-code}
-\subfigure{\input{./Chapter5/Figures/figure-greedy-mt-decoding-process-3}}
+    \caption{贪婪的机器翻译解码算法的伪代码}
-    \caption{贪婪的机器翻译解码过程实例}
+    \label{fig:5-10}
-    \label{fig:5-11}
 \end{figure}
 %----------------------------------------------
-该算法的核心在于，系统一直维护一个当前最好的结果，之后每一步考虑扩展这个结果的所有可能，并计算模型得分，然后再保留扩展后的最好结果。注意，在每一步中，只有排名第一的结果才会被保留，其他结果都会被丢弃。这也体现了贪婪的思想。显然这个方法不能保证搜索到全局最优的结果，但是由于每次扩展只考虑一个最好的结果，因此该方法速度很快。图\ref{fig:5-11}给出了算法执行过程的简单示例。当然，机器翻译的解码方法有很多，这里仅仅使用简单的贪婪搜索方法来解决机器翻译的解码问题，在后续章节会对更加优秀的解码方法进行介绍。
+\parinterval 该算法的核心在于，系统一直维护一个当前最好的结果，之后每一步考虑扩展这个结果的所有可能，并计算模型得分，然后再保留扩展后的最好结果。注意，在每一步中，只有排名第一的结果才会被保留，其他结果都会被丢弃。这也体现了贪婪的思想。显然这个方法不能保证搜索到全局最优的结果，但是由于每次扩展只考虑一个最好的结果，因此该方法速度很快。图\ref{fig:5-11}给出了算法执行过程的简单示例。当然，机器翻译的解码方法有很多，这里仅仅使用简单的贪婪搜索方法来解决机器翻译的解码问题，在后续章节会对更加优秀的解码方法进行介绍。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -875,7 +875,7 @@ g(\seq{s},\seq{t}) & \equiv & \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \ti
 \begin{eqnarray}
 &                    & \textrm{max}\Big(\frac{\varepsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}{f({s_j|t_i})}\Big) \nonumber \\
 & \textrm{s.t.} & \textrm{任意单词} t_{y}:\;\sum_{s_x}{f(s_x|t_y)} = 1 \nonumber
-\label{eq:5-31}
+\label{eq:5-29-30}
 \end{eqnarray}
 \noindent 其中，$\textrm{max}(\cdot)$表示最大化，$\frac{\varepsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}{f({s_j|t_i})}$是目标函数，$f({s_j|t_i})$是模型的参数，$\sum_{s_x}{f(s_x|t_y)}=1$是优化的约束条件，以保证翻译概率满足归一化的要求。需要注意的是$\{f(s_x |t_y)\}$对应了很多参数，每个源语言单词和每个目标语单词的组合都对应一个参数$f(s_x |t_y)$。
@@ -916,42 +916,42 @@ L(f,\lambda)&=&\frac{\varepsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}{f(s_j|t_
 \noindent 这里$s_u$和$t_v$分别表示源语言和目标语言词表中的某一个单词。为了求$\frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)}$，这里引入一个辅助函数。令$g(z)=\alpha z^{\beta}$ 为变量$z$ 的函数，显然，
 $\frac{\partial g(z)}{\partial z} = \alpha \beta z^{\beta-1} = \frac{\beta}{z}\alpha z^{\beta} = \frac{\beta}{z} g(z)$。这里可以把$\prod_{j=1}^{m} \sum_{i=0}^{l} f(s_j|t_i)$看做$g(z)=\alpha z^{\beta}$的实例。首先，令$z=\sum_{i=0}^{l}f(s_u|t_i)$，注意$s_u$为给定的源语单词。然后，把$\beta$定义为$\sum_{i=0}^{l}f(s_u|t_i)$在$\prod_{j=1}^{m} \sum_{i=0}^{l} f(s_j|t_i)$ 中出现的次数，即源语句子中与$s_u$相同的单词的个数。
+\vspace{-1em}
 \begin{eqnarray}
 \beta &=& \sum_{j=1}^{m} \delta(s_j,s_u)
 \label{eq:5-32}
 \end{eqnarray}
 \noindent 其中，当$x=y$时，$\delta(x,y)=1$，否则为0。
 \parinterval 根据$\frac{\partial g(z)}{\partial z} = \frac{\beta}{z} g(z)$，可以得到
+\vspace{-0.5em}
 \begin{eqnarray}
-\frac{\partial g(z)}{\partial z}& =& \frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial \big[ \sum\limits_{i=0}^{l}f(s_u|t_i) \big]} \nonumber \\
+\frac{\partial g(z)}{\partial z}& =& \frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial \big[ \sum\limits_{i=0}^{l}f(s_u|t_i) \big]}\nonumber \\
-& = &\frac{\sum\limits_{j=1}^{m} \delta(s_j,s_u)}{\sum\limits_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)
+ &=& \frac{\sum\limits_{j=1}^{m} \delta(s_j,s_u)}{\sum\limits_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)
 \label{eq:5-33}
 \end{eqnarray}
 \parinterval 根据$\frac{\partial g(z)}{\partial z}$和$\frac{\partial z}{\partial f}$计算的结果，可以得到
+\vspace{-0.5em}
 \begin{eqnarray}
 {\frac{\partial \big[ \prod_{j=1}^{m} \sum_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)}}& =& {{\frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial \big[ \sum\limits_{i=0}^{l}f(s_u|t_i) \big]}} \cdot{\frac{\partial \big[ \sum\limits_{i=0}^{l}f(s_u|t_i) \big]}{\partial f(s_u|t_v)}}} \nonumber \\
 & = &{\frac{\sum\limits_{j=1}^{m} \delta(s_j,s_u)}{\sum\limits_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \cdot \sum\limits_{i=0}^{l} \delta(t_i,t_v)}
 \label{eq:5-34}
 \end{eqnarray}
 \parinterval 将$\frac{\partial \big[ \prod_{j=1}^{m} \sum_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)}$进一步代入$\frac{\partial L(f,\lambda)}{\partial f(s_u|t_v)}$，得到$L(f,\lambda)$的导数
+\vspace{-0.5em}
 \begin{eqnarray}
-& &{\frac{\partial L(f,\lambda)}{\partial f(s_u|t_v)}}\nonumber \\
+{\frac{\partial L(f,\lambda)}{\partial f(s_u|t_v)}} &=&{\frac{\varepsilon}{(l+1)^{m}} \cdot \frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_{a_j}) \big]}{\partial f(s_u|t_v)} - \lambda_{t_v}}\nonumber \\
-&=&{\frac{\varepsilon}{(l+1)^{m}} \cdot \frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_{a_j}) \big]}{\partial f(s_u|t_v)} - \lambda_{t_v}}\nonumber \\
 &=&{\frac{\varepsilon}{(l+1)^{m}} \frac{\sum_{j=1}^{m} \delta(s_j,s_u) \cdot \sum_{i=0}^{l} \delta(t_i,t_v)}{\sum_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) - \lambda_{t_v}}
 \label{eq:5-35}
 \end{eqnarray}
 \parinterval 令$\frac{\partial L(f,\lambda)}{\partial f(s_u|t_v)}=0$，有
+\vspace{-1em}
 \begin{eqnarray}
 f(s_u|t_v) &=& \frac{\lambda_{t_v}^{-1} \varepsilon}{(l+1)^{m}} \cdot \frac{\sum\limits_{j=1}^{m} \delta(s_j,s_u) \cdot \sum\limits_{i=0}^{l} \delta(t_i,t_v)}{\sum\limits_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \cdot f(s_u|t_v)
 \label{eq:5-36}
 \end{eqnarray}
 \parinterval 将上式稍作调整得到下式：
+\vspace{-1em}
 \begin{eqnarray}
 f(s_u|t_v) &=& \lambda_{t_v}^{-1} \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \sum\limits_{j=1}^{m} \delta(s_j,s_u) \sum\limits_{i=0}^{l} \delta(t_i,t_v) \frac{f(s_u|t_v) }{\sum\limits_{i=0}^{l}f(s_u|t_i)}
 \label{eq:5-37}
@@ -980,11 +980,13 @@ f(s_u|t_v) &=& \lambda_{t_v}^{-1} \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=
 %----------------------------------------------
 \parinterval 期望频次是事件在其分布下出现次数的期望。另$c_{\mathbb{E}}(X)$为事件$X$的期望频次，其计算公式为：
+\vspace{-0.5em}
 \begin{eqnarray}
 c_{\mathbb{E}}(X)&=&\sum_i c(x_i) \cdot \funp{P}(x_i)
 \label{eq:5-38}
 \end{eqnarray}
+\vspace{-0.5em}
 \noindent 其中$c(x_i)$表示$X$取$x_i$时出现的次数，$\funp{P}(x_i)$表示$X=x_i$出现的概率。图\ref{fig:5-26}展示了事件$X$的期望频次的详细计算过程。其中$x_1$、$x_2$和$x_3$分别表示事件$X$出现2次、1次和5次的情况。
 %----------------------------------------------
@@ -997,38 +999,50 @@ c_{\mathbb{E}}(X)&=&\sum_i c(x_i) \cdot \funp{P}(x_i)
 \end{figure}
 %----------------------------------------------
+\vspace{-0.5em}
 \parinterval 因为在$\funp{P}(\seq{s}|\seq{t})$中，$t_v$翻译（连接）到$s_u$的期望频次为：
+\vspace{-0.5em}
 \begin{eqnarray}
 c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t}) & \equiv & \sum\limits_{j=1}^{m} \delta(s_j,s_u) \cdot \sum\limits_{i=0}^{l} \delta(t_i,t_v) \cdot \frac {f(s_u|t_v)}{\sum\limits_{i=0}^{l}f(s_u|t_i)}
 \label{eq:5-39}
 \end{eqnarray}
+\vspace{-0.5em}
 \parinterval 所以公式\ref {eq:5-37}可重写为：
+\vspace{-0.5em}
 \begin{eqnarray}
 f(s_u|t_v)&=&\lambda_{t_v}^{-1} \cdot \funp{P}(\seq{s}| \seq{t}) \cdot c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})
 \label{eq:5-40}
 \end{eqnarray}
+\vspace{-0.5em}
 \parinterval 在此如果令$\lambda_{t_v}^{'}=\frac{\lambda_{t_v}}{\funp{P}(\seq{s}| \seq{t})}$，可得：
+\vspace{-0.5em}
 \begin{eqnarray}
 f(s_u|t_v) &= &\lambda_{t_v}^{-1} \cdot \funp{P}(\seq{s}| \seq{t}) \cdot c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t}) \nonumber \\
 &=&{(\lambda_{t_v}^{'})}^{-1} \cdot c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})
 \label{eq:5-41}
 \end{eqnarray}
+\vspace{-0.5em}
 \parinterval 又因为IBM模型对$f(\cdot|\cdot)$的约束如下：
+\vspace{-0.5em}
 \begin{eqnarray}
 \forall t_y : \sum\limits_{s_x} f(s_x|t_y) &=& 1
 \label{eq:5-42}
 \end{eqnarray}
+\vspace{-0.5em}
 \parinterval 为了满足$f(\cdot|\cdot)$的概率归一化约束，易得$\lambda_{t_v}^{'}$为：
+\vspace{-0.5em}
 \begin{eqnarray}
 \lambda_{t_v}^{'}&=&\sum\limits_{s'_u} c_{\mathbb{E}}(s'_u|t_v;\seq{s},\seq{t})
 \label{eq:5-43}
 \end{eqnarray}
+\vspace{-0.5em}
 \parinterval 因此，$f(s_u|t_v)$的计算式可再一步变换成下式：
+\vspace{-0.5em}
 \begin{eqnarray}
 f(s_u|t_v)&=&\frac{c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})}  { \sum\limits_{s'_u} c_{\mathbb{E}}(s'_u|t_v;\seq{s},\seq{t}) }
 \label{eq:5-44}
@@ -1042,7 +1056,9 @@ f(s_u|t_v)&=&\frac{c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})}  { \sum\limits_{s'_u
 \end{figure}
 %----------------------------------------------
+\vspace{-0.5em}
 \parinterval 进一步，假设有$K$个互译的句对（称作平行语料）：
+\vspace{-0.5em}
 $\{(\seq{s}^{[1]},\seq{t}^{[1]}),...,(\seq{s}^{[K]},\seq{t}^{[K]})\}$，$f(s_u|t_v)$的期望频次为：
 \begin{eqnarray}
 c_{\mathbb{E}}(s_u|t_v)&=&\sum\limits_{k=1}^{K}  c_{\mathbb{E}}(s_u|t_v;s^{[k]},t^{[k]})
@@ -1062,10 +1078,6 @@ c_{\mathbb{E}}(s_u|t_v)&=&\sum\limits_{k=1}^{K}  c_{\mathbb{E}}(s_u|t_v;s^{[k]},
 \parinterval 至此，本章完成了对IBM模型1训练方法的介绍。其可以通过图\ref{fig:5-27}所示的算法进行实现。算法最终的形式并不复杂，因为只需要遍历每个句对，之后计算$f(\cdot|\cdot)$的期望频次，最后估计新的$f(\cdot|\cdot)$，这个过程迭代直至$f(\cdot|\cdot)$收敛至稳定状态。
-\vspace{-1.5em}
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
 %----------------------------------------------------------------------------------------

--- a/Chapter6/chapter6.tex
+++ b/Chapter6/chapter6.tex
@@ -229,7 +229,6 @@
 %----------------------------------------------
 \begin{itemize}
-\vspace{0.5em}
 \item 第一部分：对每个$i\in[1,l]$的目标语单词的产出率建模（{\color{red!70} 红色}），即$\varphi_i$的生成概率。它依赖于$\seq{t}$和区间$[1,i-1]$的目标语单词的产出率$\varphi_1^{i-1}$。\footnote{这里约定，当$i=1$ 时，$\varphi_1^0$ 表示空。}
 \vspace{0.5em}
 \item 第二部分：对$i=0$时的产出率建模（{\color{blue!70} 蓝色}），即空标记$t_0$的产出率生成概率。它依赖于$\seq{t}$和区间$[1,i-1]$的目标语单词的产出率$\varphi_1^l$。
@@ -248,7 +247,7 @@
 \subsection{IBM 模型3}
 \parinterval IBM模型3通过一些假设对图\ref{fig:6-7}所表示的基本模型进行了化简。具体来说，对于每个$i\in[1,l]$，假设$\funp{P}(\varphi_i |\varphi_1^{i-1},\seq{t})$仅依赖于$\varphi_i$和$t_i$，$\funp{P}(\pi_{ik}|\pi_{i1}^{k-1},\pi_1^{i-1},\tau_0^l,\varphi_0^l,\seq{t})$仅依赖于$\pi_{ik}$、$i$、$m$和$l$。而对于所有的$i\in[0,l]$，假设$\funp{P}(\tau_{ik}|\tau_{i1}^{k-1},\tau_1^{i-1},\varphi_0^l,\seq{t})$仅依赖于$\tau_{ik}$和$t_i$。这些假设的形式化描述为：
+\vspace{-0.5em}
 \begin{eqnarray}
 \funp{P}(\varphi_i|\varphi_1^{i-1},\seq{t})                                                              & = &{\funp{P}(\varphi_i|t_i)} \label{eq:6-10} \\
 \funp{P}(\tau_{ik} = s_j |\tau_{i1}^{k-1},\tau_{1}^{i-1},\varphi_0^t,\seq{t})             & = & t(s_j|t_i) \label{eq:6-11} \\
@@ -265,7 +264,6 @@
 \end{eqnarray}
 否则
 \begin{eqnarray}
 \funp{P}(\pi_{0k}=j|\pi_{01}^{k-1},\pi_1^l,\tau_0^l,\varphi_0^l,\seq{t}) & = & 0
 \label{eq:6-14}
@@ -308,7 +306,6 @@ m-\varphi_0\\
 p_0+p_1                            & = & 1 \label{eq:6-21}
 \end{eqnarray}
 }
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------

--- a/bibliography.bib
+++ b/bibliography.bib
@@ -5090,7 +5090,7 @@ author    = {Yoshua Bengio and
               Tobias Weyand and
               Marco Andreetto and
               Hartwig Adam},
-  journal={CoRR},
+  publisher ={CoRR},
  year={2017},
 }
 @inproceedings{sifre2014rigid,
@@ -11622,28 +11622,28 @@ author    = {Zhuang Liu and
  publisher = {International Conference on Machine Learning},
  year      = {2018}
 }
-@article{zhao2020dual,
+@inproceedings{zhao2020dual,
  title={Dual Learning: Theoretical Study and an Algorithmic Extension},
  author={Zhao, Zhibing and Xia, Yingce and Qin, Tao and Xia, Lirong and Liu, Tie-Yan},
-  journal={arXiv preprint arXiv:2005.08238},
+  publisher ={arXiv preprint arXiv:2005.08238},
  year={2020}
 }
 %%%%% chapter 16------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%% chapter 17------------------------------------------------------
-@article{DBLP:journals/ac/Bar-Hillel60,
+@inproceedings{DBLP:journals/ac/Bar-Hillel60,
  author    = {Yehoshua Bar-Hillel},
  title     = {The Present Status of Automatic Translation of Languages},
-  journal   = {Advances in computers},
+  publisher = {Advances in computers},
  volume    = {1},
  pages     = {91--163},
  year      = {1960}
 }
-@article{DBLP:journals/corr/abs-1901-09115,
+@inproceedings{DBLP:journals/corr/abs-1901-09115,
  author    = {Andrei Popescu-Belis},
  title     = {Context in Neural Machine Translation: {A} Review of Models and Evaluations},
-  journal   = {CoRR},
+  publisher = {CoRR},
  volume    = {abs/1901.09115},
  year      = {2019}
 }
@@ -11698,7 +11698,7 @@ author    = {Zhuang Liu and
 @inproceedings{tiedemann2010context,
  title={Context adaptation in statistical machine translation using models with exponentially decaying cache},
  author={Tiedemann, J{\"o}rg},
-  publisher={Domain Adaptation for Natural Language Processing},
+  publisher={Annual Meeting of the Association for Computational Linguistics},
  pages={8--15},
  year={2010}
 }
@@ -11747,7 +11747,7 @@ author    = {Zhuang Liu and
  title     = {Using Sense-labeled Discourse Connectives for Statistical Machine
               Translation},
  pages     = {129--138},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Annual Conference of the European Association for Machine Translation},
  year      = {2012}
 }
 @inproceedings{DBLP:conf/emnlp/LaubliS018,
@@ -11760,12 +11760,12 @@ author    = {Zhuang Liu and
  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2018}
 }
-@article{DBLP:journals/corr/abs-1912-08494,
+@inproceedings{DBLP:journals/corr/abs-1912-08494,
  author    = {Sameen Maruf and
               Fahimeh Saleh and
               Gholamreza Haffari},
  title     = {A Survey on Document-level Machine Translation: Methods and Evaluation},
-  journal   = {CoRR},
+  publisher = {CoRR},
  volume    = {abs/1912.08494},
  year      = {2019}
 }
@@ -11777,21 +11777,20 @@ author    = {Zhuang Liu and
  publisher = {Association for Computational Linguistics},
  year      = {2017}
 }
-@article{DBLP:journals/corr/abs-1910-07481,
+@inproceedings{DBLP:journals/corr/abs-1910-07481,
  author    = {Valentin Mac{\'{e}} and
               Christophe Servan},
  title     = {Using Whole Document Context in Neural Machine Translation},
-  journal   = {CoRR},
+  publisher = {The International Workshop on Spoken Language Translation},
-  volume    = {abs/1910.07481},
  year      = {2019}
 }
-@article{DBLP:journals/corr/JeanLFC17,
+@inproceedings{DBLP:journals/corr/JeanLFC17,
  author    = {S{\'{e}}bastien Jean and
               Stanislas Lauly and
               Orhan Firat and
               Kyunghyun Cho},
  title     = {Does Neural Machine Translation Benefit from Larger Context?},
-  journal   = {CoRR},
+  publisher = {CoRR},
  volume    = {abs/1704.05135},
  year      = {2017}
 }
@@ -11823,12 +11822,12 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@article{DBLP:journals/corr/abs-2010-12827,
+@inproceedings{DBLP:journals/corr/abs-2010-12827,
  author    = {Amane Sugiyama and
               Naoki Yoshinaga},
  title     = {Context-aware Decoder for Neural Machine Translation using a Target-side
               Document-Level Language Model},
-  journal   = {CoRR},
+  publisher = {CoRR},
  volume    = {abs/2010.12827},
  year      = {2020}
 }
@@ -11970,14 +11969,13 @@ author    = {Zhuang Liu and
  publisher = {International Joint Conference on Artificial Intelligence},
  year      = {2020}
 }
-@article{DBLP:journals/tacl/TuLSZ18,
+@inproceedings{DBLP:journals/tacl/TuLSZ18,
  author    = {Zhaopeng Tu and
               Yang Liu and
               Shuming Shi and
               Tong Zhang},
  title     = {Learning to Remember Translation History with a Continuous Cache},
  publisher = {Transactions of the Association for Computational Linguistics},
-  volume    = {6},
  pages     = {407--420},
  year      = {2018}
 }
@@ -12071,7 +12069,7 @@ author    = {Zhuang Liu and
  publisher = {{AAAI} Press},
  year      = {2019}
 }
-@article{DBLP:journals/tacl/YuSSLKBD20,
+@inproceedings{DBLP:journals/tacl/YuSSLKBD20,
  author    = {Lei Yu and
               Laurent Sartran and
               Wojciech Stokowiec and
@@ -12080,16 +12078,16 @@ author    = {Zhuang Liu and
               Phil Blunsom and
               Chris Dyer},
  title     = {Better Document-Level Machine Translation with Bayes' Rule},
-  journal   = {Transactions of the Association for Computational Linguistics},
+  publisher = {Transactions of the Association for Computational Linguistics},
  volume    = {8},
  pages     = {346--360},
  year      = {2020}
 }
-@article{DBLP:journals/corr/abs-1903-04715,
+@inproceedings{DBLP:journals/corr/abs-1903-04715,
  author    = {S{\'{e}}bastien Jean and
               Kyunghyun Cho},
  title     = {Context-Aware Learning for Neural Machine Translation},
-  journal   = {CoRR},
+  publisher = {CoRR},
  volume    = {abs/1903.04715},
  year      = {2019}
 }
@@ -12111,7 +12109,7 @@ author    = {Zhuang Liu and
  publisher = {Annual Conference of the European Association for Machine Translation},
  year      = {2019}
 }
-@article{DBLP:journals/corr/abs-1911-03110,
+@inproceedings{DBLP:journals/corr/abs-1911-03110,
  author    = {Liangyou Li and
               Xin Jiang and
               Qun Liu},
@@ -12149,33 +12147,28 @@ author    = {Zhuang Liu and
  publisher = {IEEE Transactions on Acoustics, Speech, and Signal Processing},
  year      = {2012}
 }
-@article{DBLP:journals/ftsig/GalesY07,
+@inproceedings{DBLP:journals/ftsig/GalesY07,
  author    = {Mark J. F. Gales and
               Steve J. Young},
  title     = {The Application of Hidden Markov Models in Speech Recognition},
-  journal   = {Found Trends Signal Process},
+  publisher = {Found Trends Signal Process},
-  volume    = {1},
-  number    = {3},
  pages     = {195--304},
  year      = {2007}
 }
-@article{DBLP:journals/taslp/MohamedDH12,
+@inproceedings{DBLP:journals/taslp/MohamedDH12,
  author    = {Abdel-rahman Mohamed and
               George E. Dahl and
               Geoffrey E. Hinton},
  title     = {Acoustic Modeling Using Deep Belief Networks},
-  journal   = {IEEE Transactions on Speech and Audio Processing},
+  publisher = {IEEE Transactions on Speech and Audio Processing},
-  volume    = {20},
-  number    = {1},
  pages     = {14--22},
  year      = {2012}
 }
-@article{DBLP:journals/spm/X12a,
+@inproceedings{DBLP:journals/spm/X12a,
+  author    = {G Hinton and L Deng and D Yu and GE Dahl and B Kingsbury},
  title     = {Deep Neural Networks for Acoustic Modeling in Speech Recognition:
               The Shared Views of Four Research Groups},
-  journal   = {IEEE Signal Processing Magazine},
+  publisher = {IEEE Signal Processing Magazine},
-  volume    = {29},
-  number    = {6},
  pages     = {82--97},
  year      = {2012}
 }
@@ -12232,14 +12225,14 @@ author    = {Zhuang Liu and
  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2016}
 }
-@article{DBLP:journals/corr/BerardPSB16,
+@inproceedings{DBLP:journals/corr/BerardPSB16,
  author    = {Alexandre Berard and
               Olivier Pietquin and
               Christophe Servan and
               Laurent Besacier},
  title     = {Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text
               Translation},
-  journal   = {CoRR},
+  publisher = {CoRR},
  volume    = {abs/1612.01744},
  year      = {2016}
 }
@@ -12277,16 +12270,14 @@ author    = {Zhuang Liu and
  publisher = {International Conference on Machine Learning},
  year      = {2006}
 }
-@article{DBLP:journals/jstsp/WatanabeHKHH17,
+@inproceedings{DBLP:journals/jstsp/WatanabeHKHH17,
  author    = {Shinji Watanabe and
               Takaaki Hori and
               Suyoun Kim and
               John R. Hershey and
               Tomoki Hayashi},
  title     = {Hybrid CTC/Attention Architecture for End-to-End Speech Recognition},
-  journal   = {IEEE Journal of Selected Topics in Signal Processing},
+  publisher = {IEEE Journal of Selected Topics in Signal Processing},
-  volume    = {11},
-  number    = {8},
  pages     = {1240--1253},
  year      = {2017}
 }
@@ -12300,15 +12291,13 @@ author    = {Zhuang Liu and
  publisher = {IEEE Transactions on Acoustics, Speech, and Signal Processing},
  year      = {2017}
 }
-@article{DBLP:journals/pami/ShiBY17,
+@inproceedings{DBLP:journals/pami/ShiBY17,
  author    = {Baoguang Shi and
               Xiang Bai and
               Cong Yao},
  title     = {An End-to-End Trainable Neural Network for Image-Based Sequence Recognition
               and Its Application to Scene Text Recognition},
-  journal   = {{IEEE} Transactions on Pattern Analysis and Machine Intelligence}, 
+  publisher = {{IEEE} Transactions on Pattern Analysis and Machine Intelligence}, 
-  volume    = {39},
-  number    = {11},
  pages     = {2298--2304},
  year      = {2017}
 }
@@ -12396,16 +12385,16 @@ author    = {Zhuang Liu and
  title     = {Effectively pretraining a speech translation decoder with Machine
               Translation data},
  pages     = {8014--8020},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2020}
 }
-@article{DBLP:journals/corr/abs-1802-06003,
+@inproceedings{DBLP:journals/corr/abs-1802-06003,
  author    = {Takatomo Kano and
               Sakriani Sakti and
               Satoshi Nakamura},
  title     = {Structured-based Curriculum Learning for End-to-end English-Japanese
               Speech Translation},
-  journal   = {CoRR},
+  publisher = {CoRR},
  volume    = {abs/1802.06003},
  year      = {2018}
 }
@@ -12424,7 +12413,6 @@ author    = {Zhuang Liu and
  author    = {Lawrence R. Rabiner and
               Biing-Hwang Juang},
  title     = {Fundamentals of speech recognition},
-  series    = {Prentice Hall signal processing series},
  publisher = {Prentice Hall},
  year      = {1993}
 }
@@ -12561,12 +12549,12 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }
-@article{Elliott2015MultilingualID,
+@inproceedings{Elliott2015MultilingualID,
  title={Multilingual Image Description with Neural Sequence Models},
  author={Desmond Elliott and 
          Stella Frank and 
 		  Eva Hasler},
-  journal={arXiv: Computation and Language},
+  publisher ={arXiv: Computation and Language},
  year={2015}
 }
 @inproceedings{DBLP:conf/wmt/MadhyasthaWS17,
@@ -12579,13 +12567,12 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
-@article{DBLP:journals/corr/CaglayanBB16,
+@inproceedings{DBLP:journals/corr/CaglayanBB16,
  author    = {Ozan Caglayan and
               Lo{\"{\i}}c Barrault and
               Fethi Bougares},
  title     = {Multimodal Attention for Neural Machine Translation},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/1609.03976},
  year      = {2016}
 }
 @inproceedings{DBLP:conf/acl/CalixtoLC17,
@@ -12597,13 +12584,12 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
-@article{DBLP:journals/corr/DelbrouckD17,
+@inproceedings{DBLP:journals/corr/DelbrouckD17,
  author    = {Jean-Benoit Delbrouck and
               St{\'{e}}phane Dupont},
  title     = {Multimodal Compact Bilinear Pooling for Multimodal Neural Machine
               Translation},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/1703.08084},
  year      = {2017}
 }
 @inproceedings{DBLP:conf/acl/LibovickyH17,
@@ -12614,22 +12600,20 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
-@article{DBLP:journals/corr/abs-1712-03449,
+@inproceedings{DBLP:journals/corr/abs-1712-03449,
  author    = {Jean-Benoit Delbrouck and
               St{\'{e}}phane Dupont},
  title     = {Modulating and attending the source image during encoding improves
               Multimodal Translation},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/1712.03449},
  year      = {2017}
 }
-@article{DBLP:journals/corr/abs-1807-11605,
+@inproceedings{DBLP:journals/corr/abs-1807-11605,
  author    = {Hasan Sait Arslan and
               Mark Fishel and
               Gholamreza Anbarjafari},
  title     = {Doubly Attentive Transformer Machine Translation},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/1807.11605},
  year      = {2018}
 }
 @inproceedings{DBLP:conf/wmt/HelclLV18,
@@ -12721,7 +12705,6 @@ author    = {Zhuang Liu and
               Yoshua Bengio},
  title     = {Show, Attend and Tell: Neural Image Caption Generation with Visual
               Attention},
-  volume    = {37},
  pages     = {2048--2057},
  publisher = {International Conference on Machine Learning},
  year      = {2015}
@@ -12751,7 +12734,7 @@ author    = {Zhuang Liu and
  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
  year      = {2017}
 }
-@article{DBLP:journals/pami/FuJCSZ17,
+@inproceedings{DBLP:journals/pami/FuJCSZ17,
  author    = {Kun Fu and
               Junqi Jin and
               Runpeng Cui and
@@ -12759,9 +12742,7 @@ author    = {Zhuang Liu and
               Changshui Zhang},
  title     = {Aligning Where to See and What to Tell: Image Captioning with Region-Based
               Attention and Scene-Specific Contexts},
-  journal   = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
+  publisher = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
-  volume    = {39},
-  number    = {12},
  pages     = {2321--2334},
  year      = {2017}
 }
@@ -12772,8 +12753,6 @@ author    = {Zhuang Liu and
               Tao Mei},
  title     = {Exploring Visual Relationship for Image Captioning},
  series    = {Lecture Notes in Computer Science},
-  volume    = {11218},
-  pages     = {711--727},
  publisher = {European Conference on Computer Vision},
  year      = {2018}
 }
@@ -12788,21 +12767,19 @@ author    = {Zhuang Liu and
  publisher = {International Joint Conference on Artificial Intelligence},
  year      = {2017}
 }
-@article{DBLP:journals/corr/abs-1804-02767,
+@inproceedings{DBLP:journals/corr/abs-1804-02767,
  author    = {Joseph Redmon and
               Ali Farhadi},
  title     = {YOLOv3: An Incremental Improvement},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/1804.02767},
  year      = {2018}
 }
-@article{DBLP:journals/corr/abs-2004-10934,
+@inproceedings{DBLP:journals/corr/abs-2004-10934,
  author    = {Alexey Bochkovskiy and
               Chien-Yao Wang and
               Hong-Yuan Mark Liao},
  title     = {YOLOv4: Optimal Speed and Accuracy of Object Detection},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/2004.10934},
  year      = {2020}
 }
 @inproceedings{DBLP:conf/cvpr/LuXPS17,
@@ -12840,15 +12817,13 @@ author    = {Zhuang Liu and
  publisher = {ACM Multimedia},
  year      = {2017}
 }
-@article{DBLP:journals/mta/FangWCT18,
+@inproceedings{DBLP:journals/mta/FangWCT18,
  author    = {Fang Fang and
               Hanli Wang and
               Yihao Chen and
               Pengjie Tang},
  title     = {Looking deeper and transferring attention for image captioning},
-  journal   = {Multimedia Tools Applications},
+  publisher = {Multimedia Tools Applications},
-  volume    = {77},
-  number    = {23},
  pages     = {31159--31175},
  year      = {2018}
 }
@@ -12861,12 +12836,11 @@ author    = {Zhuang Liu and
  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
  year      = {2018}
 }
-@article{DBLP:journals/corr/abs-1805-09019,
+@inproceedings{DBLP:journals/corr/abs-1805-09019,
  author    = {Qingzhong Wang and
               Antoni B. Chan},
  title     = {{CNN+CNN:} Convolutional Decoders for Image Captioning},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/1805.09019},
  year      = {2018}
 }
 @inproceedings{DBLP:conf/eccv/DaiYL18,
@@ -12874,8 +12848,6 @@ author    = {Zhuang Liu and
               Deming Ye and
               Dahua Lin},
  title     = {Rethinking the Form of Latent States in Image Captioning},
-  volume    = {11209},
-  pages     = {294--310},
  publisher = {European Conference on Computer Vision},
  year      = {2018}
 }
@@ -12900,28 +12872,24 @@ author    = {Zhuang Liu and
               Alexander Kirillov and
               Sergey Zagoruyko},
  title     = {End-to-End Object Detection with Transformers},
-  volume    = {12346},
-  pages     = {213--229},
  publisher = {European Conference on Computer Vision},
  year      = {2020}
 }
-@article{DBLP:journals/tcsv/YuLYH20,
+@inproceedings{DBLP:journals/tcsv/YuLYH20,
  author    = {Jun Yu and
               Jing Li and
               Zhou Yu and
               Qingming Huang},
  title     = {Multimodal Transformer With Multi-View Visual Representation for Image
               Captioning},
-  journal   = {IEEE Transactions on Circuits and Systems for Video Technology},
+  publisher = {IEEE Transactions on Circuits and Systems for Video Technology},
-  volume    = {30},
-  number    = {12},
  pages     = {4467--4480},
  year      = {2020}
 }
-@article{Huasong2020SelfAdaptiveNM,
+@inproceedings{Huasong2020SelfAdaptiveNM,
  title={Self-Adaptive Neural Module Transformer for Visual Question Answering},
  author={Zhong Huasong and Jingyuan Chen and Chen Shen and Hanwang Zhang and Jianqiang Huang and Xian-Sheng Hua},
-  journal={IEEE Transactions on Multimedia},
+  publisher ={IEEE Transactions on Multimedia},
  year={2020},
  pages={1-1}
 }
@@ -12944,7 +12912,6 @@ author    = {Zhuang Liu and
               Xiaokang Yang},
  title     = {Semantic Equivalent Adversarial Data Augmentation for Visual Question
               Answering},
-  volume    = {12364},
  pages     = {437--453},
  publisher = {	European Conference on Computer Vision},
  year      = {2020}
@@ -12963,7 +12930,6 @@ author    = {Zhuang Liu and
               Yejin Choi and
               Jianfeng Gao},
  title     = {Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks},
-  volume    = {12375},
  pages     = {121--137},
  publisher = {	European Conference on Computer Vision},
  year      = {2020}
@@ -13005,23 +12971,21 @@ author    = {Zhuang Liu and
  pages     = {465--476},
  year      = {2017}
 }
-@article{DBLP:journals/corr/abs-1908-06616,
+@inproceedings{DBLP:journals/corr/abs-1908-06616,
  author    = {Hajar Emami and
               Majid Moradi Aliabadi and
               Ming Dong and
               Ratna Babu Chinnam},
  title     = {{SPA-GAN:} Spatial Attention {GAN} for Image-to-Image Translation},
-  journal   = {CoRR},
+  publisher = {IEEE Transactions on Multimedia},
-  volume    = {abs/1908.06616},
  year      = {2019}
 }
-@article{DBLP:journals/access/XiongWG19,
+@inproceedings{DBLP:journals/access/XiongWG19,
  author    = {Feng Xiong and
               Qianqian Wang and
               Quanxue Gao},
  title     = {Consistent Embedded {GAN} for Image-to-Image Translation},
-  journal   = {International Conference on Access Networks},
+  publisher = {International Conference on Access Networks},
-  volume    = {7},
  pages     = {126651--126661},
  year      = {2019}
 }
@@ -13063,12 +13027,11 @@ author    = {Zhuang Liu and
               Bernt Schiele and
               Honglak Lee},
  title     = {Generative Adversarial Text to Image Synthesis},
-  volume    = {48},
  pages     = {1060--1069},
  publisher = {International Conference on Machine Learning},
  year      = {2016}
 }
-@article{DBLP:journals/corr/DashGALA17,
+@inproceedings{DBLP:journals/corr/DashGALA17,
  author    = {Ayushman Dash and
               John Cristian Borges Gamboa and
               Sheraz Ahmed and
@@ -13076,8 +13039,7 @@ author    = {Zhuang Liu and
               Muhammad Zeshan Afzal},
  title     = {{TAC-GAN} - Text Conditioned Auxiliary Classifier Generative Adversarial
               Network},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/1703.06412},
  year      = {2017}
 }
 @inproceedings{DBLP:conf/nips/ReedAMTSL16,
@@ -13142,12 +13104,11 @@ author    = {Zhuang Liu and
  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2018}
 }
-@article{DBLP:journals/corr/ChoE16,
+@inproceedings{DBLP:journals/corr/ChoE16,
  author    = {Kyunghyun Cho and
               Masha Esipova},
  title     = {Can neural machine translation do simultaneous translation?},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/1606.02012},
  year      = {2016}
 }
 @inproceedings{DBLP:conf/eacl/NeubigCGL17,
@@ -13293,7 +13254,6 @@ author    = {Zhuang Liu and
  title     = {A Reinforcement Learning Approach to Interactive-Predictive Neural
               Machine Translation},
  publisher   = {CoRR},
-  volume    = {abs/1805.01553},
  year      = {2018}
 }
 @inproceedings{DBLP:journals/mt/DomingoPC17,
@@ -13302,8 +13262,6 @@ author    = {Zhuang Liu and
               Francisco Casacuberta},
  title     = {Segment-based interactive-predictive machine translation},
  publisher   = {Machine Translation},
-  volume    = {31},
-  number    = {4},
  pages     = {163--185},
  year      = {2017}
 }
@@ -13321,8 +13279,6 @@ author    = {Zhuang Liu and
               Juan Miguel Vilar},
  title     = {Statistical Approaches to Computer-Assisted Translation},
  publisher   = {Computer Linguistics},
-  volume    = {35},
-  number    = {1},
  pages     = {3--28},
  year      = {2009}
 }
@@ -13351,7 +13307,6 @@ author    = {Zhuang Liu and
  title     = {TurboTransformers: An Efficient {GPU} Serving System For Transformer
               Models},
  publisher   = {CoRR},
-  volume    = {abs/2010.05680},
  year      = {2020}
 }
 @inproceedings{DBLP:conf/iclr/HuangCLWMW18,