合并分支 'caorunzhe' 到 'mengxia'

Caorunzhe 查看合并请求 !650

合并分支 'caorunzhe' 到 'mengxia'
Caorunzhe 查看合并请求 !650
1355fbc2 · 孟霞 · fbc7a57f · 5e1f0bea · 1355fbc2 · 1355fbc2
Commit 1355fbc2 authored Dec 21, 2020 by 孟霞
--- a/Chapter1/chapter1.tex
+++ b/Chapter1/chapter1.tex
@@ -186,7 +186,7 @@
 \includegraphics[scale=0.3]{./Chapter1/Figures/figure-wmt-participation.jpg}
 \includegraphics[scale=0.3]{./Chapter1/Figures/figure-wmt-bestresults.jpg}
 \setlength{\belowcaptionskip}{-1.5em}
-    \caption{WMT\ 19国际机器翻译大赛（左：WMT\ 19参赛队伍；右：WMT\ 19各项目的最好分数结果）}
+    \caption{WMT\ 19国际机器翻译大赛（左：WMT\ 19参赛队伍；右：WMT\ 19各项目的最好分数）}
    \label{fig:1-5}
 \end{figure}
 %-------------------------------------------
@@ -200,7 +200,7 @@
 \sectionnewpage
 \section{机器翻译现状及挑战}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\parinterval 机器翻译技术发展到今天已经过无数次迭代，技术范式也经过若干次更替，近些年机器翻译的应用也如雨后春笋相继浮现。今天的机器翻译的质量究竟如何呢？乐观地说，在很多特定的条件下，机器翻译的译文结果是非常不错的，甚至可以接近人工翻译的结果。然而，在开放式翻译任务中，机器翻译的结果还并不完美。更严格来说，机器翻译的质量远没有达到人们所期望的程度。对于有些人提到的“机器翻译代替人工翻译”也并不是事实。比如，在高精度同声传译任务中，机器翻译仍需要更多打磨；再比如，针对于小说的翻译，机器翻译还无法做到与人工翻译媲美；甚至有人尝试用机器翻译系统翻译中国古代诗词，这里更多的是娱乐的味道。但是毫无疑问的是，机器翻译可以帮助人类，甚至有朝一日可以代替一些低端的人工翻译工作。
+\parinterval 机器翻译技术发展到今天已经过无数次迭代，技术范式也经过若干次更替，近些年机器翻译的应用也如雨后春笋相继浮现。今天的机器翻译的质量究竟如何呢？乐观地说，在很多特定的条件下，机器翻译的译文结果是非常不错的，甚至可以接近人工翻译的结果。然而，在开放式翻译任务中，机器翻译的结果还并不完美。更严格来说，机器翻译的质量远没有达到人们所期望的程度。对于有些人提到的“机器翻译将代替人工翻译”也并不是事实。比如，在高精度同声传译任务中，机器翻译仍需要更多打磨；再比如，针对于小说的翻译，机器翻译还无法做到与人工翻译媲美；甚至有人尝试用机器翻译系统翻译中国古代诗词，这里更多的是娱乐的味道。但是毫无疑问的是，机器翻译可以帮助人类，甚至有朝一日可以代替一些低端的人工翻译工作。

 \parinterval 图\ref{fig:1-6}展示了机器翻译和人工翻译质量的一个对比结果。在汉语到英语的新闻翻译任务中，如果对译文进行人工评价（五分制），那么机器翻译的译文得分为3.9分，人工译文得分为4.7分（人的翻译也不是完美的）。可见，在这个任务中机器翻译表现不错，但是与人还有一定差距。如果换一种方式评价，把人的译文作为参考答案，用机器翻译的译文与其进行比对（百分制），会发现机器翻译的得分只有47分。当然，这个结果并不是说机器翻译的译文质量很差，它更多的是表明机器翻译系统可以生成一些与人工翻译不同的译文，机器翻译也具有一定的创造性。这也类似于，很多围棋选手都想向AlphaGo学习，因为智能围棋系统也可以走出一些人类从未走过的妙招。

@@ -549,7 +549,7 @@
 \vspace{0.5em}
 \item EMNLP，全称Conference on Empirical Methods in Natural Language Processing，自然语言处理另一个顶级会议之一，由ACL当中对语言数据和经验方法有特殊兴趣的团体主办，始于1996年。会议比较偏重于方法和经验性结果。
 \vspace{0.5em}
-\item MT Summit，全称Machine Translation Summit，是机器翻译领域的重要峰会。该会议的特色是与产业结合，在探讨机器翻译技术问题的同时，更多的关注机器翻译的应用落地工作，因此备受产业界关注。该会议每两年举办一次，通常由欧洲机器翻译协会（The European Association for Machine Translation，EAMT）、美国机器翻译协会（The Association for Machine Translation in the Americas，AMTA）、亚洲-太平洋地区机器翻译协会（Asia-Pacific Association for Machine Translation，AAMT）。
+\item MT Summit，全称Machine Translation Summit，是机器翻译领域的重要峰会。该会议的特色是与产业结合，在探讨机器翻译技术问题的同时，更多的关注机器翻译的应用落地工作，因此备受产业界关注。该会议每两年举办一次，通常由欧洲机器翻译协会（The European Association for Machine Translation，EAMT）、美国机器翻译协会（The Association for Machine Translation in the Americas，AMTA）、亚洲-太平洋地区机器翻译协会（Asia-Pacific Association for Machine Translation，AAMT）举办。
 \vspace{0.5em}
 \item NAACL，全称Annual Conference of the North American Chapter of the Association for Computational Linguistics，为ACL北美分会，在自然语言处理领域也属于顶级会议，每年会选择一个北美城市召开会议。
 \vspace{0.5em}

--- a/Chapter10/Figures/figure-presentation-space.tex
+++ b/Chapter10/Figures/figure-presentation-space.tex
@@ -15,7 +15,7 @@
 \draw [->] ([xshift=1pt]unit2.east) .. controls +(east:0.5) and +(south:0.2) .. ([xshift=0.2em,yshift=-1pt]unitbox.south);

 \node [anchor=south] (spacelabel1) at (space1.north) {\scriptsize{离散表示空间}};
-\node [anchor=north] (captain1) at ([yshift=-0.5em]space1.south) {\scriptsize{(a) \textbf{统计机器翻译}}};
+\node [anchor=north] (captain1) at ([yshift=-0.5em]space1.south) {\small{(a) \textbf{统计机器翻译}}};

 \end{scope}

@@ -32,7 +32,7 @@

 \node [anchor=south] (spacelabel1) at (space1.north) {\scriptsize{离散表示空间}};
 \node [anchor=south] (spacelabel2) at (space2.north) {\scriptsize{连续表示空间}};
-\node [anchor=north] (captain1) at ([yshift=-0.5em,xshift=1em]space1.south east) {\scriptsize{(b) \textbf{神经机器翻译}}};
+\node [anchor=north] (captain1) at ([yshift=-0.5em,xshift=1em]space1.south east) {\small{(b) \textbf{神经机器翻译}}};

 \end{scope}


--- a/Chapter10/chapter10.tex
+++ b/Chapter10/chapter10.tex
@@ -441,13 +441,13 @@ NMT                     & 21.7          & 18.7           & -13.7      \\

 \parinterval 从数学模型上看，神经机器翻译模型与统计机器翻译的目标是一样的：在给定源语言句子$\seq{x}$的情况下，找出翻译概率最大的目标语言译文$\hat{\seq{y}}$，其计算如公式\eqref{eq:10-1}所示:
 \begin{eqnarray}
-\hat{\seq{{y}}} = \argmax_{\seq{{y}}} \funp{P} (\seq{{y}} | \seq{{x}})
+\hat{\seq{{y}}} &=& \argmax_{\seq{{y}}} \funp{P} (\seq{{y}} | \seq{{x}})
 \label{eq:10-1}
 \end{eqnarray}

 \noindent 这里，用$\seq{{x}}=\{ x_1,x_2,..., x_m \}$表示输入的源语言单词序列，$\seq{{y}}=\{ y_1,y_2,..., y_n \}$ 表示生成的目标语言单词序列。由于神经机器翻译在生成译文时采用的是自左向右逐词生成的方式，并在翻译每个单词时考虑已经生成的翻译结果，因此对$ \funp{P} (\seq{{y}} | \seq{{x}})$的求解可以转换为公式\eqref{eq:10-2}所示过程：
 \begin{eqnarray}
-\funp{P} (\seq{{y}} | \seq{{x}}) = \prod_{j=1}^{n} \funp{P} ( y_j | \seq{{y}}_{<j }, \seq{{x}}  )
+\funp{P} (\seq{{y}} | \seq{{x}}) &=& \prod_{j=1}^{n} \funp{P} ( y_j | \seq{{y}}_{<j }, \seq{{x}}  )
 \label{eq:10-2}
 \end{eqnarray}

@@ -465,12 +465,12 @@ NMT                     & 21.7          & 18.7           & -13.7      \\
 \vspace{0.5em}
 \item	如何得到每个目标语言单词的概率，即译文单词的{\small\sffamily\bfseries{生成}}\index{生成}（Generation）\index{Generation}。与神经语言模型一样，可以用一个Softmax输出层来获取当前时刻所有单词的分布，即利用Softmax 函数计算目标语言词表中每个单词的概率。令目标语言序列$j$时刻的循环神经网络的输出向量（或状态）为$\mathbi{s}_j$。根据循环神经网络的性质，$ y_j$ 的生成只依赖前一个状态$\mathbi{s}_{j-1}$和当前时刻的输入（即词嵌入$\textrm{e}_y (y_{j-1})$）。同时考虑源语言信息$\mathbi{C}$，$\funp{P}(y_j  | \seq{{y}}_{<j},\seq{{x}})$可以被重新定义为公式\eqref{eq:10-3}：
 \begin{eqnarray}
-\funp{P} (y_j | \seq{{y}}_{<j},\seq{{x}}) = \funp{P} ( {y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}} )
+\funp{P} (y_j | \seq{{y}}_{<j},\seq{{x}}) &=& \funp{P} ( {y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}} )
 \label{eq:10-3}
 \end{eqnarray}
 $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softmax的输入是循环神经网络$j$时刻的输出。在具体实现时，$\mathbi{C}$可以被简单地作为第一个时刻循环单元的输入，即，当$j=1$ 时，解码器的循环神经网络会读入编码器最后一个隐层状态$ \mathbi{h}_m$（也就是$\mathbi{C}$），而其他时刻的隐层状态不直接与$\mathbi{C}$相关。最终，$\funp{P} (y_j | \seq{{y}}_{<j},\seq{{x}})$ 被表示为公式\eqref{eq:10-4}：
 \begin{eqnarray}
-\funp{P} (y_j | \seq{{y}}_{<j},\seq{{x}}) =
+\funp{P} (y_j | \seq{{y}}_{<j},\seq{{x}}) &=&
 \left \{ \begin{array}{ll}
 \funp{P} (y_j |\mathbi{C} ,y_{j-1}) &j=1 \\
 \funp{P} (y_j|\mathbi{s}_{j-1},y_{j-1})  \quad &j>1
@@ -492,7 +492,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 \parinterval 输入层（词嵌入）和输出层（Softmax）的内容已在{\chapternine}进行了介绍，因此这里的核心内容是设计循环神经网络结构，即设计循环单元的结构。至今，研究人员已经提出了很多优秀的循环单元结构。其中循环神经网络（RNN）
 是最原始的循环单元结构。在RNN中，对于序列$\seq{{x}}=\{ \mathbi{x}_1, \mathbi{x}_2,...,\mathbi{x}_m \}$，每个时刻$t$都对应一个循环单元，它的输出是一个向量$\mathbi{h}_t$，可以被描述为公式\eqref{eq:10-5}：
 \begin{eqnarray}
-\mathbi{h}_t=f(\mathbi{x}_t \mathbi{U}+\mathbi{h}_{t-1} \mathbi{W}+\mathbi{b})
+\mathbi{h}_t &=& f(\mathbi{x}_t \mathbi{U}+\mathbi{h}_{t-1} \mathbi{W}+\mathbi{b})
 \label{eq:10-5}
 \end{eqnarray}

@@ -514,10 +514,10 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\subfigure[遗忘门]{\input{./Chapter10/Figures/figure-lstm01}}
-\subfigure[输入门]{\input{./Chapter10/Figures/figure-lstm02}}
-\subfigure[记忆更新]{\input{./Chapter10/Figures/figure-lstm03}}
-\subfigure[输出门]{\input{./Chapter10/Figures/figure-lstm04}}
+\subfigure[{\small 遗忘门}]{\input{./Chapter10/Figures/figure-lstm01}}
+\subfigure[{\small 输入门}]{\input{./Chapter10/Figures/figure-lstm02}}
+\subfigure[{\small 记忆更新}]{\input{./Chapter10/Figures/figure-lstm03}}
+\subfigure[{\small 输出门}]{\input{./Chapter10/Figures/figure-lstm04}}
 \caption{LSTM中的门控结构}
 \label{fig:10-11}
 \end{figure}
@@ -529,7 +529,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 \vspace{0.5em}
 \item {\small\sffamily\bfseries{遗忘}}\index{遗忘}。顾名思义，遗忘的目的是忘记一些历史，在LSTM中通过遗忘门实现，其结构如图\ref{fig:10-11}(a)所示。$\mathbi{x}_{t}$表示时刻$t$的输入向量，$\mathbi{h}_{t-1}$是时刻$t-1$的循环单元的输出，$\mathbi{x}_{t}$和$\mathbi{h}_{t-1}$都作为$t$时刻循环单元的输入。$\sigma$将对$\mathbi{x}_{t}$和$\mathbi{h}_{t-1}$进行筛选，以决定遗忘的信息，其计算如公式\eqref{eq:10-6}所示：
 \begin{eqnarray}
-\mathbi{f}_t=\sigma(\mathbi{W}_f [\mathbi{h}_{t-1},\mathbi{x}_{t}] + \mathbi{b}_f )
+\mathbi{f}_t &=& \sigma(\mathbi{W}_f [\mathbi{h}_{t-1},\mathbi{x}_{t}] + \mathbi{b}_f )
 \label{eq:10-6}
 \end{eqnarray}

@@ -543,7 +543,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm

 之后，用$\mathbi{i}_t$点乘$\hat{\mathbi{c}}_t$，得到当前需要记忆的信息，记为$\mathbi{i}_t \cdot  \hat{\mathbi{c}}_t$。接下来需要更新旧的信息$\mathbi{c}_{t-1}$，得到新的记忆信息$\mathbi{c}_t$，更新的操作如图\ref{fig:10-11}(c)红色线部分所示，“$\bigoplus$”表示相加。具体规则是通过遗忘门选择忘记一部分上文信息$\mathbi{f}_t$，通过输入门计算新增的信息$\mathbi{i}_t \cdot  \hat{\mathbi{c}}_t$，然后根据“$\bigotimes$”门与“$\bigoplus$”门进行相应的乘法和加法计算，如公式\eqref{eq:10-9}：
 \begin{eqnarray}
-\mathbi{c}_t = \mathbi{f}_t \cdot \mathbi{c}_{t-1} + \mathbi{i}_t  \cdot \hat{\mathbi{c}_t}
+\mathbi{c}_t &=& \mathbi{f}_t \cdot \mathbi{c}_{t-1} + \mathbi{i}_t  \cdot \hat{\mathbi{c}_t}
 \label{eq:10-9}
 \end{eqnarray}
 \vspace{-1.0em}
@@ -578,9 +578,9 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 %----------------------------------------------
 \begin{figure}[htp]
 \centering
-\subfigure[重置门]{\input{./Chapter10/Figures/figure-gru01}}
-\subfigure[更新门]{\input{./Chapter10/Figures/figure-gru02}}
-\subfigure[隐藏状态更新]{\input{./Chapter10/Figures/figure-gru03}}
+\subfigure[{\small 重置门}]{\input{./Chapter10/Figures/figure-gru01}}
+\subfigure[{\small 更新门}]{\input{./Chapter10/Figures/figure-gru02}}
+\subfigure[{\small 隐藏状态更新}]{\input{./Chapter10/Figures/figure-gru03}}
 \caption{GRU中的门控结构}
 \label{fig:10-13}
 \end{figure}
@@ -594,13 +594,13 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm

 \parinterval 当完成了重置门和更新门计算后，就需要更新当前隐藏状态，如图\ref{fig:10-13}(c)所示。在计算得到了重置门的权重$\mathbi{r}_t$后，使用其对前一时刻的状态$\mathbi{h}_{t-1}$进行重置($\mathbi{r}_t \cdot \mathbi{h}_{t-1}$)，将重置后的结果与$\mathbi{x}_t$拼接，通过Tanh激活函数将数据变换到[-1,1]范围内，具体计算如公式\eqref{eq:10-14}：
 \begin{eqnarray}
-\hat{\mathbi{h}}_t = \textrm{Tanh} (\mathbi{W}_h [\mathbi{r}_t \cdot \mathbi{h}_{t-1},\mathbi{x}_{t}])
+\hat{\mathbi{h}}_t &=& \textrm{Tanh} (\mathbi{W}_h [\mathbi{r}_t \cdot \mathbi{h}_{t-1},\mathbi{x}_{t}])
 \label{eq:10-14}
 \end{eqnarray}

 \parinterval $\hat{\mathbi{h}}_t$在包含了输入信息$\mathbi{x}_t$的同时，引入了$\mathbi{h}_{t-1}$的信息，可以理解为，记忆了当前时刻的状态。下一步是计算更新后的隐藏状态也就是更新记忆，如公式\eqref{eq:10-15}所示：
 \begin{eqnarray}
-\mathbi{h}_t = (1-\mathbi{u}_t) \cdot \mathbi{h}_{t-1} +\mathbi{u}_t \cdot \hat{\mathbi{h}}_t
+\mathbi{h}_t &=& (1-\mathbi{u}_t) \cdot \mathbi{h}_{t-1} +\mathbi{u}_t \cdot \hat{\mathbi{h}}_t
 \label{eq:10-15}
 \end{eqnarray}

@@ -721,7 +721,7 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm

 \parinterval 根据这种思想，上下文向量$\mathbi{C}_j$被定义为对不同时间步编码器输出的状态序列$\{ \mathbi{h}_1, \mathbi{h}_2,...,\mathbi{h}_m \}$进行加权求和，如公式\eqref{eq:10-16}所示：
 \begin{eqnarray}
-\mathbi{C}_j=\sum_{i} \alpha_{i,j} \mathbi{h}_i
+\mathbi{C}_j&=&\sum_{i} \alpha_{i,j} \mathbi{h}_i
 \label{eq:10-16}
 \end{eqnarray}

@@ -742,13 +742,13 @@ $\funp{P}({y_j | \mathbi{s}_{j-1} ,y_{j-1},\mathbi{C}})$由Softmax实现，Softm
 \vspace{0.5em}
 \item	使用目标语言上一时刻循环单元的输出$\mathbi{s}_{j-1}$与源语言第$i$个位置的表示$\mathbi{h}_i$之间的相关性，其用来表示目标语言位置$j$对源语言位置$i$的关注程度，记为$\beta_{i,j}$，由函数$a(\cdot)$实现，其具体计算如公式\eqref{eq:10-17}所示：
 \begin{eqnarray}
-\beta_{i,j} = a(\mathbi{s}_{j-1},\mathbi{h}_i)
+\beta_{i,j} &=& a(\mathbi{s}_{j-1},\mathbi{h}_i)
 \label{eq:10-17}
 \end{eqnarray}

 $a(\cdot)$可以被看作是目标语言表示和源语言表示的一种“统一化”，即把源语言和目标语言表示映射在同一个语义空间，进而语义相近的内容有更大的相似性。该函数有多种计算方式，比如，向量乘、向量夹角和单层神经网络等，具体数学表达如公式\eqref{eq:10-18}：
 \begin{eqnarray}
-a (\mathbi{s},\mathbi{h}) =  \left\{ \begin{array}{ll}
+a (\mathbi{s},\mathbi{h}) &=&  \left\{ \begin{array}{ll}
    \mathbi{s} \mathbi{h}^{\textrm{T}} & \textrm{向量乘} \\
    \textrm{cos}(\mathbi{s}, \mathbi{h}) & \textrm{向量夹角} \\
    \mathbi{s} \mathbi{W} \mathbi{h}^{\textrm{T}} & \textrm{线性模型} \\
@@ -763,7 +763,7 @@ a (\mathbi{s},\mathbi{h}) =  \left\{ \begin{array}{ll}
 \item	进一步，利用Softmax函数，将相关性系数$\beta_{i,j}$进行指数归一化处理，得到注意力权重$\alpha_{i,j}$，具体计算如公式\eqref{eq:10-19}：
 \vspace{0.5em}
 \begin{eqnarray}
-\alpha_{i,j}=\frac{\textrm{exp}(\beta_{i,j})} {\sum_{i'} \textrm{exp}(\beta_{i',j})}
+\alpha_{i,j} &=& \frac{\textrm{exp}(\beta_{i,j})} {\sum_{i'} \textrm{exp}(\beta_{i',j})}
 \label{eq:10-19}
 \end{eqnarray}
 \vspace{0.5em}
@@ -795,7 +795,7 @@ a (\mathbi{s},\mathbi{h}) =  \left\{ \begin{array}{ll}

 \parinterval 在\ref{sec:10.3.1}节中，公式\eqref{eq:10-4}描述了目标语言单词生成概率$ \funp{P} (y_j | \mathbi{y}_{<j},\mathbi{x})$。在引入注意力机制后，不同时刻的上下文向量$\mathbi{C}_j$替换了传统模型中固定的句子表示$\mathbi{C}$。描述如公式\eqref{eq:10-20}：
 \begin{eqnarray}
-\funp{P} (y_j | \mathbi{y}_{<j},\mathbi{x}) \equiv \funp{P} (y_j | \mathbi{s}_{j-1},y_{j-1},\mathbi{C}_j )
+\funp{P} (y_j | \mathbi{y}_{<j},\mathbi{x}) &=& \funp{P} (y_j | \mathbi{s}_{j-1},y_{j-1},\mathbi{C}_j )
 \label{eq:10-20}
 \end{eqnarray}

@@ -839,7 +839,7 @@ a (\mathbi{s},\mathbi{h}) =  \left\{ \begin{array}{ll}

 \parinterval 也可以用这个系统描述翻译中的注意力问题，其中，$\mathrm{query}$即目标语言位置$j$的某种表示，$\mathrm{key}$和$\mathrm{value}$即源语言每个位置$i$上的${\mathbi{h}_i}$（这里$\mathrm{key}$和$\mathrm{value}$是相同的）。但是，这样的系统在机器翻译问题上并不好用，因为目标语言的表示和源语言的表示都在多维实数空间上，所以无法要求两个实数向量像字符串一样进行严格匹配，或者说这种严格匹配的模型可能会导致$\mathrm{query}$几乎不会命中任何的$\mathrm{key}$。既然无法严格精确匹配，注意力机制就采用了一个“模糊”匹配的方法。这里定义每个$\mathrm{key}_i$和$\mathrm{query}$ 都有一个0～1之间的匹配度，这个匹配度描述了$\mathrm{key}_i$和$\mathrm{query}$之间的相关程度，记为$\alpha_i$。而查询的结果（记为$\overline{\mathrm{value}}$）也不再是某一个单元的$\mathrm{value}$，而是所有单元$\mathrm{value}$用$\alpha_i$的加权和，具体计算如公式\eqref{eq:10-21}：
 \begin{eqnarray}
-\overline{\mathrm{value}} = \sum_i \alpha_i \cdot {\mathrm{value}}_i
+\overline{\mathrm{value}} &=& \sum_i \alpha_i \cdot {\mathrm{value}}_i
 \label{eq:10-21}
 \end{eqnarray}

@@ -858,15 +858,15 @@ a (\mathbi{s},\mathbi{h}) =  \left\{ \begin{array}{ll}

 \parinterval 最后，从统计学的角度，如果把$\alpha_i$作为每个$\mathrm{value}_i$出现的概率的某种估计，即：$ \funp{P} (\mathrm{value}_i$) $= \alpha_i$，于是可以把公式\eqref{eq:10-21}重写为公式\eqref{eq:10-22}：
 \begin{eqnarray}
-\overline{\mathrm{value}} = \sum_i \funp{P} ( {\mathrm{value}}_i) \cdot {\mathrm{value}}_i
+\overline{\mathrm{value}} &=& \sum_i \funp{P} ( {\mathrm{value}}_i) \cdot {\mathrm{value}}_i
 \label{eq:10-22}
 \end{eqnarray}

 \noindent 显然， $\overline{\mathrm{value}}$就是$\mathrm{value}_i$在分布$ \funp{P}( \mathrm{value}_i$)下的期望，即公式\eqref{eq:10-23}：
-\begin{equation}
-\mathbb{E}_{\sim \\ \funp{P} ( {\mathrm{\mathrm{value}}}_i )} ({\mathrm{value}}_i) = \sum_i \funp{P} ({\mathrm{value}}_i) \cdot {\mathrm{value}}_i
+\begin{eqnarray}
+\mathbb{E}_{\sim \funp{P} ( {\mathrm{\mathrm{value}}}_i )} ({\mathrm{value}}_i) &=& \sum_i \funp{P} ({\mathrm{value}}_i) \cdot {\mathrm{value}}_i
 \label{eq:10-23}
-\end{equation}
+\end{eqnarray}

 从这个观点看，注意力机制实际上是得到了变量$\mathrm{value}$的期望。当然，严格意义上说，$\alpha_i$并不是从概率角度定义的，在实际应用中也并不必须追求严格的统计学意义。

@@ -925,7 +925,7 @@ a (\mathbi{s},\mathbi{h}) =  \left\{ \begin{array}{ll}

 \parinterval 在基于梯度的方法中，模型参数可以通过损失函数$L$对参数的梯度进行不断更新。对于第$\textrm{step}$步参数更新，首先进行神经网络的前向计算，之后进行反向计算，并得到所有参数的梯度信息，再使用公式\eqref{eq:10-24}的规则进行参数更新：
 \begin{eqnarray}
-\mathbi{w}_{\textrm{step}+1} = \mathbi{w}_{\textrm{step}} - \alpha \cdot \frac{ \partial L(\mathbi{w}_{\textrm{step}})} {\partial \mathbi{w}_{\textrm{step}} }
+\mathbi{w}_{\textrm{step}+1} &=& \mathbi{w}_{\textrm{step}} - \alpha \cdot \frac{ \partial L(\mathbi{w}_{\textrm{step}})} {\partial \mathbi{w}_{\textrm{step}} }
 \label{eq:10-24}
 \end{eqnarray}

@@ -941,13 +941,13 @@ a (\mathbi{s},\mathbi{h}) =  \left\{ \begin{array}{ll}

 \parinterval 神经机器翻译在目标端的每个位置都会输出一个概率分布，表示这个位置上不同单词出现的可能性。设计损失函数时，需要知道当前位置输出的分布相比于标准答案的“差异”。对于这个问题，常用的是交叉熵损失函数。令$\mathbi{y}$表示机器翻译模型输出的分布，$\hat{\mathbi{y}}$ 表示标准答案，则交叉熵损失可以被定义为公式\eqref{eq:10-25}：
 \begin{eqnarray}
-L_{\textrm{ce}}(\mathbi{y},\hat{\mathbi{y}}) = - \sum_{k=1}^{|V|} \mathbi{y}[k] \textrm{log} (\hat{\mathbi{y}}[k])
+L_{\textrm{ce}}(\mathbi{y},\hat{\mathbi{y}}) &=& - \sum_{k=1}^{|V|} \mathbi{y}[k] \textrm{log} (\hat{\mathbi{y}}[k])
 \label{eq:10-25}
 \end{eqnarray}

 \noindent 其中$\mathbi{y}[k]$ 和$\hat{\mathbi{y}}[k]$分别表示向量$\mathbi{y}$和$\hat{\mathbi{y}}$的第$k$维，$|V|$表示输出向量的维度（等于词表大小）。假设有$n$个训练样本，模型输出的概率分布为$\mathbi{Y} = \{ \mathbi{y}_1,\mathbi{y}_2,..., \mathbi{y}_n \}$，标准答案的分布$\widehat{\mathbi{Y}}=\{ \hat{\mathbi{y}}_1, \hat{\mathbi{y}}_2,...,\hat{\mathbi{y}}_n \}$。这个训练样本集合上的损失函数可以被定义为公式\eqref{eq:10-26}：
 \begin{eqnarray}
-L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\hat{\mathbi{y}}_j)
+L(\mathbi{Y},\widehat{\mathbi{Y}}) &=& \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\hat{\mathbi{y}}_j)
 \label{eq:10-26}
 \end{eqnarray}

@@ -1002,7 +1002,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \parinterval 需要注意的是，训练循环神经网络时，反向传播使得网络层之间的梯度相乘。在网络层数过深时，如果连乘因子小于1可能造成梯度指数级的减少，甚至趋近于0，导致网络无法优化，也就是梯度消失问题。当连乘因子大于1时，可能会导致梯度的乘积变得异常大，造成梯度爆炸的问题。在这种情况下需要使用“梯度裁剪”来防止梯度超过阈值。梯度裁剪在{\chapternine}已经介绍过，这里简单回顾一下。梯度裁剪的具体公式如公式\eqref{eq:10-28}所示：
 \vspace{-0.5em}
 \begin{eqnarray}
-\mathbi{w}' = \mathbi{w} \cdot \frac{\gamma} {\textrm{max}(\gamma,\| \mathbi{w} \|_2)}
+\mathbi{w}' &=& \mathbi{w} \cdot \frac{\gamma} {\textrm{max}(\gamma,\| \mathbi{w} \|_2)}
 \label{eq:10-28}
 \end{eqnarray}
 %\vspace{0.5em}
@@ -1034,7 +1034,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \parinterval 图\ref{fig:10-26}展示了一种常用的学习率调整策略。它分为两个阶段：预热阶段和衰减阶段。模型训练初期梯度通常很大，如果直接使用较大的学习率很容易让模型陷入局部最优。学习率的预热阶段便是通过在训练初期使学习率从小到大逐渐增加来减缓在初始阶段模型“跑偏”的现象。一般来说，初始学习率太高会使得模型进入一种损失函数曲面非常不平滑的区域，进而使得模型进入一种混乱状态，后续的优化过程很难取得很好的效果。一个常用的学习率预热方法是{\small\bfnew{逐渐预热}}\index{逐渐预热}（Gradual Warmup）\index{Gradual Warmup}。假设预热的更新次数为$N$，初始学习率为$\alpha_0$，则预热阶段第$\textrm{step}$次更新的学习率计算如公式\eqref{eq:10-29}所示：
 %\vspace{0.5em}
 \begin{eqnarray}
-\alpha_t = \frac{\textrm{step}}{N} \alpha_0 \quad,\quad 1 \leq t \leq T'
+\alpha_t &=& \frac{\textrm{step}}{N} \alpha_0 \quad,\quad 1 \leq t \leq T'
 \label{eq:10-29}
 \end{eqnarray}
 %-------
@@ -1110,9 +1110,9 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\
 \begin{figure}[htp]
 \centering
 \begin{tabular}{l l}
-\subfigure[]{\input{./Chapter10/Figures/figure-process01}} &\subfigure[]{\input{./Chapter10/Figures/figure-process02}} \\
-\subfigure[]{\input{./Chapter10/Figures/figure-process03}}  &\subfigure[]{\input{./Chapter10/Figures/figure-process04}} \\
-\subfigure[]{\input{./Chapter10/Figures/figure-process05}}  &\subfigure[]{\input{./Chapter10/Figures/figure-process06}}
+\subfigure[{\small }]{\input{./Chapter10/Figures/figure-process01}} &\subfigure[{\small }]{\input{./Chapter10/Figures/figure-process02}} \\
+\subfigure[{\small }]{\input{./Chapter10/Figures/figure-process03}}  &\subfigure[{\small }]{\input{./Chapter10/Figures/figure-process04}} \\
+\subfigure[{\small }]{\input{./Chapter10/Figures/figure-process05}}  &\subfigure[{\small }]{\input{./Chapter10/Figures/figure-process06}}
 \end{tabular}
 \caption{一个三层循环神经网络的模型并行过程}
 \label{fig:10-28}
@@ -1133,13 +1133,13 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\

 \parinterval 在具体实现时，由于当前目标语言单词的生成需要依赖前面单词的生成，因此无法同时生成所有的目标语言单词。理论上，可以枚举所有的$\seq{{y}}$，之后利用$\funp{P}(\seq{{y}} | \seq{{x}})$ 的定义对每个$\seq{{y}}$进行评价，然后找出最好的$\seq{{y}}$。这也被称作{\small\bfnew{全搜索}}\index{全搜索}（Full Search）\index{Full Search}。但是，枚举所有的译文单词序列显然是不现实的。因此，在具体实现时，并不会访问所有可能的译文单词序列，而是用某种策略进行有效的搜索。常用的做法是自左向右逐词生成。比如，对于每一个目标语言位置$j$，可以执行公式\eqref{eq:10-31}的过程：
 \begin{eqnarray}
-\hat{y}_j = \argmax_{y_j} \funp{P}(y_j | \hat{\seq{{y}}}_{<j} , \seq{{x}})
+\hat{y}_j &=& \argmax_{y_j} \funp{P}(y_j | \hat{\seq{{y}}}_{<j} , \seq{{x}})
 \label{eq:10-31}
 \end{eqnarray}

 \noindent 其中，$\hat{y}_j$表示位置$j$概率最高的单词，$\hat{\seq{{y}}}_{<j} = \{ \hat{y}_1,...,\hat{y}_{j-1} \}$表示已经生成的最优译文单词序列。也就是，把最优的译文看作是所有位置上最优单词的组合。显然，这是一种贪婪搜索，因为无法保证$\{ \hat{y}_1,...,\hat{y}_{n} \}$是全局最优解。一种缓解这个问题的方法是，在每步中引入更多的候选。这里定义$\hat{y}_{jk} $ 表示在目标语言第$j$个位置排名在第$k$位的单词。在每一个位置$j$，可以生成$k$个最可能的单词，而不是1个，这个过程可以被描述为公式\eqref{eq:10-32}：
 \begin{eqnarray}
-\{ \hat{y}_{j1},...,\hat{y}_{jk} \} = \argmax_{ \{ \hat{y}_{j1},...,\hat{y}_{jk} \} }
+\{ \hat{y}_{j1},...,\hat{y}_{jk} \} &=& \argmax_{ \{ \hat{y}_{j1},...,\hat{y}_{jk} \} }
 \funp{P}(y_j | \{ \hat{\seq{{y}}}_{<{j\ast}} \},\seq{{x}})
 \label{eq:10-32}
 \end{eqnarray}
@@ -1218,13 +1218,13 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\

 \parinterval 为了解决上面提到的问题，可以使用其他特征与$\textrm{log } \funp{P} (\seq{{y}} | \seq{{x}})$一起组成新的模型得分$\textrm{score} ( \seq{{y}} , \seq{{x}})$。针对模型倾向于生成短句子的问题，常用的做法是引入惩罚机制。比如，可以定义一个惩罚因子，具体如公式\eqref{eq:10-33}：
 \begin{eqnarray}
-\textrm{lp}(\seq{{y}}) = \frac {(5+ |\seq{{y}}|)^{\alpha}} {(5+1)^{\alpha}}
+\textrm{lp}(\seq{{y}}) &=& \frac {(5+ |\seq{{y}}|)^{\alpha}} {(5+1)^{\alpha}}
 \label{eq:10-33}
 \end{eqnarray}

 \noindent 其中，$|\seq{{y}}|$代表已经得到的译文长度，$\alpha$是一个固定的常数，用于控制惩罚的强度。同时在计算句子得分时，额外引入表示覆盖度的因子，如公式\eqref{eq:10-34}所示：
 \begin{eqnarray}
-\textrm{cp}(\seq{{y}} , \seq{{x}}) = \beta \cdot \sum_{i=1}^{|\seq{{x}}|} \textrm{log} \big(\textrm{min}(\sum_j^{|\seq{{y}}|} \alpha_{ij},1 ) \big)
+\textrm{cp}(\seq{{y}} , \seq{{x}}) &=& \beta \cdot \sum_{i=1}^{|\seq{{x}}|} \textrm{log} \big(\textrm{min}(\sum_j^{|\seq{{y}}|} \alpha_{ij},1 ) \big)
 \label{eq:10-34}
 \end{eqnarray}

@@ -1232,7 +1232,7 @@ L(\mathbi{Y},\widehat{\mathbi{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbi{y}_j,\

 \parinterval 最终，模型得分定义如公式\eqref{eq:10-35}所示：
 \begin{eqnarray}
-\textrm{score} ( \seq{{y}} , \seq{{x}}) = \frac{\textrm{log} \funp{P}(\seq{{y}} | \seq{{x}})} {\textrm{lp}(\seq{{y}})} + \textrm{cp}(\seq{{y}} , \seq{{x}})
+\textrm{score} ( \seq{{y}} , \seq{{x}}) &=& \frac{\textrm{log} \funp{P}(\seq{{y}} | \seq{{x}})} {\textrm{lp}(\seq{{y}})} + \textrm{cp}(\seq{{y}} , \seq{{x}})
 \label{eq:10-35}
 \end{eqnarray}


--- a/Chapter11/Figures/figure-deep-vs-light.tex
+++ b/Chapter11/Figures/figure-deep-vs-light.tex
@@ -27,7 +27,7 @@
 	\node[vuale] at (6.5em, 9.9em) {$\mathbi{z}_1$};
 	
 	\node (t2) at (2.5em, -1em) {\large{$\cdots$}};
-	\node [anchor=north,font=\tiny] at ([yshift=-0.2em]t2.south) {深度卷积};
+	\node [anchor=north,font=\tiny,scale=1.1] at ([yshift=-0.2em]t2.south) {深度卷积};
 	\end{scope}

 	\begin{scope}[xshift=4cm]
@@ -52,7 +52,7 @@
 	\node[vuale] at (6.5em, 9.9em) {$\mathbi{z}_1$};
 	
 	\node (t2) at (2.5em, -1em) {\large{$\cdots$}};
-	\node [anchor=north,font=\tiny] at ([yshift=-0.2em]t2.south) {轻量卷积};
+	\node [anchor=north,font=\tiny,scale=1.1] at ([yshift=-0.2em]t2.south) {轻量卷积};
 	\end{scope}
 	
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter11/Figures/figure-standard.tex
+++ b/Chapter11/Figures/figure-standard.tex
@@ -40,7 +40,7 @@
 	\node[vuale] at ([xshift=0.9em]r3_1.east) {$\mathbi{z}_1$};
 	
 	\node (t1) at (2.5em, -1em) {\large{$\cdots$}};
-	\node [anchor=north,font=\tiny] at ([yshift=-0.2em]t1.south) {(a) 标准卷积};
+	\node [anchor=north,font=\tiny,scale=1.1] at ([yshift=-0.2em]t1.south) {(a) 标准卷积};
 	\end{scope}
 	
 	\begin{scope}[xshift=4cm]
@@ -74,7 +74,7 @@
 	\node[vuale] at ([xshift=0.9em]r3_1.east) {$\mathbi{z}_1$};
 	
 	\node (t2) at (2.5em, -1em) {\large{$\cdots$}};
-	\node [anchor=north,font=\tiny] at ([yshift=-0.2em]t2.south) {(b) 深度卷积};
+	\node [anchor=north,font=\tiny,scale=1.1] at ([yshift=-0.2em]t2.south) {(b) 深度卷积};
 	\end{scope}
 	
 	\begin{scope}[xshift=8cm]
@@ -110,7 +110,7 @@
 	\node[vuale] at ([xshift=0.9em]r3_1.east) {$\mathbi{z}_1$};
 	
 	\node (t3) at (2.5em, -1em) {\large{$\cdots$}};
-	\node [anchor=north,font=\tiny] at ([yshift=-0.2em]t3.south) {(c) 逐点卷积};
+	\node [anchor=north,font=\tiny,scale=1.1] at ([yshift=-0.2em]t3.south) {(c) 逐点卷积};
 	\end{scope}
 	
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter11/chapter11.tex
+++ b/Chapter11/chapter11.tex
@@ -45,9 +45,9 @@
 \begin{figure}[htp]
 \centering
 %\input{./Chapter11/Figures/figure-f }
-\subfigure[全连接层]{\input{./Chapter11/Figures/figure-full-connection-vs-cnn-a}}
+\subfigure[{\small 全连接层}]{\input{./Chapter11/Figures/figure-full-connection-vs-cnn-a}}
 \hspace{2cm}
-\subfigure[卷积层]{\input{./Chapter11/Figures/figure-full-connection-vs-cnn-b}}
+\subfigure[{\small 卷积层}]{\input{./Chapter11/Figures/figure-full-connection-vs-cnn-b}}
 \caption{全连接层（a）与卷积层（b）的结构对比}
 \label{fig:11-1}
 \end{figure}
@@ -87,7 +87,7 @@

 \parinterval 在卷积计算中，不同深度下卷积核不同但是执行操作相同，这里以二维卷积核为例展示具体卷积计算。若设输入矩阵为$\mathbi{x}$，输出矩阵为$\mathbi{y}$，卷积滑动步幅为$\textrm{stride}$，卷积核为$\mathbi{w}$，且$\mathbi{w} \in \mathbb{R}^{Q \times U} $，那么卷积计算过程如公式\eqref{eq:11-1}所示：
 \begin{eqnarray}
-\mathbi{y}_{i,j} = \sum \sum ( \mathbi{x}_{[j\times \textrm{stride}:j\times \textrm{stride}+U-1,i\times \textrm{stride}:i\times \textrm{stride}+Q-1]} \odot \mathbi{w} )
+\mathbi{y}_{i,j} &=& \sum \sum ( \mathbi{x}_{[j\times \textrm{stride}:j\times \textrm{stride}+U-1,i\times \textrm{stride}:i\times \textrm{stride}+Q-1]} \odot \mathbi{w} )
 \label{eq:11-1}
 \end{eqnarray}

@@ -171,8 +171,8 @@
 \begin{figure}[htp]
 \centering
 %\input{./Chapter11/Figures/figure-f }
-\subfigure[最大池化]{\input{./Chapter11/Figures/figure-max-pooling}}
-\subfigure[平均池化]{\input{./Chapter11/Figures/figure-average-pooling}}
+\subfigure[{\small 最大池化}]{\input{./Chapter11/Figures/figure-max-pooling}}
+\subfigure[{\small 平均池化}]{\input{./Chapter11/Figures/figure-average-pooling}}
 \caption{池化操作}
 \label{fig:11-8}
 \end{figure}
@@ -194,8 +194,8 @@
 \begin{figure}[htp]
 \centering
 %\input{./Chapter11/Figures/figure-f }
-\subfigure[循环神经网络的串行结构]{\input{./Chapter11/Figures/figure-structural-comparison-a}}
-\subfigure[卷积神经网络的层级结构]{\input{./Chapter11/Figures/figure-structural-comparison-b}}
+\subfigure[{\small 循环神经网络的串行结构}]{\input{./Chapter11/Figures/figure-structural-comparison-a}}
+\subfigure[{\small 卷积神经网络的层级结构}]{\input{./Chapter11/Figures/figure-structural-comparison-b}}
 \caption{串行及层级结构对比（$\mathbi{e}_i$表示词嵌入，$\mathbi{0}$表示$\mathbi{0}$向量，方框里的2、3、4表示层次编号）}
 \label{fig:11-9}
 \end{figure}
@@ -340,7 +340,7 @@

 \parinterval 残差连接是一种训练深层网络的技术，其内容在{\chapternine}已经进行了介绍，即在多层神经网络之间通过增加直接连接的方式，从而将底层信息直接传递给上层。通过增加这样的直接连接，可以让不同层之间的信息传递更加高效，有利于深层神经网络的训练，其计算如公式\eqref{eq:11-6}所示：
 \begin{eqnarray}
-\mathbi{h}^{l+1} = F (\mathbi{h}^l) + \mathbi{h}^l
+\mathbi{h}^{l+1} &=& F (\mathbi{h}^l) + \mathbi{h}^l
 \label{eq:11-6}
 \end{eqnarray}

@@ -349,7 +349,7 @@
 \parinterval 在ConvS2S中残差连接主要应用于门控卷积神经网络和多跳自注意力机制中，比如在编码器的多层门控卷积神经网络中，在每一层的输入和输出之间增加残差连接，具体的数学描述如公式\eqref{eq:11-7}所示：
 \begin{eqnarray}
 %\mathbi{h}_i^l = \funp{v} (\mathbi{W}^l [\mathbi{h}_{i-\frac{k}{2}}^{l-1},...,\mathbi{h}_{i+\frac{k}{2}}^{l-1}] + b_{\mathbi{W}}^l ) + \mathbi{h}_i^{l-1}
-\mathbi{h}^{l+1} = \mathbi{A}^{l} \otimes \sigma ( \mathbi{B}^{l} ) + \mathbi{h}^{l}
+\mathbi{h}^{l+1} &=& \mathbi{A}^{l} \otimes \sigma ( \mathbi{B}^{l} ) + \mathbi{h}^{l}
 \label{eq:11-7}
 \end{eqnarray}

@@ -383,7 +383,7 @@

 \parinterval 在ConvS2S模型中，解码器同样采用堆叠的多层门控卷积网络来对目标语言进行序列建模。区别于编码器，解码器在每一层卷积网络之后引入了注意力机制，用来参考源语言信息。ConvS2S选用了点乘注意力，并且通过类似残差连接的方式将注意力操作的输入与输出同时作用于下一层计算，称为多跳注意力。其具体计算方式如公式\eqref{eq:11-10}所示：
 \begin{eqnarray}
-\alpha_{ij}^l = \frac{ \textrm{exp} (\mathbi{d}_{j}^l\mathbi{h}_i) }{\sum_{i^{'}=1}^m \textrm{exp} (\mathbi{d}_{j}^l\mathbi{h}_{i^{'}})}
+\alpha_{ij}^l &=& \frac{ \textrm{exp} (\mathbi{d}_{j}^l\mathbi{h}_i) }{\sum_{i^{'}=1}^m \textrm{exp} (\mathbi{d}_{j}^l\mathbi{h}_{i^{'}})}
 \label{eq:11-10}
 \end{eqnarray}

@@ -395,13 +395,13 @@

 \noindent 其中，$\mathbi{z}_j^l$表示第$l$层卷积网络输出中第$j$个位置的表示，$\mathbi{W}_{d}^{l}$和$\mathbi{b}_{d}^{l}$是模型可学习的参数，$\textrm{Conv}(\cdot)$表示卷积操作。在获得第$l$层的注意力权重之后，就可以得到对应的一个上下文表示$\mathbi{C}_j^l$，具体计算如公式\eqref{eq:11-13}所示：
 \begin{eqnarray}
-\mathbi{C}_j^l = \sum_i \alpha_{ij}^l (\mathbi{h}_i + \mathbi{e}_i)
+\mathbi{C}_j^l &=& \sum_i \alpha_{ij}^l (\mathbi{h}_i + \mathbi{e}_i)
 \label{eq:11-13}
 \end{eqnarray}

 \noindent 模型使用了更全面的源语言信息，同时考虑了源语言端编码表示$\mathbi{h}_i$以及词嵌入表示$\mathbi{e}_i$。在获得第$l$层的上下文向量$\mathbi{C}_j^l$后，模型将其与$\mathbi{z}_j^l$相加后送入下一层网络，这个过程可以被描述为公式\eqref{eq:11-14}：
 \begin{eqnarray}
-\mathbi{s}_j^{l+1} = \mathbi{C}_j^l + \mathbi{z}_j^l
+\mathbi{s}_j^{l+1} &=& \mathbi{C}_j^l + \mathbi{z}_j^l
 \label{eq:11-14}
 \end{eqnarray}

@@ -478,7 +478,7 @@
 \parinterval 给定输入序列表示$\seq{x} = \{ \mathbi{x}_1,\mathbi{x}_2,...,\mathbi{x}_m \}$，其中$m$为序列长度，$\mathbi{x}_i \in \mathbb{R}^{O} $ ，$O$ 即输入序列的通道数。为了获得与输入序列长度相同的卷积输出结果，首先需要进行填充。为了方便描述，这里在输入序列尾部填充 $K-1$ 个元素（$K$为卷积核窗口的长度），其对应的卷积结果为$\seq{z} = \{ \mathbi{z}_1,\mathbi{z}_2,...,\mathbi{z}_m \}$。
 在标准卷积中，若使用N表示卷积核的个数，也就是标准卷积输出序列的通道数，那么对于第$i$个位置的第$n$个通道$ \mathbi{z}_{i,n}^\textrm{\,std}$，其标准卷积具体计算如公式\eqref{eq:11-18}所示：
 \begin{eqnarray}
-\mathbi{z}_{i,n}^\textrm{\,std} = \sum_{o=1}^{O} \sum_{k=0}^{K-1} \mathbi{W}_{k,o,n}^\textrm{\,std} \mathbi{x}_{i+k,o}
+\mathbi{z}_{i,n}^\textrm{\,std} &=& \sum_{o=1}^{O} \sum_{k=0}^{K-1} \mathbi{W}_{k,o,n}^\textrm{\,std} \mathbi{x}_{i+k,o}
 \label{eq:11-18}
 \end{eqnarray}

@@ -488,7 +488,7 @@

 \parinterval 相应的，深度卷积只考虑不同词之间的依赖性，而不考虑不同通道之间的关系，相当于使用$O$个卷积核逐个通道对不同的词进行卷积操作。因此深度卷积不改变输出的表示维度，输出序列表示的通道数与输入序列一致，其计算如公式\eqref{eq:11-19}所示：
 \begin{eqnarray}
-\mathbi{z}_{i,o}^\textrm{\,dw} = \sum_{k=0}^{K-1} \mathbi{W}_{k,o}^\textrm{\,dw} \mathbi{x}_{i+k,o}
+\mathbi{z}_{i,o}^\textrm{\,dw} &=& \sum_{k=0}^{K-1} \mathbi{W}_{k,o}^\textrm{\,dw} \mathbi{x}_{i+k,o}
 \label{eq:11-19}
 \end{eqnarray}

@@ -564,7 +564,7 @@

 \parinterval 在轻量卷积中，模型使用的卷积参数是静态的，与序列位置无关， 维度大小为$K\times a$；而在动态卷积中，为了增强模型的表示能力，卷积参数来自于当前位置输入的变换，具体如公式\eqref{eq:11-22}：
 \begin{eqnarray}
-\funp{f} (\mathbi{x}_{i}) = \sum_{c=1}^d \mathbi{W}_{:,:,c} \odot \mathbi{x}_{i,c}
+\funp{f} (\mathbi{x}_{i}) &=& \sum_{c=1}^d \mathbi{W}_{:,:,c} \odot \mathbi{x}_{i,c}
 \label{eq:11-22}
 \end{eqnarray}


--- a/Chapter12/Figures/figure-different-regularization-methods.tex
+++ b/Chapter12/Figures/figure-different-regularization-methods.tex
@@ -14,8 +14,8 @@
 \node [anchor=west] (plus1) at ([xshift=0.9em]l1.east) {\scriptsize{$\mathbf{\oplus}$}};
 \node [anchor=west] (plus2) at ([xshift=0.9em]l4.east) {\scriptsize{$\mathbf{\oplus}$}};

-\node [anchor=north] (label1) at ([xshift=3em,yshift=-0.5em]l1.south) {\scriptsize{(a)后标准化}};
-\node [anchor=north] (label2) at ([xshift=3em,yshift=-0.5em]l3.south) {\scriptsize{(b)前标准化}};
+\node [anchor=north] (label1) at ([xshift=3em,yshift=-0.5em]l1.south) {\small{(a)后标准化}};
+\node [anchor=north] (label2) at ([xshift=3em,yshift=-0.5em]l3.south) {\small{(b)前标准化}};

 \draw [->,thick] ([xshift=-1.5em]l1.west) -- ([xshift=-0.1em]l1.west);
 \draw [->,thick] ([xshift=0.1em]l1.east) -- ([xshift=0.2em]plus1.west);

--- a/Chapter12/chapter12.tex
+++ b/Chapter12/chapter12.tex
@@ -281,7 +281,7 @@

 \parinterval 在得到$\mathbi{Q}$，$\mathbi{K}$和$\mathbi{V}$后，便可以进行注意力机制的运算，这个过程可以被形式化为公式\eqref{eq:12-9}：
 \begin{eqnarray}
-\textrm{Attention}(\mathbi{Q},\mathbi{K},\mathbi{V}) = \textrm{Softmax}
+\textrm{Attention}(\mathbi{Q},\mathbi{K},\mathbi{V}) &=& \textrm{Softmax}
 ( \frac{\mathbi{Q}\mathbi{K}^{\textrm{T}}} {\sqrt{d_k}} + \mathbi{Mask} ) \mathbi{V}
 \label{eq:12-9}
 \end{eqnarray}
@@ -417,13 +417,13 @@
 \parinterval 在Transformer的训练过程中，由于引入了残差操作，将前面所有层的输出加到一起，如公式\eqref{eq:12-12}所示：
 \begin{eqnarray}
 %x_{l+1} = x_l + F (x_l)
-\mathbi{x}^{l+1} = F (\mathbi{x}^l) + \mathbi{x}^l
+\mathbi{x}^{l+1} &=& F (\mathbi{x}^l) + \mathbi{x}^l
 \label{eq:12-12}
 \end{eqnarray}

 \noindent 其中$\mathbi{x}^l$表示第$l$层网络的输入向量，$F (\mathbi{x}^l)$是子层运算，这样会导致不同层（或子层）的结果之间的差异性很大，造成训练过程不稳定、训练时间较长。为了避免这种情况，在每层中加入了层标准化操作\upcite{Ba2016LayerN}。图\ref{fig:12-14} 中的红色方框展示了Transformer中残差和层标准化的位置。层标准化的计算如公式\eqref{eq:12-13}所示：
 \begin{eqnarray}
-\textrm{LN}(\mathbi{x}) = g \cdot \frac{\mathbi{x}- \mu} {\sigma} + b
+\textrm{LN}(\mathbi{x}) &=& g \cdot \frac{\mathbi{x}- \mu} {\sigma} + b
 \label{eq:12-13}
 \end{eqnarray}

@@ -459,7 +459,7 @@

 \parinterval Transformer使用了全连接网络。全连接网络的作用主要体现在将经过注意力操作之后的表示映射到新的空间中，新的空间会有利于接下来的非线性变换等操作。实验证明，去掉全连接网络会对模型的性能造成很大影响。Transformer的全连接前馈神经网络包含两次线性变换和一次非线性变换（ReLU激活函数:ReLU$(\mathbi{x})=\textrm{max}⁡(0,\mathbi{x})$），每层的前馈神经网络参数不共享，具体计算如公式\eqref{eq:12-14}：
 \begin{eqnarray}
-\textrm{FFN}(\mathbi{x}) = \textrm{max} (0,\mathbi{x}\mathbi{W}_1 + \mathbi{b}_1)\mathbi{W}_2 + \mathbi{b}_2
+\textrm{FFN}(\mathbi{x}) &=& \textrm{max} (0,\mathbi{x}\mathbi{W}_1 + \mathbi{b}_1)\mathbi{W}_2 + \mathbi{b}_2
 \label{eq:12-14}
 \end{eqnarray}

@@ -489,7 +489,7 @@
 \item	Transformer使用Adam优化器优化参数，并设置$\beta_1=0.9$，$\beta_2=0.98$，$\epsilon=10^{-9}$。
 \item Transformer在学习率中同样应用了学习率{\small\bfnew{预热}}\index{预热}（Warmup）\index{Warmup}策略，其计算如公式\eqref{eq:12-15}所示：
 \begin{eqnarray}
-lrate = d_{\textrm{model}}^{-0.5} \cdot \textrm{min} (\textrm{step}^{-0.5} , \textrm{step} \cdot \textrm{warmup\_steps}^{-1.5})
+lrate &=& d_{\textrm{model}}^{-0.5} \cdot \textrm{min} (\textrm{step}^{-0.5} , \textrm{step} \cdot \textrm{warmup\_steps}^{-1.5})
 \label{eq:12-15}
 \end{eqnarray}


--- a/Chapter13/Figures/figure-example-of-neural-machine-translation.png
+++ b/Chapter13/Figures/figure-example-of-neural-machine-translation.png
--- a/Chapter13/Figures/figure-true-case-of-adversarial-examples.jpg
+++ b/Chapter13/Figures/figure-true-case-of-adversarial-examples.jpg
--- a/Chapter13/chapter13.tex
+++ b/Chapter13/chapter13.tex
@@ -46,7 +46,7 @@
 \sectionnewpage
 \section{开放词表}

-\parinterval 人类表达语言的方式是十分多样的，这也体现在单词的构成上，甚至我们都无法想象数据中存在的不同单词的数量。比如，如果使用简单的分词策略，WMT、CCMT等评测数据的英文词表大小都会在100万以上。当然，这里面也包括很多的数字和字母的混合，还有一些组合词。不过，如果不加限制，机器翻译所面对的词表确实很``大''。这也会导致系统速度变慢，模型变大。更严重的问题是，测试数据中的一些单词根本就没有在训练数据中出现过，这时会出现OOV翻译问题，即系统无法对未见单词进行翻译。在神经机器翻译中，通常会考虑使用更小的翻译单元来缓解以上问题。
+\parinterval 人类表达语言的方式是十分多样的，这也体现在单词的构成上，甚至我们都无法想象数据中存在的不同单词的数量。即便使用分词策略，在WMT、CCMT等评测数据上，英文词表大小都会在100万以上。当然，这里面也包括很多的数字和字母的混合，还有一些组合词。不过，如果不加限制，机器翻译所面对的词表将会很“大”。这也会导致系统速度变慢，模型变大。更严重的问题是，测试数据中的一些单词根本就没有在训练数据中出现过，这时会出现OOV翻译问题，即系统无法对未见单词进行翻译。在神经机器翻译中，通常会考虑使用更小的翻译单元来缓解以上问题。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -54,18 +54,18 @@

 \subsection{大词表和OOV问题}

-\parinterval 首先来具体看一看神经机器翻译的大词表问题。神经机器翻译模型训练和解码都依赖于源语言和目标语言的词表。在建模中，词表中的每一个单词都会被转换为分布式（向量）表示，即词嵌入。这些向量会作为模型的输入（见第六章）。如果每个单词都对应一个向量，那么单词的各种变形（时态、语态等）都会导致词表和相应的向量数量的增加。图\ref{fig:7-7}展示了一些英语单词的时态语态变化。
+\parinterval 首先来具体看一看神经机器翻译的大词表问题。神经机器翻译模型训练和解码都依赖于源语言和目标语言的词表。在建模中，词表中的每一个单词都会被转换为分布式（向量）表示，即词嵌入。这些向量会作为模型的输入（见{\chapterten}）。如果每个单词都对应一个向量，那么单词的各种变形（时态、语态等）都会导致词表和相应的向量数量的增加。图\ref{fig:13-1}展示了一些英语单词的时态语态变化。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter13/Figures/figure-word-change}
 \caption{单词时态、语态、单复数的变化}
-\label{fig:7-7}
+\label{fig:13-1}
 \end{figure}
 %----------------------------------------------

-\parinterval 如果要覆盖更多的翻译现象，词表会不断膨胀，并带来两个问题：
+\parinterval 如果要覆盖更多的翻译现象，词表会不断膨胀，并带来下述两个问题：

 \begin{itemize}
 \vspace{0.5em}
@@ -75,7 +75,7 @@
 \vspace{0.5em}
 \end{itemize}

-\parinterval 理想情况下，机器翻译应该是一个{\small\bfnew{开放词表}}\index{开放词表}（Open-Vocabulary）\index{Open-Vocabulary}的翻译任务。也就是，不论测试数据中包含什么样的词，机器翻译系统都应该能够正常翻译。但是，现实的情况是，即使不断扩充词表，也不可能覆盖所有可能的单词。这时就会出现OOV问题（集外词问题）。这个问题在使用受限词表时会更加严重，因为低频词和未见过的词都会被看作OOV单词。这时会将这些单词用<UNK>代替。通常，数据中<UNK>的数量会直接影响翻译性能，过多的<UNK>会造成欠翻译、结构混乱等问题。因此神经机器翻译需要额外的机制解决大词表和OOV问题。
+\parinterval 理想情况下，机器翻译应该是一个{\small\bfnew{开放词表}}\index{开放词表}（Open-Vocabulary）\index{Open-Vocabulary}的翻译任务。也就是，无论测试数据中包含什么样的词，机器翻译系统都应该能够正常翻译。但是，现实的情况是即使不断扩充词表，也不可能覆盖所有可能的单词，即OOV问题，或被称作集外词问题。这个问题在使用受限词表时会更加严重，因为低频词和未见过的词都会被看作OOV单词。这时会将这些单词用<UNK>代替。通常，数据中<UNK>的数量会直接影响翻译性能，过多的<UNK>会造成欠翻译、结构混乱等问题。因此神经机器翻译需要额外的机制解决大词表和OOV问题。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -83,20 +83,20 @@

 \subsection{子词}

-\parinterval 一种解决开放词表翻译问题的方法是改造输出层结构\upcite{garcia-martinez2016factored,DBLP:conf/acl/JeanCMB15}，比如，替换原始的Softmax层，用更加高效的神经网络结构进行超大规模词表上的预测。不过这类方法往往需要对系统进行修改，由于模型结构和训练方法的调整使得系统开发与调试的工作量增加。而且这类方法仍然无法解决OOV问题。因此在实用系统中并不常用。
+\parinterval 一种解决开放词表翻译问题的方法是改造输出层结构\upcite{garcia-martinez2016factored,DBLP:conf/acl/JeanCMB15}，比如，替换原始的Softmax层，用更加高效的神经网络结构进行超大规模词表上的预测。不过，由于模型结构和训练方法的调整使得系统开发与调试的工作量增加，因此使用这类方法时往往需要对系统进行修改，并且这类方法仍然无法解决OOV问题，因此在实用系统中并不常用。

-\parinterval 另一种思路是不改变机器翻译系统，而是从数据处理的角度来缓解OOV问题。既然使用单词会带来数据稀疏问题，那么自然会想到使用更小的单元。比如，把字符作为最小的翻译单元 \footnote{中文中字符可以被看作是汉字。} \ \dash \ 也就是基于字符的翻译模型\upcite{DBLP:journals/tacl/LeeCH17}。以英文为例，只需要构造一个包含26个英文字母、数字和一些特殊符号的字符表，便可以表示所有的单词。
+\parinterval 另一种思路是不改变机器翻译系统，而是从数据处理的角度来缓解OOV问题。既然使用单词会带来数据稀疏问题，那么自然会想到使用更小的单元。比如，把字符作为最小的翻译单元 \footnote{中文里的字符可以被看作是汉字。} \ \dash \ 也就是基于字符的翻译模型\upcite{DBLP:journals/tacl/LeeCH17}。以英文为例，只需要构造一个包含26个英文字母、数字和一些特殊符号的字符表，便可以表示所有的单词。

-\parinterval 但是字符级翻译也面临着新的问题\ \dash\ 使用字符增加了系统捕捉不同语言单元之间搭配的难度。假设平均一个单词由5个字符组成，所处理的序列长度便增大5倍。这使得具有独立意义的不同语言单元需要跨越更远的距离才能产生联系。此外，基于字符的方法也破坏了单词中天然存在的构词规律，或者说破坏了单词内字符的局部依赖。比如，英文单词``telephone''中的``tele''和``phone''都是有具体意义的词缀，但是如果把它们打散为字符就失去了这些含义。
+\parinterval 但是字符级翻译也面临着新的问题\ \dash\ 使用字符增加了系统捕捉不同语言单元之间搭配的难度。假设平均一个单词由5个字符组成，系统所处理的序列长度便增大5倍。这使得具有独立意义的不同语言单元需要跨越更远的距离才能产生联系。此外，基于字符的方法也破坏了单词中天然存在的构词规律，或者说破坏了单词内字符的局部依赖。比如，英文单词“telephone”中的“tele”和“phone”都是有具体意义的词缀，但是如果把它们打散为字符就失去了这些含义。

-\parinterval 那么有没有一种方式能够兼顾基于单词和基于字符方法的优点呢？常用的手段包括两种，一种是采用字词融合的方式构建词表，将未知单词转换为字符的序列并通过特殊的标记将其与普通的单词区分开来\upcite{luong2016acl_hybrid}。而另一种方式是将单词切分为{\small\bfnew{子词}}\index{子词}（Sub-word）\index{Sub-word}，它是介于单词和字符中间的一种语言单元表示形式。比如，将英文单词``doing''切分为``do''+``ing''。对于形态学丰富的语言来说，子词体现了一种具有独立意义的构词基本单元。比如，如图\ref{fig:7-8}，子词``do''，和``new''在可以用于组成其他不同形态的单词。
+\parinterval 那么有没有一种方式能够兼顾基于单词和基于字符方法的优点呢？常用的手段包括两种，一种是采用字词融合的方式构建词表，将未知单词转换为字符的序列并通过特殊的标记将其与普通的单词区分开来\upcite{luong2016acl_hybrid}。而另一种方式是将单词切分为{\small\bfnew{子词}}\index{子词}（Sub-word）\index{Sub-word}，它是介于单词和字符中间的一种语言单元表示形式。比如，将英文单词“doing”切分为“do”+“ing”。对于形态学丰富的语言来说，子词体现了一种具有独立意义的构词基本单元。比如，如图\ref{fig:13-2}，子词“do”，和“new”在可以用于组成其他不同形态的单词。

 %----------------------------------------------
 \begin{figure}[htp]
 \centering
 \input{./Chapter13/Figures/figure-word-root}
 \caption{不同单词共享相同的子词（前缀）}
-\label{fig:7-8}
+\label{fig:13-2}
 \end{figure}
 %----------------------------------------------

@@ -106,13 +106,13 @@
 \vspace{0.5em}
 \item 对原始数据进行分词操作；
 \vspace{0.5em}
-\item 构建子词词表；
+\item 构建符号合并表；
 \vspace{0.5em}
-\item 通过子词词表重新对数据中的单词进行切分。
+\item 通过合并表，将按字符切分的单词合并为字词组合。
 \vspace{0.5em}
 \end{itemize}

-\parinterval 这里面的核心是如何构建子词词表，下面对一些典型方法进行介绍。
+\parinterval 这里面的核心是如何构建符号合并表，下面对一些典型方法进行介绍。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -120,7 +120,7 @@

 \subsection{双字节编码（BPE）}

-\parinterval {\small\bfnew{字节对编码}}\index{字节对编码}或{\small\bfnew{双字节编码}}\index{双字节编码}（Byte Pair Encoding，BPE）\index{Byte Pair Encoding，BPE}是一种常用的子词词表构建方法\upcite{DBLP:conf/acl/SennrichHB16a}。BPE方法最早用于数据压缩，该方法将数据中常见的连续字符串替换为一个不存在的字符，之后通过构建一个替换关系的对应表，对压缩后的数据进行还原。机器翻译借用了这种思想，把子词切分看作是学习对自然语言句子进行压缩编码表示的问题\upcite{Gage1994ANA}。其目的是，保证编码后的结果（即子词切分）占用的字节尽可能少。这样，子词单元会尽可能被不同单词复用，同时又不会因为使用过小的单元造成子词切分序列过长。使用BPE算法构建子词词表可以分为如下几个步骤：
+\parinterval {\small\bfnew{字节对编码}}\index{字节对编码}或{\small\bfnew{双字节编码}}\index{双字节编码}（Byte Pair Encoding\index{Byte Pair Encoding}，BPE）是一种常用的子词词表构建方法\upcite{DBLP:conf/acl/SennrichHB16a}。BPE方法最早用于数据压缩，该方法将数据中常见的连续字符串替换为一个不存在的字符，之后通过构建一个替换关系的对应表，对压缩后的数据进行还原。机器翻译借用了这种思想，把子词切分看作是学习对自然语言句子进行压缩编码表示的问题\upcite{Gage1994ANA}。其目的是，保证编码后的结果（即子词切分）占用的字节尽可能少。这样，子词单元会尽可能被不同单词复用，同时又不会因为使用过小的单元造成子词切分序列过长。使用BPE算法构建符号合并表可以分为如下几个步骤：

 \begin{itemize}
 \vspace{0.5em}
@@ -128,7 +128,7 @@
 \vspace{0.5em}
 \item 将分词后的每个单词进行进一步切分，划分为字符序列。同时，在每个单词结尾添加结束符<e>用于标记单词的边界。之后，统计该单词在数据中出现的次数。例如单词low在数据中出现了5次，可以将其记为`l o w <e>:'5。
 \vspace{0.5em}
-\item 对得到的字符集合进行统计，统计每个单词中2-gram符号出现的频次 \footnote{发生合并前，一个字符便是一个符号}。之后，选择最高频的2-gram符号，将其合并为新的符号，即新的子词。例如``A''和``B''连续出现的频次最高，则以``AB''替换所有单词内连续出现的``A''和``B''并将其加入子词词表。这样，``AB''会被作为一个整体，在之后的过程中可以与其他符号进一步合并。需要注意的是替换和合并不会跨越单词的边界，即只对单个单词进行替换和合并。
+\item 对得到的字符集合进行统计，统计每个单词中2-gram符号出现的频次 \footnote{发生合并前，一个字符便是一个符号}。之后，选择最高频的2-gram符号，将其合并为新的符号，即新的子词。例如“A”和“B”连续出现的频次最高，则以“AB”替换所有单词内连续出现的“A”和“B”并将其加入子词词表。这样，“AB”会被作为一个整体，在之后的过程中可以与其他符号进一步合并。需要注意的是替换和合并不会跨越单词的边界，即只对单个单词进行替换和合并。
 \vspace{0.5em}
 \item 不断重复上一步骤，直到子词词表大小达到预定的大小或者下一个最高频的2-gram字符的频次为1。子词词表大小是BPE的唯一的参数，它用来控制上述子词合并的规模。
 \vspace{0.5em}
@@ -143,19 +143,9 @@
 %----------------------------------------------
 \end{itemize}

-\parinterval 图\ref{fig:7-9}给出了BPE算法执行的实例。在执行合并操作时，需要考虑不同的情况。假设词表中存在子词``ab''和``cd''，此时要加入子词``abcd''。可能会出现如下的情况：
+\parinterval 图\ref{fig:7-9}给出了BPE算法执行的实例。其中预先设定的合并表的大小为10

-\begin{itemize}
-\vspace{0.5em}
-\item 若``ab''、``cd''、``abcd''完全独立，彼此的出现互不影响，将``abcd''加入词表，词表数目$+1$；
-\vspace{0.5em}
-\item 若``ab''和``cd''必同时出现则词表中加入``abcd''，去除``ab''和``cd''，词表数目$-1$。这个操作是为了较少词表中的冗余；
-\vspace{0.5em}
-\item 若出现``ab''，其后必出现``cd''，但是``cd''却可以作为独立的子词出现，则将``abcd''加入词表，去除``ab''，反之亦然，词表数目不变。
-\vspace{0.5em}
-\end{itemize}
-
-\parinterval 在得到了子词词表后，便需要对单词进行切分。BPE要求从较长的子词开始替换。首先，对子词词表按照字符长度从大到小进行排序。然后，对于每个单词，遍历子词词表，判断每个子词是不是当前词的子串，若是则进行替换切分。将单词中所有的子串替换为子词后，如果仍有子串未被替换，则将其用<UNK>代替，如图\ref{fig:7-10} 。
+\parinterval 在得到了符号合并表后，便需要对用字符表示的单词进行合并，得到以子词形式表示的文本。首先，将单词切分为以字符表示的符号序列，并在尾部加上终结符。然后按照符号合并表的顺序依次遍历，如果存在相同的2-gram符号组合，则对其进行合并，直至遍历结束。图1.4给出了一个使用字符合并表对单词进行子词切分的实例。红色单元为每次合并后得到的新符号，直至无法合并，或遍历结束，得到最终的合并结果。其中每一个单元为一个子词，如图\ref{fig:7-10} 。{\red{图有问题}}

 %----------------------------------------------
 \begin{figure}[htp]
@@ -168,9 +158,9 @@

 \parinterval 由于模型的输出也是子词序列，因此需要对最终得到的翻译结果进行子词还原，即将子词形式表达的单元重新组合为原本的单词。这一步操作也十分简单，只需要不断的将每个子词向后合并，直至遇到表示单词边界的结束符<e>，便得到了一个完整的单词。

-\parinterval 使用BPE方法的策略有很多。不仅可以单独对源语言和目标语言进行子词的切分，也可以联合源语言和目标语言，共同进行子词切分，被称作Joint-BPE\upcite{DBLP:conf/acl/SennrichHB16a}。单语BPE比较简单直接，而Joint-BPE则可以增加两种语言子词切分的一致性。对于相似语系中的语言，如英语和德语，常使用Joint-BPE的方法联合构建词表。而对于中英这些差异比较大的语种，则需要独立的进行子词切分。
+\parinterval 使用BPE方法的策略有很多。不仅可以单独对源语言和目标语言进行子词的切分，也可以联合源语言和目标语言，共同进行子词切分，被称作Joint-BPE\upcite{DBLP:conf/acl/SennrichHB16a}。单语BPE比较简单直接，而Joint-BPE则可以增加两种语言子词切分的一致性。对于相似语系中的语言，如英语和德语，常使用Joint-BPE的方法联合构建词表。而对于中英这些差异比较大的语种，则需要独立的进行子词切分。使用子词表示句子的方法可以有效的平衡词汇量，增大对未见单词的覆盖度。像英译德、汉译英任务，使用16k或者32k的子词词表大小便能取得很好的效果。

-\parinterval BPE还有很多变种方法。在进行子词切分时，BPE从最长的子词开始进行切分。这个启发性规则可以保证切分结果的唯一性，实际上，在对一个单词用同一个子词词表切分时，可能存在多种切分方式，如hello，我们可以分割为``hell''和``o''，也可以分割为``h''和``ello''。这种切分的多样性可以来提高神经机器翻译系统的健壮性\upcite{DBLP:conf/acl/Kudo18}。而在T5等预训练模型中\upcite{DBLP:journals/jmlr/RaffelSRLNMZLL20}则使用了基于字符级别的BPE。此外，尽管BPE被命名为字节对编码，实际上一般处理的是Unicode编码，而不是字节。在预训练模型GPT2中，也探索了字节级别的BPE，在机器翻译、问答等任务中取得了很好的效果\upcite{radford2019language}。
+\parinterval BPE还有很多变种方法。在进行子词切分时，BPE从最长的子词开始进行切分。这个启发性规则可以保证切分结果的唯一性，实际上，在对一个单词用同一个子词词表切分时，可能存在多种切分方式，如hello，我们可以分割为“hell”和“o”，也可以分割为“h”和“ello”。这种切分的多样性可以用来提高神经机器翻译系统的健壮性\upcite{DBLP:conf/acl/Kudo18}。{\red 而在T5等预训练模型中\upcite{DBLP:journals/jmlr/RaffelSRLNMZLL20}，则使用了基于字符级别的BPE。此外，尽管BPE被命名为字节对编码，实际上一般处理的是Unicode编码，而不是字节。在预训练模型GPT2中，也探索了字节级别的BPE，在机器翻译、问答等任务中取得了很好的效果\upcite{radford2019language}}。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -178,11 +168,21 @@

 \subsection{其他方法}

-\parinterval 与基于统计的BPE算法不同，基于Word Piece和1-gram Language Model（ULM）的方法则是利用语言模型进行子词词表的构造\upcite{DBLP:conf/acl/Kudo18}。本质上，基于语言模型的方法和基于BPE的方法的思路是一样的，即通过合并字符和子词不断生成新的子词。它们的区别仅在于合并子词的方式不同。基于BPE的方法选择出现频次最高的连续字符2-gram合并为新的子词，而基于语言模型的方法则是根据语言模型输出的概率选择要合并哪些子词。
+\parinterval 与基于统计的BPE算法不同，基于Word Piece的子词切分方法则是利用语言模型进行子词词表的构造\upcite{DBLP:conf/icassp/SchusterN12}。本质上，基于语言模型的方法和基于BPE的方法的思路是一样的，即通过合并字符和子词不断生成新的子词。它们的区别仅在于合并子词的方式不同。{\red 基于BPE的方法选择出现频次最高的连续字符2-gram合并为新的子词}，而基于语言模型的方法则是根据语言模型输出的概率选择要合并哪些子词。具体来说，基于Word Piece的方法首先将句子切割为字符表示的形式\upcite{DBLP:conf/icassp/SchusterN12}，并利用该数据训练一个1-gram语言模型，记为$\textrm{log}\funp{P}(\cdot)$。假设两个相邻的子词单元$a$和$b$被合并为新的子词$c$，则整个句子的语言模型得分的变化为$\triangle=\textrm{log}\funp{P}(c)-\textrm{log}\funp{P}(a)-\textrm{log}\funp{P}(b)$。这样，可以不断的选择使$\triangle$最大的两个子词单元进行合并，直到达到预设的词表大小或者句子概率的增量低于某个阈值。

-\parinterval 具体来说，基于Word Piece的方法首先将句子切割为字符表示的形式\upcite{DBLP:conf/icassp/SchusterN12}，并利用该数据训练一个1-gram语言模型，记为$\textrm{logP}(\cdot)$。假设两个相邻的子词单元$a$和$b$被合并为新的子词$c$，则整个句子的语言模型得分的变化为$\triangle=\textrm{logP}(c)-\textrm{logP}(a)-\textrm{logP}(b)$。这样，可以不断的选择使$\triangle$最大的两个子词单元进行合并，直到达到预设的词表大小或者句子概率的增量低于某个阈值。而ULM方法以最大化整个句子的概率为目标构建词表\upcite{DBLP:conf/acl/Kudo18}，具体实现上也不同于基于Word Piece的方法，这里不做详细介绍。
+\parinterval 目前比较主流的子词切分方法都是作用于分词后的序列，对一些没有明显词边界且资源稀缺的语种并不友好。相比之下，SentencePiece可以作用于未经过分词处理的输入序列\upcite{kudo2018sentencepiece}，同时囊括了双字节编码和语言模型的子词切分方法，更加灵活易用。

-\parinterval 使用子词表示句子的方法可以有效的平衡词汇量，增大对未见单词的覆盖度。像英译德、汉译英任务，使用16k或者32k的子词词表大小便能取得很好的效果。
+\parinterval 通过上述子词切分方法，可以缓解OOV的问题，允许模型利用到一些词法上的信息。然而主流的BPE子词切分方法中，每个单词都对应一种唯一的子词切分方式，因此输入的数据经过子词切分后的序列表示也是唯一的。在给定词表的情况下，每句话仍然存在多种切分方式。而经过现有BPE处理后的序列，模型只能接收到单一的表示，可能会阻止模型更好地学习词的组成，不能充分利用单词中的形态学特征。此外，针对切分错误的输入数据表现不够鲁棒，常常会导致整句话的翻译效果极差。为此，研究人员提出一些正则化方法\upcite{DBLP:conf/acl/Kudo18,provilkov2020bpe}。
+
+\begin{itemize}
+\vspace{0.5em}
+\item 子词正则化方法\upcite{DBLP:conf/acl/Kudo18}。其思想是在训练过程中扰乱确定的子词边界，根据1-gram Language Model{\red （ULM）}采样出多种子词切分候选。通过最大化整个句子的概率为目标构建词表。在实现上，与上述基于Word Piece的方法略有不同，这里不做详细介绍。
+\vspace{0.5em}
+\item BPE-Dropout\upcite{provilkov2020bpe}。在训练时，通过在合并过程中按照一定概率$p${\red（这个p能不能改成P）}（介于0与1之间）随机丢弃一些可行的合并操作，从而产生不同的子词切分结果，进而增强模型健壮性。而在推断阶段，将p设置为0，等同于标准的BPE。总的来说，上述方法相当于在子词的粒度上对输入的序列进行扰动，进而达到鲁棒性训练的目的。在之后的小节中同样会针对鲁棒性训练进行详细介绍。
+\vspace{0.5em}
+\item DPE\upcite{he2020dynamic}。引入了混合字符-子词的切分方式，将句子的子词分割方式看作一种潜变量，该结构能够利用{\small\bfnew{动态规划}}\index{动态规划}（Dynamic Programming）\index{Dynamic Programming}的思想精确地将潜在的子字片段边缘化。解码端的输入是基于字符表示的目标语序列，推断时将每个时间步的输出映射到预先设定好的子词词表之上，得到当前最可能得子词结果。若当前子词长度为$m$，则接下来的$m$个时间步的输入为该子词，并在$m$个时间步后得到下一个切分的子词。
+\vspace{0.5em}
+\end{itemize}

 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -193,7 +193,7 @@

 \parinterval {\small\bfnew{正则化}}\index{正则化}（Regularization）\index{Regularization}是机器学习中的经典技术，通常用于缓解{\small\bfnew{过拟合问题}}\index{过拟合问题}（The Overfitting Problem）\index{Overfitting Problem}。正则化的概念源自线性代数和代数几何。在实践中，它更多的是指对{\small\bfnew{反问题}}\index{反问题}（The Inverse Problem）\index{Inverse Problem}的一种求解方式。假设输入$x$和输出$y$之间存在一种映射$f$
 \begin{eqnarray}
-y = f(x)
+y &=& f(x)
 \label{eq:13-1}
 \end{eqnarray}

@@ -224,7 +224,7 @@ y = f(x)

 \parinterval 正则化的一种实现是在训练目标中引入一个正则项。在神经机器翻译中，引入正则项的训练目标为：
 \begin{eqnarray}
-\widehat{\mathbf{w}}=\argmax_{\mathbf{w}}L(\mathbf{w}) + \lambda R(\mathbf{w})
+\widehat{\mathbf{w}} &=& \argmax_{\mathbf{w}}L(\mathbf{w}) + \lambda R(\mathbf{w})
 \label{eq:13-2}
 \end{eqnarray}

@@ -254,7 +254,7 @@ R(\mathbf{w}) & = & (\big| |\mathbf{w}| {\big|}_2)^2 \\

 \parinterval 从几何的角度看，L1和L2正则项都是有物理意义的。二者都可以被看作是空间上的一个区域，比如，在二维平面上，L1范数表示一个以0点为中心的矩形，L2范数表示一个以0点为中心的圆。因此，优化问题可以被看作是在两个区域（$L(\mathbf{w})$和$R(\mathbf{w})$）叠加在一起所形成的区域上进行优化。由于L1和L2正则项都是在0点（坐标原点）附近形成的区域，因此优化的过程可以确保参数不会偏离0点太多。也就是说，L1和L2正则项引入了一个先验：模型的解不应该离0点太远。而L1和L2正则项实际上是在度量这个距离。

-\parinterval 那为什么要用L1和L2正则项惩罚离0点远的解呢？这还要从模型复杂度谈起。实际上，对于神经机器翻译这样的模型来说，模型的容量是足够的。所谓容量可以被简单的理解为独立参数的个数 \footnote{关于模型容量，在\ref{section-13.2}节会有进一步讨论。}。也就是说，理论上存在一种模型可以完美的描述问题。但是，从目标函数拟合的角度来看，如果一个模型可以拟合很复杂的目标函数，那模型所表示的函数形态也会很复杂。这往往体现在模型中参数的值``偏大''。比如，用一个多项式函数拟合一些空间中的点，如果希望拟合得很好，各个项的系数往往是非零的。而且为了对每个点进行拟合，通常需要多项式中的某些项具有较大的系数，以获得函数在局部有较大的斜率。显然，这样的模型是很复杂的。而模型的复杂度可以用函数中的参数（比如多项式中各项的系数）的``值''进行度量，体现出来就是模型参数的范数。
+\parinterval 那为什么要用L1和L2正则项惩罚离0点远的解呢？这还要从模型复杂度谈起。实际上，对于神经机器翻译这样的模型来说，模型的容量是足够的。所谓容量可以被简单的理解为独立参数的个数 \footnote{关于模型容量，在\ref{section-13.2}节会有进一步讨论。}。也就是说，理论上存在一种模型可以完美的描述问题。但是，从目标函数拟合的角度来看，如果一个模型可以拟合很复杂的目标函数，那模型所表示的函数形态也会很复杂。这往往体现在模型中参数的值“偏大”。比如，用一个多项式函数拟合一些空间中的点，如果希望拟合得很好，各个项的系数往往是非零的。而且为了对每个点进行拟合，通常需要多项式中的某些项具有较大的系数，以获得函数在局部有较大的斜率。显然，这样的模型是很复杂的。而模型的复杂度可以用函数中的参数（比如多项式中各项的系数）的“值”进行度量，体现出来就是模型参数的范数。

 \parinterval 因此，L1和L2正则项的目的是防止模型为了匹配少数（噪声）样本而导致模型的参数过大。反过来说，L1和L2正则项会鼓励那些参数值在0点附近的情况。从实践的角度看，这种方法可以很好的对统计模型的训练进行校正，得到泛化能力更强的模型。

@@ -266,15 +266,15 @@ R(\mathbf{w}) & = & (\big| |\mathbf{w}| {\big|}_2)^2 \\

 \parinterval 神经机器翻译在每个目标语位置$j$会输出一个分布$y_j$，这个分布描述了每个目标语言单词出现的可能性。在训练时，每个目标语言位置上的答案是一个单词，也就对应了One-hot分布$\tilde{y}_j$，它仅仅在正确答案那一维为1，其它维均为0。模型训练可以被看作是一个调整模型参数让$y_j$逼近$\tilde{y}_j$的过程。但是，$\tilde{y}_j$的每一个维度是一个非0即1的目标，这样也就无法考虑类别之间的相关性。具体来说，除非模型在答案那一维输出1，否则都会得到惩罚。即使模型把一部分概率分配给与答案相近的单词（比如同义词），这个相近的单词仍被视为完全错误的预测。

-\parinterval {\small\bfnew{标签平滑}}\index{标签平滑}（Label Smoothing）\index{Label Smoothing}的思想很简单\upcite{Szegedy_2016_CVPR}：答案所对应的单词不应该``独享''所有的概率，其它单词应该有机会作为答案。这个观点与第二章中语言模型的平滑非常类似。在复杂模型的参数估计中，往往需要给未见或者低频事件分配一些概率，以保证模型具有更好的泛化能力。具体实现时，标签平滑使用了一个额外的分布$q$，它是在词汇表$V$ 上的一个均匀分布，即$q(k)=\frac{1}{|V|}$，其中$q(k)$表示分布的第$k$维。然后，答案分布被重新定义为$\tilde{y}_j$和$q$的线性插值：
+\parinterval {\small\bfnew{标签平滑}}\index{标签平滑}（Label Smoothing）\index{Label Smoothing}的思想很简单\upcite{Szegedy_2016_CVPR}：答案所对应的单词不应该“独享”所有的概率，其它单词应该有机会作为答案。这个观点与第二章中语言模型的平滑非常类似。在复杂模型的参数估计中，往往需要给未见或者低频事件分配一些概率，以保证模型具有更好的泛化能力。具体实现时，标签平滑使用了一个额外的分布$q$，它是在词汇表$V$ 上的一个均匀分布，即$q(k)=\frac{1}{|V|}$，其中$q(k)$表示分布的第$k$维。然后，答案分布被重新定义为$\tilde{y}_j$和$q$的线性插值：
 \begin{eqnarray}
-y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
+y_{j}^{ls} &=& (1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \label{eq:13-5}
 \end{eqnarray}

 \noindent 这里$\alpha$表示一个系数，用于控制分布$q$的重要性。$y_{j}^{ls}$会被作为最终的答案分布用于模型的训练。

-\parinterval 标签平滑实际上定义了一种``软''标签，使得所有标签都可以分到一些概率。一方面可以缓解数据中噪声的影响，另一方面目标分布会更合理（显然，真实的分布不应该是One-hot分布）。图\ref{fig:13-12}展示了标签平滑前后的损失函数计算结果的对比。
+\parinterval 标签平滑实际上定义了一种“软”标签，使得所有标签都可以分到一些概率。一方面可以缓解数据中噪声的影响，另一方面目标分布会更合理（显然，真实的分布不应该是One-hot分布）。图\ref{fig:13-12}展示了标签平滑前后的损失函数计算结果的对比。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -293,9 +293,9 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q

 \subsection{Dropout}

-\parinterval 神经机器翻译模型是一种典型的多层神经网络模型。每一层网络都包含若干神经元，负责接收前一层所有神经元的输出，并进行诸如乘法、加法等变换，并有选择的使用非线性的激活函数，最终得到当前层每个神经元的输出。从模型最终预测的角度看，每个神经元都在参与最终的预测。理想的情况下，我们希望每个神经元都能相互独立的做出``贡献''。这样的模型会更加健壮，因为即使一部分神经元不能正常工作，其它神经元仍然可以独立做出合理的预测。但是，随着每一层神经元数量的增加以及网络结构的复杂化，研究者发现神经元之间会出现{\small\bfnew{相互适应}}\index{相互适应}（Co-Adaptation）\index{Co-Adaptation}的现象。所谓相互适应是指，一个神经元对输出的贡献与同一层其它神经元的行为是相关的，也就是说这个神经元已经适应到它周围的``环境''中。
+\parinterval 神经机器翻译模型是一种典型的多层神经网络模型。每一层网络都包含若干神经元，负责接收前一层所有神经元的输出，并进行诸如乘法、加法等变换，并有选择的使用非线性的激活函数，最终得到当前层每个神经元的输出。从模型最终预测的角度看，每个神经元都在参与最终的预测。理想的情况下，我们希望每个神经元都能相互独立的做出“贡献”。这样的模型会更加健壮，因为即使一部分神经元不能正常工作，其它神经元仍然可以独立做出合理的预测。但是，随着每一层神经元数量的增加以及网络结构的复杂化，研究者发现神经元之间会出现{\small\bfnew{相互适应}}\index{相互适应}（Co-Adaptation）\index{Co-Adaptation}的现象。所谓相互适应是指，一个神经元对输出的贡献与同一层其它神经元的行为是相关的，也就是说这个神经元已经适应到它周围的“环境”中。

-\parinterval 相互适应的好处在于神经网络可以处理更加复杂的问题，因为联合使用两个神经元要比单独使用每个神经元的表示能力强。这也类似于传统机器学习任务中往往会设计一些高阶特征，比如自然语言序列标注中对bi-gram和tri-gram的使用。不过另一方面，相互适应会导致模型变得更加``脆弱''。因为相互适应的神经元可以更好的描述训练数据中的现象，但是在测试数据上，由于很多现象是未见的，细微的扰动会导致神经元无法适应。具体体现出来就是过拟合问题。
+\parinterval 相互适应的好处在于神经网络可以处理更加复杂的问题，因为联合使用两个神经元要比单独使用每个神经元的表示能力强。这也类似于传统机器学习任务中往往会设计一些高阶特征，比如自然语言序列标注中对bi-gram和tri-gram的使用。不过另一方面，相互适应会导致模型变得更加“脆弱”。因为相互适应的神经元可以更好的描述训练数据中的现象，但是在测试数据上，由于很多现象是未见的，细微的扰动会导致神经元无法适应。具体体现出来就是过拟合问题。

 \parinterval Dropout是解决这个问题的一种常用方法\upcite{DBLP:journals/corr/abs-1207-0580}。方法很简单，在训练时随机让一部分神经元停止工作，这样每次参数更新中每个神经元周围的环境都在变化，它就不会过分适应到环境中。图\ref{fig:13-13}中给出了某一次参数训练中使用Dropout之前和之后的状态对比。

@@ -308,7 +308,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \end{figure}
 %----------------------------------------------

-\parinterval 具体实现时，可以设置一个参数$p\in (0,1)$。在每次参数更新所使用的前向和反向计算中，每个神经元都以概率$p$停止工作。相当于每层神经网络会有以$p$为比例的神经元被``屏蔽''掉。每一次参数更新中会随机屏蔽不同的神经元。图\ref{fig:13-14}给出了Dropout方法和传统方法计算方式的对比。
+\parinterval 具体实现时，可以设置一个参数$p\in (0,1)$。在每次参数更新所使用的前向和反向计算中，每个神经元都以概率$p$停止工作。相当于每层神经网络会有以$p$为比例的神经元被“屏蔽”掉。每一次参数更新中会随机屏蔽不同的神经元。图\ref{fig:13-14}给出了Dropout方法和传统方法计算方式的对比。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -329,7 +329,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q

 \subsection{Layer Dropout}

-\parinterval 随时网络层数的增多，相互适应也会出现在不同层之间。特别是在引入残差网络之后，不同层的输出可以进行线性组合，因此不同层之间的相互影响会更加直接。对于这个问题，也可以使用Dropout的思想对不同层进行屏蔽。比如，可以使用一个开关来控制一个层能否发挥作用，这个开关以概率$p$被随机关闭，即该层有为$p$的可能性不工作。图\ref{fig:13-15}展示了Transformer多层网络引入Layer Dropout 前后的情况。可以看到，使用Layer Dropout后，开关M会被随机打开或者关闭，以达到屏蔽某一层计算的目的。由于使用了残差网络，关闭每一层相当于``跳过''这一层网络，因此Layer Dropout并不会影响神经网络中数据流的传递。
+\parinterval 随时网络层数的增多，相互适应也会出现在不同层之间。特别是在引入残差网络之后，不同层的输出可以进行线性组合，因此不同层之间的相互影响会更加直接。对于这个问题，也可以使用Dropout的思想对不同层进行屏蔽。比如，可以使用一个开关来控制一个层能否发挥作用，这个开关以概率$p$被随机关闭，即该层有为$p$的可能性不工作。图\ref{fig:13-15}展示了Transformer多层网络引入Layer Dropout 前后的情况。可以看到，使用Layer Dropout后，开关M会被随机打开或者关闭，以达到屏蔽某一层计算的目的。由于使用了残差网络，关闭每一层相当于“跳过”这一层网络，因此Layer Dropout并不会影响神经网络中数据流的传递。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -340,7 +340,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \end{figure}
 %----------------------------------------------

-\parinterval Layer Dropout可以被理解为在一个深网络（即原始网络）中随机采样出一个由若干层网络构成的``浅''网络。不同``浅''网络所对应的同一层的模型参数是共享的。这也达到了对指数级子网络高效训练的目的。需要注意的是，在推断阶段，每层的输出需要乘以$1-p$，确保训练时每层输出的期望和解码是一致的。Layer Dropout可以非常有效的缓解深层网路中的过拟合问题。在\ref{subsection-13.2}节还会看到Layer Dropout可以成功地帮助我们训练Deep Transformer模型。
+\parinterval Layer Dropout可以被理解为在一个深网络（即原始网络）中随机采样出一个由若干层网络构成的“浅”网络。不同“浅”网络所对应的同一层的模型参数是共享的。这也达到了对指数级子网络高效训练的目的。需要注意的是，在推断阶段，每层的输出需要乘以$1-p$，确保训练时每层输出的期望和解码是一致的。Layer Dropout可以非常有效的缓解深层网路中的过拟合问题。在\ref{subsection-13.2}节还会看到Layer Dropout可以成功地帮助我们训练Deep Transformer模型。

 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -349,11 +349,11 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \sectionnewpage
 \section{增大模型容量}\label{section-13.2}

-\parinterval 神经机器翻译是一种典型的多层神经网络。一方面，可以通过设计合适的网络连接方式和激活函数来捕捉复杂的翻译现象；另一方面，越来越多的可用数据让模型能够得到更有效的训练。在训练数据较为充分的情况下，设计更加``复杂''的模型成为了提升系统性能的有效手段。比如，Transformer模型有两个常用配置Transformer-Base和Transformer-Big。其中，Transformer-Big比Transformer-Base使用了更多的神经元，相应的翻译品质更优\upcite{NIPS2017_7181}。
+\parinterval 神经机器翻译是一种典型的多层神经网络。一方面，可以通过设计合适的网络连接方式和激活函数来捕捉复杂的翻译现象；另一方面，越来越多的可用数据让模型能够得到更有效的训练。在训练数据较为充分的情况下，设计更加“复杂”的模型成为了提升系统性能的有效手段。比如，Transformer模型有两个常用配置Transformer-Base和Transformer-Big。其中，Transformer-Big比Transformer-Base使用了更多的神经元，相应的翻译品质更优\upcite{NIPS2017_7181}。

 \parinterval 那么是否还有类似的方法可以改善系统性能呢？答案显然是肯定的。这里，把这类方法统称为基于大容量模型的方法。在传统机器学习的观点中，神经网络的性能不仅依赖于架构设计，同样与容量密切相关。那么什么是模型的{\small\bfnew{容量}}\index{容量}（Capacity）\index{Capacity}？简单理解，容量是指神经网络的参数量，即神经元之间连接权重的个数。另一种定义是把容量看作神经网络所能表示的假设空间大小\upcite{DBLP:journals/nature/LeCunBH15}，也就是神经网络能表示的不同函数所构成的空间。

-\parinterval 而学习一个神经网络就是要找到一个``最优''的函数，它可以准确地拟合数据。当假设空间变大时，训练系统有机会找到更好的函数，但是同时也需要依赖更多的训练样本才能完成最优函数的搜索。相反，当假设空间变小时，训练系统会更容易完成函数搜索，但是很多优质的函数可能都没有被包含在假设空间里。这也体现了一种简单的辩证思想：如果训练（搜索）的代价高，会有更大的机会找到更好的解；另一方面，如果想少花力气进行训练（搜索），那就设计一个小一些的假设空间，在小一些规模的样本集上进行训练，当然搜索到的解可能不是最好的。
+\parinterval 而学习一个神经网络就是要找到一个“最优”的函数，它可以准确地拟合数据。当假设空间变大时，训练系统有机会找到更好的函数，但是同时也需要依赖更多的训练样本才能完成最优函数的搜索。相反，当假设空间变小时，训练系统会更容易完成函数搜索，但是很多优质的函数可能都没有被包含在假设空间里。这也体现了一种简单的辩证思想：如果训练（搜索）的代价高，会有更大的机会找到更好的解；另一方面，如果想少花力气进行训练（搜索），那就设计一个小一些的假设空间，在小一些规模的样本集上进行训练，当然搜索到的解可能不是最好的。

 \parinterval 在很多机器翻译任务中，训练数据是相对充分的。这时增加模型容量是提升性能的一种很好的选择。常见的方法有三种：

@@ -401,7 +401,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q

 \subsection{深网络}

-\parinterval 虽然，理论上宽网络有能力拟合任意的函数，但是获得这种能力的代价是非常高的。在实践中，往往需要增加相当的宽度，以极大的训练代价才能换来少量的性能提升。当神经网络达到一定宽度后这种现象更为严重。``无限''增加宽度显然是不现实的。
+\parinterval 虽然，理论上宽网络有能力拟合任意的函数，但是获得这种能力的代价是非常高的。在实践中，往往需要增加相当的宽度，以极大的训练代价才能换来少量的性能提升。当神经网络达到一定宽度后这种现象更为严重。“无限”增加宽度显然是不现实的。

 \parinterval 因此，另一种思路是使用更深的网络以增加模型的容量。深网络是指包含更多层的神经网络。相比宽网络的参数量随着宽度呈平方增长，深网络的参数量随着深度呈线性增长。这带给深网络一个优点：在同样参数量下可以通过更多的非线性变换来对问题进行描述。这也赋予了深网络对复杂问题建模的能力。比如，在图像识别领域，很多先进的系统都是基于很深的神经网络，甚至在一些任务上最好的的结果需要1000 层以上的神经网络。

@@ -432,7 +432,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \end{figure}
 %----------------------------------------------

-\parinterval 不过，深网络容易发生梯度消失和梯度爆炸问题。因此在使用深网络时，训练策略的选择是至关重要的。实际上，标准的Transformer模型已经是不太``浅''的神经网络，因此里面使用了残差连接来缓解梯度消失等问题。此外，为了避免过拟合，深层网络的训练也要与Dropout等正则化策略相配合，并且需要设计恰当的参数初始化方法和学习率调整策略。关于构建深层神经机器翻译的方法，本章\ref{subsection-7.5.1}节还会有进一步讨论。
+\parinterval 不过，深网络容易发生梯度消失和梯度爆炸问题。因此在使用深网络时，训练策略的选择是至关重要的。实际上，标准的Transformer模型已经是不太“浅”的神经网络，因此里面使用了残差连接来缓解梯度消失等问题。此外，为了避免过拟合，深层网络的训练也要与Dropout等正则化策略相配合，并且需要设计恰当的参数初始化方法和学习率调整策略。关于构建深层神经机器翻译的方法，本章\ref{subsection-7.5.1}节还会有进一步讨论。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -442,7 +442,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q

 \parinterval 如前所述，神经机器翻译的原始输入是单词序列，包括源语言端和目标语言端。模型中的输入层将这种离散的单词表示转换成实数向量的表示，也就是常说的{\small\bfnew{词嵌入}}\index{词嵌入}（Embedding）\index{Embedding}。从实现的角度来看，输入层其实就是从一个词嵌入矩阵中提取对应的词向量表示，这个矩阵两个维度大小分别对应着词表大小和词嵌入的维度。词嵌入的维度也代表着模型对单词刻画的能力。因此适当增加词嵌入的维度也是一种增加模型容量的手段。通常，词嵌入和隐藏层的维度是一致的，这种设计也是为了便于系统实现。

-\parinterval 当然，并不是说词嵌入的维度一定越大就越好。本质上，词嵌入是要在一个多维空间上有效的区分含有不同语义的单词。如果词表较大，更大的词嵌入维度会更有意义，因为需要更多的``特征''描述更多的语义。当词表较小时，增大词嵌入维度可能不会带来增益，相反会增加系统计算的负担。另一种策略是，动态选择词嵌入维度，比如，对于高频词使用较大的词嵌入维度，而对于低频词则使用较小的词嵌入维度\upcite{DBLP:conf/iclr/BaevskiA19}。这种方法可以用同样的参数量处理更大的词表。
+\parinterval 当然，并不是说词嵌入的维度一定越大就越好。本质上，词嵌入是要在一个多维空间上有效的区分含有不同语义的单词。如果词表较大，更大的词嵌入维度会更有意义，因为需要更多的“特征”描述更多的语义。当词表较小时，增大词嵌入维度可能不会带来增益，相反会增加系统计算的负担。另一种策略是，动态选择词嵌入维度，比如，对于高频词使用较大的词嵌入维度，而对于低频词则使用较小的词嵌入维度\upcite{DBLP:conf/iclr/BaevskiA19}。这种方法可以用同样的参数量处理更大的词表。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -458,7 +458,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q

 \subsection{大批量训练}

-\parinterval 在第六章已经介绍了神经机器翻译模型需要使用梯度下降方法进行训练。其中，一项非常重要的技术就是{\small\bfnew{小批量训练}}\index{小批量训练}（Mini-batch Training）\index{Mini-batch Training}，即每次使用多个样本来获取梯度并对模型参数进行更新。这里将每次参数更新使用的多个样本集合称为批次，将样本的数量称作批次的大小。在机器翻译中，通常用批次中的源语言/目标语言单词数或者句子数来表示批次大小。理论上，过小的批次会带来训练的不稳定，而且参数更新次数会大大增加。因此，很多研究者尝试增加批次大小来提高训练的稳定性。在Transformer模型中，使用更大的批次已经被验证是有效的。这种方法也被称作大批量训练。不过，这里所谓`` 大''批量是一个相对的概念。下面就一起看一看如何使用合适的批次大小来训练神经机器翻译模型。
+\parinterval 在第六章已经介绍了神经机器翻译模型需要使用梯度下降方法进行训练。其中，一项非常重要的技术就是{\small\bfnew{小批量训练}}\index{小批量训练}（Mini-batch Training）\index{Mini-batch Training}，即每次使用多个样本来获取梯度并对模型参数进行更新。这里将每次参数更新使用的多个样本集合称为批次，将样本的数量称作批次的大小。在机器翻译中，通常用批次中的源语言/目标语言单词数或者句子数来表示批次大小。理论上，过小的批次会带来训练的不稳定，而且参数更新次数会大大增加。因此，很多研究者尝试增加批次大小来提高训练的稳定性。在Transformer模型中，使用更大的批次已经被验证是有效的。这种方法也被称作大批量训练。不过，这里所谓“ 大”批量是一个相对的概念。下面就一起看一看如何使用合适的批次大小来训练神经机器翻译模型。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -524,7 +524,7 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \item 按词数构建批次：对比按照句长生成批次，按词数生成批次可以防止某些批次中句子整体长度特别长或者特别短的情况，保证不同批次之间整体的词数处于大致相同的范围，这样所得到的梯度也是可比较的。通常的做法是根据源语言词数、目标语言词数，或者源语言词数与目标语言词数的最大值等指标生成批次。

 \vspace{0.5em}
-\item 按课程学习的方式：考虑样本的``难度''也是生成批次的一种策略。比如，可以使用{\small\bfnew{课程学习}}\index{课程学习}（Curriculum Learning）\index{Curriculum Learning} 的思想\upcite{DBLP:conf/icml/BengioLCW09}，让系统先学习``简单''的样本，之后逐渐增加样本的难度，达到循序渐进的学习。具体来说，可以利用句子长度、词频等指标计算每个批次的``难度''，记为$d$。 之后，选择满足$d \leq c$的样本构建一个批次。这里，$c$表示难度的阈值，它可以随着训练的执行不断增大。
+\item 按课程学习的方式：考虑样本的“难度”也是生成批次的一种策略。比如，可以使用{\small\bfnew{课程学习}}\index{课程学习}（Curriculum Learning）\index{Curriculum Learning} 的思想\upcite{DBLP:conf/icml/BengioLCW09}，让系统先学习“简单”的样本，之后逐渐增加样本的难度，达到循序渐进的学习。具体来说，可以利用句子长度、词频等指标计算每个批次的“难度”，记为$d$。 之后，选择满足$d \leq c$的样本构建一个批次。这里，$c$表示难度的阈值，它可以随着训练的执行不断增大。
 \vspace{0.5em}
 \end{itemize}

@@ -535,24 +535,133 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \sectionnewpage
 \section{对抗样本训练}

+\parinterval 同其它基于神经网络的方法一样，提高{\small\bfnew{鲁棒性}}\index{鲁棒性}（Robustness）\index{Robustness}也是神经机器翻译研发中需要关注的。比如，大容量模型可以很好的拟合训练数据，但是当测试样本与训练样本差异较大时，会导致很糟糕的翻译结果\upcite{JMLR:v15:srivastava14a,DBLP:conf/amta/MullerRS20}。另一方面，实践中也发现，有些情况下即使输入中有微小的扰动，神经网络模型的输出也会产生巨大变化。或者说，神经网络模型在输入样本上容易受到{\small\bfnew{攻击}}\index{攻击}（Attack）\index{Attack}\upcite{DBLP:conf/sp/Carlini017,DBLP:conf/cvpr/Moosavi-Dezfooli16,DBLP:conf/acl/ChengJM19}。图\ref{fig:13-19}展示了一个神经机器翻译系统的结果，可以看到，输入中把单词“他”换成“她”会造成完全不同的译文。这时神经机器翻译系统就存在鲁棒性问题。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.5]{./Chapter13/Figures/figure-example-of-neural-machine-translation.png}
+\caption{神经机器翻译实例}
+\label{fig:13-19}
+\end{figure}
+%----------------------------------------------
+
+\parinterval 决定神经网络模型鲁棒性的因素主要包括训练数据、网络结构、正则化方法等。仅仅从网络结构设计和训练算法优化的角度来改善鲁棒性一般是较为困难的，因为如果输入数据是“干净”的，模型就会学习在这样的数据上进行预测。无论模型的能力是强还是弱，当推断时输入数据出现扰动的时候，模型可能无法适应，因为它从未见过这种新的数据。因此，一种简单直接的方法是从训练样本出发，让模型在学习的过程中能对样本中的扰动进行处理，进而在推断时具有更强的鲁棒性。具体来说，可以在训练过程中构造有噪声的样本，即基于{\small\bfnew{对抗样本}}\index{对抗样本}（Adversarial Examples）\index{Adversarial Examples}进行{\small\bfnew{对抗训练}}\index{对抗训练}（Adversarial Training）\index{Adversarial Training}。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsection{对抗样本及对抗攻击}
+
+\parinterval 在图像识别领域，研究人员就发现，对于输入图像的细小扰动，如像素变化等，会使模型以高置信度给出错误的预测\upcite{DBLP:conf/cvpr/NguyenYC15,DBLP:journals/corr/SzegedyZSBEGF13,DBLP:journals/corr/GoodfellowSS14}，但是这种扰动并不会影响人类的判断。也就是说，样本中的微小变化“欺骗”了图像识别系统，但是“欺骗”不了人类。这种现象背后的原因有很多，例如，一种可能的原因是：系统并没有理解图像，而是在拟合数据，因此拟合能力越强，反而对数据中的微小变化更加敏感。从统计学习的角度看，既然新的数据中可能会有扰动，那更好的学习方式就是在训练中显性地把这种扰动建模出来，让模型对输入的细微变化更加鲁棒。
+
+\parinterval 这种对原样本上增加一些难以察觉的扰动从而使模型的到错误判断的样本，被称为对抗样本。对于输入$\mathbi{x}$和输出$\mathbi{y}$，对抗样本形式上可以被描述为：{\red s.t.的形式再确认一下}
+\begin{eqnarray}
+\funp{C}(\mathbi{x}) &=& \mathbi{y} 
+\label{eq:13-6}\\
+\funp{C}(\mathbi{x}) &\neq& \mathbi{y} 
+\label{eq:13-7}\\
+\textrm{s.t.} \quad \funp{R}(\mathbi{x},\mathbi{x}') &<& \varepsilon
+\label{eq:13-8}
+\end{eqnarray}
+
+\noindent 其中，$(\mathbi{x},\mathbi{y})$为原样本，$(\mathbi{x}',\mathbi{y})$为输入中含有扰动的对抗样本，函数$\funp{C}(\cdot)$为模型。公式\eqref{eq:13-8}中$\funp{R}(\mathbi{x},\mathbi{x}')$表示扰动后的输入$\mathbi{x}'$和原输入$\mathbi{x}$之间的距离，$\varepsilon$表示扰动的受限范围当模型对包含噪声的数据容易给出较差的结果时，往往意味着模型的抗干扰能力差，因此可以利用对抗样本检测现有模型的鲁棒性\upcite{DBLP:conf/emnlp/JiaL17}。同时，采用类似数据增强的方式将对抗样本混合至训练数据中，能够帮助模型学习到更普适的特征使模型的到稳定的输出，这种方式也被称为对抗训练\upcite{DBLP:journals/corr/GoodfellowSS14,DBLP:conf/emnlp/BekoulisDDD18,DBLP:conf/naacl/YasunagaKR18}。
+
+\parinterval 通过对抗样本训练提升模型鲁棒性的首要问题是如何生成对抗样本。通过当前模型$\funp{C}$，和样本$(\mathbi{x},\mathbi{y})$，生成对抗样本的过程，被称为{\small\bfnew{对抗攻击}}\index{对抗攻击}（Adversarial Attack）\index{Adversarial Attack}。对抗攻击可以被分为两种，分别是黑盒攻击和白盒攻击。在白盒攻击中，攻击算法可以访问模型的完整信息，包括模型结构、网络参数、损失函数、激活函数、输入和输出数据等。而黑盒攻击不需要知道神经网络的详细信息，仅仅通过访问模型的输入和输出达到攻击目的，因此 通常依赖启发式方法来生成对抗样本。由于神经网络本身便是一个黑盒模型，研究人员对模型内部的参数干预度有限，因此黑盒攻击在许多实际应用中更加实用。
+
+\parinterval 在神经机器翻译中，输入中细小的扰动经常会使模型变得脆弱\upcite{DBLP:conf/iclr/BelinkovB18}。但是图像和文本之间存在着一定的差异，图像中的对抗攻击方法难以直接应用于自然语言处理领域，这是由于以像素值等表示的图像数据是连续的\upcite{DBLP:conf/naacl/MichelLNP19}{\red （如何理解连续？）}，而文本中的一个个单词本身离散的。因此简单替换这些离散的单词，可能会生成语法错误或者语义错误的句子。这时的扰动过大，模型很容易判别，因此无法涵盖原始问题。即使对词嵌入等连续表示部分进行扰动，也会产生无法与词嵌入空间中的任何词匹配的问题\upcite{Gong2018AdversarialTW}。针对这些问题，下面着重介绍神经机器翻译任务中有效的生成和利用对抗样本的方法。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsection{基于黑盒攻击的方法}
+
+\parinterval 一个好的对抗样本应该具有一些性质，如：对文本做最少的修改，并最大程度地保留原文的语义。这些可以通过对文本加噪声的方式来来实现，分为自然噪声和人工噪声\upcite{DBLP:conf/iclr/BelinkovB18}。自然噪声一般是指人为的在语料库中收集自然出现的错误，如输入错误，拼写错误等，构建可用的词汇替换表，在文本中加入噪声。人为噪声则可以通过多种方式来处理文本。如可以通过在干净的数据中通过固定的规则或是使用噪声生成器以一定的概率引入不同类型的噪声，如：拼写、表情符号、语法错误等\upcite{DBLP:conf/naacl/VaibhavSSN19,DBLP:conf/naacl/AnastasopoulosL19,DBLP:conf/acl/SinghGR18}；此外，也可以在文本中加入精心设计，或毫无意义的单词序列，以此来分散模型的注意，通过不同的算法来确定插入的位置和内容，这种方式常用于阅读理解任务中\upcite{DBLP:conf/emnlp/JiaL17}。
+
+\parinterval 除了单纯的在文本中引入各种扰动外，还可以通过文本编辑的方式构建对抗样本，在不改变语义的情况下尽可能的修改文本\upcite{DBLP:journals/corr/SamantaM17,DBLP:conf/ijcai/0002LSBLS18}，从而生成对抗样本。文本的编辑方式主要包括交换，插入，替换和删除操作。图\ref{fig:13-20}给出了一些通过上述方式生成的对抗样本。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.5]{./Chapter13/Figures/figure-true-case-of-adversarial-examples.jpg}
+\caption{对抗样本实例{\red（需要让读者真实的感受一下）}}
+\label{fig:13-20}
+\end{figure}
+%----------------------------------------------
+
+\parinterval 形式上可以利用FGSM\upcite{DBLP:journals/corr/GoodfellowSS14}等算法验证文本中每一个单词对语义的贡献度，同时为每一个单词构建其候选池，包括单词的近义词，拼写错误词，同音词等。对于贡献度较低的词如语气词，副词等，可以通过插入，删除操作进行扰动。对于文本序列中其他的单词可以进行在候选池中选择相应的单词进行替换。对于交换操作可以基于词级别交换序列中的单词，也可以是基于字符级的交换单词中的字符\upcite{DBLP:conf/coling/EbrahimiLD18}。重复的进行不同的编辑操作，直至误导模型做出错误的判断。
+
+\parinterval 基于语义的方法除了通过不同的算法进行修改输入生成对抗样本外，也可以通过神经网络模型增加扰动，例如在机器翻译中常用的回译技术也是生成对抗样本的一种有效方式，通过反向模型将目标语言翻译成源语言，并再次应用于神经机器翻译系统的训练。除了翻译模型，语言模型也可以用于生成对抗样本。前面也已经介绍过语言模型可以用于检测句子流畅度，根据上文预测当前位置可能出现的单词，因此可以通过语言模型根据上文预测出当前位置最可能出现的多个单词，与序列中的原本的单词进行替换。在机器翻译任务中，可以通过与神经机器翻译系统联合训练，共享词向量矩阵的方式得到语言模型。{\red （引用）}
+
+\parinterval 此外，生成{\small\bfnew{对抗网络}}\index{对抗网络}（Generative Adversarial Networks\index{Generative Adversarial Networks}, GANs）也可以被用来生成对抗样本\upcite{DBLP:conf/iclr/ZhaoDS18}。与回译方法类似，GAN的方法将原始的输入映射为潜在分布$\funp{P}$，并在其中搜索出服从相同分布的文本构成对抗样本。一些研究正也对这种方法进行了优化\upcite{DBLP:conf/iclr/ZhaoDS18}，在稠密的向量空间$z$中进行搜索，也就是说在定义$\funp{P}$的基础稠密向量空间中找到对抗性表示$\mathbi{z}'$，然后利用生成模型将其映射回$\mathbi{x}'$，使最终生成的对抗样本在语义上接近原始输入。{\red（既然GAN不是主流，可以考虑把这部分放到拓展阅读中）}
+
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsection{基于白盒攻击的方法}
+
+\parinterval 除了在离散的词汇级别的增加扰动，还可以在模型内部增加扰动。这里简单介绍一下利用白盒攻击方法增加模型鲁棒性的方法：
+
+\begin{itemize}
+\vspace{0.5em}
+\item 与利用词向量的余弦相似度选择近义词对当前词进行替换类似，可以对于每一个词都在其词嵌入表示的基础上累加了一个从正太分布，之后将其作为模型的最终输入。同时，可以在训练目标中增加额外的训练目标，比如，迫使模型在接收到被扰动的输入后，编码端生成与正常输入类似的表示，解码端输出正确的翻译结果\upcite{DBLP:conf/acl/LiuTMCZ18}。
+\vspace{0.5em}
+\item 除了引入标准的噪声外，还可以根据模型所存在的具体问题， 构建不同的扰动。例如，针对输入中包含同音字错误导致模型输出误差较大的问题，可以将单词的发音转换为一个包含$n$个发音单元，如音素，音节等的发音序列，并训练相应的嵌入矩阵将每一个发音单元转换为对应的向量表示。对发音序列中的发音单元的嵌入表示进行平均后，得到当前单词的发音表示。最后将词嵌入与单词的发音表示，加权求和后的结果作为模型的输入\upcite{DBLP:conf/acl/LiuMHXH19}。通过这种方式可以提高模型对同音异形词的鲁棒性，得到更准确的翻译结果。此外除了在词嵌入层增加扰动，同样有研究人员证明了，在端到端模型的中的编码端输出中引入额外的噪声，能起到与在层输入中增加扰动生成对抗样本进行对抗训练类似的效果，增强了模型训练的鲁棒性\upcite{DBLP:conf/acl/LiLWJXZLL20}。
+\vspace{0.5em}
+\item 此外还有一类基于梯度的方法来生成对抗样本进行对抗性训练。例如，可以通过最大化替换词与原始单词词向量之间的差值和候选词的梯度向量之间的相似度来生成对抗样本\upcite{DBLP:conf/acl/ChengJM19}，具体的计算方式如下：{\red 下面的是sin还是sim，而且文字中是正弦把？下面三角是不是delta}
+\begin{eqnarray}
+{\mathbi{x}'}_i &=& \arg\max_{\mathbi{x}\in \nu_{\mathbi{x}}}\textrm{sim}(\funp{e}(\mathbi{x})-\funp{e}(\mathbi{x}_i),\mathbi{g}_{\mathbi{x}_i}) 
+\label{eq:13-9} \\
+\mathbi{g}_{\mathbi{x}_i} &=&  \Delta_{\funp{e}(\mathbi{x}_i)} - \log \funp{P}(\mathbi{y}|\mathbi{x};\theta)
+\label{eq:13-10}
+\end{eqnarray}
+
+\noindent 其中，$\mathbi{x}_i$为输入中第$i$个词，$\mathbi{g}_{\mathbi{x}_i}$为对应的梯度向量，$\funp{e}(\cdot)$用于获取词向量，$\textrm{sim}(\cdot,\cdot)$用于评估两个向量之间的余弦距离{\red（很多符号没有解释，$∇_(e(x_i))$是什么？等等）}。$\nu_{\mathbi{x}}$为源语的词表，但是由于对词表中所有单词进行枚举，计算成本较大，因此利用语言模型选择最可能的$n$个词作为候选，进而缩减匹配范围，通过此方式对采样出的源语词进行替换。同时为了保护模型不受解码器预测误差的影响，对模型目标端的输入同样做了调整，方法与源语端类似，不同的地方在于将公式\eqref{eq:13-10}中的损失替换为$- \log \funp{P}(\mathbi{y}|\mathbi{x}')$，利用语言模型选择候选和采样的方式也做出了相应的调整。在进行对抗性训练时，在原有的训练损失上增加了三个额外的损失，最终的训练目标为：
+\begin{eqnarray}
+Loss(\theta_{\textrm{mt}},\theta_{\textrm{lm}}^{\mathbi{x}},\theta_{\textrm{lm}}^{\mathbi{y}}) &=& Loss_{\textrm{clean}}(\theta_{\textrm{mt}}) + Loss_{\textrm{lm}}(\theta_{\textrm{lm}}^{\mathbi{x}}) + \nonumber \\
+& & Loss_{\textrm{robust}}(\theta_{\textrm{mt}}) + Loss_{\textrm{lm}}(\theta_{\textrm{lm}}^{\mathbi{y}})
+\label{eq:13-11}
+\end{eqnarray}
+
+\noindent 其中分别$Loss_{\textrm{clean}}(\theta_{\textrm{mt}})$为正常情况下的损失，$Loss_{\textrm{lm}}(\theta_{\textrm{lm}}^{\mathbi{x}})$和$Loss_{\textrm{lm}}(\theta_{\textrm{lm}}^{\mathbi{y}})$生成对抗样本所用到的源语言与目标语言语言模型的损失，$Loss_{\textrm{robust}}(\theta_{\textrm{mt}})$是以对源语言和目标端做出修改后得到的对抗样本作为输入，原始的目标语作为答案计算得到的损失，其具体形式如下：
+\begin{eqnarray}
+Loss_{\textrm{robust}}(\theta_{\textrm{mt}}) &=&  \frac{1}{\textrm{S}}\sum_{(\mathbi{x},\mathbi{y}\in \textrm{S})}-\log \funp{P}(\mathbi{y}|\mathbi{x}',\mathbi{z}';\theta_{\textrm{mt}})
+\label{eq:13-11}
+\end{eqnarray}
+
+\noindent 这里$\textrm{S}$代表整体的语料库。通过上述方式达到对抗训练的目的。
+\vspace{0.5em}
+\end{itemize}
+
+\parinterval 不论是黑盒方法还是白盒方法，本质上都是通过增加噪声使得模型训练更加健壮。类似的思想在很多机器学习方法都有体现，比如，最大熵模型中使用高斯噪声就是常用的增加模型健壮性的手段之一\upcite{chen1999gaussian}{\red 这篇文章再找一下}。而在深度学习时代下，对抗训练将问题定义为：有意识地构造系统容易出错的样本，进而更加有效的增加系统抗干扰的能力。
+
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
 %----------------------------------------------------------------------------------------

 \sectionnewpage
-\section{最小风险训练}
+\section{高级模型训练技术}

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsection{增强学习方法}
+\subsection{极大似然估计的问题}
+
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsection{非teacher-forcing方法}

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsection{曝光偏置问题}
+\subsection{增强学习方法}

 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -561,11 +670,11 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q
 \sectionnewpage
 \section{知识精炼}\label{subsection-7.5.3}

-\parinterval 理想的机器翻译系统应该是品质好、速度块、存储占用少。不过现实的机器翻译系统往往需要用运行速度和存储空间来换取翻译品质，比如，\ref{subsection-7.3.2}节提到的增大模型容量的方法就是通过增加模型参数量来达到更好的函数拟合效果，但是这也导致系统变得更加笨拙。在很多场景下，这样的模型甚至无法使用。比如，Transformer-Big等``大''模型通常在专用GPU服务器上运行，在手机等受限环境下仍很难应用。
+\parinterval 理想的机器翻译系统应该是品质好、速度块、存储占用少。不过现实的机器翻译系统往往需要用运行速度和存储空间来换取翻译品质，比如，\ref{subsection-7.3.2}节提到的增大模型容量的方法就是通过增加模型参数量来达到更好的函数拟合效果，但是这也导致系统变得更加笨拙。在很多场景下，这样的模型甚至无法使用。比如，Transformer-Big等“大”模型通常在专用GPU服务器上运行，在手机等受限环境下仍很难应用。

-\parinterval 另一方面，直接训练``小''模型的效果往往并不理想，其翻译品质与``大''模型相比仍有比较明显的差距。比如，在Transformer中，使用一个48层的编码器要比传统的6层编码器在BLEU上高出1-2个点，而且两者翻译结果的人工评价的区别也十分明显。
+\parinterval 另一方面，直接训练“小”模型的效果往往并不理想，其翻译品质与“大”模型相比仍有比较明显的差距。比如，在Transformer中，使用一个48层的编码器要比传统的6层编码器在BLEU上高出1-2个点，而且两者翻译结果的人工评价的区别也十分明显。

-\parinterval 面对小模型难以训练的问题，一种有趣的想法是把``大''模型的知识传递给``小''模型，让``小''模型可以更好的进行学习。这类似于，教小孩子学习数学，是请一个权威数学家（数据中的标准答案），还是请一个小学数学教师（``大''模型）。这就是知识精炼的基本思想。
+\parinterval 面对小模型难以训练的问题，一种有趣的想法是把“大”模型的知识传递给“小”模型，让“小”模型可以更好的进行学习。这类似于，教小孩子学习数学，是请一个权威数学家（数据中的标准答案），还是请一个小学数学教师（“大”模型）。这就是知识精炼的基本思想。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -573,23 +682,23 @@ y_{j}^{ls}=(1-\alpha) \cdot \tilde{y}_j + \alpha \cdot q

 \subsection{什么是知识精炼}

-\parinterval 通常，知识精炼可以被看作是一种知识迁移的手段\upcite{Hinton2015Distilling}。如果把``大''模型的知识迁移到``小''模型，这种方法的直接结果就是{\small\bfnew{模型压缩}}\index{模型压缩}（Model Compression）\index{Model Compression}。当然，理论上也可以把``小''模型的知识迁移到``大''模型，比如，将迁移后得到的``大''模型作为初始状态，之后继续训练该模型，以期望取得加速收敛的效果。不过，在实践中更多是使用``大''模型到``小''模型的迁移，这也是本节讨论的重点。
+\parinterval 通常，知识精炼可以被看作是一种知识迁移的手段\upcite{Hinton2015Distilling}。如果把“大”模型的知识迁移到“小”模型，这种方法的直接结果就是{\small\bfnew{模型压缩}}\index{模型压缩}（Model Compression）\index{Model Compression}。当然，理论上也可以把“小”模型的知识迁移到“大”模型，比如，将迁移后得到的“大”模型作为初始状态，之后继续训练该模型，以期望取得加速收敛的效果。不过，在实践中更多是使用“大”模型到“小”模型的迁移，这也是本节讨论的重点。

 \parinterval 知识精炼基于两个假设：

 \begin{itemize}
 \vspace{0.5em}
-\item ``知识''在模型间是可迁移的。也就是说，一个模型中蕴含的规律可以被另一个模型使用。最典型的例子就是预训练模型（见\ref{subsection-7.2.6}）。使用单语数据学习到的表示模型，在双语的翻译任务中仍然可以发挥很好的作用。也就是，把单语语言模型学习到的知识迁移到双语翻译中对句子表示的任务中；
+\item “知识”在模型间是可迁移的。也就是说，一个模型中蕴含的规律可以被另一个模型使用。最典型的例子就是预训练模型（见\ref{subsection-7.2.6}）。使用单语数据学习到的表示模型，在双语的翻译任务中仍然可以发挥很好的作用。也就是，把单语语言模型学习到的知识迁移到双语翻译中对句子表示的任务中；
 \vspace{0.5em}
-\item 模型所蕴含的``知识''比原始数据中的``知识''更容易被学习到。比如，机器翻译中大量使用的回译（伪数据）方法，就把模型的输出作为数据让系统进行学习。
+\item 模型所蕴含的“知识”比原始数据中的“知识”更容易被学习到。比如，机器翻译中大量使用的回译（伪数据）方法，就把模型的输出作为数据让系统进行学习。
 \vspace{0.5em}
 \end{itemize}

-\parinterval 这里所说的第二个假设对应了机器学习中的一大类问题\ \dash \ {\small\bfnew{学习难度}}\index{学习难度}（Learning Difficulty）\index{Learning Difficulty}。所谓难度是指：在给定一个模型的情况下，需要花费多少代价对目标任务进行学习。如果目标任务很简单，同时模型与任务很匹配，那学习难度就会降低。如果目标任务很复杂，同时模型与其匹配程度很低，那学习难度就会很大。在自然语言处理任务中，这个问题的一种表现是：在很好的数据中学习的模型的翻译质量可能仍然很差。即使训练数据是完美的，但是模型仍然无法做到完美的学习。这可能是因为建模的不合理，导致模型无法描述目标任务中复杂的规律。也就是纵然数据很好，但是模型学不到其中的``知识''。在机器翻译中这个问题体现的尤为明显。比如，在机器翻译系统$n$-best结果中挑选最好的译文（成为Oracle）作为训练样本让系统重新学习，系统仍然达不到Oracle的水平。
+\parinterval 这里所说的第二个假设对应了机器学习中的一大类问题\ \dash \ {\small\bfnew{学习难度}}\index{学习难度}（Learning Difficulty）\index{Learning Difficulty}。所谓难度是指：在给定一个模型的情况下，需要花费多少代价对目标任务进行学习。如果目标任务很简单，同时模型与任务很匹配，那学习难度就会降低。如果目标任务很复杂，同时模型与其匹配程度很低，那学习难度就会很大。在自然语言处理任务中，这个问题的一种表现是：在很好的数据中学习的模型的翻译质量可能仍然很差。即使训练数据是完美的，但是模型仍然无法做到完美的学习。这可能是因为建模的不合理，导致模型无法描述目标任务中复杂的规律。也就是纵然数据很好，但是模型学不到其中的“知识”。在机器翻译中这个问题体现的尤为明显。比如，在机器翻译系统$n$-best结果中挑选最好的译文（成为Oracle）作为训练样本让系统重新学习，系统仍然达不到Oracle的水平。

-\parinterval 知识精炼本身也体现了一种``自学习''的思想。即利用模型（自己）的预测来教模型（自己）。这样既保证了知识可以向更轻量的模型迁移，同时也避免了模型从原始数据中学习难度大的问题。虽然``大''模型的预测中也会有错误，但是这种预测是更符合建模的假设的，因此``小''模型反倒更容易从不完美的信息中学习\footnote[15]{很多时候，``大''模型和``小''模型都是基于同一种架构，因此二者对问题的假设和模型结构都是相似的。}到更多的知识。类似于，刚开始学习围棋的人从职业九段身上可能什么也学不到，但是向一个业余初段的选手学习可能更容易入门。另外，也有研究表明：在机器翻译中，相比于``小''模型，``大''模型更容易进行优化，也更容易找到更好的模型收敛状态。因此在需要一个性能优越，存储较小的模型时，也会考虑将大模型压缩得到更轻量模型的手段\upcite{DBLP:journals/corr/abs-2002-11794}。
+\parinterval 知识精炼本身也体现了一种“自学习”的思想。即利用模型（自己）的预测来教模型（自己）。这样既保证了知识可以向更轻量的模型迁移，同时也避免了模型从原始数据中学习难度大的问题。虽然“大”模型的预测中也会有错误，但是这种预测是更符合建模的假设的，因此“小”模型反倒更容易从不完美的信息中学习\footnote[15]{很多时候，“大”模型和“小”模型都是基于同一种架构，因此二者对问题的假设和模型结构都是相似的。}到更多的知识。类似于，刚开始学习围棋的人从职业九段身上可能什么也学不到，但是向一个业余初段的选手学习可能更容易入门。另外，也有研究表明：在机器翻译中，相比于“小”模型，“大”模型更容易进行优化，也更容易找到更好的模型收敛状态。因此在需要一个性能优越，存储较小的模型时，也会考虑将大模型压缩得到更轻量模型的手段\upcite{DBLP:journals/corr/abs-2002-11794}。

-\parinterval 通常把``大''模型看作的传授知识的``教师''，被称作{\small\bfnew{教师模型}}\index{教师模型}（Teacher Model）\index{Teacher Model}；把``小''模型看作是接收知识的``学生''，被称作{\small\bfnew{学生模型}}\index{学生模型}（Student Model）\index{Student Model}。比如，可以把Transformer-Big看作是教师模型，把Transformer-Base看作是学生模型。
+\parinterval 通常把“大”模型看作的传授知识的“教师”，被称作{\small\bfnew{教师模型}}\index{教师模型}（Teacher Model）\index{Teacher Model}；把“小”模型看作是接收知识的“学生”，被称作{\small\bfnew{学生模型}}\index{学生模型}（Student Model）\index{Student Model}。比如，可以把Transformer-Big看作是教师模型，把Transformer-Base看作是学生模型。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -651,7 +760,7 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})

 \begin{itemize}
 \vspace{0.5em}
-\item 固定教师模型，通过减少模型容量的方式设计学生模型。比如，可以使用容量较大的模型作为教师模型（如：Transformer-Big或Transformer-Deep），然后通过将神经网络变``窄''、变``浅''的方式得到学生模型。我们可以用Transformer-Big做教师模型，然后把Transformer-Big的解码器变为一层网络，作为学生模型。
+\item 固定教师模型，通过减少模型容量的方式设计学生模型。比如，可以使用容量较大的模型作为教师模型（如：Transformer-Big或Transformer-Deep），然后通过将神经网络变“窄”、变“浅”的方式得到学生模型。我们可以用Transformer-Big做教师模型，然后把Transformer-Big的解码器变为一层网络，作为学生模型。
 \vspace{0.5em}
 \item 固定学生模型，通过模型集成的方式设计教师模型。可以组合多个模型生成更高质量的译文（见\ref{subsection-7.4.3}节）。比如，融合多个Transformer-Big模型（不同参数初始化方式），之后学习一个Transformer-Base模型。
 \vspace{0.5em}
@@ -683,3 +792,7 @@ L_{\textrm{seq}} = - \textrm{logP}_{\textrm{s}}(\hat{\mathbf{y}} | \mathbf{x})

 \sectionnewpage
 \section{小结及深入阅读}
+
+\parinterval 对抗样本除了用于提高模型的鲁棒性之外，还有很多其他的应用场景。其中最主要的便是用于评估模型。通过构建由对抗样本构造的数据集，可以验证模型对于不同类型噪声鲁棒性\upcite{DBLP:conf/emnlp/MichelN18}。正是由于对抗样本在检测和提高模型鲁棒性具有明显的效果，因此很多的研究人员在针对不同的任务提出了很多有效的方法。但是在生成对抗样本时常常要注意或考虑很多问题，比如扰动是否足够细微，在人类难以察觉的同时做到欺骗模型的目的，对抗样本在不同的模型结构或数据集上是否具有足够的泛化能力。生成的方法是否足够高效等等。 
+
+
--- a/Chapter15/Figures/figure-attention-distribution-based-on-gaussian-distribution.tex
+++ b/Chapter15/Figures/figure-attention-distribution-based-on-gaussian-distribution.tex
+%%%------------------------------------------------------------------------------------------------------------
+%%% 调序模型1：基于距离的调序
+\begin{center}
+\begin{tikzpicture}
+
+\tikzstyle{cirnode}=[circle,inner sep=4pt,draw]
+\tikzstyle{colnode}=[fill=ugreen!30,inner sep=0.1pt]
+
+\tikzstyle{show}=[fill=red,circle,inner sep=0.5pt]
+
+\begin{scope}
+
+\node [anchor=north west,cirnode] (c1) at (0, 0) {};
+
+\draw[-,dotted] ([xshift=-1em,yshift=-0.5em]c1.south)--([xshift=9.3em,yshift=-0.5em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-2em]c1.south)--([xshift=9.3em,yshift=-2em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-3.5em]c1.south)--([xshift=9.3em,yshift=-3.5em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-5em]c1.south)--([xshift=9.3em,yshift=-5em]c1.south);
+
+\node [anchor=north,cirnode] (c2) at ([xshift=0em,yshift=-5.5em]c1.south) {};
+\node [anchor=west,cirnode] (c3) at ([xshift=0.6em,yshift=0em]c2.east) {};
+\node [anchor=west,cirnode] (c4) at ([xshift=0.6em,yshift=0em]c3.east) {};
+\node [anchor=west,cirnode] (c5) at ([xshift=0.6em,yshift=0em]c4.east) {};
+\node [anchor=west,cirnode] (c6) at ([xshift=0.6em,yshift=0em]c5.east) {};
+\node [anchor=west,cirnode] (c7) at ([xshift=0.6em,yshift=0em]c6.east) {};
+
+\node [anchor=south,colnode,minimum height=1.6em,minimum width=1em] (b1) at ([xshift=0em,yshift=0.5em]c2.north) {};
+\node [anchor=south,colnode,minimum height=4.1em,minimum width=1em] (b2) at ([xshift=0em,yshift=0.5em]c3.north) {};
+\node [anchor=south,colnode,minimum height=0.8em,minimum width=1em] (b3) at ([xshift=0em,yshift=0.5em]c4.north) {};
+\node [anchor=south,colnode,minimum height=0.4em,minimum width=1em] (b4) at ([xshift=0em,yshift=0.5em]c5.north) {};
+\node [anchor=south,colnode,minimum height=0.15em,minimum width=1em] (b5) at ([xshift=0em,yshift=0.5em]c6.north) {};
+\node [anchor=south,colnode,minimum height=0.15em,minimum width=1em] (b6) at ([xshift=0em,yshift=0.5em]c7.north) {};
+
+{\scriptsize
+\node [anchor=center] (n1) at ([xshift=0em,yshift=-1em]c2.south){\color{orange}Bush};
+\node [anchor=west] (n2) at ([xshift=-0.2em,yshift=0em]n1.east){\color{ugreen!30}held};
+\node [anchor=west] (n3) at ([xshift=0.35em,yshift=0em]n2.east){a};
+\node [anchor=west] (n4) at ([xshift=0.5em,yshift=0em]n3.east){talk};
+\node [anchor=west] (n5) at ([xshift=-0.3em,yshift=0em]n4.east){with};
+\node [anchor=west] (n6) at ([xshift=-0.3em,yshift=0em]n5.east){Sharon};
+}
+
+\node [anchor=north] (l1) at ([xshift=1em,yshift=-1em]n3.south){\small {(a)原始分布}};
+
+\draw[-,very thick] ([xshift=1.2em,yshift=2.4em]b6.north)--([xshift=1.7em,yshift=1.9em]b6.north);
+\draw[-,very thick] ([xshift=1.2em,yshift=1.9em]b6.north)--([xshift=1.7em,yshift=2.4em]b6.north);
+
+\end{scope}
+
+\begin{scope}[xshift=4.4cm,yshift=0em]
+
+\node [anchor=north west,circle,inner sep=4pt] (c1) at (0, 0) {};
+
+\draw[-,dotted] ([xshift=-1em,yshift=-0.5em]c1.south)--([xshift=9.3em,yshift=-0.5em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-2em]c1.south)--([xshift=9.3em,yshift=-2em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-3.5em]c1.south)--([xshift=9.3em,yshift=-3.5em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-5em]c1.south)--([xshift=9.3em,yshift=-5em]c1.south);
+
+\node [anchor=north,cirnode] (c2) at ([xshift=0em,yshift=-5.5em]c1.south) {};
+\node [anchor=west,cirnode] (c3) at ([xshift=0.6em,yshift=0em]c2.east) {};
+\node [anchor=west,cirnode] (c4) at ([xshift=0.6em,yshift=0em]c3.east) {};
+\node [anchor=west,cirnode] (c5) at ([xshift=0.6em,yshift=0em]c4.east) {};
+\node [anchor=west,cirnode] (c6) at ([xshift=0.6em,yshift=0em]c5.east) {};
+\node [anchor=west,cirnode] (c7) at ([xshift=0.6em,yshift=0em]c6.east) {};
+
+\node [anchor=south,inner sep=0.1pt,minimum height=1.6em,minimum width=1em] (b1) at ([xshift=0em,yshift=0.5em]c2.north) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=4.1em,minimum width=1em] (b2) at ([xshift=0em,yshift=0.5em]c3.north) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=0.8em,minimum width=1em] (b3) at ([xshift=0em,yshift=0.5em]c4.north) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=0.4em,minimum width=1em] (b4) at ([xshift=0em,yshift=0.5em]c5.north) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=0.15em,minimum width=1em] (b5) at ([xshift=0em,yshift=0.5em]c6.north) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=0.15em,minimum width=1em] (b6) at ([xshift=0em,yshift=0.5em]c7.north) {};
+
+{\scriptsize
+\node [anchor=center] (n1) at ([xshift=0em,yshift=-1em]c2.south){\color{orange}Bush};
+\node [anchor=west] (n2) at ([xshift=-0.2em,yshift=0em]n1.east){held};
+\node [anchor=west] (n3) at ([xshift=0.35em,yshift=0em]n2.east){a};
+\node [anchor=west] (n4) at ([xshift=0.5em,yshift=0em]n3.east){\color{blue!60}talk};
+\node [anchor=west] (n5) at ([xshift=-0.3em,yshift=0em]n4.east){with};
+\node [anchor=west] (n6) at ([xshift=-0.3em,yshift=0em]n5.east){Sharon};
+}
+
+%\node [anchor=west,show] (s1) at (-0.5em,-5.7em){};
+%\node [anchor=west,show] (s2) at (-0.2em,-5.6em){};
+%\node [anchor=west,show] (s3) at (1.1em,-5.5em){};
+%\node [anchor=west,show] (s11) at (1.9em,-5em){};
+%\node [anchor=west,show] (s4) at (3em,-4em){};
+%\node [anchor=west,show] (s5) at (3.7em,-3em){};
+%\node [anchor=west,show] (s12) at (4.2em,-2.2em){};
+%\node [anchor=west,show] (s6) at (5.3em,-1.4em){};
+%\node [anchor=west,show] (s7) at (6.3em,-2em){};
+%\node [anchor=west,show] (s13) at (7.4em,-3em){};
+%\node [anchor=west,show] (s8) at (8.4em,-3.9em){};
+%\node [anchor=west,show] (s9) at (8.9em,-4.5em){};
+%\node [anchor=west,show] (s10) at (9.3em,-5em){};
+
+%\draw[-,blue!60,thick] (-0.5em,-5.7em)..controls (-0.2em,-5.6em) and (1.1em,-5.5em)..(1.9em,-5em)..controls (3em,-4em) and (3.7em,-3em)..(4.2em,-2.2em)..controls (5.3em,-1.4em) and (6.3em,-2em)..(7.4em,-3em)..controls (8.4em,-3.9em) and (8.9em,-4.5em)..(9.3em,-5em);
+%\draw[-,blue!60,thick] (-1em,-6em)..controls (0em,-5.7em) and (1em,-5em)..(1.6em,-4.3em)..controls (5.3em,1em) and (7.4em,-2em)..(9.3em,-5em);
+\draw[-,blue!60,thick] ([xshift=-1em,yshift=-4.7em]c1.south)..controls (3.8em,-6em) and (3.9em,3.6em)..([xshift=9.3em,yshift=-4.3em]c1.south);
+
+\node [anchor=north] (l1) at ([xshift=1em,yshift=-1em]n3.south){\small {(b)高斯分布}};
+
+\draw[-,very thick] ([xshift=1.2em,yshift=2.2em]b6.north)--([xshift=1.7em,yshift=2.2em]b6.north);
+\draw[-,very thick] ([xshift=1.2em,yshift=1.9em]b6.north)--([xshift=1.7em,yshift=1.9em]b6.north);
+
+\node [anchor=south] (t1) at ([xshift=0em,yshift=6.7em]n4.north){$D$};
+\draw[->] ([xshift=0em,yshift=0em]t1.west)--([xshift=-1em,yshift=0em]t1.west);
+\draw[->] ([xshift=0em,yshift=0em]t1.east)--([xshift=1em,yshift=0em]t1.east);
+\draw[-] ([xshift=1em,yshift=-0.5em]t1.east)--([xshift=1em,yshift=0.5em]t1.east);
+\draw[-] ([xshift=-1em,yshift=-0.5em]t1.west)--([xshift=-1em,yshift=0.5em]t1.west);
+
+\end{scope}
+
+\begin{scope}[xshift=8.8cm,yshift=0em]
+
+\node [anchor=north west,cirnode] (c1) at (0, 0) {};
+
+\draw[-,dotted] ([xshift=-1em,yshift=-0.5em]c1.south)--([xshift=9.3em,yshift=-0.5em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-2em]c1.south)--([xshift=9.3em,yshift=-2em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-3.5em]c1.south)--([xshift=9.3em,yshift=-3.5em]c1.south);
+\draw[-,dotted] ([xshift=-1em,yshift=-5em]c1.south)--([xshift=9.3em,yshift=-5em]c1.south);
+
+\node [anchor=north,cirnode] (c2) at ([xshift=0em,yshift=-5.5em]c1.south) {};
+\node [anchor=west,cirnode] (c3) at ([xshift=0.6em,yshift=0em]c2.east) {};
+\node [anchor=west,cirnode] (c4) at ([xshift=0.6em,yshift=0em]c3.east) {};
+\node [anchor=west,cirnode] (c5) at ([xshift=0.6em,yshift=0em]c4.east) {};
+\node [anchor=west,cirnode] (c6) at ([xshift=0.6em,yshift=0em]c5.east) {};
+\node [anchor=west,cirnode] (c7) at ([xshift=0.6em,yshift=0em]c6.east) {};
+
+\node [anchor=south,colnode,minimum height=0.15em,minimum width=1em] (b1) at ([xshift=0em,yshift=0.5em]c2.north) {};
+\node [anchor=south,colnode,minimum height=4.2em,minimum width=1em] (b2) at ([xshift=0em,yshift=0.5em]c3.north) {};
+\node [anchor=south,colnode,minimum height=3.7em,minimum width=1em] (b3) at ([xshift=0em,yshift=0.5em]c4.north) {};
+\node [anchor=south,colnode,minimum height=4.2em,minimum width=1em] (b4) at ([xshift=0em,yshift=0.5em]c5.north) {};
+\node [anchor=south,colnode,minimum height=0.8em,minimum width=1em] (b5) at ([xshift=0em,yshift=0.5em]c6.north) {};
+\node [anchor=south,colnode,minimum height=0.15em,minimum width=1em] (b6) at ([xshift=0em,yshift=0.5em]c7.north) {};
+
+{\scriptsize
+\node [anchor=center] (n1) at ([xshift=0em,yshift=-1em]c2.south){\color{orange}Bush};
+\node [anchor=west] (n2) at ([xshift=-0.2em,yshift=0em]n1.east){\color{ugreen!30}held};
+\node [anchor=west] (n3) at ([xshift=0.35em,yshift=0em]n2.east){\color{ugreen!30}a};
+\node [anchor=west] (n4) at ([xshift=0.5em,yshift=0em]n3.east){\color{ugreen!30}talk};
+\node [anchor=west] (n5) at ([xshift=-0.3em,yshift=0em]n4.east){with};
+\node [anchor=west] (n6) at ([xshift=-0.3em,yshift=0em]n5.east){Sharon};
+}
+
+\node [anchor=north] (l1) at ([xshift=1em,yshift=-1em]n3.south){\small {(c)修改后的分布}};
+
+\end{scope}
+
+
+\end{tikzpicture}
+\end{center}
\ No newline at end of file
--- a/Chapter15/Figures/figure-convolutional-attention-network.tex
+++ b/Chapter15/Figures/figure-convolutional-attention-network.tex
+\begin{tikzpicture}
+
+\tikzstyle{elementnode} = [anchor=center,draw,minimum size=0.6em,inner sep=0.1pt,gray!80]
+
+\begin{scope}[scale=1.0]
+\foreach \i / \j in
+    {0/4, 1/4, 2/4, 3/4, 4/4, 5/4, 6/4, 7/4,
+    0/3, 1/3, 2/3, 3/3, 4/3, 5/3, 6/3, 7/3,
+    0/2, 1/2, 2/2, 3/2, 4/2, 5/2, 6/2, 7/2,
+    0/1, 1/1, 2/1, 3/1, 4/1, 5/1, 6/1, 7/1,
+    0/0, 1/0, 2/0, 3/0, 4/0, 5/0, 6/0, 7/0}
+    \node[elementnode] (a\i\j) at (0.6em*\i,0.6em*\j) {};
+
+\foreach \i / \j in
+    {0/4, 1/4, 2/4, 3/4, 4/4, 5/4, 6/4, 7/4,
+    0/3, 1/3, 2/3, 3/3, 4/3, 5/3, 6/3, 7/3,
+    0/2, 1/2, 2/2, 3/2, 4/2, 5/2, 6/2, 7/2,
+    0/1, 1/1, 2/1, 3/1, 4/1, 5/1, 6/1, 7/1,
+    0/0, 1/0, 2/0, 3/0, 4/0, 5/0, 6/0, 7/0}
+    \node[elementnode,fill=gray!50] (b\i\j) at (0.6em*\i+5.5em,0.6em*\j) {};
+
+
+\node [anchor=south west,minimum height=0.5em,minimum width=4.8em,inner sep=0.1pt,very thick,blue!60,draw] (n1) at ([xshift=0em,yshift=0em]a01.south west) {};
+\node [anchor=north west,minimum height=0.5em,minimum width=4.8em,inner sep=0.1pt,very thick,red!60,draw] (n2) at ([xshift=0em,yshift=0em]a02.north west) {};
+\node [anchor=west,minimum height=0.6em,minimum width=0.6em,inner sep=0.1pt,very thick,blue!60,draw] (n3) at ([xshift=0em,yshift=0em]b21.west) {};
+\node [anchor=west,minimum height=0.6em,minimum width=0.6em,inner sep=0.1pt,very thick,red!60,draw] (n4) at ([xshift=0em,yshift=0em]b42.west) {};
+
+\draw [-,very thick,dotted,blue!60] ([xshift=0em,yshift=0em]n1.south east) -- ([xshift=0em,yshift=0em]n3.south west);
+\draw [-,very thick,dotted,blue!60] ([xshift=0em,yshift=0em]n1.north east) -- ([xshift=0em,yshift=0em]n3.north west);
+\draw [-,very thick,dotted,red!60] ([xshift=0em,yshift=0em]n2.south east) -- ([xshift=0em,yshift=0em]n4.south west);
+\draw [-,very thick,dotted,red!60] ([xshift=0em,yshift=0em]n2.north east) -- ([xshift=0em,yshift=0em]n4.north west);
+
+\node [anchor=north] (l1) at ([xshift=0.5em,yshift=-1em]a70.south){\footnotesize {(a)标准自注意力模型}};
+\node [anchor=south,rotate=90] (l2) at ([xshift=0em,yshift=0em]a02.west){\scriptsize {注意力头}};
+\node [anchor=south] (l2) at ([xshift=0em,yshift=0em]a44.north){\scriptsize {句子长度}};
+
+\end{scope}
+
+\begin{scope}[scale=1.0,xshift=4.6cm]
+\foreach \i / \j in
+    {0/4, 1/4, 2/4, 3/4, 4/4, 5/4, 6/4, 7/4,
+    0/3, 1/3, 2/3, 3/3, 4/3, 5/3, 6/3, 7/3,
+    0/2, 1/2, 2/2, 3/2, 4/2, 5/2, 6/2, 7/2,
+    0/1, 1/1, 2/1, 3/1, 4/1, 5/1, 6/1, 7/1,
+    0/0, 1/0, 2/0, 3/0, 4/0, 5/0, 6/0, 7/0}
+    \node[elementnode] (a\i\j) at (0.6em*\i,0.6em*\j) {};
+
+\foreach \i / \j in
+    {0/4, 1/4, 2/4, 3/4, 4/4, 5/4, 6/4, 7/4,
+    0/3, 1/3, 2/3, 3/3, 4/3, 5/3, 6/3, 7/3,
+    0/2, 1/2, 2/2, 3/2, 4/2, 5/2, 6/2, 7/2,
+    0/1, 1/1, 2/1, 3/1, 4/1, 5/1, 6/1, 7/1,
+    0/0, 1/0, 2/0, 3/0, 4/0, 5/0, 6/0, 7/0}
+    \node[elementnode,fill=gray!50] (b\i\j) at (0.6em*\i+5.5em,0.6em*\j) {};
+
+
+\node [anchor=south west,minimum height=0.5em,minimum width=3em,inner sep=0.1pt,very thick,blue!60,draw] (n1) at ([xshift=0em,yshift=0em]a01.south west) {};
+\node [anchor=north west,minimum height=0.5em,minimum width=3em,inner sep=0.1pt,very thick,red!60,draw] (n2) at ([xshift=0em,yshift=0em]a22.north west) {};
+\node [anchor=west,minimum height=0.6em,minimum width=0.6em,inner sep=0.1pt,very thick,blue!60,draw] (n3) at ([xshift=0em,yshift=0em]b21.west) {};
+\node [anchor=west,minimum height=0.6em,minimum width=0.6em,inner sep=0.1pt,very thick,red!60,draw] (n4) at ([xshift=0em,yshift=0em]b42.west) {};
+
+\draw [-,very thick,dotted,blue!60] ([xshift=0em,yshift=0em]n1.south east) -- ([xshift=0em,yshift=0em]n3.south west);
+\draw [-,very thick,dotted,blue!60] ([xshift=0em,yshift=0em]n1.north east) -- ([xshift=0em,yshift=0em]n3.north west);
+\draw [-,very thick,dotted,red!60] ([xshift=0em,yshift=0em]n2.south east) -- ([xshift=0em,yshift=0em]n4.south west);
+\draw [-,very thick,dotted,red!60] ([xshift=0em,yshift=0em]n2.north east) -- ([xshift=0em,yshift=0em]n4.north west);
+
+\node [anchor=north] (l1) at ([xshift=0.5em,yshift=-1em]a70.south){\footnotesize {(b)一维卷积注意力模型}};
+\node [anchor=south,rotate=90] (l2) at ([xshift=0em,yshift=0em]a02.west){\scriptsize {注意力头}};
+\node [anchor=south] (l2) at ([xshift=0em,yshift=0em]a44.north){\scriptsize {句子长度}};
+
+\end{scope}
+
+\begin{scope}[scale=1.0,xshift=9.2cm]
+\foreach \i / \j in
+    {0/4, 1/4, 2/4, 3/4, 4/4, 5/4, 6/4, 7/4,
+    0/3, 1/3, 2/3, 3/3, 4/3, 5/3, 6/3, 7/3,
+    0/2, 1/2, 2/2, 3/2, 4/2, 5/2, 6/2, 7/2,
+    0/1, 1/1, 2/1, 3/1, 4/1, 5/1, 6/1, 7/1,
+    0/0, 1/0, 2/0, 3/0, 4/0, 5/0, 6/0, 7/0}
+    \node[elementnode] (a\i\j) at (0.6em*\i,0.6em*\j) {};
+
+\foreach \i / \j in
+    {0/4, 1/4, 2/4, 3/4, 4/4, 5/4, 6/4, 7/4,
+    0/3, 1/3, 2/3, 3/3, 4/3, 5/3, 6/3, 7/3,
+    0/2, 1/2, 2/2, 3/2, 4/2, 5/2, 6/2, 7/2,
+    0/1, 1/1, 2/1, 3/1, 4/1, 5/1, 6/1, 7/1,
+    0/0, 1/0, 2/0, 3/0, 4/0, 5/0, 6/0, 7/0}
+    \node[elementnode,fill=gray!50] (b\i\j) at (0.6em*\i+5.5em,0.6em*\j) {};
+
+
+\node [anchor=south west,minimum height=1.8em,minimum width=3em,inner sep=0.1pt,very thick,blue!60,draw] (n1) at ([xshift=0em,yshift=0em]a00.south west) {};
+\node [anchor=north west,minimum height=1.8em,minimum width=3em,inner sep=0.1pt,very thick,red!60,draw] (n2) at ([xshift=0em,yshift=0em]a23.north west) {};
+\node [anchor=west,minimum height=0.6em,minimum width=0.6em,inner sep=0.1pt,very thick,blue!60,draw] (n3) at ([xshift=0em,yshift=0em]b21.west) {};
+\node [anchor=west,minimum height=0.6em,minimum width=0.6em,inner sep=0.1pt,very thick,red!60,draw] (n4) at ([xshift=0em,yshift=0em]b42.west) {};
+
+\draw [-,very thick,dotted,blue!60] ([xshift=0em,yshift=0em]n1.south east) -- ([xshift=0em,yshift=0em]n3.south west);
+\draw [-,very thick,dotted,blue!60] ([xshift=0em,yshift=0em]n1.north east) -- ([xshift=0em,yshift=0em]n3.north west);
+\draw [-,very thick,dotted,red!60] ([xshift=0em,yshift=0em]n2.south east) -- ([xshift=0em,yshift=0em]n4.south west);
+\draw [-,very thick,dotted,red!60] ([xshift=0em,yshift=0em]n2.north east) -- ([xshift=0em,yshift=0em]n4.north west);
+
+\node [anchor=north] (l1) at ([xshift=0.5em,yshift=-1em]a70.south){\footnotesize {(c)二维卷积注意力模型}};
+\node [anchor=south,rotate=90] (l2) at ([xshift=0em,yshift=0em]a02.west){\scriptsize {注意力头}};
+\node [anchor=south] (l2) at ([xshift=0em,yshift=0em]a44.north){\scriptsize {句子长度}};
+
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter15/Figures/figure-encoder-structure-of-transformer-model-optimized-by-nas.tex
+++ b/Chapter15/Figures/figure-encoder-structure-of-transformer-model-optimized-by-nas.tex
+\begin{tikzpicture}
+	\tikzstyle{unit}=[draw,rounded corners=2pt,drop shadow,font=\tiny]
+
+%left
+\begin{scope}
+\foreach \x/\d in {1/2em, 2/8em, 3/18em, 4/24em}
+	\node[unit,fill=yellow!20] at (0,\d) (ln_\x) {层正则};
+
+\foreach \x/\d in {1/4em, 2/20em}
+	\node[unit,fill=green!20] at (0,\d) (sa_\x) {8头自注意力：512};
+
+\foreach \x/\d in {1/6em, 2/16em, 3/22em, 4/32em}
+	\node[draw,circle,minimum size=1em,inner sep=1pt] at (0,\d) (add_\x) {\scriptsize\bfnew{+}};
+
+\foreach \x/\d in {2/14em, 4/30em}
+	\node[unit,fill=red!20] at (0,\d) (conv_\x) {卷积$1 \times 1$：512};
+
+\foreach \x/\d in {1/10em,3/26em}
+	\node[unit,fill=red!20] at (0,\d) (conv_\x) {卷积$1 \times 1$：2048};
+
+\foreach \x/\d in {1/12em, 2/28em}
+	\node[unit,fill=blue!20] at (0,\d) (relu_\x) {RELU};
+
+\draw[->,thick] ([yshift=-1.4em]ln_1.-90) -- ([yshift=-0.1em]ln_1.-90);
+\draw[->,thick] ([yshift=0.1em]ln_1.90) -- ([yshift=-0.1em]sa_1.-90);
+\draw[->,thick] ([yshift=0.1em]sa_1.90) -- ([yshift=-0.1em]add_1.-90);
+\draw[->,thick] ([yshift=0.1em]add_1.90) -- ([yshift=-0.1em]ln_2.-90);
+\draw[->,thick] ([yshift=0.1em]ln_2.90) -- ([yshift=-0.1em]conv_1.-90);
+\draw[->,thick] ([yshift=0.1em]conv_1.90) -- ([yshift=-0.1em]relu_1.-90);
+\draw[->,thick] ([yshift=0.1em]relu_1.90) -- ([yshift=-0.1em]conv_2.-90);
+\draw[->,thick] ([yshift=0.1em]conv_2.90) -- ([yshift=-0.1em]add_2.-90);
+\draw[->,thick] ([yshift=0.1em]add_2.90) -- ([yshift=-0.1em]ln_3.-90);
+\draw[->,thick] ([yshift=0.1em]ln_3.90) -- ([yshift=-0.1em]sa_2.-90);
+\draw[->,thick] ([yshift=0.1em]sa_2.90) -- ([yshift=-0.1em]add_3.-90);
+\draw[->,thick] ([yshift=0.1em]add_3.90) -- ([yshift=-0.1em]ln_4.-90);
+\draw[->,thick] ([yshift=0.1em]ln_4.90) -- ([yshift=-0.1em]conv_3.-90);
+\draw[->,thick] ([yshift=0.1em]conv_3.90) -- ([yshift=-0.1em]relu_2.-90);
+\draw[->,thick] ([yshift=0.1em]relu_2.90) -- ([yshift=-0.1em]conv_4.-90);
+\draw[->,thick] ([yshift=0.1em]conv_4.90) -- ([yshift=-0.1em]add_4.-90);
+\draw[->,thick] ([yshift=0.1em]add_4.90) -- ([yshift=1em]add_4.90);
+
+\draw[->,thick] ([yshift=-0.8em]ln_1.-90) .. controls ([xshift=5em,yshift=-0.8em]ln_1.-90) and ([xshift=5em]add_1.0) .. (add_1.0);
+\draw[->,thick] (add_1.0) .. controls ([xshift=5em]add_1.0) and ([xshift=5em]add_2.0) .. (add_2.0);
+\draw[->,thick] (add_2.0) .. controls ([xshift=5em]add_2.0) and ([xshift=5em]add_3.0) .. (add_3.0);
+\draw[->,thick] (add_3.0) .. controls ([xshift=5em]add_3.0) and ([xshift=5em]add_4.0) .. (add_4.0);
+
+\node[font=\scriptsize] at (0em, -1em){(a) Transformer编码器中若干块的结构};
+\end{scope}
+
+%right
+\begin{scope}[xshift=14em]
+\foreach \x/\d in {1/2em, 2/8em, 3/16em, 4/22em, 5/28em}
+	\node[unit,fill=yellow!20] at (0,\d) (ln_\x) {层正则};
+
+\node[unit,fill=green!20] at (0,24em) (sa_1) {8头自注意力：512};
+
+\foreach \x/\d in {1/6em, 2/14em, 3/20em, 4/26em, 5/36em}
+	\node[draw,circle,minimum size=1em,inner sep=1pt] at (0,\d) (add_\x) {\scriptsize\bfnew{+}};
+
+\node[unit,fill=red!20] at (0,30em) (conv_4) {卷积$1 \times 1$：2048};
+\node[unit,fill=red!20] at (0,34em) (conv_5) {卷积$1 \times 1$：512};
+
+\node[unit,fill=blue!20] at (0,32em) (relu_3) {RELU};
+\node[unit,fill=red!20] at (0,4em) (glu_1) {门控线性单元：512};
+\node[unit,fill=red!20] at (-3em,10em) (conv_1) {卷积$1 \times 1$：2048};
+\node[unit,fill=cyan!20] at (3em,10em) (conv_2) {卷积$3 \times 1$：256};
+\node[unit,fill=blue!20] at (-3em,12em) (relu_1) {RELU};
+\node[unit,fill=blue!20] at (3em,12em) (relu_2) {RELU};
+\node[unit,fill=cyan!20] at (0em,18em) (conv_3) {Sep卷积$9 \times 1$：256};
+
+
+\draw[->,thick] ([yshift=-1.4em]ln_1.-90) -- ([yshift=-0.1em]ln_1.-90);
+\draw[->,thick] ([yshift=0.1em]ln_1.90) -- ([yshift=-0.1em]glu_1.-90);
+\draw[->,thick] ([yshift=0.1em]glu_1.90) -- ([yshift=-0.1em]add_1.-90);
+\draw[->,thick] ([yshift=0.1em]add_1.90) -- ([yshift=-0.1em]ln_2.-90);
+\draw[->,thick] ([,yshift=0.1em]ln_2.135) -- ([yshift=-0.1em]conv_1.-90);
+\draw[->,thick] ([yshift=0.1em]ln_2.45) -- ([yshift=-0.1em]conv_2.-90);
+\draw[->,thick] ([yshift=0.1em]conv_1.90) -- ([yshift=-0.1em]relu_1.-90);
+\draw[->,thick] ([yshift=0.1em]conv_2.90) -- ([yshift=-0.1em]relu_2.-90);
+\draw[->,thick] ([yshift=0.1em]relu_1.90) -- ([yshift=-0.1em]add_2.-135);
+\draw[->,thick] ([yshift=0.1em]relu_2.90) -- ([yshift=-0.1em]add_2.-45);
+\draw[->,thick] ([yshift=0.1em]add_2.90) -- ([yshift=-0.1em]ln_3.-90);
+\draw[->,thick] ([yshift=0.1em]ln_3.90) -- ([yshift=-0.1em]conv_3.-90);
+\draw[->,thick] ([yshift=0.1em]conv_3.90) -- ([yshift=-0.1em]add_3.-90);
+\draw[->,thick] ([yshift=0.1em]add_3.90) -- ([yshift=-0.1em]ln_4.-90);
+\draw[->,thick] ([yshift=0.1em]ln_4.90) -- ([yshift=-0.1em]sa_1.-90);
+\draw[->,thick] ([yshift=0.1em]sa_1.90) -- ([yshift=-0.1em]add_4.-90);
+\draw[->,thick] ([yshift=0.1em]add_4.90) -- ([yshift=-0.1em]ln_5.-90);
+\draw[->,thick] ([yshift=0.1em]ln_5.90) -- ([yshift=-0.1em]conv_4.-90);
+\draw[->,thick] ([yshift=0.1em]conv_4.90) -- ([yshift=-0.1em]relu_3.-90);
+\draw[->,thick] ([yshift=0.1em]relu_3.90) -- ([yshift=-0.1em]conv_5.-90);
+\draw[->,thick] ([yshift=0.1em]conv_5.90) -- ([yshift=-0.1em]add_5.-90);
+\draw[->,thick] ([yshift=0.1em]add_5.90) -- ([yshift=1em]add_5.90);
+
+\draw[->,thick] ([yshift=-0.8em]ln_1.-90) .. controls ([xshift=5em,yshift=-0.8em]ln_1.-90) and ([xshift=5em]add_1.0) .. (add_1.0);
+\draw[->,thick] (add_1.0) .. controls ([xshift=8em]add_1.0) and ([xshift=8em]add_3.0) .. (add_3.0);
+\draw[->,thick] (add_3.0) .. controls ([xshift=5em]add_3.0) and ([xshift=5em]add_4.0) .. (add_4.0);
+\draw[->,thick] (add_4.0) .. controls ([xshift=5em]add_4.0) and ([xshift=5em]add_5.0) .. (add_5.0);
+
+\node[font=\scriptsize,align=center] at (0em, -1.5em){(b) 使用结构搜索方法优化后的 \\ Transformer编码器中若干块的结构};
+
+\node[minimum size=0.8em,inner sep=0pt,rounded corners=1pt,draw,fill=blue!20] (act) at (5.5em, 38em){};
+\node[anchor=west,font=\footnotesize] at ([xshift=0.1em]act.east){激活函数};
+\node[anchor=north,minimum size=0.8em,inner sep=0pt,rounded corners=1pt,draw,fill=yellow!20] (nor) at ([yshift=-0.6em]act.south){};
+\node[anchor=west,font=\footnotesize] at ([xshift=0.1em]nor.east){正则化};
+\node[anchor=north,minimum size=0.8em,inner sep=0pt,rounded corners=1pt,draw,fill=cyan!20] (wc) at ([yshift=-0.6em]nor.south){};
+\node[anchor=west,font=\footnotesize] at ([xshift=0.1em]wc.east){宽卷积};
+\node[anchor=north,minimum size=0.8em,inner sep=0pt,rounded corners=1pt,draw,fill=green!20] (at) at ([yshift=-0.6em]wc.south){};
+\node[anchor=west,font=\footnotesize] (tag) at ([xshift=0.1em]at.east){注意力机制};
+\node[anchor=north,minimum size=0.8em,inner sep=0pt,rounded corners=1pt,draw,fill=red!20] (nsl) at ([yshift=-0.6em]at.south){};
+\node[anchor=west,font=\footnotesize] at ([xshift=0.1em]nsl.east){非空间层};
+
+\begin{pgfonlayer}{background}
+\node[draw,drop shadow,fill=white][fit=(act)(nsl)(tag)]{};
+\end{pgfonlayer}
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter15/Figures/figure-encoder-tree-structure-modeling.tex
+++ b/Chapter15/Figures/figure-encoder-tree-structure-modeling.tex
+
+\begin{tikzpicture}
+\begin{scope}
+
+\tikzstyle{hnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=4.5em,rounded corners=5pt,fill=ugreen!30]
+\tikzstyle{tnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=4.5em,rounded corners=5pt,fill=red!30]
+\tikzstyle{wnode}=[inner sep=0mm,minimum height=1.4em,minimum width=4.4em]
+
+\node [anchor=south west,hnode] (n1) at (0,0) {$\mathbi{h}_1$};
+\node [anchor=west,hnode] (n2) at ([xshift=1em,yshift=0em]n1.east) {$\mathbi{h}_2$};
+\node [anchor=west,hnode] (n3) at ([xshift=1em,yshift=0em]n2.east) {$\mathbi{h}_3$};
+\node [anchor=west,hnode] (n4) at ([xshift=1em,yshift=0em]n3.east) {$\cdots$};
+\node [anchor=west,hnode] (n5) at ([xshift=1em,yshift=0em]n4.east) {$\mathbi{h}_n$};
+
+\node [anchor=south,tnode] (t1) at ([xshift=2.8em,yshift=1em]n1.north) {$\mathbi{h}_{n+1}$};
+\node [anchor=south,tnode] (t2) at ([xshift=2.8em,yshift=1em]t1.north) {$\mathbi{h}_{n+2}$};
+\node [anchor=south,tnode] (t3) at ([xshift=2.8em,yshift=1em]t2.north) {$\cdots$};
+\node [anchor=south,tnode] (t4) at ([xshift=2.8em,yshift=1em]t3.north) {$\mathbi{h}_{2n-1}$};
+
+\draw [->,thick] ([xshift=0em,yshift=0em]n1.east) -- ([xshift=0em,yshift=0em]n2.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n2.east) -- ([xshift=0em,yshift=0em]n3.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n3.east) -- ([xshift=0em,yshift=0em]n4.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n4.east) -- ([xshift=0em,yshift=0em]n5.west);
+
+\draw [->,thick] ([xshift=0em,yshift=0em]n1.north) -- ([xshift=0em,yshift=0em]t1.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n2.north) -- ([xshift=0em,yshift=0em]t1.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]t1.north) -- ([xshift=0em,yshift=0em]t2.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n3.north) -- ([xshift=0em,yshift=0em]t2.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]t2.north) -- ([xshift=0em,yshift=0em]t3.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n4.north) -- ([xshift=0em,yshift=0em]t3.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]t3.north) -- ([xshift=0em,yshift=0em]t4.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]n5.north) -- ([xshift=0em,yshift=0em]t4.south);
+
+
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter15/Figures/figure-evolution-and-change-of-ml-methods.tex
+++ b/Chapter15/Figures/figure-evolution-and-change-of-ml-methods.tex
+\begin{tikzpicture}
+
+\node[rounded corners=4pt, minimum width=10.4em, minimum height=7em,fill=yellow!15!gray!15] (box1) at (0em,0em){};
+\node[anchor=west,rounded corners=4pt, minimum width=10.4em, minimum height=7em,fill=yellow!15!gray!15] (box2) at ([xshift=2.8em]box1.east){};
+\node[anchor=west,rounded corners=4pt, minimum width=10.4em, minimum height=7em,fill=yellow!15!gray!15] (box3) at ([xshift=2.8em]box2.east){};
+
+\draw[densely dotted,line width=1.2pt] ([xshift=0.8em]box1.90) -- ([xshift=0.8em]box1.-90);
+\draw[densely dotted,line width=1.2pt] ([xshift=0.8em]box2.90) -- ([xshift=0.8em]box2.-90);
+\draw[densely dotted,line width=1.2pt] ([xshift=0.8em]box3.90) -- ([xshift=0.8em]box3.-90);
+
+\node[anchor=west,draw,rounded corners=2pt, minimum width=5em, minimum height=3em,font=\scriptsize,align=center,inner sep=1pt,fill=yellow!10] (n1) at ([xshift=0.5em]box1.west){机器学习算法：\\决策树、支持 \\ 向量机$\cdots$};
+\node[anchor=west,draw,rounded corners=2pt, minimum width=5em, minimum height=3em,font=\scriptsize,align=center,inner sep=1pt,fill=yellow!10] (n2) at ([xshift=0.5em]box2.west){神经网络：\\RNN、CNN、 \\ Transformer$\cdots$};
+\node[anchor=west,draw,rounded corners=2pt, minimum width=5em, minimum height=3em,font=\scriptsize,align=center,inner sep=1pt,fill=ugreen!10] (n3) at ([xshift=0.5em]box3.west){神经网络：\\RNN、CNN、 \\ Transformer$\cdots$};
+
+\foreach \x/\c in {1/yellow,2/ugreen,3/ugreen}{
+\node[anchor=north,font=\scriptsize,inner ysep=0.1em] (output_\x)at ([xshift=-2.2em,yshift=-0.5em]box\x.north){输出};
+\node[anchor=north,inner ysep=0.1em] at ([xshift=3em,yshift=-0.5em]box\x.north){\scriptsize\bfnew{执行步骤}};
+\node[anchor=south,font=\scriptsize,inner ysep=0.1em,fill=\c!10,rounded corners=2pt] at ([xshift=-2.2em,yshift=0.5em]box\x.south)(input_\x){输入};
+\draw[->,thick] (input_\x.90) -- (n\x.-90);
+\draw[->,thick] (n\x.90) -- (output_\x.-90);
+}
+
+\node[anchor=east,font=\scriptsize,align=center,inner xsep=0pt] at ([xshift=-0.2em]box1.east){1.特征提取；\\2.模型设计； \\3.实验验证。};
+\node[anchor=east,font=\scriptsize,align=center,inner xsep=0pt] at ([xshift=-0.2em]box2.east){1.模型设计； \\2.实验验证。\\ };
+\node[anchor=east,font=\scriptsize,align=center,inner xsep=0pt] at ([xshift=-0.2em]box3.east){1.实验验证。 \\ \\};
+
+\node [draw,thick,anchor=west,single arrow,minimum height=1.6em,single arrow head extend=0.4em] at ([xshift=0.6em]box1.east) {};
+\node [draw,thick,anchor=west,single arrow,minimum height=1.6em,single arrow head extend=0.4em] at ([xshift=0.6em]box2.east) {};
+
+\node[font=\footnotesize, anchor=north] at ([yshift=-0.1em]box1.south){传统机器学习};
+\node[font=\footnotesize, anchor=north] at ([yshift=-0.1em]box2.south){深度学习};
+\node[font=\footnotesize, anchor=north] at ([yshift=-0.1em]box3.south){深度学习\&网络结构搜索};
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter15/Figures/figure-multi-scale-local-modeling.tex
+++ b/Chapter15/Figures/figure-multi-scale-local-modeling.tex
+\begin{tikzpicture}
+\begin{scope}
+
+\tikzstyle{cirnode}=[circle,minimum size=3.7em,draw]
+\tikzstyle{recnode}=[rectangle,rounded corners=2pt,inner sep=0mm,minimum height=1.5em,minimum width=4em,draw]
+
+\node [anchor=west,cirnode] (n1) at (0, 0) {$\mathbi{h}_{i-2}^l$};
+\node [anchor=west,cirnode] (n2) at ([xshift=1em,yshift=0em]n1.east) {$\mathbi{h}_{i-1}^l$};
+\node [anchor=west,cirnode] (n3) at ([xshift=1em,yshift=0em]n2.east) {$\mathbi{h}_{i}^l$};
+\node [anchor=west,cirnode] (n4) at ([xshift=1em,yshift=0em]n3.east) {$\mathbi{h}_{i+1}^l$};
+\node [anchor=west,cirnode] (n5) at ([xshift=1em,yshift=0em]n4.east) {$\mathbi{h}_{i+2}^l$};
+
+\node [anchor=center,blue!30,minimum height=4.2em,minimum width=4.5em,very thick,draw] (c1) at ([xshift=0em,yshift=0em]n3.center) {};
+\node [anchor=center,ugreen!30,minimum height=4.9em,minimum width=14.5em,very thick,draw] (c2) at ([xshift=0em,yshift=0em]n3.center) {};
+\node [anchor=center,red!30,minimum height=5.6em,minimum width=24.5em,very thick,draw] (c3) at ([xshift=0em,yshift=0em]n3.center) {};
+
+\node [anchor=south,recnode] (r1) at ([xshift=0em,yshift=2.5em]n2.north) {$\textrm{head}_1$};
+\node [anchor=south,recnode] (r2) at ([xshift=0em,yshift=2.5em]n3.north) {$\textrm{head}_2$};
+\node [anchor=south,recnode] (r3) at ([xshift=0em,yshift=2.5em]n4.north) {$\textrm{head}_3$};
+
+\node [anchor=south,cirnode] (n6) at ([xshift=0em,yshift=1em]r2.north) {$\mathbi{h}_{i}^{l+1}$};
+
+\draw [->,very thick,blue!30] ([xshift=0em,yshift=0em]c1.north) -- ([xshift=0em,yshift=0em]r2.south);
+\draw [->,very thick,ugreen!30] ([xshift=4.73em,yshift=0em]c2.north) -- ([xshift=0em,yshift=0em]r3.south);
+\draw [->,very thick,red!30] ([xshift=-4.73em,yshift=0em]c3.north) -- ([xshift=0em,yshift=0em]r1.south);
+
+\draw [->] ([xshift=0em,yshift=0em]r1.north) -- ([xshift=0em,yshift=0em]n6.south west);
+\draw [->] ([xshift=0em,yshift=0em]r2.north) -- ([xshift=0em,yshift=0em]n6.south);
+\draw [->] ([xshift=0em,yshift=0em]r3.north) -- ([xshift=0em,yshift=0em]n6.south east);
+
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter15/Figures/figure-transparent-attention-mechanism.tex
+++ b/Chapter15/Figures/figure-transparent-attention-mechanism.tex
+
+\begin{tikzpicture}
+\begin{scope}
+
+\tikzstyle{encnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=4.5em,rounded corners=5pt,thick]
+\tikzstyle{decnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=4.5em,rounded corners=5pt,thick]
+\tikzstyle{cnode}=[rectangle,draw=teal!80, inner sep=0mm,minimum height=1.4em,minimum width=4.4em,fill=teal!17,rounded corners=2pt,thick]
+
+\node [anchor=north,encnode] (n1) at (0, 0) {编码器};
+
+\node [anchor=north,rectangle,minimum height=1.5em,minimum width=2.5em,rounded corners=5pt] (n2) at ([xshift=0em,yshift=-0.2em]n1.south) {$\mathbi{X}$};
+
+\node [anchor=west,encnode,draw=red!60!black!80,fill=red!20] (n3) at ([xshift=1.5em,yshift=0em]n2.east) {$\mathbi{h}_0$};
+
+\node [anchor=west,encnode,draw=red!60!black!80,fill=red!20] (n4) at ([xshift=1.5em,yshift=0em]n3.east) {$\mathbi{h}_1$};
+
+\node [anchor=west,encnode,draw=red!60!black!80,fill=red!20] (n5) at ([xshift=1.5em,yshift=0em]n4.east) {$\mathbi{h}_2$};
+
+\node [anchor=west,rectangle,minimum height=1.5em,minimum width=2.5em,rounded corners=5pt] (n6) at ([xshift=1em,yshift=0em]n5.east) {$\ldots$};
+
+\node [anchor=west,encnode,draw=red!60!black!80,fill=red!20] (n7) at ([xshift=1em,yshift=0em]n6.east) {$\mathbi{h}_{N-1}$};
+
+
+\node [anchor=north,cnode] (cn1) at ([xshift=0em,yshift=-2.8em]n4.south) {\footnotesize{权重聚合$\mathbi{g}$}};
+
+\node [anchor=north,cnode,opacity=0.5] (cn2) at ([xshift=0em,yshift=-2.8em]n3.south) {\footnotesize{权重聚合$\mathbi{g}$}};
+
+\node [anchor=north,cnode,opacity=0.5] (cn3) at ([xshift=0em,yshift=-2.8em]n5.south) {\footnotesize{权重聚合$\mathbi{g}$}};
+
+\node [anchor=north,cnode,opacity=0.5] (cn4) at ([xshift=0em,yshift=-2.8em]n7.south) {\footnotesize{权重聚合$\mathbi{g}$}};
+
+
+\node [anchor=west,decnode] (n9) at ([xshift=0em,yshift=-7.2em]n1.west) {解码器};
+
+\node [anchor=north,rectangle,minimum height=1.5em,minimum width=2.5em,rounded corners=5pt] (n10) at ([xshift=0em,yshift=-0.2em]n9.south) {$\mathbi{y}_{<j}$};
+
+\node [anchor=west,decnode,draw=ublue,fill=blue!10] (n11) at ([xshift=1.5em,yshift=0em]n10.east) {$\mathbi{s}_j^0$};
+
+\node [anchor=west,decnode,draw=ublue,fill=blue!10] (n12) at ([xshift=1.5em,yshift=0em]n11.east) {$\mathbi{s}_j^1$};
+
+\node [anchor=west,decnode,draw=ublue,fill=blue!10] (n13) at ([xshift=1.5em,yshift=0em]n12.east) {$\mathbi{s}_j^2$};
+
+\node [anchor=west,rectangle,minimum height=1.5em,minimum width=2.5em,rounded corners=5pt] (n14) at ([xshift=1em,yshift=0em]n13.east) {$\ldots$};
+
+\node [anchor=west,decnode,draw=ublue,fill=blue!10] (n15) at ([xshift=1em,yshift=0em]n14.east) {$\mathbi{s}_j^{M-1}$};
+
+\node [anchor=west,rectangle,minimum height=1.5em,minimum width=2.5em,rounded corners=5pt] (n16) at ([xshift=1.5em,yshift=0em]n15.east) {$\mathbi{y}_{j}$};
+
+
+\node [anchor=south,minimum height=1.5em,minimum width=2.5em] (n17) at ([xshift=0em,yshift=6em]n16.north) {};
+
+
+\node [anchor=west,minimum height=0.5em,minimum width=4em] (n20) at ([xshift=0em,yshift=-2.5em]n2.east) {};
+\node [anchor=west,minimum height=0.5em,minimum width=4em] (n21) at ([xshift=0em,yshift=-3.9em]n2.east) {};
+\node [anchor=north,minimum height=0.5em,minimum width=4em] (n22) at ([xshift=0em,yshift=-0.7em]n11.south) {};
+
+\begin{pgfonlayer}{background}
+{
+\node[rectangle,inner sep=2pt,fill=blue!7] [fit = (n1) (n7) (n17) (n20)] (bg1) {};
+\node[rectangle,inner sep=2pt,fill=red!7] [fit = (n9) (n16) (n13) (n21) (n22)] (bg2) {};
+}
+\end{pgfonlayer}
+
+\draw [->,thick] ([xshift=0em,yshift=0em]n2.east) -- ([xshift=0em,yshift=0em]n3.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n3.east) -- ([xshift=0em,yshift=0em]n4.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n4.east) -- ([xshift=0em,yshift=0em]n5.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n5.east) -- ([xshift=0em,yshift=0em]n6.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n6.east) -- ([xshift=0em,yshift=0em]n7.west);
+
+\draw [->,thick] ([xshift=0em,yshift=0em]n10.east) -- ([xshift=0em,yshift=0em]n11.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n11.east) -- ([xshift=0em,yshift=0em]n12.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n12.east) -- ([xshift=0em,yshift=0em]n13.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n13.east) -- ([xshift=0em,yshift=0em]n14.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n14.east) -- ([xshift=0em,yshift=0em]n15.west);
+\draw [->,thick] ([xshift=0em,yshift=0em]n15.east) -- ([xshift=0em,yshift=0em]n16.west);
+
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]cn2.south) -- ([xshift=0em,yshift=0em]n11.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]cn1.south) -- ([xshift=0em,yshift=0em]n12.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]cn3.south) -- ([xshift=0em,yshift=0em]n13.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]cn4.south) -- ([xshift=0em,yshift=0em]n15.north);
+
+
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n2.south)..controls +(south:1.4em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn2.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n3.south)..controls +(south:0.8em) and +(north:0.7em)..([xshift=0em,yshift=0em]cn2.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n4.south)..controls +(south:0.6em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn2.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n5.south)..controls +(south:0.7em) and +(north:0.7em)..([xshift=0em,yshift=0em]cn2.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n7.south)..controls +(south:1.4em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn2.north);
+
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n2.south)..controls +(south:1.4em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn3.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n3.south)..controls +(south:0.8em) and +(north:0.7em)..([xshift=0em,yshift=0em]cn3.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n4.south)..controls +(south:0.6em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn3.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n5.south)..controls +(south:0.7em) and +(north:0.7em)..([xshift=0em,yshift=0em]cn3.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n7.south)..controls +(south:1.4em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn3.north);
+
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n2.south)..controls +(south:1.4em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn4.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n3.south)..controls +(south:0.8em) and +(north:0.7em)..([xshift=0em,yshift=0em]cn4.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n4.south)..controls +(south:0.6em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn4.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n5.south)..controls +(south:0.7em) and +(north:0.7em)..([xshift=0em,yshift=0em]cn4.north);
+\draw [->,thick,gray!70,opacity=0.5] ([xshift=0em,yshift=0em]n7.south)..controls +(south:1.4em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn4.north);
+
+\draw [->,thick] ([xshift=0em,yshift=0em]n2.south)..controls +(south:1.4em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn1.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n3.south)..controls +(south:0.8em) and +(north:0.7em)..([xshift=0em,yshift=0em]cn1.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n4.south)..controls +(south:0.6em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn1.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n5.south)..controls +(south:0.7em) and +(north:0.7em)..([xshift=0em,yshift=0em]cn1.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n7.south)..controls +(south:1.4em) and +(north:0.6em)..([xshift=0em,yshift=0em]cn1.north);
+
+
+
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter15/chapter15.tex
+++ b/Chapter15/chapter15.tex
--- a/Chapter16/Figures/figure-contrast-diagram-of-beam-search-topk-and-sampling.tex
+++ b/Chapter16/Figures/figure-contrast-diagram-of-beam-search-topk-and-sampling.tex
-% !Mode:: "TeX:UTF-8"
-% !TEX encoding = UTF-8 Unicode
-
-\begin{tikzpicture}
-
-\begin{scope}
-%%%%%%%%%%%左侧源语言
-{\small
-\node [anchor=north west] (dictionarylabel) at (0,0) {源语言};
-\node [anchor=north west] (entry1) at ([xshift=1.0em,yshift=-0.8em]dictionarylabel.south west) {今};
-\node [anchor=north west] (entry2) at ([yshift=0.5em]entry1.south west) {天};
-\node [anchor=north west] (entry3) at ([yshift=0.1em]entry2.south west) {是};
-\node [anchor=north west] (entry4) at ([yshift=0.1em]entry3.south west) {个};
-\node [anchor=north west] (entry5) at ([yshift=0.1em]entry4.south west) {好};
-\node [anchor=north west] (entry6) at ([yshift=0.1em]entry5.south west) {天};
-\node [anchor=north west] (entry7) at ([yshift=0.5em]entry6.south west) {气};
-\node [anchor=north west] (entry8) at ([xshift=0.2em,yshift=0.1em]entry7.south west) {！};
-\node [anchor=center] (pos00) at ([yshift=0.63em]entry1){};
-\node [anchor=center] (pos9) at ([yshift=-1.92em]entry8){};
-}
-
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,fill=yellow!20,inner sep=0.1em] [fit =(pos00) (entry1) (entry2) (entry3) (entry4) (entry5) (entry6)(entry7)(entry8)(pos9)]  {};
-}
-\end{pgfonlayer}
-
-\end{scope}
-%%%%%%%%%%%左侧源语言
-%%%%%%%%%%%%%%左侧模型
-\begin{scope}[xshift=4.0em,yshift=0.5em]
-
-\tikzstyle{neuronnode} = [minimum size=0.1em,circle,draw,black,thick,fill=white]
-
-\node [anchor=west] (dictionarylabel2) at ([xshift=1.2em]dictionarylabel.east) {\small{模型}};
-
-\node [anchor=center] (pos1) at ([yshift=-1.65em]dictionarylabel2) {};
-
-\node [anchor=center,neuronnode] (neuron00) at ([xshift=-1.5em,yshift=-4.7em]dictionarylabel2) {};
-\node [anchor=center,neuronnode] (neuron01) at ([yshift=-1.7em]neuron00) {};
-\node [anchor=center,neuronnode] (neuron02) at ([yshift=-1.7em]neuron01) {};
-\node [anchor=center,neuronnode] (neuron03) at ([yshift=-1.7em]neuron02) {};
-
-\node [anchor=center,neuronnode] (neuron10) at ([xshift=1.5em,yshift=-1.0em]neuron00) {};
-\node [anchor=center,neuronnode] (neuron11) at ([yshift=-1.7em]neuron10) {};
-\node [anchor=center,neuronnode] (neuron12) at ([yshift=-1.7em]neuron11) {};
-
-\node [anchor=center,neuronnode] (neuron20) at ([xshift=3em]neuron00) {};
-\node [anchor=center,neuronnode] (neuron21) at ([yshift=-1.7em]neuron20) {};
-\node [anchor=center,neuronnode] (neuron22) at ([yshift=-1.7em]neuron21) {};
-\node [anchor=center,neuronnode] (neuron23) at ([yshift=-1.7em]neuron22) {};
-
-\node [anchor=center] (pos2) at ([yshift=-3.72em]neuron12) {};
-
-\draw[-](neuron00.east) -- (neuron10.west);
-\draw[-](neuron00.east) -- (neuron11.west);
-\draw[-](neuron00.east) -- (neuron12.west);
-\draw[-](neuron01.east) -- (neuron10.west);
-\draw[-](neuron01.east) -- (neuron11.west);
-\draw[-](neuron01.east) -- (neuron12.west);
-\draw[-](neuron02.east) -- (neuron10.west);
-\draw[-](neuron02.east) -- (neuron11.west);
-\draw[-](neuron02.east) -- (neuron12.west);
-\draw[-](neuron03.east) -- (neuron10.west);
-\draw[-](neuron03.east) -- (neuron11.west);
-\draw[-](neuron03.east) -- (neuron12.west);
-
-\draw[-](neuron10.east) -- (neuron20.west);
-\draw[-](neuron11.east) -- (neuron20.west);
-\draw[-](neuron12.east) -- (neuron20.west);
-\draw[-](neuron10.east) -- (neuron21.west);
-\draw[-](neuron11.east) -- (neuron21.west);
-\draw[-](neuron12.east) -- (neuron21.west);
-\draw[-](neuron10.east) -- (neuron22.west);
-\draw[-](neuron11.east) -- (neuron22.west);
-\draw[-](neuron12.east) -- (neuron22.west);
-\draw[-](neuron10.east) -- (neuron23.west);
-\draw[-](neuron11.east) -- (neuron23.west);
-\draw[-](neuron12.east) -- (neuron23.west);
-
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,fill=gray!20,inner sep=0.1em] [fit = (neuron00) (neuron03) (neuron20) (neuron23)(pos1)(pos2)]  {};
-}
-\end{pgfonlayer}
-
-
-\end{scope}
-%%%%%%%%%%%%%%左侧模型
-%%%%%%%%%%%%%%%%%%%预测分布
-\begin{scope}[xshift=11.5em]
-
-\node [anchor=west] (dictionarylabel3) at ([xshift=9.0em]dictionarylabel.east) {\small{预测分布}};
-
-\node [anchor=center] (pos31) at ([xshift=-5.0em,yshift=-1.77em]dictionarylabel3) {};
-\node [anchor=center] (pos32) at ([yshift=-3.5em]pos31) {};
-\node [anchor=center] (pos33) at ([xshift=9.5em]pos31) {};
-\node [anchor=center] (pos34) at ([yshift=-3.5em]pos33) {};
-
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=black,inner sep=0.2em,fill=white,drop shadow] [fit =(pos31)(pos32)(pos33)(pos34)]  (remark1label3-1) {};
-}
-\end{pgfonlayer}
-
-\node [anchor=center] (pos3-21) at ([xshift=-5.0em,yshift=-8.7em]dictionarylabel3) {};
-\node [anchor=center] (pos3-22) at ([yshift=-4em]pos3-21) {};
-\node [anchor=center] (pos3-23) at ([xshift=9.5em]pos3-21) {};
-\node [anchor=center] (pos3-24) at ([yshift=-4em]pos3-23) {};
-
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=black,inner sep=0.2em,fill=white,drop shadow] [fit =(pos3-21)(pos3-22)(pos3-23)(pos3-24)]  (remark1label3-2) {};
-}
-\end{pgfonlayer}
-
-\node [anchor=center,minimum height=2.5em,minimum width=1.0em,fill=orange!30] (cy3-11) at ([xshift=-4.6em,yshift=-4.0em]dictionarylabel3) {};
-\node [anchor=center,minimum height=2.0em,minimum width=1.0em,fill=blue!30] (cy3-12) at ([xshift=1.5em,yshift=-0.25em]cy3-11) {};
-\node [anchor=center,minimum height=3.5em,minimum width=1.0em,fill=black!30] (cy3-13) at ([xshift=1.5em,yshift=0.75em]cy3-12) {};
-\node [anchor=center,minimum height=1.0em,minimum width=1.0em,fill=green!30] (cy3-14) at ([xshift=1.5em,yshift=-1.25em]cy3-13) {};
-\node [anchor=center,minimum height=1.5em,minimum width=1.0em,fill=yellow!30] (cy3-15) at ([xshift=1.5em,yshift=0.25em]cy3-14) {};
-\node [anchor=center,minimum height=0.0em,minimum width=1.0em,fill=gray!30] (cy3-16) at ([xshift=1.5em,yshift=-0.4em]cy3-15) {};
-\node [anchor=center] (cy3-17) at ([xshift=1.5em,yshift=1em]cy3-16) {\tiny{$\cdots$}};
-
-%%%%%%%%%%%%%%%%%%%%%%%下方图注
-
-\node [anchor=center,minimum height=0.7em,minimum width=1.0em,fill=orange!30] (cu3-11) at ([yshift=-5.2em]cy3-11) {};
-\node [anchor=west] (cu21) at ([xshift=0.0em]cu3-11.east) {\scriptsize{wonderful}};
-
-\node [anchor=center,minimum height=0.7em,minimum width=1.0em,fill=green!30] (cu3-12) at ([xshift=5.5em]cu3-11) {};
-\node [anchor=west] (cu22) at ([xshift=0.0em]cu3-12.east) {\scriptsize{brilliant}};
-
-\node [anchor=center,minimum height=0.7em,minimum width=1.0em,fill=blue!30] (cu3-13) at ([yshift=-1.5em]cu3-11) {};
-\node [anchor=west] (cu23) at ([xshift=0.0em]cu3-13.east) {\scriptsize{great}};
-
-\node [anchor=center,minimum height=0.7em,minimum width=1.0em,fill=black!30] (cu3-14) at ([yshift=-1.5em]cu3-13) {};
-\node [anchor=west] (cu24) at ([xshift=0.0em]cu3-14.east) {\scriptsize{good}};
-
-\node [anchor=center,minimum height=0.7em,minimum width=1.0em,fill=yellow!30] (cu3-15) at ([yshift=-1.5em]cu3-12) {};
-\node [anchor=west] (cu25) at ([xshift=0.0em]cu3-15.east) {\scriptsize{sunny}};
-
-\node [anchor=center,minimum height=0.7em,minimum width=1.0em,fill=gray!30] (cu3-16) at ([yshift=-1.5em]cu3-15) {};
-\node [anchor=west] (cu26) at ([xshift=0.0em]cu3-16.east) {\scriptsize{what}};
-
-\end{scope}
-%%%%%%%%%%%%%%%%%%%预测分布
-
-
-%%%%%%%%%%%%%%%%%%%%%%%解码策略
-
-\begin{scope}[xshift=22.5em]
-\node [anchor=west] (dictionarylabel4) at ([xshift=20.5em]dictionarylabel.east) {\small{解码策略}};
-%%%%%%%%%%%%%%%%%%%底框1
-\node [anchor=center] (pos4-11) at ([xshift=-4.5em,yshift=-1.77em]dictionarylabel4) {};
-\node [anchor=center] (pos4-12) at ([yshift=-2.5em]pos4-11) {};
-\node [anchor=center] (pos4-13) at ([xshift=9.5em]pos4-11) {};
-\node [anchor=center] (pos4-14) at ([yshift=-2.5em]pos4-13) {};
-
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=black,inner sep=0.2em,fill=white,drop shadow] [fit =(pos4-11)(pos4-12)(pos4-13)(pos4-14)]  (remark1label41) {};
-}
-\end{pgfonlayer}
-%%%%%%%%%%%%%%%%%%%底框2
-\node [anchor=center] (pos4-212) at ([xshift=-4.5em,yshift=-10.2em]dictionarylabel4) {};
-\node [anchor=center] (pos4-222) at ([yshift=-2.5em]pos4-212) {};
-\node [anchor=center] (pos4-232) at ([xshift=9.5em]pos4-212) {};
-\node [anchor=center] (pos4-242) at ([yshift=-2.5em]pos4-232) {};
-
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=black,inner sep=0.2em,fill=white,drop shadow] [fit =(pos4-212)(pos4-222)(pos4-232)(pos4-242)]  (remark1label42) {};
-}
-\end{pgfonlayer}
-%%%%%%%%%%%%%%%%%%%底框3
-\node [anchor=center] (pos4-313) at ([xshift=-4.5em,yshift=-6em]dictionarylabel4) {};
-\node [anchor=center] (pos4-323) at ([yshift=-2.5em]pos4-313) {};
-\node [anchor=center] (pos4-333) at ([xshift=9.5em]pos4-313) {};
-\node [anchor=center] (pos4-343) at ([yshift=-2.5em]pos4-333) {};
-
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=black,inner sep=0.2em,fill=white,drop shadow] [fit =(pos4-313)(pos4-323)(pos4-333)(pos4-343)]  (remark1label43) {};
-}
-\end{pgfonlayer}
-%%%%%%%%%%%%束搜索的虚线框1
-\node [anchor=center] (pos12red11) at ([xshift=0.2em,yshift=-0.65em]pos4-11) {};
-\node [anchor=center] (pos22red11) at ([yshift=-0.85em]pos12red11) {};
-\node [anchor=center] (pos32red11) at ([xshift=1.5em]pos12red11) {};
-\node [anchor=center] (pos42red11) at ([yshift=-0.85em]pos32red11) {};
-
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=red,inner sep=0.2em,fill=white,dashed] [fit =(pos12red11)(pos22red11)(pos32red11)(pos42red11)]  (remark1labe41-1) {};
-}
-\end{pgfonlayer}
-%%%%%%%%%%%%%%%%束搜索里面内容
-{\scriptsize
-\node [anchor=center] (cy00) at ([xshift=6.7em,yshift=0.2em]pos4-11) {\tiny{束搜索}};
-\node [anchor=center,minimum height=1.8em,minimum width=0.8em,fill=orange!30] (cy11) at ([xshift=-0.0em,yshift=-1.80em]pos4-11) {};
-\node [anchor=center,minimum height=1.5em,minimum width=0.8em,fill=blue!30] (cy12) at ([xshift=1.3em,yshift=-0.15em]cy11) {};
-\node [anchor=center,minimum height=2.5em,minimum width=0.8em,fill=black!30] (cy13) at ([xshift=1.3em,yshift=0.5em]cy12) {};
-\node [anchor=center,minimum height=0.8em,minimum width=0.8em,fill=green!30] (cy14) at ([xshift=1.3em,yshift=-0.85em]cy13) {};
-\node [anchor=center,minimum height=1.2em,minimum width=0.8em,fill=yellow!30] (cy15) at ([xshift=1.3em,yshift=0.2em]cy14) {};
-\node [anchor=center,minimum height=0.0em,minimum width=0.8em,fill=gray!30] (cy16) at ([xshift=1.3em,yshift=-0.25em]cy15) {};
-\node [anchor=center] (cy17) at ([xshift=1.5em,yshift=-0.3em]cy16) {$\cdots$};
-\node [anchor=center] (cy18) at ([xshift=1.1em,yshift=1.2em]cy17) {$\Rightarrow$};
-\node [anchor=center,minimum height=1.8em,minimum width=0.8em,fill=orange!30] (cy19) at ([xshift=1.1em,yshift=-0.35em]cy18) {};
-\node [anchor=center,minimum height=1.5em,minimum width=0.8em,fill=blue!30] (cy110) at ([xshift=1.3em,yshift=-0.15em]cy19) {};
-\node [anchor=center,minimum height=2.5em,minimum width=0.8em,fill=black!30] (cy111) at ([xshift=1.3em,yshift=0.5em]cy110) {};
-\node [anchor=center,color=red] (cy112) at ([xshift=1.35em,yshift=-1.23em]cy110) {\tiny{$\Uparrow$}};
-\node [anchor=center,color=red] (cy113) at ([yshift=-0.55em]cy112) {\tiny{good}};
-}
-%%%%%%%%%%%%束搜索的虚线框2
-\node [anchor=center] (pos12red12) at ([xshift=7.65em,yshift=-0.65em]pos4-11) {};
-\node [anchor=center] (pos22red12) at ([yshift=-0.85em]pos12red12) {};
-\node [anchor=center] (pos32red12) at ([xshift=1.5em]pos12red12) {};
-\node [anchor=center] (pos42red12) at ([yshift=-0.85em]pos32red12) {};
-
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=red,inner sep=0.2em,fill=white,dashed] [fit =(pos12red12)(pos22red12)(pos32red12)(pos42red12)]  (remark1label) {};
-}
-\end{pgfonlayer}
-%%%%%%%%%%%%topk的虚线框1
-\node [anchor=center] (pos12-2) at ([xshift=0.20em,yshift=-0.65em]pos4-212) {};
-\node [anchor=center] (pos22-2) at ([yshift=-0.85em]pos12-2) {};
-\node [anchor=center] (pos32-2) at ([xshift=1.5em]pos12-2) {};
-\node [anchor=center] (pos42-2) at ([yshift=-0.85em]pos32-2) {};
-
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=red,inner sep=0.2em,fill=white,dashed] [fit =(pos12-2)(pos22-2)(pos32-2)(pos42-2)]  (remark1label-2) {};
-}
-\end{pgfonlayer}
-
-{\scriptsize
-\node [anchor=center] (cy00-2) at ([xshift=6.7em,yshift=0.2em]pos4-212) {\tiny{$n$-best}};
-\node [anchor=center,minimum height=1.8em,minimum width=0.8em,fill=orange!30] (cy11-2) at ([xshift=0.0em,yshift=-1.8em]pos4-212) {};
-\node [anchor=center,minimum height=1.5em,minimum width=0.8em,fill=blue!30] (cy12-2) at ([xshift=1.3em,yshift=-0.15em]cy11-2) {};
-\node [anchor=center,minimum height=2.5em,minimum width=0.8em,fill=black!30] (cy13-2) at ([xshift=1.3em,yshift=0.5em]cy12-2) {};
-\node [anchor=center,minimum height=0.8em,minimum width=0.8em,fill=green!30] (cy14-2) at ([xshift=1.3em,yshift=-0.85em]cy13-2) {};
-\node [anchor=center,minimum height=1.2em,minimum width=0.8em,fill=yellow!30] (cy15-2) at ([xshift=1.3em,yshift=0.2em]cy14-2) {};
-\node [anchor=center,minimum height=0.0em,minimum width=0.8em,fill=gray!30] (cy16-2) at ([xshift=1.3em,yshift=-0.25em]cy15-2) {};
-\node [anchor=center] (cy17-2) at ([xshift=1.5em,yshift=-0.3em]cy16-2) {$\cdots$};
-\node [anchor=center] (cy18-2) at ([xshift=1.1em,yshift=1.2em]cy17-2) {$\Rightarrow$};
-\node [anchor=center,minimum height=1.8em,minimum width=0.8em,fill=orange!30] (cy19-2) at ([xshift=1.1em,yshift=-0.35em]cy18-2) {};
-\node [anchor=center,minimum height=1.5em,minimum width=0.8em,fill=blue!30] (cy110-2) at ([xshift=1.3em,yshift=-0.15em]cy19-2) {};
-\node [anchor=center,minimum height=2.5em,minimum width=0.8em,fill=black!30] (cy111-2) at ([xshift=1.3em,yshift=0.5em]cy110-2) {};
-\node [anchor=center,color=red] (cy112-2) at ([xshift=0.0em,yshift=-1.2em]cy110-2) {\tiny{$\Uparrow$}};
-\node [anchor=center,color=red] (cy113-2) at ([yshift=-0.65em]cy112-2) {\tiny{wonderful}};
-}
-%%%%%%%%%%%%topk的虚线框2
-\node [anchor=center] (pos12red22) at ([xshift=7.65em,yshift=-0.65em]pos4-212) {};
-\node [anchor=center] (pos22red22) at ([yshift=-0.85em]pos12red22) {};
-\node [anchor=center] (pos32red22) at ([xshift=1.5em]pos12red22) {};
-\node [anchor=center] (pos42red22) at ([yshift=-0.85em]pos32red22) {};
-
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,draw=red,inner sep=0.2em,fill=white,dashed] [fit =(pos12red22)(pos22red22)(pos32red22)(pos42red22)]  (remark1label2) {};
-}
-\end{pgfonlayer}
-
-%%%%%%%%%%%%%%%%%采样里面内容
-
-{\scriptsize
-\node [anchor=center] (cy00-2) at ([xshift=6.7em,yshift=0.2em]pos4-313) {\tiny{采样}};
-\node [anchor=center,minimum height=1.8em,minimum width=0.8em,fill=orange!30] (cy11-3) at ([xshift=2.8em,yshift=-1.8em]pos4-313) {};
-\node [anchor=center,minimum height=1.5em,minimum width=0.8em,fill=blue!30] (cy12-3) at ([xshift=1.3em,yshift=-0.15em]cy11-3) {};
-\node [anchor=center,minimum height=2.5em,minimum width=0.8em,fill=black!30] (cy13-3) at ([xshift=1.3em,yshift=0.5em]cy12-3) {};
-\node [anchor=center,minimum height=0.8em,minimum width=0.8em,fill=green!30] (cy14-3) at ([xshift=1.3em,yshift=-0.85em]cy13-3) {};
-\node [anchor=center,minimum height=1.2em,minimum width=0.8em,fill=yellow!30] (cy15-3) at ([xshift=1.3em,yshift=0.2em]cy14-3) {};
-\node [anchor=center,minimum height=0.0em,minimum width=0.8em,fill=gray!30] (cy16-3) at ([xshift=1.3em,yshift=-0.25em]cy15-3) {};
-\node [anchor=center] (cy17-3) at ([xshift=1.5em,yshift=-0.3em]cy16-3) {$\cdots$};
-\node [anchor=center,color=red] (cy112-3) at ([xshift=0.0em,yshift=-1.1em]cy15-3) {\tiny$\Uparrow$};
-\node [anchor=center,color=red] (cy113-3) at ([yshift=-0.65em]cy112-3) {\tiny{sunny}};
-}
-
-\end{scope}
-
-{\small
-\node [anchor=west] (dictionarylabel) at ([xshift=32.0em]dictionarylabel.east) {目标语言};
-\node [anchor=north west] (entry5-1) at ([xshift=0.70em,yshift=-1.37em]dictionarylabel.south west) {Today};
-\node [anchor=north] (entry5-2) at ([yshift=0.5em]entry5-1.south) {is};
-\node [anchor=north] (entry5-3) at ([yshift=0.1em]entry5-2.south) {a};
-\node [anchor=center] (pos5-0) at ([yshift=1.2em]entry5-1){};
-\node [anchor=center] (pos5-4) at ([yshift=-1.2em]entry5-3){};
-
-\node [anchor=north,color=red,minimum height=1.2em,minimum width=3.41em,fill=blue!20,inner sep=0.1em] (entry5-5) at ([yshift=-0.4em]pos5-4.south) {？};
-
-\node [anchor=north,minimum height=1.2em,minimum width=3.15em] (entry5-6) at ([yshift=-0.4em]entry5-5.south) {};
-\node [anchor=north,minimum height=1.2em,minimum width=3.15em] (entry5-7) at ([yshift=-3.23em]entry5-6.south) {};
-}
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,fill=green!20,inner sep=0.1em] [fit =(entry5-1) (entry5-2) (entry5-3)(pos5-0)(pos5-4)]  {};
-}
-\end{pgfonlayer}
-
-\begin{pgfonlayer}{background}
-{
-\node[rectangle,fill=blue!20,inner sep=0.1em] [fit =(entry5-6) (entry5-7)]  {};
-}
-\end{pgfonlayer}
-
-\draw [->,thick]([xshift=0.1em,yshift=0.3em]entry5.east) -- ([xshift=1.27em,yshift=0.3em]entry5.east);
-
-\draw [->,thick]([xshift=5.4859em,yshift=0.3em]entry5.east) --  ([xshift=12em,yshift=0.3em]entry5.east) -- ([yshift=-0.57em]cy3-14.south);
-
-\draw[-,thick] ([xshift=0.0em]cy3-17.east) ..controls+(east:1.2em) and + (west:1.2em)..([xshift=-0.19em]pos4-11.west);\draw[-,thick] ([xshift=0.0em]cy3-17.east) ..controls+(east:1.2em) and + (west:1.2em)..([yshift=-2.3em,xshift=-0.19em]pos4-212.west);
-
-\draw [->,thick]([xshift=-1.291em]entry5-5.west) -- ([xshift=0.02em]entry5-5.west);
-\draw [-,thick]([xshift=0.415em]pos4-13.east)--([xshift=0.9em]pos4-13.east)--([xshift=0.9em]pos4-242.east)--([xshift=0.415em]pos4-242.east);
-
-
-\end{tikzpicture}
-
-%---------------------------------------------------------------------
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
--- a/Chapter16/Figures/figure-cycle-consistency.tex
+++ b/Chapter16/Figures/figure-cycle-consistency.tex
-\begin{tikzpicture}
-
-\node [rectangle,inner sep=2pt,font=\scriptsize] (center) at (0,0) {};
-
-\node [rectangle,inner sep=2pt,font=\scriptsize] (top) at ([yshift=3em,xshift=0em]center.north) {
-\begin{tabular}{c}
-翻译模型 \\
-$\textrm{P}(\ \mathbi{y}|\ \mathbi{x})$
-\end{tabular}
-};
-
-\node [rectangle,inner sep=2pt,font=\scriptsize] (left) at ([yshift=0em,xshift=-4em]center.west) {
-\begin{tabular}{c}
-今天天气真好。
-\end{tabular}
-};
-
-\node [rectangle,inner sep=2pt,font=\scriptsize] (right) at ([yshift=0em,xshift=4em]center.east) {
-\begin{tabular}{c}
-The weather is \\so good today.
-\end{tabular}
-};
-
-\node [rectangle,inner sep=2pt,font=\scriptsize] (down) at ([yshift=-3em,xshift=0em]center.south) {
-\begin{tabular}{c}
-翻译模型 \\
-$\textrm{P}(\ \mathbi{x}|\ \mathbi{y})$
-\end{tabular}
-};
-
-\draw [->,line width=0.8pt] (left.north)  .. controls +(north:0.5) and +(west:0.5) .. (top.west);
-\draw [->,line width=0.8pt] (top.east)  .. controls +(east:0.5) and +(north:0.5) .. (right.north);
-\draw [->,line width=0.8pt] (down.west)  .. controls +(west:0.5) and +(south:0.5) .. (left.south);
-\draw [->,line width=0.8pt] (right.south)  .. controls +(south:0.5) and +(east:0.5) .. (down.east) ;
-
-\end{tikzpicture}
\ No newline at end of file
--- a/Chapter16/Figures/figure-elmo-model-structure.tex
+++ b/Chapter16/Figures/figure-elmo-model-structure.tex
-\begin{tikzpicture}
-
-\tikzstyle{embedding} = [line width=0.6pt,draw=black,minimum width=2.5em,minimum height=1.6em,fill=green!20]
-\tikzstyle{model} = [line width=0.6pt,draw=black,minimum width=3.0em,minimum height=1.6em,fill=blue!20,rounded corners=2pt]
-
-\node [anchor=center,model] (node1-1) at (0,0) {\footnotesize{LSTM}};
-\node [anchor=west,model] (node1-2) at ([xshift=1.8em]node1-1.east) {\footnotesize{LSTM}};
-\node [anchor=west,scale=1.8] (node1-3) at ([xshift=1.0em]node1-2.east) {...};
-\node [anchor=west,model] (node1-4) at ([xshift=1.0em]node1-3.east) {\footnotesize{LSTM}};
-\node [anchor=west,model] (node1-5) at ([xshift=2.0em]node1-4.east) {\footnotesize{LSTM}};
-\node [anchor=west,model] (node1-6) at ([xshift=1.8em]node1-5.east) {\footnotesize{LSTM}};
-\node [anchor=west,scale=1.8] (node1-7) at ([xshift=1.0em]node1-6.east) {...};
-\node [anchor=west,model] (node1-8) at ([xshift=1.0em]node1-7.east) {\footnotesize{LSTM}};
-
-\node [anchor=south,model](node2-1) at ([yshift=1.8em]node1-1.north){\footnotesize{LSTM}};
-\node [anchor=south,model](node2-2) at ([yshift=1.8em]node1-2.north){\footnotesize{LSTM}};
-\node [anchor=west,scale=1.8](node2-3) at ([xshift=1.0em]node2-2.east){...};
-\node [anchor=south,model](node2-4) at ([yshift=1.8em]node1-4.north){\footnotesize{LSTM}};
-\node [anchor=south,model](node2-5) at ([yshift=1.8em]node1-5.north){\footnotesize{LSTM}};
-\node [anchor=south,model](node2-6) at ([yshift=1.8em]node1-6.north){\footnotesize{LSTM}};
-\node [anchor=west,scale=1.8](node2-7) at ([xshift=1.0em]node2-6.east){...};
-\node [anchor=south,model](node2-8) at ([yshift=1.8em]node1-8.north){\footnotesize{LSTM}};
-
-\draw [->,thick](node1-1.east)--(node1-2.west);
-\draw [->,thick](node1-2.east)--([xshift=0.5em]node1-3.west);
-\draw [->,thick]([xshift=-0.5em]node1-3.east)--(node1-4.west);
-\draw [<-,thick](node1-5.east)--(node1-6.west);
-\draw [<-,thick](node1-6.east)--([xshift=0.5em]node1-7.west);
-\draw [<-,thick]([xshift=-0.5em]node1-7.east)--(node1-8.west);
-
-\draw [->,thick](node1-1.north)--(node2-1.south);
-\draw [->,thick](node1-2.north)--(node2-2.south);
-\draw [->,thick](node1-4.north)--(node2-4.south);
-\draw [->,thick](node1-5.north)--(node2-5.south);
-\draw [->,thick](node1-6.north)--(node2-6.south);
-\draw [->,thick](node1-8.north)--(node2-8.south);
-
-\draw [->,thick](node2-1.east)--(node2-2.west);
-\draw [->,thick](node2-2.east)--([xshift=0.5em]node2-3.west);
-\draw [->,thick]([xshift=-0.5em]node2-3.east)--(node2-4.west);
-\draw [<-,thick](node2-5.east)--(node2-6.west);
-\draw [<-,thick](node2-6.east)--([xshift=0.5em]node2-7.west);
-\draw [<-,thick]([xshift=-0.5em]node2-7.east)--(node2-8.west);
-
-\begin{pgfonlayer}{background}
-{
-\node[fill=white,inner sep=0.5em,draw=black,line width=0.6pt,minimum width=6.0em,rounded corners=2pt,dashed] [fit =(node1-1)(node1-2)(node1-3)(node1-4)(node2-1)] (remark1) {};
-}
-\end{pgfonlayer}
-
-\begin{pgfonlayer}{background}
-{
-\node[fill=white,inner sep=0.5em,draw=black,line width=0.6pt,minimum width=6.0em,rounded corners=2pt,dashed] [fit =(node1-5)(node1-6)(node1-7)(node1-8)(node2-8)] (remark1) {};
-}
-\end{pgfonlayer}
-
-\node [anchor=north,embedding] (node0-2) at ([yshift=-2em]node1-4.south){\footnotesize{$\mathbi{e}_2$}};
-\node [anchor=east,embedding] (node0-1) at ([xshift=-1.4em]node0-2.west){\footnotesize{$\mathbi{e}_1$}};
-\node [anchor=north,scale=1.8] (node0-3) at ([yshift=-2em]node1-5.south){...};
-\node [anchor=north,embedding] (node0-4) at ([yshift=-2em]node1-6.south){\footnotesize{$\mathbi{e}_n$}};
-
-\draw [->,thick](node0-1.north)--(node1-1.south);
-\draw [->,thick](node0-1.north)--(node1-5.south);
-\draw [->,thick](node0-2.north)--(node1-2.south);
-\draw [->,thick](node0-2.north)--(node1-6.south);
-\draw [->,thick](node0-4.north)--(node1-4.south);
-\draw [->,thick](node0-4.north)--(node1-8.south);
-
-\node [anchor=south,embedding,fill=yellow!20](node3-2) at ([yshift=2em]node2-4.north){\footnotesize{$\seq{P}_2$}};
-\node [anchor=east,embedding,fill=yellow!20] (node3-1) at ([xshift=-1.4em]node3-2.west){\footnotesize{$\seq{P}_1$}};
-\node [anchor=south,scale=1.8] (node3-3) at ([yshift=2em]node2-5.north){...};
-\node [anchor=south,embedding,fill=yellow!20](node3-4) at ([yshift=2em]node2-6.north){\footnotesize{$\seq{P}_n$}};
-
-\draw [<-,thick](node3-1.south)--(node2-1.north);
-\draw [<-,thick](node3-1.south)--(node2-5.north);
-\draw [<-,thick](node3-2.south)--(node2-2.north);
-\draw [<-,thick](node3-2.south)--(node2-6.north);
-\draw [<-,thick](node3-4.south)--(node2-4.north);
-\draw [<-,thick](node3-4.south)--(node2-8.north);
-
-\end{tikzpicture}
-
-
-
-
-
-
-
-
-
-
-
-
--- a/Chapter16/Figures/figure-example-of-iterative-back-translation.tex
+++ b/Chapter16/Figures/figure-example-of-iterative-back-translation.tex
@@ -14,6 +14,12 @@
 \draw [->,thick]([yshift=-0.75em]node1-1.west)--(remark1.north east);
 \draw [->,thick,dashed](remark1.south east)--([yshift=-0.75em]node2-1.west);

+\node [anchor=west,font=\tiny,circle,draw=black,inner sep=0.1em](node3-3) at ([xshift=1.0em,yshift=3em]remark1.east){1};
+\node [anchor=west,font=\tiny,circle,draw=black,inner sep=0.1em](node3-4) at ([xshift=1.0em,yshift=-1.8em]remark1.east){2};
+
+\node [anchor=west,font=\tiny,circle,draw=black,inner sep=0.1em](node3-5) at ([xshift=1.0em,yshift=-1.0em]node1-1.east){3};
+\node [anchor=west,font=\tiny,circle,draw=black,inner sep=0.1em](node3-6) at ([xshift=1.0em,yshift=0.6em]node2-1.east){3};
+
 \node [anchor=west] (node4-1) at ([xshift=4.0em,yshift=-3.5em]node1-1.east) {\small{反向}};
 \node [anchor=north] (node4-2) at ([yshift=0.5em]node4-1.south) {\small{翻译模型}};
 \begin{pgfonlayer}{background}
@@ -21,6 +27,8 @@
 \node[fill=blue!20,inner sep=0.3em,draw=black,line width=0.6pt,minimum width=3.0em,drop shadow,rounded corners=2pt] [fit =(node4-1)(node4-2)]  (remark2) {};
 }
 \end{pgfonlayer}
+\node [anchor=west,font=\tiny,circle,draw=black,inner sep=0.1em](node4-3) at ([xshift=1.0em,yshift=-1.8em]remark2.east){4};
+
 \draw [->,thick]([yshift=-0.75em]node1-1.east)--(remark2.north west);
 \draw [->,thick]([yshift=-0.75em]node2-1.east)--(remark2.south west);

@@ -39,6 +47,9 @@
 }
 \end{pgfonlayer}

+\node [anchor=west,font=\tiny,circle,draw=black,inner sep=0.1em](node6-3) at ([xshift=1.0em,yshift=-1.0em]node5-1.east){5};
+\node [anchor=west,font=\tiny,circle,draw=black,inner sep=0.1em](node6-4) at ([xshift=1.0em,yshift=0.6em]node6-1.east){5};
+
 \draw [->,thick]([yshift=-0.75em]node5-1.east)--(remark3.north west);
 \draw [->,thick]([yshift=-0.75em]node6-1.east)--(remark3.south west);


--- a/Chapter16/Figures/figure-multi-language-single-model-system-diagram.tex
+++ b/Chapter16/Figures/figure-multi-language-single-model-system-diagram.tex
@@ -23,7 +23,7 @@
 \node[anchor=north,font=\footnotesize] (pair2) at ([yshift=-4.5em,xshift=1em]train.south) {双语句对2：};
 \node[anchor=west,lan](train3) at ([yshift=.7em,xshift=0.4em]pair2.east) {法语：{\color{red}<german>} \ Bonjour};
 \node[anchor=west,lan](train4) at ([yshift=-.7em,xshift=0.4em]pair2.east) {德语：Hallo};
-\node[anchor=north,font=\footnotesize] (decode) at ([yshift=-8em]train.south) {\small\bfnew{解码阶段：}};
+\node[anchor=north,font=\footnotesize] (decode) at ([yshift=-8em]train.south) {\small\bfnew{推断阶段：}};
 \node[anchor=north,font=\footnotesize] (input) at ([xshift=2.13em,yshift=-0.6em]decode.south) {输入：};
 \node[anchor=west,lan](decode2) at ([xshift=0.4em]input.east) {英语：{\color{red}<german>} \ hello};
 \node[anchor=north,font=\footnotesize] (output) at ([xshift=2.13em,yshift=-2.6em]decode.south) {输出：};

--- a/Chapter16/Figures/figure-optimization-of-the-model-initialization-method.tex
+++ b/Chapter16/Figures/figure-optimization-of-the-model-initialization-method.tex
@@ -14,7 +14,7 @@
 \draw [->,thick] ([yshift=1pt]data.north) .. controls +(90:2em) and +(90:2em) .. ([yshift=1pt]model.north) node[above,midway] {\small{参数优化}};
 \draw [->,thick] ([yshift=1pt]model.south) .. controls +(-90:2em) and +(-90:2em) .. ([yshift=1pt]data.south) node[below,midway] {\small{数据优化}};

-\node[word] at ([xshift=-0.5em,yshift=-4em]data.south){\small{(a) 思路1}};
+\node[word] at ([xshift=-0.5em,yshift=-4em]data.south){\small{(a) 基于数据的初始化方法}};

 \end{scope}
 \end{tikzpicture}
@@ -33,7 +33,7 @@
 \draw [->,thick] ([yshift=1pt]data.north) .. controls +(90:2em) and +(90:2em) .. ([yshift=1pt]model.north) node[above,midway] {\small{参数优化}};
 \draw [->,thick] ([yshift=1pt]model.south) .. controls +(-90:2em) and +(-90:2em) .. ([yshift=1pt]data.south) node[below,midway] {\small{数据优化}};

-\node[word] at ([xshift=-0.5em,yshift=-4em]model.south){\small{(b) 思路2}};
+\node[word] at ([xshift=-0.5em,yshift=-4em]model.south){\small{(b) 基于模型的初始化方法}};

 \end{scope}
 \end{tikzpicture}

--- a/Chapter16/Figures/figure-shared-space-inductive-bilingual-dictionary.tex
+++ b/Chapter16/Figures/figure-shared-space-inductive-bilingual-dictionary.tex
@@ -41,8 +41,8 @@
 \node[](circle2) at ([xshift=3.0em]circle1.east) {\input{Chapter16/Figures/figure-shared-space-inductive-bilingual-dictionary-b}};
 \node[](circle3) at ([xshift=5.5em]circle2.east) {\input{Chapter16/Figures/figure-shared-space-inductive-bilingual-dictionary-c}};
 \node[](circle4) at ([xshift=5.5em]circle3.east) {\input{Chapter16/Figures/figure-shared-space-inductive-bilingual-dictionary-d}};
-\draw[->,very thick] ([xshift=-0.5em]circle2.east)--([xshift=0.5em]circle3.west)node [pos=0.5,above] (pos1) {\scriptsize{Y空间}};
-\node [anchor=south](pos1-2) at ([yshift=-0.5em]pos1.north){\scriptsize{X映射到}};
+\draw[->,very thick] ([xshift=-0.5em]circle2.east)--([xshift=0.5em]circle3.west)node [pos=0.5,above] (pos1) {\scriptsize{$\mathbi{Y}$空间}};
+\node [anchor=south](pos1-2) at ([yshift=-0.5em]pos1.north){\scriptsize{\mathbi{X}映射到}};
 \draw[->,very thick] ([xshift=-0.5em]circle3.east)--([xshift=0.5em]circle4.west)node [pos=0.5,above] (pos2) {\scriptsize{推断}};
 \node [anchor=south](pos2-2) at ([yshift=-0.5em]pos2.north){\scriptsize{词典}};


--- a/Chapter16/Figures/lm-fusion.tex
+++ b/Chapter16/Figures/lm-fusion.tex
-\begin{tikzpicture}
-
-\tikzstyle{cir} = [draw,inner sep=2pt,line width=1pt,align=center,minimum height=2em,minimum width=2em,circle,fill=white]
-\tikzstyle{add} = [draw,inner sep=2pt,line width=1pt,align=center,minimum height=1em,minimum width=1em,fill=white]
-\tikzstyle{minicir} = [draw,inner sep=2pt,line width=1pt,align=center,minimum height=1em,minimum width=1em,fill=white,circle]
-\tikzstyle{rec} = [draw,inner sep=2pt,line width=1pt,align=center,minimum height=1.5em,minimum width=2.5em,fill=white]
-\tikzstyle{dia} = [draw,inner sep=2pt,line width=1pt,align=center,fill=white,diamond,minimum height=2em,minimum width=2em]
-
-
-\node [cir,anchor=north,dashed] (a0) at (0,0) {\tiny{$y_{t-1}$}};
-\node [cir,anchor=west] (a1) at ([xshift=4.0em]a0.east) {\tiny{$y_t$}};
-\node [add,anchor=north] (a11) at ([yshift=-1em]a1.south) {\tiny{$+$}};
-\node [minicir,anchor=north] (a12) at ([yshift=-1em]a11.south) {\tiny{$\times$}};
-\node [minicir,anchor=west] (a11-1) at ([xshift=0.8em]a12.east) {\tiny{$\beta$}};
-\node [rec,anchor=north] (a13) at ([yshift=-1.0em]a12.south) {\tiny{${\funp{P}}_{t}^{LM}$}};
-\node [rec,anchor=north] (a14) at ([yshift=-2.0em]a13.south) {\tiny{${\funp{P}}_{t}^{TM}$}};
-\node [dia,anchor=north] (a15) at ([yshift=-1em]a14.south) {\tiny{$\funp{C}_{t}$}};
-\node [anchor=west] (a13-2) at ([xshift=-4em]a13.west) {\tiny{$\cdots$}};
-\node [anchor=west] (a14-2) at ([xshift=-4em]a14.west) {\tiny{$\cdots$}};
-\node [anchor=west] (a15-2) at ([xshift=-4.25em]a15.west) {\tiny{$\cdots$}};
-
-
-\node [anchor=east] (a13-3) at ([yshift=0.8em]a13-2.west) {\small{模型语言}};
-\node [anchor=north] (a13-4) at ([xshift=0em]a13-3.south) {\small{隐藏层}};
-
-\node [anchor=east] (a14-3) at ([yshift=0.8em]a14-2.west) {\small{神经机器翻译}};
-\node [anchor=north] (a14-4) at ([xshift=0.5em]a14-3.south) {\small{模型隐藏层}};
-
-\node [anchor=east] (a15-3) at ([xshift=0em]a15-2.west) {\small{上下文向量}};
-
-\draw[->,thick](a11.north) -- (a1.south);
-\draw[->,thick](a12.north) -- (a11.south);
-\draw[->,thick](a13.north) -- (a12.south);
-\draw[->,thick](a11-1.west) -- (a12.east);
-\draw[->,dashed](a0.south) -- (a13.north west);
-\draw[->,dashed](a0.south) -- (a14.north west);
-\draw[->,thick](a15.north) -- (a14.south);
-\draw[->,dashed]([xshift=-2.0em]a13.west) -- (a13.west);
-\draw[->,dashed]([xshift=-2.0em]a14.west) -- (a14.west);
-
-\draw [->,thick] (a14.east) ..controls + (east:1em) and +(east:4.1em).. (a11.east);
-\draw[->,dashed](a1.south east) -- ([xshift=6.0em,yshift=-4.0em]a1.south);
-\draw[->,dashed](a1.south east) -- ([xshift=6.0em,yshift=-7.5em]a1.south);
-\draw[-]([xshift=5.9em,yshift=1.05em]a1.east) -- ([xshift=5.9em,yshift=-14.7em]a1.east);
-
-
-%%%%%%%%%%%%%%%%%%%%%%
-
-\node [cir,anchor=west] (a2) at ([xshift=10.0em]a1.east) {\tiny{$y_{t}$}};
-\node [add,anchor=north] (a21) at ([yshift=-1em]a2.south) {\tiny{$+$}};
-\node [minicir,anchor=north] (a22) at ([yshift=-1em]a21.south) {\tiny{$\times$}};
-\node [minicir,anchor=west] (a21-1) at ([xshift=0.8em]a22.east) {\tiny{$g_{t}$}};
-\node [cir,anchor=north] (a23) at ([yshift=-0.6125em]a22.south) {\tiny{${\funp{P}}_{t}^{LM}$}};
-\node [cir,anchor=north] (a24) at ([yshift=-1.217em]a23.south) {\tiny{${\funp{P}}_{t}^{TM}$}};
-\node [dia,anchor=north] (a25) at ([yshift=-0.6044em]a24.south) {\tiny{$\funp{C}_{t}$}};
-\node [anchor=west] (a23-2) at ([xshift=-3.5em]a23.west) {\tiny{$\cdots$}};
-\node [anchor=west] (a24-2) at ([xshift=-3.5em]a24.west) {\tiny{$\cdots$}};
-\node [anchor=west] (a25-2) at ([xshift=-3.65em]a25.west) {\tiny{$\cdots$}};
-
-\draw[->,thick](a21.north) -- (a2.south);
-\draw[->,thick](a22.north) -- (a21.south);
-\draw[->,thick](a23.north) -- (a22.south);
-\draw[->,thick](a21-1.west) -- (a22.east);
-\draw[->,thick](a25.north) -- (a24.south);
-\draw [->,thick] (a24.east) ..controls + (east:1em) and +(east:4.2em).. (a21.east);
-\draw [->,thick] (a25.west) ..controls + (west:1em) and +(west:2em).. (a21.west);
-\draw[->,dashed]([xshift=-1.5em]a23.west) -- (a23.west);
-\draw[->,dashed]([xshift=-1.5em]a24.west) -- (a24.west);
-
-\node [cir,anchor=west] (a3) at ([xshift=4.0em]a2.east) {\tiny{$y_{t+1}$}};
-\node [add,anchor=north] (a31) at ([yshift=-1em]a3.south) {\tiny{$+$}};
-\node [minicir,anchor=north] (a32) at ([yshift=-1em]a31.south) {\tiny{$\times$}};
-\node [minicir,anchor=west] (a31-1) at ([xshift=0.8em]a32.east) {\tiny{$g_{t}$}};
-\node [cir,anchor=north] (a33) at ([yshift=-0.6125em]a32.south) {\tiny{${\funp{P}}_{t}^{LM}$}};
-\node [cir,anchor=north] (a34) at ([yshift=-1.217em]a33.south) {\tiny{${\funp{P}}_{t}^{TM}$}};
-\node [dia,anchor=north] (a35) at ([yshift=-0.6044em]a34.south) {\tiny{$\funp{C}_{t}$}};
-
-
-\draw[->,thick](a31.north) -- (a3.south);
-\draw[->,thick](a32.north) -- (a31.south);
-\draw[->,thick](a33.north) -- (a32.south);
-\draw[->,thick](a31-1.west) -- (a32.east);
-\draw[->,thick](a35.north) -- (a34.south);
-\draw[->,dashed](a23.east) -- (a33.west);
-\draw[->,dashed](a24.east) -- (a34.west);
-\draw [->,thick] (a34.east) ..controls + (east:1em) and +(east:4.2em).. (a31.east);
-\draw [->,thick] (a35.west) ..controls + (west:1em) and +(west:2em).. (a31.west);
-\draw[->,dashed](a33.east) -- ([xshift=2em]a33.east);
-\draw[->,dashed](a34.east) -- ([xshift=2em]a34.east);
-
-\draw[->,dashed](a3.south east) -- ([xshift=6.0em,yshift=-4.0em]a3.south);
-\draw[->,dashed](a3.south east) -- ([xshift=6.0em,yshift=-7.5em]a3.south);
-
-\node[anchor=north](pos1) at ([xshift=-1.5em,yshift=-0.5em]a15.south) {(a) 浅融合};
-\node[anchor=north](pos2) at ([xshift=-2.0em,yshift=-0.5em]a35.south) {(b) 深融合};
-
-\end{tikzpicture}
-
-
-
-
-
-
-
-
-
-
-
-
-
-
--- a/Chapter16/chapter16.tex
+++ b/Chapter16/chapter16.tex
--- a/Chapter17/chapter17.tex
+++ b/Chapter17/chapter17.tex
@@ -230,27 +230,188 @@

 \parinterval 此外，研究人员们还探索了很多其他方法来提高语音翻译模型的性能。利用在海量的无标注语音数据上预训练的{\small\bfnew{自监督}}\index{自监督}（Self-supervised）\index{Self-supervised}模型作为一个特征提取器，将从语音中提取的特征作为语音翻译模型的输入，可以有效提高模型的性能\upcite{DBLP:conf/interspeech/WuWPG20}。相比语音翻译模型，文本翻译模型任务更加简单，因此一种思想是利用文本翻译模型来指导语音翻译模型，比如通过知识蒸馏\upcite{DBLP:conf/interspeech/LiuXZHWWZ19}、正则化\upcite{DBLP:conf/emnlp/AlinejadS20}等方法。为了简化语音翻译模型的学习，可以通过课程学习的策略，使模型从语音识别任务，逐渐过渡到语音翻译任务，这种由易到难的训练策略可以使模型训练更加充分\upcite{DBLP:journals/corr/abs-1802-06003,DBLP:conf/acl/WangWLZY20}。

+%----------------------------------------------------------------------------------------
+%    NEW SECTION
+%----------------------------------------------------------------------------------------

+\section{图像翻译}

+\parinterval 人类所接受的信息中视觉信息的比重往往不亚于语言信息，甚至更多。视觉信息通常以图像的形式存在，近几年，结合图像的多模态机器翻译任务受到了广泛的研究。多模态机器翻译（图a）简单来说就是结合源语言和其他模态（例如图像等）的信息生成目标语言的过程。这种结合图像的机器翻译还是一种狭义上的“翻译”，它本质上还是从源语言到目标语言或者说从文本到文本的翻译。那么从图像到文本上（图b）的转换，例如，{\small\bfnew{图片描述生成}}\index{图片描述生成}（Image Captioning）\index{Image Captioning}，即给定图像生成与图像内容相关的描述，也可以被称为广义上的“翻译”，当然，这种广义上的翻译形式不仅仅包括图像到文本，还应该包括从图像到图像（图c），甚至是从文本到图像（图d）等等。这里将这些与图像相关的翻译任务统称为图像翻译。

+%----------------------------------------------------------------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{图像翻译任务}
+\label{tab:17-2-1-c}
+\end{table}
+%----------------------------------------------------------------------------------------------------

+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------

+\subsection{基于图像增强的文本翻译}

+\parinterval 在文本翻译中引入图像信息是最典型的多模态机器翻译任务。虽然多模态机器翻译还是一种从源语言文字到目标语言文字的转换，但是在转换的过程中，融入了其他模态的信息减少了歧义的产生。例如前文提到的通过与源语言相关的图像信息，将“A medium sized  child jumps off of a dusty bank”中“bank”译为“河岸”而不是“银行”，通过给定一张相关的图片，机器翻译模型就可以利用视觉信息更好的理解歧义词，避免产生歧义。换句话说，对于同一图像或者视觉场景的描述，源语言和目标语言描述的本质意义是一致的，只不过，体现在语言上会有表达方法上的差异。那么，图像就会存在一些源语言和目标语言的隐含对齐“约束”，将这种“约束”融入到机器翻译系统，会让模型加深对某些歧义词语上下文的理解，从而进一步提高机器翻译质量。

+\parinterval WMT机器翻译评测在2016年首次将融合图像和文本的多模态机器翻译作为机器翻译和跨语言图像描述的共享任务\upcite{DBLP:conf/wmt/SpeciaFSE16}，这项任务也受到了广泛的研究\upcite{DBLP:conf/wmt/CaglayanABGBBMH17,DBLP:conf/wmt/LibovickyHTBP16}。如何融入视觉信息，更好的理解多模态上下文语义是多模态机器翻译研究的热点，大体的研究方向包括基于特征融合的方法\upcite{DBLP:conf/emnlp/CalixtoL17,DBLP:journals/corr/abs-1712-03449,DBLP:conf/wmt/HelclLV18}、基于多任务学习的方法\upcite{DBLP:conf/ijcnlp/ElliottK17,DBLP:conf/acl/YinMSZYZL20}。接下来将从这两个方向，对多模态机器翻译的研究展开介绍。

+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------

+\subsubsection{1. 基于特征融合的方法}

+\parinterval 较为早期的研究工作通常将图像信息作为输入句子的一部分\upcite{DBLP:conf/emnlp/CalixtoL17,DBLP:conf/wmt/HuangLSOD16}，或者用其对编码器、解码器的状态进行初始化\upcite{DBLP:conf/emnlp/CalixtoL17,Elliott2015MultilingualID,DBLP:conf/wmt/MadhyasthaWS17}。如图2所示，对图像特征的提取通常是基于卷积神经网络，有关卷积神经网络的内容，请参考{\chaptereleven}内容。通过卷积神经网络得到全局视觉特征，在进行维度变换后，将其作为源语言输入的一部分或者初始化状态引入到模型当中。但是，这种图像信息的引入方式有以下两个缺点：

+\begin{itemize}
+    \vspace{0.5em}
+    \item 图像信息不全都是有用的，往往存在一些与源语言或目标语言无关的信息，作为全局特征会引入噪音
+    \vspace{0.5em}
+    \item 图像信息作为源语言的一部分或者初始化状态，间接参与目标语言单词的生成，在循环神经网络信息传递的过程中，图像信息会有一定的损失。
+    \vspace{0.5em}
+\end{itemize}

+%----------------------------------------------------------------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{建模全局的视觉特征方法}
+\label{tab:17-2-2-c}
+\end{table}
+%----------------------------------------------------------------------------------------------------

+\parinterval 说到噪音问题就不得不提到注意力机制的引入，前面章节中提到过这样的一个例子：

+\vspace{0.8em}
+\centerline{中午\ 没\ 吃饭\ ，\ 又\ 刚\ 打\ 了\ 一\ 下午\ 篮球\ ，\ 我\ 现在\ 很\ 饿\ ，\ 我\ 想\underline{\quad \quad \quad} 。}
+\vspace{0.8em}

+\parinterval 想在横线处填写“吃饭”，“吃东西”的原因是我们在读句子的过程中，关注到了“没/吃饭”，“很/饿”等关键息。这是在自然语言处理中注意力机制解决的问题，即对于要生成的目标语言单词时，相关性更高的源语言片段应该在源语言句子的表示中体现出来，而不是将所有的源语言单词一视同仁。同样的，注意力机制也用在多模态机器翻译中，即在生成目标单词时，对于图像而言，更应该关注与目标单词相关的图像部分，而弱化对其他部分的关注，这样就达到了降噪的目的，另外，注意力机制的引入，也使图像信息直接参与目标语言的生成，解决了在编码器中，图像信息传递损失的问题。

+%----------------------------------------------------------------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{目标词“bank”注意力机制前后对比}
+\label{tab:17-2-3-c}
+\end{table}
+%----------------------------------------------------------------------------------------------------

+\parinterval 那么，多模态机器翻译是如何计算上下文向量的呢？这里仿照第十章的内容给出具体解释(参考图10.19)：

+\parinterval 编码器输出的状态序列${\mathbi{h}_1,\mathbi{h}_2,...\mathbi{h}_m}$，m为状态序列的长度，需要注意的是，这里的状态序列不是源语言的状态序列，而是通过基于卷积循环网络提取到的图像的状态序列。假设图像的特征维度16×16×512，其中前两个维度分别表示图像的高和宽，这里会将图像的维度映射为256×512的状态序列，512为每个状态的维度，对于目标语位置$j$，上下文向量$\mathbi{C}_{j}$被定义为对序列的编码器输出进行加权求和，如下：
+\begin{eqnarray}
+\mathbi{C}_{j}&=& \sum_{i}{{\alpha}_{i,j}{\mathbi{h}}_{i}}
+\label{eq:17-2-1}
+\end{eqnarray}

+\noindent 其中，${\alpha}_{i,j}$是注意力权重，它表示目标语言第j个位置与图片编码状态序列第i个位置的相关性大小，计算方式与{\chapterten}描述的注意力函数一致。

+\parinterval 这里，将每个时间步编码器的输出$\mathbi{h}_{i}$看作源图像序列位置$i$的表示结果。图3说明了模型在生成目标词“man”时，图像经过注意力机制对图像区域关注度的可视化效果，可以看到，经过注意力机制后，模型更注重的是与目标词相关的图像部分。当然，多模态机器翻译的输入还包括源语言文字序列。通常，源语言文字对于翻译的作用比图像更大\upcite{DBLP:conf/acl/YaoW20}。从这个角度说，图像信息更多的是作为文字信息的补充，而不是替代。除此之外，注意力机制在多模态机器翻译中也有很多研究，不仅仅在解码器端将经过注意力机制的文本特征和视觉特征作为解码输入的一部分，还有的工作在编码端将源语言与图像信息进行注意力建模\upcite{DBLP:journals/corr/abs-1712-03449,DBLP:conf/acl/YaoW20}，得到更好的源语言特征表示。

+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{2. 基于多任务学习的方法}
+
+\parinterval 基于多任务学习的方法通常是把翻译任务与其他视觉任务结合，进行联合训练。在{\chapterfifteen}和{\chaptersixteen}已经提到过多任务学习。一种常见的多任务学习框架是针对多个相关的任务，共享模型的部分参数来学习不同任务之间相似的部分，并通过特定的模块来学习每个任务特有的部分。在多模态机器翻译中，应用多任务学习的主要策略就是将翻译作为主任务，同时设置一些与其他模态相关的子任务，通过这些子任务来辅助源语言理解自身的语言知识。
+
+\parinterval 如图4所示，可以将多模态机器翻译任务分解为两个子任务：机器翻译和图片生成\upcite{DBLP:conf/ijcnlp/ElliottK17}。其中机器翻译作为主任务，图片生成作为子任务，图片生成这里指的是从一个图片描述生成对应图片，对于图片生成任务在后面叙述。通过单个编码器对源语言数据进行建模，然后通过两个解码器（翻译解码器和图像解码器）来学习翻译任务和图像生成任务。顶层任务学习每个任务的独立特征，底层共享参数层能够学习到更丰富的文本特征表示。另外在视觉问答领域有研究表明\upcite{DBLP:conf/nips/LuYBP16}，在多模态任务中，不宜引入多层的注意力，因为多层注意力会导致模型严重的过拟合，从另一角度来说，利用多任务学习的方式，提高模型的泛化能力，也是一种有效防止过拟合现象的方式。类似的思想，也大量使用在多模态自然语言处理中，例如图像描述生成、视觉问答\upcite{DBLP:conf/iccv/AntolALMBZP15}等。
+
+%----------------------------------------------------------------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{多模态机器翻译多任务学习的应用}
+\label{tab:17-2-4-c}
+\end{table}
+%----------------------------------------------------------------------------------------------------
+
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsection{图像到文本的翻译}
+
+\parinterval 图像到文本的转换也可以看作是广义上的翻译，简单来说，就是把源语言的形式替换成了图像。其中，图像描述生成是最典型的任务。虽然，这部分内容并不是本书的重点，不过为了保证多模态翻译内容的完整性，这里对相关技术进行简要介绍。图像描述生成是指给定图像生成文字描述，有时也被称作图说话、图像字幕生成。如何理解图像信息、在理解图像信息基础上生成描述是图像描述任务要解决的问题，可以发现，该任务涉及到自然语言处理和计算机视觉两个领域，是一项很有挑战的任务。同时，图像描述在图像检索、智能导盲、人机交互等领域有着广泛的应用场景，有很大的研究价值。
+
+%----------------------------------------------------------------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{图像描述传统方法}
+\label{tab:17-2-5-c}
+\end{table}
+%----------------------------------------------------------------------------------------------------
+
+\parinterval 传统图像描述生成有两种范式：基于检索的方法和基于模板的方法。其中基于检索的方法（图5左）是指在指定的图像描述候选句子中选择其中的句子作为图像的描述，这种方法的弊端是所选择的句子可能会和图像很大程度上不相符。而基于模板的方法（图5右）是指在图像上检测视觉特征，然后把内容填在实现设计好的模板当中，这种方法的缺点是生成的图像描述过于呆板，‘像是在一个模子中刻出来的’说的就是这个意思。近几年来 ，由于卷积神经网络在计算机视觉领域效果显著，而循环神经网络在自然语言处理领域卓有成效，受到机器翻译领域编码器-解码器框架的启发，逐渐的，这种基于卷积神经网络作为编码器编码图像，循环神经网络作为解码器解码描述的编码器-解码器框架成了图像描述任务的基础范式。本章节，从基础的图像描述范式编码器-解码器框架展开\upcite{DBLP:conf/cvpr/VinyalsTBE15,DBLP:conf/icml/XuBKCCSZB15}，从编码器的改进、解码器的改进展开介绍。  
+
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{1. 基础框架}
+
+\parinterval 受到神经机器翻译的启发，编码器-解码器框架也应用到图像描述任务当中。其中，编码器将输入的图像转换为一种新的“表示”形式，这种表示包含了输入图像的所有信息。之后解码器把这种“表示”重新转换为输出的描述。图XX中（上）是编码器-解码器框架在图像描述生成的应用\upcite{DBLP:conf/cvpr/VinyalsTBE15}。首先，通过卷积神经网络提取图像特征到一个合适的长度向量表示。然后，利用长短时记忆网络（LSTM）解码生成文字描述，这个过程中与机器翻译解码过程类似。这种建模方式存在一定的短板：生成的描述单词不一定需要所有的图像信息，将全局的图像信息送入模型中，可能会引入噪音，使这种“表示”形式不准确。针对这个问题，图XX（下）\upcite{DBLP:conf/icml/XuBKCCSZB15}为了弥补这种建模的局限性，引入了注意力机制。利用注意力机制在生成不同单词时，使模型不再只关注图像的全局特征，而是关注“应该”关注的图像特征。
+
+%----------------------------------------------------------------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{图像描述的编码器-解码器框架}
+\label{tab:17-2-6-c}
+\end{table}
+%----------------------------------------------------------------------------------------------------
+
+\parinterval 图像描述生成基本上沿用了编码器-解码器框架。接下来，分别从编码器端的改进和解码器端的改进展开介绍。这些改进总体来说是在解决以下两个问题：
+
+\begin{itemize}
+    \vspace{0.5em}
+    \item 在编码器端，如何更丰富、更全面的编码图像信息？
+    \vspace{0.5em}
+    \item 在解码器端，如何更好的利用编码器端的特征表示？
+    \vspace{0.5em}
+\end{itemize}
+
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{2. 编码器的改进}
+
+\parinterval 要想使编码器-解码器框架在图像描述中充分发挥作用，编码器也要更好的表示图像信息。对于编码器的改进，大多也是从这个方向出发。通常，体现在向编码器中添加图像的语义信息\upcite{DBLP:conf/cvpr/YouJWFL16,DBLP:conf/cvpr/ChenZXNSLC17,DBLP:journals/pami/FuJCSZ17}和位置信息\upcite{DBLP:conf/cvpr/ChenZXNSLC17,DBLP:conf/ijcai/LiuSWWY17}。
+
+\parinterval 图像的语义信息一般是指图像中存在的实体、属性、场景等等。如图XX所示，从图像中利用属性或实体检测器提取出“child”、“river”、“bank”等等的属性词和实体词作为图像的语义信息，提取全局的图像特征初始化循环神经网络，再利用注意力机制计算目标词与属性词或实体词之间的注意力权重，根据该权重计算上下文向量，从而将编码语义信息送入解码端\upcite{DBLP:conf/cvpr/YouJWFL16}，在解码‘bank’单词时，会更关注图像语义信息中的‘bank’。当然，除了图像中的实体和属性作为语义信息外，也可以将图片的场景信息也加入到编码器当中\upcite{DBLP:journals/pami/FuJCSZ17}。有关如何做属性、实体和场景的检测，涉及到目标检测任务的工作，例如Faster-RCNN\upcite{DBLP:journals/pami/RenHG017}、YOLO\upcite{DBLP:journals/corr/abs-1804-02767,DBLP:journals/corr/abs-2004-10934}等等,这里不过多赘述。
+
+%----------------------------------------------------------------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{编码器“显式”融入语义信息}
+\label{tab:17-2-6-c}
+\end{table}
+%----------------------------------------------------------------------------------------------------
+
+\parinterval 以上的方法大都是将图像中的实体、属性、场景等映射到文字上，并把这些信息显式地添加到编码器端。令一种方式，把图像中的语义特征隐式地作用到编码器端\upcite{DBLP:conf/cvpr/ChenZXNSLC17}。例如，可以图像数据可以分解为三个通道（红、绿、蓝），简单来说，就是将图像的每一个像素点按照红色、绿色、蓝色分成三个部分，这样就将图像分成了三个通道。在很多图像中，不同通道随伴随的特征是不一样的，可以将其作用于编码器端。另一种方法是基于位置信息的编码器增强。位置信息指的是图像中对象（物体）的位置。利用目标检测技术检测系统获得图中的对象和对应的特征，这样就确定了图中的对象位置。显然，这些信息也可以加入到编码端，以加强编码器的表示能力\upcite{DBLP:conf/eccv/YaoPLM18}。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUBSUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsubsection{3. 解码器的改进}
+
+\parinterval 由于解码器输出的是语言文字序列，因此需要考虑语言的特点对其进行改进。 例如，解码过程中， “the”,“on”，“at”这种介词或者冠词与图像的相关性较低，这时图像信息的引入就会产生负面影响\upcite{DBLP:conf/cvpr/LuXPS17}。因此，可以通过门等结构，控制视觉信号作用于文字生成的程度。另外,在解码过程中，生成的每个单词对应着图像的区域可能是不同的。因此也可以设计更为有效的注意力机制来捕捉解码端对不同图像局部信息的关注程度\upcite{DBLP:conf/cvpr/00010BT0GZ18}。 
+\parinterval 除了在解码端更好的使生成文本与图像特征相互作用以外，还有一些其他的解码器端改进的方向。例如：用其它结构（如卷积神经网络或者Transformer）代替解码器端循环神经网络\upcite{DBLP:conf/cvpr/AnejaDS18}。或者使用更深层的神经网络学习动词或者名词等视觉中不易表现出来的单词\upcite{DBLP:journals/mta/FangWCT18}，其思想与深层神经机器翻译模型有相通之处（{\chapterfifteen}）。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsection{图像、文本到图像的翻译}
+
+\parinterval 当生成的目标对象是图像时，问题就变为了图像生成问题。虽然，这个领域本身并不属于机器翻译，但是其使用的基本方法与机器翻译有类似之处。二者也可以相互借鉴。因此，这里对图像生成问题也进行简要描述。
+
+\parinterval 计算机视觉领域，图像风格转移、图像语义分割、图像超分辨率等任务，都可以被视为{\small\bfnew{图像到图像的翻译}}\index{图像到图像的翻译}（Image-to-Image Translation）\index{Image-to-Image Translation}问题。与机器翻译类似，这些问题的共同目标是学习从一个对象到另一个对象的映射，只不过这里的对象是指图像，而非机器翻译中的文字。例如，给定物体的轮廓生成真实物体照片或者给定白天照片生成夜晚的照片等。图像到图像的翻译有广阔的应用场景，如图片补全、风格迁移等。
+
+\parinterval 对抗神经网络被广泛地应用再图像到图像的翻译任务当中\upcite{DBLP:conf/nips/GoodfellowPMXWOCB14,DBLP:conf/nips/ZhuZPDEWS17,DBLP:journals/corr/abs-1908-06616}。实际上，这类方法非常适合图像生成类的任务。简单来说，对抗生成网络包括两个部分分别是：生成器和判别器。基于输入生成器生成一个结果，而判别器要判别生成的结果和真实结果是否是相同的，对抗的思想是，通过强化生成器的生成能力和判别器的判别能力，当生成器生成的结果可以“骗”过判别器时，即判别器无法分清真实结果和生成结果，认为模型学到了这种映射关系。在图像到图像的翻译中，根据输入图像，生成器生成预测图像，判别器判别是否为目标图像，多次迭代后，生成图像被判别为目标图像时，则模型学习到了“翻译能力”。以上的工作都是有监督的，即基于对齐的图像对数据集，但是，这种数据的标注是极为费时费力的，所以有很多的工作也基于无监督的方法展开\upcite{DBLP:conf/iccv/ZhuPIE17,DBLP:conf/iccv/YiZTG17,DBLP:conf/nips/LiuBK17}，这里不过多赘述。
+
+\parinterval {\small\bfnew{文本到图像的翻译}}\index{文本到图像的翻译}（Text-to-Image Translation）\index{Text-to-Image Translation}是指给定描述物体颜色和形状等细节的一自然语言文字，生成对应的图像。该任务也可以看作是图像描述任务的逆任务。目前方法上大部分基于对抗神经网络\upcite{DBLP:conf/icml/ReedAYLSL16,DBLP:journals/corr/DashGALA17,DBLP:conf/nips/ReedAMTSL16}。基本流程为：首先利用自然语言处理技术提取出文本信息，然后再用文本特征作为后面生成图像的约束，在对抗神经网络中生成器（Generator）中根据文本特征生成图像的约束，从而别鉴别器（Discriminator）鉴定其生成效果。

 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -478,11 +639,11 @@ D_i&\subseteq&\{X_{-i},Y_{-i}\} \label{eq:17-3-2}

 \section{小结及扩展阅读}

-\parinterval 本章仅对音频处理和语音识别进行了简单的介绍，具体内容可以参考一些经典书籍，比如关于信号处理的基础知识\upcite{[Discrete-Time Signal Processing (3rd version)][ Discrete-Time Speech Signal Processing: Principles and Practice]}，以及语音识别的传统方法\upcite{[Fundamentals of Speech Recognition][ Spoken Language Processing: A Guide to Theory, Algorithm, and System Development]}和基于深度学习的最新方法\upcite{[ Automatic Speech Recognition: A Deep Learning Approach, 俞栋、邓力]}。此外，语音翻译的一个重要应用是机器同声传译。
+\parinterval 本章仅对音频处理和语音识别进行了简单的介绍，具体内容可以参考一些经典书籍，比如关于信号处理的基础知识\upcite{Oppenheim2001DiscretetimeSP,Quatieri2001DiscreteTimeSS}，以及语音识别的传统方法\upcite{DBLP:books/daglib/0071550,Huang2001SpokenLP}和基于深度学习的最新方法\upcite{benesty2008automatic}。此外，语音翻译的一个重要应用是机器同声传译。

 \parinterval 在篇章级翻译方面，一些研究工作对这类模型的上下文建模能力进行了探索\upcite{DBLP:conf/discomt/KimTN19,DBLP:conf/acl/LiLWJXZLL20}，发现模型性能在小数据集上的BLEU提升并不完全来自于上下文信息的利用。同时，受限于数据规模，篇章级翻译模型相对难以训练。一些研究人员通过调整训练策略来帮助模型更容易捕获上下文信息\upcite{DBLP:journals/corr/abs-1903-04715,DBLP:conf/acl/SaundersSB20,DBLP:conf/mtsummit/StojanovskiF19}。除了训练策略的调整，也可以使用数据增强\upcite{DBLP:conf/discomt/SugiyamaY19}和预训练\upcite{DBLP:journals/corr/abs-1911-03110,DBLP:journals/tacl/LiuGGLEGLZ20}的手段来缓解数据稀缺的问题。此外，区别于传统的篇章级翻译，一些对话翻译也需要使用长距离上下文信息\upcite{DBLP:conf/wmt/MarufMH18}。

-
+\parinterval 最近，多模态机器翻译、图像描述、视觉问答\upcite{DBLP:conf/iccv/AntolALMBZP15}（Visual Question Answering）等多模态任务受到人工智能领域的广泛关注。如何将多个模态的信息充分融合，是研究多模态任务的重要问题。在自然语言处理领域transformer\upcite{vaswani2017attention}框架的提出后，被应用到计算机视觉\upcite{DBLP:conf/eccv/CarionMSUKZ20}、多模态任务\upcite{DBLP:conf/acl/YaoW20,DBLP:journals/tcsv/YuLYH20,Huasong2020SelfAdaptiveNM}效果也有显著的提升。另外，数据稀缺是多模态任务受限之处，可以采取数据增强\upcite{DBLP:conf/emnlp/GokhaleBBY20,DBLP:conf/eccv/Tang0ZWY20}的方式缓解。但是，这时仍需要回答在：模型没有充分训练时，图像等模态信息究竟在翻译里发挥了多少作用？类似的问题在篇章级机器翻译中也存在，上下文模型在训练数据量很小的时候对翻译的作用十分微弱（引用李北ACL）。因此，也有必要探究究竟图像等上下文信息如何可以更有效地发挥作用。此外，受到预训练模型的启发，在多模态领域，图像和文本联合预训练\upcite{DBLP:conf/eccv/Li0LZHZWH0WCG20,DBLP:conf/aaai/ZhouPZHCG20,DBLP:conf/iclr/SuZCLLWD20}的工作也相继开展，利用transformer框架，通过自注意力机制捕捉图像和文本的隐藏对齐，提升模型性能，同时缓解数据稀缺问题。




--- a/Chapter18/Figures/figure-translation-interfered.tex
+++ b/Chapter18/Figures/figure-translation-interfered.tex
@@ -2,48 +2,30 @@
 %%% outline
 %-------------------------------------------------------------------------
 \begin{tikzpicture}[scale=0.8]
-\tikzstyle{every node}=[scale=0.8]
-\tikzstyle{node}=[rounded corners=4pt, draw,minimum width=3em, minimum height=2em, drop shadow={shadow xshift=0.14em, shadow yshift=-0.14em}]
-
+\tikzstyle{diction}=[align=center,rounded corners=2pt, draw,drop shadow,fill=green!20,font=\scriptsize]
+\tikzstyle{word}=[align=center,anchor=west]
 \begin{scope}

-%\draw[fill=yellow!20]  (-5em, 0) -- (-6em, 1em) -- (5em, 1em) -- (6em, 0em) -- (5em, -1em) -- (-6em, -1em) -- (-5em, 0em);
-%\draw[fill=yellow!20]  (-5em, 10em) -- (-6em, 11.2em) -- (5em, 11.2em) -- (6em, 10em) -- (5em,8.8em) -- (-6em, 8.8em) -- (-5em, 10em);
-\node[] (n1) at (0,0){小牛翻译的总部在哪里？};
-
-\node[node,fill=blue!20] (c1) at (0, 5em){\scriptsize\bfnew{机器翻译}};
-\node[align=left] (n2) at (0,10em){Where is the headquarters \\ of {\color{red} Mavericks Translation}?};
-
-\node [draw,single arrow,inner ysep=0.3em, minimum height=2.4em, rotate=90,fill=cyan!40,very thin] (arrow1) at (0, 2.4em) {};
-\node [draw,single arrow,inner ysep=0.3em, minimum height=2em, rotate=90,fill=cyan!40,very thin] (arrow1) at (0, 7.2em) {};
-
-\node[font=\Large,text=red] at (0, -2em){\ding{56}};
-\end{scope}
-\begin{scope}[xshift=14em]
-%\draw[fill=yellow!20]  (-5em, 0) -- (-6em, 1em) -- (5em, 1em) -- (6em, 0em) -- (5em, -1em) -- (-6em, -1em) -- (-5em, 0em);
-%\draw[fill=yellow!20]  (-5em, 10em) -- (-6em, 11.2em) -- (5em, 11.2em) -- (6em, 10em) -- (5em,8.8em) -- (-6em, 8.8em) -- (-5em, 10em);
-\node[] (n3) at (0,0){小牛翻译的总部在哪里？};
-
-\node[node,fill=blue!20] (c2) at (-3em, 5em){\scriptsize\bfnew{机器翻译}};
-
-\node[node,fill=red!20] (c3) at (3em, 5em){\scriptsize\bfnew{术语词典}};
-
-\node[font=\scriptsize,draw,inner sep=3pt,fill=red!20,minimum height=1em] (w1) at (9em, 6.5em){小牛翻译};
+\node[word] (origin) at (0,0) {源文};
+\node[word] (n1) at ([xshift=1em]origin.east){{\color{red} 小牛翻译}的总部在哪里？}; 

-\node[font=\scriptsize,draw,inner sep=3pt,fill=red!20,minimum height=1em] (w2) at (9em, 3.5em){NiuTrans};
-\node[font=\Large] (add) at (0em, 5em){+};
-\node[align=left] (n4) at (0,10em){Where is the headquarters \\ of {\color{red} NiuTrans}?};
+\node[word] (right) at ([yshift=-6em]origin.south west){译文};
+\node[word] (n3) at ([xshift=1em]right.east){Where is the headquarters \\ of {\color{red} NiuTrans}?}; 

-\node [draw,single arrow,inner ysep=0.3em, minimum height=2.4em, rotate=90,fill=cyan!40,very thin] (arrow1) at (0, 2.4em) {};
-\node [draw,single arrow,inner ysep=0.3em, minimum height=2em, rotate=90,fill=cyan!40,very thin] (arrow1) at (0, 7.2em) {};
+%\node[diction] (dic) at ([xshift=2em,yshift=-1.8em]n1.south east) {
+%术语词典 \\
+%小牛翻译 = NiuTrans \\
+%......
+%};
+%\draw[->,red] ([yshift=-0.2em]dic.west)  .. controls +(west:2em) and +(south:2em) .. ( [xshift=-4em]n1.south) node[above,midway,font=\scriptsize]{};
+\node[font=\scriptsize] at ([yshift=-2.3em,xshift=-6em]n1.south) {“小牛翻译”=“NiuTrans”};
+\draw[->,very thick] ([yshift=-0.2em,xshift=-0.4em]n1.south) -- ([yshift=0.2em]n3.north);
+%\draw[->,very thick] ([yshift=-0.2em,xshift=-0.2em]n2.south) -- ([yshift=0.2em]n3.north);

-\draw[dash pattern=on 1pt off 0.5pt,black,line width=1.2pt,->, out=180, in=45] ([xshift=-0.2em]w1.180) to ([xshift=0.2em]c3.20);
-\draw[dash pattern=on 1pt off 0.5pt,black,line width=1.2pt,->,out=180,in=-45] ([xshift=-0.2em]w2.180) to ([xshift=0.2em]c3.-20);
-
-\node[font=\Large,text=ugreen] at (0, -2em){\ding{52}};
 \end{scope}
 \end{tikzpicture}




+
--- a/Chapter18/chapter18.tex
+++ b/Chapter18/chapter18.tex
@@ -151,7 +151,7 @@
 \centering
 \input{./Chapter18/Figures/figure-translation-interfered}
 %\setlength{\abovecaptionskip}{-0.2cm}
-\caption{翻译结果可干预性（{\color{red} 这个图需要修改！有些乱，等回沈阳找我讨论！}）}
+\caption{翻译结果可干预性}
 \label{fig:18-3}
 \end{figure}
 %----------------------------------------------
@@ -249,10 +249,10 @@
 %    NEW SECTION
 %----------------------------------------------------------------------------------------

-\section{机器翻译的应用场景}
+%\section{机器翻译的应用场景}

 %----------------------------------------------------------------------------------------
 %    NEW SECTION
 %----------------------------------------------------------------------------------------

-\section{拓展思考}
+%\section{拓展思考}
--- a/Chapter19/Figures/fig-cover.jpg
+++ b/Chapter19/Figures/fig-cover.jpg
--- a/Chapter19/Figures/figure-niutrans.jpg
+++ b/Chapter19/Figures/figure-niutrans.jpg
--- a/Chapter19/chapter19.tex
+++ b/Chapter19/chapter19.tex
+% !Mode:: "TeX:UTF-8"
+% !TEX encoding = UTF-8 Unicode
+
+%----------------------------------------------------------------------------------------
+% 机器翻译：统计建模与深度学习方法
+% Machine Translation: Statistical Modeling and Deep Learning Methods
+%
+% Copyright 2020
+% 肖桐(xiaotong@mail.neu.edu.cn) 朱靖波 (zhujingbo@mail.neu.edu.cn)
+%----------------------------------------------------------------------------------------
+
+%----------------------------------------------------------------------------------------
+%    CONFIGURATIONS
+%----------------------------------------------------------------------------------------
+
+\renewcommand\figurename{图}%将figure改为图
+\renewcommand\tablename{表}%将figure改为图
+\chapterimage{fig-NEU-8.jpg} % Chapter heading image
+
+%----------------------------------------------------------------------------------------
+%	CHAPTER 19
+%----------------------------------------------------------------------------------------
+
+\chapter{机器翻译杂谈}
+
+\parinterval 朱靖波 2020.12.10-16随笔\\
+
+\parinterval 自从计算机诞生开始，机器翻译即利用计算机软件技术实现不同语言自动翻译，就是人们首先想到的计算机主要应用之一。很多人说现在处于人工智能时代，是得语言者的天下，因此机器翻译也是认知智能的终极梦想之一，本节将分享我们对机器翻译技术和应用的一些思考，有些想法不一定正确，也许需要十年之后才能验证。
+
+\vspace{0.5em}
+\parinterval 简单来说，机器翻译技术应用至少可以满足三个用户需求。一是实现外文资料辅助阅读和帮助不同母语的人们进行无障碍交流；二是计算机辅助翻译，帮助人工翻译降本增效；三是大数据分析和处理应用领域实现对多语言文字资料（也可以图像和语音资料）进行加工处理，海量数据翻译对于人工翻译来说是无法完成的，机器翻译是大数据翻译的唯一有效解决方案。从上述三个需求可以看出，机器翻译和人工翻译本质上不存在严格冲突，属于两个平行轨道，两者可以和谐共存、相互帮助。对于机器翻译来说，至少有两个应用场景是其无法胜任的。第一个是要求高质量翻译结果，比如诗歌小说翻译出版；第二个是比如国家领导人讲话，轻易不允许出现低级翻译错误，否则有可能导致严重后果甚至国际纠纷。严格上来说，对译文准确性要求很高的应用 场景不可能简单采用机器翻译，必须由高水平的人工翻译参与来完成。
+\vspace{0.5em}
+
+\parinterval 机器翻译技术发展至今经历了三个主要阶段，基于规则的方法、统计机器翻译和神经机器翻译。基于规则的方法大家都比较熟悉，专家人工书写一些转换翻译规则，将源语言句子转换翻译成为目标语言译文句子，最大的瓶颈问题是人工书写翻译规则代价非常高，规则较多很难写全，并且规则写多了容易产生冲突，造成跷跷板现象。为了解决人工书写翻译规则代价过高的问题，后两个发展阶段主要采用机器学习的方法，事先准备好较大规模的双语句子作为训练语料，采用机器学习方法来构建机器翻译系统。这种方法原则上不需要人工干预或者太多干预，并且机器翻译系统构建的代价低、速度快，其主要瓶颈问题就是需要事先收集好大规模双语句对集合，这对于很多语言对来说难度比较大，特别是小语种语言对。
+\vspace{0.5em}
+
+\parinterval 如何构建一套好的机器翻译系统呢？假设我们需要给用户提供一套翻译品质不错的机器翻译系统，至少需要考虑三个方面：足够大规模的双语句对集合用于训练学习、强大的机器翻译技术和错误驱动的打磨过程。前两者大家比较好理解，第三点也非常关键，通过总结翻译错误分析原因，比如属于数据问题还是技术问题，找到一个解决方案，不断迭代优化翻译使其品质越来越好。从技术应用和产业化角度来看，简单靠提出一个新的机器翻译技术，对于构建一套好的机器翻译系统来说，只能说必要条件，不是充要条件，上述三者缺一不可。
+\vspace{0.5em}
+
+\parinterval 据了解全世界至少有五六千种不同语言，能够电子化的语种至少也有两三千种，我们真正熟悉的主流语种不会太多，很多语种人才在国内也是稀缺资源。为了缓解这个问题，机器翻译成为了一个有效解决方案（毕竟培养小语种翻译人才代价奇高）。目前主流的机器翻译技术是神经机器翻译，基于深度学习技术，翻译品质依赖于双语句子训练数据规模。我们都知道，只有主流语种比如英语和中文有能力收集较大规模的双语句对集合，目前大多商用英汉机器翻译系统由几个亿的中英双语句对训练而成，但这对于99\%以上的语言对来说是遥不可及的。甚至大部分语言对的电子化双语句对集合规模非常小，过百万算多的，很多只有几万个句对，甚至没有，最多有一个小规模双语词典而已。
+\vspace{0.5em}
+
+\parinterval 因此资源稀缺语种机器翻译技术研究也成为学术界的研究热点，相信这个课题的突破能大大推动机器翻译技术落地应用。机器翻译本身是一个刚需，在很多大数据翻译应用场景，机器翻译是唯一有效的解决方案，非人工翻译所为。在2017年以前机器翻译市场规模一直很小，主要原因就是机器翻译品质不够好，就算采用最先进的神经机器翻译技术，缺乏足够大规模的双语句对集合作为训练数据，我们也是巧妇难为无米之炊。从技术研究和应用可行性角度来说，解决资源稀缺语种机器翻译问题非常有价值。我们通常可以从两个维度来思考，一是如何想办法获取更多双语句对，甚至包括质量低一点的伪双语数据；二是如何从更少样本来实现高效学习，或者充分利用丰富的单语数据资源或者可比较数据资源来提升模型学习效果。
+\vspace{0.5em}
+
+\parinterval 业内不少研究人员提出采用知识图谱来改善机器翻译，并希望用于解决稀缺资源语种机器翻译问题；还有一些研究工作引入语言分析技术来改善机器翻译，多种不同机器翻译技术融合也是一个思路，比如将基于规则的方法、统计机器翻译技术与神经机器翻译技术互补性融合；另外还可以引入预训练技术来改善机器翻译品质，特别是针对稀缺资源语种机器翻译等等。不仅仅限于上述这些，总体来说，这些思路都具有良好的研究价值，但是从应用角度构建可实用机器翻译系统，我们还需要更多考虑技术落地可行性才行。比如大规模知识图谱构建的代价和语言分析技术的精度如何；实验结果显示大规模双语句对训练条件下，预训练技术对机器翻译的帮助能力有限；双语句对训练数据规模较小的时候，神经机器翻译品质可能不如统计机器翻译，说明两者具有一定互补性。做研究可以搞单点突破，但从可实用机器翻译系统构建来说，需要多技术互补融合，以解决实际问题和改善翻译品质。
+\vspace{0.5em}
+
+\parinterval 通常我们把基于规则的方法、统计机器翻译和神经机器翻译称之为第一、第二和第三代机器翻译技术，很自然会问第四代机器翻译会如何发展？有人说是基于知识的机器翻译技术，也有人说是无监督机器翻译技术或者新的机器翻译范式等等。这个第四代的问题现在肯定没有答案，在讨论这个问题之前，我们能否先回答一个问题呢？所谓新一代机器翻译技术是否应该比目前机器翻译技术的翻译品质更好呢？如果同意的话，第二代机器翻译技术碾压了第一代，第三代也毫无争议地碾压了第二代，现在的问题是实验结果显示，比如拿商用的英汉汉英机器翻译系统举例子，经过几个亿双语句对的训练学习后，翻译品质人工评价可以达到80-90\%之间，也许再过三五年会更好，新闻翻译准确率也许有能力超过90\%，那我们需要回答的一个简单问题是所谓的第四代机器翻译技术准备在新闻领域翻译达到怎样的准确率呢？92\%或者93\%的数字估计无法支撑起新一代机器翻译技术的碾压性，我们可能会得出一个猜测：是否将来不存在第四代机器翻译技术？
+\vspace{0.5em}
+
+\parinterval 大家可能会说我论述的角度不对，我也认可这一点，从历史发展观上来看，新一代的技术必然是存在，换句话说，第四代机器翻译技术一定会出现，只是不知道在啥时候而已。我们可以换个角度来讨论这个问题，神经机器翻译的红利还没有被挖尽，还存在很好的发展空间，在可预期的将来，神经机器翻译技术估计还是属于主流技术，但会产生大量变种。在训练双语数据充分的前提下，要想碾杀神经机器翻译技术，我的确有点质疑这一点。但有一个发展方向是有道理的，前文提到99\%以上语言对属于稀缺资源，目前神经机器翻译技术对于稀缺资源语言对来说表现不好，甚至很差几乎不可用。从这一点来说，无监督机器翻译和更少样本的训练学习机制值得关注，由此产生的新机器翻译技术成为了可能。我们愿意把新一代机器翻译技术称之为面向具体应用场景的第四代机器翻译技术，本质上是针对不同应用条件、不同应用场景提出新一代能力更强的机器翻译技术，不是简单一个技术，而是一个技术集合，这是完全可能的。
+\vspace{0.5em}
+
+\parinterval 这几年神经机器翻译技术大大提升了翻译品质，推动了机器翻译产业化的快速发展。跟其它深度学习技术应用一样，缺乏可解释性成为了神经机器翻译一个被攻击点。有些研究人员持有不同观点，认为神经机器翻译具有良好可解释性，每一步计算过程非常清楚。这就涉及到如何定义可解释性，具有可解释性的深度学习技术也是一个研究热点。我们先举个简单例子来说明一下，法庭上法官判决犯罪嫌疑人罪名成立，我们不可能简单说有罪或者无罪，同时会说明根据哪条法律法规作为依据，从判决过程来看，这些依据就是判决结果的解释。如果采用深度学习技术，只是一个有罪或无罪的结果，不能解释，不提供任何依据细节，估计犯罪嫌疑人肯定不服。
+\vspace{0.5em}
+
+\parinterval 从上述例子我们可以得出一个分析，我们所需要的可解释性的内涵到底是结论推理的计算过程还是结论推理的以理服人呢？对可解释性的两种理解可能是不一样的，前者面向结论推理过程（how），后者面向结论可理解性（why）。回头来说，对神经机器翻译可解释性研究的目标，到底是前者还是后者呢？目前学术界有一些相关研究，比如对神经机器翻译模型中注意力机制的可视化分析软对齐结果等。但有一点是肯定的，我们希望研究神经机器翻译技术的可解释性，目的是为了“纠错”，也可以有利于人工干预机制等。只有通过可解释性研究，搞清楚翻译错误的原因，最终找到解决方案来实现纠错，才是我们研究神经机器翻译技术可解释性的目的所在。
+\vspace{0.5em}
+
+\parinterval 除了翻译品质维度以外，机器翻译技术应用可以从三个维度来讨论，包括语种维度、领域维度和应用模式维度。机器翻译技术应该为全球用户服务，提供支持所有国家至少一种官方语言的翻译能力，实现任意两种语言的自动互译，当然语种数量越多越好。面临的最大问题就是双语数据稀缺，上述已经讨论了这个问题。关于领域维度，通用领域翻译系统对于垂直领域应用来说是不够充分的，最典型的问题在于垂直领域术语翻译的问题，计算机不能无中生有，虽然存在瞎猫碰死耗子，但没有办法充分解决垂直领域术语OOV翻译问题。比较直接可行的解决方案至少有两个，一是引入垂直领域术语双语词典用于改善机器翻译效果；二是收集加工一定规模的垂直领域双语句对来优化训练翻译模型。这两种工程方法虽然简单，但效果不错，相对来说，两者结合才能更加有效，但问题是垂直领域双语句对的收集很多时候代价太高，不太可行，本质上就转换成为垂直领域资源稀缺问题和领域自适应学习问题，另外也可以引入小样本学习、迁移学习和联合学习等机器学习技术来改善这个问题。
+
+\vspace{0.5em}
+\parinterval 应用模式维度能够体现丰富多彩的机器翻译应用和服务，还可以细分到具体应用场景，这个我们就不一一列举，后面可能会讨论到一些具体应用。这里主要讨论一下应用模式的软硬件环境。通常机器翻译典型应用属于在线翻译公有云服务，用户接入非常简单，只需要联网使用浏览器就可以自由免费使用。在某些具体行业应用中，用户对数据翻译安全性和保密性要求非常高，其中可能还会涉及到个性化订制要求，这一点在线翻译公有云服务就无法满足用户需求，本地部署机器翻译私有云和离线机器翻译技术和服务成为了新的应用模式。本地部署私有云的问题在于用户需要自己购买GPU服务器和建机房，硬件投入和代价也不低，也许将来会出现一种新的应用模式：在线私有云或专有云，有点像服务托管模式。最后一种云服务就是混合云，简单来说就是公有云、私有云和专有云混合体而已。
+\vspace{0.5em}
+
+\parinterval 离线机器翻译技术可以为更小型的智能翻译终端设备提供服务，比如大家熟悉的翻译机、翻译笔、翻译耳机等智能翻译设备，在不联网的情况下能够实现高品质机器翻译功能，相当于将机器翻译系统安装在智能翻译终端设备上，这个应用模式具有很大的潜力。但需要解决的问题很多，首先是模型大小、翻译速度和翻译品质三大问题，之后需要考虑不同操作系统（Linux、Android Q和iOS）和不同架构的CPU芯片，比如x86、MIPS、ARM等架构的智能适配兼容问题，特别是国产化机器翻译解决方案需求也在不断上升，机器翻译本质上需要有能力兼容国产化操作系统和芯片。将来离线翻译系统还可以安装到办公设备上，比如传真机、打印机和复印机等，实现支持多语言的智能办公。目前人工智能芯片发展速度非常快，其实机器翻译和语音处理雷同，目前市面上语音技术芯片已经被广泛使用，机器翻译芯片的研发缺的估计不是技术，其最大的问题应该是缺少应用场景和上下游的应用支撑，一旦这个时机成熟，机器翻译芯片研发和应用也有可能会爆发。
+\vspace{0.5em}
+
+\parinterval 机器翻译可以与文档解析、语音识别、OCR和视频字幕提取等技术相结合，我们称之为多模态机器翻译，大大丰富了机器翻译的应用模式。文档解析技术可以帮助实现Word文档翻译、PDF文档翻译、WPS文档翻译、邮件翻译等更多格式文档自动翻译能力，也可以作为插件嵌入到各种办公平台中，成为智能办公好助手。语音识别与机器翻译是绝配，语音语言是人与人交流的最自然方式，语音翻译用途就非常丰富了，比如翻译机和语音翻译APP，还有目前大家比较期待的会议AI同传应用，参加国际会议可以通过该技术听懂讲不同母语研究人员的报告，该技术也可以成为会议室的标配，帮助不同母语的参会人员进行自由交流。但目前最大的问题主要体现在两个方面，一是很多实际应用场景中语音识别结果欠佳，造成错误蔓延，导致机器翻译结果不够理想；二是就算小语种的语音识别效果很好，但资源稀缺型小语种翻译性能不够好。OCR技术可以帮助实现扫描笔和翻译笔的应用、出国旅游的拍照翻译功能，将来还可以与穿戴式设备相结合，比如智能眼镜等等。视频字幕翻译能够帮助我们欣赏没有中文字幕的国外电影和电视节目，比如我们到达任何一个国家，打开电视都能够看到中文字幕，也是非常酷的应用。
+\vspace{0.5em}
+
+\parinterval 但目前多模块机器翻译技术框架大多采用串行流水线，只是简单将两个或者多个不同的技术连接在一起，比如语音翻译过程分两步：语音识别和机器翻译，也可以增加一个语音合成发音功能。其它多模态机器翻译技术也大同小异，这个简单的串行流水线技术框架最大的问题就是错误蔓延，一旦某个技术环节准确率不够好，最后的结果就不会太好，比如90\%$\times$90\%=81\%。并且后续的技术环节不一定有能力纠正前面技术环节引入的错误，最终导致用户体验不够好。很多人说会议英中AI同传用户体验不够好，很自然以为是机器翻译出了问题，其实目前问题主要出在语音识别环节。学术界开始研究端到端的多模态机器翻译技术，不是采用串行流水线技术架构，而是采用一步到位的方式，这理论上能够缓解错误蔓延的问题，但目前实际效果还不够理想，期待学术界取得新的突破。
+\vspace{0.5em}
+
+\parinterval 即使双语句对训练集合规模非常大、机器翻译技术在不断优化，但我们都知道机器翻译结果不可能完美，出现一些译文错误是难免的。如果我们想利用机器翻译技术来帮助人工翻译过程，比较常见的方式是译后编辑，即对自动译文进行人工修正错误。这就很自然产生两个实际问题，一是自动译文是否具有编辑价值？一个简单的计算方法就是编辑距离，即人工需要通过多少次增删改动作可以完成译后编辑过程。其次数越少，说明机器翻译对人工翻译的帮助越大。编辑距离本质上是一种译文质量评价的方法，可以考虑推荐具有较高译后编辑价值的自动译文给人工译员。第二个问题就是当机器翻译出现错误且人工译后编辑修正后，能否通过一种有效的错误反馈机制帮助机器翻译系统提高性能。学术界也有很多人研究这个问题，目前还没有取得大家满意的结果。除此之外还有另外一些问题：比如人机交互的用户体验问题，该需求很自然带起了交互式机器翻译技术研究，希望最大程度发挥人机协同合作效果，这个也是值得研究的课题。
+\vspace{0.5em}
+
+\parinterval 传统机器翻译译文评价方法分成自动方法和人工评价方法，自动方法用得最多的是BLEU值，被广泛应用于机器翻译系统研发调优过程中和机器翻译评测中。人工评价就不用过多解释了，从理论上来说，两者具有正相关，但实验结果显示不是绝对正相关。换句话说，理论上BLEU值越大，机器翻译系统性能越好，但如果两套机器翻译系统的BLEU值差异性不太大的话（比如$<$0.5），从人工评价角度来看，可能没有差异性，甚至分数高的翻译品质不如分数低的情况也可能发生。后来不少研究人员对机器翻译自动评价方法开展了大量的研究，甚至对评价方法的评价也成为了一个研究方向。如何对机器翻译译文进行更加有效的自动评价，这个研究非常有价值，因为基于机器学习的训练方法调优完全基于评价方法，可以这么说，评价方法就是指挥棒，直接影响该领域的发展，所以将来我们还应该更加重视机器翻译评价方法的研究工作。
+\vspace{0.5em}
+
+\parinterval 机器翻译评价方法除了自动和人工分类以外，还有另外的分类方法：基于参考答案和没有参考答案两类。基于参考答案的自动评价方法比较简单，拿BLEU方法举例子，人工事先构建一个包含几百个甚至上千个句子的测试集合，通常每个源语言句子提供多个（四个）不同的正确译文，然后计算每个源语言句子的自动译文与人工给定的多个参考译文之间的相似度，相似度越大，说明自动译文越接近正确翻译结果。如果人工事先提供多个正确译文有难度的话，偶尔我们也会快速构建只包含一个参考译文的测试集。多个不同参考译文有助于自动评价结果的可靠性，因为一个源语言句子理论上拥有多个不同正确译文。
+\vspace{0.5em}
+
+\parinterval 做研究实验的时候我们可以事先准备好测试集合，问题是在很多实际应用场景中，比如译后编辑过程中，我们希望机器翻译系统能够对每个输入句子的自动译文提供一个质量评价分数，分数越高表示译文正确性越好，具有更高的译后编辑价值，系统自动推荐高质量的译文给人工翻译后编辑。这种情况下我们不可能实现构建好包含多个参考译文的测试集合，即没有参考译文的自动译文质量评价技术。这个技术非常有趣，用途非常广泛，除了上述推荐高质量译文以外，将来也可以用于数据质量检测，甚至可以用于改善优化机器翻译系统本身。学术界也开展了不少相关研究工作，但离实际应用还远远不够，如何利用解码知识和外部语言学知识优化没有参考答案的译文质量评价，是值得深入研究的一个方向。
+\vspace{0.5em}
+
+\parinterval 回头讨论一下上述提到的第二个问题，机器翻译一直存在一个诟病就是用户不知道如何有效干预纠错，帮助机器翻译系统越来越好，并且我们也不希望它屡教不改。基于规则的方法和统计机器翻译方法相对容易实现人工干预纠错，实现手段比较丰富，而神经机器翻译方法存在不可解释性，难以有效实现人工干预纠错。目前有的研究人员深入研究引入外部知识库（用户双语术语库）来实现对OOV翻译的干预纠错；有的提出使用增量式训练方法不断迭代优化模型，也取得了一些进展；有的融合不同技术来实现更好的机器翻译效果，比如引入基于规则的翻译前处理和后处理，或者引入统计机器翻译技术优化译文选择等等。但这些方法代价不低甚至很高，并且性能提升的效果无法得到保障，有时候可能降低翻译品质，有点像跷跷板现象。总体来说，这个方向的研究工作成果还不够丰富，但对用户来说非常重要，如果能够采用隐性反馈学习方法，在用户不知不觉中不断改善优化机器翻译品质，就非常酷了，这也许会成为将来的一个研究热点。
+\vspace{0.5em}
+
+\parinterval 对于人工翻译无法完成的任务，比如大规模数据翻译，机器翻译肯定是唯一有效的选择。为了更好帮助人工翻译，交互式机器翻译技术是非常有价值的，但需要解决一个实际问题是用户体验的问题，简单来说就是人机交互的方式。之前比较传统的人机交互方式属于机器翻译尊重人工干预的结果，一旦人工确定译文的部分片段结果后，机器翻译会保证最终输出译文中一定会出现该部分译文片段。举个简单交互例子，比如从左到右的翻译方向，人工指定第一个译文单词，机器翻译就选择输入一个“最佳”译文，首部单词为该单词。这种人机交互模式存在两个问题，一是将人工干预结果作为机器翻译解码过程的硬约束，可能对译文生成造成负面影响；二是该人机交互方式改变了人工翻译的习惯，用户体验可能不太好。探索更加丰富的人机交互方式，改善用户体验，同时发挥机器翻译的优势，是人机交互值得深入研究的课题，本质上人机交互式机器翻译体现了人工干预纠错的思想，不同的一点是这种干预纠错可能是针对当前句子的，不一定针对整个机器翻译系统的，如果能够做到后者，将人机交互与错误驱动反馈学习结合，就具有非常高的应用价值。
+\vspace{0.5em}
+
+\parinterval 传统机器翻译训练学习过程是按照不同语言对进行的，比如中英翻译和中日翻译等。一带一路周边国家就有近百种官方语言，联合国所有成员国的官方语言总数至少两三百个，如果想实现任意两个国家官方语言的互译，这种组合至少有几万个语言对。再加上上千种非官方语言的小语种，任意两个语种的组合就爆炸了，可以达到几百万个不同语言对，每个语言对独立进行训练学习机器翻译系统，代价投入是难以想象的。上文也提到了，可能有99\%以上语言对属于资源稀缺型，无法收集足够规模的双语句对完成有效的翻译模型训练学习。为了缓解这个资源稀缺语言翻译的问题，学术界已经开展了很多相关研究工作。我们认为资源稀缺语言翻译和多语言翻译两个不同问题可以相互结合来考虑，基本思想是相同或者相似语系的不同语言之间共享翻译知识。简单来说，能否训练学习一个强大的通用翻译模型，不是简单仅仅支持一个语言对翻译，而是有能力同时支持多个不同语言的互译能力。这个方案的好处是不言而喻的，不仅能大大降低训练学习的代价，还可大大降低系统部署的硬件投入和维护代价。一旦多语言共享翻译模型取得突破，就能够大大缓解小语种翻译的问题，具有良好的理论研究和应用价值。
+\vspace{0.5em}
+
+\parinterval 目前神经机器翻译技术已经被工业界广泛使用，我们可以拿Transformer模型来简单讨论一下，在很多领域，比如图像和语音应用领域已经证明网络结构越深，层数越多，越有助于提高表示学习能力和应用系统的性能。同样我们在机器翻译领域也得到类似结论，如何利用更多层的网络来优化机器翻译建模是值得探讨的方向。更深的网络结构也会带来很多麻烦，比如训练代价和有效性问题。训练代价与网络结构深度成正比，这个容易理解，关键是后者训练有效性的问题，比如能否在有限时间内快速收敛到预期结果。传统Transformer模型一旦扩展到10多层，训练学习过程好像就容易出问题，为此我们团队曾提出SDT训练方法，有效缓解了这个问题，实现有能力训练40层以上的Transformer模型，目的在于改善机器翻译品质。
+\vspace{0.5em}
+
+\parinterval 通常用于构建机器翻译系统的神经网络结构是人为事先确定的，包括预定义层数（深度）和每层宽度等，其实人为事先给定的网络结构对于当前任务来说是否最佳，这个问题没有结论，学术界目前也无法很好回答这个问题。但根据我们的常识性知识可以得知，过于依赖专家经验来设计网络结构肯定不是最佳方案，后来学术界就产生了网络结构搜索研究工作，即如何根据训练数据本身来自动优化模型深层网络结构，争取达到最佳的训练学习效果，是一个非常有趣的研究方向。目前神经机器翻译技术主要依赖于编码器-解码器两层技术框架，把编码和解码阶段分开，类似于将传统的分析和生成阶段分开，但两者又相互依赖，这样做的好处是技术架构简单，不过可能存在表示学习不够充分和错误蔓延等问题的可能性。为了解决这个问题，我们团队做了一个有趣尝试，提出一个新的神经机器翻译技术框架\ \dash \ 基于联合分布的注意力模型Reformer，不依赖于传统编码器-解码器技术框架，而是直接采用一个统一技术框架完成翻译过程，这项工作目前还比较初级，有待于进一步深入研究。
+\vspace{0.5em}
+
+\parinterval 最后简单评价一下机器翻译市场发展的趋势。机器翻译本身是个强刚需，用于解决全球用户多语言交流障碍问题。机器翻译产业真正热起来，应该归功于神经机器翻译技术应用，之前基于规则的方法和统计机器翻译技术虽然也在工业界得到了应用，但由于翻译品质没有达到用户预期，用户付费欲望比较差，没有良好的商业变现能力，导致机器翻译产业在2017年以前类似于“鸡肋”产业。严格上来说，2016年下半年开始，神经机器翻译技术工业界应用快速激活了用户需求，用户对机器翻译的认可度急剧上升，越来越丰富的应用模式和需求被挖掘出来，除了传统计算机辅助翻译CAT以外，语音和OCR与机器翻译技术结合，使得大家比较熟悉的语音翻译APP、翻译机、翻译笔、会议AI同传和垂直行业（专利、医药、旅游等）等的机器翻译解决方案也逐渐得到了广泛应用。总体来说，机器翻译产学研正处于快速上升期，每年市场规模达到至少100\%以上增长，随着多模态机器翻译和大数据翻译技术应用，应用场景会越来越丰富，随着5G甚至6G技术发展，视频翻译和电话通讯翻译等应用会进一步爆发。另外，随着人工智能芯片领域的发展，很自然地机器翻译芯片也会逐渐得到应用，比如嵌入到手机、打印机、复印机、传真机和电视机等智能终端设备，实现所有内容皆可翻译，任何场景皆可运行的目标，机器翻译服务将进入人们的日常生活中，无所不在，让生活更加美好！
+\vspace{0.5em}
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\includegraphics[scale=0.4]{./Chapter19/Figures/figure-niutrans.jpg}
+%\setlength{\abovecaptionskip}{-0.2cm}
+%\caption{使用TranSmart系统进行交互式翻译的实例}
+\label{fig:19-1}
+\end{figure}
+%----------------------------------------------
--- a/Chapter2/chapter2.tex
+++ b/Chapter2/chapter2.tex
@@ -273,7 +273,7 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \label{eq:2-15}
 \end{eqnarray}

-\parinterval 相对熵的意义在于：在一个事件空间里，概率分布$\funp{P}(x)$对应的每个事件的可能性。若用概率分布$\funp{Q}(x)$编码$\funp{P}(x)$，平均每个事件的信息量增加了多少。它衡量的是同一个事件空间里两个概率分布的差异。KL距离有两条重要的性质：
+\parinterval 其中，概率分布$\funp{P}(x)$对应的每个事件的可能性。相对熵的意义在于：在一个事件空间里，若用概率分布$\funp{Q}(x)$来编码$\funp{P}(x)$，相比于用概率分布$\funp{P}(x)$来编码$\funp{P}(x)$时信息量增加了多少。它衡量的是同一个事件空间里两个概率分布的差异。KL距离有两条重要的性质：

 \begin{itemize}
 \vspace{0.5em}
@@ -474,10 +474,12 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \label{eq:2-23}
 \end{eqnarray}

-\parinterval 这样，整个序列$w_1 w_2 \ldots w_m$的生成概率可以被重新定义为：
+\parinterval 如表\ref{tab:2-2}所示，整个序列$w_1 w_2 \ldots w_m$的生成概率可以被重新定义为：

 %------------------------------------------------------
+\begin{table}[htp]{
 \begin{center}
+\caption{基于$n$-gram的序列生成概率}
 {\footnotesize
 \begin{tabular}{l|l|l |l|l}
 链式法则 & 1-gram & 2-gram & $ \ldots $ & $n$-gram\\
@@ -491,7 +493,10 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \rule{0pt}{10pt} $\funp{P}(w_m|w_1  \ldots  w_{m-1})$ & $\funp{P}(w_m)$ & $\funp{P}(w_m|w_{m-1})$ & $ \ldots $ & $\funp{P}(w_m|w_{m-n+1}  \ldots  w_{m-1})$
 \end{tabular}
 }
+\label{tab:2-2}
 \end{center}
+}
+\end{table}
 %------------------------------------------------------

 \parinterval 可以看到，1-gram语言模型只是$n$-gram语言模型的一种特殊形式。基于独立性假设，1-gram假定当前单词出现与否与任何历史都无关，这种方法大大化简了求解句子概率的复杂度。比如，上一节中公式\eqref{eq:seq-independ}就是一个1-gram语言模型。但是，句子中的单词并非完全相互独立的，这种独立性假设并不能完美地描述客观世界的问题。如果需要更精确地获取句子的概率，就需要使用更长的“历史”信息，比如，2-gram、3-gram、甚至更高阶的语言模型。
@@ -565,7 +570,7 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x

 \subsubsection{1. 加法平滑方法}

-\parinterval {\small\bfnew{加法平滑}}\index{加法平滑}（Additive Smoothing）\index{Additive Smoothing}是一种简单的平滑技术。本小节首先介绍这一方法，希望通过它了解平滑算法的思想。通常情况下，系统研发者会利用采集到的语料库来模拟真实的全部语料库。当然，没有一个语料库能覆盖所有的语言现象。假设有一个语料库$C$，其中从未出现“确实\ 现在”这样的2-gram，现在要计算一个句子$S$ =“确实/现在/物价/很/高”的概率。当计算“确实/现在”的概率时，$\funp{P}(S) = 0$，导致整个句子的概率为0。
+\parinterval {\small\bfnew{加法平滑}}\index{加法平滑}（Additive Smoothing）\index{Additive Smoothing}是一种简单的平滑技术。通常情况下，系统研发者会利用采集到的语料库来模拟真实的全部语料库。当然，没有一个语料库能覆盖所有的语言现象。假设有一个语料库$C$，其中从未出现“确实\ 现在”这样的2-gram，现在要计算一个句子$S$ =“确实/现在/物价/很/高”的概率。当计算“确实/现在”的概率时，$\funp{P}(S) = 0$，导致整个句子的概率为0。

 \parinterval 加法平滑方法假设每个$n$-gram出现的次数比实际统计次数多$\theta$次，$0 < \theta\le 1$。这样，计算概率的时候分子部分不会为0。重新计算$\funp{P}(\textrm{现在}|\textrm{确实})$，可以得到：
 \begin{eqnarray}
@@ -632,7 +637,7 @@ N & = & \sum_{r=0}^{\infty}{r^{*}n_r} \nonumber \\

 \noindent 其中$n_1/N$就是分配给所有出现为0次事件的概率。古德-图灵方法最终通过出现1次的$n$-gram估计了出现为0次的事件概率，达到了平滑的效果。

-\parinterval 下面通过一个例子来说明这个方法是如何对事件出现的可能性进行平滑的。仍然考虑在加法平滑法中统计单词的例子，根据古德-图灵方法进行修正如表\ref{tab:2-2}所示。
+\parinterval 下面通过一个例子来说明这个方法是如何对事件出现的可能性进行平滑的。仍然考虑在加法平滑法中统计单词的例子，根据古德-图灵方法进行修正如表\ref{tab:2-3}所示。

 %------------------------------------------------------
 \begin{table}[htp]{
@@ -647,7 +652,7 @@ N & = & \sum_{r=0}^{\infty}{r^{*}n_r} \nonumber \\
 \rule{0pt}{10pt} 3 & 1 & 4 & 0.333 \\
 \rule{0pt}{10pt} 4 & 1 & - & - \\
 \end{tabular}
-\label{tab:2-2}
+\label{tab:2-3}
 }
 \end{center}
 }\end{table}
@@ -684,7 +689,7 @@ I cannot see without my reading \underline{\ \ \ \ \ \ \ \ }

 \parinterval 观察语料中的2-gram发现，“Francisco”的前一个词仅可能是“San”，不会出现“reading”。这个分析证实了，考虑前一个词的影响是有帮助的，比如仅在前一个词是“San”时，才给“Francisco”赋予一个较高的概率值。基于这种想法，改进原有的1-gram模型，创造一个新的1-gram模型$\funp{P}_{\textrm{continuation}}$，简写为$\funp{P}_{\textrm{cont}}$。这个模型可以通过考虑前一个词的影响评估当前词作为第二个词出现的可能性。

-\parinterval 为了评估$\funp{P}_{\textrm{cont}}$，统计使用当前词作为第二个词所出现2-gram的种类，2-gram种类越多，这个词作为第二个词出现的可能性越高，呈正比：
+\parinterval 为了评估$\funp{P}_{\textrm{cont}}$，统计使用当前词作为第二个词所出现2-gram的种类，2-gram种类越多，这个词作为第二个词出现的可能性越高：
 \begin{eqnarray}
 \funp{P}_{\textrm{cont}}(w_i) \varpropto |\{w_{i-1}: c(w_{i-1} w_i )>0\}|
 \label{eq:2-34}
@@ -749,7 +754,7 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\
 \label{eq:5-65}
 \end{eqnarray}

-\parinterval  本质上，PPL反映了语言模型对序列可能性预测能力的一种评估。如果$ w_1\dots w_m $\\是真实的自然语言，``完美''的模型会得到$ \funp{P}(w_1\dots w_m)=1 $，它对应了最低的困惑度PPL=1，这说明模型可以完美地对词序列出现的可能性进行预测。当然，真实的语言模型是无法达到PPL=1的，比如，在著名的Penn Treebank（PTB）数据上最好的语言模型的PPL值也只能到达35左右。可见自然语言处理任务的困难程度。
+\parinterval  本质上，PPL反映了语言模型对序列可能性预测能力的一种评估。如果$ w_1\dots w_m $\\是真实的自然语言，“完美”的模型会得到$ \funp{P}(w_1\dots w_m)=1 $，它对应了最低的困惑度PPL=1，这说明模型可以完美地对词序列出现的可能性进行预测。当然，真实的语言模型是无法达到PPL=1的，比如，在著名的Penn Treebank（PTB）数据上最好的语言模型的PPL值也只能到达35左右。可见自然语言处理任务的困难程度。

 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -814,7 +819,7 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\

 \noindent 这里$\arg$即argument（参数），$\argmax_x f(x)$表示返回使$f(x)$达到最大的$x$。$\argmax_{w \in \chi}$\\$\funp{P}(w)$表示找到使语言模型得分$\funp{P}(w)$达到最大的单词序列$w$。$\chi$ 是搜索问题的解空间，它是所有可能的单词序列$w$的集合。$\hat{w}$可以被看做该搜索问题中的“最优解”，即概率最大的单词序列。

-\parinterval 在序列生成任务中，最简单的策略就是对词表中的词汇进行任意组合，通过这种枚举的方式得到全部可能的序列。但是，很多时候并生成序列的长度是无法预先知道的。比如，机器翻译中目标语序列的长度是任意的。那么怎样判断一个序列何时完成了生成过程呢？这里借用现代人类书写中文和英文的过程：句子的生成首先从一片空白开始，然后从左到右逐词生成，除了第一个单词，所有单词的生成都依赖于前面已经生成的单词。为了方便计算机实现，通常定义单词序列从一个特殊的符号<sos>后开始生成。同样地，一个单词序列的结束也用一个特殊的符号<eos>来表示。
+\parinterval 在序列生成任务中，最简单的策略就是对词表中的词汇进行任意组合，通过这种枚举的方式得到全部可能的序列。但是，很多时候待生成序列的长度是无法预先知道的。比如，机器翻译中目标语序列的长度是任意的。那么怎样判断一个序列何时完成了生成过程呢？这里借用现代人类书写中文和英文的过程：句子的生成首先从一片空白开始，然后从左到右逐词生成，除了第一个单词，所有单词的生成都依赖于前面已经生成的单词。为了方便计算机实现，通常定义单词序列从一个特殊的符号<sos>后开始生成。同样地，一个单词序列的结束也用一个特殊的符号<eos>来表示。

 \parinterval 对于一个序列$<$sos$>$\ I\ agree\ $<$eos$>$，图\ref{fig:2-12}展示语言模型视角下该序列的生成过程。该过程通过在序列的末尾不断附加词表中的单词来逐渐扩展序列，直到这段序列结束。这种生成单词序列的过程被称作{\small\bfnew{自左向右生成}}\index{自左向右生成}（Left-to-Right Generation）\index{Left-to-Right Generation}。注意，这种序列生成策略与$n$-gram的思想天然契合，因为$n$-gram语言模型中，每个词的生成概率依赖前面（左侧）若干词，因此$n$-gram语言模型也是一种自左向右的计算模型。

@@ -857,7 +862,7 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\

 \parinterval 当任务对单词序列长度没有限制时，上述两种方法枚举出的单词序列也是无穷无尽的。因此这两种枚举策略并不具备完备性而且会导致枚举过程无法停止。由于日常生活中通常不会见到特别长的句子，因此可以通过限制单词序列的最大长度来避免这个问题。一旦单词序列的最大长度被确定，以上两种枚举策略就可以在一定时间内枚举出所有可能的单词序列，因而一定可以找到最优的单词序列，即具备最优性。

-\parinterval 此时上述生成策略虽然可以满足完备性和最优性，但其仍然算不上是优秀的生成策略，因为这两种算法在时间复杂度和空间复杂度上的表现很差，如表\ref{tab:2-3}所示。其中$|V|$为词表大小，$m$ 为序列长度。值得注意的是，在之前的遍历过程中，除了在序列开头一定会挑选<sos>之外，其他位置每次可挑选的单词并不只有词表中的单词，还有结束符号<eos>，因此实际上生成过程中每个位置的单词候选数量为$|V|+1$。
+\parinterval 此时上述生成策略虽然可以满足完备性和最优性，但其仍然算不上是优秀的生成策略，因为这两种算法在时间复杂度和空间复杂度上的表现很差，如表\ref{tab:2-4}所示。其中$|V|$为词表大小，$m$ 为序列长度。值得注意的是，在之前的遍历过程中，除了在序列开头一定会挑选<sos>之外，其他位置每次可挑选的单词并不只有词表中的单词，还有结束符号<eos>，因此实际上生成过程中每个位置的单词候选数量为$|V|+1$。

 \vspace{0.5em}
 %------------------------------------------------------
@@ -870,13 +875,13 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\
 \rule{0pt}{10pt} 深度优先 & $O({(|V|+1)}^{m-1})$ & $O(m)$ \\
 \rule{0pt}{10pt} 宽度优先 & $O({(|V|+1)}^{m-1}$) & $O({(|V|+1)}^{m})$ \\
 \end{tabular}
-\label{tab:2-3}
+\label{tab:2-4}
 }
 \end{center}
 }\end{table}
 %------------------------------------------------------

-\parinterval 那么是否有比枚举策略更高效的方法呢？答案是肯定的。一种直观的方法是将搜索的过程表示成树型结构，称为解空间树。它包含了搜索过程中可生成的全部序列。该树的根节点恒为<sos>，代表序列均从<sos> 开始。该树结构中非叶子节点的兄弟节点有$|V|+1$个，由词表和结束符号<eos>构成。从图\ref{fig:2-13}可以看到，对于一个最大长度为4的序列的搜索过程，生成某个单词序列的过程实际上就是访问解空间树中从根节点<sos> 开始一直到叶子节点<eos>结束的某条路径，而这条的路径上节点按顺序组成了一段独特的单词序列。此时对所有可能单词序列的枚举就变成了对解空间树的遍历。并且枚举的过程与语言模型打分的过程也是一致的，每枚举一个词$i$也就是在上图选择$w_i$一列的一个节点，语言模型就可以为当前的树节点$w_i$给出一个分值，即$\funp{P}(w_i | w_1 w_2 \ldots w_{i-1})$。对于$n$-gram语言模型，这个分值可以表示为$\funp{P}(w_i | w_1 w_2 \ldots w_{i-1})=\funp{P}(w_i | w_{i-n+1} \ldots w_{i-1})$
+\parinterval 那么是否有比枚举策略更高效的方法呢？答案是肯定的。一种直观的方法是将搜索的过程表示成树型结构，称为解空间树。它包含了搜索过程中可生成的全部序列。该树的根节点恒为<sos>，代表序列均从<sos> 开始。该树结构中非叶子节点的兄弟节点有$|V|+1$个，由词表和结束符号<eos>构成。从图\ref{fig:2-13}可以看到，对于一个最大长度为4的序列的搜索过程，生成某个单词序列的过程实际上就是访问解空间树中从根节点<sos> 开始一直到叶子节点<eos>结束的某条路径，而这条的路径上节点按顺序组成了一段独特的单词序列。此时对所有可能单词序列的枚举就变成了对解空间树的遍历。并且枚举的过程与语言模型打分的过程也是一致的，每枚举一个词$i$也就是在图\ref{fig:2-13}选择$w_i$一列的一个节点，语言模型就可以为当前的树节点$w_i$给出一个分值，即$\funp{P}(w_i | w_1 w_2 \ldots w_{i-1})$。对于$n$-gram语言模型，这个分值可以表示为$\funp{P}(w_i | w_1 w_2 \ldots w_{i-1})=\funp{P}(w_i | w_{i-n+1} \ldots w_{i-1})$

 %----------------------------------------------
 \begin{figure}[htp]
@@ -912,7 +917,7 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\
 \end{figure}
 %-------------------------------------------

-\parinterval 这样，语言模型的打分与解空间树的遍历就融合在一起了。于是，序列生成的问题可以被重新描述为：寻找所有单词序列组成的解空间树中权重总和最大的一条路径。在这个定义下，前面提到的两种枚举词序列的方法就是经典的{\small\bfnew{深度优先搜索}}\index{深度优先搜索}（Depth-first Search）\index{Depth-first Search}和{\small\bfnew{宽度优先搜索}}\index{宽度优先搜索}（Breadth-first Search）\index{Breadth-first Search}的雏形\upcite{even2011graph,tarjan1972depth}。在后面的内容中，从遍历解空间树的角度出发，可以对原始这些搜索策略的效率进行优化。
+\parinterval 这样，语言模型的打分与解空间树的遍历就融合在一起了。于是，序列生成的问题可以被重新描述为：寻找所有单词序列组成的解空间树中权重总和最大的一条路径。在这个定义下，前面提到的两种枚举词序列的方法就是经典的{\small\bfnew{深度优先搜索}}\index{深度优先搜索}（Depth-first Search）\index{Depth-first Search}和{\small\bfnew{宽度优先搜索}}\index{宽度优先搜索}（Breadth-first Search）\index{Breadth-first Search}的雏形\upcite{even2011graph,tarjan1972depth}。在后面的内容中，从遍历解空间树的角度出发，可以对这些原始的搜索策略的效率进行优化。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -1038,7 +1043,7 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\
 \begin{adjustwidth}{1em}{}
 \begin{itemize}
 \vspace{0.5em}
-\item 在$n$-gram语言模型中，由于语料中往往存在大量的低频词以及未登录词，模型会产生不合理的概率预测结果。因此本章介绍了三种平滑方法，以解决上述问题。实际上，平滑方法是语言建模中的重要研究方向。除了上述三种方法之外，还有Jelinek–Mercer平滑\upcite{jelinek1980interpolated}、Katz 平滑\upcite{katz1987estimation}以及Witten–Bell平滑等等\upcite{bell1990text,witten1991the}。相关工作也对这些平滑方法进行了详细对比\upcite{chen1999empirical,goodman2001a}。
+\item 在$n$-gram语言模型中，由于语料中往往存在大量的低频词以及未登录词，模型会产生不合理的概率预测结果。因此本章介绍了三种平滑方法，以解决上述问题。实际上，平滑方法是语言建模中的重要研究方向。除了上文中介绍的三种平滑方法之外，还有如Jelinek–Mercer平滑\upcite{jelinek1980interpolated}、Katz 平滑\upcite{katz1987estimation}以及Witten–Bell平滑等等\upcite{bell1990text,witten1991the}的平滑方法。相关工作也对这些平滑方法进行了详细对比\upcite{chen1999empirical,goodman2001a}。
 \vspace{0.5em}
 \item 除了平滑方法，也有很多工作对$n$-gram语言模型进行改进。比如，对于形态学丰富的语言，可以考虑对单词的形态学变化进行建模。这类语言模型在一些机器翻译系统中也体现出了很好的潜力\upcite{kirchhoff2005improved,sarikaya2007joint,koehn2007factored}。此外，如何使用超大规模数据进行语言模型训练也是备受关注的研究方向。比如，有研究者探索了对超大语言模型进行压缩和存储的方法\upcite{federico2007efficient,federico2006how,heafield2011kenlm}。另一个有趣的方向是，利用随机存储算法对大规模语言模型进行有效存储\upcite{talbot2007smoothed,talbot2007randomised}，比如，在语言模型中使用Bloom\ Filter等随机存储的数据结构。
 \vspace{0.5em}

--- a/Chapter5/Figures/figure-a-more-detailed-explanation-of-formula-3.40.tex
+++ b/Chapter5/Figures/figure-a-more-detailed-explanation-of-formula-3.40.tex
@@ -9,8 +9,8 @@
 \begin{tikzpicture}

 \node [anchor=west,inner sep=2pt,minimum height=2em] (eq1) at (0,0) {$f(s_u|t_v)$};
-\node [anchor=west,inner sep=2pt] (eq2) at ([xshift=-2pt]eq1.east) {$=$};
-\node [anchor=west,inner sep=2pt,minimum height=2em] (eq3) at ([xshift=-2pt]eq2.east) {$\lambda_{t_v}^{-1}$};
+\node [anchor=west,inner sep=2pt] (eq2) at ([xshift=-1pt]eq1.east) {$=$};
+\node [anchor=west,inner sep=2pt,minimum height=2em] (eq3) at ([xshift=-1pt]eq2.east) {$\lambda_{t_v}^{-1}$};
 \node [anchor=west,inner sep=2pt,minimum height=3.0em] (eq4) at ([xshift=-3pt]eq3.east) {\footnotesize{$\frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)$}};
 \node [anchor=west,inner sep=2pt,minimum height=3.0em] (eq5) at ([xshift=1pt]eq4.east) {\footnotesize{$\sum\limits_{j=1}^{m} \delta(s_j,s_u) \sum\limits_{i=0}^{l} \delta(t_i,t_v)$}};
 \node [anchor=west,inner sep=2pt,minimum height=3.0em] (eq6) at ([xshift=1pt]eq5.east) {$\frac{f(s_u|t_v)}{\sum_{i=0}^{l}f(s_u|t_i)}$};
@@ -27,7 +27,7 @@
 }

 {
-\node [anchor=south west,inner sep=2pt] (label1) at (eq4.north west) {{\scriptsize{翻译概率$\textrm{P}(\mathbf{s}|\mathbf{t})$}}};
+\node [anchor=south west,inner sep=2pt] (label1) at (eq4.north west) {{\scriptsize{翻译概率$\funp{P}(\seq{s}|\seq{t})$}}};
 }
 {
 \node [anchor=south west,inner sep=2pt] (label2) at (eq5.north west) {{\scriptsize{配对的总次数}}};

--- a/Chapter5/Figures/figure-different-alignment-comparison.tex
+++ b/Chapter5/Figures/figure-different-alignment-comparison.tex
@@ -13,7 +13,7 @@
    \draw [-] (s2.south) -- (t2.north);
    \node [anchor=center,draw=ublue,circle,thick,fill=white,inner sep=1pt,circular drop shadow={shadow xshift=0.1em,shadow yshift=-0.1em}] (mark) at ([xshift=0.8em,yshift=-0.7em]s2.south east) {{\color{ugreen} \tiny{\textbf{Yes}}}};
    }
-    \node [anchor=center] (labela) at ([xshift=2em,yshift=-1.0em]t1.south) {\scriptsize{(a)}};
+    \node [anchor=center] (labela) at ([xshift=2em,yshift=-1.0em]t1.south) {\small{(a)}};
    \end{scope}
    \begin{scope}[xshift=2.3in]
    {\small
@@ -25,7 +25,7 @@
    \draw [-] (s1.south) -- (t2.north);
    \node [anchor=center,draw=ublue,circle,thick,fill=white,inner sep=1.5pt,circular drop shadow={shadow xshift=0.1em,shadow yshift=-0.1em}] (mark) at ([xshift=0.8em,yshift=-0.7em]s2.south east) {{\color{red} \tiny{\textbf{No}}}};
    }
-        \node [anchor=center] (labelb) at ([xshift=2em,yshift=-1.0em]t1.south) {\scriptsize{(b)}};
+        \node [anchor=center] (labelb) at ([xshift=2em,yshift=-1.0em]t1.south) {\small{(b)}};
    \end{scope}
    \begin{scope}[xshift=4.1in]
    {\small
@@ -37,7 +37,7 @@
    \draw [-] (s2.south) -- ([yshift=-0.2em]t1.north);
    \node [anchor=center,draw=ublue,circle,thick,fill=white,inner sep=1pt,circular drop shadow={shadow xshift=0.1em,shadow yshift=-0.1em}] (mark) at ([xshift=0.8em,yshift=-0.7em]s2.south east) {{\color{ugreen} \tiny{\textbf{Yes}}}};
    }
-    \node [anchor=center] (labelc) at ([xshift=2em,yshift=-1.0em]t1.south) {\scriptsize{(c)}};
+    \node [anchor=center] (labelc) at ([xshift=2em,yshift=-1.0em]t1.south) {\small{(c)}};
    \end{scope}
    \end{tikzpicture}
   

--- a/Chapter5/Figures/figure-different-translation-candidate-space.tex
+++ b/Chapter5/Figures/figure-different-translation-candidate-space.tex
@@ -23,13 +23,13 @@
 \node [draw,dashed,ublue,fill=blue!10,thick,anchor=center,circle,minimum size=18pt] (t6) at ([xshift=3em]t2.east) {};
 \node [draw,dashed,ublue,fill=blue!10,thick,anchor=center,circle,minimum size=18pt] (t7) at ([xshift=3em]t4.east) {};

-\draw [->,thick,] (s.north east) .. controls +(north east:1em) and +(north west:1em).. (t1.north west) node[pos=0.5,below] {\tiny{P ($\seq{t}_1|\seq{s}$)=0.1}};
-\draw [->,thick,] (s.60) .. controls +(50:4em) and +(west:1em).. (t2.west) node[pos=0.5,below] {\tiny{P($\seq{t}_2|\seq{s}$)=0.2}};
-\draw [->,thick,] (s.north) .. controls +(70:4em) and +(west:1em).. (t3.west) node[pos=0.5,above,xshift=-1em] {\tiny{P($\seq{t}_3|\seq{s}$)=0.3}};
-\draw [->,thick,] (s.south east) .. controls +(300:3em) and +(south west:1em).. (t4.south west) node[pos=0.5,below] {\tiny{P($\seq{t}_4|\seq{s}$)=0.1}};
+\draw [->,thick] (s.north east) .. controls +(north east:1em) and +(north west:1em).. (t1.north west) node[pos=0.5,below] {\tiny{$\funp{P} (\seq{t}_1|\seq{s})=0.1$}};
+\draw [->,thick] (s.60) .. controls +(50:4em) and +(west:1em).. (t2.west) node[pos=0.5,below] {\tiny{$\funp{P}(\seq{t}_2|\seq{s})=0.2$}};
+\draw [->,thick] (s.north) .. controls +(70:4em) and +(west:1em).. (t3.west) node[pos=0.5,above,xshift=-1em] {\tiny{$\funp{P}(\seq{t}_3|\seq{s})=0.3$}};
+\draw [->,thick] (s.south east) .. controls +(300:3em) and +(south west:1em).. (t4.south west) node[pos=0.5,below] {\tiny{$\funp{P}(\seq{t}_4|\seq{s})=0.1$}};

-\node [anchor=center] (foot1) at ([xshift=3.8em,yshift=-3em]s1.south) {\footnotesize{人的翻译候选空间}};
-\node [anchor=center] (foot2) at ([xshift=7em,yshift=-3em]s.south) {\footnotesize{机器的翻译候选空间}};
+\node [anchor=center] (foot1) at ([xshift=3.8em,yshift=-3.5em]s1.south) {\small{（a） 人的翻译候选空间}};
+\node [anchor=center] (foot2) at ([xshift=7em,yshift=-3.5em]s.south) {\small{（b） 机器的翻译候选空间}};


 \end{tikzpicture}

--- a/Chapter5/Figures/figure-example-translation-alignment.tex
+++ b/Chapter5/Figures/figure-example-translation-alignment.tex
@@ -4,9 +4,9 @@



-\begin{tabular}{| l | l |}
+\begin{tabular}{| c | c |}
 \hline
-& {\footnotesize{$\prod\limits_{(j,i) \in \hat{A}} \funp{P}(s_j,t_i)$} } \\ \hline
+\rule{0pt}{15pt} 源语言句子“我对你感到满意”的不同翻译结果& {\footnotesize{$\prod\limits_{(j,i) \in \hat{A}} \funp{P}(s_j,t_i)$} } \\ \hline

 \begin{tikzpicture}


--- a/Chapter5/Figures/figure-greedy-mt-decoding-process-1.tex
+++ b/Chapter5/Figures/figure-greedy-mt-decoding-process-1.tex
@@ -63,29 +63,29 @@
 \node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t53) at ([yshift=-0.2em]t52.south) {satisfies};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt11) at (t11.east) {{\color{white} .4}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt12) at (t12.east) {{\color{white} .3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt13) at (t13.east) {{\color{white} .1}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt11) at (t11.east) {{\color{white} 0.4}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt12) at (t12.east) {{\color{white} 0.3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt13) at (t13.east) {{\color{white} 0.1}};
 }

 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt21) at (t21.east) {{\color{white} .3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt22) at (t22.east) {{\color{white} .3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt23) at (t23.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt21) at (t21.east) {{\color{white} 0.3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt22) at (t22.east) {{\color{white} 0.3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt23) at (t23.east) {{\color{white} 0.2}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt31) at (t31.east) {{\color{white} .7}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt32) at (t32.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt31) at (t31.east) {{\color{white} 0.7}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt32) at (t32.east) {{\color{white} 0.3}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt41) at (t41.east) {{\color{white} .4}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt42) at (t42.east) {{\color{white} .2}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt43) at (t43.east) {{\color{white} .1}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt41) at (t41.east) {{\color{white} 0.4}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt42) at (t42.east) {{\color{white} 0.2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt43) at (t43.east) {{\color{white} 0.1}};
 }
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt51) at (t51.east) {{\color{white} .3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt52) at (t52.east) {{\color{white} .2}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt53) at (t53.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt51) at (t51.east) {{\color{white} 0.3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt52) at (t52.east) {{\color{white} 0.2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt53) at (t53.east) {{\color{white} 0.2}};
 }
 }
 {\scriptsize
@@ -108,7 +108,7 @@
 \draw [-] (glabel.south west) -- ([xshift=3.5in]glabel.south west);

 \node [anchor=center,rotate=90] (hlabel2) at ([xshift=-1.3em,yshift=-8.5em]glabel.west) {\tiny{$h$存放临时翻译结果}};
-\node [anchor=north west] (foot1) at ([xshift=0.0em,yshift=-18.0em]translabel.south west) {\scriptsize{(a)\; 4:$h = \phi$}};
+\node [anchor=north west] (foot1) at ([xshift=4.0em,yshift=-18.0em]translabel.south west) {\small{(a)\; 4:$h = \phi$}};
 }
 \end{scope}

@@ -173,34 +173,34 @@


 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt11) at (t11.east) {{\color{white} .4}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt12) at (t12.east) {{\color{white} .3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt13) at (t13.east) {{\color{white} .1}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt11) at (t11.east) {{\color{white} 0.4}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt12) at (t12.east) {{\color{white} 0.3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt13) at (t13.east) {{\color{white} 0.1}};
 }

 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt21) at (t21.east) {{\color{white} .3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt22) at (t22.east) {{\color{white} .3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt23) at (t23.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt21) at (t21.east) {{\color{white} 0.3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt22) at (t22.east) {{\color{white} 0.3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt23) at (t23.east) {{\color{white} 0.2}};
 }


 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt31) at (t31.east) {{\color{white} .7}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt32) at (t32.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt31) at (t31.east) {{\color{white} 0.7}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt32) at (t32.east) {{\color{white} 0.3}};
 }

 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt41) at (t41.east) {{\color{white} .4}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt42) at (t42.east) {{\color{white} .2}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt43) at (t43.east) {{\color{white} .1}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt41) at (t41.east) {{\color{white} 0.4}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt42) at (t42.east) {{\color{white} 0.2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt43) at (t43.east) {{\color{white} 0.1}};
 }


 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt51) at (t51.east) {{\color{white} .3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt52) at (t52.east) {{\color{white} .2}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt53) at (t53.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt51) at (t51.east) {{\color{white} 0.3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt52) at (t52.east) {{\color{white} 0.2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt53) at (t53.east) {{\color{white} 0.2}};
 }

 }
@@ -233,7 +233,7 @@
 \draw [-] (glabel.south west) -- ([xshift=3.5in]glabel.south west);

 \node [anchor=center,rotate=90] (hlabel2) at ([xshift=-1.3em,yshift=-8.5em]glabel.west) {\tiny{$h$存放临时翻译结果}};
-\node [anchor=north west] (foot2) at ([xshift=0.0em,yshift=-18.0em]translabel.south west) {\scriptsize{(b)\; 6: \textbf{if} $used[j]=$ \textbf{false} \textbf{then}}};
+\node [anchor=north west] (foot2) at ([xshift=-4.0em,yshift=-18.0em]translabel.south west) {\small{(b)\; 6: \textbf{if} $used[j]=$ \textbf{false} \textbf{then}}};
 }
 {%大大的join
 \node [anchor=center,draw=ublue,circle,thick,fill=white,inner sep=2.5pt,circular drop shadow={shadow xshift=0.1em,shadow yshift=-0.1em}] (join) at ([xshift=4em,yshift=-1em]hlabel.north east) {\tiny{\textsc{Join}}};

--- a/Chapter5/Figures/figure-greedy-mt-decoding-process-3.tex
+++ b/Chapter5/Figures/figure-greedy-mt-decoding-process-3.tex
@@ -63,34 +63,34 @@


 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt11) at (t11.east) {{\color{white} .4}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt12) at (t12.east) {{\color{white} .3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt13) at (t13.east) {{\color{white} .1}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt11) at (t11.east) {{\color{white} 0.4}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt12) at (t12.east) {{\color{white} 0.3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt13) at (t13.east) {{\color{white} 0.1}};
 }

 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt21) at (t21.east) {{\color{white} .3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt22) at (t22.east) {{\color{white} .3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt23) at (t23.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt21) at (t21.east) {{\color{white} 0.3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt22) at (t22.east) {{\color{white} 0.3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt23) at (t23.east) {{\color{white} 0.2}};
 }


 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt31) at (t31.east) {{\color{white} .7}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt32) at (t32.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt31) at (t31.east) {{\color{white} 0.7}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt32) at (t32.east) {{\color{white} 0.3}};
 }

 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt41) at (t41.east) {{\color{white} .4}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt42) at (t42.east) {{\color{white} .2}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt43) at (t43.east) {{\color{white} .1}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt41) at (t41.east) {{\color{white} 0.4}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt42) at (t42.east) {{\color{white} 0.2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt43) at (t43.east) {{\color{white} 0.1}};
 }


 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt51) at (t51.east) {{\color{white} .3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt52) at (t52.east) {{\color{white} .2}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt53) at (t53.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt51) at (t51.east) {{\color{white} 0.3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt52) at (t52.east) {{\color{white} 0.2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt53) at (t53.east) {{\color{white} 0.2}};
 }

 }
@@ -126,7 +126,7 @@

 \node [anchor=center,rotate=90] (hlabel2) at ([xshift=-0.7em,yshift=-7.5em]glabel.west) {\tiny{$h$存放临时翻译结果}};
 }
-\node [anchor=north west] (foot1) at ([xshift=0.0em,yshift=-12.3em]translabel.south west) {\scriptsize{(c)\; 7:  $h = h \cup \textrm{\textsc{Join}}(best,\pi[j])$}};
+\node [anchor=north west] (foot1) at ([xshift=-2.0em,yshift=-12.3em]translabel.south west) {\small{(c)\; 7:  $h = h \cup \textrm{\textsc{Join}}(best,\pi[j])$}};
 {%大大的join
 \node [anchor=center,draw=ublue,circle,thick,fill=white,inner sep=2.5pt,circular drop shadow={shadow xshift=0.1em,shadow yshift=-0.1em}] (join) at ([xshift=4em,yshift=-1em]hlabel.north east) {\tiny{\textsc{Join}}};
 }
@@ -228,34 +228,34 @@


 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt11) at (t11.east) {{\color{white} .4}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt12) at (t12.east) {{\color{white} .3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt13) at (t13.east) {{\color{white} .1}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt11) at (t11.east) {{\color{white} 0.4}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt12) at (t12.east) {{\color{white} 0.3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt13) at (t13.east) {{\color{white} 0.1}};
 }

 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt21) at (t21.east) {{\color{white} .3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt22) at (t22.east) {{\color{white} .3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt23) at (t23.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt21) at (t21.east) {{\color{white} 0.3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt22) at (t22.east) {{\color{white} 0.3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt23) at (t23.east) {{\color{white} 0.2}};
 }


 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt31) at (t31.east) {{\color{white} .7}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt32) at (t32.east) {{\color{white} .3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt31) at (t31.east) {{\color{white} 0.7}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt32) at (t32.east) {{\color{white} 0.3}};
 }

 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt41) at (t41.east) {{\color{white} .4}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt42) at (t42.east) {{\color{white} .2}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt43) at (t43.east) {{\color{white} .1}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt41) at (t41.east) {{\color{white} 0.4}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt42) at (t42.east) {{\color{white} 0.2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt43) at (t43.east) {{\color{white} 0.1}};
 }


 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt51) at (t51.east) {{\color{white} .3}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt52) at (t52.east) {{\color{white} .2}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt53) at (t53.east) {{\color{white} .2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt51) at (t51.east) {{\color{white} 0.3}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt52) at (t52.east) {{\color{white} 0.2}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=1.5em,fill=black] (pt53) at (t53.east) {{\color{white} 0.2}};
 }

 }
@@ -283,7 +283,7 @@
 \draw [-] (glabel.south west) -- ([xshift=3.5in]glabel.south west);

 \node [anchor=center,rotate=90] (hlabel2) at ([xshift=-0.7em,yshift=-7.5em]glabel.west) {\tiny{$h$存放临时翻译结果}};
-\node [anchor=north west] (foot2) at ([xshift=0.0em,yshift=-23.0em]translabel.south west) {\scriptsize{(d)\; 8: $best = \textrm{\textsc{PruneForTop1}}(h)$}};
+\node [anchor=north west] (foot2) at ([xshift=-5.0em,yshift=-23.0em]translabel.south west) {\small{(d)\; 8: $best = \textrm{\textsc{PruneForTop1}}(h)$}};
 }



--- a/Chapter5/Figures/figure-ibm-model-iteration-process-diagram.tex
+++ b/Chapter5/Figures/figure-ibm-model-iteration-process-diagram.tex
@@ -5,8 +5,8 @@
 \begin{tikzpicture}

 \node [anchor=west,inner sep=2pt,fill=red!20,minimum height=3em] (eq1) at (0,0) {$f(s_u|t_v)$};
-\node [anchor=west,inner sep=2pt] (eq2) at ([xshift=-2pt]eq1.east) {$=$};
-\node [anchor=west,inner sep=2pt] (eq3) at ([xshift=-2pt]eq2.east) {$\lambda_{t_v}^{-1}$};
+\node [anchor=west,inner sep=2pt] (eq2) at ([xshift=-1pt]eq1.east) {$=$};
+\node [anchor=west,inner sep=2pt] (eq3) at ([xshift=-1pt]eq2.east) {$\lambda_{t_v}^{-1}$};
 \node [anchor=west,inner sep=2pt] (eq4) at ([xshift=-2pt]eq3.east) {$\frac{\varepsilon}{(l+1)^{m}}$};
 \node [anchor=west,inner sep=2pt,fill=red!20,minimum height=3em] (eq5) at ([xshift=-2pt]eq4.east) {\footnotesize{$\prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)$}};
 \node [anchor=west,inner sep=2pt] (eq6) at ([xshift=-2pt]eq5.east) {\footnotesize{$\sum\limits_{j=1}^{m} \delta(s_j,s_u) \sum\limits_{i=0}^{l} \delta(t_i,t_v)$}};

--- a/Chapter5/Figures/figure-process-of-machine-translation.tex
+++ b/Chapter5/Figures/figure-process-of-machine-translation.tex
@@ -17,31 +17,31 @@
 \draw [->,very thick,ublue] (s5.south) -- ([yshift=-0.7em]s5.south);

 {\small
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t11) at ([yshift=-1em]s1.south) {I};
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t12) at ([yshift=-0.2em]t11.south) {me};
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t13) at ([yshift=-0.2em]t12.south) {I'm};
+\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=2.5em] (t11) at ([yshift=-1em]s1.south) {I};
+\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=2.5em] (t12) at ([yshift=-0.8em]t11.south) {me};
+\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=2.5em] (t13) at ([yshift=-0.8em]t12.south) {I'm};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl11) at (t11.north west) {\tiny{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl12) at (t12.north west) {\tiny{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl13) at (t13.north west) {\tiny{{\color{white} \textbf{1}}}};

-\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t21) at ([yshift=-1em]s2.south) {to};
-\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t22) at ([yshift=-0.2em]t21.south) {with};
-\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t23) at ([yshift=-0.2em]t22.south) {for};
+\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=2.5em] (t21) at ([yshift=-1em]s2.south) {to};
+\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=2.5em] (t22) at ([yshift=-0.8em]t21.south) {with};
+\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=2.5em] (t23) at ([yshift=-0.8em]t22.south) {for};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl21) at (t21.north west) {\tiny{{\color{white} \textbf{2}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl22) at (t22.north west) {\tiny{{\color{white} \textbf{2}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl23) at (t23.north west) {\tiny{{\color{white} \textbf{2}}}};

-\node [anchor=north,inner sep=2pt,fill=blue!20,minimum height=1.5em,minimum width=2.5em] (t31) at ([yshift=-1em]s3.south) {you};
+\node [anchor=north,inner sep=2pt,fill=blue!20,minimum height=1.6em,minimum width=2.5em] (t31) at ([yshift=-1em]s3.south) {you};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl31) at (t31.north west) {\tiny{{\color{white} \textbf{3}}}};

-\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.5em,minimum width=3em] (t41) at ([yshift=-1em]s4.south) {$\phi$};
-\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.5em,minimum width=3em] (t42) at ([yshift=-0.2em]t41.south) {feel};
+\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.6em,minimum width=3em] (t41) at ([yshift=-1em]s4.south) {$\phi$};
+\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.6em,minimum width=3em] (t42) at ([yshift=-0.8em]t41.south) {feel};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl41) at (t41.north west) {\tiny{{\color{white} \textbf{4}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl42) at (t42.north west) {\tiny{{\color{white} \textbf{4}}}};

-\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t51) at ([yshift=-1em]s5.south) {satisfy};
-\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t52) at ([yshift=-0.2em]t51.south) {satisfied};
-\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t53) at ([yshift=-0.2em]t52.south) {satisfies};
+\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.6em,minimum width=4.5em] (t51) at ([yshift=-1em]s5.south) {satisfy};
+\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.6em,minimum width=4.5em] (t52) at ([yshift=-0.8em]t51.south) {satisfied};
+\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.6em,minimum width=4.5em] (t53) at ([yshift=-0.8em]t52.south) {satisfies};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl51) at (t51.north west) {\tiny{{\color{white} \textbf{5}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl52) at (t52.north west) {\tiny{{\color{white} \textbf{5}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl53) at (t53.north west) {\tiny{{\color{white} \textbf{5}}}};
@@ -51,22 +51,22 @@
 {\tiny

 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt11) at (t11.east) {{\color{white} \textbf{P=.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt12) at (t12.east) {{\color{white} \textbf{P=.2}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt13) at (t13.east) {{\color{white} \textbf{P=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt11) at (t11.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt12) at (t12.south) {{\color{white} \textbf{$\seq{P}$=0.2}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt13) at (t13.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt21) at (t21.east) {{\color{white} \textbf{P=.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt22) at (t22.east) {{\color{white} \textbf{P=.3}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt23) at (t23.east) {{\color{white} \textbf{P=.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt21) at (t21.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt22) at (t22.south) {{\color{white} \textbf{$\seq{P}$=0.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt23) at (t23.south) {{\color{white} \textbf{$\seq{P}$=0.3}}};

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt31) at (t31.east) {{\color{white} \textbf{P=1}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt31) at (t31.south) {{\color{white} \textbf{$\seq{P}$=1}}};

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt41) at (t41.east) {{\color{white} \textbf{P=.5}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt42) at (t42.east) {{\color{white} \textbf{P=.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=5em,fill=black] (pt41) at (t41.south) {{\color{white} \textbf{$\seq{P}$=0.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=5em,fill=black] (pt42) at (t42.south) {{\color{white} \textbf{$\seq{P}$=0.5}}};

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt51) at (t51.east) {{\color{white} \textbf{P=.5}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt52) at (t52.east) {{\color{white} \textbf{P=.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt53) at (t53.east) {{\color{white} \textbf{P=.1}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt51) at (t51.south) {{\color{white} \textbf{$\seq{P}$=0.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt52) at (t52.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt53) at (t53.south) {{\color{white} \textbf{$\seq{P}$=0.1}}};
 }

 }
@@ -76,23 +76,23 @@
 \begin{scope}
 {\small

-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (ft11) at ([yshift=-1.2in]t11.west) {I'm};
-\node [anchor=center,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=5em] (ft12) at ([xshift=5.0em]ft11.center) {satisfied};
+\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (ft11) at ([yshift=-1.5in]t11.west) {I'm};
+\node [anchor=center,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (ft12) at ([xshift=5.0em]ft11.center) {satisfied};
 \node [anchor=center,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (ft13) at ([xshift=5.0em]ft12.center) {with};
 \node [anchor=center,inner sep=2pt,fill=blue!20,minimum height=1.5em,minimum width=2.5em] (ft14) at ([xshift=4.0em]ft13.center) {you};

 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (ft21) at ([yshift=-2em]ft11.west) {I'm};
-\node [anchor=center,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=5em] (ft22) at ([xshift=5.0em]ft21.center) {satisfy};
+\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (ft21) at ([yshift=-3em]ft11.west) {I'm};
+\node [anchor=center,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (ft22) at ([xshift=5.0em]ft21.center) {satisfy};
 \node [anchor=center,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (ft23) at ([xshift=5.0em]ft22.center) {to};
 \node [anchor=center,inner sep=2pt,fill=blue!20,minimum height=1.5em,minimum width=2.5em] (ft24) at ([xshift=4.0em]ft23.center) {you};
 }

 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (ft31) at ([yshift=-2em]ft21.west) {I'm};
-\node [anchor=center,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=5em] (ft32) at ([xshift=5.0em]ft31.center) {satisfy};
-\node [anchor=center,inner sep=2pt,fill=blue!20,minimum height=1.5em,minimum width=2.5em] (ft33) at ([xshift=5.0em]ft32.center) {you};
-\node [anchor=center,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (ft34) at ([xshift=4.0em]ft33.center) {to};
+\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=2.5em] (ft31) at ([yshift=-3em]ft21.west) {I'm};
+\node [anchor=center,inner sep=2pt,fill=purple!20,minimum height=1.6em,minimum width=4.5em] (ft32) at ([xshift=5.0em]ft31.center) {satisfy};
+\node [anchor=center,inner sep=2pt,fill=blue!20,minimum height=1.6em,minimum width=2.5em] (ft33) at ([xshift=5.0em]ft32.center) {you};
+\node [anchor=center,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=2.5em] (ft34) at ([xshift=4.0em]ft33.center) {to};
 }

 \node [anchor=north west,inner sep=1pt,fill=black] (ftl11) at (ft11.north west) {\tiny{{\color{white} \textbf{1}}}};
@@ -117,20 +117,20 @@
 {\tiny

 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft11) at (ft11.east) {{\color{white} \textbf{P=.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft12) at (ft12.east) {{\color{white} \textbf{P=.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft13) at (ft13.east) {{\color{white} \textbf{P=.3}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft14) at (ft14.east) {{\color{white} \textbf{P=1}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft11) at (ft11.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pft12) at (ft12.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft13) at (ft13.south) {{\color{white} \textbf{$\seq{P}$=0.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft14) at (ft14.south) {{\color{white} \textbf{$\seq{P}$=1}}};

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft21) at (ft21.east) {{\color{white} \textbf{P=.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft22) at (ft22.east) {{\color{white} \textbf{P=.1}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft23) at (ft23.east) {{\color{white} \textbf{P=.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft24) at (ft24.east) {{\color{white} \textbf{P=1}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft21) at (ft21.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pft22) at (ft22.south) {{\color{white} \textbf{$\seq{P}$=0.1}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft23) at (ft23.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft24) at (ft24.south) {{\color{white} \textbf{$\seq{P}$=1}}};

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft31) at (ft31.east) {{\color{white} \textbf{P=.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft32) at (ft32.east) {{\color{white} \textbf{P=.1}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft33) at (ft33.east) {{\color{white} \textbf{P=1}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.5em,fill=black] (pft34) at (ft34.east) {{\color{white} \textbf{P=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft31) at (ft31.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pft32) at (ft32.south) {{\color{white} \textbf{$\seq{P}$=0.1}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft33) at (ft33.south) {{\color{white} \textbf{$\seq{P}$=1}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pft34) at (ft34.south) {{\color{white} \textbf{$\seq{P}$=0.4}}};
 }

 }
@@ -146,34 +146,34 @@
 \end{pgfonlayer}

 {
-\node [anchor=west,inner sep=2pt,minimum height=1.5em,minimum width=2.5em] (ft41) at ([yshift=-2em]ft31.west) {...};
+\node [anchor=west,inner sep=2pt,minimum height=1.5em,minimum width=2.5em] (ft41) at ([yshift=-3em]ft31.west) {...};
 }

 {
-\node [anchor=west,inner sep=2pt,minimum height=1.5em,minimum width=2.5em] (ft42) at ([yshift=-2em]ft32.west) {\scriptsize{{所有翻译单元都是概率化的}}};
-\node [anchor=west,inner sep=1pt,fill=black] (ft43) at (ft42.east) {{\color{white} \tiny{{P=概率}}}};
+\node [anchor=west,inner sep=2pt,minimum height=1.5em,minimum width=2.5em] (ft42) at ([yshift=-3em]ft32.west) {\scriptsize{{所有翻译单元都是概率化的}}};
+\node [anchor=west,inner sep=1pt,fill=black] (ft43) at (ft42.east) {{\color{white} \tiny{{$\seq{P}$=概率}}}};
 }
 }
 \end{scope}

 \begin{scope}
 {\footnotesize
-\node [anchor=east] (label4) at ([yshift=0.4em]ft11.west) {翻译就是一条};
+\node [anchor=east] (label4) at ([yshift=0.0em]ft11.west) {翻译就是一条};
 \node [anchor=north west] (label4part2) at ([yshift=0.7em]label4.south west) {译文选择路径};
 }

 {\footnotesize
-\node [anchor=east] (label5) at ([yshift=0.4em]ft21.west) {不同的译文对};
+\node [anchor=east] (label5) at ([yshift=0.0em]ft21.west) {不同的译文对};
 \node [anchor=north west] (label5part2) at ([yshift=0.7em]label5.south west) {应不同的路径};
 }

 {\footnotesize
-\node [anchor=east] (label6) at ([yshift=0.4em]ft31.west) {单词翻译的词};
+\node [anchor=east] (label6) at ([yshift=0.0em]ft31.west) {单词翻译的词};
 \node [anchor=north west] (label6part2) at ([yshift=0.7em]label6.south west) {序也可能不同};
 }

 {\footnotesize
-\node [anchor=east] (label7) at ([yshift=0.4em]ft41.west) {可能的翻译路};
+\node [anchor=east] (label7) at ([yshift=0.0em]ft41.west) {可能的翻译路};
 \node [anchor=north west] (label7part2) at ([yshift=0.7em]label7.south west) {径非常多};
 }

@@ -181,14 +181,14 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{scope}
 {
-\draw[decorate,thick,decoration={brace,amplitude=5pt}] ([yshift=8em,xshift=2.0em]t53.south east) -- ([xshift=2.0em]t53.south east) node [pos=0.5,right,xshift=0.5em,yshift=2.0em] (label2) {\footnotesize{{从双语数}}};
+\draw[decorate,thick,decoration={brace,amplitude=5pt}] ([yshift=9em,xshift=2.0em]t53.south east) -- ([yshift=-0.5em,xshift=2.0em]t53.south east) node [pos=0.5,right,xshift=0.5em,yshift=2.0em] (label2) {\footnotesize{{从双语数}}};
 \node [anchor=north west] (label2part2) at ([yshift=0.3em]label2.south west) {\footnotesize{{据中自动}}};
 \node [anchor=north west] (label2part3) at ([yshift=0.3em]label2part2.south west) {\footnotesize{{学习词典}}};
 \node [anchor=north west] (label2part4) at ([yshift=0.3em]label2part3.south west) {\footnotesize{{（训练）}}};
 }

 {
-\draw[decorate,thick,decoration={brace,amplitude=5pt}] ([yshift=-1.0em,xshift=6.2em]t53.south west) -- ([yshift=-10.5em,xshift=6.2em]t53.south west) node [pos=0.5,right,xshift=0.5em,yshift=2.0em] (label3) {\footnotesize{{利用概率}}};
+\draw[decorate,thick,decoration={brace,amplitude=5pt}] ([yshift=-2.0em,xshift=6.2em]t53.south west) -- ([yshift=-14.5em,xshift=6.2em]t53.south west) node [pos=0.5,right,xshift=0.5em,yshift=2.0em] (label3) {\footnotesize{{利用概率}}};
 \node [anchor=north west] (label3part2) at ([yshift=0.3em]label3.south west) {\footnotesize{{化的词典}}};
 \node [anchor=north west] (label3part3) at ([yshift=0.3em]label3part2.south west) {\footnotesize{{进行翻译}}};
 \node [anchor=north west] (label3part4) at ([yshift=0.3em]label3part3.south west) {\footnotesize{{（解码）}}};
@@ -202,11 +202,11 @@
 \node [anchor=west] (score1) at ([xshift=1.5em]ft14.east) {\footnotesize{P=0.042}};
 \node [anchor=west] (score2) at ([xshift=1.5em]ft24.east) {\footnotesize{P=0.006}};
 \node [anchor=west] (score3) at ([xshift=1.5em]ft34.east) {\footnotesize{P=0.003}};
-\node [anchor=south] (scorelabel) at ([xshift=-2.0em]score1.north) {\scriptsize{{\color{black}{率给每个译文赋予一个模型得分}}}};
+\node [anchor=south] (scorelabel) at ([xshift=-3.0em]score1.north) {\scriptsize{{\color{black}{率给每个译文赋予一个模型得分}}}};
 \node [anchor=south] (scorelabel2) at ([yshift=-0.5em]scorelabel.north) {\scriptsize{{\color{black}{系统综合单词概率和语言模型概}}}};
 }
 {
-\node [anchor=north] (scorelabel2) at (score3.south) {\scriptsize{{选择得分}}};
+\node [anchor=north] (scorelabel2) at ([yshift=-1.5em]score3.south) {\scriptsize{{选择得分}}};
 \node [anchor=north west] (scorelabel2part2) at ([xshift=-0.5em,yshift=0.5em]scorelabel2.south west) {\scriptsize{{最高的译文}}};
 \node [anchor=center,draw=ublue,circle,thick,fill=white,inner sep=1pt,circular drop shadow={shadow xshift=0.05em,shadow yshift=-0.05em}] (head1) at ([xshift=0.3em]score1.east) {\scriptsize{{\color{ugreen} {ok}}}};
 }
@@ -216,10 +216,10 @@
 \begin{scope}

 {
-\draw [->,ultra thick,ublue,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.3em]t13.west) -- ([xshift=0.8em,yshift=-0.3em]t13.east) -- ([xshift=-0.2em,yshift=-0.3em]t21.west) -- ([xshift=0.8em,yshift=-0.3em]t21.east) -- ([xshift=-0.2em,yshift=-0.3em]t31.west) -- ([xshift=0.8em,yshift=-0.3em]t31.east) -- ([xshift=-0.2em,yshift=-0.3em]t41.west) -- ([xshift=0.8em,yshift=-0.3em]t41.east) -- ([xshift=-0.2em,yshift=-0.3em]t51.west) -- ([xshift=1.2em,yshift=-0.3em]t51.east);
+\draw [->,ultra thick,ublue,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.42em]t13.west) -- ([xshift=0.8em,yshift=-0.42em]t13.east) -- ([xshift=-0.2em,yshift=-0.42em]t21.west) -- ([xshift=0.8em,yshift=-0.42em]t21.east) -- ([xshift=-0.2em,yshift=-0.42em]t31.west) -- ([xshift=0.8em,yshift=-0.42em]t31.east) -- ([xshift=-0.2em,yshift=-0.42em]t41.west) -- ([xshift=0.8em,yshift=-0.42em]t41.east) -- ([xshift=-0.2em,yshift=-0.42em]t51.west) -- ([xshift=1.2em,yshift=-0.42em]t51.east);
 }

-\draw [->,ultra thick,red,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.5em]t13.west) -- ([xshift=0.8em,yshift=-0.5em]t13.east) -- ([xshift=-0.2em,yshift=-0.5em]t22.west) -- ([xshift=0.8em,yshift=-0.5em]t22.east) -- ([xshift=-0.2em,yshift=-0.5em]t31.west) -- ([xshift=0.8em,yshift=-0.5em]t31.east) -- ([xshift=-0.2em,yshift=-0.5em]t41.west) -- ([xshift=0.8em,yshift=-0.5em]t41.east) -- ([xshift=-0.2em,yshift=-0.5em]t52.west) -- ([xshift=1.2em,yshift=-0.5em]t52.east);
+\draw [->,ultra thick,red,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.62em]t13.west) -- ([xshift=0.8em,yshift=-0.62em]t13.east) -- ([xshift=-0.2em,yshift=-0.62em]t22.west) -- ([xshift=0.8em,yshift=-0.62em]t22.east) -- ([xshift=-0.2em,yshift=-0.62em]t31.west) -- ([xshift=0.8em,yshift=-0.62em]t31.east) -- ([xshift=-0.2em,yshift=-0.62em]t41.west) -- ([xshift=0.8em,yshift=-0.62em]t41.east) -- ([xshift=-0.2em,yshift=-0.62em]t52.west) -- ([xshift=1.2em,yshift=-0.62em]t52.east);

 \end{scope}


--- a/Chapter5/Figures/figure-scores-of-different-translation_model&language_model.tex
+++ b/Chapter5/Figures/figure-scores-of-different-translation_model&language_model.tex
 %%% outline
 %-------------------------------------------------------------------------
-\begin{tabular}{| l | l |}
+\begin{tabular}{| c | l |}
 \hline
-& {\footnotesize{$\prod\limits_{(j,i) \in \hat{A}} \funp{P}(s_j,t_i)$} \color{red}{{\footnotesize{$\times\funp{P}_{\textrm{lm}}(\mathbf{t})$}}}} \\ \hline
+\rule{0pt}{15pt} 源语言句子“我对你感到满意”的不同翻译结果& {\footnotesize{$\prod\limits_{(j,i) \in \hat{A}} \funp{P}(s_j,t_i)$} \color{red}{{\footnotesize{$\times\funp{P}_{\textrm{lm}}(\mathbf{t})$}}}} \\ \hline

 \begin{tikzpicture}


--- a/Chapter5/Figures/figure-zh-en-translation-sentence-pairs&word-alignment-connection.tex
+++ b/Chapter5/Figures/figure-zh-en-translation-sentence-pairs&word-alignment-connection.tex
@@ -11,7 +11,7 @@
 \node [anchor=west] (s3) at ([xshift=0.5em]s2.east) {你\footnotesize{$_3$}};
 \node [anchor=west] (s4) at ([xshift=0.5em]s3.east) {感到\footnotesize{$_4$}};
 \node [anchor=west] (s5) at ([xshift=0.5em]s4.east) {满意\footnotesize{$_5$}};
-\node [anchor=east] (s) at (s1.west) {$\mathbf{s}=$};
+\node [anchor=east] (s) at (s1.west) {$\seq{s}=$};
 \end{scope}

 \begin{scope}[yshift=-3.0em]
@@ -20,7 +20,7 @@
 \node [anchor=west] (t3) at ([xshift=0.3em,yshift=0.1em]t2.east) {satisfied\footnotesize{$_3$}};
 \node [anchor=west] (t4) at ([xshift=0.3em]t3.east) {with\footnotesize{$_4$}};
 \node [anchor=west] (t5) at ([xshift=0.3em,yshift=-0.2em]t4.east) {you\footnotesize{$_5$}};
-\node [anchor=east] (t) at ([xshift=-0.3em]t1.west) {$\mathbf{t}=$};
+\node [anchor=east] (t) at ([xshift=-0.3em]t1.west) {$\seq{t}=$};
 \end{scope}



--- a/Chapter5/chapter5.tex
+++ b/Chapter5/chapter5.tex
@@ -136,7 +136,7 @@ IBM模型由Peter F. Brown等人于上世纪九十年代初提出\upcite{DBLP:jo

 \parinterval 对于第一个问题，可以给计算机一个翻译词典，这样计算机可以发挥计算方面的优势，尽可能多地把翻译结果拼装出来。比如，可以把每个翻译结果看作是对单词翻译的拼装，这可以被形象地比作贯穿多个单词的一条路径，计算机所做的就是尽可能多地生成这样的路径。图\ref{fig:5-4}中蓝色和红色的折线就分别表示了两条不同的译文选择路径，区别在于“满意”和“对”的翻译候选是不一样的，蓝色折线选择的是“satisfy”和“to”，而红色折线是“satisfied”和“with”。换句话说，不同的译文对应不同的路径（即使词序不同也会对应不同的路径）。

-\parinterval 对于第二个问题，尽管机器能够找到很多译文选择路径，但它并不知道哪些路径是好的。说地再直白一些，简单地枚举路径实际上就是一个体力活，没有太多的智能。因此计算机还需要再聪明一些，运用它的能够“掌握”的知识判断翻译结果的好与坏。这一步是最具挑战的，当然也有很多思路。在统计机器翻译中，这个问题被定义为：设计一种统计模型，它可以给每个译文一个可能性，而这个可能性越高表明译文越接近人工翻译。
+\parinterval 对于第二个问题，尽管机器能够找到很多译文选择路径，但它并不知道哪些路径是好的。说地再直白一些，简单地枚举路径实际上就是一个体力活，没有太多的智能。因此计算机还需要再聪明一些，运用它的能够“掌握”的知识判断翻译结果的好与坏。这一步是最具挑战的，当然也有很多思路来解决这个问题。在统计机器翻译中，这个问题被定义为：设计一种统计模型，它可以给每个译文一个可能性，而这个可能性越高表明译文越接近人工翻译。

 \parinterval 如图\ref{fig:5-4}所示，每个单词翻译候选的右侧黑色框里的数字就是单词的翻译概率，使用这些单词的翻译概率，可以得到整句译文的概率（用符号$\funp{P}$表示）。这样，就用概率化的模型描述了每个翻译候选的可能性。基于这些翻译候选的可能性，机器翻译系统可以对所有的翻译路径进行打分，比如，图\ref{fig:5-4}中第一条路径的分数为0.042，第二条是0.006，以此类推。最后，系统可以选择分数最高的路径作为源语言句子的最终译文。

@@ -262,7 +262,7 @@ $\seq{t}$ = machine\; \underline{translation}\; is\; a\; process\; of\; generati
 \begin{eqnarray}
 \funp{P}(\text{机器},\text{translation}; \seq{s},\seq{t})  & = & \frac{2}{121} \\
 \funp{P}(\text{机器},\text{look}; \seq{s},\seq{t})  & =  & \frac{0}{121}
-\label{eq:5-3}
+\label{eq:5-4}
 \end{eqnarray}

 \noindent 注意，由于“look”没有出现在数据中，因此$\funp{P}(\text{机器},\text{look}; \seq{s},\seq{t})=0$。这时，可以使用{\chaptertwo}介绍的平滑算法赋予它一个非零的值，以保证在后续的步骤中整个翻译模型不会出现零概率的情况。
@@ -275,11 +275,11 @@ $\seq{t}$ = machine\; \underline{translation}\; is\; a\; process\; of\; generati

 \parinterval 如果有更多的句子，上面的方法同样适用。假设，有$K$个互译句对$\{(\seq{s}^{[1]},\seq{t}^{[1]})$,...,\\$(\seq{s}^{[K]},\seq{t}^{[K]})\}$。仍然可以使用基于相对频次的方法估计翻译概率$\funp{P}(x,y)$，具体方法如下:
 \begin{eqnarray}
-\funp{P}(x,y)  =  \frac{{\sum_{k=1}^{K} c(x,y;\seq{s}^{[k]},\seq{t}^{[k]})}}{\sum_{k=1}^{K}{{\sum_{x',y'} c(x',y';\seq{s}^{[k]},\seq{t}^{[k]})}}}
-\label{eq:5-4}
+\funp{P}(x,y)  &=&  \frac{{\sum_{k=1}^{K} c(x,y;\seq{s}^{[k]},\seq{t}^{[k]})}}{\sum_{k=1}^{K}{{\sum_{x',y'} c(x',y';\seq{s}^{[k]},\seq{t}^{[k]})}}}
+\label{eq:5-5}
 \end{eqnarray}

-\parinterval 与公式\eqref{eq:5-1}相比，公式\eqref{eq:5-4}的分子、分母都多了一项累加符号$\sum_{k=1}^{K} \cdot$，它表示遍历语料库中所有的句对。换句话说，当计算词的共现次数时，需要对每个句对上的计数结果进行累加。从统计学习的角度，使用更大规模的数据进行参数估计可以提高结果的可靠性。计算单词的翻译概率也是一样，在小规模的数据上看，很多翻译现象的特征并不突出，但是当使用的数据量增加到一定程度，翻译的规律会很明显的体现出来。
+\parinterval 与公式\eqref{eq:5-1}相比，公式\eqref{eq:5-5}的分子、分母都多了一项累加符号$\sum_{k=1}^{K} \cdot$，它表示遍历语料库中所有的句对。换句话说，当计算词的共现次数时，需要对每个句对上的计数结果进行累加。从统计学习的角度，使用更大规模的数据进行参数估计可以提高结果的可靠性。计算单词的翻译概率也是一样，在小规模的数据上看，很多翻译现象的特征并不突出，但是当使用的数据量增加到一定程度，翻译的规律会很明显的体现出来。

 \parinterval 举个例子，实例\ref{eg:5-2}展示了一个由两个句对构成的平行语料库。

@@ -303,10 +303,10 @@ $\seq{t}^{[2]}$ = So\; ,\; what\; is\; human\; \underline{translation}\; ?
                                                                            & = & \frac{4 + 1}{|\seq{s}^{[1]}| \times |\seq{t}^{[1]}| + |\seq{s}^{[2]}| \times |\seq{t}^{[2]}|} \nonumber \\
                                                                            & = & \frac{4 + 1}{11 \times 11 + 5 \times 7} \nonumber \\
                                                                            & = & \frac{5}{156}
-\label{eq:5-5}
+\label{eq:5-6}
 \end{eqnarray}
 }
-\parinterval 公式\eqref{eq:5-5}所展示的计算过程很简单，分子是两个句对中“翻译”和“translation”共现次数的累计，分母是两个句对的源语言单词和目标语言单词的组合数的累加。显然，这个方法也很容易推广到处理更多句子的情况。
+\parinterval 公式\eqref{eq:5-6}所展示的计算过程很简单，分子是两个句对中“翻译”和“translation”共现次数的累计，分母是两个句对的源语言单词和目标语言单词的组合数的累加。显然，这个方法也很容易推广到处理更多句子的情况。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -323,14 +323,14 @@ $\seq{t}^{[2]}$ = So\; ,\; what\; is\; human\; \underline{translation}\; ?

 \subsubsection{1. 基础模型}

-\parinterval 计算句子级翻译概率并不简单。因为自然语言非常灵活，任何数据无法覆盖足够多的句子，因此，无法像公式\eqref{eq:5-4}一样直接用简单计数的方式对句子的翻译概率进行估计。这里，采用一个退而求其次的方法：找到一个函数$g(\seq{s},\seq{t})\ge 0$来模拟翻译概率对译文可能性进行估计。可以定义一个新的函数$g(\seq{s},\seq{t})$，令其满足：给定$\seq{s}$，翻译结果$\seq{t}$出现的可能性越大，$g(\seq{s},\seq{t})$的值越大；$\seq{t}$出现的可能性越小，$g(\seq{s},\seq{t})$的值越小。换句话说，$g(\seq{s},\seq{t})$和翻译概率$\funp{P}(\seq{t}|\seq{s})$呈正相关。如果存在这样的函数$g(\seq{s},\seq{t}
+\parinterval 计算句子级翻译概率并不简单。因为自然语言非常灵活，任何数据无法覆盖足够多的句子，因此，无法像公式\eqref{eq:5-5}一样直接用简单计数的方式对句子的翻译概率进行估计。这里，采用一个退而求其次的方法：找到一个函数$g(\seq{s},\seq{t})\ge 0$来模拟翻译概率对译文可能性进行估计。可以定义一个新的函数$g(\seq{s},\seq{t})$，令其满足：给定$\seq{s}$，翻译结果$\seq{t}$出现的可能性越大，$g(\seq{s},\seq{t})$的值越大；$\seq{t}$出现的可能性越小，$g(\seq{s},\seq{t})$的值越小。换句话说，$g(\seq{s},\seq{t})$和翻译概率$\funp{P}(\seq{t}|\seq{s})$呈正相关。如果存在这样的函数$g(\seq{s},\seq{t}
 )$，可以利用$g(\seq{s},\seq{t})$近似表示$\funp{P}(\seq{t}|\seq{s})$，如下：
 \begin{eqnarray}
-\funp{P}(\seq{t}|\seq{s})  \equiv  \frac{g(\seq{s},\seq{t})}{\sum_{\seq{t}'}g(\seq{s},\seq{t}')}
-\label{eq:5-6}
+\funp{P}(\seq{t}|\seq{s}) & \equiv & \frac{g(\seq{s},\seq{t})}{\sum_{\seq{t}'}g(\seq{s},\seq{t}')}
+\label{eq:5-7}
 \end{eqnarray}

-\parinterval 公式\eqref{eq:5-6}相当于在函数$g(\cdot)$上做了归一化，这样等式右端的结果具有一些概率的属性，比如，$0 \le \frac{g(\seq{s},\seq{t})}{\sum_{\seq{t'}}g(\seq{s},\seq{t'})} \le 1$。具体来说，对于源语言句子$\seq{s}$，枚举其所有的翻译结果，并把所对应的函数$g(\cdot)$相加作为分母，而分子是某个翻译结果$\seq{t}$所对应的$g(\cdot)$的值。
+\parinterval 公式\eqref{eq:5-7}相当于在函数$g(\cdot)$上做了归一化，这样等式右端的结果具有一些概率的属性，比如，$0 \le \frac{g(\seq{s},\seq{t})}{\sum_{\seq{t'}}g(\seq{s},\seq{t'})} \le 1$。具体来说，对于源语言句子$\seq{s}$，枚举其所有的翻译结果，并把所对应的函数$g(\cdot)$相加作为分母，而分子是某个翻译结果$\seq{t}$所对应的$g(\cdot)$的值。

 \parinterval 上述过程初步建立了句子级翻译模型，并没有直接求$\funp{P}(\seq{t}|\seq{s})$，而是把问题转化为对$g(\cdot)$的设计和计算上。但是，面临着两个新的问题：

@@ -338,13 +338,13 @@ $\seq{t}^{[2]}$ = So\; ,\; what\; is\; human\; \underline{translation}\; ?
 \vspace{0.5em}
 \item 如何定义函数$g(\seq{s},\seq{t})$？即，在知道单词翻译概率的前提下，如何计算$g(\seq{s},\seq{t})$；
 \vspace{0.5em}
-\item 公式\eqref{eq:5-6}中分母$\sum_{seq{t'}}g(\seq{s},{\seq{t}'})$需要累加所有翻译结果的$g(\seq{s},{\seq{t}'})$，但枚举所有${\seq{t}'}$是不现实的。
+\item 公式\eqref{eq:5-7}中分母$\sum_{seq{t'}}g(\seq{s},{\seq{t}'})$需要累加所有翻译结果的$g(\seq{s},{\seq{t}'})$，但枚举所有${\seq{t}'}$是不现实的。
 \vspace{0.5em}
 \end{itemize}

 \parinterval  当然，这里最核心的问题还是函数$g(\seq{s},\seq{t})$的定义。而第二个问题其实不需要解决，因为机器翻译只关注于可能性最大的翻译结果，即$g(\seq{s},\seq{t})$的计算结果最大时对应的译文。这个问题会在后面进行讨论。

-\parinterval 回到设计$g(\seq{s},\seq{t})$的问题上。这里，采用“大题小作”的方法，这个技巧在{\chaptertwo}已经进行了充分的介绍。具体来说，直接建模句子之间的对应比较困难，但可以利用单词之间的对应来描述句子之间的对应关系。这就用到了上一小节所介绍的单词翻译概率。
+\parinterval 回到设计$g(\seq{s},\seq{t})$的问题上。这里，采用“大题小作”的方法，这个技巧在{\chaptertwo}已经进行了充分的介绍。具体来说，直接建模句子之间的对应比较困难，但可以利用单词之间的对应来描述句子之间的对应关系。这就用到了\ref{chapter5.2.3}小节所介绍的单词翻译概率。

 \parinterval 首先引入一个非常重要的概念\ \dash \ {\small\sffamily\bfseries{词对齐}}\index{词对齐}（Word Alignment）\index{Word Alignment}，它是统计机器翻译中最核心的概念之一。词对齐描述了平行句对中单词之间的对应关系，它体现了一种观点：本质上句子之间的对应是由单词之间的对应表示的。当然，这个观点在神经机器翻译或者其他模型中可能会有不同的理解，但是翻译句子的过程中考虑词级的对应关系是符合人类对语言的认知的。

@@ -362,15 +362,15 @@ $\seq{t}^{[2]}$ = So\; ,\; what\; is\; human\; \underline{translation}\; ?

 \parinterval 对于句对$(\seq{s},\seq{t})$，假设可以得到最优词对齐$\widehat{A}$，于是可以使用单词翻译概率计算$g(\seq{s},\seq{t})$，如下
 \begin{eqnarray}
-g(\seq{s},\seq{t}) = \prod_{(j,i)\in \widehat{A}}\funp{P}(s_j,t_i)
-\label{eq:5-7}
+g(\seq{s},\seq{t}) &= &\prod_{(j,i)\in \widehat{A}}\funp{P}(s_j,t_i)
+\label{eq:5-8}
 \end{eqnarray}

 \noindent 其中$g(\seq{s},\seq{t})$被定义为句子$\seq{s}$中的单词和句子$\seq{t}$中的单词的翻译概率的乘积，并且这两个单词之间必须有词对齐连接。$\funp{P}(s_j,t_i)$表示具有词对齐连接的源语言单词$s_j$和目标语言单词$t_i$的单词翻译概率。以图\ref{fig:5-7}中的句对为例，其中“我”与“I”、“对”与“with”、“你” 与“you”等相互对应，可以把它们的翻译概率相乘得到$g(\seq{s},\seq{t})$的计算结果，如下：
 \begin{eqnarray}
 {g(\seq{s},\seq{t})}&= &  \funp{P}(\textrm{我,I}) \times \funp{P}(\textrm{对,with}) \times \funp{P}(\textrm{你,you}) \times \nonumber \\
          &    & \funp{P}(\textrm{感到, am}) \times \funp{P}(\textrm{满意,satisfied})
-\label{eq:5-8}
+\label{eq:5-9}
 \end{eqnarray}

 \parinterval  显然，如果每个词对齐连接所对应的翻译概率变大，那么整个句子翻译的得分也会提高。也就是说，词对齐越准确，翻译模型的打分越高，$\seq{s}$和$\seq{t}$之间存在翻译关系的可能性越大。
@@ -381,7 +381,7 @@ g(\seq{s},\seq{t}) = \prod_{(j,i)\in \widehat{A}}\funp{P}(s_j,t_i)

 \subsubsection{2. 生成流畅的译文}

-\parinterval 公式\eqref{eq:5-7}定义的$g(\seq{s},\seq{t})$存在的问题是没有考虑词序信息。这里用一个简单的例子说明这个问题。如图\ref{fig:5-8}所示，源语言句子“我 对 你 感到 满意”有两个翻译结果，第一个翻译结果是“I am satisfied with you”，第二个是“I with you am satisfied”。虽然这两个译文包含的目标语单词是一样的，但词序存在很大差异。比如，它们都选择了“satisfied”作为源语单词“满意”的译文，但是在第一个翻译结果中“satisfied”处于第3个位置，而第二个结果中处于最后的位置。显然第一个翻译结果更符合英语的表达习惯，翻译的质量更高。遗憾的是，对于有明显差异的两个译文，公式\eqref{eq:5-7}计算得到的函数$g(\cdot)$的值却是一样的。
+\parinterval 公式\eqref{eq:5-8}定义的$g(\seq{s},\seq{t})$存在的问题是没有考虑词序信息。这里用一个简单的例子说明这个问题。如图\ref{fig:5-8}所示，源语言句子“我 对 你 感到 满意”有两个翻译结果，第一个翻译结果是“I am satisfied with you”，第二个是“I with you am satisfied”。虽然这两个译文包含的目标语单词是一样的，但词序存在很大差异。比如，它们都选择了“satisfied”作为源语单词“满意”的译文，但是在第一个翻译结果中“satisfied”处于第3个位置，而第二个结果中处于最后的位置。显然第一个翻译结果更符合英语的表达习惯，翻译的质量更高。遗憾的是，对于有明显差异的两个译文，公式\eqref{eq:5-8}计算得到的函数$g(\cdot)$的值却是一样的。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -398,18 +398,18 @@ g(\seq{s},\seq{t}) = \prod_{(j,i)\in \widehat{A}}\funp{P}(s_j,t_i)
 \begin{eqnarray}
 \funp{P}_{\textrm{lm}}(\seq{t}) & = & \funp{P}_{\textrm{lm}}(t_1...t_l) \nonumber \\
                                           & =  & \funp{P}(t_1)\times \funp{P}(t_2|t_1)\times \funp{P}(t_3|t_2)\times ... \times \funp{P}(t_l|t_{l-1})
-\label{eq:5-9}
+\label{eq:5-10}
 \end{eqnarray}

 \noindent  其中，$\seq{t}=t_1...t_l$表示由$l$个单词组成的句子，$\funp{P}_{\textrm{lm}}(\seq{t})$表示语言模型给句子$\seq{t}$的打分。具体而言，$\funp{P}_{\textrm{lm}}(\seq{t})$被定义为$\funp{P}(t_i|t_{i-1})(i=1,2,...,l)$的连乘\footnote{为了确保数学表达的准确性，本书中定义$\funp{P}(t_1|t_0) \equiv \funp{P}(t_1)$}，其中$\funp{P}(t_i|t_{i-1})(i=1,2,...,l)$表示前面一个单词为$t_{i-1}$时，当前单词为$t_i$的概率。语言模型的训练方法可以参看{\chaptertwo}相关内容。

-\parinterval 回到建模问题上来。既然语言模型可以帮助系统度量每个译文的流畅度，那么可以使用它对翻译进行打分。一种简单的方法是把语言模型$\funp{P}_{\textrm{lm}}{(\seq{t})}$ 和公式\eqref{eq:5-7}中的$g(\seq{s},\seq{t})$相乘，这样就得到了一个新的$g(\seq{s},\seq{t})$，它同时考虑了翻译准确性（$\prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)}$）和流畅度（$\funp{P}_{\textrm{lm}}(\seq{t})$）:
+\parinterval 回到建模问题上来。既然语言模型可以帮助系统度量每个译文的流畅度，那么可以使用它对翻译进行打分。一种简单的方法是把语言模型$\funp{P}_{\textrm{lm}}{(\seq{t})}$ 和公式\eqref{eq:5-8}中的$g(\seq{s},\seq{t})$相乘，这样就得到了一个新的$g(\seq{s},\seq{t})$，它同时考虑了翻译准确性（$\prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)}$）和流畅度（$\funp{P}_{\textrm{lm}}(\seq{t})$）:
 \begin{eqnarray}
-g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times  \funp{P}_{\textrm{lm}}(\seq{t})
-\label{eq:5-10}
+g(\seq{s},\seq{t}) & \equiv & \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times  \funp{P}_{\textrm{lm}}(\seq{t})
+\label{eq:5-11}
 \end{eqnarray}

-\parinterval 如图\ref{fig:5-9}所示，语言模型$\funp{P}_{\textrm{lm}}(\seq{t})$分别给$\seq{t}^{'}$和$\seq{t}^{”}$赋予0.0107和0.0009的概率，这表明句子$\seq{t}^{'}$更符合英文的表达，这与期望是相吻合的。它们再分别乘以$\prod_{j,i \in \widehat{A}}{\funp{P}(s_j},t_i)$的值，就得到公式\eqref{eq:5-10}定义的函数$g(\cdot)$的值。显然句子$\seq{t}^{'}$的分数更高。至此，完成了对函数$g(\seq{s},\seq{t})$的一个简单定义，把它带入公式\eqref{eq:5-6}就得到了同时考虑准确性和流畅性的句子级统计翻译模型。
+\parinterval 如图\ref{fig:5-9}所示，语言模型$\funp{P}_{\textrm{lm}}(\seq{t})$分别给$\seq{t}^{'}$和$\seq{t}^{”}$赋予0.0107和0.0009的概率，这表明句子$\seq{t}^{'}$更符合英文的表达，这与期望是相吻合的。它们再分别乘以$\prod_{j,i \in \widehat{A}}{\funp{P}(s_j},t_i)$的值，就得到公式\eqref{eq:5-11}定义的函数$g(\cdot)$的值。显然句子$\seq{t}^{'}$的分数更高。至此，完成了对函数$g(\seq{s},\seq{t})$的一个简单定义，把它带入公式\eqref{eq:5-7}就得到了同时考虑准确性和流畅性的句子级统计翻译模型。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -430,23 +430,23 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \parinterval 解码是指在得到翻译模型后，对于新输入的句子生成最佳译文的过程。具体来说，当给定任意的源语言句子$\seq{s}$，解码系统要找到翻译概率最大的目标语译文$\hat{\seq{t}}$。这个过程可以被形式化描述为：
 \begin{eqnarray}
-\widehat{\seq{t}}=\argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})
-\label{eq:5-11}
+\widehat{\seq{t}}&=&\argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})
+\label{eq:5-12}
 \end{eqnarray}

-\noindent  其中$\argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})$表示找到使$\funp{P}(\seq{t}|\seq{s})$达到最大时的译文$\seq{t}$。结合上一小节中关于$\funp{P}(\seq{t}|\seq{s})$的定义，把公式\eqref{eq:5-6}带入公式\eqref{eq:5-11}得到：
+\noindent  其中$\argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})$表示找到使$\funp{P}(\seq{t}|\seq{s})$达到最大时的译文$\seq{t}$。结合\ref{sec:sentence-level-translation}小节中关于$\funp{P}(\seq{t}|\seq{s})$的定义，把公式\eqref{eq:5-7}带入公式\eqref{eq:5-12}得到：
 \begin{eqnarray}
-\widehat{\seq{t}}=\argmax_{\seq{t}}\frac{g(\seq{s},\seq{t})}{\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}
-\label{eq:5-12}
+\widehat{\seq{t}}&=&\argmax_{\seq{t}}\frac{g(\seq{s},\seq{t})}{\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}
+\label{eq:5-13}
 \end{eqnarray}

-\parinterval 在公式\eqref{eq:5-12}中，可以发现${\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}$是一个关于$\seq{s}$的函数，当给定源语句$\seq{s}$时，它是一个常数，而且$g(\cdot) \ge 0$，因此${\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}$不影响对$\widehat{\seq{t}}$的求解，也不需要计算。基于此，公式\eqref{eq:5-12}可以被化简为：
+\parinterval 在公式\eqref{eq:5-13}中，可以发现${\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}$是一个关于$\seq{s}$的函数，当给定源语句$\seq{s}$时，它是一个常数，而且$g(\cdot) \ge 0$，因此${\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}$不影响对$\widehat{\seq{t}}$的求解，也不需要计算。基于此，公式\eqref{eq:5-13}可以被化简为：
 \begin{eqnarray}
-\widehat{\seq{t}}=\argmax_{\seq{t}}g(\seq{s},\seq{t})
-\label{eq:5-13}
+\widehat{\seq{t}}&=&\argmax_{\seq{t}}g(\seq{s},\seq{t})
+\label{eq:5-14}
 \end{eqnarray}

-\parinterval 公式\eqref{eq:5-13}定义了解码的目标，剩下的问题是实现$\argmax$，以快速准确地找到最佳译文$\widehat{\seq{t}}$。但是，简单遍历所有可能的译文并计算$g(\seq{s},\seq{t})$ 的值是不可行的，因为所有潜在译文构成的搜索空间是十分巨大的。为了理解机器翻译的搜索空间的规模，假设源语言句子$\seq{s}$有$m$个词，每个词有$n$个可能的翻译候选。如果从左到右一步步翻译每个源语言单词，那么简单的顺序翻译会有$n^m$种组合。如果进一步考虑目标语单词的任意调序，每一种对翻译候选进行选择的结果又会对应$m!$种不同的排序。因此，源语句子$\seq{s}$至少有$n^m \cdot m!$ 个不同的译文。
+\parinterval 公式\eqref{eq:5-14}定义了解码的目标，剩下的问题是实现$\argmax$，以快速准确地找到最佳译文$\widehat{\seq{t}}$。但是，简单遍历所有可能的译文并计算$g(\seq{s},\seq{t})$ 的值是不可行的，因为所有潜在译文构成的搜索空间是十分巨大的。为了理解机器翻译的搜索空间的规模，假设源语言句子$\seq{s}$有$m$个词，每个词有$n$个可能的翻译候选。如果从左到右一步步翻译每个源语言单词，那么简单的顺序翻译会有$n^m$种组合。如果进一步考虑目标语单词的任意调序，每一种对翻译候选进行选择的结果又会对应$m!$种不同的排序。因此，源语句子$\seq{s}$至少有$n^m \cdot m!$ 个不同的译文。

 \parinterval $n^{m}\cdot m!$是什么样的概念呢？如表\ref{tab:5-2}所示，当$m$和$n$分别为2和10时，译文只有200个，不算多。但是当$m$和$n$分别为20和10时，即源语言句子的长度20，每个词有10个候选译文，系统会面对$2.4329 \times 10^{38}$个不同的译文，这几乎是不可计算的。

@@ -479,18 +479,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \end{figure}
 %----------------------------------------------

-\parinterval 图\ref{fig:5-10}给出了贪婪解码算法的伪代码。其中$\pi$保存所有源语单词的候选译文，$\pi[j]$表示第$j$个源语单词的翻译候选的集合，$best$保存当前最好的翻译结果，$h$保存当前步生成的所有译文候选。算法的主体有两层循环，在内层循环中如果第$j$个源语单词没有被翻译过，则用$best$和它的候选译文$\pi[j]$生成新的翻译，再存于$h$中，即操作$h=h\cup{\textrm{Join}(best,\pi[j])}$。外层循环再从$h$中选择得分最高的结果存于$best$中，即操作$best=\textrm{PruneForTop1}(h)$；同时标识相应的源语单词已翻译，即$used[best.j]=true$。
-
-%----------------------------------------------
-%\begin{figure}[htp]
-%    \centering
-%\subfigure{\input{./Chapter5/Figures/figure-greedy-mt-decoding-process-1}}
-%\subfigure{\input{./Chapter5/Figures/greedy-mt-decoding-process-3}}
-%\setlength{\belowcaptionskip}{14.0em}
-    %\caption{贪婪的机器翻译解码过程实例}
-    %\label{fig:5-11}
-%\end{figure}
-%----------------------------------------------
+\parinterval 图\ref{fig:5-10}给出了贪婪解码算法的伪代码。其中$\pi$保存所有源语单词的候选译文，$\pi[j]$表示第$j$个源语单词的翻译候选的集合，$best$保存当前最好的翻译结果，$h$保存当前步生成的所有译文候选。算法的主体有两层循环，在内层循环中如果第$j$个源语单词没有被翻译过，则用$best$和它的候选译文$\pi[j]$生成新的翻译，再存于$h$中，即操作$h=h\cup{\textrm{Join}(best,\pi[j])}$。外层循环再从$h$中选择得分最高的结果存于$best$中，即操作$best=\textrm{PruneForTop1}(h)$；同时标记相应的源语言单词状态为已翻译，即$used[best.j]=true$。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -542,22 +531,22 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \parinterval 举个例子，对于汉译英的翻译任务，英语句子$\seq{t}$可以被看作是汉语句子$\seq{s}$加入噪声通过信道后得到的结果。换句话说，汉语句子经过噪声-信道传输时发生了变化，在信道的输出端呈现为英语句子。于是需要根据观察到的汉语特征，通过概率$\funp{P}(\seq{t}|\seq{s})$猜测最为可能的英语句子。这个找到最可能的目标语句（信源）的过程也被称为
 {\small\sffamily\bfseries{解码}}（Decoding）。直到今天，解码这个概念也被广泛地使用在机器翻译及相关任务中。这个过程也可以表述为：给定输入$\seq{s}$，找到最可能的输出$\seq{t}$，使得$\funp{P}(\seq{t}|\seq{s})$达到最大：
 \begin{eqnarray}
-\widehat{\seq{t}}=\argmax_{\seq{t}}\funp{P}(\seq{t}|\seq{s})
-\label{eq:5-14}
+\widehat{\seq{t}}&=&\argmax_{\seq{t}}\funp{P}(\seq{t}|\seq{s})
+\label{eq:5-15}
 \end{eqnarray}

-\parinterval 公式\eqref{eq:5-14}的核心内容之一是定义$\funp{P}(\seq{t}|\seq{s})$。在IBM模型中，可以使用贝叶斯准则对$\funp{P}(\seq{t}|\seq{s})$进行如下变换：
+\parinterval 公式\eqref{eq:5-15}的核心内容之一是定义$\funp{P}(\seq{t}|\seq{s})$。在IBM模型中，可以使用贝叶斯准则对$\funp{P}(\seq{t}|\seq{s})$进行如下变换：
 \begin{eqnarray}
 \funp{P}(\seq{t}|\seq{s}) & = &\frac{\funp{P}(\seq{s},\seq{t})}{\funp{P}(\seq{s})} \nonumber \\
                       & = & \frac{\funp{P}(\seq{s}|\seq{t})\funp{P}(\seq{t})}{\funp{P}(\seq{s})}
-\label{eq:5-15}
+\label{eq:5-16}
 \end{eqnarray}

-\parinterval 公式\eqref{eq:5-15}把$\seq{s}$到$\seq{t}$的翻译概率转化为$\frac{\funp{P}(\seq{s}|\seq{t})\textrm{P(t)}}{\funp{P}(\seq{s})}$，它包括三个部分：
+\parinterval 公式\eqref{eq:5-16}把$\seq{s}$到$\seq{t}$的翻译概率转化为$\frac{\funp{P}(\seq{s}|\seq{t})\textrm{P(t)}}{\funp{P}(\seq{s})}$，它包括三个部分：

 \begin{itemize}
 \vspace{0.5em}
-\item 第一部分是由译文$\seq{t}$到源语言句子$\seq{s}$的翻译概率$\funp{P}(\seq{s}|\seq{t})$，也被称为翻译模型。它表示给定目标语句$\seq{t}$生成源语句$\seq{s}$的概率。需要注意是翻译的方向已经从$\funp{P}(\seq{t}|\seq{s})$转向了$\funp{P}(\seq{s}|\seq{t})$，但无须刻意地区分，可以简单地理解为翻译模型刻画了$\seq{s}$和$\seq{t}$的翻译对应程度；
+\item 第一部分是由译文$\seq{t}$到源语言句子$\seq{s}$的翻译概率$\funp{P}(\seq{s}|\seq{t})$，也被称为翻译模型。它表示给定目标语句$\seq{t}$生成源语句$\seq{s}$的概率。需要注意是翻译的方向已经从$\funp{P}(\seq{t}|\seq{s})$转向了$\funp{P}(\seq{s}|\seq{t})$，但无须刻意地区分，可以简单地理解为翻译模型描述了$\seq{s}$和$\seq{t}$的翻译对应程度；
 \vspace{0.5em}
 \item 第二部分是$\funp{P}(\seq{t})$，也被称为语言模型。它表示的是目标语言句子$\seq{t}$出现的可能性；
 \vspace{0.5em}
@@ -570,14 +559,14 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \widehat{\seq{t}} & = & \argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s}) \nonumber \\
          & = & \argmax_{\seq{t}} \frac{\funp{P}(\seq{s}|\seq{t})\funp{P}(\seq{t})}{\funp{P}(\seq{s})} \nonumber \\
          & = & \argmax_{\seq{t}} \funp{P}(\seq{s}|\seq{t})\funp{P}(\seq{t})
-\label{eq:5-16}
+\label{eq:5-17}
 \end{eqnarray}

-\parinterval 公式\eqref{eq:5-16}展示了IBM模型最基础的建模方式，它把模型分解为两项：（反向）翻译模型$\funp{P}(\seq{s}|\seq{t})$和语言模型$\funp{P}(\seq{t})$。一个很自然的问题是：直接用$\funp{P}(\seq{t}|\seq{s})$定义翻译问题不就可以了吗，为什么要用$\funp{P}(\seq{s}|\seq{t})$和$\funp{P}(\seq{t})$的联合模型？从理论上来说，正向翻译模型$\funp{P}(\seq{t}|\seq{s})$和反向翻译模型$\funp{P}(\seq{s}|\seq{t})$的数学建模可以是一样的，因为我们只需要在建模的过程中把两个语言调换即可。使用$\funp{P}(\seq{s}|\seq{t})$和$\funp{P}(\seq{t})$的联合模型的意义在于引入了语言模型，它可以很好地对译文的流畅度进行评价，确保结果是通顺的目标语言句子。
+\parinterval 公式\eqref{eq:5-17}展示了IBM模型最基础的建模方式，它把模型分解为两项：（反向）翻译模型$\funp{P}(\seq{s}|\seq{t})$和语言模型$\funp{P}(\seq{t})$。仔细观察公式\eqref{eq:5-17}的推导过程，我们很容易发现一个问题：直接用$\funp{P}(\seq{t}|\seq{s})$定义翻译问题不就可以了吗，为什么要用$\funp{P}(\seq{s}|\seq{t})$和$\funp{P}(\seq{t})$的联合模型？从理论上来说，正向翻译模型$\funp{P}(\seq{t}|\seq{s})$和反向翻译模型$\funp{P}(\seq{s}|\seq{t})$的数学建模可以是一样的，因为我们只需要在建模的过程中把两个语言调换即可。使用$\funp{P}(\seq{s}|\seq{t})$和$\funp{P}(\seq{t})$的联合模型的意义在于引入了语言模型，它可以很好地对译文的流畅度进行评价，确保结果是通顺的目标语言句子。

-\parinterval 可以回忆一下\ref{sec:sentence-level-translation}节中讨论的问题，如果只使用翻译模型可能会造成一个局面：译文的单词都和源语言单词对应的很好，但是由于语序的问题，读起来却不像人说的话。从这个角度说，引入语言模型是十分必要的。这个问题在Brown等人的论文中也有讨论\upcite{DBLP:journals/coling/BrownPPM94}，他们提到单纯使用$\funp{P}(\seq{s}|\seq{t})$会把概率分配给一些翻译对应比较好但是不合法的目标语句子，而且这部分概率可能会很大，影响模型的决策。这也正体现了IBM模型的创新之处，作者用数学技巧把$\funp{P}(\seq{t})$引入进来，保证了系统的输出是通顺的译文。语言模型也被广泛使用在语音识别等领域以保证结果的流畅性，甚至应用的历史比机器翻译要长得多，这里的方法也有借鉴相关工作的味道。
+\parinterval 可以回忆一下\ref{sec:sentence-level-translation}节中讨论的问题，如果只使用翻译模型可能会造成一个局面：译文的单词都和源语言单词对应的很好，但是由于语序的问题，读起来却不像人说的话。从这个角度说，引入语言模型是十分必要的。这个问题在Brown等人的论文中也有讨论\upcite{DBLP:journals/coling/BrownPPM94}，他们提到单纯使用$\funp{P}(\seq{s}|\seq{t})$会把概率分配给一些翻译对应比较好但是不通顺甚至不合逻辑的目标语言句子，而且这部分概率可能会很大，影响模型的决策。这也正体现了IBM模型的创新之处，作者用数学技巧把$\funp{P}(\seq{t})$引入进来，保证了系统的输出是通顺的译文。语言模型也被广泛使用在语音识别等领域以保证结果的流畅性，甚至应用的历史比机器翻译要长得多，这里的方法也有借鉴相关工作的味道。

-实际上，在机器翻译中引入语言模型是一个很深刻的概念。在IBM模型之后相当长的时间里，语言模型一直是机器翻译各个部件中最重要的部分。对译文连贯性的建模也是所有系统中需要包含的内容（即使隐形体现）。
+实际上，在机器翻译中引入语言模型这个概念十分重要。在IBM模型之后相当长的时间里，语言模型一直是机器翻译各个部件中最重要的部分。对译文连贯性的建模也是所有系统中需要包含的内容（即使隐形体现）。

 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -585,7 +574,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \section{统计机器翻译的三个基本问题}

-\parinterval 公式\eqref{eq:5-16}给出了统计机器翻译的数学描述。为了实现这个过程，面临着三个基本问题：
+\parinterval 公式\eqref{eq:5-17}给出了统计机器翻译的数学描述。为了实现这个过程，面临着三个基本问题：

 \begin{itemize}
 \vspace{0.5em}
@@ -597,13 +586,13 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \vspace{0.5em}
 \end{itemize}

-\parinterval 为了理解以上的问题，可以先回忆一下\ref{sec:sentence-level-translation}小节中的公式\eqref{eq:5-10}，即$g(\seq{s},\seq{t})$函数的定义，它用于评估一个译文的好与坏。如图\ref{fig:5-14}所示，$g(\seq{s},\seq{t})$函数与公式\eqref{eq:5-16}的建模方式非常一致，即$g(\seq{s},\seq{t})$函数中红色部分描述译文$\seq{t}$的可能性大小，对应翻译模型$\funp{P}(\seq{s}|\seq{t})$；蓝色部分描述译文的平滑或流畅程度，对应语言模型$\funp{P}(\seq{t})$。尽管这种对应并不十分严格的，但也可以看出在处理机器翻译问题上，很多想法的本质是一样的。
+\parinterval 为了理解以上的问题，可以先回忆一下\ref{sec:sentence-level-translation}小节中的公式\eqref{eq:5-11}，即$g(\seq{s},\seq{t})$函数的定义，它用于评估一个译文的好与坏。如图\ref{fig:5-14}所示，$g(\seq{s},\seq{t})$函数与公式\eqref{eq:5-17}的建模方式非常一致，即$g(\seq{s},\seq{t})$函数中红色部分描述译文$\seq{t}$的可能性大小，对应翻译模型$\funp{P}(\seq{s}|\seq{t})$；蓝色部分描述译文的平滑或流畅程度，对应语言模型$\funp{P}(\seq{t})$。尽管这种对应并不十分严格的，但也可以看出在处理机器翻译问题上，很多想法的本质是一样的。

 %----------------------------------------------
 \begin{figure}[htp]
    \centering
 \input{./Chapter5/Figures/figure-correspondence-between-ibm-model&formula-1.13}
-    \caption{IBM模型与公式\eqref{eq:5-10}的对应关系}
+    \caption{IBM模型与公式\eqref{eq:5-11}的对应关系}
    \label{fig:5-14}
 \end{figure}
 %----------------------------------------------
@@ -656,13 +645,13 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \parinterval 直接准确估计$\funp{P}(\seq{s}|\seq{t})$很难，训练数据只能覆盖整个样本空间非常小的一部分，绝大多数句子在训练数据中一次也没出现过。为了解决这个问题，IBM模型假设：句子之间的对应可以由单词之间的对应进行表示。于是，翻译句子的概率可以被转化为词对齐生成的概率：

 \begin{eqnarray}
-\funp{P}(\seq{s}|\seq{t})= \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})
-\label{eq:5-17}
+\funp{P}(\seq{s}|\seq{t})&=& \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})
+\label{eq:5-18}
 \end{eqnarray}

-\parinterval 公式\eqref{eq:5-17}使用了简单的全概率公式把$\funp{P}(\seq{s}|\seq{t})$进行展开。通过访问$\seq{s}$和$\seq{t}$之间所有可能的词对齐$\seq{a}$，并把对应的对齐概率进行求和，得到了$\seq{t}$到$\seq{s}$的翻译概率。这里，可以把词对齐看作翻译的隐含变量，这样从$\seq{t}$到$\seq{s}$的生成就变为从$\seq{t}$同时生成$\seq{s}$和隐含变量$\seq{a}$的问题。引入隐含变量是生成式模型常用的手段，通过使用隐含变量，可以把较为困难的端到端学习问题转化为分步学习问题。
+\parinterval 公式\eqref{eq:5-18}使用了简单的全概率公式把$\funp{P}(\seq{s}|\seq{t})$进行展开。通过访问$\seq{s}$和$\seq{t}$之间所有可能的词对齐$\seq{a}$，并把对应的对齐概率进行求和，得到了$\seq{t}$到$\seq{s}$的翻译概率。这里，可以把词对齐看作翻译的隐含变量，这样从$\seq{t}$到$\seq{s}$的生成就变为从$\seq{t}$同时生成$\seq{s}$和隐含变量$\seq{a}$的问题。引入隐含变量是生成式模型常用的手段，通过使用隐含变量，可以把较为困难的端到端学习问题转化为分步学习问题。

-\parinterval 举个例子说明公式\eqref{eq:5-17}的实际意义。如图\ref{fig:5-17}所示，可以把从“谢谢\ 你”到“thank you”的翻译分解为9种可能的词对齐。因为源语言句子$\seq{s}$有2个词，目标语言句子$\seq{t}$加上空标记$t_0$共3个词，因此每个源语言单词有3个可能对齐的位置，整个句子共有$3\times3=9$种可能的词对齐。
+\parinterval 举个例子说明公式\eqref{eq:5-18}的实际意义。如图\ref{fig:5-17}所示，可以把从“谢谢\ 你”到“thank you”的翻译分解为9种可能的词对齐。因为源语言句子$\seq{s}$有2个词，目标语言句子$\seq{t}$加上空标记$t_0$共3个词，因此每个源语言单词有3个可能对齐的位置，整个句子共有$3\times3=9$种可能的词对齐。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -675,11 +664,11 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \parinterval 接下来的问题是如何定义$\funp{P}(\seq{s},\seq{a}|\seq{t})$\ \dash \ 即定义词对齐的生成概率。但是，隐含变量$\seq{a}$仍然很复杂，因此直接定义$\funp{P}(\seq{s},\seq{a}|\seq{t})$也很困难，在IBM模型中，为了化简问题，$\funp{P}(\seq{s},\seq{a}|\seq{t})$被进一步分解。使用链式法则，可以得到：
 \begin{eqnarray}
-\funp{P}(\seq{s},\seq{a}|\seq{t})=\funp{P}(m|\seq{t})\prod_{j=1}^{m}{\funp{P}(a_j|\seq{a}{}_1^{j-1},\seq{s}{}_1^{j-1},m,\seq{t})\funp{P}(s_j|\seq{a}{}_1^{j},\seq{s}{}_1^{j-1},m,\seq{t})}
-\label{eq:5-18}
+\funp{P}(\seq{s},\seq{a}|\seq{t})&=&\funp{P}(m|\seq{t})\prod_{j=1}^{m}{\funp{P}(a_j|\seq{a}{}_1^{j-1},\seq{s}{}_1^{j-1},m,\seq{t})\funp{P}(s_j|\seq{a}{}_1^{j},\seq{s}{}_1^{j-1},m,\seq{t})}
+\label{eq:5-19}
 \end{eqnarray}

-\noindent  其中$s_j$和$a_j$分别表示第$j$个源语言单词及第$j$个源语言单词对齐到的目标位置，\seq{s}${{}_1^{j-1}}$表示前$j-1$个源语言单词（即\seq{s}${}_1^{j-1}=s_1...s_{j-1}$），\seq{a}${}_1^{j-1}$表示前$j-1$个源语言的词对齐（即\seq{a}${}_1^{j-1}=a_1...a_{j-1}$），$m$表示源语句子的长度。公式\eqref{eq:5-18}将$\funp{P}(\seq{s},\seq{a}|\seq{t})$分解为四个部分，具体含义如下：
+\noindent  其中$s_j$和$a_j$分别表示第$j$个源语言单词及第$j$个源语言单词对齐到的目标位置，\seq{s}${{}_1^{j-1}}$表示前$j-1$个源语言单词（即\seq{s}${}_1^{j-1}=s_1...s_{j-1}$），\seq{a}${}_1^{j-1}$表示前$j-1$个源语言的词对齐（即\seq{a}${}_1^{j-1}=a_1...a_{j-1}$），$m$表示源语句子的长度。公式\eqref{eq:5-19}将$\funp{P}(\seq{s},\seq{a}|\seq{t})$分解为四个部分，具体含义如下：

 \begin{itemize}
 \vspace{0.5em}
@@ -694,7 +683,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \end{itemize}
 \parinterval 换句话说，当求$\funp{P}(\seq{s},\seq{a}|\seq{t})$时，首先根据译文$\seq{t}$确定源语言句子$\seq{s}$的长度$m$；当知道源语言句子有多少个单词后，循环$m$次，依次生成第1个到第$m$个源语言单词；当生成第$j$个源语言单词时，要先确定它是由哪个目标语译文单词生成的，即确定生成的源语言单词对应的译文单词的位置；当知道了目标语译文单词的位置，就能确定第$j$个位置的源语言单词。

-\parinterval 需要注意的是公式\eqref{eq:5-18}定义的模型并没有做任何化简和假设，也就是说公式的左右两端是严格相等的。在后面的内容中会看到，这种将一个整体进行拆分的方法可以有助于分步骤化简并处理问题。
+\parinterval 需要注意的是公式\eqref{eq:5-19}定义的模型并没有做任何化简和假设，也就是说公式的左右两端是严格相等的。在后面的内容中会看到，这种将一个整体进行拆分的方法可以有助于分步骤化简并处理问题。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -702,7 +691,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \subsection{基于词对齐的翻译实例}

-\parinterval 用前面图\ref{fig:5-16}中例子来对公式\eqref{eq:5-18}进行说明。例子中，源语言句子“在\ \ 桌子\ \ 上”目标语译文“on the table”之间的词对齐为$\seq{a}=\{\textrm{1-0, 2-3, 3-1}\}$。 公式\eqref{eq:5-18}的计算过程如下：
+\parinterval 用前面图\ref{fig:5-16}中例子来对公式\eqref{eq:5-19}进行说明。例子中，源语言句子“在\ \ 桌子\ \ 上”目标语译文“on the table”之间的词对齐为$\seq{a}=\{\textrm{1-0, 2-3, 3-1}\}$。 公式\eqref{eq:5-19}的计算过程如下：

 \begin{itemize}
 \vspace{0.5em}
@@ -724,7 +713,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 &&{\funp{P}(s_2=\textrm{桌子} \mid \textrm{\{1-0, 2-3\}},\textrm{在},3,\textrm{$t_0$ on the table}) {\times}} \nonumber \\
 &&{\funp{P}(a_3=1 \mid \textrm{\{1-0, 2-3\}},\textrm{在\ \ 桌子},3,\textrm{$t_0$ on the table}) {\times}} \nonumber \\
 &&{\funp{P}(s_3=\textrm{上} \mid \textrm{\{1-0, 2-3, 3-1\}},\textrm{在\ \ 桌子},3,\textrm{$t_0$ on the table})  }
-\label{eq:5-19}
+\label{eq:5-20}
 \end{eqnarray}

 %----------------------------------------------------------------------------------------
@@ -732,14 +721,14 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 %----------------------------------------------------------------------------------------

 \sectionnewpage
-\section{IBM模型1}
-\parinterval 公式\eqref{eq:5-17}和公式\eqref{eq:5-18}把翻译问题定义为对译文和词对齐同时进行生成的问题。其中有两个问题：
+\section{IBM模型1}\label{IBM-model1}
+\parinterval 公式\eqref{eq:5-18}和公式\eqref{eq:5-19}把翻译问题定义为对译文和词对齐同时进行生成的问题。其中有两个问题：

 \begin{itemize}
 \vspace{0.3em}
-\item 首先，公式\eqref{eq:5-17}的右端（$ \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})$）要求对所有的词对齐概率进行求和，但是词对齐的数量随着句子长度是呈指数增长，如何遍历所有的对齐$\seq{a}$？
+\item 首先，公式\eqref{eq:5-18}的右端（$ \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})$）要求对所有的词对齐概率进行求和，但是词对齐的数量随着句子长度是呈指数增长，如何遍历所有的对齐$\seq{a}$？
 \vspace{0.3em}
-\item 其次，公式\eqref{eq:5-18}虽然对词对齐的问题进行了描述，但是模型中的很多参数仍然很复杂，如何计算$\funp{P}(m|\seq{t})$、$\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})$ 和$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})$？
+\item 其次，公式\eqref{eq:5-19}虽然对词对齐的问题进行了描述，但是模型中的很多参数仍然很复杂，如何计算$\funp{P}(m|\seq{t})$、$\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})$ 和$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})$？
 \vspace{0.3em}
 \end{itemize}

@@ -749,37 +738,37 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------
 \vspace{-0.5em}
-\subsection{IBM模型1}
-\parinterval IBM模型1对公式\eqref{eq:5-18}中的三项进行了简化。具体方法如下：
+\subsection{IBM模型1的建模}
+\parinterval IBM模型1对公式\eqref{eq:5-19}中的三项进行了简化。具体方法如下：

 \begin{itemize}
 \item 假设$\funp{P}(m|\seq{t})$为常数$\varepsilon$，即源语言句子长度的生成概率服从均匀分布，如下：
 \begin{eqnarray}
-\funp{P}(m|\seq{t})\; \equiv \; \varepsilon
-\label{eq:5-20}
+\funp{P}(m|\seq{t})& \equiv & \varepsilon
+\label{eq:5-21}
 \end{eqnarray}
-\item 对齐概率$\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})$仅依赖于译文长度$l$，即每个词对齐连接的生成概率也服从均匀分布。换句话说，对于任何源语言位置$j$对齐到目标语言任何位置都是等概率的。比如译文为“on the table”，再加上$t_0$共4个位置，相应的，任意源语单词对齐到这4个位置的概率是一样的。具体描述如下：
+\item 对齐概率$\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})$仅依赖于译文长度$l$，即每个词对齐连接的生成概率也服从均匀分布。换句话说，对于任意源语言位置$j$对齐到目标语言任意位置都是等概率的。比如译文为“on the table”，再加上$t_0$共4个位置，相应的，任意源语单词对齐到这4个位置的概率是一样的。具体描述如下：
 \begin{eqnarray}
-\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t}) \equiv \frac{1}{l+1}
-\label{eq:5-21}
+\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})& \equiv &  \frac{1}{l+1}
+\label{eq:5-22}
 \end{eqnarray}

 \item 源语单词$s_j$的生成概率$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})$仅依赖与其对齐的译文单词$t_{a_j}$，即词汇翻译概率$f(s_j|t_{a_j})$。此时词汇翻译概率满足$\sum_{s_j}{f(s_j|t_{a_j})}=1$。比如在图\ref{fig:5-18}表示的例子中，源语单词“上”出现的概率只和与它对齐的单词“on”有关系，与其他单词没有关系。
 \begin{eqnarray}
-\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t}) \equiv f(s_j|t_{a_j})
-\label{eq:5-22}
+\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})& \equiv &  f(s_j|t_{a_j})
+\label{eq:5-23}
 \end{eqnarray}

-用一个简单的例子对公式\eqref{eq:5-22}进行说明。比如，在图\ref{fig:5-18}中，“桌子”对齐到“table”，可被描述为$f(s_2 |t_{a_2})=f(\textrm{“桌子”}|\textrm{“table”})$，表示给定“table”翻译为“桌子”的概率。通常，$f(s_2 |t_{a_2})$被认为是一种概率词典，它反应了两种语言词汇一级的对应关系。
+用一个简单的例子对公式\eqref{eq:5-23}进行说明。比如，在图\ref{fig:5-18}中，“桌子”对齐到“table”，可被描述为$f(s_2 |t_{a_2})=f(\textrm{“桌子”}|\textrm{“table”})$，表示给定“table”翻译为“桌子”的概率。通常，$f(s_2 |t_{a_2})$被认为是一种概率词典，它反应了两种语言词汇一级的对应关系。
 \end{itemize}

-\parinterval 将上述三个假设和公式\eqref{eq:5-18}代入公式\eqref{eq:5-17}中，得到$\funp{P}(\seq{s}|\seq{t})$的表达式：
+\parinterval 将上述三个假设和公式\eqref{eq:5-19}代入公式\eqref{eq:5-18}中，得到$\funp{P}(\seq{s}|\seq{t})$的表达式：
 \begin{eqnarray}
 \funp{P}(\seq{s}|\seq{t}) & = &  \sum_{\seq{a}}{\funp{P}(\seq{s},\seq{a}|\seq{t})} \nonumber \\
                        & = &  \sum_{\seq{a}}{\funp{P}(m|\seq{t})}\prod_{j=1}^{m}{\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})\funp{P}(s_j |a_1^j,s_1^{j-1},m,\seq{t})} \nonumber \\
                        & = &  \sum_{\seq{a}}{\varepsilon}\prod_{j=1}^{m}{\frac{1}{l+1}f(s_j|t_{a_j})} \nonumber \\
                        & = & \sum_{\seq{a}}{\frac{\varepsilon}{(l+1)^m}}\prod_{j=1}^{m}f(s_j|t_{a_j})
-\label{eq:5-23}
+\label{eq:5-24}
 \end{eqnarray}

 %----------------------------------------------
@@ -791,19 +780,19 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \end{figure}
 %----------------------------------------------

-\parinterval 在公式\eqref{eq:5-23}中，需要遍历所有的词对齐，即$ \sum_{\seq{a}}{\cdot}$。但这种表示不够直观，因此可以把这个过程重新表示为如下形式：
+\parinterval 在公式\eqref{eq:5-24}中，需要遍历所有的词对齐，即$ \sum_{\seq{a}}{\cdot}$。但这种表示不够直观，因此可以把这个过程重新表示为如下形式：
 \begin{eqnarray}
-\funp{P}(\seq{s}|\seq{t})={\sum_{a_1=0}^{l}\cdots}{\sum_{a_m=0}^{l}\frac{\varepsilon}{(l+1)^m}}{\prod_{j=1}^{m}f(s_j|t_{a_j})}
-\label{eq:5-24}
+\funp{P}(\seq{s}|\seq{t})&=&{\sum_{a_1=0}^{l}\cdots}{\sum_{a_m=0}^{l}\frac{\varepsilon}{(l+1)^m}}{\prod_{j=1}^{m}f(s_j|t_{a_j})}
+\label{eq:5-25}
 \end{eqnarray}

-\parinterval 公式\eqref{eq:5-24}分为两个主要部分。第一部分：遍历所有的对齐$\seq{a}$。其中$\seq{a}$由$\{a_1,...,a_m\}$\\ 组成，每个$a_j\in \{a_1,...,a_m\}$从译文的开始位置$(0)$循环到截止位置$(l)$。如图\ref{fig:5-19}表示的例子，描述的是源语单词$s_3$从译文的开始$t_0$遍历到结尾$t_3$，即$a_3$的取值范围。第二部分: 对于每个$\seq{a}$累加对齐概率$\funp{P}(\seq{s},a| \seq{t})=\frac{\varepsilon}{(l+1)^m}{\prod_{j=1}^{m}f(s_j|t_{a_j})}$。
+\parinterval 公式\eqref{eq:5-25}分为两个主要部分。第一部分：遍历所有的对齐$\seq{a}$。其中$\seq{a}$由$\{a_1,...,a_m\}$\\ 组成，每个$a_j\in \{a_1,...,a_m\}$从译文的开始位置$(0)$循环到截止位置$(l)$。如图\ref{fig:5-19}表示的例子，描述的是源语单词$s_3$从译文的开始$t_0$遍历到结尾$t_3$，即$a_3$的取值范围。第二部分: 对于每个$\seq{a}$累加对齐概率$\funp{P}(\seq{s},a| \seq{t})=\frac{\varepsilon}{(l+1)^m}{\prod_{j=1}^{m}f(s_j|t_{a_j})}$。

 %----------------------------------------------
 \begin{figure}[htp]
    \centering
 \input{./Chapter5/Figures/figure-formula-3.25-part-1-example}
-    \caption{公式{\eqref{eq:5-24}}第一部分实例}
+    \caption{公式{\eqref{eq:5-25}}第一部分实例}
    \label{fig:5-19}
 \end{figure}
 %----------------------------------------------
@@ -816,36 +805,36 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \subsection{解码及计算优化}\label{decoding&computational-optimization}

-\parinterval 如果模型参数给定，可以使用IBM模型1对新的句子进行翻译。比如，可以使用\ref{sec:simple-decoding}节描述的解码方法搜索最优译文。在搜索过程中，只需要通过公式\eqref{eq:5-24}计算每个译文候选的IBM模型翻译概率。但是，公式\eqref{eq:5-24}的高计算复杂度导致这些模型很难直接使用。以IBM模型1为例，这里把公式\eqref{eq:5-24}重写为：
+\parinterval 如果模型参数给定，可以使用IBM模型1对新的句子进行翻译。比如，可以使用\ref{sec:simple-decoding}节描述的解码方法搜索最优译文。在搜索过程中，只需要通过公式\eqref{eq:5-25}计算每个译文候选的IBM模型翻译概率。但是，公式\eqref{eq:5-25}的高计算复杂度导致这些模型很难直接使用。以IBM模型1为例，这里把公式\eqref{eq:5-25}重写为：
 \begin{eqnarray}
-\funp{P}(\seq{s}| \seq{t}) = \frac{\varepsilon}{(l+1)^{m}} \underbrace{\sum\limits_{a_1=0}^{l} ... \sum\limits_{a_m=0}^{l}}_{(l+1)^m\textrm{次循环}} \underbrace{\prod\limits_{j=1}^{m} f(s_j|t_{a_j})}_{m\textrm{次循环}}
-\label{eq:5-27}
+\funp{P}(\seq{s}| \seq{t}) &=& \frac{\varepsilon}{(l+1)^{m}} \underbrace{\sum\limits_{a_1=0}^{l} ... \sum\limits_{a_m=0}^{l}}_{(l+1)^m\textrm{次循环}} \underbrace{\prod\limits_{j=1}^{m} f(s_j|t_{a_j})}_{m\textrm{次循环}}
+\label{eq:5-26}
 \end{eqnarray}

 \noindent 可以看到，遍历所有的词对齐需要$(l+1)^m$次循环，遍历所有源语言位置累计$f(s_j|t_{a_j})$需要$m$次循环，因此这个模型的计算复杂度为$O((l+1)^m m)$。当$m$较大时，计算这样的模型几乎是不可能的。不过，经过仔细观察，可以发现公式右端的部分有另外一种计算方法，如下：
 \begin{eqnarray}
-\sum\limits_{a_1=0}^{l} ... \sum\limits_{a_m=0}^{l} \prod\limits_{j=1}^{m} f(s_j|t_{a_j}) = \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)
-\label{eq:5-28}
+\sum\limits_{a_1=0}^{l} ... \sum\limits_{a_m=0}^{l} \prod\limits_{j=1}^{m} f(s_j|t_{a_j}) &=& \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)
+\label{eq:5-27}
 \end{eqnarray}

-\noindent  公式\eqref{eq:5-28}的技巧在于把若干个乘积的加法（等式左手端）转化为若干加法结果的乘积（等式右手端），这样省去了多次循环，把$O((l+1)^m m)$的计算复杂度降为$O((l+1)m)$。此外，公式\eqref{eq:5-28}相比公式\eqref{eq:5-27}的另一个优点在于，公式\eqref{eq:5-28}中乘法的数量更少，因为现代计算机中乘法运算的代价要高于加法，因此公式\eqref{eq:5-28}的计算机实现效率更高。图\ref{fig:5-21} 对这个过程进行了进一步解释。
+\noindent  公式\eqref{eq:5-27}的技巧在于把若干个乘积的加法（等式左手端）转化为若干加法结果的乘积（等式右手端），这样省去了多次循环，把$O((l+1)^m m)$的计算复杂度降为$O((l+1)m)$。此外，公式\eqref{eq:5-27}相比公式\eqref{eq:5-26}的另一个优点在于，公式\eqref{eq:5-27}中乘法的数量更少，因为现代计算机中乘法运算的代价要高于加法，因此公式\eqref{eq:5-27}的计算机实现效率更高。图\ref{fig:5-21} 对这个过程进行了进一步解释。

 %----------------------------------------------
 \begin{figure}[htp]
    \centering
 \input{./Chapter5/Figures/figure-example-of-formula3.29}
-   \caption{$\sum\limits_{a_1=0}^{l} ... \sum\limits_{a_m=0}^{l} \prod\limits_{j=1}^{m} f(s_j|t_{a_j}) = \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)$的实例}
+   \caption{$\sum\limits_{a_1=0}^{l} ... \sum\limits_{a_m=0}^{l} \prod\limits_{j=1}^{m} f(s_j|t_{a_j}) \; = \; \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)$的实例}
   \label{fig:5-21}
 \end{figure}
 %----------------------------------------------

-\parinterval 接着，利用公式\eqref{eq:5-28}的方式，可以把公式\eqref{eq:5-24}重写表示为：
+\parinterval 接着，利用公式\eqref{eq:5-27}的方式，可以把公式\eqref{eq:5-25}重写表示为：
 \begin{eqnarray}
-\textrm{IBM模型1：\ \ \ \ } \funp{P}(\seq{s}| \seq{t}) & = & \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \label{eq:5-64}
-\label{eq:5-29}
+\textrm{IBM模型1：\ \ \ \ } \funp{P}(\seq{s}| \seq{t}) & = & \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)
+\label{eq:5-28}
 \end{eqnarray}

-公式\eqref{eq:5-64}是IBM模型1的最终表达式，在解码和训练中可以被直接使用。
+公式\eqref{eq:5-28}是IBM模型1的最终表达式，在解码和训练中可以被直接使用。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -874,18 +863,18 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \parinterval 在IBM模型中，优化的目标函数被定义为$\funp{P}(\seq{s}| \seq{t})$。也就是，对于给定的句对$(\seq{s},\seq{t})$，最大化翻译概率$\funp{P}(\seq{s}| \seq{t})$。 这里用符号$\funp{P}_{\theta}(\seq{s}|\seq{t})$表示模型由参数$\theta$决定，模型训练可以被描述为对目标函数$\funp{P}_{\theta}(\seq{s}|\seq{t})$的优化过程：
 \begin{eqnarray}
-\widehat{\theta}=\argmax_{\theta}\funp{P}_{\theta}(\seq{s}|\seq{t})
-\label{eq:5-30}
+\widehat{\theta}&=&\argmax_{\theta}\funp{P}_{\theta}(\seq{s}|\seq{t})
+\label{eq:5-29}
 \end{eqnarray}

 \noindent 其中，$\argmax_{\theta}$表示求最优参数的过程（或优化过程）。

-\parinterval 公式\eqref{eq:5-30}实际上也是一种基于极大似然的模型训练方法。这里，可以把$\funp{P}_{\theta}(\seq{s}|\seq{t})$看作是模型对数据描述的一个似然函数，记作$L(\seq{s},\seq{t};\theta)$。也就是，优化目标是对似然函数的优化：$\{\widehat{\theta}\}=\{\argmax_{\theta \in \Theta}L(\seq{s},\seq{t};\theta)\}$，其中\{$\widehat{\theta}$\} 表示可能有多个结果，$\Theta$表示参数空间。
+\parinterval 公式\eqref{eq:5-29}实际上也是一种基于极大似然的模型训练方法。这里，可以把$\funp{P}_{\theta}(\seq{s}|\seq{t})$看作是模型对数据描述的一个似然函数，记作$L(\seq{s},\seq{t};\theta)$。也就是，优化目标是对似然函数的优化：$\{\widehat{\theta}\}=\{\argmax_{\theta \in \Theta}L(\seq{s},\seq{t};\theta)\}$，其中\{$\widehat{\theta}$\} 表示可能有多个结果，$\Theta$表示参数空间。

-\parinterval 回到IBM模型的优化问题上。以IBM模型1为例，优化的目标是最大化翻译概率$\funp{P}(\seq{s}| \seq{t})$。使用公式\eqref{eq:5-64} ，可以把这个目标表述为：
+\parinterval 回到IBM模型的优化问题上。以IBM模型1为例，优化的目标是最大化翻译概率$\funp{P}(\seq{s}| \seq{t})$。使用公式\eqref{eq:5-28} ，可以把这个目标表述为：
 \begin{eqnarray}
 &                    & \textrm{max}\Big(\frac{\varepsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}{f({s_j|t_i})}\Big) \nonumber \\
-& \textrm{s.t.} & \textrm{任意单词} t_{y}:\;\sum_{s_x}{f(s_x|t_y)}=1 \nonumber
+& \textrm{s.t.} & \textrm{任意单词} t_{y}:\;\sum_{s_x}{f(s_x|t_y)} = 1 \nonumber
 \label{eq:5-31}
 \end{eqnarray}
 \noindent 其中，$\textrm{max}(\cdot)$表示最大化，$\frac{\varepsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}{f({s_j|t_i})}$是目标函数，$f({s_j|t_i})$是模型的参数，$\sum_{s_x}{f(s_x|t_y)}=1$是优化的约束条件，以保证翻译概率满足归一化的要求。需要注意的是$\{f(s_x |t_y)\}$对应了很多参数，每个源语言单词和每个目标语单词的组合都对应一个参数$f(s_x |t_y)$。
@@ -898,11 +887,11 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \parinterval 可以看到，IBM模型的参数训练问题本质上是带约束的目标函数优化问题。由于目标函数是可微分函数，解决这类问题的一种常用手法是把带约束的优化问题转化为不带约束的优化问题。这里用到了{\small\sffamily\bfseries{拉格朗日乘数法}}\index{拉格朗日乘数法}（Lagrange Multiplier Method）\index{The Lagrange Multiplier Method}，它的基本思想是把含有$n$个变量和$m$个约束条件的优化问题转化为含有$n+m$个变量的无约束优化问题。

-\parinterval 这里的目标是$\max(\funp{P}_{\theta}(\seq{s}|\seq{t}))$，约束条件是对于任意的目标语单词$t_y$有\\$\sum_{s_x}{\funp{P}(s_x|t_y)}=1$。根据拉格朗日乘数法，可以把上述优化问题重新定义最大化如下拉格朗日函数：
+\parinterval 这里的目标是$\max(\funp{P}_{\theta}(\seq{s}|\seq{t}))$，约束条件是对于任意的目标语单词$t_y$有\\$\sum_{s_x}{\funp{P}(s_x|t_y)}=1$。根据拉格朗日乘数法，可以把上述优化问题重新定义为最大化如下拉格朗日函数的问题：
 \vspace{-0.5em}
 \begin{eqnarray}
-L(f,\lambda)=\frac{\varepsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}{f(s_j|t_i)}-\sum_{t_y}{\lambda_{t_y}(\sum_{s_x}{f(s_x|t_y)}-1)}
-\label{eq:5-32}
+L(f,\lambda)&=&\frac{\varepsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}{f(s_j|t_i)}-\sum_{t_y}{\lambda_{t_y}(\sum_{s_x}{f(s_x|t_y)}-1)}
+\label{eq:5-30}
 \end{eqnarray}

 \vspace{-0.3em}
@@ -922,29 +911,30 @@ L(f,\lambda)=\frac{\varepsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}{f(s_j|t_i)
 \frac{\partial L(f,\lambda)}{\partial f(s_u|t_v)}& = & \frac{\partial \big[ \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)} - \nonumber \\
                                                                     &     & \frac{\partial \big[ \sum_{t_y} \lambda_{t_y} (\sum_{s_x} f(s_x|t_y) -1) \big]}{\partial f(s_u|t_v)} \nonumber \\
                                                                     & =  & \frac{\varepsilon}{(l+1)^{m}} \cdot \frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)} - \lambda_{t_v}
-\label{eq:5-33}
+\label{eq:5-31}
 \end{eqnarray}

 \noindent 这里$s_u$和$t_v$分别表示源语言和目标语言词表中的某一个单词。为了求$\frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)}$，这里引入一个辅助函数。令$g(z)=\alpha z^{\beta}$ 为变量$z$ 的函数，显然，
 $\frac{\partial g(z)}{\partial z} = \alpha \beta z^{\beta-1} = \frac{\beta}{z}\alpha z^{\beta} = \frac{\beta}{z} g(z)$。这里可以把$\prod_{j=1}^{m} \sum_{i=0}^{l} f(s_j|t_i)$看做$g(z)=\alpha z^{\beta}$的实例。首先，令$z=\sum_{i=0}^{l}f(s_u|t_i)$，注意$s_u$为给定的源语单词。然后，把$\beta$定义为$\sum_{i=0}^{l}f(s_u|t_i)$在$\prod_{j=1}^{m} \sum_{i=0}^{l} f(s_j|t_i)$ 中出现的次数，即源语句子中与$s_u$相同的单词的个数。
 \begin{eqnarray}
-\beta=\sum_{j=1}^{m} \delta(s_j,s_u)
-\label{eq:5-34}
+\beta &=& \sum_{j=1}^{m} \delta(s_j,s_u)
+\label{eq:5-32}
 \end{eqnarray}

 \noindent 其中，当$x=y$时，$\delta(x,y)=1$，否则为0。

 \parinterval 根据$\frac{\partial g(z)}{\partial z} = \frac{\beta}{z} g(z)$，可以得到
 \begin{eqnarray}
-\frac{\partial g(z)}{\partial z} = \frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial \big[ \sum\limits_{i=0}^{l}f(s_u|t_i) \big]} = \frac{\sum\limits_{j=1}^{m} \delta(s_j,s_u)}{\sum\limits_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)
-\label{eq:5-35}
+\frac{\partial g(z)}{\partial z}& =& \frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial \big[ \sum\limits_{i=0}^{l}f(s_u|t_i) \big]} \nonumber \\
+& = &\frac{\sum\limits_{j=1}^{m} \delta(s_j,s_u)}{\sum\limits_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i)
+\label{eq:5-33}
 \end{eqnarray}

 \parinterval 根据$\frac{\partial g(z)}{\partial z}$和$\frac{\partial z}{\partial f}$计算的结果，可以得到
 \begin{eqnarray}
 {\frac{\partial \big[ \prod_{j=1}^{m} \sum_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)}}& =& {{\frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \big]}{\partial \big[ \sum\limits_{i=0}^{l}f(s_u|t_i) \big]}} \cdot{\frac{\partial \big[ \sum\limits_{i=0}^{l}f(s_u|t_i) \big]}{\partial f(s_u|t_v)}}} \nonumber \\
 & = &{\frac{\sum\limits_{j=1}^{m} \delta(s_j,s_u)}{\sum\limits_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \cdot \sum\limits_{i=0}^{l} \delta(t_i,t_v)}
-\label{eq:5-36}
+\label{eq:5-34}
 \end{eqnarray}

 \parinterval 将$\frac{\partial \big[ \prod_{j=1}^{m} \sum_{i=0}^{l} f(s_j|t_i) \big]}{\partial f(s_u|t_v)}$进一步代入$\frac{\partial L(f,\lambda)}{\partial f(s_u|t_v)}$，得到$L(f,\lambda)$的导数
@@ -952,22 +942,22 @@ $\frac{\partial g(z)}{\partial z} = \alpha \beta z^{\beta-1} = \frac{\beta}{z}\a
 & &{\frac{\partial L(f,\lambda)}{\partial f(s_u|t_v)}}\nonumber \\
 &=&{\frac{\varepsilon}{(l+1)^{m}} \cdot \frac{\partial \big[ \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_{a_j}) \big]}{\partial f(s_u|t_v)} - \lambda_{t_v}}\nonumber \\
 &=&{\frac{\varepsilon}{(l+1)^{m}} \frac{\sum_{j=1}^{m} \delta(s_j,s_u) \cdot \sum_{i=0}^{l} \delta(t_i,t_v)}{\sum_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) - \lambda_{t_v}}
-\label{eq:5-37}
+\label{eq:5-35}
 \end{eqnarray}

 \parinterval 令$\frac{\partial L(f,\lambda)}{\partial f(s_u|t_v)}=0$，有
 \begin{eqnarray}
-f(s_u|t_v) = \frac{\lambda_{t_v}^{-1} \varepsilon}{(l+1)^{m}} \cdot \frac{\sum\limits_{j=1}^{m} \delta(s_j,s_u) \cdot \sum\limits_{i=0}^{l} \delta(t_i,t_v)}{\sum\limits_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \cdot f(s_u|t_v)
-\label{eq:5-38}
+f(s_u|t_v) &=& \frac{\lambda_{t_v}^{-1} \varepsilon}{(l+1)^{m}} \cdot \frac{\sum\limits_{j=1}^{m} \delta(s_j,s_u) \cdot \sum\limits_{i=0}^{l} \delta(t_i,t_v)}{\sum\limits_{i=0}^{l}f(s_u|t_i)} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \cdot f(s_u|t_v)
+\label{eq:5-36}
 \end{eqnarray}

 \parinterval 将上式稍作调整得到下式：
 \begin{eqnarray}
-f(s_u|t_v) = \lambda_{t_v}^{-1} \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \sum\limits_{j=1}^{m} \delta(s_j,s_u) \sum\limits_{i=0}^{l} \delta(t_i,t_v) \frac{f(s_u|t_v) }{\sum\limits_{i=0}^{l}f(s_u|t_i)}
-\label{eq:5-39}
+f(s_u|t_v) &=& \lambda_{t_v}^{-1} \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \sum\limits_{j=1}^{m} \delta(s_j,s_u) \sum\limits_{i=0}^{l} \delta(t_i,t_v) \frac{f(s_u|t_v) }{\sum\limits_{i=0}^{l}f(s_u|t_i)}
+\label{eq:5-37}
 \end{eqnarray}

-\parinterval  可以看出，这不是一个计算$f(s_u|t_v)$的解析式，因为等式右端仍含有$f(s_u|t_v)$。不过它蕴含着一种非常经典的方法\ $\dash$\ {\small\sffamily\bfseries{期望最大化}}\index{期望最大化}（Expectation Maximization）\index{Expectation Maximization}方法，简称EM方法（或算法）。使用EM方法可以利用上式迭代地计算$f(s_u|t_v)$，使其最终收敛到最优值。EM方法的思想是：用当前的参数，求似然函数的期望，之后最大化这个期望同时得到新的一组参数的值。对于IBM模型来说，其迭代过程就是反复使用公式\eqref{eq:5-39}，具体如图\ref{fig:5-24}所示。
+\parinterval  可以看出，这不是一个计算$f(s_u|t_v)$的解析式，因为等式右端仍含有$f(s_u|t_v)$。不过它蕴含着一种非常经典的方法\ $\dash$\ {\small\sffamily\bfseries{期望最大化}}\index{期望最大化}（Expectation Maximization）\index{Expectation Maximization}方法，简称EM方法（或算法）。使用EM方法可以利用式\ref{eq:5-37}迭代地计算$f(s_u|t_v)$，使其最终收敛到最优值。EM方法的思想是：用当前的参数，求似然函数的期望，之后最大化这个期望同时得到新的一组参数的值。对于IBM模型来说，其迭代过程就是反复使用公式\eqref{eq:5-37}，具体如图\ref{fig:5-24}所示。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -978,22 +968,22 @@ f(s_u|t_v) = \lambda_{t_v}^{-1} \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}
 \end{figure}
 %----------------------------------------------

-\parinterval 为了化简$f(s_u|t_v)$的计算，在此对公式\eqref{eq:5-39}进行了重新组织，见图\ref{fig:5-25}。其中，红色部分表示翻译概率P$(\seq{s}|\seq{t})$；蓝色部分表示$(s_u,t_v)$ 在句对$(\seq{s},\seq{t})$中配对的总次数，即“$t_v$翻译为$s_u$”在所有对齐中出现的次数；绿色部分表示$f(s_u|t_v)$对于所有的$t_i$的相对值，即“$t_v$翻译为$s_u$”在所有对齐中出现的相对概率；蓝色与绿色部分相乘表示“$t_v$翻译为$s_u$”这个事件出现次数的期望的估计，称之为{\small\sffamily\bfseries{期望频次}}\index{期望频次}（Expected Count）\index{Expected Count}。
+\parinterval 为了化简$f(s_u|t_v)$的计算，在此对公式\eqref{eq:5-37}进行了重新组织，见图\ref{fig:5-25}。其中，红色部分表示翻译概率P$(\seq{s}|\seq{t})$；蓝色部分表示$(s_u,t_v)$ 在句对$(\seq{s},\seq{t})$中配对的总次数，即“$t_v$翻译为$s_u$”在所有对齐中出现的次数；绿色部分表示$f(s_u|t_v)$对于所有的$t_i$的相对值，即“$t_v$翻译为$s_u$”在所有对齐中出现的相对概率；蓝色与绿色部分相乘表示“$t_v$翻译为$s_u$”这个事件出现次数的期望的估计，称之为{\small\sffamily\bfseries{期望频次}}\index{期望频次}（Expected Count）\index{Expected Count}。
 \vspace{-0.3em}
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
 \input{./Chapter5/Figures/figure-a-more-detailed-explanation-of-formula-3.40}
-   \caption{公式\eqref{eq:5-39}的解释}
+   \caption{公式\eqref{eq:5-37}的解释}
   \label{fig:5-25}
 \end{figure}
 %----------------------------------------------

 \parinterval 期望频次是事件在其分布下出现次数的期望。另$c_{\mathbb{E}}(X)$为事件$X$的期望频次，其计算公式为：
-
-\begin{equation}
-c_{\mathbb{E}}(X)=\sum_i c(x_i) \cdot \funp{P}(x_i)
-\end{equation}
+\begin{eqnarray}
+c_{\mathbb{E}}(X)&=&\sum_i c(x_i) \cdot \funp{P}(x_i)
+\label{eq:5-38}
+\end{eqnarray}

 \noindent 其中$c(x_i)$表示$X$取$x_i$时出现的次数，$\funp{P}(x_i)$表示$X=x_i$出现的概率。图\ref{fig:5-26}展示了事件$X$的期望频次的详细计算过程。其中$x_1$、$x_2$和$x_3$分别表示事件$X$出现2次、1次和5次的情况。

@@ -1009,39 +999,39 @@ c_{\mathbb{E}}(X)=\sum_i c(x_i) \cdot \funp{P}(x_i)

 \parinterval 因为在$\funp{P}(\seq{s}|\seq{t})$中，$t_v$翻译（连接）到$s_u$的期望频次为：
 \begin{eqnarray}
-c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t}) \equiv \sum\limits_{j=1}^{m} \delta(s_j,s_u) \cdot \sum\limits_{i=0}^{l} \delta(t_i,t_v) \cdot \frac {f(s_u|t_v)}{\sum\limits_{i=0}^{l}f(s_u|t_i)}
-\label{eq:5-40}
+c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t}) & \equiv & \sum\limits_{j=1}^{m} \delta(s_j,s_u) \cdot \sum\limits_{i=0}^{l} \delta(t_i,t_v) \cdot \frac {f(s_u|t_v)}{\sum\limits_{i=0}^{l}f(s_u|t_i)}
+\label{eq:5-39}
 \end{eqnarray}

-\parinterval 所以公式\ref {eq:5-39}可重写为：
+\parinterval 所以公式\ref {eq:5-37}可重写为：
 \begin{eqnarray}
-f(s_u|t_v)=\lambda_{t_v}^{-1} \cdot \funp{P}(\seq{s}| \seq{t}) \cdot c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})
-\label{eq:5-41}
+f(s_u|t_v)&=&\lambda_{t_v}^{-1} \cdot \funp{P}(\seq{s}| \seq{t}) \cdot c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})
+\label{eq:5-40}
 \end{eqnarray}

 \parinterval 在此如果令$\lambda_{t_v}^{'}=\frac{\lambda_{t_v}}{\funp{P}(\seq{s}| \seq{t})}$，可得：
 \begin{eqnarray}
 f(s_u|t_v) &= &\lambda_{t_v}^{-1} \cdot \funp{P}(\seq{s}| \seq{t}) \cdot c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t}) \nonumber \\
 &=&{(\lambda_{t_v}^{'})}^{-1} \cdot c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})
-\label{eq:5-42}
+\label{eq:5-41}
 \end{eqnarray}

 \parinterval 又因为IBM模型对$f(\cdot|\cdot)$的约束如下：
 \begin{eqnarray}
-\forall t_y : \sum\limits_{s_x} f(s_x|t_y) =1
-\label{eq:5-43}
+\forall t_y : \sum\limits_{s_x} f(s_x|t_y) &=& 1
+\label{eq:5-42}
 \end{eqnarray}

-\parinterval 为了满足$f(\cdot|\cdot)$的概率归一化约束，易知$\lambda_{t_v}^{'}$为：
+\parinterval 为了满足$f(\cdot|\cdot)$的概率归一化约束，易得$\lambda_{t_v}^{'}$为：
 \begin{eqnarray}
-\lambda_{t_v}^{'}=\sum\limits_{s_u} c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})
-\label{eq:5-44}
+\lambda_{t_v}^{'}&=&\sum\limits_{s_u} c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})
+\label{eq:5-43}
 \end{eqnarray}

 \parinterval 因此，$f(s_u|t_v)$的计算式可再一步变换成下式：
 \begin{eqnarray}
-f(s_u|t_v)=\frac{c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})}  { \sum\limits_{s_u} c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t}) }
-\label{eq:5-45}
+f(s_u|t_v)&=&\frac{c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})}  { \sum\limits_{s_u} c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t}) }
+\label{eq:5-44}
 \end{eqnarray}


@@ -1049,8 +1039,8 @@ f(s_u|t_v)=\frac{c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})}  { \sum\limits_{s_u} c
 \parinterval 进一步，假设有$K$个互译的句对（称作平行语料）：
 $\{(\seq{s}^{[1]},\seq{t}^{[1]}),...,(\seq{s}^{[K]},\seq{t}^{[K]})\}$，$f(s_u|t_v)$的期望频次为：
 \begin{eqnarray}
-c_{\mathbb{E}}(s_u|t_v)=\sum\limits_{k=1}^{K}  c_{\mathbb{E}}(s_u|t_v;s^{[k]},t^{[k]})
-\label{eq:5-46}
+c_{\mathbb{E}}(s_u|t_v)&=&\sum\limits_{k=1}^{K}  c_{\mathbb{E}}(s_u|t_v;s^{[k]},t^{[k]})
+\label{eq:5-45}
 \end{eqnarray}

 \parinterval 于是有$f(s_u|t_v)$的计算公式和迭代过程图\ref{fig:5-27}所示。完整的EM算法如图\ref{fig:5-28}所示。其中E-Step对应4-5行，目的是计算$c_{\mathbb{E}}(\cdot)$；M-Step对应6-9行，目的是计算$f(\cdot|\cdot)$。

--- a/Chapter6/Figures/figure-alignment-matrix-for-zh-to-en-translation.tex
+++ b/Chapter6/Figures/figure-alignment-matrix-for-zh-to-en-translation.tex
@@ -25,7 +25,7 @@
    \node[anchor=west,inner sep=0pt,font=\footnotesize,rotate=45] at([xshift=0.1cm+\bc*2,yshift=0.4em]o.east){satisfied};
    \node[anchor=west,inner sep=0pt,font=\footnotesize,rotate=45] at([xshift=0.1cm+\bc*3,yshift=0.4em]o.east){with};
    \node[anchor=west,inner sep=0pt,font=\footnotesize,rotate=45] at([xshift=0.1cm+\bc*4,yshift=0.4em]o.east){you};
-	\node[anchor=east,inner sep=0pt,font=\footnotesize] at([xshift=\bc*4.5,yshift=-1.0cm-\bc*4]o.west){(a)对齐实例1};
+	\node[anchor=east,inner sep=0pt,font=\small] at([xshift=\bc*4.5,yshift=-1.0cm-\bc*4]o.west){(a)对齐实例1};
 \end{scope}
 \begin{scope}[xshift=15.0em]
    \filldraw [fill=white,drop shadow] (0,0) rectangle (\bc*8,\bc*6);
@@ -56,7 +56,7 @@
    \node[anchor=west,inner sep=0pt,font=\footnotesize,rotate=45] at([xshift=0.1cm+\bc*5,yshift=0.4em]o.east){work};
    \node[anchor=west,inner sep=0pt,font=\footnotesize,rotate=45] at([xshift=0.1cm+\bc*6,yshift=0.4em]o.east){every};
    \node[anchor=west,inner sep=0pt,font=\footnotesize,rotate=45] at([xshift=0.1cm+\bc*7,yshift=0.4em]o.east){day};
-    \node[anchor=east,inner sep=0pt,font=\footnotesize] at([xshift=\bc*6.0,yshift=-1.0cm-\bc*5]o.west){(b)对齐实例2};
+    \node[anchor=east,inner sep=0pt,font=\small] at([xshift=\bc*6.0,yshift=-1.0cm-\bc*5]o.west){(b)对齐实例2};
 \end{scope}
 \end{tikzpicture}
 %---------------------------------------------------------------------
\ No newline at end of file
--- a/Chapter6/Figures/figure-examples-of-sequential-translation-and-reorder-translation.tex
+++ b/Chapter6/Figures/figure-examples-of-sequential-translation-and-reorder-translation.tex
@@ -24,7 +24,7 @@
 		\draw[line width=1.2pt,dashed] ([yshift=-0.3em]n14.south) -- ([yshift=0.2em]n24.north);
 		\draw[line width=1.2pt,dashed] ([yshift=-0.3em]n15.south) -- ([yshift=0.2em]n25.north);
 		\draw[line width=1.2pt,dashed] ([yshift=-0.3em]n16.south) -- ([yshift=0.2em]n26.north);
-        \node[anchor=west] at([xshift=5.5em,yshift=-3em]n21.east){(a)顺序翻译对齐结果};
+        \node[anchor=west] at([xshift=5.5em,yshift=-3em]n21.east){\small{(a)顺序翻译对齐结果}};
 \end{scope}
 \begin{scope}[yshift=-11.5em]
 	\tikzstyle{cand} = [draw,inner sep=4pt,line width=1pt,align=center,drop shadow,minimum height =1.6em,minimum width=4.2em,fill=green!30]
@@ -49,7 +49,7 @@
 		\draw[line width=1.2pt,dashed,out=-40,in=140] ([yshift=-0.3em]n14.south) to ([yshift=0.2em]n26.north);
 		\draw[line width=1.2pt,dashed,out=-140,in=40] ([yshift=-0.3em]n15.south) to ([yshift=0.2em]n23.north);
 		\draw[line width=1.2pt,dashed,out=-140,in=40] ([yshift=-0.3em]n16.south) to ([yshift=0.2em]n24.north);
-		\node[anchor=west] at([xshift=5.5em,yshift=-3em]n21.east){(b)调序翻译对齐结果};
+		\node[anchor=west] at([xshift=5.5em,yshift=-3em]n21.east){\small{(b)调序翻译对齐结果}};
 \end{scope}
 \end{tikzpicture}
 %---------------------------------------------------------------------
\ No newline at end of file
--- a/Chapter6/chapter6.tex
+++ b/Chapter6/chapter6.tex
@@ -23,7 +23,7 @@

 \chapter{基于扭曲度和繁衍率的模型}

-{\chapterfive}展示了一种基于单词的翻译模型。这种模型的形式非常简单，而且其隐含的词对齐信息具有较好的可解释性。不过，语言翻译的复杂性远远超出人们的想象。有两方面挑战\ \dash\ 如何对“ 调序”问题进行建模以及如何对“一对多翻译”问题进行建模。调序是翻译问题中所特有的现象，比如，汉语到日语的翻译中，需要对谓词进行调序。另一方面，一个单词在另一种语言中可能会被翻译为多个连续的词，比如，汉语“ 联合国”翻译到英语会对应三个单词“The United Nations”。这种现象也被称作一对多翻译，它与句子长度预测有着密切的联系。
+{\chapterfive}展示了一种基于单词的翻译模型。这种模型的形式非常简单，而且其隐含的词对齐信息具有较好的可解释性。不过，语言翻译的复杂性远远超出人们的想象。语言翻译主要有两方面挑战\ \dash\ 如何对“ 调序”问题进行建模以及如何对“一对多翻译”问题进行建模。一方面，调序是翻译问题中所特有的现象，比如，汉语到日语的翻译中，需要对谓词进行调序。另一方面，一个单词在另一种语言中可能会被翻译为多个连续的词，比如，汉语“ 联合国”翻译到英语会对应三个单词“The United Nations”。这种现象也被称作一对多翻译，它与句子长度预测有着密切的联系。

 无论是调序还是一对多翻译，简单的翻译模型（如IBM模型1）都无法对其进行很好的处理。因此，需要考虑对这两个问题单独进行建模。本章将会对机器翻译中两个常用的概念进行介绍\ \dash\ 扭曲度（Distortion）和繁衍率（Fertility）。它们可以被看作是对调序和一对多翻译现象的一种统计描述。基于此，本章会进一步介绍基于扭曲度和繁衍率的翻译模型，建立相对完整的基于单词的统计建模体系。相关的技术和概念在后续章节也会被进一步应用。

@@ -34,7 +34,7 @@
 \sectionnewpage
 \section{基于扭曲度的模型}

-下面将介绍扭曲度在机器翻译中的定义及使用方法。这也带来了两个新的翻译模型\ \dash\ IBM模型2\upcite{DBLP:journals/coling/BrownPPM94}和HMM翻译模型\upcite{vogel1996hmm}。
+下面将介绍扭曲度在机器翻译中的定义及使用方法。这也带来了两个新的翻译模型\ \dash\ IBM模型2\upcite{DBLP:journals/coling/BrownPPM94}和HMM\upcite{vogel1996hmm}。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -74,11 +74,11 @@
 \parinterval 对于建模来说，IBM模型1很好地化简了翻译问题，但是由于使用了很强的假设，导致模型和实际情况有较大差异。其中一个比较严重的问题是假设词对齐的生成概率服从均匀分布。IBM模型2抛弃了这个假设\upcite{DBLP:journals/coling/BrownPPM94}。它认为词对齐是有倾向性的，它与源语言单词的位置和目标语言单词的位置有关。具体来说，对齐位置$a_j$的生成概率与位置$j$、源语言句子长度$m$和目标语言句子长度$l$有关，形式化表述为：

 \begin{eqnarray}
-\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t}) \equiv a(a_j|j,m,l)
+\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t}) & \equiv & a(a_j|j,m,l)
 \label{eq:6-1}
 \end{eqnarray}

-\parinterval 这里还用{\chapterthree}中的例子（图\ref{fig:6-3}）来进行说明。在IBM模型1中，“桌子”对齐到目标语言四个位置的概率是一样的。但在IBM模型2中，“桌子”对齐到“table”被形式化为$a(a_j |j,m,l)=a(3|2,3,3)$，意思是对于源语言位置2（$j=2$）的词，如果它的源语言和目标语言都是3个词（$l=3,m=3$），对齐到目标语言位置3（$a_j=3$）的概率是多少？因为$a(a_j|j,m,l)$也是模型需要学习的参数，因此“桌子”对齐到不同目标语言单词的概率也是不一样的。理想的情况下，通过$a(a_j|j,m,l)$，“桌子”对齐到“table”应该得到更高的概率。
+\parinterval 这里还用{\chapterfive}中的例子（图\ref{fig:6-3}）来进行说明。在IBM模型1中，“桌子”对齐到目标语言四个位置的概率是一样的。但在IBM模型2中，“桌子”对齐到“table”被形式化为$a(a_j |j,m,l)=a(3|2,3,3)$，意思是对于源语言位置2（$j=2$）的词，如果它的源语言和目标语言都是3个词（$l=3,m=3$），对齐到目标语言位置3（$a_j=3$）的概率是多少？因为$a(a_j|j,m,l)$也是模型需要学习的参数，因此“桌子”对齐到不同目标语言单词的概率也是不一样的。理想的情况下，通过$a(a_j|j,m,l)$，“桌子”对齐到“table”应该得到更高的概率。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -97,7 +97,7 @@
 \label{eq:s-word-gen-prob}
 \end{eqnarray}

-把公式\eqref{eq:6-1}、\eqref{eq:s-len-gen-prob}和\eqref{eq:s-word-gen-prob}和 重新带入公式$\funp{P}(\seq{s},\seq{a}|\seq{t})=\funp{P}(m|\seq{t})\prod_{j=1}^{m}{\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})}$\\${\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})}$ 和$\funp{P}(\seq{s}|\seq{t})= \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})$，可以得到IBM模型2的数学描述：
+把公式\eqref{eq:6-1}、\eqref{eq:s-len-gen-prob}和\eqref{eq:s-word-gen-prob}重新带入公式$\funp{P}(\seq{s},\seq{a}|\seq{t})=\funp{P}(m|\seq{t})\prod_{j=1}^{m}{\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})}$\\${\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})}$ 和$\funp{P}(\seq{s}|\seq{t})= \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})$，可以得到IBM模型2的数学描述：
 \begin{eqnarray}
 \funp{P}(\seq{s}| \seq{t}) & = &  \sum_{\seq{a}}{\funp{P}(\seq{s},\seq{a}| \seq{t})} \nonumber \\
                       & = & \sum_{a_1=0}^{l}{\cdots}\sum _{a_m=0}^{l}{\varepsilon}\prod_{j=1}^{m}{a(a_j|j,m,l)f(s_j|t_{a_j})}
@@ -106,7 +106,7 @@

 \parinterval 类似于模型1，模型2的表达式\eqref{eq:6-4}也能被拆分为两部分进行理解。第一部分：遍历所有的$\seq{a}$；第二部分：对于每个$\seq{a}$累加对齐概率$\funp{P}(\seq{s},\seq{a}| \seq{t})$，即计算对齐概率$a(a_j|j,m,l)$和词汇翻译概率$f(s_j|t_{a_j})$对于所有源语言位置的乘积。

-\parinterval 同样的，模型2的解码及训练优化和模型1的十分相似，在此不再赘述，详细推导过程可以参看{\chapterfive}解码及计算优化部分。这里直接给出IBM模型2的最终表达式：
+\parinterval 同样的，模型2的解码及训练优化和模型1的十分相似，在此不再赘述，详细推导过程可以参看{\chapterfive}\ref{IBM-model1}小节解码及计算优化部分。这里直接给出IBM模型2的最终表达式：
 \begin{eqnarray}
 \funp{P}(\seq{s}| \seq{t}) & = & \varepsilon \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} a(i|j,m,l) f(s_j|t_i)
 \label{eq:6-5}
@@ -132,7 +132,7 @@

 \parinterval 针对此问题，基于HMM的词对齐模型抛弃了IBM模型1-2的绝对位置假设，将一阶隐马尔可夫模型用于词对齐问题\upcite{vogel1996hmm}。HMM词对齐模型认为，单词与单词之间并不是毫无联系的，对齐概率应该取决于对齐位置的差异而不是本身单词所在的位置。具体来说，位置$j$的对齐概率$a_j$与前一个位置$j-1$的对齐位置$a_{j-1}$和译文长度$l$有关，形式化的表述为：
 \begin{eqnarray}
-\funp{P}(a_{j}|a_{1}^{j-1},s_{1}^{j-1},m,\seq{t})\equiv\funp{P}(a_{j}|a_{j-1},l)
+\funp{P}(a_{j}|a_{1}^{j-1},s_{1}^{j-1},m,\seq{t})& \equiv & \funp{P}(a_{j}|a_{j-1},l)
 \label{eq:6-6}
 \end{eqnarray}

@@ -140,13 +140,13 @@

 \parinterval 把公式$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t}) \equiv f(s_j|t_{a_j})$和\eqref{eq:6-6}重新带入公式$\funp{P}(\seq{s},\seq{a}|\seq{t})=\funp{P}(m|\seq{t})$\\$\prod_{j=1}^{m}{\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})}$和$\funp{P}(\seq{s}|\seq{t})= \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})$,可得HMM词对齐模型的数学描述：
 \begin{eqnarray}
-\funp{P}(\seq{s}| \seq{t})=\sum_{\seq{a}}{\funp{P}(m|\seq{t})}\prod_{j=1}^{m}{\funp{P}(a_{j}|a_{j-1},l)f(s_{j}|t_{a_j})}
+\funp{P}(\seq{s}| \seq{t})& =& \sum_{\seq{a}}{\funp{P}(m|\seq{t})}\prod_{j=1}^{m}{\funp{P}(a_{j}|a_{j-1},l)f(s_{j}|t_{a_j})}
 \label{eq:6-7}
 \end{eqnarray}

 \parinterval 此外，为了使得HMM的对齐概率$\funp{P}(a_{j}|a_{j-1},l)$满足归一化的条件，这里还假设其对齐概率只取决于$a_{j}-a_{j-1}$，即：
 \begin{eqnarray}
-\funp{P}(a_{j}|a_{j-1},l)=\frac{\mu(a_{j}-a_{j-1})}{\sum_{i=1}^{l}{\mu(i-a_{j-1})}}
+\funp{P}(a_{j}|a_{j-1},l)& = & \frac{\mu(a_{j}-a_{j-1})}{\sum_{i=1}^{l}{\mu(i-a_{j-1})}}
 \label{eq:6-8}
 \end{eqnarray}

@@ -179,7 +179,7 @@

 \begin{itemize}
 \vspace{0.3em}
-\item 首先，对于每个英语单词$t_i$决定它的产出率$\varphi_{i}$。比如“Scientists”的产出率是2，可表示为${\varphi}_{1}=2$。这表明它会生成2个汉语单词；
+\item 首先，对于每个英语单词$t_i$确定它的产出率$\varphi_{i}$。比如“Scientists”的产出率是2，可表示为${\varphi}_{1}=2$。这表明它会生成2个汉语单词；
 \vspace{0.3em}
 \item 其次，确定英语句子中每个单词生成的汉语单词列表。比如“Scientists”生成“科学家”和“们”两个汉语单词，可表示为${\tau}_1=\{{\tau}_{11}=\textrm{“科学家”},{\tau}_{12}=\textrm{“们”}\}$。 这里用特殊的空标记NULL表示翻译对空的情况；
 \vspace{0.3em}
@@ -201,10 +201,10 @@
 \parinterval 可以看出，一组$\tau$和$\pi$（记为$<\tau,\pi>$）可以决定一个对齐$\seq{a}$和一个源语句子$\seq{s}$。

 \noindent 相反的，一个对齐$\seq{a}$和一个源语句子$\seq{s}$可以对应多组$<\tau,\pi>$。如图\ref{fig:6-6}所示，不同的$<\tau,\pi>$对应同一个源语言句子和词对齐。它们的区别在于目标语单词“Scientists”生成的源语言单词“科学家”和“ 们”的顺序不同。这里把不同的$<\tau,\pi>$对应到的相同的源语句子$\seq{s}$和对齐$\seq{a}$记为$<\seq{s},\seq{a}>$。因此计算$\funp{P}(\seq{s},\seq{a}| \seq{t})$时需要把每个可能结果的概率加起来，如下：
-\begin{equation}
-\funp{P}(\seq{s},\seq{a}| \seq{t})=\sum_{{<\tau,\pi>}\in{<\seq{s},\seq{a}>}}{\funp{P}(\tau,\pi|\seq{t}) }
+\begin{eqnarray}
+\funp{P}(\seq{s},\seq{a}| \seq{t}) & = & \sum_{{<\tau,\pi>}\in{<\seq{s},\seq{a}>}}{\funp{P}(\tau,\pi|\seq{t}) }
 \label{eq:6-9}
-\end{equation}
+\end{eqnarray}

 %----------------------------------------------
 \begin{figure}[htp]
@@ -233,15 +233,15 @@

 \begin{itemize}
 \vspace{0.5em}
-\item 第一部分：每个$i\in[1,l]$的目标语单词的产出率建模（{\color{red!70} 红色}），即$\varphi_i$的生成概率。它依赖于$\seq{t}$和区间$[1,i-1]$的目标语单词的产出率$\varphi_1^{i-1}$。\footnote{这里约定，当$i=1$ 时，$\varphi_1^0$ 表示空。}
+\item 第一部分：对每个$i\in[1,l]$的目标语单词的产出率建模（{\color{red!70} 红色}），即$\varphi_i$的生成概率。它依赖于$\seq{t}$和区间$[1,i-1]$的目标语单词的产出率$\varphi_1^{i-1}$。\footnote{这里约定，当$i=1$ 时，$\varphi_1^0$ 表示空。}
 \vspace{0.5em}
-\item 第二部分：$i=0$时的产出率建模（{\color{blue!70} 蓝色}），即空标记$t_0$的产出率生成概率。它依赖于$\seq{t}$和区间$[1,i-1]$的目标语单词的产出率$\varphi_1^l$。
+\item 第二部分：对$i=0$时的产出率建模（{\color{blue!70} 蓝色}），即空标记$t_0$的产出率生成概率。它依赖于$\seq{t}$和区间$[1,i-1]$的目标语单词的产出率$\varphi_1^l$。
 \vspace{0.5em}
-\item 第三部分：词汇翻译建模（{\color{green!70} 绿色}），目标语言单词$t_i$生成第$k$个源语言单词$\tau_{ik}$时的概率，依赖于$\seq{t}$、所有目标语言单词的产出率$\varphi_0^l$、区间$i\in[1,l]$的目标语言单词生成的源语言单词$\tau_1^{i-1}$和目标语单词$t_i$生成的前$k$个源语言单词$\tau_{i1}^{k-1}$。
+\item 第三部分：对词汇翻译建模（{\color{green!70} 绿色}），目标语言单词$t_i$生成第$k$个源语言单词$\tau_{ik}$时的概率，依赖于$\seq{t}$、所有目标语言单词的产出率$\varphi_0^l$、区间$i\in[1,l]$的目标语言单词生成的源语言单词$\tau_1^{i-1}$和目标语单词$t_i$生成的前$k$个源语言单词$\tau_{i1}^{k-1}$。
 \vspace{0.5em}
 \item 第四部分：对于每个$i\in[1,l]$的目标语言单词生成的源语言单词的扭曲度建模（{\color{yellow!70!black} 黄色}），即第$i$个目标语言单词生成的第$k$个源语言单词在源文中的位置$\pi_{ik}$ 的概率。其中$\pi_1^{i-1}$ 表示区间$[1,i-1]$的目标语言单词生成的源语言单词的扭曲度，$\pi_{i1}^{k-1}$表示第$i$目标语言单词生成的前$k-1$个源语言单词的扭曲度。
 \vspace{0.5em}
-\item 第五部分：$i=0$时的扭曲度建模（{\color{gray!70} 灰色}），即空标记$t_0$生成源语言位置的概率。
+\item 第五部分：对$i=0$时的扭曲度建模（{\color{gray!70} 灰色}），即空标记$t_0$生成源语言位置的概率。
 \end{itemize}

 %----------------------------------------------------------------------------------------
@@ -262,29 +262,29 @@

 \parinterval 对于$i=0$的情况需要单独进行考虑。实际上，$t_0$只是一个虚拟的单词。它要对应$\seq{s}$中原本为空对齐的单词。这里假设：要等其他非空对应单词都被生成（放置）后，才考虑这些空对齐单词的生成（放置）。即非空对单词都被生成后，在那些还有空的位置上放置这些空对的源语言单词。此外，在任何的空位置上放置空对的源语言单词都是等概率的，即放置空对齐源语言单词服从均匀分布。这样在已经放置了$k$个空对齐源语言单词的时候，应该还有$\varphi_0-k$个空位置。如果第$j$个源语言位置为空，那么

-\begin{equation}
-\funp{P}(\pi_{0k}=j|\pi_{01}^{k-1},\pi_1^l,\tau_0^l,\varphi_0^l,\seq{t})=\frac{1}{\varphi_0-k}
+\begin{eqnarray}
+\funp{P}(\pi_{0k}=j|\pi_{01}^{k-1},\pi_1^l,\tau_0^l,\varphi_0^l,\seq{t}) & = & \frac{1}{\varphi_0-k}
 \label{eq:6-13}
-\end{equation}
+\end{eqnarray}

 否则

-\begin{equation}
-\funp{P}(\pi_{0k}=j|\pi_{01}^{k-1},\pi_1^l,\tau_0^l,\varphi_0^l,\seq{t})=0
+\begin{eqnarray}
+\funp{P}(\pi_{0k}=j|\pi_{01}^{k-1},\pi_1^l,\tau_0^l,\varphi_0^l,\seq{t}) & = & 0
 \label{eq:6-14}
-\end{equation}
+\end{eqnarray}

 这样对于$t_0$所对应的$\tau_0$，就有
 {
 \begin{eqnarray}
-\prod_{k=1}^{\varphi_0}{\funp{P}(\pi_{0k}|\pi_{01}^{k-1},\pi_{1}^{l},\tau_{0}^{l},\varphi_{0}^{l},\seq{t})         }=\frac{1}{\varphi_{0}!}
+\prod_{k=1}^{\varphi_0}{\funp{P}(\pi_{0k}|\pi_{01}^{k-1},\pi_{1}^{l},\tau_{0}^{l},\varphi_{0}^{l},\seq{t})         } & = & \frac{1}{\varphi_{0}!}
 \label{eq:6-15}
 \end{eqnarray}
 }
 \parinterval 而上面提到的$t_0$所对应的这些空位置是如何生成的呢？即如何确定哪些位置是要放置空对齐的源语言单词。在IBM模型3中，假设在所有的非空对齐源语言单词都被生成出来后（共$\varphi_1+\varphi_2+\cdots {\varphi}_l$个非空对源语单词），这些单词后面都以$p_1$概率随机地产生一个“槽”用来放置空对齐单词。这样，${\varphi}_0$就服从了一个二项分布。于是得到
 {
 \begin{eqnarray}
-\funp{P}(\varphi_0|\seq{t})=\big(\begin{array}{c}
+\funp{P}(\varphi_0|\seq{t}) & = & \big(\begin{array}{c}
 \varphi_1+\varphi_2+\cdots \varphi_l\\
 \varphi_0\\
 \end{array}\big)p_0^{\varphi_1+\varphi_2+\cdots \varphi_l-\varphi_0}p_1^{\varphi_0}
@@ -318,7 +318,7 @@ p_0+p_1                            & = & 1 \label{eq:6-21}

 \subsection{IBM 模型4}

-\parinterval IBM模型3仍然存在问题，比如，它不能很好地处理一个目标语言单词生成多个源语言单词的情况。这个问题在模型1和模型2中也存在。如果一个目标语言单词对应多个源语言单词，往往这些源语言单词构成短语或搭配。但是模型1-3把这些源语言单词看成独立的单元，而实际上它们是一个整体。这就造成了在模型1-3中这些源语言单词可能会“分散”开。为了解决这个问题，模型4对模型3进行了进一步修正。
+\parinterval IBM模型3仍然存在问题，比如，它不能很好地处理一个目标语言单词生成多个源语言单词的情况。这个问题在模型1和模型2中也存在。如果一个目标语言单词对应多个源语言单词，则这些源语言单词往往会构成短语。但是模型1-3把这些源语言单词看成独立的单元，而实际上它们是一个整体。这就造成了在模型1-3中这些源语言单词可能会“分散”开。为了解决这个问题，模型4对模型3进行了进一步修正。

 \parinterval 为了更清楚地阐述，这里引入新的术语\ \dash \ {\small\bfnew{概念单元}}\index{概念单元}或{\small\bfnew{概念}}\index{概念}（Concept）\index{Concept}。词对齐可以被看作概念之间的对应。这里的概念是指具有独立语法或语义功能的一组单词。依照Brown等人的表示方法\upcite{DBLP:journals/coling/BrownPPM94}，可以把概念记为cept.。每个句子都可以被表示成一系列的cept.。这里要注意的是，源语言句子中的cept.数量不一定等于目标句子中的cept.数量。因为有些cept. 可以为空，因此可以把那些空对的单词看作空cept.。比如，在图\ref{fig:6-8}的实例中，“了”就对应一个空cept.。

@@ -336,23 +336,23 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \parinterval 另外，可以用$\odot_{i}$表示位置为$[i]$的目标语言单词对应的那些源语言单词位置的平均值，如果这个平均值不是整数则对它向上取整。比如在本例中，目标语句中第4个cept. （“.”）对应在源语言句子中的第5个单词。可表示为${\odot}_{4}=5$。

 \parinterval 利用这些新引进的概念，模型4对模型3的扭曲度进行了修改。主要是把扭曲度分解为两类参数。对于$[i]$对应的源语言单词列表($\tau_{[i]}$)中的第一个单词($\tau_{[i]1}$），它的扭曲度用如下公式计算：
-\begin{equation}
-\funp{P}(\pi_{[i]1}=j|{\pi}_1^{[i]-1},{\tau}_0^l,{\varphi}_0^l,\seq{t})=d_{1}(j-{\odot}_{i-1}|A(t_{[i-1]}),B(s_j))
+\begin{eqnarray}
+\funp{P}(\pi_{[i]1}=j|{\pi}_1^{[i]-1},{\tau}_0^l,{\varphi}_0^l,\seq{t}) & = & d_{1}(j-{\odot}_{i-1}|A(t_{[i-1]}),B(s_j))
 \label{eq:6-22}
-\end{equation}
+\end{eqnarray}

 \noindent 其中，第$i$个目标语言单词生成的第$k$个源语言单词的位置用变量$\pi_{ik}$表示。而对于列表($\tau_{[i]}$)中的其他的单词($\tau_{[i]k},1 < k \le \varphi_{[i]}$)的扭曲度，用如下公式计算：

-\begin{equation}
-\funp{P}(\pi_{[i]k}=j|{\pi}_{[i]1}^{k-1},\pi_1^{[i]-1},\tau_0^l,\varphi_0^l,\seq{t})=d_{>1}(j-\pi_{[i]k-1}|B(s_j))
+\begin{eqnarray}
+\funp{P}(\pi_{[i]k}=j|{\pi}_{[i]1}^{k-1},\pi_1^{[i]-1},\tau_0^l,\varphi_0^l,\seq{t}) & = & d_{>1}(j-\pi_{[i]k-1}|B(s_j))
 \label{eq:6-23}
-\end{equation}
+\end{eqnarray}

 \parinterval 这里的函数$A(\cdot)$和函数$B(\cdot)$分别把目标语言和源语言的单词映射到单词的词类。这么做的目的是要减小参数空间的大小。词类信息通常可以通过外部工具得到，比如Brown聚类等。另一种简单的方法是把单词直接映射为它的词性。这样可以直接用现在已经非常成熟的词性标注工具解决问题。

-\parinterval 从上面改进的扭曲度模型可以看出，对于$t_{[i]}$生成的第一个源语言单词，要考虑中心$\odot_{[i]}$和这个源语言单词之间的绝对距离。实际上也就要把$t_{[i]}$生成的所有源语言单词看成一个整体并把它放置在合适的位置。这个过程要依据第一个源语言单词的词类和对应源语中心位置，和前一个非空对目标语言单词$t_{[i-1]}$的词类。而对于$t_{[i]}$生成的其他源语言单词，只需要考虑它与前一个刚放置完的源语言单词的相对位置和这个源语言单词的词类。
+\parinterval 从上面改进的扭曲度模型可以看出，对于$t_{[i]}$生成的第一个源语言单词，要考虑中心$\odot_{[i]}$和这个源语言单词之间的绝对距离。实际上也就要把$t_{[i]}$生成的所有源语言单词看成一个整体并把它放置在合适的位置。这个过程要依据第一个源语言单词的词类和对应的源语中心位置，以及前一个非空的目标语言单词$t_{[i-1]}$的词类。而对于$t_{[i]}$生成的其他源语言单词，只需要考虑它与前一个刚放置完的源语言单词的相对位置和这个源语言单词的词类。

-\parinterval 实际上，上述过程就要先用$t_{[i]}$生成的第一个源语言单词代表整个$t_{[i]}$生成的单词列表，并把第一个源语言单词放置在合适的位置。然后，相对于前一个刚生成的源语言单词，把列表中的其他单词放置在合适的地方。这样就可以在一定程度上保证由同一个目标语言单词生成的源语言单词之间可以相互影响，达到了改进的目的。
+\parinterval 实际上，上述过程要先用$t_{[i]}$生成的第一个源语言单词代表整个$t_{[i]}$生成的单词列表，并把第一个源语言单词放置在合适的位置。然后，相对于前一个刚生成的源语言单词，把列表中的其他单词放置在合适的地方。这样就可以在一定程度上保证由同一个目标语言单词生成的源语言单词之间可以相互影响，达到了改进的目的。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -360,7 +360,7 @@ p_0+p_1                            & = & 1 \label{eq:6-21}

 \subsection{ IBM 模型5}

-\parinterval 模型3和模型4并不是“准确”的模型。这两个模型会把一部分概率分配给一些根本就不存在的句子。这个问题被称作IBM模型3和模型4的{\small\bfnew{缺陷}}\index{缺陷}（Deficiency）\index{Deficiency}。说得具体一些，模型3和模型4 中并没有这样的约束：如果已经放置了某个源语言单词的位置不能再放置其他单词，也就是说句子的任何位置只能放置一个词，不能多也不能少。由于缺乏这个约束，模型3和模型4中在所有合法的词对齐上概率和不等于1。 这部分缺失的概率被分配到其他不合法的词对齐上。举例来说，如图\ref{fig:6-9}所示，“吃/早饭”和“have breakfast”之间的合法词对齐用直线表示 。但是在模型3和模型4中， 它们的概率和为$0.9<1$。 损失掉的概率被分配到像5和6这样的对齐上了（红色）。虽然IBM模型并不支持一对多的对齐，但是模型3和模型4把概率分配给这些“ 不合法”的词对齐上，因此也就产生所谓的缺陷。
+\parinterval 模型3和模型4并不是“准确”的模型。这两个模型会把一部分概率分配给一些根本就不存在的句子。这个问题被称作IBM模型3和模型4的{\small\bfnew{缺陷}}\index{缺陷}（Deficiency）\index{Deficiency}。说得具体一些，模型3和模型4 中并没有这样的约束：如果已经放置了某个源语言单词的位置不能再放置其他单词，也就是说句子的任何位置只能放置一个词，不能多也不能少。由于缺乏这个约束，模型3和模型4中在所有合法的词对齐上概率和不等于1。 这部分缺失的概率被分配到其他不合法的词对齐上。举例来说，如图\ref{fig:6-9}所示，“吃/早饭”和“have breakfast”之间的合法词对齐用直线表示 。但是在模型3和模型4中， 它们的概率和为$0.9<1$。 损失掉的概率被分配到像a5和a6这样的对齐上了（红色）。虽然IBM模型并不支持一对多的对齐，但是模型3和模型4把概率分配给这些“ 不合法”的词对齐上，因此也就产生所谓的缺陷。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -385,7 +385,7 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \label{eq:6-25}
 \end{eqnarray}

-\noindent 这里，因子$1-\delta(v_j, v_{j-1})$是用来判断第$j$个位置是不是为空。如果第$j$个位置为空则$v_j = v_{j-1}$，这样$\funp{P}(\pi_{[i]1}=j|\pi_1^{[i]-1}, \tau_0^l, \varphi_0^l, \seq{t}) = 0$。这样就从模型上避免了模型3和模型4中生成不存在的字符串的问题。这里还要注意的是，对于放置第一个单词的情况，影响放置的因素有$v_j$，$B(s_i)$和$v_{j-1}$。此外还要考虑位置$j$放置了第一个源语言单词以后它的右边是不是还有足够的位置留给剩下的$k-1$个源语言单词。参数$v_m-(\varphi_{[i]}-1)$正是为了考虑这个因素，这里$v_m$表示整个源语言句子中还有多少空位置，$\varphi_{[i]}-1$ 表示源语言位置$j$右边至少还要留出的空格数。对于放置非第一个单词的情况，主要是要考虑它和前一个放置位置的相对位置。这主要体现在参数$v_j-v_{\varphi_{[i]}k-1}$上。式\eqref{eq:6-25} 的其他部分都可以用上面的理论解释，这里不再赘述。
+\noindent 这里，因子$1-\delta(v_j, v_{j-1})$是用来判断第$j$个位置是不是为空。如果第$j$个位置为空则$v_j = v_{j-1}$，这样$\funp{P}(\pi_{[i]1}=j|\pi_1^{[i]-1}, \tau_0^l, \varphi_0^l, \seq{t}) = 0$。这样就从模型上避免了模型3和模型4中生成不存在的字符串的问题。这里还要注意的是，对于放置第一个单词的情况，影响放置的因素有$v_j$，$B(s_i)$和$v_{j-1}$。此外还要考虑位置$j$放置了第一个源语言单词以后它的右边是不是还有足够的位置留给剩下的$k-1$个源语言单词。参数$v_m-(\varphi_{[i]}-1)$正是为了解决这个问题，这里$v_m$表示整个源语言句子中还有多少空位置，$\varphi_{[i]}-1$ 表示源语言位置$j$右边至少还要留出的空格数。对于放置非第一个单词的情况，主要是要考虑它和前一个放置位置的相对位置。这主要体现在参数$v_j-v_{\varphi_{[i]}k-1}$上。式\eqref{eq:6-25} 的其他部分都可以用上面的理论解释，这里不再赘述。

 \parinterval 实际上，模型5和模型4的思想基本一致，即，先确定$\tau_{[i]1}$的绝对位置，然后再确定$\tau_{[i]}$中剩余单词的相对位置。模型5消除了产生不存在的句子的可能性，不过模型5的复杂性也大大增加了。
 %----------------------------------------------------------------------------------------
@@ -395,9 +395,9 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \sectionnewpage
 \section{解码和训练}

-\parinterval 与IBM模型1一样，IBM模型2-5和隐马尔可夫模型的解码可以直接使用{\chapterfive}所描述的方法。基本思路与{\chaptertwo}所描述的自左向右搜索方法一致，即：对译文自左向右生成，每次扩展一个源语言单词的翻译，即把源语言单词的译文放到已经生成的译文的右侧。每次扩展可以选择不同的源语言单词或者同一个源语言单词的不同翻译候选，这样就可以得到多个不同的扩展译文。在这个过程中，同时计算翻译模型和语言模型的得分，对每个得到译文候选打分。最终，保留一个或者多个译文。这个过程重复执行直至所有源语言单词被翻译完。
+\parinterval 与IBM模型1一样，IBM模型2-5和隐马尔可夫模型的解码可以直接使用{\chapterfive}所描述的方法。基本思路与{\chaptertwo}所描述的自左向右搜索方法一致，即：对译文自左向右生成，每次扩展一个源语言单词的翻译，即把源语言单词的译文放到已经生成的译文的右侧。每次扩展可以选择不同的源语言单词或者同一个源语言单词的不同翻译候选，这样就可以得到多个不同的扩展译文。在这个过程中，同时计算翻译模型和语言模型的得分，对每个得到的译文候选打分。最终，保留一个或者多个译文。这个过程重复执行直至所有源语言单词被翻译完。

-\parinterval 类似的，IBM模型2-5和隐马尔可夫模型也都可以使用期望最大化（EM）方法进行模型训练。相关数学推导可参考附录\ref{appendix-B}的内容。通常，可以使用这些模型获得双语句子间的词对齐结果，比如使用GIZA++工具。这时，往往会使用多个模型，把简单的模型训练后的参数作为初始值送给后面更加复杂的模型。比如，先用IBM模型1训练，之后把参数送给IBM模型2，再训练，之后把参数送给隐马尔可夫模型等。值得注意的是，并不是所有的模型使用EM算法都能找到全局最优解。特别是IBM模型3-5的训练中使用一些剪枝和近似的方法，优化的真实目标函数会更加复杂。不过，IBM模型1是一个{\small\bfnew{凸函数}}\index{凸函数}（Convex Function）\index{Convex Function}，因此理论上使用EM方法能够找到全局最优解。更实际的好处是，IBM 模型1训练的最终结果与参数的初始化过程无关。这也是为什么在使用IBM 系列模型时，往往会使用IBM模型1作为起始模型的原因。
+\parinterval 类似的，IBM模型2-5和隐马尔可夫模型也都可以使用期望最大化（EM）方法进行模型训练。相关数学推导可参考附录\ref{appendix-B}的内容。通常，可以使用这些模型获得双语句子间的词对齐结果，比如使用GIZA++工具。这时，往往会使用多个模型，把简单的模型训练后的参数作为初始值传给后面更加复杂的模型。比如，先用IBM模型1训练，之后把参数送给IBM模型2，再训练，之后把参数送给隐马尔可夫模型等。值得注意的是，并不是所有的模型使用EM算法都能找到全局最优解。特别是IBM模型3-5的训练中使用一些剪枝和近似的方法，优化的真实目标函数会更加复杂。不过，IBM模型1是一个{\small\bfnew{凸函数}}\index{凸函数}（Convex Function）\index{Convex Function}，因此理论上使用EM方法能够找到全局最优解。更实际的好处是，IBM 模型1训练的最终结果与参数的初始化过程无关。这也是为什么在使用IBM 系列模型时，往往会使用IBM模型1作为起始模型的原因。

 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -428,13 +428,13 @@ p_0+p_1                            & = & 1 \label{eq:6-21}

 \parinterval IBM模型的缺陷是指翻译模型会把一部分概率分配给一些根本不存在的源语言字符串。如果用$\funp{P}(\textrm{well}|\seq{t})$表示$\funp{P}(\seq{s}| \seq{t})$在所有的正确的（可以理解为语法上正确的）$\seq{s}$上的和，即
 \begin{eqnarray}
-\funp{P}(\textrm{well}|\seq{t})=\sum_{\seq{s}\textrm{\;is\;well\;formed}}{\funp{P}(\seq{s}| \seq{t})}
+\funp{P}(\textrm{well}|\seq{t}) & = & \sum_{\seq{s}\textrm{\;is\;well\;formed}}{\funp{P}(\seq{s}| \seq{t})}
 \label{eq:6-26}
 \end{eqnarray}

 \parinterval 类似地，用$\funp{P}(\textrm{ill}|\seq{t})$表示$\funp{P}(\seq{s}| \seq{t})$在所有的错误的（可以理解为语法上错误的）$\seq{s}$上的和。如果$\funp{P}(\textrm{well}|\seq{t})+ \funp{P}(\textrm{ill}|\seq{t})<1$，就把剩余的部分定义为$\funp{P}(\textrm{failure}|\seq{t})$。它的形式化定义为，
 \begin{eqnarray}
-\funp{P}({\textrm{failure}|\seq{t}})  = 1 - \funp{P}({\textrm{well}|\seq{t}}) - \funp{P}({\textrm{ill}|\seq{t}})
+\funp{P}({\textrm{failure}|\seq{t}}) & = & 1 - \funp{P}({\textrm{well}|\seq{t}}) - \funp{P}({\textrm{ill}|\seq{t}})
 \label{eq:6-27}
 \end{eqnarray}

@@ -452,7 +452,7 @@ p_0+p_1                            & = & 1 \label{eq:6-21}

 \parinterval 在IBM模型中，$\funp{P}(\seq{t})\funp{P}(\seq{s}| \seq{t})$会随着目标语言句子长度的增加而减少，因为这种模型有多个概率化的因素组成，乘积项越多结果的值越小。这也就是说，IBM模型会更倾向选择长度短一些的目标语言句子。显然这种对短句子的偏向性并不是机器翻译所期望的。

-\parinterval 这个问题在很多机器翻译系统中都存在。它实际上也是了一种{\small\bfnew{系统偏置}}\index{系统偏置}（System Bias）\index{System Bias}的体现。为了消除这种偏置，可以通过在模型中增加一个短句子惩罚引子来抵消掉模型对短句子的倾向性。比如，可以定义一个惩罚引子，它的值随着长度的减少而增加。不过，简单引入这样的惩罚因子会导致模型并不符合一个严格的噪声信道模型。它对应一个基于判别式框架的翻译模型，这部分内容会在{\chapterseven}进行介绍。
+\parinterval 这个问题在很多机器翻译系统中都存在。它实际上也是了一种{\small\bfnew{系统偏置}}\index{系统偏置}（System Bias）\index{System Bias}的体现。为了消除这种偏置，可以通过在模型中增加一个短句子惩罚因子来抵消掉模型对短句子的倾向性。比如，可以定义一个惩罚因子，它的值随着长度的减少而增加。不过，简单引入这样的惩罚因子会导致模型并不符合一个严格的噪声信道模型。它对应一个基于判别式框架的翻译模型，这部分内容会在{\chapterseven}进行介绍。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -460,7 +460,7 @@ p_0+p_1                            & = & 1 \label{eq:6-21}

 \subsection{其他问题}

-\parinterval 模型5的意义是什么？模型5的提出是为了消除模型3和模型4的缺陷。缺陷的本质是，$\funp{P}(\seq{s},\seq{a}| \seq{t})$在所有合理的对齐上概率和不为1。 但是，在这里更关心是哪个对齐$\seq{a}$使$\funp{P}(\seq{s},\seq{a}| \seq{t})$达到最大，即使$\funp{P}(\seq{s},\seq{a}|\seq{t})$不符合概率分布的定义，也并不影响我们寻找理想的对齐$\seq{a}$。从工程的角度说，$\funp{P}(\seq{s},\seq{a}| \seq{t})$不归一并不是一个十分严重的问题。遗憾的是，实际上到现在为止有太多对IBM模型3和模型4中的缺陷进行过系统的实验和分析，但对于这个问题到底有多严重并没有定论。当然用模型5是可以解决这个问题。但是如果用一个非常复杂的模型去解决了一个并不产生严重后果的问题，那这个模型也就没有太大意义了（从实践的角度）。
+\parinterval 模型5的意义是什么？模型5的提出是为了消除模型3和模型4的缺陷。缺陷的本质是，$\funp{P}(\seq{s},\seq{a}| \seq{t})$在所有合理的对齐上概率和不为1。 但是，在这里更关心是哪个对齐$\seq{a}$使$\funp{P}(\seq{s},\seq{a}| \seq{t})$达到最大，即使$\funp{P}(\seq{s},\seq{a}|\seq{t})$不符合概率分布的定义，也并不影响我们寻找理想的对齐$\seq{a}$。从工程的角度说，$\funp{P}(\seq{s},\seq{a}| \seq{t})$不归一并不是一个十分严重的问题。遗憾的是，实际上到现在为止有太多对IBM模型3和模型4中的缺陷进行系统性的实验和分析，但对于这个问题到底有多严重并没有定论。当然用模型5是可以解决这个问题。但是如果用一个非常复杂的模型去解决了一个并不产生严重后果的问题，那这个模型也就没有太大意义了（从实践的角度）。

 \parinterval 概念（cept.）的意义是什么？经过前面的分析可知，IBM模型的词对齐模型使用了cept.这个概念。但是，在IBM模型中使用的cept.最多只能对应一个目标语言单词（模型并没有用到源语言cept. 的概念）。因此可以直接用单词代替cept.。这样，即使不引入cept.的概念，也并不影响IBM模型的建模。实际上，cept.的引入确实可以帮助我们从语法和语义的角度解释词对齐过程。不过，这个方法在IBM 模型中的效果究竟如何还没有定论。


--- a/Chapter7/Figures/figure-basic-process-of-translation.tex
+++ b/Chapter7/Figures/figure-basic-process-of-translation.tex
@@ -11,7 +11,7 @@

 \node[anchor=east] (t0) at (-0.5em, -1.5) {$\seq{t}$：};

-\node[anchor=north] (l) at ([xshift=7em,yshift=-0.5em]t0.south) {\footnotesize{(a)\ }};
+\node[anchor=north] (l) at ([xshift=7em,yshift=-0.5em]t0.south) {\small{(a)\ }};
 \end{scope}


@@ -29,7 +29,7 @@
 \path[<->, thick] (s2.south) edge (t1.north);
 }

-\node[anchor=north] (l) at ([xshift=7em,yshift=-0.5em]t0.south) {\footnotesize{(b)\ }};
+\node[anchor=north] (l) at ([xshift=7em,yshift=-0.5em]t0.south) {\small{(b)\ }};
 \end{scope}


@@ -50,7 +50,7 @@
 \node[anchor=west,fill=red!20] (t2) at ([xshift=1em]t1.east) {\footnotesize{an apple}};
 \path[<->, thick] (s3.south) edge (t2.north);
 }
-\node[anchor=north] (l) at ([xshift=7em,yshift=-0.5em]t0.south) {\footnotesize{(c)\ }};
+\node[anchor=north] (l) at ([xshift=7em,yshift=-0.5em]t0.south) {\small{(c)\ }};
 \end{scope}


@@ -76,6 +76,6 @@
 \node[anchor=west,fill=red!20] (t3) at ([xshift=1em]t2.east) {\footnotesize{on the table}};
 \path[<->, thick] (s1.south) edge (t3.north);
 }
-\node[anchor=north] (l) at ([xshift=7em,yshift=-0.5em]t0.south) {\footnotesize{(d)\ }};
+\node[anchor=north] (l) at ([xshift=7em,yshift=-0.5em]t0.south) {\small{(d)\ }};
 \end{scope}
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter7/Figures/figure-example-of-hypothesis-recombination.tex
+++ b/Chapter7/Figures/figure-example-of-hypothesis-recombination.tex
@@ -5,7 +5,7 @@
 {
 \node [anchor=north,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h0) at (0,0) {\small{null}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl0) at (h0.north west) {\scriptsize{{\color{white} \textbf{0}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\tiny{{\color{white} \textbf{$\funp{P}$=1}}}};

 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h2) at ([xshift=2.2em,yshift=3.5em]h0.east) {\small{an}};
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h3) at ([xshift=2.2em]h2.east) {\small{apple}};
@@ -13,8 +13,8 @@
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl2) at (h2.north west) {\scriptsize{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl3) at (h3.north west) {\scriptsize{{\color{white} \textbf{2}}}};

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt2) at (h2.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.3}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h3.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt2) at (h2.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.3}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h3.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};

 \draw [->,very thick,ublue] ([xshift=0.1em]pt0.south) -- ([xshift=-0.1em]h2.west);
 \draw [->,very thick,ublue] ([xshift=0.1em]pt2.south) -- ([xshift=-0.1em]h3.west);
@@ -22,7 +22,7 @@
 {
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h1) at ([xshift=7em]h0.east) {\small{an apple}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl1) at (h1.north west) {\scriptsize{{\color{white} \textbf{1-2}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt1) at (h1.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt1) at (h1.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};
 \draw [->,very thick,ublue] ([xshift=0.1em]pt0.south) -- ([xshift=-0.1em]h1.west);
 }
 }
@@ -37,10 +37,10 @@
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl5) at (h6.north west) {\scriptsize{{\color{white} \textbf{1}}}};

 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl5) at (h8.north west) {\scriptsize{{\color{white} \textbf{2}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt4) at (h4.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt5) at (h5.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.3}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt6) at (h6.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.4}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt8) at (h8.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt4) at (h4.east) {\tiny{{\color{white} \textbf{$\funp{P}$=1}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt5) at (h5.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.3}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt6) at (h6.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.4}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt8) at (h8.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.2}}}};

 \draw [->,very thick,ublue] ([xshift=0.1em]pt4.south) -- ([xshift=-0.1em]h5.west);
 \draw [->,very thick,ublue] ([xshift=0.1em]pt4.south) -- ([xshift=-0.1em]h6.west);
@@ -48,7 +48,7 @@

 {
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h7) at ([xshift=2.2em]h5.east) {\small{is not}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt7) at (h7.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt7) at (h7.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.2}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl5) at (h7.north west) {\scriptsize{{\color{white} \textbf{2}}}};
 \draw [->,very thick,ublue] ([xshift=0.1em]pt5.south) -- ([xshift=-0.1em]h7.west);
 }
@@ -56,7 +56,7 @@

 \node[anchor=north] (l1) at ([xshift=5.5em,yshift=-1em]h0.south) {\scriptsize{原假设}};
 \node[anchor=north] (l2) at ([xshift=5.5em,yshift=-1em]h4.south) {\scriptsize{原假设}};
-\node[anchor=north] (part1) at ([xshift=16em,yshift=-2em]h0.south){\scriptsize{（a）译文相同时的假设重组}};
+\node[anchor=north] (part1) at ([xshift=16em,yshift=-3em]h0.south){\small{（a）译文相同时的假设重组}};

 \end{scope}

@@ -68,7 +68,7 @@
 {
 \node [anchor=north,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h0) at (0,0) {\small{null}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl0) at (h0.north west) {\scriptsize{{\color{white} \textbf{0}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\tiny{{\color{white} \textbf{$\funp{P}$=1}}}};

 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h2) at ([xshift=2.2em,yshift=3.5em]h0.east) {\small{an}};
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h3) at ([xshift=2.2em]h2.east) {\small{apple}};
@@ -76,8 +76,8 @@
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl2) at (h2.north west) {\scriptsize{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl3) at (h3.north west) {\scriptsize{{\color{white} \textbf{2}}}};

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt2) at (h2.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.3}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h3.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt2) at (h2.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.3}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h3.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};

 \draw [->,very thick,ublue] ([xshift=0.1em]pt0.south) -- ([xshift=-0.1em]h2.west);
 \draw [->,very thick,ublue] ([xshift=0.1em]pt2.south) -- ([xshift=-0.1em]h3.west);
@@ -97,10 +97,10 @@
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl5) at (h6.north west) {\scriptsize{{\color{white} \textbf{1}}}};

 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl5) at (h8.north west) {\scriptsize{{\color{white} \textbf{2}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt4) at (h4.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt5) at (h5.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.3}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt6) at (h6.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.4}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt8) at (h8.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt4) at (h4.east) {\tiny{{\color{white} \textbf{$\funp{P}$=1}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt5) at (h5.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.3}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt6) at (h6.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.4}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt8) at (h8.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.2}}}};

 \draw [->,very thick,ublue] ([xshift=0.1em]pt4.south) -- ([xshift=-0.1em]h5.west);
 \draw [->,very thick,ublue] ([xshift=0.1em]pt4.south) -- ([xshift=-0.1em]h6.west);
@@ -115,11 +115,11 @@
 {
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em,opacity=0.6] (h1) at ([xshift=7em]h0.east) {\small{an apple}};
 \node [anchor=north west,inner sep=1.0pt,fill=black,opacity=0.6] (hl1) at (h1.north west) {\scriptsize{{\color{white} \textbf{1-2}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black,opacity=0.6] (pt1) at (h1.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black,opacity=0.6] (pt1) at (h1.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};
 }
 {
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em,opacity=0.6] (h7) at ([xshift=2.2em]h5.east) {\small{is not}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black,opacity=0.6] (pt7) at (h7.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black,opacity=0.6] (pt7) at (h7.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.2}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black,opacity=0.6] (hl5) at (h7.north west) {\scriptsize{{\color{white} \textbf{2}}}};
 }
 }
@@ -131,7 +131,7 @@

 \node[anchor=north] (l1) at ([xshift=7.5em,yshift=-1em]h0.south) {\scriptsize{重组假设}};
 \node[anchor=north] (l2) at ([xshift=7.5em,yshift=-1em]h4.south) {\scriptsize{重组假设}};
-\node[anchor=north] (part2) at ([xshift=0em,yshift=-14em]h0.south){\scriptsize{（b）译文不同时的假设重组}};
+\node[anchor=north] (part2) at ([xshift=0em,yshift=-14em]h0.south){\small{（b）译文不同时的假设重组}};
 \end{scope}



--- a/Chapter7/Figures/figure-example-of-stack-decode.tex
+++ b/Chapter7/Figures/figure-example-of-stack-decode.tex
@@ -6,7 +6,7 @@
 {
 \node [anchor=north,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h0) at (0,0) {\scriptsize{null}};
 \node [anchor=north west,inner sep=1.5pt,fill=black] (hl0) at (h0.north west) {\scriptsize{{\color{white} \textbf{0}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=1}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\tiny{{\color{white} \textbf{$\funp{P}$=1}}}};
 }
 {
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h13) at ([xshift=2.1em,yshift=6em]h0.east) {\scriptsize{there is}};
@@ -17,8 +17,8 @@
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl3) at (h13.north west) {\scriptsize{{\color{white} \textbf{3}}}};


-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt1) at (h1.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=.2}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h13.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt1) at (h1.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.2}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h13.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};

 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h2) at ([xshift=2.1em]h1.east) {\scriptsize{have}};
 \node [anchor=west,inner sep=2pt,minimum height=2em,minimum width=3em] (h22) at ([xshift=2.1em]h12.east) {\small{\textbf{...}}};
@@ -32,10 +32,10 @@
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl3) at (h3.north west) {\scriptsize{{\color{white} \textbf{2}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl33) at (h33.north west) {\scriptsize{{\color{white} \textbf{4-5}}}};

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt2) at (h2.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=.5}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt23) at (h23.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=.5}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h3.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=.5}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt33) at (h33.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt2) at (h2.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt23) at (h23.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h3.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt33) at (h33.east) {\tiny{{\color{white} \textbf{$\funp{P}$=0.5}}}};
 }
 \node [anchor=north] (l0) at ([xshift=0.2em,yshift=-0.7em]h0.south) {\small{\textbf{未译词}}};
 \node [anchor=north] (l1) at ([xshift=0.3em,yshift=-0.7em]h1.south) {\small{\textbf{已译}1\textbf{词}}};

--- a/Chapter7/Figures/figure-judge-type-of-reorder-method.tex
+++ b/Chapter7/Figures/figure-judge-type-of-reorder-method.tex
@@ -105,8 +105,8 @@
 \node[anchor=north] (m1) at ([xshift=0.6em,yshift=0.1em]b05.east) {M};
 }

-\node[anchor=north] (l1) at ([xshift=1.8em,yshift=-0.5em]a10.south) {\scriptsize{基于词}};
-\node[anchor=north] (l2) at ([xshift=2.2em,yshift=-0.5em]b10.south) {\scriptsize{基于短语}};
+\node[anchor=north] (l1) at ([xshift=1.8em,yshift=-0.5em]a10.south) {\small{基于词}};
+\node[anchor=north] (l2) at ([xshift=2.2em,yshift=-0.5em]b10.south) {\small{基于短语}};

 \end{scope}


--- a/Chapter7/Figures/figure-search-space-representation-of-feature-weight.tex
+++ b/Chapter7/Figures/figure-search-space-representation-of-feature-weight.tex
@@ -27,7 +27,7 @@
 \node[anchor=north] (label3) at ([xshift=0em,yshift=-2.5em]label2.north) {取值};	
 }

-\node[anchor=north] (l1) at ([xshift=0em,yshift=-2.5em]x3.south) {\footnotesize{(a)搜索空间}};
+\node[anchor=north] (l1) at ([xshift=0em,yshift=-2.5em]x3.south) {\small{(a)搜索空间}};
 \end{scope}

 \begin{scope}[scale=0.55,xshift=3.2in] 
@@ -68,7 +68,7 @@
 \node[anchor=north] (e4) at ([xshift=0,yshift=-0.2em]e3.south) {$w_M = 1.00$};
 }

-\node[anchor=north] (l1) at ([xshift=0em,yshift=-2.5em]x3.south) {\footnotesize{(b)一条搜索路径}};
+\node[anchor=north] (l1) at ([xshift=0em,yshift=-2.5em]x3.south) {\small{(b)一条搜索路径}};
 \end{scope}

 \begin{scope}[scale=0.55,xshift=6.8in] 
@@ -119,6 +119,6 @@
 \node[anchor=north] (label2) at ([xshift=0em,yshift=-2.5em]label1.north) {种组合};
 }

-\node[anchor=north] (l1) at ([xshift=0em,yshift=-2.5em]x3.south) {\footnotesize{(c)多条搜索路径}};
+\node[anchor=north] (l1) at ([xshift=0em,yshift=-2.5em]x3.south) {\small{(c)多条搜索路径}};
 \end{scope}
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter7/Figures/figure-translation-hypothesis-extension.tex
+++ b/Chapter7/Figures/figure-translation-hypothesis-extension.tex
@@ -5,24 +5,24 @@
 \begin{scope}
 {
 \node [anchor=north,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3.5em] (h0) at (0,0) {\small{null}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl0) at (h0.north west) {\scriptsize{{\color{white} \textbf{0}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl0) at (h0.north west) {\tiny{{\color{white} \textbf{0}}}};
+\node [anchor=north,inner sep=1pt,minimum width=3.5em,fill=black] (pt0) at (h0.south) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};
 }

 {
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3.5em] (h1) at ([xshift=3em]h0.east) {\small{on}};
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3.5em] (h2) at ([xshift=3em,yshift=3em]h0.east) {\small{table}};
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3.5em] (h3) at ([xshift=3em,yshift=-3em]h0.east) {\small{there is}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl1) at (h1.north west) {\scriptsize{{\color{white} \textbf{2}}}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl2) at (h2.north west) {\scriptsize{{\color{white} \textbf{1}}}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl3) at (h3.north west) {\scriptsize{{\color{white} \textbf{3}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt1) at (h1.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt2) at (h2.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.3}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h3.east) {\footnotesize{{\color{white} \textbf{P=.5}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl1) at (h1.north west) {\tiny{{\color{white} \textbf{2}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl2) at (h2.north west) {\tiny{{\color{white} \textbf{1}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl3) at (h3.north west) {\tiny{{\color{white} \textbf{3}}}};
+\node [anchor=north,inner sep=1pt,minimum width=3.5em,fill=black] (pt1) at (h1.south) {\footnotesize{{\color{white} \textbf{$\funp{P}$=0.2}}}};
+\node [anchor=north,inner sep=1pt,minimum width=3.5em,fill=black] (pt2) at (h2.south) {\footnotesize{{\color{white} \textbf{$\funp{P}$=0.3}}}};
+\node [anchor=north,inner sep=1pt,minimum width=3.5em,fill=black] (pt3) at (h3.south) {\footnotesize{{\color{white} \textbf{P=0.5}}}};

-\draw [->,very thick,ublue] ([xshift=0.1em]pt0.south) -- ([xshift=-0.1em]h1.west);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt0.south) -- ([xshift=-0.1em]h2.west);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt0.south) -- ([xshift=-0.1em]h3.west);
+\draw [->,very thick,ublue] ([xshift=0.1em]h0.east) -- ([xshift=-0.1em]h1.west);
+\draw [->,very thick,ublue] ([xshift=0.1em]h0.east) -- ([xshift=-0.1em]h2.west);
+\draw [->,very thick,ublue] ([xshift=0.1em]h0.east) -- ([xshift=-0.1em]h3.west);
 }

 {
@@ -32,40 +32,40 @@
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=4em] (h7) at ([xshift=3em,yshift=1.2em]h5.east) {\small{on the table}};
 \node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=4.6em] (h8) at ([xshift=3em,yshift=-2em]h5.east) {\small{\ \;apple}};

-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl4) at (h4.north west) {\scriptsize{{\color{white} \textbf{4}}}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl5) at (h5.north west) {\scriptsize{{\color{white} \textbf{4-5}}}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl6) at (h6.north west) {\scriptsize{{\color{white} \textbf{1}}}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl7) at (h7.north west) {\scriptsize{{\color{white} \textbf{1-2}}}};
-\node [anchor=north west,inner sep=1.5pt,fill=black] (hl8) at (h8.north west) {\scriptsize{{\color{white} \textbf{5}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl4) at (h4.north west) {\tiny{{\color{white} \textbf{4}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl5) at (h5.north west) {\tiny{{\color{white} \textbf{4-5}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl6) at (h6.north west) {\tiny{{\color{white} \textbf{1}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl7) at (h7.north west) {\tiny{{\color{white} \textbf{1-2}}}};
+\node [anchor=north west,inner sep=1.5pt,fill=black] (hl8) at (h8.north west) {\tiny{{\color{white} \textbf{5}}}};

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt4) at (h4.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.1}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt5) at (h5.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.4}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt6) at (h6.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.3}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt7) at (h7.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.4}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt8) at (h8.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
+\node [anchor=north,inner sep=1pt,minimum width=3.5em,fill=black] (pt4) at (h4.south) {\footnotesize{{\color{white} \textbf{$\funp{P}$=0.1}}}};
+\node [anchor=north,inner sep=1pt,minimum width=3.5em,fill=black] (pt5) at (h5.south) {\footnotesize{{\color{white} \textbf{$\funp{P}$=0.4}}}};
+\node [anchor=north,inner sep=1pt,minimum width=3.5em,fill=black] (pt6) at (h6.south) {\footnotesize{{\color{white} \textbf{$\funp{P}$=0.3}}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.6em,fill=black] (pt7) at (h7.south) {\footnotesize{{\color{white} \textbf{$\funp{P}$=0.4}}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.6em,fill=black] (pt8) at (h8.south) {\footnotesize{{\color{white} \textbf{$\funp{P}$=0.2}}}};

-\draw [->,very thick,ublue] ([xshift=0.1em]pt1.south) -- ([xshift=1em,yshift=0.7em]pt1.south);
+\draw [->,very thick,ublue] ([xshift=0.1em]h6.east) -- ([xshift=1em,yshift=0.7em]h6.east);
+\draw [->,very thick,ublue] ([xshift=0.1em]h6.east) -- ([xshift=1em,yshift=-0.7em]h6.east);

-\draw [->,very thick,ublue] ([xshift=0.1em]pt2.south) -- ([xshift=1em,yshift=-0.7em]pt2.south);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt2.south) -- ([xshift=1em,yshift=0.7em]pt2.south);
+\draw [->,very thick,ublue] ([xshift=0.1em]h2.east) -- ([xshift=1em,yshift=0.7em]h2.east);
+\draw [->,very thick,ublue] ([xshift=0.1em]h2.east) -- ([xshift=1em,yshift=-0.7em]h2.east);

-\draw [->,very thick,ublue] ([xshift=0.1em]pt6.south) -- ([xshift=1em,yshift=-0.7em]pt6.south);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt6.south) -- ([xshift=1em,yshift=0.7em]pt6.south);
+\draw [->,very thick,ublue] ([xshift=0.1em]h1.east) -- ([xshift=1em,yshift=0.7em]h1.east);

-\draw [->,very thick,ublue] ([xshift=0.1em]pt3.south) -- ([xshift=-0.1em]h4.west);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt3.south) -- ([xshift=-0.1em]h5.west);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt3.south) -- ([xshift=-0.1em]h6.west);
+\draw [->,very thick,ublue] ([xshift=0.1em]h3.east) -- ([xshift=-0.1em]h4.west);
+\draw [->,very thick,ublue] ([xshift=0.1em]h3.east) -- ([xshift=-0.1em]h5.west);
+\draw [->,very thick,ublue] ([xshift=0.1em]h3.east) -- ([xshift=-0.1em]h6.west);

-\draw [->,very thick,ublue] ([xshift=0.1em]pt5.south) -- ([xshift=-0.1em]h7.west);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt5.south) -- ([xshift=1em,yshift=-0.7em]pt5.south);
+\draw [->,very thick,ublue] ([xshift=0.1em]h5.east) -- ([xshift=-0.1em]h7.west);
+\draw [->,very thick,ublue] ([xshift=0.1em]h5.east) -- ([xshift=1em,yshift=-0.7em]h5.east);

-\draw [->,very thick,ublue] ([xshift=0.1em]pt4.south) -- ([xshift=-0.1em]h8.west);
-\draw [->,very thick,ublue] ([xshift=0.1em]pt4.south) -- ([xshift=1em,yshift=-0.7em]pt4.south);
+\draw [->,very thick,ublue] ([xshift=0.1em]h4.east) -- ([xshift=-0.1em]h8.west);
+\draw [->,very thick,ublue] ([xshift=0.1em]h4.east) -- ([xshift=1em,yshift=-0.7em]h4.east);
 }

 {
-\draw [->,ultra thick,red,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.5em]h0.west) -- ([xshift=0.7em,yshift=-0.5em]h0.east) -- ([xshift=-0.2em,yshift=-0.5em]h3.west) -- ([xshift=0.8em,yshift=-0.5em]h3.east) -- ([xshift=-0.2em,yshift=-0.5em]h5.west) -- ([xshift=0.8em,yshift=-0.5em]h5.east) -- ([xshift=-0.2em,yshift=-0.5em]h7.west) -- ([xshift=1.5em,yshift=-0.5em]h7.east);
-\node [anchor=north west] (wtranslabel) at ([yshift=-3em]h0.south west) {\small{翻译路径：}};
+\draw [->,ultra thick,red,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.6em]h0.west) -- ([xshift=0.0em,yshift=-0.6em]h0.east) -- ([xshift=-0.2em,yshift=-0.6em]h3.west) -- ([xshift=0.0em,yshift=-0.6em]h3.east) -- ([xshift=-0.2em,yshift=-0.6em]h5.west) -- ([xshift=0.0em,yshift=-0.6em]h5.east) -- ([xshift=-0.2em,yshift=-0.6em]h7.west) -- ([xshift=1.5em,yshift=-0.6em]h7.east);
+\node [anchor=north west] (wtranslabel) at ([yshift=-4.3em]h0.south west) {\small{翻译路径：}};
 \draw [->,ultra thick,red,line width=1.5pt,opacity=0.7] (wtranslabel.east) -- ([xshift=1.5em]wtranslabel.east);
 }
 \end{scope}

--- a/Chapter7/Figures/figure-word-and-phrase-translation-regard-as-path.tex
+++ b/Chapter7/Figures/figure-word-and-phrase-translation-regard-as-path.tex
@@ -19,59 +19,59 @@
 \draw [->,very thick,ublue] (s5.south) -- ([yshift=-0.7em]s5.south);

 {\small
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t11) at ([yshift=-1em]s1.south) {I};
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t12) at ([yshift=-0.2em]t11.south) {me};
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t13) at ([yshift=-0.2em]t12.south) {I'm};
+\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=2.5em] (t11) at ([yshift=-1em]s1.south) {I};
+\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=2.5em] (t12) at ([yshift=-0.8em]t11.south) {me};
+\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=2.5em] (t13) at ([yshift=-0.8em]t12.south) {I'm};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl11) at (t11.north west) {\tiny{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl12) at (t12.north west) {\tiny{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl13) at (t13.north west) {\tiny{{\color{white} \textbf{1}}}};

 {
-\node [anchor=north west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=6.55em] (t14) at ([yshift=-0.2em]t13.south west) {I'm};
-\node [anchor=north west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=6.55em] (t15) at ([yshift=-0.2em]t14.south west) {I};
+\node [anchor=north west,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=6.55em] (t14) at ([yshift=-0.8em]t13.south west) {I'm \quad \quad};
+\node [anchor=north west,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=6.55em] (t15) at ([yshift=-0.8em]t14.south west) {I \quad \quad};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl14) at (t14.north west) {\tiny{{\color{white} \textbf{1-2}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl15) at (t15.north west) {\tiny{{\color{white} \textbf{1-2}}}};
 }

-\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t21) at ([yshift=-1em]s2.south) {to};
-\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t22) at ([yshift=-0.2em]t21.south) {with};
-\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t23) at ([yshift=-0.2em]t22.south) {for};
+\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=2.5em] (t21) at ([yshift=-1em]s2.south) {to};
+\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=2.5em] (t22) at ([yshift=-0.8em]t21.south) {with};
+\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=2.5em] (t23) at ([yshift=-0.8em]t22.south) {for};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl21) at (t21.north west) {\tiny{{\color{white} \textbf{2}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl22) at (t22.north west) {\tiny{{\color{white} \textbf{2}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl23) at (t23.north west) {\tiny{{\color{white} \textbf{2}}}};

 {
-\node [anchor=north west,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=6.55em] (t24) at ([yshift=-0.2em,xshift=-2.6em]t15.south east) {for you};
-\node [anchor=north west,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=6.55em] (t25) at ([yshift=-0.2em]t24.south west) {with you};
+\node [anchor=north west,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=6.55em] (t24) at ([yshift=-0.8em,xshift=-2.6em]t15.south east) {for you};
+\node [anchor=north west,inner sep=2pt,fill=green!20,minimum height=1.6em,minimum width=6.55em] (t25) at ([yshift=-0.8em]t24.south west) {with you};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl24) at (t24.north west) {\tiny{{\color{white} \textbf{2-3}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl25) at (t25.north west) {\tiny{{\color{white} \textbf{2-3}}}};
 }

-\node [anchor=north,inner sep=2pt,fill=blue!20,minimum height=1.5em,minimum width=2.5em] (t31) at ([yshift=-1em]s3.south) {you};
+\node [anchor=north,inner sep=2pt,fill=blue!20,minimum height=1.6em,minimum width=2.5em] (t31) at ([yshift=-1em]s3.south) {you};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl31) at (t31.north west) {\tiny{{\color{white} \textbf{3}}}};

 {
-\node [anchor=west,inner sep=2pt,fill=blue!20,minimum height=1.5em,minimum width=13.35em] (t32) at ([xshift=1.4em]t14.east) {you are satisfied};
-\node [anchor=north west,inner sep=2pt,fill=blue!20,minimum height=1.5em,minimum width=7.45em] (t33) at ([yshift=-0.2em]t32.south west) {$\phi$};
+\node [anchor=west,inner sep=2pt,fill=blue!20,minimum height=1.6em,minimum width=13.35em] (t32) at ([xshift=1.4em]t14.east) {\quad \; you are satisfied};
+\node [anchor=north west,inner sep=2pt,fill=blue!20,minimum height=1.6em,minimum width=7.45em] (t33) at ([yshift=-0.8em]t32.south west) {$\phi$ \quad \; \ };
 \node [anchor=north west,inner sep=1pt,fill=black] (tl32) at (t32.north west) {\tiny{{\color{white} \textbf{3-5}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl33) at (t33.north west) {\tiny{{\color{white} \textbf{3-4}}}};
 }

-\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.5em,minimum width=3em] (t41) at ([yshift=-1em]s4.south) {$\phi$};
-\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.5em,minimum width=3em] (t42) at ([yshift=-0.2em]t41.south) {show};
+\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.6em,minimum width=3em] (t41) at ([yshift=-1em]s4.south) {$\phi$};
+\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.6em,minimum width=3em] (t42) at ([yshift=-0.8em]t41.south) {show};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl41) at (t41.north west) {\tiny{{\color{white} \textbf{4}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl42) at (t42.north west) {\tiny{{\color{white} \textbf{4}}}};

 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=9.00em] (t43) at ([xshift=1.75em]t24.east) {satisfied};
-\node [anchor=north west,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=9.00em] (t44) at ([yshift=-0.2em]t43.south west) {satisfactory};
+\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=9.00em] (t43) at ([xshift=1.75em]t24.east) {satisfied};
+\node [anchor=north west,inner sep=2pt,fill=red!20,minimum height=1.6em,minimum width=9.00em] (t44) at ([yshift=-0.8em]t43.south west) {satisfactory};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl43) at (t43.north west) {\tiny{{\color{white} \textbf{4-5}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl44) at (t44.north west) {\tiny{{\color{white} \textbf{4-5}}}};
 }

-\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t51) at ([yshift=-1em]s5.south) {satisfy};
-\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t52) at ([yshift=-0.2em]t51.south) {satisfied};
-\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t53) at ([yshift=-0.2em]t52.south) {satisfies};
+\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.6em,minimum width=4.5em] (t51) at ([yshift=-1em]s5.south) {satisfy};
+\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.6em,minimum width=4.5em] (t52) at ([yshift=-0.8em]t51.south) {satisfied};
+\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.6em,minimum width=4.5em] (t53) at ([yshift=-0.8em]t52.south) {satisfies};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl51) at (t51.north west) {\tiny{{\color{white} \textbf{5}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl52) at (t52.north west) {\tiny{{\color{white} \textbf{5}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl53) at (t53.north west) {\tiny{{\color{white} \textbf{5}}}};
@@ -80,38 +80,38 @@

 {\tiny

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt11) at (t11.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt12) at (t12.east) {{\color{white} \textbf{$\funp{P}$=.2}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt13) at (t13.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt11) at (t11.south) {{\color{white} \textbf{$\funp{P}$=0.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt12) at (t12.south) {{\color{white} \textbf{$\funp{P}$=0.2}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt13) at (t13.south) {{\color{white} \textbf{$\funp{P}$=0.4}}};
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt14) at (t14.east) {{\color{white} \textbf{$\funp{P}$=.1}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt15) at (t15.east) {{\color{white} \textbf{$\funp{P}$=.2}}};
+\node [anchor=north,inner sep=1pt,minimum width=10.9em,fill=black] (pt14) at (t14.south) {{\color{white} \textbf{$\funp{P}$=0.1}\quad \quad}};
+\node [anchor=north,inner sep=1pt,minimum width=10.9em,fill=black] (pt15) at (t15.south) {{\color{white} \textbf{$\funp{P}$=0.2}\quad \quad}};
 }

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt21) at (t21.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt22) at (t22.east) {{\color{white} \textbf{$\funp{P}$=.3}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt23) at (t23.east) {{\color{white} \textbf{$\funp{P}$=.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt21) at (t21.south) {{\color{white} \textbf{$\funp{P}$=0.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt22) at (t22.south) {{\color{white} \textbf{$\funp{P}$=0.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt23) at (t23.south) {{\color{white} \textbf{$\funp{P}$=0.3}}};
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt24) at (t24.east) {{\color{white} \textbf{$\funp{P}$=.2}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt25) at (t25.east) {{\color{white} \textbf{$\funp{P}$=.1}}};
+\node [anchor=north,inner sep=1pt,minimum width=10.9em,fill=black] (pt24) at (t24.south) {{\color{white} \textbf{$\funp{P}$=0.2}}};
+\node [anchor=north,inner sep=1pt,minimum width=10.9em,fill=black] (pt25) at (t25.south) {{\color{white} \textbf{$\funp{P}$=0.1}}};
 }

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt31) at (t31.east) {{\color{white} \textbf{$\funp{P}$=1}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt31) at (t31.south) {{\color{white} \textbf{$\funp{P}$=1}}};
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt33) at (t32.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt33) at (t33.east) {{\color{white} \textbf{$\funp{P}$=.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=22.2em,fill=black] (pt33) at (t32.south) {{\color{white} \textbf{\quad $\funp{P}$=0.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=12.4em,fill=black] (pt33) at (t33.south) {{\color{white} \textbf{$\funp{P}$=0.3}\quad \quad \quad \;}};
 }

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt41) at (t41.east) {{\color{white} \textbf{$\funp{P}$=.5}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt42) at (t42.east) {{\color{white} \textbf{$\funp{P}$=.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=5em,fill=black] (pt41) at (t41.south) {{\color{white} \textbf{$\funp{P}$=0.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=5em,fill=black] (pt42) at (t42.south) {{\color{white} \textbf{$\funp{P}$=0.5}}};
 {
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt43) at (t43.east) {{\color{white} \textbf{$\funp{P}$=.3}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt44) at (t44.east) {{\color{white} \textbf{$\funp{P}$=.2}}};
+\node [anchor=north,inner sep=1pt,minimum width=15em,fill=black] (pt43) at (t43.south) {{\color{white} \textbf{$\funp{P}$=0.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=15em,fill=black] (pt44) at (t44.south) {{\color{white} \textbf{$\funp{P}$=0.2}}};
 }

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt51) at (t51.east) {{\color{white} \textbf{$\funp{P}$=.5}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt52) at (t52.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt53) at (t53.east) {{\color{white} \textbf{$\funp{P}$=.1}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt51) at (t51.south) {{\color{white} \textbf{$\funp{P}$=0.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt52) at (t52.south) {{\color{white} \textbf{$\funp{P}$=0.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt53) at (t53.south) {{\color{white} \textbf{$\funp{P}$=0.1}}};

 }

@@ -129,13 +129,13 @@
 \draw[decorate,thick,decoration={brace,amplitude=5pt,mirror}] ([yshift=0em,xshift=-0.5em]t11.north west) -- ([xshift=-0.5em]t13.south west) node [pos=0.5,left,xshift=1.0em,yshift=0.0em,text width=5em,align=left] (label2) {\footnotesize{\textbf{单词翻译}}};

 {
-\draw [->,ultra thick,red,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.5em]t13.west) -- ([xshift=0.8em,yshift=-0.5em]t13.east) -- ([xshift=-0.2em,yshift=-0.5em]t22.west) -- ([xshift=0.8em,yshift=-0.5em]t22.east) -- ([xshift=-0.2em,yshift=-0.5em]t31.west) -- ([xshift=0.8em,yshift=-0.5em]t31.east) -- ([xshift=-0.2em,yshift=-0.5em]t41.west) -- ([xshift=0.8em,yshift=-0.5em]t41.east) -- ([xshift=-0.2em,yshift=-0.5em]t52.west) -- ([xshift=1.2em,yshift=-0.5em]t52.east);
+\draw [->,ultra thick,red,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.62em]t13.west) -- ([xshift=0.8em,yshift=-0.62em]t13.east) -- ([xshift=-0.2em,yshift=-0.62em]t22.west) -- ([xshift=0.8em,yshift=-0.62em]t22.east) -- ([xshift=-0.2em,yshift=-0.62em]t31.west) -- ([xshift=0.8em,yshift=-0.62em]t31.east) -- ([xshift=-0.2em,yshift=-0.62em]t41.west) -- ([xshift=0.8em,yshift=-0.62em]t41.east) -- ([xshift=-0.2em,yshift=-0.62em]t52.west) -- ([xshift=1.2em,yshift=-0.62em]t52.east);
 }

 {
 \draw [->,ultra thick,ublue,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.5em]t15.west) -- ([xshift=0.8em,yshift=-0.5em]t15.east) -- ([xshift=-0.2em,yshift=-0.5em]t32.west) -- ([xshift=1.2em,yshift=-0.5em]t32.east);

-\draw [->,ultra thick,ublue,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.4em]t13.west) -- ([xshift=0.8em,yshift=-0.4em]t13.east) -- ([xshift=-0.2em,yshift=-0.4em]t25.west) -- ([xshift=0.8em,yshift=-0.4em]t25.east) -- ([xshift=-0.2em,yshift=-0.4em]t41.west) -- ([xshift=0.8em,yshift=-0.4em]t41.east) -- ([xshift=-0.2em,yshift=-0.4em]t52.west) -- ([xshift=1.2em,yshift=-0.4em]t52.east);
+\draw [->,ultra thick,ublue,line width=2pt,opacity=0.7] ([xshift=-0.5em,yshift=-0.42em]t13.west) -- ([xshift=0.8em,yshift=-0.42em]t13.east) -- ([xshift=-0.2em,yshift=-0.5em]t25.west) -- ([xshift=0.8em,yshift=-0.5em]t25.east) -- ([xshift=-0.2em,yshift=-0.42em]t41.west) -- ([xshift=0.8em,yshift=-0.42em]t41.east) -- ([xshift=-0.2em,yshift=-0.42em]t52.west) -- ([xshift=1.2em,yshift=-0.42em]t52.east);
 }

 {
@@ -143,12 +143,12 @@
 }

 {
-\node [anchor=north west] (wtranslabel) at ([yshift=-4em]t15.south west) {\scriptsize{翻译路径（仅包含单词）}};
+\node [anchor=north west] (wtranslabel) at ([yshift=-5em]t15.south west) {\scriptsize{翻译路径（仅包含单词）}};
 \draw [->,ultra thick,red,line width=1.5pt,opacity=0.7] ([xshift=0.2em]wtranslabel.east) -- ([xshift=1.2em]wtranslabel.east);
 }

 {
-\node [anchor=north west] (ptranslabel) at ([yshift=-5.5em]t15.south west) {\scriptsize{翻译路径（含有短语）}};
+\node [anchor=north west] (ptranslabel) at ([yshift=-6.5em]t15.south west) {\scriptsize{翻译路径（含有短语）}};
 \draw [->,ultra thick,ublue,line width=1.5pt,opacity=0.7] ([xshift=0.95em]ptranslabel.east) -- ([xshift=1.95em]ptranslabel.east);
 }


--- a/Chapter7/Figures/figure-word-translation-regard-as-path.tex
+++ b/Chapter7/Figures/figure-word-translation-regard-as-path.tex
@@ -20,15 +20,15 @@

 {\small
 \node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t11) at ([yshift=-1em]s1.south) {I};
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t12) at ([yshift=-0.2em]t11.south) {me};
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t13) at ([yshift=-0.2em]t12.south) {I'm};
+\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t12) at ([yshift=-0.8em]t11.south) {me};
+\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=1.5em,minimum width=2.5em] (t13) at ([yshift=-0.8em]t12.south) {I'm};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl11) at (t11.north west) {\tiny{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl12) at (t12.north west) {\tiny{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl13) at (t13.north west) {\tiny{{\color{white} \textbf{1}}}};

 \node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t21) at ([yshift=-1em]s2.south) {to};
-\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t22) at ([yshift=-0.2em]t21.south) {with};
-\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t23) at ([yshift=-0.2em]t22.south) {for};
+\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t22) at ([yshift=-0.8em]t21.south) {with};
+\node [anchor=north,inner sep=2pt,fill=green!20,minimum height=1.5em,minimum width=2.5em] (t23) at ([yshift=-0.8em]t22.south) {for};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl21) at (t21.north west) {\tiny{{\color{white} \textbf{2}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl22) at (t22.north west) {\tiny{{\color{white} \textbf{2}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl23) at (t23.north west) {\tiny{{\color{white} \textbf{2}}}};
@@ -37,13 +37,13 @@
 \node [anchor=north west,inner sep=1pt,fill=black] (tl31) at (t31.north west) {\tiny{{\color{white} \textbf{3}}}};

 \node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.5em,minimum width=3em] (t41) at ([yshift=-1em]s4.south) {$\phi$};
-\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.5em,minimum width=3em] (t42) at ([yshift=-0.2em]t41.south) {show};
+\node [anchor=north,inner sep=2pt,fill=orange!20,minimum height=1.5em,minimum width=3em] (t42) at ([yshift=-0.8em]t41.south) {show};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl41) at (t41.north west) {\tiny{{\color{white} \textbf{4}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl42) at (t42.north west) {\tiny{{\color{white} \textbf{4}}}};

 \node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t51) at ([yshift=-1em]s5.south) {satisfy};
-\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t52) at ([yshift=-0.2em]t51.south) {satisfied};
-\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t53) at ([yshift=-0.2em]t52.south) {satisfies};
+\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t52) at ([yshift=-0.8em]t51.south) {satisfied};
+\node [anchor=north,inner sep=2pt,fill=purple!20,minimum height=1.5em,minimum width=4.5em] (t53) at ([yshift=-0.8em]t52.south) {satisfies};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl51) at (t51.north west) {\tiny{{\color{white} \textbf{5}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl52) at (t52.north west) {\tiny{{\color{white} \textbf{5}}}};
 \node [anchor=north west,inner sep=1pt,fill=black] (tl53) at (t53.north west) {\tiny{{\color{white} \textbf{5}}}};
@@ -52,22 +52,22 @@

 {\tiny

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt11) at (t11.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt12) at (t12.east) {{\color{white} \textbf{$\funp{P}$=.2}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt13) at (t13.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt11) at (t11.south) {{\color{white} \textbf{$\funp{P}$=0.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt12) at (t12.south) {{\color{white} \textbf{$\funp{P}$=0.2}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt13) at (t13.south) {{\color{white} \textbf{$\funp{P}$=0.4}}};

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt21) at (t21.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt22) at (t22.east) {{\color{white} \textbf{$\funp{P}$=.3}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt23) at (t23.east) {{\color{white} \textbf{$\funp{P}$=.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt21) at (t21.south) {{\color{white} \textbf{$\funp{P}$=0.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt22) at (t22.south) {{\color{white} \textbf{$\funp{P}$=0.3}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt23) at (t23.south) {{\color{white} \textbf{$\funp{P}$=0.3}}};

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt31) at (t31.east) {{\color{white} \textbf{$\funp{P}$=1}}};
+\node [anchor=north,inner sep=1pt,minimum width=4.2em,fill=black] (pt31) at (t31.south) {{\color{white} \textbf{$\funp{P}$=1}}};

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt41) at (t41.east) {{\color{white} \textbf{$\funp{P}$=.5}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt42) at (t42.east) {{\color{white} \textbf{$\funp{P}$=.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=5em,fill=black] (pt41) at (t41.south) {{\color{white} \textbf{$\funp{P}$=0.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=5em,fill=black] (pt42) at (t42.south) {{\color{white} \textbf{$\funp{P}$=0.5}}};

-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt51) at (t51.east) {{\color{white} \textbf{$\funp{P}$=.5}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt52) at (t52.east) {{\color{white} \textbf{$\funp{P}$=.4}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2.55em,fill=black] (pt53) at (t53.east) {{\color{white} \textbf{$\funp{P}$=.1}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt51) at (t51.south) {{\color{white} \textbf{$\funp{P}$=0.5}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt52) at (t52.south) {{\color{white} \textbf{$\funp{P}$=0.4}}};
+\node [anchor=north,inner sep=1pt,minimum width=7.5em,fill=black] (pt53) at (t53.south) {{\color{white} \textbf{$\funp{P}$=0.1}}};

 }


--- a/Chapter7/chapter7.tex
+++ b/Chapter7/chapter7.tex
@@ -206,7 +206,7 @@ p_4 &=& \text{问题}\nonumber
 \parinterval 图\ref{fig:7-10}给出了一个由三个双语短语$\{(\bar{s}_{\bar{a}_1},\bar{t}_1),(\bar{s}_{\bar{a}_2},\bar{t}_2),(\bar{s}_{\bar{a}_3},\bar{t}_3)\}$ 构成的汉英互译句对，其中短语对齐信息为$\bar{a}_1 = 1$，$\bar{a}_2 = 2$，$\bar{a}_3 = 3$。这里，可以把这三个短语对的组合看作是翻译推导，形式化表示为如下公式：

 \begin{eqnarray}
-d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \circ {(\bar{s}_{\bar{a}_3},\bar{t}_3)}
+d & = & {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \circ {(\bar{s}_{\bar{a}_3},\bar{t}_3)}
 \label{eq:7-1}
 \end{eqnarray}

@@ -245,7 +245,7 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c

 \parinterval 对于统计机器翻译，其目的是找到输入句子的可能性最大的译文：
 \begin{eqnarray}
-\hat{\seq{t}} = \argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})
+\hat{\seq{t}} & = & \argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})
 \label{eq:7-2}
 \end{eqnarray}

@@ -261,7 +261,7 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c

 \parinterval 基于短语的翻译模型假设$\seq{s}$到$\seq{t}$的翻译可以用翻译推导进行描述，这些翻译推导都是由双语短语组成。于是，两个句子之间的映射就可以被看作是一个个短语的映射。显然短语翻译的建模要比整个句子翻译的建模简单得多。从模型上看，可以把翻译推导$d$当作是从$\seq{s}$到$\seq{t}$翻译的一种隐含结构。这种结构定义了对问题的一种描述，即翻译由一系列短语组成。根据这个假设，可以把句子的翻译概率定义为：
 \begin{eqnarray}
-\funp{P}(\seq{t}|\seq{s}) = \sum_{d} \funp{P}(d,\seq{t}|\seq{s})
+\funp{P}(\seq{t}|\seq{s}) & = & \sum_{d} \funp{P}(d,\seq{t}|\seq{s})
 \label{eq:7-3}
 \end{eqnarray}

@@ -282,25 +282,25 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c

 \parinterval 另一种常用的方法是直接用$\funp{P}(d,\seq{t}|\seq{s})$的最大值代表整个翻译推导的概率和。这种方法假设翻译概率是非常尖锐的，“最好”的推导会占有概率的主要部分。它被形式化为：
 \begin{eqnarray}
-\funp{P}(\seq{t}|\seq{s}) \approx \max \funp{P}(d,\seq{t}|\seq{s})
+\funp{P}(\seq{t}|\seq{s}) & \approx & \max \funp{P}(d,\seq{t}|\seq{s})
 \label{eq:7-6}
 \end{eqnarray}

 \parinterval 于是，翻译的目标可以被重新定义：
 \begin{eqnarray}
-\hat{\seq{t}} = \arg\max_{\seq{t}} (\max \funp{P}(d,\seq{t}|\seq{s}))
+\hat{\seq{t}} & = & \arg\max_{\seq{t}} (\max \funp{P}(d,\seq{t}|\seq{s}))
 \label{eq:7-7}
 \end{eqnarray}

 \parinterval 值得注意的是，翻译推导中蕴含着译文的信息，因此每个翻译推导都与一个译文对应。因此可以把公式\eqref{eq:7-7}所描述的问题重新定义为：
 \begin{eqnarray}
-\hat{d} = \arg\max_{d} \funp{P}(d,\seq{t}|\seq{s})
+\hat{d} & = & \arg\max_{d} \funp{P}(d,\seq{t}|\seq{s})
 \label{eq:7-8}
 \end{eqnarray}

 \parinterval 也就是，给定一个输入句子$\seq{s}$，找到从它出发的最优翻译推导$\hat{d}$，把这个翻译推导所对应的目标语词串看作最优的译文。假设函数$t(\cdot)$可以返回一个推导的目标语词串，则最优译文也可以被看作是：
 \begin{eqnarray}
-\hat{\seq{t}} = t(\hat{d})
+\hat{\seq{t}} & = & t(\hat{d})
 \label{eq:7-9}
 \end{eqnarray}

@@ -474,7 +474,7 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c

 \parinterval 抽取双语短语之后，需要对每个双语短语的质量进行评价。这样，在使用这些双语短语时，可以更有效地估计整个句子翻译的好坏。在统计机器翻译中，一般用双语短语出现的可能性大小来度量双语短语的好坏。这里，使用相对频次估计对短语的翻译条件概率进行计算，公式如下：
 \begin{eqnarray}
-\funp{P}(\bar{t}|\bar{s}) = \frac{c(\bar{s},\bar{t})}{c(\bar{s})}
+\funp{P}(\bar{t}|\bar{s}) & = & \frac{c(\bar{s},\bar{t})}{c(\bar{s})}
 \label{eq:7-13}
 \end{eqnarray}

@@ -482,7 +482,7 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c

 \parinterval 当遇到低频短语时，短语翻译概率的估计可能会不准确。例如，短语$\bar{s}$和$\bar{t}$在语料中只出现了一次，且在一个句子中共现，那么$\bar{s}$到$\bar{t}$的翻译概率为$\funp{P}(\bar{t}|\bar{s})=1$，这显然是不合理的，因为$\bar{s}$和$\bar{t}$的出现完全可能是偶然事件。既然直接度量双语短语的好坏会面临数据稀疏问题，一个自然的想法就是把短语拆解成单词，利用双语短语中单词翻译的好坏间接度量双语短语的好坏。为了达到这个目的，可以使用{\small\bfnew{词汇化翻译概率}}\index{词汇化翻译概率}（Lexical Translation Probability）\index{Lexical Translation Probability}。前面借助词对齐信息完成了双语短语的抽取，因此，词对齐信息本身就包含了短语内部单词之间的对应关系。因此同样可以借助词对齐来计算词汇翻译概率，公式如下：
 \begin{eqnarray}
-\funp{P}_{\textrm{lex}}(\bar{t}|\bar{s}) = \prod_{j=1}^{|\bar{s}|} \frac{1}{|\{j|a(j,i) = 1\}|} \sum_{\forall(j,i):a(j,i) = 1} \sigma (t_i|s_j)
+\funp{P}_{\textrm{lex}}(\bar{t}|\bar{s}) & = & \prod_{j=1}^{|\bar{s}|} \frac{1}{|\{j|a(j,i) = 1\}|} \sum_{\forall(j,i):a(j,i) = 1} \sigma (t_i|s_j)
 \label{eq:7-14}
 \end{eqnarray}

@@ -541,7 +541,7 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c

 \parinterval 基于距离的调序方法的核心思想就是度量当前翻译结果与顺序翻译之间的差距。对于译文中的第$i$个短语，令$\rm{start}_i$表示它所对应的源语言短语中第一个词所在的位置，$\rm{end}_i$表示它所对应的源语言短语中最后一个词所在的位置。于是，这个短语（相对于前一个短语）的调序距离为：
 \begin{eqnarray}
-dr = {\rm{start}}_i-{\rm{end}}_{i-1}-1
+dr & = & {\rm{start}}_i-{\rm{end}}_{i-1}-1
 \label{eq:7-15}
 \end{eqnarray}

@@ -579,7 +579,7 @@ dr = {\rm{start}}_i-{\rm{end}}_{i-1}-1

 \parinterval 对于每种调序类型，都可以定义一个调序概率，如下：
 \begin{eqnarray}
-\funp{P}(\seq{o}|\seq{s},\seq{t},\seq{a}) = \prod_{i=1}^{K} \funp{P}(o_i| \bar{s}_{a_i}, \bar{t}_i, a_{i-1}, a_i)
+\funp{P}(\seq{o}|\seq{s},\seq{t},\seq{a}) & = & \prod_{i=1}^{K} \funp{P}(o_i| \bar{s}_{a_i}, \bar{t}_i, a_{i-1}, a_i)
 \label{eq:7-16}
 \end{eqnarray}

@@ -654,13 +654,13 @@ dr = {\rm{start}}_i-{\rm{end}}_{i-1}-1
 \parinterval 这里介绍一种更加高效的特征权重调优方法$\ \dash \ ${\small\bfnew{最小错误率训练}}\index{最小错误率训练}（Minimum Error Rate Training\index{Minimum Error Rate Training}，MERT）。最小错误率训练是统计机器翻译发展中代表性工作，也是机器翻译领域原创的重要技术方法之一\upcite{DBLP:conf/acl/Och03}。最小错误率训练假设：翻译结果相对于标准答案的错误是可度量的，进而可以通过降低错误数量的方式来找到最优的特征权重。假设有样本集合$S = \{(s_1,\seq{r}_1),...,(s_N,\seq{r}_N)\}$，$s_i$为样本中第$i$个源语言句子，$\seq{r}_i$为相应的参考译文。注意，$\seq{r}_i$ 可以包含多个参考译文。$S$通常被称为{\small\bfnew{调优集合}}\index{调优集合}（Tuning Set）\index{Tuning Set}。对于$S$中的每个源语句子$s_i$，机器翻译模型会解码出$n$-best推导$\hat{\seq{d}}_{i} = \{\hat{d}_{ij}\}$，其中$\hat{d}_{ij}$表示对于源语言句子$s_i$得到的第$j$个最好的推导。$\{\hat{d}_{ij}\}$可以被定义如下：

 \begin{eqnarray}
-\{\hat{d}_{ij}\} = \arg\max_{\{d_{ij}\}} \sum_{i=1}^{M} \lambda_i \cdot h_i (d,\seq{t},\seq{s})
+\{\hat{d}_{ij}\} & = & \arg\max_{\{d_{ij}\}} \sum_{i=1}^{M} \lambda_i \cdot h_i (d,\seq{t},\seq{s})
 \label{eq:7-17}
 \end{eqnarray}

 \parinterval 对于每个样本都可以得到$n$-best推导集合，整个数据集上的推导集合被记为$\hat{\seq{D}} = \{\hat{\seq{d}}_{1},...,\hat{\seq{d}}_{s}\}$。进一步，令所有样本的参考译文集合为$\seq{R} = \{\seq{r}_1,...,\seq{r}_N\}$。最小错误率训练的目标就是降低$\hat{\seq{D}}$相对于$\seq{R}$的错误。也就是，通过调整不同特征的权重$\lambda = \{ \lambda_i \}$，让错误率最小，形式化描述为：
 \begin{eqnarray}
-\hat{\lambda} = \arg\min_{\lambda} \textrm{Error}(\hat{\seq{D}},\seq{R})
+\hat{\lambda} & = & \arg\min_{\lambda} \textrm{Error}(\hat{\seq{D}},\seq{R})
 \label{eq:7-18}
 \end{eqnarray}
 %公式--------------------------------------------------------------------

--- a/Chapter8/Figures/figure-different-representations-of-syntax-tree.tex
+++ b/Chapter8/Figures/figure-different-representations-of-syntax-tree.tex
@@ -16,6 +16,6 @@
 	
 \node [anchor=north west] (cap1) at (-1.5em,-1in) {{(a) 树状表示}};
 \node [anchor=west] (cap2) at ([xshift=0.5in]cap1.east) {{(b) 序列表示（缩进）}};
-\node [anchor=west] (cap3) at ([xshift=0.5in]cap2.east) {{(c) 序列表示}};
+\node [anchor=west] (cap3) at ([xshift=0.3in]cap2.east) {{(c) 序列表示}};
 }
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter8/Figures/figure-example-of-cky-algorithm-execution.tex
+++ b/Chapter8/Figures/figure-example-of-cky-algorithm-execution.tex
@@ -35,7 +35,7 @@
 \node [anchor=north] (l3) at ([yshift=-1em]cell53.south) {\tiny{$l$=3}};
 \node [anchor=north] (l4) at ([yshift=-1em]cell54.south) {\tiny{$l$=4}};
 \node [anchor=north] (l5) at ([yshift=-1em]cell55.south) {\tiny{$l$=5}};
-\node [anchor=north] (caption1) at ([xshift=0.0em,yshift=0.0em]l5.south) {(a)};
+\node [anchor=north] (caption1) at ([xshift=0.0em,yshift=0.0em]l5.south) {\small{(a)}};

 \node [anchor=center] (y1) at ([xshift=-2.1em,yshift=2em]cell11.center) {\tiny{\blue 0}};
 \node [anchor=center] (y2) at ([xshift=-2.1em,yshift=2em]cell21.center) {\tiny{\blue 1}};
@@ -88,7 +88,7 @@
 \node [anchor=north] (l3) at ([yshift=-1em]cell53.south) {\tiny{$l$=3}};
 \node [anchor=north] (l4) at ([yshift=-1em]cell54.south) {\tiny{$l$=4}};
 \node [anchor=north] (l5) at ([yshift=-1em]cell55.south) {\tiny{$l$=5}};
-\node [anchor=north] (caption2) at ([xshift=0.0em,yshift=0.0em]l5.south) {(b)};
+\node [anchor=north] (caption2) at ([xshift=0.0em,yshift=0.0em]l5.south) {\small{(b)}};

 \node [anchor=center] (y1) at ([xshift=-2.1em,yshift=2em]cell11.center) {\tiny{\blue 0}};
 \node [anchor=center] (y2) at ([xshift=-2.1em,yshift=2em]cell21.center) {\tiny{\blue 1}};
@@ -170,7 +170,7 @@
 \node [anchor=north] (l3) at ([yshift=-1em]cell53.south) {\tiny{$l$=3}};
 \node [anchor=north] (l4) at ([yshift=-1em]cell54.south) {\tiny{$l$=4}};
 \node [anchor=north] (l5) at ([yshift=-1em]cell55.south) {\tiny{$l$=5}};
-\node [anchor=north] (caption3) at ([xshift=0.0em,yshift=0.0em]l5.south) {(c)};
+\node [anchor=north] (caption3) at ([xshift=0.0em,yshift=0.0em]l5.south) {\small{(c)}};

 \node [anchor=center] (y1) at ([xshift=-2.1em,yshift=2em]cell11.center) {\tiny{\blue 0}};
 \node [anchor=center] (y2) at ([xshift=-2.1em,yshift=2em]cell21.center) {\tiny{\blue 1}};
@@ -267,7 +267,7 @@
 \node [anchor=north] (l3) at ([yshift=-1em]cell53.south) {\tiny{$l$=3}};
 \node [anchor=north] (l4) at ([yshift=-1em]cell54.south) {\tiny{$l$=4}};
 \node [anchor=north] (l5) at ([yshift=-1em]cell55.south) {\tiny{$l$=5}};
-\node [anchor=north] (caption4) at ([xshift=0.0em,yshift=0.0em]l5.south) {(d)};
+\node [anchor=north] (caption4) at ([xshift=0.0em,yshift=0.0em]l5.south) {\small{(d)}};

 \node [anchor=center] (y1) at ([xshift=-2.1em,yshift=2em]cell11.center) {\tiny{\blue 0}};
 \node [anchor=center] (y2) at ([xshift=-2.1em,yshift=2em]cell21.center) {\tiny{\blue 1}};

--- a/Chapter8/Figures/figure-execution-of-cube-pruning.tex
+++ b/Chapter8/Figures/figure-execution-of-cube-pruning.tex
@@ -40,7 +40,7 @@
 \draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=-1.0em,yshift=-0.7em]alig4.south west);
 \draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=0.8em,yshift=1.0em]alig13.north east);

-\node[anchor=north] (l) at ([xshift=0em,yshift=-1.5em]alig4.south) {\scriptsize{(a)}};
+\node[anchor=north] (l) at ([xshift=0em,yshift=-1.5em]alig4.south) {\small{(a)}};
 \end{scope}

 %图2
@@ -87,7 +87,7 @@
 \draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=-1.0em,yshift=-0.7em]alig4.south west);
 \draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=0.8em,yshift=1.0em]alig13.north east);

-\node[anchor=north] (l) at ([xshift=0em,yshift=-1.5em]alig4.south) {\scriptsize{(b)}};
+\node[anchor=north] (l) at ([xshift=0em,yshift=-1.5em]alig4.south) {\small{(b)}};
 \end{scope}

 %图3
@@ -137,7 +137,7 @@
 \draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=-1.0em,yshift=-0.7em]alig4.south west);
 \draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=0.8em,yshift=1.0em]alig13.north east);

-\node[anchor=north] (l) at ([xshift=0em,yshift=-1.5em]alig4.south) {\scriptsize{(c)}};
+\node[anchor=north] (l) at ([xshift=0em,yshift=-1.5em]alig4.south) {\small{(c)}};
 \end{scope}


@@ -194,7 +194,7 @@
 \draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=-1.0em,yshift=-0.7em]alig4.south west);
 \draw [->,thick] ([xshift=-1.0em,yshift=1.0em]alig1.north west)--([xshift=0.8em,yshift=1.0em]alig13.north east);

-\node[anchor=north] (l) at ([xshift=0em,yshift=-1.5em]alig4.south) {\scriptsize{(d)}};
+\node[anchor=north] (l) at ([xshift=0em,yshift=-1.5em]alig4.south) {\small{(d)}};
 \end{scope}



--- a/Chapter8/Figures/figure-structure-of-chart.tex
+++ b/Chapter8/Figures/figure-structure-of-chart.tex
@@ -4,7 +4,7 @@
 \begin{tikzpicture}
 \begin{scope}

-\node [anchor=south west,draw,fill=ugreen!20,minimum width=2.8em,minimum height=2.8em,inner sep=1pt] (cell11) at (0,0) {\scriptsize{cell[1,2]}};
+\node [anchor=south west,draw,fill=green!20,minimum width=2.8em,minimum height=2.8em,inner sep=1pt] (cell11) at (0,0) {\scriptsize{cell[1,2]}};
 \node [anchor=south west,draw,fill=red!20,minimum width=2.8em,minimum height=2.8em,inner sep=1pt] (cell12) at (cell11.south east) {\scriptsize{cell[0,2]}};
 \node [anchor=south west,draw,fill=orange!30,minimum width=2.8em,minimum height=2.8em,inner sep=1pt] (cell21) at (cell11.north west) {\scriptsize{cell[0,1]}};
 \node [anchor=south west,draw,fill=gray!20,minimum width=2.8em,minimum height=2.8em,inner sep=1pt] (cell22) at (cell21.south east) {\scriptsize{N/A}};
@@ -12,7 +12,7 @@
 \draw [->,thick] ([xshift=-1em,yshift=1em]cell21.north west)--([xshift=1em,yshift=1em]cell22.north east);

 \node [anchor=north west,fill=orange!30,draw,drop shadow,align=left,minimum width=4em] (cell11label) at ([xshift=4em,yshift=1em]cell22.north east) {\footnotesize{VV[0,1]}};
-\node [anchor=north west,fill=ugreen!20,draw,drop shadow,align=left,minimum width=4em] (cell12label) at ([yshift=-1em]cell11label.south west) {\footnotesize{NN[1,2]}\\\footnotesize{NP[1,2]}};
+\node [anchor=north west,fill=green!20,draw,drop shadow,align=left,minimum width=4em] (cell12label) at ([yshift=-1em]cell11label.south west) {\footnotesize{NN[1,2]}\\\footnotesize{NP[1,2]}};
 \node [anchor=north west,fill=red!20,draw,drop shadow,align=left,minimum width=4em] (cell21label) at ([yshift=-1em]cell12label.south west) {\footnotesize{VP[0,2]}\\\footnotesize{NP[0,2]}};

 \draw [->,very thick,dotted] ([yshift=0.3em]cell11label.west) .. controls +(west:2em)  and +(north:1.5em) .. ([xshift=1em,yshift=-0.5em]cell21.north);

--- a/Chapter8/chapter8.tex
+++ b/Chapter8/chapter8.tex
@@ -266,7 +266,7 @@ r_4:\quad \funp{X}\ &\to\ &\langle \ \text{了},\quad \textrm{have}\ \rangle \no

 \noindent 其中，每使用一次规则就会同步替换源语言和目标语言符号串中的一个非终结符，替换结果用红色表示。通常，可以把上面这个过程称作翻译推导，记为：
 \begin{eqnarray}
-d = {r_1} \circ {r_2} \circ {r_3} \circ {r_4}
+d & = & {r_1} \circ {r_2} \circ {r_3} \circ {r_4}
 \label{eq:8-1}
 \end{eqnarray}

@@ -402,19 +402,19 @@ y&=&\beta_0 y_{\pi_1} \beta_1 y_{\pi_2} ... \beta_{m-1} y_{\pi_m} \beta_m

 \parinterval 这些特征可以被具体描述为：
 \begin{eqnarray}
-h_i (d,\seq{t},\seq{s})=\sum_{r \in d}h_i (r)
+h_i (d,\seq{t},\seq{s}) & = & \sum_{r \in d}h_i (r)
 \label{eq:8-4}
 \end{eqnarray}

 \parinterval 公式\eqref{eq:8-4}中，$r$表示推导$d$中的一条规则，$h_i (r)$表示规则$r$上的第$i$个特征。可以看出，推导$d$的特征值就是所有包含在$d$中规则的特征值的和。进一步，可以定义
 \begin{eqnarray}
-\textrm{rscore}(d,\seq{t},\seq{s})=\sum_{i=1}^7 \lambda_i \cdot h_i (d,\seq{t},\seq{s})
+\textrm{rscore}(d,\seq{t},\seq{s}) & = & \sum_{i=1}^7 \lambda_i \cdot h_i (d,\seq{t},\seq{s})
 \label{eq:8-5}
 \end{eqnarray}

 \parinterval 最终，模型得分被定义为：
 \begin{eqnarray}
-\textrm{score}(d,\seq{t},\seq{s})=\textrm{rscore}(d,\seq{t},\seq{s})+ \lambda_8 \textrm{log}⁡(\funp{P}_{\textrm{lm}}(\seq{t}))+\lambda_9 \mid \seq{t} \mid
+\textrm{score}(d,\seq{t},\seq{s}) & = & \textrm{rscore}(d,\seq{t},\seq{s})+ \lambda_8 \textrm{log}⁡(\funp{P}_{\textrm{lm}}(\seq{t}))+\lambda_9 \mid \seq{t} \mid
 \label{eq:8-6}
 \end{eqnarray}

@@ -438,14 +438,14 @@ h_i (d,\seq{t},\seq{s})=\sum_{r \in d}h_i (r)

 \parinterval 层次短语模型解码的目标是找到模型得分最高的推导，即：
 \begin{eqnarray}
-\hat{d} = \argmax_{d}\ \textrm{score}(d,\seq{t},\seq{s})
+\hat{d} & = & \argmax_{d}\ \textrm{score}(d,\seq{t},\seq{s})
 \label{eq:8-7}
 \end{eqnarray}

 \noindent 这里，$\hat{d}$的目标语部分即最佳译文$\hat{\seq{t}}$。令函数$t(\cdot)$返回翻译推导的目标语词串，于是有：

 \begin{eqnarray}
-\hat{\seq{t}}=t(\hat{d})
+\hat{\seq{t}} & = & t(\hat{d})
 \label{eq:8-8}
 \end{eqnarray}

@@ -1408,15 +1408,15 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex

 \parinterval 从句法分析的角度看，超图最大程度地复用了局部的分析结果，使得分析可以“结构化”。比如，有两个推导：
 \begin{eqnarray}
-d_1 = {r_1} \circ {r_2} \circ {r_3} \circ {r_4} \label{eqa4.30}\\
-d_2 = {r_1} \circ {r_2} \circ {r_3} \circ {r_5}
+d_1 & = & {r_1} \circ {r_2} \circ {r_3} \circ {r_4} \label{eqa4.30}\\
+d_2 & = & {r_1} \circ {r_2} \circ {r_3} \circ {r_5}
 \label{eq:8-10}
 \end{eqnarray}

 \noindent 其中，$r_1 - r_5$分别表示不同的规则。${r_1} \circ {r_2} \circ {r_3}$是两个推导的公共部分。在超图表示中，${r_1} \circ {r_2} \circ {r_3}$可以对应一个子图，显然这个子图也是一个推导，记为${d'}= {r_1} \circ {r_2} \circ {r_3}$。这样，$d_1$和$d_2$不需要重复记录${r_1} \circ {r_2} \circ {r_3}$，重新写作：
 \begin{eqnarray}
-d_1 = {d'} \circ {r_4} \label{eqa4.32}\\
-d_1 = {d'} \circ {r_5}
+d_1 & = & {d'} \circ {r_4} \label{eqa4.32}\\
+d_1 & = & {d'} \circ {r_5}
 \label{eq:8-12}
 \end{eqnarray}

@@ -1458,7 +1458,7 @@ d_1 = {d'} \circ {r_5}

 \parinterval 解码的目标是找到得分score($d$)最高的推导$d$。这个过程通常被描述为：
 \begin{eqnarray}
-\hat{d} = \argmax_d\ \textrm{score} (d,\seq{s},\seq{t})
+\hat{d} & = & \argmax_d\ \textrm{score} (d,\seq{s},\seq{t})
 \label{eq:8-13}
 \end{eqnarray}


--- a/bibliography.bib
+++ b/bibliography.bib
@@ -2713,9 +2713,8 @@ year = {2012}
               Franz Josef Och and
               Hermann Ney},
  title     = {Phrase-Based Statistical Machine Translation},
-  volume    = {2479},
  pages     = {18--32},
-  publisher = {Springer},
+  publisher = {Annual Conference on Artificial Intelligence},
  year      = {2002}
 }
 @inproceedings{DBLP:conf/naacl/ZensN04,
@@ -5791,7 +5790,7 @@ author    = {Yoshua Bengio and
 @article{Wang2018MultilayerRF,
  title={Multi-layer Representation Fusion for Neural Machine Translation},
  author={Qiang Wang and Fuxue Li and Tong Xiao and Yanyang Li and Yinqiao Li and Jingbo Zhu},
-  journal={ArXiv},
+  journal={International Conference on Computational Linguistics},
  year={2018},
  volume={abs/2002.06714}
 }
@@ -5937,6 +5936,308 @@ author    = {Yoshua Bengio and
  year      = {2012}
 }

+@article{JMLR:v15:srivastava14a,
+  author  = {Nitish Srivastava and Geoffrey Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov},
+  title   = {Dropout: A Simple Way to Prevent Neural Networks from Overfitting},
+  journal = {Journal of Machine Learning Research},
+  year    = {2014},
+  volume  = {15},
+  pages   = {1929-1958},
+}
+
+@inproceedings{DBLP:conf/amta/MullerRS20,
+  author    = {Mathias M{\"{u}}ller and
+               Annette Rios and
+               Rico Sennrich},
+  title     = {Domain Robustness in Neural Machine Translation},
+  pages     = {151--164},
+  publisher = {Association for Machine Translation in the Americas},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/sp/Carlini017,
+  author    = {Nicholas Carlini and
+               David A. Wagner},
+  title     = {Towards Evaluating the Robustness of Neural Networks},
+  pages     = {39--57},
+  publisher = {IEEE Symposium on Security and Privacy},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/cvpr/Moosavi-Dezfooli16,
+  author    = {Seyed-Mohsen Moosavi-Dezfooli and
+               Alhussein Fawzi and
+               Pascal Frossard},
+  title     = {DeepFool: {A} Simple and Accurate Method to Fool Deep Neural Networks},
+  pages     = {2574--2582},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2016}
+}
+
+@inproceedings{DBLP:conf/acl/ChengJM19,
+  author    = {Yong Cheng and
+               Lu Jiang and
+               Wolfgang Macherey},
+  title     = {Robust Neural Machine Translation with Doubly Adversarial Inputs},
+  pages     = {4324--4333},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/cvpr/NguyenYC15,
+  author    = {Anh Mai Nguyen and
+               Jason Yosinski and
+               Jeff Clune},
+  title     = {Deep neural networks are easily fooled: High confidence predictions
+               for unrecognizable images},
+  pages     = {427--436},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2015}
+}
+
+@inproceedings{DBLP:journals/corr/SzegedyZSBEGF13,
+  author    = {Christian Szegedy and
+               Wojciech Zaremba and
+               Ilya Sutskever and
+               Joan Bruna and
+               Dumitru Erhan and
+               Ian J. Goodfellow and
+               Rob Fergus},
+  title     = {Intriguing properties of neural networks},
+  publisher = {International Conference on Learning Representations},
+  year      = {2014}
+}
+
+@inproceedings{DBLP:journals/corr/GoodfellowSS14,
+  author    = {Ian J. Goodfellow and
+               Jonathon Shlens and
+               Christian Szegedy},
+  title     = {Explaining and Harnessing Adversarial Examples},
+  publisher = {International Conference on Learning Representations},
+  year      = {2015}
+}
+
+@inproceedings{DBLP:conf/emnlp/JiaL17,
+  author    = {Robin Jia and
+               Percy Liang},
+  title     = {Adversarial Examples for Evaluating Reading Comprehension Systems},
+  pages     = {2021--2031},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/emnlp/BekoulisDDD18,
+  author    = {Giannis Bekoulis and
+               Johannes Deleu and
+               Thomas Demeester and
+               Chris Develder},
+  title     = {Adversarial training for multi-context joint entity and relation extraction},
+  pages     = {2830--2836},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/naacl/YasunagaKR18,
+  author    = {Michihiro Yasunaga and
+               Jungo Kasai and
+               Dragomir R. Radev},
+  title     = {Robust Multilingual Part-of-Speech Tagging via Adversarial Training},
+  pages     = {976--986},
+  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/iclr/BelinkovB18,
+  author    = {Yonatan Belinkov and
+               Yonatan Bisk},
+  title     = {Synthetic and Natural Noise Both Break Neural Machine Translation},
+  publisher = {International Conference on Learning Representations},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/naacl/MichelLNP19,
+  author    = {Paul Michel and
+               Xian Li and
+               Graham Neubig and
+               Juan Miguel Pino},
+  title     = {On Evaluation of Adversarial Perturbations for Sequence-to-Sequence
+               Models},
+  pages     = {3103--3114},
+  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
+  year      = {2019}
+}
+
+@article{Gong2018AdversarialTW,
+  title={Adversarial Texts with Gradient Methods},
+  author={Zhitao Gong and Wenlu Wang and B. Li and D. Song and W. Ku},
+  journal={ArXiv},
+  year={2018},
+  volume={abs/1801.07175}
+}
+
+@inproceedings{DBLP:conf/naacl/VaibhavSSN19,
+  author    = {Vaibhav and
+               Sumeet Singh and
+               Craig Stewart and
+               Graham Neubig},
+  title     = {Improving Robustness of Machine Translation with Synthetic Noise},
+  pages     = {1916--1920},
+  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/naacl/AnastasopoulosL19,
+  author    = {Antonios Anastasopoulos and
+               Alison Lui and
+               Toan Q. Nguyen and
+               David Chiang},
+  title     = {Neural Machine Translation of Text from Non-Native Speakers},
+  pages     = {3070--3080},
+  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/acl/SinghGR18,
+  author    = {Marco T{\'{u}}lio Ribeiro and
+               Sameer Singh and
+               Carlos Guestrin},
+  title     = {Semantically Equivalent Adversarial Rules for Debugging {NLP} models},
+  pages     = {856--865},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2018}
+}
+
+@article{DBLP:journals/corr/SamantaM17,
+  author    = {Suranjana Samanta and
+               Sameep Mehta},
+  title     = {Towards Crafting Text Adversarial Samples},
+  journal   = {CoRR},
+  volume    = {abs/1707.02812},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/ijcai/0002LSBLS18,
+  author    = {Bin Liang and
+               Hongcheng Li and
+               Miaoqiang Su and
+               Pan Bian and
+               Xirong Li and
+               Wenchang Shi},
+  title     = {Deep Text Classification Can be Fooled},
+  pages     = {4208--4215},
+  publisher = {International Joint Conference on Artificial Intelligence},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/coling/EbrahimiLD18,
+  author    = {Javid Ebrahimi and
+               Daniel Lowd and
+               Dejing Dou},
+  title     = {On Adversarial Examples for Character-Level Neural Machine Translation},
+  pages     = {653--663},
+  publisher = {International Conference on Computational Linguistics},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/iclr/ZhaoDS18,
+  author    = {Zhengli Zhao and
+               Dheeru Dua and
+               Sameer Singh},
+  title     = {Generating Natural Adversarial Examples},
+  publisher = {International Conference on Learning Representations},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/acl/LiuTMCZ18,
+  author    = {Yong Cheng and
+               Zhaopeng Tu and
+               Fandong Meng and
+               Junjie Zhai and
+               Yang Liu},
+  title     = {Towards Robust Neural Machine Translation},
+  pages     = {1756--1766},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/acl/LiuMHXH19,
+  author    = {Hairong Liu and
+               Mingbo Ma and
+               Liang Huang and
+               Hao Xiong and
+               Zhongjun He},
+  title     = {Robust Neural Machine Translation with Joint Textual and Phonetic
+               Embedding},
+  pages     = {3044--3049},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/acl/LiLWJXZLL20,
+  author    = {Bei Li and
+               Hui Liu and
+               Ziyang Wang and
+               Yufan Jiang and
+               Tong Xiao and
+               Jingbo Zhu and
+               Tongran Liu and
+               Changliang Li},
+  title     = {Does Multi-Encoder Help? {A} Case Study on Context-Aware Neural Machine
+               Translation},
+  pages     = {3512--3518},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2020}
+}
+
+@techreport{chen1999gaussian,
+  title={A Gaussian prior for smoothing maximum entropy models},
+  author={Chen, Stanley F and Rosenfeld, Ronald},
+  year={1999},
+  institution={CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE}
+}
+
+@inproceedings{DBLP:conf/emnlp/MichelN18,
+  author    = {Paul Michel and
+               Graham Neubig},
+  title     = {{MTNT:} {A} Testbed for Machine Translation of Noisy Text},
+  pages     = {543--553},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/icassp/SchusterN12,
+  author    = {Mike Schuster and
+               Kaisuke Nakajima},
+  title     = {Japanese and Korean voice search},
+  pages     = {5149--5152},
+  publisher = {IEEE International Conference on Acoustics, Speech and Signal Processing},
+  year      = {2012}
+}
+
+@inproceedings{kudo2018sentencepiece,
+	title={SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing},
+	author={Taku {Kudo} and John {Richardson}},
+	publisher={Conference on Empirical Methods in Natural Language Processing},
+	pages={66--71},
+	year={2018}
+}
+
+@inproceedings{provilkov2020bpe,
+	title={BPE-Dropout: Simple and Effective Subword Regularization},
+	author={Ivan {Provilkov} and Dmitrii {Emelianenko} and Elena {Voita}},
+	publisher={Annual Meeting of the Association for Computational Linguistics},
+	pages={1882--1892},
+	year={2020}
+}
+
+@inproceedings{he2020dynamic,
+	title={Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation},
+	author={Xuanli {He} and Gholamreza {Haffari} and Mohammad {Norouzi}},
+	publisher={Annual Meeting of the Association for Computational Linguistics},
+	pages={3042--3051},
+	year={2020}
+}
+
 %%%%% chapter 13------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

@@ -6703,7 +7004,8 @@ author    = {Yoshua Bengio and
 @article{Lan2020ALBERTAL,
  title={ALBERT: A Lite BERT for Self-supervised Learning of Language Representations},
  author={Zhenzhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut},
-  publisher={International Conference on Learning Representations}
+  journal={International Conference on Learning Representations},
+  year={2020}
 }

 @inproceedings{Han2015LearningBW,
@@ -7780,7 +8082,7 @@ author    = {Zhuang Liu and
               Rupesh Kumar Srivastava and
               J{\"{u}}rgen Schmidhuber},
  title     = {Highway and Residual Networks learn Unrolled Iterative Estimation},
-  publisher = {International Conference on Learning Representations},
+  journal = {International Conference on Learning Representations},
  year      = {2017}
 }

@@ -7823,7 +8125,7 @@ author    = {Zhuang Liu and
               Liwei Wang and
               Tie-Yan Liu},
  title     = {On Layer Normalization in the Transformer Architecture},
-  journal   = {CoRR},
+  journal   = {International Conference on Machine Learning},
  volume    = {abs/2002.04745},
  year      = {2020}
 }
@@ -7897,7 +8199,7 @@ author    = {Zhuang Liu and
 @article{Wang2018MultilayerRF,
  title={Multi-layer Representation Fusion for Neural Machine Translation},
  author={Qiang Wang and Fuxue Li and Tong Xiao and Yanyang Li and Yinqiao Li and Jingbo Zhu},
-  journal={ArXiv},
+  journal={International Conference on Computational Linguistics},
  year={2018},
  volume={abs/2002.06714}
 }
@@ -8026,7 +8328,7 @@ author    = {Zhuang Liu and

 @inproceedings{Real2019AgingEF,
  title={Aging Evolution for Image Classifier Architecture Search},
-  author={E. Real and A. Aggarwal and Y. Huang and Quoc V. Le},
+  author={Esteban Real and Alok Aggarwal and Yanping Huang and Quoc V. Le },
  booktitle={AAAI Conference on Artificial Intelligence},
  year={2019}
 }
@@ -8070,7 +8372,7 @@ author    = {Zhuang Liu and
 }

 @inproceedings{DBLP:conf/ijcnn/Dodd90,
-  author    = {N. Dodd},
+  author    = {Dodd Nigel},
  title     = {Optimisation of network structure using genetic techniques},
  publisher = {International Joint Conference on Neural Networks, San
               Diego, CA, USA, June 17-21, 1990},
@@ -9241,7 +9543,8 @@ author    = {Zhuang Liu and
 @article{Lan2020ALBERTAL,
  title={ALBERT: A Lite BERT for Self-supervised Learning of Language Representations},
  author={Zhenzhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut},
-  publisher={International Conference on Learning Representations}
+  journal={International Conference on Learning Representations},
+  year={2020}
 }

 @inproceedings{DBLP:conf/naacl/HaoWYWZT19,
@@ -9521,21 +9824,21 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@article{DBLP:journals/corr/abs200111327,
+@inproceedings{DBLP:journals/corr/abs200111327,
  author    = {Idris Abdulmumin and
               Bashir Shehu Galadanci and
               Abubakar Isa},
  title     = {Iterative Batch Back-Translation for Neural Machine Translation: {A}
               Conceptual Model},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  year      = {2020}
 }
-@article{DBLP:journals/corr/abs200403672,
+@inproceedings{DBLP:journals/corr/abs200403672,
  author    = {Zi-Yi Dou and
               Antonios Anastasopoulos and
               Graham Neubig},
  title     = {Dynamic Data Selection and Weighting for Iterative Back-Translation},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  year      = {2020}
 }
 @inproceedings{DBLP:conf/emnlp/WuZHGQLL19,
@@ -9551,14 +9854,15 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@article{DBLP:journals/corr/abs-1901-09069,
+@inproceedings{DBLP:journals/corr/abs-1901-09069,
  author    = {Felipe Almeida and
               Geraldo Xex{\'{e}}o},
  title     = {Word Embeddings: {A} Survey},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  year      = {2019}
 }
-@article{DBLP:journals/corr/abs-2002-06823,
+
+@inproceedings{DBLP:journals/corr/abs-2002-06823,
  author    = {Jinhua Zhu and
               Yingce Xia and
               Lijun Wu and
@@ -9568,7 +9872,7 @@ author    = {Zhuang Liu and
               Houqiang Li and
               Tie-Yan Liu},
  title     = {Incorporating {BERT} into Neural Machine Translation},
-  journal   = {CoRR},
+  publisher   = {International Conference on Learning Representations},
  year      = {2020}
 }
 @inproceedings{song2019mass,
@@ -9580,13 +9884,13 @@ author    = {Zhuang Liu and
  title     = {{MASS:} Masked Sequence to Sequence Pre-training for Language Generation},
  volume    = {97},
  pages     = {5926--5936},
-  publisher = {{PMLR}},
+  publisher = {International Conference on Machine Learning},
  year      = {2019}
 }
-@article{DBLP:journals/corr/Ruder17a,
+@inproceedings{DBLP:journals/corr/Ruder17a,
  author    = {Sebastian Ruder},
  title     = {An Overview of Multi-Task Learning in Deep Neural Networks},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1706.05098},
  year      = {2017}
 }
@@ -9600,20 +9904,10 @@ author    = {Zhuang Liu and
  title     = {Dual Supervised Learning},
  volume    = {70},
  pages     = {3789--3798},
-  publisher = {{PMLR}},
-  year      = {2017}
-}
-@inproceedings{DBLP:conf/iccv/ZhuPIE17,
-  author    = {Jun-Yan Zhu and
-               Taesung Park and
-               Phillip Isola and
-               Alexei A. Efros},
-  title     = {Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial
-               Networks},
-  pages     = {2242--2251},
-  publisher = {{IEEE} Computer Society},
+  publisher = {International Conference on Machine Learning},
  year      = {2017}
 }
+
 @inproceedings{DBLP:conf/nips/HeXQWYLM16,
  author    = {Di He and
               Yingce Xia and
@@ -9654,12 +9948,12 @@ author    = {Zhuang Liu and
  title     = {Analyzing Uncertainty in Neural Machine Translation},
  volume    = {80},
  pages     = {3953--3962},
-  publisher = {{PMLR}},
+  publisher = {International Conference on Machine Learning},
  year      = {2018}
 }
 @inproceedings{finding2006adafre,
  author    = {S. F. Adafre and Maarten de Rijke},
-  title     = {Finding Similar Sentences across Multiple Languages in Wikipedia },
+  title     = {Finding Similar Sentences across Multiple Languages in Wikipedia},
  publisher = {Annual Conference of the European Association for Machine Translation},
  year      = {2006}
 }
@@ -9669,12 +9963,12 @@ author    = {Zhuang Liu and
  publisher = {AAAI Conference on Artificial Intelligence},
  year      = {2008}
 }
-@article{DBLP:journals/coling/MunteanuM05,
+@inproceedings{DBLP:journals/coling/MunteanuM05,
  author    = {Dragos Stefan Munteanu and
               Daniel Marcu},
  title     = {Improving Machine Translation Performance by Exploiting Non-Parallel
               Corpora},
-  journal   = {Computational Linguistics},
+  publisher   = {Computational Linguistics},
  volume    = {31},
  number    = {4},
  pages     = {477--504},
@@ -9728,9 +10022,9 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@article{2015OnGulcehre,
+@inproceedings{2015OnGulcehre,
  title = {On Using Monolingual Corpora in Neural Machine Translation},
-  author = { Gulcehre Caglar  and  
+  author = {Gulcehre Caglar  and  
           Firat Orhan  and  
           Xu Kelvin  and  
           Cho Kyunghyun  and  
@@ -9738,8 +10032,8 @@ author    = {Zhuang Liu and
           Lin Huei Chi  and  
           Bougares Fethi  and  
           Schwenk Holger  and  
-           Bengio  Yoshua },
-  journal = {Computer Science},
+           Bengio  Yoshua},
+  publisher = {Computer Science},
  year = {2015},
 }
 @phdthesis{黄书剑0统计机器翻译中的词对齐研究,
@@ -9748,12 +10042,12 @@ author    = {Zhuang Liu and
  publisher={南京大学},
  year={2012}
 }
-@article{DBLP:journals/corr/MikolovLS13,
+@inproceedings{DBLP:journals/corr/MikolovLS13,
  author    = {Tomas Mikolov and
               Quoc V. Le and
               Ilya Sutskever},
  title     = {Exploiting Similarities among Languages for Machine Translation},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1309.4168},
  year      = {2013}
 }
@@ -9773,10 +10067,10 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
-@article{1966ASchnemann,
+@inproceedings{1966ASchnemann,
  title={A generalized solution of the orthogonal procrustes problem},
-  author={Schnemann, Peter H. },
-  journal={Psychometrika},
+  author={Schnemann and Peter},
+  publisher={Psychometrika},
  volume={31},
  number={1},
  pages={1-10},
@@ -9854,12 +10148,12 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@article{DBLP:journals/talip/MarieF20,
+@inproceedings{DBLP:journals/talip/MarieF20,
  author    = {Benjamin Marie and
               Atsushi Fujita},
  title     = {Iterative Training of Unsupervised Neural and Statistical Machine
               Translation Systems},
-  journal   = {{ACM} Trans. Asian Low Resour. Lang. Inf. Process.},
+  publisher   = {ACM Transactions on Asian and Low-Resource Language Information Processing},
  volume    = {19},
  number    = {5},
  pages     = {68:1--68:21},
@@ -9893,7 +10187,7 @@ author    = {Zhuang Liu and
  pages     = {7057--7067},
  year      = {2019}
 }
-@article{DBLP:journals/ipm/FarhanTAJATT20,
+@inproceedings{DBLP:journals/ipm/FarhanTAJATT20,
  author    = {Wael Farhan and
               Bashar Talafha and
               Analle Abuammar and
@@ -9902,13 +10196,13 @@ author    = {Zhuang Liu and
               Ahmad Bisher Tarakji and
               Anas Toma},
  title     = {Unsupervised dialectal neural machine translation},
-  journal   = {Information Processing \& Management},
+  publisher   = {Information Processing \& Management},
  volume    = {57},
  number    = {3},
  pages     = {102181},
  year      = {2020}
 }
-@article{A2020Li,
+@inproceedings{A2020Li,
  title={A Simple and Effective Approach to Robust Unsupervised Bilingual Dictionary Induction},
  author={Yanyang Li and Yingfeng Luo and Ye Lin and Quan Du and Huizhen Wang and Shujian Huang and Tong Xiao and Jingbo Zhu},
  publisher={International Conference on Computational Linguistics},
@@ -9953,7 +10247,8 @@ author    = {Zhuang Liu and
  publisher = {AAAI Conference on Artificial Intelligence},
  year      = {2020}
 }
-@article{DBLP:journals/corr/abs-2001-08210,
+
+@inproceedings{DBLP:journals/corr/abs-2001-08210,
  author    = {Yinhan Liu and
               Jiatao Gu and
               Naman Goyal and
@@ -9963,10 +10258,13 @@ author    = {Zhuang Liu and
               Mike Lewis and
               Luke Zettlemoyer},
  title     = {Multilingual Denoising Pre-training for Neural Machine Translation},
-  journal   = {CoRR},
-  volume    = {abs/2001.08210},
+  publisher   = {Transactions of the Association for Computational Linguistics},
+  volume    = {8},
+  pages     = {726--742},
  year      = {2020}
 }
+
+
 @inproceedings{DBLP:conf/aaai/JiZDZCL20,
  author    = {Baijun Ji and
               Zhirui Zhang and
@@ -9995,25 +10293,25 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2020}
 }
-@article{DBLP:journals/corr/abs-2009-08088,
+@inproceedings{DBLP:journals/corr/abs-2009-08088,
  author    = {Zhen Yang and
               Bojie Hu and
               Ambyera Han and
               Shen Huang and
               Qi Ju},
-  title     = {Code-switching pre-training for neural machine translation},
-  journal   = {CoRR},
-  volume    = {abs/2009.08088},
+  title     = {{CSP:} Code-Switching Pre-training for Neural Machine Translation},
+  pages     = {2624--2636},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2020}
 }
-@article{DBLP:journals/corr/abs-2010-09403,
+@inproceedings{DBLP:journals/corr/abs-2010-09403,
  author    = {Dusan Varis and
               Ondrej Bojar},
  title     = {Unsupervised Pretraining for Neural Machine Translation Using Elastic
               Weight Consolidation},
-  journal   = {CoRR},
-  volume    = {abs/2010.09403},
-  year      = {2020}
+  pages     = {130--135},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
 }
 @inproceedings{DBLP:conf/emnlp/LampleOCDR18,
  author    = {Guillaume Lample and
@@ -10026,11 +10324,11 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2018}
 }
-@article{DBLP:journals/jbd/ShortenK19,
+@inproceedings{DBLP:journals/jbd/ShortenK19,
  author    = {Connor Shorten and
               Taghi M. Khoshgoftaar},
  title     = {A survey on Image Data Augmentation for Deep Learning},
-  journal   = {J. Big Data},
+  publisher   = {Journal of Big Data},
  volume    = {6},
  pages     = {60},
  year      = {2019}
@@ -10053,13 +10351,13 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@article{DBLP:journals/corr/abs-1811-01124,
+@inproceedings{DBLP:journals/corr/abs-1811-01124,
  author    = {Jean Alaux and
               Edouard Grave and
               Marco Cuturi and
               Armand Joulin},
  title     = {Unsupervised Hyperalignment for Multilingual Word Embeddings},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1811.01124},
  year      = {2018}
 }
@@ -10158,9 +10456,10 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Annual Meeting of the Association for Computational Linguistics},
  year      = {2020}
 }
-@article{hartmann2018empirical,
+@inproceedings{hartmann2018empirical,
  title={Empirical observations on the instability of aligning word vector spaces with GANs},
  author={Hartmann, Mareike and Kementchedjhieva, Yova and S{\o}gaard, Anders},
+  publisher = {openreview.net},
  year={2018}
 }
 @inproceedings{DBLP:conf/emnlp/Kementchedjhieva19,
@@ -10223,9 +10522,10 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@article{2019ADabre,
+@inproceedings{2019ADabre,
  title={A Survey of Multilingual Neural Machine Translation},
  author={Dabre, Raj  and  Chu, Chenhui  and  Kunchukuttan, Anoop },
+  publisher={ACM Computing Surveys},
  year={2019},
 }
 @inproceedings{DBLP:conf/naacl/ZophK16,
@@ -10258,17 +10558,17 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
-@article{DBLP:journals/mt/WuW07,
+@inproceedings{DBLP:journals/mt/WuW07,
  author    = {Hua Wu and
               Haifeng Wang},
  title     = {Pivot language approach for phrase-based statistical machine translation},
-  journal   = {Mach. Transl.},
+  publisher   = {Machine Translation},
  volume    = {21},
  number    = {3},
  pages     = {165--181},
  year      = {2007}
 }
-@article{Farsi2010somayeh,
+@inproceedings{Farsi2010somayeh,
  author    = {Somayeh Bakhshaei and Shahram Khadivi and Noushin Riahi },
  title     = {Farsi-german statistical machine translation through bridge language},
  publisher   = {International Telecommunications Symposium},
@@ -10295,7 +10595,7 @@ author    = {Zhuang Liu and
  title     = {Improving Pivot-Based Statistical Machine Translation by Pivoting
               the Co-occurrence Count of Phrase Pairs},
  pages     = {1665--1675},
-  publisher = {{ACL}},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2014}
 }
 @inproceedings{DBLP:conf/acl/MiuraNSTN15,
@@ -10325,14 +10625,14 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2009}
 }
-@article{DBLP:journals/corr/ChengLYSX16,
+@inproceedings{DBLP:journals/corr/ChengLYSX16,
  author    = {Yong Cheng and
               Yang Liu and
               Qian Yang and
               Maosong Sun and
               Wei Xu},
  title     = {Neural Machine Translation with Pivot Languages},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1611.04928},
  year      = {2016}
 }
@@ -10348,7 +10648,7 @@ author    = {Zhuang Liu and
 @inproceedings{de2006catalan,
  title={Catalan-English statistical machine translation without parallel corpus: bridging through Spanish},
  author={De Gispert, Adri{\`a} and Marino, Jose B},
-  booktitle={Proc. of 5th International Conference on Language Resources and Evaluation (LREC)},
+  publisher={International Conference on Language Resources and Evaluation},
  pages={65--68},
  year={2006}
 }
@@ -10370,21 +10670,28 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2011}
 }
-@article{DBLP:journals/corr/HintonVD15,
+@inproceedings{DBLP:journals/corr/HintonVD15,
  author    = {Geoffrey E. Hinton and
               Oriol Vinyals and
               Jeffrey Dean},
  title     = {Distilling the Knowledge in a Neural Network},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1503.02531},
  year      = {2015}
 }
-@article{gu2018meta,
-  title={Meta-learning for low-resource neural machine translation},
-  author={Gu, Jiatao and Wang, Yong and Chen, Yun and Cho, Kyunghyun and Li, Victor OK},
-  journal={arXiv preprint arXiv:1808.08437},
-  year={2018}
+
+@inproceedings{gu2018meta,
+  author    = {Jiatao Gu and
+               Yong Wang and
+               Yun Chen and
+               Victor O. K. Li and
+               Kyunghyun Cho},
+  title     = {Meta-Learning for Low-Resource Neural Machine Translation},
+  pages     = {3622--3631},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2018}
 }
+
 @inproceedings{DBLP:conf/naacl/GuHDL18,
  author    = {Jiatao Gu and
               Hany Hassan and
@@ -10426,11 +10733,11 @@ author    = {Zhuang Liu and
  publisher = {European Language Resources Association},
  year      = {2018}
 }
-@article{DBLP:journals/tkde/PanY10,
+@inproceedings{DBLP:journals/tkde/PanY10,
  author    = {Sinno Jialin Pan and
               Qiang Yang},
  title     = {A Survey on Transfer Learning},
-  journal   = {IEEE Transactions on knowledge and data engineering},
+  publisher   = {IEEE Transactions on knowledge and data engineering},
  volume    = {22},
  number    = {10},
  pages     = {1345--1359},
@@ -10438,14 +10745,14 @@ author    = {Zhuang Liu and
 }
 @book{2009Handbook,
  title={Handbook Of Research On Machine Learning Applications and Trends: Algorithms, Methods and Techniques - 2 Volumes},
-  author={ Olivas, Emilio Soria  and  Guerrero, Jose David Martin  and  Sober, Marcelino Martinez  and  Benedito, Jose Rafael Magdalena  and  Lopez, Antonio Jose Serrano },
+  author={Olivas, Emilio Soria  and  Guerrero, Jose David Martin  and  Sober, Marcelino Martinez  and  Benedito, Jose Rafael Magdalena  and  Lopez, Antonio Jose Serrano },
  publisher={Information Science Reference - Imprint of: IGI Publishing},
  year={2009},
 }
 @incollection{DBLP:books/crc/aggarwal14/Pan14,
  author    = {Sinno Jialin Pan},
  title     = {Transfer Learning},
-  booktitle = {Data Classification: Algorithms and Applications},
+  publisher = {Data Classification: Algorithms and Applications},
  pages     = {537--570},
  publisher = {{CRC} Press},
  year      = {2014}
@@ -10461,16 +10768,22 @@ author    = {Zhuang Liu and
  publisher = {OpenReview.net},
  year      = {2019}
 }
-@article{platanios2018contextual,
-  title={Contextual parameter generation for universal neural machine translation},
-  author={Platanios, Emmanouil Antonios and Sachan, Mrinmaya and Neubig, Graham and Mitchell, Tom},
-  journal={arXiv preprint arXiv:1808.08493},
-  year={2018}
+
+
+@inproceedings{platanios2018contextual,
+  author    = {Emmanouil Antonios Platanios and
+               Mrinmaya Sachan and
+               Graham Neubig and
+               Tom M. Mitchell},
+  title     = {Contextual Parameter Generation for Universal Neural Machine Translation},
+  pages     = {425--435},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2018}
 }
 @inproceedings{ji2020cross,
  title={Cross-Lingual Pre-Training Based Transfer for Zero-Shot Neural Machine Translation},
  author={Ji, Baijun and Zhang, Zhirui and Duan, Xiangyu and Zhang, Min and Chen, Boxing and Luo, Weihua},
-  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
+  publisher={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={34},
  number={01},
  pages={115--122},
@@ -10506,16 +10819,16 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2009}
 }
-@article{dabre2019brief,
+@inproceedings{dabre2019brief,
  title={A Brief Survey of Multilingual Neural Machine Translation},
  author={Dabre, Raj and Chu, Chenhui and Kunchukuttan, Anoop},
-  journal={arXiv preprint arXiv:1905.05395},
+  publisher={arXiv preprint arXiv:1905.05395},
  year={2019}
 }
-@article{dabre2020survey,
+@inproceedings{dabre2020survey,
  title={A survey of multilingual neural machine translation},
  author={Dabre, Raj and Chu, Chenhui and Kunchukuttan, Anoop},
-  journal={ACM Computing Surveys (CSUR)},
+  publisher={ACM Computing Surveys},
  volume={53},
  number={5},
  pages={1--38},
@@ -10551,13 +10864,13 @@ author    = {Zhuang Liu and
  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2018}
 }
-@article{DBLP:journals/tacl/LeeCH17,
+@inproceedings{DBLP:journals/tacl/LeeCH17,
  author    = {Jason Lee and
               Kyunghyun Cho and
               Thomas Hofmann},
  title     = {Fully Character-Level Neural Machine Translation without Explicit
               Segmentation},
-  journal   = {Transactions of the Association for Computational Linguistics},
+  publisher   = {Transactions of the Association for Computational Linguistics},
  volume    = {5},
  pages     = {365--378},
  year      = {2017}
@@ -10572,13 +10885,13 @@ author    = {Zhuang Liu and
  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2016}
 }
-@article{DBLP:journals/corr/HaNW16,
+@inproceedings{DBLP:journals/corr/HaNW16,
  author    = {Thanh-Le Ha and
               Jan Niehues and
               Alexander H. Waibel},
  title     = {Toward Multilingual Neural Machine Translation with Universal Encoder
               and Decoder},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1611.04798},
  year      = {2016}
 }
@@ -10645,7 +10958,7 @@ author    = {Zhuang Liu and
  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2019}
 }
-@article{DBLP:journals/corr/abs-1903-07091,
+@inproceedings{DBLP:journals/corr/abs-1903-07091,
  author    = {Naveen Arivazhagan and
               Ankur Bapna and
               Orhan Firat and
@@ -10653,7 +10966,7 @@ author    = {Zhuang Liu and
               Melvin Johnson and
               Wolfgang Macherey},
  title     = {The Missing Ingredient in Zero-Shot Neural Machine Translation},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1903.07091},
  year      = {2019}
 }
@@ -10665,19 +10978,27 @@ author    = {Zhuang Liu and
  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2019}
 }
-@article{firat2016zero,
-  title={Zero-resource translation with multi-lingual neural machine translation},
-  author={Firat, Orhan and Sankaran, Baskaran and Al-Onaizan, Yaser and Vural, Fatos T Yarman and Cho, Kyunghyun},
-  journal={arXiv preprint arXiv:1606.04164},
-  year={2016}
+
+
+@inproceedings{firat2016zero,
+  author    = {Orhan Firat and
+               Baskaran Sankaran and
+               Yaser Al-Onaizan and
+               Fatos T. Yarman-Vural and
+               Kyunghyun Cho},
+  title     = {Zero-Resource Translation with Multi-Lingual Neural Machine Translation},
+  pages     = {268--277},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2016}
 }
-@article{DBLP:journals/corr/abs-1805-10338,
+
+@inproceedings{DBLP:journals/corr/abs-1805-10338,
  author    = {Lierni Sestorain and
               Massimiliano Ciaramita and
               Christian Buck and
               Thomas Hofmann},
  title     = {Zero-Shot Dual Machine Translation},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1805.10338},
  year      = {2018}
 }
@@ -10757,7 +11078,7 @@ author    = {Zhuang Liu and
               Yoshua Bengio and
               Pierre-Antoine Manzagol},
  title     = {Extracting and composing robust features with denoising autoencoders},
-  series    = {{ACM} International Conference Proceeding Series},
+  series    = {International Conference on Learning Representations},
  volume    = {307},
  pages     = {1096--1103},
  publisher = {International Conference on Machine Learning}
@@ -10771,20 +11092,20 @@ author    = {Zhuang Liu and
  publisher = {International Conference on Learning Representations},
  year      = {2018}
 }
-@article{DBLP:journals/coling/BhagatH13,
+@inproceedings{DBLP:journals/coling/BhagatH13,
  author    = {Rahul Bhagat and
               Eduard H. Hovy},
  title     = {What Is a Paraphrase?},
-  journal   = {Computational Linguistics},
+  publisher   = {Computational Linguistics},
  volume    = {39},
  number    = {3},
  pages     = {463--472},
  year      = {2013}
 }
-@article{2010Generating,
+@inproceedings{2010Generating,
  title={Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods},
  author={ Madnani, Nitin  and  Dorr, Bonnie J. },
-  journal={Computational Linguistics},
+  publisher={Computational Linguistics},
  volume={36},
  number={3},
  pages={341-387},
@@ -10817,10 +11138,10 @@ author    = {Zhuang Liu and
  publisher = {Annual Conference of the European Association for Machine Translation},
  year      = {2017}
 }
-@article{2005Improving,
+@inproceedings{2005Improving,
  title={Improving Machine Translation Performance by Exploiting Non-Parallel Corpora},
  author={ Munteanu, Ds  and  Marcu, D },
-  journal={Computational Linguistics},
+  publisher={Computational Linguistics},
  volume={31},
  number={4},
  pages={477-504},
@@ -10836,12 +11157,12 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2010}
 }
-@article{DBLP:journals/jair/RuderVS19,
+@inproceedings{DBLP:journals/jair/RuderVS19,
  author    = {Sebastian Ruder and
               Ivan Vulic and
               Anders S{\o}gaard},
  title     = {A Survey of Cross-lingual Word Embedding Models},
-  journal   = {J. Artif. Intell. Res.},
+  publisher   = {Journal of Artificial Intelligence Research},
  volume    = {65},
  pages     = {569--631},
  year      = {2019}
@@ -10856,14 +11177,14 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }
-@article{DBLP:journals/tacl/TuLLLL17,
+@inproceedings{DBLP:journals/tacl/TuLLLL17,
  author    = {Zhaopeng Tu and
               Yang Liu and
               Zhengdong Lu and
               Xiaohua Liu and
               Hang Li},
  title     = {Context Gates for Neural Machine Translation},
-  journal   = {Annual Meeting of the Association for Computational Linguistics},
+  publisher   = {Annual Meeting of the Association for Computational Linguistics},
  volume    = {5},
  pages     = {87--99},
  year      = {2017}
@@ -10883,12 +11204,21 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
-@article{ng2019facebook,
-  title={Facebook FAIR's WMT19 News Translation Task Submission},
-  author={Ng, Nathan and Yee, Kyra and Baevski, Alexei and Ott, Myle and Auli, Michael and Edunov, Sergey},
-  journal={arXiv preprint arXiv:1907.06616},
-  year={2019}
+
+
+@inproceedings{ng2019facebook,
+  author    = {Nathan Ng and
+               Kyra Yee and
+               Alexei Baevski and
+               Myle Ott and
+               Michael Auli and
+               Sergey Edunov},
+  title     = {Facebook FAIR's {WMT19} News Translation Task Submission},
+  pages     = {314--319},
+  publisher = {Association for Computational Linguistics},
+  year      = {2019}
 }
+
 @inproceedings{DBLP:conf/wmt/WangLLJZLLXZ18,
  author    = {Qiang Wang and
               Bei Li and
@@ -10935,7 +11265,9 @@ author    = {Zhuang Liu and
  publisher = {Conference and Workshop on Neural Information Processing Systems},
  year      = {2015}
 }
-@article{DBLP:journals/corr/abs-1802-05365,
+
+
+@inproceedings{DBLP:journals/corr/abs-1802-05365,
  author    = {Matthew E. Peters and
               Mark Neumann and
               Mohit Iyyer and
@@ -10943,11 +11275,12 @@ author    = {Zhuang Liu and
               Christopher Clark and
               Kenton Lee and
               Luke Zettlemoyer},
-  title     = {Deep contextualized word representations},
-  journal   = {CoRR},
-  volume    = {abs/1802.05365},
+  title     = {Deep Contextualized Word Representations},
+  pages     = {2227--2237},
+  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2018}
 }
+
 @inproceedings{DBLP:conf/icml/CollobertW08,
  author    = {Ronan Collobert and
               Jason Weston},
@@ -11004,13 +11337,13 @@ author    = {Zhuang Liu and
  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2019}
 }
-@article{DBLP:journals/corr/abs-1908-06259,
+@inproceedings{DBLP:journals/corr/abs-1908-06259,
  author    = {Tianyu He and
               Xu Tan and
               Tao Qin},
  title     = {Hard but Robust, Easy but Sensitive: How Encoder and Decoder Perform
               in Neural Machine Translation},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1908.06259},
  year      = {2019}
 }
@@ -11035,12 +11368,18 @@ author    = {Zhuang Liu and
  publisher = {Springer},
  year      = {1998}
 }
-@article{liu2019multi,
-  title={Multi-task deep neural networks for natural language understanding},
-  author={Liu, Xiaodong and He, Pengcheng and Chen, Weizhu and Gao, Jianfeng},
-  journal={arXiv preprint arXiv:1901.11504},
-  year={2019}
+
+@inproceedings{liu2019multi,
+  author    = {Xiaodong Liu and
+               Pengcheng He and
+               Weizhu Chen and
+               Jianfeng Gao},
+  title     = {Multi-Task Deep Neural Networks for Natural Language Understanding},
+  pages     = {4487--4496},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
 }
+
 @inproceedings{DBLP:journals/corr/LuongLSVK15,
  author    = {Minh-Thang Luong and
               Quoc V. Le and
@@ -11059,7 +11398,7 @@ author    = {Zhuang Liu and
  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2016}
 }
-@article{DBLP:journals/tacl/JohnsonSLKWCTVW17,
+@inproceedings{DBLP:journals/tacl/JohnsonSLKWCTVW17,
  author    = {Melvin Johnson and
               Mike Schuster and
               Quoc V. Le and
@@ -11074,19 +11413,19 @@ author    = {Zhuang Liu and
               Jeffrey Dean},
  title     = {Google's Multilingual Neural Machine Translation System: Enabling
               Zero-Shot Translation},
-  journal   = {Transactions of the Association for Computational Linguistics},
+  publisher   = {Transactions of the Association for Computational Linguistics},
  volume    = {5},
  pages     = {339--351},
  year      = {2017}
 }
-@article{DBLP:journals/csl/GulcehreFXCB17,
+@inproceedings{DBLP:journals/csl/GulcehreFXCB17,
  author    = {{\c{C}}aglar G{\"{u}}l{\c{c}}ehre and
               Orhan Firat and
               Kelvin Xu and
               Kyunghyun Cho and
               Yoshua Bengio},
  title     = {On integrating a language model into neural machine translation},
-  journal   = {Computational Linguistics},
+  publisher   = {Computational Linguistics},
  volume    = {45},
  pages     = {137--148},
  year      = {2017}
@@ -11154,10 +11493,10 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2013}
 }
-@article{imamura2016multi,
+@inproceedings{imamura2016multi,
  title={Multi-domain adaptation for statistical machine translation based on feature augmentation},
  author={Imamura, Kenji and Sumita, Eiichiro},
-  journal={Association for Machine Translation in the Americas},
+  publisher={Association for Machine Translation in the Americas},
  pages={79},
  year={2016}
 }
@@ -11180,10 +11519,10 @@ author    = {Zhuang Liu and
  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2010}
 }
-@article{shah2012general,
+@inproceedings{shah2012general,
  title={A general framework to weight heterogeneous parallel data for model adaptation in statistical machine translation},
  author={Shah, Kashif and Barrault, Lo{\i}c and Schwenk, Holger and Le Mans, France},
-  journal={MT Summit, Octobre},
+  publisher={Machine Translation Summit},
  year={2012}
 }
 @inproceedings{DBLP:conf/iwslt/MansourN12,
@@ -11239,17 +11578,17 @@ author    = {Zhuang Liu and
  publisher = {International Conference on Computational Linguistics},
  year      = {2014}
 }
-@article{joty2015using,
+@inproceedings{joty2015using,
  title={Using joint models for domain adaptation in statistical machine translation},
  author={Joty, Nadir Durrani Hassan Sajjad Shafiq and Vogel, Ahmed Abdelali Stephan},
-  journal={Proceedings of MT Summit XV},
+  publisher={Proceedings of MT Summit XV},
  pages={117},
  year={2015}
 }
 @inproceedings{chen2016bilingual,
  title={Bilingual methods for adaptive training data selection for machine translation},
  author={Chen, Boxing and Kuhn, Roland and Foster, George and Cherry, Colin and Huang, Fei},
-  booktitle={Association for Machine Translation in the Americas},
+  publisher={Association for Machine Translation in the Americas},
  pages={93--103},
  year={2016}
 }
@@ -11320,7 +11659,7 @@ author    = {Zhuang Liu and
  publisher={International Workshop on Spoken Language Translation},
  year={2011}
 }
-@article{moore2010intelligent,
+@inproceedings{moore2010intelligent,
  title = {Intelligent selection of language model training data},
  author = {Moore, Robert C and Lewis, Will},
  publisher = {Annual Meeting of the Association for Computational Linguistics},
@@ -11367,16 +11706,16 @@ author    = {Zhuang Liu and
  publisher = {International Conference on Computational Linguistics},
  year      = {2016}
 }
-@article{chu2015integrated,
+@inproceedings{chu2015integrated,
  title={Integrated parallel data extraction from comparable corpora for statistical machine translation},
  author={Chu, Chenhui},
  year={2015},
  publisher={Kyoto University}
 }
-@article{DBLP:journals/tit/Scudder65a,
+@inproceedings{DBLP:journals/tit/Scudder65a,
  author    = {H. J. Scudder III},
  title     = {Probability of error of some adaptive pattern-recognition machines},
-  journal   = {{IEEE} Transactions on Information Theory},
+  publisher   = {{IEEE} Transactions on Information Theory},
  volume    = {11},
  number    = {3},
  pages     = {363--371},
@@ -11390,14 +11729,14 @@ author    = {Zhuang Liu and
  publisher = {International Conference on Computational Linguistics},
  year      = {2018}
 }
-@article{DBLP:journals/corr/abs-1708-08712,
+@inproceedings{DBLP:journals/corr/abs-1708-08712,
  author    = {Hassan Sajjad and
               Nadir Durrani and
               Fahim Dalvi and
               Yonatan Belinkov and
               Stephan Vogel},
  title     = {Neural Machine Translation Training in a Multi-Domain Scenario},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1708.08712},
  year      = {2017}
 }
@@ -11464,7 +11803,7 @@ author    = {Zhuang Liu and
 @inproceedings{britz2017effective,
  title={Effective domain mixing for neural machine translation},
  author={Britz, Denny and Le, Quoc and Pryzant, Reid},
-  booktitle={Proceedings of the Second Conference on Machine Translation},
+  publisher={Proceedings of the Second Conference on Machine Translation},
  pages={118--126},
  year={2017}
 }
@@ -11499,21 +11838,21 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
-@article{DBLP:journals/corr/abs-1906-03129,
+@inproceedings{DBLP:journals/corr/abs-1906-03129,
  author    = {Shen Yan and
               Leonard Dahlmann and
               Pavel Petrushkov and
               Sanjika Hewavitharana and
               Shahram Khadivi},
  title     = {Word-based Domain Adaptation for Neural Machine Translation},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1906.03129},
  year      = {2019}
 }
-@article{dakwale2017finetuning,
+@inproceedings{dakwale2017finetuning,
  title={Finetuning for neural machine translation with limited degradation across in-and out-of-domain data},
  author={Dakwale, Praveen and Monz, Christof},
-  journal={Proceedings of the XVI Machine Translation Summit},
+  publisher={Proceedings of the XVI Machine Translation Summit},
  volume={117},
  year={2017}
 }
@@ -11530,12 +11869,19 @@ author    = {Zhuang Liu and
  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2019}
 }
-@article{barone2017regularization,
-  title={Regularization techniques for fine-tuning in neural machine translation},
-  author={Barone, Antonio Valerio Miceli and Haddow, Barry and Germann, Ulrich and Sennrich, Rico},
-  journal={arXiv preprint arXiv:1707.09920},
-  year={2017}
+
+
+@inproceedings{barone2017regularization,
+  author    = {Antonio Valerio Miceli Barone and
+               Barry Haddow and
+               Ulrich Germann and
+               Rico Sennrich},
+  title     = {Regularization techniques for fine-tuning in neural machine translation},
+  pages     = {1489--1494},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2017}
 }
+
 @inproceedings{DBLP:conf/acl/SaundersB20,
  author    = {Danielle Saunders and
               Bill Byrne},
@@ -11548,7 +11894,7 @@ author    = {Zhuang Liu and
 @inproceedings{khayrallah2017neural,
  title={Neural lattice search for domain adaptation in machine translation},
  author={Khayrallah, Huda and Kumar, Gaurav and Duh, Kevin and Post, Matt and Koehn, Philipp},
-  booktitle={Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},
+  publisher={Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},
  pages={20--25},
  year={2017}
 }
@@ -11562,11 +11908,11 @@ author    = {Zhuang Liu and
  publisher = {Association for Computational Linguistics},
  year      = {2019}
 }
-@article{DBLP:journals/corr/FreitagA16,
+@inproceedings{DBLP:journals/corr/FreitagA16,
  author    = {Markus Freitag and
               Yaser Al-Onaizan},
  title     = {Fast Domain Adaptation for Neural Machine Translation},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1612.06897},
  year      = {2016}
 }
@@ -11589,10 +11935,10 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
-@article{DBLP:journals/ibmrd/Luhn58,
+@inproceedings{DBLP:journals/ibmrd/Luhn58,
  author    = {Hans Peter Luhn},
  title     = {The Automatic Creation of Literature Abstracts},
-  journal   = {{IBM} J. Res. Dev.},
+  publisher   = {IBM Journal of research and development},
  volume    = {2},
  number    = {2},
  pages     = {159--165},
@@ -11655,7 +12001,7 @@ author    = {Zhuang Liu and
  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2019}
 }
-@article{DBLP:journals/corr/abs-2010-11125,
+@inproceedings{DBLP:journals/corr/abs-2010-11125,
  author    = {Angela Fan and
               Shruti Bhosale and
               Holger Schwenk and
@@ -11674,7 +12020,7 @@ author    = {Zhuang Liu and
               Michael Auli and
               Armand Joulin},
  title     = {Beyond English-Centric Multilingual Machine Translation},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/2010.11125},
  year      = {2020}
 }
@@ -11746,13 +12092,13 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@article{DBLP:journals/ejasmp/RadzikowskiNWY19,
+@inproceedings{DBLP:journals/ejasmp/RadzikowskiNWY19,
  author    = {Kacper Radzikowski and
               Robert Nowak and
               Le Wang and
               Osamu Yoshie},
  title     = {Dual supervised learning for non-native speech recognition},
-  journal   = {{EURASIP} J. Audio Speech Music. Process.},
+  publisher   = {EURASIP Journal on Audio, Speech, and Music Processing},
  volume    = {2019},
  pages     = {3},
  year      = {2019}
@@ -11774,13 +12120,13 @@ author    = {Zhuang Liu and
  publisher = {{IEEE} Computer Society},
  year      = {2017}
 }
-@article{DBLP:journals/access/DuRZH20,
+@inproceedings{DBLP:journals/access/DuRZH20,
  author    = {Liang Du and
               Xin Ren and
               Peng Zhou and
               Zhiguo Hu},
  title     = {Unsupervised Dual Learning for Feature and Instance Selection},
-  journal   = {{IEEE} Access},
+  publisher   = {{IEEE} Access},
  volume    = {8},
  pages     = {170248--170260},
  year      = {2020}
@@ -11794,6 +12140,7 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2020}
 }
+
 @inproceedings{DBLP:conf/nips/YangDYCSL19,
  author    = {Zhilin Yang and
               Zihang Dai and
@@ -11802,13 +12149,14 @@ author    = {Zhuang Liu and
               Ruslan Salakhutdinov and
               Quoc V. Le},
  title     = {XLNet: Generalized Autoregressive Pretraining for Language Understanding},
+  publisher = {Annual Conference on Neural Information Processing Systems},
  pages     = {5754--5764},
  year      = {2019}
 }
-@article{lewis2019bart,
+@inproceedings{lewis2019bart,
  title={Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension},
  author={Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Ves and Zettlemoyer, Luke},
-  journal={arXiv preprint arXiv:1910.13461},
+  publisher={arXiv preprint arXiv:1910.13461},
  year={2019}
 }
 @inproceedings{DBLP:conf/iclr/LanCGGSS20,
@@ -11860,7 +12208,7 @@ author    = {Zhuang Liu and
  publisher = {International Conference on Computer Vision},
  year      = {2019}
 }
-@article{DBLP:journals/corr/abs-2010-12831,
+@inproceedings{DBLP:journals/corr/abs-2010-12831,
  author    = {Liunian Harold Li and
               Haoxuan You and
               Zhecan Wang and
@@ -11869,7 +12217,7 @@ author    = {Zhuang Liu and
               Kai-Wei Chang},
  title     = {Weakly-supervised VisualBERT: Pre-training without Parallel Images
               and Captions},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/2010.12831},
  year      = {2020}
 }
@@ -11919,18 +12267,18 @@ author    = {Zhuang Liu and
 @inproceedings{shen2020q,
  title={Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT.},
  author={Shen, Sheng and Dong, Zhen and Ye, Jiayu and Ma, Linjian and Yao, Zhewei and Gholami, Amir and Mahoney, Michael W and Keutzer, Kurt},
-  booktitle={AAAI Conference on Artificial Intelligence},
+  publisher={AAAI Conference on Artificial Intelligence},
  pages={8815--8821},
  year={2020}
 }
-@article{DBLP:journals/corr/abs-1910-01108,
+@inproceedings{DBLP:journals/corr/abs-1910-01108,
  author    = {Victor Sanh and
               Lysandre Debut and
               Julien Chaumond and
               Thomas Wolf},
  title     = {DistilBERT, a distilled version of {BERT:} smaller, faster, cheaper
               and lighter},
-  journal   = {CoRR},
+  publisher   = {CoRR},
  volume    = {abs/1910.01108},
  year      = {2019}
 }
@@ -12890,8 +13238,730 @@ author    = {Zhuang Liu and
  publisher={电子工业出版社},
  year={2020}
 }
+%%%%%%%%%%%%%%%%%王屹超部分，孟霞加%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+@inproceedings{DBLP:conf/mm/LinMSYYGZL20,
+  author    = {Huan Lin and
+               Fandong Meng and
+               Jinsong Su and
+               Yongjing Yin and
+               Zhengyuan Yang and
+               Yubin Ge and
+               Jie Zhou and
+               Jiebo Luo},
+  title     = {Dynamic Context-guided Capsule Network for Multimodal Machine Translation},
+  pages     = {1320--1329},
+  publisher = {	ACM Multimedia},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/wmt/SpeciaFSE16,
+  author    = {Lucia Specia and
+               Stella Frank and
+               Khalil Sima'an and
+               Desmond Elliott},
+  title     = {A Shared Task on Multimodal Machine Translation and Crosslingual Image
+               Description},
+  pages     = {543--553},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2016}
+}
+
+@inproceedings{DBLP:conf/wmt/ElliottFBBS17,
+  author    = {Desmond Elliott and
+               Stella Frank and
+               Lo{\"{\i}}c Barrault and
+               Fethi Bougares and
+               Lucia Specia},
+  title     = {Findings of the Second Shared Task on Multimodal Machine Translation
+               and Multilingual Image Description},
+  pages     = {215--233},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/wmt/BarraultBSLEF18,
+  author    = {Lo{\"{\i}}c Barrault and
+               Fethi Bougares and
+               Lucia Specia and
+               Chiraag Lala and
+               Desmond Elliott and
+               Stella Frank},
+  title     = {Findings of the Third Shared Task on Multimodal Machine Translation},
+  pages     = {304--323},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/wmt/CaglayanABGBBMH17,
+  author    = {Ozan Caglayan and
+               Walid Aransa and
+               Adrien Bardet and
+               Mercedes Garc{\'{\i}}a-Mart{\'{\i}}nez and
+               Fethi Bougares and
+               Lo{\"{\i}}c Barrault and
+               Marc Masana and
+               Luis Herranz and
+               Joost van de Weijer},
+  title     = {{LIUM-CVC} Submissions for {WMT17} Multimodal Translation Task},
+  pages     = {432--439},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/wmt/LibovickyHTBP16,
+  author    = {Jindrich Libovick{\'{y}} and
+               Jindrich Helcl and
+               Marek Tlust{\'{y}} and
+               Ondrej Bojar and
+               Pavel Pecina},
+  title     = {{CUNI} System for {WMT16} Automatic Post-Editing and Multimodal Translation
+               Tasks},
+  pages     = {646--654},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2016}
+}
+
+@inproceedings{DBLP:conf/emnlp/CalixtoL17,
+  author    = {Iacer Calixto and
+               Qun Liu},
+  title     = {Incorporating Global Visual Features into Attention-based Neural Machine
+               Translation},
+  pages     = {992--1003},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/wmt/HuangLSOD16,
+  author    = {Po-Yao Huang and
+               Frederick Liu and
+               Sz-Rung Shiang and
+               Jean Oh and
+               Chris Dyer},
+  title     = {Attention-based Multimodal Neural Machine Translation},
+  pages     = {639--645},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2016}
+}
+
+@article{Elliott2015MultilingualID,
+  title={Multilingual Image Description with Neural Sequence Models},
+  author={Desmond Elliott and 
+          Stella Frank and 
+		  Eva Hasler},
+  journal={arXiv: Computation and Language},
+  year={2015}
+}
+
+@inproceedings{DBLP:conf/wmt/MadhyasthaWS17,
+  author    = {Pranava Swaroop Madhyastha and
+               Josiah Wang and
+               Lucia Specia},
+  title     = {Sheffield MultiMT: Using Object Posterior Predictions for Multimodal
+               Machine Translation},
+  pages     = {470--476},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+
+@article{DBLP:journals/corr/CaglayanBB16,
+  author    = {Ozan Caglayan and
+               Lo{\"{\i}}c Barrault and
+               Fethi Bougares},
+  title     = {Multimodal Attention for Neural Machine Translation},
+  journal   = {CoRR},
+  volume    = {abs/1609.03976},
+  year      = {2016}
+}
+
+@inproceedings{DBLP:conf/acl/CalixtoLC17,
+  author    = {Iacer Calixto and
+               Qun Liu and
+               Nick Campbell},
+  title     = {Doubly-Attentive Decoder for Multi-modal Neural Machine Translation},
+  pages     = {1913--1924},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+
+@article{DBLP:journals/corr/DelbrouckD17,
+  author    = {Jean-Benoit Delbrouck and
+               St{\'{e}}phane Dupont},
+  title     = {Multimodal Compact Bilinear Pooling for Multimodal Neural Machine
+               Translation},
+  journal   = {CoRR},
+  volume    = {abs/1703.08084},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/acl/LibovickyH17,
+  author    = {Jindrich Libovick{\'{y}} and
+               Jindrich Helcl},
+  title     = {Attention Strategies for Multi-Source Sequence-to-Sequence Learning},
+  pages     = {196--202},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+
+@article{DBLP:journals/corr/abs-1712-03449,
+  author    = {Jean-Benoit Delbrouck and
+               St{\'{e}}phane Dupont},
+  title     = {Modulating and attending the source image during encoding improves
+               Multimodal Translation},
+  journal   = {CoRR},
+  volume    = {abs/1712.03449},
+  year      = {2017}
+}
+
+@article{DBLP:journals/corr/abs-1807-11605,
+  author    = {Hasan Sait Arslan and
+               Mark Fishel and
+               Gholamreza Anbarjafari},
+  title     = {Doubly Attentive Transformer Machine Translation},
+  journal   = {CoRR},
+  volume    = {abs/1807.11605},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/wmt/HelclLV18,
+  author    = {Jindrich Helcl and
+               Jindrich Libovick{\'{y}} and
+               Dusan Varis},
+  title     = {{CUNI} System for the {WMT18} Multimodal Translation Task},
+  pages     = {616--623},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/ijcnlp/ElliottK17,
+  author    = {Desmond Elliott and
+               {\'{A}}kos K{\'{a}}d{\'{a}}r},
+  title     = {Imagination Improves Multimodal Translation},
+  pages     = {130--141},
+  publisher = {International Joint Conference on Natural Language Processing},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/emnlp/ZhouCLY18,
+  author    = {Mingyang Zhou and
+               Runxiang Cheng and
+               Yong Jae Lee and
+               Zhou Yu},
+  title     = {A Visual Attention Grounding Neural Model for Multimodal Machine Translation},
+  pages     = {3643--3653},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/acl/CalixtoRA19,
+  author    = {Iacer Calixto and
+               Miguel Rios and
+               Wilker Aziz},
+  title     = {Latent Variable Model for Multi-modal Translation},
+  pages     = {6392--6405},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/acl/YinMSZYZL20,
+  author    = {Yongjing Yin and
+               Fandong Meng and
+               Jinsong Su and
+               Chulun Zhou and
+               Zhengyuan Yang and
+               Jie Zhou and
+               Jiebo Luo},
+  title     = {A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine
+               Translation},
+  pages     = {3025--3035},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/acl/YaoW20,
+  author    = {Shaowei Yao and
+               Xiaojun Wan},
+  title     = {Multimodal Transformer for Multimodal Machine Translation},
+  pages     = {4346--4350},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/nips/LuYBP16,
+  author    = {Jiasen Lu and
+               Jianwei Yang and
+               Dhruv Batra and
+               Devi Parikh},
+  title     = {Hierarchical Question-Image Co-Attention for Visual Question Answering},
+  booktitle = {Conference on Neural Information Processing Systems},
+  pages     = {289--297},
+  year      = {2016}
+}
+
+@inproceedings{DBLP:conf/cvpr/VinyalsTBE15,
+  author    = {Oriol Vinyals and
+               Alexander Toshev and
+               Samy Bengio and
+               Dumitru Erhan},
+  title     = {Show and tell: {A} neural image caption generator},
+  pages     = {3156--3164},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2015}
+}
+
+@inproceedings{DBLP:conf/icml/XuBKCCSZB15,
+  author    = {Kelvin Xu and
+               Jimmy Ba and
+               Ryan Kiros and
+               Kyunghyun Cho and
+               Aaron C. Courville and
+               Ruslan Salakhutdinov and
+               Richard S. Zemel and
+               Yoshua Bengio},
+  title     = {Show, Attend and Tell: Neural Image Caption Generation with Visual
+               Attention},
+  volume    = {37},
+  pages     = {2048--2057},
+  publisher = {International Conference on Machine Learning},
+  year      = {2015}
+}
+
+@inproceedings{DBLP:conf/cvpr/YouJWFL16,
+  author    = {Quanzeng You and
+               Hailin Jin and
+               Zhaowen Wang and
+               Chen Fang and
+               Jiebo Luo},
+  title     = {Image Captioning with Semantic Attention},
+  pages     = {4651--4659},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2016}
+}
+
+@inproceedings{DBLP:conf/cvpr/ChenZXNSLC17,
+  author    = {Long Chen and
+               Hanwang Zhang and
+               Jun Xiao and
+               Liqiang Nie and
+               Jian Shao and
+               Wei Liu and
+               Tat-Seng Chua},
+  title     = {{SCA-CNN:} Spatial and Channel-Wise Attention in Convolutional Networks
+               for Image Captioning},
+  pages     = {6298--6306},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2017}
+}
+
+@article{DBLP:journals/pami/FuJCSZ17,
+  author    = {Kun Fu and
+               Junqi Jin and
+               Runpeng Cui and
+               Fei Sha and
+               Changshui Zhang},
+  title     = {Aligning Where to See and What to Tell: Image Captioning with Region-Based
+               Attention and Scene-Specific Contexts},
+  journal   = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
+  volume    = {39},
+  number    = {12},
+  pages     = {2321--2334},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/eccv/YaoPLM18,
+  author    = {Ting Yao and
+               Yingwei Pan and
+               Yehao Li and
+               Tao Mei},
+  title     = {Exploring Visual Relationship for Image Captioning},
+  series    = {Lecture Notes in Computer Science},
+  volume    = {11218},
+  pages     = {711--727},
+  publisher = {European Conference on Computer Vision},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/ijcai/LiuSWWY17,
+  author    = {Chang Liu and
+               Fuchun Sun and
+               Changhu Wang and
+               Feng Wang and
+               Alan L. Yuille},
+  title     = {{MAT:} {A} Multimodal Attentive Translator for Image Captioning},
+  pages     = {4033--4039},
+  publisher = {International Joint Conference on Artificial Intelligence},
+  year      = {2017}
+}
+
+@article{DBLP:journals/corr/abs-1804-02767,
+  author    = {Joseph Redmon and
+               Ali Farhadi},
+  title     = {YOLOv3: An Incremental Improvement},
+  journal   = {CoRR},
+  volume    = {abs/1804.02767},
+  year      = {2018}
+}
+
+@article{DBLP:journals/corr/abs-2004-10934,
+  author    = {Alexey Bochkovskiy and
+               Chien-Yao Wang and
+               Hong-Yuan Mark Liao},
+  title     = {YOLOv4: Optimal Speed and Accuracy of Object Detection},
+  journal   = {CoRR},
+  volume    = {abs/2004.10934},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/cvpr/LuXPS17,
+  author    = {Jiasen Lu and
+               Caiming Xiong and
+               Devi Parikh and
+               Richard Socher},
+  title     = {Knowing When to Look: Adaptive Attention via a Visual Sentinel for
+               Image Captioning},
+  pages     = {3242--3250},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/cvpr/00010BT0GZ18,
+  author    = {Peter Anderson and
+               Xiaodong He and
+               Chris Buehler and
+               Damien Teney and
+               Mark Johnson and
+               Stephen Gould and
+               Lei Zhang},
+  title     = {Bottom-Up and Top-Down Attention for Image Captioning and Visual Question
+               Answering},
+  pages     = {6077--6086},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/mm/ZhouXKC17,
+  author    = {Luowei Zhou and
+               Chenliang Xu and
+               Parker A. Koch and
+               Jason J. Corso},
+  title     = {Watch What You Just Said: Image Captioning with Text-Conditional Attention},
+  pages     = {305--313},
+  publisher = {ACM Multimedia},
+  year      = {2017}
+}
+
+@article{DBLP:journals/mta/FangWCT18,
+  author    = {Fang Fang and
+               Hanli Wang and
+               Yihao Chen and
+               Pengjie Tang},
+  title     = {Looking deeper and transferring attention for image captioning},
+  journal   = {Multimedia Tools Applications},
+  volume    = {77},
+  number    = {23},
+  pages     = {31159--31175},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/cvpr/AnejaDS18,
+  author    = {Jyoti Aneja and
+               Aditya Deshpande and
+               Alexander G. Schwing},
+  title     = {Convolutional Image Captioning},
+  pages     = {5561--5570},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2018}
+}
+
+@article{DBLP:journals/corr/abs-1805-09019,
+  author    = {Qingzhong Wang and
+               Antoni B. Chan},
+  title     = {{CNN+CNN:} Convolutional Decoders for Image Captioning},
+  journal   = {CoRR},
+  volume    = {abs/1805.09019},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/eccv/DaiYL18,
+  author    = {Bo Dai and
+               Deming Ye and
+               Dahua Lin},
+  title     = {Rethinking the Form of Latent States in Image Captioning},
+  volume    = {11209},
+  pages     = {294--310},
+  publisher = {European Conference on Computer Vision},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/iccv/AntolALMBZP15,
+  author    = {Stanislaw Antol and
+               Aishwarya Agrawal and
+               Jiasen Lu and
+               Margaret Mitchell and
+               Dhruv Batra and
+               C. Lawrence Zitnick and
+               Devi Parikh},
+  title     = {{VQA:} Visual Question Answering},
+  pages     = {2425--2433},
+  publisher = {International Conference on Computer Vision},
+  year      = {2015}
+}
+
+@inproceedings{DBLP:conf/eccv/CarionMSUKZ20,
+  author    = {Nicolas Carion and
+               Francisco Massa and
+               Gabriel Synnaeve and
+               Nicolas Usunier and
+               Alexander Kirillov and
+               Sergey Zagoruyko},
+  title     = {End-to-End Object Detection with Transformers},
+  volume    = {12346},
+  pages     = {213--229},
+  publisher = {European Conference on Computer Vision},
+  year      = {2020}
+}
+
+@article{DBLP:journals/tcsv/YuLYH20,
+  author    = {Jun Yu and
+               Jing Li and
+               Zhou Yu and
+               Qingming Huang},
+  title     = {Multimodal Transformer With Multi-View Visual Representation for Image
+               Captioning},
+  journal   = {IEEE Transactions on Circuits and Systems for Video Technology},
+  volume    = {30},
+  number    = {12},
+  pages     = {4467--4480},
+  year      = {2020}
+}
+
+@article{Huasong2020SelfAdaptiveNM,
+  title={Self-Adaptive Neural Module Transformer for Visual Question Answering},
+  author={Zhong Huasong and Jingyuan Chen and Chen Shen and Hanwang Zhang and Jianqiang Huang and Xian-Sheng Hua},
+  journal={IEEE Transactions on Multimedia},
+  year={2020},
+  pages={1-1}
+}
+
+@inproceedings{DBLP:conf/emnlp/GokhaleBBY20,
+  author    = {Tejas Gokhale and
+               Pratyay Banerjee and
+               Chitta Baral and
+               Yezhou Yang},
+  title     = {{MUTANT:} {A} Training Paradigm for Out-of-Distribution Generalization
+               in Visual Question Answering},
+  pages     = {878--892},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/eccv/Tang0ZWY20,
+  author    = {Ruixue Tang and
+               Chao Ma and
+               Wei Emma Zhang and
+               Qi Wu and
+               Xiaokang Yang},
+  title     = {Semantic Equivalent Adversarial Data Augmentation for Visual Question
+               Answering},
+  volume    = {12364},
+  pages     = {437--453},
+  publisher = {	European Conference on Computer Vision},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/eccv/Li0LZHZWH0WCG20,
+  author    = {Xiujun Li and
+               Xi Yin and
+               Chunyuan Li and
+               Pengchuan Zhang and
+               Xiaowei Hu and
+               Lei Zhang and
+               Lijuan Wang and
+               Houdong Hu and
+               Li Dong and
+               Furu Wei and
+               Yejin Choi and
+               Jianfeng Gao},
+  title     = {Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks},
+  volume    = {12375},
+  pages     = {121--137},
+  publisher = {	European Conference on Computer Vision},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/aaai/ZhouPZHCG20,
+  author    = {Luowei Zhou and
+               Hamid Palangi and
+               Lei Zhang and
+               Houdong Hu and
+               Jason J. Corso and
+               Jianfeng Gao},
+  title     = {Unified Vision-Language Pre-Training for Image Captioning and {VQA}},
+  pages     = {13041--13049},
+  publisher = {AAAI Conference on Artificial Intelligence},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/iclr/SuZCLLWD20,
+  author    = {Weijie Su and
+               Xizhou Zhu and
+               Yue Cao and
+               Bin Li and
+               Lewei Lu and
+               Furu Wei and
+               Jifeng Dai},
+  title     = {{VL-BERT:} Pre-training of Generic Visual-Linguistic Representations},
+  publisher = {International Conference on Learning Representations},
+  year      = {2020}
+}
+
+@inproceedings{DBLP:conf/nips/GoodfellowPMXWOCB14,
+  author    = {Ian J. Goodfellow and
+               Jean Pouget-Abadie and
+               Mehdi Mirza and
+               Bing Xu and
+               David Warde-Farley and
+               Sherjil Ozair and
+               Aaron C. Courville and
+               Yoshua Bengio},
+  title     = {Generative Adversarial Nets},
+  publisher = {Conference on Neural Information Processing Systems},
+  pages     = {2672--2680},
+  year      = {2014}
+}
+
+@inproceedings{DBLP:conf/nips/ZhuZPDEWS17,
+  author    = {Jun-Yan Zhu and
+               Richard Zhang and
+               Deepak Pathak and
+               Trevor Darrell and
+               Alexei A. Efros and
+               Oliver Wang and
+               Eli Shechtman},
+  title     = {Toward Multimodal Image-to-Image Translation},
+  publisher = {Conference on Neural Information Processing Systems},
+  pages     = {465--476},
+  year      = {2017}
+}
+
+@article{DBLP:journals/corr/abs-1908-06616,
+  author    = {Hajar Emami and
+               Majid Moradi Aliabadi and
+               Ming Dong and
+               Ratna Babu Chinnam},
+  title     = {{SPA-GAN:} Spatial Attention {GAN} for Image-to-Image Translation},
+  journal   = {CoRR},
+  volume    = {abs/1908.06616},
+  year      = {2019}
+}
+
+@article{DBLP:journals/access/XiongWG19,
+  author    = {Feng Xiong and
+               Qianqian Wang and
+               Quanxue Gao},
+  title     = {Consistent Embedded {GAN} for Image-to-Image Translation},
+  journal   = {International Conference on Access Networks},
+  volume    = {7},
+  pages     = {126651--126661},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/iccv/ZhuPIE17,
+  author    = {Jun-Yan Zhu and
+               Taesung Park and
+               Phillip Isola and
+               Alexei A. Efros},
+  title     = {Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial
+               Networks},
+  pages     = {2242--2251},
+  publisher = {International Conference on Computer Vision},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/iccv/YiZTG17,
+  author    = {Zili Yi and
+               Hao (Richard) Zhang and
+               Ping Tan and
+               Minglun Gong},
+  title     = {DualGAN: Unsupervised Dual Learning for Image-to-Image Translation},
+  pages     = {2868--2876},
+  publisher = {International Conference on Computer Vision},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/nips/LiuBK17,
+  author    = {Ming-Yu Liu and
+               Thomas Breuel and
+               Jan Kautz},
+  title     = {Unsupervised Image-to-Image Translation Networks},
+  publisher = {Conference on Neural Information Processing Systems},
+  pages     = {700--708},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/cvpr/IsolaZZE17,
+  author    = {Phillip Isola and
+               Jun-Yan Zhu and
+               Tinghui Zhou and
+               Alexei A. Efros},
+  title     = {Image-to-Image Translation with Conditional Adversarial Networks},
+  pages     = {5967--5976},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/icml/ReedAYLSL16,
+  author    = {Scott E. Reed and
+               Zeynep Akata and
+               Xinchen Yan and
+               Lajanugen Logeswaran and
+               Bernt Schiele and
+               Honglak Lee},
+  title     = {Generative Adversarial Text to Image Synthesis},
+  volume    = {48},
+  pages     = {1060--1069},
+  publisher = {International Conference on Machine Learning},
+  year      = {2016}
+}
+
+@article{DBLP:journals/corr/DashGALA17,
+  author    = {Ayushman Dash and
+               John Cristian Borges Gamboa and
+               Sheraz Ahmed and
+               Marcus Liwicki and
+               Muhammad Zeshan Afzal},
+  title     = {{TAC-GAN} - Text Conditioned Auxiliary Classifier Generative Adversarial
+               Network},
+  journal   = {CoRR},
+  volume    = {abs/1703.06412},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/nips/ReedAMTSL16,
+  author    = {Scott E. Reed and
+               Zeynep Akata and
+               Santosh Mohan and
+               Samuel Tenka and
+               Bernt Schiele and
+               Honglak Lee},
+  title     = {Learning What and Where to Draw},
+  publisher = {Conference on Neural Information Processing Systems},
+  pages     = {217--225},
+  year      = {2016}
+}
+
+@inproceedings{DBLP:conf/cvpr/ZhangXY18,
+  author    = {Zizhao Zhang and
+               Yuanpu Xie and
+               Lin Yang},
+  title     = {Photographic Text-to-Image Synthesis With a Hierarchically-Nested
+               Adversarial Network},
+  pages     = {6199--6208},
+  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
+  year      = {2018}
+}
+
 %%%%% chapter 17------------------------------------------------------
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%cha

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%% chapter 18------------------------------------------------------

--- a/mt-book-xelatex.tex
+++ b/mt-book-xelatex.tex
@@ -147,9 +147,10 @@
 %\include{Chapter13/chapter13}
 %\include{Chapter14/chapter14}
 %\include{Chapter15/chapter15}
-\include{Chapter16/chapter16}
+%\include{Chapter16/chapter16}
 %\include{Chapter17/chapter17}
-%\include{Chapter18/chapter18}
+\include{Chapter18/chapter18}
+\include{Chapter19/chapter19}
 %\include{ChapterAppend/chapterappend}