合并分支 'caorunzhe' 到 'mengxia'

Caorunzhe 查看合并请求 !292

合并分支 'caorunzhe' 到 'mengxia'
Caorunzhe 查看合并请求 !292
572eacf5 · 孟霞 · ed77360f · 6491486b · 572eacf5 · 572eacf5
Commit 572eacf5 authored Sep 29, 2020 by 孟霞
--- a/Chapter1/Figures/figure-results-zh-to-en-news-field-translation.tex
+++ b/Chapter1/Figures/figure-results-zh-to-en-news-field-translation.tex
@@ -16,6 +16,7 @@
 \draw [->,thick] ([xshift=-0.5cm]mt.south west) -- ([xshift=-0.5cm,yshift=3.2cm]mt.south west);
 \node [anchor=north west] (x1) at ([xshift=0.0cm]human.south east) {\footnotesize{评价对象}};
 \node [anchor=north east] (y1) at ([xshift=-0.5cm,yshift=3.2cm]mt.south west) {\footnotesize{打分}};
+\node [anchor=north] (y2) at ([yshift=-0cm]y1.south) {\footnotesize{（分）}};

 \node [anchor=south west, fill=blue!50, minimum width=1.1cm, minimum height=1.5cm] (mt1) at ([xshift=13.0em,yshift=-3.0em]mt.east) {{\color{white} {\small\sffamily\bfseries{机器}}}};
 \node [anchor=south west, fill=red!50, minimum width=1.1cm, minimum height=2.7cm] (human1) at ([xshift=0.5cm]mt1.south east) {{\color{white} {\small\sffamily\bfseries{人}}}};
@@ -25,6 +26,7 @@
 \draw [->,thick] ([xshift=-0.5cm]mt1.south west) -- ([xshift=-0.5cm,yshift=3.2cm]mt1.south west);
 \node [anchor=north west] (x1) at ([xshift=0.0cm]human1.south east) {\footnotesize{评价对象}};
 \node [anchor=north east] (y1) at ([xshift=-0.5cm,yshift=3.2cm]mt1.south west) {\footnotesize{打分}};
+\node [anchor=north] (y2) at ([yshift=-0cm]y1.south) {\footnotesize{（分）}};

 \node[anchor=south](footname1) at ([xshift=2.1em,yshift=-2.0em]mt.south){\footnotesize{人工评价（五分制）}};
 \node[anchor=south](footname2) at ([xshift=2.1em,yshift=-2.0em]mt1.south){\footnotesize{自动评价（百分制）}};

--- a/Chapter12/chapter12.tex
+++ b/Chapter12/chapter12.tex
@@ -581,7 +581,7 @@ Transformer Deep（48层） & 30.2            & 43.1            & 194$\times 10^

 \begin{itemize}
 \vspace{0.5em}
-\item 近两年，有研究已经发现注意力机制可以捕捉一些语言现象\upcite{DBLP:journals/corr/abs-1905-09418}，比如，在Transformer 的多头注意力中，不同头往往会捕捉到不同的信息，比如，有些头对低频词更加敏感，有些头更适合词意消歧，甚至有些头可以捕捉句法信息。此外，由于注意力机制增加了模型的复杂性，而且随着网络层数的增多，神经机器翻译中也存在大量的冗余，因此研发轻量的注意力模型也是具有实践意义的方向\upcite{Xiao2019SharingAW,zhang-etal-2018-accelerating}（{\color{red} Weight Distillation: Transferring the Knowledge in Neural Network Parameters}）。
+\item 近两年，有研究已经发现注意力机制可以捕捉一些语言现象\upcite{DBLP:journals/corr/abs-1905-09418}，比如，在Transformer 的多头注意力中，不同头往往会捕捉到不同的信息，比如，有些头对低频词更加敏感，有些头更适合词意消歧，甚至有些头可以捕捉句法信息。此外，由于注意力机制增加了模型的复杂性，而且随着网络层数的增多，神经机器翻译中也存在大量的冗余，因此研发轻量的注意力模型也是具有实践意义的方向\upcite{Xiao2019SharingAW,zhang-etal-2018-accelerating,Lin2020WeightDT}。
 \vspace{0.5em}
 \item 神经机器翻译依赖成本较高的GPU设备，因此对模型的裁剪和加速也是很多系统研发人员所感兴趣的方向。比如，从工程上，可以考虑减少运算强度，比如使用低精度浮点数\upcite{Ott2018ScalingNM} 或者整数\upcite{DBLP:journals/corr/abs-1906-00532,Lin2020TowardsF8}进行计算，或者引入缓存机制来加速模型的推断\upcite{Vaswani2018Tensor2TensorFN}；也可以通过对模型参数矩阵的剪枝来减小整个模型的体积\upcite{DBLP:journals/corr/SeeLM16}；另一种方法是知识精炼\upcite{Hinton2015Distilling,kim-rush-2016-sequence}。 利用大模型训练小模型，这样往往可以得到比单独训练小模型更好的效果\upcite{DBLP:journals/corr/ChenLCL17}。
 \vspace{0.5em}

--- a/Chapter15/Figures/figure-dynamic-linear-aggregation-network-structure.tex
+++ b/Chapter15/Figures/figure-dynamic-linear-aggregation-network-structure.tex
+
+\begin{tikzpicture}
+\begin{scope}
+
+\node [anchor=north,rectangle, inner sep=0mm,minimum height=1.2em,minimum width=2em,rounded corners=5pt,thick] (n1) at (0, 0) {编码端};
+
+\node [anchor=west,rectangle, inner sep=0mm,minimum height=1.2em,minimum width=0em,rounded corners=5pt,thick] (n2) at ([xshift=3.5em,yshift=-0.5em]n1.east) {$z_0$};
+
+\node [anchor=west,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=3em,fill=orange!20,rounded corners=5pt,thick] (n3) at ([xshift=3.5em,yshift=0em]n2.east) {$z_1$};
+
+\node [anchor=west,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=3em,fill=orange!20,rounded corners=5pt,thick] (n4) at ([xshift=3.5em,yshift=0em]n3.east) {$z_2$};
+
+\node [anchor=west,rectangle, inner sep=0mm,minimum height=1.2em,minimum width=1em,rounded corners=5pt,thick] (n6) at ([xshift=1.5em,yshift=0em]n4.east) {$\ldots$};
+
+\node [anchor=west,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=3em,fill=orange!20,rounded corners=5pt,thick] (n5) at ([xshift=3.5em,yshift=0em]n6.east) {$z_{l}$};
+
+\node [anchor=west,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=3em,fill=orange!20,rounded corners=5pt,thick] (n7) at ([xshift=1.5em,yshift=0em]n5.east) {$z_{l+1}$};
+
+\node [anchor=north,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=15em,fill=teal!17,rounded corners=5pt,thick] (n8) at ([xshift=0em,yshift=-3em]n4.south) {层正则化};
+
+\node [anchor=north,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=15em,fill=purple!17,rounded corners=5pt,thick] (n9) at ([xshift=0em,yshift=-1em]n8.south) {$L_0\ \quad L_1\ \quad L_2\quad \ldots \quad\ L_l$};
+
+\node [anchor=north,rectangle,draw, inner sep=0mm,minimum height=1.2em,minimum width=15em,fill=teal!17,rounded corners=5pt,thick] (n10) at ([xshift=0em,yshift=-2em]n9.south) {权重累加};
+
+\node [anchor=west,rectangle, inner sep=0mm,minimum height=1.2em, rounded corners=5pt,thick] (n11) at ([xshift=0em,yshift=-4.5em]n1.west) {聚合网络};
+
+\node [anchor=east,rectangle, inner sep=0mm,minimum height=1.2em,minimum width=9em,rounded corners=5pt,thick] (n12) at ([xshift=0em,yshift=-4.5em]n7.east) {};
+\node [anchor=south,rectangle, inner sep=0mm,minimum height=1em,minimum width=1em,rounded corners=5pt,thick] (n13) at ([xshift=0em,yshift=1em]n8.north) {};
+
+\begin{pgfonlayer}{background}
+{
+\node[rectangle,inner sep=2pt,fill=blue!7] [fit = (n1) (n7) (n13)] (bg1) {};
+\node[rectangle,inner sep=2pt,fill=red!7] [fit = (n10) (n8) (n11) (n12)] (bg2) {};
+}
+\end{pgfonlayer}
+
+
+\draw[->,thick] ([xshift=0.5em,yshift=-0em]n2.south)..controls +(south:2em) and +(north:2em)..([xshift=-0em,yshift=-0em]n8.north) ;
+\draw[->,thick] ([xshift=-0em,yshift=-0em]n3.south)..controls +(south:2em) and +(north:2em)..([xshift=-0em,yshift=-0em]n8.north) ;
+\draw[->,thick] ([xshift=-0em,yshift=-0em]n5.south)..controls +(south:2em) and +(north:2em)..([xshift=-0em,yshift=-0em]n8.north) ;
+\draw [->,thick] ([xshift=0em,yshift=0em]n4.south) -- ([xshift=0em,yshift=0em]n8.north);
+
+\draw [->,thick] ([xshift=0em,yshift=0em]n8.south) -- ([xshift=0em,yshift=0em]n9.north);
+
+\draw[->,thick] ([xshift=-4.5em,yshift=-0em]n9.south)..controls +(south:0.8em) and +(north:0.8em)..([xshift=-0em,yshift=-0em]n10.north) ;
+\draw[->,thick] ([xshift=-2em,yshift=-0em]n9.south)..controls +(south:0.8em) and +(north:0.8em)..([xshift=-0em,yshift=-0em]n10.north) ;
+\draw[->,thick] ([xshift=0em,yshift=-0em]n9.south)..controls +(south:0.8em) and +(north:0.8em)..([xshift=-0em,yshift=-0em]n10.north) ;
+\draw[->,thick] ([xshift=4.5em,yshift=-0em]n9.south)..controls +(south:0.8em) and +(north:0.8em)..([xshift=-0em,yshift=-0em]n10.north) ;
+
+\draw[->,thick] ([xshift=0em,yshift=-0em]n10.east)..controls +(east:5em) and +(south:1.5em)..([xshift=-0em,yshift=-0em]n7.south) ;
+
+
+\end{scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter15/Figures/figure-expanded-residual-network.tex
+++ b/Chapter15/Figures/figure-expanded-residual-network.tex
+%%
+\begin{center}
+\begin{tikzpicture}
+\begin{scope}[scale=0.6]
+
+\node [anchor=east,fill=red!50,draw,rounded corners=3pt] (s11) at (-0.5em, 0) {\footnotesize{sublayer1}};
+\node [anchor=west,draw,circle,line width=1pt] (c11) at ([xshift=2em]s11.east) {};
+
+\node [anchor=north,fill=red!10,draw,dotted,rounded corners=3pt] (s21) at ([yshift=-3em]s11.south) {\footnotesize{sublayer1}};
+\node [anchor=west, draw,circle,dotted,line width=1pt] (c21) at ([xshift=2em]s21.east) {};
+\node [anchor=west,fill=red!10,draw,dotted,rounded corners=3pt] (s22) at ([xshift=2em]c21.east) {\footnotesize{sublayer2}};
+\node [anchor=west, draw,circle,dotted,line width=1pt] (c22) at ([xshift=2em]s22.east) {};
+
+\node [anchor=north,fill=red!50,draw,rounded corners=3pt] (s31) at ([yshift=-3em]s21.south) {\footnotesize{sublayer1}};
+\node [anchor=west,draw,circle,line width=1pt] (c31) at ([xshift=2em]s31.east) {};
+
+
+\node [anchor=north,fill=red!10,draw,dotted,rounded corners=3pt] (s41) at ([yshift=-3em]s31.south) {\footnotesize{sublayer1}};
+\node [anchor=east, draw,circle,line width=1pt] (c44) at ([xshift=-2em]s41.west) {};
+\node [anchor=west, draw,circle,dotted,line width=1pt] (c41) at ([xshift=2em]s41.east) {};
+\node [anchor=west,fill=red!10,draw,dotted,rounded corners=3pt] (s42) at ([xshift=2em]c41.east) {\footnotesize{sublayer2}};
+\node [anchor=west, draw,circle,dotted,line width=1pt] (c42) at ([xshift=2em]s42.east) {};
+\node [anchor=west,fill=red!50,draw,rounded corners=3pt] (s43) at ([xshift=2em]c42.east) {\footnotesize{sublayer3}};
+\node [anchor=west, draw,circle,line width=1pt] (c43) at ([xshift=2em]s43.east) {};
+
+\draw[-,rounded corners,line width=1pt] (c44.east) -- ([xshift=0.8em]c44.east) -- ([xshift=-1.2em,yshift=2.3em]s11.west) -- ([xshift=2.8em,,yshift=2.3em]s11.east) -- (c11.north);
+\draw[-,rounded corners,line width=1pt] (c44.east) -- ([xshift=0.8em]c44.east) -- ([xshift=-1.2em]s11.west) -- (s11.west);
+\draw[-,rounded corners,line width=1pt] (s11.east) -- (c11.west);
+\draw[-,rounded corners,line width=1pt] (c11.east) -- ([xshift=11.3em]c11.east) -- (c22.north);
+
+\draw[-,rounded corners,line width=1pt,dotted] (c44.east) -- ([xshift=0.8em]c44.east) -- ([xshift=-1.2em,yshift=2.3em]s21.west) -- ([xshift=2.7em,,yshift=2.3em]s21.east) -- (c21.north);
+\draw[-,rounded corners,line width=1pt,dotted] (c44.east) -- ([xshift=0.8em]c44.east) -- ([xshift=-1.2em]s21.west) -- (s21.west);
+\draw[-,rounded corners,line width=1pt,dotted] (s21.east) -- (c21.west);
+\draw[-,rounded corners,line width=1pt,dotted] (c21.east) -- (s22.west);
+\draw[-,rounded corners,line width=1pt,dotted] (s22.east) -- (c22.west);
+\draw[-,rounded corners,line width=1pt] (c22.east) -- ([xshift=11.3em]c22.east) -- (c43.north);
+
+\draw[-,rounded corners,line width=1pt] (c44.east) -- ([xshift=0.8em]c44.east) -- ([xshift=-1.2em,yshift=2.3em]s31.west) -- ([xshift=2.7em,,yshift=2.3em]s31.east) -- (c31.north);
+\draw[-,rounded corners,line width=1pt] (c44.east) -- ([xshift=0.8em]c44.east) -- ([xshift=-1.2em]s31.west) -- (s31.west);
+\draw[-,rounded corners,line width=1pt] (s31.east) -- (c31.west);
+\draw[-,rounded corners,line width=1pt] (c31.east) -- ([xshift=11.3em]c31.east) -- (c42.north);
+
+\draw[-,rounded corners,line width=1pt,dotted] (c44.east) -- ([xshift=0.8em]c44.east) -- ([xshift=-1.2em,yshift=2.3em]s41.west) -- ([xshift=2.7em,,yshift=2.3em]s41.east) -- (c41.north);
+\draw[-,rounded corners,line width=1pt,dotted] (c44.east) -- (s41.west);
+\draw[-,rounded corners,line width=1pt,dotted] (s41.east) -- (c41.west);
+\draw[-,rounded corners,line width=1pt,dotted] (c41.east) -- (s42.west);
+\draw[-,rounded corners,line width=1pt,dotted] (s42.east) -- (c42.west);
+\draw[-,rounded corners,line width=1pt] (c42.east) -- (s43.west);
+\draw[-,rounded corners,line width=1pt] (s43.east) -- (c43.west);
+\draw[->,rounded corners,line width=1pt] (c43.east) -- ([xshift=2em]c43.east);
+\end{scope} 
+\end{tikzpicture}
+\end{center}
+
+
--- a/Chapter15/Figures/figure-learning-rate.tex
+++ b/Chapter15/Figures/figure-learning-rate.tex
+\begin{center}
+  \begin{tikzpicture}[scale=1.0]
+    \footnotesize{
+      \begin{axis}[
+      width=.50\textwidth,
+      height=.40\textwidth,
+      legend style={at={(0.60,0.08)}, anchor=south west},
+      xlabel={\scriptsize{更新次数（10k）}},
+      ylabel={\scriptsize{学习率 （$10^{-3}$）}},
+      ylabel style={yshift=-1em},xlabel style={yshift=0.0em},
+      yticklabel style={/pgf/number format/precision=2,/pgf/number format/fixed zerofill},
+      ymin=0,ymax=2.2, ytick={0.5, 1, 1.5, 2},
+      xmin=0,xmax=5,xtick={1,2,3,4},
+      legend style={xshift=-8pt,yshift=-4pt, legend plot pos=right,font=\scriptsize,cells={anchor=west}}
+      ]
+
+      \addplot[red,line width=1.25pt] coordinates {(0,0) (1.6,2) (1.8,1.888) (2,1.787) (2.5,1.606) (3,1.462) (3.5,1.3549) (4,1.266) (4.5,1.193) (5,1.131)};
+      \addlegendentry{\scriptsize Base48}
+      %\addplot[red,line width=1.25pt] coordinates {(0,0) (8000,0.002) (10000,0.00179) (12000,0.00163) (12950,0.001572)};
+      \addplot[blue,line width=1.25pt] coordinates {(0,0) (0.8,2) (0.9906,1.7983)};
+      %\addplot[red,line width=1.25pt] coordinates {(0,0) (8000,0.002) (9906,0.0017983)};
+      \addplot[blue,dashed,line width=1.25pt] coordinates {(0.9906,1.7983) (0.9906,2)};
+      \addplot[blue,line width=1.25pt] coordinates {(0.9906,2) (1.1906,1.79) (1.3906,1.63) (1.4856,1.572)};
+      \addplot[blue,dashed,line width=1.25pt] coordinates {(1.4856,1.572) (1.4856,2)};
+      \addplot[blue,line width=1.25pt] coordinates {(1.4856,2) (1.6856,1.79) (1.8856,1.63) (1.9806,1.572)};
+      \addplot[blue,dashed,line width=1.25pt] coordinates {(1.9806,1.572) (1.9806,2)};
+      \addplot[blue,line width=1.25pt] coordinates {(1.9806,2) (2.1806,1.79) (2.3806,1.63) (2.4756,1.572)};
+      \addplot[blue,dashed,line width=1.25pt] coordinates {(2.4756,1.572) (2.4756,2)};
+      \addplot[blue,line width=1.25pt] coordinates {(2.4756,2) (2.6756,1.79) (2.8756,1.63) (2.9706,1.572)};
+      \addplot[blue,dashed,line width=1.25pt] coordinates {(2.9706,1.572) (2.9706,2)};
+      \addplot[blue,line width=1.25pt] coordinates {(2.9706,2) (3.1706,1.79) (3.3706,1.63) (3.4656,1.572) (3.6706,1.4602) (3.7136,1.44)};
+      \addplot[blue,dashed,line width=1.25pt] coordinates {(3.7136,1.44) (3.7136,2)};
+      \addplot[blue,line width=1.25pt] coordinates {(3.7136,2) (3.9136,1.79) (4.1136,1.63) (4.2086,1.572) (4.4136,1.4602) (4.4566,1.44) (4.7000,1.3574) (5.0000,1.2531)};
+      \addlegendentry{\scriptsize SDT48}
+
+      \end{axis}
+     }
+  \end{tikzpicture}
+\end{center}
\ No newline at end of file
--- a/Chapter15/Figures/figure-post-norm-vs-pre-norm.tex
+++ b/Chapter15/Figures/figure-post-norm-vs-pre-norm.tex
+%%%------------------------------------------------------------------------------------------------------------
+%%% 调序模型1：基于距离的调序
+\begin{center}
+\begin{tikzpicture}
+
+\begin{scope}[minimum height = 20pt]
+
+\node [anchor=east] (x1) at (-0.5em, 0) {$x_l$};
+\node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (F1) at ([xshift=2em]x1.east){\small{$\mathcal{F}$}};
+\node [anchor=west,circle,draw,minimum size=1em] (n1) at ([xshift=2em]F1.east) {};
+\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (ln1) at ([xshift=2em]n1.east){\small{\textrm{LN}}};
+\node [anchor=west] (x2) at ([xshift=2em]ln1.east) {$x_{l+1}$};
+
+\node [anchor=north] (x3) at ([yshift=-5em]x1.south) {$x_l$};
+\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (F2) at ([xshift=2em]x3.east){\small{\textrm{LN}}};
+\node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (ln2) at ([xshift=2em]F2.east){\small{$\mathcal{F}$}};
+\node [anchor=west,circle,draw,,minimum size=1em] (n2) at ([xshift=2em]ln2.east){};
+\node [anchor=west] (x4) at ([xshift=2em]n2.east) {$x_{l+1}$};
+
+\draw[->, line width=1pt] ([xshift=-0.1em]x1.east)--(F1.west);
+\draw[->, line width=1pt] ([xshift=-0.1em]F1.east)--(n1.west);
+\draw[->, line width=1pt] (n1.east)--node[above]{$y_l$}(ln1.west);
+\draw[->, line width=1pt] ([xshift=-0.1em]ln1.east)--(x2.west);
+\draw[->, line width=1pt] ([xshift=-0.1em]x3.east)--(F2.west);
+\draw[->, line width=1pt] ([xshift=-0.1em]F2.east)--(ln2.west);
+\draw[->, line width=1pt] ([xshift=0.1em]ln2.east)--node[above]{$y_l$}(n2.west);
+\draw[->, line width=1pt] (n2.east)--(x4.west);
+\draw[->,rounded corners,line width=1pt] ([yshift=-0.2em]x1.north) -- ([yshift=1em]x1.north) -- ([yshift=1.4em]n1.north) -- (n1.north);
+\draw[->,rounded corners,line width=1pt] ([yshift=-0.2em]x3.north) -- ([yshift=1em]x3.north) -- ([yshift=1.4em]n2.north) -- (n2.north);
+\draw[-] (n1.west)--(n1.east);
+\draw[-] (n1.north)--(n1.south);
+\draw[-] (n2.west)--(n2.east);
+\draw[-] (n2.north)--(n2.south);
+
+\node [anchor=south] (k1) at ([yshift=-0.1em]x1.north){};
+\node [anchor=south] (k2) at ([yshift=-0.1em]x3.north){};
+\begin{pgfonlayer}{background}
+\node [rectangle,inner sep=0.3em,fill=orange!10] [fit = (x1) (F1) (n1) (ln1) (x2) (k1)] (box0) {};
+\node [rectangle,inner sep=0.3em,fill=blue!10] [fit = (x3) (F2) (n2) (ln2) (x4) (k2)] (box1) {};
+\end{pgfonlayer}
+
+\node [anchor=north] (c1) at (box0.south){\footnotesize {(a)后作方式的残差连接}};
+\node [anchor=north] (c2) at (box1.south){\footnotesize {(b)前作方式的残差连接}};
+\end{scope}
+\end{tikzpicture}
+\end{center}
\ No newline at end of file
--- a/Chapter15/Figures/figure-progressive-training.tex
+++ b/Chapter15/Figures/figure-progressive-training.tex
+%%%------------------------------------------------------------------------------------------------------------
+
+\begin{center}
+\begin{tikzpicture}
+\begin{scope}
+
+\node [anchor=east,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s11) at (-0.5em, 0) {\footnotesize{$\times h$}};
+\node [rectangle,anchor=west,fill=blue!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s12) at ([xshift=1.2em]s11.east) {};
+
+\node [anchor=north,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s21) at ([yshift=-1.2em]s11.south) {\footnotesize{$\times h$}};
+\node [anchor=west,fill=orange!20,draw=red,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em,dashed] (s22) at ([xshift=1.2em]s21.east) {\footnotesize{$\times h$}};
+\node [anchor=west,fill=blue!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s23) at ([xshift=1.2em]s22.east) {};
+
+\node [anchor=north,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s31) at ([yshift=-1.2em]s21.south) {\footnotesize{$\times h$}};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s32) at ([xshift=1.2em]s31.east) {\footnotesize{$\times h$}};
+\node [anchor=west,fill=orange!20,draw=red,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em,dashed] (s33) at ([xshift=1.2em]s32.east) {\footnotesize{$\times h$}};
+\node [anchor=west,fill=blue!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s34) at ([xshift=1.2em]s33.east) {};
+
+\node [anchor=north,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s41) at ([yshift=-1.2em]s31.south) {\footnotesize{$\times h$}};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s42) at ([xshift=1.2em]s41.east) {\footnotesize{$\times h$}};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s43) at ([xshift=1.2em]s42.east) {\footnotesize{$\times h$}};
+\node [anchor=west,fill=orange!20,draw=red,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em,dashed] (s44) at ([xshift=1.2em]s43.east) {\footnotesize{$\times h$}};
+\node [anchor=west,fill=blue!20,draw,rounded corners=3pt,minimum height=1.6em,minimum width=1.6em] (s45) at ([xshift=1.2em]s44.east) {};
+
+\node [anchor=east] (p1) at ([xshift=-2em]s11.west) {\footnotesize{step 1}};
+\node [anchor=east] (p2) at ([xshift=-2em]s21.west) {\footnotesize{step 2}};
+\node [anchor=east] (p3) at ([xshift=-2em]s31.west) {\footnotesize{step 3}};
+\node [anchor=east] (p4) at ([xshift=-2em]s41.west) {\footnotesize{step 4}};
+
+\node [anchor=south,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (b1) at ([xshift=-0.2em,yshift=1.6em]p1.north) {};
+\node [anchor=west] (b2) at (b1.east) {\footnotesize{：编码器}};
+\node [anchor=west,fill=blue!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (b3) at ([xshift=1em]b2.east) {};
+\node [anchor=west] (b4) at (b3.east) {\footnotesize{：解码器}};
+\node [anchor=west] (b5) at ([xshift=2em]b4.east) {\footnotesize{：拷贝}};
+\draw[-latex,thick,red,dashed] ([xshift=0.5em]b4.east) -- (b5.west);
+
+\draw [-latex, line width=0.8pt] ([xshift=-1.2em]s11.west) -- (s11.west);
+\draw [-latex, line width=0.8pt] (s11.east) -- (s12.west);
+\draw [-latex, line width=0.8pt] (s12.east) -- ([xshift=1.2em]s12.east);
+
+\draw [-latex, line width=0.8pt] ([xshift=-1.2em]s21.west) -- (s21.west);
+\draw [-latex, line width=0.8pt] (s21.east) -- (s22.west);
+\draw [-latex, line width=0.8pt] (s22.east) -- (s23.west);
+\draw [-latex, line width=0.8pt] (s23.east) -- ([xshift=1.2em]s23.east);
+
+\draw [-latex, line width=0.8pt] ([xshift=-1.2em]s31.west) -- (s31.west);
+\draw [-latex, line width=0.8pt] (s31.east) -- (s32.west);
+\draw [-latex, line width=0.8pt] (s32.east) -- (s33.west);
+\draw [-latex, line width=0.8pt] (s33.east) -- (s34.west);
+\draw [-latex, line width=0.8pt] (s34.east) -- ([xshift=1.2em]s34.east);
+
+\draw [-latex, line width=0.8pt] ([xshift=-1.2em]s41.west) -- (s41.west);
+\draw [-latex, line width=0.8pt] (s41.east) -- (s42.west);
+\draw [-latex, line width=0.8pt] (s42.east) -- (s43.west);
+\draw [-latex, line width=0.8pt] (s43.east) -- (s44.west);
+\draw [-latex, line width=0.8pt] (s44.east) -- (s45.west);
+\draw [-latex, line width=0.8pt] (s45.east) -- ([xshift=1.2em]s45.east);
+
+\draw[-latex,thick,red,dashed] (s11.south)..controls +(south:1em) and +(north:1.2em)..(s22.north);
+\draw[-latex,thick,red,dashed] (s22.south)..controls +(south:1em) and +(north:1.2em)..(s33.north);
+\draw[-latex,thick,red,dashed] (s33.south)..controls +(south:1em) and +(north:1.2em)..(s44.north);
+\end{scope} 
+\end{tikzpicture}
+\end{center}
+
+
--- a/Chapter15/Figures/figure-sparse-connections-between-different-groups.tex
+++ b/Chapter15/Figures/figure-sparse-connections-between-different-groups.tex
+%%%------------------------------------------------------------------------------------------------------------
+\begin{center}
+\begin{tikzpicture}
+\begin{scope}
+
+\node [anchor=east,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (s11) at (-0.5em, 0) {};
+\node [rectangle,anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (s12) at ([xshift=2em]s11.east) {};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (s13) at ([xshift=2em]s12.east) {};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (s14) at ([xshift=2em]s13.east) {};
+
+\node [anchor=north,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (s21) at ([yshift=-2.5em]s11.south) {};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (s22) at ([xshift=2em]s21.east) {};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (s23) at ([xshift=2em]s22.east) {};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (s24) at ([xshift=2em]s23.east) {};
+
+\node [anchor=north,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (s31) at ([yshift=-2.5em]s21.south) {};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (s32) at ([xshift=2em]s31.east) {};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (s33) at ([xshift=2em]s32.east) {};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (s34) at ([xshift=2em]s33.east) {};
+
+\node [anchor=north,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (s41) at ([yshift=-2.5em]s31.south) {};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (s42) at ([xshift=2em]s41.east) {};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (s43) at ([xshift=2em]s42.east) {};
+\node [anchor=west,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (s44) at ([xshift=2em]s43.east) {};
+
+\node [anchor=east] (p1) at ([xshift=-3.5em]s11.west) {$p=\infty$};
+\node [anchor=east] (p2) at ([xshift=-4em]s21.west) {$p=1$};
+\node [anchor=east] (p3) at ([xshift=-4em]s31.west) {$p=2$};
+\node [anchor=east] (p4) at ([xshift=-4em]s41.west) {$p=4$};
+\node [anchor=north] (p5) at ([yshift=-1em]p3.south) {$\cdots$};
+\node [anchor=south,fill=orange!20,draw,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em] (b1) at ([xshift=-0.6em,yshift=1.2em]p1.north) {};
+\node [anchor=west] (b2) at (b1.east) {\footnotesize{:Layer}};
+\node [anchor=west,draw=red,rounded corners=3pt,minimum height=1.4em,minimum width=1.4em,dashed,line width=0.8pt] (b3) at ([xshift=1em]b2.east) {};
+\node [anchor=west] (b4) at (b3.east) {\footnotesize{:Block}};
+
+\draw [-latex, line width=0.8pt] ([xshift=-2em]s11.west) -- (s11.west);
+\draw [-latex, line width=0.8pt] (s11.east) -- (s12.west);
+\draw [-latex, line width=0.8pt] (s12.east) -- (s13.west);
+\draw [-latex, line width=0.8pt] (s13.east) -- (s14.west);
+\draw [-latex, line width=0.8pt] (s14.east) -- ([xshift=2em]s14.east);
+
+\draw [-latex, line width=0.8pt] ([xshift=-2em]s21.west) -- (s21.west);
+\draw [-latex, line width=0.8pt] (s21.east) -- (s22.west);
+\draw [-latex, line width=0.8pt] (s22.east) -- (s23.west);
+\draw [-latex, line width=0.8pt] (s23.east) -- (s24.west);
+\draw [-latex, line width=0.8pt] (s24.east) -- ([xshift=2em]s24.east);
+
+\draw [-latex, line width=0.8pt] ([xshift=-2em]s31.west) -- (s31.west);
+\draw [-latex, line width=0.8pt] (s31.east) -- (s32.west);
+\draw [-latex, line width=0.8pt] (s32.east) -- (s33.west);
+\draw [-latex, line width=0.8pt] (s33.east) -- (s34.west);
+\draw [-latex, line width=0.8pt] (s34.east) -- ([xshift=2em]s34.east);
+
+\draw [-latex, line width=0.8pt] ([xshift=-2em]s41.west) -- (s41.west);
+\draw [-latex, line width=0.8pt] (s41.east) -- (s42.west);
+\draw [-latex, line width=0.8pt] (s42.east) -- (s43.west);
+\draw [-latex, line width=0.8pt] (s43.east) -- (s44.west);
+\draw [-latex, line width=0.8pt] (s44.east) -- ([xshift=2em]s44.east);
+
+
+\node [draw=red,rounded corners=3pt,minimum height=1.7em,minimum width=1.7em,dashed,line width=0.8pt] (x21) at (s21) {};
+\node [draw=red,rounded corners=3pt,minimum height=1.7em,minimum width=1.7em,dashed,line width=0.8pt] (x22) at (s22) {};
+\node [draw=red,rounded corners=3pt,minimum height=1.7em,minimum width=1.7em,dashed,line width=0.8pt] (x23) at (s23) {};
+\node [draw=red,rounded corners=3pt,minimum height=1.7em,minimum width=1.7em,dashed,line width=0.8pt] (x24) at (s24) {};
+
+\node [draw=red,rounded corners=3pt,minimum height=1.7em,minimum width=5.2em,dashed,line width=0.8pt] (x31) at ([xshift=1.75em]s31) {};
+\node [draw=red,rounded corners=3pt,minimum height=1.7em,minimum width=5.2em,dashed,line width=0.8pt] (x32) at ([xshift=1.75em]s33) {};
+
+\node [draw=red,rounded corners=3pt,minimum height=1.7em,minimum width=12.2em,dashed,line width=0.8pt] (x41) at ([xshift=1.75em]s42) {};
+
+{
+\draw [-latex, line width=0.8pt] ([xshift=-1em]s21.west).. controls +(58:0.6) and +(122:0.6) .. ([xshift=1em]s21.east);
+\draw [-latex, line width=0.8pt] ([xshift=-1em]s22.west).. controls +(58:0.6) and +(122:0.6) .. ([xshift=1em]s22.east);
+\draw [-latex, line width=0.8pt] ([xshift=-1em]s23.west).. controls +(58:0.6) and +(122:0.6) .. ([xshift=1em]s23.east);
+\draw [-latex, line width=0.8pt] ([xshift=-1em]s24.west).. controls +(58:0.6) and +(122:0.6) .. ([xshift=1em]s24.east);
+}
+
+{
+\draw [-latex, line width=0.8pt] ([xshift=-1em]s21.west).. controls +(65:0.8) and +(115:0.8) .. ([xshift=1em]s22.east);
+\draw [-latex, line width=0.8pt] ([xshift=-1em]s22.west).. controls +(65:0.8) and +(115:0.8) .. ([xshift=1em]s23.east);
+\draw [-latex, line width=0.8pt] ([xshift=-1em]s23.west).. controls +(65:0.8) and +(115:0.8) .. ([xshift=1em]s24.east);
+\draw [-latex, line width=0.8pt] ([xshift=-1em]s31.west).. controls +(65:0.8) and +(115:0.8) .. ([xshift=1em]s32.east);
+\draw [-latex, line width=0.8pt] ([xshift=-1em]s33.west).. controls +(65:0.8) and +(115:0.8) .. ([xshift=1em]s34.east);
+}
+
+{
+\draw [-latex, line width=0.8pt] ([xshift=-1em]s21.west).. controls +(70:1.0) and +(110:1.0) .. ([xshift=1em]s23.east);
+\draw [-latex, line width=0.8pt] ([xshift=-1em]s22.west).. controls +(70:1.0) and +(110:1.0) .. ([xshift=1em]s24.east);
+}
+
+{
+\draw [-latex, line width=0.8pt] ([xshift=-1em]s21.west).. controls +(75:1.2) and +(105:1.2) .. ([xshift=1em]s24.east);
+\draw [-latex, line width=0.8pt] ([xshift=-1em]s31.west).. controls +(75:1.2) and +(105:1.2) .. ([xshift=1em]s34.east);
+\draw [-latex, line width=0.8pt] ([xshift=-1em]s41.west).. controls +(75:1.2) and +(105:1.2) .. ([xshift=1em]s44.east);
+}
+\end{scope} 
+\end{tikzpicture}
+\end{center}
+
+
--- a/Chapter15/Figures/figure-sublayer-skip.tex
+++ b/Chapter15/Figures/figure-sublayer-skip.tex
+%%%------------------------------------------------------------------------------------------------------------
+%%% 调序模型1：基于距离的调序
+\begin{center}
+\begin{tikzpicture}
+
+\begin{scope}[minimum height = 20pt]
+
+\node [anchor=east] (x1) at (-0.5em, 0) {$x_l$};
+\node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (ln1) at ([xshift=1em]x1.east){\small{\textrm{LN}}};
+\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (f1) at ([xshift=0.6em]ln1.east){\small{$\mathcal{F}$}};
+\node [anchor=west,circle,draw,,minimum size=1em] (n1) at ([xshift=3em]f1.east){};
+\node [anchor=west] (x2) at ([xshift=1em]n1.east) {$x_{l+1}$};
+\node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (ln12) at ([xshift=1em]x2.east){\small{\textrm{LN}}};
+\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (f12) at ([xshift=0.6em]ln12.east){\small{$\mathcal{F}$}};
+\node [anchor=west,circle,draw,,minimum size=1em] (n12) at ([xshift=3em]f12.east){};
+\node [anchor=west] (x22) at ([xshift=1em]n12.east) {$x_{l+2}$};
+
+\node [anchor=north] (x3) at ([yshift=-5em]x1.south) {$x_l$};
+\node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (ln2) at ([xshift=1em]x3.east){\small{\textrm{LN}}};
+\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (f2) at ([xshift=0.6em]ln2.east){\small{$\mathcal{F}$}};
+\node [anchor=west,minimum size=1em] (p1) at ([xshift=1em]f2.east){};
+\node [anchor=north] (m1) at ([yshift=0.6em]p1.south){\tiny{\red{$M=1$}}};
+\node [anchor=west,circle,draw,,minimum size=1em] (n2) at ([xshift=3em]f2.east){};
+\node [anchor=west] (x4) at ([xshift=1em]n2.east) {$x_{l+1}$};
+\node [anchor=west,draw,fill=red!20,inner xsep=5pt,rounded corners=2pt] (ln22) at ([xshift=1em]x4.east){\small{\textrm{LN}}};
+\node [anchor=west,draw,fill=green!20,inner xsep=5pt,rounded corners=2pt] (f22) at ([xshift=0.6em]ln22.east){\small{$\mathcal{F}$}};
+\node [anchor=west,minimum size=1em] (p2) at ([xshift=1em]f22.east){};
+\node [anchor=north] (m2) at ([yshift=0.6em]p2.south){\tiny{\red{$M=0$}}};
+\node [anchor=west,circle,draw,,minimum size=1em] (n22) at ([xshift=3em]f22.east){};
+\node [anchor=west] (x42) at ([xshift=1em]n22.east) {$x_{l+2}$};
+
+\draw[->, line width=1pt] ([xshift=-0.1em]x1.east)--(ln1.west);
+\draw[->, line width=1pt] ([xshift=-0.1em]ln1.east)--(f1.west);
+\draw[->, line width=1pt] ([xshift=0.1em]f1.east)--(n1.west);
+\draw[->, line width=1pt] (n1.east)--(x2.west);
+\draw[->, line width=1pt] ([xshift=-0.1em]x3.east)--(ln2.west);
+\draw[->, line width=1pt] ([xshift=-0.1em]ln2.east)--(f2.west);
+\draw[-, line width=1pt] ([xshift=0.1em]f2.east)--(p1.west);
+\draw[*-,red,line width=0.6pt] (p1.west) -- (p1.east);
+\draw[->, line width=1pt] (p1.east)--(n2.west);
+\draw[->, line width=1pt] (n2.east)--(x4.west);
+\draw[->,rounded corners,line width=1pt] ([yshift=-0.2em]x1.north) -- ([yshift=1em]x1.north) -- ([yshift=1.4em]n1.north) -- (n1.north);
+\draw[->,rounded corners,line width=1pt] ([yshift=-0.2em]x3.north) -- ([yshift=1em]x3.north) -- ([yshift=1.4em]n2.north) -- (n2.north);
+\draw[-] (n1.west)--(n1.east);
+\draw[-] (n1.north)--(n1.south);
+\draw[-] (n2.west)--(n2.east);
+\draw[-] (n2.north)--(n2.south);
+
+\draw[->, line width=1pt] ([xshift=-0.1em]x2.east)--(ln12.west);
+\draw[->, line width=1pt] ([xshift=-0.1em]ln12.east)--(f12.west);
+\draw[->, line width=1pt] ([xshift=0.1em]f12.east)--(n12.west);
+\draw[->, line width=1pt] (n12.east)--(x22.west);
+\draw[->, line width=1pt] ([xshift=-0.1em]x4.east)--(ln22.west);
+\draw[->, line width=1pt] ([xshift=-0.1em]ln22.east)--(f22.west);
+\draw[-, line width=1pt] ([xshift=0.1em]f22.east)--(p2.west);
+\draw[*-,red,line width=0.6pt] ([yshift=-0.1em]p2.west) -- (p2.north east);
+\draw[->, line width=1pt] (p2.east)--(n22.west);
+\draw[->, line width=1pt] (n22.east)--(x42.west);
+\draw[->,rounded corners,line width=1pt] ([yshift=-0.2em]x2.north) -- ([yshift=1em]x2.north) -- ([yshift=1.4em]n12.north) -- (n12.north);
+\draw[->,rounded corners,line width=1pt] ([yshift=-0.2em]x4.north) -- ([yshift=1em]x4.north) -- ([yshift=1.4em]n22.north) -- (n22.north);
+\draw[-] (n12.west)--(n12.east);
+\draw[-] (n12.north)--(n12.south);
+\draw[-] (n22.west)--(n22.east);
+\draw[-] (n22.north)--(n22.south);
+
+\node [anchor=south] (k1) at ([yshift=-0.1em]x1.north){};
+\node [anchor=south] (k2) at ([yshift=-0.1em]x3.north){};
+\begin{pgfonlayer}{background}
+\node [rectangle,inner sep=0.3em,fill=orange!10] [fit = (x1) (f1) (n1) (ln1) (x2) (k1) (f12) (n12) (ln12) (x22)] (box0) {};
+\node [rectangle,inner sep=0.3em,fill=blue!10] [fit = (x3) (f2) (n2) (ln2) (x4) (k2) (f22) (n22) (ln22) (x42)] (box1) {};
+\end{pgfonlayer}
+
+\node [anchor=north] (c1) at (box0.south){\footnotesize {(a)标准的Pre-Norm}};
+\node [anchor=north] (c2) at (box1.south){\footnotesize {(b)基于随机子层跳跃的Pre-Norm}};
+\end{scope}
+\end{tikzpicture}
+\end{center}
\ No newline at end of file
--- a/Chapter15/Figures/figure-wmt16.tex
+++ b/Chapter15/Figures/figure-wmt16.tex
+\definecolor{ublue}{rgb}{0.152,0.250,0.545}
+\begin{tikzpicture}
+\begin{axis}
+[  
+  width=5cm, height=3.5cm, 
+  xtick={15,17,19,21,23,25},
+  ytick={6.0,6.5,7.0},
+  xlabel={\scriptsize Epoch},
+  ylabel={},
+  ylabel style={},
+  x tick label style={},
+  y tick label style={},
+  tick align=inside,
+  legend style={anchor=north,xshift=1.7cm,yshift=1cm,legend  columns =-1},
+  ymin=5.7,
+  ymax=7.3,
+  xmin=14.6,
+  xmax=25.4,
+  extra y ticks={6.0,6.5,7.0},
+  extra y tick labels={3.7,3.8,3.9},
+  extra y tick style={ticklabel pos=right}]
+  \addplot [sharp plot,very thick,red!60,mark=diamond*] coordinates{(15,6.75) (16,6.73) (17,6.70) (18,6.67) (19,6.64) (20,6.61) (21,6.59) (22,6.58) (23,6.57) (24,6.58) (25,6.59)};
+  \addplot [sharp plot,very thick,purple!60,mark=triangle*] coordinates{(15,6.70) (16,6.4) (17,6.20) (18,6.30) (19,6.20) (20,6.10) (21,6.15) (22,6.10) (23,6.15) (24,6.16) (25,6.17)};
+  \legend{\scriptsize {训练集},\scriptsize{校验集}}
+\end{axis}
+\begin{axis}
+[ xshift=6.6cm,
+  width=5cm, height=3.5cm, 
+  xtick={15,17,19,21,23,25},
+  ytick={5.0,5.5,6.0},
+  xlabel={\scriptsize Epoch},
+  ylabel={},
+  ylabel style={},
+  x tick label style={},
+  y tick label style={},
+  tick align=inside,
+  ymin=4.7,
+  ymax=6.3,
+  xmin=14.6,
+  xmax=25.4,
+  extra y ticks={5.0,5.5,6.0},
+  extra y tick labels={3.5,3.6,3.7},
+  extra y tick style={ticklabel pos=right}]
+  \addplot [sharp plot,very thick,red!60,mark=diamond*] coordinates{(15,5.7) (16,5.65) (17,5.6) (18,5.55) (19,5.5) (20,5.45) (21,5.4) (22,5.38) (23,5.36) (24,5.34) (25,5.27)};
+  \addplot [sharp plot,very thick,purple!60,mark=triangle*] coordinates{(15,5.0) (16,4.9) (17,4.9) (18,5.05) (19,4.9) (20,5.0) (21,5.0) (22,5.1) (23,5.0) (24,5.15) (25,5.5)};
+\end{axis}
+\node [anchor=north,rotate=90] (n1) at (-1.3cm,1cm) {\scriptsize 训练集\ PPL};
+\node [anchor=north,rotate=90] (n2) at (5.4cm,1cm) {\scriptsize 训练集\ PPL};
+\node [anchor=north,rotate=90] (n3) at (4.2cm,1cm) {\scriptsize 校验集\ PPL};
+\node [anchor=north,rotate=90] (n4) at (10.7cm,1cm) {\scriptsize 校验集\ PPL};
+\end{tikzpicture}
+
+%---------------------------------------------------------------------
\ No newline at end of file
--- a/Chapter15/chapter15.tex
+++ b/Chapter15/chapter15.tex
@@ -27,4 +27,267 @@
 %    NEW SECTION
 %----------------------------------------------------------------------------------------

-\section{}
+\section{深层网络}
+
+\parinterval {\chapterthirteen}已经指出：增加神经网络的深度有助于对句子进行更充分的表示、同时增加模型的容量。但是，简单地堆叠很多层Transformer网络并不能带来性能上的提升，反而会面临更加严重的梯度消失/梯度爆炸的问题。这是由于伴随神经网络变深，梯度无法有效地从输出层回传到底层网络，造成网络浅层部分的参数无法得到充分训练\upcite{Wang2019LearningDT,DBLP:conf/cvpr/YuYR18}。针对这些问题，已经有研究者开始尝试进行求解，并取得了很好的效果。比如，设计更有利于深层信息传递的网络连接和恰当的参数初始化方法等\upcite{Bapna2018TrainingDN,Wang2019LearningDT,DBLP:conf/emnlp/ZhangTS19}。
+
+\parinterval 但是，如何设计一个足够“深”的机器翻译模型仍然是业界关注的热点问题之一。此外，伴随着网络的继续变深，将会面临一些新的问题，例如，如何加速深层网络的训练，如何解决深层网络的过拟合问题等。下面将会对以上问题展开讨论。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsection{Post-Norm vs Pre-Norm}
+
+\parinterval 为了探究为何深层的Transformer模型很难直接训练，首先对Transformer的模型结构进行简单的回顾。以Transformer的编码端为例，在多头自注意力网络和前馈神经网络中间，Transformer模型利用残差连接和层正则化操作来提高信息的传递效率。Transformer模型大致分为图\ref{fig:15-1}中两种结构\ \dash \ 后作方式的残差单元（Post-Norm）和前作方式的残差单元（Pre-Norm）。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter15/Figures/figure-post-norm-vs-pre-norm}
+\caption{Post-Norm Transformer vs Pre-Norm Transformer}
+\label{fig:15-1}
+\end{figure}
+%-------------------------------------------
+
+\parinterval 令$x_l$和$x_{l+1}$表示第$l$子层的输入和输出\footnote[1]{这里沿用Transformer中的定义，每一层（Layer）包含多个子层（Sub-layer）。比如，对于Transformer编码器，每一层包含一个自注意力子层和一个前馈神经网络子层。所有子层都需要进行层归一化和残差连接。}，$y_l$表示中间的临时输出；$\textrm{LN}(\cdot)$表示层归一化操作\upcite{Ba2016LayerN},帮助减少子层输出分布的方差。从而让训练变得更稳定；{\red $\mathcal{F}(\cdot)$ （F这种斜体是有什么特殊含义吗）}表示子层所对应的函数，比如前馈神经网络、自注意力网络等。下面分别对Post-Norm和Pre-Norm进行简单的描述。
+\begin{itemize}
+\vspace{0.5em}
+\item Post-Norm：早期的Transformer遵循的是Post-Norm结构\upcite{vaswani2017attention}。也就是层正则化作用于每一子层的输入和输出的残差结果上，如图\ref{fig:15-1}(a)所示。可以表示如下：
+\begin{eqnarray}
+x_{l+1}=\textrm{LN}(x_l+\mathcal{F}(x_l;\theta_l))
+\label{eq:15-1}
+\end{eqnarray}
+其中，$\theta_l$是子层$l$的参数。
+\vspace{0.5em}
+\item Pre-Norm：通过调整层正则化的位置，将其放置于每一子层的输入之前，得到了Pre-Norm结构,如图\ref{eq:15-1}(b)所示。其思想与He等人的思想一致\upcite{DBLP:conf/eccv/HeZRS16}，也被广泛应用于最新的Transformer开源系统中\upcite{Vaswani2018Tensor2TensorFN,Ottfairseq,KleinOpenNMT}，公式如下：
+\begin{eqnarray}
+x_{l+1}=x_l+\mathcal{F}(\textrm{LN}(x_l);\theta_l)
+\label{eq:15-2}
+\end{eqnarray}
+\end{itemize}
+
+\parinterval 从上述公式可以看到Pre-Norm结构可以通过残差路径将底层网络的输出直接暴露给上层网络；另一方面从反向传播的角度看，使用Pre-Norm结构，顶层的梯度可以更容易地反馈到底层网络。这里以一个含有$L$个子层的结构为例。令$Loss$表示整个神经网络输出上的损失，$x_L$为顶层的输出。对于Post-Norm结构，根据链式法则，损失$Loss$相对于$x_l$的梯度可以表示为：
+\begin{eqnarray}
+\frac{\partial Loss}{\partial x_l}=\frac{\partial Loss}{\partial x_L} \times \prod_{k=l}^{L-1}\frac{\partial \textrm{LN}(y_k)}{\partial y_k} \times \prod_{k=l}^{L-1}(1+\frac{\partial \mathcal{F}(x_k;\theta_k)}{\partial x_k})
+\label{eq:15-3}
+\end{eqnarray}
+其中$\prod_{k=l}^{L-1}\frac{\partial \textrm{LN}(y_k)}{\partial y_k}$表示在反向传播过程中经过层正则化得到的复合函数导数，$\prod_{k=l}^{L-1}(1+\frac{\partial \mathcal{F}(x_k;\theta_k)}{\partial x_k})$代表每个子层间残差连接的导数。
+
+\parinterval 类似的，也能得到Pre-Norm结构的梯度计算结果,如下式所示：
+\begin{eqnarray}
+\frac{\partial Loss}{\partial x_l}=\frac{\partial Loss}{\partial x_L} \times (1+\sum_{k=l}^{L-1}\frac{\partial \mathcal{F}(\textrm{LN}(x_k);\theta_k)}{\partial x_l})
+\label{eq:15-4}
+\end{eqnarray}
+
+\parinterval 对比公式\eqref{eq:15-3}和公式\eqref{eq:15-4}可以明显发现Pre-Norm结构直接把顶层的梯度$\frac{\partial Loss}{\partial x_L}$传递给下层，也就是$\frac{\partial Loss}{\partial x_l}$中直接含有$\frac{\partial Loss}{\partial x_L}$的部分。这个性质弱化了梯度计算对模型深度$L$的依赖；而如公式\eqref{eq:15-3}右侧所示，Post-Norm结构会导致一个与$L$相关的多项导数的积，伴随着$L$的增大更容易发生梯度消失和梯度爆炸问题。因此，Pre-Norm结构更适于堆叠多层神经网络的情况。比如，使用Pre-Norm结构可以很轻松的训练一个30层（60个子层）的Transformer编码器网络，并带来可观的BLEU提升。这个结果相当于标准Transformer编码器深度的6倍\upcite{Wang2019LearningDT}。相对的，用Pre-Norm结构训练深网络的时候，训练结果很不稳定，甚至有时候无法完成有效训练。这里把使用Pre-Norm的深层Transformer称为Transformer-Deep。
+
+\parinterval 另一个有趣的发现是，使用深层网络后,训练模型收敛的时间大大缩短。相比于Transformer-Big等宽网络，Transformer-Deep并不需要太大的隐藏层大小就可以取得相当甚至更优的翻译品质。也就是说，Transformer-Deep是一个更“窄”更“深”的网络。这种结构的参数量比Transformer-Big少，系统运行效率更高。表\ref{tab:15-1}对比了不同模型的参数量和训练/推断时间。
+
+%----------------------------------------------
+\begin{table}[htp]
+\centering
+\caption{不同Transformer结构的训练/推断时间对比（WMT14英德任务）}
+\begin{tabular}{l | r r r}
+\rule{0pt}{15pt}     系统 & 参数量 & 训练时间 & 推断时间  \\
+\hline
+\rule{0pt}{15pt}     Base & 63M & 4.6h & 19.4s  \\
+\rule{0pt}{15pt}     Big & 210M & 36.1h & 29.3s  \\
+\rule{0pt}{15pt}     DLCL-30 & 137M & 9.8h & 24.6s  \\
+\end{tabular}
+\label{tab:15-1}
+\end{table}
+%----------------------------------------------
+
+\parinterval 还有一个有趣的发现是，当编码器端使用深层网络之后，解码器端使用更浅的网络依然能够维持相当的翻译品质。这是由于解码器端的计算仍然会有对源语言端信息的加工和抽象，当编码器变深之后，解码器对源语言端的加工不那么重要了，因此可以减少解码网络的深度。这样做的一个直接好处是：可以通过减少解码器的深度加速翻译系统。对于一些延时敏感的场景，这种架构是极具潜力的。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsection{层聚合}
+
+\parinterval 尽管使用Pre-Norm结构可以很容易地训练深层Transformer模型，但从信息传递的角度看，Transformer模型中第$n$层的输入仅仅依赖于前一层的输出。虽然残差连接可以将信息跨层传递，但是对于很深的网络，整个模型的输入和输出之间仍需要很多次残差连接才能进行有效的传递。为了使上层的网络可以更加方便地访问下层网络的信息，一种方法是直接引入更多跨层的连接。最简单的一种方法是直接将所有层的输出都连接到最上层，达到聚合多层信息的目的\upcite{Bapna2018TrainingDN,Wang2018MultilayerRF}。另一种更加有效的方式是使用{\small\bfnew{动态线性层聚合方法}}\index{动态线性层聚合方法}（Dynamic Linear Combination of Layers，DLCL）\index{Dynamic Linear Combination of Layers，DLCL}。在每一层的输入中不仅考虑前一层的输出，而是将前面所有层的中间结果（包括词嵌入）进行线性聚合，理论上等价于常微分方程中的高阶求解方法\upcite{Wang2019LearningDT}。以Pre-Norm结构为例，具体做法如下：
+\begin{itemize}
+\vspace{0.5em}
+\item 对于每一层的输出$x_{l+1}$，对其进行层正则化，得到每一层的信息的表示
+\begin{eqnarray}
+z_{l}=\textrm{LN}(x_{l+1})
+\label{eq:15-5}
+\end{eqnarray}
+注意，$z_0$表示词嵌入层的输出，$z_l(l>0)$表示Transformer网络中最终的各层输出。
+\vspace{0.5em}
+\item 	定义一个维度为$(L+1)\times(L+1)$的权值矩阵$\vectorn{W}$，矩阵中每一行表示之前各层对当前层计算的贡献度，其中$L$是编码端（或解码端）的层数。令$\vectorn{W}_{l,i}$代表权值矩阵$\vectorn{W}$第$l$行第$i$列的权重，则层聚合的输出为$z_i$的线性加权和：
+\begin{eqnarray}
+g_l=\sum_{i=0}^{l}z_i\times \vectorn{W}_{l,i}
+\label{eq:15-6}
+\end{eqnarray}
+$g_l$会作为输入的一部分送入第$l+1$层。其网络的结构如图\ref{fig:15-2}所示
+\end{itemize}
+
+%---------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter15/Figures/figure-dynamic-linear-aggregation-network-structure}
+\caption{动态线性层聚合网络结构图}
+\label{fig:15-2}
+\end{figure}
+%-------------------------------------------
+
+\parinterval 可以看到，权值矩阵$\vectorn{W}$是一个下三角矩阵。开始时，对矩阵参数的每行进行平均初始化，即初始化矩阵$\vectorn{W}_0$的每一行各个位置的值为$1/M,M \in (1,2,3 \cdots L+1)$。 伴随着神经网络的训练，网络通过反向传播算法来不断更新$\vectorn{W}$中每一行不同位置权重的大小。
+
+\parinterval 动态线性层聚合的一个好处是，系统可以自动学习不同层对当前层的贡献度。在实验中也发现，离当前层更近的部分贡献度（权重）会更大，这也是符合直觉的。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsection{深层模型的训练加速}
+
+\parinterval 尽管训练这种窄而深的神经网络对比宽网络有更快的收敛速度，但伴随着训练数据的增加，以及模型进一步的加深，神经网络的训练代价成为不可忽视的问题。例如，在几千万甚至上亿的双语平行语料上训练一个48层的Transformer模型需要将近几周的时间能达到收敛\footnote[2]{训练时间的估算是在单台8卡Titan V GPU服务器上得到的}。因此，在保证模型精度不变的前提下如何高效地完成深层网络的训练也是至关重要的。在实践中能够发现，深层网络中相邻层之间具有一定的相似性。因此，一个想法是：能否通过不断复用浅层网络的参数来初始化更深层的网络，渐进式的训练深层网络，避免从头训练整个网络，进而达到加速深层网络训练的目的。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsection{渐进式训练}
+
+\parinterval 所谓渐进式训练是指从浅层网络开始，在训练过程中逐渐增加训练的深度。一种比较简单的方法是将网络分为浅层部分和深层部分，之后分别进行训练，最终达到提高模型翻译性能的目的\upcite{DBLP:conf/acl/WuWXTGQLL19}。
+
+\parinterval 另一种方式是动态构建深层网络，并尽可能复用浅层网络的训练结果。假设开始的时候模型包含$h$层网络，然后训练这个模型至收敛。之后，直接拷贝这$h$层网络（包括参数），并堆叠出一个$2h$层的模型。之后继续训练，重复这个过程。进行$n$次之后就得到了$n\times h$层的模型。图\ref{fig:15-3}给出了在编码端使用渐进式训练的示意图。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter15/Figures/figure-progressive-training}
+\caption{渐进式深层网络训练过程}
+\label{fig:15-3}
+\end{figure}
+%-------------------------------------------
+
+\parinterval 渐进式训练的好处在于深层模型并不是从头开始训练。每一次堆叠，都相当于利用“浅”模型给“深”模型提供了一个很好的初始点，这样深层模型的训练会更加容易。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsection{分组稠密连接}
+
+\parinterval 很多研究者已经发现深层网络不同层之间的稠密连接能够很明显地提高信息传递的效率\upcite{Wang2019LearningDT,DBLP:conf/cvpr/HuangLMW17,Dou2018ExploitingDR,DBLP:conf/acl/WuWXTGQLL19}。与此同时，对之前层信息的不断复用有助于得到更好的表示，但随之而来的是网络计算代价过大的问题。由于动态线性层聚合方法（DLCL）在每一次聚合时都需要重新计算之前每一层表示对当前层网络输入的贡献度，因此伴随着编码端整体深度的不断增加，这部分的计算代价变得不可忽略。例如，一个基于动态层聚合的48层Transformer模型的训练时间比不使用动态层聚合慢近1.9倍。同时，缓存中间结果也增加了显存的使用量，尽管使用了FP16计算，每张12G显存的GPU上计算的词也不能超过2048个，这导致训练开销急剧增大。
+
+\parinterval 缓解这个问题的一种方法是使用更稀疏的层间连接方式。其核心思想与动态线性层聚合是类似的，不同点在于可以通过调整层之间连接的稠密程度来降低训练代价。比如，可以将每$p$层分为一组，之后动态线性层聚合只在不同组之间进行。这样，通过调节$p$值的大小可以控制网络中连接的稠密程度，作为一种训练代价与翻译性能之间的权衡。显然，标准的Transformer模型\upcite{vaswani2017attention}和DLCL模型\upcite{Wang2019LearningDT}都可以看作是该方法的一种特例。如图\ref{fig:15-4}所示：当$p=1$时，每一个单独的块被看作一个独立的组，这等价于基于动态层聚合的DLCL模型；当$p=\infty$时，这等价于正常的Transformer模型。值得注意的是，如果配合渐进式训练。在分组稠密连接中可以设置$p=h$。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter15/Figures/figure-sparse-connections-between-different-groups}
+\caption{不同组之间的稀疏连接}
+\label{fig:15-4}
+\end{figure}
+%-------------------------------------------
+
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsection{学习率重置策略}
+
+\parinterval 尽管渐进式训练策略与分组稠密连接结构可以加速深层网络的训练，但使用传统的学习率衰减策略会导致堆叠深层模型时的学习率较小，因此模型无法快速地达到收敛状态，同时也影响最终的模型性能。
+
+\parinterval  图\ref{fig:15-5}中的红色曲线描绘了标准的Transformer模型的学习率曲线（WMT英德任务），可以看到当模型训练到40k步时，网络的学习率值对比峰值有明显的差距，而此时刚开始训练最终的深层模型，过小的学习率并不利于后期深层网络的充分训练。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter15/Figures/figure-learning-rate}
+\caption{学习率重置vs从头训练的学习率曲线}
+\label{fig:15-5}
+\end{figure}
+%-------------------------------------------
+
+\parinterval 针对该问题的一个解决方案是修改学习率曲线的衰减策略。图中蓝色的曲线是修改后的学习率曲线。首先在训练的初期让网络快速的达到学习率的峰值（线性递增），之后的每一次从$p$层网络变为$2p$层网络时，都会将当前的学习率值重置到峰值点。之后，根据训练的步数对其进行相应的衰减。具体的步骤如下：
+\begin{itemize}
+\vspace{0.5em}
+\item 在训练的初期，模型先经历一个学习率预热的过程：
+\begin{eqnarray}
+lr=d_{model}^{-0.5}\cdot step\_num \cdot warmup\_steps^{-0.5}
+\label{eq:15-7}
+\end{eqnarray}
+这里，$step\_num$表示参数更新的次数，$warmup\_step$表示预热的更次次数，$d_{model}^{-0.5}$表示Transformer模型隐层大小，$lr$是学习率。
+\vspace{0.5em}
+\item 	在之后的迭代训练过程中，每当进行新的迭代，学习率都会重置到峰值，之后进行相应的衰减：
+\begin{eqnarray}
+lr=d_{model}^{-0.5}\cdot step\_num^{-0.5}
+\label{eq:15-8}
+\end{eqnarray}
+这里$step\_num$代表学习率重置后更新的步数。
+\end{itemize}
+
+\parinterval 综合使用渐进式训练、分组稠密连接、学习率重置策略可以在保证翻译品质不变的前提下，缩减近40\%的训练时间（40层编码器）。同时，加速比伴随着模型的加深与数据集的增大会进一步地扩大。
+
+%----------------------------------------------------------------------------------------
+%    NEW SUB-SECTION
+%----------------------------------------------------------------------------------------
+
+\subsection{深层模型的鲁棒性训练}
+
+\parinterval 伴随着网络的加深，还会面临另外一个比较严峻的问题\ \dash \ 过拟合。由于参数量的增大，深层网络的输入与输出分布之间的差异也会越来越大，然而不同子层之间的{\small\bfnew{相互适应}}\index{相互适应}（Co-adaptation）\index{Co-adaptation}也会更加的明显，导致任意子层网络对其他子层的依赖过大。这对于训练阶段是有帮助的，因为不同子层可以协同工作从而更好地拟合训练数据。然而这种方式也降低了模型的泛化能力，即深层网络更容易陷入过拟合问题。
+
+\parinterval 通常，可以使用Dropout手段用来缓解过拟合问题（见{\chapterthirteen}）。不幸的是,尽管目前Transformer模型使用了多种Dropout手段（如Residual Dropout、Attention Dropout、 ReLU Dropout等），过拟合问题在深层网络中仍然存在。从图\ref{fig:15-6}中可以看到，深层网络对比浅层网络在训练集和校验集的困惑度上都有显著的优势，然而网络在训练一段时间后出现校验集困惑度上涨的现象，说明模型已经过拟合于训练数据。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter15/Figures/figure-wmt16}
+\caption{浅层网络(左)与深层网络（右）在WMT16英德的校验集与训练集的困惑度}
+\label{fig:15-6}
+\end{figure}
+%-------------------------------------------
+
+\parinterval 在{\chapterthirteen}提到的Layer Dropout方法可以有效地缓解这个问题。以编码端为例， Layer Dropout的过程可以被描述为：在训练过程中，对自注意力子层或前馈神经网络子层进行随机丢弃，以减少不同子层之间的相互适应。这里选择Pre-Norm结构作为基础架构，它可以被描述为：
+\begin{eqnarray}
+x_{l+1}=\mathcal{F}(\textrm{LN}(x_l))+x_l
+\label{eq:15-9}
+\end{eqnarray}
+其中$\textrm{LN}( \cdot )$表示层正则化函数， $\mathcal{F}( \cdot )$表示自注意力机制或者前馈神经网络，$x_l$表示第$l$个子层的输出。之后，使用一个掩码$M$（值为0或1）来控制每一子层是正常计算还是丢弃。于是，该子层的计算公式可以被重写为：
+\begin{eqnarray}
+x_{l+1}=M \cdot \mathcal{F}(\textrm{LN}(x_l))+x_l
+\label{eq:15-10}
+\end{eqnarray}
+$M=0$代表该子层被丢弃，而$M=1$代表正常进行当前子层的计算。图\ref{fig:15-7}展示了这个方法与标准Pre-Norm结构之间的区别。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter15/Figures/figure-sublayer-skip}
+\caption{标准的Pre-Norm结构与基于随机跳跃子层的Pre-Norm结构}
+\label{fig:15-7}
+\end{figure}
+%-------------------------------------------
+
+\parinterval 除此之外，有研究者已经发现残差网络中底层的子网络通过对输入进行抽象得到的表示对最终的输出有很大的影响，上层网络通过对底层网络得到的表示不断修正来拟合训练目标\upcite{DBLP:journals/corr/GreffSS16}。该结论同样适用于Transformer模型，比如，在训练中，残差支路以及底层的梯度范数通常比较大，这也间接表明底层网络在整个优化的过程中需要更大的更新。考虑到这个因素，在设计每一个子层被丢弃的概率时可以采用自底向上线性增大的策略，保证底层的网络相比于顶层更容易保留下来。这里用$L$来代表编码端块的个数，$l$代表当前的子层的编号，那么$M$可以通过以下的方式得到：
+\begin{eqnarray}
+M = \left\{\begin{array}{ll}
+0&P \leqslant p_l\\
+1&P > p_l
+\end{array}\right.
+\label{eq:15-11}
+\end{eqnarray}
+其中，$P$是服从伯努利分布的随机变量，$p_l$指每一个子层被丢弃的概率，具体计算方式如下：
+\begin{eqnarray}
+p_l=\frac{l}{2L}\cdot \varphi
+\label{eq:15-12}
+\end{eqnarray}
+这里，$1 \leqslant l \leqslant 2L$ ，且$\varphi$是预先设定的超参数。
+
+\parinterval 在Layer Dropout中，一个由$2L$个子层构成的残差网络，其顶层的输出相当于是$2^{2L}$个子网络的聚合结果。通过随机丢弃$n$个子层，则会屏蔽掉$2^n$个子网络的输出，将子网络的总体数量降低至$2^{2L-n}$。如图\ref{fig:15-8}所示的残差网络展开图，当有3个子层时，从输入到输出共存在8条路径，当删除子层sublayer2后，从输入到输出路径的路径则会减少到4条。
+
+%----------------------------------------------
+\begin{figure}[htp]
+\centering
+\input{./Chapter15/Figures/figure-expanded-residual-network}
+\caption{Layer Dropout中残差网络的展开图}
+\label{fig:15-8}
+\end{figure}
+%-------------------------------------------
+
--- a/Chapter2/Figures/figure-full-probability-word-segmentation-3.tex
+++ b/Chapter2/Figures/figure-full-probability-word-segmentation-3.tex
@@ -17,7 +17,7 @@



-\node [anchor=north west] (label11) at ([xshift=18.0em,yshift=1.63em]label1.south west) {更多数据-总词数:100K $\sim$ 1M};
+\node [anchor=north west] (label11) at ([xshift=18.0em,yshift=1.63em]label1.south west) {更多数据-总词数:1百万个词};
 \node [anchor=north west] (p12) at (label11.south west) {$\funp{P}(\textrm{很})=0.000010$};
 \node [anchor=north west] (p22) at (p12.south west) {$\funp{P}(\textrm{。})=0.001812$};
 \node [anchor=north west] (p32) at (p22.south west) {$\funp{P}(\textrm{确实})=0.000001$};

--- a/Chapter2/Figures/figure-word-frequency-distribution.tex
+++ b/Chapter2/Figures/figure-word-frequency-distribution.tex
@@ -128,6 +128,7 @@ axis x line*=bottom,
 (103,69555)
 (104,68668)};
 \end{axis}
+\node[anchor=west] (n44) at (1.7em,10.8em){（次）};
 \end{tikzpicture}

 %---------------------------------------------------------------------

--- a/Chapter2/chapter2.tex
+++ b/Chapter2/chapter2.tex
@@ -43,7 +43,7 @@
 \subsection{随机变量和概率}
 \parinterval 在自然界中，很多{\small\bfnew{事件}}\index{事件}（Event）\index{Event}是否会发生是不确定的。例如，明天会下雨、掷一枚硬币是正面朝上、扔一个骰子的点数是1等。这些事件可能会发生也可能不会发生。通过大量的重复试验，能发现具有某种规律性的事件叫做{\small\sffamily\bfseries{随机事件}}\index{随机事件}。

-\parinterval {\small\sffamily\bfseries{随机变量}}\index{随机变量}（Random Variable）\index{Random Variable}是对随机事件发生可能状态的描述，是随机事件的数量表征。设$\Omega = \{ \omega \}$为一个随机试验的样本空间，$X=X(\omega)$就是定义在样本空间$\Omega$上的单值实数函数，即$X=X(\omega)$为随机变量，记为$X$。随机变量是一种能随机选取数值的变量，常用大写的英语字母或希腊字母表示，其取值通常用小写字母来表示。例如，用$A$ 表示一个随机变量，用$a$表示变量$A$的一个取值。根据随机变量可以选取的值的某些性质，可以将其划分为离散变量和连续变量。
+\parinterval {\small\sffamily\bfseries{随机变量}}\index{随机变量}（Random Variable）\index{Random Variable}是对随机事件发生可能状态的描述，是随机事件的数量表征。设$\varOmega = \{ \omega \}$为一个随机试验的样本空间，$X=X(\omega)$就是定义在样本空间$\varOmega$上的单值实数函数，即$X=X(\omega)$为随机变量，记为$X$。随机变量是一种能随机选取数值的变量，常用大写的英语字母或希腊字母表示，其取值通常用小写字母来表示。例如，用$A$ 表示一个随机变量，用$a$表示变量$A$的一个取值。根据随机变量可以选取的值的某些性质，可以将其划分为离散变量和连续变量。

 \parinterval 离散变量是在其取值区间内可以被一一列举、总数有限并且可计算的数值变量。例如，用随机变量$X$代表某次投骰子出现的点数，点数只可能取1$\sim$6这6个整数，$X$就是一个离散变量。

@@ -546,7 +546,7 @@ F(x)=\int_{-\infty}^x f(x)\textrm{d}x
 \label{eq:2-26}
 \end{eqnarray}

-\parinterval 显然，这个结果是不合理的。因为即使语料中没有 “确实”和“现在”两个词连续出现，这种搭配也是客观存在的。这时简单地用极大似然估计得到概率却是0，导致整个句子出现的概率为0。 更常见的问题是那些根本没有出现在词表中的词，称为{\small\sffamily\bfseries{未登录词}}\index{未登录词}（Out-of-vocabulary Word，OOV Word）\index{Out-of-vocabulary Word，OOV Word}，比如一些生僻词，可能模型训练阶段从来没有看到过，这时模型仍然会给出0 概率。图\ref{fig:2-11}展示了一个真实语料库中词语出现频次的分布，可以看到绝大多数词都是低频词。
+\parinterval 显然，这个结果是不合理的。因为即使语料中没有 “确实”和“现在”两个词连续出现，这种搭配也是客观存在的。这时简单地用极大似然估计得到概率却是0，导致整个句子出现的概率为0。 更常见的问题是那些根本没有出现在词表中的词，称为{\small\sffamily\bfseries{未登录词}}\index{未登录词}（Out-Of-Vocabulary Word，OOV Word）\index{Out-Of-Vocabulary Word，OOV Word}，比如一些生僻词，可能模型训练阶段从来没有看到过，这时模型仍然会给出0概率。图\ref{fig:2-11}展示了一个真实语料库中词语出现频次的分布，可以看到绝大多数词都是低频词。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -735,7 +735,7 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\
 \end{array}\right.
 \label{eq:2-41}
 \end{eqnarray}
-\noindent 其中catcount$(\cdot)$表示的是单词$w_i$作为n-gram中第n个词时$w_{i-n+1} \ldots w_i$的种类数目。
+\noindent 其中catcount$(\cdot)$表示的是单词$w_i$作为$n$-gram中第$n$个词时$w_{i-n+1} \ldots w_i$的种类数目。

 \parinterval Kneser-Ney平滑是很多语言模型工具的基础\upcite{heafield2011kenlm,stolcke2002srilm}。还有很多以此为基础衍生出来的算法，感兴趣的读者可以通过参考文献自行了解\upcite{parsing2009speech,ney1994structuring,chen1999empirical}。

@@ -1046,7 +1046,7 @@ c(\cdot) & \textrm{当计算最高阶模型时}  \\
 \vspace{0.5em}
 \item 本章更多地关注了语言模型的基本问题和求解思路，但是基于$n$-gram的方法并不是语言建模的唯一方法。从现在自然语言处理的前沿看，端到端的深度学习方法在很多任务中都取得了领先的性能。语言模型同样可以使用这些方法\upcite{jing2019a}，而且在近些年取得了巨大成功。例如，最早提出的前馈神经语言模型\upcite{bengio2003a}和后来的基于循环单元的语言模型\upcite{mikolov2010recurrent}、基于长短期记忆单元的语言模型\upcite{sundermeyer2012lstm}以及现在非常流行的Transformer\upcite{vaswani2017attention}。 关于神经语言模型的内容，会在{\chapternine}进行进一步介绍。
 \vspace{0.5em}
-\item 最后，本章结合语言模型的序列生成任务对搜索技术进行了介绍。类似地，机器翻译任务也需要从大量的翻译候选中快速寻找最优译文。因此在机器翻译任务中也使用了搜索方法，这个过程通常被称作{\small\bfnew{解码}}\index{解码}（Decoding）\index{Decoding}。例如，有研究者在基于词的翻译模型中尝试使用启发式搜索\upcite{DBLP:conf/acl/OchUN01,DBLP:conf/acl/WangW97,tillmann1997a}以及贪婪搜索方法\upcite{germann2001fast}\upcite{germann2003greedy}，也有研究者探索基于短语的栈解码方法\upcite{Koehn2007Moses,DBLP:conf/amta/Koehn04}。此外，解码方法还包括有限状态机解码\upcite{bangalore2001a}\upcite{DBLP:journals/mt/BangaloreR02}以及基于语言学约束的解码\upcite{venugopal2007an,zollmann2007the,liu2006tree,galley2006scalable,chiang2005a}。相关内容将在{\chaptereight}和{\chapterfourteen}进行介绍。
+\item 最后，本章结合语言模型的序列生成任务对搜索技术进行了介绍。类似地，机器翻译任务也需要从大量的翻译候选中快速寻找最优译文。因此在机器翻译任务中也使用了搜索方法，这个过程通常被称作解码。例如，有研究者在基于词的翻译模型中尝试使用启发式搜索\upcite{DBLP:conf/acl/OchUN01,DBLP:conf/acl/WangW97,tillmann1997a}以及贪婪搜索方法\upcite{germann2001fast}\upcite{germann2003greedy}，也有研究者探索基于短语的栈解码方法\upcite{Koehn2007Moses,DBLP:conf/amta/Koehn04}。此外，解码方法还包括有限状态机解码\upcite{bangalore2001a}\upcite{DBLP:journals/mt/BangaloreR02}以及基于语言学约束的解码\upcite{venugopal2007an,zollmann2007the,liu2006tree,galley2006scalable,chiang2005a}。相关内容将在{\chaptereight}和{\chapterfourteen}进行介绍。
 \vspace{0.5em}
 \end{itemize}
 \end{adjustwidth}
--- a/Chapter5/Figures/figure-a-more-detailed-explanation-of-formula-3.40.tex
+++ b/Chapter5/Figures/figure-a-more-detailed-explanation-of-formula-3.40.tex
@@ -46,7 +46,7 @@
 {
 \draw[decorate,thick,decoration={brace,amplitude=5pt,mirror}] ([yshift=-0.2em]eq5.south west) -- ([yshift=-0.2em]eq6.south east) node [pos=0.4,below,xshift=-0.0em,yshift=-0.3em] (expcount1) {\footnotesize{{``$t_v$翻译为$s_u$''这个事件}}};
 \node [anchor=north west] (expcount2) at ([yshift=0.5em]expcount1.south west) {\footnotesize{{出现次数的期望的估计}}};
-\node [anchor=north west] (expcount3) at ([yshift=0.5em]expcount2.south west) {\footnotesize{{称之为期望频次}}（Expected Count）};
+\node [anchor=north west] (expcount3) at ([yshift=0.5em]expcount2.south west) {\footnotesize{{称之为期望频次}}};
 }

 \end{tikzpicture}

--- a/Chapter5/Figures/figure-calculation-of-the-expected-frequency-2.tex
+++ b/Chapter5/Figures/figure-calculation-of-the-expected-frequency-2.tex
 \qquad
 \begin{tabular}{cccc}
-\multicolumn{1}{c|}{$x_i$} & c($x_i$) & P($x_i$) & $c(x_i)\cdot$P($x_i$) \\ \hline
+\multicolumn{1}{c|}{$x_i$} & c($x_i$) & $\funp{P}$($x_i$) & $c(x_i)\cdot\funp{P}$($x_i$) \\ \hline
 \multicolumn{1}{c|}{$x_1$} & 2        & 0.1      & 0.2                   \\
 \multicolumn{1}{c|}{$x_2$} & 1        & 0.3      & 0.3                   \\
 \multicolumn{1}{c|}{$x_3$} & 5        & 0.2      & 1.0                   \\ \hline

--- a/Chapter5/Figures/figure-em-algorithm-flow-chart.tex
+++ b/Chapter5/Figures/figure-em-algorithm-flow-chart.tex
@@ -9,15 +9,15 @@
 \node [anchor=north west] (line2) at ([yshift=-0.3em]line1.south west) {输入: 平行语料${(\seq{s}^{[1]},\seq{t}^{[1]}),...,(\seq{s}^{[K]},\seq{t}^{[K]})}$};
 \node [anchor=north west] (line3) at ([yshift=-0.1em]line2.south west) {输出: 参数$f(\cdot|\cdot)$的最优值};
 \node [anchor=north west] (line4) at ([yshift=-0.1em]line3.south west) {1: \textbf{Function} \textsc{EM}($\{(\seq{s}^{[1]},\seq{t}^{[1]}),...,(\seq{s}^{[K]},\seq{t}^{[K]})\}$) };
-\node [anchor=north west] (line5) at ([yshift=-0.1em]line4.south west) {2: \ \ Initialize $f(\cdot|\cdot)$ \hspace{5em} $\rhd$ 比如给$f(\cdot|\cdot)$一个均匀分布};
-\node [anchor=north west] (line6) at ([yshift=-0.1em]line5.south west) {3: \ \ Loop until $f(\cdot|\cdot)$ converges};
-\node [anchor=north west] (line7) at ([yshift=-0.1em]line6.south west) {4: \ \ \ \ \textbf{foreach} $k = 1$ to $K$ \textbf{do}};
-\node [anchor=north west] (line8) at ([yshift=-0.1em]line7.south west) {5: \ \ \ \ \ \ \ \footnotesize{$c_{\mathbb{E}}(\seq{s}_u|\seq{t}_v;\seq{s}^{[k]},\seq{t}^{[k]}) = \sum\limits_{j=1}^{|\seq{s}^{[k]}|} \delta(s_j,s_u) \sum\limits_{i=0}^{|\seq{t}^{[k]}|} \delta(t_i,t_v) \cdot \frac{f(s_u|t_v)}{\sum_{i=0}^{l}f(s_u|t_i)}$}\normalsize{}};
-\node [anchor=north west] (line9) at ([yshift=-0.1em]line8.south west) {6: \ \ \ \ \textbf{foreach} $t_v$ appears at least one of $\{\seq{t}^{[1]},...,\seq{t}^{[K]}\}$ \textbf{do}};
-\node [anchor=north west] (line10) at ([yshift=-0.1em]line9.south west) {7: \ \ \ \ \ \ \ $\lambda_{t_v}^{'} = \sum_{s_u} \sum_{k=1}^{K} c_{\mathbb{E}}(s_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]})$};
-\node [anchor=north west] (line11) at ([yshift=-0.1em]line10.south west) {8: \ \ \ \ \ \ \ \textbf{foreach} $s_u$ appears at least one of $\{\seq{s}^{[1]},...,\seq{s}^{[K]}\}$ \textbf{do}};
-\node [anchor=north west] (line12) at ([yshift=-0.1em]line11.south west) {9: \ \ \ \ \ \ \ \ \ $f(s_u|t_v) = \sum_{k=1}^{K} c_{\mathbb{E}}(s_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]}) \cdot (\lambda_{t_v}^{'})^{-1}$};
-\node [anchor=north west] (line13) at ([yshift=-0.1em]line12.south west) {10: \ \textbf{return} $f(\cdot|\cdot)$};
+\node [anchor=north west] (line5) at ([yshift=-0.1em]line4.south west) {2: \quad Initialize $f(\cdot|\cdot)$ \hspace{3em} $\rhd$ 比如给$f(\cdot|\cdot)$一个均匀分布};
+\node [anchor=north west] (line6) at ([yshift=-0.1em]line5.south west) {3: \quad Loop until $f(\cdot|\cdot)$ converges};
+\node [anchor=north west] (line7) at ([yshift=-0.1em]line6.south west) {4: \quad \quad  \textbf{foreach} $k = 1$ to $K$ \textbf{do}};
+\node [anchor=north west] (line8) at ([yshift=-0.1em]line7.south west) {5: \quad \quad \quad \footnotesize{$c_{\mathbb{E}}(\seq{s}_u|\seq{t}_v;\seq{s}^{[k]},\seq{t}^{[k]}) = \sum\limits_{j=1}^{|\seq{s}^{[k]}|} \delta(s_j,s_u) \sum\limits_{i=0}^{|\seq{t}^{[k]}|} \delta(t_i,t_v) \cdot \frac{f(s_u|t_v)}{\sum_{i=0}^{l}f(s_u|t_i)}$}\normalsize{}};
+\node [anchor=north west] (line9) at ([yshift=-0.1em]line8.south west) {6: \quad \quad \textbf{foreach} $t_v$ appears at least one of $\{\seq{t}^{[1]},...,\seq{t}^{[K]}\}$ \textbf{do}};
+\node [anchor=north west] (line10) at ([yshift=-0.1em]line9.south west) {7: \quad \quad \quad $\lambda_{t_v}^{'} = \sum_{s_u} \sum_{k=1}^{K} c_{\mathbb{E}}(s_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]})$};
+\node [anchor=north west] (line11) at ([yshift=-0.1em]line10.south west) {8: \quad \quad \quad \textbf{foreach} $s_u$ appears at least one of $\{\seq{s}^{[1]},...,\seq{s}^{[K]}\}$ \textbf{do}};
+\node [anchor=north west] (line12) at ([yshift=-0.1em]line11.south west) {9: \quad \quad \quad \quad $f(s_u|t_v) = \sum_{k=1}^{K} c_{\mathbb{E}}(s_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]}) \cdot (\lambda_{t_v}^{'})^{-1}$};
+\node [anchor=north west] (line13) at ([yshift=-0.1em]line12.south west) {10: \textbf{return} $f(\cdot|\cdot)$};

 \begin{pgfonlayer}{background}
 {

--- a/Chapter5/Figures/figure-greedy-mt-decoding-process-1.tex
+++ b/Chapter5/Figures/figure-greedy-mt-decoding-process-1.tex
@@ -19,7 +19,7 @@
 \node [anchor=west] (s4) at ([xshift=2.5em]s3.east) {{感到}};
 \node [anchor=west] (s5) at ([xshift=2.5em]s4.east) {{满意}};

-\node [anchor=south west,inner sep=1pt] (sentlabel) at ([yshift=0.3em]s1.north west) {\scriptsize{{输入: 待翻译句子(已经分词)}}};
+\node [anchor=south west,inner sep=1pt] (sentlabel) at ([yshift=0.3em]s1.north west) {\scriptsize{{输入: 待翻译句子（已经分词）}}};
 {
 \draw [->,very thick,ublue] ([yshift=0.2em]s1.south) -- ([yshift=-0.8em]s1.south) node [pos=0.5,right] (pi1) {\tiny{$\pi$(1)}};
 \draw [->,very thick,ublue] ([yshift=0.2em]s2.south) -- ([yshift=-0.8em]s2.south) node [pos=0.5,right] (pi2) {\tiny{$\pi$(2)}};
@@ -102,13 +102,13 @@
 \node [anchor=west] (hlabel) at ([yshift=-2.5em]jlabel.west) {\scriptsize{$i = 1, j = 1$}};
 }
 {\tiny
-\node [anchor=north west] (glabel) at (hlabel.south west) {$g(\mathbf{s},\mathbf{t})$};
+\node [anchor=north west] (glabel) at (hlabel.south west) {$g(\seq{s},\seq{t})$};
 \node [anchor=west] (translabel) at (glabel.east) {翻译结果};
-\draw [-] (glabel.north east) -- ([yshift=-2.0in]glabel.north east);
+\draw [-] (glabel.north east) -- ([yshift=-1.5in]glabel.north east);
 \draw [-] (glabel.south west) -- ([xshift=3.5in]glabel.south west);

 \node [anchor=center,rotate=90] (hlabel2) at ([xshift=-1.3em,yshift=-8.5em]glabel.west) {\tiny{$h$存放临时翻译结果}};
-\node [anchor=north west] (foot1) at ([xshift=0.0em,yshift=-23.0em]translabel.south west) {\scriptsize{(a)\; 4:$h = \phi$}};
+\node [anchor=north west] (foot1) at ([xshift=0.0em,yshift=-18.0em]translabel.south west) {\scriptsize{(a)\; 4:$h = \phi$}};
 }
 \end{scope}

@@ -124,7 +124,7 @@
 \node [anchor=west] (s4) at ([xshift=2.5em]s3.east) {{感到}};
 \node [anchor=west] (s5) at ([xshift=2.5em]s4.east) {{满意}};

-\node [anchor=south west,inner sep=1pt] (sentlabel) at ([yshift=0.3em]s1.north west) {\scriptsize{{输入: 待翻译句子(已经分词)}}};
+\node [anchor=south west,inner sep=1pt] (sentlabel) at ([yshift=0.3em]s1.north west) {\scriptsize{{输入: 待翻译句子（已经分词）}}};

 {
 \draw [->,very thick,ublue] ([yshift=0.2em]s1.south) -- ([yshift=-0.8em]s1.south) node [pos=0.5,right] (pi1) {\tiny{$\pi$(1)}};
@@ -227,13 +227,13 @@
 }

 {\tiny%下面的表格
-\node [anchor=north west] (glabel) at (hlabel.south west) {$g(\mathbf{s},\mathbf{t})$};
+\node [anchor=north west] (glabel) at (hlabel.south west) {$g(\seq{s},\seq{t})$};
 \node [anchor=west] (translabel) at (glabel.east) {翻译结果};
-\draw [-] (glabel.north east) -- ([yshift=-2.0in]glabel.north east);
+\draw [-] (glabel.north east) -- ([yshift=-1.5in]glabel.north east);
 \draw [-] (glabel.south west) -- ([xshift=3.5in]glabel.south west);

 \node [anchor=center,rotate=90] (hlabel2) at ([xshift=-1.3em,yshift=-8.5em]glabel.west) {\tiny{$h$存放临时翻译结果}};
-\node [anchor=north west] (foot2) at ([xshift=0.0em,yshift=-23.0em]translabel.south west) {\scriptsize{(b)\; 6: \textbf{if} $used[j]=$ \textbf{false} \textbf{then}}};
+\node [anchor=north west] (foot2) at ([xshift=0.0em,yshift=-18.0em]translabel.south west) {\scriptsize{(b)\; 6: \textbf{if} $used[j]=$ \textbf{false} \textbf{then}}};
 }
 {%大大的join
 \node [anchor=center,draw=ublue,circle,thick,fill=white,inner sep=2.5pt,circular drop shadow={shadow xshift=0.1em,shadow yshift=-0.1em}] (join) at ([xshift=4em,yshift=-1em]hlabel.north east) {\tiny{\textsc{Join}}};
@@ -253,17 +253,13 @@
 \draw [->,thick] (hypotrans1.south) ..controls +(south:0.5) and +(north:0.5).. (join.north);
 }
 {
-\draw [->,thick] (list1.south) ..controls +(319:3) and +(north west:2.2).. (join.north west);
+\draw [->,thick] (list1.west) ..controls +(249:3.1) and +(north west:2.2).. (join.north west);
 \draw [->,thick] (join.south) ..controls +(south:1) and +(east:1).. ([xshift=4em]g2.east);
 }

 \end{scope}


-
-
-
-
 \end{tikzpicture}



--- a/Chapter5/Figures/figure-greedy-mt-decoding-process-3.tex
+++ b/Chapter5/Figures/figure-greedy-mt-decoding-process-3.tex
@@ -14,7 +14,7 @@
 \node [anchor=west] (s4) at ([xshift=2.5em]s3.east) {{感到}};
 \node [anchor=west] (s5) at ([xshift=2.5em]s4.east) {{满意}};

-\node [anchor=south west,inner sep=1pt] (sentlabel) at ([yshift=0.3em]s1.north west) {\scriptsize{{输入: 待翻译句子(已经分词)}}};
+\node [anchor=south west,inner sep=1pt] (sentlabel) at ([yshift=0.3em]s1.north west) {\scriptsize{{输入: 待翻译句子（已经分词）}}};

 {
 \draw [->,very thick,ublue] ([yshift=0.2em]s1.south) -- ([yshift=-0.8em]s1.south) node [pos=0.5,right] (pi1) {\tiny{$\pi$(1)}};
@@ -119,7 +119,7 @@
 }

 {\tiny%下面的表格
-\node [anchor=north west] (glabel) at (hlabel.south west) {$g(\mathbf{s},\mathbf{t})$};
+\node [anchor=north west] (glabel) at (hlabel.south west) {$g(\seq{s},\seq{t})$};
 \node [anchor=west] (translabel) at (glabel.east) {翻译结果};
 \draw [-] (glabel.north east) -- ([yshift=-2.0in]glabel.north east);
 \draw [-] (glabel.south west) -- ([xshift=3.5in]glabel.south west);
@@ -160,7 +160,7 @@
 }

 {
-\draw [->,thick] (list2.south) ..controls +(south:1.5) and +(north:1.1).. (join.120);
+\draw [->,thick] (list2.south) ..controls +(south:1.7) and +(north:1.0).. (join.120);
 \draw [->,thick] (join.south) ..controls +(south:3) and +(east:1).. ([xshift=4em]g5.east);
 }

@@ -179,7 +179,7 @@
 \node [anchor=west] (s4) at ([xshift=2.5em]s3.east) {{感到}};
 \node [anchor=west] (s5) at ([xshift=2.5em]s4.east) {{满意}};

-\node [anchor=south west,inner sep=1pt] (sentlabel) at ([yshift=0.3em]s1.north west) {\scriptsize{{输入: 待翻译句子(已经分词)}}};
+\node [anchor=south west,inner sep=1pt] (sentlabel) at ([yshift=0.3em]s1.north west) {\scriptsize{{输入: 待翻译句子（已经分词）}}};

 {
 \draw [->,very thick,ublue] ([yshift=0.2em]s1.south) -- ([yshift=-0.8em]s1.south) node [pos=0.5,right] (pi1) {\tiny{$\pi$(1)}};
@@ -277,7 +277,7 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 {\tiny%下面的表格
-\node [anchor=north west] (glabel) at (hlabel.south west) {$g(\mathbf{s},\mathbf{t})$};
+\node [anchor=north west] (glabel) at (hlabel.south west) {$g(\seq{s},\seq{t})$};
 \node [anchor=west] (translabel) at (glabel.east) {翻译结果};
 \draw [-] (glabel.north east) -- ([yshift=-2.0in]glabel.north east);
 \draw [-] (glabel.south west) -- ([xshift=3.5in]glabel.south west);

--- a/Chapter5/Figures/figure-human-translation.tex
+++ b/Chapter5/Figures/figure-human-translation.tex
@@ -10,7 +10,7 @@
 \node [anchor=west] (s3) at ([xshift=2em]s2.east) {{你}};
 \node [anchor=west] (s4) at ([xshift=2em]s3.east) {{感到}};
 \node [anchor=west] (s5) at ([xshift=2em]s4.east) {{满意}};
-\node [anchor=south west] (sentlabel) at ([yshift=-0.5em]s1.north west) {\scriptsize{\sffamily\bfseries{待翻译句子(已经分词):}}};
+\node [anchor=south west] (sentlabel) at ([yshift=-0.5em]s1.north west) {\scriptsize{\sffamily\bfseries{待翻译句子（已经分词）:}}};

 {
 \draw [->,very thick,ublue] (s1.south) -- ([yshift=-0.7em]s1.south);

--- a/Chapter5/Figures/figure-process-of-machine-translation.tex
+++ b/Chapter5/Figures/figure-process-of-machine-translation.tex
@@ -184,14 +184,14 @@
 \draw[decorate,thick,decoration={brace,amplitude=5pt}] ([yshift=8em,xshift=2.0em]t53.south east) -- ([xshift=2.0em]t53.south east) node [pos=0.5,right,xshift=0.5em,yshift=2.0em] (label2) {\footnotesize{{从双语数}}};
 \node [anchor=north west] (label2part2) at ([yshift=0.3em]label2.south west) {\footnotesize{{据中自动}}};
 \node [anchor=north west] (label2part3) at ([yshift=0.3em]label2part2.south west) {\footnotesize{{学习词典}}};
-\node [anchor=north west] (label2part4) at ([yshift=0.3em]label2part3.south west) {\footnotesize{{(训练)}}};
+\node [anchor=north west] (label2part4) at ([yshift=0.3em]label2part3.south west) {\footnotesize{{（训练）}}};
 }

 {
 \draw[decorate,thick,decoration={brace,amplitude=5pt}] ([yshift=-1.0em,xshift=6.2em]t53.south west) -- ([yshift=-10.5em,xshift=6.2em]t53.south west) node [pos=0.5,right,xshift=0.5em,yshift=2.0em] (label3) {\footnotesize{{利用概率}}};
 \node [anchor=north west] (label3part2) at ([yshift=0.3em]label3.south west) {\footnotesize{{化的词典}}};
 \node [anchor=north west] (label3part3) at ([yshift=0.3em]label3part2.south west) {\footnotesize{{进行翻译}}};
-\node [anchor=north west] (label3part4) at ([yshift=0.3em]label3part3.south west) {\footnotesize{{(解码)}}};
+\node [anchor=north west] (label3part4) at ([yshift=0.3em]label3part3.south west) {\footnotesize{{（解码）}}};
 }
 \end{scope}


--- a/Chapter5/chapter5.tex
+++ b/Chapter5/chapter5.tex
@@ -24,7 +24,7 @@

 \chapter{基于词的机器翻译建模}

-\parinterval 使用统计方法对翻译问题进行建模是机器翻译发展中的重要里程碑。这种思想也影响了当今的统计机器翻译和神经机器翻译范式。虽然技术不断发展，传统的统计模型已经不再``新鲜''，但它对于今天机器翻译的研究仍然有着重要的启示作用。在了解前沿、展望未来的同时，更要冷静地思考前人给我们带来了什么。基于此，这里将介绍统计机器翻译的开山之作\ \dash \ IBM 模型，它提出了使用统计模型进行翻译的思想，并在建模中引入了单词对齐这一重要概念。
+\parinterval 使用统计方法对翻译问题进行建模是机器翻译发展中的重要里程碑。这种思想也影响了当今的统计机器翻译和神经机器翻译范式。虽然技术不断发展，传统的统计模型已经不再“新鲜”，但它对于今天机器翻译的研究仍然有着重要的启示作用。在了解前沿、展望未来的同时，更要冷静地思考前人给我们带来了什么。基于此，这里将介绍统计机器翻译的开山之作\ \dash \ IBM 模型，它提出了使用统计模型进行翻译的思想，并在建模中引入了单词对齐这一重要概念。

 IBM模型由Peter F. Brown等人于上世纪九十年代初提出\upcite{DBLP:journals/coling/BrownPPM94}。客观地说，这项工作的视野和对问题的理解，已经超过当时很多人所能看到的东西，其衍生出来的一系列方法和新的问题还被后人花费将近10年的时间来进行研究与讨论。时至今日，IBM模型中的一些思想仍然影响着很多研究工作。本章将重点介绍一种简单的基于单词的统计翻译模型（IBM模型1），以及在这种建模方式下的模型训练方法。这些内容可以作为后续章节中统计机器翻译和神经机器翻译建模方法的基础。

@@ -39,7 +39,7 @@ IBM模型由Peter F. Brown等人于上世纪九十年代初提出\upcite{DBLP:jo

 \parinterval 那么，基于单词的统计机器翻译模型又是如何描述翻译问题的呢？Peter F. Brown等人提出了一个观点\upcite{DBLP:journals/coling/BrownPPM94}：在翻译一个句子时，可以把其中的每个单词翻译成对应的目标语言单词，然后调整这些目标语言单词的顺序，最后得到整个句子的翻译结果，而这个过程可以用统计模型来描述。尽管在人看来使用两个语言单词之间的对应进行翻译是很自然的事，但是对于计算机来说可是向前迈出了一大步。

-\parinterval 先来看一个例子。图 \ref{fig:5-1}展示了一个汉语翻译到英语的例子。首先，可以把源语言句子中的单词``我''、``对''、``你''、``感到''和``满意''分别翻译为``I''、``with''、``you''、``am''\ 和``satisfied''，然后调整单词的顺序，比如，``am''放在译文的第2个位置，``you''应该放在最后的位置等等，最后得到译文``I am satisfied with you''。
+\parinterval 先来看一个例子。图 \ref{fig:5-1}展示了一个汉语翻译到英语的例子。首先，可以把源语言句子中的单词“我”、“对”、“你”、“感到”和“满意”分别翻译为“I”、“with”、“you”、“am”\ 和“satisfied”，然后调整单词的顺序，比如，“am”放在译文的第2个位置，“you”应该放在最后的位置等等，最后得到译文“I am satisfied with you”。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -71,7 +71,7 @@ IBM模型由Peter F. Brown等人于上世纪九十年代初提出\upcite{DBLP:jo
 \end{figure}
 %----------------------------------------------

-\parinterval 图\ref{fig:5-2}给出了上述过程的一个示例。对于今天的自然语言处理研究，``分析、转换和生成''依然是一个非常深刻的观点。包括机器翻译在内的很多自然语言处理问题都可以用这个过程来解释。比如，对于现在比较前沿的神经机器翻译方法，从大的框架来说，依然在做分析（编码器）、转换（编码-解码注意力）和生成（解码器），只不过这些过程隐含在神经网络的设计中。当然，这里并不会对``分析、转换和生成''的架构展开过多的讨论，随着后面技术内容讨论的深入，这个观念会进一步体现。
+\parinterval 图\ref{fig:5-2}给出了上述过程的一个示例。对于今天的自然语言处理研究，“分析、转换和生成”依然是一个非常深刻的观点。包括机器翻译在内的很多自然语言处理问题都可以用这个过程来解释。比如，对于现在比较前沿的神经机器翻译方法，从大的框架来说，依然在做分析（编码器）、转换（编码-解码注意力）和生成（解码器），只不过这些过程隐含在神经网络的设计中。当然，这里并不会对“分析、转换和生成”的架构展开过多的讨论，随着后面技术内容讨论的深入，这个观念会进一步体现。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
 %----------------------------------------------------------------------------------------
@@ -106,17 +106,17 @@ IBM模型由Peter F. Brown等人于上世纪九十年代初提出\upcite{DBLP:jo
 %----------------------------------------------
 \vspace{-0.2em}

-\parinterval 图\ref{fig:5-3}展示了人在翻译``我/对/你/感到/满意''时可能会思考的内容\footnote{这里用斜杠表示单词之间的分隔。}。具体来说，有如下两方面内容：
+\parinterval 图\ref{fig:5-3}展示了人在翻译“我/对/你/感到/满意”时可能会思考的内容\footnote{这里用斜杠表示单词之间的分隔。}。具体来说，有如下两方面内容：

 \begin{itemize}
 \vspace{0.5em}
-\item {\small\bfnew{翻译知识的学习}}：对于输入的源语言句子，首先需要知道每个单词可能的翻译有什么，这些翻译被称为{\small\sffamily\bfseries{翻译候选}}\index{翻译候选}（Translation Candidate）\index{Translation Candidate}。比如，汉语单词``对''可能的译文有``to''、``with''和``for''等。对于人来说，可以通过阅读、背诵、做题或者老师教等途径获得翻译知识，这些知识就包含了源语言与目标语言单词之间的对应关系。通常，也把这个过程称之为学习过程。
+\item {\small\bfnew{翻译知识的学习}}：对于输入的源语言句子，首先需要知道每个单词可能的翻译有什么，这些翻译被称为{\small\sffamily\bfseries{翻译候选}}\index{翻译候选}（Translation Candidate）\index{Translation Candidate}。比如，汉语单词“对”可能的译文有“to”、“with”和“for”等。对于人来说，可以通过阅读、背诵、做题或者老师教等途径获得翻译知识，这些知识就包含了源语言与目标语言单词之间的对应关系。通常，也把这个过程称之为学习过程。
 \vspace{0.5em}
-\item {\small\bfnew{运用知识生成译文}}：当翻译一个从未见过的句子时，可以运用学习到的翻译知识，得到新的句子中每个单词的译文，并处理常见的单词搭配、主谓一致等问题，比如，英语中``satisfied'' 后面常常使用介词``with''构成搭配。基于这些知识可以快速生成译文。
+\item {\small\bfnew{运用知识生成译文}}：当翻译一个从未见过的句子时，可以运用学习到的翻译知识，得到新的句子中每个单词的译文，并处理常见的单词搭配、主谓一致等问题，比如，英语中“satisfied” 后面常常使用介词“with”构成搭配。基于这些知识可以快速生成译文。
 \vspace{0.5em}
 \end{itemize}

-当然，每个人进行翻译时所使用的方法和技巧都不相同，所谓人工翻译也没有固定的流程。但是，可以确定的是，人在进行翻译时也需要``学习''和``运用''翻译知识。对翻译知识``学习''和``运用''的好与坏，直接决定了人工翻译结果的质量。
+当然，每个人进行翻译时所使用的方法和技巧都不相同，所谓人工翻译也没有固定的流程。但是，可以确定的是，人在进行翻译时也需要“学习”和“运用”翻译知识。对翻译知识“学习”和“运用”的好与坏，直接决定了人工翻译结果的质量。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -124,7 +124,7 @@ IBM模型由Peter F. Brown等人于上世纪九十年代初提出\upcite{DBLP:jo

 \subsubsection{2. 机器翻译流程}

-\parinterval 人进行翻译的过程比较容易理解，那计算机是如何完成翻译的呢？虽然人工智能这个概念显得很神奇，但是计算机远没有人那么智能，有时甚至还很``笨''。一方面，它没有能力像人一样，在教室里和老师一起学习语言知识；另一方面，即使能列举出每个单词的候选译文，但是还是不知道这些译文是怎么拼装成句的，甚至不知道哪些译文是对的。为了更加直观地理解机器在翻译时要解决的挑战，可以将问题归纳如下：
+\parinterval 人进行翻译的过程比较容易理解，那计算机是如何完成翻译的呢？虽然人工智能这个概念显得很神奇，但是计算机远没有人那么智能，有时甚至还很“笨”。一方面，它没有能力像人一样，在教室里和老师一起学习语言知识；另一方面，即使能列举出每个单词的候选译文，但是还是不知道这些译文是怎么拼装成句的，甚至不知道哪些译文是对的。为了更加直观地理解机器在翻译时要解决的挑战，可以将问题归纳如下：

 \begin{itemize}
 \vspace{0.5em}
@@ -134,11 +134,11 @@ IBM模型由Peter F. Brown等人于上世纪九十年代初提出\upcite{DBLP:jo
 \vspace{0.5em}
 \end{itemize}

-\parinterval 对于第一个问题，可以给计算机一个翻译词典，这样计算机可以发挥计算方面的优势，尽可能多地把翻译结果拼装出来。比如，可以把每个翻译结果看作是对单词翻译的拼装，这可以被形象地比作贯穿多个单词的一条路径，计算机所做的就是尽可能多地生成这样的路径。图\ref{fig:5-4}中蓝色和红色的折线就分别表示了两条不同的译文选择路径，区别在于``满意''和``对''的翻译候选是不一样的，蓝色折线选择的是``satisfy''和``to''，而红色折线是``satisfied''和``with''。换句话说，不同的译文对应不同的路径（即使词序不同也会对应不同的路径）。
+\parinterval 对于第一个问题，可以给计算机一个翻译词典，这样计算机可以发挥计算方面的优势，尽可能多地把翻译结果拼装出来。比如，可以把每个翻译结果看作是对单词翻译的拼装，这可以被形象地比作贯穿多个单词的一条路径，计算机所做的就是尽可能多地生成这样的路径。图\ref{fig:5-4}中蓝色和红色的折线就分别表示了两条不同的译文选择路径，区别在于“满意”和“对”的翻译候选是不一样的，蓝色折线选择的是“satisfy”和“to”，而红色折线是“satisfied”和“with”。换句话说，不同的译文对应不同的路径（即使词序不同也会对应不同的路径）。

-\parinterval 对于第二个问题，尽管机器能够找到很多译文选择路径，但它并不知道哪些路径是好的。说地再直白一些，简单地枚举路径实际上就是一个体力活，没有太多的智能。因此计算机还需要再聪明一些，运用它的能够``掌握''的知识判断翻译结果的好与坏。这一步是最具挑战的，当然也有很多思路。在统计机器翻译中，这个问题被定义为：设计一种统计模型，它可以给每个译文一个可能性，而这个可能性越高表明译文越接近人工翻译。
+\parinterval 对于第二个问题，尽管机器能够找到很多译文选择路径，但它并不知道哪些路径是好的。说地再直白一些，简单地枚举路径实际上就是一个体力活，没有太多的智能。因此计算机还需要再聪明一些，运用它的能够“掌握”的知识判断翻译结果的好与坏。这一步是最具挑战的，当然也有很多思路。在统计机器翻译中，这个问题被定义为：设计一种统计模型，它可以给每个译文一个可能性，而这个可能性越高表明译文越接近人工翻译。

-\parinterval 如图\ref{fig:5-4}所示，每个单词翻译候选的右侧黑色框里的数字就是单词的翻译概率，使用这些单词的翻译概率，可以得到整句译文的概率（用符号P表示）。这样，就用概率化的模型描述了每个翻译候选的可能性。基于这些翻译候选的可能性，机器翻译系统可以对所有的翻译路径进行打分，比如，图\ref{fig:5-4}中第一条路径的分数为0.042，第二条是0.006，以此类推。最后，系统可以选择分数最高的路径作为源语言句子的最终译文。
+\parinterval 如图\ref{fig:5-4}所示，每个单词翻译候选的右侧黑色框里的数字就是单词的翻译概率，使用这些单词的翻译概率，可以得到整句译文的概率（用符号$\funp{P}$表示）。这样，就用概率化的模型描述了每个翻译候选的可能性。基于这些翻译候选的可能性，机器翻译系统可以对所有的翻译路径进行打分，比如，图\ref{fig:5-4}中第一条路径的分数为0.042，第二条是0.006，以此类推。最后，系统可以选择分数最高的路径作为源语言句子的最终译文。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -154,7 +154,7 @@ IBM模型由Peter F. Brown等人于上世纪九十年代初提出\upcite{DBLP:jo
 %----------------------------------------------------------------------------------------

 \subsubsection{3. 人工翻译 vs. 机器翻译}
-\parinterval 人在翻译时的决策是非常确定并且快速的，但计算机处理这个问题时却充满了概率化的思想。当然它们也有类似的地方。首先，计算机使用统计模型的目的是把翻译知识变得可计算，并把这些``知识''储存在模型参数中，这个模型和人类大脑的作用是类似的\footnote{这里并不是要把统计模型等同于生物学或者认知科学上的人脑，这里是指它们处理翻译问题时发挥的作用类似。}；其次，计算机对统计模型进行训练相当于人类对知识的学习，二者都可以被看作是理解、加工知识的过程；再有，计算机使用学习到的模型对新句子进行翻译的过程相当于人运用知识的过程。在统计机器翻译中，模型学习的过程被称为{\small\sffamily\bfseries{训练}}\index{训练}（Training）\index{Training}，目的是从双语平行数据中自动学习翻译``知识''；而使用模型处理新句子的过程是一个典型的预测过程，也被称为{\small\sffamily\bfseries{解码}}\index{解码}（Decoding）\index{Decoding}或{\small\sffamily\bfseries{推断}}\index{推断}（Inference）\index{Inference}。图\ref{fig:5-4}的右侧标注在翻译过程中训练和解码的作用。最终，统计机器翻译的核心由三部分构成\ \dash \ 建模、训练和解码。本章后续内容会围绕这三个问题展开讨论。
+\parinterval 人在翻译时的决策是非常确定并且快速的，但计算机处理这个问题时却充满了概率化的思想。当然它们也有类似的地方。首先，计算机使用统计模型的目的是把翻译知识变得可计算，并把这些“知识”储存在模型参数中，这个模型和人类大脑的作用是类似的\footnote{这里并不是要把统计模型等同于生物学或者认知科学上的人脑，这里是指它们处理翻译问题时发挥的作用类似。}；其次，计算机对统计模型进行训练相当于人类对知识的学习，二者都可以被看作是理解、加工知识的过程；再有，计算机使用学习到的模型对新句子进行翻译的过程相当于人运用知识的过程。在统计机器翻译中，模型学习的过程被称为{\small\sffamily\bfseries{训练}}\index{训练}（Training）\index{Training}，目的是从双语平行数据中自动学习翻译“知识”；而使用模型处理新句子的过程是一个典型的预测过程，也被称为{\small\sffamily\bfseries{解码}}\index{解码}（Decoding）\index{Decoding}或{\small\sffamily\bfseries{推断}}\index{推断}（Inference）\index{Inference}。图\ref{fig:5-4}的右侧标注在翻译过程中训练和解码的作用。最终，统计机器翻译的核心由三部分构成\ \dash \ 建模、训练和解码。本章后续内容会围绕这三个问题展开讨论。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -196,11 +196,11 @@ IBM模型由Peter F. Brown等人于上世纪九十年代初提出\upcite{DBLP:jo

 \subsubsection{1. 什么是单词翻译概率？}

-\parinterval 单词翻译概率描述的是一个源语言单词与目标语言译文构成正确翻译的可能性，这个概率越高表明单词翻译越可靠。使用单词翻译概率，可以帮助机器翻译系统解决翻译时的``择词''问题，即选择什么样的目标语译文是合适的。当人在翻译某个单词时，可以利用积累的知识，快速得到它的高质量候选译文。
+\parinterval 单词翻译概率描述的是一个源语言单词与目标语言译文构成正确翻译的可能性，这个概率越高表明单词翻译越可靠。使用单词翻译概率，可以帮助机器翻译系统解决翻译时的“择词”问题，即选择什么样的目标语译文是合适的。当人在翻译某个单词时，可以利用积累的知识，快速得到它的高质量候选译文。

-\parinterval 以汉译英为例，当翻译``我''这个单词时，可能直接会想到用``I''、``me''或``I'm''作为它的译文，而几乎不会选择``you''、``satisfied''等含义相差太远的译文。这是为什么呢？如果从统计学的角度来看，无论是何种语料，包括教材、新闻、小说等，绝大部分情况下``我''都翻译成了``I''、``me''等，几乎不会看到我被翻译成``you''或``satisfied''的情况。可以说``我''翻译成``I''、``me''等属于高频事件，而翻译成``you''、``satisfied''等属于低频或小概率事件。因此人在翻译时也是选择在统计意义上概率更大的译文，这也间接反映出统计模型可以在一定程度上描述人的翻译习惯和模式。
+\parinterval 以汉译英为例，当翻译“我”这个单词时，可能直接会想到用“I”、“me”或“I'm”作为它的译文，而几乎不会选择“you”、“satisfied”等含义相差太远的译文。这是为什么呢？如果从统计学的角度来看，无论是何种语料，包括教材、新闻、小说等，绝大部分情况下“我”都翻译成了“I”、“me”等，几乎不会看到我被翻译成“you”或“satisfied”的情况。可以说“我”翻译成“I”、“me”等属于高频事件，而翻译成“you”、“satisfied”等属于低频或小概率事件。因此人在翻译时也是选择在统计意义上概率更大的译文，这也间接反映出统计模型可以在一定程度上描述人的翻译习惯和模式。

-\parinterval 表\ref{tab:5-1}展示了汉语到英语的单词翻译实例及相应的翻译概率。可以看到，``我''翻译成``I''的概率最高，为0.5。这是符合人类对翻译的认知的。此外，这种概率化的模型避免了非0即1的判断，所有的译文都是可能的，只是概率不同。这也使得统计模型可以覆盖更多的翻译现象，甚至捕捉到一些人所忽略的情况。\\ \\ \\
+\parinterval 表\ref{tab:5-1}展示了汉语到英语的单词翻译实例及相应的翻译概率。可以看到，“我”翻译成“I”的概率最高，为0.5。这是符合人类对翻译的认知的。此外，这种概率化的模型避免了非0即1的判断，所有的译文都是可能的，只是概率不同。这也使得统计模型可以覆盖更多的翻译现象，甚至捕捉到一些人所忽略的情况。\\ \\ \\

 %----------------------------------------------
 \begin{table}[htp]
@@ -227,7 +227,7 @@ IBM模型由Peter F. Brown等人于上世纪九十年代初提出\upcite{DBLP:jo

 \subsubsection{2. 如何从双语平行数据中进行学习？}

-\parinterval 假设有一定数量的双语对照的平行数据，是否可以从中自动获得两种语言单词之间的翻译概率呢？回忆一下{\chaptertwo}中的掷骰子游戏，其中使用了相对频次估计方法来自动获得骰子不同面出现概率的估计值。其中，重复投掷骰子很多次，然后统计``1''到``6''各面出现的次数，再除以投掷的总次数，最后得到它们出现的概率的极大似然估计。这里，可以使用类似的方式计算单词翻译概率。但是，现在有的是句子一级对齐的数据，并不知道两种语言之间单词的对应关系。也就是，要从句子级对齐的平行数据中学习单词之间对齐的概率。这里，需要使用稍微``复杂''一些的模型来描述这个问题。
+\parinterval 假设有一定数量的双语对照的平行数据，是否可以从中自动获得两种语言单词之间的翻译概率呢？回忆一下{\chaptertwo}中的掷骰子游戏，其中使用了相对频次估计方法来自动获得骰子不同面出现概率的估计值。其中，重复投掷骰子很多次，然后统计“1”到“6”各面出现的次数，再除以投掷的总次数，最后得到它们出现的概率的极大似然估计。这里，可以使用类似的方式计算单词翻译概率。但是，现在有的是句子一级对齐的数据，并不知道两种语言之间单词的对应关系。也就是，要从句子级对齐的平行数据中学习单词之间对齐的概率。这里，需要使用稍微“复杂”一些的模型来描述这个问题。

 令$X$和$Y$分别表示源语言和目标语言的词汇表。对于任意源语言单词$x \in X$，所有的目标语单词$y \in Y$都可能是它的译文。给定一个互译的句对$(\seq{s},\seq{t})$，可以把$\funp{P}(x \leftrightarrow y; \seq{s}, \seq{t})$定义为：在观测到$(\seq{s},\seq{t})$的前提下$x$和$y$互译的概率。其中$x$是属于句子$\seq{s}$中的词，而$y$是属于句子$\seq{t}$ 中的词。$\funp{P}(x \leftrightarrow y; \seq{s},\seq{t})$的计算公式描述如下：
 \vspace{-0.5em}
@@ -250,7 +250,7 @@ $\seq{t}$ = machine\; \underline{translation}\; is\; a\; process\; of\; generati
 \label{eg:5-1}
 \end{example}

-\parinterval 假设，$x=\textrm{``翻译''}$，$y=\textrm{``translation''}$，现在要计算$x$和$y$共现的总次数。``翻译''和``translation''分别在$\seq{s}$和$\seq{t}$中出现了2次，因此$c(\textrm{``翻译''},\textrm{``translation''};\seq{s},\seq{t})$ 等于4。而对于$\sum_{x',y'} c(x',y';\seq{s},$ $\seq{t})$，因为$x'$和$y'$分别表示的是$\seq{s}$和$\seq{t}$中的任意词，所以$\sum_{x',y'} c(x',y';\seq{s},\seq{t})$表示所有单词对的数量\ \dash \ 即$\seq{s}$的词数乘以$\seq{t}$的词数。最后，``翻译''和``translation''的单词翻译概率为：
+\parinterval 假设，$x=\textrm{“翻译”}$，$y=\textrm{“translation”}$，现在要计算$x$和$y$共现的总次数。“翻译”和“translation”分别在$\seq{s}$和$\seq{t}$中出现了2次，因此$c(\textrm{“翻译”},\textrm{“translation”};\seq{s},\seq{t})$ 等于4。而对于$\sum_{x',y'} c(x',y';\seq{s},$ $\seq{t})$，因为$x'$和$y'$分别表示的是$\seq{s}$和$\seq{t}$中的任意词，所以$\sum_{x',y'} c(x',y';\seq{s},\seq{t})$表示所有单词对的数量\ \dash \ 即$\seq{s}$的词数乘以$\seq{t}$的词数。最后，“翻译”和“translation”的单词翻译概率为：
 \begin{eqnarray}
 \funp{P}(\text{翻译},\text{translation}; \seq{s},\seq{t})  & = & \frac{c(\textrm{翻译},\textrm{translation};\seq{s},\seq{t})}{\sum_{x',y'} c(x',y';\seq{s},\seq{t})} \nonumber \\
                                                                                                         & =  & \frac{4}{|\seq{s}|\times |\seq{t}|} \nonumber \\
@@ -258,14 +258,14 @@ $\seq{t}$ = machine\; \underline{translation}\; is\; a\; process\; of\; generati
 \label{eq:5-2}
 \end{eqnarray}

-\noindent 这里运算$|\cdot|$表示句子长度。类似的，可以得到``机器''和``translation''、``机器''和``look''的单词翻译概率：
+\noindent 这里运算$|\cdot|$表示句子长度。类似的，可以得到“机器”和“translation”、“机器”和“look”的单词翻译概率：
 \begin{eqnarray}
 \funp{P}(\text{机器},\text{translation}; \seq{s},\seq{t})  & = & \frac{2}{121} \\
 \funp{P}(\text{机器},\text{look}; \seq{s},\seq{t})  & =  & \frac{0}{121}
 \label{eq:5-3}
 \end{eqnarray}

-\noindent 注意，由于``look''没有出现在数据中，因此$\funp{P}(\text{机器},\text{look}; \seq{s},\seq{t})=0$。这时，可以使用{\chaptertwo}介绍的平滑算法赋予它一个非零的值，以保证在后续的步骤中整个翻译模型不会出现零概率的情况。
+\noindent 注意，由于“look”没有出现在数据中，因此$\funp{P}(\text{机器},\text{look}; \seq{s},\seq{t})=0$。这时，可以使用{\chaptertwo}介绍的平滑算法赋予它一个非零的值，以保证在后续的步骤中整个翻译模型不会出现零概率的情况。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -279,7 +279,7 @@ $\seq{t}$ = machine\; \underline{translation}\; is\; a\; process\; of\; generati
 \label{eq:5-4}
 \end{eqnarray}

-\parinterval 与公式\ref{eq:5-1}相比，公式\ref{eq:5-4}的分子、分母都多了一项累加符号$\sum_{k=1}^{K} \cdot$，它表示遍历语料库中所有的句对。换句话说，当计算词的共现次数时，需要对每个句对上的计数结果进行累加。从统计学习的角度，使用更大规模的数据进行参数估计可以提高结果的可靠性。计算单词的翻译概率也是一样，在小规模的数据上看，很多翻译现象的特征并不突出，但是当使用的数据量增加到一定程度，翻译的规律会很明显的体现出来。
+\parinterval 与公式\eqref{eq:5-1}相比，公式\eqref{eq:5-4}的分子、分母都多了一项累加符号$\sum_{k=1}^{K} \cdot$，它表示遍历语料库中所有的句对。换句话说，当计算词的共现次数时，需要对每个句对上的计数结果进行累加。从统计学习的角度，使用更大规模的数据进行参数估计可以提高结果的可靠性。计算单词的翻译概率也是一样，在小规模的数据上看，很多翻译现象的特征并不突出，但是当使用的数据量增加到一定程度，翻译的规律会很明显的体现出来。

 \parinterval 举个例子，实例\ref{eg:5-2}展示了一个由两个句对构成的平行语料库。

@@ -296,7 +296,7 @@ $\seq{t}^{[2]}$ = So\; ,\; what\; is\; human\; \underline{translation}\; ?
 \label{eg:5-2}
 \end{example}

-\noindent 其中，$\seq{s}^{[1]}$和$\seq{s}^{[2]}$分别表示第一个句对和第二个句对的源语言句子，$\seq{t}^{[1]}$和$\seq{t}^{[2]}$表示对应的目标语言句子。于是，``翻译''和``translation'' 的翻译概率为
+\noindent 其中，$\seq{s}^{[1]}$和$\seq{s}^{[2]}$分别表示第一个句对和第二个句对的源语言句子，$\seq{t}^{[1]}$和$\seq{t}^{[2]}$表示对应的目标语言句子。于是，“翻译”和“translation” 的翻译概率为
 {\small
 \begin{eqnarray}
 {\funp{P}(\textrm{翻译},\textrm{translation})} & = & {\frac{c(\textrm{翻译},\textrm{translation};\seq{s}^{[1]},\seq{t}^{[1]})+c(\textrm{翻译},\textrm{translation};\seq{s}^{[2]},\seq{t}^{[2]})}{\sum_{x',y'} c(x',y';\seq{s}^{[1]},\seq{t}^{[1]}) + \sum_{x',y'} c(x',y';\seq{s}^{[2]},\seq{t}^{[2]})}} \nonumber \\
@@ -306,7 +306,7 @@ $\seq{t}^{[2]}$ = So\; ,\; what\; is\; human\; \underline{translation}\; ?
 \label{eq:5-5}
 \end{eqnarray}
 }
-\parinterval 公式\ref{eq:5-5}所展示的计算过程很简单，分子是两个句对中``翻译''和``translation''共现次数的累计，分母是两个句对的源语言单词和目标语言单词的组合数的累加。显然，这个方法也很容易推广到处理更多句子的情况。
+\parinterval 公式\eqref{eq:5-5}所展示的计算过程很简单，分子是两个句对中“翻译”和“translation”共现次数的累计，分母是两个句对的源语言单词和目标语言单词的组合数的累加。显然，这个方法也很容易推广到处理更多句子的情况。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -323,14 +323,14 @@ $\seq{t}^{[2]}$ = So\; ,\; what\; is\; human\; \underline{translation}\; ?

 \subsubsection{1. 基础模型}

-\parinterval 计算句子级翻译概率并不简单。因为自然语言非常灵活，任何数据无法覆盖足够多的句子，因此，无法像公式\ref{eq:5-4}一样直接用简单计数的方式对句子的翻译概率进行估计。这里，采用一个退而求其次的方法：找到一个函数$g(\seq{s},\seq{t})\ge 0$来模拟翻译概率对译文可能性进行估计。可以定义一个新的函数$g(\seq{s},\seq{t})$，令其满足：给定$\seq{s}$，翻译结果$\seq{t}$出现的可能性越大，$g(\seq{s},\seq{t})$的值越大；$\seq{t}$出现的可能性越小，$g(\seq{s},\seq{t})$的值越小。换句话说，$g(\seq{s},\seq{t})$和翻译概率$\funp{P}(\seq{t}|\seq{s})$呈正相关。如果存在这样的函数$g(\seq{s},\seq{t}
+\parinterval 计算句子级翻译概率并不简单。因为自然语言非常灵活，任何数据无法覆盖足够多的句子，因此，无法像公式\eqref{eq:5-4}一样直接用简单计数的方式对句子的翻译概率进行估计。这里，采用一个退而求其次的方法：找到一个函数$g(\seq{s},\seq{t})\ge 0$来模拟翻译概率对译文可能性进行估计。可以定义一个新的函数$g(\seq{s},\seq{t})$，令其满足：给定$\seq{s}$，翻译结果$\seq{t}$出现的可能性越大，$g(\seq{s},\seq{t})$的值越大；$\seq{t}$出现的可能性越小，$g(\seq{s},\seq{t})$的值越小。换句话说，$g(\seq{s},\seq{t})$和翻译概率$\funp{P}(\seq{t}|\seq{s})$呈正相关。如果存在这样的函数$g(\seq{s},\seq{t}
 )$，可以利用$g(\seq{s},\seq{t})$近似表示$\funp{P}(\seq{t}|\seq{s})$，如下：
 \begin{eqnarray}
 \funp{P}(\seq{t}|\seq{s})  \equiv  \frac{g(\seq{s},\seq{t})}{\sum_{\seq{t}'}g(\seq{s},\seq{t}')}
 \label{eq:5-6}
 \end{eqnarray}

-\parinterval 公式\ref{eq:5-6}相当于在函数$g(\cdot)$上做了归一化，这样等式右端的结果具有一些概率的属性，比如，$0 \le \frac{g(\seq{s},\seq{t})}{\sum_{\seq{t'}}g(\seq{s},\seq{t'})} \le 1$。具体来说，对于源语言句子$\seq{s}$，枚举其所有的翻译结果，并把所对应的函数$g(\cdot)$相加作为分母，而分子是某个翻译结果$\seq{t}$所对应的$g(\cdot)$的值。
+\parinterval 公式\eqref{eq:5-6}相当于在函数$g(\cdot)$上做了归一化，这样等式右端的结果具有一些概率的属性，比如，$0 \le \frac{g(\seq{s},\seq{t})}{\sum_{\seq{t'}}g(\seq{s},\seq{t'})} \le 1$。具体来说，对于源语言句子$\seq{s}$，枚举其所有的翻译结果，并把所对应的函数$g(\cdot)$相加作为分母，而分子是某个翻译结果$\seq{t}$所对应的$g(\cdot)$的值。

 \parinterval 上述过程初步建立了句子级翻译模型，并没有直接求$\funp{P}(\seq{t}|\seq{s})$，而是把问题转化为对$g(\cdot)$的设计和计算上。但是，面临着两个新的问题：

@@ -338,17 +338,17 @@ $\seq{t}^{[2]}$ = So\; ,\; what\; is\; human\; \underline{translation}\; ?
 \vspace{0.5em}
 \item 如何定义函数$g(\seq{s},\seq{t})$？即，在知道单词翻译概率的前提下，如何计算$g(\seq{s},\seq{t})$；
 \vspace{0.5em}
-\item 公式\ref{eq:5-6}中分母$\sum_{seq{t'}}g(\seq{s},{\seq{t}'})$需要累加所有翻译结果的$g(\seq{s},{\seq{t}'})$，但枚举所有${\seq{t}'}$是不现实的。
+\item 公式\eqref{eq:5-6}中分母$\sum_{seq{t'}}g(\seq{s},{\seq{t}'})$需要累加所有翻译结果的$g(\seq{s},{\seq{t}'})$，但枚举所有${\seq{t}'}$是不现实的。
 \vspace{0.5em}
 \end{itemize}

 \parinterval  当然，这里最核心的问题还是函数$g(\seq{s},\seq{t})$的定义。而第二个问题其实不需要解决，因为机器翻译只关注于可能性最大的翻译结果，即$g(\seq{s},\seq{t})$的计算结果最大时对应的译文。这个问题会在后面进行讨论。

-\parinterval 回到设计$g(\seq{s},\seq{t})$的问题上。这里，采用``大题小作''的方法，这个技巧在{\chaptertwo}已经进行了充分的介绍。具体来说，直接建模句子之间的对应比较困难，但可以利用单词之间的对应来描述句子之间的对应关系。这就用到了上一小节所介绍的单词翻译概率。
+\parinterval 回到设计$g(\seq{s},\seq{t})$的问题上。这里，采用“大题小作”的方法，这个技巧在{\chaptertwo}已经进行了充分的介绍。具体来说，直接建模句子之间的对应比较困难，但可以利用单词之间的对应来描述句子之间的对应关系。这就用到了上一小节所介绍的单词翻译概率。

 \parinterval 首先引入一个非常重要的概念\ \dash \ {\small\sffamily\bfseries{词对齐}}\index{词对齐}（Word Alignment）\index{Word Alignment}，它是统计机器翻译中最核心的概念之一。词对齐描述了平行句对中单词之间的对应关系，它体现了一种观点：本质上句子之间的对应是由单词之间的对应表示的。当然，这个观点在神经机器翻译或者其他模型中可能会有不同的理解，但是翻译句子的过程中考虑词级的对应关系是符合人类对语言的认知的。

-\parinterval 图\ref{fig:5-7} 展示了一个句对$\seq{s}$和$\seq{t}$，单词的右下标数字表示了该词在句中的位置，而虚线表示的是句子$\seq{s}$和$\seq{t}$中的词对齐关系。比如，``满意''的右下标数字5表示在句子$\seq{s}$中处于第5个位置，``satisfied''的右下标数字3表示在句子$\seq{t}$中处于第3个位置，``满意''和``satisfied''之间的虚线表示两个单词之间是对齐的。为方便描述，用二元组$(j,i)$ 来描述词对齐，它表示源语言句子的第$j$个单词对应目标语言句子的第$i$个单词，即单词$s_j$和$t_i$对应。通常，也会把$(j,i)$称作一条{\small\sffamily\bfseries{词对齐连接}}\index{词对齐连接}（Word Alignment Link\index{Word Alignment Link}）。图\ref{fig:5-7} 中共有5 条虚线，表示有5组单词之间的词对齐连接。可以把这些词对齐连接构成的集合作为词对齐的一种表示，记为$A$，即$A={\{(1,1),(2,4),(3,5),(4,2)(5,3)}\}$。
+\parinterval 图\ref{fig:5-7} 展示了一个句对$\seq{s}$和$\seq{t}$，单词的右下标数字表示了该词在句中的位置，而虚线表示的是句子$\seq{s}$和$\seq{t}$中的词对齐关系。比如，“满意”的右下标数字5表示在句子$\seq{s}$中处于第5个位置，“satisfied”的右下标数字3表示在句子$\seq{t}$中处于第3个位置，“满意”和“satisfied”之间的虚线表示两个单词之间是对齐的。为方便描述，用二元组$(j,i)$ 来描述词对齐，它表示源语言句子的第$j$个单词对应目标语言句子的第$i$个单词，即单词$s_j$和$t_i$对应。通常，也会把$(j,i)$称作一条{\small\sffamily\bfseries{词对齐连接}}\index{词对齐连接}（Word Alignment Link\index{Word Alignment Link}）。图\ref{fig:5-7} 中共有5 条虚线，表示有5组单词之间的词对齐连接。可以把这些词对齐连接构成的集合作为词对齐的一种表示，记为$A$，即$A={\{(1,1),(2,4),(3,5),(4,2)(5,3)}\}$。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -366,7 +366,7 @@ g(\seq{s},\seq{t}) = \prod_{(j,i)\in \widehat{A}}\funp{P}(s_j,t_i)
 \label{eq:5-7}
 \end{eqnarray}

-\noindent 其中$g(\seq{s},\seq{t})$被定义为句子$\seq{s}$中的单词和句子$\seq{t}$中的单词的翻译概率的乘积，并且这两个单词之间必须有词对齐连接。$\funp{P}(s_j,t_i)$表示具有词对齐连接的源语言单词$s_j$和目标语言单词$t_i$的单词翻译概率。以图\ref{fig:5-7}中的句对为例，其中``我''与``I''、``对''与``with''、``你'' 与``you''等相互对应，可以把它们的翻译概率相乘得到$g(\seq{s},\seq{t})$的计算结果，如下：
+\noindent 其中$g(\seq{s},\seq{t})$被定义为句子$\seq{s}$中的单词和句子$\seq{t}$中的单词的翻译概率的乘积，并且这两个单词之间必须有词对齐连接。$\funp{P}(s_j,t_i)$表示具有词对齐连接的源语言单词$s_j$和目标语言单词$t_i$的单词翻译概率。以图\ref{fig:5-7}中的句对为例，其中“我”与“I”、“对”与“with”、“你” 与“you”等相互对应，可以把它们的翻译概率相乘得到$g(\seq{s},\seq{t})$的计算结果，如下：
 \begin{eqnarray}
 {g(\seq{s},\seq{t})}&= &  \funp{P}(\textrm{我,I}) \times \funp{P}(\textrm{对,with}) \times \funp{P}(\textrm{你,you}) \times \nonumber \\
          &    & \funp{P}(\textrm{感到, am}) \times \funp{P}(\textrm{满意,satisfied})
@@ -381,7 +381,7 @@ g(\seq{s},\seq{t}) = \prod_{(j,i)\in \widehat{A}}\funp{P}(s_j,t_i)

 \subsubsection{2. 生成流畅的译文}

-\parinterval 公式\ref{eq:5-7}定义的$g(\seq{s},\seq{t})$存在的问题是没有考虑词序信息。这里用一个简单的例子说明这个问题。如图\ref{fig:5-8}所示，源语言句子``我 对 你 感到 满意''有两个翻译结果，第一个翻译结果是``I am satisfied with you''，第二个是``I with you am satisfied''。虽然这两个译文包含的目标语单词是一样的，但词序存在很大差异。比如，它们都选择了``satisfied''作为源语单词``满意''的译文，但是在第一个翻译结果中``satisfied''处于第3个位置，而第二个结果中处于最后的位置。显然第一个翻译结果更符合英语的表达习惯，翻译的质量更高。遗憾的是，对于有明显差异的两个译文，公式\ref{eq:5-7}计算得到的函数$g(\cdot)$的值却是一样的。
+\parinterval 公式\eqref{eq:5-7}定义的$g(\seq{s},\seq{t})$存在的问题是没有考虑词序信息。这里用一个简单的例子说明这个问题。如图\ref{fig:5-8}所示，源语言句子“我 对 你 感到 满意”有两个翻译结果，第一个翻译结果是“I am satisfied with you”，第二个是“I with you am satisfied”。虽然这两个译文包含的目标语单词是一样的，但词序存在很大差异。比如，它们都选择了“satisfied”作为源语单词“满意”的译文，但是在第一个翻译结果中“satisfied”处于第3个位置，而第二个结果中处于最后的位置。显然第一个翻译结果更符合英语的表达习惯，翻译的质量更高。遗憾的是，对于有明显差异的两个译文，公式\eqref{eq:5-7}计算得到的函数$g(\cdot)$的值却是一样的。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -403,13 +403,13 @@ g(\seq{s},\seq{t}) = \prod_{(j,i)\in \widehat{A}}\funp{P}(s_j,t_i)

 \noindent  其中，$\seq{t}=t_1...t_l$表示由$l$个单词组成的句子，$\funp{P}_{\textrm{lm}}(\seq{t})$表示语言模型给句子$\seq{t}$的打分。具体而言，$\funp{P}_{\textrm{lm}}(\seq{t})$被定义为$\funp{P}(t_i|t_{i-1})(i=1,2,...,l)$的连乘\footnote{为了确保数学表达的准确性，本书中定义$\funp{P}(t_1|t_0) \equiv \funp{P}(t_1)$}，其中$\funp{P}(t_i|t_{i-1})(i=1,2,...,l)$表示前面一个单词为$t_{i-1}$时，当前单词为$t_i$的概率。语言模型的训练方法可以参看{\chaptertwo}相关内容。

-\parinterval 回到建模问题上来。既然语言模型可以帮助系统度量每个译文的流畅度，那么可以使用它对翻译进行打分。一种简单的方法是把语言模型$\funp{P}_{\textrm{lm}}{(\seq{t})}$ 和公式\ref{eq:5-7}中的$g(\seq{s},\seq{t})$相乘，这样就得到了一个新的$g(\seq{s},\seq{t})$，它同时考虑了翻译准确性（$\prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)}$）和流畅度（$\funp{P}_{\textrm{lm}}(\seq{t})$）:
+\parinterval 回到建模问题上来。既然语言模型可以帮助系统度量每个译文的流畅度，那么可以使用它对翻译进行打分。一种简单的方法是把语言模型$\funp{P}_{\textrm{lm}}{(\seq{t})}$ 和公式\eqref{eq:5-7}中的$g(\seq{s},\seq{t})$相乘，这样就得到了一个新的$g(\seq{s},\seq{t})$，它同时考虑了翻译准确性（$\prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)}$）和流畅度（$\funp{P}_{\textrm{lm}}(\seq{t})$）:
 \begin{eqnarray}
 g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times  \funp{P}_{\textrm{lm}}(\seq{t})
 \label{eq:5-10}
 \end{eqnarray}

-\parinterval 如图\ref{fig:5-9}所示，语言模型$\funp{P}_{\textrm{lm}}(\seq{t})$分别给$\seq{t}^{'}$和$\seq{t}^{''}$赋予0.0107和0.0009的概率，这表明句子$\seq{t}^{'}$更符合英文的表达，这与期望是相吻合的。它们再分别乘以$\prod_{j,i \in \widehat{A}}{\funp{P}(s_j},t_i)$的值，就得到公式\ref{eq:5-10}定义的函数$g(\cdot)$的值。显然句子$\seq{t}^{'}$的分数更高。至此，完成了对函数$g(\seq{s},\seq{t})$的一个简单定义，把它带入公式\ref{eq:5-6}就得到了同时考虑准确性和流畅性的句子级统计翻译模型。
+\parinterval 如图\ref{fig:5-9}所示，语言模型$\funp{P}_{\textrm{lm}}(\seq{t})$分别给$\seq{t}^{'}$和$\seq{t}^{”}$赋予0.0107和0.0009的概率，这表明句子$\seq{t}^{'}$更符合英文的表达，这与期望是相吻合的。它们再分别乘以$\prod_{j,i \in \widehat{A}}{\funp{P}(s_j},t_i)$的值，就得到公式\eqref{eq:5-10}定义的函数$g(\cdot)$的值。显然句子$\seq{t}^{'}$的分数更高。至此，完成了对函数$g(\seq{s},\seq{t})$的一个简单定义，把它带入公式\eqref{eq:5-6}就得到了同时考虑准确性和流畅性的句子级统计翻译模型。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -434,19 +434,19 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \label{eq:5-11}
 \end{eqnarray}

-\noindent  其中$\argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})$表示找到使$\funp{P}(\seq{t}|\seq{s})$达到最大时的译文$\seq{t}$。结合上一小节中关于$\funp{P}(\seq{t}|\seq{s})$的定义，把公式\ref{eq:5-6}带入公式\ref{eq:5-11}得到：
+\noindent  其中$\argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})$表示找到使$\funp{P}(\seq{t}|\seq{s})$达到最大时的译文$\seq{t}$。结合上一小节中关于$\funp{P}(\seq{t}|\seq{s})$的定义，把公式\eqref{eq:5-6}带入公式\eqref{eq:5-11}得到：
 \begin{eqnarray}
 \widehat{\seq{t}}=\argmax_{\seq{t}}\frac{g(\seq{s},\seq{t})}{\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}
 \label{eq:5-12}
 \end{eqnarray}

-\parinterval 在公式\ref{eq:5-12}中，可以发现${\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}$是一个关于$\seq{s}$的函数，当给定源语句$\seq{s}$时，它是一个常数，而且$g(\cdot) \ge 0$，因此${\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}$不影响对$\widehat{\seq{t}}$的求解，也不需要计算。基于此，公式\ref{eq:5-12}可以被化简为：
+\parinterval 在公式\eqref{eq:5-12}中，可以发现${\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}$是一个关于$\seq{s}$的函数，当给定源语句$\seq{s}$时，它是一个常数，而且$g(\cdot) \ge 0$，因此${\sum_{\seq{t}^{'}g(\seq{s},\seq{t}^{'})}}$不影响对$\widehat{\seq{t}}$的求解，也不需要计算。基于此，公式\eqref{eq:5-12}可以被化简为：
 \begin{eqnarray}
 \widehat{\seq{t}}=\argmax_{\seq{t}}g(\seq{s},\seq{t})
 \label{eq:5-13}
 \end{eqnarray}

-\parinterval 公式\ref{eq:5-13}定义了解码的目标，剩下的问题是实现$\argmax$，以快速准确地找到最佳译文$\widehat{\seq{t}}$。但是，简单遍历所有可能的译文并计算$g(\seq{s},\seq{t})$ 的值是不可行的，因为所有潜在译文构成的搜索空间是十分巨大的。为了理解机器翻译的搜索空间的规模，假设源语言句子$\seq{s}$有$m$个词，每个词有$n$个可能的翻译候选。如果从左到右一步步翻译每个源语言单词，那么简单的顺序翻译会有$n^m$种组合。如果进一步考虑目标语单词的任意调序，每一种对翻译候选进行选择的结果又会对应$m!$种不同的排序。因此，源语句子$\seq{s}$至少有$n^m \cdot m!$ 个不同的译文。
+\parinterval 公式\eqref{eq:5-13}定义了解码的目标，剩下的问题是实现$\argmax$，以快速准确地找到最佳译文$\widehat{\seq{t}}$。但是，简单遍历所有可能的译文并计算$g(\seq{s},\seq{t})$ 的值是不可行的，因为所有潜在译文构成的搜索空间是十分巨大的。为了理解机器翻译的搜索空间的规模，假设源语言句子$\seq{s}$有$m$个词，每个词有$n$个可能的翻译候选。如果从左到右一步步翻译每个源语言单词，那么简单的顺序翻译会有$n^m$种组合。如果进一步考虑目标语单词的任意调序，每一种对翻译候选进行选择的结果又会对应$m!$种不同的排序。因此，源语句子$\seq{s}$至少有$n^m \cdot m!$ 个不同的译文。

 \parinterval $n^{m}\cdot m!$是什么样的概念呢？如表\ref{tab:5-2}所示，当$m$和$n$分别为2和10时，译文只有200个，不算多。但是当$m$和$n$分别为20和10时，即源语言句子的长度20，每个词有10个候选译文，系统会面对$2.4329 \times 10^{38}$个不同的译文，这几乎是不可计算的。

@@ -468,7 +468,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 }\end{table}
 %----------------------------------------------

-\parinterval 已经有工作证明机器翻译问题是NP难的\upcite{knight1999decoding}。对于如此巨大的搜索空间，需要一种十分高效的搜索算法才能实现机器翻译的解码。在{\chaptertwo}已经介绍一些常用的搜索方法。这里使用一种贪婪的搜索方法实现机器翻译的解码。它把解码分成若干步骤，每步只翻译一个单词，并保留当前`` 最好''的结果，直至所有源语言单词都被翻译完毕。
+\parinterval 已经有工作证明机器翻译问题是NP难的\upcite{knight1999decoding}。对于如此巨大的搜索空间，需要一种十分高效的搜索算法才能实现机器翻译的解码。在{\chaptertwo}已经介绍一些常用的搜索方法。这里使用一种贪婪的搜索方法实现机器翻译的解码。它把解码分成若干步骤，每步只翻译一个单词，并保留当前“ 最好”的结果，直至所有源语言单词都被翻译完毕。
 \vspace{0.3em}
 %----------------------------------------------
 \begin{figure}[htp]
@@ -497,7 +497,6 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
    \centering
 \subfigure{\input{./Chapter5/Figures/figure-greedy-mt-decoding-process-1}}
 \subfigure{\input{./Chapter5/Figures/figure-greedy-mt-decoding-process-3}}
-%\setlength{\belowcaptionskip}{2.0em}
    \caption{贪婪的机器翻译解码过程实例}
    \label{fig:5-11}
 \end{figure}
@@ -547,14 +546,14 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \label{eq:5-14}
 \end{eqnarray}

-\parinterval 公式\ref{eq:5-14}的核心内容之一是定义$\funp{P}(\seq{t}|\seq{s})$。在IBM模型中，可以使用贝叶斯准则对$\funp{P}(\seq{t}|\seq{s})$进行如下变换：
+\parinterval 公式\eqref{eq:5-14}的核心内容之一是定义$\funp{P}(\seq{t}|\seq{s})$。在IBM模型中，可以使用贝叶斯准则对$\funp{P}(\seq{t}|\seq{s})$进行如下变换：
 \begin{eqnarray}
 \funp{P}(\seq{t}|\seq{s}) & = &\frac{\funp{P}(\seq{s},\seq{t})}{\funp{P}(\seq{s})} \nonumber \\
                       & = & \frac{\funp{P}(\seq{s}|\seq{t})\funp{P}(\seq{t})}{\funp{P}(\seq{s})}
 \label{eq:5-15}
 \end{eqnarray}

-\parinterval 公式\ref{eq:5-15}把$\seq{s}$到$\seq{t}$的翻译概率转化为$\frac{\funp{P}(\seq{s}|\seq{t})\textrm{P(t)}}{\funp{P}(\seq{s})}$，它包括三个部分：
+\parinterval 公式\eqref{eq:5-15}把$\seq{s}$到$\seq{t}$的翻译概率转化为$\frac{\funp{P}(\seq{s}|\seq{t})\textrm{P(t)}}{\funp{P}(\seq{s})}$，它包括三个部分：

 \begin{itemize}
 \vspace{0.5em}
@@ -574,7 +573,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \label{eq:5-16}
 \end{eqnarray}

-\parinterval 公式\ref{eq:5-16}展示了IBM模型最基础的建模方式，它把模型分解为两项：（反向）翻译模型$\funp{P}(\seq{s}|\seq{t})$和语言模型$\funp{P}(\seq{t})$。一个很自然的问题是：直接用$\funp{P}(\seq{t}|\seq{s})$定义翻译问题不就可以了吗，为什么要用$\funp{P}(\seq{s}|\seq{t})$和$\funp{P}(\seq{t})$的联合模型？从理论上来说，正向翻译模型$\funp{P}(\seq{t}|\seq{s})$和反向翻译模型$\funp{P}(\seq{s}|\seq{t})$的数学建模可以是一样的，因为我们只需要在建模的过程中把两个语言调换即可。使用$\funp{P}(\seq{s}|\seq{t})$和$\funp{P}(\seq{t})$的联合模型的意义在于引入了语言模型，它可以很好地对译文的流畅度进行评价，确保结果是通顺的目标语言句子。
+\parinterval 公式\eqref{eq:5-16}展示了IBM模型最基础的建模方式，它把模型分解为两项：（反向）翻译模型$\funp{P}(\seq{s}|\seq{t})$和语言模型$\funp{P}(\seq{t})$。一个很自然的问题是：直接用$\funp{P}(\seq{t}|\seq{s})$定义翻译问题不就可以了吗，为什么要用$\funp{P}(\seq{s}|\seq{t})$和$\funp{P}(\seq{t})$的联合模型？从理论上来说，正向翻译模型$\funp{P}(\seq{t}|\seq{s})$和反向翻译模型$\funp{P}(\seq{s}|\seq{t})$的数学建模可以是一样的，因为我们只需要在建模的过程中把两个语言调换即可。使用$\funp{P}(\seq{s}|\seq{t})$和$\funp{P}(\seq{t})$的联合模型的意义在于引入了语言模型，它可以很好地对译文的流畅度进行评价，确保结果是通顺的目标语言句子。

 \parinterval 可以回忆一下\ref{sec:sentence-level-translation}节中讨论的问题，如果只使用翻译模型可能会造成一个局面：译文的单词都和源语言单词对应的很好，但是由于语序的问题，读起来却不像人说的话。从这个角度说，引入语言模型是十分必要的。这个问题在Brown等人的论文中也有讨论\upcite{DBLP:journals/coling/BrownPPM94}，他们提到单纯使用$\funp{P}(\seq{s}|\seq{t})$会把概率分配给一些翻译对应比较好但是不合法的目标语句子，而且这部分概率可能会很大，影响模型的决策。这也正体现了IBM模型的创新之处，作者用数学技巧把$\funp{P}(\seq{t})$引入进来，保证了系统的输出是通顺的译文。语言模型也被广泛使用在语音识别等领域以保证结果的流畅性，甚至应用的历史比机器翻译要长得多，这里的方法也有借鉴相关工作的味道。

@@ -586,7 +585,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \section{统计机器翻译的三个基本问题}

-\parinterval 公式\ref{eq:5-16}给出了统计机器翻译的数学描述。为了实现这个过程，面临着三个基本问题：
+\parinterval 公式\eqref{eq:5-16}给出了统计机器翻译的数学描述。为了实现这个过程，面临着三个基本问题：

 \begin{itemize}
 \vspace{0.5em}
@@ -598,13 +597,13 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \vspace{0.5em}
 \end{itemize}

-\parinterval 为了理解以上的问题，可以先回忆一下\ref{sec:sentence-level-translation}小节中的公式\ref{eq:5-10}，即$g(\seq{s},\seq{t})$函数的定义，它用于评估一个译文的好与坏。如图\ref{fig:5-14}所示，$g(\seq{s},\seq{t})$函数与公式\ref{eq:5-16}的建模方式非常一致，即$g(\seq{s},\seq{t})$函数中红色部分描述译文$\seq{t}$的可能性大小，对应翻译模型$\funp{P}(\seq{s}|\seq{t})$；蓝色部分描述译文的平滑或流畅程度，对应语言模型$\funp{P}(\seq{t})$。尽管这种对应并不十分严格的，但也可以看出在处理机器翻译问题上，很多想法的本质是一样的。
+\parinterval 为了理解以上的问题，可以先回忆一下\ref{sec:sentence-level-translation}小节中的公式\eqref{eq:5-10}，即$g(\seq{s},\seq{t})$函数的定义，它用于评估一个译文的好与坏。如图\ref{fig:5-14}所示，$g(\seq{s},\seq{t})$函数与公式\eqref{eq:5-16}的建模方式非常一致，即$g(\seq{s},\seq{t})$函数中红色部分描述译文$\seq{t}$的可能性大小，对应翻译模型$\funp{P}(\seq{s}|\seq{t})$；蓝色部分描述译文的平滑或流畅程度，对应语言模型$\funp{P}(\seq{t})$。尽管这种对应并不十分严格的，但也可以看出在处理机器翻译问题上，很多想法的本质是一样的。

 %----------------------------------------------
 \begin{figure}[htp]
    \centering
 \input{./Chapter5/Figures/figure-correspondence-between-ibm-model&formula-1.13}
-    \caption{IBM模型与公式\ref{eq:5-10}的对应关系}
+    \caption{IBM模型与公式\eqref{eq:5-10}的对应关系}
    \label{fig:5-14}
 \end{figure}
 %----------------------------------------------
@@ -621,7 +620,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \begin{itemize}
 \vspace{0.5em}
-\item 一个源语言单词只能对应一个目标语言单词。如图\ref{fig:5-15}所示，(a)和(c)都满足该条件，尽管(c)中的``谢谢''和``你''都对应``thanks''，但并不违背这个约束。而(b)不满足约束，因为`` 谢谢''同时对应到了两个目标语单词上。这个约束条件也导致这里的词对齐变成一种{\small\sffamily\bfseries{非对称的词对齐}}\index{非对称的词对齐}（Asymmetric Word Alignment）\index{Asymmetric Word Alignment}，因为它只对源语言做了约束，但是目标语言没有。使用这样的约束的目的是为了减少建模的复杂度。在IBM模型之后的方法中也提出了双向词对齐，用于建模一个源语言单词对应到多个目标语单词的情况\upcite{och2003systematic}。
+\item 一个源语言单词只能对应一个目标语言单词。如图\ref{fig:5-15}所示，(a)和(c)都满足该条件，尽管(c)中的“谢谢”和“你”都对应“thanks”，但并不违背这个约束。而(b)不满足约束，因为“ 谢谢”同时对应到了两个目标语单词上。这个约束条件也导致这里的词对齐变成一种{\small\sffamily\bfseries{非对称的词对齐}}\index{非对称的词对齐}（Asymmetric Word Alignment）\index{Asymmetric Word Alignment}，因为它只对源语言做了约束，但是目标语言没有。使用这样的约束的目的是为了减少建模的复杂度。在IBM模型之后的方法中也提出了双向词对齐，用于建模一个源语言单词对应到多个目标语单词的情况\upcite{och2003systematic}。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -633,20 +632,20 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \end{figure}
 %----------------------------------------------

-\item 源语言单词可以翻译为空，这时它对应到一个虚拟或伪造的目标语单词$t_0$。在图\ref{fig:5-16}所示的例子中，``在''没有对应到``on the table''中的任意一个词，而是把它对应到$t_0$上。这样，所有的源语言单词都能找到一个目标语单词对应。这种设计也很好地引入了{\small\sffamily\bfseries{空对齐}}\index{空对齐}（Empty Alignment\index{Empty Alignment}）的思想，即源语言单词不对应任何真实存在的单词的情况。而这种空对齐的情况在翻译中是频繁出现的，比如虚词的翻译。
+\item 源语言单词可以翻译为空，这时它对应到一个虚拟或伪造的目标语单词$t_0$。在图\ref{fig:5-16}所示的例子中，“在”没有对应到“on the table”中的任意一个词，而是把它对应到$t_0$上。这样，所有的源语言单词都能找到一个目标语单词对应。这种设计也很好地引入了{\small\sffamily\bfseries{空对齐}}\index{空对齐}（Empty Alignment\index{Empty Alignment}）的思想，即源语言单词不对应任何真实存在的单词的情况。而这种空对齐的情况在翻译中是频繁出现的，比如虚词的翻译。

 %----------------------------------------------
 \begin{figure}[htp]
    \centering
 \input{./Chapter5/Figures/figure-word-alignment-instance}
 \setlength{\belowcaptionskip}{-0.5em}
-    \caption{词对齐实例（``在''对应到$t_0$）}
+    \caption{词对齐实例（“在”对应到$t_0$）}
    \label{fig:5-16}
 \end{figure}
 %----------------------------------------------
 \end{itemize}

-\parinterval 通常，把词对齐记为$\seq{a}$，它由$a_1$到$a_m$共$m$个词对齐连接组成，即$\seq{a}=a_1...a_m$。$a_j$表示第$j$个源语单词$s_j$对应的目标语单词的位置。在图\ref{fig:5-16}的例子中，词对齐关系可以记为$a_1=0, a_2=3, a_3=1$，即第1个源语单词``在''对应到目标语译文的第0个位置，第2个源语单词``桌子''对应到目标语译文的第3个位置，第3个源语单词``上''对应到目标语译文的第1个位置。
+\parinterval 通常，把词对齐记为$\seq{a}$，它由$a_1$到$a_m$共$m$个词对齐连接组成，即$\seq{a}=a_1...a_m$。$a_j$表示第$j$个源语单词$s_j$对应的目标语单词的位置。在图\ref{fig:5-16}的例子中，词对齐关系可以记为$a_1=0, a_2=3, a_3=1$，即第1个源语单词“在”对应到目标语译文的第0个位置，第2个源语单词“桌子”对应到目标语译文的第3个位置，第3个源语单词“上”对应到目标语译文的第1个位置。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -661,9 +660,9 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \label{eq:5-17}
 \end{eqnarray}

-\parinterval 公式\ref{eq:5-17}使用了简单的全概率公式把$\funp{P}(\seq{s}|\seq{t})$进行展开。通过访问$\seq{s}$和$\seq{t}$之间所有可能的词对齐$\seq{a}$，并把对应的对齐概率进行求和，得到了$\seq{t}$到$\seq{s}$的翻译概率。这里，可以把词对齐看作翻译的隐含变量，这样从$\seq{t}$到$\seq{s}$的生成就变为从$\seq{t}$同时生成$\seq{s}$和隐含变量$\seq{a}$的问题。引入隐含变量是生成式模型常用的手段，通过使用隐含变量，可以把较为困难的端到端学习问题转化为分步学习问题。
+\parinterval 公式\eqref{eq:5-17}使用了简单的全概率公式把$\funp{P}(\seq{s}|\seq{t})$进行展开。通过访问$\seq{s}$和$\seq{t}$之间所有可能的词对齐$\seq{a}$，并把对应的对齐概率进行求和，得到了$\seq{t}$到$\seq{s}$的翻译概率。这里，可以把词对齐看作翻译的隐含变量，这样从$\seq{t}$到$\seq{s}$的生成就变为从$\seq{t}$同时生成$\seq{s}$和隐含变量$\seq{a}$的问题。引入隐含变量是生成式模型常用的手段，通过使用隐含变量，可以把较为困难的端到端学习问题转化为分步学习问题。

-\parinterval 举个例子说明公式\ref{eq:5-17}的实际意义。如图\ref{fig:5-17}所示，可以把从``谢谢\ 你''到``thank you''的翻译分解为9种可能的词对齐。因为源语言句子$\seq{s}$有2个词，目标语言句子$\seq{t}$加上空标记$t_0$共3个词，因此每个源语言单词有3个可能对齐的位置，整个句子共有$3\times3=9$种可能的词对齐。
+\parinterval 举个例子说明公式\eqref{eq:5-17}的实际意义。如图\ref{fig:5-17}所示，可以把从“谢谢\ 你”到“thank you”的翻译分解为9种可能的词对齐。因为源语言句子$\seq{s}$有2个词，目标语言句子$\seq{t}$加上空标记$t_0$共3个词，因此每个源语言单词有3个可能对齐的位置，整个句子共有$3\times3=9$种可能的词对齐。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -680,7 +679,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \label{eq:5-18}
 \end{eqnarray}

-\noindent  其中$s_j$和$a_j$分别表示第$j$个源语言单词及第$j$个源语言单词对齐到的目标位置，\seq{s}${{}_1^{j-1}}$表示前$j-1$个源语言单词（即\seq{s}${}_1^{j-1}=s_1...s_{j-1}$），\seq{a}${}_1^{j-1}$表示前$j-1$个源语言的词对齐（即\seq{a}${}_1^{j-1}=a_1...a_{j-1}$），$m$表示源语句子的长度。公式\ref{eq:5-18}将$\funp{P}(\seq{s},\seq{a}|\seq{t})$分解为四个部分，具体含义如下：
+\noindent  其中$s_j$和$a_j$分别表示第$j$个源语言单词及第$j$个源语言单词对齐到的目标位置，\seq{s}${{}_1^{j-1}}$表示前$j-1$个源语言单词（即\seq{s}${}_1^{j-1}=s_1...s_{j-1}$），\seq{a}${}_1^{j-1}$表示前$j-1$个源语言的词对齐（即\seq{a}${}_1^{j-1}=a_1...a_{j-1}$），$m$表示源语句子的长度。公式\eqref{eq:5-18}将$\funp{P}(\seq{s},\seq{a}|\seq{t})$分解为四个部分，具体含义如下：

 \begin{itemize}
 \vspace{0.5em}
@@ -695,7 +694,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \end{itemize}
 \parinterval 换句话说，当求$\funp{P}(\seq{s},\seq{a}|\seq{t})$时，首先根据译文$\seq{t}$确定源语言句子$\seq{s}$的长度$m$；当知道源语言句子有多少个单词后，循环$m$次，依次生成第1个到第$m$个源语言单词；当生成第$j$个源语言单词时，要先确定它是由哪个目标语译文单词生成的，即确定生成的源语言单词对应的译文单词的位置；当知道了目标语译文单词的位置，就能确定第$j$个位置的源语言单词。

-\parinterval 需要注意的是公式\ref{eq:5-18}定义的模型并没有做任何化简和假设，也就是说公式的左右两端是严格相等的。在后面的内容中会看到，这种将一个整体进行拆分的方法可以有助于分步骤化简并处理问题。
+\parinterval 需要注意的是公式\eqref{eq:5-18}定义的模型并没有做任何化简和假设，也就是说公式的左右两端是严格相等的。在后面的内容中会看到，这种将一个整体进行拆分的方法可以有助于分步骤化简并处理问题。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -703,13 +702,13 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \subsection{基于词对齐的翻译实例}

-\parinterval 用前面图\ref{fig:5-16}中例子来对公式\ref{eq:5-18}进行说明。例子中，源语言句子``在\ \ 桌子\ \ 上''目标语译文``on the table''之间的词对齐为$\seq{a}=\{\textrm{1-0, 2-3, 3-1}\}$。 公式\ref{eq:5-18}的计算过程如下：
+\parinterval 用前面图\ref{fig:5-16}中例子来对公式\eqref{eq:5-18}进行说明。例子中，源语言句子“在\ \ 桌子\ \ 上”目标语译文“on the table”之间的词对齐为$\seq{a}=\{\textrm{1-0, 2-3, 3-1}\}$。 公式\eqref{eq:5-18}的计算过程如下：

 \begin{itemize}
 \vspace{0.5em}
-\item 首先根据译文确定源文$\seq{s}$的单词数量（$m=3$），即$\funp{P}(m=3|\textrm{``}t_0\;\textrm{on\;the\;table''})$；
+\item 首先根据译文确定源文$\seq{s}$的单词数量（$m=3$），即$\funp{P}(m=3|\textrm{“}t_0\;\textrm{on\;the\;table”})$；
 \vspace{0.5em}
-\item 再确定源语言单词$s_1$由谁生成的且生成的是什么。可以看到$s_1$由第0个目标语单词生成，也就是$t_0$，表示为$\funp{P}(a_1\;= 0\;\; |\phi,\phi,3,\textrm{``}t_0\;\textrm{on\;the\;table''})$，其中$\phi$表示空。当知道了$s_1$是由$t_0$生成的，就可以通过$t_0$生成源语言第一个单词``在''，即$\funp{P}(s_1\;= \textrm{`` 在''}\;|\{1-0\},\phi,3,\textrm{``$t_0$\;on\;the\;table''}) $；
+\item 再确定源语言单词$s_1$由谁生成的且生成的是什么。可以看到$s_1$由第0个目标语单词生成，也就是$t_0$，表示为$\funp{P}(a_1\;= 0\;\; |\phi,\phi,3,\textrm{“}t_0\;\textrm{on\;the\;table”})$，其中$\phi$表示空。当知道了$s_1$是由$t_0$生成的，就可以通过$t_0$生成源语言第一个单词“在”，即$\funp{P}(s_1\;= \textrm{“ 在”}\;|\{1-0\},\phi,3,\textrm{“$t_0$\;on\;the\;table”}) $；
 \vspace{0.5em}
 \item 类似于生成$s_1$，依次确定源语言单词$s_2$和$s_3$由谁生成且生成的是什么；
 \vspace{0.5em}
@@ -734,13 +733,13 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \sectionnewpage
 \section{IBM模型1}
-\parinterval 公式\ref{eq:5-17}和公式\ref{eq:5-18}把翻译问题定义为对译文和词对齐同时进行生成的问题。其中有两个问题：
+\parinterval 公式\eqref{eq:5-17}和公式\eqref{eq:5-18}把翻译问题定义为对译文和词对齐同时进行生成的问题。其中有两个问题：

 \begin{itemize}
 \vspace{0.3em}
-\item 首先，公式\ref{eq:5-17}的右端（$ \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})$）要求对所有的词对齐概率进行求和，但是词对齐的数量随着句子长度是呈指数增长，如何遍历所有的对齐$\seq{a}$？
+\item 首先，公式\eqref{eq:5-17}的右端（$ \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})$）要求对所有的词对齐概率进行求和，但是词对齐的数量随着句子长度是呈指数增长，如何遍历所有的对齐$\seq{a}$？
 \vspace{0.3em}
-\item 其次，公式\ref{eq:5-18}虽然对词对齐的问题进行了描述，但是模型中的很多参数仍然很复杂，如何计算$\funp{P}(m|\seq{t})$、$\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})$ 和$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})$？
+\item 其次，公式\eqref{eq:5-18}虽然对词对齐的问题进行了描述，但是模型中的很多参数仍然很复杂，如何计算$\funp{P}(m|\seq{t})$、$\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})$ 和$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})$？
 \vspace{0.3em}
 \end{itemize}

@@ -751,7 +750,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 %----------------------------------------------------------------------------------------
 \vspace{-0.5em}
 \subsection{IBM模型1}
-\parinterval IBM模型1对公式\ref{eq:5-18}中的三项进行了简化。具体方法如下：
+\parinterval IBM模型1对公式\eqref{eq:5-18}中的三项进行了简化。具体方法如下：

 \begin{itemize}
 \item 假设$\funp{P}(m|\seq{t})$为常数$\varepsilon$，即源语言句子长度的生成概率服从均匀分布，如下：
@@ -759,22 +758,22 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \funp{P}(m|\seq{t})\; \equiv \; \varepsilon
 \label{eq:5-20}
 \end{eqnarray}
-\item 对齐概率$\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})$仅依赖于译文长度$l$，即每个词对齐连接的生成概率也服从均匀分布。换句话说，对于任何源语言位置$j$对齐到目标语言任何位置都是等概率的。比如译文为``on the table''，再加上$t_0$共4个位置，相应的，任意源语单词对齐到这4个位置的概率是一样的。具体描述如下：
+\item 对齐概率$\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})$仅依赖于译文长度$l$，即每个词对齐连接的生成概率也服从均匀分布。换句话说，对于任何源语言位置$j$对齐到目标语言任何位置都是等概率的。比如译文为“on the table”，再加上$t_0$共4个位置，相应的，任意源语单词对齐到这4个位置的概率是一样的。具体描述如下：
 \begin{eqnarray}
 \funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t}) \equiv \frac{1}{l+1}
 \label{eq:5-21}
 \end{eqnarray}

-\item 源语单词$s_j$的生成概率$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})$仅依赖与其对齐的译文单词$t_{a_j}$，即词汇翻译概率$f(s_j|t_{a_j})$。此时词汇翻译概率满足$\sum_{s_j}{f(s_j|t_{a_j})}=1$。比如在图\ref{fig:5-18}表示的例子中，源语单词``上''出现的概率只和与它对齐的单词``on''有关系，与其他单词没有关系。
+\item 源语单词$s_j$的生成概率$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})$仅依赖与其对齐的译文单词$t_{a_j}$，即词汇翻译概率$f(s_j|t_{a_j})$。此时词汇翻译概率满足$\sum_{s_j}{f(s_j|t_{a_j})}=1$。比如在图\ref{fig:5-18}表示的例子中，源语单词“上”出现的概率只和与它对齐的单词“on”有关系，与其他单词没有关系。
 \begin{eqnarray}
 \funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t}) \equiv f(s_j|t_{a_j})
 \label{eq:5-22}
 \end{eqnarray}

-用一个简单的例子对公式\ref{eq:5-22}进行说明。比如，在图\ref{fig:5-18}中，``桌子''对齐到``table''，可被描述为$f(s_2 |t_{a_2})=f(\textrm{``桌子''}|\textrm{``table''})$，表示给定``table''翻译为``桌子''的概率。通常，$f(s_2 |t_{a_2})$被认为是一种概率词典，它反应了两种语言词汇一级的对应关系。
+用一个简单的例子对公式\eqref{eq:5-22}进行说明。比如，在图\ref{fig:5-18}中，“桌子”对齐到“table”，可被描述为$f(s_2 |t_{a_2})=f(\textrm{“桌子”}|\textrm{“table”})$，表示给定“table”翻译为“桌子”的概率。通常，$f(s_2 |t_{a_2})$被认为是一种概率词典，它反应了两种语言词汇一级的对应关系。
 \end{itemize}

-\parinterval 将上述三个假设和公式\ref{eq:5-18}代入公式\ref{eq:5-17}中，得到$\funp{P}(\seq{s}|\seq{t})$的表达式：
+\parinterval 将上述三个假设和公式\eqref{eq:5-18}代入公式\eqref{eq:5-17}中，得到$\funp{P}(\seq{s}|\seq{t})$的表达式：
 \begin{eqnarray}
 \funp{P}(\seq{s}|\seq{t}) & = &  \sum_{\seq{a}}{\funp{P}(\seq{s},\seq{a}|\seq{t})} \nonumber \\
                        & = &  \sum_{\seq{a}}{\funp{P}(m|\seq{t})}\prod_{j=1}^{m}{\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})\funp{P}(s_j |a_1^j,s_1^{j-1},m,\seq{t})} \nonumber \\
@@ -792,19 +791,19 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \end{figure}
 %----------------------------------------------

-\parinterval 在公式\ref{eq:5-23}中，需要遍历所有的词对齐，即$ \sum_{\seq{a}}{\cdot}$。但这种表示不够直观，因此可以把这个过程重新表示为如下形式：
+\parinterval 在公式\eqref{eq:5-23}中，需要遍历所有的词对齐，即$ \sum_{\seq{a}}{\cdot}$。但这种表示不够直观，因此可以把这个过程重新表示为如下形式：
 \begin{eqnarray}
 \funp{P}(\seq{s}|\seq{t})={\sum_{a_1=0}^{l}\cdots}{\sum_{a_m=0}^{l}\frac{\varepsilon}{(l+1)^m}}{\prod_{j=1}^{m}f(s_j|t_{a_j})}
 \label{eq:5-24}
 \end{eqnarray}

-\parinterval 公式\ref{eq:5-24}分为两个主要部分。第一部分：遍历所有的对齐$\seq{a}$。其中$\seq{a}$由$\{a_1,...,a_m\}$\\ 组成，每个$a_j\in \{a_1,...,a_m\}$从译文的开始位置$(0)$循环到截止位置$(l)$。如图\ref{fig:5-19}表示的例子，描述的是源语单词$s_3$从译文的开始$t_0$遍历到结尾$t_3$，即$a_3$的取值范围。第二部分: 对于每个$\seq{a}$累加对齐概率$\funp{P}(\seq{s},a| \seq{t})=\frac{\varepsilon}{(l+1)^m}{\prod_{j=1}^{m}f(s_j|t_{a_j})}$。
+\parinterval 公式\eqref{eq:5-24}分为两个主要部分。第一部分：遍历所有的对齐$\seq{a}$。其中$\seq{a}$由$\{a_1,...,a_m\}$\\ 组成，每个$a_j\in \{a_1,...,a_m\}$从译文的开始位置$(0)$循环到截止位置$(l)$。如图\ref{fig:5-19}表示的例子，描述的是源语单词$s_3$从译文的开始$t_0$遍历到结尾$t_3$，即$a_3$的取值范围。第二部分: 对于每个$\seq{a}$累加对齐概率$\funp{P}(\seq{s},a| \seq{t})=\frac{\varepsilon}{(l+1)^m}{\prod_{j=1}^{m}f(s_j|t_{a_j})}$。

 %----------------------------------------------
 \begin{figure}[htp]
    \centering
 \input{./Chapter5/Figures/figure-formula-3.25-part-1-example}
-    \caption{公式{\ref{eq:5-24}}第一部分实例}
+    \caption{公式{\eqref{eq:5-24}}第一部分实例}
    \label{fig:5-19}
 \end{figure}
 %----------------------------------------------
@@ -817,7 +816,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \subsection{解码及计算优化}\label{decoding&computational-optimization}

-\parinterval 如果模型参数给定，可以使用IBM模型1对新的句子进行翻译。比如，可以使用\ref{sec:simple-decoding}节描述的解码方法搜索最优译文。在搜索过程中，只需要通过公式\ref{eq:5-24}计算每个译文候选的IBM模型翻译概率。但是，公式\ref{eq:5-24}的高计算复杂度导致这些模型很难直接使用。以IBM模型1为例，这里把公式\ref{eq:5-24}重写为：
+\parinterval 如果模型参数给定，可以使用IBM模型1对新的句子进行翻译。比如，可以使用\ref{sec:simple-decoding}节描述的解码方法搜索最优译文。在搜索过程中，只需要通过公式\eqref{eq:5-24}计算每个译文候选的IBM模型翻译概率。但是，公式\eqref{eq:5-24}的高计算复杂度导致这些模型很难直接使用。以IBM模型1为例，这里把公式\eqref{eq:5-24}重写为：
 \begin{eqnarray}
 \funp{P}(\seq{s}| \seq{t}) = \frac{\varepsilon}{(l+1)^{m}} \underbrace{\sum\limits_{a_1=0}^{l} ... \sum\limits_{a_m=0}^{l}}_{(l+1)^m\textrm{次循环}} \underbrace{\prod\limits_{j=1}^{m} f(s_j|t_{a_j})}_{m\textrm{次循环}}
 \label{eq:5-27}
@@ -829,7 +828,7 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \label{eq:5-28}
 \end{eqnarray}

-\noindent  公式\ref{eq:5-28}的技巧在于把若干个乘积的加法（等式左手端）转化为若干加法结果的乘积（等式右手端），这样省去了多次循环，把$O((l+1)^m m)$的计算复杂度降为$O((l+1)m)$。此外，公式\ref{eq:5-28}相比公式\ref{eq:5-27}的另一个优点在于，公式\ref{eq:5-28}中乘法的数量更少，因为现代计算机中乘法运算的代价要高于加法，因此公式\ref{eq:5-28}的计算机实现效率更高。图\ref{fig:5-21} 对这个过程进行了进一步解释。
+\noindent  公式\eqref{eq:5-28}的技巧在于把若干个乘积的加法（等式左手端）转化为若干加法结果的乘积（等式右手端），这样省去了多次循环，把$O((l+1)^m m)$的计算复杂度降为$O((l+1)m)$。此外，公式\eqref{eq:5-28}相比公式\eqref{eq:5-27}的另一个优点在于，公式\eqref{eq:5-28}中乘法的数量更少，因为现代计算机中乘法运算的代价要高于加法，因此公式\eqref{eq:5-28}的计算机实现效率更高。图\ref{fig:5-21} 对这个过程进行了进一步解释。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -840,13 +839,13 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 
 \end{figure}
 %----------------------------------------------

-\parinterval 接着，利用公式\ref{eq:5-28}的方式，可以把公式\ref{eq:5-24}重写表示为：
+\parinterval 接着，利用公式\eqref{eq:5-28}的方式，可以把公式\eqref{eq:5-24}重写表示为：
 \begin{eqnarray}
 \textrm{IBM模型1：\ \ \ \ } \funp{P}(\seq{s}| \seq{t}) & = & \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}^{m} \sum\limits_{i=0}^{l} f(s_j|t_i) \label{eq:5-64}
 \label{eq:5-29}
 \end{eqnarray}

-公式\ref{eq:5-64}是IBM模型1的最终表达式，在解码和训练中可以被直接使用。
+公式\eqref{eq:5-64}是IBM模型1的最终表达式，在解码和训练中可以被直接使用。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -881,9 +880,9 @@ g(\seq{s},\seq{t}) \equiv \prod_{j,i \in \widehat{A}}{\funp{P}(s_j,t_i)} \times 

 \noindent 其中，$\argmax_{\theta}$表示求最优参数的过程（或优化过程）。

-\parinterval 公式\ref{eq:5-30}实际上也是一种基于极大似然的模型训练方法。这里，可以把$\funp{P}_{\theta}(\seq{s}|\seq{t})$看作是模型对数据描述的一个似然函数，记作$L(\seq{s},\seq{t};\theta)$。也就是，优化目标是对似然函数的优化：$\{\widehat{\theta}\}=\{\argmax_{\theta \in \Theta}L(\seq{s},\seq{t};\theta)\}$，其中\{$\widehat{\theta}$\} 表示可能有多个结果，$\Theta$表示参数空间。
+\parinterval 公式\eqref{eq:5-30}实际上也是一种基于极大似然的模型训练方法。这里，可以把$\funp{P}_{\theta}(\seq{s}|\seq{t})$看作是模型对数据描述的一个似然函数，记作$L(\seq{s},\seq{t};\theta)$。也就是，优化目标是对似然函数的优化：$\{\widehat{\theta}\}=\{\argmax_{\theta \in \Theta}L(\seq{s},\seq{t};\theta)\}$，其中\{$\widehat{\theta}$\} 表示可能有多个结果，$\Theta$表示参数空间。

-\parinterval 回到IBM模型的优化问题上。以IBM模型1为例，优化的目标是最大化翻译概率$\funp{P}(\seq{s}| \seq{t})$。使用公式\ref{eq:5-64} ，可以把这个目标表述为：
+\parinterval 回到IBM模型的优化问题上。以IBM模型1为例，优化的目标是最大化翻译概率$\funp{P}(\seq{s}| \seq{t})$。使用公式\eqref{eq:5-64} ，可以把这个目标表述为：
 \begin{eqnarray}
 &                    & \textrm{max}\Big(\frac{\varepsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}{f({s_j|t_i})}\Big) \nonumber \\
 & \textrm{s.t.} & \textrm{任意单词} t_{y}:\;\sum_{s_x}{f(s_x|t_y)}=1 \nonumber
@@ -968,7 +967,7 @@ f(s_u|t_v) = \lambda_{t_v}^{-1} \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}
 \label{eq:5-39}
 \end{eqnarray}

-\parinterval  可以看出，这不是一个计算$f(s_u|t_v)$的解析式，因为等式右端仍含有$f(s_u|t_v)$。不过它蕴含着一种非常经典的方法\ $\dash$\ {\small\sffamily\bfseries{期望最大化}}\index{期望最大化}（Expectation Maximization）\index{Expectation Maximization}方法，简称EM方法（或算法）。使用EM方法可以利用上式迭代地计算$f(s_u|t_v)$，使其最终收敛到最优值。EM方法的思想是：用当前的参数，求似然函数的期望，之后最大化这个期望同时得到新的一组参数的值。对于IBM模型来说，其迭代过程就是反复使用公式\ref{eq:5-39}，具体如图\ref{fig:5-24}所示。
+\parinterval  可以看出，这不是一个计算$f(s_u|t_v)$的解析式，因为等式右端仍含有$f(s_u|t_v)$。不过它蕴含着一种非常经典的方法\ $\dash$\ {\small\sffamily\bfseries{期望最大化}}\index{期望最大化}（Expectation Maximization）\index{Expectation Maximization}方法，简称EM方法（或算法）。使用EM方法可以利用上式迭代地计算$f(s_u|t_v)$，使其最终收敛到最优值。EM方法的思想是：用当前的参数，求似然函数的期望，之后最大化这个期望同时得到新的一组参数的值。对于IBM模型来说，其迭代过程就是反复使用公式\eqref{eq:5-39}，具体如图\ref{fig:5-24}所示。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -979,13 +978,13 @@ f(s_u|t_v) = \lambda_{t_v}^{-1} \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}
 \end{figure}
 %----------------------------------------------

-\parinterval 为了化简$f(s_u|t_v)$的计算，在此对公式\ref{eq:5-39}进行了重新组织，见图\ref{fig:5-25}。其中，红色部分表示翻译概率P$(\seq{s}|\seq{t})$；蓝色部分表示$(s_u,t_v)$ 在句对$(\seq{s},\seq{t})$中配对的总次数，即``$t_v$翻译为$s_u$''在所有对齐中出现的次数；绿色部分表示$f(s_u|t_v)$对于所有的$t_i$的相对值，即``$t_v$翻译为$s_u$''在所有对齐中出现的相对概率；蓝色与绿色部分相乘表示``$t_v$翻译为$s_u$''这个事件出现次数的期望的估计，称之为{\small\sffamily\bfseries{期望频次}}\index{期望频次}(Expected Count)\index{Expected Count}。
+\parinterval 为了化简$f(s_u|t_v)$的计算，在此对公式\eqref{eq:5-39}进行了重新组织，见图\ref{fig:5-25}。其中，红色部分表示翻译概率P$(\seq{s}|\seq{t})$；蓝色部分表示$(s_u,t_v)$ 在句对$(\seq{s},\seq{t})$中配对的总次数，即“$t_v$翻译为$s_u$”在所有对齐中出现的次数；绿色部分表示$f(s_u|t_v)$对于所有的$t_i$的相对值，即“$t_v$翻译为$s_u$”在所有对齐中出现的相对概率；蓝色与绿色部分相乘表示“$t_v$翻译为$s_u$”这个事件出现次数的期望的估计，称之为{\small\sffamily\bfseries{期望频次}}\index{期望频次}(Expected Count)\index{Expected Count}。
 \vspace{-0.3em}
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
 \input{./Chapter5/Figures/figure-a-more-detailed-explanation-of-formula-3.40}
-   \caption{公式\ref{eq:5-39}的解释}
+   \caption{公式\eqref{eq:5-39}的解释}
   \label{fig:5-25}
 \end{figure}
 %----------------------------------------------
@@ -996,7 +995,7 @@ f(s_u|t_v) = \lambda_{t_v}^{-1} \frac{\varepsilon}{(l+1)^{m}} \prod\limits_{j=1}
 c_{\mathbb{E}}(X)=\sum_i c(x_i) \cdot \funp{P}(x_i)
 \end{equation}

-\noindent 其中$c(x_i)$表示$X$取$x_i$时出现的次数，P$(x_i)$表示$X=x_i$出现的概率。图\ref{fig:5-26}展示了事件$X$的期望频次的详细计算过程。其中$x_1$、$x_2$和$x_3$分别表示事件$X$出现2次、1次和5次的情况。
+\noindent 其中$c(x_i)$表示$X$取$x_i$时出现的次数，$\funp{P}(x_i)$表示$X=x_i$出现的概率。图\ref{fig:5-26}展示了事件$X$的期望频次的详细计算过程。其中$x_1$、$x_2$和$x_3$分别表示事件$X$出现2次、1次和5次的情况。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -1087,7 +1086,7 @@ c_{\mathbb{E}}(s_u|t_v)=\sum\limits_{k=1}^{K}  c_{\mathbb{E}}(s_u|t_v;s^{[k]},t^
 \sectionnewpage
 \section{小结及拓展阅读}

-\parinterval 本章对IBM系列模型中的IBM模型1进行了详细的介绍和讨论，从一个简单的基于单词的翻译模型开始，本章从建模、解码、训练多个维度对统计机器翻译进行了描述，期间涉及了词对齐、优化等多个重要概念。IBM模型共分为5个模型，对翻译问题的建模依次由浅入深，同时模型复杂度也依次增加，我们将在{\chaptersix}对IBM模型2-5进行详细的介绍和讨论。IBM模型作为入门统计机器翻译的``必经之路''，其思想对今天的机器翻译仍然产生着影响。虽然单独使用IBM模型进行机器翻译现在已经不多见，甚至很多从事神经机器翻译等前沿研究的人对IBM模型已经逐渐淡忘，但是不能否认IBM模型标志着一个时代的开始。从某种意义上讲，当使用公式$\hat{\seq{t}} = \argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})$描述机器翻译问题的时候，或多或少都在使用与IBM模型相似的思想。
+\parinterval 本章对IBM系列模型中的IBM模型1进行了详细的介绍和讨论，从一个简单的基于单词的翻译模型开始，本章从建模、解码、训练多个维度对统计机器翻译进行了描述，期间涉及了词对齐、优化等多个重要概念。IBM模型共分为5个模型，对翻译问题的建模依次由浅入深，同时模型复杂度也依次增加，我们将在{\chaptersix}对IBM模型2-5进行详细的介绍和讨论。IBM模型作为入门统计机器翻译的“必经之路”，其思想对今天的机器翻译仍然产生着影响。虽然单独使用IBM模型进行机器翻译现在已经不多见，甚至很多从事神经机器翻译等前沿研究的人对IBM模型已经逐渐淡忘，但是不能否认IBM模型标志着一个时代的开始。从某种意义上讲，当使用公式$\hat{\seq{t}} = \argmax_{\seq{t}} \funp{P}(\seq{t}|\seq{s})$描述机器翻译问题的时候，或多或少都在使用与IBM模型相似的思想。

 \parinterval 当然，本书也无法涵盖IBM模型的所有内涵，很多内容需要感兴趣的读者继续研究和挖掘。其中最值得关注的是统计词对齐问题。由于词对齐是IBM模型训练的间接产物，因此IBM模型成为了自动词对齐的重要方法。比如IBM模型训练装置GIZA++更多的是被用于自动词对齐任务，而非简单的训练IBM模型参数\upcite{och2003systematic}。

@@ -1098,7 +1097,7 @@ c_{\mathbb{E}}(s_u|t_v)=\sum\limits_{k=1}^{K}  c_{\mathbb{E}}(s_u|t_v;s^{[k]},t^
 \item 随着词对齐概念的不断深入，也有很多词对齐方面的工作并不依赖IBM模型。比如，可以直接使用判别式模型利用分类器解决词对齐问题\upcite{ittycheriah2005maximum}；使用带参数控制的动态规划方法来提高词对齐准确率\upcite{DBLP:conf/naacl/GaleC91}；甚至可以把对齐的思想用于短语和句法结构的双语对应\upcite{xiao2013unsupervised}；无监督的对称词对齐方法，正向和反向模型联合训练，结合数据的相似性\upcite{DBLP:conf/naacl/LiangTK06}；除了GIZA++，研究人员也开发了很多优秀的自动对齐工具，比如，FastAlign\upcite{DBLP:conf/naacl/DyerCS13}、Berkeley Word Aligner\upcite{taskar2005a}等，这些工具现在也有很广泛的应用。

 \vspace{0.5em}
-\item 一种较为通用的词对齐评价标准是{\bfnew{对齐错误率}}(Alignment Error Rate, AER)\upcite{DBLP:journals/coling/FraserM07}。在此基础之上也可以对词对齐评价方法进行改进，以提高对齐质量与机器翻译评价得分BLEU的相关性\upcite{DBLP:conf/acl/DeNeroK07,paul2007all,黄书剑2009一种错误敏感的词对齐评价方法}。也有工作通过统计机器翻译系统性能的提升来评价对齐质量\upcite{DBLP:journals/coling/FraserM07}。不过，在相当长的时间内，词对齐质量对机器翻译系统的影响究竟如何并没有统一的结论。有些时候，词对齐的错误率下降了，但是机器翻译系统的译文品质没有带来性能提升。但是，这个问题比较复杂，需要进一步的论证。不过，可以肯定的是，词对齐可以帮助人们分析机器翻译的行为。甚至在最新的神经机器翻译中，如何在神经网络模型中寻求两种语言单词之间的对应关系也是对模型进行解释的有效手段之一\upcite{DBLP:journals/corr/FengLLZ16}。
+\item 一种较为通用的词对齐评价标准是{\bfnew{对齐错误率}}（Alignment Error Rate, AER）\upcite{DBLP:journals/coling/FraserM07}。在此基础之上也可以对词对齐评价方法进行改进，以提高对齐质量与机器翻译评价得分BLEU的相关性\upcite{DBLP:conf/acl/DeNeroK07,paul2007all,黄书剑2009一种错误敏感的词对齐评价方法}。也有工作通过统计机器翻译系统性能的提升来评价对齐质量\upcite{DBLP:journals/coling/FraserM07}。不过，在相当长的时间内，词对齐质量对机器翻译系统的影响究竟如何并没有统一的结论。有些时候，词对齐的错误率下降了，但是机器翻译系统的译文品质没有带来性能提升。但是，这个问题比较复杂，需要进一步的论证。不过，可以肯定的是，词对齐可以帮助人们分析机器翻译的行为。甚至在最新的神经机器翻译中，如何在神经网络模型中寻求两种语言单词之间的对应关系也是对模型进行解释的有效手段之一\upcite{DBLP:journals/corr/FengLLZ16}。

 \vspace{0.5em}
 \item 基于单词的翻译模型的解码问题也是早期研究者所关注的。比较经典的方法的是贪婪方法\upcite{germann2003greedy}。也有研究者对不同的解码方法进行了对比\upcite{germann2001fast}，并给出了一些加速解码的思路。随后，也有工作进一步对这些方法进行改进\upcite{DBLP:conf/coling/UdupaFM04,DBLP:conf/naacl/RiedelC09}。实际上，基于单词的模型的解码是一个NP完全问题\upcite{knight1999decoding}，这也是为什么机器翻译的解码十分困难的原因。关于翻译模型解码算法的时间复杂度也有很多讨论\upcite{DBLP:conf/eacl/UdupaM06,DBLP:conf/emnlp/LeuschMN08,DBLP:journals/mt/FlemingKN15}。

--- a/Chapter6/chapter6.tex
+++ b/Chapter6/chapter6.tex
@@ -21,9 +21,9 @@
 %	CHAPTER 6
 %----------------------------------------------------------------------------------------

-\chapter{基于扭曲度和繁衍率的模型}
+\chapter{基于扭曲度和繁衍率的翻译模型}

-{\chapterfive}展示了一种基于单词的翻译模型。这种模型的形式非常简单，而且其隐含的词对齐信息具有较好的可解释性。不过，语言翻译的复杂性远远超出人们的想象。有两方面挑战\ \dash\ 如何对`` 调序''问题进行建模以及如何对``一对多翻译''问题进行建模。调序是翻译问题中所特有的现象，比如，汉语到日语的翻译中，需要对谓词进行调序。另一方面，一个单词在另一种语言中可能会被翻译为多个连续的词，比如，汉语`` 联合国''翻译到英语会对应三个单词``The United Nations''。这种现象也被称作一对多翻译，它与句子长度预测有着密切的联系。
+{\chapterfive}展示了一种基于单词的翻译模型。这种模型的形式非常简单，而且其隐含的词对齐信息具有较好的可解释性。不过，语言翻译的复杂性远远超出人们的想象。有两方面挑战\ \dash\ 如何对“ 调序”问题进行建模以及如何对“一对多翻译”问题进行建模。调序是翻译问题中所特有的现象，比如，汉语到日语的翻译中，需要对谓词进行调序。另一方面，一个单词在另一种语言中可能会被翻译为多个连续的词，比如，汉语“ 联合国”翻译到英语会对应三个单词“The United Nations”。这种现象也被称作一对多翻译，它与句子长度预测有着密切的联系。

 无论是调序还是一对多翻译，简单的翻译模型（如IBM模型1）都无法对其进行很好的处理。因此，需要考虑对这两个问题单独进行建模。本章将会对机器翻译中两个常用的概念进行介绍\ \dash\ 扭曲度（Distortion）和繁衍率（Fertility）。它们可以被看作是对调序和一对多翻译现象的一种统计描述。基于此，本章会进一步介绍基于扭曲度和繁衍率的翻译模型，建立相对完整的基于单词的统计建模体系。相关的技术和概念在后续章节也会被进一步应用。

@@ -53,7 +53,7 @@

 \parinterval 在对调序问题进行建模的方法中，最基本的是使用调序距离方法。这里，可以假设完全进行顺序翻译时，调序的“代价”是最低的。当调序出现时，可以用调序相对于顺序翻译产生的位置偏移来度量调序的程度，也被称为调序距离。图\ref{fig:6-2}展示了翻译时两种语言中词的对齐矩阵。比如，在图\ref{fig:6-2}(a)中，系统需要跳过“对”和“你”来翻译“感到”和“满意”，之后再回过头翻译“对”和“你”，这样就完成了对单词的调序。这时可以简单地把需要跳过的单词数看作一种距离。

-\parinterval 可以看到，调序距离实际上是在度量目标语言词序相对于源语言词序的一种扭曲程度。因此，也常常把这种调序距离称作{\small\sffamily\bfseries{扭曲度}}（Distortion）。调序距离越大对应的扭曲度也越大。比如，可以明显看出图\ref{fig:6-2}（b）中调序的扭曲度要比图\ref{fig:6-2}（a）中调序的扭曲度大，因此\ref{fig:6-2}（b）实例的调序代价也更大。
+\parinterval 可以看到，调序距离实际上是在度量目标语言词序相对于源语言词序的一种扭曲程度。因此，也常常把这种调序距离称作{\small\sffamily\bfseries{扭曲度}}（Distortion）。调序距离越大对应的扭曲度也越大。比如，可以明显看出图\ref{fig:6-2}(b)中调序的扭曲度要比图\ref{fig:6-2}(a)中调序的扭曲度大，因此\ref{fig:6-2}(b)实例的调序代价也更大。

 \parinterval 在机器翻译中使用扭曲度进行翻译建模是一种十分自然的想法。接下来，会介绍两个基于扭曲度的翻译模型，分别是IBM模型2和隐马尔可夫模型。不同于IBM模型1，它们利用了单词的位置信息定义了扭曲度，并将扭曲度融入翻译模型中，使得对翻译问题的建模更加合理。

@@ -78,7 +78,7 @@
 \label{eq:6-1}
 \end{eqnarray}

-\parinterval 这里还用{\chapterthree}中的例子（图\ref{fig:6-3}）来进行说明。在IBM模型1中，``桌子''对齐到目标语言四个位置的概率是一样的。但在IBM模型2中，``桌子''对齐到``table''被形式化为$a(a_j |j,m,l)=a(3|2,3,3)$，意思是对于源语言位置2（$j=2$）的词，如果它的源语言和目标语言都是3个词（$l=3,m=3$），对齐到目标语言位置3（$a_j=3$）的概率是多少？因为$a(a_j|j,m,l)$也是模型需要学习的参数，因此``桌子''对齐到不同目标语言单词的概率也是不一样的。理想的情况下，通过$a(a_j|j,m,l)$，``桌子''对齐到``table''应该得到更高的概率。
+\parinterval 这里还用{\chapterthree}中的例子（图\ref{fig:6-3}）来进行说明。在IBM模型1中，“桌子”对齐到目标语言四个位置的概率是一样的。但在IBM模型2中，“桌子”对齐到“table”被形式化为$a(a_j |j,m,l)=a(3|2,3,3)$，意思是对于源语言位置2（$j=2$）的词，如果它的源语言和目标语言都是3个词（$l=3,m=3$），对齐到目标语言位置3（$a_j=3$）的概率是多少？因为$a(a_j|j,m,l)$也是模型需要学习的参数，因此“桌子”对齐到不同目标语言单词的概率也是不一样的。理想的情况下，通过$a(a_j|j,m,l)$，“桌子”对齐到“table”应该得到更高的概率。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -136,7 +136,7 @@
 \label{eq:6-6}
 \end{eqnarray}

-\parinterval 这里用图\ref{fig:6-4}的例子对公式进行说明。在IBM模型1-2中，单词的对齐都是与单词所在的绝对位置有关。但在HMM词对齐模型中，``你''对齐到``you''被形式化为$\funp{P}(a_{j}|a_{j-1},l)= P(5|4,5)$，意思是对于源语言位置$3(j=3)$上的单词，如果它的译文是第5个目标语言单词，上一个对齐位置是$4(a_{2}=4)$，对齐到目标语言位置$5(a_{j}=5)$的概率是多少？理想的情况下，通过$\funp{P}(a_{j}|a_{j-1},l)$，``你''对齐到``you''应该得到更高的概率，并且由于源语言单词``对''和``你''距离很近，因此其对应的对齐位置``with''和``you''的距离也应该很近。
+\parinterval 这里用图\ref{fig:6-4}的例子对公式进行说明。在IBM模型1-2中，单词的对齐都是与单词所在的绝对位置有关。但在HMM词对齐模型中，“你”对齐到“you”被形式化为$\funp{P}(a_{j}|a_{j-1},l)= P(5|4,5)$，意思是对于源语言位置$3(j=3)$上的单词，如果它的译文是第5个目标语言单词，上一个对齐位置是$4(a_{2}=4)$，对齐到目标语言位置$5(a_{j}=5)$的概率是多少？理想的情况下，通过$\funp{P}(a_{j}|a_{j-1},l)$，“你”对齐到“you”应该得到更高的概率，并且由于源语言单词“对”和“你”距离很近，因此其对应的对齐位置“with”和“you”的距离也应该很近。

 \parinterval 把公式$\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t}) \equiv f(s_j|t_{a_j})$和\eqref{eq:6-6}重新带入公式$\funp{P}(\seq{s},\seq{a}|\seq{t})=\funp{P}(m|\seq{t})$\\$\prod_{j=1}^{m}{\funp{P}(a_j|a_1^{j-1},s_1^{j-1},m,\seq{t})\funp{P}(s_j|a_1^{j},s_1^{j-1},m,\seq{t})}$和$\funp{P}(\seq{s}|\seq{t})= \sum_{\seq{a}}\funp{P}(\seq{s},\seq{a}|\seq{t})$,可得HMM词对齐模型的数学描述：
 \begin{eqnarray}
@@ -179,11 +179,11 @@

 \begin{itemize}
 \vspace{0.3em}
-\item 首先，对于每个英语单词$t_i$决定它的产出率$\varphi_{i}$。比如``Scientists''的产出率是2，可表示为${\varphi}_{1}=2$。这表明它会生成2个汉语单词；
+\item 首先，对于每个英语单词$t_i$决定它的产出率$\varphi_{i}$。比如“Scientists”的产出率是2，可表示为${\varphi}_{1}=2$。这表明它会生成2个汉语单词；
 \vspace{0.3em}
-\item 其次，确定英语句子中每个单词生成的汉语单词列表。比如``Scientists''生成``科学家''和``们''两个汉语单词，可表示为${\tau}_1=\{{\tau}_{11}=\textrm{``科学家''},{\tau}_{12}=\textrm{``们''}\}$。 这里用特殊的空标记NULL表示翻译对空的情况；
+\item 其次，确定英语句子中每个单词生成的汉语单词列表。比如“Scientists”生成“科学家”和“们”两个汉语单词，可表示为${\tau}_1=\{{\tau}_{11}=\textrm{“科学家”},{\tau}_{12}=\textrm{“们”}\}$。 这里用特殊的空标记NULL表示翻译对空的情况；
 \vspace{0.3em}
-\item 最后，把生成的所有汉语单词放在合适的位置。比如``科学家''和``们''分别放在$\seq{s}$的位置1和位置2。可以用符号$\pi$记录生成的单词在源语言句子$\seq{s}$中的位置。比如``Scientists'' 生成的汉语单词在$\seq{s}$ 中的位置表示为${\pi}_{1}=\{{\pi}_{11}=1,{\pi}_{12}=2\}$。
+\item 最后，把生成的所有汉语单词放在合适的位置。比如“科学家”和“们”分别放在$\seq{s}$的位置1和位置2。可以用符号$\pi$记录生成的单词在源语言句子$\seq{s}$中的位置。比如“Scientists” 生成的汉语单词在$\seq{s}$ 中的位置表示为${\pi}_{1}=\{{\pi}_{11}=1,{\pi}_{12}=2\}$。
 \vspace{0.3em}
 \end{itemize}

@@ -200,7 +200,7 @@

 \parinterval 可以看出，一组$\tau$和$\pi$（记为$<\tau,\pi>$）可以决定一个对齐$\seq{a}$和一个源语句子$\seq{s}$。

-\noindent 相反的，一个对齐$\seq{a}$和一个源语句子$\seq{s}$可以对应多组$<\tau,\pi>$。如图\ref{fig:6-6}所示，不同的$<\tau,\pi>$对应同一个源语言句子和词对齐。它们的区别在于目标语单词``Scientists''生成的源语言单词``科学家''和`` 们''的顺序不同。这里把不同的$<\tau,\pi>$对应到的相同的源语句子$\seq{s}$和对齐$\seq{a}$记为$<\seq{s},\seq{a}>$。因此计算$\funp{P}(\seq{s},\seq{a}| \seq{t})$时需要把每个可能结果的概率加起来，如下：
+\noindent 相反的，一个对齐$\seq{a}$和一个源语句子$\seq{s}$可以对应多组$<\tau,\pi>$。如图\ref{fig:6-6}所示，不同的$<\tau,\pi>$对应同一个源语言句子和词对齐。它们的区别在于目标语单词“Scientists”生成的源语言单词“科学家”和“ 们”的顺序不同。这里把不同的$<\tau,\pi>$对应到的相同的源语句子$\seq{s}$和对齐$\seq{a}$记为$<\seq{s},\seq{a}>$。因此计算$\funp{P}(\seq{s},\seq{a}| \seq{t})$时需要把每个可能结果的概率加起来，如下：
 \begin{equation}
 \funp{P}(\seq{s},\seq{a}| \seq{t})=\sum_{{<\tau,\pi>}\in{<\seq{s},\seq{a}>}}{\funp{P}(\tau,\pi|\seq{t}) }
 \label{eq:6-9}
@@ -281,7 +281,7 @@
 \label{eq:6-15}
 \end{eqnarray}
 }
-\parinterval 而上面提到的$t_0$所对应的这些空位置是如何生成的呢？即如何确定哪些位置是要放置空对齐的源语言单词。在IBM模型3中，假设在所有的非空对齐源语言单词都被生成出来后（共$\varphi_1+\varphi_2+\cdots {\varphi}_l$个非空对源语单词），这些单词后面都以$p_1$概率随机地产生一个``槽''用来放置空对齐单词。这样，${\varphi}_0$就服从了一个二项分布。于是得到
+\parinterval 而上面提到的$t_0$所对应的这些空位置是如何生成的呢？即如何确定哪些位置是要放置空对齐的源语言单词。在IBM模型3中，假设在所有的非空对齐源语言单词都被生成出来后（共$\varphi_1+\varphi_2+\cdots {\varphi}_l$个非空对源语单词），这些单词后面都以$p_1$概率随机地产生一个“槽”用来放置空对齐单词。这样，${\varphi}_0$就服从了一个二项分布。于是得到
 {
 \begin{eqnarray}
 \funp{P}(\varphi_0|\seq{t})=\big(\begin{array}{c}
@@ -318,9 +318,9 @@ p_0+p_1                            & = & 1 \label{eq:6-21}

 \subsection{IBM 模型4}

-\parinterval IBM模型3仍然存在问题，比如，它不能很好地处理一个目标语言单词生成多个源语言单词的情况。这个问题在模型1和模型2中也存在。如果一个目标语言单词对应多个源语言单词，往往这些源语言单词构成短语或搭配。但是模型1-3把这些源语言单词看成独立的单元，而实际上它们是一个整体。这就造成了在模型1-3中这些源语言单词可能会``分散''开。为了解决这个问题，模型4对模型3进行了进一步修正。
+\parinterval IBM模型3仍然存在问题，比如，它不能很好地处理一个目标语言单词生成多个源语言单词的情况。这个问题在模型1和模型2中也存在。如果一个目标语言单词对应多个源语言单词，往往这些源语言单词构成短语或搭配。但是模型1-3把这些源语言单词看成独立的单元，而实际上它们是一个整体。这就造成了在模型1-3中这些源语言单词可能会“分散”开。为了解决这个问题，模型4对模型3进行了进一步修正。

-\parinterval 为了更清楚地阐述，这里引入新的术语\ \dash \ {\small\bfnew{概念单元}}\index{概念单元}或{\small\bfnew{概念}}\index{概念}（Concept）\index{Concept}。词对齐可以被看作概念之间的对应。这里的概念是指具有独立语法或语义功能的一组单词。依照Brown等人的表示方法\upcite{DBLP:journals/coling/BrownPPM94}，可以把概念记为cept.。每个句子都可以被表示成一系列的cept.。这里要注意的是，源语言句子中的cept.数量不一定等于目标句子中的cept.数量。因为有些cept. 可以为空，因此可以把那些空对的单词看作空cept.。比如，在图\ref{fig:6-8}的实例中，``了''就对应一个空cept.。
+\parinterval 为了更清楚地阐述，这里引入新的术语\ \dash \ {\small\bfnew{概念单元}}\index{概念单元}或{\small\bfnew{概念}}\index{概念}（Concept）\index{Concept}。词对齐可以被看作概念之间的对应。这里的概念是指具有独立语法或语义功能的一组单词。依照Brown等人的表示方法\upcite{DBLP:journals/coling/BrownPPM94}，可以把概念记为cept.。每个句子都可以被表示成一系列的cept.。这里要注意的是，源语言句子中的cept.数量不一定等于目标句子中的cept.数量。因为有些cept. 可以为空，因此可以把那些空对的单词看作空cept.。比如，在图\ref{fig:6-8}的实例中，“了”就对应一个空cept.。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -331,9 +331,9 @@ p_0+p_1                            & = & 1 \label{eq:6-21}
 \end{figure}
 %----------------------------------------------

-\parinterval 在IBM模型的词对齐框架下，目标语的cept.只能是那些非空对齐的目标语单词，而且每个cept.只能由一个目标语言单词组成（通常把这类由一个单词组成的cept.称为独立单词cept.）。这里用$[i]$表示第$i$ 个独立单词cept.在目标语言句子中的位置。换句话说，$[i]$表示第$i$个非空对的目标语单词的位置。比如在本例中``mind''在$\seq{t}$中的位置表示为$[3]$。
+\parinterval 在IBM模型的词对齐框架下，目标语的cept.只能是那些非空对齐的目标语单词，而且每个cept.只能由一个目标语言单词组成（通常把这类由一个单词组成的cept.称为独立单词cept.）。这里用$[i]$表示第$i$ 个独立单词cept.在目标语言句子中的位置。换句话说，$[i]$表示第$i$个非空对的目标语单词的位置。比如在本例中“mind”在$\seq{t}$中的位置表示为$[3]$。

-\parinterval 另外，可以用$\odot_{i}$表示位置为$[i]$的目标语言单词对应的那些源语言单词位置的平均值，如果这个平均值不是整数则对它向上取整。比如在本例中，目标语句中第4个cept. （``.''）对应在源语言句子中的第5个单词。可表示为${\odot}_{4}=5$。
+\parinterval 另外，可以用$\odot_{i}$表示位置为$[i]$的目标语言单词对应的那些源语言单词位置的平均值，如果这个平均值不是整数则对它向上取整。比如在本例中，目标语句中第4个cept. （“.”）对应在源语言句子中的第5个单词。可表示为${\odot}_{4}=5$。

 \parinterval 利用这些新引进的概念，模型4对模型3的扭曲度进行了修改。主要是把扭曲度分解为两类参数。对于$[i]$对应的源语言单词列表($\tau_{[i]}$)中的第一个单词($\tau_{[i]1}$），它的扭曲度用如下公式计算：
 \begin{equation}
@@ -360,7 +360,7 @@ p_0+p_1                            & = & 1 \label{eq:6-21}

 \subsection{ IBM 模型5}

-\parinterval 模型3和模型4并不是``准确''的模型。这两个模型会把一部分概率分配给一些根本就不存在的句子。这个问题被称作IBM模型3和模型4的{\small\bfnew{缺陷}}\index{缺陷}（Deficiency）\index{Deficiency}。说得具体一些，模型3和模型4 中并没有这样的约束：如果已经放置了某个源语言单词的位置不能再放置其他单词，也就是说句子的任何位置只能放置一个词，不能多也不能少。由于缺乏这个约束，模型3和模型4中在所有合法的词对齐上概率和不等于1。 这部分缺失的概率被分配到其他不合法的词对齐上。举例来说，如图\ref{fig:6-9}所示，``吃/早饭''和``have breakfast''之间的合法词对齐用直线表示 。但是在模型3和模型4中， 它们的概率和为$0.9<1$。 损失掉的概率被分配到像5和6这样的对齐上了（红色）。虽然IBM模型并不支持一对多的对齐，但是模型3和模型4把概率分配给这些`` 不合法''的词对齐上，因此也就产生所谓的缺陷。
+\parinterval 模型3和模型4并不是“准确”的模型。这两个模型会把一部分概率分配给一些根本就不存在的句子。这个问题被称作IBM模型3和模型4的{\small\bfnew{缺陷}}\index{缺陷}（Deficiency）\index{Deficiency}。说得具体一些，模型3和模型4 中并没有这样的约束：如果已经放置了某个源语言单词的位置不能再放置其他单词，也就是说句子的任何位置只能放置一个词，不能多也不能少。由于缺乏这个约束，模型3和模型4中在所有合法的词对齐上概率和不等于1。 这部分缺失的概率被分配到其他不合法的词对齐上。举例来说，如图\ref{fig:6-9}所示，“吃/早饭”和“have breakfast”之间的合法词对齐用直线表示 。但是在模型3和模型4中， 它们的概率和为$0.9<1$。 损失掉的概率被分配到像5和6这样的对齐上了（红色）。虽然IBM模型并不支持一对多的对齐，但是模型3和模型4把概率分配给这些“ 不合法”的词对齐上，因此也就产生所谓的缺陷。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -414,17 +414,17 @@ p_0+p_1                            & = & 1 \label{eq:6-21}

 \subsection{词对齐及对称化}

-\parinterval IBM五个模型都是基于一个词对齐的假设\ \dash \ 一个源语言单词最多只能对齐到一个目标语言单词。这个约束大大降低了建模的难度。在法英翻译中一对多的对齐情况并不多见，这个假设带来的问题也不是那么严重。但是，在像汉英翻译这样的任务中，一个汉语单词对应多个英语单词的翻译很常见，这时IBM模型的词对齐假设就表现出了明显的问题。比如在翻译`` 我/会/试一试/。''\ $\to$ \ ``I will have a try .''时，IBM模型根本不能把单词``试一试''对齐到三个单词``have a try''，因而可能无法得到正确的翻译结果。
+\parinterval IBM五个模型都是基于一个词对齐的假设\ \dash \ 一个源语言单词最多只能对齐到一个目标语言单词。这个约束大大降低了建模的难度。在法英翻译中一对多的对齐情况并不多见，这个假设带来的问题也不是那么严重。但是，在像汉英翻译这样的任务中，一个汉语单词对应多个英语单词的翻译很常见，这时IBM模型的词对齐假设就表现出了明显的问题。比如在翻译“ 我/会/试一试/。”\ $\to$ \ “I will have a try .”时，IBM模型根本不能把单词“试一试”对齐到三个单词“have a try”，因而可能无法得到正确的翻译结果。

-\parinterval 本质上，IBM模型词对齐的``不完整''问题是IBM模型本身的缺陷。解决这个问题有很多思路。一种思路是，反向训练后，合并源语言单词，然后再正向训练。这里用汉英翻译为例来解释这个方法。首先反向训练，就是把英语当作待翻译语言，而把汉语当作目标语言进行训练（参数估计）。这样可以得到一个词对齐结果（参数估计的中间结果）。在这个词对齐结果里面，一个汉语单词可对应多个英语单词。之后，扫描每个英语句子，如果有多个英语单词对应同一个汉语单词，就把这些英语单词合并成一个英语单词。处理完之后，再把汉语当作源语言而把英语当作目标语言进行训练。这样就可以把一个汉语单词对应到合并的英语单词上。虽然从模型上看，还是一个汉语单词对应一个英语``单词''，但实质上已经把这个汉语单词对应到多个英语单词上了。训练完之后，再利用这些参数进行翻译（解码）时，就能把一个中文单词翻译成多个英文单词了。但是反向训练后再训练也存在一些问题。首先，合并英语单词会使数据变得更稀疏，训练不充分。其次，由于IBM模型的词对齐结果并不是高精度的，利用它的词对齐结果来合并一些英文单词可能造成严重的错误，比如：把本来很独立的几个单词合在了一起。因此，还要考虑实际需要和问题的严重程度来决定是否使用该方法。
+\parinterval 本质上，IBM模型词对齐的“不完整”问题是IBM模型本身的缺陷。解决这个问题有很多思路。一种思路是，反向训练后，合并源语言单词，然后再正向训练。这里用汉英翻译为例来解释这个方法。首先反向训练，就是把英语当作待翻译语言，而把汉语当作目标语言进行训练（参数估计）。这样可以得到一个词对齐结果（参数估计的中间结果）。在这个词对齐结果里面，一个汉语单词可对应多个英语单词。之后，扫描每个英语句子，如果有多个英语单词对应同一个汉语单词，就把这些英语单词合并成一个英语单词。处理完之后，再把汉语当作源语言而把英语当作目标语言进行训练。这样就可以把一个汉语单词对应到合并的英语单词上。虽然从模型上看，还是一个汉语单词对应一个英语“单词”，但实质上已经把这个汉语单词对应到多个英语单词上了。训练完之后，再利用这些参数进行翻译（解码）时，就能把一个中文单词翻译成多个英文单词了。但是反向训练后再训练也存在一些问题。首先，合并英语单词会使数据变得更稀疏，训练不充分。其次，由于IBM模型的词对齐结果并不是高精度的，利用它的词对齐结果来合并一些英文单词可能造成严重的错误，比如：把本来很独立的几个单词合在了一起。因此，还要考虑实际需要和问题的严重程度来决定是否使用该方法。

-\parinterval 另一种思路是双向对齐之后进行词对齐{\small\sffamily\bfseries{对称化}}\index{对称化}（Symmetrization）\index{Symmetrization}。这个方法可以在IBM词对齐的基础上获得对称的词对齐结果。思路很简单，用正向（汉语为源语言，英语为目标语言）和反向（汉语为目标语言，英语为源语言）同时训练。这样可以得到两个词对齐结果。然后利用一些启发性方法用这两个词对齐生成对称的结果（比如，取`` 并集''、``交集''等），这样就可以得到包含一对多和多对多的词对齐结果\upcite{och2003systematic}。比如，在基于短语的统计机器翻译中已经很成功地使用了这种词对齐信息进行短语的获取。直到今天，对称化仍然是很多自然语言处理系统中的一个关键步骤。
+\parinterval 另一种思路是双向对齐之后进行词对齐{\small\sffamily\bfseries{对称化}}\index{对称化}（Symmetrization）\index{Symmetrization}。这个方法可以在IBM词对齐的基础上获得对称的词对齐结果。思路很简单，用正向（汉语为源语言，英语为目标语言）和反向（汉语为目标语言，英语为源语言）同时训练。这样可以得到两个词对齐结果。然后利用一些启发性方法用这两个词对齐生成对称的结果（比如，取“ 并集”、“交集”等），这样就可以得到包含一对多和多对多的词对齐结果\upcite{och2003systematic}。比如，在基于短语的统计机器翻译中已经很成功地使用了这种词对齐信息进行短语的获取。直到今天，对称化仍然是很多自然语言处理系统中的一个关键步骤。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsection{``缺陷''问题}
+\subsection{“缺陷”问题}

 \parinterval IBM模型的缺陷是指翻译模型会把一部分概率分配给一些根本不存在的源语言字符串。如果用$\funp{P}(\textrm{well}|\seq{t})$表示$\funp{P}(\seq{s}| \seq{t})$在所有的正确的（可以理解为语法上正确的）$\seq{s}$上的和，即
 \begin{eqnarray}
@@ -440,9 +440,9 @@ p_0+p_1                            & = & 1 \label{eq:6-21}

 \parinterval 本质上，模型3和模型4就是对应$\funp{P}({\textrm{failure}|\seq{t}})>0$的情况。这部分概率是模型损失掉的。有时候也把这类缺陷称为{\small\bfnew{物理缺陷}}\index{物理缺陷}（Physical Deficiency\index{Physical Deficiency}）或{\small\bfnew{技术缺陷}}\index{技术缺陷}（Technical Deficiency\index{Technical Deficiency}）。还有一种缺陷被称作{\small\bfnew{精神缺陷}}（Spiritual Deficiency\index{Spiritual Deficiency}）或{\small\bfnew{逻辑缺陷}}\index{逻辑缺陷}（Logical Deficiency\index{Logical Deficiency}），它是指$\funp{P}({\textrm{well}|\seq{t}}) + \funp{P}({\textrm{ill}|\seq{t}}) = 1$ 且$\funp{P}({\textrm{ill}|\seq{t}}) > 0$的情况。模型1 和模型2 就有逻辑缺陷。可以注意到，技术缺陷只存在于模型3 和模型4 中，模型1和模型2并没有技术缺陷问题。根本原因在于模型1和模型2的词对齐是从源语言出发对应到目标语言，$\seq{t}$到$\seq{s}$ 的翻译过程实际上是从单词$s_1$开始到单词$s_m$ 结束，依次把每个源语言单词$s_j$对应到唯一一个目标语言位置。显然，这个过程能够保证每个源语言单词仅对应一个目标语言单词。但是，模型3 和模型4中对齐是从目标语言出发对应到源语言，$\seq{t}$到$\seq{s}$的翻译过程从$t_1$开始$t_l$ 结束，依次把目标语言单词$t_i$生成的单词对应到某个源语言位置上。但是这个过程不能保证$t_i$中生成的单词所对应的位置没有被其他单词占用，因此也就产生了缺陷。

-\parinterval 这里还要强调的是，技术缺陷是模型3和模型4是模型本身的缺陷造成的，如果有一个``更好''的模型就可以完全避免这个问题。而逻辑缺陷几乎是不能从模型上根本解决的，因为对于任意一种语言都不能枚举所有的句子（$\funp{P}({\textrm{ill}|\seq{t}})$实际上是得不到的）。
+\parinterval 这里还要强调的是，技术缺陷是模型3和模型4是模型本身的缺陷造成的，如果有一个“更好”的模型就可以完全避免这个问题。而逻辑缺陷几乎是不能从模型上根本解决的，因为对于任意一种语言都不能枚举所有的句子（$\funp{P}({\textrm{ill}|\seq{t}})$实际上是得不到的）。

-\parinterval IBM的模型5已经解决了技术缺陷问题。但逻辑缺陷的解决很困难，因为即使对于人来说也很难判断一个句子是不是``良好''的句子。当然可以考虑用语言模型来缓解这个问题，不过由于在翻译的时候源语言句子都是定义``良好''的句子，$\funp{P}({\textrm{ill}|\seq{t}})$对$\funp{P}(\seq{s}| \seq{t})$的影响并不大。但用输入的源语言句子$\seq{s}$的``良好性''并不能解决技术缺陷，因为技术缺陷是模型的问题或者模型参数估计方法的问题。无论输入什么样的$\seq{s}$，模型3和模型4的技术缺陷问题都存在。
+\parinterval IBM的模型5已经解决了技术缺陷问题。但逻辑缺陷的解决很困难，因为即使对于人来说也很难判断一个句子是不是“良好”的句子。当然可以考虑用语言模型来缓解这个问题，不过由于在翻译的时候源语言句子都是定义“良好”的句子，$\funp{P}({\textrm{ill}|\seq{t}})$对$\funp{P}(\seq{s}| \seq{t})$的影响并不大。但用输入的源语言句子$\seq{s}$的“良好性”并不能解决技术缺陷，因为技术缺陷是模型的问题或者模型参数估计方法的问题。无论输入什么样的$\seq{s}$，模型3和模型4的技术缺陷问题都存在。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION

--- a/Chapter7/Figures/figure-example-of-hypothesis-recombination.tex
+++ b/Chapter7/Figures/figure-example-of-hypothesis-recombination.tex
@@ -3,12 +3,12 @@
 \begin{tikzpicture}
 \begin{scope}
 {
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h0) at (0,0) {\small{null}};
+\node [anchor=north,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h0) at (0,0) {\small{null}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl0) at (h0.north west) {\scriptsize{{\color{white} \textbf{0}}}};
 \node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};

-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h2) at ([xshift=2.2em,yshift=3.5em]h0.east) {\small{an}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h3) at ([xshift=2.2em]h2.east) {\small{apple}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h2) at ([xshift=2.2em,yshift=3.5em]h0.east) {\small{an}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h3) at ([xshift=2.2em]h2.east) {\small{apple}};

 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl2) at (h2.north west) {\scriptsize{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl3) at (h3.north west) {\scriptsize{{\color{white} \textbf{2}}}};
@@ -20,17 +20,17 @@
 \draw [->,very thick,ublue] ([xshift=0.1em]pt2.south) -- ([xshift=-0.1em]h3.west);

 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h1) at ([xshift=7em]h0.east) {\small{an apple}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h1) at ([xshift=7em]h0.east) {\small{an apple}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl1) at (h1.north west) {\scriptsize{{\color{white} \textbf{1-2}}}};
 \node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt1) at (h1.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.5}}}};
 \draw [->,very thick,ublue] ([xshift=0.1em]pt0.south) -- ([xshift=-0.1em]h1.west);
 }
 }
 {
-\node [anchor=north west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h4) at ([yshift=-9em]h0.south west) {\small{null}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h5) at ([xshift=2.2em]h4.east) {\small{he}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h6) at ([xshift=2.2em,yshift=3.5em]h4.east) {\small{it}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h8) at ([xshift=2.2em]h6.east) {\small{is not}};
+\node [anchor=north west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h4) at ([yshift=-9em]h0.south west) {\small{null}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h5) at ([xshift=2.2em]h4.east) {\small{he}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h6) at ([xshift=2.2em,yshift=3.5em]h4.east) {\small{it}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h8) at ([xshift=2.2em]h6.east) {\small{is not}};

 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl4) at (h4.north west) {\scriptsize{{\color{white} \textbf{0}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl5) at (h5.north west) {\scriptsize{{\color{white} \textbf{1}}}};
@@ -47,7 +47,7 @@
 \draw [->,very thick,ublue] ([xshift=0.1em]pt6.south) -- ([xshift=-0.1em]h8.west);

 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h7) at ([xshift=2.2em]h5.east) {\small{is not}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h7) at ([xshift=2.2em]h5.east) {\small{is not}};
 \node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt7) at (h7.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl5) at (h7.north west) {\scriptsize{{\color{white} \textbf{2}}}};
 \draw [->,very thick,ublue] ([xshift=0.1em]pt5.south) -- ([xshift=-0.1em]h7.west);
@@ -66,12 +66,12 @@

 \begin{scope}[xshift = 16em, yshift = 0em]
 {
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h0) at (0,0) {\small{null}};
+\node [anchor=north,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h0) at (0,0) {\small{null}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl0) at (h0.north west) {\scriptsize{{\color{white} \textbf{0}}}};
 \node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};

-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h2) at ([xshift=2.2em,yshift=3.5em]h0.east) {\small{an}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h3) at ([xshift=2.2em]h2.east) {\small{apple}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h2) at ([xshift=2.2em,yshift=3.5em]h0.east) {\small{an}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h3) at ([xshift=2.2em]h2.east) {\small{apple}};

 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl2) at (h2.north west) {\scriptsize{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl3) at (h3.north west) {\scriptsize{{\color{white} \textbf{2}}}};
@@ -87,10 +87,10 @@
 }
 }
 {
-\node [anchor=north west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h4) at ([yshift=-9em]h0.south west) {\small{null}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h5) at ([xshift=2.2em]h4.east) {\small{he}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h6) at ([xshift=2.2em,yshift=3.5em]h4.east) {\small{it}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h8) at ([xshift=2.2em]h6.east) {\small{is not}};
+\node [anchor=north west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h4) at ([yshift=-9em]h0.south west) {\small{null}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h5) at ([xshift=2.2em]h4.east) {\small{he}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h6) at ([xshift=2.2em,yshift=3.5em]h4.east) {\small{it}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h8) at ([xshift=2.2em]h6.east) {\small{is not}};

 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl4) at (h4.north west) {\scriptsize{{\color{white} \textbf{0}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl5) at (h5.north west) {\scriptsize{{\color{white} \textbf{1}}}};
@@ -113,14 +113,14 @@

 {
 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em,opacity=0.3] (h1) at ([xshift=7em]h0.east) {\small{an apple}};
-\node [anchor=north west,inner sep=1.0pt,fill=black,opacity=0.3] (hl1) at (h1.north west) {\scriptsize{{\color{white} \textbf{1-2}}}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black,opacity=0.3] (pt1) at (h1.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.5}}}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em,opacity=0.6] (h1) at ([xshift=7em]h0.east) {\small{an apple}};
+\node [anchor=north west,inner sep=1.0pt,fill=black,opacity=0.6] (hl1) at (h1.north west) {\scriptsize{{\color{white} \textbf{1-2}}}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black,opacity=0.6] (pt1) at (h1.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.5}}}};
 }
 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em,opacity=0.3] (h7) at ([xshift=2.2em]h5.east) {\small{is not}};
-\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black,opacity=0.3] (pt7) at (h7.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
-\node [anchor=north west,inner sep=1.0pt,fill=black,opacity=0.3] (hl5) at (h7.north west) {\scriptsize{{\color{white} \textbf{2}}}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em,opacity=0.6] (h7) at ([xshift=2.2em]h5.east) {\small{is not}};
+\node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black,opacity=0.6] (pt7) at (h7.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=.2}}}};
+\node [anchor=north west,inner sep=1.0pt,fill=black,opacity=0.6] (hl5) at (h7.north west) {\scriptsize{{\color{white} \textbf{2}}}};
 }
 }

@@ -129,7 +129,6 @@
 \node [anchor=west] (l2) at ([xshift=1em, yshift=0.5em]h7.east) {\footnotesize{舍弃概率}};
 \node [anchor=west] (l21) at ([xshift=0em, yshift=-1em]l2.west) {\footnotesize{较低假设}};

-%\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em,opacity=0.7] (h1) at ([xshift=-1em,yshift=2em]h2.north) {重组假设};
 \node[anchor=north] (l1) at ([xshift=7.5em,yshift=-1em]h0.south) {\scriptsize{重组假设}};
 \node[anchor=north] (l2) at ([xshift=7.5em,yshift=-1em]h4.south) {\scriptsize{重组假设}};
 \node[anchor=north] (part2) at ([xshift=0em,yshift=-14em]h0.south){\scriptsize{（b）译文不同时的假设重组}};

--- a/Chapter7/Figures/figure-example-of-phrase-table.tex
+++ b/Chapter7/Figures/figure-example-of-phrase-table.tex
@@ -6,7 +6,7 @@
 \node [anchor=west] (s2) at ([yshift=-1.2em]s1.west) {\small{，悲伤 $\vert\vert\vert$ , sadness $\vert\vert\vert$ -1.946 -3.659 0 -3.709 1 0 $\vert\vert\vert$ 1 $\vert\vert\vert$ 0-0 1-1}};
 \node [anchor=west] (s3) at ([yshift=-1.2em]s2.west) {\small{，北京 等 $\vert\vert\vert$ , beijing , and other $\vert\vert\vert$ 0 -7.98 0 -3.84 1 0 $\vert\vert\vert$ 2 $\vert\vert\vert$ 0-0 1-1 2-2 2-3 2-4}};
 \node [anchor=west] (s4) at ([yshift=-1.2em]s3.west) {\small{，北京 及 $\vert\vert\vert$ , beijing , and $\vert\vert\vert$ -0.69 -1.45 -0.92 -4.80 1 0  $\vert\vert\vert$ 2 $\vert\vert\vert$ 0-0 1-1 2-2}};
-\node [anchor=west] (s5) at ([yshift=-1.2em]s4.west) {\small{一个 中国 $\vert\vert\vert$ one china $\vert\vert\vert$ 0 -1.725 0 -1.636 1 0 $\vert\vert\vert$ 2 $\vert\vert\vert$ 1-1 2-2}};
+\node [anchor=west] (s5) at ([yshift=-1.2em]s4.west) {\small{一个 世界 $\vert\vert\vert$ one world $\vert\vert\vert$ 0 -1.725 0 -1.636 1 0 $\vert\vert\vert$ 2 $\vert\vert\vert$ 1-1 2-2}};
 \node [anchor=west] (s7) at ([yshift=-1.1em]s5.west) {\small{...}};
 \node [anchor=west] (s6) at ([yshift=1.0em]s1.west) {\small{...}};
 \begin{pgfonlayer}{background}

--- a/Chapter7/Figures/figure-example-of-stack-decode.tex
+++ b/Chapter7/Figures/figure-example-of-stack-decode.tex
@@ -4,14 +4,14 @@
 \begin{tikzpicture}
 \begin{scope}
 {
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h0) at (0,0) {\scriptsize{null}};
+\node [anchor=north,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h0) at (0,0) {\scriptsize{null}};
 \node [anchor=north west,inner sep=1.5pt,fill=black] (hl0) at (h0.north west) {\scriptsize{{\color{white} \textbf{0}}}};
 \node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=1}}}};
 }
 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h13) at ([xshift=2.1em,yshift=6em]h0.east) {\scriptsize{there is}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h13) at ([xshift=2.1em,yshift=6em]h0.east) {\scriptsize{there is}};
 \node [anchor=west,inner sep=2pt,minimum height=2em,minimum width=3em] (h12) at ([xshift=2.1em,yshift=3.5em]h0.east) {\small{\textbf{...}}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h1) at ([xshift=2.1em]h0.east) {\scriptsize{tabel}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h1) at ([xshift=2.1em]h0.east) {\scriptsize{tabel}};

 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl1) at (h1.north west) {\scriptsize{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl3) at (h13.north west) {\scriptsize{{\color{white} \textbf{3}}}};
@@ -20,12 +20,12 @@
 \node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt1) at (h1.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=.2}}}};
 \node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt3) at (h13.east) {\scriptsize{{\color{white} \textbf{$\funp{P}$=.5}}}};

-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h2) at ([xshift=2.1em]h1.east) {\scriptsize{have}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h2) at ([xshift=2.1em]h1.east) {\scriptsize{have}};
 \node [anchor=west,inner sep=2pt,minimum height=2em,minimum width=3em] (h22) at ([xshift=2.1em]h12.east) {\small{\textbf{...}}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h23) at ([xshift=2.1em]h13.east) {\scriptsize{an}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h3) at ([xshift=2.1em]h2.east) {\scriptsize{there is}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h23) at ([xshift=2.1em]h13.east) {\scriptsize{an}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h3) at ([xshift=2.1em]h2.east) {\scriptsize{there is}};
 \node [anchor=west,inner sep=2pt,minimum height=2em,minimum width=3em] (h32) at ([xshift=2.1em]h22.east) {\small{\textbf{...}}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3em] (h33) at ([xshift=2.1em]h23.east) {\scriptsize{an apple}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3em] (h33) at ([xshift=2.1em]h23.east) {\scriptsize{an apple}};

 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl2) at (h2.north west) {\scriptsize{{\color{white} \textbf{3}}}};
 \node [anchor=north west,inner sep=1.0pt,fill=black] (hl23) at (h23.north west) {\scriptsize{{\color{white} \textbf{4}}}};
@@ -55,14 +55,14 @@
 \draw [->,thick,red] (h1.north).. controls +(60:0.5) and +(120:0.5) .. (h2.north);
 \draw [->,thick,red] (h2.north).. controls +(60:0.5) and +(120:0.5) .. (h3.north);
 }
-\node [anchor=south east] (wtranslabel) at ([xshift=-1.5em,yshift=-2.2em]h0.south west) {\small{\textbf{：假设堆栈}}};
+\node [anchor=south east] (wtranslabel) at ([xshift=-2.4em,yshift=-2.26em]h0.south west) {\small{\textbf{假设堆栈}}};
 \node [anchor=east,inner sep=2pt,fill=blue!10,minimum height=1em,minimum width=2em] (stacklabel) at ([xshift=-0.1em]wtranslabel.west) {};
 {
-\node [anchor=east] (line1) at ([xshift=-1.0em,yshift=0.4em]h0.west) {\small{0号栈包含空假设}};
+\node [anchor=east] (line1) at ([xshift=-1.0em,yshift=0.45em]h0.west) {\small{0号栈包含空假设}};
 }
 {
-\node [anchor=east] (line2) at ([xshift=-2.3em,yshift=0.5em]h13.west) {\small{通过假设扩展产生新的假设}};
-\node [anchor=north west] (line3) at ([yshift=0.1em]line2.south west) {\small{并不断的被存入假设堆栈中}};
+\node [anchor=east] (line2) at ([xshift=-2.3em,yshift=0.44em]h13.west) {\small{通过假设扩展产生新的假设}};
+\node [anchor=north west] (line3) at ([yshift=0.1em]line2.south west) {\small{并不断地被存入假设堆栈中}};
 }
 \begin{pgfonlayer}{background}
 {

--- a/Chapter7/Figures/figure-get-word-alignment.tex
+++ b/Chapter7/Figures/figure-get-word-alignment.tex
@@ -88,8 +88,8 @@
 \node[align=center,elementnode,minimum size=0.3cm,inner sep=0.1pt,fill=red!50] (lc4) at (c22) {};
 \node[align=center,elementnode,minimum size=0.3cm,inner sep=0.1pt,fill=blue!50] (lc5) at (c30) {};

-\node[anchor=north] (l1) at ([xshift=0.5em,yshift=-0.5em]a10.south) {\footnotesize{S - T}};
-\node[anchor=north] (l2) at ([xshift=0.5em,yshift=-0.5em]b10.south) {\footnotesize{T - S}};
+\node[anchor=north] (l1) at ([xshift=0.5em,yshift=-0.5em]a10.south) {\footnotesize{$\seq{s}$ - $\seq{t}$}};
+\node[anchor=north] (l2) at ([xshift=0.5em,yshift=-0.5em]b10.south) {\footnotesize{$\seq{t}$ - $\seq{s}$}};
 \node[anchor=north] (l3) at ([xshift=0.5em,yshift=-0.5em]c10.south) {\footnotesize{交集/并集}};

 \end{scope}

--- a/Chapter7/Figures/figure-reorder-base-distance.tex
+++ b/Chapter7/Figures/figure-reorder-base-distance.tex
@@ -34,8 +34,8 @@
 \node[anchor=north] (d1) at ([xshift=-0.1em,yshift=-0.1em]distance.south) {+4};
 \node[anchor=north] (d2) at ([yshift=-1.8em]d1.south) {-5};

-\node[anchor=north west,fill=blue!20] (m1) at ([xshift=-1em,yshift=-0.0em]t1.south west) {\small{$start_1\ \ -\ \ end_{0}\ \ -\ \ 1$\quad =\quad 5\ -\ 0\ -\ 1}};
-\node[anchor=north west,fill=blue!20] (m2) at ([xshift=-1em,yshift=-0.0em]t2.south west) {\small{$start_2\ \ -\ \  end_{1}\ \ -\ \ 1$\quad =\quad 1\ -\ 5\ -\ 1}};
+\node[anchor=north west,fill=blue!20] (m1) at ([xshift=-1em,yshift=-0.0em]t1.south west) {\small{$\rm{start}_1\ \ -\ \ \rm{end}_{0}\ \ -\ \ 1$\quad =\quad 5\ -\ 0\ -\ 1}};
+\node[anchor=north west,fill=blue!20] (m2) at ([xshift=-1em,yshift=-0.0em]t2.south west) {\small{$\rm{start}_2\ \ -\ \  \rm{end}_{1}\ \ -\ \ 1$\quad =\quad 1\ -\ 5\ -\ 1}};

 \draw[-] ([xshift=0.08in]target.south west)--([xshift=2.4in]target.south west);


--- a/Chapter7/Figures/figure-translation-hypothesis-extension.tex
+++ b/Chapter7/Figures/figure-translation-hypothesis-extension.tex
@@ -4,15 +4,15 @@
 \begin{tikzpicture}
 \begin{scope}
 {
-\node [anchor=north,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3.5em] (h0) at (0,0) {\small{null}};
+\node [anchor=north,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3.5em] (h0) at (0,0) {\small{null}};
 \node [anchor=north west,inner sep=1.5pt,fill=black] (hl0) at (h0.north west) {\scriptsize{{\color{white} \textbf{0}}}};
 \node [anchor=north,rotate=90,inner sep=1pt,minimum width=2em,fill=black] (pt0) at (h0.east) {\footnotesize{{\color{white} \textbf{$\funp{P}$=1}}}};
 }

 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3.5em] (h1) at ([xshift=3em]h0.east) {\small{on}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3.5em] (h2) at ([xshift=3em,yshift=3em]h0.east) {\small{table}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3.5em] (h3) at ([xshift=3em,yshift=-3em]h0.east) {\small{there is}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3.5em] (h1) at ([xshift=3em]h0.east) {\small{on}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3.5em] (h2) at ([xshift=3em,yshift=3em]h0.east) {\small{table}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3.5em] (h3) at ([xshift=3em,yshift=-3em]h0.east) {\small{there is}};
 \node [anchor=north west,inner sep=1.5pt,fill=black] (hl1) at (h1.north west) {\scriptsize{{\color{white} \textbf{2}}}};
 \node [anchor=north west,inner sep=1.5pt,fill=black] (hl2) at (h2.north west) {\scriptsize{{\color{white} \textbf{1}}}};
 \node [anchor=north west,inner sep=1.5pt,fill=black] (hl3) at (h3.north west) {\scriptsize{{\color{white} \textbf{3}}}};
@@ -26,11 +26,11 @@
 }

 {
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3.5em] (h4) at ([xshift=3em,yshift=-1.8em]h3.east) {\small{one}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3.5em] (h5) at ([xshift=3em,yshift=1.2em]h3.east) {\small{an apple}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=3.5em] (h6) at ([xshift=3em,yshift=1.2em]h1.east) {\small{table}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=4em] (h7) at ([xshift=3em,yshift=1.2em]h5.east) {\small{on the table}};
-\node [anchor=west,inner sep=2pt,fill=red!20,minimum height=2em,minimum width=4.6em] (h8) at ([xshift=3em,yshift=-2em]h5.east) {\small{\ \;apple}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3.5em] (h4) at ([xshift=3em,yshift=-1.8em]h3.east) {\small{one}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3.5em] (h5) at ([xshift=3em,yshift=1.2em]h3.east) {\small{an apple}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=3.5em] (h6) at ([xshift=3em,yshift=1.2em]h1.east) {\small{table}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=4em] (h7) at ([xshift=3em,yshift=1.2em]h5.east) {\small{on the table}};
+\node [anchor=west,inner sep=2pt,fill=red!40,minimum height=2em,minimum width=4.6em] (h8) at ([xshift=3em,yshift=-2em]h5.east) {\small{\ \;apple}};

 \node [anchor=north west,inner sep=1.5pt,fill=black] (hl4) at (h4.north west) {\scriptsize{{\color{white} \textbf{4}}}};
 \node [anchor=north west,inner sep=1.5pt,fill=black] (hl5) at (h5.north west) {\scriptsize{{\color{white} \textbf{4-5}}}};

--- a/Chapter7/Figures/figure-word-and-phrase-translation-regard-as-path.tex
+++ b/Chapter7/Figures/figure-word-and-phrase-translation-regard-as-path.tex
@@ -143,7 +143,7 @@
 }

 {
-\node [anchor=north west] (wtranslabel) at ([yshift=-4em]t15.south west) {\scriptsize{翻译路径（仅含有单词）}};
+\node [anchor=north west] (wtranslabel) at ([yshift=-4em]t15.south west) {\scriptsize{翻译路径（仅包含单词）}};
 \draw [->,ultra thick,red,line width=1.5pt,opacity=0.7] ([xshift=0.2em]wtranslabel.east) -- ([xshift=1.2em]wtranslabel.east);
 }


--- a/Chapter7/chapter7.tex
+++ b/Chapter7/chapter7.tex
@@ -21,7 +21,7 @@
 %	CHAPTER 7
 %----------------------------------------------------------------------------------------

-\chapter{基于短语的模型}
+\chapter{基于短语的翻译模型}

 \parinterval 机器翻译的一个基本问题是要定义翻译的基本单元是什么。比如，可以像{\chapterfive}介绍的那样，以单词为单位进行翻译，即把句子的翻译看作是单词之间对应关系的一种组合。基于单词的模型是符合人类对翻译问题的认知的，因为单词本身就是人类加工语言的一种基本单元。另一方面，在进行翻译时也可以使用一些更“复杂”的知识。比如，很多词语间的搭配需要根据语境的变化进行调整，而且对于句子结构的翻译往往需要更上层的知识，如句法知识。因此，在对单词翻译进行建模的基础上，需要探索其他类型的翻译知识，使得搭配和结构翻译等问题可以更好地被建模。

@@ -39,7 +39,7 @@
 %    NEW SUB-SECTION
 %----------------------------------------------------------------------------------------

-\subsection{基于词的翻译所带来的问题}
+\subsection{词的翻译带来的问题}

 \parinterval 首先，回顾一下基于单词的统计翻译模型是如何完成翻译的。图\ref{fig:7-1}展示了一个实例。其中，左侧是一个单词的“翻译表”，它记录了源语言（汉语）单词和目标语言（英语）单词之间的对应关系，以及这种对应的可能性大小（用$\funp{P}$表示）。在翻译时，会使用这些单词一级的对应，生成译文。图\ref{fig:7-1}右侧就展示了一个基于词的模型生成的翻译结果，其中$\seq{s}$和$\seq{t}$分别表示源语言和目标语言句子，单词之间的连线表示两个句子中单词一级的对应。

@@ -539,13 +539,13 @@ d = {(\bar{s}_{\bar{a}_1},\bar{t}_1)} \circ {(\bar{s}_{\bar{a}_2},\bar{t}_2)} \c

 \parinterval 基于距离的调序的一个基本假设是：语言的翻译基本上都是顺序的，也就是，译文单词出现的顺序和源语言单词的顺序基本上是一致的。反过来说，如果译文和源语言单词（或短语）的顺序差别很大，就认为出现了调序。

-\parinterval 基于距离的调序方法的核心思想就是度量当前翻译结果与顺序翻译之间的差距。对于译文中的第$i$个短语，令$start_i$表示它所对应的源语言短语中第一个词所在的位置，$end_i$表示它所对应的源语言短语中最后一个词所在的位置。于是，这个短语（相对于前一个短语）的调序距离为：
+\parinterval 基于距离的调序方法的核心思想就是度量当前翻译结果与顺序翻译之间的差距。对于译文中的第$i$个短语，令$\rm{start}_i$表示它所对应的源语言短语中第一个词所在的位置，$\rm{end}_i$表示它所对应的源语言短语中最后一个词所在的位置。于是，这个短语（相对于前一个短语）的调序距离为：
 \begin{eqnarray}
-dr = start_i-end_{i-1}-1
+dr = {\rm{start}}_i-{\rm{end}}_{i-1}-1
 \label{eq:7-15}
 \end{eqnarray}

-\parinterval 在图\ref{fig:7-20}的例子中，“the apple”所对应的调序距离为4，“on the table”所对应的调序距离为$-5$。显然，如果两个源语短语按顺序翻译，则$start_i = end_{i-1} + 1$，这时调序距离为0。
+\parinterval 在图\ref{fig:7-20}的例子中，“the apple”所对应的调序距离为4，“on the table”所对应的调序距离为$-5$。显然，如果两个源语短语按顺序翻译，则$\rm{start}_i = \rm{end}_{i-1} + 1$，这时调序距离为0。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -898,7 +898,7 @@ dr = start_i-end_{i-1}-1
 %----------------------------------------------------------------------------------------

 \sectionnewpage
-\section{小节及拓展阅读}\label{section-7.8}
+\section{小结及拓展阅读}\label{section-7.8}

 \parinterval 统计机器翻译模型是近三十年内自然语言处理的重要里程碑之一。其统计建模的思想长期影响着自然语言处理的研究。无论是前面介绍的基于单词的模型，还是本章介绍的基于短语的模型，甚至后面即将介绍的基于句法的模型，大家都在尝试回答：究竟应该用什么样的知识对机器翻译进行统计建模？不过，这个问题至今还没有确定的答案。但是，显而易见，统计机器翻译为机器翻译的研究提供了一种范式，即让计算机用概率化的 “知识” 描述翻译问题。这些 “ 知识” 体现在统计模型的结构和参数中，并且可以从大量的双语和单语数据中自动学习。这种建模思想在今天的机器翻译研究中仍然随处可见。


--- a/Chapter8/Figures/figure-an-example-of-phrase-system.tex
+++ b/Chapter8/Figures/figure-an-example-of-phrase-system.tex
@@ -4,19 +4,19 @@
 \begin{tikzpicture}
 \begin{scope}
 \node [anchor=east] (shead) at (0,0) {源语:};
-\node [anchor=west] (swords) at (shead.east) {澳洲\ \ 是\ \ 与\ \ 北韩\ \ 有\ \ 邦交\ \ 的\ \ 少数\ \ 国家\ \ 之一};
+\node [anchor=west] (swords) at (shead.east) {巴基斯坦\ \ 是\ \ 与\ \ 中国\ \ 有\ \ 邦交\ \ 的\ \ 多数\ \ 国家\ \ 之一};
 \node [anchor=north east] (thead) at ([yshift=-0.8em]shead.south east) {短语系统:};
-\node [anchor=west] (twords) at (thead.east) {Australia is diplomatic relations with North Korea};
-\node [anchor=north west] (twords2) at ([yshift=-0.2em]twords.south west) {is one of the few countries};
+\node [anchor=west] (twords) at (thead.east) {Pakistan  is diplomatic relations with China};
+\node [anchor=north west] (twords2) at ([yshift=-0.2em]twords.south west) {is one of the many countries};
 \node [anchor=north east] (rhead) at ([yshift=-2.2em]thead.south east) {参考译文:};
-\node [anchor=west] (rwords) at (rhead.east) {Australia is one of the few countries that have};
-\node [anchor=north west] (rwords2) at ([yshift=-0.2em]rwords.south west) {diplomatic relations with North Korea};
+\node [anchor=west] (rwords) at (rhead.east) {Pakistan  is one of the many countries that have};
+\node [anchor=north west] (rwords2) at ([yshift=-0.2em]rwords.south west) {diplomatic relations with China};

 \begin{pgfonlayer}{background}
 {
-\draw[fill=red!20,draw=white] ([xshift=-5.4em]twords.north) rectangle ([xshift=10.8em]twords.south);
-\draw[fill=blue!20,draw=white] ([xshift=-4.6em]twords2.north) rectangle ([xshift=6.1em]twords2.south);
-\node [anchor=south east,inner sep=1pt,fill=black] (l1) at ([xshift=10.8em]twords.south) {\tiny{{\color{white} 1}}};
+\draw[fill=red!20,draw=white] ([xshift=-5.1em]twords.north) rectangle ([xshift=9.1em]twords.south);
+\draw[fill=blue!20,draw=white] ([xshift=-4.9em]twords2.north) rectangle ([xshift=6.1em]twords2.south);
+\node [anchor=south east,inner sep=1pt,fill=black] (l1) at ([xshift=9.1em]twords.south) {\tiny{{\color{white} 1}}};
 \node [anchor=south east,inner sep=1pt,fill=black] (l2) at ([xshift=6.1em]twords2.south) {\tiny{{\color{white} 2}}};
 }
 \end{pgfonlayer}

--- a/Chapter8/Figures/figure-combination-of-translation-with-different-rules.tex
+++ b/Chapter8/Figures/figure-combination-of-translation-with-different-rules.tex
@@ -7,34 +7,33 @@
 \node[anchor=north] (q1) at (0,0) {\scriptsize\sffamily\bfseries{输入字符串：}};
 \node[anchor=west] (q2) at ([xshift=0em,yshift=-2em]q1.west) {\footnotesize{进口$\quad$和$\quad$出口$\quad$大幅度$\quad$下降$\quad$了}};

-%\node[anchor=north,fill=blue!20,minimum height=1em,minimum width=1em] (f1) at ([xshift=-4.1em,yshift=-0.8em]q2.south) {};

-\node[anchor=north,fill=blue!20,minimum height=4em,minimum width=1em] (f1) at ([xshift=2.2em,yshift=-0.7em]q2.south) {};
+\node[anchor=north,fill=blue!20,minimum height=4em,minimum width=1em] (f1) at ([xshift=2.7em,yshift=-0.7em]q2.south) {};

 \node[anchor=east] (n1) at ([xshift=1em,yshift=-2em]q2.west) {\scriptsize\sffamily\bfseries{匹配规则：}};

-\node[anchor=west] (n2) at ([xshift=0em,yshift=0em]n1.east) {\scriptsize{$\textrm{X} \to  \langle\ \textrm{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \textrm{X}_1\ \textrm{have}\ \textrm{drastically}\ \textrm{fallen}\ \rangle$}};
+\node[anchor=west] (n2) at ([xshift=0em,yshift=0em]n1.east) {\scriptsize{$\seq{X} \to  \langle\ \seq{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \seq{X}_1\ \textrm{have}\ \textrm{drastically}\ \textrm{fallen}\ \rangle$}};

-\node[anchor=west] (n3) at ([xshift=0em,yshift=-1.5em]n2.west) {\scriptsize{$\textrm{X} \to  \langle\ \textrm{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \textrm{X}_1\ \textrm{have}\ \textrm{fallen}\ \textrm{drastically}\ \rangle$}};
+\node[anchor=west] (n3) at ([xshift=0em,yshift=-1.5em]n2.west) {\scriptsize{$\seq{X} \to  \langle\ \seq{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \seq{X}_1\ \textrm{have}\ \textrm{fallen}\ \textrm{drastically}\ \rangle$}};

-\node[anchor=west] (n4) at ([xshift=0em,yshift=-1.5em]n3.west) {\scriptsize{$\textrm{X} \to  \langle\ \textrm{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \textrm{X}_1\ \textrm{has}\ \textrm{drastically}\ \textrm{fallen}\ \rangle$}};
+\node[anchor=west] (n4) at ([xshift=0em,yshift=-1.5em]n3.west) {\scriptsize{$\seq{X} \to  \langle\ \seq{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \seq{X}_1\ \textrm{has}\ \textrm{drastically}\ \textrm{fallen}\ \rangle$}};

 \draw[decorate,decoration={mirror,brace}]([xshift=0.5em,yshift=-1em]q2.west) --([xshift=7em,yshift=-1em]q2.west) node [xshift=0em,yshift=-1em,align=center](label1) {};	

 {\scriptsize
 \node[anchor=west] (h1) at ([xshift=1em,yshift=-15em]q2.west) {{Span[0,3]下的翻译假设：}};
-\node[anchor=west] (h2) at ([xshift=0em,yshift=-1.3em]h1.west) {{X：imports and exports}};
-\node[anchor=west] (h6) at ([xshift=0em,yshift=-1.3em]h2.west) {{S：the import and export}};
+\node[anchor=west] (h2) at ([xshift=0em,yshift=-1.3em]h1.west) {{$\seq{X}$：imports and exports}};
+\node[anchor=west] (h6) at ([xshift=0em,yshift=-1.3em]h2.west) {{$\seq{S}$：the import and export}};
 }

 {\scriptsize
-\node[anchor=west] (h21) at ([xshift=9em,yshift=5.0em]h1.east) {{替换$\textrm{X}_1$后生成的翻译假设：}};
-\node[anchor=west] (h22) at ([xshift=0em,yshift=-1.3em]h21.west) {{X：imports and exports have drastically fallen}};
-\node[anchor=west] (h23) at ([xshift=0em,yshift=-1.3em]h22.west) {{X：the import and export have drastically fallen}};
-\node[anchor=west] (h24) at ([xshift=0em,yshift=-1.3em]h23.west) {{X：imports and exports have drastically fallen}};
-\node[anchor=west] (h25) at ([xshift=0em,yshift=-1.3em]h24.west) {{X：the import and export have drastically fallen}};
-\node[anchor=west] (h26) at ([xshift=0em,yshift=-1.3em]h25.west) {{X：imports and exports has drastically fallen}};
-\node[anchor=west] (h27) at ([xshift=0em,yshift=-1.3em]h26.west) {{X：the import and export has drastically fallen}};
+\node[anchor=west] (h21) at ([xshift=9em,yshift=5.0em]h1.east) {{替换$\seq{X}_1$后生成的翻译假设：}};
+\node[anchor=west] (h22) at ([xshift=0em,yshift=-1.3em]h21.west) {{$\seq{X}$：imports and exports have drastically fallen}};
+\node[anchor=west] (h23) at ([xshift=0em,yshift=-1.3em]h22.west) {{$\seq{X}$：the import and export have drastically fallen}};
+\node[anchor=west] (h24) at ([xshift=0em,yshift=-1.3em]h23.west) {{$\seq{X}$：imports and exports have drastically fallen}};
+\node[anchor=west] (h25) at ([xshift=0em,yshift=-1.3em]h24.west) {{$\seq{X}$：the import and export have drastically fallen}};
+\node[anchor=west] (h26) at ([xshift=0em,yshift=-1.3em]h25.west) {{$\seq{X}$：imports and exports has drastically fallen}};
+\node[anchor=west] (h27) at ([xshift=0em,yshift=-1.3em]h26.west) {{$\seq{X}$：the import and export has drastically fallen}};
 }

 \node [rectangle,inner sep=0.1em,rounded corners=1pt,draw] [fit = (h1) (h2) (h6)] (gl1) {};

--- a/Chapter8/Figures/figure-derivation-of-hierarchical-phrase-and-tree-structure-model.tex
+++ b/Chapter8/Figures/figure-derivation-of-hierarchical-phrase-and-tree-structure-model.tex
@@ -4,22 +4,22 @@
 {\scriptsize
 \begin{scope}[sibling distance=0pt, level distance = 27pt]
 {\scriptsize
-\Tree[.\node(n1){\textbf{S}};
-        [.\node(n2){\textbf{S}};
-	        [.\node(n3){\textbf{S}};
-		        [.\node(n4){\textbf{S}};
-		            [.\node(n5){\textbf{X}}; \node(cw1){但}; ]
+\Tree[.\node(n1){\seq{S}};
+        [.\node(n2){\seq{S}};
+	        [.\node(n3){\seq{S}};
+		        [.\node(n4){\seq{S}};
+		            [.\node(n5){\seq{X}}; \node(cw1){但}; ]
 		        ]
-		        [.\node(n6){\textbf{X}}; \node(cw2){美国}; ]
+		        [.\node(n6){\seq{X}}; \node(cw2){美国}; ]
 		    ]
-	        [.\node(n7){\textbf{X}};
+	        [.\node(n7){\seq{X}};
 	            [. \node(cw3){并没有}; ]
 	            [. \node(cw4){执行}; ]
 	        ]
        ]
-        [.\node(n8){\textbf{X}};
+        [.\node(n8){\seq{X}};
            [. \node(cw5){世贸}; ]
-            [.\node(n9){\textbf{X}};
+            [.\node(n9){\seq{X}};
                [. \node(cw6){组织}; ]
                [. \node(cw7){的}; ]
            ]
@@ -44,31 +44,31 @@
 \draw[-] (rules.south west)--([xshift=1.8in]rules.south west);

 \node[anchor=north west] (r1) at ([yshift=-0.2em]rules.south west) {$r_1$};
-\node[anchor=west] (rc1) at ([xshift=0.0em]r1.east) {$\textrm{S} \; \to \; \langle\ \textrm{X}_1, \; \; \textrm{X}_1\ \rangle$};
+\node[anchor=west] (rc1) at ([xshift=0.0em]r1.east) {$\textrm{S} \; \to \; \langle\ \seq{X}_1, \; \; \seq{X}_1\ \rangle$};

 \node[anchor=north west] (r2) at ([yshift=-0.4em]r1.south west) {$r_2$};
-\node[anchor=west] (rc2) at ([xshift=0em]r2.east) {$\textrm{S} \; \to \; \langle\  \textrm{S}_1 \; \textrm{X}_2, \; \; \textrm{S}_1 \; \textrm{X}_2\ \rangle$};
+\node[anchor=west] (rc2) at ([xshift=0em]r2.east) {$\textrm{S} \; \to \; \langle\  \textrm{S}_1 \; \seq{X}_2, \; \; \textrm{S}_1 \; \seq{X}_2\ \rangle$};

 \node[anchor=north west] (r3) at ([yshift=-0.4em]r2.south west) {$r_3$};
-\node[anchor=west] (rc3) at ([xshift=0em]r3.east) {$\textrm{X} \; \to \; \langle\  \text{但}, \; \; \text{but}\ \rangle$};
+\node[anchor=west] (rc3) at ([xshift=0em]r3.east) {$\seq{X} \; \to \; \langle\  \text{但}, \; \; \text{but}\ \rangle$};

 \node[anchor=north west] (r4) at ([yshift=-0.4em]r3.south west) {$r_4$};
-\node[anchor=west] (rc4) at ([xshift=0em]r4.east) {$\textrm{X} \; \to \; \langle\  \text{美国}, \; \; \text{the U.S.}\ \rangle$};
+\node[anchor=west] (rc4) at ([xshift=0em]r4.east) {$\seq{X} \; \to \; \langle\  \text{美国}, \; \; \text{the U.S.}\ \rangle$};

 \node[anchor=north west] (r5) at ([yshift=-0.4em]r4.south west) {$r_5$};
-\node[anchor=west] (rc5) at ([xshift=0em]r5.east) {$\textrm{X} \; \to \; \langle\  \text{并没有} \; \text{执行}, \; \; \text{}$};
+\node[anchor=west] (rc5) at ([xshift=0em]r5.east) {$\seq{X} \; \to \; \langle\  \text{并没有} \; \text{执行}, \; \; \text{}$};

 \node[anchor=north west] (r52) at ([yshift=-0.4em]r5.south west) {{\color{white} $r_5$}};
 \node[anchor=west] (rc52) at ([xshift=2.9em]r52.east) {$\text{has not implemented}\ \rangle$};

 \node[anchor=north west] (r6) at ([yshift=-0.4em]r52.south west) {$r_6$};
-\node[anchor=west] (rc6) at ([xshift=0em]r6.east) {$\textrm{X} \; \to \; \langle\ \text{世贸} \; \textrm{X}_1 \; \text{裁决}, $};
+\node[anchor=west] (rc6) at ([xshift=0em]r6.east) {$\seq{X} \; \to \; \langle\ \text{世贸} \; \seq{X}_1 \; \text{裁决}, $};

 \node[anchor=north west] (r61) at ([yshift=-0.4em]r6.south west) {{\color{white} $r_6$}};
-\node[anchor=west] (rc61) at ([xshift=2.9em]r61.east) {$\text{the decision} \; \textrm{X}_1 \; \text{the WTO}\ \rangle$};
+\node[anchor=west] (rc61) at ([xshift=2.9em]r61.east) {$\text{the decision} \; \seq{X}_1 \; \text{the WTO}\ \rangle$};

 \node[anchor=north west] (r7) at ([yshift=-0.4em]r61.south west) {$r_7$};
-\node[anchor=west] (rc7) at ([xshift=0em]r7.east) {$\textrm{X} \; \to \; \langle\ \text{组织 的}, \; \; \text{of}\ \rangle$};
+\node[anchor=west] (rc7) at ([xshift=0em]r7.east) {$\seq{X} \; \to \; \langle\ \text{组织 的}, \; \; \text{of}\ \rangle$};
 \end{scope}

 \node[anchor=south] (l1) at ([xshift=-9em,yshift=1em]rules.north) {\normalsize{${d = r_3}{\circ r_1}{ \circ r_4}{ \circ r_2}{ \circ r_5}{ \circ r_2}{ \circ r_7}{ \circ r_6}{ \circ r_2}$}};

--- a/Chapter8/Figures/figure-different-representations-of-syntax-tree.tex
+++ b/Chapter8/Figures/figure-different-representations-of-syntax-tree.tex
@@ -15,7 +15,7 @@
 \end{scope}
 	
 \node [anchor=north west] (cap1) at (-1.5em,-1in) {{(a) 树状表示}};
-\node [anchor=west] (cap2) at ([xshift=0.5in]cap1.east) {{(b) 序列表示(缩进)}};
+\node [anchor=west] (cap2) at ([xshift=0.5in]cap1.east) {{(b) 序列表示（缩进）}};
 \node [anchor=west] (cap3) at ([xshift=0.5in]cap2.east) {{(c) 序列表示}};
 }
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter8/Figures/figure-example-of-translation-use-syntactic-structure.tex
+++ b/Chapter8/Figures/figure-example-of-translation-use-syntactic-structure.tex
@@ -8,16 +8,18 @@

 {\scriptsize

-\node[anchor=west] (ref) at (0,0) {{\sffamily\bfseries{人工翻译:}} {\red{After}} North Korea demanded concessions from U.S. again before the start of a new round of six-nation talks ...};

-\node[anchor=north west] (hifst) at ([yshift=-0.3em]ref.south west) {{\sffamily\bfseries{机器翻译:}} \blue{In}\black{} the new round of six-nation talks on North Korea again demanded that U.S. in the former promise ...};
+
+\node[anchor=west] (ref) at (0,0) {{\sffamily\bfseries{人工翻译:}} {\red{After}} the school team won the Championship of the China University Basketball Association for the first time ...};
+
+\node[anchor=north west] (hifst) at ([yshift=-0.3em]ref.south west) {{\sffamily\bfseries{机器翻译:}} \blue{In}\black{} the school team won the Chinese College Basketball League Championship for the first time ...};

 {
-\node[anchor=north west] (synhifst) at ([yshift=-0.3em]hifst.south west) {\sffamily\bfseries{更好?:}};
+\node[anchor=north west] (synhifst) at ([yshift=-0.2em]hifst.south west) {\sffamily\bfseries{更好?:}};

-\node[anchor=west, fill=red!20!white, inner sep=0.3em] (synhifstpart1) at ([xshift=-0.5em]synhifst.east) {After};
+\node[anchor=west, fill=red!20, inner sep=0.3em] (synhifstpart1) at ([xshift=-0.3em]synhifst.east) {After};

-\node[anchor=west, fill=blue!20!white, inner sep=0.25em] (synhifstpart2) at ([xshift=0.1em,yshift=-0.05em]synhifstpart1.east) {North Korea again demanded that U.S. promised concessions before the new round of six-nation talks};
+\node[anchor=west, fill=blue!20, inner sep=0.25em] (synhifstpart2) at ([xshift=0.1em,yshift=-0.05em]synhifstpart1.east) {the school team won the Championship of the China University Basketball Association for the first time};

 \node[anchor=west] (synhifstpart3) at ([xshift=-0.2em]synhifstpart2.east) {...};
 }
@@ -25,9 +27,9 @@
 \node [anchor=west] (inputlabel) at ([yshift=-0.4in]synhifst.west) {\sffamily\bfseries{输入:}};

 \node [anchor=west,minimum height=12pt] (inputseg1) at (inputlabel.east) {在$_1$ };
-\node [anchor=west,minimum height=12pt] (inputseg2) at ([xshift=0.2em]inputseg1.east) {北韩$_2$ 再度$_3$ 要求$_4$ 美国$_5$ 于$_6$ 新$_7$ 回合$_8$ 六$_9$ 国$_{10}$ 会谈$_{11}$ 前$_{12}$ 承诺$_{13}$ 让步$_{14}$};
-\node [anchor=west,minimum height=12pt] (inputseg3) at ([xshift=0.2em]inputseg2.east) {后$_{15}$};
-\node [anchor=west,minimum height=12pt] (inputseg4) at ([xshift=0.2em]inputseg3.east) {,$_{16}$};
+\node [anchor=west,minimum height=12pt] (inputseg2) at ([xshift=0.2em]inputseg1.east) {学校$_2$ 球队$_3$ 首次$_4$ 夺得$_5$ 中国$_6$ 大学生$_7$ 篮球$_8$ 联赛$_9$ 冠军$_{10}$};
+\node [anchor=west,minimum height=12pt] (inputseg3) at ([xshift=0.2em]inputseg2.east) {后$_{11}$};
+\node [anchor=west,minimum height=12pt] (inputseg4) at ([xshift=0.2em]inputseg3.east) {,$_{12}$};
 \node [anchor=west,minimum height=12pt] (inputseg5) at ([xshift=0.2em]inputseg4.east) {...};

 {
@@ -45,17 +47,17 @@
 }

 {
-\node [anchor=north east,align=left] (nolimitlabel) at (synlabel1.south west) {\tiny{短语结构树很容易捕捉}\\\tiny{这种介词短语结构}};
+\node [anchor=north east,align=left] (nolimitlabel) at (synlabel1.south west) {\scriptsize{短语结构树很容易捕捉}\\\scriptsize{这种介词短语结构}};
 }

 {
 \node [anchor=west,minimum height=12pt,fill=red!20] (inputseg1) at (inputlabel.east) {在$_1$ };
-\node [anchor=west,minimum height=12pt,fill=blue!20] (inputseg2) at ([xshift=0.2em]inputseg1.east) {北韩$_2$ 再度$_3$ 要求$_4$ 美国$_5$ 于$_6$ 新$_7$ 回合$_8$ 六$_9$ 国$_{10}$ 会谈$_{11}$ 前$_{12}$ 承诺$_{13}$ 让步$_{14}$};
+\node [anchor=west,minimum height=12pt,fill=blue!20] (inputseg2) at ([xshift=0.2em]inputseg1.east) {学校$_2$ 球队$_3$ 首次$_4$ 夺得$_5$ 中国$_6$ 大学生$_7$ 篮球$_8$ 联赛$_9$ 冠军$_{10}$};
 \node [anchor=west,minimum height=12pt,fill=red!20] (inputseg3) at ([xshift=0.2em]inputseg2.east) {后$_{15}$};

 \path [draw,->,dashed] (inputseg1.north) .. controls +(north:0.2) and +(south:0.3) ..  ([xshift=1em]synhifstpart1.south);
 \path [draw,->,dashed] (inputseg3.north) .. controls +(north:0.2) and +(south:0.6) ..  ([xshift=1em]synhifstpart1.south);
-\path [draw,->,dashed] ([xshift=-0.5in]inputseg2.north) --  ([xshift=-0.6in]synhifstpart2.south);
+\path [draw,->,dashed] ([xshift=-0.8in]inputseg2.north) --  ([xshift=1.9in]synhifstpart2.south);
 }

 }

--- a/Chapter8/Figures/figure-examples-of-translation-with-complex-ordering.tex
+++ b/Chapter8/Figures/figure-examples-of-translation-with-complex-ordering.tex
@@ -9,9 +9,9 @@

 \node[anchor=west] (ref) at (0,0) {{\sffamily\bfseries{参考答案：}} The Chinese star performance troupe presented a wonderful Peking opera as well as singing and dancing };

-\node[anchor=north west] (ref2) at (ref.south west) {{\color{white} \sffamily\bfseries{Reference:}} performance to Hong Kong audience .};
+\node[anchor=north west] (ref2) at (ref.south west) {{\color{white} \sffamily\bfseries{Reference:}} performance to the national audience .};

-\node[anchor=north west] (hifst) at (ref2.south west) {{\sffamily\bfseries{层次短语系统：}} Star troupe of China, highlights of Peking opera and dance show to the audience of Hong Kong .};
+\node[anchor=north west] (hifst) at (ref2.south west) {{\sffamily\bfseries{层次短语系统：}} Star troupe of China, highlights of Peking opera and dance show to the audience of the national .};

 \node[anchor=north west] (synhifst) at (hifst.south west) {{\sffamily\bfseries{句法系统：}} Chinese star troupe};

@@ -21,7 +21,7 @@

 \node[anchor=west, fill=red!20!white, inner sep=0.40em] (synhifstpart4) at ([xshift=0.2em]synhifstpart3.east) {to};

-\node[anchor=west, fill=purple!20!white, inner sep=0.25em] (synhifstpart5) at ([xshift=0.2em]synhifstpart4.east) {Hong Kong audience};
+\node[anchor=west, fill=purple!20!white, inner sep=0.25em] (synhifstpart5) at ([xshift=0.2em]synhifstpart4.east) {the national audience};

 \node[anchor=west] (synhifstpart6) at (synhifstpart5.east) {.};

@@ -39,7 +39,7 @@
            ]
            [.\node(tn8){PP};
                [.\node(tn9){P}; \node[fill=red!20!white](seg5){给$_{12}$}; ]
-                [.\node(tn10){NP}; \edge[roof]; \node[fill=purple!20!white](seg6){香港$_{13}$ 观众$_{14}$}; ]
+                [.\node(tn10){NP}; \edge[roof]; \node[fill=purple!20!white](seg6){全国$_{13}$ 观众$_{14}$}; ]
            ]
        ]
        [.\node(tn11){.}; ]
@@ -47,11 +47,11 @@

 \end{scope}

-\path [draw,thick,->,dashed] (seg2.north) .. controls +(north:1.0) and +(south:1.5) ..  (synhifstpart4.south);
-\path [draw,thick,->,dashed] (seg3.north) --  (synhifstpart3.south);
-\path [draw,thick,->,dashed] (seg4.north) --  (synhifstpart2.south);
-\path [draw,thick,->,dashed] (seg5.north) .. controls +(north:0.5) ..  (synhifstpart4.south);
-\path [draw,thick,->,dashed] (seg6.north) --  (synhifstpart5.south);
+\path [draw,->,dashed] (seg2.north) .. controls +(north:1.0) and +(south:1.5) ..  (synhifstpart4.south);
+\path [draw,->,dashed] (seg3.north) --  (synhifstpart3.south);
+\path [draw,->,dashed] (seg4.north) --  (synhifstpart2.south);
+\path [draw,->,dashed] (seg5.north) .. controls +(north:0.5) ..  (synhifstpart4.south);
+\path [draw,->,dashed] (seg6.north) --  (synhifstpart5.south);

 }
 \end{scope}

--- a/Chapter8/Figures/figure-execution-of-cube-pruning.tex
+++ b/Chapter8/Figures/figure-execution-of-cube-pruning.tex
@@ -3,10 +3,10 @@
 \tikzstyle{selectnode} = [rectangle,fill=green!20,minimum height=1.5em,minimum width=1.5em,inner sep=1.2pt]
 \tikzstyle{srcnode} = [rotate=45,anchor=south west]
 \begin{scope}[scale=0.85]
-\node [anchor=west] (s1) at (0,0) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{from}\ \textrm{X}_1>$}};
-\node [anchor=east] (s2) at ([yshift=-2em]s1.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{since}\ \textrm{X}_1>$}};
-\node [anchor=east] (s3) at ([yshift=-2em]s2.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{from the}\ \textrm{X}_1>$}};
-\node [anchor=east] (s4) at ([yshift=-2em]s3.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{through}\ \textrm{X}_1>$}};
+\node [anchor=west] (s1) at (0,0) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{from}\ \seq{X}_1>$}};
+\node [anchor=east] (s2) at ([yshift=-2em]s1.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{since}\ \seq{X}_1>$}};
+\node [anchor=east] (s3) at ([yshift=-2em]s2.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{from the}\ \seq{X}_1>$}};
+\node [anchor=east] (s4) at ([yshift=-2em]s3.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{through}\ \seq{X}_1>$}};

 \node [anchor=center,alignmentnode] (alig1) at ([xshift=12.0em,yshift=0em]s1.west) {};
 \node [anchor=center,alignmentnode] (alig11) at ([xshift=2.2em]alig1.center) {};
@@ -45,10 +45,10 @@

 %图2
 \begin{scope}[xshift=18.0em,scale=0.85]
-\node [anchor=west] (s1) at (0,0) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{from}\ \textrm{X}_1>$}};
-\node [anchor=east] (s2) at ([yshift=-2em]s1.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{since}\ \textrm{X}_1>$}};
-\node [anchor=east] (s3) at ([yshift=-2em]s2.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{from the}\ \textrm{X}_1>$}};
-\node [anchor=east] (s4) at ([yshift=-2em]s3.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{through}\ \textrm{X}_1>$}};
+\node [anchor=west] (s1) at (0,0) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{from}\ \seq{X}_1>$}};
+\node [anchor=east] (s2) at ([yshift=-2em]s1.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{since}\ \seq{X}_1>$}};
+\node [anchor=east] (s3) at ([yshift=-2em]s2.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{from the}\ \seq{X}_1>$}};
+\node [anchor=east] (s4) at ([yshift=-2em]s3.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{through}\ \seq{X}_1>$}};

 \node [anchor=center,alignmentnode] (alig1) at ([xshift=12.0em,yshift=0em]s1.west) {};
 \node [anchor=center,alignmentnode] (alig11) at ([xshift=2.2em]alig1.center) {};
@@ -92,10 +92,10 @@

 %图3
 \begin{scope}[yshift=-13.0em,scale=0.85]
-\node [anchor=west] (s1) at (0,0) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{from}\ \textrm{X}_1>$}};
-\node [anchor=east] (s2) at ([yshift=-2em]s1.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{since}\ \textrm{X}_1>$}};
-\node [anchor=east] (s3) at ([yshift=-2em]s2.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{from the}\ \textrm{X}_1>$}};
-\node [anchor=east] (s4) at ([yshift=-2em]s3.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{through}\ \textrm{X}_1>$}};
+\node [anchor=west] (s1) at (0,0) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{from}\ \seq{X}_1>$}};
+\node [anchor=east] (s2) at ([yshift=-2em]s1.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{since}\ \seq{X}_1>$}};
+\node [anchor=east] (s3) at ([yshift=-2em]s2.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{from the}\ \seq{X}_1>$}};
+\node [anchor=east] (s4) at ([yshift=-2em]s3.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{through}\ \seq{X}_1>$}};

 \node [anchor=center,alignmentnode] (alig1) at ([xshift=12.0em,yshift=0em]s1.west) {};
 \node [anchor=center,alignmentnode] (alig11) at ([xshift=2.2em]alig1.center) {};
@@ -143,10 +143,10 @@

 %图4
 \begin{scope}[xshift=18.0em,yshift=-13.0em,scale=0.85]
-\node [anchor=west] (s1) at (0,0) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{from}\ \textrm{X}_1>$}};
-\node [anchor=east] (s2) at ([yshift=-2em]s1.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{since}\ \textrm{X}_1>$}};
-\node [anchor=east] (s3) at ([yshift=-2em]s2.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{from the}\ \textrm{X}_1>$}};
-\node [anchor=east] (s4) at ([yshift=-2em]s3.east) {\scriptsize{$\textrm{X} \to <\textrm{从}\ \textrm{X}_1,\ \textrm{through}\ \textrm{X}_1>$}};
+\node [anchor=west] (s1) at (0,0) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{from}\ \seq{X}_1>$}};
+\node [anchor=east] (s2) at ([yshift=-2em]s1.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{since}\ \seq{X}_1>$}};
+\node [anchor=east] (s3) at ([yshift=-2em]s2.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{from the}\ \seq{X}_1>$}};
+\node [anchor=east] (s4) at ([yshift=-2em]s3.east) {\scriptsize{$\seq{X} \to <\textrm{从}\ \seq{X}_1,\ \textrm{through}\ \seq{X}_1>$}};

 \node [anchor=center,alignmentnode] (alig1) at ([xshift=12.0em,yshift=0em]s1.west) {};
 \node [anchor=center,alignmentnode] (alig11) at ([xshift=2.2em]alig1.center) {};

--- a/Chapter8/Figures/figure-hierarchical-phrase-rule-match-generate.tex
+++ b/Chapter8/Figures/figure-hierarchical-phrase-rule-match-generate.tex
@@ -7,29 +7,29 @@
 \node[anchor=north] (q1) at (0,0) {\scriptsize\sffamily\bfseries{输入字符串：}};
 \node[anchor=west] (q2) at ([xshift=0em,yshift=-2em]q1.west) {\footnotesize{进口$\quad$和$\quad$出口$\quad$大幅度$\quad$下降$\quad$了}};

-\node[anchor=north,fill=blue!20,minimum height=1em,minimum width=1em] (f1) at ([xshift=-4.1em,yshift=-0.8em]q2.south) {};
+\node[anchor=north,fill=blue!20,minimum height=1em,minimum width=1em] (f1) at ([xshift=-4.2em,yshift=-0.8em]q2.south) {};

 \node[anchor=east] (n1) at ([xshift=1em,yshift=-2em]q2.west) {\scriptsize\sffamily\bfseries{匹配规则：}};

-\node[anchor=west] (n2) at ([xshift=0em,yshift=0em]n1.east) {\scriptsize{$\textrm{X} \to  \langle\ \textrm{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \textrm{X}_1\ \textrm{have}\ \textrm{drastically}\ \textrm{fallen}\ \rangle$}};
+\node[anchor=west] (n2) at ([xshift=0em,yshift=0em]n1.east) {\scriptsize{$$\seq{X}$ \to  \langle\ $\seq{X}$_1\ \text{大幅度}\ \text{下降}\ \text{了},\ $\seq{X}$_1\ \textrm{have}\ \textrm{drastically}\ \textrm{fallen}\ \rangle$}};

 \draw[decorate,decoration={mirror,brace}]([xshift=0.5em,yshift=-1em]q2.west) --([xshift=7em,yshift=-1em]q2.west) node [xshift=0em,yshift=-1em,align=center](label1) {};	

 {\scriptsize
 \node[anchor=west] (h1) at ([xshift=1em,yshift=-7em]q2.west) {{Span[0,3]下的翻译假设：}};
-\node[anchor=west] (h2) at ([xshift=0em,yshift=-1.3em]h1.west) {{X：the imports and exports}};
-\node[anchor=west] (h3) at ([xshift=0em,yshift=-1.3em]h2.west) {{X：imports and exports}};
-\node[anchor=west] (h4) at ([xshift=0em,yshift=-1.3em]h3.west) {{X：exports and imports}};
-\node[anchor=west] (h5) at ([xshift=0em,yshift=-1.3em]h4.west) {{X：the imports and the exports}};
-\node[anchor=west] (h6) at ([xshift=0em,yshift=-1.3em]h5.west) {{S：the import and export}};
+\node[anchor=west] (h2) at ([xshift=0em,yshift=-1.3em]h1.west) {{$\seq{X}$：the imports and exports}};
+\node[anchor=west] (h3) at ([xshift=0em,yshift=-1.3em]h2.west) {{$\seq{X}$：imports and exports}};
+\node[anchor=west] (h4) at ([xshift=0em,yshift=-1.3em]h3.west) {{$\seq{X}$：exports and imports}};
+\node[anchor=west] (h5) at ([xshift=0em,yshift=-1.3em]h4.west) {{$\seq{X}$：the imports and the exports}};
+\node[anchor=west] (h6) at ([xshift=0em,yshift=-1.3em]h5.west) {{$\seq{S}$：the import and export}};
 }

 {\scriptsize
-\node[anchor=west] (h21) at ([xshift=9em,yshift=0em]h1.east) {{替换$\textrm{X}_1$后生成的翻译假设：}};
-\node[anchor=west] (h22) at ([xshift=0em,yshift=-1.3em]h21.west) {{X：the imports and exports have drastically fallen}};
-\node[anchor=west] (h23) at ([xshift=0em,yshift=-1.3em]h22.west) {{X：imports and exports have drastically fallen}};
-\node[anchor=west] (h24) at ([xshift=0em,yshift=-1.3em]h23.west) {{X：exports and imports have drastically fallen}};
-\node[anchor=west] (h25) at ([xshift=0em,yshift=-1.3em]h24.west) {{X：the imports and the exports have drastically fallen}};
+\node[anchor=west] (h21) at ([xshift=9em,yshift=0em]h1.east) {{替换$$\seq{X}$_1$后生成的翻译假设：}};
+\node[anchor=west] (h22) at ([xshift=0em,yshift=-1.3em]h21.west) {{$\seq{X}$：the imports and exports have drastically fallen}};
+\node[anchor=west] (h23) at ([xshift=0em,yshift=-1.3em]h22.west) {{$\seq{X}$：imports and exports have drastically fallen}};
+\node[anchor=west] (h24) at ([xshift=0em,yshift=-1.3em]h23.west) {{$\seq{X}$：exports and imports have drastically fallen}};
+\node[anchor=west] (h25) at ([xshift=0em,yshift=-1.3em]h24.west) {{$\seq{X}$：the imports and the exports have drastically fallen}};
 }

 \node [rectangle,inner sep=0.1em,rounded corners=1pt,draw] [fit = (h1) (h5) (h6)] (gl1) {};

--- a/Chapter8/Figures/figure-processing-of-hierarchical-phrase-system.tex
+++ b/Chapter8/Figures/figure-processing-of-hierarchical-phrase-system.tex
@@ -9,7 +9,7 @@
 \tikzstyle{decodingnode} = [minimum width=7em,minimum height=1.7em,fill=green!20,rounded corners=0.3em];

 \node [datanode,anchor=north west,minimum height=1.7em,minimum width=8em] (bitext) at (0,0) {{ \small{训练用双语数据}}};
-\node [modelnode, anchor=north west,minimum height=1.7em,minimum width=8em] (gi) at ([xshift=2em,yshift=-0.2em]bitext.south east) {{ \small{文法(规则)抽取}}};
+\node [modelnode, anchor=north west,minimum height=1.7em,minimum width=8em] (gi) at ([xshift=2em,yshift=-0.2em]bitext.south east) {{ \small{文法（规则）抽取}}};
 \node [datanode,anchor=north east,minimum height=1.7em,minimum width=8em] (birules) at ([xshift=-2em,yshift=-0.2em]gi.south west) {{ \small{同步翻译文法}}};
 \node [modelnode, anchor=north west,minimum height=1.7em,minimum width=8em] (training) at ([xshift=2em,yshift=-0.2em]birules.south east) {{ \small{特征值学习}}};
 \node [datanode,anchor=north east,minimum height=1.7em,minimum width=8em] (model) at ([xshift=-2em,yshift=-0.2em]training.south west) {{ \small{翻译模型}}};

--- a/Chapter8/Figures/figure-rule-matching-base-tree.tex
+++ b/Chapter8/Figures/figure-rule-matching-base-tree.tex
@@ -8,7 +8,7 @@

 \begin{scope}[sibling distance=2pt,level distance=20pt,grow'=up]
 \Tree[.\node(treeroot){IP};
-     [.NP [.NR 阿都拉$_1$ ]]
+     [.NP [.NR 市长$_1$ ]]
        [.\node(tn1){VP};
            [.\node(tn2){PP};
                [.\node(tn3){P}; \node(cw1){对$_2$}; ]

--- a/Chapter8/Figures/figure-translation-rule-describe-two-sentence-generation.tex
+++ b/Chapter8/Figures/figure-translation-rule-describe-two-sentence-generation.tex
@@ -7,21 +7,21 @@
 {
 % rule 1 (source)
 \node [anchor=west] (rule1s1) at (0,0) {与};
-\node [anchor=west,inner sep=2pt,fill=black] (rule1s2) at ([xshift=0.5em]rule1s1.east) {\scriptsize{{\color{white} $\textrm{X}_1$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule1s2) at ([xshift=0.5em]rule1s1.east) {\scriptsize{{\color{white} $\funp{X}_1$}}};
 \node [anchor=west] (rule1s3) at ([xshift=0.5em]rule1s2.east) {有};
-\node [anchor=west,inner sep=2pt,fill=black] (rule1s4) at ([xshift=0.5em]rule1s3.east) {\scriptsize{{\color{white} $\textrm{X}_2$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule1s4) at ([xshift=0.5em]rule1s3.east) {\scriptsize{{\color{white} $\funp{X}_2$}}};

 % rule 1 (target)
 \node [anchor=west] (rule1t1) at ([xshift=0.8in]rule1s4.east) {have};
-\node [anchor=west,inner sep=2pt,fill=black] (rule1t2) at ([xshift=0.5em]rule1t1.east) {\scriptsize{{\color{white} $\textrm{X}_2$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule1t2) at ([xshift=0.5em]rule1t1.east) {\scriptsize{{\color{white} $\funp{X}_2$}}};
 \node [anchor=west] (rule1t3) at ([xshift=0.5em]rule1t2.east) {with};
-\node [anchor=west,inner sep=2pt,fill=black] (rule1t4) at ([xshift=0.5em]rule1t3.east) {\scriptsize{{\color{white} $\textrm{X}_1$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule1t4) at ([xshift=0.5em]rule1t3.east) {\scriptsize{{\color{white} $\funp{X}_1$}}};
 }

 {
 % phrase 1 (source and target)
-\node [anchor=north] (phrase1s1) at ([yshift=-1em]rule1s2.south) {\footnotesize{北韩}};
-\node [anchor=north] (phrase1t1) at ([yshift=-1em]rule1t4.south) {\footnotesize{North Korea}};
+\node [anchor=north] (phrase1s1) at ([yshift=-1em]rule1s2.south) {\footnotesize{巴基斯坦}};
+\node [anchor=north] (phrase1t1) at ([yshift=-1em]rule1t4.south) {\footnotesize{Pakista}};
 }

 {
@@ -48,18 +48,18 @@

 {
 % rule 2 (source)
-\node [anchor=west,inner sep=2pt,fill=black] (rule2s1) at ([yshift=3.5em,xshift=-0.5em]rule1s1.north west) {\scriptsize{{\color{white} $\textrm{X}_1$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule2s1) at ([yshift=3.5em,xshift=-0.5em]rule1s1.north west) {\scriptsize{{\color{white} $\funp{X}_1$}}};
 \node [anchor=west] (rule2s2) at ([xshift=0.5em]rule2s1.east) {的};
-\node [anchor=west,inner sep=2pt,fill=black] (rule2s3) at ([xshift=0.5em]rule2s2.east) {\scriptsize{{\color{white} $\textrm{X}_2$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule2s3) at ([xshift=0.5em]rule2s2.east) {\scriptsize{{\color{white} $\funp{X}_2$}}};

 % rule 2 (target)
-\node [anchor=west,inner sep=2pt,fill=black] (rule2t1) at ([xshift=1.8in]rule2s3.east) {\scriptsize{{\color{white} $\textrm{X}_2$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule2t1) at ([xshift=1.8in]rule2s3.east) {\scriptsize{{\color{white} $\funp{X}_2$}}};
 \node [anchor=west] (rule2t2) at ([xshift=0.5em]rule2t1.east) {that};
-\node [anchor=west,inner sep=2pt,fill=black] (rule2t3) at ([xshift=0.5em]rule2t2.east) {\scriptsize{{\color{white} $\textrm{X}_1$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule2t3) at ([xshift=0.5em]rule2t2.east) {\scriptsize{{\color{white} $\funp{X}_1$}}};

 % phrase 3 (source and target)
-\node [anchor=north] (phrase3s1) at ([yshift=-0.8em]rule2s3.south) {\footnotesize{少数 国家}};
-\node [anchor=north] (phrase3t1) at ([yshift=-0.8em]rule2t1.south) {\footnotesize{the few countries}};
+\node [anchor=north] (phrase3s1) at ([yshift=-0.8em]rule2s3.south) {\footnotesize{多数 国家}};
+\node [anchor=north] (phrase3t1) at ([yshift=-0.8em]rule2t1.south) {\footnotesize{the many countries}};

 % edges (phrase 3 to rule 2 and rule1 to rule2)
 \draw [->] (phrase3s1.north) -- ([yshift=-0.1em]rule2s3.south);
@@ -78,12 +78,12 @@

 {
 % rule 3 (source)
-\node [anchor=west,inner sep=2pt,fill=black] (rule3s1) at ([yshift=2.5em,xshift=4em]rule2s1.north west) {\scriptsize{{\color{white} $\textrm{X}_1$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule3s1) at ([yshift=2.5em,xshift=4em]rule2s1.north west) {\scriptsize{{\color{white} $\funp{X}_1$}}};
 \node [anchor=west] (rule3s2) at ([xshift=0.5em]rule3s1.east) {之一};

 % rule 3 (target)
 \node [anchor=west] (rule3t1) at ([xshift=1.0in]rule3s2.east) {one of};
-\node [anchor=west,inner sep=2pt,fill=black] (rule3t2) at ([xshift=0.5em]rule3t1.east) {\scriptsize{{\color{white} $\textrm{X}_1$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule3t2) at ([xshift=0.5em]rule3t1.east) {\scriptsize{{\color{white} $\funp{X}_1$}}};

 % edges: rule 2 to rule 3
 \draw [->] ([xshift=-1em]rule2s.north) ..controls +(north:1.2em) and +(south:1.2em).. ([yshift=-0.1em]rule3s1.south);
@@ -100,18 +100,18 @@

 {
 % rule 4 (source)
-\node [anchor=west,inner sep=2pt,fill=black] (rule4s1) at ([yshift=3.5em,xshift=-3.5em]rule3s1.north west) {\scriptsize{{\color{white} $\textrm{X}_1$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule4s1) at ([yshift=3.5em,xshift=-3.5em]rule3s1.north west) {\scriptsize{{\color{white} $\funp{X}_1$}}};
 \node [anchor=west] (rule4s2) at ([xshift=0.5em]rule4s1.east) {是};
-\node [anchor=west,inner sep=2pt,fill=black] (rule4s3) at ([xshift=0.5em]rule4s2.east) {\scriptsize{{\color{white} $\textrm{X}_2$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule4s3) at ([xshift=0.5em]rule4s2.east) {\scriptsize{{\color{white} $\funp{X}_2$}}};

 % rule 2 (target)
-\node [anchor=west,inner sep=2pt,fill=black] (rule4t1) at ([xshift=2.0in]rule4s2.east) {\scriptsize{{\color{white} $\textrm{X}_1$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule4t1) at ([xshift=2.0in]rule4s2.east) {\scriptsize{{\color{white} $\funp{X}_1$}}};
 \node [anchor=west] (rule4t2) at ([xshift=0.5em]rule4t1.east) {is};
-\node [anchor=west,inner sep=2pt,fill=black] (rule4t3) at ([xshift=0.5em]rule4t2.east) {\scriptsize{{\color{white} $\textrm{X}_2$}}};
+\node [anchor=west,inner sep=2pt,fill=black] (rule4t3) at ([xshift=0.5em]rule4t2.east) {\scriptsize{{\color{white} $\funp{X}_2$}}};

 % phrase 4 (source and target)
-\node [anchor=north] (phrase4s1) at ([yshift=-0.8em]rule4s1.south) {\footnotesize{澳洲}};
-\node [anchor=north] (phrase4t1) at ([yshift=-0.8em]rule4t1.south) {\footnotesize{Australia}};
+\node [anchor=north] (phrase4s1) at ([yshift=-0.8em]rule4s1.south) {\footnotesize{巴基斯坦}};
+\node [anchor=north] (phrase4t1) at ([yshift=-0.8em]rule4t1.south) {\footnotesize{Pakista}};

 % edges (phrase 4 to rule 4 and rule3 to rule4)
 \draw [->] (phrase4s1.north) -- ([yshift=-0.1em]rule4s1.south);

--- a/Chapter8/chapter8.tex
+++ b/Chapter8/chapter8.tex
@@ -68,7 +68,7 @@
 \end{figure}
 %-------------------------------------------

-\parinterval 句法树结构可以赋予机器翻译对语言进一步抽象的能力，这样，可以不需要使用连续词串，而是通过句法结构来对大范围的译文生成和调序进行建模。图\ref{fig:8-3}是一个在翻译中融入源语言（汉语）句法信息的实例。这个例子中，介词短语“在 $...$ 后”包含15个单词，因此，使用短语很难涵盖这样的片段。这时，系统会把“在 $...$ 后”错误地翻译为“In $...$”。通过句法树，可以知道“在 $...$ 后”对应着一个完整的子树结构PP（介词短语）。因此也很容易知道介词短语中“在 $...$ 后”是一个模板（红色），而“在”和“后”之间的部分构成从句部分（蓝色）。最终得到正确的译文“After $...$”。
+\parinterval 句法树结构可以赋予机器翻译对语言进一步抽象的能力，这样，可以不需要使用连续词串，而是通过句法结构来对大范围的译文生成和调序进行建模。图\ref{fig:8-3}是一个在翻译中融入源语言（汉语）句法信息的实例。这个例子中，介词短语“在 $...$ 后”包含12个单词，因此，使用短语很难涵盖这样的片段。这时，系统会把“在 $...$ 后”错误地翻译为“In $...$”。通过句法树，可以知道“在 $...$ 后”对应着一个完整的子树结构PP（介词短语）。因此也很容易知道介词短语中“在 $...$ 后”是一个模板（红色），而“在”和“后”之间的部分构成从句部分（蓝色）。最终得到正确的译文“After $...$”。

 \parinterval 使用句法信息在机器翻译中并不新鲜。在基于规则和模板的翻译模型中，就大量使用了句法等结构信息。只是由于早期句法分析技术不成熟，系统的整体效果并不突出。在数据驱动的方法中，句法可以很好地融合在统计建模中。通过概率化的句法设计，可以对翻译过程进行很好的描述。

@@ -92,7 +92,7 @@
 \caption{不同短语在训练数据中出现的频次}
 \label{tab:8-1}
 \begin{tabular}{p{12em} | r}
-短语(中文) & 训练数据中出现的频次 \\
+短语（中文） & 训练数据中出现的频次 \\
 \hline

 包含 & 3341\\
@@ -110,7 +110,7 @@

 \parinterval 显然，利用过长的短语来处理长距离的依赖并不是一种十分有效的方法。过于低频的长短语无法提供可靠的信息，而且使用长短语会导致模型体积急剧增加。

-\parinterval 再来看一个翻译实例\upcite{Chiang2012Hope}，图\ref{fig:8-4}是一个基于短语的机器翻译系统的翻译结果。这个例子中的调序有一些复杂，比如，“少数/国家/之一”和“与/北韩/有/邦交”的英文翻译都需要进行调序，分别是“one of the few countries”和“have diplomatic relations with North Korea”。基于短语的系统可以很好地处理这些调序问题，因为它们仅仅使用了局部的信息。但是，系统却无法在这两个短语（1和2）之间进行正确的调序。
+\parinterval 再来看一个翻译实例\upcite{Chiang2012Hope}，图\ref{fig:8-4}是一个基于短语的机器翻译系统的翻译结果。这个例子中的调序有一些复杂，比如，“多数/国家/之一”和“与/中国/有/邦交”的英文翻译都需要进行调序，分别是“one of the many countries”和“have diplomatic relations with China”。基于短语的系统可以很好地处理这些调序问题，因为它们仅仅使用了局部的信息。但是，系统却无法在这两个短语（1和2）之间进行正确的调序。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -133,30 +133,30 @@

 \parinterval 这里[什么\ \ 东西]和[什么\ \ 事]表示模板中的变量，可以被其他词序列替换。通常，可以把这个模板形式化描述为：
 \begin{eqnarray}
-\langle \ \text{与}\ \textrm{X}_1\ \text{有}\ \textrm{X}_2,\quad \textrm{have}\ \textrm{X}_2\ \textrm{with}\ \textrm{X}_1\ \rangle \nonumber
+\langle \ \text{与}\ \funp{X}_1\ \text{有}\ \funp{X}_2,\quad \textrm{have}\ \funp{X}_2\ \textrm{with}\ \funp{X}_1\ \rangle \nonumber
 \end{eqnarray}

-\noindent 其中，逗号分隔了源语言和目标语言部分，$\textrm{X}_1$和$\textrm{X}_2$表示模板中需要替换的内容，或者说变量。源语言中的变量和目标语言中的变量是一一对应的，比如，源语言中的$\textrm{X}_1$ 和目标语言中的$\textrm{X}_1$代表这两个变量可以“同时”被替换。假设给定短语对：
+\noindent 其中，逗号分隔了源语言和目标语言部分，$\funp{X}_1$和$\funp{X}_2$表示模板中需要替换的内容，或者说变量。源语言中的变量和目标语言中的变量是一一对应的，比如，源语言中的$\funp{X}_1$ 和目标语言中的$\funp{X}_1$代表这两个变量可以“同时”被替换。假设给定短语对：
 \begin{eqnarray}
-\langle \ \text{北韩},\quad \textrm{North Korea} \ \rangle \qquad\ \quad\quad\ \  \nonumber \\
+\langle \ \text{巴基斯坦},\quad \textrm{Pakistan} \ \rangle \qquad\ \quad\quad\ \  \nonumber \\
 \langle \ \text{邦交},\quad \textrm{diplomatic relations} \ \rangle\quad\ \ \ \nonumber
 \end{eqnarray}

-\parinterval 可以使用第一个短语替换模板中的变量$\textrm{X}_1$，得到：
+\parinterval 可以使用第一个短语替换模板中的变量$\funp{X}_1$，得到：
 \begin{eqnarray}
-\langle \ \text{与}\ \text{[北韩]}\ \text{有}\ \textrm{X}_2,\quad \textrm{have}\ \textrm{X}_2\ \textrm{with}\ \textrm{[North Korea]} \ \rangle \nonumber
+\langle \ \text{与}\ \text{[巴基斯坦]}\ \text{有}\ \funp{X}_2,\quad \textrm{have}\ \funp{X}_2\ \textrm{with}\ \textrm{[Pakistan]} \ \rangle \nonumber
 \end{eqnarray}

-\noindent 其中，$[\cdot]$表示被替换的部分。可以看到，在源语言和目标语言中，$\textrm{X}_1$被同时替换为相应的短语。进一步，可以用第二个短语替换$\textrm{X}_2$，得到：
+\noindent 其中，$[\cdot]$表示被替换的部分。可以看到，在源语言和目标语言中，$\funp{X}_1$被同时替换为相应的短语。进一步，可以用第二个短语替换$\funp{X}_2$，得到：
 \begin{eqnarray}
-\quad\langle \ \text{与}\ \text{北韩}\ \text{有}\ \text{[邦交]},\quad \textrm{have}\ \textrm{[diplomatic relations]}\ \textrm{with}\ \textrm{North Korea} \ \rangle \nonumber
+\quad\langle \ \text{与}\ \text{巴基斯坦}\ \text{有}\ \text{[邦交]},\quad \textrm{have}\ \textrm{[diplomatic relations]}\ \textrm{with}\ \textrm{Pakista} \ \rangle \nonumber
 \end{eqnarray}

 \parinterval 至此，就得到了一个完整词串的译文。类似的，还可以写出其他的翻译模板，如下：
 \begin{eqnarray}
-\langle \ \textrm{X}_1\ \text{是}\ \textrm{X}_2,\quad \textrm{X}_1\ \textrm{is}\ \textrm{X}_2 \ \rangle \qquad\qquad\ \nonumber \\
-\langle \ \textrm{X}_1\ \text{之一},\quad \textrm{one}\ \textrm{of}\ \textrm{X}_1 \ \rangle \qquad\qquad\ \nonumber \\
-\langle \ \textrm{X}_1\ \text{的}\ \textrm{X}_2,\quad \textrm{X}_2\ \textrm{that}\ \textrm{have}\ \textrm{X}_1\ \rangle\quad\ \nonumber
+\langle \ \funp{X}_1\ \text{是}\ \funp{X}_2,\quad \funp{X}_1\ \textrm{is}\ \funp{X}_2 \ \rangle \qquad\qquad\ \nonumber \\
+\langle \ \funp{X}_1\ \text{之一},\quad \textrm{one}\ \textrm{of}\ \funp{X}_1 \ \rangle \qquad\qquad\ \nonumber \\
+\langle \ \funp{X}_1\ \text{的}\ \funp{X}_2,\quad \funp{X}_2\ \textrm{that}\ \textrm{have}\ \funp{X}_1\ \rangle\quad\ \nonumber
 \end{eqnarray}

 \parinterval 使用上面这种变量替换的方式，就可以得到一个完整句子的翻译。
@@ -224,9 +224,9 @@

 \parinterval 这里的S、NP、VP等符号可以被看作是具有句法功能的标记，因此这个文法和传统句法分析中的CFG很像，只是CFG是单语文法，而SCFG是双语同步文法。非终结符的下标表示对应关系，比如，源语言的NP$_1$和目标语言的NP$_1$是对应的。因此，在上面这种表示形式中，两种语言间非终结符的对应关系$\sim$是隐含在变量下标中的。当然，复杂的句法功能标记并不是必须的。比如，也可以使用更简单的文法形式：
 \begin{eqnarray}
-\textrm{X}\ &\to\ &\langle \ \textrm{X}_1\ \text{希望}\ \textrm{X}_2,\quad \textrm{X}_1\ \textrm{wish}\ \textrm{to}\ \textrm{X}_2\ \rangle \nonumber \\
-\textrm{X}\ &\to\ &\langle \ \text{对}\ \textrm{X}_1\ \text{感到}\ \textrm{X}_2,\quad \textrm{be}\ \textrm{X}_2\ \textrm{wish}\ \textrm{X}_1\ \rangle \nonumber \\
-\textrm{X}\ &\to\ &\langle \ \text{强大},\quad \textrm{strong}\ \rangle \nonumber
+\funp{X}\ &\to\ &\langle \ \funp{X}_1\ \text{希望}\ \funp{X}_2,\quad \funp{X}_1\ \textrm{wish}\ \textrm{to}\ \funp{X}_2\ \rangle \nonumber \\
+\funp{X}\ &\to\ &\langle \ \text{对}\ \funp{X}_1\ \text{感到}\ \funp{X}_2,\quad \textrm{be}\ \funp{X}_2\ \textrm{wish}\ \funp{X}_1\ \rangle \nonumber \\
+\funp{X}\ &\to\ &\langle \ \text{强大},\quad \textrm{strong}\ \rangle \nonumber
 \end{eqnarray}

 \parinterval 这个文法只有一种非终结符X，因此所有的变量都可以使用任意的产生式进行推导。这就给翻译提供了更大的自由度，也就是说，规则可以被任意使用，进行自由组合。这也符合基于短语的模型中对短语进行灵活拼接的思想。基于此，层次短语系统中也使用这种并不依赖语言学句法标记的文法。在本章的内容中，如果没有特殊说明，把这种没有语言学句法标记的文法称作{\small\bfnew{基于层次短语的文法}}\index{基于层次短语的文法}（Hierarchical Phrase-based Grammar）\index{Hierarchical Phrase-based Grammar}，或简称层次短语文法。
@@ -239,10 +239,10 @@

 \parinterval 下面是一个完整的层次短语文法：
 \begin{eqnarray}
-r_1:\quad \textrm{X}\ &\to\ &\langle \ \text{进口}\ \textrm{X}_1,\quad \textrm{The}\ \textrm{imports}\ \textrm{X}_1\ \rangle \nonumber \\
-r_2:\quad \textrm{X}\ &\to\ &\langle \ \textrm{X}_1\ \text{下降}\ \textrm{X}_2,\quad \textrm{X}_2\ \textrm{X}_1\ \textrm{fallen}\ \rangle \nonumber \\
-r_3:\quad \textrm{X}\ &\to\ &\langle \ \text{大幅度},\quad \textrm{drastically}\ \rangle \nonumber \\
-r_4:\quad \textrm{X}\ &\to\ &\langle \ \text{了},\quad \textrm{have}\ \rangle \nonumber
+r_1:\quad \funp{X}\ &\to\ &\langle \ \text{进口}\ \funp{X}_1,\quad \textrm{The}\ \textrm{imports}\ \funp{X}_1\ \rangle \nonumber \\
+r_2:\quad \funp{X}\ &\to\ &\langle \ \funp{X}_1\ \text{下降}\ \funp{X}_2,\quad \funp{X}_2\ \funp{X}_1\ \textrm{fallen}\ \rangle \nonumber \\
+r_3:\quad \funp{X}\ &\to\ &\langle \ \text{大幅度},\quad \textrm{drastically}\ \rangle \nonumber \\
+r_4:\quad \funp{X}\ &\to\ &\langle \ \text{了},\quad \textrm{have}\ \rangle \nonumber
 \end{eqnarray}

 \noindent 其中，规则$r_1$和$r_2$是含有变量的规则，这些变量可以被其他规则的右部替换；规则$r_2$是调序规则；规则$r_3$和$r_4$是纯词汇化规则，表示单词或者短语的翻译。
@@ -255,11 +255,11 @@ r_4:\quad \textrm{X}\ &\to\ &\langle \ \text{了},\quad \textrm{have}\ \rangle \

 \parinterval 可以进行如下的推导（假设起始符号是X）：
 \begin{eqnarray}
-& & \langle\ \textrm{X}_1, \textrm{X}_1\ \rangle \nonumber \\
-& \xrightarrow[]{r_1} & \langle\ {\red{\textrm{进口}\ \textrm{X}_2}},\ {\red{\textrm{The imports}\ \textrm{X}_2}}\ \rangle \nonumber \\
-& \xrightarrow[]{r_2} & \langle\ \textrm{进口}\ {\red{\textrm{X}_3\ \textrm{下降}\ \textrm{X}_4}},\ \textrm{The imports}\ {\red{\textrm{X}_4\ \textrm{X}_3\ \textrm{fallen}}}\ \rangle \nonumber \\
-& \xrightarrow[]{r_3} & \langle\ \textrm{进口}\ {\red{\textrm{大幅度}}}\ \textrm{下降}\ \textrm{X}_4, \nonumber \\
-& & \ \textrm{The imports}\ \textrm{X}_4\ {\red{\textrm{drastically}}}\ \textrm{fallen}\ \rangle \nonumber \\
+& & \langle\ \funp{X}_1, \funp{X}_1\ \rangle \nonumber \\
+& \xrightarrow[]{r_1} & \langle\ {\red{\textrm{进口}\ \funp{X}_2}},\ {\red{\textrm{The imports}\ \funp{X}_2}}\ \rangle \nonumber \\
+& \xrightarrow[]{r_2} & \langle\ \textrm{进口}\ {\red{\funp{X}_3\ \textrm{下降}\ \funp{X}_4}},\ \textrm{The imports}\ {\red{\funp{X}_4\ \funp{X}_3\ \textrm{fallen}}}\ \rangle \nonumber \\
+& \xrightarrow[]{r_3} & \langle\ \textrm{进口}\ {\red{\textrm{大幅度}}}\ \textrm{下降}\ \funp{X}_4, \nonumber \\
+& & \ \textrm{The imports}\ \funp{X}_4\ {\red{\textrm{drastically}}}\ \textrm{fallen}\ \rangle \nonumber \\
 & \xrightarrow[]{r_4} & \langle\ \textrm{进口}\ \textrm{大幅度}\ \textrm{下降}\ {\red{\textrm{了}}}, \nonumber \\
 & & \ \textrm{The imports}\ {\red{\textrm{have}}}\ \textrm{drastically}\ \textrm{fallen}\ \rangle \nonumber
 \end{eqnarray}
@@ -270,7 +270,7 @@ d = {r_1} \circ {r_2} \circ {r_3} \circ {r_4}
 \label{eq:8-1}
 \end{eqnarray}

-\parinterval 在层次短语模型中，每个翻译推导都唯一地对应一个目标语译文。因此，可以用推导的概率$\textrm{P}(d)$描述翻译的好坏。同基于短语的模型是一样的（见{\chapterseven}数学建模小节），层次短语翻译的目标是：求概率最高的翻译推导$\hat{d}=\arg\max\textrm{P}(d)$。值得注意的是，基于推导的方法在句法分析中也十分常用。层次短语翻译实质上也是通过生成翻译规则的推导来对问题的表示空间进行建模。在\ref{section-8.3} 节还将看到，这种方法可以被扩展到语言学上基于句法的翻译模型中。而且这些模型都可以用一种被称作超图的结构来进行建模。从某种意义上讲，基于规则推导的方法将句法分析和机器翻译进行了形式上的统一。因此机器翻译也借用了很多句法分析的思想。
+\parinterval 在层次短语模型中，每个翻译推导都唯一地对应一个目标语译文。因此，可以用推导的概率$\funp{P}(d)$描述翻译的好坏。同基于短语的模型是一样的（见{\chapterseven}数学建模小节），层次短语翻译的目标是：求概率最高的翻译推导$\hat{d}=\arg\max\funp{P}(d)$。值得注意的是，基于推导的方法在句法分析中也十分常用。层次短语翻译实质上也是通过生成翻译规则的推导来对问题的表示空间进行建模。在\ref{section-8.3} 节还将看到，这种方法可以被扩展到语言学上基于句法的翻译模型中。而且这些模型都可以用一种被称作超图的结构来进行建模。从某种意义上讲，基于规则推导的方法将句法分析和机器翻译进行了形式上的统一。因此机器翻译也借用了很多句法分析的思想。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -280,16 +280,16 @@ d = {r_1} \circ {r_2} \circ {r_3} \circ {r_4}

 \parinterval 由于翻译现象非常复杂，在实际系统中往往需要把两个局部翻译线性拼接到一起。在层次短语模型中，这个问题通过引入{\small\bfnew{胶水规则}}\index{胶水规则}（Glue Rule）\index{Glue Rule}来处理，形式如下：
 \begin{eqnarray}
-\textrm{S} & \to & \langle\ \textrm{S}_1\ \textrm{X}_2,\ \textrm{S}_1\ \textrm{X}_2\ \rangle \nonumber \\
-\textrm{S} & \to & \langle\ \textrm{X}_1,\ \textrm{X}_1\ \rangle \nonumber
+\funp{S} & \to & \langle\ \funp{S}_1\ \funp{X}_2,\ \funp{S}_1\ \funp{X}_2\ \rangle \nonumber \\
+\funp{S} & \to & \langle\ \funp{X}_1,\ \funp{X}_1\ \rangle \nonumber
 \end{eqnarray}

 \parinterval 胶水规则引入了一个新的非终结符S，S只能和X进行顺序拼接，或者S由X生成。如果把S看作文法的起始符，使用胶水规则后，相当于把句子划分为若干个部分，每个部分都被归纳为X。之后，顺序地把这些X拼接到一起，得到最终的译文。比如，最极端的情况，整个句子会生成一个X，之后再归纳为S，这时并不需要进行胶水规则的顺序拼接；另一种极端的情况，每个单词都是独立地被翻译，被归纳为X，之后先把最左边的X归纳为S，再依次把剩下的X依次拼到一起。这样的推导形式如下：
 \begin{eqnarray}
-\textrm{S} & \to & \langle\ \textrm{S}_1\ \textrm{X}_2,\ \textrm{S}_1\ \textrm{X}_2\ \rangle \nonumber \\
-                & \to & \langle\ \textrm{S}_3\ \textrm{X}_4\ \textrm{X}_2,\ \textrm{S}_3\ \textrm{X}_4\ \textrm{X}_2\ \rangle \nonumber \\
+\funp{S} & \to & \langle\ \funp{S}_1\ \funp{X}_2,\ \funp{S}_1\ \funp{X}_2\ \rangle \nonumber \\
+                & \to & \langle\ \funp{S}_3\ \funp{X}_4\ \funp{X}_2,\ \funp{S}_3\ \funp{X}_4\ \funp{X}_2\ \rangle \nonumber \\
                & \to & ... \nonumber \\
-                & \to & \langle\ \textrm{X}_n\ ...\ \textrm{X}_4\ \textrm{X}_2,\ \textrm{X}_n\ ...\ \textrm{X}_4\ \textrm{X}_2\ \rangle \nonumber
+                & \to & \langle\ \funp{X}_n\ ...\ \funp{X}_4\ \funp{X}_2,\ \funp{X}_n\ ...\ \funp{X}_4\ \funp{X}_2\ \rangle \nonumber
 \end{eqnarray}

 \parinterval 实际上，胶水规则在很大程度上模拟了基于短语的系统中对字符串顺序翻译的操作，而且在实践中发现，这个步骤是十分必要的。特别是对法英翻译这样的任务，由于语言的结构基本上是顺序翻译的，因此引入顺序拼接的操作符合翻译的整体规律。同时，这种拼接给翻译增加了灵活性，系统会更加健壮。
@@ -334,17 +334,17 @@ d = {r_1} \circ {r_2} \circ {r_3} \circ {r_4}
 \begin{definition} 与词对齐相兼容的层次短语规则

 {\small
-对于句对$(\seq{s},\seq{t})$和它们之间的词对齐$\seq{a}$，令$\Phi$表示在句对$(\seq{s},\seq{t})$上与$\seq{a}$相兼容的双语短语集合。则：
+对于句对$(\seq{s},\seq{t})$和它们之间的词对齐$\seq{a}$，令$\varPhi$表示在句对$(\seq{s},\seq{t})$上与$\seq{a}$相兼容的双语短语集合。则：
 \begin{enumerate}
-\item 	如果$(x,y)\in \Phi$，则$\textrm{X} \to \langle x,y,\phi \rangle$是与词对齐相兼容的层次短语规则。
-\item 	对于$(x,y)\in \Phi$，存在$m$个双语短语$(x_i,y_j)\in \Phi$，同时存在(1,$...$,$m$)上面的一个排序$\sim = \{\pi_1 , ... ,\pi_m\}$，且：
+\item 	如果$(x,y)\in \varPhi$，则$\funp{X} \to \langle x,y,\varPhi \rangle$是与词对齐相兼容的层次短语规则。
+\item 	对于$(x,y)\in \varPhi$，存在$m$个双语短语$(x_i,y_j)\in \varPhi$，同时存在(1,$...$,$m$)上面的一个排序$\sim = \{\pi_1 , ... ,\pi_m\}$，且：
 \vspace{-1.5em}
 \begin{eqnarray}
 x&=&\alpha_0 x_1 \alpha_1 x_2 ... \alpha_{m-1} x_m \alpha_m \label{eq:8-2}\\
 y&=&\beta_0 y_{\pi_1} \beta_1 y_{\pi_2} ... \beta_{m-1} y_{\pi_m} \beta_m
 \label{eq:8-3}
 \end{eqnarray}
-其中，${\alpha_0, ... ,\alpha_m}$和${\beta_0, ... ,\beta_m}$表示源语言和目标语言的若干个词串（包含空串）。则$\textrm{X} \to \langle x,y,\sim \rangle$是与词对齐相兼容的层次短语规则。这条规则包含$m$个变量，变量的对齐信息是$\sim$。
+其中，${\alpha_0, ... ,\alpha_m}$和${\beta_0, ... ,\beta_m}$表示源语言和目标语言的若干个词串（包含空串）。则$\funp{X} \to \langle x,y,\sim \rangle$是与词对齐相兼容的层次短语规则。这条规则包含$m$个变量，变量的对齐信息是$\sim$。
 \end{enumerate}
 }
 \end{definition}
@@ -388,15 +388,15 @@ y&=&\beta_0 y_{\pi_1} \beta_1 y_{\pi_2} ... \beta_{m-1} y_{\pi_m} \beta_m

 \begin{itemize}
 \vspace{0.5em}
-\item 	(h1-2)短语翻译概率（取对数），即$\textrm{log}(\textrm{P}(\alpha \mid \beta))$和$\textrm{log}(\textrm{P}(\beta \mid \alpha))$，特征的计算与基于短语的模型完全一样；
+\item 	($h_{1-2}$)短语翻译概率（取对数），即$\textrm{log}(\funp{P}(\alpha \mid \beta))$和$\textrm{log}(\funp{P}(\beta \mid \alpha))$，特征的计算与基于短语的模型完全一样；
 \vspace{0.5em}
-\item 	(h3-4)词汇化翻译概率（取对数），即$\textrm{log}(\textrm{P}_{\textrm{lex}}(\alpha \mid \beta))$和$\textrm{log}(\textrm{P}_{\textrm{lex}}(\beta \mid \alpha))$，特征的计算与基于短语的模型完全一样；
+\item 	($h_{3-4}$)词汇化翻译概率（取对数），即$\textrm{log}(\funp{P}_{\textrm{lex}}(\alpha \mid \beta))$和$\textrm{log}(\funp{P}_{\textrm{lex}}(\beta \mid \alpha))$，特征的计算与基于短语的模型完全一样；
 \vspace{0.5em}
-\item (h5)翻译规则数量，让模型自动学习对规则数量的偏好，同时避免使用过少规则造成分数偏高的现象；
+\item ($h_{5}$)翻译规则数量，让模型自动学习对规则数量的偏好，同时避免使用过少规则造成分数偏高的现象；
 \vspace{0.5em}
-\item (h6)胶水规则数量，让模型自动学习使用胶水规则的偏好；
+\item ($h_{6}$)胶水规则数量，让模型自动学习使用胶水规则的偏好；
 \vspace{0.5em}
-\item (h7)短语规则数量，让模型自动学习使用纯短语规则的偏好。
+\item ($h_{7}$)短语规则数量，让模型自动学习使用纯短语规则的偏好。
 \vspace{0.5em}
 \end{itemize}

@@ -414,7 +414,7 @@ h_i (d,\seq{t},\seq{s})=\sum_{r \in d}h_i (r)

 \parinterval 最终，模型得分被定义为：
 \begin{eqnarray}
-\textrm{score}(d,\seq{t},\seq{s})=\textrm{rscore}(d,\seq{t},\seq{s})+ \lambda_8 \textrm{log}⁡(\textrm{P}_{\textrm{lm}}(\seq{t}))+\lambda_9 \mid \seq{t} \mid
+\textrm{score}(d,\seq{t},\seq{s})=\textrm{rscore}(d,\seq{t},\seq{s})+ \lambda_8 \textrm{log}⁡(\funp{P}_{\textrm{lm}}(\seq{t}))+\lambda_9 \mid \seq{t} \mid
 \label{eq:8-6}
 \end{eqnarray}

@@ -422,9 +422,9 @@ h_i (d,\seq{t},\seq{s})=\sum_{r \in d}h_i (r)

 \begin{itemize}
 \vspace{0.5em}
-\item $\textrm{log}⁡(\textrm{P}_{\textrm{lm}}(\textrm{t}))$表示语言模型得分；
+\item $\textrm{log}⁡(\funp{P}_{\textrm{lm}}(\seq{t}))$表示语言模型得分；
 \vspace{0.5em}
-\item $\mid \textrm{t} \mid$表示译文的长度。
+\item $\mid \seq{t} \mid$表示译文的长度。
 \vspace{0.5em}
 \end{itemize}

@@ -520,7 +520,7 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q
 \vspace{0.5em}
 \item 把层次短语文法转化为乔姆斯基范式，这样可以直接使用原始的CKY方法进行分析；
 \vspace{0.5em}
-\item 对CKY方法进行改造。解码的核心任务要知道每个跨度是否能匹配规则的源语言部分。实际上，层次短语模型的文法是一种特殊的文法。这种文法规则的源语言部分最多包含两个变量，而且变量不能连续。这样的规则会对应一种特定类型的模版，比如，对于包含两个变量的规则，它的源语言部分形如$\alpha_0 \textrm{X}_1 \alpha_1 \textrm{X}_2 \alpha_2$。其中，$\alpha_0$、$\alpha_1$和$\alpha_2$表示终结符串，$\textrm{X}_1$和$\textrm{X}_2$是变量。显然，如果$\alpha_0$、$\alpha_1$和$\alpha_2$确定下来那么$\textrm{X}_1$和$\textrm{X}_2$的位置也就确定了下来。因此，对于每一个词串，都可以很容易的生成这种模版，进而完成匹配。而$\textrm{X}_1$和$\textrm{X}_2$和原始CKY中匹配二叉规则本质上是一样的。由于这种方法并不需要对CKY方法进行过多调整，因此层次短语系统中广泛使用这种改造的CKY方法进行解码。
+\item 对CKY方法进行改造。解码的核心任务要知道每个跨度是否能匹配规则的源语言部分。实际上，层次短语模型的文法是一种特殊的文法。这种文法规则的源语言部分最多包含两个变量，而且变量不能连续。这样的规则会对应一种特定类型的模版，比如，对于包含两个变量的规则，它的源语言部分形如$\alpha_0 \funp{X}_1 \alpha_1 \funp{X}_2 \alpha_2$。其中，$\alpha_0$、$\alpha_1$和$\alpha_2$表示终结符串，$\funp{X}_1$和$\funp{X}_2$是变量。显然，如果$\alpha_0$、$\alpha_1$和$\alpha_2$确定下来那么$\funp{X}_1$和$\funp{X}_2$的位置也就确定了下来。因此，对于每一个词串，都可以很容易的生成这种模版，进而完成匹配。而$\funp{X}_1$和$\funp{X}_2$和原始CKY中匹配二叉规则本质上是一样的。由于这种方法并不需要对CKY方法进行过多调整，因此层次短语系统中广泛使用这种改造的CKY方法进行解码。
 \vspace{0.5em}
 \end{itemize}

@@ -542,7 +542,7 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q

 \subsection{立方剪枝}

-\parinterval 相比于基于短语的模型，基于层次短语的模型引入了“变量”的概念。这样，可以根据变量周围的上下文信息对变量进行调序。变量的内容由其所对应的跨度上的翻译假设进行填充。图\ref{fig:8-11}展示了一个层次短语规则匹配词串的实例。可以看到，规则匹配词串之后，变量X的位置对应了一个跨度。这个跨度上所有标记为X的局部推导都可以作为变量的内容。
+\parinterval 相比于基于短语的模型，基于层次短语的模型引入了“变量”的概念。这样，可以根据变量周围的上下文信息对变量进行调序。变量的内容由其所对应的跨度上的翻译假设进行填充。图\ref{fig:8-11}展示了一个层次短语规则匹配词串的实例。可以看到，规则匹配词串之后，变量$\seq{X}$的位置对应了一个跨度。这个跨度上所有标记为X的局部推导都可以作为变量的内容。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -555,12 +555,12 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q

 \parinterval 真实的情况会更加复杂。对于一个规则的源语言端，可能会有多个不同的目标语言端与之对应。比如，如下规则的源语言端完全相同，但是译文不同：
 \begin{eqnarray}
-\textrm{X} & \to & \langle\ \textrm{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \textrm{X}_1\ \textrm{have}\ \textrm{drastically}\ \textrm{fallen}\ \rangle \nonumber \\
-\textrm{X} & \to & \langle\ \textrm{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \textrm{X}_1\ \textrm{have}\ \textrm{fallen}\ \textrm{drastically}\ \rangle \nonumber \\
-\textrm{X} & \to & \langle\ \textrm{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \textrm{X}_1\ \textrm{has}\ \textrm{drastically}\ \textrm{fallen}\ \rangle \nonumber
+\funp{X} & \to & \langle\ \funp{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \funp{X}_1\ \textrm{have}\ \textrm{drastically}\ \textrm{fallen}\ \rangle \nonumber \\
+\funp{X} & \to & \langle\ \funp{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \funp{X}_1\ \textrm{have}\ \textrm{fallen}\ \textrm{drastically}\ \rangle \nonumber \\
+\funp{X} & \to & \langle\ \funp{X}_1\ \text{大幅度}\ \text{下降}\ \text{了},\ \funp{X}_1\ \textrm{has}\ \textrm{drastically}\ \textrm{fallen}\ \rangle \nonumber
 \end{eqnarray}

-\parinterval 这也就是说，当匹配规则的源语言部分“$\textrm{X}_1$\ \ 大幅度\ \ 下降\ \ 了”时会有三个译文可以选择。而变量$\textrm{X}_1$部分又有很多不同的局部翻译结果。不同的规则译文和不同的变量译文都可以组合出一个局部翻译结果。图\ref{fig:8-12}展示了这种情况的实例。
+\parinterval 这也就是说，当匹配规则的源语言部分“$\funp{X}_1$\ \ 大幅度\ \ 下降\ \ 了”时会有三个译文可以选择。而变量$\funp{X}_1$部分又有很多不同的局部翻译结果。不同的规则译文和不同的变量译文都可以组合出一个局部翻译结果。图\ref{fig:8-12}展示了这种情况的实例。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -878,8 +878,8 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \noindent 可以得到一个翻译推导：
 {\footnotesize
 \begin{eqnarray}
-&& \langle\ \textrm{IP}^{[1]},\ \textrm{S}^{[1]}\ \rangle \nonumber \\
-& \xrightarrow[r_9]{\textrm{IP}^{[1]} \Leftrightarrow \textrm{S}^{[1]}} & \langle\ \textrm{IP(}\textrm{NN}^{[2]}\ \textrm{VP}^{[3]}),\ \textrm{S(}\textrm{NP}^{[2]}\ \textrm{VP}^{[3]})\ \rangle \nonumber \\
+&& \langle\ \textrm{IP}^{[1]},\ \funp{S}^{[1]}\ \rangle \nonumber \\
+& \xrightarrow[r_9]{\textrm{IP}^{[1]} \Leftrightarrow \funp{S}^{[1]}} & \langle\ \textrm{IP(}\textrm{NN}^{[2]}\ \textrm{VP}^{[3]}),\ \textrm{S(}\textrm{NP}^{[2]}\ \textrm{VP}^{[3]})\ \rangle \nonumber \\
 & \xrightarrow[r_7]{\textrm{NN}^{[2]} \Leftrightarrow \textrm{NP}^{[2]}} & \langle\ \textrm{IP(NN(进口)}\ \textrm{VP}^{[3]}),\ \textrm{S(NP(DT(the) NNS(imports))}\ \textrm{VP}^{[3]})\ \rangle \nonumber \\
 & \xrightarrow[r_8]{\textrm{VP}^{[3]} \Leftrightarrow \textrm{VP}^{[3]}} & \langle\ \textrm{IP(NN(进口)}\ \textrm{VP(AD}^{[4]}\ \textrm{VP(VV}^{[5]}\ \textrm{AS}^{[6]}))), \nonumber \\
 &                 & \ \ \textrm{S(NP(DT(the) NNS(imports))}\ \textrm{VP(VBP}^{[6]}\ \textrm{ADVP(RB}^{[4]}\ \textrm{VBN}^{[5]})))\ \rangle \hspace{4.5em} \nonumber \\
@@ -894,7 +894,7 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \end{eqnarray}
 }

-\noindent 其中，箭头$\rightarrow$表示推导之意。显然，可以把翻译看作是基于树结构的推导过程（记为$d$）。因此，与层次短语模型一样，基于语言学句法的机器翻译也是要找到最佳的推导$\hat{d} = \argmax_{d} \textrm{P} (d)$。
+\noindent 其中，箭头$\rightarrow$表示推导之意。显然，可以把翻译看作是基于树结构的推导过程（记为$d$）。因此，与层次短语模型一样，基于语言学句法的机器翻译也是要找到最佳的推导$\hat{d} = \argmax_{d} \funp{P} (d)$。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -1317,7 +1317,7 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex

 \begin{itemize}
 \vspace{0.5em}
-\item (h1-2)短语翻译概率（取对数），即规则源语言和目标语言树覆盖的序列翻译概率。令函数$\tau(\cdot)$返回一个树片段的叶子节点序列。对于规则：
+\item ($h_{1-2}$)短语翻译概率（取对数），即规则源语言和目标语言树覆盖的序列翻译概率。令函数$\tau(\cdot)$返回一个树片段的叶子节点序列。对于规则：
 \begin{displaymath}
 \textrm{VP(}\textrm{PP}_1\ \ \textrm{VP(VV(表示)}\ \ \textrm{NN}_2\textrm{))} \rightarrow \textrm{VP(VBZ(was)}\ \ \textrm{VP(}\textrm{VBN}_2\ \ \textrm{PP}_1\textrm{))}
 \end{displaymath}
@@ -1328,7 +1328,7 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex
 \end{eqnarray}
 \noindent 于是，可以定义短语翻译概率特征为$\log(\textrm{P(}\tau( \alpha_r )|\tau( \beta_r )))$和$\log(\textrm{P(}\tau( \beta_r )|\tau( \alpha_r )))$。它们的计算方法与基于短语的系统是完全一样的\footnote[9]{对于树到串规则，$\tau( \beta_r )$就是规则目标语言端的符号串。}；
 \vspace{0.5em}
-\item (h3-4) 词汇化翻译概率（取对数），即$\log(\textrm{P}_{\textrm{lex}}(\tau( \alpha_r )|\tau( \beta_r )))$和$\log(\textrm{P}_{\textrm{lex}}(\tau( \beta_r )|\tau( \alpha_r )))$。这两个特征的计算方法与基于短语的系统也是一样的。
+\item ($h_{3-4}$) 词汇化翻译概率（取对数），即$\log(\funp{P}_{\textrm{lex}}(\tau( \alpha_r )|\tau( \beta_r )))$和$\log(\funp{P}_{\textrm{lex}}(\tau( \beta_r )|\tau( \alpha_r )))$。这两个特征的计算方法与基于短语的系统也是一样的。
 \vspace{0.5em}
 \end{itemize}

@@ -1337,11 +1337,11 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex

 \begin{itemize}
 \vspace{0.5em}
-\item (h5)基于根节点句法标签的规则生成概率（取对数），即$\log(\textrm{P(}r|\textrm{root(}r\textrm{))})$。这里，$\textrm{root(}r)$是规则所对应的双语根节点$(\alpha_h,\beta_h)$；
+\item ($h_{5}$)基于根节点句法标签的规则生成概率（取对数），即$\log(\textrm{P(}r|\textrm{root(}r\textrm{))})$。这里，$\textrm{root(}r)$是规则所对应的双语根节点$(\alpha_h,\beta_h)$；
 \vspace{0.5em}
-\item (h6)基于源语言端的规则生成概率（取对数），即$\log(\textrm{P(}r|\alpha_r)))$，给定源语言端生成整个规则的概率；
+\item ($h_{6}$)基于源语言端的规则生成概率（取对数），即$\log(\textrm{P(}r|\alpha_r)))$，给定源语言端生成整个规则的概率；
 \vspace{0.5em}
-\item (h7)基于目标语言端的规则生成概率（取对数），即$\log(\textrm{P(}r|\beta_r)))$，给定目标语言端生成整个规则的概率。
+\item ($h_{7}$)基于目标语言端的规则生成概率（取对数），即$\log(\textrm{P(}r|\beta_r)))$，给定目标语言端生成整个规则的概率。
 \end{itemize}

 \vspace{0.5em}
@@ -1349,17 +1349,17 @@ r_9: \quad \textrm{IP(}\textrm{NN}_1\ \textrm{VP}_2) \rightarrow \textrm{S(}\tex

 \begin{itemize}
 \vspace{0.5em}
-\item (h8)语言模型得分（取对数），即$\log(\textrm{P}_{\textrm{lm}}(\seq{t}))$，用于度量译文的流畅度；
+\item ($h_{8}$)语言模型得分（取对数），即$\log(\funp{P}_{\textrm{lm}}(\seq{t}))$，用于度量译文的流畅度；
 \vspace{0.5em}
-\item (h9)译文长度，即$|\seq{t}|$，用于避免模型过于倾向生成短译文（因为短译文语言模型分数高）；
+\item ($h_{9}$)译文长度，即$|\seq{t}|$，用于避免模型过于倾向生成短译文（因为短译文语言模型分数高）；
 \vspace{0.5em}
-\item (h10)翻译规则数量，学习对使用规则数量的偏好。比如，如果这个特征的权重较高，则表明系统更喜欢使用数量多的规则；
+\item ($h_{10}$)翻译规则数量，学习对使用规则数量的偏好。比如，如果这个特征的权重较高，则表明系统更喜欢使用数量多的规则；
 \vspace{0.5em}
-\item (h11)组合规则的数量，学习对组合规则的偏好；
+\item ($h_{11}$)组合规则的数量，学习对组合规则的偏好；
 \vspace{0.5em}
-\item (h12)词汇化规则的数量，学习对含有终结符规则的偏好；
+\item ($h_{12}$)词汇化规则的数量，学习对含有终结符规则的偏好；
 \vspace{0.5em}
-\item (h13)低频规则的数量，学习对训练数据中出现频次低于3的规则的偏好。低频规则大多不可靠，设计这个特征的目的也是为了区分不同质量的规则。
+\item ($h_{13}$)低频规则的数量，学习对训练数据中出现频次低于3的规则的偏好。低频规则大多不可靠，设计这个特征的目的也是为了区分不同质量的规则。
 \end{itemize}

 \vspace{0.5em}

--- a/bibliography.bib
+++ b/bibliography.bib
@@ -5068,6 +5068,13 @@ pages ={157-166},
    publisher = "Association for Computational Linguistics",
    pages = "1789--1798",
 }
+@article{Lin2020WeightDT,
+  title={Weight Distillation: Transferring the Knowledge in Neural Network Parameters},
+  author={Ye Lin and Yanyang Li and Ziyang Wang and Bei Li and Quan Du and Tong Xiao and Jingbo Zhu},
+  journal={ArXiv},
+  year={2020},
+  volume={abs/2009.09152}
+}
 %%%%% chapter 12------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

@@ -5086,6 +5093,102 @@ pages ={157-166},
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%% chapter 15------------------------------------------------------

+@inproceedings{DBLP:conf/cvpr/YuYR18,
+  author    = {Xin Yu and
+               Zhiding Yu and
+               Srikumar Ramalingam},
+  title     = {Learning Strict Identity Mappings in Deep Residual Networks},
+  pages     = {4432--4440},
+  publisher = {{IEEE} Conference on Computer Vision and Pattern Recognition},
+  year      = {2018}
+}
+
+@inproceedings{DBLP:conf/emnlp/ZhangTS19,
+  author    = {Biao Zhang and
+               Ivan Titov and
+               Rico Sennrich},
+  title     = {Improving Deep Transformer with Depth-Scaled Initialization and Merged
+               Attention},
+  pages     = {898--909},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/eccv/HeZRS16,
+  author    = {Kaiming He and
+               Xiangyu Zhang and
+               Shaoqing Ren and
+               Jian Sun},
+  title     = {Identity Mappings in Deep Residual Networks},
+  volume    = {9908},
+  pages     = {630--645},
+  publisher = {European Conference on Computer Vision},
+  year      = {2016}
+}
+
+@inproceedings{Ottfairseq,
+  author    = {Myle Ott and
+               Sergey Edunov and
+               Alexei Baevski and
+               Angela Fan and
+               Sam Gross and
+               Nathan Ng and
+               David Grangier and
+               Michael Auli},
+  title     = {fairseq: {A} Fast, Extensible Toolkit for Sequence Modeling},
+  pages     = {48--53},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
+}
+
+@inproceedings{KleinOpenNMT,
+  author    = {Guillaume Klein and
+               Yoon Kim and
+               Yuntian Deng and
+               Jean Senellart and
+               Alexander M. Rush},
+  title     = {OpenNMT: Open-Source Toolkit for Neural Machine Translation},
+  pages     = {67--72},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2017}
+}
+
+@inproceedings{DBLP:conf/acl/WuWXTGQLL19,
+  author    = {Lijun Wu and
+               Yiren Wang and
+               Yingce Xia and
+               Fei Tian and
+               Fei Gao and
+               Tao Qin and
+               Jianhuang Lai and
+               Tie{-}Yan Liu},
+  title     = {Depth Growing for Neural Machine Translation},
+  pages     = {5558--5563},
+  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  year      = {2019}
+}
+
+@inproceedings{DBLP:conf/cvpr/HuangLMW17,
+  author    = {Gao Huang and
+               Zhuang Liu and
+               Laurens van der Maaten and
+               Kilian Q. Weinberger},
+  title     = {Densely Connected Convolutional Networks},
+  pages     = {2261--2269},
+  publisher = {{IEEE} Conference on Computer Vision and Pattern Recognition},
+  year      = {2017}
+}
+
+@article{DBLP:journals/corr/GreffSS16,
+  author    = {Klaus Greff and
+               Rupesh Kumar Srivastava and
+               J{\"{u}}rgen Schmidhuber},
+  title     = {Highway and Residual Networks learn Unrolled Iterative Estimation},
+  publisher = {International Conference on Learning Representations},
+  year      = {2017}
+}
+
+
 %%%%% chapter 15------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


--- a/structure.tex
+++ b/structure.tex
@@ -686,5 +686,5 @@ addtohook={%
 \newcommand\chaptereighteen{第十八章}%*

 \newcommand\funp{}%函数P等使用，空是斜体，textrm是加粗
-\newcommand\vectorn{\textbf}%向量N等使用
+\newcommand\vectorn{\textbf}%向量N等使用\vectorn{\emph{s}}
 \newcommand\seq{}%序列N等使用