合并分支 'caorunzhe' 到 'zengxin'

Caorunzhe 查看合并请求 !844

合并分支 'caorunzhe' 到 'zengxin'
Caorunzhe 查看合并请求 !844
30677034 · zengxin · 8f0e83e9 · 6d268143 · 30677034 · 30677034
Commit 30677034 authored Jan 10, 2021 by zengxin
--- a/Chapter13/Figures/figure-bpe.tex
+++ b/Chapter13/Figures/figure-bpe.tex
 \begin{tikzpicture}
-\tikzstyle{tnode} = [rectangle,inner sep=0em,minimum width=8em,minimum height=6.6em,rounded corners=5pt,fill=ugreen!20]
-\tikzstyle{pnode} = [rectangle,inner sep=0em,minimum width=8em,minimum height=6.6em,rounded corners=5pt,fill=yellow!20]
+\tikzstyle{tnode} = [rectangle,inner sep=0em,minimum width=8em,minimum height=6.6em,rounded corners=5pt,fill=green!20]
+\tikzstyle{pnode} = [rectangle,inner sep=0em,minimum width=8em,minimum height=6.6em,rounded corners=5pt,fill=yellow!30]
 \tikzstyle{mnode} = [rectangle,inner sep=0em,minimum width=8em,minimum height=6.6em,rounded corners=5pt,fill=red!20]
 \tikzstyle{wnode} = [inner sep=0em,minimum height=1.5em]

@@ -19,7 +19,7 @@


 \begin{pgfonlayer}{background}
-\node [rectangle,inner sep=0.7em,draw,ugreen!40,dashed,very thick,rounded corners=7pt] [fit = (n1) (n4)] (box1) {};
+\node [rectangle,inner sep=0.7em,draw,ugreen!60,dashed,very thick,rounded corners=7pt] [fit = (n1) (n4)] (box1) {};
 \end{pgfonlayer}

 \node [anchor=west,align=left,font=\footnotesize] (nt1) at ([xshift=0.1em,yshift=0em]n2.east) {统计词表和\\[0.5ex]词频};
@@ -75,7 +75,7 @@
 \node [anchor=east,ublue,align=left,font=\footnotesize] (l3) at ([xshift=-0.5em,yshift=0em]cd.west) {直至达到设定的符号合\\并表大小或无法合并};

 \begin{pgfonlayer}{background}
-\node [rectangle,inner sep=0.7em,draw,yellow!40,dashed,very thick,rounded corners=7pt] [fit = (n5) (n8) (l3) (cd)] (box2) {};
+\node [rectangle,inner sep=0.7em,draw,orange!40,dashed,very thick,rounded corners=7pt] [fit = (n5) (n8) (l3) (cd)] (box2) {};
 \end{pgfonlayer}

 %第五排

--- a/Chapter13/Figures/figure-curriculum-learning-framework.tex
+++ b/Chapter13/Figures/figure-curriculum-learning-framework.tex
@@ -4,10 +4,10 @@
 \tikzstyle{node}=[inner sep=0mm,minimum height=3em,minimum width=6em,rounded corners=5pt]


-\node[anchor=west,node,fill=ugreen!30] (n1) at (0,0) {训练集};
-\node[anchor=west,node,fill=yellow!30] (n2) at ([xshift=2em,yshift=0em]n1.east) {难度评估器};
-\node[anchor=west,node,fill=red!30] (n3) at ([xshift=4em,yshift=0em]n2.east) {训练调度器};
-\node[anchor=west,node,fill=blue!30] (n4) at ([xshift=4em,yshift=0em]n3.east) {模型训练器};
+\node[anchor=west,node,fill=ugreen!15] (n1) at (0,0) {训练集};
+\node[anchor=west,node,fill=yellow!15] (n2) at ([xshift=2em,yshift=0em]n1.east) {难度评估器};
+\node[anchor=west,node,fill=red!15] (n3) at ([xshift=4em,yshift=0em]n2.east) {训练调度器};
+\node[anchor=west,node,fill=blue!15] (n4) at ([xshift=4em,yshift=0em]n3.east) {模型训练器};

 \draw [->,very thick] ([xshift=0em,yshift=0em]n1.east) -- ([xshift=0em,yshift=0em]n2.west);
 \draw [->,very thick] ([xshift=0em,yshift=0em]n2.east) -- ([xshift=0em,yshift=0em]n3.west);
@@ -23,8 +23,8 @@
 \draw [->,dotted,very thick] ([xshift=0em,yshift=0em]n4.north) --  ([xshift=0em,yshift=1em]n4.north) --  ([xshift=0em,yshift=1em]n3.north) -- (n3.north);

    \begin{pgfonlayer}{background}
-      \node[rectangle,inner sep=5pt,rounded corners=5pt,fill=gray!30] [fit = (n3) (n4) (n6) (n8) ] (g2) {};
-      \node[rectangle,inner sep=5pt,rounded corners=5pt,fill=orange!30] [fit = (n2) (n3) (n9) ] (g1) {};
+      \node[rectangle,inner sep=5pt,rounded corners=5pt,fill=gray!15] [fit = (n3) (n4) (n6) (n8) ] (g2) {};
+      \node[rectangle,inner sep=5pt,rounded corners=5pt,fill=orange!15] [fit = (n2) (n3) (n9) ] (g1) {};


    \end{pgfonlayer}

--- a/Chapter13/Figures/figure-exposure-bias.tex
+++ b/Chapter13/Figures/figure-exposure-bias.tex
@@ -8,8 +8,8 @@

 \begin{scope}[]

-\tikzstyle{rnnnode} = [draw,inner sep=2pt,minimum width=3em,minimum height=1.5em,rounded corners=1pt,fill=red!20]
-\tikzstyle{snode} = [draw,inner sep=2pt,minimum width=3em,minimum height=1.5em,rounded corners=1pt,fill=blue!20]
+\tikzstyle{rnnnode} = [draw,inner sep=2pt,minimum width=3em,minimum height=1.5em,rounded corners=1pt,fill=red!15]
+\tikzstyle{snode} = [draw,inner sep=2pt,minimum width=3em,minimum height=1.5em,rounded corners=1pt,fill=blue!15]
 \tikzstyle{ynode} = [inner sep=2pt,minimum width=3em,minimum height=1.5em,rounded corners=1pt]


@@ -63,44 +63,44 @@
 \node [anchor=west] (y5) at ([xshift=0.5em,yshift=0em]y4.east) {${y}_{n}$};


-\node [anchor=center,prob,minimum size=0.5*1em] (label11) at ([xshift=-0.1em,yshift=1em]y1.north) {.5};
-\node [anchor=center,prob,minimum size=0.2em] (label12) at ([xshift=0em,yshift=1.3em]label11.center) {};
-\node [anchor=center,prob,minimum size=0.2em] (label13) at ([xshift=0em,yshift=1.3em]label12.center) {};
-\node [anchor=center,prob,minimum size=0.7*1em] (label14) at ([xshift=0em,yshift=1.3em]label13.center) {.7};
-\node [anchor=center,prob,minimum size=0.2em] (label15) at ([xshift=0em,yshift=1.3em]label14.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label11) at ([xshift=-0.1em,yshift=1em]y1.north) {};
+\node [anchor=center,prob,minimum size=0.3em] (label12) at ([xshift=0em,yshift=1.3em]label11.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label13) at ([xshift=0em,yshift=1.3em]label12.center) {};
+\node [anchor=center,prob,minimum size=1*1.2em] (label14) at ([xshift=0em,yshift=1.3em]label13.center) {1};
+\node [anchor=center,prob,minimum size=0.3em] (label15) at ([xshift=0em,yshift=1.3em]label14.center) {};
 \begin{pgfonlayer}{background}
 \node [anchor=center,minimum size=1em,white] (up1) at ([xshift=0em,yshift=0em]label11.center) {};
 \node [anchor=center,minimum size=1em,white] (down1) at ([xshift=0em,yshift=0em]label15.center) {};
 \node [inner sep=0.3em,draw] [fit = (up1) (down1)] (prob1) {};
 \end{pgfonlayer}

-\node [anchor=center,prob,minimum size=0.2em] (label21) at ([xshift=-0.1em,yshift=1em]y2.north) {};
-\node [anchor=center,prob,minimum size=0.9*1em] (label22) at ([xshift=0em,yshift=1.3em]label21.center) {.9};
-\node [anchor=center,prob,minimum size=0.2em] (label23) at ([xshift=0em,yshift=1.3em]label22.center) {};
-\node [anchor=center,prob,minimum size=0.2em] (label24) at ([xshift=0em,yshift=1.3em]label23.center) {};
-\node [anchor=center,prob,minimum size=0.8*1em] (label25) at ([xshift=0em,yshift=1.3em]label24.center) {.8};
+\node [anchor=center,prob,minimum size=0.3em] (label21) at ([xshift=-0.1em,yshift=1em]y2.north) {};
+\node [anchor=center,prob,minimum size=1*1.2em] (label22) at ([xshift=0em,yshift=1.3em]label21.center) {1};
+\node [anchor=center,prob,minimum size=0.3em] (label23) at ([xshift=0em,yshift=1.3em]label22.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label24) at ([xshift=0em,yshift=1.3em]label23.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label25) at ([xshift=0em,yshift=1.3em]label24.center) {};
 \begin{pgfonlayer}{background}
 \node [anchor=center,minimum size=1em,white] (up2) at ([xshift=0em,yshift=0em]label21.center) {};
 \node [anchor=center,minimum size=1em,white] (down2) at ([xshift=0em,yshift=0em]label25.center) {};
 \node [inner sep=0.3em,draw] [fit = (up2) (down2)] (prob2) {};
 \end{pgfonlayer}

-\node [anchor=center,prob,minimum size=0.2em] (label31) at ([xshift=-0.1em,yshift=1em]y3.north) {};
-\node [anchor=center,prob,minimum size=0.2em] (label32) at ([xshift=0em,yshift=1.3em]label31.center) {};
-\node [anchor=center,prob,minimum size=0.2em] (label33) at ([xshift=0em,yshift=1.3em]label32.center) {};
-\node [anchor=center,prob,minimum size=0.6*1em] (label34) at ([xshift=0em,yshift=1.3em]label33.center) {.6};
-\node [anchor=center,prob,minimum size=0.2em] (label35) at ([xshift=0em,yshift=1.3em]label34.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label31) at ([xshift=-0.1em,yshift=1em]y3.north) {};
+\node [anchor=center,prob,minimum size=0.3em] (label32) at ([xshift=0em,yshift=1.3em]label31.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label33) at ([xshift=0em,yshift=1.3em]label32.center) {};
+\node [anchor=center,prob,minimum size=1*1.2em] (label34) at ([xshift=0em,yshift=1.3em]label33.center) {1};
+\node [anchor=center,prob,minimum size=0.3em] (label35) at ([xshift=0em,yshift=1.3em]label34.center) {};
 \begin{pgfonlayer}{background}
 \node [anchor=center,minimum size=1em,white] (up3) at ([xshift=0em,yshift=0em]label31.center) {};
 \node [anchor=center,minimum size=1em,white] (down3) at ([xshift=0em,yshift=0em]label35.center) {};
 \node [inner sep=0.3em,draw] [fit = (up3) (down3)] (prob3) {};
 \end{pgfonlayer}

-\node [anchor=center,prob,minimum size=0.5*1em] (label41) at ([xshift=-0.1em,yshift=1em]y5.north) {.5};
-\node [anchor=center,prob,minimum size=0.2em] (label42) at ([xshift=0em,yshift=1.3em]label41.center) {};
-\node [anchor=center,prob,minimum size=0.8*1em] (label43) at ([xshift=0em,yshift=1.3em]label42.center) {.8};
-\node [anchor=center,prob,minimum size=0.2em] (label44) at ([xshift=0em,yshift=1.3em]label43.center) {};
-\node [anchor=center,prob,minimum size=0.2em] (label45) at ([xshift=0em,yshift=1.3em]label44.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label41) at ([xshift=-0.1em,yshift=1em]y5.north) {};
+\node [anchor=center,prob,minimum size=0.3em] (label42) at ([xshift=0em,yshift=1.3em]label41.center) {};
+\node [anchor=center,prob,minimum size=1*1.2em] (label43) at ([xshift=0em,yshift=1.3em]label42.center) {1};
+\node [anchor=center,prob,minimum size=0.3em] (label44) at ([xshift=0em,yshift=1.3em]label43.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label45) at ([xshift=0em,yshift=1.3em]label44.center) {};
 \begin{pgfonlayer}{background}
 \node [anchor=center,minimum size=1em,white] (up4) at ([xshift=0em,yshift=0em]label41.center) {};
 \node [anchor=center,minimum size=1em,white] (down4) at ([xshift=0em,yshift=0em]label45.center) {};
@@ -114,7 +114,7 @@
 \node [rectangle,inner sep=0.5em,rounded corners=5pt,very thick,dotted,draw=ugreen] [fit = (n10) (y1) (y5)] (b2) {};

 \draw [->,dotted,very thick,ugreen] ([yshift=-0em]b1.east) .. controls +(east:1.7) and +(west:1) .. ([xshift=-0.1em]b2.west);
-\node [anchor=east] (inputlabel1) at ([yshift=-0.2em]b1.west) {{\color{red} \footnotesize{人工标注数据}}};
+\node [anchor=east] (inputlabel1) at ([yshift=-0.3em]b1.west) {{\color{red} \footnotesize{人工标注数据}}};
        
 \end{pgfonlayer}

@@ -122,8 +122,8 @@

 \begin{scope}[yshift=-2in]

-\tikzstyle{rnnnode} = [draw,inner sep=2pt,minimum width=3em,minimum height=1.5em,rounded corners=1pt,fill=red!20]
-\tikzstyle{snode} = [draw,inner sep=2pt,minimum width=3em,minimum height=1.5em,rounded corners=1pt,fill=blue!20]
+\tikzstyle{rnnnode} = [draw,inner sep=2pt,minimum width=3em,minimum height=1.5em,rounded corners=1pt,fill=red!15]
+\tikzstyle{snode} = [draw,inner sep=2pt,minimum width=3em,minimum height=1.5em,rounded corners=1pt,fill=blue!15]
 \tikzstyle{ynode} = [inner sep=2pt,minimum width=3em,minimum height=1.5em,rounded corners=1pt]


@@ -181,44 +181,44 @@
 \node [anchor=west] (y5) at ([xshift=0.5em,yshift=0em]y4.east) {$\tilde{y}_{n}$};


-\node [anchor=center,prob,minimum size=0.2em] (label11) at ([xshift=-0.1em,yshift=1em]y1.north) {};
-\node [anchor=center,prob,minimum size=0.9*1em] (label12) at ([xshift=0em,yshift=1.3em]label11.center) {.9};
-\node [anchor=center,prob,minimum size=0.2em] (label13) at ([xshift=0em,yshift=1.3em]label12.center) {};
-\node [anchor=center,prob,minimum size=0.2em] (label14) at ([xshift=0em,yshift=1.3em]label13.center) {};
-\node [anchor=center,prob,minimum size=0.2em] (label15) at ([xshift=0em,yshift=1.3em]label14.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label11) at ([xshift=-0.1em,yshift=1em]y1.north) {};
+\node [anchor=center,prob,minimum size=1*1.2em] (label12) at ([xshift=0em,yshift=1.3em]label11.center) {1};
+\node [anchor=center,prob,minimum size=0.3em] (label13) at ([xshift=0em,yshift=1.3em]label12.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label14) at ([xshift=0em,yshift=1.3em]label13.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label15) at ([xshift=0em,yshift=1.3em]label14.center) {};
 \begin{pgfonlayer}{background}
 \node [anchor=center,minimum size=1em,white] (up1) at ([xshift=0em,yshift=0em]label11.center) {};
 \node [anchor=center,minimum size=1em,white] (down1) at ([xshift=0em,yshift=0em]label15.center) {};
 \node [inner sep=0.3em,draw] [fit = (up1) (down1)] (prob1) {};
 \end{pgfonlayer}

-\node [anchor=center,prob,minimum size=0.5*1em] (label21) at ([xshift=-0.1em,yshift=1em]y2.north) {.5};
-\node [anchor=center,prob,minimum size=0.2em] (label22) at ([xshift=0em,yshift=1.3em]label21.center) {};
-\node [anchor=center,prob,minimum size=0.7*1em] (label23) at ([xshift=0em,yshift=1.3em]label22.center) {.7};
-\node [anchor=center,prob,minimum size=0.2em] (label24) at ([xshift=0em,yshift=1.3em]label23.center) {};
-\node [anchor=center,prob,minimum size=0.9*1em] (label25) at ([xshift=0em,yshift=1.3em]label24.center) {.9};
+\node [anchor=center,prob,minimum size=0.3em] (label21) at ([xshift=-0.1em,yshift=1em]y2.north) {};
+\node [anchor=center,prob,minimum size=0.3em] (label22) at ([xshift=0em,yshift=1.3em]label21.center) {};
+\node [anchor=center,prob,minimum size=1*1.2em] (label23) at ([xshift=0em,yshift=1.3em]label22.center) {1};
+\node [anchor=center,prob,minimum size=0.3em] (label24) at ([xshift=0em,yshift=1.3em]label23.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label25) at ([xshift=0em,yshift=1.3em]label24.center) {};
 \begin{pgfonlayer}{background}
 \node [anchor=center,minimum size=1em,white] (up2) at ([xshift=0em,yshift=0em]label21.center) {};
 \node [anchor=center,minimum size=1em,white] (down2) at ([xshift=0em,yshift=0em]label25.center) {};
 \node [inner sep=0.3em,draw] [fit = (up2) (down2)] (prob2) {};
 \end{pgfonlayer}

-\node [anchor=center,prob,minimum size=0.2em] (label31) at ([xshift=-0.1em,yshift=1em]y3.north) {};
-\node [anchor=center,prob,minimum size=0.9*1em] (label32) at ([xshift=0em,yshift=1.3em]label31.center) {.9};
-\node [anchor=center,prob,minimum size=0.2em] (label33) at ([xshift=0em,yshift=1.3em]label32.center) {};
-\node [anchor=center,prob,minimum size=0.5*1em] (label34) at ([xshift=0em,yshift=1.3em]label33.center) {.5};
-\node [anchor=center,prob,minimum size=0.8*1em] (label35) at ([xshift=0em,yshift=1.3em]label34.center) {.8};
+\node [anchor=center,prob,minimum size=0.3em] (label31) at ([xshift=-0.1em,yshift=1em]y3.north) {};
+\node [anchor=center,prob,minimum size=0.3em] (label32) at ([xshift=0em,yshift=1.3em]label31.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label33) at ([xshift=0em,yshift=1.3em]label32.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label34) at ([xshift=0em,yshift=1.3em]label33.center) {};
+\node [anchor=center,prob,minimum size=1*1.2em] (label35) at ([xshift=0em,yshift=1.3em]label34.center) {1};
 \begin{pgfonlayer}{background}
 \node [anchor=center,minimum size=1em,white] (up3) at ([xshift=0em,yshift=0em]label31.center) {};
 \node [anchor=center,minimum size=1em,white] (down3) at ([xshift=0em,yshift=0em]label35.center) {};
 \node [inner sep=0.3em,draw] [fit = (up3) (down3)] (prob3) {};
 \end{pgfonlayer}

-\node [anchor=center,prob,minimum size=0.2em] (label41) at ([xshift=-0.1em,yshift=1em]y5.north) {};
-\node [anchor=center,prob,minimum size=0.2em] (label42) at ([xshift=0em,yshift=1.3em]label41.center) {};
-\node [anchor=center,prob,minimum size=0.2em] (label43) at ([xshift=0em,yshift=1.3em]label42.center) {};
-\node [anchor=center,prob,minimum size=0.5*1em] (label44) at ([xshift=0em,yshift=1.3em]label43.center) {.5};
-\node [anchor=center,prob,minimum size=0.2em] (label45) at ([xshift=0em,yshift=1.3em]label44.center) {};
+\node [anchor=center,prob,minimum size=1*1.2em] (label41) at ([xshift=-0.1em,yshift=1em]y5.north) {1};
+\node [anchor=center,prob,minimum size=0.3em] (label42) at ([xshift=0em,yshift=1.3em]label41.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label43) at ([xshift=0em,yshift=1.3em]label42.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label44) at ([xshift=0em,yshift=1.3em]label43.center) {};
+\node [anchor=center,prob,minimum size=0.3em] (label45) at ([xshift=0em,yshift=1.3em]label44.center) {};
 \begin{pgfonlayer}{background}
 \node [anchor=center,minimum size=1em,white] (up4) at ([xshift=0em,yshift=0em]label41.center) {};
 \node [anchor=center,minimum size=1em,white] (down4) at ([xshift=0em,yshift=0em]label45.center) {};
@@ -233,7 +233,7 @@
 \node [rectangle,inner sep=0.5em,rounded corners=5pt,very thick,dotted,draw=ublue] [fit = (n10) (y1) (y5)] (b2) {};

 \draw [->,dotted,very thick,ublue] ([xshift=-0em,yshift=-0em]b1.east) .. controls +(east:1.7) and +(west:1) .. ([xshift=-0.1em]b2.west);
-\node [anchor=east] (inputlabel1) at ([yshift=-0.2em]b1.west) {{\color{red} \footnotesize{系统预测结果}}};
+\node [anchor=east] (inputlabel1) at ([yshift=-0.3em]b1.west) {{\color{red} \footnotesize{系统预测结果}}};
        
 \end{pgfonlayer}


--- a/Chapter13/Figures/figure-framework-of-Adversarial-Neural-machine-translation.tex
+++ b/Chapter13/Figures/figure-framework-of-Adversarial-Neural-machine-translation.tex

-
 %------------------------------------------------------------

 \begin{tikzpicture}

-\tikzstyle{rnnnode} = [draw,inner sep=4pt,minimum width=2em,minimum height=2em,rounded corners=1pt,fill=yellow!20]
+\tikzstyle{rnnnode} = [draw,inner sep=4pt,minimum width=2em,minimum height=2em,rounded corners=1pt,fill=green!20]
 \tikzstyle{snode} = [draw,inner sep=4pt,minimum width=2em,minimum height=2em,rounded corners=1pt,fill=red!20]
 \tikzstyle{wode} = [inner sep=0pt,minimum width=2em,minimum height=2em,rounded corners=0pt]


--- a/Chapter13/Figures/figure-of-scheduling-sampling-method.tex
+++ b/Chapter13/Figures/figure-of-scheduling-sampling-method.tex
@@ -4,8 +4,8 @@

 \begin{tikzpicture}

-\tikzstyle{rnnnode} = [draw,inner sep=2pt,minimum width=4em,minimum height=2em,rounded corners=1pt,fill=red!20]
-\tikzstyle{snode} = [draw,inner sep=2pt,minimum width=4em,minimum height=2em,rounded corners=1pt,fill=blue!20]
+\tikzstyle{rnnnode} = [draw,inner sep=2pt,minimum width=4em,minimum height=2em,rounded corners=1pt,fill=red!15]
+\tikzstyle{snode} = [draw,inner sep=2pt,minimum width=4em,minimum height=2em,rounded corners=1pt,fill=blue!15]
 \tikzstyle{ynode} = [inner sep=2pt,minimum width=4em,minimum height=2em,rounded corners=1pt]



--- a/Chapter13/Figures/figure-reinforcement-learning-method-based-on-actor-critic.tex
+++ b/Chapter13/Figures/figure-reinforcement-learning-method-based-on-actor-critic.tex

 \begin{tikzpicture}
 	
-	\node[anchor=west,inner sep=0mm,minimum height=4em,minimum width=5.5em,rounded corners=15pt,align=left,draw,fill=red!20] (n1) at (0,0) {Decoder\\Encoder};
+	\node[anchor=west,inner sep=0mm,minimum height=4em,minimum width=5.5em,rounded corners=15pt,align=left,draw,fill=red!15] (n1) at (0,0) {Decoder\\Encoder};

-	\node[anchor=west,inner sep=0mm,minimum height=4em,minimum width=5.5em,rounded corners=15pt,align=left,draw,fill=green!20] (n2) at ([xshift=10em,yshift=0em]n1.east) {Decoder\\Encoder};
+	\node[anchor=west,inner sep=0mm,minimum height=4em,minimum width=5.5em,rounded corners=15pt,align=left,draw,fill=green!15] (n2) at ([xshift=10em,yshift=0em]n1.east) {Decoder\\Encoder};

 	\node[anchor=south,inner sep=0mm,font=\small] (a1) at ([xshift=0em,yshift=1em]n1.north) {演员$p$};


--- a/Chapter14/Figures/figure-different-integration-model.tex
+++ b/Chapter14/Figures/figure-different-integration-model.tex
@@ -10,12 +10,12 @@
            \tikzstyle{output} = [rectangle,thick,rounded corners=3pt,minimum width=1.2cm,align=center,font=\scriptsize];

            \begin{scope}
-                \node [system,fill=orange!20,draw] (model3) at (0,0) {模型 $3$};
-                \node [system,fill=ugreen!20,draw,anchor=south] (model2) at ([yshift=0.5cm]model3.north) {模型 $2$};
+                \node [system,fill=yellow!30,draw] (model3) at (0,0) {模型 $3$};
+                \node [system,fill=green!20,draw,anchor=south] (model2) at ([yshift=0.5cm]model3.north) {模型 $2$};
                \node [system,fill=red!20,draw,anchor=south] (model1) at ([yshift=0.5cm]model2.north) {模型 $1$};

-                \node [output,fill=orange!20,draw,anchor=west] (output3) at ([xshift=0.8cm]model3.east) {输出 $3$};
-                \node [output,fill=ugreen!20,draw,anchor=west] (output2) at ([xshift=0.8cm]model2.east) {输出 $2$};
+                \node [output,fill=yellow!30,draw,anchor=west] (output3) at ([xshift=0.8cm]model3.east) {输出 $3$};
+                \node [output,fill=green!20,draw,anchor=west] (output2) at ([xshift=0.8cm]model2.east) {输出 $2$};
                \node [output,fill=red!20,draw,anchor=west] (output1) at ([xshift=0.8cm]model1.east) {输出 $1$};

                \begin{pgfonlayer}{background}
@@ -40,15 +40,15 @@
            \tikzstyle{output} = [rectangle,thick,rounded corners=3pt,minimum width=1.2cm,align=center,font=\scriptsize];

            \begin{scope}
-                \node [system,fill=orange!20,draw] (model3) at (0,0) {模型 $3$};
-                \node [system,fill=ugreen!20,draw,anchor=south] (model2) at ([yshift=0.5cm]model3.north) {模型 $2$};
+                \node [system,fill=yellow!30,draw] (model3) at (0,0) {模型 $3$};
+                \node [system,fill=green!20,draw,anchor=south] (model2) at ([yshift=0.5cm]model3.north) {模型 $2$};
                \node [system,fill=red!20,draw,anchor=south] (model1) at ([yshift=0.5cm]model2.north) {模型 $1$};

                \begin{pgfonlayer}{background}
                    \node [draw,thick,dashed,inner sep=3pt,fit=(model3) (model2) (model1)] (ensemble) {};
                \end{pgfonlayer}

-                \node [system,fill=ugreen!20,draw,right=1cm of ensemble] (model) {模型};
+                \node [system,fill=green!20,draw,right=1cm of ensemble] (model) {模型};

                \node [output,fill=cocoabrown!20,draw,minimum width=1.2cm,anchor=west] (final) at ([xshift=0.8cm]model.east) {最终\\输出};

@@ -68,12 +68,12 @@
            \tikzstyle{dot} = [circle,fill=blue!40!white,minimum size=5pt,inner sep=0pt];

            \begin{scope}
-                \node [system,fill=orange!20,draw] (model3) at (0,0) {模型 $3$};
-                \node [system,fill=ugreen!20,draw,anchor=south] (model2) at ([yshift=0.5cm]model3.north) {模型 $2$};
+                \node [system,fill=yellow!30,draw] (model3) at (0,0) {模型 $3$};
+                \node [system,fill=green!20,draw,anchor=south] (model2) at ([yshift=0.5cm]model3.north) {模型 $2$};
                \node [system,fill=red!20,draw,anchor=south] (model1) at ([yshift=0.5cm]model2.north) {模型 $1$};

-                \node [output,fill=orange!20,draw,anchor=west] (output3) at ([xshift=0.8cm]model3.east) {输出 $3$};
-                \node [output,fill=ugreen!20,draw,anchor=west] (output2) at ([xshift=0.8cm]model2.east) {输出 $2$};
+                \node [output,fill=yellow!30,draw,anchor=west] (output3) at ([xshift=0.8cm]model3.east) {输出 $3$};
+                \node [output,fill=green!20,draw,anchor=west] (output2) at ([xshift=0.8cm]model2.east) {输出 $2$};
                \node [output,fill=red!20,draw,anchor=west] (output1) at ([xshift=0.8cm]model1.east) {输出 $1$};

                \draw [->,very thick] (model1) to (output1);

--- a/Chapter14/Figures/figure-hypothesis-generation.tex
+++ b/Chapter14/Figures/figure-hypothesis-generation.tex
@@ -5,12 +5,12 @@
 \tikzstyle{output} = [rectangle,thick,rounded corners=3pt,minimum width=1.2cm,align=center,font=\scriptsize];

 \begin{scope}[local bounding box=MULTIPLE]
-    \node [system,fill=orange!20,draw] (engine3) at (0,0) {系统 $n$};
-    \node [system,fill=ugreen!20,draw,anchor=south] (engine2) at ([yshift=0.6cm]engine3.north) {系统 $2$};
+    \node [system,fill=yellow!30,draw] (engine3) at (0,0) {系统 $n$};
+    \node [system,fill=green!20,draw,anchor=south] (engine2) at ([yshift=0.6cm]engine3.north) {系统 $2$};
    \node [system,fill=red!20,draw,anchor=south] (engine1) at ([yshift=0.3cm]engine2.north) {系统 $1$};

-    \node [output,fill=orange!20,draw,anchor=west] (output3) at ([xshift=0.5cm]engine3.east) {输出 $n$};
-    \node [output,fill=ugreen!20,draw,anchor=west] (output2) at ([xshift=0.5cm]engine2.east) {输出 $2$};
+    \node [output,fill=yellow!30,draw,anchor=west] (output3) at ([xshift=0.5cm]engine3.east) {输出 $n$};
+    \node [output,fill=green!20,draw,anchor=west] (output2) at ([xshift=0.5cm]engine2.east) {输出 $2$};
    \node [output,fill=red!20,draw,anchor=west] (output1) at ([xshift=0.5cm]engine1.east) {输出 $1$};

    \draw [very thick,decorate,decoration={brace}] ([xshift=3pt]output1.north east) to node [midway,name=final] {} ([xshift=3pt]output3.south east);
@@ -25,11 +25,11 @@
 \end{scope}

 \begin{scope}[local bounding box=SINGLE]
-    \node [output,fill=ugreen!20,draw,anchor=west] (output3) at ([xshift=4cm]output3.east) {输出 $n$};
-    \node [output,fill=ugreen!20,draw,anchor=west] (output2) at ([xshift=4cm]output2.east) {输出 $2$};
-    \node [output,fill=ugreen!20,draw,anchor=west] (output1) at ([xshift=4cm]output1.east) {输出 $1$};
+    \node [output,fill=green!20,draw,anchor=west] (output3) at ([xshift=4cm]output3.east) {输出 $n$};
+    \node [output,fill=green!20,draw,anchor=west] (output2) at ([xshift=4cm]output2.east) {输出 $2$};
+    \node [output,fill=green!20,draw,anchor=west] (output1) at ([xshift=4cm]output1.east) {输出 $1$};

-    \node [system,fill=ugreen!20,draw,anchor=east,align=center,inner sep=1.9pt] (engine) at ([xshift=-0.5cm]output2.west) {单系统};
+    \node [system,fill=green!20,draw,anchor=east,align=center,inner sep=1.9pt] (engine) at ([xshift=-0.5cm]output2.west) {单系统};

    \draw [very thick,decorate,decoration={brace}] ([xshift=3pt]output1.north east) to node [midway,name=final] {} ([xshift=3pt]output3.south east);


--- a/Chapter14/Figures/figure-main-module.tex
+++ b/Chapter14/Figures/figure-main-module.tex

 \begin{tikzpicture}
 %左
-\node [anchor=west,draw=black!70,rounded corners,drop shadow,very thick,minimum width=6em,minimum height=3.5em,fill=blue!15,align=center,text=black] (part1) at (0,0) {\scriptsize{预测模块}};
+\node [anchor=west,draw=black!70,rounded corners,drop shadow,very thick,minimum width=6em,minimum height=3.5em,fill=red!15,align=center,text=black] (part1) at (0,0) {\small{预测模块}};
 \node [anchor=south] (text) at ([xshift=0.5em,yshift=-3.5em]part1.south) {\scriptsize{源语言句子（编码器输出）}};
-\node [anchor=east,draw=black!70,rounded corners,drop shadow,very thick,minimum width=6em,minimum height=3.5em,fill=blue!15,align=center,text=black] (part2) at ([xshift=10em]part1.east) {\scriptsize{搜索模块}};
+\node [anchor=east,draw=black!70,rounded corners,drop shadow,very thick,minimum width=6em,minimum height=3.5em,fill=green!15,align=center,text=black] (part2) at ([xshift=10em]part1.east) {\small{搜索模块}};

 \node [anchor=south] (text1) at ([xshift=0.1em,yshift=2.2em]part1.north) {\scriptsize{译文中已经生成的单词}};
 \node [anchor=south] (text2) at ([xshift=0.5em,yshift=2.2em]part2.north) {\scriptsize{预测当前位置的单词概率分布}};

--- a/Chapter14/Figures/figure-multi-modality.tex
+++ b/Chapter14/Figures/figure-multi-modality.tex
@@ -8,10 +8,10 @@
 	\tikzstyle{po} = [font=\scriptsize,rounded corners=1pt, fill=gray!20, minimum width=1.8em,minimum height=1.5em,draw]
 	\tikzstyle{tgt} = [minimum height=1.6em,minimum width=5.2em,fill=black!10!yellow!30,font=\footnotesize,drop shadow={shadow xshift=0.15em,shadow yshift=-0.15em,}]
 	\tikzstyle{p} = [fill=ugreen!15,minimum width=0.4em,inner sep=0pt]
-\node[ rounded corners=3pt, fill=red!20, drop shadow, minimum width=12em,minimum height=4em,draw]  (encoder) at (0,0) {编码器};
-\node[anchor=north,rounded corners=3pt, fill=yellow!20, drop shadow, minimum width=12em,minimum height=2em,draw] (lenpre) at([yshift=3em]encoder.north){长度预测器};
+\node[ rounded corners=3pt, thick,fill=red!20, drop shadow, minimum width=12em,minimum height=4em,draw]  (encoder) at (0,0) {编码器};
+\node[anchor=north,rounded corners=3pt, thick,fill=yellow!20, drop shadow, minimum width=12em,minimum height=2em,draw] (lenpre) at([yshift=3em]encoder.north){长度预测器};
 \node[anchor=north] (lable) at([xshift=3.5em,yshift=2.5em]lenpre.north){译文长度：3};
-\node[anchor=west, rounded corners=3pt, fill=blue!20, drop shadow, minimum width=13em,minimum height=4em,draw] (decoder) at ([xshift=1cm]encoder.east) {解码器};
+\node[anchor=west, rounded corners=3pt, thick,fill=blue!20, drop shadow, minimum width=13em,minimum height=4em,draw] (decoder) at ([xshift=1cm]encoder.east) {解码器};

 \node[anchor=north,emb] (en1) at ([yshift=-1.3em,xshift=-4.5em]encoder.south) {${\mathbi e}$(干)};
 \node[anchor=north,emb] (en2) at ([yshift=-1.3em,xshift=-1.5em]encoder.south) {${\mathbi e}$(得)};

--- a/Chapter14/Figures/figure-non-autoregressive.tex
+++ b/Chapter14/Figures/figure-non-autoregressive.tex
@@ -7,10 +7,10 @@
 	\tikzstyle{emb} = [font=\scriptsize,rounded corners=1pt, fill=orange!20, minimum width=1.8em,minimum height=1.5em,draw]
 	\tikzstyle{po} = [font=\scriptsize,rounded corners=1pt, fill=gray!20, minimum width=1.8em,minimum height=1.5em,draw]
 \begin{scope} 
-\node[rounded corners=3pt, fill=red!20, drop shadow, minimum width=10em,minimum height=4em,draw]  (encoder) at (0,0) {编码器};
-\node[anchor=north,rounded corners=3pt, fill=yellow!20, drop shadow, minimum width=10em,minimum height=2em,draw] (lenpre) at([yshift=3em]encoder.north){长度预测器};
+\node[rounded corners=3pt, thick,fill=red!20, drop shadow, minimum width=10em,minimum height=4em,draw]  (encoder) at (0,0) {编码器};
+\node[anchor=north,rounded corners=3pt, thick,fill=yellow!20, drop shadow, minimum width=10em,minimum height=2em,draw] (lenpre) at([yshift=3em]encoder.north){长度预测器};
 \node[anchor=north] (lable) at([xshift=3.5em,yshift=2.5em]lenpre.north){译文长度：4};
-\node[anchor=west, rounded corners=3pt, fill=blue!20, drop shadow, minimum width=16em,minimum height=4em,draw] (decoder) at ([xshift=1.8cm]encoder.east) {解码器};
+\node[anchor=west, rounded corners=3pt, thick,fill=blue!20, drop shadow, minimum width=16em,minimum height=4em,draw] (decoder) at ([xshift=1.8cm]encoder.east) {解码器};

 \node[anchor=north,emb] (en2) at ([yshift=-1.3em]encoder.south) {${\mathbi e}(x_2)$};
 \node[anchor=north,emb] (en1) at ([yshift=-1.3em,xshift=-3em]encoder.south) {${\mathbi e}(x_1)$};
@@ -61,7 +61,7 @@
 \end{scope} 

 \begin{scope}[yshift=2.8in]
-\node[rounded corners=3pt, fill=red!20, drop shadow, minimum width=10em,minimum height=4em,draw]  (encoder) at (0,0) {编码器};
+\node[rounded corners=3pt, thick,fill=red!20, drop shadow, minimum width=10em,minimum height=4em,draw]  (encoder) at (0,0) {编码器};
 \node[anchor=west,minimum width=16em,minimum height=4em] (decoder) at ([xshift=1.8cm]encoder.east) {};

 \node[anchor=north,emb] (en2) at ([yshift=-1.3em]encoder.south) {${\mathbi e}(x_2)$};
@@ -122,7 +122,7 @@
 \draw [->,very thick,dotted] ([xshift=-0.3em]out2.east) .. controls +(east:0.5) and +(west:0.5) ..([xshift=0em]de3.west);
 \draw [->,very thick,dotted] ([xshift=-0.3em]out3.east) .. controls +(east:0.5) and +(west:0.5) ..([xshift=0em]de4.west);
 \draw [->,very thick,dotted] ([xshift=-0.3em]out4.east) .. controls +(east:0.5) and +(west:0.5) ..([xshift=0em]de5.west);
-\node[anchor=west, rounded corners=3pt, fill=blue!20, drop shadow, minimum width=16em,minimum height=4em,draw] (decoder2) at ([xshift=1.8cm]encoder.east) {解码器};
+\node[anchor=west, rounded corners=3pt, thick,fill=blue!20, drop shadow, minimum width=16em,minimum height=4em,draw] (decoder2) at ([xshift=1.8cm]encoder.east) {解码器};

 \draw[->,line width=1pt] (encoder.east) -- (decoder.west);
 \end{scope}

--- a/Chapter14/chapter14.tex
+++ b/Chapter14/chapter14.tex
@@ -154,7 +154,7 @@

 \begin{itemize}
 \vspace{0.5em}
-\item 长度惩罚因子。用译文长度来归一化翻译概率是最常用的方法：对于源语言句子$\seq{x}$和译文句子$\seq{y}$，模型得分$\textrm{score}(\seq{x},\seq{y})$的值会随着译文$\seq{y}$ 的长度增大而减小。为了避免此现象，可以引入一个长度惩罚函数$\textrm{lp}(\seq{y})$，并定义模型得分如公式\eqref{eq:14-12}所示：
+\item {\small\sffamily\bfseries{长度惩罚因子}}。用译文长度来归一化翻译概率是最常用的方法：对于源语言句子$\seq{x}$和译文句子$\seq{y}$，模型得分$\textrm{score}(\seq{x},\seq{y})$的值会随着译文$\seq{y}$ 的长度增大而减小。为了避免此现象，可以引入一个长度惩罚函数$\textrm{lp}(\seq{y})$，并定义模型得分如公式\eqref{eq:14-12}所示：

 \begin{eqnarray}
 \textrm{score}(\seq{x},\seq{y}) &=& \frac{\log \funp{P}(\seq{y}\vert\seq{x})}{\textrm{lp}(\seq{y})}
@@ -179,7 +179,7 @@
 \end{table}
 %----------------------------------------------------------------------------------------------------
 \vspace{0.5em}
-\item 译文长度范围约束。为了让译文的长度落在合理的范围内，神经机器翻译的推断也会设置一个译文长度约束\upcite{Vaswani2018Tensor2TensorFN,KleinOpenNMT}。令$[a,b]$表示一个长度范围，可以定义:
+\item {\small\sffamily\bfseries{译文长度范围约束}}。为了让译文的长度落在合理的范围内，神经机器翻译的推断也会设置一个译文长度约束\upcite{Vaswani2018Tensor2TensorFN,KleinOpenNMT}。令$[a,b]$表示一个长度范围，可以定义:

 \begin{eqnarray}
 a &=& \omega_{\textrm{low}}\cdot |\seq{x}| \label{eq:14-3}\\
@@ -188,7 +188,7 @@ b &=& \omega_{\textrm{high}}\cdot |\seq{x}| \label{eq:14-4}
 \vspace{0.5em}
 \noindent 其中，$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$分别表示译文长度的下限和上限，比如，很多系统中设置为$\omega_{\textrm{low}}=1/2$，$\omega_{\textrm{high}}=2$，表示译文至少有源语言句子一半长，最多有源语言句子两倍长。$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$的设置对推断效率影响很大，$\omega_{\textrm{high}}$可以被看作是一个推断的终止条件，最理想的情况是$\omega_{\textrm{high}} \cdot |\seq{x}|$恰巧就等于最佳译文的长度，这时没有浪费任何计算资源。反过来的一种情况，$\omega_{\textrm{high}} \cdot |\seq{x}|$远大于最佳译文的长度，这时很多计算都是无用的。为了找到长度预测的准确率和召回率之间的平衡，一般需要大量的实验最终确定$\omega_{\textrm{low}}$和$\omega_{\textrm{high}}$。当然，利用统计模型预测$\omega_{\textrm{low}}$ 和$\omega_{\textrm{high}}$也是非常值得探索的方向，比如基于繁衍率的模型\upcite{Gu2017NonAutoregressiveNM,Feng2016ImprovingAM}。
 \vspace{0.5em}
-\item 覆盖度模型。译文长度过长或过短的问题，本质上对应着 {\small\sffamily\bfseries{过翻译}}\index{过翻译}（Over Translation）\index{Over Translation}和{\small\sffamily\bfseries{欠翻译}}\index{欠翻译}（Under Translation）\index{Under Translation}的问题\upcite{Yang2018OtemUtemOA}。这两种问题出现的原因主要在于：神经机器翻译没有对过翻译和欠翻译建模，即机器翻译覆盖度问题\upcite{TuModeling}。针对此问题，最常用的方法是在推断的过程中引入一个度量覆盖度的模型。比如，使用GNMT 覆盖度模型定义模型得分\upcite{Wu2016GooglesNM}，如下：
+\item {\small\sffamily\bfseries{覆盖度模型}}。译文长度过长或过短的问题，本质上对应着 {\small\sffamily\bfseries{过翻译}}\index{过翻译}（Over Translation）\index{Over Translation}和{\small\sffamily\bfseries{欠翻译}}\index{欠翻译}（Under Translation）\index{Under Translation}的问题\upcite{Yang2018OtemUtemOA}。这两种问题出现的原因主要在于：神经机器翻译没有对过翻译和欠翻译建模，即机器翻译覆盖度问题\upcite{TuModeling}。针对此问题，最常用的方法是在推断的过程中引入一个度量覆盖度的模型。比如，使用GNMT 覆盖度模型定义模型得分\upcite{Wu2016GooglesNM}，如下：
 \begin{eqnarray}
 \textrm{score}(\seq{x},\seq{y}) &=& \frac{\log \funp{P}(\seq{y} | \seq{x})}{\textrm{lp}(\seq{y})} + \textrm{cp}(\seq{x},\seq{y}) \label {eq:14-5}\\
 \textrm{cp}(\seq{x},\seq{y}) &=& \beta \cdot \sum_{i=1}^{|\seq{x}|} \log(\textrm{min} (\sum_{j}^{|\seq{y}|} a_{ij} , 1))

--- a/Chapter15/Figures/figure-attention-distribution-based-on-gaussian-distribution.tex
+++ b/Chapter15/Figures/figure-attention-distribution-based-on-gaussian-distribution.tex
 %%%------------------------------------------------------------------------------------------------------------
-%%% 调序模型1：基于距离的调序
+
 \begin{center}
 \begin{tikzpicture}

@@ -10,34 +10,27 @@

 \begin{scope}

-\node [anchor=north west,cirnode] (c1) at (0, 0) {};
+\node [anchor=north west,circle,inner sep=4pt] (c1) at (0, 0) {};

 \draw[-,dotted] ([xshift=-1em,yshift=-0.5em]c1.south)--([xshift=9.3em,yshift=-0.5em]c1.south);
 \draw[-,dotted] ([xshift=-1em,yshift=-2em]c1.south)--([xshift=9.3em,yshift=-2em]c1.south);
 \draw[-,dotted] ([xshift=-1em,yshift=-3.5em]c1.south)--([xshift=9.3em,yshift=-3.5em]c1.south);
 \draw[-,dotted] ([xshift=-1em,yshift=-5em]c1.south)--([xshift=9.3em,yshift=-5em]c1.south);

-\node [anchor=north,cirnode] (c2) at ([xshift=0em,yshift=-5.5em]c1.south) {};
-\node [anchor=west,cirnode] (c3) at ([xshift=0.6em,yshift=0em]c2.east) {};
-\node [anchor=west,cirnode] (c4) at ([xshift=0.6em,yshift=0em]c3.east) {};
-\node [anchor=west,cirnode] (c5) at ([xshift=0.6em,yshift=0em]c4.east) {};
-\node [anchor=west,cirnode] (c6) at ([xshift=0.6em,yshift=0em]c5.east) {};
-\node [anchor=west,cirnode] (c7) at ([xshift=0.6em,yshift=0em]c6.east) {};
-
-\node [anchor=south,colnode,minimum height=1.6em,minimum width=1em] (b1) at ([xshift=0em,yshift=0.5em]c2.north) {};
-\node [anchor=south,colnode,minimum height=4.1em,minimum width=1em] (b2) at ([xshift=0em,yshift=0.5em]c3.north) {};
-\node [anchor=south,colnode,minimum height=0.8em,minimum width=1em] (b3) at ([xshift=0em,yshift=0.5em]c4.north) {};
-\node [anchor=south,colnode,minimum height=0.4em,minimum width=1em] (b4) at ([xshift=0em,yshift=0.5em]c5.north) {};
-\node [anchor=south,colnode,minimum height=0.15em,minimum width=1em] (b5) at ([xshift=0em,yshift=0.5em]c6.north) {};
-\node [anchor=south,colnode,minimum height=0.15em,minimum width=1em] (b6) at ([xshift=0em,yshift=0.5em]c7.north) {};
+\node [anchor=south,colnode,minimum height=1.6em,minimum width=1em] (b1) at ([xshift=0em,yshift=-5em]c1.south) {};
+\node [anchor=south,colnode,minimum height=4.1em,minimum width=1em] (b2) at ([xshift=1.67em,yshift=0em]b1.south) {};
+\node [anchor=south,colnode,minimum height=0.8em,minimum width=1em] (b3) at ([xshift=1.67em,yshift=0em]b2.south) {};
+\node [anchor=south,colnode,minimum height=0.4em,minimum width=1em] (b4) at ([xshift=1.67em,yshift=0em]b3.south) {};
+\node [anchor=south,colnode,minimum height=0.15em,minimum width=1em] (b5) at ([xshift=1.67em,yshift=0em]b4.south) {};
+\node [anchor=south,colnode,minimum height=0.15em,minimum width=1em] (b6) at ([xshift=1.67em,yshift=0em]b5.south) {};

 {\scriptsize
-\node [anchor=center] (n1) at ([xshift=0em,yshift=-1em]c2.south){\color{orange}Bush};
-\node [anchor=west] (n2) at ([xshift=-0.2em,yshift=0em]n1.east){\color{ugreen!30}held};
-\node [anchor=west] (n3) at ([xshift=0.35em,yshift=0em]n2.east){a};
-\node [anchor=west] (n4) at ([xshift=0.5em,yshift=0em]n3.east){talk};
-\node [anchor=west] (n5) at ([xshift=-0.3em,yshift=0em]n4.east){with};
-\node [anchor=west] (n6) at ([xshift=-0.3em,yshift=0em]n5.east){Sharon};
+\node [anchor=center] (n1) at ([xshift=0em,yshift=-1em]b1.south){\color{orange}It};
+\node [anchor=west] (n2) at ([xshift=1em,yshift=0em]n1.east){\color{ugreen}is};
+\node [anchor=west] (n3) at ([xshift=1em,yshift=-0.1em]n2.east){a};
+\node [anchor=west] (n4) at ([xshift=0.5em,yshift=0.1em]n3.east){nice};
+\node [anchor=west] (n5) at ([xshift=0em,yshift=-0.1em]n4.east){day};
+\node [anchor=west] (n6) at ([xshift=-0.1em,yshift=0em]n5.east){today};
 }

 \node [anchor=north] (l1) at ([xshift=1em,yshift=-1em]n3.south){\small {(a)原始分布}};
@@ -56,53 +49,31 @@
 \draw[-,dotted] ([xshift=-1em,yshift=-3.5em]c1.south)--([xshift=9.3em,yshift=-3.5em]c1.south);
 \draw[-,dotted] ([xshift=-1em,yshift=-5em]c1.south)--([xshift=9.3em,yshift=-5em]c1.south);

-\node [anchor=north,cirnode] (c2) at ([xshift=0em,yshift=-5.5em]c1.south) {};
-\node [anchor=west,cirnode] (c3) at ([xshift=0.6em,yshift=0em]c2.east) {};
-\node [anchor=west,cirnode] (c4) at ([xshift=0.6em,yshift=0em]c3.east) {};
-\node [anchor=west,cirnode] (c5) at ([xshift=0.6em,yshift=0em]c4.east) {};
-\node [anchor=west,cirnode] (c6) at ([xshift=0.6em,yshift=0em]c5.east) {};
-\node [anchor=west,cirnode] (c7) at ([xshift=0.6em,yshift=0em]c6.east) {};
-
-\node [anchor=south,inner sep=0.1pt,minimum height=1.6em,minimum width=1em] (b1) at ([xshift=0em,yshift=0.5em]c2.north) {};
-\node [anchor=south,inner sep=0.1pt,minimum height=4.1em,minimum width=1em] (b2) at ([xshift=0em,yshift=0.5em]c3.north) {};
-\node [anchor=south,inner sep=0.1pt,minimum height=0.8em,minimum width=1em] (b3) at ([xshift=0em,yshift=0.5em]c4.north) {};
-\node [anchor=south,inner sep=0.1pt,minimum height=0.4em,minimum width=1em] (b4) at ([xshift=0em,yshift=0.5em]c5.north) {};
-\node [anchor=south,inner sep=0.1pt,minimum height=0.15em,minimum width=1em] (b5) at ([xshift=0em,yshift=0.5em]c6.north) {};
-\node [anchor=south,inner sep=0.1pt,minimum height=0.15em,minimum width=1em] (b6) at ([xshift=0em,yshift=0.5em]c7.north) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=1.6em,minimum width=1em] (b1) at ([xshift=0em,yshift=-5em]c1.south) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=4.1em,minimum width=1em] (b2) at ([xshift=1.67em,yshift=0em]b1.south) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=0.8em,minimum width=1em] (b3) at ([xshift=1.67em,yshift=0em]b2.south) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=0.4em,minimum width=1em] (b4) at ([xshift=1.67em,yshift=0em]b3.south) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=0.15em,minimum width=1em] (b5) at ([xshift=1.67em,yshift=0em]b4.south) {};
+\node [anchor=south,inner sep=0.1pt,minimum height=0.15em,minimum width=1em] (b6) at ([xshift=1.67em,yshift=0em]b5.south) {};

 {\scriptsize
-\node [anchor=center] (n1) at ([xshift=0em,yshift=-1em]c2.south){\color{orange}Bush};
-\node [anchor=west] (n2) at ([xshift=-0.2em,yshift=0em]n1.east){held};
-\node [anchor=west] (n3) at ([xshift=0.35em,yshift=0em]n2.east){a};
-\node [anchor=west] (n4) at ([xshift=0.5em,yshift=0em]n3.east){\color{blue!60}talk};
-\node [anchor=west] (n5) at ([xshift=-0.3em,yshift=0em]n4.east){with};
-\node [anchor=west] (n6) at ([xshift=-0.3em,yshift=0em]n5.east){Sharon};
+\node [anchor=center] (n1) at ([xshift=0em,yshift=-1em]b1.south){\color{orange}It};
+\node [anchor=west] (n2) at ([xshift=0.5em,yshift=0em]n1.east){is};
+\node [anchor=west] (n3) at ([xshift=0.5em,yshift=-0.1em]n2.east){a};
+\node [anchor=west] (n4) at ([xshift=0.5em,yshift=0.1em]n3.east){\color{blue}nice};
+\node [anchor=west] (n5) at ([xshift=0em,yshift=-0.1em]n4.east){day};
+\node [anchor=west] (n6) at ([xshift=0em,yshift=0em]n5.east){today};
 }

-%\node [anchor=west,show] (s1) at (-0.5em,-5.7em){};
-%\node [anchor=west,show] (s2) at (-0.2em,-5.6em){};
-%\node [anchor=west,show] (s3) at (1.1em,-5.5em){};
-%\node [anchor=west,show] (s11) at (1.9em,-5em){};
-%\node [anchor=west,show] (s4) at (3em,-4em){};
-%\node [anchor=west,show] (s5) at (3.7em,-3em){};
-%\node [anchor=west,show] (s12) at (4.2em,-2.2em){};
-%\node [anchor=west,show] (s6) at (5.3em,-1.4em){};
-%\node [anchor=west,show] (s7) at (6.3em,-2em){};
-%\node [anchor=west,show] (s13) at (7.4em,-3em){};
-%\node [anchor=west,show] (s8) at (8.4em,-3.9em){};
-%\node [anchor=west,show] (s9) at (8.9em,-4.5em){};
-%\node [anchor=west,show] (s10) at (9.3em,-5em){};
-
-%\draw[-,blue!60,thick] (-0.5em,-5.7em)..controls (-0.2em,-5.6em) and (1.1em,-5.5em)..(1.9em,-5em)..controls (3em,-4em) and (3.7em,-3em)..(4.2em,-2.2em)..controls (5.3em,-1.4em) and (6.3em,-2em)..(7.4em,-3em)..controls (8.4em,-3.9em) and (8.9em,-4.5em)..(9.3em,-5em);
-%\draw[-,blue!60,thick] (-1em,-6em)..controls (0em,-5.7em) and (1em,-5em)..(1.6em,-4.3em)..controls (5.3em,1em) and (7.4em,-2em)..(9.3em,-5em);
-\draw[-,blue!60,thick] ([xshift=-1em,yshift=-4.7em]c1.south)..controls (3.8em,-6em) and (3.9em,3.6em)..([xshift=9.3em,yshift=-4.3em]c1.south);
+\draw [-,blue!60,thick] ([xshift=-1em,yshift=-4.7em]c1.south) cos(2em,-4em) sin (4.4em,-1.7em) cos(7.3em,-4em) sin([xshift=9.3em,yshift=-4.7em]c1.south);
+%\draw[-,blue!60,thick] ([xshift=-1em,yshift=-4.7em]c1.south)..controls (3.8em,-6em) and (3.9em,3.6em)..([xshift=9.3em,yshift=-4.3em]c1.south);

 \node [anchor=north] (l1) at ([xshift=1em,yshift=-1em]n3.south){\small {(b)高斯分布}};

 \draw[-,very thick] ([xshift=1.2em,yshift=2.2em]b6.north)--([xshift=1.7em,yshift=2.2em]b6.north);
 \draw[-,very thick] ([xshift=1.2em,yshift=1.9em]b6.north)--([xshift=1.7em,yshift=1.9em]b6.north);

-\node [anchor=south] (t1) at ([xshift=0em,yshift=6.7em]n4.north){$D_i$};
+\node [anchor=south] (t1) at ([xshift=0em,yshift=5.7em]n4.north){$D_i$};
 \draw[->] ([xshift=0em,yshift=0em]t1.west)--([xshift=-1em,yshift=0em]t1.west);
 \draw[->] ([xshift=0em,yshift=0em]t1.east)--([xshift=1em,yshift=0em]t1.east);
 \draw[-] ([xshift=1em,yshift=-0.5em]t1.east)--([xshift=1em,yshift=0.5em]t1.east);
@@ -112,34 +83,27 @@

 \begin{scope}[xshift=8.8cm,yshift=0em]

-\node [anchor=north west,cirnode] (c1) at (0, 0) {};
+\node [anchor=north west,circle,inner sep=4pt] (c1) at (0, 0) {};

 \draw[-,dotted] ([xshift=-1em,yshift=-0.5em]c1.south)--([xshift=9.3em,yshift=-0.5em]c1.south);
 \draw[-,dotted] ([xshift=-1em,yshift=-2em]c1.south)--([xshift=9.3em,yshift=-2em]c1.south);
 \draw[-,dotted] ([xshift=-1em,yshift=-3.5em]c1.south)--([xshift=9.3em,yshift=-3.5em]c1.south);
 \draw[-,dotted] ([xshift=-1em,yshift=-5em]c1.south)--([xshift=9.3em,yshift=-5em]c1.south);

-\node [anchor=north,cirnode] (c2) at ([xshift=0em,yshift=-5.5em]c1.south) {};
-\node [anchor=west,cirnode] (c3) at ([xshift=0.6em,yshift=0em]c2.east) {};
-\node [anchor=west,cirnode] (c4) at ([xshift=0.6em,yshift=0em]c3.east) {};
-\node [anchor=west,cirnode] (c5) at ([xshift=0.6em,yshift=0em]c4.east) {};
-\node [anchor=west,cirnode] (c6) at ([xshift=0.6em,yshift=0em]c5.east) {};
-\node [anchor=west,cirnode] (c7) at ([xshift=0.6em,yshift=0em]c6.east) {};
-
-\node [anchor=south,colnode,minimum height=0.15em,minimum width=1em] (b1) at ([xshift=0em,yshift=0.5em]c2.north) {};
-\node [anchor=south,colnode,minimum height=4.2em,minimum width=1em] (b2) at ([xshift=0em,yshift=0.5em]c3.north) {};
-\node [anchor=south,colnode,minimum height=3.7em,minimum width=1em] (b3) at ([xshift=0em,yshift=0.5em]c4.north) {};
-\node [anchor=south,colnode,minimum height=4.2em,minimum width=1em] (b4) at ([xshift=0em,yshift=0.5em]c5.north) {};
-\node [anchor=south,colnode,minimum height=0.8em,minimum width=1em] (b5) at ([xshift=0em,yshift=0.5em]c6.north) {};
-\node [anchor=south,colnode,minimum height=0.15em,minimum width=1em] (b6) at ([xshift=0em,yshift=0.5em]c7.north) {};
+\node [anchor=south,colnode,minimum height=0.15em,minimum width=1em] (b1) at ([xshift=0em,yshift=-5em]c1.south) {};
+\node [anchor=south,colnode,minimum height=4.2em,minimum width=1em] (b2) at ([xshift=1.67em,yshift=0em]b1.south) {};
+\node [anchor=south,colnode,minimum height=3.7em,minimum width=1em] (b3) at ([xshift=1.67em,yshift=0em]b2.south) {};
+\node [anchor=south,colnode,minimum height=4.2em,minimum width=1em] (b4) at ([xshift=1.67em,yshift=0em]b3.south) {};
+\node [anchor=south,colnode,minimum height=0.8em,minimum width=1em] (b5) at ([xshift=1.67em,yshift=0em]b4.south) {};
+\node [anchor=south,colnode,minimum height=0.15em,minimum width=1em] (b6) at ([xshift=1.67em,yshift=0em]b5.south) {};

 {\scriptsize
-\node [anchor=center] (n1) at ([xshift=0em,yshift=-1em]c2.south){\color{orange}Bush};
-\node [anchor=west] (n2) at ([xshift=-0.2em,yshift=0em]n1.east){\color{ugreen!30}held};
-\node [anchor=west] (n3) at ([xshift=0.35em,yshift=0em]n2.east){\color{ugreen!30}a};
-\node [anchor=west] (n4) at ([xshift=0.5em,yshift=0em]n3.east){\color{ugreen!30}talk};
-\node [anchor=west] (n5) at ([xshift=-0.3em,yshift=0em]n4.east){with};
-\node [anchor=west] (n6) at ([xshift=-0.3em,yshift=0em]n5.east){Sharon};
+\node [anchor=center] (n1) at ([xshift=0em,yshift=-1em]b1.south){\color{orange}It};
+\node [anchor=west] (n2) at ([xshift=1em,yshift=0em]n1.east){\color{ugreen}is};
+\node [anchor=west] (n3) at ([xshift=1em,yshift=-0.1em]n2.east){\color{ugreen}a};
+\node [anchor=west] (n4) at ([xshift=0.5em,yshift=0.1em]n3.east){\color{ugreen}nice};
+\node [anchor=west] (n5) at ([xshift=0em,yshift=-0.1em]n4.east){day};
+\node [anchor=west] (n6) at ([xshift=-0.1em,yshift=0em]n5.east){today};
 }

 \node [anchor=north] (l1) at ([xshift=1em,yshift=-1em]n3.south){\small {(c)修改后的分布}};

--- a/Chapter15/Figures/figure-encoder-of-bidirectional-tree-structure.tex
+++ b/Chapter15/Figures/figure-encoder-of-bidirectional-tree-structure.tex
@@ -2,10 +2,10 @@
 \begin{tikzpicture}
 \begin{scope}

-\tikzstyle{hnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=3em,rounded corners=5pt,fill=ugreen!20]
+\tikzstyle{hnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=3em,rounded corners=5pt,fill=green!20]
 \tikzstyle{tnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=3em,rounded corners=5pt,fill=red!20]
 \tikzstyle{fnoder}=[rectangle,inner sep=0mm,minimum height=2.4em,minimum width=6.8em,draw,dashed,very thick,rounded corners=5pt,red!40]
-\tikzstyle{fnodeg}=[rectangle,inner sep=0mm,minimum height=2.4em,minimum width=6.8em,draw,dashed,very thick,rounded corners=5pt,ugreen!40]
+\tikzstyle{fnodeg}=[rectangle,inner sep=0mm,minimum height=2.4em,minimum width=6.8em,draw,dashed,very thick,rounded corners=5pt,green!40]

 \node [anchor=south west,fnodeg] (f1) at (0,0) {};
 \node [anchor=west,hnode] (n1) at ([xshift=0.2em,yshift=0em]f1.west) {$\mathbi{h}_1^{\textrm{up}}$};
@@ -24,24 +24,24 @@
 \node [anchor=east,hnode] (n8) at ([xshift=-0.2em,yshift=0em]f4.east) {$\cdots$};

 \node [anchor=west,fnodeg] (f5) at ([xshift=0.6em,yshift=0em]f4.east) {};
-\node [anchor=west,hnode] (n9) at ([xshift=0.2em,yshift=0em]f5.west) {$\mathbi{h}_n^{\textrm{up}}$};
-\node [anchor=east,hnode] (n10) at ([xshift=-0.2em,yshift=0em]f5.east) {$\mathbi{h}_n^{\textrm{down}}$};
+\node [anchor=west,hnode] (n9) at ([xshift=0.2em,yshift=0em]f5.west) {$\mathbi{h}_m^{\textrm{up}}$};
+\node [anchor=east,hnode] (n10) at ([xshift=-0.2em,yshift=0em]f5.east) {$\mathbi{h}_m^{\textrm{down}}$};

 \node [anchor=south,fnoder] (f6) at ([xshift=3.7em,yshift=1em]f1.north) {};
-\node [anchor=west,tnode] (n11) at ([xshift=0.2em,yshift=0em]f6.west) {$\mathbi{h}_{n+1}^{\textrm{up}}$};
-\node [anchor=east,tnode] (n12) at ([xshift=-0.2em,yshift=0em]f6.east) {$\mathbi{h}_{n+1}^{\textrm{down}}$};
+\node [anchor=west,tnode] (n11) at ([xshift=0.2em,yshift=0em]f6.west) {$\mathbi{h}_{m+1}^{\textrm{up}}$};
+\node [anchor=east,tnode] (n12) at ([xshift=-0.2em,yshift=0em]f6.east) {$\mathbi{h}_{m+1}^{\textrm{down}}$};

 \node [anchor=south,fnoder] (f7) at ([xshift=3.7em,yshift=1em]f6.north) {};
-\node [anchor=west,tnode] (n13) at ([xshift=0.2em,yshift=0em]f7.west) {$\mathbi{h}_{n+2}^{\textrm{up}}$};
-\node [anchor=east,tnode] (n14) at ([xshift=-0.2em,yshift=0em]f7.east) {$\mathbi{h}_{n+2}^{\textrm{down}}$};
+\node [anchor=west,tnode] (n13) at ([xshift=0.2em,yshift=0em]f7.west) {$\mathbi{h}_{m+2}^{\textrm{up}}$};
+\node [anchor=east,tnode] (n14) at ([xshift=-0.2em,yshift=0em]f7.east) {$\mathbi{h}_{m+2}^{\textrm{down}}$};

 \node [anchor=south,fnoder] (f8) at ([xshift=3.7em,yshift=1em]f7.north) {};
 \node [anchor=west,tnode] (n15) at ([xshift=0.2em,yshift=0em]f8.west) {$\cdots$};
 \node [anchor=east,tnode] (n16) at ([xshift=-0.2em,yshift=0em]f8.east) {$\cdots$};

 \node [anchor=south,fnoder] (f9) at ([xshift=3.7em,yshift=1em]f8.north) {};
-\node [anchor=west,tnode] (n17) at ([xshift=0.2em,yshift=0em]f9.west) {$\mathbi{h}_{2n-1}^{\textrm{up}}$};
-\node [anchor=east,tnode] (n18) at ([xshift=-0.2em,yshift=0em]f9.east) {$\mathbi{h}_{2n-1}^{\textrm{down}}$};
+\node [anchor=west,tnode] (n17) at ([xshift=0.2em,yshift=0em]f9.west) {$\mathbi{h}_{2m-1}^{\textrm{up}}$};
+\node [anchor=east,tnode] (n18) at ([xshift=-0.2em,yshift=0em]f9.east) {$\mathbi{h}_{2m-1}^{\textrm{down}}$};


 \draw [->,thick] ([xshift=0em,yshift=0em]n11.east) -- ([xshift=0em,yshift=0em]n12.west);

--- a/Chapter15/Figures/figure-encoder-structure-of-transformer-model-optimized-by-nas.tex
+++ b/Chapter15/Figures/figure-encoder-structure-of-transformer-model-optimized-by-nas.tex
@@ -39,42 +39,49 @@
 \end{scope}

 %right
-\begin{scope}[xshift=14em]
-\foreach \x/\d in {1/2em, 2/8em, 3/14em}
-	\node[unit,fill=yellow!20] at (0,\d) (ln_\x) {层正则化};
+\begin{scope}[xshift=13em]
+

-\foreach \x/\d in {1/6em, 2/12em, 3/22em}
-	\node[draw,circle,minimum size=1em,inner sep=1pt] at (0,\d) (add_\x) {\scriptsize\bfnew{+}};

-\node[unit,fill=red!20] at (0,16em) (conv_4) {卷积$1 \times 1$：2048};
-\node[unit,fill=red!20] at (0,20em) (conv_5) {卷积$1 \times 1$：512};
+\foreach \x/\d in {1/2em, 2/8em, 3/16em}
+	\node[unit,fill=yellow!20] at (0,\d) (ln_\x) {层正则化};

-\node[unit,fill=blue!20] at (0,18em) (relu_3) {RELU};
-\node[unit,fill=cyan!20] at (0,4em) (conv_3) {Sep卷积$9 \times 1$：256};
-\node[unit,fill=green!20] at (0,10em) (sa_1) {8头自注意力：512};
+\foreach \x/\d in {1/6em, 2/14em, 3/20em}
+	\node[draw,circle,minimum size=1em,inner sep=1pt] at (0,\d) (add_\x) {\scriptsize\bfnew{+}};

+\node[unit,fill=red!20] at (0,4em) (glu_1) {门控线性单元：512};
+\node[unit,fill=red!20] at (-3em,10em) (conv_1) {卷积$1 \times 1$：2048};
+\node[unit,fill=cyan!20] at (3em,10em) (conv_2) {卷积$3 \times 1$：256};
+\node[unit,fill=blue!20] at (-3em,12em) (relu_1) {RELU};
+\node[unit,fill=blue!20] at (3em,12em) (relu_2) {RELU};
+\node[unit,fill=cyan!20] at (0em,18em) (conv_3) {Sep卷积$9 \times 1$：256};


 \draw[->,thick] ([yshift=-1.4em]ln_1.-90) -- ([yshift=-0.1em]ln_1.-90);
-\draw[->,thick] ([yshift=0.1em]ln_1.90) -- ([yshift=-0.1em]conv_3.-90);
-\draw[->,thick] ([yshift=0.1em]conv_3.90) -- ([yshift=-0.1em]add_1.-90);
+\draw[->,thick] ([yshift=0.1em]ln_1.90) -- ([yshift=-0.1em]glu_1.-90);
+\draw[->,thick] ([yshift=0.1em]glu_1.90) -- ([yshift=-0.1em]add_1.-90);
 \draw[->,thick] ([yshift=0.1em]add_1.90) -- ([yshift=-0.1em]ln_2.-90);
-\draw[->,thick] ([,yshift=0.1em]ln_2.90) -- ([yshift=-0.1em]sa_1.-90);
-\draw[->,thick] ([yshift=0.1em]sa_1.90) -- ([yshift=-0.1em]add_2.-90);
+\draw[->,thick] ([,yshift=0.1em]ln_2.135) -- ([yshift=-0.1em]conv_1.-90);
+\draw[->,thick] ([yshift=0.1em]ln_2.45) -- ([yshift=-0.1em]conv_2.-90);
+\draw[->,thick] ([yshift=0.1em]conv_1.90) -- ([yshift=-0.1em]relu_1.-90);
+\draw[->,thick] ([yshift=0.1em]conv_2.90) -- ([yshift=-0.1em]relu_2.-90);
+\draw[->,thick] ([yshift=0.1em]relu_1.90) -- ([yshift=-0.1em]add_2.-135);
+\draw[->,thick] ([yshift=0.1em]relu_2.90) -- ([yshift=-0.1em]add_2.-45);
 \draw[->,thick] ([yshift=0.1em]add_2.90) -- ([yshift=-0.1em]ln_3.-90);
-\draw[->,thick] ([yshift=0.1em]ln_3.90) -- ([yshift=-0.1em]conv_4.-90);
-\draw[->,thick] ([yshift=0.1em]conv_4.90) -- ([yshift=-0.1em]relu_3.-90);
-\draw[->,thick] ([yshift=0.1em]relu_3.90) -- ([yshift=-0.1em]conv_5.-90);
-\draw[->,thick] ([yshift=0.1em]conv_5.90) -- ([yshift=-0.1em]add_3.-90);
+\draw[->,thick] ([yshift=0.1em]ln_3.90) -- ([yshift=-0.1em]conv_3.-90);
+\draw[->,thick] ([yshift=0.1em]conv_3.90) -- ([yshift=-0.1em]add_3.-90);
 \draw[->,thick] ([yshift=0.1em]add_3.90) -- ([yshift=1em]add_3.90);

+
+
 \draw[->,thick] ([yshift=-0.8em]ln_1.-90) .. controls ([xshift=5em,yshift=-0.8em]ln_1.-90) and ([xshift=5em]add_1.0) .. (add_1.0);
-\draw[->,thick] (add_1.0) .. controls ([xshift=5em]add_1.0) and ([xshift=5em]add_2.0) .. (add_2.0);
-\draw[->,thick] (add_2.0) .. controls ([xshift=5em]add_2.0) and ([xshift=5em]add_3.0) .. (add_3.0);
+\draw[->,thick] (add_1.0) .. controls ([xshift=8em]add_1.0) and ([xshift=8em]add_3.0) .. (add_3.0);
+
+

 \node[font=\scriptsize,align=center] at (0em, -1.5em){(b) 使用结构搜索方法优化后的 \\ Transformer编码器中若干块的结构};

-\node[minimum size=0.8em,inner sep=0pt,rounded corners=1pt,draw,fill=blue!20] (act) at (5.5em, 20em){};
+\node[minimum size=0.8em,inner sep=0pt,rounded corners=1pt,draw,fill=blue!20] (act) at (8em, 20em){};
 \node[anchor=west,font=\footnotesize] at ([xshift=0.1em]act.east){激活函数};
 \node[anchor=north,minimum size=0.8em,inner sep=0pt,rounded corners=1pt,draw,fill=yellow!20] (nor) at ([yshift=-0.6em]act.south){};
 \node[anchor=west,font=\footnotesize] at ([xshift=0.1em]nor.east){层正则化};

--- a/Chapter15/Figures/figure-encoder-tree-structure-modeling.tex
+++ b/Chapter15/Figures/figure-encoder-tree-structure-modeling.tex
@@ -2,7 +2,7 @@
 \begin{tikzpicture}
 \begin{scope}

-\tikzstyle{hnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=4.5em,rounded corners=5pt,fill=ugreen!30]
+\tikzstyle{hnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=4.5em,rounded corners=5pt,fill=green!30]
 \tikzstyle{tnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=4.5em,rounded corners=5pt,fill=red!30]
 \tikzstyle{wnode}=[inner sep=0mm,minimum height=1.4em,minimum width=4.4em]

@@ -10,12 +10,12 @@
 \node [anchor=west,hnode] (n2) at ([xshift=1em,yshift=0em]n1.east) {$\mathbi{h}_2$};
 \node [anchor=west,hnode] (n3) at ([xshift=1em,yshift=0em]n2.east) {$\mathbi{h}_3$};
 \node [anchor=west,hnode] (n4) at ([xshift=1em,yshift=0em]n3.east) {$\cdots$};
-\node [anchor=west,hnode] (n5) at ([xshift=1em,yshift=0em]n4.east) {$\mathbi{h}_n$};
+\node [anchor=west,hnode] (n5) at ([xshift=1em,yshift=0em]n4.east) {$\mathbi{h}_m$};

-\node [anchor=south,tnode] (t1) at ([xshift=2.8em,yshift=1em]n1.north) {$\mathbi{h}_{n+1}$};
-\node [anchor=south,tnode] (t2) at ([xshift=2.8em,yshift=1em]t1.north) {$\mathbi{h}_{n+2}$};
+\node [anchor=south,tnode] (t1) at ([xshift=2.8em,yshift=1em]n1.north) {$\mathbi{h}_{m+1}$};
+\node [anchor=south,tnode] (t2) at ([xshift=2.8em,yshift=1em]t1.north) {$\mathbi{h}_{m+2}$};
 \node [anchor=south,tnode] (t3) at ([xshift=2.8em,yshift=1em]t2.north) {$\cdots$};
-\node [anchor=south,tnode] (t4) at ([xshift=2.8em,yshift=1em]t3.north) {$\mathbi{h}_{2n-1}$};
+\node [anchor=south,tnode] (t4) at ([xshift=2.8em,yshift=1em]t3.north) {$\mathbi{h}_{2m-1}$};

 \draw [->,thick] ([xshift=0em,yshift=0em]n1.east) -- ([xshift=0em,yshift=0em]n2.west);
 \draw [->,thick] ([xshift=0em,yshift=0em]n2.east) -- ([xshift=0em,yshift=0em]n3.west);

--- a/Chapter15/Figures/figure-layer-fusion-method-2d.tex
+++ b/Chapter15/Figures/figure-layer-fusion-method-2d.tex
@@ -7,9 +7,9 @@
 \tikzstyle{vlnode}=[rectangle,inner sep=0mm,minimum height=1em,minimum width=5em,rounded corners=2pt,draw]


-\node [anchor=west,lnode] (n1) at (0, 0) {$\mathbi{g}^3$};
-\node [anchor=north west,lnode] (n2) at ([xshift=0em,yshift=-0.5em]n1.south west) {$\mathbi{g}^2$};
-\node [anchor=north west,lnode] (n3) at ([xshift=0em,yshift=-0.5em]n2.south west) {$\mathbi{g}^1$};
+\node [anchor=west,lnode] (n1) at (0, 0) {$\mathbi{h}_3$};
+\node [anchor=north west,lnode] (n2) at ([xshift=0em,yshift=-0.5em]n1.south west) {$\mathbi{h}_2$};
+\node [anchor=north west,lnode] (n3) at ([xshift=0em,yshift=-0.5em]n2.south west) {$\mathbi{h}_1$};

 \node [anchor=south] (d1) at ([xshift=0em,yshift=0.2em]n1.north) {1D};


--- a/Chapter15/Figures/figure-layer-fusion-method.tex
+++ b/Chapter15/Figures/figure-layer-fusion-method.tex
@@ -19,7 +19,7 @@

 \node [anchor=west,encnode,draw=red!60!black!80,fill=red!20] (n7) at ([xshift=1em,yshift=0em]n6.east) {$\mathbi{h}_{L-1}$};

-\node [anchor=north,rectangle,draw=teal!80, inner sep=0mm,minimum height=2em,minimum width=8em,fill=teal!17,rounded corners=5pt,thick] (n8) at ([xshift=3em,yshift=-1.2em]n4.south) {权重聚合$\mathbi{g}$};
+\node [anchor=north,rectangle,draw=teal!80, inner sep=0mm,minimum height=2em,minimum width=8em,fill=teal!17,rounded corners=5pt,thick] (n8) at ([xshift=3em,yshift=-1.5em]n4.south) {权重聚合$\mathbi{g}$};



@@ -61,11 +61,11 @@
 \draw [->,thick] ([xshift=0em,yshift=0em]n5.east) -- ([xshift=0em,yshift=0em]n6.west);
 \draw [->,thick] ([xshift=0em,yshift=0em]n6.east) -- ([xshift=0em,yshift=0em]n7.west);

-\draw [->,thick] ([xshift=0em,yshift=0em]n2.south) -- ([xshift=0em,yshift=0em]n8.north);
-\draw [->,thick] ([xshift=0em,yshift=0em]n3.south) -- ([xshift=0em,yshift=0em]n8.north);
-\draw [->,thick] ([xshift=0em,yshift=0em]n4.south) -- ([xshift=0em,yshift=0em]n8.north);
-\draw [->,thick] ([xshift=0em,yshift=0em]n5.south) -- ([xshift=0em,yshift=0em]n8.north);
-\draw [->,thick] ([xshift=0em,yshift=0em]n7.south) -- ([xshift=0em,yshift=0em]n8.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n2.south)..controls +(south:1.5em) and +(north:1.5em)..([xshift=0em,yshift=0.1em]n8.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n3.south)..controls +(south:0.9em) and +(north:1.6em)..([xshift=0em,yshift=0.1em]n8.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n4.south)..controls +(south:0.8em) and +(north:1.4em)..([xshift=0em,yshift=0.1em]n8.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n5.south)..controls +(south:0.8em) and +(north:1.4em)..([xshift=0em,yshift=0.1em]n8.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n7.south)..controls +(south:1.5em) and +(north:1.5em)..([xshift=0em,yshift=0.1em]n8.north);

 \draw [->,thick] ([xshift=0em,yshift=0em]n10.east) -- ([xshift=0em,yshift=0em]n11.west);
 \draw [->,thick] ([xshift=0em,yshift=0em]n11.east) -- ([xshift=0em,yshift=0em]n12.west);
@@ -74,10 +74,10 @@
 \draw [->,thick] ([xshift=0em,yshift=0em]n14.east) -- ([xshift=0em,yshift=0em]n15.west);
 \draw [->,thick] ([xshift=0em,yshift=0em]n15.east) -- ([xshift=0em,yshift=0em]n16.west);

-\draw [->,thick] ([xshift=0em,yshift=0em]n8.south) -- ([xshift=0em,yshift=0em]n11.north);
-\draw [->,thick] ([xshift=0em,yshift=0em]n8.south) -- ([xshift=0em,yshift=0em]n12.north);
-\draw [->,thick] ([xshift=0em,yshift=0em]n8.south) -- ([xshift=0em,yshift=0em]n13.north);
-\draw [->,thick] ([xshift=0em,yshift=0em]n8.south) -- ([xshift=0em,yshift=0em]n15.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n8.south)..controls +(south:1.2em) and +(north:1.5em)..([xshift=0em,yshift=0em]n11.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n8.south)..controls +(south:1.3em) and +(north:1.2em)..([xshift=0em,yshift=0em]n12.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n8.south)..controls +(south:1.3em) and +(north:1.2em)..([xshift=0em,yshift=0em]n13.north);
+\draw [->,thick] ([xshift=0em,yshift=0em]n8.south)..controls +(south:1.5em) and +(north:1.5em)..([xshift=0em,yshift=0em]n15.north);




--- a/Chapter15/Figures/figure-light-weight-transformer-module.tex
+++ b/Chapter15/Figures/figure-light-weight-transformer-module.tex
@@ -3,9 +3,9 @@
 \begin{center}
 \begin{tikzpicture}

-\tikzstyle{manode}=[rectangle,inner sep=0mm,minimum height=4em,minimum width=4em,rounded corners=5pt,thick,draw,fill=blue!20]
-\tikzstyle{ffnnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=6em,rounded corners=5pt,thick,fill=red!20,draw]
-\tikzstyle{ebnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=10em,rounded corners=5pt,thick,fill=green!20,draw]
+\tikzstyle{manode}=[rectangle,inner sep=0mm,minimum height=4em,minimum width=4em,rounded corners=5pt,thick,draw,fill=blue!15]
+\tikzstyle{ffnnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=6em,rounded corners=5pt,thick,fill=red!15,draw]
+\tikzstyle{ebnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=10em,rounded corners=5pt,thick,fill=ugreen!15,draw]

 \begin{scope}[]

@@ -33,6 +33,11 @@
 \draw[->,thick,rectangle,rounded corners=5pt] ([xshift=0em,yshift=0.5em]f1.north)--([xshift=-6em,yshift=0.5em]f1.north)--([xshift=-5.45em,yshift=0em]add1.west)--([xshift=0em,yshift=0em]add1.west);


+\node [anchor=north,inner sep=0mm,minimum height=1.5em] (ip) at ([xshift=0em,yshift=-1em]f1.south){input};
+\node [anchor=south,inner sep=0mm,minimum height=1.5em] (op) at ([xshift=0em,yshift=1em]f2.north){output};
+\draw[->,thick] ([xshift=0em,yshift=0em]ip.north)--([xshift=0em,yshift=0em]f1.south);
+\draw[->,thick] ([xshift=0em,yshift=0em]f2.north)--([xshift=0em,yshift=0em]op.south);
+
 \end{scope}
 \end{tikzpicture}
 \end{center}
\ No newline at end of file
--- a/Chapter15/Figures/figure-main-flow-of-neural-network-structure-search.tex
+++ b/Chapter15/Figures/figure-main-flow-of-neural-network-structure-search.tex
@@ -5,9 +5,9 @@
 \begin{scope}[scale=0.36]
 \tikzstyle{every node}=[scale=0.36]

-\node[draw=ublue,very thick,drop shadow,fill=white,minimum width=40em,minimum height=25em] (rec3) at (2.25,0){};
-\node[draw=ublue,very thick,drop shadow,fill=white,minimum width=22em,minimum height=25em] (rec2) at (-12.4,0){};
-\node[draw=ublue,very thick,drop shadow,fill=white,minimum width=24em,minimum height=25em] (rec1) at (-24,0){};
+\node[draw=ublue,very thick,rounded corners=3pt,drop shadow,fill=white,minimum width=40em,minimum height=25em] (rec3) at (2.25,0){};
+\node[draw=ublue,very thick,rounded corners=3pt,drop shadow,fill=white,minimum width=22em,minimum height=25em] (rec2) at (-12.4,0){};
+\node[draw=ublue,very thick,rounded corners=3pt,drop shadow,fill=white,minimum width=24em,minimum height=25em] (rec1) at (-24,0){};

 %left
 \node[text=ublue] (label1) at (-26.4,4){\Huge\bfnew{结构空间}};

--- a/Chapter15/Figures/figure-multi-scale-local-modeling.tex
+++ b/Chapter15/Figures/figure-multi-scale-local-modeling.tex
 \begin{tikzpicture}
 \begin{scope}

-\tikzstyle{cirnode}=[circle,minimum size=3.7em,draw]
-\tikzstyle{recnode}=[rectangle,rounded corners=2pt,inner sep=0mm,minimum height=1.5em,minimum width=4em,draw]
+\tikzstyle{cirnode}=[circle,minimum size=3em,font=\footnotesize,draw]
+\tikzstyle{recnode}=[rectangle,rounded corners=2pt,inner sep=0mm,minimum height=1.8em,minimum width=6em]

 \node [anchor=west,cirnode] (n1) at (0, 0) {$\mathbi{h}_{i-2}^l$};
-\node [anchor=west,cirnode] (n2) at ([xshift=1em,yshift=0em]n1.east) {$\mathbi{h}_{i-1}^l$};
-\node [anchor=west,cirnode] (n3) at ([xshift=1em,yshift=0em]n2.east) {$\mathbi{h}_{i}^l$};
-\node [anchor=west,cirnode] (n4) at ([xshift=1em,yshift=0em]n3.east) {$\mathbi{h}_{i+1}^l$};
-\node [anchor=west,cirnode] (n5) at ([xshift=1em,yshift=0em]n4.east) {$\mathbi{h}_{i+2}^l$};
+\node [anchor=west,cirnode] (n2) at ([xshift=1.2em,yshift=0em]n1.east) {$\mathbi{h}_{i-1}^l$};
+\node [anchor=west,cirnode] (n3) at ([xshift=1.2em,yshift=0em]n2.east) {$\mathbi{h}_{i}^l$};
+\node [anchor=west,cirnode] (n4) at ([xshift=1.2em,yshift=0em]n3.east) {$\mathbi{h}_{i+1}^l$};
+\node [anchor=west,cirnode] (n5) at ([xshift=1.2em,yshift=0em]n4.east) {$\mathbi{h}_{i+2}^l$};

-\node [anchor=center,blue!30,minimum height=4.2em,minimum width=4.5em,very thick,draw] (c1) at ([xshift=0em,yshift=0em]n3.center) {};
-\node [anchor=center,ugreen!30,minimum height=4.9em,minimum width=14.5em,very thick,draw] (c2) at ([xshift=0em,yshift=0em]n3.center) {};
-\node [anchor=center,red!30,minimum height=5.6em,minimum width=24.5em,very thick,draw] (c3) at ([xshift=0em,yshift=0em]n3.center) {};
+\begin{pgfonlayer}{background}
+\node [anchor=center,red!30,minimum height=4.5em,minimum width=21em,very thick,draw] (c3) at ([xshift=0em,yshift=0em]n3.center) {};
+\node [anchor=center,ugreen!30,minimum height=4em,minimum width=12.5em,very thick,draw] (c2) at ([xshift=0em,yshift=0em]n3.center) {};
+\node [anchor=center,orange!30,minimum height=3.5em,minimum width=3.6em,very thick,draw] (c1) at ([xshift=0em,yshift=0em]n3.center) {};
+\end{pgfonlayer}

-\node [anchor=south,recnode] (r1) at ([xshift=0em,yshift=2.5em]n2.north) {$\textrm{head}_1$};
-\node [anchor=south,recnode] (r2) at ([xshift=0em,yshift=2.5em]n3.north) {$\textrm{head}_2$};
-\node [anchor=south,recnode] (r3) at ([xshift=0em,yshift=2.5em]n4.north) {$\textrm{head}_3$};
+\node [anchor=south,recnode,fill=red!20] (r1) at ([xshift=-3.5em,yshift=2.5em]n2.north) {$\textrm{head}_1$};
+\node [anchor=south,recnode,fill=orange!20] (r2) at ([xshift=0em,yshift=2.5em]n3.north) {$\textrm{head}_2$};
+\node [anchor=south,recnode,fill=ugreen!20] (r3) at ([xshift=3.5em,yshift=2.5em]n4.north) {$\textrm{head}_3$};

 \node [anchor=south,cirnode] (n6) at ([xshift=0em,yshift=1em]r2.north) {$\mathbi{h}_{i}^{l+1}$};

-\draw [->,very thick,blue!30] ([xshift=0em,yshift=0em]c1.north) -- ([xshift=0em,yshift=0em]r2.south);
-\draw [->,very thick,ugreen!30] ([xshift=4.73em,yshift=0em]c2.north) -- ([xshift=0em,yshift=0em]r3.south);
-\draw [->,very thick,red!30] ([xshift=-4.73em,yshift=0em]c3.north) -- ([xshift=0em,yshift=0em]r1.south);
+\draw [->,very thick,orange!30] ([xshift=0em,yshift=0em]c1.north) -- ([xshift=0em,yshift=0em]r2.south);
+\draw [->,very thick,ugreen!30] ([xshift=3em,yshift=0em]c2.north)..controls +(north:1.5em) and +(south:1.5em)..([xshift=0em,yshift=0em]r3.south);
+\draw [->,very thick,red!30] ([xshift=-3em,yshift=0em]c3.north)..controls +(north:1.5em) and +(south:1.5em)..([xshift=0em,yshift=0em]r1.south);

 \draw [->] ([xshift=0em,yshift=0em]r1.north) -- ([xshift=0em,yshift=0em]n6.south west);
 \draw [->] ([xshift=0em,yshift=0em]r2.north) -- ([xshift=0em,yshift=0em]n6.south);

--- a/Chapter15/Figures/figure-multi-task-structure.tex
+++ b/Chapter15/Figures/figure-multi-task-structure.tex
@@ -2,7 +2,7 @@
 \begin{tikzpicture}
 \begin{scope}

-\tikzstyle{enode}=[rectangle,inner sep=0mm,minimum height=5em,minimum width=5em,rounded corners=7pt,fill=ugreen!30]
+\tikzstyle{enode}=[rectangle,inner sep=0mm,minimum height=5em,minimum width=5em,rounded corners=7pt,fill=green!30]
 \tikzstyle{dnode}=[rectangle,inner sep=0mm,minimum height=2em,minimum width=6.5em,rounded corners=5pt,fill=red!30]
 \tikzstyle{wnode}=[inner sep=0mm,minimum height=2em,minimum width=4em]


--- a/Chapter15/Figures/figure-structure-search-based-on-gradient-method.tex
+++ b/Chapter15/Figures/figure-structure-search-based-on-gradient-method.tex
@@ -6,7 +6,7 @@

 \node[node,fill=red!20] (n1) at (0,0){\scriptsize\bfnew{超网络}： \\ [1ex] 模型结构参数 \\[0.4ex] 网络参数};
 \node[anchor=west,node,fill=yellow!20] (n2) at ([xshift=4em]n1.east){\scriptsize\bfnew{优化后的超网络}： \\ [1ex]模型{\color{red}结构参数}（已优化） \\ [0.4ex]网络参数（已优化）};
-\node[anchor=west,node,fill=blue!20] (n3) at ([xshift=6em]n2.east){\scriptsize\bfnew{找到的模型结构}};
+\node[anchor=west,node,fill=green!20] (n3) at ([xshift=6em]n2.east){\scriptsize\bfnew{找到的模型结构}};

 \draw[-latex,thick] (n1.0) -- node[above,align=center,font=\scriptsize]{优化后的\\超网络}(n2.180);
 \draw[-latex,thick] (n2.0) -- node[above,align=center,font=\scriptsize]{根据结构参数\\离散化结构}(n3.180);

--- a/Chapter15/Figures/figure-structure-search-based-on-reinforcement-learning.tex
+++ b/Chapter15/Figures/figure-structure-search-based-on-reinforcement-learning.tex
@@ -5,7 +5,7 @@
 \tikzstyle{node}=[minimum height=2em,minimum width=5em,draw,rounded corners=2pt,thick,drop shadow]

 \node[node,fill=red!20] (n1) at (0,0){\small\bfnew{环境}};
-\node[anchor=south,node,fill=blue!20] (n2) at ([yshift=5em]n1.north){\small\bfnew{智能体}};
+\node[anchor=south,node,fill=green!20] (n2) at ([yshift=5em]n1.north){\small\bfnew{智能体}};
 \node[anchor=north,font=\footnotesize] at ([yshift=-0.2em]n1.south){（结构所应用于的任务）};
 \node[anchor=south,font=\footnotesize] at ([yshift=0.2em]n2.north){（结构生成器）};


--- a/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-1.tex
+++ b/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-1.tex
@@ -3,13 +3,13 @@
 \begin{center}
 \begin{tikzpicture}

-\tikzstyle{wrnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=3em,rounded corners=5pt,fill=blue!30]
-\tikzstyle{srnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=3em,rounded corners=5pt,fill=yellow!30]
+\tikzstyle{wrnode}=[rectangle,inner sep=0mm,minimum height=1.6em,minimum width=3em,rounded corners=5pt,fill=blue!30]
+\tikzstyle{srnode}=[rectangle,inner sep=0mm,minimum height=1.6em,minimum width=3em,rounded corners=5pt,fill=orange!30]
 \tikzstyle{dotnode}=[inner sep=0mm,minimum height=0.5em,minimum width=1.5em]
-\tikzstyle{wnode}=[inner sep=0mm,minimum height=1.8em]
+\tikzstyle{wnode}=[inner sep=0mm,minimum height=1.6em]
 {\small
-\begin{scope}[]
-
+\begin{scope}[scale=1]
+\tikzstyle{every node}=[scale=1]
 \node [anchor=west,wrnode] (wr1) at (0,0) {$\mathbi{h}_{w_1}$};
 \node [anchor=west,wrnode] (wr2) at ([xshift=1em,yshift=0em]wr1.east) {$\mathbi{h}_{w_2}$};
 \node [anchor=west,wrnode] (wr3) at ([xshift=1em,yshift=0em]wr2.east) {$\mathbi{h}_{w_3}$};
@@ -22,27 +22,27 @@
 \node [anchor=west,dotnode] (dot3) at ([xshift=0.8em,yshift=0em]sr3.east) {$\cdots$};
 \node [anchor=west,srnode] (sr4) at ([xshift=0.8em,yshift=0em]dot3.east) {$\mathbi{h}_{l_7}$};

-\node [anchor=north,wnode,font=\footnotesize] (w1) at ([xshift=0em,yshift=-1em]wr1.south) {$w_1$\ :\ I};
-\node [anchor=north,wnode,font=\footnotesize] (w2) at ([xshift=0em,yshift=-1em]wr2.south) {$w_2$\ :\ love};
-\node [anchor=north,wnode,font=\footnotesize] (w3) at ([xshift=0em,yshift=-1em]wr3.south) {$w_3$\ :\ dogs};
+\node [anchor=north,wnode,font=\footnotesize] (w1) at ([xshift=0em,yshift=-0.7em]wr1.south) {$w_1$\ :\ I};
+\node [anchor=north,wnode,font=\footnotesize] (w2) at ([xshift=0em,yshift=-0.7em]wr2.south) {$w_2$\ :\ love};
+\node [anchor=north,wnode,font=\footnotesize] (w3) at ([xshift=0em,yshift=-0.7em]wr3.south) {$w_3$\ :\ dogs};

-\node [anchor=north,wnode,font=\footnotesize] (w4) at ([xshift=0em,yshift=-1em]sr1.south) {$l_1$\ :\ S};
-\node [anchor=north,dotnode] (dot4) at ([xshift=0em,yshift=-2.4em]dot1.south) {$\cdots$};
-\node [anchor=north,wnode,font=\footnotesize] (w5) at ([xshift=0em,yshift=-1em]sr2.south) {$l_3$\ :\ PRN};
-\node [anchor=north,dotnode] (dot5) at ([xshift=0em,yshift=-2.2em]dot2.south) {$\cdots$};
-\node [anchor=north,wnode,font=\footnotesize] (w6) at ([xshift=0em,yshift=-1em]sr3.south) {$l_5$\ :\ VBP};
-\node [anchor=north,dotnode] (dot6) at ([xshift=0em,yshift=-2.3em]dot3.south) {$\cdots$};
-\node [anchor=north,wnode,font=\footnotesize] (w7) at ([xshift=0em,yshift=-1em]sr4.south) {$l_7$\ :\ NNS};
+\node [anchor=north,wnode,font=\footnotesize] (w4) at ([xshift=0em,yshift=-0.7em]sr1.south) {$l_1$\ :\ S};
+\node [anchor=north,dotnode] (dot4) at ([xshift=0em,yshift=-2em]dot1.south) {$\cdots$};
+\node [anchor=north,wnode,font=\footnotesize] (w5) at ([xshift=0em,yshift=-0.7em]sr2.south) {$l_3$\ :\ PRN};
+\node [anchor=north,dotnode] (dot5) at ([xshift=0em,yshift=-2em]dot2.south) {$\cdots$};
+\node [anchor=north,wnode,font=\footnotesize] (w6) at ([xshift=0em,yshift=-0.7em]sr3.south) {$l_5$\ :\ VBP};
+\node [anchor=north,dotnode] (dot6) at ([xshift=0em,yshift=-2em]dot3.south) {$\cdots$};
+\node [anchor=north,wnode,font=\footnotesize] (w7) at ([xshift=0em,yshift=-0.7em]sr4.south) {$l_7$\ :\ NNS};

-\node [anchor=south,circle,draw,minimum size=1.2em] (c1) at ([xshift=2.5em,yshift=2em]wr2.north){};
+\node [anchor=south,circle,draw,minimum size=1.2em] (c1) at ([xshift=2.5em,yshift=1.5em]wr2.north){};
 \node [anchor=west,circle,draw,minimum size=1.2em] (c2) at ([xshift=8em,yshift=0em]c1.east){};
 \node [anchor=west,circle,draw,minimum size=1.2em] (c3) at ([xshift=8em,yshift=0em]c2.east){};

-\node [anchor=south,srnode] (m1) at ([xshift=0em,yshift=2em]c1.north) {$\mathbi{h}_{l_1}$};
+\node [anchor=south,srnode] (m1) at ([xshift=0em,yshift=1em]c1.north) {$\mathbi{h}_{l_1}$};
 \node [anchor=south,wrnode] (m2) at ([xshift=0em,yshift=0em]m1.north) {$\mathbi{h}_{w_1}$};
-\node [anchor=south,srnode] (m3) at ([xshift=0em,yshift=2em]c2.north) {$\mathbi{h}_{l_5}$};
+\node [anchor=south,srnode] (m3) at ([xshift=0em,yshift=1em]c2.north) {$\mathbi{h}_{l_5}$};
 \node [anchor=south,wrnode] (m4) at ([xshift=0em,yshift=0em]m3.north) {$\mathbi{h}_{w_2}$};
-\node [anchor=south,srnode] (m5) at ([xshift=0em,yshift=2em]c3.north) {$\mathbi{h}_{l_7}$};
+\node [anchor=south,srnode] (m5) at ([xshift=0em,yshift=1em]c3.north) {$\mathbi{h}_{l_7}$};
 \node [anchor=south,wrnode] (m6) at ([xshift=0em,yshift=0em]m5.north) {$\mathbi{h}_{w_3}$};


@@ -55,10 +55,10 @@

 \begin{pgfonlayer}{background}
 \node [rectangle,inner sep=0.5em,draw=blue!80,dashed,very thick,rounded corners=10pt] [fit = (wr1) (wr3) (w1) (w3)] (box1) {};
-\node [rectangle,inner sep=0.5em,draw=yellow!80,dashed,very thick,rounded corners=10pt] [fit = (sr1) (sr4) (w4) (w7)] (box2) {};
-\node [rectangle,minimum height=5em,inner sep=0.6em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (m1) (m2)] (box3) {};
-\node [rectangle,minimum height=5em,inner sep=0.6em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (m3) (m4)] (box4) {};
-\node [rectangle,minimum height=5em,inner sep=0.6em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (m5) (m6)] (box5) {};
+\node [rectangle,inner sep=0.5em,draw=orange!80,dashed,very thick,rounded corners=10pt] [fit = (sr1) (sr4) (w4) (w7)] (box2) {};
+\node [rectangle,inner sep=0.5em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (m1) (m2)] (box3) {};
+\node [rectangle,inner sep=0.5em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (m3) (m4)] (box4) {};
+\node [rectangle,inner sep=0.5em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (m5) (m6)] (box5) {};
 \end{pgfonlayer}

 \node [anchor=south,wnode] (h1) at ([xshift=0em,yshift=0.1em]box3.north) {${\mathbi{h}'}_1$\ :\ };
@@ -73,9 +73,9 @@
 \draw [->,thick] ([xshift=0em,yshift=0em]w6.north) -- ([xshift=0em,yshift=0em]sr3.south);
 \draw [->,thick] ([xshift=0em,yshift=0em]w7.north) -- ([xshift=0em,yshift=0em]sr4.south);

-\draw [->,thick] ([xshift=0em,yshift=0.7em]dot4.north) -- ([xshift=0em,yshift=-0.7em]dot1.south);
-\draw [->,thick] ([xshift=0em,yshift=0.7em]dot5.north) -- ([xshift=0em,yshift=-0.7em]dot2.south);
-\draw [->,thick] ([xshift=0em,yshift=0.7em]dot6.north) -- ([xshift=0em,yshift=-0.7em]dot3.south);
+\draw [->,thick] ([xshift=0em,yshift=0.6em]dot4.north) -- ([xshift=0em,yshift=-0.7em]dot1.south);
+\draw [->,thick] ([xshift=0em,yshift=0.6em]dot5.north) -- ([xshift=0em,yshift=-0.7em]dot2.south);
+\draw [->,thick] ([xshift=0em,yshift=0.6em]dot6.north) -- ([xshift=0em,yshift=-0.7em]dot3.south);

 \draw [<->,thick] ([xshift=0em,yshift=0em]wr1.east) -- ([xshift=0em,yshift=0em]wr2.west);
 \draw [<->,thick] ([xshift=0em,yshift=0em]wr2.east) -- ([xshift=0em,yshift=0em]wr3.west);
@@ -96,13 +96,13 @@
 \draw[->,thick] ([xshift=0em,yshift=-0em]wr3.north)..controls +(north:2em) and +(south:1em)..([xshift=-0em,yshift=-0em]c3.south west) ;
 \draw[->,thick] ([xshift=0em,yshift=-0em]sr4.north)..controls +(north:2em) and +(east:0em)..([xshift=-0em,yshift=-0em]c3.east) ;

-\draw [->,thick] ([xshift=0em,yshift=0em]c1.north) -- ([xshift=0em,yshift=0em]box3.south);
-\draw [->,thick] ([xshift=0em,yshift=0em]c2.north) -- ([xshift=0em,yshift=0em]box4.south);
-\draw [->,thick] ([xshift=0em,yshift=0em]c3.north) -- ([xshift=0em,yshift=0em]box5.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]c1.north) -- ([xshift=0em,yshift=0em]m1.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]c2.north) -- ([xshift=0em,yshift=0em]m3.south);
+\draw [->,thick] ([xshift=0em,yshift=0em]c3.north) -- ([xshift=0em,yshift=0em]m5.south);

-\node [anchor=north] (r1) at ([xshift=0em,yshift=-1em]w2.south) {词语RNN};
-\node [anchor=north] (r2) at ([xshift=3em,yshift=-1em]w5.south) {句法RNN};
-\node [anchor=north] (label1) at ([xshift=0em,yshift=-4em]dot4.south) {(a)平行结构};
+\node [anchor=north,font=\small] (r1) at ([xshift=0em,yshift=-1em]w2.south) {词语RNN};
+\node [anchor=north,font=\small] (r2) at ([xshift=3em,yshift=-1em]w5.south) {句法RNN};
+\node [anchor=north,font=\small] (label1) at ([xshift=0em,yshift=-3em]dot4.south) {(a)平行结构};

 \end{scope}
 }

--- a/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-2.tex
+++ b/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-2.tex
@@ -4,7 +4,7 @@
 \begin{tikzpicture}

 \tikzstyle{wrnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=3em,rounded corners=5pt,fill=blue!30]
-\tikzstyle{srnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=3em,rounded corners=5pt,fill=yellow!30]
+\tikzstyle{srnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=3em,rounded corners=5pt,fill=orange!30]
 \tikzstyle{dotnode}=[inner sep=0mm,minimum height=0.5em,minimum width=1.5em]
 \tikzstyle{wnode}=[inner sep=0mm,minimum height=1.8em]

@@ -48,9 +48,9 @@
 \node [anchor=south,wnode] (w10) at ([xshift=0em,yshift=0.5em]c3.north) {$\mathbi{e}_{w_2}$};

 \begin{pgfonlayer}{background}
-\node [rectangle,minimum height=5em,inner sep=0.6em,fill=ugreen!20,rounded corners=8pt] [fit = (c1) (w8)] (box6) {};
-\node [rectangle,minimum height=5em,inner sep=0.6em,fill=ugreen!20,rounded corners=8pt] [fit = (c2) (w9)] (box7) {};
-\node [rectangle,minimum height=5em,inner sep=0.6em,fill=ugreen!20,rounded corners=8pt] [fit = (c3) (w10)] (box8) {};
+\node [rectangle,minimum height=5em,inner sep=0.6em,fill=green!20,rounded corners=8pt] [fit = (c1) (w8)] (box6) {};
+\node [rectangle,minimum height=5em,inner sep=0.6em,fill=green!20,rounded corners=8pt] [fit = (c2) (w9)] (box7) {};
+\node [rectangle,minimum height=5em,inner sep=0.6em,fill=green!20,rounded corners=8pt] [fit = (c3) (w10)] (box8) {};
 \end{pgfonlayer}

 \node [anchor=south,wrnode] (wr1) at ([xshift=0em,yshift=1em]box6.north) {$\mathbi{h}_{w_1}$};
@@ -63,7 +63,7 @@

 \begin{pgfonlayer}{background}
 \node [rectangle,minimum width=20em,minimum height=13em,inner sep=0.5em,draw=blue!80,dashed,very thick,rounded corners=10pt] [fit = (h1) (w1) (h3) (c3)] (box1) {};
-\node [rectangle,inner sep=0.5em,draw=yellow!80,dashed,very thick,rounded corners=10pt] [fit = (sr1) (sr4) (w4) (w7)] (box2) {};
+\node [rectangle,inner sep=0.5em,draw=orange!80,dashed,very thick,rounded corners=10pt] [fit = (sr1) (sr4) (w4) (w7)] (box2) {};
 \node [rectangle,inner sep=0.4em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (wr1)] (box3) {};
 \node [rectangle,inner sep=0.4em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (wr2)] (box4) {};
 \node [rectangle,inner sep=0.4em,fill=gray!20,draw=black,dashed,very thick,rounded corners=8pt] [fit = (wr3)] (box5) {};

--- a/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-3.tex
+++ b/Chapter15/Figures/figure-three-fusion-methods-of-tree-structure-information-3.tex
@@ -3,12 +3,12 @@
 \begin{center}
 \begin{tikzpicture}

-\tikzstyle{hnode}=[rectangle,inner sep=0mm,minimum height=1.8em,minimum width=3em,rounded corners=5pt,fill=red!30]
+\tikzstyle{hnode}=[rectangle,inner sep=0mm,minimum height=1.6em,minimum width=3em,rounded corners=5pt,fill=red!30]
 \tikzstyle{dotnode}=[inner sep=0mm,minimum height=0.5em,minimum width=1.5em]
-\tikzstyle{wnode}=[inner sep=0mm,minimum height=1.8em]
+\tikzstyle{wnode}=[inner sep=0mm,minimum height=1.6em]
 {\small
-\begin{scope}[]
-
+\begin{scope}[scale=1]
+\tikzstyle{every node}=[scale=1]
 \node [anchor=west,hnode] (n1) at (0,0) {$\mathbi{h}_{1}$};
 \node [anchor=west,hnode] (n2) at ([xshift=1em,yshift=0em]n1.east) {$\mathbi{h}_{2}$};
 \node [anchor=west,dotnode] (dot1) at ([xshift=1em,yshift=0em]n2.east) {$\cdots$};
@@ -18,14 +18,14 @@
 \node [anchor=west,dotnode] (dot3) at ([xshift=1em,yshift=0em]n4.east) {$\cdots$};
 \node [anchor=west,hnode] (n5) at ([xshift=1em,yshift=0em]dot3.east) {$\mathbi{h}_{10}$};

-\node [anchor=north,wnode,font=\footnotesize] (w1) at ([xshift=0em,yshift=-1em]n1.south) {$l_1$\ :\ S};
-\node [anchor=north,wnode,font=\footnotesize] (w2) at ([xshift=0em,yshift=-1em]n2.south) {$l_3$\ :\ NP};
-\node [anchor=north,dotnode] (dot4) at ([xshift=0em,yshift=-2.4em]dot1.south) {$\cdots$};
-\node [anchor=north,wnode,font=\footnotesize] (w3) at ([xshift=0em,yshift=-1em]n3.south) {$w_1$\ :\ I};
-\node [anchor=north,dotnode] (dot5) at ([xshift=0em,yshift=-2.2em]dot2.south) {$\cdots$};
-\node [anchor=north,wnode,font=\footnotesize] (w4) at ([xshift=0em,yshift=-1em]n4.south) {$w_2$\ :\ love};
-\node [anchor=north,dotnode] (dot6) at ([xshift=0em,yshift=-2.3em]dot3.south) {$\cdots$};
-\node [anchor=north,wnode,font=\footnotesize] (w5) at ([xshift=0em,yshift=-1em]n5.south) {$w_3$\ :\ dogs};
+\node [anchor=north,wnode,font=\footnotesize] (w1) at ([xshift=0em,yshift=-0.7em]n1.south) {$l_1$\ :\ S};
+\node [anchor=north,wnode,font=\footnotesize] (w2) at ([xshift=0em,yshift=-0.7em]n2.south) {$l_3$\ :\ NP};
+\node [anchor=north,dotnode] (dot4) at ([xshift=0em,yshift=-2em]dot1.south) {$\cdots$};
+\node [anchor=north,wnode,font=\footnotesize] (w3) at ([xshift=0em,yshift=-0.7em]n3.south) {$w_1$\ :\ I};
+\node [anchor=north,dotnode] (dot5) at ([xshift=0em,yshift=-2em]dot2.south) {$\cdots$};
+\node [anchor=north,wnode,font=\footnotesize] (w4) at ([xshift=0em,yshift=-0.7em]n4.south) {$w_2$\ :\ love};
+\node [anchor=north,dotnode] (dot6) at ([xshift=0em,yshift=-2em]dot3.south) {$\cdots$};
+\node [anchor=north,wnode,font=\footnotesize] (w5) at ([xshift=0em,yshift=-0.7em]n5.south) {$w_3$\ :\ dogs};


 \node [anchor=south,wnode] (h1) at ([xshift=0em,yshift=0.3em]n3.north) {${\mathbi{h}'}_1$\ :\ };
@@ -41,10 +41,10 @@
 \end{pgfonlayer}


-\node [anchor=east] (r1) at ([xshift=-2em,yshift=0em]box1.west) {词语RNN};
-
+\node [anchor=east,font=\small] (r1) at ([xshift=-2em,yshift=0em]box1.west) {混合RNN};

-\node [anchor=south west,wnode] (l1) at ([xshift=1em,yshift=6em]r1.north west) {先序遍历句法树，得到序列：};
+{\small
+\node [anchor=south west,wnode] (l1) at ([xshift=1em,yshift=5em]r1.north west) {先序遍历句法树，得到序列：};
 \node [anchor=north west,wnode,align=center] (l2) at ([xshift=0.5em,yshift=-0.6em]l1.north east) {S\\[0.5em]$l_1$};
 \node [anchor=north west,wnode,align=center] (l3) at ([xshift=0.5em,yshift=0em]l2.north east) {NP\\[0.5em]$l_2$};
 \node [anchor=north west,wnode,align=center] (l4) at ([xshift=0.5em,yshift=0em]l3.north east) {PRN\\[0.5em]$l_3$};
@@ -55,7 +55,7 @@
 \node [anchor=north west,wnode,align=center] (l9) at ([xshift=0.5em,yshift=0em]l8.north east) {NP\\[0.5em]$l_6$};
 \node [anchor=north west,wnode,align=center] (l10) at ([xshift=0.5em,yshift=0em]l9.north east) {NNS\\[0.5em]$l_7$};
 \node [anchor=north west,wnode,align=center] (l11) at ([xshift=0.5em,yshift=0em]l10.north east) {dogs\\[0.5em]$w_3$};
-
+}


 \draw [->,thick] ([xshift=0em,yshift=0em]w1.north) -- ([xshift=0em,yshift=0em]n1.south);
@@ -65,9 +65,9 @@
 \draw [->,thick] ([xshift=0em,yshift=0em]w5.north) -- ([xshift=0em,yshift=0em]n5.south);


-\draw [->,thick] ([xshift=0em,yshift=0.7em]dot4.north) -- ([xshift=0em,yshift=-0.7em]dot1.south);
-\draw [->,thick] ([xshift=0em,yshift=0.7em]dot5.north) -- ([xshift=0em,yshift=-0.7em]dot2.south);
-\draw [->,thick] ([xshift=0em,yshift=0.7em]dot6.north) -- ([xshift=0em,yshift=-0.7em]dot3.south);
+\draw [->,thick] ([xshift=0em,yshift=0.6em]dot4.north) -- ([xshift=0em,yshift=-0.7em]dot1.south);
+\draw [->,thick] ([xshift=0em,yshift=0.6em]dot5.north) -- ([xshift=0em,yshift=-0.7em]dot2.south);
+\draw [->,thick] ([xshift=0em,yshift=0.6em]dot6.north) -- ([xshift=0em,yshift=-0.7em]dot3.south);


 \draw [<->,thick] ([xshift=0em,yshift=0em]n1.east) -- ([xshift=0em,yshift=0em]n2.west);
@@ -79,7 +79,7 @@
 \draw [<->,thick] ([xshift=0em,yshift=0em]dot3.east) -- ([xshift=0em,yshift=0em]n5.west);


-\node [anchor=north] (label2) at ([xshift=-2em,yshift=-2em]w3.south) {(c)混合结构};
+\node [anchor=north,font=\small] (label2) at ([xshift=-2em,yshift=-1em]w3.south) {(c)混合结构};

 \end{scope}
 }

--- a/Chapter15/chapter15.tex
+++ b/Chapter15/chapter15.tex
--- a/Chapter16/chapter16.tex
+++ b/Chapter16/chapter16.tex
@@ -88,11 +88,11 @@
 %----------------------------------------------
 \begin{itemize}
    \vspace{0.5em}
-    \item 丢掉单词：句子中的每个词均有$\funp{P}_{\rm{Drop}}$的概率被丢弃。
+    \item {\small\bfnew{丢掉单词}}：句子中的每个词均有$\funp{P}_{\rm{Drop}}$的概率被丢弃。
    \vspace{0.5em}
-    \item 掩码单词：句子中的每个词均有$\funp{P}_{\rm{Mask}}$的概率被替换为一个额外的<Mask>词。<Mask>的作用类似于占位符，可以理解为一个句子中的部分词被屏蔽掉，无法得知该位置词的准确含义。
+    \item {\small\bfnew{掩码单词}}：句子中的每个词均有$\funp{P}_{\rm{Mask}}$的概率被替换为一个额外的<Mask>词。<Mask>的作用类似于占位符，可以理解为一个句子中的部分词被屏蔽掉，无法得知该位置词的准确含义。
    \vspace{0.5em}
-    \item 打乱顺序：将句子中距离较近的某些词的位置进行随机交换。
+    \item {\small\bfnew{打乱顺序}}：将句子中距离较近的某些词的位置进行随机交换。
    \vspace{0.5em}
 \end{itemize}
 %----------------------------------------------
@@ -112,11 +112,11 @@
 %----------------------------------------------
 \begin{itemize}
    \vspace{0.5em}
-    \item 对单语数据加噪。通过一个端到端模型预测源语言句子的调序结果，该模型和神经机器翻译模型的编码器共享参数，从而增强编码器的特征提取能力\upcite{DBLP:conf/emnlp/ZhangZ16}；
+    \item {\small\bfnew{对单语数据加噪}}。通过一个端到端模型预测源语言句子的调序结果，该模型和神经机器翻译模型的编码器共享参数，从而增强编码器的特征提取能力\upcite{DBLP:conf/emnlp/ZhangZ16}；
    \vspace{0.5em}
-    \item 训练降噪自编码器。将加噪后的句子作为输入，原始句子作为输出，用来训练降噪自编码器，这一思想在无监督机器翻译中得到了广泛应用，详细方法可以参考\ref{unsupervised-NMT}节；
+    \item {\small\bfnew{训练降噪自编码器}}。将加噪后的句子作为输入，原始句子作为输出，用来训练降噪自编码器，这一思想在无监督机器翻译中得到了广泛应用，详细方法可以参考\ref{unsupervised-NMT}节；
    \vspace{0.5em}
-    \item 对伪数据进行加噪。比如在上文中提到的对伪数据加入噪声的方法中，通常也使用上述这三种加噪方法来提高伪数据的多样性；
+    \item {\small\bfnew{对伪数据进行加噪}}。比如在上文中提到的对伪数据加入噪声的方法中，通常也使用上述这三种加噪方法来提高伪数据的多样性；
    \vspace{0.5em}
 \end{itemize}
 %----------------------------------------------
@@ -512,9 +512,9 @@

 \begin{itemize}
 \vspace{0.5em}
-\item 基于无监督的分布匹配。该步骤利用一些无监督的方法来得到一个包含噪声的初始化词典$D$。
+\item {\small\bfnew{基于无监督的分布匹配}}。该步骤利用一些无监督的方法来得到一个包含噪声的初始化词典$D$。
 \vspace{0.5em}
-\item 基于有监督的微调。利用两个单语词嵌入和第一步中学习到的种子字典执行一些对齐算法来迭代微调，例如，{\small\bfnew{普氏分析}}\index{普氏分析}（Procrustes Analysis\index{Procrustes Analysis}）\upcite{1966ASchnemann}。
+\item {\small\bfnew{基于有监督的微调}}。利用两个单语词嵌入和第一步中学习到的种子字典执行一些对齐算法来迭代微调，例如，{\small\bfnew{普氏分析}}\index{普氏分析}（Procrustes Analysis\index{Procrustes Analysis}）\upcite{1966ASchnemann}。
 \vspace{0.5em}
 \end{itemize}

@@ -542,9 +542,9 @@

 \begin{itemize}
 \vspace{0.5em}
-\item 基于生成对抗网络的方法\upcite{DBLP:conf/iclr/LampleCRDJ18,DBLP:conf/acl/ZhangLLS17,DBLP:conf/emnlp/XuYOW18,DBLP:conf/naacl/MohiuddinJ19}。在这个方法中，通过生成器来产生映射$\mathbi{W}$，鉴别器负责区分随机抽样的元素$\mathbi{W} \mathbi{X}$ 和$\mathbi{Y}$，两者共同优化收敛后即可得到映射$\mathbi{W}$。
+\item {\small\bfnew{基于生成对抗网络的方法}}\upcite{DBLP:conf/iclr/LampleCRDJ18,DBLP:conf/acl/ZhangLLS17,DBLP:conf/emnlp/XuYOW18,DBLP:conf/naacl/MohiuddinJ19}。在这个方法中，通过生成器来产生映射$\mathbi{W}$，鉴别器负责区分随机抽样的元素$\mathbi{W} \mathbi{X}$ 和$\mathbi{Y}$，两者共同优化收敛后即可得到映射$\mathbi{W}$。
 \vspace{0.5em}
-\item 基于Gromov-wasserstein 的方法\upcite{DBLP:conf/emnlp/Alvarez-MelisJ18,DBLP:conf/lrec/GarneauGBDL20,DBLP:journals/corr/abs-1811-01124,DBLP:conf/emnlp/XuYOW18}。Wasserstein距离是度量空间中定义两个概率分布之间距离的函数。在这个任务中，它用来衡量不同语言中单词对之间的相似性，利用空间近似同构的信息可以定义出一些目标函数，之后通过优化该目标函数也可以得到映射$\mathbi{W}$。
+\item {\small\bfnew{基于Gromov-wasserstein 的方法}}\upcite{DBLP:conf/emnlp/Alvarez-MelisJ18,DBLP:conf/lrec/GarneauGBDL20,DBLP:journals/corr/abs-1811-01124,DBLP:conf/emnlp/XuYOW18}。Wasserstein距离是度量空间中定义两个概率分布之间距离的函数。在这个任务中，它用来衡量不同语言中单词对之间的相似性，利用空间近似同构的信息可以定义出一些目标函数，之后通过优化该目标函数也可以得到映射$\mathbi{W}$。
 \vspace{0.5em}
 \end{itemize}

@@ -675,10 +675,10 @@
 \parinterval 无监督神经机器翻译还有两个关键的技巧：
 \begin{itemize}
 \vspace{0.5em}
-\item 词表共享：对于源语言和目标语言里都一样的词使用同一个词嵌入，而不是源语言和目标语言各自对应一个词嵌入，比如，阿拉伯数字或者一些实体名字。这样相当于告诉模型这个词在源语言和目标语言里面表达同一个意思，隐式地引入了单词翻译的监督信号。在无监督神经机器翻译里词表共享搭配子词切分会更加有效，因为子词的覆盖范围广，比如，多个不同的词可以包含同一个子词。
+\item {\small\bfnew{词表共享}}：对于源语言和目标语言里都一样的词使用同一个词嵌入，而不是源语言和目标语言各自对应一个词嵌入，比如，阿拉伯数字或者一些实体名字。这样相当于告诉模型这个词在源语言和目标语言里面表达同一个意思，隐式地引入了单词翻译的监督信号。在无监督神经机器翻译里词表共享搭配子词切分会更加有效，因为子词的覆盖范围广，比如，多个不同的词可以包含同一个子词。

 \vspace{0.5em}
-\item 模型共享：与多语言翻译系统类似，使用同一个翻译模型来进行正向翻译（源语言$\to$目标语言）和反向翻译（目标语言$\to$源语言）。这样做降低了模型的参数量。而且，两个翻译方向可以互相为对方起到正则化的作用，减小了过拟合的风险。
+\item {\small\bfnew{模型共享}}：与多语言翻译系统类似，使用同一个翻译模型来进行正向翻译（源语言$\to$目标语言）和反向翻译（目标语言$\to$源语言）。这样做降低了模型的参数量。而且，两个翻译方向可以互相为对方起到正则化的作用，减小了过拟合的风险。
 \vspace{0.5em}
 \end{itemize}

@@ -752,9 +752,9 @@

 \begin{itemize}
 \vspace{0.5em}
-\item 基于数据的方法。利用源领域的双语数据或目标领域单语数据进行数据选择或数据增强，来增加模型训练的数据量。
+\item {\small\bfnew{基于数据的方法}}。利用源领域的双语数据或目标领域单语数据进行数据选择或数据增强，来增加模型训练的数据量。
 \vspace{0.5em}
-\item 基于模型的方法。针对领域适应开发特定的模型结构、训练策略和推断方法。
+\item {\small\bfnew{基于模型的方法}}。针对领域适应开发特定的模型结构、训练策略和推断方法。
 \vspace{0.5em}
 \end{itemize}


--- a/Chapter17/Figures/figure-cache.tex
+++ b/Chapter17/Figures/figure-cache.tex
@@ -17,10 +17,10 @@
 \node[anchor=south,font=\footnotesize,inner sep=0pt] at ([yshift=0.2em]value.north){value};
 \node[anchor=south,font=\footnotesize,inner sep=0pt] (cache)at ([yshift=2em,xshift=1.5em]key.north){\small\bfnew{缓存}};

-\node[draw,anchor=east,minimum size=1.8em,fill=orange!15] (dt) at ([yshift=2.1em,xshift=-4em]key.west){${\mathbi{d}}_{t}$};
+\node[draw,anchor=east,thick,minimum size=1.8em,fill=orange!15] (dt) at ([yshift=2.1em,xshift=-4em]key.west){${\mathbi{d}}_{t}$};
 \node[anchor=north,font=\footnotesize] (readlab) at ([xshift=2.8em,yshift=0.3em]dt.north){\red{读取}};
-\node[draw,anchor=east,minimum size=1.8em,fill=ugreen!15] (st) at ([xshift=-3.7em]dt.west){${\mathbi{s}}_{t}$};
-\node[draw,anchor=east,minimum size=1.8em,fill=red!15] (st2) at ([xshift=-0.85em,yshift=3.5em]dt.west){$ \widetilde{\mathbi{s}}_{t}$};
+\node[draw,anchor=east,thick,minimum size=1.8em,fill=ugreen!15] (st) at ([xshift=-3.7em]dt.west){${\mathbi{s}}_{t}$};
+\node[draw,anchor=east,thick,minimum size=1.8em,fill=red!15] (st2) at ([xshift=-0.85em,yshift=3.5em]dt.west){$ \widetilde{\mathbi{s}}_{t}$};

 %\node[draw,anchor=north,circle,inner sep=0pt, minimum size=1.2em,fill=yellow] (add) at ([yshift=-1em]st2.south){+};
 \node[draw,thick,inner sep=0pt, minimum size=1.1em, circle] (add) at ([yshift=-1.5em]st2.south){};
@@ -29,7 +29,7 @@

 \node[anchor=north,inner sep=0pt,font=\footnotesize,text=red] at ([xshift=-0em,yshift=-0.5em]add.south){融合};

-\node[draw,anchor=east,minimum size=1.8em,fill=yellow!15] (ct) at ([xshift=-2em,yshift=-3.5em]st.west){$ {\mathbi{C}}_{t}$};
+\node[draw,anchor=east,thick,minimum size=1.8em,fill=yellow!15] (ct) at ([xshift=-2em,yshift=-3.5em]st.west){$ {\mathbi{C}}_{t}$};
 \node[anchor=north,font=\footnotesize] (matchlab) at ([xshift=6.7em,yshift=-0.1em]ct.north){\red{匹配}};

 \node[anchor=east] (y) at ([xshift=-6em,yshift=1em]st.west){$\mathbi{y}_{t-1}$};
@@ -53,12 +53,12 @@
 %node[above,font=\footnotesize,text=red,rotate=25]{reading}
 \draw[-latex,dashed,very thick,out=-5,in=-170] (ct.0) to ([yshift=-2.5em]box.180);
 %node[above,font=\footnotesize,text=red,pos=0.7,rotate=8]{matching}
-\draw[-,very thick,out=0,in=-135](st.0) to (add.-135);
-\draw[-,very thick,out=180,in=-45](dt.180) to (add.-45);
-\draw[-latex,very thick] (add.90) -- (st2.-90);
-\draw[-latex,very thick,out=100,in=-100] (ct.90) to (output.-90);
-\draw[-latex,very thick,out=180,in=-100] (st2.180) to (output.-90);
-\draw[-latex,very thick,out=80,in=-100] (y.90) to (output.-90);
-\draw[-latex,very thick] (output.90) -- ([yshift=1em]output.90);
-\draw[-latex,very thick] ([yshift=-1.2em]yt.-90) -- (yt.-90);
+\draw[-,thick,out=0,in=-135](st.0) to (add.-135);
+\draw[-,thick,out=180,in=-45](dt.180) to (add.-45);
+\draw[-latex,thick] (add.90) -- (st2.-90);
+\draw[-latex,thick,out=100,in=-100] (ct.90) to (output.-90);
+\draw[-latex,thick,out=180,in=-100] (st2.180) to (output.-90);
+\draw[-latex,thick,out=80,in=-100] (y.90) to (output.-90);
+\draw[-latex,thick] (output.90) -- ([yshift=1em]output.90);
+\draw[-latex,thick] ([yshift=-1.2em]yt.-90) -- (yt.-90);
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/chapter17.tex
+++ b/Chapter17/chapter17.tex
@@ -160,11 +160,11 @@
 %----------------------------------------------------------------------------------------------------
 \begin{itemize}
    \vspace{0.5em}
-    \item 错误传播问题。级联模型导致的一个很严重的问题在于，语音识别模型得到的文本如果存在错误，这些错误很可能在翻译过程中被放大，从而使最后翻译结果出现比较大的偏差。比如识别时在句尾少生成了个“吗”，会导致翻译模型将疑问句翻译为陈述句。
+    \item {\small\bfnew{错误传播问题}}。级联模型导致的一个很严重的问题在于，语音识别模型得到的文本如果存在错误，这些错误很可能在翻译过程中被放大，从而使最后翻译结果出现比较大的偏差。比如识别时在句尾少生成了个“吗”，会导致翻译模型将疑问句翻译为陈述句。
    \vspace{0.5em}
-    \item 翻译效率问题。由于需要语音识别模型和文本标注模型只能串行地计算，翻译效率相对较低，而实际很多场景中都需要达到低延时的翻译。
+    \item {\small\bfnew{翻译效率问题}}。由于需要语音识别模型和文本标注模型只能串行地计算，翻译效率相对较低，而实际很多场景中都需要达到低延时的翻译。
    \vspace{0.5em}
-    \item 语音中的副语言信息丢失。将语音识别为文本的过程中，语音中包含的语气、情感、音调等信息会丢失，而同一句话在不同的语气中表达的意思很可能是不同的。尤其是在实际应用中，由于语音识别结果通常并不包含标点，还需要额外的后处理模型将标点还原，也会带来额外的计算代价。
+    \item {\small\bfnew{语音中的副语言信息丢失}}。将语音识别为文本的过程中，语音中包含的语气、情感、音调等信息会丢失，而同一句话在不同的语气中表达的意思很可能是不同的。尤其是在实际应用中，由于语音识别结果通常并不包含标点，还需要额外的后处理模型将标点还原，也会带来额外的计算代价。
    \vspace{0.5em}
 \end{itemize}
 %----------------------------------------------------------------------------------------------------
@@ -199,9 +199,9 @@
 %----------------------------------------------------------------------------------------------------
 \begin{itemize}
    \vspace{0.5em}
-    \item 训练数据稀缺。虽然语音识别和文本翻译的训练数据都很多，但是直接由源语言语音到目标语言文本的平行数据十分有限，因此端到端语音翻译天然地就是一种低资源翻译任务。
+    \item {\small\bfnew{训练数据稀缺}}。虽然语音识别和文本翻译的训练数据都很多，但是直接由源语言语音到目标语言文本的平行数据十分有限，因此端到端语音翻译天然地就是一种低资源翻译任务。
    \vspace{0.5em}
-    \item 建模复杂度更高。在语音识别中，模型是学习如何生成语音对应的文字序列，输入和输出的对齐比较简单，不涉及到调序的问题。在文本翻译中，模型要学习如何生成源语言序列对应的目标语言序列，仅需要学习不同语言之间的映射，不涉及到模态的转换。而语音翻译模型需要学习从语音到目标语言文本的生成，任务更加复杂。
+    \item {\small\bfnew{建模复杂度更高}}。在语音识别中，模型是学习如何生成语音对应的文字序列，输入和输出的对齐比较简单，不涉及到调序的问题。在文本翻译中，模型要学习如何生成源语言序列对应的目标语言序列，仅需要学习不同语言之间的映射，不涉及到模态的转换。而语音翻译模型需要学习从语音到目标语言文本的生成，任务更加复杂。
    \vspace{0.5em}
 \end{itemize}
 %----------------------------------------------------------------------------------------------------
@@ -231,9 +231,9 @@
 %----------------------------------------------------------------------------------------------------
 \begin{itemize}
    \vspace{0.5em}
-    \item 输入和输出之间的对齐是单调的。也就是后面的输入只会预测与前面的序列相同或后面的输出内容。比如对于图\ref{fig:17-8}中的例子，如果输入的位置t已经预测了字符l，那么t之后的位置不会再预测前面的字符h和e。
+    \item {\small\bfnew{输入和输出之间的对齐是单调的}}。也就是后面的输入只会预测与前面的序列相同或后面的输出内容。比如对于图\ref{fig:17-8}中的例子，如果输入的位置t已经预测了字符l，那么t之后的位置不会再预测前面的字符h和e。
    \vspace{0.5em}
-    \item 输入和输出之间是多对一的关系。也就是多个输入会对应到同一个输出上。这对于语音序列来说是非常自然的一件事情，由于输入的每个位置只包含非常短的语音特征，因此多个输入才可以对应到一个输出字符。
+    \item {\small\bfnew{输入和输出之间是多对一的关系}}。也就是多个输入会对应到同一个输出上。这对于语音序列来说是非常自然的一件事情，由于输入的每个位置只包含非常短的语音特征，因此多个输入才可以对应到一个输出字符。
    \vspace{0.5em}
 \end{itemize}
 %----------------------------------------------------------------------------------------------------
@@ -604,7 +604,7 @@
 \noindent 之后，分别计算词级和句子级注意力模型。需要注意的是句子级注意力添加了一个前馈全连接网络子层FFN。其具体计算方式如下：

 \begin{eqnarray}
-\mathbi{s}^j&=&\textrm{WordAttention}(\mathbi{q}_{w},\mathbi{h}^{j},\mathbi{h}^{j})
+\mathbi{s}^k&=&\textrm{WordAttention}(\mathbi{q}_{w},\mathbi{h}^{k},\mathbi{h}^{k})
 \label{eq:17-3-7}\\
 \mathbi{d}_t&=&\textrm{FFN}(\textrm{SentAttention}(\mathbi{q}_{s},\mathbi{s},\mathbi{s}))
 \label{eq:17-3-9}

--- a/Chapter3/chapter3.tex
+++ b/Chapter3/chapter3.tex
@@ -173,9 +173,9 @@ Interests $\to$ \; Interest/s & selected $\to$ \; se/lect/ed & processed $\to$ \

 \begin{itemize}
 \vspace{0.5em}
-\item 训练。利用标注数据，对统计模型的参数进行学习。
+\item {\small\sffamily\bfseries{训练}}。利用标注数据，对统计模型的参数进行学习。
 \vspace{0.5em}
-\item 预测。利用学习到的模型和参数，对新的句子进行切分。这个过程也可以被看作是利用学习到的模型在新的数据上进行推断。
+\item {\small\sffamily\bfseries{预测}}。利用学习到的模型和参数，对新的句子进行切分。这个过程也可以被看作是利用学习到的模型在新的数据上进行推断。
 \vspace{0.5em}
 \end{itemize}

@@ -244,9 +244,9 @@ $计算这种切分的概率值。

 \begin{itemize}
 \vspace{0.5em}
-\item BIO（Beginning-inside-outside）格式。以命名实体识别为例，B代表一个命名实体的开始，I表示一个命名实体的其它部分，O表示一个非命名实体单元。
+\item {\small\sffamily\bfseries{BIO格式}}（Beginning-inside-outside）。以命名实体识别为例，B代表一个命名实体的开始，I表示一个命名实体的其它部分，O表示一个非命名实体单元。
 \vspace{0.5em}
-\item BIOES格式。与BIO格式相比，多出了标签E（End）和S（Single）。仍然以命名实体识别为例，E和S分别用于标注一个命名实体的结束位置和仅含一个单词的命名实体。
+\item {\small\sffamily\bfseries{BIOES格式}}。与BIO格式相比，多出了标签E（End）和S（Single）。仍然以命名实体识别为例，E和S分别用于标注一个命名实体的结束位置和仅含一个单词的命名实体。
 \vspace{0.5em}
 \end{itemize}

@@ -284,9 +284,9 @@ $计算这种切分的概率值。

 \begin{itemize}
 \vspace{0.5em}
-\item 样本在这些特征上的差异度，即特征对于样本的区分能力。比如，可以考虑优先选择样本特征值方差较大即区分能力强的特征\footnote{方差如果很小，意味着样本在这个特征上基本上没有差异，那么这个特征对于样本的区分并没有什么用。}；
+\item {\small\sffamily\bfseries{样本在这些特征上的差异度}}，即特征对于样本的区分能力。比如，可以考虑优先选择样本特征值方差较大即区分能力强的特征\footnote{方差如果很小，意味着样本在这个特征上基本上没有差异，那么这个特征对于样本的区分并没有什么用。}；
 \vspace{0.5em}
-\item 特征与任务目标的相关性。优先选择相关性高的特征。
+\item {\small\sffamily\bfseries{特征与任务目标的相关性}}。优先选择相关性高的特征。
 \vspace{0.5em}
 \end{itemize}

@@ -378,11 +378,11 @@ $计算这种切分的概率值。

 \begin{itemize}
 \vspace{0.5em}
-\item 隐含状态序列的概率计算：即给定模型（转移概率和发射概率），根据可见状态序列（抛硬币的结果）计算在该模型下得到这个结果的概率，这个问题的求解需要用到前后向算法\upcite{baum1970maximization}。
+\item {\small\sffamily\bfseries{隐含状态序列的概率计算}}，即给定模型（转移概率和发射概率），根据可见状态序列（抛硬币的结果）计算在该模型下得到这个结果的概率，这个问题的求解需要用到前后向算法\upcite{baum1970maximization}。
 \vspace{0.5em}
-\item 参数学习：即给定硬币种类（隐含状态数量），根据多个可见状态序列（抛硬币的结果）估计模型的参数（转移概率），这个问题的求解需要用到EM算法\upcite{1977Maximum}。
+\item {\small\sffamily\bfseries{参数学习}}，即给定硬币种类（隐含状态数量），根据多个可见状态序列（抛硬币的结果）估计模型的参数（转移概率），这个问题的求解需要用到EM算法\upcite{1977Maximum}。
 \vspace{0.5em}
-\item 解码：即给定模型（转移概率和发射概率）和可见状态序列（抛硬币的结果），计算在可见状态序列的情况下，最可能出现的对应的状态序列，这个问题的求解需要用到基于动态规划的方法，通常也被称作{\small\sffamily\bfseries{维特比算法}}\index{维特比算法}（Viterbi Algorithm）\index{Viterbi Algorithm}\upcite{1967Error}。
+\item {\small\sffamily\bfseries{解码}}，即给定模型（转移概率和发射概率）和可见状态序列（抛硬币的结果），计算在可见状态序列的情况下，最可能出现的对应的状态序列，这个问题的求解需要用到基于动态规划的方法，通常也被称作{\small\sffamily\bfseries{维特比算法}}\index{维特比算法}（Viterbi Algorithm）\index{Viterbi Algorithm}\upcite{1967Error}。
 \vspace{0.5em}
 \end{itemize}

@@ -531,7 +531,7 @@ Z(\seq{x})&=&\sum_{\seq{y}}\exp(\sum_{i=1}^m\sum_{j=1}^k\lambda_{j}F_{j}(y_{i-1}

 \parinterval 无论在日常生活中还是在研究工作中，都会遇到各种各样的分类问题，例如挑选西瓜时需要区分“好瓜”和“坏瓜”、编辑看到一篇新闻稿件时要对稿件进行分门别类。事实上，在机器学习中，对“分类任务”的定义会更宽泛而并不拘泥于“类别”的概念，在对样本进行预测时，只要预测标签集合是有限的且预测标签是离散的，就可认定其为分类任务。

-\parinterval 具体来说，分类任务目标是训练一个可以根据输入数据预测离散标签的{\small\sffamily\bfseries{分类器}}\index{分类器}（Classifier\index{Classifier}），也可称为分类模型。在有监督的分类任务中\footnote{与之相对应的，还有无监督、半监督分类任务，不过这些内容不是本书讨论的重点。读者可以参看参考文献\upcite{周志华2016机器学习,李航2019统计学习方法}对相关概念进行了解。}，训练数据集合通常由形似$(\boldsymbol{x}_i,y_i)$的带标注数据构成，$\boldsymbol{x}_i=(x_{i1},x_{i2},\ldots,x_{ik})$作为分类器的输入数据（通常被称作一个训练样本），其中$x_{ij}$表示样本$\boldsymbol{x}_i$的第$j$个特征；$y_i$作为输入数据对应的{\small\sffamily\bfseries{标签}}\index{标签}（Label）\index{Label}，反映了输入数据对应的“类别”。若标签集合大小为$n$，则分类任务的本质是通过对训练数据集合的学习，建立一个从$k$ 维样本空间到$n$维标签空间的映射关系。更确切地说，分类任务的最终目标是学习一个条件概率分布$\funp{P}(y|\boldsymbol{x})$，这样对于输入$\boldsymbol{x}$可以找到概率最大的$y$作为分类结果输出。
+\parinterval 具体来说，分类任务目标是训练一个可以根据输入数据预测离散标签的{\small\sffamily\bfseries{分类器}}\index{分类器}（Classifier\index{Classifier}），也可称为分类模型。在有监督的分类任务中\footnote{与之相对应的，还有无监督、半监督分类任务，不过这些内容不是本书讨论的重点。读者可以参看参考文献\upcite{周志华2016机器学习,李航2019统计学习方法}对相关概念进行了解。}，训练数据集合通常由形似$({\mathbi{x}}^{[i]},y^{[i]})$的带标注数据构成，${\mathbi{x}}^{[i]}=(x^{[i]}_1,\ldots,x^{[i]}_k)$作为分类器的输入数据（通常被称作一个训练样本），其中$x^{[i]}_j$表示样本$\mathbi{x}^{[i]}$的第$j$个特征；$y^{[i]}$作为输入数据对应的{\small\sffamily\bfseries{标签}}\index{标签}（Label）\index{Label}，反映了输入数据对应的“类别”。若标签集合大小为$n$，则分类任务的本质是通过对训练数据集合的学习，建立一个从$k$ 维样本空间到$n$维标签空间的映射关系。更确切地说，分类任务的最终目标是学习一个条件概率分布$\funp{P}(y|{\mathbi{x}})$，这样对于输入${\mathbi{x}}$可以找到概率最大的$y$作为分类结果输出。

 \parinterval 与概率图模型一样，分类模型中也依赖特征定义。其定义形式与\ref{sec3:feature}节的描述一致，这里不再赘述。分类任务一般根据类别数量分为二分类任务和多分类任务，二分类任务是最经典的分类任务，只需要对输出进行非零即一的预测。多分类任务则可以有多种处理手段，比如，可以将其“拆解”为多个二分类任务求解，或者直接让模型输出多个类别中的一个。在命名实体识别中，往往会使用多类别分类模型。比如，在BIO标注下，有三个类别（B、I和O）。一般来说，类别数量越大分类的难度也越大。比如，BIOES标注包含5个类别，因此使用同样的分类器，它要比BIO标注下的分类问题难度大。此外，更多的类别有助于准确的刻画目标问题。因此在实践中需要在类别数量和分类难度之间找到一种平衡。

@@ -547,15 +547,15 @@ Z(\seq{x})&=&\sum_{\seq{y}}\exp(\sum_{i=1}^m\sum_{j=1}^k\lambda_{j}F_{j}(y_{i-1}

 \begin{itemize}
 \vspace{0.5em}
-\item $K$-近邻分类算法。$K$-近邻分类算法通过计算不同特征值之间的距离进行分类，这种方法适用于可以提取到数值型特征\footnote{即可以用数值大小对某方面特征进行衡量。}的分类问题。该方法的基本思想为：将提取到的特征分别作为坐标轴，建立一个$k$维坐标系（对应特征数量为$k$的情况），此时每个样本都将成为该$k$维空间的一个点，将未知样本与已知类别样本的空间距离作为分类依据进行分类，比如，考虑与输入样本最近的$K$个样本的类别进行分类。
+\item{\small\sffamily\bfseries{$K$-近邻分类算法}} 。$K$-近邻分类算法通过计算不同特征值之间的距离进行分类，这种方法适用于可以提取到数值型特征\footnote{即可以用数值大小对某方面特征进行衡量。}的分类问题。该方法的基本思想为：将提取到的特征分别作为坐标轴，建立一个$k$维坐标系（对应特征数量为$k$的情况），此时每个样本都将成为该$k$维空间的一个点，将未知样本与已知类别样本的空间距离作为分类依据进行分类，比如，考虑与输入样本最近的$K$个样本的类别进行分类。
 \vspace{0.5em}
-\item 支持向量机。支持向量机是一种二分类模型，其思想是通过线性超平面将不同输入划分为正例和负例，并使线性超平面与不同输入的距离都达到最大。与$K$-近邻分类算法类似，支持向量机也适用于可以提取到数值型特征的分类问题。
+\item {\small\sffamily\bfseries{支持向量机}}。支持向量机是一种二分类模型，其思想是通过线性超平面将不同输入划分为正例和负例，并使线性超平面与不同输入的距离都达到最大。与$K$-近邻分类算法类似，支持向量机也适用于可以提取到数值型特征的分类问题。
 \vspace{0.5em}
-\item 最大熵模型。最大熵模型是根据最大熵原理提出的一种分类模型，其基本思想是：以在训练数据集中学习到的经验知识作为一种“约束”，并在符合约束的前提下，在若干合理的条件概率分布中选择“使条件熵最大”的模型。
+\item {\small\sffamily\bfseries{最大熵模型}}。最大熵模型是根据最大熵原理提出的一种分类模型，其基本思想是：以在训练数据集中学习到的经验知识作为一种“约束”，并在符合约束的前提下，在若干合理的条件概率分布中选择“使条件熵最大”的模型。
 \vspace{0.5em}
-\item 决策树分类算法。决策树分类算法是一种基于实例的归纳学习方法：将样本中某些决定性特征作为决策树的节点，根据特征表现进行对样本划分，最终根节点到每个叶子节点均形成一条分类的路径规则。这种分类方法适用于可以提取到离散型特征\footnote{即特征值是离散的。}的分类问题。
+\item {\small\sffamily\bfseries{决策树分类算法}}。决策树分类算法是一种基于实例的归纳学习方法：将样本中某些决定性特征作为决策树的节点，根据特征表现进行对样本划分，最终根节点到每个叶子节点均形成一条分类的路径规则。这种分类方法适用于可以提取到离散型特征\footnote{即特征值是离散的。}的分类问题。
 \vspace{0.5em}
-\item 朴素贝叶斯分类算法。朴素贝叶斯算法是以贝叶斯定理为基础并且假设特征之间相互独立的方法，以特征之间相互独立作为前提假设，学习从输入到输出的联合概率分布，并以后验概率最大的输出作为最终类别。
+\item {\small\sffamily\bfseries{朴素贝叶斯分类算法}}。朴素贝叶斯算法是以贝叶斯定理为基础并且假设特征之间相互独立的方法，以特征之间相互独立作为前提假设，学习从输入到输出的联合概率分布，并以后验概率最大的输出作为最终类别。
 \vspace{0.5em}
 \end{itemize}


--- a/Chapter4/chapter4.tex
+++ b/Chapter4/chapter4.tex
@@ -763,9 +763,9 @@ d&=&t \frac{s}{\sqrt{n}}

 \begin{itemize}
 \vspace{0.5em}
-\item 找出译文中翻译错误的短语。要求预测出一个能够捕捉短语内部单词翻译错误、单词漏译以及单词顺序错误的标签序列。该序列中每个标签都对应着一个短语，若是短语不存在任何错误，则标签为“OK”；若是短语内部存在单词翻译错误和单词漏译，则标签为“BAD”；若短语内部的单词顺序存在问题，则标签为“BAD\_word\_order”。图\ref{fig:4-12}中的连线表示单词之间的对齐关系，蓝色虚线框标出了每个短语的范围，图\ref{fig:4-12}中的Phrase-target tags即为该过程中需要预测的质量标签序列。
+\item {\small\sffamily\bfseries{找出译文中翻译错误的短语}}。要求预测出一个能够捕捉短语内部单词翻译错误、单词漏译以及单词顺序错误的标签序列。该序列中每个标签都对应着一个短语，若是短语不存在任何错误，则标签为“OK”；若是短语内部存在单词翻译错误和单词漏译，则标签为“BAD”；若短语内部的单词顺序存在问题，则标签为“BAD\_word\_order”。图\ref{fig:4-12}中的连线表示单词之间的对齐关系，蓝色虚线框标出了每个短语的范围，图\ref{fig:4-12}中的Phrase-target tags即为该过程中需要预测的质量标签序列。
 \vspace{0.5em}
-\item 找出译文中短语之间漏译错误。短语级质量评估任务同时也要求预测一个能够捕捉到短语间的漏译现象的质量标签序列，在译文端短语的两侧位置进行预测，若某位置未出现漏译，则该位置的质量标签为“OK”，否则为“BAD\_omission”。图\ref{fig:4-12}中的Gap tags即为该过程中的质量标签序列。
+\item {\small\sffamily\bfseries{找出译文中短语之间漏译错误}}。短语级质量评估任务同时也要求预测一个能够捕捉到短语间的漏译现象的质量标签序列，在译文端短语的两侧位置进行预测，若某位置未出现漏译，则该位置的质量标签为“OK”，否则为“BAD\_omission”。图\ref{fig:4-12}中的Gap tags即为该过程中的质量标签序列。
 \vspace{0.5em}
 \end{itemize}

@@ -785,13 +785,13 @@ d&=&t \frac{s}{\sqrt{n}}

 \begin{itemize}
 \vspace{0.5em}
-\item 区分“人工翻译”和“机器翻译”。在早期的工作中，研究人员试图训练一个能够区分人工翻译和机器翻译的二分类器完成句子级的质量评估\upcite{gamon2005sentence}，将被分类器判断为“人工翻译”的机器译文视为优秀的译文，将被分类器判断为“机器翻译”的机器译文视为较差的译文。一方面，这种评估方式不够直观，另一方面，这种评估方式并不十分准确，因为通过人工比对发现很多被判定为“机器翻译”的译文具有与人们期望的人类翻译相同的质量水平。
+\item {\small\sffamily\bfseries{区分“人工翻译”和“机器翻译”}}。在早期的工作中，研究人员试图训练一个能够区分人工翻译和机器翻译的二分类器完成句子级的质量评估\upcite{gamon2005sentence}，将被分类器判断为“人工翻译”的机器译文视为优秀的译文，将被分类器判断为“机器翻译”的机器译文视为较差的译文。一方面，这种评估方式不够直观，另一方面，这种评估方式并不十分准确，因为通过人工比对发现很多被判定为“机器翻译”的译文具有与人们期望的人类翻译相同的质量水平。
 \vspace{0.5em}
-\item 预测反映译文句子质量的“质量标签”。在同一时期，研究人员们也尝试使用人工为机器译文分配能够反映译文质量的标签\upcite{DBLP:conf/lrec/Quirk04}，例如“不可接受”、“一定程度上可接受”、“ 可接受”、“ 理想”等类型的质量标签，同时将获取机器译文的质量标签作为句子级质量评估的任务目标。
+\item {\small\sffamily\bfseries{预测反映译文句子质量的“质量标签”}}。在同一时期，研究人员们也尝试使用人工为机器译文分配能够反映译文质量的标签\upcite{DBLP:conf/lrec/Quirk04}，例如“不可接受”、“一定程度上可接受”、“ 可接受”、“ 理想”等类型的质量标签，同时将获取机器译文的质量标签作为句子级质量评估的任务目标。
 \vspace{0.5em}
-\item 预测译文句子的相对排名。当相对排序（详见\ref{sec:human-eval-scoring}节）的译文评价方法被引入后，给出机器译文的相对排名成为句子级质量评估的任务目标。
+\item {\small\sffamily\bfseries{预测译文句子的相对排名}}。当相对排序（详见\ref{sec:human-eval-scoring}节）的译文评价方法被引入后，给出机器译文的相对排名成为句子级质量评估的任务目标。
 \vspace{0.5em}
-\item 预测译文句子的后编辑工作量。在最近的研究中，句子级的质量评估一直在探索各种类型的离散或连续的后编辑标签。例如，通过测量以秒为单位的后编辑时间对译文句子进行评分；通过测量预测后编辑过程所需的击键数对译文句子进行评分；通过计算{\small\sffamily\bfseries{人工译后编辑距离}}\index{人工译后编辑距离}（Human Translation Error Rate，HTER）\index{Human Translation Error Rate}，即在后编辑过程中编辑（插入/删除/替换）数量与参考翻译长度的占比率对译文句子进行评分。HTER的计算公式为：
+\item {\small\sffamily\bfseries{预测译文句子的后编辑工作量}}。在最近的研究中，句子级的质量评估一直在探索各种类型的离散或连续的后编辑标签。例如，通过测量以秒为单位的后编辑时间对译文句子进行评分；通过测量预测后编辑过程所需的击键数对译文句子进行评分；通过计算{\small\sffamily\bfseries{人工译后编辑距离}}\index{人工译后编辑距离}（Human Translation Error Rate，HTER）\index{Human Translation Error Rate}，即在后编辑过程中编辑（插入/删除/替换）数量与参考翻译长度的占比率对译文句子进行评分。HTER的计算公式为：
 \vspace{0.5em}
 \begin{eqnarray}
 \textrm{HTER}&=& \frac{\mbox{编辑操作数目}}{\mbox{翻译后编辑结果长度}}
@@ -829,9 +829,9 @@ d&=&t \frac{s}{\sqrt{n}}

 \begin{itemize}
 \vspace{0.5em}
-\item 阅读理解测试得分情况。以往衡量文档译文质量的主要方法是采用理解测试\upcite{,DBLP:conf/icassp/JonesGSGHRW05}，即利用提前设计好的与文档相关的阅读理解题目（包括多项选择题类型和问答题类型）对母语为目标语言的多个测试者进行测试，将代表测试者在给定文档上的问卷中的所有问题所得到的分数作为质量标签。
+\item {\small\sffamily\bfseries{阅读理解测试得分情况}}。以往衡量文档译文质量的主要方法是采用理解测试\upcite{,DBLP:conf/icassp/JonesGSGHRW05}，即利用提前设计好的与文档相关的阅读理解题目（包括多项选择题类型和问答题类型）对母语为目标语言的多个测试者进行测试，将代表测试者在给定文档上的问卷中的所有问题所得到的分数作为质量标签。
 \vspace{0.5em}
-\item 后编辑工作量。 最近的研究工作中，多是采用对文档译文进行后编辑的工作量评估文档译文的质量。为了准确获取文档后编辑的工作量，两阶段后编辑方法被提出\upcite{DBLP:conf/eamt/ScartonZVGS15}，即第一阶段对文档中的句子单独在无语境情况下进行后编辑，第二阶段将所有句子重新合并成文档后再进行后编辑。两阶段中后编辑工作量的总和越多，意味着文档译文质量越差。
+\item {\small\sffamily\bfseries{后编辑工作量}}。 最近的研究工作中，多是采用对文档译文进行后编辑的工作量评估文档译文的质量。为了准确获取文档后编辑的工作量，两阶段后编辑方法被提出\upcite{DBLP:conf/eamt/ScartonZVGS15}，即第一阶段对文档中的句子单独在无语境情况下进行后编辑，第二阶段将所有句子重新合并成文档后再进行后编辑。两阶段中后编辑工作量的总和越多，意味着文档译文质量越差。
 \vspace{0.5em}
 \end{itemize}

@@ -855,22 +855,22 @@ d&=&t \frac{s}{\sqrt{n}}

 \begin{itemize}
 \vspace{0.5em}
-\item 表示/特征学习模块：用于在数据中提取能够反映翻译结果质量的“特征”；
+\item {\small\sffamily\bfseries{表示/特征学习模块}}：用于在数据中提取能够反映翻译结果质量的“特征”；
 \vspace{0.5em}
-\item 质量评估模块：基于句子的表示结果，利用机器学习算法预测翻译结果的质量。
+\item {\small\sffamily\bfseries{质量评估模块}}：基于句子的表示结果，利用机器学习算法预测翻译结果的质量。
 \end{itemize}

 \parinterval 在传统机器学习的观点下，句子都是由某些特征表示的。因此需要人工设计能够对译文质量评估有指导性作用的特征\upcite{DBLP:conf/wmt/Bicici13a,DBLP:conf/wmt/SouzaBTN13,DBLP:conf/wmt/BiciciW14,DBLP:conf/wmt/SouzaGBTN14,DBLP:conf/wmt/Espla-GomisSF15}，常用的特征有：

 \begin{itemize}
 \vspace{0.5em}
-\item 复杂度特征：反映了翻译一个源文的难易程度，翻译难度越大，译文质量低的可能性就越大；
+\item {\small\sffamily\bfseries{复杂度特征}}：反映了翻译一个源文的难易程度，翻译难度越大，译文质量低的可能性就越大；
 \vspace{0.5em}
-\item 流畅度特征：反映了译文的自然度、流畅度、语法合理程度；
+\item {\small\sffamily\bfseries{流畅度特征}}：反映了译文的自然度、流畅度、语法合理程度；
 \vspace{0.5em}
-\item 置信度特征：反映了机器翻译系统对输出的译文的置信程度；
+\item {\small\sffamily\bfseries{置信度特征}}：反映了机器翻译系统对输出的译文的置信程度；
 \vspace{0.5em}
-\item 充分度特征：反映了源文和机器译文在不同语言层次上的密切程度或关联程度。
+\item {\small\sffamily\bfseries{充分度特征}}：反映了源文和机器译文在不同语言层次上的密切程度或关联程度。
 \vspace{0.5em}
 \end{itemize}

@@ -897,11 +897,11 @@ d&=&t \frac{s}{\sqrt{n}}

 \begin{itemize}
 \vspace{0.5em}
-\item 判断人工后编辑的工作量。人工后编辑工作中有两个不可避免的问题：1）待编辑的机器译文是否值得改？2）待编辑的机器译文需要修改哪里？对于一些质量较差的机器译文来说，人工重译远远比修改译文的效率高，后编辑人员可以借助质量评估系统提供的指标筛选出值得进行后编辑的机器译文，另一方面，质量评估模型可以为每条机器译文提供{错误内容、错误类型、错误严重程度}的注释，这些内容将帮助后编辑人员准确定位到需要修改的位置，同时在一定程度上提示后编辑人员采取何种修改策略，势必能大大减少后编辑的工作内容。
+\item {\small\sffamily\bfseries{判断人工后编辑的工作量}}。人工后编辑工作中有两个不可避免的问题：1）待编辑的机器译文是否值得改？2）待编辑的机器译文需要修改哪里？对于一些质量较差的机器译文来说，人工重译远远比修改译文的效率高，后编辑人员可以借助质量评估系统提供的指标筛选出值得进行后编辑的机器译文，另一方面，质量评估模型可以为每条机器译文提供{错误内容、错误类型、错误严重程度}的注释，这些内容将帮助后编辑人员准确定位到需要修改的位置，同时在一定程度上提示后编辑人员采取何种修改策略，势必能大大减少后编辑的工作内容。
 \vspace{0.5em}
-\item 自动识别并更正翻译错误。质量评估和{\small\sffamily\bfseries{自动后编辑}}\index{自动后编辑}（Automatic Post-editing，APE）\index{Automatic Post-editing}也是很有潜力的应用方向。因为质量评估可以预测出错的位置，进而可以使用自动方法修正这些错误。但是，在这种应用模式中，质量评估的精度是非常关键的，因为如果预测错误可能会产生错误的修改，甚至带来整体译文质量的下降。
+\item {\small\sffamily\bfseries{自动识别并更正翻译错误}}。质量评估和{\small\sffamily\bfseries{自动后编辑}}\index{自动后编辑}（Automatic Post-editing，APE）\index{Automatic Post-editing}也是很有潜力的应用方向。因为质量评估可以预测出错的位置，进而可以使用自动方法修正这些错误。但是，在这种应用模式中，质量评估的精度是非常关键的，因为如果预测错误可能会产生错误的修改，甚至带来整体译文质量的下降。
 \vspace{0.5em}
-\item 辅助外语交流和学习。例如，在很多社交网站上，用户会利用外语进行交流。质量评估模型可以提示该用户输入的内容中存在的用词、语法等问题，这样用户可以重新对内容进行修改。甚至质量评估可以帮助外语学习者发现外语使用中的问题，例如，对于一个英语初学者，如果能提示他/她写的句子中的明显错误，对他/她的外语学习是非常有帮助的。
+\item {\small\sffamily\bfseries{辅助外语交流和学习}}。例如，在很多社交网站上，用户会利用外语进行交流。质量评估模型可以提示该用户输入的内容中存在的用词、语法等问题，这样用户可以重新对内容进行修改。甚至质量评估可以帮助外语学习者发现外语使用中的问题，例如，对于一个英语初学者，如果能提示他/她写的句子中的明显错误，对他/她的外语学习是非常有帮助的。
 \vspace{0.5em}
 \end{itemize}


--- a/Chapter5/Figures/figure-calculation-formula&iterative-process-of-function.tex
+++ b/Chapter5/Figures/figure-calculation-formula&iterative-process-of-function.tex
@@ -11,7 +11,7 @@
    \node [anchor=west] (eq2) at (eq1.east) {$=$\ };
    \draw [-] ([xshift=0.3em]eq2.east) -- ([xshift=11.6em]eq2.east);
    \node [anchor=south west] (eq3) at ([xshift=1em]eq2.east) {$\sum_{k=1}^{K} c_{\mathbb{E}}(s_u|t_v;s^{[k]},t^{[k]})$};
-    \node [anchor=north west] (eq4) at (eq2.east) {$\sum_{s_u} \sum_{k=1}^{K} c_{\mathbb{E}}(s_u|t_v;s^{[k]},t^{[k]})$};
+    \node [anchor=north west] (eq4) at (eq2.east) {$\sum_{s'_u} \sum_{k=1}^{K} c_{\mathbb{E}}(s'_u|t_v;s^{[k]},t^{[k]})$};

   {
    \node [anchor=south] (label1) at ([yshift=-6em,xshift=3em]eq1.north west) {利用这个公式计算};

--- a/Chapter5/Figures/figure-em-algorithm-flow-chart.tex
+++ b/Chapter5/Figures/figure-em-algorithm-flow-chart.tex
@@ -14,7 +14,7 @@
 \node [anchor=north west] (line7) at ([yshift=-0.1em]line6.south west) {4: \quad \quad  \textbf{foreach} $k = 1$ to $K$ \textbf{do}};
 \node [anchor=north west] (line8) at ([yshift=-0.1em]line7.south west) {5: \quad \quad \quad \footnotesize{$c_{\mathbb{E}}(\seq{s}_u|\seq{t}_v;\seq{s}^{[k]},\seq{t}^{[k]}) = \sum\limits_{j=1}^{|\seq{s}^{[k]}|} \delta(s_j,s_u) \sum\limits_{i=0}^{|\seq{t}^{[k]}|} \delta(t_i,t_v) \cdot \frac{f(s_u|t_v)}{\sum_{i=0}^{l}f(s_u|t_i)}$}\normalsize{}};
 \node [anchor=north west] (line9) at ([yshift=-0.1em]line8.south west) {6: \quad \quad \textbf{foreach} $t_v$ appears at least one of $\{\seq{t}^{[1]},...,\seq{t}^{[K]}\}$ \textbf{do}};
-\node [anchor=north west] (line10) at ([yshift=-0.1em]line9.south west) {7: \quad \quad \quad $\lambda_{t_v}^{'} = \sum_{s_u} \sum_{k=1}^{K} c_{\mathbb{E}}(s_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]})$};
+\node [anchor=north west] (line10) at ([yshift=-0.1em]line9.south west) {7: \quad \quad \quad $\lambda_{t_v}^{'} = \sum_{s'_u} \sum_{k=1}^{K} c_{\mathbb{E}}(s'_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]})$};
 \node [anchor=north west] (line11) at ([yshift=-0.1em]line10.south west) {8: \quad \quad \quad \textbf{foreach} $s_u$ appears at least one of $\{\seq{s}^{[1]},...,\seq{s}^{[K]}\}$ \textbf{do}};
 \node [anchor=north west] (line12) at ([yshift=-0.1em]line11.south west) {9: \quad \quad \quad \quad $f(s_u|t_v) = \sum_{k=1}^{K} c_{\mathbb{E}}(s_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]}) \cdot (\lambda_{t_v}^{'})^{-1}$};
 \node [anchor=north west] (line13) at ([yshift=-0.1em]line12.south west) {10: \textbf{return} $f(\cdot|\cdot)$};

--- a/Chapter5/chapter5.tex
+++ b/Chapter5/chapter5.tex
@@ -330,7 +330,7 @@ $\seq{t}^{[2]}$ = So\; ,\; what\; is\; human\; \underline{translation}\; ?
 \label{eq:5-7}
 \end{eqnarray}

-\parinterval 公式\eqref{eq:5-7}相当于在函数$g(\cdot)$上做了归一化，这样等式右端的结果具有一些概率的属性，比如，$0 \le \frac{g(\seq{s},\seq{t})}{\sum_{\seq{t'}}g(\seq{s},\seq{t'})} \le 1$。具体来说，对于源语言句子$\seq{s}$，枚举其所有的翻译结果，并把所对应的函数$g(\cdot)$相加作为分母，而分子是某个翻译结果$\seq{t}$所对应的$g(\cdot)$的值。
+\parinterval 公式\eqref{eq:5-7}相当于在函数$g(\cdot)$上做了归一化，这样等式右端的结果具有一些概率的属性，比如，$0 \le \frac{g(\seq{s},\seq{t})}{\sum_{\seq{t'}}g(\seq{s},\seq{t'})} \le 1$。 具体来说，对于源语言句子$\seq{s}$，枚举其所有的翻译结果，并把所对应的函数$g(\cdot)$相加作为分母，而分子是某个翻译结果$\seq{t}$所对应的$g(\cdot)$的值。

 \parinterval 上述过程初步建立了句子级翻译模型，并没有直接求$\funp{P}(\seq{t}|\seq{s})$，而是把问题转化为对$g(\cdot)$的设计和计算上。但是，面临着两个新的问题：

@@ -1024,13 +1024,13 @@ f(s_u|t_v) &= &\lambda_{t_v}^{-1} \cdot \funp{P}(\seq{s}| \seq{t}) \cdot c_{\mat

 \parinterval 为了满足$f(\cdot|\cdot)$的概率归一化约束，易得$\lambda_{t_v}^{'}$为：
 \begin{eqnarray}
-\lambda_{t_v}^{'}&=&\sum\limits_{s_u} c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})
+\lambda_{t_v}^{'}&=&\sum\limits_{s'_u} c_{\mathbb{E}}(s'_u|t_v;\seq{s},\seq{t})
 \label{eq:5-43}
 \end{eqnarray}

 \parinterval 因此，$f(s_u|t_v)$的计算式可再一步变换成下式：
 \begin{eqnarray}
-f(s_u|t_v)&=&\frac{c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})}  { \sum\limits_{s_u} c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t}) }
+f(s_u|t_v)&=&\frac{c_{\mathbb{E}}(s_u|t_v;\seq{s},\seq{t})}  { \sum\limits_{s'_u} c_{\mathbb{E}}(s'_u|t_v;\seq{s},\seq{t}) }
 \label{eq:5-44}
 \end{eqnarray}


--- a/Chapter6/chapter6.tex
+++ b/Chapter6/chapter6.tex
@@ -335,13 +335,13 @@ p_0+p_1                            & = & 1 \label{eq:6-21}

 \parinterval 另外，可以用$\odot_{i}$表示位置为$[i]$的目标语言单词对应的那些源语言单词位置的平均值，如果这个平均值不是整数则对它向上取整。比如在本例中，目标语句中第4个cept. （“.”）对应在源语言句子中的第5个单词。可表示为${\odot}_{4}=5$。

-\parinterval 利用这些新引进的概念，模型4对模型3的扭曲度进行了修改。主要是把扭曲度分解为两类参数。对于$[i]$对应的源语言单词列表($\tau_{[i]}$)中的第一个单词($\tau_{[i]1}$），它的扭曲度用如下公式计算：
+\parinterval 利用这些新引进的概念，模型4对模型3的扭曲度进行了修改。主要是把扭曲度分解为两类参数。对于$[i]$对应的源语言单词列表($\tau_{[i]}$)中的第一个单词($\tau_{[i]1}$），且$[i]>0$，它的扭曲度用如下公式计算：
 \begin{eqnarray}
 \funp{P}(\pi_{[i]1}=j|{\pi}_1^{[i]-1},{\tau}_0^l,{\varphi}_0^l,\seq{t}) & = & d_{1}(j-{\odot}_{i-1}|A(t_{[i-1]}),B(s_j))
 \label{eq:6-22}
 \end{eqnarray}

-\noindent 其中，第$i$个目标语言单词生成的第$k$个源语言单词的位置用变量$\pi_{ik}$表示。而对于列表($\tau_{[i]}$)中的其他的单词($\tau_{[i]k},1 < k \le \varphi_{[i]}$)的扭曲度，用如下公式计算：
+\noindent 其中，第$i$个目标语言单词生成的第$k$个源语言单词的位置用变量$\pi_{ik}$表示。而对于列表($\tau_{[i]}$)中的其他的单词($\tau_{[i]k},1 < k \le \varphi_{[i]}$)的扭曲度，且$[i]>0$，用如下公式计算：

 \begin{eqnarray}
 \funp{P}(\pi_{[i]k}=j|{\pi}_{[i]1}^{k-1},\pi_1^{[i]-1},\tau_0^l,\varphi_0^l,\seq{t}) & = & d_{>1}(j-\pi_{[i]k-1}|B(s_j))

--- a/Chapter7/chapter7.tex
+++ b/Chapter7/chapter7.tex
@@ -652,14 +652,14 @@ dr & = & {\rm{start}}_i-{\rm{end}}_{i-1}-1

 \parinterval 想要得到最优的特征权重，最简单的方法是枚举所有特征权重可能的取值，然后评价每组权重所对应的翻译性能，最后选择最优的特征权重作为调优的结果。但是特征权重是一个实数值，因此可以考虑把实数权重进行量化，即把权重看作是在固定间隔上的取值，比如，每隔0.01取值。即使是这样，同时枚举多个特征的权重也是非常耗时的工作，当特征数量增多时这种方法的效率仍然很低。

-\parinterval 这里介绍一种更加高效的特征权重调优方法$\ \dash \ ${\small\bfnew{最小错误率训练}}\index{最小错误率训练}（Minimum Error Rate Training\index{Minimum Error Rate Training}，MERT）。最小错误率训练是统计机器翻译发展中代表性工作，也是机器翻译领域原创的重要技术方法之一\upcite{DBLP:conf/acl/Och03}。最小错误率训练假设：翻译结果相对于标准答案的错误是可度量的，进而可以通过降低错误数量的方式来找到最优的特征权重。假设有样本集合$S = \{(s_1,\seq{r}_1),...,(s_N,\seq{r}_N)\}$，$s_i$为样本中第$i$个源语言句子，$\seq{r}_i$为相应的参考译文。注意，$\seq{r}_i$ 可以包含多个参考译文。$S$通常被称为{\small\bfnew{调优集合}}\index{调优集合}（Tuning Set）\index{Tuning Set}。对于$S$中的每个源语句子$s_i$，机器翻译模型会解码出$n$-best推导$\hat{\seq{d}}_{i} = \{\hat{d}_{ij}\}$，其中$\hat{d}_{ij}$表示对于源语言句子$s_i$得到的第$j$个最好的推导。$\{\hat{d}_{ij}\}$可以被定义如下：
+\parinterval 这里介绍一种更加高效的特征权重调优方法$\ \dash \ ${\small\bfnew{最小错误率训练}}\index{最小错误率训练}（Minimum Error Rate Training\index{Minimum Error Rate Training}，MERT）。最小错误率训练是统计机器翻译发展中代表性工作，也是机器翻译领域原创的重要技术方法之一\upcite{DBLP:conf/acl/Och03}。最小错误率训练假设：翻译结果相对于标准答案的错误是可度量的，进而可以通过降低错误数量的方式来找到最优的特征权重。假设有样本集合$S = \{(s^{[1]},\seq{r}^{[1]}),...,(s^{[N]},\seq{r}^{[N]})\}$，$s^{[i]}$为样本中第$i$个源语言句子，$\seq{r}^{[i]}$为相应的参考译文。注意，$\seq{r}^{[i]}$ 可以包含多个参考译文。$S$通常被称为{\small\bfnew{调优集合}}\index{调优集合}（Tuning Set）\index{Tuning Set}。对于$S$中的每个源语句子$s^{[i]}$，机器翻译模型会解码出$n$-best推导$\hat{\seq{d}}^{[i]} = \{\hat{d}_{j}^{[i]}\}$，其中$\hat{d}_{j}^{[i]}$表示对于源语言句子$s^{[i]}$得到的第$j$个最好的推导。$\{\hat{d}_{j}^{[i]}\}$可以被定义如下：

 \begin{eqnarray}
-\{\hat{d}_{ij}\} & = & \arg\max_{\{d_{ij}\}} \sum_{i=1}^{M} \lambda_i \cdot h_i (d,\seq{t},\seq{s})
+\{\hat{d}_{j}^{[i]}\} & = & \arg\max_{\{d_{j}^{[i]}\}} \sum_{i=1}^{M} \lambda_i \cdot h_i (d,\seq{t}^{[i]},\seq{s}^{[i]})
 \label{eq:7-17}
 \end{eqnarray}

-\parinterval 对于每个样本都可以得到$n$-best推导集合，整个数据集上的推导集合被记为$\hat{\seq{D}} = \{\hat{\seq{d}}_{1},...,\hat{\seq{d}}_{s}\}$。进一步，令所有样本的参考译文集合为$\seq{R} = \{\seq{r}_1,...,\seq{r}_N\}$。最小错误率训练的目标就是降低$\hat{\seq{D}}$相对于$\seq{R}$的错误。也就是，通过调整不同特征的权重$\lambda = \{ \lambda_i \}$，让错误率最小，形式化描述为：
+\parinterval 对于每个样本都可以得到$n$-best推导集合，整个数据集上的推导集合被记为$\hat{\seq{D}} = \{\hat{\seq{d}}^{[1]},...,\hat{\seq{d}}^{[N]}\}$。进一步，令所有样本的参考译文集合为$\seq{R} = \{\seq{r}^{[1]},...,\seq{r}^{[N]}\}$。最小错误率训练的目标就是降低$\hat{\seq{D}}$相对于$\seq{R}$的错误。也就是，通过调整不同特征的权重$\lambda = \{ \lambda_i \}$，让错误率最小，形式化描述为：
 \begin{eqnarray}
 \hat{\lambda} & = & \arg\min_{\lambda} \textrm{Error}(\hat{\seq{D}},\seq{R})
 \label{eq:7-18}

--- a/Chapter8/Figures/figure-phrase-structure-tree-and-dependency-tree.tex
+++ b/Chapter8/Figures/figure-phrase-structure-tree-and-dependency-tree.tex
@@ -23,8 +23,8 @@
 \node [anchor=west] (t4) at ([xshift=0.5em,]t3.east) {ball};

 \draw [->] ([xshift=0em]t3.north) .. controls +(north:1em) and +(north:1em) .. ([xshift=-0.2em]t4.north);
-\draw [->] ([xshift=0.2em]t4.north) .. controls +(north:2.5em) and +(north:2.5em) .. ([xshift=0.2em]t2.north);
-\draw [->] ([xshift=0.0em]t1.north) .. controls +(north:2.5em) and +(north:2.5em) .. ([xshift=-0.2em]t2.north);
+\draw [<-] ([xshift=0.2em]t4.north) .. controls +(north:2.5em) and +(north:2.5em) .. ([xshift=0.2em]t2.north);
+\draw [<-] ([xshift=0.0em]t1.north) .. controls +(north:2.5em) and +(north:2.5em) .. ([xshift=-0.2em]t2.north);

 \node [anchor=north west] (cap2) at ([yshift=-0.2em,xshift=-0.5em]t2.south west) {\small{(b) 依存树}};
 \end{scope}

--- a/Chapter8/chapter8.tex
+++ b/Chapter8/chapter8.tex
@@ -532,9 +532,9 @@ span\textrm{[0,4]}&=&\textrm{“猫} \quad \textrm{喜欢} \quad \textrm{吃} \q

 \begin{itemize}
 \vspace{0.5em}
-\item 剪枝：在CKY中，每个跨度都可以生成非常多的推导（局部翻译假设）。理论上，这些推导的数量会和跨度大小成指数关系。显然不可能保存如此大量的翻译推导。对于这个问题，常用的办法是只保留top-$k$个推导。也就是每个局部结果只保留最好的$k$个，即束剪枝。在极端情况下，当$k$=1时，这个方法就变成了贪婪的方法；
+\item {\small\bfnew{剪枝}}：在CKY中，每个跨度都可以生成非常多的推导（局部翻译假设）。理论上，这些推导的数量会和跨度大小成指数关系。显然不可能保存如此大量的翻译推导。对于这个问题，常用的办法是只保留top-$k$个推导。也就是每个局部结果只保留最好的$k$个，即束剪枝。在极端情况下，当$k$=1时，这个方法就变成了贪婪的方法；
 \vspace{0.5em}
-\item $n$-best结果的生成：$n$-best推导（译文）的生成是统计机器翻译必要的功能。比如，最小错误率训练中就需要最好的$n$个结果用于特征权重调优。在基于CKY的方法中，整个句子的翻译结果会被保存在最大跨度所对应的结构中。因此一种简单的$n$-best生成方法是从这个结构中取出排名最靠前的$n$个结果。另外，也可以考虑自上而下遍历CKY生成的推导空间，得到更好的$n$-best结果\upcite{huang2005better}。
+\item {\small\bfnew{$n$-best结果的生成}}：$n$-best推导（译文）的生成是统计机器翻译必要的功能。比如，最小错误率训练中就需要最好的$n$个结果用于特征权重调优。在基于CKY的方法中，整个句子的翻译结果会被保存在最大跨度所对应的结构中。因此一种简单的$n$-best生成方法是从这个结构中取出排名最靠前的$n$个结果。另外，也可以考虑自上而下遍历CKY生成的推导空间，得到更好的$n$-best结果\upcite{huang2005better}。
 \end{itemize}
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION

--- a/Chapter9/Figures/figure-absolute-loss.tex
+++ b/Chapter9/Figures/figure-absolute-loss.tex
@@ -8,12 +8,12 @@
 \draw[->,thick] (-6,0) -- (5,0);
 \draw[->,thick] (-5,-4) -- (-5,5);

-\draw [<-] (-2.5,4) -- (-2,5) node [pos=1,right,inner sep=2pt] {\footnotesize{答案$\tilde{\mathbi{y}}_i$}};
+\draw [<-] (-2.5,4) -- (-2,5) node [pos=1,right,inner sep=2pt] {\footnotesize{答案${\mathbi{y}}^{[i]}$}};
 {
-\draw [<-] (-3,-3) -- (-2.5,-2) node [pos=0,left,inner sep=2pt] {\footnotesize{预测${\mathbi{y}}_i$}};}
+\draw [<-] (-3,-3) -- (-2.5,-2) node [pos=0,left,inner sep=2pt] {\footnotesize{预测${\hat{\mathbi{y}}}^{[i]}$}};}

 {
-\draw [<-] (2.3,1) -- (3.3,2) node [pos=1,right,inner sep=2pt] {\footnotesize{偏差$|\tilde{\mathbi{y}}_i - {\mathbi{y}}_i|$}};
+\draw [<-] (2.3,1) -- (3.3,2) node [pos=1,right,inner sep=2pt] {\footnotesize{偏差$|{\mathbi{y}}^{[i]} - {\hat{\mathbi{y}}}^{[i]}|$}};
 \foreach \x in {-3.8,-3.7,...,3.0}{
    \pgfmathsetmacro{\p}{- 1/14 * (\x + 4) * (\x + 1) * (\x - 1) * (\x - 3)};
    \pgfmathsetmacro{\q}{- 1/14 * (4*\x*\x*\x + 3*\x*\x - 26*\x - 1)};

--- a/Chapter9/chapter9.tex
+++ b/Chapter9/chapter9.tex
@@ -142,11 +142,11 @@

 \begin{itemize}
 \vspace{0.5em}
-\item 特征的构造需要耗费大量的时间和精力。在传统机器学习的特征工程方法中，特征提取都是基于人力完成的，该过程往往依赖于大量的先验假设，会导致相关系统的研发周期也大大增加；
+\item {\small\sffamily\bfseries{特征的构造需要耗费大量的时间和精力}}。在传统机器学习的特征工程方法中，特征提取都是基于人力完成的，该过程往往依赖于大量的先验假设，会导致相关系统的研发周期也大大增加；
 \vspace{0.5em}
-\item 最终的系统性能强弱非常依赖特征的选择。有一句话在业界广泛流传：“数据和特征决定了机器学习的上限”，但是人的智力和认知是有限的，因此人工设计的特征的准确性和覆盖度会存在瓶颈；
+\item {\small\sffamily\bfseries{最终的系统性能强弱非常依赖特征的选择}}。有一句话在业界广泛流传：“数据和特征决定了机器学习的上限”，但是人的智力和认知是有限的，因此人工设计的特征的准确性和覆盖度会存在瓶颈；
 \vspace{0.5em}
-\item 通用性差。针对不同的任务，传统机器学习的特征工程方法需要选择出不同的特征，在某个任务上表现很好的特征在其他任务上可能没有效果。
+\item {\small\sffamily\bfseries{通用性差}}。针对不同的任务，传统机器学习的特征工程方法需要选择出不同的特征，在某个任务上表现很好的特征在其他任务上可能没有效果。
 \vspace{0.5em}
 \end{itemize}

@@ -645,11 +645,11 @@ x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe

 \begin{itemize}
 \vspace{0.5em}
-\item 对问题建模，即定义输入$ \{x_i\} $的形式；
+\item {\small\sffamily\bfseries{对问题建模}}，即定义输入$ \{x_i\} $的形式；
 \vspace{0.5em}
-\item 设计有效的决策模型，即定义$ y $；
+\item {\small\sffamily\bfseries{设计有效的决策模型}}，即定义$ y $；
 \vspace{0.5em}
-\item 得到模型参数（如权重$ \{w_i\} $）的最优值。
+\item {\small\sffamily\bfseries{得到模型参数}}（如权重$ \{w_i\} $）的最优值。
 \vspace{0.5em}
 \end{itemize}

@@ -1128,7 +1128,7 @@ y&=&{\textrm{Sigmoid}}({\textrm{Tanh}}({\mathbi{x}}\cdot {\mathbi{W}}^{[1]}+{\ma

 \subsection{损失函数}

-\parinterval 在神经网络的有监督学习中，训练模型的数据是由输入和正确答案所组成的样本构成的。假设有多个输入样本$ \{{\mathbi{x}}_1,{\mathbi{x}}_2,\dots,{\mathbi{x}}_n\} $，每一个$ {\mathbi{x}}_i $都对应一个正确答案$ \widetilde{\mathbi{y}}_i $，$ \{{\mathbi{x}}_i,\widetilde{\mathbi{y}}_i\} $就构成一个优化神经网络的{\small\sffamily\bfseries{训练数据集合}}\index{训练数据集合}（Training Data Set）\index{Training Data Set}。对于一个神经网络模型${\mathbi{y}}=f({\mathbi{x}}) $,每个$ {\mathbi{x}}_i $也会有一个输出$ {\mathbi{y}}_i $。如果可以度量正确答案$ \widetilde{\mathbi{y}}_i $和神经网络输出$ {\mathbi{y}}_i $之间的偏差，进而通过调整网络参数减小这种偏差，就可以得到更好的模型。
+\parinterval 在神经网络的有监督学习中，训练模型的数据是由输入和正确答案所组成的样本构成的。假设有多个输入样本$ \{{\mathbi{x}}^{[1]}\dots,{\mathbi{x}}^{[n]}\} $，每一个$ {\mathbi{x}}^{[i]}$都对应一个正确答案$ {\mathbi{y}}^{[i]} $，$ \{{\mathbi{x}}^{[i]},{\mathbi{y}}^{[i]}\} $就构成一个优化神经网络的{\small\sffamily\bfseries{训练数据集合}}\index{训练数据集合}（Training Data Set）\index{Training Data Set}。对于一个神经网络模型${\mathbi{y}}=f({\mathbi{x}}) $,每个$ {\mathbi{x}}^{[i]} $也会有一个输出$ {\hat{\mathbi{y}}}^{[i]} $。如果可以度量正确答案$ {\mathbi{y}}^{[i]} $和神经网络输出$ {\hat{\mathbi{y}}}^{[i]} $之间的偏差，进而通过调整网络参数减小这种偏差，就可以得到更好的模型。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -1139,9 +1139,9 @@ y&=&{\textrm{Sigmoid}}({\textrm{Tanh}}({\mathbi{x}}\cdot {\mathbi{W}}^{[1]}+{\ma
 \end{figure}
 %-------------------------------------------

-\parinterval 通常，可以通过设计{\small\sffamily\bfseries{损失函数}}\index{损失函数}（Loss Function）\index{Loss Function}来度量正确答案$ \widetilde{\mathbi{y}}_i $和神经网络输出$ {\mathbi{y}}_i $之间的偏差。而这个损失函数往往充当训练的{\small\sffamily\bfseries{目标函数}}\index{目标函数}（Objective Function）\index{Objective Function}，神经网络训练就是通过不断调整神经网络内部的参数而使损失函数最小化。图\ref{fig:9-42}展示了一个绝对值损失函数的实例。
+\parinterval 通常，可以通过设计{\small\sffamily\bfseries{损失函数}}\index{损失函数}（Loss Function）\index{Loss Function}来度量正确答案$ {\mathbi{y}}^{[i]} $和神经网络输出$ {\hat{\mathbi{y}}}^{[i]} $之间的偏差。而这个损失函数往往充当训练的{\small\sffamily\bfseries{目标函数}}\index{目标函数}（Objective Function）\index{Objective Function}，神经网络训练就是通过不断调整神经网络内部的参数而使损失函数最小化。图\ref{fig:9-42}展示了一个绝对值损失函数的实例。

-\parinterval 这里用$ Loss(\widetilde{\mathbi{y}}_i,{\mathbi{y}}_i) $表示网络输出$ {\mathbi{y}}_i $相对于答案$ \widetilde{\mathbi{y}}_i $的损失，简记为$ L $。表\ref{tab:9-3}是几种常见损失函数的定义。需要注意的是，没有一种损失函数可以适用于所有的问题。损失函数的选择取决于许多因素，包括：数据中是否有离群点、模型结构的选择、是否易于找到函数的导数以及预测结果的置信度等。对于相同的神经网络，不同的损失函数会对训练得到的模型产生不同的影响。对于新的问题，如果无法找到已有的、适合于该问题的损失函数，研究人员也可以自定义损失函数。因此设计新的损失函数也是神经网络中有趣的研究方向。
+\parinterval 这里用$ Loss({\mathbi{y}}^{[i]},{\hat{\mathbi{y}}}^{[i]}) $表示网络输出$ {\hat{\mathbi{y}}}^{[i]} $相对于答案$ {\mathbi{y}}^{[i]} $的损失，简记为$ L $。表\ref{tab:9-3}是几种常见损失函数的定义。需要注意的是，没有一种损失函数可以适用于所有的问题。损失函数的选择取决于许多因素，包括：数据中是否有离群点、模型结构的选择、是否易于找到函数的导数以及预测结果的置信度等。对于相同的神经网络，不同的损失函数会对训练得到的模型产生不同的影响。对于新的问题，如果无法找到已有的、适合于该问题的损失函数，研究人员也可以自定义损失函数。因此设计新的损失函数也是神经网络中有趣的研究方向。

 %--------------------------------------------------------------------
 \begin{table}[htp]
@@ -1152,19 +1152,19 @@ y&=&{\textrm{Sigmoid}}({\textrm{Tanh}}({\mathbi{x}}\cdot {\mathbi{W}}^{[1]}+{\ma
 \begin{tabular}{l | l l}
 \rule{0pt}{15pt}     名称 & 定义 & 应用  \\
 \hline
-\rule{0pt}{15pt}     0-1损失 & $ L=\begin{cases} 0 & \widetilde{\mathbi{y}}_i={\mathbi{y}}_i \\1 & \widetilde{\mathbi{y}}_i\not ={\mathbi{y}}_i\end{cases} $ & 感知机  \\
-\rule{0pt}{15pt}     Hinge损失 & $ L={\textrm {max}}(0,1-\widetilde{\mathbi{y}}_i\cdot {\mathbi{y}}_i) $ & SVM  \\
-\rule{0pt}{15pt}     绝对值损失 & $ L=\vert \widetilde{\mathbi{y}}_i-{\mathbi{y}}_i\vert $ & 回归  \\
-\rule{0pt}{15pt}     Logistic损失 & $ L={\textrm{log}}(1+\widetilde{\mathbi{y}}_i\cdot {\mathbi{y}}_i) $ & 回归  \\
-\rule{0pt}{15pt}     平方损失 & $ L={(\widetilde{\mathbi{y}}_i-{\mathbi{y}}_i)}^2 $ & 回归  \\
-\rule{0pt}{15pt}     指数损失 & $ L={\textrm{exp}}(-\widetilde{\mathbi{y}}_i\cdot {\mathbi{y}}_i) $ & AdaBoost  \\
-\rule{0pt}{15pt}     交叉熵损失 & $ L=-\sum_{k}{{\mathbi{y}}_{i}[k]}{\textrm {log}} {\widetilde{\mathbi{y}}_{i}[k]} $ & 多分类  \\
-\rule{0pt}{15pt}     & 其中，${\mathbi{y}}_{i}[k]$ 表示 ${\mathbi{y}}_i$的第$k$维
+\rule{0pt}{15pt}     0-1损失 & $ L=\begin{cases} 0 & {\mathbi{y}}^{[i]}={\hat{\mathbi{y}}}^{[i]} \\1 & {\mathbi{y}}^{[i]}\not ={\hat{\mathbi{y}}}^{[i]}\end{cases} $ & 感知机  \\
+\rule{0pt}{15pt}     Hinge损失 & $ L={\textrm {max}}(0,1-{\mathbi{y}}^{[i]}\cdot {\hat{\mathbi{y}}}^{[i]}) $ & SVM  \\
+\rule{0pt}{15pt}     绝对值损失 & $ L=\vert {\mathbi{y}}^{[i]}-{\hat{\mathbi{y}}}^{[i]}\vert $ & 回归  \\
+\rule{0pt}{15pt}     Logistic损失 & $ L={\textrm{log}}(1+{\mathbi{y}}^{[i]}\cdot {\hat{\mathbi{y}}}^{[i]}) $ & 回归  \\
+\rule{0pt}{15pt}     平方损失 & $ L={({\mathbi{y}}^{[i]}-{\hat{\mathbi{y}}}^{[i]})}^2 $ & 回归  \\
+\rule{0pt}{15pt}     指数损失 & $ L={\textrm{exp}}(-{\mathbi{y}}^{[i]}\cdot{\hat{\mathbi{y}}}^{[i]}) $ & AdaBoost  \\
+\rule{0pt}{15pt}     交叉熵损失 & $ L=-\sum_{k}{\hat{\mathbi{y}}}^{[i]}_{k}{\textrm {log}} {\mathbi{y}}^{[i]}_{k} $ & 多分类  \\
+\rule{0pt}{15pt}     & 其中，${\mathbi{y}}^{[i]}_{k}$ 表示 ${\mathbi{y}}^{[i]}$的第$k$维
 \end{tabular}
 \end{table}
 %--------------------------------------------------------------------

-\parinterval 在实际系统开发中，损失函数中除了损失项（即用来度量正确答案$ \widetilde{\mathbi{y}}_i $和神经网络输出$ {\mathbi{y}}_i $之间的偏差的部分）之外，还可以包括正则项，比如L1正则和L2正则。设置正则项本质上是要加入一些偏置，使模型在优化的过程中偏向某个方向多一些。关于正则项的内容将在\ref{sec:9.4.5}节介绍。
+\parinterval 在实际系统开发中，损失函数中除了损失项（即用来度量正确答案$ {\mathbi{y}}^{[i]} $和神经网络输出$ {\hat{\mathbi{y}}}^{[i]} $之间的偏差的部分）之外，还可以包括正则项，比如L1正则和L2正则。设置正则项本质上是要加入一些偏置，使模型在优化的过程中偏向某个方向多一些。关于正则项的内容将在\ref{sec:9.4.5}节介绍。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -1172,14 +1172,14 @@ y&=&{\textrm{Sigmoid}}({\textrm{Tanh}}({\mathbi{x}}\cdot {\mathbi{W}}^{[1]}+{\ma

 \subsection{基于梯度的参数优化}\label{sec9:para-training}

-\parinterval 对于第$ i $个样本$ ({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i) $，把损失函数$ L(\widetilde{\mathbi{y}}_i,{\mathbi{y}}_i) $看作是参数$ \bm \theta $的函数\footnote{为了简化描述，可以用$
-\bm{\theta} $表示神经网络中的所有参数，包括各层的权重矩阵${\mathbi{W}}^{[1]}\dots{\mathbi{W}}^{[n]}$和偏置向量${\mathbi{b}}^{[1]}\dots{\mathbi{b}}^{[n]}$等。}，因为输出$ {\mathbi{y}}_i $是由输入$ {\mathbi{x}}_i $和模型参数$ \bm \theta $决定，因此也把损失函数写为$ L({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i;{\bm \theta}) $。下式描述了参数学习的过程：
+\parinterval 对于第$ i $个样本$ ({\mathbi{x}}^{[i]},{\mathbi{y}}^{[i]}) $，把损失函数$ L({\mathbi{y}}^{[i]},{\hat{\mathbi{y}}}^{[i]}) $看作是参数$ \bm \theta $的函数\footnote{为了简化描述，可以用$
+\bm{\theta} $表示神经网络中的所有参数，包括各层的权重矩阵${\mathbi{W}}^{[1]}\dots{\mathbi{W}}^{[n]}$和偏置向量${\mathbi{b}}^{[1]}\dots{\mathbi{b}}^{[n]}$等。}，因为模型输出$ {\hat{\mathbi{y}}}^{[i]}$是由输入$ {\mathbi{x}}^{[i]}$和模型参数$ \bm \theta $决定，因此也把损失函数写为$ L({\mathbi{x}}^{[i]},{\mathbi{y}}^{[i]};{\bm \theta}) $。下式描述了参数学习的过程：
 \begin{eqnarray}
-\widehat{\bm\theta}&=&\mathop{\arg\min}_{\bm \theta}\frac{1}{n}\sum_{i=1}^{n}{L({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i;{\bm \theta})}
+\widehat{\bm\theta}&=&\mathop{\arg\min}_{\bm \theta}\frac{1}{n}\sum_{i=1}^{n}{L({\mathbi{x}}^{[i]},{\mathbi{y}}^{[i]};{\bm \theta})}
 \label{eq:9-28}
 \end{eqnarray}

-\noindent 其中，$ \widehat{\bm \theta} $表示在训练数据上使损失的平均值达到最小的参数，$n$为训练数据总量。$ \frac{1}{n}\sum \limits_{i=1}^{n}{L({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i;{\bm \theta})} $也被称作{\small\sffamily\bfseries{代价函数}}\index{代价函数}（Cost Function）\index{Cost Function}，它是损失函数均值期望的估计，记为$ J({\bm \theta}) $。
+\noindent 其中，$ \widehat{\bm \theta} $表示在训练数据上使损失的平均值达到最小的参数，$n$为训练数据总量。$ \frac{1}{n}\sum \limits_{i=1}^{n}{L({\mathbi{x}}^{[i]},{\mathbi{y}}^{[i]};{\bm \theta})} $也被称作{\small\sffamily\bfseries{代价函数}}\index{代价函数}（Cost Function）\index{Cost Function}，它是损失函数均值期望的估计，记为$ J({\bm \theta}) $。

 \parinterval 参数优化的核心问题是：找到使代价函数$ J({\bm\theta}) $达到最小的$ \bm \theta $。然而$ J({\bm\theta}) $可能会包含大量的参数，比如，基于神经网络的机器翻译模型的参数量可能会超过一亿个。这时不可能用手动方法进行调参。为了实现高效的参数优化，比较常用的手段是使用{\small\bfnew{梯度下降方法}}\index{梯度下降方法}（The Gradient Descent Method）\index{The Gradient Descent Method}。

@@ -1220,7 +1220,7 @@ y&=&{\textrm{Sigmoid}}({\textrm{Tanh}}({\mathbi{x}}\cdot {\mathbi{W}}^{[1]}+{\ma

 \parinterval 批量梯度下降是梯度下降方法中最原始的形式，这种梯度下降方法在每一次迭代时使用所有的样本进行参数更新。参数优化的目标函数如下：
 \begin{eqnarray}
-J({\bm \theta})&=&\frac{1}{n}\sum_{i=1}^{n}{L({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i;{\bm \theta})}
+J({\bm \theta})&=&\frac{1}{n}\sum_{i=1}^{n}{L({\mathbi{x}}^{[i]},{\mathbi{y}}^{[i]};{\bm \theta})}
 \label{eq:9-30}
 \end{eqnarray}

@@ -1238,11 +1238,11 @@ J({\bm \theta})&=&\frac{1}{n}\sum_{i=1}^{n}{L({\mathbi{x}}_i,\widetilde{\mathbi{

 \parinterval 随机梯度下降（简称SGD）不同于批量梯度下降，每次迭代只使用一个样本对参数进行更新。SGD的目标函数如下：
 \begin{eqnarray}
-J({\bm \theta})&=&L({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i;{\bm \theta})
+J({\bm \theta})&=&L({\mathbi{x}}^{[i]},{\mathbi{y}}^{[i]};{\bm \theta})
 \label{eq:9-31}
 \end{eqnarray}

-\noindent 由于每次只随机选取一个样本$({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i)$进行优化，这样更新的计算代价低，参数更新的速度大大加快，而且也适用于利用少量样本进行在线学习的情况\footnote{比如，训练数据不是一次给定的，而是随着模型的使用不断追加的。这时，需要不断地用新的训练样本更新模型，这种模式也被称作{\scriptsize\bfnew{在线学习}}（Online Learning）。}。
+\noindent 由于每次只随机选取一个样本$({\mathbi{x}}^{[i]},{\mathbi{y}}^{[i]})$进行优化，这样更新的计算代价低，参数更新的速度大大加快，而且也适用于利用少量样本进行在线学习的情况\footnote{比如，训练数据不是一次给定的，而是随着模型的使用不断追加的。这时，需要不断地用新的训练样本更新模型，这种模式也被称作{\scriptsize\bfnew{在线学习}}（Online Learning）。}。

 \parinterval 因为随机梯度下降算法每次优化的只是某一个样本上的损失，所以它的问题也非常明显：单个样本上的损失无法代表在全部样本上的损失，因此参数更新的效率低，方法收敛速度极慢。即使在目标函数为强凸函数的情况下，SGD仍旧无法做到线性收敛。

@@ -1256,7 +1256,7 @@ J({\bm \theta})&=&L({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i;{\bm \theta})

 \parinterval 为了综合批量梯度下降和随机梯度下降的优缺点，在实际应用中一般采用这两个算法的折中\ \dash \ 小批量梯度下降。其思想是：每次迭代计算一小部分训练数据的损失函数，并对参数进行更新。这一小部分数据被称为一个批次（mini-batch或者batch）。小批量梯度下降的参数优化的目标函数如下：
 \begin{eqnarray}
-J({\bm \theta})&=&\frac{1}{m}\sum_{i=j}^{j+m-1}{L({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i;{\bm \theta})}
+J({\bm \theta})&=&\frac{1}{m}\sum_{i=j}^{j+m-1}{L({\mathbi{x}}^{[i]},{\mathbi{y}}^{[i]};{\bm \theta})}
 \label{eq:9-32}
 \end{eqnarray}

@@ -1729,7 +1729,7 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \begin{spacing}{1.6}
 \begin{itemize}
 \vspace{0.5em}
-\item $ \frac{\partial L}{\partial {\mathbi{h}}^K} $表示损失函数$ L $相对网络输出$ {\mathbi{h}}^K $的梯度。比如，对于平方损失$ L=\frac{1}{2}{\Vert \widetilde {\mathbi{y}}-{\mathbi{h}}^K\Vert}^2 $，有$ \frac{\partial L}{\partial {\mathbi{h}}^K}= \widetilde{\mathbi{y}} -{\mathbi{h}}^K $。计算结束后，将$ \frac{\partial L}{\partial {\mathbi{h}}^K} $向前传递。
+\item $ \frac{\partial L}{\partial {\mathbi{h}}^K} $表示损失函数$ L $相对网络输出$ {\mathbi{h}}^K $的梯度。比如，对于平方损失$ L=\frac{1}{2}{\Vert {\mathbi{y}}-{\mathbi{h}}^K\Vert}^2 $，有$ \frac{\partial L}{\partial {\mathbi{h}}^K}= {\mathbi{y}} -{\mathbi{h}}^K $。计算结束后，将$ \frac{\partial L}{\partial {\mathbi{h}}^K} $向前传递。
 \vspace{0.5em}
 \item $ \frac{\partial f^T({\mathbi{s}}^K)}{\partial {\mathbi{s}}^K} $表示激活函数相对于其输入$ {\mathbi{s}}^K $的梯度。比如，对于Sigmoid函数$ f({\mathbi{s}})=\frac{1}{1+{\textrm e}^{- {\mathbi{s}}}}$，有$ \frac{\partial f({\mathbi{s}})}{\partial {\mathbi{s}}}=f({\mathbi{s}}) (1-f({\mathbi{s}}))$
 \vspace{0.5em}
@@ -1899,11 +1899,11 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f

 \begin{itemize}
 \vspace{0.3em}
-\item 输入层（词的分布式表示层），即把输入的离散的单词变为分布式表示对应的实数向量；
+\item {\small\sffamily\bfseries{输入层}}（词的分布式表示层），即把输入的离散的单词变为分布式表示对应的实数向量；
 \vspace{0.3em}
-\item 隐藏层，即将得到的词的分布式表示进行线性和非线性变换；
+\item {\small\sffamily\bfseries{隐藏层}}，即将得到的词的分布式表示进行线性和非线性变换；
 \vspace{0.3em}
-\item 输出层（Softmax层），根据隐藏层的输出预测单词的概率分布。
+\item {\small\sffamily\bfseries{输出层}}（Softmax层），根据隐藏层的输出预测单词的概率分布。
 \vspace{0.3em}
 \end{itemize}

@@ -2076,11 +2076,11 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f

 \parinterval  为了方便理解，看一个简单的例子。假如现在有个“预测下一个单词”的任务：有这样一个句子“屋里 要 摆放 一个 \rule[-3pt]{1cm}{0.05em}”，其中下划线的部分表示需要预测的下一个单词。如果模型在训练数据中看到过类似于“摆放 一个 桌子”这样的片段，那么就可以很自信的预测出“桌子”。另一方面，很容易知道，实际上与“桌子”相近的单词，如“椅子”，也是可以预测的单词的。但是，“椅子”恰巧没有出现在训练数据中，这时如果用One-hot编码来表示单词，显然无法把“椅子”填到下划线处；而如果使用单词的分布式表示，很容易就知道 “桌子”与“椅子”是相似的，因此预测“ 椅子”在一定程度上也是合理的。
 \begin{example}
-屋里 要 摆放 一个 \_\_\_\_\_ \hspace{0.5em} \quad \quad 预测下个词
+屋里\ 要\ 摆放\ 一个 \_\_\_\_\_ \hspace{0.5em} \quad \quad 预测下个词

-\hspace{2em} 屋里 要 摆放 一个{ \red{桌子}} \hspace{3.2em}见过
+\hspace{2em} 屋里\ 要\ 摆放\ 一个\ { \red{桌子}} \hspace{3.2em}见过

-\hspace{2em} 屋里 要 摆放 一个{ \blue{椅子}} \hspace{3.2em}没见过，但是仍然是合理预测
+\hspace{2em} 屋里\ 要\ 摆放\ 一个\ { \blue{椅子}} \hspace{3.2em}没见过，但是仍然是合理预测
 \end{example}

 \parinterval  关于单词的分布式表示还有一个经典的例子：通过词嵌入可以得到如下关系：$\textrm{“国王”}=\textrm{“女王”}-\textrm{“女人”} +\textrm{“男人”}$。从这个例子可以看出，词嵌入也具有一些代数性质，比如，词的分布式表示可以通过加、减等代数运算相互转换。图\ref{fig:9-66}展示了词嵌入在一个二维平面上的投影，不难发现，含义相近的单词分布比较临近。
@@ -2116,9 +2116,9 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \parinterval  目前，词嵌入已经成为诸多自然语言处理系统的标配，也衍生出很多有趣的研究法方向。但是，冷静地看，词嵌入依旧存在一些问题：每个词都对应唯一的向量表示，那么对于一词多义现象，词义需要通过上下文进行区分，这时使用简单的词嵌入式是无法处理的。有一个著名的例子：

 \begin{example}
-Aaron is an employee of {\red{\underline{apple}}}.
+Aaron is an employee of {\red{\underline{apple}}}\ .

-\hspace{2em} He finally ate the {\red{\underline{apple}}}.
+\hspace{2em} He finally ate the {\red{\underline{apple}}}\ .
 \end{example}

 \parinterval  这两句中“apple”的语义显然是不同的，第一句中的上下文“Jobs”和“CEO”可以帮助我们判断“apple”是一个公司名字，而不是水果。但是词嵌入只有一个结果，因此无法区分这两种情况。这个例子给我们一个启发：在一个句子中，不能孤立的看待单词，应同时考虑其上下文的信息。也就是需要一个能包含句子中上下文信息的表示模型。

--- a/ChapterAppend/chapterappend.tex
+++ b/ChapterAppend/chapterappend.tex
@@ -71,7 +71,7 @@
 \vspace{0.5em}
 \item GroundHog。GroundHog\upcite{bahdanau2014neural}基于Theano\upcite{al2016theano}框架，由蒙特利尔大学LISA 实验室使用Python语言编写的一个框架，旨在提供灵活而高效的方式来实现复杂的循环神经网络模型。它提供了包括LSTM在内的多种模型。Bahdanau等人在此框架上又编写了GroundHog神经机器翻译系统。该系统也作为了很多论文的基线系统。网址：\url{https://github.com/lisa-groundhog/GroundHog}
 \vspace{0.5em}
-\item Nematus。Nematus\upcite{DBLP:journals/corr/SennrichFCBHHJL17}是英国爱丁堡大学开发的，基于Theano框架的神经机器翻译系统。该系统使用GRU作为隐层单元，支持多层网络。Nematus 编码端有正向和反向的编码方式，可以同时提取源语句子中的上下文信息。该系统的一个优点是，它可以支持输入端有多个特征的输入（例如词的词性等）。网址：\url{https://github.com/EdinburghNLP/nematus}
+\item Nematus。Nematus\upcite{DBLP:journals/corr/SennrichFCBHHJL17}是英国爱丁堡大学开发的，基于Theano框架的神经机器翻译系统。该系统使用GRU作为隐层单元，支持多层网络。Nematus 编码端有正向和反向的编码方式，可以同时提取源语言句子中的上下文信息。该系统的一个优点是，它可以支持输入端有多个特征的输入（例如词的词性等）。网址：\url{https://github.com/EdinburghNLP/nematus}
 \vspace{0.5em}
 \item ZophRNN。ZophRNN\upcite{zoph2016simple}是由南加州大学的Barret Zoph 等人使用C++语言开发的系统。Zoph既可以训练序列表示模型（如语言模型），也可以训练序列到序列的模型（如神经机器翻译模型）。当训练神经机器翻译系统时，ZophRNN也支持多源输入。网址：\url{https://github.com/isi-nlp/Zoph\_RNN}
 \vspace{0.5em}
@@ -119,7 +119,7 @@

 \begin{itemize}
 \vspace{0.5em}
-\item CCMT（全国机器翻译大会），前身为CWMT（全国机器翻译研讨会）是国内机器翻译领域的旗舰会议，自2005年起已经组织多次机器翻译评测，对国内机器翻译相关技术的发展产生了深远影响。该评测主要针对汉语、英语以及国内的少数民族语言（蒙古语、藏语、维吾尔语等）进行评测，领域包括新闻、口语、政府文件等，不同语言方向对应的领域也有所不同。评价方式不同届略有不同，主要采用自动评价的方式，自CWMT\ 2013起则针对某些领域增设人工评价。自动评价的指标一般包括BLEU-SBP、BLEU-NIST、TER、METEOR、NIST、GTM、mWER、mPER 以及ICT 等，其中以BLEU-SBP 为主，汉语为目标语的翻译采用基于字符的评价方式，面向英语的翻译采用基于词的评价方式。每年该评测吸引国内外近数十家企业及科研机构参赛，业内认可度极高。关于CCMT的更多信息可参考中文信息学会机器翻译专业委员会相关页面：\url{http://sc.cipsc.org.cn/mt/index.php/CWMT.html}。
+\item CCMT（全国机器翻译大会），前身为CWMT（全国机器翻译研讨会）是国内机器翻译领域的旗舰会议，自2005年起已经组织多次机器翻译评测，对国内机器翻译相关技术的发展产生了深远影响。该评测主要针对汉语、英语以及国内的少数民族语言（蒙古语、藏语、维吾尔语等）进行评测，领域包括新闻、口语、政府文件等，不同语言方向对应的领域也有所不同。评价方式不同届略有不同，主要采用自动评价的方式，自CWMT\ 2013起则针对某些领域增设人工评价。自动评价的指标一般包括BLEU-SBP、BLEU-NIST、TER、METEOR、NIST、GTM、mWER、mPER 以及ICT 等，其中以BLEU-SBP 为主，汉语为目标语言的翻译采用基于字符的评价方式，面向英语的翻译采用基于词的评价方式。每年该评测吸引国内外近数十家企业及科研机构参赛，业内认可度极高。关于CCMT的更多信息可参考中文信息学会机器翻译专业委员会相关页面：\url{http://sc.cipsc.org.cn/mt/index.php/CWMT.html}。
 \vspace{0.5em}
 \item WMT由Special Interest Group for Machine Translation（SIGMT）主办，会议自2006年起每年召开一次，是一个涉及机器翻译多种任务的综合性会议，包括多领域翻译评测任务、质量评价任务以及其他与机器翻译的相关任务（如文档对齐评测等）。现在WMT已经成为机器翻译领域的旗舰评测会议，很多研究工作都以WMT评测结果作为基准。WMT评测涉及的语言范围较广，包括英语、德语、芬兰语、捷克语、罗马尼亚语等十多种语言，翻译方向一般以英语为核心，探索英语与其他语言之间的翻译性能，领域包括新闻、信息技术、生物医学。最近，也增加了无指导机器翻译等热门问题。WMT在评价方面类似于CCMT，也采用人工评价与自动评价相结合的方式，自动评价的指标一般为BLEU、TER 等。此外，WMT公开了所有评测数据，因此也经常被机器翻译相关人员所使用。更多WMT的机器翻译评测相关信息可参考SIGMT官网：\url{http://www.sigmt.org/}。
 \vspace{0.5em}
@@ -234,100 +234,102 @@

 \section{IBM模型2训练方法}

-IBM模型2与模型1的训练过程完全一样，本质上都是EM方法，因此可以直接复用{\chapterfive}中训练模型1的流程。对于句对$(\mathbf{s},\mathbf{t})$，$m=|\mathbf{s}|$，$l=|\mathbf{t}|$，E-Step的计算公式如下，其中参数$f(s_j|t_i)$与IBM模型1 一样：
+\parinterval IBM模型2与模型1的训练过程完全一样，本质上都是EM方法，因此可以直接复用{\chapterfive}中训练模型1的流程。对于源语言句子$\seq{s}=\{s_1,\dots,s_m\}$和目标语言句子$\seq{t}=\{t_1,\dots,t_l\}$，E-Step的计算公式如下：

 \begin{eqnarray}
-c(s_u|t_v;\mathbf{s},\mathbf{t}) &=&\sum\limits_{j=1}^{m} \sum\limits_{i=0}^{l} \frac{f(s_u|t_v)a(i|j,m,l) \delta(s_j,s_u)\delta (t_i,t_v) }   {\sum_{k=0}^{l} f(s_u|t_k)a(k|j,m,l)} \\
-c(i|j,m,l;\mathbf{s},\mathbf{t}) &=&\frac{f(s_j|t_i)a(i|j,m,l)}   {\sum_{k=0}^{l} f(s_j|t_k)a(k,j,m,l)}
+c(s_u|t_v;\seq{s},\seq{t}) &=&\sum\limits_{j=1}^{m} \sum\limits_{i=0}^{l} \frac{f(s_u|t_v)a(i|j,m,l) \delta(s_j,s_u)\delta (t_i,t_v) }   {\sum_{k=0}^{l} f(s_u|t_k)a(k|j,m,l)} \\
+c(i|j,m,l;\seq{s},\seq{t}) &=&\frac{f(s_j|t_i)a(i|j,m,l)}   {\sum_{k=0}^{l} f(s_j|t_k)a(k,j,m,l)}
 \label{eq:append-1}
 \end{eqnarray}

-\parinterval M-Step的计算公式如下，其中参数$a(i|j,m,l)$表示调序概率：
+\noindent M-Step的计算公式如下：

 \begin{eqnarray}
-f(s_u|t_v) &=&\frac{c(s_u|t_v;\mathbf{s},\mathbf{t}) }    {\sum_{s_u} c(s_u|t_v;\mathbf{s},\mathbf{t})} \\
-a(i|j,m,l) &=&\frac{c(i|j;\mathbf{s},\mathbf{t})}  {\sum_{i}c(i|j;\mathbf{s},\mathbf{t})}
+f(s_u|t_v) &=&\frac{c(s_u|t_v;\seq{s},\seq{t}) }    {\sum_{s'_u} c(s'_u|t_v;\seq{s},\seq{t})} \\
+a(i|j,m,l) &=&\frac{c(i|j,m,l;\seq{s},\seq{t})}  {\sum_{i'}c(i'|j,m,l;\seq{s},\seq{t})}
 \label{eq:append-2}
 \end{eqnarray}

-对于由$K$个样本组成的训练集$\{(\mathbf{s}^{[1]},\mathbf{t}^{[1]}),...,(\mathbf{s}^{[K]},\mathbf{t}^{[K]})\}$，可以将M-Step的计算调整为：
+\noindent 其中，$f(s_u|t_v)$与IBM模型1 一样表示目标语言单词$t_v$到源语言单词$s_u$的翻译概率，$a(i|j,m,l)$表示调序概率。
+
+\parinterval 对于由$K$个样本组成的训练集$\{(\seq{s}^{[1]},\seq{t}^{[1]}),...,(\seq{s}^{[K]},\seq{t}^{[K]})\}$，可以将M-Step的计算调整为：

 \begin{eqnarray}
-f(s_u|t_v) &=&\frac{\sum_{k=1}^{K}c_{\mathbb{E}}(s_u|t_v;\mathbf{s}^{[k]},\mathbf{t}^{[k]}) }    {\sum_{s_u} \sum_{k=1}^{K} c_{\mathbb{E}}(s_u|t_v;\mathbf{s}^{[k]},\mathbf{t}^{[k]})} \\
-a(i|j,m,l) &=&\frac{\sum_{k=1}^{K}c_{\mathbb{E}}(i|j;\mathbf{s}^{[k]},\mathbf{t}^{[k]})}  {\sum_{i}\sum_{k=1}^{K}c_{\mathbb{E}}(i|j;\mathbf{s}^{[k]},\mathbf{t}^{[k]})}
+f(s_u|t_v) &=&\frac{\sum_{k=1}^{K}c(s_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]}) }    {\sum_{s'_u} \sum_{k=1}^{K} c(s'_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]})} \\
+a(i|j,m,l) &=&\frac{\sum_{k=1}^{K}c(i|j,m^{[k]},l^{[k]};\seq{s}^{[k]},\seq{t}^{[k]})}  {\sum_{i'}\sum_{k=1}^{K}c(i'|j,m^{[k]},l^{[k]};\seq{s}^{[k]},\seq{t}^{[k]})}
 \label{eq:append-3}
 \end{eqnarray}

+\noindent 其中，$m^{[k]}=|\seq{s}^{[k]}|$，$l^{[k]}=|\seq{t}^{[k]}|$。
+
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
 %----------------------------------------------------------------------------------------

 \section{IBM模型3训练方法}
-\parinterval IBM模型3的参数估计与模型1和模型2采用相同的方法。这里直接给出辅助函数。
+\parinterval IBM模型3的参数估计与模型1和模型2采用相同的方法，辅助函数被定义如下：
 \begin{eqnarray}
-h(t,d,n,p, \lambda,\mu, \nu, \zeta) & = &  \funp{P}_{\theta}(\mathbf{s}|\mathbf{t})-\sum_{t}\lambda_{t}\big(\sum_{s}t(s|t)-1\big)  \nonumber \\
+h(t,d,n,p, \lambda,\mu, \nu, \zeta) & = &  \funp{P}_{\theta}(\seq{s}|\seq{t})-\sum_{t_v}\lambda_{t_v}\big(\sum_{s_u}t(s_u|t_v)-1\big)  \nonumber \\
 & & -\sum_{i}\mu_{iml}\big(\sum_{j}d(j|i,m,l)-1\big) \nonumber \\
-& & -\sum_{t}\nu_{t}\big(\sum_{\varphi}n(\varphi|t)-1\big)-\zeta(p^0+p^1-1)
+& & -\sum_{t_v}\nu_{t_v}\big(\sum_{\varphi}n(\varphi|t_v)-1\big)-\zeta(p_0+p_1-1)
 \label{eq:1.1}
 \end{eqnarray}

-\parinterval 由于篇幅所限这里略去了推导步骤直接给出具体公式。
-\begin{eqnarray}
-c(s|t,\mathbf{s},\mathbf{t}) & = & \sum_{\mathbf{a}}\big[\funp{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times \sum_{j=1}^{m} (\delta(s_j,s) \cdot \delta(t_{a_{j}},t))\big] \label{eq:1.2} \\
-c(j|i,m,l;\mathbf{s},\mathbf{t}) & = & \sum_{\mathbf{a}}\big[\funp{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times \delta(i,a_j)\big] \label{eq:1.3} \\
-c(\varphi|t;\mathbf{s},\mathbf{t}) & = & \sum_{\mathbf{a}}\big[\funp{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times \sum_{i=1}^{l}\delta(\varphi,\varphi_{i})\delta(t,t_i)\big]
-\label{eq:1.4}
-\end{eqnarray}
-
+\parinterval 这里略去推导步骤，直接给出不同参数对应的期望频次计算公式，如下：
 \begin{eqnarray}
-c(0|\mathbf{s},\mathbf{t}) & = & \sum_{\mathbf{a}}\big[\funp{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t})  \times (m-2\varphi_0) \big] \label{eq:1.5} \\
-c(1|\mathbf{s},\mathbf{t}) & = & \sum_{\mathbf{a}}\big[\funp{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times \varphi_0 \big] \label{eq:1.6}
+c(s_u|t_v,\seq{s},\seq{t}) & = & \sum_{\seq{a}}\big[\funp{P}_{\theta}(\seq{s},\seq{a}|\seq{t}) \times \sum_{j=1}^{m} (\delta(s_j,s_u) \cdot \delta(t_{a_{j}},t_v))\big] \label{eq:1.2} \\
+c(j|i,m,l;\seq{s},\seq{t}) & = & \sum_{\seq{a}}\big[\funp{P}_{\theta}(\seq{s},\seq{a}|\seq{t}) \times \delta(i,a_j)\big] \label{eq:1.3} \\
+c(\varphi|t_v;\seq{s},\seq{t}) & = & \sum_{\seq{a}}\big[\funp{P}_{\theta}(\seq{s},\seq{a}|\seq{t}) \times \sum_{i=1}^{l}\delta(\varphi,\varphi_{i})\delta(t_v,t_i)\big] \label{eq:1.4} \\
+c(0|\seq{s},\seq{t}) & = & \sum_{\seq{a}}\big[\funp{P}_{\theta}(\seq{s},\seq{a}|\seq{t})  \times (m-2\varphi_0) \big] \label{eq:1.5} \\
+c(1|\seq{s},\seq{t}) & = & \sum_{\seq{a}}\big[\funp{P}_{\theta}(\seq{s},\seq{a}|\seq{t}) \times \varphi_0 \big] \label{eq:1.6}
 \end{eqnarray}

 \parinterval 进一步，对于由$K$个样本组成的训练集，有：
 \begin{eqnarray}
-t(s|t) & = & \lambda_{t}^{-1} \times \sum_{k=1}^{K}c(s|t;\mathbf{s}^{[k]},\mathbf{t}^{[k]}) \label{eq:1.7} \\
-d(j|i,m,l) & = & \mu_{iml}^{-1} \times \sum_{k=1}^{K}c(j|i,m,l;\mathbf{s}^{[k]},\mathbf{t}^{[k]}) \label{eq:1.8} \\
-n(\varphi|t) & = & \nu_{t}^{-1} \times \sum_{k=1}^{K}c(\varphi |t;\mathbf{s}^{[k]},\mathbf{t}^{[k]}) \label{eq:1.9} \\
-p_x & = & \zeta^{-1} \sum_{k=1}^{K}c(x;\mathbf{s}^{[k]},\mathbf{t}^{[k]}) \label{eq:1.10}
+t(s_u|t_v) & = & \lambda_{t_v}^{-1} \times \sum_{k=1}^{K}c(s_u|t_v;\seq{s}^{[k]},\seq{t}^{[k]}) \label{eq:1.7} \\
+d(j|i,m,l) & = & \mu_{iml}^{-1} \times \sum_{k=1}^{K}c(j|i,m,l;\seq{s}^{[k]},\seq{t}^{[k]}) \label{eq:1.8} \\
+n(\varphi|t_v) & = & \nu_{t_v}^{-1} \times \sum_{k=1}^{K}c(\varphi |t_v;\seq{s}^{[k]},\seq{t}^{[k]}) \label{eq:1.9} \\
+p_x & = & \zeta^{-1} \sum_{k=1}^{K}c(x;\seq{s}^{[k]},\seq{t}^{[k]}) \label{eq:1.10}
 \end{eqnarray}

-\parinterval 在模型3中，因为繁衍率的引入，并不能像模型1和模型2那样，在保证正确性的情况下加速参数估计的过程。这就使得每次迭代过程中，都不得不面对大小为$(l+1)^m$的词对齐空间。遍历所有$(l+1)^m$个词对齐所带来的高时间复杂度显然是不能被接受的。因此就要考虑能否仅利用词对齐空间中的部分词对齐对这些参数进行估计。比较简单的方法是仅使用Viterbi对齐来进行参数估计，这里Viterbi 词对齐可以被简单的看作搜索到的最好词对齐。遗憾的是，在模型3中并没有方法直接获得Viterbi对齐。这样只能采用一种折中的策略，即仅考虑那些使得$\funp{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t})$ 达到较高值的词对齐。这里把这部分词对齐组成的集合记为$S$。式\ref{eq:1.2}可以被修改为：
+\parinterval 在模型3中，因为繁衍率的引入，并不能像模型1那样，通过简单的数学技巧加速参数估计的过程（见{\chapterfive}）。因此在计算公式\eqref{eq:1.2}-\eqref{eq:1.6}时，我们不得不面对大小为$(l+1)^m$的词对齐空间。遍历所有$(l+1)^m$个词对齐所带来的高时间复杂度显然是不能被接受的。因此就要考虑能否仅利用词对齐空间中的部分词对齐对这些参数进行估计。比较简单的方法是仅使用Viterbi对齐来进行参数估计，这里Viterbi 词对齐可以被简单的看作搜索到的最好词对齐。遗憾的是，在模型3中并没有方法直接获得Viterbi对齐。这样只能采用一种折中的策略，即仅考虑那些使得$\funp{P}_{\theta}(\seq{s},\seq{a}|\seq{t})$ 达到较高值的词对齐。这里把这部分词对齐组成的集合记为$S$。以公式\eqref{eq:1.2}为例，它可以被修改为：
 \begin{eqnarray}
-c(s|t,\mathbf{s},\mathbf{t}) &\approx & \sum_{\mathbf{a} \in S}\big[\funp{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times \sum_{j=1}^{m}(\delta(s_j,\mathbf{s}) \cdot \delta(t_{a_{j}},\mathbf{t})) \big]
+c(s_u|t_v,\seq{s},\seq{t}) &\approx & \sum_{\seq{a} \in S}\big[\funp{P}_{\theta}(\seq{s},\seq{a}|\seq{t}) \times \sum_{j=1}^{m}(\delta(s_j,s_u) \cdot \delta(t_{a_{j}},t_v)) \big]
 \label{eq:1.11}
 \end{eqnarray}

-\parinterval 同理可以获得式\ref{eq:1.3}-\ref{eq:1.6}的修改结果。进一步，在IBM模型3中，可以定义$S$如下：
+\parinterval 可以以同样的方式修改公式\eqref{eq:1.3}-\eqref{eq:1.6}的修改结果。进一步，在IBM模型3中，可以定义$S$如下：
 \begin{eqnarray}
-S &=& N(b^{\infty}(V(\mathbf{s}|\mathbf{t};2))) \cup (\mathop{\cup}\limits_{ij} N(b_{i \leftrightarrow j}^{\infty}(V_{i \leftrightarrow j}(\mathbf{s}|\mathbf{t},2))))
+S &=& N(b^{\infty}(V(\seq{s}|\seq{t};2))) \cup (\mathop{\cup}\limits_{ij} N(b_{i \leftrightarrow j}^{\infty}(V_{i \leftrightarrow j}(\seq{s}|\seq{t},2))))
 \label{eq:1.12}
 \end{eqnarray}

 \parinterval 为了理解这个公式，先介绍几个概念。
 \begin{itemize}
-\item $V(\mathbf{s}|\mathbf{t})$表示Viterbi词对齐，$V(\mathbf{s}|\mathbf{t},1)$、$V(\mathbf{s}|\mathbf{t},2)$和$V(\mathbf{s}|\mathbf{t},3)$就分别对应了模型1、2 和3 的Viterbi 词对齐；
-\item 把那些满足第$j$个源语言单词对应第$i$个目标语言单词（$a_j=i$）的词对齐构成的集合记为$\mathbf{A}_{i \leftrightarrow j}(\mathbf{s},\mathbf{t})$。通常称这些对齐中$j$和$i$被``钉''在了一起。在$\mathbf{A}_{i \leftrightarrow j}(\mathbf{s},\mathbf{t})$中使$\funp{P}(\mathbf{a}|\mathbf{s},\mathbf{t})$达到最大的那个词对齐被记为$V_{i \leftrightarrow j}(\mathbf{s},\mathbf{t})$；
-\item 如果两个词对齐，通过交换两个词对齐连接就能互相转化，则称它们为邻居。一个词对齐$\mathbf{a}$的所有邻居记为$N(\mathbf{a})$。
+\item $V(\seq{s}|\seq{t})$表示Viterbi词对齐，$V(\seq{s}|\seq{t},1)$、$V(\seq{s}|\seq{t},2)$和$V(\seq{s}|\seq{t},3)$就分别对应了模型1、2 和3 的Viterbi 词对齐；
+\item 把那些满足第$j$个源语言单词对应第$i$个目标语言单词（$a_j=i$）的词对齐构成的集合记为$\seq{a}_{i \leftrightarrow j}(\seq{s},\seq{t})$。通常称这些对齐中$j$和$i$被``钉''在了一起。在$\seq{a}_{i \leftrightarrow j}(\seq{s},\seq{t})$中使$\funp{P}(\seq{a}|\seq{s},\seq{t})$达到最大的那个词对齐被记为$V_{i \leftrightarrow j}(\seq{s}|\seq{t})$；
+\item 如果两个词对齐，通过交换两个词对齐连接就能互相转化，则称它们为邻居。一个词对齐$\seq{a}$的所有邻居记为$N(\seq{a})$。
 \end{itemize}

 \vspace{0.5em}
-\parinterval 公式\ref{eq:1.12}中，$b^{\infty}(V(\mathbf{s}|\mathbf{t};2))$ 和 $b_{i \leftrightarrow j}^{\infty}(V_{i \leftrightarrow j}(\mathbf{s}|\mathbf{t},2))$ 分别是对 $V(\mathbf{s}|\mathbf{t};3)$ 和 $V_{i \leftrightarrow j}(\mathbf{s}|\mathbf{t},3)$ 的估计。在计算$S$的过程中，需要知道一个对齐$\bf{a}$的邻居$\bf{a}^{'}$的概率，即通过$\funp{P}_{\theta}(\mathbf{a},\mathbf{s}|\mathbf{t})$计算$\funp{P}_{\theta}(\mathbf{a}',\mathbf{s}|\mathbf{t})$。在模型3中，如果$\bf{a}$和$\bf{a}'$仅区别于某个源语单词对齐到的目标位置上（$a_j \neq a_{j}'$），那么
+\parinterval 公式\eqref{eq:1.12}中，$b^{\infty}(V(\seq{s}|\seq{t};2))$ 和 $b_{i \leftrightarrow j}^{\infty}(V_{i \leftrightarrow j}(\seq{s}|\seq{t},2))$ 分别是对 $V(\seq{s}|\seq{t};3)$ 和 $V_{i \leftrightarrow j}(\seq{s}|\seq{t},3)$ 的估计。在计算$S$的过程中，需要知道一个对齐$\seq{a}$的邻居$\seq{a}'$的概率，即通过$\funp{P}_{\theta}(\seq{a},\seq{s}|\seq{t})$计算$\funp{P}_{\theta}(\seq{a}',\seq{s}|\seq{t})$。在模型3中，如果$\seq{a}$和$\seq{a}'$仅区别于某个源语单词$s_j$对齐从$a_j$变到$a_{j}'$，且$a_j$和$a'_j$均不为零，令$a_j=i$，$a'_{j}=i'$，那么

 \begin{eqnarray}
-\funp{P}_{\theta}(\mathbf{a}',\mathbf{s}|\mathbf{t}) & = & \funp{P}_{\theta}(\mathbf{a},\mathbf{s}|\mathbf{t}) \cdot  \nonumber \\
+\funp{P}_{\theta}(\seq{a}',\seq{s}|\seq{t}) & = & \funp{P}_{\theta}(\seq{a},\seq{s}|\seq{t}) \cdot  \nonumber \\
                                                                                   &     & \frac{\varphi_{i'}+1}{\varphi_i} \cdot \frac{n(\varphi_{i'}+1|t_{i'})}{n(\varphi_{i'}|t_{i'})} \cdot \frac{n(\varphi_{i}-1|t_{i})}{n(\varphi_{i}|t_{i})} \cdot \nonumber \\
                                                                                   &     & \frac{t(s_j|t_{i'})}{t(s_{j}|t_{i})} \cdot \frac{d(j|i',m,l)}{d(j|i,m,l)}
 \label{eq:1.13}
 \end{eqnarray}

-\parinterval 如果$\bf{a}$和$\bf{a}'$区别于两个位置$j_1$和$j_2$的对齐上，$a_{j_{1}}=a_{j_{2}^{'}}$且$a_{j_{2}}=a_{j_{1}^{'}}$，那么
+\parinterval 如果$\seq{a}$和$\seq{a}'$区别于两个位置$j_1$和$j_2$的对齐，即$a_{j_{1}}=a'_{j_{2}}$且$a_{j_{2}}=a'_{j_{1}}$，那么
 \begin{eqnarray}
-\funp{P}_{\theta}(\mathbf{a'},\mathbf{s}|\mathbf{t}) &=& \funp{P}_{\theta}(\mathbf{a},\mathbf{s}|\mathbf{t}) \cdot \frac{t(s_{j_{2}}|t_{a_{j_{2}}})}{t(s_{j_{1}}|t_{a_{j_{1}}})} \cdot \frac{d(j_{2}|a_{j_{2}},m,l)}{d(j_{1}|a_{j_{1}},m,l)}
+\funp{P}_{\theta}(\seq{a}',\seq{s}|\seq{t}) &=& \funp{P}_{\theta}(\seq{a},\seq{s}|\seq{t}) \cdot \nonumber \\
+                                                                 &  & \frac{t(s_{j_{1}}|t_{a_{j_{2}}})}{t(s_{j_{1}}|t_{a_{j_{1}}})} \cdot \frac{t(s_{j_{2}}|t_{a_{j_{1}}})}{t(s_{j_{2}}|t_{a_{j_{2}}})} \cdot \nonumber \\
+                                                                 &  & \frac{d(j_{1}|a_{j_{2}},m,l)}{d(j_{1}|a_{j_{1}},m,l)} \cdot \frac{d(j_{2}|a_{j_{1}},m,l)}{d(j_{2}|a_{j_{2}},m,l)}
 \label{eq:1.14}
 \end{eqnarray}

-\parinterval 相比整个词对齐空间，$S$只是一个非常小的子集，因此运算复杂度可以被大大降低。可以看到，模型3的参数估计过程是建立在模型1和模型2的参数估计结果上的。这不仅是因为模型3要利用模型2的Viterbi对齐，而且还因为模型3参数的初值也要直接利用模型2的参数。从这个角度说，模型1，2，3是有序的且向前依赖的。单独的对模型3的参数进行估计是极其困难的。实际上IBM的模型4和模型5也具有这样的性质，即它们都可以利用前一个模型参数估计的结果作为自身参数的初始值。
+\parinterval 相比整个词对齐空间，$S$只是一个非常小的子集，因此计算时间可以被大大降低。可以看到，模型3的参数估计过程是建立在模型1和模型2的参数估计结果上的。这不仅是因为模型3要利用模型2的Viterbi对齐，而且还因为模型3参数的初值也要直接利用模型2的参数。从这个角度说，模型1、2、3是有序的且向前依赖的。单独的对模型3的参数进行估计是较为困难的。实际上IBM的模型4和模型5也具有这样的性质，即它们都可以利用前一个模型参数估计的结果作为自身参数的初始值。

 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -335,17 +337,17 @@ S &=& N(b^{\infty}(V(\mathbf{s}|\mathbf{t};2))) \cup (\mathop{\cup}\limits_{ij} 

 \section{IBM模型4训练方法}

-\parinterval 模型4的参数估计基本与模型3一致。需要修改的是扭曲度的估计公式，对于目标语第$i$个cept.生成的第一单词，可以得到（假设有$K$个训练样本）：
+\parinterval 模型4的参数估计基本与模型3一致。需要修改的是扭曲度的估计公式，对于目标语言的第$i$个cept.生成的第一单词，可以得到（假设有$K$个训练样本）：
 \begin{eqnarray}
-d_1(\Delta_j|ca,cb) &=& \mu_{1cacb}^{-1} \times \sum_{k=1}^{K}c_1(\Delta_j|ca,cb;\mathbf{s}^{[k]},\mathbf{t}^{[k]})
+d_1(\Delta_j|ca,cb) &=& \mu_{1cacb}^{-1} \times \sum_{k=1}^{K}c_1(\Delta_j|ca,cb;\seq{s}^{[k]},\seq{t}^{[k]})
 \label{eq:1.15}
 \end{eqnarray}

 其中，

 \begin{eqnarray}
-c_1(\Delta_j|ca,cb;\mathbf{s},\mathbf{t})           & = & \sum_{\mathbf{a}}\big[\funp{P}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times s_1(\Delta_j|ca,cb;\mathbf{a},\mathbf{s},\mathbf{t})\big] \label{eq:1.16} \\
-s_1(\Delta_j|ca,cb;\rm{a},\mathbf{s},\mathbf{t}) & = & \sum_{i=1}^l \big[\varepsilon(\varphi_i) \cdot \delta(\pi_{i1}-\odot _{i},\Delta_j) \cdot \nonumber \\
+c_1(\Delta_j|ca,cb;\seq{s},\seq{t})           & = & \sum_{\seq{a}}\big[\funp{P}_{\theta}(\seq{s},\seq{a}|\seq{t}) \times z_1(\Delta_j|ca,cb;\seq{a},\seq{s},\seq{t})\big] \label{eq:1.16} \\
+z_1(\Delta_j|ca,cb;\rm{a},\seq{s},\seq{t}) & = & \sum_{i=1}^l \big[\varepsilon(\varphi_i) \cdot \delta(\pi_{i1}-\odot _{i},\Delta_j) \cdot \nonumber \\
                                                                           &     & \delta(A(t_{i-1}),ca) \cdot \delta(B(\tau_{i1}),cb) \big] \label{eq:1.17}
 \end{eqnarray}

@@ -359,29 +361,31 @@ s_1(\Delta_j|ca,cb;\rm{a},\mathbf{s},\mathbf{t}) & = & \sum_{i=1}^l \big[\vareps
 \label{eq:1.21}
 \end{eqnarray}

-对于目标语第$i$个cept.生成的其他单词（非第一个单词），可以得到：
+对于目标语言的第$i$个cept.生成的其他单词（非第一个单词），可以得到：

 \begin{eqnarray}
-d_{>1}(\Delta_j|cb) &=& \mu_{>1cb}^{-1} \times \sum_{k=1}^{K}c_{>1}(\Delta_j|cb;\mathbf{s}^{[k]},\mathbf{t}^{[k]})
+d_{>1}(\Delta_j|cb) &=& \mu_{>1cb}^{-1} \times \sum_{k=1}^{K}c_{>1}(\Delta_j|cb;\seq{s}^{[k]},\seq{t}^{[k]})
 \label{eq:1.18}
 \end{eqnarray}

 其中，

 \begin{eqnarray}
-c_{>1}(\Delta_j|cb;\mathbf{s},\mathbf{t})                  & = & \sum_{\mathbf{a}}\big[\textrm{p}_{\theta}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times s_{>1}(\Delta_j|cb;\mathbf{a},\mathbf{s},\mathbf{t}) \big] \label{eq:1.19} \\
-s_{>1}(\Delta_j|cb;\mathbf{a},\mathbf{s},\mathbf{t}) & = & \sum_{i=1}^l \big[\varepsilon(\varphi_i-1)\sum_{k=2}^{\varphi_i}\delta(\pi_{[i]k}-\pi_{[i]k-1},\Delta_j) \cdot \nonumber ß\\
+c_{>1}(\Delta_j|cb;\seq{s},\seq{t})                  & = & \sum_{\seq{a}}\big[\funp{P}_{\theta}(\seq{s},\seq{a}|\seq{t}) \times z_{>1}(\Delta_j|cb;\seq{a},\seq{s},\seq{t}) \big] \label{eq:1.19} \\
+z_{>1}(\Delta_j|cb;\seq{a},\seq{s},\seq{t}) & = & \sum_{i=1}^l \big[\varepsilon(\varphi_i-1)\sum_{k=2}^{\varphi_i}\delta(\pi_{[i]k}-\pi_{[i]k-1},\Delta_j) \cdot \nonumber ß\\
                                                                                  &    & \delta(B(\tau_{[i]k}),cb) \big] \label{eq:1.20}
 \end{eqnarray}

-\noindent 这里，$ca$和$cb$分别表示目标语言和源语言的某个词类。模型4需要像模型3一样，通过定义一个词对齐集合$S$，使得每次迭代都在$S$上进行，进而降低运算量。模型4中$S$的定义为：
+\noindent 这里，$ca$和$cb$分别表示目标语言和源语言的某个词类。注意，在公式\eqref{eq:1.17}和\eqref{eq:1.20}中，求和操作$\sum_{i=1}^l$是从$i=1$开始计算，而不是从$i=0$。这实际上跟IBM模型4的定义相关，因为$d_{1}(j-{\odot}_{i-1}|A(t_{[i-1]}),B(s_j))$和$d_{>1}(j-\pi_{[i]k-1}|B(s_j))$是从$[i]>0$开始定义的，详细信息可以参考{\chaptersix}的内容。
+
+\parinterval 模型4 需要像模型3 一样，通过定义一个词对齐集合$S$，使得每次训练迭代都在$S$ 上进行，进而降低运算量。模型4 中$S$的定义为：

 \begin{eqnarray}
-\textrm{S} &=& N(\tilde{b}^{\infty}(V(\mathbf{s}|\mathbf{t};2))) \cup (\mathop{\cup}\limits_{ij} N(\tilde{b}_{i \leftrightarrow j}^{\infty}(V_{i \leftrightarrow j}(\mathbf{s}|\mathbf{t},2))))
+S &=& N(\tilde{b}^{\infty}(V(\seq{s}|\seq{t};2))) \cup (\mathop{\cup}\limits_{ij} N(\tilde{b}_{i \leftrightarrow j}^{\infty}(V_{i \leftrightarrow j}(\seq{s}|\seq{t},2))))
 \label{eq:1.22}
 \end{eqnarray}

-\parinterval 对于一个对齐$\mathbf{a}$，可用模型3对它的邻居进行排名，即按$\funp{P}_{\theta}(b(\mathbf{a})|\mathbf{s},\mathbf{t};3)$排序，其中$b(\mathbf{a})$表示$\mathbf{a}$的邻居。$\tilde{b}(\mathbf{a})$ 表示这个排名表中满足$\funp{P}_{\theta}(\mathbf{a}'|\mathbf{s},\mathbf{t};4) > \funp{P}_{\theta}⁡(\mathbf{a}|\mathbf{s},\mathbf{t};4)$的最高排名的$\mathbf{a}'$。 同理可知$\tilde{b}_{i \leftrightarrow j}^{\infty}(\mathbf{a})$ 的意义。这里之所以不用模型3中采用的方法直接利用$b^{\infty}(\mathbf{a})$得到模型4中高概率的对齐，是因为模型4中要想获得某个对齐$\mathbf{a}$的邻居$\mathbf{a}'$必须做很大调整，比如：调整$\tau_{[i]1}$和$\odot_{i}$等等。这个过程要比模型3的相应过程复杂得多。因此在模型4中只能借助于模型3的中间步骤来进行参数估计。
+\parinterval 对于一个对齐$\seq{a}$，可用模型3对它的邻居进行排名，即按$\funp{P}_{\theta}(b(\seq{a})|\seq{s},\seq{t};3)$排序，其中$b(\seq{a})$表示$\seq{a}$的邻居。$\tilde{b}(\seq{a})$ 表示这个排名表中满足$\funp{P}_{\theta}(\seq{a}'|\seq{s},\seq{t};4) > \funp{P}_{\theta}⁡(\seq{a}|\seq{s},\seq{t};4)$的最高排名的$\seq{a}'$。 同理可知$\tilde{b}_{i \leftrightarrow j}^{\infty}(\seq{a})$ 的意义。这里之所以不用模型3中采用的方法直接利用$b^{\infty}(\seq{a})$得到模型4中高概率的对齐，是因为模型4中要想获得某个对齐$\seq{a}$的邻居$\seq{a}'$必须做很大调整，比如：调整$\tau_{[i]1}$和$\odot_{i}$等等。这个过程要比模型3的相应过程复杂得多。因此在模型4中只能借助于模型3的中间步骤来进行参数估计。
 \setlength{\belowdisplayskip}{3pt}%调整空白大小

 %----------------------------------------------------------------------------------------
@@ -389,54 +393,54 @@ s_{>1}(\Delta_j|cb;\mathbf{a},\mathbf{s},\mathbf{t}) & = & \sum_{i=1}^l \big[\va
 %----------------------------------------------------------------------------------------

 \section{IBM模型5训练方法}
-\parinterval 模型5的参数估计过程也模型4的过程基本一致，二者的区别在于扭曲度的估计公式。在模型5中，对于目标语第$i$个cept.生成的第一单词，可以得到（假设有$K$个训练样本）：
+\parinterval 模型5的参数估计过程也和模型4的过程基本一致，二者的区别在于扭曲度的估计公式。在模型5中，对于目标语言的第$i$个cept.生成的第一单词，可以得到（假设有$K$个训练样本）：

 \begin{eqnarray}
-d_1(\Delta_j|cb) &=& \mu_{1cb}^{-1} \times \sum_{k=1}^{K}c_1(\Delta_j|cb;\mathbf{s}^{[k]},\mathbf{t}^{[k]})
+d_1(\Delta_j|cb) &=& \mu_{1cb}^{-1} \times \sum_{k=1}^{K}c_1(\Delta_j|cb;\seq{s}^{[k]},\seq{t}^{[k]})
 \label{eq:1.23}
 \end{eqnarray}

 其中，

 \begin{eqnarray}
-c_1(\Delta_j|cb,v_x,v_y;\mathbf{s},\mathbf{t})                   & = & \sum_{\mathbf{a}}\Big[ \funp{P}(\mathbf{s},\mathbf{a}|\mathbf{t}) \times s_1(\Delta_j|cb,v_x,v_y;\mathbf{a},\mathbf{s},\mathbf{t}) \Big] \label{eq:1.24} \\
-s_1(\Delta_j|cb,v_x,v_y;\mathbf{a},\mathbf{s},\mathbf{t}) & = & \sum_{i=1}^l \Big [ \varepsilon(\varphi_i) \cdot \delta(v_{\pi_{i1}},\Delta_j) \cdot \delta(v_{\odot _{i-1}},v_x) \nonumber \\
+c_1(\Delta_j|cb,v_x,v_y;\seq{s},\seq{t})                   & = & \sum_{\seq{a}}\Big[ \funp{P}(\seq{s},\seq{a}|\seq{t}) \times z_1(\Delta_j|cb,v_x,v_y;\seq{a},\seq{s},\seq{t}) \Big] \label{eq:1.24} \\
+z_1(\Delta_j|cb,v_x,v_y;\seq{a},\seq{s},\seq{t}) & = & \sum_{i=1}^l \Big [ \varepsilon(\varphi_i) \cdot \delta(v_{\pi_{i1}},\Delta_j) \cdot \delta(v_{\odot _{i-1}},v_x) \nonumber \\
                                                                                          &    & \cdot \delta(v_m-\varphi_i+1,v_y) \cdot \delta(v_{\pi_{i1}},v_{\pi_{i1}-1} )\Big] \label{eq:1.25}
 \end{eqnarray}


-对于目标语第$i$个cept.生成的其他单词（非第一个单词），可以得到：
+对于目标语言的第$i$个cept.生成的其他单词（非第一个单词），可以得到：

 \begin{eqnarray}
-d_{>1}(\Delta_j|cb,v) &=& \mu_{>1cb}^{-1} \times \sum_{k=1}^{K}c_{>1}(\Delta_j|cb,v;\mathbf{s}^{[k]},\mathbf{t}^{[k]})
+d_{>1}(\Delta_j|cb,v) &=& \mu_{>1cb}^{-1} \times \sum_{k=1}^{K}c_{>1}(\Delta_j|cb,v;\seq{s}^{[k]},\seq{t}^{[k]})
 \label{eq:1.26}
 \end{eqnarray}

 其中，

 \begin{eqnarray}
-c_{>1}(\Delta_j|cb,v;\mathbf{s},\mathbf{t})                   & =  & \sum_{\mathbf{a}}\Big[\funp{P}(\mathbf{a},\mathbf{s}|\mathbf{t}) \times s_{>1}(\Delta_j|cb,v;\mathbf{a},\mathbf{s},\mathbf{t}) \Big] \label{eq:1.27} \\
-s_{>1}(\Delta_j|cb,v;\mathbf{a},\mathbf{s},\mathbf{t}) & = & \sum_{i=1}^l\Big[\varepsilon(\varphi_i-1)\sum_{k=2}^{\varphi_i} \big[\delta(v_{\pi_{ik}}-v_{\pi_{[i]k}-1},\Delta_j)  \nonumber \\
+c_{>1}(\Delta_j|cb,v;\seq{s},\seq{t})                   & =  & \sum_{\seq{a}}\Big[\funp{P}(\seq{a},\seq{s}|\seq{t}) \times z_{>1}(\Delta_j|cb,v;\seq{a},\seq{s},\seq{t}) \Big] \label{eq:1.27} \\
+z_{>1}(\Delta_j|cb,v;\seq{a},\seq{s},\seq{t}) & = & \sum_{i=1}^l\Big[\varepsilon(\varphi_i-1)\sum_{k=2}^{\varphi_i} \big[\delta(v_{\pi_{ik}}-v_{\pi_{[i]k}-1},\Delta_j)  \nonumber \\
                                                                                    &     & \cdot \delta(B(\tau_{[i]k}) ,cb) \cdot \delta(v_m-v_{\pi_{i(k-1)}}-\varphi_i+k,v) \nonumber \\
                                                                                    &     & \cdot \delta(v_{\pi_{i1}},v_{\pi_{i1}-1}) \big] \Big] \label{eq:1.28}
 \end{eqnarray}

 \vspace{0.5em}

-\parinterval 从式\ref{eq:1.24}中可以看出因子$\delta(v_{\pi_{i1}},v_{\pi_{i1}-1})$保证了，即使对齐$\mathbf{a}$不合理（一个源语言位置对应多个目标语言位置）也可以避免在这个不合理的对齐上计算结果。需要注意的是因子$\delta(v_{\pi_{p1}},v_{\pi_{p1-1}})$，确保了$\mathbf{a}$中不合理的部分不产生坏的影响，而$\mathbf{a}$中其他正确的部分仍会参与迭代。
+\parinterval 从公式\eqref{eq:1.24}中可以看出，因子$\delta(v_{\pi_{i1}},v_{\pi_{i1}-1})$保证了，即使对齐$\seq{a}$不合理（一个源语言位置对应多个目标语言位置）也可以避免在这个不合理的对齐上计算结果。也就是因子$\delta(v_{\pi_{p1}},v_{\pi_{p1-1}})$确保了$\seq{a}$中不合理的部分不产生坏的影响，而$\seq{a}$中其他正确的部分仍会参与迭代。

-\parinterval 不过上面的参数估计过程与IBM前4个模型的参数估计过程并不完全一样。IBM前4个模型在每次迭代中，可以在给定$\mathbf{s}$、$\mathbf{t}$和一个对齐$\mathbf{a}$的情况下直接计算并更新参数。但是在模型5的参数估计过程中（如公式\ref{eq:1.24}），需要模拟出由$\mathbf{t}$生成$\mathbf{s}$的过程才能得到正确的结果，因为从$\mathbf{t}$、$\mathbf{s}$和$\mathbf{a}$中是不能直接得到 的正确结果的。具体说，就是要从目标语言句子的第一个单词开始到最后一个单词结束，依次生成每个目标语言单词对应的源语言单词，每处理完一个目标语言单词就要暂停，然后才能计算式\ref{eq:1.24}中求和符号里面的内容。这也就是说即使给定了$\mathbf{s}$、$\mathbf{t}$和一个对齐$\mathbf{a}$，也不能直接在它们上进行计算，必须重新模拟$\mathbf{t}$到$\mathbf{s}$的生成过程。
+\parinterval 不过上面的参数估计过程与IBM前4个模型的参数估计过程并不完全一样。IBM前4个模型在每次迭代中，可以在给定$\seq{s}$、$\seq{t}$和一个对齐$\seq{a}$的情况下直接计算并更新参数。但是在模型5的参数估计过程中（如公式\eqref{eq:1.24}），需要模拟出由$\seq{t}$生成$\seq{s}$的过程才能得到正确的结果，因为从$\seq{t}$、$\seq{s}$和$\seq{a}$中是不能直接得到 的正确结果的。具体说，就是要从目标语言句子的第一个单词开始到最后一个单词结束，依次生成每个目标语言单词对应的源语言单词，每处理完一个目标语言单词就要暂停，然后才能计算公式\eqref{eq:1.24}中求和符号里面的内容。

-\parinterval 从前面的分析可以看出，虽然模型5比模型4更精确，但是模型5过于复杂以至于给参数估计增加了计算量（对于每组$\mathbf{t}$、$\mathbf{s}$和$\mathbf{a}$都要模拟$\mathbf{t}$生成$\mathbf{s}$的翻译过程）。因此模型5的系统实现是一个挑战。
+\parinterval 从前面的分析可以看出，虽然模型5比模型4更精确，但是模型5过于复杂以至于给参数估计增加了计算量（对于每组$\seq{t}$、$\seq{s}$和$\seq{a}$都要模拟$\seq{t}$生成$\seq{s}$的翻译过程）。因此模型5的系统实现是一个挑战。

 \parinterval 在模型5中同样需要定义一个词对齐集合$S$，使得每次迭代都在$S$上进行。可以对$S$进行如下定义
 \begin{eqnarray}
-\textrm{S} &=& N(\tilde{\tilde{b}}^{\infty}(V(\mathbf{s}|\mathbf{t};2))) \cup (\mathop{\cup}\limits_{ij} N(\tilde{\tilde{b}}_{i \leftrightarrow j}^{\infty}(V_{i \leftrightarrow j}(\mathbf{s}|\mathbf{t},2))))
+S &=& N(\tilde{\tilde{b}}^{\infty}(V(\seq{s}|\seq{t};2))) \cup (\mathop{\cup}\limits_{ij} N(\tilde{\tilde{b}}_{i \leftrightarrow j}^{\infty}(V_{i \leftrightarrow j}(\seq{s}|\seq{t},2))))
 \label{eq:1.29}
 \end{eqnarray}
 \vspace{0.5em}

-\noindent 其中，$\tilde{\tilde{b}}(\mathbf{a})$借用了模型4中$\tilde{b}(\mathbf{a})$的概念。不过$\tilde{\tilde{b}}(\mathbf{a})$表示在利用模型3进行排名的列表中满足$\funp{P}_{\theta}(\mathbf{a}'|\mathbf{s},\mathbf{t};5)$的最高排名的词对齐，这里$\mathbf{a}'$表示$\mathbf{a}$的邻居。
+\noindent 其中，$\tilde{\tilde{b}}(\seq{a})$借用了模型4中$\tilde{b}(\seq{a})$的概念。不过$\tilde{\tilde{b}}(\seq{a})$表示在利用模型3进行排名的列表中满足$\funp{P}_{\theta}(\seq{a}'|\seq{s},\seq{t};5)$的最高排名的词对齐，这里$\seq{a}'$表示$\seq{a}$的邻居。
 \end{appendices}