合并分支 'mengxia' 到 'caorunzhe'

Mengxia 查看合并请求 !742

合并分支 'mengxia' 到 'caorunzhe'
Mengxia 查看合并请求 !742
a3cd04b7 · 孟霞 · afebd90b · f73b87ba · afebd90b · a3cd04b7
Commit a3cd04b7 authored Dec 29, 2020 by 孟霞
--- a/Chapter14/Figures/figure-reproduction-rate2.tex
+++ b/Chapter14/Figures/figure-reproduction-rate2.tex
-\definecolor{taupegray}{rgb}{0.55, 0.52, 0.54}
-\definecolor{bananamania}{rgb}{0.98, 0.91, 0.71}
-%%% outline
-%-------------------------------------------------------------------------
-\begin{tikzpicture}
-\tikzstyle{every node}=[scale=0.7]
-	\tikzstyle{layer} = [draw=black!70,thick, minimum width=7.5em,rounded corners=2pt,inner ysep=6pt,font=\footnotesize,align=center]
-	\tikzstyle{line} = [line width=1pt,->]
-	\tikzstyle{cir} = [draw,circle,minimum size=1em, thick,inner sep=0pt]
-	
-	%encoder
-	\node[layer,fill=red!15] (src_emb) at (0,0){\scriptsize\textbf{Input Embedding}};
-	\node[anchor=south,layer,fill=yellow!20] (src_sa) at ([yshift=2.8em]src_emb.north){\scriptsize\textbf{Self-Attention}};
-	\node[anchor=south,layer,fill=gray!20] (src_norm1) at ([yshift=0.6em]src_sa.north){\scriptsize\textbf{Add \& LayerNorm}};
-	\node[anchor=south,layer,fill=orange!20] (src_ff) at ([yshift=0.6em]src_norm1.north){\scriptsize\textbf{Feed Forward}\\  \scriptsize\textbf{Network}};
-	\node[anchor=south,layer,fill=gray!20] (src_norm2) at ([yshift=0.6em]src_ff.north){\scriptsize\textbf{Add \& LayerNorm}};
-	\node[anchor=south,layer,fill=blue!20] (src_sf) at ([yshift=1.6em]src_norm2.north){\scriptsize\textbf{Softmax}};
-	
-	%decoder
-	\node[anchor=west,layer,fill=red!15] (tgt_emb) at ([xshift=4.4em]src_emb.east){\scriptsize\textbf{Output Embedding}};
-	\node[anchor=south,layer,fill=yellow!20] (tgt_sa) at ([yshift=2.8em]tgt_emb.north){\scriptsize\textbf{Self-Attention}};
-	\node[anchor=south,layer,fill=yellow!20] (tgt_pa) at ([yshift=1.4em]tgt_sa.north){\scriptsize\textbf{Positional Attention}};
-	\node[anchor=south,layer,fill=gray!20] (tgt_norm1) at ([yshift=0.6em]tgt_pa.north){\scriptsize\textbf{Add \& LayerNorm}};
-	\node[anchor=south,layer,fill=yellow!20] (tgt_eda) at ([yshift=1.4em]tgt_norm1.north){\scriptsize\textbf{Encoder-Decoder} \\  \scriptsize\textbf{Attention}};
-	\node[anchor=south,layer,fill=gray!20] (tgt_norm2) at ([yshift=0.6em]tgt_eda.north){\scriptsize\textbf{Add \& LayerNorm}};
-	\node[anchor=south,layer,fill=orange!20] (tgt_ff) at ([yshift=0.6em]tgt_norm2.north){\scriptsize\textbf{Feed Forward}\\  \scriptsize\textbf{Network}};
-	\node[anchor=south,layer,fill=gray!20] (tgt_norm3) at ([yshift=0.6em]tgt_ff.north){\scriptsize\textbf{Add \& LayerNorm}};
-	\node[anchor=south,layer,fill=green!20] (tgt_linear) at ([yshift=1.1em]tgt_norm3.north){\scriptsize\textbf{Linear}};
-	\node[anchor=south,layer,fill=blue!20] (tgt_sf) at ([yshift=0.6em]tgt_linear.north){\scriptsize\textbf{Softmax}};
-	
-	\node[font=\footnotesize,anchor=south] (w3) at ([yshift=0.8em]src_sf.north){\scriptsize\textbf{2}};
-	\node[font=\footnotesize,anchor=east] (w2) at ([xshift=-0.5em]w3.west){\scriptsize\textbf{1}};
-	\node[font=\footnotesize,anchor=east] (w1) at ([xshift=-0.5em]w2.west){\scriptsize\textbf{1}};
-	\node[font=\footnotesize,anchor=west] (w4) at ([xshift=0.5em]w3.east){\scriptsize\textbf{0}};
-	\node[font=\footnotesize,anchor=west] (w5) at ([xshift=0.5em]w4.east){\scriptsize\textbf{1}};
-	\node[font=\footnotesize,anchor=south] (output) at ([yshift=0.8em]tgt_sf.north){\scriptsize\sffamily\bfseries{我们\quad 完全\quad 接受\quad 它\quad 。}};
-	\node[font=\footnotesize,anchor=north] (src) at ([yshift=-0.8em]src_emb.south){\scriptsize\textbf{We totally accept it .}};
-	\node[font=\footnotesize,anchor=north] (tgt) at ([yshift=-0.8em]tgt_emb.south){\scriptsize\textbf{We totally accept accept .}};
-	
-	\node[cir] (src_add) at (0,1.6em) {};
-	\node[cir,fill=orange!7] (src_pos) at (-2.5em,1.6em) {};
-	
-	\node[cir] (tgt_add) at (9.7em,1.6em) {};
-	\node[cir,fill=orange!7] (tgt_pos) at (12.2em,1.6em) {};
-	
-	\node[cir,fill=orange!7] (tgt_pos2) at ([xshift=3em,yshift=-1.74em]tgt_pa.north) {};
-	\draw[line] (tgt_pos2.180) -- ([yshift=-0.6em]tgt_pa.south) -- (tgt_pa.south);
-	\draw[line] (tgt_pos2.180) -- ([xshift=1.8em,yshift=-0.6em]tgt_pa.south) -- ([xshift=1.8em]tgt_pa.south);
-	
-	\draw[-,thick] (src_add.90) -- (src_add.-90);
-	\draw[-,thick] (src_add.0) -- (src_add.180);
-	\draw[-,thick,] (src_pos.180) .. controls ([xshift=0.8em,yshift=0.8em]src_pos.180) and ([xshift=-0.8em,yshift=-0.8em]src_pos.0) ..(src_pos.0);
-	\draw[-,thick] (tgt_add.90) -- (tgt_add.-90);
-	\draw[-,thick] (tgt_add.0) -- (tgt_add.180);
-	\draw[-,thick,] (tgt_pos.180) .. controls ([xshift=0.8em,yshift=0.8em]tgt_pos.180) and ([xshift=-0.8em,yshift=-0.8em]tgt_pos.0) ..(tgt_pos.0);
-	\draw[-,thick,] (tgt_pos2.180) .. controls ([xshift=0.8em,yshift=0.8em]tgt_pos2.180) and ([xshift=-0.8em,yshift=-0.8em]tgt_pos2.0) ..(tgt_pos2.0);
-	
-	\draw[line] (src_emb.north) -- (src_add.south);
-	\draw[line] (src_add.north) -- (src_sa.south);
-	\draw[line] (src_sa.north) -- (src_norm1.south);
-	\draw[line] (src_norm1.north) -- (src_ff.south);
-	\draw[line] (src_ff.north) -- (src_norm2.south);
-	\draw[line] (src_norm2.north) -- (src_sf.south);
-	\draw[line] (tgt_emb.north) -- (tgt_add.south);
-	\draw[line] (tgt_add.north) -- (tgt_sa.south);
-	\draw[line] (tgt_sa.north) -- ([yshift=0.5em]tgt_sa.north) -- ([xshift=-1.8em,yshift=0.5em]tgt_sa.north)--([xshift=-1.8em]tgt_pa.south);
-	\draw[line] (tgt_pa.north) -- (tgt_norm1.south);
-	\draw[line] (tgt_eda.north) -- (tgt_norm2.south);
-	\draw[line] (tgt_norm2.north) -- (tgt_ff.south);
-	\draw[line] (tgt_ff.north) -- (tgt_norm3.south);
-	\draw[line] (tgt_norm3.north) -- (tgt_linear.south);
-	
-	\draw[line] (src_pos.0) -- (src_add.180);
-	\draw[line] (tgt_pos.180) -- (tgt_add.0);
-	\draw[line] (src_sf.north) -- (w3.south);
-	\draw[line] (tgt_sf.north) -- (output.south);
-	\draw[line] (src.north) -- (src_emb.south);
-	
-	\draw[line,<->,out=-25,in=-155] ([xshift=-2em]src_sa.south) to ([xshift=2em]src_sa.south);
-	\draw[line] (src_norm2.north) -- ([yshift=0.5em]src_norm2.north) -- ([xshift=4em,yshift=0.5em]src_norm2.north) -- ([xshift=4em,yshift=-0.95em]src_norm2.north) -- ([xshift=-1.8em,yshift=-0.6em]tgt_eda.south) -- ([xshift=-1.8em]tgt_eda.south);
-	\draw[line] (src_norm2.north) -- ([yshift=0.5em]src_norm2.north) -- ([xshift=4em,yshift=0.5em]src_norm2.north) -- ([xshift=4em,yshift=-0.95em]src_norm2.north)--  ([yshift=-0.6em]tgt_eda.south) -- (tgt_eda.south);
-	\draw[line,] (tgt_norm1.north) -- ([yshift=0.5em]tgt_norm1.north) -- ([yshift=0.5em,xshift=1.8em]tgt_norm1.north) -- ([xshift=1.8em]tgt_eda.south);
-	\draw[line,<->,out=-25,in=-155] ([xshift=-2em]tgt_sa.south) to ([xshift=2em]tgt_sa.south);
-	
-\begin{pgfonlayer}{background}
-{
-\node[draw=taupegray,thick,fill=ugreen!10,inner sep=0pt,minimum height=13em,minimum width=9.5em,rounded corners=4pt,drop shadow] (box1) at (0em,7em){};
-\node[draw=taupegray,thick,fill=yellow!10,inner sep=0pt,minimum height=4.7em,minimum width=9.5em,rounded corners=4pt,drop shadow] (box2) at (0em,13.6em){};
-\node[draw=taupegray,thick,fill=blue!7,inner sep=0pt,minimum height=23.6em,minimum width=10.5em,rounded corners=4pt,drop shadow] (box3) at (9.7em,10.7em){};
-}
-\end{pgfonlayer}
-
-     \node[] at ([yshift=1.5em]box2.north){\normalsize{译文长度：5}};
-     \node[] at ([xshift=-2em,yshift=0.5em]box2.west){\normalsize{繁衍率}};
-     \node[] at ([xshift=-2em,yshift=-0.5em]box2.west){\normalsize{预测器}};
-     \node[] at ([xshift=-2em]box1.west){\normalsize{编码器}};
-	 \node[] at ([xshift=-1em,yshift=-3.8em]box1.west){{$M \times$}};
-     \node[] at ([xshift=2em]box3.east){\normalsize{解码器}};
-	 \node[] at ([xshift=1em,yshift=-7.5em]box3.east){{$\times N$}};
-	 \draw[line,dotted,violet] (box2.north) -- ([yshift=1em]box2.north) -- ([yshift=1em,xshift=4.7em]box2.north) -- ([xshift=-2.4em]tgt_emb.west) -- (tgt_emb.west);
-	 \draw[line,-,dotted,violet,] (src_emb.east) -- ([xshift=-2em]tgt_emb.west);
-
-\end{tikzpicture}
-
-
-
-
--- a/Chapter17/Figures/figure-cache.tex
+++ b/Chapter17/Figures/figure-cache.tex

 \begin{tikzpicture}
+%\tikzstyle{every node}=[scale=0.8]
 	\tikzstyle{prob}=[minimum width=0.4em, fill=blue!15,inner sep=0pt]
-\node[draw,fill=red!20,inner sep=0pt,minimum width=3em,minimum height=5em](key) at (0,0){};
+\node[draw,fill=yellow!15,inner sep=0pt,minimum width=3em,minimum height=5em](key) at (0,0){};
 \draw[] ([yshift=0.5em]key.180) -- ([yshift=0.5em]key.0);
 \draw[] ([yshift=1.5em]key.180) -- ([yshift=1.5em]key.0);
 \draw[] ([yshift=-0.5em]key.180) -- ([yshift=-0.5em]key.0);
 \draw[] ([yshift=-1.5em]key.180) -- ([yshift=-1.5em]key.0);
-\node[draw,fill=blue!20,inner sep=0pt,minimum width=3em,minimum height=5em](value) at (3em,0){};
+\node[draw,fill=ugreen!15,inner sep=0pt,minimum width=3em,minimum height=5em](value) at (3em,0){};
 \draw[] ([yshift=0.5em]value.180) -- ([yshift=0.5em]value.0);
 \draw[] ([yshift=1.5em]value.180) -- ([yshift=1.5em]value.0);
 \draw[] ([yshift=-0.5em]value.180) -- ([yshift=-0.5em]value.0);
@@ -16,36 +17,42 @@
 \node[anchor=south,font=\footnotesize,inner sep=0pt] at ([yshift=0.2em]value.north){value};
 \node[anchor=south,font=\footnotesize,inner sep=0pt] (cache)at ([yshift=2em,xshift=1.5em]key.north){\small\bfnew{Cache}};

-\node[draw,anchor=east,minimum size=2.4em] (dt) at ([yshift=1.4em,xshift=-4em]key.west){$\mathbi{d}_\mathbi{t}$};
-\node[draw,anchor=east,minimum size=2.4em] (st) at ([xshift=-4em]dt.west){$\mathbi{s}_\mathbi{t}$};
-\node[draw,anchor=east,minimum size=2.4em] (st2) at ([xshift=-0.8em,yshift=4em]dt.west){$ \widetilde{\mathbi{s}}_\mathbi{t}$};
+\node[draw,anchor=east,minimum size=1.8em,fill=orange!15] (dt) at ([yshift=2.1em,xshift=-4em]key.west){${\mathbi{d}}_{t}$};
+\node[anchor=north,font=\footnotesize] (readlab) at ([xshift=2.8em,yshift=0.3em]dt.north){\red{reading}};
+\node[draw,anchor=east,minimum size=1.8em,fill=ugreen!15] (st) at ([xshift=-3.7em]dt.west){${\mathbi{s}}_{t}$};
+\node[draw,anchor=east,minimum size=1.8em,fill=red!15] (st2) at ([xshift=-0.85em,yshift=3.5em]dt.west){$ \widetilde{\mathbi{s}}_{t}$};

-\node[draw,anchor=north,circle,inner sep=0pt, minimum size=1.2em,fill=yellow] (add) at ([yshift=-1em]st2.south){+};
+%\node[draw,anchor=north,circle,inner sep=0pt, minimum size=1.2em,fill=yellow] (add) at ([yshift=-1em]st2.south){+};
+\node[draw,thick,inner sep=0pt, minimum size=1.1em, circle] (add) at ([yshift=-1.5em]st2.south){};
+\draw[-,thick] (add.0) -- (add.180);
+\draw[-,thick] (add.90) -- (add.-90);

-\node[anchor=north,inner sep=0pt,font=\footnotesize,text=red] at ([yshift=-1em]add.south){combining};
+\node[anchor=north,inner sep=0pt,font=\footnotesize,text=red] at ([xshift=-0.08em,yshift=-1em]add.south){combining};

-\node[draw,anchor=east,minimum size=2.4em] (ct) at ([xshift=-3em,yshift=-3.5em]st.west){$ \widetilde{\mathbi{C}}_\mathbi{t}$};
+\node[draw,anchor=east,minimum size=1.8em,fill=yellow!15] (ct) at ([xshift=-2em,yshift=-3.5em]st.west){$ {\mathbi{C}}_{t}$};
+\node[anchor=north,font=\footnotesize] (matchlab) at ([xshift=6.7em,yshift=-0.1em]ct.north){\red{mathching}};

-\node[anchor=east] (y) at ([xshift=-6em,yshift=1em]st.west){$\mathbi{y}_{\mathbi{t}-\mathbi{1}}$};
+\node[anchor=east] (y) at ([xshift=-6em,yshift=1em]st.west){$\mathbi{y}_{t-1}$};

-\node[draw,anchor=east,minimum width=8em,minimum height=1.6em,fill=blue!20] (output) at ([xshift=-2.6em,yshift=2.6em]st2.west){};
+\node[draw,anchor=east,minimum width=7em,minimum height=1.4em,fill=blue!20] (output) at ([xshift=-2.6em,yshift=2.6em]st2.west){};

-\node[anchor=south] (yt) at ([yshift=5em]output.north){$\mathbi{y}_{\mathbi{t}}$};
-\draw[] ([xshift=-0.8em]output.90) -- ([xshift=-0.8em]output.-90);
-\draw[] ([xshift=-2.4em]output.90) -- ([xshift=-2.4em]output.-90);
-\draw[] ([xshift=0.8em]output.90) -- ([xshift=0.8em]output.-90);
-\draw[] ([xshift=2.4em]output.90) -- ([xshift=2.4em]output.-90);
+\node[anchor=south] (yt) at ([yshift=4.2em]output.north){$\mathbi{y}_{t}$};
+\draw[] ([xshift=-0.7em]output.90) -- ([xshift=-0.7em]output.-90);
+\draw[] ([xshift=-2.1em]output.90) -- ([xshift=-2.1em]output.-90);
+\draw[] ([xshift=0.7em]output.90) -- ([xshift=0.7em]output.-90);
+\draw[] ([xshift=2.1em]output.90) -- ([xshift=2.1em]output.-90);

-\foreach \x/\y in {1/2,2/1,3/5,4/1,5/1,6/1,7/3,8/4,9/2,10/3,11/5,12/5,13/2,14/5,15/5,16/5,17/13,18/2,19/4,20/1}
-	\node[draw=blue!20,anchor=south,prob,minimum height=0.2em*\y] at ([yshift=1em,xshift=-4.2em+0.4em*\x]output.north){};
+\foreach \x/\y in {1/2,2/1,3/5,4/1,5/1,6/1,7/3,8/4,9/2,10/3,11/5,12/10,13/2,14/5,15/5,16/5,17/5}
+	\node[draw=blue!25,anchor=south,prob,minimum height=0.2em*\y] at ([yshift=1em,xshift=-3.65em+0.4em*\x]output.north){};
 	
 \begin{pgfonlayer}{background}
 \node[inner sep=3pt,draw,dotted,rounded corners=2pt,very thick][fit=(key)(value)(cache)](box){};
 \end{pgfonlayer}

-\draw[-latex,dashed,very thick,out=-145,in=10] ([yshift=1.6em]box.180) to node[above,font=\footnotesize,text=red,rotate=25]{reading}(dt.0);
-
-\draw[-latex,dashed,very thick,out=-5,in=-170] (ct.0) to node[above,font=\footnotesize,text=red,pos=0.7,rotate=8]{matching}([yshift=-2.5em]box.180);
+\draw[-latex,dashed,very thick,out=-145,in=10] ([yshift=1em]box.180) to (dt.0);
+%node[above,font=\footnotesize,text=red,rotate=25]{reading}
+\draw[-latex,dashed,very thick,out=-5,in=-170] (ct.0) to ([yshift=-2.5em]box.180);
+%node[above,font=\footnotesize,text=red,pos=0.7,rotate=8]{matching}
 \draw[-,very thick,out=0,in=-135](st.0) to (add.-135);
 \draw[-,very thick,out=180,in=-45](dt.180) to (add.-45);
 \draw[-latex,very thick] (add.90) -- (st2.-90);
@@ -53,5 +60,5 @@
 \draw[-latex,very thick,out=180,in=-100] (st2.180) to (output.-90);
 \draw[-latex,very thick,out=80,in=-100] (y.90) to (output.-90);
 \draw[-latex,very thick] (output.90) -- ([yshift=1em]output.90);
-\draw[-latex,very thick] ([yshift=-1em]yt.-90) -- (yt.-90);
+\draw[-latex,very thick] ([yshift=-1.2em]yt.-90) -- (yt.-90);
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-layer.tex
+++ b/Chapter17/Figures/figure-layer.tex
@@ -4,10 +4,10 @@

 \foreach \x in {1,2,3,4}
 	\node[draw,inner sep=0pt,minimum height=1em,minimum width=1.6em,fill=red!30,rounded corners=1pt] (c1_\x) at (0em+2em*\x, 0em){};
-
+\node[anchor=north] (hpre) at ([yshift=1.8em]c1_1.north) {${\mathbi{h}}^ {\textrm{pre}j}$};
 \foreach \x in {1,2,3}
 	\node[draw,inner sep=0pt,minimum height=1em,minimum width=1.6em,fill=red!30,rounded corners=1pt] (c2_\x) at (11em+2em*\x, 0em){};
-
+\node[anchor=north] (hpre) at ([yshift=1.8em]c2_1.north) {${\mathbi{h}}^ {\textrm{pre}1}$};
 \foreach \x in {1,2,3,4,5}
 	\node[draw,inner sep=0pt,minimum height=1em,minimum width=1.6em,fill=red!30,rounded corners=1pt] (c3_\x) at (18.4em+2em*\x, 0em){};,minimum width=1em

@@ -17,6 +17,8 @@
 %\node[inner sep=0pt,minimum size=1em,fill=ugreen,circle] (c5) at (9em, 7em){};
 \node[draw,inner sep=0pt,minimum size=1.2em,fill=green!20,circle] (qs) at (18.6em, 6.4em){};
 \node[draw,inner sep=0pt,minimum size=1.2em,fill=green!20,circle] (qw) at (18.6em, 4.4em){};
+\node[anchor=north] (qslab) at ([xshift=-0.8em,yshift=1em]qs.north) {${\mathbi{q}}^s$};
+\node[anchor=north] (qwlab) at ([xshift=-0.8em,yshift=1em]qw.north) {${\mathbi{q}}^w$};

 \node[draw,thick,inner sep=0pt, minimum size=1.2em, circle] (sigma) at (24.4em, 8em){};
 \draw[-,thick] (sigma.0) -- (sigma.180);
@@ -25,6 +27,9 @@
 \node[draw,fill=orange!30,inner sep=0pt, minimum size=1.2em, circle] (add1) at (5em, 3em){};
 \node[draw,fill=orange!30,inner sep=0pt, minimum size=1.2em, circle] (add2) at (15em, 3em){};
 \node[draw,fill=orange!30,inner sep=0pt, minimum size=1.2em, circle] (add3) at (10em, 5.2em){};
+\node[anchor=north] (cond) at ([xshift=-1em,yshift=0.5em]add3.north) {${\mathbi{d}}$};
+\node[anchor=north] (cons1) at ([xshift=-1em,yshift=0.5em]add2.north) {${\mathbi{s}}^1$};
+\node[anchor=north] (consj) at ([xshift=-1em,yshift=0.5em]add1.north) {${\mathbi{s}}^j$};
 \begin{pgfonlayer}{background}
 \node[draw,rounded corners=2pt,drop shadow,fill=white, minimum width=8.3em][fit=(c1_1)(c1_4)](box1){};
 \node[draw,rounded corners=2pt,drop shadow,fill=white,minimum width=6.4em][fit=(c2_1)(c2_3)](box2){};
@@ -35,10 +40,12 @@

 \node[draw=violet,densely dotted,minimum width=1.9em, minimum height=2.1em,very thick] (n1) at (24.4em,0em){};
 \node[draw=violet,densely dotted,minimum width=1.8em, minimum height=2em,very thick] (n2) at (24.4em,10.4em){};
-\node[] at (24.4em, -1.5em){$\mathbi{x}_\mathbi{t}$};
+%\node[] at (24.4em, -1.5em){$\mathbi{x}_\mathbi{t}$};
 \node[text=ublue] at (10.5em, 0em) {\small\bfnew{...}};
+\node[text=ublue] (hh) at (-0.8em, 0em) {\small\bfnew{...}};

-\draw[->,thick, out=70, in=-120] ([yshift=0.1em]c1_1.90) node[xshift=-0.4em,yshift=1.2em]{$ \mathbi{h}_ \mathbi{i}^ \mathbi{j}$}to ([yshift=-0.1em]add1.-90);
+\draw[->,thick, out=70, in=-120] ([yshift=0.1em]c1_1.90) to ([yshift=-0.1em]add1.-90);
+%node[xshift=-0.4em,yshift=1.2em]{$ \mathbi{h}^ {\textrm j}$}
 \draw[->,thick, out=80, in=-100] ([yshift=0.1em]c1_2.90) to ([yshift=-0.1em]add1.-90);
 \draw[->,thick, out=100, in=-80] ([yshift=0.1em]c1_3.90) to ([yshift=-0.1em]add1.-90);
 \draw[->,thick, out=110, in=-60] ([yshift=0.1em]c1_4.90) to ([yshift=-0.1em]add1.-90);
@@ -48,19 +55,19 @@
 \draw[->,thick, out=110, in=-70] ([yshift=0.1em]c2_3.90) to ([yshift=-0.1em]add2.-90);


-\draw[->,thick, out=30, in=-130] ([yshift=0.1em]add1.90) node[xshift=-0.4em,yshift=1.1em]{$ \mathbi{s}^ \mathbi{j}$} to ([yshift=-0.1em]add3.-120);
+\draw[->,thick, out=30, in=-130] ([yshift=0.1em]add1.90) to ([yshift=-0.1em]add3.-120);
 \draw[->,thick, out=150, in=-50] ([yshift=0.1em]add2.90) to ([yshift=-0.1em]add3.-70);
-\draw[->,thick, ugreen!60,out=160,in=-10] ([xshift=-0.1em]qs.160) node[xshift=-0.3em,yshift=0.1em,above,text=black]{$ \mathbi{q}_\mathbi{s}$} to ([xshift=0.1em]add3.0);
-\draw[->,thick, ugreen!60,out=180,in=0] ([xshift=-0.1em]qw.180) node[xshift=-0.3em,yshift=0.4em,above,text=black]{$ \mathbi{q}_\mathbi{w}$} to ([xshift=0.1em]add2.0);
+\draw[->,thick, ugreen!60,out=160,in=-10] ([xshift=-0.1em]qs.160) to ([xshift=0.1em]add3.0);
+\draw[->,thick, ugreen!60,out=180,in=0] ([xshift=-0.1em]qw.180) to ([xshift=0.1em]add2.0);
 \draw[->,thick, ugreen!60,out=170,in=-10] ([xshift=-0.1em]qw.160) to ([xshift=0.1em]add1.0);

 \draw[->,thick] ([yshift=0.1em]n1.135) .. controls ([xshift=-2em]n1.130) and ([xshift=2em]qw.0) .. ([xshift=0.1em]qw.0);
 \draw[->,thick] ([yshift=0.1em]n1.120) .. controls ([xshift=-2em,yshift=1em]n1.120) and ([xshift=3em]qs.0) .. ([xshift=0.1em]qs.0);
-\draw[->,thick] ([yshift=0.1em]n1.90) node[yshift=1em,right]{$ \mathbi{h}_\mathbi{t}$}-- ([yshift=-0.1em]sigma.-90);
+\draw[->,thick] ([yshift=0.1em]n1.90) node[yshift=0.5em,right]{$ {\mathbi{h}}_{\textrm{t}}$}-- ([yshift=-0.1em]sigma.-90);
 \draw[->,thick] ([yshift=0.1em]sigma.90) -- ([yshift=-0.1em]n2.-90);
-\draw[->,thick] ([yshift=0.1em]n2.90) -- node[right]{$ \widetilde{\mathbi{h}}_\mathbi{t}$}([yshift=2em]n2.90);
+\draw[->,thick] ([yshift=0.1em]n2.90) -- node[right]{$ \widetilde{\mathbi{h}}_{\textrm{t}}$}([yshift=2em]n2.90);

-\draw[decorate,decoration={brace, mirror},gray, thick] ([yshift=-2em]box1.-180) -- node[font=\scriptsize,text=black,below]{前几句}([yshift=-2em]box2.0);
+\draw[decorate,decoration={brace, mirror},gray, thick] ([yshift=-2em]hh.-180) -- node[font=\scriptsize,text=black,below]{前几句}([yshift=-2em]box2.0);
 \draw[decorate,decoration={brace, mirror},gray, thick] ([yshift=-2em]box3.-180) -- node[font=\scriptsize,text=black,below]{当前句}([yshift=-2em]box3.0);
 \draw[->, thick, rounded corners=2pt] ([yshift=0.1em]add3.90) -- ([yshift=2.1em]add3.90) -- ([xshift=-0.1em]sigma.180);


--- a/Chapter17/Figures/figure-picture-translation.tex
+++ b/Chapter17/Figures/figure-picture-translation.tex
@@ -2,15 +2,15 @@
 \tikzstyle{every node}=[scale=0.9]
 \begin {scope}
 \node[draw=white,scale=0.6] (input) at (0,0){\includegraphics[width=0.62\textwidth]{./Chapter17/Figures/figure-bank-without-attention.png}};(1.9,-1.4);
-\node[anchor=south] (english1) at ([xshift=0em,yshift=-2.5em]input.south) {\begin{tabular}{l}{\large\bfnew{英语}}{\Large{：A medium sized child}}\end{tabular}};
-\node[anchor=south] (english2) at ([xshift=1.9em,yshift=-1.2em]english1.south) {\begin{tabular}{l}{\Large{jumps off a dusty {\red{\underline{bank}}}.}} \end{tabular}};
-\end {scope}
-\node[draw,thick,inner sep=0pt,minimum height=16em,minimum width=19em,rounded corners=8pt][fit = (input) (english1)(english2)] (box1) at (0em,-1.5em){};
-\begin {scope}[xshift=1.45in,yshift=-0.2in]
-\draw[-,thick] (0,0.2) -- (1,0.2) -- (1,0.4) --(1.5,0) -- (1,-0.4) -- (1,-0.2) -- (0,-0.2) -- (0,0.2);
-\end {scope}
-\begin {scope}[xshift=4.4in,yshift=-0.2in]
-\node[anchor=east] (de1) {\begin{tabular}{l}{\large\bfnew{汉语}}{\Large{：一个半大孩子从尘土}}\end{tabular}};
-\node[anchor=south] (de2) at ([xshift=2em,yshift=-1.5em]de1.south) {\begin{tabular}{l}{\Large{飞扬的{\red{\underline{河床}}}上跳下来。}} \end{tabular}};
+\node[anchor=south] (english1) at ([xshift=-0.4em,yshift=-3.5em]input.south) {\begin{tabular}{l}{\normalsize\bfnew{英语}}{\large{：A medium sized child}}\end{tabular}};
+\node[anchor=south] (english2) at ([xshift=1.8em,yshift=-1.2em]english1.south) {\begin{tabular}{l}{\large{jumps off a dusty {\red{\underline{bank}}}.}} \end{tabular}};
+\draw[decorate,decoration={brace,amplitude=4mm},very thick] ([xshift=7em]input.90) -- ([xshift=5.7em,yshift=0.5em]english2.270);
+
+\node[anchor=east,rectangle,thick,rounded corners,minimum width=3.5em,minimum height=2.5em,text centered,draw=black!70,fill=red!25](trans)at ([xshift=8em,yshift=5.3em]english1.east){\normalsize{翻译模型}};
+
+\draw[->,very thick]([xshift=-1.65em]trans.west) to (trans.west);
+\draw[->,very thick](trans.east) to ([xshift=1.65em]trans.east);
+\node[anchor=east] (de1) at ([xshift=5.85cm,yshift=-0.1em]trans.east) {\begin{tabular}{l}{\normalsize\bfnew{汉语}}{\normalsize{：一个半大孩子从尘土飞扬}}\end{tabular}};
+\node[anchor=south] (de2) at ([xshift=0em,yshift=-1.5em]de1.south) {\begin{tabular}{l}{\normalsize{的{\red{\underline{河床}}}上跳下来。}} \end{tabular}};
 \end {scope}
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-twodecoding.tex
+++ b/Chapter17/Figures/figure-twodecoding.tex
-\tikzstyle{encoder} = [rectangle,thick,rounded corners,minimum width=2.3cm,minimum height=1.4cm,text centered,draw=black!70,fill=red!25]
-\tikzstyle{decoder} = [rectangle,thick,rounded corners,minimum width=2.3cm,minimum height=1.4cm,text centered,draw=black!70,fill=blue!15]
-\tikzstyle{attention} = [rectangle,thick,rounded corners,minimum width=2.6cm,minimum height=0.9cm,text centered,draw=black!70,fill=green!25]
+\tikzstyle{encoder} = [rectangle,thick,rounded corners,minimum width=4.3em,minimum height=2.2em,text centered,draw=black!70,fill=red!25]
+\tikzstyle{decoder} = [rectangle,thick,rounded corners,minimum width=4.3em,minimum height=2.2em,text centered,draw=black!70,fill=blue!15]
+\tikzstyle{attention} = [rectangle,thick,rounded corners,minimum width=2.6cm,minimum height=2.2em,text centered,draw=black!70,fill=green!25]

-\begin{tikzpicture}[node distance = 0,scale = 1]
-\tikzstyle{every node}=[scale=1]
-\node(encoder_left)[encoder]{\large{编码器}};
-\node(encoder_right)[encoder, right of = encoder_left, xshift=3cm]{\large{编码器}};
-\node(decoder_left)[decoder, above of = encoder_left, yshift=2.7cm]{\large{解码器}};
-\node(decoder_right)[decoder, above of = encoder_right, yshift=2.7cm]{\large{解码器}};
-\node(text_left)[below of = encoder_left, yshift=-2.2cm]{\large{前文}};
-\node(text_right)[below of = encoder_right, yshift=-2.2cm]{\large{源语}};
-\node(text_top)[above of = decoder_right, yshift=2cm]{\large{句子级翻译结果}};
-\node(title_1)[above of = text_top, xshift=-1.5cm, yshift=1.3cm]{\large\bfnew{一阶段解码}};
-\node(ground2)[rectangle,very thick,rounded corners,minimum width=5cm,minimum height=5.3cm,right of = encoder_right,xshift=6cm,yshift=1.4cm,draw=black,dashed]{};
-\node(ground1)[rectangle,thick,rounded corners,minimum width=3.3cm,minimum height=4.5cm,right of = encoder_right,xshift=5.5cm,yshift=1.4cm,draw=black,fill=yellow!15]{};
-\node(attention_below)[attention, right of = encoder_right, xshift=5.5cm]{\large{注意力机制}};
-\node(attention_above)[attention, above of = attention_below, yshift=1.4cm]{\large{注意力机制}};
-\node(ffn)[attention, above of = attention_above, yshift=1.4cm, fill=blue!8]{\large{前馈神经网络}};
+\begin{tikzpicture}[node distance = 0,scale = 0.75]
+\tikzstyle{every node}=[scale=0.75]
+\node(decoder_left)[decoder]{\normalsize{解码器}};
+\node(decoder_right)[decoder, right of = decoder_left, xshift=2.2cm]{\normalsize{解码器}};
+\node(encoder_left)[encoder, above of = decoder_left, yshift=1.6cm]{\normalsize{编码器}};
+\node(encoder_right)[encoder, above of = decoder_right, yshift=1.6cm]{\normalsize{编码器}};
+\node(text_left)[below of = encoder_left, yshift=1.5cm]{\normalsize{前文}};
+\node(text_right)[below of = encoder_right, yshift=1.5cm]{\normalsize{源语}};
+\node(text_top)[above of = decoder_right, yshift=-1.6cm]{\normalsize{句子级翻译结果}};
+\node(title_1)[above of = text_left, xshift=1.1cm, yshift=3cm]{\large\bfnew{一阶段解码}};
+\node(ground2)[rectangle,very thick,rounded corners,minimum width=5cm,minimum height=5.8cm,right of = decoder_right,xshift=5.3cm,yshift=1.6cm,draw=black,dashed]{};
+\node(ground1)[rectangle,thick,rounded corners,minimum width=3.3cm,minimum height=5cm,right of = decoder_right,xshift=4.8cm,yshift=1.58cm,draw=black,fill=yellow!15]{};
+\node(attention_below)[attention, right of = decoder_right, xshift=4.8cm]{\normalsize{注意力机制}};
+\node(attention_above)[attention, above of = attention_below, yshift=1.6cm]{\normalsize{注意力机制}};
+\node(ffn)[attention, above of = attention_above, yshift=1.6cm, fill=blue!8]{\normalsize{前馈神经网络}};
 \node(n)[right of = attention_above, xshift=2.4cm,scale=1.5]{$\times N$};
-\node(text_2)[above of = ffn, yshift=1.9cm]{\large{上下文修正结果}};
-\node(title_2)[above of = text_2, xshift=0.5cm,yshift=1.3cm]{\large\bfnew{二阶段解码}};
-\node(text_rright)[right of = text_right, xshift=5.5cm]{\large{句子级翻译结果}};
+\node(text_2)[above of = ffn, yshift=1.9cm]{\normalsize{上下文修正结果}};
+\node(title_2)[right of = title_1, xshift=6.3cm]{\large\bfnew{二阶段解码}};
+%\node(text_rright)[right of = text_right, xshift=5.5cm]{\normalsize{句子级翻译结果}};

-\draw[->,very thick]([yshift=0.2cm]text_left.north)to(encoder_left.south);
-\draw[->,very thick]([yshift=0.2cm]text_right.north)to(encoder_right.south);
-\draw[->,very thick](encoder_left.north)to(decoder_left.south);
-\draw[->,very thick](encoder_right.north)to(decoder_right.south);
-\draw[->,very thick](decoder_right.north)to([yshift=-0.1cm]text_top.south);
-\draw[->,very thick]([yshift=0.2cm]text_rright.north)to(attention_below.south);
+\draw[->,very thick]([yshift=-0.1cm]text_left.south)to(encoder_left.north);
+\draw[->,very thick]([yshift=-0.1cm]text_right.south)to(encoder_right.north);
+\draw[->,very thick](encoder_left.south)to(decoder_left.north);
+\draw[->,very thick](encoder_right.south)to(decoder_right.north);
+\draw[->,very thick](decoder_right.south)to([yshift=0.1cm]text_top.north);
+%\draw[->,very thick]([yshift=0.2cm]text_rright.north)to(attention_below.south);
 \draw[->,very thick](attention_below.north)to(attention_above.south);
 \draw[->,very thick](attention_above.north)to([yshift=-0.05cm]ffn.south);
 \draw[->,very thick](ffn.north)to([yshift=-0.05cm]text_2.south);
-\draw[-,very thick,dashed]([xshift=2cm,yshift=-0.2cm]text_right.east)to([xshift=2cm,yshift=9cm]text_right.east);
-\draw[-,very thick]([yshift=0.5cm]encoder_left.north)--([yshift=0.5cm,xshift=4.5cm]encoder_left.north)--([xshift=-2.68cm]attention_below.west)--(attention_below.west);
-\draw[-,very thick](decoder_left.north)--([yshift=0.5cm]decoder_left.north)--([yshift=0.5cm,xshift=4.7cm]decoder_left.north)--([xshift=-2.48cm]attention_above.west)--(attention_above.west);
+\draw[->,very thick]([yshift=-0.05em]text_top.south) -- ([yshift=-4.8em]decoder_right.south) -- ([yshift=-4.78em]attention_below.south) --(attention_below.south);
+\draw[-,very thick,dashed]([xshift=1.25cm,yshift=-3cm]decoder_right.east)to([xshift=1.25cm,yshift=6.5cm]decoder_right.east);
+\draw[->,very thick,draw=gray,rounded corners=2pt] (encoder_left.south)--([yshift=-0.3cm]encoder_left.south)--([yshift=-0.3cm,xshift=3.42cm]encoder_left.south)--([xshift=-2.25cm]attention_above.west)--(attention_above.west);
+\draw[->,very thick,draw=gray,rounded corners=2pt] (encoder_right.south)--([yshift=-0.3cm]encoder_right.south)--([yshift=-0.3cm,xshift=3.42cm]encoder_left.south)--([xshift=-2.25cm]attention_above.west)--(attention_above.west);
+\draw[->,very thick,draw=gray,rounded corners=2pt](decoder_left.south)--([yshift=-0.3cm]decoder_left.south)--([yshift=-0.3cm,xshift=3.42cm]decoder_left.south)--([xshift=-2.25cm]attention_below.west)--(attention_below.west);
+\draw[->,very thick,draw=gray,rounded corners=2pt](decoder_right.south)--([yshift=-0.3cm]decoder_right.south)--([yshift=-0.3cm,xshift=3.42cm]decoder_left.south)--([xshift=-2.25cm]attention_below.west)--(attention_below.west);
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/chapter17.tex
+++ b/Chapter17/chapter17.tex
@@ -464,7 +464,7 @@

 \section{篇章级翻译}

-\parinterval 目前大多数机器翻译系统都是句子级的，这种系统的输入和输出都是以句子为单位，并基于句子之间相互独立的假设。然而句子级模型却缺少了对句子间上下文信息的建模，篇章级翻译则是用于解决该问题，进而改善机器翻译在整个篇章上的翻译质量。篇章级翻译的目的就是对句子间的上下文信息进行建模，改善机器翻译在整个篇章上的质量。篇章级翻译的概念在很早就已经被提出\upcite{DBLP:journals/ac/Bar-Hillel60}。随着近几年神经机器翻译取得了巨大进展，如何使用篇章级上下文信息成为进一步改善机器翻译质量的重要方向\upcite{DBLP:journals/corr/abs-1912-08494,DBLP:journals/corr/abs-1901-09115}。本节我们将主要从篇章级机器翻译的评价、建模方法等角度展开介绍。
+\parinterval 目前大多数机器翻译系统都是句子级的，这种系统的输入和输出均以句子为单位，且基于“句子之间相互独立”的假设，却缺少了对篇章上下文信息的建模，因而在需要依赖上下文的翻译场景中其翻译效果总是不尽人意。篇章级翻译的目的就是通过对篇章上下文信息进行建模来解决该问题，进而改善机器翻译在整个篇章上的翻译质量。篇章级翻译的概念在很早就已经被提出\upcite{DBLP:journals/ac/Bar-Hillel60}，随着近几年神经机器翻译取得了巨大进展，如何使用篇章上下文信息成为进一步改善机器翻译质量的重要方向\upcite{DBLP:journals/corr/abs-1912-08494,DBLP:journals/corr/abs-1901-09115}。本节我们将主要从篇章级机器翻译的评价、建模方法等角度展开介绍。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -472,7 +472,7 @@

 \subsection{什么是篇章级翻译}

-\parinterval “篇章”在这里指一系列连续的段落或者句子所构成的整体，其中各个句子间从形式和内容上都具有一定的连贯性和一致性\upcite{jurafsky2000speech}。这些联系主要体现在{\small\sffamily\bfseries{衔接}}\index{衔接}（Cohesion \index{Cohesion}）以及{\small\sffamily\bfseries{连贯}}\index{连贯}（Coherence \index{Coherence}）两个方面。其中衔接体现在显性的语言成分和结构上，包括篇章中句子间语法和词汇上的联系，而连贯体现在各个句子之间逻辑和语义上的联系。因此，篇章级翻译的目的就是要考虑到这些上下文之间的联系，从而生成相比句子级翻译更连贯和准确的翻译结果（如实例\ref{eg:17-1}）。但是由于不同语言的特性多种多样，上下文信息在篇章级翻译中的作用也不尽相同。比如在德语中名词是分词性的，因此在代词翻译的过程中需要根据其先行词的词性进行区分，而这种现象在其它不区分词性的语言中是不存在的。这导致篇章级翻译在不同的语种中可能对应多种不同的上下文现象。
+\parinterval “篇章”在这里指一系列连续的段落或句子所构成的整体，其中各个句子间从形式和内容上都具有一定的连贯性和一致性\upcite{jurafsky2000speech}。这些联系主要体现在{\small\sffamily\bfseries{衔接}}\index{衔接}（Cohesion \index{Cohesion}）以及{\small\sffamily\bfseries{连贯}}\index{连贯}（Coherence \index{Coherence}）两个方面。其中衔接体现在显性的语言成分和结构上，包括篇章中句子间的语法和词汇的联系，而连贯体现在各个句子之间的逻辑和语义的联系上。因此，篇章级翻译的目的就是要将这些上下文之间的联系考虑在内，从而生成比句子级翻译更连贯和准确的翻译结果（如实例\ref{eg:17-1}）。但是由于不同语言的特性多种多样，上下文信息在篇章级翻译中的作用也不尽相同。比如在德语中名词是分词性的，因此在代词翻译的过程中需要根据其先行词的词性进行区分，而这种现象在其它不区分词性的语言中是不存在的。这意味着篇章级翻译在不同的语种中可能对应多种不同的上下文现象。

 \begin{example}
 上下文句子：我上周针对这个问题做出解释并咨询了他的意见。
@@ -486,14 +486,14 @@
 \label{eg:17-1}
 \end{example}

-\parinterval 正是由于这种上下文现象的多样性，使得篇章级翻译模型的性能评价相对困难。目前篇章级机器翻译主要针对一些常见上下文的现象，比如代词翻译、省略、连接和词汇衔接等，而{\chapterfour}介绍的BLEU等通用自动评价指标通常对这些上下文现象不敏感，篇章级翻译需要采用一些专用方法来对这些具体的现象进行评价。之前已经有一些研究工作针对具体的上下文现象提出了相应的评价标准并且在篇章级翻译中得到应用\upcite{DBLP:conf/naacl/BawdenSBH18,DBLP:conf/acl/VoitaST19}，但是目前并没有达成共识，这也在一定程度上阻碍了篇章级机器翻译的进一步发展。我们将在ref{sec:17-3-2}节中对这些评价标准进行介绍。
+\parinterval 正是由于这种上下文现象的多样性，使评价篇章级翻译模型的性能变得相对困难。目前篇章级机器翻译主要针对一些常见的上下文现象，比如代词翻译、省略、连接和词汇衔接等，而{\chapterfour}介绍的BLEU等通用自动评价指标通常对这些上下文现象不敏感，篇章级翻译需要采用一些专用方法来对这些具体现象进行评价。之前已经有一些研究工作针对具体的上下文现象提出了相应的评价标准并且在篇章级翻译中得到应用\upcite{DBLP:conf/naacl/BawdenSBH18,DBLP:conf/acl/VoitaST19}，但是目前并没有达成共识，这也在一定程度上阻碍了篇章级机器翻译的进一步发展。我们将在\ref{sec:17-3-2}节中对这些评价标准进行介绍。

-\parinterval 从建模的角度来看，篇章级翻译需要引入额外的上下文信息，来解决上述上下文现象。在统计机器翻译时代就已经有一些相关工作，这些工作都是针对某一具体的上下文现象进行建模，比如篇章结构\upcite{DBLP:conf/anlp/MarcuCW00,foster2010translating,DBLP:conf/eacl/LouisW14}、代词回指\upcite{DBLP:conf/iwslt/HardmeierF10,DBLP:conf/wmt/NagardK10,DBLP:conf/eamt/LuongP16,}、词汇衔接\upcite{tiedemann2010context,DBLP:conf/emnlp/GongZZ11,DBLP:conf/ijcai/XiongBZLL13,xiao2011document}和篇章连接词\upcite{DBLP:conf/sigdial/MeyerPZC11,DBLP:conf/hytra/MeyerP12,}等。但是由于统计机器翻译本身流程复杂，依赖于许多组件和针对上下文现象所精心构造的特征，其建模方法相对比较困难。到了神经机器翻译时代，翻译质量相比统计机器翻译取得了大幅提升\upcite{DBLP:conf/nips/SutskeverVL14,bahdanau2014neural,vaswani2017attention}，这也鼓励研究人员进一步探索利用篇章上下文的信息\upcite{DBLP:conf/emnlp/LaubliS018}。近几年，相关工作不断涌现并且取得了一些阶段性进展\upcite{DBLP:journals/corr/abs-1912-08494}。
+\parinterval 从建模的角度看，篇章级翻译需要引入额外的上下文信息，来解决上述上下文现象。在统计机器翻译时代就已经有一些相关工作，这些工作都是针对某一具体的上下文现象进行建模，比如篇章结构\upcite{DBLP:conf/anlp/MarcuCW00,foster2010translating,DBLP:conf/eacl/LouisW14}、代词回指\upcite{DBLP:conf/iwslt/HardmeierF10,DBLP:conf/wmt/NagardK10,DBLP:conf/eamt/LuongP16,}、词汇衔接\upcite{tiedemann2010context,DBLP:conf/emnlp/GongZZ11,DBLP:conf/ijcai/XiongBZLL13,xiao2011document}和篇章连接词\upcite{DBLP:conf/sigdial/MeyerPZC11,DBLP:conf/hytra/MeyerP12,}等。但是由于统计机器翻译本身流程复杂，依赖于许多组件和针对上下文现象所精心构造的特征，其建模方法相对比较困难。到了神经机器翻译时代，翻译质量相比统计机器翻译取得了大幅提升\upcite{DBLP:conf/nips/SutskeverVL14,bahdanau2014neural,vaswani2017attention}，这也鼓励研究人员就如何利用篇章上下文的信息进一步展开探索\upcite{DBLP:conf/emnlp/LaubliS018}。近几年，相关工作不断涌现并且取得了一些阶段性进展\upcite{DBLP:journals/corr/abs-1912-08494}。

-\parinterval
-区别于篇章级统计机器翻译，篇章级神经机器翻译通常采用直接对上下文句子进行建模的端到端的方式。这种方法不再需要针对某一具体的上下文现象构造相应的特征，而是通过翻译模型本身从上下文句子中抽取和融合相应的上下文信息。通常情况下，待翻译句子的上下文信息一般来自于近距离的上下文，篇章级机器翻译可以采用局部建模的手段将前一句或者周围几句作为上下文送入模型。针对长距离的上下文现象，也可以使用全局建模的手段直接从篇章所有其他句子中提取上下文信息。近几年多数研究工作都在探索更有效的局部建模或者全局建模的方法，主要包括改进输入\upcite{DBLP:conf/discomt/TiedemannS17,DBLP:conf/naacl/BawdenSBH18,DBLP:conf/wmt/GonzalesMS17,DBLP:journals/corr/abs-1910-07481}、多编码器结构\upcite{DBLP:journals/corr/JeanLFC17,DBLP:conf/acl/TitovSSV18,DBLP:conf/emnlp/ZhangLSZXZL18}、层次结构\upcite{DBLP:conf/emnlp/WangTWL17,DBLP:conf/emnlp/TanZXZ19,Werlen2018DocumentLevelNM,DBLP:conf/naacl/MarufMH19,DBLP:conf/acl/HaffariM18,DBLP:conf/emnlp/YangZMGFZ19,DBLP:conf/ijcai/ZhengYHCB20} 以及基于缓存的方法\upcite{DBLP:conf/coling/KuangXLZ18,DBLP:journals/tacl/TuLSZ18}四类。
+\parinterval 
+区别于篇章级统计机器翻译，篇章级神经机器翻译通常采用端到端的方式直接对上下文句子进行建模。这种方法不再需要针对某一具体的上下文现象构造相应的特征，而是通过翻译模型本身从上下文句子中抽取和融合相应的上下文信息。通常情况下，待翻译句子的上下文信息来自于近距离的上下文，篇章级机器翻译可以采用局部建模的手段将前一句或者周围几句作为上下文送入模型。针对长距离的上下文现象，也可以使用全局建模的手段直接从篇章的其他所有句子中提取上下文信息。近几年多数研究工作都在探索更有效的局部建模或全局建模方法，主要包括改进输入\upcite{DBLP:conf/discomt/TiedemannS17,DBLP:conf/naacl/BawdenSBH18,DBLP:conf/wmt/GonzalesMS17,DBLP:journals/corr/abs-1910-07481}、多编码器结构\upcite{DBLP:journals/corr/JeanLFC17,DBLP:conf/acl/TitovSSV18,DBLP:conf/emnlp/ZhangLSZXZL18}、层次结构\upcite{DBLP:conf/naacl/MarufMH19,DBLP:conf/acl/HaffariM18,DBLP:conf/emnlp/YangZMGFZ19,DBLP:conf/ijcai/ZhengYHCB20}以及基于缓存的方法\upcite{DBLP:conf/coling/KuangXLZ18,DBLP:journals/tacl/TuLSZ18}四类。

-\parinterval 此外，篇章级机器翻译面临的另外一个挑战是数据稀缺。篇章级机器翻译所需要的双语数据需要保留篇章边界，数量相比于句子级双语数据要少很多。除了在之前提到的端到端做法中采用预训练或者参数共享的手段（见{\chaptersixteen}），也可以采用另外的建模手段来缓解数据稀缺问题。比如在句子级翻译模型推断过程中，通过目标端篇章级语言模型\upcite{DBLP:conf/discomt/GarciaCE19,DBLP:journals/tacl/YuSSLKBD20,DBLP:journals/corr/abs-2010-12827}来引入上下文信息，或者对句子级的解码结果进行修正\upcite{DBLP:conf/aaai/XiongH0W19,DBLP:conf/acl/VoitaST19,DBLP:conf/emnlp/VoitaST19}。这种方法能够充分利用句子级的双语数据，并且在一定程度上缓解篇章级双语数据稀缺问题。
+\parinterval 此外，篇章级机器翻译面临的另外一个挑战是数据稀缺。篇章级机器翻译所需要的双语数据需要保留篇章边界，数量相比于句子级双语数据要少很多。除了在之前提到的端到端方法中采用预训练或者参数共享的手段（见{\chaptersixteen}），也可以采用另外的建模手段来缓解数据稀缺问题。比如在句子级翻译模型的推断过程中，通过篇章级语言模型\upcite{DBLP:conf/discomt/GarciaCE19,DBLP:journals/tacl/YuSSLKBD20,DBLP:journals/corr/abs-2010-12827}在目标端引入上下文信息，或者对句子级的解码结果进行修正\upcite{DBLP:conf/aaai/XiongH0W19,DBLP:conf/acl/VoitaST19,DBLP:conf/emnlp/VoitaST19}。这种方法能够充分利用句子级的双语数据，并且在一定程度上缓解篇章级双语数据稀缺的问题。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -501,11 +501,11 @@

 \subsection{篇章级翻译的评价}\label{sec:17-3-2}

-\parinterval BLEU等自动评价指标能够在一定程度上反映译文的整体质量，但是并不能有效地评估篇章级翻译模型的性能。这是由于传统测试数据中出现篇章级上下文现象的比例相对较少，并且$n$-gram的匹配很难检测到一些具体的语言现象，这使得研究人员很难通过BLEU的涨幅来判断篇章级翻译模型的效果。
+\parinterval BLEU等自动评价指标能够在一定程度上反映译文的整体质量，但是并不能有效地评估篇章级翻译模型的性能。这是由于传统测试数据中出现篇章上下文现象的比例相对较少，并且$n$-gram的匹配很难检测到一些具体的语言现象，这使得研究人员很难通过BLEU得分来判断篇章级翻译模型的效果。

-\parinterval 为此，研究人员总结了篇章级机器翻译任务中存在的上下文现象，并基于此设计了相应的自动评价指标。比如针对代词翻译现象，首先使用词对齐寻找源语言中代词在译文和参考译文中的对应位置，然后通过计数计算最终的准确率和召回率等指标\upcite{DBLP:conf/iwslt/HardmeierF10,DBLP:conf/discomt/WerlenP17}。针对词汇衔接现象，使用{\small\sffamily\bfseries{词汇链}}\index{词汇链}（Lexical Chain\index{Lexical Chain}）\footnote{词汇链指篇章中语义相关的词所构成的序列}等来获取相应分数，然后通过加权平均的方式对BLEU和METEOR等指标进行扩展\upcite{DBLP:conf/emnlp/WongK12,DBLP:conf/discomt/GongZZ15}。{\red{针对篇章连接词}}，使用候选词典和词对齐对源语中连接词的正确翻译结果进行计数，计算其准确率\upcite{DBLP:conf/cicling/HajlaouiP13}。
+\parinterval 为此，研究人员总结了机器翻译任务中存在的上下文现象，并基于此设计了相应的自动评价指标。比如针对篇章中代词的翻译问题，首先借助词对齐工具确定源语言中的代词在译文和参考答案中的对应位置，然后通过计算最终的准确率和召回率等指标\upcite{DBLP:conf/iwslt/HardmeierF10,DBLP:conf/discomt/WerlenP17}对篇章级译文的质量进行评价。针对篇章中的词汇衔接，使用{\small\sffamily\bfseries{词汇链}}\index{词汇链}（Lexical Chain\index{Lexical Chain}）\footnote{词汇链指篇章中语义相关的词所构成的序列}等来获取能够反映词汇衔接质量的分数，然后通过加权的方式与常规的BLEU或METEOR等指标结合在一起\upcite{DBLP:conf/emnlp/WongK12,DBLP:conf/discomt/GongZZ15}。针对篇章中的连接词，使用候选词典和词对齐工具对源语中连接词的正确翻译结果进行计数，计算其准确率\upcite{DBLP:conf/cicling/HajlaouiP13}。

-\parinterval 除了自动评价指标，也有一些研究人员针对特有的上下文现象手工构造了相应的测试套件。例如，可以采用对比测试的方式。测试集中每一个测试样例都包含一个正确翻译的结果，以及多个错误结果，一个理想的模型应该对正确的翻译评价最高，排名在所有错误答案之上,此时就可以通过模型是否能挑选出正确答案来评估其性能。这种方法可以很好地衡量模型在某一特定上下文现象上的处理能力，比如词义消歧\upcite{DBLP:conf/wmt/RiosMS18}、代词翻译\upcite{DBLP:conf/naacl/BawdenSBH18,DBLP:conf/wmt/MullerRVS18}和一些衔接问题\upcite{DBLP:conf/acl/VoitaST19}等。但是该方法缺点在于其使用范围受限于测试集的语种和规模，扩展性较差。
+\parinterval 除了自动评价指标，也有一些研究人员针对特有的上下文现象手工构造了相应的测试套件用于评价翻译质量。测试套件中每一个测试样例都包含一个正确翻译的结果，以及多个错误结果，一个理想的翻译模型应该对正确的翻译结果评价最高，排名在所有错误结果之上,此时就可以根据模型是否能挑选出正确翻译结果来评估其性能。这种方法可以很好地衡量翻译模型在某一特定上下文现象上的处理能力，比如词义消歧\upcite{DBLP:conf/wmt/RiosMS18}、代词翻译\upcite{DBLP:conf/naacl/BawdenSBH18,DBLP:conf/wmt/MullerRVS18}和一些衔接问题\upcite{DBLP:conf/acl/VoitaST19}等。但是该方法也存在使用范围受限于测试集的语种和规模、扩展性较差的缺点。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -513,17 +513,16 @@

 \subsection{篇章级翻译的建模}

-\parinterval 篇章级神经机器翻译不再针对具体的上下文现象构造特征，而是对篇章上下文直接进行建模。在理想情况下，这种方法将以整个篇章为单位作为模型的输入和输出。然而由于现实中篇章对应的序列长度过长，因此直接对整个篇章对应序列建模难度很大，使得主流的序列到序列模型很难达到良好的效果甚至难以训练。一种思路是采用能够处理超长序列的模型对篇章信息建模，比如，使用{\chapterfifteen}中提到的处理长序列的Transformer模型就是针对该问题的一个有效的解决方法\upcite{DBLP:conf/iclr/KitaevKL20}。不过，这类模型并不针对篇章级翻译的具体翻译问题，因此并不是篇章级翻译中的主流方法。
+\parinterval 篇章级神经机器翻译不再针对具体的上下文现象构造特征，而是对篇章中句子的上下文直接进行建模。在理想情况下，这种方法将以整个篇章为单位作为模型的输入和输出。然而由于现实中篇章对应的序列过长，因此直接建模整个篇章对应的序列难度很大，使得主流的序列到序列模型很难达到良好的效果甚至难以训练。一种思路是采用能够处理超长序列的模型对篇章序列建模，比如，使用{\chapterfifteen}中提到的处理长序列的Transformer模型就是针对该问题的一个有效的解决方法\upcite{DBLP:conf/iclr/KitaevKL20}。不过，这类模型并不针对篇章级翻译的具体翻译问题，因此并不是篇章级翻译中的主流方法。

 \parinterval 现在常见的端到端做法还是从句子级翻译出发，通过额外的模块来对篇章中的上下文句子进行抽象表示，然后提取相应的上下文信息并融入到当前句子的翻译过程中。形式上，篇章级翻译的建模方式如下：
-
 \begin{eqnarray}
 \funp{P}(\seq{Y}|\seq{X})&=&\prod_{i=1}^{T}{\funp{P}(Y_i|X_i,D_i)}
-\label{eq:17-3-1}\\
-D_i&\subseteq&\{X_{-i},Y_{-i}\} \label{eq:17-3-2}
+\label{eq:17-3-1}
 \end{eqnarray}
+其中$\seq{X}$和$\seq{Y}$分别为源语言篇章和目标语言篇章，$X_i$和$Y_i$分别为源语言篇章和目标语言篇章中的某个句子，$T$表示篇章中句子的数目\footnote{为了简化问题，我们假设源语言端和目标语言段具有相同的句子数目$T$}。$D_i$表示翻译第$i$个句子时所对应的上下文句子集合，理想情况下，$D_i$中包含源语言篇章和目标语言篇章中所有除第$i$句之外的句子，但考虑到不同的任务场景需求与模型的应用效率，篇章级神经机器翻译在建模的时候通常仅使用其中的一部分作为上下文句子输入。

-其中$\seq{X}$和$\seq{Y}$分别为源语言篇章和目标语言篇章，$X_i$和$Y_i$分别为源语言篇章和目标语言篇章中的某个句子，$X_{-i}$和$Y_{-i}$分别为去掉第$i$个句子的源语言篇章和目标语言，$T$表示篇章中句子的数目\footnote{为了简化问题，我们假设源语言端和目标语言段具有相同的句子数目$T$}。$D_i$表示翻译第个句子时所对应的上下文句子集合，代表源语言篇章和目标语言篇章中其它的句子。考虑到不同的任务场景需求与模型的应用效率，篇章级神经机器翻译在建模的时候通常仅使用一部分作为上下文句子输入。对应的，篇章级神经机器翻译主要需要考虑两个问题：1）上下文范围的选取，比如上下文句子的多少\upcite{agrawal2018contextual,DBLP:conf/emnlp/WerlenRPH18,DBLP:conf/naacl/MarufMH19}，是否考虑目标端上下文句子\upcite{DBLP:conf/discomt/TiedemannS17,agrawal2018contextual}等；2）不同的上下文范围也对应着不同的建模方式，即如何从上下文句子中提取上下文信息，并且融入到翻译模型中。接下来将对一些典型的建模方法进行介绍，包括改进输入\upcite{DBLP:conf/discomt/TiedemannS17,DBLP:conf/naacl/BawdenSBH18,DBLP:conf/wmt/GonzalesMS17,DBLP:journals/corr/abs-1910-07481}、多编码器结构\upcite{DBLP:journals/corr/JeanLFC17,DBLP:conf/acl/TitovSSV18,DBLP:conf/emnlp/ZhangLSZXZL18}、层次结构\upcite{DBLP:conf/emnlp/WangTWL17,DBLP:conf/emnlp/TanZXZ19,Werlen2018DocumentLevelNM,DBLP:conf/naacl/MarufMH19,DBLP:conf/acl/HaffariM18,DBLP:conf/emnlp/YangZMGFZ19,DBLP:conf/ijcai/ZhengYHCB20} 以及基于缓存的方法\upcite{DBLP:conf/coling/KuangXLZ18,DBLP:journals/tacl/TuLSZ1}。
+\parinterval 上下文范围的选取是篇章级神经机器翻译需要着重考虑的问题，比如上下文句子的多少\upcite{agrawal2018contextual,DBLP:conf/emnlp/WerlenRPH18,DBLP:conf/naacl/MarufMH19}，是否考虑目标端上下文句子\upcite{DBLP:conf/discomt/TiedemannS17,agrawal2018contextual}等。此外，不同的上下文范围也对应着不同的建模方式\footnote{即如何从上下文句子中提取上下文信息，并且融入到翻译模型中。}，接下来将对一些典型的建模方法进行介绍，包括改进输入\upcite{DBLP:conf/discomt/TiedemannS17,DBLP:conf/naacl/BawdenSBH18,DBLP:conf/wmt/GonzalesMS17,DBLP:journals/corr/abs-1910-07481}、多编码器结构\upcite{DBLP:journals/corr/JeanLFC17,DBLP:conf/acl/TitovSSV18,DBLP:conf/emnlp/ZhangLSZXZL18}、层次结构\upcite{DBLP:conf/emnlp/WangTWL17,DBLP:conf/emnlp/TanZXZ19,Werlen2018DocumentLevelNM}以及基于缓存的方法\upcite{DBLP:conf/coling/KuangXLZ18,DBLP:journals/tacl/TuLSZ18}。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -531,7 +530,7 @@ D_i&\subseteq&\{X_{-i},Y_{-i}\} \label{eq:17-3-2}

 \subsubsection{1. 改进输入形式}

-\parinterval 一种简单的方法是直接复用传统的序列到序列模型，将上下文句子与当前句子拼接作为模型输入。如实例\ref{eg:17-3-1}所示，这种做法不需要改动模型结构，操作简单，适用于包括基于循环神经网络\upcite{DBLP:conf/discomt/TiedemannS17}和Transformer\upcite{agrawal2018contextual,DBLP:conf/discomt/ScherrerTL19}在内的神经机器翻译系统。但是由于过长的序列会导致模型难以训练，通常只会采用局部的上下文句子进行拼接，比如源语言端前一句或者周围几句\upcite{DBLP:conf/discomt/TiedemannS17}。同时，引入目标语言端的上下文\upcite{DBLP:conf/naacl/BawdenSBH18,agrawal2018contextual,DBLP:conf/discomt/ScherrerTL19}，比如在解码时拼接目标语言端上下文和当前句同样会带来一定的性能提升。但是过大的窗口在推断时会导致错误累计的问题\upcite{agrawal2018contextual}，因此通常只考虑目标语端的前一句。
+\parinterval 一种简单的方法是直接复用传统的序列到序列模型，将篇章中待翻译句子与其上下文句子拼接后作为模型输入。如实例\ref{eg:17-3-1}所示，这种做法不需要改动模型结构，操作简单，适用于包括基于循环神经网络\upcite{DBLP:conf/discomt/TiedemannS17}和Transformer\upcite{agrawal2018contextual,DBLP:conf/discomt/ScherrerTL19}在内的神经机器翻译系统。但是由于过长的序列会导致模型难以训练，通常只会选取局部的上下文句子进行拼接，比如只拼接源语言端前一句或者周围几句\upcite{DBLP:conf/discomt/TiedemannS17}。此外，也可以引入目标语言端的上下文\upcite{DBLP:conf/naacl/BawdenSBH18,agrawal2018contextual,DBLP:conf/discomt/ScherrerTL19}，在解码时拼接目标语言端上下文和当前句同样会带来一定的性能提升。但是过大的窗口在推断时会导致错误累计的问题\upcite{agrawal2018contextual}，因此通常只考虑目标语端的前一句。

 \begin{example}
 传统模型训练输入：
@@ -543,14 +542,14 @@ D_i&\subseteq&\{X_{-i},Y_{-i}\} \label{eq:17-3-2}
 \vspace{0.5em}
 \qquad\ 改进后模型训练输入：

-\hspace{10em}源语言：{\red{他们在哪？ <sep> }}你看到了吗？
+\hspace{10em}源语言：{\red{他们在哪？\ <sep>\ }}你看到了吗？

 \hspace{10em}目标语言：Do you see them?

 \label{eg:17-3-1}
 \end{example}

-\parinterval 其他改进输入的做法相比于拼接的方法要复杂一些，首先需要对篇章进行处理，得到词汇链\footnote{词汇链指篇章中语义相关的词所构成的序列}\upcite{DBLP:conf/wmt/GonzalesMS17}或者篇章嵌入\upcite{DBLP:journals/corr/abs-1910-07481}等信息，然后融入到当前句子的序列表示中，送入模型进行翻译。这种方式中上下文信息来自于预先提取的篇章表示，但是这种表示是否适合机器翻译还有待论证。
+\parinterval 其他改进输入的做法相比于拼接的方法要复杂一些，首先需要对篇章进行处理，得到词汇链\footnote{词汇链指篇章中语义相关的词所构成的序列}\upcite{DBLP:conf/wmt/GonzalesMS17}或者篇章嵌入\upcite{DBLP:journals/corr/abs-1910-07481}等信息，然后将融入这些信息的当前句子序列表示送入模型中。目前这种将预先提取的篇章表示作为上下文信息的方法是否适合机器翻译还有待论证。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -566,25 +565,21 @@ D_i&\subseteq&\{X_{-i},Y_{-i}\} \label{eq:17-3-2}
 \end{figure}
 %----------------------------------------------

-\parinterval 区别于在输入上进行改进，另一种思路是对传统的编码器-解码器框架进行更改，采用额外的编码器来编码上下文句子，称之为多编码器结构\upcite{DBLP:conf/acl/LiLWJXZLL20,DBLP:conf/acl/LiLWJXZLL20,DBLP:conf/discomt/SugiyamaY19}。这种结构最早被应用在基于循环神经网络的篇章级翻译模型\upcite{DBLP:journals/corr/JeanLFC17,DBLP:conf/coling/KuangX18,DBLP:conf/naacl/BawdenSBH18,DBLP:conf/pacling/YamagishiK19}，并且在Transformer模型上同样适用\upcite{DBLP:journals/corr/abs-1805-10163,DBLP:conf/emnlp/ZhangLSZXZL18}。图\ref{fig:17-18}展示了一个基于Transformer模型的多编码器结构，基于源语言当前待翻译句子的编码表示$\mathbi{h}$和上下文句子的编码表示$\mathbi{h}_{pre}$，模型首先通过注意力机制提取句子间上下文信息$\mathbi{d}$：
-
+\parinterval 区别于在输入上进行改进，另一种思路是对传统的编码器-解码器框架进行更改，引入额外的编码器来对上下文句子进行编码，该结构被称为多编码器结构\upcite{DBLP:conf/acl/LiLWJXZLL20,DBLP:conf/acl/LiLWJXZLL20,DBLP:conf/discomt/SugiyamaY19}。这种结构最早被应用在基于循环神经网络的篇章级翻译模型中\upcite{DBLP:journals/corr/JeanLFC17,DBLP:conf/coling/KuangX18,DBLP:conf/naacl/BawdenSBH18,DBLP:conf/pacling/YamagishiK19}，后期证明在Transformer模型上同样适用\upcite{DBLP:journals/corr/abs-1805-10163,DBLP:conf/emnlp/ZhangLSZXZL18}。图\ref{fig:17-18}展示了一个基于Transformer模型的多编码器结构，基于源语言当前待翻译句子的编码表示$\mathbi{h}$和上下文句子的编码表示$\mathbi{h}^{\textrm pre}$，模型首先通过注意力机制提取上下文信息$\mathbi{d}$：
 \begin{eqnarray}
-\mathbi{d}&=&Attention(\mathbi{h},\mathbi{h}_{pre},\mathbi{h}_{pre})
+\mathbi{d}&=&\textrm{Attention}(\mathbi{h},\mathbi{h}^{\textrm pre},\mathbi{h}^{\textrm pre})
 \label{eq:17-3-3}
 \end{eqnarray}
-
-在注意力机制中，$\mathbi{h}$作为query（查询），$\mathbi{h}_{pre}$作为key（键）和value（值）。然后通过门控机制将每个位置的编码表示和上下文信息进行融合，具体方式如下：
-
+其中，$\mathbi{h}$作为query（查询），$\mathbi{h}^{\textrm pre}$作为key（键）和value（值）。然后通过门控机制将待翻译句子中每个位置的编码表示和上下文中对应位置的信息进行融合，具体方式如下：
 \begin{eqnarray}
 \widetilde{\mathbi{h}_{t}}&=&\lambda_{t}\mathbi{h}_{t}+(1-\lambda_{t})\mathbi{d}_{t}
 \label{eq:17-3-4}\\
 \lambda_{t}&=&\sigma(\mathbi{W}_{\lambda}[\mathbi{h}_{t};\mathbi{d}_{t}]+\mathbi{b}_{\lambda})
 \label{eq:17-3-5}
 \end{eqnarray}
+其中$\widetilde{\mathbi{h}}$为融合了上下文信息的最终序列表示结果，$\widetilde{\mathbi{h}_{t}}$为其第$t$个位置的表示。$\mathbi{W}_{\lambda}$和$\mathbi{b}_{\lambda}$为模型可学习的参数，$\sigma$为Sigmoid函数，用来获取门控权值$\lambda$。除了在解码端外部进行融合，也可以将$\mathbi{h}^{\textrm pre}$送入解码器，在解码器中采用类似的机制进行融合\upcite{DBLP:conf/emnlp/ZhangLSZXZL18}。

-其中$\widetilde{\mathbi{h}}$为融合了上下文信息的最终序列表示结果，$\widetilde{\mathbi{h}_{t}}$为其中第$t$个位置的表示。$\mathbi{W}_{\lambda}$和$\mathbi{b}_{\lambda}$为模型可学习的参数，$\sigma$为Sigmoid函数，用来获取门控权值$\lambda$。
-
-\parinterval 除了在解码端外部进行融合，也可以将$\mathbi{h}_{pre}$送入解码器，在解码器中采用类似的机制进行融合\upcite{DBLP:conf/emnlp/ZhangLSZXZL18}。此外，多编码器结构由于引入了额外的模块，模型整体参数量大大增加，会导致其难以训练。为此一些研究人员提出使用句子级模型预训练的方式来初始化模型参数\upcite{DBLP:journals/corr/JeanLFC17,DBLP:conf/emnlp/ZhangLSZXZL18}，或者使用编码器参数共享的手段来减小模型复杂度\upcite{DBLP:conf/pacling/YamagishiK19,DBLP:conf/coling/KuangX18,DBLP:journals/corr/abs-1805-10163}。
+\parinterval 此外，由于多编码器结构引入了额外的模块，模型整体参数量大大增加，会导致其难以训练。为此一些研究人员提出使用句子级模型预训练的方式来初始化模型参数\upcite{DBLP:journals/corr/JeanLFC17,DBLP:conf/emnlp/ZhangLSZXZL18}，或者将两个编码器的参数进行共享来减小模型复杂度\upcite{DBLP:conf/pacling/YamagishiK19,DBLP:conf/coling/KuangX18,DBLP:journals/corr/abs-1805-10163}。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -601,24 +596,22 @@ D_i&\subseteq&\{X_{-i},Y_{-i}\} \label{eq:17-3-2}
 \end{figure}
 %----------------------------------------------

-\parinterval 多编码器结构通过额外的编码器对前一句进行编码，但是无法处理多个上下文句子的情况。为了能够捕捉到更充分的上下文信息，可以采用层次结构来对更多的上下文句子进行建模。层次结构可以有效的处理更长的上下文序列，以及序列内不同单元之间的相互作用。类似的思想也成功的应用在基于树的翻译模型中（{\chaptereight}和{\chapterfifteen}）。
-
-\parinterval 图\ref{fig:17-19}描述了一个基于层次注意力的模型结构\upcite{DBLP:conf/emnlp/WerlenRPH18}。首先通过翻译模型的编码器获取前文$k$个句子的序列编码表示$(\mathbi{h}^k,\dots,\mathbi{h}^2,\mathbi{h}^1)$，然后使用层次注意力机制从这些编码表示中提取上下文信息$\mathbi{d}$，进而可以和当前句子的编码表示$\mathbi{h}$融合，得到一个上下文相关的当前句子表示$\widetilde{\mathbi{h}}$。其中层次注意力的计算过程也是分为两步，第一步针对前文每个句子的词序列表示$\mathbi{h}^{j}$，使用词级注意力提取从各个句子的上下文信息$\mathbi{s}^{j}$，然后在这$k$个句子级上下文表示$\mathbi{s}=(\mathbi{s}^k,\dots,\mathbi{s}^2,\mathbi{s}^1)$基础上，使用句子级注意力提取最终的上下文信息。具体计算过程如下所示：
+\parinterval 多编码器结构通过额外的编码器对前一句进行编码，但是无法处理引入多个上下文句子的情况。为了捕捉到更充分的上下文信息，可以采用层次结构来对更多的上下文句子进行建模。层次结构可以有效地处理更长的上下文序列以及序列内不同单元之间的相互作用。类似的思想也成功的应用在基于树的翻译模型中（{\chaptereight}和{\chapterfifteen}）。

+\parinterval 图\ref{fig:17-19}描述了一个基于层次注意力的模型结构\upcite{DBLP:conf/emnlp/WerlenRPH18}。首先通过翻译模型的编码器获取前$k$个句子的词序列编码表示$(\mathbi{h}^{\textrm{pre}1},\mathbi{h}^{\textrm{pre}2},\dots,\mathbi{h}^{\textrm{pre}k})$，然后针对前文每个句子的词序列编码表示$\mathbi{h}^{\textrm{pre}j}$，使用词级注意力提取句子级的上下文信息$\mathbi{s}^{j}$，然后在这$k$个句子级上下文信息$\mathbi{s}=(\mathbi{s}^1,\mathbi{s}^2,\mathbi{s}^k)$的基础上，使用句子级注意力提取最终的篇章上下文信息$\mathbi{d}$。最终上下文信息$\mathbi{d}$的获取涉及到词级和句子级两个不同层次的注意力操作，因此将该过程称为层次注意力。为了增强模型表示能力，层次注意力中并未直接使用当前句子第$t$个位置的编码表示$\mathbi{h}_{t}$作为查询，而是通过$f_w$和$f_s$两个线性变换分别获取词级注意力和句子级注意力的查询$\mathbi{q}_{w}$和$\mathbi{q}_{s}$，另外在句子级注意力之后添加了一个前馈全连接网络子层FFN。其具体计算过程如下所示：
 \begin{eqnarray}
 \mathbi{q}_{w}&=&f_w(\mathbi{h}_t)
 \label{eq:17-3-6}\\
-\mathbi{s}^j&=&WordAttention(\mathbi{q}_{w},\mathbi{h}^{j},\mathbi{h}^{j})
+\mathbi{s}^j&=&\textrm{WordAttention}(\mathbi{q}_{w},\mathbi{h}^{j},\mathbi{h}^{j})
 \label{eq:17-3-7}\\
 \mathbi{q}_{s}&=&f_s(\mathbi{h}_t)
 \label{eq:17-3-8}\\
-\mathbi{d}_t&=&FFN(SentAttention(\mathbi{q}_{s},\mathbi{s},\mathbi{s}))
+\mathbi{d}_t&=&\textrm{FFN}(\textrm{SentAttention}(\mathbi{q}_{s},\mathbi{s},\mathbi{s}))
 \label{eq:17-3-9}
 \end{eqnarray}
+在得到最终的上下文信息$\mathbi{d}$后，模型同样采用门控机制（如公式\eqref{eq:17-3-4}和公式\eqref{eq:17-3-5}）与$\mathbi{h}$进行融合来得到一个上下文相关的当前句子表示$\widetilde{\mathbi{h}}$。

-其中$\mathbi{h}_{t}$表示当前句子第$t$个位置的编码表示。为了增强模型表示能力，首先通过$f_w$和$f_s$两个线性变换分别获取词级注意力和句子级注意力的查询$\mathbi{q}_{w}$和$\mathbi{q}_{s}$，另外在句子级注意力之后添加了一个前馈全连接网络子层FFN。在获得上下文表示$\mathbi{d}_{t}$后，模型同样采用门控机制（如公式\eqref{eq:17-3-4}和公式\eqref{eq:17-3-5}）与$\mathbi{h}_{t}$进行融合来得到最终的编码表示$\widetilde{\mathbi{h}_{t}}$。
-
-\parinterval 通过层次注意力，模型可以在词级和句子级两个维度从多个句子中提取更充分的上下文信息，并且可以同时用于解码端来获取目标端的上下文信息。基于层次注意力，为了进一步编码整个篇章的上下文信息，研究人员提出选择性注意力\upcite{DBLP:conf/naacl/MarufMH19}来对篇章中整体上下文进行有选择的信息提取。此外，也有研究人员使用循环神经网络\upcite{DBLP:conf/emnlp/WangTWL17}、记忆网络\upcite{DBLP:conf/acl/HaffariM18}、胶囊网络\upcite{DBLP:conf/emnlp/YangZMGFZ19}和片段级相对注意力\upcite{DBLP:conf/ijcai/ZhengYHCB20}等结构来对多个上下文句子进行上下文信息提取。
+\parinterval 通过层次注意力，模型可以在词级和句子级两个维度从多个句子中提取更充分的上下文信息，除了用于编码器端，也可以用于解码器端来获取目标端的上下文信息。基于层次注意力，为了进一步编码整个篇章的上下文信息，研究人员提出选择性注意力\upcite{DBLP:conf/naacl/MarufMH19}来对篇章中整体上下文进行有选择的信息提取。此外，也有研究人员使用循环神经网络\upcite{DBLP:conf/emnlp/WangTWL17}、记忆网络\upcite{DBLP:conf/acl/HaffariM18}、胶囊网络\upcite{DBLP:conf/emnlp/YangZMGFZ19}和片段级相对注意力\upcite{DBLP:conf/ijcai/ZhengYHCB20}等结构来对多个上下文句子进行上下文信息提取。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -626,17 +619,14 @@ D_i&\subseteq&\{X_{-i},Y_{-i}\} \label{eq:17-3-2}

 \subsubsection{4. 基于缓存的方法}

-\parinterval 除了以上提到的建模方法，还有一类基于缓存的方法\upcite{DBLP:journals/tacl/TuLSZ18,DBLP:conf/coling/KuangXLZ18}。这种方法最大的特点在于将篇章翻译看作一个连续的过程，然后在这个过程中通过一个额外的缓存来记录一些相关信息，最后在每个句子解码的过程中使用这个缓存来提供上下文信息。图\ref{fig:17-20}描述了一种基于缓存的篇章级翻译模型结构\upcite{DBLP:journals/tacl/TuLSZ18}。在这里，翻译模型基于循环神经网络（参考{\chapterten}），但是这种方法同样适用于包括Transformer在内的其他神经机器翻译模型。模型中篇章上下文的建模依赖于缓存的读和写操作。其中读操作以及与目标端表示的融合方法和层次结构中提到的方法类似，同样使用注意力机制以及门控机制来获取最终的目标端表示$\widetilde{\mathbi{s}_{t}}$。而缓存的写操作则是在每个句子翻译结束后，将句子中每个词${y}_{t}$对应的表示对$<\mathbi{c}_{t},\mathbi{s}_{t}>$作为注意力的键和值按照一定规则写入缓存。其中，$\mathbi{c}_{t}$和$\mathbi{s}_{t}$分别表示第$t$个目标词所对应的源语表示和解码器隐层状态。如果${y}_{t}$不存在于缓存，则写入其中的空槽或者替换最久未使用的键值对；如果${y}_{t}$存在于缓存，则将对应的键值对进行更新:
-
+\parinterval 除了以上提到的建模方法，还有一类基于缓存的方法\upcite{DBLP:journals/tacl/TuLSZ18,DBLP:conf/coling/KuangXLZ18}。这种方法最大的特点在于将篇章翻译看作一个连续的过程，然后在这个过程中通过一个额外的缓存来记录一些相关信息，最后在每个句子解码的过程中使用这个缓存来提供上下文信息。图\ref{fig:17-20}描述了一种基于缓存的篇章级翻译模型结构\upcite{DBLP:journals/tacl/TuLSZ18}。在这里，翻译模型基于循环神经网络（参考{\chapterten}），但是这种方法同样适用于包括Transformer在内的其他神经机器翻译模型。模型中篇章上下文的建模依赖于缓存的读和写操作。缓存的写操作指的是：按照一定规则将翻译历史中一些译文单词对应的上下文向量$\mathbi{C}_r$和其解码器端的隐藏状态$\mathbi{s}_r$分别作为一对键和值写入到缓存中。而缓存的读操作则是指将待翻译句子中第$t$个单词的上下文向量$\mathbi{C}_t$作为查询，与缓存中的所有键分别进行匹配，并根据其匹配概率进行带权相加得到当前待翻译句子的篇章上下文信息 $\mathbi{d}$。该方法中单词的解码器端隐藏状态$\mathbi{s}_t$与对应位置的上下文信息$\mathbi{d}_t$的融合也是基于门控机制的方法。事实上，由于该方法中缓存空间是有限的，其内容的更新也存在一定的规则：在当前句子的翻译结束后，如果单词$y_t$的对应信息未曾写入缓存，则写入其中的空槽或者替换最久未使用的键值对；如果$y_t$已作为翻译历史存在于缓存中，则将对应的键值对按照以下规则进行更新:
 \begin{eqnarray}
 \mathbi{k}_{i}&=&(\mathbi{k}_{i}+\mathbi{c}_{t})/2
 \label{eq:17-3-10}\\
 \mathbi{v}_{i}&=&(\mathbi{v}_{i}+\mathbi{s}_{t})/2
 \label{eq:17-3-11}
 \end{eqnarray}
-
 其中$i$表示$y_t$在缓存中的位置，$\mathbi{k}_{i}$和$\mathbi{v}_{i}$分别为缓存中对应的键和值。这种方法缓存的都是目标端历史的词级表示，因此能够解决一些词汇衔接的问题，比如词汇一致性和一些搭配问题，产生更连贯的翻译结果。
-
 %----------------------------------------------
 \begin{figure}[htp]
    \centering
@@ -652,26 +642,11 @@ D_i&\subseteq&\{X_{-i},Y_{-i}\} \label{eq:17-3-2}

 \subsection{在推断阶段结合篇章上下文}

-\parinterval 上一节介绍的建模方法主要针对上下文句子进行建模，通过端到端的方式进行上下文信息的提取和融合。由于篇章级双语数据相对稀缺，这种复杂的篇章级翻译模型很难通过直接训练取得很好的效果，通常会采用两阶段训练或参数共享的方式。此外，相比之下句子级双语数据更为丰富，在此基础上训练得到的模型性能通常能够达到预期。因此，一个自然的想法是基于高质量句子级翻译模型，在推断过程中结合上下文信息方法来构造篇章级翻译模型。比如通过结合目标语言端的篇章级语言模型\upcite{DBLP:conf/discomt/GarciaCE19,DBLP:journals/tacl/YuSSLKBD20,DBLP:journals/corr/abs-2010-12827}来引入上下文信息，或者通过两阶段解码\upcite{DBLP:conf/aaai/XiongH0W19,DBLP:conf/acl/VoitaST19}和后编辑\upcite{DBLP:conf/emnlp/VoitaST19}的方法在句子级翻译结果上进行修正。
+\parinterval 上一节介绍的建模方法主要是对篇章中待翻译句子的上下文句子进行建模，通过端到端的方式对上下文信息进行提取和融合。由于篇章级双语数据相对稀缺，这种复杂的篇章级翻译模型很难通过直接训练取得很好的效果，通常可以采用两阶段训练或参数共享的方式来缓解这个问题。此外，由于句子级双语数据更为丰富，因此句子级翻译模型的单句质量通常能够达到预期。因此，一个自然的想法是以高质量的句子级翻译模型为基础，通过在推断过程中结合上下文信息来构造篇章级翻译模型。

-\parinterval 相比于篇章级双语数据，篇章级单语数据更容易获取。在双语数据稀缺的情况下，通过引入目标语言端的篇章级语言模型可以更充分的利用这些单语数据。最简单的做法是在翻译模型的分数基础上加上语言模型的分数\upcite{DBLP:conf/discomt/GarciaCE19,DBLP:journals/corr/abs-2010-12827}，既可以在推断的搜索过程中作为模型最终打分，也可以将其作为重排序阶段的一种特征。其次，也可以使用噪声信道模型对篇章级翻译进行建模\upcite{DBLP:journals/tacl/YuSSLKBD20}。使用贝叶斯规则，将篇章翻译问题转换成如下形式（参考5.3节内容）：
+\parinterval 在句子级翻译模型中引入目标语言端的篇章级语言模型\upcite{DBLP:conf/discomt/GarciaCE19,DBLP:journals/tacl/YuSSLKBD20,DBLP:journals/corr/abs-2010-12827}是一种结合上下文信息的常用手段。 相比于篇章级双语数据，篇章级单语数据更容易获取。在双语数据稀缺的情况下，通过引入目标语言端的篇章级语言模型可以更充分的利用这些单语数据。最简单的做法是在句子级翻译模型的分数基础上加上语言模型的分数\upcite{DBLP:conf/discomt/GarciaCE19,DBLP:journals/corr/abs-2010-12827}，既可以在推断的搜索过程中作为模型最终打分，也可以将其作为重排序阶段的一种特征。

-\begin{eqnarray}
-\widehat{Y}&=&\argmax_{Y}\funp{P}(Y|X)\\
-&=&\argmax_{Y}\underbrace{\funp{P}(X|Y)}_{\textrm{信道模型}}\times\underbrace{\funp{P}(Y)}_{\textrm{语言模型}}
-\label{eq:17-3-12}
-\end{eqnarray}
-
-其中$X$和$Y$分别表示源语言端和目标语言端篇章。进一步，可以得到近似形式：
-
-\begin{eqnarray}
-\widehat{Y}&\approx&\argmax_{Y}\prod_{i=1}^{T}{\funp{P}(X_i|Y_i)\times\funp{P}(Y_i|Y_{<i})}
-\label{eq:17-3-13}
-\end{eqnarray}
-
-通过这种生成式模型，只需要使用句子级的翻译模型以及目标端的篇章级翻译模型，避免了对篇章级双语数据的依赖。
-
-\parinterval 另一种改进方法不影响句子级翻译模型的推断过程，而是在完成翻译后使用额外的模块进行第二阶段的解码，通过两阶段的解码来引入上下文信息\upcite{DBLP:conf/aaai/XiongH0W19,DBLP:conf/acl/VoitaST19}。如图\ref{fig:17-21}所示，这种两阶段解码的做法相当于将篇章级翻译的问题进行了分离和简化，适用于篇章级双语数据稀缺的场景。基于类似的思想，有研究人员使用后编辑的做法对翻译结果进行修正\upcite{DBLP:conf/emnlp/VoitaST19}。区别于两阶段解码的方法，后编辑的方法无需参考源语信息，只是基于目标语言端的连续翻译结果来提供上下文信息。通过这种方式，可以降低对篇章级双语数据的需求。
+\parinterval 另一种在推断过程中引入上下文信息的方法是通过两阶段解码\upcite{DBLP:conf/aaai/XiongH0W19,DBLP:conf/acl/VoitaST19}和后编辑\upcite{DBLP:conf/emnlp/VoitaST19}在句子级翻译结果上进行修正。这种方法不影响句子级翻译模型的推断过程，而是在完成翻译后使用额外的模块进行第二阶段的解码，如图\ref{fig:17-21}所示，这种两阶段解码的做法相当于将篇章级翻译的问题进行了分离和简化，适用于篇章级双语数据稀缺的场景。基于类似的思想，有研究人员使用后编辑的做法对翻译结果进行修正。区别于两阶段解码的方法，后编辑的方法无需参考源语信息，只是基于目标语言端的连续翻译结果来提供上下文信息。通过这种方式，可以降低对篇章级双语数据的需求。

 %----------------------------------------------
 \begin{figure}[htp]
@@ -681,7 +656,6 @@ D_i&\subseteq&\{X_{-i},Y_{-i}\} \label{eq:17-3-2}
    \label{fig:17-21}
 \end{figure}
 %----------------------------------------------
-
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
 %----------------------------------------------------------------------------------------