合并分支 'caorunzhe' 到 'master'

Caorunzhe 查看合并请求 !697

合并分支 'caorunzhe' 到 'master'
Caorunzhe 查看合并请求 !697
52b6cbac · 曹润柘 · c31783e3 · b1d01dc3 · 52b6cbac · 52b6cbac
Commit 52b6cbac authored Dec 24, 2020 by 曹润柘
--- a/Chapter17/Figures/figure-an-end-to-end-voice-translation-model-based-on-transformer.tex
+++ b/Chapter17/Figures/figure-an-end-to-end-voice-translation-model-based-on-transformer.tex
@@ -17,11 +17,11 @@
 \node[layer,anchor=south,fill=red!20] (de_ca) at ([yshift=1.4em]de_sa.north){Multi-Head \\ Attention};
 \node[layer,anchor=south,fill=green!20] (de_ffn) at ([yshift=1.4em]de_ca.north){Feed Forward \\ Network};

-\node[layer,anchor=south,fill=blue!20] (sf) at ([yshift=2em]de_ffn.north){Softmax};
+\node[layer,anchor=south,fill=blue!20] (sf) at ([yshift=1.6em]de_ffn.north){Softmax};
 \node[layer,anchor=south,fill=orange!20] (output) at ([yshift=1.4em]sf.north){STLoss};

-\node[anchor=north,font=\scriptsize,align=center] (en_input) at ([yshift=-1em]en_cnn.south){语音特征\\(FilterBank/MFCC)};
-\node[anchor=north,font=\scriptsize,align=center] (de_input) at ([yshift=-1em]de_add.south){目标文本\\(Embedding)};
+\node[anchor=north,font=\scriptsize,align=center] (en_input) at ([yshift=-1em]en_cnn.south){Speech Feature\\(FilterBank/MFCC)};
+\node[anchor=north,font=\scriptsize,align=center] (de_input) at ([yshift=-1.1em]de_add.south){Target Text\\(Embedding)};

 \node[anchor=east,font=\scriptsize,align=center] (en_pos) at ([xshift=-2em]en_add.west){Position\\(Embedding)};
 \node[anchor=west,font=\scriptsize,align=center] (de_pos) at ([xshift=2em]de_add.east){Position\\(Embedding)};
@@ -40,8 +40,8 @@
 \draw[->] ([xshift=-0.1em]de_pos.180) -- ([xshift=0.1em]de_add.0);
 \draw[->,rounded corners=2pt] ([yshift=0.1em]en_ffn.90) -- ([yshift=2em]en_ffn.90) -- ([xshift=4em,yshift=2em]en_ffn.90) -- ([xshift=-1.5em]de_ca.west) -- ([xshift=-0.1em]de_ca.west);
 \begin{pgfonlayer}{background}
-\node[draw=ugreen,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt][fit=(en_sa)(en_ffn)]{};
-\node[draw=red,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt][fit=(de_sa)(de_ca)(de_ffn)]{};
+\node[draw=ugreen,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick][fit=(en_sa)(en_ffn)]{};
+\node[draw=red,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick][fit=(de_sa)(de_ca)(de_ffn)]{};
 \end{pgfonlayer}

 \node[anchor=east,font=\scriptsize,text=ugreen] at ([xshift=-0.1em]box1.west){$N \times$};

--- a/Chapter17/Figures/figure-application-of-multimodal-machine-translation-to-multitask-learning.tex
+++ b/Chapter17/Figures/figure-application-of-multimodal-machine-translation-to-multitask-learning.tex
-\tikzstyle{coder} = [rectangle,thick,rounded corners,minimum width=2.3cm,minimum height=1cm,text centered,draw=black,fill=red!25]
+\tikzstyle{coder} = [rectangle,rounded corners,minimum height=2.2em,minimum width=4.3em,text centered,draw=black,fill=red!25]
 \begin{tikzpicture}[node distance = 0,scale = 1]
 \tikzstyle{every node}=[scale=1]
 \node(x)[]{x};
@@ -6,7 +6,7 @@
 \node(decoder_left)[coder, above of = encoder, yshift=6em,fill=blue!25]{{解码器}};
 \node(y_hat)[above of = decoder_left, yshift=4em]{{$\rm y'$}};
 \node(y)[above of = decoder_left, xshift=-6em]{{$\rm y$}};
-\node(decoder_right)[coder, above of = encoder, xshift=12em,fill=yellow!25]{{解码器}};
+\node(decoder_right)[coder, above of = encoder, xshift=11em,fill=yellow!25]{{解码器}};

 \node(figure)[draw=white,above of = decoder_right,yshift=6.5em,scale=0.25] {\includegraphics[width=0.62\textwidth]{./Chapter17/Figures/figure-bank-without-attention.png}};

@@ -14,6 +14,6 @@
 \draw[->,thick](encoder)to(decoder_left)node[right,xshift=-0.1cm,yshift=-1.25cm,scale=1.0]{翻译};
 \draw[->,thick](decoder_left)to(y_hat);
 \draw[->,thick](y)to(decoder_left);
-\draw[->,thick](encoder)to(decoder_right)node[left,xshift=-3.8em,yshift=0.25cm,scale=1.0]{生成图片};
+\draw[->,thick](encoder)to(decoder_right)node[left,xshift=-3.1em,yshift=0.25cm,scale=1.0]{生成图片};
 \draw[->,thick](decoder_right)to(figure);
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-image-description-of-encoder-decoder-framework.tex
+++ b/Chapter17/Figures/figure-image-description-of-encoder-decoder-framework.tex
@@ -20,11 +20,11 @@

 \foreach \x in {1,2}{
 \draw[-,line width=2pt] (A\x) -- ([xshift=3.6em]A\x) -- ([xshift=3.6em,yshift=-3em]A\x) -- ([yshift=-3em]A\x) -- (A\x) -- ([xshift=1em]A\x);
-\draw[-, very thick] (B\x) -- (C\x) -- (D\x) -- (B\x);
-\draw[-, very thick,fill=black] ([xshift=-0.6em,yshift=-1.2em]B\x)  -- ([xshift=-0.3em,yshift=-1em]B\x) -- ([yshift=-1.2em]B\x) --([xshift=0.3em,yshift=-1em]B\x) -- ([xshift=0.6em,yshift=-1.2em]B\x) -- (D\x) -- (C\x) -- ([xshift=-0.6em,yshift=-1.2em]B\x);
-\draw[-, very thick,fill=black] (E\x) -- ([xshift=0.2em,yshift=0.3em]E\x) -- ([xshift=0.33em]F\x) -- (F\x) -- (E\x);
+\draw[-, thick] (B\x) -- (C\x) -- (D\x) -- (B\x);
+\draw[-, thick,fill=black] ([xshift=-0.6em,yshift=-1.2em]B\x)  -- ([xshift=-0.3em,yshift=-1em]B\x) -- ([yshift=-1.2em]B\x) --([xshift=0.3em,yshift=-1em]B\x) -- ([xshift=0.6em,yshift=-1.2em]B\x) -- (D\x) -- (C\x) -- ([xshift=-0.6em,yshift=-1.2em]B\x);
+\draw[-, thick,fill=black] (E\x) -- ([xshift=0.2em,yshift=0.3em]E\x) -- ([xshift=0.33em]F\x) -- (F\x) -- (E\x);
 \node[circle,inner sep=0pt,minimum size=0.4em,fill=black] at ([xshift=-0.7em,yshift=-0.2em]B\x){};
-\node[draw,rounded corners=2pt,fill=yellow!20,minimum width=2.3cm,minimum height=1cm](cnn\x) at ([xshift=9.5em,yshift=-1.5em]A\x){CNN};
+\node[draw,rounded corners=2pt,fill=yellow!20,minimum height=2.2em,minimum width=4.3em](cnn\x) at ([xshift=9.5em,yshift=-1.5em]A\x){CNN};
 \node[draw,circle,fill=green!20,font=\footnotesize,anchor=west,inner sep=3pt] (h\x_2) at ([xshift=3em,yshift=0.66em]cnn\x.east){$h_2$};
 \node[draw,circle,fill=green!20,font=\footnotesize,anchor=south,inner sep=3pt] (h\x_1) at ([yshift=1em]h\x_2.north){$h_1$};
 \node[font=\footnotesize,anchor=north] (h\x_c) at ([yshift=-0.6em]h\x_2.south){$\cdots$};
@@ -36,11 +36,11 @@
 \node[draw,thick,rounded corners=2pt,densely dashed,inner ysep=1.2em,inner xsep=0.4em,label={above:图像特征向量}][fit=(h2_1)(h2_2)(h2_n)](box2){};
 \end{pgfonlayer}

-\node[anchor=west,draw,rounded corners=2pt,fill=blue!20,minimum width=2.3cm,minimum height=1cm] (decoder1)at ([xshift=3em]box1.east){解码器};
+\node[anchor=west,draw,rounded corners=2pt,fill=blue!20,minimum height=2.2em,minimum width=4.3em] (decoder1)at ([xshift=3em]box1.east){解码器};
 \node[anchor=west,draw,circle,inner sep=0pt,minimum size=1.4em] (add)at ([xshift=2em,yshift=1.6em]box2.east){};
 \draw[] (add.0) -- (add.180);
 \draw[] (add.90) -- (add.-90);
-\node[anchor=west,draw,rounded corners=2pt,fill=blue!20,minimum width=2.3cm,minimum height=1cm] (decoder2)at ([xshift=6em]box2.east){解码器};
+\node[anchor=west,draw,rounded corners=2pt,fill=blue!20,minimum height=2.2em,minimum width=4.3em] (decoder2)at ([xshift=6em]box2.east){解码器};


 \draw[->,thick] ([xshift=-2.7em]cnn1.180) -- ([xshift=-0.1em]cnn1.180);

--- a/Chapter17/Figures/figure-layer.tex
+++ b/Chapter17/Figures/figure-layer.tex

-\begin{tikzpicture}
+\begin{tikzpicture}[node distance = 0,scale = 0.8]
+\tikzstyle{every node}=[scale=0.8]

 \foreach \x in {1,2,3,4}
-	\node[inner sep=0pt,minimum size=1em,fill=ublue,circle] (c1_\x) at (0em+1.6em*\x, 0em){};
+	\node[draw,inner sep=0pt,minimum height=1em,minimum width=1.6em,fill=red!30,rounded corners=1pt] (c1_\x) at (0em+2em*\x, 0em){};

-\foreach \x in {1,2,3,4,5,6}
-	\node[inner sep=0pt,minimum size=1em,fill=ublue,circle] (c2_\x) at (8.4em+1.6em*\x, 0em){};
+\foreach \x in {1,2,3}
+	\node[draw,inner sep=0pt,minimum height=1em,minimum width=1.6em,fill=red!30,rounded corners=1pt] (c2_\x) at (11em+2em*\x, 0em){};

 \foreach \x in {1,2,3,4,5}
-	\node[inner sep=0pt,minimum size=1em,fill=ublue,circle] (c3_\x) at (20em+1.6em*\x, 0em){};
+	\node[draw,inner sep=0pt,minimum height=1em,minimum width=1.6em,fill=red!30,rounded corners=1pt] (c3_\x) at (18.4em+2em*\x, 0em){};,minimum width=1em

 \foreach \x in {1,2,3,4,5}
-	\node[inner sep=0pt,minimum size=1em,fill=orange,circle] (c4_\x) at (20em+1.6em*\x, 9.4em){};
+	\node[draw,inner sep=0pt,minimum height=1em,minimum width=1.6em,fill=blue!30,rounded corners=1pt] (c4_\x) at (18.4em+2em*\x, 10.4em){};

-\node[inner sep=0pt,minimum size=1em,fill=ugreen,circle] (c5) at (9em, 7em){};
-\node[inner sep=0pt,minimum size=1.2em,fill=ugreen,circle] (qs) at (18.6em, 5em){};
-\node[inner sep=0pt,minimum size=1.2em,fill=ugreen,circle] (qw) at (18.6em, 3em){};
+%\node[inner sep=0pt,minimum size=1em,fill=ugreen,circle] (c5) at (9em, 7em){};
+\node[draw,inner sep=0pt,minimum size=1.2em,fill=green!20,circle] (qs) at (18.6em, 6.4em){};
+\node[draw,inner sep=0pt,minimum size=1.2em,fill=green!20,circle] (qw) at (18.6em, 4.4em){};

-\node[fill=orange,inner sep=0pt, minimum size=1.2em, circle, text=white] (sigma) at (24.8em, 7em){\small\bfnew{$\sigma$}};
+\node[draw,thick,inner sep=0pt, minimum size=1.2em, circle] (sigma) at (24.4em, 8em){};
+\draw[-,thick] (sigma.0) -- (sigma.180);
+\draw[-,thick] (sigma.90) -- (sigma.-90);

-\node[fill=ugreen,inner sep=0pt, minimum size=1.2em, circle, text=white] (add1) at (4em, 3em){\small\bfnew{+}};
-\node[fill=ugreen,inner sep=0pt, minimum size=1.2em, circle, text=white] (add2) at (14em, 3em){\small\bfnew{+}};
-\node[fill=ugreen,inner sep=0pt, minimum size=1.2em, circle, text=white] (add3) at (9em, 5em){\small\bfnew{+}};
+\node[draw,fill=orange!30,inner sep=0pt, minimum size=1.2em, circle] (add1) at (5em, 3em){};
+\node[draw,fill=orange!30,inner sep=0pt, minimum size=1.2em, circle] (add2) at (15em, 3em){};
+\node[draw,fill=orange!30,inner sep=0pt, minimum size=1.2em, circle] (add3) at (10em, 5.2em){};
 \begin{pgfonlayer}{background}
-\node[draw,rounded corners=2pt,drop shadow,fill=white][fit=(c1_1)(c1_4)](box1){};
-\node[draw,rounded corners=2pt,drop shadow,fill=white][fit=(c2_1)(c2_6)](box2){};
-\node[draw,rounded corners=2pt,drop shadow,fill=white][fit=(c3_1)(c3_5)](box3){};
-\node[draw,rounded corners=2pt,drop shadow,fill=white][fit=(c4_1)(c4_5)](box4){};
-\node[draw,rounded corners=2pt,inner xsep=6pt,drop shadow,fill=white][fit=(c5)](box5){};
+\node[draw,rounded corners=2pt,drop shadow,fill=white, minimum width=8.3em][fit=(c1_1)(c1_4)](box1){};
+\node[draw,rounded corners=2pt,drop shadow,fill=white,minimum width=6.4em][fit=(c2_1)(c2_3)](box2){};
+\node[draw,rounded corners=2pt,drop shadow,fill=white,minimum width=10.5em][fit=(c3_1)(c3_5)](box3){};
+\node[draw,rounded corners=2pt,drop shadow,fill=white,minimum width=10.3em][fit=(c4_1)(c4_5)](box4){};
+%\node[draw,rounded corners=2pt,inner xsep=6pt,drop shadow,fill=white][fit=(c5)](box5){};
 \end{pgfonlayer}

-\node[draw,dash pattern=on 3pt off 1pt,minimum width=1.6em, minimum height=2em,very thick] (n1) at (24.8em,0em){};
-\node[draw,dash pattern=on 3pt off 1pt,minimum width=1.6em, minimum height=2em,very thick] (n2) at (24.8em,9.4em){};
-\node[] at (24.8em, -1.5em){$\mathbi{x}_\mathbi{t}$};
-\node[text=ublue] at (8.2em, 0em) {\small\bfnew{...}};
-
-\draw[-latex, out=70, in=-120] (c1_1.90) node[xshift=-0.4em,yshift=1.2em]{$ \mathbi{h}_ \mathbi{i}^ \mathbi{j}$}to (add1.-90);
-\draw[-latex, out=80, in=-100] (c1_2.90) to (add1.-90);
-\draw[-latex, out=100, in=-80] (c1_3.90) to (add1.-90);
-\draw[-latex, out=110, in=-60] (c1_4.90) to (add1.-90);
-
-\draw[-latex, out=60, in=-140] (c2_1.90) to (add2.-90);
-\draw[-latex, out=70, in=-120] (c2_2.90) to (add2.-90);
-\draw[-latex, out=80, in=-100] (c2_3.90) to (add2.-90);
-\draw[-latex, out=100, in=-80] (c2_4.90) to (add2.-90);
-\draw[-latex, out=110, in=-60] (c2_5.90) to (add2.-90);
-\draw[-latex, out=120, in=-40] (c2_6.90) to (add2.-90);
-
-\draw[-latex, out=20, in=-150] (add1.90) node[xshift=-0.4em,yshift=1.2em]{$ \mathbi{s}^ \mathbi{j}$} to (add3.-90);
-\draw[-latex, out=160, in=-30] (add2.90) to (add3.-90);
-\draw[-latex] (add3.90) -- (box5.-90);
-\draw[-latex] (box5.0) -- node[xshift=-3em,above]{$ \mathbi{d}_\mathbi{t}$}(sigma.180);
-\draw[-latex, ugreen!60] (qs.180) node[xshift=-1em,above,text=black]{$ \mathbi{q}_\mathbi{s}$}-- (add3.0);
-\draw[-, ugreen!60] (qw.180) node[xshift=-1em,above,text=black]{$ \mathbi{q}_\mathbi{w}$}-- (add2.0);
-\draw[-latex, ugreen!60] (add2.180) -- (add1.0);
-
-\draw[-latex] (n1.130) -- (qw.0);
-\draw[-latex] (n1.120) -- (qs.0);
-\draw[-latex] (n1.90) node[yshift=1em,right]{$ \mathbi{h}_\mathbi{t}$}-- (sigma.-90);
-\draw[-latex] (sigma.90) -- (n2.-90);
-\draw[-latex] (n2.90) -- node[right]{$ \widetilde{\mathbi{h}}_\mathbi{t}$}([yshift=2em]n2.90);
+\node[draw=violet,densely dotted,minimum width=1.9em, minimum height=2.1em,very thick] (n1) at (24.4em,0em){};
+\node[draw=violet,densely dotted,minimum width=1.8em, minimum height=2em,very thick] (n2) at (24.4em,10.4em){};
+\node[] at (24.4em, -1.5em){$\mathbi{x}_\mathbi{t}$};
+\node[text=ublue] at (10.5em, 0em) {\small\bfnew{...}};
+
+\draw[->,thick, out=70, in=-120] ([yshift=0.1em]c1_1.90) node[xshift=-0.4em,yshift=1.2em]{$ \mathbi{h}_ \mathbi{i}^ \mathbi{j}$}to ([yshift=-0.1em]add1.-90);
+\draw[->,thick, out=80, in=-100] ([yshift=0.1em]c1_2.90) to ([yshift=-0.1em]add1.-90);
+\draw[->,thick, out=100, in=-80] ([yshift=0.1em]c1_3.90) to ([yshift=-0.1em]add1.-90);
+\draw[->,thick, out=110, in=-60] ([yshift=0.1em]c1_4.90) to ([yshift=-0.1em]add1.-90);
+
+\draw[->,thick, out=70, in=-110] ([yshift=0.1em]c2_1.90) to ([yshift=-0.1em]add2.-90);
+\draw[->,thick, out=90, in=-90] ([yshift=0.1em]c2_2.90) to ([yshift=-0.1em]add2.-90);
+\draw[->,thick, out=110, in=-70] ([yshift=0.1em]c2_3.90) to ([yshift=-0.1em]add2.-90);
+
+
+\draw[->,thick, out=30, in=-130] ([yshift=0.1em]add1.90) node[xshift=-0.4em,yshift=1.1em]{$ \mathbi{s}^ \mathbi{j}$} to ([yshift=-0.1em]add3.-120);
+\draw[->,thick, out=150, in=-50] ([yshift=0.1em]add2.90) to ([yshift=-0.1em]add3.-70);
+\draw[->,thick, ugreen!60,out=160,in=-10] ([xshift=-0.1em]qs.160) node[xshift=-0.3em,yshift=0.1em,above,text=black]{$ \mathbi{q}_\mathbi{s}$} to ([xshift=0.1em]add3.0);
+\draw[->,thick, ugreen!60,out=180,in=0] ([xshift=-0.1em]qw.180) node[xshift=-0.3em,yshift=0.4em,above,text=black]{$ \mathbi{q}_\mathbi{w}$} to ([xshift=0.1em]add2.0);
+\draw[->,thick, ugreen!60,out=170,in=-10] ([xshift=-0.1em]qw.160) to ([xshift=0.1em]add1.0);
+
+\draw[->,thick] ([yshift=0.1em]n1.135) .. controls ([xshift=-2em]n1.130) and ([xshift=2em]qw.0) .. ([xshift=0.1em]qw.0);
+\draw[->,thick] ([yshift=0.1em]n1.120) .. controls ([xshift=-2em,yshift=1em]n1.120) and ([xshift=3em]qs.0) .. ([xshift=0.1em]qs.0);
+\draw[->,thick] ([yshift=0.1em]n1.90) node[yshift=1em,right]{$ \mathbi{h}_\mathbi{t}$}-- ([yshift=-0.1em]sigma.-90);
+\draw[->,thick] ([yshift=0.1em]sigma.90) -- ([yshift=-0.1em]n2.-90);
+\draw[->,thick] ([yshift=0.1em]n2.90) -- node[right]{$ \widetilde{\mathbi{h}}_\mathbi{t}$}([yshift=2em]n2.90);

 \draw[decorate,decoration={brace, mirror},gray, thick] ([yshift=-2em]box1.-180) -- node[font=\scriptsize,text=black,below]{前几句}([yshift=-2em]box2.0);
 \draw[decorate,decoration={brace, mirror},gray, thick] ([yshift=-2em]box3.-180) -- node[font=\scriptsize,text=black,below]{当前句}([yshift=-2em]box3.0);
+\draw[->, thick, rounded corners=2pt] ([yshift=0.1em]add3.90) -- ([yshift=2.1em]add3.90) -- ([xshift=-0.1em]sigma.180);


 %annotation
-\node[fill=ublue,rounded corners=1pt,inner sep=0pt,minimum size=1em] (a1) at (2em,-4.5em) {};
+\node[fill=red!30,rounded corners=1pt,inner sep=0pt,minimum size=1em] (a1) at (2em,-4.5em) {};
 \node[anchor=west,font=\footnotesize] (w1) at ([xshift=0.4em]a1.east) {编码表示};

-\node[anchor=west,fill=ugreen,rounded corners=1pt,inner sep=0pt,minimum size=1em] (a2) at ([xshift=2em]w1.east) {};
+\node[anchor=west,fill=orange!30,rounded corners=1pt,inner sep=0pt,minimum size=1em] (a2) at ([xshift=2em]w1.east) {};
 \node[anchor=west,font=\footnotesize] (w2)at ([xshift=0.4em]a2.east) {层次注意力};

-\node[anchor=west,fill=orange,rounded corners=1pt,inner sep=0pt,minimum size=1em] (a3) at ([xshift=2em]w2.east) {};
+\node[anchor=west,fill=blue!30,rounded corners=1pt,inner sep=0pt,minimum size=1em] (a3) at ([xshift=2em]w2.east) {};
 \node[anchor=west,font=\footnotesize] at ([xshift=0.4em]a3.east) {融合上下文信息的编码表示};
 \end{tikzpicture}


--- a/Chapter17/Figures/figure-modeling-a-global-approach-to-visual-characteristics.tex
+++ b/Chapter17/Figures/figure-modeling-a-global-approach-to-visual-characteristics.tex
@@ -24,13 +24,12 @@
 \draw[-, very thick,fill=black] ([xshift=-0.6em,yshift=-1.2em]B\x)  -- ([xshift=-0.3em,yshift=-1em]B\x) -- ([yshift=-1.2em]B\x) --([xshift=0.3em,yshift=-1em]B\x) -- ([xshift=0.6em,yshift=-1.2em]B\x) -- (D\x) -- (C\x) -- ([xshift=-0.6em,yshift=-1.2em]B\x);
 \draw[-, very thick,fill=black] (E\x) -- ([xshift=0.2em,yshift=0.3em]E\x) -- ([xshift=0.33em]F\x) -- (F\x) -- (E\x);
 \node[circle,inner sep=0pt,minimum size=0.4em,fill=black] at ([xshift=-0.7em,yshift=-0.2em]B\x){};
-\node[draw,rounded corners=2pt,fill=yellow!20,minimum width=2.3cm,minimum height=1cm](cnn\x) at ([xshift=1.8em,yshift=3.6em]A\x){CNN};
+\node[draw,rounded corners=2pt,fill=yellow!20,minimum width=2.3cm,minimum height=2.2em](cnn\x) at ([xshift=1.8em,yshift=3.6em]A\x){CNN};
 }
-
-\node[draw,anchor=south,rounded corners=2pt,minimum width=4.0cm,minimum height=1cm,fill=red!20](encoder) at ([yshift=2.6em,xshift=2.2em]cnn1.north){编码器};
+\node[draw,anchor=south,rounded corners=2pt,minimum width=4.0cm,minimum height=2.2em,fill=red!20](encoder) at ([yshift=2.6em,xshift=2.2em]cnn1.north){编码器};
 \node[anchor=north,font=\Large](x) at ([xshift=2.5em,yshift=-3.4em]encoder.south){$\seq{x}$};

-\node[draw,anchor=south,rounded corners=2pt,minimum width=4.0cm,minimum height=1cm,fill=blue!20](decoder) at ([yshift=2.6em,xshift=2.2em]cnn2.north){解码器};
+\node[draw,anchor=south,rounded corners=2pt,minimum width=4.0cm,minimum height=2.2em,fill=blue!20](decoder) at ([yshift=2.6em,xshift=2.2em]cnn2.north){解码器};
 \node[anchor=north,font=\Large](y) at ([xshift=2.5em,yshift=-3.4em]decoder.south){$\seq{y}$};
 \node[anchor=south,font=\Large](y_1) at ([yshift=3em]decoder.north){$\seq{y}'$};


--- a/Chapter17/Figures/figure-picture-translation.tex
+++ b/Chapter17/Figures/figure-picture-translation.tex
+\begin{tikzpicture}[node distance = 0]
+\tikzstyle{every node}=[scale=0.9]
+\begin {scope}
+\node[draw=white,scale=0.6] (input) at (0,0){\includegraphics[width=0.62\textwidth]{./Chapter17/Figures/figure-bank-without-attention.png}};(1.9,-1.4);
+\node[anchor=south] (english1) at ([xshift=0em,yshift=-2.5em]input.south) {\begin{tabular}{l}{\large\bfnew{英语}}{\Large{：A medium sized child}}\end{tabular}};
+\node[anchor=south] (english2) at ([xshift=1.9em,yshift=-1.2em]english1.south) {\begin{tabular}{l}{\Large{jumps off a dusty {\red{\underline{bank}}}.}} \end{tabular}};
+\end {scope}
+\node[draw,thick,inner sep=0pt,minimum height=16em,minimum width=19em,rounded corners=8pt][fit = (input) (english1)(english2)] (box1) at (0em,-1.5em){};
+\begin {scope}[xshift=1.45in,yshift=-0.2in]
+\draw[-,thick] (0,0.2) -- (1,0.2) -- (1,0.4) --(1.5,0) -- (1,-0.4) -- (1,-0.2) -- (0,-0.2) -- (0,0.2);
+\end {scope}
+\begin {scope}[xshift=4.4in,yshift=-0.2in]
+\node[anchor=east] (de1) {\begin{tabular}{l}{\large\bfnew{汉语}}{\Large{：一个半大孩子从尘土}}\end{tabular}};
+\node[anchor=south] (de2) at ([xshift=2em,yshift=-1.5em]de1.south) {\begin{tabular}{l}{\Large{飞扬的{\red{\underline{河床}}}上跳下来。}} \end{tabular}};
+\end {scope}
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-speech-recognition-model-based-on-transformer.tex
+++ b/Chapter17/Figures/figure-speech-recognition-model-based-on-transformer.tex
@@ -17,11 +17,11 @@
 \node[layer,anchor=south,fill=red!20] (de_ca) at ([yshift=1.4em]de_sa.north){Multi-Head \\ Attention};
 \node[layer,anchor=south,fill=green!20] (de_ffn) at ([yshift=1.4em]de_ca.north){Feed Forward \\ Network};

-\node[layer,anchor=south,fill=blue!20] (sf) at ([yshift=2em]de_ffn.north){Softmax};
+\node[layer,anchor=south,fill=blue!20] (sf) at ([yshift=1.6em]de_ffn.north){Softmax};
 \node[layer,anchor=south,fill=orange!20] (output) at ([yshift=1.4em]sf.north){Output Probabilities};

 \node[anchor=north,font=\scriptsize,align=center] (en_input) at ([yshift=-1em]en_cnn.south){Speech Feature\\(FilterBank/MFCC)};
-\node[anchor=north,font=\scriptsize,align=center] (de_input) at ([yshift=-1em]de_add.south){Transcription\\(Embedding)};
+\node[anchor=north,font=\scriptsize,align=center] (de_input) at ([yshift=-1.1em]de_add.south){Transcription\\(Embedding)};

 \node[anchor=east,font=\scriptsize,align=center] (en_pos) at ([xshift=-2em]en_add.west){Position\\(Embedding)};
 \node[anchor=west,font=\scriptsize,align=center] (de_pos) at ([xshift=2em]de_add.east){Position\\(Embedding)};
@@ -40,8 +40,8 @@
 \draw[->] ([xshift=-0.1em]de_pos.180) -- ([xshift=0.1em]de_add.0);
 \draw[->,rounded corners=2pt] ([yshift=0.1em]en_ffn.90) -- ([yshift=2em]en_ffn.90) -- ([xshift=4em,yshift=2em]en_ffn.90) -- ([xshift=-1.5em]de_ca.west) -- ([xshift=-0.1em]de_ca.west);
 \begin{pgfonlayer}{background}
-\node[draw=ugreen,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt][fit=(en_sa)(en_ffn)](box1){};
-\node[draw=red,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt][fit=(de_sa)(de_ca)(de_ffn)](box2){};
+\node[draw=ugreen,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick][fit=(en_sa)(en_ffn)](box1){};
+\node[draw=red,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick][fit=(de_sa)(de_ca)(de_ffn)](box2){};
 \end{pgfonlayer}

 \node[anchor=east,font=\scriptsize,text=ugreen] at ([xshift=-0.1em]box1.west){$N \times$};

--- a/Chapter17/Figures/figure-speech-translation-model-based-on-CTC.tex
+++ b/Chapter17/Figures/figure-speech-translation-model-based-on-CTC.tex
@@ -19,11 +19,11 @@

 \node[layer,anchor=south,fill=blue!20] (en_sf) at ([yshift=3em]en_ffn.north){Softmax};
 \node[layer,anchor=south,fill=blue!20] (sf) at ([yshift=2em]de_ffn.north){Softmax};
-\node[layer,anchor=south,fill=orange!20] (en_output) at ([yshift=1.4em]en_sf.north){CTC输出};
-\node[layer,anchor=south,fill=orange!20] (output) at ([yshift=1.4em]sf.north){语音翻译输出};
+\node[layer,anchor=south,fill=orange!20] (en_output) at ([yshift=1.4em]en_sf.north){CTC Output};
+\node[layer,anchor=south,fill=orange!20] (output) at ([yshift=1.4em]sf.north){ST Output};

-\node[anchor=north,font=\scriptsize,align=center] (en_input) at ([yshift=-1em]en_cnn.south){语音特征\\(FilterBank/MFCC)};
-\node[anchor=north,font=\scriptsize,align=center] (de_input) at ([yshift=-1em]de_add.south){目标文本\\(Embedding)};
+\node[anchor=north,font=\scriptsize,align=center] (en_input) at ([yshift=-1em]en_cnn.south){Speech Feature\\(FilterBank/MFCC)};
+\node[anchor=north,font=\scriptsize,align=center] (de_input) at ([yshift=-1em]de_add.south){Target Text\\(Embedding)};

 \node[anchor=east,font=\scriptsize,align=center] (en_pos) at ([xshift=-2em]en_add.west){Position\\(Embedding)};
 \node[anchor=west,font=\scriptsize,align=center] (de_pos) at ([xshift=2em]de_add.east){Position\\(Embedding)};
@@ -44,13 +44,13 @@
 \draw[->] ([xshift=-0.1em]de_pos.180) -- ([xshift=0.1em]de_add.0);
 \draw[->,rounded corners=2pt] ([yshift=2em]en_ffn.90) -- ([xshift=4em,yshift=2em]en_ffn.90) -- ([xshift=-1.5em]de_ca.west) -- ([xshift=-0.1em]de_ca.west);
 \begin{pgfonlayer}{background}
-\node[draw=ugreen,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt][fit=(en_sa)(en_ffn)]{};
-\node[draw=red,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt][fit=(de_sa)(de_ca)(de_ffn)]{};
+\node[draw=ugreen,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick][fit=(en_sa)(en_ffn)]{};
+\node[draw=red,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt,dashed,thick][fit=(de_sa)(de_ca)(de_ffn)]{};
 \end{pgfonlayer}

 \node[anchor=east,font=\scriptsize,text=ugreen] at ([xshift=-0.1em]box1.west){$N \times$};
 \node[anchor=west,font=\scriptsize,text=red] at ([xshift=0.1em]box2.east){$\times N$};
 \node[anchor=east,font=\scriptsize] at ([xshift=-0.1em]en_cnn.west){$2 \times$};
-\node[anchor=east,font=\scriptsize,align=center,text=ugreen] at ([xshift=-0.1em,yshift=3em]box1.west){语音翻译\\编码器};
-\node[anchor=west,font=\scriptsize,align=center,text=red] at ([xshift=0.1em,yshift=5em]box2.east){语音翻译\\解码器};
+\node[anchor=east,font=\scriptsize,align=center,text=ugreen] at ([xshift=-0.1em,yshift=3em]box1.west){ST\\Encoder};
+\node[anchor=west,font=\scriptsize,align=center,text=red] at ([xshift=0.1em,yshift=5em]box2.east){ST\\Decoder};
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-the-encoder-explicitly-incorporates-semantic-information.tex
+++ b/Chapter17/Figures/figure-the-encoder-explicitly-incorporates-semantic-information.tex
@@ -10,24 +10,24 @@
 \node(bank)[word, below of = jump, yshift=-0.75cm, fill=blue!65]{bank};
 \node(sky)[word, below of = bank, yshift=-0.75cm, fill=blue!30]{sky};
 \node(tree)[word, below of = sky, yshift=-0.75cm, fill=blue!15]{tree};
-\node(cir)[circle,very thick, minimum width=0.6cm, xshift=8cm,  draw=black]{};
-\node(decoder)[rectangle, rounded corners, minimum width=2.5cm, minimum height=1.2cm, right of = cir,xshift=3cm, draw=black, fill=blue!25]{\large{解码器}};
+\node(cir)[circle,thick, minimum width=0.6cm, xshift=8cm,  draw=black]{};
+\node(decoder)[rectangle, rounded corners, minimum height=2.2em,minimum width=4.3em, right of = cir,xshift=3cm, draw=black, fill=blue!25]{\large{解码器}};
 \node(yn_1)[below of = decoder,yshift=-2cm,scale=1.2]{$\rm y_{n-1}$};
 \node(yn_2)[above of = decoder,yshift=2cm,scale=1.2]{$\rm y'_{n-1}$(bank)};

-\draw[->, very thick]([xshift=0.1cm]figure.east)to([xshift=2cm]figure.east);
-\draw[-,very thick]([xshift=-0.03cm]cir.east)to([xshift=0.03cm]cir.west);
-\draw[-,very thick]([yshift=0.03cm]cir.south)to([yshift=-0.03cm]cir.north);
-\draw[->, very thick]([xshift=0.1cm]cir.east)to([xshift=-0.1cm]decoder.west);
-\draw[->, very thick](yn_1)to([yshift=-0.1cm]decoder.south);
-\draw[->, very thick]([yshift=0.1cm]decoder.north)to(yn_2);
+\draw[->, thick]([xshift=0.1cm]figure.east)to([xshift=2cm]figure.east);
+\draw[-,thick]([xshift=-0.03cm]cir.east)to([xshift=0.03cm]cir.west);
+\draw[-,thick]([yshift=0.03cm]cir.south)to([yshift=-0.03cm]cir.north);
+\draw[->, thick]([xshift=0.1cm]cir.east)to([xshift=-0.1cm]decoder.west);
+\draw[->, thick](yn_1)to([yshift=-0.1cm]decoder.south);
+\draw[->, thick]([yshift=0.1cm]decoder.north)to(yn_2);

 \draw[->, thick, color=blue!45]([xshift=0.05cm]river.east)to([xshift=-0.05cm]cir.west);
 \draw[->, thick, color=blue!45]([xshift=0.05cm]mountain.east)to([xshift=-0.05cm]cir.west);
 \draw[->, thick, color=blue!15]([xshift=0.05cm]child.east)to([xshift=-0.05cm]cir.west);
 \draw[->, thick, color=blue!25]([xshift=0.05cm]man.east)to([xshift=-0.05cm]cir.west);
 \draw[->, thick, color=blue!30]([xshift=0.05cm]jump.east)to([xshift=-0.05cm]cir.west);
-\draw[->, thick, color=blue!65]([xshift=0.05cm]bank.east)to([xshift=-0.05cm]cir.west);
+\draw[->, very thick, color=blue!65]([xshift=0.05cm]bank.east)to([xshift=-0.05cm]cir.west);
 \draw[->, thick, color=blue!30]([xshift=0.05cm]sky.east)to([xshift=-0.05cm]cir.west);
 \draw[->, thick, color=blue!15]([xshift=0.05cm]tree.east)to([xshift=-0.05cm]cir.west);
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-three-ways-of-dual-decoder-speech-translation.tex
+++ b/Chapter17/Figures/figure-three-ways-of-dual-decoder-speech-translation.tex
-\tikzstyle{coder} = [rectangle,thick,rounded corners,minimum width=2.3cm,minimum height=1cm,text centered,draw=black!70,fill=red!20]
+\tikzstyle{coder} = [rectangle,thick,rounded corners,minimum height=2.2em,minimum width=4.3em,text centered,draw=black!70,fill=red!20]

 \begin{tikzpicture}[node distance = 0,scale = 0.75]
 \tikzstyle{every node}=[scale=0.75]
@@ -19,7 +19,7 @@
 \node [anchor=south,scale=1.2] (node1) at ([xshift=-2.0em,yshift=6em]decoder_1.north) {{$x,y$：语言数据}};
 \node [anchor=north,scale=1.2] (node2) at ([xshift=0.6em]node1.south){{$s$：语音数据}};
 %%%%%%%%%%%%%%%%%%%%%%%%级联
-\node(encoder-2)[coder]at ([xshift=10.0em]encoder.east){\large{编码器}};
+\node(encoder-2)[coder]at ([xshift=12.0em]encoder.east){\large{编码器}};
 \node(decoder_1-2)[coder,above of =encoder-2,yshift=1.4cm,fill=blue!20]{\large{解码器}};
 \node(decoder_2-2)[coder,above of =decoder_1-2, yshift=1.4cm,fill=yellow!20]{\large{解码器}};
 \node(s-2)[below of = encoder-2,yshift=-1.8cm,scale=1.6]{$s$};

--- a/Chapter17/Figures/figure-traditional-methods-of-image-description.tex
+++ b/Chapter17/Figures/figure-traditional-methods-of-image-description.tex
@@ -17,14 +17,14 @@
 \node(surd-1)[right of = text_3-1, xshift=2cm,scale=1.5]{\textcolor{red}{$\surd$}};
 \node(text_4-1)[description, right of = figure-1, xshift=5.2cm,yshift=-0.9cm,fill=color_blue]{\textcolor{white}{男人戴着眼镜。}};
 \node(point-1)[right of = figure-1, xshift=5cm,yshift=-1.4cm,scale=1.5]{...};
-\draw[->,very thick](figure-1)to([xshift=-0.1cm]ground-1.west);
+\draw[->,thick](figure-1)to([xshift=-0.1cm]ground-1.west);

 \node(figure)[draw=white,scale=0.25]at ([xshift=20.0em]figure-1.east){\includegraphics[width=0.62\textwidth]{./Chapter17/Figures/figure-dog-with-hat.png}};
 \node(ground)[rectangle,rounded corners, minimum width=5cm, minimum height=1.5cm,right of = figure, xshift=5cm,yshift=-2.6em,fill=blue!20]{\large{图片中有\underline{\textcolor{red}{狗}}，\underline{\textcolor{red}{帽子}}，\underline{\quad\ }。}};
 \node(dog)[rectangle,rounded corners, minimum width=1cm, minimum height=0.7cm,right of = figure, xshift=3cm,yshift=1.5cm,thick, draw=color_orange,fill=color_orange!50]{狗};
 \node(hat)[rectangle,rounded corners, minimum width=1.5cm, minimum height=0.7cm,right of = figure, xshift=4.5cm,yshift=1.5cm,thick, draw=color_green,fill=color_green!50]{帽子};
-\draw[->, very thick,color=black!60](figure.east)to([xshift=-0.1cm]dog.west)node[left,xshift=-0.2cm,yshift=-0.1cm,color=black]{图片检测};
-\draw[->, very thick,color=black!60]([yshift=-0.1cm]hat.south)to([yshift=0.1cm]ground.north)node[right,xshift=-0.2cm,yshift=0.5cm,color=black]{模板填充};
+\draw[->, thick,color=black!60](figure.east)to([xshift=-0.1cm]dog.west)node[left,xshift=-0.2cm,yshift=-0.1cm,color=black]{图片检测};
+\draw[->, thick,color=black!60]([yshift=-0.1cm]hat.south)to([yshift=0.1cm]ground.north)node[right,xshift=-0.2cm,yshift=0.5cm,color=black]{模板填充};

 \node [anchor=north](pos1)at ([xshift=-3.8em,yshift=-0.5em]ground-1.south){（a）基于检索的图像描述生成范式};
 \node [anchor=north](pos2)at ([xshift=-3.8em,yshift=-0.5em]ground.south){（b）基于模板的图像描述生成范式};

--- a/Chapter17/Figures/figure-word-lattice.tex
+++ b/Chapter17/Figures/figure-word-lattice.tex
@@ -18,22 +18,22 @@

 \draw[->] (n0.0) -- node[word,above]{of /0.343}(n2.180);
 \draw[->] (n0.60) -- node[word,above,rotate=40]{a /0.499}(n1.-150);
-\draw[->] (n0.-50) -- node[word,above,rotate=-20]{our /0.116}(n3.150);
-\draw[->] (n0.-70) .. controls ([xshift=-8em]n4.180) and ([xshift=-8em]n4.180) .. node[above,word,xshift=3em,yshift=-0.6em]{that /0.039} (n4.180);
-\draw[->] (n4.0) .. node[word,above,xshift=-2em,yshift=-0.4em]{hostage /1} controls ([xshift=5em]n4.0) and ([yshift=-6em]n6.-90) .. (n6.-90);
-\draw[->] (n2.-90) -- node[word,above,rotate=-18,pos=0.55]{house /0.125}(n7.180);
+\draw[->] (n0.-50) -- node[word,above,rotate=-20]{their /0.116}(n3.150);
+\draw[->] (n0.-70) .. controls ([xshift=-8em]n4.180) and ([xshift=-8em]n4.180) .. node[above,word,xshift=3em,yshift=-0.6em]{that /0.042} (n4.180);
+\draw[->] (n4.0) .. node[word,above,xshift=-2em,yshift=-0.4em]{hospital /1} controls ([xshift=5em]n4.0) and ([yshift=-6em]n6.-90) .. (n6.-90);
+\draw[->] (n2.-90) -- node[word,above,rotate=-18,pos=0.55]{house /0.127}(n7.180);
 \draw[->] (n3.-10) node[word,above,xshift=3.6em,yshift=-0.8em]{conference /1} .. controls ([xshift=4.6em,yshift=-1.8em]n3.-10) and ([yshift=-1.6em,xshift=-3em]n10.-135) .. (n10.-135);
 \draw[->] (n7.0) -- node[word,above]{which /1}(n10.180);
-\draw[->] (n2.0) -- node[word,above,pos=0.5]{hostages /0.300}(n6.180);
+\draw[->] (n2.0) -- node[word,above,pos=0.5]{hospital /0.300}(n6.180);
 \draw[->] (n2.45) -- node[word,above,rotate=18,pos=0.3]{a /0.573}(n11.-135);
-\draw[->,rounded corners=1em] (n1.-45) node[word,above,xshift=1.4em,yshift=-1.3em,rotate=-43]{house /0.078} -- ([yshift=-0.4em,xshift=-1em]n11.-90) -- (n7.100);
+\draw[->,rounded corners=1em] (n1.-45) node[word,above,xshift=1.4em,yshift=-1.3em,rotate=-43]{house /0.079} -- ([yshift=-0.4em,xshift=-1em]n11.-90) -- (n7.100);
 \draw[->] (n1.20) node[word,above,xshift=4em]{conference /0.734} .. controls ([xshift=8em]n1.20) and  ([xshift=-0.6em,yshift=2.2em]n5.110) .. (n5.110);
 \draw[->] (n11.0) -- node[word,above]{conference /1}(n5.180);
 \draw[->] (n5.-90) ..node[word,above,xshift=1.4em]{is /0.773} controls ([yshift=-1.6em]n5.-90) and ([xshift=-3em]n6.150]) .. (n6.150);
-\draw[->] (n5.0) node[word, above,xshift=1.4em]{as /0.226}.. controls ([xshift=2.6em]n5.0) and ([xshift=-0.6em,yshift=2em]n6.120) .. (n6.120);
+\draw[->] (n5.0) node[word, above,xshift=1.4em]{as /0.227}.. controls ([xshift=2.6em]n5.0) and ([xshift=-0.6em,yshift=2em]n6.120) .. (n6.120);

 \coordinate (a) at ([xshift=6em,yshift=3em]n1);
-\draw[->] (n1.60) .. controls ([xshift=3em,yshift=2em]n1.60) and ([xshift=-2em]a) .. (a) node[word,above,xshift=1em]{hostage /0.187}.. controls ([xshift=8em]a) and ([xshift=-0.6em,yshift=6em]n6.90) .. (n6.90);
+\draw[->] (n1.60) .. controls ([xshift=3em,yshift=2em]n1.60) and ([xshift=-2em]a) .. (a) node[word,above,xshift=1em]{hospital /0.187}.. controls ([xshift=8em]a) and ([xshift=-0.6em,yshift=6em]n6.90) .. (n6.90);
 \draw[->] (n10.0) -- node[above,word,pos=0.4,rotate=30]{is /1}(n6.-135);
 \draw[->] (n6.0) -- node[above,word,yshift=0.2em]{being /1}(n8.180);
 \draw[->] (n8.0) -- node[above,word,yshift=0.3em]{recorded /1}(n9.180);

--- a/Chapter17/chapter17.tex
+++ b/Chapter17/chapter17.tex
@@ -35,11 +35,17 @@

 \parinterval 长期以来，机器翻译的任务都是指句子级翻译。主要原因在于，句子级的翻译建模可以大大简化问题，使得机器翻译方法更容易进行实践和验证。但是人类使用语言的过程并不是孤立在一个个句子上进行的。这个问题可以类比于我们学习语言的过程：小孩成长过程中会接受视觉、听觉、触觉等多种信号，这些信号的共同作用使得他们产生对客观世界的“认识”，同时促使其使用“语言”进行表达。从这个角度说，语言能力并不是由单一因素形成的，它往往伴随着其他信息的相互作用，比如，当我们翻译一句话的时候，会用到看到的画面、听到的语调、甚至前面说过句子中的信息。

-\parinterval 从广义上讲，当前句子以外的信息都可以被看作是一种上下文。比如，图XXX中，需要把英语句子“XXX”翻译为汉语。但是，其中的“bank”有多个含义，因此仅仅使用英语句子本身的信息可能会将其翻译为“银行”，而非正确的译文“河床”。但是，图XXX中也提供了这个英语句子所对应的图片，显然图片中直接展示了河床，这时“bank”是没有歧义的。通常也会把这种使用图片和文字一起进行机器翻译的任务称作多模态机器翻译（参考文献）。
+\parinterval 从广义上讲，当前句子以外的信息都可以被看作是一种上下文。比如，图\ref{fig:17-1-18}中，需要把英语句子“A medium sized child jumps off a dusty bank”翻译为汉语。但是，其中的“bank”有多个含义，因此仅仅使用英语句子本身的信息可能会将其翻译为“银行”，而非正确的译文“河床”。但是，图\ref{fig:17-1-18}中也提供了这个英语句子所对应的图片，显然图片中直接展示了河床，这时“bank”是没有歧义的。通常也会把这种使用图片和文字一起进行机器翻译的任务称作{\small\bfnew{多模态机器翻译}}\index{多模态机器翻译}（Multi-Modal Machine Translation）\index{Multi-Modal Machine Translation}。

-\parinterval 图图
-
-\parinterval 所谓模态（Modality）是指某一种信息来源。例如，视觉、听觉、嗅觉、味觉都可以被看作是不同的模态。因此视频、语音、文字等都可以被看作是承载这些模态的媒介。在机器翻译中使用多模态这个概念，更多是为了区分某些不同于文字的信息。除了图像等视觉模态信息，机器翻译也可以利用语音模态信息。比如，直接对语音进行翻译，甚至直接用语音表达出翻译结果。
+%----------------------------------------------
+\begin{figure}[htp]
+    \centering
+\input{./Chapter17/Figures/figure-picture-translation}
+    \caption{多模态机器翻译实例}
+    \label{fig:17-1-18}
+\end{figure}
+%-------------------------------------------
+\parinterval {\small\bfnew{模态}}\index{模态}（Modality）\index{Modality}是指某一种信息来源。例如，视觉、听觉、嗅觉、味觉都可以被看作是不同的模态。因此视频、语音、文字等都可以被看作是承载这些模态的媒介。在机器翻译中使用多模态这个概念，更多是为了区分某些不同于文字的信息。除了图像等视觉模态信息，机器翻译也可以利用语音模态信息。比如，直接对语音进行翻译，甚至直接用语音表达出翻译结果。

 \parinterval 此外，除了不同信息源所引入的上下文，机器翻译也可以利用文字本身的上下文。比如，翻译一篇文章中的某个句子时，可以根据整个篇章的内容进行翻译。显然这种篇章的语境是有助于机器翻译的。在本章后面的内容中，会就机器翻译中使用不同上下文（多模态和篇章信息）的方法展开讨论。

@@ -69,7 +75,7 @@

 \parinterval 经过上面的描述，音频的表示实际上是一个非常长的采样点序列，这导致了直接使用现有的深度学习技术处理音频序列较为困难。并且，原始的音频信号中可能包含着较多的噪声、环境声或冗余信息也会对模型产生干扰。因此，一般会对音频序列进行处理来提取声学特征，具体为将长序列的采样点序列转换为短序列的特征向量序列，再用于下游系统模块。虽然已有一些工作不依赖特征提取，直接在原始的采样点序列上进行声学建模和模型训练\upcite{DBLP:conf/interspeech/SainathWSWV15}，但目前的主流方法仍然是基于声学特征进行建模\upcite{DBLP:conf/icassp/MohamedHP12}。

-\parinterval 声学特征提取的第一步是预处理。其流程主要是对音频进行预加重、分帧和加窗。预加重用来提升音频信号中的高频部分，目的是使频谱更加平滑。分帧（原理如图\ref{fig17-2}）是基于短时平稳假设，即根据生物学特征，语音信号是一个缓慢变化的过程，10ms~30ms的信号片段是相对平稳的。基于这个假设，一般将每25ms作为一帧来提取特征，这个时间称为{\small\bfnew{帧长}}\index{帧长}（Frame Length）\index{Frame Length}。同时，为了保证不同帧之间的信号平滑性，使每两个相邻帧之间存在一定的重合部分。一般每隔10ms取一帧，这个时长称为{\small\bfnew{帧移}}\index{帧移}（Frame Shift）\index{Frame Shift}。为了缓解分帧带来的频谱泄漏，对每帧的信号进行加窗处理使其幅度在两段渐变到0，一般采用的是{\small\bfnew{汉明窗}}\index{汉明窗}（Hamming）\index{Hamming}。
+\parinterval 声学特征提取的第一步是预处理。其流程主要是对音频进行预加重、分帧和加窗。预加重用来提升音频信号中的高频部分，目的是使频谱更加平滑。分帧（原理如图\ref{fig:17-2}）是基于短时平稳假设，即根据生物学特征，语音信号是一个缓慢变化的过程，10ms~30ms的信号片段是相对平稳的。基于这个假设，一般将每25ms作为一帧来提取特征，这个时间称为{\small\bfnew{帧长}}\index{帧长}（Frame Length）\index{Frame Length}。同时，为了保证不同帧之间的信号平滑性，使每两个相邻帧之间存在一定的重合部分。一般每隔10ms取一帧，这个时长称为{\small\bfnew{帧移}}\index{帧移}（Frame Shift）\index{Frame Shift}。为了缓解分帧带来的频谱泄漏，对每帧的信号进行加窗处理使其幅度在两段渐变到0，一般采用的是{\small\bfnew{汉明窗}}\index{汉明窗}（Hamming）\index{Hamming}。
 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
 \centering
@@ -79,7 +85,7 @@
 \end{figure}
 %----------------------------------------------------------------------------------------------------

-\parinterval 经过了上述的预处理操作，可以得到音频对应的帧序列，之后通过不同的操作来提取不同类型的声学特征。常用的声学特征包括{\small\bfnew{Mel频率倒谱系数}}\index{Mel频率倒谱系数}（Mel-Frequency Cepstral Coefficient, MFCC）\index{Mel-Frequency Cepstral Coefficient}、{\small\bfnew{感知线性预测系数}}\index{感知线性预测系数}（Perceptual Lienar Predictive, PLP）\index{Perceptual Lienar Predictive}、{\small\bfnew{滤波器组}}\index{滤波器组}（Filter-bank, Fbank）\index{Filter-bank}等。MFCC、PLP和Fbank特征都需要对预处理后的音频做{\small\bfnew{短时傅里叶变换}}\index{短时傅里叶变换}（Short-time Fourier Tranform, STFT）\index{Short-time Fourier Tranform}，得到具有规律的线性分辨率。之后再经过特定的操作，得到各种声学特征。不同声学特征的特点是不同的，MFCC去相关性较好，PLP抗噪性强，FBank可以保留更多的语音原始特征。在语音翻译中，比较常用的声学特征为FBank或MFCC\upcite{洪青阳2020语音识别原理与应用}。
+\parinterval 经过了上述的预处理操作，可以得到音频对应的帧序列，之后通过不同的操作来提取不同类型的声学特征。常用的声学特征包括{\small\bfnew{Mel频率倒谱系数}}\index{Mel频率倒谱系数}（Mel-Frequency Cepstral Coefficient，MFCC）\index{Mel-Frequency Cepstral Coefficient}、{\small\bfnew{感知线性预测系数}}\index{感知线性预测系数}（Perceptual Lienar Predictive，PLP）\index{Perceptual Lienar Predictive}、{\small\bfnew{滤波器组}}\index{滤波器组}（Filter-bank，Fbank）\index{Filter-bank}等。MFCC、PLP和Fbank特征都需要对预处理后的音频做{\small\bfnew{短时傅里叶变换}}\index{短时傅里叶变换}（Short-time Fourier Tranform，STFT）\index{Short-time Fourier Tranform}，得到具有规律的线性分辨率。之后再经过特定的操作，得到各种声学特征。不同声学特征的特点是不同的，MFCC去相关性较好，PLP抗噪性强，FBank可以保留更多的语音原始特征。在语音翻译中，比较常用的声学特征为FBank或MFCC\upcite{洪青阳2020语音识别原理与应用}。

 \parinterval 某种程度上讲，提取到的声学特征可以理解计算机视觉中的像素特征，或者自然语言处理中的词嵌入表示。不同之处在于，声学特征更加复杂多变，可能存在着较多的噪声和冗余信息。此外，相比对应的文字序列，音频提取到的特征序列长度要大十倍以上。比如，人类正常交流中每秒钟一般可以说2-3个字，而每秒钟的语音可以提取得到100帧的特征序列。巨大的长度比差异也为语音翻译中对声学特征建模带来了困难。

@@ -246,7 +252,7 @@

 %----------------------------------------------------------------------------------------------------

-\parinterval 此外，研究人员们还探索了很多其他方法来提高语音翻译模型的性能。利用在海量的无标注语音数据上预训练的{\small\bfnew{自监督}}\index{自监督}（Self-supervised）\index{Self-supervised}模型作为一个特征提取器，将从语音中提取的特征作为语音翻译模型的输入，可以有效提高模型的性能\upcite{DBLP:conf/interspeech/WuWPG20}。相比语音翻译模型，文本翻译模型任务更加简单，因此一种思想是利用文本翻译模型来指导语音翻译模型，比如通过知识蒸馏\upcite{DBLP:conf/interspeech/LiuXZHWWZ19}、正则化\upcite{DBLP:conf/emnlp/AlinejadS20}等方法。为了简化语音翻译模型的学习，可以通过课程学习的策略，使模型从语音识别任务，逐渐过渡到语音翻译任务，这种由易到难的训练策略可以使模型训练更加充分\upcite{DBLP:journals/corr/abs-1802-06003,DBLP:conf/acl/WangWLZY20}。
+\parinterval 此外，研究人员还探索了很多其他方法来提高语音翻译模型的性能。利用在海量的无标注语音数据上预训练的{\small\bfnew{自监督}}\index{自监督}（Self-supervised）\index{Self-supervised}模型作为一个特征提取器，将从语音中提取的特征作为语音翻译模型的输入，可以有效提高模型的性能\upcite{DBLP:conf/interspeech/WuWPG20}。相比语音翻译模型，文本翻译模型任务更加简单，因此一种思想是利用文本翻译模型来指导语音翻译模型，比如通过知识蒸馏\upcite{DBLP:conf/interspeech/LiuXZHWWZ19}、正则化\upcite{DBLP:conf/emnlp/AlinejadS20}等方法。为了简化语音翻译模型的学习，可以通过课程学习的策略，使模型从语音识别任务，逐渐过渡到语音翻译任务，这种由易到难的训练策略可以使模型训练更加充分\upcite{DBLP:journals/corr/abs-1802-06003,DBLP:conf/acl/WangWLZY20}。

 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -271,7 +277,7 @@

 \subsection{基于图像增强的文本翻译}

-\parinterval 在文本翻译中引入图像信息是最典型的多模态机器翻译任务。虽然多模态机器翻译还是一种从源语言文字到目标语言文字的转换，但是在转换的过程中，融入了其他模态的信息减少了歧义的产生。例如前文提到的通过与源语言相关的图像信息，将“A medium sized  child jumps off of a dusty bank”中“bank”译为“河岸”而不是“银行”，通过给定一张相关的图片，机器翻译模型就可以利用视觉信息更好的理解歧义词，避免产生歧义。换句话说，对于同一图像或者视觉场景的描述，源语言和目标语言描述的本质意义是一致的，只不过，体现在语言上会有表达方法上的差异。那么，图像就会存在一些源语言和目标语言的隐含对齐“约束”，将这种“约束”融入到机器翻译系统，会让模型加深对某些歧义词语上下文的理解，从而进一步提高机器翻译质量。
+\parinterval 在文本翻译中引入图像信息是最典型的多模态机器翻译任务。虽然多模态机器翻译还是一种从源语言文字到目标语言文字的转换，但是在转换的过程中，融入了其他模态的信息减少了歧义的产生。例如前文提到的通过与源语言相关的图像信息，将“A medium sized  child jumps off of a dusty bank”中“bank”翻译为“河岸”而不是“银行”，通过给定一张相关的图片，机器翻译模型就可以利用视觉信息更好的理解歧义词，避免产生歧义。换句话说，对于同一图像或者视觉场景的描述，源语言和目标语言描述的本质意义是一致的，只不过，体现在语言上会有表达方法上的差异。那么，图像就会存在一些源语言和目标语言的隐含对齐“约束”，将这种“约束”融入到机器翻译系统，会让模型加深对某些歧义词语上下文的理解，从而进一步提高机器翻译质量。

 \parinterval WMT机器翻译评测在2016年首次将融合图像和文本的多模态机器翻译作为机器翻译和跨语言图像描述的共享任务\upcite{DBLP:conf/wmt/SpeciaFSE16}，这项任务也受到了广泛的研究\upcite{DBLP:conf/wmt/CaglayanABGBBMH17,DBLP:conf/wmt/LibovickyHTBP16}。如何融入视觉信息，更好的理解多模态上下文语义是多模态机器翻译研究的热点，大体的研究方向包括基于特征融合的方法\upcite{DBLP:conf/emnlp/CalixtoL17,DBLP:journals/corr/abs-1712-03449,DBLP:conf/wmt/HelclLV18}、基于多任务学习的方法\upcite{DBLP:conf/ijcnlp/ElliottK17,DBLP:conf/acl/YinMSZYZL20}。接下来将从这两个方向，对多模态机器翻译的研究展开介绍。

@@ -285,7 +291,7 @@

 \begin{itemize}
    \vspace{0.5em}
-    \item 图像信息不全都是有用的，往往存在一些与源语言或目标语言无关的信息，作为全局特征会引入噪音
+    \item 图像信息不全都是有用的，往往存在一些与源语言或目标语言无关的信息，作为全局特征会引入噪音。
    \vspace{0.5em}
    \item 图像信息作为源语言的一部分或者初始化状态，间接参与目标语言单词的生成，在循环神经网络信息传递的过程中，图像信息会有一定的损失。
    \vspace{0.5em}
@@ -327,7 +333,7 @@

 \noindent 其中，${\alpha}_{i,j}$是注意力权重，它表示目标语言第j个位置与图片编码状态序列第i个位置的相关性大小，计算方式与{\chapterten}描述的注意力函数一致。

-\parinterval 这里，将每个时间步编码器的输出$\mathbi{h}_{i}$看作源图像序列位置$i$的表示结果。图3说明了模型在生成目标词“man”时，图像经过注意力机制对图像区域关注度的可视化效果，可以看到，经过注意力机制后，模型更注重的是与目标词相关的图像部分。当然，多模态机器翻译的输入还包括源语言文字序列。通常，源语言文字对于翻译的作用比图像更大\upcite{DBLP:conf/acl/YaoW20}。从这个角度说，图像信息更多的是作为文字信息的补充，而不是替代。除此之外，注意力机制在多模态机器翻译中也有很多研究，不仅仅在解码器端将经过注意力机制的文本特征和视觉特征作为解码输入的一部分，还有的工作在编码端将源语言与图像信息进行注意力建模\upcite{DBLP:journals/corr/abs-1712-03449,DBLP:conf/acl/YaoW20}，得到更好的源语言特征表示。
+\parinterval 这里，将每个时间步编码器的输出$\mathbi{h}_{i}$看作源图像序列位置$i$的表示结果。图\ref{fig:17-12}说明了模型在生成目标词“bank”时，图像经过注意力机制对图像区域关注度的可视化效果，可以看到，经过注意力机制后，模型更注重的是与目标词相关的图像部分。当然，多模态机器翻译的输入还包括源语言文字序列。通常，源语言文字对于翻译的作用比图像更大\upcite{DBLP:conf/acl/YaoW20}。从这个角度说，图像信息更多的是作为文字信息的补充，而不是替代。除此之外，注意力机制在多模态机器翻译中也有很多研究，不仅仅在解码器端将经过注意力机制的文本特征和视觉特征作为解码输入的一部分，还有的工作在编码器端将源语言与图像信息进行注意力建模\upcite{DBLP:journals/corr/abs-1712-03449,DBLP:conf/acl/YaoW20}，得到更好的源语言特征表示。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -402,7 +408,7 @@

 \parinterval 要想使编码器-解码器框架在图像描述中充分发挥作用，编码器也要更好的表示图像信息。对于编码器的改进，大多也是从这个方向出发。通常，体现在向编码器中添加图像的语义信息\upcite{DBLP:conf/cvpr/YouJWFL16,DBLP:conf/cvpr/ChenZXNSLC17,DBLP:journals/pami/FuJCSZ17}和位置信息\upcite{DBLP:conf/cvpr/ChenZXNSLC17,DBLP:conf/ijcai/LiuSWWY17}。

-\parinterval 图像的语义信息一般是指图像中存在的实体、属性、场景等等。如图\ref{fig:17-16}所示，从图像中利用属性或实体检测器提取出“child”、“river”、“bank”等等的属性词和实体词作为图像的语义信息，提取全局的图像特征初始化循环神经网络，再利用注意力机制计算目标词与属性词或实体词之间的注意力权重，根据该权重计算上下文向量，从而将编码语义信息送入解码端\upcite{DBLP:conf/cvpr/YouJWFL16}，在解码‘bank’单词时，会更关注图像语义信息中的‘bank’。当然，除了图像中的实体和属性作为语义信息外，也可以将图片的场景信息也加入到编码器当中\upcite{DBLP:journals/pami/FuJCSZ17}。有关如何做属性、实体和场景的检测，涉及到目标检测任务的工作，例如Faster-RCNN\upcite{DBLP:journals/pami/RenHG017}、YOLO\upcite{DBLP:journals/corr/abs-1804-02767,DBLP:journals/corr/abs-2004-10934}等等,这里不过多赘述。
+\parinterval 图像的语义信息一般是指图像中存在的实体、属性、场景等等。如图\ref{fig:17-16}所示，从图像中利用属性或实体检测器提取出“child”、“river”、“bank”等等的属性词和实体词作为图像的语义信息，提取全局的图像特征初始化循环神经网络，再利用注意力机制计算目标词与属性词或实体词之间的注意力权重，根据该权重计算上下文向量，从而将编码语义信息送入解码器端\upcite{DBLP:conf/cvpr/YouJWFL16}，在解码‘bank’单词时，会更关注图像语义信息中的‘bank’。当然，除了图像中的实体和属性作为语义信息外，也可以将图片的场景信息加入到编码器当中\upcite{DBLP:journals/pami/FuJCSZ17}。有关如何做属性、实体和场景的检测，涉及到目标检测任务的工作，例如Faster-RCNN\upcite{DBLP:journals/pami/RenHG017}、YOLO\upcite{DBLP:journals/corr/abs-1804-02767,DBLP:journals/corr/abs-2004-10934}等等,这里不过多赘述。

 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
@@ -413,7 +419,7 @@
 \end{figure}
 %----------------------------------------------------------------------------------------------------

-\parinterval 以上的方法大都是将图像中的实体、属性、场景等映射到文字上，并把这些信息显式地添加到编码器端。令一种方式，把图像中的语义特征隐式地作用到编码器端\upcite{DBLP:conf/cvpr/ChenZXNSLC17}。例如，可以图像数据可以分解为三个通道（红、绿、蓝），简单来说，就是将图像的每一个像素点按照红色、绿色、蓝色分成三个部分，这样就将图像分成了三个通道。在很多图像中，不同通道随伴随的特征是不一样的，可以将其作用于编码器端。另一种方法是基于位置信息的编码器增强。位置信息指的是图像中对象（物体）的位置。利用目标检测技术检测系统获得图中的对象和对应的特征，这样就确定了图中的对象位置。显然，这些信息也可以加入到编码端，以加强编码器的表示能力\upcite{DBLP:conf/eccv/YaoPLM18}。
+\parinterval 以上的方法大都是将图像中的实体、属性、场景等映射到文字上，并把这些信息显式地添加到编码器端。令一种方式，把图像中的语义特征隐式地作用到编码器端\upcite{DBLP:conf/cvpr/ChenZXNSLC17}。例如，图像数据可以分解为三个通道（红、绿、蓝），简单来说，就是将图像的每一个像素点按照红色、绿色、蓝色分成三个部分，这样就将图像分成了三个通道。在很多图像中，不同通道伴随的特征是不一样的，可以将其作用于编码器端。另一种方法是基于位置信息的编码器增强。位置信息指的是图像中对象（物体）的位置。利用目标检测技术检测系统获得图中的对象和对应的特征，这样就确定了图中的对象位置。显然，这些信息也可以加入到编码器端，以加强编码器的表示能力\upcite{DBLP:conf/eccv/YaoPLM18}。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION
@@ -421,8 +427,9 @@

 \subsubsection{3. 解码器的改进}

-\parinterval 由于解码器输出的是语言文字序列，因此需要考虑语言的特点对其进行改进。 例如，解码过程中， “the”,“on”，“at”这种介词或者冠词与图像的相关性较低，这时图像信息的引入就会产生负面影响\upcite{DBLP:conf/cvpr/LuXPS17}。因此，可以通过门等结构，控制视觉信号作用于文字生成的程度。另外,在解码过程中，生成的每个单词对应着图像的区域可能是不同的。因此也可以设计更为有效的注意力机制来捕捉解码端对不同图像局部信息的关注程度\upcite{DBLP:conf/cvpr/00010BT0GZ18}。 
-\parinterval 除了在解码端更好的使生成文本与图像特征相互作用以外，还有一些其他的解码器端改进的方向。例如：用其它结构（如卷积神经网络或者Transformer）代替解码器端循环神经网络\upcite{DBLP:conf/cvpr/AnejaDS18}。或者使用更深层的神经网络学习动词或者名词等视觉中不易表现出来的单词\upcite{DBLP:journals/mta/FangWCT18}，其思想与深层神经机器翻译模型有相通之处（{\chapterfifteen}）。
+\parinterval 由于解码器输出的是语言文字序列，因此需要考虑语言的特点对其进行改进。 例如，解码过程中， “the”,“on”，“at”这种介词或者冠词与图像的相关性较低，这时图像信息的引入就会产生负面影响\upcite{DBLP:conf/cvpr/LuXPS17}。因此，可以通过门等结构，控制视觉信号作用于文字生成的程度。另外,在解码过程中，生成的每个单词对应着图像的区域可能是不同的。因此也可以设计更为有效的注意力机制来捕捉解码器端对不同图像局部信息的关注程度\upcite{DBLP:conf/cvpr/00010BT0GZ18}。
+
+\parinterval 除了在解码器端更好的使生成文本与图像特征相互作用以外，还有一些其他的解码器端改进的方向。例如：用其它结构（如卷积神经网络或者Transformer）代替解码器端循环神经网络\upcite{DBLP:conf/cvpr/AnejaDS18}。或者使用更深层的神经网络学习动词或者名词等视觉中不易表现出来的单词\upcite{DBLP:journals/mta/FangWCT18}，其思想与深层神经机器翻译模型有相通之处（{\chapterfifteen}）。

 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -434,7 +441,7 @@

 \parinterval 计算机视觉领域，图像风格转移、图像语义分割、图像超分辨率等任务，都可以被视为{\small\bfnew{图像到图像的翻译}}\index{图像到图像的翻译}（Image-to-Image Translation）\index{Image-to-Image Translation}问题。与机器翻译类似，这些问题的共同目标是学习从一个对象到另一个对象的映射，只不过这里的对象是指图像，而非机器翻译中的文字。例如，给定物体的轮廓生成真实物体照片或者给定白天照片生成夜晚的照片等。图像到图像的翻译有广阔的应用场景，如图片补全、风格迁移等。

-\parinterval 对抗神经网络被广泛地应用再图像到图像的翻译任务当中\upcite{DBLP:conf/nips/GoodfellowPMXWOCB14,DBLP:conf/nips/ZhuZPDEWS17,DBLP:journals/corr/abs-1908-06616}。实际上，这类方法非常适合图像生成类的任务。简单来说，对抗生成网络包括两个部分分别是：生成器和判别器。基于输入生成器生成一个结果，而判别器要判别生成的结果和真实结果是否是相同的，对抗的思想是，通过强化生成器的生成能力和判别器的判别能力，当生成器生成的结果可以“骗”过判别器时，即判别器无法分清真实结果和生成结果，认为模型学到了这种映射关系。在图像到图像的翻译中，根据输入图像，生成器生成预测图像，判别器判别是否为目标图像，多次迭代后，生成图像被判别为目标图像时，则模型学习到了“翻译能力”。以上的工作都是有监督的，即基于对齐的图像对数据集，但是，这种数据的标注是极为费时费力的，所以有很多的工作也基于无监督的方法展开\upcite{DBLP:conf/iccv/ZhuPIE17,DBLP:conf/iccv/YiZTG17,DBLP:conf/nips/LiuBK17}，这里不过多赘述。
+\parinterval 对抗神经网络被广泛地应用在图像到图像的翻译任务当中\upcite{DBLP:conf/nips/GoodfellowPMXWOCB14,DBLP:conf/nips/ZhuZPDEWS17,DBLP:journals/corr/abs-1908-06616}。实际上，这类方法非常适合图像生成类的任务。简单来说，对抗生成网络包括两个部分分别是：生成器和判别器。基于输入生成器生成一个结果，而判别器要判别生成的结果和真实结果是否是相同的，对抗的思想是，通过强化生成器的生成能力和判别器的判别能力，当生成器生成的结果可以“骗”过判别器时，即判别器无法分清真实结果和生成结果，认为模型学到了这种映射关系。在图像到图像的翻译中，根据输入图像，生成器生成预测图像，判别器判别是否为目标图像，多次迭代后，生成图像被判别为目标图像时，则模型学习到了“翻译能力”。以上的工作都是有监督的，即基于对齐的图像对数据集，但是，这种数据的标注是极为费时费力的，所以有很多的工作也基于无监督的方法展开\upcite{DBLP:conf/iccv/ZhuPIE17,DBLP:conf/iccv/YiZTG17,DBLP:conf/nips/LiuBK17}，这里不过多赘述。

 \parinterval {\small\bfnew{文本到图像的翻译}}\index{文本到图像的翻译}（Text-to-Image Translation）\index{Text-to-Image Translation}是指给定描述物体颜色和形状等细节的一自然语言文字，生成对应的图像。该任务也可以看作是图像描述任务的逆任务。目前方法上大部分基于对抗神经网络\upcite{DBLP:conf/icml/ReedAYLSL16,DBLP:journals/corr/DashGALA17,DBLP:conf/nips/ReedAMTSL16}。基本流程为：首先利用自然语言处理技术提取出文本信息，然后再用文本特征作为后面生成图像的约束，在对抗神经网络中生成器（Generator）中根据文本特征生成图像的约束，从而别鉴别器（Discriminator）鉴定其生成效果。

@@ -452,15 +459,19 @@

 \subsection{什么是篇章级翻译}

-\parinterval “篇章”在这里指一系列连续的段落或者句子所构成的整体，其中各个句子间从形式和内容上都具有一定的连贯性和一致性\upcite{jurafsky2000speech}。这些联系主要体现在{\small\sffamily\bfseries{衔接}}\index{衔接}（Cohesion \index{Cohesion}）以及{\small\sffamily\bfseries{连贯}}\index{连贯}（Coherence \index{Coherence}）两个方面。其中衔接体现在显性的语言成分和结构上，包括篇章中句子间语法和词汇上的联系，而连贯体现在各个句子之间逻辑和语义上的联系。因此，篇章级翻译的目的就是要考虑到这些上下文之间的联系，从而生成相比句子级翻译更连贯和准确的翻译结果（如表\ref{tab:17-3-1}）。但是由于不同语言的特性多种多样，上下文信息在篇章级翻译中的作用也不尽相同。比如在德语中名词是分词性的，因此在代词翻译的过程中需要根据其先行词的词性进行区分，而这种现象在其它不区分词性的语言中是不存在的。这导致篇章级翻译在不同的语种中可能对应多种不同的上下文现象。
+\parinterval “篇章”在这里指一系列连续的段落或者句子所构成的整体，其中各个句子间从形式和内容上都具有一定的连贯性和一致性\upcite{jurafsky2000speech}。这些联系主要体现在{\small\sffamily\bfseries{衔接}}\index{衔接}（Cohesion \index{Cohesion}）以及{\small\sffamily\bfseries{连贯}}\index{连贯}（Coherence \index{Coherence}）两个方面。其中衔接体现在显性的语言成分和结构上，包括篇章中句子间语法和词汇上的联系，而连贯体现在各个句子之间逻辑和语义上的联系。因此，篇章级翻译的目的就是要考虑到这些上下文之间的联系，从而生成相比句子级翻译更连贯和准确的翻译结果（如实例\ref{eg:17-1}）。但是由于不同语言的特性多种多样，上下文信息在篇章级翻译中的作用也不尽相同。比如在德语中名词是分词性的，因此在代词翻译的过程中需要根据其先行词的词性进行区分，而这种现象在其它不区分词性的语言中是不存在的。这导致篇章级翻译在不同的语种中可能对应多种不同的上下文现象。

-%----------------------------------------------------------------------------------------------------
-\begin{figure}[htp]
-\centering
-\caption{篇章级翻译中时态一致性的问题}
-\label{tab:17-3-1}
-\end{figure}
-%----------------------------------------------------------------------------------------------------
+\begin{example}
+上下文句子：我上周针对这个问题做出解释并咨询了他的意见。
+
+\hspace{2em} 待翻译句子：他也同意我的看法。
+
+\hspace{2em} 句子级翻译结果：He also agrees with me.
+
+\hspace{2em} 篇章级翻译结果：{\red{And}} he {\red{agreed}} with me.
+
+\label{eg:17-1}
+\end{example}

 \parinterval 正是由于这种上下文现象的多样性，使得篇章级翻译模型的性能评价相对困难。目前篇章级机器翻译主要针对一些常见上下文的现象，比如代词翻译、省略、连接和词汇衔接等，而{\chapterfour}介绍的BLEU等通用自动评价指标通常对这些上下文现象不敏感，篇章级翻译需要采用一些专用方法来对这些具体的现象进行评价。之前已经有一些研究工作针对具体的上下文现象提出了相应的评价标准并且在篇章级翻译中得到应用\upcite{DBLP:conf/naacl/BawdenSBH18,DBLP:conf/acl/VoitaST19}，但是目前并没有达成共识，这也在一定程度上阻碍了篇章级机器翻译的进一步发展。我们将在ref{sec:17-3-2}节中对这些评价标准进行介绍。

@@ -526,7 +537,7 @@ D_i&\subseteq&\{X_{-i},Y_{-i}\} \label{eq:17-3-2}
 \label{eg:17-3-1}
 \end{example}

-\parinterval 其他改进输入的做法相比于拼接的方法要复杂一些，首先需要对篇章进行处理，得到词汇链（Lexical Chain）\footnote{词汇链指篇章中语义相关的词所构成的序列}\upcite{DBLP:conf/wmt/GonzalesMS17}或者篇章嵌入\upcite{DBLP:journals/corr/abs-1910-07481}等信息，然后融入到当前句子的序列表示中，送入模型进行翻译。这种方式中上下文信息来自于预先提取的篇章表示，但是这种表示是否适合机器翻译还有待论证。
+\parinterval 其他改进输入的做法相比于拼接的方法要复杂一些，首先需要对篇章进行处理，得到词汇链\footnote{词汇链指篇章中语义相关的词所构成的序列}\upcite{DBLP:conf/wmt/GonzalesMS17}或者篇章嵌入\upcite{DBLP:journals/corr/abs-1910-07481}等信息，然后融入到当前句子的序列表示中，送入模型进行翻译。这种方式中上下文信息来自于预先提取的篇章表示，但是这种表示是否适合机器翻译还有待论证。

 %----------------------------------------------------------------------------------------
 %    NEW SUBSUB-SECTION

--- a/Chapter18/chapter18.tex
+++ b/Chapter18/chapter18.tex
@@ -249,7 +249,48 @@
 %    NEW SECTION
 %----------------------------------------------------------------------------------------

-%\section{机器翻译的应用场景}
+\section{机器翻译的应用场景}
+\parinterval 机器翻译有着十分广泛的应用，下面看一下机器翻译在生活中的具体应用形式：
+
+\parinterval （一）网页翻译
+
+\parinterval 进入信息爆炸的时代之后，互联网上海量的数据随处可得，然而由于国家和地区语言的不同，网络上的数据也呈现出多语言的特性。当人们在遇到包含不熟悉语言的网页时，无法及时有效地获取其中的信息。因此，对不同语言的网页进行翻译是必不可少的一步。由于网络上网页的数量数不胜数，依靠人工对网页进行翻译是不切实际的，相反，机器翻译十分适合这个任务。目前，市场上有很多浏览器提供网页翻译的服务，极大地简化了人们从网络上获取不同语言信息的难度。
+
+\parinterval （二）科技文献翻译
+
+\parinterval 在专利等科技文献翻译中，往往需要将文献翻译为英语或者其他语言，比如摘要翻译。以往这种翻译工作通常由人工来完成。由于对翻译结果的质量要求较高，因此要求翻译人员具有相关背景知识，这导致译员资源稀缺。特别是，近几年国内专利申请数不断增加，这给人工翻译带来了很大的负担。相比于人工翻译，机器翻译可以在短时间内完成大量的专利翻译，同时结合术语词典和人工校对等方式，可以保证专利的翻译质量。同时，以专利为代表的科技文献往往具有很强的领域性，针对各类领域文本进行单独优化，机器翻译的品质可以大大提高。因此，机器翻译在专利翻译等行业有十分广泛的应用前景。
+
+\parinterval （三）视频字幕翻译
+
+\parinterval 随着互联网的普及，人们可以通过互联网接触到大量境外影视作品。由于人们可能没有相应的外语能力，通常需要专业人员对字幕进行翻译。因此，这些境外视频的传播受限于字幕翻译的速度和准确度。现在的一些视频网站在使用语音识别为视频生成源语字幕的同时，通过机器翻译技术为各种语言的受众提供质量尚可的目标语言字幕，这种方式为人们提供了极大的便利。
+
+\parinterval （四）社交
+
+\parinterval 社交是人们的重要社会活动。人们可以通过各种各样的社交软件做到即时通讯，进行协作或者分享自己的观点。然而受限于语言问题，人们的社交范围往往不会超出自己所掌握的语种范围，很难方便地进行跨语言社交。随着机器翻译技术的发展，越来越多的社交软件开始支持自动翻译，用户可以轻易地将各种语言的内容翻译成自己的母语，方便了人们的交流，让语言问题不再成为社交的障碍。
+
+\parinterval （五）同声传译
+
+\parinterval 在一些国际会议中，与会者来自许多不同的国家，为了保证会议的流畅，通常需要专业译员进行同声传译。同声传译需要在不打断演讲的同时，不间断地将讲话内容进行口译，对翻译人员的素质要求极高，成本高昂。现在，一些会议开始采用语音识别来将语音转换成文本，同时使用机器翻译技术进行翻译的方式，达到同步翻译的目的。这项技术已经得到了多个企业的关注，并在很多重要会议上进行尝试，取得了很好的反响。不过同声传译达到真正的使用还需一定时间的打磨，特别是会议场景下，准确进行语音识别和翻译仍然具有挑战性。
+
+\parinterval （六）医药领域翻译
+
+\parinterval 在医药领域中，从药品研发、临床试验到药品注册，都有着大量的翻译需求。比如，在新药注册阶段，限定申报时间的同时，更是对翻译质量有着极高的要求。由于医药领域专业词汇量庞大、单词冗长复杂、术语准确且文体专业性强，翻译难度明显高于其他领域，人工翻译的方式代价大且很难满足效率的要求。为此，机器翻译近几年在医药领域取得广泛应用。在针对医药领域进行优化后，机器翻译质量可以很好地满足翻译的要求。
+
+\parinterval （七）中国传统语言文化的翻译
+
+\parinterval 中国几千年的历史留下了极为宝贵的文化遗产，而其中，文言文作为古代书面语，具有言文分离、行文简练的特点，易于流传。言文分离的特点使得文言文和现在的标准汉语具有一定的区别。为了更好发扬中国传统文化，我们需要对文言文进行翻译。而文言文古奥难懂，人们需要具备一定的文言文知识背景才能准确翻译。机器翻译技术也可以帮助人们快速完成文言文的翻译。除此之外，机器翻译技术同样可以用于古诗生成和对联生成等任务。
+
+\parinterval （八）全球化
+
+\parinterval 在经济全球化的今天，很多企业都有国际化的需求，企业员工或多或少地会遇到一些跨语言阅读和交流的情况，比如阅读进口产品的说明书，跨国公司之间的邮件、说明文件等等。相比于成本较高的人工翻译，机器翻译往往是一种很好的选择。在一些质量要求不高的翻译场景中，机器翻译可以得到应用。
+
+\parinterval （九）翻译机
+
+\parinterval 出于商务、学术交流或者旅游的目的，人们在出国时会面临着跨语言交流的问题。近几年，随着出境人数的增加，不少企业推出了翻译机产品。通过结合机器翻译、语音识别和图像识别技术，翻译机实现了图像翻译和语音翻译的功能。用户可以很便捷地获取一些外语图像文字和语音信息，同时可以通过翻译机进行对话，降低跨语言交流门槛。
+
+\parinterval （十）翻译结果后编辑
+
+\parinterval 翻译结果后编辑是指在机器翻译的结果之上，通过少量的人工编辑来进一步完善机器译文。在传统的人工翻译过程中，翻译人员完全依靠人工的方式进行翻译，这虽然保证了翻译质量，但是时间成本高。相对应的，机器翻译具有速度快和成本低的优势。在一些领域，目前的机器翻译质量已经可以很大程度上减小翻译人员的工作量，翻译人员可以在机器翻译的辅助下，花费相对较小的代价来完成翻译。

 %----------------------------------------------------------------------------------------
 %    NEW SECTION