合并分支 'caorunzhe' 到 'master'

Caorunzhe 查看合并请求 !672

合并分支 'caorunzhe' 到 'master'
Caorunzhe 查看合并请求 !672
64111b9e · 曹润柘 · a723f193 · 97538212 · 64111b9e · 64111b9e
Commit 64111b9e authored Dec 22, 2020 by 曹润柘
--- a/Chapter17/Figures/figure-an-end-to-end-voice-translation-model-based-on-transformer.tex
+++ b/Chapter17/Figures/figure-an-end-to-end-voice-translation-model-based-on-transformer.tex
+\begin{tikzpicture}
+	\tikzstyle{layer}=[draw,rounded corners=2pt,font=\scriptsize,align=center,minimum width=5em]
+	\tikzstyle{word}=[font=\scriptsize]
+\node[layer,fill=red!20] (en_sa) at (0,0){Multi-Head \\ Attention};
+\node[layer,anchor=south,fill=green!20] (en_ffn) at ([yshift=1.4em]en_sa.north){Feed Forward \\ Network};
+\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=north] (en_add) at ([yshift=-1.4em]en_sa.south){};
+\draw[] (en_add.90) -- (en_add.-90);
+\draw[] (en_add.0) -- (en_add.180);
+\node[layer,anchor=north,fill=yellow!20] (en_cnn) at ([yshift=-1.4em]en_add.south){CNN};
+\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=west] (de_add) at ([xshift=7em]en_add.east){};
+\draw[] (de_add.90) -- (de_add.-90);
+\draw[] (de_add.0) -- (de_add.180);
+\node[layer,anchor=south,fill=red!20] (de_sa) at ([yshift=1.4em]de_add.north){Masked \\Multi-Head\\Attention};
+\node[layer,anchor=south,fill=red!20] (de_ca) at ([yshift=1.4em]de_sa.north){Multi-Head \\ Attention};
+\node[layer,anchor=south,fill=green!20] (de_ffn) at ([yshift=1.4em]de_ca.north){Feed Forward \\ Network};
+\node[layer,anchor=south,fill=blue!20] (sf) at ([yshift=2em]de_ffn.north){Softmax};
+\node[layer,anchor=south,fill=orange!20] (output) at ([yshift=1.4em]sf.north){STLoss};
+\node[anchor=north,font=\scriptsize,align=center] (en_input) at ([yshift=-1em]en_cnn.south){语音特征\\(FilterBank/MFCC)};
+\node[anchor=north,font=\scriptsize,align=center] (de_input) at ([yshift=-1em]de_add.south){目标文本\\(Embedding)};
+\node[anchor=east,font=\scriptsize,align=center] (en_pos) at ([xshift=-2em]en_add.west){Position\\(Embedding)};
+\node[anchor=west,font=\scriptsize,align=center] (de_pos) at ([xshift=2em]de_add.east){Position\\(Embedding)};
+\draw[->] (en_input.90) -- ([yshift=-0.1em]en_cnn.-90);
+\draw[->] ([yshift=0.1em]en_cnn.90) -- ([yshift=-0.1em]en_add.-90);
+\draw[->] ([yshift=0.1em]en_add.90) -- ([yshift=-0.1em]en_sa.-90);
+\draw[->] ([yshift=0.1em]en_sa.90) -- ([yshift=-0.1em]en_ffn.-90);
+\draw[->] (de_input.90) -- ([yshift=-0.1em]de_add.-90);
+\draw[->] ([yshift=0.1em]de_add.90) -- ([yshift=-0.1em]de_sa.-90);
+\draw[->] ([yshift=0.1em]de_sa.90) -- ([yshift=-0.1em]de_ca.-90);
+\draw[->] ([yshift=0.1em]de_ca.90) -- ([yshift=-0.1em]de_ffn.-90);
+\draw[->] ([yshift=0.1em]de_ffn.90) -- ([yshift=-0.1em]sf.-90);
+\draw[->] ([yshift=0.1em]sf.90) -- ([yshift=-0.1em]output.-90);
+\draw[->] ([xshift=0.1em]en_pos.0) -- ([xshift=-0.1em]en_add.180);
+\draw[->] ([xshift=-0.1em]de_pos.180) -- ([xshift=0.1em]de_add.0);
+\draw[->,rounded corners=2pt] ([yshift=0.1em]en_ffn.90) -- ([yshift=2em]en_ffn.90) -- ([xshift=4em,yshift=2em]en_ffn.90) -- ([xshift=-1.5em]de_ca.west) -- ([xshift=-0.1em]de_ca.west);
+\begin{pgfonlayer}{background}
+\node[draw=ugreen,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt][fit=(en_sa)(en_ffn)]{};
+\node[draw=red,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt][fit=(de_sa)(de_ca)(de_ffn)]{};
+\end{pgfonlayer}
+\node[anchor=east,font=\scriptsize,text=ugreen] at ([xshift=-0.1em]box1.west){$N \times$};
+\node[anchor=west,font=\scriptsize,text=red] at ([xshift=0.1em]box2.east){$\times N$};
+\node[anchor=east,font=\scriptsize] at ([xshift=-0.1em]en_cnn.west){$2 \times$};
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-examples-of-CTC-predictive-word-sequences.tex
+++ b/Chapter17/Figures/figure-examples-of-CTC-predictive-word-sequences.tex
+\begin{tikzpicture}
+\node[draw=white] (input) at (0,0){\includegraphics[width=0.62\textwidth]{./Chapter17/Figures/figure-hello-audio.png}};
+\node[anchor=east,font=\scriptsize,align=center]  (a1) at  ([xshift=2.0em]input.west) {音频数据输入};
+\node[minimum width=17.4em,minimum height=2.9em,draw=white,line width=3pt] at (0.3em,-0.02em){};
+\node[anchor=north,draw,rounded corners=2pt,minimum width=16em, minimum height=2.2em,fill=yellow!20] (box) at ([xshift=0.4em]input.south){};
+\node[anchor=west,minimum width=1.2em,minimum height=2.2em] (w1) at ([xshift=0.2em]box.west){{h}};
+\node[anchor=west,minimum width=1.2em,minimum height=2.2em] (w2) at ([xshift=0.2em]w1.east){{e}};
+\node[anchor=west,minimum width=1.2em,minimum height=2.2em] (w3) at ([xshift=0.2em]w2.east){{e}};
+\node[anchor=west,minimum width=1.2em,minimum height=2.2em] (w4) at ([xshift=0.2em]w3.east){{$\epsilon$}};
+\node[anchor=west,minimum width=1.2em,minimum height=2.2em] (w5) at ([xshift=0.2em]w4.east){{l}};
+\node[anchor=west,minimum width=1.2em,minimum height=2.2em] (w6) at ([xshift=0.2em]w5.east){{$\epsilon$}};
+\node[anchor=west,minimum width=1.2em,minimum height=2.2em] (w7) at ([xshift=0.2em]w6.east){{l}};
+\node[anchor=west,minimum width=1.2em,minimum height=2.2em] (w8) at ([xshift=0.2em]w7.east){{l}};
+\node[anchor=west,minimum width=1.2em,minimum height=2.2em] (w9) at ([xshift=0.2em]w8.east){{o}};
+\node[anchor=west,minimum width=1.2em,minimum height=2.2em] (w10) at ([xshift=0.2em]w9.east){{o}};
+\node[anchor=west,minimum width=1.2em,minimum height=2.2em] (w11) at ([xshift=0.2em]w10.east){{!}};
+\draw[very thick] (w1.south west) -- (w1.south east);
+\draw[very thick] (w2.south west) -- (w2.south east);
+\draw[very thick] (w3.south west) -- (w3.south east);
+\draw[very thick] (w5.south west) -- (w5.south east);
+\draw[very thick] (w7.south west) -- (w7.south east);
+\draw[very thick] (w8.south west) -- (w8.south east);
+\draw[very thick] (w9.south west) -- (w9.south east);
+\draw[very thick] (w10.south west) -- (w10.south east);
+\draw[very thick] (w11.south west) -- (w11.south east);
+\node[anchor=north,minimum width=1.2em,minimum height=1.4em,fill=gray!30] (m1) at ([yshift=-1em]w1.south){{h}};
+\node[anchor=north,minimum width=2.64em,minimum height=1.4em,fill=gray!30] (m2) at ([yshift=-1em,xshift=0.72em]w2.south){{e}};
+\node[anchor=north,minimum width=1.2em,minimum height=1.4em,fill=gray!30] (m3) at ([yshift=-1em]w4.south){};
+\node[anchor=north,minimum width=1.2em,minimum height=1.4em,fill=gray!30] (m4) at ([yshift=-1em]w5.south){{l}};
+\node[anchor=north,minimum width=1.2em,minimum height=1.4em,fill=gray!30] (m5) at ([yshift=-1em]w6.south){};
+\node[anchor=north,minimum width=2.64em,minimum height=1.4em,fill=gray!30] (m6) at ([yshift=-1em,xshift=0.72em]w7.south){{l}};
+\node[anchor=north,minimum width=2.64em,minimum height=1.4em,fill=gray!30] (m7) at ([yshift=-1em,xshift=0.72em]w9.south){{o}};
+\node[anchor=north,minimum width=1.2em,minimum height=1.4em,fill=gray!30] (m8) at ([yshift=-1em]w11.south){{!}};
+\node[anchor=north,minimum width=1.2em,minimum height=1.4em,fill=gray!30] (o1) at ([yshift=-3.8em]w1.south){{h}};
+\node[anchor=north,minimum width=1.2em,minimum height=1.4em,fill=gray!30] (o2) at ([yshift=-3.8em]w2.south){{e}};
+\node[anchor=north,minimum width=1.2em,minimum height=1.4em,fill=gray!30] (o3) at ([yshift=-3.8em]w3.south){{l}};
+\node[anchor=north,minimum width=1.2em,minimum height=1.4em,fill=gray!30] (o4) at ([yshift=-3.8em]w4.south){{l}};
+\node[anchor=north,minimum width=1.2em,minimum height=1.4em,fill=gray!30] (o5) at ([yshift=-3.8em]w5.south){{o}};
+\node[anchor=north,minimum width=1.2em,minimum height=1.4em,fill=gray!30] (o6) at ([yshift=-3.8em]w6.south){{!}};
+\node[anchor=north,minimum width=1.2em,minimum height=1.4em,fill=gray!30] at ([yshift=-3.8em]w7.south){};
+\node[anchor=north,minimum width=1.2em,minimum height=1.4em,fill=gray!30] at ([yshift=-3.8em]w8.south){};
+\node[anchor=north,minimum width=1.2em,minimum height=1.4em,fill=gray!30] at ([yshift=-3.8em]w9.south){};
+\node[anchor=north,minimum width=1.2em,minimum height=1.4em,fill=gray!30] at ([yshift=-3.8em]w10.south){};
+\node[anchor=north,minimum width=1.2em,minimum height=1.4em,fill=gray!30] at ([yshift=-3.8em]w11.south){};
+\draw[blue!40,fill=blue!30,opacity=0.7] (w1.south west) -- (w1.south east) -- (o1.south east) -- (o1.south west) -- (w1.south west);
+\draw[blue!40,fill=blue!30,opacity=0.7] (w2.south west) -- (w3.south east) -- (m2.south east) .. controls ([yshift=-0.3em]m2.south east) and ([yshift=0.3em]o2.north east) .. (o2.north east) -- (o2.south east) -- (o2.south west) -- (w2.south west);
+\draw[blue!40,fill=blue!30,opacity=0.7] (w5.south west) -- (w5.south east) -- (m4.south east) .. controls ([yshift=-0.3em]m4.south east) and ([yshift=0.3em]o3.north east) .. (o3.north east) -- (o3.south east) -- (o3.south west) -- (o3.north west) .. controls ([yshift=0.3em]o3.north west) and ([yshift=-0.3em]m4.south west) .. (m4.south west) -- (w5.south west);
+\draw[blue!40,fill=blue!30,opacity=0.7] (w7.south west) -- (w8.south east) -- (m6.south east) .. controls ([yshift=-0.3em]m6.south east) and ([yshift=0.3em]o4.north east) .. (o4.north east) -- (o4.south east) -- (o4.south west) -- (o4.north west) .. controls ([yshift=0.3em]o4.north west) and ([yshift=-0.3em]m6.south west) .. (m6.south west) -- (w7.south west);
+\draw[blue!40,fill=blue!30,opacity=0.7] (w9.south west) -- (w10.south east) -- (m7.south east) .. controls ([yshift=-0.1em]m7.south east) and ([yshift=0.2em]o5.north east) .. (o5.north east) -- (o5.south east) -- (o5.south west) -- (o5.north west) .. controls ([yshift=0.1em]o5.north west) and ([yshift=-0.5em]m7.south west) .. (m7.south west) -- (w9.south west);
+\draw[blue!40,fill=blue!30,opacity=0.7] (w11.south west) -- (w11.south east) -- (m8.south east) .. controls ([yshift=-0.4em]m8.south east) and ([yshift=0.1em]o6.north east) .. (o6.north east) -- (o6.south east) -- (o6.south west) -- (o6.north west) .. controls ([yshift=0.1em]o6.north west) and ([yshift=-0.5em]m8.south west) .. (m8.south west) -- (w11.south west);
+\node[anchor=north,font=\scriptsize,align=center] (a2) at  ([yshift=-1.4em]a1.south) {预测字母序列};
+\node[anchor=north,font=\scriptsize,align=center] (a3) at  ([yshift=-1.8em]a2.south) {合并重复字母 \\ 并丢弃$\epsilon$};
+\node[anchor=north,font=\scriptsize,align=center] (a4) at  ([yshift=-0.6em]a3.south) {最终结果输出};
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-hello-audio.png
+++ b/Chapter17/Figures/figure-hello-audio.png
--- a/Chapter17/Figures/figure-image-description-of-encoder-decoder-framework.tex
+++ b/Chapter17/Figures/figure-image-description-of-encoder-decoder-framework.tex
+\begin{tikzpicture}[scale=0.8]
+\tikzstyle{every node}=[scale=0.8]
+%figure 1
+\coordinate (A1) at (0, 0);
+\coordinate (B1) at ([xshift=1.5em,yshift=-0.4em]A1);
+\coordinate (C1) at ([xshift=0.3em,yshift=-2.6em]A1);
+\coordinate (D1) at ([xshift=2.7em,yshift=-2.6em]A1);
+\coordinate (E1) at ([xshift=2.4em,yshift=-1.5em]A1);
+\coordinate (F1) at ([xshift=0.3em]D1);
+%figure 2
+\coordinate (A2) at ([yshift=-15em]A1);
+\coordinate (B2) at ([xshift=1.5em,yshift=-0.4em]A2);
+\coordinate (C2) at ([xshift=0.3em,yshift=-2.6em]A2);
+\coordinate (D2) at ([xshift=2.7em,yshift=-2.6em]A2);
+\coordinate (E2) at ([xshift=2.4em,yshift=-1.5em]A2);
+\coordinate (F2) at ([xshift=0.3em]D2);
+\foreach \x in {1,2}{
+\draw[-,line width=2pt] (A\x) -- ([xshift=3.6em]A\x) -- ([xshift=3.6em,yshift=-3em]A\x) -- ([yshift=-3em]A\x) -- (A\x) -- ([xshift=1em]A\x);
+\draw[-, very thick] (B\x) -- (C\x) -- (D\x) -- (B\x);
+\draw[-, very thick,fill=black] ([xshift=-0.6em,yshift=-1.2em]B\x)  -- ([xshift=-0.3em,yshift=-1em]B\x) -- ([yshift=-1.2em]B\x) --([xshift=0.3em,yshift=-1em]B\x) -- ([xshift=0.6em,yshift=-1.2em]B\x) -- (D\x) -- (C\x) -- ([xshift=-0.6em,yshift=-1.2em]B\x);
+\draw[-, very thick,fill=black] (E\x) -- ([xshift=0.2em,yshift=0.3em]E\x) -- ([xshift=0.33em]F\x) -- (F\x) -- (E\x);
+\node[circle,inner sep=0pt,minimum size=0.4em,fill=black] at ([xshift=-0.7em,yshift=-0.2em]B\x){};
+\node[draw,rounded corners=2pt,fill=yellow!20,minimum width=2.3cm,minimum height=1cm](cnn\x) at ([xshift=9.5em,yshift=-1.5em]A\x){CNN};
+\node[draw,circle,fill=green!20,font=\footnotesize,anchor=west,inner sep=3pt] (h\x_2) at ([xshift=3em,yshift=0.66em]cnn\x.east){$h_2$};
+\node[draw,circle,fill=green!20,font=\footnotesize,anchor=south,inner sep=3pt] (h\x_1) at ([yshift=1em]h\x_2.north){$h_1$};
+\node[font=\footnotesize,anchor=north] (h\x_c) at ([yshift=-0.6em]h\x_2.south){$\cdots$};
+\node[draw,circle,fill=green!20,font=\footnotesize,anchor=north,inner sep=3pt] (h\x_n) at ([yshift=-0.6em]h\x_c.south){$h_n$};
+}
+\begin{pgfonlayer}{background}
+\node[draw,thick,rounded corners=2pt,densely dashed,inner ysep=1.2em,inner xsep=0.4em,label={above:图像特征向量}][fit=(h1_1)(h1_2)(h1_n)](box1){};
+\node[draw,thick,rounded corners=2pt,densely dashed,inner ysep=1.2em,inner xsep=0.4em,label={above:图像特征向量}][fit=(h2_1)(h2_2)(h2_n)](box2){};
+\end{pgfonlayer}
+\node[anchor=west,draw,rounded corners=2pt,fill=blue!20,minimum width=2.3cm,minimum height=1cm] (decoder1)at ([xshift=3em]box1.east){解码器};
+\node[anchor=west,draw,circle,inner sep=0pt,minimum size=1.4em] (add)at ([xshift=2em,yshift=1.6em]box2.east){};
+\draw[] (add.0) -- (add.180);
+\draw[] (add.90) -- (add.-90);
+\node[anchor=west,draw,rounded corners=2pt,fill=blue!20,minimum width=2.3cm,minimum height=1cm] (decoder2)at ([xshift=6em]box2.east){解码器};
+\draw[->,thick] ([xshift=-2.7em]cnn1.180) -- ([xshift=-0.1em]cnn1.180);
+\draw[->,thick] ([xshift=-2.7em]cnn2.180) -- ([xshift=-0.1em]cnn2.180);
+\draw[->,thick] ([xshift=0.1em]cnn1.0) -- ([xshift=-0.1em]box1.180);
+\draw[->,thick] ([xshift=0.1em]cnn2.0) -- ([xshift=-0.1em]box2.180);
+\draw[->,thick] ([xshift=0.1em]box1.0) -- ([xshift=-0.1em]decoder1.180);
+\draw[->,thick] ([xshift=0.1em]h2_1.0) -- (add.180);
+\draw[->,thick] ([xshift=0.1em]h2_2.0) -- (add.180);
+\draw[->,thick] ([xshift=0.1em]h2_c.0) -- (add.180);
+\draw[->,thick] ([xshift=0.1em]h2_n.0) -- (add.180);
+\draw[->,thick,out=20,in=130] ([xshift=0.1em]add.45) to ([xshift=-0em,yshift=0.1em]decoder2.north west);
+\draw[->,thick,out=200,in=-45] ([xshift=-0.1em]decoder2.west) to ([yshift=-0.1em]add.-90);
+\node [anchor=north](pos1) at ([yshift=-1.0em]box1.south) {（a）未引入注意力机制};
+\node [anchor=north](pos2) at ([yshift=-1.0em]box2.south) {（b）引入注意力机制};
+\end{tikzpicture}
+%------------------------------------------------------------------------------------------------------------
--- a/Chapter17/Figures/figure-image-translation-task.tex
+++ b/Chapter17/Figures/figure-image-translation-task.tex
+\begin{tikzpicture}[scale=0.6]
+\tikzstyle{every node}=[scale=0.6]
+%figure 1
+\coordinate (A1) at (0, 0);
+\coordinate (B1) at ([xshift=1.5em,yshift=-0.4em]A1);
+\coordinate (C1) at ([xshift=0.3em,yshift=-2.6em]A1);
+\coordinate (D1) at ([xshift=2.7em,yshift=-2.6em]A1);
+\coordinate (E1) at ([xshift=2.4em,yshift=-1.5em]A1);
+\coordinate (F1) at ([xshift=0.3em]D1);
+\coordinate (G1) at ([xshift=0.3em,yshift=-5em]A1);
+\coordinate (H1) at ([xshift=0.4em,yshift=-1.6em]G1);
+\coordinate (I1) at ([xshift=0.4em,yshift=-2.0em]G1);
+\coordinate (J1) at ([xshift=0.4em,yshift=-2.5em]G1);
+\coordinate (K1) at ([xshift=0.4em,yshift=-3.0em]G1);
+\coordinate (L1) at ([xshift=0.4em,yshift=-3.5em]G1);
+\coordinate (G2) at ([xshift=8em,yshift=-2.5em]A1);
+\coordinate (H2) at ([xshift=0.4em,yshift=-1.6em]G2);
+\coordinate (I2) at ([xshift=0.4em,yshift=-2.0em]G2);
+\coordinate (J2) at ([xshift=0.4em,yshift=-2.5em]G2);
+\coordinate (K2) at ([xshift=0.4em,yshift=-3.0em]G2);
+\coordinate (L2) at ([xshift=0.4em,yshift=-3.5em]G2);
+%figure 2
+\coordinate (A2) at ([yshift=-0.5em,xshift=7em]G2);
+\coordinate (B2) at ([xshift=1.5em,yshift=-0.4em]A2);
+\coordinate (C2) at ([xshift=0.3em,yshift=-2.6em]A2);
+\coordinate (D2) at ([xshift=2.7em,yshift=-2.6em]A2);
+\coordinate (E2) at ([xshift=2.4em,yshift=-1.5em]A2);
+\coordinate (F2) at ([xshift=0.3em]D2);
+\coordinate (G3) at ([xshift=8em,yshift=0.5em]A2);
+\coordinate (H3) at ([xshift=0.4em,yshift=-1.6em]G3);
+\coordinate (I3) at ([xshift=0.4em,yshift=-2.0em]G3);
+\coordinate (J3) at ([xshift=0.4em,yshift=-2.5em]G3);
+\coordinate (K3) at ([xshift=0.4em,yshift=-3.0em]G3);
+\coordinate (L3) at ([xshift=0.4em,yshift=-3.5em]G3);
+%figure 3
+\coordinate (A3) at ([yshift=-0.5em,xshift=7em]G3);
+\coordinate (B3) at ([xshift=1.5em,yshift=-0.4em]A3);
+\coordinate (C3) at ([xshift=0.3em,yshift=-2.6em]A3);
+\coordinate (D3) at ([xshift=2.7em,yshift=-2.6em]A3);
+\coordinate (E3) at ([xshift=2.4em,yshift=-1.5em]A3);
+\coordinate (F3) at ([xshift=0.3em]D3);
+\coordinate (A4) at ([xshift=8em]A3);
+\coordinate (B4) at ([xshift=1.5em,yshift=-0.4em]A4);
+\coordinate (C4) at ([xshift=0.3em,yshift=-2.6em]A4);
+\coordinate (D4) at ([xshift=2.7em,yshift=-2.6em]A4);
+\coordinate (E4) at ([xshift=2.4em,yshift=-1.5em]A4);
+\coordinate (F4) at ([xshift=0.3em]D4);
+%figure 4
+\coordinate (G4) at ([xshift=7.6em,yshift=0.5em]A4);
+\coordinate (H4) at ([xshift=0.4em,yshift=-1.6em]G4);
+\coordinate (I4) at ([xshift=0.4em,yshift=-2.0em]G4);
+\coordinate (J4) at ([xshift=0.4em,yshift=-2.5em]G4);
+\coordinate (K4) at ([xshift=0.4em,yshift=-3.0em]G4);
+\coordinate (L4) at ([xshift=0.4em,yshift=-3.5em]G4);
+\coordinate (A5) at ([yshift=-0.5em,xshift=8em]G4);
+\coordinate (B5) at ([xshift=1.5em,yshift=-0.4em]A5);
+\coordinate (C5) at ([xshift=0.3em,yshift=-2.6em]A5);
+\coordinate (D5) at ([xshift=2.7em,yshift=-2.6em]A5);
+\coordinate (E5) at ([xshift=2.4em,yshift=-1.5em]A5);
+\coordinate (F5) at ([xshift=0.3em]D5);
+\foreach \x in {1,2,3,4,5}{
+\draw[-,line width=2pt] (A\x) -- ([xshift=3.6em]A\x) -- ([xshift=3.6em,yshift=-3em]A\x) -- ([yshift=-3em]A\x) -- (A\x) -- ([xshift=1em]A\x);
+\draw[-, very thick] (B\x) -- (C\x) -- (D\x) -- (B\x);
+\draw[-, very thick,fill=black] ([xshift=-0.6em,yshift=-1.2em]B\x)  -- ([xshift=-0.3em,yshift=-1em]B\x) -- ([yshift=-1.2em]B\x) --([xshift=0.3em,yshift=-1em]B\x) -- ([xshift=0.6em,yshift=-1.2em]B\x) -- (D\x) -- (C\x) -- ([xshift=-0.6em,yshift=-1.2em]B\x);
+\draw[-, very thick,fill=black] (E\x) -- ([xshift=0.2em,yshift=0.3em]E\x) -- ([xshift=0.33em]F\x) -- (F\x) -- (E\x);
+\node[circle,inner sep=0pt,minimum size=0.4em,fill=black] at ([xshift=-0.7em,yshift=-0.2em]B\x){};
+}
+\foreach \y in {1,2,3,4}{
+\draw[-,line width=2pt] (G\y) -- ([xshift=1.6em]G\y) -- ([xshift=3em,yshift=-1.4em]G\y) -- ([xshift=3em,yshift=-4em]G\y) -- ([yshift=-4em]G\y) -- (G\y) -- ([xshift=1em]G\y);
+\draw[-,line width=2pt] ([xshift=1.6em]G\y) -- ([xshift=1.5em,yshift=-1.4em]G\y) -- ([xshift=3em,yshift=-1.4em]G\y) ;
+\draw[-,line width=1.6pt] (H\y) -- ([xshift=0.6em]H\y);
+\draw[-,line width=1.6pt] (I\y) -- ([xshift=2em]I\y);
+\draw[-,line width=1.6pt] (J\y) -- ([xshift=2em]J\y);
+\draw[-,line width=1.6pt] (K\y) -- ([xshift=2em]K\y);
+\draw[-,line width=1.6pt] (L\y) -- ([xshift=2em]L\y);
+}
+\draw[-,thick] ([yshift=4em,xshift=5em]G2) -- ([yshift=-8em,xshift=5em]G2);
+\draw[-,thick] ([yshift=4em,xshift=5em]G3) -- ([yshift=-8em,xshift=5em]G3);
+\draw[-,thick] ([yshift=4.5em,xshift=5.6em]A4) -- ([yshift=-7.5em,xshift=5.6em]A4);
+\node [draw,single arrow,minimum height=2.4em,single arrow head extend=0.4em] (arrow1) at ([xshift=-2.4em,yshift=-2em]G2) {};
+\node [draw,single arrow,minimum height=2.4em,single arrow head extend=0.4em] (arrow2) at ([xshift=-2.4em,yshift=-2em]G3) {};
+\node [draw,single arrow,minimum height=2.4em,single arrow head extend=0.4em] (arrow3) at ([xshift=-2.4em,yshift=-1.5em]A4) {};
+\node [draw,single arrow,minimum height=2.4em,single arrow head extend=0.4em] (arrow4) at ([xshift=-2.5em,yshift=-1.5em]A5) {};
+\node[anchor=north,font=\small,scale=1.5] at ([yshift=-6em]arrow1.south){（a）多模态机器翻译};
+\node[anchor=north,font=\small,scale=1.5] at ([yshift=-6em]arrow2.south){（b）图像到文本翻译};
+\node[anchor=north,font=\small,scale=1.5] at ([yshift=-6em]arrow3.south){（c）图像到图像翻译};
+\node[anchor=north,font=\small,scale=1.5] at ([yshift=-6em]arrow4.south){（d）文本到图像翻译};
+\end{tikzpicture}
+%------------------------------------------------------------------------------------------------------------
--- a/Chapter17/Figures/figure-modeling-a-global-approach-to-visual-characteristics.tex
+++ b/Chapter17/Figures/figure-modeling-a-global-approach-to-visual-characteristics.tex
+\begin{tikzpicture}[scale=0.8]
+\tikzstyle{every node}=[scale=0.8]
+%figure 1
+\coordinate (A1) at (0, 0);
+\coordinate (B1) at ([xshift=1.5em,yshift=-0.4em]A1);
+\coordinate (C1) at ([xshift=0.3em,yshift=-2.6em]A1);
+\coordinate (D1) at ([xshift=2.7em,yshift=-2.6em]A1);
+\coordinate (E1) at ([xshift=2.4em,yshift=-1.5em]A1);
+\coordinate (F1) at ([xshift=0.3em]D1);
+%figure 2
+\coordinate (A2) at ([xshift=15em]A1);
+\coordinate (B2) at ([xshift=1.5em,yshift=-0.4em]A2);
+\coordinate (C2) at ([xshift=0.3em,yshift=-2.6em]A2);
+\coordinate (D2) at ([xshift=2.7em,yshift=-2.6em]A2);
+\coordinate (E2) at ([xshift=2.4em,yshift=-1.5em]A2);
+\coordinate (F2) at ([xshift=0.3em]D2);
+\foreach \x in {1,2}{
+\draw[-,line width=2pt] (A\x) -- ([xshift=3.6em]A\x) -- ([xshift=3.6em,yshift=-3em]A\x) -- ([yshift=-3em]A\x) -- (A\x) -- ([xshift=1em]A\x);
+\draw[-, very thick] (B\x) -- (C\x) -- (D\x) -- (B\x);
+\draw[-, very thick,fill=black] ([xshift=-0.6em,yshift=-1.2em]B\x)  -- ([xshift=-0.3em,yshift=-1em]B\x) -- ([yshift=-1.2em]B\x) --([xshift=0.3em,yshift=-1em]B\x) -- ([xshift=0.6em,yshift=-1.2em]B\x) -- (D\x) -- (C\x) -- ([xshift=-0.6em,yshift=-1.2em]B\x);
+\draw[-, very thick,fill=black] (E\x) -- ([xshift=0.2em,yshift=0.3em]E\x) -- ([xshift=0.33em]F\x) -- (F\x) -- (E\x);
+\node[circle,inner sep=0pt,minimum size=0.4em,fill=black] at ([xshift=-0.7em,yshift=-0.2em]B\x){};
+\node[draw,rounded corners=2pt,fill=yellow!20,minimum width=2.3cm,minimum height=1cm](cnn\x) at ([xshift=1.8em,yshift=3.6em]A\x){CNN};
+}
+\node[draw,anchor=south,rounded corners=2pt,minimum width=4.0cm,minimum height=1cm,fill=red!20](encoder) at ([yshift=2.6em,xshift=2.2em]cnn1.north){编码器};
+\node[anchor=north,font=\Large](x) at ([xshift=2.5em,yshift=-3.4em]encoder.south){$\seq{x}$};
+\node[draw,anchor=south,rounded corners=2pt,minimum width=4.0cm,minimum height=1cm,fill=blue!20](decoder) at ([yshift=2.6em,xshift=2.2em]cnn2.north){解码器};
+\node[anchor=north,font=\Large](y) at ([xshift=2.5em,yshift=-3.4em]decoder.south){$\seq{y}$};
+\node[anchor=south,font=\Large](y_1) at ([yshift=3em]decoder.north){$\seq{y}'$};
+\draw[->,thick] ([yshift=-2.1em]cnn1.south) -- ([yshift=-0.1em]cnn1.south);
+\draw[->,thick] ([yshift=-2.1em]cnn2.south) -- ([yshift=-0.1em]cnn2.south);
+\draw[->,thick] ([yshift=0.1em]cnn1.north) -- ([yshift=2.4em]cnn1.north);
+\draw[->,thick] ([yshift=0.1em]cnn2.north) -- ([yshift=2.4em]cnn2.north);
+\draw[->,thick] ([yshift=0.3em]x.north) -- ([yshift=4.5em]x.south);
+\draw[->,thick] ([yshift=0.3em]y.north) -- ([yshift=4.7em]y.south);
+\draw[->,thick] ([xshift=0.1em]encoder.east) -- ([xshift=-0.1em]decoder.west);
+\draw[->,thick] ([yshift=0.1em]decoder.north) -- ([yshift=-0.1em]y_1.south);
+\node [anchor=south,scale=1.2] (node1) at ([xshift=-2.0em,yshift=2.5em]encoder.north) {{$x,y$：双语数据}};
+\end{tikzpicture}
+%------------------------------------------------------------------------------------------------------------
--- a/Chapter17/Figures/figure-speech-recognition-model-based-on-transformer.tex
+++ b/Chapter17/Figures/figure-speech-recognition-model-based-on-transformer.tex
+\begin{tikzpicture}
+	\tikzstyle{layer}=[draw,rounded corners=2pt,font=\scriptsize,align=center,minimum width=5em]
+	\tikzstyle{word}=[font=\scriptsize]
+\node[layer,fill=red!20] (en_sa) at (0,0){Multi-Head \\ Attention};
+\node[layer,anchor=south,fill=green!20] (en_ffn) at ([yshift=1.4em]en_sa.north){Feed Forward \\ Network};
+\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=north] (en_add) at ([yshift=-1.4em]en_sa.south){};
+\draw[] (en_add.90) -- (en_add.-90);
+\draw[] (en_add.0) -- (en_add.180);
+\node[layer,anchor=north,fill=yellow!20] (en_cnn) at ([yshift=-1.4em]en_add.south){CNN};
+\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=west] (de_add) at ([xshift=7em]en_add.east){};
+\draw[] (de_add.90) -- (de_add.-90);
+\draw[] (de_add.0) -- (de_add.180);
+\node[layer,anchor=south,fill=red!20] (de_sa) at ([yshift=1.4em]de_add.north){Masked \\Multi-Head\\Attention};
+\node[layer,anchor=south,fill=red!20] (de_ca) at ([yshift=1.4em]de_sa.north){Multi-Head \\ Attention};
+\node[layer,anchor=south,fill=green!20] (de_ffn) at ([yshift=1.4em]de_ca.north){Feed Forward \\ Network};
+\node[layer,anchor=south,fill=blue!20] (sf) at ([yshift=2em]de_ffn.north){Softmax};
+\node[layer,anchor=south,fill=orange!20] (output) at ([yshift=1.4em]sf.north){Output Probabilities};
+\node[anchor=north,font=\scriptsize,align=center] (en_input) at ([yshift=-1em]en_cnn.south){Speech Feature\\(FilterBank/MFCC)};
+\node[anchor=north,font=\scriptsize,align=center] (de_input) at ([yshift=-1em]de_add.south){Transcription\\(Embedding)};
+\node[anchor=east,font=\scriptsize,align=center] (en_pos) at ([xshift=-2em]en_add.west){Position\\(Embedding)};
+\node[anchor=west,font=\scriptsize,align=center] (de_pos) at ([xshift=2em]de_add.east){Position\\(Embedding)};
+\draw[->] (en_input.90) -- ([yshift=-0.1em]en_cnn.-90);
+\draw[->] ([yshift=0.1em]en_cnn.90) -- ([yshift=-0.1em]en_add.-90);
+\draw[->] ([yshift=0.1em]en_add.90) -- ([yshift=-0.1em]en_sa.-90);
+\draw[->] ([yshift=0.1em]en_sa.90) -- ([yshift=-0.1em]en_ffn.-90);
+\draw[->] (de_input.90) -- ([yshift=-0.1em]de_add.-90);
+\draw[->] ([yshift=0.1em]de_add.90) -- ([yshift=-0.1em]de_sa.-90);
+\draw[->] ([yshift=0.1em]de_sa.90) -- ([yshift=-0.1em]de_ca.-90);
+\draw[->] ([yshift=0.1em]de_ca.90) -- ([yshift=-0.1em]de_ffn.-90);
+\draw[->] ([yshift=0.1em]de_ffn.90) -- ([yshift=-0.1em]sf.-90);
+\draw[->] ([yshift=0.1em]sf.90) -- ([yshift=-0.1em]output.-90);
+\draw[->] ([xshift=0.1em]en_pos.0) -- ([xshift=-0.1em]en_add.180);
+\draw[->] ([xshift=-0.1em]de_pos.180) -- ([xshift=0.1em]de_add.0);
+\draw[->,rounded corners=2pt] ([yshift=0.1em]en_ffn.90) -- ([yshift=2em]en_ffn.90) -- ([xshift=4em,yshift=2em]en_ffn.90) -- ([xshift=-1.5em]de_ca.west) -- ([xshift=-0.1em]de_ca.west);
+\begin{pgfonlayer}{background}
+\node[draw=ugreen,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt][fit=(en_sa)(en_ffn)](box1){};
+\node[draw=red,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt][fit=(de_sa)(de_ca)(de_ffn)](box2){};
+\end{pgfonlayer}
+\node[anchor=east,font=\scriptsize,text=ugreen] at ([xshift=-0.1em]box1.west){$N \times$};
+\node[anchor=west,font=\scriptsize,text=red] at ([xshift=0.1em]box2.east){$\times N$};
+\node[anchor=east,font=\scriptsize] at ([xshift=-0.1em]en_cnn.west){$2 \times$};
+\node[anchor=east,font=\scriptsize,align=center,text=ugreen] at ([xshift=-0.1em,yshift=3em]box1.west){ASR \\ Encoder};
+\node[anchor=west,font=\scriptsize,align=center,text=red] at ([xshift=0.1em,yshift=5em]box2.east){ASR \\ Decoder};
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-speech-translation-model-based-on-CTC.tex
+++ b/Chapter17/Figures/figure-speech-translation-model-based-on-CTC.tex
+\begin{tikzpicture}
+	\tikzstyle{layer}=[draw,rounded corners=2pt,font=\scriptsize,align=center,minimum width=5em]
+	\tikzstyle{word}=[font=\scriptsize]
+\node[layer,fill=red!20] (en_sa) at (0,0){Multi-Head \\ Attention};
+\node[layer,anchor=south,fill=green!20] (en_ffn) at ([yshift=1.4em]en_sa.north){Feed Forward \\ Network};
+\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=north] (en_add) at ([yshift=-1.4em]en_sa.south){};
+\draw[] (en_add.90) -- (en_add.-90);
+\draw[] (en_add.0) -- (en_add.180);
+\node[layer,anchor=north,fill=yellow!20] (en_cnn) at ([yshift=-1.4em]en_add.south){CNN};
+\node[draw,circle,inner sep=0pt, minimum size=1em,anchor=west] (de_add) at ([xshift=7em]en_add.east){};
+\draw[] (de_add.90) -- (de_add.-90);
+\draw[] (de_add.0) -- (de_add.180);
+\node[layer,anchor=south,fill=red!20] (de_sa) at ([yshift=1.4em]de_add.north){Masked \\Multi-Head\\Attention};
+\node[layer,anchor=south,fill=red!20] (de_ca) at ([yshift=1.4em]de_sa.north){Multi-Head \\ Attention};
+\node[layer,anchor=south,fill=green!20] (de_ffn) at ([yshift=1.4em]de_ca.north){Feed Forward \\ Network};
+\node[layer,anchor=south,fill=blue!20] (en_sf) at ([yshift=3em]en_ffn.north){Softmax};
+\node[layer,anchor=south,fill=blue!20] (sf) at ([yshift=2em]de_ffn.north){Softmax};
+\node[layer,anchor=south,fill=orange!20] (en_output) at ([yshift=1.4em]en_sf.north){CTC输出};
+\node[layer,anchor=south,fill=orange!20] (output) at ([yshift=1.4em]sf.north){语音翻译输出};
+\node[anchor=north,font=\scriptsize,align=center] (en_input) at ([yshift=-1em]en_cnn.south){语音特征\\(FilterBank/MFCC)};
+\node[anchor=north,font=\scriptsize,align=center] (de_input) at ([yshift=-1em]de_add.south){目标文本\\(Embedding)};
+\node[anchor=east,font=\scriptsize,align=center] (en_pos) at ([xshift=-2em]en_add.west){Position\\(Embedding)};
+\node[anchor=west,font=\scriptsize,align=center] (de_pos) at ([xshift=2em]de_add.east){Position\\(Embedding)};
+\draw[->] (en_input.90) -- ([yshift=-0.1em]en_cnn.-90);
+\draw[->] ([yshift=0.1em]en_cnn.90) -- ([yshift=-0.1em]en_add.-90);
+\draw[->] ([yshift=0.1em]en_add.90) -- ([yshift=-0.1em]en_sa.-90);
+\draw[->] ([yshift=0.1em]en_sa.90) -- ([yshift=-0.1em]en_ffn.-90);
+\draw[->] (de_input.90) -- ([yshift=-0.1em]de_add.-90);
+\draw[->] ([yshift=0.1em]de_add.90) -- ([yshift=-0.1em]de_sa.-90);
+\draw[->] ([yshift=0.1em]de_sa.90) -- ([yshift=-0.1em]de_ca.-90);
+\draw[->] ([yshift=0.1em]de_ca.90) -- ([yshift=-0.1em]de_ffn.-90);
+\draw[->] ([yshift=0.1em]en_ffn.90) -- ([yshift=-0.1em]en_sf.-90);
+\draw[->] ([yshift=0.1em]en_sf.90) -- ([yshift=-0.1em]en_output.-90);
+\draw[->] ([yshift=0.1em]de_ffn.90) -- ([yshift=-0.1em]sf.-90);
+\draw[->] ([yshift=0.1em]sf.90) -- ([yshift=-0.1em]output.-90);
+\draw[->] ([xshift=0.1em]en_pos.0) -- ([xshift=-0.1em]en_add.180);
+\draw[->] ([xshift=-0.1em]de_pos.180) -- ([xshift=0.1em]de_add.0);
+\draw[->,rounded corners=2pt] ([yshift=2em]en_ffn.90) -- ([xshift=4em,yshift=2em]en_ffn.90) -- ([xshift=-1.5em]de_ca.west) -- ([xshift=-0.1em]de_ca.west);
+\begin{pgfonlayer}{background}
+\node[draw=ugreen,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt][fit=(en_sa)(en_ffn)]{};
+\node[draw=red,rounded corners=2pt,inner xsep=6pt,inner ysep=8pt][fit=(de_sa)(de_ca)(de_ffn)]{};
+\end{pgfonlayer}
+\node[anchor=east,font=\scriptsize,text=ugreen] at ([xshift=-0.1em]box1.west){$N \times$};
+\node[anchor=west,font=\scriptsize,text=red] at ([xshift=0.1em]box2.east){$\times N$};
+\node[anchor=east,font=\scriptsize] at ([xshift=-0.1em]en_cnn.west){$2 \times$};
+\node[anchor=east,font=\scriptsize,align=center,text=ugreen] at ([xshift=-0.1em,yshift=3em]box1.west){语音翻译\\编码器};
+\node[anchor=west,font=\scriptsize,align=center,text=red] at ([xshift=0.1em,yshift=5em]box2.east){语音翻译\\解码器};
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-three-ways-of-dual-decoder-speech-translation.tex
+++ b/Chapter17/Figures/figure-three-ways-of-dual-decoder-speech-translation.tex
@@ -15,7 +15,9 @@
 \draw[->,thick](decoder_2.north)to(y.south);
 \draw[->,thick](encoder.north)--([yshift=0.7cm]encoder.north)--([xshift=-4.16em,yshift=0.7cm]encoder.north)--(decoder_1.south);
 \draw[->,thick](encoder.north)--([yshift=0.7cm]encoder.north)--([xshift=4.16em,yshift=0.7cm]encoder.north)--(decoder_2.south);
-\node [anchor=north](pos1) at (s.south) {(a) 单编码器-双解码器方式};
+\node [anchor=north,scale = 1.2](pos1) at (s.south) {(a) 单编码器-双解码器方式};
+\node [anchor=south,scale=1.2] (node1) at ([xshift=-2.0em,yshift=6em]decoder_1.north) {{$x,y$：语言数据}};
+\node [anchor=north,scale=1.2] (node2) at ([xshift=0.6em]node1.south){{$s$：语音数据}};
 %%%%%%%%%%%%%%%%%%%%%%%%级联
 \node(encoder-2)[coder]at ([xshift=10.0em]encoder.east){\large{编码器}};
 \node(decoder_1-2)[coder,above of =encoder-2,yshift=1.4cm,fill=blue!20]{\large{解码器}};
@@ -27,7 +29,7 @@
 \draw[->,thick](encoder-2.north)to(decoder_1-2.south);
 \draw[->,thick](decoder_1-2.north)to(decoder_2-2.south);
 \draw[->,thick](decoder_2-2.north)to(y-2.south);
-\node [anchor=north](pos2) at (s-2.south) {(b) 级联编码器方式};
+\node [anchor=north,scale = 1.2](pos2) at (s-2.south) {(b) 级联编码器方式};
 %%%%%%%%%%%%%%%%%%%%%%%%联合
 \node(encoder-3)[coder]at([xshift=10.0em]encoder-2.east){\large{编码器}};
 \node(decoder_1-3)[coder,above of =encoder-3,xshift=-1.6cm,yshift=2.8cm,fill=blue!20]{\large{解码器}};
@@ -40,5 +42,5 @@
 \draw[->,thick](decoder_2-3.north)to(y-3.south);
 \draw[->,thick](encoder-3.north)--([yshift=0.7cm]encoder-3.north)--([xshift=-4.16em,yshift=0.7cm]encoder-3.north)--(decoder_1-3.south);
 \draw[->,thick](encoder-3.north)--([yshift=0.7cm]encoder-3.north)--([xshift=4.16em,yshift=0.7cm]encoder-3.north)--(decoder_2-3.south);
-\node [anchor=north](pos3) at (s-3.south) {(c) 联合编码器方式};
+\node [anchor=north,scale = 1.2](pos3) at (s-3.south) {(c) 联合编码器方式};
 \end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/Figures/figure-word-lattice.tex
+++ b/Chapter17/Figures/figure-word-lattice.tex
+\begin{tikzpicture}
+	\tikzstyle{node}=[circle,minimum size=1.2em,draw,inner sep=0pt,fill=yellow!20,font=\footnotesize,thick]
+	\tikzstyle{word}=[font=\scriptsize]
+\node[node] (n0) at (0,0) {0};
+\node[anchor=west,node] (n2) at ([xshift=3em]n0.east){2};
+\node[anchor=west,node] (n6) at ([xshift=13em]n2.east){6};
+\node[anchor=west,node] (n8) at ([xshift=2.4em]n6.east){8};
+\node[anchor=west,node] (n9) at ([xshift=2.4em]n8.east){9};
+\node[anchor=south,node] (n1) at ([xshift=0.6em,yshift=3.2em]n2.north){1};
+\node[anchor=north,node] (n3) at ([xshift=2.2em,yshift=-1.6em]n2.south){3};
+\node[anchor=north,node] (n7) at ([xshift=5.2em,yshift=-0.8em]n2.south){7};
+\node[anchor=west,node] (n10) at ([xshift=4em]n7.east){10};
+\node[anchor=south,node] (n11) at ([yshift=3.0em]n7.north){11};
+\node[anchor=south,node] (n5) at ([yshift=3.0em]n10.north){5};
+\node[anchor=north,node] (n4) at ([xshift=6em,yshift=-1.6em]n3.south){4};
+\draw[->] (n0.0) -- node[word,above]{of /0.343}(n2.180);
+\draw[->] (n0.60) -- node[word,above,rotate=40]{a /0.499}(n1.-150);
+\draw[->] (n0.-50) -- node[word,above,rotate=-20]{our /0.116}(n3.150);
+\draw[->] (n0.-70) .. controls ([xshift=-8em]n4.180) and ([xshift=-8em]n4.180) .. node[above,word,xshift=3em,yshift=-0.6em]{that /0.039} (n4.180);
+\draw[->] (n4.0) .. node[word,above,xshift=-2em,yshift=-0.4em]{hostage /1} controls ([xshift=5em]n4.0) and ([yshift=-6em]n6.-90) .. (n6.-90);
+\draw[->] (n2.-90) -- node[word,above,rotate=-18,pos=0.55]{house /0.125}(n7.180);
+\draw[->] (n3.-10) node[word,above,xshift=3.6em,yshift=-0.8em]{conference /1} .. controls ([xshift=4.6em,yshift=-1.8em]n3.-10) and ([yshift=-1.6em,xshift=-3em]n10.-135) .. (n10.-135);
+\draw[->] (n7.0) -- node[word,above]{which /1}(n10.180);
+\draw[->] (n2.0) -- node[word,above,pos=0.5]{hostages /0.300}(n6.180);
+\draw[->] (n2.45) -- node[word,above,rotate=18,pos=0.3]{a /0.573}(n11.-135);
+\draw[->,rounded corners=1em] (n1.-45) node[word,above,xshift=1.4em,yshift=-1.3em,rotate=-43]{house /0.078} -- ([yshift=-0.4em,xshift=-1em]n11.-90) -- (n7.100);
+\draw[->] (n1.20) node[word,above,xshift=4em]{conference /0.734} .. controls ([xshift=8em]n1.20) and  ([xshift=-0.6em,yshift=2.2em]n5.110) .. (n5.110);
+\draw[->] (n11.0) -- node[word,above]{conference /1}(n5.180);
+\draw[->] (n5.-90) ..node[word,above,xshift=1.4em]{is /0.773} controls ([yshift=-1.6em]n5.-90) and ([xshift=-3em]n6.150]) .. (n6.150);
+\draw[->] (n5.0) node[word, above,xshift=1.4em]{as /0.226}.. controls ([xshift=2.6em]n5.0) and ([xshift=-0.6em,yshift=2em]n6.120) .. (n6.120);
+\coordinate (a) at ([xshift=6em,yshift=3em]n1);
+\draw[->] (n1.60) .. controls ([xshift=3em,yshift=2em]n1.60) and ([xshift=-2em]a) .. (a) node[word,above,xshift=1em]{hostage /0.187}.. controls ([xshift=8em]a) and ([xshift=-0.6em,yshift=6em]n6.90) .. (n6.90);
+\draw[->] (n10.0) -- node[above,word,pos=0.4,rotate=30]{is /1}(n6.-135);
+\draw[->] (n6.0) -- node[above,word,yshift=0.2em]{being /1}(n8.180);
+\draw[->] (n8.0) -- node[above,word,yshift=0.3em]{recorded /1}(n9.180);
+\end{tikzpicture}
\ No newline at end of file
--- a/Chapter17/chapter17.tex
+++ b/Chapter17/chapter17.tex
@@ -109,6 +109,7 @@
 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
 \centering
+\input{./Chapter17/Figures/figure-speech-recognition-model-based-on-transformer}
 \caption{基于Transformer的语音识别模型}
 \label{fig:17-2-3}
 \end{figure}
@@ -119,6 +120,7 @@
 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
 \centering
+\input{./Chapter17/Figures/figure-word-lattice.tex}
 \caption{词格示例}
 \label{fig:17-2-4}
 \end{figure}
@@ -167,6 +169,7 @@
 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
 \centering
+\input{./Chapter17/Figures/figure-an-end-to-end-voice-translation-model-based-on-transformer}
 \caption{基于Transformer的端到端语音翻译模型}
 \label{fig:17-2-5}
 \end{figure}
@@ -194,6 +197,7 @@
 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
 \centering
+\input{./Chapter17/Figures/figure-examples-of-CTC-predictive-word-sequences}
 \caption{CTC预测单词序列示例}
 \label{fig:17-2-6}
 \end{figure}
@@ -211,13 +215,14 @@
 \end{itemize}
 %----------------------------------------------------------------------------------------------------
-\parinterval 将CTC应用到语音翻译中的方法非常简单，只需要在编码器的顶层加上一个额外的输出层即可。通过这种方式，不需要增加过多的额外参数，就可以给模型加入一个较强的监督信息，提高模型的收敛性。
+\parinterval 将CTC应用到语音翻译中的方法非常简单，只需要在编码器的顶层加上一个额外的输出层即可（图\ref{fig:17-8}）。通过这种方式，不需要增加过多的额外参数，就可以给模型加入一个较强的监督信息，提高模型的收敛性。
 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
 \centering
+\input{./Chapter17/Figures/figure-speech-translation-model-based-on-CTC}
 \caption{基于CTC的语音翻译模型}
-\label{fig:17-2-7}
+\label{fig:17-8}
 \end{figure}
 %----------------------------------------------------------------------------------------------------
@@ -249,14 +254,15 @@
 \section{图像翻译}
-\parinterval 人类所接受的信息中视觉信息的比重往往不亚于语言信息，甚至更多。视觉信息通常以图像的形式存在，近几年，结合图像的多模态机器翻译任务受到了广泛的研究。多模态机器翻译（图a）简单来说就是结合源语言和其他模态（例如图像等）的信息生成目标语言的过程。这种结合图像的机器翻译还是一种狭义上的“翻译”，它本质上还是从源语言到目标语言或者说从文本到文本的翻译。那么从图像到文本上（图b）的转换，例如，{\small\bfnew{图片描述生成}}\index{图片描述生成}（Image Captioning）\index{Image Captioning}，即给定图像生成与图像内容相关的描述，也可以被称为广义上的“翻译”，当然，这种广义上的翻译形式不仅仅包括图像到文本，还应该包括从图像到图像（图c），甚至是从文本到图像（图d）等等。这里将这些与图像相关的翻译任务统称为图像翻译。
+\parinterval 人类所接受的信息中视觉信息的比重往往不亚于语言信息，甚至更多。视觉信息通常以图像的形式存在，近几年，结合图像的多模态机器翻译任务受到了广泛的研究。多模态机器翻译（图\ref{fig:17-10}（a））简单来说就是结合源语言和其他模态（例如图像等）的信息生成目标语言的过程。这种结合图像的机器翻译还是一种狭义上的“翻译”，它本质上还是从源语言到目标语言或者说从文本到文本的翻译。那么从图像到文本上（图\ref{fig:17-10}（b））的转换，例如，{\small\bfnew{图片描述生成}}\index{图片描述生成}（Image Captioning）\index{Image Captioning}，即给定图像生成与图像内容相关的描述，也可以被称为广义上的“翻译”，当然，这种广义上的翻译形式不仅仅包括图像到文本，还应该包括从图像到图像（图\ref{fig:17-10}（c）），甚至是从文本到图像（图\ref{fig:17-10}（d））等等。这里将这些与图像相关的翻译任务统称为图像翻译。
 %----------------------------------------------------------------------------------------------------
-\begin{table}[htp]
+\begin{figure}[htp]
 \centering
+\input{./Chapter17/Figures/figure-image-translation-task.tex}
 \caption{图像翻译任务}
-\label{tab:17-2-1-c}
+\label{fig:17-10}
-\end{table}
+\end{figure}
 %----------------------------------------------------------------------------------------------------
 %----------------------------------------------------------------------------------------
@@ -275,7 +281,7 @@
 \subsubsection{1. 基于特征融合的方法}
-\parinterval 较为早期的研究工作通常将图像信息作为输入句子的一部分\upcite{DBLP:conf/emnlp/CalixtoL17,DBLP:conf/wmt/HuangLSOD16}，或者用其对编码器、解码器的状态进行初始化\upcite{DBLP:conf/emnlp/CalixtoL17,Elliott2015MultilingualID,DBLP:conf/wmt/MadhyasthaWS17}。如图2所示，对图像特征的提取通常是基于卷积神经网络，有关卷积神经网络的内容，请参考{\chaptereleven}内容。通过卷积神经网络得到全局视觉特征，在进行维度变换后，将其作为源语言输入的一部分或者初始化状态引入到模型当中。但是，这种图像信息的引入方式有以下两个缺点：
+\parinterval 较为早期的研究工作通常将图像信息作为输入句子的一部分\upcite{DBLP:conf/emnlp/CalixtoL17,DBLP:conf/wmt/HuangLSOD16}，或者用其对编码器、解码器的状态进行初始化\upcite{DBLP:conf/emnlp/CalixtoL17,Elliott2015MultilingualID,DBLP:conf/wmt/MadhyasthaWS17}。如图\ref{fig:17-11}所示，对图像特征的提取通常是基于卷积神经网络，有关卷积神经网络的内容，请参考{\chaptereleven}内容。通过卷积神经网络得到全局视觉特征，在进行维度变换后，将其作为源语言输入的一部分或者初始化状态引入到模型当中。但是，这种图像信息的引入方式有以下两个缺点：
 \begin{itemize}
    \vspace{0.5em}
@@ -286,11 +292,12 @@
 \end{itemize}
 %----------------------------------------------------------------------------------------------------
-\begin{table}[htp]
+\begin{figure}[htp]
 \centering
+\input{./Chapter17/Figures/figure-modeling-a-global-approach-to-visual-characteristics}
 \caption{建模全局的视觉特征方法}
-\label{tab:17-2-2-c}
+\label{fig:17-11}
-\end{table}
+\end{figure}
 %----------------------------------------------------------------------------------------------------
 \parinterval 说到噪音问题就不得不提到注意力机制的引入，前面章节中提到过这样的一个例子：
@@ -302,11 +309,11 @@
 \parinterval 想在横线处填写“吃饭”，“吃东西”的原因是我们在读句子的过程中，关注到了“没/吃饭”，“很/饿”等关键息。这是在自然语言处理中注意力机制解决的问题，即对于要生成的目标语言单词时，相关性更高的源语言片段应该在源语言句子的表示中体现出来，而不是将所有的源语言单词一视同仁。同样的，注意力机制也用在多模态机器翻译中，即在生成目标单词时，对于图像而言，更应该关注与目标单词相关的图像部分，而弱化对其他部分的关注，这样就达到了降噪的目的，另外，注意力机制的引入，也使图像信息直接参与目标语言的生成，解决了在编码器中，图像信息传递损失的问题。
 %----------------------------------------------------------------------------------------------------
-\begin{table}[htp]
+\begin{figure}[htp]
 \centering
 \caption{目标词“bank”注意力机制前后对比}
 \label{tab:17-2-3-c}
-\end{table}
+\end{figure}
 %----------------------------------------------------------------------------------------------------
 \parinterval 那么，多模态机器翻译是如何计算上下文向量的呢？这里仿照第十章的内容给出具体解释(参考图10.19)：
@@ -332,11 +339,11 @@
 \parinterval 如图4所示，可以将多模态机器翻译任务分解为两个子任务：机器翻译和图片生成\upcite{DBLP:conf/ijcnlp/ElliottK17}。其中机器翻译作为主任务，图片生成作为子任务，图片生成这里指的是从一个图片描述生成对应图片，对于图片生成任务在后面叙述。通过单个编码器对源语言数据进行建模，然后通过两个解码器（翻译解码器和图像解码器）来学习翻译任务和图像生成任务。顶层任务学习每个任务的独立特征，底层共享参数层能够学习到更丰富的文本特征表示。另外在视觉问答领域有研究表明\upcite{DBLP:conf/nips/LuYBP16}，在多模态任务中，不宜引入多层的注意力，因为多层注意力会导致模型严重的过拟合，从另一角度来说，利用多任务学习的方式，提高模型的泛化能力，也是一种有效防止过拟合现象的方式。类似的思想，也大量使用在多模态自然语言处理中，例如图像描述生成、视觉问答\upcite{DBLP:conf/iccv/AntolALMBZP15}等。
 %----------------------------------------------------------------------------------------------------
-\begin{table}[htp]
+\begin{figure}[htp]
 \centering
 \caption{多模态机器翻译多任务学习的应用}
 \label{tab:17-2-4-c}
-\end{table}
+\end{figure}
 %----------------------------------------------------------------------------------------------------
 %----------------------------------------------------------------------------------------
@@ -348,11 +355,11 @@
 \parinterval 图像到文本的转换也可以看作是广义上的翻译，简单来说，就是把源语言的形式替换成了图像。其中，图像描述生成是最典型的任务。虽然，这部分内容并不是本书的重点，不过为了保证多模态翻译内容的完整性，这里对相关技术进行简要介绍。图像描述生成是指给定图像生成文字描述，有时也被称作图说话、图像字幕生成。如何理解图像信息、在理解图像信息基础上生成描述是图像描述任务要解决的问题，可以发现，该任务涉及到自然语言处理和计算机视觉两个领域，是一项很有挑战的任务。同时，图像描述在图像检索、智能导盲、人机交互等领域有着广泛的应用场景，有很大的研究价值。
 %----------------------------------------------------------------------------------------------------
-\begin{table}[htp]
+\begin{figure}[htp]
 \centering
 \caption{图像描述传统方法}
 \label{tab:17-2-5-c}
-\end{table}
+\end{figure}
 %----------------------------------------------------------------------------------------------------
 \parinterval 传统图像描述生成有两种范式：基于检索的方法和基于模板的方法。其中基于检索的方法（图5左）是指在指定的图像描述候选句子中选择其中的句子作为图像的描述，这种方法的弊端是所选择的句子可能会和图像很大程度上不相符。而基于模板的方法（图5右）是指在图像上检测视觉特征，然后把内容填在实现设计好的模板当中，这种方法的缺点是生成的图像描述过于呆板，‘像是在一个模子中刻出来的’说的就是这个意思。近几年来 ，由于卷积神经网络在计算机视觉领域效果显著，而循环神经网络在自然语言处理领域卓有成效，受到机器翻译领域编码器-解码器框架的启发，逐渐的，这种基于卷积神经网络作为编码器编码图像，循环神经网络作为解码器解码描述的编码器-解码器框架成了图像描述任务的基础范式。本章节，从基础的图像描述范式编码器-解码器框架展开\upcite{DBLP:conf/cvpr/VinyalsTBE15,DBLP:conf/icml/XuBKCCSZB15}，从编码器的改进、解码器的改进展开介绍。  
@@ -363,14 +370,15 @@
 \subsubsection{1. 基础框架}
-\parinterval 受到神经机器翻译的启发，编码器-解码器框架也应用到图像描述任务当中。其中，编码器将输入的图像转换为一种新的“表示”形式，这种表示包含了输入图像的所有信息。之后解码器把这种“表示”重新转换为输出的描述。图XX中（上）是编码器-解码器框架在图像描述生成的应用\upcite{DBLP:conf/cvpr/VinyalsTBE15}。首先，通过卷积神经网络提取图像特征到一个合适的长度向量表示。然后，利用长短时记忆网络（LSTM）解码生成文字描述，这个过程中与机器翻译解码过程类似。这种建模方式存在一定的短板：生成的描述单词不一定需要所有的图像信息，将全局的图像信息送入模型中，可能会引入噪音，使这种“表示”形式不准确。针对这个问题，图XX（下）\upcite{DBLP:conf/icml/XuBKCCSZB15}为了弥补这种建模的局限性，引入了注意力机制。利用注意力机制在生成不同单词时，使模型不再只关注图像的全局特征，而是关注“应该”关注的图像特征。
+\parinterval 受到神经机器翻译的启发，编码器-解码器框架也应用到图像描述任务当中。其中，编码器将输入的图像转换为一种新的“表示”形式，这种表示包含了输入图像的所有信息。之后解码器把这种“表示”重新转换为输出的描述。图\ref{fig:17-15}（a）是编码器-解码器框架在图像描述生成的应用\upcite{DBLP:conf/cvpr/VinyalsTBE15}。首先，通过卷积神经网络提取图像特征到一个合适的长度向量表示。然后，利用长短时记忆网络（LSTM）解码生成文字描述，这个过程中与机器翻译解码过程类似。这种建模方式存在一定的短板：生成的描述单词不一定需要所有的图像信息，将全局的图像信息送入模型中，可能会引入噪音，使这种“表示”形式不准确。针对这个问题，图\ref{fig:17-15}（b）\upcite{DBLP:conf/icml/XuBKCCSZB15}为了弥补这种建模的局限性，引入了注意力机制。利用注意力机制在生成不同单词时，使模型不再只关注图像的全局特征，而是关注“应该”关注的图像特征。
 %----------------------------------------------------------------------------------------------------
-\begin{table}[htp]
+\begin{figure}[htp]
 \centering
+\input{./Chapter17/Figures/figure-image-description-of-encoder-decoder-framework}
 \caption{图像描述的编码器-解码器框架}
-\label{tab:17-2-6-c}
+\label{fig:17-15}
-\end{table}
+\end{figure}
 %----------------------------------------------------------------------------------------------------
 \parinterval 图像描述生成基本上沿用了编码器-解码器框架。接下来，分别从编码器端的改进和解码器端的改进展开介绍。这些改进总体来说是在解决以下两个问题：
@@ -394,11 +402,11 @@
 \parinterval 图像的语义信息一般是指图像中存在的实体、属性、场景等等。如图XX所示，从图像中利用属性或实体检测器提取出“child”、“river”、“bank”等等的属性词和实体词作为图像的语义信息，提取全局的图像特征初始化循环神经网络，再利用注意力机制计算目标词与属性词或实体词之间的注意力权重，根据该权重计算上下文向量，从而将编码语义信息送入解码端\upcite{DBLP:conf/cvpr/YouJWFL16}，在解码‘bank’单词时，会更关注图像语义信息中的‘bank’。当然，除了图像中的实体和属性作为语义信息外，也可以将图片的场景信息也加入到编码器当中\upcite{DBLP:journals/pami/FuJCSZ17}。有关如何做属性、实体和场景的检测，涉及到目标检测任务的工作，例如Faster-RCNN\upcite{DBLP:journals/pami/RenHG017}、YOLO\upcite{DBLP:journals/corr/abs-1804-02767,DBLP:journals/corr/abs-2004-10934}等等,这里不过多赘述。
 %----------------------------------------------------------------------------------------------------
-\begin{table}[htp]
+\begin{figure}[htp]
 \centering
 \caption{编码器“显式”融入语义信息}
 \label{tab:17-2-6-c}
-\end{table}
+\end{figure}
 %----------------------------------------------------------------------------------------------------
 \parinterval 以上的方法大都是将图像中的实体、属性、场景等映射到文字上，并把这些信息显式地添加到编码器端。令一种方式，把图像中的语义特征隐式地作用到编码器端\upcite{DBLP:conf/cvpr/ChenZXNSLC17}。例如，可以图像数据可以分解为三个通道（红、绿、蓝），简单来说，就是将图像的每一个像素点按照红色、绿色、蓝色分成三个部分，这样就将图像分成了三个通道。在很多图像中，不同通道随伴随的特征是不一样的，可以将其作用于编码器端。另一种方法是基于位置信息的编码器增强。位置信息指的是图像中对象（物体）的位置。利用目标检测技术检测系统获得图中的对象和对应的特征，这样就确定了图中的对象位置。显然，这些信息也可以加入到编码端，以加强编码器的表示能力\upcite{DBLP:conf/eccv/YaoPLM18}。
@@ -443,11 +451,11 @@
 \parinterval “篇章”在这里指一系列连续的段落或者句子所构成的整体，其中各个句子间从形式和内容上都具有一定的连贯性和一致性\upcite{jurafsky2000speech}。这些联系主要体现在{\small\sffamily\bfseries{衔接}}\index{衔接}（Cohesion \index{Cohesion}）以及{\small\sffamily\bfseries{连贯}}\index{连贯}（Coherence \index{Coherence}）两个方面。其中衔接体现在显性的语言成分和结构上，包括篇章中句子间语法和词汇上的联系，而连贯体现在各个句子之间逻辑和语义上的联系。因此，篇章级翻译的目的就是要考虑到这些上下文之间的联系，从而生成相比句子级翻译更连贯和准确的翻译结果（如表\ref{tab:17-3-1}）。但是由于不同语言的特性多种多样，上下文信息在篇章级翻译中的作用也不尽相同。比如在德语中名词是分词性的，因此在代词翻译的过程中需要根据其先行词的词性进行区分，而这种现象在其它不区分词性的语言中是不存在的。这导致篇章级翻译在不同的语种中可能对应多种不同的上下文现象。
 %----------------------------------------------------------------------------------------------------
-\begin{table}[htp]
+\begin{figure}[htp]
 \centering
 \caption{篇章级翻译中时态一致性的问题}
 \label{tab:17-3-1}
-\end{table}
+\end{figure}
 %----------------------------------------------------------------------------------------------------
 \parinterval 正是由于这种上下文现象的多样性，使得篇章级翻译模型的性能评价相对困难。目前篇章级机器翻译主要针对一些常见上下文的现象，比如代词翻译、省略、连接和词汇衔接等，而{\chapterfour}介绍的BLEU等通用自动评价指标通常对这些上下文现象不敏感，篇章级翻译需要采用一些专用方法来对这些具体的现象进行评价。之前已经有一些研究工作针对具体的上下文现象提出了相应的评价标准并且在篇章级翻译中得到应用\upcite{DBLP:conf/naacl/BawdenSBH18,DBLP:conf/acl/VoitaST19}，但是目前并没有达成共识，这也在一定程度上阻碍了篇章级机器翻译的进一步发展。我们将在ref{sec:17-3-2}节中对这些评价标准进行介绍。