update (book)

c2b4e5c0 · xiaotong · 05e8631a · c2b4e5c0 · c2b4e5c0 · c2b4e5c0
Commit c2b4e5c0 authored Mar 10, 2020 by xiaotong
--- a/Book/Chapter1/chapter1.tex
+++ b/Book/Chapter1/chapter1.tex
@@ -322,7 +322,7 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \parinterval 由于人工评价费事费力，同时具有一定的主观性，甚至同一篇文章不同人在不同时刻的理解都会不同，因此自动评价是也是机器翻译系统研发人员所青睐的方法。自动评价的方式虽然不如人工评价准确，但是具有速度快，成本低、一致性高的优点。而且随着评价技术的不断发展，自动评价方式已经具有了比较好的指导性，可以帮助我们快速了解当前机器翻译译文的质量。在机器翻译领域，自动评价已经成为了一个重要的分支，提出的自动评价方法不下几十种。在这里我们无法对这些方法一一列举，为了便于后续章节的描述，这里仅对具有代表性的一些方法进行简要介绍。

-\subsubsection{BLEU评价}\index{Chapter1.5.2.1}
+\subsubsection{BLEU}\index{Chapter1.5.2.1}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \parinterval 目前使用最广泛的自动评价指标是BLEU。BLEU是Bilingual Evaluation understudy的缩写，最早由IBM在2002年提出\cite{papineni2002bleu}。通过采用$n$-gram匹配的方式评定机器翻译结果和参考译文之间的相似度，即机器翻译的结果越接近人工参考译文就认定它的质量越高。$n$-gram是指$n$个连续单词组成的单元，称为$n$元语法单元。$n$越大表示考虑评价时考虑的片段匹配越大。

@@ -366,7 +366,7 @@ e^{(1-\frac{r}{c})}& c<r

 \parinterval 从机器翻译的发展来看，BLEU的意义在于它给系统研发人员提供了一种简单、高效、可重复的自动评价手段，在研发机器翻译系统时可以不需要依赖人工评价。同时，BLEU也有很多创新之处，包括引入$n$-gram的匹配，截断计数和短句惩罚等等，包括NIST等很多评价指标都是受到BLEU的启发。当然， BLEU也并不完美，甚至经常被人诟病。比如，它需要依赖参考译文，而且评价结果有时与人工评价不一致，同时BLEU评价只是单纯的从匹配度的角度思考翻译质量的好坏，并没有真正考虑句子的语义是否翻译正确。但是，毫无疑问，BLEU仍然是机器翻译中最常用的评价方法。在没有找到更好的替代方案之前，BLEU还是机器翻译研究所使用的标准评价指标。

-\subsubsection{TER评价}\index{Chapter1.5.2.2}
+\subsubsection{TER}\index{Chapter1.5.2.2}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \parinterval  TER是Translation Edit Rate的缩写，是一种基于距离的评价方法，用来评定机器翻译结果的译后编辑的工作量\cite{snover2006study}。这里，距离被定义为两个序列之间其中一个转换成另一个序列所需要的最少编辑操作次数。操作次数越多，距离越大，序列之间的相似性越低，相反距离越小，表示一个句子越容易改写成另一个句子，序列之间的相似性越高。TER使用的编辑操作包括：增加，删除，替换和移位，其中增加，删除，替换操作计算得到的距离被称为编辑距离，并根据错误率的形式给出评分：
 \begin{eqnarray}
@@ -386,7 +386,7 @@ Candidate：cat is standing in the ground

 \parinterval 与BLEU不同，基于距离的评价方法是一种典型的``错误率''的度量，类似的思想也广泛应用于语音识别等领域。在机器翻译中，除了TER外，还有WER， PER等十分相似的方法，只是在``错误''的定义上略有不同。需要注意的是，很多时候，研究者并不会单独使用BLEU或者TER，而是将两种方法融合，比如，使用BLEU – TER作为评价指标（BLEU和TER之间是减号）。

-\subsubsection{检测点评价}\index{Chapter1.5.2.3}
+\subsubsection{基于检测点的评价}\index{Chapter1.5.2.3}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \parinterval  BLEU、TER等评价指标可以对译文的整体质量进行评估，但是缺乏对具体问题的细致评价。很多时候，研究人员需要知道系统是否能够处理特定的问题，而不是得到一个笼统的评价结果。基于监测点的方法正是基于此想法\cite{shiwen1993automatic}。基于检测点的评价的优点在于对机器翻译系统给出一个总体评价的同时针对系统在各个语言点上的翻译能力进行评估，方便比较不同翻译模型的性能。同时基于检测点的评价方法针对较为完备的语言学分类体系，也被多次用于机器翻译比赛的质量评测。

@@ -477,9 +477,13 @@ His house is on the south bank of the river.
 \parinterval 社交是人们的重要社会活动，在当今的互联网时代中，数量繁多的社交软件在极大地方便了人们的同时，也改变了人们的生活方式。人们可以通过各种各样的社交软件做到即时通讯，进行协作或者分享自己的观点。然而受限于语言问题，人们的社交范围往往不会超出自己所掌握的语种范围，很难方便地进行跨语言社交。随着机器翻译技术的发展，越来越多的社交软件开始支持自动翻译，用户可以轻易地将各种语言的内容翻译成自己的母语，方便了人们的交流，让语言问题不再是社交的障碍。\\ \\ \\ \\

 \section{开源项目与评测}\index{Chapter1.7}
+
+\parinterval 从实践的角度，机器翻译的发展主要可以归功于两方面的推动作用：开源系统和评测。开源系统通过代码共享的方式使得最新的研究成果可以快速传播，同时实验结果可以复现。而评测比赛，使得机器翻译各个研究组织在同一平台进行较为合理的良性竞争对比，共同推动机器翻译的发展与进步。此外，开源项目也促进了不同团队之间的协作，让研究人员在同一个平台上集中力量攻关。
+
 \subsection{开源机器翻译系统}\index{Chapter1.7.1}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\parinterval 从实践的角度，机器翻译的发展主要可以归功于两方面的推动作用：开源系统和评测。开源系统通过代码共享的方式使得最新的研究成果可以快速传播，同时实验结果可以复现。而评测比赛，使得机器翻译各个研究组织在同一平台进行较为合理的良性竞争对比，共同推动机器翻译的发展与进步。此外，开源项目也促进了不同团队之间的协作，让研究人员在同一个平台上集中力量攻关。
+
+下面列举一些优秀的开源机器翻译系统

 \subsubsection{统计机器翻译开源系统}\index{Chapter1.7.1.1}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

--- a/Book/Chapter2/chapter2.tex
+++ b/Book/Chapter2/chapter2.tex
@@ -384,13 +384,17 @@
 \begin{definition}
 词

-语言里最小的可以独立运用的单位：词汇。——《新华字典》
+语言里最小的可以独立运用的单位：词汇。
+\begin{flushright}——《新华字典》\end{flushright}

-单词（word），含有语义内容或语用内容，且能被单独念出来的的最小单位。——《维基百科》
+单词（word），含有语义内容或语用内容，且能被单独念出来的的最小单位。
+\begin{flushright}——《维基百科》\end{flushright}

-語句中具有完整概念，能獨立自由運用的基本單位。——《国语辞典》
+語句中具有完整概念，能獨立自由運用的基本單位。
+\begin{flushright}——《国语辞典》\end{flushright}

-说话或诗歌、文章、戏剧中的语句。——《现代汉语词典》
+说话或诗歌、文章、戏剧中的语句。
+\begin{flushright}——《现代汉语词典》\end{flushright}
 \end{definition}
 %-------------------------------------------

@@ -934,9 +938,11 @@ c_{\textrm{KN}}(\cdot) & = & \begin{cases} \textrm{count}(\cdot)\quad for\ the\ 
 \begin{definition}
 句法分析

-句法分析(Parsing)就是指对句子中的词语语法功能进行分析。——《百度百科》
+句法分析(Parsing)就是指对句子中的词语语法功能进行分析。
+\begin{flushright}——《百度百科》\end{flushright}

-在自然语言或者计算机语言中，句法分析是利用形式化的文法规则对一个符号串进行分析的过程。——《维基百科（译文）》
+在自然语言或者计算机语言中，句法分析是利用形式化的文法规则对一个符号串进行分析的过程。
+\begin{flushright}——《维基百科（译文）》\end{flushright}
 \end{definition}
 %-------------------------------------------
 \parinterval 上面的定义中，句法分析包含三个重要的概念：
@@ -1114,7 +1120,7 @@ s_0 \overset{r_1}{\Rightarrow} s_1 \overset{r_2}{\Rightarrow} s_2 \overset{r_3}{
 \item
 \item $N$为一个非终结符集合
 \item $\Sigma$为一个终结符集合
-\item $R$为一个规则(产生式)集合，每条规则 $r \in R$的形式为$X \to Y_1Y_2...Y_n$，其中$X \in N$, $Y_i \in N \cup \Sigma$，每个$r$都对应一个概率，表示其生成的可能性。
+\item $R$为一个规则(产生式)集合，每条规则 $r \in R$的形式为$p:X \to Y_1Y_2...Y_n$，其中$X \in N$, $Y_i \in N \cup \Sigma$，每个$r$都对应一个概率$p$，表示其生成的可能性。
 \item $S$为一个起始符号集合且$S \subseteq N$
 \end{itemize}
 \end{definition}

--- a/Book/Chapter3/Chapter3.tex
+++ b/Book/Chapter3/Chapter3.tex
@@ -34,7 +34,7 @@
 \end{figure}
 %-------------------------------------------

-\parinterval 上面的例子反映了人在做翻译时所使用的一些知识：首先，两种语言单词的顺序可能不一致，译文需要符合目标语的习惯，这也是我们常说翻译的\textbf{流畅度}问题\textbf{（fluency）}；其次，源语言单词需要准确的被翻译出来\footnote{当然，对于一些意译的情况或者虚词并不需要翻译。}，也是我们常说的翻译的\textbf{准确性}和\textbf{充分性}问题\textbf{（adequacy）}。为了达到以上目的，传统观点认为翻译需要过程包含三个步骤（图 \ref{fig:3-2}）
+\parinterval 上面的例子反映了人在做翻译时所使用的一些知识：首先，两种语言单词的顺序可能不一致，译文需要符合目标语的习惯，这也是我们常说翻译的\textbf{流畅度}问题（fluency）；其次，源语言单词需要准确的被翻译出来\footnote{当然，对于一些意译的情况或者虚词并不需要翻译。}，也是我们常说的翻译的\textbf{准确性}和\textbf{充分性}问题（adequacy）。为了达到以上目的，传统观点认为翻译需要过程包含三个步骤（图 \ref{fig:3-2}）

 \begin{itemize}
 \item \textbf{分析：}将源语言句子切分或者表示为能够处理的最小单元。基于词的翻译模型中，最小处理单元就是单词，因此在这里也可以简单地将分析理解为分词\footnote{在后续章节中会看到，分析也包括对语言结构的深入分析，但是这里为了突出基于单词的概念，因此把问题简化为最简单的情况。}。

--- a/Book/Chapter6/Chapter6.tex
+++ b/Book/Chapter6/Chapter6.tex
@@ -890,7 +890,7 @@ $\textrm{a}(\cdot)$可以被看作是目标语表示和源语言表示的一种`
 \begin{figure}[htp]
 \centering
 \input{./Chapter6/Figures/figure-Query-model-corresponding-to-traditional-query-model-vs-attention-mechanism}
-\caption{传统查询模型（a） vs 注意力机制（b）所对应的查询模型}
+\caption{传统查询模型(a)和注意力机制所对应的查询模型(b)}
 \label{fig:6-25}
 \end{figure}
 %----------------------------------------------
@@ -937,9 +937,9 @@ $\textrm{a}(\cdot)$可以被看作是目标语表示和源语言表示的一种`
 \parinterval 将公式\ref{eqC6.29}应用于神经机器翻译有几个基本问题需要考虑：1）损失函数的选择；2）参数初始化的策略，也就是如何设置$\mathbf{w}_0$；3）优化策略和学习率调整策略；4）训练加速。下面我们对这些问题进行讨论。
 %%%%%%%%%%%%%%%%%%
 \subsubsection{损失函数}\index{Chapter6.3.5.1}
-\parinterval 因为神经机器翻译在每个目标语位置都会输出一个概率分布，表示这个位置上不同单词出现的可能性，因此我们需要知道当前位置输出的分布相比于标准答案的``损失''。对于这个问题，常用的是交叉熵损失函数\footnote{\ \ 百度百科：\url{https://baike.baidu.com/item/\%E4\%BA\%A4\%E5\%8F\%89\%E7\%86\%B5/8983241?fr=aladdin}}。令$\mathbf{y}$表示机器翻译模型输出的分布，$\hat{\mathbf{y}}$表示标准答案，则交叉熵损失可以被定义为$L_{ce}(\mathbf{y},\hat{\mathbf{y}}) = - \sum_{k=1}^{|V|} \mathbf{y}[k] \textrm{log} (\hat{\mathbf{y}}[k])$，其中$\mathbf{y}[k]$和$\hat{\mathbf{y}}[k]$分别表示向量$\mathbf{y}$和$\hat{\mathbf{y}}$的第$k$维，$|V|$表示输出向量得维度（等于词表大小）。对于一个模型输出的概率分布$\mathbf{Y} = \{ \mathbf{y}_1,\mathbf{y}_2,…, \mathbf{y}_n \}$和标准答案分布$\hat{\mathbf{Y}}=\{ \hat{\mathbf{y}}_1, \hat{\mathbf{y}}_2,…,\hat{\mathbf{y}}_n \}$，损失函数可以被定义为
+\parinterval 因为神经机器翻译在每个目标语位置都会输出一个概率分布，表示这个位置上不同单词出现的可能性，因此我们需要知道当前位置输出的分布相比于标准答案的``损失''。对于这个问题，常用的是交叉熵损失函数\footnote{\ \ 百度百科：\url{https://baike.baidu.com/item/\%E4\%BA\%A4\%E5\%8F\%89\%E7\%86\%B5/8983241?fr=aladdin}}。令$\mathbf{y}$表示机器翻译模型输出的分布，$\hat{\mathbf{y}}$表示标准答案，则交叉熵损失可以被定义为$L_{\textrm{ce}}(\mathbf{y},\hat{\mathbf{y}}) = - \sum_{k=1}^{|V|} \mathbf{y}[k] \textrm{log} (\hat{\mathbf{y}}[k])$，其中$\mathbf{y}[k]$ 和$\hat{\mathbf{y}}[k]$分别表示向量$\mathbf{y}$和$\hat{\mathbf{y}}$的第$k$维，$|V|$表示输出向量得维度（等于词表大小）。对于一个模型输出的概率分布$\mathbf{Y} = \{ \mathbf{y}_1,\mathbf{y}_2,…, \mathbf{y}_n \}$和标准答案分布$\hat{\mathbf{Y}}=\{ \hat{\mathbf{y}}_1, \hat{\mathbf{y}}_2,…,\hat{\mathbf{y}}_n \}$，损失函数可以被定义为
 \begin{equation}
-L(\mathbf{Y},\hat{\mathbf{Y}}) = \sum_{j=1}^n L_{ce}(\mathbf{y}_j,\hat{\mathbf{y}}_j)
+L(\mathbf{Y},\hat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\hat{\mathbf{y}}_j)
 \label{eqC6.30}
 \end{equation}

@@ -1005,7 +1005,7 @@ L(\mathbf{Y},\hat{\mathbf{Y}}) = \sum_{j=1}^n L_{ce}(\mathbf{y}_j,\hat{\mathbf{y
 \centering
 %\includegraphics[scale=0.7]{./Chapter6/Figures/Big learning rate vs Small learning rate.png}
 \input{./Chapter6/Figures/figure-convergence&lr}
-\caption{学习率过小函数收敛过程（左） vs 学习率过大函数收敛过程（右） }
+\caption{学习率过小（左） vs 学习率过大（右） }
 \label{fig:6-27}
 \end{figure}
 %----------------------------------------------
@@ -1198,6 +1198,7 @@ L(\mathbf{Y},\hat{\mathbf{Y}}) = \sum_{j=1}^n L_{ce}(\mathbf{y}_j,\hat{\mathbf{y
 \end{equation}

 \noindent 显然，当目标语$y$过短时，$\textrm{lp}(\mathbf{y})$的值越小，因为$\textrm{log P}(\mathbf{y} | \mathbf{x})$是负数，所以句子得分$\textrm{score} ( \mathbf{y} , \mathbf{x})$越小。也就是说，模型会惩罚译文过短的结果。当覆盖度较高时，同样会使得分变低。通过这样的惩罚机制，使模型得分更为合理，从而帮助我们选择出质量更高的译文。
+
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{实例-GNMT}\index{Chapter6.3.7}
 \parinterval 循环神经网络在机器翻译中有很多成功的应用，比如、RNNSearch\cite{bahdanau2014neural}、Nematus\\ \cite{DBLP:journals/corr/SennrichFCBHHJL17}等系统就被很多研究者作为实验系统。在众多基于循环神经网络的系统中，GNMT系统是最成功的一个\cite{Wu2016GooglesNM}。GNMT是谷歌2016年发布的神经机器翻译系统。在GNMT之前，神经机器翻译有三个弱点：训练和推理速度较慢、在翻译稀有单词上缺乏鲁棒性和有时无法完整翻译源语言句子中的所有单词。GNMT的提出有效的解决了上述问题。
@@ -1220,7 +1221,7 @@ L(\mathbf{Y},\hat{\mathbf{Y}}) = \sum_{j=1}^n L_{ce}(\mathbf{y}_j,\hat{\mathbf{y
 % 表
 \begin{table}[htp]
 \centering
-\caption{GNMT与当时最优秀的模型}
+\caption{GNMT与其它翻译模型对比\cite{Wu2016GooglesNM}}
 \label{tab:gnmt vs state-of-the-art models}
 \begin{tabular}{l l l l}
 \multicolumn{1}{l|}{\multirow{2}{*}{\#}} & \multicolumn{2}{c}{\textbf{BLEU}} & \multirow{2}{*}{\textbf{CPU decoding time}} \\
@@ -1303,7 +1304,7 @@ L(\mathbf{Y},\hat{\mathbf{Y}}) = \sum_{j=1}^n L_{ce}(\mathbf{y}_j,\hat{\mathbf{y
 \begin{figure}[htp]
 \centering
 \input{./Chapter6/Figures/figure-Dependencies-between-words-in-a-recurrent-neural-network}
-\caption{注意力机制中单词之间的依赖关系}
+\caption{自注意力机制中单词之间的依赖关系}
 \label{fig:6-35}
 \end{figure}
 %----------------------------------------------
@@ -1549,13 +1550,9 @@ L(\mathbf{Y},\hat{\mathbf{Y}}) = \sum_{j=1}^n L_{ce}(\mathbf{y}_j,\hat{\mathbf{y
 \parinterval 多头机制具体的计算公式如下：
 %-------------------------------------------------------
 \begin{eqnarray}
-\begin{array}{ll}
-\textrm{MultiHead}&(\mathbf{Q}, \mathbf{K} , \mathbf{V}) =
-\textrm{Concat} (\mathbf{head}_1, ... , \mathbf{head}_h ) \mathbf{W}^o \\
-&\textrm{where} \mathbf{head}_i  =\textrm{Attention} (\mathbf{Q}\mathbf{W}_i^Q ,
- \mathbf{K}\mathbf{W}_i^K  , \mathbf{V}\mathbf{W}_i^V )
+\textrm{MultiHead}(\mathbf{Q}, \mathbf{K} , \mathbf{V})& = & \textrm{Concat} (\mathbf{head}_1, ... , \mathbf{head}_h ) \mathbf{W}^o \\
+\textrm{where} \mathbf{head}_i & = &\textrm{Attention} (\mathbf{Q}\mathbf{W}_i^Q , \mathbf{K}\mathbf{W}_i^K  , \mathbf{V}\mathbf{W}_i^V )
 \label{eqC6.46}
-\end{array}
 \end{eqnarray}

 \parinterval 多头机制的好处是允许模型在不同的表示子空间里学习。在很多实验中发现，不同表示空间的头捕获的信息是不同的，比如，在使用Transformer处理自然语言时，有的头可以捕捉句法信息，有头可以捕捉词法信息。
@@ -1670,7 +1667,7 @@ lrate = d_{model}^{-0.5} \cdot \textrm{min} (step^{-0.5} , step \cdot warmup\_st
 \begin{figure}[htp]
 \centering
 \input{./Chapter6/Figures/figure-lrate-of-transformer}
-\caption{Transformer模型的学习率调整曲线}
+\caption{Transformer模型的学习率曲线}
 \label{fig:6-52}
 \end{figure}
 %----------------------------------------------
@@ -1711,12 +1708,12 @@ lrate = d_{model}^{-0.5} \cdot \textrm{min} (step^{-0.5} , step \cdot warmup\_st
 % 表
 \begin{table}[htp]
 \centering
-\caption{三种Transformer的实验对比}
+\caption{三种Transformer模型的对比}
 \label{tab:word-translation-examples}
 \begin{tabular}{l | l l l}

-\multirow{2}{*}{\#}   & \multicolumn{2}{c}{\textbf{BLEU}} & \multirow{2}{*}{\textbf{params}} \\
-                      & \textbf{EN-DE}  & \textbf{EN-FR}  &                                  \\ \hline
+\multirow{2}{*}{\#}   & \multicolumn{2}{c}{BLEU} & \multirow{2}{*}{params} \\
+                      & EN-DE  & EN-FR  &                                  \\ \hline
 Transformer Base      & 27.3            & 38.1            & 65$\times 10^{6}$                \\
 Transformer Big       & 28.4            & 41.8            & 213$\times 10^{6}$               \\
 Transformer Deep(48层) & 30.2            & 43.1            & 194$\times 10^{6}$              \\

--- a/Book/Chapter6/Figures/figure-Automatically-generate-instances-of-couplets.tex
+++ b/Book/Chapter6/Figures/figure-Automatically-generate-instances-of-couplets.tex
@@ -2,8 +2,8 @@

 \begin{tikzpicture}
 \begin{scope}
-\tikzstyle{lnode} = [minimum height=2.5em,minimum width=12em,inner sep=3pt,very thick,rounded corners=2pt,draw=red!75!black,fill=red!5];
-\tikzstyle{rnode} = [minimum height=2.5em,minimum width=12em,inner sep=3pt,very thick,rounded corners=2pt,draw=blue!75!black,fill=blue!5];
+\tikzstyle{lnode} = [minimum height=2.5em,minimum width=12em,inner sep=3pt,rounded corners=2pt,draw=red!75!black,fill=red!5];
+\tikzstyle{rnode} = [minimum height=2.5em,minimum width=12em,inner sep=3pt,rounded corners=2pt,draw=blue!75!black,fill=blue!5];
 \tikzstyle{standard} = [rounded corners=3pt]

 \node [lnode,anchor=west] (l1) at (0,0) {上联：翠竹千支歌盛世};
@@ -22,4 +22,4 @@
 \node [rnode,anchor=west] (l10) at ([xshift=1em]l9.east) {下联：春回大地满园红};

 \end{scope}
-\end{tikzpicture}
\ No newline at end of file
+\end{tikzpicture}
--- a/Book/Chapter6/Figures/figure-Comparison-of-the-number-of-padding-in-batch.tex
+++ b/Book/Chapter6/Figures/figure-Comparison-of-the-number-of-padding-in-batch.tex
@@ -25,4 +25,5 @@

 \node [rectangle,inner sep=0.5em,rounded corners=2pt,very thick,dotted,draw=ugreen!80] [fit = (s1) (s3) (p1) (p3)] (box0) {};
 \node [rectangle,inner sep=0.5em,rounded corners=2pt,very thick,dotted,draw=ugreen!80] [fit = (s4) (s6) (p4) (p5)] (box0) {};
-\end{tikzpicture}
\ No newline at end of file
+
+\end{tikzpicture}
--- a/Book/Chapter6/Figures/figure-Example-of-automatic-translation-of-classical-Chinese.tex
+++ b/Book/Chapter6/Figures/figure-Example-of-automatic-translation-of-classical-Chinese.tex
@@ -3,13 +3,13 @@

 \begin{frame}{}

- \begin{tcolorbox}[size=normal,left=2mm,right=1mm,colback=red!5!white,colframe=red!75!black]
+ \begin{tcolorbox}[size=normal,left=2mm,right=1mm,colback=red!5!white,colframe=red!75!black,boxrule=1pt]
 {
 \small{古文：侍卫步军都指挥使、彰信节度使李继勋营于寿州城南，唐刘仁赡伺继勋无备，出兵击之，杀士卒数百人，焚其攻具。}
 }
 \end{tcolorbox}
 \vspace{-0.4em}
- \begin{tcolorbox}[size=normal,left=2mm,right=1mm,colback=blue!5!white,colframe=blue!75!black]
+ \begin{tcolorbox}[size=normal,left=2mm,right=1mm,colback=blue!5!white,colframe=blue!75!black,boxrule=1pt]
 {
 \small{现代文：侍卫步军都指挥使、彰信节度使李继勋在寿州城南扎营，唐刘仁赡窥伺李继勋没有防备，出兵攻打他，杀死士兵几百人，烧毁李继勋的攻城器}
 }
@@ -17,13 +17,13 @@

 \vspace{0.2em}

-\begin{tcolorbox}[size=normal,left=2mm,right=1mm,colback=red!5!white,colframe=red!75!black]
+\begin{tcolorbox}[size=normal,left=2mm,right=1mm,colback=red!5!white,colframe=red!75!black,boxrule=1pt]
 {
 \small{古文：其后人稍稍识之，多延至其家，使为弟子论学。}
 }
 \end{tcolorbox}
 \vspace{-0.4em}
- \begin{tcolorbox}[size=normal,left=2mm,right=1mm,colback=blue!5!white,colframe=blue!75!black]
+ \begin{tcolorbox}[size=normal,left=2mm,right=1mm,colback=blue!5!white,colframe=blue!75!black,boxrule=1pt]
 {
 \small{现代文：后来的人渐渐认识他，多把他请到家里，让他为弟子讲授学问。}
 }
@@ -32,4 +32,4 @@
 \vspace{-0.8em}


-\end{frame}
\ No newline at end of file
+\end{frame}
--- a/Book/Chapter6/Figures/figure-GRU01.tex
+++ b/Book/Chapter6/Figures/figure-GRU01.tex
@@ -74,20 +74,20 @@
                \draw[emph] (aux71) -| (aux32) -| (aux44);
                \node[opnode,circle,draw=red,thick] () at (aux44) {$\sigma$};
            }
-            
+
        \end{scope}

        \begin{scope}
            \node[wordnode,anchor=south] () at (aux71) {$\mathbf{h}_{t-1}$};
            \node[wordnode,anchor=west] () at (aux12) {$\mathbf{x}_t$};
-            
+
        \end{scope}

       \node[] (tanh) at (aux46){};

        \begin{pgfonlayer}{background}
-            \node[draw,very thick,rectangle,fill=blue!30!white,rounded corners=5pt,inner sep=6pt,fit=(aux22) (aux76) (z76) (tanh)] (GRU) {};
+            \node[draw,very thick,rectangle,fill=blue!10!white,rounded corners=5pt,inner sep=6pt,fit=(aux22) (aux76) (z76) (tanh)] (GRU) {};
        \end{pgfonlayer}


-    \end{tikzpicture}
\ No newline at end of file
+    \end{tikzpicture}
--- a/Book/Chapter6/Figures/figure-GRU02.tex
+++ b/Book/Chapter6/Figures/figure-GRU02.tex
@@ -87,20 +87,20 @@
                \node[opnode,circle,draw=red,thick] () at (aux45) {$\sigma$};
                \node[opnode,rectangle,rounded corners=2pt,inner sep=2pt,font=\tiny,draw=red,thick] () at (aux65) {$1-$};
            }
-            
+
        \end{scope}

        \begin{scope}
            \node[wordnode,anchor=south] () at (aux71) {$\mathbf{h}_{t-1}$};
            \node[wordnode,anchor=west] () at (aux12) {$\mathbf{x}_t$};
-            
+
        \end{scope}
-        
+
        \node[] (tanh) at (aux46){};

        \begin{pgfonlayer}{background}
-            \node[draw,very thick,rectangle,fill=blue!30!white,rounded corners=5pt,inner sep=6pt,fit=(aux22) (aux76) (z76) (tanh)] (GRU) {};
+            \node[draw,very thick,rectangle,fill=blue!10!white,rounded corners=5pt,inner sep=6pt,fit=(aux22) (aux76) (z76) (tanh)] (GRU) {};
        \end{pgfonlayer}


-    \end{tikzpicture}
\ No newline at end of file
+    \end{tikzpicture}
--- a/Book/Chapter6/Figures/figure-GRU03.tex
+++ b/Book/Chapter6/Figures/figure-GRU03.tex
@@ -105,7 +105,7 @@
                \node[opnode,circle,draw=red,thick] () at (aux75) {X};
                \node[opnode,circle,draw=red,thick] () at (aux76) {\textbf{+}};
            }
-            
+
        \end{scope}

        \begin{scope}
@@ -118,8 +118,8 @@
        \end{scope}

        \begin{pgfonlayer}{background}
-            \node[draw,very thick,rectangle,fill=blue!30!white,rounded corners=5pt,inner sep=6pt,fit=(aux22) (aux76) (z76) (tanh)] (GRU) {};
+            \node[draw,very thick,rectangle,fill=blue!10!white,rounded corners=5pt,inner sep=6pt,fit=(aux22) (aux76) (z76) (tanh)] (GRU) {};
        \end{pgfonlayer}


-    \end{tikzpicture}
\ No newline at end of file
+    \end{tikzpicture}
--- a/Book/Chapter6/Figures/figure-LSTM01.tex
+++ b/Book/Chapter6/Figures/figure-LSTM01.tex
@@ -80,21 +80,21 @@
                \draw[-latex,emph] (aux12) -- (aux22) -- (aux23) -- (f53);
                \node[opnode,circle,draw=red,thick] () at (aux33) {$\sigma$};
            }
-            
+
        \end{scope}

        \begin{scope}
            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux21) {$\mathbf{h}_{t-1}$};
            \node[wordnode,anchor=west] () at (aux12) {$\mathbf{x}_t$};
            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux51) {$\mathbf{c}_{t-1}$};
-           
+
        \end{scope}

           \node[ ] (o27) at (aux27) { };

        \begin{pgfonlayer}{background}
-            \node[draw,very thick,rectangle,fill=blue!30!white,rounded corners=5pt,inner sep=4pt,fit=(aux22) (aux58) (u55) (o27)] (LSTM) {};
+            \node[draw,very thick,rectangle,fill=blue!10!white,rounded corners=5pt,inner sep=4pt,fit=(aux22) (aux58) (u55) (o27)] (LSTM) {};
        \end{pgfonlayer}

-   
-    \end{tikzpicture}
\ No newline at end of file
+
+    \end{tikzpicture}
--- a/Book/Chapter6/Figures/figure-LSTM02.tex
+++ b/Book/Chapter6/Figures/figure-LSTM02.tex
@@ -102,14 +102,14 @@
            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux21) {$\mathbf{h}_{t-1}$};
            \node[wordnode,anchor=west] () at (aux12) {$\mathbf{x}_t$};
            \node[wordnode,anchor=south] () at ([xshift=0.5\base]aux51) {$\mathbf{c}_{t-1}$};
-           
+
        \end{scope}
-        
+
         \node[ ] (o27) at (aux27) { };

        \begin{pgfonlayer}{background}
-            \node[draw,very thick,rectangle,fill=blue!30!white,rounded corners=5pt,inner sep=4pt,fit=(aux22) (aux58) (u55) (o27)] (LSTM) {};
+            \node[draw,very thick,rectangle,fill=blue!10!white,rounded corners=5pt,inner sep=4pt,fit=(aux22) (aux58) (u55) (o27)] (LSTM) {};
        \end{pgfonlayer}
-        

-    \end{tikzpicture}
\ No newline at end of file
+
+    \end{tikzpicture}
--- a/Book/Chapter6/Figures/figure-LSTM03.tex
+++ b/Book/Chapter6/Figures/figure-LSTM03.tex
@@ -109,7 +109,7 @@
                \node[opnode,circle,draw=red,thick] (f53) at (aux53) {X};
                \node[opnode,circle,draw=red,thick] (u55) at (aux55) {\textbf{+}};
            }
-           
+
        \end{scope}

        \begin{scope}
@@ -119,14 +119,14 @@
            {
                \node[wordnode,anchor=south] () at ([xshift=-0.5\base]aux59) {$\mathbf{c}_{t}$};
            }
-           
+
        \end{scope}
-        
+
        \node[ ] (o27) at (aux27) { };

        \begin{pgfonlayer}{background}
-            \node[draw,very thick,rectangle,fill=blue!30!white,rounded corners=5pt,inner sep=4pt,fit=(aux22) (aux58) (u55) (o27)] (LSTM) {};
+            \node[draw,very thick,rectangle,fill=blue!10!white,rounded corners=5pt,inner sep=4pt,fit=(aux22) (aux58) (u55) (o27)] (LSTM) {};
        \end{pgfonlayer}


-    \end{tikzpicture}
\ No newline at end of file
+    \end{tikzpicture}
--- a/Book/Chapter6/Figures/figure-LSTM04.tex
+++ b/Book/Chapter6/Figures/figure-LSTM04.tex
@@ -127,7 +127,7 @@
                \draw[-latex,emph] (o27) -- (aux29);
                \draw[-latex,emph] (o27) -| (aux68);
            }
-           
+
        \end{scope}

        \begin{scope}
@@ -144,8 +144,8 @@
        \end{scope}

        \begin{pgfonlayer}{background}
-            \node[draw,very thick,rectangle,fill=blue!30!white,rounded corners=5pt,inner sep=4pt,fit=(aux22) (aux58) (u55) (o27)] (LSTM) {};
+            \node[draw,very thick,rectangle,fill=blue!10!white,rounded corners=5pt,inner sep=4pt,fit=(aux22) (aux58) (u55) (o27)] (LSTM) {};
        \end{pgfonlayer}


-    \end{tikzpicture}
\ No newline at end of file
+    \end{tikzpicture}
--- a/Book/Chapter6/Figures/figure-convergence&lr.tex
+++ b/Book/Chapter6/Figures/figure-convergence&lr.tex
 \definecolor{ublue}{rgb}{0.152,0.250,0.545}
 \begin{tikzpicture}
 \begin{axis}[
-  name=s1,  
-  width=7cm, height=4cm, 
+  name=s1,
+  width=7cm, height=4cm,
  xtick={-4,-3,-2,-1,0,1,2,3,4},
  ytick={0,1,...,4},
  xticklabel style={opacity=0},
  yticklabel style={opacity=0},
-  xlabel={\textbf{$\textrm{W}_t$}},
-  ylabel={\textbf{L($\textrm{W}_t$)}},
+  xlabel={$w$},
+  ylabel={$L(w)$},
  axis line style={->},
  xlabel style={xshift=2.2cm,yshift=1.2cm},
  ylabel style={rotate=-90,xshift=1.5cm,yshift=1.6cm},
@@ -26,16 +26,16 @@
 \end{axis}
 \begin{axis}[
  at={(s1.south)},
-  anchor=south, 
+  anchor=south,
  xshift=6cm,
  yshift=0cm,
-  width=7cm, height=4cm, 
+  width=7cm, height=4cm,
  xtick={-4,-3,-2,-1,0,1,2,3,4},
  ytick={0,1,...,4},
  xticklabel style={opacity=0},
  yticklabel style={opacity=0},
-  xlabel={\textbf{$\textrm{W}_t$}},
-  ylabel={\textbf{L($\textrm{W}_t$)}},
+  xlabel={$w$},
+  ylabel={$L(w)$},
  axis line style={->},
  xlabel style={xshift=2.2cm,yshift=1.2cm},
  ylabel style={rotate=-90,xshift=1.5cm,yshift=1.6cm},
@@ -52,4 +52,4 @@
 \addplot [quiver={u=-x-(x/abs(x))*(1+x^2-4)^(1/2),v=-0.7},domain=-3.13:2.6,->,samples=2,red!60,ultra thick] {x^2/4};
 \addplot [draw=ublue,fill=red,mark=*] coordinates{(0,0)};
 \end{axis}
-\end{tikzpicture}
\ No newline at end of file
+\end{tikzpicture}
--- a/Book/Chapter6/Figures/figure-lrate-of-transformer.tex
+++ b/Book/Chapter6/Figures/figure-lrate-of-transformer.tex
@@ -7,8 +7,8 @@
      width=.60\textwidth,
      height=.40\textwidth,
      legend style={at={(0.60,0.08)}, anchor=south west},
-      xlabel={\footnotesize{num update (10k)}},
-      ylabel={\footnotesize{Learn rate  (\scriptsize{$10^{-3}$)}}},
+      xlabel={\footnotesize{更新步数 (10k)}},
+      ylabel={\footnotesize{学习率  (\scriptsize{$10^{-3}$)}}},
      ylabel style={yshift=-1em},xlabel style={yshift=0.0em},
      yticklabel style={/pgf/number format/precision=2,/pgf/number format/fixed zerofill},
      ymin=0,ymax=0.9, ytick={0.2, 0.4, 0.6, 0.8},

--- a/Book/Chapter6/Figures/figure-self-att-vs-enco-deco-att.tex
+++ b/Book/Chapter6/Figures/figure-self-att-vs-enco-deco-att.tex
 \begin{tikzpicture}
-   
+

 \node[rounded corners=1pt,minimum width=11.0em,minimum height=2.0em,fill=pink!30,draw=black](p1) at (0,0) {\small{Self-Attention}};

@@ -13,11 +13,11 @@

 \node[anchor=north,rounded corners=1pt,minimum width=11.0em,minimum height=3.5em,draw=ugreen!70,very thick,dotted](p1-1) at ([yshift=-5.2em]p1.south) {\small{解码端每个位置的表示}};

-\draw [->,thick,dashed] (word3.south) .. controls +(south:1em) and +(north:1em) .. (p1-1.north);
+\draw [->,thick,dashed] (word3.south) .. controls +(south:1.5em) and +(north:1.5em) .. ([xshift=-0.4em]p1-1.north);
 \draw [->,thick,dashed](word1.south) --(p1-1.north);
-\draw [->,thick,dashed] (word2.south) .. controls +(south:1em) and +(north:1em) .. (p1-1.north);
+\draw [->,thick,dashed] (word2.south) .. controls +(south:1.0em) and +(north:1.5em) .. ([xshift=0.4em]p1-1.north);

-\node[anchor=north](caption1) at ([xshift=0.0em,yshift=-9.5em]p1.south){\small{(a)Self-Attention的输入}};
+\node[anchor=north](caption1) at ([xshift=0.0em,yshift=-9.5em]p1.south){\small{(a) Self-Attention的输入}};
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \node[anchor=west,rounded corners=1pt,minimum width=14.0em,minimum height=2.0em,fill=pink!30,draw=black](p2) at ([xshift=5.0em]p1.east){\small{Encoder-Decoder Attention}};

@@ -48,11 +48,11 @@

 \draw[<-,thick,dashed]([xshift=-3.6em,yshift=-3.2em]word1-2.north)--([xshift=-3.6em,yshift=-3.2em]p2.south);
 \draw[<-,thick,dashed]([xshift=3.6em,yshift=-3.2em]word1-2.north)--([xshift=3.6em,yshift=-3.2em]p2.south);
-\draw [->,thick,dashed] (word1-2.south) .. controls +(south:1em) and +(north:1em) .. ([yshift=0.3em]p2-3.north);
+\draw [->,thick,dashed] (word1-2.south) .. controls +(south:1em) and +(north:1.5em) .. ([yshift=0.3em,xshift=-0.4em]p2-3.north);


-\node[anchor=north](caption2) at ([xshift=0.0em,yshift=-9.5em]p2.south){\small{(b)Encoder-Decoder Attention的输入}};
+\node[anchor=north](caption2) at ([xshift=0.0em,yshift=-9.5em]p2.south){\small{(b) Encoder-Decoder Attention的输入}};



-    \end{tikzpicture}
\ No newline at end of file
+    \end{tikzpicture}
--- a/Book/Chapter6/Figures/figure-softmax.tex
+++ b/Book/Chapter6/Figures/figure-softmax.tex
 \definecolor{ublue}{rgb}{0.152,0.250,0.545}
 \begin{tikzpicture}
-\begin{axis}[  
-  width=8cm, height=5cm, 
+\begin{axis}[
+  width=8cm, height=5cm,
  xtick={-6,-4,...,6},
  ytick={0,0.5,1},
-  xlabel={\small{\textbf{x}}},
-  ylabel={\small{\textbf{Softmax(x)}}},
+  xlabel={\small{$x$}},
+  ylabel={\small{Softmax($x$)}},
  xlabel style={xshift=3.0cm,yshift=1cm},
  axis y line=middle,
  ylabel style={xshift=-2.4cm,yshift=-0.2cm},
@@ -22,4 +22,4 @@
 \end{axis}
 \end{tikzpicture}

-%---------------------------------------------------------------------
\ No newline at end of file
+%---------------------------------------------------------------------
--- a/Book/Chapter6/Figures/figure-the-whole-of-LSTM.tex
+++ b/Book/Chapter6/Figures/figure-the-whole-of-LSTM.tex
-  
-  
+
+



@@ -154,25 +154,25 @@
 \end{scope}

 \begin{pgfonlayer}{background}
-\node[draw,very thick,rectangle,fill=blue!30!white,rounded corners=5pt,inner sep=4pt,fit=(aux22) (aux58) (u55) (o27)] (LSTM) {};
+\node[draw,very thick,rectangle,fill=blue!10!white,rounded corners=5pt,inner sep=4pt,fit=(aux22) (aux58) (u55) (o27)] (LSTM) {};
 \end{pgfonlayer}

 \begin{scope}
 {
 % forget gate formula
-\node[formulanode,anchor=south east,text width=3.4cm] () at ([shift={(4\base,1.5\base)}]aux51) {遗忘门\\$\mathbf{f}_t=\sigma(\mathbf{W}_f[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_f)$};
+\node[formulanode,anchor=south east,text width=10em] () at ([shift={(4\base,1.5\base)}]aux51) {遗忘门\\$\mathbf{f}_t=\sigma(\mathbf{W}_f[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_f)$};
 }
 {
 % input gate formula
-\node[formulanode,anchor=north east] () at ([shift={(4\base,-1.5\base)}]aux21) {输入门\\$\mathbf{i}_t=\sigma(\mathbf{W}_i[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_i)$\\$\hat{\mathbf{c}}_t=\mathrm{tanh}(\mathbf{W}_c[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_c)$};
+\node[formulanode,anchor=north east,text width=10em] () at ([shift={(4\base,-1.5\base)}]aux21) {输入门\\$\mathbf{i}_t=\sigma(\mathbf{W}_i[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_i)$\\$\hat{\mathbf{c}}_t=\mathrm{tanh}(\mathbf{W}_c[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_c)$};
 }
 {
 % cell update formula
-\node[formulanode,anchor=south west,text width=3.02cm] () at ([shift={(-4\base,1.5\base)}]aux59) {记忆更新\\$\mathbf{c}_{t}=\mathbf{f}_t\cdot \mathbf{c}_{t-1}+\mathbf{i}_t\cdot \hat{\mathbf{c}}_t$};
+\node[formulanode,anchor=south west,text width=10em] () at ([shift={(-4\base,1.5\base)}]aux59) {记忆更新\\$\mathbf{c}_{t}=\mathbf{f}_t\cdot \mathbf{c}_{t-1}+\mathbf{i}_t\cdot \hat{\mathbf{c}}_t$};
 }
 {
 % output gate formula
-\node[formulanode,anchor=north west] () at ([shift={(-4\base,-1.5\base)}]aux29) {输出门\\$\mathbf{o}_t=\sigma(\mathbf{W}_o[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_o)$\\$\mathbf{h}_{t}=\mathbf{o}_t\cdot \mathrm{tanh}(\mathbf{c}_{t})$};
+\node[formulanode,anchor=north west,text width=10em] () at ([shift={(-4\base,-1.5\base)}]aux29) {输出门\\$\mathbf{o}_t=\sigma(\mathbf{W}_o[\mathbf{h}_{t-1},\mathbf{x}_t]+\mathbf{b}_o)$\\$\mathbf{h}_{t}=\mathbf{o}_t\cdot \mathrm{tanh}(\mathbf{c}_{t})$};
 }
 \end{scope}
 \end{tikzpicture}

--- a/Book/mt-book.bbl
+++ b/Book/mt-book.bbl