fixed formatting issues

b71c66fc · xiaotong · 8f911175 · b71c66fc · b71c66fc · b71c66fc
Commit b71c66fc authored Mar 08, 2020 by xiaotong
--- a/.gitignore
+++ b/.gitignore
@@ -27,3 +27,5 @@ Section07-Towards-Strong-NMT-Systems/section07.pdf
 Book/mt-book.run.xml
 Book/mt-book-xelatex.bcf
 Book/mt-book-xelatex.idx
+Book/mt-book-xelatex.run.xml
+Book/mt-book-xelatex.synctex(busy)
--- a/Book/Chapter2/chapter2.tex
+++ b/Book/Chapter2/chapter2.tex
@@ -87,7 +87,7 @@
 %表1--------------------------------------------------------------------
 \begin{table}[htp]
 \centering
-\caption{离散变量A的概率分布}
+\caption{离散变量$A$的概率分布}
 \begin{tabular}{c|c c c c c c}
 \rule{0pt}{15pt}     A & $a_1=1$ & $a_2=2$ & $a_3=3$ & $a_4=4$ & $a_5=5$ & $a_6=6$\\
               \hline
@@ -101,7 +101,7 @@

 \parinterval 对于离散变量$A$，$\textrm{P}(A=a)$是个确定的值，可以表示事件$A=a$的可能性大小；而对于连续变量，求在某个定点处的概率是无意义的，只能求其落在某个取值区间内的概率。因此，用\textbf{概率分布函数$\textrm{F}(x)$}和\textbf{概率密度函数$\textrm{f}(x)$}来统一描述随机变量的取值分布情况。概率分布函数$\textrm{F}(x)$取值小于某个值的概率，是概率的累加形式。假设$A$是一个随机变量，$a$是任意实数，将函数$\textrm{F}(a)=\textrm{P}\{A\leq a\}$，$-\infty<a<\infty $定义为$A$的分布函数。通过分布函数，我们可以清晰地表示任何随机变量的概率。

-\parinterval 对于连续变量，我们不能像离散变量一样列出所有的概率取值，而是用概率密度函数来描述分布情况。概率密度函数反映了变量在某个区间内的概率变化快慢，概率密度函数的值是概率的变化率，该连续变量的概率也就是对概率密度函数求积分得到的结果。设$\textrm{f}(x) \geq 0$是连续变量$X$的概率密度函数，$X$的分布函数就可以用$\textrm{F}(X)=\int_{-\infty}^x \textrm{f}(x)dx \ (x\in R)$来表示。
+\parinterval 对于连续变量，我们不能像离散变量一样列出所有的概率取值，而是用概率密度函数来描述分布情况。概率密度函数反映了变量在某个区间内的概率变化快慢，概率密度函数的值是概率的变化率，该连续变量的概率也就是对概率密度函数求积分得到的结果。设$\textrm{f}(x) \geq 0$是连续变量$X$的概率密度函数，$X$的分布函数就可以用$\textrm{F}(X)=\int_{-\infty}^x \textrm{f}(x)dx \ (x\in \mathbb{R})$来表示。

 %----------------------------------------------
 % 图2.3
@@ -310,7 +310,7 @@

 \parinterval 一个事件X的自信息（self-information）的表达式为：
 \begin{eqnarray}
-\textrm{I}(x)=-log\textrm{P}(x)
+\textrm{I}(x)=-\log\textrm{P}(x)
 \label{eqC2.17-new}
 \end{eqnarray}

@@ -327,17 +327,17 @@

 \parinterval 自信息只处理单一的结果。若量化整个概率分布中的不确定性或者说信息量，我们可以用信息熵，其公式如下：
 \begin{eqnarray}
-\textrm{H}(x)=\sum_{x \in \textrm{X}}[ \textrm{P}(x) \textrm{I}(x)] =- \sum_{x \in \textrm{X} } [\textrm{P}(x)log(\textrm{P}(x)) ]
+\textrm{H}(x)=\sum_{x \in \textrm{X}}[ \textrm{P}(x) \textrm{I}(x)] =- \sum_{x \in \textrm{X} } [\textrm{P}(x)\log(\textrm{P}(x)) ]
 \label{eqC2.18-new}
 \end{eqnarray}

-\parinterval 一个分布的信息熵也就是从该分布中得到的一个事件的期望信息量。比如，$a$、$b$、$c$、$d$三支球队，三支队伍夺冠的概率分别是$P1$、$P2$、$P3$、$P4$，某个人对比赛不感兴趣但是又想知道哪只球队夺冠，通过使用二分法2次就确定哪支球队夺冠了。但其实，我们知道这四只球队中c的实力比较强劲，那么猜1次就可以确定。所以对于前者，哪只球队夺冠的信息量较高，信息熵也相对较高，对于后者信息量和信息熵也就相对较低。因此我们可以得知：较为尖锐的分布具有较低的熵；分布越接近均匀熵越大。
+\parinterval 一个分布的信息熵也就是从该分布中得到的一个事件的期望信息量。比如，$a$、$b$、$c$、$d$三支球队，三支队伍夺冠的概率分别是$P_1$、$P_2$、$P_3$、$P_4$，某个人对比赛不感兴趣但是又想知道哪只球队夺冠，通过使用二分法2次就确定哪支球队夺冠了。但其实，我们知道这四只球队中c的实力比较强劲，那么猜1次就可以确定。所以对于前者，哪只球队夺冠的信息量较高，信息熵也相对较高，对于后者信息量和信息熵也就相对较低。因此我们可以得知：较为尖锐的分布具有较低的熵；分布越接近均匀熵越大。

 \subsubsection{（二）KL距离}\index{Chapter2.2.5.2}

 \parinterval 如果在相同的随机变量$X$上有两个独立的概率分布P$(x)$和Q$(x)$，那么我们可以使用KL距离("Kullback-Leibler" 散度)来衡量这两个分布的不同，也就是大家所说的相对熵，其公式如下：
 \begin{eqnarray}
-\textrm{D}_{KL}(P\parallel Q) & = & \sum_{x \in \textrm{X}} [ \textrm{P}(x)\log \frac{\textrm{P}(x) }{ \textrm{Q}(x) } ]  \nonumber \\
+\textrm{D}_{\textrm{KL}}(\textrm{P}\parallel \textrm{Q}) & = & \sum_{x \in \textrm{X}} [ \textrm{P}(x)\log \frac{\textrm{P}(x) }{ \textrm{Q}(x) } ]  \nonumber \\
                                                                                       & = & \sum_{x \in \textrm{X} }[ \textrm{P}(x)(\log\textrm{P}(x)-\log \textrm{Q}(x))]
 \label{eqC2.19-new}
 \end{eqnarray}
@@ -346,9 +346,9 @@

 \vspace{0.5em}
 \begin{itemize}
-\item 非负性，即$\textrm{D}_{KL} (P \parallel Q) \geqslant 0$，等号成立条件是$P$和$Q$在离散情况下具有相同的概率分布，在连续条件下处处相等。可简单理解为$P$和$Q$等价。
+\item 非负性，即$\textrm{D}_{\textrm{KL}} (\textrm{P} \parallel \textrm{Q}) \geqslant 0$，等号成立条件是$P$和$Q$在离散情况下具有相同的概率分布，在连续条件下处处相等。可简单理解为$P$和$Q$等价。
 \vspace{0.5em}
-\item 不对称性，即$\textrm{D}_{KL} (P \parallel Q) \neq \textrm{D}_{KL} (Q  \parallel P)$，所以$KL$距离并不代表我们日常生活中的那个距离。这种不对称性意味着在选择使用$\textrm{D}_{KL} (P  \parallel Q)$或者$\textrm{D}_{KL} (Q  \parallel P)$，将会产生重要的影响。
+\item 不对称性，即$\textrm{D}_{\textrm{KL}} (\textrm{P} \parallel \textrm{Q}) \neq \textrm{D}_{\textrm{KL}} (\textrm{Q}  \parallel \textrm{P})$，所以$KL$距离并不代表我们日常生活中的那个距离。这种不对称性意味着在选择使用$\textrm{D}_{\textrm{KL}} (\textrm{P}  \parallel \textrm{Q})$或者$\textrm{D}_{\textrm{KL}} (\textrm{Q}  \parallel \textrm{P})$，将会产生重要的影响。
 \end{itemize}
 \vspace{0.5em}

@@ -356,7 +356,7 @@

 \parinterval 交叉熵是一个与KL距离密切相关的概念，它的公式是：
 \begin{eqnarray}
-\textrm{H}(P,Q)=-\sum_{x \in \textrm{X}} [\textrm{P}(x) log \textrm{Q}(x) ]
+\textrm{H}(\textrm{P},\textrm{Q})=-\sum_{x \in \textrm{X}} [\textrm{P}(x) \log \textrm{Q}(x) ]
 \label{eqC2.20-new}
 \end{eqnarray}


--- a/Book/Chapter3/Chapter3.tex
+++ b/Book/Chapter3/Chapter3.tex
@@ -11,7 +11,7 @@
 \chapterimage{chapter_head_1} % Chapter heading image
 %公式1.7之后往后串一个
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\chapter{基于词的翻译模型}
+\chapter{基于词的机器翻译模型}

 \parinterval 使用统计方法对翻译问题进行建模是机器翻译发展中的重要里程碑。这种思想也影响了当今的统计机器翻译和神经机器翻译方法。虽然技术不断发展，传统的统计模型已经不再``新鲜''，但它对于今天机器翻译的研究仍然有着重要的启示作用。在了解前沿、展望未来的同时，我们更要冷静的思考前人给我们带来了什么。基于此，本章将介绍统计机器翻译的开山之作\ \dash \ IBM模型，它提出了使用统计模型进行翻译的思想，并在建模中引入了单词对齐这一重要概念。IBM模型由Peter E. Brown等人于上世纪九十年代初提出\cite{brown1993mathematics}。客观的说，这项工作的视野和对问题的理解，已经超过当时很多人所能看到的东西，其衍生出来的一系列方法和新的问题还被后人花费将近10年的时间来进行研究与讨论。时至今日，IBM模型中的一些思想仍然影响着很多研究工作。
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

--- a/Book/Chapter6/Chapter6.tex
+++ b/Book/Chapter6/Chapter6.tex
--- a/Book/Chapter6/Figures/figure-a-simple-example-for-tl.tex
+++ b/Book/Chapter6/Figures/figure-a-simple-example-for-tl.tex
@@ -9,14 +9,14 @@
 \node [pos=0.4,left,xshift=-36em,yshift=7em,font=\small] (original0) {\quad 源语（中文）输入：};
 \node [pos=0.4,left,xshift=-22em,yshift=7em,font=\small] (original1) {
 \begin{tabular}[t]{l}
-\parbox{14em}{''我''、''很''、''好''、''<eos>'' }
+\parbox{14em}{``我''、``很''、``好''、``<eos>'' }
 \end{tabular}
 };
 %译文1--------------mt1
 \node[font=\small] (mt1) at ([xshift=0em,yshift=-1em]original0.south) {目标语（英文）输出：};
 \node[font=\small] (ts1) at ([xshift=0em,yshift=-1em]original1.south)  {
 \begin{tabular}[t]{l}
-\parbox{14em}{''I''、''am''、''fine''、''<eos>''}
+\parbox{14em}{``I''、``am''、``fine''、``<eos>''}
 \end{tabular}
 };


--- a/Book/Chapter6/Figures/figure-encoder-decoder-process.tex
+++ b/Book/Chapter6/Figures/figure-encoder-decoder-process.tex
@@ -20,12 +20,11 @@
 \node (cell010) at ([xshift=-9em,yshift=0em]cell01.west){\quad};

 %\rightarrow {}
-\node [anchor=west,minimum width=1.5em,minimum size=1.5em] (cell07) at (cell06.east) {\hspace{0.07em}\footnotesize{--->}};
+\node [anchor=west,minimum width=1.5em,minimum size=1.5em] (cell07) at (cell06.east) {\hspace{0.07em}\footnotesize{$\longrightarrow$}};
 \node [anchor=west,minimum width=1.5em,minimum size=1.5em] (cell08) at (cell06.east){\small{
 \hspace{0.6em}
 \begin{tabular}{l}
-源语言句\\子
-的{\red''表示''}
+源语言句子的``表示''
 \end{tabular}
 }
 };

--- a/Book/Chapter6/Figures/figure-process-of-5.tex
+++ b/Book/Chapter6/Figures/figure-process-of-5.tex
@@ -44,7 +44,7 @@

 \node(eq1) at ([xshift=0.5em,yshift=0]bra.east){=};

-\node(sof1) at ([xshift=2em,yshift=0]eq1.east){softmax(};
+\node(sof1) at ([xshift=2em,yshift=0]eq1.east){Softmax(};

 %-----------------------------------------------------------
 %QK+MASK
@@ -103,7 +103,7 @@
 %------------------------------
 %第二行
 \node(eq2) at  ([xshift=0em,yshift=-6em]eq1.south){=};
-\node(sof2) at ([xshift=2em,yshift=0]eq2.east){softmax(};
+\node(sof2) at ([xshift=2em,yshift=0]eq2.east){Softmax(};
 %中间粉色矩阵
 \node(mid) at  ([xshift=1.5em,yshift=0em]sof2.east){
 \begin{tabular}{|l|l|l|}