update chapter 5

a93c4d0c · 曹润柘 · 59026629 · a93c4d0c · a93c4d0c
Commit a93c4d0c authored Mar 24, 2020 by 曹润柘
--- a/Book/Chapter6/Chapter6.tex
+++ b/Book/Chapter6/Chapter6.tex
@@ -1322,15 +1322,15 @@ L(\mathbf{Y},\hat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\hat{

 \parinterval 举个例子，如图\ref{fig:6-36}所示，一个汉语句子包含5个词。这里，我们用$\mathbf{h}$(``你'')表示``你''当前的表示结果。如果把``你''看作目标，这时$\mathrm{query}$就是$\mathbf{h}$(``你'')，$\mathrm{key}$和$\mathrm{value}$\\是图中所有位置的表示，即：{$\mathbf{h}$(``你'')、$\mathbf{h}$(``什么'')、$\mathbf{h}$(``也'')、$\mathbf{h}$(``没'')、$\mathbf{h}$(``学'')}。在自注意力模型中，首先计算$\mathrm{query}$和$\mathrm{key}$的相关度，这里用$\alpha_i$表示$\mathbf{h}$(``你'')和位置$i$的表示之间的相关性。然后，把$\alpha_i$作为权重，对不同位置上的$\mathrm{value}$进行加权求和。最终，得到新的表示结果$\tilde{\mathbf{h}}$ (``你'' )：
 \begin{eqnarray}
-\tilde{\mathbf{h}} (\textrm{``你''} ) = \alpha_0 {\mathbf{h}} (\textrm{``你''} )
-+ \alpha_1 {\mathbf{h}} (\textrm{``什么 ''})
-+ \alpha_2 {\mathbf{h}} (\textrm{``也''} )
-+ \alpha_3 {\mathbf{h}} (\textrm{``没''} )
-+\alpha_4 {\mathbf{h}} (\textrm{``学''} )
+\tilde{\mathbf{h}} (\textrm{``你''} ) = \alpha_1 {\mathbf{h}} (\textrm{``你''} )
+ \alpha_2 {\mathbf{h}} (\textrm{``什么 ''})
+ \alpha_3 {\mathbf{h}} (\textrm{``也''} )
+ \alpha_4 {\mathbf{h}} (\textrm{``没''} )
+\alpha_5 {\mathbf{h}} (\textrm{``学''} )
 \label{eqC6.40}
 \end{eqnarray}

-\parinterval 同理，也可以用同样的方法处理这个句子中的其它单词。可以看出，在注意力机制中，我们并不是使用类似于循环神经网络的记忆能力去访问历史信息。序列中所有单词之间的信息都是通过同一种操作（$\mathrm{query}$和$\mathrm{key}$的相关度）进行处理。这样，表示结果$\tilde{\mathbf{h}} (\textrm{``你''})$在包含``你''这个单词的信息的同时，也包含了序列中其它词的信息。也就是，序列中每一个位置的表示结果中，都包含了其它位置的信息。从这个角度说，$\tilde{\mathbf{h}} (\textrm{``你''})$已经不在是单词''你''自身的表示结果，而是一种在单词``你''的位置上的全局信息的表示。
+\parinterval 同理，也可以用同样的方法处理这个句子中的其它单词。可以看出，在注意力机制中，我们并不是使用类似于循环神经网络的记忆能力去访问历史信息。序列中所有单词之间的信息都是通过同一种操作（$\mathrm{query}$和$\mathrm{key}$的相关度）进行处理。这样，表示结果$\tilde{\mathbf{h}} (\textrm{``你''})$在包含``你''这个单词的信息的同时，也包含了序列中其它词的信息。也就是，序列中每一个位置的表示结果中，都包含了其它位置的信息。从这个角度说，$\tilde{\mathbf{h}} (\textrm{``你''})$已经不再是单词''你''自身的表示结果，而是一种在单词``你''的位置上的全局信息的表示。

 \parinterval 通常，也把生成\{ $\tilde{\mathbf{h}}(\mathbf{w}_i)$ \}的过程称为\textbf{特征提取}，而实现这个过程的模型被称为特征提取器。循环神经网络、自注意力模型都是典型的特征提取器。特征提取是神经机器翻译系统的关键步骤，在随后的内容中可以看到自注意力模型是一个非常适合机器翻译任务的特征提取器。

@@ -1473,7 +1473,7 @@ L(\mathbf{Y},\hat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\hat{
 \label{eqC6.45}
 \end{eqnarray}

-\noindent 首先，我们通过对$\mathbf{Q}$和$\mathbf{K}$的转置进行点乘操作，计算得到一个维度大小为$L \times L$的相关性矩阵，即$\mathbf{Q}\mathbf{K}^{T}$，它表示一个序列上任意两个位置（$i, i’$）的相关性。再通过系数1/$\sqrt{d_k}$进行放缩操作，放缩可以尽量减少相关性矩阵的方差，具体体现在运算过程中实数矩阵中的数值不会过大，有利于模型训练。在此基础上，我们通过对相关性矩阵累加一个掩码矩阵，来屏蔽掉矩阵中的无用信息。比如，在编码端对句子的补齐，在解码端则屏蔽掉未来信息，这一部分内容将在下一小节进行详细介绍。随后我们使用Softmax函数对相关性矩阵在行的维度上进行归一化操作，这可以理解为对第$i$行进行归一化，结果对应了$\mathbf{V}$中的不同位置上向量的注意力权重。对于$\mathrm{value}$的加权求和，可以直接用相关性性系数和$\mathbf{V}$进行矩阵乘法得到，即$\textrm{Softmax}
+\noindent 首先，我们通过对$\mathbf{Q}$和$\mathbf{K}$的转置进行点乘操作，计算得到一个维度大小为$L \times L$的相关性矩阵，即$\mathbf{Q}\mathbf{K}^{T}$，它表示一个序列上任意两个位置（$i, i’$）的相关性。再通过系数1/$\sqrt{d_k}$进行放缩操作，放缩可以尽量减少相关性矩阵的方差，具体体现在运算过程中实数矩阵中的数值不会过大，有利于模型训练。在此基础上，我们通过对相关性矩阵累加一个掩码矩阵，来屏蔽掉矩阵中的无用信息。比如，在编码端对句子的补齐，在解码端则屏蔽掉未来信息，这一部分内容将在下一小节进行详细介绍。随后我们使用Softmax函数对相关性矩阵在行的维度上进行归一化操作，这可以理解为对第$i$行进行归一化，结果对应了$\mathbf{V}$中的不同位置上向量的注意力权重。对于$\mathrm{value}$的加权求和，可以直接用相关性系数和$\mathbf{V}$进行矩阵乘法得到，即$\textrm{Softmax}
 ( \frac{\mathbf{Q}\mathbf{K}^{T}} {\sqrt{d_k}} + \mathbf{Mask} )$和$\mathbf{V}$进行矩阵乘。最终我们就到了自注意力的输出，它和输入的$\mathbf{V}$的大小是一模一样的。图\ref{fig:6-43}展示了点乘注意力计算的全过程。

 %----------------------------------------------
@@ -1504,7 +1504,7 @@ L(\mathbf{Y},\hat{\mathbf{Y}}) = \sum_{j=1}^n L_{\textrm{ce}}(\mathbf{y}_j,\hat{
 \parinterval 我们在公式\ref{eqC6.45}中提到了Mask（掩码），它的目的是对向量中某些值进行掩盖，避免无关位置的数值对运算造成影响。Transformer中的Mask主要应用在注意力机制中的相关性系数计算，具体方式是在相关性系数矩阵上累加一个Mask矩阵。该矩阵在需要Mask的位置的值为负无穷-inf（具体实现时是一个非常小的数，比如-1e-9），其余位置为0，这样在进行了Softmax归一化操作之后，被掩码掉的位置计算得到的权重便近似为0，也就是说对无用信息分配的权重为0，从而避免了其对结果产生影响。Transformer包含两种Mask：

 \begin{itemize}
-\item Padding Mask。在批量处理多个样本时（训练或解码），由于要对源言语和目标语言的输入进行批次化处理，而每个批次内序列的长度不一样，为了方便对批次内序列进行矩阵表示，需要进行对齐操作，即在较短的序列后面填充0来占位（padding操作）。而这些填充的位置没有意义，不参与注意力机制的计算，因此，需要进行Mask操作，屏蔽其影响。
+\item Padding Mask。在批量处理多个样本时（训练或解码），由于要对源语言和目标语言的输入进行批次化处理，而每个批次内序列的长度不一样，为了方便对批次内序列进行矩阵表示，需要进行对齐操作，即在较短的序列后面填充0来占位（padding操作）。而这些填充的位置没有意义，不参与注意力机制的计算，因此，需要进行Mask操作，屏蔽其影响。

 \item Future Mask。对于解码器来说，由于在预测的时候是自左向右进行的，即第$t$时刻解码器的输出只能依赖于$t$时刻之前的输出。且为了保证训练解码一致，避免在训练过程中观测到目标语端每个位置未来的信息，因此需要对未来信息进行屏蔽。具体的做法是：构造一个上三角值全为-inf的Mask矩阵，也就是说，在解码端计算中，在当前位置，我们通过Future Mask把序列之后的信息屏蔽掉了，避免了$t$之后的位置对当前的计算产生影响。图\ref{fig:6-45}给出了一个具体的实例。


--- a/Book/Chapter6/Figures/figure-Example-of-self-attention-mechanism-calculation.tex
+++ b/Book/Chapter6/Figures/figure-Example-of-self-attention-mechanism-calculation.tex
@@ -4,29 +4,28 @@
 \begin{tikzpicture}
 \begin{scope}

-\tikzstyle{rnode} = [draw,minimum width=3.5em,minimum height=1.2em]
-
-\node [rnode,anchor=south west,fill=red!20!white] (value1) at (0,0) {\scriptsize{$\textbf{h}(\textrm{``你''})$}};
-\node [rnode,anchor=south west,fill=red!20!white] (value2) at ([xshift=1em]value1.south east) {\scriptsize{$\textbf{h}(\textrm{``什么''})$}};
-\node [rnode,anchor=south west,fill=red!20!white] (value3) at ([xshift=1em]value2.south east) {\scriptsize{$\textbf{h}(\textrm{``也''})$}};
-\node [rnode,anchor=south west,fill=red!20!white] (value4) at ([xshift=1em]value3.south east) {\scriptsize{$\textbf{h}(\textrm{``没''})$}};
-
-\node [rnode,anchor=south west,fill=green!20!white] (key1) at ([yshift=0.2em]value1.north west) {\scriptsize{$\textbf{h}(\textrm{``你''})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key2) at ([yshift=0.2em]value2.north west) {\scriptsize{$\textbf{h}(\textrm{``什么''})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key3) at ([yshift=0.2em]value3.north west) {\scriptsize{$\textbf{h}(\textrm{``也''})$}};
-\node [rnode,anchor=south west,fill=green!20!white] (key4) at ([yshift=0.2em]value4.north west) {\scriptsize{$\textbf{h}(\textrm{``没''})$}};
-
-\node [rnode,anchor=east] (query) at ([xshift=-2em]key1.west) {\scriptsize{$\textbf{s}(\textrm{``you''})$}};
-\node [anchor=east] (querylabel) at ([xshift=-0.2em]query.west) {\scriptsize{query}};
-
-\draw [->] ([yshift=1pt,xshift=6pt]query.north) .. controls +(90:1em) and +(90:1em) .. ([yshift=1pt]key1.north);
-\draw [->] ([yshift=1pt,xshift=3pt]query.north) .. controls +(90:1.5em) and +(90:1.5em) .. ([yshift=1pt]key2.north);
-\draw [->] ([yshift=1pt]query.north) .. controls +(90:2em) and +(90:2em) .. ([yshift=1pt]key3.north);
-\draw [->] ([yshift=1pt,xshift=-3pt]query.north) .. controls +(90:2.5em) and +(90:2.5em) .. ([yshift=1pt]key4.north);
-\node [anchor=south east] (alpha1) at ([xshift=1em]key1.north east) {\scriptsize{$\alpha_1=.4$}};
-\node [anchor=south east] (alpha2) at ([xshift=1em]key2.north east) {\scriptsize{$\alpha_2=.4$}};
-\node [anchor=south east] (alpha3) at ([xshift=1em]key3.north east) {\scriptsize{$\alpha_3=0$}};
-\node [anchor=south east] (alpha4) at ([xshift=1em]key4.north east) {\scriptsize{$\alpha_4=.1$}};
+\tikzstyle{rnode} = [draw,minimum width=2.8em,minimum height=1.2em]
+
+
+
+\node [rnode,anchor=south west,fill=green!20!white] (key11) at (0,0) {\scriptsize{$\textbf{h}(\textrm{``你''})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key12) at ([xshift=0.8em]key11.south east) {\scriptsize{$\textbf{h}(\textrm{``什么''})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key13) at ([xshift=0.8em]key12.south east) {\scriptsize{$\textbf{h}(\textrm{``也''})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key14) at ([xshift=0.8em]key13.south east) {\scriptsize{$\textbf{h}(\textrm{``没''})$}};
+\node [rnode,anchor=south west,fill=green!20!white] (key15) at ([xshift=0.8em]key14.south east) {\scriptsize{$\textbf{h}(\textrm{``学''})$}};
+
+\node [rnode,anchor=east] (query1) at ([xshift=-1em]key11.west) {\scriptsize{$\textbf{h}(\textrm{``你''})$}};
+
+\draw [->] ([yshift=1pt,xshift=4pt]query1.north) .. controls +(90:0.6em) and +(90:0.6em) .. ([yshift=1pt]key11.north);
+\draw [->] ([yshift=1pt,xshift=0pt]query1.north) .. controls +(90:1.0em) and +(90:1.0em) .. ([yshift=1pt]key12.north);
+\draw [->] ([yshift=1pt,xshift=-4pt]query1.north) .. controls +(90:1.4em) and +(90:1.4em) .. ([yshift=1pt]key13.north);
+\draw [->] ([yshift=1pt,xshift=-8pt]query1.north) .. controls +(90:1.8em) and +(90:1.8em) .. ([yshift=1pt]key14.north);
+\draw [->] ([yshift=1pt,xshift=-12pt]query1.north) .. controls +(90:2.2em) and +(90:2.2em) .. ([yshift=1pt]key15.north);
+\node [anchor=south west] (alpha11) at ([xshift=0.3em]key11.north) {\scriptsize{$\alpha_1$}};
+\node [anchor=south west] (alpha12) at ([xshift=0.3em]key12.north) {\scriptsize{$\alpha_2$}};
+\node [anchor=south west] (alpha13) at ([xshift=0.3em]key13.north) {\scriptsize{$\alpha_3$}};
+\node [anchor=south west] (alpha14) at ([xshift=0.3em]key14.north) {\scriptsize{$\alpha_4$}};
+\node [anchor=south west] (alpha15) at ([xshift=0.3em]key15.north) {\scriptsize{$\alpha_5$}};

 \end{scope}
 \end{tikzpicture}
\ No newline at end of file