Skip to content
项目
群组
代码片段
帮助
当前项目
正在载入...
登录 / 注册
切换导航面板
M
mtbookv2
概览
Overview
Details
Activity
Cycle Analytics
版本库
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
问题
0
Issues
0
列表
Board
标记
里程碑
合并请求
0
Merge Requests
0
CI / CD
CI / CD
流水线
作业
日程表
图表
维基
Wiki
代码片段
Snippets
成员
Collapse sidebar
Close sidebar
活动
图像
聊天
创建新问题
作业
提交
Issue Boards
Open sidebar
NiuTrans
mtbookv2
Commits
afbe1e67
Commit
afbe1e67
authored
Jan 14, 2021
by
zengxin
Browse files
Options
Browse Files
Download
Plain Diff
合并分支 'zengxin' 到 'caorunzhe'
Zengxin 查看合并请求
!907
parents
14c46f04
33bd6ea0
全部展开
隐藏空白字符变更
内嵌
并排
正在显示
5 个修改的文件
包含
51 行增加
和
50 行删除
+51
-50
Chapter10/chapter10.tex
+0
-0
Chapter11/Figures/figure-structural-comparison-a.tex
+10
-9
Chapter11/Figures/figure-structural-comparison-b.tex
+34
-34
Chapter11/chapter11.tex
+0
-0
Chapter12/chapter12.tex
+7
-7
没有找到文件。
Chapter10/chapter10.tex
查看文件 @
afbe1e67
差异被折叠。
点击展开。
Chapter11/Figures/figure-structural-comparison-a.tex
查看文件 @
afbe1e67
...
...
@@ -4,13 +4,13 @@
\begin{tikzpicture}
[node distance = 0cm]
\node
(num1)[num]
{$
\mathbi
{
e
}_
1
$}
;
\node
(num2)[num,right of = num1,xshift = 1.2cm]
{
\textcolor
{
blue!
70
}{$
\mathbi
{
e
}_
2
$}}
;
\node
(num3)[num,right of = num2,xshift = 1.2cm]
{
\textcolor
{
blue!
70
}{$
\mathbi
{
e
}_
3
$}}
;
\node
(num4)[num,right of = num3,xshift = 1.2cm]
{
\textcolor
{
blue!
70
}{$
\mathbi
{
e
}_
4
$}}
;
\node
(num5)[num,right of = num4,xshift = 1.2cm]
{
\textcolor
{
blue!
70
}{$
\mathbi
{
e
}_
5
$}}
;
\node
(num6)[num,right of = num5,xshift = 1.2cm]
{
\textcolor
{
blue!
70
}{$
\mathbi
{
e
}_
6
$}}
;
\node
(num7)[num,right of = num6,xshift = 1.2cm]
{
\textcolor
{
blue!
70
}{$
\mathbi
{
e
}_
7
$}}
;
\node
(num8)[num,right of = num7,xshift = 1.2cm]
{
\textcolor
{
blue!
70
}{$
\mathbi
{
e
}_
8
$}}
;
\node
(num2)[num,right of = num1,xshift = 1.2cm]
{
\textcolor
{
blue!
85
}{$
\mathbi
{
e
}_
2
$}}
;
\node
(num3)[num,right of = num2,xshift = 1.2cm]
{
\textcolor
{
blue!
85
}{$
\mathbi
{
e
}_
3
$}}
;
\node
(num4)[num,right of = num3,xshift = 1.2cm]
{
\textcolor
{
blue!
85
}{$
\mathbi
{
e
}_
4
$}}
;
\node
(num5)[num,right of = num4,xshift = 1.2cm]
{
\textcolor
{
blue!
85
}{$
\mathbi
{
e
}_
5
$}}
;
\node
(num6)[num,right of = num5,xshift = 1.2cm]
{
\textcolor
{
blue!
85
}{$
\mathbi
{
e
}_
6
$}}
;
\node
(num7)[num,right of = num6,xshift = 1.2cm]
{
\textcolor
{
blue!
85
}{$
\mathbi
{
e
}_
7
$}}
;
\node
(num8)[num,right of = num7,xshift = 1.2cm]
{
\textcolor
{
blue!
85
}{$
\mathbi
{
e
}_
8
$}}
;
\node
(num9)[num,right of = num8,xshift = 1.2cm]
{$
\mathbi
{
e
}_
9
$}
;
%\node(A)[below of = num2,yshift = -0.6cm]{A};
%\node(B)[below of = num8,yshift = -0.6cm]{B};
...
...
@@ -23,8 +23,8 @@
\draw
[->, thick, color = blue!80](num6.east)--(num7.west);
\draw
[->, thick, color = blue!80](num7.east)--(num8.west);
\draw
[->,thick,color = black!
70
] (num1) -- (num2);
\draw
[->,thick,color =black!
70
] (num8) -- (num9);
\draw
[->,thick,color = black!
85
] (num1) -- (num2);
\draw
[->,thick,color =black!
85
] (num8) -- (num9);
\end{tikzpicture}
\ No newline at end of file
Chapter11/Figures/figure-structural-comparison-b.tex
查看文件 @
afbe1e67
...
...
@@ -4,13 +4,13 @@
\begin{tikzpicture}
[node distance = 0cm]
\node
(num1
_
0)[num, fill = blue!40]
{
\textcolor
{
white
}{$
\mathbi
{
0
}$}}
;
\node
(num1
_
1)[num,right of = num1
_
0,xshift = 1.2cm]
{$
\mathbi
{
e
}_
1
$}
;
\node
(num1
_
2)[num,right of = num1
_
1,xshift = 1.2cm]
{
\textcolor
{
blue!
70
}{$
\mathbi
{
e
}_
2
$}}
;
\node
(num1
_
3)[num,right of = num1
_
2,xshift = 1.2cm]
{
\textcolor
{
blue!
70
}{$
\mathbi
{
e
}_
3
$}}
;
\node
(num1
_
4)[num,right of = num1
_
3,xshift = 1.2cm]
{
\textcolor
{
blue!
70
}{$
\mathbi
{
e
}_
4
$}}
;
\node
(num1
_
5)[num,right of = num1
_
4,xshift = 1.2cm]
{
\textcolor
{
blue!
70
}{$
\mathbi
{
e
}_
5
$}}
;
\node
(num1
_
6)[num,right of = num1
_
5,xshift = 1.2cm]
{
\textcolor
{
blue!
70
}{$
\mathbi
{
e
}_
6
$}}
;
\node
(num1
_
7)[num,right of = num1
_
6,xshift = 1.2cm]
{
\textcolor
{
blue!
70
}{$
\mathbi
{
e
}_
7
$}}
;
\node
(num1
_
8)[num,right of = num1
_
7,xshift = 1.2cm]
{
\textcolor
{
blue!
70
}{$
\mathbi
{
e
}_
8
$}}
;
\node
(num1
_
2)[num,right of = num1
_
1,xshift = 1.2cm]
{
\textcolor
{
blue!
85
}{$
\mathbi
{
e
}_
2
$}}
;
\node
(num1
_
3)[num,right of = num1
_
2,xshift = 1.2cm]
{
\textcolor
{
blue!
85
}{$
\mathbi
{
e
}_
3
$}}
;
\node
(num1
_
4)[num,right of = num1
_
3,xshift = 1.2cm]
{
\textcolor
{
blue!
85
}{$
\mathbi
{
e
}_
4
$}}
;
\node
(num1
_
5)[num,right of = num1
_
4,xshift = 1.2cm]
{
\textcolor
{
blue!
85
}{$
\mathbi
{
e
}_
5
$}}
;
\node
(num1
_
6)[num,right of = num1
_
5,xshift = 1.2cm]
{
\textcolor
{
blue!
85
}{$
\mathbi
{
e
}_
6
$}}
;
\node
(num1
_
7)[num,right of = num1
_
6,xshift = 1.2cm]
{
\textcolor
{
blue!
85
}{$
\mathbi
{
e
}_
7
$}}
;
\node
(num1
_
8)[num,right of = num1
_
7,xshift = 1.2cm]
{
\textcolor
{
blue!
85
}{$
\mathbi
{
e
}_
8
$}}
;
\node
(num1
_
9)[num,right of = num1
_
8,xshift = 1.2cm]
{$
\mathbi
{
e
}_
9
$}
;
\node
(num1
_
10)[num,right of = num1
_
9,xshift = 1.2cm, fill = blue!40]
{$
\mathbi
{
0
}$}
;
%\node(A)[below of = num2,yshift = -0.6cm]{A};
...
...
@@ -19,11 +19,11 @@
\node
(num2
_
0)[num,above of = num1
_
0,yshift = 1.2cm, fill = blue!40]
{
\textcolor
{
white
}{$
\mathbi
{
0
}$}}
;
\node
(num2
_
1)[num,right of = num2
_
0,xshift = 1.2cm]
{
\textbf
2
}
;
\node
(num2
_
2)[num,right of = num2
_
1,xshift = 1.2cm]
{
\textbf
2
}
;
\node
(num2
_
3)[num,right of = num2
_
2,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
70
}
2
}}
;
\node
(num2
_
4)[num,right of = num2
_
3,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
70
}
2
}}
;
\node
(num2
_
5)[num,right of = num2
_
4,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
70
}
2
}}
;
\node
(num2
_
6)[num,right of = num2
_
5,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
70
}
2
}}
;
\node
(num2
_
7)[num,right of = num2
_
6,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
70
}
2
}}
;
\node
(num2
_
3)[num,right of = num2
_
2,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
85
}
2
}}
;
\node
(num2
_
4)[num,right of = num2
_
3,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
85
}
2
}}
;
\node
(num2
_
5)[num,right of = num2
_
4,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
85
}
2
}}
;
\node
(num2
_
6)[num,right of = num2
_
5,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
85
}
2
}}
;
\node
(num2
_
7)[num,right of = num2
_
6,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
85
}
2
}}
;
\node
(num2
_
8)[num,right of = num2
_
7,xshift = 1.2cm]
{
\textbf
2
}
;
\node
(num2
_
9)[num,right of = num2
_
8,xshift = 1.2cm]
{
\textbf
2
}
;
\node
(num2
_
10)[num,right of = num2
_
9,xshift = 1.2cm, fill = blue!40]
{$
\mathbi
{
0
}$}
;
...
...
@@ -32,9 +32,9 @@
\node
(num3
_
1)[num,right of = num3
_
0,xshift = 1.2cm]
{
\textbf
3
}
;
\node
(num3
_
2)[num,right of = num3
_
1,xshift = 1.2cm]
{
\textbf
3
}
;
\node
(num3
_
3)[num,right of = num3
_
2,xshift = 1.2cm]
{
\textbf
3
}
;
\node
(num3
_
4)[num,right of = num3
_
3,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
70
}
3
}}
;
\node
(num3
_
5)[num,right of = num3
_
4,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
70
}
3
}}
;
\node
(num3
_
6)[num,right of = num3
_
5,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
70
}
3
}}
;
\node
(num3
_
4)[num,right of = num3
_
3,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
85
}
3
}}
;
\node
(num3
_
5)[num,right of = num3
_
4,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
85
}
3
}}
;
\node
(num3
_
6)[num,right of = num3
_
5,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
85
}
3
}}
;
\node
(num3
_
7)[num,right of = num3
_
6,xshift = 1.2cm]
{
\textbf
3
}
;
\node
(num3
_
8)[num,right of = num3
_
7,xshift = 1.2cm]
{
\textbf
3
}
;
\node
(num3
_
9)[num,right of = num3
_
8,xshift = 1.2cm]
{
\textbf
3
}
;
...
...
@@ -45,7 +45,7 @@
\node
(num4
_
2)[num,right of = num4
_
1,xshift = 1.2cm]
{
\textbf
4
}
;
\node
(num4
_
3)[num,right of = num4
_
2,xshift = 1.2cm]
{
\textbf
4
}
;
\node
(num4
_
4)[num,right of = num4
_
3,xshift = 1.2cm]
{
\textbf
4
}
;
\node
(num4
_
5)[num,right of = num4
_
4,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
6
0
}
4
}}
;
\node
(num4
_
5)[num,right of = num4
_
4,xshift = 1.2cm]
{
\textbf
{
\textcolor
{
blue!
8
0
}
4
}}
;
\node
(num4
_
6)[num,right of = num4
_
5,xshift = 1.2cm]
{
\textbf
4
}
;
\node
(num4
_
7)[num,right of = num4
_
6,xshift = 1.2cm]
{
\textbf
4
}
;
\node
(num4
_
8)[num,right of = num4
_
7,xshift = 1.2cm]
{
\textbf
4
}
;
...
...
@@ -58,19 +58,19 @@
\draw
[->, thick](num1
_
1.north)--([xshift=-0.1em,yshift=-0.1em]num2
_
2.south);
\draw
[->, thick](num2
_
1.north)--([xshift=-0.1em,yshift=-0.1em]num3
_
2.south);
\draw
[->, thick](num3
_
1.north)--([xshift=-0.1em,yshift=-0.1em]num4
_
2.south);
\draw
[->, thick, color = blue!
6
0](num1
_
2.north)--([xshift=-0.1em,yshift=-0.1em]num2
_
3.south);
\draw
[->, thick, color = blue!
8
0](num1
_
2.north)--([xshift=-0.1em,yshift=-0.1em]num2
_
3.south);
\draw
[->, thick](num2
_
2.north)--([xshift=-0.1em,yshift=-0.1em]num3
_
3.south);
\draw
[->, thick](num3
_
2.north)--([xshift=-0.1em,yshift=-0.1em]num4
_
3.south);
\draw
[->, thick, color = blue!
6
0](num1
_
3.north)--([xshift=-0.1em,yshift=-0.1em]num2
_
4.south);
\draw
[->, thick, color = blue!
6
0](num2
_
3.north)--([xshift=-0.1em,yshift=-0.1em]num3
_
4.south);
\draw
[->, thick, color = blue!
8
0](num1
_
3.north)--([xshift=-0.1em,yshift=-0.1em]num2
_
4.south);
\draw
[->, thick, color = blue!
8
0](num2
_
3.north)--([xshift=-0.1em,yshift=-0.1em]num3
_
4.south);
\draw
[->, thick](num3
_
3.north)--([xshift=-0.1em,yshift=-0.1em]num4
_
4.south);
\draw
[->, thick, color = blue!
6
0](num1
_
4.north)--([xshift=-0.1em,yshift=-0.1em]num2
_
5.south);
\draw
[->, thick, color = blue!
6
0](num2
_
4.north)--([xshift=-0.1em,yshift=-0.1em]num3
_
5.south);
\draw
[->, thick, color = blue!
6
0](num3
_
4.north)--([xshift=-0.1em,yshift=-0.1em]num4
_
5.south);
\draw
[->, thick, color = blue!
6
0](num1
_
5.north)--([xshift=-0.1em,yshift=-0.1em]num2
_
6.south);
\draw
[->, thick, color = blue!
6
0](num2
_
5.north)--([xshift=-0.1em,yshift=-0.1em]num3
_
6.south);
\draw
[->, thick, color = blue!
8
0](num1
_
4.north)--([xshift=-0.1em,yshift=-0.1em]num2
_
5.south);
\draw
[->, thick, color = blue!
8
0](num2
_
4.north)--([xshift=-0.1em,yshift=-0.1em]num3
_
5.south);
\draw
[->, thick, color = blue!
8
0](num3
_
4.north)--([xshift=-0.1em,yshift=-0.1em]num4
_
5.south);
\draw
[->, thick, color = blue!
8
0](num1
_
5.north)--([xshift=-0.1em,yshift=-0.1em]num2
_
6.south);
\draw
[->, thick, color = blue!
8
0](num2
_
5.north)--([xshift=-0.1em,yshift=-0.1em]num3
_
6.south);
\draw
[->, thick](num3
_
5.north)--([xshift=-0.1em,yshift=-0.1em]num4
_
6.south);
\draw
[->, thick, color = blue!
6
0](num1
_
6.north)--([xshift=-0.1em,yshift=-0.1em]num2
_
7.south);
\draw
[->, thick, color = blue!
8
0](num1
_
6.north)--([xshift=-0.1em,yshift=-0.1em]num2
_
7.south);
\draw
[->, thick](num2
_
6.north)--([xshift=-0.1em,yshift=-0.1em]num3
_
7.south);
\draw
[->, thick](num3
_
6.north)--([xshift=-0.1em,yshift=-0.1em]num4
_
7.south);
\draw
[->, thick](num1
_
7.north)--([xshift=-0.1em,yshift=-0.1em]num2
_
8.south);
...
...
@@ -86,19 +86,19 @@
\draw
[->, thick](num1
_
3.north)--([xshift=0.1em,yshift=-0.1em]num2
_
2.south);
\draw
[->, thick](num2
_
3.north)--([xshift=0.1em,yshift=-0.1em]num3
_
2.south);
\draw
[->, thick](num3
_
3.north)--([xshift=0.1em,yshift=-0.1em]num4
_
2.south);
\draw
[->, thick, color = blue!
6
0](num1
_
4.north)--([xshift=0.1em,yshift=-0.1em]num2
_
3.south);
\draw
[->, thick, color = blue!
8
0](num1
_
4.north)--([xshift=0.1em,yshift=-0.1em]num2
_
3.south);
\draw
[->, thick](num2
_
4.north)--([xshift=0.1em,yshift=-0.1em]num3
_
3.south);
\draw
[->, thick](num3
_
4.north)--([xshift=0.1em,yshift=-0.1em]num4
_
3.south);
\draw
[->, thick, color = blue!
6
0](num1
_
5.north)--([xshift=0.1em,yshift=-0.1em]num2
_
4.south);
\draw
[->, thick, color = blue!
6
0](num2
_
5.north)--([xshift=0.1em,yshift=-0.1em]num3
_
4.south);
\draw
[->, thick, color = blue!
8
0](num1
_
5.north)--([xshift=0.1em,yshift=-0.1em]num2
_
4.south);
\draw
[->, thick, color = blue!
8
0](num2
_
5.north)--([xshift=0.1em,yshift=-0.1em]num3
_
4.south);
\draw
[->, thick](num3
_
5.north)--([xshift=0.1em,yshift=-0.1em]num4
_
4.south);
\draw
[->, thick, color = blue!
6
0](num1
_
6.north)--([xshift=0.1em,yshift=-0.1em]num2
_
5.south);
\draw
[->, thick, color = blue!
6
0](num2
_
6.north)--([xshift=0.1em,yshift=-0.1em]num3
_
5.south);
\draw
[->, thick, color = blue!
6
0](num3
_
6.north)--([xshift=0.1em,yshift=-0.1em]num4
_
5.south);
\draw
[->, thick, color = blue!
6
0](num1
_
7.north)--([xshift=0.1em,yshift=-0.1em]num2
_
6.south);
\draw
[->, thick, color = blue!
6
0](num2
_
7.north)--([xshift=0.1em,yshift=-0.1em]num3
_
6.south);
\draw
[->, thick, color = blue!
8
0](num1
_
6.north)--([xshift=0.1em,yshift=-0.1em]num2
_
5.south);
\draw
[->, thick, color = blue!
8
0](num2
_
6.north)--([xshift=0.1em,yshift=-0.1em]num3
_
5.south);
\draw
[->, thick, color = blue!
8
0](num3
_
6.north)--([xshift=0.1em,yshift=-0.1em]num4
_
5.south);
\draw
[->, thick, color = blue!
8
0](num1
_
7.north)--([xshift=0.1em,yshift=-0.1em]num2
_
6.south);
\draw
[->, thick, color = blue!
8
0](num2
_
7.north)--([xshift=0.1em,yshift=-0.1em]num3
_
6.south);
\draw
[->, thick](num3
_
7.north)--([xshift=0.1em,yshift=-0.1em]num4
_
6.south);
\draw
[->, thick, color = blue!
6
0](num1
_
8.north)--([xshift=0.1em,yshift=-0.1em]num2
_
7.south);
\draw
[->, thick, color = blue!
8
0](num1
_
8.north)--([xshift=0.1em,yshift=-0.1em]num2
_
7.south);
\draw
[->, thick](num2
_
8.north)--([xshift=0.1em,yshift=-0.1em]num3
_
7.south);
\draw
[->, thick](num3
_
8.north)--([xshift=0.1em,yshift=-0.1em]num4
_
7.south);
\draw
[->, thick](num1
_
9.north)--([xshift=0.1em,yshift=-0.1em]num2
_
8.south);
...
...
Chapter11/chapter11.tex
查看文件 @
afbe1e67
差异被折叠。
点击展开。
Chapter12/chapter12.tex
查看文件 @
afbe1e67
...
...
@@ -132,7 +132,7 @@
\multicolumn
{
1
}{
l|
}{
ConvS2S
}
&
25.16
&
40.46
&
1.5
$
\times
10
^{
20
}$
\\
\multicolumn
{
1
}{
l|
}{
MoE
}
&
26.03
&
40.56
&
1.2
$
\times
10
^{
20
}$
\\
\multicolumn
{
1
}{
l|
}{
Transformer (Base Model)
}
&
27.3
&
38.1
&
3.3
$
\times
10
^{
18
}$
\\
\multicolumn
{
1
}{
l|
}{
Transformer (Big Model)
}
&
{
\small\
sffamily\bfseries
{
28.4
}}
&
{
\small\sffamily\bfseries
{
41.8
}}
&
2.3
$
\times
10
^{
19
}$
\\
\multicolumn
{
1
}{
l|
}{
Transformer (Big Model)
}
&
{
\small\
bfnew
{
28.4
}}
&
{
\small\bfnew
{
41.8
}}
&
2.3
$
\times
10
^{
19
}$
\\
\end{tabular}
\end{table}
%----------------------------------------------
...
...
@@ -158,19 +158,19 @@
\begin{itemize}
\vspace
{
0.5em
}
\item
{
\small\
sffamily\bfseries
{
自注意力子层
}}
\index
{
自注意力子层
}
(Self-Attention Sub-layer)
\index
{
Self-Attention Sub-layer
}
:使用自注意力机制对输入的序列进行新的表示;
\item
{
\small\
bfnew
{
自注意力子层
}}
\index
{
自注意力子层
}
(Self-Attention Sub-layer)
\index
{
Self-Attention Sub-layer
}
:使用自注意力机制对输入的序列进行新的表示;
\vspace
{
0.5em
}
\item
{
\small\
sffamily\bfseries
{
前馈神经网络子层
}}
\index
{
前馈神经网络子层
}
(Feed-Forward Sub-layer)
\index
{
Feed-Forward Sub-layer
}
:使用全连接的前馈神经网络对输入向量序列进行进一步变换;
\item
{
\small\
bfnew
{
前馈神经网络子层
}}
\index
{
前馈神经网络子层
}
(Feed-Forward Sub-layer)
\index
{
Feed-Forward Sub-layer
}
:使用全连接的前馈神经网络对输入向量序列进行进一步变换;
\vspace
{
0.5em
}
\item
{
\small\
sffamily\bfseries
{
残差连接
}}
(标记为“Add”):对于自注意力子层和前馈神经网络子层,都有一个从输入直接到输出的额外连接,也就是一个跨子层的直连。残差连接可以使深层网络的信息传递更为有效;
\item
{
\small\
bfnew
{
残差连接
}}
(标记为“Add”):对于自注意力子层和前馈神经网络子层,都有一个从输入直接到输出的额外连接,也就是一个跨子层的直连。残差连接可以使深层网络的信息传递更为有效;
\vspace
{
0.5em
}
\item
{
\small\
sffamily\bfseries
{
层标准化
}}
(Layer Normalization):自注意力子层和前馈神经网络子层进行最终输出之前,会对输出的向量进行层标准化,规范结果向量取值范围,这样易于后面进一步的处理。
\item
{
\small\
bfnew
{
层标准化
}}
(Layer Normalization):自注意力子层和前馈神经网络子层进行最终输出之前,会对输出的向量进行层标准化,规范结果向量取值范围,这样易于后面进一步的处理。
\vspace
{
0.5em
}
\end{itemize}
\parinterval
以上操作就构成了Transformer的一层,各个模块执行的顺序可以简单描述为:Self-Attention
$
\to
$
Residual Connection
$
\to
$
Layer Normalization
$
\to
$
Feed Forward Network
$
\to
$
Residual Connection
$
\to
$
Layer Normalization。编码器可以包含多个这样的层,比如,可以构建一个六层编码器,每层都执行上面的操作。最上层的结果作为整个编码的结果,会被传入解码器。
\parinterval
解码器的结构与编码器十分类似。它也是由若干层组成,每一层包含编码器中的所有结构,即:自注意力子层、前馈神经网络子层、残差连接和层标准化模块。此外,为了捕捉源语言的信息,解码器又引入了一个额外的
{
\small\
sffamily\bfseries
{
编码-解码注意力子层
}}
\index
{
编码-解码注意力子层
}
(Encoder-Decoder Attention Sub-layer)
\index
{
Encoder-Decoder Attention Sub-layer
}
。这个新的子层,可以帮助模型使用源语言句子的表示信息生成目标语言不同位置的表示。编码-解码注意力子层仍然基于自注意力机制,因此它和自注意力子层的结构是相同的,只是
$
\mathrm
{
query
}$
、
$
\mathrm
{
key
}$
、
$
\mathrm
{
value
}$
的定义不同。比如,在解码器端,自注意力子层的
$
\mathrm
{
query
}$
、
$
\mathrm
{
key
}$
、
$
\mathrm
{
value
}$
是相同的,它们都等于解码器每个位置的表示。而在编码-解码注意力子层中,
$
\mathrm
{
query
}$
是解码器每个位置的表示,此时
$
\mathrm
{
key
}$
和
$
\mathrm
{
value
}$
是相同的,等于编码器每个位置的表示。图
\ref
{
fig:12-5
}
给出了这两种不同注意力子层输入的区别。
\parinterval
解码器的结构与编码器十分类似。它也是由若干层组成,每一层包含编码器中的所有结构,即:自注意力子层、前馈神经网络子层、残差连接和层标准化模块。此外,为了捕捉源语言的信息,解码器又引入了一个额外的
{
\small\
bfnew
{
编码-解码注意力子层
}}
\index
{
编码-解码注意力子层
}
(Encoder-Decoder Attention Sub-layer)
\index
{
Encoder-Decoder Attention Sub-layer
}
。这个新的子层,可以帮助模型使用源语言句子的表示信息生成目标语言不同位置的表示。编码-解码注意力子层仍然基于自注意力机制,因此它和自注意力子层的结构是相同的,只是
$
\mathrm
{
query
}$
、
$
\mathrm
{
key
}$
、
$
\mathrm
{
value
}$
的定义不同。比如,在解码器端,自注意力子层的
$
\mathrm
{
query
}$
、
$
\mathrm
{
key
}$
、
$
\mathrm
{
value
}$
是相同的,它们都等于解码器每个位置的表示。而在编码-解码注意力子层中,
$
\mathrm
{
query
}$
是解码器每个位置的表示,此时
$
\mathrm
{
key
}$
和
$
\mathrm
{
value
}$
是相同的,等于编码器每个位置的表示。图
\ref
{
fig:12-5
}
给出了这两种不同注意力子层输入的区别。
%----------------------------------------------
\begin{figure}
[htp]
...
...
@@ -319,7 +319,7 @@
\subsection
{
多头注意力机制
}
\parinterval
Transformer中使用的另一项重要技术是
{
\small\
sffamily\bfseries
{
多头注意力机制
}}
\index
{
多头注意力机制
}
(Multi-head Attention)
\index
{
Multi-head Attention
}
。“多头”可以理解成将原来的
$
\mathbi
{
Q
}$
、
$
\mathbi
{
K
}$
、
$
\mathbi
{
V
}$
按照隐层维度平均切分成多份。假设切分
$
h
$
份,那么最终会得到
$
\mathbi
{
Q
}
=
\{
\mathbi
{
Q
}_
1
,...,
\mathbi
{
Q
}_
h
\}
$
,
$
\mathbi
{
K
}
=
\{
\mathbi
{
K
}_
1
,...,
\mathbi
{
K
}_
h
\}
$
,
$
\mathbi
{
V
}
=
\{
\mathbi
{
V
}_
1
,...,
\mathbi
{
V
}_
h
\}
$
。多头注意力就是用每一个切分得到的
$
\mathbi
{
Q
}$
,
$
\mathbi
{
K
}$
,
$
\mathbi
{
V
}$
独立的进行注意力计算,即第
$
i
$
个头的注意力计算结果
$
\mathbi
{
head
}_
i
=
\textrm
{
Attention
}
(
\mathbi
{
Q
}_
i,
\mathbi
{
K
}_
i,
\mathbi
{
V
}_
i
)
$
。
\parinterval
Transformer中使用的另一项重要技术是
{
\small\
bfnew
{
多头注意力机制
}}
\index
{
多头注意力机制
}
(Multi-head Attention)
\index
{
Multi-head Attention
}
。“多头”可以理解成将原来的
$
\mathbi
{
Q
}$
、
$
\mathbi
{
K
}$
、
$
\mathbi
{
V
}$
按照隐层维度平均切分成多份。假设切分
$
h
$
份,那么最终会得到
$
\mathbi
{
Q
}
=
\{
\mathbi
{
Q
}_
1
,...,
\mathbi
{
Q
}_
h
\}
$
,
$
\mathbi
{
K
}
=
\{
\mathbi
{
K
}_
1
,...,
\mathbi
{
K
}_
h
\}
$
,
$
\mathbi
{
V
}
=
\{
\mathbi
{
V
}_
1
,...,
\mathbi
{
V
}_
h
\}
$
。多头注意力就是用每一个切分得到的
$
\mathbi
{
Q
}$
,
$
\mathbi
{
K
}$
,
$
\mathbi
{
V
}$
独立的进行注意力计算,即第
$
i
$
个头的注意力计算结果
$
\mathbi
{
head
}_
i
=
\textrm
{
Attention
}
(
\mathbi
{
Q
}_
i,
\mathbi
{
K
}_
i,
\mathbi
{
V
}_
i
)
$
。
\parinterval
下面根据图
\ref
{
fig:12-12
}
详细介绍多头注意力的计算过程:
...
...
编写
预览
Markdown
格式
0%
重试
或
添加新文件
添加附件
取消
您添加了
0
人
到此讨论。请谨慎行事。
请先完成此评论的编辑!
取消
请
注册
或者
登录
后发表评论