合并分支 'mengxia' 到 'caorunzhe'

Mengxia 查看合并请求 !726

合并分支 'mengxia' 到 'caorunzhe'
Mengxia 查看合并请求 !726
3d57d2a0 · 孟霞 · 64600e61 · abd366de · 3d57d2a0 · 3d57d2a0
Commit 3d57d2a0 authored Dec 28, 2020 by 孟霞
--- a/Chapter9/Figures/figure-activate.tex
+++ b/Chapter9/Figures/figure-activate.tex
@@ -6,7 +6,7 @@
 \foreach \x in {-1.0,-0.5,0.0,0.5,1.0}{\draw(\x,0)--(\x,0.05)node[below,outer sep=2pt,font=\scriptsize]at(\x,0){\x};}
 \foreach \y in {1.0,0.5}{\draw(0,\y)--(0.05,\y)node[left,outer sep=2pt,font=\scriptsize]at(0,\y){\y};}
 \draw[color=red ,domain=-1.4:1, line width=1pt]plot(\x,{ln(1+(exp(\x))});
-\node[black,anchor=south] at (0,1.4) {\small $y = \ln(1+{\textrm e}^x)$};
+\node[black,anchor=south] at (0,1.6) {\small $y = \ln(1+{\textrm e}^x)$};
 \node [anchor=south east,inner sep=1pt] (labela) at (0.8,-2) {\small{(a) Softplus}};
 \end{scope}
@@ -21,7 +21,7 @@
      \pgfmathresult};}
 \foreach \y in {0.5,1.0}{\draw(0,\y)--(0.05,\y)node[left,outer sep=2pt,font=\scriptsize]at(-0.15,\y){\y};}
 \draw[color=red,domain=-1.4:1.4, line width=1pt]plot(\x,{1/(1+(exp(-5*\x)))});
-\node[black,anchor=south] at (0,1.4) {\small $y = \frac{1}{1+{\textrm e}^{-x}}$};
+\node[black,anchor=south] at (0,1.6) {\small $y = \frac{1}{1+{\textrm e}^{-x}}$};
 \node [anchor=south east,inner sep=1pt] (labelb) at (0.8,-2) {\small{(b) Sigmoid}};
 \end{scope}
 %%%------------------------------------------------------------------------------------------------------------
@@ -34,41 +34,41 @@
        \foreach \x in {-1.0,-0.5,0.0,0.5,1.0}{\draw(\x,0)--(\x,0.05)node[below,outer sep=2pt,font=\scriptsize]at(\x,0){\x};}
        \foreach \y in {,-1.0-0.5,0.5,1.0}{\draw(0,\y)--(0.05,\y)node[left,outer sep=2pt,font=\scriptsize]at(0,\y){\y};}
        \draw[color=red ,domain=-1.4:1.4, line width=1pt]plot(\x,{tanh(\x)});
-        \node[black,anchor=south] at (0,1.4) {\small $y = \frac{{\textrm e}^{x}-{\textrm e}^{-x}}{{e}^{x}+e^{-x}}$};
+        \node[black,anchor=south] at (0,1.6) {\small $y = \frac{{\textrm e}^{x}-{\textrm e}^{-x}}{{e}^{x}+e^{-x}}$};
 \node [anchor=south east,inner sep=1pt] (labelc) at (0.8,-2) {\small{(c) Tanh}};
 \end{scope}
 %%%------------------------------------------------------------------------------------------------------------
-\begin{scope}[yshift=-1.7in]
+\begin{scope}[yshift=-1.8in]
  \draw[->, line width=1pt](-1.4,0)--(1.4,0)node[left,below,font=\scriptsize]{$x$};
        \draw[->, line width=1pt](0,-1.4)--(0,1.4)node[right,font=\scriptsize]{$y$};
        \foreach \x in {-1.0,-0.5,0.0,0.5,1.0}{\draw(\x,0)--(\x,0.05)node[below,outer sep=2pt,font=\scriptsize]at(\x,0){\x};}
        \foreach \y in {0.5,1.0}{\draw(0,\y)--(0.05,\y)node[left,outer sep=2pt,font=\scriptsize]at(0,\y){\y};}
        \draw[color=red ,domain=-1.4:1.4, line width=1pt]plot(\x,{max(\x,0)});
-        \node[black,anchor=south] at (0,1.4) {\small $y =\max (0, x)$};
+        \node[black,anchor=south] at (0,1.6) {\small $y =\max (0, x)$};
 \node [anchor=south east,inner sep=1pt] (labeld) at (0.8,-2) {\small{(d) ReLU}};
 \end{scope}
 %%%------------------------------------------------------------------------------------------------------------
-\begin{scope}[yshift=-1.7in,xshift=1.6in]
+\begin{scope}[yshift=-1.8in,xshift=1.6in]
        \draw[->, line width=1pt](-1.4,0)--(1.4,0)node[left,below,font=\scriptsize]{$x$};
        \draw[->, line width=1pt](0,-1.4)--(0,1.4)node[right,font=\scriptsize]{$y$};
        \foreach \x in {-1.0,-0.5,0.0,0.5,1.0}{\draw(\x,0)--(\x,0.05)node[below,outer sep=2pt,font=\scriptsize]at(\x,0){\x};}
        \foreach \y in {0.5,1.0}{\draw(0,\y)--(0.05,\y)node[left,outer sep=2pt,font=\scriptsize]at(-0.15,\y){\y};}
        \draw[color=red ,domain=-1.4:1.4, line width=1pt]plot(\x,{exp(-1*((\x)^2))});
-        \node[black,anchor=south] at (0,1.4) {\small $y =e^{-x^2}$};
+        \node[black,anchor=south] at (0,1.6) {\small $y =e^{-x^2}$};
 \node [anchor=south east,inner sep=1pt] (labele) at (0.8,-2) {\small{(e) Gaussian}};
 \end{scope}
 %%%------------------------------------------------------------------------------------------------------------
-\begin{scope}[yshift=-1.7in,xshift=3.2in]
+\begin{scope}[yshift=-1.8in,xshift=3.2in]
        \draw[->, line width=1pt](-1.4,0)--(1.4,0)node[left,below,font=\scriptsize]{$x$};
        \draw[->, line width=1pt](0,-1.4)--(0,1.4)node[right,font=\scriptsize]{$y$};
        \foreach \x in {-1.0,-0.5,0.0,0.5,1.0}{\draw(\x,0)--(\x,0.05)node[below,outer sep=2pt,font=\scriptsize]at(\x,0){\x};}
        \foreach \y in {0.5,1.0}{\draw(0,\y)--(0.05,\y)node[left,outer sep=2pt,font=\scriptsize]at(0,\y){\y};}
        \draw[color=red ,domain=-1:1, line width=1pt]plot(\x,\x);
-        \node[black,anchor=south] at (0,1.4) {\small $y =x$};
+        \node[black,anchor=south] at (0,1.6) {\small $y =x$};
 \node [anchor=south east,inner sep=1pt] (labelf) at (0.8,-2) {\small{(f) Identity}};
 \end{scope}
 \end{tikzpicture}

--- a/Chapter9/Figures/figure-bias.tex
+++ b/Chapter9/Figures/figure-bias.tex
@@ -5,12 +5,14 @@
 {
 \draw [->,thick] (-1.8,0) -- (1.8,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.6,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
 \node [anchor=south east,inner sep=1pt] (labela) at (0.2,-0.5) {\small{(a)}};
 }
-{\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {{\scriptsize{\ $w_{11}=100$}}\\[-0ex] \scriptsize{\ $b_1=0$}};}
+{\node [anchor=north west,align=left] (wblabel) at (-2,2) {{\scriptsize{\ $w_{11}=100$}}\\[-0ex] \scriptsize{\ $b_1=0$}};}
 {\draw [-,very thick,ublue,rounded corners=0.1em] (-1.5,0) -- (0,0) -- (0,1) -- (1.5,1);}
 \end{scope}
 %---------------------------------------------------------------------------------------------
@@ -18,12 +20,14 @@
 {
 \draw [->,thick] (-1.8,0) -- (1.8,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.6,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
 \node [anchor=south east,inner sep=1pt] (labelb) at (0.2,-0.5) {\small{(b)}};
 }
-{\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {\scriptsize{\ $w_{11}=100$}\\[-0ex] {\scriptsize{\ $b_1=-2$}}};}
+{\node [anchor=north west,align=left] (wblabel) at (-2,2) {\scriptsize{\ $w_{11}=100$}\\[-0ex] {\scriptsize{\ $b_1=-2$}}};}
 {\draw [-,very thick,ublue,rounded corners=0.1em] (-1.5,0) -- (0.25,0) -- (0.25,1) -- (1.5,1);}
 \end{scope}
 %-----------------------------------------------------------------------------------------------
@@ -31,12 +35,14 @@
 {
 \draw [->,thick] (-1.8,0) -- (1.8,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.6,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
 \node [anchor=south east,inner sep=1pt] (labelc) at (0.2,-0.5) {\small{(c)}};
 }
-{\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {\scriptsize{\ $w_{11}=100$}\\[-0ex] {\scriptsize{\ $b_1=-4$}}};}
+{\node [anchor=north west,align=left] (wblabel) at (-2,2) {\scriptsize{\ $w_{11}=100$}\\[-0ex] {\scriptsize{\ $b_1=-4$}}};}
 {\draw [-,very thick,ublue,rounded corners=0.1em] (-1.5,0) -- (0.5,0) -- (0.5,1) -- (1.5,1);}
 \end{scope}
 \end{tikzpicture}

--- a/Chapter9/Figures/figure-broadcast.tex
+++ b/Chapter9/Figures/figure-broadcast.tex
@@ -9,7 +9,7 @@
    \addtocounter{mycount1}{1};
  }
 \node [anchor=south] (varlabel) at (0,0.6) {$\mathbi{s}$};
-\node [anchor=north] (labelc) at (0,-0.7) {\small{(a)}};
+\node [anchor=north] (labelc) at (0,-0.7) {\small{(a)张量$\mathbi{s}$}};
 \end{scope}
 \begin{scope}[xshift=2.1in]
@@ -21,7 +21,7 @@
    \addtocounter{mycount1}{1};
  }
 \node [anchor=south] (varlabel) at (0,0.1) {$\mathbi{b}$};
-\node [anchor=north] (labelc) at (0,-0.7) {\small{(b)}};
+\node [anchor=north] (labelc) at (0,-0.7) {\small{(b)张量$\mathbi{b}$}};
 \end{scope}
@@ -51,7 +51,7 @@
  }
 \node [anchor=center] (plabel) at (-4.5em,0) {\huge{$\mathbf{+}$}};
 \node [anchor=south] (varlabel) at (0,0.6) {$\mathbi{b}$};
-\node [anchor=north] (labelc) at (0,-0.7) {\small{(c)}};
+\node [anchor=north] (labelc) at (0,-0.7) {\small{(c)张量的单元加运算}};
 \end{scope}
 \begin{scope}[yshift=-1in,xshift=3in]
 \setcounter{mycount1}{2}

--- a/Chapter9/Figures/figure-embedding-matrix.tex
+++ b/Chapter9/Figures/figure-embedding-matrix.tex
@@ -9,12 +9,12 @@
 \end{pgfonlayer}
 \draw [->,thick] ([yshift=-1em]box.south)--([yshift=-0.1em]box.south) node [pos=0,below] (bottom1) {\small{单词$w$的One-hot表示}};
-\draw [->,thick] ([yshift=0.1em]box.north)--([yshift=1em]box.north) node [pos=1,above] (top1) {\scriptsize{$\mathbi{e}$=(8,.2,-1,.9,...,1)}};
+\draw [->,thick] ([yshift=0.1em]box.north)--([yshift=1em]box.north) node [pos=1,above] (top1) {\scriptsize{$\mathbi{e}$=(8,0.2,-1,0.9,...,1)}};
 \node [anchor=north] (bottom2) at ([yshift=0.3em]bottom1.south) {\scriptsize{$\mathbi{o}$=(0,0,1,0,...,0)}};
 \node [anchor=south] (top2) at ([yshift=-0.3em]top1.north) {\small{单词$w$的分布式表示}};
 {
-\node [anchor=north west,fill=red!20!white] (cmatrix) at ([xshift=3em,yshift=1.0em]c.north east) {\scriptsize{$\begin{pmatrix} 1 & .2 & -.2 & 8 & ... & 0 \\ .6 & .8 & -2 & 1 & ... & -.2 \\ 8 & .2 & -1 & .9 & ... & 2.3 \\ 1 & 1.2 & -.9 & 3 & ... & .2 \\ ... & ... & ... & ... & ... & ... \\ 1 & .3 & 3 & .9 & ... & 5.1 \end{pmatrix}$}};
+\node [anchor=north west,fill=red!20!white] (cmatrix) at ([xshift=3em,yshift=1.0em]c.north east) {\scriptsize{$\begin{pmatrix} 1 & 0.2 & -0.2 & 8 & ... & 0 \\ 0.6 & 0.8 & -2 & 1 & ... & -0.2 \\ 8 & 0.2 & -1 & 0.9 & ... & 2.3 \\ 1 & 1.2 & -0.9 & 3 & ... & 0.2 \\ ... & ... & ... & ... & ... & ... \\ 1 & 0.3 & 3 & 0.9 & ... & 5.1 \end{pmatrix}$}};
 \node [anchor=west,inner sep=2pt,fill=red!30!white] (c) at (e.east) {\small{$\mathbi{C}$}};
 \draw [<-,thick] (c.east) -- ([xshift=3em]c.east);
 }

--- a/Chapter9/Figures/figure-embedding.tex
+++ b/Chapter9/Figures/figure-embedding.tex
@@ -2,8 +2,8 @@
 \begin{tikzpicture}
 {
 \begin{scope}[xshift=2in]
-\node [anchor=north west] (o1) at (0,0) {\footnotesize{$\begin{bmatrix} .1 \\ -1 \\ 2 \\ ... \\ 0 \end{bmatrix}$}};
+\node [anchor=north west] (o1) at (0,0) {\footnotesize{$\begin{bmatrix} 0.1 \\ -1 \\ 2 \\ ... \\ 0 \end{bmatrix}$}};
-\node [anchor=north west] (o2) at ([xshift=1em]o1.north east) {\footnotesize{$\begin{bmatrix} 1 \\ 2 \\ .2 \\ ... \\ -1 \end{bmatrix}$}};
+\node [anchor=north west] (o2) at ([xshift=1em]o1.north east) {\footnotesize{$\begin{bmatrix} 1 \\ 2 \\ 0.2 \\ ... \\ -1 \end{bmatrix}$}};
 \node [anchor=north east] (v) at ([xshift=-0em]o1.north west) {\footnotesize{$\begin{matrix} \textrm{\ \ \ 属性}_1 \\ \textrm{\ \ \ 属性}_2 \\ \textrm{\ \ \ 属性}_3 \\ ... \\ \textrm{属性}_{512} \end{matrix}$}};
 \node [anchor=south] (w1) at (o1.north) {\footnotesize{桌子}};
 \node [anchor=south] (w2) at (o2.north) {\footnotesize{椅子}};

--- a/Chapter9/Figures/figure-fit.tex
+++ b/Chapter9/Figures/figure-fit.tex
@@ -61,6 +61,8 @@
 {
 \draw [->,thick] (-1.6,0) -- (1.6,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.45,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
@@ -72,6 +74,10 @@
 {
 \draw [->,thick] (-1.6,0) -- (1.6,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.45,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
+\node [anchor=east,inner sep=1pt] (label1) at (0,0.85) {\tiny{1}};
+\node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
 \draw [-,very thick,red,domain=-1.5:1.5,samples=100] plot (\x,{0.2 * (\x +0.4)^3 + 1.2 - 0.3 *(\x + 0.8)^2});
 }
 {
@@ -153,6 +159,10 @@
 \begin{scope}[xshift=2.1in,yshift=0.1in]
 \draw [->,thick] (-1.6,0) -- (1.6,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.45,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
+\node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
+\node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
 \draw [-] (-0.05,1) -- (0.05,1);
 {\draw [-,very thick,ublue,rounded corners=0.1em] (-1.5,0) -- (0.5,0) -- (0.5,0.7) -- (0.7,0.7) -- (0.7,0) -- (1.5,0);}
 {\draw [-,very thick,ublue,rounded corners=0.1em] (-1.5,0) -- (0.7,0) -- (0.7,0.6) -- (0.9,0.6) -- (0.9,0) -- (1.5,0);}
@@ -163,6 +173,10 @@
 {
 \draw [->,thick] (-1.6,0) -- (1.6,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.45,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
+\node [anchor=east,inner sep=1pt] (label1) at (0,0.85) {\tiny{1}};
+\node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
 \draw [-,very thick,red,domain=-1.5:1.5,samples=100] plot (\x,{0.2 * (\x +0.4)^3 + 1.2 - 0.3 *(\x + 0.8)^2});
 }
 \foreach \n in {0.5}{

--- a/Chapter9/Figures/figure-piecewise.tex
+++ b/Chapter9/Figures/figure-piecewise.tex
@@ -9,6 +9,8 @@
 {
 \draw [->,thick] (-2.2,0) -- (2.2,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.95,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=north,inner sep=1pt] (labelb) at (0,-0.2) {\small{(b)}};
 }
@@ -32,6 +34,8 @@
 {
 \draw [->,thick] (-2.2,0) -- (2.2,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.95,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1.18) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};

--- a/Chapter9/Figures/figure-save.tex
+++ b/Chapter9/Figures/figure-save.tex
@@ -7,7 +7,7 @@
    \node [fill=green!15,inner sep=0pt,minimum height=0.49cm,minimum width=0.49cm](vector1) at (\x,0.25) {$\number\value{mycount1}$};
    \addtocounter{mycount1}{1};
 }
-\node [anchor=north] (labela) at ([xshift=-1.2em,yshift=-0em]vector1.south) {\small{(a) }};
+\node [anchor=north] (labela) at ([xshift=-1.2em,yshift=-0.3em]vector1.south) {\small{(a)1阶张量 }};
 \end{scope}
 \begin{scope}[xshift=1.2in]
@@ -21,7 +21,7 @@
    \node [fill=red!15,inner sep=0pt,minimum height=0.49cm,minimum width=0.49cm] at (\x,0.25) {$\number\value{mycount2}$};
    \addtocounter{mycount2}{1};
 }
-\node [anchor=north] (labelb) at ([xshift=0.3em,yshift=-0em]vector2.south) {\small{(b) }};
+\node [anchor=north] (labelb) at ([xshift=0.3em,yshift=-0.3em]vector2.south) {\small{(b)2阶张量 }};
 \end{scope}
 \begin{scope}[yshift=-0.6in]
@@ -45,9 +45,9 @@
 }
 \draw[decorate,thick,decoration={brace,mirror,raise=0.2em}] (0,-0.2) -- (2.95,-0.2);
 \draw[decorate,thick,decoration={brace,mirror,raise=0.2em}] (3.05,-0.2) -- (6,-0.2);
-\node [anchor=north] (subtensor1) at (1.5,-0.4) {\footnotesize{$3 \times 2$ sub-tensor}};
+\node [anchor=north] (subtensor1) at (1.5,-0.4) {\footnotesize{$3 \times 2$ 子张量}};
-\node [anchor=north] (subtensor1) at (4.5,-0.4) {\footnotesize{$3 \times 2$ sub-tensor}};
+\node [anchor=north] (subtensor1) at (4.5,-0.4) {\footnotesize{$3 \times 2$ 子张量}};
-\node [anchor=north] (labelc) at (3,-0.8) {\small{(c)}};
+\node [anchor=north] (labelc) at (3,-1.1) {\small{(c)1阶张量}};
 \end{scope}
 \end{tikzpicture}

--- a/Chapter9/Figures/figure-translation.tex
+++ b/Chapter9/Figures/figure-translation.tex
@@ -5,9 +5,11 @@
 \node[neuron,anchor=north] (a1) at (0,0) {};
 \draw[->,thick] ([xshift=-2em,yshift=0em]a1.south) to ([xshift=3em,yshift=0em]a1.south);
 \draw[->,thick] ([xshift=0em,yshift=-4em]a1.west) to ([xshift=0em,yshift=2em]a1.west);
-\node[below] at ([xshift=0.5em,yshift=-1em]a1.west){0};
+\node [anchor=south] (heng1) at ([xshift=2.5em,yshift=-0.8em]a1.south) {\scriptsize{$x$}};
-\node[below] at ([xshift=2em,yshift=-1em]a1.west){1};
+\node [anchor=west] (zong1) at ([xshift=-1em,yshift=1.8em]a1.west) {\scriptsize{$y$}};
-\node[below] at ([xshift=-0.5em,yshift=2em]a1.west){1};
+\node[below] at ([xshift=0.5em,yshift=-1em]a1.west){\footnotesize{0}};
+\node[below] at ([xshift=2em,yshift=-1em]a1.west){\footnotesize{1}};
+\node[below] at ([xshift=-0.5em,yshift=1.5em]a1.west){\footnotesize{1}};
 \node [anchor=west] (x) at ([xshift=-0.7em,yshift=1em]a1.south) {\Large{$\textbf{F}$}};
 {
@@ -15,9 +17,11 @@
 \node[neuron,anchor=north] (a2) at ([xshift=10em,yshift=0em]a1.south) {};
 \draw[->,thick] ([xshift=-2em,yshift=0em]a2.north) to ([xshift=3em,yshift=0em]a2.north);
 \draw[->,thick] ([xshift=0em,yshift=-2em]a2.west) to ([xshift=0em,yshift=4em]a2.west);
-\node[above] at ([xshift=0.5em,yshift=1em]a2.west){0};
+\node [anchor=south] (heng1) at ([xshift=2.5em,yshift=1.25em]a2.south) {\scriptsize{$x$}};
-\node[above] at ([xshift=2em,yshift=1em]a2.west){1};
+\node [anchor=west] (zong1) at ([xshift=-1em,yshift=3.85em]a2.west) {\scriptsize{$y$}};
-\node[below] at ([xshift=-0.5em,yshift=0em]a2.west){-1};
+\node[above] at ([xshift=0.5em,yshift=1em]a2.west){\footnotesize{0}};
+\node[above] at ([xshift=2em,yshift=1em]a2.west){\footnotesize{1}};
+\node[below] at ([xshift=-0.5em,yshift=0em]a2.west){\footnotesize{-1}};
 \node [anchor=west] (x) at ([xshift=-3.5cm,yshift=2em]a2.north) {\scriptsize{
    $\mathbi{W}=\begin{pmatrix}
    1&0&0\\
@@ -37,9 +41,11 @@
 \node[neuron,anchor=north] (a3) at ([xshift=11em,yshift=2.05em]a2.south) {};
 \draw[->,thick] ([xshift=-3em,yshift=0em]a3.north) to ([xshift=2em,yshift=0em]a3.north);
 \draw[->,thick] ([xshift=-1em,yshift=-2em]a3.west) to ([xshift=-1em,yshift=4em]a3.west);
-\node[above] at ([xshift=-0.5em,yshift=1em]a3.west){0};
+\node [anchor=south] (heng1) at ([xshift=1.5em,yshift=1.2em]a3.south) {\scriptsize{$x$}};
-\node[above] at ([xshift=1em,yshift=1em]a3.west){1};
+\node [anchor=west] (zong1) at ([xshift=-2em,yshift=3.8em]a3.west) {\scriptsize{$y$}};
-\node[left] at ([xshift=-0.75em,yshift=-0.5em]a3.west){-1};
+\node[above] at ([xshift=-0.5em,yshift=1em]a3.west){\footnotesize{0}};
+\node[above] at ([xshift=1em,yshift=1em]a3.west){\footnotesize{1}};
+\node[left] at ([xshift=-0.75em,yshift=-0.5em]a3.west){\footnotesize{-1}};
 \node [anchor=west,rotate = 180] (x) at ([xshift=0.7em,yshift=1em]a3.south) {\Large{$\textbf{F}$}};

--- a/Chapter9/Figures/figure-w1.tex
+++ b/Chapter9/Figures/figure-w1.tex
@@ -5,12 +5,14 @@
 {
 \draw [->,thick] (-1.8,0) -- (1.8,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.6,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
 \node [anchor=south east,inner sep=1pt] (labela) at (0.2,-0.5) {\small{(a)}};
 }
-{\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {\scriptsize{\ $w_{11}=100$}\\[-0ex] {\scriptsize{\ $b_1=-4$}}};}
+{\node [anchor=north west,align=left] (wblabel) at (-2,2) {\scriptsize{\ $w_{11}=100$}\\[-0ex] {\scriptsize{\ $b_1=-4$}}};}
 {\draw [-,very thick,ublue,rounded corners=0.1em] (-1.5,0) -- (0.5,0) -- (0.5,1) -- (1.5,1);}
 \end{scope}
 %---------------------------------------------------------------------------------------------
@@ -18,12 +20,14 @@
 {
 \draw [->,thick] (-1.8,0) -- (1.8,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.6,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
 \node [anchor=south east,inner sep=1pt] (labelb) at (0.2,-0.5) {\small{(b)}};
 }
-{\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {{\scriptsize{\ $w'_{11}=0.9$}}};}
+{\node [anchor=north west,align=left] (wblabel) at (-2,2) {{\scriptsize{\ $w'_{11}=0.9$}}};}
 {\draw [-,very thick,ublue,rounded corners=0.1em] (-1.8,0) -- (0.5,0) -- (0.5,0.9) -- (1.8,0.9);}
 \end{scope}
 %-----------------------------------------------------------------------------------------------
@@ -32,12 +36,14 @@
 {
 \draw [->,thick] (-1.8,0) -- (1.8,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.6,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
 \node [anchor=south east,inner sep=1pt] (labelc) at (0.2,-0.5) {\small{(c)}};
 }
-{\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {{\scriptsize{\ $w'_{11}=0.7$}}};}
+{\node [anchor=north west,align=left] (wblabel) at (-2,2) {{\scriptsize{\ $w'_{11}=0.7$}}};}
 {\draw [-,very thick,ublue,rounded corners=0.1em] (-1.5,0) -- (0.5,0) -- (0.5,0.7) -- (1.5,0.7);}
 \end{scope}

--- a/Chapter9/Figures/figure-w2.tex
+++ b/Chapter9/Figures/figure-w2.tex
@@ -5,6 +5,8 @@
 {
 \draw [->,thick] (-1.8,0) -- (1.8,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.6,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
@@ -18,6 +20,8 @@
 {
 \draw [->,thick] (-1.8,0) -- (1.8,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.6,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
@@ -32,6 +36,8 @@
 {
 \draw [->,thick] (-1.8,0) -- (1.8,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.6,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};

--- a/Chapter9/Figures/figure-weight.tex
+++ b/Chapter9/Figures/figure-weight.tex
@@ -5,12 +5,14 @@
 {
 \draw [->,thick] (-1.8,0) -- (1.8,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.6,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
 \node [anchor=south east,inner sep=1pt] (labela) at (0.2,-0.5) {\small{(a)}};
 }
-{\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {\scriptsize{\ $w_{11}=1$}\\[-0ex] \scriptsize{\ $b_1=0$}};}
+{\node [anchor=north west,align=left] (wblabel) at (-2,2) {\scriptsize{\ $w_{11}=1$}\\[-0ex] \scriptsize{\ $b_1=0$}};}
 {\draw [-,very thick,ublue,domain=-1.5:1.5,samples=100] plot (\x,{1/(1+exp(-2*\x))});}
 \end{scope}
 %---------------------------------------------------------------------------------------------
@@ -19,11 +21,13 @@
 \draw [->,thick] (-1.8,0) -- (1.8,0);
 \draw [->,thick] (0,0) -- (0,2);
 \draw [-] (-0.05,1) -- (0.05,1);
+\node [anchor=south] (heng1) at (1.6,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
 \node [anchor=south east,inner sep=1pt] (labelb) at (0.2,-0.5) {\small{(b)}};
 }
-{\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {{\scriptsize{\ $w_{11}=10$}}\\[-0ex] \scriptsize{\ $b_1=0$}};}
+{\node [anchor=north west,align=left] (wblabel) at (-2,2) {{\scriptsize{\ $w_{11}=10$}}\\[-0ex] \scriptsize{\ $b_1=0$}};}
 {\draw [-,very thick,ublue,domain=-1.5:1.5,samples=100] plot (\x,{1/(1+exp(-4*\x))});}
 \end{scope}
 %-----------------------------------------------------------------------------------------------
@@ -31,12 +35,14 @@
 {
 \draw [->,thick] (-1.8,0) -- (1.8,0);
 \draw [->,thick] (0,0) -- (0,2);
+\node [anchor=south] (heng1) at (1.6,-0.35) {\scriptsize{$x$}};
+\node [anchor=south] (zong1) at (-0.2,1.6) {\scriptsize{$y$}};
 \draw [-] (-0.05,1) -- (0.05,1);
 \node [anchor=east,inner sep=1pt] (label1) at (0,1) {\tiny{1}};
 \node [anchor=south east,inner sep=1pt] (label2) at (0,0) {\tiny{0}};
 \node [anchor=south east,inner sep=1pt] (labelc) at (0.2,-0.5) {\small{(c)}};
 }
-{\node [anchor=north west,align=left] (wblabel) at (-1.8,2) {{\scriptsize{\ $w_{11}=100$}}\\[-0ex] \scriptsize{\ $b_1=0$}};}
+{\node [anchor=north west,align=left] (wblabel) at (-2,2) {{\scriptsize{\ $w_{11}=100$}}\\[-0ex] \scriptsize{\ $b_1=0$}};}
 {\draw [-,very thick,ublue,rounded corners=0.1em] (-1.5,0) -- (0,0) -- (0,1) -- (1.5,1);}
 \end{scope}
 \end{tikzpicture}

--- a/Chapter9/chapter9.tex
+++ b/Chapter9/chapter9.tex
@@ -222,7 +222,7 @@
 \parinterval 本章默认使用行向量，如$ \mathbi{a}=(a_1, a_2, a_3) $，$ \mathbi{a} $对应的列向量记为$ \mathbi{a}^{\textrm T} $。
-\parinterval {\small\sffamily\bfseries{矩阵}}\index{矩阵}（Matrix）\index{Matrix}：矩阵是一个按照长方阵列排列的实数集合，最早来自于方程组的系数及常数所构成的方阵。在计算机领域，通常将矩阵看作二维数组。这里用符号$ \mathbi{A}$表示一个矩阵，如果该矩阵有$ m $行$ n $列，那么有$\mathbi{A}\in {\mathbb R}^{m\times n} $。矩阵中的每个元素都被一个行索引和一个列索引所确定，例如，$ a_{ij} $表示第$ i $行、第$ j $列的矩阵元素。如下，公式\eqref{eq:9-3}中$ \mathbi{A} $定义了一个2行2列的矩阵。
+\parinterval {\small\sffamily\bfseries{矩阵}}\index{矩阵}（Matrix）\index{Matrix}：矩阵是一个按照长方阵列排列的实数集合，最早来自于方程组的系数及常数所构成的方阵。在计算机领域，通常将矩阵看作二维数组。这里用符号$ \mathbi{A}$表示一个矩阵，如果该矩阵有$ m $行$ n $列，那么有$\mathbi{A}\in {\mathbb R}^{m\times n} $。矩阵中的每个元素都被一个行索引和一个列索引所确定，例如，$ a_{ij} $表示第$ i $行、第$ j $列的矩阵元素。如下，下式中的$ \mathbi{A} $定义了一个2行2列的矩阵。
 \begin{eqnarray}
 \mathbi{A}& = & \begin{pmatrix}
   a_{11} & a_{12}\\
@@ -241,14 +241,14 @@
 \subsubsection{2. 矩阵的转置}
-\parinterval {\small\sffamily\bfseries{转置}}\index{转置}（Transpose）\index{Transpose}是矩阵的重要操作之一。矩阵的转置可以看作是将矩阵以对角线为镜像进行翻转：假设$\mathbi{A}$为$ m $行$ n $列的矩阵，第$ i $行、第$ j $ 列的元素是$ a_{ij} $，即：$\mathbi{A}={(a_{ij})}_{m\times n} $，把$ m\times n $矩阵$\mathbi{A}$的行换成同序数的列得到一个$ n\times m $矩阵，则得到$ \mathbi{A}$的转置矩阵，记为${\mathbi{A}}^{\textrm T} $，且${\mathbi{A}}^{\textrm T}={(a_{ji})}_{n\times m} $。例如，对于公式\eqref{eq:9-100}中的矩阵，
+\parinterval {\small\sffamily\bfseries{转置}}\index{转置}（Transpose）\index{Transpose}是矩阵的重要操作之一。矩阵的转置可以看作是将矩阵以对角线为镜像进行翻转：假设$\mathbi{A}$为$ m $行$ n $列的矩阵，第$ i $行、第$ j $ 列的元素是$ a_{ij} $，即：$\mathbi{A}={(a_{ij})}_{m\times n} $，把$ m\times n $矩阵$\mathbi{A}$的行换成同序数的列得到一个$ n\times m $矩阵，则得到$ \mathbi{A}$的转置矩阵，记为${\mathbi{A}}^{\textrm T} $，且${\mathbi{A}}^{\textrm T}={(a_{ji})}_{n\times m} $。例如，对于下式中的矩阵，
 \begin{eqnarray}
 \mathbi{A} & = & \begin{pmatrix} 1 & 3 & 2 & 6\\5 & 4 & 8 & 2\end{pmatrix}
 \label{eq:9-100}
 \end{eqnarray}
-\noindent 它转置的结果如公式\eqref{eq:9-101}所示
+\noindent 它转置的结果如下：
 \begin{eqnarray}
 {\mathbi{A}}^{\textrm T} & = &\begin{pmatrix} 1 & 5\\3 & 4\\2 & 8\\6 & 2\end{pmatrix}
@@ -263,7 +263,7 @@
 \subsubsection{3. 矩阵加法和数乘}
-\parinterval 矩阵加法又被称作{\small\sffamily\bfseries{按元素加法}}\index{按元素加法}（Element-wise Addition）\index{Element-wise Addition}。它是指两个矩阵把其相对应元素加在一起的运算，通常的矩阵加法被定义在两个形状相同的矩阵上。两个$ m\times n $矩阵$ \mathbi{A}$和$ \mathbi{B} $的和，标记为$ \mathbi{A} + \mathbi{B}$，它也是个$ m\times n $矩阵，其内的各元素为其相对应元素相加后的值，即如果矩阵$ {\mathbi{C}}= {\mathbi{A}} + {\mathbi{B}} $，则$ c_{ij} = a_{ij} + b_{ij} $。公式\eqref{eq:9-4}展示了矩阵之间进行加法的计算过程。
+\parinterval 矩阵加法又被称作{\small\sffamily\bfseries{按元素加法}}\index{按元素加法}（Element-wise Addition）\index{Element-wise Addition}。它是指两个矩阵把其相对应元素加在一起的运算，通常的矩阵加法被定义在两个形状相同的矩阵上。两个$ m\times n $矩阵$ \mathbi{A}$和$ \mathbi{B} $的和，标记为$ \mathbi{A} + \mathbi{B}$，它也是个$ m\times n $矩阵，其内的各元素为其相对应元素相加后的值，即如果矩阵$ {\mathbi{C}}= {\mathbi{A}} + {\mathbi{B}} $，则$ c_{ij} = a_{ij} + b_{ij} $。下式展示了矩阵之间进行加法的计算过程。
 \begin{eqnarray}
 \begin{pmatrix}
   1 & 3\\
@@ -334,7 +334,7 @@
 \subsubsection{4. 矩阵乘法和矩阵点乘}
-\parinterval 矩阵乘法是矩阵运算中最重要的操作之一，为了与矩阵点乘区分，通常也把矩阵乘法叫做矩阵叉乘。假设$ {\mathbi{A}} $为$ m\times p $的矩阵，$ {\mathbi{B}} $为$ p\times n $的矩阵，对$ {\mathbi{A}}$和$ {\mathbi{B}} $作矩阵乘法的结果是一个$ m\times n $的矩阵$ {\mathbi{C}} $，其中矩阵$ {\mathbi{C}} $中第$ i $行、第$ j $列的元素可以如公式\eqref{eq:9-6}表示为：
+\parinterval 矩阵乘法是矩阵运算中最重要的操作之一，为了与矩阵点乘区分，通常也把矩阵乘法叫做矩阵叉乘。假设$ {\mathbi{A}} $为$ m\times p $的矩阵，$ {\mathbi{B}} $为$ p\times n $的矩阵，对$ {\mathbi{A}}$和$ {\mathbi{B}} $作矩阵乘法的结果是一个$ m\times n $的矩阵$ {\mathbi{C}} $，其中矩阵$ {\mathbi{C}} $中第$ i $行、第$ j $列的元素可以表示为：
 \begin{eqnarray}
 {({\mathbi{A}}{\mathbi{B}})}_{ij} &=& \sum_{k=1}^p a_{ik}b_{kj}
 \label{eq:9-6}
@@ -384,7 +384,7 @@
 \label{eq:9-104}
 \end{eqnarray}
-\parinterval 矩阵点乘的计算如公式\eqref{eq:9-8}所示：
+\parinterval 矩阵点乘的计算方式如下：
 \begin{eqnarray}
 {\mathbi{C}} & = & {\mathbi{A}}\odot {\mathbi{B}} \nonumber \\
          & = & \begin{pmatrix}
@@ -444,20 +444,20 @@ f(c{\mathbi{v}})&=&cf({\mathbi{v}})
 \subsubsection{6. 范数}
-\parinterval 工程领域，经常会使用被称为{\small\bfnew{范数}}\index{范数}（Norm）\index{Norm}的函数衡量向量大小，范数为向量空间内的所有向量赋予非零的正长度或大小。对于一个$n$维向量$ {\mathbi{x}} $，一个常见的范数函数为$ l_p $ 范数，通常表示为$ {\Vert{\mathbi{x}}\Vert}_p $ ，其中$p\ge 0$，是一个标量形式的参数。常用的$ p $的取值有$ 1 $、$ 2 $、$ \infty $等。范数的计算方式如公式\eqref{eq:9-14}所示：
+\parinterval 工程领域，经常会使用被称为{\small\bfnew{范数}}\index{范数}（Norm）\index{Norm}的函数衡量向量大小，范数为向量空间内的所有向量赋予非零的正长度或大小。对于一个$n$维向量$ {\mathbi{x}} $，一个常见的范数函数为$ l_p $ 范数，通常表示为$ {\Vert{\mathbi{x}}\Vert}_p $ ，其中$p\ge 0$，是一个标量形式的参数。常用的$ p $的取值有$ 1 $、$ 2 $、$ \infty $等。范数的计算方式如下：
 \begin{eqnarray}
 l_p({\mathbi{x}}) & = & {\Vert{\mathbi{x}}\Vert}_p \nonumber \\
               & = & {\left (\sum_{i=1}^{n}{{\vert x_{i}\vert}^p}\right )}^{\frac{1}{p}}
 \label{eq:9-14}
 \end{eqnarray}
-\parinterval $ l_1 $范数为向量的各个元素的绝对值之和，如公式\eqref{eq:9-15}所示：
+\parinterval $ l_1 $范数为向量的各个元素的绝对值之和：
 \begin{eqnarray}
 {\Vert{\mathbi{x}}\Vert}_1&=&\sum_{i=1}^{n}{\vert x_{i}\vert}
 \label{eq:9-15}
 \end{eqnarray}
-\parinterval $ l_2 $范数为向量的各个元素平方和的二分之一次方，如公式\eqref{eq:9-16}所示：
+\parinterval $ l_2 $范数为向量的各个元素平方和的二分之一次方：
 \begin{eqnarray}
 {\Vert{\mathbi{x}}\Vert}_2&=&\sqrt{\sum_{i=1}^{n}{{x_{i}}^2}} \nonumber \\
                                      &=&\sqrt{{\mathbi{x}}^{\textrm T}{\mathbi{x}}}
@@ -466,7 +466,7 @@ l_p({\mathbi{x}}) & = & {\Vert{\mathbi{x}}\Vert}_p \nonumber \\
 \parinterval $ l_2 $范数被称为{\small\bfnew{欧几里得范数}}\index{欧几里得范数}（Euclidean Norm）\index{Euclidean Norm}。从几何角度，向量也可以表示为从原点出发的一个带箭头的有向线段，其$ l_2 $范数为线段的长度，也常被称为向量的模。$ l_2 $ 范数在机器学习中非常常用。向量$ {\mathbi{x}} $的$ l_2 $范数经常简化表示为$ \Vert{\mathbi{x}}\Vert $，可以通过点积$ {\mathbi{x}}^{\textrm T}{\mathbi{x}} $进行计算。
-\parinterval $ l_{\infty} $范数为向量的各个元素的最大绝对值，如公式\eqref{eq:9-17}所示：
+\parinterval $ l_{\infty} $范数为向量的各个元素的最大绝对值：
 \begin{eqnarray}
 {\Vert{\mathbi{x}}\Vert}_{\infty}&=&{\textrm{max}}\{x_1,x_2,\dots,x_n\}
 \label{eq:9-17}
@@ -484,7 +484,7 @@ l_p({\mathbi{x}}) & = & {\Vert{\mathbi{x}}\Vert}_p \nonumber \\
 \vspace{0.5em}
 \end{itemize}
-\parinterval 在深度学习中，有时候希望衡量矩阵的大小，这时可以考虑使用 {\small\bfnew{Frobenius 范数}}\index{Frobenius 范数}（Frobenius Norm）\index{Frobenius Norm}。公式\eqref{eq:9-18}展示了其计算方式：
+\parinterval 在深度学习中，有时候希望衡量矩阵的大小，这时可以考虑使用 {\small\bfnew{Frobenius 范数}}\index{Frobenius 范数}（Frobenius Norm）\index{Frobenius Norm}，其计算方式如下：
 \begin{eqnarray}
 {\Vert{\mathbi{A}}\Vert}_F&=&\sqrt{\sum_{i,j} a_{i,j}^2}
 \label{eq:9-18}
@@ -514,7 +514,7 @@ l_p({\mathbi{x}}) & = & {\Vert{\mathbi{x}}\Vert}_p \nonumber \\
 \subsubsection{1. 感知机\ \dash \ 最简单的人工神经元模型}
 \vspace{0.5em}
-\parinterval 感知机是人工神经元的一种实例，在上世纪50-60年代被提出后，对神经网络研究产生了深远的影响。感知机模型如图\ref {fig:9-5}所示，其输入是一个$n$维二值向量$ {\mathbi{x}}=(x_1,x_2,\dots,x_n) $，其中$ x_i=0 $或$ 1 $。权重${\mathbi{w}}=(w_1,w_2,\dots,w_n) $，每个输入变量对应一个权重$ w_i $。偏置$ b $是一个实数变量（$ -\sigma $）。输出也是一个二值结果，即$ y=0 $或$ 1 $。$ y $值的判定由输入的加权和是否大于（或小于）一个阈值$ \sigma $决定（公式\eqref{eq:9-19}）：
+\parinterval 感知机是人工神经元的一种实例，在上世纪50-60年代被提出后，对神经网络研究产生了深远的影响。感知机模型如图\ref {fig:9-5}所示，其输入是一个$n$维二值向量$ {\mathbi{x}}=(x_1,x_2,\dots,x_n) $，其中$ x_i=0 $或$ 1 $。权重${\mathbi{w}}=(w_1,w_2,\dots,w_n) $，每个输入变量对应一个权重$ w_i $。偏置$ b $是一个实数变量（$ -\sigma $）。输出也是一个二值结果，即$ y=0 $或$ 1 $。$ y $值的判定由输入的加权和是否大于（或小于）一个阈值$ \sigma $决定：
 \begin{eqnarray}
 y&=&\begin{cases} 0 & \sum_{i}{x_i\cdot w_i}-\sigma <0\\1 & \sum_{i}{x_i\cdot w_i}-\sigma \geqslant 0\end{cases}
 \label{eq:9-19}
@@ -541,7 +541,7 @@ y&=&\begin{cases} 0 & \sum_{i}{x_i\cdot w_i}-\sigma <0\\1 & \sum_{i}{x_i\cdot w_
 \vspace{0.5em}
 \end{itemize}
-\parinterval 在这种情况下应该如何做出决定呢？比如，女朋友很希望和你一起去看音乐会，但是剧场很远而且票价500元，如果这些因素对你都是同等重要的（即$ w_1=w_2=w_3 $,假设这三个权重都设置为1）那么会得到一个综合得分，如公式\eqref{eq:9-20}所示：
+\parinterval 在这种情况下应该如何做出决定呢？比如，女朋友很希望和你一起去看音乐会，但是剧场很远而且票价500元，如果这些因素对你都是同等重要的（即$ w_1=w_2=w_3 $,假设这三个权重都设置为1）那么会得到一个综合得分：
 \begin{eqnarray}
 x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumber \\
                                                                     & = & 1
@@ -566,7 +566,7 @@ x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 \vspace{-1em}
 \subsubsection{2. 神经元内部权重}
-\parinterval 在上面的例子中，连接权重代表着每个输入因素对最终输出结果的重要程度，为了得到令人满意的决策，需要不断调整权重。如果你是守财奴，则会对票价看得更重一些，这样你会用不均匀的权重计算每个因素的影响，比如：$ w_1=0.5 $，$ w_2=2 $，$ w_3=0.5 $，此时感知机模型如图\ref{fig:9-7}所示。在这种情况下，女友很希望和你一起去看音乐会，但是剧场很远而且票价500元，会导致你不去看音乐会，公式\eqref{eq:9-21}展示了该决策过程：
+\parinterval 在上面的例子中，连接权重代表着每个输入因素对最终输出结果的重要程度，为了得到令人满意的决策，需要不断调整权重。如果你是守财奴，则会对票价看得更重一些，这样你会用不均匀的权重计算每个因素的影响，比如：$ w_1=0.5 $，$ w_2=2 $，$ w_3=0.5 $，此时感知机模型如图\ref{fig:9-7}所示。在这种情况下，女友很希望和你一起去看音乐会，但是剧场很远而且票价500元，会导致你不去看音乐会，该决策过程如下：
 \begin{eqnarray}
 \sum_{i}{x_i\cdot w_i} & = & 0\cdot 0.5+0\cdot 2+1\cdot 0.5 \nonumber \\
                                   & = & 0.5 \nonumber \\
@@ -610,7 +610,7 @@ x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 \end{figure}
 %-------------------------------------------
-\parinterval 使用修改后的模型做决策：女朋友很希望和你一起，但是剧场有20km远而且票价有500元。于是有$ x_1=10/20 $，$ x_2=150/500 $，$ x_3=1 $。此时决策过程如公式\eqref{eq:9-22}所示：
+\parinterval 使用修改后的模型做决策：女朋友很希望和你一起，但是剧场有20km远而且票价有500元。于是有$ x_1=10/20 $，$ x_2=150/500 $，$ x_3=1 $。此时决策过程如下：
 \begin{eqnarray}
 \sum_{i}{x_i\cdot w_i} & = & 0.5\cdot 0.5+0.3\cdot 2+1\cdot 0.5 \nonumber \\
                                   & = & 1.35 \nonumber \\
@@ -672,7 +672,7 @@ x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 \parinterval 为了建立多层神经网络，首先需要把前面提到的简单的神经元进行扩展，把多个神经元组成一“层”神经元。比如，很多实际问题需要同时有多个输出，这时可以把多个相同的神经元并列起来，每个神经元都会有一个单独的输出，这就构成一“层”，形成了单层神经网络。单层神经网络中的每一个神经元都对应着一组权重和一个输出，可以把单层神经网络中的不同输出看作一个事物不同角度的描述。
-\parinterval 举个简单的例子，预报天气时，往往需要预测温度、湿度和风力，这就意味着如果使用单层神经网络进行预测，需要设置3个神经元。如图\ref{fig:9-10}所示，此时权重矩阵如公式\eqref{eq:9-105}所示：
+\parinterval 举个简单的例子，预报天气时，往往需要预测温度、湿度和风力，这就意味着如果使用单层神经网络进行预测，需要设置3个神经元。如图\ref{fig:9-10}所示，此时权重矩阵如下：
 \begin{eqnarray}
 {\mathbi{W}}&=&\begin{pmatrix} w_{11} & w_{12} & w_{13}\\ w_{21} & w_{22} & w_{23}\end{pmatrix}
@@ -699,7 +699,7 @@ x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 \end{figure}
 %-------------------------------------------
-\parinterval 在神经网络中，对于输入向量$ {\mathbi{x}}\in {\mathbb R}^m $，一层神经网络首先将其经过线性变换映射到$ {\mathbb R}^n $，再经过激活函数变成${\mathbi{y}}\in {\mathbb R}^n $。还是上面天气预测的例子，每个神经元获得相同的输入，权重矩阵$ {\mathbi{W}} $是一个$ 2\times 3 $矩阵，矩阵中每个元素$ w_{ij} $代表第$ j $个神经元中$ x_{i} $对应的权重值，假设编号为1的神经元负责预测温度，则$ w_{i1} $含义为预测温度时，输入$ x_{i} $对其影响程度。此外所有神经元的偏置$ b_{1} $，$ b_{2} $，$ b_{3} $组成了最终的偏置向量$ {\mathbi{b}}$。在该例中则有，权重矩阵$ {\mathbi{W}}=\begin{pmatrix} w_{11} & w_{12} & w_{13}\\ w_{21} & w_{22} & w_{23}\end{pmatrix} $，偏置向量$ {\mathbi{b}}=(b_1,b_2,b_3) $。
+\parinterval 在神经网络中，对于输入向量$ {\mathbi{x}}\in {\mathbb R}^m $，一层神经网络首先将其经过线性变换映射到$ {\mathbb R}^n $，再经过激活函数变成${\mathbi{y}}\in {\mathbb R}^n $。还是上面天气预测的例子，每个神经元获得相同的输入，权重矩阵$ {\mathbi{W}} $是一个$ 2\times 3 $矩阵，矩阵中每个元素$ w_{ij} $代表第$ j $个神经元中$ x_{i} $对应的权重值，假设编号为1的神经元负责预测温度，则$ w_{i1} $的含义为预测温度时输入$ x_{i} $对其影响程度。此外所有神经元的偏置$ b_{1} $，$ b_{2} $，$ b_{3} $组成了最终的偏置向量$ {\mathbi{b}}$。在该例中则有，权重矩阵$ {\mathbi{W}}=\begin{pmatrix} w_{11} & w_{12} & w_{13}\\ w_{21} & w_{22} & w_{23}\end{pmatrix} $，偏置向量$ {\mathbi{b}}=(b_1,b_2,b_3) $。
 \parinterval 那么，线性变换的本质是什么？
@@ -707,7 +707,7 @@ x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 \vspace{0.5em}
 \item 从代数角度看，对于线性空间$ \textrm V $，任意$ {\mathbi{a}}$，${\mathbi{a}}\in {\textrm V} $和数域中的任意$ \alpha $，线性变换$ T(\cdot) $需满足：$ T({\mathbi{a}}+{\mathbi{b}})=T({\mathbi{a}})+T({\mathbi{b}}) $，且$ T(\alpha {\mathbi{a}})=\alpha T({\mathbi{a}}) $；
 \vspace{0.5em}
-\item 从几何角度看，公式中的${\mathbi{x}}\cdot {\mathbi{W}}+{\mathbi{b}}$将${\mathbi{x}}$右乘${\mathbi{W}}$相当于对$ {\mathbi{x}} $进行旋转变换。例如，对三个点$ (0,0) $，$ (0,1) $，$ (1,0) $及其围成的矩形区域右乘公式\eqref{eq:9-106}所示矩阵：
+\item 从几何角度看，公式中的${\mathbi{x}}\cdot {\mathbi{W}}+{\mathbi{b}}$将${\mathbi{x}}$右乘${\mathbi{W}}$相当于对$ {\mathbi{x}} $进行旋转变换。例如，对三个点$ (0,0) $，$ (0,1) $，$ (1,0) $及其围成的矩形区域右乘如下矩阵：
    \begin{eqnarray}
    {\mathbi{W}}&=&\begin{pmatrix} 1 & 0 & 0\\ 0 & -1 & 0\\ 0 & 0 & 1\end{pmatrix}
@@ -930,20 +930,20 @@ x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 \parinterval 对于一个单层神经网络，$ {\mathbi{y}}=f({\mathbi{x}}\cdot{\mathbi{W}}+{\mathbi{b}}) $中的${\mathbi{x}}\cdot {\mathbi{W}} $表示对输入${\mathbi{x}} $进行线性变换，其中${\mathbi{x}}$是输入张量，$ {\mathbi{W}}$是权重矩阵。$ {\mathbi{x}}\cdot {\mathbi{W}} $表示的是矩阵乘法，需要注意的是这里是矩阵乘法而不是张量乘法。
-\parinterval 张量乘以矩阵是怎样计算呢？可以先回忆一下\ref{sec:9.2.1}节的线性代数的知识。假设$ {\mathbi{A}} $为$ m\times p $的矩阵，$ {\mathbi{B}} $为$ p\times n $的矩阵，对${\mathbi{A}} $ 和${\mathbi{B}}$ 作矩阵乘积的结果是一个$ m\times n $的矩阵${\mathbi{C}}$，其中矩阵${\mathbi{C}}$中第$ i $行、第$ j $列的元素可以表示为公式\eqref{eq:9-24}：
+\parinterval 张量乘以矩阵是怎样计算呢？可以先回忆一下\ref{sec:9.2.1}节的线性代数的知识。假设$ {\mathbi{A}} $为$ m\times p $的矩阵，$ {\mathbi{B}} $为$ p\times n $的矩阵，对${\mathbi{A}} $ 和${\mathbi{B}}$ 作矩阵乘积的结果是一个$ m\times n $的矩阵${\mathbi{C}}$，其中矩阵${\mathbi{C}}$中第$ i $行、第$ j $列的元素可以表示为：
 \begin{eqnarray}
 {({\mathbi{A}}{\mathbi{B}})}_{ij}&=&\sum_{k=1}^{p}{a_{ik}b_{kj}}
 \label{eq:9-24}
 \end{eqnarray}
-\noindent 例如$ {\mathbi{A}}= \begin{pmatrix} a_{11} & a_{12} & a_{13}\\a_{21} & a_{22} & a_{23}\end{pmatrix} $，$ {\mathbi{B}}= \begin{pmatrix} b_{11} & b_{12}\\b_{21} & b_{22}\\b_{31} & b_{32}\end{pmatrix} $，公式\eqref{eq:9-108}展示了两矩阵做乘法运算的过程：
+\noindent 例如$ {\mathbi{A}}= \begin{pmatrix} a_{11} & a_{12} & a_{13}\\a_{21} & a_{22} & a_{23}\end{pmatrix} $，$ {\mathbi{B}}= \begin{pmatrix} b_{11} & b_{12}\\b_{21} & b_{22}\\b_{31} & b_{32}\end{pmatrix} $，两矩阵做乘法运算的过程如下：
 \begin{eqnarray}
 {\mathbi{C}} & = & {\mathbi{A}}{\mathbi{B}} \nonumber \\
                & = & \begin{pmatrix} a_{11}b_{11}+a_{12}b_{21}+a_{13}b_{31} & a_{11}b_{12}+a_{12}b_{22}+a_{13}b_{32}\\a_{21}b_{11}+a_{22}b_{21}+a_{23}b_{31} & a_{21}b_{12}+a_{22}b_{22}+a_{23}b_{32}\end{pmatrix}
 \label{eq:9-108}
 \end{eqnarray}
-\parinterval 将矩阵乘法扩展到高阶张量中：一个张量${\mathbi{x}}$若要与矩阵$ {\mathbi{W}}$做矩阵乘法，则$ {\mathbi{x}} $的最后一维度需要与${\mathbi{W}}$的行数大小相等，即：若张量${\mathbi{x}} $的形状为$ \cdot \times n $，${\mathbi{W}} $须为$ n\times \cdot $的矩阵。公式\eqref{eq:9-25}是一个例子:
+\parinterval 将矩阵乘法扩展到高阶张量中：一个张量${\mathbi{x}}$若要与矩阵$ {\mathbi{W}}$做矩阵乘法，则$ {\mathbi{x}} $的最后一维度需要与${\mathbi{W}}$的行数大小相等，即：若张量${\mathbi{x}} $的形状为$ \cdot \times n $，${\mathbi{W}} $须为$ n\times \cdot $的矩阵。下式是一个例子:
 \begin{eqnarray}
 {\mathbi{x}}(1:4,1:4,{\red{1:4}})\;\;\times\;\; {{\mathbi{W}}({\red{1:4}},1:2)}&=&{\mathbi{s}}(1:4,1:4,1:2)
 \label{eq:9-25}
@@ -971,7 +971,7 @@ x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 \begin{itemize}
 \vspace{0.5em}
-\item $ {\mathbi{s}}+{\mathbi{b}} $中的单元加就是对张量中的每个位置都进行加法。在上例中$ {\mathbi{s}} $是形状为$ (1:4,1:4,1:2) $的3阶张量，而$ {\mathbi{b}}$是含有4个元素的向量，在形状不同的情况下是怎样进行单元加的呢？在这里需要引入{\small\sffamily\bfseries{广播机制}}\index{广播机制}：如果两个数组的后缘维度（即从末尾开始算起的维度）的轴长度相符或其中一方的长度为1，则认为它们是广播兼容的。广播会在缺失或长度为1的维度上进行，它是深度学习框架中常用的计算方式。来看一个具体的例子，如图\ref{fig:9-28}所示，$ {\mathbi{s}} $是一个$ 2\times 4 $的矩阵而$ {\mathbi{b}} $是一个长度为4的向量，这两者进行单元加运算时，广播机制会将$ {\mathbi{b}} $沿第一个维度复制后，再与$ {\mathbi{s}} $做加法运算。
+\item $ {\mathbi{s}}+{\mathbi{b}} $中的单元加就是对张量中的每个位置都进行加法。在上例中$ {\mathbi{s}} $是形状为$ (1:4,1:4,1:2) $的3阶张量，而$ {\mathbi{b}}$是含有4个元素的向量，在形状不同的情况下是怎样进行单元加的呢？在这里需要引入{\small\sffamily\bfseries{广播机制}}\index{广播机制}（Broadcast Mechanism\index{Broadcast Mechanism}）：如果两个数组的后缘维度（即从末尾开始算起的维度）的轴长度相符或其中一方的长度为1，则认为它们是广播兼容的。广播会在缺失或长度为1的维度上进行，它是深度学习框架中常用的计算方式。来看一个具体的例子，如图\ref{fig:9-28}所示，$ {\mathbi{s}} $是一个$ 2\times 4 $的矩阵而$ {\mathbi{b}} $是一个长度为4的向量，这两者进行单元加运算时，广播机制会将$ {\mathbi{b}} $沿第一个维度复制后，再与$ {\mathbi{s}} $做加法运算。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -982,7 +982,7 @@ x_1\cdot w_1+x_2\cdot w_2+x_3\cdot w_3 & = & 0\cdot 1+0\cdot 1+1\cdot 1 \nonumbe
 \end {figure}
 %-------------------------------------------
 \vspace{0.5em}
-\item 除了单位加之外，张量之间也可以使用减法操作、乘法操作。此外也可以对张量作激活操作，这里将其称作为函数的{\small\bfnew{向量化}}\index{向量化}（Vectorization）\index{Vectorization}。例如，对向量（1阶张量）作ReLU激活，公式\eqref{eq:9-26}为ReLU激活函数：
+\item 除了单位加之外，张量之间也可以使用减法操作、乘法操作。此外也可以对张量作激活操作，这里将其称作为函数的{\small\bfnew{向量化}}\index{向量化}（Vectorization）\index{Vectorization}。例如，对向量（1阶张量）作ReLU激活，ReLU激活函数表达式如下：
 \begin{eqnarray}
 f(x)&=&\begin{cases} 0 & x\le 0 \\x & x>0\end{cases}
 \label{eq:9-26}
@@ -1014,7 +1014,7 @@ f(x)&=&\begin{cases} 0 & x\le 0 \\x & x>0\end{cases}
 \begin{figure}[htp]
 \centering
 \input{./Chapter9/Figures/figure-save}
-\caption{1阶(a)、2阶(b)、3阶张量(c)的物理存储}
+\caption{不同阶的张量的物理存储方式}
 \label{fig:9-29}
 \end{figure}
 %-------------------------------------------
@@ -1173,7 +1173,7 @@ y&=&{\textrm{Sigmoid}}({\textrm{Tanh}}({\mathbi{x}}\cdot {\mathbi{W}}^{[1]}+{\ma
 \subsection{基于梯度的参数优化}\label{sec9:para-training}
 \parinterval 对于第$ i $个样本$ ({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i) $，把损失函数$ L(\widetilde{\mathbi{y}}_i,{\mathbi{y}}_i) $看作是参数$ \bm \theta $的函数\footnote{为了简化描述，可以用$
-\bm{\theta} $表示神经网络中的所有参数，包括各层的权重矩阵${\mathbi{W}}^{[1]}\dots{\mathbi{W}}^{[n]}$和偏置向量${\mathbi{b}}^{[1]}\dots{\mathbi{b}}^{[n]}$等。}，因为输出$ {\mathbi{y}}_i $是由输入$ {\mathbi{x}}_i $和模型参数$ \bm \theta $决定，因此也把损失函数写为$ L({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i;{\bm \theta}) $。参数学习过程可以被描述为公式\eqref{eq:9-28}：
+\bm{\theta} $表示神经网络中的所有参数，包括各层的权重矩阵${\mathbi{W}}^{[1]}\dots{\mathbi{W}}^{[n]}$和偏置向量${\mathbi{b}}^{[1]}\dots{\mathbi{b}}^{[n]}$等。}，因为输出$ {\mathbi{y}}_i $是由输入$ {\mathbi{x}}_i $和模型参数$ \bm \theta $决定，因此也把损失函数写为$ L({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i;{\bm \theta}) $。下式描述了参数学习的过程：
 \begin{eqnarray}
 \widehat{\bm\theta}&=&\mathop{\arg\min}_{\bm \theta}\frac{1}{n}\sum_{i=1}^{n}{L({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i;{\bm \theta})}
 \label{eq:9-28}
@@ -1200,7 +1200,7 @@ y&=&{\textrm{Sigmoid}}({\textrm{Tanh}}({\mathbi{x}}\cdot {\mathbi{W}}^{[1]}+{\ma
 \end{figure}
 %-------------------------------------------
-\parinterval 应用梯度下降算法时，首先需要初始化参数${\bm \theta}$。一般情况下深度学习中的参数应该初始化为一个不太大的随机数。一旦初始化${\bm \theta}$后，就开始对模型进行不断的更新，{\small\sffamily\bfseries{参数更新的规则}}\index{参数更新的规则}（Update Rule）\index{Update Rule}如公式\eqref{eq:9-29}所示：
+\parinterval 应用梯度下降算法时，首先需要初始化参数${\bm \theta}$。一般情况下深度学习中的参数应该初始化为一个不太大的随机数。一旦初始化${\bm \theta}$后，就开始对模型进行不断的更新，{\small\sffamily\bfseries{参数更新的规则}}\index{参数更新的规则}（Update Rule）\index{Update Rule}如下：
 \begin{eqnarray}
 {\bm \theta}_{t+1}&=&{\bm \theta}_{t}-\alpha \cdot \frac{\partial J({\bm \theta})}{\partial {\bm \theta}}
 \label{eq:9-29}
@@ -1218,7 +1218,7 @@ y&=&{\textrm{Sigmoid}}({\textrm{Tanh}}({\mathbi{x}}\cdot {\mathbi{W}}^{[1]}+{\ma
 \noindent {\small\sffamily\bfseries{1）批量梯度下降\index{批量梯度下降}（Batch Gradient Descent）\index{Batch Gradient Descent}}}
 \vspace{0.5em}
-\parinterval 批量梯度下降是梯度下降方法中最原始的形式，这种梯度下降方法在每一次迭代时使用所有的样本进行参数更新。参数优化的目标函数如公式\eqref{eq:9-30}所示：
+\parinterval 批量梯度下降是梯度下降方法中最原始的形式，这种梯度下降方法在每一次迭代时使用所有的样本进行参数更新。参数优化的目标函数如下：
 \begin{eqnarray}
 J({\bm \theta})&=&\frac{1}{n}\sum_{i=1}^{n}{L({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i;{\bm \theta})}
 \label{eq:9-30}
@@ -1236,7 +1236,7 @@ J({\bm \theta})&=&\frac{1}{n}\sum_{i=1}^{n}{L({\mathbi{x}}_i,\widetilde{\mathbi{
 \noindent {\small\sffamily\bfseries{2）随机梯度下降\index{随机梯度下降}（Stochastic Gradient Descent）\index{Stochastic Gradient Descent}}}
 \vspace{0.5em}
-\parinterval 随机梯度下降（简称SGD）不同于批量梯度下降，每次迭代只使用一个样本对参数进行更新。SGD的目标函数如公式\eqref{eq:9-31}所示
+\parinterval 随机梯度下降（简称SGD）不同于批量梯度下降，每次迭代只使用一个样本对参数进行更新。SGD的目标函数如下：
 \begin{eqnarray}
 J({\bm \theta})&=&L({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i;{\bm \theta})
 \label{eq:9-31}
@@ -1254,7 +1254,7 @@ J({\bm \theta})&=&L({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i;{\bm \theta})
 \noindent {\small\sffamily\bfseries{3）小批量梯度下降\index{小批量梯度下降}（Mini-batch Gradient Descent）\index{Mini-batch Gradient Descent}}}
 \vspace{0.5em}
-\parinterval 为了综合批量梯度下降和随机梯度下降的优缺点，在实际应用中一般采用这两个算法的折中\ \dash \ 小批量梯度下降。其思想是：每次迭代计算一小部分训练数据的损失函数，并对参数进行更新。这一小部分数据被称为一个批次（mini-batch或者batch）。小批量梯度下降的参数优化的目标函数如公式\eqref{eq:9-32}所示：
+\parinterval 为了综合批量梯度下降和随机梯度下降的优缺点，在实际应用中一般采用这两个算法的折中\ \dash \ 小批量梯度下降。其思想是：每次迭代计算一小部分训练数据的损失函数，并对参数进行更新。这一小部分数据被称为一个批次（mini-batch或者batch）。小批量梯度下降的参数优化的目标函数如下：
 \begin{eqnarray}
 J({\bm \theta})&=&\frac{1}{m}\sum_{i=j}^{j+m-1}{L({\mathbi{x}}_i,\widetilde{\mathbi{y}}_i;{\bm \theta})}
 \label{eq:9-32}
@@ -1389,7 +1389,7 @@ $+2x^2+x+1)$ & \ \ $(x^4+2x^3+2x^2+x+1)$ & $+6x+1$ \\
 \subsubsection{3. 基于梯度的方法的变种和改进}\label{sec:9.4.2.3}
-\parinterval  参数优化通常基于梯度下降算法，即在每个更新步骤$ t $，沿梯度反方向更新参数，如公式\eqref{eq:9-200}所示：
+\parinterval  参数优化通常基于梯度下降算法，即在每个更新步骤$ t $，沿梯度反方向更新参数，该过程如下：
 \begin{eqnarray}
 {\bm \theta}_{t+1}&=&{\bm \theta}_{t}-\alpha \cdot \frac{\partial J({\bm \theta}_t)}{\partial {\bm \theta}_t}
 \label{eq:9-200}
@@ -1547,7 +1547,7 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \parinterval  网络训练过程中，如果参数的初始值过大，而且每层网络的梯度都大于1，反向传播过程中，各层梯度的偏导数都会比较大，会导致梯度指数级地增长直至超出浮点数表示的范围，这就产生了梯度爆炸现象。如果发生这种情况，模型中离输入近的部分比离输入远的部分参数更新得更快，使网络变得非常不稳定。在极端情况下，模型的参数值变得非常大，甚至于溢出。针对梯度爆炸的问题，常用的解决办法为{\small\sffamily\bfseries{梯度裁剪}}\index{梯度裁剪}（Gradient Clipping）\index{Gradient Clipping}。
-\parinterval    梯度裁剪的思想是设置一个梯度剪切阈值。在更新梯度的时候，如果梯度超过这个阈值，就将其强制限制在这个范围之内。假设梯度为${\mathbi{g}}$，梯度剪切阈值为$\sigma $，梯度裁剪过程如公式\eqref{eq:9-43}所示：
+\parinterval    梯度裁剪的思想是设置一个梯度剪切阈值。在更新梯度的时候，如果梯度超过这个阈值，就将其强制限制在这个范围之内。假设梯度为${\mathbi{g}}$，梯度剪切阈值为$\sigma $，梯度裁剪过程可描述为下式：
 \begin{eqnarray}
 {\mathbi{g}}&=&{\textrm{min}}(\frac{\sigma}{\Vert {\mathbi{g}}\Vert},1){\mathbi{g}}
 \label{eq:9-43}
@@ -1585,7 +1585,7 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \end{figure}
 %-------------------------------------------
-相比较于简单的多层堆叠的结构，残差网络提供了跨层连接结构。这种结构在反向传播中有很大的好处，比如，对于一个训练样本，损失函数为$L$，$ \mathbi x_l $处的梯度可以进行如公式\eqref{eq:9-45}的计算：
+相比较于简单的多层堆叠的结构，残差网络提供了跨层连接结构。这种结构在反向传播中有很大的好处，比如，对于一个训练样本，损失函数为$L$，$ \mathbi x_l $处的梯度可以进行如下计算：
 \begin{eqnarray}
 \frac{\partial L}{\partial {\mathbi{x}}_l}&=&\frac{\partial L}{\partial {\mathbi{x}}_{l+1}} \cdot  \frac{\partial {\mathbi{x}}_{l+1}}{\partial {\mathbi{x}}_l}\nonumber\\
 &=&\frac{\partial L}{\partial {\mathbi{x}}_{l+1}} \cdot \left(1+\frac{\partial F({\mathbi{x}}_l)}{\partial {\mathbi{x}}_l}\right)\nonumber\\
@@ -1650,7 +1650,7 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \vspace{0.5em}
 \item  $ {\mathbi{h}}^K $：整个网络的输出；
 \vspace{0.5em}
-\item  $ {\mathbi{s}}^k $：第$ k $层的线性变换结果，其计算方式如公式\eqref{eq:9-109}所示：
+\item  $ {\mathbi{s}}^k $：第$ k $层的线性变换结果，其计算方式如下：
       \begin{eqnarray}
       {\mathbi{s}}^k & = & {\mathbi{h}}^{k-1}{\mathbi{W}}^k \nonumber \\
                   & = & \sum{h_j^{k-1}w_{j,i}^k}
@@ -1661,7 +1661,7 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \vspace{0.5em}
 \end{itemize}
-\parinterval  于是，在神经网络的第$ k $层，前向计算过程如公式\eqref{eq:9-46}所示：
+\parinterval  于是，在神经网络的第$ k $层，前向计算过程可以描述为：
 \begin{eqnarray}
 {\mathbi{h}}^k & = & f^k({\mathbi{s}}^k) \nonumber \nonumber \\
            & = & f^k({\mathbi{h}}^{k-1}{\mathbi{W}}^k)
@@ -1716,7 +1716,7 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \end{figure}
 %-------------------------------------------
-\parinterval  在第一阶段，计算的目标是得到损失函数$ L $关于第$ K $层中间状态$ {\mathbi{s}}^K $的梯度，这里令$ {\bm \pi}^K= \frac{\partial L}{\partial{\mathbi{s}}^K} $，利用链式法则有公式\eqref{eq:9-49}：
+\parinterval  在第一阶段，计算的目标是得到损失函数$ L $关于第$ K $层中间状态$ {\mathbi{s}}^K $的梯度，这里令$ {\bm \pi}^K= \frac{\partial L}{\partial{\mathbi{s}}^K} $，利用链式法则有：
 \begin{eqnarray}
 {\bm \pi}^K&=& \frac{\partial L}{\partial {\mathbi{s}}^K}\nonumber\\
 &=&\frac{\partial L}{\partial {\mathbi{h}}^K}\cdot \frac{\partial {\mathbi{h}}^K}{\partial {\mathbi{s}}^K}\nonumber\\
@@ -1805,7 +1805,7 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \vspace{0.5em}
 \end{itemize}
-\parinterval  这两步和输出层的反向传播十分类似。可以利用链式法则得到公式\eqref{eq:9-54}：
+\parinterval  这两步和输出层的反向传播十分类似。可以利用链式法则得到：
 \begin{eqnarray}
 \frac{\partial L}{\partial {\mathbi{s}}^k}&=&\frac{\partial L}{\partial {\mathbi{h}}^k}\cdot \frac{\partial {\mathbi{h}}^k}{\partial {\mathbi{s}}^k}\nonumber\\
 &=&\frac{\partial L}{\partial {\mathbi{h}}^k}\cdot \frac{\partial f^k({\mathbi{s}}^k)}{\partial {\mathbi{s}}^k}
@@ -1849,13 +1849,13 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \subsection{基于前馈神经网络的语言模型}
-\parinterval  回顾一下{\chaptertwo}的内容，语言建模的问题被定义为：对于一个词序列$ w_1w_2\dots w_m$，如何计算该词序列的可能性？词序列出现的概率可以通过链式法则得到，如公式\eqref{eq:9-57}所示：
+\parinterval  回顾一下{\chaptertwo}的内容，语言建模的问题被定义为：对于一个词序列$ w_1w_2\dots w_m$，如何计算该词序列的可能性？词序列出现的概率可以通过链式法则得到：
 \begin{eqnarray}
 \funp{P}(w_1w_2\dots w_m)&=&\funp{P}(w_1)\funp{P}(w_2|w_1)\funp{P}(w_3|w_1w_2)\dots \funp{P}(w_m|w_1\dots w_{m-1})
 \label{eq:9-57}
 \end{eqnarray}
-\parinterval  由于$ \funp{P}(w_m|w_1\dots w_{m-1}) $需要建模$ m-1 $个词构成的历史信息，这个模型仍然很复杂。于是就有了基于局部历史的$n$-gram语言模型，如公式\eqref{eq:9-58}所示：
+\parinterval  由于$ \funp{P}(w_m|w_1\dots w_{m-1}) $需要建模$ m-1 $个词构成的历史信息，这个模型仍然很复杂。于是就有了基于局部历史的$n$-gram语言模型：
 \begin{eqnarray}
 \funp{P}(w_m|w_1\dots w_{m-1})&=&\funp{P}(w_m|w_{m-n+1}\dots w_{m-1})
 \label{eq:9-58}
@@ -1869,7 +1869,7 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \noindent 这里，$ w_{m-n+1}\dots w_m $也被称作$n$-gram，即$ n $元语法单元。$n$-gram语言模型是一种典型的基于离散表示的模型。在这个模型中，所有的词都被看作是离散的符号。因此，不同单词之间是“完全”不同的。另一方面，语言现象是十分多样的，即使在很大的语料库上也无法得到所有$n$-gram的准确统计。甚至很多$n$-gram在训练数据中从未出现过。由于不同$n$-gram 间没有建立直接的联系， $n$-gram 语言模型往往面临数据稀疏的问题。比如，虽然在训练数据中见过“景色”这个词，但是测试数据中却出现了“风景”这个词，恰巧“风景”在训练数据中没有出现过。即使“风景”和“景色”表达的是相同的意思，$n$-gram语言模型仍然会把“风景”看作未登录词，赋予一个很低的概率值。
-\parinterval  上面这个问题的本质是$n$-gram语言模型对词使用了离散化表示，即每个单词都孤立的对应词表中的一个索引，词与词之间在语义上没有任何“重叠”。神经语言模型重新定义了这个问题。这里并不需要显性地通过统计离散的$n$-gram的频度，而是直接设计一个神经网络模型$ g(\cdot)$来估计单词生成的概率，正如公式\eqref{eq:9-59}所示：
+\parinterval  上面这个问题的本质是$n$-gram语言模型对词使用了离散化表示，即每个单词都孤立的对应词表中的一个索引，词与词之间在语义上没有任何“重叠”。神经语言模型重新定义了这个问题。这里并不需要显性地通过统计离散的$n$-gram的频度，而是直接设计一个神经网络模型$ g(\cdot)$来估计单词生成的概率，如下所示：
 \begin{eqnarray}
 \funp{P}(w_m|w_1\dots w_{m-1})&=&g(w_1\dots w_m)
 \label{eq:9-59}
@@ -1915,7 +1915,7 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \subsubsection{2. 输入层}
-\parinterval  $ {\mathbi{o}}_{i-3} $、 $ {\mathbi{o}}_{i-2} $ 、$ {\mathbi{o}}_{i-1} $为该语言模型的输入（绿色方框），输入为每个词（如上文的$ w_{i-1}$、$ w_{i-2}$等）的One-hot向量表示（维度大小与词表大小一致），每个One-hot向量仅一维为1，其余为0，比如：$ (0,0,1,\dots,0) $ 表示词表中第三个单词。之后把One-hot向量乘以一个矩阵$ \mathbi{C} $得到单词的分布式表示（紫色方框）。令$ {\mathbi{o}}_i $为第$ i $个词的One-hot表示，$ {\mathbi{e}}_i $为第$ i $个词的分布式表示，则分布式表示$ {\mathbi{e}}_i $的计算方式如公式\eqref{eq:9-60}所示：
+\parinterval  $ {\mathbi{o}}_{i-3} $、 $ {\mathbi{o}}_{i-2} $ 、$ {\mathbi{o}}_{i-1} $为该语言模型的输入（绿色方框），输入为每个词（如上文的$ w_{i-1}$、$ w_{i-2}$等）的One-hot向量表示（维度大小与词表大小一致），每个One-hot向量仅一维为1，其余为0，比如：$ (0,0,1,\dots,0) $ 表示词表中第三个单词。之后把One-hot向量乘以一个矩阵$ \mathbi{C} $得到单词的分布式表示（紫色方框）。令$ {\mathbi{o}}_i $为第$ i $个词的One-hot表示，$ {\mathbi{e}}_i $为第$ i $个词的分布式表示，则分布式表示$ {\mathbi{e}}_i $的计算方式如下：
 \begin{eqnarray}
 {\mathbi{e}}_i&=&{\mathbi{o}}_i\cdot{\mathbi{C}}
 \label{eq:9-60}
@@ -1929,7 +1929,7 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \subsubsection{3. 隐藏层和输出层}
-\parinterval  把得到的$ {\mathbi{e}}_0 $、$ {\mathbi{e}}_1 $、$ {\mathbi{e}}_2 $三个向量级联在一起，经过两层网络，最后通过Softmax函数（橙色方框）得到输出，具体过程如公式\eqref{eq:9-61}和\eqref{eq:9-62}所示：
+\parinterval  把得到的$ {\mathbi{e}}_0 $、$ {\mathbi{e}}_1 $、$ {\mathbi{e}}_2 $三个向量级联在一起，经过两层网络，最后通过Softmax函数（橙色方框）得到输出，具体过程为：
 \begin{eqnarray}
 {\mathbi{y}}&=&{\textrm{Softmax}}({\mathbi{h}}_0{\mathbi{U}})\label{eq:9-61}\\
@@ -1939,7 +1939,7 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \noindent  这里，输出$ {\mathbi{y}}$是词表$V$上的一个分布，来表示$\funp{P}(w_i|w_{i-1},w_{i-2},w_{i-3}) $。$ {\mathbi{U}}$、${\mathbi{H}}$和${\mathbi{d}}$是模型的参数。这样，对于给定的单词$w_i$可以用$y_i$得到其概率，其中$y_i$表示向量${\mathbi{y}}$的第$i$维。
-\parinterval Softmax($\cdot$)的作用是根据输入的$|V|$维向量（即${\mathbi{h}}_0{\mathbi{U}}$），得到一个$|V|$维的分布。令${\bm \tau}$表示Softmax($\cdot$)的输入向量，Softmax函数可以被定义为公式\eqref{eq:9-120}：
+\parinterval Softmax($\cdot$)的作用是根据输入的$|V|$维向量（即${\mathbi{h}}_0{\mathbi{U}}$），得到一个$|V|$维的分布。令${\bm \tau}$表示Softmax($\cdot$)的输入向量，Softmax函数可以被定义为：
 \begin{eqnarray}
 \textrm{Softmax}(\tau_i)&=&\frac{\textrm{exp}(\tau_i)}  {\sum_{i'=1}^{|V|} \textrm{exp}(\tau_{i'})}
@@ -1987,7 +1987,7 @@ z_t&=&\gamma z_{t-1}+(1-\gamma) \frac{\partial J}{\partial {\theta}_t} \cdot  \f
 \parinterval  对于长距离依赖问题，可以通过{\small\sffamily\bfseries{循环神经网络}}\index{循环神经网络}（Recurrent Neural Network\index{Recurrent Neural Network}，或RNN）进行求解。通过引入循环单元这种特殊的结构，循环神经网络可以对任意长度的历史进行建模，因此在一定程度上解决了传统$n$-gram语言模型有限历史的问题。正是基于这个优点，{\small\sffamily\bfseries{循环神经网络语言模型}}\index{循环神经网络语言模型}（RNNLM）\index{RNNLM}应运而生\upcite{mikolov2010recurrent}。
-\parinterval  在循环神经网络中，输入和输出都是一个序列，分别记为$ ({\mathbi{x}}_1,\dots,{\mathbi{x}}_m) $和$ ({\mathbi{y}}_1,\dots,\\ {\mathbi{y}}_m) $。它们都可以被看作是时序序列，其中每个时刻$ t $都对应一个输入$ {\mathbi{x}}_t $和输出$ {\mathbi{y}}_t $。循环神经网络的核心是{\small\sffamily\bfseries{循环单元}}\index{循环单元}（RNN Cell）\index{RNN Cell}，它读入前一个时刻循环单元的输出和当前时刻的输入，生成当前时刻循环单元的输出。图\ref{fig:9-62}展示了一个简单的循环单元结构，对于时刻$ t $，循环单元的输出被定义为公式\eqref{eq:9-63}：
+\parinterval  在循环神经网络中，输入和输出都是一个序列，分别记为$ ({\mathbi{x}}_1,\dots,{\mathbi{x}}_m) $和$ ({\mathbi{y}}_1,\dots,\\ {\mathbi{y}}_m) $。它们都可以被看作是时序序列，其中每个时刻$ t $都对应一个输入$ {\mathbi{x}}_t $和输出$ {\mathbi{y}}_t $。循环神经网络的核心是{\small\sffamily\bfseries{循环单元}}\index{循环单元}（RNN Cell）\index{RNN Cell}，它读入前一个时刻循环单元的输出和当前时刻的输入，生成当前时刻循环单元的输出。图\ref{fig:9-62}展示了一个简单的循环单元结构，对于时刻$ t $，循环单元的输出被定义为：
 \begin{eqnarray}
 {\mathbi{h}}_t&=&{\textrm{Tanh}}({\mathbi{x}}_t{\mathbi{U}}+{\mathbi{h}}_{t-1}{\mathbi{W}})
 \label{eq:9-63}