rule matching of adjacent variables

a6a1f901 · xiaotong · 6817d376 · a6a1f901 · a6a1f901
Commit a6a1f901 authored Jan 13, 2020 by xiaotong
--- a/Section04-Phrasal-and-Syntactic-Models/section04-test.tex
+++ b/Section04-Phrasal-and-Syntactic-Models/section04-test.tex
@@ -156,9 +156,15 @@
 \begin{frame}{基于串的解码 - 规则匹配}

 \begin{itemize}
-\item 相比基于树的解码，基于串的解码的实现要复杂许多，这主要是因为对于每一个片段，需要判断每条规则是否能匹配
+\item 相比基于树的解码，基于串的解码的实现要复杂许多，因为对于每一个片段，需要判断每条规则是否能匹配
+	\begin{itemize}
+	\item 就是匹配树片段的叶子节点序列，即单词和变量构成的串
+	\item<2-> 匹配单词可以直接完成
+	\item<3-> 匹配变量需要检查相应跨度节点上是否有相应标记的推导
+	\end{itemize}
 \end{itemize}

+\vspace{-1em}
 \begin{center}
 \begin{tikzpicture}

@@ -179,14 +185,14 @@
 \node [anchor=south,align=center] (box2label) at (box2.north) {[{\blue 2},{\blue 11}]\\NP};
 \node [anchor=south,align=center] (box3label) at (box3.north) {[{\blue 11},{\blue 13}]\\VP};
 }
+\visible<2->{
+\node [draw,thick,purple,inner sep=0pt] (box4) [fit = (sw2)] {};
+}
 \end{pgfonlayer}

 \draw[decorate,decoration={brace,mirror,,amplitude=3mm}] (sw1.south west) -- (sw4.south east);

-\node [anchor=north] (label) at ([yshift=-1em]sw3.south) {在跨度[{\blue 0},{\blue 13}]上进行规则匹配};
-\node [anchor=north] (rule) at ([yshift=-0.3em]label.south) {{\footnotesize 比如：IP({\color{red} NP$_1$} VP(PP(P(对) {\color{ugreen} NP$_2$}) {\color{orange} VP$_3$}))}};
-\node [anchor=north west] (rule2) at ([yshift=0.2em]rule.south west) {{\footnotesize \hspace{2.8em} $\to$ NP$_1$ VP$_3$ with NP$_2$}};
-
+\node [anchor=north] (label) at ([yshift=-1em]sw3.south) {在跨度[{\blue 0},{\blue 13}]上匹配``NP 对 NP VP''};
 }

 \end{scope}
@@ -194,6 +200,77 @@
 \end{tikzpicture}
 \end{center}

+\vspace{-1em}
+
+\begin{itemize}
+\item<4-> 如果待匹配的单词和变量序列中，没有连续的变量，这样的规则符合lexicalized norm form (LNF)，规则的匹配时间复杂度为O(1)
+	\begin{itemize}
+	\item 比如层次短语系统的规则就符合LNF，因此规则匹配非常容易实现
+	\item 显然上面例子中的规则不符合LNF
+	\end{itemize}
+\end{itemize}
+
+\end{frame}
+
+%%%------------------------------------------------------------------------------------------------------------
+%%%  基于串的解码 - 连续变量的匹配，复杂度增加
+\begin{frame}{基于串的解码 - 连续变量的匹配}
+\begin{itemize}
+\item 但是，如果待匹配串中有连续变量，问题会变得复杂：因为确定两个变量之间的边界需要 增加一重循环
+\end{itemize}
+
+\vspace{-0.5em}
+
+\begin{center}
+\begin{tikzpicture}
+
+\begin{scope}
+{\scriptsize
+
+\node [anchor=west] (sw11) at (0,0) {阿都拉$_1$};
+\node [anchor=west] (sw12) at ([xshift=0.1em]sw11.east) {对$_2$};
+\node [anchor=west,fill=green!20] (sw13) at ([xshift=0.1em]sw12.east) {自己$_3$ 四$_4$\ 个$_5$\ 多$_6$\ 月$_7$\ 以来$_8$\ 的$_9$\ 施政$_{10}$\ 表现$_{11}$ 感到$_{12}$ };
+\node [anchor=west,fill=orange!20] (sw14) at ([xshift=0.2em]sw13.east) {满意$_{13}$};
+
+\node [anchor=north west] (sw21) at ([yshift=-0.3em]sw11.south west) {阿都拉$_1$};
+\node [anchor=west] (sw22) at ([xshift=0.1em]sw21.east) {对$_2$};
+\node [anchor=west,fill=green!20] (sw23) at ([xshift=0.1em]sw22.east) {自己$_3$ 四$_4$\ 个$_5$\ 多$_6$\ 月$_7$\ 以来$_8$\ 的$_9$\ 施政$_{10}$\ 表现$_{11}$};
+\node [anchor=west,fill=orange!20] (sw24) at ([xshift=0.2em]sw23.east) {感到$_{12}$ 满意$_{13}$};
+
+\node [anchor=north west] (sw31) at ([yshift=-0.3em]sw21.south west) {阿都拉$_1$};
+\node [anchor=west] (sw32) at ([xshift=0.1em]sw31.east) {对$_2$};
+\node [anchor=west,fill=green!20] (sw33) at ([xshift=0.1em]sw32.east) {自己$_3$ 四$_4$\ 个$_5$\ 多$_6$\ 月$_7$\ 以来$_8$\ 的$_9$\ 施政$_{10}$};
+\node [anchor=west,fill=orange!20] (sw34) at ([xshift=0.2em]sw33.east) {表现$_{11}$ 感到$_{12}$ 满意$_{13}$};
+
+\node [anchor=north] (dots) at ([yshift=-0.5em]sw33.south) {...};
+
+\node [anchor=north west] (sw41) at ([yshift=-1.8em]sw31.south west) {阿都拉$_1$};
+\node [anchor=west] (sw42) at ([xshift=0.1em]sw41.east) {对$_2$};
+\node [anchor=west,fill=green!20] (sw43) at ([xshift=0.1em]sw42.east) {自己$_3$ };
+\node [anchor=west,fill=orange!20] (sw44) at ([xshift=0.2em]sw43.east) {四$_4$\ 个$_5$\ 多$_6$\ 月$_7$\ 以来$_8$\ 的$_9$ 施政$_{10}$ 表现$_{11}$ 感到$_{12}$ 满意$_{13}$};
+
+\node [anchor=south] (label) at ([yshift=0.3em]sw13.north) {\footnotesize{在跨度[{\blue 0},{\blue 13}]上匹配``NP 对 NP VP''}};
+
+\node [anchor=north west,minimum size=1.2em,fill=green!20] (np) at ([yshift=-1.0em,xshift=0.3em]sw41.south west) {};
+\node [anchor=west] (nplabel) at (np.east) {NP(第二个)};
+\node [anchor=west,minimum size=1.2em,fill=orange!20] (vp) at ([xshift=1.0em]nplabel.east) {};
+\node [anchor=west] (vplabel) at (vp.east) {VP};
+
+}
+\end{scope}
+
+\end{tikzpicture}
+\end{center}
+
+\vspace{-0.5em}
+
+\begin{itemize}
+\item<2-> 理论上，对于长度为$n$的词串，匹配$m$个连续变量的时间复杂度是O($m^n$)
+	\begin{itemize}
+	\item 这也会导致含有多个变量的非词汇化规则的匹配大大增加系统的运行时间，但这种规则在句法系统中也很常见
+	\end{itemize}
+\end{itemize}
+
 \end{frame}

 %%%------------------------------------------------------------------------------------------------------------

--- a/Section04-Phrasal-and-Syntactic-Models/section04.tex
+++ b/Section04-Phrasal-and-Syntactic-Models/section04.tex
@@ -4925,6 +4925,128 @@ NP-BAR(NN$_1$ NP-BAR$_2$) $\to$ NN$_1$ NP-BAR$_2$
 \end{frame}

 %%%------------------------------------------------------------------------------------------------------------
+%%%  基于串的解码 - 规则匹配
+\begin{frame}{基于串的解码 - 规则匹配}
+
+\begin{itemize}
+\item 相比基于树的解码，基于串的解码的实现要复杂许多，因为对于每一个片段，需要判断每条规则是否能匹配
+	\begin{itemize}
+	\item 就是匹配树片段的叶子节点序列，即单词和变量构成的串
+	\item<2-> 匹配单词可以直接完成
+	\item<3-> 匹配变量需要检查相应跨度节点上是否有相应标记的推导
+	\end{itemize}
+\end{itemize}
+
+\vspace{-1em}
+\begin{center}
+\begin{tikzpicture}
+
+\begin{scope}
+{\scriptsize
+
+\node [anchor=west] (sw1) at (0,0) {阿都拉$_1$};
+\node [anchor=west] (sw2) at ([xshift=0.1em]sw1.east) {对$_2$};
+\node [anchor=west] (sw3) at ([xshift=0.1em]sw2.east) {自己$_3$ 四$_4$\ 个$_5$\ 多$_6$\ 月$_7$\ 以来$_8$\ 的$_9$\ 施政$_{10}$\ 表现$_{11}$};
+\node [anchor=west] (sw4) at ([xshift=0.2em]sw3.east) {感到$_{12}$ 满意$_{13}$};
+
+\begin{pgfonlayer}{background}
+\visible<3->{
+\node [fill=red!20,inner sep=0pt] (box1) [fit = (sw1)] {};
+\node [fill=green!20,inner sep=0pt] (box2) [fit = (sw3)] {};
+\node [fill=orange!20,inner sep=0pt] (box3) [fit = (sw4)] {};
+\node [anchor=south,align=center] (box1label) at (box1.north) {[{\blue 0},{\blue 1}]\\VP};
+\node [anchor=south,align=center] (box2label) at (box2.north) {[{\blue 2},{\blue 11}]\\NP};
+\node [anchor=south,align=center] (box3label) at (box3.north) {[{\blue 11},{\blue 13}]\\VP};
+}
+\visible<2->{
+\node [draw,thick,purple,inner sep=0pt] (box4) [fit = (sw2)] {};
+}
+\end{pgfonlayer}
+
+\draw[decorate,decoration={brace,mirror,,amplitude=3mm}] (sw1.south west) -- (sw4.south east);
+
+\node [anchor=north] (label) at ([yshift=-1em]sw3.south) {在跨度[{\blue 0},{\blue 13}]上匹配``NP 对 NP VP''};
+}
+
+\end{scope}
+
+\end{tikzpicture}
+\end{center}
+
+\vspace{-1em}
+
+\begin{itemize}
+\item<4-> 如果待匹配的单词和变量序列中，没有连续的变量，这样的规则符合lexicalized norm form (LNF)，规则的匹配时间复杂度为O(1)
+	\begin{itemize}
+	\item 比如层次短语系统的规则就符合LNF，因此规则匹配非常容易实现
+	\item 显然上面例子中的规则不符合LNF
+	\end{itemize}
+\end{itemize}
+
+\end{frame}
+
+%%%------------------------------------------------------------------------------------------------------------
+%%%  基于串的解码 - 连续变量的匹配，复杂度增加
+\begin{frame}{基于串的解码 - 连续变量的匹配}
+\begin{itemize}
+\item 但是，如果待匹配串中有连续变量，问题会变得复杂：因为确定两个变量之间的边界需要 增加一重循环
+\end{itemize}
+
+\vspace{-0.5em}
+
+\begin{center}
+\begin{tikzpicture}
+
+\begin{scope}
+{\scriptsize
+
+\node [anchor=west] (sw11) at (0,0) {阿都拉$_1$};
+\node [anchor=west] (sw12) at ([xshift=0.1em]sw11.east) {对$_2$};
+\node [anchor=west,fill=green!20] (sw13) at ([xshift=0.1em]sw12.east) {自己$_3$ 四$_4$\ 个$_5$\ 多$_6$\ 月$_7$\ 以来$_8$\ 的$_9$\ 施政$_{10}$\ 表现$_{11}$ 感到$_{12}$ };
+\node [anchor=west,fill=orange!20] (sw14) at ([xshift=0.2em]sw13.east) {满意$_{13}$};
+
+\node [anchor=north west] (sw21) at ([yshift=-0.3em]sw11.south west) {阿都拉$_1$};
+\node [anchor=west] (sw22) at ([xshift=0.1em]sw21.east) {对$_2$};
+\node [anchor=west,fill=green!20] (sw23) at ([xshift=0.1em]sw22.east) {自己$_3$ 四$_4$\ 个$_5$\ 多$_6$\ 月$_7$\ 以来$_8$\ 的$_9$\ 施政$_{10}$\ 表现$_{11}$};
+\node [anchor=west,fill=orange!20] (sw24) at ([xshift=0.2em]sw23.east) {感到$_{12}$ 满意$_{13}$};
+
+\node [anchor=north west] (sw31) at ([yshift=-0.3em]sw21.south west) {阿都拉$_1$};
+\node [anchor=west] (sw32) at ([xshift=0.1em]sw31.east) {对$_2$};
+\node [anchor=west,fill=green!20] (sw33) at ([xshift=0.1em]sw32.east) {自己$_3$ 四$_4$\ 个$_5$\ 多$_6$\ 月$_7$\ 以来$_8$\ 的$_9$\ 施政$_{10}$};
+\node [anchor=west,fill=orange!20] (sw34) at ([xshift=0.2em]sw33.east) {表现$_{11}$ 感到$_{12}$ 满意$_{13}$};
+
+\node [anchor=north] (dots) at ([yshift=-0.5em]sw33.south) {...};
+
+\node [anchor=north west] (sw41) at ([yshift=-1.8em]sw31.south west) {阿都拉$_1$};
+\node [anchor=west] (sw42) at ([xshift=0.1em]sw41.east) {对$_2$};
+\node [anchor=west,fill=green!20] (sw43) at ([xshift=0.1em]sw42.east) {自己$_3$ };
+\node [anchor=west,fill=orange!20] (sw44) at ([xshift=0.2em]sw43.east) {四$_4$\ 个$_5$\ 多$_6$\ 月$_7$\ 以来$_8$\ 的$_9$ 施政$_{10}$ 表现$_{11}$ 感到$_{12}$ 满意$_{13}$};
+
+\node [anchor=south] (label) at ([yshift=0.3em]sw13.north) {\footnotesize{在跨度[{\blue 0},{\blue 13}]上匹配``NP 对 NP VP''}};
+
+\node [anchor=north west,minimum size=1.2em,fill=green!20] (np) at ([yshift=-1.0em,xshift=0.3em]sw41.south west) {};
+\node [anchor=west] (nplabel) at (np.east) {NP(第二个)};
+\node [anchor=west,minimum size=1.2em,fill=orange!20] (vp) at ([xshift=1.0em]nplabel.east) {};
+\node [anchor=west] (vplabel) at (vp.east) {VP};
+
+}
+\end{scope}
+
+\end{tikzpicture}
+\end{center}
+
+\vspace{-0.5em}
+
+\begin{itemize}
+\item<2-> 理论上，对于长度为$n$的词串，匹配$m$个连续变量的时间复杂度是O($m^n$)
+	\begin{itemize}
+	\item 这也会导致含有多个变量的非词汇化规则的匹配大大增加系统的运行时间，但这种规则在句法系统中也很常见
+	\end{itemize}
+\end{itemize}
+
+\end{frame}
+
+%%%------------------------------------------------------------------------------------------------------------
 %%%  基于串的解码
 \begin{frame}{二叉化+CKY}
 % NiuTrans Manual 和我EMNLP的论文，还有以前的文档