wording (word segmentation)

3a186ef1 · xiaotong · 8794e37a · 3a186ef1 · 3a186ef1 · 3a186ef1
Commit 3a186ef1 authored Aug 30, 2020 by xiaotong
--- a/Chapter3/Figures/figure-examples-of-chinese-word-segmentation-based-on-1-gram-model.tex
+++ b/Chapter3/Figures/figure-examples-of-chinese-word-segmentation-based-on-1-gram-model.tex
@@ -28,7 +28,7 @@

 \draw [->,very thick,ublue] ([xshift=0.2em]corpus.east) -- ([xshift=4.2em]corpus.east)  node [pos=0.5, above] {\color{red}{\scriptsize{统计学习}}};

-\draw [->,very thick,ublue] ([xshift=0.2em]model.east) -- ([xshift=4.2em]model.east)  node [pos=0.5, above] {\color{red}{\scriptsize{搜索\&计算}}};
+\draw [->,very thick,ublue] ([xshift=0.2em]model.east) -- ([xshift=4.2em]model.east)  node [pos=0.5, above] {\color{red}{\scriptsize{预测}}};

 {\scriptsize
 \node [anchor=north west] (sentlabel) at ([xshift=6.2em,yshift=-1em]model.north east) {\color{red}{自动分词系统}};
@@ -53,7 +53,7 @@
 \end{pgfonlayer}

 {
-\draw [-,thick,red,dotted] ([yshift=0.3em]modellabel.north) ..controls +(north:0.5) and +(south:0.5).. ([xshift=-3em]label1content.south);
+\draw [-,thick,dotted] ([yshift=0.3em]modellabel.north) ..controls +(north:0.5) and +(south:0.5).. ([xshift=-3em]label1content.south);
 }

 }
@@ -122,7 +122,7 @@
 \end{pgfonlayer}

 {
-\draw [-,thick,red,dotted] (segcontent.north) ..controls +(north:0.7) and +(south:0.7).. (segsystem.south);
+\draw [-,thick,dotted] (segcontent.north) ..controls +(north:0.7) and +(south:0.7).. (segsystem.south);
 }

 \end{tikzpicture}

--- a/Chapter3/Figures/figure-word-segmentation-based-on-statistics.tex
+++ b/Chapter3/Figures/figure-word-segmentation-based-on-statistics.tex
@@ -44,7 +44,7 @@
 }

 {
-\draw [->,very thick,ublue] ([xshift=0.2em]model.east) -- ([xshift=4.2em]model.east)  node [pos=0.5, above] {\color{red}{\scriptsize{搜索\&计算}}};
+\draw [->,very thick,ublue] ([xshift=0.2em]model.east) -- ([xshift=4.2em]model.east)  node [pos=0.5, above] {\color{red}{\scriptsize{推断}}};
 }

 {\scriptsize
@@ -71,7 +71,7 @@
 \node [anchor=east,draw,dashed,red,thick,minimum width=13em,minimum height=1.4em] (final) at (p2seg2.east) {};
 \node [anchor=west,red] (finallabel) at ([xshift=3.1em]sentlabel.east) {输出概率最大的结果};
 %\node [anchor=north east,red] (finallabel2) at ([yshift=0.5em]finallabel.south east) {的结果};
-\draw [->,thick,red] ([xshift=0.0em,yshift=-0.5em]final.north east) ..controls +(east:0.3) and +(south:0.0).. ([xshift=1.0em]finallabel.south);
+\draw [->,thick,red] ([xshift=0.0em,yshift=-0.5em]final.north east) ..controls +(east:0.2) and +(south:1.0).. ([xshift=2.0em]finallabel.south);
 }

 }

--- a/Chapter3/chapter3.tex
+++ b/Chapter3/chapter3.tex
@@ -169,7 +169,7 @@ Interests $\to$ \; Interest/s & selected $\to$ \; se/lect/ed & processed $\to$ \

 \subsubsection{1. 统计模型的学习与推断}

-\parinterval 在分词任务中，数据驱动主要指用已经分词切分好的数据``喂''给系统，这个数据也被称作{\small\bfnew{标注数据}}\index{标注数据}（Annotated Data）\index{Annotated Data}。在获得标注数据后，系统自动学习一个统计模型来描述分词的过程，而这个模型会把分词的``知识''作为参数保存在模型中。当送入一个新的需要分词的句子时，可以利用学习到的模型对所有可能的分词结果进行预测，并进行概率化的描述，最终选择概率最大的结果作为输出。这个方法就是基于统计的分词方法，其与{\chaptertwo}介绍的统计语言建模方法本质上是一样的。具体来说，可以分为两个步骤：
+\parinterval 统计分词也是一种典型的数据驱动方法。这种方法将已经经过分词的数据``喂''给系统，这个数据也被称作{\small\bfnew{标注数据}}\index{标注数据}（Annotated Data）\index{Annotated Data}。在获得标注数据后，系统自动学习一个统计模型来描述分词的过程，而这个模型会把分词的``知识''作为参数保存在模型中。当送入一个新的需要分词的句子时，可以利用学习到的模型对可能的分词结果进行概率化的描述，最终选择概率最大的结果作为输出。这个方法就是基于统计的分词方法，其与{\chaptertwo}介绍的统计语言建模方法本质上是一样的。具体来说，可以分为两个步骤：

 \begin{itemize}
 \vspace{0.5em}
@@ -218,11 +218,9 @@ $计算这种切分的概率值。
 \label{eq:3.2-1}
 \end{eqnarray}

-\parinterval 经过充分训练的统计模型$\funp{P}(\cdot)$就是得到的分词模型。对于输入的新句子S，通过这个模型找到最佳的分词结果$W^{*}$输出。假设输入句子S是``确实现在数据很多''，可以通过列举获得不同切分方式的概率，其中概率最高的切分方式，就是系统的目标输出。
+\parinterval 经过充分训练的统计模型$\funp{P}(\cdot)$就是得到的分词模型。对于输入的新句子$S$，通过这个模型找到最佳的分词结果输出。假设输入句子$S$是``确实现在数据很多''，可以通过列举获得不同切分方式的概率，其中概率最高的切分方式，就是系统的目标输出。

-\parinterval 这种分词方法也被称作基于1-gram语言模型的分词，或全概率分词，使用标注好的分词数据进行学习，获得分词模型。这种方法最大的优点是整个学习过程（模型训练过程）和推导过程（处理新句子进行切分的过程）都是全自动进行的。这种方法虽然简单，但是其效率很高，因此被广泛应用在工业界系统里。
-
-\parinterval 当然，真正的分词系统还需要解决很多其他问题，比如使用动态规划等方法高效搜索最优解以及如何处理未见过的词等等，由于本节的重点是介绍中文分词的基础方法和统计建模思想，因此无法覆盖所有中文分词的技术内容，有兴趣的读者可以参考\ref{sec3:summary}节的相关文献做进一步深入研究。
+\parinterval 这种分词方法也被称作基于1-gram语言模型的分词，或全概率分词。全概率分词最大的优点在于方法简单、效率高，因此被广泛应用在工业界系统里。它本质上就是一个1-gram语言模型，因此可以直接复用$n$-gram语言模型的训练方法和未登录词处理方法。与传统$n$-gram语言模型稍有不同的是，分词的预测过程需要找到一个给定字符串所有可能切分中1-gram语言模型得分最高的切分。因此，可以使用{\chaptertwo}中所描述的搜索算法实现这个预测过程，也可以使用动态规划方法快速找到最优切分结果。由于本节的重点是介绍中文分词的基础方法和统计建模思想，因此不会对相关搜索算法进行进一步展开，有兴趣的读者可以参考{\chaptertwo}和本章\ref{sec3:summary}节的相关文献做进一步深入研究。

 %----------------------------------------------------------------------------------------
 %    NEW SECTION