update append

8181d07d · 曹润柘 · ce1e056c · 8181d07d · 8181d07d · 8181d07d
Commit 8181d07d authored May 01, 2020 by 曹润柘
--- a/Book/ChapterAppend/ChapterAppend.tex
+++ b/Book/ChapterAppend/ChapterAppend.tex
@@ -8,13 +8,126 @@
 %----------------------------------------------------------------------------------------
 \renewcommand\figurename{图}%将figure改为图
 \renewcommand\tablename{表}%将figure改为图
-\chapterimage{fig-NEU-9.jpg} % Chapter heading image
+\chapterimage{fig-NEU-1.jpg} % Chapter heading image
 %------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%第一章附录
 \begin{appendices}
 \chapter{附录A}
 \label{appendix-A}
+\parinterval 在构建机器翻译系统的过程中，数据是必不可少的，尤其是现在主流的神经机器翻译系统，系统的性能往往受限于语料库规模和质量。所幸的是，随着语料库语言学的发展，一些主流语种的相关语料资源已经十分丰富。
+\parinterval 为了方便读者进行相关研究，我们汇总了几个常用的基准数据集，这些数据集已经在机器翻译领域中被广泛使用，有很多之前的相关工作可以进行复现和对比。同时，我们收集了一下常用的平行语料，方便读者进行一些探索。
+%%%%%%%%%%%%%%%%%%%%%
+\section{基准数据集}
+%----------------------------------------------
+% 表1.1-1
+\begin{table}[htp]{
+\footnotesize
+\begin{center}
+\caption{基准数据集}
+\label{tab:Reference-data-set}
+\begin{tabular}{p{1.6cm} | p{1.2cm} p{1.6cm} p{2.6cm} p{3.9cm}}
+{任务} & {语种} &{领域} &{描述} &{数据集地址} \\
+\hline
+\rule{0pt}{15pt}WMT & En Zh& 新闻、医学 & 以英语为核心的多& {http://www.statmt.org/wmt19/} \\
+ & De Ru等 & 、翻译 & 语种机器翻译数据 & \\
+ & & & 集，涉及多种任务 & \\
+\rule{0pt}{15pt}IWSLT & En De Fr & 口语翻译 & 文本翻译数据集来 & {https://wit3.fbk.eu/} \\
+ &  Cs Zh等 &  &自TED演讲，数 & \\
+ &  &  & 据规模较小 & \\
+\rule{0pt}{15pt}NIST & Zh-En等 & 新闻翻译 & 评测集包括4句参 & {https://www.ldc.upenn.edu/coll} \\
+ &  Cs Zh等 &  & 考译文，质量较高 & aborations/evaluations/nist \\
+\end{tabular}
+\end{center}
+}\end{table}
+%-------------------------------------------
+%----------------------------------------------
+% 表1.1-2
+\begin{table}[htp]{
+\footnotesize
+\begin{center}
+\begin{tabular}{p{1.6cm} | p{1.2cm} p{1.6cm} p{2.6cm} p{3.9cm}}
+\rule{0pt}{15pt}{任务} & {语种} &{领域} &{描述} &{数据集地址} \\
+\hline
+\rule{0pt}{15pt}TVsub & Zh-En & 字幕翻译 & 数据抽取自电视剧 & {https://github.com/longyuewan} \\
+ &   &   & 字幕，用于对话中 & gdcu/tvsub \\
+ &   &  & 长距离上下文研究 & \\
+\rule{0pt}{15pt}Flickr30K & En-De & 多模态翻译 & 31783张图片，每 & {http://shannon.cs.illinois.edu/D} \\
+ & &  & 张图片5个语句标 & enotationGraph/ \\
+ &   &  & 注 & \\
+\rule{0pt}{15pt}Multi30K  & En-De & 多模态翻译 & 31014张图片，每 & {http://www.statmt.org/wmt16/} \\
+ &  En-Fr &  & 张图片5个语句标 & multimodal-task.html \\
+ &   &  & 注 & \\
+\rule{0pt}{15pt}IAPRTC-12 & En-De & 多模态翻译 & 20000张图片及对 & {https://www.imageclef.org} \\
+ &   &  & 应标注  & /photodata \\
+\rule{0pt}{15pt}IKEA & En-De & 多模态翻译 & 3600张图片及对应  & {https://github.com/sampalomad} \\
+ &  En-Fr &  & 标注 & /IKEA-Dataset.git \\
+\end{tabular}
+\end{center}
+}\end{table}
+%-------------------------------------------
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{平行语料}
+\parinterval 神经机器翻译系统的训练需要大量的双语数据，这里我们汇总了一些公开的平行语料，方便读者获取。
+\vspace{0.5em}
+\begin{itemize}
+\item News Commentary Corpus：包括汉语、英语等12个语种，64个语言对的双语数据，爬取自Project Syndicate网站的政治、经济评论。URL：\url{http://www.casmacat.eu/corpus/news-commentary.html}
+\vspace{0.5em}
+\item CWMT Corpus：中国计算机翻译研讨会社区收集和共享的中英平行语料，涵盖多种领域，例如新闻、电影字幕、小说和政府文档等。URL：\url{http://nlp.nju.edu.cn/cwmt-wmt/}
+\vspace{0.5em}
+\item Common Crawl corpus：包括捷克语、德语、俄语、法语4种语言到英语的双语数据，爬取自互联网网页。URL：\url{http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz}
+\vspace{0.5em}
+\item Europarl Corpus：包括保加利亚语、捷克语等20种欧洲语言到英语的双语数据，来源于欧洲议会记录。URL：\url{http://www.statmt.org/europarl/}
+\vspace{0.5em}
+\item ParaCrawl Corpus：包括23种欧洲语言到英语的双语语料，数据来源于网络爬取。URL：\url{https://www.paracrawl.eu/index.php}
+\vspace{0.5em}
+\item United Nations Parallel Corpus：包括阿拉伯语、英语、西班牙语、法语、俄语、汉语6种联合国正式语言，30种语言对的双语数据，来源自联合国公共领域的官方记录和其他会议文件。URL：\url{https://conferences.unite.un.org/UNCorpus/}
+\vspace{0.5em}
+\item TED Corpus：TED大会演讲在其网站公布了自2007年以来的演讲字幕，以及超过100种语言的翻译版本。WIT收集整理了这些数据，以方便科研工作者使用，同时，会为每年的IWSLT评测比赛提供评测数据集。URL：\url{https://wit3.fbk.eu/}
+\vspace{0.5em}
+\item OpenSubtile：由P. Lison和J. Tiedemann收集自opensubtiles电影字幕网站，包含62种语言、1782个语种对的平行语料，资源相对比较丰富。URL：\url{http://opus.nlpl.eu/OpenSubtitles2018.php}
+\vspace{0.5em}
+\item Wikititles Corpus：包括古吉拉特语等14个语种，11个语言对的双语数据，数据来源自维基百科的标题。URL：\url{http://data.statmt.org/wikititles/v1/}
+\vspace{0.5em}
+\item CzEng:捷克语和英语的平行语料，数据来源于欧洲法律、信息技术和小说领域。URL:\url{ http://ufal.mff.cuni.cz/czeng/czeng17}
+\vspace{0.5em}
+\item Yandex Corpus：俄语和英语的平行语料，爬取自互联网网页。URL：\url{https://translate.yandex.ru/corpus}
+\vspace{0.5em}
+\item Tilde MODEL Corpus：欧洲语言的多语言开放数据，包含多个数据集，数据来自于经济、新闻、政府、旅游等门户网站。URL：\url{https://tilde-model.s3-eu-west-1.amazonaws.com/Tilde_MODEL_Corpus.html}
+\vspace{0.5em}
+\item Setimes Corpus：包括克罗地亚语、阿尔巴尼亚等9种巴尔干语言，72种个语言对的双语数据，来源于东南欧时报的新闻报道。URL：\url{http://www.statmt.org/setimes/}
+\vspace{0.5em}
+\item TVsub：收集自电视剧集字幕的中英文对话语料库，包含超过200万的句对，可用于对话领域和长距离上下文信息的研究。URL：\url{https://github.com/longyuewangdcu/tvsub}
+\vspace{0.5em}
+\item Recipe Corpus：由Cookpad公司创建的日英食谱语料库，包含10万多的句对。URL：\url{http://lotus.kuee.kyoto-u.ac.jp/WAT/recipe-corpus/}
+\end{itemize}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{相关工具}
+\subsection{数据预处理工具}
+\parinterval 数据处理是搭建神经机器翻译系统的重要步骤，这里我们提供了一些开源工具供读者进行使用。
+\vspace{0.5em}
+\begin{itemize}
+\item Moses：Moses 提供了很多数据预处理的脚本和工具，被机器翻译研究者广泛使用。其中包括符号标准化、分词、大小写转换和长度过滤等。URL：\url{https://github.com/moses-smt/mosesdecoder/tree/master/scripts}
+\vspace{0.5em}
+\item Jieba：常用的中文分词工具。URL：\url{https://github.com/fxsjy/jieba}
+\vspace{0.5em}
+\item Subword-nmt：基于BPE算法的子词切分工具。URL：\url{https://github.com/rsennrich/subword-nmt}
+\end{itemize}
+\subsection{评价工具}
+\parinterval 机器翻译领域已经有多种自动评价指标，包括BLEU、TER和METEOR等，这里我们提供了一些自动评价指标的工具，方便读者使用。
+\vspace{0.5em}
+\begin{itemize}
+\item Moses：其中包括了通用的BLEU评测脚本。URL：\url{https://github.com/moses-smt/mosesdecoder/tree/master/scripts/generic}
+\vspace{0.5em}
+\item Tercom：自动评价指标TER的计算工具，只有java版本。URL：\url{http://www.cs.umd.edu/~snover/tercom/}
+\vspace{0.5em}
+\item Meteor：自动评价指标METEOR的实现。URL：\url{https://www.cs.cmu.edu/~alavie/METEOR/}
+\end{itemize}
 \end{appendices}

--- a/Book/mt-book-xelatex.idx
+++ b/Book/mt-book-xelatex.idx
@@ -8,15 +8,15 @@
 \indexentry{Chapter4.2.2.2|hyperpage}{16}
 \indexentry{Chapter4.2.2.3|hyperpage}{17}
 \indexentry{Chapter4.2.3|hyperpage}{18}
-\indexentry{Chapter4.2.3.1|hyperpage}{18}
+\indexentry{Chapter4.2.3.1|hyperpage}{19}
-\indexentry{Chapter4.2.3.2|hyperpage}{19}
+\indexentry{Chapter4.2.3.2|hyperpage}{20}
-\indexentry{Chapter4.2.3.3|hyperpage}{20}
+\indexentry{Chapter4.2.3.3|hyperpage}{21}
 \indexentry{Chapter4.2.4|hyperpage}{22}
 \indexentry{Chapter4.2.4.1|hyperpage}{22}
 \indexentry{Chapter4.2.4.2|hyperpage}{23}
-\indexentry{Chapter4.2.4.3|hyperpage}{24}
+\indexentry{Chapter4.2.4.3|hyperpage}{25}
 \indexentry{Chapter4.2.5|hyperpage}{25}
-\indexentry{Chapter4.2.6|hyperpage}{25}
+\indexentry{Chapter4.2.6|hyperpage}{26}
 \indexentry{Chapter4.2.7|hyperpage}{29}
 \indexentry{Chapter4.2.7.1|hyperpage}{30}
 \indexentry{Chapter4.2.7.2|hyperpage}{30}
@@ -24,32 +24,32 @@
 \indexentry{Chapter4.2.7.4|hyperpage}{32}
 \indexentry{Chapter4.3|hyperpage}{33}
 \indexentry{Chapter4.3.1|hyperpage}{36}
-\indexentry{Chapter4.3.1.1|hyperpage}{36}
+\indexentry{Chapter4.3.1.1|hyperpage}{37}
-\indexentry{Chapter4.3.1.2|hyperpage}{37}
+\indexentry{Chapter4.3.1.2|hyperpage}{38}
-\indexentry{Chapter4.3.1.3|hyperpage}{38}
+\indexentry{Chapter4.3.1.3|hyperpage}{39}
-\indexentry{Chapter4.3.1.4|hyperpage}{39}
+\indexentry{Chapter4.3.1.4|hyperpage}{40}
-\indexentry{Chapter4.3.2|hyperpage}{39}
+\indexentry{Chapter4.3.2|hyperpage}{40}
 \indexentry{Chapter4.3.3|hyperpage}{41}
 \indexentry{Chapter4.3.4|hyperpage}{42}
-\indexentry{Chapter4.3.5|hyperpage}{45}
+\indexentry{Chapter4.3.5|hyperpage}{46}
-\indexentry{Chapter4.4|hyperpage}{48}
+\indexentry{Chapter4.4|hyperpage}{49}
-\indexentry{Chapter4.4.1|hyperpage}{49}
+\indexentry{Chapter4.4.1|hyperpage}{51}
-\indexentry{Chapter4.4.2|hyperpage}{52}
+\indexentry{Chapter4.4.2|hyperpage}{51}
 \indexentry{Chapter4.4.2.1|hyperpage}{53}
-\indexentry{Chapter4.4.2.2|hyperpage}{54}
+\indexentry{Chapter4.4.2.2|hyperpage}{55}
-\indexentry{Chapter4.4.2.3|hyperpage}{56}
+\indexentry{Chapter4.4.2.3|hyperpage}{57}
-\indexentry{Chapter4.4.3|hyperpage}{57}
+\indexentry{Chapter4.4.3|hyperpage}{58}
-\indexentry{Chapter4.4.3.1|hyperpage}{58}
+\indexentry{Chapter4.4.3.1|hyperpage}{59}
 \indexentry{Chapter4.4.3.2|hyperpage}{62}
-\indexentry{Chapter4.4.3.3|hyperpage}{62}
+\indexentry{Chapter4.4.3.3|hyperpage}{63}
-\indexentry{Chapter4.4.3.4|hyperpage}{63}
+\indexentry{Chapter4.4.3.4|hyperpage}{64}
-\indexentry{Chapter4.4.3.5|hyperpage}{64}
+\indexentry{Chapter4.4.3.5|hyperpage}{65}
-\indexentry{Chapter4.4.4|hyperpage}{65}
+\indexentry{Chapter4.4.4|hyperpage}{66}
-\indexentry{Chapter4.4.4.1|hyperpage}{66}
+\indexentry{Chapter4.4.4.1|hyperpage}{67}
 \indexentry{Chapter4.4.4.2|hyperpage}{67}
-\indexentry{Chapter4.4.5|hyperpage}{69}
+\indexentry{Chapter4.4.5|hyperpage}{68}
-\indexentry{Chapter4.4.5|hyperpage}{70}
+\indexentry{Chapter4.4.5|hyperpage}{71}
-\indexentry{Chapter4.4.7|hyperpage}{72}
+\indexentry{Chapter4.4.7|hyperpage}{73}
-\indexentry{Chapter4.4.7.1|hyperpage}{73}
+\indexentry{Chapter4.4.7.1|hyperpage}{74}
-\indexentry{Chapter4.4.7.2|hyperpage}{74}
+\indexentry{Chapter4.4.7.2|hyperpage}{76}
-\indexentry{Chapter4.5|hyperpage}{76}
+\indexentry{Chapter4.5|hyperpage}{77}
--- a/Book/mt-book-xelatex.ptc
+++ b/Book/mt-book-xelatex.ptc
--- a/Book/mt-book-xelatex.tex
+++ b/Book/mt-book-xelatex.tex
@@ -110,14 +110,14 @@
 %	CHAPTERS
 %----------------------------------------------------------------------------------------
-\include{Chapter1/chapter1}
+%\include{Chapter1/chapter1}
-\include{Chapter2/chapter2}
+%\include{Chapter2/chapter2}
-\include{Chapter3/chapter3}
+%\include{Chapter3/chapter3}
 \include{Chapter4/chapter4}
-\include{Chapter5/chapter5}
+%\include{Chapter5/chapter5}
-\include{Chapter6/chapter6}
+%\include{Chapter6/chapter6}
 %\include{Chapter7/chapter7}
-\include{ChapterAppend/chapterappend}
+%\include{ChapterAppend/chapterappend}