update 17

3eb18c4e · 曹润柘 · 12cfc171 · 3eb18c4e · 3eb18c4e · 3eb18c4e
Commit 3eb18c4e authored Jan 13, 2021 by 曹润柘
--- a/Chapter17/Figures/figure-cascading-speech-translation.tex
+++ b/Chapter17/Figures/figure-cascading-speech-translation.tex
@@ -10,7 +10,7 @@
 \node(process_2)[process,fill=blue!20,right of = process_1,xshift=7.0cm,text width=4cm,align=center]{\baselineskip=4pt\LARGE{[[0.2,...,0.3], \qquad ..., \qquad  0.3,...,0.5]]}\par};
 \node(text_2)[below of = process_2,yshift=-2cm,scale=1.5]{语音特征};
 \node(process_3)[process,fill=orange!20,minimum width=6cm,minimum height=5cm,right of = process_2,xshift=8.2cm,text width=4cm,align=center]{};
-\node(text_3)[below of = process_3,yshift=-3cm,scale=1.5]{源语文本及其词格};
+\node(text_3)[below of = process_3,yshift=-3cm,scale=1.5]{源语言文本及其词格};
 \node(cir_s)[cir,very thick, below of = process_3,xshift=-2.2cm,yshift=1.1cm]{\LARGE S};
 \node(cir_a)[cir,right of = cir_s,xshift=1cm,yshift=0.8cm]{\LARGE a};
 \node(cir_c)[cir,right of = cir_a,xshift=1.2cm,yshift=0cm]{\LARGE c};

--- a/Chapter17/chapter17.tex
+++ b/Chapter17/chapter17.tex
@@ -25,7 +25,7 @@
 \parinterval 基于上下文的翻译是机器翻译的一个重要分支。传统方法中，机器翻译通常被定义为对一个句子进行翻译的问题。但是，现实中每句话往往不是独立出现的。比如，人们会使用语音进行表达，或者通过图片来传递信息，这些语音和图片内容都可以伴随着文字一起出现在翻译场景中。此外，句子往往存在于段落或者篇章之中，如果要理解这个句子，也需要整个段落或者篇章的信息。而这些上下文信息都是机器翻译可以利用的。
-\parinterval 本章在句子级翻译的基础上将问题扩展为更大上下文中的翻译，具体包括：语音翻译、图像翻译、篇章翻译三个主题。这些问题均为机器翻译应用中的真实需求。同时，使用多模态等信息也是当下自然语言处理的热点方向之一。
+\parinterval 本章在句子级翻译的基础上将问题扩展为更大上下文中的翻译，具体包括语音翻译、图像翻译、篇章翻译三个主题。这些问题均为机器翻译应用中的真实需求。同时，使用多模态等信息也是当下自然语言处理的热点方向之一。
 %----------------------------------------------------------------------------------------
 %    NEW SECTION
@@ -33,9 +33,9 @@
 \section{机器翻译需要更多的上下文}
-\parinterval 长期以来，机器翻译都是指句子级翻译。主要原因在于，句子级的翻译建模可以大大简化问题，使得机器翻译方法更容易被实践和验证。但是人类使用语言的过程并不是孤立地在一个个句子上进行的。这个问题可以类比于人类学习语言的过程：小孩成长过程中会接受视觉、听觉、触觉等多种信号，这些信号的共同作用使得他们产生对客观世界的“认识”，同时促使他们使用“语言”进行表达。从这个角度说，语言能力并不是由单一因素形成的，它往往伴随着其他信息的相互作用，比如，当我们翻译一句话的时候，会用到看到的画面、听到的语调、甚至前面说过的句子中的信息。
+\parinterval 长期以来，机器翻译都是指句子级翻译。主要原因在于，句子级的翻译建模可以大大简化问题，使得机器翻译方法更容易被实践和验证。但是人类使用语言的过程并不是孤立地在一个个句子上进行的。这个问题可以类比于人类学习语言的过程：小孩成长过程中会接受视觉、听觉、触觉等多种信号，这些信号的共同作用使得他们产生对客观世界的“认识”，同时促使他们使用“语言”进行表达。从这个角度说，语言能力并不是由单一因素形成的，它往往伴随着其他信息的相互作用，比如，当人们翻译一句话的时候，会用到看到的画面、听到的语调、甚至前面说过的句子中的信息。
-\parinterval 广义上，当前句子以外的信息都可以被看作一种上下文。比如，图\ref{fig:17-1}中，需要把英语句子“A girl jumps off a bank .”翻译为汉语。但是，其中的“bank”有多个含义，因此仅仅使用英语句子本身的信息可能会将其翻译为“银行”，而非正确的译文“河床”。但是，图\ref{fig:17-1}中也提供了这个英语句子所对应的图片，显然图片中直接展示了河床，这时“bank”是没有歧义的。通常也会把这种使用图片和文字一起进行机器翻译的任务称作{\small\bfnew{多模态机器翻译}}\index{多模态机器翻译}（Multi-Modal Machine Translation）\index{Multi-Modal Machine Translation}。
+\parinterval 广义上，当前句子以外的信息都可以被看作一种上下文。比如，图\ref{fig:17-1}中，需要把英语句子“A girl jumps off a bank .”翻译为汉语。但是，其中的“bank”有多个含义，因此仅仅使用英语句子本身的信息可能会将其翻译为“银行”，而非正确的译文“河床”。但是，图\ref{fig:17-1}中也提供了这个英语句子所对应的图片，显然图片中直接展示了河床，这时“bank”是没有歧义的。通常也会把这种使用图片和文字一起进行机器翻译的任务称作{\small\bfnew{多模态机器翻译}}\index{多模态机器翻译}（Multi-Modal Machine Translation）\index{Multi-model Machine Translation}。
 %----------------------------------------------
 \begin{figure}[htp]
@@ -62,7 +62,7 @@
 \subsection{音频处理}
-\parinterval 为了保证对相关内容描述的完整性，这里对语音处理的基本知识作简要介绍。不同于文本，音频本质上是经过若干信号处理之后的{\small\bfnew{波形}}（Waveform）\index{Waveform}。具体来说，声音是一种空气的震动，因此可以被转换为模拟信号。模拟信号是一段连续的信号，经过采样变为离散的数字信号。采样是每隔固定的时间记录一下声音的振幅，采样率表示每秒的采样点数，单位是赫兹（Hz）。采样率越高，结果的损失则越小。通常来说，采样的标准是能够通过离散化的数字信号重现原始语音。我们日常生活中使用的手机和电脑设备的采样率一般为16kHz，表示每秒16000个采样点；而音频CD的采样率可以达到44.1kHz。 经过进一步的量化，将采样点的值转换为整型数值保存，从而减少占用的存储空间，通常采用的是16位量化。将采样率和量化位数相乘，就可以得到{\small\bfnew{比特率}}\index{比特率}（Bits Per Second，BPS）\index{Bits Per Second}，表示音频每秒占用的位数。例如，16kHz采样率和16位量化的音频，比特率为256kb/s。音频处理的整体流程如图\ref{fig:17-2}所示\upcite{洪青阳2020语音识别原理与应用,陈果果2020语音识别实战}。
+\parinterval 为了保证对相关内容描述的完整性，这里对语音处理的基本知识作简要介绍。不同于文本，音频本质上是经过若干信号处理之后的{\small\bfnew{波形}}（Waveform）\index{Waveform}。具体来说，声音是一种空气的震动，因此可以被转换为模拟信号。模拟信号是一段连续的信号，经过采样变为离散的数字信号。采样是每隔固定的时间记录一下声音的振幅，采样率表示每秒的采样点数，单位是赫兹（Hz）。采样率越高，结果的损失则越小。通常来说，采样的标准是能够通过离散化的数字信号重现原始语音。日常生活中使用的手机和电脑设备的采样率一般为16kHz，表示每秒16000个采样点；而音频CD的采样率可以达到44.1kHz。 经过进一步的量化，将采样点的值转换为整型数值保存，从而减少占用的存储空间，通常采用的是16位量化。将采样率和量化位数相乘，就可以得到{\small\bfnew{比特率}}\index{比特率}（Bits Per Second，BPS）\index{Bits Per Second}，表示音频每秒占用的位数。例如，16kHz采样率和16位量化的音频，比特率为256kb/s。音频处理的整体流程如图\ref{fig:17-2}所示\upcite{洪青阳2020语音识别原理与应用,陈果果2020语音识别实战}。
 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
@@ -85,7 +85,7 @@
 \end{figure}
 %----------------------------------------------------------------------------------------------------
-\parinterval 经过了上述的预处理操作，可以得到音频对应的帧序列，之后通过不同的操作来提取不同类型的声学特征。常用的声学特征包括{\small\bfnew{Mel频率倒谱系数}}\index{Mel频率倒谱系数}（Mel-Frequency Cepstral Coefficient，MFCC）\index{Mel-Frequency Cepstral Coefficient}、{\small\bfnew{感知线性预测系数}}\index{感知线性预测系数}（Perceptual Lienar Predictive，PLP）\index{Perceptual Lienar Predictive}、{\small\bfnew{滤波器组}}\index{滤波器组}（Filter-bank，Fbank）\index{Filter-bank}等。MFCC、PLP和Fbank特征都需要对预处理后的音频做{\small\bfnew{短时傅里叶变换}}\index{短时傅里叶变换}（Short-time Fourier Tranform，STFT）\index{Short-time Fourier Tranform}，得到具有规律的线性分辨率。之后再经过特定的操作，得到各种声学特征。不同声学特征的特点是不同的，MFCC去相关性较好，PLP抗噪性强，FBank可以保留更多的语音原始特征。在语音翻译中，比较常用的声学特征为FBank或MFCC\upcite{洪青阳2020语音识别原理与应用}。
+\parinterval 经过了上述的预处理操作，可以得到音频对应的帧序列，之后通过不同的操作来提取不同类型的声学特征。常用的声学特征包括{\small\bfnew{Mel频率倒谱系数}}\index{Mel频率倒谱系数}（Mel-frequency Cepstral Coefficient，MFCC）\index{Mel-Frequency Cepstral Coefficient}、{\small\bfnew{感知线性预测系数}}\index{感知线性预测系数}（Perceptual Lienar Predictive，PLP）\index{Perceptual Lienar Predictive}、{\small\bfnew{滤波器组}}\index{滤波器组}（Filter-bank，Fbank）\index{Filter-bank}等。MFCC、PLP和Fbank特征都需要对预处理后的音频做{\small\bfnew{短时傅里叶变换}}\index{短时傅里叶变换}（Short-time Fourier Tranform，STFT）\index{Short-time Fourier Tranform}，得到具有规律的线性分辨率。之后再经过特定的操作，得到各种声学特征。不同声学特征的特点是不同的，MFCC去相关性较好，PLP抗噪性强，FBank可以保留更多的语音原始特征。在语音翻译中，比较常用的声学特征为FBank或MFCC\upcite{洪青阳2020语音识别原理与应用}。
 \parinterval 实际上，提取到的声学特征可以类比于计算机视觉中的像素特征，或者自然语言处理中的词嵌入表示。不同之处在于，声学特征更加复杂多变，可能存在着较多的噪声和冗余信息。此外，相比对应的文字序列，音频提取到的特征序列长度要大十倍以上。比如，人类正常交流中每秒钟一般可以说2-3个字，而每秒钟的语音可以提取得到100帧的特征序列。巨大的长度比差异也为声学特征建模带来了挑战。
@@ -147,7 +147,7 @@
 \parinterval 可以看出，词格可以保存多条搜索路径，路径中保存了输入序列的时间信息以及解码过程。翻译模型基于词格进行翻译，可以降低语音识别模型带来的误差\upcite{DBLP:conf/acl/ZhangGCF19,DBLP:conf/acl/SperberNPW19}。但在端到端语音识别模型中，一般基于束搜索方法进行解码，且解码序列的长度与输入序列并不匹配，相比传统声学模型解码丢失了语音的时间信息，因此这种基于词格的方法主要集中在传统语音识别系统上。
-\parinterval 为了降低错误传播问题带来的影响，一种思路是通过一个后处理模型修正识别结果中的错误，再送给文本翻译模型进行翻译。也可以进一步对文本做{\small\bfnew{顺滑}}\index{顺滑}（Disfluency Detection\index{Disfluency Detection}），使得送给翻译系统的文本更加干净、流畅，比如除去一些导致停顿的语气词。这一做法在工业界得到了广泛应用，但由于每个模型只能串行地计算，也会带来额外的计算代价以及运算时间。另外一种思路是训练更加健壮的文本翻译模型，使其可以处理输入中存在的噪声或误差\upcite{DBLP:conf/acl/LiuTMCZ18}。 
+\parinterval 为了降低错误传播问题带来的影响，一种思路是通过一个后处理模型修正识别结果中的错误，再送给文本翻译模型进行翻译。也可以进一步对文本做{\small\bfnew{顺滑}}\index{顺滑}（Disfluency Detection\index{Disfluency Detection}）处理，使得送给翻译系统的文本更加干净、流畅，比如除去一些导致停顿的语气词。这一做法在工业界得到了广泛应用，但由于每个模型只能串行地计算，也会带来额外的计算代价以及运算时间。另外一种思路是训练更加健壮的文本翻译模型，使其可以处理输入中存在的噪声或误差\upcite{DBLP:conf/acl/LiuTMCZ18}。 
 %----------------------------------------------------------------------------------------
 %    NEW SUB-SECTION
@@ -249,7 +249,7 @@
 \end{figure}
 %----------------------------------------------------------------------------------------------------
-\parinterval 另外一种多任务学习的思想是通过两个解码器，分别预测语音对应的源语言句子和目标语言句子，具体有图\ref{fig:17-10}展示的三种方式\upcite{DBLP:conf/naacl/AnastasopoulosC18,DBLP:conf/asru/BaharBN19}。图\ref{fig:17-10}(a)中采用单编码器-双解码器的方式，两个解码器根据编码器的表示，分别预测源语言句子和目标语言句子，从而使编码器训练地更加充分。这种做法的好处在于源语言文的本生任务成可以辅助翻译过程，相当于为源语言语音提供了额外的“模态”信息。图\ref{fig:17-10}(b)则通过使用两个级联的解码器，先利用第一个解码器生成源语言句子，然后再利用第一个解码器的表示，通过第二个解码器生成目标语言句子。这种方法通过增加一个中间输出，降低了模型的训练难度，但同时也会带来额外的解码耗时，因为两个解码器需要串行地进行生成。图\ref{fig:17-10}(c) 中模型更进一步，第二个编码器联合编码器和第一个解码器的表示进行生成，更充分地利用了已有信息。
+\parinterval 另外一种多任务学习的思想是通过两个解码器，分别预测语音对应的源语言句子和目标语言句子，具体有图\ref{fig:17-10}展示的三种方式\upcite{DBLP:conf/naacl/AnastasopoulosC18,DBLP:conf/asru/BaharBN19}。图\ref{fig:17-10}(a)中采用单编码器-双解码器的方式，两个解码器根据编码器的表示，分别预测源语言句子和目标语言句子，从而使编码器训练地更加充分。这种做法的好处在于源语言文的文本生成任务成可以辅助翻译过程，相当于为源语言语音提供了额外的“模态”信息。图\ref{fig:17-10}(b)则通过使用两个级联的解码器，先利用第一个解码器生成源语言句子，然后再利用第一个解码器的表示，通过第二个解码器生成目标语言句子。这种方法通过增加一个中间输出，降低了模型的训练难度，但同时也会带来额外的解码耗时，因为两个解码器需要串行地进行生成。图\ref{fig:17-10}(c) 中模型更进一步，第二个编码器联合编码器和第一个解码器的表示进行生成，更充分地利用了已有信息。
 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]
 \centering
@@ -347,7 +347,7 @@
 \end{figure}
 %----------------------------------------------------------------------------------------------------
-\parinterval 那么，多模态机器翻译是如何计算上下文向量的呢？这里仿照第十章的内容给出描述。假设编码器输出的状态序列为$\{\mathbi{h}_1,...\mathbi{h}_m\}$，需要注意的是，这里的状态序列不是源语言句子的状态序列，而是通过基于卷积等操作提取到的图像的状态序列。假设图像的特征维度是$16 \times 16 \times 512$，其中前两个维度分别表示图像的高和宽，这里会将图像映射为$256 \times 512$ 的状态序列，其中$512$为每个状态的维度。对于目标语位置$j$，上下文向量$\mathbi{C}_{j}$被定义为对序列的编码器输出进行加权求和，如下：
+\parinterval 那么，多模态机器翻译是如何计算上下文向量的呢？这里仿照第十章的内容给出描述。假设编码器输出的状态序列为$\{\mathbi{h}_1,...\mathbi{h}_m\}$，需要注意的是，这里的状态序列不是源语言句子的状态序列，而是通过基于卷积等操作提取到的图像的状态序列。假设图像的特征维度是$16 \times 16 \times 512$，其中前两个维度分别表示图像的高和宽，这里会将图像映射为$256 \times 512$ 的状态序列，其中$512$为每个状态的维度。对于目标语言位置$j$，上下文向量$\mathbi{C}_{j}$被定义为对序列的编码器输出进行加权求和，如下：
 \begin{eqnarray}
 \mathbi{C}_{j}&=& \sum_{i}{{\alpha}_{i,j}{\mathbi{h}}_{i}}
 \end{eqnarray}
@@ -412,7 +412,7 @@
 \parinterval 要想使编码器-解码器框架在图像描述生成中充分发挥作用，编码器也要更好的表示图像信息。对于编码器的改进，通常体现在向编码器中添加图像的语义信息\upcite{DBLP:conf/cvpr/YouJWFL16,DBLP:conf/cvpr/ChenZXNSLC17,DBLP:journals/pami/FuJCSZ17}和位置信息\upcite{DBLP:conf/cvpr/ChenZXNSLC17,DBLP:conf/ijcai/LiuSWWY17}。
-\parinterval 图像的语义信息一般是指图像中存在的实体、属性、场景等等。如图\ref{fig:17-17}所示，从图像中利用属性或实体检测器提取出“girl”、“river”、“bank”等属性词和实体词，将他们作为图像的语义信息编码的一部分，再利用注意力机制计算目标语言单词与这些属性词或实体词之间的注意力权重\upcite{DBLP:conf/cvpr/YouJWFL16}。当然，除了图像中的实体和属性作为语义信息外，也可以将图片的场景信息加入到编码器当中\upcite{DBLP:journals/pami/FuJCSZ17}。有关如何做属性、实体和场景的检测，涉及到目标检测任务的工作，例如Faster-RCNN\upcite{DBLP:journals/pami/RenHG017}、YOLO\upcite{DBLP:journals/corr/abs-1804-02767,DBLP:journals/corr/abs-2004-10934}等等,这里不过多赘述。
+\parinterval 图像的语义信息一般是指图像中存在的实体、属性、场景等等。如图\ref{fig:17-17}所示，从图像中利用属性或实体检测器提取出“girl”、“river”、“bank”等属性词和实体词，将他们作为图像的语义信息编码的一部分，再利用注意力机制计算目标语言单词与这些属性词或实体词之间的注意力权重\upcite{DBLP:conf/cvpr/YouJWFL16}。当然，除了图像中的实体和属性作为语义信息外，也可以将图片的场景信息加入到编码器当中\upcite{DBLP:journals/pami/FuJCSZ17}。有关如何做属性、实体和场景的检测，涉及到目标检测任务的工作，例如Faster-RCNN\upcite{DBLP:journals/pami/RenHG017}、YOLO\upcite{DBLP:journals/corr/abs-1804-02767,DBLP:journals/corr/abs-2004-10934}等等,这里不再赘述。
 %----------------------------------------------------------------------------------------------------
 \begin{figure}[htp]

--- a/bibliography.bib
+++ b/bibliography.bib
@@ -5090,7 +5090,7 @@ author    = {Yoshua Bengio and
               Tobias Weyand and
               Marco Andreetto and
               Hartwig Adam},
-  journal={CoRR},
+  publisher ={CoRR},
  year={2017},
 }
 @inproceedings{sifre2014rigid,
@@ -11622,28 +11622,28 @@ author    = {Zhuang Liu and
  publisher = {International Conference on Machine Learning},
  year      = {2018}
 }
-@article{zhao2020dual,
+@inproceedings{zhao2020dual,
  title={Dual Learning: Theoretical Study and an Algorithmic Extension},
  author={Zhao, Zhibing and Xia, Yingce and Qin, Tao and Xia, Lirong and Liu, Tie-Yan},
-  journal={arXiv preprint arXiv:2005.08238},
+  publisher ={arXiv preprint arXiv:2005.08238},
  year={2020}
 }
 %%%%% chapter 16------------------------------------------------------
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%% chapter 17------------------------------------------------------
-@article{DBLP:journals/ac/Bar-Hillel60,
+@inproceedings{DBLP:journals/ac/Bar-Hillel60,
  author    = {Yehoshua Bar-Hillel},
  title     = {The Present Status of Automatic Translation of Languages},
-  journal   = {Advances in computers},
+  publisher = {Advances in computers},
  volume    = {1},
  pages     = {91--163},
  year      = {1960}
 }
-@article{DBLP:journals/corr/abs-1901-09115,
+@inproceedings{DBLP:journals/corr/abs-1901-09115,
  author    = {Andrei Popescu-Belis},
  title     = {Context in Neural Machine Translation: {A} Review of Models and Evaluations},
-  journal   = {CoRR},
+  publisher = {CoRR},
  volume    = {abs/1901.09115},
  year      = {2019}
 }
@@ -11698,7 +11698,7 @@ author    = {Zhuang Liu and
 @inproceedings{tiedemann2010context,
  title={Context adaptation in statistical machine translation using models with exponentially decaying cache},
  author={Tiedemann, J{\"o}rg},
-  publisher={Domain Adaptation for Natural Language Processing},
+  publisher={Annual Meeting of the Association for Computational Linguistics},
  pages={8--15},
  year={2010}
 }
@@ -11747,7 +11747,7 @@ author    = {Zhuang Liu and
  title     = {Using Sense-labeled Discourse Connectives for Statistical Machine
               Translation},
  pages     = {129--138},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Annual Conference of the European Association for Machine Translation},
  year      = {2012}
 }
 @inproceedings{DBLP:conf/emnlp/LaubliS018,
@@ -11760,12 +11760,12 @@ author    = {Zhuang Liu and
  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2018}
 }
-@article{DBLP:journals/corr/abs-1912-08494,
+@inproceedings{DBLP:journals/corr/abs-1912-08494,
  author    = {Sameen Maruf and
               Fahimeh Saleh and
               Gholamreza Haffari},
  title     = {A Survey on Document-level Machine Translation: Methods and Evaluation},
-  journal   = {CoRR},
+  publisher = {CoRR},
  volume    = {abs/1912.08494},
  year      = {2019}
 }
@@ -11777,21 +11777,20 @@ author    = {Zhuang Liu and
  publisher = {Association for Computational Linguistics},
  year      = {2017}
 }
-@article{DBLP:journals/corr/abs-1910-07481,
+@inproceedings{DBLP:journals/corr/abs-1910-07481,
  author    = {Valentin Mac{\'{e}} and
               Christophe Servan},
  title     = {Using Whole Document Context in Neural Machine Translation},
-  journal   = {CoRR},
+  publisher = {The International Workshop on Spoken Language Translation},
-  volume    = {abs/1910.07481},
  year      = {2019}
 }
-@article{DBLP:journals/corr/JeanLFC17,
+@inproceedings{DBLP:journals/corr/JeanLFC17,
  author    = {S{\'{e}}bastien Jean and
               Stanislas Lauly and
               Orhan Firat and
               Kyunghyun Cho},
  title     = {Does Neural Machine Translation Benefit from Larger Context?},
-  journal   = {CoRR},
+  publisher = {CoRR},
  volume    = {abs/1704.05135},
  year      = {2017}
 }
@@ -11823,12 +11822,12 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2019}
 }
-@article{DBLP:journals/corr/abs-2010-12827,
+@inproceedings{DBLP:journals/corr/abs-2010-12827,
  author    = {Amane Sugiyama and
               Naoki Yoshinaga},
  title     = {Context-aware Decoder for Neural Machine Translation using a Target-side
               Document-Level Language Model},
-  journal   = {CoRR},
+  publisher = {CoRR},
  volume    = {abs/2010.12827},
  year      = {2020}
 }
@@ -11970,14 +11969,13 @@ author    = {Zhuang Liu and
  publisher = {International Joint Conference on Artificial Intelligence},
  year      = {2020}
 }
-@article{DBLP:journals/tacl/TuLSZ18,
+@inproceedings{DBLP:journals/tacl/TuLSZ18,
  author    = {Zhaopeng Tu and
               Yang Liu and
               Shuming Shi and
               Tong Zhang},
  title     = {Learning to Remember Translation History with a Continuous Cache},
  publisher = {Transactions of the Association for Computational Linguistics},
-  volume    = {6},
  pages     = {407--420},
  year      = {2018}
 }
@@ -12071,7 +12069,7 @@ author    = {Zhuang Liu and
  publisher = {{AAAI} Press},
  year      = {2019}
 }
-@article{DBLP:journals/tacl/YuSSLKBD20,
+@inproceedings{DBLP:journals/tacl/YuSSLKBD20,
  author    = {Lei Yu and
               Laurent Sartran and
               Wojciech Stokowiec and
@@ -12080,16 +12078,16 @@ author    = {Zhuang Liu and
               Phil Blunsom and
               Chris Dyer},
  title     = {Better Document-Level Machine Translation with Bayes' Rule},
-  journal   = {Transactions of the Association for Computational Linguistics},
+  publisher = {Transactions of the Association for Computational Linguistics},
  volume    = {8},
  pages     = {346--360},
  year      = {2020}
 }
-@article{DBLP:journals/corr/abs-1903-04715,
+@inproceedings{DBLP:journals/corr/abs-1903-04715,
  author    = {S{\'{e}}bastien Jean and
               Kyunghyun Cho},
  title     = {Context-Aware Learning for Neural Machine Translation},
-  journal   = {CoRR},
+  publisher = {CoRR},
  volume    = {abs/1903.04715},
  year      = {2019}
 }
@@ -12111,7 +12109,7 @@ author    = {Zhuang Liu and
  publisher = {Annual Conference of the European Association for Machine Translation},
  year      = {2019}
 }
-@article{DBLP:journals/corr/abs-1911-03110,
+@inproceedings{DBLP:journals/corr/abs-1911-03110,
  author    = {Liangyou Li and
               Xin Jiang and
               Qun Liu},
@@ -12149,33 +12147,28 @@ author    = {Zhuang Liu and
  publisher = {IEEE Transactions on Acoustics, Speech, and Signal Processing},
  year      = {2012}
 }
-@article{DBLP:journals/ftsig/GalesY07,
+@inproceedings{DBLP:journals/ftsig/GalesY07,
  author    = {Mark J. F. Gales and
               Steve J. Young},
  title     = {The Application of Hidden Markov Models in Speech Recognition},
-  journal   = {Found Trends Signal Process},
+  publisher = {Found Trends Signal Process},
-  volume    = {1},
-  number    = {3},
  pages     = {195--304},
  year      = {2007}
 }
-@article{DBLP:journals/taslp/MohamedDH12,
+@inproceedings{DBLP:journals/taslp/MohamedDH12,
  author    = {Abdel-rahman Mohamed and
               George E. Dahl and
               Geoffrey E. Hinton},
  title     = {Acoustic Modeling Using Deep Belief Networks},
-  journal   = {IEEE Transactions on Speech and Audio Processing},
+  publisher = {IEEE Transactions on Speech and Audio Processing},
-  volume    = {20},
-  number    = {1},
  pages     = {14--22},
  year      = {2012}
 }
-@article{DBLP:journals/spm/X12a,
+@inproceedings{DBLP:journals/spm/X12a,
+  author    = {G Hinton and L Deng and D Yu and GE Dahl and B Kingsbury},
  title     = {Deep Neural Networks for Acoustic Modeling in Speech Recognition:
               The Shared Views of Four Research Groups},
-  journal   = {IEEE Signal Processing Magazine},
+  publisher = {IEEE Signal Processing Magazine},
-  volume    = {29},
-  number    = {6},
  pages     = {82--97},
  year      = {2012}
 }
@@ -12232,14 +12225,14 @@ author    = {Zhuang Liu and
  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2016}
 }
-@article{DBLP:journals/corr/BerardPSB16,
+@inproceedings{DBLP:journals/corr/BerardPSB16,
  author    = {Alexandre Berard and
               Olivier Pietquin and
               Christophe Servan and
               Laurent Besacier},
  title     = {Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text
               Translation},
-  journal   = {CoRR},
+  publisher = {CoRR},
  volume    = {abs/1612.01744},
  year      = {2016}
 }
@@ -12277,16 +12270,14 @@ author    = {Zhuang Liu and
  publisher = {International Conference on Machine Learning},
  year      = {2006}
 }
-@article{DBLP:journals/jstsp/WatanabeHKHH17,
+@inproceedings{DBLP:journals/jstsp/WatanabeHKHH17,
  author    = {Shinji Watanabe and
               Takaaki Hori and
               Suyoun Kim and
               John R. Hershey and
               Tomoki Hayashi},
  title     = {Hybrid CTC/Attention Architecture for End-to-End Speech Recognition},
-  journal   = {IEEE Journal of Selected Topics in Signal Processing},
+  publisher = {IEEE Journal of Selected Topics in Signal Processing},
-  volume    = {11},
-  number    = {8},
  pages     = {1240--1253},
  year      = {2017}
 }
@@ -12300,15 +12291,13 @@ author    = {Zhuang Liu and
  publisher = {IEEE Transactions on Acoustics, Speech, and Signal Processing},
  year      = {2017}
 }
-@article{DBLP:journals/pami/ShiBY17,
+@inproceedings{DBLP:journals/pami/ShiBY17,
  author    = {Baoguang Shi and
               Xiang Bai and
               Cong Yao},
  title     = {An End-to-End Trainable Neural Network for Image-Based Sequence Recognition
               and Its Application to Scene Text Recognition},
-  journal   = {{IEEE} Transactions on Pattern Analysis and Machine Intelligence}, 
+  publisher = {{IEEE} Transactions on Pattern Analysis and Machine Intelligence}, 
-  volume    = {39},
-  number    = {11},
  pages     = {2298--2304},
  year      = {2017}
 }
@@ -12396,16 +12385,16 @@ author    = {Zhuang Liu and
  title     = {Effectively pretraining a speech translation decoder with Machine
               Translation data},
  pages     = {8014--8020},
-  publisher = {Annual Meeting of the Association for Computational Linguistics},
+  publisher = {Conference on Empirical Methods in Natural Language Processing},
  year      = {2020}
 }
-@article{DBLP:journals/corr/abs-1802-06003,
+@inproceedings{DBLP:journals/corr/abs-1802-06003,
  author    = {Takatomo Kano and
               Sakriani Sakti and
               Satoshi Nakamura},
  title     = {Structured-based Curriculum Learning for End-to-end English-Japanese
               Speech Translation},
-  journal   = {CoRR},
+  publisher = {CoRR},
  volume    = {abs/1802.06003},
  year      = {2018}
 }
@@ -12424,7 +12413,6 @@ author    = {Zhuang Liu and
  author    = {Lawrence R. Rabiner and
               Biing-Hwang Juang},
  title     = {Fundamentals of speech recognition},
-  series    = {Prentice Hall signal processing series},
  publisher = {Prentice Hall},
  year      = {1993}
 }
@@ -12561,12 +12549,12 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2016}
 }
-@article{Elliott2015MultilingualID,
+@inproceedings{Elliott2015MultilingualID,
  title={Multilingual Image Description with Neural Sequence Models},
  author={Desmond Elliott and 
          Stella Frank and 
 		  Eva Hasler},
-  journal={arXiv: Computation and Language},
+  publisher ={arXiv: Computation and Language},
  year={2015}
 }
 @inproceedings{DBLP:conf/wmt/MadhyasthaWS17,
@@ -12579,13 +12567,12 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
-@article{DBLP:journals/corr/CaglayanBB16,
+@inproceedings{DBLP:journals/corr/CaglayanBB16,
  author    = {Ozan Caglayan and
               Lo{\"{\i}}c Barrault and
               Fethi Bougares},
  title     = {Multimodal Attention for Neural Machine Translation},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/1609.03976},
  year      = {2016}
 }
 @inproceedings{DBLP:conf/acl/CalixtoLC17,
@@ -12597,13 +12584,12 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
-@article{DBLP:journals/corr/DelbrouckD17,
+@inproceedings{DBLP:journals/corr/DelbrouckD17,
  author    = {Jean-Benoit Delbrouck and
               St{\'{e}}phane Dupont},
  title     = {Multimodal Compact Bilinear Pooling for Multimodal Neural Machine
               Translation},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/1703.08084},
  year      = {2017}
 }
 @inproceedings{DBLP:conf/acl/LibovickyH17,
@@ -12614,22 +12600,20 @@ author    = {Zhuang Liu and
  publisher = {Annual Meeting of the Association for Computational Linguistics},
  year      = {2017}
 }
-@article{DBLP:journals/corr/abs-1712-03449,
+@inproceedings{DBLP:journals/corr/abs-1712-03449,
  author    = {Jean-Benoit Delbrouck and
               St{\'{e}}phane Dupont},
  title     = {Modulating and attending the source image during encoding improves
               Multimodal Translation},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/1712.03449},
  year      = {2017}
 }
-@article{DBLP:journals/corr/abs-1807-11605,
+@inproceedings{DBLP:journals/corr/abs-1807-11605,
  author    = {Hasan Sait Arslan and
               Mark Fishel and
               Gholamreza Anbarjafari},
  title     = {Doubly Attentive Transformer Machine Translation},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/1807.11605},
  year      = {2018}
 }
 @inproceedings{DBLP:conf/wmt/HelclLV18,
@@ -12721,7 +12705,6 @@ author    = {Zhuang Liu and
               Yoshua Bengio},
  title     = {Show, Attend and Tell: Neural Image Caption Generation with Visual
               Attention},
-  volume    = {37},
  pages     = {2048--2057},
  publisher = {International Conference on Machine Learning},
  year      = {2015}
@@ -12751,7 +12734,7 @@ author    = {Zhuang Liu and
  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
  year      = {2017}
 }
-@article{DBLP:journals/pami/FuJCSZ17,
+@inproceedings{DBLP:journals/pami/FuJCSZ17,
  author    = {Kun Fu and
               Junqi Jin and
               Runpeng Cui and
@@ -12759,9 +12742,7 @@ author    = {Zhuang Liu and
               Changshui Zhang},
  title     = {Aligning Where to See and What to Tell: Image Captioning with Region-Based
               Attention and Scene-Specific Contexts},
-  journal   = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
+  publisher = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
-  volume    = {39},
-  number    = {12},
  pages     = {2321--2334},
  year      = {2017}
 }
@@ -12772,8 +12753,6 @@ author    = {Zhuang Liu and
               Tao Mei},
  title     = {Exploring Visual Relationship for Image Captioning},
  series    = {Lecture Notes in Computer Science},
-  volume    = {11218},
-  pages     = {711--727},
  publisher = {European Conference on Computer Vision},
  year      = {2018}
 }
@@ -12788,21 +12767,19 @@ author    = {Zhuang Liu and
  publisher = {International Joint Conference on Artificial Intelligence},
  year      = {2017}
 }
-@article{DBLP:journals/corr/abs-1804-02767,
+@inproceedings{DBLP:journals/corr/abs-1804-02767,
  author    = {Joseph Redmon and
               Ali Farhadi},
  title     = {YOLOv3: An Incremental Improvement},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/1804.02767},
  year      = {2018}
 }
-@article{DBLP:journals/corr/abs-2004-10934,
+@inproceedings{DBLP:journals/corr/abs-2004-10934,
  author    = {Alexey Bochkovskiy and
               Chien-Yao Wang and
               Hong-Yuan Mark Liao},
  title     = {YOLOv4: Optimal Speed and Accuracy of Object Detection},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/2004.10934},
  year      = {2020}
 }
 @inproceedings{DBLP:conf/cvpr/LuXPS17,
@@ -12840,15 +12817,13 @@ author    = {Zhuang Liu and
  publisher = {ACM Multimedia},
  year      = {2017}
 }
-@article{DBLP:journals/mta/FangWCT18,
+@inproceedings{DBLP:journals/mta/FangWCT18,
  author    = {Fang Fang and
               Hanli Wang and
               Yihao Chen and
               Pengjie Tang},
  title     = {Looking deeper and transferring attention for image captioning},
-  journal   = {Multimedia Tools Applications},
+  publisher = {Multimedia Tools Applications},
-  volume    = {77},
-  number    = {23},
  pages     = {31159--31175},
  year      = {2018}
 }
@@ -12861,12 +12836,11 @@ author    = {Zhuang Liu and
  publisher = {IEEE Conference on Computer Vision and Pattern Recognition},
  year      = {2018}
 }
-@article{DBLP:journals/corr/abs-1805-09019,
+@inproceedings{DBLP:journals/corr/abs-1805-09019,
  author    = {Qingzhong Wang and
               Antoni B. Chan},
  title     = {{CNN+CNN:} Convolutional Decoders for Image Captioning},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/1805.09019},
  year      = {2018}
 }
 @inproceedings{DBLP:conf/eccv/DaiYL18,
@@ -12874,8 +12848,6 @@ author    = {Zhuang Liu and
               Deming Ye and
               Dahua Lin},
  title     = {Rethinking the Form of Latent States in Image Captioning},
-  volume    = {11209},
-  pages     = {294--310},
  publisher = {European Conference on Computer Vision},
  year      = {2018}
 }
@@ -12900,28 +12872,24 @@ author    = {Zhuang Liu and
               Alexander Kirillov and
               Sergey Zagoruyko},
  title     = {End-to-End Object Detection with Transformers},
-  volume    = {12346},
-  pages     = {213--229},
  publisher = {European Conference on Computer Vision},
  year      = {2020}
 }
-@article{DBLP:journals/tcsv/YuLYH20,
+@inproceedings{DBLP:journals/tcsv/YuLYH20,
  author    = {Jun Yu and
               Jing Li and
               Zhou Yu and
               Qingming Huang},
  title     = {Multimodal Transformer With Multi-View Visual Representation for Image
               Captioning},
-  journal   = {IEEE Transactions on Circuits and Systems for Video Technology},
+  publisher = {IEEE Transactions on Circuits and Systems for Video Technology},
-  volume    = {30},
-  number    = {12},
  pages     = {4467--4480},
  year      = {2020}
 }
-@article{Huasong2020SelfAdaptiveNM,
+@inproceedings{Huasong2020SelfAdaptiveNM,
  title={Self-Adaptive Neural Module Transformer for Visual Question Answering},
  author={Zhong Huasong and Jingyuan Chen and Chen Shen and Hanwang Zhang and Jianqiang Huang and Xian-Sheng Hua},
-  journal={IEEE Transactions on Multimedia},
+  publisher ={IEEE Transactions on Multimedia},
  year={2020},
  pages={1-1}
 }
@@ -12944,7 +12912,6 @@ author    = {Zhuang Liu and
               Xiaokang Yang},
  title     = {Semantic Equivalent Adversarial Data Augmentation for Visual Question
               Answering},
-  volume    = {12364},
  pages     = {437--453},
  publisher = {	European Conference on Computer Vision},
  year      = {2020}
@@ -12963,7 +12930,6 @@ author    = {Zhuang Liu and
               Yejin Choi and
               Jianfeng Gao},
  title     = {Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks},
-  volume    = {12375},
  pages     = {121--137},
  publisher = {	European Conference on Computer Vision},
  year      = {2020}
@@ -13005,23 +12971,21 @@ author    = {Zhuang Liu and
  pages     = {465--476},
  year      = {2017}
 }
-@article{DBLP:journals/corr/abs-1908-06616,
+@inproceedings{DBLP:journals/corr/abs-1908-06616,
  author    = {Hajar Emami and
               Majid Moradi Aliabadi and
               Ming Dong and
               Ratna Babu Chinnam},
  title     = {{SPA-GAN:} Spatial Attention {GAN} for Image-to-Image Translation},
-  journal   = {CoRR},
+  publisher = {IEEE Transactions on Multimedia},
-  volume    = {abs/1908.06616},
  year      = {2019}
 }
-@article{DBLP:journals/access/XiongWG19,
+@inproceedings{DBLP:journals/access/XiongWG19,
  author    = {Feng Xiong and
               Qianqian Wang and
               Quanxue Gao},
  title     = {Consistent Embedded {GAN} for Image-to-Image Translation},
-  journal   = {International Conference on Access Networks},
+  publisher = {International Conference on Access Networks},
-  volume    = {7},
  pages     = {126651--126661},
  year      = {2019}
 }
@@ -13063,12 +13027,11 @@ author    = {Zhuang Liu and
               Bernt Schiele and
               Honglak Lee},
  title     = {Generative Adversarial Text to Image Synthesis},
-  volume    = {48},
  pages     = {1060--1069},
  publisher = {International Conference on Machine Learning},
  year      = {2016}
 }
-@article{DBLP:journals/corr/DashGALA17,
+@inproceedings{DBLP:journals/corr/DashGALA17,
  author    = {Ayushman Dash and
               John Cristian Borges Gamboa and
               Sheraz Ahmed and
@@ -13076,8 +13039,7 @@ author    = {Zhuang Liu and
               Muhammad Zeshan Afzal},
  title     = {{TAC-GAN} - Text Conditioned Auxiliary Classifier Generative Adversarial
               Network},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/1703.06412},
  year      = {2017}
 }
 @inproceedings{DBLP:conf/nips/ReedAMTSL16,
@@ -13142,12 +13104,11 @@ author    = {Zhuang Liu and
  publisher = {Annual Conference of the North American Chapter of the Association for Computational Linguistics},
  year      = {2018}
 }
-@article{DBLP:journals/corr/ChoE16,
+@inproceedings{DBLP:journals/corr/ChoE16,
  author    = {Kyunghyun Cho and
               Masha Esipova},
  title     = {Can neural machine translation do simultaneous translation?},
-  journal   = {CoRR},
+  publisher = {CoRR},
-  volume    = {abs/1606.02012},
  year      = {2016}
 }
 @inproceedings{DBLP:conf/eacl/NeubigCGL17,
@@ -13293,7 +13254,6 @@ author    = {Zhuang Liu and
  title     = {A Reinforcement Learning Approach to Interactive-Predictive Neural
               Machine Translation},
  publisher   = {CoRR},
-  volume    = {abs/1805.01553},
  year      = {2018}
 }
 @inproceedings{DBLP:journals/mt/DomingoPC17,
@@ -13302,8 +13262,6 @@ author    = {Zhuang Liu and
               Francisco Casacuberta},
  title     = {Segment-based interactive-predictive machine translation},
  publisher   = {Machine Translation},
-  volume    = {31},
-  number    = {4},
  pages     = {163--185},
  year      = {2017}
 }
@@ -13321,8 +13279,6 @@ author    = {Zhuang Liu and
               Juan Miguel Vilar},
  title     = {Statistical Approaches to Computer-Assisted Translation},
  publisher   = {Computer Linguistics},
-  volume    = {35},
-  number    = {1},
  pages     = {3--28},
  year      = {2009}
 }
@@ -13351,7 +13307,6 @@ author    = {Zhuang Liu and
  title     = {TurboTransformers: An Efficient {GPU} Serving System For Transformer
               Models},
  publisher   = {CoRR},
-  volume    = {abs/2010.05680},
  year      = {2020}
 }
 @inproceedings{DBLP:conf/iclr/HuangCLWMW18,