Initial commit for Fairseq-S2T

dc35a9b5 · xuchen · dc35a9b5 · dc35a9b5 · dc35a9b5 · dc35a9b5
Commit dc35a9b5 authored Mar 15, 2021 by xuchen
--- a/.gitignore
+++ b/.gitignore
+# JetBrains PyCharm IDE
+.idea/
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# macOS dir files
+.DS_Store
+
+# Distribution / packaging
+.Python
+env/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+
+# Checkpoints
+checkpoints
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# SageMath parsed files
+*.sage.py
+
+# dotenv
+.env
+
+# virtualenv
+.venv
+venv/
+ENV/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+
+# Generated files
+/fairseq/temporal_convolution_tbc
+/fairseq/modules/*_layer/*_forward.cu
+/fairseq/modules/*_layer/*_backward.cu
+/fairseq/version.py
+
+# data
+data-bin/
+
+# reranking
+/examples/reranking/rerank_data
+
+# Cython-generated C++ source files
+/fairseq/data/data_utils_fast.cpp
+/fairseq/data/token_block_utils_fast.cpp
+
+# VSCODE
+.vscode/ftp-sync.json
+.vscode/settings.json
+
+# Experimental Folder
+experimental/*
+
+# Weights and Biases logs
+wandb/
--- a/.gitmodules
+++ b/.gitmodules
+[submodule "fairseq/model_parallel/megatron"]
+    path = fairseq/model_parallel/megatron
+    url = https://github.com/ngoyal2707/Megatron-LM
+    branch = fairseq
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
+# Code of Conduct
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to make participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment
+include:
+
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery and unwelcome sexual attention or
+  advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+  address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+  professional setting
+
+## Our Responsibilities
+
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+
+## Scope
+
+This Code of Conduct applies within all project spaces, and it also applies when
+an individual is representing the project or its community in public spaces.
+Examples of representing a project or community include using an official
+project e-mail address, posting via an official social media account, or acting
+as an appointed representative at an online or offline event. Representation of
+a project may be further defined and clarified by project maintainers.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at <conduct@pytorch.org>. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq
+
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
+# Contributing to Facebook AI Research Sequence-to-Sequence Toolkit (fairseq)
+We want to make contributing to this project as easy and transparent as
+possible.
+
+## Pull Requests
+We actively welcome your pull requests.
+
+1. Fork the repo and create your branch from `master`.
+2. If you've added code that should be tested, add tests.
+3. If you've changed APIs, update the documentation.
+4. Ensure the test suite passes.
+5. Make sure your code lints.
+6. If you haven't already, complete the Contributor License Agreement ("CLA").
+
+## Contributor License Agreement ("CLA")
+In order to accept your pull request, we need you to submit a CLA. You only need
+to do this once to work on any of Facebook's open source projects.
+
+Complete your CLA here: <https://code.facebook.com/cla>
+
+## Issues
+We use GitHub issues to track public bugs. Please ensure your description is
+clear and has sufficient instructions to be able to reproduce the issue.
+
+## License
+By contributing to Facebook AI Research Sequence-to-Sequence Toolkit (fairseq),
+you agree that your contributions will be licensed under the LICENSE file in
+the root directory of this source tree.
--- a/LICENSE
+++ b/LICENSE
+MIT License
+
+Copyright (c) Facebook, Inc. and its affiliates.
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README.md
+++ b/README.md
--- a/docs/Makefile
+++ b/docs/Makefile
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line.
+SPHINXOPTS    =
+SPHINXBUILD   = python -msphinx
+SPHINXPROJ    = fairseq
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
\ No newline at end of file
--- a/docs/_static/theme_overrides.css
+++ b/docs/_static/theme_overrides.css
+.wy-table-responsive table td kbd {
+    white-space: nowrap;
+}
+.wy-table-responsive table td {
+    white-space: normal !important;
+}
+.wy-table-responsive {
+    overflow: visible !important;
+}
--- a/docs/command_line_tools.rst
+++ b/docs/command_line_tools.rst
+.. _Command-line Tools:
+
+Command-line Tools
+==================
+
+Fairseq provides several command-line tools for training and evaluating models:
+
+- :ref:`fairseq-preprocess`: Data pre-processing: build vocabularies and binarize training data
+- :ref:`fairseq-train`: Train a new model on one or multiple GPUs
+- :ref:`fairseq-generate`: Translate pre-processed data with a trained model
+- :ref:`fairseq-interactive`: Translate raw text with a trained model
+- :ref:`fairseq-score`: BLEU scoring of generated translations against reference translations
+- :ref:`fairseq-eval-lm`: Language model evaluation
+
+
+.. _fairseq-preprocess:
+
+fairseq-preprocess
+~~~~~~~~~~~~~~~~~~
+.. automodule:: fairseq_cli.preprocess
+
+    .. argparse::
+        :module: fairseq.options
+        :func: get_preprocessing_parser
+        :prog: fairseq-preprocess
+
+
+.. _fairseq-train:
+
+fairseq-train
+~~~~~~~~~~~~~
+.. automodule:: fairseq_cli.train
+
+    .. argparse::
+        :module: fairseq.options
+        :func: get_training_parser
+        :prog: fairseq-train
+
+
+.. _fairseq-generate:
+
+fairseq-generate
+~~~~~~~~~~~~~~~~
+.. automodule:: fairseq_cli.generate
+
+    .. argparse::
+        :module: fairseq.options
+        :func: get_generation_parser
+        :prog: fairseq-generate
+
+
+.. _fairseq-interactive:
+
+fairseq-interactive
+~~~~~~~~~~~~~~~~~~~
+.. automodule:: fairseq_cli.interactive
+
+    .. argparse::
+        :module: fairseq.options
+        :func: get_interactive_generation_parser
+        :prog: fairseq-interactive
+
+
+.. _fairseq-score:
+
+fairseq-score
+~~~~~~~~~~~~~
+.. automodule:: fairseq_cli.score
+
+    .. argparse::
+        :module: fairseq_cli.score
+        :func: get_parser
+        :prog: fairseq-score
+
+
+.. _fairseq-eval-lm:
+
+fairseq-eval-lm
+~~~~~~~~~~~~~~~
+.. automodule:: fairseq_cli.eval_lm
+
+    .. argparse::
+        :module: fairseq.options
+        :func: get_eval_lm_parser
+        :prog: fairseq-eval-lm
--- a/docs/conf.py
+++ b/docs/conf.py
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+#
+# fairseq documentation build configuration file, created by
+# sphinx-quickstart on Fri Aug 17 21:45:30 2018.
+#
+# This file is execfile()d with the current directory set to its
+# containing dir.
+#
+# Note that not all possible configuration values are present in this
+# autogenerated file.
+#
+# All configuration values have a default; values that are commented out
+# serve to show the default.
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+
+import os
+import sys
+from fairseq import __version__
+
+
+# source code directory, relative to this file, for sphinx-autobuild
+sys.path.insert(0, os.path.abspath(".."))
+
+source_suffix = [".rst"]
+
+# -- General configuration ------------------------------------------------
+
+# If your documentation needs a minimal Sphinx version, state it here.
+#
+# needs_sphinx = '1.0'
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    "sphinx.ext.autodoc",
+    "sphinx.ext.intersphinx",
+    "sphinx.ext.viewcode",
+    "sphinx.ext.napoleon",
+    "sphinxarg.ext",
+]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ["_templates"]
+
+# The master toctree document.
+master_doc = "index"
+
+# General information about the project.
+project = "fairseq"
+copyright = "Facebook AI Research (FAIR)"
+author = "Facebook AI Research (FAIR)"
+
+github_doc_root = "https://github.com/pytorch/fairseq/tree/master/docs/"
+
+# The version info for the project you're documenting, acts as replacement for
+# |version| and |release|, also used in various other places throughout the
+# built documents.
+#
+# The short X.Y version.
+version = __version__
+# The full version, including alpha/beta/rc tags.
+release = __version__
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = None
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This patterns also effect to html_static_path and html_extra_path
+exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = "sphinx"
+highlight_language = "python"
+
+# If true, `todo` and `todoList` produce output, else they produce nothing.
+todo_include_todos = False
+
+
+# -- Options for HTML output ----------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = "sphinx_rtd_theme"
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+#
+# html_theme_options = {}
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ["_static"]
+
+html_context = {
+    "css_files": [
+        "_static/theme_overrides.css",  # override wide tables in RTD theme
+    ],
+}
+
+# Custom sidebar templates, must be a dictionary that maps document names
+# to template names.
+#
+# This is required for the alabaster theme
+# refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars
+# html_sidebars = {
+#    '**': [
+#        'about.html',
+#        'navigation.html',
+#        'relations.html',  # needs 'show_related': True theme option to display
+#        'searchbox.html',
+#        'donate.html',
+#    ]
+# }
+
+
+# Example configuration for intersphinx: refer to the Python standard library.
+intersphinx_mapping = {
+    "numpy": ("http://docs.scipy.org/doc/numpy/", None),
+    "python": ("https://docs.python.org/", None),
+    "torch": ("https://pytorch.org/docs/master/", None),
+}
--- a/docs/criterions.rst
+++ b/docs/criterions.rst
+.. role:: hidden
+    :class: hidden-section
+
+.. _Criterions:
+
+Criterions
+==========
+
+Criterions compute the loss function given the model and batch, roughly::
+
+  loss = criterion(model, batch)
+
+.. automodule:: fairseq.criterions
+    :members:
+
+.. autoclass:: fairseq.criterions.FairseqCriterion
+    :members:
+    :undoc-members:
+
+.. autoclass:: fairseq.criterions.adaptive_loss.AdaptiveLoss
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.criterions.composite_loss.CompositeLoss
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.criterions.cross_entropy.CrossEntropyCriterion
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.criterions.label_smoothed_cross_entropy.LabelSmoothedCrossEntropyCriterion
+    :members:
+    :undoc-members:
--- a/docs/data.rst
+++ b/docs/data.rst
+.. role:: hidden
+    :class: hidden-section
+
+.. module:: fairseq.data
+
+Data Loading and Utilities
+==========================
+
+.. _datasets:
+
+Datasets
+--------
+
+**Datasets** define the data format and provide helpers for creating
+mini-batches.
+
+.. autoclass:: fairseq.data.FairseqDataset
+    :members:
+.. autoclass:: fairseq.data.LanguagePairDataset
+    :members:
+.. autoclass:: fairseq.data.MonolingualDataset
+    :members:
+
+**Helper Datasets**
+
+These datasets wrap other :class:`fairseq.data.FairseqDataset` instances and
+provide additional functionality:
+
+.. autoclass:: fairseq.data.BacktranslationDataset
+    :members:
+.. autoclass:: fairseq.data.ConcatDataset
+    :members:
+.. autoclass:: fairseq.data.ResamplingDataset
+    :members:
+.. autoclass:: fairseq.data.RoundRobinZipDatasets
+    :members:
+.. autoclass:: fairseq.data.TransformEosDataset
+    :members:
+
+
+Dictionary
+----------
+
+.. autoclass:: fairseq.data.Dictionary
+    :members:
+
+
+Iterators
+---------
+
+.. autoclass:: fairseq.data.CountingIterator
+    :members:
+.. autoclass:: fairseq.data.EpochBatchIterator
+    :members:
+.. autoclass:: fairseq.data.GroupedIterator
+    :members:
+.. autoclass:: fairseq.data.ShardedIterator
+    :members:
--- a/docs/docutils.conf
+++ b/docs/docutils.conf
+[writers]
+option-limit=0
--- a/docs/fairseq.gif
+++ b/docs/fairseq.gif
--- a/docs/fairseq_logo.png
+++ b/docs/fairseq_logo.png
--- a/docs/getting_started.rst
+++ b/docs/getting_started.rst
+Evaluating Pre-trained Models
+=============================
+
+First, download a pre-trained model along with its vocabularies:
+
+.. code-block:: console
+
+    > curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf -
+
+This model uses a `Byte Pair Encoding (BPE)
+vocabulary <https://arxiv.org/abs/1508.07909>`__, so we'll have to apply
+the encoding to the source text before it can be translated. This can be
+done with the
+`apply\_bpe.py <https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/apply_bpe.py>`__
+script using the ``wmt14.en-fr.fconv-cuda/bpecodes`` file. ``@@`` is
+used as a continuation marker and the original text can be easily
+recovered with e.g. ``sed s/@@ //g`` or by passing the ``--remove-bpe``
+flag to :ref:`fairseq-generate`. Prior to BPE, input text needs to be tokenized
+using ``tokenizer.perl`` from
+`mosesdecoder <https://github.com/moses-smt/mosesdecoder>`__.
+
+Let's use :ref:`fairseq-interactive` to generate translations interactively.
+Here, we use a beam size of 5 and preprocess the input with the Moses
+tokenizer and the given Byte-Pair Encoding vocabulary. It will automatically
+remove the BPE continuation markers and detokenize the output.
+
+.. code-block:: console
+
+    > MODEL_DIR=wmt14.en-fr.fconv-py
+    > fairseq-interactive \
+        --path $MODEL_DIR/model.pt $MODEL_DIR \
+        --beam 5 --source-lang en --target-lang fr \
+        --tokenizer moses \
+        --bpe subword_nmt --bpe-codes $MODEL_DIR/bpecodes
+    | loading model(s) from wmt14.en-fr.fconv-py/model.pt
+    | [en] dictionary: 44206 types
+    | [fr] dictionary: 44463 types
+    | Type the input sentence and press return:
+    Why is it rare to discover new marine mammal species?
+    S-0     Why is it rare to discover new marine mam@@ mal species ?
+    H-0     -0.0643349438905716     Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins?
+    P-0     -0.0763 -0.1849 -0.0956 -0.0946 -0.0735 -0.1150 -0.1301 -0.0042 -0.0321 -0.0171 -0.0052 -0.0062 -0.0015
+
+This generation script produces three types of outputs: a line prefixed
+with *O* is a copy of the original source sentence; *H* is the
+hypothesis along with an average log-likelihood; and *P* is the
+positional score per token position, including the
+end-of-sentence marker which is omitted from the text.
+
+Other types of output lines you might see are *D*, the detokenized hypothesis,
+*T*, the reference target, *A*, alignment info, *E* the history of generation steps.
+
+See the `README <https://github.com/pytorch/fairseq#pre-trained-models>`__ for a
+full list of pre-trained models available.
+
+Training a New Model
+====================
+
+The following tutorial is for machine translation. For an example of how
+to use Fairseq for other tasks, such as :ref:`language modeling`, please see the
+``examples/`` directory.
+
+Data Pre-processing
+-------------------
+
+Fairseq contains example pre-processing scripts for several translation
+datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT
+2014 (English-German). To pre-process and binarize the IWSLT dataset:
+
+.. code-block:: console
+
+    > cd examples/translation/
+    > bash prepare-iwslt14.sh
+    > cd ../..
+    > TEXT=examples/translation/iwslt14.tokenized.de-en
+    > fairseq-preprocess --source-lang de --target-lang en \
+        --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
+        --destdir data-bin/iwslt14.tokenized.de-en
+
+This will write binarized data that can be used for model training to
+``data-bin/iwslt14.tokenized.de-en``.
+
+Training
+--------
+
+Use :ref:`fairseq-train` to train a new model. Here a few example settings that work
+well for the IWSLT 2014 dataset:
+
+.. code-block:: console
+
+    > mkdir -p checkpoints/fconv
+    > CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/iwslt14.tokenized.de-en \
+        --optimizer nag --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
+        --arch fconv_iwslt_de_en --save-dir checkpoints/fconv
+
+By default, :ref:`fairseq-train` will use all available GPUs on your machine. Use the
+``CUDA_VISIBLE_DEVICES`` environment variable to select specific GPUs and/or to
+change the number of GPU devices that will be used.
+
+Also note that the batch size is specified in terms of the maximum
+number of tokens per batch (``--max-tokens``). You may need to use a
+smaller value depending on the available GPU memory on your system.
+
+Generation
+----------
+
+Once your model is trained, you can generate translations using
+:ref:`fairseq-generate` **(for binarized data)** or
+:ref:`fairseq-interactive` **(for raw text)**:
+
+.. code-block:: console
+
+    > fairseq-generate data-bin/iwslt14.tokenized.de-en \
+        --path checkpoints/fconv/checkpoint_best.pt \
+        --batch-size 128 --beam 5
+    | [de] dictionary: 35475 types
+    | [en] dictionary: 24739 types
+    | data-bin/iwslt14.tokenized.de-en test 6750 examples
+    | model fconv
+    | loaded checkpoint trainings/fconv/checkpoint_best.pt
+    S-721   danke .
+    T-721   thank you .
+    ...
+
+To generate translations with only a CPU, use the ``--cpu`` flag. BPE
+continuation markers can be removed with the ``--remove-bpe`` flag.
+
+Advanced Training Options
+=========================
+
+Large mini-batch training with delayed updates
+----------------------------------------------
+
+The ``--update-freq`` option can be used to accumulate gradients from
+multiple mini-batches and delay updating, creating a larger effective
+batch size. Delayed updates can also improve training speed by reducing
+inter-GPU communication costs and by saving idle time caused by variance
+in workload across GPUs. See `Ott et al.
+(2018) <https://arxiv.org/abs/1806.00187>`__ for more details.
+
+To train on a single GPU with an effective batch size that is equivalent
+to training on 8 GPUs:
+
+.. code-block:: console
+
+    > CUDA_VISIBLE_DEVICES=0 fairseq-train --update-freq 8 (...)
+
+Training with half precision floating point (FP16)
+--------------------------------------------------
+
+.. note::
+
+    FP16 training requires a Volta GPU and CUDA 9.1 or greater
+
+Recent GPUs enable efficient half precision floating point computation,
+e.g., using `Nvidia Tensor Cores
+<https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html>`__.
+Fairseq supports FP16 training with the ``--fp16`` flag:
+
+.. code-block:: console
+
+    > fairseq-train --fp16 (...)
+
+Distributed training
+--------------------
+
+Distributed training in fairseq is implemented on top of ``torch.distributed``.
+The easiest way to launch jobs is with the `torch.distributed.launch
+<https://pytorch.org/docs/stable/distributed.html#launch-utility>`__ tool.
+
+For example, to train a large English-German Transformer model on 2 nodes each
+with 8 GPUs (in total 16 GPUs), run the following command on each node,
+replacing ``node_rank=0`` with ``node_rank=1`` on the second node and making
+sure to update ``--master_addr`` to the IP address of the first node:
+
+.. code-block:: console
+
+    > python -m torch.distributed.launch --nproc_per_node=8 \
+        --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" \
+        --master_port=12345 \
+        $(which fairseq-train) data-bin/wmt16_en_de_bpe32k \
+        --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
+        --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
+        --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
+        --lr 0.0005 \
+        --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+        --max-tokens 3584 \
+        --max-epoch 70 \
+        --fp16
+
+On SLURM clusters, fairseq will automatically detect the number of nodes and
+GPUs, but a port number must be provided:
+
+.. code-block:: console
+
+    > salloc --gpus=16 --nodes 2 (...)
+    > srun fairseq-train --distributed-port 12345 (...).
+
+Sharding very large datasets
+----------------------------
+
+It can be challenging to train over very large datasets, particularly if your
+machine does not have much system RAM. Most tasks in fairseq support training
+over "sharded" datasets, in which the original dataset has been preprocessed
+into non-overlapping chunks (or "shards").
+
+For example, instead of preprocessing all your data into a single "data-bin"
+directory, you can split the data and create "data-bin1", "data-bin2", etc.
+Then you can adapt your training command like so:
+
+.. code-block:: console
+
+    > fairseq-train data-bin1:data-bin2:data-bin3 (...)
+
+Training will now iterate over each shard, one by one, with each shard
+corresponding to an "epoch", thus reducing system memory usage.
--- a/docs/hydra_integration.md
+++ b/docs/hydra_integration.md
--- a/docs/index.rst
+++ b/docs/index.rst
+.. fairseq documentation master file, created by
+   sphinx-quickstart on Fri Aug 17 21:45:30 2018.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+:github_url: https://github.com/pytorch/fairseq
+
+
+fairseq documentation
+=====================
+
+Fairseq is a sequence modeling toolkit written in `PyTorch
+<http://pytorch.org/>`_ that allows researchers and developers to
+train custom models for translation, summarization, language modeling and other
+text generation tasks.
+
+.. toctree::
+    :maxdepth: 1
+    :caption: Getting Started
+
+    getting_started
+    command_line_tools
+
+.. toctree::
+    :maxdepth: 1
+    :caption: Extending Fairseq
+
+    overview
+    tutorial_simple_lstm
+    tutorial_classifying_names
+
+.. toctree::
+    :maxdepth: 2
+    :caption: Library Reference
+
+    tasks
+    models
+    criterions
+    optim
+    lr_scheduler
+    data
+    modules
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`search`
--- a/docs/lr_scheduler.rst
+++ b/docs/lr_scheduler.rst
+.. role:: hidden
+    :class: hidden-section
+
+.. _Learning Rate Schedulers:
+
+Learning Rate Schedulers
+========================
+
+Learning Rate Schedulers update the learning rate over the course of training.
+Learning rates can be updated after each update via :func:`step_update` or at
+epoch boundaries via :func:`step`.
+
+.. automodule:: fairseq.optim.lr_scheduler
+    :members:
+
+.. autoclass:: fairseq.optim.lr_scheduler.FairseqLRScheduler
+    :members:
+    :undoc-members:
+
+.. autoclass:: fairseq.optim.lr_scheduler.cosine_lr_scheduler.CosineSchedule
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.lr_scheduler.fixed_schedule.FixedSchedule
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.lr_scheduler.inverse_square_root_schedule.InverseSquareRootSchedule
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.lr_scheduler.reduce_lr_on_plateau.ReduceLROnPlateau
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.lr_scheduler.triangular_lr_scheduler.TriangularSchedule
+    :members:
+    :undoc-members:
--- a/docs/make.bat
+++ b/docs/make.bat
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=python -msphinx
+)
+set SOURCEDIR=.
+set BUILDDIR=_build
+set SPHINXPROJ=fairseq
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The Sphinx module was not found. Make sure you have Sphinx installed,
+	echo.then set the SPHINXBUILD environment variable to point to the full
+	echo.path of the 'sphinx-build' executable. Alternatively you may add the
+	echo.Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.http://sphinx-doc.org/
+	exit /b 1
+)
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
+
+:end
+popd
--- a/docs/models.rst
+++ b/docs/models.rst
+.. role:: hidden
+    :class: hidden-section
+
+.. module:: fairseq.models
+
+.. _Models:
+
+Models
+======
+
+A Model defines the neural network's ``forward()`` method and encapsulates all
+of the learnable parameters in the network. Each model also provides a set of
+named *architectures* that define the precise network configuration (e.g.,
+embedding dimension, number of layers, etc.).
+
+Both the model type and architecture are selected via the ``--arch``
+command-line argument. Once selected, a model may expose additional command-line
+arguments for further configuration.
+
+.. note::
+
+    All fairseq Models extend :class:`BaseFairseqModel`, which in turn extends
+    :class:`torch.nn.Module`. Thus any fairseq Model can be used as a
+    stand-alone Module in other PyTorch code.
+
+
+Convolutional Neural Networks (CNN)
+-----------------------------------
+
+.. module:: fairseq.models.fconv
+.. autoclass:: fairseq.models.fconv.FConvModel
+    :members:
+.. autoclass:: fairseq.models.fconv.FConvEncoder
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.models.fconv.FConvDecoder
+    :members:
+
+
+Long Short-Term Memory (LSTM) networks
+--------------------------------------
+
+.. module:: fairseq.models.lstm
+.. autoclass:: fairseq.models.lstm.LSTMModel
+    :members:
+.. autoclass:: fairseq.models.lstm.LSTMEncoder
+    :members:
+.. autoclass:: fairseq.models.lstm.LSTMDecoder
+    :members:
+
+
+Transformer (self-attention) networks
+-------------------------------------
+
+.. module:: fairseq.models.transformer
+.. autoclass:: fairseq.models.transformer.TransformerModel
+    :members:
+.. autoclass:: fairseq.models.transformer.TransformerEncoder
+    :members:
+.. autoclass:: fairseq.models.transformer.TransformerEncoderLayer
+    :members:
+.. autoclass:: fairseq.models.transformer.TransformerDecoder
+    :members:
+.. autoclass:: fairseq.models.transformer.TransformerDecoderLayer
+    :members:
+
+
+Adding new models
+-----------------
+
+.. currentmodule:: fairseq.models
+.. autofunction:: fairseq.models.register_model
+.. autofunction:: fairseq.models.register_model_architecture
+.. autoclass:: fairseq.models.BaseFairseqModel
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.models.FairseqEncoderDecoderModel
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.models.FairseqEncoderModel
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.models.FairseqLanguageModel
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.models.FairseqMultiModel
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.models.FairseqEncoder
+    :members:
+.. autoclass:: fairseq.models.CompositeEncoder
+    :members:
+.. autoclass:: fairseq.models.FairseqDecoder
+    :members:
+
+
+.. _Incremental decoding:
+
+Incremental decoding
+--------------------
+
+.. autoclass:: fairseq.models.FairseqIncrementalDecoder
+    :members:
+    :undoc-members:
--- a/docs/modules.rst
+++ b/docs/modules.rst
+Modules
+=======
+
+Fairseq provides several stand-alone :class:`torch.nn.Module` classes that may
+be helpful when implementing a new :class:`~fairseq.models.BaseFairseqModel`.
+
+.. automodule:: fairseq.modules
+    :members:
+    :undoc-members:
--- a/docs/optim.rst
+++ b/docs/optim.rst
+.. role:: hidden
+    :class: hidden-section
+
+.. _optimizers:
+
+Optimizers
+==========
+
+Optimizers update the Model parameters based on the gradients.
+
+.. automodule:: fairseq.optim
+    :members:
+
+.. autoclass:: fairseq.optim.FairseqOptimizer
+    :members:
+    :undoc-members:
+
+.. autoclass:: fairseq.optim.adadelta.Adadelta
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.adagrad.Adagrad
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.adafactor.FairseqAdafactor
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.adam.FairseqAdam
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.fp16_optimizer.FP16Optimizer
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.nag.FairseqNAG
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.sgd.SGD
+    :members:
+    :undoc-members:
--- a/docs/overview.rst
+++ b/docs/overview.rst
+Overview
+========
+
+Fairseq can be extended through user-supplied `plug-ins
+<https://en.wikipedia.org/wiki/Plug-in_(computing)>`_. We support five kinds of
+plug-ins:
+
+- :ref:`Models` define the neural network architecture and encapsulate all of the
+  learnable parameters.
+- :ref:`Criterions` compute the loss function given the model outputs and targets.
+- :ref:`Tasks` store dictionaries and provide helpers for loading/iterating over
+  Datasets, initializing the Model/Criterion and calculating the loss.
+- :ref:`Optimizers` update the Model parameters based on the gradients.
+- :ref:`Learning Rate Schedulers` update the learning rate over the course of
+  training.
+
+**Training Flow**
+
+Given a ``model``, ``criterion``, ``task``, ``optimizer`` and ``lr_scheduler``,
+fairseq implements the following high-level training flow::
+
+  for epoch in range(num_epochs):
+      itr = task.get_batch_iterator(task.dataset('train'))
+      for num_updates, batch in enumerate(itr):
+          task.train_step(batch, model, criterion, optimizer)
+          average_and_clip_gradients()
+          optimizer.step()
+          lr_scheduler.step_update(num_updates)
+      lr_scheduler.step(epoch)
+
+where the default implementation for ``task.train_step`` is roughly::
+
+  def train_step(self, batch, model, criterion, optimizer, **unused):
+      loss = criterion(model, batch)
+      optimizer.backward(loss)
+      return loss
+
+**Registering new plug-ins**
+
+New plug-ins are *registered* through a set of ``@register`` function
+decorators, for example::
+
+  @register_model('my_lstm')
+  class MyLSTM(FairseqEncoderDecoderModel):
+      (...)
+
+Once registered, new plug-ins can be used with the existing :ref:`Command-line
+Tools`. See the Tutorial sections for more detailed walkthroughs of how to add
+new plug-ins.
+
+**Loading plug-ins from another directory**
+
+New plug-ins can be defined in a custom module stored in the user system. In
+order to import the module, and make the plugin available to *fairseq*, the
+command line supports the ``--user-dir`` flag that can be used to specify a
+custom location for additional modules to load into *fairseq*.
+
+For example, assuming this directory tree::
+
+  /home/user/my-module/
+  └── __init__.py
+  
+with ``__init__.py``::
+
+  from fairseq.models import register_model_architecture
+  from fairseq.models.transformer import transformer_vaswani_wmt_en_de_big
+
+  @register_model_architecture('transformer', 'my_transformer')
+  def transformer_mmt_big(args):
+      transformer_vaswani_wmt_en_de_big(args)
+
+it is possible to invoke the :ref:`fairseq-train` script with the new architecture with::
+
+  fairseq-train ... --user-dir /home/user/my-module -a my_transformer --task translation
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
+sphinx<2.0
+sphinx-argparse
--- a/docs/tasks.rst
+++ b/docs/tasks.rst
+.. role:: hidden
+    :class: hidden-section
+
+.. module:: fairseq.tasks
+
+.. _Tasks:
+
+Tasks
+=====
+
+Tasks store dictionaries and provide helpers for loading/iterating over
+Datasets, initializing the Model/Criterion and calculating the loss.
+
+Tasks can be selected via the ``--task`` command-line argument. Once selected, a
+task may expose additional command-line arguments for further configuration.
+
+Example usage::
+
+    # setup the task (e.g., load dictionaries)
+    task = fairseq.tasks.setup_task(args)
+
+    # build model and criterion
+    model = task.build_model(args)
+    criterion = task.build_criterion(args)
+
+    # load datasets
+    task.load_dataset('train')
+    task.load_dataset('valid')
+
+    # iterate over mini-batches of data
+    batch_itr = task.get_batch_iterator(
+        task.dataset('train'), max_tokens=4096,
+    )
+    for batch in batch_itr:
+        # compute the loss
+        loss, sample_size, logging_output = task.get_loss(
+            model, criterion, batch,
+        )
+        loss.backward()
+
+
+Translation
+-----------
+
+.. autoclass:: fairseq.tasks.translation.TranslationTask
+
+.. _language modeling:
+
+Language Modeling
+-----------------
+
+.. autoclass:: fairseq.tasks.language_modeling.LanguageModelingTask
+
+
+Adding new tasks
+----------------
+
+.. autofunction:: fairseq.tasks.register_task
+.. autoclass:: fairseq.tasks.FairseqTask
+    :members:
+    :undoc-members:
--- a/docs/tutorial_classifying_names.rst
+++ b/docs/tutorial_classifying_names.rst
--- a/docs/tutorial_simple_lstm.rst
+++ b/docs/tutorial_simple_lstm.rst
--- a/egs/librispeech/asr/conf/asr_train_ctc.yaml
+++ b/egs/librispeech/asr/conf/asr_train_ctc.yaml
+#train-subset: train-clean-100,train-clean-360,train-other-500
+train-subset: train-clean-100
+valid-subset: dev-clean
+
+max-epoch: 100
+max-update: 300000
+
+num-workers: 0
+patience: 10
+no-progress-bar: True
+log-interval: 100
+seed: 1
+report-accuracy: True
+
+arch: s2t_transformer_s
+share-decoder-input-output-embed: True
+optimizer: adam
+clip-norm: 10.0
+lr-scheduler: inverse_sqrt
+warmup-init-lr: 1e-7
+warmup-updates: 10000
+lr: 2e-3
+#adam_betas: (0.9,0.98)
+
+ctc-weight: 0.3
+criterion: label_smoothed_cross_entropy_with_ctc
+label_smoothing: 0.1
+
+conv-kernel-sizes: 5,5
+conv-channels: 1024
+dropout: 0.1
+activation-fn: relu
+encoder-embed-dim: 256
+encoder-ffn-embed-dim: 2048
+encoder-layers: 12
+decoder-layers: 6
+encoder-attention-heads: 4
+
+#decoder-embed-dim: 256
+#decoder-ffn-embed-dim: 2048
+#decoder-attention-heads: 4
+#attention-dropout: 0.1
+#activation-dropout: 0.1
--- a/egs/librispeech/asr/decode.sh
+++ b/egs/librispeech/asr/decode.sh
+#! /bin/bash
+
+# training the model
+
+gpu_num=1
+
+test_subset=(test-clean test-other)
+exp_name=test
+
+n_average=10
+beam_size=5
+max_tokens=40000
+
+cmd="./run.sh
+    --stage 2
+    --stop_stage 2
+    --gpu_num ${gpu_num}
+    --exp_name ${exp_name}
+    --test_subset ${test_subset}
+    --n_average ${n_average}
+    --beam_size ${beam_size}
+    --max_tokens ${max_tokens}
+    "
+
+echo $cmd
+eval $cmd
--- a/egs/librispeech/asr/local/monitor.sh
+++ b/egs/librispeech/asr/local/monitor.sh
+gpu_num=1
+
+while :
+do
+    all_devices=$(seq 0 `gpustat | sed '1,2d' | wc -l`);
+    count=0
+    for dev in ${all_devices[@]}
+    do
+        line=`expr $dev + 2`
+        use=`gpustat -p | head -n $line | tail -1 | cut -d '|' -f4 | wc -w`
+        if [[ $use -eq 0 ]]; then
+            device[$count]=$dev
+            count=`expr $count + 1`
+            if [[ $count -eq $gpu_num ]]; then
+                break
+            fi
+        fi
+    done
+    if [[ ${#device[@]} -lt $gpu_num ]]; then
+        sleep 60s
+    else
+        echo "Run $cmd"
+        eval $cmd
+        sleep 10s
+        exit
+    fi
+done
--- a/egs/librispeech/asr/local/parse_options.sh
+++ b/egs/librispeech/asr/local/parse_options.sh
+#!/usr/bin/env bash
+
+# Copyright 2012  Johns Hopkins University (Author: Daniel Povey);
+#                 Arnab Ghoshal, Karel Vesely
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
+# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
+# MERCHANTABLITY OR NON-INFRINGEMENT.
+# See the Apache 2 License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Parse command-line options.
+# To be sourced by another script (as in ". parse_options.sh").
+# Option format is: --option-name arg
+# and shell variable "option_name" gets set to value "arg."
+# The exception is --help, which takes no arguments, but prints the
+# $help_message variable (if defined).
+
+
+###
+### The --config file options have lower priority to command line
+### options, so we need to import them first...
+###
+
+# Now import all the configs specified by command-line, in left-to-right order
+for ((argpos=1; argpos<$#; argpos++)); do
+  if [ "${!argpos}" == "--config" ]; then
+    argpos_plus1=$((argpos+1))
+    config=${!argpos_plus1}
+    [ ! -r $config ] && echo "$0: missing config '$config'" && exit 1
+    . $config  # source the config file.
+  fi
+done
+
+
+###
+### Now we process the command line options
+###
+while true; do
+  [ -z "${1:-}" ] && break;  # break if there are no arguments
+  case "$1" in
+    # If the enclosing script is called with --help option, print the help
+    # message and exit.  Scripts should put help messages in $help_message
+    --help|-h) if [ -z "$help_message" ]; then echo "No help found." 1>&2;
+      else printf "$help_message\n" 1>&2 ; fi;
+      exit 0 ;;
+    --*=*) echo "$0: options to scripts must be of the form --name value, got '$1'"
+      exit 1 ;;
+    # If the first command-line argument begins with "--" (e.g. --foo-bar),
+    # then work out the variable name as $name, which will equal "foo_bar".
+    --*) name=`echo "$1" | sed s/^--// | sed s/-/_/g`;
+      # Next we test whether the variable in question is undefned-- if so it's
+      # an invalid option and we die.  Note: $0 evaluates to the name of the
+      # enclosing script.
+      # The test [ -z ${foo_bar+xxx} ] will return true if the variable foo_bar
+      # is undefined.  We then have to wrap this test inside "eval" because
+      # foo_bar is itself inside a variable ($name).
+      eval '[ -z "${'$name'+xxx}" ]' && echo "$0: invalid option $1" 1>&2 && exit 1;
+
+      oldval="`eval echo \\$$name`";
+      # Work out whether we seem to be expecting a Boolean argument.
+      if [ "$oldval" == "true" ] || [ "$oldval" == "false" ]; then
+        was_bool=true;
+      else
+        was_bool=false;
+      fi
+
+      # Set the variable to the right value-- the escaped quotes make it work if
+      # the option had spaces, like --cmd "queue.pl -sync y"
+      eval $name=\"$2\";
+
+      # Check that Boolean-valued arguments are really Boolean.
+      if $was_bool && [[ "$2" != "true" && "$2" != "false" ]]; then
+        echo "$0: expected \"true\" or \"false\": $1 $2" 1>&2
+        exit 1;
+      fi
+      shift 2;
+      ;;
+  *) break;
+  esac
+done
+
+
+# Check for an empty argument to the --cmd option, which can easily occur as a
+# result of scripting errors.
+[ ! -z "${cmd+xxx}" ] && [ -z "$cmd" ] && echo "$0: empty argument to --cmd option" 1>&2 && exit 1;
+
+
+true; # so this script returns exit code 0.
--- a/egs/librispeech/asr/local/path.sh
+++ b/egs/librispeech/asr/local/path.sh
+MAIN_ROOT=$PWD/../../..
+KALDI_ROOT=$MAIN_ROOT/tools/kaldi
+
+export PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PATH
+[ ! -f $KALDI_ROOT/tools/config/common_path.sh ] && echo >&2 "The standard file $KALDI_ROOT/tools/config/common_path.sh is not present -> Exit!" && exit 1
+. $KALDI_ROOT/tools/config/common_path.sh
+export LC_ALL=C
+
+export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:$MAIN_ROOT/src/lib
+export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:$MAIN_ROOT/tools/chainer_ctc/ext/warp-ctc/build
+. "${MAIN_ROOT}"/tools/activate_python.sh && . "${MAIN_ROOT}"/tools/extra_path.sh
+export PATH=$MAIN_ROOT/utils:$MAIN_ROOT/espnet/bin:$PATH
+
+export OMP_NUM_THREADS=1
+
+# check extra module installation
+if ! which tokenizer.perl > /dev/null; then
+    echo "Error: it seems that moses is not installed." >&2
+    echo "Error: please install moses as follows." >&2
+    echo "Error: cd ${MAIN_ROOT}/tools && make moses.done" >&2
+    return 1
+fi
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
--- a/egs/librispeech/asr/local/utils.sh
+++ b/egs/librispeech/asr/local/utils.sh
+
+get_devices(){
+    gpu_num=$1
+    use_cpu=$2
+    device=()
+    while :
+    do
+        record=`mktemp -t temp.record.XXXXXX`
+        gpustat > $record
+        all_devices=$(seq 0 `cat $record | sed '1,2d' | wc -l`);
+        count=0
+        for dev in ${all_devices[@]}
+        do
+            line=`expr $dev + 2`
+            use=`cat $record | head -n $line | tail -1 | cut -d '|' -f3 | cut -d '/' -f1`
+            if [[ $use -lt 10 ]]; then
+                device[$count]=$dev
+                count=`expr $count + 1`
+                if [[ $count -eq $gpu_num ]]; then
+                    break
+                fi
+            fi
+        done
+        if [[ ${#device[@]} -lt $gpu_num ]]; then
+            if [[ $use_cpu -eq 1 ]]; then
+                device=(-1)
+            else
+                sleep 60s
+            fi
+        else
+            break
+        fi
+    done
+
+    echo ${device[*]} | sed 's/ /,/g'
+    return $?
+}
+
+
--- a/egs/librispeech/asr/run.sh
+++ b/egs/librispeech/asr/run.sh
+#! /bin/bash
+
+# Processing LibriSpeech Datasets
+
+# Copyright 2021 Natural Language Processing Laboratory 
+# Xu Chen (xuchenneu@163.com)
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+#set -u
+set -o pipefail
+export PYTHONIOENCODING=UTF-8
+
+eval=1
+time=$(date "+%m%d_%H%M")
+
+stage=0
+stop_stage=0
+
+######## hardware ########
+# devices
+device=()
+gpu_num=8
+update_freq=1
+
+root_dir=~/Code/st/fairseq
+pwd_dir=$PWD
+
+# dataset
+src_lang=en
+lang=${src_lang}
+
+dataset=librispeech
+task=speech_to_text
+vocab_type=unigram
+vocab_size=10000
+
+data_dir=~/Code/st/data/${dataset}
+test_subset=(dev-clean dev-other test-clean test-other)
+
+# exp
+extra_tag=
+extra_parameter=
+exp_tag=baseline
+exp_name=
+
+# config
+train_config=asr_train_ctc.yaml
+data_config=config.yaml
+
+# training setting
+fp16=0
+max_tokens=40000
+step_valid=0
+
+# decoding setting
+n_average=10
+beam_size=5
+
+. ./local/parse_options.sh || exit 1;
+
+if [[ $step_valid -eq 1 ]]; then
+    validate_interval=10000
+    save_interval=10000
+    no_epoch_checkpoints=1
+    save_interval_updates=5000
+    keep_interval_updates=3
+else
+    validate_interval=1
+    keep_last_epochs=10
+fi
+
+# full path
+train_config=$pwd_dir/conf/${train_config}
+if [[ -z ${exp_name} ]]; then
+    exp_name=$(basename ${train_config%.*})_${exp_tag}${extra_tag}
+fi
+
+model_dir=$root_dir/../checkpoints/$dataset/$task/asr/${exp_name}
+
+if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then
+    echo "stage -1: Data Download"
+    # pass
+fi
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    ### Task dependent. You have to make data the following preparation part by yourself.
+    ### But you can utilize Kaldi recipes in most cases
+    echo "stage 0: Data Preparation"
+    cmd="python ${root_dir}/examples/speech_to_text/prep_librispeech_data.py
+        --output-root ${data_dir} 
+        --vocab-type ${vocab_type}
+        --vocab-size ${vocab_size}"
+    echo -e "\033[34mRun command: \n${cmd} \033[0m"
+    [[ $eval -eq 1 ]] && eval $cmd
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "stage 1: ASR Network Training"
+    [[ ! -d $data_dir ]] && echo "The data dir $data_dir is not existing!" && exit 1;
+
+    if [[ -z ${device} || ${#device[@]} -eq 0 ]]; then
+		if [[ ${gpu_num} -eq 0 ]]; then
+			device=()
+		else
+        	source ./local/utils.sh
+        	device=$(get_devices $gpu_num 0)
+		fi
+    fi
+
+    echo -e "dev=${device} data=$data_dir model=${model_dir}"
+
+    if [[ ! -d ${model_dir} ]]; then
+        mkdir -p ${model_dir}
+    else
+        echo "${model_dir} exists."
+    fi
+
+    cp ${BASH_SOURCE[0]} ${model_dir}
+    cp ${PWD}/train.sh ${model_dir}
+    cp ${train_config} ${model_dir}
+
+    cmd="python3 -u ${root_dir}/fairseq_cli/train.py
+        $data_dir
+		--config-yaml ${data_config}
+        --train-config ${train_config}
+        --task speech_to_text
+        --max-tokens ${max_tokens}
+        --update-freq ${update_freq}
+        --log-interval 100
+        --save-dir ${model_dir}
+        --tensorboard-logdir ${model_dir}"
+
+    if [[ -n ${extra_parameter} ]]; then
+        cmd="${cmd}
+        ${extra_parameter}"
+    fi
+	if [[ ${gpu_num} -gt 0 ]]; then
+		cmd="${cmd}
+        --distributed-world-size $gpu_num
+        --ddp-backend no_c10d"
+	fi
+    if [[ $fp16 -eq 1 ]]; then
+        cmd="${cmd}
+        --fp16"
+    fi
+    if [[ -n $no_epoch_checkpoints && $no_epoch_checkpoints -eq 1 ]]; then
+        cmd="$cmd
+        --no-epoch-checkpoints"
+    fi
+    if [[ -n $validate_interval ]]; then
+        cmd="${cmd}
+        --validate-interval $validate_interval "
+    fi
+    if [[ -n $save_interval ]]; then
+        cmd="${cmd}
+        --save-interval $save_interval "
+    fi
+    if [[ -n $keep_last_epochs ]]; then
+        cmd="${cmd}
+        --keep-last-epochs $keep_last_epochs "
+    fi
+    if [[ -n $save_interval_updates ]]; then
+        cmd="${cmd}
+        --save-interval-updates $save_interval_updates"
+        if [[ -n $keep_interval_updates ]]; then
+        cmd="${cmd}
+        --keep-interval-updates $keep_interval_updates"
+        fi
+    fi
+
+    echo -e "\033[34mRun command: \n${cmd} \033[0m"
+
+    # save info
+    log=./history.log
+    echo "${time} | ${device} | $data_dir | ${model_dir} " >> $log
+    cat $log | tail -n 50 > tmp.log
+    mv tmp.log $log
+    export CUDA_VISIBLE_DEVICES=${device}
+
+    cmd="nohup ${cmd} >> ${model_dir}/train.log 2>&1 &"
+    if [[ $eval -eq 1 ]]; then
+		eval $cmd
+		sleep 2s
+		tail -n `wc -l ${model_dir}/train.log | awk '{print $1+1}'` -f ${model_dir}/train.log
+	fi
+fi
+wait
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: ASR Decoding"
+    if [[ ${n_average} -ne 1 ]]; then
+        # Average models
+		dec_model=avg_${n_average}_checkpoint.pt
+
+		cmd="python ${root_dir}/scripts/average_checkpoints.py
+        --inputs ${model_dir}
+        --num-epoch-checkpoints ${n_average}
+        --output ${model_dir}/${dec_model}"
+    	echo -e "\033[34mRun command: \n${cmd} \033[0m"
+    	[[ $eval -eq 1 ]] && eval $cmd
+	else
+		dec_model=checkpoint_best.pt
+	fi
+
+	#tmp_file=$(mktemp ${model_dir}/tmp-XXXXX)
+	#trap 'rm -rf ${tmp_file}' EXIT
+	result_file=${model_dir}/decode_result
+	[[ -f ${result_file} ]] && rm ${result_file}
+
+	for subset in ${test_subset[@]}; do
+        subset=${subset}_asr
+  		cmd="python ${root_dir}/fairseq_cli/generate.py
+        ${data_dir}/$lang
+        --config-yaml ${data_config}
+        --gen-subset ${subset}
+        --task speech_to_text
+        --path ${model_dir}/${dec_model}
+        --results-path ${model_dir}
+        --max-tokens ${max_tokens}
+        --beam ${beam_size}
+        --scoring wer"
+    	echo -e "\033[34mRun command: \n${cmd} \033[0m"
+
+        if [[ $eval -eq 1 ]]; then
+    	    eval $cmd
+    	    tail -n 1 ${model_dir}/generate-${subset}.txt >> ${result_file}
+        fi
+	done
+    cat ${result_file}
+fi
--- a/egs/librispeech/asr/train.sh
+++ b/egs/librispeech/asr/train.sh
+#! /bin/bash
+
+# training the model
+
+gpu_num=0
+update_freq=1
+
+extra_tag=
+extra_parameter=
+
+#extra_tag="${extra_tag}"
+#extra_parameter="${extra_parameter} "
+
+exp_tag=test
+train_config=asr_train_ctc.yaml
+
+max_tokens=4000
+
+cmd="./run.sh
+    --stage 1
+    --stop_stage 1
+    --gpu_num ${gpu_num}
+    --update_freq ${update_freq}
+    --exp_tag ${exp_tag}
+    --train_config ${train_config}
+    --max_tokens ${max_tokens}
+    "
+
+if [[ -n ${extra_tag} ]]; then
+    cmd="$cmd --extra_tag ${extra_tag}"
+fi
+if [[ -n ${extra_parameter} ]]; then
+    cmd="$cmd --extra_parameter ${extra_parameter}"
+fi
+
+echo $cmd
+eval $cmd
--- a/egs/mustc/asr/conf/asr_train_ctc.yaml
+++ b/egs/mustc/asr/conf/asr_train_ctc.yaml
+train-subset: train_asr
+valid-subset: dev_asr
+
+max-epoch: 50
+max-update: 100000
+
+num-workers: 8
+patience: 10
+no-progress-bar: True
+log-interval: 100
+seed: 1
+report-accuracy: True
+
+#load-params: 
+#load-pretrained-encoder-from: 
+
+arch: s2t_transformer_s
+share-decoder-input-output-embed: True
+optimizer: adam
+clip-norm: 10.0
+lr-scheduler: inverse_sqrt
+warmup-init-lr: 1e-7
+warmup-updates: 10000
+lr: 1e-3
+#adam_betas: (0.9,0.98)
+
+ctc-weight: 0.3
+criterion: label_smoothed_cross_entropy_with_ctc
+label_smoothing: 0.1
+
+conv-kernel-sizes: 5,5
+conv-channels: 1024
+dropout: 0.1
+activation-fn: relu
+encoder-embed-dim: 256
+encoder-ffn-embed-dim: 2048
+encoder-layers: 12
+decoder-layers: 6
+encoder-attention-heads: 4
+
+#decoder-embed-dim: 256
+#decoder-ffn-embed-dim: 2048
+#decoder-attention-heads: 4
+#attention-dropout: 0.1
+#activation-dropout: 0.1
--- a/egs/mustc/asr/decode.sh
+++ b/egs/mustc/asr/decode.sh
+#! /bin/bash
+
+# training the model
+
+gpu_num=1
+
+test_subset=(tst-COMMON)
+exp_name=test
+
+n_average=10
+beam_size=5
+max_tokens=40000
+
+cmd="./run.sh
+    --stage 3
+    --stop_stage 3
+    --gpu_num ${gpu_num}
+    --exp_name ${exp_name}
+    --test_subset ${test_subset}
+    --n_average ${n_average}
+    --beam_size ${beam_size}
+    --max_tokens ${max_tokens}
+    "
+
+echo $cmd
+eval $cmd
--- a/egs/mustc/asr/local/monitor.sh
+++ b/egs/mustc/asr/local/monitor.sh
+gpu_num=1
+
+while :
+do
+    all_devices=$(seq 0 `gpustat | sed '1,2d' | wc -l`);
+    count=0
+    for dev in ${all_devices[@]}
+    do
+        line=`expr $dev + 2`
+        use=`gpustat -p | head -n $line | tail -1 | cut -d '|' -f4 | wc -w`
+        if [[ $use -eq 0 ]]; then
+            device[$count]=$dev
+            count=`expr $count + 1`
+            if [[ $count -eq $gpu_num ]]; then
+                break
+            fi
+        fi
+    done
+    if [[ ${#device[@]} -lt $gpu_num ]]; then
+        sleep 60s
+    else
+        echo "Run $cmd"
+        eval $cmd
+        sleep 10s
+        exit
+    fi
+done
--- a/egs/mustc/asr/local/parse_options.sh
+++ b/egs/mustc/asr/local/parse_options.sh
+#!/usr/bin/env bash
+
+# Copyright 2012  Johns Hopkins University (Author: Daniel Povey);
+#                 Arnab Ghoshal, Karel Vesely
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
+# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
+# MERCHANTABLITY OR NON-INFRINGEMENT.
+# See the Apache 2 License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Parse command-line options.
+# To be sourced by another script (as in ". parse_options.sh").
+# Option format is: --option-name arg
+# and shell variable "option_name" gets set to value "arg."
+# The exception is --help, which takes no arguments, but prints the
+# $help_message variable (if defined).
+
+
+###
+### The --config file options have lower priority to command line
+### options, so we need to import them first...
+###
+
+# Now import all the configs specified by command-line, in left-to-right order
+for ((argpos=1; argpos<$#; argpos++)); do
+  if [ "${!argpos}" == "--config" ]; then
+    argpos_plus1=$((argpos+1))
+    config=${!argpos_plus1}
+    [ ! -r $config ] && echo "$0: missing config '$config'" && exit 1
+    . $config  # source the config file.
+  fi
+done
+
+
+###
+### Now we process the command line options
+###
+while true; do
+  [ -z "${1:-}" ] && break;  # break if there are no arguments
+  case "$1" in
+    # If the enclosing script is called with --help option, print the help
+    # message and exit.  Scripts should put help messages in $help_message
+    --help|-h) if [ -z "$help_message" ]; then echo "No help found." 1>&2;
+      else printf "$help_message\n" 1>&2 ; fi;
+      exit 0 ;;
+    --*=*) echo "$0: options to scripts must be of the form --name value, got '$1'"
+      exit 1 ;;
+    # If the first command-line argument begins with "--" (e.g. --foo-bar),
+    # then work out the variable name as $name, which will equal "foo_bar".
+    --*) name=`echo "$1" | sed s/^--// | sed s/-/_/g`;
+      # Next we test whether the variable in question is undefned-- if so it's
+      # an invalid option and we die.  Note: $0 evaluates to the name of the
+      # enclosing script.
+      # The test [ -z ${foo_bar+xxx} ] will return true if the variable foo_bar
+      # is undefined.  We then have to wrap this test inside "eval" because
+      # foo_bar is itself inside a variable ($name).
+      eval '[ -z "${'$name'+xxx}" ]' && echo "$0: invalid option $1" 1>&2 && exit 1;
+
+      oldval="`eval echo \\$$name`";
+      # Work out whether we seem to be expecting a Boolean argument.
+      if [ "$oldval" == "true" ] || [ "$oldval" == "false" ]; then
+        was_bool=true;
+      else
+        was_bool=false;
+      fi
+
+      # Set the variable to the right value-- the escaped quotes make it work if
+      # the option had spaces, like --cmd "queue.pl -sync y"
+      eval $name=\"$2\";
+
+      # Check that Boolean-valued arguments are really Boolean.
+      if $was_bool && [[ "$2" != "true" && "$2" != "false" ]]; then
+        echo "$0: expected \"true\" or \"false\": $1 $2" 1>&2
+        exit 1;
+      fi
+      shift 2;
+      ;;
+  *) break;
+  esac
+done
+
+
+# Check for an empty argument to the --cmd option, which can easily occur as a
+# result of scripting errors.
+[ ! -z "${cmd+xxx}" ] && [ -z "$cmd" ] && echo "$0: empty argument to --cmd option" 1>&2 && exit 1;
+
+
+true; # so this script returns exit code 0.
--- a/egs/mustc/asr/local/path.sh
+++ b/egs/mustc/asr/local/path.sh
+MAIN_ROOT=$PWD/../../..
+KALDI_ROOT=$MAIN_ROOT/tools/kaldi
+
+export PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PATH
+[ ! -f $KALDI_ROOT/tools/config/common_path.sh ] && echo >&2 "The standard file $KALDI_ROOT/tools/config/common_path.sh is not present -> Exit!" && exit 1
+. $KALDI_ROOT/tools/config/common_path.sh
+export LC_ALL=C
+
+export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:$MAIN_ROOT/src/lib
+export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:$MAIN_ROOT/tools/chainer_ctc/ext/warp-ctc/build
+. "${MAIN_ROOT}"/tools/activate_python.sh && . "${MAIN_ROOT}"/tools/extra_path.sh
+export PATH=$MAIN_ROOT/utils:$MAIN_ROOT/espnet/bin:$PATH
+
+export OMP_NUM_THREADS=1
+
+# check extra module installation
+if ! which tokenizer.perl > /dev/null; then
+    echo "Error: it seems that moses is not installed." >&2
+    echo "Error: please install moses as follows." >&2
+    echo "Error: cd ${MAIN_ROOT}/tools && make moses.done" >&2
+    return 1
+fi
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
--- a/egs/mustc/asr/local/utils.sh
+++ b/egs/mustc/asr/local/utils.sh
+
+get_devices(){
+    gpu_num=$1
+    use_cpu=$2
+    device=()
+    while :
+    do
+        record=`mktemp -t temp.record.XXXXXX`
+        gpustat > $record
+        all_devices=$(seq 0 `cat $record | sed '1,2d' | wc -l`);
+        count=0
+        for dev in ${all_devices[@]}
+        do
+            line=`expr $dev + 2`
+            use=`cat $record | head -n $line | tail -1 | cut -d '|' -f3 | cut -d '/' -f1`
+            if [[ $use -lt 10 ]]; then
+                device[$count]=$dev
+                count=`expr $count + 1`
+                if [[ $count -eq $gpu_num ]]; then
+                    break
+                fi
+            fi
+        done
+        if [[ ${#device[@]} -lt $gpu_num ]]; then
+            if [[ $use_cpu -eq 1 ]]; then
+                device=(-1)
+            else
+                sleep 60s
+            fi
+        else
+            break
+        fi
+    done
+
+    echo ${device[*]} | sed 's/ /,/g'
+    return $?
+}
+
+
--- a/egs/mustc/asr/run.sh
+++ b/egs/mustc/asr/run.sh
+#! /bin/bash
+
+# Processing MuST-C Datasets
+
+# Copyright 2021 Natural Language Processing Laboratory 
+# Xu Chen (xuchenneu@163.com)
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+#set -u
+set -o pipefail
+export PYTHONIOENCODING=UTF-8
+
+eval=1
+time=$(date "+%m%d_%H%M")
+
+stage=0
+stop_stage=0
+
+######## hardware ########
+# devices
+device=()
+gpu_num=8
+update_freq=1
+
+root_dir=~/st/fairseq
+pwd_dir=$PWD
+
+# dataset
+src_lang=en
+lang=${src_lang}
+
+dataset=mustc
+task=speech_to_text
+vocab_type=unigram
+vocab_size=5000
+
+data_dir=~/st/data/${dataset}
+test_subset=(tst-COMMON)
+
+# exp
+extra_tag=
+extra_parameter=
+exp_tag=baseline
+exp_name=
+
+# config
+train_config=asr_train_ctc.yaml
+data_config=config_asr.yaml
+
+# training setting
+fp16=1
+max_tokens=40000
+step_valid=0
+
+# decoding setting
+n_average=10
+beam_size=5
+
+. ./local/parse_options.sh || exit 1;
+
+if [[ $step_valid -eq 1 ]]; then
+    validate_interval=10000
+    save_interval=10000
+    no_epoch_checkpoints=1
+    save_interval_updates=5000
+    keep_interval_updates=3
+else
+    validate_interval=1
+    keep_last_epochs=10
+fi
+
+# full path
+train_config=$pwd_dir/conf/${train_config}
+if [[ -z ${exp_name} ]]; then
+    exp_name=$(basename ${train_config%.*})_${exp_tag}${extra_tag}
+fi
+
+model_dir=$root_dir/../checkpoints/$dataset/$task/asr/${exp_name}
+
+if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then
+    echo "stage -1: Data Download"
+    # pass
+fi
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    ### Task dependent. You have to make data the following preparation part by yourself.
+    ### But you can utilize Kaldi recipes in most cases
+    echo "stage 0: ASR Data Preparation"
+    cmd="python ${root_dir}/examples/speech_to_text/prep_mustc_data.py
+        --data-root ${data_dir}
+        --task asr
+        --vocab-type ${vocab_type}
+        --vocab-size ${vocab_size}"
+    echo -e "\033[34mRun command: \n${cmd} \033[0m"
+    [[ $eval -eq 1 ]] && eval $cmd
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "stage 1: ASR Network Training"
+    [[ ! -d $data_dir ]] && echo "The data dir $data_dir is not existing!" && exit 1;
+
+    if [[ -z ${device} || ${#device[@]} -eq 0 ]]; then
+		if [[ ${gpu_num} -eq 0 ]]; then
+			device=()
+		else
+        	source ./local/utils.sh
+        	device=$(get_devices $gpu_num 0)
+		fi
+    fi
+
+    echo -e "dev=${device} data=$data_dir model=${model_dir}"
+
+    if [[ ! -d ${model_dir} ]]; then
+        mkdir -p ${model_dir}
+    else
+        echo "${model_dir} exists."
+    fi
+
+    cp ${BASH_SOURCE[0]} ${model_dir}
+    cp ${PWD}/train.sh ${model_dir}
+    cp ${train_config} ${model_dir}
+
+    cmd="python3 -u ${root_dir}/fairseq_cli/train.py
+        $data_dir/$lang
+		--config-yaml ${data_config}
+        --train-config ${train_config}
+        --task speech_to_text
+        --max-tokens ${max_tokens}
+        --update-freq ${update_freq}
+        --log-interval 100
+        --save-dir ${model_dir}
+        --tensorboard-logdir ${model_dir}"
+
+    if [[ -n ${extra_parameter} ]]; then
+        cmd="${cmd}
+        ${extra_parameter}"
+    fi
+	if [[ ${gpu_num} -gt 0 ]]; then
+		cmd="${cmd}
+        --distributed-world-size $gpu_num
+        --ddp-backend no_c10d"
+	fi
+    if [[ $fp16 -eq 1 ]]; then
+        cmd="${cmd}
+        --fp16"
+    fi
+    if [[ -n $no_epoch_checkpoints && $no_epoch_checkpoints -eq 1 ]]; then
+        cmd="$cmd
+        --no-epoch-checkpoints"
+    fi
+    if [[ -n $validate_interval ]]; then
+        cmd="${cmd}
+        --validate-interval $validate_interval "
+    fi
+    if [[ -n $save_interval ]]; then
+        cmd="${cmd}
+        --save-interval $save_interval "
+    fi
+    if [[ -n $keep_last_epochs ]]; then
+        cmd="${cmd}
+        --keep-last-epochs $keep_last_epochs "
+    fi
+    if [[ -n $save_interval_updates ]]; then
+        cmd="${cmd}
+        --save-interval-updates $save_interval_updates"
+        if [[ -n $keep_interval_updates ]]; then
+        cmd="${cmd}
+        --keep-interval-updates $keep_interval_updates"
+        fi
+    fi
+
+    echo -e "\033[34mRun command: \n${cmd} \033[0m"
+
+    # save info
+    log=./history.log
+    echo "${time} | ${device} | $data_dir | ${model_dir} " >> $log
+    cat $log | tail -n 50 > tmp.log
+    mv tmp.log $log
+    export CUDA_VISIBLE_DEVICES=${device}
+
+    cmd="nohup ${cmd} >> ${model_dir}/train.log 2>&1 &"
+    if [[ $eval -eq 1 ]]; then
+		eval $cmd
+		sleep 2s
+		tail -n `wc -l ${model_dir}/train.log | awk '{print $1+1}'` -f ${model_dir}/train.log
+	fi
+fi
+wait
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: ASR Decoding"
+    if [[ ${n_average} -ne 1 ]]; then
+        # Average models
+		dec_model=avg_${n_average}_checkpoint.pt
+
+		cmd="python ${root_dir}/scripts/average_checkpoints.py
+        --inputs ${model_dir}
+        --num-epoch-checkpoints ${n_average}
+        --output ${model_dir}/${dec_model}"
+    	echo -e "\033[34mRun command: \n${cmd} \033[0m"
+    	[[ $eval -eq 1 ]] && eval $cmd
+	else
+		dec_model=checkpoint_best.pt
+	fi
+
+	#tmp_file=$(mktemp ${model_dir}/tmp-XXXXX)
+	#trap 'rm -rf ${tmp_file}' EXIT
+	result_file=${model_dir}/decode_result
+	[[ -f ${result_file} ]] && rm ${result_file}
+
+	for subset in ${test_subset[@]}; do
+        subset=${subset}_asr
+  		cmd="python ${root_dir}/fairseq_cli/generate.py
+        ${data_dir}/$lang
+        --config-yaml ${data_config}
+        --gen-subset ${subset}
+        --task speech_to_text
+        --path ${model_dir}/${dec_model}
+        --results-path ${model_dir}
+        --max-tokens ${max_tokens}
+        --beam ${beam_size}
+        --scoring wer"
+    	echo -e "\033[34mRun command: \n${cmd} \033[0m"
+
+        if [[ $eval -eq 1 ]]; then
+    	    eval $cmd
+    	    tail -n 1 ${model_dir}/generate-${subset}.txt >> ${result_file}
+        fi
+	done
+    cat ${result_file}
+fi
--- a/egs/mustc/asr/train.sh
+++ b/egs/mustc/asr/train.sh
+#! /bin/bash
+
+# training the model
+
+gpu_num=8
+update_freq=1
+max_tokens=40000
+
+extra_tag=
+extra_parameter=
+
+#extra_tag="${extra_tag}"
+#extra_parameter="${extra_parameter} "
+
+exp_tag=
+train_config=asr_train_ctc.yaml
+
+cmd="./run.sh
+    --stage 1
+    --stop_stage 1
+    --gpu_num ${gpu_num}
+    --update_freq ${update_freq}
+    --train_config ${train_config}
+    --max_tokens ${max_tokens}
+    "
+
+if [[ -n ${exp_tag} ]]; then
+    cmd="$cmd --exp_tag ${exp_tag}"
+fi
+if [[ -n ${extra_tag} ]]; then
+    cmd="$cmd --extra_tag ${extra_tag}"
+fi
+if [[ -n ${extra_parameter} ]]; then
+    cmd="$cmd --extra_parameter ${extra_parameter}"
+fi
+
+echo $cmd
+eval $cmd
--- a/egs/mustc/st/conf/st_train.yaml
+++ b/egs/mustc/st/conf/st_train.yaml
+train-subset: train_st
+valid-subset: dev_st
+
+max-epoch: 50
+max-update: 100000
+
+num-workers: 8
+patience: 10
+no-progress-bar: True
+log-interval: 100
+seed: 1
+report-accuracy: True
+
+#load-params: 
+#load-pretrained-encoder-from: 
+
+arch: s2t_transformer_s
+share-decoder-input-output-embed: True
+optimizer: adam
+clip-norm: 10.0
+lr-scheduler: inverse_sqrt
+warmup-init-lr: 1e-7
+warmup-updates: 10000
+lr: 2e-3
+#adam_betas: (0.9,0.98)
+
+criterion: label_smoothed_cross_entropy
+label_smoothing: 0.1
+
+conv-kernel-sizes: 5,5
+conv-channels: 1024
+dropout: 0.1
+activation-fn: relu
+encoder-embed-dim: 256
+encoder-ffn-embed-dim: 2048
+encoder-layers: 12
+decoder-layers: 6
+encoder-attention-heads: 4
+
+#decoder-embed-dim: 256
+#decoder-ffn-embed-dim: 2048
+#decoder-attention-heads: 4
+#attention-dropout: 0.1
+#activation-dropout: 0.1
--- a/egs/mustc/st/conf/st_train_ctc.yaml
+++ b/egs/mustc/st/conf/st_train_ctc.yaml
+train-subset: train_st
+valid-subset: dev_st
+
+max-epoch: 50
+max-update: 100000
+
+num-workers: 8
+patience: 10
+no-progress-bar: True
+log-interval: 100
+seed: 1
+report-accuracy: True
+
+#load-params: 
+#load-pretrained-encoder-from: 
+
+arch: s2t_transformer_s
+share-decoder-input-output-embed: True
+optimizer: adam
+clip-norm: 10.0
+lr-scheduler: inverse_sqrt
+warmup-init-lr: 1e-7
+warmup-updates: 10000
+lr: 2e-3
+#adam_betas: (0.9,0.98)
+
+ctc-weight: 0.3
+criterion: label_smoothed_cross_entropy_with_ctc
+label_smoothing: 0.1
+
+conv-kernel-sizes: 5,5
+conv-channels: 1024
+dropout: 0.1
+activation-fn: relu
+encoder-embed-dim: 256
+encoder-ffn-embed-dim: 2048
+encoder-layers: 12
+decoder-layers: 6
+encoder-attention-heads: 4
+
+#decoder-embed-dim: 256
+#decoder-ffn-embed-dim: 2048
+#decoder-attention-heads: 4
+#attention-dropout: 0.1
+#activation-dropout: 0.1
--- a/egs/mustc/st/decode.sh
+++ b/egs/mustc/st/decode.sh
+#! /bin/bash
+
+# training the model
+
+gpu_num=1
+
+test_subset=(tst-COMMON)
+exp_name=test
+
+n_average=10
+beam_size=5
+max_tokens=40000
+
+cmd="./run.sh
+    --stage 2
+    --stop_stage 2
+    --gpu_num ${gpu_num}
+    --exp_name ${exp_name}
+    --test_subset ${test_subset}
+    --n_average ${n_average}
+    --beam_size ${beam_size}
+    --max_tokens ${max_tokens}
+    "
+
+echo $cmd
+eval $cmd
--- a/egs/mustc/st/local/monitor.sh
+++ b/egs/mustc/st/local/monitor.sh
+gpu_num=1
+
+while :
+do
+    all_devices=$(seq 0 `gpustat | sed '1,2d' | wc -l`);
+    count=0
+    for dev in ${all_devices[@]}
+    do
+        line=`expr $dev + 2`
+        use=`gpustat -p | head -n $line | tail -1 | cut -d '|' -f4 | wc -w`
+        if [[ $use -eq 0 ]]; then
+            device[$count]=$dev
+            count=`expr $count + 1`
+            if [[ $count -eq $gpu_num ]]; then
+                break
+            fi
+        fi
+    done
+    if [[ ${#device[@]} -lt $gpu_num ]]; then
+        sleep 60s
+    else
+        echo "Run $cmd"
+        eval $cmd
+        sleep 10s
+        exit
+    fi
+done
--- a/egs/mustc/st/local/parse_options.sh
+++ b/egs/mustc/st/local/parse_options.sh
+#!/usr/bin/env bash
+
+# Copyright 2012  Johns Hopkins University (Author: Daniel Povey);
+#                 Arnab Ghoshal, Karel Vesely
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
+# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
+# MERCHANTABLITY OR NON-INFRINGEMENT.
+# See the Apache 2 License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Parse command-line options.
+# To be sourced by another script (as in ". parse_options.sh").
+# Option format is: --option-name arg
+# and shell variable "option_name" gets set to value "arg."
+# The exception is --help, which takes no arguments, but prints the
+# $help_message variable (if defined).
+
+
+###
+### The --config file options have lower priority to command line
+### options, so we need to import them first...
+###
+
+# Now import all the configs specified by command-line, in left-to-right order
+for ((argpos=1; argpos<$#; argpos++)); do
+  if [ "${!argpos}" == "--config" ]; then
+    argpos_plus1=$((argpos+1))
+    config=${!argpos_plus1}
+    [ ! -r $config ] && echo "$0: missing config '$config'" && exit 1
+    . $config  # source the config file.
+  fi
+done
+
+
+###
+### Now we process the command line options
+###
+while true; do
+  [ -z "${1:-}" ] && break;  # break if there are no arguments
+  case "$1" in
+    # If the enclosing script is called with --help option, print the help
+    # message and exit.  Scripts should put help messages in $help_message
+    --help|-h) if [ -z "$help_message" ]; then echo "No help found." 1>&2;
+      else printf "$help_message\n" 1>&2 ; fi;
+      exit 0 ;;
+    --*=*) echo "$0: options to scripts must be of the form --name value, got '$1'"
+      exit 1 ;;
+    # If the first command-line argument begins with "--" (e.g. --foo-bar),
+    # then work out the variable name as $name, which will equal "foo_bar".
+    --*) name=`echo "$1" | sed s/^--// | sed s/-/_/g`;
+      # Next we test whether the variable in question is undefned-- if so it's
+      # an invalid option and we die.  Note: $0 evaluates to the name of the
+      # enclosing script.
+      # The test [ -z ${foo_bar+xxx} ] will return true if the variable foo_bar
+      # is undefined.  We then have to wrap this test inside "eval" because
+      # foo_bar is itself inside a variable ($name).
+      eval '[ -z "${'$name'+xxx}" ]' && echo "$0: invalid option $1" 1>&2 && exit 1;
+
+      oldval="`eval echo \\$$name`";
+      # Work out whether we seem to be expecting a Boolean argument.
+      if [ "$oldval" == "true" ] || [ "$oldval" == "false" ]; then
+        was_bool=true;
+      else
+        was_bool=false;
+      fi
+
+      # Set the variable to the right value-- the escaped quotes make it work if
+      # the option had spaces, like --cmd "queue.pl -sync y"
+      eval $name=\"$2\";
+
+      # Check that Boolean-valued arguments are really Boolean.
+      if $was_bool && [[ "$2" != "true" && "$2" != "false" ]]; then
+        echo "$0: expected \"true\" or \"false\": $1 $2" 1>&2
+        exit 1;
+      fi
+      shift 2;
+      ;;
+  *) break;
+  esac
+done
+
+
+# Check for an empty argument to the --cmd option, which can easily occur as a
+# result of scripting errors.
+[ ! -z "${cmd+xxx}" ] && [ -z "$cmd" ] && echo "$0: empty argument to --cmd option" 1>&2 && exit 1;
+
+
+true; # so this script returns exit code 0.
--- a/egs/mustc/st/local/path.sh
+++ b/egs/mustc/st/local/path.sh
+MAIN_ROOT=$PWD/../../..
+KALDI_ROOT=$MAIN_ROOT/tools/kaldi
+
+export PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PATH
+[ ! -f $KALDI_ROOT/tools/config/common_path.sh ] && echo >&2 "The standard file $KALDI_ROOT/tools/config/common_path.sh is not present -> Exit!" && exit 1
+. $KALDI_ROOT/tools/config/common_path.sh
+export LC_ALL=C
+
+export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:$MAIN_ROOT/src/lib
+export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:$MAIN_ROOT/tools/chainer_ctc/ext/warp-ctc/build
+. "${MAIN_ROOT}"/tools/activate_python.sh && . "${MAIN_ROOT}"/tools/extra_path.sh
+export PATH=$MAIN_ROOT/utils:$MAIN_ROOT/espnet/bin:$PATH
+
+export OMP_NUM_THREADS=1
+
+# check extra module installation
+if ! which tokenizer.perl > /dev/null; then
+    echo "Error: it seems that moses is not installed." >&2
+    echo "Error: please install moses as follows." >&2
+    echo "Error: cd ${MAIN_ROOT}/tools && make moses.done" >&2
+    return 1
+fi
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
--- a/egs/mustc/st/local/utils.sh
+++ b/egs/mustc/st/local/utils.sh
+
+get_devices(){
+    gpu_num=$1
+    use_cpu=$2
+    device=()
+    while :
+    do
+        record=`mktemp -t temp.record.XXXXXX`
+        gpustat > $record
+        all_devices=$(seq 0 `cat $record | sed '1,2d' | wc -l`);
+        count=0
+        for dev in ${all_devices[@]}
+        do
+            line=`expr $dev + 2`
+            use=`cat $record | head -n $line | tail -1 | cut -d '|' -f3 | cut -d '/' -f1`
+            if [[ $use -eq 0 ]]; then
+                device[$count]=$dev
+                count=`expr $count + 1`
+                if [[ $count -eq $gpu_num ]]; then
+                    break
+                fi
+            fi
+        done
+        if [[ ${#device[@]} -lt $gpu_num ]]; then
+            if [[ $use_cpu -eq 1 ]]; then
+                device=(-1)
+            else
+                sleep 60s
+            fi
+        else
+            break
+        fi
+    done
+
+    echo ${device[*]} | sed 's/ /,/g'
+    return $?
+}
+
+
--- a/egs/mustc/st/run.sh
+++ b/egs/mustc/st/run.sh
+#! /bin/bash
+
+# Processing MuST-C Datasets
+
+# Copyright 2021 Natural Language Processing Laboratory 
+# Xu Chen (xuchenneu@163.com)
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+#set -u
+set -o pipefail
+export PYTHONIOENCODING=UTF-8
+
+eval=1
+time=$(date "+%m%d_%H%M")
+
+stage=0
+stop_stage=0
+
+######## hardware ########
+# devices
+device=()
+gpu_num=8
+update_freq=1
+
+root_dir=~/st/fairseq
+pwd_dir=$PWD
+
+# dataset
+src_lang=en
+tgt_lang=de
+lang=${src_lang}-${tgt_lang}
+
+dataset=mustc
+task=speech_to_text
+vocab_type=unigram
+asr_vocab_size=5000
+vocab_size=8000
+share_dict=0
+
+data_dir=~/st/data/${dataset}
+test_subset=(tst-COMMON)
+
+# exp
+extra_tag=
+extra_parameter=
+exp_tag=baseline
+exp_name=
+
+# config
+train_config=st_train_ctc.yaml
+
+# training setting
+fp16=1
+max_tokens=40000
+step_valid=0
+bleu_valid=0
+
+# decoding setting
+n_average=10
+beam_size=5
+
+. ./local/parse_options.sh || exit 1;
+
+if [[ $step_valid -eq 1 ]]; then
+    validate_interval=10000
+    save_interval=10000
+    no_epoch_checkpoints=1
+    save_interval_updates=5000
+    keep_interval_updates=3
+else
+    validate_interval=1
+    keep_last_epochs=10
+fi
+
+if [[ ${share_dict} -eq 1 ]]; then
+	data_config=config_st_share.yaml
+else
+	data_config=config_st.yaml
+fi
+
+# full path
+train_config=$pwd_dir/conf/${train_config}
+if [[ -z ${exp_name} ]]; then
+    exp_name=$(basename ${train_config%.*})_${exp_tag}${extra_tag}
+fi
+
+model_dir=$root_dir/../checkpoints/$dataset/$task/st/${exp_name}
+
+if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then
+    echo "stage -1: Data Download"
+    # pass
+fi
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    ### Task dependent. You have to make data the following preparation part by yourself.
+    ### But you can utilize Kaldi recipes in most cases
+    echo "stage 0: ASR Data Preparation"
+    cmd="python ${root_dir}/examples/speech_to_text/prep_mustc_data.py
+        --data-root ${data_dir}
+        --task asr
+        --vocab-type ${vocab_type}
+        --vocab-size ${asr_vocab_size}"
+    echo -e "\033[34mRun command: \n${cmd} \033[0m"
+    #[[ $eval -eq 1 && ${share_dict} -ne 1 ]] && eval $cmd
+
+    echo "stage 0: ST Data Preparation"
+    cmd="python ${root_dir}/examples/speech_to_text/prep_mustc_data.py
+        --data-root ${data_dir}
+        --task st
+        --add-src
+        --cmvn-type utterance
+        --vocab-type ${vocab_type}
+        --vocab-size ${vocab_size}"
+    if [[ $share_dict -eq 1 ]]; then
+        cmd="$cmd
+        --share"
+	else
+        cmd="$cmd
+        --asr-prefix spm_${vocab_type}${asr_vocab_size}_asr"
+    fi
+
+    echo -e "\033[34mRun command: \n${cmd} \033[0m"
+    [[ $eval -eq 1 ]] && eval ${cmd}
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "stage 1: ST Network Training"
+    [[ ! -d ${data_dir} ]] && echo "The data dir ${data_dir} is not existing!" && exit 1;
+
+    if [[ -z ${device} || ${#device[@]} -eq 0 ]]; then
+		if [[ ${gpu_num} -eq 0 ]]; then
+			device=()
+		else
+        	source ./local/utils.sh
+        	device=$(get_devices $gpu_num 0)
+		fi
+    fi
+
+    echo -e "dev=${device} data=${data_dir} model=${model_dir}"
+
+    if [[ ! -d ${model_dir} ]]; then
+        mkdir -p ${model_dir}
+    else
+        echo "${model_dir} exists."
+    fi
+
+    cp ${BASH_SOURCE[0]} ${model_dir}
+    cp ${PWD}/train.sh ${model_dir}
+    cp ${train_config} ${model_dir}
+
+    cmd="python3 -u ${root_dir}/fairseq_cli/train.py
+        ${data_dir}/$lang
+        --config-yaml ${data_config}
+        --train-config ${train_config}
+        --task speech_to_text
+        --max-tokens ${max_tokens}
+        --update-freq ${update_freq}
+        --log-interval 100
+        --save-dir ${model_dir}
+        --tensorboard-logdir ${model_dir}"
+
+    if [[ -n ${extra_parameter} ]]; then
+        cmd="${cmd}
+        ${extra_parameter}"
+    fi
+	if [[ ${gpu_num} -gt 0 ]]; then
+		cmd="${cmd}
+        --distributed-world-size $gpu_num
+        --ddp-backend no_c10d"
+	fi
+    if [[ $fp16 -eq 1 ]]; then
+        cmd="${cmd}
+        --fp16"
+    fi
+    if [[ $bleu_valid -eq 1 ]]; then
+        cmd="$cmd
+        --eval-bleu
+        --eval-bleu-args '{\"beam\": 1}'
+        --eval-tokenized-bleu
+        --eval-bleu-remove-bpe
+        --best-checkpoint-metric bleu
+        --maximize-best-checkpoint-metric"
+    fi
+    if [[ -n $no_epoch_checkpoints && $no_epoch_checkpoints -eq 1 ]]; then
+        cmd="$cmd
+        --no-epoch-checkpoints"
+    fi
+    if [[ -n $validate_interval ]]; then
+        cmd="${cmd}
+        --validate-interval $validate_interval "
+    fi
+    if [[ -n $save_interval ]]; then
+        cmd="${cmd}
+        --save-interval $save_interval "
+    fi
+    if [[ -n $keep_last_epochs ]]; then
+        cmd="${cmd}
+        --keep-last-epochs $keep_last_epochs "
+    fi
+    if [[ -n $save_interval_updates ]]; then
+        cmd="${cmd}
+        --save-interval-updates $save_interval_updates"
+        if [[ -n $keep_interval_updates ]]; then
+        cmd="${cmd}
+        --keep-interval-updates $keep_interval_updates"
+        fi
+    fi
+
+    echo -e "\033[34mRun command: \n${cmd} \033[0m"
+
+    # save info
+    log=./history.log
+    echo "${time} | ${device} | ${data_dir} | ${model_dir} " >> $log
+    cat $log | tail -n 50 > tmp.log
+    mv tmp.log $log
+    export CUDA_VISIBLE_DEVICES=${device}
+
+    cmd="nohup ${cmd} >> ${model_dir}/train.log 2>&1 &"
+    if [[ $eval -eq 1 ]]; then
+		eval $cmd
+		sleep 2s
+		tail -n `wc -l ${model_dir}/train.log | awk '{print $1+1}'` -f ${model_dir}/train.log
+	fi
+fi
+wait
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: ST Decoding"
+    if [[ ${n_average} -ne 1 ]]; then
+        # Average models
+		dec_model=avg_${n_average}_checkpoint.pt
+
+		cmd="python ${root_dir}/scripts/average_checkpoints.py
+        --inputs ${model_dir}
+        --num-epoch-checkpoints ${n_average}
+        --output ${model_dir}/${dec_model}"
+    	echo -e "\033[34mRun command: \n${cmd} \033[0m"
+    	[[ $eval -eq 1 ]] && eval $cmd
+	else
+		dec_model=checkpoint_best.pt
+	fi
+
+	#tmp_file=$(mktemp ${model_dir}/tmp-XXXXX)
+	#trap 'rm -rf ${tmp_file}' EXIT
+	result_file=${model_dir}/decode_result
+	[[ -f ${result_file} ]] && rm ${result_file}
+
+	for subset in ${test_subset[@]}; do
+        subset=${subset}_st
+  		cmd="python ${root_dir}/fairseq_cli/generate.py
+        ${data_dir}/$lang
+        --config-yaml ${data_config}
+        --gen-subset ${subset}
+        --task speech_to_text
+        --path ${model_dir}/${dec_model}
+        --results-path ${model_dir}
+        --max-tokens ${max_tokens}
+        --beam ${beam_size}
+        --scoring sacrebleu"
+    	echo -e "\033[34mRun command: \n${cmd} \033[0m"
+
+        if [[ $eval -eq 1 ]]; then
+    	    eval $cmd
+    	    tail -n 1 ${model_dir}/generate-${subset}.txt >> ${result_file}
+        fi
+	done
+    cat ${result_file}
+fi
--- a/egs/mustc/st/train.sh
+++ b/egs/mustc/st/train.sh
+#! /bin/bash
+
+# training the model
+
+gpu_num=8
+update_freq=1
+max_tokens=40000
+
+extra_tag=
+extra_parameter=
+
+#extra_tag="${extra_tag}"
+#extra_parameter="${extra_parameter} "
+
+exp_tag=
+train_config=st_train_ctc.yaml
+
+cmd="./run.sh
+    --stage 1
+    --stop_stage 1
+    --gpu_num ${gpu_num}
+    --update_freq ${update_freq}
+    --train_config ${train_config}
+    --max_tokens ${max_tokens}
+    "
+
+if [[ -n ${exp_tag} ]]; then
+    cmd="$cmd --exp_tag ${exp_tag}"
+fi
+if [[ -n ${extra_tag} ]]; then
+    cmd="$cmd --extra_tag ${extra_tag}"
+fi
+if [[ -n ${extra_parameter} ]]; then
+    cmd="$cmd --extra_parameter ${extra_parameter}"
+fi
+
+echo $cmd
+eval $cmd
--- a/examples/.gitignore
+++ b/examples/.gitignore
+!*/*.sh
+!*/*.md
--- a/examples/__init__.py
+++ b/examples/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+try:
+    from fairseq.version import __version__  # noqa
+except ImportError:
+    pass
--- a/examples/adaptive_span/README.md
+++ b/examples/adaptive_span/README.md
+# Adaptive Span
+
+Adaptive Span is a novel self-attention mechanism that can learn its optimal
+attention span. This allows us to extend significantly the maximum context size
+used in Transformer, while maintaining control over their memory footprint
+and computational time. It uses the Truncated BPTT technique for training,
+as in [transformerXL](https://github.com/pytorch/fairseq/blob/master/examples/truncated_bptt/README.md).
+
+Adaptive Span was introduced by paper:
+[Adaptive Attention Span in Transformers](https://arxiv.org/abs/1905.07799),
+which achieved state-of-the-art language modeling results at the time of publication.
+
+We manage to reproduce their result in fairseq and keep most of the
+[original implementation](https://github.com/facebookresearch/adaptive-span) untouched.
+You can refer to the their sweep file as well if any combination of hyperparameter is not clear.
+
+##### 0. Setup
+
+First you need to process the Enwik8 dataset, we use the pre-tokenized dataset
+from [adaptive span paper](https://github.com/facebookresearch/adaptive-span/blob/master/get_data.sh).
+You can download the dataset, and then run:
+```bash
+fairseq-preprocess --only-source --trainpref ~/data/enwik8/train.txt \
+    --validpref ~/data/enwik8/valid.txt --testpref ~/data/enwik8/test.txt \
+    --destdir ~/data/enwik8/data-bin/ --joined-dictionary --workers 20
+```
+
+##### 1. Train a Adaptive Span model on Enwik8
+
+We will train a 12-layer Adaptive Span model following the [hyperparameters
+used in the original
+paper](https://github.com/facebookresearch/adaptive-span/blob/master/experiments/enwik8.sh).
+
+The following command assumes 4 GPUs, so that the total batch size is 64
+sequences (4 x 16). Training should take 2-3 days on 4 V100 GPUs:
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train \
+    --user-dir examples/adaptive_span \
+    --data  ~/data/enwik8/data-bin/ \
+    --fp16 --fp16-no-flatten-grads --max-update 600000 \
+    --task truncated_bptt_lm --tokens-per-sample 512 --arch adaptive_span \
+    --n-layer 12 --d-model 512 --n-head 8 --d-inner 2048 --dropout 0.3 \
+    --attn-span 8192 --optimizer adagrad_with_grad_clip --adagrad-clip 0.03 \
+    --validate-interval-updates 1000 \
+    --lr-scheduler fixed --warmup-updates 32000 --batch-size-valid 32 \
+    --lr 0.07 --criterion adaptive_span_loss --batch-size 16 --update-freq 1 \
+    --seed 2 --log-format json --log-interval 25 --aux-loss-scaler 5e-07
+```
+This should land around 1.05 on validation, 1.03 on test. You can lower the
+--aux-loss-scaler for better performance (longer span). It gives ~0.03 bpc
+improvement to the transformerXL baseline here.
+If training on a single GPU, set `--update-freq=4` to accumulate 4x gradients
+and simulate training on 4 GPUs.
+You can also reproduce the transformerXL result on enwik8 using this code base.
+It should land around 1.06 on test,matching the [original paper](https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/run_enwik8_base.sh).
+You can try by
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train \
+    --user-dir examples/truncated_bptt \
+    ~/data/enwik8/data-bin/ \
+    --task truncated_bptt_lm  --fp16 --max-update 400000 \
+    --tokens-per-sample 512 --arch transformer_xl --n-layer 12 \
+    --d-model 512 --n-head 8 --d-head 64 --d-inner 2048 --dropout 0.1 \
+    --dropatt 0.0 --mem-len 512 --optimizer adam --clip-norm 0.25 \
+    --lr-scheduler cosine --warmup-updates 0 \
+    --lr 0.0 --lr 0.00025 --batch-size 15 \
+    --update-freq 1 --seed 2 --log-format json --log-interval 25 \
+    --fp16
+```
+
+##### 2. Evaluate
+For Adaptive Span:
+```bash
+fairseq-eval-lm ~/data/enwik8/data-bin/ --path model/checkpoint_best.pt \
+ --user-dir examples/adaptive_span \
+ --task truncated_bptt_lm --batch-size 8 --tokens-per-sample 512 --gen-subset test
+```
+For Transformer-XL evaluation:
+```bash
+fairseq-eval-lm ~/data/enwik8/data-bin/ --path model/checkpoint_best.pt \
+    --user-dir examples/truncated_bptt/ --task truncated_bptt_lm --batch-size 8 \
+    --tokens-per-sample 80 \
+    --model-overrides '{"mem_len":2100,"clamp_len":820,"same_length":True}' \
+    --gen-subset valid
+```
+
+*Note:* During training the model saw 512 tokens of context
+(``--tokens-per-sample=512``), with batch size 8. These settings match the evaluation
+settings from [the original
+paper](https://github.com/facebookresearch/adaptive-span/blob/master/experiments/enwik8.sh).
--- a/examples/adaptive_span/__init__.py
+++ b/examples/adaptive_span/__init__.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import importlib
+import os
+
+# automatically import any Python files in the current directory
+cur_dir = os.path.dirname(__file__)
+for file in os.listdir(cur_dir):
+    path = os.path.join(cur_dir, file)
+    if (
+        not file.startswith("_")
+        and not file.startswith(".")
+        and (file.endswith(".py") or os.path.isdir(path))
+    ):
+        mod_name = file[: file.find(".py")] if file.endswith(".py") else file
+        module = importlib.import_module(__name__ + "." + mod_name)
--- a/examples/adaptive_span/adagrad_with_grad_clip.py
+++ b/examples/adaptive_span/adagrad_with_grad_clip.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+from torch.optim import Adagrad
+
+from fairseq.optim import LegacyFairseqOptimizer, register_optimizer
+
+
+@register_optimizer("adagrad_with_grad_clip")
+class FairseqAdagradWithGradClip(LegacyFairseqOptimizer):
+    def __init__(self, args, params):
+        super().__init__(args)
+        self._optimizer = AdagradWithGradClip(params, **self.optimizer_config)
+
+    @staticmethod
+    def add_args(parser):
+        """Add optimizer-specific arguments to the parser."""
+        # fmt: off
+        parser.add_argument('--weight-decay', '--wd', default=0.0, type=float, metavar='WD',
+                            help='weight decay')
+        parser.add_argument('--adagrad-clip', default=0.0, type=float, metavar='D',
+                            help='internal grad clip')
+        # fmt: on
+
+    @property
+    def optimizer_config(self):
+        """
+        Return a kwarg dictionary that will be used to override optimizer
+        args stored in checkpoints. This allows us to load a checkpoint and
+        resume training using a different set of optimizer args, e.g., with a
+        different learning rate.
+        """
+        return {
+            "lr": self.args.lr[0],
+            "weight_decay": self.args.weight_decay,
+            "grad_clip": self.args.adagrad_clip,
+        }
+
+    @property
+    def supports_flat_params(self):
+        return False
+
+
+def _clip_grad(clr, grad, group_grad_clip):
+    if group_grad_clip > 0:
+        norm = grad.norm(2).item()
+        if norm > group_grad_clip:
+            clr *= group_grad_clip / (norm + 1e-10)
+    return clr
+
+
+class AdagradWithGradClip(Adagrad):
+    """Adagrad algorithm with custom gradient clipping"""
+
+    def __init__(
+        self,
+        params,
+        lr=1e-2,
+        lr_decay=0,
+        weight_decay=0,
+        initial_accumulator_value=0,
+        grad_clip=0,
+    ):
+        Adagrad.__init__(
+            self,
+            params,
+            lr=lr,
+            lr_decay=lr_decay,
+            weight_decay=weight_decay,
+            initial_accumulator_value=initial_accumulator_value,
+        )
+        self.defaults["grad_clip"] = grad_clip
+        self.param_groups[0].setdefault("grad_clip", grad_clip)
+
+    def step(self, closure=None):
+        loss = None
+        if closure is not None:
+            loss = closure()
+
+        for group in self.param_groups:
+            for p in group["params"]:
+                if p.grad is None:
+                    continue
+
+                grad = p.grad.data
+                state = self.state[p]
+
+                state["step"] += 1
+
+                if group["weight_decay"] != 0:
+                    if p.grad.data.is_sparse:
+                        raise RuntimeError(
+                            "weight_decay option is "
+                            "not compatible with sparse "
+                            "gradients"
+                        )
+                    grad = grad.add(group["weight_decay"], p.data)
+
+                clr = group["lr"] / (1 + (state["step"] - 1) * group["lr_decay"])
+
+                # clip
+                clr = _clip_grad(clr=clr, grad=grad, group_grad_clip=group["grad_clip"])
+
+                if grad.is_sparse:
+                    # the update is non-linear so indices must be unique
+                    grad = grad.coalesce()
+                    grad_indices = grad._indices()
+                    grad_values = grad._values()
+                    size = grad.size()
+
+                    def make_sparse(values):
+                        constructor = grad.new
+                        if grad_indices.dim() == 0 or values.dim() == 0:
+                            return constructor().resize_as_(grad)
+                        return constructor(grad_indices, values, size)
+
+                    state["sum"].add_(make_sparse(grad_values.pow(2)))
+                    std = state["sum"]._sparse_mask(grad)
+                    std_values = std._values().sqrt_().add_(1e-10)
+                    p.data.add_(-clr, make_sparse(grad_values / std_values))
+                else:
+                    state["sum"].addcmul_(1, grad, grad)
+                    std = state["sum"].sqrt().add_(1e-10)
+                    p.data.addcdiv_(-clr, grad, std)
+
+        return loss
--- a/examples/adaptive_span/adaptive_span_attention.py
+++ b/examples/adaptive_span/adaptive_span_attention.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+class AdaptiveMask(nn.Module):
+    """Soft masking function for adaptive size.
+    It masks out the last K values of an input. The masking value
+    goes from 1 to 0 gradually, so K can be learned with
+    back-propagation.
+    Args:
+        max_size: maximum size (i.e. input dimension)
+        ramp_size: size of the ramp going from 0 to 1
+        init_val: initial size proportion not to be masked out
+        shape: learn multiple sizes independent of each other
+    """
+
+    def __init__(self, max_size, ramp_size, init_val=0, shape=(1,)):
+        nn.Module.__init__(self)
+        self._max_size = max_size
+        self._ramp_size = ramp_size
+        self.current_val = nn.Parameter(torch.zeros(*shape) + init_val)
+        mask_template = torch.linspace(1 - max_size, 0, steps=max_size)
+        self.register_buffer("mask_template", mask_template)
+
+    def forward(self, x):
+        mask = self.mask_template.float() + self.current_val.float() * self._max_size
+        mask = mask / self._ramp_size + 1
+        mask = mask.clamp(0, 1)
+        if x.size(-1) < self._max_size:
+            # the input could have been trimmed beforehand to save computation
+            mask = mask.narrow(-1, self._max_size - x.size(-1), x.size(-1))
+        x = (x * mask).type_as(x)
+        return x
+
+    def get_current_max_size(self, include_ramp=True):
+        current_size = math.ceil(self.current_val.max().item() * self._max_size)
+        if include_ramp:
+            current_size += self._ramp_size
+        current_size = max(0, min(self._max_size, current_size))
+        return current_size
+
+    def get_current_avg_size(self, include_ramp=True):
+        current_size = math.ceil(
+            self.current_val.float().mean().item() * self._max_size
+        )
+        if include_ramp:
+            current_size += self._ramp_size
+        current_size = max(0, min(self._max_size, current_size))
+        return current_size
+
+    def clamp_param(self):
+        """this need to be called after each update"""
+        self.current_val.data.clamp_(0, 1)
+
+
+class AdaptiveSpan(nn.Module):
+    """Adaptive attention span for Transformerself.
+    This module learns an attention span length from data for each
+    self-attention head.
+    Args:
+        attn_span: maximum attention span
+        adapt_span_loss: loss coefficient for the span length
+        adapt_span_ramp: length of the masking ramp
+        adapt_span_init: initial size ratio
+        adapt_span_cache: adapt cache size to reduce memory usage
+    """
+
+    def __init__(
+        self,
+        attn_span,
+        adapt_span_ramp,
+        adapt_span_init,
+        n_head,
+        adapt_span_layer,
+        **kargs
+    ):
+        nn.Module.__init__(self)
+        self._max_span = attn_span
+        self._n_head = n_head
+        self._adapt_span_layer = adapt_span_layer
+        if self._adapt_span_layer:
+            self._mask = AdaptiveMask(
+                max_size=self._max_span,
+                ramp_size=adapt_span_ramp,
+                init_val=adapt_span_init,
+            )
+        else:
+            self._mask = AdaptiveMask(
+                max_size=self._max_span,
+                ramp_size=adapt_span_ramp,
+                init_val=adapt_span_init,
+                shape=(n_head, 1, 1),
+            )
+
+    def forward(self, attn, normalize=True):
+        """mask attention with the right span"""
+        # batch and head dimensions are merged together, so separate them first
+        self.clamp_param()
+        if self._adapt_span_layer:
+            attn = self._mask(attn)
+        else:
+            B = attn.size(0)  # batch size
+            M = attn.size(1)  # block size
+            attn = attn.reshape(B // self._n_head, self._n_head, M, -1)
+            attn = self._mask(attn)
+            attn = attn.view(B, M, -1)
+        return attn
+
+    def get_trim_len(self):
+        """how much of memory can be trimmed to reduce computation"""
+        L = self._max_span
+        trim_len = min(L - 1, L - self._mask.get_current_max_size())
+        # too fine granularity might be bad for the memory management
+        trim_len = math.floor(trim_len / 64) * 64
+        return trim_len
+
+    def trim_memory(self, query, key, value, key_pe):
+        """trim out unnecessary memory beforehand to reduce computation"""
+        trim_len = self.get_trim_len()
+        cache_size = key.size(1) - query.size(1)
+        trim_len_cache = trim_len - (self._max_span - cache_size)
+        if trim_len_cache > 0:
+            key = key[:, trim_len_cache:, :]
+            value = value[:, trim_len_cache:, :]
+        elif trim_len_cache < 0:
+            # cache is too short! this happens when validation resumes
+            # after a lot of updates.
+            key = F.pad(key, [0, 0, -trim_len_cache, 0])
+            value = F.pad(value, [0, 0, -trim_len_cache, 0])
+        if trim_len > 0:
+            if key_pe is not None:
+                key_pe = key_pe[:, :, trim_len:]
+        return key, value, key_pe
+
+    def get_cache_size(self):
+        """determine how long the cache should be"""
+        trim_len = self.get_trim_len()
+        # give a buffer of 64 steps since a span might increase
+        # in future updates
+        return min(self._max_span, self._max_span - trim_len + 64)
+
+    def get_loss(self):
+        """a loss term for regularizing the span length"""
+        return self._max_span * self._mask.current_val.float().mean()
+
+    def get_current_max_span(self):
+        return self._mask.get_current_max_size()
+
+    def get_current_avg_span(self):
+        return self._mask.get_current_avg_size()
+
+    def clamp_param(self):
+        self._mask.clamp_param()
--- a/examples/adaptive_span/adaptive_span_loss.py
+++ b/examples/adaptive_span/adaptive_span_loss.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import math
+from dataclasses import dataclass
+
+import torch.nn.functional as F
+from fairseq import metrics, utils
+from fairseq.criterions import register_criterion
+from fairseq.criterions.cross_entropy import CrossEntropyCriterion
+from fairseq.dataclass import FairseqDataclass
+from omegaconf import II
+
+
+@dataclass
+class AdaptiveSpanCriterionConfig(FairseqDataclass):
+    sentence_avg: bool = II("optimization.sentence_avg")
+
+
+@register_criterion("adaptive_span_loss", dataclass=AdaptiveSpanCriterionConfig)
+class AdaptiveSpanCriterion(CrossEntropyCriterion):
+    def __init__(self, task, sentence_avg):
+        super().__init__(task, sentence_avg)
+
+    def forward(self, model, sample, reduce=True):
+        """Compute the loss for the given sample.
+
+        Returns a tuple with three elements:
+        1) the loss here is summed, different from the adaptive span code
+        2) the sample size, which is used as the denominator for the gradient
+        3) logging outputs to display while training
+        """
+        net_output = model(**sample["net_input"])
+        loss, aux_loss, avg_span, max_span = self.compute_loss(
+            model, net_output, sample, reduce=reduce
+        )
+        sample_size = (
+            sample["target"].size(0) if self.sentence_avg else sample["ntokens"]
+        )
+        loss /= sample_size
+        total_loss = loss + aux_loss
+        sample_size = 1
+
+        logging_output = {
+            "loss": loss.data,
+            "ntokens": sample["ntokens"],
+            "nsentences": sample["target"].size(0),
+            "sample_size": sample_size,
+            "total_loss": total_loss.data,
+            "avg_span": avg_span * sample_size,
+            "max_span": max_span * sample_size,
+        }
+        return total_loss, sample_size, logging_output
+
+    def compute_loss(self, model, net_output, sample, reduce=True):
+        loss, _ = super().compute_loss(model, net_output, sample, reduce)
+        aux_loss = model.get_aux_loss()
+        avg_span = model.get_current_avg_span()
+        max_span = model.get_current_max_span()
+        return loss, aux_loss, avg_span, max_span
+
+    @staticmethod
+    def reduce_metrics(logging_outputs) -> None:
+        """Aggregate logging outputs from data parallel training."""
+        loss_sum = sum(log.get("loss", 0) for log in logging_outputs)
+        ntokens = sum(log.get("ntokens", 0) for log in logging_outputs)
+        sample_size = sum(log.get("sample_size", 0) for log in logging_outputs)
+        total_loss_sum = sum(log.get("total_loss", 0) for log in logging_outputs)
+        avg_span_sum = sum(log.get("avg_span", 0) for log in logging_outputs)
+        max_span_sum = sum(log.get("max_span", 0) for log in logging_outputs)
+
+        # we divide by log(2) to convert the loss from base e to base 2
+        metrics.log_scalar(
+            "loss", loss_sum / sample_size / math.log(2), sample_size, round=3
+        )
+        metrics.log_scalar("avg_span", avg_span_sum / sample_size, sample_size, round=3)
+        metrics.log_scalar("max_span", max_span_sum / sample_size, sample_size, round=3)
+        # total loss contains the L1 norm on adaptive-span
+        metrics.log_scalar(
+            "total_loss",
+            total_loss_sum / sample_size / math.log(2),
+            sample_size,
+            round=3,
+        )
+        if sample_size != ntokens:
+            metrics.log_scalar(
+                "nll_loss", loss_sum / ntokens / math.log(2), ntokens, round=3
+            )
+            metrics.log_derived(
+                "ppl", lambda meters: utils.get_perplexity(meters["nll_loss"].avg)
+            )
+        else:
+            metrics.log_derived(
+                "ppl", lambda meters: utils.get_perplexity(meters["loss"].avg)
+            )
+
+    @staticmethod
+    def logging_outputs_can_be_summed() -> bool:
+        """
+        Whether the logging outputs returned by `forward` can be summed
+        across workers prior to calling `reduce_metrics`. Setting this
+        to True will improves distributed training speed.
+        """
+        return True
--- a/examples/adaptive_span/adaptive_span_model.py
+++ b/examples/adaptive_span/adaptive_span_model.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the license found in the
+# LICENSE file in the root directory of this source tree.
+
+import math
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from fairseq.modules.layer_norm import LayerNorm
+
+from .adaptive_span_attention import AdaptiveSpan
+
+# Size notations:
+# B = batch_size, H = d_model, M = block_size, L = attn_span
+
+
+def _skew(X, pad_value):
+    """shift every row 1 step to right"""
+    # X = B x M x L
+    B, M, L = X.size()
+    X = F.pad(X, (0, M + 1), value=pad_value)  # B x M x (L+M+1)
+    X = X.view(B, -1)  # B x ML+MM+M
+    X = X[:, :-M]  # B x ML+MM
+    X = X.view(B, M, M + L)  # B x M x L+M
+    return X
+
+
+def _unskew(X):
+    """reverse _skew operation"""
+    # X = B x M x L+M
+    B, M, L = X.size()
+    L -= M
+    X = X.view(B, -1)  # B x ML+MM
+    X = F.pad(X, (0, M))  # B x ML+MM+M
+    X = X.view(B, M, M + L + 1)  # B x M x L+M+1
+    X = X[:, :, :L]  # B x M x L
+    return X
+
+
+class SeqAttention(nn.Module):
+    """Sequential self-attention layer.
+    Each token will attend to its previous fixed number of steps.
+    Note that attention doesn't include the current step itself.
+    """
+
+    def __init__(self, d_model, n_head, attn_span, dropout, adapt_span_layer, **kargs):
+        nn.Module.__init__(self)
+        self.dropout = nn.Dropout(dropout)
+        self.d_model = d_model  # size of a single head
+        self.attn_span = attn_span
+        self.adaptive_span = AdaptiveSpan(
+            attn_span=attn_span,
+            n_head=n_head,
+            adapt_span_layer=adapt_span_layer,
+            **kargs
+        )
+
+    def forward(self, query, key, value, key_pe):
+        # query size = B x M x H
+        # key, value sizes = B x (M+L) x H
+
+        key, value, key_pe = self.adaptive_span.trim_memory(query, key, value, key_pe)
+
+        # compute attention from context
+        # B x M (dest) x (M+L) (src)
+        attn_cont = torch.matmul(query, key.transpose(-1, -2))
+        attn_cont = _unskew(attn_cont)  # B x M x L
+
+        # compute the effect of position embedding
+        attn_pos = torch.matmul(query, key_pe)  # B x M x L_pos
+        attn = attn_cont + attn_pos
+
+        attn = attn / math.sqrt(self.d_model)  # B x M X L_pos
+
+        attn = F.softmax(attn.float(), dim=-1).type_as(attn)
+
+        # trim attention lengths according to the learned span
+        attn = self.adaptive_span(attn)
+
+        attn = self.dropout(attn)  # B x M X L_pos
+
+        attn_cont = _skew(attn, 0)  # B x M X (L+M)
+        out = torch.matmul(attn_cont, value)  # B x M x H
+        return out
+
+    def get_cache_size(self):
+        return self.adaptive_span.get_cache_size()
+
+
+class MultiHeadSeqAttention(nn.Module):
+    def __init__(self, d_model, n_head, **kargs):
+        nn.Module.__init__(self)
+        assert d_model % n_head == 0
+        self.n_head = n_head
+        self.head_dim = d_model // n_head
+        self.attn = SeqAttention(d_model=self.head_dim, n_head=n_head, **kargs)
+        self.proj_query = nn.Linear(d_model, d_model, bias=False)
+        nn.init.xavier_normal_(self.proj_query.weight)
+        self.proj_out = nn.Linear(d_model, d_model, bias=False)
+        nn.init.xavier_normal_(self.proj_out.weight)
+        self.proj_val = nn.Linear(d_model, d_model, bias=False)
+        nn.init.xavier_normal_(self.proj_val.weight)
+        self.proj_key = nn.Linear(d_model, d_model, bias=False)
+        nn.init.xavier_normal_(self.proj_key.weight)
+
+    def head_reshape(self, x):
+        K = self.n_head
+        D = self.head_dim
+        x = x.view(x.size()[:-1] + (K, D))  # B x (M+L) x K x D
+        x = x.transpose(1, 2).contiguous()  # B x K x (M+L) x D
+        x = x.view(-1, x.size(-2), x.size(-1))  # B_K x (M+L) x D
+        return x
+
+    def forward(self, query, key, value, key_pe):
+        B = query.size(0)
+        K = self.n_head
+        D = self.head_dim
+        M = query.size(1)
+
+        query = self.proj_query(query)
+        query = self.head_reshape(query)
+        value = self.proj_val(value)
+        value = self.head_reshape(value)
+        key = self.proj_key(key)
+        key = self.head_reshape(key)
+
+        out = self.attn(query, key, value, key_pe)  # B_K x M x D
+        out = out.view(B, K, M, D)  # B x K x M x D
+        out = out.transpose(1, 2).contiguous()  # B x M x K x D
+        out = out.view(B, M, -1)  # B x M x K_D
+        out = self.proj_out(out)
+        return out
+
+
+class FeedForwardLayer(nn.Module):
+    def __init__(self, d_model, d_inner, dropout, **kargs):
+        nn.Module.__init__(self)
+        self.fc1 = nn.Linear(d_model, d_inner)
+        self.fc2 = nn.Linear(d_inner, d_model)
+        nn.init.xavier_uniform_(self.fc1.weight)
+        nn.init.xavier_uniform_(self.fc2.weight)
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, h):
+        h1 = F.relu(self.fc1(h))
+        h1 = self.dropout(h1)
+        h2 = self.fc2(h1)
+        return h2
+
+
+class TransformerSeqLayer(nn.Module):
+    def __init__(self, d_model, **kargs):
+        nn.Module.__init__(self)
+        self.attn = MultiHeadSeqAttention(d_model=d_model, **kargs)
+        self.norm1 = LayerNorm(d_model)
+        self.ff = FeedForwardLayer(d_model=d_model, **kargs)
+        self.norm2 = LayerNorm(d_model)
+
+    def forward(self, h, h_cache, key_pe):
+        # h = B x M x H
+        # h_cache = B x L x H
+        h_all = torch.cat([h_cache, h], dim=1)  # B x (M+L) x H
+        attn_out = self.attn(h, h_all, h_all, key_pe)
+        h = self.norm1(h + attn_out)  # B x M x H
+        if self.ff is not None:
+            ff_out = self.ff(h)
+            out = self.norm2(h + ff_out)  # B x M x H
+        else:
+            out = h
+        return out
+
+    def get_cache_size(self):
+        return self.attn.attn.get_cache_size()
+
+
+class TransformerSeq(nn.Module):
+    def __init__(
+        self,
+        vocab_size,
+        d_model,
+        n_head,
+        n_layer,
+        attn_span,
+        emb_dropout,
+        aux_loss_scaler,
+        adapt_span_layer,
+        **kargs
+    ):
+        nn.Module.__init__(self)
+        # token embeddings
+        self.in_emb = nn.Embedding(vocab_size, d_model)
+        nn.init.normal_(self.in_emb.weight, mean=0, std=d_model ** -0.5)
+        self.out_emb = nn.Linear(d_model, vocab_size)
+        self.aux_loss_scaler = aux_loss_scaler
+        if emb_dropout > 0:
+            self.emb_dropout = nn.Dropout(emb_dropout)
+        else:
+            self.emb_dropout = None
+        # position embeddings
+        self.key_pe = nn.Parameter(torch.randn(1, d_model // n_head, attn_span))
+
+        self.layers = nn.ModuleList()
+        self.layers.extend(
+            TransformerSeqLayer(
+                d_model=d_model,
+                n_head=n_head,
+                attn_span=attn_span,
+                adapt_span_layer=adapt_span_layer,
+                **kargs
+            )
+            for _ in range(n_layer)
+        )
+
+    def forward(self, x, h_cache, target=None):
+        # x size = B x M
+        block_size = x.size(1)
+        h = self.in_emb(x)  # B x M x H
+        if self.emb_dropout is not None:
+            h = self.emb_dropout(h)
+
+        h_cache_next = []
+        for l, layer in enumerate(self.layers):
+            cache_size = layer.attn.attn.get_cache_size()
+            if cache_size > block_size:
+                h_cache_next_l = torch.cat(
+                    [h_cache[l][:, -cache_size + block_size :, :], h], dim=1
+                ).detach()
+            else:
+                h_cache_next_l = h[:, -cache_size:, :].detach()
+            h_cache_next.append(h_cache_next_l)
+            h = layer(h, h_cache[l], self.key_pe)  # B x M x H
+
+        if self.emb_dropout is not None:
+            h = self.emb_dropout(h)
+
+        out = F.log_softmax(self.out_emb(h).float(), dim=-1).type_as(h)
+        dummy_loss = None
+
+        return out, h_cache_next, dummy_loss
+
+    def get_aux_loss(self):
+        loss = 0.0
+        for layer in self.layers:
+            loss += layer.attn.attn.adaptive_span.get_loss()
+        return self.aux_loss_scaler * loss
+
+    def get_current_max_span(self):
+        max_span = 0.0
+        for layer in self.layers:
+            max_span = max(
+                max_span, layer.attn.attn.adaptive_span.get_current_max_span()
+            )
+        return max_span
+
+    def get_current_avg_span(self):
+        avg_span = 0.0
+        for layer in self.layers:
+            avg_span += layer.attn.attn.adaptive_span.get_current_avg_span()
+        return avg_span / len(self.layers)
--- a/examples/adaptive_span/adaptive_span_model_wrapper.py
+++ b/examples/adaptive_span/adaptive_span_model_wrapper.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import logging
+from dataclasses import dataclass
+from typing import Dict, List, Optional
+
+import torch
+from fairseq.dataclass import FairseqDataclass
+from fairseq.models import (
+    FairseqIncrementalDecoder,
+    FairseqLanguageModel,
+    register_model,
+)
+from .adaptive_span_model import TransformerSeq as AdaptiveSpanTransformerModel
+
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class AdaptiveSpanSmallConfig(FairseqDataclass):
+    # defaults come from https://github.com/facebookresearch/adaptive-span/blob/master/experiments/enwik8_small.sh
+    vocab_size: int = 50
+    d_model: int = 256
+    n_head: int = 4
+    d_inner: int = 1024
+    n_layer: int = 8
+    attn_span: int = 1024
+    dropout: float = 0.0
+    emb_dropout: float = 0.0
+    adapt_span_ramp: int = 32
+    adapt_span_init: float = 0.0
+    aux_loss_scaler: float = 0.000002
+    adapt_span_layer: bool = False
+
+
+@register_model("adaptive_span", dataclass=AdaptiveSpanSmallConfig)
+class AdaptiveSpanTransformer(FairseqLanguageModel):
+    @classmethod
+    def build_model(cls, cfg: AdaptiveSpanSmallConfig, task):
+        return cls(AdaptiveSpanDecoder(cfg, task))
+
+    def get_aux_loss(self):
+        return self.decoder.get_aux_loss()
+
+    def get_current_max_span(self):
+        return self.decoder.get_current_max_span()
+
+    def get_current_avg_span(self):
+        return self.decoder.get_current_avg_span()
+
+
+class AdaptiveSpanDecoder(FairseqIncrementalDecoder):
+    def __init__(self, cfg, task):
+
+        super().__init__(task.target_dictionary)
+
+        self.config = cfg
+        config = AdaptiveSpanSmallConfig(
+            vocab_size=len(task.target_dictionary),
+            d_model=cfg.d_model,
+            n_head=cfg.n_head,
+            d_inner=cfg.d_inner,
+            n_layer=cfg.n_layer,
+            attn_span=cfg.attn_span,
+            dropout=cfg.dropout,
+            emb_dropout=cfg.emb_dropout,
+            adapt_span_ramp=cfg.adapt_span_ramp,
+            adapt_span_init=cfg.adapt_span_init,
+            aux_loss_scaler=cfg.aux_loss_scaler,
+            adapt_span_layer=cfg.adapt_span_layer,
+        )
+        logger.info(config)
+        self.model = AdaptiveSpanTransformerModel(**config.__dict__)
+
+        self._mems = None
+
+    def forward(
+        self,
+        src_tokens,
+        incremental_state: Optional[Dict[str, List[torch.Tensor]]] = None,
+        encoder_out=None,
+    ):
+        bsz = src_tokens.size(0)
+        if incremental_state is not None:  # used during inference
+            mems = self.get_incremental_state("mems")
+            src_tokens = src_tokens[:, -1:]  # only keep the most recent token
+        else:
+            mems = self._mems
+
+        if mems is None:
+            # first time init
+            mems = self.init_hid_cache(bsz)
+        output = self.model(x=src_tokens, h_cache=mems,)
+        if incremental_state is not None:
+            self.set_incremental_state(incremental_state, "mems", output[1])
+        else:
+            self._mems = output[1]
+        return (output[0],)
+
+    def max_positions(self):
+        return self.config.attn_span
+
+    def init_hid_cache(self, batch_sz):
+        hid = []
+        for layer in self.model.layers:
+            param = next(self.model.parameters())
+            h = torch.zeros(
+                batch_sz,
+                layer.get_cache_size(),
+                self.config.d_model,
+                dtype=param.dtype,
+                device=param.device,
+            )
+            hid.append(h)
+        return hid
+
+    def get_aux_loss(self):
+        return self.model.get_aux_loss()
+
+    def get_current_max_span(self):
+        return self.model.get_current_max_span()
+
+    def get_current_avg_span(self):
+        return self.model.get_current_avg_span()
+
+    def reorder_incremental_state(
+        self,
+        incremental_state: Dict[str, Dict[str, Optional[torch.Tensor]]],
+        new_order: torch.Tensor,
+    ):
+        """Reorder incremental state.
+
+        This will be called when the order of the input has changed from the
+        previous time step. A typical use case is beam search, where the input
+        order changes between time steps based on the selection of beams.
+        """
+        raise NotImplementedError("This is required for generation/beam search")
+        # mems = self.get_incremental_state(incremental_state, "mems")
+        # if mems is not None:
+        #     new_mems = [mems_i.index_select(1, new_order) for mems_i in mems]
+        #     self.set_incremental_state(incremental_state, "mems", new_mems)
--- a/examples/adaptive_span/truncated_bptt_lm_task.py
+++ b/examples/adaptive_span/truncated_bptt_lm_task.py
+../truncated_bptt/truncated_bptt_lm_task.py
\ No newline at end of file
--- a/examples/backtranslation/README.md
+++ b/examples/backtranslation/README.md
--- a/examples/backtranslation/deduplicate_lines.py
+++ b/examples/backtranslation/deduplicate_lines.py
+#!/usr/bin/python3
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import argparse
+import fileinput
+import hashlib
+import sys
+from multiprocessing import Pool
+
+
+def get_hashes_and_lines(raw_line):
+    hash = hashlib.md5(raw_line).hexdigest()
+    return hash, raw_line
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--workers", type=int, default=10)
+    parser.add_argument("files", nargs="*", help="input files")
+    args = parser.parse_args()
+
+    seen = set()
+    with fileinput.input(args.files, mode="rb") as h:
+        pool = Pool(args.workers)
+        results = pool.imap_unordered(get_hashes_and_lines, h, 1000)
+        for i, (hash, raw_line) in enumerate(results):
+            if hash not in seen:
+                seen.add(hash)
+                sys.stdout.buffer.write(raw_line)
+            if i % 1000000 == 0:
+                print(i, file=sys.stderr, end="", flush=True)
+            elif i % 100000 == 0:
+                print(".", file=sys.stderr, end="", flush=True)
+    print(file=sys.stderr, flush=True)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/backtranslation/extract_bt_data.py
+++ b/examples/backtranslation/extract_bt_data.py
+#!/usr/bin/env python
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import argparse
+import fileinput
+
+from tqdm import tqdm
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description=(
+            "Extract back-translations from the stdout of fairseq-generate. "
+            "If there are multiply hypotheses for a source, we only keep the first one. "
+        )
+    )
+    parser.add_argument("--output", required=True, help="output prefix")
+    parser.add_argument(
+        "--srclang", required=True, help="source language (extracted from H-* lines)"
+    )
+    parser.add_argument(
+        "--tgtlang", required=True, help="target language (extracted from S-* lines)"
+    )
+    parser.add_argument("--minlen", type=int, help="min length filter")
+    parser.add_argument("--maxlen", type=int, help="max length filter")
+    parser.add_argument("--ratio", type=float, help="ratio filter")
+    parser.add_argument("files", nargs="*", help="input files")
+    args = parser.parse_args()
+
+    def validate(src, tgt):
+        srclen = len(src.split(" ")) if src != "" else 0
+        tgtlen = len(tgt.split(" ")) if tgt != "" else 0
+        if (
+            (args.minlen is not None and (srclen < args.minlen or tgtlen < args.minlen))
+            or (
+                args.maxlen is not None
+                and (srclen > args.maxlen or tgtlen > args.maxlen)
+            )
+            or (
+                args.ratio is not None
+                and (max(srclen, tgtlen) / float(min(srclen, tgtlen)) > args.ratio)
+            )
+        ):
+            return False
+        return True
+
+    def safe_index(toks, index, default):
+        try:
+            return toks[index]
+        except IndexError:
+            return default
+
+    with open(args.output + "." + args.srclang, "w") as src_h, open(
+        args.output + "." + args.tgtlang, "w"
+    ) as tgt_h:
+        for line in tqdm(fileinput.input(args.files)):
+            if line.startswith("S-"):
+                tgt = safe_index(line.rstrip().split("\t"), 1, "")
+            elif line.startswith("H-"):
+                if tgt is not None:
+                    src = safe_index(line.rstrip().split("\t"), 2, "")
+                    if validate(src, tgt):
+                        print(src, file=src_h)
+                        print(tgt, file=tgt_h)
+                    tgt = None
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/backtranslation/prepare-de-monolingual.sh
+++ b/examples/backtranslation/prepare-de-monolingual.sh
+#!/bin/bash
+
+SCRIPTS=mosesdecoder/scripts
+TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
+NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
+REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
+BPEROOT=subword-nmt/subword_nmt
+
+
+BPE_CODE=wmt18_en_de/code
+SUBSAMPLE_SIZE=25000000
+LANG=de
+
+
+OUTDIR=wmt18_${LANG}_mono
+orig=orig
+tmp=$OUTDIR/tmp
+mkdir -p $OUTDIR $tmp
+
+
+URLS=(
+    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2007.de.shuffled.gz"
+    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2008.de.shuffled.gz"
+    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2009.de.shuffled.gz"
+    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2010.de.shuffled.gz"
+    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2011.de.shuffled.gz"
+    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2012.de.shuffled.gz"
+    "http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.de.shuffled.gz"
+    "http://www.statmt.org/wmt15/training-monolingual-news-crawl-v2/news.2014.de.shuffled.v2.gz"
+    "http://data.statmt.org/wmt16/translation-task/news.2015.de.shuffled.gz"
+    "http://data.statmt.org/wmt17/translation-task/news.2016.de.shuffled.gz"
+    "http://data.statmt.org/wmt18/translation-task/news.2017.de.shuffled.deduped.gz"
+)
+FILES=(
+    "news.2007.de.shuffled.gz"
+    "news.2008.de.shuffled.gz"
+    "news.2009.de.shuffled.gz"
+    "news.2010.de.shuffled.gz"
+    "news.2011.de.shuffled.gz"
+    "news.2012.de.shuffled.gz"
+    "news.2013.de.shuffled.gz"
+    "news.2014.de.shuffled.v2.gz"
+    "news.2015.de.shuffled.gz"
+    "news.2016.de.shuffled.gz"
+    "news.2017.de.shuffled.deduped.gz"
+)
+
+
+cd $orig
+for ((i=0;i<${#URLS[@]};++i)); do
+    file=${FILES[i]}
+    if [ -f $file ]; then
+        echo "$file already exists, skipping download"
+    else
+        url=${URLS[i]}
+        wget "$url"
+    fi
+done
+cd ..
+
+
+if [ -f $tmp/monolingual.${SUBSAMPLE_SIZE}.${LANG} ]; then
+    echo "found monolingual sample, skipping shuffle/sample/tokenize"
+else
+    gzip -c -d -k $(for FILE in "${FILES[@]}"; do echo $orig/$FILE; done) \
+    | shuf -n $SUBSAMPLE_SIZE \
+    | perl $NORM_PUNC $LANG \
+    | perl $REM_NON_PRINT_CHAR \
+    | perl $TOKENIZER -threads 8 -a -l $LANG \
+    > $tmp/monolingual.${SUBSAMPLE_SIZE}.${LANG}
+fi
+
+
+if [ -f $tmp/bpe.monolingual.${SUBSAMPLE_SIZE}.${LANG} ]; then
+    echo "found BPE monolingual sample, skipping BPE step"
+else
+    python $BPEROOT/apply_bpe.py -c $BPE_CODE \
+        < $tmp/monolingual.${SUBSAMPLE_SIZE}.${LANG} \
+        > $tmp/bpe.monolingual.${SUBSAMPLE_SIZE}.${LANG}
+fi
+
+
+if [ -f $tmp/bpe.monolingual.dedup.${SUBSAMPLE_SIZE}.${LANG} ]; then
+    echo "found deduplicated monolingual sample, skipping deduplication step"
+else
+    python deduplicate_lines.py $tmp/bpe.monolingual.${SUBSAMPLE_SIZE}.${LANG} \
+    > $tmp/bpe.monolingual.dedup.${SUBSAMPLE_SIZE}.${LANG}
+fi
+
+
+if [ -f $OUTDIR/bpe.monolingual.dedup.00.de ]; then
+    echo "found sharded data, skipping sharding step"
+else
+    split --lines 1000000 --numeric-suffixes \
+        --additional-suffix .${LANG} \
+        $tmp/bpe.monolingual.dedup.${SUBSAMPLE_SIZE}.${LANG} \
+        $OUTDIR/bpe.monolingual.dedup.
+fi
--- a/examples/backtranslation/prepare-wmt18en2de.sh
+++ b/examples/backtranslation/prepare-wmt18en2de.sh
+#!/bin/bash
+# Adapted from https://github.com/facebookresearch/MIXER/blob/master/prepareData.sh
+
+echo 'Cloning Moses github repository (for tokenization scripts)...'
+git clone https://github.com/moses-smt/mosesdecoder.git
+
+echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
+git clone https://github.com/rsennrich/subword-nmt.git
+
+SCRIPTS=mosesdecoder/scripts
+TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
+CLEAN=$SCRIPTS/training/clean-corpus-n.perl
+NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
+REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
+BPEROOT=subword-nmt/subword_nmt
+BPE_TOKENS=32000
+
+URLS=(
+    "http://statmt.org/wmt13/training-parallel-europarl-v7.tgz"
+    "http://statmt.org/wmt13/training-parallel-commoncrawl.tgz"
+    "http://data.statmt.org/wmt18/translation-task/training-parallel-nc-v13.tgz"
+    "http://data.statmt.org/wmt18/translation-task/rapid2016.tgz"
+    "http://data.statmt.org/wmt17/translation-task/dev.tgz"
+    "http://statmt.org/wmt14/test-full.tgz"
+)
+FILES=(
+    "training-parallel-europarl-v7.tgz"
+    "training-parallel-commoncrawl.tgz"
+    "training-parallel-nc-v13.tgz"
+    "rapid2016.tgz"
+    "dev.tgz"
+    "test-full.tgz"
+)
+CORPORA=(
+    "training/europarl-v7.de-en"
+    "commoncrawl.de-en"
+    "training-parallel-nc-v13/news-commentary-v13.de-en"
+    "rapid2016.de-en"
+)
+
+if [ ! -d "$SCRIPTS" ]; then
+    echo "Please set SCRIPTS variable correctly to point to Moses scripts."
+    exit 1
+fi
+
+OUTDIR=wmt18_en_de
+
+src=en
+tgt=de
+lang=en-de
+prep=$OUTDIR
+tmp=$prep/tmp
+orig=orig
+
+mkdir -p $orig $tmp $prep
+
+cd $orig
+
+for ((i=0;i<${#URLS[@]};++i)); do
+    file=${FILES[i]}
+    if [ -f $file ]; then
+        echo "$file already exists, skipping download"
+    else
+        url=${URLS[i]}
+        wget "$url"
+        if [ -f $file ]; then
+            echo "$url successfully downloaded."
+        else
+            echo "$url not successfully downloaded."
+            exit 1
+        fi
+        if [ ${file: -4} == ".tgz" ]; then
+            tar zxvf $file
+        elif [ ${file: -4} == ".tar" ]; then
+            tar xvf $file
+        fi
+    fi
+done
+cd ..
+
+echo "pre-processing train data..."
+for l in $src $tgt; do
+    rm $tmp/train.tags.$lang.tok.$l
+    for f in "${CORPORA[@]}"; do
+        cat $orig/$f.$l | \
+            perl $NORM_PUNC $l | \
+            perl $REM_NON_PRINT_CHAR | \
+            perl $TOKENIZER -threads 8 -a -l $l >> $tmp/train.tags.$lang.tok.$l
+    done
+done
+
+echo "pre-processing test data..."
+for l in $src $tgt; do
+    if [ "$l" == "$src" ]; then
+        t="src"
+    else
+        t="ref"
+    fi
+    grep '<seg id' $orig/test-full/newstest2014-deen-$t.$l.sgm | \
+        sed -e 's/<seg id="[0-9]*">\s*//g' | \
+        sed -e 's/\s*<\/seg>\s*//g' | \
+        sed -e "s/\’/\'/g" | \
+    perl $TOKENIZER -threads 8 -a -l $l > $tmp/test.$l
+    echo ""
+done
+
+echo "splitting train and valid..."
+for l in $src $tgt; do
+    awk '{if (NR%100 == 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/valid.$l
+    awk '{if (NR%100 != 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/train.$l
+done
+
+TRAIN=$tmp/train.de-en
+BPE_CODE=$prep/code
+rm -f $TRAIN
+for l in $src $tgt; do
+    cat $tmp/train.$l >> $TRAIN
+done
+
+echo "learn_bpe.py on ${TRAIN}..."
+python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE
+
+for L in $src $tgt; do
+    for f in train.$L valid.$L test.$L; do
+        echo "apply_bpe.py to ${f}..."
+        python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $tmp/bpe.$f
+    done
+done
+
+perl $CLEAN -ratio 1.5 $tmp/bpe.train $src $tgt $prep/train 1 250
+perl $CLEAN -ratio 1.5 $tmp/bpe.valid $src $tgt $prep/valid 1 250
+
+for L in $src $tgt; do
+    cp $tmp/bpe.test.$L $prep/test.$L
+done
--- a/examples/backtranslation/sacrebleu.sh
+++ b/examples/backtranslation/sacrebleu.sh
+#!/bin/bash
+
+if [ $# -ne 5 ]; then
+    echo "usage: $0 [dataset=wmt14/full] [langpair=en-de] [databin] [bpecode] [model]"
+    exit
+fi
+
+
+DATASET=$1
+LANGPAIR=$2
+DATABIN=$3
+BPECODE=$4
+MODEL=$5
+
+SRCLANG=$(echo $LANGPAIR | cut -d '-' -f 1)
+TGTLANG=$(echo $LANGPAIR | cut -d '-' -f 2)
+
+
+BPEROOT=examples/backtranslation/subword-nmt/subword_nmt
+if [ ! -e $BPEROOT ]; then
+    BPEROOT=subword-nmt/subword_nmt
+    if [ ! -e $BPEROOT ]; then
+        echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
+        git clone https://github.com/rsennrich/subword-nmt.git
+    fi
+fi
+
+
+sacrebleu -t $DATASET -l $LANGPAIR --echo src \
+| sacremoses tokenize -a -l $SRCLANG -q \
+| python $BPEROOT/apply_bpe.py -c $BPECODE \
+| fairseq-interactive $DATABIN --path $MODEL \
+    -s $SRCLANG -t $TGTLANG \
+    --beam 5 --remove-bpe --buffer-size 1024 --max-tokens 8000 \
+| grep ^H- | cut -f 3- \
+| sacremoses detokenize -l $TGTLANG -q \
+| sacrebleu -t $DATASET -l $LANGPAIR
--- a/examples/backtranslation/tokenized_bleu.sh
+++ b/examples/backtranslation/tokenized_bleu.sh
+#!/bin/bash
+
+if [ $# -ne 5 ]; then
+    echo "usage: $0 [dataset=wmt14/full] [langpair=en-de] [databin] [bpecode] [model]"
+    exit
+fi
+
+
+DATASET=$1
+LANGPAIR=$2
+DATABIN=$3
+BPECODE=$4
+MODEL=$5
+
+SRCLANG=$(echo $LANGPAIR | cut -d '-' -f 1)
+TGTLANG=$(echo $LANGPAIR | cut -d '-' -f 2)
+
+
+BPEROOT=examples/backtranslation/subword-nmt/subword_nmt
+if [ ! -e $BPEROOT ]; then
+    BPEROOT=subword-nmt/subword_nmt
+    if [ ! -e $BPEROOT ]; then
+        echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
+        git clone https://github.com/rsennrich/subword-nmt.git
+    fi
+fi
+
+
+TMP_REF=$(mktemp)
+
+sacrebleu -t $DATASET -l $LANGPAIR --echo ref -q \
+| sacremoses normalize -l $TGTLANG -q \
+| sacremoses tokenize -a -l $TGTLANG -q \
+> $TMP_REF
+
+sacrebleu -t $DATASET -l $LANGPAIR --echo src -q \
+| sacremoses normalize -l $SRCLANG -q \
+| sacremoses tokenize -a -l $SRCLANG -q \
+| python $BPEROOT/apply_bpe.py -c $BPECODE \
+| fairseq-interactive $DATABIN --path $MODEL \
+    -s $SRCLANG -t $TGTLANG \
+    --beam 5 --remove-bpe --buffer-size 1024 --max-tokens 8000 \
+| grep ^H- | cut -f 3- \
+| fairseq-score --ref $TMP_REF
+
+rm -f $TMP_REF
--- a/examples/bart/README.glue.md
+++ b/examples/bart/README.glue.md
+# Fine-tuning BART on GLUE tasks
+
+### 1) Download the data from GLUE website (https://gluebenchmark.com/tasks) using following commands:
+```bash
+wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
+python download_glue_data.py --data_dir glue_data --tasks all
+```
+
+### 2) Preprocess GLUE task data (same as RoBERTa):
+```bash
+./examples/roberta/preprocess_GLUE_tasks.sh glue_data <glue_task_name>
+```
+`glue_task_name` is one of the following:
+`{ALL, QQP, MNLI, QNLI, MRPC, RTE, STS-B, SST-2, CoLA}`
+Use `ALL` for preprocessing all the glue tasks.
+
+### 3) Fine-tuning on GLUE task:
+Example fine-tuning cmd for `RTE` task
+```bash
+TOTAL_NUM_UPDATES=2036  # 10 epochs through RTE for bsz 16
+WARMUP_UPDATES=61      # 6 percent of the number of updates
+LR=1e-05                # Peak LR for polynomial LR scheduler.
+NUM_CLASSES=2
+MAX_SENTENCES=16        # Batch size.
+BART_PATH=/path/to/bart/model.pt
+
+CUDA_VISIBLE_DEVICES=0,1 fairseq-train RTE-bin/ \
+    --restore-file $BART_PATH \
+    --batch-size $MAX_SENTENCES \
+    --max-tokens 4400 \
+    --task sentence_prediction \
+    --add-prev-output-tokens \
+    --layernorm-embedding \
+    --share-all-embeddings \
+    --share-decoder-input-output-embed \
+    --reset-optimizer --reset-dataloader --reset-meters \
+    --required-batch-size-multiple 1 \
+    --init-token 0 \
+    --arch bart_large \
+    --criterion sentence_prediction \
+    --num-classes $NUM_CLASSES \
+    --dropout 0.1 --attention-dropout 0.1 \
+    --weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-08 \
+    --clip-norm 0.0 \
+    --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
+    --fp16 --fp16-init-scale 4 --threshold-loss-scale 1 --fp16-scale-window 128 \
+    --max-epoch 10 \
+    --find-unused-parameters \
+    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric;
+```
+
+For each of the GLUE task, you will need to use following cmd-line arguments:
+
+Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
+---|---|---|---|---|---|---|---|---
+`--num-classes` | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 1
+`--lr` | 5e-6 | 1e-5 | 1e-5 | 1e-5 | 5e-6 | 2e-5 | 2e-5 | 2e-5
+`bsz` | 128 | 32 | 32 | 32 | 128 | 64 | 64 | 32
+`--total-num-update` | 30968 | 33112 | 113272 | 1018 | 5233 | 1148 | 1334 | 1799
+`--warmup-updates` | 1858 | 1986 | 6796 | 61 | 314 | 68 | 80 | 107
+
+For `STS-B` additionally add `--regression-target --best-checkpoint-metric loss` and remove `--maximize-best-checkpoint-metric`.
+
+**Note:**
+
+a) `--total-num-updates` is used by `--polynomial_decay` scheduler and is calculated for `--max-epoch=10` and `--batch-size=32/64/128` depending on the task.
+
+b) Above cmd-args and hyperparams are tested on Nvidia `V100` GPU with `32gb` of memory for each task. Depending on the GPU memory resources available to you, you can use increase `--update-freq` and reduce `--batch-size`.
+
+### Inference on GLUE task
+After training the model as mentioned in previous step, you can perform inference with checkpoints in `checkpoints/` directory using following python code snippet:
+
+```python
+from fairseq.models.bart import BARTModel
+
+bart = BARTModel.from_pretrained(
+    'checkpoints/',
+    checkpoint_file='checkpoint_best.pt',
+    data_name_or_path='RTE-bin'
+)
+
+label_fn = lambda label: bart.task.label_dictionary.string(
+    [label + bart.task.label_dictionary.nspecial]
+)   
+ncorrect, nsamples = 0, 0
+bart.cuda()
+bart.eval()
+with open('glue_data/RTE/dev.tsv') as fin:
+    fin.readline()
+    for index, line in enumerate(fin):
+        tokens = line.strip().split('\t')
+        sent1, sent2, target = tokens[1], tokens[2], tokens[3]
+        tokens = bart.encode(sent1, sent2)
+        prediction = bart.predict('sentence_classification_head', tokens).argmax().item()
+        prediction_label = label_fn(prediction)
+        ncorrect += int(prediction_label == target)
+        nsamples += 1
+print('| Accuracy: ', float(ncorrect)/float(nsamples))
+```
--- a/examples/bart/README.md
+++ b/examples/bart/README.md
+# BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
+
+[https://arxiv.org/abs/1910.13461](https://arxiv.org/abs/1910.13461)
+
+## Introduction
+
+BART is sequence-to-sequence model trained with denoising as pretraining objective. We show that this pretraining objective is more generic and show that we can match [RoBERTa](../roberta) results on SQuAD and GLUE and gain state-of-the-art results on summarization (XSum, CNN dataset), long form generative question answering (ELI5) and dialog response genration (ConvAI2). See the associated paper for more details.
+
+## Pre-trained models
+
+Model | Description | # params | Download
+---|---|---|---
+`bart.base` | BART model with 6 encoder and decoder layers | 140M | [bart.base.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/bart.base.tar.gz)
+`bart.large` | BART model with 12 encoder and decoder layers | 400M | [bart.large.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/bart.large.tar.gz)
+`bart.large.mnli` | `bart.large` finetuned on `MNLI` | 400M | [bart.large.mnli.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/bart.large.mnli.tar.gz)
+`bart.large.cnn` | `bart.large` finetuned on `CNN-DM` | 400M | [bart.large.cnn.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/bart.large.cnn.tar.gz)
+`bart.large.xsum` | `bart.large` finetuned on `Xsum` | 400M | [bart.large.xsum.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/bart.large.xsum.tar.gz)
+
+## Results
+
+**[GLUE (Wang et al., 2019)](https://gluebenchmark.com/)**
+_(dev set, single model, single-task finetuning)_
+
+Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B
+---|---|---|---|---|---|---|---|---
+`roberta.large` | 90.2 | 94.7 | 92.2 | 86.6 | 96.4 | 90.9 | 68.0 | 92.4
+`bart.large` | 89.9 | 94.9 | 92.5 | 87.0 | 96.6 | 90.4 | 62.8 | 91.2
+
+**[SQuAD (Rajpurkar et al., 2018)](https://rajpurkar.github.io/SQuAD-explorer/)**
+_(dev set, no additional data used)_
+
+Model | SQuAD 1.1 EM/F1 | SQuAD 2.0 EM/F1
+---|---|---
+`roberta.large` | 88.9/94.6 | 86.5/89.4
+`bart.large` | 88.8/94.6 | 86.1/89.2
+
+**[CNN/Daily Mail](http://nlpprogress.com/english/summarization.html)**
+_(test set, no additional data used)_
+
+Model | R1 | R2 | RL
+---|---|---|---
+`BERTSUMEXTABS` | 42.13 | 19.60 | 39.18
+`bart.large` | 44.16 | 21.28 | 40.90
+
+## Example usage
+
+##### Load BART from torch.hub (PyTorch >= 1.1):
+```python
+import torch
+bart = torch.hub.load('pytorch/fairseq', 'bart.large')
+bart.eval()  # disable dropout (or leave in train mode to finetune)
+```
+
+##### Load BART (for PyTorch 1.0 or custom models):
+```python
+# Download bart.large model
+wget https://dl.fbaipublicfiles.com/fairseq/models/bart.large.tar.gz
+tar -xzvf bart.large.tar.gz
+
+# Load the model in fairseq
+from fairseq.models.bart import BARTModel
+bart = BARTModel.from_pretrained('/path/to/bart.large', checkpoint_file='model.pt')
+bart.eval()  # disable dropout (or leave in train mode to finetune)
+```
+
+##### Apply Byte-Pair Encoding (BPE) to input text:
+```python
+tokens = bart.encode('Hello world!')
+assert tokens.tolist() == [0, 31414, 232, 328, 2]
+bart.decode(tokens)  # 'Hello world!'
+```
+
+##### Extract features from BART:
+```python
+# Extract the last layer's features
+last_layer_features = bart.extract_features(tokens)
+assert last_layer_features.size() == torch.Size([1, 5, 1024])
+
+# Extract all layer's features from decoder (layer 0 is the embedding layer)
+all_layers = bart.extract_features(tokens, return_all_hiddens=True)
+assert len(all_layers) == 13
+assert torch.all(all_layers[-1] == last_layer_features)
+```
+
+##### Use BART for sentence-pair classification tasks:
+```python
+# Download BART already finetuned for MNLI
+bart = torch.hub.load('pytorch/fairseq', 'bart.large.mnli')
+bart.eval()  # disable dropout for evaluation
+
+# Encode a pair of sentences and make a prediction
+tokens = bart.encode('BART is a seq2seq model.', 'BART is not sequence to sequence.')
+bart.predict('mnli', tokens).argmax()  # 0: contradiction
+
+# Encode another pair of sentences
+tokens = bart.encode('BART is denoising autoencoder.', 'BART is version of autoencoder.')
+bart.predict('mnli', tokens).argmax()  # 2: entailment
+```
+
+##### Register a new (randomly initialized) classification head:
+```python
+bart.register_classification_head('new_task', num_classes=3)
+logprobs = bart.predict('new_task', tokens)
+```
+
+##### Batched prediction:
+```python
+import torch
+from fairseq.data.data_utils import collate_tokens
+
+bart = torch.hub.load('pytorch/fairseq', 'bart.large.mnli')
+bart.eval()
+
+batch_of_pairs = [
+    ['BART is a seq2seq model.', 'BART is not sequence to sequence.'],
+    ['BART is denoising autoencoder.', 'BART is version of autoencoder.'],
+]
+
+batch = collate_tokens(
+    [bart.encode(pair[0], pair[1]) for pair in batch_of_pairs], pad_idx=1
+)
+
+logprobs = bart.predict('mnli', batch)
+print(logprobs.argmax(dim=1))
+# tensor([0, 2])
+```
+
+##### Using the GPU:
+```python
+bart.cuda()
+bart.predict('new_task', tokens)
+```
+
+#### Filling masks:
+
+BART can be used to fill multiple `<mask>` tokens in the input.
+```python
+bart = torch.hub.load('pytorch/fairseq', 'bart.base')
+bart.eval()
+bart.fill_mask(['The cat <mask> on the <mask>.'], topk=3, beam=10)
+# [[('The cat was on the ground.', tensor(-0.6183)), ('The cat was on the floor.', tensor(-0.6798)), ('The cat sleeps on the couch.', tensor(-0.6830))]]
+```
+
+Note that by default we enforce the output length to match the input length.
+This can be disabled by setting ``match_source_len=False``:
+```
+bart.fill_mask(['The cat <mask> on the <mask>.'], topk=3, beam=10, match_source_len=False)
+# [[('The cat was on the ground.', tensor(-0.6185)), ('The cat was asleep on the couch.', tensor(-0.6276)), ('The cat was on the floor.', tensor(-0.6800))]]
+```
+
+Example code to fill masks for a batch of sentences using GPU
+```
+bart.cuda()
+bart.fill_mask(['The cat <mask> on the <mask>.', 'The dog <mask> on the <mask>.'], topk=3, beam=10)
+# [[('The cat was on the ground.', tensor(-0.6183)), ('The cat was on the floor.', tensor(-0.6798)), ('The cat sleeps on the couch.', tensor(-0.6830))], [('The dog was on the ground.', tensor(-0.6190)), ('The dog lay on the ground.', tensor(-0.6711)),
+('The dog was asleep on the couch', tensor(-0.6796))]]
+```
+
+#### Evaluating the `bart.large.mnli` model:
+
+Example python code snippet to evaluate accuracy on the MNLI `dev_matched` set.
+```python
+label_map = {0: 'contradiction', 1: 'neutral', 2: 'entailment'}
+ncorrect, nsamples = 0, 0
+bart.cuda()
+bart.eval()
+with open('glue_data/MNLI/dev_matched.tsv') as fin:
+    fin.readline()
+    for index, line in enumerate(fin):
+        tokens = line.strip().split('\t')
+        sent1, sent2, target = tokens[8], tokens[9], tokens[-1]
+        tokens = bart.encode(sent1, sent2)
+        prediction = bart.predict('mnli', tokens).argmax().item()
+        prediction_label = label_map[prediction]
+        ncorrect += int(prediction_label == target)
+        nsamples += 1
+        print('| Accuracy: ', float(ncorrect)/float(nsamples))
+# Expected output: 0.9010
+```
+
+#### Evaluating the `bart.large.cnn` model:
+- Follow instructions [here](https://github.com/abisee/cnn-dailymail) to download and process into data-files such that `test.source` and `test.target` has one line for each non-tokenized sample.
+- For simpler preprocessing, you can also `wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz`, although there is no guarantee of identical scores
+- `huggingface/transformers` has a simpler interface that supports [single-gpu](https://github.com/huggingface/transformers/blob/master/examples/legacy/seq2seq/run_eval.py) and [multi-gpu](https://github.com/huggingface/transformers/blob/master/examples/legacy/seq2seq/run_distributed_eval.py) beam search.
+    In `huggingface/transformers`, the BART models' paths are `facebook/bart-large-cnn` and `facebook/bart-large-xsum`.
+
+In `fairseq`, summaries can be generated using:
+
+```bash
+cp data-bin/cnn_dm/dict.source.txt  checkpoints/
+python examples/bart/summarize.py \
+  --model-dir pytorch/fairseq \
+  --model-file bart.large.cnn \
+  --src cnn_dm/test.source \
+  --out cnn_dm/test.hypo
+```
+
+For calculating rouge, install `files2rouge` from [here](https://github.com/pltrdy/files2rouge).
+
+```bash
+export CLASSPATH=/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar
+
+# Tokenize hypothesis and target files.
+cat test.hypo | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.hypo.tokenized
+cat test.target | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.hypo.target
+files2rouge test.hypo.tokenized test.hypo.target
+# Expected output: (ROUGE-2 Average_F: 0.21238)
+```
+
+
+## Finetuning
+
+- [Finetuning on GLUE](README.glue.md)
+- [Finetuning on CNN-DM](README.summarization.md)
+
+## Citation
+
+```bibtex
+@article{lewis2019bart,
+    title = {BART: Denoising Sequence-to-Sequence Pre-training for Natural
+Language Generation, Translation, and Comprehension},
+    author = {Mike Lewis and Yinhan Liu and Naman Goyal and Marjan Ghazvininejad and
+              Abdelrahman Mohamed and Omer Levy and Veselin Stoyanov
+              and Luke Zettlemoyer },
+    journal={arXiv preprint arXiv:1910.13461},
+    year = {2019},
+}
+```
--- a/examples/bart/README.summarization.md
+++ b/examples/bart/README.summarization.md
+# Fine-tuning BART on CNN-Dailymail summarization task
+
+### 1) Download the CNN and Daily Mail data and preprocess it into data files with non-tokenized cased samples.
+
+Follow the instructions [here](https://github.com/abisee/cnn-dailymail) to download the original CNN and Daily Mail datasets. To preprocess the data, refer to the pointers in [this issue](https://github.com/pytorch/fairseq/issues/1391) or check out the code [here](https://github.com/artmatsak/cnn-dailymail).
+
+Follow the instructions [here](https://github.com/EdinburghNLP/XSum) to download the original Extreme Summarization datasets, or check out the code [here](https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset), Please keep the raw dataset and make sure no tokenization nor BPE on the dataset.
+
+### 2) BPE preprocess:
+
+```bash
+wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json'
+wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe'
+wget -N 'https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt'
+
+TASK=cnn_dm
+for SPLIT in train val
+do
+  for LANG in source target
+  do
+    python -m examples.roberta.multiprocessing_bpe_encoder \
+    --encoder-json encoder.json \
+    --vocab-bpe vocab.bpe \
+    --inputs "$TASK/$SPLIT.$LANG" \
+    --outputs "$TASK/$SPLIT.bpe.$LANG" \
+    --workers 60 \
+    --keep-empty;
+  done
+done
+```
+
+### 3) Binarize dataset:
+```bash
+fairseq-preprocess \
+  --source-lang "source" \
+  --target-lang "target" \
+  --trainpref "${TASK}/train.bpe" \
+  --validpref "${TASK}/val.bpe" \
+  --destdir "${TASK}-bin/" \
+  --workers 60 \
+  --srcdict dict.txt \
+  --tgtdict dict.txt;
+```
+
+### 4) Fine-tuning on CNN-DM summarization task:
+Example fine-tuning CNN-DM
+```bash
+TOTAL_NUM_UPDATES=20000  
+WARMUP_UPDATES=500      
+LR=3e-05
+MAX_TOKENS=2048
+UPDATE_FREQ=4
+BART_PATH=/path/to/bart/model.pt
+
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 fairseq-train cnn_dm-bin \
+    --restore-file $BART_PATH \
+    --max-tokens $MAX_TOKENS \
+    --task translation \
+    --source-lang source --target-lang target \
+    --truncate-source \
+    --layernorm-embedding \
+    --share-all-embeddings \
+    --share-decoder-input-output-embed \
+    --reset-optimizer --reset-dataloader --reset-meters \
+    --required-batch-size-multiple 1 \
+    --arch bart_large \
+    --criterion label_smoothed_cross_entropy \
+    --label-smoothing 0.1 \
+    --dropout 0.1 --attention-dropout 0.1 \
+    --weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \
+    --clip-norm 0.1 \
+    --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
+    --fp16 --update-freq $UPDATE_FREQ \
+    --skip-invalid-size-inputs-valid-test \
+    --find-unused-parameters;
+```
+Above is expected to run on `1` node with `8 32gb-V100`.
+Expected training time is about `5 hours`. Training time can be reduced with distributed training on `4` nodes and `--update-freq 1`.
+
+Use TOTAL_NUM_UPDATES=15000 UPDATE_FREQ=2 for Xsum task
+
+### Inference for CNN-DM test data using above trained checkpoint.
+After training the model as mentioned in previous step, you can perform inference with checkpoints in `checkpoints/` directory using `eval_cnn.py`, for example
+
+```bash
+cp data-bin/cnn_dm/dict.source.txt  checkpoints/
+python examples/bart/summarize.py \
+  --model-dir checkpoints \
+  --model-file checkpoint_best.pt \
+  --src cnn_dm/test.source \
+  --out cnn_dm/test.hypo
+```
+For XSUM, which uses beam=6, lenpen=1.0, max_len_b=60, min_len=10:
+```bash
+cp data-bin/cnn_dm/dict.source.txt  checkpoints/
+python examples/bart/summarize.py \
+  --model-dir checkpoints \
+  --model-file checkpoint_best.pt \
+  --src cnn_dm/test.source \
+  --out cnn_dm/test.hypo \
+  --xsum-kwargs
+```
--- a/examples/bart/summarize.py
+++ b/examples/bart/summarize.py
+# Copyright (c) Facebook, Inc. and its affiliates.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import torch
+from fairseq.models.bart import BARTModel
+import argparse
+
+XSUM_KWARGS = dict(beam=6, lenpen=1.0, max_len_b=60, min_len=10, no_repeat_ngram_size=3)
+CNN_KWARGS = dict(beam=4, lenpen=2.0, max_len_b=140, min_len=55, no_repeat_ngram_size=3)
+
+
+@torch.no_grad()
+def generate(bart, infile, outfile="bart_hypo.txt", bsz=32, n_obs=None, **eval_kwargs):
+    count = 1
+
+    # if n_obs is not None: bsz = min(bsz, n_obs)
+
+    with open(infile) as source, open(outfile, "w") as fout:
+        sline = source.readline().strip()
+        slines = [sline]
+        for sline in source:
+            if n_obs is not None and count > n_obs:
+                break
+            if count % bsz == 0:
+                hypotheses_batch = bart.sample(slines, **eval_kwargs)
+                for hypothesis in hypotheses_batch:
+                    fout.write(hypothesis + "\n")
+                    fout.flush()
+                slines = []
+
+            slines.append(sline.strip())
+            count += 1
+
+        if slines != []:
+            hypotheses_batch = bart.sample(slines, **eval_kwargs)
+            for hypothesis in hypotheses_batch:
+                fout.write(hypothesis + "\n")
+                fout.flush()
+
+
+def main():
+    """
+    Usage::
+
+         python examples/bart/summarize.py \
+            --model-dir $HOME/bart.large.cnn \
+            --model-file model.pt \
+            --src $HOME/data-bin/cnn_dm/test.source
+    """
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--model-dir",
+        required=True,
+        type=str,
+        default="bart.large.cnn/",
+        help="path containing model file and src_dict.txt",
+    )
+    parser.add_argument(
+        "--model-file",
+        default="checkpoint_best.pt",
+        help="where in model_dir are weights saved",
+    )
+    parser.add_argument(
+        "--src", default="test.source", help="text to summarize", type=str
+    )
+    parser.add_argument(
+        "--out", default="test.hypo", help="where to save summaries", type=str
+    )
+    parser.add_argument("--bsz", default=32, help="where to save summaries", type=int)
+    parser.add_argument(
+        "--n", default=None, help="how many examples to summarize", type=int
+    )
+    parser.add_argument(
+        "--xsum-kwargs",
+        action="store_true",
+        default=False,
+        help="if true use XSUM_KWARGS else CNN_KWARGS",
+    )
+    args = parser.parse_args()
+    eval_kwargs = XSUM_KWARGS if args.xsum_kwargs else CNN_KWARGS
+    if args.model_dir == "pytorch/fairseq":
+        bart = torch.hub.load("pytorch/fairseq", args.model_file)
+    else:
+        bart = BARTModel.from_pretrained(
+            args.model_dir,
+            checkpoint_file=args.model_file,
+            data_name_or_path=args.model_dir,
+        )
+    bart = bart.eval()
+    if torch.cuda.is_available():
+        bart = bart.cuda().half()
+    generate(
+        bart, args.src, bsz=args.bsz, n_obs=args.n, outfile=args.out, **eval_kwargs
+    )
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/byte_level_bpe/README.md
+++ b/examples/byte_level_bpe/README.md
+# Neural Machine Translation with Byte-Level Subwords
+
+https://arxiv.org/abs/1909.03341
+
+We provide an implementation of byte-level byte-pair encoding (BBPE), taking IWSLT 2017 Fr-En translation as
+example.
+
+## Data
+Get data and generate fairseq binary dataset:
+```bash
+bash ./get_data.sh
+```
+
+## Model Training
+Train Transformer model with Bi-GRU embedding contextualization (implemented in `gru_transformer.py`):
+```bash
+# VOCAB=bytes
+# VOCAB=chars
+VOCAB=bbpe2048
+# VOCAB=bpe2048
+# VOCAB=bbpe4096
+# VOCAB=bpe4096
+# VOCAB=bpe16384
+```
+```bash
+fairseq-train "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
+    --arch gru_transformer --encoder-layers 2 --decoder-layers 2 --dropout 0.3 --share-all-embeddings \
+    --optimizer adam --adam-betas '(0.9, 0.98)' \
+    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
+    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+    --log-format 'simple' --log-interval 100 --save-dir "checkpoints/${VOCAB}" \
+    --batch-size 100 --max-update 100000 --update-freq 2
+```
+
+## Generation
+`fairseq-generate` requires bytes (BBPE) decoder to convert byte-level representation back to characters:
+```bash
+# BPE=--bpe bytes
+# BPE=--bpe characters
+BPE=--bpe byte_bpe --sentencepiece-model-path data/spm_bbpe2048.model
+# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe2048.model
+# BPE=--bpe byte_bpe --sentencepiece-model-path data/spm_bbpe4096.model
+# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe4096.model
+# BPE=--bpe sentencepiece --sentencepiece-model data/spm_bpe16384.model
+```
+
+```bash
+fairseq-generate "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
+    --source-lang fr --gen-subset test --sacrebleu --path "checkpoints/${VOCAB}/checkpoint_last.pt" \
+    --tokenizer moses --moses-target-lang en ${BPE}
+```
+When using `fairseq-interactive`, bytes (BBPE) encoder/decoder is required to tokenize input data and detokenize model predictions:
+```bash
+fairseq-interactive "data/bin_${VOCAB}" --task translation --user-dir examples/byte_level_bpe/gru_transformer \
+    --path "checkpoints/${VOCAB}/checkpoint_last.pt" --input data/test.fr --tokenizer moses --moses-source-lang fr \
+    --moses-target-lang en ${BPE} --buffer-size 1000 --max-tokens 10000
+```
+
+## Results
+| Vocabulary    | Model  | BLEU |
+|:-------------:|:-------------:|:-------------:|
+| Joint BPE 16k ([Kudo, 2018](https://arxiv.org/abs/1804.10959)) | 512d LSTM 2+2 | 33.81 |
+| Joint BPE 16k | Transformer base 2+2 (w/ GRU) | 36.64 (36.72) |
+| Joint BPE 4k | Transformer base 2+2 (w/ GRU) | 35.49 (36.10) |
+| Joint BBPE 4k | Transformer base 2+2 (w/ GRU) | 35.61 (35.82) |
+| Joint BPE 2k | Transformer base 2+2 (w/ GRU) | 34.87 (36.13) |
+| Joint BBPE 2k | Transformer base 2+2 (w/ GRU) | 34.98 (35.43) |
+| Characters | Transformer base 2+2 (w/ GRU) | 31.78 (33.30) |
+| Bytes | Transformer base 2+2 (w/ GRU) | 31.57 (33.62) |
+
+
+## Citation
+```
+@misc{wang2019neural,
+    title={Neural Machine Translation with Byte-Level Subwords},
+    author={Changhan Wang and Kyunghyun Cho and Jiatao Gu},
+    year={2019},
+    eprint={1909.03341},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+
+
+## Contact
+Changhan Wang ([changhan@fb.com](mailto:changhan@fb.com)),
+Kyunghyun Cho ([kyunghyuncho@fb.com](mailto:kyunghyuncho@fb.com)),
+Jiatao Gu ([jgu@fb.com](mailto:jgu@fb.com))
--- a/examples/byte_level_bpe/get_bitext.py
+++ b/examples/byte_level_bpe/get_bitext.py
--- a/examples/byte_level_bpe/get_data.sh
+++ b/examples/byte_level_bpe/get_data.sh
--- a/examples/byte_level_bpe/gru_transformer.py
+++ b/examples/byte_level_bpe/gru_transformer.py
--- a/examples/camembert/README.md
+++ b/examples/camembert/README.md
--- a/examples/constrained_decoding/README.md
+++ b/examples/constrained_decoding/README.md
--- a/examples/constrained_decoding/normalize.py
+++ b/examples/constrained_decoding/normalize.py
--- a/examples/constrained_decoding/tok.py
+++ b/examples/constrained_decoding/tok.py
--- a/examples/conv_seq2seq/README.md
+++ b/examples/conv_seq2seq/README.md
--- a/examples/criss/README.md
+++ b/examples/criss/README.md
--- a/examples/criss/download_and_preprocess_flores_test.sh
+++ b/examples/criss/download_and_preprocess_flores_test.sh
--- a/examples/criss/download_and_preprocess_tatoeba.sh
+++ b/examples/criss/download_and_preprocess_tatoeba.sh
--- a/examples/criss/mining/mine.py
+++ b/examples/criss/mining/mine.py
--- a/examples/criss/mining/mine_example.sh
+++ b/examples/criss/mining/mine_example.sh
--- a/examples/criss/save_encoder.py
+++ b/examples/criss/save_encoder.py
--- a/examples/criss/sentence_retrieval/encoder_analysis.py
+++ b/examples/criss/sentence_retrieval/encoder_analysis.py
--- a/examples/criss/sentence_retrieval/sentence_retrieval_tatoeba.sh
+++ b/examples/criss/sentence_retrieval/sentence_retrieval_tatoeba.sh
--- a/examples/criss/unsupervised_mt/eval.sh
+++ b/examples/criss/unsupervised_mt/eval.sh
--- a/examples/cross_lingual_language_model/README.md
+++ b/examples/cross_lingual_language_model/README.md
--- a/examples/fast_noisy_channel/README.md
+++ b/examples/fast_noisy_channel/README.md
--- a/examples/fast_noisy_channel/__init__.py
+++ b/examples/fast_noisy_channel/__init__.py
--- a/examples/fast_noisy_channel/noisy_channel_beam_search.py
+++ b/examples/fast_noisy_channel/noisy_channel_beam_search.py
--- a/examples/fast_noisy_channel/noisy_channel_sequence_generator.py
+++ b/examples/fast_noisy_channel/noisy_channel_sequence_generator.py
--- a/examples/fast_noisy_channel/noisy_channel_translation.py
+++ b/examples/fast_noisy_channel/noisy_channel_translation.py
--- a/examples/gottbert/README.md
+++ b/examples/gottbert/README.md
--- a/examples/joint_alignment_translation/README.md
+++ b/examples/joint_alignment_translation/README.md
--- a/examples/joint_alignment_translation/prepare-wmt18en2de_no_norm_no_escape_no_agressive.sh
+++ b/examples/joint_alignment_translation/prepare-wmt18en2de_no_norm_no_escape_no_agressive.sh
--- a/examples/language_model/README.adaptive_inputs.md
+++ b/examples/language_model/README.adaptive_inputs.md
--- a/examples/language_model/README.conv.md
+++ b/examples/language_model/README.conv.md
--- a/examples/language_model/README.md
+++ b/examples/language_model/README.md
--- a/examples/language_model/prepare-wikitext-103.sh
+++ b/examples/language_model/prepare-wikitext-103.sh
--- a/examples/laser/README.md
+++ b/examples/laser/README.md
--- a/examples/laser/laser_src/__init__.py
+++ b/examples/laser/laser_src/__init__.py
--- a/examples/laser/laser_src/laser_lstm.py
+++ b/examples/laser/laser_src/laser_lstm.py
--- a/examples/laser/laser_src/laser_task.py
+++ b/examples/laser/laser_src/laser_task.py
--- a/examples/laser/laser_src/laser_transformer.py
+++ b/examples/laser/laser_src/laser_transformer.py
--- a/examples/laser/laser_src/multitask_data_utils.py
+++ b/examples/laser/laser_src/multitask_data_utils.py
--- a/examples/latent_depth/README.md
+++ b/examples/latent_depth/README.md
--- a/examples/latent_depth/latent_depth_src/__init__.py
+++ b/examples/latent_depth/latent_depth_src/__init__.py
--- a/examples/latent_depth/latent_depth_src/loss/__init__.py
+++ b/examples/latent_depth/latent_depth_src/loss/__init__.py
--- a/examples/latent_depth/latent_depth_src/loss/latent_depth.py
+++ b/examples/latent_depth/latent_depth_src/loss/latent_depth.py
--- a/examples/latent_depth/latent_depth_src/models/__init__.py
+++ b/examples/latent_depth/latent_depth_src/models/__init__.py
--- a/examples/latent_depth/latent_depth_src/models/latent_multilingual_transformer.py
+++ b/examples/latent_depth/latent_depth_src/models/latent_multilingual_transformer.py
--- a/examples/latent_depth/latent_depth_src/models/latent_transformer.py
+++ b/examples/latent_depth/latent_depth_src/models/latent_transformer.py
--- a/examples/latent_depth/latent_depth_src/modules/__init__.py
+++ b/examples/latent_depth/latent_depth_src/modules/__init__.py
--- a/examples/latent_depth/latent_depth_src/modules/latent_layers.py
+++ b/examples/latent_depth/latent_depth_src/modules/latent_layers.py
--- a/examples/latent_depth/latent_depth_src/multilingual_translation_latent_depth.py
+++ b/examples/latent_depth/latent_depth_src/multilingual_translation_latent_depth.py
--- a/examples/layerdrop/README.md
+++ b/examples/layerdrop/README.md
--- a/examples/linformer/README.md
+++ b/examples/linformer/README.md
--- a/examples/linformer/linformer_src/__init__.py
+++ b/examples/linformer/linformer_src/__init__.py
--- a/examples/linformer/linformer_src/models/__init__.py
+++ b/examples/linformer/linformer_src/models/__init__.py
--- a/examples/linformer/linformer_src/models/linformer_roberta.py
+++ b/examples/linformer/linformer_src/models/linformer_roberta.py
--- a/examples/linformer/linformer_src/modules/__init__.py
+++ b/examples/linformer/linformer_src/modules/__init__.py
--- a/examples/linformer/linformer_src/modules/linformer_sentence_encoder.py
+++ b/examples/linformer/linformer_src/modules/linformer_sentence_encoder.py
--- a/examples/linformer/linformer_src/modules/linformer_sentence_encoder_layer.py
+++ b/examples/linformer/linformer_src/modules/linformer_sentence_encoder_layer.py
--- a/examples/linformer/linformer_src/modules/multihead_linear_attention.py
+++ b/examples/linformer/linformer_src/modules/multihead_linear_attention.py
--- a/examples/m2m_100/README.md
+++ b/examples/m2m_100/README.md
--- a/examples/m2m_100/install_dependecies.sh
+++ b/examples/m2m_100/install_dependecies.sh
--- a/examples/m2m_100/process_data/clean_histogram.py
+++ b/examples/m2m_100/process_data/clean_histogram.py
--- a/examples/m2m_100/process_data/dedup_data.py
+++ b/examples/m2m_100/process_data/dedup_data.py
--- a/examples/m2m_100/process_data/remove_too_much_punc.py
+++ b/examples/m2m_100/process_data/remove_too_much_punc.py
--- a/examples/m2m_100/tok.sh
+++ b/examples/m2m_100/tok.sh
--- a/examples/m2m_100/tokenizers/README.md
+++ b/examples/m2m_100/tokenizers/README.md
--- a/examples/m2m_100/tokenizers/seg_ja.sh
+++ b/examples/m2m_100/tokenizers/seg_ja.sh
--- a/examples/m2m_100/tokenizers/seg_ko.sh
+++ b/examples/m2m_100/tokenizers/seg_ko.sh
--- a/examples/m2m_100/tokenizers/thirdparty/.gitignore
+++ b/examples/m2m_100/tokenizers/thirdparty/.gitignore
--- a/examples/m2m_100/tokenizers/tokenize_indic.py
+++ b/examples/m2m_100/tokenizers/tokenize_indic.py
--- a/examples/m2m_100/tokenizers/tokenize_thai.py
+++ b/examples/m2m_100/tokenizers/tokenize_thai.py
--- a/examples/m2m_100/tokenizers/tokenize_zh.py
+++ b/examples/m2m_100/tokenizers/tokenize_zh.py
--- a/examples/m2m_100/tokenizers/tokenizer_ar.sh
+++ b/examples/m2m_100/tokenizers/tokenizer_ar.sh
--- a/examples/mbart/README.md
+++ b/examples/mbart/README.md
--- a/examples/megatron_11b/README.md
+++ b/examples/megatron_11b/README.md
--- a/examples/megatron_11b/detok.py
+++ b/examples/megatron_11b/detok.py
--- a/examples/multilingual/ML50_langs.txt
+++ b/examples/multilingual/ML50_langs.txt
--- a/examples/multilingual/README.md
+++ b/examples/multilingual/README.md
--- a/examples/multilingual/data_scripts/README.md
+++ b/examples/multilingual/data_scripts/README.md
--- a/examples/multilingual/data_scripts/binarize.py
+++ b/examples/multilingual/data_scripts/binarize.py
--- a/examples/multilingual/data_scripts/check_iswlt_test_data.py
+++ b/examples/multilingual/data_scripts/check_iswlt_test_data.py
--- a/examples/multilingual/data_scripts/check_self_overlaps.py
+++ b/examples/multilingual/data_scripts/check_self_overlaps.py
--- a/examples/multilingual/data_scripts/check_valid_test_overlaps.py
+++ b/examples/multilingual/data_scripts/check_valid_test_overlaps.py
--- a/examples/multilingual/data_scripts/dedup_all.py
+++ b/examples/multilingual/data_scripts/dedup_all.py
--- a/examples/multilingual/data_scripts/download_ML50_v1.sh
+++ b/examples/multilingual/data_scripts/download_ML50_v1.sh
--- a/examples/multilingual/data_scripts/download_af_xh.sh
+++ b/examples/multilingual/data_scripts/download_af_xh.sh
--- a/examples/multilingual/data_scripts/download_flores_data.sh
+++ b/examples/multilingual/data_scripts/download_flores_data.sh
--- a/examples/multilingual/data_scripts/download_iitb.sh
+++ b/examples/multilingual/data_scripts/download_iitb.sh
--- a/examples/multilingual/data_scripts/download_iwslt_and_extract.sh
+++ b/examples/multilingual/data_scripts/download_iwslt_and_extract.sh
--- a/examples/multilingual/data_scripts/download_lotus.sh
+++ b/examples/multilingual/data_scripts/download_lotus.sh
--- a/examples/multilingual/data_scripts/download_ted_and_extract.py
+++ b/examples/multilingual/data_scripts/download_ted_and_extract.py
--- a/examples/multilingual/data_scripts/download_wat19_my.sh
+++ b/examples/multilingual/data_scripts/download_wat19_my.sh
--- a/examples/multilingual/data_scripts/download_wmt19_and_before.py
+++ b/examples/multilingual/data_scripts/download_wmt19_and_before.py
--- a/examples/multilingual/data_scripts/download_wmt20.sh
+++ b/examples/multilingual/data_scripts/download_wmt20.sh
--- a/examples/multilingual/data_scripts/preprocess_ML50_v1.sh
+++ b/examples/multilingual/data_scripts/preprocess_ML50_v1.sh
--- a/examples/multilingual/data_scripts/remove_valid_test_in_train.py
+++ b/examples/multilingual/data_scripts/remove_valid_test_in_train.py
--- a/examples/multilingual/data_scripts/requirement.txt
+++ b/examples/multilingual/data_scripts/requirement.txt
--- a/examples/multilingual/data_scripts/utils/dedup.py
+++ b/examples/multilingual/data_scripts/utils/dedup.py
--- a/examples/multilingual/data_scripts/utils/fasttext_multi_filter.py
+++ b/examples/multilingual/data_scripts/utils/fasttext_multi_filter.py
--- a/examples/multilingual/data_scripts/utils/strip_sgm.sh
+++ b/examples/multilingual/data_scripts/utils/strip_sgm.sh
--- a/examples/multilingual/finetune_multilingual_model.sh
+++ b/examples/multilingual/finetune_multilingual_model.sh
--- a/examples/multilingual/multilingual_fairseq_gen.sh
+++ b/examples/multilingual/multilingual_fairseq_gen.sh
--- a/examples/multilingual/train_multilingual_model.sh
+++ b/examples/multilingual/train_multilingual_model.sh
--- a/examples/noisychannel/README.md
+++ b/examples/noisychannel/README.md
--- a/examples/noisychannel/__init__.py
+++ b/examples/noisychannel/__init__.py
--- a/examples/noisychannel/rerank.py
+++ b/examples/noisychannel/rerank.py
--- a/examples/noisychannel/rerank_generate.py
+++ b/examples/noisychannel/rerank_generate.py
--- a/examples/noisychannel/rerank_options.py
+++ b/examples/noisychannel/rerank_options.py
--- a/examples/noisychannel/rerank_score_bw.py
+++ b/examples/noisychannel/rerank_score_bw.py
--- a/examples/noisychannel/rerank_score_lm.py
+++ b/examples/noisychannel/rerank_score_lm.py
--- a/examples/noisychannel/rerank_tune.py
+++ b/examples/noisychannel/rerank_tune.py
--- a/examples/noisychannel/rerank_utils.py
+++ b/examples/noisychannel/rerank_utils.py
--- a/examples/nonautoregressive_translation/README.md
+++ b/examples/nonautoregressive_translation/README.md
--- a/examples/nonautoregressive_translation/scripts.md
+++ b/examples/nonautoregressive_translation/scripts.md
--- a/examples/paraphraser/README.md
+++ b/examples/paraphraser/README.md
--- a/examples/paraphraser/paraphrase.py
+++ b/examples/paraphraser/paraphrase.py
--- a/examples/pay_less_attention_paper/README.md
+++ b/examples/pay_less_attention_paper/README.md
--- a/examples/pointer_generator/README.md
+++ b/examples/pointer_generator/README.md
--- a/examples/pointer_generator/README.xsum.md
+++ b/examples/pointer_generator/README.xsum.md
--- a/examples/pointer_generator/pointer_generator_src/__init__.py
+++ b/examples/pointer_generator/pointer_generator_src/__init__.py
--- a/examples/pointer_generator/pointer_generator_src/transformer_pg.py
+++ b/examples/pointer_generator/pointer_generator_src/transformer_pg.py
--- a/examples/pointer_generator/postprocess.py
+++ b/examples/pointer_generator/postprocess.py
--- a/examples/pointer_generator/preprocess.py
+++ b/examples/pointer_generator/preprocess.py
--- a/examples/quant_noise/README.md
+++ b/examples/quant_noise/README.md
--- a/examples/quant_noise/transformer_quantization_config.yaml
+++ b/examples/quant_noise/transformer_quantization_config.yaml
--- a/examples/roberta/README.custom_classification.md
+++ b/examples/roberta/README.custom_classification.md
--- a/examples/roberta/README.glue.md
+++ b/examples/roberta/README.glue.md
--- a/examples/roberta/README.md
+++ b/examples/roberta/README.md
--- a/examples/roberta/README.pretraining.md
+++ b/examples/roberta/README.pretraining.md
--- a/examples/roberta/README.race.md
+++ b/examples/roberta/README.race.md
--- a/examples/roberta/commonsense_qa/README.md
+++ b/examples/roberta/commonsense_qa/README.md
--- a/examples/roberta/commonsense_qa/__init__.py
+++ b/examples/roberta/commonsense_qa/__init__.py
--- a/examples/roberta/commonsense_qa/commonsense_qa_task.py
+++ b/examples/roberta/commonsense_qa/commonsense_qa_task.py
--- a/examples/roberta/commonsense_qa/download_cqa_data.sh
+++ b/examples/roberta/commonsense_qa/download_cqa_data.sh
--- a/examples/roberta/multiprocessing_bpe_encoder.py
+++ b/examples/roberta/multiprocessing_bpe_encoder.py
--- a/examples/roberta/preprocess_GLUE_tasks.sh
+++ b/examples/roberta/preprocess_GLUE_tasks.sh
--- a/examples/roberta/preprocess_RACE.py
+++ b/examples/roberta/preprocess_RACE.py
--- a/examples/roberta/preprocess_RACE.sh
+++ b/examples/roberta/preprocess_RACE.sh
--- a/examples/roberta/wsc/README.md
+++ b/examples/roberta/wsc/README.md
--- a/examples/roberta/wsc/__init__.py
+++ b/examples/roberta/wsc/__init__.py
--- a/examples/roberta/wsc/wsc_criterion.py
+++ b/examples/roberta/wsc/wsc_criterion.py
--- a/examples/roberta/wsc/wsc_task.py
+++ b/examples/roberta/wsc/wsc_task.py
--- a/examples/roberta/wsc/wsc_utils.py
+++ b/examples/roberta/wsc/wsc_utils.py
--- a/examples/rxf/README.md
+++ b/examples/rxf/README.md
--- a/examples/rxf/__init__.py
+++ b/examples/rxf/__init__.py
--- a/examples/rxf/rxf_src/__init__.py
+++ b/examples/rxf/rxf_src/__init__.py
--- a/examples/rxf/rxf_src/label_smoothed_cross_entropy_r3f.py
+++ b/examples/rxf/rxf_src/label_smoothed_cross_entropy_r3f.py
--- a/examples/rxf/rxf_src/sentence_prediction_r3f.py
+++ b/examples/rxf/rxf_src/sentence_prediction_r3f.py
--- a/examples/scaling_nmt/README.md
+++ b/examples/scaling_nmt/README.md
--- a/examples/simultaneous_translation/README.md
+++ b/examples/simultaneous_translation/README.md
--- a/examples/simultaneous_translation/__init__.py
+++ b/examples/simultaneous_translation/__init__.py
--- a/examples/simultaneous_translation/eval/__init__.py
+++ b/examples/simultaneous_translation/eval/__init__.py
--- a/examples/simultaneous_translation/eval/agents/__init__.py
+++ b/examples/simultaneous_translation/eval/agents/__init__.py
--- a/examples/simultaneous_translation/eval/agents/agent.py
+++ b/examples/simultaneous_translation/eval/agents/agent.py
--- a/examples/simultaneous_translation/eval/agents/simul_trans_agent.py
+++ b/examples/simultaneous_translation/eval/agents/simul_trans_agent.py
--- a/examples/simultaneous_translation/eval/agents/simul_trans_text_agent.py
+++ b/examples/simultaneous_translation/eval/agents/simul_trans_text_agent.py
--- a/examples/simultaneous_translation/eval/agents/word_splitter.py
+++ b/examples/simultaneous_translation/eval/agents/word_splitter.py
--- a/examples/simultaneous_translation/eval/client.py
+++ b/examples/simultaneous_translation/eval/client.py
--- a/examples/simultaneous_translation/eval/eval_latency.py
+++ b/examples/simultaneous_translation/eval/eval_latency.py
--- a/examples/simultaneous_translation/eval/evaluate.py
+++ b/examples/simultaneous_translation/eval/evaluate.py
--- a/examples/simultaneous_translation/eval/scorers/__init__.py
+++ b/examples/simultaneous_translation/eval/scorers/__init__.py
--- a/examples/simultaneous_translation/eval/scorers/scorer.py
+++ b/examples/simultaneous_translation/eval/scorers/scorer.py
--- a/examples/simultaneous_translation/eval/scorers/text_scorer.py
+++ b/examples/simultaneous_translation/eval/scorers/text_scorer.py
--- a/examples/simultaneous_translation/eval/server.py
+++ b/examples/simultaneous_translation/eval/server.py
--- a/examples/simultaneous_translation/models/__init__.py
+++ b/examples/simultaneous_translation/models/__init__.py
--- a/examples/simultaneous_translation/models/convtransformer_simul_trans.py
+++ b/examples/simultaneous_translation/models/convtransformer_simul_trans.py
--- a/examples/simultaneous_translation/models/transformer_monotonic_attention.py
+++ b/examples/simultaneous_translation/models/transformer_monotonic_attention.py
--- a/examples/simultaneous_translation/modules/__init__.py
+++ b/examples/simultaneous_translation/modules/__init__.py
--- a/examples/simultaneous_translation/modules/fixed_pre_decision.py
+++ b/examples/simultaneous_translation/modules/fixed_pre_decision.py
--- a/examples/simultaneous_translation/modules/monotonic_multihead_attention.py
+++ b/examples/simultaneous_translation/modules/monotonic_multihead_attention.py
--- a/examples/simultaneous_translation/modules/monotonic_transformer_layer.py
+++ b/examples/simultaneous_translation/modules/monotonic_transformer_layer.py
--- a/examples/simultaneous_translation/utils/__init__.py
+++ b/examples/simultaneous_translation/utils/__init__.py
--- a/examples/simultaneous_translation/utils/data_utils.py
+++ b/examples/simultaneous_translation/utils/data_utils.py
--- a/examples/simultaneous_translation/utils/functions.py
+++ b/examples/simultaneous_translation/utils/functions.py
--- a/examples/simultaneous_translation/utils/latency.py
+++ b/examples/simultaneous_translation/utils/latency.py
--- a/examples/speech_recognition/README.md
+++ b/examples/speech_recognition/README.md
--- a/examples/speech_recognition/__init__.py
+++ b/examples/speech_recognition/__init__.py
--- a/examples/speech_recognition/criterions/ASG_loss.py
+++ b/examples/speech_recognition/criterions/ASG_loss.py
--- a/examples/speech_recognition/criterions/__init__.py
+++ b/examples/speech_recognition/criterions/__init__.py
--- a/examples/speech_recognition/criterions/cross_entropy_acc.py
+++ b/examples/speech_recognition/criterions/cross_entropy_acc.py
--- a/examples/speech_recognition/data/__init__.py
+++ b/examples/speech_recognition/data/__init__.py
--- a/examples/speech_recognition/data/asr_dataset.py
+++ b/examples/speech_recognition/data/asr_dataset.py
--- a/examples/speech_recognition/data/collaters.py
+++ b/examples/speech_recognition/data/collaters.py
--- a/examples/speech_recognition/data/data_utils.py
+++ b/examples/speech_recognition/data/data_utils.py
--- a/examples/speech_recognition/data/replabels.py
+++ b/examples/speech_recognition/data/replabels.py
--- a/examples/speech_recognition/datasets/asr_prep_json.py
+++ b/examples/speech_recognition/datasets/asr_prep_json.py
--- a/examples/speech_recognition/datasets/prepare-librispeech.sh
+++ b/examples/speech_recognition/datasets/prepare-librispeech.sh
--- a/examples/speech_recognition/hydra/README.md
+++ b/examples/speech_recognition/hydra/README.md
--- a/examples/speech_recognition/hydra/conf/hydra/sweeper/ax.yaml
+++ b/examples/speech_recognition/hydra/conf/hydra/sweeper/ax.yaml
--- a/examples/speech_recognition/hydra/conf/infer.yaml
+++ b/examples/speech_recognition/hydra/conf/infer.yaml
--- a/examples/speech_recognition/hydra/decoder.py
+++ b/examples/speech_recognition/hydra/decoder.py
--- a/examples/speech_recognition/hydra/infer.py
+++ b/examples/speech_recognition/hydra/infer.py
--- a/examples/speech_recognition/infer.py
+++ b/examples/speech_recognition/infer.py
--- a/examples/speech_recognition/models/__init__.py
+++ b/examples/speech_recognition/models/__init__.py
--- a/examples/speech_recognition/models/vggtransformer.py
+++ b/examples/speech_recognition/models/vggtransformer.py
--- a/examples/speech_recognition/models/w2l_conv_glu_enc.py
+++ b/examples/speech_recognition/models/w2l_conv_glu_enc.py
--- a/examples/speech_recognition/tasks/__init__.py
+++ b/examples/speech_recognition/tasks/__init__.py
--- a/examples/speech_recognition/tasks/speech_recognition.py
+++ b/examples/speech_recognition/tasks/speech_recognition.py
--- a/examples/speech_recognition/utils/wer_utils.py
+++ b/examples/speech_recognition/utils/wer_utils.py
--- a/examples/speech_recognition/w2l_decoder.py
+++ b/examples/speech_recognition/w2l_decoder.py
--- a/examples/speech_to_text/README.md
+++ b/examples/speech_to_text/README.md
--- a/examples/speech_to_text/data_utils.py
+++ b/examples/speech_to_text/data_utils.py
--- a/examples/speech_to_text/docs/covost_example.md
+++ b/examples/speech_to_text/docs/covost_example.md
--- a/examples/speech_to_text/docs/librispeech_example.md
+++ b/examples/speech_to_text/docs/librispeech_example.md
--- a/examples/speech_to_text/docs/mtedx_example.md
+++ b/examples/speech_to_text/docs/mtedx_example.md
--- a/examples/speech_to_text/docs/mustc_example.md
+++ b/examples/speech_to_text/docs/mustc_example.md
--- a/examples/speech_to_text/docs/simulst_mustc_example.md
+++ b/examples/speech_to_text/docs/simulst_mustc_example.md
--- a/examples/speech_to_text/prep_covost_data.py
+++ b/examples/speech_to_text/prep_covost_data.py
--- a/examples/speech_to_text/prep_librispeech_data.py
+++ b/examples/speech_to_text/prep_librispeech_data.py
--- a/examples/speech_to_text/prep_mtedx_data.py
+++ b/examples/speech_to_text/prep_mtedx_data.py
--- a/examples/speech_to_text/prep_mustc_data.py
+++ b/examples/speech_to_text/prep_mustc_data.py
--- a/examples/speech_to_text/simultaneous_translation/agents/fairseq_simul_st_agent.py
+++ b/examples/speech_to_text/simultaneous_translation/agents/fairseq_simul_st_agent.py
--- a/examples/speech_to_text/simultaneous_translation/agents/simul_trans_agent.py
+++ b/examples/speech_to_text/simultaneous_translation/agents/simul_trans_agent.py
--- a/examples/stories/README.md
+++ b/examples/stories/README.md
--- a/examples/translation/README.md
+++ b/examples/translation/README.md
--- a/examples/translation/prepare-iwslt14.sh
+++ b/examples/translation/prepare-iwslt14.sh
--- a/examples/translation/prepare-iwslt17-multilingual.sh
+++ b/examples/translation/prepare-iwslt17-multilingual.sh
--- a/examples/translation/prepare-wmt14en2de.sh
+++ b/examples/translation/prepare-wmt14en2de.sh
--- a/examples/translation/prepare-wmt14en2fr.sh
+++ b/examples/translation/prepare-wmt14en2fr.sh
--- a/examples/translation_moe/README.md
+++ b/examples/translation_moe/README.md
--- a/examples/translation_moe/score.py
+++ b/examples/translation_moe/score.py
--- a/examples/translation_moe/translation_moe_src/__init__.py
+++ b/examples/translation_moe/translation_moe_src/__init__.py
--- a/examples/translation_moe/translation_moe_src/logsumexp_moe.py
+++ b/examples/translation_moe/translation_moe_src/logsumexp_moe.py
--- a/examples/translation_moe/translation_moe_src/mean_pool_gating_network.py
+++ b/examples/translation_moe/translation_moe_src/mean_pool_gating_network.py
--- a/examples/translation_moe/translation_moe_src/translation_moe.py
+++ b/examples/translation_moe/translation_moe_src/translation_moe.py
--- a/examples/truncated_bptt/README.md
+++ b/examples/truncated_bptt/README.md
--- a/examples/truncated_bptt/__init__.py
+++ b/examples/truncated_bptt/__init__.py
--- a/examples/truncated_bptt/transformer_xl_model.py
+++ b/examples/truncated_bptt/transformer_xl_model.py
--- a/examples/truncated_bptt/truncated_bptt_lm_task.py
+++ b/examples/truncated_bptt/truncated_bptt_lm_task.py
--- a/examples/unsupervised_quality_estimation/README.md
+++ b/examples/unsupervised_quality_estimation/README.md
--- a/examples/unsupervised_quality_estimation/aggregate_scores.py
+++ b/examples/unsupervised_quality_estimation/aggregate_scores.py
--- a/examples/unsupervised_quality_estimation/meteor.py
+++ b/examples/unsupervised_quality_estimation/meteor.py
--- a/examples/unsupervised_quality_estimation/repeat_lines.py
+++ b/examples/unsupervised_quality_estimation/repeat_lines.py
--- a/examples/wav2vec/README.md
+++ b/examples/wav2vec/README.md
--- a/examples/wav2vec/config/finetuning/base_100h.yaml
+++ b/examples/wav2vec/config/finetuning/base_100h.yaml
--- a/examples/wav2vec/config/finetuning/base_10h.yaml
+++ b/examples/wav2vec/config/finetuning/base_10h.yaml
--- a/examples/wav2vec/config/finetuning/base_10m.yaml
+++ b/examples/wav2vec/config/finetuning/base_10m.yaml
--- a/examples/wav2vec/config/finetuning/base_1h.yaml
+++ b/examples/wav2vec/config/finetuning/base_1h.yaml
--- a/examples/wav2vec/config/finetuning/base_960h.yaml
+++ b/examples/wav2vec/config/finetuning/base_960h.yaml
--- a/examples/wav2vec/config/finetuning/vox_100h.yaml
+++ b/examples/wav2vec/config/finetuning/vox_100h.yaml
--- a/examples/wav2vec/config/finetuning/vox_10h.yaml
+++ b/examples/wav2vec/config/finetuning/vox_10h.yaml
--- a/examples/wav2vec/config/finetuning/vox_10m.yaml
+++ b/examples/wav2vec/config/finetuning/vox_10m.yaml
--- a/examples/wav2vec/config/finetuning/vox_1h.yaml
+++ b/examples/wav2vec/config/finetuning/vox_1h.yaml
--- a/examples/wav2vec/config/finetuning/vox_960h.yaml
+++ b/examples/wav2vec/config/finetuning/vox_960h.yaml
--- a/examples/wav2vec/config/pretraining/wav2vec2_base_librispeech.yaml
+++ b/examples/wav2vec/config/pretraining/wav2vec2_base_librispeech.yaml
--- a/examples/wav2vec/config/pretraining/wav2vec2_large_librivox.yaml
+++ b/examples/wav2vec/config/pretraining/wav2vec2_large_librivox.yaml
--- a/examples/wav2vec/libri_labels.py
+++ b/examples/wav2vec/libri_labels.py
--- a/examples/wav2vec/vq-wav2vec_featurize.py
+++ b/examples/wav2vec/vq-wav2vec_featurize.py
--- a/examples/wav2vec/wav2vec_featurize.py
+++ b/examples/wav2vec/wav2vec_featurize.py
--- a/examples/wav2vec/wav2vec_manifest.py
+++ b/examples/wav2vec/wav2vec_manifest.py
--- a/examples/wmt19/README.md
+++ b/examples/wmt19/README.md
--- a/examples/wmt20/README.md
+++ b/examples/wmt20/README.md
--- a/examples/xlmr/README.md
+++ b/examples/xlmr/README.md
--- a/fairseq/__init__.py
+++ b/fairseq/__init__.py
--- a/fairseq/benchmark/__init__.py
+++ b/fairseq/benchmark/__init__.py
--- a/fairseq/benchmark/dummy_lm.py
+++ b/fairseq/benchmark/dummy_lm.py
--- a/fairseq/benchmark/dummy_masked_lm.py
+++ b/fairseq/benchmark/dummy_masked_lm.py
--- a/fairseq/benchmark/dummy_model.py
+++ b/fairseq/benchmark/dummy_model.py
--- a/fairseq/benchmark/dummy_mt.py
+++ b/fairseq/benchmark/dummy_mt.py
--- a/fairseq/binarizer.py
+++ b/fairseq/binarizer.py
--- a/fairseq/checkpoint_utils.py
+++ b/fairseq/checkpoint_utils.py
--- a/fairseq/clib/cuda/ngram_repeat_block_cuda.cpp
+++ b/fairseq/clib/cuda/ngram_repeat_block_cuda.cpp
--- a/fairseq/clib/cuda/ngram_repeat_block_cuda_kernel.cu
+++ b/fairseq/clib/cuda/ngram_repeat_block_cuda_kernel.cu
--- a/fairseq/clib/libbleu/libbleu.cpp
+++ b/fairseq/clib/libbleu/libbleu.cpp
--- a/fairseq/clib/libbleu/module.cpp
+++ b/fairseq/clib/libbleu/module.cpp
--- a/fairseq/clib/libnat/edit_dist.cpp
+++ b/fairseq/clib/libnat/edit_dist.cpp
--- a/fairseq/clib/libnat_cuda/binding.cpp
+++ b/fairseq/clib/libnat_cuda/binding.cpp
--- a/fairseq/clib/libnat_cuda/edit_dist.cu
+++ b/fairseq/clib/libnat_cuda/edit_dist.cu
--- a/fairseq/clib/libnat_cuda/edit_dist.h
+++ b/fairseq/clib/libnat_cuda/edit_dist.h
--- a/fairseq/config/__init__.py
+++ b/fairseq/config/__init__.py
--- a/fairseq/config/config.yaml
+++ b/fairseq/config/config.yaml
--- a/fairseq/config/model/transformer_lm/transformer_lm_baevski_gbw.yaml
+++ b/fairseq/config/model/transformer_lm/transformer_lm_baevski_gbw.yaml
--- a/fairseq/config/model/transformer_lm/transformer_lm_baevski_wiki103.yaml
+++ b/fairseq/config/model/transformer_lm/transformer_lm_baevski_wiki103.yaml
--- a/fairseq/config/model/transformer_lm/transformer_lm_big.yaml
+++ b/fairseq/config/model/transformer_lm/transformer_lm_big.yaml
--- a/fairseq/config/model/transformer_lm/transformer_lm_gbw.yaml
+++ b/fairseq/config/model/transformer_lm/transformer_lm_gbw.yaml
--- a/fairseq/config/model/transformer_lm/transformer_lm_gpt.yaml
+++ b/fairseq/config/model/transformer_lm/transformer_lm_gpt.yaml
--- a/fairseq/config/model/transformer_lm/transformer_lm_gpt2_big.yaml
+++ b/fairseq/config/model/transformer_lm/transformer_lm_gpt2_big.yaml
--- a/fairseq/config/model/transformer_lm/transformer_lm_gpt2_medium.yaml
+++ b/fairseq/config/model/transformer_lm/transformer_lm_gpt2_medium.yaml
--- a/fairseq/config/model/transformer_lm/transformer_lm_gpt2_small.yaml
+++ b/fairseq/config/model/transformer_lm/transformer_lm_gpt2_small.yaml
--- a/fairseq/config/model/transformer_lm/transformer_lm_wiki103.yaml
+++ b/fairseq/config/model/transformer_lm/transformer_lm_wiki103.yaml
--- a/fairseq/config/model/wav2vec/vq_wav2vec_gumbel.yaml
+++ b/fairseq/config/model/wav2vec/vq_wav2vec_gumbel.yaml
--- a/fairseq/config/model/wav2vec2/wav2vec2_base.yaml
+++ b/fairseq/config/model/wav2vec2/wav2vec2_base.yaml
--- a/fairseq/config/model/wav2vec2/wav2vec2_large.yaml
+++ b/fairseq/config/model/wav2vec2/wav2vec2_large.yaml
--- a/fairseq/criterions/__init__.py
+++ b/fairseq/criterions/__init__.py
--- a/fairseq/criterions/adaptive_loss.py
+++ b/fairseq/criterions/adaptive_loss.py
--- a/fairseq/criterions/composite_loss.py
+++ b/fairseq/criterions/composite_loss.py
--- a/fairseq/criterions/cross_entropy.py
+++ b/fairseq/criterions/cross_entropy.py
--- a/fairseq/criterions/ctc.py
+++ b/fairseq/criterions/ctc.py
--- a/fairseq/criterions/fairseq_criterion.py
+++ b/fairseq/criterions/fairseq_criterion.py
--- a/fairseq/criterions/label_smoothed_cross_entropy.py
+++ b/fairseq/criterions/label_smoothed_cross_entropy.py
--- a/fairseq/criterions/label_smoothed_cross_entropy_latency_augmented.py
+++ b/fairseq/criterions/label_smoothed_cross_entropy_latency_augmented.py
--- a/fairseq/criterions/label_smoothed_cross_entropy_with_alignment.py
+++ b/fairseq/criterions/label_smoothed_cross_entropy_with_alignment.py
--- a/fairseq/criterions/label_smoothed_cross_entropy_with_ctc.py
+++ b/fairseq/criterions/label_smoothed_cross_entropy_with_ctc.py
--- a/fairseq/criterions/legacy_masked_lm.py
+++ b/fairseq/criterions/legacy_masked_lm.py
--- a/fairseq/criterions/masked_lm.py
+++ b/fairseq/criterions/masked_lm.py
--- a/fairseq/criterions/model_criterion.py
+++ b/fairseq/criterions/model_criterion.py
--- a/fairseq/criterions/nat_loss.py
+++ b/fairseq/criterions/nat_loss.py
--- a/fairseq/criterions/sentence_prediction.py
+++ b/fairseq/criterions/sentence_prediction.py
--- a/fairseq/criterions/sentence_ranking.py
+++ b/fairseq/criterions/sentence_ranking.py
--- a/fairseq/criterions/wav2vec_criterion.py
+++ b/fairseq/criterions/wav2vec_criterion.py
--- a/fairseq/data/__init__.py
+++ b/fairseq/data/__init__.py
--- a/fairseq/data/add_target_dataset.py
+++ b/fairseq/data/add_target_dataset.py
--- a/fairseq/data/append_token_dataset.py
+++ b/fairseq/data/append_token_dataset.py
--- a/fairseq/data/audio/__init__.py
+++ b/fairseq/data/audio/__init__.py
--- a/fairseq/data/audio/audio_utils.py
+++ b/fairseq/data/audio/audio_utils.py
--- a/fairseq/data/audio/feature_transforms/__init__.py
+++ b/fairseq/data/audio/feature_transforms/__init__.py
--- a/fairseq/data/audio/feature_transforms/global_cmvn.py
+++ b/fairseq/data/audio/feature_transforms/global_cmvn.py
--- a/fairseq/data/audio/feature_transforms/specaugment.py
+++ b/fairseq/data/audio/feature_transforms/specaugment.py
--- a/fairseq/data/audio/feature_transforms/utterance_cmvn.py
+++ b/fairseq/data/audio/feature_transforms/utterance_cmvn.py
--- a/fairseq/data/audio/raw_audio_dataset.py
+++ b/fairseq/data/audio/raw_audio_dataset.py
--- a/fairseq/data/audio/speech_to_text_dataset.py
+++ b/fairseq/data/audio/speech_to_text_dataset.py
--- a/fairseq/data/backtranslation_dataset.py
+++ b/fairseq/data/backtranslation_dataset.py
--- a/fairseq/data/base_wrapper_dataset.py
+++ b/fairseq/data/base_wrapper_dataset.py
--- a/fairseq/data/bucket_pad_length_dataset.py
+++ b/fairseq/data/bucket_pad_length_dataset.py
--- a/fairseq/data/colorize_dataset.py
+++ b/fairseq/data/colorize_dataset.py
--- a/fairseq/data/concat_dataset.py
+++ b/fairseq/data/concat_dataset.py
--- a/fairseq/data/concat_sentences_dataset.py
+++ b/fairseq/data/concat_sentences_dataset.py
--- a/fairseq/data/data_utils.py
+++ b/fairseq/data/data_utils.py
--- a/fairseq/data/data_utils_fast.pyx
+++ b/fairseq/data/data_utils_fast.pyx
--- a/fairseq/data/denoising_dataset.py
+++ b/fairseq/data/denoising_dataset.py
--- a/fairseq/data/dictionary.py
+++ b/fairseq/data/dictionary.py
--- a/fairseq/data/encoders/__init__.py
+++ b/fairseq/data/encoders/__init__.py
--- a/fairseq/data/encoders/byte_bpe.py
+++ b/fairseq/data/encoders/byte_bpe.py
--- a/fairseq/data/encoders/byte_utils.py
+++ b/fairseq/data/encoders/byte_utils.py
--- a/fairseq/data/encoders/bytes.py
+++ b/fairseq/data/encoders/bytes.py
--- a/fairseq/data/encoders/characters.py
+++ b/fairseq/data/encoders/characters.py
--- a/fairseq/data/encoders/fastbpe.py
+++ b/fairseq/data/encoders/fastbpe.py
--- a/fairseq/data/encoders/gpt2_bpe.py
+++ b/fairseq/data/encoders/gpt2_bpe.py
--- a/fairseq/data/encoders/gpt2_bpe_utils.py
+++ b/fairseq/data/encoders/gpt2_bpe_utils.py
--- a/fairseq/data/encoders/hf_bert_bpe.py
+++ b/fairseq/data/encoders/hf_bert_bpe.py
--- a/fairseq/data/encoders/hf_byte_bpe.py
+++ b/fairseq/data/encoders/hf_byte_bpe.py
--- a/fairseq/data/encoders/moses_tokenizer.py
+++ b/fairseq/data/encoders/moses_tokenizer.py
--- a/fairseq/data/encoders/nltk_tokenizer.py
+++ b/fairseq/data/encoders/nltk_tokenizer.py
--- a/fairseq/data/encoders/sentencepiece_bpe.py
+++ b/fairseq/data/encoders/sentencepiece_bpe.py
--- a/fairseq/data/encoders/space_tokenizer.py
+++ b/fairseq/data/encoders/space_tokenizer.py
--- a/fairseq/data/encoders/subword_nmt_bpe.py
+++ b/fairseq/data/encoders/subword_nmt_bpe.py
--- a/fairseq/data/encoders/utils.py
+++ b/fairseq/data/encoders/utils.py
--- a/fairseq/data/fairseq_dataset.py
+++ b/fairseq/data/fairseq_dataset.py
--- a/fairseq/data/fasta_dataset.py
+++ b/fairseq/data/fasta_dataset.py
--- a/fairseq/data/id_dataset.py
+++ b/fairseq/data/id_dataset.py
--- a/fairseq/data/indexed_dataset.py
+++ b/fairseq/data/indexed_dataset.py
--- a/fairseq/data/iterators.py
+++ b/fairseq/data/iterators.py
--- a/fairseq/data/language_pair_dataset.py
+++ b/fairseq/data/language_pair_dataset.py
--- a/fairseq/data/legacy/__init__.py
+++ b/fairseq/data/legacy/__init__.py
--- a/fairseq/data/legacy/block_pair_dataset.py
+++ b/fairseq/data/legacy/block_pair_dataset.py
--- a/fairseq/data/legacy/masked_lm_dataset.py
+++ b/fairseq/data/legacy/masked_lm_dataset.py
--- a/fairseq/data/legacy/masked_lm_dictionary.py
+++ b/fairseq/data/legacy/masked_lm_dictionary.py
--- a/fairseq/data/list_dataset.py
+++ b/fairseq/data/list_dataset.py
--- a/fairseq/data/lm_context_window_dataset.py
+++ b/fairseq/data/lm_context_window_dataset.py
--- a/fairseq/data/lru_cache_dataset.py
+++ b/fairseq/data/lru_cache_dataset.py
--- a/fairseq/data/mask_tokens_dataset.py
+++ b/fairseq/data/mask_tokens_dataset.py
--- a/fairseq/data/monolingual_dataset.py
+++ b/fairseq/data/monolingual_dataset.py
--- a/fairseq/data/multi_corpus_dataset.py
+++ b/fairseq/data/multi_corpus_dataset.py
--- a/fairseq/data/multi_corpus_sampled_dataset.py
+++ b/fairseq/data/multi_corpus_sampled_dataset.py
--- a/fairseq/data/multilingual/__init__.py
+++ b/fairseq/data/multilingual/__init__.py
--- a/fairseq/data/multilingual/multilingual_data_manager.py
+++ b/fairseq/data/multilingual/multilingual_data_manager.py
--- a/fairseq/data/multilingual/multilingual_utils.py
+++ b/fairseq/data/multilingual/multilingual_utils.py
--- a/fairseq/data/multilingual/sampled_multi_dataset.py
+++ b/fairseq/data/multilingual/sampled_multi_dataset.py
--- a/fairseq/data/multilingual/sampled_multi_epoch_dataset.py
+++ b/fairseq/data/multilingual/sampled_multi_epoch_dataset.py
--- a/fairseq/data/multilingual/sampling_method.py
+++ b/fairseq/data/multilingual/sampling_method.py
--- a/fairseq/data/nested_dictionary_dataset.py
+++ b/fairseq/data/nested_dictionary_dataset.py
--- a/fairseq/data/noising.py
+++ b/fairseq/data/noising.py
--- a/fairseq/data/num_samples_dataset.py
+++ b/fairseq/data/num_samples_dataset.py
--- a/fairseq/data/numel_dataset.py
+++ b/fairseq/data/numel_dataset.py
--- a/fairseq/data/offset_tokens_dataset.py
+++ b/fairseq/data/offset_tokens_dataset.py
--- a/fairseq/data/pad_dataset.py
+++ b/fairseq/data/pad_dataset.py
--- a/fairseq/data/plasma_utils.py
+++ b/fairseq/data/plasma_utils.py
--- a/fairseq/data/prepend_dataset.py
+++ b/fairseq/data/prepend_dataset.py
--- a/fairseq/data/prepend_token_dataset.py
+++ b/fairseq/data/prepend_token_dataset.py
--- a/fairseq/data/raw_label_dataset.py
+++ b/fairseq/data/raw_label_dataset.py
--- a/fairseq/data/replace_dataset.py
+++ b/fairseq/data/replace_dataset.py
--- a/fairseq/data/resampling_dataset.py
+++ b/fairseq/data/resampling_dataset.py
--- a/fairseq/data/roll_dataset.py
+++ b/fairseq/data/roll_dataset.py
--- a/fairseq/data/round_robin_zip_datasets.py
+++ b/fairseq/data/round_robin_zip_datasets.py
--- a/fairseq/data/shorten_dataset.py
+++ b/fairseq/data/shorten_dataset.py
--- a/fairseq/data/sort_dataset.py
+++ b/fairseq/data/sort_dataset.py
--- a/fairseq/data/strip_token_dataset.py
+++ b/fairseq/data/strip_token_dataset.py
--- a/fairseq/data/subsample_dataset.py
+++ b/fairseq/data/subsample_dataset.py
--- a/fairseq/data/token_block_dataset.py
+++ b/fairseq/data/token_block_dataset.py
--- a/fairseq/data/token_block_utils_fast.pyx
+++ b/fairseq/data/token_block_utils_fast.pyx
--- a/fairseq/data/transform_eos_dataset.py
+++ b/fairseq/data/transform_eos_dataset.py
--- a/fairseq/data/transform_eos_lang_pair_dataset.py
+++ b/fairseq/data/transform_eos_lang_pair_dataset.py
--- a/fairseq/dataclass/__init__.py
+++ b/fairseq/dataclass/__init__.py
--- a/fairseq/dataclass/configs.py
+++ b/fairseq/dataclass/configs.py
--- a/fairseq/dataclass/constants.py
+++ b/fairseq/dataclass/constants.py
--- a/fairseq/dataclass/initialize.py
+++ b/fairseq/dataclass/initialize.py
--- a/fairseq/dataclass/utils.py
+++ b/fairseq/dataclass/utils.py
--- a/fairseq/distributed/__init__.py
+++ b/fairseq/distributed/__init__.py
--- a/fairseq/distributed/distributed_timeout_wrapper.py
+++ b/fairseq/distributed/distributed_timeout_wrapper.py
--- a/fairseq/distributed/fully_sharded_data_parallel.py
+++ b/fairseq/distributed/fully_sharded_data_parallel.py
--- a/fairseq/distributed/legacy_distributed_data_parallel.py
+++ b/fairseq/distributed/legacy_distributed_data_parallel.py
--- a/fairseq/distributed/module_proxy_wrapper.py
+++ b/fairseq/distributed/module_proxy_wrapper.py
--- a/fairseq/distributed/tpu_distributed_data_parallel.py
+++ b/fairseq/distributed/tpu_distributed_data_parallel.py
--- a/fairseq/distributed/utils.py
+++ b/fairseq/distributed/utils.py
--- a/fairseq/file_io.py
+++ b/fairseq/file_io.py
--- a/fairseq/file_utils.py
+++ b/fairseq/file_utils.py
--- a/fairseq/hub_utils.py
+++ b/fairseq/hub_utils.py
--- a/fairseq/incremental_decoding_utils.py
+++ b/fairseq/incremental_decoding_utils.py
--- a/fairseq/iterative_refinement_generator.py
+++ b/fairseq/iterative_refinement_generator.py
--- a/fairseq/logging/__init__.py
+++ b/fairseq/logging/__init__.py
--- a/fairseq/logging/meters.py
+++ b/fairseq/logging/meters.py
--- a/fairseq/logging/metrics.py
+++ b/fairseq/logging/metrics.py
--- a/fairseq/logging/progress_bar.py
+++ b/fairseq/logging/progress_bar.py
--- a/fairseq/model_parallel/__init__.py
+++ b/fairseq/model_parallel/__init__.py
--- a/fairseq/model_parallel/criterions/__init__.py
+++ b/fairseq/model_parallel/criterions/__init__.py
--- a/fairseq/model_parallel/criterions/vocab_parallel_cross_entropy.py
+++ b/fairseq/model_parallel/criterions/vocab_parallel_cross_entropy.py
--- a/fairseq/model_parallel/megatron_trainer.py
+++ b/fairseq/model_parallel/megatron_trainer.py
--- a/fairseq/model_parallel/models/__init__.py
+++ b/fairseq/model_parallel/models/__init__.py
--- a/fairseq/model_parallel/models/pipeline_parallel_transformer/__init__.py
+++ b/fairseq/model_parallel/models/pipeline_parallel_transformer/__init__.py
--- a/fairseq/model_parallel/models/pipeline_parallel_transformer/layers.py
+++ b/fairseq/model_parallel/models/pipeline_parallel_transformer/layers.py
--- a/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py
+++ b/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py
--- a/fairseq/model_parallel/models/roberta/__init__.py
+++ b/fairseq/model_parallel/models/roberta/__init__.py
--- a/fairseq/model_parallel/models/roberta/model.py
+++ b/fairseq/model_parallel/models/roberta/model.py
--- a/fairseq/model_parallel/models/transformer.py
+++ b/fairseq/model_parallel/models/transformer.py
--- a/fairseq/model_parallel/models/transformer_lm.py
+++ b/fairseq/model_parallel/models/transformer_lm.py
--- a/fairseq/model_parallel/modules/__init__.py
+++ b/fairseq/model_parallel/modules/__init__.py
--- a/fairseq/model_parallel/modules/multihead_attention.py
+++ b/fairseq/model_parallel/modules/multihead_attention.py
--- a/fairseq/model_parallel/modules/transformer_layer.py
+++ b/fairseq/model_parallel/modules/transformer_layer.py
--- a/fairseq/models/__init__.py
+++ b/fairseq/models/__init__.py
--- a/fairseq/models/bart/__init__.py
+++ b/fairseq/models/bart/__init__.py
--- a/fairseq/models/bart/hub_interface.py
+++ b/fairseq/models/bart/hub_interface.py
--- a/fairseq/models/bart/model.py
+++ b/fairseq/models/bart/model.py
--- a/fairseq/models/composite_encoder.py
+++ b/fairseq/models/composite_encoder.py
--- a/fairseq/models/distributed_fairseq_model.py
+++ b/fairseq/models/distributed_fairseq_model.py
--- a/fairseq/models/fairseq_decoder.py
+++ b/fairseq/models/fairseq_decoder.py
--- a/fairseq/models/fairseq_encoder.py
+++ b/fairseq/models/fairseq_encoder.py
--- a/fairseq/models/fairseq_incremental_decoder.py
+++ b/fairseq/models/fairseq_incremental_decoder.py
--- a/fairseq/models/fairseq_model.py
+++ b/fairseq/models/fairseq_model.py
--- a/fairseq/models/fconv.py
+++ b/fairseq/models/fconv.py
--- a/fairseq/models/fconv_lm.py
+++ b/fairseq/models/fconv_lm.py
--- a/fairseq/models/fconv_self_att.py
+++ b/fairseq/models/fconv_self_att.py
--- a/fairseq/models/huggingface/__init__.py
+++ b/fairseq/models/huggingface/__init__.py
--- a/fairseq/models/huggingface/hf_gpt2.py
+++ b/fairseq/models/huggingface/hf_gpt2.py
--- a/fairseq/models/lightconv.py
+++ b/fairseq/models/lightconv.py
--- a/fairseq/models/lightconv_lm.py
+++ b/fairseq/models/lightconv_lm.py
--- a/fairseq/models/lstm.py
+++ b/fairseq/models/lstm.py
--- a/fairseq/models/lstm_lm.py
+++ b/fairseq/models/lstm_lm.py
--- a/fairseq/models/masked_lm.py
+++ b/fairseq/models/masked_lm.py
--- a/fairseq/models/model_utils.py
+++ b/fairseq/models/model_utils.py
--- a/fairseq/models/multilingual_transformer.py
+++ b/fairseq/models/multilingual_transformer.py
--- a/fairseq/models/nat/__init__.py
+++ b/fairseq/models/nat/__init__.py
--- a/fairseq/models/nat/cmlm_transformer.py
+++ b/fairseq/models/nat/cmlm_transformer.py
--- a/fairseq/models/nat/fairseq_nat_model.py
+++ b/fairseq/models/nat/fairseq_nat_model.py
--- a/fairseq/models/nat/insertion_transformer.py
+++ b/fairseq/models/nat/insertion_transformer.py
--- a/fairseq/models/nat/iterative_nonautoregressive_transformer.py
+++ b/fairseq/models/nat/iterative_nonautoregressive_transformer.py
--- a/fairseq/models/nat/levenshtein_transformer.py
+++ b/fairseq/models/nat/levenshtein_transformer.py
--- a/fairseq/models/nat/levenshtein_utils.py
+++ b/fairseq/models/nat/levenshtein_utils.py
--- a/fairseq/models/nat/nat_crf_transformer.py
+++ b/fairseq/models/nat/nat_crf_transformer.py
--- a/fairseq/models/nat/nonautoregressive_ensembles.py
+++ b/fairseq/models/nat/nonautoregressive_ensembles.py
--- a/fairseq/models/nat/nonautoregressive_transformer.py
+++ b/fairseq/models/nat/nonautoregressive_transformer.py
--- a/fairseq/models/roberta/__init__.py
+++ b/fairseq/models/roberta/__init__.py
--- a/fairseq/models/roberta/alignment_utils.py
+++ b/fairseq/models/roberta/alignment_utils.py
--- a/fairseq/models/roberta/hub_interface.py
+++ b/fairseq/models/roberta/hub_interface.py
--- a/fairseq/models/roberta/model.py
+++ b/fairseq/models/roberta/model.py
--- a/fairseq/models/roberta/model_camembert.py
+++ b/fairseq/models/roberta/model_camembert.py
--- a/fairseq/models/roberta/model_gottbert.py
+++ b/fairseq/models/roberta/model_gottbert.py
--- a/fairseq/models/roberta/model_xlmr.py
+++ b/fairseq/models/roberta/model_xlmr.py
--- a/fairseq/models/speech_to_text/__init__.py
+++ b/fairseq/models/speech_to_text/__init__.py
--- a/fairseq/models/speech_to_text/berard.py
+++ b/fairseq/models/speech_to_text/berard.py
--- a/fairseq/models/speech_to_text/convtransformer.py
+++ b/fairseq/models/speech_to_text/convtransformer.py
--- a/fairseq/models/speech_to_text/modules/augmented_memory_attention.py
+++ b/fairseq/models/speech_to_text/modules/augmented_memory_attention.py
--- a/fairseq/models/speech_to_text/modules/emformer.py
+++ b/fairseq/models/speech_to_text/modules/emformer.py
--- a/fairseq/models/speech_to_text/s2t_transformer.py
+++ b/fairseq/models/speech_to_text/s2t_transformer.py
--- a/fairseq/models/speech_to_text/utils.py
+++ b/fairseq/models/speech_to_text/utils.py
--- a/fairseq/models/transformer.py
+++ b/fairseq/models/transformer.py
--- a/fairseq/models/transformer_align.py
+++ b/fairseq/models/transformer_align.py
--- a/fairseq/models/transformer_from_pretrained_xlm.py
+++ b/fairseq/models/transformer_from_pretrained_xlm.py
--- a/fairseq/models/transformer_lm.py
+++ b/fairseq/models/transformer_lm.py
--- a/fairseq/models/wav2vec/__init__.py
+++ b/fairseq/models/wav2vec/__init__.py
--- a/fairseq/models/wav2vec/wav2vec.py
+++ b/fairseq/models/wav2vec/wav2vec.py
--- a/fairseq/models/wav2vec/wav2vec2.py
+++ b/fairseq/models/wav2vec/wav2vec2.py
--- a/fairseq/models/wav2vec/wav2vec2_asr.py
+++ b/fairseq/models/wav2vec/wav2vec2_asr.py
--- a/fairseq/modules/__init__.py
+++ b/fairseq/modules/__init__.py
--- a/fairseq/modules/adaptive_input.py
+++ b/fairseq/modules/adaptive_input.py
--- a/fairseq/modules/adaptive_softmax.py
+++ b/fairseq/modules/adaptive_softmax.py
--- a/fairseq/modules/beamable_mm.py
+++ b/fairseq/modules/beamable_mm.py
--- a/fairseq/modules/character_token_embedder.py
+++ b/fairseq/modules/character_token_embedder.py
--- a/fairseq/modules/checkpoint_activations.py
+++ b/fairseq/modules/checkpoint_activations.py
--- a/fairseq/modules/conv_tbc.py
+++ b/fairseq/modules/conv_tbc.py
--- a/fairseq/modules/cross_entropy.py
+++ b/fairseq/modules/cross_entropy.py
--- a/fairseq/modules/cuda_utils.cu
+++ b/fairseq/modules/cuda_utils.cu
--- a/fairseq/modules/downsampled_multihead_attention.py
+++ b/fairseq/modules/downsampled_multihead_attention.py
--- a/fairseq/modules/dynamic_convolution.py
+++ b/fairseq/modules/dynamic_convolution.py
--- a/fairseq/modules/dynamic_crf_layer.py
+++ b/fairseq/modules/dynamic_crf_layer.py
--- a/fairseq/modules/dynamicconv_layer/__init__.py
+++ b/fairseq/modules/dynamicconv_layer/__init__.py
--- a/fairseq/modules/dynamicconv_layer/cuda_function_gen.py
+++ b/fairseq/modules/dynamicconv_layer/cuda_function_gen.py
--- a/fairseq/modules/dynamicconv_layer/dynamicconv_cuda.cpp
+++ b/fairseq/modules/dynamicconv_layer/dynamicconv_cuda.cpp
--- a/fairseq/modules/dynamicconv_layer/dynamicconv_cuda.cuh
+++ b/fairseq/modules/dynamicconv_layer/dynamicconv_cuda.cuh
--- a/fairseq/modules/dynamicconv_layer/dynamicconv_cuda_kernel.cu
+++ b/fairseq/modules/dynamicconv_layer/dynamicconv_cuda_kernel.cu
--- a/fairseq/modules/dynamicconv_layer/dynamicconv_layer.py
+++ b/fairseq/modules/dynamicconv_layer/dynamicconv_layer.py
--- a/fairseq/modules/dynamicconv_layer/dynamiconv_cpu.cpp
+++ b/fairseq/modules/dynamicconv_layer/dynamiconv_cpu.cpp
--- a/fairseq/modules/dynamicconv_layer/setup.py
+++ b/fairseq/modules/dynamicconv_layer/setup.py
--- a/fairseq/modules/fairseq_dropout.py
+++ b/fairseq/modules/fairseq_dropout.py
--- a/fairseq/modules/fp32_group_norm.py
+++ b/fairseq/modules/fp32_group_norm.py
--- a/fairseq/modules/gelu.py
+++ b/fairseq/modules/gelu.py
--- a/fairseq/modules/grad_multiply.py
+++ b/fairseq/modules/grad_multiply.py
--- a/fairseq/modules/gumbel_vector_quantizer.py
+++ b/fairseq/modules/gumbel_vector_quantizer.py
--- a/fairseq/modules/kmeans_vector_quantizer.py
+++ b/fairseq/modules/kmeans_vector_quantizer.py
--- a/fairseq/modules/layer_drop.py
+++ b/fairseq/modules/layer_drop.py
--- a/fairseq/modules/layer_norm.py
+++ b/fairseq/modules/layer_norm.py
--- a/fairseq/modules/learned_positional_embedding.py
+++ b/fairseq/modules/learned_positional_embedding.py
--- a/fairseq/modules/lightconv_layer/__init__.py
+++ b/fairseq/modules/lightconv_layer/__init__.py
--- a/fairseq/modules/lightconv_layer/cuda_function_gen.py
+++ b/fairseq/modules/lightconv_layer/cuda_function_gen.py
--- a/fairseq/modules/lightconv_layer/lightconv_cuda.cpp
+++ b/fairseq/modules/lightconv_layer/lightconv_cuda.cpp
--- a/fairseq/modules/lightconv_layer/lightconv_cuda.cuh
+++ b/fairseq/modules/lightconv_layer/lightconv_cuda.cuh
--- a/fairseq/modules/lightconv_layer/lightconv_cuda_kernel.cu
+++ b/fairseq/modules/lightconv_layer/lightconv_cuda_kernel.cu
--- a/fairseq/modules/lightconv_layer/lightconv_layer.py
+++ b/fairseq/modules/lightconv_layer/lightconv_layer.py
--- a/fairseq/modules/lightconv_layer/setup.py
+++ b/fairseq/modules/lightconv_layer/setup.py
--- a/fairseq/modules/lightweight_convolution.py
+++ b/fairseq/modules/lightweight_convolution.py
--- a/fairseq/modules/linearized_convolution.py
+++ b/fairseq/modules/linearized_convolution.py
--- a/fairseq/modules/multihead_attention.py
+++ b/fairseq/modules/multihead_attention.py
--- a/fairseq/modules/positional_embedding.py
+++ b/fairseq/modules/positional_embedding.py
--- a/fairseq/modules/quant_noise.py
+++ b/fairseq/modules/quant_noise.py
--- a/fairseq/modules/quantization/__init__.py
+++ b/fairseq/modules/quantization/__init__.py
--- a/fairseq/modules/quantization/pq/__init__.py
+++ b/fairseq/modules/quantization/pq/__init__.py
--- a/fairseq/modules/quantization/pq/em.py
+++ b/fairseq/modules/quantization/pq/em.py
--- a/fairseq/modules/quantization/pq/modules/__init__.py
+++ b/fairseq/modules/quantization/pq/modules/__init__.py
--- a/fairseq/modules/quantization/pq/modules/qconv.py
+++ b/fairseq/modules/quantization/pq/modules/qconv.py
--- a/fairseq/modules/quantization/pq/modules/qemb.py
+++ b/fairseq/modules/quantization/pq/modules/qemb.py
--- a/fairseq/modules/quantization/pq/modules/qlinear.py
+++ b/fairseq/modules/quantization/pq/modules/qlinear.py
--- a/fairseq/modules/quantization/pq/pq.py
+++ b/fairseq/modules/quantization/pq/pq.py
--- a/fairseq/modules/quantization/pq/utils.py
+++ b/fairseq/modules/quantization/pq/utils.py
--- a/fairseq/modules/quantization/quantization_options.py
+++ b/fairseq/modules/quantization/quantization_options.py
--- a/fairseq/modules/quantization/scalar/__init__.py
+++ b/fairseq/modules/quantization/scalar/__init__.py
--- a/fairseq/modules/quantization/scalar/modules/__init__.py
+++ b/fairseq/modules/quantization/scalar/modules/__init__.py
--- a/fairseq/modules/quantization/scalar/modules/qact.py
+++ b/fairseq/modules/quantization/scalar/modules/qact.py
--- a/fairseq/modules/quantization/scalar/modules/qconv.py
+++ b/fairseq/modules/quantization/scalar/modules/qconv.py
--- a/fairseq/modules/quantization/scalar/modules/qemb.py
+++ b/fairseq/modules/quantization/scalar/modules/qemb.py
--- a/fairseq/modules/quantization/scalar/modules/qlinear.py
+++ b/fairseq/modules/quantization/scalar/modules/qlinear.py
--- a/fairseq/modules/quantization/scalar/ops.py
+++ b/fairseq/modules/quantization/scalar/ops.py
--- a/fairseq/modules/quantization/scalar/utils.py
+++ b/fairseq/modules/quantization/scalar/utils.py
--- a/fairseq/modules/rel_position_multihead_attention.py
+++ b/fairseq/modules/rel_position_multihead_attention.py
--- a/fairseq/modules/same_pad.py
+++ b/fairseq/modules/same_pad.py
--- a/fairseq/modules/scalar_bias.py
+++ b/fairseq/modules/scalar_bias.py
--- a/fairseq/modules/sinusoidal_positional_embedding.py
+++ b/fairseq/modules/sinusoidal_positional_embedding.py
--- a/fairseq/modules/sparse_multihead_attention.py
+++ b/fairseq/modules/sparse_multihead_attention.py
--- a/fairseq/modules/sparse_transformer_sentence_encoder.py
+++ b/fairseq/modules/sparse_transformer_sentence_encoder.py
--- a/fairseq/modules/sparse_transformer_sentence_encoder_layer.py
+++ b/fairseq/modules/sparse_transformer_sentence_encoder_layer.py
--- a/fairseq/modules/transformer_layer.py
+++ b/fairseq/modules/transformer_layer.py
--- a/fairseq/modules/transformer_sentence_encoder.py
+++ b/fairseq/modules/transformer_sentence_encoder.py
--- a/fairseq/modules/transformer_sentence_encoder_layer.py
+++ b/fairseq/modules/transformer_sentence_encoder_layer.py
--- a/fairseq/modules/transpose_last.py
+++ b/fairseq/modules/transpose_last.py
--- a/fairseq/modules/unfold.py
+++ b/fairseq/modules/unfold.py
--- a/fairseq/modules/vggblock.py
+++ b/fairseq/modules/vggblock.py
--- a/fairseq/nan_detector.py
+++ b/fairseq/nan_detector.py
--- a/fairseq/ngram_repeat_block.py
+++ b/fairseq/ngram_repeat_block.py
--- a/fairseq/optim/__init__.py
+++ b/fairseq/optim/__init__.py
--- a/fairseq/optim/adadelta.py
+++ b/fairseq/optim/adadelta.py
--- a/fairseq/optim/adafactor.py
+++ b/fairseq/optim/adafactor.py
--- a/fairseq/optim/adagrad.py
+++ b/fairseq/optim/adagrad.py
--- a/fairseq/optim/adam.py
+++ b/fairseq/optim/adam.py
--- a/fairseq/optim/adamax.py
+++ b/fairseq/optim/adamax.py
--- a/fairseq/optim/bmuf.py
+++ b/fairseq/optim/bmuf.py
--- a/fairseq/optim/composite.py
+++ b/fairseq/optim/composite.py
--- a/fairseq/optim/cpu_adam.py
+++ b/fairseq/optim/cpu_adam.py
--- a/fairseq/optim/dynamic_loss_scaler.py
+++ b/fairseq/optim/dynamic_loss_scaler.py
--- a/fairseq/optim/fairseq_optimizer.py
+++ b/fairseq/optim/fairseq_optimizer.py
--- a/fairseq/optim/fp16_optimizer.py
+++ b/fairseq/optim/fp16_optimizer.py
--- a/fairseq/optim/fused_adam.py
+++ b/fairseq/optim/fused_adam.py
--- a/fairseq/optim/fused_lamb.py
+++ b/fairseq/optim/fused_lamb.py
--- a/fairseq/optim/lr_scheduler/__init__.py
+++ b/fairseq/optim/lr_scheduler/__init__.py
--- a/fairseq/optim/lr_scheduler/cosine_lr_scheduler.py
+++ b/fairseq/optim/lr_scheduler/cosine_lr_scheduler.py
--- a/fairseq/optim/lr_scheduler/fairseq_lr_scheduler.py
+++ b/fairseq/optim/lr_scheduler/fairseq_lr_scheduler.py
--- a/fairseq/optim/lr_scheduler/fixed_schedule.py
+++ b/fairseq/optim/lr_scheduler/fixed_schedule.py
--- a/fairseq/optim/lr_scheduler/inverse_square_root_schedule.py
+++ b/fairseq/optim/lr_scheduler/inverse_square_root_schedule.py
--- a/fairseq/optim/lr_scheduler/manual_lr_scheduler.py
+++ b/fairseq/optim/lr_scheduler/manual_lr_scheduler.py
--- a/fairseq/optim/lr_scheduler/pass_through.py
+++ b/fairseq/optim/lr_scheduler/pass_through.py
--- a/fairseq/optim/lr_scheduler/polynomial_decay_schedule.py
+++ b/fairseq/optim/lr_scheduler/polynomial_decay_schedule.py
--- a/fairseq/optim/lr_scheduler/reduce_lr_on_plateau.py
+++ b/fairseq/optim/lr_scheduler/reduce_lr_on_plateau.py
--- a/fairseq/optim/lr_scheduler/tri_stage_lr_scheduler.py
+++ b/fairseq/optim/lr_scheduler/tri_stage_lr_scheduler.py
--- a/fairseq/optim/lr_scheduler/triangular_lr_scheduler.py
+++ b/fairseq/optim/lr_scheduler/triangular_lr_scheduler.py
--- a/fairseq/optim/nag.py
+++ b/fairseq/optim/nag.py
--- a/fairseq/optim/sgd.py
+++ b/fairseq/optim/sgd.py
--- a/fairseq/optim/shard.py
+++ b/fairseq/optim/shard.py
--- a/fairseq/options.py
+++ b/fairseq/options.py
--- a/fairseq/pdb.py
+++ b/fairseq/pdb.py
--- a/fairseq/quantization_utils.py
+++ b/fairseq/quantization_utils.py
--- a/fairseq/registry.py
+++ b/fairseq/registry.py
--- a/fairseq/scoring/__init__.py
+++ b/fairseq/scoring/__init__.py
--- a/fairseq/scoring/bleu.py
+++ b/fairseq/scoring/bleu.py
--- a/fairseq/scoring/chrf.py
+++ b/fairseq/scoring/chrf.py
--- a/fairseq/scoring/tokenizer.py
+++ b/fairseq/scoring/tokenizer.py
--- a/fairseq/scoring/wer.py
+++ b/fairseq/scoring/wer.py
--- a/fairseq/search.py
+++ b/fairseq/search.py
--- a/fairseq/sequence_generator.py
+++ b/fairseq/sequence_generator.py
--- a/fairseq/sequence_scorer.py
+++ b/fairseq/sequence_scorer.py
--- a/fairseq/tasks/__init__.py
+++ b/fairseq/tasks/__init__.py
--- a/fairseq/tasks/audio_pretraining.py
+++ b/fairseq/tasks/audio_pretraining.py
--- a/fairseq/tasks/cross_lingual_lm.py
+++ b/fairseq/tasks/cross_lingual_lm.py
--- a/fairseq/tasks/denoising.py
+++ b/fairseq/tasks/denoising.py
--- a/fairseq/tasks/fairseq_task.py
+++ b/fairseq/tasks/fairseq_task.py
--- a/fairseq/tasks/language_modeling.py
+++ b/fairseq/tasks/language_modeling.py
--- a/fairseq/tasks/legacy_masked_lm.py
+++ b/fairseq/tasks/legacy_masked_lm.py
--- a/fairseq/tasks/masked_lm.py
+++ b/fairseq/tasks/masked_lm.py
--- a/fairseq/tasks/multilingual_denoising.py
+++ b/fairseq/tasks/multilingual_denoising.py
--- a/fairseq/tasks/multilingual_masked_lm.py
+++ b/fairseq/tasks/multilingual_masked_lm.py
--- a/fairseq/tasks/multilingual_translation.py
+++ b/fairseq/tasks/multilingual_translation.py
--- a/fairseq/tasks/semisupervised_translation.py
+++ b/fairseq/tasks/semisupervised_translation.py
--- a/fairseq/tasks/sentence_prediction.py
+++ b/fairseq/tasks/sentence_prediction.py
--- a/fairseq/tasks/sentence_ranking.py
+++ b/fairseq/tasks/sentence_ranking.py
--- a/fairseq/tasks/speech_to_text.py
+++ b/fairseq/tasks/speech_to_text.py
--- a/fairseq/tasks/translation.py
+++ b/fairseq/tasks/translation.py
--- a/fairseq/tasks/translation_from_pretrained_bart.py
+++ b/fairseq/tasks/translation_from_pretrained_bart.py
--- a/fairseq/tasks/translation_from_pretrained_xlm.py
+++ b/fairseq/tasks/translation_from_pretrained_xlm.py
--- a/fairseq/tasks/translation_lev.py
+++ b/fairseq/tasks/translation_lev.py
--- a/fairseq/tasks/translation_multi_simple_epoch.py
+++ b/fairseq/tasks/translation_multi_simple_epoch.py
--- a/fairseq/token_generation_constraints.py
+++ b/fairseq/token_generation_constraints.py
--- a/fairseq/tokenizer.py
+++ b/fairseq/tokenizer.py
--- a/fairseq/trainer.py
+++ b/fairseq/trainer.py
--- a/fairseq/utils.py
+++ b/fairseq/utils.py
--- a/fairseq/version.txt
+++ b/fairseq/version.txt
--- a/fairseq_cli/__init__.py
+++ b/fairseq_cli/__init__.py
--- a/fairseq_cli/eval_lm.py
+++ b/fairseq_cli/eval_lm.py
--- a/fairseq_cli/generate.py
+++ b/fairseq_cli/generate.py
--- a/fairseq_cli/hydra_train.py
+++ b/fairseq_cli/hydra_train.py
--- a/fairseq_cli/interactive.py
+++ b/fairseq_cli/interactive.py
--- a/fairseq_cli/preprocess.py
+++ b/fairseq_cli/preprocess.py
--- a/fairseq_cli/score.py
+++ b/fairseq_cli/score.py
--- a/fairseq_cli/train.py
+++ b/fairseq_cli/train.py
--- a/fairseq_cli/validate.py
+++ b/fairseq_cli/validate.py
--- a/hubconf.py
+++ b/hubconf.py
--- a/pyproject.toml
+++ b/pyproject.toml
--- a/scripts/__init__.py
+++ b/scripts/__init__.py
--- a/scripts/average_checkpoints.py
+++ b/scripts/average_checkpoints.py
--- a/scripts/build_sym_alignment.py
+++ b/scripts/build_sym_alignment.py
--- a/scripts/compare_namespaces.py
+++ b/scripts/compare_namespaces.py
--- a/scripts/compound_split_bleu.sh
+++ b/scripts/compound_split_bleu.sh
--- a/scripts/constraints/extract.py
+++ b/scripts/constraints/extract.py
--- a/scripts/constraints/validate.py
+++ b/scripts/constraints/validate.py
--- a/scripts/convert_dictionary.lua
+++ b/scripts/convert_dictionary.lua
--- a/scripts/convert_model.lua
+++ b/scripts/convert_model.lua
--- a/scripts/count_docs.py
+++ b/scripts/count_docs.py
--- a/scripts/read_binarized.py
+++ b/scripts/read_binarized.py
--- a/scripts/rm_pt.py
+++ b/scripts/rm_pt.py
--- a/scripts/sacrebleu.sh
+++ b/scripts/sacrebleu.sh
--- a/scripts/shard_docs.py
+++ b/scripts/shard_docs.py
--- a/scripts/split_train_valid_docs.py
+++ b/scripts/split_train_valid_docs.py
--- a/scripts/spm_decode.py
+++ b/scripts/spm_decode.py
--- a/scripts/spm_encode.py
+++ b/scripts/spm_encode.py
--- a/scripts/spm_train.py
+++ b/scripts/spm_train.py
--- a/setup.py
+++ b/setup.py
--- a/tests/__init__.py
+++ b/tests/__init__.py
--- a/tests/distributed/__init__.py
+++ b/tests/distributed/__init__.py
--- a/tests/distributed/test_bmuf.py
+++ b/tests/distributed/test_bmuf.py
--- a/tests/distributed/test_distributed_timeout_wrapper.py
+++ b/tests/distributed/test_distributed_timeout_wrapper.py
--- a/tests/distributed/test_module_proxy_wrapper.py
+++ b/tests/distributed/test_module_proxy_wrapper.py
--- a/tests/distributed/test_utils.py
+++ b/tests/distributed/test_utils.py
--- a/tests/distributed/utils.py
+++ b/tests/distributed/utils.py
--- a/tests/gpu/__init__.py
+++ b/tests/gpu/__init__.py
--- a/tests/gpu/test_binaries_gpu.py
+++ b/tests/gpu/test_binaries_gpu.py
--- a/tests/gpu/transformer_quantization_config.yaml
+++ b/tests/gpu/transformer_quantization_config.yaml
--- a/tests/speech_recognition/__init__.py
+++ b/tests/speech_recognition/__init__.py
--- a/tests/speech_recognition/asr_test_base.py
+++ b/tests/speech_recognition/asr_test_base.py
--- a/tests/speech_recognition/test_collaters.py
+++ b/tests/speech_recognition/test_collaters.py
--- a/tests/speech_recognition/test_cross_entropy.py
+++ b/tests/speech_recognition/test_cross_entropy.py
--- a/tests/speech_recognition/test_data_utils.py
+++ b/tests/speech_recognition/test_data_utils.py
--- a/tests/speech_recognition/test_vggtransformer.py
+++ b/tests/speech_recognition/test_vggtransformer.py
--- a/tests/test_activation_checkpointing.py
+++ b/tests/test_activation_checkpointing.py
--- a/tests/test_average_checkpoints.py
+++ b/tests/test_average_checkpoints.py
--- a/tests/test_backtranslation_dataset.py
+++ b/tests/test_backtranslation_dataset.py
--- a/tests/test_binaries.py
+++ b/tests/test_binaries.py
--- a/tests/test_character_token_embedder.py
+++ b/tests/test_character_token_embedder.py
--- a/tests/test_checkpoint_utils.py
+++ b/tests/test_checkpoint_utils.py
--- a/tests/test_concat_dataset.py
+++ b/tests/test_concat_dataset.py
--- a/tests/test_constraints.py
+++ b/tests/test_constraints.py
--- a/tests/test_convtbc.py
+++ b/tests/test_convtbc.py
--- a/tests/test_data_utils.py
+++ b/tests/test_data_utils.py
--- a/tests/test_dataset.py
+++ b/tests/test_dataset.py
--- a/tests/test_dictionary.py
+++ b/tests/test_dictionary.py
--- a/tests/test_export.py
+++ b/tests/test_export.py
--- a/tests/test_file_io.py
+++ b/tests/test_file_io.py
--- a/tests/test_fp16_optimizer.py
+++ b/tests/test_fp16_optimizer.py
--- a/tests/test_inference_dropout.py
+++ b/tests/test_inference_dropout.py
--- a/tests/test_iopath.py
+++ b/tests/test_iopath.py
--- a/tests/test_iterators.py
+++ b/tests/test_iterators.py
--- a/tests/test_label_smoothing.py
+++ b/tests/test_label_smoothing.py
--- a/tests/test_lm_context_window.py
+++ b/tests/test_lm_context_window.py
--- a/tests/test_lstm_jitable.py
+++ b/tests/test_lstm_jitable.py
--- a/tests/test_memory_efficient_fp16.py
+++ b/tests/test_memory_efficient_fp16.py
--- a/tests/test_metrics.py
+++ b/tests/test_metrics.py
--- a/tests/test_multi_corpus_dataset.py
+++ b/tests/test_multi_corpus_dataset.py
--- a/tests/test_multi_corpus_sampled_dataset.py
+++ b/tests/test_multi_corpus_sampled_dataset.py
--- a/tests/test_multihead_attention.py
+++ b/tests/test_multihead_attention.py
--- a/tests/test_noising.py
+++ b/tests/test_noising.py
--- a/tests/test_reproducibility.py
+++ b/tests/test_reproducibility.py
--- a/tests/test_resampling_dataset.py
+++ b/tests/test_resampling_dataset.py
--- a/tests/test_sequence_generator.py
+++ b/tests/test_sequence_generator.py
--- a/tests/test_sequence_scorer.py
+++ b/tests/test_sequence_scorer.py
--- a/tests/test_sparse_multihead_attention.py
+++ b/tests/test_sparse_multihead_attention.py
--- a/tests/test_token_block_dataset.py
+++ b/tests/test_token_block_dataset.py
--- a/tests/test_train.py
+++ b/tests/test_train.py
--- a/tests/test_utils.py
+++ b/tests/test_utils.py
--- a/tests/utils.py
+++ b/tests/utils.py
--- a/train.py
+++ b/train.py