Adapt the fairseq toolkit for speech to text task.
This project adapts the [fairseq](https://github.com/pytorch/fairseq) toolkit for speech-to-text tasks, including speech recognition and speech translation.
It contains the implementation of the following methods proposed by NiuTrans Team.
Implementation of the paper:
[Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders](https://arxiv.org/abs/2105.05752)
## Key Features
### Training
- Support the Kaldi-style complete recipe
- ASR, MT, and ST pipeline (bin)
- Support the Kaldi-style complete recipes
- ASR, MT, and ST pipelines (bin)
- Read training config in yaml file
- CTC multi-task learning
- MT training in the ST-like way (Online tokenizer) (There may be bugs)
- speed perturb during pre-processing (need torchaudio ≥ 0.8.0)
- MT training in the ST-like way (Online tokenizer) (This may be slowly.)
- speed perturb during pre-processing
### Model
- Conformer Architecture
- Load pre-trained model for ST
- Relative position encoding
- Load pre-trained modules
- Relative position representation
- Stacked acoustic-and-textual encoding
- Progressive down-sampling for acoustic encoding
## Installation
...
...
@@ -43,9 +45,10 @@ make -j src.build CUDA_HOME=<path to cuda install>
The shell scripts for each benchmark is in the egs folder, we create the ASR pipeline for LibriSpeech, all pipelines (ASR, MT, and ST) for MuST-C. Besides, we also provide the template for other benchmarks.
We supply the recipes for multiple benchmarks in the egs folder, including machine translation, speech recognition, and speech translation corpora.
Besides, we also provide the template for other benchmarks.
Here is an example for MuST-C:
...
...
@@ -53,41 +56,40 @@ Here is an example for MuST-C:
mustc
├── asr
│ ├── binary.sh
│ ├── conf
│ ├── conf/
│ ├── decode.sh
│ ├── local
│ ├── local/
│ ├── run.sh
│ └── train.sh
├── mt
│ ├── binary.sh
│ ├── conf
│ ├── conf/
│ ├── decode.sh
│ ├── local
│ ├── local/
│ ├── run.sh
│ └── train.sh
└── st
├── binary.sh
├── conf
├── conf/
├── decode.sh
├── ensemble.sh
├── local
├── local/
├── run.sh
└── train.sh
```
* run.sh: the core script, which includes the whole processes
* run.sh: the core script that includes the whole pipeline
* train.sh: call the run.sh for training
* decode.sh: call the run.sh for decoding
* binary.sh: generate the datasets alone
* conf: the folder to save the configure files (.yaml).
* local: the folder to save utils shell scripts
* local: the folder to save utils
* monitor.sh: check the GPUS for running the program automatically
* parse_options.sh: parse the parameters for run.sh
* path.sh: no use
* utils.sh: the utils shell functions
* utils.sh: the util shell functions
## Citations
```angular2html
```bibtex
@inproceedings{xu-etal-2021-stacked,
title = "Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders",