Adapt the fairseq toolkit for speech to text task.
This project adapts the [fairseq](https://github.com/pytorch/fairseq) toolkit for speech-to-text tasks, including speech recognition and speech translation.
It contains the implementation of the following methods proposed by NiuTrans Team.
Implementation of the paper:
[Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders](https://arxiv.org/abs/2105.05752)
[Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders](https://arxiv.org/abs/2105.05752)
## Key Features
## Key Features
### Training
### Training
- Support the Kaldi-style complete recipe
- Support the Kaldi-style complete recipes
- ASR, MT, and ST pipeline (bin)
- ASR, MT, and ST pipelines (bin)
- Read training config in yaml file
- Read training config in yaml file
- CTC multi-task learning
- CTC multi-task learning
- MT training in the ST-like way (Online tokenizer) (There may be bugs)
- MT training in the ST-like way (Online tokenizer) (This may be slowly.)
- speed perturb during pre-processing (need torchaudio ≥ 0.8.0)
- speed perturb during pre-processing
### Model
### Model
- Conformer Architecture
- Conformer Architecture
- Load pre-trained model for ST
- Load pre-trained modules
- Relative position encoding
- Relative position representation
- Stacked acoustic-and-textual encoding
- Stacked acoustic-and-textual encoding
- Progressive down-sampling for acoustic encoding
## Installation
## Installation
...
@@ -43,9 +45,10 @@ make -j src.build CUDA_HOME=<path to cuda install>
...
@@ -43,9 +45,10 @@ make -j src.build CUDA_HOME=<path to cuda install>
The shell scripts for each benchmark is in the egs folder, we create the ASR pipeline for LibriSpeech, all pipelines (ASR, MT, and ST) for MuST-C. Besides, we also provide the template for other benchmarks.
We supply the recipes for multiple benchmarks in the egs folder, including machine translation, speech recognition, and speech translation corpora.
Besides, we also provide the template for other benchmarks.
Here is an example for MuST-C:
Here is an example for MuST-C:
...
@@ -53,41 +56,40 @@ Here is an example for MuST-C:
...
@@ -53,41 +56,40 @@ Here is an example for MuST-C:
mustc
mustc
├── asr
├── asr
│ ├── binary.sh
│ ├── binary.sh
│ ├── conf
│ ├── conf/
│ ├── decode.sh
│ ├── decode.sh
│ ├── local
│ ├── local/
│ ├── run.sh
│ ├── run.sh
│ └── train.sh
│ └── train.sh
├── mt
├── mt
│ ├── binary.sh
│ ├── binary.sh
│ ├── conf
│ ├── conf/
│ ├── decode.sh
│ ├── decode.sh
│ ├── local
│ ├── local/
│ ├── run.sh
│ ├── run.sh
│ └── train.sh
│ └── train.sh
└── st
└── st
├── binary.sh
├── binary.sh
├── conf
├── conf/
├── decode.sh
├── decode.sh
├── ensemble.sh
├── local/
├── local
├── run.sh
├── run.sh
└── train.sh
└── train.sh
```
```
* run.sh: the core script, which includes the whole processes
* run.sh: the core script that includes the whole pipeline
* train.sh: call the run.sh for training
* train.sh: call the run.sh for training
* decode.sh: call the run.sh for decoding
* decode.sh: call the run.sh for decoding
* binary.sh: generate the datasets alone
* binary.sh: generate the datasets alone
* conf: the folder to save the configure files (.yaml).
* conf: the folder to save the configure files (.yaml).
* local: the folder to save utils shell scripts
* local: the folder to save utils
* monitor.sh: check the GPUS for running the program automatically
* monitor.sh: check the GPUS for running the program automatically
* parse_options.sh: parse the parameters for run.sh
* parse_options.sh: parse the parameters for run.sh
* path.sh: no use
* utils.sh: the util shell functions
* utils.sh: the utils shell functions
## Citations
## Citations
```angular2html
```bibtex
@inproceedings{xu-etal-2021-stacked,
@inproceedings{xu-etal-2021-stacked,
title = "Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders",
title = "Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders",