Commit f76b8cb3 by xuchen

README

parent 0cbb797d
# NiuTrans-Fairseq-S2T
# Speech-to-Text (S2T) toolkit
This project adapts the [fairseq](https://github.com/pytorch/fairseq) toolkit for speech-to-text tasks, including speech recognition and speech translation.
## Overview
It contains the implementation of the following methods proposed by NiuTrans Team.
This repository is an extension of the [Fairseq toolkit](https://github.com/pytorch/fairseq) specialized for speech-to-text (S2T) generation tasks. This toolkit provides comprehensive support for Automatic Speech Recognition (ASR), Machine Translation (MT), and Speech Translation (ST).
[Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders](https://arxiv.org/abs/2105.05752)
## Features
## Key Features
- Complete recipes: Kaldi-style recipe support for ASR, MT, and ST tasks, ensuring a smooth workflow.
- Various configurations: An extensive collection of YAML configuration files to customize models for different tasks and scenarios.
- Easy reproduction: The comprehensive support of methods in our papers, including SATE, PDS, CTC-NAST, BIL-CTC, and more
- Multiple inference strategies: Greedy decoding, beam search, CTC decoding, CTC rescoring, and more
- More features can be found in the **run.sh** file.
### Training
## Installation
- Support the Kaldi-style complete recipes
- ASR, MT, and ST pipelines (bin)
- Read training config in yaml file
- CTC multi-task learning
- MT training in the ST-like way (Online tokenizer) (This may be slowly.)
- speed perturb during pre-processing
### Model
1. Clone the repository:
- Conformer Architecture
- Load pre-trained modules
- Relative position representation
- Stacked acoustic-and-textual encoding
- Progressive down-sampling for acoustic encoding
```bash
git clone https://github.com/xuchennlp/S2T.git
```
## Installation
2. Navigate to the project directory and install the required dependencies:
* Note we only test the following environment.
```bash
cd S2T
pip install -e .
```
1. Python == 3.6
2. torch == 1.8, torchaudio == 0.8.0, cuda == 10.2
3. apex
```
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
```
4. nccl
```
make -j src.build CUDA_HOME=<path to cuda install>
Our version: python 3.8, pytorch 1.11.0.
## Quick Start
1. Download your dataset and process it into the format of MUST-C dataset.
2. Run the shell script **run.sh** in the corresponding directory as follows:
```bash
# Set ST_DIR environment variable as the parent directory of S2T directory
export ST_DIR=/path/to/S2T/..
cd egs/mustc/st/
./run.sh --stage 0 --stop_stage 2
```
- Stage 0 performs the data processing, including feature extraction of audios (Not required in MT), vocabulary generation, training and testing files generation.
- Stage 1 performs the model training, where multiple choices are supported.
- Stage 2 performs the model inference, where multiple strategies are supported.
- All details are available in **run.sh**.
## Reproduction of our methods
### SATE: Stacked Acoustic and Textual Encoding (ACL 2021)
**Paper**: [Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders](https://aclanthology.org/2021.acl-long.204/)
**Highlights**: an simple and effective methods to utilize the pre-trained ASR and MT models to improve the end-to-end ST model; introducing the adapter to bridge the pre-trained encoders
Here is an example on the MUST-C ST dataset.
```bash
cd egs/mustc/st/
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_sate
```
5. gcc ≥ 4.9 (We use the version 5.4)
6. python library
### PDS: Progressive Down-Sampling (ACL 2023 findings)
**Paper**: [Bridging the Granularity Gap for Acoustic Modeling](https://aclanthology.org/2023.findings-acl.688/)
**Highlights**: an effective method to facilitate the convergence of S2T tasks by increasing the modeling granularity of acoustic representations
Here is an example on the MUST-C ST dataset. This method also supports the ASR task.
```bash
cd egs/mustc/st/
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_pds
```
pip install pandas sentencepiece configargparse gpustat tensorboard editdistance
### NAST: Non-Autoregressive Speech Translation (ACL 2023)
**Paper**: [CTC-based Non-autoregressive Speech Translation](https://aclanthology.org/2023.acl-long.744/)
**Highlights**: a non-autoregressive modeling method that only relies on the CTC inference and achieves the comparable results with the autoregressive methods
Here is an example on the MUST-C ST dataset.
```bash
cd egs/mustc/st/
# Non-autoregressive modeling
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_nast
# Autoregressive modeling
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_ctc_aug
```
## Code Structure
We supply the recipes for multiple benchmarks in the egs folder, including machine translation, speech recognition, and speech translation corpora.
Besides, we also provide the template for other benchmarks.
Here is an example for MuST-C:
```markdown
mustc
├── asr
│   ├── binary.sh
│   ├── conf/
│   ├── decode.sh
│   ├── local/
│   ├── run.sh
│   └── train.sh
├── mt
│   ├── binary.sh
│   ├── conf/
│   ├── decode.sh
│   ├── local/
│   ├── run.sh
│   └── train.sh
└── st
├── binary.sh
├── conf/
├── decode.sh
├── local/
├── run.sh
└── train.sh
### BiL-CTC: Bilingual CTC (Submitted to ICASSP 2024)
**Paper**: [Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition](https://arxiv.org/abs/2309.12234/)
**Highlights**: introducing both cross-modal and cross-lingual CTC for S2T tasks and developing an novel implementation strategy called Synchronous BiL-CTC that outperforms the traditional progressive strategy (the implementation in NAST)
Here is an example on the MUST-C ST dataset.
```bash
cd egs/mustc/st/
# Progressive BiL-CTC
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_bil_ctc_progressive
# Synchronous BiL-CTC
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_bil_ctc_synchronous
```
* run.sh: the core script that includes the whole pipeline
* train.sh: call the run.sh for training
* decode.sh: call the run.sh for decoding
* binary.sh: generate the datasets alone
* conf: the folder to save the configure files (.yaml).
* local: the folder to save utils
* monitor.sh: check the GPUS for running the program automatically
* parse_options.sh: parse the parameters for run.sh
* utils.sh: the util shell functions
## Citations
```bibtex
@inproceedings{xu-etal-2021-stacked,
title = "Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders",
author = "Xu, Chen and
Hu, Bojie and
Li, Yanyang and
Zhang, Yuhao and
Huang, Shen and
Ju, Qi and
Xiao, Tong and
Zhu, Jingbo",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.204",
doi = "10.18653/v1/2021.acl-long.204",
pages = "2619--2630",
}
```
\ No newline at end of file
## Acknowledgments
- Fairseq community for the base toolkit
- ESPnet community for the base toolkit
- NiuTrans Team for their contributions and research
Finally, thank you to everyone who has helped me during my research career.
I sincerely hope that everyone can enjoy the pleasure of research
## Feedback
If you have any questions, feel free to contact xuchennlp[at]outlook.com.
\ No newline at end of file
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论