Commit f76b8cb3 by xuchen

README

parent 0cbb797d
# NiuTrans-Fairseq-S2T # Speech-to-Text (S2T) toolkit
This project adapts the [fairseq](https://github.com/pytorch/fairseq) toolkit for speech-to-text tasks, including speech recognition and speech translation. ## Overview
It contains the implementation of the following methods proposed by NiuTrans Team. This repository is an extension of the [Fairseq toolkit](https://github.com/pytorch/fairseq) specialized for speech-to-text (S2T) generation tasks. This toolkit provides comprehensive support for Automatic Speech Recognition (ASR), Machine Translation (MT), and Speech Translation (ST).
[Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders](https://arxiv.org/abs/2105.05752) ## Features
## Key Features - Complete recipes: Kaldi-style recipe support for ASR, MT, and ST tasks, ensuring a smooth workflow.
- Various configurations: An extensive collection of YAML configuration files to customize models for different tasks and scenarios.
- Easy reproduction: The comprehensive support of methods in our papers, including SATE, PDS, CTC-NAST, BIL-CTC, and more
- Multiple inference strategies: Greedy decoding, beam search, CTC decoding, CTC rescoring, and more
- More features can be found in the **run.sh** file.
### Training ## Installation
- Support the Kaldi-style complete recipes 1. Clone the repository:
- ASR, MT, and ST pipelines (bin)
- Read training config in yaml file
- CTC multi-task learning
- MT training in the ST-like way (Online tokenizer) (This may be slowly.)
- speed perturb during pre-processing
### Model
- Conformer Architecture ```bash
- Load pre-trained modules git clone https://github.com/xuchennlp/S2T.git
- Relative position representation ```
- Stacked acoustic-and-textual encoding
- Progressive down-sampling for acoustic encoding
## Installation 2. Navigate to the project directory and install the required dependencies:
* Note we only test the following environment. ```bash
cd S2T
pip install -e .
```
1. Python == 3.6 Our version: python 3.8, pytorch 1.11.0.
2. torch == 1.8, torchaudio == 0.8.0, cuda == 10.2
3. apex ## Quick Start
```
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ 1. Download your dataset and process it into the format of MUST-C dataset.
``` 2. Run the shell script **run.sh** in the corresponding directory as follows:
4. nccl
``` ```bash
make -j src.build CUDA_HOME=<path to cuda install> # Set ST_DIR environment variable as the parent directory of S2T directory
export ST_DIR=/path/to/S2T/..
cd egs/mustc/st/
./run.sh --stage 0 --stop_stage 2
```
- Stage 0 performs the data processing, including feature extraction of audios (Not required in MT), vocabulary generation, training and testing files generation.
- Stage 1 performs the model training, where multiple choices are supported.
- Stage 2 performs the model inference, where multiple strategies are supported.
- All details are available in **run.sh**.
## Reproduction of our methods
### SATE: Stacked Acoustic and Textual Encoding (ACL 2021)
**Paper**: [Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders](https://aclanthology.org/2021.acl-long.204/)
**Highlights**: an simple and effective methods to utilize the pre-trained ASR and MT models to improve the end-to-end ST model; introducing the adapter to bridge the pre-trained encoders
Here is an example on the MUST-C ST dataset.
```bash
cd egs/mustc/st/
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_sate
``` ```
5. gcc ≥ 4.9 (We use the version 5.4)
6. python library ### PDS: Progressive Down-Sampling (ACL 2023 findings)
**Paper**: [Bridging the Granularity Gap for Acoustic Modeling](https://aclanthology.org/2023.findings-acl.688/)
**Highlights**: an effective method to facilitate the convergence of S2T tasks by increasing the modeling granularity of acoustic representations
Here is an example on the MUST-C ST dataset. This method also supports the ASR task.
```bash
cd egs/mustc/st/
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_pds
``` ```
pip install pandas sentencepiece configargparse gpustat tensorboard editdistance
### NAST: Non-Autoregressive Speech Translation (ACL 2023)
**Paper**: [CTC-based Non-autoregressive Speech Translation](https://aclanthology.org/2023.acl-long.744/)
**Highlights**: a non-autoregressive modeling method that only relies on the CTC inference and achieves the comparable results with the autoregressive methods
Here is an example on the MUST-C ST dataset.
```bash
cd egs/mustc/st/
# Non-autoregressive modeling
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_nast
# Autoregressive modeling
./run.sh --stage 0 --stop_stage 2 --train_config reproduction_ctc_aug
``` ```
## Code Structure ### BiL-CTC: Bilingual CTC (Submitted to ICASSP 2024)
We supply the recipes for multiple benchmarks in the egs folder, including machine translation, speech recognition, and speech translation corpora. **Paper**: [Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition](https://arxiv.org/abs/2309.12234/)
Besides, we also provide the template for other benchmarks.
**Highlights**: introducing both cross-modal and cross-lingual CTC for S2T tasks and developing an novel implementation strategy called Synchronous BiL-CTC that outperforms the traditional progressive strategy (the implementation in NAST)
Here is an example for MuST-C:
Here is an example on the MUST-C ST dataset.
```markdown
mustc ```bash
├── asr cd egs/mustc/st/
│   ├── binary.sh # Progressive BiL-CTC
│   ├── conf/ ./run.sh --stage 0 --stop_stage 2 --train_config reproduction_bil_ctc_progressive
│   ├── decode.sh # Synchronous BiL-CTC
│   ├── local/ ./run.sh --stage 0 --stop_stage 2 --train_config reproduction_bil_ctc_synchronous
│   ├── run.sh
│   └── train.sh
├── mt
│   ├── binary.sh
│   ├── conf/
│   ├── decode.sh
│   ├── local/
│   ├── run.sh
│   └── train.sh
└── st
├── binary.sh
├── conf/
├── decode.sh
├── local/
├── run.sh
└── train.sh
``` ```
* run.sh: the core script that includes the whole pipeline ## Acknowledgments
* train.sh: call the run.sh for training
* decode.sh: call the run.sh for decoding - Fairseq community for the base toolkit
* binary.sh: generate the datasets alone - ESPnet community for the base toolkit
* conf: the folder to save the configure files (.yaml). - NiuTrans Team for their contributions and research
* local: the folder to save utils
* monitor.sh: check the GPUS for running the program automatically Finally, thank you to everyone who has helped me during my research career.
* parse_options.sh: parse the parameters for run.sh I sincerely hope that everyone can enjoy the pleasure of research
* utils.sh: the util shell functions
## Feedback
## Citations
If you have any questions, feel free to contact xuchennlp[at]outlook.com.
```bibtex \ No newline at end of file
@inproceedings{xu-etal-2021-stacked,
title = "Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders",
author = "Xu, Chen and
Hu, Bojie and
Li, Yanyang and
Zhang, Yuhao and
Huang, Shen and
Ju, Qi and
Xiao, Tong and
Zhu, Jingbo",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.204",
doi = "10.18653/v1/2021.acl-long.204",
pages = "2619--2630",
}
```
\ No newline at end of file
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论