README

f76b8cb3 · xuchen · 0cbb797d · f76b8cb3
Commit f76b8cb3 authored Sep 28, 2023 by xuchen
--- a/README.md
+++ b/README.md
-# NiuTrans-Fairseq-S2T
+# Speech-to-Text (S2T) toolkit
-This project adapts the [fairseq](https://github.com/pytorch/fairseq) toolkit for speech-to-text tasks, including speech recognition and speech translation.
+## Overview
-It contains the implementation of the following methods proposed by NiuTrans Team.
+This repository is an extension of the [Fairseq toolkit](https://github.com/pytorch/fairseq) specialized for speech-to-text (S2T) generation tasks. This toolkit provides comprehensive support for Automatic Speech Recognition (ASR), Machine Translation (MT), and Speech Translation (ST).
-[Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders](https://arxiv.org/abs/2105.05752)
+## Features
-## Key Features
+- Complete recipes: Kaldi-style recipe support for ASR, MT, and ST tasks, ensuring a smooth workflow.
+- Various configurations: An extensive collection of YAML configuration files to customize models for different tasks and scenarios.
+- Easy reproduction: The comprehensive support of methods in our papers, including SATE, PDS, CTC-NAST, BIL-CTC, and more
+- Multiple inference strategies: Greedy decoding, beam search, CTC decoding, CTC rescoring, and more
+- More features can be found in the **run.sh** file.
-### Training
+## Installation
- Support the Kaldi-style complete recipes
+1. Clone the repository:
- ASR, MT, and ST pipelines (bin)
- Read training config in yaml file
- CTC multi-task learning
- MT training in the ST-like way (Online tokenizer) (This may be slowly.)
- speed perturb during pre-processing
-### Model
- Conformer Architecture
+    ```bash
- Load pre-trained modules
+    git clone https://github.com/xuchennlp/S2T.git
- Relative position representation
+    ```
- Stacked acoustic-and-textual encoding
- Progressive down-sampling for acoustic encoding
-## Installation
+2. Navigate to the project directory and install the required dependencies:
-* Note we only test the following environment.
+    ```bash
+    cd S2T
+    pip install -e .
+    ```
-1. Python == 3.6
+    Our version: python 3.8, pytorch 1.11.0.
-2. torch == 1.8, torchaudio == 0.8.0, cuda == 10.2
-3. apex
+## Quick Start
-```
-pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
+1. Download your dataset and process it into the format of MUST-C dataset.
-```
+2. Run the shell script **run.sh** in the corresponding directory as follows:
-4. nccl
-```
+   ```bash
-make -j src.build CUDA_HOME=<path to cuda install>
+   # Set ST_DIR environment variable as the parent directory of S2T directory
+   export ST_DIR=/path/to/S2T/..
+   cd egs/mustc/st/
+   ./run.sh --stage 0 --stop_stage 2
+   ```
+- Stage 0 performs the data processing, including feature extraction of audios (Not required in MT), vocabulary generation, training and testing files generation.
+- Stage 1 performs the model training, where multiple choices are supported.
+- Stage 2 performs the model inference, where multiple strategies are supported.
+- All details are available in **run.sh**.
+## Reproduction of our methods
+### SATE: Stacked Acoustic and Textual Encoding (ACL 2021)
+**Paper**: [Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders](https://aclanthology.org/2021.acl-long.204/)
+**Highlights**: an simple and effective methods to utilize the pre-trained ASR and MT models to improve the end-to-end ST model; introducing the adapter to bridge the pre-trained encoders
+Here is an example on the MUST-C ST dataset.
+```bash
+cd egs/mustc/st/
+./run.sh --stage 0 --stop_stage 2 --train_config reproduction_sate
 ```
-5. gcc ≥ 4.9 (We use the version 5.4)
-6. python library 
+### PDS: Progressive Down-Sampling (ACL 2023 findings)
+**Paper**: [Bridging the Granularity Gap for Acoustic Modeling](https://aclanthology.org/2023.findings-acl.688/)
+**Highlights**: an effective method to facilitate the convergence of S2T tasks by increasing the modeling granularity of acoustic representations
+Here is an example on the MUST-C ST dataset. This method also supports the ASR task.
+```bash
+cd egs/mustc/st/
+./run.sh --stage 0 --stop_stage 2 --train_config reproduction_pds
 ```
-pip install pandas sentencepiece configargparse gpustat tensorboard editdistance
+### NAST: Non-Autoregressive Speech Translation (ACL 2023)
+**Paper**: [CTC-based Non-autoregressive Speech Translation](https://aclanthology.org/2023.acl-long.744/)
+**Highlights**: a non-autoregressive modeling method that only relies on the CTC inference and achieves the comparable results with the autoregressive methods
+Here is an example on the MUST-C ST dataset.
+```bash
+cd egs/mustc/st/
+# Non-autoregressive modeling
+./run.sh --stage 0 --stop_stage 2 --train_config reproduction_nast
+# Autoregressive modeling
+./run.sh --stage 0 --stop_stage 2 --train_config reproduction_ctc_aug
 ```
-## Code Structure
+### BiL-CTC: Bilingual CTC (Submitted to ICASSP 2024)
-We supply the recipes for multiple benchmarks in the egs folder, including machine translation, speech recognition, and speech translation corpora.
+**Paper**: [Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition](https://arxiv.org/abs/2309.12234/)
-Besides, we also provide the template for other benchmarks.
+**Highlights**: introducing both cross-modal and cross-lingual CTC for S2T tasks and developing an novel implementation strategy called Synchronous BiL-CTC that outperforms the traditional progressive strategy (the implementation in NAST)
-Here is an example for MuST-C:
+Here is an example on the MUST-C ST dataset.
-```markdown
-mustc
+```bash
-├── asr
+cd egs/mustc/st/
-│   ├── binary.sh
+# Progressive BiL-CTC
-│   ├── conf/
+./run.sh --stage 0 --stop_stage 2 --train_config reproduction_bil_ctc_progressive
-│   ├── decode.sh
+# Synchronous BiL-CTC
-│   ├── local/
+./run.sh --stage 0 --stop_stage 2 --train_config reproduction_bil_ctc_synchronous
-│   ├── run.sh
-│   └── train.sh
-├── mt
-│   ├── binary.sh
-│   ├── conf/
-│   ├── decode.sh
-│   ├── local/
-│   ├── run.sh
-│   └── train.sh
-└── st
-    ├── binary.sh
-    ├── conf/
-    ├── decode.sh
-    ├── local/
-    ├── run.sh
-    └── train.sh
 ```
-* run.sh: the core script that includes the whole pipeline
+## Acknowledgments
-* train.sh: call the run.sh for training
-* decode.sh: call the run.sh for decoding
+- Fairseq community for the base toolkit
-* binary.sh: generate the datasets alone
+- ESPnet community for the base toolkit
-* conf: the folder to save the configure files (.yaml). 
+- NiuTrans Team for their contributions and research
-* local: the folder to save utils
-  * monitor.sh: check the GPUS for running the program automatically 
+Finally, thank you to everyone who has helped me during my research career.
-  * parse_options.sh: parse the parameters for run.sh
+I sincerely hope that everyone can enjoy the pleasure of research
-  * utils.sh: the util shell functions
+## Feedback
-## Citations
+If you have any questions, feel free to contact xuchennlp[at]outlook.com.
-```bibtex
\ No newline at end of file
-@inproceedings{xu-etal-2021-stacked,
-    title = "Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders",
-    author = "Xu, Chen  and
-      Hu, Bojie  and
-      Li, Yanyang  and
-      Zhang, Yuhao  and
-      Huang, Shen  and
-      Ju, Qi  and
-      Xiao, Tong  and
-      Zhu, Jingbo",
-    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
-    month = aug,
-    year = "2021",
-    address = "Online",
-    publisher = "Association for Computational Linguistics",
-    url = "https://aclanthology.org/2021.acl-long.204",
-    doi = "10.18653/v1/2021.acl-long.204",
-    pages = "2619--2630",
-}
-```
\ No newline at end of file