Name |
Last commit
|
Last Update |
---|---|---|
docs | ||
egs | ||
examples | ||
fairseq | ||
fairseq_cli | ||
scripts | ||
tests | ||
.gitignore | ||
.gitmodules | ||
CODE_OF_CONDUCT.md | ||
CONTRIBUTING.md | ||
Fairseq-README.md | ||
LICENSE | ||
README.md | ||
hubconf.py | ||
pyproject.toml | ||
setup.py | ||
train.py |
It must be said that some problems still confuse me: 1. Whether to scale in the input layer (I try to replace it with layer specification); 2. The detailed setting of weight sharing between output projection matrix and embedding matrix in the adapter (I notice that inconsistent variance will lead to bad results); 3. The biggest confusion is that the variance increases with the calculation layer by layer (I am not sure if this phenomenon is reasonable, I will compare the behavior on the latest code). Finally, the detailed implementation is so important to the final performance, even if it is a subtle difference.
Name |
Last commit
|
Last Update |
---|---|---|
docs | 正在载入提交数据... | |
egs | 正在载入提交数据... | |
examples | 正在载入提交数据... | |
fairseq | 正在载入提交数据... | |
fairseq_cli | 正在载入提交数据... | |
scripts | 正在载入提交数据... | |
tests | 正在载入提交数据... | |
.gitignore | 正在载入提交数据... | |
.gitmodules | 正在载入提交数据... | |
CODE_OF_CONDUCT.md | 正在载入提交数据... | |
CONTRIBUTING.md | 正在载入提交数据... | |
Fairseq-README.md | 正在载入提交数据... | |
LICENSE | 正在载入提交数据... | |
README.md | 正在载入提交数据... | |
hubconf.py | 正在载入提交数据... | |
pyproject.toml | 正在载入提交数据... | |
setup.py | 正在载入提交数据... | |
train.py | 正在载入提交数据... |