Commit 037d45dc by libei

WMT19 training system based Tensor2Tensor-1.0.14

parents
<?xml version="1.0" encoding="UTF-8"?>
<module type="PYTHON_MODULE" version="4">
<component name="NewModuleRootManager">
<content url="file://$MODULE_DIR$" />
<orderEntry type="inheritedJdk" />
<orderEntry type="sourceFolder" forTests="false" />
</component>
<component name="TestRunnerService">
<option name="PROJECT_TEST_RUNNER" value="Unittests" />
</component>
</module>
\ No newline at end of file
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
<component name="ProjectModuleManager">
<modules>
<module fileurl="file://$PROJECT_DIR$/.idea/WMT19.iml" filepath="$PROJECT_DIR$/.idea/WMT19.iml" />
</modules>
</component>
</project>
\ No newline at end of file
差异被折叠。 点击展开。
# T2T: Tensor2Tensor Transformers
[![PyPI
version](https://badge.fury.io/py/tensor2tensor.svg)](https://badge.fury.io/py/tensor2tensor)
[![GitHub
Issues](https://img.shields.io/github/issues/tensorflow/tensor2tensor.svg)](https://github.com/tensorflow/tensor2tensor/issues)
[![Contributions
welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)
[![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/tensor2tensor/Lobby)
[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0)
[T2T](https://github.com/tensorflow/tensor2tensor) is a modular and extensible
library and binaries for supervised learning with TensorFlow and with support
for sequence tasks. It is actively used and maintained by researchers and
engineers within the Google Brain team. You can read more about Tensor2Tensor in
the recent [Google Research Blog post introducing
it](https://research.googleblog.com/2017/06/accelerating-deep-learning-research.html).
We're eager to collaborate with you on extending T2T, so please feel
free to [open an issue on
GitHub](https://github.com/tensorflow/tensor2tensor/issues) or
send along a pull request to add your dataset or model.
See [our contribution
doc](CONTRIBUTING.md) for details and our [open
issues](https://github.com/tensorflow/tensor2tensor/issues).
And chat with us and other users on
[Gitter](https://gitter.im/tensor2tensor/Lobby).
### Contents
* [Walkthrough](#walkthrough)
* [Installation](#installation)
* [Features](#features)
* [T2T Overview](#t2t-overview)
* [Datasets](#datasets)
* [Problems and Modalities](#problems-and-modalities)
* [Models](#models)
* [Hyperparameter Sets](#hyperparameter-sets)
* [Trainer](#trainer)
* [Adding your own components](#adding-your-own-components)
* [Adding a dataset](#adding-a-dataset)
---
## Walkthrough
Here's a walkthrough training a good English-to-German translation
model using the Transformer model from [*Attention Is All You
Need*](https://arxiv.org/abs/1706.03762) on WMT data.
```
pip install tensor2tensor
# See what problems, models, and hyperparameter sets are available.
# You can easily swap between them (and add new ones).
t2t-trainer --registry_help
PROBLEM=wmt_ende_tokens_32k
MODEL=transformer
HPARAMS=transformer_base_single_gpu
DATA_DIR=$HOME/t2t_data
TMP_DIR=/tmp/t2t_datagen
TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS
mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR
# Generate data
t2t-datagen \
--data_dir=$DATA_DIR \
--tmp_dir=$TMP_DIR \
--num_shards=100 \
--problem=$PROBLEM
cp $TMP_DIR/tokens.vocab.* $DATA_DIR
# Train
# * If you run out of memory, add --hparams='batch_size=2048' or even 1024.
t2t-trainer \
--data_dir=$DATA_DIR \
--problems=$PROBLEM \
--model=$MODEL \
--hparams_set=$HPARAMS \
--output_dir=$TRAIN_DIR
# Decode
DECODE_FILE=$DATA_DIR/decode_this.txt
echo "Hello world" >> $DECODE_FILE
echo "Goodbye world" >> $DECODE_FILE
BEAM_SIZE=4
ALPHA=0.6
t2t-trainer \
--data_dir=$DATA_DIR \
--problems=$PROBLEM \
--model=$MODEL \
--hparams_set=$HPARAMS \
--output_dir=$TRAIN_DIR \
--train_steps=0 \
--eval_steps=0 \
--decode_beam_size=$BEAM_SIZE \
--decode_alpha=$ALPHA \
--decode_from_file=$DECODE_FILE
cat $DECODE_FILE.$MODEL.$HPARAMS.beam$BEAM_SIZE.alpha$ALPHA.decodes
```
---
## Installation
```
# Assumes tensorflow or tensorflow-gpu installed
pip install tensor2tensor
# Installs with tensorflow-gpu requirement
pip install tensor2tensor[tensorflow_gpu]
# Installs with tensorflow (cpu) requirement
pip install tensor2tensor[tensorflow]
```
Binaries:
```
# Data generator
t2t-datagen
# Trainer
t2t-trainer --registry_help
```
Library usage:
```
python -c "from tensor2tensor.models.transformer import Transformer"
```
---
## Features
* Many state of the art and baseline models are built-in and new models can be
added easily (open an issue or pull request!).
* Many datasets across modalities - text, audio, image - available for
generation and use, and new ones can be added easily (open an issue or pull
request for public datasets!).
* Models can be used with any dataset and input mode (or even multiple); all
modality-specific processing (e.g. embedding lookups for text tokens) is done
with `Modality` objects, which are specified per-feature in the dataset/task
specification.
* Support for multi-GPU machines and synchronous (1 master, many workers) and
asynchrounous (independent workers synchronizing through a parameter server)
distributed training.
* Easily swap amongst datasets and models by command-line flag with the data
generation script `t2t-datagen` and the training script `t2t-trainer`.
---
## T2T overview
### Datasets
**Datasets** are all standardized on `TFRecord` files with `tensorflow.Example`
protocol buffers. All datasets are registered and generated with the
[data
generator](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/bin/t2t-datagen)
and many common sequence datasets are already available for generation and use.
### Problems and Modalities
**Problems** define training-time hyperparameters for the dataset and task,
mainly by setting input and output **modalities** (e.g. symbol, image, audio,
label) and vocabularies, if applicable. All problems are defined in
[`problem_hparams.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem_hparams.py).
**Modalities**, defined in
[`modality.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/utils/modality.py),
abstract away the input and output data types so that **models** may deal with
modality-independent tensors.
### Models
**`T2TModel`s** define the core tensor-to-tensor transformation, independent of
input/output modality or task. Models take dense tensors in and produce dense
tensors that may then be transformed in a final step by a **modality** depending
on the task (e.g. fed through a final linear transform to produce logits for a
softmax over classes). All models are imported in
[`models.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/models/models.py),
inherit from `T2TModel` - defined in
[`t2t_model.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/utils/t2t_model.py)
- and are registered with
[`@registry.register_model`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/utils/registry.py).
### Hyperparameter Sets
**Hyperparameter sets** are defined and registered in code with
[`@registry.register_hparams`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/utils/registry.py)
and are encoded in
[`tf.contrib.training.HParams`](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/training/python/training/hparam.py)
objects. The `HParams` are available to both the problem specification and the
model. A basic set of hyperparameters are defined in
[`common_hparams.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/models/common_hparams.py)
and hyperparameter set functions can compose other hyperparameter set functions.
### Trainer
The **trainer** binary is the main entrypoint for training, evaluation, and
inference. Users can easily switch between problems, models, and hyperparameter
sets by using the `--model`, `--problems`, and `--hparams_set` flags. Specific
hyperparameters can be overridden with the `--hparams` flag. `--schedule` and
related flags control local and distributed training/evaluation
([distributed training documentation](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/docs/distributed_training.md)).
---
## Adding your own components
T2T's components are registered using a central registration mechanism that
enables easily adding new ones and easily swapping amongst them by command-line
flag. You can add your own components without editing the T2T codebase by
specifying the `--t2t_usr_dir` flag in `t2t-trainer`.
You can currently do so for models, hyperparameter sets, and modalities. Please
do submit a pull request if your component might be useful to others.
Here's an example with a new hyperparameter set:
```python
# In ~/usr/t2t_usr/my_registrations.py
from tensor2tensor.models import transformer
from tensor2tensor.utils import registry
@registry.register_hparams
def transformer_my_very_own_hparams_set():
hparams = transformer.transformer_base()
hparams.hidden_size = 1024
...
```
```python
# In ~/usr/t2t_usr/__init__.py
from . import my_registrations
```
```
t2t-trainer --t2t_usr_dir=~/usr/t2t_usr --registry_help
```
You'll see under the registered HParams your
`transformer_my_very_own_hparams_set`, which you can directly use on the command
line with the `--hparams_set` flag.
## Adding a dataset
See the [data generators
README](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/README.md).
---
*Note: This is not an official Google product.*
deal-test2bpe.sh
使用教程
需要指定的参数:
eval_dir= # 指定测试集文件夹的路径(注意路径名后面要带 /),比如 eval_dir=*****/eval/
# eval文件夹下是mt06、mt08等测试集所在的文件夹
src_bpe= # 指定源语言端的bpe词表路径,比如 中-->英方向的翻译,指定 src_bpe=****/zh.bpe
input=input.token # 该变量指定了每个测试集源语言段的命名规则,注意测试源文件与该关键字一致,一般不改
output=input.bpe # 该变量指定了每个测试集源语言段文件生成bpe文件的命名规则,注意测试源文件与该关键字一致,一般不改
PYTHON=python3.6 # 该变量制定了python的版本,虚拟环境可以设置为python
APPLY_BPE=./subword-nmt-master/apply_bpe.py # 该变量指定了apply_bpe.py文件所在的路径
运行方式:
sh deal-test2bpe.sh
\ No newline at end of file
<!-- toc -->
---
------
#环境安装
## 1. 升级到python3.6
**如果已经是3.6就不用升啦**
```bash
sudo yum update
sudo yum install yum-utils
sudo yum groupinstall development
sudo yum install https://centos7.iuscommunity.org/ius-release.rpm
sudo yum install python36u
python3.6 -V
sudo yum install python36u-pip
sudo yum install python36u-devel
```
## 2. 安装splite3 (仅限阿里云)
启动`tensorboard`时需要,否则报错:`No module named _sqlite3`
```bash
wget https://www.sqlite.org/2017/sqlite-autoconf-3170000.tar.gz --no-check-certificate
tar zxvf sqlite-autoconf-3170000.tar.gz
cd sqlite-autoconf-3170000
./configure --prefix=/usr/local/sqlite3 --disable-static --enable-fts5 --enable-json1 CFLAGS="-g -O2 -DSQLITE_ENABLE_FTS3=1 -DSQLITE_ENABLE_FTS4=1 -DSQLITE_ENABLE_RTREE=1"
```
重新编译`python3.6`
```bash
wget https://www.python.org/ftp/python/3.6.4/Python-3.6.4.tgz (阿里云速度慢,可以直接windows上下载,上传到阿里云上)
cd Python-3.6.4
LD_RUN_PATH=/usr/local/sqlite3/lib ./configure LDFLAGS="-L/usr/local/sqlite3/lib" CPPFLAGS="-I /usr/local/sqlite3/include"
LD_RUN_PATH=/usr/local/sqlite3/lib make
LD_RUN_PATH=/usr/local/sqlite3/lib sudo make install
```
## 3. 创建虚拟环境
```bash
python3.6 -m venv 'your env name'
```
e.g.:`python3.6 -m venv env-cwmt`,会在当前目录下生成`env-cwmt`目录
## 4. 激活虚拟环境
```bash
source 'your env name'/bin/activate
```
e.g.:`source env-cwmt/bin/activate`,此时你的命令行提示符变成了`(env-cwmt)***`
## 5. 设置pip源
将镜像源修改至清华
```shell
mkdir ~/.pip
cd ~/.pip
vi pip.conf
[global]
index‐url = https://pypi.tuna.tsinghua.edu.cn/simple
[install]
trusted‐host=mirrors.aliyun.com
:wq
```
## 6. 安装tensorflow
```bash
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorflow-gpu==1.3.0
```
如果是阿里云服务器,设置清华大学镜像,速度快;`tensorflow`使用`1.3.0`版本
## 7. 安装其他package
```bash
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple sympy
```
---
#目录结构
* bin/
包括:模型训练,解码,打分*
> 训练脚本:`train.sh`
>
> 解码脚本:`decoder.sh`, `translate_dataset.sh`
>
> 打分脚本*:**todo**
* data/
使用的数据集,目录结构是:`语言方向`/`版本`-`切分方式`-`数据集`,e.g. `zh2en/v4-bpe32k-cwmt`表示`v4`版本,切分方式是`bpe32k`,数据集是`cwmt`
> `train` 目录是`tgt-gen.py`生成的`tfRecord`文件,以及bpe结果
> `eval` 目录包括校验集和所有的测试集,每一个评价集合文件内,包括:`input.bpe`(源语输入的bpe结果),`input.token`(源语输入的分词结果,实际并没有使用),`ref*`(单个/多个references)
* output/
存放生成的模型,目录结构是:`语言方向`/`版本`-`切分方式`-`数据集`/`tag`
> 如果使用过ensemble解码,还会生成存在ensemble model的目录,e.g. `ensemble15`
* tensor2tensor/
核心代码
* doc/
使用文档,或者实验记录
---
#训练流程
##配置介绍
编辑`bin/train.sh`,进行配置
###硬件
- `dev`表示使用的gpu设备,例如`dev=0,1,2,3`
- `gpu_fraction`表示每个gpu占用的百分比,e.g. `gpu_fraction=0.95`,通常不用改
###数据集
- `lang`表示当前翻译方向,例如`lang=zh2en`
- `datatype`表示使用数据的类型,例如`datatype=v4-bpe32k`
- `dataset`表示使用的训练数据集,例如`dataset=cwmt`
###训练参数
- `model`表示使用的已注册模型,e.g. `model=transformer`
- `param`表示使用的已注册参数, e.g. `param=transformer_base`
- `train_step`表示当前模型要更新的次数, e.g. `train_step=103000`,这是使用`cwmt700w`数据更近近似`10epoch`的次数
- `other_hparams`表示临时改变的训练参数,常见的用法是调整batch_size大小,e.g. `other_hparams='batch_size=2048'`,否则你必须在代码里注册相关的训练参数
- `tag`表示当前跑的实验的名字,e.g. `tag=baseline-epoch20`,这个名字是方便你记录的。**每次跑不同的实验都应该改**
## 使用方法
更新设置后,直接运行`./train.sh`。该脚本会开始训练,并且会在`cpu`上进行多卡的自动校验。
此时,你打开`tensorboard`,观察相应曲线即可
----
#解码流程
##配置介绍
###硬件
* `device`表示使用的gpu设备,例如`device=(0 1 2 3)`
* > 注意,这里的`device``shell`中的数组,不是`train.sh`里的字符串,多个值之间用`空格`隔开
>
> `device`可以设置多卡,虽然每一个解码程序只能用一个卡,但是我们可以用多卡并行跑不同的实验
###评价
* `is_eval`表示是否进行评价(e.g. 跑`BLEU`),`1`表示解码+评价,`0`表示只解码
* > 本项目中设置为`1`
* `eval_tool`表示使用的评价工具,可选`multi-bleu``mteval`
* > 本项目中设置`multi-bleu`
* `lowercase`表示评价的时候,是否全小写化,即大小写不敏感,`1`表示大小写不敏感,`0`表示大小写敏感
* > 此选项目前只使用于`multi-bleu`,本项目中设置为`0`
###数据集/词汇表
* `lang`表示当前翻译方向,例如`lang=zh2en`
* `datatype`表示使用数据的类型,例如`datatype=v4-bpe32k`
* `dataset`表示使用的训练数据集,例如`dataset=cwmt`
* `evalset`表示要被评价的测试集
* > 此选项是`shell`的数组,可以设置多个值,用`空格`隔开
###翻译模型
* `model`表示使用的已注册模型,e.g. `model=transformer`
* `param`表示使用的已注册参数, e.g. `param=transformer_base`
* ``tag`表示该模型对应的实验名称,e.g. `tag=baseline`,**注意修改**
### 解码参数
* `beam_size`表示`beam`大小,通常是`12`
* `batch_size`表示解码时一个`batch`的大小,通常是`32`,如果出现`OOM`错误,可以降低该值
* `alphas`表示长度归一化`(Layer Normalization, LN)`的系数
* > 此选项是`shell`的数组,可以设置多个值,程序能够自动进行`grid search`,寻找最优`alpha`
* `ensemble`表示用来`checkpoint average`的数量
* > 如果使用单模型,不要设置这个参数值,空着就行
>
> 如果想使用ensemble, 例如最后`15`个模型平均,则设置此选项为`15`即可
## 使用方法
解码流程主要包括4个步骤:
* 使用单模型,在`校验集`上搜索最好的超参数`alpha`
* 使用单模型,用最好的`alpha`跑所有的`测试集`
* 使用`ensemble`模型,在`校验集`上搜索最好的超参数`alpha`
* 使用`ensemble`模型,用最好的`alpha`跑所有的`测试集`
> `ensemble`模型的超参数`alpha`不一定每次都需要搜索,或者搜索范围可以小一点(一般跟单模型`alpha`差不多,不会相差特别悬殊)
>
> 可以根据最后解码报告的`bp`和`ratio`值来判断是否应该再调整`alpha`
>
> 每次跑的多个解码结果,会在程序结束后汇总在一起输出
假设你刚使用`4卡(0,1,2,3)`训练完模型,设置`数据集`&`翻译模型`的相关参数(类似`train.sh`)
###1. 单模型,搜索超参数
* 设置解码用的gpu,能多用就多用
```shell
device=(0 1 2 3)
```
* 设置校验集
```shell
evalset=(cwmt18-dev)
```
* 设置要grid search的超参数
```shell
alphas=(0.9 1.0 1.1 1.2 1.3)
```
这里使用了5个值,可以比gpu的卡数多,程序会自动调度
* 设置ensemble为空
```shell
ensemble=
```
###2. 单模型,跑测试集
- 设置解码用的gpu,能多用就多用
```shell
device=(0 1 2 3)
```
- 设置测试集,一共8个
```shell
evalset=(cwmt17-dev wmt17-test mt06 mt08 mt12-nd mt12-nw mt12-wb exact2k)
```
- 设置超参数,根据上一步的最优值,假设是`1.2`
```shell
alphas=(1.2)
```
- 设置ensemble为空
```shell
ensemble=
```
### 3. ensemble模型,搜索超参数
- 设置解码用的gpu,能多用就多用
```shell
device=(0 1 2 3)
```
- 设置校验集
```shell
evalset=(cwmt18-dev)
```
- 设置要grid search的超参数,**不用像单模型那么多,在单模型最优值附近搜索就行**
```shell
alphas=(1.1 1.2 1.3)
```
- 设置ensemble,假设使用`15`个模型来平均
```shell
ensemble=15
```
### 4. ensemble模型,跑测试集
- 设置解码用的gpu,能多用就多用
```shell
device=(0 1 2 3)
```
- 设置测试集,一共8个
```shell
evalset=(cwmt17-dev wmt17-test mt06 mt08 mt12-nd mt12-nw mt12-wb exact2k)
```
- 设置超参数,根据上一步的最优值,假设是`1.2`
```shell
alphas=(1.2)
```
- 设置ensemble
```shell
ensemble=15
```
###
\ No newline at end of file
#encoding=utf-8
import os
import sys
def calBatchNum(srcfile, dstfile, batchsize):
ret = True
try:
srcfd = open(srcfile,'r')
except IOError:
print ('srcfile does not exist!')
try:
dstfd = open(dstfile,'r')
except IOError:
print ('dstfile does not exist!')
maxlist=[]
with open(srcfile, encoding='utf-8') as srclines, open(dstfile) as dstlines:
for srcline, dstline in zip(srclines, dstlines):
srclinelist = srcline.split(' ')
srclinenum = len(srclinelist)
dstlinelist = dstline.split(' ')
dstlinenum = len(dstlinelist)
maxlist.append(max(srclinenum, dstlinenum))
batchnum = 1
batchroom = batchsize
for i in range(0, len(maxlist)):
batchroom = batchroom - maxlist[i]
if batchroom < 0:
batchnum = batchnum + 1
batchroom = batchsize - maxlist[i]
if batchroom < 0:
print('can not make room for this sentence 0', i)
return
print('total batch number is', batchnum)
srcfd.close()
dstfd.close()
return ret
if __name__ == "__main__":
srcfile = ''
dstfile = ''
batchsizestr = ''
batchsize = 0
if len(sys.argv) == 4:
srcfile = sys.argv[1]
dstfile = sys.argv[2]
batchsizestr = sys.argv[3]
batchsize = int(batchsizestr)
else:
errorInfo = '****************** Error ******************\r\n'
errorInfo = errorInfo + 'Please input with srcfile path, dstfile path and batchsize\r\n'
errorInfo = errorInfo + 'e.g.:$ python ./calBatchNum.py srtfile dstfile batchsize\r\n'
print (errorInfo)
exit(1)
calBatchNum(srcfile, dstfile, batchsize)
#! /usr/bin/bash
set -e
##################################SET PARAMS#######################################
# eval目录, e.g. ../data/zh2en/v4-bpe32k-cwmt18/eval/
eval_dir=./eval/
# 源语端BPE词表路径, e.g. ../data/zh2en/v4-bpe32k-cwmt18/eval/src.bpe
src_bpe=./bpe0.5/zh.bpe
# 源语端命名关键字,规定测试集源语言端命名需带input.token, 对应bpe输出input.bpe
input=input.token
output=input.bpe
# python的版本
PYTHON=python
#apply_bpe程序路径
APPLY_BPE=./subword-nmt-master/apply_bpe.py
################################SET PARAMS#########################################
if [ ! -d "$eval_dir" ]; then
echo "$eval_dir is not exists."
exit 1
fi
echo "######## START RUN ########"
for file in $eval_dir/* ;do
{
if [ -d $file ]; then
flag1=1
for sub_file in $file/*;do
{
if [ -f $sub_file ]; then
if [[ $sub_file =~ $output ]]; then
flag1=2
echo " HAS EXISTS , PROCESS FILE FOLDER : $file "
break
fi
fi
}
done
if [ "$flag1" != "2" ]; then
flag2=3
for sub_file in $file/*;do
{
if [ -f $sub_file ]; then
if [[ $sub_file =~ $input ]]; then
sub_file_bpe=${sub_file/%"token"/"bpe"}
#echo $sub_file_bpe
cmd="$PYTHON $APPLY_BPE -c $src_bpe -i $sub_file -o $sub_file_bpe"
$cmd
echo " CREATE SUCCESSFUL , PROCESS FILE FOLDER : $file "
flag2=4
fi
fi
}
done
if [ "$flag2" != "4" ]; then
echo "WARNING: make sure source filename in the $file contains key, $key "
fi
fi
fi
}
done
echo "######## END OF PROGRAM ########"
###########################################
### configuration file for SMT ###
### ###
### 2013-04-19 ###
###########################################
# punct mapping dictionary for detoken
param="Punct-Mapping-Dict" value="./punctuation.mapping.dat"
# system log path
param="system-log" value="./system.detoken.log"
This source diff could not be displayed because it is too large. You can view the blob instead.
, , NULL
. 。 NULL
( ( NULL
) ) NULL
; ; NULL
! ! NULL
? ? NULL
' ‘ ’
" “ ”
我存在得意义是为了提交空文件夹。
如有你需要占用这个文件夹就把我替换掉吧,否则请无视我吧。
要是你还觉得文件夹不够用,那就自己去新建吧~
This source diff could not be displayed because it is too large. You can view the blob instead.
我存在得意义是为了提交空文件夹。
如有你需要占用这个文件夹就把我替换掉吧,否则请无视我吧。
要是你还觉得文件夹不够用,那就自己去新建吧~
我存在得意义是为了提交空文件夹。
如有你需要占用这个文件夹就把我替换掉吧,否则请无视我吧。
要是你还觉得文件夹不够用,那就自己去新建吧~
我存在得意义是为了提交空文件夹。
如有你需要占用这个文件夹就把我替换掉吧,否则请无视我吧。
要是你还觉得文件夹不够用,那就自己去新建吧~
This source diff could not be displayed because it is too large. You can view the blob instead.
我存在得意义是为了提交空文件夹。
如有你需要占用这个文件夹就把我替换掉吧,否则请无视我吧。
要是你还觉得文件夹不够用,那就自己去新建吧~
我存在得意义是为了提交空文件夹。
如有你需要占用这个文件夹就把我替换掉吧,否则请无视我吧。
要是你还觉得文件夹不够用,那就自己去新建吧~
/*
* $Id:
* 0008
*
* $File:
* basic_method.cpp
*
* $Proj:
* Decoder for Statistical Machine Translation
*
* $Func:
* basic method
*
* $Version:
* 0.0.1
*
* $Created by:
* Qiang Li
*
* $Email
* liqiangneu@gmail.com
*
* $Last Modified by:
* 2014-04-11,16:56
* 2012-12-04,16:29
*/
#include "basic_method.h"
namespace basic_method {
/*
* $Name: Split
* $Function: Split string with char
* $Date: 2014-04-11
*/
bool BasicMethod::Split(const string &phraseTable, const char &splitchar, vector< string > &dest) {
string::size_type splitPos = phraseTable.find(splitchar);
string::size_type lastSplitPos = 0;
string tempString;
while (splitPos != string::npos) {
tempString = phraseTable.substr(lastSplitPos, splitPos - lastSplitPos);
if (!tempString.empty()) {
dest.push_back(tempString);
}
lastSplitPos = splitPos + 1;
splitPos = phraseTable.find(splitchar, lastSplitPos);
}
if (lastSplitPos < phraseTable.size()) {
tempString = phraseTable.substr(lastSplitPos);
dest.push_back(tempString);
}
if (!dest.empty()) {
return true;
} else {
return false;
}
}
/*
* $Name: splitWithStr
* $Function: Split string with string
* $Date: 2014-04-11
*/
bool BasicMethod::SplitWithStr(const string &src, const string &separator, vector< string > &dest) {
string str = src;
string substring;
string::size_type start = 0, index = 0;
string::size_type separator_len = separator.size();
while (index != string::npos && start < src.size()) {
index = src.find(separator, start);
if (index == 0) {
start = start + separator_len;
continue;
}
if (index == string::npos) {
dest.push_back(src.substr(start));
break;
}
dest.push_back(src.substr(start,index-start));
start = index + separator_len;
}
return true;
}
bool BasicMethod::Replace_String(string & original , const string & source_str , const string & target_str)
{
string::size_type pos = 0;
string::size_type src_len = source_str.size();
string::size_type tgt_len = target_str.size();
while( (pos = original.find(source_str, pos)) != string::npos)
{
original.replace(pos, src_len, target_str);
pos += tgt_len;
}
return true;
}
int BasicMethod::Get_Word_Count(string & input , char sep)
{
RmEndSpace(input);
int word_count = 0;
string::size_type split_pos = input.find(sep);
string::size_type last_split_pos = 0;
while (split_pos != string::npos)
{
word_count++;
last_split_pos = split_pos + 1;
split_pos = input.find(sep, last_split_pos);
}
return ++word_count;
}
/*
* $Name: size_tToString
* $Function:
* $Date: 2014-04-11
*/
string BasicMethod::size_tToString(size_t &source) {
stringstream oss;
oss << source;
return oss.str();
}
/*
* $Name: intToString
* $Function:
* $Date: 2014-04-11
*/
string BasicMethod::intToString(int &source) {
stringstream oss;
oss << source;
return oss.str();
}
/*
* $Name: ConvertCharToString
* $Function:
* $Date: 2014-04-11
*/
string BasicMethod::ConvertCharToString( char &input_char ) {
stringstream oss;
oss << input_char;
return oss.str();
}
/*
* $Name: ClearIllegalChar
* $Function:
* $Date: 2014-04-11
*/
bool BasicMethod::ClearIllegalChar( string &str ) {
string::size_type pos = 0;
while( ( pos = str.find( "\r", pos ) ) != string::npos ) {
str.replace( pos, 1, "" );
}
pos = 0;
while( ( pos = str.find( "\n", pos ) ) != string::npos ) {
str.replace( pos, 1, "" );
}
return true;
}
/*
* $Name: toUpper
* $Function:
* $Date: 2014-04-11
*/
bool BasicMethod::toUpper( string &str ) {
for( string::size_type i = 0; i < str.size(); ++i ) {
if( islower( ( unsigned char )str.at( i ) ) ) {
str.at( i ) = toupper( ( unsigned char )str.at( i ) );
}
}
return true;
}
/*
* $Name: toLower
* $Function:
* $Date: 2014-04-11
*/
bool BasicMethod::ToLower(string &str) {
for (string::size_type i = 0; i < str.size(); ++i) {
if (isupper((unsigned char)str.at(i))) {
str.at(i) = tolower((unsigned char)str.at(i));
}
}
return true;
}
/*
* $Name: RmEndSpace
* $Function: no trailing space
* $Date: 2014-04-11
*/
bool BasicMethod::RmEndSpace(string &str) {
if (str != "") {
string tmpStr;
int pos = (int)str.length() - 1;
while (pos >= 0 && str[ pos ] == ' ') {
--pos;
}
tmpStr = str.substr(0, pos + 1);
str = tmpStr;
}
return true;
}
/*
* $Name: RmStartSpace
* $Function: no leading space
* $Date: 2014-04-11
*/
bool BasicMethod::RmStartSpace(string &str) {
string tmpStr;
size_t pos = 0;
for( string::iterator iter = str.begin(); iter != str.end(); ++iter ) {
if( *iter != ' ' ) {
tmpStr = str.substr( pos, str.length() - pos );
break;
} else {
++pos;
}
}
str = tmpStr;
return true;
}
/*
* $Name: RemoveExtractSpace
* $Function: One space only between words
* $Date: 2014-04-11
*/
bool BasicMethod::RemoveExtraSpace( string &input_string, string &output_string ) {
char preceded_char = ' ';
for( string::iterator iter = input_string.begin(); iter != input_string.end(); ++ iter ) {
if( *iter == ' ' && preceded_char == ' ' ) {
continue;
} else {
output_string.push_back( *iter );
preceded_char = *iter;
}
}
return true;
}
/*
* $Name: deleteFileList
* $Function:
* $Date: 2014-04-11
*/
bool BasicMethod::deleteFileList( vector<string> &fileList, SystemCommand &systemCommand ) {
clock_t start,finish;
string command;
for( vector< string >::iterator iter = fileList.begin(); iter != fileList.end(); ++iter ) {
command = systemCommand.delete_command_ + *iter;
start = clock();
cerr<<"Delete\n"
<<" command : "<<command<<"\n"
<<" input : "<<*iter<<"\n"
<<flush;
system( command.c_str() );
finish = clock();
cerr<<" time : "<<(double)(finish-start)/CLOCKS_PER_SEC<<"s\n"
<<flush;
}
fileList.clear();
return true;
}
}
/****************************************************************************
* Project Name : NiuTrans Server Decoder
* File Name : basic_method.h
* Author : Wang Qiang
* Email : wangqiang@zjyatuo.com
* Create Time : 2016/1/15 11:11:35
* Copyright : Copyright (c) 2016 Shenyang YaTrans Network Technology Co., Ltd. All Rights Reserved.
*
* basic toolkit
*
****************************************************************************/
#ifndef DECODER_BASIC_METHOD_H_
#define DECODER_BASIC_METHOD_H_
#include <iostream>
#include <string>
#include <vector>
#include <fstream>
#include <sstream>
#include <set>
#include <cstdio>
#include <cstdlib>
#include <cctype>
#include <ctime>
using namespace std;
namespace basic_method
{
class SystemCommand
{
public:
string sort_file_;
string delete_command_;
public:
SystemCommand(string &newSortFile, string &newDel) : sort_file_(newSortFile), delete_command_(newDel) {}
};
class BasicMethod
{
public:
typedef string::size_type STRPOS;
public:
bool Split(const string &phraseTable, const char &splitchar, vector< string > &dest);
public:
bool SplitWithStr(const string &src, const string &separator, vector< string > &dest);
public:
bool deleteFileList(vector< string > &fileList, SystemCommand &systemCommand);
public:
bool Replace_String(string & original , const string & source_str , const string & target_str);
int Get_Word_Count(string & input , char sep);
public:
string size_tToString(size_t &source);
string intToString(int &source);
string ConvertCharToString(char &input_char);
public:
bool ClearIllegalChar(string &str);
public:
bool toUpper(string &str);
bool ToLower(string &str);
public:
bool RmEndSpace(string &str);
bool RmStartSpace(string &str);
public:
bool RemoveExtraSpace(string &input_string, string &output_string);
};
}
#endif
/*
* $Id:
* 0033
*
* $File:
* detokenizer.h
*
* $Proj:
* Detokenizer for Statistical Machine Translation
*
* $Func:
* detokenizer
*
* $Version:
* 0.0.1
*
* $Created by:
* Qiang Li
*
* $Email
* liqiangneu@gmail.com
*
* $Last Modified by:
* 2013-03-17,20:16
*/
#ifndef DECODER_DETOKENIZER_H_
#define DECODER_DETOKENIZER_H_
#include <iostream>
#include <iomanip>
#include <map>
#include <utility>
#include <string>
#include <cctype>
#include <ctime>
#include "basic_method.h"
#ifndef WIN32
#include <sys/time.h>
#endif
using namespace std;
using namespace basic_method;
namespace decoder_detokenizer
{
class PunctuationMap: public BasicMethod{
public:
map< string, pair< string, string > > punctuation_dictionary_;
// English punctuation set, just for e2c translation
set< string > punctuation_set;
public:
bool LoadPunctuation( string &punctuation_file );
};
class Detokenizer: public BasicMethod {
public:
string language_;
string input_file_;
string output_file_;
public:
Detokenizer(){}
~Detokenizer(){}
public:
// for offline service
bool DetokenizerEn( map< string, string > &parameters );
bool DetokenizerZh( map< string, string > &parameters, PunctuationMap &punctuation_map );
public:
// for online service
bool DetokenizerEn( string &input_sentence, string &output_sentence );
bool DetokenizerZh( PunctuationMap &punctuation_map, string &input_sentence, string &output_sentence );
private:
bool ReplaceSpecChars( string &str );
bool DetokenEnStart( string &str );
bool DetokenZhStart( PunctuationMap &punctuation_map, string &str );
bool isAbbreviation( string &str );
bool isAlphaAndNumber( char character );
bool isLeftDelimiter( string &str );
bool isRightDelimiter( string &str );
bool isQuotMarks( string &str );
bool isHyphen( string &str );
bool isSpaces( string &str );
bool isDefinedMark( string &str );
bool CheckFilesInConf( map< string, string > &param );
bool CheckFileInConf( map< string, string > &param, string &fileKey );
bool PrintConfig();
public:
static bool PrintDetokenizerLogo();
};
}
#endif
/*
* $Id:
* 0004
*
* $File:
* interface.cpp
*
* $Proj:
* DetokenLib for Statistical Machine Translation
*
* $Func:
* interface
*
* $Version:
* 0.0.1
*
* $Created by:
* Qiang Li
*
* $Email
* liqiangneu@gmail.com
*
* $Last Modified by:
* 2014-01-12,20:45,
* 2014-01-10,13:14,
*/
#include "interface.h"
using namespace detoken_interface;
bool ParseParameterInConfig( string config, map< string, string > &param );
bool CheckFileInConf( map< string, string > &param, string &fileKey );
bool PrintDetokenLogo();
/*
* $Name: __Init
* $Funtion: Init for Detokenizer
* $Date: 2014-10-10
*/
void* __init(const char* config)
{
PrintDetokenLogo();
cerr<<"Parameters:\n"<<flush;
cerr<<" config_file_ : "<<config<<"\n"<<flush;
map< string, string > parameters_for_config;
ParseParameterInConfig( config, parameters_for_config );
DetokenInterface* interf = new DetokenInterface();
#ifdef SUPPORT_ONLINE_SERVICE_EC_
string key = "Punct-Mapping-Dict";
CheckFileInConf(parameters_for_config, key);
interf->punct_mapping_dict_file_name_ = parameters_for_config[key];
#endif
if( parameters_for_config.find( "system-log" ) == parameters_for_config.end() )
{
cerr<<"[Error] Please add parameter 'system-log' in your config file.\n"<<flush;
exit( 1 );
}
interf->system_log_file_name_ = parameters_for_config[ "system-log" ];
interf->system_log_.open(interf->system_log_file_name_.c_str(), ios::app);
if (!interf->system_log_)
{
cerr<<"ERROR: Please check the log path of \""<<interf->system_log_file_name_<<"\".\n"<<flush;
exit( 1 );
}
#ifdef SUPPORT_ONLINE_SERVICE_EC_
interf->punctuation_map_.LoadPunctuation(parameters_for_config["Punct-Mapping-Dict"]);
#endif
return ( void* ) interf;
}
/*
* $Name: __reload
* $Funtion: reload model for translation memory
* $Date: 2014-01-12
*/
void __reload ( void* class_handle ) {
cerr<<"Reload Detokn...\n"<<flush;
#ifdef SUPPORT_ONLINE_SERVICE_EC_
DetokenInterface* interf = (DetokenInterface*) class_handle;
interf->punctuation_map_.punctuation_dictionary_.clear();
interf->punctuation_map_.LoadPunctuation(interf->punct_mapping_dict_file_name_);
#endif
return;
}
/*
* $Name: __do_job
* $Funtion: do job for translation memory
* $Date: 2014-01-12
*/
//char* __do_job( void* class_handle, const char* msg_text, int print_log, const char* log_head )
char* __do_job(void* class_handle, const char* msg_text, const char* decoder_input, int sent_init, int print_log, const char* log_head)
{
#ifndef WIN32
timeval start_time, end_time;
gettimeofday( &start_time, NULL );
clock_t start_time_clock = clock();
clock_t end_time_clock = 0;
#else
clock_t start_time = clock();
clock_t end_time = 0;
#endif
cerr<<"Detokenizer...";
DetokenInterface* interf = (DetokenInterface*) class_handle;
string sentence(msg_text);
string final_translation_result;
#ifdef SUPPORT_ONLINE_SERVICE_CE_
Detokenizer detokenizer_handle;
detokenizer_handle.DetokenizerEn(sentence, final_translation_result);
#elif defined SUPPORT_ONLINE_SERVICE_EC_
Detokenizer detokenizer_handle;
detokenizer_handle.DetokenizerZh(interf->punctuation_map_, sentence, final_translation_result);
#endif
#ifdef WIN32
char* msg_res = new char[ final_translation_result.size() + 1 ];
strcpy_s( msg_res, final_translation_result.size() + 1, final_translation_result.c_str() );
#else
char* msg_res = new char[ final_translation_result.size() + 1 ];
strncpy( msg_res, final_translation_result.c_str(), final_translation_result.size() + 1 );
#endif
#ifndef WIN32
gettimeofday( &end_time, NULL );
double time = ( (double)( end_time.tv_sec - start_time.tv_sec ) * 1000000 + (double)(end_time.tv_usec - start_time.tv_usec) ) / 1000000;
end_time_clock = clock();
double time_clock = ( double )( end_time_clock - start_time_clock )/CLOCKS_PER_SEC;
#else
end_time = clock();
double time = ( double )( end_time - start_time )/CLOCKS_PER_SEC;
#endif
cerr<<"Done!\n"
<<"[INPUT ] "<<sentence<<"\n";
cerr<<"[DETOKEN] "<<final_translation_result<<"\n";
cerr<<"[time="<<time<<"s speed="<<1.000/time<<"sent/s] \n\n";
interf->system_log_<<"[INPUT ] "<<sentence<<"\n"
<<"[DETOKEN] "<<final_translation_result<<"\n";
interf->system_log_<<"[time="<<time<<"s speed="<<1.000/time<<"sent/s] \n\n";
return msg_res;
}
/*
* $Name: __destroy
* $Funtion: destroy model for translation memory
* $Date: 2014-01-12
*/
void __destroy( void* class_handle ) {
DetokenInterface* interf = (DetokenInterface*) class_handle;
interf->system_log_.clear();
interf->system_log_.close();
delete interf;
}
/*
* $Name: ParseParameterInConfig
* $Funtion:
* $Date: 2013-05-13
*/
bool ParseParameterInConfig( string config, map< string, string > &param ) {
ifstream inputConfigFile( config.c_str() );
if ( !inputConfigFile ) {
cerr<<"ERROR: Config File does not exist, exit!\n"<<flush;
exit( 1 );
}
string lineOfConfigFile;
while ( getline( inputConfigFile, lineOfConfigFile ) ) {
BasicMethod bm;
bm.ClearIllegalChar( lineOfConfigFile );
bm.RmStartSpace ( lineOfConfigFile );
bm.RmEndSpace ( lineOfConfigFile );
if( lineOfConfigFile == "" || *lineOfConfigFile.begin() == '#' ) {
continue;
} else if ( lineOfConfigFile.find( "param=\"" ) == lineOfConfigFile.npos
|| lineOfConfigFile.find( "value=\"" ) == lineOfConfigFile.npos ) {
continue;
} else {
string::size_type pos = lineOfConfigFile.find( "param=\"" );
pos += 7;
string key;
for ( ; lineOfConfigFile[ pos ] != '\"' && pos < lineOfConfigFile.length(); ++pos ) {
key += lineOfConfigFile[ pos ];
}
if ( lineOfConfigFile[ pos ] != '\"' ) {
continue;
}
pos = lineOfConfigFile.find( "value=\"" );
pos += 7;
string value;
for ( ; lineOfConfigFile[ pos ] != '\"' && pos < lineOfConfigFile.length(); ++pos ) {
value += lineOfConfigFile[ pos ];
}
if ( lineOfConfigFile[ pos ] != '\"' ) {
continue;
}
if ( param.find( key ) == param.end() ) {
param.insert( make_pair( key, value ) );
} else {
param[ key ] = value;
}
}
}
return true;
}
/*
* $Name: CheckFileInConf
* $Function:
* $Date: 2013-05-13
*/
bool CheckFileInConf( map< string, string > &param, string &fileKey ) {
if( param.find( fileKey ) != param.end() ) {
ifstream inFile( param[ fileKey ].c_str() );
if ( !inFile ) {
cerr<<"ERROR: Please check the path of \""<<fileKey<<"\".\n"<<flush;
exit( 1 );
}
inFile.clear();
inFile.close();
} else {
cerr<<"ERROR: Please add parameter \""<<fileKey<<"\" in your config file.\n"<<flush;
exit( 1 );
}
return true;
}
/*
* $Name:
* $Funtion:
* $Date:
*/
bool PrintDetokenLogo() {
cerr<<"####### SMT ####### SMT ####### SMT ####### SMT ####### SMT #######\n"
<<"# Detokenizer #\n"
<<"# Version 0.0.1 #\n"
<<"# NEUNLPLab/YAYI corp #\n"
<<"# liqiangneu@gmail.com #\n"
<<"####### SMT ####### SMT ####### SMT ####### SMT ####### SMT #######\n"
<<flush;
return true;
}
/*
* $Id:
* 0003
*
* $File:
* interface.h
*
* $Proj:
* DetokenLib for Statistical Machine Translation
*
* $Func:
* header file of interface
*
* $Version:
* 0.0.1
*
* $Created by:
* Qiang Li
*
* $Email
* liqiangneu@gmail.com
*
* $Last Modified by:
* 2014-01-10,13:14
*/
#ifndef DETOKENLIB_INTERFACE_H_
#define DETOKENLIB_INTERFACE_H_
#include <iostream>
#include <string>
#include <cstring>
#include "detokenizer.h"
using namespace std;
using namespace decoder_detokenizer;
//#define SUPPORT_ONLINE_SERVICE_CE_
#define SUPPORT_ONLINE_SERVICE_EC_
namespace detoken_interface {
class DetokenInterface {
public:
DetokenInterface(){}
~DetokenInterface(){}
#ifdef SUPPORT_ONLINE_SERVICE_EC_
public:
PunctuationMap punctuation_map_;
#endif
public:
string punct_mapping_dict_file_name_;
public:
string system_log_file_name_;
ofstream system_log_;
};
}
#ifdef WIN32
#define DLLEXPORT __declspec(dllexport)
#else
#define DLLEXPORT
#endif
extern "C" DLLEXPORT void* __init ( const char* config );
//extern "C" DLLEXPORT char* __do_job ( void* class_handle, const char* msg_text, int print_log, const char* log_head );
extern "C" DLLEXPORT char* __do_job(void* class_handle, const char* msg_text, const char* decoder_input, int sent_init, int print_log, const char* log_head);
extern "C" DLLEXPORT void __reload ( void* class_handle );
extern "C" DLLEXPORT void __destroy( void* class_handle );
#endif
/*
* $Id:
* 0002
*
* $File:
* main.cpp
*
* $Proj:
* RecaserLib for Statistical Machine Translation
*
* $Func:
* main function
*
* $Version:
* 0.0.1
*
* $Created by:
* Qiang Li
*
* $Email
* liqiangneu@gmail.com
*
* $Last Modified by:
* 2014-01-10,13:14
*/
#include "main.h"
int main( int argc, char * argv[] ) {
if( argc < 4 ) {
cerr<<"[USAGE] EXE CONFIG TEST OUTPUT\n"<<flush;
exit( 1 );
}
string config( argv[ 1 ] );
void* handle = ( void* )__init( config.c_str() );
cerr<<argv[ 2 ]<<"\n"<<flush;
ifstream infile( argv[ 2 ] );
if ( !infile ) {
cerr<<"Can not open file "<<argv[ 1 ]<<"\n"<<flush;
exit( 1 );
}
FILE *outfile = fopen(argv[3] , "w");
string sentence;
int lineNo = 0;
while ( getline( infile, sentence ) ) {
++lineNo;
if( lineNo % 10000 == 0 )
{
fprintf(stderr,"\r\tprocessed %d lines." , lineNo );
}
char * output = __do_job(handle, sentence.c_str(), sentence.c_str(), 0, 1, "");
fprintf( outfile , "%s\n" , output );
delete []output;
}
infile.clear();
infile.close();
fclose(outfile);
__destroy( handle );
return 0;
}
/*
* $Id:
* 0001
*
* $File:
* main.h
*
* $Proj:
* DetokenLib for Statistical Machine Translation
*
* $Func:
* header file of main function
*
* $Version:
* 0.0.1
*
* $Created by:
* Qiang Li
*
* $Email
* liqiangneu@gmail.com
*
* $Last Modified by:
* 2014-10-10,12:34
*/
#ifndef DETOKENLIB_MAIN_H_
#define DETOKENLIB_MAIN_H_
#include <iostream>
#include "interface.h"
#include "detokenizer.h"
using namespace std;
namespace main_func
{
}
#endif
sofile:
g++ -O2 -o ../lib/libPostProcessing.so *.cpp -fPIC -shared
exe:
g++ -O2 -o ../bin/Detoken *.cpp
我存在得意义是为了提交空文件夹。
如有你需要占用这个文件夹就把我替换掉吧,否则请无视我吧。
要是你还觉得文件夹不够用,那就自己去新建吧~
我存在得意义是为了提交空文件夹。
如有你需要占用这个文件夹就把我替换掉吧,否则请无视我吧。
要是你还觉得文件夹不够用,那就自己去新建吧~
## bpe file path
SRC_TRAIN_PATH=../../zh.dev.bpe
TGT_TRAIN_PATH=../../en.dev.bpe
SRC_DEV_PATH=../../zh.dev.bpe
TGT_DEV_PATH=../../en.dev.bpe
## model saved path
TRAINDATA_PATH=../../0813
##
python3 fine_tuning.py \
--src_trainfile_path=$SRC_TRAIN_PATH \
--tgt_trainfile_path=$TGT_TRAIN_PATH \
--src_devfile_path=$SRC_DEV_PATH \
--tgt_devfile_paths=$TGT_DEV_PATH \
--src_dic_path=../../zh2en_final/source_dic \
--tgt_dic_path=../../zh2en_final/target_dic \
--final_traindata_path=$TRAINDATA_PATH
#usage: model_dir model_num out_dir
import argparse
import os
import re
import tensorflow as tf
import numpy as np
import six
tf.logging.set_verbosity(tf.logging.INFO)
parser = argparse.ArgumentParser()
parser.add_argument('-model_dir', required=True, type=str, help='saved models path')
parser.add_argument('-model_num', required=True, type=int, help='ensembled model numbers, we use the last models')
parser.add_argument('-out_dir', required=True, type=str, help='output ensembled model path, do not set same as model_dir')
args = parser.parse_args()
assert os.path.exists(args.model_dir), 'check model dir!'
assert args.out_dir != args.model_dir, 'do not set model_dir == output_dir'
root_dir, dir_names, file_names = list(os.walk(args.model_dir))[0]
index_list = []
for file in file_names:
match = re.findall(r'model\.ckpt-(\d+)\.index', file)
if len(match) == 1:
index_list += match
# we sort all files by descending order, so the recent file is sorted in the front
index_list = [int(i) for i in index_list]
index_list = sorted(index_list, reverse=True)
print('total find %d model index'%len(index_list))
print(index_list)
model_num = args.model_num
if args.model_num > len(index_list):
print('warning: you set model_num=%d, however only %d files are detected. so reset model_num=%d'%(args.model_num, len(index_list), len(index_list)))
model_num = len(index_list)
# get ensembled model index
index_list = index_list[:model_num]
print('using following index-model')
print(index_list)
if not os.path.exists(args.out_dir):
os.mkdir(args.out_dir)
"""
extract model parameters
"""
tf.logging.info("Reading variables and averaging checkpoints:")
checkpoints = [os.path.join(args.model_dir, 'model.ckpt-{}'.format(index)) for index in index_list]
for c in checkpoints:
tf.logging.info("%s ", c)
var_list = tf.contrib.framework.list_variables(checkpoints[0])
var_values, var_dtypes = {}, {}
for (name, shape) in var_list:
if not name.startswith("global_step"):
var_values[name] = np.zeros(shape)
for checkpoint in checkpoints:
reader = tf.contrib.framework.load_checkpoint(checkpoint)
for name in var_values:
tensor = reader.get_tensor(name)
var_dtypes[name] = tensor.dtype
var_values[name] += tensor
tf.logging.info("Read from checkpoint %s", checkpoint)
for name in var_values: # Average.
var_values[name] /= len(checkpoints)
tf_vars = [
tf.get_variable(v, shape=var_values[v].shape, dtype=var_dtypes[name])
for v in var_values
]
placeholders = [tf.placeholder(v.dtype, shape=v.shape) for v in tf_vars]
assign_ops = [tf.assign(v, p) for (v, p) in zip(tf_vars, placeholders)]
global_step = tf.Variable(
0, name="global_step", trainable=False, dtype=tf.int64)
saver = tf.train.Saver(tf.all_variables())
# Build a model consisting only of variables, set them to the average values.
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
for p, assign_op, (name, value) in zip(placeholders, assign_ops,
six.iteritems(var_values)):
sess.run(assign_op, {p: value})
# Use the built saver to save the averaged checkpoint.
saver.save(sess, os.path.join(args.out_dir, 'ensemble_%d'%model_num) , global_step=global_step)
tf.logging.info("Averaged checkpoints saved in %s", args.out_dir)
#!/usr/bin/perl -w
##################################################################################
#
# NiuTrans - SMT platform
# Copyright (C) 2011, NEU-NLPLab (http://www.nlplab.com/). All rights reserved.
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public
# License as published by the Free Software Foundation; either
# version 2 of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public
# License along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA
#
##################################################################################
#######################################
# version : 1.1.0
# Function : detokenizer
# Author : Qiang Li
# Email : liqiangneu@gmail.com
# Date : 08/06/2012
# Last Modified:
#######################################
use strict;
use Encode;
use utf8;
my $logo = "########### SCRIPT ########### SCRIPT ############ SCRIPT ##########\n".
"# #\n".
"# NiuTrans detokenizer (version 1.1.0) --www.nlplab.com #\n".
"# #\n".
"########### SCRIPT ########### SCRIPT ############ SCRIPT ##########\n";
print STDERR $logo;
my %param;
getParameter( @ARGV );
detokenize();
sub detokenize
{
open( INFILE, "<", $param{ "-in" } ) or die "Error: can not open file $param{ \"-in\" }.\n";
open( OUTPUT, ">", $param{ "-out" } ) or die "Error: can not open file $param{ \"-out\" }.\n";
my $sentNo = 0;
my $inputFileSent;
while( $inputFileSent = <INFILE> )
{
++$sentNo;
$inputFileSent =~ s/[\r\n]//g;
if( $inputFileSent =~ /^<.+>$/ || $inputFileSent =~ /^\s*$/ )
{
print OUTPUT $inputFileSent."\n";
}
else
{
my $detokenizeRes = startDetokenize( $inputFileSent );
print OUTPUT $detokenizeRes."\n";
}
print STDERR "\rProcessed $sentNo lines." if( $sentNo % 100 == 0 );
}
print STDERR "\rProcessed $sentNo lines.\n";
close( INFILE );
close( OUTFILE );
}
sub startDetokenize
{
my $sentence = $_[ 0 ];
$sentence =~ s/ \@\-\@ /-/g; # de-escape special chars
$sentence =~ s/\&bar;/\|/g; # factor separator
$sentence =~ s/\&lt;/\</g; # xml
$sentence =~ s/\&gt;/\>/g; # xml
$sentence =~ s/\&bra;/\[/g; # syntax non-terminal (legacy)
$sentence =~ s/\&ket;/\]/g; # syntax non-terminal (legacy)
$sentence =~ s/\&quot;/\"/g; # xml
$sentence =~ s/\&apos;/\'/g; # xml
$sentence =~ s/\&#91;/\[/g; # syntax non-terminal
$sentence =~ s/\&#93;/\]/g; # syntax non-terminal
$sentence =~ s/\&amp;/\&/g; # escape escape
my @words = split / +/,$sentence;
my $sentenceDetoken = "";
my %quoteCount = ( "\'" => 0, "\"" => 0 );
my $connector = " ";
my $wordCnt = 0;
my $preWord = "";
foreach my $word ( @words )
{
if( $word =~ /^[\p{IsSc}\(\[\{]+$/ )
{
$sentenceDetoken = $sentenceDetoken.$connector.$word;
$connector = "";
}
elsif( $word =~ /^[\,\.\?\!\:\;\\\%\}\]\)]+$/ )
{
$sentenceDetoken = $sentenceDetoken.$word;
$connector = " ";
}
elsif( ( $wordCnt > 0 ) && ( $word =~ /^[\'][\p{IsAlpha}]/ ) && ( $preWord =~ /[\p{IsAlnum}]$/ ) )
{
$sentenceDetoken = $sentenceDetoken.$word;
$connector = " ";
}
elsif( $word =~ /^[\'\"]+$/ )
{
if( !exists $quoteCount{ $word } )
{
$quoteCount{ $word } = 0;
}
if( ( $quoteCount{ $word } % 2 ) eq 0 )
{
if( ( $word eq "'" ) && ( $wordCnt > 0 ) && ( $preWord =~ /[s]$/ ) )
{
$sentenceDetoken = $sentenceDetoken.$word;
$connector = " ";
}
else
{
$sentenceDetoken = $sentenceDetoken.$connector.$word;
$connector = "";
++$quoteCount{ $word };
}
}
else
{
$sentenceDetoken = $sentenceDetoken.$word;
$connector = " ";
++$quoteCount{ $word };
}
}
else
{
$sentenceDetoken = $sentenceDetoken.$connector.$word;
$connector = " ";
}
$preWord = $word;
++$wordCnt;
}
$sentenceDetoken =~ s/ +/ /g;
$sentenceDetoken =~ s/^ //g;
$sentenceDetoken =~ s/ $//g;
$sentenceDetoken =~ s/^([[:punct:]\s]*)([[:alpha:]])(.*)$/$1\U$2\E$3/ if( $param{ "-upcase" } eq 1);
return $sentenceDetoken;
}
sub getParameter
{
if( ( scalar( @_ ) < 4 ) || ( scalar( @_ ) % 2 != 0 ) )
{
print STDERR "[USAGE]\n".
" NiuTrans-detokenizer.pl [OPTIONS]\n".
"[OPTION]\n".
" -in : Input File.\n".
" -out : Output File.\n".
" -upcase : Uppercase the first char [optional]\n".
" Default value is 1.\n".
"[EXAMPLE]\n".
" perl NiuTrans-detokenizer.pl [-in FILE]\n".
" [-out FILE]\n";
exit( 0 );
}
my $pos;
for( $pos = 0; $pos < scalar( @_ ); ++$pos )
{
my $key = $ARGV[ $pos ];
++$pos;
my $value = $ARGV[ $pos ];
$param{ $key } = $value;
}
if( !exists $param{ "-in" } )
{
print STDERR "Error: please assign \"-in\"!\n";
exit( 1 );
}
if( !exists $param{ "-out" } )
{
print STDERR "Error: please assign \"-out\"!\n";
exit( 1 );
}
if( !exists $param{ "-upcase" } )
{
$param{ "-upcase" } = 1;
}
elsif( $param{ "-upcase" } ne 1 )
{
$param{ "-upcase" } = 0;
}
}
This source diff could not be displayed because it is too large. You can view the blob instead.
#!/usr/bin/python
#coding: utf-8
__author__ = "Summer Rain"
from sys import argv
from time import time
from os import path, system, listdir
program = "mteval_sbp.linux"
include_path = "."
src_path = "src"
lib_path = "."
compile_flag = "-O3"
link_flag = ""
lib_flag = ""
if __name__ == "__main__":
files = [fe for fe in listdir(src_path) if fe.endswith(".cpp") or fe.endswith(".cc")]
ofes = []
for sfe in files:
#print "compiling %s ..." %(sfe)
print "%s %s %s" %("-" * 20, sfe, "-" * 20)
ofe = "." + sfe.replace(".cpp", ".o").replace(".cc", ".o")
time1 = path.getmtime(src_path + "/" + sfe)
if path.isfile(ofe):
time2 = path.getmtime(ofe)
else:
time2 = time1 - 1
if time1 > time2 or len(argv) == 2 and argv[1] == "clean":
cmd = "g++ -c %s/%s %s -o %s -I%s" %(src_path, sfe, compile_flag, ofe, include_path)
print cmd
if system(cmd) != 0:
exit(1)
else:
print "%s is the newest" %(ofe)
ofes.append(ofe)
if files != []:
print "-" * 40
cmd = "g++ %s -o %s %s %s -L%s" %(" ".join(ofes), program, link_flag, lib_flag, lib_path)
print cmd
system(cmd)
# -*- coding: utf-8 -*-
import codecs, re, sys
import argparse
parser = argparse.ArgumentParser(description='Invert translation result to xml format official eval tools require.')
parser.add_argument('--src_testfile_path', required=True, help="the path of source test file.")
parser.add_argument('--refs_testfile_path', required=True, help="the path of refs file, use puntch , split ref files.")
parser.add_argument('--tst_testfile_path', required=True, help="the path of translation of source file.")
parser.add_argument('--output_path', required=True, help="the path of output.")
parser.add_argument('--srclang', required=True)
parser.add_argument('--tgtlang', required=True)
args = parser.parse_args()
src_testfile_path=args.src_testfile_path
refs_testfile_path=args.refs_testfile_path
tst_testfile_path=args.tst_testfile_path
output_path=args.output_path
srclang=args.srclang
tgtlang=args.tgtlang
def get_system_info(organization, system_identify, system_description_info):
system_label = []
system_label.append("<system site=\"" + organization + "\""+ " " + "sysid=\"" + system_identify + "\">")
for info in system_description_info:
system_label.append(info.strip())
system_label.append("</system>")
return system_label
def get_firstLine_info():
return ["<?xml version=\"1.0\" encoding=\"UTF-8\"?>"]
def get_secondLine_info(setclass, setid, srclang, tgtlang):
return ["<" + setclass + " setid=\"" + setid + "\"" + " " + "srclang=\"" + srclang + "\"" + " " + "trglang=\"" + tgtlang +"\">"]
def get_tstTail_info():
return ["</tstset>"]
def get_refTail_info():
return ["</refset>"]
def XMLformat(line):
special_char = ["&", ">", "<", "\"", r"'"]
repalce_char = [r"&amp;", r"&lt;", r"&gt;", r"&quot;", r"&apos;"]
count = 0
new_line = ""
while count < len(line):
c = line[count]
cc_spec = 0
flag = False
while cc_spec < len(special_char):
if c == special_char[cc_spec]:
flag = True
break
cc_spec += 1
if flag:
line = line[:count] + repalce_char[cc_spec] + line[(count + 1):]
count += 1
return line
def handle_src_file(src_file, srclang, tgtlang, out_path):
# 该函数返回结果
src_xml_content = []
# 开头
src_xml_content.append(get_firstLine_info()[0])
src_xml_content.append(get_secondLine_info(setclass="srcset", setid= srclang + "_" + tgtlang + "_news_trans", srclang=srclang, tgtlang=tgtlang)[0])
# 正文
src_xml_content.append("<DOC docid=" + "\"news\">")
count = 0
src_file_handle = codecs.open(src_file, "r", "utf_8_sig")
for src_line in src_file_handle.read().strip().split("\n"):
count += 1
src_xml_content.append("<seg id=" + "\"" + str(count) + "\">" + XMLformat(src_line.strip()) + "</seg>")
src_xml_content.append("</DOC>")
src_file_handle.close()
# 结尾
src_xml_content.append("</srcset>")
# 输出
src_file_out = codecs.open(out_path + "src.txt.xml", "w", "utf_8_sig")
for src_line in src_xml_content:
src_file_out.write(src_line + "\n")
src_file_out.close()
return src_xml_content
def get_src_content(src_file, src_lang, tgt_lang, out_path, occasion):
src_xml_content = []
if occasion == True:
src_xml_content = handle_src_file(src_file, src_lang, tgt_lang, out_path)
else:
src_file_handle = codecs.open(src_file, "r", "utf_8_sig")
for line in src_file_handle.read().strip().split("\n"):
src_xml_content.append(line)
src_file_handle.close()
return src_xml_content[2:][:-1]
def main(src_file, refs_file, tst_file, out_path ,src_lang="zh", tgt_lang="en", occasion=True):
"""
:param src_file:
:param refs_file: list
:param tst_file:
:param out_path:
:param src_lang:
:param tgt_lang:
:param occasion:
:return:
"""
# 将src_file转为特定的xml格式 或 读取src_file文件内容
src_xml_content = get_src_content(src_file, src_lang, tgt_lang, out_path, occasion)
# 比较src_xml_content内容, 将refs_file转为特定格式
refs_xml_content = []
transorg = 1 # 1表示第一个ref的数据,一次类推
for ref_file in refs_file:
ref_site = "\"transorg" + str(transorg) + "\""
ref_file_handle = codecs.open(ref_file, "r", "utf_8_sig")
ref_lines = ref_file_handle.read().strip().split("\n")
ref_line_count, count = 0, 0
ref_line = ""
while count < len(src_xml_content):
# src xml line
src_line = src_xml_content[count]
# ref raw line
if ref_line_count < len(ref_lines):
ref_line = ref_lines[ref_line_count]
# 依据src_xml_file文件情况
if "<DOC" in src_line:
# <DOC docid="news" site="transorg1">
refs_xml_content.append(src_line.strip()[:-1] + " " + "site=" + ref_site + ">")
elif "<p>" in src_line or "</p>" in src_line:
# <p> </p> 单独成一行
refs_xml_content.append(src_line)
elif "</DOC>" in src_line:
# </DOC > 单独成一行
refs_xml_content.append(src_line)
elif "<seg" in src_line:
rs = re.match(r"<seg id=\"(.*?)\">", src_line) # 提取<seg id="402">中间的数字
if rs:
# rs.group(0) = <seg id=\"(.*?)\">
line = rs.group(0) + XMLformat(ref_line.strip()) + "</seg>"
refs_xml_content.append(line)
ref_line_count += 1
count += 1
transorg += 1
ref_file_handle.close()
# tail
refs_out_contents = get_firstLine_info() + get_secondLine_info("refset", src_lang + "_" + tgt_lang + "_news" , src_lang, tgt_lang) + \
refs_xml_content + get_refTail_info()
# 比较src_xml_file,将翻译结果转为xml格式,注意目前只支持一个翻译结果
tst_file_handle = codecs.open(tst_file, "r", "utf_8_sig")
tst_lines = tst_file_handle.read().strip().split("\n")
transed_xml_content = []
count, tst_line_no = 0, 0
transed_line = ""
while count < len(src_xml_content):
src_line = src_xml_content[count]
if tst_line_no < len(tst_lines):
transed_line = tst_lines[tst_line_no]
if "<DOC" in src_line:
# <DOC docid="文档名称" sysid="系统标识">
transed_xml_content.append(src_line[:-1] + " sysid=\"" + src_lang + "_" + tgt_lang + "_trans" + "\">")
elif "<p>" in src_line or "</p>" in src_line:
transed_xml_content.append(src_line)
elif "</DOC" in src_line:
transed_xml_content.append(src_line)
elif "<seg" in src_line:
rs = re.match(r"<seg id=\"(.*?)\">", src_line)
if rs:
line = rs.group(0) + XMLformat(transed_line.strip()) + "</seg>"
transed_xml_content.append(line)
tst_line_no += 1
count += 1
tst_file_handle.close()
# <system site="单位名称" sysid="系统标识"> tail
transed_out_contents = get_firstLine_info() + get_secondLine_info("tstset", src_lang + "_" + tgt_lang + "_news", src_lang, tgt_lang) + \
get_system_info("Niu", src_lang + "_" + tgt_lang + "_trans", ["系统描述信息"]) + transed_xml_content + get_tstTail_info()
## 保存
ref_out_file = codecs.open(out_path + "ref.txt.xml", "w", "utf_8_sig")
max_size = len(refs_out_contents)
count = 0
while count < max_size:
out_line = refs_out_contents[count]
ref_out_file.write(out_line)
if count != (max_size - 1):
ref_out_file.write("\n")
count += 1
transed_out_file = codecs.open(out_path + "tst.txt.xml", "w", "utf_8_sig")
max_size = len(transed_out_contents)
count = 0
while count < max_size:
out_line = transed_out_contents[count]
transed_out_file.write(out_line)
if count != (max_size - 1):
transed_out_file.write("\n")
count += 1
ref_out_file.close()
transed_out_file.close()
# main("./eval/ensemble/mt12-wb/input.token", ["./eval/ensemble/mt12-wb/ref0","./eval/ensemble/mt12-wb/ref1", "./eval/ensemble/mt12-wb/ref2", "./eval/ensemble/mt12-wb/ref3"], "./eval/ensemble\mt12-wb/mt12-wb.ensemble", "./eval/ensemble\mt12-wb/")
print ("refs: " + refs_testfile_path)
refs_list = []
for line in refs_testfile_path.strip().split():
refs_list.append(line.strip())
if len(refs_list) == 1 and len(refs_testfile_path.strip().split(",")) > 1:
refs_list = []
for line in refs_testfile_path.strip().split(","):
refs_list.append(line.strip())
main(src_testfile_path, refs_list, tst_testfile_path, output_path, srclang, tgtlang)
print("xml format file has created.")
src_file=./input.token
refs_file=./ref0
tst_file=./mt06.ensemble
output_path=./
srclang=zh
tgtlang=en
python3 mteval_sbp.py --src_testfile_path=$src_file \
--refs_testfile_path=$refs_file \
--tst_testfile_path=$tst_file \
--output_path=$output_path \
--srclang=$srclang \
--tgtlang=$tgtlang
./Tool/mteval_sbp.linux -c -r $output"ref.txt.xml" -s $output"src.txt.xml" -t $output"tst.txt.xml" > $output"1best_result"
echo "eval result has created."
#pragma once
#pragma warning(disable:4503)
#pragma warning(disable:4786)
#include "xmlfunc.h"
#include<string>
#include<vector>
#include<map>
using namespace std;
const int max_Ngram=9;
const int NIST_ORDER=5;
const int BLEU_ORDER=4;
typedef struct {
double cum;
double ind;
} _cum_ind;
typedef vector<_cum_ind> nscore_struct;
typedef basic_string<int> SENT;
typedef map<string,int> VOCAB;
typedef map<SENT,double> GRAMMAP;
typedef map<int,SENT> SEGMAP;
typedef map<string,SEGMAP> DOCMAP;
typedef map<string,DOCMAP> SITEMAP;
typedef pair<double,int> SCORE;
typedef map<int,SCORE> SEGSCORE; // segid, score
typedef map<string,pair<SCORE,SEGSCORE> > DOCSCORE; // docid,score
typedef map<string,pair<SCORE,DOCSCORE> > SITESCORE; // site,score
typedef map<string,double> SCOREMAP;
typedef map<string,nscore_struct> NSCOREMAP;
tstring get_ref_data(const string & setid, SITEMAP & docs, VOCAB & voc, int preserve_case, const tstring & fn);
tstring get_tst_data(const string & setid, SITEMAP & docs, const VOCAB & voc, int preserve_case, const tstring & fn);
tstring get_source_info(DOCMAP & srcs, const tstring & fn);
tstring sgm_get_ref_data(const string & setid, SITEMAP & docs, VOCAB & voc, int preserve_case, const tstring & fn);
tstring sgm_get_tst_data(const string & setid, SITEMAP & docs, const VOCAB & voc, int preserve_case, const tstring & fn);
tstring sgm_get_source_info(DOCMAP & srcs, const tstring & fn);
void compute_ngram_info(const SITEMAP & refs, GRAMMAP & ngram_info);
// Score NIST and BLEU
// parameter: nist: non-zero for nist, and zero of bleu
void score_system(const SITEMAP & refs, const SITEMAP & tsts, const GRAMMAP & ngram_info,const string & site, NSCOREMAP & SCOREmt,int nist, SITESCORE & score);
double mper_score_system(const SITEMAP & refs, const SITEMAP & tsts, const string & site, SITESCORE & score);
double mwer_score_system(const SITEMAP & refs, const SITEMAP & tsts, const string & site, SITESCORE & score);
double gtm_score_system(const SITEMAP & refs, const SITEMAP & tsts, const string & site, SITESCORE & score);
double ict_score_system(const SITEMAP & refs, const SITEMAP & tsts, const string & site, SITESCORE & score);
//void NormalizeText(string & s, const tstring & lang);
//void makelower(string & s);
//void setdetail(int dt);
//void setcase(int c);
typedef void (* PWORDSEGMENTER)(const string & sent, vector<string> & words, const tstring & lang);
void setwordsegmenter(PWORDSEGMENTER p);
void defaultwordsegmenter(const string & s, vector<string> & words, const tstring & lang);
This source diff could not be displayed because it is too large. You can view the blob instead.
MT evaluation scorer began on 2018 Apr 18 at 17:57:59
Evaluation of en-to-zh translation using:
src set (1 docs, 1000 segs)
ref set (1 refs)
tst set (1 systems)
Scores of system:
NIST=6.9923 BLEU=0.2588 BLEU_SBP=0.2452 GTM=0.5959 mWER=0.6470 mPER=0.4637 ICT=0.2380
# ------------------------------------------------------------------------
Individual N-gram scoring
1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram
------ ------ ------ ------ ------ ------ ------ ------ ------
NIST: 5.4436 1.2857 0.2261 0.0303 0.0064 0.0019 0.0005 0.0000 0.0000 ""
BLEU: 0.6085 0.3340 0.2036 0.1286 0.0835 0.0552 0.0375 0.0254 0.0173 ""
# ------------------------------------------------------------------------
Cumulative N-gram scoring
1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram
------ ------ ------ ------ ------ ------ ------ ------ ------
NIST: 5.4436 6.7294 6.9555 6.9859 6.9923 6.9942 6.9946 6.9947 6.9947 ""
BLEU: 0.5832 0.4321 0.3315 0.2588 0.2047 0.1634 0.1316 0.1066 0.0867 ""
BLEU_SBP: 0.5525 0.4093 0.3140 0.2452 0.1939 0.1548 0.1247 0.1010 0.0821 ""
MT evaluation scorer ended on 2018 Apr 18 at 17:58:01
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
for args in $@
do
detokenizedfile=$args".detoken"
perl detoken.perl -l en < $args > $detokenizedfile
done
for args in $@
do
detokenizedfile=$args".detoken"
perl NiuTrans-detokenizer.pl -in $args -out $detokenizedfile
# perl detoken.perl -l en < $args > $detokenizedfile
done
# coding=utf-8
import re
import sys
# -
# ℃ ?
# 1.35
if __name__ == '__main__':
print("python detoken_zzy.py infile outfile")
file_in = open(sys.argv[1],"r",encoding="UTF-8")
file_out = open(sys.argv[2],"w",encoding="UTF-8")
count = 0
while True:
line = file_in.readline()
if len(line) == 0:
break
line = line.replace(" - ","-")
line = line.replace(" -- ","-")
line = line.replace(" -","-")
line = line.replace("- ","-")
dot = re.findall(r"\d\. \d",line)
if len(dot) != 0:
for item in dot:
line = line.replace(item,item.replace(". ","."))
count += 1
file_out.write(line)
print("dot: " + str(count))
print("detoken_zzy done!")
MT evaluation scorer began on 2018 Apr 17 at 14:01:04
Evaluation of en-to-zh translation using:
src set (1 docs, 1001 segs)
ref set (1 refs)
tst set (1 systems)
Scores of system:
NIST=7.2479 BLEU=0.3026 BLEU_SBP=0.2835 GTM=0.6093 mWER=0.6603 mPER=0.4596 ICT=0.2502
# ------------------------------------------------------------------------
Individual N-gram scoring
1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram
------ ------ ------ ------ ------ ------ ------ ------ ------
NIST: 5.5411 1.3476 0.2812 0.0625 0.0155 0.0064 0.0029 0.0012 0.0005 ""
BLEU: 0.6222 0.3820 0.2484 0.1683 0.1146 0.0786 0.0544 0.0372 0.0255 ""
# ------------------------------------------------------------------------
Cumulative N-gram scoring
1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram
------ ------ ------ ------ ------ ------ ------ ------ ------
NIST: 5.5411 6.8887 7.1699 7.2324 7.2479 7.2543 7.2572 7.2583 7.2588 ""
BLEU: 0.5964 0.4673 0.3732 0.3026 0.2471 0.2028 0.1670 0.1377 0.1137 ""
BLEU_SBP: 0.5587 0.4378 0.3496 0.2835 0.2315 0.1899 0.1564 0.1290 0.1065 ""
MT evaluation scorer ended on 2018 Apr 17 at 14:01:06
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
open in,"$ARGV[0]";
while($in=<in>)
{
chomp $in;
$in = lc($in);
print $in."\n";
}
set -e
basePath=./cwmt17-test/
src_file=$basePath"input.token"
refs_file=$basePath"ref.detoken"
tst_file=$basePath"trans"
output_path=$basePath
detoken=./detoken.sh
srclang=en
tgtlang=zh
refs_file_detoken=
IFS=',' arr=($refs_file)
for x in ${arr[@]}; do
sh $detoken $x
refs_file_detoken=${refs_file_detoken}${x}".detoken"
done
refs_file_detoken=${refs_file_detoken%,*}
sh $detoken $tst_file
python3 detoken_zzy.py $tst_file".detoken" $tst_file".detoken.agv"
echo "$refs_file $tst_file , detoken file has finished, output is $refs_file_detoken $tst_file".detoken""
python3 ./Tool/mteval_sbp.py --src_testfile_path=$src_file \
--refs_testfile_path=$refs_file\
--tst_testfile_path=$tst_file".detoken.agv" \
--output_path=$output_path \
--srclang=$srclang \
--tgtlang=$tgtlang
./Tool/mteval_sbp.linux -c -r $output_path"ref.txt.xml" -s $output_path"src.txt.xml" -t $output_path"tst.txt.xml" > $output_path"1best_result-detoken"
echo "eval result has created."
set -e
basePath=./en2zh-cwmt17-test/
src_file=$basePath"input.token"
refs_file=$basePath"ref"
tst_file=$basePath"trans"
output_path=$basePath
detoken=detoken.sh
srclang=en
tgtlang=zh
refs_file_detoken=
IFS=',' arr=($refs_file)
for x in ${arr[@]}; do
sh $detoken $x
refs_file_detoken=${refs_file_detoken}${x}".detoken"
done
refs_file_detoken=${refs_file_detoken%,*}
sh $detoken $tst_file
#python3 detoken_zzy.py $tst_file".detoken" $tst_file".detoken.agv"
echo "$refs_file $tst_file , detoken file has finished, output is $refs_file_detoken $tst_file".detoken""
python3 ./Tool/mteval_sbp.py --src_testfile_path=$src_file \
--refs_testfile_path=$refs_file_detoken \
--tst_testfile_path=$tst_file".detoken" \
--output_path=$output_path \
--srclang=$srclang \
--tgtlang=$tgtlang
./Tool/mteval_sbp.linux -c -r $output_path"ref.txt.xml" -s $output_path"src.txt.xml" -t $output_path"tst.txt.xml" > $output_path"1best_result-detoken"
#perl ./Tool/mteval-v13a_Niu.pl -r $output_path"ref.txt.xml" -s $output_path"src.txt.xml" -t $output_path"tst.txt.xml" > $output_path"1best_result-detoken_m"
echo "eval result has created."
set -e
basePath=./cwmt18-dev/
src_file=$basePath"input.token1000"
#refs_file=$basePath"ref0",$basePath"ref1",$basePath"ref2",$basePath"ref3"
refs_file=$basePath"ref1000"
#refs_file=$basePath"cwmt18.untoken.enpun1000"
tst_file=$basePath"cwmt18.rerank.1000"
output_path=$basePath
detoken=detoken.sh
srclang=zh
tgtlang=en
refs_file_detoken=
IFS=',' arr=($refs_file)
for x in ${arr[@]}; do
sh $detoken $x
refs_file_detoken=${refs_file_detoken}${x}".detoken"
done
refs_file_detoken=${refs_file_detoken%,*}
sh $detoken $tst_file
python3 detoken_zzy.py $tst_file".detoken" $tst_file".detoken.agv"
echo "$refs_file $tst_file , detoken file has finished, output is $refs_file_detoken $tst_file".detoken""
python3 ./Tool/mteval_sbp.py --src_testfile_path=$src_file \
--refs_testfile_path=$refs_file \
--tst_testfile_path=$tst_file \
--output_path=$output_path \
--srclang=$srclang \
--tgtlang=$tgtlang
./Tool/mteval_sbp.linux -c -r $output_path"ref.txt.xml" -s $output_path"src.txt.xml" -t $output_path"tst.txt.xml" > $output_path"1best_result-detoken"
echo "eval result has created."
basePath=./baseline/exact2k/
src_file=$basePath"input.token"
refs_file=$basePath"ref"
tst_file=$basePath"tst.trans"
output_path=$basePath
srclang=zh
tgtlang=en
python3 ./Tool/mteval_sbp.py --src_testfile_path=$src_file \
--refs_testfile_path=$refs_file \
--tst_testfile_path=$tst_file \
--output_path=$output_path \
--srclang=$srclang \
--tgtlang=$tgtlang
./Tool/mteval_sbp.linux -c -r $output_path"ref.txt.xml" -s $output_path"src.txt.xml" -t $output_path"tst.txt.xml" > $output_path"1best_result"
echo "eval result has created."
This source diff could not be displayed because it is too large. You can view the blob instead.
# coding=utf-8
import re
import sys
# -
# ℃ ?
# 1.35
if __name__ == '__main__':
print("python detoken_zzy.py infile outfile")
file_in = open(sys.argv[1],"r",encoding="UTF-8")
file_out = open(sys.argv[2],"w",encoding="UTF-8")
count = 0
while True:
line = file_in.readline()
if len(line) == 0:
break
line = line.replace(" - ","-")
line = line.replace(" -- ","-")
line = line.replace(" -","-")
line = line.replace("- ","-")
dot = re.findall(r"\d\. \d",line)
if len(dot) != 0:
for item in dot:
line = line.replace(item,item.replace(". ","."))
count += 1
file_out.write(line)
print("dot: " + str(count))
print("detoken_zzy done!")
# coding=utf-8
import re
import sys
# -
# ℃ ?
# 1.35
if __name__ == '__main__':
print("python detoken_zh.py infile outfile")
file_in = open(sys.argv[1],"r",encoding="UTF-8")
file_out = open(sys.argv[2],"w",encoding="UTF-8")
count = 0
while True:
line = file_in.readline()
if len(line) == 0:
break
line = line.replace(" - ","-")
line = line.replace(" -- ","-")
line = line.replace(" -","-")
line = line.replace("- ","-")
dot = re.findall(r"\d\. \d",line)
if len(dot) != 0:
for item in dot:
line = line.replace(item,item.replace(". ","."))
item = item.replace(". ",".")
count += 1
full_point = re.findall(r"\.$|\.\"$",line)
if len(full_point) != 0:
for item in full_point:
line = line.replace(item,item.replace(".","。"))
item = item.replace(".","。")
count += 1
quote = re.findall(r"\'.+?\'",line)
if len(quote) != 0:
for item in quote:
line = line.replace(item,item.replace("'","‘",1))
item = item.replace("'","‘",1)
line = line.replace(item,item.replace("'","’",1))
item = item.replace("'","’",1)
count += 2
quote_doble = re.findall(r"\".+?\"",line)
if len(quote_doble) != 0:
for item in quote_doble:
# if "\"" in item:
# print(item)
line = line.replace(item,item.replace("\"","“",1))
# print(item)
item = item.replace("\"","“",1)
line = line.replace(item,item.replace("\"","”",1))
item = item.replace("\"","”",1)
count += 2
colon = re.findall(r"[^\d]:",line) # maohao
if len(colon) != 0:
for item in colon:
line = line.replace(item,item.replace(":",":"))
item = item.replace(":",":")
count += 1
line = line.replace("•","﹒")
line = line.replace(",",",")
file_out.write(line)
print("all: " + str(count))
print("detoken_zh done!")
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
This source diff could not be displayed because it is too large. You can view the blob instead.
差异被折叠。 点击展开。
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论