SLTK, an Neural Sequence Labeling Toolkit. It includes character LSTM, word LSTM…

SLTK, an Neural Sequence Labeling Toolkit. It includes character LSTM, word LSTM and CRF components.

SLTK, an Neural Sequence Labeling Toolkit. It includes character LSTM, word LSTM…
SLTK, an Neural Sequence Labeling Toolkit. It includes character LSTM, word LSTM and CRF components.
5d63396a · huchi · 174d7757 · 174d7757 · 174d7757 · 5d63396a
Commit 5d63396a authored Dec 02, 2018 by huchi
--- a/.gitignore
+++ b/.gitignore
-NiuTrans.Tensor.vcxproj
-NiuTrans.Tensor.vcxproj.filters
-x64/
-vc140.pdb
-NiuTrans.Tensor.vcxproj.user
-NiuTrans.Tensor.aps
--- a/README.md
+++ b/README.md
-# NiuTrans.Tensor张量计算库
-
-## NiuTrans.Tensor
-
-NiuTrans.Tensor是小牛开源项目所开发的一个工具包，提供了完整的张量定义及计算功能，可以被用于深度学习相关研究及工业系统的开发。NiuTrans.Tensor具有以下特点：
-
-* 简单小巧，易于修改
-* c语言编写，代码高度优化
-* 同时支持CPU和GPU设备
-* 丰富的张量计算接口
-* 支持C/C++、Python等调用方式
-
-## 安装方法
-
-在开始创建您的项目并使用NiuTrans.Tensor工具包时，需要注意的是：
-
-* 所创建项目如在CPU上运行，我们的系统支持高性能的数学运算库，推荐安装[MKL](https://software.intel.com/en-us/mkl)或[OpenBLAS](http://www.openblas.net/)。
-* 所创建项目如需在GPU上运行，需安装 [CUDA](https://developer.nvidia.com/cuda-downloads)，CUDA版本需求为9.0及以上，CUDA工具为创建高性能GPU加速应用程序提供了开发环境。
-
-小牛开源项目所开发的NiuTrans.Tensor工具包采用源程序编译方法，在Windows和Linux环境下的安装方法如下所示。
-
-### Windows
-
-若在Windows上使用NiuTrans.Tensor工具包：
-
-* 首先需要将NiuTrans.Tensor代码包含在所创建的项目中
-* 在所创建项目中需要引用XTensor.h、core里的CHeader.h和function里的FHeader.h这三个头文件：
-    * 通过XTensor.h可以获取我们需要操作的XTensor类
-    * 通过core里的CHeader.h可以对Tensor进行一些张量运算
-    * 通过function里的FHeader.h可以调用一些激活函数
-* 在所创建项目中使用命名空间nts
-
-此外，一些必须的环境配置方法请参考 [NiuTrans.Tensor环境配置](http://47.105.50.196/NiuTrans/NiuTrans.Tensor/blob/master/doc/Configuration.md)。
-
-### Linux
-
-若在Linux上使用NiuTrans.Tensor工具包，直接执行make.sh即可在同级目录下生成tensorCPU和tensorGPU，分别对应于NiuTrans.Tensor的CPU以及GPU的可执行文件。以前馈神经网络语言模型为例，输入以下命令即可在GPU上执行提供的测试用例：
->./tensorGPU -test
-
-更多详细使用方法请见[NiuTrans.Tensor开发文档](http://47.104.97.237/niutrans/site/niutensor/index.html)
-
-
-## 开发团队
-
-NiuTrans.Tensor张量计算库由东北大学自然语言处理实验室小牛开源团队开发，致力于为深度学习相关研究及工业系统的开发提供完整的张量定义及计算功能。
-
-## 更新版本
-
-NiuTrans.Tensor version 0.1.0 - 2018年8月3日
\ No newline at end of file
--- a/data/config.txt
+++ b/data/config.txt
+***Others***
+devID:			-1
+useMem:			true
+isClean:		true
+isTrain:		true
+isShuffled:		false
+isSupervised:	true
+
+***Network Setting***
+useAdam:		false
+useClip:		false
+useInitState:	false
+tagNum:			7
+charDim:		100
+lstmUnit:		100
+batchSize:		10
+maxEpoch:		50
+maxLength:		123
+stepsCheck:		1
+dropoutP:		0.5F
+clipLower:		-5.0F
+clipUpper:		5.0F
+adamBeta1:		0.9F
+adamBeta2:		0.999F
+adamDelta:		1e-8F
+adamBeta1T:		1.0F
+adamBeta2T:		1.0F
+learningRate:	0.02F
+
+***File & Script***
+tagScheme:		iob
+logFile:		log
+embFile:		data/w2v
+ckptPath:		data/ckpt
+mapFile:		data/mappings
+evalScript:		tool/eval.py
+devFile:		data/dev.txt
+testFile:		data/test.txt
+trainFile:		data/train.txt
+preScript:		tool/pre-process.py
+recordFile:		data/record.txt
\ No newline at end of file
--- a/data/record.txt
+++ b/data/record.txt
+***Record***
+loss:		4.03107
+
+trainF1:		0
+trainRecall:		0
+trainAccuracy:		0
+trainPrecision:		0
+
+devF1:		100
+devRecall:		100
+devAccuracy:		98.91
+devPrecision:		100
+
+testF1:		100
+testRecall:		100
+testAccuracy:		98.36
+testPrecision:		100
\ No newline at end of file
--- a/doc/Configuration.md
+++ b/doc/Configuration.md
-# NiuTrans.Tensor环境配置
-
-## 注意事项
-
-CUDA最新版本9.2尚且不支持VS2017最新版本，因此建议使用CUDA版本为9.0或9.1，建议使用VS版本为VS2015，或使用VS2017时安装v140工具集，解决方案平台设置为×64。
-
-## CUDA配置
-
-在已安装好VS、CUDA并配置好环境变量后，一些关键的CUDA配置选项如下所示，以下配置选项在 **项目 -> 属性** 中可以找到。
-
->$(CUDA_PATH)\include
-
-加入到 **VC++目录 -> 包含** 中。
-
->$(CUDA_PATH)\lib\Win32
-
-加入到 **VC++目录 -> 库** 中。
-
->cuda.lib;cudadevrt.lib;cudart.lib;cudart_static.lib;nvcuvid.lib;OpenCL.lib;cublas.lib;curand.lib;
-
-加入到 **链接器->输入->附加依赖项** 中。
-
-配置完成后，右键 **工程->项目依赖性** ，选择CUDA9。
-在.cu文件上右键属性，在项类型中选择"CUDA C/C++"（最好搜索.cu文件，然后全选设置）。
-
-## 其他配置
-
-**C/C++->常规->SDL检查**，设为否。
-
-在 **C/C++->预处理器->预处理器定义** 中，添加
-
->USE_CUDA;USE_BLAS;WIN32;MKL;_DEBUG;_CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_WARNINGS_
-CONSOLE;
-
-**链接器->系统->子系统**，设置为控制台。
-
-**常规->字符集**，使用Unicode字符集。
-
-**调试->命令参数**中设置可执行文件所需要的参数。
-
-
--- a/doc/manual.md
+++ b/doc/manual.md
--- a/source/sample/transformer/T2TModel.cpp
+++ b/source/sample/transformer/T2TModel.cpp
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
- * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
- */
-
-
-#include "T2TModel.h"
-#include "T2TUtility.h"
-#include "../../tensor/core/CHeader.h"
-
-namespace transformer
-{
-
-/* constructor */
-T2TModel::T2TModel()
-{
-    devID = -1;
-    mem = NULL;
-    isLM = false;
-    isMT = false;
-    nhead = 1;
-}
-
-/* de-constructor */
-T2TModel::~T2TModel()
-{
-    delete mem;
-}
-
-/* 
-initialize the model 
->> argc - number of arguments
->> argv - list of pointers to the arguments
-*/
-void T2TModel::InitModel(int argc, const char ** argv)
-{
-    bool useMem = false;
-
-    LoadParamInt(argc, argv, "dev", &devID, -1);
-    LoadParamBool(argc, argv, "mem", &useMem, useMem);
-    LoadParamBool(argc, argv, "lm", &isLM, true);
-    LoadParamBool(argc, argv, "mt", &isMT, false);
-    LoadParamInt(argc, argv, "nhead", &nhead, 8);
-
-    if(useMem){
-        delete mem;
-        mem = new XMem(devID);
-    }
-
-    encoder.InitModel(argc, argv, isLM, isLM ? 1 : 0, devID, mem);
-    outputLayer.InitModel(argc, argv, devID, mem);
-}
-
-/* 
-make the encoding network
->> input - input tensor
->> mask - the mask for positions that are/not involved in computation
->> skipInputRes - indicates whether we skip the residual connection of the first layer
-<< return - encoding result
-*/
-XTensor T2TModel::MakeEncoding(XTensor &input, XTensor &mask, bool skipInputRes)
-{
-    return encoder.Make(input, mask, skipInputRes);
-}
-
-/* 
-make the entire network (with the output softmax layer) 
->> input - input tensor
->> output - output tensor (distribution)
-*/
-void T2TModel::Make(XTensor &input, XTensor &output)
-{
-    XTensor encoding;
-    
-    if(isLM){
-        /* generate mask to see "previous" words only */
-        int len = input.GetDim(input.order - 2);
-        int * dims = new int[input.order + 1];
-        for(int i = 0; i < input.order; i++)
-            dims[i + 1] = input.GetDim(i);
-        dims[0] = nhead;
-        dims[input.order] = len;
-        XTensor mask(input.order + 1, dims, X_FLOAT, 1.0F, input.devID, input.mem);
-        
-        /* a upper triangular matrix where the cells of the upper triangular are set to -1e-9 */
-        _SetDataLowTri(&mask, 1e9F, -1);
-        _ScaleAndShiftMe(&mask, 1.0F, -1e9F);
-
-        encoding = MakeEncoding(input, mask, true);
-        outputLayer.Make(encoding, output);
-
-        delete[] dims;
-    }
-    else{
-        ShowNTErrors("TODO!");
-    }
-}
-
-}
--- a/source/sample/transformer/T2TTrainer.cpp
+++ b/source/sample/transformer/T2TTrainer.cpp
--- a/source/tensor/core/getandset/Select.cu
+++ b/source/tensor/core/getandset/Select.cu
--- a/source/tensor/core/math/Unary.h
+++ b/source/tensor/core/math/Unary.h
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
-*/
-
-#ifndef __UNARY_H__
-#define __UNARY_H__
-
-#include "../../XTensor.h"
-
-namespace nts{
-
-/* set every entry to its absolute value */
-void _Absolute(const XTensor * a, XTensor * b);
-/* 
-set every entry to its absolute value (do it on site)
-keep the result in the input tensor a and return nothing
-*/
-void _AbsoluteMe(XTensor * a);
-/* 
-set every entry to its absolute value (return a XTensor structure)
-make a new tensor to keep the result and return it
-*/
-XTensor Absolute(const XTensor & a);
-
-/* set every entry to its exponent value */
-void _Exp(const XTensor * a, XTensor * b);
-/* 
-set every entry to its exponent value (do it on site)
-keep the result in the input tensor a and return nothing
-*/
-void _ExpMe(XTensor * a);
-/* 
-set every entry to its exponent value (return a XTensor structure)
-make a new tensor to keep the result and return it
-*/
-XTensor Exp(const XTensor & a);
-
-/* set every entry to its logarithm value */
-void _Log(const XTensor * a, XTensor * b);
-/* 
-set every entry to its logarithm value (do it on site)
-keep the result in the input tensor a and return nothing
-*/
-void _LogMe(XTensor * a);
-/* 
-set every entry to its logarithm value (return a XTensor structure)
-make a new tensor to keep the result and return it
-*/
-XTensor Log(const XTensor & a);
-
-/* set every entry to its sine value */
-void _Sin(const XTensor * a, XTensor * b);
-/* 
-set every entry to its sine value (do it on site)
-keep the result in the input tensor a and return nothing
-*/
-void _SinMe(XTensor * a);
-/* 
-set every entry to its sine value (return a XTensor structure)
-make a new tensor to keep the result and return it
-*/
-XTensor Sin(const XTensor & a);
-
-/* set every entry to its cosine value */
-void _Cos(const XTensor * a, XTensor * b);
-/* 
-set every entry to its cosine value (do it on site)
-keep the result in the input tensor a and return nothing
-*/
-void _CosMe(XTensor * a);
-/* 
-set every entry to its cosine value (return a XTensor structure)
-make a new tensor to keep the result and return it
-*/
-XTensor Cos(const XTensor & a);
-
-/* set every entry to its tangent value */
-void _Tan(const XTensor * a, XTensor * b);
-/* 
-set every entry to its tangent value (do it on site)
-keep the result in the input tensor a and return nothing
-*/
-void _TanMe(XTensor * a);
-/* 
-set every entry to its tangent value (return a XTensor structure)
-make a new tensor to keep the result and return it
-*/
-XTensor Tan(const XTensor & a);
-
-
-/* set every entry to its round value */
-//void _Round(const XTensor * a, XTensor * b);
-/* 
-set every entry to its round value (do it on site)
-keep the result in the input tensor a and return nothing
-*/
-//void _RoundMe(XTensor * a);
-/* 
-set every entry to its round value (return a XTensor structure)
-make a new tensor to keep the result and return it
-*/
-//XTensor Round(const XTensor & a);
-
-}
-#endif //end __UNARY_H__
\ No newline at end of file
--- a/source/tensor/core/shape/Concatenate.cpp
+++ b/source/tensor/core/shape/Concatenate.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#include "../../XTensor.h"
-#include "../../XName.h"
-#include "Concatenate.h"
-#include "Merge.h"
-#include "ConcatenateSolely.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/*
-concatenate a list of tensors along a given dimension
-
-Note that this is actually a wrapper that selects "ConcatenateSolely"
-or "Merge" by means of the tensor shapes
-
->> smalls - a list of tensors for concatenation
->> big - the resulting tensor
->> dim - which dimension we perform the concatenation
-*/
-void _Concatenate(const XList * smalls, XTensor * big, int dim)
-{
-    bool uniform = true;
-    for (int i = 1; i < smalls->count; i++) {
-        XTensor * a = (XTensor*)smalls->GetItem(i - 1);
-        XTensor * b = (XTensor*)smalls->GetItem(i);
-        CheckNTErrors((a && b), "Empty input tensors!");
-        if (!XTensor::IsSameShaped(a, b))
-            uniform = false;
-    }
-
-    if (uniform)
-        _Merge(smalls, big, dim);
-    else
-        _ConcatenateSolely(smalls, big, dim);
-}
-
-/*
-concatenate a list of tensors along a given dimension (return a XTensor structure)
-make a new tensor to keep the result and return it
-
-Note that this is actually a wrapper that selects "ConcatenateSolely"
-or "Merge" by means of the tensor shapes
-
->> smalls - a list of tensors for concatenation
->> big - the resulting tensor
->> dim - which dimension we perform the concatenation
-<< return - the tensor of concatenating a list of tensors along a given dimension
-*/
-XTensor Concatenate(const XList &smalls, int dim)
-{
-    CheckNTErrors(smalls.count > 0, "Empty list!");
-    CheckNTErrors(dim >= 0, "Illegal dimension to concatenate!");
-
-    bool uniform = true;
-    for (int i = 1; i < smalls.count; i++) {
-        XTensor * a = (XTensor*)smalls.GetItem(i - 1);
-        XTensor * b = (XTensor*)smalls.GetItem(i);
-        CheckNTErrors((a && b), "Empty input tensors!");
-        if (!XTensor::IsSameShaped(a, b))
-            uniform = false;
-    }
-    XTensor * tensor = (XTensor*)smalls.GetItem(0);
-    int order = tensor->order;
-    int * dimSize = new int[order];
-
-    if (uniform) {
-        for (int i = 0; i < tensor->order; i++) {
-            if (i != dim)
-                dimSize[i] = tensor->dimSize[i];
-            else
-                dimSize[i] = tensor->dimSize[dim] * smalls.count;
-        }
-
-        float dr = (!tensor->isSparse) ? 1.0F : tensor->denseRatio;
-        XTensor big(order, dimSize, tensor->dataType, dr, tensor->devID, tensor->mem);
-        big.SetTMP();
-
-        /* call _Merge function */
-        _Merge(&smalls, &big, dim);
-                
-        /* tensor connection */
-        XLink::MakeLink(&smalls, &big, SHAPE_MERGE);
-        XLink::AddParamToHeadInt(&big, dim);
-        
-        /* destroy variables */
-        delete[] dimSize;
-
-        return big;
-    }
-    else {
-        for (int i = 0; i < tensor->order; i++)
-            if (i != dim)
-                dimSize[i] = tensor->dimSize[i];
-
-        int catDimSize = 0;
-        for (int i = 0; i < smalls.count; i++) {
-            XTensor * tensor = (XTensor*)smalls.GetItem(i);
-            catDimSize += tensor->dimSize[dim];
-        }
-        dimSize[dim] = catDimSize;
-
-        float dr = (!tensor->isSparse) ? 1.0F : tensor->denseRatio;
-        XTensor big(order, dimSize, tensor->dataType, dr, tensor->devID, tensor->mem);
-        big.SetTMP();
-
-        /* call _ConcatenateSolely function */
-        _ConcatenateSolely(&smalls, &big, dim);
-
-        /* tensor connection */
-        XLink::MakeLink(&smalls, &big, SHAPE_CONCATENATE);
-        XLink::AddParamToHeadInt(&big, dim);
-
-        /* destroy variables */
-        delete[] dimSize;
-
-        return big;
-    }
-}
-
-/*
-concatenate two tensors along a given dimension
-
->> smallA - one tensor for concatenation
->> smallB - the other tensor for concatenation
->> big - the resulting tensor
->> dim - which dimension we perform the concatenation
-*/
-void _Concatenate(const XTensor * smallA, const XTensor * smallB, XTensor * big, int dim)
-{
-    XList smalls(2);
-    smalls.Add(smallA);
-    smalls.Add(smallB);
-
-    _Concatenate(&smalls, big, dim);
-}
-
-/*
-concatenate two tensors along a given dimension (return a XTensor structure).
-make a new tensor to keep the result and return it.
-
->> smallA - one tensor for concatenation
->> smallB - the other tensor for concatenation
->> big - the resulting tensor
->> dim - which dimension we perform the concatenation
-<< return - the tensor of concatenating two tensor along a given dimension
-*/
-XTensor Concatenate(const XTensor &smallA, const XTensor &smallB, int dim)
-{
-    CheckNTErrors(dim >= 0, "Illegal dimension to concatenate!");
-
-    XList smalls(2);
-    smalls.Add(&smallA);
-    smalls.Add(&smallB);
-
-    bool uniform = true;
-    for (int i = 1; i < smalls.count; i++) {
-        XTensor * a = (XTensor*)smalls.Get(i - 1);
-        XTensor * b = (XTensor*)smalls.Get(i);
-        CheckNTErrors((a && b), "Empty input tensors!");
-        if (!XTensor::IsSameShaped(a, b))
-            uniform = false;
-    }
-    XTensor * tensor = (XTensor*)smalls.Get(0);
-    int order = tensor->order;
-    int * dimSize = new int[order];
-
-    if (uniform) {
-        for (int i = 0; i < tensor->order; i++) {
-            if (i != dim)
-                dimSize[i] = tensor->dimSize[i];
-            else
-                dimSize[i] = tensor->dimSize[dim] * smalls.count;
-        }
-
-        float dr = (!tensor->isSparse) ? 1.0F : tensor->denseRatio;
-        XTensor big(order, dimSize, tensor->dataType, dr, tensor->devID, tensor->mem);
-        big.SetTMP();
-
-        /* call _Merge function */
-        _Merge(&smalls, &big, dim);
-                
-        /* tensor connection */
-        XLink::MakeLink(&smalls, &big, SHAPE_MERGE);
-        XLink::AddParamToHeadInt(&big, dim);
-        
-        /* destroy variables */
-        delete[] dimSize;
-
-        return big;
-    }
-    else {
-        for (int i = 0; i < tensor->order; i++)
-            if (i != dim)
-                dimSize[i] = tensor->dimSize[i];
-
-        int catDimSize = 0;
-        for (int i = 0; i < smalls.count; i++) {
-            XTensor * tensor = (XTensor*)smalls.Get(i);
-            catDimSize += tensor->dimSize[dim];
-        }
-        dimSize[dim] = catDimSize;
-
-        float dr = (!tensor->isSparse) ? 1.0F : tensor->denseRatio;
-        XTensor big(order, dimSize, tensor->dataType, dr, tensor->devID, tensor->mem);
-        big.SetTMP();
-
-        /* call _ConcatenateSolely function */
-        _ConcatenateSolely(&smalls, &big, dim);
-
-        /* tensor connection */
-        XLink::MakeLink(&smalls, &big, SHAPE_CONCATENATE);
-        XLink::AddParamToHeadInt(&big, dim);
-
-        /* destroy variables */
-        delete[] dimSize;
-
-        return big;
-    }
-}
-
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/shape/ConcatenateSolely.cpp
+++ b/source/tensor/core/shape/ConcatenateSolely.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#include "../../XTensor.h"
-#include "../../XUtility.h"
-#include "../../XName.h"
-#include "ConcatenateSolely.h"
-#include "MergeBlockLists.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/*
-concatenate a list of tensors along a given dimension
-
->> smalls - a list of tensors for concatenation
->> big - the resulting tensor
->> dim - which dimension we perform the concatenation
-*/
-void _ConcatenateSolely(const XList * smalls, XTensor * big, int dim)
-{
-    CheckNTErrors(big->order > dim && dim >= 0, "Illegal dimension to concatenate!");
-
-    int catDimSize = 0;
-    int dimRDI = big->order - dim - 1;
-
-    for (int i = 0; i < smalls->count; i++) {
-        XTensor * tensor = (XTensor*)smalls->GetItem(i);
-        CheckNTErrors((big->order == tensor->order), "Unmatched tensor orders!");
-        for (int j = 0; j < big->order; j++) {
-            if (j != dimRDI) {
-                CheckNTErrors((big->dimSizeRDI[j] == tensor->dimSizeRDI[j]), "Unmatched tensor sizes!");
-            }
-            else {
-                catDimSize += tensor->dimSizeRDI[j];
-            }
-        }
-    }
-
-    CheckNTErrors((catDimSize == big->dimSizeRDI[dimRDI]), "Unmatched tensor sizes!");
-
-    int stride = 1;
-    for (int i = 0; i < dimRDI; i++)
-        stride *= big->dimSizeRDI[i];
-
-    int blockNum = 1;
-    for (int i = dimRDI + 1; i < big->order; i++)
-        blockNum *= big->dimSizeRDI[i];
-
-    int offset = 0;
-
-    /* 
-    two strategies are used - we can either resort to memcpy2d for the case of
-    concatenation of a few items, or use MergeBlockLists to merge a large number
-    of data blocks 
-    */
-    if (smalls->count <= MIN_TENSOR_CAT_NUM) {
-        for (int i = 0; i < smalls->count; i++) {
-            XTensor * tensor = (XTensor*)smalls->GetItem(i);
-            int sPitch = stride * tensor->dimSizeRDI[dimRDI] * tensor->unitSize;
-            int tPitch = stride * big->dimSizeRDI[dimRDI] * big->unitSize;
-            int mSize = sPitch;
-            int n = blockNum;
-            XMemCopy2D((char*)big->data + offset, tPitch, big->devID,
-                (char*)tensor->data, sPitch, tensor->devID,
-                mSize, n);
-            offset += sPitch;
-        }
-    }
-    else {
-        XList * sourceArrays = new XList(smalls->count);
-        int * blockSizes = new int[smalls->count];
-        for (int i = 0; i < smalls->count; i++) {
-            XTensor * tensor = (XTensor*)smalls->GetItem(i);
-            blockSizes[i] = stride * tensor->dimSizeRDI[dimRDI] * tensor->unitSize;
-            sourceArrays->Add(tensor->data);
-        }
-
-        _MergeBlockLists(sourceArrays, blockSizes, blockNum, big->data, big->mem);
-
-        delete[] blockSizes;
-        delete sourceArrays;
-    }
-}
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/shape/MakeMergeBlockIndex.cu
+++ b/source/tensor/core/shape/MakeMergeBlockIndex.cu
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#include "../../XDevice.h"
-#include "../../XTensor.h"
-#include "MakeMergeBlockIndex.h"
-#include "MakeMergeBlockIndex.cuh"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/*
-set target data block index for the data movement in split (device code)
->> blockIndex - index of the blocks
->> blockNum - number of the blocks
->> blockNumInMerge - size of the dimension along which we perform the merging operation
->> splitSizeInGrid - size of each data array to merge
->> gridSize - number of blocks in a grid (here grid is a higher level orgnization upon blocks)
->> gridNum - number of grids
->> mem - the memory pool
-*/
-__global__
-void KernelMakeMergeBlockIndex(int * blockIndex, int blockNum, int blockNumInMerge,
-                               int splitSizeInGrid, int gridSize, int gridNum)
-{
-    /* block index */
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-
-    /* grid index */
-    int k = blockDim.y * blockIdx.y + threadIdx.y;
-
-    if (i >= blockNum || k >= gridNum)
-        return;
-
-    int j = blockNumInMerge * (i % splitSizeInGrid) + int(i / splitSizeInGrid);
-
-    /* i = source block index, j = target block index and k = (target) grid index */
-    blockIndex[i + gridSize * k] = j + gridSize * k;
-}
-
-
-/*
-set target data block index for the data movement in split
->> devID - id of the GPU device
->> blockIndex - index of the blocks
->> blockNum - number of the blocks
->> blockNumInMerge - size of the dimension along which we perform the merging operation
->> splitSizeInGrid - size of each data array to merge
->> gridSize - number of blocks in a grid (here grid is a higher level orgnization upon blocks)
->> gridNum - number of grids
->> mem - the memory pool
-*/
-void _CudaMakeMergeBlockIndex(int devID,
-                              int * blockIndex, int blockNum, int blockNumInMerge,
-                              int splitSizeInGrid, int gridSize, int gridNum)
-{
-    int cudaGrids[3];
-    int cudaBlocks[3];
-
-    GDevs.GetCudaThread2D(devID, blockNum, gridNum, MAX_INT, cudaGrids, cudaBlocks);
-
-    int devIDBackup;
-    ProtectCudaDev(devID, devIDBackup);
-
-    KernelMakeMergeBlockIndex << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-                                 (blockIndex, blockNum, blockNumInMerge, splitSizeInGrid, gridSize, gridNum);
-
-    BacktoCudaDev(devID, devIDBackup);
-}
-#endif // USE_CUDA
-
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/shape/MakeSplitBlockIndex.cu
+++ b/source/tensor/core/shape/MakeSplitBlockIndex.cu
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#include "../../XDevice.h"
-#include "../../XTensor.h"
-#include "MakeSplitBlockIndex.h"
-#include "MakeSplitBlockIndex.cuh"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/*
-set target data block index for the data movement in split (device code)
->> blockIndex - block index
->> splitNum - number of splits
->> blockSplitSize - size of the splitted block
->> blockNum - number of data blocks
-*/
-__global__
-void KernelMakeSplitBlockIndex(int * blockIndex, int splitNum, int blockSplitSize, int blockNum)
-{
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-
-    if (i >= blockNum)
-        return;
-
-    int j = (i % splitNum) * blockSplitSize + i / splitNum;
-
-    /* i = source block index, j = target block index */
-    blockIndex[i] = j;
-}
-
-/*
-set target data block index for the data movement in split
->> devID - device id
->> blockIndex - block index
->> splitNum - number of splits
->> blockSplitSize - size of the splitted block
->> blockNum - number of data blocks
-*/
-void _CudaMakeSplitBlockIndex(int devID, int * blockIndex, int splitNum, int blockSplitSize, int blockNum)
-{
-    int cudaGrids[3];
-    int cudaBlocks[3];
-
-    GDevs.GetCudaThread(devID, blockNum, cudaGrids, cudaBlocks);
-
-    int devIDBackup;
-    ProtectCudaDev(devID, devIDBackup);
-
-    KernelMakeSplitBlockIndex << <dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >> >
-                                 (blockIndex, splitNum, blockSplitSize, blockNum);
-
-    BacktoCudaDev(devID, devIDBackup);
-}
-
-#endif // USE_CUDA
-
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/shape/MergeBlockLists.cu
+++ b/source/tensor/core/shape/MergeBlockLists.cu
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#include "../../XDevice.h"
-#include "../../XUtility.h"
-#include "../../XTensor.h"
-#include "MergeBlockLists.h"
-#include "MergeBlockLists.cuh"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-/*
-copy a number of blocks (of different sizes) to target positions
->> sourceList - list of data arrays to copy from
->> sourceBlockSizes - the size of the block_i
->> sourceBlockNum - number of blocks to merge
->> targetList - list of data arrays to copy to
-*/
-__global__
-void KernelCopyBlockLists(DTYPE * sourceList[], int * sourceBlockSizes, int sourceBlockNum, DTYPE * targetList[])
-{
-    __shared__ int iBlockSizes[MAX_CUDA_THREAD_NUM_PER_BLOCK];
-    __shared__ DTYPE * iSourceList[MAX_CUDA_THREAD_NUM_PER_BLOCK];
-    __shared__ DTYPE * iTargetList[MAX_CUDA_THREAD_NUM_PER_BLOCK];
-
-    /* entry index in the block */
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-
-    /* block index */
-    int j = blockDim.y * blockIdx.y + threadIdx.y;
-
-    if (j >= sourceBlockNum)
-        return;
-
-    if (threadIdx.x == 0) {
-        iBlockSizes[threadIdx.y] = sourceBlockSizes[j];
-        iSourceList[threadIdx.y] = sourceList[j];
-        iTargetList[threadIdx.y] = targetList[j];
-    }
-
-    __syncthreads();
-
-    if (i < iBlockSizes[threadIdx.y])
-        iTargetList[threadIdx.y][i] = iSourceList[threadIdx.y][i];
-}
-
-/*
-merge data by blocks (cuda version)
->> sourceList - list of data arrays (heads of the blocks) to copy from
->> blockSizes - size of the blocks
->> blockNum - number of blocks
->> target - target data array
->> myMem - the memory pool
-*/
-void _CudaMergeBlockLists(const XList * sourceList, int * blockSizes, int blockNum, void * target, XMem * myMem)
-{
-    CheckNTErrors((myMem != NULL), "No memory pool!");
-    CheckNTErrors((myMem->devID >= 0), "Wrong device to run!");
-
-    int newBlockListSize = sourceList->count * blockNum;
-
-    int minBlockSize = MAX_INT;
-    int maxBlockSize = -MAX_INT;
-    int realMaxBlockSize = 1;
-    DTYPE ** sourceArrays = new DTYPE*[newBlockListSize];
-    DTYPE ** targetArrays = new DTYPE*[newBlockListSize];
-    int * sizes = new int[newBlockListSize];
-    int * offsets = new int[sourceList->count];
-    memset(offsets, 0, sizeof(int) * sourceList->count);
-
-    int totalOffset = 0;
-    for (int k = 0; k < blockNum; k++) {
-        for (int i = 0; i < sourceList->count; i++) {
-            CheckNTErrors((blockSizes[i] % sizeof(DTYPE) == 0), "Unsupported block size!");
-            int j = k * sourceList->count + i;
-            sizes[j] = blockSizes[i] / sizeof(DTYPE);
-            sourceArrays[j] = (DTYPE*)sourceList->GetItem(i) + offsets[i];
-            targetArrays[j] = (DTYPE*)target + totalOffset;
-            offsets[i] += sizes[i];
-            totalOffset += sizes[i];
-
-            if (minBlockSize > blockSizes[i])
-                minBlockSize = blockSizes[i];
-            if (maxBlockSize < blockSizes[i])
-                maxBlockSize = blockSizes[i];
-        }
-    }
-
-    CheckNTErrors((minBlockSize % sizeof(DTYPE) == 0), "Unsupported block size!");
-    CheckNTErrors((maxBlockSize % sizeof(DTYPE) == 0), "Unsupported block size!");
-    realMaxBlockSize = maxBlockSize / sizeof(DTYPE);
-
-    int devIDBackup;
-    ProtectCudaDev(myMem->devID, devIDBackup);
-
-    int cudaGridSizes[3];
-    int cudaBlockSizes[3];
-
-    GDevs.GetCudaThread2D(myMem->devID, realMaxBlockSize, newBlockListSize, MAX_INT,
-                          cudaGridSizes, cudaBlockSizes);
-
-    myMem->SetPinBuf();
-    int * sizesGPU = (int*)myMem->AllocBuf(myMem->devID, sizeof(int) * newBlockListSize, 256);
-
-    DTYPE ** sourceArraysGPU = (DTYPE**)myMem->AllocBuf(myMem->devID, sizeof(DTYPE*) * newBlockListSize, 256);
-
-    DTYPE ** targetArraysGPU = (DTYPE**)myMem->AllocBuf(myMem->devID, sizeof(DTYPE*) * newBlockListSize, 256);
-
-    XMemCopy(sizesGPU, myMem->devID, sizes, -1, sizeof(int) * newBlockListSize);
-    XMemCopy(sourceArraysGPU, myMem->devID, sourceArrays, -1, sizeof(DTYPE*) * newBlockListSize);
-    XMemCopy(targetArraysGPU, myMem->devID, targetArrays, -1, sizeof(DTYPE*) * newBlockListSize);
-
-    KernelCopyBlockLists << <dim3(cudaGridSizes[0], cudaGridSizes[1]), dim3(cudaBlockSizes[0], cudaBlockSizes[1]) >> >
-                            (sourceArraysGPU, sizesGPU, newBlockListSize, targetArraysGPU);
-
-    myMem->BackToPinBuf();
-
-    delete[] sourceArrays;
-    delete[] targetArrays;
-    delete[] sizes;
-    delete[] offsets;
-
-    BacktoCudaDev(myMem->devID, devIDBackup);
-}
-#endif // USE_CUDA
-
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/shape/Permute.cpp
+++ b/source/tensor/core/shape/Permute.cpp
-
--- a/source/tensor/core/shape/Transpose.cpp
+++ b/source/tensor/core/shape/Transpose.cpp
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
- * $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-28
- * It is extreamly hot these days and i cannot sleep well. Fortunately we had 
- * good lunch of Steamed Cold Noodles. This made me feel much better!
- */
-
-#include "Transpose.h"
-#include "Merge.h"
-#include "../../XUtility.h"
-#include "../../XName.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/*
-tensor transposition of dimensions i and j
-b = transposed(a) 
-
-For a input tensor a, we tranpose the dimensions i and j of it.
-E.g., let a be a tensor of size x * y * z, i = 0, j = 2, 
-then the output will be a tensor of size z * y * x.
-
->> a - the input tensor
->> b - the output tensor by transpose tensor a with specified dimensions i and j
->> i - the transposed dimension
->> j - the transposed dimension
-*/
-void _Transpose(const XTensor * a, XTensor * b, const int i, const int j)
-{
-    CheckNTErrors(a && b, "Empty tensors");
-    CheckNTErrors(a->order == b->order, "Wrong tensor orders");
-    CheckNTErrors(a->unitNum == b->unitNum && a->unitSize == b->unitSize, "Wrong tensor sizes");
-    CheckNTErrors(a->order > i && i >= 0, "index of dimension is out of scope!");
-    CheckNTErrors(a->order > j && j >= 0, "index of dimension is out of scope!");
-
-    for(int k = 0; k < a->order; k++){
-        if(k == i){
-            CheckNTErrors(a->dimSize[k] == b->dimSize[j], "Wrong dimension size in transposition");
-        }
-        else if(k == j){
-            CheckNTErrors(a->dimSize[k] == b->dimSize[i], "Wrong dimension size in transposition");
-        }
-        else{
-            CheckNTErrors(a->dimSize[k] == b->dimSize[k], "Wrong dimension size in transposition");
-        }
-    }
-
-    if(i == j){
-        XMemCopy(b->data, b->devID, a->data, a->devID, b->unitNum * b->unitSize);
-    }
-    else{
-        int I = MIN(i, j);
-        int J = MAX(i, j);
-        int * dims = new int[a->order + 1];
-
-        for(int k = 0; k <= J; k++)
-            dims[k] = a->dimSize[k];
-        dims[J + 1] = -1;
-        for(int k = J + 1; k < a->order; k++)
-            dims[k + 1] = a->dimSize[k];
-
-        /* reshape tensor a form (..., n_I, ..., n_J, ...) => (..., n_I, ..., n_J, 1, ...)*/
-        XTensor * aTMP =  new XTensor(a->order + 1, dims, a->dataType, a->denseRatio, a->devID, a->mem);
-        aTMP->data = a->data;
-
-        for(int k = 0; k < I; k++)
-            dims[k] = a->dimSize[k];
-        for(int k = I + 1; k <= J; k++)
-            dims[k - 1] = a->dimSize[k];
-        dims[J] = a->dimSize[I];
-        for(int k = J + 1; k < a->order; k++)
-            dims[k] = a->dimSize[k];
-
-        /* reshape tensor b form (..., m_I, ..., m_J, ...) => (..., m_J, m_I, ...) */
-        b->Reshape(b->order, dims);
-
-        /* tensor (..., n_I, ..., n_J, 1, ...) => tensor (..., m_J, m_I, ...) */
-        _Merge(aTMP, b, J + 1, I);
-
-        memcpy(dims, a->dimSize, sizeof(int) * a->order);
-        dims[I] = a->dimSize[J];
-        dims[J] = a->dimSize[I];
-
-        /* reshape tensor b form (..., m_J, m_I, ...) => (..., m_J, ..., m_I, ...) =>  */
-        b->Reshape(b->order, dims);
-
-        aTMP->data = NULL;
-        delete[] dims;
-        delete aTMP;
-    }
-}
-
-/*
-tensor transposition of dimensions i and j (return a XTensor structure).
-make a new tensor to keep the result and return it.
-b = transposed(a)
-
-For a input tensor a, we tranpose the dimensions i and j of it.
-E.g., let a be a tensor of size x * y * z, i = 0, j = 2, 
-then the output will be a tensor of size z * y * x.
-
->> a - the input tensor
->> i - the transposed dimension
->> j - the transposed dimension
-<< return - the output tensor by transpose tensor a with specified dimensions i and j
-*/
-XTensor Transpose(const XTensor &a, const int i, const int j)
-{
-    CheckNTErrors(a.order > i && i >= 0, "index of dimension is out of scope!");
-    CheckNTErrors(a.order > j && j >= 0, "index of dimension is out of scope!");
-
-    int order = a.order;
-    int * dimSize = new int[order];
-    for(int k = 0; k < order; k++){
-        if(k == i)
-            dimSize[k] = a.dimSize[j];
-        else if(k == j)
-            dimSize[k] = a.dimSize[i];
-        else
-            dimSize[k] = a.dimSize[k];
-    }
-
-    float dr = (!a.isSparse) ? 1.0F : a.denseRatio;
-    XTensor b(order, dimSize, a.dataType, dr, a.devID, a.mem);
-    b.SetTMP();
-
-    /* call _Transpose function */
-    _Transpose(&a, &b, i, j);
-    
-    /* tensor connection */
-    XLink::MakeLink(&a, NULL, &b, SHAPE_TRANSPOSE);
-    XLink::AddParamToHeadInt(&b, i);
-    XLink::AddParamToHeadInt(&b, j);
-
-    /* destroy variables */
-    delete[] dimSize;
-
-    return b;
-}
-
-}
--- a/source/tensor/core/shape/Transpose.h
+++ b/source/tensor/core/shape/Transpose.h
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
- * $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-05
- * It will rain tomorrow - end of the hot days :)
- */
-
-#ifndef __TRANSPOSE_H__
-#define __TRANSPOSE_H__
-
-#include "../../XTensor.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/*
-tensor transposition of dimensions i and j
-b = transposed(a) 
-*/
-void _Transpose(const XTensor * a, XTensor * b, const int i, const int j);
-
-/* 
-tensor transposition of dimensions i and j (return a XTensor structure).
-make a new tensor to keep the result and return it.
-b = transposed(a)
-*/
-XTensor Transpose(const XTensor &a, const int i, const int j);
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __TRANSPOSE_H__
--- a/source/tensor/core/shape/Unsqueeze.cpp
+++ b/source/tensor/core/shape/Unsqueeze.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#include "../../XTensor.h"
-#include "../../XName.h"
-#include "Unsqueeze.h"
-#include "MergeBlockLists.h"
-#include "Unsqueeze.cuh"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/*
-insert a dimension by copying the blocks for x times
-(where x is the size of the inerted dimension)
-
->> a - input tensor
->> b - output tensor
->> dim - where to insert the dimension
->> dSize - size of the newly-inserted dimension
-*/
-void _Unsqueeze(const XTensor * a, XTensor * b, int dim, int dSize)
-{
-    CheckNTErrors((a && b), "Empty input tensors!");
-    CheckNTErrors((a->order == b->order - 1), "Unmatched tensors!");
-    CheckNTErrors((a->unitSize == b->unitSize), "Unmatched tensors!");
-
-    int dimRDI = b->order - dim - 1;
-    for (int i = 0; i < b->order; i++) {
-        if (i < dimRDI) {
-            CheckNTErrors((a->dimSizeRDI[i] == b->dimSizeRDI[i]), "Unmatched tensors!");
-        }
-        else if (i > dimRDI) {
-            CheckNTErrors((a->dimSizeRDI[i - 1] == b->dimSizeRDI[i]), "Unmatched tensors!");
-        }
-        else {
-            CheckNTErrors((dSize == b->dimSizeRDI[i]), "Unmatched tensors!");
-        }
-    }
-
-    int blockSize = 1;
-    int realBlockSize = 1;
-
-    int blockNumA = 1;
-    int blockNumB = 1;
-    for (int i = 0; i < dimRDI; i++)
-        blockSize *= a->dimSizeRDI[i];
-
-    realBlockSize = blockSize * a->unitSize;
-
-    blockNumA = a->unitNum / blockSize;
-    blockNumB = b->unitNum / blockSize;
-
-    CheckNTErrors((blockNumA * dSize == blockNumB), "Unmatched tensors!");
-
-    if (a->devID >= 0 || b->devID >= 0) {
-#ifdef USE_CUDA
-        _CudaUnsqueeze(a, b, dim, dSize);
-#else
-        ShowNTErrors("Please specify USE_CUDA and recompile the code!");
-#endif
-    }
-    else {
-        XList * sourceArrays = new XList(blockNumB);
-        int * blockSizes = new int[blockNumB];
-
-        for (int i = 0; i < blockNumA; i++) {
-            char * ap = (char*)a->data + i * realBlockSize;
-            for (int j = 0; j < dSize; j++) {
-                sourceArrays->Add(ap);
-                blockSizes[i * dSize + j] = realBlockSize;
-            }
-        }
-
-        _MergeBlockLists(sourceArrays, blockSizes, 1, b->data, b->mem);
-
-        delete sourceArrays;
-        delete[] blockSizes;
-    }
-}
-
-/*
-insert a dimension by copying the blocks for x times
-(where x is the size of the inerted dimension) (returna a XTensor structure)
-make a new tensor to keep the result and return it
-
->> a - input tensor
->> dim - where to insert the dimension
->> dSize - size of the newly-inserted dimension
-<< return - a tensor by inserting a dimension by copying the blocks for x times
-*/
-XTensor Unsqueeze(const XTensor &a, int dim, int dSize)
-{
-    int order = a.order + 1;
-    int * dimSize = new int[order];
-
-    for (int i = 0; i < order; i++) {
-        if (i < dim)
-            dimSize[i] = a.dimSize[i];
-        else if (i == dim)
-            dimSize[i] = dSize;
-        else
-            dimSize[i] = a.dimSize[i - 1];
-    }
-
-    float dr = (!a.isSparse) ? 1.0F : a.denseRatio;
-    XTensor b(order, dimSize, a.dataType, dr, a.devID, a.mem);
-    b.SetTMP();
-
-    /* call _Unsqueeze function */
-    _Unsqueeze(&a, &b, dim, dSize);
-
-    /* tensor connections */
-    XLink::MakeLink(&a, NULL, &b, SHAPE_UNSQUEEZE);
-    XLink::AddParamToHeadInt(&b, dim);
-    XLink::AddParamToHeadInt(&b, dSize);
-
-    /* destroy variables */
-    delete[] dimSize;
-
-    return b;
-}
-
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/shape/Unsqueeze.cuh
+++ b/source/tensor/core/shape/Unsqueeze.cuh
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#ifndef __UNSQUEEZE_CUH__
-#define __UNSQUEEZE_CUH__
-
-#include "Unsqueeze.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/* duplicate the data along a given dimension */
-void _CudaUnsqueeze(const XTensor * a, XTensor * b, int dim, int dSize);
-
-#endif // USE_CUDA
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __UNSQUEEZE_CUH__
\ No newline at end of file
--- a/source/tensor/core/shape/Unsqueeze.h
+++ b/source/tensor/core/shape/Unsqueeze.h
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#ifndef __UNSQUEEZE_H__
-#define __UNSQUEEZE_H__
-
-#include "../../XTensor.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/* insert a dimension by copying the blocks for x times (where x is the size of the inerted dimension) */
-void _Unsqueeze(const XTensor * a, XTensor * b, int dim, int dSize);
-
-/*
-insert a dimension by copying the blocks for x times 
-(where x is the size of the inerted dimension) (return a XTensor structure)
-make a new tensor to keep the result and return it
-*/
-XTensor Unsqueeze(const XTensor &a, int dim, int dSize);
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __UNSQUEEZE_H__
--- a/source/tensor/core/sort/Sort.cpp
+++ b/source/tensor/core/sort/Sort.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#include "../../XTensor.h"
-#include "../movement/CopyValues.h"
-#include "../../XUtility.h"
-#include "../../XName.h"
-#include "Sort.h"
-#include "Sort.cuh"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/*
-sort the tensor along a given dimension
->> a - input tensor
->> b - output tensor
->> index - index of the items in the resulting tensor
->> dim - the dimension along which the sorting is performed
-*/
-void _Sort(const XTensor * a, XTensor * b, XTensor * index, int dim)
-{
-    CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
-    CheckNTErrors((dim >= 0 && dim < a->order), "Incorrect dimension specified!");
-    CheckNTErrors((a->order == index->order), "Unmatched input tensors!");
-    CheckNTErrors((index->dataType == X_INT), "Wrong data type!");
-
-	int dimRDI = a->order - dim - 1;
-    /* make the index tensor */
-    index->SetAscendingOrder(dim);
-
-    if (a->devID >= 0) {
-#ifdef USE_CUDA
-        _CudaSortBig(a, b, index, index, dim);
-#else
-        ShowNTErrors("Plesae specify USE_CUDA and recompile the code!");
-#endif
-    }
-    else {
-        int stride = 1;
-        int strideNum = a->dimSizeRDI[dimRDI];
-        for (int i = 0; i < dimRDI; i++)
-            stride *= a->dimSizeRDI[i];
-
-        int blockNum = 1;
-        for (int i = dimRDI + 1; i < a->order; i++)
-            blockNum *= a->dimSizeRDI[i];
-        int blockSize = stride * strideNum;
-
-		_CopyValues(a, b);
-        for (int k = 0; k < blockNum; k++) {
-        for (int i = 0; i < stride; i++) {
-                void * dataB = (char*)b->data + (k * blockSize + i) * b->unitSize;
-                void * indexData = (char*)index->data + (k * blockSize + i) * sizeof(int);
-
-                /* we sort the data array along "dim" */
-                if (a->dataType == X_FLOAT)
-                    XQSort(dataB, indexData, strideNum, a->unitSize, stride, CompXFloat);
-                else {
-                    ShowNTErrors("TODO!");
-                }
-            }
-        }
-    }
-}
-
-/*
-sort the tensor along a given dimension (do it on site)
-keep the result in the input tensor a and return nothing
-
->> a - input tensor
->> index - index of the items in the resulting tensor
->> dim - the dimension along which the sorting is performed
-*/
-void _SortMe(XTensor * a, XTensor * index, int dim)
-{
-    _Sort(a, a, index, dim);
-}
-
-/*
-sort the tensor along a given dimension (return a XTensor structure)
-make a new tensor to keep the result and return it
-
->> a - input tensor
->> b - output tensor
->> index - index of the items in the resulting tensor
->> dim - the dimension along which the sorting is performed
-*/
-void Sort(XTensor & a, XTensor & b, XTensor & index, int dim)
-{
-    /* call _Negate function */
-    _Sort(&a, &b, &index, dim);
-    
-    /* tensor connections */
-    XList list(2);
-    list.Add(&b);
-    list.Add(&index);
-     XLink::MakeLink(&a, &list, SORT_SORT);
-     XLink::AddParamToHeadInt(&b, dim);
-     XLink::AddParamToHeadInt(&index, dim);
-}
-
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/sort/Sort.cu
+++ b/source/tensor/core/sort/Sort.cu
--- a/source/tensor/core/sort/Sort.cuh
+++ b/source/tensor/core/sort/Sort.cuh
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#ifndef __SORT_CUH__
-#define __SORT_CUH__
-
-#include "Sort.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/* sort the tensor along a given dimension */
-void _CudaSortBig(const XTensor * a, XTensor * b, XTensor * indexA, XTensor * indexB, int dim, int k = -1);
-
-#endif // USE_CUDA
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __SORT_CUH__
-
--- a/source/tensor/core/sort/Sort.h
+++ b/source/tensor/core/sort/Sort.h
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#ifndef __SORT_H__
-#define __SORT_H__
-
-#include "../../XTensor.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/* sort the data along a given dimension */
-void _Sort(const XTensor * a, XTensor * b, XTensor * index, int dim);
-
-/* 
-sort the data along a given dimension (do it on site)
-keep the result in the input tensor a and return nothing
-*/
-void _SortMe(XTensor * a, XTensor * index, int dim);
-
-/* 
-sort the data along a given dimension (return a XTensor structure)
-make a new tensor to keep the result and return it
-*/
-void Sort(XTensor & a, XTensor & b, XTensor & index, int dim);
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __SORT_H__
\ No newline at end of file
--- a/source/tensor/core/sort/TopK.cpp
+++ b/source/tensor/core/sort/TopK.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#include "../../XTensor.h"
-#include "../../XName.h"
-#include "TopK.h"
-#include "TopK.cuh"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/*
-get the top-k items along a given dimension
->> a - input tensor
->> b - output tensor (top-k result)
->> index - index of the top-k items
->> dim - the dimension along which the sorting is performed
->> k - how many items returned after sorting
-*/
-void _TopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k)
-{
-    CheckNTErrors((a->unitSize == b->unitSize), "Unmatched input tensors!");
-    CheckNTErrors((a->order == b->order), "Unmatched input tensors!");
-    CheckNTErrors((index == NULL || a->order == index->order), "Unmatched input tensors!");
-    CheckNTErrors((index->dataType == X_INT), "Wrong data type!");
-
-    int dimRDI = a->order - dim - 1;
-    for (int i = 0; i < a->order; i++) {
-        if (i == dimRDI) {
-            CheckNTErrors((b->dimSizeRDI[i] == k), "A too large K");
-            CheckNTErrors((index == NULL || index->dimSizeRDI[i] == k), "Wrong size!");
-        }
-        else {
-            CheckNTErrors((b->dimSizeRDI[i] == a->dimSizeRDI[i]), "Wrong size!");
-            CheckNTErrors((index == NULL || index->dimSizeRDI[i] == a->dimSizeRDI[i]), "Wrong size!");
-        }
-    }
-
-    if (a->devID >= 0 || b->devID >= 0) {
-#ifdef USE_CUDA
-        _CudaTopK(a, b, index, dim, k);
-#else
-        ShowNTErrors("Plesae specify USE_CUDA and recompile the code!");
-#endif
-    }
-    else {
-        CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
-
-        int stride = 1;
-        int strideNumA = a->dimSizeRDI[dimRDI];
-        int strideNumB = b->dimSizeRDI[dimRDI];
-        for (int i = 0; i < dimRDI; i++)
-            stride *= a->dimSizeRDI[i];
-
-        int blockNum = 1;
-        for (int i = dimRDI + 1; i < a->order; i++)
-            blockNum *= a->dimSizeRDI[i];
-        int blockSizeA = stride * strideNumA;
-        int blockSizeB = stride * strideNumB;
-
-        XHeap<MIN_HEAP, DTYPE> heap(k);
-
-        for (int h = 0; h < blockNum; h++) {
-            for (int i = 0; i < stride; i++) {
-                DTYPE * dataA = (DTYPE*)a->data + (h * blockSizeA + i);
-                DTYPE * dataB = (DTYPE*)b->data + (h * blockSizeB + i);
-                int * indexData = (int*)index->data + (h * blockSizeB + i);
-
-                /* initialize the heep */
-                heap.Clear(DTYPE_MIN);
-
-                for (int j = 0; j < blockSizeA; j += stride) {
-                    if (heap.count < heap.size) {
-                        heap.Push(HeapNode<DTYPE>(j / stride, dataA[j]));
-                    }
-                    else {
-                        if (dataA[j] > heap.Top().value)
-                            heap.ReplaceTop(HeapNode<DTYPE>(j / stride, dataA[j]));
-                    }
-                }
-
-                for (int j = strideNumA >= k ? k - 1 : strideNumA - 1; j >= 0; j--) {
-                    HeapNode<DTYPE> node = heap.Pop();
-                    dataB[j * stride] = node.value;
-                    indexData[j * stride] = node.index;
-                }
-            }
-        }
-    }
-}
-
-/*
-get the top-k items along a given dimension
->> a - input tensor
->> b - output tensor (top-k result)
->> index - index of the top-k items
->> dim - the dimension along which the sorting is performed
->> k - how many items returned after sorting
-*/
-void TopK(XTensor &a, XTensor &b, XTensor &index, int dim, int k)
-{
-    _TopK(&a, &b, &index, dim, k);
-
-    /* tensor connection */
-    XList list(2);
-    list.Add(&b);
-    list.Add(&index);
-    XLink::MakeLink(&a, &list, SORT_TOPK);
-    XLink::AddParamToHeadInt(&b, dim);
-    XLink::AddParamToHeadInt(&index, k);
-    XLink::AddParamToHeadInt(&b, dim);
-    XLink::AddParamToHeadInt(&index, k);
-}
-
-
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/sort/TopK.cuh
+++ b/source/tensor/core/sort/TopK.cuh
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#ifndef __TOPK_CUH__
-#define __TOPK_CUH__
-
-#include "TopK.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/* get the top-k items along a given dimension */
-void _CudaTopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k);
-
-#endif // USE_CUDA
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __TOPK_CUH__
\ No newline at end of file
--- a/source/tensor/core/sort/TopK.h
+++ b/source/tensor/core/sort/TopK.h
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#ifndef __TOPK_H__
-#define __TOPK_H__
-
-#include "../../XTensor.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/* get the top-k items along a given dimension */
-void _TopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k);
-
-/* get the top-k items along a given dimension */
-void TopK(XTensor &a, XTensor &b, XTensor &index, int dim, int k);
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __TOPK_H__
--- a/source/tensor/core/utilities/FlushToMem.cpp
+++ b/source/tensor/core/utilities/FlushToMem.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-06-22
-*/
-
-#include "../../XUtility.h"
-#include "FlushToMem.h"
-#include "FlushToMem.cuh"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/*
-flush a list of XTensor to GPU memory
->> mList - list of the tensors
->> devID - target GPU id
->> GPUMem - memory pool for the GPU
-*/
-void CPUToGPUFlush(XList * mList, int devID, XMem * GPUMem)
-{
-#ifdef USE_CUDA
-    CudaCPUToGPUFlush(mList, devID, GPUMem);
-#endif
-}
-
-/* copy the data from GPU memory to CPU memory */
-void GPUToCPUFlush(XTensor * tensor)
-{
-#ifdef USE_CUDA
-    CudaGPUToCPUFlush(tensor);
-#endif
-}
-
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/utilities/FlushToMem.cu
+++ b/source/tensor/core/utilities/FlushToMem.cu
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-06-14
-*/
-
-#include "FlushToMem.cuh"
-#include "../../XUtility.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/*
-flush a list of XTensor to GPU memory
->> mList - list of the tensors
->> devID - target GPU id
->> GPUMem - memory pool for the GPU
-*/
-void CudaCPUToGPUFlush(XList * mList, int devID, XMem * GPUMem)
-{
-    if (mList == NULL || mList->count == 0)
-        return;
-
-#ifdef USE_CUDA
-    int size = 0, p = 0;
-    int reqiredSize = 0;
-
-    /* compute the requried memory size */
-    for (int i = 0; i < mList->count; i++) {
-        XTensor * m = (XTensor*)mList->GetItem(i);
-
-        CheckNTErrors((m->devID < 0), "Cannot do gpu-flush on matrices that are already on GPUs.");
-
-        if (m->isSparse)
-            reqiredSize = sizeof(int) + (sizeof(int) + m->unitSize) * m->unitNumNonZero;
-        else
-            reqiredSize = m->unitSize * m->unitNum;
-
-        size += reqiredSize;
-    }
-
-    char * data = new char[size];
-    char * GPUData = GPUMem != NULL ? (char*)GPUMem->Alloc(GPUMem->devID, size):
-                                      (char*)XMemAlloc(devID, size);
-    int pSize = 0;
-
-    /* place the data in a memory block */
-    for (int i = 0; i < mList->count; i++) {
-        XTensor * m = (XTensor*)mList->GetItem(i);
-
-        if (m->isSparse)
-            pSize = sizeof(int) + (sizeof(int) + m->unitSize) * m->unitNumNonZero;
-        else
-            pSize = m->unitSize * m->unitNum;
-
-        reqiredSize = pSize;
-
-        memcpy(data + p, m->data, pSize);
-
-        if (m->dataHost != NULL)
-            delete[](char*)m->dataHost;
-
-        if(m->mem == NULL)
-            delete[] (char*)m->data;
-
-        m->dataHost = NULL;
-        m->data = GPUData + p;
-        m->devID = GPUMem != NULL ? GPUMem->devID : devID;
-        m->mem = GPUMem;
-
-        p += reqiredSize;
-    }
-
-    /* copy from CPU memory to GPU memory */
-    cudaMemcpy(GPUData, data, size, cudaMemcpyHostToDevice);
-
-    delete[] data;
-#endif
-}
-
-/* copy the data from GPU memory to CPU memory */
-void CudaGPUToCPUFlush(XTensor * tensor)
-{
-    CheckNTErrors((sizeof(DTYPE) == tensor->unitSize), "Unsupported data type.");
-
-    if (tensor->dataHost != NULL)
-        delete[](char*)tensor->dataHost;
-
-    if (tensor->isSparse) {
-        int num = int(tensor->unitNum * tensor->denseRatio + 1);
-        cudaMemcpy(&num, (DTYPE*)tensor->data, sizeof(int), cudaMemcpyDeviceToHost);
-
-        int tupleSize = sizeof(int) + sizeof(DTYPE);
-        int size = sizeof(int) + tupleSize*(num);
-
-        CheckNTErrors((size >= 0), "Illegal data size in the sparse matrix!");
-
-        tensor->dataHost = new char[size];
-        cudaMemcpy(tensor->dataHost, tensor->data, size, cudaMemcpyDeviceToHost);
-    }
-    else {
-        tensor->dataHost = new char[tensor->unitNum * tensor->unitSize];
-        if (tensor->data != NULL)
-            XMemCopy(tensor->dataHost, -1, tensor->data, tensor->devID, tensor->unitNum * tensor->unitSize);
-        else
-            memset(tensor->dataHost, 0, tensor->unitNum * tensor->unitSize);
-    }
-}
-#endif // USE_CUDA
-
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/utilities/FlushToMem.cuh
+++ b/source/tensor/core/utilities/FlushToMem.cuh
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-06-14
-*/
-
-#ifndef __FLUSHTOMEM_CUH__
-#define __FLUSHTOMEM_CUH__
-
-#include "../../XTensor.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/* flush a list of XTensor to GPU memory */
-void CudaCPUToGPUFlush(XList * mList, int devID, XMem * GPUMem);
-
-/* copy the data from GPU memory to CPU memory */
-void CudaGPUToCPUFlush(XTensor * tensor);
-
-#endif // USE_CUDA
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __FLUSHTOMEM_CUH__
\ No newline at end of file
--- a/source/tensor/core/utilities/FlushToMem.h
+++ b/source/tensor/core/utilities/FlushToMem.h
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-06-22
-*/
-
-#ifndef __FLUSHTOMEM_H__
-#define __FLUSHTOMEM_H__
-
-#include "../../XTensor.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/* flush a list of XTensor to GPU memory */
-void CPUToGPUFlush(XList * mList, int devID, XMem * GPUMem);
-
-/* copy the data from GPU memory to CPU memory */
-void GPUToCPUFlush(XTensor * tensor);
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __FLUSHTOMEM_H__
--- a/source/tensor/core/utilities/SetAscendingOrder.cu
+++ b/source/tensor/core/utilities/SetAscendingOrder.cu
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-06-14
-*/
-
-#include "SetAscendingOrder.cuh"
-#include "../../XDevice.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/* 
-set the cell to the ascending order along a given dimension (kernel code)
->> data - the data array
->> stride - how many items we go ove when move to the next item along the dimension
->> strideNum - size of the given dimension
->> blockNum - block number
-*/
-__global__
-void KernelSetAscendingOrder(int * data, int stride, int strideNum, int blockNum)
-{
-    __shared__ int iBlock[MAX_CUDA_THREAD_NUM_PER_BLOCK];
-    __shared__ int iOffset[MAX_CUDA_THREAD_NUM_PER_BLOCK];
-
-    /* index along the "stride" dimension */
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-
-    /* index along the leading dimension */
-    int j = blockDim.y * blockIdx.y + threadIdx.y;
-
-    if(i >= stride * blockNum || j >= strideNum)
-        return;
-
-    if(threadIdx.y == 0){
-        iBlock[threadIdx.x] = i / stride;
-        iOffset[threadIdx.x] = i % stride;
-    }
-    __syncthreads();
-    
-    int * d = (int*)data + (iBlock[threadIdx.x] * strideNum + j) * stride + iOffset[threadIdx.x];
-    *d = j;
-}
-
-/* 
-set the cell to the ascending order along a given dimension
->> a - the tensor
->> dim - the dimension
-*/
-void CudaSetAscendingOrder(XTensor * a, int dim)
-{
-    CheckNTErrors((a->dataType == X_INT), "TODO!");
-
-	int dimRDI = a->order - dim - 1;
-	int stride = 1;
-    int strideNum = a->dimSizeRDI[dimRDI];
-    for(int i = 0; i < dimRDI; i++)
-        stride *= a->dimSizeRDI[i];
-
-    int blockNum = 1;
-    for(int i = dimRDI + 1; i < a->order; i++)
-        blockNum *= a->dimSizeRDI[i];
-
-    int gridSize[3];
-    int blockSize[3];
-
-    GDevs.GetCudaThread2D(a->devID, strideNum, stride * blockNum, MAX_INT, gridSize, blockSize);
-
-    int devIDBackup;
-    ProtectCudaDev(a->devID, devIDBackup);
-
-    KernelSetAscendingOrder<<<dim3(gridSize[1], gridSize[0]), dim3(blockSize[1], blockSize[0])>>>
-                             ((int*)a->data, stride, strideNum, blockNum);
-
-    BacktoCudaDev(a->devID, devIDBackup);
-}
-#endif // USE_CUDA
-
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/utilities/SetAscendingOrder.cuh
+++ b/source/tensor/core/utilities/SetAscendingOrder.cuh
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-06-14
-*/
-
-#ifndef __SETASCENDINGORDER_CUH__
-#define __SETASCENDINGORDER_CUH__
-
-#include "../../XTensor.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/* set the cell to the ascending order along a given dimension */
-void CudaSetAscendingOrder(XTensor * a, int dim);
-
-#endif // USE_CUDA
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __SETASCENDINGORDER_CUH__
\ No newline at end of file
--- a/source/tensor/core/utilities/XMatrixSegment.cpp
+++ b/source/tensor/core/utilities/XMatrixSegment.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#include <stdarg.h>
-#include <math.h>
-#include "XMatrixSegment.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/*
-segment a 2d tensor (i.e., matrix) into blocks and run jobs in parallel
->> parallelRunner - parallel runner
->> job - the function to run
->> opNum - number of operations
->> rowNum - number of rows
->> colNum - number of columns
->> argNum - number of arguments of the jobs
->> ... - arguments of the jobs
-*/
-void RunParallel2D(XPRunner * parallelRunner, void * job,
-                   int opNum, int rowNum, int colNum, int argNum, ...)
-{
-    if (rowNum == 0 || colNum == 0)
-        return;
-
-    int jobNum = 1;
-
-    if (parallelRunner != NULL && (parallelRunner->method == PRUNNER_SINGLE || parallelRunner->method == PRUNNER_MULTIPLE)) {
-        if (opNum >= parallelRunner->minimumOPNum * parallelRunner->threadNum)
-            jobNum = parallelRunner->GetJobNum(rowNum * colNum);
-    }
-
-    CheckNTErrors(jobNum != 0, "TODO!");
-
-    /* argument list of the jobs */
-    XList * jobArgList = new XList(4);
-
-    va_list ap;
-    va_start(ap, argNum);
-    for (int i = 0; i < argNum; i++) {
-        void * p = va_arg(ap, void*);
-        jobArgList->Add(p);
-    }
-    va_end(ap);
-
-    /* prepare the neccesary argument list for parallel processing */
-    XList * jobs = new XList(jobNum);
-    XList * args = new XList(jobNum);
-
-    int * indexList = new int[jobNum * 4 * 4];
-
-    /* segment the matrix into blocks */
-    int nblock = SegmentTensor2D(rowNum, colNum, jobNum, indexList);
-
-    /*
-    assign jobs
-    argument rules:
-    1. block information
-    2. other arguments
-    */
-    for (int i = 0; i < jobNum; i++) {
-        XList * blockArgs = new XList(argNum + 4);
-        int * blockIndex = indexList + i * 4;
-
-        blockArgs->Add(blockIndex);
-        blockArgs->Add(blockIndex + 1);
-        blockArgs->Add(blockIndex + 2);
-        blockArgs->Add(blockIndex + 3);
-
-        for (int j = 0; j < argNum; j++)
-            blockArgs->Add(jobArgList->GetItem(j));
-
-        args->Add(blockArgs);
-        jobs->Add((void*)job);
-    }
-
-    args->count = nblock;
-    jobs->count = nblock;
-
-    /* single job */
-    if (jobNum == 1)
-        ((TFunction)job)((XList*)args->GetItem(0));
-    /* multiple jobs */
-    else
-        parallelRunner->Run(jobs, args);
-
-    /* free the memory */
-    delete[] indexList;
-    for (int i = 0; i < args->count; i++) {
-        XList * blockArgs = (XList*)args->GetItem(i);
-        delete blockArgs;
-    }
-    delete args;
-    delete jobs;
-    delete jobArgList;
-}
-
-/*
-segment a block into sub-blocks
->> rowNum - number of rows
->> colNum - number of columns
->> blockNum - number of sub-blocks
->> blockIndex - upper-left and bottom-right corners of each sub-block
-<< return - the number of resulting sub-blocks
-*/
-int SegmentTensor2D(int rowNum, int colNum, int blockNum, int * blockIndex)
-{
-    int total = rowNum * colNum;
-    int rowSize = (int)ceil(sqrt((float)total / blockNum));
-    int colSize = rowSize;
-
-    /* a narrow matrix */
-    if (rowSize > colNum * 0.9) {
-        rowSize = colNum;
-        colSize = (int)ceil((float)rowNum / blockNum);
-    }
-
-    /* a narrow matrix */
-    if (colSize > rowNum * 0.9) {
-        colSize = rowNum;
-        rowSize = (int)ceil((float)colNum / blockNum);
-    }
-
-    if (blockNum == 1) {
-        colSize = rowNum;
-        rowSize = colNum;
-    }
-
-    CheckNTErrors((colSize <= rowNum && rowSize <= colNum),
-        "Too large block!");
-
-    int x1, y1, x2, y2;
-    int xMax = rowNum - 1;
-    int yMax = colNum - 1;
-    int nblock = 0, nitem = 0;
-    int * indexList = blockIndex;
-
-    int xSegNum = int((float)rowNum / colSize);
-    int ySegNum = int((float)colNum / rowSize);
-    int marginBlockNum = blockNum - xSegNum * ySegNum;
-
-    /*
-    To maximize the number of resulting sub-block, we have to
-    make use of the margin block
-    */
-    if (blockNum > 1 && marginBlockNum > 0) {
-        int margin = 0;
-        int step = 0;
-        if (rowNum < colNum) {
-            margin = int(((float)marginBlockNum / blockNum) * colNum);
-            step = (int)ceil((float)rowNum / marginBlockNum);
-            x1 = 0;
-            y1 = yMax - margin + 1;
-            x2 = step - 1;
-            y2 = yMax;
-            while (x2 <= xMax) {
-                int * blockIndex = indexList + nblock * 4;
-                blockIndex[0] = x1; blockIndex[1] = y1;
-                blockIndex[2] = x2; blockIndex[3] = y2;
-                nblock++;
-                nitem += (y2 - y1 + 1) * (x2 - x1 + 1);
-
-                if (x2 == xMax)
-                    break;
-
-                x1 = x2 + 1;
-                x2 = x1 + step - 1;
-
-                if (x2 > xMax)
-                    x2 = xMax;
-            }
-
-            yMax -= margin;
-        }
-        else {
-            margin = int(((float)marginBlockNum / blockNum) * rowNum);
-            step = (int)ceil((float)colNum / marginBlockNum);
-            x1 = xMax - margin + 1;
-            y1 = 0;
-            x2 = xMax;
-            y2 = step - 1;
-            while (y2 <= yMax) {
-                int * blockIndex = indexList + nblock * 4;
-                blockIndex[0] = x1; blockIndex[1] = y1;
-                blockIndex[2] = x2; blockIndex[3] = y2;
-                nblock++;
-                nitem += (y2 - y1 + 1) * (x2 - x1 + 1);
-
-                if (y2 == yMax)
-                    break;
-
-                y1 = y2 + 1;
-                y2 = y1 + step - 1;
-
-                if (y2 > yMax)
-                    y2 = yMax;
-            }
-
-            xMax -= margin;
-        }
-
-        colSize = (int)ceil((float)(xMax + 1) / xSegNum);
-        rowSize = (int)ceil((float)(yMax + 1) / ySegNum);
-
-    }
-
-    x1 = 0;
-    y1 = 0;            // upper-left corner
-    x2 = colSize - 1;
-    y2 = rowSize - 1;  // bottom-right corner
-
-    /* the main body of the matrix (after removing the margin block) */
-    while (x1 <= xMax) {
-        y1 = 0;
-        x2 = x1 + colSize - 1;
-        y2 = y1 + rowSize - 1;
-
-        if (x2 > xMax) {
-            x2 = xMax;
-        }
-
-        while (y2 <= yMax) {
-            int * blockIndex = indexList + nblock * 4;
-            blockIndex[0] = x1; blockIndex[1] = y1;
-            blockIndex[2] = x2; blockIndex[3] = y2;
-            nblock++;
-            nitem += (y2 - y1 + 1) * (x2 - x1 + 1);
-
-            if (y2 == yMax)
-                break;
-
-            y1 = y2 + 1;
-            y2 = y1 + rowSize - 1;
-
-            if (y2 > yMax)
-                y2 = yMax;
-
-            CheckNTErrors((nblock <= blockNum),
-                "Fail to segment the matrix!");
-        }
-
-        x1 = x2 + 1;
-    }
-
-    CheckNTErrors(nitem == rowNum * colNum,
-        "Fail to segment the matrix!");
-
-    return nblock;
-}
-
-/*
-segment a block into sub-blocks (each block consists of a number of rows)
->> rowNum - number of rows
->> colNum - number of columns
->> blockNum - number of sub-blocks
->> blockIndex - upper-left and bottom-right corners of each sub-block
-<< return - the number of resulting sub-blocks
-*/
-int SegmentTensor2DInRows(int rowNum, int colNum, int blockNum, int * blockIndex)
-{
-    if (rowNum < blockNum) {
-        blockIndex[0] = 0;
-        blockIndex[1] = 0;
-        blockIndex[2] = rowNum - 1;
-        blockIndex[3] = colNum - 1;
-        return 1;
-    }
-
-    int segSize = (int)ceil((float)rowNum / blockNum);
-    int x1 = 0;
-    int x2 = x1 + segSize - 1;
-    int y1 = 0;
-    int y2 = colNum - 1;
-    int last = rowNum - 1;
-    int nblock = 0;
-
-    while (x1 <= last) {
-        x2 = x1 + segSize - 1;
-
-        if (x2 > last) {
-            x2 = last;
-        }
-
-        int * blockInfo = blockIndex + 4 * nblock;
-        blockInfo[0] = x1;
-        blockInfo[1] = y1;
-        blockInfo[2] = x2;
-        blockInfo[3] = y2;
-        nblock++;
-
-        if (x2 == last)
-            break;
-
-        x1 += segSize;
-    }
-
-    return nblock;
-}
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/utilities/XMatrixSegment.h
+++ b/source/tensor/core/utilities/XMatrixSegment.h
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#ifndef __XMATRIXSEGMENT_H__
-#define __XMATRIXSEGMENT_H__
-
-#include "../../XTensor.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/* segmentation and parallel processing for 2d tensors (i.e., matrices) */
-/* segment a 2d tensor (i.e., matrix) into blocks and run jobs in parallel */
-void RunParallel2D(XPRunner * parallelRunner, void * job, int opNum, int rowNum, int colNum, int argNum, ...);
-
-/* segment a block into sub-blocks */
-int SegmentTensor2D(int rowNum, int colNum, int blockNum, int * blockIndex);
-
-/* segment a block into sub-blocks */
-int SegmentTensor2DInRows(int rowNum, int colNum, int blockNum, int * blockIndex);
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __XMATRIXSEGMENT_H__
\ No newline at end of file
--- a/source/tensor/function/FHeader.h
+++ b/source/tensor/function/FHeader.h
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-/* this is a header to include all functions in the "function" workspace */
-
-#ifndef __FHEADER_H__
-#define __FHEADER_H__
-
-#include "../XTensor.h"
-
-#include "HardTanH.h"
-#include "Identity.h"
-#include "LogSoftmax.h"
-#include "Loss.h"
-#include "Rectify.h"
-#include "Sigmoid.h"
-#include "Softmax.h"
-
-#endif // __FHEADER_H__
\ No newline at end of file
--- a/source/tensor/function/HardTanH.cpp
+++ b/source/tensor/function/HardTanH.cpp
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-25
-*/
-
-#include <stdlib.h>
-#include "../XName.h"
-#include "HardTanH.h"
-#include "HardTanH.cuh"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-/*
-hard tanh function 
-y =  1    if x > 1
-     x    if -1 <= x <= 1
-    -1    if x < -1
->> x - input tensor
->> y - result
-*/
-void _HardTanH(const XTensor * x, XTensor * y)
-{
-#ifdef USE_CUDA
-    if(x->devID >= 0 || y->devID >= 0){
-        _CudaHardTanH(x, y);
-        return;
-    }
-#endif
-    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
-        int n = x->GetSize();
-        DTYPE * ip = (DTYPE*)x->data;
-        DTYPE * op = (DTYPE*)y->data;
-        for(int i = 0; i < n; i++){
-            DTYPE p = ip[i];
-            if(p > 1.0)
-                p = 1.0;
-            else if(p < -1.0)
-                p = -1.0;
-            op[i] = p;
-        }
-    }
-    else
-        ShowNTErrors("TODO!");
-}
-
-/* 
-hard tanh function (return a XTensor structure) 
-make a new tensor to keep the result and return it
-
-y =  1    if x > 1
-     x    if -1 <= x <= 1
-    -1    if x < -1
->> x - input tensor
-<< return - y
-*/
-XTensor HardTanH(const XTensor &x)
-{
-    XTensor y(&x);
-    y.SetTMP();
-
-    /* call _HardTanH function */
-    _HardTanH(&x, &y);
-
-    /* tensor connection */
-    XLink::MakeLink(&x, NULL, &y, FUNC_HARDTANH);
-
-    return y;
-}
-
-/*
-backward computation
-
-dE/dx = dE/dy * dy/dx
-
-hard tanh: y =  1    if x > 1
-                x    if -1 <= x <= 1
-               -1    if x< -1
-
-   and dy/dx =  1    if -1 <= x <= 1
-                0    otherwise
-
->> gold - gold standard to measure error (or loss)
->> y - output of the function
->> x - input of the function
->> dedy - dE/dy
->> dedx - dE/dx
->> lossName - type of loss function, e.g., cross entropy
-*/
-void _HardTanHBackward(XTensor * gold, XTensor * y, XTensor * x, 
-                       XTensor * dedy, XTensor * dedx,
-                       LOSS_FUNCTION_NAME lossName)
-{
-    CheckNTErrors((gold == NULL || XTensor::IsSameShaped(gold, y)), 
-                   "The tensors must be of the same size!");
-
-#ifdef USE_CUDA
-    if(x->devID >= 0 || y->devID >= 0){
-        _CudaHardTanHBackward(gold, y, x, dedy, dedx, lossName);
-        return;
-    }
-#endif
-
-    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
-        /* calculate dE/dy */
-        if(lossName != NOLOSS)
-            _LossBackward(dedy, gold, y, lossName);
-
-        DTYPE * dedyp = (DTYPE*)dedy->data;
-        DTYPE * dedxp = (DTYPE*)dedx->data;
-        DTYPE * ip = (DTYPE*)x->data;
-        int size = y->unitNum;
-
-        /* dE/dx = dE/dy * dy/dx */
-        for(int i = 0; i < size; i++){
-            DTYPE s =ip[i];
-            if(s > 1.0 || s < -1.0)
-                dedxp[i] = 0;
-            else
-                dedxp[i] = dedyp[i];
-        }
-    }
-    else
-        ShowNTErrors("TODO!");
-}
-
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/function/HardTanH.cu
+++ b/source/tensor/function/HardTanH.cu
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-25
-*/
-
-#include "HardTanH.h"
-#include "HardTanH.cuh"
-#include "Loss.cuh"
-#include "../XDevice.h"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/* 
-hard tanh forward computation (Cuda kernel) 
-y =  1    if x > 1
-     x    if -1 <= x <= 1
-    -1    if x < -1
->> x - input data array
->> y - output data array
->> size - size of input/output
-*/
-__global__ 
-void KernelHardtanhCompute(DTYPE * x, DTYPE * y, int size)
-{
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-
-    if (i < size){
-        DTYPE p = x[i];
-        if(p > (DTYPE)1.0)
-            p = (DTYPE)1.0;
-        else if(p < (DTYPE)-1.0)
-            p = (DTYPE)-1.0;
-        y[i] = p;
-    }
-}
-
-/*
-hard tanh forward computation (Cuda version) 
-y =  1    if x > 1
-     x    if -1 <= x <= 1
-    -1    if x < -1
->> x - input tensor
->> y - output tensor
-*/
-void _CudaHardTanH(const XTensor * x, XTensor * y)
-{
-    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
-
-        CheckNTErrors(!x->isSparse && !y->isSparse, "The hard tanh activation function does not support sparse tensors.");
-        CheckNTErrors(x->unitNum && y->unitNum, "The x vectors must be of the same length.");
-
-        int gridSize[3], blockSize[3];
-
-        GDevs.GetCudaThread(x->devID, x->unitNum, gridSize, blockSize);
-
-        int devIDBackup;
-        ProtectCudaDev(x->devID, devIDBackup);
-
-        KernelHardtanhCompute<<<dim3(gridSize[0]), dim3(blockSize[0])>>>((DTYPE*)x->data, (DTYPE*)y->data, x->unitNum);
-
-        BacktoCudaDev(x->devID, devIDBackup);
-    }
-    else{
-        ShowNTErrors("TODO!");
-    }
-}
-
-/* 
-hard tanh backward computation of dE/dx (Cuda kernel)
-
-dy/dx = 1     if -1 <= x <= 1
-        0     otherwise
-
->> dedy - dE/dy
->> dedx - dE/dx
->> gold - gold standard
->> y - y of the function
->> x - x of the function
->> size - size of y/x
-*/
-__global__ 
-void KernelHardtanhBackward(DTYPE * dedy, DTYPE * dedx, DTYPE * gold, DTYPE * y, DTYPE * x, int size)
-{
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-
-    if (i < size){
-        DTYPE s = x[i];
-        if(s > (DTYPE)1.0 || s < (DTYPE)-1.0)
-            dedx[i] = 0;
-        else
-            dedx[i] = dedy[i];
-    }
-}
-
-/*
-backward computation (Cuda version)
-
-dE/dx = dE/dy * dy/dx
-
-hard tanh: y =  1    if x > 1
-                x    if -1 <= x <= 1
-               -1    if x< -1
-
-   and dy/dx =  1    if -1 <= x <= 1
-                0    otherwise
-
->> gold - gold standard to measure error (or loss)
->> y - output of the function
->> x - input of the function
->> dedy - dE/dy
->> dedx - dE/dx
->> lossName - type of loss function, e.g., cross entropy
-*/
-void _CudaHardTanHBackward(XTensor * gold, XTensor * y, XTensor * x, 
-                           XTensor * dedy, XTensor * dedx,
-                           LOSS_FUNCTION_NAME lossName)
-{
-    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
-
-        /* calculate dE/dy */
-        if(lossName != NOLOSS)
-            _LossBackward(dedy, gold, y, lossName);
-
-        int gridSize[3], blockSize[3];
-
-        GDevs.GetCudaThread(x->devID, x->unitNum, gridSize, blockSize);
-
-        int devIDBackup;
-        ProtectCudaDev(x->devID, devIDBackup);
-
-        /* dE/dx = dE/dy * dy/dx */
-        KernelHardtanhBackward<<<dim3(gridSize[0]),dim3(blockSize[0])>>>
-                               ((DTYPE*)dedy->data, 
-                                (DTYPE*)dedx->data,
-                                 gold == NULL ? NULL : (DTYPE*)gold->data, 
-                                (DTYPE*)y->data, (DTYPE*)x->data, 
-                                 x->unitNum);
-
-        BacktoCudaDev(x->devID, devIDBackup);
-    }
-    else
-        ShowNTErrors("TODO!");
-}
-
-#endif
-
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/function/HardTanH.cuh
+++ b/source/tensor/function/HardTanH.cuh
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-25
-*/
-
-#ifndef __HARDTANH_CUH__
-#define __HARDTANH_CUH__
-
-#include "../XTensor.h"
-#include "Loss.h"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/* 
-rectify function 
-y =  1    if x > 1
-     x    if -1 <= x <= 1
-    -1    if x < -1
-*/
-void _CudaHardTanH(const XTensor * input, XTensor * output);
-
-/* de/dx (Cuda version) */
-void _CudaHardTanHBackward(XTensor * gold, XTensor * y, XTensor * x, 
-                           XTensor * dedy, XTensor * dedx,
-                           LOSS_FUNCTION_NAME lossName);
-
-#endif // USE_CUDA
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __HARDTANH_CUH__
\ No newline at end of file
--- a/source/tensor/function/HardTanH.h
+++ b/source/tensor/function/HardTanH.h
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-25
-*/
-
-#ifndef __HARDHANH_H__
-#define __HARDHANH_H__
-
-#include "../XTensor.h"
-#include "Loss.h"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-#define HTanH HardTanH
-
-/* 
-hard tanh function 
-y =  1    if x > 1
-     x    if -1 <= x <= 1
-    -1    if x < -1
-*/
-void _HardTanH(const XTensor * x, XTensor * y);
-
-/* hard tanh function (return a XTensor structure) */
-XTensor HardTanH(const XTensor &x);
-
-/* de/dx */
-void _HardTanHBackward(XTensor * gold, XTensor * y, XTensor * x, 
-                       XTensor * dedy, XTensor * dedx,
-                       LOSS_FUNCTION_NAME lossName);
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __HARDHANH_H__
\ No newline at end of file
--- a/source/tensor/function/Identity.cpp
+++ b/source/tensor/function/Identity.cpp
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-27
-*/
-
-#include "../XName.h"
-#include "Identity.h"
-#include "../XUtility.h"
-#include "../core/movement/CopyValues.h"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-/* 
-identity function y = x 
->> x - input tensor
->> y - result
-*/
-void _Identity(const XTensor * x, XTensor * y)
-{
-    _CopyValues(x, y);
-}
-
-/* 
-identity function y = x (return a XTensor structure) 
-make a new tensor to keep the result and return it
-
->> x - input tensor
-<< return - y
-*/
-XTensor Identity(const XTensor &x)
-{
-    XTensor y(&x);
-    y.SetTMP();
-
-    /* call _Identity function */
-    _Identity(&x, &y);
-
-    /* tensor connection */
-    XLink::MakeLink(&x, NULL, &y, FUNC_IDENTITY);
-
-    return y;
-}
-/* 
-backward computation for identity function y = x 
-
-dE/dx = dE/dy * dy/dx = dE/dy
-
->> gold - gold standard to measure error (or loss)
->> y - output of the function
->> x - input of the function
->> dedy - dE/dy
->> dedx - dE/dx
->> lossName - type of loss function, e.g., cross entropy
-*/
-void _IdentityBackward(XTensor * gold, XTensor * y, XTensor * x, 
-                       XTensor * dedy, XTensor * dedx,
-                       LOSS_FUNCTION_NAME lossName)
-{
-    CheckNTErrors((gold == NULL || XTensor::IsSameShaped(gold, y)), 
-                  "The tensors must be of the same size!");
-
-    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE)
-    {
-        /* calculate dE/dy */
-        if(lossName != NOLOSS)
-            _LossBackward(dedy, gold, y, lossName);
-
-        if(dedy->data != dedx->data)
-            _CopyValues(dedy, dedx);
-    }
-    else
-        ShowNTErrors("TODO!");
-}
-
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/function/Identity.h
+++ b/source/tensor/function/Identity.h
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-27
-*/
-
-#ifndef __IDENTITY_H__
-#define __IDENTITY_H__
-
-#include "../XTensor.h"
-#include "Loss.h"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-/* identity function y = x */
-void _Identity(const XTensor * x, XTensor * y);
-
-/* identity function y = x (return a XTensor structure) */
-XTensor Identity(const XTensor &x);
-
-/* de/dx */
-void _IdentityBackward(XTensor * gold, XTensor * y, XTensor * x, 
-                       XTensor * dedy, XTensor * dedx,
-                       LOSS_FUNCTION_NAME lossName);
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __IDENTITY_H__
\ No newline at end of file
--- a/source/tensor/function/LogSoftmax.cuh
+++ b/source/tensor/function/LogSoftmax.cuh
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-26
-*/
-
-#ifndef __LOGSOFTMAX_CUH__
-#define __LOGSOFTMAX_CUH__
-
-#include "../XTensor.h"
-#include "Loss.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/* log scale softmax y = log(e^x / \sum_{i} e^{x_i}) (Cuda version) */
-void _CudaLogSoftmax(const XTensor * input, XTensor * output, int leadDim);
-
-/* log scale softmax y = log(e^x / \sum_{i} e^{x_i}) (Cuda version) */
-void _CudaLogSoftmaxSumMax(XTensor * x, XTensor * y, int leadDim, XTensor * sum, XTensor * max);
-
-/* de/dx (Cuda version) */
-void _CudaLogSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x,
-                            XTensor * dedy, XTensor * dedx,
-                            int leadDim, 
-                            LOSS_FUNCTION_NAME lossName);
-
-#endif // USE_CUDA
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __LOGSOFTMAX_CUH__
\ No newline at end of file
--- a/source/tensor/function/LogSoftmax.h
+++ b/source/tensor/function/LogSoftmax.h
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-25
-*/
-
-#ifndef __LOGSOFTMAX_H__
-#define __LOGSOFTMAX_H__
-
-#include "../XTensor.h"
-#include "Loss.h"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-/* log scale softmax y = log(e^x / \sum_{i} e^{x_i}) */
-void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim);
-
-/* log scale softmax y = log(e^x / \sum_{i} e^{x_i}) (return a XTensor structure) */
-XTensor LogSoftmax(const XTensor &x, int leadDim);
-
-/* log scale softmax y = log(e^x / \sum_{i} e^{x_i}) (with both argument of x and y) */
-void LogSoftmax(const XTensor &x, XTensor &y, int leadDim);
-
-/* de/dx */
-void _LogSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x, 
-                         XTensor * dedy, XTensor * dedx,
-                         int leadDim,
-                         LOSS_FUNCTION_NAME lossName);
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __LOGSOFTMAX_H__
\ No newline at end of file
--- a/source/tensor/function/Loss.cuh
+++ b/source/tensor/function/Loss.cuh
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#ifndef __LOSS_CUH__
-#define __LOSS_CUH__
-
-#include "../XTensor.h"
-#include "Loss.h"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/* compute the loss (cuda version) */
-DTYPE _CudaLossCompute(XTensor * gold, XTensor * output, LOSS_FUNCTION_NAME LFName,
-                      bool isLogOutput, int leadDim, int gBeg, int gLen, int oBeg);
-
-/* compute the loss in log scale (cuda version) */
-DTYPE _CudaLossComputeForLogScale(XTensor * gold, XTensor * output, LOSS_FUNCTION_NAME LFName,
-                                 int leadDim, int gBeg, int gLen, int oBeg);
-
-/* backward compuation for a single element (cuda version) */
-DTYPE _CudaLossBackwardPoint(DTYPE t, DTYPE y, LOSS_FUNCTION_NAME LFName);
-
-/* backward compuation for (dense) vectors (cuda version) */
-void _CudaLossBackward(XTensor * dedy, XTensor * t, XTensor * y, 
-                      LOSS_FUNCTION_NAME LFName, 
-                      int leadDim = -1, int tBeg = 0, int tLen = -1, int yBeg = 0);
-
-#endif // USE_CUDA
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __RECTIFY_CUH__
\ No newline at end of file
--- a/source/tensor/function/Loss.h
+++ b/source/tensor/function/Loss.h
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-* Good start of the new project - but full of meetings today.
-*/
-
-#ifndef __LOSS_H__
-#define __LOSS_H__
-
-#include "../XTensor.h"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-/*
-loss function name, e.g., crossentropy.
-*/
-enum LOSS_FUNCTION_NAME {
-NOLOSS, 
-SQUAREDERROR,  // loss = sum_{i} 0.5*(t_i - y_i)^2, where t_i is the gold standard and y_i is the model output
-               // dloss/dy_i = y_i - t_i
-               // it is actually a squared euclidean distance
-CROSSENTROPY,  // loss = sum_{i} (-t_i * log(y_i)), where t and y are distributions 
-               // dloss/dy_i = -t_i / y_i
-ONEHOTERROR    // loss = sum_{i} e_i
-               // where e_i = 0.5*(t_i - y_i)^2 if t_i = 1, 
-               // e_i = 0 otherwise
-};
-
-/*
-loss function to measure the "number" of errors
-*/
-
-/* compute the loss */
-DTYPE _LossCompute(XTensor * gold, XTensor * output, LOSS_FUNCTION_NAME LFName,
-                  bool isLogOutput, int leadDim, int gBeg, int gLen, int oBeg);
-
-/* compute the loss (log version) */
-DTYPE _LossComputeForLogScale(XTensor * gold, XTensor * output, LOSS_FUNCTION_NAME LFName,
-                             int leadDim, int gBeg, int gLen, int oBeg);
-
-/* backward compuation for a single element */
-DTYPE _LossBackwardPoint(DTYPE t, DTYPE y, LOSS_FUNCTION_NAME LFName);
-
-/* backward compuation for (dense) vectors */
-void _LossBackward(XTensor * dEdY, XTensor * t, XTensor * y, 
-                  LOSS_FUNCTION_NAME LFName, 
-                  int leadDim = -1, int tBeg = 0, int tLen = -1, int yBeg = 0);
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __LOSS_H__
--- a/source/tensor/function/Rectify.cpp
+++ b/source/tensor/function/Rectify.cpp
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#include "../XName.h"
-#include "Rectify.h"
-#include "Rectify.cuh"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-/*
-rectify function y = max(0, x)
->> input - input tensor
->> output - result
-*/
-void _Rectify(const XTensor * x, XTensor * y)
-{
-#ifdef USE_CUDA
-    if(y->devID >= 0 || y->devID >= 0){
-        _CudaRectify(x, y);
-        return;
-    }
-#endif
-
-    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
-        DTYPE * ip = (DTYPE*)x->data;
-        DTYPE * op = (DTYPE*)y->data;
-        int n = x->GetSize();
-        for(int i = 0; i < n; i++){
-            DTYPE p = ip[i];
-            if(p < 0)
-                p = 0;
-
-            op[i] = p;
-        }
-    }
-    else
-        ShowNTErrors("TODO!");
-}
-
-/*
-rectify function y = max(0, x) (return a XTensor structure) 
-make a new tensor to keep the result and return it
-
->> input - input tensor
-<< return - y
-*/
-XTensor Rectify(const XTensor &x)
-{
-    XTensor y(&x);
-    y.SetTMP();
-
-    /* call _Rectify function */
-    _Rectify(&x, &y);
-
-    /* tensor connection */
-    XLink::MakeLink(&x, NULL, &y, FUNC_RECTIFY);
-
-    return y;
-}
-
-
-/*
-backward computation
-
-dE/dx = dE/dy * dy/dx
-
-rectified: y = max(0, x)
-
-or
-
-rectified: y = 0     if x < 0
-               x     otherwise
-
-   and dy/ds = 0     if x < 0
-               1     otherwise
-
->> gold - gold standard to measure error (or loss)
->> y - output of the function
->> x - input of the function
->> dedy - dE/dy
->> dedx - dE/dx
->> lossName - type of loss function, e.g., cross entropy
-*/
-void _RectifyBackward(XTensor * gold, XTensor * y, XTensor * x, 
-                      XTensor * dedy, XTensor * dedx,
-                      LOSS_FUNCTION_NAME lossName)
-{
-    CheckNTErrors((gold == NULL || XTensor::IsSameShaped(gold, y)), 
-                  "The tensors must be of the same size!");
-
-#ifdef USE_CUDA
-    if(x->devID >= 0 || y->devID >= 0){
-        _CudaRectifyBackward(gold, y, x, dedy, dedx, lossName);
-        return;
-    }
-#endif
-
-    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE)
-    {
-        /* calculate dE/dy */
-        if(lossName != NOLOSS)
-            _LossBackward(dedy, gold, y, lossName);
-
-        DTYPE * dedyp = (DTYPE*)dedy->data;
-        DTYPE * dedxp = (DTYPE*)dedx->data;
-        DTYPE * ip = (DTYPE*)x->data;
-        int size = y->unitNum;
-        for(int i = 0; i < size; i++){
-            /* dE/ds = dE/dy * dy/ds = dE/dy */
-            DTYPE s = ip[i];
-            if(s < 0)
-                dedxp[i] = 0;
-            else
-                dedxp[i] = dedyp[i];
-        }
-    }
-    else
-        ShowNTErrors("TODO!");
-}
-
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/function/Rectify.cu
+++ b/source/tensor/function/Rectify.cu
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#include "Rectify.h"
-#include "Rectify.cuh"
-#include "Loss.cuh"
-#include "../XDevice.h"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/* 
-hard rectify computation (Cuda kernel) 
-rectify   : y =  x    if x >= 0
-                 0    if x < 0
->> input - input tensor
->> output - output tensor
->> size - size of input/output
-*/
-__global__ 
-void KernelRectify(DTYPE * x, DTYPE * y, int size)
-{
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-
-    if (i < size){
-        DTYPE p = x[i];
-        if(p < 0)
-            p = 0;
-        y[i] = p;
-    }
-}
-
-/*
-rectify function y = max(0, x)
->> x - input tensor
->> y - result
-*/
-void _CudaRectify(const XTensor * x, XTensor * y)
-{
-    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
-
-        CheckNTErrors(!x->isSparse && !y->isSparse, "The Rectify function does not support sparse matrices.");
-        CheckNTErrors(x->unitNum && y->unitNum, "The input vectors must be of the same length.");
-
-        int gridSize[3], blockSize[3];
-
-        GDevs.GetCudaThread(x->devID, x->unitNum, gridSize, blockSize);
-
-        int devIDBackup;
-        ProtectCudaDev(x->devID, devIDBackup);
-
-        KernelRectify<<<dim3(gridSize[0]), dim3(blockSize[0])>>>((DTYPE*)x->data, (DTYPE*)y->data, x->unitNum);
-
-        BacktoCudaDev(x->devID, devIDBackup);
-    }
-    else
-        ShowNTErrors("TODO!");
-}
-
-/* 
-rectify backward computation of dE/dx (Cuda kernel)
-
-dy/dx =  1    if x >= 0
-         0    otherwise
-
->> dedy - dE/dy
->> dedx - dE/dx
->> gold - gold standard
->> y - output of the function
->> x - input of the function
->> size - size of output/input
-*/
-__global__ 
-void KernelRectifyBackward(DTYPE * dedy, DTYPE * dedx, DTYPE * gold, DTYPE * y, DTYPE * x, int size)
-{
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-
-    if (i < size){
-        DTYPE s = x[i];
-        if(s >= 0)
-            dedx[i] = dedy[i];
-        else
-            dedx[i] = 0;
-    }
-}
-
-
-/*
-backward computation (Cuda version)
-
-dE/ds = dE/dy * dy/ds
-
-rectify  : y =  s    if s >= 0
-                0    if s < 0
-
-   and dy/ds =  1    if s >= 0
-                0    otherwise
-
->> gold - gold standard to measure error (or loss)
->> output - output of the activation function, i.e., y
->> input - input of the activation function , i.e., s
->> dEdY - dE/dy
->> dEdS - dE/ds
->> lossName - type of loss function, e.g., cross entropy
->> gBeg - where to start in the gold standard (along the leading dimension)
->> gLen - segment length from gBeg (along the leading dimension)
->> oBeg - where to start in the model output (along the leading dimension)
->> parallelRunner - parallel processing module
-*/
-void _CudaRectifyBackward(XTensor * gold, XTensor * y, XTensor * x, 
-                          XTensor * dedy, XTensor * dedx,
-                          LOSS_FUNCTION_NAME lossName)
-{
-    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
-
-        /* calculate dE/dy */
-        if(lossName != NOLOSS)
-            _CudaLossBackward(dedy, gold, y, lossName);
-        
-        int gridSize[3], blockSize[3];
-
-        GDevs.GetCudaThread(x->devID, x->unitNum, gridSize, blockSize);
-
-        int devIDBackup;
-        ProtectCudaDev(x->devID, devIDBackup);
-
-        /* dE/ds = dE/dy * dy/ds */
-        KernelRectifyBackward<<<dim3(gridSize[0]),dim3(blockSize[0])>>>
-                              ((DTYPE*)dedy->data, 
-                               (DTYPE*)dedx->data,
-                                gold == NULL ? NULL : (DTYPE*)gold->data, 
-                               (DTYPE*)y->data, (DTYPE*)x->data, 
-                                x->unitNum);
-
-        BacktoCudaDev(x->devID, devIDBackup);
-    }
-    else
-        ShowNTErrors("TODO!");
-}
-
-#endif
-
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/function/Rectify.cuh
+++ b/source/tensor/function/Rectify.cuh
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#ifndef __RECTIFY_CUH__
-#define __RECTIFY_CUH__
-
-#include "../XTensor.h"
-#include "Loss.h"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/* rectify function y = max(0, x) (Cuda version) */
-void _CudaRectify(const XTensor * input, XTensor * output);
-
-/* de/dx (Cuda version) */
-void _CudaRectifyBackward(XTensor * gold, XTensor * y, XTensor * x, 
-                          XTensor * dedy, XTensor * dedx,
-                          LOSS_FUNCTION_NAME lossName);
-
-#endif // USE_CUDA
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __RECTIFY_CUH__
\ No newline at end of file
--- a/source/tensor/function/Rectify.h
+++ b/source/tensor/function/Rectify.h
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-
-#ifndef __RECTIFY_H__
-#define __RECTIFY_H__
-
-#include "../XTensor.h"
-#include "Loss.h"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-/* rectify function y = max(0, x) */
-void _Rectify(const XTensor * x, XTensor * y);
-
-/* rectify function y = max(0, x) (return a XTensor structure) */
-XTensor Rectify(const XTensor &x);
-
-/* de/dx */
-void _RectifyBackward(XTensor * gold, XTensor * y, XTensor * x, 
-                      XTensor * dedy, XTensor * dedx,
-                      LOSS_FUNCTION_NAME lossName);
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __RECTIFY_H__
\ No newline at end of file
--- a/source/tensor/function/Sigmoid.cpp
+++ b/source/tensor/function/Sigmoid.cpp
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-25
-*/
-
-#include "../XName.h"
-#include <math.h>
-#include "Sigmoid.h"
-#include "Sigmoid.cuh"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-/*
-sigmoid function y = 1/(1+exp(-x))
->> x - input tensor
->> y - result
-*/
-void _Sigmoid(const XTensor * x, XTensor * y)
-{
-#ifdef USE_CUDA
-    if(x->devID >= 0 || y->devID >= 0){
-        _CudaSigmoid(x, y);
-        return;
-    }
-#endif
-
-    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
-        DTYPE * ip = (DTYPE*)x->data;
-        DTYPE * op = (DTYPE*)y->data;
-        int n = x->GetSize();
-        for(int i = 0; i < n; i++){
-            DTYPE p = ip[i];
-            op[i] = (DTYPE)1.0/((DTYPE)1.0+(DTYPE)exp(-p));
-        }
-    }
-    else
-        ShowNTErrors("TODO!");
-}
-
-/*
-sigmoid function y = 1/(1+exp(-x)) (return a XTensor structure) 
-make a new tensor to keep the result and return it
-
->> x - input tensor
-<< return - y
-*/
-XTensor Sigmoid(const XTensor &x)
-{
-    XTensor y(&x);
-    y.SetTMP();
-
-    /* call _Sigmoid function */
-    _Sigmoid(&x, &y);
-
-    /* tensor connection */
-    XLink::MakeLink(&x, NULL, &y, FUNC_SIGMOID);
-
-    return y;
-}
-
-/*
-backward computation
-
-dE/ds = dE/dy * dy/dx
-
-sigmoid: y = 1/(1+exp(-x))
-
-   and dy/dx = y * (1 -y)
-
->> gold - gold standard to measure the error (or loss)
->> y - output of the function
->> x - input of the function
->> dedy - dE/dy
->> dedx - dE/dx
->> lossName - type of loss function, e.g., cross entropy
-*/
-void _SigmoidBackward(XTensor * gold, XTensor * y, XTensor * x, 
-                      XTensor * dedy, XTensor * dedx,
-                      LOSS_FUNCTION_NAME lossName)
-{
-    CheckNTErrors((gold == NULL || XTensor::IsSameShaped(gold, y)), 
-                  "The tensors must be of the same size!");
-
-#ifdef USE_CUDA
-    if(x->devID >= 0 || y->devID >= 0){
-        _CudaSigmoidBackward(gold, y, x, dedy, dedx, lossName);
-        return;
-    }
-#endif
-
-    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE)
-    {
-        /* calculate dE/dy */
-        if(lossName != NOLOSS)
-            _LossBackward(dedy, gold, y, lossName);
-
-        DTYPE * dedyp = (DTYPE*)dedy->data;
-        DTYPE * dedxp = (DTYPE*)dedx->data;
-        DTYPE * op = (DTYPE*)y->data;
-        int size = y->unitNum;
-
-        /* dE/dx = dE/dy * dy/dx */
-        for(int i = 0; i < size; i++){
-            DTYPE y = op[i];
-            dedxp[i] = dedyp[i] * (DTYPE)y * ((DTYPE)1.0 - y);
-        }
-    }
-    else
-        ShowNTErrors("TODO!");
-}
-
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/function/Sigmoid.cu
+++ b/source/tensor/function/Sigmoid.cu
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-25
-*/
-
-#include "Sigmoid.h"
-#include "Sigmoid.cuh"
-#include "Loss.cuh"
-#include "../XDevice.h"
-
-#ifdef USE_CUDA
-
-// the CUDA stuff
-#include <cuda_runtime.h>
-#include <cublas_v2.h>
-#include <cuda.h>
-
-#endif
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/* 
-sigmoid function y = 1/(1+exp(-x))  (Cuda kernel) 
->> x - input data pointer
->> y - output data pointer
->> size - size of input/output
-*/
-__global__ 
-void KernelSigmoidCompute(DTYPE * x, DTYPE * y, int size)
-{
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-
-    if (i < size){
-        y[i] = 1/(1+exp(-x[i]));
-    }
-}
-
-/*
-sigmoid function y = 1/(1+exp(-x)) (Cuda version)
->> x - input vector
->> y - result
-*/
-void _CudaSigmoid(const XTensor * x, XTensor * y)
-{
-    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
-
-        CheckNTErrors(!x->isSparse && !y->isSparse, "the activation function (rectify) does not support sparse matrices.");
-        CheckNTErrors(x->unitNum && y->unitNum, "we require two vectors with the same length.");
-
-        int gridSize[3], blockSize[3];
-
-        GDevs.GetCudaThread(x->devID, x->unitNum, gridSize, blockSize);
-
-        int devIDBackup;
-        ProtectCudaDev(x->devID, devIDBackup);
-
-        KernelSigmoidCompute<<<dim3(gridSize[0]), dim3(blockSize[0])>>>((DTYPE*)x->data, (DTYPE*)y->data, x->unitNum);
-
-        BacktoCudaDev(x->devID, devIDBackup);
-    }
-    else
-        ShowNTErrors("TODO!");
-}
-
-/* 
-sigmoid backward computation of dE/dx (Cuda kernel)
-
-dE/ds = dE/dy * dy/dx
-
-sigmoid: y = 1/(1+exp(-x))
-
-   and dy/ds = y * (1 -y)
-
->> dedy - dE/dy
->> dedx - dE/ds
->> gold - gold standard
->> y - output of the function
->> x - input of the function
->> size - size of output/input
-*/
-__global__ 
-void KernelSigmoidBackward(DTYPE * dedy, DTYPE * dedx, DTYPE * gold, DTYPE * y, DTYPE * x, int size)
-{
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-
-    if (i < size){
-        dedx[i] = dedy[i] * y[i] * ((DTYPE)1.0 - y[i]);
-    }
-}
-
-/*
-backward computation (Cuda version)
-
-dE/ds = dE/dy * dy/dx
-
-sigmoid: y = 1/(1+exp(-x))
-
-   and dy/dx = y * (1 -y)
-
->> gold - gold standard to measure error (or loss)
->> y - output of the function
->> x - input of the function
->> dedy - dE/dy
->> dedx - dE/dx
->> lossName - type of loss function, e.g., cross entropy
-*/
-void _CudaSigmoidBackward(XTensor * gold, XTensor * y, XTensor * x, 
-                          XTensor * dedy, XTensor * dedx,
-                          LOSS_FUNCTION_NAME lossName)
-{
-    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
-        /* calculate dE/dy */
-        if(lossName != NOLOSS)
-            _LossBackward(dedy, gold, y, lossName);
-
-        
-        int gridSize[3], blockSize[3];
-
-        GDevs.GetCudaThread(x->devID, x->unitNum, gridSize, blockSize);
-
-        int devIDBackup;
-        ProtectCudaDev(x->devID, devIDBackup);
-
-        /* dE/ds = dE/dy * dy/ds */
-        KernelSigmoidBackward<<<dim3(gridSize[0]),dim3(blockSize[0])>>>
-                              ((DTYPE*)dedy->data, 
-                               (DTYPE*)dedx->data,
-                                gold == NULL ? NULL : (DTYPE*)gold->data, 
-                               (DTYPE*)y->data, (DTYPE*)x->data, 
-                                x->unitNum);
-
-        BacktoCudaDev(x->devID, devIDBackup);
-    }
-    else
-        ShowNTErrors("TODO!");
-}
-
-#endif
-
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/function/Sigmoid.cuh
+++ b/source/tensor/function/Sigmoid.cuh
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-25
-*/
-
-#ifndef __SIGMOID_CUH__
-#define __SIGMOID_CUH__
-
-#include "../XTensor.h"
-#include "Loss.h"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/* rectify function y = max(0, x) (Cuda version) */
-void _CudaSigmoid(const XTensor * input, XTensor * output);
-
-/* de/dx (Cuda version) */
-void _CudaSigmoidBackward(XTensor * gold, XTensor * y, XTensor * x, 
-                          XTensor * dedy, XTensor * dedx,
-                          LOSS_FUNCTION_NAME lossName);
-
-#endif // USE_CUDA
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __SIGMOID_CUH__
\ No newline at end of file
--- a/source/tensor/function/Sigmoid.h
+++ b/source/tensor/function/Sigmoid.h
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-25
-*/
-
-#ifndef __SIGMOID_H__
-#define __SIGMOID_H__
-
-#include "../XTensor.h"
-#include "Loss.h"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-/* sigmoid function y = 1/(1+exp(-x)) */
-void _Sigmoid(const XTensor * x, XTensor * y);
-
-/* sigmoid function y = 1/(1+exp(-x)) (return a XTensor structure) */
-XTensor Sigmoid(const XTensor &x);
-
-/* de/dx */
-void _SigmoidBackward(XTensor * gold, XTensor * y, XTensor * x, 
-                      XTensor * dedy, XTensor * dedx,
-                      LOSS_FUNCTION_NAME lossName);
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __SIGMOID_H__
\ No newline at end of file
--- a/source/tensor/function/Softmax.cpp
+++ b/source/tensor/function/Softmax.cpp
--- a/source/tensor/function/Softmax.cuh
+++ b/source/tensor/function/Softmax.cuh
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-27
-*/
-
-#ifndef __SOFTMAX_CUH__
-#define __SOFTMAX_CUH__
-
-#include "../XTensor.h"
-#include "Loss.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-
-/* softmax y = e^x / \sum_{i} e^{x_i} (Cuda version) */
-void _CudaSoftmax(const XTensor * input, XTensor * output, int leadDim);
-
-/* softmax y = e^x / \sum_{i} e^{x_i} (Cuda version) */
-void _CudaSoftmaxSumMax(const XTensor * x, XTensor * y, int leadDim, XTensor * sum, XTensor * max);
-
-/* de/dx (Cuda version) */
-void _CudaSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x,
-                          XTensor * dedy, XTensor * dedx,
-                          int leadDim, 
-                          LOSS_FUNCTION_NAME lossName);
-
-#endif // USE_CUDA
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __SOFTMAX_CUH__
\ No newline at end of file
--- a/source/tensor/function/Softmax.h
+++ b/source/tensor/function/Softmax.h
-/* NiuTrans.Tensor - an open-source tensor library
- * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
- * All rights reserved.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-27
-*/
-
-#ifndef __SOFTMAX_H__
-#define __SOFTMAX_H__
-
-#include "../XTensor.h"
-#include "Loss.h"
-
-namespace nts{ // namespace nts(NiuTrans.Tensor)
-
-/* softmax y = e^x / \sum_{i} e^{x_i} */
-void _Softmax(const XTensor * x, XTensor * y, int leadDim);
-
-/* softmax y = e^x / \sum_{i} e^{x_i} (return a XTensor structure) */
-XTensor Softmax(const XTensor &x, int leadDim);
-
-/* de/dx */
-void _SoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x, 
-                      XTensor * dedy, XTensor * dedx,
-                      int leadDim,
-                      LOSS_FUNCTION_NAME lossName);
-
-} // namespace nts(NiuTrans.Tensor)
-
-#endif // __SOFTMAX_H__
\ No newline at end of file
--- a/source/tensor/test/TAbsolute.cpp
+++ b/source/tensor/test/TAbsolute.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12
-*/
-
-#include "../core/math/Unary.h"
-#include "TAbsolute.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/*
-case 1: test Absolute function.
-Set every entry to its absolute value.
-*/
-bool TestAbsolute1()
-{
-	/* a tensor of size (3, 2) */
-	int order = 2;
-	int * dimSize = new int[order];
-	dimSize[0] = 3;
-	dimSize[1] = 2;
-
-	int unitNum = 1;
-	for (int i = 0; i < order; i++)
-		unitNum *= dimSize[i];
-
-	DTYPE aData[3][2] = { {1.0F, -2.0F}, 
-	                      {0.5F, -4.0F},
-	                      {0.0F, 6.0F} };
-	DTYPE answer[3][2] = { {1.0F, 2.0F},
-	                       {0.5F, 4.0F},
-	                       {0.0F, 6.0F} };
-
-	/* CPU test */
-	bool cpuTest = true;
-
-	/* create tensors */
-	XTensor * a = NewTensor(order, dimSize);
-	XTensor * b = NewTensor(order, dimSize);
-	XTensor * aMe = NewTensor(order, dimSize);
-    XTensor bUser;
-
-	/* initialize variables */
-	a->SetData(aData, unitNum);
-    aMe->SetData(aData, unitNum);
-
-	/* call Absolute function */
-    _Absolute(a, b);
-	_AbsoluteMe(aMe);
-    bUser = Absolute(*a);
-
-	/* check results */
-	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
-    
-#ifdef USE_CUDA
-	/* GPU test */
-	bool gpuTest = true;
-
-	/* create tensor */
-	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor bUserGPU;
-
-	/* Initialize variables */
-	aGPU->SetData(aData, unitNum);
-    aMeGPU->SetData(aData, unitNum);
-
-	/* call Absolute function */
-    _Absolute(aGPU, bGPU);
-	_AbsoluteMe(aMeGPU);
-    bUserGPU = Absolute(*aGPU);
-
-	/* check results */
-	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
-
-	/* destroy variables */
-	delete a;
-	delete b;
-	delete aMe;
-    delete aGPU;
-    delete bGPU;
-    delete aMeGPU;
-	delete[] dimSize;
-
-	return cpuTest && gpuTest;
-#else
-	/* destroy variables */
-	delete a;
-	delete b;
-	delete aMe;
-	delete[] dimSize;
-
-	return cpuTest;
-#endif // USE_CUDA
-}
-
-/* other cases */
-/*
-TODO!!
-*/
-
-/* test for Absolute Function */
-bool TestAbsolute()
-{
-	XPRINT(0, stdout, "[TEST Absolute] set every entry to its absolute value \n");
-	bool returnFlag = true, caseFlag = true;
-
-	/* case 1 test */
-	caseFlag = TestAbsolute1();
-
-	if (!caseFlag) {
-		returnFlag = false;
-		XPRINT(0, stdout, ">> case 1 failed!\n");
-	}
-	else
-		XPRINT(0, stdout, ">> case 1 passed!\n");
-
-	/* other cases test */
-	/*
-	TODO!!
-	*/
-
-	if (returnFlag) {
-		XPRINT(0, stdout, ">> All Passed!\n");
-	}
-	else
-		XPRINT(0, stdout, ">> Failed!\n");
-
-	XPRINT(0, stdout, "\n");
-
-	return returnFlag;
-}
-
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TAbsolute.h
+++ b/source/tensor/test/TAbsolute.h
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12
-*/
-
-#ifndef __TEST_ABSOLUTE_H__
-#define __TEST_ABSOLUTE_H__
-
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/* test for Absolute Function */
-bool TestAbsolute();
-
-} // namespace nts(NiuTrans.Tensor)
-#endif // __TEST_ABSOLUTE_H__
--- a/source/tensor/test/TClip.cpp
+++ b/source/tensor/test/TClip.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
-*/
-
-#include "../XTensor.h"
-#include "TClip.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/*
-case 1: test Clip function.
-Set every entry to its clip value.
-*/
-bool TestClip1()
-{
-	/* a tensor of size (3, 2) */
-	int aOrder = 2;
-	int * aDimSize = new int[aOrder];
-	aDimSize[0] = 3;
-	aDimSize[1] = 2;
-
-	int aUnitNum = 1;
-	for (int i = 0; i < aOrder; i++)
-		aUnitNum *= aDimSize[i];
-
-	DTYPE aData[3][2] = { {1.0F, -2.0F},
-						  {0.0F, 4.0F},
-						  {5.0F, -6.0F} };
-	DTYPE answer[3][2] = { {1.0F, -1.0F},
-						   {0.0F, 1.0F},
-					   	   {1.0F, -1.0F} };
-
-	/* CPU test */
-	bool cpuTest = true;
-
-	/* create tensors */
-	XTensor * a = NewTensor(aOrder, aDimSize);
-	XTensor * b = NewTensor(aOrder, aDimSize);
-	XTensor * aMe = NewTensor(aOrder, aDimSize);
-	XTensor bUser;
-
-	/* initialize variables */
-	a->SetData(aData, aUnitNum);
-	aMe->SetData(aData, aUnitNum);
-
-	/* call Clip function */
-	_Clip(a, b, -1.0, 1.0);
-	_ClipMe(aMe, -1.0, 1.0);
-	bUser = Clip(*a, -1.0, 1.0);
-
-	/* check results */
-	cpuTest = b->CheckData(answer, aUnitNum, 1e-4F) && 
-              aMe->CheckData(answer, aUnitNum, 1e-4F) && 
-              bUser.CheckData(answer, aUnitNum, 1e-4F);
-
-#ifdef USE_CUDA
-	/* GPU test */
-	bool gpuTest = true;
-
-	/* create tensor */
-	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
-	XTensor * bGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
-	XTensor * aMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
-	XTensor bUserGPU;
-
-	/* Initialize variables */
-	aGPU->SetData(aData, aUnitNum);
-	aMeGPU->SetData(aData, aUnitNum);
-
-	/* call Clip function */
-	_Clip(aGPU, bGPU, -1.0, 1.0);
-	_ClipMe(aMeGPU, -1.0, 1.0);
-	bUserGPU = Clip(*aGPU, -1.0, 1.0);
-
-	/* check results */
-	gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F) && 
-              aMeGPU->CheckData(answer, aUnitNum, 1e-4F) && 
-              bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
-
-	/* destroy variables */
-	delete a;
-	delete b;
-	delete aMe;
-	delete aGPU;
-	delete bGPU;
-	delete aMeGPU;
-	delete[] aDimSize;
-
-	return cpuTest && gpuTest;
-#else
-	/* destroy variables */
-	delete a;
-	delete b;
-	delete aMe;
-	delete[] aDimSize;
-
-	return cpuTest;
-#endif // USE_CUDA
-}
-
-/* other cases */
-/*
-TODO!!
-*/
-
-/* test for Clip Function */
-bool TestClip()
-{
-	XPRINT(0, stdout, "[TEST Clip] set every entry to its clip value \n");
-	bool returnFlag = true, caseFlag = true;
-
-	/* case 1 test */
-	caseFlag = TestClip1();
-
-	if (!caseFlag) {
-		returnFlag = false;
-		XPRINT(0, stdout, ">> case 1 failed!\n");
-	}
-	else
-		XPRINT(0, stdout, ">> case 1 passed!\n");
-
-	/* other cases test */
-	/*
-	TODO!!
-	*/
-
-	if (returnFlag) {
-		XPRINT(0, stdout, ">> All Passed!\n");
-	}
-	else
-		XPRINT(0, stdout, ">> Failed!\n");
-
-	XPRINT(0, stdout, "\n");
-
-	return returnFlag;
-}
-
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TClip.h
+++ b/source/tensor/test/TClip.h
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
-*/
-
-#ifndef __TEST_CLIP_H__
-#define __TEST_CLIP_H__
-
-#include "../core/math/Clip.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/* test for Clip Function */
-extern "C"
-bool TestClip();
-
-} // namespace nts(NiuTrans.Tensor)
-#endif // __TEST_CLIP_H__
--- a/source/tensor/test/TConcatenate.h
+++ b/source/tensor/test/TConcatenate.h
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-06-14
-*/
-
-#ifndef __TEST_CONCATENATE_H__
-#define __TEST_CONCATENATE_H__
-
-#include "../core/shape/Concatenate.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/* test for Concatenate Function */
-bool TestConcatenate();
-
-} // namespace nts(NiuTrans.Tensor)
-#endif // __TEST_CONCATENATE_H__
--- a/source/tensor/test/TConcatenateSolely.cpp
+++ b/source/tensor/test/TConcatenateSolely.cpp
--- a/source/tensor/test/TConcatenateSolely.h
+++ b/source/tensor/test/TConcatenateSolely.h
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-06-14
-*/
-
-#ifndef __TEST_CONCATENATESOLELY_H__
-#define __TEST_CONCATENATESOLELY_H__
-
-#include "../core/shape/ConcatenateSolely.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/* test for ConcatenateSolely Function */
-bool TestConcatenateSolely();
-
-} // namespace nts(NiuTrans.Tensor)
-#endif // __TEST_CONCATENATESOLELY_H__
--- a/source/tensor/test/TConvertDataType.cpp
+++ b/source/tensor/test/TConvertDataType.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12
-*/
-
-#include "TConvertDataType.h"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/*
-case 1: test ConvertDataType function.
-In this case, the flaot32 data type is converted to int32 data type.
-
-*/
-bool TestConvertDataType1()
-{
-	/* a tensor of size (3, 2) */
-	int aOrder = 2;
-	int * aDimSize = new int[aOrder];
-	aDimSize[0] = 3;
-	aDimSize[1] = 2;
-
-	int aUnitNum = 1;
-	for (int i = 0; i < aOrder; i++)
-		aUnitNum *= aDimSize[i];
-
-	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
-	                      {0.5F, 4.0F},
-	                      {5.0F, 6.0F} };
-	int answer[3][2] = { {1, 2},
-	                     {0, 4},
-                         {5, 6} };
-
-	/* CPU test */
-	bool cpuTest = true;
-
-	/* create tensors */
-	XTensor * a = NewTensor(aOrder, aDimSize);
-	XTensor * b = NewTensor(aOrder, aDimSize, X_INT);
-
-	/* initialize variables */
-	a->SetData(aData, aUnitNum);
-	b->SetZeroAll();
-
-	/* call ConvertDataType function */
-	_ConvertDataType(a, b);
-
-	/* check results */
-	cpuTest = b->CheckData(answer, aUnitNum);
-    
-#ifdef USE_CUDA
-	/* GPU test */
-	bool gpuTest = true;
-
-	/* create tensor */
-	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
-	XTensor * bGPU = NewTensor(aOrder, aDimSize, X_INT, 1.0F, 0);
-
-	/* Initialize variables */
-	aGPU->SetData(aData, aUnitNum);
-
-	/* call ConvertDataType function */
-	_ConvertDataType(aGPU, bGPU);
-
-	/* check results */
-	gpuTest = bGPU->CheckData(answer, aUnitNum);
-
-	/* destroy variables */
-	delete a;
-	delete b;
-    delete aGPU;
-    delete bGPU;
-	delete[] aDimSize;
-
-	return cpuTest && gpuTest;
-#else
-	/* destroy variables */
-	delete a;
-	delete b;
-	delete[] aDimSize;
-
-	return cpuTest;
-#endif // USE_CUDA
-}
-
-/*
-case 2: test ConvertDataType function.
-In this case, the int32 data type is converted to float32 data type.
-
-*/
-bool TestConvertDataType2()
-{
-	/* a tensor of size (3, 2) */
-	int aOrder = 2;
-	int * aDimSize = new int[aOrder];
-	aDimSize[0] = 3;
-	aDimSize[1] = 2;
-
-	int aUnitNum = 1;
-	for (int i = 0; i < aOrder; i++)
-		aUnitNum *= aDimSize[i];
-
-	int aData[3][2] = { {1, 2}, 
-	                    {0, 4},
-	                    {5, 6} };
-	DTYPE answer[3][2] = { {1.0F, 2.0F}, 
-	                       {0.0F, 4.0F},
-	                       {5.0F, 6.0F} };
-
-	/* CPU test */
-	bool cpuTest = true;
-
-	/* create tensors */
-	XTensor * a = NewTensor(aOrder, aDimSize, X_INT);
-	XTensor * b = NewTensor(aOrder, aDimSize);
-
-	/* initialize variables */
-	a->SetData(aData, aUnitNum);
-	b->SetZeroAll();
-
-	/* call ConvertDataType function */
-	_ConvertDataType(a, b);
-
-	/* check results */
-	cpuTest = b->CheckData(answer, aUnitNum, 1e-4F);
-    
-#ifdef USE_CUDA
-	/* GPU test */
-	bool gpuTest = true;
-
-	/* create tensor */
-	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_INT, 1.0F, 0);
-	XTensor * bGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
-
-	/* Initialize variables */
-	aGPU->SetData(aData, aUnitNum);
-
-	/* call ConvertDataType function */
-	_ConvertDataType(aGPU, bGPU);
-
-	/* check results */
-	gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F);
-
-	/* destroy variables */
-	delete a;
-	delete b;
-    delete aGPU;
-    delete bGPU;
-	delete[] aDimSize;
-
-	return cpuTest && gpuTest;
-#else
-	/* destroy variables */
-	delete a;
-	delete b;
-	delete[] aDimSize;
-
-	return cpuTest;
-#endif // USE_CUDA
-}
-
-/* other cases */
-/*
-TODO!!
-*/
-
-/* test for ConvertDataType Function */
-bool TestConvertDataType()
-{
-	XPRINT(0, stdout, "[TEST ConvertDataType] convert data type \n");
-	bool returnFlag = true, caseFlag = true;
-
-	/* case 1 test */
-	caseFlag = TestConvertDataType1();
-
-	if (!caseFlag) {
-		returnFlag = false;
-		XPRINT(0, stdout, ">> case 1 failed!\n");
-	}
-	else
-		XPRINT(0, stdout, ">> case 1 passed!\n");
-
-	/* case 2 test */
-	caseFlag = TestConvertDataType2();
-
-	if (!caseFlag) {
-		returnFlag = false;
-		XPRINT(0, stdout, ">> case 2 failed!\n");
-	}
-	else
-		XPRINT(0, stdout, ">> case 2 passed!\n");
-
-	/* other cases test */
-	/*
-	TODO!!
-	*/
-
-	if (returnFlag) {
-		XPRINT(0, stdout, ">> All Passed!\n");
-	}
-	else
-		XPRINT(0, stdout, ">> Failed!\n");
-
-	XPRINT(0, stdout, "\n");
-
-	return returnFlag;
-}
-
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TConvertDataType.h
+++ b/source/tensor/test/TConvertDataType.h
--- a/source/tensor/test/TCopyIndexed.h
+++ b/source/tensor/test/TCopyIndexed.h
--- a/source/tensor/test/TCopyValues.cpp
+++ b/source/tensor/test/TCopyValues.cpp
--- a/source/tensor/test/TCopyValues.h
+++ b/source/tensor/test/TCopyValues.h
--- a/source/tensor/test/TCos.cpp
+++ b/source/tensor/test/TCos.cpp
--- a/source/tensor/test/TCos.h
+++ b/source/tensor/test/TCos.h
--- a/source/tensor/test/TDiv.cpp
+++ b/source/tensor/test/TDiv.cpp
--- a/source/tensor/test/TDiv.h
+++ b/source/tensor/test/TDiv.h
--- a/source/tensor/test/TDivDim.h
+++ b/source/tensor/test/TDivDim.h
--- a/source/tensor/test/TExp.cpp
+++ b/source/tensor/test/TExp.cpp
--- a/source/tensor/test/TExp.h
+++ b/source/tensor/test/TExp.h
--- a/source/tensor/test/THardTanH.cpp
+++ b/source/tensor/test/THardTanH.cpp
--- a/source/tensor/test/THardTanH.h
+++ b/source/tensor/test/THardTanH.h
--- a/source/tensor/test/TIdentity.cpp
+++ b/source/tensor/test/TIdentity.cpp
--- a/source/tensor/test/TIdentity.h
+++ b/source/tensor/test/TIdentity.h
--- a/source/tensor/test/TLog.cpp
+++ b/source/tensor/test/TLog.cpp
--- a/source/tensor/test/TLog.h
+++ b/source/tensor/test/TLog.h
--- a/source/tensor/test/TLogSoftmax.cpp
+++ b/source/tensor/test/TLogSoftmax.cpp
--- a/source/tensor/test/TLogSoftmax.h
+++ b/source/tensor/test/TLogSoftmax.h
--- a/source/tensor/test/TLoss.cpp
+++ b/source/tensor/test/TLoss.cpp
--- a/source/tensor/test/TLoss.h
+++ b/source/tensor/test/TLoss.h
--- a/source/tensor/test/TMatrixMul.h
+++ b/source/tensor/test/TMatrixMul.h
--- a/source/tensor/test/TMatrixMul2D.cpp
+++ b/source/tensor/test/TMatrixMul2D.cpp
--- a/source/tensor/test/TMatrixMul2D.h
+++ b/source/tensor/test/TMatrixMul2D.h
--- a/source/tensor/test/TMatrixMul2DParallel.cpp
+++ b/source/tensor/test/TMatrixMul2DParallel.cpp
--- a/source/tensor/test/TMatrixMul2DParallel.h
+++ b/source/tensor/test/TMatrixMul2DParallel.h
--- a/source/tensor/test/TMatrixMulBatched.cpp
+++ b/source/tensor/test/TMatrixMulBatched.cpp
--- a/source/tensor/test/TMatrixMulBatched.h
+++ b/source/tensor/test/TMatrixMulBatched.h
--- a/source/tensor/test/TMerge.cpp
+++ b/source/tensor/test/TMerge.cpp
--- a/source/tensor/test/TMerge.h
+++ b/source/tensor/test/TMerge.h
--- a/source/tensor/test/TMultiply.cpp
+++ b/source/tensor/test/TMultiply.cpp
--- a/source/tensor/test/TMultiply.h
+++ b/source/tensor/test/TMultiply.h
--- a/source/tensor/test/TMultiplyDim.cpp
+++ b/source/tensor/test/TMultiplyDim.cpp
--- a/source/tensor/test/TMultiplyDim.h
+++ b/source/tensor/test/TMultiplyDim.h
--- a/source/tensor/test/TNegate.cpp
+++ b/source/tensor/test/TNegate.cpp
--- a/source/tensor/test/TNegate.h
+++ b/source/tensor/test/TNegate.h
--- a/source/tensor/test/TNormalize.cpp
+++ b/source/tensor/test/TNormalize.cpp
--- a/source/tensor/test/TNormalize.h
+++ b/source/tensor/test/TNormalize.h
--- a/source/tensor/test/TPower.cpp
+++ b/source/tensor/test/TPower.cpp
--- a/source/tensor/test/TPower.h
+++ b/source/tensor/test/TPower.h
--- a/source/tensor/test/TRectify.cpp
+++ b/source/tensor/test/TRectify.cpp
--- a/source/tensor/test/TRectify.h
+++ b/source/tensor/test/TRectify.h
--- a/source/tensor/test/TReduceMax.cpp
+++ b/source/tensor/test/TReduceMax.cpp
--- a/source/tensor/test/TReduceMax.h
+++ b/source/tensor/test/TReduceMax.h
--- a/source/tensor/test/TReduceMean.cpp
+++ b/source/tensor/test/TReduceMean.cpp
--- a/source/tensor/test/TReduceMean.h
+++ b/source/tensor/test/TReduceMean.h
--- a/source/tensor/test/TReduceSum.cpp
+++ b/source/tensor/test/TReduceSum.cpp
--- a/source/tensor/test/TReduceSum.h
+++ b/source/tensor/test/TReduceSum.h
--- a/source/tensor/test/TReduceSumSquared.cpp
+++ b/source/tensor/test/TReduceSumSquared.cpp
--- a/source/tensor/test/TReduceSumSquared.h
+++ b/source/tensor/test/TReduceSumSquared.h
--- a/source/tensor/test/TReduceVariance.cpp
+++ b/source/tensor/test/TReduceVariance.cpp
--- a/source/tensor/test/TReduceVariance.h
+++ b/source/tensor/test/TReduceVariance.h
--- a/source/tensor/test/TRound.cpp
+++ b/source/tensor/test/TRound.cpp
--- a/source/tensor/test/TRound.h
+++ b/source/tensor/test/TRound.h
--- a/source/tensor/test/TScaleAndShift.cpp
+++ b/source/tensor/test/TScaleAndShift.cpp
--- a/source/tensor/test/TScaleAndShift.h
+++ b/source/tensor/test/TScaleAndShift.h
--- a/source/tensor/test/TSelect.cpp
+++ b/source/tensor/test/TSelect.cpp
--- a/source/tensor/test/TSelect.h
+++ b/source/tensor/test/TSelect.h
--- a/source/tensor/test/TSetAscendingOrder.cpp
+++ b/source/tensor/test/TSetAscendingOrder.cpp
--- a/source/tensor/test/TSetAscendingOrder.h
+++ b/source/tensor/test/TSetAscendingOrder.h
--- a/source/tensor/test/TSetData.cpp
+++ b/source/tensor/test/TSetData.cpp
--- a/source/tensor/test/TSetData.h
+++ b/source/tensor/test/TSetData.h
--- a/source/tensor/test/TSigmoid.cpp
+++ b/source/tensor/test/TSigmoid.cpp
--- a/source/tensor/test/TSigmoid.h
+++ b/source/tensor/test/TSigmoid.h
--- a/source/tensor/test/TSign.cpp
+++ b/source/tensor/test/TSign.cpp
--- a/source/tensor/test/TSign.h
+++ b/source/tensor/test/TSign.h
--- a/source/tensor/test/TSin.cpp
+++ b/source/tensor/test/TSin.cpp
--- a/source/tensor/test/TSin.h
+++ b/source/tensor/test/TSin.h
--- a/source/tensor/test/TSoftmax.cpp
+++ b/source/tensor/test/TSoftmax.cpp
--- a/source/tensor/test/TSoftmax.h
+++ b/source/tensor/test/TSoftmax.h
--- a/source/tensor/test/TSort.cpp
+++ b/source/tensor/test/TSort.cpp
--- a/source/tensor/test/TSort.h
+++ b/source/tensor/test/TSort.h
--- a/source/tensor/test/TSplit.cpp
+++ b/source/tensor/test/TSplit.cpp
--- a/source/tensor/test/TSplit.h
+++ b/source/tensor/test/TSplit.h
--- a/source/tensor/test/TSub.cpp
+++ b/source/tensor/test/TSub.cpp
--- a/source/tensor/test/TSub.h
+++ b/source/tensor/test/TSub.h
--- a/source/tensor/test/TSubDim.cpp
+++ b/source/tensor/test/TSubDim.cpp
--- a/source/tensor/test/TSubDim.h
+++ b/source/tensor/test/TSubDim.h
--- a/source/tensor/test/TSum.cpp
+++ b/source/tensor/test/TSum.cpp
--- a/source/tensor/test/TSum.h
+++ b/source/tensor/test/TSum.h
--- a/source/tensor/test/TSumByColumnTV.cpp
+++ b/source/tensor/test/TSumByColumnTV.cpp
--- a/source/tensor/test/TSumByColumnTV.h
+++ b/source/tensor/test/TSumByColumnTV.h
--- a/source/tensor/test/TSumByColumnVT.cpp
+++ b/source/tensor/test/TSumByColumnVT.cpp
--- a/source/tensor/test/TSumByColumnVT.h
+++ b/source/tensor/test/TSumByColumnVT.h
--- a/source/tensor/test/TSumDim.cpp
+++ b/source/tensor/test/TSumDim.cpp
--- a/source/tensor/test/TSumDim.h
+++ b/source/tensor/test/TSumDim.h
--- a/source/tensor/test/TTan.cpp
+++ b/source/tensor/test/TTan.cpp
--- a/source/tensor/test/TTan.h
+++ b/source/tensor/test/TTan.h
--- a/source/tensor/test/TTopK.cpp
+++ b/source/tensor/test/TTopK.cpp
--- a/source/tensor/test/TTopK.h
+++ b/source/tensor/test/TTopK.h
--- a/source/tensor/test/TTranspose.cpp
+++ b/source/tensor/test/TTranspose.cpp
--- a/source/tensor/test/TTranspose.h
+++ b/source/tensor/test/TTranspose.h
--- a/source/tensor/test/TUnsqueeze.cpp
+++ b/source/tensor/test/TUnsqueeze.cpp
--- a/source/tensor/test/TUnsqueeze.h
+++ b/source/tensor/test/TUnsqueeze.h
--- a/source/tensor/test/TXMem.cpp
+++ b/source/tensor/test/TXMem.cpp
--- a/source/tensor/test/TXMem.h
+++ b/source/tensor/test/TXMem.h
--- a/source/tensor/test/Test.cpp
+++ b/source/tensor/test/Test.cpp
--- a/source/tensor/test/Test.h
+++ b/source/tensor/test/Test.h
--- a/source/network/Main.cpp
+++ b/source/network/Main.cpp
--- a/source/network/XBackwardFunc.cpp
+++ b/source/network/XBackwardFunc.cpp
--- a/source/network/XBackwardFunc.h
+++ b/source/network/XBackwardFunc.h
--- a/source/network/XBackwardLoss.cpp
+++ b/source/network/XBackwardLoss.cpp
--- a/source/network/XBackwardLoss.h
+++ b/source/network/XBackwardLoss.h
--- a/source/network/XBackwardMath.cpp
+++ b/source/network/XBackwardMath.cpp
--- a/source/network/XBackwardMath.h
+++ b/source/network/XBackwardMath.h
--- a/source/network/XBackwardShape.cpp
+++ b/source/network/XBackwardShape.cpp
--- a/source/network/XBackwardShape.h
+++ b/source/network/XBackwardShape.h
--- a/source/network/XNet.cpp
+++ b/source/network/XNet.cpp
--- a/source/network/XNet.h
+++ b/source/network/XNet.h
--- a/source/network/XNoder.cpp
+++ b/source/network/XNoder.cpp
--- a/source/network/XNoder.h
+++ b/source/network/XNoder.h
--- a/source/sample/fnnlm/FNNLM.cpp
+++ b/source/sample/fnnlm/FNNLM.cpp
--- a/source/sample/fnnlm/FNNLM.h
+++ b/source/sample/fnnlm/FNNLM.h
--- a/src/sample/sltk/Main.cpp
+++ b/src/sample/sltk/Main.cpp
--- a/src/sample/sltk/SLTKCRF.cpp
+++ b/src/sample/sltk/SLTKCRF.cpp
--- a/src/sample/sltk/SLTKCRF.h
+++ b/src/sample/sltk/SLTKCRF.h
--- a/src/sample/sltk/SLTKDataUtility.cpp
+++ b/src/sample/sltk/SLTKDataUtility.cpp
--- a/src/sample/sltk/SLTKDataUtility.h
+++ b/src/sample/sltk/SLTKDataUtility.h
--- a/src/sample/sltk/SLTKLSTMCell.cpp
+++ b/src/sample/sltk/SLTKLSTMCell.cpp
--- a/src/sample/sltk/SLTKLSTMCell.h
+++ b/src/sample/sltk/SLTKLSTMCell.h
--- a/src/sample/sltk/SLTKModel.cpp
+++ b/src/sample/sltk/SLTKModel.cpp
--- a/src/sample/sltk/SLTKModel.h
+++ b/src/sample/sltk/SLTKModel.h
--- a/src/sample/sltk/SLTKProjection.cpp
+++ b/src/sample/sltk/SLTKProjection.cpp
--- a/src/sample/sltk/SLTKProjection.h
+++ b/src/sample/sltk/SLTKProjection.h
--- a/src/sample/sltk/SLTKTrainer.cpp
+++ b/src/sample/sltk/SLTKTrainer.cpp
--- a/src/sample/sltk/SLTKTrainer.h
+++ b/src/sample/sltk/SLTKTrainer.h
--- a/src/sample/sltk/SLTKUtility.cpp
+++ b/src/sample/sltk/SLTKUtility.cpp
--- a/src/sample/sltk/SLTKUtility.h
+++ b/src/sample/sltk/SLTKUtility.h
--- a/src/sample/sltk/backup/add_info.py
+++ b/src/sample/sltk/backup/add_info.py
--- a/source/sample/transformer/T2TAttention.cpp
+++ b/source/sample/transformer/T2TAttention.cpp
--- a/source/sample/transformer/T2TAttention.h
+++ b/source/sample/transformer/T2TAttention.h
--- a/src/sample/transformer/T2TDecoder.cpp
+++ b/src/sample/transformer/T2TDecoder.cpp
--- a/source/sample/transformer/T2TDecoder.h
+++ b/source/sample/transformer/T2TDecoder.h
--- a/source/sample/transformer/T2TEmbedding.cpp
+++ b/source/sample/transformer/T2TEmbedding.cpp
--- a/source/sample/transformer/T2TEmbedding.h
+++ b/source/sample/transformer/T2TEmbedding.h
--- a/source/sample/transformer/T2TEncoder.cpp
+++ b/source/sample/transformer/T2TEncoder.cpp
--- a/source/sample/transformer/T2TEncoder.h
+++ b/source/sample/transformer/T2TEncoder.h
--- a/source/sample/transformer/T2TFNN.cpp
+++ b/source/sample/transformer/T2TFNN.cpp
--- a/source/sample/transformer/T2TFNN.h
+++ b/source/sample/transformer/T2TFNN.h
--- a/source/sample/transformer/T2TLayerNormal.cpp
+++ b/source/sample/transformer/T2TLayerNormal.cpp
--- a/source/sample/transformer/T2TLayerNormal.h
+++ b/source/sample/transformer/T2TLayerNormal.h
--- a/src/sample/transformer/T2TModel.cpp
+++ b/src/sample/transformer/T2TModel.cpp
--- a/source/sample/transformer/T2TModel.h
+++ b/source/sample/transformer/T2TModel.h
--- a/source/sample/transformer/T2TOutput.cpp
+++ b/source/sample/transformer/T2TOutput.cpp
--- a/source/sample/transformer/T2TOutput.h
+++ b/source/sample/transformer/T2TOutput.h
--- a/src/sample/transformer/T2TTrainer.cpp
+++ b/src/sample/transformer/T2TTrainer.cpp
--- a/source/sample/transformer/T2TTrainer.h
+++ b/source/sample/transformer/T2TTrainer.h
--- a/source/sample/transformer/T2TUtility.cpp
+++ b/source/sample/transformer/T2TUtility.cpp
--- a/source/sample/transformer/T2TUtility.h
+++ b/source/sample/transformer/T2TUtility.h
--- a/source/sample/transformer/Transformer.cpp
+++ b/source/sample/transformer/Transformer.cpp
--- a/source/sample/transformer/Transformer.h
+++ b/source/sample/transformer/Transformer.h
--- a/source/tensor/Main.cpp
+++ b/source/tensor/Main.cpp
--- a/source/tensor/XBLAS.cpp
+++ b/source/tensor/XBLAS.cpp
--- a/source/tensor/XBLAS.h
+++ b/source/tensor/XBLAS.h
--- a/source/tensor/XDataType.cpp
+++ b/source/tensor/XDataType.cpp
--- a/source/tensor/XDataType.h
+++ b/source/tensor/XDataType.h
--- a/source/tensor/XDevice.cpp
+++ b/source/tensor/XDevice.cpp
--- a/source/tensor/XDevice.h
+++ b/source/tensor/XDevice.h
--- a/source/tensor/XGlobal.cpp
+++ b/source/tensor/XGlobal.cpp
--- a/source/tensor/XGlobal.h
+++ b/source/tensor/XGlobal.h
--- a/source/tensor/XHeap.cpp
+++ b/source/tensor/XHeap.cpp
--- a/source/tensor/XHeap.h
+++ b/source/tensor/XHeap.h
--- a/source/tensor/XLink.cpp
+++ b/source/tensor/XLink.cpp
--- a/source/tensor/XLink.h
+++ b/source/tensor/XLink.h
--- a/source/tensor/XList.cpp
+++ b/source/tensor/XList.cpp
--- a/source/tensor/XList.h
+++ b/source/tensor/XList.h
--- a/source/tensor/XMem.cpp
+++ b/source/tensor/XMem.cpp
--- a/source/tensor/XMem.h
+++ b/source/tensor/XMem.h
--- a/source/tensor/XName.cpp
+++ b/source/tensor/XName.cpp
--- a/source/tensor/XName.h
+++ b/source/tensor/XName.h
--- a/source/tensor/XPRunner.cpp
+++ b/source/tensor/XPRunner.cpp
--- a/source/tensor/XPRunner.h
+++ b/source/tensor/XPRunner.h
--- a/source/tensor/XQueue.cpp
+++ b/source/tensor/XQueue.cpp
--- a/source/tensor/XQueue.h
+++ b/source/tensor/XQueue.h
--- a/source/tensor/XStream.cpp
+++ b/source/tensor/XStream.cpp
--- a/source/tensor/XStream.h
+++ b/source/tensor/XStream.h
--- a/source/tensor/XTensor.cpp
+++ b/source/tensor/XTensor.cpp
--- a/source/tensor/XTensor.h
+++ b/source/tensor/XTensor.h
--- a/source/tensor/XThread.cpp
+++ b/source/tensor/XThread.cpp
--- a/source/tensor/XThread.h
+++ b/source/tensor/XThread.h
--- a/source/tensor/XUtility.cpp
+++ b/source/tensor/XUtility.cpp
--- a/source/tensor/XUtility.h
+++ b/source/tensor/XUtility.h
--- a/source/tensor/core/CHeader.h
+++ b/source/tensor/core/CHeader.h
--- a/source/tensor/core/arithmetic/Div.cpp
+++ b/source/tensor/core/arithmetic/Div.cpp
--- a/source/tensor/core/arithmetic/Div.cu
+++ b/source/tensor/core/arithmetic/Div.cu
--- a/source/tensor/core/arithmetic/Div.cuh
+++ b/source/tensor/core/arithmetic/Div.cuh
--- a/source/tensor/core/arithmetic/Div.h
+++ b/source/tensor/core/arithmetic/Div.h
--- a/source/tensor/core/arithmetic/DivDim.cpp
+++ b/source/tensor/core/arithmetic/DivDim.cpp
--- a/source/tensor/core/arithmetic/DivDim.cu
+++ b/source/tensor/core/arithmetic/DivDim.cu
--- a/source/tensor/core/arithmetic/DivDim.cuh
+++ b/source/tensor/core/arithmetic/DivDim.cuh
--- a/source/tensor/core/arithmetic/DivDim.h
+++ b/source/tensor/core/arithmetic/DivDim.h
--- a/source/tensor/core/arithmetic/MatrixMul.cpp
+++ b/source/tensor/core/arithmetic/MatrixMul.cpp
--- a/source/tensor/core/arithmetic/MatrixMul.h
+++ b/source/tensor/core/arithmetic/MatrixMul.h
--- a/source/tensor/core/arithmetic/MatrixMul2D.cpp
+++ b/source/tensor/core/arithmetic/MatrixMul2D.cpp
--- a/source/tensor/core/arithmetic/MatrixMul2D.cu
+++ b/source/tensor/core/arithmetic/MatrixMul2D.cu
--- a/source/tensor/core/arithmetic/MatrixMul2D.cuh
+++ b/source/tensor/core/arithmetic/MatrixMul2D.cuh
--- a/source/tensor/core/arithmetic/MatrixMul2D.h
+++ b/source/tensor/core/arithmetic/MatrixMul2D.h
--- a/source/tensor/core/arithmetic/MatrixMul2DMultiTheading.cpp
+++ b/source/tensor/core/arithmetic/MatrixMul2DMultiTheading.cpp
--- a/source/tensor/core/arithmetic/MatrixMul2DMultiTheading.h
+++ b/source/tensor/core/arithmetic/MatrixMul2DMultiTheading.h
--- a/source/tensor/core/arithmetic/MatrixMul2DParallel.cpp
+++ b/source/tensor/core/arithmetic/MatrixMul2DParallel.cpp
--- a/source/tensor/core/arithmetic/MatrixMul2DParallel.h
+++ b/source/tensor/core/arithmetic/MatrixMul2DParallel.h
--- a/source/tensor/core/arithmetic/MatrixMulBatched.cpp
+++ b/source/tensor/core/arithmetic/MatrixMulBatched.cpp
--- a/source/tensor/core/arithmetic/MatrixMulBatched.h
+++ b/source/tensor/core/arithmetic/MatrixMulBatched.h
--- a/source/tensor/core/arithmetic/Multiply.cpp
+++ b/source/tensor/core/arithmetic/Multiply.cpp
--- a/source/tensor/core/arithmetic/Multiply.cu
+++ b/source/tensor/core/arithmetic/Multiply.cu
--- a/source/tensor/core/arithmetic/Multiply.cuh
+++ b/source/tensor/core/arithmetic/Multiply.cuh
--- a/source/tensor/core/arithmetic/Multiply.h
+++ b/source/tensor/core/arithmetic/Multiply.h
--- a/source/tensor/core/arithmetic/MultiplyDim.cpp
+++ b/source/tensor/core/arithmetic/MultiplyDim.cpp
--- a/source/tensor/core/arithmetic/MultiplyDim.cu
+++ b/source/tensor/core/arithmetic/MultiplyDim.cu
--- a/source/tensor/core/arithmetic/MultiplyDim.cuh
+++ b/source/tensor/core/arithmetic/MultiplyDim.cuh
--- a/source/tensor/core/arithmetic/MultiplyDim.h
+++ b/source/tensor/core/arithmetic/MultiplyDim.h
--- a/source/tensor/core/arithmetic/Negate.cpp
+++ b/source/tensor/core/arithmetic/Negate.cpp
--- a/source/tensor/core/arithmetic/Negate.cu
+++ b/source/tensor/core/arithmetic/Negate.cu
--- a/source/tensor/core/arithmetic/Negate.cuh
+++ b/source/tensor/core/arithmetic/Negate.cuh
--- a/source/tensor/core/arithmetic/Negate.h
+++ b/source/tensor/core/arithmetic/Negate.h
--- a/source/tensor/core/arithmetic/Sign.cpp
+++ b/source/tensor/core/arithmetic/Sign.cpp
--- a/source/tensor/core/arithmetic/Sign.cu
+++ b/source/tensor/core/arithmetic/Sign.cu
--- a/source/tensor/core/arithmetic/Sign.cuh
+++ b/source/tensor/core/arithmetic/Sign.cuh
--- a/source/tensor/core/arithmetic/Sign.h
+++ b/source/tensor/core/arithmetic/Sign.h
--- a/source/tensor/core/arithmetic/Sub.cpp
+++ b/source/tensor/core/arithmetic/Sub.cpp
--- a/source/tensor/core/arithmetic/Sub.cu
+++ b/source/tensor/core/arithmetic/Sub.cu
--- a/source/tensor/core/arithmetic/Sub.cuh
+++ b/source/tensor/core/arithmetic/Sub.cuh
--- a/source/tensor/core/arithmetic/Sub.h
+++ b/source/tensor/core/arithmetic/Sub.h
--- a/source/tensor/core/arithmetic/SubDim.cpp
+++ b/source/tensor/core/arithmetic/SubDim.cpp
--- a/source/tensor/core/arithmetic/SubDim.cu
+++ b/source/tensor/core/arithmetic/SubDim.cu
--- a/source/tensor/core/arithmetic/SubDim.cuh
+++ b/source/tensor/core/arithmetic/SubDim.cuh
--- a/source/tensor/core/arithmetic/SubDim.h
+++ b/source/tensor/core/arithmetic/SubDim.h
--- a/source/tensor/core/arithmetic/Sum.cpp
+++ b/source/tensor/core/arithmetic/Sum.cpp
--- a/source/tensor/core/arithmetic/Sum.cu
+++ b/source/tensor/core/arithmetic/Sum.cu
--- a/source/tensor/core/arithmetic/Sum.cuh
+++ b/source/tensor/core/arithmetic/Sum.cuh
--- a/source/tensor/core/arithmetic/Sum.h
+++ b/source/tensor/core/arithmetic/Sum.h
--- a/source/tensor/core/arithmetic/SumByColumnTV.cpp
+++ b/source/tensor/core/arithmetic/SumByColumnTV.cpp
--- a/source/tensor/core/arithmetic/SumByColumnTV.cu
+++ b/source/tensor/core/arithmetic/SumByColumnTV.cu
--- a/source/tensor/core/arithmetic/SumByColumnTV.cuh
+++ b/source/tensor/core/arithmetic/SumByColumnTV.cuh
--- a/source/tensor/core/arithmetic/SumByColumnTV.h
+++ b/source/tensor/core/arithmetic/SumByColumnTV.h
--- a/source/tensor/core/arithmetic/SumByColumnVT.cpp
+++ b/source/tensor/core/arithmetic/SumByColumnVT.cpp
--- a/source/tensor/core/arithmetic/SumByColumnVT.cu
+++ b/source/tensor/core/arithmetic/SumByColumnVT.cu
--- a/source/tensor/core/arithmetic/SumByColumnVT.cuh
+++ b/source/tensor/core/arithmetic/SumByColumnVT.cuh
--- a/source/tensor/core/arithmetic/SumByColumnVT.h
+++ b/source/tensor/core/arithmetic/SumByColumnVT.h
--- a/source/tensor/core/arithmetic/SumDim.cpp
+++ b/source/tensor/core/arithmetic/SumDim.cpp
--- a/source/tensor/core/arithmetic/SumDim.cu
+++ b/source/tensor/core/arithmetic/SumDim.cu
--- a/source/tensor/core/arithmetic/SumDim.cuh
+++ b/source/tensor/core/arithmetic/SumDim.cuh
--- a/source/tensor/core/arithmetic/SumDim.h
+++ b/source/tensor/core/arithmetic/SumDim.h
--- a/source/tensor/core/arithmetic/XTensorBLAS.cpp
+++ b/source/tensor/core/arithmetic/XTensorBLAS.cpp
--- a/source/tensor/core/arithmetic/XTensorBLAS.cu
+++ b/source/tensor/core/arithmetic/XTensorBLAS.cu
--- a/source/tensor/core/arithmetic/XTensorBLAS.h
+++ b/source/tensor/core/arithmetic/XTensorBLAS.h
--- a/source/tensor/core/getandset/ConvertDataType.cpp
+++ b/source/tensor/core/getandset/ConvertDataType.cpp
--- a/source/tensor/core/getandset/ConvertDataType.cu
+++ b/source/tensor/core/getandset/ConvertDataType.cu
--- a/source/tensor/core/getandset/ConvertDataType.cuh
+++ b/source/tensor/core/getandset/ConvertDataType.cuh
--- a/source/tensor/core/getandset/ConvertDataType.h
+++ b/source/tensor/core/getandset/ConvertDataType.h
--- a/source/tensor/core/getandset/Select.cpp
+++ b/source/tensor/core/getandset/Select.cpp
--- a/source/sample/transformer/T2TDecoder.cpp
+++ b/source/sample/transformer/T2TDecoder.cpp
--- a/source/tensor/core/getandset/Select.cuh
+++ b/source/tensor/core/getandset/Select.cuh
--- a/source/tensor/core/getandset/Select.h
+++ b/source/tensor/core/getandset/Select.h
--- a/source/tensor/core/getandset/SetData.cpp
+++ b/source/tensor/core/getandset/SetData.cpp
--- a/source/tensor/core/getandset/SetData.cu
+++ b/source/tensor/core/getandset/SetData.cu
--- a/source/tensor/core/getandset/SetData.cuh
+++ b/source/tensor/core/getandset/SetData.cuh
--- a/source/tensor/core/getandset/SetData.h
+++ b/source/tensor/core/getandset/SetData.h
--- a/source/tensor/core/math/Clip.cpp
+++ b/source/tensor/core/math/Clip.cpp
--- a/source/tensor/core/math/Clip.cu
+++ b/source/tensor/core/math/Clip.cu
--- a/source/tensor/core/math/Clip.cuh
+++ b/source/tensor/core/math/Clip.cuh
--- a/source/tensor/core/math/Clip.h
+++ b/source/tensor/core/math/Clip.h
--- a/source/tensor/core/math/Normalize.cpp
+++ b/source/tensor/core/math/Normalize.cpp
--- a/source/tensor/core/math/Normalize.cu
+++ b/source/tensor/core/math/Normalize.cu
--- a/source/tensor/core/math/Normalize.cuh
+++ b/source/tensor/core/math/Normalize.cuh
--- a/source/tensor/core/math/Normalize.h
+++ b/source/tensor/core/math/Normalize.h
--- a/source/tensor/core/math/Power.cpp
+++ b/source/tensor/core/math/Power.cpp
--- a/source/tensor/core/math/Power.cu
+++ b/source/tensor/core/math/Power.cu
--- a/source/tensor/core/math/Power.cuh
+++ b/source/tensor/core/math/Power.cuh
--- a/source/tensor/core/math/Power.h
+++ b/source/tensor/core/math/Power.h
--- a/source/tensor/core/math/ScaleAndShift.cpp
+++ b/source/tensor/core/math/ScaleAndShift.cpp
--- a/source/tensor/core/math/ScaleAndShift.cu
+++ b/source/tensor/core/math/ScaleAndShift.cu
--- a/source/tensor/core/math/ScaleAndShift.cuh
+++ b/source/tensor/core/math/ScaleAndShift.cuh
--- a/source/tensor/core/math/ScaleAndShift.h
+++ b/source/tensor/core/math/ScaleAndShift.h
--- a/source/tensor/core/math/Unary.cpp
+++ b/source/tensor/core/math/Unary.cpp
--- a/source/tensor/core/math/Unary.cu
+++ b/source/tensor/core/math/Unary.cu
--- a/source/tensor/core/math/Unary.cuh
+++ b/source/tensor/core/math/Unary.cuh
--- a/src/tensor/core/math/Unary.h
+++ b/src/tensor/core/math/Unary.h
--- a/source/tensor/core/movement/CopyBlocks.cpp
+++ b/source/tensor/core/movement/CopyBlocks.cpp
--- a/source/tensor/core/movement/CopyBlocks.h
+++ b/source/tensor/core/movement/CopyBlocks.h
--- a/source/tensor/core/movement/CopyBlocksInGrid.cpp
+++ b/source/tensor/core/movement/CopyBlocksInGrid.cpp
--- a/source/tensor/core/movement/CopyBlocksInGrid.cu
+++ b/source/tensor/core/movement/CopyBlocksInGrid.cu
--- a/source/tensor/core/movement/CopyBlocksInGrid.cuh
+++ b/source/tensor/core/movement/CopyBlocksInGrid.cuh
--- a/source/tensor/core/movement/CopyBlocksInGrid.h
+++ b/source/tensor/core/movement/CopyBlocksInGrid.h
--- a/source/tensor/core/movement/CopyBlocksOnSite.cpp
+++ b/source/tensor/core/movement/CopyBlocksOnSite.cpp
--- a/source/tensor/core/movement/CopyBlocksOnSite.cu
+++ b/source/tensor/core/movement/CopyBlocksOnSite.cu
--- a/source/tensor/core/movement/CopyBlocksOnSite.cuh
+++ b/source/tensor/core/movement/CopyBlocksOnSite.cuh
--- a/source/tensor/core/movement/CopyBlocksOnSite.h
+++ b/source/tensor/core/movement/CopyBlocksOnSite.h
--- a/source/tensor/core/movement/CopyBlocksSelected.cu
+++ b/source/tensor/core/movement/CopyBlocksSelected.cu
--- a/source/tensor/core/movement/CopyBlocksSelected.cuh
+++ b/source/tensor/core/movement/CopyBlocksSelected.cuh
--- a/source/tensor/core/movement/CopyData2D.cpp
+++ b/source/tensor/core/movement/CopyData2D.cpp
--- a/source/tensor/core/movement/CopyData2D.h
+++ b/source/tensor/core/movement/CopyData2D.h
--- a/source/tensor/core/movement/CopyInGrid.cpp
+++ b/source/tensor/core/movement/CopyInGrid.cpp
--- a/source/tensor/core/movement/CopyInGrid.h
+++ b/source/tensor/core/movement/CopyInGrid.h
--- a/source/tensor/core/movement/CopyIndexed.cpp
+++ b/source/tensor/core/movement/CopyIndexed.cpp
--- a/source/tensor/core/movement/CopyIndexed.h
+++ b/source/tensor/core/movement/CopyIndexed.h
--- a/source/tensor/core/movement/CopyValues.cpp
+++ b/source/tensor/core/movement/CopyValues.cpp
--- a/source/tensor/core/movement/CopyValues.cu
+++ b/source/tensor/core/movement/CopyValues.cu
--- a/source/tensor/core/movement/CopyValues.cuh
+++ b/source/tensor/core/movement/CopyValues.cuh
--- a/source/tensor/core/movement/CopyValues.h
+++ b/source/tensor/core/movement/CopyValues.h
--- a/src/tensor/core/movement/Gather.cpp
+++ b/src/tensor/core/movement/Gather.cpp
--- a/src/tensor/core/movement/Gather.h
+++ b/src/tensor/core/movement/Gather.h
--- a/src/tensor/core/movement/Spread.cpp
+++ b/src/tensor/core/movement/Spread.cpp
--- a/src/tensor/core/movement/Spread.cu
+++ b/src/tensor/core/movement/Spread.cu
--- a/src/tensor/core/movement/Spread.cuh
+++ b/src/tensor/core/movement/Spread.cuh
--- a/src/tensor/core/movement/Spread.h
+++ b/src/tensor/core/movement/Spread.h
--- a/source/tensor/core/reduce/ReduceMax.cpp
+++ b/source/tensor/core/reduce/ReduceMax.cpp
--- a/source/tensor/core/reduce/ReduceMax.cu
+++ b/source/tensor/core/reduce/ReduceMax.cu
--- a/source/tensor/core/reduce/ReduceMax.cuh
+++ b/source/tensor/core/reduce/ReduceMax.cuh
--- a/source/tensor/core/reduce/ReduceMax.h
+++ b/source/tensor/core/reduce/ReduceMax.h
--- a/source/tensor/core/reduce/ReduceMean.cpp
+++ b/source/tensor/core/reduce/ReduceMean.cpp
--- a/source/tensor/core/reduce/ReduceMean.h
+++ b/source/tensor/core/reduce/ReduceMean.h
--- a/source/tensor/core/reduce/ReduceStandardVariance.h
+++ b/source/tensor/core/reduce/ReduceStandardVariance.h
--- a/source/tensor/core/reduce/ReduceSum.cpp
+++ b/source/tensor/core/reduce/ReduceSum.cpp
--- a/source/tensor/core/reduce/ReduceSum.cu
+++ b/source/tensor/core/reduce/ReduceSum.cu
--- a/source/tensor/core/reduce/ReduceSum.cuh
+++ b/source/tensor/core/reduce/ReduceSum.cuh
--- a/source/tensor/core/reduce/ReduceSum.h
+++ b/source/tensor/core/reduce/ReduceSum.h
--- a/src/tensor/core/reduce/ReduceSumAll.cpp
+++ b/src/tensor/core/reduce/ReduceSumAll.cpp
--- a/src/tensor/core/reduce/ReduceSumAll.h
+++ b/src/tensor/core/reduce/ReduceSumAll.h
--- a/source/tensor/core/reduce/ReduceSumSquared.cpp
+++ b/source/tensor/core/reduce/ReduceSumSquared.cpp
--- a/source/tensor/core/reduce/ReduceSumSquared.h
+++ b/source/tensor/core/reduce/ReduceSumSquared.h
--- a/source/tensor/core/reduce/ReduceVariance.cpp
+++ b/source/tensor/core/reduce/ReduceVariance.cpp
--- a/source/tensor/core/reduce/ReduceVariance.h
+++ b/source/tensor/core/reduce/ReduceVariance.h
--- a/src/tensor/core/shape/Concatenate.cpp
+++ b/src/tensor/core/shape/Concatenate.cpp
--- a/source/tensor/core/shape/Concatenate.h
+++ b/source/tensor/core/shape/Concatenate.h
--- a/src/tensor/core/shape/ConcatenateSolely.cpp
+++ b/src/tensor/core/shape/ConcatenateSolely.cpp
--- a/source/tensor/core/shape/ConcatenateSolely.h
+++ b/source/tensor/core/shape/ConcatenateSolely.h
--- a/source/tensor/core/shape/MakeMergeBlockIndex.cpp
+++ b/source/tensor/core/shape/MakeMergeBlockIndex.cpp
--- a/src/tensor/core/shape/MakeMergeBlockIndex.cu
+++ b/src/tensor/core/shape/MakeMergeBlockIndex.cu
--- a/source/tensor/core/shape/MakeMergeBlockIndex.cuh
+++ b/source/tensor/core/shape/MakeMergeBlockIndex.cuh
--- a/source/tensor/core/shape/MakeMergeBlockIndex.h
+++ b/source/tensor/core/shape/MakeMergeBlockIndex.h
--- a/source/tensor/core/shape/MakeSplitBlockIndex.cpp
+++ b/source/tensor/core/shape/MakeSplitBlockIndex.cpp
--- a/src/tensor/core/shape/MakeSplitBlockIndex.cu
+++ b/src/tensor/core/shape/MakeSplitBlockIndex.cu
--- a/source/tensor/core/shape/MakeSplitBlockIndex.cuh
+++ b/source/tensor/core/shape/MakeSplitBlockIndex.cuh
--- a/source/tensor/core/shape/MakeSplitBlockIndex.h
+++ b/source/tensor/core/shape/MakeSplitBlockIndex.h
--- a/source/tensor/core/shape/Merge.cpp
+++ b/source/tensor/core/shape/Merge.cpp
--- a/src/tensor/core/shape/Merge.h
+++ b/src/tensor/core/shape/Merge.h
--- a/src/tensor/core/shape/MergeBlockLists.cpp
+++ b/src/tensor/core/shape/MergeBlockLists.cpp
--- a/src/tensor/core/shape/MergeBlockLists.cu
+++ b/src/tensor/core/shape/MergeBlockLists.cu
--- a/src/tensor/core/shape/MergeBlockLists.cuh
+++ b/src/tensor/core/shape/MergeBlockLists.cuh
--- a/source/tensor/core/shape/MergeBlockLists.h
+++ b/source/tensor/core/shape/MergeBlockLists.h
--- a/src/tensor/core/shape/Permute.cpp
+++ b/src/tensor/core/shape/Permute.cpp
--- a/src/tensor/core/shape/Permute.h
+++ b/src/tensor/core/shape/Permute.h
--- a/src/tensor/core/shape/Reshape.cpp
+++ b/src/tensor/core/shape/Reshape.cpp
--- a/src/tensor/core/shape/Reshape.h
+++ b/src/tensor/core/shape/Reshape.h
--- a/source/tensor/core/shape/Split.cpp
+++ b/source/tensor/core/shape/Split.cpp
--- a/src/tensor/core/shape/Split.h
+++ b/src/tensor/core/shape/Split.h
--- a/src/tensor/core/shape/Squeeze.cpp
+++ b/src/tensor/core/shape/Squeeze.cpp
--- a/src/tensor/core/shape/Squeeze.h
+++ b/src/tensor/core/shape/Squeeze.h
--- a/src/tensor/core/shape/Transpose.cpp
+++ b/src/tensor/core/shape/Transpose.cpp
--- a/src/tensor/core/shape/Transpose.h
+++ b/src/tensor/core/shape/Transpose.h
--- a/src/tensor/core/shape/Unsqueeze.cpp
+++ b/src/tensor/core/shape/Unsqueeze.cpp
--- a/source/tensor/core/shape/Unsqueeze.cu
+++ b/source/tensor/core/shape/Unsqueeze.cu
--- a/source/tensor/core/shape/MergeBlockLists.cuh
+++ b/source/tensor/core/shape/MergeBlockLists.cuh
--- a/src/tensor/core/shape/Unsqueeze.h
+++ b/src/tensor/core/shape/Unsqueeze.h
--- a/src/tensor/core/sort/Sort.cpp
+++ b/src/tensor/core/sort/Sort.cpp
--- a/src/tensor/core/sort/Sort.cu
+++ b/src/tensor/core/sort/Sort.cu
--- a/src/tensor/core/sort/Sort.cuh
+++ b/src/tensor/core/sort/Sort.cuh
--- a/src/tensor/core/sort/Sort.h
+++ b/src/tensor/core/sort/Sort.h
--- a/src/tensor/core/sort/TopK.cpp
+++ b/src/tensor/core/sort/TopK.cpp
--- a/source/tensor/core/sort/TopK.cu
+++ b/source/tensor/core/sort/TopK.cu
--- a/src/tensor/core/sort/TopK.cuh
+++ b/src/tensor/core/sort/TopK.cuh
--- a/source/tensor/core/shape/Merge.h
+++ b/source/tensor/core/shape/Merge.h
--- a/source/tensor/core/shape/MergeBlockLists.cpp
+++ b/source/tensor/core/shape/MergeBlockLists.cpp
--- a/src/tensor/core/utilities/FlushToMem.cu
+++ b/src/tensor/core/utilities/FlushToMem.cu
--- a/src/tensor/core/utilities/FlushToMem.cuh
+++ b/src/tensor/core/utilities/FlushToMem.cuh
--- a/src/tensor/core/utilities/FlushToMem.h
+++ b/src/tensor/core/utilities/FlushToMem.h
--- a/src/tensor/core/utilities/SetAscendingOrder.cu
+++ b/src/tensor/core/utilities/SetAscendingOrder.cu
--- a/src/tensor/core/utilities/SetAscendingOrder.cuh
+++ b/src/tensor/core/utilities/SetAscendingOrder.cuh
--- a/src/tensor/core/utilities/XMatrixSegment.cpp
+++ b/src/tensor/core/utilities/XMatrixSegment.cpp
--- a/src/tensor/core/utilities/XMatrixSegment.h
+++ b/src/tensor/core/utilities/XMatrixSegment.h
--- a/src/tensor/function/CrossEntropy.cpp
+++ b/src/tensor/function/CrossEntropy.cpp
--- a/src/tensor/function/CrossEntropy.cu
+++ b/src/tensor/function/CrossEntropy.cu
--- a/src/tensor/function/CrossEntropy.cuh
+++ b/src/tensor/function/CrossEntropy.cuh
--- a/src/tensor/function/CrossEntropy.h
+++ b/src/tensor/function/CrossEntropy.h
--- a/src/tensor/function/Dropout.cpp
+++ b/src/tensor/function/Dropout.cpp
--- a/src/tensor/function/Dropout.cu
+++ b/src/tensor/function/Dropout.cu
--- a/src/tensor/function/Dropout.cuh
+++ b/src/tensor/function/Dropout.cuh
--- a/src/tensor/function/Dropout.h
+++ b/src/tensor/function/Dropout.h
--- a/src/tensor/function/FHeader.h
+++ b/src/tensor/function/FHeader.h
--- a/src/tensor/function/HardTanH.cpp
+++ b/src/tensor/function/HardTanH.cpp
--- a/src/tensor/function/HardTanH.cu
+++ b/src/tensor/function/HardTanH.cu
--- a/src/tensor/function/HardTanH.cuh
+++ b/src/tensor/function/HardTanH.cuh
--- a/src/tensor/function/HardTanH.h
+++ b/src/tensor/function/HardTanH.h
--- a/src/tensor/function/Identity.cpp
+++ b/src/tensor/function/Identity.cpp
--- a/src/tensor/function/Identity.h
+++ b/src/tensor/function/Identity.h
--- a/source/tensor/function/LogSoftmax.cpp
+++ b/source/tensor/function/LogSoftmax.cpp
--- a/source/tensor/function/LogSoftmax.cu
+++ b/source/tensor/function/LogSoftmax.cu
--- a/src/tensor/function/LogSoftmax.cuh
+++ b/src/tensor/function/LogSoftmax.cuh
--- a/src/tensor/function/LogSoftmax.h
+++ b/src/tensor/function/LogSoftmax.h
--- a/source/tensor/function/Loss.cpp
+++ b/source/tensor/function/Loss.cpp
--- a/source/tensor/function/Loss.cu
+++ b/source/tensor/function/Loss.cu
--- a/src/tensor/function/Loss.cuh
+++ b/src/tensor/function/Loss.cuh
--- a/src/tensor/function/Loss.h
+++ b/src/tensor/function/Loss.h
--- a/src/tensor/function/Rectify.cpp
+++ b/src/tensor/function/Rectify.cpp
--- a/src/tensor/function/Rectify.cu
+++ b/src/tensor/function/Rectify.cu
--- a/src/tensor/function/Rectify.cuh
+++ b/src/tensor/function/Rectify.cuh
--- a/src/tensor/function/Rectify.h
+++ b/src/tensor/function/Rectify.h
--- a/src/tensor/function/Sigmoid.cpp
+++ b/src/tensor/function/Sigmoid.cpp
--- a/src/tensor/function/Sigmoid.cu
+++ b/src/tensor/function/Sigmoid.cu
--- a/src/tensor/function/Sigmoid.cuh
+++ b/src/tensor/function/Sigmoid.cuh
--- a/src/tensor/function/Sigmoid.h
+++ b/src/tensor/function/Sigmoid.h
--- a/src/tensor/function/Softmax.cpp
+++ b/src/tensor/function/Softmax.cpp
--- a/source/tensor/function/Softmax.cu
+++ b/source/tensor/function/Softmax.cu
--- a/src/tensor/function/Softmax.cuh
+++ b/src/tensor/function/Softmax.cuh
--- a/src/tensor/function/Softmax.h
+++ b/src/tensor/function/Softmax.h
--- a/src/tensor/test/TAbsolute.cpp
+++ b/src/tensor/test/TAbsolute.cpp
--- a/source/tensor/core/shape/Split.h
+++ b/source/tensor/core/shape/Split.h
--- a/src/tensor/test/TClip.cpp
+++ b/src/tensor/test/TClip.cpp
--- a/src/tensor/test/TClip.h
+++ b/src/tensor/test/TClip.h
--- a/source/tensor/test/TConcatenate.cpp
+++ b/source/tensor/test/TConcatenate.cpp
--- a/src/tensor/test/TConcatenate.h
+++ b/src/tensor/test/TConcatenate.h
--- a/src/tensor/test/TConcatenateSolely.cpp
+++ b/src/tensor/test/TConcatenateSolely.cpp
--- a/src/tensor/test/TConcatenateSolely.h
+++ b/src/tensor/test/TConcatenateSolely.h
--- a/src/tensor/test/TConvertDataType.cpp
+++ b/src/tensor/test/TConvertDataType.cpp
--- a/src/tensor/test/TConvertDataType.h
+++ b/src/tensor/test/TConvertDataType.h
--- a/source/tensor/test/TCopyIndexed.cpp
+++ b/source/tensor/test/TCopyIndexed.cpp
--- a/src/tensor/test/TCopyIndexed.h
+++ b/src/tensor/test/TCopyIndexed.h
--- a/src/tensor/test/TCopyValues.cpp
+++ b/src/tensor/test/TCopyValues.cpp
--- a/src/tensor/test/TCopyValues.h
+++ b/src/tensor/test/TCopyValues.h
--- a/src/tensor/test/TCos.cpp
+++ b/src/tensor/test/TCos.cpp
--- a/src/tensor/test/TCos.h
+++ b/src/tensor/test/TCos.h
--- a/src/tensor/test/TCrossEntropy.cpp
+++ b/src/tensor/test/TCrossEntropy.cpp
--- a/src/tensor/test/TCrossEntropy.h
+++ b/src/tensor/test/TCrossEntropy.h
--- a/src/tensor/test/TDiv.cpp
+++ b/src/tensor/test/TDiv.cpp
--- a/src/tensor/test/TDiv.h
+++ b/src/tensor/test/TDiv.h
--- a/src/tensor/test/TDivDim.cpp
+++ b/src/tensor/test/TDivDim.cpp
--- a/src/tensor/test/TDivDim.h
+++ b/src/tensor/test/TDivDim.h
--- a/src/tensor/test/TDropout.cpp
+++ b/src/tensor/test/TDropout.cpp
--- a/src/tensor/test/TDropout.h
+++ b/src/tensor/test/TDropout.h
--- a/src/tensor/test/TExp.cpp
+++ b/src/tensor/test/TExp.cpp
--- a/source/tensor/core/shape/Permute.h
+++ b/source/tensor/core/shape/Permute.h
--- a/src/tensor/test/TGather.cpp
+++ b/src/tensor/test/TGather.cpp
--- a/src/tensor/test/TGather.h
+++ b/src/tensor/test/TGather.h
--- a/src/tensor/test/THardTanH.cpp
+++ b/src/tensor/test/THardTanH.cpp
--- a/src/tensor/test/THardTanH.h
+++ b/src/tensor/test/THardTanH.h
--- a/src/tensor/test/TIdentity.cpp
+++ b/src/tensor/test/TIdentity.cpp
--- a/src/tensor/test/TIdentity.h
+++ b/src/tensor/test/TIdentity.h
--- a/src/tensor/test/TLog.cpp
+++ b/src/tensor/test/TLog.cpp
--- a/src/tensor/test/TLog.h
+++ b/src/tensor/test/TLog.h
--- a/src/tensor/test/TLogSoftmax.cpp
+++ b/src/tensor/test/TLogSoftmax.cpp
--- a/src/tensor/test/TLogSoftmax.h
+++ b/src/tensor/test/TLogSoftmax.h
--- a/src/tensor/test/TLoss.cpp
+++ b/src/tensor/test/TLoss.cpp
--- a/src/tensor/test/TLoss.h
+++ b/src/tensor/test/TLoss.h
--- a/source/tensor/test/TMatrixMul.cpp
+++ b/source/tensor/test/TMatrixMul.cpp
--- a/src/tensor/test/TMatrixMul.h
+++ b/src/tensor/test/TMatrixMul.h
--- a/src/tensor/test/TMatrixMul2D.cpp
+++ b/src/tensor/test/TMatrixMul2D.cpp
--- a/src/tensor/test/TMatrixMul2D.h
+++ b/src/tensor/test/TMatrixMul2D.h
--- a/src/tensor/test/TMatrixMul2DParallel.cpp
+++ b/src/tensor/test/TMatrixMul2DParallel.cpp
--- a/src/tensor/test/TMatrixMul2DParallel.h
+++ b/src/tensor/test/TMatrixMul2DParallel.h
--- a/src/tensor/test/TMatrixMulBatched.cpp
+++ b/src/tensor/test/TMatrixMulBatched.cpp
--- a/src/tensor/test/TMatrixMulBatched.h
+++ b/src/tensor/test/TMatrixMulBatched.h
--- a/src/tensor/test/TMerge.cpp
+++ b/src/tensor/test/TMerge.cpp
--- a/src/tensor/test/TMerge.h
+++ b/src/tensor/test/TMerge.h
--- a/src/tensor/test/TMultiply.cpp
+++ b/src/tensor/test/TMultiply.cpp
--- a/src/tensor/test/TMultiply.h
+++ b/src/tensor/test/TMultiply.h
--- a/src/tensor/test/TMultiplyDim.cpp
+++ b/src/tensor/test/TMultiplyDim.cpp
--- a/src/tensor/test/TMultiplyDim.h
+++ b/src/tensor/test/TMultiplyDim.h
--- a/src/tensor/test/TNegate.cpp
+++ b/src/tensor/test/TNegate.cpp
--- a/src/tensor/test/TNegate.h
+++ b/src/tensor/test/TNegate.h
--- a/src/tensor/test/TNormalize.cpp
+++ b/src/tensor/test/TNormalize.cpp
--- a/src/tensor/test/TNormalize.h
+++ b/src/tensor/test/TNormalize.h
--- a/src/tensor/test/TPower.cpp
+++ b/src/tensor/test/TPower.cpp
--- a/src/tensor/test/TPower.h
+++ b/src/tensor/test/TPower.h
--- a/src/tensor/test/TRectify.cpp
+++ b/src/tensor/test/TRectify.cpp
--- a/src/tensor/test/TRectify.h
+++ b/src/tensor/test/TRectify.h
--- a/src/tensor/test/TReduceMax.cpp
+++ b/src/tensor/test/TReduceMax.cpp
--- a/src/tensor/test/TReduceMax.h
+++ b/src/tensor/test/TReduceMax.h
--- a/src/tensor/test/TReduceMean.cpp
+++ b/src/tensor/test/TReduceMean.cpp
--- a/src/tensor/test/TReduceMean.h
+++ b/src/tensor/test/TReduceMean.h
--- a/src/tensor/test/TReduceSum.cpp
+++ b/src/tensor/test/TReduceSum.cpp
--- a/src/tensor/test/TReduceSum.h
+++ b/src/tensor/test/TReduceSum.h
--- a/src/tensor/test/TReduceSumAll.cpp
+++ b/src/tensor/test/TReduceSumAll.cpp
--- a/src/tensor/test/TReduceSumAll.h
+++ b/src/tensor/test/TReduceSumAll.h
--- a/src/tensor/test/TReduceSumSquared.cpp
+++ b/src/tensor/test/TReduceSumSquared.cpp
--- a/src/tensor/test/TReduceSumSquared.h
+++ b/src/tensor/test/TReduceSumSquared.h
--- a/src/tensor/test/TReduceVariance.cpp
+++ b/src/tensor/test/TReduceVariance.cpp
--- a/src/tensor/test/TReduceVariance.h
+++ b/src/tensor/test/TReduceVariance.h
--- a/src/tensor/test/TRound.cpp
+++ b/src/tensor/test/TRound.cpp
--- a/src/tensor/test/TRound.h
+++ b/src/tensor/test/TRound.h
--- a/src/tensor/test/TScaleAndShift.cpp
+++ b/src/tensor/test/TScaleAndShift.cpp
--- a/src/tensor/test/TScaleAndShift.h
+++ b/src/tensor/test/TScaleAndShift.h
--- a/src/tensor/test/TSelect.cpp
+++ b/src/tensor/test/TSelect.cpp
--- a/src/tensor/test/TSelect.h
+++ b/src/tensor/test/TSelect.h
--- a/src/tensor/test/TSetAscendingOrder.cpp
+++ b/src/tensor/test/TSetAscendingOrder.cpp
--- a/src/tensor/test/TSetAscendingOrder.h
+++ b/src/tensor/test/TSetAscendingOrder.h
--- a/src/tensor/test/TSetData.cpp
+++ b/src/tensor/test/TSetData.cpp
--- a/src/tensor/test/TSetData.h
+++ b/src/tensor/test/TSetData.h
--- a/src/tensor/test/TSigmoid.cpp
+++ b/src/tensor/test/TSigmoid.cpp
--- a/src/tensor/test/TSigmoid.h
+++ b/src/tensor/test/TSigmoid.h
--- a/src/tensor/test/TSign.cpp
+++ b/src/tensor/test/TSign.cpp
--- a/src/tensor/test/TSign.h
+++ b/src/tensor/test/TSign.h
--- a/src/tensor/test/TSin.cpp
+++ b/src/tensor/test/TSin.cpp
--- a/src/tensor/test/TSin.h
+++ b/src/tensor/test/TSin.h
--- a/src/tensor/test/TSoftmax.cpp
+++ b/src/tensor/test/TSoftmax.cpp
--- a/src/tensor/test/TSoftmax.h
+++ b/src/tensor/test/TSoftmax.h
--- a/src/tensor/test/TSort.cpp
+++ b/src/tensor/test/TSort.cpp
--- a/src/tensor/test/TSort.h
+++ b/src/tensor/test/TSort.h
--- a/src/tensor/test/TSplit.cpp
+++ b/src/tensor/test/TSplit.cpp
--- a/src/tensor/test/TSplit.h
+++ b/src/tensor/test/TSplit.h
--- a/src/tensor/test/TSpread.cpp
+++ b/src/tensor/test/TSpread.cpp
--- a/src/tensor/test/TSpread.h
+++ b/src/tensor/test/TSpread.h
--- a/src/tensor/test/TSub.cpp
+++ b/src/tensor/test/TSub.cpp
--- a/src/tensor/test/TSub.h
+++ b/src/tensor/test/TSub.h
--- a/src/tensor/test/TSubDim.cpp
+++ b/src/tensor/test/TSubDim.cpp
--- a/src/tensor/test/TSubDim.h
+++ b/src/tensor/test/TSubDim.h
--- a/src/tensor/test/TSum.cpp
+++ b/src/tensor/test/TSum.cpp
--- a/src/tensor/test/TSum.h
+++ b/src/tensor/test/TSum.h
--- a/src/tensor/test/TSumByColumnTV.cpp
+++ b/src/tensor/test/TSumByColumnTV.cpp
--- a/src/tensor/test/TSumByColumnTV.h
+++ b/src/tensor/test/TSumByColumnTV.h
--- a/src/tensor/test/TSumByColumnVT.cpp
+++ b/src/tensor/test/TSumByColumnVT.cpp
--- a/src/tensor/test/TSumByColumnVT.h
+++ b/src/tensor/test/TSumByColumnVT.h
--- a/source/tensor/test/TDivDim.cpp
+++ b/source/tensor/test/TDivDim.cpp
--- a/src/tensor/test/TSumDim.h
+++ b/src/tensor/test/TSumDim.h
--- a/src/tensor/test/TTan.cpp
+++ b/src/tensor/test/TTan.cpp
--- a/src/tensor/test/TTan.h
+++ b/src/tensor/test/TTan.h
--- a/src/tensor/test/TTopK.cpp
+++ b/src/tensor/test/TTopK.cpp
--- a/src/tensor/test/TTopK.h
+++ b/src/tensor/test/TTopK.h
--- a/src/tensor/test/TTranspose.cpp
+++ b/src/tensor/test/TTranspose.cpp
--- a/src/tensor/test/TTranspose.h
+++ b/src/tensor/test/TTranspose.h
--- a/src/tensor/test/TUnsqueeze.cpp
+++ b/src/tensor/test/TUnsqueeze.cpp
--- a/src/tensor/test/TUnsqueeze.h
+++ b/src/tensor/test/TUnsqueeze.h
--- a/src/tensor/test/TXMem.cpp
+++ b/src/tensor/test/TXMem.cpp
--- a/src/tensor/test/TXMem.h
+++ b/src/tensor/test/TXMem.h
--- a/src/tensor/test/Test.cpp
+++ b/src/tensor/test/Test.cpp
--- a/src/tensor/test/Test.h
+++ b/src/tensor/test/Test.h
--- a/tool/config_gen.py
+++ b/tool/config_gen.py
--- a/tool/eval.py
+++ b/tool/eval.py
--- a/tool/get_w2v.py
+++ b/tool/get_w2v.py