Merge branch 'xiaotong-working'

4388c982 · xuchen · dd6646ed · 9d33e210 · 4388c982 · 4388c982
Commit 4388c982 authored Aug 04, 2018 by xuchen
--- a/.gitignore
+++ b/.gitignore
 NiuTrans.Tensor.vcxproj
 NiuTrans.Tensor.vcxproj.filters
 x64/
+vc140.pdb
+NiuTrans.Tensor.vcxproj.user
+NiuTrans.Tensor.aps
--- a/README.md
+++ b/README.md
-NiuTrans.Tensor张量计算库
+# NiuTrans.Tensor张量计算库
\ No newline at end of file
+## NiuTrans.Tensor
+NiuTrans.Tensor是小牛开源项目所开发的一个工具包，提供了完整的张量定义及计算功能，可以被用于深度学习相关研究及工业系统的开发。NiuTrans.Tensor具有以下特点：
+* 简单小巧，易于修改
+* c语言编写，代码高度优化
+* 同时支持CPU和GPU设备
+* 丰富的张量计算接口
+* 支持C/C++、Python等调用方式
+## 安装方法
+在开始创建您的项目并使用NiuTrans.Tensor工具包时，需要注意的是：
+* 所创建项目如在CPU上运行，我们的系统支持高性能的数学运算库，推荐安装[MKL](https://software.intel.com/en-us/mkl)或[OpenBLAS](http://www.openblas.net/)。
+* 所创建项目如需在GPU上运行，需安装 [CUDA](https://developer.nvidia.com/cuda-downloads)，CUDA版本需求为9.0及以上，CUDA工具为创建高性能GPU加速应用程序提供了开发环境。
+小牛开源项目所开发的NiuTrans.Tensor工具包采用源程序编译方法，在Windows和Linux环境下的安装方法如下所示。
+### Windows
+若在Windows上使用NiuTrans.Tensor工具包：
+* 首先需要将NiuTrans.Tensor代码包含在所创建的项目中
+* 在所创建项目中需要引用XTensor.h、core里的CHeader.h和function里的FHeader.h这三个头文件：
+    * 通过XTensor.h可以获取我们需要操作的XTensor类
+    * 通过core里的CHeader.h可以对Tensor进行一些张量运算
+    * 通过function里的FHeader.h可以调用一些激活函数
+* 在所创建项目中使用命名空间nts
+此外，一些必须的环境配置方法请参考 [NiuTrans.Tensor环境配置](http://47.105.50.196/NiuTrans/NiuTrans.Tensor/blob/linye/doc/Configuration.md)。
+### Linux
+若在Linux上使用NiuTrans.Tensor工具包，直接执行make.sh即可在同级目录下生成tensorCPU和tensorGPU，分别对应于NiuTrans.Tensor的CPU以及GPU的可执行文件。以前馈神经网络语言模型为例，输入以下命令即可在GPU上执行提供的测试用例：
+>./tensorGPU -test
+更多详细使用方法请见[NiuTrans.Tensor开发文档](http://47.104.97.237/niutrans/site/niutensor/index.html)
+## 开发团队
+NiuTrans.Tensor张量计算库由东北大学自然语言处理实验室、小牛翻译、小牛雅智合作开发，致力于为深度学习相关研究及工业系统的开发提供完整的张量定义及计算功能。
+## 更新版本
+NiuTrans.Tensor version 0.1.0 - 2018年8月3日
\ No newline at end of file
--- a/doc/Configuration.md
+++ b/doc/Configuration.md
+# NiuTrans.Tensor环境配置
+## 注意事项
+CUDA最新版本9.2尚且不支持VS2017最新版本，因此建议使用CUDA版本为9.0或9.1，建议使用VS版本为VS2015，或使用VS2017时安装v140工具集。
+## CUDA配置
+在已安装好VS、CUDA并配置好环境变量后，一些关键的CUDA配置选项如下所示，以下配置选项在 **项目 -> 属性** 中可以找到。
+>$(CUDA_PATH)\include
+加入到 **VC++目录 -> 包含** 中。
+>$(CUDA_PATH)\lib\Win32
+加入到 **VC++目录 -> 库** 中。
+>cuda.lib;cudadevrt.lib;cudart.lib;cudart_static.lib;nvcuvid.lib;OpenCL.lib;cublas.lib;curand.lib;
+加入到 **链接器->输入->附加依赖项** 中。
+配置完成后，右键 **工程->项目依赖性** ，选择CUDA9。
+在.cu文件上右键属性，在项类型中选择"CUDA C/C++"（最好搜索.cu文件，然后全选设置）。
+## 其他配置
+**C/C++->常规->SDL检查**，设为否。
+在 **C/C++->预处理器->预处理器定义** 中，添加
+>USE_CUDA;USE_BLAS;WIN32;MKL;DEBUG;CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_WARNINGS_
+CONSOLE;
+**链接器->系统->子系统**，设置为控制台。
+**常规->字符集**，使用Unicode字符集。
+**调试->命令参数**中设置可执行文件所需要的参数。
--- a/doc/manual.md
+++ b/doc/manual.md
@@ -16,15 +16,26 @@ NiuTrans.Tensor撠皞★撘銝芸極
 * 所创建项目如在CPU上运行，我们的系统支持高性能的数学运算库，推荐安装[Intel® MKL](https://software.intel.com/en-us/mkl)或[OpenBLAS](http://www.openblas.net/)。
 * 所创建项目如需在GPU上运行，需安装 [NVIDIA®CUDA®Toolkit](https://developer.nvidia.com/cuda-downloads)，CUDA版本需求为9.0及以上，CUDA工具为创建高性能GPU加速应用程序提供了开发环境。
-在使用小牛开源项目所开发的NiuTrans.Tensor工具包时：
+小牛开源项目所开发的NiuTrans.Tensor工具包采用源程序编译方法，在Windows和Linux环境下的安装方法如下所示。
+### Windows
+若在Windows上使用NiuTrans.Tensor工具包：
 * 首先需要将NiuTrans.Tensor代码包含在所创建的项目中
-* 需要引用XTensor.h、core里的CHeader.h和function里的FHeader.h这三个头文件：
+* 在所创建项目中需要引用XTensor.h、core里的CHeader.h和function里的FHeader.h这三个头文件：
    * 通过XTensor.h可以获取我们需要操作的XTensor类
    * 通过core里的CHeader.h可以对Tensor进行一些张量运算
    * 通过function里的FHeader.h可以调用一些激活函数
 * 在所创建项目中使用命名空间nts
+此外，一些必须的环境配置方法请参考 [NiuTrans.Tensor环境配置](http://47.105.50.196/NiuTrans/NiuTrans.Tensor/blob/linye/doc/Configuration.md)。
+### Linux
+若在Linux上使用NiuTrans.Tensor工具包，直接执行make.sh即可在同级目录下生成tensorCPU和tensorGPU，分别对应于NiuTrans.Tensor的CPU以及GPU的可执行文件。以前馈神经网络语言模型为例，输入以下命令即可在GPU上执行提供的测试用例：
+>./tensorGPU -test
 ## 什么是张量
 在计算机科学中，张量（Tensor）通常被定义为\\(n\\)维空间中的一种量，它具有\\(n\\)个分量，这种张量本质上是一个多维数组（ multidimensional array）。张量的阶或秩是这个多维数组的维度，或者简单理解为索引张量里的每个元素所需要的索引个数。通常来说，0阶张量被定义为标量（Scalar），1阶张量被定义为向量（vector），而2阶张量被定义为矩阵（matrix）。比如，在一个三维空间中，1阶张量就是空间中点所表示的向量\\((x,y,z)\\)，其中\\(x\\)、\\(y\\)、\\(z\\)分别表示这个点在三个轴上的坐标。
@@ -56,13 +67,20 @@ $$
 ## 如何定义张量
-如果你是一名C/C++或者Python的使用者，那么在程序中使用NiuTrans.Tensor定义张量将非常简单。首先，下载NiuTrans.Tensor的工具包(source???)，并解压到任意目录，比如~/NTS目录。我们会在NTS这个目录中找到source子目录，它是存放源代码的目录。对于source子目录的结构，信息如下：
+如果你是一名C/C++或者Python的使用者，那么在程序中使用NiuTrans.Tensor定义张量将非常简单。首先，下载NiuTrans.Tensor的工具包，并解压到任意目录，比如~/NiuTrans.Tensor目录。我们会在NiuTrans.Tensor这个目录中找到source子目录，它是存放源代码的目录。对于source子目录的结构，信息如下：
-* ~/NTS/source/XTensor.h - 定义了张量结构XTensor，以及构建和销毁XTensor的接口
+* ~/NiuTrans.Tensor/source/tensor/XTensor.h - 定义了张量结构XTensor，以及构建和销毁XTensor的接口
-* ~/NTS/source/core - 存放张量计算的函数声明及函数体实现的源文件
+* ~/NiuTrans.Tensor/source/tensor/core - 存放张量计算的函数声明及函数体实现的源文件
-* ~/NTS/source/function - 存放各种激活函数的源文件
+    * arithmetic - 存放有关算术运算的源文件
-* ~/NTS/source/test - 存放单元测试的源文件
+    * getandset - 存放有关算术存取的源文件
-* ~/NTS/source/*.h(cpp) - 与张量定义不相关，后文介绍 :)
+    * math - 存放有关数学运算的源文件
+    * movement - 存放有关数据移动的源文件
+    * reduce - 存放有关规约操作的源文件
+    * shape - 存放有关形状转换的源文件
+    * sort - 存放有关排序操作的源文件
+* ~/NiuTrans.Tensor/source/tensor/function - 存放各种激活函数的源文件
+* ~/NiuTrans.Tensor/source/tensor/test - 存放单元测试的源文件
+* ~/NiuTrans.Tensor/source/tensor/*.h(cpp) - 与张量定义不相关，后文介绍 :)
 以C/C++为例，仅需要在源程序中引用XTensor.h头文件就可以完成张量的定义。下面是一个简单的示例程序sample.cpp
 ```
@@ -84,10 +102,10 @@ int main(int argc, const char ** argv)
 }
 ```
-下一步，编译以上源程序，这个过程需要指定XTensor.h头文件所在目录。比如，使用g++编译sample.cpp（如果你使用的是visual studio，请看这里???）
+下一步，编译以上源程序，这个过程需要指定XTensor.h头文件所在目录。比如，使用g++编译sample.cpp
 ```
-g++ sample.cpp -I~/NTS/source -o sample
+g++ sample.cpp -I~/NiuTrans.Tensor/source/tensor -o sample
 ```
 在sample.cpp中使用了XTensor，它是NiuTrans.Tensor里的一个类，这个类定义了张量所需的数据结构。我们可以使用这个类完成对张量的计算、拷贝等各种操作。XTensor类型的变量被声明后，这个变量需要被初始化，或者说被真正指定为一个张量，比如，指定张量各个维度的大小、张量中每个单元的数据类型、给张量分配内存空间等。InitTensor2D()就是一个张量初始化函数，它把张量初始化为一个矩阵，有四个参数：指向被初始化的张量的指针，矩阵的列数，矩阵的行数，数据单元的类型。这里X_FLOAT，是NiuTrans.Tensor自定义的枚举类型，它表示单精度浮点数。我们也可以使用X_INT或者X_DOUBLE，将数据类型指定为32bit整数或者双精度浮点数。
@@ -190,8 +208,6 @@ int main(int argc, const char ** argv)
 | 创建4维稠密张量 | XTensor * NewTensor4D(<br>const int d0, const int d1, const int d2, const int d3, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br>  d3 - 张量第四维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
 | 创建5维稠密张量 | XTensor * NewTensor5D(<br>const int d0, const int d1, const int d2, <br> const int d3, const int d4, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br>  d3 - 张量第四维大小 <br>  d4 - 张量第五维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
-## 设备
 ## 访问张量中的内容
 在C/C++中，我们通过XTensor.h访问张量中的内容，并且仅需要在源程序中引用XTensor.h头文件就可以完成张量的定义。
@@ -204,44 +220,79 @@ int main(int argc, const char ** argv)
 | void * data | 保存元素的数据数组 |
 | int devID | 设备ID，指张量所申请的空间所在CPU或者GPU设备的编号，-1表示CPU |
 | int order | 张量的维度，例如：一个矩阵（维度为2）是一个二维张量 |
-| int dimSize<br> [MAX_TENSOR_DIM_NUM] | 张量中每一维度的大小，索引0表示第1维 |
+| int dimSize[ ] | 张量中每一维度的大小，索引0表示第1维 |
 | TENSOR_DATA_TYPE dataType | 每个数据单元的数据类型 |
 | int unitSize | 数据单元的大小，类似于sizeof() |
 | int unitNum | 数据单元的数量 |
 | bool isSparse | 是否稠密，一个n * m稠密矩阵的数据量大小为n * m,而稀疏（非稠密）矩阵的数据量大小则取决于矩阵中非零元素个数。|
 | float denseRatio | 稠密度，指非零单元的比例，是介于0和1之间的一个实数，0表示所有单元全为零，1表示全为非零单元。|
-在XTensor.h头文件中定义的方法说明：
+在XTensor.h头文件中定义的部分方法说明，详情参见附录：
 | 功能 | 函数  | 参数 |
 | - | - | - |
-| 判断两个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 |
-| 判断三个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b, XTensor * c) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 <br> c - 进行比较的第三个张量 |
 | 设置张量每一维度的大小 | void SetDim(int * myDimSize) |myDimSize - 张量每一维度的大小 |
 | 得到张量中给定的维度大小 | int GetDim(const int dim) | dim - 张量的维度 |
 | 重新调整矩阵维度 | void Reshape(<br> const int order, const int * myDimSize) | order - 张量的维度 <br> myDimSize - 张量每一维的大小 |
 | 得到张量中元素数量 | int GetSize() | N/A |
-| 得到所给数据类型的数据<br> 单元大小 | int GetUnitSize(<br> TENSOR_DATA_TYPE myDataType) | myDataType - 所给数据类型 |
-| 张量中所有元素设置为0 | void SetZeroAll(XStream * stream = NULL) | stream - 多线程流|
 | 用数组赋值张量 | void SetData(<br> const void * d, int num, int beg = 0) | d - 赋值数组  <br> num - 数组大小 <br> beg - 赋值时从张量的第几位开始 |
-| 设置张量服从均匀分布 | void SetDataRand(<br> DTYPE lower, DTYPE upper) | lower - 最小值 <br> upper - 最大值 |
+| 张量中所有元素设置为0 | void SetZeroAll(XStream * stream = NULL) | stream - 多线程流|
-| 设置张量服从正态分布 | void SetDataRandn(<br> DTYPE mean, DTYPE standardDeviation) | mean - 均值 <br> standardDeviation - 标准差 |
-| 将给定维度中元素<br> 设置为升序 | void SetAscendingOrder(int dim) | dim - 给定维度 |
 | 获取二维张量的值 | DTYPE Get2D(int ni, int mi = 0) | ni - 行值 <br> mi - 列值 |
 | 设置二维张量中<br> 的单元值 | bool Set2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
 | 增加二维张量中<br> 的单元值 | bool Add2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
 | 将矩阵重置为特定大小 | bool Resize(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 |
 | 将矩阵重置为<br> 另一矩阵大小 | bool Resize(<br> const XTensor * myTensor) | myTensor - 重置矩阵大小的参考矩阵 |
-| 依据给定张量<br>复制一个新的张量 | XTensor * NewTensor(<br>XTensor * a, bool isFilledData = true) | a - 给定张量 <br>  isFilledData - 是否申请张量中的数据空间 |
 | 依据给定张量<br>释放数据空间 | void DelTensor(<br>const XTensor * tensor) | tensor - 给定张量 |
 ## 张量计算
-NiuTrans.Tensor提供关于张量计算的函数功能，主要包括一些基本的张量运算以及激活函数，在本节中，主要对这些函数及其用法用例进行介绍。
+NiuTrans.Tensor提供关于张量计算的函数功能，主要包括一些基本的张量运算以及激活函数，在本节中，主要对这些函数及其用法用例进行介绍。我们以点乘(Multiply)操作为例介绍NiuTrans.Tensor的几种函数定义形式：
+* _Multiply: 需指定输出张量，只支持前向操作
+* Multiply: 输出张量与输入张量相同，只支持前向操作
+* MultiplyMe: 输出张量需返回给上层，同时支持前向和反向操作
+### 代数计算(arithmetic)
+此部分主要包括各种数学运算，加、减、乘、除、取负、取绝对值等。
+#### 取绝对值（Absolute）
+##### 什么是张量的取绝对值运算？
+利用张量的取绝对值运算可以将张量中每一元素取绝对值并得到一个新的张量，一个维度分别为\\(2 \times 3\\)的矩阵取绝对值过程如下所示：
+$$
+\left(\begin{matrix}-1.0 & 2.0 & 3.0\\\\-4.0 & 5.0 & 6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0\end{matrix}\right)
+$$
+##### 张量取绝对值的调用
+NiuTrans.Tensor提供了张量的取绝对值操作，在NiuTrans.Tensor/Tensor/core/arithmetic中定义
+张量取绝对值的调用方式以及参数说明如下所示:
+```
+void _Absolute(const XTensor * a, XTensor * b)
+void _AbsoluteMe(XTensor * a)
+XTensor Absolute(const XTensor & a)
+```
+Parameters: 
+* a - 输入张量
+* b - 输出张量
+##### 张量取绝对值片段示例
-### arithmetic
+用Absolute进行张量取绝对值操作的示例代码为：
+```
+/* call Absolute function */
+b = Absolute(*a);
+```
+有关张量取绝对值的详细代码示例：
-此部分主要包括各种数学运算，加、减、乘、除、取负等。
+NiuTrans.Tensor/Tensor/test/TAbsolute.cpp
 #### 矩阵乘法（MatrixMul）
@@ -256,26 +307,31 @@ $$
 ##### 矩阵乘法的调用
-NiuTrans.Tensor提供了矩阵乘法的计算操作，在NiuTrans.Tensor/Tensor/core中定义，矩阵乘法的调用方式以及参数说明如下所示:
+NiuTrans.Tensor提供了矩阵乘法的计算操作，在NiuTrans.Tensor/Tensor/core/arithmetic中定义，函数定义为：
+>c_{i,j} = trans(ai) * trans(bj) * alpha + c_{i,j} * beta
+矩阵乘法的调用方式以及参数说明如下所示:
 ```
 void _MatrixMul(XTensor * a, MATRIX_TRANS_TYPE transposedA, XTensor * b, MATRIX_TRANS_TYPE transposedB, XTensor * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0)
+XTensor MatrixMul(const XTensor &a, MATRIX_TRANS_TYPE transposedA, const XTensor &b, MATRIX_TRANS_TYPE transposedB, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0)
 ```
 Parameters: 
-* a - 操作张量1
+* a - 输入张量1
-* transposedA - 操作张量1是否进行转置
+* transposedA - a是否进行转置
-* b - 操作张量2
+* b - 输入张量2
-* transposedB - 操作张量2是否进行转置
+* transposedB - b是否进行转置
-* c - 操作张量3
+* c - 输出张量
-* alpha - 系数
+* alpha - 系数α
-* beta - 系数
+* beta - 系数β
 ##### 矩阵乘法片段示例
 我们以最基本的二维矩阵乘法为例，用MatrixMul进行矩阵乘法操作的示例代码为：
 ```
 /* call MatrixMul function */
-_MatrixMul(s1, X_NOTRANS, s2, X_NOTRANS, t);
+t = MatrixMul(*s1, X_NOTRANS, *s2, X_NOTRANS);
 ```
 有关矩阵乘法的详细代码示例：
@@ -295,16 +351,20 @@ $$
 ##### 张量点乘的调用
-NiuTrans.Tensor提供了张量点乘的计算操作，用来计算张量中元素点乘结果，该函数在NiuTrans.Tensor/Tensor/core中定义，张量点乘的调用方式以及参数说明如下所示:
+NiuTrans.Tensor提供了张量点乘的计算操作，用来计算张量中元素点乘结果，该函数在NiuTrans.Tensor/Tensor/core/arithmetic中定义，张量点乘的调用方式以及参数说明如下所示:
 ```
 _Multiply(XTensor * a, XTensor * b, XTensor * c, int leadingDim, DTYPE alpha = 0)
+void _MultiplyMe(XTensor * a, const XTensor * b, DTYPE alpha = 0, int leadingDim = 0)
+XTensor Multiply(const XTensor &a, const XTensor &b, DTYPE alpha = 0, int leadingDim = 0)
 ```
 Parameters: 
-* a - 操作张量1
+* a - 输入张量1
-* b - 操作张量2
+* b - 输入张量2
-* c - 结果张量
+* c - 输出张量
-* leadingDim - ???
+* leadingDim - 沿着指定维度进行点乘操作
 * alpha - 系数
 ##### 张量点乘片段示例
@@ -312,7 +372,7 @@ Parameters:
 用Multiply进行s1和s2张量间的点乘操作的调用示例如下所示，计算结果存入t中：
 ```
 /* call multiply function */
-_Multiply(s1, s2, t, 0);
+t = Multiply(*s1, *s2, 0);
 ```
 有关矩阵乘法的详细代码示例见：
@@ -331,25 +391,67 @@ $$
 ##### 张量取负的调用
-NiuTrans.Tensor提供了张量取负的计算操作，进行张量的按元素位置进行取负操作，该函数在NiuTrans.Tensor/Tensor/core中定义，张量取负的调用方式以及参数说明如下所示:
+NiuTrans.Tensor提供了张量取负的计算操作，进行张量的按元素位置进行取负操作，该函数在NiuTrans.Tensor/Tensor/core/arithmetic中定义，张量取负的调用方式以及参数说明如下所示:
 ```
-_Negate(XTensor * a)
+void _Negate(const XTensor * a, XTensor * b)
+void _NegateMe(XTensor * a)
+XTensor Negate(const XTensor & a)
 ```
 Parameters: 
-* a - 操作张量
+* a - 输入张量
+* b - 输出张量
 ##### 张量取负片段示例
-用Negate进行张量取负操作的调用示例如下所示，其中a为我们要进行处理的张量：
+用Negate进行张量取负操作的调用示例如下所示，其中a为我们要进行处理的张量，b为得到的结果张量：
 ```
 /* call negate function */
-_Negate(a);
+b = Negate(*aGPU);
 ```
 有关张量取负的详细代码示例见：
 NiuTrans.Tensor/Tensor/test/TNegate.cpp
+#### 符号函数（Sign）
+##### 什么是张量的符号函数？
+张量的符号函数用来取得张量中每一元素的符号，一个维度为\\(3 \times 2\\)的张量符号函数操作过程如下所示：
+$$
+\left(\begin{matrix}1.0 & -2.0\\\\0.0 & 4.0\\\\5.0 & -6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}1.0 & -1.0\\\\0.0 & 1.0\\\\1.0 & -1.0\end{matrix}\right)
+$$
+##### 张量符号函数的调用
+NiuTrans.Tensor提供了张量的符号函数，该函数在NiuTrans.Tensor/Tensor/core/arithmetic中定义，张量符号函数的调用方式以及参数说明如下所示:
+```
+void _Sign(const XTensor * a, XTensor * b)
+void _SignMe(XTensor * a)
+XTensor Sign(const XTensor & a)
+```
+Parameters: 
+* a - 输入张量
+* b - 输出张量
+##### 张量符号函数片段示例
+用Sign进行张量符号函数的调用示例如下所示，其中a为我们要进行处理的张量，b为得到的结果张量：
+```
+/* call Sign function */
+b = Sign(*a);
+```
+有关张量符号函数的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TSign.cpp
 #### 加法（Sum）
 ##### 什么是张量加法？
@@ -363,25 +465,29 @@ $$
 ##### 张量加法的调用
-NiuTrans.Tensor提供了张量加法的计算操作，在NiuTrans.Tensor/Tensor/core中定义，该操作用来进行张量之间的按元素位置相加，并得到相加的结果张量，张量加法的调用方法为：
+NiuTrans.Tensor提供了张量加法的计算操作，在NiuTrans.Tensor/Tensor/core/arithmetic中定义，该操作用来进行张量之间的按元素位置相加，并得到相加的结果张量，张量加法的调用方法为：
 ```
-_Sum(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
+void _Sum(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta = (DTYPE)1.0)
+void _SumMe(XTensor * a, const XTensor * b, DTYPE beta = (DTYPE)1.0)
+XTensor Sum(const XTensor &a, const XTensor &b, DTYPE beta = (DTYPE)1.0)
 ```
 其中a和b为输入张量，c为结果张量，若c为NULL则将相加结果存入a中，beta为一个缩放参数，缩放公式为：c = a + b * beta，beta默认为1.0，NiuTrans.Tensor中张量加法的调用方式以及参数说明如下所示:
 Parameters: 
-* a - 操作张量1
+* a - 输入张量1
-* b - 操作张量2
+* b - 输入张量2
-* c - 结果张量，如果c为空则将结果存入a
+* c - 输出张量
 * beta - 缩放参数
 ##### 张量加法片段示例
-调用Sum进行张量间的求和操作如下所示，在此例中直接将张量相加结果存入a中：
+调用Sum进行张量间的求和操作如下所示，在此例中将张量相加结果存入c中：
 ```
 /* call sum function */
-_Sum(a, b);
+c = Sum(*a, *b);
 ```
 详细代码示例见：
@@ -402,13 +508,13 @@ $$
 NiuTrans.Tensor提供了张量的SumByColumnTV操作，调用方法及参数说明如下所示:
 ```
-_SumByColumnTV(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
+void _SumByColumnTV(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
 ```
 Parameters:
-* a - 操作张量
+* a - 输入张量
-* b - 操作向量
+* b - 输入向量
-* c - 结果张量
+* c - 输出张量
 * beta - 缩放参数
 调用SumByColumnTV进行的运算为c_col = a_col + b * \beta
@@ -418,7 +524,7 @@ Parameters:
 SumByColumnTV示例代码如下，其中a为输入的张量，b为输入的向量，c为a和b按列相加所得结果：
 ```
 /* call SumByColumnTV function */
-_SumByColumnTV(a, b, c);
+void _SumByColumnTV(a, b, c);
 ```
 有关张量SumByColumnTV的详细代码示例见：
@@ -443,9 +549,9 @@ _SumByColumnVT(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
 ```
 Parameters:
-* a - 操作向量
+* a - 输入向量
-* b - 操作张量
+* b - 输入张量
-* c - 结果向量
+* c - 输出向量
 * beta - 缩放参数
 调用SumByColumnVT进行的运算为c = a + \sum{col} b_col * \beta
@@ -461,10 +567,47 @@ _SumByColumnVT(a, b, c);
 NiuTrans.Tensor/Tensor/test/TSumByColumnVT.cpp
-### getandset
+### 张量存取(getandset)
 此部分包括各种数据类型转化，设置数据、取数据等操作。
+#### ConvertDataType
+##### 什么是ConvertDataType？
+ConvertDataType的作用是将张量中每个元素的数据类型转换为另一数据类型。
+##### ConvertDataType调用
+NiuTrans.Tensor提供了张量的ConvertDataType操作，调用方法及参数说明如下所示:
+```
+void _ConvertDataType(const XTensor * input, XTensor * output)
+```
+Parameters:
+* input - 输入张量
+* output - 输出张量
+#####  ConvertDataType片段示例
+ConvertDataType示例代码如下，本例中将张量中元素数据类型由flaot32转换为int32。
+首先，创建张量时a为flaot32类型，b为int32类型：
+```
+/* create tensors */
+XTensor * a = NewTensor(aOrder, aDimSize);
+XTensor * b = NewTensor(aOrder, aDimSize, X_INT);
+```
+调用ConvertDataType函数
+```
+/* call ConvertDataType function */
+_ConvertDataType(a, b);
+```
+有关张量ConvertDataType的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TConvertDataType.cpp
 #### 选择（Select）
 ##### 什么是张量的选择操作？
@@ -496,8 +639,25 @@ $$
 ##### 张量选择的调用
 NiuTrans.Tensor提供了张量的选择操作，调用方法及参数说明如下所示:
+第一种选择操由一个0，1构成的index矩阵对张量进行选择：
+```
+void _Select(const XTensor * a, XTensor * c, XTensor * indexCPU)
+XTensor Select(const XTensor &a, XTensor &indexCPU)
+```
+Parameters:
+* a - 输入张量
+* c - 输出张量
+* indexCPU - 张量选择标志
+第二种调用方式是按位置范围对张量进行选择：
 ```
-_SelectRange(XTensor * a, int dim, int low, int high, XTensor * c)
+void _SelectRange(const XTensor * a, XTensor * c, int dim, int low, int high)
+XTensor SelectRange(const XTensor &a, int dim, int low, int high)
 ```
 Parameters:
@@ -505,7 +665,7 @@ Parameters:
 * dim - 在哪一维对张量进行张量选择操作
 * low - 张量选择范围的下限
 * high - 张量选择范围的上限
-* c - 结果张量
+* c - 输出张量
 >需要注意的是，当张量选择的取值范围为[1,3]时意味着选择的是索引位置为1和2的值
@@ -514,7 +674,7 @@ Parameters:
 张量选择示例代码如下，其中s为输入的待操作张量，t输出结果张量，在第三维上按范围[1,3]进行张量的选择操作：
 ```
 /* call SelectRange function */
-_SelectRange(s, 2, 1, 3, t);
+t = SelectRange(*s, 2, 1, 3);
 ```
 有关张量选择的详细代码示例见：
@@ -534,13 +694,63 @@ $$
 ##### SetData调用
 NiuTrans.Tensor提供了张量的SetData操作，调用方法及参数说明如下所示:
+设置张量为固定值：
+```
+void _SetDataFixed(XTensor * tensor, void * valuePointer)
+void SetDataFixed(XTensor &tensor, DTYPE p)
+```
+Parameters:
+* tensor - 输入张量
+* valuePointer - 指向数据的指针
+* p - 设置的值
+设置张量为整型值：
+```
+void _SetDataFixedInt(XTensor * tensor, int p)
+```
+Parameters:
+* tensor - 输入张量
+* p - 固定整型值
+设置张量为单精度浮点值：
+```
+void _SetDataFixedFloat(XTensor * tensor, float p)
+```
+Parameters:
+* tensor - 输入张量
+* p - 固定单精度浮点值
+设置张量为双精度浮点值：
+```
+void _SetDataFixedDouble(XTensor * tensor, double p)
+```
+Parameters:
+* tensor - 输入张量
+* p - 固定双精度浮点值
+设置张量为随机分布：
 ```
-SetDataRand(DTYPE lower, DTYPE upper)
+void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
+```
+* tensor - 输入张量
+* low - 取值下限
+* high - 取值上限
+设置张量为正态分布：
+```
+void _SetDataRandN(XTensor * tensor, DTYPE mean, DTYPE standardDeviation)
 ```
 Parameters:
-* lower - 取值下限
+* tensor - 输入张量
-* upper - 取值上限
+* mean - 均值
+* standardDeviation - 标准差
 #####  SetData片段示例
@@ -553,9 +763,41 @@ s->SetDataRand(0.0, 1.0);
 NiuTrans.Tensor/Tensor/test/TSetData.cpp
-### math
+### 数学运算(math)
+此部分包括各种非基本代数操作，包括：log、exp、power等。
+#### 对数运算（Log）
+##### 什么是张量的对数运算？
+张量的对数运算即将张量中每一元素都取对数从而得到一个新的张量。
-此部分包括各种非基本代数操作，包括：log、exp、abs等。
+##### Log调用
+NiuTrans.Tensor提供了张量的Log操作，调用方法及参数说明如下所示:
+```
+void _Log(const XTensor * a, XTensor * b)
+void _LogMe(XTensor * a)
+XTensor Log(const XTensor & a)
+```
+Parameters:
+* a - 输入张量
+* b - 输出张量
+#####  Log片段示例
+Log示例代码如下所示：
+```
+/* call Log function */
+b = Log(*a);
+```
+有关Log的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TLog.cpp
 #### 标准化（Normalize）
@@ -568,7 +810,11 @@ NiuTrans.Tensor/Tensor/test/TSetData.cpp
 NiuTrans.Tensor提供了张量的Normalize操作，调用方法及参数说明如下所示:
 ```
-_Normalize(XTensor * input, XTensor * output, int dim, XTensor * mean, XTensor * var, XTensor * a, XTensor * b, DTYPE epsilon)
+void _Normalize(const XTensor * input, XTensor * output, int dim, const XTensor * mean, const XTensor * var, const XTensor * a, const XTensor * b, DTYPE epsilon)
+void _NormalizeMe(XTensor * input, int dim, const XTensor * mean, const XTensor * var, const XTensor * a, const XTensor * b, DTYPE epsilon)
+XTensor Normalize(const XTensor &input, int dim, const XTensor &mean, const XTensor &var, const XTensor &a, const XTensor &b, DTYPE epsilon)
 ```
 Parameters:
@@ -586,7 +832,7 @@ Parameters:
 Normalize示例代码如下所示：
 ```
 /* call normalize function */
-_Normalize(s, t, 0, mean, var, a, b, 0.0);
+t = Normalize(*s, 0, *mean, *var, *a, *b, 0.0F);
 ```
 有关Normalize的详细代码示例见：
@@ -608,13 +854,18 @@ $$
 NiuTrans.Tensor提供了张量幂运算的操作，用来进行张量的按元素位置进行幂运算的操作，调用方法为：
 ```
-_Power(XTensor * a, DTYPE p)
+void _Power(const XTensor * a, XTensor * b, DTYPE p)
+void _PowerMe(XTensor * a, DTYPE p)
+XTensor Power(const XTensor & a, DTYPE p)
 ```
 其中a为进行操作的张量，p为次方数，张量幂运算的参数说明如下所示:
 Parameters: 
-* a - 操作张量
+* a - 输入张量
+* b - 输出张量
 * p - 次方数
 ##### 张量幂运算片段示例
@@ -622,7 +873,7 @@ Parameters:
 下面是调用Power进行a的幂为2.0的幂运算操作的一段示例代码：
 ```
 /* call power function */
-_Power(a, 2.0);
+b = Power(*a, 2.0F);
 ```
 有关张量幂运算的详细代码示例见：
@@ -643,13 +894,18 @@ $$
 NiuTrans.Tensor提供了张量的缩放和偏移操作，调用方法为：
 ```
-_ScaleAndShift(XTensor * a, DTYPE scale, DTYPE shift)
+void _ScaleAndShift(const XTensor * a, XTensor * b, DTYPE scale, DTYPE shift = 0)
+void _ScaleAndShiftMe(XTensor * a, DTYPE scale, DTYPE shift = 0)
+XTensor ScaleAndShift(const XTensor &a, DTYPE scale, DTYPE shift = 0)
 ```
 张量的缩放和偏移操作结果为：p = p * scale + shift，其中scale和shift分别为张量的缩放和偏移参数，张量缩放和偏移操作的参数说明如下表所示:
 Parameters:
 * a - 输入张量
+* b - 输出张量
 * scale - 缩放参数
 * shift - 偏移参数
@@ -658,51 +914,16 @@ Parameters:
 张量缩放和偏移示例代码如下，input为输入的待操作张量，scaleFactor为缩放参数，shiftFactor为偏移参数：
 ```
 /* call ScaleAndShift function */
-_ScaleAndShift(input, scaleFactor, shiftFactor);
+t = ScaleAndShift(*s, scaleFactor, shiftFactor);
 ```
 有关张量缩放和偏移的详细代码示例见：
 NiuTrans.Tensor/Tensor/test/TScaleAndShift.cpp
-### movement
+### 数据移动(movement)
 此部分主要是介绍有关数据拷贝函数。
-#### 拷贝（CopyValues）
-##### 什么是张量的拷贝操作？
-拷贝，即将一个张量的值赋给另一个张量，也就是对张量进行拷贝操作，一个\\(2 \times 4\\)的张量拷贝过程如下所示：
-$$
-\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow
-\left(
-\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right)
-$$
-##### 张量拷贝操作的调用
-NiuTrans.Tensor提供了张量的拷贝操作，调用方法及参数说明如下所示:
-```
-_CopyValues(XTensor * s, XTensor * t, XStream * stream)
-```
-Parameters:
-* s - 输入张量
-* t - 输出结果张量
-* stream - 多线程流
-#####  张量拷贝片段示例
- 张量拷贝示例代码如下，其中input为输入的待操作张量，output输出结果张量：
-```
-/* call CopyValues function */
-_CopyValues(input, output);
-```
-有关张量拷贝的详细代码示例见：
-NiuTrans.Tensor/Tensor/test/TCopyValues.cpp
 #### CopyIndexed
 ##### 什么是张量的CopyIndexed操作？
@@ -740,12 +961,14 @@ $$
 NiuTrans.Tensor提供了张量的CopyIndexed操作，调用方法及参数说明如下所示:
 ```
-_CopyIndexed(XTensor * s, XTensor * t, int dim, int * srcIndex, int indexSize, int * tgtIndex, int copyNum)
+void _CopyIndexed(const XTensor * s, XTensor * t, int dim, int * srcIndex, int indexSize, int * tgtIndex, int copyNum)
+XTensor CopyIndexed(const XTensor &s, int dim, int * srcIndex,int indexSize, int * tgtIndex, int copyNum)
 ```
 Parameters:
 * s - 输入张量
-* t - 输出结果张量
+* t - 输出张量
 * dim - 在哪一维对张量进行CopyIndexed操作
 * srcIndex - 源索引，即在指定dim上进行赋值的值的索引
 * indexSize - 源索引的个数
@@ -754,101 +977,95 @@ Parameters:
 #####  张量CopyIndexed片段示例
-CopyIndexed示例代码如下，其中s为输入的待操作张量，t输出结果张量，在第三维上按起始位置索引拷贝一个元素到目标张量：
+CopyIndexed示例代码如下，其中s为输入的待操作张量，t输出结果张量，在指定维度上按起始位置索引拷贝一个元素到目标张量：
 ```
 /* call CopyIndexed function */
-_CopyIndexed(s, t, 2, srcIndex, indexSize, tgtIndex, 1);
+t = CopyIndexed(*s, dim, srcIndex, indexSize, tgtIndex, copyNum);
 ```
 有关CopyIndexed的详细代码示例见：
 NiuTrans.Tensor/Tensor/test/TCopyIndexed.cpp
-### reduce
+#### 拷贝（CopyValues）
-#### 归约取最大值（ReduceMax）
-##### 什么是张量的归约取最大值？
-张量的归约取最大值操作是沿着张量的某一维度，取得该向量在该维度中的最大值,一个\\(2 \times 4\\)的张量在维度0和维度1进行取最大值操作的过程分别如下所示：
+##### 什么是张量的拷贝操作？
-$$
+拷贝，即将一个张量的值赋给另一个张量，也就是对张量进行拷贝操作，一个\\(2 \times 4\\)的张量拷贝过程如下所示：
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right)
-$$
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow
-\left(\begin{matrix}3.0\\\\7.0\end{matrix}\right)
+\left(
+\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right)
 $$
-##### 张量归约取最大值操作的调用
+##### 张量拷贝操作的调用
-NiuTrans.Tensor提供了张量的ReduceMax操作，用来获得张量中沿指定维度取得的最大值，张量归约取最大值操作的调用方式及参数说明如下所示:
+NiuTrans.Tensor提供了张量的拷贝操作，调用方法及参数说明如下所示:
 ```
-_ReduceMax(XTensor * input, XTensor * output, int dim)
+void _CopyValues(const XTensor * s, XTensor * t, XStream * stream = NULL)
+XTensor CopyValues(const XTensor &s, XStream * stream = NULL)
 ```
 Parameters:
-* input - 输入张量
+* s - 输入张量
-* output - 输出张量
+* t - 输出张量
-* dim - 沿着指定维度进行取最大值操作
+* stream - 多线程流
-##### 张量归约取最大值片段示例
+#####  张量拷贝片段示例
-调用ReduceMax进行张量归约取最大值操作的示例代码如下所示，代码中两行分别表示沿着维度0和维度1进行取值：
+张量拷贝示例代码如下，其中s为输入的待操作张量，t输出结果张量：
 ```
-/* call reduce max function */
+/* call CopyValues function */
-_ReduceMax(a, reduce_a, 0);
+t = CopyValues(*s);
-_ReduceMax(b, reduce_b, 1);
 ```
-有关张量归约取最大值的详细代码示例见：
+有关张量拷贝的详细代码示例见：
-NiuTrans.Tensor/Tensor/test/TReduceMax.cpp
+NiuTrans.Tensor/Tensor/test/TCopyValues.cpp
-#### 归约求和（ReduceSum）
+### 规约操作(reduce)
-##### 什么是张量的归约求和操作？
+#### 归约取最大值（ReduceMax）
-张量的归约求和操作是沿着张量的某一维度，计算该张量在该维度的和,一个\\(2 \times 4\\)的张量在维度0和维度1进行求和操作的过程分别如下所示：
+##### 什么是张量的归约取最大值？
+张量的归约取最大值操作是沿着张量的某一维度，取得该向量在该维度中的最大值,一个\\(2 \times 4\\)的张量在维度0和维度1进行取最大值操作的过程分别如下所示：
 $$
 \left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}4.0 & 6.0 & 8.0 & 10.0\end{matrix}\right)
+\left(\begin{matrix}4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right)
 $$
 $$
 \left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}6.0\\\\22.0\end{matrix}\right)
+\left(\begin{matrix}3.0\\\\7.0\end{matrix}\right)
 $$
-##### 张量归约求和操作的调用
+##### 张量归约取最大值操作的调用
-NiuTrans.Tensor提供了张量的ReduceSum操作，调用方法为：
+NiuTrans.Tensor提供了张量的ReduceMax操作，用来获得张量中沿指定维度取得的最大值，张量归约取最大值操作的调用方式及参数说明如下所示:
-```
-_ReduceSum(XTensor * input, XTensor * output, int dim, XTensor * shift, DTYPE power, bool isExp)
 ```
-其中shift默认为NULL，power默认为1.0F，isExp默认为false，张量归约求和操作的参数说明如下所示:
+void _ReduceMax(const XTensor * input, XTensor * output, int dim)
+XTensor ReduceMax(const XTensor &input, int dim)
+```
 Parameters:
 * input - 输入张量
 * output - 输出张量
 * dim - 沿着指定维度进行取最大值操作
-* shift - 输入的偏移，默认为NULL
-* power - 元素的幂，默认为1.0F
-* isExp - 是否取指，默认为false
-##### 张量归约求和片段示例
+##### 张量归约取最大值片段示例
-调用ReduceSum进行张量归约求和操作的示例代码如下所示，代码中两行分别表示沿着维度0和维度1进行取值：
+调用ReduceMax进行张量归约取最大值操作的示例代码如下所示，代码中两行分别表示沿着维度0和维度1进行取值：
 ```
-/* call reduce sum function */
+/* call reduce max function */
-_ReduceSum(a, reduce_a, 0);
+t = ReduceMax(*s, 0);
-_ReduceSum(b, reduce_b, 1);
+t = ReduceMax(*s, 1);
 ```
-有关张量归约求和的详细代码示例见：
+有关张量归约取最大值的详细代码示例见：
-NiuTrans.Tensor/Tensor/test/TReduceSum.cpp
+NiuTrans.Tensor/Tensor/test/TReduceMax.cpp
 #### 归约取均值（ReduceMean）
@@ -870,7 +1087,9 @@ $$
 NiuTrans.Tensor提供了张量的ReduceMean操作，调用方法为：
 ```
-_ReduceMean(XTensor * input, XTensor * output, int dim)
+void _ReduceMean(const XTensor * input, XTensor * output, int dim)
+XTensor ReduceMean(const XTensor &input, int dim)
 ```
 ReduceMean用来获得张量中沿指定维度取得的数值均值，张量归约取均值的参数说明如下所示:
@@ -885,29 +1104,78 @@ Parameters:
 调用ReduceMean进行张量归约取均值操作的示例代码如下所示，代码中两行分别表示沿着维度0和维度1进行取值：
 ```
 /* call reduce mean function */
-_ReduceMean(a, reduce_a, 0);
+t = ReduceMean(*s, 0);
-_ReduceMean(b, reduce_b, 1);
+t = ReduceMean(*s, 1);
 ```
 有关张量归约取均值的详细代码示例见：
 NiuTrans.Tensor/Tensor/test/TReduceMean.cpp
-#### 归约取方差（ReduceSumSquared）
+#### 归约求和（ReduceSum）
-##### 什么是张量的归约取方差操作？
+##### 什么是张量的归约求和操作？
-张量的归约取方差操作是沿着张量的某一维度，计算该张量在该维度的方差,一个\\(2 \times 4\\)的张量在维度0进行取方差操作的过程如下所示：
+张量的归约求和操作是沿着张量的某一维度，计算该张量在该维度的和,一个\\(2 \times 4\\)的张量在维度0和维度1进行求和操作的过程分别如下所示：
 $$
 \left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}8.0 & 8.0 & 8.0 & 8.0\end{matrix}\right)
+\left(\begin{matrix}4.0 & 6.0 & 8.0 & 10.0\end{matrix}\right)
 $$
-##### 张量归约取方差操作的调用
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}6.0\\\\22.0\end{matrix}\right)
+$$
+##### 张量归约求和操作的调用
+NiuTrans.Tensor提供了张量的ReduceSum操作，调用方法为：
+```
+void _ReduceSum(const XTensor * input, XTensor * output, int dim, const XTensor * shift = NULL, DTYPE power = (DTYPE)1.0F, bool isExp = false)
+XTensor ReduceSum(const XTensor &input, int dim, const XTensor &shift = NULLTensor, DTYPE power = (DTYPE)1.0F, bool isExp = false)
+```
+其中shift默认为NULL，power默认为1.0F，isExp默认为false，张量归约求和操作的参数说明如下所示:
+Parameters:
+* input - 输入张量
+* output - 输出张量
+* dim - 沿着指定维度进行取最大值操作
+* shift - 输入的偏移，默认为NULL
+* power - 元素的幂，默认为1.0F
+* isExp - 是否取指，默认为false
+##### 张量归约求和片段示例
+调用ReduceSum进行张量归约求和操作的示例代码如下所示，代码中两行分别表示沿着维度0和维度1进行取值：
+```
+/* call reduce sum function */
+t1 = ReduceSum(*s, 0, *shift1);
+t2 = ReduceSum(*s, 1, *shift2);
+```
+有关张量归约求和的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TReduceSum.cpp
+#### 归约取方差（ReduceSumSquared）
+##### 什么是张量的归约取方差操作？
+张量的归约取方差操作是沿着张量的某一维度，计算该张量在该维度的方差,一个\\(2 \times 4\\)的张量在维度0进行取方差操作的过程如下所示：
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}8.0 & 8.0 & 8.0 & 8.0\end{matrix}\right)
+$$
+##### 张量归约取方差操作的调用
 NiuTrans.Tensor提供了张量的ReduceSumSquared操作，调用方法为：
 ```
-_ReduceSumSquared(XTensor * input, XTensor * output, int dim, XTensor * shift)
+void _ReduceSumSquared(const XTensor * input, XTensor * output, int dim, const XTensor * shift)
+XTensor ReduceSumSquared(const XTensor &input, int dim, const XTensor &shift)
 ```
 ReduceSumSquared用来计算张量的沿着某一维度元素的方差，张量归约取方差操作的参数说明如下所示:
@@ -923,7 +1191,7 @@ Parameters:
 调用ReduceSumSquared进行张量归约取方差操作的示例代码如下所示：
 ```
 /* call reduce sum squared function */
-_ReduceSumSquared(input, output, 0, shift);
+t = ReduceSumSquared(*s, 0, *shift);
 ```
 有关张量归约取方差的详细代码示例见：
@@ -944,7 +1212,9 @@ $$
 NiuTrans.Tensor提供了张量的ReduceVariance操作，调用方法为：
 ```
-_ReduceVariance(XTensor * input, XTensor * output, int dim, XTensor * mean)
+void _ReduceVariance(const XTensor * input, XTensor * output, int dim, const XTensor * mean)
+XTensor ReduceVariance(const XTensor &input, int dim, const XTensor &mean)
 ```
 ReduceVariance用来计算张量的沿着某一维度元素的标准差，张量归约取标准差操作的参数说明如下所示:
@@ -960,13 +1230,13 @@ Parameters:
 调用ReduceVariance进行张量归约取标准差操作的示例代码如下所示：
 ```
 /* call reduce variance function */
-_ReduceVariance(input, output, 0, mean);
+t = ReduceVariance(*s, 0, *mean);
 ```
 有关张量归约取标准差的详细代码示例见：
 NiuTrans.Tensor/Tensor/test/TReduceVariance.cpp
-### shape
+### 形状转换(shape)
 此部分主要包括关于形状改变的函数，比如：split、merge、reshape等。
@@ -985,25 +1255,30 @@ $$
 ##### 张量级联的调用
 NiuTrans.Tensor提供了张量间的级联操作，调用方法为：
-```
-_Concatenate(XList * smalls, XTensor * big, int dim)
-_Concatenate(XTensor * smallA, XTensor * smallB, XTensor * big, int dim)
-```
 第一种调用方法中的操作对象是列表，将进行级联操作的张量存入列表smalls中，级联结果存入张量big中：
+```
+void _Concatenate(const XList * smalls, XTensor * big, int dim)
+XTensor Concatenate(const XList &smalls, int dim)
+```
 Parameters:
 * smalls - 进行级联张量的列表
-* big - 结果张量
+* big - 输出张量
 * dim - 在指定维度进行级联
 第二种方法操作对象不再是列表中的张量而是直接对一系列张量进行级联操作：
+```
+void _Concatenate(const XTensor * smallA, const XTensor * smallB, XTensor * big, int dim)
+XTensor Concatenate(const XTensor &smallA, const XTensor &smallB, int dim)
+```
 Parameters:
-* smallA - 操作张量1
+* smallA - 输入张量1
-* smallB - 操作张量2
+* smallB - 输入张量2
-* big - 结果张量
+* big - 输出张量
 * dim - 进行级联的维度
 ##### 张量级联片段示例
@@ -1011,151 +1286,164 @@ Parameters:
 通过操作张量列表进行张量的级联操作片段示例如下所示，sList为存放进行级联张量的列表，t为结果张量：
 ```
 /* call concatenate function */
-_Concatenate(&sList, t, 1);
+t = Concatenate(*sList, 1);
 ```
 直接通过操作一系列张量进行张量的级联操作片段示例如下所示，s1、s2为需要进行级联的张量，t为结果张量：
 ```
 /* call concatenate function */
-_Concatenate(s1, s2, t, 1);
+t = Concatenate(*s1, *s2, 1);
 ```
 有关张量级联的详细代码示例见：
 NiuTrans.Tensor/Tensor/test/TConcatenate.cpp
-#### 切分（Split）
+#### 合并（Merge）
-##### 什么是张量的切分操作？
+##### 什么是张量的合并操作？
-张量间的切分操作是沿着张量的某一维度，可以将一个张量切分成另一张量，也可以将一个大的张量切分成n个小的张量集合的列表。
+张量间的合并操作与级联有些类似，是沿着张量的某一维度，可以将一个张量合并为另一个维度不同的张量，也可以将一个列表中的所有张量合并在一起组成一个更大的张量。
-第一种情况下将维度为\\(4 \times 3\\)张量沿着维度0进行切分，切分份数为2，得到维度为\\(2 \times 2 \times 3\\)的张量的过程如下所示：
+在第一种情况下将维度为\\(2 \times 2 \times 3\\)的张量在维度1进行合并，进行合并的维度为0，得到维度为\\(4 \times 3\\)的张量的过程如下所示：
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
 \begin{aligned}
 \Biggl( & \left( 
 \begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right),
 \\\\ & \left( 
 \begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}
 \right) \Biggr)
-\end{aligned}
+\end{aligned} \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
 $$
-在第二种情况下将维度为\\(4 \times 3\\)张量沿着维度0进行切分，切分份数为2，得到两个维度均为\\(2 \times 3\\)的张量的过程如下所示：
+在第二种情况下将两个维度均为\\(2 \times 3\\)的张量沿着维度0合并为维度为\\(4 \times 3\\)的张量的过程如下所示：
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
-$$
+$$ 
-##### 张量切分的调用
+##### 张量合并操作的调用
-NiuTrans.Tensor提供了两种张量切分操作，调用方法为：
+NiuTrans.Tensor提供了张量的合并操作，调用方法为：
-```
-_Split(XTensor * s, XTensor * t, int whereToSplit, int splitNum)
+在第一种调用方法中是将源张量中的某一维度进行Merge操作，Merge结果为张量t，whereToMerge为指定进行Merge操作的维度，leadingDim为指定将哪一维度Merge，例如：(N/2, 2, M) -> (N, M)，参数说明如下表所示:
-_Split(XTensor * big, XList * smalls, int whereToSplit, int splitNum)
 ```
-在第一种调用方法中是将源张量中的某一维度进行Split操作，Split结果为张量t，whereToSplit为在哪一维度进行split操作，splitNum表示分成多少份，例如：(N, M) -> (N/3, M, 3)，参数说明如下所示:
+void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim = -1)
+XTensor Merge(const XTensor &s, int whereToMerge, int leadingDim = -1)
+```
 Parameters:
-* s - 操作张量
+* s - 输入张量
-* t - 结果张量
+* t - 输出张量
-* whereToSplit - 在指定维度进行split操作
+* whereToMerge - 沿着指定维度进行Merge操作
-* splitNum - 分成多少份
+* leadingDim - 把指定维度进行Merge操作
-在第二种调用方法中是将所操作张量big按某一维度whereToSplit进行Split操作，操作结果为包含若干更小维度张量的列表smalls，splitNum表示分成多少份，例如：(N, M) -> 2 * (N/2, M)，参数说明如下所示:
+在第二种调用方法中是将所操作张量存入列表smalls中，操作结果为张量big，whereToMerge为指定进行Merge操作的维度，例如：2 * (N/2, M) -> (N, M)，参数说明如下表所示:
+```
+void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
+XTensor Merge(const XList &smalls, int whereToMerge)
+```
 Parameters:
-* big - 操作张量
+* smalls - 存放进行合并张量的列表
-* smalls - 存放切分出张量的列表
+* big - 结果张量
-* whereToSplit - 在指定维度进行split操作
+* whereToMerge - 沿着指定维度进行Merge操作
-* splitNum - 分成多少份
-##### 张量切分片段示例
+##### 张量合并片段示例
-上述第一种张量切分片段示例如下所示，s为进行切分的张量，t为结果张量，0表示沿着维度0进行切分操作，2表示切分份数为2：
+上述第一种张量合并片段示例如下所示，s为进行合并的张量，t为结果张量，1表示在维度1进行合并操作，0表示将维度0进行合并操作：
 ```
-/* call split function */
+/* call merge function */
-_Split(s, t, 0, 2);
+t = Merge(*s, 1, 0);
 ```
-上述第二种张量切分片段示例如下所示，s为进行切分的张量，tList为存放结果张量的列表，1表示沿着维度1进行切分操作，2表示切分份数为2：
+上述第二种张量合并片段示例如下所示，sList为要进行合并的张量列表，t为结果张量，0表示沿着维度0进行合并操作：
 ```
-/* call split function */
+/* call merge function */
-_Split(s, &tList, 1, 2);
+t = Merge(*sList, 0);
 ```
-有关张量切分的详细代码示例见：
+有关张量合并的详细代码示例见：
-NiuTrans.Tensor/Tensor/test/TSplit.cpp
+NiuTrans.Tensor/Tensor/test/TMerge.cpp
-#### 合并（Merge）
+#### 切分（Split）
-##### 什么是张量的合并操作？
+##### 什么是张量的切分操作？
-张量间的合并操作与级联有些类似，是沿着张量的某一维度，可以将一个张量合并为另一个维度不同的张量，也可以将一个列表中的所有张量合并在一起组成一个更大的张量。
+张量间的切分操作是沿着张量的某一维度，可以将一个张量切分成另一张量，也可以将一个大的张量切分成n个小的张量集合的列表。
-在第一种情况下将维度为\\(2 \times 2 \times 3\\)的张量在维度1进行合并，进行合并的维度为0，得到维度为\\(4 \times 3\\)的张量的过程如下所示：
+第一种情况下将维度为\\(4 \times 3\\)张量沿着维度0进行切分，切分份数为2，得到维度为\\(2 \times 2 \times 3\\)的张量的过程如下所示：
 $$
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
 \begin{aligned}
 \Biggl( & \left( 
 \begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right),
 \\\\ & \left( 
 \begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}
 \right) \Biggr)
-\end{aligned} \rightarrow 
+\end{aligned}
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
 $$
-在第二种情况下将两个维度均为\\(2 \times 3\\)的张量沿着维度0合并为维度为\\(4 \times 3\\)的张量的过程如下所示：
+在第二种情况下将维度为\\(4 \times 3\\)张量沿着维度0进行切分，切分份数为2，得到两个维度均为\\(2 \times 3\\)的张量的过程如下所示：
 $$
-\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
+\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
-$$ 
+$$
-##### 张量合并操作的调用
+##### 张量切分的调用
-NiuTrans.Tensor提供了张量的合并操作，调用方法为：
+NiuTrans.Tensor提供了两种张量切分操作，调用方法为：
-```
-_Merge(XTensor * s, XTensor * t, int whereToMerge, int leadingDim)
+在第一种调用方法中是将源张量中的某一维度进行Split操作，Split结果为张量t，whereToSplit为在哪一维度进行split操作，splitNum表示分成多少份，例如：(N, M) -> (N/3, M, 3)，参数说明如下所示:
-_Merge(XList * smalls, XTensor * big, int whereToMerge)
 ```
-在第一种调用方法中是将源张量中的某一维度进行Merge操作，Merge结果为张量t，whereToMerge为指定进行Merge操作的维度，leadingDim为指定将哪一维度Merge，例如：(N/2, 2, M) -> (N, M)，参数说明如下表所示:
+void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum)
+XTensor Split(const XTensor &s, int whereToSplit, int splitNum)
+```
 Parameters:
-* s - 操作张量
+* s - 输入张量
-* t - 结果张量
+* t - 输出张量
-* whereToMerge - 沿着指定维度进行Merge操作
+* whereToSplit - 在指定维度进行split操作
-* leadingDim - 把指定维度进行Merge操作
+* splitNum - 分成多少份
-在第二种调用方法中是将所操作张量存入列表smalls中，操作结果为张量big，whereToMerge为指定进行Merge操作的维度，例如：2 * (N/2, M) -> (N, M)，参数说明如下表所示:
+在第二种调用方法中是将所操作张量big按某一维度whereToSplit进行Split操作，操作结果为包含若干更小维度张量的列表smalls，splitNum表示分成多少份，例如：(N, M) -> 2 * (N/2, M)，参数说明如下所示:
+```
+void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum)
+XList SplitList(const XTensor &big, int whereToSplit, int splitNum)
+```
 Parameters:
-* smalls - 存放进行合并张量的列表
+* big - 输入张量
-* big - 结果张量
+* smalls - 存放切分出张量的列表
-* whereToMerge - 沿着指定维度进行Merge操作
+* whereToSplit - 在指定维度进行split操作
+* splitNum - 分成多少份
-##### 张量合并片段示例
+##### 张量切分片段示例
+上述第一种张量切分片段示例如下所示，s为进行切分的张量，t为结果张量，0表示沿着维度0进行切分操作，2表示切分份数为2：
-上述第一种张量合并片段示例如下所示，s为进行合并的张量，t为结果张量，1表示在维度1进行合并操作，0表示将维度0进行合并操作：
 ```
-/* call merge function */
+/* call split function */
-_Merge(s, t, 1, 0);
+t = Split(*s, 0, 2);
 ```
-上述第二种张量合并片段示例如下所示，sList为要进行合并的张量列表，t为结果张量，0表示沿着维度0进行合并操作：
+上述第二种张量切分片段示例如下所示，s为进行切分的张量，tList为存放结果张量的列表，1表示沿着维度1进行切分操作，2表示切分份数为2：
 ```
-/* call merge function */
+/* call split function */
-_Merge(&sList, t, 0);
+Split(*s, tList, 1, 2);
 ```
-有关张量合并的详细代码示例见：
+有关张量切分的详细代码示例见：
-NiuTrans.Tensor/Tensor/test/TMerge.cpp
+NiuTrans.Tensor/Tensor/test/TSplit.cpp
 #### Unsqueeze
@@ -1189,28 +1477,32 @@ $$
 NiuTrans.Tensor提供了张量的Unsqueeze操作，调用方法及参数说明如下所示:
 ```
-_Unsqueeze(XTensor * a, XTensor * b, int dim, int dSize)
+void _Unsqueeze(const XTensor * a, XTensor * b, int dim, int dSize)
+XTensor Unsqueeze(const XTensor &a, int dim, int dSize)
 ```
 Parameters:
 * a - 输入张量
-* b - 输出结果张量
+* b - 输出张量
 * dim - 在指定维度进行Unsqueeze操作
 * dSize - 插入维度的大小
 #####  Unsqueeze片段示例
 Unsqueeze示例代码如下，其中s为输入的待操作张量，t1、t2代表输出结果张量，以下两行分别表示在维度1和维度2上插入的维度大小为2：
 ```
 /* call Unsqueeze function */
-_Unsqueeze(s, t1, 1, 2);
+t1 = Unsqueeze(*s, 1, 2);
-_Unsqueeze(s, t2, 2, 2);
+t2 = Unsqueeze(*s, 2, 2);
 ```
 有关张量Unsqueeze的详细代码示例见：
 NiuTrans.Tensor/Tensor/test/TUnsqueeze.cpp
-### sort
+### 排序操作(sort)
 此部分主要介绍排序相关的函数，如：sort、topk等。
@@ -1228,22 +1520,31 @@ $$
 ##### Sort的调用
 NiuTrans.Tensor提供了张量的Sort操作，调用方法及参数说明如下所示:
 ```
-_Sort(XTensor * a, XTensor * index, int dim)
+void _Sort(const XTensor * a, XTensor * b, XTensor * index, int dim)
+void _SortMe(XTensor * a, XTensor * index, int dim)
+void Sort(XTensor & a, XTensor & b, XTensor & index, int dim)
 ```
 Parameters:
-* a - 操作张量
+* a - 输入张量
-* index - 结果张量中元素的索引
+* b- 输出张量
+* index - 输出张量中元素的索引
 * dim - 沿着指定维度进行Sort操作
 #####  Sort片段示例
-Sort示例代码如下所示，a为进行操作的张量，b为结果张量中元素的索引，本例中沿着维度0进行Sort操作：
+Sort示例代码如下所示，a为进行操作的张量，index为结果张量中元素的索引，本例中沿着维度0进行Sort操作：
 ```
 /* call Sort function */
-_Sort(a, b, 0);
+Sort(*a, b, *index, 0)
 ```
 有关Sort的详细代码示例见                              NiuTrans.Tensor/Tensor/test/TSort.cpp
 #### TopK
@@ -1265,61 +1566,38 @@ $$
 ##### TopK的调用
 NiuTrans.Tensor提供了张量的TopK操作，调用方法及参数说明如下所示:
 ```
-_TopK(XTensor * a, XTensor * b, XTensor * index, int dim, int k)
+void _TopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k)
+void TopK(XTensor &a, XTensor &b, XTensor &index, int dim, int k)
 ```
 Parameters:
 * a - 输入张量
-* b - 输出结果张量
+* b - 输出张量
 * index - 输出结果索引
 * dim - 沿着指定维度进行TopK操作
 * k - TopK中k代表取最大的k个值
 #####  TopK片段示例
- TopK示例代码如下，input为输入的待操作张量，output输出结果张量，index为输出结果索引，本例中沿着维度0取Top-2：
+TopK示例代码如下，s为输入的待操作张量，t输出结果张量，index为输出结果索引，本例中沿着维度dim取Top-k：
 ```
 /* call TopK function */
 int dim = 0;
 int k = inputDimSize[dim];
-_TopK(input, outputA, indexA, dim, k);
+TopK(s, t, index, dim, k);
 ```
 有关TopK的详细代码示例见                              NiuTrans.Tensor/Tensor/test/TTopK.cpp
-### function
+### 激活函数(function)
 此部分主要介绍一些激活函数和损失函数。
-#### Rectify
-##### 什么是Rectify？
-Rectify是一种激活函数，Rectify函数定义为：
->y = max(0, x)
-##### Rectify调用
-NiuTrans.Tensor提供了张量的Rectify激活函数，调用方法及参数说明如下所示:
-```
-Rectify(XTensor * x, XTensor * y)
-```
-Parameters:
-* x - 输入张量
-* y - 输出张量
-#####  Rectify片段示例
-Rectify示例代码如下，其中x为输入的向量，y为输入的张量：
-```
-/* call Rectify function */
-Rectify(x, y);
-```
-有关Rectify的详细代码示例见：
-NiuTrans.Tensor/Tensor/test/TRectify.cpp
 #### HardTanH
 ##### 什么是HardTanH？
@@ -1332,9 +1610,13 @@ HardTanH銝蝘瘣餃嚗ardTanH摰蛹嚗
 ##### HardTanH调用
 NiuTrans.Tensor提供了张量的HardTanH激活函数，调用方法及参数说明如下所示:
 ```
-HardTanH(XTensor * x, XTensor * y)
+void _HardTanH(const XTensor * x, XTensor * y)
+XTensor HardTanH(const XTensor &x)
 ```
 Parameters:
 * x - 输入张量
@@ -1343,10 +1625,12 @@ Parameters:
 #####  HardTanH片段示例
 HardTanH示例代码如下，其中x为输入的向量，y为输入的张量：
 ```
 /* call hardtanh function */
-HardTanH(x, y);
+y = HardTanH(*x);
 ```
 有关HardTanH的详细代码示例见：
 NiuTrans.Tensor/Tensor/test/THardTanH.cpp
@@ -1361,9 +1645,13 @@ Identity銝蝘瘣餃嚗dentity摰蛹嚗
 ##### Identity调用
 NiuTrans.Tensor提供了张量的Identity激活函数，调用方法及参数说明如下所示:
 ```
-Identity(XTensor * x, XTensor * y)
+void _Identity(const XTensor * x, XTensor * y)
+XTensor Identity(const XTensor &x)
 ```
 Parameters:
 * x - 输入张量
@@ -1372,10 +1660,12 @@ Parameters:
 #####  Identity片段示例
 Identity示例代码如下，其中x为输入的向量，y为输入的张量：
 ```
 /* call Identity function */
-Identity(x, y);
+y = Identity(*x);
 ```
 有关Identity的详细代码示例见：
 NiuTrans.Tensor/Tensor/test/TIdentity.cpp
@@ -1390,9 +1680,13 @@ LogSoftmax銝蝘瘣餃嚗ogSoftmax摰蛹嚗
 ##### LogSoftmax调用
 NiuTrans.Tensor提供了张量的LogSoftmax激活函数，调用方法及参数说明如下所示:
 ```
-LogSoftmax(XTensor * x, XTensor * y, int leadDim)
+void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
+XTensor LogSoftmax(const XTensor &x, int leadDim)
 ```
 Parameters:
 * x - 输入张量
@@ -1402,14 +1696,94 @@ Parameters:
 #####  LogSoftmax片段示例
 LogSoftmax示例代码如下，其中x为输入的向量，y为输入的张量，本例中沿着维度1进行LogSoftmax操作：
 ```
 /* call LogSoftmax function */
-LogSoftmax(x, y, 1);
+y = LogSoftmax(*x, 1);
 ```
 有关LogSoftmax的详细代码示例见：
 NiuTrans.Tensor/Tensor/test/TLogSoftmax.cpp
+#### Loss
+##### 什么是Loss？
+Loss Function(损失函数)是用来衡量神经网络模型效果及优化目标的一种损失函数，函数定义为：
+>squared error : loss = sum_{i} 0.5*(gold_i - output_i)^2 <br />
+cross entropy : loss = sum_{i} (-gold_i * log(output_i)) <br />
+one hot error : loss = sum_{i} e_i <br />
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; where e_i = 0.5*(t_i - y_i)^2 &nbsp;&nbsp;if t_i = 1, <br />
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;e_i = 0 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; otherwise
+##### Loss调用
+NiuTrans.Tensor提供了张量的Loss激活函数，调用方法及参数说明如下所示:
+```
+DTYPE _LossCompute(XTensor * gold, XTensor * output, LOSS_FUNCTION_NAME LFName, bool isLogOutput, int leadDim, int gBeg, int gLen, int oBeg)
+```
+Parameters:
+* gold - 标准答案
+* output - 输出的模型预测结果
+* LFName - 损失函数名称
+* isLogOutput - 输出是否log
+* leadDim - 沿着指定维度进行输出
+* gBeg - 沿着指定维度leadDim从指定位置取标准答案
+* gLen - 从指定位置gBeg开始标准答案的偏移
+* oBeg - 沿着指定维度leadDim从指定位置开始输出模型预测结果
+#####  Loss片段示例
+Loss示例代码如下所示：
+```
+/* call LossCompute function */
+error = _LossCompute(gold, output, SQUAREDERROR, false, 0, 0, dimSize[0], 0);
+```
+有关Loss的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TLoss.cpp
+#### Rectify
+##### 什么是Rectify？
+Rectify是一种激活函数，Rectify函数定义为：
+>y = max(0, x)
+##### Rectify调用
+NiuTrans.Tensor提供了张量的Rectify激活函数，调用方法及参数说明如下所示:
+```
+void _Rectify(const XTensor * x, XTensor * y)
+XTensor Rectify(const XTensor &x)
+```
+Parameters:
+* x - 输入张量
+* y - 输出张量
+#####  Rectify片段示例
+Rectify示例代码如下，其中x为输入的向量，y为输入的张量：
+```
+/* call Rectify function */
+y = Rectify(*x);
+```
+有关Rectify的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TRectify.cpp
 #### Sigmoid
 ##### 什么是Sigmoid？
@@ -1421,7 +1795,9 @@ Sigmoid銝蝘瘣餃嚗igmoid摰蛹嚗
 NiuTrans.Tensor提供了张量的Sigmoid激活函数，调用方法及参数说明如下所示:
 ```
-Sigmoid(XTensor * x, XTensor * y)
+void _Sigmoid(const XTensor * x, XTensor * y)
+XTensor Sigmoid(const XTensor &x)
 ```
 Parameters:
@@ -1433,7 +1809,7 @@ Parameters:
 Sigmoid示例代码如下，其中x为输入的向量，y为输入的张量：
 ```
 /* call Sigmoid function */
-Sigmoid(x, y);
+y = Sigmoid(*x);
 ```
 有关Sigmoid的详细代码示例见：
@@ -1450,7 +1826,9 @@ Softmax銝蝘瘣餃嚗oftmax摰蛹嚗
 NiuTrans.Tensor提供了张量的Softmax激活函数，调用方法及参数说明如下所示:
 ```
-Softmax(XTensor * x, XTensor * y, int leadDim)
+void _Softmax(const XTensor * x, XTensor * y, int leadDim)
+XTensor Softmax(const XTensor &x, int leadDim)
 ```
 Parameters:
@@ -1463,63 +1841,611 @@ Parameters:
 Softmax示例代码如下，其中x为输入的向量，y为输入的张量，本例中沿着维度1进行Softmax操作：
 ```
 /* call Softmax function */
-Softmax(x, y, 1);
+y = Softmax(*x, 1);
 ```
 有关Softmax的详细代码示例见：
 NiuTrans.Tensor/Tensor/test/TSoftmax.cpp
-#### Loss
+## 高级技巧
-##### 什么是Loss？
+### 内存池
-Loss Function(损失函数)是用来衡量神经网络模型效果及优化目标的一种损失函数，函数定义为：
+内存作为计算机软件运行过程中不可或缺的一项重要资源，在软件开发过程中具有十分重要的地位。对于一个软件系统而言，如何更高效地进行内存管理将对系统整体性能，尤其是运行速度方面产生很大程度的影响。虽然目前而言，主流编程语言均会为开发人员提供相应的系统级接口（如C语言中的malloc和free，C++中的new和delete等），但这类接口在设计的时候由于需要考虑各种使用情况，因此并不一定能够最适用于目前的使用需求（如对速度具有较高要求等），因此直接使用系统级的内存管理接口存在以下弊端：
->squared error : loss = sum_{i} 0.5*(gold_i - output_i)^2 <br />
-cross entropy : loss = sum_{i} (-gold_i * log(output_i)) <br />
-one hot error : loss = sum_{i} e_i <br />
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; where e_i = 0.5*(t_i - y_i)^2 &nbsp;&nbsp;if t_i = 1, <br />
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;e_i = 0 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; otherwise
+1. 内存申请、释放时间消耗大：由于操作系统在进行内存管理的时候需要保证内存空间得到有效地使用，因此在执行内存申请或释放操作的时候，系统会对候选内存块进行一定程度的选择和合并，这些操作给相应的操作带来了许多额外的时间开销，导致频繁地对内存进行操作耗时较大。 
+2. 程序执行效率低：由于所申请内存块的大小不定，当频繁使用系统级接口进行内存管理的时候容易在存储空间中产生大量内存碎片，拖慢系统的执行效率。
+3. 易发生内存泄漏：使用系统级接口对内存空间进行申请的时候，一般来说需要程序开发人员显性地对空间进行释放，一旦疏忽将导致内存泄漏情况的发生，因此使用系统级接口进行内存管理需要谨慎对存储空间的使用情况进行分析，使用相关检测工具对内存泄漏情况进行有效地核查。
-##### Loss调用
+此外，当系统中存在对GPU设备上的显存空间进行管理的时候，申请、释放操作所产生的时间代价相对普通内存来说更大。不同于内存空间的申请，在申请或释放显存的时候需要对CPU正在执行的操作进行中断，交由GPU设备进行显存的操作，因此这部分产生的时间消耗远比内存申请来说大得多，最终导致频繁地对显存空间进行操作会更严重地拖慢系统整体的执行效率。
-NiuTrans.Tensor提供了张量的Loss激活函数，调用方法及参数说明如下所示:
+针对以上问题，本系统支持使用内存池（Memory Pool）来对系统中的存储空间（包括内存和显存）进行管理。内存池的概念主要是在对存储空间进行使用之前，预先从系统中申请一整块的空间，由程序自身（内存池）对这部分的空间进行管理。这样做的好处在于对存储空间的申请、释放等操作不需要对系统的相应接口进行频繁调用，降低了其中中断、搜寻最优块等操作的耗时，同时也不易产生内存碎片。此外，由于内存池的申请是一次性的操作，因此不会在系统全局产生大规模内存|泄漏的情况，对系统的稳定性会有所助益。
+具体来说，想要在NiuTrans.Tensor的工具包中使用内存池（XMem）进行操作，只需要三个步骤：内存池的定义，使用以及释放。
+* 内存池的定义
+最简单的定义一个内存池只需指定一个设备ID即可，下面是一段示例代码。
 ```
-LossCompute(XTensor * gold, XTensor * output, LOSS_FUNCTION_NAME LFName,bool isLogOutput, int leadDim, int gBeg, int gLen, int oBeg)
+// 定义一个内存池mem，它的类型是XMem
+XMem * mem = new XMem(devID);
 ```
-Parameters:
+若需要更具体地指定内存池的信息，可以定义内存池的时候通过myMode、myBlockSize、myBlockNum、myBufSize等参数设置内存池的使用模型、内存块大小、内存块数量以及缓存区大小。
-* gold - 标准答案
+* 内存池的使用
-* output - 输出的模型预测结果
-* LFName - 损失函数名称
-* isLogOutput - 输出是否log
-* leadDim - 沿着指定维度进行输出
-* gBeg - 沿着指定维度leadDim从指定位置取标准答案
-* gLen - 从指定位置gBeg开始标准答案的偏移
-* oBeg - 沿着指定维度leadDim从指定位置开始输出模型预测结果
-#####  Loss片段示例
+在定义好内存池之后，我们即可在该空间上进行变量的定义及使用了，这里以张量的定义为例，下面是一段示例代码。
+```
+// 声明一个变量tensor，它的类型是XTensor
+XTensor tensor;                         
-Loss示例代码如下所示：
+// 在内存池上初始化这个变量为50列*100行的矩阵(2阶张量)      
+InitTensor2D(&tensor, 50, 100, X_FLOAT, -1, mem);
 ```
-/* call LossCompute function */
+我们可以看到，上述代码相对之前之前未使用内存池时的定义方式而言，仅需在定义的时候指定所使用的内存池即可，无需更复杂的操作。
-error = LossCompute(gold, output, SQUAREDERROR, false, 0, 0, dimSize[0], 0);
+* 内存池的释放
+当希望将完全对内存池进行释放的时候，我们仅需直接对内存池进行删除即可，下面是一段示例代码。
+```
+// 删除内存池mem
+delete mem;
 ```
-有关Loss的详细代码示例见：
-NiuTrans.Tensor/Tensor/test/TLoss.cpp
+## 实例1：矩阵乘法
-## 高级技巧
+这里我们给出一个矩阵乘法的例子，首先定义张量维度的大小，然后初始化两个维度分别为2*3和3*2的矩阵，使用SetData()方法对矩阵进行赋值，最后计算两个矩阵相乘。
-### 内存池
+关于矩阵乘法的详细代码请见NiuTrans.Tensor/Tensor/sample/mul/。
-## 实例1：矩阵乘法
+```
+#include "mul.h"
+namespace nts
+{
+void sampleMUL1()
+{
+    DTYPE aData[2][3] = { { 1.0F, 2.0F, 3.0F },
+                          { -4.0F, 5.0F, 6.0F } };
+    DTYPE bData[3][2] = { { 0.0F, -1.0F },
+                          { 1.0F, 2.0F },
+                          { 2.0F, 1.0F } };
+    DTYPE answer[2][2] = { { 8.0F, 6.0F },
+                           { 17.0F, 20.0F } };
+    /* a source tensor of size (2, 3) */
+    int aOrder = 2;
+    int * aDimSize = new int[aOrder];
+    aDimSize[0] = 2;
+    aDimSize[1] = 3;
+    int aUnitNum = 1;
+    for (int i = 0; i < aOrder; i++)
+        aUnitNum *= aDimSize[i];
+    /* a source tensor of size (3, 2) */
+    int bOrder = 2;
+    int * bDimSize = new int[bOrder];
+    bDimSize[0] = 3;
+    bDimSize[1] = 2;
+    int bUnitNum = 1;
+    for (int i = 0; i < bOrder; i++)
+        bUnitNum *= bDimSize[i];
+    /* a target tensor of size (2, 2) */
+    int resultOrder = 2;
+    int * resultDimSize = new int[resultOrder];
+    resultDimSize[0] = 2;
+    resultDimSize[1] = 2;
+    int resultUnitNum = 1;
+    for (int i = 0; i < resultOrder; i++)
+        resultUnitNum *= resultDimSize[i];
+	/* create tensors */
+    XTensor * a = NewTensor(aOrder, aDimSize);
+    XTensor * b = NewTensor(bOrder, bDimSize);
+    XTensor * result = NewTensor(resultOrder, resultDimSize);
+	/* initialize variables */
+    a->SetData(aData, aUnitNum);
+    b->SetData(bData, bUnitNum);
+    result->SetZeroAll();
+	/* call MatrixMul function */
+    _MatrixMul(a, X_NOTRANS, b, X_NOTRANS, result);
+    result->Dump(stderr, "result:");
+	/* destroy variables */
+    delete[] aDimSize;
+    delete[] bDimSize;
+    delete[] resultDimSize;
+    delete a;
+    delete b;
+    delete result;
+}
+```
 ## 实例2：前馈神经网络
+下面我们来实现一个简单的前馈神经网络语言模型。
+语言建模任务是通过某种方式对语言建立数学模型的过程。在神经网络出现之前，一般使用统计的方法来设计语言模型。比较常见的为n-gram模型，它对文本中若干词语共现的频率进行统计，并使用平滑算法对未见词语搭配进行修正，最终得到该语言中不同词语连续出现的概率值。神经语言模型相对传统基于统计的模型而言，能够在学习词语搭配的同时学习到词汇之间的相似性，相对平滑算法而言有效提高了对已知单词的未见搭配的预测效果，获得了更好的性能。
+神经语言模型最早由Bengio等人系统化提出并进行了深入研究，其整体结构上和普通的前馈神经网络类似，由输入层、隐藏层和输出层组成，层和层之间存在连接，每一层将本层接收到的向量映射到另一维空间上作为该层的输出。
+前馈神经网络语言模型的主要流程如下所示:
+```
+int FNNLMMain(int argc, const char ** argv)
+{
+    if(argc == 0)
+        return 1;
+    FNNModel model;
+    /* load arguments */
+    LoadArgs(argc, argv, model);
+    /* check the setting */
+    Check(model);
+    /* initialize model parameters */
+    Init(model);
+    /* learn model parameters */
+    if(strcmp(trainFN, ""))
+        Train(trainFN, shuffled, model);
+    /* save the final model */
+    if(strcmp(modelFN, "") && strcmp(trainFN, ""))
+        Dump(modelFN, model);
+    /* load the model if neccessary */
+    if(strcmp(modelFN, ""))
+        Read(modelFN, model);
+    /* test the model on the new data */
+    if(strcmp(testFN, "") && strcmp(outputFN, ""))
+        Test(testFN, outputFN, model);
+    return 0;
+}
+```
+对模型中的参数进行初始化：
+```
+/* initialize the model */
+void Init(FNNModel &model)
+{
+    /* create embedding parameter matrix: vSize * eSize */
+    InitModelTensor2D(model.embeddingW, model.vSize, model.eSize, model);
+    /* create hidden layer parameter matrics */
+    for(int i = 0; i < model.hDepth; i++){
+        /* hidden layer parameter matrix: (n-1)eSize * hsize if it is the first layer
+                                           hsize * hsize otherwise */
+        if(i == 0)
+            InitModelTensor2D(model.hiddenW[i], (model.n - 1) * model.eSize, model.hSize, model);
+        else
+            InitModelTensor2D(model.hiddenW[i], model.hSize, model.hSize, model);
+        /* bias term: a row vector of hSize entries */
+        InitModelTensor1D(model.hiddenB[i], model.hSize, model);
+    }
+    /* create the output layer parameter matrix and bias term */
+    int iSize = model.hDepth == 0 ? (model.n - 1) * model.eSize : model.hSize;
+    InitModelTensor2D(model.outputW, iSize, model.vSize, model);
+    InitModelTensor1D(model.outputB, model.vSize, model);
+    /* then, we initialize model parameters using a uniform distribution in range
+       of [-minmax, minmax] */
+    model.embeddingW.SetDataRand(-minmax, minmax);
+    model.outputW.SetDataRand(-minmax, minmax);
+    for(int i = 0; i < model.hDepth; i++)
+        model.hiddenW[i].SetDataRand(-minmax, minmax);
+    /* all bias terms are set to zero */
+    model.outputB.SetZeroAll();
+    for(int i = 0; i < model.hDepth; i++)
+        model.hiddenB[i].SetZeroAll();
+}
+```
+训练过程：
+```
+void Train(const char * train, bool isShuffled, FNNModel &model)
+{
+    char name[MAX_NAME_LENGTH];
+    /* shuffle the data */
+    if(isShuffled){
+        sprintf(name, "%s-tmp", train);
+        Shuffle(train, name);
+    }
+    else
+        strcpy(name, train);
+    int epoch = 0;
+    int step = 0;
+    int wordCount = 0;
+    int wordCountTotal = 0;
+    int ngramNum = 1;
+    float loss = 0;
+    bool isEnd = false;
+    NGram * ngrams = new NGram[MAX_LINE_LENGTH_HERE];
+    /* make a model to keep gradients */
+    FNNModel grad;
+    Copy(grad, model);
+    /* XNet for automatic differentiation */
+    XNet autoDiffer;
+    double startT = GetClockSec();
+    /* iterate for a number of epochs */
+    for(epoch = 0; epoch < nEpoch; epoch++){
+        /* data file */
+        FILE * file = fopen(name, "rb");
+        CheckErrors(file, "Cannot open the training file");
+        wordCount = 0;
+        loss = 0;
+        ngramNum = 1;
+        while(ngramNum > 0){
+            /* load a minibatch of ngrams */
+            ngramNum = LoadNGrams(file, model.n, ngrams, sentBatch, wordBatch);
+            if (ngramNum <= 0)
+                break;
+            /* previous n - 1 words */
+            XTensor inputs[MAX_N_GRAM];
+            /* the predicted word */
+            XTensor output;
+            /* the gold standard */
+            XTensor gold;
+            /* make the input tensor for position i */
+            for(int i = 0; i < model.n - 1; i++)
+                MakeWordBatch(inputs[i], ngrams, ngramNum, i, model.vSize, model.devID, model.mem);
+            /* make the gold tensor */
+            MakeWordBatch(gold, ngrams, ngramNum, model.n - 1, model.vSize, model.devID, model.mem);
+            if(!autoDiff){
+                /* prepare an empty network for building the fnn */
+                FNNNet net;
+                /* gradident = 0 */
+                Clear(grad);
+                /* forward computation */
+                Forward(inputs, output, model, net);
+                /* backward computation to obtain gradients */
+                Backward(inputs, output, gold, CROSSENTROPY, model, grad, net);
+                /* update model parameters */
+                Update(model, grad, learningRate, false);
+            }
+            else{
+                /* forward + backward process */
+                ForwardAutoDiff(inputs, output, model);
+                /* automatic differentiation */
+                autoDiffer.Backward(output, gold, CROSSENTROPY);
+                /* update model parameters */
+                Update(model, grad, learningRate, true);
+            }
+            /* get probabilities */
+            float prob = GetProb(output, gold);
+            loss += -prob;
+            wordCount += ngramNum;
+            wordCountTotal += ngramNum;
+            if(++step >= nStep){
+                isEnd = true;
+                break;
+            }
+            if (step % 100 == 0) {
+                double elapsed = GetClockSec() - startT;
+                XPRINT5(0, stderr, "[INFO] elapsed=%.1fs, step=%d, epoch=%d, ngram=%d, ppl=%.3f\n",
+                           elapsed, step, epoch + 1, wordCountTotal, exp(loss / wordCount));
+            }
+        }
+        fclose(file);
+        if(isEnd)
+            break;
+    }
+    double elapsed = GetClockSec() - startT;
+    XPRINT5(0, stderr, "[INFO] elapsed=%.1fs, step=%d, epoch=%d, ngram=%d, ppl=%.3f\n", 
+               elapsed, step, epoch, wordCountTotal, exp(loss / wordCount));
+    XPRINT3(0, stderr, "[INFO] training finished (took %.1fs, step=%d and epoch=%d)\n", 
+               elapsed, step, epoch);
+    delete[] ngrams;
+}
+```
+在这里只介绍部分主要代码，详细代码请参见NiuTrans.Tensor/source/sample/FNNLM.cpp
+前馈神经网络前向部分：经过数据处理之后我们得到了语言模型的输入（n-1个词），我们把输入input和输入层的权重w1（词向量）相乘得到每个输入单词的向量表示，公式如下：
+>embedding = input * w1
+最后将n-1个词的向量连接起来作为输入层最终的输出。
+同理，我们将输入层的输出分别经过隐藏层和输出层得到最终的结果，公式如下：
+>h = tanh(h_pre*w2+b)
+>y = softmax(h_last*w3)
+前向过程代码如下：
+```
+/*
+forward procedure
+>> inputs - input word representations
+>> output - output probability
+>> model - the fnn model
+>> net - the network that keeps the internal tensors generated in the process
+*/
+void Forward(XTensor inputs[], XTensor &output, FNNModel &model, FNNNet &net)
+{
+    int batchSize = -1;
+    int n = model.n;
+    int depth = model.hDepth;
+    XList eList(n - 1);
+    /* previoius n - 1 words */
+    for(int i = 0; i < n - 1; i++){
+        XTensor &input = inputs[i];
+        XTensor &w = model.embeddingW;
+        XTensor &embedding = net.embeddings[i];
+        if(batchSize == -1)
+            batchSize = input.dimSize[0];
+        else{
+            CheckErrors(batchSize == input.dimSize[0], "Wrong input word representations!");
+        }
+        /* embedding output tensor of position i */
+        InitModelTensor2D(embedding, batchSize, model.eSize, model);
+        /* generate word embedding of position i:
+           embedding = input * w   */
+        _MatrixMul(&input, X_NOTRANS, &w, X_NOTRANS, &embedding);
+        eList.Add(&net.embeddings[i]);
+    }
+    /* concatenate word embeddings
+       embeddingcat = cat(embedding_0...embedding_{n-1}) */
+    InitModelTensor2D(net.embeddingCat, batchSize, (n - 1) * model.eSize, model);
+    _Concatenate(&eList, &net.embeddingCat, 1);
+    /* go over each hidden layer */
+    for(int i = 0; i < depth; i++){
+        XTensor &h_pre = i == 0 ? net.embeddingCat : net.hiddens[i - 1];
+        XTensor &w = model.hiddenW[i];
+        XTensor &b = model.hiddenB[i];
+        XTensor &h = net.hiddens[i];
+        XTensor &s = net.hiddenStates[i];
+        InitModelTensor2D(h, batchSize, model.hSize, model);
+        InitModelTensor2D(s, batchSize, model.hSize, model);
+        /* generate hidden states of layer i: 
+           s = h_pre * w    */
+        _MatrixMul(&h_pre, X_NOTRANS, &w, X_NOTRANS, &s);
+        /* make a 2d tensor for the bias term */
+        XTensor b2D;
+        InitTensor(&b2D, &s);
+        _Unsqueeze(&b, &b2D, 0, batchSize);
+        /* introduce bias term:
+           s = s + b
+           NOTE: the trick here is to extend b to a 2d tensor
+                 to fit into the 2d representation in tensor summation */
+        _Sum(&s, &b2D, &s);
+        /* pass the state through the hard tanh function:
+           h = tanh(s) */
+        _HardTanH(&s, &h);
+    }
+    /* generate the output Pr(w_{n-1}|w_0...w_{n-2}):
+       y = softmax(h_last * w) 
+       Note that this is the implementation as that in Bengio et al.' paper.
+       TODO: we add bias term here */
+    {
+        XTensor &h_last = depth > 0 ? net.hiddens[depth - 1] : net.embeddingCat;
+        XTensor &w = model.outputW;
+        XTensor &b = model.outputB;
+        XTensor &s = net.stateLast;
+        XTensor &y = output;
+        InitModelTensor2D(s, batchSize, model.vSize, model);
+        InitModelTensor2D(y, batchSize, model.vSize, model);
+        /* s = h_last * w  */
+        _MatrixMul(&h_last, X_NOTRANS, &w, X_NOTRANS, &s);
+        XTensor b2D;
+        InitTensor(&b2D, &s);
+        _Unsqueeze(&b, &b2D, 0, batchSize);
+        _Sum(&s, &b2D, &s);
+        /* y = softmax(s) */
+        _LogSoftmax(&s, &y, 1);
+    }   
+}
+```
+反向部分：首先利用前向得到的最终结果和标准答案计算总的损失函数L，然后采用梯度下降的方法通过反向传播计算得到损失函数L对每层的参数w的导数∂L/∂w，之后我们根据
+>w_(k+1)= w_k-η*  ∂L/(∂w_k )	
+对参数W进行更新，其中η是学习率。
+反向以及反向传播后的更新代码如下：
+```
+/*
+backward procedure
+>> inputs - input word representations
+>> output - output probability
+>> gold - gold standard
+>> loss - loss function name
+>> model - the fnn model
+>> grad - the model that keeps the gradient information
+>> net - the network that keeps the internal tensors generated in the process
+*/
+void Backward(XTensor inputs[], XTensor &output, XTensor &gold, LOSS_FUNCTION_NAME loss, 
+              FNNModel &model,  FNNModel &grad, FNNNet &net)
+{
+    int batchSize = output.GetDim(0);
+    int n = model.n;
+    int depth = model.hDepth;
+    /* back-propagation for the output layer */
+    XTensor &y = output;
+    XTensor &s = net.stateLast;
+    XTensor &x = depth > 0 ? net.hiddens[depth - 1] : net.embeddingCat;
+    XTensor &w = model.outputW;
+    XTensor &dedw = grad.outputW;
+    XTensor &dedb = grad.outputB;
+    XTensor deds(&y);
+    XTensor dedx(&x);
+    /* for y = softmax(s), we get dE/ds
+        where E is the error function (define by loss) */
+    _LogSoftmaxBackward(&gold, &y, &s, NULL, &deds, 1, loss);
+    /* for s = x * w, we get 
+       dE/w_{i,j} = dE/ds_j * ds/dw_{i,j} 
+                  = dE/ds_j * x_{i}
+       (where i and j are the row and column indices, and
+        x is the top most hidden layer)
+       so we know 
+       dE/dw = x^T * dE/ds */
+    _MatrixMul(&x, X_TRANS, &deds, X_NOTRANS, &dedw);
+    /* gradient of the bias: dE/db = dE/ds * 1 = dE/ds
+    specifically dE/db_{j} = \sum_{i} dE/ds_{i,j} */
+    _ReduceSum(&deds, &dedb, 0);
+    /* then, we compute 
+       dE/dx_{j} = \sum_j' (dE/ds_{j'} * ds_{j'}/dx_j) 
+                 = \sum_j' (dE/ds_{j'} * w_{j, j'})
+       i.e., 
+       dE/dx = dE/ds * w^T */
+    _MatrixMul(&deds, X_NOTRANS, &w, X_TRANS, &dedx);
+    XTensor &gradPassed = dedx;
+    XTensor dedsHidden;
+    XTensor dedxBottom;
+    if (depth > 0)
+        InitTensor(&dedsHidden, &dedx);
+    InitTensor(&dedxBottom, &net.embeddingCat);
+    /* back-propagation from top to bottom in the stack of hidden layers
+       for each layer, h = f(s)
+                       s = x * w + b */
+    for (int i = depth - 1; i >= 0; i--) {
+        XTensor &h = net.hiddens[i];
+        XTensor &s = net.hiddenStates[i];
+        XTensor &x = i == 0 ? net.embeddingCat : net.hiddenStates[i - 1];
+        XTensor &w = model.hiddenW[i];
+        XTensor &dedh = gradPassed;  // gradient passed though the previous layer
+        XTensor &dedx = i == 0 ? dedxBottom : dedh;
+        XTensor &deds = dedsHidden;
+        XTensor &dedw = grad.hiddenW[i];
+        XTensor &dedb = grad.hiddenB[i];
+        /* backpropagation through the activation fucntion: 
+           dE/ds = dE/dh * dh/ds */
+        _HardTanHBackward(NULL, &h, &s, &dedh, &deds, NOLOSS);
+        /* gradient of the weight: dE/dw = x^T * dE/ds   */
+        _MatrixMul(&x, X_TRANS, &deds, X_NOTRANS, &dedw);
+        /* gradient of the bias: dE/db = dE/ds * 1 = dE/ds
+           specifically dE/db_{j} = \sum_{i} dE/ds_{i,j} */
+        _ReduceSum(&deds, &dedb, 0);
+        /* gradient of the input: dE/dx = dE/ds * w^T    */
+        _MatrixMul(&deds, X_NOTRANS, &w, X_TRANS, &dedx);
+        if (i > 0)
+            _CopyValues(&dedx, &gradPassed);
+    }
+    XList eList(n - 1);
+    /* back-propagation for the embedding layer */
+    for (int i = 0; i < n - 1; i++) {
+        XTensor * dedy = NewTensor2D(batchSize, model.eSize, X_FLOAT, model.devID, model.mem);
+        eList.Add(dedy);
+    }
+    /* gradient of the concatenation of the embedding layers */
+    XTensor &dedyCat = depth > 0 ? dedxBottom : dedx;
+    /* split the concatenation of gradients of the embeddings */
+    _Split(&dedyCat, &eList, 1, n - 1);
+    /* go over for each word */
+    for (int i = 0; i < n - 1; i++) {
+        XTensor * dedy = (XTensor*)eList.GetItem(i);
+        XTensor &x = inputs[i];
+        XTensor &dedw = grad.embeddingW;
+        /* gradient of the embedding weight: dE/dw += x^T * dE/dy 
+           NOTE that we accumulate dE/dw here because the matrix w
+           is shared by several layers (or words) */
+        _MatrixMul(&x, X_TRANS, dedy, X_NOTRANS, &dedw, 1.0F, 1.0F);
+        delete dedy;
+    }
+}
+```
 ## 实例3：循环神经网络
-## 致谢
+## NiuTrans.Tensor团队
+* 肖桐
+* 李垠桥
+* 许晨
+* 姜雨帆
+* 林野
+* 张裕浩
+* 胡驰
 ## 附录
@@ -1527,13 +2453,15 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp
 | 成员变量 | 功能 |
 | - | - |
+| int id | 张量标识 |
 | XMem * mem | 张量所使用的内存池 |
 | void * data | 保存元素的数据数组 |
 | void * dataHost | 主机内存上的数据副本，只在GPU上运行时被激活 |
+| void ** dataP | 指向数据地址的指针 |
 | int devID | 设备ID，指张量所申请的空间所在CPU或者GPU设备的编号，-1表示CPU |
 | int order | 张量的维度，例如：一个矩阵（维度为2）是一个二维张量 |
-| int dimSize<br> [MAX_TENSOR_DIM_NUM] | 张量中每一维度的大小，索引0表示第1维 |
+| int dimSize[ ] | 张量中每一维度的大小，索引0表示第1维 |
-| int dimSizeRDI<br> [MAX_TENSOR_DIM_NUM] | 转置模式下张量中每一维度的大小，索引0表示第1维 |
+| int dimSizeRDI[ ] | 转置模式下张量中每一维度的大小，索引0表示第1维 |
 | TENSOR_DATA_TYPE dataType | 每个数据单元的数据类型 |
 | int unitSize | 数据单元的大小，类似于sizeof() |
 | int unitNum | 数据单元的数量 |
@@ -1541,17 +2469,34 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp
 | int unitNumNonZero | 稀疏矩阵中非零元素个数 |
 | float denseRatio | 稠密度，指非零单元的比例，是介于0和1之间的一个实数，0表示所有单元全为零，1表示全为非零单元。|
 | bool isShared | 标志数据数组是否被其他张量所共享 |
+| bool isDefaultDType | 矩阵中使用的数据类型是否是属于默认数据类型 |
 | bool isInGlobalMem | 标志数据是否在全局内存而不是内存池中 |
-| bool isAllValued<br> [MAX_TENSOR_DIM_NUM] | 标志稀疏矩阵中是否每个维度都具有非零元素 |
+| bool isAllValued[ ] | 标志稀疏矩阵中是否每个维度都具有非零元素 |
+| bool isInit | 张量是否被初始化 |
+| bool isTmp | 张量是否为临时创建 |
+| bool isGrad | 当使用模型参数时张量是否保持梯度 |
+| unsigned int visitMark | 节点访问标志 |
+| XTensor * grad | 反向传播的梯度 |
+| XLink income | 超边的入边 |
+| XLink outgo | 超边的出边 |
 在XTensor.h头文件中定义的方法说明：
 | 功能 | 函数  | 参数 |
 | - | - | - |
+| 构造函数 | XTensor() | N/A |
+| 析构函数 | ~XTensor() | N/A |
+| 初始化成员变量 | void Init() | N/A |
+| 销毁数据 | void DestroyData() | N/A |
+| 张量的浅层复制 | void ShallowCopy(<br>const XTensor &tensor) | tensor - 进行复制的张量 |
+| 重载等于符号 | XTensor& operator= (<br>const XTensor &tensor) | tensor - 重载的张量 |
+| 重载加法符号 | XTensor  operator+ (<br>const XTensor &tensor) | tensor - 重载的张量 |
+| 重载乘法符号 | XTensor  operator* (<br>const XTensor &tensor) | tensor - 重载的张量 |
+| 线性变换 | XTensor Lin(<br>DTYPE scale, DTYPE shift = 0) | scale - 缩放参数 <br> shift - 偏移参数 |
 | 判断两个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 |
 | 判断三个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b, XTensor * c) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 <br> c - 进行比较的第三个张量 |
-| 设置张量每一维度的大小 | void SetDim(int * myDimSize) |myDimSize - 张量每一维度的大小 |
+| 设置张量每一维度的大小 | void SetDim(<br>int * myDimSize) |myDimSize - 张量每一维度的大小 |
-| 得到张量中给定的维度大小 | int GetDim(const int dim) | dim - 张量的维度 |
+| 得到张量中给定的维度大小 | int GetDim(<br>const int dim) | dim - 张量的维度 |
 | 重新调整矩阵维度 | void Reshape(<br> const int order, const int * myDimSize) | order - 张量的维度 <br> myDimSize - 张量每一维的大小 |
 | 得到张量中元素数量 | int GetSize() | N/A |
 | 得到内存使用大小 | int GetDataSizeInChar() | N/A |
@@ -1561,15 +2506,26 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp
 | 设置张量服从均匀分布 | void SetDataRand(<br> DTYPE lower, DTYPE upper) | lower - 最小值 <br> upper - 最大值 |
 | 设置张量服从正态分布 | void SetDataRandn(<br> DTYPE mean, DTYPE standardDeviation) | mean - 均值 <br> standardDeviation - 标准差 |
 | 检查张量中元素是否相同 | bool CheckData(<br> const void * answer, int num, int beg = 0) | answer - 给定数组 <br> num - 数组大小 <br> beg - 赋值时从张量的第几位开始 |
-| 将给定维度中元素<br> 设置为升序 | void SetAscendingOrder(int dim) | dim - 给定维度 |
+| 设置数据指针 | void SetDataPointer() | N/A |
-| 获取张量中元素指针 | void * GetCell(int * index, int size)    | index - 元素位置 <br> size-矩阵大小 |
+| 将给定维度中元素<br> 设置为升序 | void SetAscendingOrder(<br>int dim) | dim - 给定维度 |
-| 获取二维张量中元素指针 | void * GetCell2D(int ni, int mi = 0) | ni - 行值 <br> mi - 列值 |
+| 得到索引指向的单元的值 | DTYPE Get(int index[], int size = -1) | index - 给定索引 <br> size-矩阵大小 |
-| 获取二维张量的值 | DTYPE Get2D(int ni, int mi = 0) | ni - 行值 <br> mi - 列值 |
+| 获取张量中元素指针 | void * GetCell(<br>int * index, int size)    | index - 元素位置 <br> size-矩阵大小 |
+| 获取一维张量中元素的<br>默认类型值 | DTYPE Get1D(<br>int i) | i - 第一维 |
+| 获取二维张量中元素的<br>默认类型值 | DTYPE Get2D(<br>int ni, int mi) const | ni - 第一维 <br> mi - 第二维 |
+| 获取三维张量中元素的<br>默认类型值 | DTYPE Get3D(<br>int d0, int d1, int d2) | d0 - 第一维 <br> d1 - 第二维 <br> d2 - 第三维 |
+| 获取一维张量中元素的<br>整形值 |int Get1DInt(<br>int i) | i - 第一维 |
+| 获取二维张量中元素的<br>整形值 | int Get2DInt(<br>int ni, int mi) | ni - 第一维 <br> mi - 第二维 |
+| 获取三维张量中元素的整形值 | int Get3DInt(<br>int d0, int d1, int d2) | d0 - 第一维 <br> d1 - 第二维 <br> d2 - 第三维 |
 | 获取稀疏张量的值 | DTYPE GetInSparse(int i) | i - 稀疏矩阵中非0元素位置 |
 | 获取稀疏张量中<br> 元组的键值 | int GetKeyInSparse(int i) | i - 稀疏矩阵中非0元素位置 |
-| 设置二维张量中<br> 的单元值 | bool Set2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
+| 设置单元中的值 | bool Set(<br>DTYPE value, int index[], int size = -1) | value - 值 <br> index - 元素位置 <br> size-矩阵大小 |
-| 增加二维张量中<br> 的单元值 | bool Add2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
+| 设置一维张量中的单元值 | bool Set1D(<br>DTYPE value, int i) | value - 值 <br> i - 第一维 |
+| 设置二维张量中的单元值 | bool Set2D(<br>DTYPE value, int ni, int mi) | value - 值 <br> ni - 第一维 <br> mi - 第二维 |
+| 设置三维张量中的单元值 | bool Set3D(<br>DTYPE value, int d0, int d1, int d2) | value - 值 <br> d0 - 第一维 <br> d1 - 第二维 <br> d2 - 第三维 |
+| 增加二维张量中<br> 的单元值 | bool Add2D(<br>DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
 | 获取稀疏矩阵中<br> 非零元素数量 | int GetNonzeroSize() | N/A |
+| 设置张量为临时变量 | void SetTMP(<br>bool myIsTmp = true) | myIsTmp - 是否为临时变量 |
+| 张量是否保持梯度 | void SetGrad(<br>bool myIsGrad = true) | myIsTmp - 是否保持梯度 |
 | 将矩阵重置为特定大小 | bool Resize(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 |
 | 将矩阵重置为特定大小<br>并不申请新空间 | bool ResizeWithNoData(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 |
 | 将矩阵重置为<br> 另一矩阵大小 | bool Resize(<br> const XTensor * myTensor) | myTensor - 重置矩阵大小的参考矩阵 |
@@ -1580,4 +2536,4 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp
 | 在缓冲区创建张量 | XTensor * NewTensorBuf( <br> const int myOrder,  <br> const int * myDimSize, XMem * myMem, <br> const TENSOR_DATA_TYPE myDataType = <br> X_FLOAT, const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myMem - 张量所使用的内存池 <br>  myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 |
 | 依据给定张量<br>复制一个新的张量 | XTensor * NewTensor(<br>XTensor * a, bool isFilledData = true) | a - 给定张量 <br>  isFilledData - 是否申请张量中的数据空间 |
 | 依据给定张量<br>释放数据空间 | void DelTensor(<br>const XTensor * tensor) | tensor - 给定张量 |
 | 依据给定张量<br>在缓存中释放数据空间 | void DelTensorBuf(<br>const XTensor * tensor) | tensor - 给定张量 |
\ No newline at end of file
--- a/source/network/Main.cpp
+++ b/source/network/Main.cpp
@@ -21,25 +21,31 @@
 #include <stdio.h>
 #include "XNet.h"
+#include "../tensor/XUtility.h"
 #include "../tensor/function/FHeader.h"
 #include "../tensor/core/CHeader.h"
 #include "../sample/fnnlm/FNNLM.h"
+#include "../sample/transformer/Transformer.h"
 //#define CRTDBG_MAP_ALLOC
 //#include <stdlib.h>
 //#include <crtdbg.h>
-using namespace nts;
+void TransposeTest();
-using namespace samplefnnlm;
+void SumDimTest();
+using namespace nts;
+using namespace fnnlm;
+using namespace transformer;
 int main( int argc, const char ** argv )
 {
+    //_CrtSetBreakAlloc(896);
-    if(argc > 1 && !strcmp(argv[1], "-test"))
-        1;//Test();
+    if(argc > 1 && !strcmp(argv[1], "-fnnlm"))
-    else if(argc > 1 && !strcmp(argv[1], "-fnnlm"))
        FNNLMMain(argc - 1, argv + 1);
+    else if(argc > 1 && !strcmp(argv[1], "-t2t"))
+        TransformerMain(argc - 1, argv + 1);
    else{
        fprintf(stderr, "Thanks for using NiuTrans.Network! This is a library for building\n");
        fprintf(stderr, "neural networks in an easy way. \n\n");
@@ -47,36 +53,120 @@ int main( int argc, const char ** argv )
        fprintf(stderr, "Or run this program with \"-fnnlm\" for sample FNNLM!\n");
    }
-    XNet net;
+    //_CrtDumpMemoryLeaks();
-    XTensor a;
-    XTensor b;
+    return 0;
-    XTensor c;
+}
-    InitTensor2D(&a, 2, 2);
+void TransposeTest()
-    InitTensor2D(&b, 2, 4);
+{
-    InitTensor2D(&c, 2, 4);
+#ifdef USE_CUDA
+    XMem mem0(0, UNI_FREE, MILLION * 64, 1024, MILLION * 64);
+    //XMem mem1(1, UNI_FREE, MILLION * 64, 1024, MILLION * 64);
+    XTensor x;
+    XTensor y;
+    XTensor z;
-    a.SetZeroAll();
+    int loops = 2000;
-    b.SetZeroAll();
-    c.SetZeroAll();
-    SetDataFixed(a, 0.1F);
+    int B = 3 * 2 * 4;
-    a.Set2D(0.3F, 1, 0);
+    int K = 8 * 1;
-    a.Set2D(0.4F, 1, 1);
+    int N = 50;
+    int H = 512 * 4;
-    b = Merge(a, a, 1);
+    int nnn = GDevs.nGPU;
-    c = HTanH(MMul(a, b));
-    a.Dump(stderr, "a:");
+    InitTensor3D(&x, B, N, H, X_FLOAT, 0);
-    b.Dump(stderr, "b:");
+    InitTensor4D(&y, K, B, N, H/K, X_FLOAT, 0);
-    c.Dump(stderr, "c:");
+    InitTensor3D(&z, B, N, H, X_FLOAT, 0);
-    XLink::ShowNetwork(stderr, &c);
-    net.Backward(c);
+    cudaEvent_t ctime0;
+    cudaEvent_t ctime1;
+    cudaEvent_t ctime2;
+    cudaEvent_t ctime3;
+    cudaEvent_t ctime4;
+    cudaEvent_t ctime5;
-    net.Dump(stderr);
+    float elapsedSplit = 0.0;
+    float elapsedMerge = 0.0;
-    //_CrtDumpMemoryLeaks();
+    float elapsedSum = 0.0;
+    cudaEventCreate(&ctime0);
+    cudaEventCreate(&ctime1);
+    cudaEventCreate(&ctime2);
+    cudaEventCreate(&ctime3);
+    cudaEventCreate(&ctime4);
+    cudaEventCreate(&ctime5);
+    cudaEventRecord(ctime0, 0);
+    double time0 = GetClock();
+    for(int i = 0; i < loops; i++)
+        _Split(&x, &y, 2, K);
+    double time1 = GetClock();
-    return 0;
+    cudaEventRecord(ctime1, 0);
+    cudaEventSynchronize(ctime1);
+    cudaEventElapsedTime(&elapsedSplit, ctime0, ctime1);
+    cudaEventRecord(ctime2, 0);
+    double time2 = GetClock();
+    for(int i = 0; i < loops; i++)
+        _Merge(&y, &x, 3);
+    double time3 = GetClock();
+    cudaEventRecord(ctime3, 0);
+    cudaEventSynchronize(ctime3);
+    cudaEventElapsedTime(&elapsedMerge, ctime2, ctime3);
+    cudaEventRecord(ctime4, 0);
+    double time4 = GetClock();
+    for(int i = 0; i < loops; i++)
+        _Sum(&x, &z, &x);
+    double time5 = GetClock();
+    cudaEventRecord(ctime5, 0);
+    cudaEventSynchronize(ctime5);
+    cudaEventElapsedTime(&elapsedSum, ctime4, ctime5);
+    fprintf(stderr, "split:%f merge:%f sum:%f\n", time1 - time0, time3 - time2, time5 - time4);
+    fprintf(stderr, "split:%f merge:%f sum:%f\n", elapsedSplit, elapsedMerge, elapsedSum);
+#endif
+}
+void SumDimTest()
+{
+    XTensor x;
+    XTensor y;
+    XTensor z;
+    int a = 5;
+    int b = 7;
+    int c = 3;
+    InitTensor3D(&x, a, b, c, X_FLOAT, -1);
+    InitTensor1D(&y, c, X_FLOAT, -1);
+    InitTensor3D(&z, a, b, c, X_FLOAT, -1);
+    x.SetZeroAll();
+    y.SetZeroAll();
+    z.SetZeroAll();
+    float * data = new float[x.unitNum];
+    for(int i = 0; i < x.unitNum; i++)
+        data[i] = (DTYPE)i;
+    x.SetData(data, x.unitNum);
+    for(int i = 0; i < y.unitNum; i++)
+        data[i] = -(DTYPE)i;
+    y.SetData(data, y.unitNum);
+    _SumDim(&x, &y, &z, 2);
+    z.Dump(stderr, "z:");
+    delete[] data;
 }
--- a/source/network/XBackwardFunc.cpp
+++ b/source/network/XBackwardFunc.cpp
@@ -63,6 +63,8 @@ void XFuncGrad::MakeGrad(XTensor * node)
    else{
        ShowNTErrors("Wrong activation function type!");
    }
+    node->visitMark = NODE_FINISHED;
 }
 /* indicates whether the node is for an activation function */

--- a/source/network/XBackwardMath.cpp
+++ b/source/network/XBackwardMath.cpp
@@ -37,10 +37,52 @@ void XMathGrad::MakeGrad(XTensor * node)
    if(operID == MATH_SUM)
        GradSum(node);
+    else if(operID == MATH_SUMDIM)
+        GradSumDim(node);
    else if(operID == MATH_MULTIPLY)
        GradMultiply(node);
    else if(operID == MATH_MATRIXMUL)
        GradMatrixMul(node);
+    else if(operID == MATH_MATRIXMULBATCHED)
+        GradMatrixMulBatched(node);
+    else if (operID == MATH_LOG)
+        GradLog(node);
+    else if (operID == MATH_POWER)
+        GradPower(node);
+    else if (operID == MATH_NEGATE)
+        GradNegate(node);
+    else if (operID == MATH_SCALEANDSHIFT)
+        GradScaleAndShift(node);
+    else if (operID == MATH_DIV)
+        GradDiv(node);
+    else if (operID == MATH_SUB)
+        GradSub(node);
+    else if (operID == MATH_SIN)
+        GradSin(node);
+    else if (operID == MATH_COS)
+        GradCos(node);
+    else if (operID == MATH_TAN)
+        GradTan(node);
+    else if (operID == MATH_EXP)
+        GradExp(node);
+    else if (operID == MATH_NORMALIZE)
+        GradNormalize(node);
+    else if (operID == MATH_ABSOLUTE)
+        GradAbsolute(node);
+    else if (operID == MATH_SIGN)
+        GradSign(node);
+    else if (operID == MATH_ROUND)
+        GradRound(node);
+    else if (operID == MATH_CLIP)
+        GradClip(node);
+    else if (operID == REDUCE_REDUCEMEAN)
+        GradReduceMean(node);
+    else if (operID == REDUCE_REDUCESUM)
+        GradReduceSum(node);
+    else if (operID == REDUCE_REDUCESUMSQUARED)
+        GradReduceSumSquared(node);
+    else if (operID == REDUCE_REDUCEVARIANCE)
+        GradReduceVariance(node);
    else{
        ShowNTErrors("TODO!");
    }
@@ -70,11 +112,108 @@ void XMathGrad::GradSum(XTensor * node)
    XTensor * a = income.tails[0];
    XTensor * b = income.tails[1];
    DTYPE beta = income.GetParam(0);
    XNoder::MakeGrad(a);
    XNoder::MakeGrad(b);
    _Sum(a->grad, node->grad, a->grad);
    _Sum(b->grad, node->grad, b->grad, beta);
+    node->visitMark = NODE_FINISHED;
+}
+/* 
+gradient for sum with one dimension
+c = a + b * \beta
+where the size of b is equal to dimension n of a, i.e., |b| = a.dimSize[n]
+dE/da = dE/dc
+dE/db = dE/dc * b.reduce(0,...,n-1,n+1,...) * \beta
+*/
+void XMathGrad::GradSumDim(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for SUMDIM!");
+    XTensor * a = income.tails[0];
+    XTensor * b = income.tails[1];
+    int n = income.GetParamInt(0);
+    DTYPE beta = income.GetParam(1);
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+    _Sum(a->grad, node->grad, a->grad);
+    int order = a->order;
+    int dimSize[MAX_TENSOR_DIM_NUM];
+    memcpy(dimSize, a->dimSize, sizeof(int) * a->order);
+    if(n == order - 1){
+        int reshapedSize[MAX_TENSOR_DIM_NUM];
+        reshapedSize[0] = a->unitNum/dimSize[order - 1];
+        reshapedSize[1] = dimSize[order - 1];
+        /* we reshape dE/dc to a matrix whose column number is equal to the 
+           size of b. Then we can reduce the matrix into a row vector. */
+        node->grad->Reshape(2, reshapedSize);
+        if(b->outgo.tailNum > 1){
+            XTensor * bGradTMP = NewTensorBuf(b->grad, b->devID, b->mem);
+            _ReduceSum(node->grad, bGradTMP, 0);
+            if(beta != 1.0F)
+                _ScaleAndShiftMe(bGradTMP, beta);
+            _Sum(bGradTMP, b->grad, b->grad);
+            DelTensorBuf(bGradTMP);
+        }
+        else{
+            _ReduceSum(node->grad, b->grad, 0);
+            if(beta != 1.0F)
+                _ScaleAndShiftMe(b->grad, beta);
+        }
+        node->grad->Reshape(order, dimSize);
+    }
+    else{
+        int reshapedSize[MAX_TENSOR_DIM_NUM];
+        reshapedSize[0] = 1;
+        reshapedSize[1] = dimSize[n];
+        reshapedSize[2] = 1;
+        for(int i = 0; i < order; i++){
+            if(i < n)
+                reshapedSize[0] *= dimSize[i];
+        }
+        reshapedSize[2] = a->unitNum / (reshapedSize[0] * reshapedSize[1]);
+        /* we reshape dE/dc to a 3D tensor of size (x, y, z) where y = |b|. 
+           Then reduce along with z and x to obtain dE/db. */
+        node->grad->Reshape(3, reshapedSize);
+        XTensor * interGrad = NewTensorBuf(2, reshapedSize, b->dataType, b->denseRatio, b->devID, b->mem);
+        _ReduceSum(node->grad, interGrad, 2);
+        if(b->outgo.tailNum > 1){
+            XTensor * bGradTMP = NewTensorBuf(b->grad, b->devID, b->mem);
+            _ReduceSum(interGrad, bGradTMP, 0);
+            if(beta != 1.0F)
+                _ScaleAndShiftMe(bGradTMP, beta);
+            _Sum(bGradTMP, b->grad, b->grad);
+            DelTensorBuf(bGradTMP);
+        }
+        else{
+            _ReduceSum(interGrad, b->grad, 0);
+            if(beta != 1.0F)
+                _ScaleAndShiftMe(b->grad, beta);
+        }
+        node->grad->Reshape(order, dimSize);
+        DelTensorBuf(interGrad);
+    }
+    node->visitMark = NODE_FINISHED;
 }
 /* 
@@ -99,6 +238,8 @@ void XMathGrad::GradMultiply(XTensor * node)
    CheckNTErrors(XTensor::IsSameShaped(a, b), "Wrong sized input tensors!");
    _Multiply(node->grad, b, a->grad, 1.0F);
    _Multiply(node->grad, a, b->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
 }
 /* 
@@ -124,16 +265,59 @@ void XMathGrad::GradMatrixMul(XTensor * node)
    XNoder::MakeGrad(a);
    XNoder::MakeGrad(b);
+    XTensor * c = node;
    XTensor * dedc = node->grad;
    XTensor * deda = a->grad;
    XTensor * dedb = b->grad;
+    if(deda->order == 2 && dedb->order == 2)
+        GradMatrixMul(a, deda, transA, b, dedb, transB, dedc, alpha);
+    else if(transA == X_NOTRANS && deda->order > 2 && dedb->order == 2){
+        int orderBackupA = a->order;
+        int orderBackupC = c->order;
+        int dimsBackupA[MAX_TENSOR_DIM_NUM];
+        int dimsBackupC[MAX_TENSOR_DIM_NUM];
+        memcpy(dimsBackupA, a->dimSize, sizeof(int) * a->order);
+        memcpy(dimsBackupC, c->dimSize, sizeof(int) * c->order);
+        a->Reshape(a->unitNum/a->GetDim(-1), a->GetDim(-1));
+        c->Reshape(c->unitNum/c->GetDim(-1), c->GetDim(-1));
+        deda->Reshape(a->unitNum/a->GetDim(-1), a->GetDim(-1));
+        dedc->Reshape(c->unitNum/c->GetDim(-1), c->GetDim(-1));
+        GradMatrixMul(a, deda, transA, b, dedb, transB, dedc, alpha);
+        a->Reshape(orderBackupA, dimsBackupA);
+        c->Reshape(orderBackupC, dimsBackupC);
+        deda->Reshape(orderBackupA, dimsBackupA);
+        dedc->Reshape(orderBackupC, dimsBackupC);
+    }
+    else{
+        ShowNTErrors("TODO!");
+    }
+    node->visitMark = NODE_FINISHED;
+}
+/*
+gradient for matrix multiply: c = matmul(a, b) * \alpha
+>> a - as it is
+>> deda - dE/da
+>> b - as it is
+>> dedb - dE/db
+>> dedc - dE/dc
+>> alpha - the scalar
+*/
+void XMathGrad::GradMatrixMul(XTensor * a, XTensor * deda, MATRIX_TRANS_TYPE transA,
+                              XTensor * b, XTensor * dedb, MATRIX_TRANS_TYPE transB,
+                              XTensor * dedc, DTYPE alpha)
+{
    /* c = a * b * \alpha */
    if(transA == X_NOTRANS && transB == X_NOTRANS){
        /* dE/da = dE/dc * b^T * \alpha */
        _MatrixMul(dedc, X_NOTRANS, b, X_TRANS, deda, alpha, 1.0F);
        /* dE/db = a^T * dE/dc * \alpha */
        _MatrixMul(a, X_TRANS, dedc, X_NOTRANS, dedb, alpha, 1.0F);
    }
@@ -141,8 +325,9 @@ void XMathGrad::GradMatrixMul(XTensor * node)
    /* c = a^T * b * \alpha */
    else if(transA == X_TRANS && transB == X_NOTRANS){
-        /* dE/da = dE/dc * b^T * \alpha */
+        /* dE/da = (dE/dc * b^T)^T * \alpha 
-        _MatrixMul(dedc, X_NOTRANS, b, X_TRANS, deda, alpha, 1.0F);
+                 = b * dE/dc^T * \alpha */
+        _MatrixMul(b, X_NOTRANS, dedc, X_TRANS, deda, alpha, 1.0F);
        /* dE/db = a * dE/dc * \alpha */
        _MatrixMul(a, X_NOTRANS, dedc, X_NOTRANS, dedb, alpha, 1.0F);
@@ -154,19 +339,710 @@ void XMathGrad::GradMatrixMul(XTensor * node)
        /* dE/da = dE/dc * b * \alpha */
        _MatrixMul(dedc, X_NOTRANS, b, X_NOTRANS, deda, alpha, 1.0F);
-        /* dE/db = a^T * dE/dc * \alpha */
+        /* dE/db = (a^T * dE/dc)^T * \alpha 
-        _MatrixMul(a, X_TRANS, dedc, X_NOTRANS, dedb, alpha, 1.0F);
+                 = dE/dc^T * a * \alpha */
+        _MatrixMul(dedc, X_TRANS, a, X_NOTRANS, dedb, alpha, 1.0F);
    }
    /* c = a^T * b^T * \alpha */
    else if(transA == X_TRANS && transB == X_TRANS){
-        /* dE/da = dE/dc * b * \alpha */
+        /* dE/da = (dE/dc * b)^T * \alpha 
-        _MatrixMul(dedc, X_NOTRANS, b, X_NOTRANS, deda, alpha, 1.0F);
+                 = b^T * dE/dc^T * \alpha */
+        _MatrixMul(b, X_TRANS, dedc, X_TRANS, deda, alpha, 1.0F);
+        /* dE/db = (a * dE/dc)^T * \alpha 
+                 = dE/dc^T * a^T * \alpha */
+        _MatrixMul(dedc, X_TRANS, a, X_TRANS, dedb, alpha, 1.0F);
+    }
+}
+/* 
+gradient for matrix multiply in batch mode.
+for each batch: c_i = matmul(a_i, b_i) * \alpha
+for c_i = matmul(a_i, b_i) * \alpha
+we have 
+dE/da_i = dE/dc_i * b_i^T * \alpha
+dE/db_i = a_i^T * dE/dc_i * \alpha
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradMatrixMulBatched(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for MULTIPLY!");
+    CheckNTErrors(income.paramNum == 3, "Wrong parameter number for MULTIPLY!");
+    XTensor * a = income.tails[0]; 
+    XTensor * b = income.tails[1];
+    MATRIX_TRANS_TYPE transA = income.GetParamTrans(0);
+    MATRIX_TRANS_TYPE transB = income.GetParamTrans(1);
+    DTYPE alpha = income.GetParam(2);
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+    XTensor * c = node;
+    XTensor * dedc = node->grad;
+    XTensor * deda = a->grad;
+    XTensor * dedb = b->grad;
+    /* c = a * b * \alpha */
+    if(transA == X_NOTRANS && transB == X_NOTRANS){
+        /* dE/da = dE/dc * b^T * \alpha */
+        _MatrixMulBatched(dedc, X_NOTRANS, b, X_TRANS, deda, alpha, 1.0F);
+        /* dE/db = a^T * dE/dc * \alpha */
+        _MatrixMulBatched(a, X_TRANS, dedc, X_NOTRANS, dedb, alpha, 1.0F);
+    }
+    /* c = a^T * b * \alpha */
+    else if(transA == X_TRANS && transB == X_NOTRANS){
+        /* dE/da = (dE/dc * b^T)^T * \alpha 
+                 = b * dE/dc^T * \alpha */
+        _MatrixMulBatched(b, X_NOTRANS, dedc, X_TRANS, deda, alpha, 1.0F);
        /* dE/db = a * dE/dc * \alpha */
-        _MatrixMul(a, X_NOTRANS, dedc, X_NOTRANS, dedb, alpha, 1.0F);
+        _MatrixMulBatched(a, X_NOTRANS, dedc, X_NOTRANS, dedb, alpha, 1.0F);
+    }
+    /* c = a * b^T * \alpha */
+    else if(transA == X_NOTRANS && transB == X_TRANS){
+        /* dE/da = dE/dc * b * \alpha */
+        _MatrixMulBatched(dedc, X_NOTRANS, b, X_NOTRANS, deda, alpha, 1.0F);
+        /* dE/db = (a^T * dE/dc)^T * \alpha 
+                 = dE/dc^T * a * \alpha */
+        _MatrixMulBatched(dedc, X_TRANS, a, X_NOTRANS, dedb, alpha, 1.0F);
+    }
+    /* c = a^T * b^T * \alpha */
+    else if(transA == X_TRANS && transB == X_TRANS){
+        /* dE/da = (dE/dc * b)^T * \alpha 
+                 = b^T * dE/dc^T * \alpha */
+        _MatrixMulBatched(b, X_TRANS, dedc, X_TRANS, deda, alpha, 1.0F);
+        /* dE/db = (a * dE/dc)^T * \alpha 
+                 = dE/dc^T * a^T * \alpha */
+        _MatrixMulBatched(dedc, X_TRANS, a, X_TRANS, dedb, alpha, 1.0F);
    }
+    node->visitMark = NODE_FINISHED;
+}
+/*
+gradient for log
+for
+c = log(a)
+we have
+dE/da = dE/dc * 1/a
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradLog(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for LOG!");
+    XTensor * a = income.tails[0];
+    XNoder::MakeGrad(a);
+    _Div(node->grad, a, a->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+}
+/*
+gradient for power
+for
+c = pow(a,p)
+we have
+dE/da = (dE/dc) * p*a^(p-1)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradPower(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for POWER!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XTensor * c = NewTensor(a);
+    DTYPE p = income.GetParam(0);
+    XNoder::MakeGrad(a);
+    _Power(a, b, (p-1)/p);
+    _ScaleAndShift(b, c, p);
+    _Multiply(node->grad, c, a->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+    delete c;
+}
+/*
+gradient for negate
+for
+c = -a
+we have
+dE/da = dE/dc * (-1)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradNegate(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for NEGATE!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XNoder::MakeGrad(a);
+    _ScaleAndShift(node->grad, b, -1.0F);
+    _Sum(a->grad, b, a->grad);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+}
+/*
+gradient for ScaleAndShift
+for
+c = a * scale + shift
+we have
+dE/da = dE/dc * scale
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradScaleAndShift(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SCALEANDSHIFT!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    DTYPE scale = income.GetParam(0);
+    XNoder::MakeGrad(a);
+    _ScaleAndShift(node->grad, b, scale);
+    _Sum(a->grad, b, a->grad);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+}
+/*
+gradient for minus
+for
+c =  a - b * \beta
+we have
+dE/da = dE/dc
+dE/db = -dE/dc * \beta
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradSub(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for SUBSTRACT!");
+    XTensor * a = income.tails[0];
+    XTensor * b = income.tails[1];
+    DTYPE beta = income.GetParam(0);
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+    _Sum(a->grad, node->grad, a->grad);
+    _Sum(b->grad, node->grad, b->grad, -beta);
+    node->visitMark = NODE_FINISHED;
+}
+/*
+gradient for divide
+for
+c =  a / b
+we have
+dE/da = dE/dc / b
+dE/db = dE/dc * a / -b^2
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradDiv(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for DIVIDE!");
+    XTensor * a = income.tails[0];
+    XTensor * b = income.tails[1];
+    XTensor * c = NewTensor(b);
+    XTensor * d = NewTensor(b);
+    XTensor * e = NewTensor(b);
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+    CheckNTErrors(XTensor::IsSameShaped(a, b), "Wrong sized input tensors!");
+    _Div(node->grad, b, a->grad, 1.0F);
+    _Power(b, c, -2.0F);
+    _Multiply(a, c, d);
+    _ScaleAndShift(d, e, -1.0F);
+    _Multiply(node->grad, e, b->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete c;
+    delete d;
+    delete e;
+}
+/*
+gradient for exp
+for
+c = exp(a)
+we have
+dE/da = dE/dc * exp(a)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradExp(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for EXP!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XNoder::MakeGrad(a);
+    _Exp(a, b);
+    _Multiply(node->grad, b, a->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+}
+/*
+gradient for sin
+for
+c = sin(a)
+we have
+dE/da = dE/dc * cos(a)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradSin(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SIN!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XNoder::MakeGrad(a);
+    _Cos(a, b);
+    _Multiply(node->grad, b, a->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+}
+/*
+gradient for cos
+for
+c = cos(a)
+we have
+dE/da = dE/dc * -sin(a)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradCos(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for COS!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XTensor * c = NewTensor(a);
+    XNoder::MakeGrad(a);
+    _Sin(a, b);
+    _ScaleAndShift(b, c, -1.0F);
+    _Multiply(node->grad, c, a->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+    delete c;
+}
+/*
+gradient for tan
+for
+c = tan(a)
+we have
+dE/da = dE/dc * 1/(cos(a))^2
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradTan(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for TAN!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XTensor * c = NewTensor(a);
+    XNoder::MakeGrad(a);
+    _Cos(a, b);
+    _Power(b, c, -2.0F);
+    _Multiply(node->grad, c, a->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+    delete c;
+}
+/*
+gradient for normalize
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradNormalize(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 5, "Wrong input tensor number for NORMALIZE!");
+    XTensor * input = income.tails[0];
+    XTensor * mean = income.tails[1];
+    XTensor * var = income.tails[2];
+    XTensor * a = income.tails[3];
+    XTensor * b = income.tails[4];
+    XTensor * c = NewTensor(var);
+    XTensor * d = NewTensor(a);
+    XTensor * e = NewTensor(a);
+    XTensor * f = NewTensor(a);
+    XTensor * g = NewTensor(a);
+    XTensor * h = NewTensor(a);
+    XTensor * i = NewTensor(a);
+    XTensor * j = NewTensor(a);
+    XTensor * k = NewTensor(var);
+    XTensor * p = NewTensor(var);
+    XTensor * q = NewTensor(var);
+    XTensor * r = NewTensor(a);
+    XTensor * x = NewTensor(mean);
+    XTensor * y = NewTensor(mean);
+    XTensor * z = NewTensor(mean);
+    DTYPE epsilon = income.GetParam(1);
+    int dim = income.GetParamInt(0);
+    int n = a->GetDim(dim);
+    XNoder::MakeGrad(input);
+    XNoder::MakeGrad(mean);
+    XNoder::MakeGrad(var);
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+    /* dEdinput */
+    _ScaleAndShift(var, c, 1.0F, epsilon);
+    _Unsqueeze(c, d, dim, n);
+    _Power(d, e, -0.5F);
+    _Multiply(a, e, f);
+    _Multiply(node->grad, f, input->grad, 1.0F);
+    /* dEdmean */
+    _ScaleAndShift(f, g, -1.0F);
+    _ReduceSum(g, x, dim);
+    _ReduceSum(node->grad, y, dim);
+    _Multiply(y, x, mean->grad, 1.0F);
+    /* dEdvar */
+    _Unsqueeze(mean, h, dim, n);
+    _Sub(input, h, i);
+    _Multiply(a, i, j);
+    _Power(var, k, -1.5F);
+    _ScaleAndShift(k, p, -0.5F);
+    _ReduceSum(j, z, dim);
+    _Multiply(z, p, q);
+    _Multiply(y, q, var->grad, 1.0F);
+    /* dEda */
+    _Multiply(i, e, r);
+    _Multiply(node->grad, r, a->grad, 1.0F);
+    /* dEdb */
+    _Sum(b->grad, node->grad, b->grad);
+    node->visitMark = NODE_FINISHED;
+    delete c;
+    delete d;
+    delete e;
+    delete f;
+    delete g;
+    delete h;
+    delete i;
+    delete j;
+    delete k;
+    delete p;
+    delete q;
+    delete r;
+    delete x;
+    delete y;
+    delete z;
+}
+/*
+gradient for absolute
+for
+c = |a|
+we have
+dE/da = dE/dc   a >= 0
+        -dE/dc  a < 0
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradAbsolute(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for ABSOLUTE!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XNoder::MakeGrad(a);
+    _Sign(a, b);
+    _Multiply(node->grad, b, a->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+}
+/*
+gradient for sign
+for
+c = sign(a)
+we have
+dE/da = 0
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradSign(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SIGN!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XNoder::MakeGrad(a);
+    b->SetZeroAll();
+    _Sum(a->grad, b, a->grad);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+}
+/*
+gradient for round
+for
+c = round(a)
+we have
+dE/da = 0
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradRound(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for ROUND!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XNoder::MakeGrad(a);
+    b->SetZeroAll();
+    _Sum(a->grad, b, a->grad);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+}
+/*
+gradient for clip
+we have
+dE/da = 1  lower < a < upper
+dE/da = 0  otherwise 
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradClip(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for CLIP!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    DTYPE lower = income.GetParam(0);
+    DTYPE upper = income.GetParam(1);
+    XNoder::MakeGrad(a);
+    _ClipBackward(node, a, node->grad, a->grad, lower, upper);
+    _Sum(a->grad, b, a->grad);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+}
+/*
+gradient for reduceMean
+for
+c = reduceMean(a, dim)
+we have
+dE/da = Unsqueeze(dE/dc) * 1/dimSizeA[dim]
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradReduceMean(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for Reduce!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XTensor * c = NewTensor(a);
+    int dim = income.GetParamInt(0);
+    int n = a->GetDim(dim);
+    XNoder::MakeGrad(a);
+    _Unsqueeze(node->grad, b, dim, n);
+    _ScaleAndShift(b, c, 1.0F/n);
+    _Sum(a->grad, c, a->grad);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+    delete c;
+}
+/*
+gradient for reduceSum
+for
+c = reduceSum(a, dim)
+we have
+dE/da = Unsqueeze(dE/dc) * 1
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradReduceSum(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for Reduce!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    int dim = income.GetParamInt(0);
+    int n = a->GetDim(dim);
+    XNoder::MakeGrad(a);
+    _Unsqueeze(node->grad, b, dim, n);
+    _Sum(a->grad, b, a->grad);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+}
+/*
+gradient for reduceSumSquared
+for
+c = reduceSumSquared(a, dim, b)
+we have
+dE/da = Unsqueeze(dE/dc) * 2a
+dE/db = Unsqueeze(dE/dc) * (-2b)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradReduceSumSquared(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for Reduce!");
+    XTensor * a = income.tails[0];
+    XTensor * b = income.tails[1];
+    XTensor * c = NewTensor(a);
+    XTensor * d = NewTensor(b);
+    XTensor * e = NewTensor(c);
+    int dim = income.GetParamInt(0);
+    int n = a->GetDim(dim);
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+    _ScaleAndShift(a, c, 2.0F);
+    _ScaleAndShift(b, d, -2.0F);
+    _Unsqueeze(node->grad, e, dim, n);
+    _Multiply(e, c, a->grad, 1.0F);
+    _Multiply(node->grad, d, b->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete c;
+    delete d;
+    delete e;
+}
+/*
+gradient for reduceVariance
+for
+c = reduceVariance(a, dim, b)
+we have
+dE/da = Unsqueeze(dE/dc) * 2a/dimSizeA[dim]
+dE/db = Unsqueeze(dE/dc) * (-2a/dimSizeA[dim])
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradReduceVariance(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for Reduce!");
+    XTensor * a = income.tails[0];
+    XTensor * b = income.tails[1];
+    XTensor * c = NewTensor(a);
+    XTensor * d = NewTensor(b);
+    XTensor * e = NewTensor(a);
+    int dim = income.GetParamInt(0);
+    int n = a->GetDim(dim);
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+    _ScaleAndShift(a, c, 2.0F / n);
+    _ScaleAndShift(b, d, -2.0F / n);
+    _Unsqueeze(node->grad, e, dim, n);
+    _Multiply(e, c, a->grad, 1.0F);
+    _Multiply(node->grad, d, b->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete c;
+    delete d;
+    delete e;
 }
 }
--- a/source/network/XBackwardMath.h
+++ b/source/network/XBackwardMath.h
@@ -44,15 +44,107 @@ private:
    static
    void GradSum(XTensor * node);
-    /* gradient for multiply (dot production): c =  a * b */
+    /* gradient for sum with one dimension: c = a + b * \beta
+       where the size of b is equal to that of one dimension of a */
+    static
+    void GradSumDim(XTensor * node);
+    /* gradient for multiply (dot production): c =  a * b * \alpha */
    static
    void GradMultiply(XTensor * node);
-    /* gradient for matrix multiply: c = matmul(a, b) */
+    /* gradient for matrix multiply: c = matmul(a, b) * \alpha */
    static
    void GradMatrixMul(XTensor * node);
+    /* gradient for matrix multiply: c = matmul(a, b) * \alpha */
+    static
+    void GradMatrixMul(XTensor * a, XTensor * deda, MATRIX_TRANS_TYPE transA,
+                       XTensor * b, XTensor * dedb, MATRIX_TRANS_TYPE transB,
+                       XTensor * dedc, DTYPE alpha);
+    /* gradient for matrix multiply in batch mode.
+       for each batch: c_i = matmul(a_i, b_i) * \alpha */
+    static
+    void GradMatrixMulBatched(XTensor * node);
+    /* gradient for log: c =  log(a) */
+    static
+    void GradLog(XTensor * node);
+    /* gradient for power */
+    static
+    void GradPower(XTensor * node);
+    /* gradient for negate */
+    static
+    void GradNegate(XTensor * node);
+    /* gradient for ScaleAndShift */
+    static
+    void GradScaleAndShift(XTensor * node);
+    /* gradient for Minus */
+    static
+    void GradSub(XTensor * node);
+    /* gradient for Divide */
+    static
+    void GradDiv(XTensor * node);
+    /* gradient for reduceMean */
+    static
+    void GradReduceMean(XTensor * node);
+    /* gradient for reduceSum */
+    static
+    void GradReduceSum(XTensor * node);
+    /* gradient for reduceSumSquared */
+    static
+    void GradReduceSumSquared(XTensor * node);
+    /* gradient for reduceVariance */
+    static
+    void GradReduceVariance(XTensor * node);
+    /* gradient for sin */
+    static
+    void GradSin(XTensor * node);
+    /* gradient for cos */
+    static
+    void GradCos(XTensor * node);
+    /* gradient for tan */
+    static
+    void GradTan(XTensor * node);
+    /* gradient for exp */
+    static
+    void GradExp(XTensor * node);
+    /* gradient for normalize */
+    static
+    void GradNormalize(XTensor * node);
+    /* gradient for absolute */
+    static
+    void GradAbsolute(XTensor * node);
+    /* gradient for sign */
+    static
+    void GradSign(XTensor * node);
+    /* gradient for clip */
+    static
+    void GradClip(XTensor * node);
+    /* gradient for round */
+    static
+    void GradRound(XTensor * node);
 };
 }
 #endif
\ No newline at end of file
--- a/source/network/XBackwardShape.cpp
+++ b/source/network/XBackwardShape.cpp
@@ -43,6 +43,12 @@ void XShapeGrad::MakeGrad(XTensor * node)
        GradMergeList(node);
    else if(operID == SHAPE_UNSQUEEZE)
        GradUnsqueeze(node);
+    else if(operID == SHAPE_SPLIT)
+        GradSplit(node);
+    else if(operID == SHAPE_SPLIT_LIST)
+        GradSplitList(node);
+    else if (operID == SHAPE_TRANSPOSE)
+        GradTranspose(node);
    else{
        ShowNTErrors("TODO!");
    }
@@ -55,6 +61,13 @@ bool XShapeGrad::IsShapeOP(XTensor * node)
    return (income.typeID & DATA_BASE) != 0;
 }
+/* post processing of a node */
+void XShapeGrad::PostProcessing(XTensor * node, int typeID)
+{
+    if(typeID == SHAPE_SPLIT_LIST)
+        GradSplitListPost(node);
+}
 /* 
 gradient for merge
 for 
@@ -134,6 +147,8 @@ void XShapeGrad::GradMerge(XTensor * node)
    gradInputSmall.data = NULL;
    delete[] dims;
+    node->visitMark = NODE_FINISHED;
 }
 /* 
@@ -213,6 +228,120 @@ void XShapeGrad::GradMergeList(XTensor * node)
        gradSmall.data = NULL;
        delete[] dims;
    }
+    node->visitMark = NODE_FINISHED;
+}
+/* 
+gradient computation for split: 
+for
+c = split(a)
+we have
+dE/da = merge(dE/dc)
+>> node - the node (c) for backward computation
+*/
+void XShapeGrad::GradSplit(XTensor * node)
+{
+    XLink &income = node->income;
+    XTensor * input = income.tails[0];
+    int whereToSplit = income.GetParamInt(0);
+    int splitNum = income.GetParamInt(1);
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SPLIT!");
+    CheckNTErrors(node->order == input->order + 1, "Wrong tensor orders!");
+    CheckNTErrors(splitNum == node->dimSize[0], "Wrong split number!");
+    XNoder::MakeGrad(input);
+    /* we can simply merge the gradient tensor 
+       if the input is used in spliting only */
+    if(input->outgo.tailNum == 1)
+        _Merge(node->grad, input->grad, whereToSplit + 1, 0);
+    /* if the tensor is used somewhere else, we need another SUM
+       for gradient accumulation */
+    else{
+        XTensor inputGradTMP(input);
+        _Merge(node->grad, &inputGradTMP, whereToSplit + 1, 0);
+        _Sum(input->grad, &inputGradTMP, input->grad);
+    }
+    node->visitMark = NODE_FINISHED;
+}
+/* 
+gradient computation for spliting 
+where we return the list of the splits
+for
+list(c_1, ...) = split(a) 
+we have
+dE/da = merge(dE/c_1, ...)
+>> node - the node (c) for backward computation
+*/
+void XShapeGrad::GradSplitList(XTensor * node)
+{
+    XLink &income = node->income;
+    XTensor * input = income.tails[0];
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SPLIT!");
+    CheckNTErrors(node->order == input->order + 1, "Wrong tensor orders!");
+    node->visitMark = NODE_DOING;
+}
+/*
+gradient computation for spliting. We return 
+the list of the splits : list(c_1, ...) = split(a).
+this method is called only when all nodes of spliting 
+have been processed. We do this in a post-processing
+manner because we can fuze multiple memory copy jobs 
+one time. This is good for system speed up. 
+>> node - the node (c) for backward computation
+*/
+void XShapeGrad::GradSplitListPost(XTensor * node)
+{
+    /* we compute the gradient for current node, rather than for
+       child node, i.e., we use the outgoing edge here */
+    XLink &outgo = node->outgo;
+    XList splits(outgo.tailNum);
+    int whereToSplit = -1;
+    int splitNum = 0;
+    for(int i = 0; i < outgo.tailNum; i++){
+        XTensor * parent = (XTensor*)outgo.tails[i];
+        XLink &income = parent->income;
+        if(income.typeID == SHAPE_SPLIT_LIST){
+            int w = income.GetParamInt(0);
+            int splitID = income.GetParamInt(1);
+            if(whereToSplit < 0)
+                whereToSplit = w;
+            splitNum++;
+            CheckNTErrors(whereToSplit == w, "Wrong dimension for spliting");
+            CheckNTErrors(income.tailNum == 1, "Something wrong with outgoing edge!");
+            CheckNTErrors(splitNum - 1 == splitID, "Wrong split id!");
+            splits.Add(parent);
+        }
+    }
+    /* we can simply merge the gradient tensor 
+       if the node is used in spliting only */
+    if(outgo.tailNum == splitNum){
+        _Merge(&splits, node->grad, whereToSplit + 1);
+    }
+    /* if the tensor is used as input to other nodes
+       somewhere else, we need another SUM for gradient 
+       accumulation */
+    else{
+        XTensor nodeGradTMP(node);
+        _Merge(&splits, &nodeGradTMP, whereToSplit + 1);
+        _Sum(node->grad, &nodeGradTMP, node->grad);
+    }
 }
 /* 
@@ -239,6 +368,40 @@ void XShapeGrad::GradUnsqueeze(XTensor * node)
    CheckNTErrors(output->unitNum = input->unitNum * dSize, "Wrong tensor size!");
    _ReduceSum(output->grad, input->grad, dim);
+    node->visitMark = NODE_FINISHED;
+}
+/*
+gradient for transposing a tensor
+for
+c = Transpose(a)
+we have
+dE/da = Transpose(dE/dc)
+>> node - the node (c) for backward computation
+*/
+void XShapeGrad::GradTranspose(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for TRANSPOSE!");
+    XTensor * output = node;
+    XTensor * input = income.tails[0];
+    XTensor * b = NewTensor(input);
+    XNoder::MakeGrad(input);
+    int i = income.GetParamInt(0);
+    int j = income.GetParamInt(1);
+    CheckNTErrors(input->order > i && i >= 0, "index of dimension is out of scope!");
+    CheckNTErrors(input->order > j && j >= 0, "index of dimension is out of scope!");
+    _Transpose(output->grad, b, i, j);
+    _Sum(input->grad, b, input->grad);
+    node->visitMark = NODE_FINISHED;
+    delete b;
 }
 }
\ No newline at end of file
--- a/source/network/XBackwardShape.h
+++ b/source/network/XBackwardShape.h
@@ -40,18 +40,41 @@ public:
    static
    bool IsShapeOP(XTensor * node);
+    /* post processing of a node */
+    static
+    void PostProcessing(XTensor * node, int typeId);
 private:
-    /* gradient for merge: c = merge(a, b, ...) */
+    /* gradient computation for merge: c = merge(a, b, ...) */
    static
    void GradMerge(XTensor * node);
-    /* gradient for merging a list of tensors : c = merge(list(a, b, ...)) */
+    /* gradient computation for merging a list of tensors : c = merge(list(a, b, ...)) */
    static
    void GradMergeList(XTensor * node);
-    /* gradient for unsqueezing a tensor : c = unsqueeze(a) */
+    /* gradient computation for split: c = split(a) */
+    static
+    void GradSplit(XTensor * node);
+    /* gradient computation for spliting. we return the list of the splits : list(c_1, ...) = split(a) */
+    static
+    void GradSplitList(XTensor * node);
+    /* gradient computation for spliting. we return the list of the splits : list(c_1, ...) = split(a).
+       this method is called only when all nodes of spliting have been processed. We do this in a post-processing
+       manner because we can fuze multiple memory copy jobs one time. This is good for system speed up. */
+    static
+    void GradSplitListPost(XTensor * node);
+    /* gradient computation for unsqueezing a tensor : c = unsqueeze(a) */
    static
    void GradUnsqueeze(XTensor * node);
+    /* gradient computation for unsqueezing a tensor : c = unsqueeze(a) */
+    static
+    void GradTranspose(XTensor * node);
 };
 }

--- a/source/network/XNet.cpp
+++ b/source/network/XNet.cpp
@@ -46,6 +46,11 @@ unsigned int MakeNetID()
    return id;
 }
+void XNetClearAll()
+{
+    MUTEX_DELE(netMutex);
+}
 /* constructor */
 XNet::XNet()
 {
@@ -143,7 +148,7 @@ void XNet::Backward(XList &roots, XList &golds, LOSS_FUNCTION_NAME loss)
    /* back-propagation from output to input */
    for(int i = nodes.count - 1; i >= 0; i--){
-        XTensor * node = (XTensor*)nodes.Get(i);
+        XTensor * node = (XTensor*)nodes.Get(i);;
        if(node->visitMark == NODE_FINISHED)
            continue;
@@ -176,6 +181,10 @@ void XNet::BackwardNode(XTensor * node)
        return;
    if(!XNoder::IsLeaf(node)){
+        /* post processing for parent nodes */
+        BackwardNodePost(node);
+        /* process the current node */
        if(XMathGrad::IsMathOP(node))
            XMathGrad::MakeGrad(node);
        else if(XFuncGrad::IsFunc(node))
@@ -186,8 +195,24 @@ void XNet::BackwardNode(XTensor * node)
            ShowNTErrors("Wrong node type!");
        }
    }
+}
+/* 
+backward computation (in post processing) for a given node 
+>> node - the node whose parent nodes are not processed yet. So
+          we do the job at the child node.
+*/
+void XNet::BackwardNodePost(XTensor * node)
+{
+    bool isSplitList = false;
+    XLink &outgo = node->outgo;
+    for(int i = 0; i < outgo.tailNum; i++){
+        if(outgo.tails[i]->income.typeID == SHAPE_SPLIT_LIST)
+            isSplitList = true;
+    }
-    node->visitMark = NODE_FINISHED;
+    if(isSplitList)
+        XShapeGrad::PostProcessing(node, SHAPE_SPLIT_LIST);
 }
 /* 
@@ -238,10 +263,11 @@ void XNet::TarjanVisit(XTensor * node, XList &orders, const unsigned int code)
    if(node == NULL)
        return;
+    //fprintf(stderr, "%d\n", node->id);
    if(node->visitMark == code + 1){
        ShowNTErrors("There is a circle in the network\n");
    }
-    else if(node->visitMark <= code || node->visitMark >= code + 2){
+    else if(node->visitMark <= code){
        node->visitMark = code + 1;
        XLink &income = node->income;
        for(int i = 0; i < income.tailNum; i++){

--- a/source/network/XNet.h
+++ b/source/network/XNet.h
@@ -73,6 +73,9 @@ struct XNet
    /* backward computation for a given node */
    void BackwardNode(XTensor * node);
+    /* backward computation (in post processing) for a given node */
+    void BackwardNodePost(XTensor * node);
    /* traverse the net and find the topological order by 
       depth-first search (Tarjan's algorithm) */
    void Traverse(XTensor &root);
@@ -92,6 +95,7 @@ struct XNet
 extern unsigned int netIDGlobal;
 extern MUTEX_HANDLE netMutex;
 extern unsigned int MakeNetID();
+extern void XNetClearAll();
 }

--- a/source/sample/fnnlm/FNNLM.cpp
+++ b/source/sample/fnnlm/FNNLM.cpp
@@ -33,7 +33,7 @@
 #include "../../tensor/function/FHeader.h"
 #include "../../network/XNet.h"
-namespace samplefnnlm
+namespace fnnlm
 {
 #define MAX_NAME_LENGTH 1024
@@ -57,7 +57,7 @@ void LoadArgs(int argc, const char ** argv, FNNModel &model);
 void Init(FNNModel &model);
 void Check(FNNModel &model);
 void Copy(FNNModel &tgt, FNNModel &src);
-void Clear(FNNModel &model);
+void Clear(FNNModel &model, bool isNodeGrad);
 void InitModelTensor1D(XTensor &tensor, int num, FNNModel &model);
 void InitModelTensor2D(XTensor &tensor, int rowNum, int colNum, FNNModel &model);
 void Train(const char * train, bool isShuffled, FNNModel &model);
@@ -153,43 +153,80 @@ load arguments
 */
 void LoadArgs(int argc, const char ** argv, FNNModel &model)
 {
+    fprintf(stderr, "args:\n");
    for(int i = 0; i < argc; i++){
-        if(!strcmp(argv[i], "-train") && i + 1 < argc)
+        if(!strcmp(argv[i], "-train") && i + 1 < argc){
            strcpy(trainFN, argv[i + 1]);
-        if(!strcmp(argv[i], "-model") && i + 1 < argc)
+            fprintf(stderr, " -train=%s\n", argv[i + 1]);
+        }
+        if(!strcmp(argv[i], "-model") && i + 1 < argc){
            strcpy(modelFN, argv[i + 1]);
-        if(!strcmp(argv[i], "-test") && i + 1 < argc)
+            fprintf(stderr, " -model=%s\n", argv[i + 1]);
+        }
+        if(!strcmp(argv[i], "-test") && i + 1 < argc){
            strcpy(testFN, argv[i + 1]);
-        if(!strcmp(argv[i], "-output") && i + 1 < argc)
+            fprintf(stderr, " -test=%s\n", argv[i + 1]);
+        }
+        if(!strcmp(argv[i], "-output") && i + 1 < argc){
            strcpy(outputFN, argv[i + 1]);
-        if(!strcmp(argv[i], "-n") && i + 1 < argc)
+            fprintf(stderr, " -output=%s\n", argv[i + 1]);
+        }
+        if(!strcmp(argv[i], "-n") && i + 1 < argc){
            model.n = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-esize") && i + 1 < argc)
+            fprintf(stderr, " -n=%d\n", model.n);
+        }
+        if(!strcmp(argv[i], "-esize") && i + 1 < argc){
            model.eSize = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-vsize") && i + 1 < argc)
+            fprintf(stderr, " -esize=%d\n", model.eSize);
+        }
+        if(!strcmp(argv[i], "-vsize") && i + 1 < argc){
            model.vSize = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-hdepth") && i + 1 < argc)
+            fprintf(stderr, " -vsize=%d\n", model.vSize);
+        }
+        if(!strcmp(argv[i], "-hdepth") && i + 1 < argc){
            model.hDepth = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-hsize") && i + 1 < argc)
+            fprintf(stderr, " -hdepth=%d\n", model.hDepth);
+        }
+        if(!strcmp(argv[i], "-hsize") && i + 1 < argc){
            model.hSize = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-lrate") && i + 1 < argc)
+            fprintf(stderr, " -hsize=%d\n", model.hSize);
+        }
+        if(!strcmp(argv[i], "-lrate") && i + 1 < argc){
            learningRate = (float)atof(argv[i + 1]);
-        if(!strcmp(argv[i], "-nstep") && i + 1 < argc)
+            fprintf(stderr, " -lrate=%f\n", learningRate);
+        }
+        if(!strcmp(argv[i], "-nstep") && i + 1 < argc){
            nStep = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-nepoch") && i + 1 < argc)
+            fprintf(stderr, " -nstep=%d\n", nStep);
+        }
+        if(!strcmp(argv[i], "-nepoch") && i + 1 < argc){
            nEpoch = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-minmax") && i + 1 < argc)
+            fprintf(stderr, " -nepoch=%d\n", nEpoch);
+        }
+        if(!strcmp(argv[i], "-minmax") && i + 1 < argc){
            minmax = (float)fabs(atof(argv[i + 1]));
-        if(!strcmp(argv[i], "-batch") && i + 1 < argc)
+            fprintf(stderr, " -minmax=%f\n", minmax);
+        }
+        if(!strcmp(argv[i], "-batch") && i + 1 < argc){
            sentBatch = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-wbatch") && i + 1 < argc)
+            fprintf(stderr, " -batch=%d\n", sentBatch);
+        }
+        if(!strcmp(argv[i], "-wbatch") && i + 1 < argc){
            wordBatch = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-shuffle"))
+            fprintf(stderr, " -wbatch=%d\n", wordBatch);
+        }
+        if(!strcmp(argv[i], "-shuffle")){
            shuffled = true;
-        if(!strcmp(argv[i], "-autodiff"))
+            fprintf(stderr, " -shuffle=true\n");
+        }
+        if(!strcmp(argv[i], "-autodiff")){
            autoDiff = true;
-        if(!strcmp(argv[i], "-dev") && i + 1 < argc)
+            fprintf(stderr, " -autodiff=true\n");
+        }
+        if(!strcmp(argv[i], "-dev") && i + 1 < argc){
            model.devID = atoi(argv[i + 1]);
+            fprintf(stderr, " -dev=%d\n", model.devID);
+        }
    }
    for(int i = 0; i < argc; i++){
@@ -203,6 +240,7 @@ void Check(FNNModel &model)
 {
    CheckErrors(model.n > 0 && model.n <= MAX_N_GRAM, "The LM order is out of range (use -n)!");
    CheckErrors(model.vSize > 0, "no vocabulary size found (use -vsize)!");
+    CheckErrors(model.eSize > 0, "no embedding size found (use -esize)!");
 }
 /* make a hard copy of the fnn model */
@@ -230,16 +268,37 @@ void Copy(FNNModel &tgt, FNNModel &src)
    }
 }
-/* reset model parameters */
+/* 
-void Clear(FNNModel &model)
+reset model parameters 
+>> model - the model whose parameter (gradient) is set to 0
+>> isNodeGrad - indicates whether the tensor node keeps the 
+                gradient information
+*/
+void Clear(FNNModel &model, bool isNodeGrad)
 {
-    model.embeddingW.SetZeroAll();
+    if (isNodeGrad) {
-    for(int i = 0; i < MAX_HIDDEN_NUM; i++){
+        if(model.embeddingW.grad != NULL)
-        model.hiddenW[i].SetZeroAll();
+            model.embeddingW.grad->SetZeroAll();
-        model.hiddenB[i].SetZeroAll();
+        for (int i = 0; i < MAX_HIDDEN_NUM; i++) {
+            if(model.hiddenW[i].grad != NULL)
+                model.hiddenW[i].grad->SetZeroAll();
+            if(model.hiddenB[i].grad != NULL)
+                model.hiddenB[i].grad->SetZeroAll();
+        }
+        if(model.outputW.grad != NULL)
+            model.outputW.grad->SetZeroAll();
+        if(model.outputB.grad != NULL)
+            model.outputB.grad->SetZeroAll();
+    }
+    else {
+        model.embeddingW.SetZeroAll();
+        for (int i = 0; i < MAX_HIDDEN_NUM; i++) {
+            model.hiddenW[i].SetZeroAll();
+            model.hiddenB[i].SetZeroAll();
+        }
+        model.outputW.SetZeroAll();
+        model.outputB.SetZeroAll();
    }
-    model.outputW.SetZeroAll();
-    model.outputB.SetZeroAll();
 }
 /* 
@@ -401,7 +460,7 @@ void Train(const char * train, bool isShuffled, FNNModel &model)
                FNNNet net;
                /* gradident = 0 */
-                Clear(grad);
+                Clear(grad, false);
                /* forward computation */
                Forward(inputs, output, model, net);
@@ -413,6 +472,9 @@ void Train(const char * train, bool isShuffled, FNNModel &model)
                Update(model, grad, learningRate, false);
            }
            else{
+                /* gradient = 0 */
+                Clear(model, true);
                /* forward + backward process */
                ForwardAutoDiff(inputs, output, model);
@@ -492,21 +554,24 @@ void Update(FNNModel &model, FNNModel &grad, float epsilon, bool isNodeGrad)
        gradList.Add(&grad.embeddingW);
    }
    else{
-        paraList.Add(model.outputW.grad);
+        gradList.Add(model.outputW.grad);
-        paraList.Add(&model.outputB.grad);
+        gradList.Add(model.outputB.grad);
        for (int i = 0; i < model.hDepth; i++) {
-            paraList.Add(&model.hiddenW[i].grad);
+            gradList.Add(model.hiddenW[i].grad);
-            paraList.Add(&model.hiddenB[i].grad);
+            gradList.Add(model.hiddenB[i].grad);
        }
-        paraList.Add(&model.embeddingW.grad);
+        gradList.Add(model.embeddingW.grad);
    }
    for (int i = 0; i < paraList.count; i++) {
        XTensor * para = (XTensor*)paraList.GetItem(i);
        XTensor * paraGrad = (XTensor*)gradList.GetItem(i);
+        //fprintf(stderr, "%d\n", i);
+        //paraGrad->Dump(stderr, "grad:", 10);
        /* the delta rule */
        _Sum(para, paraGrad, para, -epsilon);
    }
@@ -516,7 +581,7 @@ void Update(FNNModel &model, FNNModel &grad, float epsilon, bool isNodeGrad)
 get prediction probabilites of the gold words
 >> output - output probabilities
 >> gold - gold standard
->> 
+>> wordPobs - probability of each word
 << return - probability of the batch
 */
 float GetProb(XTensor &output, XTensor &gold, XTensor * wordProbs)
@@ -568,8 +633,10 @@ int LoadNGrams(FILE * file, int n, NGram * ngrams, int sentNum, int wordNum)
        if(pin <= 0){
            int len = (int)strlen(lineBuf);
-            if(lineBuf[len - 1] == '\r')
+            while(lineBuf[len - 1] == '\r' || lineBuf[len - 1] == '\n'){
                lineBuf[len - 1] = 0;
+                len--;
+            }
            len = (int)strlen(lineBuf);
            if(len == 0)
@@ -580,10 +647,11 @@ int LoadNGrams(FILE * file, int n, NGram * ngrams, int sentNum, int wordNum)
            /* how many words are in the sentence */
            int wNum = 0;
+            int i = 0;
-            for(int i = pin; i < len; i++){
+            for(i = pin; i < len; i++){
                /* load word (id) seperated by space or tab */
-                if((lineBuf[i] == ' ' || lineBuf[i] == '\t' || i == len - 1) && wSize > 0){
+                if((lineBuf[i] == ' ' || lineBuf[i] == '\t') && wSize > 0){
                    lineBuf[i] = 0;
                    wordBuf[wNum++] = atoi(lineBuf + i - wSize);
                    wSize = 0;
@@ -592,6 +660,9 @@ int LoadNGrams(FILE * file, int n, NGram * ngrams, int sentNum, int wordNum)
                    wSize++;
            }
+            if(wSize > 0)
+                wordBuf[wNum++] = atoi(lineBuf + i - wSize);
            wordBufCount = wNum;
            lineNum++;
        }
@@ -911,7 +982,6 @@ forward process (with tensor connections)
 */
 void ForwardAutoDiff(XTensor inputs[], XTensor &output, FNNModel &model)
 {
-    int batchSize = inputs[0].GetDim(0);
    int n = model.n;
    int depth = model.hDepth;
@@ -935,15 +1005,13 @@ void ForwardAutoDiff(XTensor inputs[], XTensor &output, FNNModel &model)
    hidden = Merge(hidden, 2, 0);
    /* hidden layers */
-    for(int i = 0; i < depth; i++){
+    for(int i = 0; i < depth; i++)
-        b = Unsqueeze(model.hiddenB[i], 1, batchSize);
+        hidden = MMul(hidden, model.hiddenW[i]) + model.hiddenB[i];
-        hidden = MMul(hidden, model.hiddenW) + b;
-    }
-    b = Unsqueeze(model.outputB, 1, batchSize);
    /* output layer */
-    output = LogSoftmax(MMul(hidden, model.outputW) + b, 1);
+    output = LogSoftmax(MMul(hidden, model.outputW) + model.outputB, 1);
+    //XLink::ShowNetwork(stderr, &output);
 }
 /* 
@@ -1040,18 +1108,23 @@ void Test(const char * test, const char * result, FNNModel &model)
        /* the gold standard */
        XTensor gold;
-        /* prepare an empty network for building the fnn */
+        if (!autoDiff) {
-        FNNNet net;
+            /* prepare an empty network for building the fnn */
+            FNNNet net;
-        /* make the input tensor for position i */
+            /* make the input tensor for position i */
-        for (int i = 0; i < model.n - 1; i++)
+            for (int i = 0; i < model.n - 1; i++)
-            MakeWordBatch(inputs[i], ngrams, ngramNum, i, model.vSize, model.devID, model.mem);
+                MakeWordBatch(inputs[i], ngrams, ngramNum, i, model.vSize, model.devID, model.mem);
-        /* make the gold tensor */
+            /* make the gold tensor */
-        MakeWordBatch(gold, ngrams, ngramNum, model.n - 1, model.vSize, model.devID, model.mem);
+            MakeWordBatch(gold, ngrams, ngramNum, model.n - 1, model.vSize, model.devID, model.mem);
-        /* forward computation */
+            /* forward computation */
-        Forward(inputs, output, model, net);
+            Forward(inputs, output, model, net);
+        }
+        else {
+            ForwardAutoDiff(inputs, output, model);
+        }
        /* prediction probabilities */
        XTensor probs;

--- a/source/sample/fnnlm/FNNLM.h
+++ b/source/sample/fnnlm/FNNLM.h
@@ -36,7 +36,7 @@
 using namespace nts;
-namespace samplefnnlm
+namespace fnnlm
 {
 #define _EXIT_(x)// exit(x)
@@ -126,7 +126,7 @@ struct FNNNet
    XTensor output;
 };
-/* entry of the program */
+/* entrance of the program */
 int FNNLMMain(int argc, const char ** argv);
 };

--- a/source/sample/transformer/T2TAttention.cpp
+++ b/source/sample/transformer/T2TAttention.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#include <math.h>
+#include "T2TAttention.h"
+#include "T2TUtility.h"
+#include "T2TEmbedding.h"
+#include "../../tensor/core/CHeader.h"
+namespace transformer
+{
+/* constructor */
+T2TAttention::T2TAttention()
+{
+    nhead = -1;
+    dk = -1;
+    dv = -1;
+    d  = -1;
+}
+/* deconstructor */
+T2TAttention::~T2TAttention()
+{
+}
+/* 
+initialize the model 
+>> argc - number of arguments
+>> argv - list of pointers to the arguments
+>> myDevID - device id
+>> myMem - the memory pool
+*/
+void T2TAttention::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
+{
+    devID = myDevID;
+    mem = myMem;
+    float minmax = 0;
+    LoadParamInt(argc, argv, "nhead", &nhead, 8);
+    LoadParamInt(argc, argv, "d", &dk, DEFAULT_BEDDING_SIZE);
+    LoadParamInt(argc, argv, "d", &dv, DEFAULT_BEDDING_SIZE);
+    LoadParamInt(argc, argv, "d", &d, DEFAULT_BEDDING_SIZE);
+    LoadParamFloat(argc, argv, "attminmax", &minmax, 0.08F);
+    InitTensor2D(&wk, d, dk, X_FLOAT, devID, mem);
+    InitTensor2D(&wq, d, dk, X_FLOAT, devID, mem);
+    InitTensor2D(&wv, d, dv, X_FLOAT, devID, mem);
+    wk.SetDataRand(-minmax, minmax);
+    wq.SetDataRand(-minmax, minmax);
+    wv.SetDataRand(-minmax, minmax);
+}
+/* 
+make the network 
+>> k - keys. It might be of size B * L * H
+       where B = batch size, L = sequence length, 
+       and H = vector size of each position
+>> q - queries
+>> v - values
+<< return - multi-attention result
+*/
+XTensor T2TAttention::Make(XTensor &k, XTensor &q, XTensor &v)
+{
+    XTensor k2;
+    XTensor q2;
+    XTensor v2;
+    /* linear transofmration before self-attention */
+    k2 = MMul(k, wk);
+    q2 = MMul(q, wq);
+    v2 = MMul(v, wv);
+    XTensor kheads;
+    XTensor qheads;
+    XTensor vheads;
+    /* multi head */
+    kheads = Split(k2, k2.order - 1, nhead);
+    qheads = Split(q2, q2.order - 1, nhead);
+    vheads = Split(v2, v2.order - 1, nhead);
+    XTensor att;
+    XTensor scalar;
+    /* scalar = softmax(Q * K^T / sqrt(dk)) * V */
+    scalar = Softmax(Linear(BMMul(qheads, X_NOTRANS, kheads, X_TRANS), 1/sqrt((float)dk)), -1);
+    att = BMMul(scalar, vheads);
+    /* concatenate the heads */
+    return Merge(att, att.order - 1);
+}
+}
--- a/source/sample/transformer/T2TAttention.h
+++ b/source/sample/transformer/T2TAttention.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#ifndef __T2TATTENTION_H__
+#define __T2TATTENTION_H__
+#include "../../network/XNet.h"
+using namespace nts;
+namespace transformer
+{
+/* 
+multi-head attention 
+y(Q, K, V) = cat(head_1, head_2, ..., head_n)
+where head_i = Attention(Q * w_i^Q, K * w_i^K, V * w_i^V)
+      attention(Q, K, V) = softmax(Q * K^T/d_k^0.5) V
+      d_k = dimension size of K
+*/
+class T2TAttention
+{
+public:
+    /* device id */
+    int devID;
+    /* memory pool */
+    XMem * mem;
+    /* head number */
+    int nhead;
+    /* transformation matrix for K */
+    XTensor wk;
+    /* transformation matrix for Q */
+    XTensor wq;
+    /* transformation matrix for V */
+    XTensor wv;
+    /* size of transformed Q and K */
+    int dk;
+    /* size of transformed V */
+    int dv;
+    /* size of input Q, K and V */
+    int d;
+public:
+    /* constructor */
+    T2TAttention();
+    /* de-constructor */
+    ~T2TAttention();
+    /* initialize the model */
+    void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
+    /* make the network */
+    XTensor Make(XTensor &k, XTensor &q, XTensor &v);
+};
+}
+#endif
--- a/source/sample/transformer/T2TDecoder.cpp
+++ b/source/sample/transformer/T2TDecoder.cpp
--- a/source/sample/transformer/T2TDecoder.h
+++ b/source/sample/transformer/T2TDecoder.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#ifndef __T2TDECODER_H__
+#define __T2TDECODER_H__
+namespace transformer
+{
+class T2TDecoder
+{
+};
+class AttDecoder : T2TDecoder
+{
+public:
+    /* initialize the model */
+    void InitModel(int argc, const char ** argv);
+};
+}
+#endif
\ No newline at end of file
--- a/source/sample/transformer/T2TEmbedding.cpp
+++ b/source/sample/transformer/T2TEmbedding.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-08-01
+ */
+#include <math.h>
+#include "T2TEmbedding.h"
+#include "T2TUtility.h"
+#include "../../tensor/core/CHeader.h"
+namespace transformer
+{
+/* constructor */
+T2TEmbedder::T2TEmbedder()
+{
+    devID = -1;
+    mem = NULL;
+    vSize = -1;
+    maxLength = -1;
+}
+/* deconstructor */
+T2TEmbedder::~T2TEmbedder()
+{
+}
+/* 
+initialize the model 
+>> argc - number of arguments
+>> argv - list of pointers to the arguments
+>> myDevID - device id
+>> myMem - the memory pool
+*/
+void T2TEmbedder::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
+{
+    devID = myDevID;
+    mem = myMem;
+    int d = 0;
+    LoadParamInt(argc, argv, "vsize", &vSize, -1);
+    LoadParamInt(argc, argv, "maxlen", &maxLength, 256);
+    LoadParamInt(argc, argv, "d", &eSize, DEFAULT_BEDDING_SIZE);
+    LoadParamInt(argc, argv, "d", &d, DEFAULT_BEDDING_SIZE);
+    InitTensor2D(&w, vSize, eSize, X_FLOAT, devID, mem);
+    w.SetDataRandn(0, sqrt((float)eSize));
+    /* create the positional embedding matrix */
+    MakePosEmbedding(eSize, d, maxLength);
+}
+/* 
+make positional embeddings (of size eSize * length
+eSize - embedding size
+length - length of the sequenc
+*/
+void T2TEmbedder::MakePosEmbedding(int eSize, int d, int length)
+{
+    InitTensor2D(&posEmbeddingBase, length, eSize, X_FLOAT, devID, mem);
+    float * data = new float[posEmbeddingBase.unitNum];
+    for(int pos = 0; pos < length; pos++){
+        float * dp = data + pos * eSize;
+        for(int k = 0; k < eSize; k++){
+            if(k % 2 == 0){
+                int i = k/2;
+                dp[k] = sin(pos/pow(10000.0F, 2.0F*i/d));
+            }
+            else{
+                int i = (k - 1)/2;
+                dp[k] = cos(pos/pow(10000.0F, 2.0F*i/d));
+            }
+        }
+    }
+    posEmbeddingBase.SetData(data, posEmbeddingBase.unitNum);
+    delete[] data;
+}
+/* 
+make the network 
+*/
+XTensor T2TEmbedder::Make(XTensor &input)
+{
+    CheckNTErrors(input.GetDim(-1) == vSize, "Wrong vocabulary size!");
+    CheckNTErrors(input.order > 1, "Wrong input tensor size!");
+    CheckNTErrors(input.dimSize[input.order - 2] < maxLength, "The sequence is too long!");
+    CheckNTErrors(vSize > 0, "set vocabulary size by \"-vsize\"");
+    CheckNTErrors(eSize > 0, "set embedding size by \"-esize\"");
+    int dims[MAX_TENSOR_DIM_NUM];
+    memcpy(dims, input.dimSize, input.order * sizeof(int));
+    dims[input.order - 1] = eSize;
+    bool match = (posEmbedding.order == input.order);
+    if(match){
+        for(int i = 0; i < input.order; i++){
+            if(dims[i] != posEmbedding.GetDim(i))
+                match = false;
+        }
+    }
+    /* we make positional embeddings first */
+    if(!match){
+        InitTensor(&posEmbedding, input.order, dims, X_FLOAT, 1.0F, devID, mem);
+        XTensor * posTMP = NewTensorBuf(2, dims + 1, X_FLOAT, 1.0F, devID, mem);
+        _CopyValues(&posEmbeddingBase, 0, posTMP->unitNum, posTMP, 0);
+        _Unsqueeze(posTMP, &posEmbedding, 0, dims[0]);
+        DelTensorBuf(posTMP);
+    }
+    XTensor wordEmbedding;
+    /* then we make word embeddings */
+    wordEmbedding = MMul(&input, w);
+    /* we sum over the two embeddings */
+    return wordEmbedding + posEmbedding;
+}
+}
--- a/source/sample/transformer/T2TEmbedding.h
+++ b/source/sample/transformer/T2TEmbedding.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-08-01
+ */
+#ifndef __T2TEMBEDDING_H__
+#define __T2TEMBEDDING_H__
+#include "../../network/XNet.h"
+using namespace nts;
+namespace transformer
+{
+#define DEFAULT_BEDDING_SIZE 512
+/* 
+embedding (of word at position i):
+word embedding + positional embedding
+*/
+class T2TEmbedder
+{
+public:
+    /* device id */
+    int devID;
+    /* memory pool */
+    XMem * mem;
+    /* vocabulary size */
+    int vSize;
+    /* embedding size */
+    int eSize;
+    /* maximum length of the sequence */
+    int maxLength;
+    /* word embedding matrix */
+    XTensor w;
+    /* predefined positional embeddings. It can speeds up 
+       the embedding processing by re-loading. */
+    XTensor posEmbeddingBase;
+    /* positional embeddings */
+    XTensor posEmbedding;
+public:
+    /* constructor */
+    T2TEmbedder();
+    /* de-constructor */
+    ~T2TEmbedder();
+    /* initialize the model */
+    void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
+    /* make positional embeddings */
+    void MakePosEmbedding(int eSize, int d, int length);
+    /* make the network */
+    XTensor Make(XTensor &input);
+};
+}
+#endif
--- a/source/sample/transformer/T2TEncoder.cpp
+++ b/source/sample/transformer/T2TEncoder.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#include <math.h>
+#include "T2TEncoder.h"
+#include "T2TLayerNormal.h"
+#include "T2TUtility.h"
+#include "../../tensor/core/CHeader.h"
+namespace transformer
+{
+/* constructor */
+AttEncoder::AttEncoder()
+{
+}
+/* de-constructor */
+AttEncoder::~AttEncoder()
+{
+    delete[] attentions;
+    delete[] fnns;
+    delete[] layerNorms;
+}
+/* 
+initialize the model 
+>> argc - number of arguments
+>> argv - list of pointers to the arguments
+>> myDevID - device id
+>> myMem - the memory pool
+*/
+void AttEncoder::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
+{
+    devID = myDevID;
+    mem = myMem;
+    LoadParamInt(argc, argv, "nstack", &nlayer, 6);
+    LoadParamInt(argc, argv, "hsize", &hSize, 512);
+    LoadParamInt(argc, argv, "esize", &eSize, 512);
+    LoadParamInt(argc, argv, "vsize", &vSize, -1);
+    CheckNTErrors(nlayer > 1, "We have one encoding layer at least!");
+    CheckNTErrors(vSize > 1, "set vocabulary size by \"-vsize\"");
+    /* embedding model */
+    embedder.InitModel(argc, argv, devID, mem);
+    attentions = new T2TAttention[nlayer];
+    fnns = new T2TFNN[nlayer];
+    layerNorms = new T2TLN[nlayer];
+    /* initialize the stacked layers */
+    for(int i = 0; i < nlayer; i++){
+        attentions[i].InitModel(argc, argv, myDevID, myMem);
+        fnns[i].InitModel(argc, argv, myDevID, myMem);
+        layerNorms[i].InitModel(argc, argv, myDevID, myMem);
+    }
+}
+/* 
+make the encoding network
+>> input - the input tensor of the encoder
+<< return - the output tensor of the encoder
+*/
+XTensor AttEncoder::Make(XTensor &input)
+{
+    XTensor x;
+    x = embedder.Make(input);
+    for(int i = 0; i < nlayer; i++){
+        XTensor att;
+        XTensor ln;
+        XTensor fnn;
+        XTensor res;
+        /* self attention */
+        att = attentions[i].Make(x, x, x);
+        /* residual connection */
+        res = Sum(att, x);
+        /* TODO: dropout */
+        /* layer normalization */
+        ln = layerNorms[i].Make(res);
+        /* input of next layer */
+        x = ln;
+        /* fnn */
+        fnn = fnns[i].Make(x);
+        /* residual connection */
+        res = Sum(fnn, x);
+        /* TODO: dropout */
+        /* layer normalization */
+        ln = layerNorms[i].Make(res);
+        /* input of next layer */
+        x = ln;
+    }
+    return x;
+}
+}
--- a/source/sample/transformer/T2TEncoder.h
+++ b/source/sample/transformer/T2TEncoder.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#ifndef __T2TENCODER_H__
+#define __T2TENCODER_H__
+#include "T2TFNN.h"
+#include "T2TAttention.h"
+#include "T2TEmbedding.h"
+#include "T2TLayerNormal.h"
+#include "../../network/XNet.h"
+using namespace nts;
+namespace transformer
+{
+/* 
+base class of the encoder 
+*/
+class T2TEncoder
+{
+public:
+    virtual
+    XTensor Make(XTensor &input) = 0;
+};
+/* 
+the encoder based on RNN 
+*/
+class RNNEncoder : T2TEncoder
+{
+public:
+    XTensor Make(XTensor &input);
+};
+/* 
+the encoder based on self-attention 
+*/
+class AttEncoder : T2TEncoder
+{
+public:
+    /* device id */
+    int devID;
+    /* memory pool */
+    XMem * mem;
+    /* layer number */
+    int nlayer;
+    /* hidden layer size of the FNN layer */
+    int hSize;
+    /* embedding size */
+    int eSize;
+    /* vocabulary size */
+    int vSize;
+    /* embedding of word at each position */
+    T2TEmbedder embedder;
+    /* FNN model of each layer */
+    T2TFNN * fnns;
+    /* attention model of each layer */
+    T2TAttention * attentions;
+    /* layer normalization */
+    T2TLN * layerNorms;
+    /* input tensor of the encoder */
+    XTensor * input;
+    /* output tensor of the encoder */
+    XTensor * output;
+public:
+    /* constructor */
+    AttEncoder();
+    /* de-constructor */
+    ~AttEncoder();
+    /* initialize the model */
+    void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
+    /* make the encoding network */
+    XTensor Make(XTensor &input);
+};
+}
+#endif
--- a/source/sample/transformer/T2TFNN.cpp
+++ b/source/sample/transformer/T2TFNN.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#include "T2TFNN.h"
+#include "T2TUtility.h"
+#include "T2TEmbedding.h"
+#include "../../tensor/core/CHeader.h"
+#include "../../tensor/function/FHeader.h"
+namespace transformer
+{
+/* constructor */
+T2TFNN::T2TFNN()
+{
+    inSize  = -1;
+    outSize = -1;
+    hSize   = -1;
+}
+/* deconstructor */
+T2TFNN::~T2TFNN()
+{
+}
+/* 
+initialize the model 
+>> argc - number of arguments
+>> argv - list of pointers to the arguments
+>> myDevID - device id
+>> myMem - the memory pool
+*/
+void T2TFNN::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
+{
+    devID = myDevID;
+    mem = myMem;
+    float minmax = 0;
+    LoadParamInt(argc, argv, "d", &inSize, DEFAULT_BEDDING_SIZE);
+    LoadParamInt(argc, argv, "d", &outSize, DEFAULT_BEDDING_SIZE);
+    LoadParamInt(argc, argv, "fnnh", &hSize, DEFAULT_BEDDING_SIZE);
+    LoadParamFloat(argc, argv, "fnnminmax", &minmax, 0.08F);
+    InitTensor2D(&w1, inSize, hSize, X_FLOAT, devID, mem);
+    InitTensor1D(&b1, hSize, X_FLOAT, devID, mem);
+    InitTensor2D(&w2, hSize, outSize, X_FLOAT, devID, mem);
+    InitTensor1D(&b2, outSize, X_FLOAT, devID, mem);
+    w1.SetDataRand(-minmax, minmax);
+    b1.SetDataRand(-minmax, minmax);
+    w2.SetDataRand(-minmax, minmax);
+    b2.SetDataRand(-minmax, minmax);
+}
+/* 
+make the network 
+y = max(0, x * w1 + b1) * w2 + b2
+>> input - the input tensor
+>> return - the output tensor 
+*/
+XTensor T2TFNN::Make(XTensor &input)
+{
+    XTensor t1;
+    /* t1 = max(0, x * w1 + b1) */
+    t1 = Rectify(MMul(input, X_NOTRANS, w1, X_NOTRANS) + b1);
+    /* result = t1 * w2 + b2 */
+    return MMul(t1, X_NOTRANS, w2, X_NOTRANS) + b2;
+}
+}
--- a/source/sample/transformer/T2TFNN.h
+++ b/source/sample/transformer/T2TFNN.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#ifndef __T2TFNN_H__
+#define __T2TFNN_H__
+#include "../../tensor/XTensor.h"
+using namespace nts;
+namespace transformer
+{
+/* a fnn: y = max(0, x * w1 + b1) * w2 + b2 */
+class T2TFNN
+{
+public:
+    /* device id */
+    int devID;
+    /* memory pool */
+    XMem * mem;
+    /* size of input vector */
+    int inSize;
+    /* size of output vector */
+    int outSize;
+    /* size of hidden layers */
+    int hSize;
+    /* matrix of transformation 1 */
+    XTensor w1;
+    /* bias of transformation 1 */
+    XTensor b1;
+    /* matrix of transformation 2 */
+    XTensor w2;
+    /* bias of transformation 2 */
+    XTensor b2;
+public:
+    /* constructor */
+    T2TFNN();
+    /* deconstructor */
+    ~T2TFNN();
+    /* initialize the model */
+    void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
+    /* make the network */
+    XTensor Make(XTensor &input);
+};
+}
+#endif
--- a/source/sample/transformer/T2TLayerNormal.cpp
+++ b/source/sample/transformer/T2TLayerNormal.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#include "T2TLayerNormal.h"
+#include "../../tensor/core/CHeader.h"
+namespace transformer
+{
+/* constructor */
+T2TLN::T2TLN()
+{
+    devID = -1;
+    mem   = NULL;
+}
+/* de-constructor */
+T2TLN::~T2TLN()
+{
+}
+/*
+initialize the model
+>> argc - number of arguments
+>> argv - list of pointers to the arguments
+>> myDevID - device id
+>> myMem - the memory pool
+*/
+void T2TLN::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
+{
+    devID = myDevID;
+    mem = myMem;
+}
+/*
+make the network 
+for each layer representation x, we have
+y = 
+>> input - the input tensor
+>> return - layer normalization output
+*/
+XTensor T2TLN::Make(XTensor &input)
+{
+    XTensor &x = input;
+    XTensor mean;
+    XTensor variance;
+    XTensor standard;
+    XTensor meanFilled;
+    XTensor standardFilled;
+    /* \mu = (sum_i x_i)/m */
+    mean = ReduceSum(x, x.order - 1);
+    /* \sigma = (sum_i (x_i - \mu)^2)/m */
+    variance = ReduceVariance(x, x.order - 1, mean);
+    /* standard = sqrt(variance) */
+    standard = Power(variance, 0.5F);
+    /* unsqueeze mean and standard deviation to fit them into 
+       the same size of x */
+    meanFilled = Unsqueeze(mean, x.order - 1, x.GetDim(-1));
+    standardFilled = Unsqueeze(standard, x.order - 1, x.GetDim(-1));
+    /* x' = (x - \mu)/standard */
+    return (x - meanFilled)/standardFilled;
+}
+}
--- a/source/sample/transformer/T2TLayerNormal.h
+++ b/source/sample/transformer/T2TLayerNormal.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#ifndef __T2TLAYERNORMAL_H__
+#define __T2TLAYERNORMAL_H__
+#include "../../network/XNet.h"
+using namespace nts;
+namespace transformer
+{
+class T2TLN
+{
+public:
+    /* device id */
+    int devID;
+    /* memory pool */
+    XMem * mem;
+public:
+    /* constructor */
+    T2TLN();
+    /* de-constructor */
+    ~T2TLN();
+    /* initialize the model */
+    void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
+    /* make the network */
+    XTensor Make(XTensor &input);
+};
+}
+#endif
--- a/source/sample/transformer/T2TModel.cpp
+++ b/source/sample/transformer/T2TModel.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#include "T2TModel.h"
+#include "T2TUtility.h"
+namespace transformer
+{
+/* constructor */
+T2TModel::T2TModel()
+{
+    devID = -1;
+    mem = NULL;
+    isLM = false;
+    isMT = false;
+}
+/* de-constructor */
+T2TModel::~T2TModel()
+{
+    delete mem;
+}
+/* 
+initialize the model 
+>> argc - number of arguments
+>> argv - list of pointers to the arguments
+*/
+void T2TModel::InitModel(int argc, const char ** argv)
+{
+    bool useMem = false;
+    LoadParamInt(argc, argv, "dev", &devID, -1);
+    LoadParamBool(argc, argv, "mem", &useMem, useMem);
+    LoadParamBool(argc, argv, "lm", &isLM, true);
+    LoadParamBool(argc, argv, "mt", &isMT, false);
+    if(useMem){
+        delete mem;
+        mem = new XMem(devID);
+    }
+    encoder.InitModel(argc, argv, devID, mem);
+    outputLayer.InitModel(argc, argv, devID, mem);
+}
+/* 
+make the encoding network
+>> input - input tensor
+<< return - encoding result
+*/
+XTensor T2TModel::MakeEncoding(XTensor &input)
+{
+    return encoder.Make(input);
+}
+/* 
+make the entire network (with the output softmax layer) 
+>> input - input tensor
+>> output - output tensor (distribution)
+*/
+void T2TModel::Make(XTensor &input, XTensor &output)
+{
+    if(isLM){
+        XTensor encoding;
+        encoding = MakeEncoding(input);
+        outputLayer.Make(encoding, output);
+    }
+    else{
+        ShowNTErrors("TODO!");
+    }
+}
+}
\ No newline at end of file
--- a/source/sample/transformer/T2TModel.h
+++ b/source/sample/transformer/T2TModel.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#ifndef __T2TMODEL_H__
+#define __T2TMODEL_H__
+#include "T2TFNN.h"
+#include "T2TAttention.h"
+#include "T2TEncoder.h"
+#include "T2TDecoder.h"
+#include "T2TOutput.h"
+namespace transformer
+{
+class T2TModel
+{
+public:
+    /* device id */
+    int devID;
+    /* memory pool */
+    XMem * mem;
+    /* the encoder */
+    AttEncoder encoder;
+    /* the decoder */
+    AttDecoder decoder;
+    /* output layer */
+    T2TOutput outputLayer;
+    /* indicates whether the model is running for language modeling */
+    bool isLM;
+    /* indicates whether the model is running for machine translation */
+    bool isMT;
+public:
+    /* constructor */
+    T2TModel();
+    /* de-constructor */
+    ~T2TModel();
+    /* initialize the model */
+    void InitModel(int argc, const char ** argv);
+    /* make the encoding network */
+    XTensor MakeEncoding(XTensor &input);
+    /* make the entire network (with the output softmax layer) */
+    void Make(XTensor &input, XTensor &output);
+};
+}
+#endif
\ No newline at end of file
--- a/source/sample/transformer/T2TOutput.cpp
+++ b/source/sample/transformer/T2TOutput.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#include "T2TOutput.h"
+#include "T2TUtility.h"
+#include "T2TEmbedding.h"
+#include "../../tensor/core/CHeader.h"
+namespace transformer
+{
+/* constructor */
+T2TOutput::T2TOutput()
+{
+    devID = -1;
+    mem = NULL;
+    vSize = -1;
+    inSize = -1;
+    hSize = -1;
+}
+/* de-constructor */
+T2TOutput::~T2TOutput()
+{
+}
+/*
+initialize the model 
+>> argc - number of arguments
+>> argv - list of pointers to the arguments
+>> myDevID - device id
+>> myMem - the memory pool
+*/
+void T2TOutput::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
+{
+    devID = myDevID;
+    mem = myMem;
+    float minmax = 0;
+    LoadParamInt(argc, argv, "vsize", &vSize, -1);
+    LoadParamInt(argc, argv, "d", &inSize, DEFAULT_BEDDING_SIZE);
+    LoadParamInt(argc, argv, "d", &hSize, DEFAULT_BEDDING_SIZE);
+    LoadParamFloat(argc, argv, "outputminmax", &minmax, 0.08F);
+    InitTensor2D(&w, hSize, vSize, X_FLOAT, devID, mem);
+    w.SetDataRand(-minmax, minmax);
+}
+/* 
+make the network 
+y = softmax(x * w)
+>> input - input tensor
+<< return - output tensor 
+*/
+XTensor T2TOutput::Make(XTensor &input)
+{
+    XTensor &x = input;
+    return LogSoftmax(MMul(x, w), -1);
+}
+/* 
+make the network (redefined output tensor) 
+>> input - input tensor
+>> output - output tensor 
+*/
+void T2TOutput::Make(XTensor &input, XTensor &output)
+{
+    XTensor &x = input;
+    output = LogSoftmax(MMul(x, w), -1);
+}
+}
\ No newline at end of file
--- a/source/sample/transformer/T2TOutput.h
+++ b/source/sample/transformer/T2TOutput.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#ifndef __T2TOUTPUT_H__
+#define __T2TOUTPUT_H__
+#include "../../tensor/function/FHeader.h"
+using namespace nts;
+namespace transformer
+{
+/* output layer */
+class T2TOutput
+{
+public:
+    /* device id */
+    int devID;
+    /* memory pool */
+    XMem * mem;
+    /* vocabulary size */
+    int vSize;
+    /* input vector size */
+    int inSize;
+    /* vector size of the linear transformation */
+    int hSize;
+    /* transformation matrix */
+    XTensor w;
+public:
+    /* constructor */
+    T2TOutput();
+    /* de-constructor */
+    ~T2TOutput();
+    /* initialize the model */
+    void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
+    /* make the network */
+    XTensor Make(XTensor &input);
+    /* make the network (redefined output tensor) */
+    void Make(XTensor &input, XTensor &output);
+};
+}
+#endif
\ No newline at end of file
--- a/source/sample/transformer/T2TTrainer.cpp
+++ b/source/sample/transformer/T2TTrainer.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-08-02
+ */
+#include <math.h>
+#include "T2TTrainer.h"
+#include "T2TUtility.h"
+#include "../../tensor/XUtility.h"
+#include "../../tensor/core/CHeader.h"
+namespace transformer
+{
+/* constructor */
+T2TTrainer::T2TTrainer()
+{
+    devID = -1;
+    mem = NULL;
+    seqLen = NULL;
+    nseqBuf = 0;
+    nextSeq = -1;
+}
+/* de-constructor */
+T2TTrainer::~T2TTrainer()
+{
+    delete[] buf;
+    delete[] seqLen;
+    delete[] seqOffset;
+}
+/* 
+initialization 
+>> argc - number of arguments
+>> argv - list of pointers to the arguments
+*/
+void T2TTrainer::Init(int argc, const char ** argv)
+{
+    LoadParamInt(argc, argv, "dev", &devID, -1);
+    LoadParamFloat(argc, argv, "lrate", &lrate, 0.001F);
+    LoadParamInt(argc, argv, "sbatch", &sBatchSize, 1);
+    LoadParamInt(argc, argv, "wbatch", &wBatchSize, 1);
+    LoadParamInt(argc, argv, "nepoch", &nepoch, 1);
+    LoadParamInt(argc, argv, "nstep", &nstep, 1);
+    LoadParamInt(argc, argv, "vsize", &vSize, 1);
+    LoadParamBool(argc, argv, "sorted", &isLenSorted, false);
+    LoadParamInt(argc, argv, "bufsize", &bufSize, 50000);
+    buf = new int[bufSize];
+    seqLen = new int[bufSize];
+    seqOffset = new int[bufSize];
+}
+/* 
+train the model
+>> fn - training data file
+>> model - model to train
+*/
+void T2TTrainer::Train(const char * fn, T2TModel * model)
+{
+    int epoch = 0;
+    int step = 0;
+    int wc = 0;
+    int wordCount = 0;
+    int wordCountTotal = 0;
+    bool isEnd = false;
+    float loss = 0;
+    XNet net;
+    double startT = GetClockSec();
+    for(epoch = 0; epoch < nepoch; epoch++){
+        FILE * file = fopen(fn, "rb");
+        CheckNTErrors(file, "cannot open training file!");
+        wordCount = 0;
+        /* batch of input sequences */
+        XTensor batch;
+        while(LoadBatch(file, &batch, 1, vSize, sBatchSize, wBatchSize, isLenSorted, wc)){
+            /* output probabilities */
+            XTensor output;
+            /* make the network */
+            model->Make(batch, output);
+            /* back-propagation for obtaining gradients */
+            net.Backward(output, batch, CROSSENTROPY);
+            /* update the parameters */
+            Update(model);
+            /* get probabilities */
+            float prob = GetProb(&output, &batch, NULL);
+            loss += -prob;
+            wordCount += wc;
+            wordCountTotal += wc;
+            if(++step >= nstep){
+                isEnd = true;
+                break;
+            }
+            if (step % 1 == 0) {
+                double elapsed = GetClockSec() - startT;
+                XPRINT5(0, stderr, "[INFO] elapsed=%.1fs, step=%d, epoch=%d, ngram=%d, ppl=%.3f\n",
+                        elapsed, step, epoch + 1, wordCountTotal, exp(loss / wordCount));
+            }
+        }
+        fclose(file);
+    }
+    double elapsed = GetClockSec() - startT;
+    XPRINT5(0, stderr, "[INFO] elapsed=%.1fs, step=%d, epoch=%d, ngram=%d, ppl=%.3f\n",
+            elapsed, step, epoch, wordCountTotal, exp(loss / wordCount));
+    XPRINT3(0, stderr, "[INFO] training finished (took %.1fs, step=%d and epoch=%d)\n",
+            elapsed, step, epoch);
+}
+char line[MAX_SEQUENCE_LENGTH];
+/* 
+load data to buffer 
+>> file - where to load data
+*/
+int T2TTrainer::LoadBuf(FILE * file)
+{
+    int lineCount = 0;
+    int seqCount = 0;
+    int wordCount = 0;
+    while(fgets(line, MAX_SEQUENCE_LENGTH - 1, file)){
+        int len = (int)strlen(line);
+        while(line[len - 1] == '\r' || line[len - 1] == '\n'){
+            line[len - 1] = 0;
+            len--;
+        }
+        len = (int)strlen(line);
+        if(len == 0)
+            continue;
+        /* how many characters are in a word */
+        int wSize = 0;
+        /* how many words are in the sentence */
+        int wNum = 0;
+        int wNumLocal = 0;
+        int i = 0;
+        for(i = 0; i < len; i++){
+            /* load word (id) seperated by space or tab */
+            if((line[i] == ' ' || line[i] == '\t') && wSize > 0){
+                line[i] = 0;
+                if(wSize == 3 && line[i - 1] == '|' && line[i - 2] == '|' && line[i - 3] == '|'){
+                    seqLen[seqCount] = wNumLocal;
+                    seqOffset[seqCount] = wordCount + wNum - wNumLocal;
+                    seqCount++;
+                    wNumLocal = 0;
+                }
+                else{
+                    buf[wordCount + wNum++] = atoi(line + i - wSize);
+                    wNumLocal++;
+                }
+                wSize = 0;
+            }
+            else
+                wSize++;
+        }
+        if(wSize > 0){
+            buf[wordCount + wNum++] = atoi(line + i - wSize);
+            wNumLocal++;
+        }
+        seqLen[seqCount] = wNumLocal;
+        seqOffset[seqCount] = wordCount + wNum - wNumLocal;
+        seqCount++;
+        wordCount += wNum;
+        lineCount++;
+        if(wordCount >= bufSize - MAX_SEQUENCE_LENGTH)
+            break;
+    }
+    nseqBuf = seqCount;
+    nextSeq = 0;
+    return lineCount;
+}
+/* 
+load a batch of sequences 
+>> file - the handle to the data file
+>> batch - the batch
+>> step - the step we go over when move to the next sequence
+>> vs - vocabulary size
+>> sBatch - batch size of sequences
+>> wBatch - batch size of words
+>> isSorted - indicates whether the sequences are sorted by length
+>> wCount - word count
+*/
+int T2TTrainer::LoadBatch(FILE * file, XTensor * batch, int step, int vs, int sBatch, int wBatch, bool isSorted, int &wCount)
+{
+    if(nextSeq < 0 || nextSeq >= nseqBuf)
+        LoadBuf(file);
+    int seq = MAX(nextSeq, 0);
+    int wc = 0;
+    int wn = 0;
+    int sc = 0;
+    int max = 0;
+    while(seq + sc < nseqBuf){
+        wn = seqLen[seq + sc];
+        wc += wn;
+        sc += 1;
+        if(max < wn)
+            max = wn;
+        if(sc >= sBatch && wc >= wBatch)
+            break;
+    }
+    nextSeq = seq + sc;
+    if(sc > 0){
+        int dims[MAX_TENSOR_DIM_NUM];
+        dims[0] = sc;
+        dims[1] = max;
+        dims[2] = vs;
+        if(batch->order != 3 || batch->GetDim(0) != dims[0] || 
+           batch->GetDim(1) != dims[1] || batch->GetDim(2) != dims[2]){
+               InitTensor(batch, 3, dims, X_FLOAT, 1.0F, devID, mem);
+        }
+        batch->SetZeroAll();
+        /* this might be slow on GPUs :( */
+        for(int s = seq; s < seq + sc; s++){
+            for(int w = 0; w < seqLen[s]; w++){
+                batch->Set3D(1.0F, s - seq, w, buf[seqOffset[s] + w]);
+                wCount++;
+            }
+        }
+    }
+    return sc;
+}
+/*
+get word probabilities for a batch of sequences
+>> output - word distribution for each position
+>> gold - gold standard
+>> wordProbs - word probability for gold prediction
+*/
+float T2TTrainer::GetProb(XTensor * output, XTensor * gold, XTensor * wordProbs)
+{
+    XTensor probs;
+    InitTensor(&probs, output);
+    /* probs[i,j] = output[i,j] * gold[i,j] */
+    _Multiply(output, gold, &probs);
+    /* probability of each word */
+    XTensor wprobs;
+    InitTensor1D(&wprobs, output->unitNum/output->GetDim(-1), X_FLOAT, output->devID, output->mem);
+    int dims[2] = {output->unitNum/output->GetDim(-1), output->GetDim(-1)};
+    probs.Reshape(2, dims);
+    _ReduceSum(&probs, &wprobs, 1);
+    if(wordProbs != NULL)
+        _CopyValues(&wprobs, wordProbs);
+    /* reshape the tensor to fit it into the reduce procedure
+     TODO: XTensor supports scalars */
+    dims[0] = 1;
+    dims[1] = probs.unitNum;
+    probs.Reshape(2, dims);
+    /* probability for the batch */
+    XTensor result;
+    InitTensor1D(&result, 1, X_FLOAT, output->devID, output->mem);
+    _ReduceSum(&probs, &result, 1);
+    return result.Get1D(0);
+}
+/* 
+update the model by delta rule 
+>> model - the t2t model
+*/
+void T2TTrainer::Update(T2TModel * model)
+{
+    XList ws(100);
+    ws.Add(&model->outputLayer.w);
+    for(int i = 0; i < model->encoder.nlayer; i++){
+        ws.Add(&model->encoder.fnns[i].w1);
+        ws.Add(&model->encoder.fnns[i].b1);
+        ws.Add(&model->encoder.fnns[i].w2);
+        ws.Add(&model->encoder.fnns[i].b2);
+    }
+    ws.Add(&model->encoder.embedder.w);
+    for(int i = 0; i < ws.count; i++){
+        XTensor * para = (XTensor*)ws.Get(i);
+        XTensor * paraGrad = para->grad;
+        CheckNTErrors(para != NULL, "NULL parameter tensor!");
+        CheckNTErrors(paraGrad != NULL, "NULL gradient tensor!");
+        /* the delta rule */
+        _Sum(para, paraGrad, para, -lrate);
+    }
+}
+}
--- a/source/sample/transformer/T2TTrainer.h
+++ b/source/sample/transformer/T2TTrainer.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-08-02
+ */
+#ifndef __T2TTRAINER_H__
+#define __T2TTRAINER_H__
+#include "T2TModel.h"
+#include "../../tensor/function/FHeader.h"
+#define MAX_SEQUENCE_LENGTH 1024 * 4
+using namespace nts;
+namespace transformer
+{
+/* trainer of the T2T model */
+class T2TTrainer
+{
+public:
+    /* device id */
+    int devID;
+    /* memory pool */
+    XMem * mem;
+    /* buffer for loading words */
+    int * buf;
+    /* buffer size */
+    int bufSize;
+    /* length of each sequence */
+    int * seqLen;
+    /* offset of the first word for each sequence */
+    int * seqOffset;
+    /* number of sequences in the buffer */
+    int nseqBuf;
+    /* offset for next sequence in the buffer */
+    int nextSeq;
+    /* indicates whether the sequence is sorted by length */
+    bool isLenSorted;
+    /* vocabulary size of the source side */
+    int vSize;
+    /* learning rate */
+    float lrate;
+    /* sentence batch size */
+    int sBatchSize;
+    /* word batch size */
+    int wBatchSize;
+    /* training epoch number */
+    int nepoch;
+    /* traing step number */
+    int nstep;
+public:
+    /* constructor */
+    T2TTrainer();
+    /* de-constructor */
+    ~T2TTrainer();
+    /* initialize the trainer */
+    void Init(int argc, const char ** argv);
+    /* train the model */
+    void Train(const char * fn, T2TModel * model);
+    /* load data to buffer */
+    int LoadBuf(FILE * file);
+    /* load a batch of sequences */
+    int LoadBatch(FILE * file, XTensor * batch, int step, int vs, int sBatch, int wBatch, bool isSorted, int &wCount);
+    /* get word probabilities for a batch of sequences */
+    float GetProb(XTensor * output, XTensor * gold, XTensor * wordProbs);
+    /* update the model by delta rule */
+    void Update(T2TModel * model);
+};
+}
+#endif
--- a/source/sample/transformer/T2TUtility.cpp
+++ b/source/sample/transformer/T2TUtility.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+namespace transformer
+{
+void LoadParamString(int argc, const char ** argv, const char * name, char * p, const char * defaultP)
+{
+    char vname[128];
+    vname[0] = '-';
+    strcpy(vname + 1, name);
+    bool hit = false;
+    for(int i = 0; i < argc; i++){
+        if(!strcmp(argv[i], vname) && i + 1 < argc){
+            strcpy(p, argv[i + 1]);
+            //fprintf(stderr, " %s=%s\n", name, argv[i + 1]);
+            hit = true;
+        }
+    }
+    if(!hit)
+        strcpy(p, defaultP);
+}
+void LoadParamInt(int argc, const char ** argv, const char * name, int * p, int defaultP)
+{
+    char vname[128];
+    vname[0] = '-';
+    strcpy(vname + 1, name);
+    bool hit = false;
+    for(int i = 0; i < argc; i++){
+        if(!strcmp(argv[i], vname) && i + 1 < argc){
+            *(int*)p = atoi(argv[i + 1]);
+            //fprintf(stderr, " %s=%s\n", name, argv[i + 1]);
+            hit = true;
+        }
+    }
+    if(!hit)
+        *p = defaultP;
+}
+void LoadParamBool(int argc, const char ** argv, const char * name, bool * p, bool defaultP)
+{
+    char vname[128];
+    vname[0] = '-';
+    strcpy(vname + 1, name);
+    bool hit = false;
+    for(int i = 0; i < argc; i++){
+        if(!strcmp(argv[i], vname)){
+            *(bool*)p = true;
+            //fprintf(stderr, " %s=%s\n", name, "true");
+            hit = true;
+        }
+    }
+    if(!hit)
+        *p = defaultP;
+}
+void LoadParamFloat(int argc, const char ** argv, const char * name, float * p, float defaultP)
+{
+    char vname[128];
+    vname[0] = '-';
+    strcpy(vname + 1, name);
+    bool hit = false;
+    for(int i = 0; i < argc; i++){
+        if(!strcmp(argv[i], vname) && i + 1 < argc){
+            *p = (float)atof(argv[i + 1]);
+            //fprintf(stderr, " %s=%s\n", name, argv[i + 1]);
+            hit = true;
+        }
+    }
+    if(!hit)
+        *p = defaultP;
+}
+void ShowParams(int argc, const char ** argv)
+{
+    fprintf(stderr, "args:\n");
+    for(int i = 0; i < argc; i++){
+        if(argv[i][0] == '-'){
+            if(i + 1 < argc && argv[i + 1][0] != '-')
+                fprintf(stderr, " %s=%s\n", argv[i], argv[i + 1]);
+            else
+                fprintf(stderr, " %s=yes\n", argv[i]);
+        }
+    }
+    fprintf(stderr, "\n");
+}
+}
--- a/source/sample/transformer/T2TUtility.h
+++ b/source/sample/transformer/T2TUtility.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#ifndef __T2TUTILITY_H__
+#define __T2TUTILITY_H__
+#include <stdio.h>
+namespace transformer
+{
+/* load arguments */
+void LoadParamString(int argc, const char ** argv, const char * name, char * p, const char * defaultP);
+void LoadParamInt(int argc, const char ** argv, const char * name, int * p, int defaultP);
+void LoadParamBool(int argc, const char ** argv, const char * name, bool * p, bool defaultP);
+void LoadParamFloat(int argc, const char ** argv, const char * name, float * p, float defaultP);
+/* show arguments */
+void ShowParams(int argc, const char ** argv);
+}
+#endif
--- a/source/sample/transformer/Transformer.cpp
+++ b/source/sample/transformer/Transformer.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#include "Transformer.h"
+#include "T2TModel.h"
+#include "T2TUtility.h"
+#include "T2TTrainer.h"
+#include "../../tensor/XDevice.h"
+namespace transformer
+{
+int TransformerMain(int argc, const char ** argv)
+{
+    if(argc == 0)
+        return 1;
+    ShowParams(argc, argv);
+    char * trainFN = new char[MAX_LINE_LENGTH];
+    LoadParamString(argc, argv, "train", trainFN, "");
+    T2TModel model;
+    model.InitModel(argc, argv);
+    if(strcmp(trainFN, "")){
+        T2TTrainer trainer;
+        trainer.Init(argc, argv);
+        trainer.Train(trainFN, &model);
+    }
+    delete[] trainFN;
+    return 0;
+}
+}
\ No newline at end of file
--- a/source/sample/transformer/Transformer.h
+++ b/source/sample/transformer/Transformer.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ *
+ * An impelementation of the transformer system. See more details 
+ * about FNNLM in 
+ * "Attention Is All You Need" by Vaswani et al.
+ * https://arxiv.org/pdf/1706.03762.pdf
+ *
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ * I start writing the code related to NMT - a long time since my last coding 
+ * work on MT
+ */
+#ifndef __TRANSFORMER_H__
+#define __TRANSFORMER_H__
+#include "../../tensor/XGlobal.h"
+#include "../../tensor/XTensor.h"
+#include "../../tensor/core/CHeader.h"
+namespace transformer
+{
+/* entrance of the program */
+int TransformerMain(int argc, const char ** argv);
+}
+#endif
\ No newline at end of file
--- a/source/tensor/Main.cpp
+++ b/source/tensor/Main.cpp
@@ -29,6 +29,7 @@
 #include "XTensor.h"
 #include "XDevice.h"
 #include "./test/Test.h"
+#include "./core/CHeader.h"
 //#define CRTDBG_MAP_ALLOC
 //#include <stdlib.h>  
@@ -37,6 +38,7 @@
 using namespace nts;
 void SmallTest();
+void TransposeTest();
 int main( int argc, const char ** argv )
 {
@@ -92,3 +94,35 @@ void SmallTest()
    c.Dump(stderr, "c:");
    d.Dump(stderr, "d:");
 }
+void TransposeTest()
+{
+    XTensor a;
+    XTensor b;
+    int I = 2;
+    int J = 3;
+    InitTensor4D(&a, 2, 3, 4, 5);
+    int * dims = new int[a.order];
+    memcpy(dims, a.dimSize, sizeof(int) * a.order);
+    dims[I] = a.dimSize[J];
+    dims[J] = a.dimSize[I];
+    InitTensor(&b, 4, dims);
+    a.SetZeroAll();
+    b.SetZeroAll();
+    float * data = new float[a.unitNum];
+    for(int i = 0; i < a.unitNum; i++)
+        data[i] = (float)i;
+    a.SetData(data, a.unitNum, 0);
+    _Transpose(&a, &b, I, J);
+    b.Dump(stderr, "b:");
+    delete[] data;
+}
--- a/source/tensor/XDevice.cpp
+++ b/source/tensor/XDevice.cpp
@@ -40,6 +40,7 @@ XDevManager GDevs;
 /* constructor */
 XDevice::XDevice()
 {
+    stream = NULL;
    Clear();
 #ifdef USE_CUDA
@@ -55,6 +56,8 @@ XDevice::~XDevice()
    MUTEX_DELE(cublasMutex);
    if(isHandleReady)
        cublasDestroy(cublasHandle);
+    if(stream != NULL)
+        delete stream;
 #endif
 }
@@ -118,6 +121,8 @@ void XDevice::Init(int myDevID)
        }
        else
            sprintf(name2, "GPU-%d %s", devID, name);
+        stream = new XStream(0, devID);
 #endif
    }
@@ -161,6 +166,14 @@ cublasHandle_t * XDevice::GetCublasHandle()
    return &cublasHandle;
 }
+/* get the stream of cuda */
+cudaStream_t * XDevice::GetCudaStream()
+{
+    CheckNTErrors(stream != NULL, "the stream is not initialized!");
+    return &stream->stream;
+}
 #endif // USE_CUDA
 /* switch to a device */
@@ -311,11 +324,19 @@ void XDevManager::Clear()
 /* get the handle of GPU */
 cublasHandle_t * XDevManager::GetCudaHandle(const int devID)
 {
-    CheckNTErrors((devID < nGPU), "index of GPU is out of range.");
+    CheckNTErrors(devID < nGPU, "index of GPU is out of range.");
    return GPUs[devID].GetCublasHandle();
 }
+/* get the stream of cuda */
+cudaStream_t * XDevManager::GetCudaStream(const int devID)
+{
+    CheckNTErrors(devID < nGPU, "index of GPU is out of range.");
+    return GPUs[devID].GetCudaStream();
+}
 #endif
 /* 
@@ -384,13 +405,10 @@ int XDevManager::GetCudaThread2D(const int devID, const int n, const int m, int 
    memset(gridSize, 0, sizeof(int) * 3);
    memset(blockSize, 0, sizeof(int) * 3);
-    if(n <= 0 || m <= 0 || devID >= nGPU)
+    if(n <= 0 || m <= 0)
        return 1;
-    if(devID < 0){
+    CheckNTErrors(devID >= 0 && devID < nGPU, "Invalid GPU device id!");
-        XPRINT(0, stderr, "WARNING! You are calling the grid and block size computation function on a CPU!");
-        return 0;
-    }
 #ifdef USE_CUDA

--- a/source/tensor/XDevice.h
+++ b/source/tensor/XDevice.h
@@ -25,6 +25,7 @@
 #define __XDEVICE_H__
 #include "XThread.h"
+#include "XStream.h"
 #ifdef USE_CUDA
@@ -92,6 +93,9 @@ public:
    /* specify whether Unified Virtual Address Space (UVA) is supported */
    bool isUVASupported;
+    /* default stream for the device */
+    XStream * stream;
 #ifdef USE_CUDA
    /* mutex for handle (GPU cublas) */
@@ -121,6 +125,9 @@ public:
 #ifdef USE_CUDA
    /* get cublas handle */
    cublasHandle_t * GetCublasHandle();
+    /* get the stream of cuda */
+    cudaStream_t * GetCudaStream();
 #endif
    /* switch to a device */
@@ -178,6 +185,9 @@ public:
 #ifdef USE_CUDA
    /* get the handle of GPU */
    cublasHandle_t * GetCudaHandle(const int devID);
+    /* get the stream of cuda */
+    cudaStream_t * GetCudaStream(const int devID);
 #endif
    /* get grid and block sizes that max potential */

--- a/source/tensor/XLink.cpp
+++ b/source/tensor/XLink.cpp
@@ -167,7 +167,9 @@ void XLink::SetType(int id)
    type[0] = 0;
    strcpy(type, GetOPName(id));
    typeID = id;
-    CheckNTErrors(strcmp(type, "NULL"), "illegal edge type name!");
+    if(id != 0){
+        CheckNTErrors(strcmp(type, "NULL"), "illegal edge type name!");
+    }
 }
 /* 
@@ -515,7 +517,7 @@ void XLink::CopyIncoming(const XTensor * reference, XTensor * target)
        tails.Add(tail);
    }
-    MakeLink(&tails, target, reference->id);
+    MakeLink(&tails, target, reference->income.typeID);
    int paraNum = reference->income.paramNum;
    target->income.paramNum = paraNum;

--- a/source/tensor/XList.cpp
+++ b/source/tensor/XList.cpp
@@ -208,22 +208,16 @@ void XList::Insert(int pos, void * item)
 /* get the item at position i */
 void * XList::GetItem(int i) const
 {
-    if( i >= 0 && i < count )
+    CheckNTErrors(i >= 0 && i < count, "Index of a list item is out of scope!");
-        return items[i];
+    return items[i];
-    else
-        return NULL;
 }
 /* get the integer-typed item at position i */
 int XList::GetItemInt(int i)
 {
    CheckNTErrors(isIntList, "An int list is required!");
+    CheckNTErrors(i >= 0 && i < count, "Index of a list item is out of scope!");
-    if( i >= 0 && i < count ){
+    return *(int*)(items[i]);
-        return *(int*)(items[i]);
-    }
-    else
-        return 0;
 }
 /* set the item at position i */

--- a/source/tensor/XMem.cpp
+++ b/source/tensor/XMem.cpp
@@ -181,7 +181,10 @@ void XMem::Free(int myDevID, void * mem)
    else{
 #ifdef USE_CUDA
        SetDevice(myDevID);
-        CheckNTErrors(cudaFree((char*)mem) == cudaSuccess, "Cannot free the memory.");
+        cudaError_t error = cudaFree((char*)mem);
+        if(error != cudaSuccess){
+            ShowNTErrors("Cannot free the memory.");
+        }
 #else
        ShowNTErrors("Please specify USE_CUDA for compiling this program.");
 #endif

--- a/source/tensor/XName.cpp
+++ b/source/tensor/XName.cpp
@@ -29,6 +29,22 @@ const char * GetOPName(int type)
    if ((type & MATH_BASE) != 0){
        if (type == MATH_ABSOLUTE)
            return "M_ABSOLUTE";
+        else if (type == MATH_EXP)
+            return "M_EXP";
+        else if (type == MATH_LOG)
+            return "M_LOG";
+        else if (type == MATH_SIN)
+            return "M_SIN";
+        else if (type == MATH_COS)
+            return "M_COS";
+        else if (type == MATH_TAN)
+            return "M_TAN";
+        else if (type == MATH_ROUND)
+            return "M_ROUND";
+        else if (type == MATH_CLIP)
+            return "M_CLIP";
+        else if (type == MATH_DIV)
+            return "M_DIV";
        else if (type == MATH_MATRIXMUL)
            return "M_MATRIXMUL";
        else if (type == MATH_MATRIXMULBATCHED)
@@ -37,18 +53,20 @@ const char * GetOPName(int type)
            return "M_MULTIPLY";
        else if (type == MATH_NEGATE)
            return "M_NEGATE";
-        else if (type == MATH_SIGN)
-            return "M_SIGN";
-        else if (type == MATH_SUM)
-            return "M_SUM";
-        else if (type == MATH_LOG)
-            return "M_LOG";
        else if (type == MATH_NORMALIZE)
            return "M_NORMALIZE";
        else if (type == MATH_POWER)
            return "M_POWER";
        else if (type == MATH_SCALEANDSHIFT)
            return "M_SCALEANDSHIFT";
+        else if (type == MATH_SIGN)
+            return "M_SIGN";
+        else if (type == MATH_SUM)
+            return "M_SUM";
+        else if (type == MATH_SUB)
+            return "M_SUB";
+        else if (type == MATH_SUMDIM)
+            return "M_SUMDIM";
        else if (type == REDUCE_REDUCEMAX)
            return "R_REDUCEMAX";
        else if (type == REDUCE_REDUCEMEAN)

--- a/source/tensor/XName.h
+++ b/source/tensor/XName.h
@@ -30,20 +30,30 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 /* math operations */
 #define MATH_BASE               0x00001000
 #define MATH_ABSOLUTE           MATH_BASE + 1
-#define MATH_MATRIXMUL          MATH_ABSOLUTE + 1
+#define MATH_EXP                MATH_ABSOLUTE + 1
+#define MATH_LOG                MATH_EXP + 1
+#define MATH_SIN                MATH_LOG + 1
+#define MATH_COS                MATH_SIN + 1
+#define MATH_TAN                MATH_COS + 1
+#define MATH_ROUND              MATH_TAN + 1
+#define MATH_CLIP               MATH_ROUND + 1
+#define MATH_DIV                MATH_CLIP + 1
+#define MATH_MATRIXMUL          MATH_DIV + 1
 #define MATH_MATRIXMULBATCHED   MATH_MATRIXMUL + 1
 #define MATH_MULTIPLY           MATH_MATRIXMULBATCHED + 1
 #define MATH_NEGATE             MATH_MULTIPLY + 1
-#define MATH_SIGN               MATH_NEGATE + 1
+#define MATH_NORMALIZE          MATH_NEGATE + 1
-#define MATH_SUM                MATH_SIGN + 1
-#define MATH_LOG                MATH_SUM + 1
-#define MATH_NORMALIZE          MATH_LOG + 1
 #define MATH_POWER              MATH_NORMALIZE + 1
 #define MATH_SCALEANDSHIFT      MATH_POWER + 1
+#define MATH_SIGN               MATH_SCALEANDSHIFT + 1
+#define MATH_SUM                MATH_SIGN + 1
+#define MATH_SUB                MATH_SUM + 1
+#define MATH_SUMDIM             MATH_SUB + 1
-#define REDUCE                  MATH_SCALEANDSHIFT + 1
+#define REDUCE                  MATH_SUMDIM + 1
 #define REDUCE_REDUCEMAX        REDUCE + 1
 #define REDUCE_REDUCEMEAN       REDUCE_REDUCEMAX + 1
 #define REDUCE_REDUCESUM        REDUCE_REDUCEMEAN + 1

--- a/source/tensor/XStream.cpp
+++ b/source/tensor/XStream.cpp
@@ -84,7 +84,7 @@ void XStream::Create(int priority, int myDevID)
    XDevice::SetGPUDevice(myDevID);
    //cudaStreamCreateWithPriority(&stream, cudaStreamDefault, priority);
    CheckNTErrors((cudaStreamCreate(&stream) == cudaSuccess), 
-                        "cannot create the cuda stream!");
+                  "cannot create the cuda stream!");
    XDevice::SetGPUDevice(backupDevID);
 #endif
    devID = myDevID;

--- a/source/tensor/XTensor.cpp
+++ b/source/tensor/XTensor.cpp
@@ -42,6 +42,8 @@
 #include "core/movement/CopyValues.h"
 #include "core/arithmetic/Sum.h"
 #include "core/arithmetic/Multiply.h"
+#include "core/arithmetic/Sub.h"
+#include "core/arithmetic/Div.h"
 #include "core/math/ScaleAndShift.h"
 #ifdef USE_CUDA
@@ -354,6 +356,18 @@ XTensor XTensor::operator* (const XTensor& tensor)
    return Multiply(*this, tensor);
 }
+/* overloading of the minus-sign */
+XTensor XTensor::operator- (const XTensor& tensor)
+{
+    return Sub(*this, tensor);
+}
+/* overloading of the division-sign */
+XTensor XTensor::operator/ (const XTensor& tensor)
+{
+    return Div(*this, tensor);
+}
 /* 
 linear transformation b = a * \scale + \shift
 >> scale - the slope
@@ -426,8 +440,12 @@ get the size of a given dimension
 int XTensor::GetDim(const int dim)
 {
    CheckNTErrors(dim < order, "dimenision is out of range!");
+    int d = dim;
+    if(dim < 0)
+        d = order - 1;
-    return dimSize[dim];
+    return dimSize[d];
 }
 /* 
@@ -454,6 +472,27 @@ void XTensor::Reshape(const int myOrder, const int * myDimSize)
    memcpy(dimSizeRDI, dimsRDI, sizeof(int) * order);
 }
+/* 
+reshape the tensor to a vector 
+>> num - number of elements
+*/
+void XTensor::Reshape(const int num)
+{
+    int dim = num;
+    Reshape(1, &dim);
+}
+/* 
+reshape the tensor to a matrix 
+>> rowNum - number of rows
+>> colNum - number of columns
+*/
+void XTensor::Reshape(const int rowNum, const int colNum)
+{
+    int dims[2] = {rowNum, colNum};
+    Reshape(2, dims);
+}
 /* get the number of items in the data array */
 int XTensor::GetSize() const
 {
@@ -560,25 +599,24 @@ set the tensor items by a uniform distribution in range [lower, upper]
 void XTensor::SetDataRand(DTYPE lower, DTYPE upper)
 {
    // TODO: cuda code!!!!!!!
-    // TODO: replace float with DTYPE
    if (data == NULL)
        return;
    // srand((unsigned)time(0));
+    DTYPE variance = upper - lower;
    void * d = NULL;
    if (dataType == X_FLOAT) {
        d = new float[unitNum];
        for (int i = 0; i < unitNum; i++) {
-            DTYPE value = lower + (upper - lower) * (float)rand() / RAND_MAX;
+            DTYPE value = lower + variance * (float)rand() / RAND_MAX;
            *((float*)d + i) = value;
        }
    }
    else if (dataType == X_DOUBLE) {
        d = new double[unitNum];
        for (int i = 0; i < unitNum; i++) {
-            *((double*)d + i) = lower + (upper - lower) * rand() / RAND_MAX;
+            *((double*)d + i) = lower + variance * rand() / RAND_MAX;
        }
    }
    else {
@@ -588,15 +626,15 @@ void XTensor::SetDataRand(DTYPE lower, DTYPE upper)
    SetData(d, unitNum);
    if (dataType == X_FLOAT) {
-        delete[](float*)d;
+        delete[] (float*)d;
    }
    else {
-        delete[](double*)d;
+        delete[] (double*)d;
    }
 }
-/* a gauss distribution */
+/* a gauss distribution (Box-Muller method) */
-double GaussRand()
+double GaussRand(DTYPE mean, DTYPE standardDeviation)
 {
    // TODO: cuda code!!!!!!!
@@ -606,8 +644,8 @@ double GaussRand()
    double pi = 3.141592654;
    if (phase == 0){
-        u = rand() / (RAND_MAX + 1.0);
+        u = (rand() + 1.0) / (RAND_MAX + 1.0);
-        v = rand() / (RAND_MAX + 1.0);
+        v = (rand() + 1.0) / (RAND_MAX + 1.0);
        z = sqrt(-2.0 * log(u))* sin(2.0 * pi * v);
    }
    else{
@@ -615,7 +653,7 @@ double GaussRand()
    }
    phase = 1 - phase;
-    return z;
+    return mean + (z * standardDeviation);
 }
 /* 
@@ -626,7 +664,6 @@ set the tensor items by a normal distribution
 void XTensor::SetDataRandn(DTYPE mean, DTYPE standardDeviation)
 {
    // TODO: cuda code!!!!!!!
-    // TODO: replace float with DTYPE
    if (data == NULL)
        return;
@@ -636,13 +673,13 @@ void XTensor::SetDataRandn(DTYPE mean, DTYPE standardDeviation)
    if (dataType == X_FLOAT) {
        d = new float[unitNum];
        for (int i = 0; i < unitNum; i++) {
-            *((float*)d + i) = (float)GaussRand();
+            *((float*)d + i) = (float)GaussRand(mean, standardDeviation);
        }
    }
    else if (dataType == X_DOUBLE) {
        d = new double[unitNum];
        for (int i = 0; i < unitNum; i++) {
-            *((double*)d + i) = GaussRand();
+            *((double*)d + i) = GaussRand(mean, standardDeviation);
        }
    }
    else {
@@ -652,10 +689,10 @@ void XTensor::SetDataRandn(DTYPE mean, DTYPE standardDeviation)
    SetData(d, unitNum);
    if (dataType == X_FLOAT) {
-        delete[](float*)d;
+        delete[] (float*)d;
    }
    else {
-        delete[](double*)d;
+        delete[] (double*)d;
    }
 }
@@ -1003,11 +1040,11 @@ set the value of a cell in a 3d tensor in default type
 */
 bool XTensor::Set3D(DTYPE value, int d0, int d1, int d2)
 {
-    CheckNTErrors((order == 3), "Cannot get a 2d cell for a tensor whose order is not 2!");
+    CheckNTErrors(order == 3, "Cannot get a 2d cell for a tensor whose order is not 2!");
-    CheckNTErrors((d0 >= 0 && d1 < dimSize[0]), "dimension 0 is out of range!");
+    CheckNTErrors(d0 >= 0 && d0 < dimSize[0], "dimension 0 is out of range!");
-    CheckNTErrors((d2 >= 0 && d2 < dimSize[1]), "dimension 1 is out of range!");
+    CheckNTErrors(d1 >= 0 && d1 < dimSize[1], "dimension 1 is out of range!");
-    CheckNTErrors((d2 >= 0 && d2 < dimSize[2]), "dimension 1 is out of range!");
+    CheckNTErrors(d2 >= 0 && d2 < dimSize[2], "dimension 1 is out of range!");
-    CheckNTErrors((dataType == DEFAULT_DTYPE), "The tensor is not in default type.");
+    CheckNTErrors(dataType == DEFAULT_DTYPE, "The tensor is not in default type.");
    int dims[3] = {d0, d1, d1};
@@ -1439,6 +1476,21 @@ void XTensor::Dump(FILE * file, const char * label, const int n, const int verbo
 }
 /* 
+dump data to a file
+>> tensor - tensor whose data is dumped
+>> file - where to domp the data
+>> label - label of the tensor
+>> n - number of items to dump
+>> verbose - verbose level
+*/
+void XTensor::Dump(const XTensor * tensor, FILE * file, const char * label, const int n, const int verbose)
+{
+    XTensor a(tensor->order, tensor->dimSize, tensor->dataType, tensor->denseRatio, tensor->devID, tensor->mem);
+    _CopyValues(tensor, &a);
+    a.Dump(file, label, n, verbose);
+}
+/* 
 read data from a file
 >> file - where to load the data
 >> label - label of the tensor
@@ -1687,13 +1739,13 @@ void InitTensor(XTensor * tensor,
        dims[0] = -abs(dims[0]);
-        tensor->Resize(myOrder, dims, myDataType, myDenseRatio);
+        if (myDevID == CURRENT_GPU)
-        if(myDevID == CURRENT_GPU)
            tensor->devID = XDevice::GetGPUDevice();
        else
            tensor->devID = myDevID;
+        tensor->Resize(myOrder, dims, myDataType, myDenseRatio);
        if(allocated)
            XTensor::AllocateData(tensor);
    }
@@ -1870,28 +1922,47 @@ generate a XTensor which allocates data on the buffer
 >> myDimSize - the size of each dimension
 >> myMem - memory pool used to allocating the data array.
           we actually allocate the data on the buffer associated with
-           the memory pool.
+           the memory pool
+>> devID - device id
 >> myDataType - unit size (e.g., int, float, and double)
 >> myDenseRatio - how often an element has non-zero value
 */
-XTensor * NewTensorBuf(const int myOrder, const int * myDimSize, XMem * myMem,
+XTensor * NewTensorBuf(const int myOrder, const int * myDimSize,
-                       const TENSOR_DATA_TYPE myDataType, const float myDenseRatio)
+                       const TENSOR_DATA_TYPE myDataType, const float myDenseRatio,
+                       const int devID, XMem * myMem)
 {
-    CheckNTErrors(myMem != NULL, "No memory pool specified!");
    int dims[MAX_TENSOR_DIM_NUM];
    memcpy(dims, myDimSize, sizeof(int) * myOrder);
    dims[0] = -abs(dims[0]);
-    XTensor * tensor = NewTensor(myOrder, dims, myDataType, myDenseRatio, -1, myMem);
+    XTensor * tensor = NewTensor(myOrder, dims, myDataType, myDenseRatio, devID, myMem);
-    tensor->data = myMem->AllocBuf(myMem->devID, tensor->unitNum * tensor->unitSize);
+    if(myMem != NULL)
+        tensor->data = myMem->AllocBuf(myMem->devID, tensor->unitNum * tensor->unitSize);
+    else
+        tensor->data = XMemAlloc(devID, tensor->unitNum * tensor->unitSize);
    return tensor;
 }
 /* 
+generate a XTensor which allocates data on the buffer 
+>> reference - reference tensor
+>> devID - device id
+>> myMem - memory pool used to allocating the data array.
+           we actually allocate the data on the buffer associated with
+           the memory pool
+*/
+XTensor * NewTensorBuf(const XTensor * reference, int devID, XMem * myMem)
+{
+    return NewTensorBuf(reference->order, reference->dimSize, 
+                        reference->dataType, reference->denseRatio,
+                        devID, myMem);
+}
+/* 
 generate a dense vector 
 >> num - number of entries
 >> myDataType - unit size (e.g., int, float, and double) 
@@ -2041,7 +2112,7 @@ XTensor * NewTensor(XTensor * a, bool isFilledData)
 free the data space of a given tensor 
 >> tensor - pointer to the tensor
 */
-void DelTensor(const XTensor * tensor)
+void DelTensor(XTensor * tensor)
 {
    delete tensor;
 }
@@ -2050,10 +2121,13 @@ void DelTensor(const XTensor * tensor)
 free the data space of a given tensor (on the buffer)
 >> tensor - pointer to the tensor
 */
-void DelTensorBuf(const XTensor * tensor)
+void DelTensorBuf(XTensor * tensor)
 {
-    CheckNTErrors(tensor->mem != NULL, "No memory pool found!");
+    if(tensor->mem != NULL)
-    tensor->mem->ReleaseBuf(tensor->devID, tensor->unitNum * tensor->unitSize);
+        tensor->mem->ReleaseBuf(tensor->devID, tensor->unitNum * tensor->unitSize);
+    else
+        XMemFree(tensor->devID, tensor->data);
+    tensor->data = NULL;
    delete tensor;
 }

--- a/source/tensor/XTensor.h
+++ b/source/tensor/XTensor.h
@@ -45,12 +45,13 @@ namespace nts{
 struct XLink;
 /* define the maximum number of dimensions in a tensor */
-#define MAX_TENSOR_DIM_NUM 6
+#define MAX_TENSOR_DIM_NUM 8
 #define USE_BATCHED_STRIDED_MAT_MUL
-#define MIN_TENSOR_SPLIT_NUM 10
+#define MIN_TENSOR_SPLIT_NUM 0
 #define MIN_TENSOR_SPLIT_LIST_NUM 1024
 #define MIN_TENSOR_CAT_NUM 8
 /* computation flags */
 #define UNSAFE_BUT_FAST_MEM
 #define FAST_MATRIX
@@ -202,6 +203,12 @@ public:
    /* overloading of the multiply-sign */
    XTensor  operator* (const XTensor &tensor);
+    /* overloading of the minus-sign */
+    XTensor  operator- (const XTensor &tensor);
+    /* overloading of the division-sign */
+    XTensor  operator/ (const XTensor &tensor);
    /* linear transformation */
    XTensor Lin(DTYPE scale, DTYPE shift = 0);
@@ -222,6 +229,12 @@ public:
    /* reshape the tensor */
    void Reshape(const int order, const int * myDimSize);
+    /* reshape the tensor to a vector */
+    void Reshape(const int num);
+    /* reshape the tensor to a matrix */
+    void Reshape(const int rowNum, const int colNum);
    /* get the number of items in the data array */
    int GetSize() const;
@@ -328,6 +341,10 @@ public:
    /* dump data to a file */
    void Dump(FILE * file, const char * label = NULL, const int n = -1, const int verbose = 0);
+    /* dump data to a file */
+    static
+    void Dump(const XTensor * tensor, FILE * file, const char * label = NULL, const int n = -1, const int verbose = 0);
    /* read data from a file */
    void Read(FILE * file, const char * label = NULL);
@@ -386,8 +403,12 @@ XTensor * NewTensor(const int myOrder, const int * myDimSize, const TENSOR_DATA_
                    const float myDenseRatio = 1.0F, const int myDevID = -1, XMem * myMem = NULL);
 /* generate a XTensor which allocates data on the buffer */
-XTensor * NewTensorBuf(const int myOrder, const int * myDimSize, XMem * myMem,
+XTensor * NewTensorBuf(const int myOrder, const int * myDimSize,
-                       const TENSOR_DATA_TYPE myDataType = X_FLOAT, const float myDenseRatio = 1.0F);
+                       const TENSOR_DATA_TYPE myDataType = X_FLOAT, const float myDenseRatio = 1.0F,
+                       const int myDevID = -1, XMem * myMem = NULL);
+/* generate a XTensor which allocates data on the buffer */
+XTensor * NewTensorBuf(const XTensor * reference, int devID, XMem * myMem);
 /* generate a dense vector */
 XTensor * NewTensor1D(const int num, const TENSOR_DATA_TYPE myDataType = X_FLOAT, const int myDevID = -1, 
@@ -417,10 +438,10 @@ XTensor * NewTensor5D(const int d0, const int d1, const int d2, const int d3, co
 XTensor * NewTensor(XTensor * a, bool isFilledData = true);
 /* free the data space of a given tensor */
-void DelTensor(const XTensor * tensor);
+void DelTensor(XTensor * tensor);
 /* free the data space of a given tensor (on the buffer) */
-void DelTensorBuf(const XTensor * tensor);
+void DelTensorBuf(XTensor * tensor);
 } /* end of the nts (NiuTrans.Tensor) namespace */

--- a/source/tensor/XUtility.cpp
+++ b/source/tensor/XUtility.cpp
@@ -175,29 +175,38 @@ void XMemCopy(void * t, int devIDT, const void * s, int devIDS, size_t size)
        return;
    }
 #ifdef USE_CUDA
-    else if(devIDT >= 0 && devIDS < 0){
-        cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyHostToDevice);
-        if(error != cudaSuccess){
-            ShowNTErrors("cudaMemcpy error (cudaMemcpyHostToDevice)");
-        }
-    }
-    else if(devIDT < 0 && devIDS >= 0){
-        cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyDeviceToHost);
-        if(error != cudaSuccess){
-            ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToHost)");
-        }
-    }
    else{
-        //if(devIDT == devIDS){
+        int devID = devIDT < 0 ? devIDS : devIDT;
-            cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyDeviceToDevice);
+        int devIDBackup = 0;
+        cudaGetDevice(&devIDBackup);
+        cudaSetDevice(devID);
+        if(devIDT >= 0 && devIDS < 0){
+            cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyHostToDevice);
            if(error != cudaSuccess){
-                ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToDevice)");
+                ShowNTErrors("cudaMemcpy error (cudaMemcpyHostToDevice)");
            }
-        /*}
+        }
+        else if(devIDT < 0 && devIDS >= 0){
+            cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyDeviceToHost);
+            if(error != cudaSuccess){
+                ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToHost)");
+            }
+        }
        else{
-            CheckNTErrors((cudaMemcpyPeer(t, devIDT, s, devIDS, size) == cudaSuccess),
+            //if(devIDT == devIDS){
-                                "cudaMemcpy error (cudaMemcpyDeviceToDevice)");
+                cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyDeviceToDevice);
-        }*/
+                if(error != cudaSuccess){
+                    ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToDevice)");
+                }
+            /*}
+            else{
+                CheckNTErrors((cudaMemcpyPeer(t, devIDT, s, devIDS, size) == cudaSuccess),
+                                    "cudaMemcpy error (cudaMemcpyDeviceToDevice)");
+            }*/
+        }
+        cudaSetDevice(devIDBackup);
    }
 #else
    ShowNTErrors("Please specify USE_CUDA and recompile the code!");
@@ -208,6 +217,9 @@ void XMemCopy(void * t, int devIDT, const void * s, int devIDS, size_t size)
 #ifdef USE_CUDA
 void XMemCopyAsync(void * t, int devIDT, const void * s, int devIDS, size_t size, cudaStream_t stream, int streamDevID)
 {
+    if(t == s)
+        return;
    int devIDBackup = -1;
    if(streamDevID >= 0 && (devIDT >= 0 || devIDS >= 0)){
        CheckNTErrors((cudaGetDevice(&devIDBackup) == cudaSuccess), "Cannot get GPU device id!");
@@ -220,17 +232,23 @@ void XMemCopyAsync(void * t, int devIDT, const void * s, int devIDS, size_t size
        return;
    }
    else if(devIDT >= 0 && devIDS < 0){
-        CheckNTErrors((cudaMemcpyAsync(t, s, size, cudaMemcpyHostToDevice, stream) == cudaSuccess),
+        cudaError_t error = cudaMemcpyAsync(t, s, size, cudaMemcpyHostToDevice, stream);
-                            "cudaMemcpyAsync error (cudaMemcpyHostToDevice)");
+        if(error != cudaSuccess){
+            ShowNTErrors("cudaMemcpyAsync error (cudaMemcpyHostToDevice)");
+        }
    }
    else if(devIDT < 0 && devIDS >= 0){
-        CheckNTErrors((cudaMemcpyAsync(t, s, size, cudaMemcpyDeviceToHost, stream) == cudaSuccess),
+        cudaError_t error = cudaMemcpyAsync(t, s, size, cudaMemcpyDeviceToHost, stream);
-                            "cudaMemcpyAsync error (cudaMemcpyDeviceToHost)");
+        if(error != cudaSuccess){
+            ShowNTErrors("cudaMemcpyAsync error (cudaMemcpyDeviceToHost)");
+        }
    }
    else{
        //if(devIDT == devIDS){
-            CheckNTErrors((cudaMemcpyAsync(t, s, size, cudaMemcpyDeviceToDevice, stream) == cudaSuccess),
+            cudaError_t error = cudaMemcpyAsync(t, s, size, cudaMemcpyDeviceToDevice, stream);
-                                "cudaMemcpyAsync error (cudaMemcpyDeviceToDevice)");
+            if(error != cudaSuccess){
+                ShowNTErrors("cudaMemcpyAsync error (cudaMemcpyDeviceToDevice)");
+            }
        //}
        /*else{
            CheckNTErrors((cudaMemcpyPeerAsync(t, devIDT, s, devIDS, size, stream) == cudaSuccess),
@@ -261,18 +279,69 @@ void XMemCopy2D(void * t, size_t tPitch, int devIDT, const void * s, size_t sPit
        return;
    }
 #ifdef USE_CUDA
-    else if (devIDT >= 0 && devIDS < 0) {
+    else{
-        CheckNTErrors((cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyHostToDevice) == cudaSuccess),
+        int devID = devIDT < 0 ? devIDS : devIDT;
-                            "cudaMemcpy2D error (cudaMemcpyHostToDevice)");
+        int devIDBackup = 0;
+        cudaGetDevice(&devIDBackup);
+        cudaSetDevice(devID);
+        if (devIDT >= 0 && devIDS < 0) {
+            cudaError_t error = cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyHostToDevice);
+            if(error != cudaSuccess){
+                ShowNTErrors("cudaMemcpy2D error (cudaMemcpyHostToDevice)");
+            }
+        }
+        else if (devIDT < 0 && devIDS >= 0) {
+            cudaError_t error = cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToHost);
+            if(error != cudaSuccess){
+                ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToHost)");
+            }
+        }
+        else {
+            cudaError_t error = cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToDevice);
+            if (error != cudaSuccess) {
+                ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToDevice)");
+            }
+        }
+        cudaSetDevice(devIDBackup);
    }
-    else if (devIDT < 0 && devIDS >= 0) {
+#else
-        CheckNTErrors((cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToHost) == cudaSuccess),
+    ShowNTErrors("Please specify USE_CUDA and recompile the code!");
-            "cudaMemcpy error (cudaMemcpyDeviceToHost)");
+#endif
+}
+void XMemCopy2DAsync(void * t, size_t tPitch, int devIDT, const void * s, size_t sPitch, int devIDS, size_t mSize, int n, XStream * stream)
+{
+    if (t == s)
+        return;
+    if (devIDT < 0 && devIDS < 0) {
+        for(int i = 0; i < n; i++)
+            memcpy((char*)t + tPitch * i, (char*)s + sPitch * i, mSize);
+        return;
    }
-    else {
+#ifdef USE_CUDA
-        cudaError_t error = cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToDevice);
+    else{
-        if (error != cudaSuccess) {
+        CheckNTErrors(stream != NULL, "No stream found!");
-            ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToDevice)");
+        cudaStream_t &cstream = stream->stream;
+        if (devIDT >= 0 && devIDS < 0) {
+            cudaError_t error = cudaMemcpy2DAsync(t, tPitch, s, sPitch, mSize, n, cudaMemcpyHostToDevice, cstream);
+            if(error != cudaSuccess){
+                ShowNTErrors("cudaMemcpy2D error (cudaMemcpyHostToDevice)");
+            }
+        }
+        else if (devIDT < 0 && devIDS >= 0) {
+            cudaError_t error = cudaMemcpy2DAsync(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToHost, cstream);
+            if(error != cudaSuccess){
+                ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToHost)");
+            }
+        }
+        else {
+            cudaError_t error = cudaMemcpy2DAsync(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToDevice, cstream);
+            if (error != cudaSuccess) {
+                ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToDevice)");
+            }
        }
    }
 #else

--- a/source/tensor/XUtility.h
+++ b/source/tensor/XUtility.h
@@ -23,6 +23,7 @@
 #include <stdio.h>
 #include "XGlobal.h"
+#include "XDevice.h"
 #ifndef __XUTILITY_H__
 #define __XUTILITY_H__
@@ -41,6 +42,7 @@ extern void XMemSet(void * p, int value, size_t size);
 extern void XMemSet(int devID, void * p, int value, size_t size);
 extern void XMemCopy(void * t, int devIDT, const void * s, int devIDS, size_t size);
 extern void XMemCopy2D(void * t, size_t tPitch, int devIDT, const void * s, size_t sPitch, int devIDS, size_t mSize, int n);
+extern void XMemCopy2DAsync(void * t, size_t tPitch, int devIDT, const void * s, size_t sPitch, int devIDS, size_t mSize, int n, XStream * stream);
 extern void * XMemAlloc(int devID, size_t size);
 extern void * XMemAllocOnDev(int devID, size_t size);
 extern void XMemFree(int devID, void * p);

--- a/source/tensor/core/CHeader.h
+++ b/source/tensor/core/CHeader.h
@@ -26,49 +26,63 @@
 #include "../XTensor.h"
-#include "shape/Concatenate.h"
+#include "arithmetic/Div.h"
-#include "shape/ConcatenateSolely.h"
-#include "movement/CopyBlocks.h"
-#include "movement/CopyBlocksInGrid.h"
-#include "movement/CopyBlocksOnSite.h"
-#include "movement/CopyData2D.h"
-#include "movement/CopyIndexed.h"
-#include "movement/CopyInGrid.h"
-#include "movement/CopyValues.h"
-#include "utilities/FlushToMem.h"
-#include "shape/MakeMergeBlockIndex.h"
-#include "shape/MakeSplitBlockIndex.h"
 #include "arithmetic/MatrixMul.h"
 #include "arithmetic/MatrixMul2D.h"
 #include "arithmetic/MatrixMul2DMultiTheading.h"
 #include "arithmetic/MatrixMul2DParallel.h"
 #include "arithmetic/MatrixMulBatched.h"
-#include "arithmetic/MatrixMULBatchedCPU.h"
-#include "shape/Merge.h"
-#include "shape/MergeBlockLists.h"
 #include "arithmetic/Multiply.h"
 #include "arithmetic/Negate.h"
+#include "arithmetic/Sign.h"
+#include "arithmetic/Sub.h"
+#include "arithmetic/Sum.h"
+#include "arithmetic/SumByColumnTV.h"
+#include "arithmetic/SumByColumnVT.h"
+#include "arithmetic/SumDim.h"
+#include "arithmetic/XTensorBLAS.h"
+#include "getandset/ConvertDataType.h"
+#include "getandset/Select.h"
+#include "getandset/SetData.h"
+#include "math/Clip.h"
 #include "math/Normalize.h"
-#include "shape/Permute.h"
 #include "math/Power.h"
+#include "math/ScaleAndShift.h"
+#include "math/Unary.h"
+#include "movement/CopyBlocks.h"
+#include "movement/CopyBlocksInGrid.h"
+#include "movement/CopyBlocksOnSite.h"
+#include "movement/CopyData2D.h"
+#include "movement/CopyIndexed.h"
+#include "movement/CopyInGrid.h"
+#include "movement/CopyValues.h"
 #include "reduce/ReduceMax.h"
 #include "reduce/ReduceMean.h"
 #include "reduce/ReduceStandardVariance.h"
 #include "reduce/ReduceSum.h"
 #include "reduce/ReduceSumSquared.h"
 #include "reduce/ReduceVariance.h"
-#include "math/ScaleAndShift.h"
-#include "getandset/Select.h"
+#include "shape/Concatenate.h"
-#include "getandset/SetData.h"
+#include "shape/ConcatenateSolely.h"
-#include "sort/Sort.h"
+#include "shape/MakeMergeBlockIndex.h"
+#include "shape/MakeSplitBlockIndex.h"
+#include "shape/Merge.h"
+#include "shape/MergeBlockLists.h"
+#include "shape/Permute.h"
 #include "shape/Split.h"
-#include "arithmetic/Sum.h"
-#include "arithmetic/SumByColumnTV.h"
-#include "arithmetic/SumByColumnVT.h"
-#include "sort/TopK.h"
 #include "shape/Transpose.h"
 #include "shape/Unsqueeze.h"
+#include "sort/Sort.h"
+#include "sort/TopK.h"
 #include "utilities/XMatrixSegment.h"
-#include "arithmetic/XTensorBLAS.h"
+#include "utilities/FlushToMem.h"
 #endif // __CHEADER_H__
\ No newline at end of file
--- a/source/tensor/core/arithmetic/Div.cpp
+++ b/source/tensor/core/arithmetic/Div.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#include "../../XTensor.h"
+#include "../../XName.h"
+#include "Div.h"
+#include "Div.cuh"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+element-wise division of two tensors
+c(i) = a(i)/b(i) + \alpha * c(i)
+where i is the index of the item
+>> a - tensor a
+>> b - tensor b
+>> c - result tensor
+>> alpha - the coefficient
+>> leadingDim - the dimension along which we perform broadcasting
+*/
+void _Div(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha, int leadingDim)
+{
+	int leadingDimRDI = a->order - leadingDim - 1;
+    CheckNTErrors((a->unitNum <= c->unitNum && b->unitNum <= c->unitNum),
+                  "Unmatched tensors in multiplication!");
+    CheckNTErrors((a->order == b->order && a->order == c->order), 
+                  "Unmatched tensors!");
+#ifdef USE_CUDA
+    if (a->devID >= 0 || b->devID >= 0 || c->devID >= 0) {
+        _CudaDiv(a, b, c, alpha, leadingDim);
+        return;
+    }
+#endif
+    int stride = 1;
+    int blockSizeA = 1;
+    int blockSizeB = 1;
+    int blockSizeC = 1;
+    int blockNum = 1;
+    int dimensionSizeA = a->dimSizeRDI[leadingDimRDI];
+    int dimensionSizeB = b->dimSizeRDI[leadingDimRDI];
+    int dimensionSizeC = c->dimSizeRDI[leadingDimRDI];
+    for (int i = 0; i < a->order; i++) {
+        if (i != leadingDimRDI) {
+            CheckNTErrors((a->dimSizeRDI[i] == b->dimSizeRDI[i] && a->dimSizeRDI[i] == c->dimSizeRDI[i]),
+                          "Unmatched tensors!");
+        }
+        if (i < leadingDimRDI)
+            stride *= a->dimSizeRDI[i];
+    }
+    blockSizeA = stride * dimensionSizeA;
+    blockSizeB = stride * dimensionSizeB;
+    blockSizeC = stride * dimensionSizeC;
+    blockNum = a->unitNum / blockSizeA;
+    if (!a->isSparse && !b->isSparse) {
+        if (a->dataType == DEFAULT_DTYPE && b->dataType == DEFAULT_DTYPE) {
+            if (a->unitNum == c->unitNum && b->unitNum == c->unitNum) {
+                int size = a->unitNum;
+                DTYPE * ap = (DTYPE*)a->data;
+                DTYPE * bp = (DTYPE*)b->data;
+                DTYPE * cp = (DTYPE*)c->data;
+                if (alpha == 0) {
+                    for (int i = 0; i < size; i++)
+                        cp[i] = ap[i] / bp[i];
+                }
+                else {
+                    for (int i = 0; i < size; i++)
+                        cp[i] = ap[i] / bp[i] + alpha * cp[i];
+                }
+            }
+            else {
+                for (int k = 0; k < blockNum; k++) {
+                    for (int ci = 0, ai = 0, bi = 0; ci < dimensionSizeC; ci++, ai++, bi++) {
+                        if (ai >= dimensionSizeA)
+                            ai = 0;
+                        if (bi >= dimensionSizeB)
+                            bi = 0;
+                        DTYPE * ap = (DTYPE*)a->data + k * blockSizeA + ai * stride;
+                        DTYPE * bp = (DTYPE*)b->data + k * blockSizeB + bi * stride;
+                        DTYPE * cp = (DTYPE*)c->data + k * blockSizeC + ci * stride;
+                        for (int j = 0; j < stride; j++)
+                            cp[j] = ap[j] / bp[j] + cp[j] * alpha;
+                    }
+                }
+            }
+        }
+        else {
+            // TODO!!
+            ShowNTErrors("TODO!");
+        }
+    }
+    else {
+        // TODO!!
+        ShowNTErrors("TODO!");
+    }
+}
+/*
+element-wise division of two tensors (do it on site)
+keep the result in the input tensor a and return nothing
+a(i) = a(i)*b(i) + \alpha * a(i)
+where i is the index of the item
+>> a - tensor a (where keep the result)
+>> b - tensor b
+>> alpha - the coefficient
+>> leadingDim - the dimension along which we perform broadcasting
+*/
+void _DivMe(XTensor * a, const XTensor * b, DTYPE alpha, int leadingDim)
+{
+    _Div(a, b, a, alpha, leadingDim);
+}
+/*
+element-wise division of two tensors (return a XTensor structure)
+make a new tensor c to keep the result and return it
+c(i) = a(i)*b(i)
+where i is the index of the item
+>> a - tensor a
+>> b - tensor b
+>> leadingDim - the dimension along which we perform broadcasting
+<< return - the product of the tensors
+*/
+XTensor Div(const XTensor &a, const XTensor &b, int leadingDim)
+{
+    CheckNTErrors(a.dimSize[leadingDim] == b.dimSize[leadingDim], "TODO!");
+    XTensor c(&a);
+    c.SetTMP();
+    /* call _Multiply function */
+    _Div(&a, &b, &c, 0, leadingDim);
+    /* tensor connections */
+    XLink::MakeLink(&a, &b, &c, MATH_DIV);
+    XLink::AddParamToHeadInt(&c, leadingDim);
+    return c;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/Div.cu
+++ b/source/tensor/core/arithmetic/Div.cu
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
+*/
+#include "../../XDevice.h"
+#include "../../XTensor.h"
+#include "Div.h"
+#include "Div.cuh"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/*
+division of data arrays in a element-wise manner c(i) = a(i)/b(i)
+>> a - data array a
+>> b - data array b
+>> c - result data array
+>> size - size of c
+*/
+__global__
+void KernelDivElementWise(DTYPE * a, DTYPE * b, DTYPE * c, int size)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i < size)
+        c[i] = a[i] / b[i];
+}
+/*
+division of data arrays in a element-wise manner c(i) = a(i)/b(i) + \alpha*c(i)
+>> a - data array a
+>> b - data array b
+>> c - result data array
+>> size - size of c
+>> alpha - the coefficient
+*/
+__global__
+void KernelDivElementWiseV2(DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE alpha)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i < size)
+        c[i] = a[i] / b[i] + alpha * c[i];
+}
+/*
+division of two tensors in a element-wise manner c(i) = a(i)/b(i).
+Note that a and b can be of different sizes here, i.e.,
+|a_lead| <= |c_lead| and |b_lead| <= |c_lead|
+where |a_lead| means the size of the leading dimension of a
+>> a - tensor a
+>> b - tensor b
+>> c - result tensor
+>> alpha - the coefficient
+>> stride - the number of items we go over when move next along the leading dimension in a block
+>> ldSizeA - size of the leading dimension of a
+>> ldSizeB - size of the leading dimension of b
+>> ldSizeC - size of the leading dimension of c
+>> blockNum - number of blocks
+*/
+template<int nonZeroAlpha> __global__
+void KernelDivElementWiseTensorDynamic(DTYPE * a, DTYPE * b, DTYPE * c, DTYPE alpha,
+    int stride, int ldSizeA, int ldSizeB, int ldSizeC, int blockNum)
+{
+    __shared__ DTYPE* ap[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    __shared__ DTYPE* bp[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    __shared__ DTYPE* cp[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    int j = blockDim.y * blockIdx.y + threadIdx.y;
+    if (i >= blockNum * stride || j >= ldSizeC)
+        return;
+    if (threadIdx.y == 0) {
+        int block = i / stride;
+        int size = block * stride;
+        ap[threadIdx.x] = a + size * ldSizeA;
+        bp[threadIdx.x] = b + size * ldSizeB;
+        cp[threadIdx.x] = c + size * ldSizeC;
+    }
+    __syncthreads();
+    int aj = j >= ldSizeA ? j % ldSizeA : j;
+    int bj = j >= ldSizeB ? j % ldSizeB : j;
+    int offseti = i % stride;
+    if (nonZeroAlpha == 0)
+        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj * ldSizeA + offseti] / bp[threadIdx.x][bj * ldSizeB + offseti];
+    else
+        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj * ldSizeA + offseti] / bp[threadIdx.x][bj * ldSizeB + offseti]
+                                                 + alpha * cp[threadIdx.x][j * ldSizeC + offseti];
+}
+/*
+element-wise division of two tensors
+c(i) = a(i)*b(i) + \alpha * c(i)
+where i is the item index
+>> a - tensor a
+>> b - tensor b
+>> c - result tensor
+>> alpha - the coefficient
+>> leadingDim - dimension along which we perform broadcasting
+*/
+void _CudaDiv(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha, int leadingDim)
+{
+	int leadingDimRDI = a->order - leadingDim - 1;
+    CheckNTErrors((a->unitNum <= c->unitNum && b->unitNum <= c->unitNum),
+                  "Unmatched tensors in multiplication!");
+    CheckNTErrors((a->order == b->order && a->order == c->order), "Unmatched tensors!");
+    int stride = 1;
+    int blockSizeA = 1;
+    int blockNum = 1;
+    int dimensionSizeA = a->dimSizeRDI[leadingDimRDI];
+    int dimensionSizeB = b->dimSizeRDI[leadingDimRDI];
+    int dimensionSizeC = c->dimSizeRDI[leadingDimRDI];
+    for (int i = 0; i < a->order; i++) {
+        if (i != leadingDimRDI) {
+            CheckNTErrors((a->dimSizeRDI[i] == b->dimSizeRDI[i] &&
+                           a->dimSizeRDI[i] == c->dimSizeRDI[i]),
+                          "Unmatched tensors!");
+        }
+        if (i < leadingDimRDI)
+            stride *= a->dimSizeRDI[i];
+    }
+    blockSizeA = stride * dimensionSizeA;
+    blockNum = a->unitNum / blockSizeA;
+    int devIDBackup;
+    ProtectCudaDev(a->devID, devIDBackup);
+    if (!a->isSparse && !b->isSparse) {
+        if (a->dataType == DEFAULT_DTYPE && b->dataType == DEFAULT_DTYPE) {
+            int cudaGridSize[3];
+            int cudaBlockSize[3];
+            if (a->unitNum == c->unitNum && b->unitNum == c->unitNum) {
+                GDevs.GetCudaThread(a->devID, c->unitNum, cudaGridSize, cudaBlockSize);
+                dim3 blocks(cudaGridSize[0]), threads(cudaBlockSize[0]);
+                if (alpha == 0)
+                    KernelDivElementWise << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, c->unitNum);
+                else
+                    KernelDivElementWiseV2 << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, c->unitNum, alpha);
+            }
+            else {
+                GDevs.GetCudaThread2D(c->devID, stride * blockNum, dimensionSizeC, MAX_INT, cudaGridSize, cudaBlockSize);
+                dim3 blocks(cudaGridSize[0], cudaGridSize[1]), threads(cudaBlockSize[0], cudaBlockSize[1]);
+                if (alpha == 0) {
+                    KernelDivElementWiseTensorDynamic<0> << <blocks, threads >> >
+                        ((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, 0,
+                        stride, dimensionSizeA, dimensionSizeB, dimensionSizeC, blockNum);
+                }
+                else {
+                    KernelDivElementWiseTensorDynamic<1> << <blocks, threads >> >
+                        ((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, alpha,
+                        stride, dimensionSizeA, dimensionSizeB, dimensionSizeC, blockNum);
+                }
+            }
+        }
+        else {
+            // TODO!!
+            ShowNTErrors("TODO!");
+        }
+    }
+    else {
+        // TODO!!
+        ShowNTErrors("TODO!");
+    }
+    BacktoCudaDev(a->devID, devIDBackup);
+}
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/arithmetic/Div.cuh
+++ b/source/tensor/core/arithmetic/Div.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#ifndef __DIV_CUH__
+#define __DIV_CUH__
+#include "Div.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/* division of two tensors in a element-wise manner c(i) = a(i)/b(i) */
+__global__
+void KernelDivElementWise(DTYPE * a, DTYPE * b, DTYPE * c, int size);
+/* division of two tensors in a element-wise manner c(i) = a(i)/b(i) + \alpha*c(i) */
+__global__
+void KernelDivElementWiseV2(DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE alpha);
+/* division of two tensors in a element-wise manner c(i) = a(i)/b(i)+ \alpha*c(i)  */
+template<int nonZeroAlpha>__global__
+void KernelDivElementWiseTensorDynamic(DTYPE * a, DTYPE * b, DTYPE * c, DTYPE alpha, int stride, int ldSizeA, int ldSizeB, int ldSizeC, int blockNum);
+/* element-wise division of two tensors */
+void _CudaDiv(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha = 0, int leadingDim = 0);
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
+#endif // __DIV_CUH__
--- a/source/tensor/core/math/Log.h
+++ b/source/tensor/core/math/Log.h
@@ -16,31 +16,39 @@
 */
 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
-*/
+ */
-#ifndef __LOG_H__
+#ifndef __DIV_H__
-#define __LOG_H__
+#define __DIV_H__
 #include "../../XTensor.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
-/* set every entry to its log value */
+/* 
-void _Log(const XTensor * a, XTensor * b);
+element-wise division of two tensors:
+c(i) = a(i)/b(i) + \alpha * c(i) 
+where i is the index of the element
+*/
+void _Div(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha = 0, int leadingDim = 0);
 /* 
-set every entry to its log value (do it on site)
+element-wise division of two tensors (do it on site)
 keep the result in the input tensor a and return nothing
+a(i) = a(i)/b(i) + \alpha * a(i) 
+where i is the index of the element 
 */
-void _LogMe(XTensor * a);
+void _DivMe(XTensor * a, const XTensor * b, DTYPE alpha = 0, int leadingDim = 0);
 /* 
-set every entry to its log value (return a XTensor structure)
+element-wise division of two tensors (return a XTensor structure)
 make a new tensor to keep the result and return it
+c(i) = a(i)/b(i)
+where i is the index of the element 
 */
-XTensor Log(const XTensor & a);
+XTensor Div(const XTensor &a, const XTensor &b, int leadingDim = 0);
 } // namespace nts(NiuTrans.Tensor)
-#endif // __LOG_H__
+#endif // __DIV_H__
\ No newline at end of file
--- a/source/tensor/core/arithmetic/MatrixMULBatchedCPU.cpp
+++ b/source/tensor/core/arithmetic/MatrixMULBatchedCPU.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-#include "../../XTensor.h"
-#include "MatrixMULBatchedCPU.h"
-#include "MatrixMul2D.h"
-#include "XTensorBLAS.h"
-namespace nts { // namespace nts(NiuTrans.Tensor)
-/*
-matrix multiplication in batch mode (BLAS)
-c_i = trans(a_i) * trans(b_i) * \alpha + c_i * \beta for each i in [0,count-1]
->> a - list of input matrices (2d tensors)
->> transposedA - indicate whether the matrix a is transposed
->> b - another list of input matrices (2d tensors)
->> transposedB - indicate whether the matrix b is transposed
->> c - output matrix (2d tensor)
->> alpha - scalar
->> beta - scalar
-*/
-void _MatrixMULBatchedCPU(const XList * a, MATRIX_TRANS_TYPE transposedA,
-                          const XList * b, MATRIX_TRANS_TYPE transposedB,
-                          XList * c, DTYPE alpha, DTYPE beta)
-{
-    CheckNTErrors(a && b && c, "Empty input lists!");
-    CheckNTErrors(a->count == b->count && a->count == c->count, "Input lists must be of the same size!");
-    if (a->count == 0)
-        return;
-    bool isUniform = true;
-    for (int i = 1; i < a->count; i++) {
-        XTensor * aim = (XTensor*)a->GetItem(i - 1);
-        XTensor * bim = (XTensor*)b->GetItem(i - 1);
-        XTensor * cim = (XTensor*)c->GetItem(i - 1);
-        XTensor * ai = (XTensor*)a->GetItem(i);
-        XTensor * bi = (XTensor*)b->GetItem(i);
-        XTensor * ci = (XTensor*)c->GetItem(i);
-        if (!XTensor::IsSameShaped(aim, ai) ||
-            !XTensor::IsSameShaped(bim, bi) ||
-            !XTensor::IsSameShaped(cim, ci))
-        {
-            isUniform = false;
-            break;
-        }
-    }
-    for (int i = 0; i < a->count; i++) {
-        XTensor * ai = (XTensor*)a->GetItem(i);
-        XTensor * bi = (XTensor*)b->GetItem(i);
-        XTensor * ci = (XTensor*)c->GetItem(i);
-        CheckNTErrors((ai->order == 2), "2d tensor (i.e., matrix) is required!");
-        CheckNTErrors((bi->order == 2), "2d tensor (i.e., matrix) is required!");
-        CheckNTErrors((ci->order == 2), "2d tensor (i.e., matrix) is required!");
-#ifdef USE_BLAS
-        if (useBLAS)
-            _MatrixMULCPU(ai, transposedA, bi, transposedB, ci, alpha, beta);
-        else
-            _MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
-#else
-        _MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
-#endif
-    }
-    //}
-}
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/arithmetic/MatrixMul.cpp
+++ b/source/tensor/core/arithmetic/MatrixMul.cpp
@@ -24,8 +24,8 @@
 #include "../../XName.h"
 #include "MatrixMul.h"
 #include "MatrixMul2D.h"
-#include "MatrixMULBatchedCPU.h"
 #include "XTensorBLAS.h"
+#include "MatrixMulBatched.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -53,11 +53,29 @@ void _MatrixMul(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
                const XTensor * b, MATRIX_TRANS_TYPE transposedB,
                XTensor * c, DTYPE alpha, DTYPE beta, XPRunner * parallelRunner)
 {
-    CheckNTErrors((a && b && c), "Empty input tensors!");
+    CheckNTErrors(a && b && c, "Empty input tensors!");
-    CheckNTErrors((a->dataType == b->dataType && a->dataType == c->dataType),
+    CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType,
                  "Input tensors should have the same data type!");
-    CheckNTErrors((a->order >= 2 && b->order >= 2 && c->order >= 2),
+    CheckNTErrors(a->order >= 2 && b->order >= 2 && c->order >= 2,
                  "Input tensors must have a order >= 2!");
+    CheckNTErrors(c->order == a->order + b->order - 2, "wrong tensor order")
+    /* we transform a higher order tensor to a matrix to kill the number
+       of calls of matrix multiplication */
+    if(transposedA == X_NOTRANS && a->order > 2 && b->order == 2){
+        int ncolA = a->dimSize[a->order - 1];
+        int ncolC = c->dimSize[c->order - 1];
+        XTensor * a2 = NewTensor2D(a->unitNum/ncolA, -ncolA, a->dataType, a->devID, a->mem);
+        XTensor * c2 = NewTensor2D(c->unitNum/ncolC, -ncolC, c->dataType, c->devID, c->mem);
+        a2->data = a->data;
+        c2->data = c->data;
+        _MatrixMul2D(a2, transposedA, b, transposedB, c2, alpha, beta, parallelRunner);
+        a2->data = NULL;
+        c2->data = NULL;
+        delete a2;
+        delete c2;
+        return;
+    }
    int an = transposedA == X_TRANS ? a->dimSizeRDI[0] : a->dimSizeRDI[1];
    int am = transposedA == X_TRANS ? a->dimSizeRDI[1] : a->dimSizeRDI[0];
@@ -144,10 +162,10 @@ void _MatrixMul(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
        cublasHandle_t * handle = a->mem != NULL ? a->mem->GetCublasHandle() : GDevs.GetCudaHandle(a->devID);
        _CudaBLASMatrixMULList(handle,
-                              aList, transposedA,
+                               aList, transposedA,
-                              bList, transposedB,
+                               bList, transposedB,
-                              cList, aList->count,
+                               cList, aList->count,
-                              alpha, beta);
+                               alpha, beta);
        BacktoCudaDev(a->devID, devIDBackup);
 #else
@@ -156,9 +174,9 @@ void _MatrixMul(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
    }
    else {
        CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
-        _MatrixMULBatchedCPU(aList, transposedA,
+        _MatrixMulBatchedCPU(aList, transposedA,
-            bList, transposedB,
+                             bList, transposedB,
-            cList, alpha, beta);
+                             cList, alpha, beta);
    }
    for (int i = 0; i < aList->count; i++) {
@@ -251,9 +269,7 @@ XTensor MatrixMul(const XTensor &a, MATRIX_TRANS_TYPE transposedA,
 /* 
 matrix multiplication with no transposition c = a * b * alpha
 >> a - tensor a
->> transposedA - indicates whether the matrices in a are transposed
 >> b - tensor b
->> transposedB - indicates whether teh matrices in b are transposed
 >> alpha - a coefficient
 >> parallelRunner - parallel processing module
 << return - the result of matrix multiplication

--- a/source/tensor/core/arithmetic/MatrixMulBatched.cpp
+++ b/source/tensor/core/arithmetic/MatrixMulBatched.cpp
@@ -23,8 +23,8 @@
 #include "../../XDevice.h"
 #include "../../XName.h"
 #include "MatrixMulBatched.h"
-#include "MatrixMULBatchedCPU.h"
 #include "XTensorBLAS.h"
+#include "MatrixMul2D.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -57,6 +57,43 @@ void _MatrixMulBatched(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
    CheckNTErrors((a->order == b->order && a->order == c->order), 
                  "Input tensor and output tensor must have same order!");
+    if (a->devID >= 0 || b->devID >= 0 || c->devID >= 0)
+        _MatrixMulBatchedGPU(a, transposedA, b, transposedB, c, alpha, beta);
+    else
+        _MatrixMulBatchedCPU(a, transposedA, b, transposedB, c, alpha, beta);
+}
+/*
+matrix multiplication of the two tensors
+optimized for GPU
+for each 2-dimensional data array in a (denoted as ai) and
+each 2-dimensional data array in b (denoted as bi), we have
+ci = trans(ai) * trans(bi) * alpha + cm * beta
+where trans() returns the transposed matrix if the flag is fired
+>> a - tensor a
+>> transposedA - indicates whether the matrices in a are transposed
+>> b - tensor b
+>> transposedB - indicates whether teh matrices in b are transposed
+>> c - where we keep a*b
+>> alpha - a coefficient
+>> beta - another coefficient
+*/
+void _MatrixMulBatchedGPU(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
+                          const XTensor * b, MATRIX_TRANS_TYPE transposedB,
+                          XTensor * c, DTYPE alpha, DTYPE beta)
+{
+#ifdef USE_CUDA
+    CheckNTErrors((a && b && c), "Empty input tensors!");
+    CheckNTErrors((a->dataType == b->dataType && a->dataType == c->dataType),
+                  "Input tensors should have the same data type!");
+    CheckNTErrors((a->order >= 2 && b->order >= 2 && c->order >= 2),
+                  "Input tensors must have a order >= 2!");
+    CheckNTErrors((a->order == b->order && a->order == c->order), 
+                  "Input tensor and output tensor must have same order!");
+    CheckNTErrors(a->devID >= 0 && b->devID >= 0 && c->devID >= 0, "The tensors must be on GPUs");
    int an = transposedA == X_TRANS ? a->dimSizeRDI[0] : a->dimSizeRDI[1];
    int am = transposedA == X_TRANS ? a->dimSizeRDI[1] : a->dimSizeRDI[0];
    int bn = transposedB == X_TRANS ? b->dimSizeRDI[0] : b->dimSizeRDI[1];
@@ -64,8 +101,7 @@ void _MatrixMulBatched(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
    int cn = c->dimSizeRDI[1];
    int cm = c->dimSizeRDI[0];
-    CheckNTErrors((am == bn && an == cn && bm == cm),
+    CheckNTErrors((am == bn && an == cn && bm == cm), "Unmatched tensors in multiplication!");
-        "Unmatched tensors in multiplication!");
    int aBlockSize = a->dimSizeRDI[0] * a->dimSizeRDI[1];
    int bBlockSize = b->dimSizeRDI[0] * b->dimSizeRDI[1];
@@ -81,76 +117,159 @@ void _MatrixMulBatched(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
        blockNum *= a->dimSizeRDI[i];
    }
-    XList * aList = new XList(10);
+    int devIDBackup = 0;
-    XList * bList = new XList(10);
+    ProtectCudaDev(a->devID, devIDBackup);
-    XList * cList = new XList(10);
-    int aDimSize[2] = { -a->dimSizeRDI[1], a->dimSizeRDI[0] };
+    cublasHandle_t * handle = a->mem != NULL ? a->mem->GetCublasHandle() : GDevs.GetCudaHandle(a->devID);
-    int bDimSize[2] = { -b->dimSizeRDI[1], b->dimSizeRDI[0] };
+    _CudaBLASMatrixMULBatchedStrided(handle,
-    int cDimSize[2] = { -c->dimSizeRDI[1], c->dimSizeRDI[0] };
+                                     a->data, transposedA, a->dataType, aBlockSize,
+                                     b->data, transposedB, b->dataType, bBlockSize,
-    for (int p = 0; p < blockNum; p++) {
+                                     c->data, c->dataType, cBlockSize, blockNum,
-        void * ap = (char*)a->data + aRealBlockSize * p;
+                                     a->dimSizeRDI[1], a->dimSizeRDI[0],
-        void * bp = (char*)b->data + bRealBlockSize * p;
+                                     b->dimSizeRDI[1], b->dimSizeRDI[0],
-        void * cp = (char*)c->data + cRealBlockSize * p;
+                                     c->dimSizeRDI[1], c->dimSizeRDI[0], alpha, beta);
-        XTensor * ai = NewTensor(2, aDimSize, a->dataType, a->denseRatio, a->devID, a->mem);
-        XTensor * bi = NewTensor(2, bDimSize, b->dataType, b->denseRatio, b->devID, b->mem);
+    BacktoCudaDev(a->devID, devIDBackup);
-        XTensor * ci = NewTensor(2, cDimSize, c->dataType, c->denseRatio, c->devID, c->mem);
+#endif
-        ai->data = ap;
+}
-        bi->data = bp;
-        ci->data = cp;
+/*
-        aList->Add(ai);
+matrix multiplication of the two tensors
-        bList->Add(bi);
+optimized for CPU
-        cList->Add(ci);
+for each 2-dimensional data array in a (denoted as ai) and
+each 2-dimensional data array in b (denoted as bi), we have
+ci = trans(ai) * trans(bi) * alpha + cm * beta
+where trans() returns the transposed matrix if the flag is fired
+>> a - tensor a
+>> transposedA - indicates whether the matrices in a are transposed
+>> b - tensor b
+>> transposedB - indicates whether teh matrices in b are transposed
+>> c - where we keep a*b
+>> alpha - a coefficient
+>> beta - another coefficient
+*/
+void _MatrixMulBatchedCPU(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
+                          const XTensor * b, MATRIX_TRANS_TYPE transposedB,
+                          XTensor * c, DTYPE alpha, DTYPE beta)
+{
+CheckNTErrors((a && b && c), "Empty input tensors!");
+    CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType,
+                 "Input tensors should have the same data type!");
+    CheckNTErrors(a->order >= 2 && b->order >= 2 && c->order >= 2,
+                 "Input tensors must have a order >= 2!");
+    CheckNTErrors(a->order == b->order && a->order == c->order, 
+                 "Input tensor and output tensor must have same order!");
+    int an = transposedA == X_TRANS ? a->dimSizeRDI[0] : a->dimSizeRDI[1];
+    int am = transposedA == X_TRANS ? a->dimSizeRDI[1] : a->dimSizeRDI[0];
+    int bn = transposedB == X_TRANS ? b->dimSizeRDI[0] : b->dimSizeRDI[1];
+    int bm = transposedB == X_TRANS ? b->dimSizeRDI[1] : b->dimSizeRDI[0];
+    int cn = c->dimSizeRDI[1];
+    int cm = c->dimSizeRDI[0];
+    CheckNTErrors(am == bn && an == cn && bm == cm, "Unmatched tensors in multiplication!");
+    int aBlockSize = a->dimSizeRDI[0] * a->dimSizeRDI[1];
+    int bBlockSize = b->dimSizeRDI[0] * b->dimSizeRDI[1];
+    int cBlockSize = c->dimSizeRDI[0] * c->dimSizeRDI[1];
+    int aRealBlockSize = aBlockSize * a->unitSize;
+    int bRealBlockSize = bBlockSize * b->unitSize;
+    int cRealBlockSize = cBlockSize * c->unitSize;
+    int blockNum = 1;
+    for (int i = 2; i < a->order; i++) {
+        CheckNTErrors((a->dimSizeRDI[i] == c->dimSizeRDI[i]), "Incorrect tensor sizes!");
+        CheckNTErrors((b->dimSizeRDI[i] == c->dimSizeRDI[i]), "Incorrect tensor sizes!");
+        blockNum *= a->dimSizeRDI[i];
    }
-    if (a->devID >= 0 && b->devID >= 0 && c->devID >= 0) {
+    int aDimSize[2] = {-a->dimSizeRDI[1], a->dimSizeRDI[0]};
-#ifdef USE_CUDA
+    int bDimSize[2] = {-b->dimSizeRDI[1], b->dimSizeRDI[0]};
-        CheckNTErrors((a->devID == b->devID && a->devID == c->devID),
+    int cDimSize[2] = {-c->dimSizeRDI[1], c->dimSizeRDI[0]};
-                      "The code must be run on the same GPU!");
+    XTensor * ai = NewTensor2D(aDimSize[0], aDimSize[1], a->dataType, a->devID, a->mem);
-        int devIDBackup;
+    XTensor * bi = NewTensor2D(bDimSize[0], bDimSize[1], b->dataType, b->devID, b->mem);
-        ProtectCudaDev(a->devID, devIDBackup);
+    XTensor * ci = NewTensor2D(cDimSize[0], cDimSize[1], c->dataType, c->devID, c->mem);
-        cublasHandle_t * handle = a->mem != NULL ? a->mem->GetCublasHandle() : GDevs.GetCudaHandle(a->devID);
+    for (int i = 0; i < blockNum; i++) {
-        _CudaBLASMatrixMULList(handle,
+        ai->data = (char*)a->data + i * aRealBlockSize;
-							   aList, transposedA,
+        bi->data = (char*)b->data + i * bRealBlockSize;
-                               bList, transposedB,
+        ci->data = (char*)c->data + i * cRealBlockSize;
-                               cList, aList->count,
+#ifdef USE_BLAS
-                               alpha, beta);
+        if (useBLAS)
+            _MatrixMULCPU(ai, transposedA, bi, transposedB, ci, alpha, beta);
-        BacktoCudaDev(a->devID, devIDBackup);
+        else
+            _MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
 #else
-        ShowNTErrors("Please specify USE_CUDA and recompile the code!");
+        _MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
 #endif
    }
-    else {
-        CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
-        _MatrixMULBatchedCPU(aList, transposedA,
-            bList, transposedB,
-            cList, alpha, beta);
-    }
-    for (int i = 0; i < aList->count; i++) {
+    ai->data = NULL;
-        XTensor * ai = (XTensor*)aList->GetItem(i);
+    bi->data = NULL;
-        ai->data = NULL;
+    ci->data = NULL;
-        delete ai;
+    delete ai;
-    }
+    delete bi;
+    delete ci;
+}
-    for (int i = 0; i < bList->count; i++) {
+/*
-        XTensor * bi = (XTensor*)bList->GetItem(i);
+matrix multiplication in batch mode for list inputs (BLAS)
-        bi->data = NULL;
+c_i = trans(a_i) * trans(b_i) * \alpha + c_i * \beta for each i in [0,count-1]
-        delete bi;
+>> a - list of input matrices (2d tensors)
-    }
+>> transposedA - indicate whether the matrix a is transposed
+>> b - another list of input matrices (2d tensors)
+>> transposedB - indicate whether the matrix b is transposed
+>> c - output matrix (2d tensor)
+>> alpha - scalar
+>> beta - scalar
+*/
+void _MatrixMulBatchedCPU(const XList * a, MATRIX_TRANS_TYPE transposedA,
+                          const XList * b, MATRIX_TRANS_TYPE transposedB,
+                          XList * c, DTYPE alpha, DTYPE beta)
+{
+    CheckNTErrors(a && b && c, "Empty input lists!");
+    CheckNTErrors(a->count == b->count && a->count == c->count, "Input lists must be of the same size!");
+    if (a->count == 0)
+        return;
-    for (int i = 0; i < cList->count; i++) {
+    bool isUniform = true;
-        XTensor * ci = (XTensor*)cList->GetItem(i);
+    for (int i = 1; i < a->count; i++) {
-        ci->data = NULL;
+        XTensor * aim = (XTensor*)a->GetItem(i - 1);
-        delete ci;
+        XTensor * bim = (XTensor*)b->GetItem(i - 1);
+        XTensor * cim = (XTensor*)c->GetItem(i - 1);
+        XTensor * ai = (XTensor*)a->GetItem(i);
+        XTensor * bi = (XTensor*)b->GetItem(i);
+        XTensor * ci = (XTensor*)c->GetItem(i);
+        if (!XTensor::IsSameShaped(aim, ai) ||
+            !XTensor::IsSameShaped(bim, bi) ||
+            !XTensor::IsSameShaped(cim, ci))
+        {
+            isUniform = false;
+            break;
+        }
    }
-    delete aList;
+    for (int i = 0; i < a->count; i++) {
-    delete bList;
+        XTensor * ai = (XTensor*)a->GetItem(i);
-    delete cList;
+        XTensor * bi = (XTensor*)b->GetItem(i);
+        XTensor * ci = (XTensor*)c->GetItem(i);
+        CheckNTErrors((ai->order == 2), "2d tensor (i.e., matrix) is required!");
+        CheckNTErrors((bi->order == 2), "2d tensor (i.e., matrix) is required!");
+        CheckNTErrors((ci->order == 2), "2d tensor (i.e., matrix) is required!");
+#ifdef USE_BLAS
+        if (useBLAS)
+            _MatrixMULCPU(ai, transposedA, bi, transposedB, ci, alpha, beta);
+        else
+            _MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
+#else
+        _MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
+#endif
+    }
 }
 /*
@@ -212,4 +331,60 @@ XTensor MatrixMulBatched(const XTensor &a, MATRIX_TRANS_TYPE transposedA, const 
    return c;
 }
+/*
+matrix multiplication of the two tensors (do it on site)
+c = a * b * alpha
+make a new tensor to keep the result and return it
+for each 2-dimensional data array in a (denoted as ai) and
+each 2-dimensional data array in b (denoted as bi), we have
+ci = ai * bi * alpha + cm * beta
+>> a - tensor a
+>> b - tensor b
+>> alpha - a coefficient
+>> parallelRunner - parallel processing module
+<< return - the result of matrix multiplication of the two tensors
+*/
+XTensor MatrixMulBatched(const XTensor &a, const XTensor &b,
+                         DTYPE alpha, XPRunner * parallelRunner)
+{
+    CheckNTErrors(a.dataType == b.dataType, "Input tensors should have the same data type!");
+    CheckNTErrors(a.order >= 2 && b.order >= 2, "Input tensors must have a order >= 2!");
+    CheckNTErrors(a.order == b.order, "Input tensor and output tensor must have same order!");
+    int an = a.dimSizeRDI[1];
+    int am = a.dimSizeRDI[0];
+    int bn = b.dimSizeRDI[1];
+    int bm = b.dimSizeRDI[0];
+    CheckNTErrors(am == bn, "Unmatched tensors in multiplication!");
+    int order = a.order;
+    int sub = 0;
+    int * dimSize = new int[order];
+    for (int i = 0; i < a.order - 2; i++)
+        dimSize[sub++] = a.dimSize[i];
+    dimSize[sub++] = an;
+    dimSize[sub++] = bm;
+    float dr = (!a.isSparse || !b.isSparse) ? 1.0F : MAX(a.denseRatio, b.denseRatio);
+    XTensor c(order, dimSize, a.dataType, dr, a.devID, a.mem);
+    c.SetTMP();
+    /*call _MatrixMulBatched function */
+    _MatrixMulBatched(&a, X_NOTRANS, &b, X_NOTRANS, &c, alpha, 0, parallelRunner);
+    /* tensor connections */
+    XLink::MakeLink(&a, &b, &c, MATH_MATRIXMULBATCHED);
+    XLink::AddParamToHeadTrans(&c, X_NOTRANS);
+    XLink::AddParamToHeadTrans(&c, X_NOTRANS);
+    XLink::AddParamToHead(&c, alpha);
+    /* destroy variables */
+    delete[] dimSize;
+    return c;
+}
 } // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/MatrixMulBatched.h
+++ b/source/tensor/core/arithmetic/MatrixMulBatched.h
@@ -26,6 +26,8 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
+#define BMMul MatrixMulBatched
 /*
 matrix multiplication of the two tensors c = trans(a) * trans(b) * alpha + c * beta
@@ -37,6 +39,28 @@ where trans() returns the transposed matrix if the flag is fired
 void _MatrixMulBatched(const XTensor * a, MATRIX_TRANS_TYPE transposedA, const XTensor * b, MATRIX_TRANS_TYPE transposedB,
                       XTensor * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0, XPRunner * parallelRunner = NULL);
+/*
+matrix multiplication of the two tensors c = trans(a) * trans(b) * alpha + c * beta
+optimized for GPU
+*/
+void _MatrixMulBatchedGPU(const XTensor * a, MATRIX_TRANS_TYPE transposedA, const XTensor * b, MATRIX_TRANS_TYPE transposedB,
+                          XTensor * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0);
+/*
+matrix multiplication of the two tensors c = trans(a) * trans(b) * alpha + c * beta
+optimized for GPU
+*/
+void _MatrixMulBatchedCPU(const XTensor * a, MATRIX_TRANS_TYPE transposedA, const XTensor * b, MATRIX_TRANS_TYPE transposedB, 
+                          XTensor * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0);
+/*
+matrix multiplication of the two tensors c = trans(a) * trans(b) * alpha + c * beta (for list inputs)
+optimized for GPU
+*/
+void _MatrixMulBatchedCPU(const XList * a, MATRIX_TRANS_TYPE transposedA, const XList * b, MATRIX_TRANS_TYPE transposedB, 
+                          XList * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0);
 /*
 matrix multiplication of the two tensors (return a XTensor structure) c = trans(a) * trans(b) * alpha
 make a new tensor to keep the result and return it
@@ -49,6 +73,17 @@ where trans() returns the transposed matrix if the flag is fired
 XTensor MatrixMulBatched(const XTensor &a, MATRIX_TRANS_TYPE transposedA, const XTensor &b, MATRIX_TRANS_TYPE transposedB,
                         DTYPE alpha = (DTYPE)1.0, XPRunner * parallelRunner = NULL);
+/*
+matrix multiplication of the two tensors (return a XTensor structure) c = a * b * alpha
+make a new tensor to keep the result and return it
+for each 2-dimensional data array in a (denoted as ai) and
+each 2-dimensional data array in b (denoted as bi), we have
+ci = ai * bi * alpha + cm * beta
+*/
+XTensor MatrixMulBatched(const XTensor &a, const XTensor &b, 
+                         DTYPE alpha = (DTYPE)1.0, XPRunner * parallelRunner = NULL);
 } // namespace nts(NiuTrans.Tensor)
 #endif // __MATRIXMULBATCHED_H__
\ No newline at end of file
--- a/source/tensor/core/arithmetic/Multiply.cpp
+++ b/source/tensor/core/arithmetic/Multiply.cpp
@@ -32,9 +32,9 @@ element-wise product of two tensors
 c(i) = a(i)*b(i) + \alpha * c(i)
 where i is the index of the item
->> a - matrix a
+>> a - tensor a
->> b - matrix b
+>> b - tensor b
->> c - result matrix
+>> c - result tensor
 >> alpha - the coefficient
 >> leadingDim - the dimension along which we perform broadcasting
 */

--- a/source/tensor/core/arithmetic/Multiply.cu
+++ b/source/tensor/core/arithmetic/Multiply.cu
@@ -104,9 +104,9 @@ void KernelMulElementWiseTensorDynamic(DTYPE * a, DTYPE * b, DTYPE * c, DTYPE al
    int offseti = i % stride;
    if (nonZeroAlpha == 0)
-        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj* ldSizeA + offseti] * bp[threadIdx.x][bj* ldSizeB + offseti];
+        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj * ldSizeA + offseti] * bp[threadIdx.x][bj * ldSizeB + offseti];
    else
-        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj* ldSizeA + offseti] * bp[threadIdx.x][bj* ldSizeB + offseti] +
+        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj * ldSizeA + offseti] * bp[threadIdx.x][bj * ldSizeB + offseti] +
        alpha * cp[threadIdx.x][j * ldSizeC + offseti];
 }

--- a/source/tensor/core/arithmetic/Sign.cpp
+++ b/source/tensor/core/arithmetic/Sign.cpp
@@ -76,7 +76,7 @@ XTensor Sign(const XTensor & a)
    XTensor b(&a);
    b.SetTMP();
-    /* call _ScaleAndShift function */
+    /* call _Sign function */
    _Sign(&a, &b);
    /* tensor connections */

--- a/source/tensor/core/arithmetic/Sub.cpp
+++ b/source/tensor/core/arithmetic/Sub.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#include "../../XTensor.h"
+#include "../../XName.h"
+#include "../../XUtility.h"
+#include "Sub.h"
+#include "Sub.cuh"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+tensor subtraction c = a - b * \beta
+>> a - a tensor
+>> b - another tensor
+>> c - where we put a-b*\beta. we save it in a if c is NULL
+>> beta - the scaling factor
+*/
+void _Sub(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta)
+{
+    CheckNTErrors(a && b && c, "Empty tensor input!");
+    CheckNTErrors(a->unitNum == b->unitNum && a->unitNum == c->unitNum,
+                  "Unmatched tensors in addition!");
+    CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType,
+                  "Unmatched tensors in addition!");
+    if (a->devID >= 0 || b->devID >= 0 || c->devID >= 0) {
+#ifdef USE_CUDA
+        if (a == c) {
+            int P2PAccesible = 0;
+#ifdef CUDA_UVA
+            cudaDeviceCanAccessPeer(&P2PAccesible, a->devID, b->devID);
+#endif
+            if ((a->devID < 0 && b->devID >= 0) ||
+                (a->devID >= 0 && b->devID < 0) ||
+                (a->devID >= 0 && b->devID >= 0 && a->devID != b->devID && !P2PAccesible))
+            {
+                ShowNTErrors("Cannot run this method on multiple devices simultaneously!");
+            }
+            else
+                _CudaSub(a, b, c, beta);
+        }
+        else
+            _CudaSub(a, b, c, beta);
+#endif
+    }
+    else {
+        if (!a->isSparse && !b->isSparse) {
+            CheckNTErrors(!c->isSparse, "Illegal use of sparse tensor in addition!");
+            if (a->dataType == DEFAULT_DTYPE &&
+                b->dataType == DEFAULT_DTYPE &&
+                c->dataType == DEFAULT_DTYPE)
+            {
+                DTYPE * ap = (DTYPE*)a->data;
+                DTYPE * bp = (DTYPE*)b->data;
+                DTYPE * cp = (DTYPE*)c->data;
+                /* unrolling */
+                int num = a->unitNum;
+                if (num % 4 == 0) {
+                    for (int i = 0; i < num; i += 4) {
+                        cp[i] = ap[i] - bp[i] * beta;
+                        cp[i + 1] = ap[i + 1] - bp[i + 1] * beta;
+                        cp[i + 2] = ap[i + 2] - bp[i + 2] * beta;
+                        cp[i + 3] = ap[i + 3] - bp[i + 3] * beta;
+                    }
+                }
+                else if (num % 2 == 0) {
+                    for (int i = 0; i < num; i += 2) {
+                        cp[i] = ap[i] - bp[i] * beta;
+                        cp[i + 1] = ap[i + 1] - bp[i + 1] * beta;
+                    }
+                }
+                else {
+                    for (int i = 0; i < num; i++) {
+                        cp[i] = ap[i] - bp[i] * beta;
+                    }
+                }
+            }
+            else {
+                // TODO!!
+                ShowNTErrors("TODO!");
+            }
+        }
+        else {
+            // TODO!!
+            ShowNTErrors("TODO!");
+        }
+    }
+}
+/*
+tensor subtraction a = a - b * \beta (do it on site)
+keep the result in the tensor a and return nothing
+>> a - a tensor
+>> b - another tensor
+>> beta - the scaling factor
+*/
+void _SubMe(XTensor * a, const XTensor * b, DTYPE beta)
+{
+    _Sub(a, b, a, beta);
+}
+/*
+tensor subtraction c = a - b * \beta (return a XTensor structure)
+make a new tensor c to keep the result and return it
+>> a - a tensor
+>> b - another tensor
+>> beta - the scaling factor
+<< return - the result of tensor subtraction
+*/
+XTensor Sub(const XTensor &a, const XTensor &b, DTYPE beta)
+{
+    XTensor c(&a);
+    c.SetTMP();
+    /* call _Sub function */
+    _Sub(&a, &b, &c, beta);
+    /* tensor connections */
+    XLink::MakeLink(&a, &b, &c, MATH_SUB);
+    XLink::AddParamToHead(&c, beta);
+    return c;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/Sub.cu
+++ b/source/tensor/core/arithmetic/Sub.cu
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#include "../../XDevice.h"
+#include "../../XUtility.h"
+#include "Sub.cuh"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/*
+subtraction of data arrays (CUDA Kernel)
+c = a - b * \beta
+>> a - A matrix
+>> b - another matrix
+>> c - where we put a-b
+>> size - the size of a/b/c
+>> beta - the coefficient
+*/
+__global__
+void KernelSUB(DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE beta)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i < size)
+        c[i] = a[i] - b[i] * beta;
+}
+/*
+tensor subtraction c = a - b * \beta (cuda version)
+>> a - a tensor
+>> b - another tensor
+>> c - where we put a-b*\beta.
+>> beta - the scaling factor
+*/
+void _CudaSub(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta)
+{
+    CheckNTErrors(a && b && c, "Empty tensor input!");
+    CheckNTErrors((a->unitNum == b->unitNum && a->unitNum == c->unitNum),
+                  "Unmatched tensors in addition!");
+    CheckNTErrors((a->dataType == b->dataType && a->dataType == c->dataType),
+                  "Unmatched tensors in addition!");
+    CheckNTErrors((a->devID == b->devID && a->devID == c->devID),
+                  "The tensors must be on the same!");
+    int devIDBackup = XDevice::GetGPUDevice();
+    XDevice::SetGPUDevice(a->devID);
+    if (!a->isSparse && !b->isSparse) {
+        CheckNTErrors(!c->isSparse, "Illegal use of sparse matrix in addition!");
+        if (a->dataType == DEFAULT_DTYPE &&
+            b->dataType == DEFAULT_DTYPE &&
+            c->dataType == DEFAULT_DTYPE)
+        {
+            int gridSize[3], blockSize[3];
+            GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
+            dim3 blocks(gridSize[0]);
+            dim3 threads(blockSize[0]);
+            KernelSUB << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, a->unitNum, beta);
+        }
+        else {
+            // TODO!!
+            ShowNTErrors("TODO!");
+        }
+    }
+    else {
+        // TODO!!
+        ShowNTErrors("TODO!");
+    }
+    XDevice::SetGPUDevice(devIDBackup);
+}
+/* subtraction over arrays
+tensor subtraction c = a - b * \beta (cuda version) with an input handle
+>> devID - device ID (MUST >= 0)
+>> handle - cuda handle
+>> a - an array
+>> b - another array
+>> c - where we put a-b
+>> size - size of the array
+>> beta - the coefficient
+*/
+void _CudaSubWithHandle(int devID, cublasHandle_t * handle, DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE beta)
+{
+    if (size == 0)
+        return;
+    if (c == NULL)
+        c = a;
+    CheckNTErrors((a && b && c), "Empty arrays in addition!");
+    int devIDBackup;
+    ProtectCudaDev(devID, devIDBackup);
+    if (c == a) {
+#ifdef DOUBELPRICSION
+        cublasDaxpy(*handle, size, &beta, b, 1, a, 1);
+#else
+        cublasSaxpy(*handle, size, &beta, b, 1, a, 1);
+#endif
+    }
+    else {
+        int gridSize[3], blockSize[3];
+        GDevs.GetCudaThread(devID, size, gridSize, blockSize);
+        dim3 blocks(gridSize[0]);
+        dim3 threads(blockSize[0]);
+        KernelSUB<<<blocks, threads>>>((DTYPE*)a, (DTYPE*)b, (DTYPE*)c, size, beta);
+    }
+    BacktoCudaDev(devID, devIDBackup);
+}
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/Sub.cuh
+++ b/source/tensor/core/arithmetic/Sub.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#ifndef __SUB_CUH__
+#define __SUB_CUH__
+#include "Sub.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/* subtraction of data arrays (CUDA Kernel) */
+__global__
+void KernelSUB(DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE beta = (DTYPE)1.0);
+/* tensor subtraction c = a - b * \beta (cuda version) */
+void _CudaSub(const XTensor * a, const XTensor * b, XTensor * c = NULL, DTYPE beta = (DTYPE)1.0);
+/*  tensor subtraction c = a - b * \beta (cuda version) with an input handle */
+void _CudaSubWithHandle(int devID, cublasHandle_t * handle, DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE beta = (DTYPE)1.0);
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
+#endif // __SUB_CUH__
--- a/source/tensor/core/arithmetic/Absolute.h
+++ b/source/tensor/core/arithmetic/Absolute.h
 /* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
+ * All rights reserved.
-*
+ *
-* Licensed under the Apache License, Version 2.0 (the "License");
+ * Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
+ * you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
+ * You may obtain a copy of the License at
-*
+ *
-*   http://www.apache.org/licenses/LICENSE-2.0
+ *   http://www.apache.org/licenses/LICENSE-2.0
-*
+ *
-* Unless required by applicable law or agreed to in writing, software
+ * Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
+ * distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
+ * See the License for the specific language governing permissions and
-* limitations under the License.
+ * limitations under the License.
-*/
+ */
 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
-*/
+ * Today is the first day of August. It's still very hot.
+ */
-#ifndef __ABSOLUTE_H__
+#ifndef __SUB_H__
-#define __ABSOLUTE_H__
+#define __SUB_H__
 #include "../../XTensor.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
-/* set every entry to its absolute value */
+/* tensor subtraction c = a - b * \beta */
-void _Absolute(const XTensor * a, XTensor * b);
+void _Sub(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta = (DTYPE)1.0);
-/*
+/* 
-set every entry to its absolute value (do it on site)
+tensor subtraction a = a - b * \beta
 keep the result in the input tensor a and return nothing
 */
-void _AbsoluteMe(XTensor * a);
+void _SubMe(XTensor * a, const XTensor * b, DTYPE beta = (DTYPE)1.0);
-/* 
+/*
-set every entry to its absolute value (return a XTensor structure)
+tensor subtraction c = a - b * \beta
-make a new tensor to keep the result and return it
+make a new tensor c to keep the result and return it
 */
-XTensor Absolute(const XTensor & a);
+XTensor Sub(const XTensor &a, const XTensor &b, DTYPE beta = (DTYPE)1.0);
 } // namespace nts(NiuTrans.Tensor)
-#endif // __ABSOLUTE_H__
+#endif // __SUB_H__
--- a/source/tensor/core/arithmetic/Sum.cpp
+++ b/source/tensor/core/arithmetic/Sum.cpp
@@ -22,8 +22,10 @@
 #include "../../XTensor.h"
 #include "../../XName.h"
 #include "../../XUtility.h"
+#include "../movement/CopyValues.h"
 #include "Sum.h"
 #include "Sum.cuh"
+#include "SumDim.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -43,8 +45,12 @@ void _Sum(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta)
    CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType,
                  "Unmatched tensors in addition!");
-    if (a->devID >= 0 || b->devID >= 0 || c->devID >= 0) {
+    if(beta == 0){
+        _CopyValues(a, c);
+        return;
+    }
+    if (a->devID >= 0 || b->devID >= 0 || c->devID >= 0) {
 #ifdef USE_CUDA
        if (a == c) {
            int P2PAccesible = 0;
@@ -67,7 +73,7 @@ void _Sum(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta)
    }
    else {
        if (!a->isSparse && !b->isSparse) {
-            CheckNTErrors(!c->isSparse, "Illegal use of sparse matrix in addition!");
+            CheckNTErrors(!c->isSparse, "Illegal use of sparse tensor in addition!");
            if (a->dataType == DEFAULT_DTYPE &&
                b->dataType == DEFAULT_DTYPE &&
@@ -123,6 +129,33 @@ void _SumMe(XTensor * a, const XTensor * b, DTYPE beta)
 {
    _Sum(a, b, a, beta);
 }
+/* 
+return a dimension if the sum is performed as SumDim (in more details in SumDim.h 
+>> a - a tensor
+>> b - another tensor for sum
+*/
+int GetSumDimIndex(const XTensor &a, const XTensor &b)
+{
+    if(a.order < b.order)
+        return -1;
+    int hitCount = 0;
+    int hitDim = -1;
+    for(int i = 0; i < b.order; i++){
+        if(b.dimSize[b.order - 1 - i] == 1)
+            continue;
+        else if(b.dimSize[b.order - 1 - i] == a.dimSize[a.order - 1 - i]){
+            hitCount++;
+            hitDim = a.order - b.order + i;
+        }
+    }
+    if(hitCount == 1)
+        return hitDim;
+    else
+        return -1;
+}
 /*
 tensor summation c = a + b * \beta (return a XTensor structure)
@@ -137,13 +170,29 @@ XTensor Sum(const XTensor &a, const XTensor &b, DTYPE beta)
 {
    XTensor c(&a);
    c.SetTMP();
+    int n = GetSumDimIndex(a, b);
+    if(n == -1){
+        /* call _Sum function */
+        _Sum(&a, &b, &c, beta);
-    /* call _Sum function */
+        /* tensor connections */
-    _Sum(&a, &b, &c, beta);
+        XLink::MakeLink(&a, &b, &c, MATH_SUM);
+        XLink::AddParamToHead(&c, beta);
+    }
+    else if(n >= 0 && n < a.order){
+        /* call _Sum function */
+        _SumDim(&a, &b, &c, n, beta);
-    /* tensor connections */
+        /* tensor connections */
-    XLink::MakeLink(&a, &b, &c, MATH_SUM);
+        XLink::MakeLink(&a, &b, &c, MATH_SUMDIM);
-    XLink::AddParamToHead(&c, beta);
+        XLink::AddParamToHeadInt(&c, n);
+        XLink::AddParamToHead(&c, beta);
+    }
+    else{
+        ShowNTErrors("Something is wrong!");
+    }
    return c;
 }

--- a/source/tensor/core/arithmetic/Sum.cu
+++ b/source/tensor/core/arithmetic/Sum.cu
@@ -20,6 +20,7 @@
 */
 #include "../../XDevice.h"
+#include "../../XUtility.h"
 #include "Sum.cuh"
 namespace nts { // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/core/arithmetic/SumDim.cpp
+++ b/source/tensor/core/arithmetic/SumDim.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-29
+ */
+#include "Sum.h"
+#include "SumDim.h"
+#include "SumDim.cuh"
+#include "../../XName.h"
+#include "../movement/CopyValues.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+tensor summation 
+c = a + b * \beta 
+where the size of b is equal to the n-th dimension of a, 
+i.e., a is summed with b by broadcasting
+>> a - a tensor
+>> b - another tensor whose size is equal to that of dimension n of a
+>> c - where we put a+b*\beta. we save it in a if c is NULL
+>> n - the dimension index
+>> beta - the scaling factor
+*/
+void _SumDim(const XTensor * a, const XTensor * b, XTensor * c, int n, DTYPE beta)
+{
+    CheckNTErrors(a && b && c, "Empty tensor input!");
+    CheckNTErrors(a->unitNum == c->unitNum, "Unmatched tensors in addition!");
+    CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType,
+                  "Unmatched data types in addition!");
+    CheckNTErrors(a->order == c->order, "The input tensors do not have the same order in addition!");
+    CheckNTErrors(!a->isSparse && !b->isSparse && !c->isSparse, "Dense tensors are required!");
+    CheckNTErrors(a->dimSize[n] == b->unitNum, "Wrong tensor size!");
+    if(beta == 0){
+        _CopyValues(a, c);
+        return;
+    }
+    if(XTensor::IsSameShaped(a, b)){
+        _Sum(a, b, c, beta);
+        return;
+    }
+    if(a->devID >= 0 || b->devID >= 0 || c->devID >= 0){
+#ifdef USE_CUDA
+        _CudaSumDim(a, b, c, n, beta);
+#else
+        ShowNTErrors("Please specify USE_CUDA and recompile the code!");
+#endif
+    }
+    else{
+        int stride = 1;
+        int blockSize = a->dimSize[n];
+        int blockNum = 1;
+        for(int i = a->order - 1; i >= 0; i--){
+            if(i > n)
+                stride *= a->dimSize[i];
+            else if(i < n)
+                blockNum *= a->dimSize[i];
+        }
+        if (a->dataType == DEFAULT_DTYPE){
+            int num = a->unitNum;
+            if(stride > 1){
+                for(int i = 0, j = 0; i < num; i += stride, j++){
+                    DTYPE * ap =   (DTYPE*)a->data + i;
+                    DTYPE   bv = *((DTYPE*)b->data + j % blockSize) * beta;
+                    DTYPE * cp =   (DTYPE*)c->data + i;
+                    for(int k = 0; k < stride; k++)
+                        cp[k] = ap[k] + bv;
+                }
+            }
+            else if(stride == 1){
+                DTYPE * bp = (DTYPE*)b->data;
+                for(int i = 0; i < num; i += blockSize){
+                    DTYPE * ap = (DTYPE*)a->data + i;
+                    DTYPE * cp = (DTYPE*)c->data + i;
+                    if(beta == 1.0F){
+                        for(int j = 0; j < blockSize; j++)
+                            cp[j] = ap[j] + bp[j];
+                    }
+                    else{
+                        for(int j = 0; j < blockSize; j++)
+                            cp[j] = ap[j] + bp[j] * beta;
+                    }
+                }
+            }
+            else{
+                ShowNTErrors("Something is wrong!");
+            }
+        }
+        else {
+            ShowNTErrors("TODO!");
+        }
+    }
+}
+/*
+tensor summation (do it on site)
+keep the result in the input tensor and return nothing
+a = a + b * \beta
+where the size of b is equal to the n-th dimension of a,
+i.e., a is summed with b by broadcasting
+>> a - a tensor
+>> b - another tensor whose size is equal to that of dimension n of a
+>> n - the dimension index
+>> beta - the scaling factor
+*/
+void _SumDim(XTensor * a, const XTensor * b, int n, DTYPE beta)
+{
+    _SumDim(a, b, a, n, beta);
+}
+/*
+tensor summation (return a XTensor structure and make tensor connections)
+make a new tensor to keep the result and return it
+c = a + b * \beta
+where the size of b is equal to the n-th dimension of a,
+i.e., a is summed with b by broadcasting
+>> a - a tensor
+>> b - another tensor whose size is equal to that of dimension n of a
+>> n - the dimension index
+>> beta - the scaling factor
+<< return - the result tensor by tensor summation
+*/
+XTensor SumDim(const XTensor &a, const XTensor &b, int n, DTYPE beta)
+{
+    XTensor c(&a);
+    c.SetTMP();
+    /* call _Sum function */
+    _SumDim(&a, &b, &c, n, beta);
+    /* tensor connections */
+    XLink::MakeLink(&a, &b, &c, MATH_SUMDIM);
+    XLink::AddParamToHeadInt(&c, n);
+    XLink::AddParamToHead(&c, beta);
+    return c;
+}
+}
--- a/source/tensor/core/arithmetic/SumDim.cu
+++ b/source/tensor/core/arithmetic/SumDim.cu
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-29
+*/
+#include "SumDim.cuh"
+#include "../../XDevice.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/* 
+tensor summation of a tensor and a row vector
+c = a + b * \beta 
+where a is a tensor and b is a row vector
+>> a - pointer to the data array of a
+>> b - pointer to the data array of b
+>> c - pointer to the data array of c
+>> rowNum - number of rows of a and c
+>> colNum - number of columns of a and c (i.e., the size of b)
+>> beta - the scaling factor
+*/
+template <class T, bool betaFired>
+__global__
+void KernelAddWithRow(T * a, T * b, T * c, int rowNum, int colNum, T beta)
+{
+    __shared__ T bv[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    int col = blockDim.x * blockIdx.x + threadIdx.x;
+    int row = blockDim.y * blockIdx.y + threadIdx.y;
+    if(col >= colNum || row >= rowNum)
+        return;
+    if(threadIdx.y == 0)
+        bv[threadIdx.x] = b[col];
+    __syncthreads();
+    int offset = colNum * row + col;
+    if(betaFired)
+        c[offset] = a[offset] + bv[threadIdx.x] * beta;
+    else
+        c[offset] = a[offset] + bv[threadIdx.x];
+}
+/* 
+tensor summation of a tensor and a colum vector
+c = a + b * \beta 
+where a is a tensor and b is a colum vector
+>> a - pointer to the data array of a
+>> b - pointer to the data array of b
+>> c - pointer to the data array of c
+>> rowNum - number of rows of a and c (i.e., the size of b)
+>> colNum - number of columns of a and c 
+>> blockNum - size of a block (matrix), i.e., rowNum * colNum
+>> blockNum - number of matrics 
+>> beta - the scaling factor
+*/
+template <class T, bool betaFired>
+__global__
+void KernelAddWithCol(T * a, T * b, T * c, int rowNum, int colNum, int blockSize, int blockNum, T beta)
+{
+    __shared__ T bv[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    int colIndex = blockDim.x * blockIdx.x + threadIdx.x;
+    int row = blockDim.y * blockIdx.y + threadIdx.y;
+    int col = colIndex % colNum;
+    int block = colIndex / colNum;
+    if(row >= rowNum || block >= blockNum)
+        return;
+    if(threadIdx.x == 0)
+        bv[threadIdx.y] = b[row];
+    __syncthreads();
+    int offset = block * blockSize + row * colNum + col;
+    if(betaFired)
+        c[offset] = a[offset] + bv[threadIdx.y] * beta;
+    else
+        c[offset] = a[offset] + bv[threadIdx.y];
+}
+/*
+tensor summation (cuda version)
+c = a + b * \beta 
+where the size of b is equal to the n-th dimension of a, 
+i.e., a is summed with b by broadcasting
+>> a - a tensor
+>> b - another tensor whose size is equal to that of dimension n of a
+>> c - where we put a+b*\beta. we save it in a if c is NULL
+>> n - the dimension index
+>> beta - the scaling factor
+*/
+void _CudaSumDim(const XTensor * a, const XTensor * b, XTensor * c, int n, DTYPE beta)
+{
+    CheckNTErrors(a && b && c, "Empty tensor input!");
+    CheckNTErrors(a->unitNum == c->unitNum, "Unmatched tensors in addition!");
+    CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType,
+                  "Unmatched data types in addition!");
+    CheckNTErrors(a->order == c->order, "The input tensors do not have the same order in addition!");
+    CheckNTErrors(!a->isSparse && !b->isSparse && !c->isSparse, "Dense tensors are required!");
+    CheckNTErrors(a->dimSize[n] == b->unitNum, "Wrong tensor size!");
+    int stride = 1;
+    int blockSize = a->dimSize[n];
+    int blockNum = 1;
+    for(int i = a->order - 1; i >= 0; i--){
+        if(i > n)
+            stride *= a->dimSize[i];
+        else if(i < n)
+            blockNum *= a->dimSize[i];
+    }
+    int cudaGrids[3];
+    int cudaBlocks[3];
+    int devIDBackup = 0;
+    ProtectCudaDev(a->devID, devIDBackup);
+    if (a->dataType == DEFAULT_DTYPE){
+        if(stride > 1){
+            GDevs.GetCudaThread2D(a->devID, stride * blockNum, blockSize, MAX_INT, cudaGrids, cudaBlocks);
+            if(beta == (DTYPE)1.0F)
+                KernelAddWithCol<DTYPE, false> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1])>>>
+                                                ((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, 
+                                                  blockSize, stride, blockSize * stride, blockNum, beta);
+            else
+                KernelAddWithCol<DTYPE, true>  <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1])>>>
+                                                ((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, 
+                                                  blockSize, stride, blockSize * stride, blockNum, beta);
+        }
+        else if(stride == 1){
+            GDevs.GetCudaThread2D(a->devID, blockSize, blockNum, MAX_INT, cudaGrids, cudaBlocks);
+            if(beta == (DTYPE)1.0F)
+                KernelAddWithRow<DTYPE, false> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1])>>>
+                                                ((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, 
+                                                  blockNum, blockSize, beta);
+            else
+                KernelAddWithRow<DTYPE, true>  <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1])>>>
+                                                ((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, 
+                                                  blockNum, blockSize, beta);
+        }
+        else{
+            ShowNTErrors("Something is wrong!");
+        }
+    }
+    else {
+        ShowNTErrors("TODO!");
+    }
+    BacktoCudaDev(a->devID, devIDBackup);
+}
+#endif
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/SumDim.cuh
+++ b/source/tensor/core/arithmetic/SumDim.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-29
+*/
+#ifndef __SUMDIM_CUH__
+#define __SUMDIM_CUH__
+#include "../../XTensor.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/* tensor summation c = a + b * \beta where the size of b is equal to the n-th dimension of a, 
+   i.e., a is summed with b by broadcasting (cuda version) */
+void _CudaSumDim(const XTensor * a, const XTensor * b, XTensor * c, int n, DTYPE beta = (DTYPE)1.0);
+#endif
+} // namespace nts(NiuTrans.Tensor)
+#endif // __SUMDIM_CUH__
--- a/source/tensor/core/arithmetic/SumDim.h
+++ b/source/tensor/core/arithmetic/SumDim.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-29
+ * It reached to 39 centigrade around 3:00 pm in Shenyang
+ */
+#ifndef __SUMDIM_H__
+#define __SUMDIM_H__
+#include "../../XTensor.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* tensor summation c = a + b * \beta where the size of b is equal to the n-th dimension of a, 
+   i.e., a is summed with b by broadcasting */
+void _SumDim(const XTensor * a, const XTensor * b, XTensor * c, int n, DTYPE beta = (DTYPE)1.0);
+/* tensor summation c = a + b * \beta where the size of b is equal to the n-th dimension of a, 
+   i.e., a is summed with b by broadcasting. we keep the result in the input tensor a and return nothing */
+void _SumDim(XTensor * a, const XTensor * b, int n, DTYPE beta = (DTYPE)1.0);
+/* tensor summation c = a + b * \beta where the size of b is equal to the n-th dimension of a, 
+   i.e., a is summed with b by broadcasting. We make a new tensor c to keep the result and return it */
+XTensor SumDim(const XTensor &a, const XTensor &b, int n, DTYPE beta = (DTYPE)1.0);
+} // namespace nts(NiuTrans.Tensor)
+#endif // __SUMDIM_H__
--- a/source/tensor/core/getandset/SetData.cpp
+++ b/source/tensor/core/getandset/SetData.cpp
@@ -20,6 +20,7 @@
 * $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-05-08
 */
+#include <math.h>
 #include "SetData.h"
 #include "SetData.cuh"
 #include "../../XUtility.h"
@@ -37,6 +38,43 @@
 namespace nts{ // namespace nts(NiuTrans.Tensor)
+/*
+Fills the input Tensor or Variable with values according to the method described in 
+"Understanding the difficulty of training deep feedforward neural networks" - Glorot, X. & Bengio, Y. (2010), 
+using a uniform distribution. The resulting tensor will have values sampled from :math:`U(-a, a)` 
+where :math:`a = gain \times \sqrt{2 / (fan\_in + fan\_out)} \times \sqrt{3}`. Also known as Glorot initialisation.
+>> tensor - the tensor whose data array would be initialized
+>> gain - an optional scaling factor
+*/
+void _SetDataFanInOut(XTensor * tensor, DTYPE gain)
+{
+    CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_FLOAT!");
+    CheckNTErrors(tensor->order >= 2, "the tensor dimension must be no less than 2!");
+    int fanIn = 1;
+    int fanOut = 1;
+    int order = tensor->order;
+    if (order == 2) {
+        fanIn = tensor->dimSize[1];
+        fanOut = tensor->dimSize[0];
+    }
+    else {
+        int numInputFmaps = tensor->dimSize[1];
+        int numOutputFmaps = tensor->dimSize[0];
+        int receptiveFieldSize = 0;
+        for (int i = 2; i < order; i++)
+            receptiveFieldSize += tensor->dimSize[i];
+        fanIn = numInputFmaps * receptiveFieldSize;
+        fanOut = numOutputFmaps * receptiveFieldSize;
+    }
+    DTYPE std = gain * sqrt(2.0/(fanIn + fanOut));
+    DTYPE a = sqrt(3.0) * std;
+    _SetDataRand(tensor, -a, a);
+}
 /* 
 generate data items with a fixed value p 
 >> tensor - the tensor whose data array would be initialized
@@ -65,7 +103,7 @@ void _SetDataFixed(XTensor * tensor, void * valuePointer)
        }
        else{
 #ifdef USE_CUDA
-            CudaSetDataFixedInt(tensor, p);
+            _CudaSetDataFixedInt(tensor, p);
 #endif
        }
    }
@@ -88,7 +126,7 @@ void _SetDataFixed(XTensor * tensor, void * valuePointer)
        }
        else{
 #ifdef USE_CUDA
-            CudaSetDataFixedFloat(tensor, p);
+            _CudaSetDataFixedFloat(tensor, p);
 #endif
        }
    }
@@ -111,7 +149,7 @@ void _SetDataFixed(XTensor * tensor, void * valuePointer)
        }
        else{
 #ifdef USE_CUDA
-            CudaSetDataFixedDouble(tensor, p);
+            _CudaSetDataFixedDouble(tensor, p);
 #endif
        }
    }
@@ -137,7 +175,7 @@ generate data items with a fixed value p (in integer)
 */
 void _SetDataFixedInt(XTensor * tensor, int p)
 {
-    CheckNTErrors(tensor->dataType == X_INT, "the tensor must be in X_INT");
+    CheckNTErrors(tensor->dataType == X_INT, "the tensor must be in X_INT!");
    if(p == 0)
        tensor->SetZeroAll();
@@ -152,7 +190,7 @@ generate data items with a fixed value p (in float)
 */
 void _SetDataFixedFloat(XTensor * tensor, float p)
 {
-    CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_INT");
+    CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_FLOAT!");
    if(p == 0)
        tensor->SetZeroAll();
@@ -167,7 +205,7 @@ generate data items with a fixed value p (in double)
 */
 void _SetDataFixedDouble(XTensor * tensor, double p)
 {
-    CheckNTErrors(tensor->dataType == X_DOUBLE, "the tensor must be in X_INT");
+    CheckNTErrors(tensor->dataType == X_DOUBLE, "the tensor must be in X_DOUBLE!");
    if(p == 0)
        tensor->SetZeroAll();
@@ -176,32 +214,32 @@ void _SetDataFixedDouble(XTensor * tensor, double p)
 }
 /*
-generate data items with a uniform distribution in [low,high]
+generate data items with a uniform distribution in [lower, upper]
 >> tensor - the tensor whose data array would be initialized
->> low - lower value of the range
+>> lower - lower value of the range
->> high - higher value of the range
+>> upper - upper value of the range
 */
-void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
+void _SetDataRand(XTensor * tensor, DTYPE lower, DTYPE upper)
 {
+    CheckNTErrors(upper > lower, "the high value must be greater than low value!");
    if(tensor == NULL)
        return;
    /* GPU code */
    if(tensor->devID < 0){
-        DTYPE variance = high - low;
+        DTYPE variance = upper - lower;
-        srand((unsigned)time(NULL));
        if(tensor->dataType == X_FLOAT){
            float * d = (float*)tensor->data;
            for(int i = 0; i < tensor->unitNum; i++){
-                d[i] = variance * ((float)rand()/RAND_MAX) + low;
+                d[i] = variance * ((float)rand()/RAND_MAX) + lower;
            }
        }
        else if(tensor->dataType == X_DOUBLE){
            double * d = (double*)tensor->data;
            for(int i = 0; i < tensor->unitNum; i++){
-                d[i] = variance * ((double)rand()/RAND_MAX) + low;
+                d[i] = variance * ((double)rand()/RAND_MAX) + lower;
            }
        }
        else{
@@ -215,12 +253,27 @@ void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
    TODO: generate data points on GPUs straightforwardly.
    */
    else{
-        XTensor * t2 = NewTensor(tensor->order, tensor->dimSize, tensor->dataType, tensor->denseRatio, -1);
+#ifdef USE_CUDA
-        _SetDataRand(t2, low, high);
+        _CudaSetDataRand(tensor, lower, upper);
-        _CopyValues(t2, tensor);
+#endif
-        delete t2;
+        //XTensor * t2 = NewTensor(tensor->order, tensor->dimSize, tensor->dataType, tensor->denseRatio, -1);
+        //_SetDataRand(t2, low, high);
+        //_CopyValues(t2, tensor);
+        //delete t2;
    }
 }
+/*
+generate data items with a normal distribution with specified mean and standard deviation 
+>> mean - mean or expectation of the distribution
+>> standardDeviation - standard deviation of the distribution
+*/
+void _SetDataRandN(XTensor * tensor, DTYPE mean, DTYPE standardDeviation)
+{
+    // TODO: rewrite it and add cuda code!!!!!!!
+    tensor->SetDataRandn(mean, standardDeviation);
+}
 } // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/getandset/SetData.cu
+++ b/source/tensor/core/getandset/SetData.cu
@@ -21,7 +21,10 @@
 * I'm surprised that I did not write this file till today.
 */
+#include <curand.h>
+#include <time.h>
 #include "SetData.cuh"
+#include <curand_kernel.h>
 #include "../../XDevice.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -46,7 +49,7 @@ generate data items with a fixed value p (in int)
 >> tensor - the tensor for initialization
 >> p - the initial value
 */
-void CudaSetDataFixedInt(XTensor * tensor, int p)
+void _CudaSetDataFixedInt(XTensor * tensor, int p)
 {
    CheckNTErrors(tensor->dataType == X_INT, "the tensor must be in X_INT!");
@@ -86,7 +89,7 @@ generate data items with a fixed value p (in float)
 >> tensor - the tensor for initialization
 >> p - the initial value
 */
-void CudaSetDataFixedFloat(XTensor * tensor, float p)
+void _CudaSetDataFixedFloat(XTensor * tensor, float p)
 {
    CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_FLOAT!");
@@ -126,7 +129,7 @@ generate data items with a fixed value p (in double)
 >> tensor - the tensor for initialization
 >> p - the initial value
 */
-void CudaSetDataFixedDouble(XTensor * tensor, double p)
+void _CudaSetDataFixedDouble(XTensor * tensor, double p)
 {
    CheckNTErrors(tensor->dataType == X_DOUBLE, "the tensor must be in X_DOUBLE!");
@@ -146,4 +149,75 @@ void CudaSetDataFixedDouble(XTensor * tensor, double p)
    BacktoCudaDev(tensor->devID, devIDBackup);
 }
+/* 
+set data array with a uniform distribution in [low, high] 
+>> deviceStates - the state of curand
+>> d - float datatype pointer to the data array 
+>> size - size of the array
+>> lower - low value of the range
+>> variance - the variance of the range
+*/
+__global__
+void KernelSetDataRandFloat(float * d, int size, DTYPE lower, DTYPE variance)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i < size) {
+        d[i] = d[i] * variance + lower;
+    }
+}
+/* 
+set data array with a uniform distribution in [low, high] 
+>> deviceStates - the state of curand
+>> d - double datatype pointer to the data array
+>> size - size of the array
+>> lower - low value of the range
+>> variance - the variance of the range
+*/
+__global__
+void KernelSetDataRandDouble(double * d, int size, DTYPE lower, DTYPE variance)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i < size){
+        d[i] = d[i] * variance + lower;
+    }
+}
+/*
+generate data items with a uniform distribution in [lower, upper]
+>> tensor - the tensor whose data array would be initialized
+>> lower - lower value of the range
+>> upper - upper value of the range
+*/
+void _CudaSetDataRand(XTensor * tensor, DTYPE lower, DTYPE upper)
+{
+    CheckNTErrors(upper > lower, "the high value must be greater than low value!");
+    int gridSize[3];
+    int blockSize[3];
+    GDevs.GetCudaThread(tensor->devID, tensor->unitNum, gridSize, blockSize);
+    dim3 blocks(gridSize[0]);
+    dim3 threads(blockSize[0]);
+    int devIDBackup;
+    ProtectCudaDev(tensor->devID, devIDBackup);
+    curandGenerator_t gen;
+    curandCreateGenerator (&gen, CURAND_RNG_PSEUDO_DEFAULT);
+    curandSetPseudoRandomGeneratorSeed(gen, time(NULL));
+    curandGenerateUniform(gen , (float*)tensor->data , tensor->unitNum);
+    curandDestroyGenerator(gen);
+    DTYPE variance = upper - lower;
+    if (tensor->dataType == X_FLOAT)
+        KernelSetDataRandFloat <<<blocks, threads >>>((float*)tensor->data, tensor->unitNum, lower, variance);
+    else if (tensor->dataType == X_DOUBLE)
+        KernelSetDataRandDouble <<<blocks, threads >>>((double*)tensor->data, tensor->unitNum, lower, variance);
+    BacktoCudaDev(tensor->devID, devIDBackup);
+}
 } // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/getandset/SetData.cuh
+++ b/source/tensor/core/getandset/SetData.cuh
@@ -29,13 +29,16 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
 /* generate data items with a fixed value p (in int) */
-void CudaSetDataFixedInt(XTensor * tensor, int p);
+void _CudaSetDataFixedInt(XTensor * tensor, int p);
 /* generate data items with a fixed value p (in float) */
-void CudaSetDataFixedFloat(XTensor * tensor, float p);
+void _CudaSetDataFixedFloat(XTensor * tensor, float p);
 /* generate data items with a fixed value p (in double) */
-void CudaSetDataFixedDouble(XTensor * tensor, double p);
+void _CudaSetDataFixedDouble(XTensor * tensor, double p);
+/* generate data items with a uniform distribution in [lower, upper] */
+void _CudaSetDataRand(XTensor * tensor, DTYPE lower, DTYPE upper);
 } // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/core/getandset/SetData.h
+++ b/source/tensor/core/getandset/SetData.h
@@ -27,6 +27,9 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
+/* generate data items with a xavier initialization */
+void _SetDataFanInOut(XTensor * tensor, DTYPE gain = 1.0F);
 /* generate data items with a fixed value p */
 void _SetDataFixed(XTensor * tensor, void * valuePointer);
@@ -42,8 +45,8 @@ void _SetDataFixedFloat(XTensor * tensor, float p);
 /* generate data items with a fixed value p (in double) */
 void _SetDataFixedDouble(XTensor * tensor, double p);
-/* generate data items with a uniform distribution in [low,high] */
+/* generate data items with a uniform distribution in [lower, upper] */
-void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high);
+void _SetDataRand(XTensor * tensor, DTYPE lower, DTYPE upper);
 /* generate data items with a normal distribution with specified mean and standard deviation */
 void _SetDataRandN(XTensor * tensor, DTYPE mean, DTYPE standardDeviation);

--- a/source/tensor/core/arithmetic/Absolute.cpp
+++ b/source/tensor/core/arithmetic/Absolute.cpp
@@ -16,67 +16,130 @@
 */
 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
 */
-#include <math.h>
 #include "../../XTensor.h"
 #include "../../XName.h"
-#include "Absolute.h"
+#include "Clip.h"
-#include "Absolute.cuh"
+#include "Clip.cuh"
 namespace nts { // namespace nts(NiuTrans.Tensor)
 /*
-set every entry to its absolute value
+set every entry to its clip value
 >> a - input tensor we are processing
 >> b - output tensor we are processing
+>> lower - the lower border
+>> upper - the upper border
 */
-void _Absolute(const XTensor * a, XTensor * b)
+void _Clip(const XTensor * a, XTensor * b, DTYPE lower, DTYPE upper)
 {
 #ifdef USE_CUDA
-    /* run it on GPUs */
+	/* run it on GPUs */
-    if (a->devID >= 0) {
+	if (a->devID >= 0) {
-        _CudaAbsolute(a, b);
+		_CudaClip(a, b, lower, upper);
-    return;
+		return;
-}
+	}
 #endif
-    CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
+	CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
-    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
+	CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
-    DTYPE * d = (DTYPE*)a->data;
-    DTYPE * db = (DTYPE*)b->data;
+	DTYPE * d = (DTYPE*)a->data;
-    for (int i = 0; i < a->unitNum; i++)
+	DTYPE * db = (DTYPE*)b->data;
-        db[i] = (DTYPE)fabs(d[i]);
+	for (int i = 0; i < a->unitNum; i++) {
+		if (d[i] > upper)
+			db[i] = upper;
+		else if (d[i] < lower)
+			db[i] = lower;
+		else
+			db[i] = d[i];
+	}
 }
 /*
-set every entry to its absolute value (do it on site)
+set every entry to its clip value (do it on site)
 keep the result in the input tensor a and return nothing
 >> a - the tensor we are processing
+>> lower - the lower border
+>> upper - the upper border
 */
-void _AbsoluteMe(XTensor * a)
+void _ClipMe(XTensor * a, DTYPE lower, DTYPE upper)
 {
-    _Absolute(a, a);
+	_Clip(a, a, lower, upper);
 }
 /*
-set every entry to its absolute value (return a XTensor structure)
+set every entry to its clip value (return a XTensor structure)
 make a new tensor to keep the result and return it
 >> a - input tensor we are processing
-<< return - the absolute value of input tensor
+>> lower - the lower border
+>> upper - the upper border
+<< return - the clip value of the input tensor
 */
-XTensor Absolute(const XTensor & a)
+XTensor Clip(const XTensor & a, DTYPE lower, DTYPE upper)
+{
+	XTensor b(&a);
+	b.SetTMP();
+	/* call _Clip function */
+	_Clip(&a, &b, lower, upper);
+	/* tensor connections */
+	XLink::MakeLink(&a, NULL, &b, MATH_CLIP);
+	XLink::AddParamToHead(&b, lower);
+	XLink::AddParamToHead(&b, upper);
+	return b;
+}
+/*
+backward computation
+dE/dx = dE/dy * dy/dx
+hard tanh: y =  upper    if x > upper
+x    if lower <= x <= upper
+lower    if x< lower
+and dy/dx =  1    if lower <= x <= upper
+0    otherwise
+>> gold - gold standard to measure error (or loss)
+>> y - output of the function
+>> x - input of the function
+>> dedy - dE/dy
+>> dedx - dE/dx
+>> lossName - type of loss function, e.g., cross entropy
+*/
+void _ClipBackward(XTensor * y, XTensor * x, XTensor * dedy, XTensor * dedx, DTYPE lower, DTYPE upper) 
 {
-    XTensor b(&a);
-    b.SetTMP();
-    /* call _Absolute function */
-    _Absolute(&a, &b);
-    /* tensor connections */
-    XLink::MakeLink(&a, NULL, &b, MATH_ABSOLUTE);
-    return b;
+#ifdef USE_CUDA
+    if (x->devID >= 0) {
+        _CudaClipBackward(y, x, dedy, dedx, lower, upper);
+        return;
 }
+#endif
+    if (x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE) {
+        DTYPE * dedyp = (DTYPE*)dedy->data;
+        DTYPE * dedxp = (DTYPE*)dedx->data;
+        DTYPE * ip = (DTYPE*)x->data;
+        int size = y->unitNum;
+        /* dE/dx = dE/dy * dy/dx */
+        for (int i = 0; i < size; i++) {
+            DTYPE s = ip[i];
+            if (s > upper || s < lower)
+                dedxp[i] = 0;
+            else
+                dedxp[i] = dedyp[i];
+        }
+    }
+    else
+        ShowNTErrors("TODO!");
+}
 } // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/arithmetic/Absolute.cu
+++ b/source/tensor/core/arithmetic/Absolute.cu
@@ -16,78 +16,162 @@
 */
 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
 */
 #include "../../XDevice.h"
 #include "../../XTensor.h"
-#include "Absolute.h"
+#include "Clip.h"
-#include "Absolute.cuh"
+#include "Clip.cuh"
 namespace nts { // namespace nts(NiuTrans.Tensor)
 #ifdef USE_CUDA
 /*
-set each entry to its absolute value (CUDA Kernel)
+set each entry to its clip value (CUDA Kernel)
 >> a - pointer to input data array
 >> b - pointer to output data array
+>> lower - the lower border
+>> upper - the upper border
 >> size - size of the data array
 */
 __global__
-void KernelAbsolute(DTYPE * a, DTYPE * b, int size)
+	void KernelClip(DTYPE * a, DTYPE * b, DTYPE lower, DTYPE upper, int size)
 {
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
+	int i = blockDim.x * blockIdx.x + threadIdx.x;
-    if (i < size)
+	if (i < size) {
-        b[i] = fabs(a[i]);
+		if (a[i] > upper)
+			b[i] = upper;
+		else if (a[i] < lower)
+			b[i] = lower;
+		else
+			b[i] = a[i];
+	}
 }
 /*
-set each entry to its absolute value (CUDA Kernel)
+set each entry to its clip value with float16 data type value (CUDA Kernel)
 This is for float16 computation
 >> a - pointer to input data array
 >> b - pointer to output data array
+>> lower - the lower border
+>> upper - the upper border
 >> size - size of the data array
 */
 __global__
-void KernelAbsolute(__half * a, __half * b, int size)
+void KernelClip(__half * a, __half * b, DTYPE lower, DTYPE upper, int size)
 {
-    return;
+	return;
 }
 /*
-set each entry to its absolute value
+set each entry to its clip value
->> a - input tensor
+>> a - input tensor we are processing
->> b - output tensor
+>> b - output tensor we are processing
+>> lower - the lower border
+>> upper - the upper border
 */
-void _CudaAbsolute(const XTensor * a, XTensor * b)
+void _CudaClip(const XTensor * a, XTensor * b, DTYPE lower, DTYPE upper)
 {
-    CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
+	CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
-    CheckNTErrors((a->isSparse == false), "TODO!");
+	CheckNTErrors((a->isSparse == false), "TODO!");
+	int gridSize[3];
+	int blockSize[3];
+	GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
+	dim3 blocks(gridSize[0]);
+	dim3 threads(blockSize[0]);
-    int gridSize[3];
+	int devIDBackup;
-    int blockSize[3];
+	ProtectCudaDev(a->devID, devIDBackup);
-    GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
+	if (a->dataType == DEFAULT_DTYPE) {
+		KernelClip << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, lower, upper, a->unitNum);
+	}
+	else if (a->dataType == X_FLOAT16) {
+		KernelClip << <blocks, threads >> >((__half*)a->data, (__half*)b->data, lower, upper, a->unitNum);
+	}
+	else {
+		ShowNTErrors("TODO!");
+	}
-    dim3 blocks(gridSize[0]);
+	BacktoCudaDev(a->devID, devIDBackup);
-    dim3 threads(blockSize[0]);
+}
+/*
+clip backward computation of dE/dx (Cuda kernel)
-    int devIDBackup;
+dy/dx = 1     if lower <= x <= upper
-    ProtectCudaDev(a->devID, devIDBackup);
+0     otherwise
-    if (a->dataType == DEFAULT_DTYPE) {
+>> dedy - dE/dy
-        KernelAbsolute << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum);
+>> dedx - dE/dx
+>> y - y of the function
+>> x - x of the function
+>> lower 
+>> upper 
+*/
+__global__
+void KernelClipBackward(DTYPE * dedy, DTYPE * dedx, DTYPE * y, DTYPE * x, DTYPE lower, DTYPE upper, int size)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i < size) {
+        DTYPE s = x[i];
+        if (s > upper || s < lower)
+            dedx[i] = 0;
+        else
+            dedx[i] = dedy[i];
    }
-    else if (a->dataType == X_FLOAT16) {
+}
-        KernelAbsolute << <blocks, threads >> >((__half*)a->data, (__half*)b->data, a->unitNum);
+/*
+backward computation (Cuda version)
+dE/dx = dE/dy * dy/dx
+hard tanh: y =  upper    if x > upper
+x    if lower <= x <= upper
+lower    if x< lower
+and dy/dx =  1    if lower <= x <= upper
+0    otherwise
+>> gold - gold standard to measure error (or loss)
+>> y - output of the function
+>> x - input of the function
+>> dedy - dE/dy
+>> dedx - dE/dx
+>> lossName - type of loss function, e.g., cross entropy
+*/
+void _CudaClipBackward(XTensor * y, XTensor * x, XTensor * dedy, XTensor * dedx, DTYPE lower, DTYPE upper)
+{
+    if (x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE) {
+        int gridSize[3], blockSize[3];
+        GDevs.GetCudaThread(x->devID, x->unitNum, gridSize, blockSize);
+        int devIDBackup;
+        ProtectCudaDev(x->devID, devIDBackup);
+        /* dE/dx = dE/dy * dy/dx */
+        KernelClipBackward <<<dim3(gridSize[0]), dim3(blockSize[0])>>>
+                             ((DTYPE*)dedy->data,
+                              (DTYPE*)dedx->data,
+                              (DTYPE*)y->data, (DTYPE*)x->data,
+                              lower, upper,
+                              x->unitNum);
+        BacktoCudaDev(x->devID, devIDBackup);
    }
-    else {
+    else
        ShowNTErrors("TODO!");
-    }
-    BacktoCudaDev(a->devID, devIDBackup);
 }
 #endif // USE_CUDA
 } // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/math/Clip.cuh
+++ b/source/tensor/core/math/Clip.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
+*/
+#ifndef __CLIP_CUH__
+#define __CLIP_CUH__
+#include "Clip.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/* set each entry to its clip value (CUDA Kernel) */
+__global__
+void KernelClip(DTYPE * a, DTYPE * b, DTYPE lower, DTYPE upper, int size);
+/* set each entry to its clip value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelClip(__half * a, __half * b, DTYPE lower, DTYPE upper, int size);
+/* set each entry to its clip value */
+void _CudaClip(const XTensor * a, XTensor * b, DTYPE lower, DTYPE upper);
+/* backward of Clip function (CUDA Kernel) */
+__global__
+void KernelClipBackward(DTYPE * dedy, DTYPE * dedx, DTYPE * y, DTYPE * x, DTYPE lower, DTYPE upper, int size);
+/* backward of Clip function */
+void _CudaClipBackward(XTensor * y, XTensor * x, XTensor * dedy, XTensor * dedx, DTYPE lower, DTYPE upper);
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
+#endif // __CLIP_H__
\ No newline at end of file
--- a/source/tensor/core/math/Log.cpp
+++ b/source/tensor/core/math/Log.cpp
@@ -16,67 +16,36 @@
 */
 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
 */
+#ifndef __CLIP_H__
+#define __CLIP_H__
 #include "../../XTensor.h"
-#include "../../XName.h"
-#include "Log.h"
-#include "Log.cuh"
-#include <math.h>
 namespace nts { // namespace nts(NiuTrans.Tensor)
-/*
+/* set every entry to its clip value */
-set every entry to its log value (do it on site)
+void _Clip(const XTensor * a, XTensor * b, DTYPE lower, DTYPE upper);
->> a - input tensor we are processing
->> b - output tensor we are processing
-*/
-void _Log(const XTensor * a, XTensor * b)
-{
-#ifdef USE_CUDA
-    /* run it on GPUs */
-    if (a->devID >= 0) {
-        _CudaLog(a, b);
-    return;
-    }
-#endif
-    CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
-    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
-    DTYPE * d = (DTYPE*)a->data;
-    DTYPE * db = (DTYPE*)b->data;
-    for (int i = 0; i < a->unitNum; i++)
-        db[i] = (DTYPE)log(d[i]);
-}
 /*
-set every entry to its log value
+set every entry to its clip value (do it on site)
 keep the result in the input tensor a and return nothing
->> a - the tensor we are processing
 */
-void _LogMe(XTensor * a)
+void _ClipMe(XTensor * a, DTYPE lower, DTYPE upper);
-{
-    _Log(a, a);
-}
 /*
-set every entry to its log value (return a XTensor structure)
+set every entry to its clip value  (return a XTensor structure)
 make a new tensor to keep the result and return it
->> a - input tensor we are processing
-<< return - the log value of the input tensor
 */
-XTensor Log(const XTensor & a)
+XTensor Clip(const XTensor & a, DTYPE lower, DTYPE upper);
-{
-    XTensor b(&a);
+/*
-    b.SetTMP();
+backward of Clip function
+*/
-    /* call _Log function */
+void _ClipBackward(XTensor * y, XTensor * x, XTensor * dedy, XTensor * dedx, DTYPE lower, DTYPE upper);
-    _Log(&a, &b);
+} // namespace nts(NiuTrans.Tensor)
-    /* tensor connections */
-    XLink::MakeLink(&a, NULL, &b, MATH_LOG);
+#endif // __CLIP_H__
-    return b;
-}
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/math/Log.cu
+++ b/source/tensor/core/math/Log.cu
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-/*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
-*/
-#include "../../XDevice.h"
-#include "../../XTensor.h"
-#include "Log.h"
-#include "Log.cuh"
-namespace nts { // namespace nts(NiuTrans.Tensor)
-#ifdef USE_CUDA
-/*
-set each entry to its log value (CUDA Kernel)
->> a - pointer to input data array
->> b - pointer to output data array
->> size - size of the data array
-*/
-__global__
-void KernelLog(DTYPE * a, DTYPE * b, int size)
-{
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-    if (i < size)
-        b[i] = log(a[i]);
-}
-/*
-set each entry to its log value (CUDA Kernel)
-This is for float16 computation
->> a - pointer to input data array
->> b - pointer to output data array
->> size - size of the data array
-*/
-__global__
-void KernelLog(__half * a, __half * b, int size)
-{
-    return;
-}
-/*
-set each entry to its log value
->> a - input tensor
->> b - output tensor
-*/
-void _CudaLog(const XTensor * a, XTensor * b)
-{
-    CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
-    CheckNTErrors((a->isSparse == false), "TODO!");
-    int gridSize[3];
-    int blockSize[3];
-    GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
-    dim3 blocks(gridSize[0]);
-    dim3 threads(blockSize[0]);
-    int devIDBackup;
-    ProtectCudaDev(a->devID, devIDBackup);
-    if (a->dataType == DEFAULT_DTYPE) {
-        KernelLog << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum);
-    }
-    else if (a->dataType == X_FLOAT16) {
-        KernelLog << <blocks, threads >> >((__half*)a->data, (__half*)b->data, a->unitNum);
-    }
-    else {
-        ShowNTErrors("TODO!");
-    }
-    BacktoCudaDev(a->devID, devIDBackup);
-}
-#endif // USE_CUDA
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/math/Normalize.cu
+++ b/source/tensor/core/math/Normalize.cu
@@ -110,7 +110,7 @@ void _CudaNormalize(const XTensor * input, XTensor * output, int dim,
    int cudaBlockSize[3];
    GDevs.GetCudaThread2D(input->devID, strideNum, stride * blockNum,
-        MAX_INT, cudaGridSize, cudaBlockSize);
+                          MAX_INT, cudaGridSize, cudaBlockSize);
    dim3 blocks(cudaGridSize[1], cudaGridSize[0]);
    dim3 threads(cudaBlockSize[1], cudaBlockSize[0]);
@@ -119,9 +119,9 @@ void _CudaNormalize(const XTensor * input, XTensor * output, int dim,
    ProtectCudaDev(a->devID, devIDBackup);
    KernelNormalize << <blocks, threads >> >((DTYPE*)input->data, (DTYPE*)output->data,
-        (DTYPE*)mean->data, (DTYPE*)var->data,
+                                             (DTYPE*)mean->data, (DTYPE*)var->data,
-        (DTYPE*)a->data, (DTYPE*)b->data, epsilon,
+                                             (DTYPE*)a->data, (DTYPE*)b->data, epsilon,
-        stride, strideNum, blockNum);
+                                              stride, strideNum, blockNum);
    BacktoCudaDev(a->devID, devIDBackup);
 }

--- a/source/tensor/core/math/Unary.cpp
+++ b/source/tensor/core/math/Unary.cpp
+#include <math.h>
+#include "../../XName.h"
+#include "Unary.h"
+#include "Unary.cuh"
+namespace nts{
+#ifdef USE_CUDA
+/* define three marco separately, specify the respective function names */
+#define _SIMPLE_UNARY_FUNCTION(_funcName, _cudaFuncName, origFunc)          \
+void _funcName(const XTensor * a, XTensor * b)                              \
+{                                                                           \
+    /* run it on GPUs */                                                    \
+    if (a->devID >= 0) {                                                    \
+        _cudaFuncName(a, b);                                                \
+    return;                                                                 \
+    }                                                                       \
+    CheckNTErrors((XTensor::IsSameShaped(a, b)),                            \
+                  "Input tensors should have the same type!");              \
+    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");                 \
+    DTYPE * d = (DTYPE*)a->data;                                            \
+    DTYPE * db = (DTYPE*)b->data;                                           \
+    for (int i = 0; i < a->unitNum; i++)                                    \
+        db[i] = (DTYPE)origFunc(d[i]);                                      \
+}
+#define _SIMPLE_UNARY_FUNCTION_ME(_funcNameMe, _funcName)                   \
+void _funcNameMe(XTensor * a)                                               \
+{                                                                           \
+    _funcName(a, a);                                                        \
+}        
+#define SIMPLE_UNARY_FUNCTION(funcName, _funcName, operationId)             \
+XTensor funcName(const XTensor &a)                                          \
+{                                                                           \
+    XTensor b(&a);                                                          \
+    b.SetTMP();                                                             \
+    _funcName(&a, &b);                                                      \
+    XLink::MakeLink(&a, NULL, &b, operationId);                             \
+    return b;                                                               \
+}
+_SIMPLE_UNARY_FUNCTION(_Absolute, _CudaAbsolute, fabs)
+_SIMPLE_UNARY_FUNCTION_ME(_AbsoluteMe, _Absolute)
+SIMPLE_UNARY_FUNCTION(Absolute, _Absolute, MATH_ABSOLUTE)
+_SIMPLE_UNARY_FUNCTION(_Exp, _CudaExp, exp)
+_SIMPLE_UNARY_FUNCTION_ME(_ExpMe, _Exp)
+SIMPLE_UNARY_FUNCTION(Exp, _Exp, MATH_EXP)
+_SIMPLE_UNARY_FUNCTION(_Log, _CudaLog, log)
+_SIMPLE_UNARY_FUNCTION_ME(_LogMe, _Log)
+SIMPLE_UNARY_FUNCTION(Log, _Log, MATH_LOG)
+_SIMPLE_UNARY_FUNCTION(_Sin, _CudaSin, sin)
+_SIMPLE_UNARY_FUNCTION_ME(_SinMe, _Sin)
+SIMPLE_UNARY_FUNCTION(Sin, _Sin, MATH_SIN)
+_SIMPLE_UNARY_FUNCTION(_Cos, _CudaCos, cos)
+_SIMPLE_UNARY_FUNCTION_ME(_CosMe, _Cos)
+SIMPLE_UNARY_FUNCTION(Cos, _Cos, MATH_COS)
+_SIMPLE_UNARY_FUNCTION(_Tan, _CudaTan, tan)
+_SIMPLE_UNARY_FUNCTION_ME(_TanMe, _Tan)
+SIMPLE_UNARY_FUNCTION(Tan, _Tan, MATH_TAN)
+_SIMPLE_UNARY_FUNCTION(_Round, _CudaRound, round)
+_SIMPLE_UNARY_FUNCTION_ME(_RoundMe, _Round)
+SIMPLE_UNARY_FUNCTION(Round, _Round, MATH_ROUND)
+#else
+/* define three marco separately, specify the respective function names */
+#define _SIMPLE_UNARY_FUNCTION(_funcName, origFunc)          \
+void _funcName(const XTensor * a, XTensor * b)                              \
+{                                                                           \
+    CheckNTErrors((XTensor::IsSameShaped(a, b)),                            \
+                  "Input tensors should have the same type!");              \
+    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");                 \
+    DTYPE * d = (DTYPE*)a->data;                                            \
+    DTYPE * db = (DTYPE*)b->data;                                           \
+    for (int i = 0; i < a->unitNum; i++)                                    \
+        db[i] = (DTYPE)origFunc(d[i]);                                      \
+}
+#define _SIMPLE_UNARY_FUNCTION_ME(_funcNameMe, _funcName)                   \
+void _funcNameMe(XTensor * a)                                               \
+{                                                                           \
+    _funcName(a, a);                                                        \
+}        
+#define SIMPLE_UNARY_FUNCTION(funcName, _funcName, operationId)             \
+XTensor funcName(const XTensor &a)                                          \
+{                                                                           \
+    XTensor b(&a);                                                          \
+    b.SetTMP();                                                             \
+    _funcName(&a, &b);                                                      \
+    XLink::MakeLink(&a, NULL, &b, operationId);                             \
+    return b;                                                               \
+}
+_SIMPLE_UNARY_FUNCTION(_Absolute, fabs)
+_SIMPLE_UNARY_FUNCTION_ME(_AbsoluteMe, _Absolute)
+SIMPLE_UNARY_FUNCTION(Absolute, _Absolute, MATH_ABSOLUTE)
+_SIMPLE_UNARY_FUNCTION(_Exp, exp)
+_SIMPLE_UNARY_FUNCTION_ME(_ExpMe, _Exp)
+SIMPLE_UNARY_FUNCTION(Exp, _Exp, MATH_EXP)
+_SIMPLE_UNARY_FUNCTION(_Log, log)
+_SIMPLE_UNARY_FUNCTION_ME(_LogMe, _Log)
+SIMPLE_UNARY_FUNCTION(Log, _Log, MATH_LOG)
+_SIMPLE_UNARY_FUNCTION(_Sin, sin)
+_SIMPLE_UNARY_FUNCTION_ME(_SinMe, _Sin)
+SIMPLE_UNARY_FUNCTION(Sin, _Sin, MATH_SIN)
+_SIMPLE_UNARY_FUNCTION(_Cos, cos)
+_SIMPLE_UNARY_FUNCTION_ME(_CosMe, _Cos)
+SIMPLE_UNARY_FUNCTION(Cos, _Cos, MATH_COS)
+_SIMPLE_UNARY_FUNCTION(_Tan, tan)
+_SIMPLE_UNARY_FUNCTION_ME(_TanMe, _Tan)
+SIMPLE_UNARY_FUNCTION(Tan, _Tan, MATH_TAN)
+_SIMPLE_UNARY_FUNCTION(_Round, round)
+_SIMPLE_UNARY_FUNCTION_ME(_RoundMe, _Round)
+SIMPLE_UNARY_FUNCTION(Round, _Round, MATH_ROUND)
+#endif
+}
\ No newline at end of file
--- a/source/tensor/core/math/Unary.cu
+++ b/source/tensor/core/math/Unary.cu
+#include <math.h>
+#include "../../XDevice.h"
+#include "../../XName.h"
+#include "Unary.cuh"
+namespace nts {
+#define SIMPLE_UNARY_FUNCTION_GPU(funcName, origFunc)                       \
+__global__                                                                  \
+void Kernel##funcName(DTYPE * a, DTYPE * b, int size)                       \
+{                                                                           \
+    int i = blockDim.x * blockIdx.x + threadIdx.x;                          \
+                                                                            \
+    if (i < size)                                                           \
+        b[i] = (DTYPE)origFunc(a[i]);                                       \
+}                                                                           \
+__global__                                                                  \
+    void Kernel##funcName(__half * a, __half * b, int size)                 \
+{                                                                           \
+    return;                                                                 \
+}                                                                           \
+void _Cuda##funcName(const XTensor * a, XTensor * b)                        \
+{                                                                           \
+    CheckNTErrors((XTensor::IsSameShaped(a, b)),                            \
+                  "Input tensors should have the same type!");              \
+    CheckNTErrors((a->isSparse == false), "TODO!");                         \
+                                                                            \
+    int gridSize[3];                                                        \
+    int blockSize[3];                                                       \
+                                                                            \
+    GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);         \
+                                                                            \
+    dim3 blocks(gridSize[0]);                                               \
+    dim3 threads(blockSize[0]);                                             \
+                                                                            \
+    int devIDBackup;                                                        \
+    ProtectCudaDev(a->devID, devIDBackup);                                  \
+                                                                            \
+    if (a->dataType == DEFAULT_DTYPE) {                                     \
+        Kernel##funcName << <blocks, threads >> >                           \
+                     ((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum);        \
+    }                                                                       \
+    else if (a->dataType == X_FLOAT16) {                                    \
+        Kernel##funcName << <blocks, threads >> >                           \
+                     ((__half*)a->data, (__half*)b->data, a->unitNum);      \
+    }                                                                       \
+    else {                                                                  \
+        ShowNTErrors("TODO!");                                              \
+    }                                                                       \
+                                                                            \
+    BacktoCudaDev(a->devID, devIDBackup);                                   \
+}                                                                           \
+SIMPLE_UNARY_FUNCTION_GPU(Absolute, fabs)
+SIMPLE_UNARY_FUNCTION_GPU(Exp, exp)
+SIMPLE_UNARY_FUNCTION_GPU(Log, log)
+SIMPLE_UNARY_FUNCTION_GPU(Sin, sin)
+SIMPLE_UNARY_FUNCTION_GPU(Cos, cos)
+SIMPLE_UNARY_FUNCTION_GPU(Tan, tan)
+SIMPLE_UNARY_FUNCTION_GPU(Round, round)
+}
\ No newline at end of file
--- a/source/tensor/core/math/Unary.cuh
+++ b/source/tensor/core/math/Unary.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#ifndef __UNARY_CUH__
+#define __UNARY_CUH__
+#include "../../XTensor.h"
+#include "Unary.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/* set each entry to its absolute value (CUDA Kernel) */
+__global__
+void KernelAbsolute(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its absolute value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelAbsolute(__half * a, __half * b, int size);
+/* set each entry to its absolute value */
+void _CudaAbsolute(const XTensor * a, XTensor * b);
+/* set each entry to its exponent value (CUDA Kernel) */
+__global__
+void KernelExp(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its exponent value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelExp(__half * a, __half * b, int size);
+/* set each entry to its exponent value */
+void _CudaExp(const XTensor * a, XTensor * b);
+/* set each entry to its logarithm value (CUDA Kernel) */
+__global__
+void KernelLog(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its logarithm value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelLog(__half * a, __half * b, int size);
+/* set each entry to its logarithm value */
+void _CudaLog(const XTensor * a, XTensor * b);
+/* set each entry to its sine value (CUDA Kernel) */
+__global__
+void KernelSin(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its sine value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelSin(__half * a, __half * b, int size);
+/* set each entry to its sine value */
+void _CudaSin(const XTensor * a, XTensor * b);
+/* set each entry to its cosine value (CUDA Kernel) */
+__global__
+void KernelCos(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its cosine value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelCos(__half * a, __half * b, int size);
+/* set each entry to its cosine value */
+void _CudaCos(const XTensor * a, XTensor * b);
+/* set each entry to its tangent value (CUDA Kernel) */
+__global__
+void KernelTan(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its tangent value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelTan(__half * a, __half * b, int size);
+/* set each entry to its tangent value */
+void _CudaTan(const XTensor * a, XTensor * b);
+/* set each entry to its round value (CUDA Kernel) */
+__global__
+void KernelRound(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its round value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelRound(__half * a, __half * b, int size);
+/* set each entry to its round value */
+void _CudaRound(const XTensor * a, XTensor * b);
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
+#endif // __UNARY_CUH__
\ No newline at end of file
--- a/source/tensor/core/math/Unary.h
+++ b/source/tensor/core/math/Unary.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#ifndef __UNARY_H__
+#define __UNARY_H__
+#include "../../XTensor.h"
+namespace nts{
+/* set every entry to its absolute value */
+void _Absolute(const XTensor * a, XTensor * b);
+/* 
+set every entry to its absolute value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _AbsoluteMe(XTensor * a);
+/* 
+set every entry to its absolute value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Absolute(const XTensor & a);
+/* set every entry to its exponent value */
+void _Exp(const XTensor * a, XTensor * b);
+/* 
+set every entry to its exponent value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _ExpMe(XTensor * a);
+/* 
+set every entry to its exponent value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Exp(const XTensor & a);
+/* set every entry to its logarithm value */
+void _Log(const XTensor * a, XTensor * b);
+/* 
+set every entry to its logarithm value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _LogMe(XTensor * a);
+/* 
+set every entry to its logarithm value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Log(const XTensor & a);
+/* set every entry to its sine value */
+void _Sin(const XTensor * a, XTensor * b);
+/* 
+set every entry to its sine value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _SinMe(XTensor * a);
+/* 
+set every entry to its sine value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Sin(const XTensor & a);
+/* set every entry to its cosine value */
+void _Cos(const XTensor * a, XTensor * b);
+/* 
+set every entry to its cosine value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _CosMe(XTensor * a);
+/* 
+set every entry to its cosine value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Cos(const XTensor & a);
+/* set every entry to its tangent value */
+void _Tan(const XTensor * a, XTensor * b);
+/* 
+set every entry to its tangent value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _TanMe(XTensor * a);
+/* 
+set every entry to its tangent value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Tan(const XTensor & a);
+/* set every entry to its round value */
+void _Round(const XTensor * a, XTensor * b);
+/* 
+set every entry to its round value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _RoundMe(XTensor * a);
+/* 
+set every entry to its round value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Round(const XTensor & a);
+}
+#endif //end __UNARY_H__
\ No newline at end of file
--- a/source/tensor/core/movement/CopyBlocks.cpp
+++ b/source/tensor/core/movement/CopyBlocks.cpp
@@ -35,24 +35,33 @@ copy a number of blocks to target positions
 >> target - target data array
 >> targetBlocks - target positions of the copy
 >> myMem - the memory pool
+>> devID - device id
 */
-void _CopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem)
+void _CopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID)
 {
-    if (myMem != NULL && myMem->devID >= 0) {
+    if (myMem != NULL)
+        devID = myMem->devID;
+    if (devID >= 0) {
 #ifdef USE_CUDA
        /* copy the index from host to device */
-        int * targetBlocksTMP = (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int));
+        int * targetBlocksTMP = myMem != NULL ?
+                               (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int)):
+                               (int*)XMemAlloc(devID, blockNum * sizeof(int));
        XMemCopy(targetBlocksTMP, myMem->devID, targetBlocks, -1, blockNum * sizeof(int));
-        _CopyBlocksOnSite(source, blockSize, blockNum, target, targetBlocksTMP, myMem);
+        _CopyBlocksOnSite(source, blockSize, blockNum, target, targetBlocksTMP, devID);
-        myMem->ReleaseBuf(myMem->devID, blockNum * sizeof(int));
+        if(myMem != NULL)
+            myMem->ReleaseBuf(myMem->devID, blockNum * sizeof(int));
+        else
+            XMemFree(devID, targetBlocksTMP);
 #else
        ShowNTErrors("Plesae specify USE_CUDA and recompile the code!");
 #endif
    }
    else {
-        _CopyBlocksOnSite(source, blockSize, blockNum, target, targetBlocks, myMem);
+        _CopyBlocksOnSite(source, blockSize, blockNum, target, targetBlocks, devID);
    }
 }
@@ -65,11 +74,12 @@ copy a number of blocks source source positions to target positions
 >> target - target data array
 >> targetBlocks - target positions of the copy
 >> myMem - the memory pool
+>> devID - device id
 */
 void _CopyBlocks(void * source, int blockSize, int * sourceBlocks, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID)
 {
    if (myMem != NULL)
-        CheckNTErrors((myMem->devID == devID), "DevIDs are different between memory pool and input devID!");
+        devID = myMem->devID;
    if (devID >= 0) {
 #ifdef USE_CUDA

--- a/source/tensor/core/movement/CopyBlocks.h
+++ b/source/tensor/core/movement/CopyBlocks.h
@@ -27,7 +27,7 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
 /* copy a number of blocks to target positions */
-void _CopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem);
+void _CopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID);
 /* copy a number of blocks from source positions to target positions */
 void _CopyBlocks(void * source, int blockSize, int * sourceBlocks, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID);

--- a/source/tensor/core/movement/CopyBlocksInGrid.cu
+++ b/source/tensor/core/movement/CopyBlocksInGrid.cu
@@ -223,8 +223,11 @@ void _CudaCopyBlocksInGrid(void * source, int blockSize, int blockNum, int gridN
    int cudaGrids[3];
    int cudaBlocks[3];
    int threadNum = MIN(MAX(blockSize, blockNum), MAX_CUDA_THREAD_NUM_PER_BLOCK);
+    int devIDBackup;
+    ProtectCudaDev(myMem->devID, devIDBackup);
    GDevs.GetCudaThread2D(myMem->devID, threadNum, gridNum * blockNum, INT_MAX, cudaGrids, cudaBlocks);
    cudaBlocks[1] = 1;
@@ -237,39 +240,41 @@ void _CudaCopyBlocksInGrid(void * source, int blockSize, int blockNum, int gridN
    if (blockNum == 4) {
        if ((SHARED_MEMORY_SIZE / itemSize - 2 * MAX_CUDA_THREAD_NUM_PER_BLOCK) >= 2 * cudaBlocks[0] * blockNum)
            KernelCopyBlocksInGridFast<int, 4, 2> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                                    ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
        else
            KernelCopyBlocksInGridFast<int, 4, 1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                                    ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
    }
    else if (blockNum == 6) {
        if ((SHARED_MEMORY_SIZE / itemSize - 2 * MAX_CUDA_THREAD_NUM_PER_BLOCK) >= 2 * cudaBlocks[0] * blockNum)
            KernelCopyBlocksInGridFast<int, 6, 2> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                                    ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
        else
            KernelCopyBlocksInGridFast<int, 6, 1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                                    ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
    }
    else if (blockNum == 8) {
        if ((SHARED_MEMORY_SIZE / itemSize - 2 * MAX_CUDA_THREAD_NUM_PER_BLOCK) >= 2 * cudaBlocks[0] * blockNum)
            KernelCopyBlocksInGridFast<int, 8, 2> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                                    ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
        else
            KernelCopyBlocksInGridFast<int, 8, 1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                                    ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
    }
    else if (blockNum == 12) {
        if ((SHARED_MEMORY_SIZE / itemSize - 2 * MAX_CUDA_THREAD_NUM_PER_BLOCK) >= 2 * cudaBlocks[0] * blockNum)
            KernelCopyBlocksInGridFast<int, 12, 2> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                                     ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
        else
            KernelCopyBlocksInGridFast<int, 12, 1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                                     ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
    }
    else {
        KernelCopyBlocksInGrid<int> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                      ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
    }
+    BacktoCudaDev(myMem->devID, devIDBackup);
 }
 #endif // USE_CUDA

--- a/source/tensor/core/movement/CopyBlocksOnSite.cpp
+++ b/source/tensor/core/movement/CopyBlocksOnSite.cpp
@@ -34,29 +34,35 @@ all the data has been on the device (CPU/GPU) already.
 >> blockNum - number of blocks
 >> target - target data array
 >> targetBlocks - target positions of the copy
->> myMem - the memory pool
+>> devID - device id
 */
-void _CopyBlocksOnSite(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem)
+void _CopyBlocksOnSite(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, int devID)
 {
-    if (myMem != NULL && myMem->devID >= 0) {
+    if (devID >= 0) {
 #ifdef USE_CUDA
-        _CudaCopyBlocks(source, blockSize, blockNum, target, targetBlocks, myMem);
+        _CudaCopyBlocks(source, blockSize, blockNum, target, targetBlocks, devID);
 #else
        ShowNTErrors("Plesae specify USE_CUDA and recompile the code!");
 #endif
    }
    else {
-        int devID = myMem != NULL ? myMem->devID : -1;
        /* 
        The following code should be fine with GPUs, but too many
        kernel calls would slow down the system. We prefer to use
        one kernel to do block copy in batch (kernel fusion). 
        */
-        for (int i = 0, b = 0; i < blockNum; i++, b += blockSize) {
+        if(blockSize == sizeof(int)){
-            XMemCopy((char*)target + targetBlocks[i] * blockSize, devID,
+            for (int i = 0, b = 0; i < blockNum; i++, b += blockSize) {
-                (char*)source + b, devID, blockSize);
+                *(int*)((char*)target + targetBlocks[i] * blockSize) = 
+                *(int*)((char*)source + b);
+            }
+        }
+        else{
+            for (int i = 0, b = 0; i < blockNum; i++, b += blockSize) {
+                XMemCopy((char*)target + targetBlocks[i] * blockSize, devID,
+                         (char*)source + b, devID, blockSize);
+            }
        }
    }
 }
 } // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/movement/CopyBlocksOnSite.cu
+++ b/source/tensor/core/movement/CopyBlocksOnSite.cu
@@ -36,39 +36,48 @@ NOTE that this version makes more use of the 2d threads in cuda
 >> target - target data array
 >> targetBlocks - target positions of the copy
 */
-template<int miniBlockSize>
+template<class T>
 __global__
-void KernelCopyBlocks(DTYPE * source, int blockSize, int blockNum, DTYPE * target, int * targetBlocks)
+void KernelCopyBlocks(T * source, int blockSize, int blockNum, T * target, int * targetBlocks)
 {
    /* entry index in the block */
-    int i = (blockDim.x * blockIdx.x + threadIdx.x) * miniBlockSize;
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
    /* block index */
    int j = blockDim.y * blockIdx.y + threadIdx.y;
-    if (j >= blockNum)
+    if (i >= blockSize || j >= blockNum)
        return;
-    /* target position */
+    T * s = source + blockSize * j;
-    int k = targetBlocks[j];
+    T * t = target + blockSize * targetBlocks[j];
-    DTYPE * s = source + blockSize * j;
+    t[i] = s[i];
-    DTYPE * t = target + blockSize * k;
+}
-    if (i < blockSize) {
+/*
-        if (miniBlockSize == 4) {
+copy a number of blocks to target positions
-            t[i] = s[i];
+NOTE that this version makes more use of the 2d threads in cuda
-            t[i + 1] = s[i + 1];
+>> source - data array (head of the blocks) to copy from
-            t[i + 2] = s[i + 2];
+>> blockSize - size of block
-            t[i + 3] = s[i + 3];
+>> blockNum - number of blocks
-        }
+>> target - target data array
-        else if (miniBlockSize <= 1) {
+>> targetBlocks - target positions of the copy
-            t[i] = s[i];
+*/
-        }
+template<class T>
-        else {
+__global__
-            printf("something wrong!");
+void KernelCopyBlocksV2(T * source, int blockSize, int blockNum, int totalSize, T * target, int * targetBlocks)
-        }
+{
-    }
+    /* entry index in the block */
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i >= totalSize)
+        return;
+    int targetBlockID = targetBlocks[i / blockSize];
+    int targetOffset  = i % blockSize;
+    *(target + blockSize * targetBlockID + targetOffset) = source[i];
 }
 /*
@@ -78,29 +87,42 @@ copy a number of blocks to target positions (cuda version)
 >> blockNum - number of blocks
 >> target - target data array
 >> targetBlocks - target positions of the copy (on the device)
->> myMem - memory pool
+>> devID - device id
 */
-void _CudaCopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem)
+void _CudaCopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, int devID)
 {
-    CheckNTErrors((myMem != NULL), "No memory pool!");
+    CheckNTErrors(devID >= 0, "Wrong device to run!");
-    CheckNTErrors((myMem->devID >= 0), "Wrong device to run!");
-    CheckNTErrors((blockSize % sizeof(DTYPE) == 0), "Unsupported block size!");
    int cudaGrids[3];
    int cudaBlocks[3];
-    int bSize = blockSize / sizeof(DTYPE);
-    if (bSize % 4 == 0) {
+    int devIDBackup;
-        GDevs.GetCudaThread2D(myMem->devID, bSize / 4, blockNum, MAX_INT, cudaGrids, cudaBlocks);
+    ProtectCudaDev(devID, devIDBackup);
-        KernelCopyBlocks<4> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((DTYPE*)source, bSize, blockNum, (DTYPE*)target, targetBlocks);
+    if(blockSize % sizeof(double) == 0){
+        int bSize = blockSize / sizeof(double);
+        GDevs.GetCudaThread(devID, bSize * blockNum, cudaGrids, cudaBlocks);
+        KernelCopyBlocksV2<double> <<<dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >>>
+                                    ((double*)source, bSize, blockNum, bSize * blockNum, (double*)target, targetBlocks);
+        //GDevs.GetCudaThread2D(devID, bSize, blockNum, MAX_INT, cudaGrids, cudaBlocks);
+        //KernelCopyBlocks<double> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>>
+        //                            ((double*)source, bSize, blockNum, (double*)target, targetBlocks);
+    }
+    else 
+    if(blockSize % sizeof(float) == 0){
+        int bSize = blockSize / sizeof(float);
+        GDevs.GetCudaThread(devID, bSize * blockNum, cudaGrids, cudaBlocks);
+        KernelCopyBlocksV2<float> <<<dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >>>
+                                   ((float*)source, bSize, blockNum, bSize * blockNum, (float*)target, targetBlocks);
+        //GDevs.GetCudaThread2D(devID, bSize, blockNum, MAX_INT, cudaGrids, cudaBlocks);
+        //KernelCopyBlocks<float> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>>
+        //                         ((float*)source, bSize, blockNum, (float*)target, targetBlocks);
    }
-    else {
+    else{
-        GDevs.GetCudaThread2D(myMem->devID, bSize, blockNum, MAX_INT, cudaGrids, cudaBlocks);
+        ShowNTErrors("Unsupported block size!");
-        KernelCopyBlocks<1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((DTYPE*)source, bSize, blockNum, (DTYPE*)target, targetBlocks);
    }
+    BacktoCudaDev(devID, devIDBackup);
 }
 #endif // USE_CUDA
 } // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/movement/CopyBlocksOnSite.cuh
+++ b/source/tensor/core/movement/CopyBlocksOnSite.cuh
@@ -28,15 +28,11 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 #ifdef USE_CUDA
-/* copy a number of blocks to target positions */
-__global__
-void KernelCopyBlocks(DTYPE * source, int blockSize, int blockNum, DTYPE * target, int * targetBlocks);
 /* copy a number of blocks to target positions (cuda version) */
-void _CudaCopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem);
+void _CudaCopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, int devID);
 #endif // USE_CUDA
 } // namespace nts(NiuTrans.Tensor)
 #endif // __COPYBLOCKS_CUH__
\ No newline at end of file
--- a/source/tensor/core/movement/CopyBlocksOnSite.h
+++ b/source/tensor/core/movement/CopyBlocksOnSite.h
@@ -27,7 +27,7 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
 /* copy a number of blocks to target positions (on site) */
-void _CopyBlocksOnSite(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem);
+void _CopyBlocksOnSite(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, int devID);
 } // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/core/movement/CopyBlocksSelected.cu
+++ b/source/tensor/core/movement/CopyBlocksSelected.cu
@@ -75,6 +75,9 @@ void _CudaCopyBlocksSelected(void * source, int blockSize, int * sourceBlocks, i
    CheckNTErrors(devID >= 0, "Wrong device to run!");
    CheckNTErrors((blockSize % sizeof(DTYPE) == 0), "Unsupported block size!");
+    int devIDBackup;
+    ProtectCudaDev(devID, devIDBackup);
    /* copy the index to the GPU memory */
    int * sourceBlocksTMP = myMem != NULL ? (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int)) : (int *)XMemAlloc(devID, blockNum * sizeof(int));
    int * targetBlocksTMP = myMem != NULL ? (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int)) : (int *)XMemAlloc(devID, blockNum * sizeof(int));
@@ -97,6 +100,8 @@ void _CudaCopyBlocksSelected(void * source, int blockSize, int * sourceBlocks, i
        XMemFree(devID, sourceBlocksTMP);
        XMemFree(devID, targetBlocksTMP);
    }
+    BacktoCudaDev(devID, devIDBackup);
 }
 #endif // USE_CUDA

--- a/source/tensor/core/movement/CopyIndexed.cpp
+++ b/source/tensor/core/movement/CopyIndexed.cpp
@@ -37,8 +37,8 @@ copy indexed sub-tensors
 >> indexSize - length of srcIndex (and tgtIndex)
 >> tgtIndex - index of the target sub-tensors
 >> copyNum - number of the sub-tensors we copy for each source index, 
-   e.g., for srcIndex = [1,4] and copyNum = 2,
+             e.g., for srcIndex = [1,4] and copyNum = 2,
-   we actually copy the source sub-tensors 1, 2, 4, 5
+             we actually copy the source sub-tensors 1, 2, 4, 5
 */
 void _CopyIndexed(const XTensor * s, XTensor * t, int dim, int * srcIndex, int indexSize, int * tgtIndex, int copyNum)
 {
@@ -73,17 +73,23 @@ void _CopyIndexed(const XTensor * s, XTensor * t, int dim, int * srcIndex, int i
    int * realSrcIndex = new int[realIndexSize];
    int * realTgtIndex = new int[realIndexSize];
    for (int i = 0; i < indexOffsetNum; i++) {
+        int base = i * indexSize * copyNum;
+        int baseSrc = i * leadDimSizeSrc;
+        int baseTgt = i * leadDimSizeTgt;
        for (int j = 0; j < indexSize; j++) {
+            int offset = base + j * copyNum;
+            int * rsi = realSrcIndex + offset;
+            int * rti = realTgtIndex + offset;
            for (int k = 0; k < copyNum; k++) {
-                realSrcIndex[i * indexSize * copyNum + j * copyNum + k] = i * leadDimSizeSrc + srcIndex[j] + k;
+                rsi[k] = baseSrc + srcIndex[j] + k;
-                realTgtIndex[i * indexSize * copyNum + j * copyNum + k] = i * leadDimSizeTgt + tgtIndex[j] + k;
+                rti[k] = baseTgt + tgtIndex[j] + k;
            }
        }
    }
    for (int i = 0; i < indexSize; i++) {
-        CheckNTErrors((srcIndex[i] < blockNumSrc), "Index is out of range!");
+        CheckNTErrors((srcIndex[i] < blockNumSrc), "Index is out of scope!");
-        CheckNTErrors((tgtIndex[i] < blockNumTgt), "Index is out of range!");
+        CheckNTErrors((tgtIndex[i] < blockNumTgt), "Index is out of scope!");
    }
    _CopyBlocks(s->data, blockSizeSrc * s->unitSize, realSrcIndex, realIndexSize, t->data, realTgtIndex, s->mem, s->devID);

--- a/source/tensor/core/movement/CopyValues.cpp
+++ b/source/tensor/core/movement/CopyValues.cpp
@@ -20,6 +20,7 @@
 */
 #include "../../XName.h"
+#include "../../XUtility.h"
 #include "CopyValues.h"
 #include "CopyValues.cuh"
@@ -35,14 +36,14 @@ copy s to t
 void _CopyValues(const XTensor * s, XTensor * t, XStream * stream)
 {
    CheckNTErrors((s != NULL && t != NULL), "The input tensor and output tensor must be nonempty!");
-    CheckNTErrors((s->data != NULL), "Cannot copy from an empty data array!");
+    CheckNTErrors((s->data != NULL), "Cannot copy an empty data array!");
    CheckNTErrors((t->data != NULL), "Cannot copy to an empty data array!");
    CheckNTErrors((s->unitNum == t->unitNum), "Unmatched data item number!");
    if ((s->dataType == X_FLOAT16 && t->dataType == X_FLOAT) ||
        (s->dataType == X_FLOAT && t->dataType == X_FLOAT16)) {
        CheckNTErrors(((s->devID < 0 && t->devID < 0) || s->devID == t->devID),
-            "The code must be run on the same device!");
+                       "The code must be run on the same device!");
        CheckNTErrors((s->isSparse || t->isSparse), "TODO!");
        ConvertDataType(s->devID, s->data, s->dataType, t->data, t->dataType, s->unitNum);
    }
@@ -69,6 +70,34 @@ void _CopyValues(const XTensor * s, XTensor * t, XStream * stream)
 }
 /*
+copy s to t
+>> s - source
+>> sBeg - begining of the segment 
+>> sLen - length of the segment
+>> t - target
+>> tBeg - beginning of the segment on the target side
+>> stream - the stream for creating the job pipeline
+*/
+void _CopyValues(const XTensor * s, const int sBeg, const int sLen, XTensor * t, const int tBeg, XStream * stream)
+{
+    CheckNTErrors(s != NULL && t != NULL, "The input tensor and output tensor must be nonempty!");
+    CheckNTErrors(s->data != NULL && t->data != NULL, "Cannot copy an empty data array!");
+    CheckNTErrors(s->unitSize == t->unitSize, "The input tensors must be of the same unit size!");
+    CheckNTErrors(s->order > sBeg && sBeg >= 0 && sLen <= s->unitNum, "Wrong segment on the source side");
+    CheckNTErrors(t->order > tBeg && tBeg >= 0, "Wrong segment on the target side");
+    if (!s->isSparse && !t->isSparse) {
+        XMemCopy((char*)t->data + tBeg * t->unitSize, t->devID,
+                 (char*)s->data + sBeg * s->unitSize, s->devID,
+                  s->unitSize * sLen);
+    }
+    else {
+        ShowNTErrors("TODO!");
+    }
+}
+/*
 copy s to t (return a XTensor structure)
 make a new tensor to keep the result and return it

--- a/source/tensor/core/movement/CopyValues.h
+++ b/source/tensor/core/movement/CopyValues.h
@@ -29,6 +29,9 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 /* copy s to t */
 void _CopyValues(const XTensor * s, XTensor * t, XStream * stream = NULL);
+/* copy a segment of s to t  */
+void _CopyValues(const XTensor * s, const int sBeg, const int sLen, XTensor * t, const int tBeg, XStream * stream = NULL);
 /* 
 copy s to t (return a XTensor structure)
 make a new tensor to keep the result and return it

--- a/source/tensor/core/reduce/ReduceMax.cu
+++ b/source/tensor/core/reduce/ReduceMax.cu
@@ -29,6 +29,71 @@ namespace nts{ // namespace nts(NiuTrans.Tensor)
 #ifdef USE_CUDA
+/*
+use PTX code to reduce float data
+*/
+__device__ __forceinline__  
+float shflDownReduceMax(float input)
+{
+    float output;
+    asm volatile(
+        "{"
+        ".reg .f32 r0;"
+        ".reg .pred p;"
+        "shfl.down.b32  r0, %1, 0x10, 0x1f;"
+        "setp.lt.f32    p,%1,r0;"
+        "@p mov.f32     %1,r0;"
+        "shfl.down.b32  r0, %1, 0x8, 0xf;"
+        "setp.lt.f32    p,%1,r0;"
+        "@p mov.f32     %1,r0;"
+        "shfl.down.b32  r0, %1, 0x4, 0x7;"
+        "setp.lt.f32    p,%1,r0;"
+        "@p mov.f32     %1,r0;"
+        "shfl.down.b32  r0, %1, 0x2, 0x3;"
+        "setp.lt.f32    p,%1,r0;"
+        "@p mov.f32     %1,r0;"
+        "shfl.down.b32  r0, %1, 0x1, 0x1;"
+        "setp.lt.f32    p, %1, r0; "
+        "@p mov.f32     %1,r0;"
+        "mov.f32        %0,%1;"
+        "}"
+        : "=f"(output) : "f"(input));
+    return output;
+}
+/*
+use PTX code to reduce int data
+*/
+__device__ __forceinline__
+int shflDownReduceMax(int input)
+{
+    int output;
+    asm volatile(
+        "{"
+        ".reg .s32 r0;"
+        ".reg .pred p;"
+        "shfl.down.b32  r0, %1, 0x10, 0x1f;"
+        "setp.lt.s32    p,%1,r0;"
+        "@p mov.s32     %1,r0;"
+        "shfl.down.b32  r0, %1, 0x8, 0xf;"
+        "setp.lt.s32    p,%1,r0;"
+        "@p mov.s32     %1,r0;"
+        "shfl.down.b32  r0, %1, 0x4, 0x7;"
+        "setp.lt.s32    p,%1,r0;"
+        "@p mov.s32     %1,r0;"
+        "shfl.down.b32  r0, %1, 0x2, 0x3;"
+        "setp.lt.s32    p,%1,r0;"
+        "@p mov.s32     %1,r0;"
+        "shfl.down.b32  r0, %1, 0x1, 0x1;"
+        "setp.lt.s32    p, %1, r0; "
+        "@p mov.s32     %1,r0;"
+        "mov.s32        %0,%1;"
+        "}"
+        : "=r"(output) : "r"(input));
+    return output;
+}
 /* 
 reduce a tensor to another that keeps the max value along a dimension  - slow version
 Given a block of data, we go over each dimension i in the stride and we have
@@ -191,25 +256,19 @@ void KernelReduceMaxFast(DTYPE * input, DTYPE * output,
    DTYPE value  = j < strideNum ? inputData[j * stride + iOffset]: FLOAT_MIN;
    DTYPE value2 = j + blockDim.y < strideNum ? inputData[(j + blockDim.y) * stride + iOffset]: FLOAT_MIN;
-    /* load data into the shared mem */
+    value = MAX(value, value2);
-    data[tid] = MAX(value, value2);
+    value = shflDownReduceMax(value);
+    if ((tid & 0x1f) == 0) { data[tid / 32] = value; }
    __syncthreads();
-    /* unroll the warp */
+    if (tid < 32) {
-    if(goodSize >= 512) {if(tid < 256) {if(data[tid] < data[tid + 256]) data[tid] = data[tid + 256];} __syncthreads();}
+        if (tid < blockDim.y / 32)
-    if(goodSize >= 256) {if(tid < 128) {if(data[tid] < data[tid + 128]) data[tid] = data[tid + 128];} __syncthreads();}
+            value = data[tid];
-    if(goodSize >= 128) {if(tid <  64) {if(data[tid] < data[tid +  64]) data[tid] = data[tid +  64];} __syncthreads();}
+        else value = FLOAT_MIN;
-    if(goodSize >=  64) {if(tid <  32) {if(data[tid] < data[tid +  32]) data[tid] = data[tid +  32];} __syncthreads();}
+        value = shflDownReduceMax(value);
-    if(goodSize >=  32) {if(tid <  16) {if(data[tid] < data[tid +  16]) data[tid] = data[tid +  16];} __syncthreads();}
+        if (tid == 0 && blockIdx.y < reducedStrideNum)
-    if(goodSize >=  16) {if(tid <   8) {if(data[tid] < data[tid +   8]) data[tid] = data[tid +   8];} __syncthreads();}
+            output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = value;
-    if(goodSize >=   8) {if(tid <   4) {if(data[tid] < data[tid +   4]) data[tid] = data[tid +   4];} __syncthreads();}
+    }
-    if(goodSize >=   4) {if(tid <   2) {if(data[tid] < data[tid +   2]) data[tid] = data[tid +   2];} __syncthreads();}
-    if(goodSize >=   2) {if(tid <   1) {if(data[tid] < data[tid +   1]) data[tid] = data[tid +   1];} __syncthreads();}
-    /* write result for this block to the output array */
-    if(threadIdx.y == 0 && blockIdx.y < reducedStrideNum) 
-        output[(k * reducedStrideNum + blockIdx.y) * stride  + iOffset] = data[0];
 }
 /*
@@ -326,6 +385,105 @@ void KernelReduceMaxSimpleFast(DTYPE * input, DTYPE * output,
    op[offset] = max;
 }
+/*
+according the GPU's sm number allocation warp num
+*/
+inline void continuousStorageThreadAllocation(dim3& grid, dim3& block, long long vectorNum, int vectorSize)
+{
+    int warpNum = 4;
+    if (vectorNum < 20 * 8){
+        warpNum = 8;
+        if (vectorNum < 20 * 4){
+            warpNum = 16;
+            if (warpNum < 20 * 2)
+                warpNum = 32;
+        }
+    }
+    int minWarpNum = vectorSize / 32;
+    if (vectorSize % 32 != 0) minWarpNum++;
+    warpNum = min(warpNum, minWarpNum);
+    grid.x = vectorNum;
+    grid.y = 1;
+    grid.z = 1;
+    block.x = 1;
+    block.y = warpNum * 32;
+    block.z = 1;
+}
+/*
+adjust threads.x number then we can use warp optimization 
+*/
+inline void adjustThreadForUseWarpOptimization(dim3& blocks, dim3& threads)
+{
+    if (threads.x > 1) {
+        blocks.x *= threads.x;
+        threads.x = 1;
+    }
+    if (threads.y < 32)
+        threads.y = 32;
+}
+/*
+In some case,we use less block to imporve efficiency
+*/
+__global__
+void KernelReduceMaxOpLessBlocks(DTYPE * input, DTYPE * output, int strideNum, int blockNum)
+{
+    int idx = threadIdx.x % 32;
+    int idy = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
+    int startIndex = idy * strideNum;
+    DTYPE threadMax = FLOAT_MIN;
+    for (int i = idx; i < strideNum; i += 32) {
+        threadMax = max(input[startIndex + i], threadMax);
+    }
+    threadMax = shflDownReduceMax(threadMax);
+    if (idx == 0) 
+        output[idy] = threadMax;
+}
+/*
+we use PTX code reduce
+*/
+__global__
+void KernelReduceMaxOp(DTYPE * input, DTYPE * output,int stride, int strideNum, 
+                       int reducedStrideNum,int blockSize, int blockNum)
+{
+    __shared__ DTYPE iData[MAX_CUDA_THREAD_NUM_PER_BLOCK / 32];
+    unsigned int tid = threadIdx.y;
+    unsigned int j = blockIdx.y * blockDim.y + threadIdx.y;
+    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
+    if (i >= stride * blockNum)
+        return;
+    /* first level reduction */
+    int k = i / stride;
+    int iOffset = i % stride;
+    DTYPE threadMax = FLOAT_MIN;
+    DTYPE * data = iData + threadIdx.x * blockDim.y;
+    DTYPE * inputData = input + k * blockSize;
+    for (int it = j; it < strideNum; it += blockDim.y){
+        threadMax = max(inputData[it * stride + iOffset], threadMax);
+    }
+    __syncthreads();
+    threadMax = shflDownReduceMax(threadMax);
+    if ((tid & 0x1f) == 0) { data[tid / 32] = threadMax; }
+    __syncthreads();
+    /* use one warp to reduce remaining data */
+    if (tid < 32){
+        if (tid < blockDim.y / 32)
+            threadMax = data[tid];
+        else threadMax = 0;
+        threadMax = shflDownReduceMax(threadMax);
+        if (tid == 0 && blockIdx.y < reducedStrideNum)
+            output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = threadMax;
+    }
+}
 /* 
 get the max-valued items along a dimension of the tensor (cuda version). 
 For a 1-dimensional data array a,
@@ -382,130 +540,147 @@ void _CudaReduceMax(const XTensor * input, XTensor * output, int dim)
    int devIDBackup;
    ProtectCudaDev(input->devID, devIDBackup);
-    do{
+    if (stride == 1 && blockNum >= 10) {
-        if (input->dataType == DEFAULT_DTYPE) {
+        dim3 grids;
-            DTYPE * iData = NULL;
+        dim3 blocks;
-            DTYPE * oData = NULL;
+        continuousStorageThreadAllocation(grids, blocks, (long long)blockNum, strideNum);
-            if (iter == 0) {
+        if (blocks.y > 128) {
-                iData = (DTYPE*)input->data;
+            KernelReduceMaxOp <<<grids, blocks >>> ((DTYPE *)input->data, (DTYPE*)output->data, stride, strideNum, grids.y, blockSize, blockNum);
-                oData = buf1;
-            }
-            else if (iter % 2 == 1) {
-                iData = buf1;
-                oData = buf2;
-            }
-            else {
-                iData = buf2;
-                oData = buf1;
-            }
-            /* unroll the reduction procedure. The code is messy but it is faster. */
-            if (strideNum < 32) {
-                GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (DTYPE*)output->data;
-                KernelReduceMax << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-            else if (strideNum < 128) {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (DTYPE*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceMaxFast<64> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-            else if (strideNum < 256) {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (DTYPE*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceMaxFast<128> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-            else if (strideNum < 512) {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (DTYPE*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceMaxFast<256> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-            else {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (DTYPE*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceMaxFast<512> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
        }
-        else if (input->dataType == X_FLOAT16) {
+        else {
-            __half * buf1ft16 = (__half *)buf1;
+            KernelReduceMaxOpLessBlocks <<<blockNum / 4, 128 >>> ((DTYPE *)input->data, (DTYPE*)output->data, strideNum, blockNum);
-            __half * buf2ft16 = (__half *)buf2;
+        }
-            __half * iData = NULL;
+    }
-            __half * oData = NULL;
+    else {
-            if (iter == 0) {
+        do {
-                iData = (__half*)input->data;
+            if (input->dataType == DEFAULT_DTYPE) {
-                oData = buf1ft16;
+                DTYPE * iData = NULL;
-            }
+                DTYPE * oData = NULL;
-            else if (iter % 2 == 1) {
+                if (iter == 0) {
-                iData = buf1ft16;
+                    iData = (DTYPE*)input->data;
-                oData = buf2ft16;
+                    oData = buf1;
+                }
+                else if (iter % 2 == 1) {
+                    iData = buf1;
+                    oData = buf2;
+                }
+                else {
+                    iData = buf2;
+                    oData = buf1;
+                }
+                /* unroll the reduction procedure. The code is messy but it is faster. */
+                if (strideNum < 32) {
+                    GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (DTYPE*)output->data;
+                    KernelReduceMax <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
+                else if (strideNum < 128) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (DTYPE*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
+                    adjustThreadForUseWarpOptimization(blocks, threads);
+                    KernelReduceMaxFast<64> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
+                else if (strideNum < 256) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (DTYPE*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
+                    adjustThreadForUseWarpOptimization(blocks, threads);
+                    KernelReduceMaxFast<128> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
+                else if (strideNum < 512) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (DTYPE*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
+                    adjustThreadForUseWarpOptimization(blocks, threads);
+                    KernelReduceMaxFast<256> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
+                else {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (DTYPE*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
+                    adjustThreadForUseWarpOptimization(blocks, threads);
+                    KernelReduceMaxFast<512> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
            }
-            else {
+            else if (input->dataType == X_FLOAT16) {
-                iData = buf2ft16;
+                __half * buf1ft16 = (__half *)buf1;
-                oData = buf1ft16;
+                __half * buf2ft16 = (__half *)buf2;
+                __half * iData = NULL;
+                __half * oData = NULL;
+                if (iter == 0) {
+                    iData = (__half*)input->data;
+                    oData = buf1ft16;
+                }
+                else if (iter % 2 == 1) {
+                    iData = buf1ft16;
+                    oData = buf2ft16;
+                }
+                else {
+                    iData = buf2ft16;
+                    oData = buf1ft16;
+                }
+                /* unroll the reduction procedure. The code is messy but it is faster. */
+                if (strideNum < 32) {
+                    GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    KernelReduceMax << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
+                else if (strideNum < 128) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
+                    KernelReduceMaxFast<64> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
+                else if (strideNum < 256) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
+                    KernelReduceMaxFast<128> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
+                else if (strideNum < 512) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
+                    KernelReduceMaxFast<256> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
+                else {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
+                    KernelReduceMaxFast<512> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
            }
-            /* unroll the reduction procedure. The code is messy but it is faster. */
+            strideNum = cudaGridSize[0];
-            if (strideNum < 32) {
+            blockSize = cudaGridSize[0];
-                GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                KernelReduceMax << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-            else if (strideNum < 128) {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceMaxFast<64> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-            else if (strideNum < 256) {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceMaxFast<128> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-            else if (strideNum < 512) {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceMaxFast<256> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-            else {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceMaxFast<512> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-        }
-        strideNum = cudaGridSize[0];
-        blockSize = cudaGridSize[0];
-        iter++;
+            iter++;
-    }while(strideNum > 1);
+        } while (strideNum > 1);
+    }
    BacktoCudaDev(input->devID, devIDBackup);

--- a/source/tensor/core/reduce/ReduceSum.cu
+++ b/source/tensor/core/reduce/ReduceSum.cu
@@ -27,6 +27,57 @@ namespace nts{ // namespace nts(NiuTrans.Tensor)
 #ifdef USE_CUDA
+/*
+use PTX code to reduce float data
+*/
+__device__ __forceinline__  
+float shflDownReduceSum(float input)
+{
+    float output;
+    asm volatile(
+        "{"
+        ".reg .f32 r0;"
+        "shfl.down.b32  r0, %1, 0x10, 0x1f;"
+        "add.f32        %1, r0, %1;"
+        "shfl.down.b32  r0, %1, 0x8, 0xf;"
+        "add.f32        %1, r0, %1;"
+        "shfl.down.b32  r0, %1, 0x4, 0x7;"
+        "add.f32        %1, r0, %1;"
+        "shfl.down.b32  r0, %1, 0x2, 0x3;"
+        "add.f32        %1, r0, %1;"
+        "shfl.down.b32  r0, %1, 0x1, 0x1;"
+        "add.f32        %0, r0, %1;"
+        "}"
+        : "=f"(output) : "f"(input));
+    return output;
+}
+/*
+use PTX code to reduce int data
+*/
+__device__ __forceinline__  
+int shflDownReduceSum(int input)
+{
+    int output;
+    asm volatile(
+        "{"
+        ".reg .s32 r0;"
+        "shfl.down.b32  r0, %1, 0x10, 0x1f;"
+        "add.s32        %1, r0, %1;"
+        "shfl.down.b32  r0, %1, 0x8, 0xf;"
+        "add.s32        %1, r0, %1;"
+        "shfl.down.b32  r0, %1, 0x4, 0x7;"
+        "add.s32        %1, r0, %1;"
+        "shfl.down.b32  r0, %1, 0x2, 0x3;"
+        "add.s32        %1, r0, %1;"
+        "shfl.down.b32  r0, %1, 0x1, 0x1;"
+        "add.s32        %0, r0, %1;"
+        "}"
+        : "=r"(output) : "r"(input));
+    return output;
+}
 /* 
 reduce a tensor to another that keeps the sum along a dimension  - slow version
 Given a block of data, we go over each dimension i in the stride and we have
@@ -96,7 +147,6 @@ void KernelReduceSum(DTYPE * input, DTYPE * output,
        __syncthreads();
    }
    /* write result for this block to the output array */
    if (threadIdx.y == 0 && blockIdx.y < reducedStrideNum) 
        output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = iData[threadIdx.x * blockDim.y];
@@ -276,25 +326,19 @@ void KernelReduceSumFast(DTYPE * input, DTYPE * output,
            value2 = exp(value2);
    }
-    /* load data into the shared mem */
+    value = value + value2;
-    data[tid] = value + value2;
    __syncthreads();
+    value = shflDownReduceSum(value);
-    /* unroll the warp */
+    if ((tid & 0x1f) == 0) { data[tid / 32] = value; }
-    if(goodSize >= 512) {if(tid < 256) {data[tid] += data[tid + 256];} __syncthreads();}
+    __syncthreads();
-    if(goodSize >= 256) {if(tid < 128) {data[tid] += data[tid + 128];} __syncthreads();}
+    if (tid < 32){
-    if(goodSize >= 128) {if(tid <  64) {data[tid] += data[tid +  64];} __syncthreads();}
+        if (tid < blockDim.y / 32)
-    if(goodSize >= 64)  {if(tid <  32) {data[tid] += data[tid +  32];} __syncthreads();}
+            value = data[tid];
-    if(goodSize >= 32)  {if(tid <  16) {data[tid] += data[tid +  16];} __syncthreads();}
+        else value = 0;
-    if(goodSize >= 16)  {if(tid <   8) {data[tid] += data[tid +   8];} __syncthreads();}
+            value = shflDownReduceSum(value);
-    if(goodSize >=  8)  {if(tid <   4) {data[tid] += data[tid +   4];} __syncthreads();}
+        if (tid == 0 && blockIdx.y < reducedStrideNum)
-    if(goodSize >=  4)  {if(tid <   2) {data[tid] += data[tid +   2];} __syncthreads();}
+            output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = value;
-    if(goodSize >=  2)  {if(tid <   1) {data[tid] += data[tid +   1];} __syncthreads();}
+    }
-    /* write result for this block to the output array */
-    if(threadIdx.y == 0 && blockIdx.y < reducedStrideNum) 
-        output[(k * reducedStrideNum + blockIdx.y) * stride  + iOffset] = data[0];
 }
 /* 
@@ -430,6 +474,174 @@ void KernelReduceSumFast(__half * input, __half * output,
 #endif
 }
+/*
+if data storage is discontinuius ,use this way to reduce 
+*/
+__global__ 
+void KernelReduceSumDiscontinuousStorage(DTYPE * input, DTYPE * output, int stride, 
+                                         int strideNum, DTYPE * shift, DTYPE power, bool isExp)
+{
+    //int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    //int endIndex = (idx+1) * strideNum;
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int blockIndex = idx / stride;
+    int offsetInBlock = idx% stride;
+    DTYPE ans = 0;
+#pragma unroll
+    for (int i = stride * strideNum * blockIndex + offsetInBlock;
+        i < stride * strideNum * blockIndex + offsetInBlock + stride * strideNum;
+        i += stride){
+        ans += input[i];
+    }
+    output[idx] = ans;
+}
+__global__
+void KernelReduceSumOp(DTYPE * input, DTYPE * output,
+    int stride, int strideNum, int reducedStrideNum,
+    int blockSize, int blockNum,
+    DTYPE * shift, DTYPE power, bool isExp)
+{
+    __shared__ DTYPE iData[MAX_CUDA_THREAD_NUM_PER_BLOCK / 32];
+    __shared__ DTYPE bias[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    unsigned int tid = threadIdx.y;
+    unsigned int j = blockIdx.y * blockDim.y + threadIdx.y;
+    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
+    if (i >= stride * blockNum)
+        return;
+    if (threadIdx.y == 0)
+        bias[threadIdx.x] = shift != NULL ? shift[i] : 0;
+    __syncthreads();
+    /* first level reduction */
+    int k = i / stride;
+    int iOffset = i % stride;
+    DTYPE threadSum = 0;
+    DTYPE * data = iData + threadIdx.x * blockDim.y;
+    DTYPE * inputData = input + k * blockSize;
+    for (int it = j; it < strideNum; it += blockDim.y){
+        DTYPE value = inputData[it * stride + iOffset] - bias[threadIdx.x];
+        if (power != (DTYPE)1.0) {
+            if (power == (DTYPE)2.0) {
+                value = value * value;
+            }
+            else if (power == (DTYPE)0.5) {
+                value = sqrt(value);
+            }
+            else {
+                value = pow(value, power);
+            }
+        }
+        if (isExp) value = exp(value);
+        threadSum += value;
+    }
+    __syncthreads();
+    threadSum = shflDownReduceSum(threadSum);
+    if ((tid & 0x1f) == 0) { data[tid / 32] = threadSum; }
+    __syncthreads();
+    if (tid < 32){
+        if (tid < blockDim.y / 32)
+            threadSum = data[tid];
+        else threadSum = 0;
+        threadSum = shflDownReduceSum(threadSum);
+        if (tid == 0 && blockIdx.y < reducedStrideNum)
+            output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = threadSum;
+    }
+}
+__global__
+void KernelReduceSumOpLessBlocks(DTYPE * input, DTYPE * output,
+    int strideNum, int blockNum,
+    DTYPE * shift, DTYPE power, bool isExp)
+{
+    __shared__ DTYPE bias[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    int idx = threadIdx.x % 32;
+    int idy = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
+    if (idx == 0)
+        bias[threadIdx.x / 32] = shift != NULL ? shift[idy] : 0;
+    int startIndex = idy * strideNum;
+    DTYPE threadSum = 0;
+    for (int i = idx; i < strideNum; i += 32) {
+        DTYPE value = input[startIndex + i] - bias[threadIdx.x / 32];
+        if (power != (DTYPE)1.0) {
+            if (power == (DTYPE)2.0) {
+                value = value * value;
+            }
+            else if (power == (DTYPE)0.5) {
+                value = sqrt(value);
+            }
+            else {
+                value = pow(value, power);
+            }
+        }
+        if (isExp) value = exp(value);
+        threadSum += value;
+    }
+    threadSum = shflDownReduceSum(threadSum);
+    if (idx == 0)
+        output[idy] = threadSum;
+}
+/*
+according the GPU's sm number allocation warp num
+*/
+inline void continuousStorageThreadAllocation(dim3& grid, dim3& block, long long vectorNum, int vectorSize)
+{
+    int warpNum = 4;
+    if (vectorNum < 20 * 8) {
+        warpNum = 8;
+        if (vectorNum < 20 * 4) {
+            warpNum = 16;
+            if (warpNum < 20 * 2)
+                warpNum = 32;
+        }
+    }
+    int minWarpNum = vectorSize / 32;
+    if (vectorSize % 32 != 0) minWarpNum++;
+    warpNum = min(warpNum, minWarpNum);
+    grid.x = vectorNum;
+    grid.y = 1;
+    grid.z = 1;
+    block.x = 1;
+    block.y = warpNum * 32;
+    block.z = 1;
+}
+/* 
+this situation we use block.x * grid.x deal one vector for continuous read
+*/
+inline void discontinuousStorageNoShareMemThreadAllocation(dim3& grid, dim3& block, int stride, int blockNum)
+{
+    block.x = 512;
+    block.y = 1;
+    if ((stride * blockNum) % 512 == 0)
+        grid.x = (stride * blockNum) / 512;
+    else
+        grid.x = (stride * blockNum) / 512 + 1;
+    grid.y = 1;
+}
+/*
+adjust threads.x number then we can use warp optimization
+*/
+inline void adjustThreadForUseWarpOptimization(dim3& blocks, dim3& threads)
+{
+    if (threads.x > 1){
+        blocks.x *= threads.x;
+        threads.x = 1;
+    }
+    if (threads.y<32)
+        threads.y = 32;
+}
 /* 
 sum the items along a dimension of the tensor (cuda version). 
 For a 1-dimensional data array a,
@@ -495,137 +707,158 @@ void _CudaReduceSum(const XTensor * input, XTensor * output, int dim, const XTen
    int devIDBackup;
    ProtectCudaDev(input->devID, devIDBackup);
+    if (stride == 1 && blockNum >= 10) {
-    do{
+        dim3 grids;
-        if(input->dataType == DEFAULT_DTYPE){
+        dim3 blocks;
-            DTYPE * iData = NULL;
+        continuousStorageThreadAllocation(grids, blocks, (long long)blockNum, strideNum);
-            DTYPE * oData = NULL;
+        if (blocks.y > 128)
-            if (iter == 0) {
+            KernelReduceSumOp <<<grids, blocks >>> ((DTYPE *)input->data, (DTYPE*)output->data, stride, strideNum, grids.y, blockSize, blockNum, sp, power, isExp);
-                iData = (DTYPE*)input->data;
+        else
-                oData = buf1;
+            KernelReduceSumOpLessBlocks <<<blockNum / 4, 128 >>> ((DTYPE *)input->data, (DTYPE*)output->data, strideNum, blockNum, sp, power, isExp);
-            }
+    }
-            else if (iter % 2 == 1) {
+    else if (stride != 1 && stride * blockNum > 4096){
-                iData = buf1;
+        //GDevs->GetGridAndBlockSize2D(devID, stride * blockNum, strideNum,MAX_INT, cudaGridSize, cudaBlockSize);
-                oData = buf2;
+        //unsigned int* goutput = (unsigned int *)input->data;
-            }
+        //convert2uintV2 << <dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1]) >> > ((float*)input->data, goutput, stride, strideNum, blockNum, strideNum*blockNum*stride);
-            else {
+        dim3 grid, block;
-                iData = buf2;
+        discontinuousStorageNoShareMemThreadAllocation(grid, block, stride, blockNum);
-                oData = buf1;
+        KernelReduceSumDiscontinuousStorage <<<grid, block >>> ((DTYPE *)input->data, (DTYPE*)output->data, stride, strideNum, sp, power, isExp);
-            }
+    }
-            /* unroll the reduction procedure. The code is messy but it is faster. */
+    else {
-            if(strideNum < 32){
+        do {
-                GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+            if (input->dataType == DEFAULT_DTYPE) {
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                DTYPE * iData = NULL;
-                if (cudaGridSize[0] == 1)
+                DTYPE * oData = NULL;
-                    oData = (DTYPE*)output->data;
+                if (iter == 0) {
-                KernelReduceSum <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
+                    iData = (DTYPE*)input->data;
-            }
+                    oData = buf1;
-            else if(strideNum < 128){
+                }
-                GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                else if (iter % 2 == 1) {
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    iData = buf1;
-                if (cudaGridSize[0] == 1)
+                    oData = buf2;
-                    oData = (DTYPE*)output->data;
+                }
-                CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
+                else {
-                KernelReduceSumFast<64> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
+                    iData = buf2;
-            }
+                    oData = buf1;
-            else if(strideNum < 256){
+                }
-                GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                /* unroll the reduction procedure. The code is messy but it is faster. */
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                if (strideNum <= 32) {
-                if (cudaGridSize[0] == 1)
+                    GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                    oData = (DTYPE*)output->data;
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
+                    if (cudaGridSize[0] == 1)
-                KernelReduceSumFast<128> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
+                        oData = (DTYPE*)output->data;
-            }
+                    KernelReduceSum <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
-            else if(strideNum < 512){
+                }
-                GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                else if (strideNum < 128) {
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                if (cudaGridSize[0] == 1)
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                    oData = (DTYPE*)output->data;
+                    if (cudaGridSize[0] == 1)
-                CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
+                        oData = (DTYPE*)output->data;
-                KernelReduceSumFast<256> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
+                    CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
-            }
+                    adjustThreadForUseWarpOptimization(blocks, threads);
-            else{
+                    KernelReduceSumFast<64> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
-                GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                }
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                else if (strideNum < 256) {
-                if (cudaGridSize[0] == 1)
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                    oData = (DTYPE*)output->data;
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
+                    if (cudaGridSize[0] == 1)
-                KernelReduceSumFast<512> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
+                        oData = (DTYPE*)output->data;
-            }
+                    CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
-        }
+                    adjustThreadForUseWarpOptimization(blocks, threads);
-        else if(input->dataType == X_FLOAT16){
+                    KernelReduceSumFast<128> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
-            __half * buf1ft16 = (__half *)buf1;
+                }
-            __half * buf2ft16 = (__half *)buf2;
+                else if (strideNum < 512) {
-            __half * spft16 = (__half *)sp;
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-            unsigned short power2 = FloatToFloat16(power);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-            __half * powerft16p = (__half*)&power2;
+                    if (cudaGridSize[0] == 1)
-            __half * iData = NULL;
+                        oData = (DTYPE*)output->data;
-            __half * oData = NULL;
+                    CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
-            if (iter == 0) {
+                    adjustThreadForUseWarpOptimization(blocks, threads);
-                iData = (__half*)input->data;
+                    KernelReduceSumFast<256> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
-                oData = buf1ft16;
+                }
-            }
+                else {
-            else if (iter % 2 == 1) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                iData = buf1ft16;
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                oData = buf2ft16;
+                    if (cudaGridSize[0] == 1)
-            }
+                        oData = (DTYPE*)output->data;
-            else {
+                    CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
-                iData = buf2ft16;
+                    adjustThreadForUseWarpOptimization(blocks, threads);
-                oData = buf1ft16;
+                    KernelReduceSumFast<512> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
-            }
+                }
-            /* unroll the reduction procedure. The code is messy but it is faster. */
-            if(strideNum < 32){
-                GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                KernelReduceSum << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
-            }
-            else if(strideNum < 128){
-                GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceSumFast<64> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
-            }
-            else if(strideNum < 256){
-                GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceSumFast<128> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
-            }
-            else if(strideNum < 512){
-                GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceSumFast<256> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
            }
-            else{
+            else if (input->dataType == X_FLOAT16) {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                __half * buf1ft16 = (__half *)buf1;
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                __half * buf2ft16 = (__half *)buf2;
-                if (cudaGridSize[0] == 1)
+                __half * spft16 = (__half *)sp;
-                    oData = (__half*)output->data;
+                unsigned short power2 = FloatToFloat16(power);
-                CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
+                __half * powerft16p = (__half*)&power2;
-                KernelReduceSumFast<512> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
+                __half * iData = NULL;
+                __half * oData = NULL;
+                if (iter == 0) {
+                    iData = (__half*)input->data;
+                    oData = buf1ft16;
+                }
+                else if (iter % 2 == 1) {
+                    iData = buf1ft16;
+                    oData = buf2ft16;
+                }
+                else {
+                    iData = buf2ft16;
+                    oData = buf1ft16;
+                }
+                /* unroll the reduction procedure. The code is messy but it is faster. */
+                if (strideNum < 32) {
+                    GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    KernelReduceSum <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
+                }
+                else if (strideNum < 128) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
+                    KernelReduceSumFast<64> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
+                }
+                else if (strideNum < 256) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
+                    KernelReduceSumFast<128> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
+                }
+                else if (strideNum < 512) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
+                    KernelReduceSumFast<256> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
+                }
+                else {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
+                    KernelReduceSumFast<512> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
+                }
            }
-        }
-        strideNum = cudaGridSize[0];
-        blockSize = cudaGridSize[0];
-        sp = NULL;
-        power = (DTYPE)1.0;
-        isExp = false;
-        iter++;
+            strideNum = cudaGridSize[0];
+            blockSize = cudaGridSize[0];
+            sp = NULL;
+            power = (DTYPE)1.0;
+            isExp = false;
-    }while(strideNum > 1);
+            iter++;
+        } while (strideNum > 1);
+    }
    ProtectCudaDev(input->devID, devIDBackup);
    if (mem != NULL)

--- a/source/tensor/core/shape/MakeMergeBlockIndex.cpp
+++ b/source/tensor/core/shape/MakeMergeBlockIndex.cpp
@@ -33,14 +33,14 @@ set target data block index for the data movement in merge
 >> splitSizeInGrid - size of each data array to merge
 >> gridSize - number of blocks in a grid (here grid is a higher level orgnization upon blocks)
 >> gridNum - number of grids
->> mem - the memory pool
+>> devID - device id
 */
 void _MakeMergeBlockIndex(int * blockIndex, int blockNum, int blockNumInMerge,
-                          int splitSizeInGrid, int gridSize, int gridNum, XMem * mem)
+                          int splitSizeInGrid, int gridSize, int gridNum, int devID)
 {
-    if (mem != NULL && mem->devID >= 0) {
+    if (devID >= 0) {
 #ifdef USE_CUDA
-        _CudaMakeMergeBlockIndex(mem->devID, blockIndex, blockNum, blockNumInMerge, splitSizeInGrid, gridSize, gridNum);
+        _CudaMakeMergeBlockIndex(devID, blockIndex, blockNum, blockNumInMerge, splitSizeInGrid, gridSize, gridNum);
 #else
        ShowNTErrors("Please specify USE_CUDA and recompile the code!");
 #endif

--- a/source/tensor/core/shape/MakeMergeBlockIndex.h
+++ b/source/tensor/core/shape/MakeMergeBlockIndex.h
@@ -28,7 +28,7 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 /* set target data block index for the data movement in merge */
 void _MakeMergeBlockIndex(int * blockIndex, int blockNum, int blockNumInMerge,
-                          int splitSizeInGrid, int gridSize, int gridNum, XMem * mem);
+                          int splitSizeInGrid, int gridSize, int gridNum, int devID);
 } // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/core/shape/MakeSplitBlockIndex.cpp
+++ b/source/tensor/core/shape/MakeSplitBlockIndex.cpp
@@ -31,13 +31,13 @@ set target data block index for the data movement in split
 >> splitNum - number of splits
 >> blockSplitSize - size of the splitted block
 >> blockNum - number of data blocks
->> mem - the memory pool
+>> devID - device id
 */
-void _MakeSplitBlockIndex(int * blockIndex, int splitNum, int blockSplitSize, int blockNum, XMem * mem)
+void _MakeSplitBlockIndex(int * blockIndex, int splitNum, int blockSplitSize, int blockNum, int devID)
 {
-    if (mem != NULL && mem->devID >= 0) {
+    if (devID >= 0) {
 #ifdef USE_CUDA
-        _CudaMakeSplitBlockIndex(mem->devID, blockIndex, splitNum, blockSplitSize, blockNum);
+        _CudaMakeSplitBlockIndex(devID, blockIndex, splitNum, blockSplitSize, blockNum);
 #else
        ShowNTErrors("Please specify USE_CUDA and recompile the code!");
 #endif

--- a/source/tensor/core/shape/MakeSplitBlockIndex.h
+++ b/source/tensor/core/shape/MakeSplitBlockIndex.h
@@ -27,7 +27,7 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
 /* set target data block index for the data movement in split */
-void _MakeSplitBlockIndex(int * blockIndex, int splitNum, int blockSplitSize, int blockNum, XMem * mem);
+void _MakeSplitBlockIndex(int * blockIndex, int splitNum, int blockSplitSize, int blockNum, int devID);
 } // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/core/shape/Merge.cpp
+++ b/source/tensor/core/shape/Merge.cpp
@@ -42,10 +42,13 @@ e.g., (N/3, M, 3) -> (N, M)
 */
 void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim)
 {
-	int whereToMergeRDI = s->order - whereToMerge - 1;
+    if(leadingDim < 0)
-	int leadingDimRDI = s->order - leadingDim - 1;
+        leadingDim = 0;
+    int whereToMergeRDI = s->order - whereToMerge - 1;
+    int leadingDimRDI = s->order - leadingDim - 1;
    if (leadingDimRDI < 0)
-		leadingDimRDI = s->order - 1;
+        leadingDimRDI = s->order - 1;
    CheckNTErrors((s != NULL && t != NULL), "Invalid tensors!");
    CheckNTErrors((s->devID == t->devID || (s->devID < 0 && t->devID < 0)),
@@ -60,8 +63,12 @@ void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim)
            CheckNTErrors((t->dimSizeRDI[i] == s->dimSizeRDI[i] * s->dimSizeRDI[leadingDimRDI]),
                          "Unmatched tensor sizes!");
        }
+        else if (i < leadingDimRDI){
+            CheckNTErrors((s->dimSizeRDI[i] == t->dimSizeRDI[i]),
+                          "Unmatched tensor sizes!");
+        }
        else if (i > leadingDimRDI) {
-            CheckNTErrors((s->dimSizeRDI[i - 1] == t->dimSizeRDI[i]),
+            CheckNTErrors((s->dimSizeRDI[i] == t->dimSizeRDI[i - 1]),
                          "Unmatched tensor sizes!");
        }
    }
@@ -119,28 +126,24 @@ void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim)
        int realBlockSize = blockSize * t->unitSize;
        int * blockIndex = (int*)(mem != NULL ?
-            mem->AllocBuf(mem->devID, blockNum * gridNum * sizeof(int)) :
+                                  mem->AllocBuf(mem->devID, blockNum * gridNum * sizeof(int)) :
-            XMemAlloc(mem->devID, blockNum * gridNum * sizeof(int)));
+                                  XMemAlloc(s->devID, blockNum * gridNum * sizeof(int)));
-        _MakeMergeBlockIndex(blockIndex, blockNum, blockNumInMerge, splitSizeInGrid, gridSize, gridNum, mem);
+        _MakeMergeBlockIndex(blockIndex, blockNum, blockNumInMerge, splitSizeInGrid, gridSize, gridNum, s->devID);
-        _CopyBlocksOnSite(s->data, realBlockSize, blockNum, dataTMP, blockIndex, mem);
+        _CopyBlocksOnSite(s->data, realBlockSize, blockNum * gridNum, dataTMP, blockIndex, s->devID);
        if (mem != NULL)
            mem->ReleaseBuf(mem->devID, blockNum * gridNum * sizeof(int));
        else
-            XMemFree(mem->devID, blockIndex);
+            XMemFree(s->devID, blockIndex);
-        /* copy from tmp to target */
-        XMemCopy(t->data, t->devID, dataTMP, s->devID, size);
        if (!isOnSameDevice) {
            XMemCopy(t->data, t->devID, dataTMP, s->devID, size);
            if (mem != NULL)
                mem->ReleaseBuf(mem->devID, size);
            else
-                XMemFree(mem->devID, dataTMP);
+                XMemFree(s->devID, dataTMP);
        }
    }
 }
@@ -163,7 +166,7 @@ XTensor Merge(const XTensor &s, int whereToMerge, int leadingDim)
    CheckNTErrors(leadingDim < whereToMerge, "Invalid leading dimension!");
    if (leadingDim < 0)
-		leadingDim = 0;
+        leadingDim = 0;
    int order = s.order - 1;
    int * dimSize = new int[order];
@@ -205,7 +208,7 @@ merge small tensors into a big tensor
 */
 void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
 {
-	CheckNTErrors((smalls != NULL), "Invalid list!");
+    CheckNTErrors((smalls != NULL), "Invalid list!");
    CheckNTErrors((smalls->count > 0), "Empty list!");
    bool uniform = true;
@@ -233,7 +236,7 @@ void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
    int mergedNum = smalls->count;
    XTensor * s0 = (XTensor*)smalls->GetItem(0);
-	int whereToMergeRDI = s0->order - whereToMerge - 1;
+    int whereToMergeRDI = s0->order - whereToMerge - 1;
    for (int i = 0; i < s0->order; i++) {
        if (i <= whereToMergeRDI)
            blockSize *= s0->dimSizeRDI[i];
@@ -268,10 +271,10 @@ void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
    }
    /* merging with fewer kernel/api calls??? (i'm not sure about it!! may remove this later) */
    else {
-        int* dimSizeTMP = new int[MAX_TENSOR_DIM_NUM];
+        int* dimSizeTMP = new int[smallsItem0->order + 1];
-        for (int i = 0; i < MAX_TENSOR_DIM_NUM; i++)
+        for (int i = 0; i < smallsItem0->order; i++)
-            dimSizeTMP[i] = -smallsItem0->dimSizeRDI[i];
+            dimSizeTMP[i + 1] = -smallsItem0->dimSize[i];
-        dimSizeTMP[smallsItem0->order] = -mergeNum;
+        dimSizeTMP[0] = -mergeNum;
        XMem * mem = smallsItem0->mem;
        XTensor * tensorTMP = new XTensor(smallsItem0->order + 1, dimSizeTMP,
@@ -283,7 +286,7 @@ void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
        if (uniform)
            dataTMP = smallsItem0->data;
        else
-            dataTMP = mem != NULL ? mem->AllocBuf(mem->devID, size) : XMemAlloc(mem->devID, size);
+            dataTMP = mem != NULL ? mem->AllocBuf(mem->devID, size) : XMemAlloc(big->devID, size);
        tensorTMP->data = dataTMP;
@@ -295,18 +298,17 @@ void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
            }
        }
-        _Merge(tensorTMP, big, whereToMerge);
+        _Merge(tensorTMP, big, whereToMerge + 1);
        delete[] dimSizeTMP;
-        tensorTMP->data = NULL;
-        dataTMP = NULL;
+        tensorTMP->data = NULL;
        delete tensorTMP;
        if ((!uniform) && (mem != NULL))
            mem->ReleaseBuf(mem->devID, size);
        else
-            XMemFree(mem->devID, dataTMP);
+            XMemFree(big->devID, dataTMP);
    }
 }

--- a/source/tensor/core/shape/MergeBlockLists.cu
+++ b/source/tensor/core/shape/MergeBlockLists.cu
@@ -109,6 +109,9 @@ void _CudaMergeBlockLists(const XList * sourceList, int * blockSizes, int blockN
    CheckNTErrors((maxBlockSize % sizeof(DTYPE) == 0), "Unsupported block size!");
    realMaxBlockSize = maxBlockSize / sizeof(DTYPE);
+    int devIDBackup;
+    ProtectCudaDev(myMem->devID, devIDBackup);
    int cudaGridSizes[3];
    int cudaBlockSizes[3];
@@ -135,6 +138,8 @@ void _CudaMergeBlockLists(const XList * sourceList, int * blockSizes, int blockN
    delete[] targetArrays;
    delete[] sizes;
    delete[] offsets;
+    BacktoCudaDev(myMem->devID, devIDBackup);
 }
 #endif // USE_CUDA

--- a/source/tensor/core/shape/Split.cpp
+++ b/source/tensor/core/shape/Split.cpp
@@ -24,6 +24,7 @@
 #include "MakeSplitBlockIndex.h"
 #include "../../XName.h"
 #include "../../XTensor.h"
+#include "../../XDevice.h"
 #include "../../XUtility.h"
 #include "../movement/CopyBlocksOnSite.h"
@@ -88,10 +89,33 @@ void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum)
        int n = blockNum / splitNum;
        int sStep = blockSize * s->unitSize;
        int tStep = n * tPitch;
-        for (int k = 0; k < splitNum; k++) {
+        if(t->devID < 0){
-            XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID,
+            for (int k = 0; k < splitNum; k++) {
-                (char*)s->data + k * sStep, sPitch, s->devID,
+                XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID,
-                mSize, n);
+                           (char*)s->data + k * sStep, sPitch, s->devID,
+                            mSize, n);
+            }
+        }
+        else{
+#ifdef USE_CUDA
+#ifdef STREAMED_MEMCPOPY
+            XStream * stream = GDevs.GPUs[t->devID].stream;
+            for (int k = 0; k < splitNum; k++) {
+                XMemCopy2DAsync((char*)t->data + k * tStep, tPitch, t->devID,
+                                (char*)s->data + k * sStep, sPitch, s->devID,
+                                 mSize, n, stream);
+            }
+            stream->StreamSynchronize();
+#else
+            for (int k = 0; k < splitNum; k++) {
+                XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID,
+                           (char*)s->data + k * sStep, sPitch, s->devID,
+                            mSize, n);
+            }
+#endif
+#else
+            ShowNTErrors("Please specify USE_CUDA and recompile the code!");
+#endif
        }
    }
    else {
@@ -108,17 +132,17 @@ void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum)
        int blockSplitSize = blockNum / splitNum;
        int * blockIndex = (int*)(mem != NULL ?
-            mem->AllocBuf(mem->devID, blockNum * sizeof(int)) :
+                                  mem->AllocBuf(mem->devID, blockNum * sizeof(int)) :
-            XMemAlloc(mem->devID, blockNum * sizeof(int)));
+                                  XMemAlloc(s->devID, blockNum * sizeof(int)));
-        _MakeSplitBlockIndex(blockIndex, splitNum, blockSplitSize, blockNum, mem);
+        _MakeSplitBlockIndex(blockIndex, splitNum, blockSplitSize, blockNum, s->devID);
-        _CopyBlocksOnSite(s->data, realBlockSize, blockNum, dataTMP, blockIndex, mem);
+        _CopyBlocksOnSite(s->data, realBlockSize, blockNum, dataTMP, blockIndex, s->devID);
        if (mem != NULL)
            mem->ReleaseBuf(mem->devID, blockNum * sizeof(int));
        else
-            XMemFree(mem->devID, blockIndex);
+            XMemFree(s->devID, blockIndex);
        /* copy from tmp to target */
        if (!isOnSameDevice) {
@@ -127,7 +151,7 @@ void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum)
            if (mem != NULL)
                mem->ReleaseBuf(mem->devID, size);
            else
-                XMemFree(mem->devID, dataTMP);
+                XMemFree(s->devID, dataTMP);
        }
    }
 }
@@ -144,6 +168,8 @@ make a new tensor to keep the result and return it
 XTensor Split(const XTensor &s, int whereToSplit, int splitNum)
 {
    CheckNTErrors(&s, "Invalid tensors!");
+    CheckNTErrors(s.dimSize[whereToSplit] % splitNum == 0, 
+                  "The dimension cannot be splitted due to the inproper split number");
    int order = s.order + 1;
    int * dimSize = new int[order];
@@ -226,20 +252,46 @@ void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum)
        int n = blockNum / splitNum;
        int sStep = blockSize * big->unitSize;
        int tStep = 0;
-        for (int k = 0; k < splitNum; k++) {
-            XTensor * t = (XTensor*)smalls->GetItem(k);
+        if(big->devID < 0){
-            XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID,
+            for (int k = 0; k < splitNum; k++) {
-                (char*)big->data + k * sStep, sPitch, big->devID,
+                XTensor * t = (XTensor*)smalls->GetItem(k);
-                mSize, n);
+                XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID,
+                           (char*)big->data + k * sStep, sPitch, big->devID,
+                            mSize, n);
+            }
+        }
+        else{
+#ifdef USE_CUDA
+#ifdef STREAMED_MEMCPOPY
+            XStream * stream = GDevs.GPUs[big->devID].stream;
+            for (int k = 0; k < splitNum; k++) {
+                XTensor * t = (XTensor*)smalls->GetItem(k);
+                XMemCopy2DAsync((char*)t->data + k * tStep, tPitch, t->devID,
+                                (char*)big->data + k * sStep, sPitch, big->devID,
+                                 mSize, n, stream);
+            }
+            stream->StreamSynchronize();
+#else
+            for (int k = 0; k < splitNum; k++) {
+                XTensor * t = (XTensor*)smalls->GetItem(k);
+                XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID,
+                           (char*)big->data + k * sStep, sPitch, big->devID,
+                            mSize, n);
+            }
+#endif
+#else
+            ShowNTErrors("Please specify USE_CUDA and recompile the code!");
+#endif
        }
    }
    /* splitting with fewer kernel/api calls??? (i'm not sure about it!! may remove this later) */
    else {
-        int* dimSizeTMP = new int[MAX_TENSOR_DIM_NUM];
+        int* dimSizeTMP = new int[big->order + 1];
-        for (int i = 0; i < MAX_TENSOR_DIM_NUM; i++)
+        for (int i = 0; i < big->order; i++)
-            dimSizeTMP[i] = -big->dimSize[i];
+            dimSizeTMP[i + 1] = -big->dimSize[i];
-        dimSizeTMP[whereToSplit] /= splitNum;
+        dimSizeTMP[whereToSplit + 1] /= splitNum;
-        dimSizeTMP[big->order] = -splitNum;
+        dimSizeTMP[0] = -splitNum;
        XMem * mem = big->mem;
        XTensor* tensorTMP = new XTensor(big->order + 1, dimSizeTMP, big->dataType, big->denseRatio, big->devID, mem);
@@ -251,7 +303,7 @@ void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum)
            dataTMP = first->data;
        }
        else {
-            dataTMP = mem != NULL ? mem->AllocBuf(mem->devID, size) : XMemAlloc(mem->devID, size);
+            dataTMP = mem != NULL ? mem->AllocBuf(mem->devID, size) : XMemAlloc(big->devID, size);
        }
        tensorTMP->data = dataTMP;
@@ -270,13 +322,12 @@ void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum)
        delete[] dimSizeTMP;
        tensorTMP->data = NULL;
-        dataTMP = NULL;
        delete tensorTMP;
        if ((!uniform) && (mem != NULL))
            mem->ReleaseBuf(mem->devID, size);
        else
-            XMemFree(mem->devID, dataTMP);
+            XMemFree(big->devID, dataTMP);
    }
 }

--- a/source/tensor/core/shape/Split.h
+++ b/source/tensor/core/shape/Split.h
@@ -26,6 +26,8 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
+#define STREAMED_MEMCPOPY
 /* 
 transform a tensor by splitting it 
 e.g., (M, N) -> (M, N/3, 3) 

--- a/source/tensor/core/shape/Transpose.cpp
+++ b/source/tensor/core/shape/Transpose.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-28
+ * It is extreamly hot these days and i cannot sleep well. Fortunately we had 
+ * good lunch of Steamed Cold Noodles. This made me feel much better!
+ */
+#include "Transpose.h"
+#include "Merge.h"
+#include "../../XUtility.h"
+#include "../../XName.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+tensor transposition of dimensions i and j
+b = transposed(a) 
+For a input tensor a, we tranpose the dimensions i and j of it.
+E.g., let a be a tensor of size x * y * z, i = 0, j = 2, 
+then the output will be a tensor of size z * y * x.
+>> a - the input tensor
+>> b - the output tensor by transpose tensor a with specified dimensions i and j
+>> i - the transposed dimension
+>> j - the transposed dimension
+*/
+void _Transpose(const XTensor * a, XTensor * b, const int i, const int j)
+{
+    CheckNTErrors(a && b, "Empty tensors");
+    CheckNTErrors(a->order == b->order, "Wrong tensor orders");
+    CheckNTErrors(a->unitNum == b->unitNum && a->unitSize == b->unitSize, "Wrong tensor sizes");
+    CheckNTErrors(a->order > i && i >= 0, "index of dimension is out of scope!");
+    CheckNTErrors(a->order > j && j >= 0, "index of dimension is out of scope!");
+    for(int k = 0; k < a->order; k++){
+        if(k == i){
+            CheckNTErrors(a->dimSize[k] == b->dimSize[j], "Wrong dimension size in transposition");
+        }
+        else if(k == j){
+            CheckNTErrors(a->dimSize[k] == b->dimSize[i], "Wrong dimension size in transposition");
+        }
+        else{
+            CheckNTErrors(a->dimSize[k] == b->dimSize[k], "Wrong dimension size in transposition");
+        }
+    }
+    if(i == j){
+        XMemCopy(b->data, b->devID, a->data, a->devID, b->unitNum * b->unitSize);
+    }
+    else{
+        int I = MIN(i, j);
+        int J = MAX(i, j);
+        int * dims = new int[a->order + 1];
+        for(int k = 0; k <= J; k++)
+            dims[k] = a->dimSize[k];
+        dims[J + 1] = -1;
+        for(int k = J + 1; k < a->order; k++)
+            dims[k + 1] = a->dimSize[k];
+        /* reshape tensor a form (..., n_I, ..., n_J, ...) => (..., n_I, ..., n_J, 1, ...)*/
+        XTensor * aTMP =  new XTensor(a->order + 1, dims, a->dataType, a->denseRatio, a->devID, a->mem);
+        aTMP->data = a->data;
+        for(int k = 0; k < I; k++)
+            dims[k] = a->dimSize[k];
+        for(int k = I + 1; k <= J; k++)
+            dims[k - 1] = a->dimSize[k];
+        dims[J] = a->dimSize[I];
+        for(int k = J + 1; k < a->order; k++)
+            dims[k] = a->dimSize[k];
+        /* reshape tensor b form (..., m_I, ..., m_J, ...) => (..., m_J, m_I, ...) */
+        b->Reshape(b->order, dims);
+        /* tensor (..., n_I, ..., n_J, 1, ...) => tensor (..., m_J, m_I, ...) */
+        _Merge(aTMP, b, J + 1, I);
+        memcpy(dims, a->dimSize, sizeof(int) * a->order);
+        dims[I] = a->dimSize[J];
+        dims[J] = a->dimSize[I];
+        /* reshape tensor b form (..., m_J, m_I, ...) => (..., m_J, ..., m_I, ...) =>  */
+        b->Reshape(b->order, dims);
+        aTMP->data = NULL;
+        delete[] dims;
+        delete aTMP;
+    }
+}
+/*
+tensor transposition of dimensions i and j (return a XTensor structure).
+make a new tensor to keep the result and return it.
+b = transposed(a)
+For a input tensor a, we tranpose the dimensions i and j of it.
+E.g., let a be a tensor of size x * y * z, i = 0, j = 2, 
+then the output will be a tensor of size z * y * x.
+>> a - the input tensor
+>> i - the transposed dimension
+>> j - the transposed dimension
+<< return - the output tensor by transpose tensor a with specified dimensions i and j
+*/
+XTensor Transpose(const XTensor &a, const int i, const int j)
+{
+    CheckNTErrors(a.order > i && i >= 0, "index of dimension is out of scope!");
+    CheckNTErrors(a.order > j && j >= 0, "index of dimension is out of scope!");
+    int order = a.order;
+    int * dimSize = new int[order];
+    for(int k = 0; k < order; k++){
+        if(k == i)
+            dimSize[k] = a.dimSize[j];
+        else if(k == j)
+            dimSize[k] = a.dimSize[i];
+        else
+            dimSize[k] = a.dimSize[k];
+    }
+    float dr = (!a.isSparse) ? 1.0F : a.denseRatio;
+    XTensor b(order, dimSize, a.dataType, dr, a.devID, a.mem);
+    b.SetTMP();
+    /* call _Transpose function */
+    _Transpose(&a, &b, i, j);
+    /* tensor connection */
+    XLink::MakeLink(&a, NULL, &b, SHAPE_TRANSPOSE);
+    XLink::AddParamToHeadInt(&b, i);
+    XLink::AddParamToHeadInt(&b, j);
+    /* destroy variables */
+    delete[] dimSize;
+    return b;
+}
+}
--- a/source/tensor/core/shape/Transpose.h
+++ b/source/tensor/core/shape/Transpose.h
@@ -27,27 +27,18 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
-#define transpose _Transpose_
 /*
-generate a transposed 1D/2D tensor
+tensor transposition of dimensions i and j
 b = transposed(a) 
 */
-void _Transpose(XTensor * a, XTensor * b);
+void _Transpose(const XTensor * a, XTensor * b, const int i, const int j);
-/* 
-transpose a 1D/2D tensor (do it on site).
-keep the result in the input tensor and return nothing.
-a = transposed(a) 
-*/
-void _TransposeMe(XTensor * a);
 /* 
-make a transposed 1D/2D tensor (return a XTensor structure).
+tensor transposition of dimensions i and j (return a XTensor structure).
 make a new tensor to keep the result and return it.
 b = transposed(a)
 */
-XTensor Transpose(XTensor &a);
+XTensor Transpose(const XTensor &a, const int i, const int j);
 } // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/core/shape/Unsqueeze.cu
+++ b/source/tensor/core/shape/Unsqueeze.cu
@@ -32,12 +32,108 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 insert a dimension by copying the blocks for n times (where n is the size of the inerted dimension)
 >> s - pointer to the source data array
 >> blockSize - size of a block
+>> totalSize - total size of the blocks (i.e., blockSIze * n)
+>> t - pointer to the target data array
+>> n - number of blocks to copy data
+*/
+template<class T>
+__global__
+void KernelUnsqueezeFlat(void * s, int blockSize, int totalSize, void * t, int n)
+{
+    /* index of data items */
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i >= blockSize)
+        return;
+    T value = ((T*)s)[i];
+    T * tData = (T*)t;
+    __syncthreads();
+    for (int k = i; k < totalSize; k += blockSize)
+        tData[k] = value;
+}
+/*
+insert a dimension by copying the blocks for n times (where n is the size of the inerted dimension)
+>> s - pointer to the source data array
+>> blockSize - size of a block
+>> totalSize - total size of the blocks (i.e., blockSIze * n)
+>> t - pointer to the target data array
+>> n - number of blocks to copy data
+*/
+template<class T>
+__global__
+void KernelUnsqueezeFlatBigram(void * s, int blockSize, int totalSize, void * t, int n)
+{
+    /* index of data items */
+    int i = (blockDim.x * blockIdx.x + threadIdx.x) * 2;
+    if (i >= blockSize)
+        return;
+    T value = ((T*)s)[i];
+    T value2 = ((T*)s)[i + 1];
+    T * tData = (T*)t;
+    __syncthreads();
+    for (int k = i; k < totalSize; k += blockSize){
+        tData[k] = value;
+        tData[k + 1] = value2;
+    }
+}
+/*
+insert a dimension by copying the blocks for n times (where n is the size of the inerted dimension)
+>> s - pointer to the source data array
+>> blockSize - size of a block
+>> totalSize - total size of the blocks (i.e., blockSIze * n)
+>> t - pointer to the target data array
+>> n - number of blocks to copy data
+*/
+template<class T>
+__global__
+void KernelUnsqueezeFlat2D(void * s, int blockSize, int totalSize, void * t, int n)
+{
+    __shared__ T data[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    __shared__ int offsets[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    /* index of data items */
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    /* index of data items */
+    int j = blockDim.y * blockIdx.y + threadIdx.y;
+    if (i >= blockSize || j >= n)
+        return;
+    if(threadIdx.y == 0)
+        data[threadIdx.x] = ((T*)s)[i];
+    if(threadIdx.x == 0)
+        offsets[threadIdx.y] = blockSize * j;
+    __syncthreads();
+    ((T*)t)[offsets[threadIdx.y] + i] = data[threadIdx.x];
+}
+/*
+insert a dimension by copying the blocks for n times (where n is the size of the inerted dimension)
+>> s - pointer to the source data array
+>> blockSize - size of a block
 >> blockNum - number of the blocks
+>> totalSize - total size of the blocks (i.e., blockSIze * n)
 >> t - pointer to the target data array
+>> n - number of blocks to copy data
 */
 template<class T>
 __global__
-void KernelUnsqueeze(void * s, int blockSize, int blockNum, void * t, int n)
+void KernelUnsqueeze(void * s, int blockSize, int blockNum, int totalSize, void * t, int n)
 {
    /* index of data items */
    int i = blockDim.x * blockIdx.x + threadIdx.x;
@@ -51,11 +147,10 @@ void KernelUnsqueeze(void * s, int blockSize, int blockNum, void * t, int n)
    MTYPE offset = blockSize * j;
    T value = ((T*)s)[offset + i];
    T * tData = (T*)t + offset * n;
-    int length = blockSize * n;
    __syncthreads();
-    for (int k = i; k < length; k += blockSize)
+    for (int k = i; k < totalSize; k += blockSize)
        tData[k] = value;
 }
@@ -83,21 +178,71 @@ void _CudaUnsqueeze(const XTensor * a, XTensor * b, int dim, int dSize)
    int cudaGrids[3];
    int cudaBlocks[3];
-    GDevs.GetCudaThread2D(a->devID, blockSize, blockNumA, MAX_INT, cudaGrids, cudaBlocks);
    int devIDBackup = 0;
    ProtectCudaDev(a->devID, devIDBackup);
-    if (a->dataType == X_FLOAT && a->dataType == X_FLOAT) {
+    if(blockNumA > 1){
-        KernelUnsqueeze<float> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
+        GDevs.GetCudaThread2D(a->devID, blockSize, blockNumA, MAX_INT, cudaGrids, cudaBlocks);
-            (a->data, blockSize, blockNumA, b->data, dSize);
+        if (a->dataType == X_FLOAT && a->dataType == X_FLOAT) {
+            KernelUnsqueeze<float> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
+                                      (a->data, blockSize, blockNumA, blockSize * dSize, b->data, dSize);
+        }
+        else if (a->dataType == X_INT && a->dataType == X_INT) {
+            KernelUnsqueeze<int> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
+                                    (a->data, blockSize, blockNumA, blockSize * dSize, b->data, dSize);
+        }
+        else {
+            ShowNTErrors("TODO!");
+        }
+    }
+    else if(blockNumA == 1 && blockSize < MAX_CUDA_THREAD_NUM_PER_BLOCK){
+        GDevs.GetCudaThread2D(a->devID, blockSize, dSize, MAX_CUDA_THREAD_NUM_PER_BLOCK/4, cudaGrids, cudaBlocks);
+        if (a->dataType == X_FLOAT && a->dataType == X_FLOAT) {
+            KernelUnsqueezeFlat2D<float> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
+                                          (a->data, blockSize, blockSize * dSize, b->data, dSize);
+        }
+        else if (a->dataType == X_INT && a->dataType == X_INT) {
+            KernelUnsqueezeFlat2D<int> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
+                                        (a->data, blockSize, blockSize * dSize, b->data, dSize);
+        }
+        else {
+            ShowNTErrors("TODO!");
+        }
+    }
+    else if(blockNumA == 1 && blockSize % 2 == 0){
+        GDevs.GetCudaThread(a->devID, blockSize/2, cudaGrids, cudaBlocks);
+        if (a->dataType == X_FLOAT && a->dataType == X_FLOAT) {
+            KernelUnsqueezeFlatBigram<float> << <dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >> >
+                                                (a->data, blockSize, blockSize * dSize, b->data, dSize);
+        }
+        else if (a->dataType == X_INT && a->dataType == X_INT) {
+            KernelUnsqueezeFlatBigram<int> << <dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >> >
+                                              (a->data, blockSize, blockSize * dSize, b->data, dSize);
+        }
+        else {
+            ShowNTErrors("TODO!");
+        }
    }
-    else if (a->dataType == X_INT && a->dataType == X_INT) {
+    else if(blockNumA == 1){
-        KernelUnsqueeze<int> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
+        GDevs.GetCudaThread(a->devID, blockSize, cudaGrids, cudaBlocks);
-            (a->data, blockSize, blockNumA, b->data, dSize);
+        if (a->dataType == X_FLOAT && a->dataType == X_FLOAT) {
+            KernelUnsqueezeFlat<float> << <dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >> >
+                                          (a->data, blockSize, blockSize * dSize, b->data, dSize);
+        }
+        else if (a->dataType == X_INT && a->dataType == X_INT) {
+            KernelUnsqueezeFlat<int> << <dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >> >
+                                        (a->data, blockSize, blockSize * dSize, b->data, dSize);
+        }
+        else {
+            ShowNTErrors("TODO!");
+        }
    }
-    else {
+    else{
-        ShowNTErrors("TODO!");
+        ShowNTErrors("Something is wrong!");
    }
    BacktoCudaDev(a->devID, devIDBackup);

--- a/source/tensor/core/sort/TopK.cu
+++ b/source/tensor/core/sort/TopK.cu
@@ -25,6 +25,7 @@
 #include "TopK.h"
 #include "TopK.cuh"
 #include "Sort.cuh"
+#define WORKERSNUM 64
 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -363,6 +364,436 @@ void KernelTopK2(T * input, int stride, int strideNum, int blockNum, int k, T mi
 }
 /*
+get the top-k items
+>> input - the input data array
+>> stride - number of items we go over when we move to the next item along a given dimension
+>> strideNum - size of the given dimension
+>> blockNum - number of data blocks
+>> k - as it is
+>> minValue - min value of an item
+>> output - the output data array
+>> index - the output index array
+*/
+template<class T> __global__
+void KernelTopK3(T * input, int stride, int strideNum, int blockNum, int k, T minValue, T * output, int * index)
+{
+    __shared__ CudaHeapNode<T> heapData[(SHARED_MEMORY_SIZE - 1024 * sizeof(T)) / sizeof(CudaHeapNode<T>)];
+    __shared__ T eachHeapMaxValue[1024];
+    /*optimization k size the parameter must more than half of k*/
+    int parameter = 0;
+    /* worker index */
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    /* index of the data arry along the given dimension */
+    int j = blockDim.y * blockIdx.y + threadIdx.y;
+    if (i >= strideNum || i >= blockDim.x || j >= stride * blockNum)
+        return;
+    int blockIndex = j / stride;
+    int offsetInBlock = j % stride;
+    T * d = input + stride * strideNum * blockIndex + offsetInBlock;
+    CudaXHeap<MIN_HEAP, T> heap(k - parameter, heapData + k * (threadIdx.y * blockDim.x + threadIdx.x));
+    __syncthreads();
+    /* go over the data array and build the heap */
+    int indexOffset = blockDim.x;
+    int dataOffset = stride * blockDim.x;
+    if (i + (heap.size - 1) * indexOffset < strideNum) {
+        int p = i;
+        int q = i * stride;
+        for (int m = 0; m < heap.size; m++) {
+            heap.Push(p, d[q]);
+            p += indexOffset;
+            q += dataOffset;
+        }
+        for (; p < strideNum; p += indexOffset, q += dataOffset) {
+            T v = d[q];
+            if (v > heap.topValue) {
+                heap.ReplaceTop(p, v);
+            }
+        }
+    }
+    else {
+        for (int p = i, q = i * stride; p < strideNum; p += indexOffset, q += dataOffset) {
+            heap.Push(p, d[q]);
+        }
+    }
+    /* fill the heap if no enough items are processed */
+    while (heap.count < heap.size) {
+        heap.Push(-1, minValue);
+    }
+    __syncthreads();
+    /*to merge the heap use another way*/
+    T minData = minValue;
+    int heapLimit = heap.count / 2;
+    if (heapLimit % 2 == 0 && heapLimit != 0) heapLimit -= 1;
+    for (int counter = heap.count - 1; counter >= heapLimit; --counter) {
+        if (minData < heap.items[counter].value)
+            minData = heap.items[counter].value;
+    }
+    eachHeapMaxValue[threadIdx.y * blockDim.x + threadIdx.x] = minData;
+    //need more optimation
+    if (i == 0) {
+        int threadLimit = (threadIdx.y + 1) * blockDim.x;
+        CudaXHeap<MIN_HEAP, T> chooseHeap(k, heapData + k * ((blockDim.x * blockDim.y) + threadIdx.y));
+        int counter = threadIdx.y * blockDim.x;
+        for (; counter < threadIdx.y * blockDim.x + k; ++counter) {
+            chooseHeap.Push(counter, eachHeapMaxValue[counter]);
+        }
+        for (; counter < threadLimit; ++counter) {
+            if (eachHeapMaxValue[counter]>chooseHeap.items[0].value) {
+                chooseHeap.ReplaceTop(counter, eachHeapMaxValue[counter]);
+            }
+        }
+        CudaXHeap<MIN_HEAP, T>  ansHeapData(k, k - parameter, heapData + k * chooseHeap.items[0].index);
+        int miss = parameter;
+        for (counter = 1; counter < k; ++counter) {
+            chooseHeap.items[0] = chooseHeap.items[chooseHeap.count - 1];
+            chooseHeap.count--;
+            chooseHeap.Down(0);
+            CudaHeapNode<T> * cmpHeapData = heapData + k * (chooseHeap.items[0].index);
+            int cmpHeapLimit = 0;
+            if (counter + heapLimit <= k - parameter){
+                cmpHeapLimit = heapLimit;
+            }
+            /* take the max data from the minHeap,so start search from the leaf node */
+            for (int iterator = k - 1 - parameter; iterator >= cmpHeapLimit; --iterator){
+                if (miss > 0){
+                    ansHeapData.Push(cmpHeapData[iterator].index, cmpHeapData[iterator].value);
+                    miss--;
+                }
+                else if (ansHeapData.items[0].value < cmpHeapData[iterator].value){
+                    ansHeapData.ReplaceTop(cmpHeapData[iterator].index, cmpHeapData[iterator].value);
+                }
+            }
+        }
+        int offset = stride * k * blockIndex + offsetInBlock;
+        T * dOutput = output + offset;
+        int * indexOutput = index + offset;
+        for (int q = 0; q < k; ++q){
+            dOutput[stride * q] = ansHeapData.items[q].value;
+            indexOutput[stride * q] = ansHeapData.items[q].index;
+        }
+    }
+}
+__device__ __forceinline__ 
+unsigned getLaneMaskLe() 
+{
+    unsigned mask;
+    asm("mov.u32 %0, %%lanemask_le;" : "=r"(mask));
+    return mask;
+}
+__device__ __forceinline__ 
+int getLaneId() 
+{
+    int laneId;
+    asm("mov.s32 %0, %laneid;" : "=r"(laneId));
+    return laneId;
+}
+__device__ 
+unsigned convert(float v)
+{
+    unsigned x = __float_as_int(v);
+    unsigned mask = (x & 0x80000000) ? 0xffffffff : 0x80000000;
+    return (x ^ mask);
+}
+__device__ 
+float convert(unsigned int v)
+{
+    float x = __uint_as_float(v);
+    return x;
+}
+__device__ 
+float deconvert(unsigned int v) 
+{
+    unsigned int mask = (v & 0x80000000) ? 0x80000000 : 0xffffffff;
+    return __int_as_float(v ^ mask);
+}
+__global__ 
+void convert2uintV2(float* input, unsigned int *output, int stride, int strideNum, int blockNum, int size)
+{
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int idy = blockDim.y * blockIdx.y + threadIdx.y;
+    int blockIndex = idy / stride;
+    int offsetInBlock = idy% stride;
+#pragma unroll
+    for (int i = idx * stride + stride * strideNum * blockIndex + offsetInBlock;
+        i < stride * strideNum * blockIndex + offsetInBlock + stride * strideNum && i < size;
+        i += stride * blockDim.x){
+        output[i] = convert(input[i]);
+    }
+}
+__global__ 
+void deconvert2floatV2(unsigned int * input, float *output, int stride, int strideNum, int blockNum, int size)
+{
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int idy = blockDim.y * blockIdx.y + threadIdx.y;
+    //int strideNum = (int)strideNumSize;
+    //if (flag) strideNum = strideNumSize[idy];
+    int blockIndex = idy / stride;
+    int offsetInBlock = idy% stride;
+#pragma unroll
+    for (int i = idx * stride + stride * strideNum * blockIndex + offsetInBlock;
+        i < stride * strideNum * blockIndex + offsetInBlock + stride * strideNum && i < size;
+        i += stride * blockDim.x){
+        output[i] = deconvert(input[i]);
+    }
+}
+__device__ 
+void radixCount(unsigned int *data, int limit, int *posCount, unsigned int mask, int maskDesire, unsigned int desire, int stride, int strideNum, int blockNum)
+{
+    /*the idx th thread in one vector */
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    /* the idy th vector in one tensor */
+    int idy = blockDim.y * blockIdx.y + threadIdx.y;
+    int blockIndex = idy / stride;
+    int offsetInBlock = idy% stride;
+    for (int j = idx*stride + stride * strideNum * blockIndex + offsetInBlock;
+        j<  stride * strideNum * blockIndex + offsetInBlock + stride*strideNum && j<limit;
+        j += stride * WORKERSNUM) {
+        if ((data[j] & maskDesire) == desire) {
+            if (data[j] & mask) {
+                posCount[(idy % (512 / WORKERSNUM))*blockDim.x + idx]++;
+            }
+        }
+    }
+}
+/* We can use this way to check thread status in a warp fastly,
+   note that the theard number need be 32 times */
+__device__ 
+void gpuCheckWarp(int *smem, bool in, int *carry, int *index)
+{
+    int vote = __ballot_sync(0xffffffff, in);
+    *index = __popc(getLaneMaskLe() & vote);
+    *carry = __popc(vote);
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int warp = idx / 32; 
+    int warpNum = blockDim.x / 32;
+    if (getLaneId() == 0) {
+        /* save each warp carry */
+        smem[warp + warpNum * threadIdx.y] = *carry; 
+    }
+    __syncthreads();
+    /* use one thread to count the carry for globe the warp */
+    if (idx == 0) {
+        for (int i = 1 + warpNum * threadIdx.y; i < warpNum * (threadIdx.y + 1); ++i) {
+            smem[i] += smem[i - 1];
+        }
+    }
+    __syncthreads();
+    if (warp % warpNum) {
+        *index += smem[warpNum * threadIdx.y + warp - 1];
+    }
+    *carry = smem[warpNum * threadIdx.y + warpNum - 1];
+}
+/*
+collect the data bigger than pattern as ans return
+*/
+__device__ 
+void collectNumber(unsigned int *data, int stride, int strideNum, int limit, 
+                    unsigned int pattern, float *ans, int *ansIndex, int k)
+{
+    int idy = blockDim.y * blockIdx.y + threadIdx.y;
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int blockIndex = idy / stride;
+    int offsetInBlock = idy % stride;
+    /* for count each warp's tmp carry */
+    __shared__ int smem[32]; 
+    int carry;
+    int index;
+    int vectorLimit = stride * strideNum * blockIndex + offsetInBlock + stride * strideNum;
+    int alibnStrideNum = strideNum;
+    if (alibnStrideNum % blockDim.x) alibnStrideNum = alibnStrideNum + blockDim.x - (alibnStrideNum % blockDim.x);
+    int vectorAlibnLimit = stride * strideNum * blockIndex + offsetInBlock + stride * alibnStrideNum;
+    int ansArrayIndex = stride * k * blockIndex + offsetInBlock;
+    int ansSize = 0;
+    __syncthreads();
+#pragma unroll
+    for (int i = idx * stride + stride * strideNum * blockIndex + offsetInBlock;
+        i < vectorAlibnLimit; i += stride * WORKERSNUM){
+        bool hasTopk = false;
+        if (i < vectorLimit&&data[i] > pattern){
+            hasTopk = true;
+        }
+        gpuCheckWarp(smem, hasTopk, &carry, &index);
+        if (carry > 0) {
+            if (hasTopk) {
+                ans[ansArrayIndex + (index - 1) * stride] = deconvert(data[i]);
+                ansIndex[ansArrayIndex + (index - 1) * stride] = i - stride * strideNum * blockIndex;
+            }
+            ansArrayIndex += carry * stride;
+            ansSize += carry;
+        }
+        __syncthreads();
+    }
+    if (ansSize < k){
+        int ramindNum = k - ansSize;
+#pragma unroll
+        for (int i = idx * stride + stride * strideNum * blockIndex + offsetInBlock; i < vectorAlibnLimit; i += stride * WORKERSNUM) {
+            bool hasTopk = false;
+            if (i < vectorLimit && data[i] == pattern) {
+                hasTopk = true;
+            }
+            gpuCheckWarp(smem, hasTopk, &carry, &index);
+            if (carry>0) {
+                int checkTmpIndex = ansArrayIndex + (index - 1) * stride;
+                /* for don't pointer boundary overflow, for instance, 
+                   if there need one index,but two index fits, wo should filter the bigger index */
+                if (hasTopk && checkTmpIndex <stride * k * blockIndex + offsetInBlock + stride * k) {
+                    ans[checkTmpIndex] = deconvert(pattern);
+                    ansIndex[checkTmpIndex] = i - stride * strideNum * blockIndex;
+                }
+                ramindNum -= carry;
+                ansArrayIndex += carry * stride;
+                if (ramindNum <= 0) break;
+            }
+            __syncthreads();
+        }
+    }
+}
+/*
+This is an old way,we use one thread to collect number and this way is very slow,so we drop it 
+*/
+__device__ 
+void collectNumberOld(unsigned int *data, int n, int k, unsigned int pattern, unsigned int *ans, int *indexNum, int stride, int strideNum)
+{
+    int idy = blockDim.y * blockIdx.y + threadIdx.y;
+    int blockIndex = idy / stride;
+    int offsetInBlock = idy % stride;
+    int cot = 0;
+    for (int i = stride * strideNum * blockIndex + offsetInBlock, j = 0; j < strideNum; j++, i += stride) {
+        if (data[i] > pattern) {
+            ans[cot] = data[i];
+            indexNum[cot++] = j;
+        }
+    }
+    /* if the cot < k ,so the left value must be desire */
+    if (cot < k) {
+        for (int i = cot; i < k; ++i) {
+            ans[i] = pattern;
+        }
+        /* count the remain index and the data value must equal pattern */
+        for (int i = stride * strideNum * blockIndex + offsetInBlock, j = 0; j < strideNum; j++, i += stride) {
+            if (data[i] == pattern) {
+                indexNum[cot++] = j;
+                if (cot == k) break;
+            }
+        }
+    }
+}
+/*
+When k is very big, we can't use share memory to calculate, so we use radix select algorithm
+*/
+template<class T> __global__
+void KernelTopKRadixSelect(unsigned int * input, int stride, int strideNum, 
+                           int blockNum, int k, T minValue, T * output, int* index, int limit)
+{
+    /* the idx th thread in one vector */
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    /* the idy th vector in one tensor */
+    int idy = blockDim.y * blockIdx.y + threadIdx.y;
+    //use optimization or not
+    //int strideNum =(int)strideNumSize;
+    //if (isOptimization) strideNum = strideNumSize[idy];
+    if (idy >= stride *blockNum) return;
+    int maskDesire = 0;
+    unsigned int mask = 0x80000000;
+    unsigned int desire = 0;
+    __shared__ int posCount[32 * 32];
+    int tmpK = k;
+    int flag = 1;
+#pragma unroll
+    for (int i = 0; i < 32; i++){
+        /* we need to clean the shared memory every loop */
+        posCount[idx + blockDim.x*(idy % (512 / WORKERSNUM))] = 0;
+        if (flag)
+            radixCount(input, stride*strideNum*blockNum, posCount, mask, maskDesire, desire, stride, strideNum, blockNum);
+        __syncthreads();
+        int sumCount = 0;
+#pragma unroll
+        for (int j = 0; j < WORKERSNUM; j++) {
+            sumCount += posCount[(idy % (512 / WORKERSNUM))*blockDim.x + j];
+        }
+        __syncthreads();
+        if (tmpK<sumCount) {
+            /* this position should be 1 */
+            desire = mask^desire;
+        }
+        else {
+            /* zoom out the k size,this position should be 0 */
+            tmpK = tmpK - sumCount;
+            if (tmpK == 0){
+                desire = (~(maskDesire >> 1)) | desire;
+                /* avoid Synchronize deadlock ,can't use break,so we use flag */
+                //break;
+                flag = 0;
+            }
+        }
+        maskDesire = mask^maskDesire;
+        mask = mask >> 1;
+    }
+    __syncthreads();
+   /* old way to collect number */
+   /*
+   if (idx == 0)
+    {
+    	unsigned int* uintOutput = new unsigned int;
+    	int* tmpIndex = new int;
+    	//*******************something worng***************************
+    	cudaMalloc((void **)&uintOutput, sizeof(unsigned int)* k);
+    	cudaMalloc((void **)&tmpIndex, sizeof(unsigned int)*k);
+    	//*************************************************************
+    	collectNumberOld(input, limit, k, desire, uintOutput, tmpIndex, stride, strideNum);
+    	int blockIndex = idy / stride;
+    	int offsetInBlock = idy% stride;
+    	for (int i = stride * k * blockIndex + offsetInBlock, j = 0; j < k; j++, i += stride)
+    	{
+    		//for(int i = )
+    		output[i] = deconvert(uintOutput[j]);
+    		index[i] = tmpIndex[j];
+    	}
+    }
+    __syncthreads();
+    */
+    collectNumber(input, stride, strideNum, limit, desire, output, index, k);
+}
+/*
 get the top-k items along a given dimension
 >> a - input tensor
 >> b - output tensor (top-k result)
@@ -388,7 +819,13 @@ void _CudaTopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k)
    for (int i = dimRDI + 1; i < a->order; i++)
        blockNum *= a->dimSizeRDI[i];
-    int workerNum = blockNum < 16 ? 64 : 32; // should be tuned for better performance
+    int workerNum = blockNum < 16 ? 64 : 32; 
+    /* adjust the thread num according size of k for fitting the share memory size */
+    if (k< 6) workerNum = 512;
+    else if (k < 11) workerNum = 256;
+    else if (k < 22) workerNum = 128;
+    else if (k < 44) workerNum = 64;
+    else workerNum = 32;
    int cudaGrids[3];
    int cudaBlocks[3];
@@ -397,29 +834,15 @@ void _CudaTopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k)
        workerNum, stride * blockNum, MAX_INT,
        cudaGrids, cudaBlocks);
-    for (int i = 0; i < 2; i++) {
-        if ((cudaBlocks[0] * cudaBlocks[1] + 1) * k * (a->unitSize + sizeof(int)) >= SHARED_MEMORY_SIZE) {
-            if (cudaBlocks[1] >= 2 && cudaBlocks[1] % 2 == 0) {
-                cudaBlocks[1] /= 2;
-                cudaGrids[1] *= 2;
-            }
-        }
-        if ((cudaBlocks[0] * cudaBlocks[1] + 1) * k * (a->unitSize + sizeof(int)) >= SHARED_MEMORY_SIZE) {
-            if (cudaBlocks[0] >= 2 && cudaBlocks[0] % 2 == 0) {
-                cudaBlocks[0] /= 2;
-                cudaGrids[0] *= 2;
-            }
-        }
-    }
    int devIDBackup = 0;
    ProtectCudaDev(a->devID, devIDBackup);
    /* we run the kernel if the heaps can fit into the shared memory */
+    cudaGrids[1] *= cudaBlocks[1];
+    cudaBlocks[1] = 1;
    if ((cudaBlocks[0] * cudaBlocks[1] + 1) * k * (a->unitSize + sizeof(int)) < SHARED_MEMORY_SIZE) {
        if (a->dataType == DEFAULT_DTYPE) {
-            KernelTopK2<DTYPE> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
+            KernelTopK3<DTYPE> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>>
                                 ((DTYPE*)a->data, stride, strideNumA, blockNum, k, DTYPE_MIN,
                                 (DTYPE*)b->data, (int*)index->data);
        }
@@ -430,20 +853,34 @@ void _CudaTopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k)
    }
    /* we resort to sorting if the data cannot fit inside the shared memory */
    else {
-        int dimSize[MAX_TENSOR_DIM_NUM];
+        //int dimSize[MAX_TENSOR_DIM_NUM];
-        memcpy(dimSize, a->dimSize, sizeof(int) * a->order);
+        //memcpy(dimSize, a->dimSize, sizeof(int) * a->order);
-        dimSize[0] = -dimSize[0];
+        //dimSize[0] = -dimSize[0];
-        XTensor * indexA = new XTensor(a->order, dimSize, X_INT, 1.0F, a->devID, a->mem);
+        //XTensor * indexA = new XTensor(a->order, dimSize, X_INT, 1.0F, a->devID, a->mem);
-        indexA->data = a->mem != NULL ? a->mem->AllocBuf(a->devID, a->unitNum * sizeof(int)) : XMemAlloc(a->devID, a->unitNum * sizeof(int));
+        //indexA->data = a->mem != NULL ? a->mem->AllocBuf(a->devID, a->unitNum * sizeof(int)) : XMemAlloc(a->devID, a->unitNum * sizeof(int));
        /* make the index tensor */
-        indexA->SetAscendingOrder(dim);
+        //indexA->SetAscendingOrder(dim);
+        //_CudaSortBig(a, b, indexA, index, dim, k);
-        _CudaSortBig(a, b, indexA, index, dim, k);
+        //if (a->mem != NULL)
+        //    a->mem->ReleaseBuf(a->devID, a->unitNum * sizeof(int));
+        //delete indexA;
+        int workerNum = WORKERSNUM;
-        if (a->mem != NULL)
+        GDevs.GetCudaThread2D(a->mem->devID,
-            a->mem->ReleaseBuf(a->devID, a->unitNum * sizeof(int));
+            workerNum, stride * blockNum, MAX_INT,
-        delete indexA;
+            cudaGrids, cudaBlocks);
+        if (a->dataType == DEFAULT_DTYPE) {
+            unsigned int* goutput = (unsigned int *)a->data;
+            /* two way all almost the same time to convert data*/
+            convert2uintV2 <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>> ((float*)a->data, goutput, stride, strideNumA, blockNum, strideNumA*blockNum*stride);
+            //convert2uintV2 << <dim3(1, stride * blockNum), dim3(512,1) >> >((float*)a->data, goutput, stride, strideNumA, blockNum, strideNumA*blockNum*stride);
+            KernelTopKRadixSelect<DTYPE> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>> (goutput, stride, strideNumA, blockNum, k, DTYPE_MIN, (DTYPE *)b->data, (int *)index->data, stride * strideNumA * blockNum);
+            deconvert2floatV2 <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>> ((unsigned int *)a->data, (float *)goutput, stride, strideNumA, blockNum, strideNumA*blockNum*stride);
+        }
    }
    BacktoCudaDev(a->devID, devIDBackup);

--- a/source/tensor/core/utilities/FlushToMem.cu
+++ b/source/tensor/core/utilities/FlushToMem.cu
@@ -117,7 +117,7 @@ void CudaGPUToCPUFlush(XTensor * tensor)
    else {
        tensor->dataHost = new char[tensor->unitNum * tensor->unitSize];
        if (tensor->data != NULL)
-            cudaMemcpy(tensor->dataHost, tensor->data, tensor->unitNum * tensor->unitSize, cudaMemcpyDeviceToHost);
+            XMemCopy(tensor->dataHost, -1, tensor->data, tensor->devID, tensor->unitNum * tensor->unitSize);
        else
            memset(tensor->dataHost, 0, tensor->unitNum * tensor->unitSize);
    }

--- a/source/tensor/function/HardTanH.cpp
+++ b/source/tensor/function/HardTanH.cpp
@@ -116,8 +116,7 @@ void _HardTanHBackward(XTensor * gold, XTensor * y, XTensor * x,
    }
 #endif
-    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE)
+    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
-    {
        /* calculate dE/dy */
        if(lossName != NOLOSS)
            _LossBackward(dedy, gold, y, lossName);

--- a/source/tensor/function/LogSoftmax.cpp
+++ b/source/tensor/function/LogSoftmax.cpp
@@ -38,6 +38,17 @@ log scale softmax y = log(e^x / \sum_{i} e^{x_i})
 */
 void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
 {
+    CheckNTErrors(!x->isSparse && !y->isSparse, "TODO!");
+    CheckNTErrors(x && y, "Empty input tensors!");
+    if(leadDim < 0)
+        leadDim = x->order - 1;
+    if(y->dimSize[leadDim] == 1){
+        y->SetZeroAll();
+        return;
+    }
    int leadDimRDI = x->order - leadDim - 1;
    if (!x->isSparse && !y->isSparse &&
        x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE)
@@ -68,25 +79,27 @@ void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
        blockSize = stride * dimensionSize;
        blockNum = y->unitNum / blockSize;
-        max = NewTensor(x->order - 1, dimSize, x->dataType, x->denseRatio, x->devID, mem);
+        max = NewTensorBuf(x->order - 1, dimSize, x->dataType, x->denseRatio, x->devID, mem);
-        sum = NewTensor(x->order - 1, dimSize, x->dataType, x->denseRatio, x->devID, mem);
+        sum = NewTensorBuf(x->order - 1, dimSize, x->dataType, x->denseRatio, x->devID, mem);
-        max->data = mem != NULL ? (char*)mem->AllocBuf(mem->devID, max->unitNum * max->unitSize) : XMemAlloc(max->devID, max->unitNum * max->unitSize);
-        sum->data = mem != NULL ? (char*)mem->AllocBuf(mem->devID, sum->unitNum * sum->unitSize) : XMemAlloc(sum->devID, sum->unitNum * sum->unitSize);
        _ReduceMax(x, max, leadDim);
        _ReduceSum(x, sum, leadDim, max, 1.0F, true);
        if (x->devID >= 0) {
-            int dims[2];
+            if(leadDimRDI == 0){
-            dims[0] = -stride;
+                blockSize = y->unitNum;
-            dims[1] = dimensionSize;
+                blockNum  = 1;
-            blockx = NewTensor(2, dims, x->dataType, x->denseRatio, x->devID, mem);
+                blockx = NewTensor2D(blockSize/dimensionSize, -dimensionSize, x->dataType, x->devID, mem);
-            blocky = NewTensor(2, dims, x->dataType, x->denseRatio, x->devID, mem);
+                blocky = NewTensor2D(blockSize/dimensionSize, -dimensionSize, x->dataType, x->devID, mem);
-            dims[0] = -stride;
+                blockMax = NewTensor2D(blockSize/dimensionSize, -1, x->dataType, x->devID, mem);
-            dims[1] = 1;
+                blockSum = NewTensor2D(blockSize/dimensionSize, -1, x->dataType, x->devID, mem);
-            blockMax = NewTensor(2, dims, x->dataType, x->denseRatio, x->devID, mem);
+            }
-            blockSum = NewTensor(2, dims, x->dataType, x->denseRatio, x->devID, mem);
+            else{
+                blockx = NewTensor2D(-stride, dimensionSize, x->dataType, x->devID, mem);
+                blocky = NewTensor2D(-stride, dimensionSize, x->dataType, x->devID, mem);
+                blockMax = NewTensor2D(-stride, 1, x->dataType, x->devID, mem);
+                blockSum = NewTensor2D(-stride, 1, x->dataType, x->devID, mem);
+            }
        }
        for (int k = 0; k < blockNum; k++) {
@@ -123,7 +136,10 @@ void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
                blockMax->data = mp;
                blockSum->data = sp;
 #ifdef USE_CUDA
-                _CudaLogSoftmaxSumMax(blockx, blocky, leadDim, blockSum, blockMax);
+                if(leadDimRDI == 0)
+                    _CudaLogSoftmaxSumMax(blockx, blocky, 1, blockSum, blockMax);
+                else
+                    _CudaLogSoftmaxSumMax(blockx, blocky, leadDim, blockSum, blockMax);
 #else
                ShowNTErrors("Please specify USE_CUDA and recompile the code!");
 #endif
@@ -134,21 +150,10 @@ void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
            }
        }
-        if (x->devID < 0) {
+        DelTensorBuf(max);
-            if (mem != NULL) {
+        DelTensorBuf(sum);
-                mem->ReleaseBuf(mem->devID, max->unitNum * max->unitSize);
-                mem->ReleaseBuf(mem->devID, sum->unitNum * sum->unitSize);
+        if (x->devID >= 0) {
-            }
-            else {
-                XMemFree(max->devID, max->data);
-                XMemFree(sum->devID, sum->data);
-                max->data = NULL;
-                sum->data = NULL;
-            }
-            delete max;
-            delete sum;
-        }
-        else {
            delete blockx;
            delete blocky;
            delete blockMax;
@@ -184,6 +189,27 @@ XTensor LogSoftmax(const XTensor &x, int leadDim)
    return y;
 }
+/* 
+log scale softmax y = log(e^x / \sum_{i} e^{x_i})
+make a new tensor to keep the result and return it
+>> x - input vector
+>> y - output vector
+>> leadDim - leading dimension (along which we perform reduction)
+*/
+void LogSoftmax(const XTensor &x, XTensor &y, int leadDim)
+{
+    if(!XTensor::IsSameShaped(&x, &y))
+        InitTensor(&y, &x);
+    /* call _LogSoftmax function */
+    _LogSoftmax(&x, &y, leadDim);
+    /* tensor connection */
+    XLink::MakeLink(&x, NULL, &y, FUNC_LOGSOFTMAX);
+    XLink::AddParamToHeadInt(&y, leadDim);
+}
 /*
 backward computation for dense matrices with default data type
@@ -255,6 +281,9 @@ void _LogSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x,
    CheckNTErrors((!dedx->isSparse), "The gradient matrix must be dense!");
    CheckNTErrors((gold != NULL), "The gold standard cannot be empty!");
+    if(leadDim < 0)
+        leadDim = y->order - 1;
    int leadDimRDI = y->order - leadDim - 1;
 #ifdef USE_CUDA
    if (gold->devID >= 0) {

--- a/source/tensor/function/LogSoftmax.h
+++ b/source/tensor/function/LogSoftmax.h
@@ -33,6 +33,9 @@ void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim);
 /* log scale softmax y = log(e^x / \sum_{i} e^{x_i}) (return a XTensor structure) */
 XTensor LogSoftmax(const XTensor &x, int leadDim);
+/* log scale softmax y = log(e^x / \sum_{i} e^{x_i}) (with both argument of x and y) */
+void LogSoftmax(const XTensor &x, XTensor &y, int leadDim);
 /* de/dx */
 void _LogSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x, 
                         XTensor * dedy, XTensor * dedx,

--- a/source/tensor/function/Loss.cu
+++ b/source/tensor/function/Loss.cu
@@ -24,7 +24,7 @@
 #include "../XDevice.h"
 #include "../core/math/Power.h"
 #include "../core/math/ScaleAndShift.h"
-#include "../core/math/Log.h"
+#include "../core/math/Unary.h"
 #include "../core/arithmetic/Negate.h"
 #include "../core/arithmetic/Sum.h"
 #include "../core/arithmetic/Multiply.h"

--- a/source/tensor/function/Softmax.cpp
+++ b/source/tensor/function/Softmax.cpp
@@ -37,6 +37,9 @@ softmax y = e^x / \sum_{i} e^{x_i}
 */
 void _Softmax(const XTensor * x, XTensor * y, int leadDim)
 {
+    if(leadDim < 0)
+        leadDim = x->order - 1;
    int leadDimRDI = x->order - leadDim - 1;
    if(!x->isSparse && !y->isSparse && x->dataType == y->dataType){
        int * dimSize = new int[x->order - 1];
@@ -182,10 +185,14 @@ void _SoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x,
                      int leadDim,
                      LOSS_FUNCTION_NAME lossName)
 {
-    CheckNTErrors((dedx->isSparse == false), "The gradient tensor must be dense!");
+    CheckNTErrors(dedx->isSparse == false, "The gradient tensor must be dense!");
-    CheckNTErrors((gold != NULL), "Incorrect x gold standard tensor!");
+    CheckNTErrors(gold != NULL || lossName == NOLOSS, "Gold standard is required for computing loss!");
+    if(leadDim < 0)
+        leadDim = y->order - 1;
    int leadDimRDI = y->order - leadDim - 1;
 #ifdef USE_CUDA
    if(y->devID >= 0){
        _CudaSoftmaxBackward(gold, y, x, dedy, dedx, leadDim, lossName);

--- a/source/tensor/function/Softmax.cu
+++ b/source/tensor/function/Softmax.cu
@@ -156,6 +156,50 @@ void KernelSoftmaxComputeTensor(__half * x, __half * max, __half * sum, __half *
 }
 /*
+use PTX code to broadcast float data
+*/
+__device__ __forceinline__ 
+float broadcast(float input)
+{
+    float output;
+    asm(
+        "{"
+        "shfl.idx.b32 %0,%1,0x0,0x1f;"
+        "}"
+        :"=f"(output) : "f"(input)
+    );
+    return output;
+}
+/*
+use warp broadcast to optimize softmax computing
+*/
+__global__
+void KernelSoftmaxComputeTensorUseBroadcast(DTYPE * input, DTYPE * max, DTYPE * sum, DTYPE * output, 
+                                            int stride, int strideNum, int blockNum)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    int j = blockDim.y * blockIdx.y + threadIdx.y;
+    int i2 = j % stride;
+    int blockSize = stride * strideNum;
+    if (j < stride * blockNum) {
+        DTYPE sumData, maxData;
+        if (i % 32 == 0) {
+            sumData = sum[j];
+            maxData = max[j];
+        }
+        sumData = broadcast(sumData);
+        maxData = broadcast(maxData);
+        if (i < strideNum){
+            int offset = int(j / stride) * blockSize + i * stride + i2;
+            output[offset] = exp(input[offset] - maxData) / sumData;
+        }
+    }
+}
+/*
 softmax y = e^x / \sum_{i} e^{x_i} (Cuda version)
 >> x - x vector
 >> y - result
@@ -183,20 +227,42 @@ void _CudaSoftmaxSumMax(const XTensor * x, XTensor * y, int leadDim, XTensor * s
    int cudaGridSize[3];
    int cudaBlockSize[3];
-    GDevs.GetCudaThread2D(x->devID, stride * blockNum, dimensionSize, MAX_INT, cudaGridSize, cudaBlockSize);
+    if (leadDim != 0 || dimensionSize <= 10){
+        /* allocate thread num for old function */
+        GDevs.GetCudaThread2D(x->devID, stride * blockNum, dimensionSize, MAX_INT, cudaGridSize, cudaBlockSize);
+    }
+    else {
+        /* allocate thread num for new function */
+        GDevs.GetCudaThread2D(x->devID, dimensionSize, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+        if (cudaBlockSize[0] < 32) {
+            /* use at least a warp */
+            cudaBlockSize[0] = 32;
+            if (cudaBlockSize[1] > 32) {
+                cudaGridSize[1] = int(ceil(float(stride * blockNum) / 32));
+                cudaBlockSize[1] = 32;
+            }
+        }
+    }
    int devIDBackup;
    ProtectCudaDev(x->devID, devIDBackup);
    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
-        KernelSoftmaxComputeTensor<<<dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1])>>>
+        if (leadDim != 0 || dimensionSize <= 10) {
-                                   ((DTYPE*)x->data, (DTYPE*)max->data, (DTYPE*)sum->data, (DTYPE*)y->data, 
+            KernelSoftmaxComputeTensor <<< dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1]) >>>
-                                     stride, dimensionSize, stride * dimensionSize, blockNum, stride * blockNum);
+                                         ((DTYPE*)x->data, (DTYPE*)max->data, (DTYPE*)sum->data, (DTYPE*)y->data,
+                                           stride, dimensionSize, stride * dimensionSize, blockNum, stride * blockNum);
+        }
+        else {
+            KernelSoftmaxComputeTensorUseBroadcast <<< dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1]) >>>
+                                                     ((DTYPE*)x->data, (DTYPE*)max->data, (DTYPE*)sum->data, (DTYPE*)y->data,
+                                                       stride, dimensionSize, blockNum);
+        }
    }
    else if(x->dataType == X_FLOAT16 && y->dataType == X_FLOAT16){
-        KernelSoftmaxComputeTensor<<<dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1])>>>
+        KernelSoftmaxComputeTensor <<< dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1]) >>>
-                                   ((__half*)x->data, (__half*)max->data, (__half*)sum->data, (__half*)y->data, 
+                                     ((__half*)x->data, (__half*)max->data, (__half*)sum->data, (__half*)y->data, 
-                                     stride, dimensionSize, blockNum);
+                                       stride, dimensionSize, blockNum);
    }
    else{
        ShowNTErrors("TODO!");
@@ -239,6 +305,9 @@ void _CudaSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x,
    CheckNTErrors((x->devID == y->devID), "Matrices used in log softmax are not on the same GPU.");
    CheckNTErrors((y->order >= 1), "Empty tensor!");
+    int devIDBackup;
+    ProtectCudaDev(x->devID, devIDBackup);
    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
        CheckNTErrors((lossName == CROSSENTROPY || 
@@ -284,8 +353,14 @@ void _CudaSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x,
            /* make a matrix to keep \beta */
            XTensor * beta = new XTensor(y->order - 1, dimSize, y->dataType, y->denseRatio, y->devID, mem);
-            ytmp->data = mem->AllocBuf(mem->devID, y->unitNum * y->unitSize);
+            if(mem != NULL){
-            beta->data = mem->AllocBuf(mem->devID, beta->unitNum * beta->unitSize);
+                ytmp->data = mem->AllocBuf(mem->devID, y->unitNum * y->unitSize);
+                beta->data = mem->AllocBuf(mem->devID, beta->unitNum * beta->unitSize);
+            }
+            else{
+                ytmp->data = XMemAlloc(y->devID, y->unitNum * y->unitSize);
+                beta->data = XMemAlloc(y->devID, beta->unitNum * beta->unitSize);
+            }
            /* \beta = \sum_i (dE/dy_i * y_i) */
            _Multiply(dedy, y, ytmp, 0, 0);
@@ -298,8 +373,18 @@ void _CudaSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x,
            /* dE/ds_j = y_j * ytmp = y_j * (dE/dy_j - \beta) */
            _Multiply(y, ytmp, dedx, 0, 0);
-            mem->ReleaseBuf(mem->devID, y->unitNum * y->unitSize);
-            mem->ReleaseBuf(mem->devID, beta->unitNum * beta->unitSize);
+            if(mem != NULL){
+                mem->ReleaseBuf(mem->devID, y->unitNum * y->unitSize);
+                mem->ReleaseBuf(mem->devID, beta->unitNum * beta->unitSize);
+            }
+            else{
+                XMemFree(y->devID, ytmp->data);
+                XMemFree(y->devID, beta->data);
+            }
+            ytmp->data = NULL;
+            beta->data = NULL;
            delete[] dimSize;
            delete ytmp;
@@ -311,6 +396,8 @@ void _CudaSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x,
    }
    else
        ShowNTErrors("TODO!");
+    BacktoCudaDev(x->devID, devIDBackup);
 }
 #endif

--- a/source/tensor/math.zip
+++ b/source/tensor/math.zip
--- a/source/tensor/test/TAbsolute.cpp
+++ b/source/tensor/test/TAbsolute.cpp
@@ -19,6 +19,7 @@
 * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12
 */
+#include "../core/math/Unary.h"
 #include "TAbsolute.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -30,14 +31,14 @@ Set every entry to its absolute value.
 bool TestAbsolute1()
 {
 	/* a tensor of size (3, 2) */
-	int aOrder = 2;
+	int order = 2;
-	int * aDimSize = new int[aOrder];
+	int * dimSize = new int[order];
-	aDimSize[0] = 3;
+	dimSize[0] = 3;
-	aDimSize[1] = 2;
+	dimSize[1] = 2;
-	int aUnitNum = 1;
+	int unitNum = 1;
-	for (int i = 0; i < aOrder; i++)
+	for (int i = 0; i < order; i++)
-		aUnitNum *= aDimSize[i];
+		unitNum *= dimSize[i];
 	DTYPE aData[3][2] = { {1.0F, -2.0F}, 
 	                      {0.5F, -4.0F},
@@ -50,14 +51,14 @@ bool TestAbsolute1()
 	bool cpuTest = true;
 	/* create tensors */
-	XTensor * a = NewTensor(aOrder, aDimSize);
+	XTensor * a = NewTensor(order, dimSize);
-	XTensor * b = NewTensor(aOrder, aDimSize);
+	XTensor * b = NewTensor(order, dimSize);
-	XTensor * aMe = NewTensor(aOrder, aDimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
    XTensor bUser;
 	/* initialize variables */
-	a->SetData(aData, aUnitNum);
+	a->SetData(aData, unitNum);
-    aMe->SetData(aData, aUnitNum);
+    aMe->SetData(aData, unitNum);
 	/* call Absolute function */
    _Absolute(a, b);
@@ -65,21 +66,21 @@ bool TestAbsolute1()
    bUser = Absolute(*a);
 	/* check results */
-	cpuTest = b->CheckData(answer, aUnitNum, 1e-4F) && aMe->CheckData(answer, aUnitNum, 1e-4F) && bUser.CheckData(answer, aUnitNum, 1e-4F);
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
 #ifdef USE_CUDA
 	/* GPU test */
 	bool gpuTest = true;
 	/* create tensor */
-	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-	XTensor * bGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-	XTensor * aMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
    XTensor bUserGPU;
 	/* Initialize variables */
-	aGPU->SetData(aData, aUnitNum);
+	aGPU->SetData(aData, unitNum);
-    aMeGPU->SetData(aData, aUnitNum);
+    aMeGPU->SetData(aData, unitNum);
 	/* call Absolute function */
    _Absolute(aGPU, bGPU);
@@ -87,7 +88,7 @@ bool TestAbsolute1()
    bUserGPU = Absolute(*aGPU);
 	/* check results */
-	gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F) && aMeGPU->CheckData(answer, aUnitNum, 1e-4F) && bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
 	/* destroy variables */
 	delete a;
@@ -96,7 +97,7 @@ bool TestAbsolute1()
    delete aGPU;
    delete bGPU;
    delete aMeGPU;
-	delete[] aDimSize;
+	delete[] dimSize;
 	return cpuTest && gpuTest;
 #else
@@ -104,7 +105,7 @@ bool TestAbsolute1()
 	delete a;
 	delete b;
 	delete aMe;
-	delete[] aDimSize;
+	delete[] dimSize;
 	return cpuTest;
 #endif // USE_CUDA

--- a/source/tensor/test/TAbsolute.h
+++ b/source/tensor/test/TAbsolute.h
@@ -22,7 +22,6 @@
 #ifndef __TEST_ABSOLUTE_H__
 #define __TEST_ABSOLUTE_H__
-#include "../core/arithmetic/Absolute.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/test/TClip.cpp
+++ b/source/tensor/test/TClip.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
+*/
+#include "../XTensor.h"
+#include "TClip.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+case 1: test Clip function.
+Set every entry to its clip value.
+*/
+bool TestClip1()
+{
+	/* a tensor of size (3, 2) */
+	int aOrder = 2;
+	int * aDimSize = new int[aOrder];
+	aDimSize[0] = 3;
+	aDimSize[1] = 2;
+	int aUnitNum = 1;
+	for (int i = 0; i < aOrder; i++)
+		aUnitNum *= aDimSize[i];
+	DTYPE aData[3][2] = { {1.0F, -2.0F},
+						  {0.0F, 4.0F},
+						  {5.0F, -6.0F} };
+	DTYPE answer[3][2] = { {1.0F, -1.0F},
+						   {0.0F, 1.0F},
+					   	   {1.0F, -1.0F} };
+	/* CPU test */
+	bool cpuTest = true;
+	/* create tensors */
+	XTensor * a = NewTensor(aOrder, aDimSize);
+	XTensor * b = NewTensor(aOrder, aDimSize);
+	XTensor * aMe = NewTensor(aOrder, aDimSize);
+	XTensor bUser;
+	/* initialize variables */
+	a->SetData(aData, aUnitNum);
+	aMe->SetData(aData, aUnitNum);
+	/* call Clip function */
+	_Clip(a, b, -1.0, 1.0);
+	_ClipMe(aMe, -1.0, 1.0);
+	bUser = Clip(*a, -1.0, 1.0);
+	/* check results */
+	cpuTest = b->CheckData(answer, aUnitNum, 1e-4F) && 
+              aMe->CheckData(answer, aUnitNum, 1e-4F) && 
+              bUser.CheckData(answer, aUnitNum, 1e-4F);
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+	/* create tensor */
+	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor bUserGPU;
+	/* Initialize variables */
+	aGPU->SetData(aData, aUnitNum);
+	aMeGPU->SetData(aData, aUnitNum);
+	/* call Clip function */
+	_Clip(aGPU, bGPU, -1.0, 1.0);
+	_ClipMe(aMeGPU, -1.0, 1.0);
+	bUserGPU = Clip(*aGPU, -1.0, 1.0);
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F) && 
+              aMeGPU->CheckData(answer, aUnitNum, 1e-4F) && 
+              bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete aGPU;
+	delete bGPU;
+	delete aMeGPU;
+	delete[] aDimSize;
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete[] aDimSize;
+	return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+TODO!!
+*/
+/* test for Clip Function */
+bool TestClip()
+{
+	XPRINT(0, stdout, "[TEST Clip] set every entry to its clip value \n");
+	bool returnFlag = true, caseFlag = true;
+	/* case 1 test */
+	caseFlag = TestClip1();
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+	XPRINT(0, stdout, "\n");
+	return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TMatrixMULBatchedCPU.h
+++ b/source/tensor/test/TMatrixMULBatchedCPU.h
@@ -16,19 +16,19 @@
 */
 /*
-* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-06-15
+* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
 */
-#ifndef __TEST_MATRIXMULBATCHEDCPU_H__
+#ifndef __TEST_CLIP_H__
-#define __TEST_MATRIXMULBATCHEDCPU_H__
+#define __TEST_CLIP_H__
-#include "../core/arithmetic/MatrixMULBatchedCPU.h"
+#include "../core/math/Clip.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
-/* test for MatrixMulBatchedCPU Function */
+/* test for Clip Function */
 extern "C"
-bool TestMatrixMulBatchedCPU();
+bool TestClip();
 } // namespace nts(NiuTrans.Tensor)
-#endif // __TEST_MATRIXMULBATCHEDCPU_H__
+#endif // __TEST_CLIP_H__
--- a/source/tensor/test/TCos.cpp
+++ b/source/tensor/test/TCos.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#include "../core/math/Unary.h"
+#include "TCos.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+case 1: test Cos function.
+Set every entry to its cosine value.
+*/
+bool TestCos1()
+{
+	/* a tensor of size (3, 2) */
+	int order = 2;
+	int * dimSize = new int[order];
+	dimSize[0] = 3;
+	dimSize[1] = 2;
+	int unitNum = 1;
+	for (int i = 0; i < order; i++)
+		unitNum *= dimSize[i];
+	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
+	                      {-1.0F, -2.0F},
+	                      {0.0F, 0.5F} };
+	DTYPE answer[3][2] = { {0.5403F, -0.4161F},
+	                       {0.5403F, -0.4161F},
+	                       {1.0F, 0.8776F} };
+	/* CPU test */
+	bool cpuTest = true;
+	/* create tensors */
+	XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
+    XTensor bUser;
+	/* initialize variables */
+	a->SetData(aData, unitNum);
+	aMe->SetData(aData, unitNum);
+	/* call Cos function */
+	_Cos(a, b);
+	_CosMe(aMe);
+    bUser = Cos(*a);
+	/* check results */
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+	/* create tensor */
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+	/* Initialize variables */
+	aGPU->SetData(aData, unitNum);
+	aMeGPU->SetData(aData, unitNum);
+	/* call Cos function */
+    _Cos(aGPU, bGPU);
+	_CosMe(aMeGPU);
+    bUserGPU = Cos(*aGPU);
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+    delete aGPU;
+    delete bGPU;
+    delete aMeGPU;
+	delete[] dimSize;
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete[] dimSize;
+	return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+TODO!!
+*/
+/* test for Cos Function */
+bool TestCos()
+{
+	XPRINT(0, stdout, "[TEST Cos] set every entry to its cosine value \n");
+	bool returnFlag = true, caseFlag = true;
+	/* case 1 test */
+	caseFlag = TestCos1();
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+	XPRINT(0, stdout, "\n");
+	return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TCos.h
+++ b/source/tensor/test/TCos.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#ifndef __TEST_SIN_H__
+#define __TEST_SIN_H__
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* test for Sin Function */
+bool TestSin();
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_SIN_H__
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#ifndef __TEST_COS_H__
+#define __TEST_COS_H__
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* test for Cos Function */
+bool TestCos();
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_COS_H__
--- a/source/tensor/test/TDiv.cpp
+++ b/source/tensor/test/TDiv.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#include "TDiv.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* 
+case 1: element-wise division of two tensors
+c(i) = a(i)/b(i) + \alpha * c(i)
+In this case, (2, 2)  (2, 2) -> (2, 2), leadingDim=0, alpha=0.
+*/
+bool TestDiv1()
+{
+	/* a source tensor of size (2, 2) */
+	int sOrder1 = 2;
+	int * sDimSize1 = new int[sOrder1];
+	sDimSize1[0] = 2;
+	sDimSize1[1] = 2;
+	int sUnitNum1 = 1;
+	for (int i = 0; i < sOrder1; i++)
+		sUnitNum1 *= sDimSize1[i];
+	/* a source tensor of size (2, 2) */
+	int sOrder2 = 2;
+	int * sDimSize2 = new int[sOrder2];
+	sDimSize2[0] = 2;
+	sDimSize2[1] = 2;
+	int sUnitNum2 = 1;
+	for (int i = 0; i < sOrder2; i++)
+		sUnitNum2 *= sDimSize2[i];
+	/* a target tensor of size (2, 2) */
+	int tOrder = 2;
+	int * tDimSize = new int[tOrder];
+	tDimSize[0] = 2;
+	tDimSize[1] = 2;
+	int tUnitNum = 1;
+	for (int i = 0; i < tOrder; i++)
+		tUnitNum *= tDimSize[i];
+	DTYPE sData1[2][2] = { {0.0F, 1.0F},
+	                       {2.0F, 3.0F} };
+	DTYPE sData2[2][2] = { {1.0F, 1.0F},
+	                       {4.0F, 9.0F} };
+	DTYPE answer[2][2] = { {0.0F, 1.0F},
+	                       {0.5F, 0.3333F} };
+	/* CPU test */
+	bool cpuTest = true;
+	/* create tensors */
+	XTensor * s1 = NewTensor(sOrder1, sDimSize1);
+	XTensor * s2 = NewTensor(sOrder2, sDimSize2);
+	XTensor * t = NewTensor(tOrder, tDimSize);
+    XTensor * tMe = NewTensor(tOrder, tDimSize);
+    XTensor tUser;
+	/* initialize variables */
+	s1->SetData(sData1, sUnitNum1);
+	tMe->SetData(sData1, sUnitNum1);
+	s2->SetData(sData2, sUnitNum2);
+	t->SetZeroAll();
+	/* call Div function */
+	_Div(s1, s2, t, 0, 0);
+	_DivMe(tMe, s2, 0, 0);
+    tUser = Div(*s1, *s2, 0);
+	/* check results */
+	cpuTest = t->CheckData(answer, tUnitNum, 1e-4F) && 
+              tMe->CheckData(answer, tUnitNum, 1e-4F) && 
+              tUser.CheckData(answer, tUnitNum, 1e-4F);
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+	/* create tensor */
+	XTensor * sGPU1 = NewTensor(sOrder1, sDimSize1, X_FLOAT, 1.0F, 0);
+	XTensor * sGPU2 = NewTensor(sOrder2, sDimSize2, X_FLOAT, 1.0F, 0);
+	XTensor * tGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * tMeGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
+    XTensor tUserGPU;
+	/* Initialize variables */
+	sGPU1->SetData(sData1, sUnitNum1);
+	tMeGPU->SetData(sData1, sUnitNum1);
+	sGPU2->SetData(sData2, sUnitNum2);
+	tGPU->SetZeroAll();
+	/* call Div function */
+	_Div(sGPU1, sGPU2, tGPU, 0, 0);
+	_DivMe(tMeGPU, sGPU2, 0, 0);
+    tUserGPU = Div(*sGPU1, *sGPU2, 0);
+	/* check results */
+	gpuTest = tGPU->CheckData(answer, tUnitNum, 1e-4F) && 
+              tMeGPU->CheckData(answer, tUnitNum, 1e-4F) && 
+              tUserGPU.CheckData(answer, tUnitNum, 1e-4F);
+	/* destroy variables */
+    delete s1;
+    delete s2;
+    delete t;
+    delete tMe;
+    delete sGPU1;
+    delete sGPU2;
+    delete tGPU;
+    delete tMeGPU;
+    delete[] sDimSize1;
+    delete[] sDimSize2;
+    delete[] tDimSize;
+	return cpuTest && gpuTest;
+#else
+    /* destroy variables */
+    delete s1;
+    delete s2;
+    delete t;
+    delete tMe;
+    delete[] sDimSize1;
+    delete[] sDimSize2;
+    delete[] tDimSize;
+	return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+TODO!!
+*/
+/* test for Div Function */
+bool TestDiv()
+{
+	XPRINT(0, stdout, "[TEST Div] element-wise division of two tensors \n");
+	bool returnFlag = true, caseFlag = true;
+	/* case 1 test */
+	caseFlag = TestDiv1();
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+	XPRINT(0, stdout, "\n");
+	return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TDiv.h
+++ b/source/tensor/test/TDiv.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#ifndef __TEST_DIV_H__
+#define __TEST_DIV_H__
+#include "../core/arithmetic/Div.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* test for Div Function */
+extern "C"
+bool TestDiv();
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_DIV_H__
--- a/source/tensor/test/TExp.cpp
+++ b/source/tensor/test/TExp.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#include "../core/math/Unary.h"
+#include "TExp.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+case 1: test Exp function.
+Set every entry to its exponent value.
+*/
+bool TestExp1()
+{
+	/* a tensor of size (3, 2) */
+	int order = 2;
+	int * dimSize = new int[order];
+	dimSize[0] = 3;
+	dimSize[1] = 2;
+	int unitNum = 1;
+	for (int i = 0; i < order; i++)
+		unitNum *= dimSize[i];
+	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
+	                      {-1.0F, -2.0F},
+	                      {0.0F, 0.5F} };
+	DTYPE answer[3][2] = { {2.7183F, 7.3891F},
+	                       {0.3679F, 0.1353F},
+	                       {1.0F, 1.6487F} };
+	/* CPU test */
+	bool cpuTest = true;
+	/* create tensors */
+	XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
+    XTensor bUser;
+	/* initialize variables */
+	a->SetData(aData, unitNum);
+	aMe->SetData(aData, unitNum);
+	/* call Exp function */
+	_Exp(a, b);
+	_ExpMe(aMe);
+    bUser = Exp(*a);
+	/* check results */
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && 
+              aMe->CheckData(answer, unitNum, 1e-4F) && 
+              bUser.CheckData(answer, unitNum, 1e-4F);
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+	/* create tensor */
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+	/* Initialize variables */
+	aGPU->SetData(aData, unitNum);
+	aMeGPU->SetData(aData, unitNum);
+	/* call Exp function */
+    _Exp(aGPU, bGPU);
+	_ExpMe(aMeGPU);
+    bUserGPU = Exp(*aGPU);
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && 
+              aMeGPU->CheckData(answer, unitNum, 1e-4F) && \
+              bUserGPU.CheckData(answer, unitNum, 1e-4F);
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+    delete aGPU;
+    delete bGPU;
+    delete aMeGPU;
+	delete[] dimSize;
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete[] dimSize;
+	return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+TODO!!
+*/
+/* test for Exp Function */
+bool TestExp()
+{
+	XPRINT(0, stdout, "[TEST Exp] set every entry to its exponent value \n");
+	bool returnFlag = true, caseFlag = true;
+	/* case 1 test */
+	caseFlag = TestExp1();
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+	XPRINT(0, stdout, "\n");
+	return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/MatrixMULBatchedCPU.h
+++ b/source/tensor/core/arithmetic/MatrixMULBatchedCPU.h
@@ -16,20 +16,16 @@
 */
 /*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
 */
-#ifndef __MATRIXMULBATCHEDCPU_H__
+#ifndef __TEST_EXP_H__
-#define __MATRIXMULBATCHEDCPU_H__
+#define __TEST_EXP_H__
-#include "../../XTensor.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
-/* matrix multiplication in batch mode (CPU code) */
+/* test for Exp Function */
-void _MatrixMULBatchedCPU(const XList * a, MATRIX_TRANS_TYPE transposedA, const XList * b, MATRIX_TRANS_TYPE transposedB, 
+bool TestExp();
-                          XList * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0);
 } // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_EXP_H__
-#endif // __MATRIXMULBATCHEDCPU_H__
\ No newline at end of file
--- a/source/tensor/test/TLog.cpp
+++ b/source/tensor/test/TLog.cpp
@@ -19,6 +19,7 @@
 * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12
 */
+#include "../core/math/Unary.h"
 #include "TLog.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -30,14 +31,14 @@ Set every entry to its log value.
 bool TestLog1()
 {
 	/* a tensor of size (3, 2) */
-	int aOrder = 2;
+	int order = 2;
-	int * aDimSize = new int[aOrder];
+	int * dimSize = new int[order];
-	aDimSize[0] = 3;
+	dimSize[0] = 3;
-	aDimSize[1] = 2;
+	dimSize[1] = 2;
-	int aUnitNum = 1;
+	int unitNum = 1;
-	for (int i = 0; i < aOrder; i++)
+	for (int i = 0; i < order; i++)
-		aUnitNum *= aDimSize[i];
+		unitNum *= dimSize[i];
 	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
 	                      {0.5F, 4.0F},
@@ -50,14 +51,14 @@ bool TestLog1()
 	bool cpuTest = true;
 	/* create tensors */
-	XTensor * a = NewTensor(aOrder, aDimSize);
+	XTensor * a = NewTensor(order, dimSize);
-    XTensor * b = NewTensor(aOrder, aDimSize);
+    XTensor * b = NewTensor(order, dimSize);
-	XTensor * aMe = NewTensor(aOrder, aDimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
    XTensor bUser;
 	/* initialize variables */
-	a->SetData(aData, aUnitNum);
+	a->SetData(aData, unitNum);
-	aMe->SetData(aData, aUnitNum);
+	aMe->SetData(aData, unitNum);
 	/* call Log function */
 	_Log(a, b);
@@ -65,21 +66,21 @@ bool TestLog1()
    bUser = Log(*a);
 	/* check results */
-	cpuTest = b->CheckData(answer, aUnitNum, 1e-4F) && aMe->CheckData(answer, aUnitNum, 1e-4F) && bUser.CheckData(answer, aUnitNum, 1e-4F);
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
 #ifdef USE_CUDA
 	/* GPU test */
 	bool gpuTest = true;
 	/* create tensor */
-	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-	XTensor * bGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-	XTensor * aMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
    XTensor bUserGPU;
 	/* Initialize variables */
-	aGPU->SetData(aData, aUnitNum);
+	aGPU->SetData(aData, unitNum);
-	aMeGPU->SetData(aData, aUnitNum);
+	aMeGPU->SetData(aData, unitNum);
 	/* call Log function */
    _Log(aGPU, bGPU);
@@ -87,7 +88,7 @@ bool TestLog1()
    bUserGPU = Log(*aGPU);
 	/* check results */
-	gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F) && aMeGPU->CheckData(answer, aUnitNum, 1e-4F) && bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
 	/* destroy variables */
 	delete a;
@@ -96,7 +97,7 @@ bool TestLog1()
    delete aGPU;
    delete bGPU;
    delete aMeGPU;
-	delete[] aDimSize;
+	delete[] dimSize;
 	return cpuTest && gpuTest;
 #else
@@ -104,7 +105,7 @@ bool TestLog1()
 	delete a;
 	delete b;
 	delete aMe;
-	delete[] aDimSize;
+	delete[] dimSize;
 	return cpuTest;
 #endif // USE_CUDA

--- a/source/tensor/test/TLog.h
+++ b/source/tensor/test/TLog.h
@@ -22,8 +22,6 @@
 #ifndef __TEST_LOG_H__
 #define __TEST_LOG_H__
-#include "../core/math/Log.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
 /* test for Log Function */

--- a/source/tensor/test/TLogSoftmax.h
+++ b/source/tensor/test/TLogSoftmax.h
@@ -16,8 +16,8 @@
 */
 /*
-* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-02
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-02
-*/
+ */
 #ifndef __TEST_LOGSOFTMAX_H__
 #define __TEST_LOGSOFTMAX_H__

--- a/source/tensor/test/TMatrixMULBatchedCPU.cpp
+++ b/source/tensor/test/TMatrixMULBatchedCPU.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-/*
-* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-06-15
-*/
-#include "TMatrixMULBatchedCPU.h"
-namespace nts { // namespace nts(NiuTrans.Tensor)
-/* 
-case 1: matrix multiplication in batch mode (CPU code). 
-In this case, aList=2*(2, 3), bList=2*(3, 2) -> c=2*(2, 2), transposedA=X_NOTRANS, transposedB=X_NOTRANS.
-*/
-bool TestMatrixMulBatchedCPU1()
-{
-    /* create list */
-    XList * aList = new XList();
-    XList * bList = new XList();
-    XList * cList = new XList();
-    /* a source tensor of size (2, 3) */
-    int aOrder = 2;
-    int * aDimSize = new int[aOrder];
-    aDimSize[0] = 2;
-    aDimSize[1] = 3;
-    int aUnitNum = 1;
-    for (int i = 0; i < aOrder; i++)
-        aUnitNum *= aDimSize[i];
-    /* a source tensor of size (3, 2) */
-    int bOrder = 2;
-    int * bDimSize = new int[bOrder];
-    bDimSize[0] = 3;
-    bDimSize[1] = 2;
-    int bUnitNum = 1;
-    for (int i = 0; i < bOrder; i++)
-        bUnitNum *= bDimSize[i];
-    /* a target tensor of size (2, 2) */
-    int cOrder = 2;
-    int * cDimSize = new int[cOrder];
-    cDimSize[0] = 2;
-    cDimSize[1] = 2;
-    int cUnitNum = 1;
-    for (int i = 0; i < cOrder; i++)
-        cUnitNum *= cDimSize[i];
-    DTYPE aData1[2][3] = { {1.0F, 2.0F, 3.0F},
-                           {-4.0F, 5.0F, 6.0F} };
-    DTYPE aData2[2][3] = { {1.0F, -2.0F, -3.0F},
-                           {-4.0F, 3.0F, 2.0F} };
-    DTYPE bData1[3][2] = { {0.0F, -1.0F},
-                           {1.0F, 2.0F}, 
-                           {2.0F, 1.0F} };
-    DTYPE bData2[3][2] = { {0.0F, 1.0F},
-                           {3.0F, 2.0F}, 
-                           {2.0F, 1.0F} };
-    DTYPE answer1[2][2] = { {8.0F, 6.0F}, 
-                            {17.0F, 20.0F} };
-    DTYPE answer2[2][2] = { {-12.0F, -6.0F}, 
-                            {13.0F, 4.0F} };
-    /* CPU test */
-    bool cpuTest = true;
-    /* create tensors */
-    XTensor * a1 = NewTensor(aOrder, aDimSize);
-    XTensor * a2 = NewTensor(aOrder, aDimSize);
-    XTensor * b1 = NewTensor(bOrder, bDimSize);
-    XTensor * b2 = NewTensor(bOrder, bDimSize);
-    XTensor * c1 = NewTensor(cOrder, cDimSize);
-    XTensor * c2 = NewTensor(cOrder, cDimSize);
-    /* initialize variables */
-    a1->SetData(aData1, aUnitNum);
-    a2->SetData(aData2, aUnitNum);
-    b1->SetData(bData1, aUnitNum);
-    b2->SetData(bData2, aUnitNum);
-    c1->SetZeroAll();
-    c2->SetZeroAll();
-    /* add tensors to list */
-    aList->Add(a1);
-    aList->Add(a2);
-    bList->Add(b1);
-    bList->Add(b2);
-    cList->Add(c1);
-    cList->Add(c2);
-    /* call MatrixMULBatchedCPU function */
-    _MatrixMULBatchedCPU(aList, X_NOTRANS, bList, X_NOTRANS, cList);
-    /* check results */
-    cpuTest = c1->CheckData(answer1, cUnitNum) && c2->CheckData(answer2, cUnitNum);
-#ifdef USE_CUDA
-    /* GPU test */
-    bool gpuTest = true;
-    /* create tensors */
-    XTensor * aGPU1 = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
-    XTensor * aGPU2 = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
-    XTensor * bGPU1 = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
-    XTensor * bGPU2 = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
-    XTensor * cGPU1 = NewTensor(cOrder, cDimSize, X_FLOAT, 1.0F, 0);
-    XTensor * cGPU2 = NewTensor(cOrder, cDimSize, X_FLOAT, 1.0F, 0);
-    /* initialize variables */
-    aGPU1->SetData(aData1, aUnitNum);
-    aGPU2->SetData(aData2, aUnitNum);
-    bGPU1->SetData(bData1, aUnitNum);
-    bGPU2->SetData(bData2, aUnitNum);
-    cGPU1->SetZeroAll();
-    cGPU2->SetZeroAll();
-    /* clear list */
-    aList->Clear();
-    bList->Clear();
-    cList->Clear();
-    /* add tensors to list */
-    aList->Add(aGPU1);
-    aList->Add(aGPU2);
-    bList->Add(bGPU1);
-    bList->Add(bGPU2);
-    cList->Add(cGPU1);
-    cList->Add(cGPU2);
-    /* call MatrixMULBatchedCPU function */
-    _MatrixMULBatchedCPU(aList, X_NOTRANS, bList, X_NOTRANS, cList);
-    /* check results */
-    gpuTest = cGPU1->CheckData(answer1, cUnitNum) && gpuTest;
-    gpuTest = cGPU2->CheckData(answer2, cUnitNum) && gpuTest;
-    /* destroy variables */
-    delete a1;
-    delete a2;
-    delete b1;
-    delete b2;
-    delete c1;
-    delete c2;
-    delete aGPU1;
-    delete aGPU2;
-    delete bGPU1;
-    delete bGPU2;
-    delete cGPU1;
-    delete cGPU2;
-    delete[] aDimSize;
-    delete[] bDimSize;
-    delete[] cDimSize;
-    return cpuTest && gpuTest;
-#else
-    /* destroy variables */
-    delete a1;
-    delete a2;
-    delete b1;
-    delete b2;
-    delete c1;
-    delete c2;
-    delete[] aDimSize;
-    delete[] bDimSize;
-    delete[] cDimSize;
-    return cpuTest;
-#endif // USE_CUDA
-}
-/* other cases */
-/*
-    TODO!!
-*/
-/* test for MatrixMulBatchedCPU Function */
-extern "C"
-bool TestMatrixMulBatchedCPU()
-{
-    XPRINT(0, stdout, "[TEST MATRIXMULBATCHEDCPU] matrix multiplication in batch mode (CPU code) \n");
-    bool returnFlag = true, caseFlag = true;
-    /* case 1 test */
-    caseFlag = TestMatrixMulBatchedCPU1();
-    if (!caseFlag) {
-        returnFlag = false;
-        XPRINT(0, stdout, ">> case 1 failed!\n");
-    }
-    else
-        XPRINT(0, stdout, ">> case 1 passed!\n");
-    /* other cases test */
-    /*
-    TODO!!
-    */
-    if (returnFlag) {
-        XPRINT(0, stdout, ">> All Passed!\n");
-    }
-    else
-        XPRINT(0, stdout, ">> Failed!\n");
-    XPRINT(0, stdout, "\n");
-    return returnFlag;
-}
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TMultiply.cpp
+++ b/source/tensor/test/TMultiply.cpp
@@ -25,133 +25,10 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 /* 
 case 1: element-wise product of two tensors
-c(i) = a(i)*b(i) + \alpha * c(i) 
-In this case, (2, 1)  (2, 1) -> (2, 1), leadingDim=0, alpha=0.
-*/
-bool TestMultiply1()
-{
-	/* a source tensor of size (2, 1) */
-	int sOrder1 = 2;
-	int * sDimSize1 = new int[sOrder1];
-	sDimSize1[0] = 2;
-	sDimSize1[1] = 1;
-	int sUnitNum1 = 1;
-	for (int i = 0; i < sOrder1; i++)
-		sUnitNum1 *= sDimSize1[i];
-	/* a source tensor of size (2, 1) */
-	int sOrder2 = 2;
-	int * sDimSize2 = new int[sOrder2];
-	sDimSize2[0] = 2;
-	sDimSize2[1] = 1;
-	int sUnitNum2 = 1;
-	for (int i = 0; i < sOrder2; i++)
-		sUnitNum2 *= sDimSize2[i];
-	/* a target tensor of size (2, 1) */
-	int tOrder = 2;
-	int * tDimSize = new int[tOrder];
-	tDimSize[0] = 2;
-	tDimSize[1] = 1;
-	int tUnitNum = 1;
-	for (int i = 0; i < tOrder; i++)
-		tUnitNum *= tDimSize[i];
-	DTYPE sData1[2][1] = { {0.0F}, 
-                           {1.0F} };
-	DTYPE sData2[2][1] = { {2.0F},
-                           {3.0F} };
-	DTYPE answer[2][1] = { {0.0F},
-                           {3.0F} };
-	/* CPU test */
-	bool cpuTest = true;
-	/* create tensors */
-	XTensor * s1 = NewTensor(sOrder1, sDimSize1);
-	XTensor * s2 = NewTensor(sOrder2, sDimSize2);
-	XTensor * t = NewTensor(tOrder, tDimSize);
-	XTensor * tMe = NewTensor(tOrder, tDimSize);
-    XTensor tUser;
-	/* initialize variables */
-	s1->SetData(sData1, sUnitNum1);
-	tMe->SetData(sData1, sUnitNum1);
-	s2->SetData(sData2, sUnitNum2);
-	t->SetZeroAll();
-	/* call Multiply function */
-	_Multiply(s1, s2, t, 0, 0);
-	_MultiplyMe(tMe, s2, 0, 0);
-    tUser = Multiply(*s1, *s2, 0);
-	/* check results */
-	cpuTest = t->CheckData(answer, tUnitNum) 
-        && tMe->CheckData(answer, tUnitNum) && tUser.CheckData(answer, tUnitNum);
-#ifdef USE_CUDA
-	/* GPU test */
-	bool gpuTest = true;
-	/* create tensor */
-	XTensor * sGPU1 = NewTensor(sOrder1, sDimSize1, X_FLOAT, 1.0F, 0);
-	XTensor * sGPU2 = NewTensor(sOrder2, sDimSize2, X_FLOAT, 1.0F, 0);
-	XTensor * tGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
-	XTensor * tMeGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
-    XTensor tUserGPU;
-	/* Initialize variables */
-	sGPU1->SetData(sData1, sUnitNum1);
-	tMeGPU->SetData(sData1, sUnitNum1);
-	sGPU2->SetData(sData2, sUnitNum2);
-	tGPU->SetZeroAll();
-	/* call Multiply function */
-	_Multiply(sGPU1, sGPU2, tGPU, 0, 0);
-	_MultiplyMe(tMeGPU, sGPU2, 0, 0);
-    tUserGPU = Multiply(*sGPU1, *sGPU2, 0);
-	/* check results */
-	gpuTest = tGPU->CheckData(answer, tUnitNum)
-        && tMeGPU->CheckData(answer, tUnitNum) && tUserGPU.CheckData(answer, tUnitNum);
-	/* destroy variables */
-    delete s1;
-    delete s2;
-    delete t;
-    delete tMe;
-    delete sGPU1;
-    delete sGPU2;
-    delete tGPU;
-    delete tMeGPU;
-    delete[] sDimSize1;
-    delete[] sDimSize2;
-    delete[] tDimSize;
-	return cpuTest && gpuTest;
-#else
-    /* destroy variables */
-    delete s1;
-    delete s2;
-    delete t;
-    delete tMe;
-    delete[] sDimSize1;
-    delete[] sDimSize2;
-    delete[] tDimSize;
-	return cpuTest;
-#endif // USE_CUDA
-}
-/* 
-case 2: element-wise product of two tensors
 c(i) = a(i)*b(i) + \alpha * c(i)
 In this case, (2, 2)  (2, 2) -> (2, 2), leadingDim=0, alpha=0.
 */
-bool TestMultiply2()
+bool TestMultiply1()
 {
 	/* a source tensor of size (2, 2) */
 	int sOrder1 = 2;
@@ -212,8 +89,9 @@ bool TestMultiply2()
    tUser = Multiply(*s1, *s2, 0);
 	/* check results */
-	cpuTest = t->CheckData(answer, tUnitNum) 
+	cpuTest = t->CheckData(answer, tUnitNum) && 
-        && tMe->CheckData(answer, tUnitNum) && tUser.CheckData(answer, tUnitNum);
+              tMe->CheckData(answer, tUnitNum) && 
+              tUser.CheckData(answer, tUnitNum);
 #ifdef USE_CUDA
 	/* GPU test */
@@ -270,113 +148,6 @@ bool TestMultiply2()
 #endif // USE_CUDA
 }
-/* 
-case 3: element-wise product of two tensors, c(i) = a(i)*b(i) + \alpha * c(i)
-In this case, (2, 2)  (2, 2) -> (2, 2), leadingDim=1, alpha=0.
-*/
-bool TestMultiply3()
-{
-	/* a source tensor of size (2, 2) */
-	int sOrder1 = 2;
-	int * sDimSize1 = new int[sOrder1];
-	sDimSize1[0] = 2;
-	sDimSize1[1] = 2;
-	int sUnitNum1 = 1;
-	for (int i = 0; i < sOrder1; i++)
-		sUnitNum1 *= sDimSize1[i];
-	/* a source tensor of size (2, 2) */
-	int sOrder2 = 2;
-	int * sDimSize2 = new int[sOrder2];
-	sDimSize2[0] = 2;
-	sDimSize2[1] = 2;
-	int sUnitNum2 = 1;
-	for (int i = 0; i < sOrder2; i++)
-		sUnitNum2 *= sDimSize2[i];
-	/* a target tensor of size (2, 2) */
-	int tOrder = 2;
-	int * tDimSize = new int[tOrder];
-	tDimSize[0] = 2;
-	tDimSize[1] = 2;
-	int tUnitNum = 1;
-	for (int i = 0; i < tOrder; i++)
-		tUnitNum *= tDimSize[i];
-	DTYPE sData1[2][2] = { {0.0F, 1.0F},
-	                       {2.0F, 3.0F} };
-	DTYPE sData2[2][2] = { {0.0F, 1.0F},
-	                       {2.0F, 3.0F} };
-	DTYPE answer[2][2] = { {0.0F, 1.0F},
-	                       {4.0F, 9.0F} };
-	/* CPU test */
-	bool cpuTest = true;
-	/* create tensors */
-	XTensor * s1 = NewTensor(sOrder1, sDimSize1);
-	XTensor * s2 = NewTensor(sOrder2, sDimSize2);
-	XTensor * t = NewTensor(tOrder, tDimSize);
-	/* initialize variables */
-	s1->SetData(sData1, sUnitNum1);
-	s2->SetData(sData2, sUnitNum2);
-	t->SetZeroAll();
-	/* call MultiplyElementWise function */
-	_Multiply(s1, s2, t, 0, 1);
-	/* check results */
-	cpuTest = t->CheckData(answer, tUnitNum);
-#ifdef USE_CUDA
-	/* GPU test */
-	bool gpuTest = true;
-	/* create tensor */
-	XTensor * sGPU1 = NewTensor(sOrder1, sDimSize1, X_FLOAT, 1.0F, 0);
-	XTensor * sGPU2 = NewTensor(sOrder2, sDimSize2, X_FLOAT, 1.0F, 0);
-	XTensor * tGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
-	/* Initialize variables */
-	sGPU1->SetData(sData1, sUnitNum1);
-	sGPU2->SetData(sData2, sUnitNum2);
-	tGPU->SetZeroAll();
-	/* call MultiplyElementWise function */
-	_Multiply(sGPU1, sGPU2, tGPU, 0, 1);
-	/* check results */
-	gpuTest = tGPU->CheckData(answer, tUnitNum);
-	/* destroy variables */
-    delete s1;
-    delete s2;
-    delete t;
-    delete sGPU1;
-    delete sGPU2;
-    delete tGPU;
-    delete[] sDimSize1;
-    delete[] sDimSize2;
-    delete[] tDimSize;
-	return cpuTest && gpuTest;
-#else
-    /* destroy variables */
-    delete s1;
-    delete s2;
-    delete t;
-    delete[] sDimSize1;
-    delete[] sDimSize2;
-    delete[] tDimSize;
-	return cpuTest;
-#endif // USE_CUDA
-}
 /* other cases */
 /*
 TODO!!
@@ -398,26 +169,6 @@ bool TestMultiply()
 	else
 		XPRINT(0, stdout, ">> case 1 passed!\n");
-	/* case 2 test */
-	caseFlag = TestMultiply2();
-	if (!caseFlag) {
-		returnFlag = false;
-		XPRINT(0, stdout, ">> case 2 failed!\n");
-	}
-	else
-		XPRINT(0, stdout, ">> case 2 passed!\n");
-	/* case 3 test */
-	caseFlag = TestMultiply3();
-	if (!caseFlag) {
-		returnFlag = false;
-		XPRINT(0, stdout, ">> case 3 failed!\n");
-	}
-	else
-		XPRINT(0, stdout, ">> case 3 passed!\n");
 	/* other cases test */
 	/*
 	TODO!!

--- a/source/tensor/test/TMultiply.h
+++ b/source/tensor/test/TMultiply.h
@@ -19,16 +19,17 @@
 * $Created by: Lin Ye (email: linye2015@outlook.com) 2018-06-15
 */
-#ifndef __TEST_MULTIPLYELEMENTWISE_H__
+#ifndef __TEST_MULTIPLY_H__
-#define __TEST_MULTIPLYELEMENTWISE_H__
+#define __TEST_MULTIPLY_H__
 #include "../core/arithmetic/Multiply.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
-/* test for MultiplyElementWise Function */
+/* test for Multiply Function */
 extern "C"
 bool TestMultiply();
 } // namespace nts(NiuTrans.Tensor)
-#endif // __TEST_MULTIPLYELEMENTWISE_H__
+#endif // __TEST_MULTIPLY_H__
--- a/source/tensor/test/TRound.cpp
+++ b/source/tensor/test/TRound.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#include "../core/math/Unary.h"
+#include "TRound.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+case 1: test Round function.
+Set every entry to its round value.
+*/
+bool TestRound1()
+{
+	/* a tensor of size (3, 2) */
+	int order = 2;
+	int * dimSize = new int[order];
+	dimSize[0] = 3;
+	dimSize[1] = 2;
+	int unitNum = 1;
+	for (int i = 0; i < order; i++)
+		unitNum *= dimSize[i];
+	DTYPE aData[3][2] = { {1.3F, 2.7F}, 
+	                      {-1.3F, -2.7F},
+	                      {0.0F, 0.5F} };
+	DTYPE answer[3][2] = { {1.0F, 3.0F},
+	                       {-1.0F, -3.0F},
+	                       {0.0F, 1.0F} };
+	/* CPU test */
+	bool cpuTest = true;
+	/* create tensors */
+	XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
+    XTensor bUser;
+	/* initialize variables */
+	a->SetData(aData, unitNum);
+	aMe->SetData(aData, unitNum);
+	/* call Round function */
+	_Round(a, b);
+	_RoundMe(aMe);
+    bUser = Round(*a);
+	/* check results */
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && 
+              aMe->CheckData(answer, unitNum, 1e-4F) && 
+              bUser.CheckData(answer, unitNum, 1e-4F);
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+	/* create tensor */
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+	/* Initialize variables */
+	aGPU->SetData(aData, unitNum);
+	aMeGPU->SetData(aData, unitNum);
+	/* call Round function */
+    _Round(aGPU, bGPU);
+	_RoundMe(aMeGPU);
+    bUserGPU = Round(*aGPU);
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && 
+              aMeGPU->CheckData(answer, unitNum, 1e-4F) && 
+              bUserGPU.CheckData(answer, unitNum, 1e-4F);
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+    delete aGPU;
+    delete bGPU;
+    delete aMeGPU;
+	delete[] dimSize;
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete[] dimSize;
+	return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+TODO!!
+*/
+/* test for Round Function */
+bool TestRound()
+{
+	XPRINT(0, stdout, "[TEST Round] set every entry to its round value \n");
+	bool returnFlag = true, caseFlag = true;
+	/* case 1 test */
+	caseFlag = TestRound1();
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+	XPRINT(0, stdout, "\n");
+	return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/Absolute.cuh
+++ b/source/tensor/core/arithmetic/Absolute.cuh
@@ -16,26 +16,16 @@
 */
 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-03
 */
-#include "Absolute.h"
+#ifndef __TEST_ROUND_H__
+#define __TEST_ROUND_H__
 namespace nts { // namespace nts(NiuTrans.Tensor)
-#ifdef USE_CUDA
+/* test for Round Function */
+bool TestRound();
-/* set each entry to its absolute value (CUDA Kernel) */
+} // namespace nts(NiuTrans.Tensor)
-__global__
+#endif // __TEST_ROUND_H__
-void KernelAbsolute(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its absolute value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelAbsolute(__half * a, __half * b, int size);
-/* set each entry to its absolute value */
-void _CudaAbsolute(const XTensor * a, XTensor * b);
-#endif // USE_CUDA
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/test/TSin.cpp
+++ b/source/tensor/test/TSin.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#include "../core/math/Unary.h"
+#include "TSin.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+case 1: test Sin function.
+Set every entry to its sine value.
+*/
+bool TestSin1()
+{
+	/* a tensor of size (3, 2) */
+	int order = 2;
+	int * dimSize = new int[order];
+	dimSize[0] = 3;
+	dimSize[1] = 2;
+	int unitNum = 1;
+	for (int i = 0; i < order; i++)
+		unitNum *= dimSize[i];
+	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
+	                      {-1.0F, -2.0F},
+	                      {0.0F, 0.5F} };
+	DTYPE answer[3][2] = { {0.8415F, 0.9093F},
+	                       {-0.8415F, -0.9093F},
+	                       {0.0F, 0.4794F} };
+	/* CPU test */
+	bool cpuTest = true;
+	/* create tensors */
+	XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
+    XTensor bUser;
+	/* initialize variables */
+	a->SetData(aData, unitNum);
+	aMe->SetData(aData, unitNum);
+	/* call Sin function */
+	_Sin(a, b);
+	_SinMe(aMe);
+    bUser = Sin(*a);
+	/* check results */
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+	/* create tensor */
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+	/* Initialize variables */
+	aGPU->SetData(aData, unitNum);
+	aMeGPU->SetData(aData, unitNum);
+	/* call Sin function */
+    _Sin(aGPU, bGPU);
+	_SinMe(aMeGPU);
+    bUserGPU = Sin(*aGPU);
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+    delete aGPU;
+    delete bGPU;
+    delete aMeGPU;
+	delete[] dimSize;
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete[] dimSize;
+	return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+TODO!!
+*/
+/* test for Sin Function */
+bool TestSin()
+{
+	XPRINT(0, stdout, "[TEST Sin] set every entry to its sine value \n");
+	bool returnFlag = true, caseFlag = true;
+	/* case 1 test */
+	caseFlag = TestSin1();
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+	XPRINT(0, stdout, "\n");
+	return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/math/Log.cuh
+++ b/source/tensor/core/math/Log.cuh
@@ -16,31 +16,16 @@
 */
 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
 */
-#ifndef __LOG_CUH__
+#ifndef __TEST_SIN_H__
-#define __LOG_CUH__
+#define __TEST_SIN_H__
-#include "Log.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
-#ifdef USE_CUDA
+/* test for Sin Function */
+bool TestSin();
-/* set each entry to its log value (CUDA Kernel) */
-__global__
-void KernelLog(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its log value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelLog(__half * a, __half * b, int size);
-/* set each entry to its log value */
-void _CudaLog(const XTensor * a, XTensor * b);
-#endif // USE_CUDA
 } // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_SIN_H__
-#endif // __LOG_CUH__
\ No newline at end of file
--- a/source/tensor/test/TSub.cpp
+++ b/source/tensor/test/TSub.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#include "TSub.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* case 1: tensor subtraction c = a - b * \beta */
+bool TestSub1()
+{
+    /* a tensor of size (2, 4) */
+    int order = 2;
+    int * dimSize = new int[order];
+    dimSize[0] = 2;
+    dimSize[1] = 4;
+    int unitNum = 1;
+    for (int i = 0; i < order; i++)
+        unitNum *= dimSize[i];
+    DTYPE aData[2][4] = { {0.0F, 1.0F, 2.0F, 3.0F},
+                          {4.0F, 5.0F, 6.0F, 7.0F} };
+    DTYPE bData[2][4] = { {1.0F, -1.0F, -3.0F, -5.0F}, 
+                          {-7.0F, -9.0F, -11.0F, -13.0F} };
+    DTYPE answer[2][4] = { {-1.0F, 2.0F, 5.0F, 8.0F},
+                           {11.0F, 14.0F, 17.0F, 20.0F} };
+    /* CPU test */
+    bool cpuTest = true;
+    /* create tensors */
+    XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+    XTensor * c = NewTensor(order, dimSize);
+    XTensor * cMe = NewTensor(order, dimSize);
+    XTensor cUser;
+    /* initialize variables */
+    a->SetData(aData, unitNum);
+    cMe->SetData(aData, unitNum);
+    b->SetData(bData, unitNum);
+    c->SetZeroAll();
+    /* call Sub function */
+    _Sub(a, b, c);
+    _SubMe(cMe, b);
+    cUser = Sub(*a, *b);
+    /* check results */
+    cpuTest = c->CheckData(answer, unitNum)
+              && cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
+#ifdef USE_CUDA
+    /* GPU test */
+    bool gpuTest = true;
+    /* create tensor */
+    XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor cUserGPU;
+    /* Initialize variables */
+    aGPU->SetData(aData, unitNum);
+    cMeGPU->SetData(aData, unitNum);
+    bGPU->SetData(bData, unitNum);
+    cGPU->SetZeroAll();
+    /* call Sub function */
+    _Sub(aGPU, bGPU, cGPU);
+    _SubMe(cMeGPU, bGPU);
+    cUserGPU = Sub(*aGPU, *bGPU);
+    /* check results */
+    gpuTest = cGPU->CheckData(answer, unitNum, 1e-4F)
+              && cMeGPU->CheckData(answer, unitNum, 1e-4F) && cUserGPU.CheckData(answer, unitNum, 1e-4F);
+    /* destroy variables */
+    delete a;
+    delete b;
+    delete c;
+    delete cMe;
+    delete aGPU;
+    delete bGPU;
+    delete cGPU;
+    delete cMeGPU;
+    delete[] dimSize;
+    return cpuTest && gpuTest;
+#else
+    /* destroy variables */
+    delete a;
+	delete b;
+	delete c;
+    delete cMe;
+    delete[] dimSize;
+    return cpuTest;
+#endif // USE_CUDA
+}
+/* case 2: tensor subtraction c = a - b * \beta */
+bool TestSub2()
+{
+    /* a tensor of size (2, 4) */
+    int order = 2;
+    int * dimSize = new int[order];
+    dimSize[0] = 2;
+    dimSize[1] = 4;
+    int unitNum = 1;
+    for (int i = 0; i < order; i++) {
+        unitNum *= dimSize[i];
+    }
+    DTYPE aData[2][4] = { {0.0F, 1.0F, 2.0F, 3.0F},
+                          {4.0F, 5.0F, 6.0F, 7.0F} };
+    DTYPE bData[2][4] = { {1.0F, -1.0F, -3.0F, -5.0F}, 
+                          {-7.0F, -9.0F, -11.0F, -13.0F} };
+    DTYPE answer[2][4] = { {-0.5F, 1.5F, 3.5F, 5.5F},
+                           {7.5F, 9.5F, 11.5F, 13.5F} };
+    float beta = 0.5F;
+    /* CPU test */
+    bool cpuTest = true;
+    /* create tensor */
+    XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+    XTensor * c = NewTensor(order, dimSize);
+    XTensor * cMe = NewTensor(order, dimSize);
+    XTensor cUser;
+    /* initialize variables */
+    a->SetData(aData, unitNum);
+    cMe->SetData(aData, unitNum);
+    b->SetData(bData, unitNum);
+    c->SetZeroAll();
+    /* call Sub function */
+    _Sub(a, b, c, beta);
+    _SubMe(cMe, b, beta);
+    cUser = Sub(*a, *b, beta);
+    /* check results */
+    cpuTest = c->CheckData(answer, unitNum)
+              && cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
+#ifdef USE_CUDA
+    /* GPU test */
+    bool gpuTest = true;
+    /* create tensor */
+    XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor cUserGPU;
+    /* Initialize variables */
+    aGPU->SetData(aData, unitNum);
+    cMeGPU->SetData(aData, unitNum);
+    bGPU->SetData(bData, unitNum);
+    cGPU->SetZeroAll();
+    /* call Sub function */
+    _Sub(aGPU, bGPU, cGPU, beta);
+    _SubMe(cMeGPU, bGPU, beta);
+    cUserGPU = Sub(*aGPU, *bGPU, beta);
+    /* check results */
+    gpuTest = cGPU->CheckData(answer, unitNum, 1e-4F)
+              && cMeGPU->CheckData(answer, unitNum, 1e-4F) && cUserGPU.CheckData(answer, unitNum, 1e-4F);
+    /* destroy variables */
+    delete a;
+    delete b;
+    delete c;
+    delete cMe;
+    delete aGPU;
+    delete bGPU;
+    delete cGPU;
+    delete cMeGPU;
+    delete[] dimSize;
+    return cpuTest && gpuTest;
+#else
+    /* destroy variables */
+    delete a;
+    delete b;
+    delete c;
+    delete cMe;
+    delete[] dimSize;
+    return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+    TODO!!
+*/
+/* test for Sub Function */
+bool TestSub()
+{
+    XPRINT(0, stdout, "[TEST SUB] tensor subtraction c = a - b * beta\n");
+    bool returnFlag = true, caseFlag = true;
+    /* case 1 test */
+    caseFlag = TestSub1();
+    if (!caseFlag) {
+        returnFlag = false;
+        XPRINT(0, stdout, ">> case 1 failed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> case 1 passed!\n");
+    /* case 2 test */
+    caseFlag = TestSub2();
+    if (!caseFlag) {
+        returnFlag = false;
+        XPRINT(0, stdout, ">> case 2 failed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> case 2 passed!\n");
+    /* other cases test */
+    /*
+        TODO!!
+    */
+    if (returnFlag) {
+        XPRINT(0, stdout, ">> All Passed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> Failed!\n");
+    XPRINT(0, stdout, "\n");
+    return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TSub.h
+++ b/source/tensor/test/TSub.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#ifndef __TEST_SUB_H__
+#define __TEST_SUB_H__
+#include "../core/arithmetic/Sub.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* test for Sub Function */
+bool TestSub();
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_SUB_H__
--- a/source/tensor/test/TSum.cpp
+++ b/source/tensor/test/TSum.cpp
@@ -16,8 +16,8 @@
 */
 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-04-30
+ * $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-04-30
-*/
+ */
 #include "TSum.h"
@@ -59,14 +59,14 @@ bool TestSum1()
    b->SetData(bData, unitNum);
    c->SetZeroAll();
-    /* call sum function */
+    /* call Sum function */
    _Sum(a, b, c);
    _SumMe(cMe, b);
    cUser = Sum(*a, *b);
    /* check results */
    cpuTest = c->CheckData(answer, unitNum)
-        && cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
+              && cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
 #ifdef USE_CUDA
    /* GPU test */
@@ -85,14 +85,14 @@ bool TestSum1()
    bGPU->SetData(bData, unitNum);
    cGPU->SetZeroAll();
-    /* call sum function */
+    /* call Sum function */
    _Sum(aGPU, bGPU, cGPU);
    _SumMe(cMeGPU, bGPU);
    cUserGPU = Sum(*aGPU, *bGPU);
    /* check results */
    gpuTest = cGPU->CheckData(answer, unitNum)
-        && cMeGPU->CheckData(answer, unitNum) && cUserGPU.CheckData(answer, unitNum);
+              && cMeGPU->CheckData(answer, unitNum) && cUserGPU.CheckData(answer, unitNum);
    /* destroy variables */
    delete a;
@@ -155,14 +155,14 @@ bool TestSum2()
    b->SetData(bData, unitNum);
    c->SetZeroAll();
-    /* call sum function */
+    /* call Sum function */
    _Sum(a, b, c, beta);
    _SumMe(cMe, b, beta);
    cUser = Sum(*a, *b, beta);
    /* check results */
    cpuTest = c->CheckData(answer, unitNum)
-        && cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
+              && cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
 #ifdef USE_CUDA
    /* GPU test */
@@ -181,14 +181,14 @@ bool TestSum2()
    bGPU->SetData(bData, unitNum);
    cGPU->SetZeroAll();
-    /* call sum function */
+    /* call Sum function */
    _Sum(aGPU, bGPU, cGPU, beta);
    _SumMe(cMeGPU, bGPU, beta);
    cUserGPU = Sum(*aGPU, *bGPU, beta);
    /* check results */
    gpuTest = cGPU->CheckData(answer, unitNum)
-        && cMeGPU->CheckData(answer, unitNum) && cUserGPU.CheckData(answer, unitNum);
+              && cMeGPU->CheckData(answer, unitNum) && cUserGPU.CheckData(answer, unitNum);
    /* destroy variables */
    delete a;

--- a/source/tensor/test/TSum.h
+++ b/source/tensor/test/TSum.h
@@ -16,8 +16,8 @@
 */
 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-04-30
+ * $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-04-30
-*/
+ */
 #ifndef __TEST_SUM_H__
 #define __TEST_SUM_H__

--- a/source/tensor/test/TSumDim.cpp
+++ b/source/tensor/test/TSumDim.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-30
+*/
+#include "TSumDim.h"
+#include "../core/arithmetic/SumDim.h"
+#include "../XTensor.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* 
+case 1: tensor summation c = a + b * \beta 
+where the size of b is equal to the n-th dimension of a, 
+i.e., a is summed with b by broadcasting 
+*/
+bool TestSumDim1()
+{
+    /* a tensor of size (2, 4) */
+    int aOrder = 2;
+    int * aDimSize = new int[aOrder];
+    aDimSize[0] = 2;
+    aDimSize[1] = 4;
+    int aUnitNum = 1;
+    for (int i = 0; i < aOrder; i++)
+        aUnitNum *= aDimSize[i];
+    /* a tensor of size (2) */
+    int bOrder = 1;
+    int * bDimSize = new int[bOrder];
+    bDimSize[0] = 2;
+    int bUnitNum = 1;
+    for (int i = 0; i < bOrder; i++)
+        bUnitNum *= bDimSize[i];
+    DTYPE aData[2][4] = { {0.0F, 1.0F, 2.0F, 3.0F},
+                          {4.0F, 5.0F, 6.0F, 7.0F} };
+    DTYPE bData[2] = {1.0F, -1.0F};
+    DTYPE answer[2][4] = { {1.0F, 2.0F, 3.0F, 4.0F},
+                           {3.0F, 4.0F, 5.0F, 6.0F} };
+    /* CPU test */
+    bool cpuTest = true;
+    /* create tensors */
+    XTensor * a = NewTensor(aOrder, aDimSize);
+    XTensor * b = NewTensor(bOrder, bDimSize);
+    XTensor * c = NewTensor(aOrder, aDimSize);
+    XTensor * cMe = NewTensor(aOrder, aDimSize);
+    XTensor cUser;
+    /* initialize variables */
+    a->SetData(aData, aUnitNum);
+    cMe->SetData(aData, aUnitNum);
+    b->SetData(bData, bUnitNum);
+    c->SetZeroAll();
+    /* call SumDim function */
+    _SumDim(a, b, c, 0);
+    _SumDim(cMe, b, 0);
+    cUser = SumDim(*a, *b, 0);
+    /* check results */
+    cpuTest = c->CheckData(answer, aUnitNum)
+              && cMe->CheckData(answer, aUnitNum) 
+              && cUser.CheckData(answer, aUnitNum);
+#ifdef USE_CUDA
+    /* GPU test */
+    bool gpuTest = true;
+    /* create tensor */
+    XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * bGPU = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor cUserGPU;
+    /* Initialize variables */
+    aGPU->SetData(aData, aUnitNum);
+    cMeGPU->SetData(aData, aUnitNum);
+    bGPU->SetData(bData, bUnitNum);
+    cGPU->SetZeroAll();
+    /* call sum function */
+    _SumDim(aGPU, bGPU, cGPU, 0);
+    _SumDim(cMeGPU, bGPU, 0);
+    cUserGPU = SumDim(*aGPU, *bGPU, 0);
+    /* check results */
+    gpuTest = cGPU->CheckData(answer, aUnitNum)
+              && cMeGPU->CheckData(answer, aUnitNum) 
+              && cUserGPU.CheckData(answer, aUnitNum);
+    /* destroy variables */
+    delete a;
+    delete b;
+    delete c;
+    delete cMe;
+    delete aGPU;
+    delete bGPU;
+    delete cGPU;
+    delete cMeGPU;
+    delete[] aDimSize;
+    delete[] bDimSize;
+    return cpuTest && gpuTest;
+#else
+    /* destroy variables */
+    delete a;
+	delete b;
+	delete c;
+    delete cMe;
+    delete[] aDimSize;
+    delete[] bDimSize;
+    return cpuTest;
+#endif // USE_CUDA
+}
+/* 
+case 2: tensor summation c = a + b * \beta 
+where the size of b is equal to the n-th dimension of a, 
+i.e., a is summed with b by broadcasting 
+*/
+bool TestSumDim2()
+{
+    /* a tensor of size (2, 4) */
+    int aOrder = 2;
+    int * aDimSize = new int[aOrder];
+    aDimSize[0] = 2;
+    aDimSize[1] = 4;
+    int aUnitNum = 1;
+    for (int i = 0; i < aOrder; i++)
+        aUnitNum *= aDimSize[i];
+    /* a tensor of size (2, 2) */
+    int bOrder = 2;
+    int * bDimSize = new int[bOrder];
+    bDimSize[0] = 2;
+    bDimSize[1] = 2;
+    int bUnitNum = 1;
+    for (int i = 0; i < bOrder; i++)
+        bUnitNum *= bDimSize[i];
+    DTYPE aData[2][4] = { {0.0F, 1.0F, 2.0F, 3.0F},
+                          {4.0F, 5.0F, 6.0F, 7.0F} };
+    DTYPE bData[2][2] = { {1.0F, -1.0F},
+                          {-1.0F, 1.0F} };
+    DTYPE answer[2][4] = { {1.0F, 0.0F, 1.0F, 4.0F},
+                           {5.0F, 4.0F, 5.0F, 8.0F} };
+    /* CPU test */
+    bool cpuTest = true;
+    /* create tensors */
+    XTensor * a = NewTensor(aOrder, aDimSize);
+    XTensor * b = NewTensor(bOrder, bDimSize);
+    XTensor * c = NewTensor(aOrder, aDimSize);
+    XTensor * cMe = NewTensor(aOrder, aDimSize);
+    XTensor cUser;
+    /* initialize variables */
+    a->SetData(aData, aUnitNum);
+    cMe->SetData(aData, aUnitNum);
+    b->SetData(bData, bUnitNum);
+    c->SetZeroAll();
+    /* call SumDim function */
+    _SumDim(a, b, c, 1);
+    _SumDim(cMe, b, 1);
+    cUser = SumDim(*a, *b, 1);
+    /* check results */
+    cpuTest = c->CheckData(answer, aUnitNum)
+              && cMe->CheckData(answer, aUnitNum) 
+              && cUser.CheckData(answer, aUnitNum);
+#ifdef USE_CUDA
+    /* GPU test */
+    bool gpuTest = true;
+    /* create tensor */
+    XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * bGPU = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor cUserGPU;
+    /* Initialize variables */
+    aGPU->SetData(aData, aUnitNum);
+    cMeGPU->SetData(aData, aUnitNum);
+    bGPU->SetData(bData, bUnitNum);
+    cGPU->SetZeroAll();
+    /* call sum function */
+    _SumDim(aGPU, bGPU, cGPU, 1);
+    _SumDim(cMeGPU, bGPU, 1);
+    cUserGPU = SumDim(*aGPU, *bGPU, 1);
+    /* check results */
+    gpuTest = cGPU->CheckData(answer, aUnitNum)
+              && cMeGPU->CheckData(answer, aUnitNum) 
+              && cUserGPU.CheckData(answer, aUnitNum);
+    /* destroy variables */
+    delete a;
+    delete b;
+    delete c;
+    delete cMe;
+    delete aGPU;
+    delete bGPU;
+    delete cGPU;
+    delete cMeGPU;
+    delete[] aDimSize;
+    delete[] bDimSize;
+    return cpuTest && gpuTest;
+#else
+    /* destroy variables */
+    delete a;
+	delete b;
+	delete c;
+    delete cMe;
+    delete[] aDimSize;
+    delete[] bDimSize;
+    return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+    TODO!!
+*/
+/* test for SumDim Function */
+bool TestSumDim()
+{
+    XPRINT(0, stdout, "[TEST SUMDIM] tensor summation c = a + b * beta by broadcasting\n");
+    bool returnFlag = true, caseFlag = true;
+    /* case 1 test */
+    caseFlag = TestSumDim1();
+    if (!caseFlag) {
+        returnFlag = false;
+        XPRINT(0, stdout, ">> case 1 failed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> case 1 passed!\n");
+    /* case 2 test */
+    caseFlag = TestSumDim2();
+    if (!caseFlag) {
+        returnFlag = false;
+        XPRINT(0, stdout, ">> case 2 failed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> case 2 passed!\n");
+    /* other cases test */
+    /*
+        TODO!!
+    */
+    if (returnFlag) {
+        XPRINT(0, stdout, ">> All Passed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> Failed!\n");
+    XPRINT(0, stdout, "\n");
+    return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TSumDim.h
+++ b/source/tensor/test/TSumDim.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-30
+* I finish my summer holidays and go back to study.
+*/
+#ifndef __TEST_SUMDIM_H__
+#define __TEST_SUMDIM_H__
+#include "../core/arithmetic/SumDim.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* test for SumDim Function */
+extern "C"
+bool TestSumDim();
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_SUMDIM_H__
--- a/source/tensor/test/TTan.cpp
+++ b/source/tensor/test/TTan.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#include "../core/math/Unary.h"
+#include "TTan.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+case 1: test Tan function.
+Set every entry to its tangent value.
+*/
+bool TestTan1()
+{
+	/* a tensor of size (3, 2) */
+	int order = 2;
+	int * dimSize = new int[order];
+	dimSize[0] = 3;
+	dimSize[1] = 2;
+	int unitNum = 1;
+	for (int i = 0; i < order; i++)
+		unitNum *= dimSize[i];
+	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
+	                      {-1.0F, -2.0F},
+	                      {0.0F, 0.5F} };
+	DTYPE answer[3][2] = { {1.5574F, -2.1850F},
+	                       {-1.5574F, 2.1850F},
+	                       {0.0F, 0.5463F} };
+	/* CPU test */
+	bool cpuTest = true;
+	/* create tensors */
+	XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
+    XTensor bUser;
+	/* initialize variables */
+	a->SetData(aData, unitNum);
+	aMe->SetData(aData, unitNum);
+	/* call Tan function */
+	_Tan(a, b);
+	_TanMe(aMe);
+    bUser = Tan(*a);
+	/* check results */
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+	/* create tensor */
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+	/* Initialize variables */
+	aGPU->SetData(aData, unitNum);
+	aMeGPU->SetData(aData, unitNum);
+	/* call Tan function */
+    _Tan(aGPU, bGPU);
+	_TanMe(aMeGPU);
+    bUserGPU = Tan(*aGPU);
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+    delete aGPU;
+    delete bGPU;
+    delete aMeGPU;
+	delete[] dimSize;
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete[] dimSize;
+	return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+TODO!!
+*/
+/* test for Tan Function */
+bool TestTan()
+{
+	XPRINT(0, stdout, "[TEST Tan] set every entry to its tangent value \n");
+	bool returnFlag = true, caseFlag = true;
+	/* case 1 test */
+	caseFlag = TestTan1();
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+	XPRINT(0, stdout, "\n");
+	return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TTan.h
+++ b/source/tensor/test/TTan.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#ifndef __TEST_TAN_H__
+#define __TEST_TAN_H__
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* test for Tan Function */
+bool TestTan();
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_TAN_H__
--- a/source/tensor/test/TTranspose.cpp
+++ b/source/tensor/test/TTranspose.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12
+*/
+#include "TTranspose.h"
+#include "../core/movement/CopyValues.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+case 1: test Transpose function.
+tensor transposition of dimensions i and j 
+*/
+bool TestTranspose1()
+{
+	/* a tensor of size (3, 2) */
+	int aOrder = 2;
+	int * aDimSize = new int[aOrder];
+	aDimSize[0] = 3;
+	aDimSize[1] = 2;
+	int aUnitNum = 1;
+	for (int i = 0; i < aOrder; i++)
+		aUnitNum *= aDimSize[i];
+    /* a tensor of size (2, 3) */
+	int bOrder = 2;
+	int * bDimSize = new int[bOrder];
+	bDimSize[0] = 2;
+	bDimSize[1] = 3;
+	int bUnitNum = 1;
+	for (int i = 0; i < bOrder; i++)
+		bUnitNum *= bDimSize[i];
+	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
+	                      {3.0F, 4.0F},
+	                      {5.0F, 6.0F} };
+	DTYPE answer[2][3] = { {1.0F, 3.0F, 5.0F},
+	                       {2.0F, 4.0F, 6.0F} };
+	/* CPU test */
+	bool cpuTest = true;
+	/* create tensors */
+	XTensor * a = NewTensor(aOrder, aDimSize);
+	XTensor * b = NewTensor(bOrder, bDimSize);
+    XTensor bUser;
+	/* initialize variables */
+	a->SetData(aData, aUnitNum);
+	/* call Transpose function */
+    _Transpose(a, b, 0, 1);
+    bUser = Transpose(*a, 0, 1);
+	/* check results */
+	cpuTest = b->CheckData(answer, aUnitNum, 1e-4F)
+              && bUser.CheckData(answer, aUnitNum, 1e-4F);
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+	/* create tensor */
+	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+	/* Initialize variables */
+	aGPU->SetData(aData, aUnitNum);
+	/* call Transpose function */
+    _Transpose(aGPU, bGPU, 0, 1);
+    bUserGPU = Transpose(*aGPU, 0, 1);
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F)
+              && bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
+	/* destroy variables */
+	delete a;
+	delete b;
+    delete aGPU;
+    delete bGPU;
+	delete[] aDimSize;
+	delete[] bDimSize;
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete[] aDimSize;
+	delete[] bDimSize;
+	return cpuTest;
+#endif // USE_CUDA
+}
+/*
+case 2: test Transpose function.
+tensor transposition of dimensions i and j 
+*/
+bool TestTranspose2()
+{
+	/* a tensor of size (4, 3, 2) */
+	int aOrder = 3;
+	int * aDimSize = new int[aOrder];
+	aDimSize[0] = 4;
+	aDimSize[1] = 3;
+	aDimSize[2] = 2;
+	int aUnitNum = 1;
+	for (int i = 0; i < aOrder; i++)
+		aUnitNum *= aDimSize[i];
+    /* a tensor of size (2, 3, 4) */
+	int bOrder = 3;
+	int * bDimSize = new int[bOrder];
+	bDimSize[0] = 2;
+	bDimSize[1] = 3;
+	bDimSize[2] = 4;
+	int bUnitNum = 1;
+	for (int i = 0; i < bOrder; i++)
+		bUnitNum *= bDimSize[i];
+	DTYPE aData[4][3][2] = { { {1.0F, 2.0F}, 
+	                           {3.0F, 4.0F},
+	                           {5.0F, 6.0F} },
+                             { {2.0F, 4.0F}, 
+	                           {4.0F, 7.0F},
+	                           {6.0F, 8.0F} },
+                             { {1.0F, 2.0F}, 
+	                           {3.0F, 4.0F},
+	                           {5.0F, 6.0F} },
+                             { {2.0F, 4.0F}, 
+	                           {4.0F, 7.0F},
+	                           {6.0F, 8.0F} },};
+	DTYPE answer[2][3][4] = { { {1.0F, 2.0F, 1.0F, 2.0F},
+                                {2.0F, 4.0F, 2.0F, 4.0F},
+                                {3.0F, 4.0F, 3.0F, 4.0F} },
+                              { {4.0F, 7.0F, 4.0F, 7.0F},
+                                {5.0F, 6.0F, 5.0F, 6.0F},
+                                {6.0F, 8.0F, 6.0F, 8.0F} } };
+	/* CPU test */
+	bool cpuTest = true;
+	/* create tensors */
+	XTensor * a = NewTensor(aOrder, aDimSize);
+	XTensor * b = NewTensor(bOrder, bDimSize);
+    XTensor bUser;
+	/* initialize variables */
+	a->SetData(aData, aUnitNum);
+	/* call Transpose function */
+    _Transpose(a, b, 0, 2);
+    bUser = Transpose(*a, 0, 2);
+	/* check results */
+	cpuTest = b->CheckData(answer, aUnitNum, 1e-4F)
+              && bUser.CheckData(answer, aUnitNum, 1e-4F);
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+	/* create tensor */
+	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+	/* Initialize variables */
+	aGPU->SetData(aData, aUnitNum);
+	/* call Transpose function */
+    _Transpose(aGPU, bGPU, 0, 2);
+    bUserGPU = Transpose(*aGPU, 0, 2);
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F)
+              && bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
+	/* destroy variables */
+	delete a;
+	delete b;
+    delete aGPU;
+    delete bGPU;
+	delete[] aDimSize;
+	delete[] bDimSize;
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete[] aDimSize;
+	delete[] bDimSize;
+	return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+TODO!!
+*/
+/* test for Transpose Function */
+bool TestTranspose()
+{
+	XPRINT(0, stdout, "[TEST TRANSPOSE] tensor transposition with specified dimensions \n");
+	bool returnFlag = true, caseFlag = true;
+	/* case 1 test */
+	caseFlag = TestTranspose1();
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+	/* case 2 test */
+	caseFlag = TestTranspose2();
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 2 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 2 passed!\n");
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+	XPRINT(0, stdout, "\n");
+	return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TTranspose.h
+++ b/source/tensor/test/TTranspose.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-30
+*/
+#ifndef __TEST_TRANSPOSE_H__
+#define __TEST_TRANSPOSE_H__
+#include "../core/shape/Transpose.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* test for Transpose Function */
+bool TestTranspose();
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_TRANSPOSE_H__
--- a/source/tensor/test/Test.cpp
+++ b/source/tensor/test/Test.cpp
@@ -30,17 +30,20 @@ bool Test()
    XPRINT(0, stdout, "Testing the XTensor utilites ... \n\n");
    wrong = !TestAbsolute() || wrong;
+    wrong = !TestClip() || wrong;
    wrong = !TestConcatenate() || wrong;
    wrong = !TestConcatenateSolely() || wrong;
+    wrong = !TestCos() || wrong;
    wrong = !TestConvertDataType() || wrong;
    wrong = !TestCopyIndexed() || wrong;
    wrong = !TestCopyValues() || wrong;
+    wrong = !TestDiv() || wrong;
+    wrong = !TestExp() || wrong;
    wrong = !TestLog() || wrong;
    wrong = !TestMatrixMul() || wrong;
    wrong = !TestMatrixMul2D() || wrong;
    wrong = !TestMatrixMul2DParallel() || wrong;
    wrong = !TestMatrixMulBatched() || wrong;
-    wrong = !TestMatrixMulBatchedCPU() || wrong;
    wrong = !TestMerge() || wrong;
    wrong = !TestMultiply() || wrong;
    wrong = !TestNegate() || wrong;
@@ -51,17 +54,23 @@ bool Test()
    wrong = !TestReduceSum() || wrong;
    wrong = !TestReduceSumSquared() || wrong;
    wrong = !TestReduceVariance() || wrong;
+    wrong = !TestRound() || wrong;
    wrong = !TestScaleAndShift() || wrong;
    wrong = !TestSelect() || wrong;
    wrong = !TestSetAscendingOrder() || wrong;
    wrong = !TestSetData() || wrong;
    wrong = !TestSign() || wrong;
+    wrong = !TestSin() || wrong;
    wrong = !TestSort() || wrong;
    wrong = !TestSplit() || wrong;
+    wrong = !TestSub() || wrong;
    wrong = !TestSum() || wrong;
    wrong = !TestSumByColumnTV() || wrong;
    wrong = !TestSumByColumnVT() || wrong;
-    wrong = !TestTopK() || wrong;
+    wrong = !TestSumDim() || wrong;
+    wrong = !TestTan() || wrong;
+    wrong = !TestTranspose() || wrong;
+    //wrong = !TestTopK() || wrong;
    wrong = !TestUnsqueeze() || wrong;
    wrong = !TestXMem() || wrong;

--- a/source/tensor/test/Test.h
+++ b/source/tensor/test/Test.h
@@ -23,17 +23,20 @@
 #define __TEST_H__
 #include "TAbsolute.h"
+#include "TClip.h"
 #include "TConcatenate.h"
 #include "TConcatenateSolely.h"
+#include "TCos.h"
 #include "TConvertDataType.h"
 #include "TCopyIndexed.h"
 #include "TCopyValues.h"
+#include "TDiv.h"
+#include "TExp.h"
 #include "TLog.h"
 #include "TMatrixMul.h"
 #include "TMatrixMul2D.h"
 #include "TMatrixMul2DParallel.h"
 #include "TMatrixMulBatched.h"
-#include "TMatrixMULBatchedCPU.h"
 #include "TMerge.h"
 #include "TMultiply.h"
 #include "TNegate.h"
@@ -44,16 +47,22 @@
 #include "TReduceSum.h"
 #include "TReduceSumSquared.h"
 #include "TReduceVariance.h"
+#include "TRound.h"
 #include "TScaleAndShift.h"
 #include "TSelect.h"
 #include "TSetAscendingOrder.h"
 #include "TSetData.h"
 #include "TSign.h"
+#include "TSin.h"
 #include "TSort.h"
 #include "TSplit.h"
+#include "TSub.h"
 #include "TSum.h"
 #include "TSumByColumnTV.h"
 #include "TSumByColumnVT.h"
+#include "TSumDim.h"
+#include "TTan.h"
+#include "TTranspose.h"
 #include "TTopK.h"
 #include "TUnsqueeze.h"
 #include "TXMem.h"