1. add macro to implement unary function 2. add sub and div function 3. merge…

1. add macro to implement unary function 2. add sub and div function 3. merge code with the latest branch of xiaotong-working

1. add macro to implement unary function 2. add sub and div function 3. merge…
1. add macro to implement unary function 2. add sub and div function 3. merge code with the latest branch of xiaotong-working
d664c0a0 · xuchen · 7e9d7015 · d664c0a0 · d664c0a0 · d664c0a0
Commit d664c0a0 authored Aug 02, 2018 by xuchen
--- a/doc/manual.md
+++ b/doc/manual.md
+<script type="text/javascript" async src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML"> </script>
 # NiuTrans.Tensor张量计算库
 ## NiuTrans.Tensor
@@ -27,39 +29,46 @@ NiuTrans.Tensor撠皞★撘銝芸極
 ## 什么是张量
-在计算机科学中，张量（Tensor）通常被定义为\\(n\\)维空间中的一种量，它具有\\(n\\)个分量，这种张量本质上是一个多维数组（ multidimensional array）。张量的阶或秩是这个多维数组的维度，或者简单理解为索引张量里的每个元素所需要的索引个数。通常来说，0阶张量被定义为标量（Scalar），1阶张量被定义为向量（vector），而2阶张量被定义为矩阵（matrix）。比如，在一个三维空间中，1阶张量就是空间中点所表示的向量\\((x,y,z)\\)，其中\\(x\\)、\\(y\\)、\\(z\\)分别表示这个点在三个轴上的坐标。
+在计算机科学中，张量（Tensor）通常被定义为$n$维空间中的一种量，它具有$n$个分量，这种张量本质上是一个多维数组（ multidimensional array）。张量的阶或秩是这个多维数组的维度，或者简单理解为索引张量里的每个元素所需要的索引个数。通常来说，0阶张量被定义为标量（Scalar），1阶张量被定义为向量（vector），而2阶张量被定义为矩阵（matrix）。比如，在一个三维空间中，1阶张量就是空间中点所表示的向量$(x,y,z)$，其中$x$、$y$、$z$分别表示这个点在三个轴上的坐标。
-张量是一种高效的数学建模工具，它可以将复杂的问题通过统一、简洁的方式进行表达。比如，姜英俊同学做饭需要2斤牛肉、5斤土豆，市场上牛肉每斤32元、土豆每斤2元，那么购买这些食物总共花费\\(2 \times 32 + 5 \times 2 = 74\\)元。如果用张量来描述，我们可以用一个1阶张量\\(a=(2,5)\\)表示所需不同食物的重量。然后用另一个1阶张量\\(b=(32,2)\\)表示不同食物的价格。最后，我们用一个0阶张量\\(c\\)表示购买这些食物的总价，计算如下
+张量是一种高效的数学建模工具，它可以将复杂的问题通过统一、简洁的方式进行表达。比如，姜英俊同学做饭需要2斤牛肉、5斤土豆，市场上牛肉每斤32元、土豆每斤2元，那么购买这些食物总共花费$2 \times 32 + 5 \times 2 = 74$元。如果用张量来描述，我们可以用一个1阶张量$a=(2,5)$表示所需不同食物的重量。然后用另一个1阶张量$b=(32,2)$表示不同食物的价格。最后，我们用一个0阶张量$c$表示购买这些食物的总价，计算如下
 $$
 \begin{aligned}
-  c & = a \times b^T \\\\
+  c & = a \times b^T \\
-    & = \left(\begin{matrix}2 & 5\end{matrix}\right) \times \left(\begin{matrix}32 \\\\ 2\end{matrix}\right) \\\\
+    & = \left(\begin{matrix}2 & 5\end{matrix}\right) \times \left(\begin{matrix}32 \\ 2\end{matrix}\right) \\
-    & = 2 \times 32 + 5 \times 2 \\\\
+    & = 2 \times 32 + 5 \times 2 \\
    & = 74
 \end{aligned}
 $$
-其中\\(b^T\\)表示行向量\\(b\\)的转置 - 列向量，\\(\times\\)表示向量的乘法。第二天，姜英俊同学换了一个市场，这里牛肉每斤35元、土豆每斤1元。如果要知道在两个市场分别购物的总价，可以把\\(b\\)重新定义为一个2阶张量\\(\left(\begin{matrix}32 & 2 \\\\ 35 & 1\end{matrix}\right)\\)，总价\\(c\\)定义为一个2阶张量。同样有
+其中$b^T$表示行向量$b$的转置 - 列向量，$\times$表示向量的乘法。第二天，姜英俊同学换了一个市场，这里牛肉每斤35元、土豆每斤1元。如果要知道在两个市场分别购物的总价，可以把$b$重新定义为一个2阶张量$\left(\begin{matrix}32 & 2 \\ 35 & 1\end{matrix}\right)$，总价$c$定义为一个2阶张量。同样有
 $$
 \begin{aligned}
-  c & = a \times b^T \\\\
+  c & = a \times b^T \\
-    & = \left(\begin{matrix}2 & 5\end{matrix}\right) \times \left(\begin{matrix}32 & 35 \\\\ 2 & 1\end{matrix}\right) \\\\
+    & = \left(\begin{matrix}2 & 5\end{matrix}\right) \times \left(\begin{matrix}32 & 35 \\ 2 & 1\end{matrix}\right) \\
    & = \left(\begin{matrix}74 & 75\end{matrix}\right)
 \end{aligned}
 $$
-即，在两个市场分别花费74元和75元。可以看出，利用张量可以对多样、复杂的问题进行建模，比如，可以进一步扩展上述问题中\\(a\\)、\\(b\\)、\\(c\\)的定义，把它们定义成更高阶的张量，处理不同时间、不同市场、不同菜谱的情况，但是不论情况如何变化，都可以用同一个公式\\(c = a \times b^T\\)来描述问题。
+即，在两个市场分别花费74元和75元。可以看出，利用张量可以对多样、复杂的问题进行建模，比如，可以进一步扩展上述问题中$a$、$b$、$c$的定义，把它们定义成更高阶的张量，处理不同时间、不同市场、不同菜谱的情况，但是不论情况如何变化，都可以用同一个公式$c = a \times b^T$来描述问题。
 许多现实世界的问题都可以被描述为张量表达式（expression），也就是把张量的组合、计算描述为算数表达式。这种建模方式也构成了现代神经网络模型及深度学习方法的基础。在许多机器学习工具中，张量计算已经成为了神经网络前向、反向传播等过程的基本单元，应用十分广泛。
 ## 如何定义张量
-如果你是一名C/C++或者Python的使用者，那么在程序中使用NiuTrans.Tensor定义张量将非常简单。首先，下载NiuTrans.Tensor的工具包(source???)，并解压到任意目录，比如~/NTS目录。我们会在NTS这个目录中找到source子目录，它是存放源代码的目录。对于source子目录的结构，信息如下：
+如果你是一名C/C++或者Python的使用者，那么在程序中使用NiuTrans.Tensor定义张量将非常简单。首先，下载NiuTrans.Tensor的工具包，并解压到任意目录，比如~/NTS目录。我们会在NTS这个目录中找到source子目录，它是存放源代码的目录。对于source子目录的结构，信息如下：
 * ~/NTS/source/XTensor.h - 定义了张量结构XTensor，以及构建和销毁XTensor的接口
 * ~/NTS/source/core - 存放张量计算的函数声明及函数体实现的源文件
+    * arithmetic - 存放有关算术运算的源文件
+    * getandset - 存放有关算术存取的源文件
+    * math - 存放有关数学运算的源文件
+    * movement - 存放有关数据移动的源文件
+    * reduce - 存放有关规约操作的源文件
+    * shape - 存放有关形状转换的源文件
+    * sort - 存放有关排序操作的源文件
 * ~/NTS/source/function - 存放各种激活函数的源文件
 * ~/NTS/source/test - 存放单元测试的源文件
 * ~/NTS/source/*.h(cpp) - 与张量定义不相关，后文介绍 :)
@@ -84,7 +93,7 @@ int main(int argc, const char ** argv)
 }
 ```
-下一步，编译以上源程序，这个过程需要指定XTensor.h头文件所在目录。比如，使用g++编译sample.cpp（如果你使用的是visual studio，请看这里???）
+下一步，编译以上源程序，这个过程需要指定XTensor.h头文件所在目录。比如，使用g++编译sample.cpp
 ```
 g++ sample.cpp -I~/NTS/source -o sample
@@ -190,8 +199,6 @@ int main(int argc, const char ** argv)
 | 创建4维稠密张量 | XTensor * NewTensor4D(<br>const int d0, const int d1, const int d2, const int d3, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br>  d3 - 张量第四维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
 | 创建5维稠密张量 | XTensor * NewTensor5D(<br>const int d0, const int d1, const int d2, <br> const int d3, const int d4, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br>  d3 - 张量第四维大小 <br>  d4 - 张量第五维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
-## 设备
 ## 访问张量中的内容
 在C/C++中，我们通过XTensor.h访问张量中的内容，并且仅需要在源程序中引用XTensor.h头文件就可以完成张量的定义。
@@ -204,71 +211,73 @@ int main(int argc, const char ** argv)
 | void * data | 保存元素的数据数组 |
 | int devID | 设备ID，指张量所申请的空间所在CPU或者GPU设备的编号，-1表示CPU |
 | int order | 张量的维度，例如：一个矩阵（维度为2）是一个二维张量 |
-| int dimSize<br> [MAX_TENSOR_DIM_NUM] | 张量中每一维度的大小，索引0表示第1维 |
+| int dimSize[ ] | 张量中每一维度的大小，索引0表示第1维 |
 | TENSOR_DATA_TYPE dataType | 每个数据单元的数据类型 |
 | int unitSize | 数据单元的大小，类似于sizeof() |
 | int unitNum | 数据单元的数量 |
 | bool isSparse | 是否稠密，一个n * m稠密矩阵的数据量大小为n * m,而稀疏（非稠密）矩阵的数据量大小则取决于矩阵中非零元素个数。|
 | float denseRatio | 稠密度，指非零单元的比例，是介于0和1之间的一个实数，0表示所有单元全为零，1表示全为非零单元。|
-在XTensor.h头文件中定义的方法说明：
+在XTensor.h头文件中定义的部分方法说明，详情参见附录：
 | 功能 | 函数  | 参数 |
 | - | - | - |
-| 判断两个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 |
-| 判断三个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b, XTensor * c) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 <br> c - 进行比较的第三个张量 |
 | 设置张量每一维度的大小 | void SetDim(int * myDimSize) |myDimSize - 张量每一维度的大小 |
 | 得到张量中给定的维度大小 | int GetDim(const int dim) | dim - 张量的维度 |
 | 重新调整矩阵维度 | void Reshape(<br> const int order, const int * myDimSize) | order - 张量的维度 <br> myDimSize - 张量每一维的大小 |
 | 得到张量中元素数量 | int GetSize() | N/A |
-| 得到所给数据类型的数据<br> 单元大小 | int GetUnitSize(<br> TENSOR_DATA_TYPE myDataType) | myDataType - 所给数据类型 |
-| 张量中所有元素设置为0 | void SetZeroAll(XStream * stream = NULL) | stream - 多线程流|
 | 用数组赋值张量 | void SetData(<br> const void * d, int num, int beg = 0) | d - 赋值数组  <br> num - 数组大小 <br> beg - 赋值时从张量的第几位开始 |
-| 设置张量服从均匀分布 | void SetDataRand(<br> DTYPE lower, DTYPE upper) | lower - 最小值 <br> upper - 最大值 |
+| 张量中所有元素设置为0 | void SetZeroAll(XStream * stream = NULL) | stream - 多线程流|
-| 设置张量服从正态分布 | void SetDataRandn(<br> DTYPE mean, DTYPE standardDeviation) | mean - 均值 <br> standardDeviation - 标准差 |
-| 将给定维度中元素<br> 设置为升序 | void SetAscendingOrder(int dim) | dim - 给定维度 |
 | 获取二维张量的值 | DTYPE Get2D(int ni, int mi = 0) | ni - 行值 <br> mi - 列值 |
 | 设置二维张量中<br> 的单元值 | bool Set2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
 | 增加二维张量中<br> 的单元值 | bool Add2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
 | 将矩阵重置为特定大小 | bool Resize(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 |
 | 将矩阵重置为<br> 另一矩阵大小 | bool Resize(<br> const XTensor * myTensor) | myTensor - 重置矩阵大小的参考矩阵 |
-| 依据给定张量<br>复制一个新的张量 | XTensor * NewTensor(<br>XTensor * a, bool isFilledData = true) | a - 给定张量 <br>  isFilledData - 是否申请张量中的数据空间 |
 | 依据给定张量<br>释放数据空间 | void DelTensor(<br>const XTensor * tensor) | tensor - 给定张量 |
 ## 张量计算
-NiuTrans.Tensor提供关于张量计算的函数功能，主要包括一些基本的张量运算以及激活函数，在本节中，主要对这些函数及其用法用例进行介绍。
+NiuTrans.Tensor提供关于张量计算的函数功能，主要包括一些基本的张量运算以及激活函数，在本节中，主要对这些函数及其用法用例进行介绍。我们以点乘(Multiply)操作为例介绍NiuTrans.Tensor的几种函数定义形式：
+* _Multiply: 需指定输出张量，只支持前向操作
+* Multiply: 输出张量与输入张量相同，只支持前向操作
+* MultiplyMe: 输出张量需返回给上层，同时支持前向和反向操作
-### arithmetic
+### 代数计算(arithmetic)
 此部分主要包括各种数学运算，加、减、乘、除、取负等。
 #### 矩阵乘法（MatrixMul）
 ##### 什么是张量间矩阵乘法？
-利用矩阵乘法可以将矩阵想乘并得到一个新的结果矩阵，两个维度分别为\\(2 \times 3\\)和\\(3 \times 2\\)的矩阵相乘过程如下所示，结果矩阵的维度为\\(2 \times 2\\)：
+利用矩阵乘法可以将矩阵想乘并得到一个新的结果矩阵，两个维度分别为$2 \times 3$和$3 \times 2$的矩阵相乘过程如下所示，结果矩阵的维度为$2 \times 2$：
 $$
-\left(\begin{matrix}1.0 & 2.0 & 3.0\\\\-4.0 & 5.0 & 6.0\end{matrix}\right) × 
+\left(\begin{matrix}1.0 & 2.0 & 3.0\\-4.0 & 5.0 & 6.0\end{matrix}\right) × 
-\left(\begin{matrix}0.0 & -1.0\\\\1.0 & 2.0\\\\2.0 & 1.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & -1.0\\1.0 & 2.0\\2.0 & 1.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}8.0 & 6.0\\\\17.0 & 20.0\end{matrix}\right)
+\left(\begin{matrix}8.0 & 6.0\\17.0 & 20.0\end{matrix}\right)
 $$
 ##### 矩阵乘法的调用
-NiuTrans.Tensor提供了矩阵乘法的计算操作，在NiuTrans.Tensor/Tensor/core中定义，矩阵乘法的调用方式以及参数说明如下所示:
+NiuTrans.Tensor提供了矩阵乘法的计算操作，在NiuTrans.Tensor/Tensor/core/arithmetic中定义，函数定义为：
+>c_{i,j} = trans(ai) * trans(bj) * alpha + c_{i,j} * beta
+矩阵乘法的调用方式以及参数说明如下所示:
 ```
 void _MatrixMul(XTensor * a, MATRIX_TRANS_TYPE transposedA, XTensor * b, MATRIX_TRANS_TYPE transposedB, XTensor * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0)
+XTensor MatrixMul(const XTensor &a, MATRIX_TRANS_TYPE transposedA, const XTensor &b, MATRIX_TRANS_TYPE transposedB, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0)
 ```
 Parameters: 
-* a - 操作张量1
+* a - 输入张量1
-* transposedA - 操作张量1是否进行转置
+* transposedA - a是否进行转置
-* b - 操作张量2
+* b - 输入张量2
-* transposedB - 操作张量2是否进行转置
+* transposedB - b是否进行转置
-* c - 操作张量3
+* c - 输出张量
-* alpha - 系数
+* alpha - 系数α
-* beta - 系数
+* beta - 系数β
 ##### 矩阵乘法片段示例
@@ -285,26 +294,30 @@ NiuTrans.Tensor/Tensor/test/TMatrixMul.cpp
 ##### 什么是张量点乘？
-利用张量间的点乘操作可以进行张量间元素的按位置依次相乘，两个维度分别为\\(2 \times 2\\)的张量点乘过程如下所示：
+利用张量间的点乘操作可以进行张量间元素的按位置依次相乘，两个维度分别为$2 \times 2$的张量点乘过程如下所示：
 $$
-\left(\begin{matrix}0.0 & 1.0\\\\2.0 & 3.0\end{matrix}\right)  ·
+\left(\begin{matrix}0.0 & 1.0\\2.0 & 3.0\end{matrix}\right)  ·
-\left(\begin{matrix}0.0 & 1.0\\\\2.0 & 3.0\end{matrix}\right)  \rightarrow 
+\left(\begin{matrix}0.0 & 1.0\\2.0 & 3.0\end{matrix}\right)  \rightarrow 
-\left(\begin{matrix}0.0 & 1.0\\\\4.0 & 9.0\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0\\4.0 & 9.0\end{matrix}\right)
 $$
 ##### 张量点乘的调用
-NiuTrans.Tensor提供了张量点乘的计算操作，用来计算张量中元素点乘结果，该函数在NiuTrans.Tensor/Tensor/core中定义，张量点乘的调用方式以及参数说明如下所示:
+NiuTrans.Tensor提供了张量点乘的计算操作，用来计算张量中元素点乘结果，该函数在NiuTrans.Tensor/Tensor/core/arithmetic中定义，张量点乘的调用方式以及参数说明如下所示:
 ```
 _Multiply(XTensor * a, XTensor * b, XTensor * c, int leadingDim, DTYPE alpha = 0)
+void _MultiplyMe(XTensor * a, const XTensor * b, DTYPE alpha = 0, int leadingDim = 0)
+XTensor Multiply(const XTensor &a, const XTensor &b, DTYPE alpha = 0, int leadingDim = 0)
 ```
 Parameters: 
-* a - 操作张量1
+* a - 输入张量1
-* b - 操作张量2
+* b - 输入张量2
-* c - 结果张量
+* c - 输出张量
-* leadingDim - ???
+* leadingDim - 沿着指定维度进行点乘操作
 * alpha - 系数
 ##### 张量点乘片段示例
@@ -322,22 +335,22 @@ NiuTrans.Tensor/Tensor/test/TMultiply.cpp
 ##### 什么是张量的取负操作？
-在进行张量的取负操作时，张量中每一元素都进行取负得到新的元素，所有新元素的组合得到新的结果张量，一个维度为\\(3 \times 2\\)的张量取负操作过程如下所示：
+在进行张量的取负操作时，张量中每一元素都进行取负得到新的元素，所有新元素的组合得到新的结果张量，一个维度为$3 \times 2$的张量取负操作过程如下所示：
 $$
-\left(\begin{matrix}1.0 & -2.0\\\\-3.0 & 4.0\\\\5.0 & -6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}1.0 & -2.0\\-3.0 & 4.0\\5.0 & -6.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}-1.0 & 2.0\\\\3.0 & -4.0\\\\-5.0 & 6.0\end{matrix}\right)
+\left(\begin{matrix}-1.0 & 2.0\\3.0 & -4.0\\-5.0 & 6.0\end{matrix}\right)
 $$
 ##### 张量取负的调用
-NiuTrans.Tensor提供了张量取负的计算操作，进行张量的按元素位置进行取负操作，该函数在NiuTrans.Tensor/Tensor/core中定义，张量取负的调用方式以及参数说明如下所示:
+NiuTrans.Tensor提供了张量取负的计算操作，进行张量的按元素位置进行取负操作，该函数在NiuTrans.Tensor/Tensor/core/arithmetic中定义，张量取负的调用方式以及参数说明如下所示:
 ```
-_Negate(XTensor * a)
+void _Negate(XTensor * a)
 ```
 Parameters: 
-* a - 操作张量
+* a - 输入张量
 ##### 张量取负片段示例
@@ -353,35 +366,39 @@ NiuTrans.Tensor/Tensor/test/TNegate.cpp
 #### 加法（Sum）
 ##### 什么是张量加法？
-张量加法的目的是将n个张量相加得到一个新的结果张量，结果张量某一位置的元素数值为进行操作的张量在该位置上元素的求和，在张量加法的计算过程中进行操作的张量与结果张量的维度相同，两个维度为\\(2\times 3\\)的张量相加过程如下所示：
+张量加法的目的是将n个张量相加得到一个新的结果张量，结果张量某一位置的元素数值为进行操作的张量在该位置上元素的求和，在张量加法的计算过程中进行操作的张量与结果张量的维度相同，两个维度为$2\times 3$的张量相加过程如下所示：
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 \\\\ 3.0 & 4.0 & 5.0\end{matrix}\right) + 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 \\ 3.0 & 4.0 & 5.0\end{matrix}\right) + 
-\left(\begin{matrix}0.5 & 1.5 & 2.5 \\\\ 3.5 & 4.5 & 5.5\end{matrix}\right) \rightarrow
+\left(\begin{matrix}0.5 & 1.5 & 2.5 \\ 3.5 & 4.5 & 5.5\end{matrix}\right) \rightarrow
-\left(\begin{matrix}0.5 & 2.5 & 4.5 \\\\ 6.5 & 8.5 & 10.5\end{matrix}\right)
+\left(\begin{matrix}0.5 & 2.5 & 4.5 \\ 6.5 & 8.5 & 10.5\end{matrix}\right)
 $$
 ##### 张量加法的调用
-NiuTrans.Tensor提供了张量加法的计算操作，在NiuTrans.Tensor/Tensor/core中定义，该操作用来进行张量之间的按元素位置相加，并得到相加的结果张量，张量加法的调用方法为：
+NiuTrans.Tensor提供了张量加法的计算操作，在NiuTrans.Tensor/Tensor/core/arithmetic中定义，该操作用来进行张量之间的按元素位置相加，并得到相加的结果张量，张量加法的调用方法为：
 ```
-_Sum(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
+void _Sum(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta = (DTYPE)1.0)
+void _SumMe(XTensor * a, const XTensor * b, DTYPE beta = (DTYPE)1.0)
+XTensor Sum(const XTensor &a, const XTensor &b, DTYPE beta = (DTYPE)1.0)
 ```
 其中a和b为输入张量，c为结果张量，若c为NULL则将相加结果存入a中，beta为一个缩放参数，缩放公式为：c = a + b * beta，beta默认为1.0，NiuTrans.Tensor中张量加法的调用方式以及参数说明如下所示:
 Parameters: 
-* a - 操作张量1
+* a - 输入张量1
-* b - 操作张量2
+* b - 输入张量2
-* c - 结果张量，如果c为空则将结果存入a
+* c - 输出张量
 * beta - 缩放参数
 ##### 张量加法片段示例
-调用Sum进行张量间的求和操作如下所示，在此例中直接将张量相加结果存入a中：
+调用Sum进行张量间的求和操作如下所示，在此例中将张量相加结果存入c中：
 ```
 /* call sum function */
-_Sum(a, b);
+_Sum(a, b, c);
 ```
 详细代码示例见：
@@ -391,24 +408,24 @@ NiuTrans.Tensor/Tensor/test/TSum.cpp
 ##### 什么是SumByColumnTV？
-SumByColumnTV的作用是将一个Tensor和一个Vector按列相加，所得结果维度与Tensor一致，一个\\(2 \times 4\\)的Tensor和一个\\(2 \times 1\\)的Vector的SumByColumnTV操作过程如下所示：
+SumByColumnTV的作用是将一个Tensor和一个Vector按列相加，所得结果维度与Tensor一致，一个$2 \times 4$的Tensor和一个$2 \times 1$的Vector的SumByColumnTV操作过程如下所示：
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) + \left(\begin{matrix}1.0\\\\0.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) + \left(\begin{matrix}1.0\\0.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}1.0 & 2.0 & 3.0 & 4.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right)
+\left(\begin{matrix}1.0 & 2.0 & 3.0 & 4.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right)
 $$ 
 ##### SumByColumnTV的调用
 NiuTrans.Tensor提供了张量的SumByColumnTV操作，调用方法及参数说明如下所示:
 ```
-_SumByColumnTV(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
+void _SumByColumnTV(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
 ```
 Parameters:
-* a - 操作张量
+* a - 输入张量
-* b - 操作向量
+* b - 输入向量
-* c - 结果张量
+* c - 输出张量
 * beta - 缩放参数
 调用SumByColumnTV进行的运算为c_col = a_col + b * \beta
@@ -418,7 +435,7 @@ Parameters:
 SumByColumnTV示例代码如下，其中a为输入的张量，b为输入的向量，c为a和b按列相加所得结果：
 ```
 /* call SumByColumnTV function */
-_SumByColumnTV(a, b, c);
+void _SumByColumnTV(a, b, c);
 ```
 有关张量SumByColumnTV的详细代码示例见：
@@ -428,11 +445,11 @@ NiuTrans.Tensor/Tensor/test/TSumByColumnTV.cpp
 ##### 什么是SumByColumnVT？
-SumByColumnVT的作用是将一个Vector和一个Tensor按列相加，所得结果维度与Vector一致，一个\\(2 \times 1\\)的Vector和一个\\(2 \times 4\\)的Tensor的SumByColumnVT操作过程如下所示：
+SumByColumnVT的作用是将一个Vector和一个Tensor按列相加，所得结果维度与Vector一致，一个$2 \times 1$的Vector和一个$2 \times 4$的Tensor的SumByColumnVT操作过程如下所示：
 $$
-\left(\begin{matrix}1.0\\\\0.0\end{matrix}\right) + \left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}1.0\\0.0\end{matrix}\right) + \left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}7.0\\\\22.0\end{matrix}\right)
+\left(\begin{matrix}7.0\\22.0\end{matrix}\right)
 $$ 
 ##### SumByColumnVT调用
@@ -443,9 +460,9 @@ _SumByColumnVT(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
 ```
 Parameters:
-* a - 操作向量
+* a - 输入向量
-* b - 操作张量
+* b - 输入张量
-* c - 结果向量
+* c - 输出向量
 * beta - 缩放参数
 调用SumByColumnVT进行的运算为c = a + \sum{col} b_col * \beta
@@ -461,7 +478,7 @@ _SumByColumnVT(a, b, c);
 NiuTrans.Tensor/Tensor/test/TSumByColumnVT.cpp
-### getandset
+### 张量存取(getandset)
 此部分包括各种数据类型转化，设置数据、取数据等操作。
@@ -469,25 +486,25 @@ NiuTrans.Tensor/Tensor/test/TSumByColumnVT.cpp
 ##### 什么是张量的选择操作？
-Select时按张量指定维度上的指定位置对张量进行选择的操作，一个\\(2 \times 2 \times 4\\)的张量选择过程如下所示，本例中是选择张量维度2上位置索引为1和2的元素并存入目标张量，得到一个维度为\\(2 \times 2 \times 2\\)的张量：
+Select时按张量指定维度上的指定位置对张量进行选择的操作，一个$2 \times 2 \times 4$的张量选择过程如下所示，本例中是选择张量维度2上位置索引为1和2的元素并存入目标张量，得到一个维度为$2 \times 2 \times 2$的张量：
 $$
 \begin{aligned}
 \Biggl( 
 & \left( 
-\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right),\\\\ 
+\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right),\\ 
 & \left( 
-\begin{matrix}1.0 & 2.0 & 3.0 & 4.0\\\\5.0 & 6.0 & 7.0 & 8.0\end{matrix}
+\begin{matrix}1.0 & 2.0 & 3.0 & 4.0\\5.0 & 6.0 & 7.0 & 8.0\end{matrix}
 \right)
 \Biggr)
 \end{aligned} \rightarrow 
 \begin{aligned}
 \Biggl( 
 & \left( 
-\begin{matrix}1.0 & 2.0\\\\5.0 & 6.0\end{matrix}
+\begin{matrix}1.0 & 2.0\\5.0 & 6.0\end{matrix}
-\right),\\\\ 
+\right),\\ 
 & \left( 
-\begin{matrix}2.0 & 3.0\\\\6.0 & 7.0\end{matrix}
+\begin{matrix}2.0 & 3.0\\6.0 & 7.0\end{matrix}
 \right)  
 \Biggr)
 \end{aligned}
@@ -496,8 +513,25 @@ $$
 ##### 张量选择的调用
 NiuTrans.Tensor提供了张量的选择操作，调用方法及参数说明如下所示:
+第一种选择操由一个0，1构成的index矩阵对张量进行选择：
+```
+void _Select(const XTensor * a, XTensor * c, XTensor * indexCPU)
+XTensor Select(const XTensor &a, XTensor &indexCPU)
+```
+Parameters:
+* a - 输入张量
+* c - 输出张量
+* indexCPU - 张量选择标志
+第二种调用方式是按位置范围对张量进行选择：
 ```
-_SelectRange(XTensor * a, int dim, int low, int high, XTensor * c)
+void _SelectRange(const XTensor * a, XTensor * c, int dim, int low, int high)
+XTensor SelectRange(const XTensor &a, int dim, int low, int high)
 ```
 Parameters:
@@ -505,7 +539,7 @@ Parameters:
 * dim - 在哪一维对张量进行张量选择操作
 * low - 张量选择范围的下限
 * high - 张量选择范围的上限
-* c - 结果张量
+* c - 输出张量
 >需要注意的是，当张量选择的取值范围为[1,3]时意味着选择的是索引位置为1和2的值
@@ -524,23 +558,70 @@ NiuTrans.Tensor/Tensor/test/TSelect.cpp
 ##### 什么是SetData？
-SetData的作用是将张量在一定取值范围内随机进行初始化设置，一个\\(2 \times 4\\)的张量在[0.0,1.0]的取值范围SetData过程如下所示：
+SetData的作用是将张量在一定取值范围内随机进行初始化设置，一个$2 \times 4$的张量在[0.0,1.0]的取值范围SetData过程如下所示：
 $$
-\left(\begin{matrix}0.0 & 0.0 & 0.0 & 0.0\\\\0.0 & 0.0 & 0.0 & 0.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 0.0 & 0.0 & 0.0\\0.0 & 0.0 & 0.0 & 0.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.1 & 0.5 & 0.3 & 0.9\\\\0.8 & 0.5 & 0.5 & 0.2\end{matrix}\right)
+\left(\begin{matrix}0.1 & 0.5 & 0.3 & 0.9\\0.8 & 0.5 & 0.5 & 0.2\end{matrix}\right)
 $$ 
 ##### SetData调用
 NiuTrans.Tensor提供了张量的SetData操作，调用方法及参数说明如下所示:
+设置张量为固定值：
 ```
-SetDataRand(DTYPE lower, DTYPE upper)
+void SetDataFixed(XTensor * tensor, void * valuePointer)
 ```
 Parameters:
-* lower - 取值下限
+* tensor - 输入张量
-* upper - 取值上限
+* valuePointer - 指向数据的指针
+设置张量为整型值：
+```
+void SetDataFixedInt(XTensor * tensor, int p)
+```
+Parameters:
+* tensor - 输入张量
+* p - 固定整型值
+设置张量为单精度浮点值：
+```
+void SetDataFixedFloat(XTensor * tensor, float p)
+```
+Parameters:
+* tensor - 输入张量
+* p - 固定单精度浮点值
+设置张量为双精度浮点值：
+```
+void SetDataFixedDouble(XTensor * tensor, double p)
+```
+Parameters:
+* tensor - 输入张量
+* p - 固定双精度浮点值
+设置张量为随机分布：
+```
+void SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
+```
+* tensor - 输入张量
+* low - 取值下限
+* high - 取值上限
+设置张量为正态分布：
+```
+void SetDataRandN(XTensor * tensor, DTYPE mean, DTYPE standardDeviation)
+```
+Parameters:
+* tensor - 输入张量
+* mean - 均值
+* standardDeviation - 标准差
 #####  SetData片段示例
@@ -553,7 +634,7 @@ s->SetDataRand(0.0, 1.0);
 NiuTrans.Tensor/Tensor/test/TSetData.cpp
-### math
+### 数学运算(math)
 此部分包括各种非基本代数操作，包括：log、exp、abs等。
@@ -568,7 +649,11 @@ NiuTrans.Tensor/Tensor/test/TSetData.cpp
 NiuTrans.Tensor提供了张量的Normalize操作，调用方法及参数说明如下所示:
 ```
-_Normalize(XTensor * input, XTensor * output, int dim, XTensor * mean, XTensor * var, XTensor * a, XTensor * b, DTYPE epsilon)
+void _Normalize(const XTensor * input, XTensor * output, int dim, const XTensor * mean, const XTensor * var, const XTensor * a, const XTensor * b, DTYPE epsilon)
+void _NormalizeMe(XTensor * input, int dim, const XTensor * mean, const XTensor * var, const XTensor * a, const XTensor * b, DTYPE epsilon)
+XTensor Normalize(const XTensor &input, int dim, const XTensor &mean, const XTensor &var, const XTensor &a, const XTensor &b, DTYPE epsilon)
 ```
 Parameters:
@@ -596,25 +681,25 @@ NiuTrans.Tensor/Tensor/test/TNormalize.cpp
 ##### 什么是张量的幂运算操作？
-幂运算是一种关于幂的数学运算，张量的幂运算是将张量中的每个元素都进行幂运算从而得到新的张量，一个维度为\\(3 \times 2\\)的幂为2.0的张量幂运算过程如下所示：
+幂运算是一种关于幂的数学运算，张量的幂运算是将张量中的每个元素都进行幂运算从而得到新的张量，一个维度为$3 \times 2$的幂为2.0的张量幂运算过程如下所示：
 $$
-\left(\begin{matrix}1.0 & 2.0\\\\3.0 & 4.0\\\\5.0 & 6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}1.0 & 2.0\\3.0 & 4.0\\5.0 & 6.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}1.0 & 4.0\\\\9.0 & 16.0\\\\25.0 & 36.0\end{matrix}\right)
+\left(\begin{matrix}1.0 & 4.0\\9.0 & 16.0\\25.0 & 36.0\end{matrix}\right)
 $$
 ##### 张量幂运算的调用
 NiuTrans.Tensor提供了张量幂运算的操作，用来进行张量的按元素位置进行幂运算的操作，调用方法为：
 ```
-_Power(XTensor * a, DTYPE p)
+void _Power(XTensor * a, DTYPE p)
 ```
 其中a为进行操作的张量，p为次方数，张量幂运算的参数说明如下所示:
 Parameters: 
-* a - 操作张量
+* a - 输入张量
 * p - 次方数
 ##### 张量幂运算片段示例
@@ -632,24 +717,29 @@ NiuTrans.Tensor/Tensor/test/TPower.cpp
 ##### 什么是张量的缩放和偏移？
-张量的缩放和偏移计算公式为：p = p * scale + shift，其中scale和shift分别为张量缩放和偏移的参数，一个\\(2 \times 4\\)的张量进行缩放和偏移的过程如下所示，缩放参数取2.0，偏移参数取0.5：
+张量的缩放和偏移计算公式为：p = p * scale + shift，其中scale和shift分别为张量缩放和偏移的参数，一个$2 \times 4$的张量进行缩放和偏移的过程如下所示，缩放参数取2.0，偏移参数取0.5：
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.5 & 2.5 & 4.5 & 6.5\\\\8.5 & 10.5 & 12.5 & 14.5\end{matrix}\right)
+\left(\begin{matrix}0.5 & 2.5 & 4.5 & 6.5\\8.5 & 10.5 & 12.5 & 14.5\end{matrix}\right)
 $$
 ##### 张量缩放和偏移的调用
 NiuTrans.Tensor提供了张量的缩放和偏移操作，调用方法为：
 ```
-_ScaleAndShift(XTensor * a, DTYPE scale, DTYPE shift)
+void _ScaleAndShift(const XTensor * a, XTensor * b, DTYPE scale, DTYPE shift = 0)
+void _ScaleAndShiftMe(XTensor * a, DTYPE scale, DTYPE shift = 0)
+XTensor ScaleAndShift(const XTensor &a, DTYPE scale, DTYPE shift = 0)
 ```
 张量的缩放和偏移操作结果为：p = p * scale + shift，其中scale和shift分别为张量的缩放和偏移参数，张量缩放和偏移操作的参数说明如下表所示:
 Parameters:
 * a - 输入张量
+* b - 输出张量
 * scale - 缩放参数
 * shift - 偏移参数
@@ -664,7 +754,7 @@ _ScaleAndShift(input, scaleFactor, shiftFactor);
 NiuTrans.Tensor/Tensor/test/TScaleAndShift.cpp
-### movement
+### 数据移动(movement)
 此部分主要是介绍有关数据拷贝函数。
@@ -672,24 +762,26 @@ NiuTrans.Tensor/Tensor/test/TScaleAndShift.cpp
 ##### 什么是张量的拷贝操作？
-拷贝，即将一个张量的值赋给另一个张量，也就是对张量进行拷贝操作，一个\\(2 \times 4\\)的张量拷贝过程如下所示：
+拷贝，即将一个张量的值赋给另一个张量，也就是对张量进行拷贝操作，一个$2 \times 4$的张量拷贝过程如下所示：
 $$
-\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow
+\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow
 \left(
-\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right)
+\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right)
 $$
 ##### 张量拷贝操作的调用
 NiuTrans.Tensor提供了张量的拷贝操作，调用方法及参数说明如下所示:
 ```
-_CopyValues(XTensor * s, XTensor * t, XStream * stream)
+void _CopyValues(const XTensor * s, XTensor * t, XStream * stream = NULL)
+XTensor CopyValues(const XTensor &s, XStream * stream = NULL)
 ```
 Parameters:
 * s - 输入张量
-* t - 输出结果张量
+* t - 输出张量
 * stream - 多线程流
 #####  张量拷贝片段示例
@@ -707,30 +799,30 @@ NiuTrans.Tensor/Tensor/test/TCopyValues.cpp
 ##### 什么是张量的CopyIndexed操作？
-CopyIndexed，即按指定索引位置拷贝张量，一个\\(2 \times 2 \times 3\\)的张量拷贝过程如下所示，本例中是对张量维度2上起始位置索引为0和2的1个元素进行拷贝，所得张量维度为\\(2 \times 2 \times 2\\)：
+CopyIndexed，即按指定索引位置拷贝张量，一个$2 \times 2 \times 3$的张量拷贝过程如下所示，本例中是对张量维度2上起始位置索引为0和2的1个元素进行拷贝，所得张量维度为$2 \times 2 \times 2$：
 $$
 \begin{aligned}
 \Biggl( 
 & \left( 
-\begin{matrix}0.0 & -1.0 & 2.0\\\\2.0 & 1.0 & 3.0\end{matrix}\right),\\\\ 
+\begin{matrix}0.0 & -1.0 & 2.0\\2.0 & 1.0 & 3.0\end{matrix}\right),\\ 
 & \left( 
-\begin{matrix}1.0 & 2.0 & 4.0\\\\3.0 & 1.0 & 2.0\end{matrix}
+\begin{matrix}1.0 & 2.0 & 4.0\\3.0 & 1.0 & 2.0\end{matrix}
-\right),\\\\ 
+\right),\\ 
 & \left( 
-\begin{matrix}-1.0 & 3.0 & 2.0\\\\1.0 & -1.0 & 0.0\end{matrix}
+\begin{matrix}-1.0 & 3.0 & 2.0\\1.0 & -1.0 & 0.0\end{matrix}
 \right)  
 \Biggr)
 \end{aligned} \rightarrow 
 \begin{aligned}
 \Biggl( 
 & \left( 
-\begin{matrix}0.0 & 2.0\\\\2.0 & 3.0\end{matrix}\right),\\\\ 
+\begin{matrix}0.0 & 2.0\\2.0 & 3.0\end{matrix}\right),\\ 
 & \left( 
-\begin{matrix}1.0 & 4.0\\\\3.0 & 2.0\end{matrix}
+\begin{matrix}1.0 & 4.0\\3.0 & 2.0\end{matrix}
-\right),\\\\ 
+\right),\\ 
 & \left( 
-\begin{matrix}-1.0 & 2.0\\\\1.0 & 0.0\end{matrix}
+\begin{matrix}-1.0 & 2.0\\1.0 & 0.0\end{matrix}
 \right)  
 \Biggr)
 \end{aligned}
@@ -740,12 +832,14 @@ $$
 NiuTrans.Tensor提供了张量的CopyIndexed操作，调用方法及参数说明如下所示:
 ```
-_CopyIndexed(XTensor * s, XTensor * t, int dim, int * srcIndex, int indexSize, int * tgtIndex, int copyNum)
+void _CopyIndexed(const XTensor * s, XTensor * t, int dim, int * srcIndex, int indexSize, int * tgtIndex, int copyNum)
+XTensor CopyIndexed(const XTensor &s, int dim, int * srcIndex,int indexSize, int * tgtIndex, int copyNum)
 ```
 Parameters:
 * s - 输入张量
-* t - 输出结果张量
+* t - 输出张量
 * dim - 在哪一维对张量进行CopyIndexed操作
 * srcIndex - 源索引，即在指定dim上进行赋值的值的索引
 * indexSize - 源索引的个数
@@ -763,29 +857,31 @@ _CopyIndexed(s, t, 2, srcIndex, indexSize, tgtIndex, 1);
 NiuTrans.Tensor/Tensor/test/TCopyIndexed.cpp
-### reduce
+### 规约操作(reduce)
 #### 归约取最大值（ReduceMax）
 ##### 什么是张量的归约取最大值？
-张量的归约取最大值操作是沿着张量的某一维度，取得该向量在该维度中的最大值,一个\\(2 \times 4\\)的张量在维度0和维度1进行取最大值操作的过程分别如下所示：
+张量的归约取最大值操作是沿着张量的某一维度，取得该向量在该维度中的最大值,一个$2 \times 4$的张量在维度0和维度1进行取最大值操作的过程分别如下所示：
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
 \left(\begin{matrix}4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right)
 $$
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}3.0\\\\7.0\end{matrix}\right)
+\left(\begin{matrix}3.0\\7.0\end{matrix}\right)
 $$
 ##### 张量归约取最大值操作的调用
 NiuTrans.Tensor提供了张量的ReduceMax操作，用来获得张量中沿指定维度取得的最大值，张量归约取最大值操作的调用方式及参数说明如下所示:
 ```
-_ReduceMax(XTensor * input, XTensor * output, int dim)
+void _ReduceMax(const XTensor * input, XTensor * output, int dim)
+XTensor ReduceMax(const XTensor &input, int dim)
 ```
 Parameters:
@@ -809,23 +905,25 @@ NiuTrans.Tensor/Tensor/test/TReduceMax.cpp
 ##### 什么是张量的归约求和操作？
-张量的归约求和操作是沿着张量的某一维度，计算该张量在该维度的和,一个\\(2 \times 4\\)的张量在维度0和维度1进行求和操作的过程分别如下所示：
+张量的归约求和操作是沿着张量的某一维度，计算该张量在该维度的和,一个$2 \times 4$的张量在维度0和维度1进行求和操作的过程分别如下所示：
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
 \left(\begin{matrix}4.0 & 6.0 & 8.0 & 10.0\end{matrix}\right)
 $$
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}6.0\\\\22.0\end{matrix}\right)
+\left(\begin{matrix}6.0\\22.0\end{matrix}\right)
 $$
 ##### 张量归约求和操作的调用
 NiuTrans.Tensor提供了张量的ReduceSum操作，调用方法为：
 ```
-_ReduceSum(XTensor * input, XTensor * output, int dim, XTensor * shift, DTYPE power, bool isExp)
+void _ReduceSum(const XTensor * input, XTensor * output, int dim, const XTensor * shift = NULL, DTYPE power = (DTYPE)1.0F, bool isExp = false)
+XTensor ReduceSum(const XTensor &input, int dim, const XTensor &shift = NULLTensor, DTYPE power = (DTYPE)1.0F, bool isExp = false)
 ```
 其中shift默认为NULL，power默认为1.0F，isExp默认为false，张量归约求和操作的参数说明如下所示:
@@ -854,23 +952,25 @@ NiuTrans.Tensor/Tensor/test/TReduceSum.cpp
 ##### 什么是张量的归约取均值操作？
-张量的归约取均值操作是沿着张量的某一维度，计算该张量在该维度的均值,一个\\(2 \times 4\\)的张量在维度0和维度1进行取均值操作的过程分别如下所示：
+张量的归约取均值操作是沿着张量的某一维度，计算该张量在该维度的均值,一个$2 \times 4$的张量在维度0和维度1进行取均值操作的过程分别如下所示：
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
 \left(\begin{matrix}2.0 & 3.0 & 4.0 & 5.0\end{matrix}\right)
 $$
 $$
-\left(\begin{matrix}1.0 & 1.0 & 3.0 & 3.0\\\\4.0 & 4.0 & 6.0 & 6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}1.0 & 1.0 & 3.0 & 3.0\\4.0 & 4.0 & 6.0 & 6.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}2.0\\\\5.0\end{matrix}\right)
+\left(\begin{matrix}2.0\\5.0\end{matrix}\right)
 $$
 ##### 张量归约取均值操作的调用
 NiuTrans.Tensor提供了张量的ReduceMean操作，调用方法为：
 ```
-_ReduceMean(XTensor * input, XTensor * output, int dim)
+void _ReduceMean(const XTensor * input, XTensor * output, int dim)
+XTensor ReduceMean(const XTensor &input, int dim)
 ```
 ReduceMean用来获得张量中沿指定维度取得的数值均值，张量归约取均值的参数说明如下所示:
@@ -896,10 +996,10 @@ NiuTrans.Tensor/Tensor/test/TReduceMean.cpp
 ##### 什么是张量的归约取方差操作？
-张量的归约取方差操作是沿着张量的某一维度，计算该张量在该维度的方差,一个\\(2 \times 4\\)的张量在维度0进行取方差操作的过程如下所示：
+张量的归约取方差操作是沿着张量的某一维度，计算该张量在该维度的方差,一个$2 \times 4$的张量在维度0进行取方差操作的过程如下所示：
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
 \left(\begin{matrix}8.0 & 8.0 & 8.0 & 8.0\end{matrix}\right)
 $$
@@ -907,7 +1007,9 @@ $$
 NiuTrans.Tensor提供了张量的ReduceSumSquared操作，调用方法为：
 ```
-_ReduceSumSquared(XTensor * input, XTensor * output, int dim, XTensor * shift)
+void _ReduceSumSquared(const XTensor * input, XTensor * output, int dim, const XTensor * shift)
+XTensor ReduceSumSquared(const XTensor &input, int dim, const XTensor &shift)
 ```
 ReduceSumSquared用来计算张量的沿着某一维度元素的方差，张量归约取方差操作的参数说明如下所示:
@@ -933,10 +1035,10 @@ NiuTrans.Tensor/Tensor/test/TReduceSumSquared.cpp
 ##### 什么是张量的归约取标准差操作？
-张量的归约取标准差操作是沿着张量的某一维度，计算该张量在该维度的标准差,一个\\(2 \times 4\\)的张量在维度0进行取标准差操作的过程如下所示：
+张量的归约取标准差操作是沿着张量的某一维度，计算该张量在该维度的标准差,一个$2 \times 4$的张量在维度0进行取标准差操作的过程如下所示：
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
 \left(\begin{matrix}4.0 & 4.0 & 4.0 & 4.0\end{matrix}\right)
 $$
@@ -944,7 +1046,9 @@ $$
 NiuTrans.Tensor提供了张量的ReduceVariance操作，调用方法为：
 ```
-_ReduceVariance(XTensor * input, XTensor * output, int dim, XTensor * mean)
+void _ReduceVariance(const XTensor * input, XTensor * output, int dim, const XTensor * mean)
+XTensor ReduceVariance(const XTensor &input, int dim, const XTensor &mean)
 ```
 ReduceVariance用来计算张量的沿着某一维度元素的标准差，张量归约取标准差操作的参数说明如下所示:
@@ -966,7 +1070,7 @@ _ReduceVariance(input, output, 0, mean);
 NiuTrans.Tensor/Tensor/test/TReduceVariance.cpp
-### shape
+### 形状转换(shape)
 此部分主要包括关于形状改变的函数，比如：split、merge、reshape等。
@@ -974,36 +1078,41 @@ NiuTrans.Tensor/Tensor/test/TReduceVariance.cpp
 ##### 什么是张量的级联操作？
-张量间的级联操作是沿着张量的某一维度，将一系列张量或是一个列表中的所有张量连接在一起组成一个更大的张量，将维度分别为\\(2 \times 1\\)和\\(2 \times 2\\)的两个张量进行级联过程如下所示：
+张量间的级联操作是沿着张量的某一维度，将一系列张量或是一个列表中的所有张量连接在一起组成一个更大的张量，将维度分别为$2 \times 1$和$2 \times 2$的两个张量进行级联过程如下所示：
 $$
-\left(\begin{matrix}0.0\\\\1.0\end{matrix}\right) +
+\left(\begin{matrix}0.0\\1.0\end{matrix}\right) +
-\left(\begin{matrix}2.0 & 3.0\\\\4.0 & 5.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}2.0 & 3.0\\4.0 & 5.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right)
+\left(\begin{matrix}0.0 & 2.0 & 3.0\\1.0 & 4.0 & 5.0\end{matrix}\right)
 $$
 ##### 张量级联的调用
 NiuTrans.Tensor提供了张量间的级联操作，调用方法为：
-```
-_Concatenate(XList * smalls, XTensor * big, int dim)
-_Concatenate(XTensor * smallA, XTensor * smallB, XTensor * big, int dim)
-```
 第一种调用方法中的操作对象是列表，将进行级联操作的张量存入列表smalls中，级联结果存入张量big中：
+```
+void _Concatenate(const XList * smalls, XTensor * big, int dim)
+XTensor Concatenate(const XList &smalls, int dim)
+```
 Parameters:
 * smalls - 进行级联张量的列表
-* big - 结果张量
+* big - 输出张量
 * dim - 在指定维度进行级联
 第二种方法操作对象不再是列表中的张量而是直接对一系列张量进行级联操作：
+```
+void _Concatenate(const XTensor * smallA, const XTensor * smallB, XTensor * big, int dim)
+XTensor Concatenate(const XTensor &smallA, const XTensor &smallB, int dim)
+```
 Parameters:
-* smallA - 操作张量1
+* smallA - 输入张量1
-* smallB - 操作张量2
+* smallB - 输入张量2
-* big - 结果张量
+* big - 输出张量
 * dim - 进行级联的维度
 ##### 张量级联片段示例
@@ -1028,47 +1137,52 @@ NiuTrans.Tensor/Tensor/test/TConcatenate.cpp
 张量间的切分操作是沿着张量的某一维度，可以将一个张量切分成另一张量，也可以将一个大的张量切分成n个小的张量集合的列表。
-第一种情况下将维度为\\(4 \times 3\\)张量沿着维度0进行切分，切分份数为2，得到维度为\\(2 \times 2 \times 3\\)的张量的过程如下所示：
+第一种情况下将维度为$4 \times 3$张量沿着维度0进行切分，切分份数为2，得到维度为$2 \times 2 \times 3$的张量的过程如下所示：
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\\0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
 \begin{aligned}
 \Biggl( & \left( 
-\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right),
+\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\end{matrix}\right),
-\\\\ & \left( 
+\\ & \left( 
-\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}
+\begin{matrix}0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}
 \right) \Biggr)
 \end{aligned}
 $$
-在第二种情况下将维度为\\(4 \times 3\\)张量沿着维度0进行切分，切分份数为2，得到两个维度均为\\(2 \times 3\\)的张量的过程如下所示：
+在第二种情况下将维度为$4 \times 3$张量沿着维度0进行切分，切分份数为2，得到两个维度均为$2 \times 3$的张量的过程如下所示：
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\\0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
+\left(\begin{matrix}0.0 & 2.0 & 3.0\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right)
 $$
 ##### 张量切分的调用
 NiuTrans.Tensor提供了两种张量切分操作，调用方法为：
-```
-_Split(XTensor * s, XTensor * t, int whereToSplit, int splitNum)
-_Split(XTensor * big, XList * smalls, int whereToSplit, int splitNum)
-```
 在第一种调用方法中是将源张量中的某一维度进行Split操作，Split结果为张量t，whereToSplit为在哪一维度进行split操作，splitNum表示分成多少份，例如：(N, M) -> (N/3, M, 3)，参数说明如下所示:
+```
+void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum)
+XTensor Split(const XTensor &s, int whereToSplit, int splitNum)
+```
 Parameters:
-* s - 操作张量
+* s - 输入张量
-* t - 结果张量
+* t - 输出张量
 * whereToSplit - 在指定维度进行split操作
 * splitNum - 分成多少份
 在第二种调用方法中是将所操作张量big按某一维度whereToSplit进行Split操作，操作结果为包含若干更小维度张量的列表smalls，splitNum表示分成多少份，例如：(N, M) -> 2 * (N/2, M)，参数说明如下所示:
+```
+void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum)
+XList SplitList(const XTensor &big, int whereToSplit, int splitNum)
+```
 Parameters:
-* big - 操作张量
+* big - 输入张量
 * smalls - 存放切分出张量的列表
 * whereToSplit - 在指定维度进行split操作
 * splitNum - 分成多少份
@@ -1096,44 +1210,49 @@ NiuTrans.Tensor/Tensor/test/TSplit.cpp
 张量间的合并操作与级联有些类似，是沿着张量的某一维度，可以将一个张量合并为另一个维度不同的张量，也可以将一个列表中的所有张量合并在一起组成一个更大的张量。
-在第一种情况下将维度为\\(2 \times 2 \times 3\\)的张量在维度1进行合并，进行合并的维度为0，得到维度为\\(4 \times 3\\)的张量的过程如下所示：
+在第一种情况下将维度为$2 \times 2 \times 3$的张量在维度1进行合并，进行合并的维度为0，得到维度为$4 \times 3$的张量的过程如下所示：
 $$
 \begin{aligned}
 \Biggl( & \left( 
-\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right),
+\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\end{matrix}\right),
-\\\\ & \left( 
+\\ & \left( 
-\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}
+\begin{matrix}0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}
 \right) \Biggr)
 \end{aligned} \rightarrow 
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\\0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right)
 $$
-在第二种情况下将两个维度均为\\(2 \times 3\\)的张量沿着维度0合并为维度为\\(4 \times 3\\)的张量的过程如下所示：
+在第二种情况下将两个维度均为$2 \times 3$的张量沿着维度0合并为维度为$4 \times 3$的张量的过程如下所示：
 $$
-\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 2.0 & 3.0\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\\0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right)
 $$ 
 ##### 张量合并操作的调用
 NiuTrans.Tensor提供了张量的合并操作，调用方法为：
-```
-_Merge(XTensor * s, XTensor * t, int whereToMerge, int leadingDim)
-_Merge(XList * smalls, XTensor * big, int whereToMerge)
-```
 在第一种调用方法中是将源张量中的某一维度进行Merge操作，Merge结果为张量t，whereToMerge为指定进行Merge操作的维度，leadingDim为指定将哪一维度Merge，例如：(N/2, 2, M) -> (N, M)，参数说明如下表所示:
+```
+void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim = -1)
+XTensor Merge(const XTensor &s, int whereToMerge, int leadingDim = -1)
+```
 Parameters:
-* s - 操作张量
+* s - 输入张量
-* t - 结果张量
+* t - 输出张量
 * whereToMerge - 沿着指定维度进行Merge操作
 * leadingDim - 把指定维度进行Merge操作
 在第二种调用方法中是将所操作张量存入列表smalls中，操作结果为张量big，whereToMerge为指定进行Merge操作的维度，例如：2 * (N/2, M) -> (N, M)，参数说明如下表所示:
+```
+void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
+XTensor Merge(const XList &smalls, int whereToMerge)
+```
 Parameters:
 * smalls - 存放进行合并张量的列表
@@ -1161,26 +1280,26 @@ NiuTrans.Tensor/Tensor/test/TMerge.cpp
 ##### 什么是Unsqueeze？
-Unsqueeze的作用是通过对张量进行操作，返回一个新的在指定维度插入新维度的张量，这个返回的张量与源张量共享相同的基础数据，一个\\(2 \times 3\\)的张量在维度1和2分别进行Unsqueeze的操作如下所示，插入新的维度大小均为2：
+Unsqueeze的作用是通过对张量进行操作，返回一个新的在指定维度插入新维度的张量，这个返回的张量与源张量共享相同的基础数据，一个$2 \times 3$的张量在维度1和2分别进行Unsqueeze的操作如下所示，插入新的维度大小均为2：
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\end{matrix}\right) \rightarrow 
 \begin{aligned}
 \Biggl( & \left( 
-\begin{matrix}0.0 & 1.0 & 2.0\\\\0.0 & 1.0 & 2.0\end{matrix}\right),
+\begin{matrix}0.0 & 1.0 & 2.0\\0.0 & 1.0 & 2.0\end{matrix}\right),
-\\\\ & \left( 
+\\ & \left( 
-\begin{matrix}3.0 & 4.0 & 5.0\\\\3.0 & 4.0 & 5.0\end{matrix}
+\begin{matrix}3.0 & 4.0 & 5.0\\3.0 & 4.0 & 5.0\end{matrix}
 \right) \Biggr)
 \end{aligned}
 $$
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right) \rightarrow  
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\end{matrix}\right) \rightarrow  
 \begin{aligned}
 \Biggl( & \left( 
-\begin{matrix}0.0 & 0.0\\\\1.0 & 1.0\\\\2.0 & 2.0\end{matrix}\right),
+\begin{matrix}0.0 & 0.0\\1.0 & 1.0\\2.0 & 2.0\end{matrix}\right),
-\\\\ & \left( 
+\\ & \left( 
-\begin{matrix}3.0 & 3.0\\\\4.0 & 4.0\\\\5.0 & 5.0\end{matrix}
+\begin{matrix}3.0 & 3.0\\4.0 & 4.0\\5.0 & 5.0\end{matrix}
 \right) \Biggr)
 \end{aligned}
 $$
@@ -1189,12 +1308,14 @@ $$
 NiuTrans.Tensor提供了张量的Unsqueeze操作，调用方法及参数说明如下所示:
 ```
-_Unsqueeze(XTensor * a, XTensor * b, int dim, int dSize)
+void _Unsqueeze(const XTensor * a, XTensor * b, int dim, int dSize)
+XTensor Unsqueeze(const XTensor &a, int dim, int dSize)
 ```
 Parameters:
 * a - 输入张量
-* b - 输出结果张量
+* b - 输出张量
 * dim - 在指定维度进行Unsqueeze操作
 * dSize - 插入维度的大小
@@ -1210,7 +1331,7 @@ _Unsqueeze(s, t2, 2, 2);
 NiuTrans.Tensor/Tensor/test/TUnsqueeze.cpp
-### sort
+### 排序操作(sort)
 此部分主要介绍排序相关的函数，如：sort、topk等。
@@ -1218,23 +1339,23 @@ NiuTrans.Tensor/Tensor/test/TUnsqueeze.cpp
 ##### 什么是Sort？
-Sort操作是对张量中元素沿着指定的维度进行排序，一个\\(2 \times 4\\)的张量沿着维度0进行Sort操作过程如下所示：
+Sort操作是对张量中元素沿着指定的维度进行排序，一个$2 \times 4$的张量沿着维度0进行Sort操作过程如下所示：
 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}4.0 & 5.0 & 6.0 & 7.0\\\\0.0 & 1.0 & 2.0 & 3.0\end{matrix}\right)
+\left(\begin{matrix}4.0 & 5.0 & 6.0 & 7.0\\0.0 & 1.0 & 2.0 & 3.0\end{matrix}\right)
 $$
 ##### Sort的调用
 NiuTrans.Tensor提供了张量的Sort操作，调用方法及参数说明如下所示:
 ```
-_Sort(XTensor * a, XTensor * index, int dim)
+void _Sort(XTensor * a, XTensor * index, int dim)
 ```
 Parameters:
-* a - 操作张量
+* a - 输入张量
-* index - 结果张量中元素的索引
+* index - 输出张量中元素的索引
 * dim - 沿着指定维度进行Sort操作
 #####  Sort片段示例
@@ -1250,15 +1371,15 @@ _Sort(a, b, 0);
 ##### 什么是TopK？
-TopK操作是通过对张量中元素进行排序，得到最大或最小的k个元素值及其对应的索引值，在张量中，可以沿着某一维度进行TopK操作，一个\\(2 \times 4\\)的张量沿着维度0进行Top-2操作过程如下所示：
+TopK操作是通过对张量中元素进行排序，得到最大或最小的k个元素值及其对应的索引值，在张量中，可以沿着某一维度进行TopK操作，一个$2 \times 4$的张量沿着维度0进行Top-2操作过程如下所示：
 $$
-\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow 
 \begin{aligned}
 outputAnswer: & \left(
-\begin{matrix}0.5 & 2.5 & 4.5 & 6.5\\\\8.5 & 10.5 & 12.5 & 14.5\end{matrix}\right)\\\\ +
+\begin{matrix}0.5 & 2.5 & 4.5 & 6.5\\8.5 & 10.5 & 12.5 & 14.5\end{matrix}\right)\\ +
-\\\\ indexAnswer: & \left(
+\\ indexAnswer: & \left(
-\begin{matrix}0 & 1 & 1 & 0\\\\1 & 0 & 0 & 1\end{matrix}\right)
+\begin{matrix}0 & 1 & 1 & 0\\1 & 0 & 0 & 1\end{matrix}\right)
 \end{aligned}
 $$
@@ -1266,12 +1387,12 @@ $$
 NiuTrans.Tensor提供了张量的TopK操作，调用方法及参数说明如下所示:
 ```
-_TopK(XTensor * a, XTensor * b, XTensor * index, int dim, int k)
+void _TopK(XTensor * a, XTensor * b, XTensor * index, int dim, int k)
 ```
 Parameters:
 * a - 输入张量
-* b - 输出结果张量
+* b - 输出张量
 * index - 输出结果索引
 * dim - 沿着指定维度进行TopK操作
 * k - TopK中k代表取最大的k个值
@@ -1287,7 +1408,7 @@ _TopK(input, outputA, indexA, dim, k);
 ```
 有关TopK的详细代码示例见                              NiuTrans.Tensor/Tensor/test/TTopK.cpp
-### function
+### 激活函数(function)
 此部分主要介绍一些激活函数和损失函数。
@@ -1302,7 +1423,7 @@ Rectify銝蝘瘣餃嚗ectify摰蛹嚗
 NiuTrans.Tensor提供了张量的Rectify激活函数，调用方法及参数说明如下所示:
 ```
-Rectify(XTensor * x, XTensor * y)
+void _Rectify(const XTensor * x, XTensor * y)
 ```
 Parameters:
@@ -1314,7 +1435,7 @@ Parameters:
 Rectify示例代码如下，其中x为输入的向量，y为输入的张量：
 ```
 /* call Rectify function */
-Rectify(x, y);
+_Rectify(x, y);
 ```
 有关Rectify的详细代码示例见：
@@ -1333,7 +1454,7 @@ HardTanH銝蝘瘣餃嚗ardTanH摰蛹嚗
 NiuTrans.Tensor提供了张量的HardTanH激活函数，调用方法及参数说明如下所示:
 ```
-HardTanH(XTensor * x, XTensor * y)
+void _HardTanH(const XTensor * x, XTensor * y)
 ```
 Parameters:
@@ -1345,7 +1466,7 @@ Parameters:
 HardTanH示例代码如下，其中x为输入的向量，y为输入的张量：
 ```
 /* call hardtanh function */
-HardTanH(x, y);
+_HardTanH(x, y);
 ```
 有关HardTanH的详细代码示例见：
@@ -1362,7 +1483,7 @@ Identity銝蝘瘣餃嚗dentity摰蛹嚗
 NiuTrans.Tensor提供了张量的Identity激活函数，调用方法及参数说明如下所示:
 ```
-Identity(XTensor * x, XTensor * y)
+void _Identity(const XTensor * x, XTensor * y)
 ```
 Parameters:
@@ -1374,7 +1495,7 @@ Parameters:
 Identity示例代码如下，其中x为输入的向量，y为输入的张量：
 ```
 /* call Identity function */
-Identity(x, y);
+_Identity(x, y);
 ```
 有关Identity的详细代码示例见：
@@ -1391,7 +1512,7 @@ LogSoftmax銝蝘瘣餃嚗ogSoftmax摰蛹嚗
 NiuTrans.Tensor提供了张量的LogSoftmax激活函数，调用方法及参数说明如下所示:
 ```
-LogSoftmax(XTensor * x, XTensor * y, int leadDim)
+void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
 ```
 Parameters:
@@ -1404,7 +1525,7 @@ Parameters:
 LogSoftmax示例代码如下，其中x为输入的向量，y为输入的张量，本例中沿着维度1进行LogSoftmax操作：
 ```
 /* call LogSoftmax function */
-LogSoftmax(x, y, 1);
+_LogSoftmax(x, y, 1);
 ```
 有关LogSoftmax的详细代码示例见：
@@ -1421,7 +1542,7 @@ Sigmoid銝蝘瘣餃嚗igmoid摰蛹嚗
 NiuTrans.Tensor提供了张量的Sigmoid激活函数，调用方法及参数说明如下所示:
 ```
-Sigmoid(XTensor * x, XTensor * y)
+void _Sigmoid(const XTensor * x, XTensor * y)
 ```
 Parameters:
@@ -1433,7 +1554,7 @@ Parameters:
 Sigmoid示例代码如下，其中x为输入的向量，y为输入的张量：
 ```
 /* call Sigmoid function */
-Sigmoid(x, y);
+_Sigmoid(x, y);
 ```
 有关Sigmoid的详细代码示例见：
@@ -1450,7 +1571,7 @@ Softmax銝蝘瘣餃嚗oftmax摰蛹嚗
 NiuTrans.Tensor提供了张量的Softmax激活函数，调用方法及参数说明如下所示:
 ```
-Softmax(XTensor * x, XTensor * y, int leadDim)
+void _Softmax(const XTensor * x, XTensor * y, int leadDim)
 ```
 Parameters:
@@ -1463,7 +1584,7 @@ Parameters:
 Softmax示例代码如下，其中x为输入的向量，y为输入的张量，本例中沿着维度1进行Softmax操作：
 ```
 /* call Softmax function */
-Softmax(x, y, 1);
+_Softmax(x, y, 1);
 ```
 有关Softmax的详细代码示例见：
@@ -1485,7 +1606,7 @@ one hot error : loss = sum_{i} e_i <br />
 NiuTrans.Tensor提供了张量的Loss激活函数，调用方法及参数说明如下所示:
 ```
-LossCompute(XTensor * gold, XTensor * output, LOSS_FUNCTION_NAME LFName,bool isLogOutput, int leadDim, int gBeg, int gLen, int oBeg)
+DTYPE LossCompute(XTensor * gold, XTensor * output, LOSS_FUNCTION_NAME LFName, bool isLogOutput, int leadDim, int gBeg, int gLen, int oBeg)
 ```
 Parameters:
@@ -1513,10 +1634,365 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp
 ### 内存池
+内存作为计算机软件运行过程中不可或缺的一项重要资源，在软件开发过程中具有十分重要的地位。对于一个软件系统而言，如何更高效地进行内存管理将对系统整体性能，尤其是运行速度方面产生很大程度的影响。对于内存的管理一般来说主要包括分配、追踪以及释放，通过相应的接口即可简单地在内存空间上进行变量的定义、使用以及删除等操作。
+虽然目前而言，主流编程语言均会为开发人员提供相应的系统级接口（如C语言中的malloc和free，C++中的new和delete等），但这类接口在设计的时候由于需要考虑各种使用情况，因此并不一定能够最适用于目前的使用需求（如对速度具有较高要求等），因此直接使用系统级的内存管理接口存在以下弊端：
+1. 内存申请、释放时间消耗大：由于操作系统在进行内存管理的时候需要保证内存空间得到有效地使用，因此在执行内存申请操作的时候，系统将会根据“最先匹配”或“最优匹配”等算法在内存空间中找到一处闲置内存进行分配。同理，在对内存空间进行释放的时候，为方便后续空间的申请，系统也会在释放的过程中适时地合并空闲内存区域，保证系统中存在大块连续内存。诸如此类的操作虽然说能够使得内存空间的使用更加高效，但也给这些操作带来了许多额外的时间开销，导致频繁地对内存进行操作耗时较大。 
+2. 程序执行效率低：由于所申请内存块的大小不定，当频繁使用系统级接口进行内存管理的时候容易在存储空间中产生大量内存碎片，拖慢系统的执行效率。
+3. 易发生内存泄漏：使用系统级接口对内存空间进行申请的时候，一般来说需要程序开发人员显性地对空间进行释放，一旦疏忽将导致内存泄漏情况的发生，严重情况下会使得软件甚至系统发生崩溃。因此使用系统级接口进行内存管理需要谨慎对存储空间的使用情况进行分析，使用相关检测工具对内存泄漏情况进行有效地核查。
+此外，当系统中存在对GPU设备上的显存空间进行管理的时候，申请、释放操作所产生的时间代价相对普通内存来说更大。不同于内存空间的申请，在申请或释放显存的时候需要对CPU正在执行的操作进行中断，交由GPU设备进行显存的操作，因此这部分产生的时间消耗远比内存申请来说大得多，最终导致频繁地对显存空间进行操作会更严重地拖慢系统整体的执行效率。
+针对以上问题，本系统支持使用内存池（Memory Pool）来对系统中的存储空间（包括内存和显存）进行管理。内存池的概念主要是在对存储空间进行使用之前，预先从系统中申请一整块的空间，由程序自身（内存池）对这部分的空间进行管理。这样做的好处在于对存储空间的申请、释放等操作不需要对系统的相应接口进行频繁调用，降低了其中中断、搜寻最优块等操作的耗时，同时也不易产生内存碎片。此外，由于内存池的申请是一次性的操作，因此不会在系统全局产生大规模内存|泄漏的情况，对系统的稳定性会有所助益。
+具体来说，想要在NiuTrans.Tensor的工具包中使用内存池（XMem）进行操作，只需要三个步骤：内存池的定义，使用以及释放。
+* 内存池的定义
+最简单的定义一个内存池只需指定一个设备ID即可，下面是一段示例代码。
+```
+// 定义一个内存池mem，它的类型是XMem
+XMem * mem = new XMem(devID);
+```
+若需要更具体地指定内存池的信息，可以定义内存池的时候通过myMode、myBlockSize、myBlockNum、myBufSize等参数设置内存池的使用模型、内存块大小、内存块数量以及缓存区大小。
+* 内存池的使用
+在定义好内存池之后，我们即可在该空间上进行变量的定义及使用了，这里以张量的定义为例，下面是一段示例代码。
+```
+// 声明一个变量tensor，它的类型是XTensor
+XTensor tensor;                         
+// 在内存池上初始化这个变量为50列*100行的矩阵(2阶张量)      
+InitTensor2D(&tensor, 50, 100, X_FLOAT, -1, mem);
+```
+我们可以看到，上述代码相对之前之前未使用内存池时的定义方式而言，仅需在定义的时候指定所使用的内存池即可，无需更复杂的操作。
+* 内存池的释放
+当希望将完全对内存池进行释放的时候，我们仅需直接对内存池进行删除即可，下面是一段示例代码。
+```
+// 删除内存池mem
+delete mem;
+```
 ## 实例1：矩阵乘法
+NiuTrans.Tensor提供的矩阵乘法实例如下所示，详细代码见NiuTrans.Tensor/Tensor/sample/mul/
+```
+#include "mul.h"
+namespace nts
+{
+void sampleMUL()
+{
+    DTYPE aData[2][3] = { { 1.0F, 2.0F, 3.0F },
+                          { -4.0F, 5.0F, 6.0F } };
+    DTYPE bData[3][2] = { { 0.0F, -1.0F },
+                          { 1.0F, 2.0F },
+                          { 2.0F, 1.0F } };
+    DTYPE answer[2][2] = { { 8.0F, 6.0F },
+                           { 17.0F, 20.0F } };
+    XTensor a;
+    //XTensor * a = NewTensor();?
+    XTensor b;
+    XTensor result;
+    InitTensor2D(&a, 2, 3);
+    InitTensor2D(&b, 3, 2);
+    //a.GetSize;
+    a.SetData(aData, 6);
+    b.SetData(bData, 6);
+    result = MatrixMul(a, X_NOTRANS, b, X_NOTRANS);
+    result.Dump(stderr, "result:");
+    if (result.CheckData(answer, 4))
+        fprintf(stderr, "answer is right\n");
+}
+void sampleMUL1()
+{
+    DTYPE aData[2][3] = { { 1.0F, 2.0F, 3.0F },
+                          { -4.0F, 5.0F, 6.0F } };
+    DTYPE bData[3][2] = { { 0.0F, -1.0F },
+                          { 1.0F, 2.0F },
+                          { 2.0F, 1.0F } };
+    DTYPE answer[2][2] = { { 8.0F, 6.0F },
+                           { 17.0F, 20.0F } };
+    /* a source tensor of size (2, 3) */
+    int aOrder = 2;
+    int * aDimSize = new int[aOrder];
+    aDimSize[0] = 2;
+    aDimSize[1] = 3;
+    int aUnitNum = 1;
+    for (int i = 0; i < aOrder; i++)
+        aUnitNum *= aDimSize[i];
+    /* a source tensor of size (3, 2) */
+    int bOrder = 2;
+    int * bDimSize = new int[bOrder];
+    bDimSize[0] = 3;
+    bDimSize[1] = 2;
+    int bUnitNum = 1;
+    for (int i = 0; i < bOrder; i++)
+        bUnitNum *= bDimSize[i];
+    /* a target tensor of size (2, 2) */
+    int resultOrder = 2;
+    int * resultDimSize = new int[resultOrder];
+    resultDimSize[0] = 2;
+    resultDimSize[1] = 2;
+    int resultUnitNum = 1;
+    for (int i = 0; i < resultOrder; i++)
+        resultUnitNum *= resultDimSize[i];
+    XTensor * a = NewTensor(aOrder, aDimSize);
+    XTensor * b = NewTensor(bOrder, bDimSize);
+    XTensor * result = NewTensor(resultOrder, resultDimSize);
+    a->SetData(aData, aUnitNum);
+    b->SetData(bData, bUnitNum);
+    result->SetZeroAll();
+    _MatrixMul(a, X_NOTRANS, b, X_NOTRANS, result);
+    result->Dump(stderr, "result:");
+}
+}
+```
 ## 实例2：前馈神经网络
+NiuTrans.Tensor提供的语言模型任务上的前馈神经网络实例部分代码如下所示，主要是关于前馈神经网络语言模型上前向和反向训练的处理过程，详细代码见NiuTrans.Tensor/Tensor/sample/fnnlm/
+```
+/*
+forward procedure
+>> inputs - input word representations
+>> output - output probability
+>> model - the fnn model
+>> net - the network that keeps the internal tensors generated in the process
+*/
+void Forward(XTensor inputs[], XTensor &output, FNNModel &model, FNNNet &net)
+{
+    int batchSize = -1;
+    int n = model.n;
+    int depth = model.hDepth;
+    XList eList(n - 1);
+    /* previoius n - 1 words */
+    for(int i = 0; i < n - 1; i++){
+        XTensor &input = inputs[i];
+        XTensor &w = model.embeddingW;
+        XTensor &embedding = net.embeddings[i];
+        if(batchSize == -1)
+            batchSize = input.dimSize[0];
+        else{
+            CheckErrors(batchSize == input.dimSize[0], "Wrong input word representations!");
+        }
+        /* embedding output tensor of position i */
+        InitModelTensor2D(embedding, batchSize, model.eSize, model);
+        /* generate word embedding of position i:
+           embedding = input * w   */
+        _MatrixMul(&input, X_NOTRANS, &w, X_NOTRANS, &embedding);
+        eList.Add(&net.embeddings[i]);
+    }
+    /* concatenate word embeddings
+       embeddingcat = cat(embedding_0...embedding_{n-1}) */
+    InitModelTensor2D(net.embeddingCat, batchSize, (n - 1) * model.eSize, model);
+    _Concatenate(&eList, &net.embeddingCat, 1);
+    /* go over each hidden layer */
+    for(int i = 0; i < depth; i++){
+        XTensor &h_pre = i == 0 ? net.embeddingCat : net.hiddens[i - 1];
+        XTensor &w = model.hiddenW[i];
+        XTensor &b = model.hiddenB[i];
+        XTensor &h = net.hiddens[i];
+        XTensor &s = net.hiddenStates[i];
+        InitModelTensor2D(h, batchSize, model.hSize, model);
+        InitModelTensor2D(s, batchSize, model.hSize, model);
+        /* generate hidden states of layer i: 
+           s = h_pre * w    */
+        _MatrixMul(&h_pre, X_NOTRANS, &w, X_NOTRANS, &s);
+        /* make a 2d tensor for the bias term */
+        XTensor b2D;
+        InitTensor(&b2D, &s);
+        _Unsqueeze(&b, &b2D, 0, batchSize);
+        /* introduce bias term:
+           s = s + b
+           NOTE: the trick here is to extend b to a 2d tensor
+                 to fit into the 2d representation in tensor summation */
+        _Sum(&s, &b2D, &s);
+        /* pass the state through the hard tanh function:
+           h = tanh(s) */
+        _HardTanH(&s, &h);
+    }
+    /* generate the output Pr(w_{n-1}|w_0...w_{n-2}):
+       y = softmax(h_last * w) 
+       Note that this is the implementation as that in Bengio et al.' paper.
+       TODO: we add bias term here */
+    {
+        XTensor &h_last = depth > 0 ? net.hiddens[depth - 1] : net.embeddingCat;
+        XTensor &w = model.outputW;
+        XTensor &b = model.outputB;
+        XTensor &s = net.stateLast;
+        XTensor &y = output;
+        InitModelTensor2D(s, batchSize, model.vSize, model);
+        InitModelTensor2D(y, batchSize, model.vSize, model);
+        /* s = h_last * w  */
+        _MatrixMul(&h_last, X_NOTRANS, &w, X_NOTRANS, &s);
+        XTensor b2D;
+        InitTensor(&b2D, &s);
+        _Unsqueeze(&b, &b2D, 0, batchSize);
+        _Sum(&s, &b2D, &s);
+        /* y = softmax(s) */
+        _LogSoftmax(&s, &y, 1);
+    }   
+}
+/*
+backward procedure
+>> inputs - input word representations
+>> output - output probability
+>> gold - gold standard
+>> loss - loss function name
+>> model - the fnn model
+>> grad - the model that keeps the gradient information
+>> net - the network that keeps the internal tensors generated in the process
+*/
+void Backward(XTensor inputs[], XTensor &output, XTensor &gold, LOSS_FUNCTION_NAME loss, 
+              FNNModel &model,  FNNModel &grad, FNNNet &net)
+{
+    int batchSize = output.GetDim(0);
+    int n = model.n;
+    int depth = model.hDepth;
+    /* back-propagation for the output layer */
+    XTensor &y = output;
+    XTensor &s = net.stateLast;
+    XTensor &x = depth > 0 ? net.hiddens[depth - 1] : net.embeddingCat;
+    XTensor &w = model.outputW;
+    XTensor &dedw = grad.outputW;
+    XTensor &dedb = grad.outputB;
+    XTensor deds(&y);
+    XTensor dedx(&x);
+    /* for y = softmax(s), we get dE/ds
+        where E is the error function (define by loss) */
+    _LogSoftmaxBackward(&gold, &y, &s, NULL, &deds, 1, loss);
+    /* for s = x * w, we get 
+       dE/w_{i,j} = dE/ds_j * ds/dw_{i,j} 
+                  = dE/ds_j * x_{i}
+       (where i and j are the row and column indices, and
+        x is the top most hidden layer)
+       so we know 
+       dE/dw = x^T * dE/ds */
+    _MatrixMul(&x, X_TRANS, &deds, X_NOTRANS, &dedw);
+    /* gradient of the bias: dE/db = dE/ds * 1 = dE/ds
+    specifically dE/db_{j} = \sum_{i} dE/ds_{i,j} */
+    _ReduceSum(&deds, &dedb, 0);
+    /* then, we compute 
+       dE/dx_{j} = \sum_j' (dE/ds_{j'} * ds_{j'}/dx_j) 
+                 = \sum_j' (dE/ds_{j'} * w_{j, j'})
+       i.e., 
+       dE/dx = dE/ds * w^T */
+    _MatrixMul(&deds, X_NOTRANS, &w, X_TRANS, &dedx);
+    XTensor &gradPassed = dedx;
+    XTensor dedsHidden;
+    XTensor dedxBottom;
+    if (depth > 0)
+        InitTensor(&dedsHidden, &dedx);
+    InitTensor(&dedxBottom, &net.embeddingCat);
+    /* back-propagation from top to bottom in the stack of hidden layers
+       for each layer, h = f(s)
+                       s = x * w + b */
+    for (int i = depth - 1; i >= 0; i--) {
+        XTensor &h = net.hiddens[i];
+        XTensor &s = net.hiddenStates[i];
+        XTensor &x = i == 0 ? net.embeddingCat : net.hiddenStates[i - 1];
+        XTensor &w = model.hiddenW[i];
+        XTensor &dedh = gradPassed;  // gradient passed though the previous layer
+        XTensor &dedx = i == 0 ? dedxBottom : dedh;
+        XTensor &deds = dedsHidden;
+        XTensor &dedw = grad.hiddenW[i];
+        XTensor &dedb = grad.hiddenB[i];
+        /* backpropagation through the activation fucntion: 
+           dE/ds = dE/dh * dh/ds */
+        _HardTanHBackward(NULL, &h, &s, &dedh, &deds, NOLOSS);
+        /* gradient of the weight: dE/dw = x^T * dE/ds   */
+        _MatrixMul(&x, X_TRANS, &deds, X_NOTRANS, &dedw);
+        /* gradient of the bias: dE/db = dE/ds * 1 = dE/ds
+           specifically dE/db_{j} = \sum_{i} dE/ds_{i,j} */
+        _ReduceSum(&deds, &dedb, 0);
+        /* gradient of the input: dE/dx = dE/ds * w^T    */
+        _MatrixMul(&deds, X_NOTRANS, &w, X_TRANS, &dedx);
+        if (i > 0)
+            _CopyValues(&dedx, &gradPassed);
+    }
+    XList eList(n - 1);
+    /* back-propagation for the embedding layer */
+    for (int i = 0; i < n - 1; i++) {
+        XTensor * dedy = NewTensor2D(batchSize, model.eSize, X_FLOAT, model.devID, model.mem);
+        eList.Add(dedy);
+    }
+    /* gradient of the concatenation of the embedding layers */
+    XTensor &dedyCat = depth > 0 ? dedxBottom : dedx;
+    /* split the concatenation of gradients of the embeddings */
+    _Split(&dedyCat, &eList, 1, n - 1);
+    /* go over for each word */
+    for (int i = 0; i < n - 1; i++) {
+        XTensor * dedy = (XTensor*)eList.GetItem(i);
+        XTensor &x = inputs[i];
+        XTensor &dedw = grad.embeddingW;
+        /* gradient of the embedding weight: dE/dw += x^T * dE/dy 
+           NOTE that we accumulate dE/dw here because the matrix w
+           is shared by several layers (or words) */
+        _MatrixMul(&x, X_TRANS, dedy, X_NOTRANS, &dedw, 1.0F, 1.0F);
+        delete dedy;
+    }
+}
+```
 ## 实例3：循环神经网络
 ## 致谢
@@ -1527,13 +2003,15 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp
 | 成员变量 | 功能 |
 | - | - |
+| int id | 张量标识 |
 | XMem * mem | 张量所使用的内存池 |
 | void * data | 保存元素的数据数组 |
 | void * dataHost | 主机内存上的数据副本，只在GPU上运行时被激活 |
+| void ** dataP | 指向数据地址的指针 |
 | int devID | 设备ID，指张量所申请的空间所在CPU或者GPU设备的编号，-1表示CPU |
 | int order | 张量的维度，例如：一个矩阵（维度为2）是一个二维张量 |
-| int dimSize<br> [MAX_TENSOR_DIM_NUM] | 张量中每一维度的大小，索引0表示第1维 |
+| int dimSize[ ] | 张量中每一维度的大小，索引0表示第1维 |
-| int dimSizeRDI<br> [MAX_TENSOR_DIM_NUM] | 转置模式下张量中每一维度的大小，索引0表示第1维 |
+| int dimSizeRDI[ ] | 转置模式下张量中每一维度的大小，索引0表示第1维 |
 | TENSOR_DATA_TYPE dataType | 每个数据单元的数据类型 |
 | int unitSize | 数据单元的大小，类似于sizeof() |
 | int unitNum | 数据单元的数量 |
@@ -1541,17 +2019,34 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp
 | int unitNumNonZero | 稀疏矩阵中非零元素个数 |
 | float denseRatio | 稠密度，指非零单元的比例，是介于0和1之间的一个实数，0表示所有单元全为零，1表示全为非零单元。|
 | bool isShared | 标志数据数组是否被其他张量所共享 |
+| bool isDefaultDType | 矩阵中使用的数据类型是否是属于默认数据类型 |
 | bool isInGlobalMem | 标志数据是否在全局内存而不是内存池中 |
-| bool isAllValued<br> [MAX_TENSOR_DIM_NUM] | 标志稀疏矩阵中是否每个维度都具有非零元素 |
+| bool isAllValued[ ] | 标志稀疏矩阵中是否每个维度都具有非零元素 |
+| bool isInit | 张量是否被初始化 |
+| bool isTmp | 张量是否为临时创建 |
+| bool isGrad | 当使用模型参数时张量是否保持梯度 |
+| unsigned int visitMark | 节点访问标志 |
+| XTensor * grad | 反向传播的梯度 |
+| XLink income | 超边的入边 |
+| XLink outgo | 超边的出边 |
 在XTensor.h头文件中定义的方法说明：
 | 功能 | 函数  | 参数 |
 | - | - | - |
+| 构造函数 | XTensor() | N/A |
+| 析构函数 | ~XTensor() | N/A |
+| 初始化成员变量 | void Init() | N/A |
+| 销毁数据 | void DestroyData() | N/A |
+| 张量的浅层复制 | void ShallowCopy(<br>const XTensor &tensor) | tensor - 进行复制的张量 |
+| 重载等于符号 | XTensor& operator= (<br>const XTensor &tensor) | tensor - 重载的张量 |
+| 重载加法符号 | XTensor  operator+ (<br>const XTensor &tensor) | tensor - 重载的张量 |
+| 重载乘法符号 | XTensor  operator* (<br>const XTensor &tensor) | tensor - 重载的张量 |
+| 线性变换 | XTensor Lin(<br>DTYPE scale, DTYPE shift = 0) | scale - 缩放参数 <br> shift - 偏移参数 |
 | 判断两个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 |
 | 判断三个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b, XTensor * c) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 <br> c - 进行比较的第三个张量 |
-| 设置张量每一维度的大小 | void SetDim(int * myDimSize) |myDimSize - 张量每一维度的大小 |
+| 设置张量每一维度的大小 | void SetDim(<br>int * myDimSize) |myDimSize - 张量每一维度的大小 |
-| 得到张量中给定的维度大小 | int GetDim(const int dim) | dim - 张量的维度 |
+| 得到张量中给定的维度大小 | int GetDim(<br>const int dim) | dim - 张量的维度 |
 | 重新调整矩阵维度 | void Reshape(<br> const int order, const int * myDimSize) | order - 张量的维度 <br> myDimSize - 张量每一维的大小 |
 | 得到张量中元素数量 | int GetSize() | N/A |
 | 得到内存使用大小 | int GetDataSizeInChar() | N/A |
@@ -1561,15 +2056,26 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp
 | 设置张量服从均匀分布 | void SetDataRand(<br> DTYPE lower, DTYPE upper) | lower - 最小值 <br> upper - 最大值 |
 | 设置张量服从正态分布 | void SetDataRandn(<br> DTYPE mean, DTYPE standardDeviation) | mean - 均值 <br> standardDeviation - 标准差 |
 | 检查张量中元素是否相同 | bool CheckData(<br> const void * answer, int num, int beg = 0) | answer - 给定数组 <br> num - 数组大小 <br> beg - 赋值时从张量的第几位开始 |
-| 将给定维度中元素<br> 设置为升序 | void SetAscendingOrder(int dim) | dim - 给定维度 |
+| 设置数据指针 | void SetDataPointer() | N/A |
-| 获取张量中元素指针 | void * GetCell(int * index, int size)    | index - 元素位置 <br> size-矩阵大小 |
+| 将给定维度中元素<br> 设置为升序 | void SetAscendingOrder(<br>int dim) | dim - 给定维度 |
-| 获取二维张量中元素指针 | void * GetCell2D(int ni, int mi = 0) | ni - 行值 <br> mi - 列值 |
+| 得到索引指向的单元的值 | DTYPE Get(int index[], int size = -1) | index - 给定索引 <br> size-矩阵大小 |
-| 获取二维张量的值 | DTYPE Get2D(int ni, int mi = 0) | ni - 行值 <br> mi - 列值 |
+| 获取张量中元素指针 | void * GetCell(<br>int * index, int size)    | index - 元素位置 <br> size-矩阵大小 |
+| 获取一维张量中元素的<br>默认类型值 | DTYPE Get1D(<br>int i) | i - 第一维 |
+| 获取二维张量中元素的<br>默认类型值 | DTYPE Get2D(<br>int ni, int mi) const | ni - 第一维 <br> mi - 第二维 |
+| 获取三维张量中元素的<br>默认类型值 | DTYPE Get3D(<br>int d0, int d1, int d2) | d0 - 第一维 <br> d1 - 第二维 <br> d2 - 第三维 |
+| 获取一维张量中元素的<br>整形值 |int Get1DInt(<br>int i) | i - 第一维 |
+| 获取二维张量中元素的<br>整形值 | int Get2DInt(<br>int ni, int mi) | ni - 第一维 <br> mi - 第二维 |
+| 获取三维张量中元素的整形值 | int Get3DInt(<br>int d0, int d1, int d2) | d0 - 第一维 <br> d1 - 第二维 <br> d2 - 第三维 |
 | 获取稀疏张量的值 | DTYPE GetInSparse(int i) | i - 稀疏矩阵中非0元素位置 |
 | 获取稀疏张量中<br> 元组的键值 | int GetKeyInSparse(int i) | i - 稀疏矩阵中非0元素位置 |
-| 设置二维张量中<br> 的单元值 | bool Set2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
+| 设置单元中的值 | bool Set(<br>DTYPE value, int index[], int size = -1) | value - 值 <br> index - 元素位置 <br> size-矩阵大小 |
-| 增加二维张量中<br> 的单元值 | bool Add2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
+| 设置一维张量中的单元值 | bool Set1D(<br>DTYPE value, int i) | value - 值 <br> i - 第一维 |
+| 设置二维张量中的单元值 | bool Set2D(<br>DTYPE value, int ni, int mi) | value - 值 <br> ni - 第一维 <br> mi - 第二维 |
+| 设置三维张量中的单元值 | bool Set3D(<br>DTYPE value, int d0, int d1, int d2) | value - 值 <br> d0 - 第一维 <br> d1 - 第二维 <br> d2 - 第三维 |
+| 增加二维张量中<br> 的单元值 | bool Add2D(<br>DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
 | 获取稀疏矩阵中<br> 非零元素数量 | int GetNonzeroSize() | N/A |
+| 设置张量为临时变量 | void SetTMP(<br>bool myIsTmp = true) | myIsTmp - 是否为临时变量 |
+| 张量是否保持梯度 | void SetGrad(<br>bool myIsGrad = true) | myIsTmp - 是否保持梯度 |
 | 将矩阵重置为特定大小 | bool Resize(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 |
 | 将矩阵重置为特定大小<br>并不申请新空间 | bool ResizeWithNoData(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 |
 | 将矩阵重置为<br> 另一矩阵大小 | bool Resize(<br> const XTensor * myTensor) | myTensor - 重置矩阵大小的参考矩阵 |
@@ -1580,4 +2086,4 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp
 | 在缓冲区创建张量 | XTensor * NewTensorBuf( <br> const int myOrder,  <br> const int * myDimSize, XMem * myMem, <br> const TENSOR_DATA_TYPE myDataType = <br> X_FLOAT, const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myMem - 张量所使用的内存池 <br>  myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 |
 | 依据给定张量<br>复制一个新的张量 | XTensor * NewTensor(<br>XTensor * a, bool isFilledData = true) | a - 给定张量 <br>  isFilledData - 是否申请张量中的数据空间 |
 | 依据给定张量<br>释放数据空间 | void DelTensor(<br>const XTensor * tensor) | tensor - 给定张量 |
 | 依据给定张量<br>在缓存中释放数据空间 | void DelTensorBuf(<br>const XTensor * tensor) | tensor - 给定张量 |
\ No newline at end of file
--- a/source/network/Main.cpp
+++ b/source/network/Main.cpp
@@ -21,6 +21,7 @@
 #include <stdio.h>
 #include "XNet.h"
+#include "../tensor/XUtility.h"
 #include "../tensor/function/FHeader.h"
 #include "../tensor/core/CHeader.h"
 #include "../sample/fnnlm/FNNLM.h"
@@ -29,13 +30,20 @@
 //#include <stdlib.h>
 //#include <crtdbg.h>
-using namespace nts;
+void TransposeTest();
-using namespace samplefnnlm;
+void SumDimTest();
+using namespace nts;
+using namespace fnnlm;
 int main( int argc, const char ** argv )
 {
+    //TransposeTest();
+    //return 0;
+    //SumDimTest();
+    //return 0;
    if(argc > 1 && !strcmp(argv[1], "-test"))
        1;//Test();
    else if(argc > 1 && !strcmp(argv[1], "-fnnlm"))
@@ -47,6 +55,8 @@ int main( int argc, const char ** argv )
        fprintf(stderr, "Or run this program with \"-fnnlm\" for sample FNNLM!\n");
    }
+    return 0;
    XNet net;
    XTensor a;
    XTensor b;
@@ -80,3 +90,116 @@ int main( int argc, const char ** argv )
    return 0;
 }
+void TransposeTest()
+{
+#ifdef USE_CUDA
+    XMem mem0(0, UNI_FREE, MILLION * 64, 1024, MILLION * 64);
+    //XMem mem1(1, UNI_FREE, MILLION * 64, 1024, MILLION * 64);
+    XTensor x;
+    XTensor y;
+    XTensor z;
+    int loops = 2000;
+    int B = 3 * 2 * 4;
+    int K = 8 * 1;
+    int N = 50;
+    int H = 512 * 4;
+    int nnn = GDevs.nGPU;
+    InitTensor3D(&x, B, N, H, X_FLOAT, 0);
+    InitTensor4D(&y, K, B, N, H/K, X_FLOAT, 0);
+    InitTensor3D(&z, B, N, H, X_FLOAT, 0);
+    cudaEvent_t ctime0;
+    cudaEvent_t ctime1;
+    cudaEvent_t ctime2;
+    cudaEvent_t ctime3;
+    cudaEvent_t ctime4;
+    cudaEvent_t ctime5;
+    float elapsedSplit = 0.0;
+    float elapsedMerge = 0.0;
+    float elapsedSum = 0.0;
+    cudaEventCreate(&ctime0);
+    cudaEventCreate(&ctime1);
+    cudaEventCreate(&ctime2);
+    cudaEventCreate(&ctime3);
+    cudaEventCreate(&ctime4);
+    cudaEventCreate(&ctime5);
+    cudaEventRecord(ctime0, 0);
+    double time0 = GetClock();
+    for(int i = 0; i < loops; i++)
+        _Split(&x, &y, 2, K);
+    double time1 = GetClock();
+    cudaEventRecord(ctime1, 0);
+    cudaEventSynchronize(ctime1);
+    cudaEventElapsedTime(&elapsedSplit, ctime0, ctime1);
+    cudaEventRecord(ctime2, 0);
+    double time2 = GetClock();
+    for(int i = 0; i < loops; i++)
+        _Merge(&y, &x, 3);
+    double time3 = GetClock();
+    cudaEventRecord(ctime3, 0);
+    cudaEventSynchronize(ctime3);
+    cudaEventElapsedTime(&elapsedMerge, ctime2, ctime3);
+    cudaEventRecord(ctime4, 0);
+    double time4 = GetClock();
+    for(int i = 0; i < loops; i++)
+        _Sum(&x, &z, &x);
+    double time5 = GetClock();
+    cudaEventRecord(ctime5, 0);
+    cudaEventSynchronize(ctime5);
+    cudaEventElapsedTime(&elapsedSum, ctime4, ctime5);
+    fprintf(stderr, "split:%f merge:%f sum:%f\n", time1 - time0, time3 - time2, time5 - time4);
+    fprintf(stderr, "split:%f merge:%f sum:%f\n", elapsedSplit, elapsedMerge, elapsedSum);
+#endif
+}
+void SumDimTest()
+{
+    XTensor x;
+    XTensor y;
+    XTensor z;
+    int a = 5;
+    int b = 7;
+    int c = 3;
+    InitTensor3D(&x, a, b, c, X_FLOAT, -1);
+    InitTensor1D(&y, c, X_FLOAT, -1);
+    InitTensor3D(&z, a, b, c, X_FLOAT, -1);
+    x.SetZeroAll();
+    y.SetZeroAll();
+    z.SetZeroAll();
+    float * data = new float[x.unitNum];
+    for(int i = 0; i < x.unitNum; i++)
+        data[i] = (DTYPE)i;
+    x.SetData(data, x.unitNum);
+    for(int i = 0; i < y.unitNum; i++)
+        data[i] = -(DTYPE)i;
+    y.SetData(data, y.unitNum);
+    _SumDim(&x, &y, &z, 2);
+    z.Dump(stderr, "z:");
+    delete[] data;
+}
--- a/source/network/XBackwardFunc.cpp
+++ b/source/network/XBackwardFunc.cpp
@@ -63,6 +63,8 @@ void XFuncGrad::MakeGrad(XTensor * node)
    else{
        ShowNTErrors("Wrong activation function type!");
    }
+    node->visitMark = NODE_FINISHED;
 }
 /* indicates whether the node is for an activation function */

--- a/source/network/XBackwardMath.cpp
+++ b/source/network/XBackwardMath.cpp
@@ -37,10 +37,46 @@ void XMathGrad::MakeGrad(XTensor * node)
    if(operID == MATH_SUM)
        GradSum(node);
+    else if(operID == MATH_SUMDIM)
+        GradSumDim(node);
    else if(operID == MATH_MULTIPLY)
        GradMultiply(node);
    else if(operID == MATH_MATRIXMUL)
        GradMatrixMul(node);
+    else if (operID == MATH_LOG)
+        GradLog(node);
+    else if (operID == MATH_POWER)
+        GradPower(node);
+    else if (operID == MATH_NEGATE)
+        GradNegate(node);
+    else if (operID == MATH_SCALEANDSHIFT)
+        GradScaleAndShift(node);
+    else if (operID == MATH_DIV)
+        GradDiv(node);
+    else if (operID == MATH_SUB)
+        GradSub(node);
+    else if (operID == MATH_SIN)
+        GradSin(node);
+    else if (operID == MATH_COS)
+        GradCos(node);
+    else if (operID == MATH_TAN)
+        GradTan(node);
+    else if (operID == MATH_EXP)
+        GradExp(node);
+    else if (operID == MATH_NORMALIZE)
+        GradNormalize(node);
+    else if (operID == MATH_ABSOLUTE)
+        GradAbsolute(node);
+    else if (operID == MATH_SIGN)
+        GradSign(node);
+    else if (operID == REDUCE_REDUCEMEAN)
+        GradReduceMean(node);
+    else if (operID == REDUCE_REDUCESUM)
+        GradReduceSum(node);
+    else if (operID == REDUCE_REDUCESUMSQUARED)
+        GradReduceSumSquared(node);
+    else if (operID == REDUCE_REDUCEVARIANCE)
+        GradReduceVariance(node);
    else{
        ShowNTErrors("TODO!");
    }
@@ -70,11 +106,108 @@ void XMathGrad::GradSum(XTensor * node)
    XTensor * a = income.tails[0];
    XTensor * b = income.tails[1];
    DTYPE beta = income.GetParam(0);
    XNoder::MakeGrad(a);
    XNoder::MakeGrad(b);
    _Sum(a->grad, node->grad, a->grad);
    _Sum(b->grad, node->grad, b->grad, beta);
+    node->visitMark = NODE_FINISHED;
+}
+/* 
+gradient for sum with one dimension
+c = a + b * \beta
+where the size of b is equal to dimension n of a, i.e., |b| = a.dimSize[n]
+dE/da = dE/dc
+dE/db = dE/dc * b.reduce(0,...,n-1,n+1,...) * \beta
+*/
+void XMathGrad::GradSumDim(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for SUMDIM!");
+    XTensor * a = income.tails[0];
+    XTensor * b = income.tails[1];
+    int n = income.GetParamInt(0);
+    DTYPE beta = income.GetParam(1);
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+    _Sum(a->grad, node->grad, a->grad);
+    int order = a->order;
+    int dimSize[MAX_TENSOR_DIM_NUM];
+    memcpy(dimSize, a->dimSize, sizeof(int) * a->order);
+    if(n == order - 1){
+        int reshapedSize[MAX_TENSOR_DIM_NUM];
+        reshapedSize[0] = a->unitNum/dimSize[order - 1];
+        reshapedSize[1] = dimSize[order - 1];
+        /* we reshape dE/dc to a matrix whose column number is equal to the 
+           size of b. Then we can reduce the matrix into a row vector. */
+        node->grad->Reshape(2, reshapedSize);
+        if(b->outgo.tailNum > 1){
+            XTensor * bGradTMP = NewTensorBuf(b->grad, b->devID, b->mem);
+            _ReduceSum(node->grad, bGradTMP, 0);
+            if(beta != 1.0F)
+                _ScaleAndShiftMe(bGradTMP, beta);
+            _Sum(bGradTMP, b->grad, b->grad);
+            DelTensorBuf(bGradTMP);
+        }
+        else{
+            _ReduceSum(node->grad, b->grad, 0);
+            if(beta != 1.0F)
+                _ScaleAndShiftMe(b->grad, beta);
+        }
+        node->grad->Reshape(order, dimSize);
+    }
+    else{
+        int reshapedSize[MAX_TENSOR_DIM_NUM];
+        reshapedSize[0] = 1;
+        reshapedSize[1] = dimSize[n];
+        reshapedSize[2] = 1;
+        for(int i = 0; i < order; i++){
+            if(i < n)
+                reshapedSize[0] *= dimSize[i];
+        }
+        reshapedSize[2] = a->unitNum / (reshapedSize[0] * reshapedSize[1]);
+        /* we reshape dE/dc to a 3D tensor of size (x, y, z) where y = |b|. 
+           Then reduce along with z and x to obtain dE/db. */
+        node->grad->Reshape(3, reshapedSize);
+        XTensor * interGrad = NewTensorBuf(2, reshapedSize, b->dataType, b->denseRatio, b->devID, b->mem);
+        _ReduceSum(node->grad, interGrad, 2);
+        if(b->outgo.tailNum > 1){
+            XTensor * bGradTMP = NewTensorBuf(b->grad, b->devID, b->mem);
+            _ReduceSum(interGrad, bGradTMP, 0);
+            if(beta != 1.0F)
+                _ScaleAndShiftMe(bGradTMP, beta);
+            _Sum(bGradTMP, b->grad, b->grad);
+            DelTensorBuf(bGradTMP);
+        }
+        else{
+            _ReduceSum(interGrad, b->grad, 0);
+            if(beta != 1.0F)
+                _ScaleAndShiftMe(b->grad, beta);
+        }
+        node->grad->Reshape(order, dimSize);
+        DelTensorBuf(interGrad);
+    }
+    node->visitMark = NODE_FINISHED;
 }
 /* 
@@ -99,6 +232,8 @@ void XMathGrad::GradMultiply(XTensor * node)
    CheckNTErrors(XTensor::IsSameShaped(a, b), "Wrong sized input tensors!");
    _Multiply(node->grad, b, a->grad, 1.0F);
    _Multiply(node->grad, a, b->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
 }
 /* 
@@ -167,6 +302,557 @@ void XMathGrad::GradMatrixMul(XTensor * node)
        /* dE/db = a * dE/dc * \alpha */
        _MatrixMul(a, X_NOTRANS, dedc, X_NOTRANS, dedb, alpha, 1.0F);
    }
+    node->visitMark = NODE_FINISHED;
+}
+/*
+gradient for log
+for
+c = log(a)
+we have
+dE/da = dE/dc * 1/a
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradLog(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for LOG!");
+    XTensor * a = income.tails[0];
+    XNoder::MakeGrad(a);
+    _Div(node->grad, a, a->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+}
+/*
+gradient for power
+for
+c = pow(a,p)
+we have
+dE/da = (dE/dc) * p*a^(p-1)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradPower(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for POWER!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XTensor * c = NewTensor(a);
+    DTYPE p = income.GetParam(0);
+    XNoder::MakeGrad(a);
+    _Power(a, b, (p-1)/p);
+    _ScaleAndShift(b, c, p);
+    _Multiply(node->grad, c, a->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+    delete c;
+}
+/*
+gradient for negate
+for
+c = -a
+we have
+dE/da = dE/dc * (-1)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradNegate(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for NEGATE!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XNoder::MakeGrad(a);
+    _ScaleAndShift(node->grad, b, -1.0F);
+    _Sum(a->grad, b, a->grad);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+}
+/*
+gradient for ScaleAndShift
+for
+c = a * scale + shift
+we have
+dE/da = dE/dc * scale
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradScaleAndShift(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SCALEANDSHIFT!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    DTYPE scale = income.GetParam(0);
+    XNoder::MakeGrad(a);
+    _ScaleAndShift(node->grad, b, scale);
+    _Sum(a->grad, b, a->grad);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+}
+/*
+gradient for minus
+for
+c =  a - b * \beta
+we have
+dE/da = dE/dc
+dE/db = -dE/dc * \beta
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradSub(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for SUBSTRACT!");
+    XTensor * a = income.tails[0];
+    XTensor * b = income.tails[1];
+    DTYPE beta = income.GetParam(0);
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+    _Sum(a->grad, node->grad, a->grad);
+    _Sum(b->grad, node->grad, b->grad, -beta);
+    node->visitMark = NODE_FINISHED;
+}
+/*
+gradient for divide
+for
+c =  a / b
+we have
+dE/da = dE/dc / b
+dE/db = dE/dc * a / -b^2
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradDiv(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for DIVIDE!");
+    XTensor * a = income.tails[0];
+    XTensor * b = income.tails[1];
+    XTensor * c = NewTensor(b);
+    XTensor * d = NewTensor(b);
+    XTensor * e = NewTensor(b);
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+    CheckNTErrors(XTensor::IsSameShaped(a, b), "Wrong sized input tensors!");
+    _Div(node->grad, b, a->grad, 1.0F);
+    _Power(b, c, -2.0F);
+    _Multiply(a, c, d);
+    _ScaleAndShift(d, e, -1.0F);
+    _Multiply(node->grad, e, b->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete c;
+    delete d;
+    delete e;
+}
+/*
+gradient for exp
+for
+c = exp(a)
+we have
+dE/da = dE/dc * exp(a)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradExp(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for EXP!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XNoder::MakeGrad(a);
+    _Exp(a, b);
+    _Multiply(node->grad, b, a->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+}
+/*
+gradient for sin
+for
+c = sin(a)
+we have
+dE/da = dE/dc * cos(a)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradSin(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SIN!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XNoder::MakeGrad(a);
+    _Cos(a, b);
+    _Multiply(node->grad, b, a->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+}
+/*
+gradient for cos
+for
+c = cos(a)
+we have
+dE/da = dE/dc * -sin(a)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradCos(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for COS!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XTensor * c = NewTensor(a);
+    XNoder::MakeGrad(a);
+    _Sin(a, b);
+    _ScaleAndShift(b, c, -1.0F);
+    _Multiply(node->grad, c, a->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+    delete c;
+}
+/*
+gradient for tan
+for
+c = tan(a)
+we have
+dE/da = dE/dc * 1/(cos(a))^2
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradTan(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for TAN!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XTensor * c = NewTensor(a);
+    XNoder::MakeGrad(a);
+    _Cos(a, b);
+    _Power(b, c, -2.0F);
+    _Multiply(node->grad, c, a->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+    delete c;
+}
+/*
+gradient for normalize
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradNormalize(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 5, "Wrong input tensor number for NORMALIZE!");
+    XTensor * input = income.tails[0];
+    XTensor * mean = income.tails[1];
+    XTensor * var = income.tails[2];
+    XTensor * a = income.tails[3];
+    XTensor * b = income.tails[4];
+    XTensor * c = NewTensor(a);
+    XTensor * d = NewTensor(a);
+    XTensor * e = NewTensor(a);
+    XTensor * f = NewTensor(a);
+    XTensor * g = NewTensor(a);
+    XTensor * h = NewTensor(a);
+    XTensor * i = NewTensor(a);
+    XTensor * j = NewTensor(a);
+    XTensor * k = NewTensor(a);
+    XTensor * p = NewTensor(a);
+    XTensor * q = NewTensor(a);
+    XTensor * r = NewTensor(a);
+    DTYPE epsilon = income.GetParamInt(0);
+    int dim = income.GetParamInt(0);
+    int n = a->GetDim(dim);
+    XNoder::MakeGrad(input);
+    XNoder::MakeGrad(mean);
+    XNoder::MakeGrad(var);
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+    /* dEdinput */
+    _ScaleAndShift(var, c, 1.0F, epsilon);
+    _Unsqueeze(c, d, dim, n);
+    _Power(d, e, -0.5F);
+    _Multiply(a, e, f);
+    _Multiply(node->grad, f, input->grad, 1.0F);
+    /* dEdmean */
+    _ScaleAndShift(f, g, -1.0F);
+    _Multiply(node->grad, g, mean->grad, 1.0F);
+    /* dEdvar */
+    _Unsqueeze(mean, h, dim, n);
+    _Sub(input, h, i);
+    _Multiply(a, i, j);
+    _Power(var, k, -1.5F);
+    _ScaleAndShift(k, p, -0.5F);
+    _Multiply(j, p, q);
+    _Multiply(node->grad, q, var->grad, 1.0F);
+    /* dEda */
+    _Multiply(i, e, r);
+    _Multiply(node->grad, r, a->grad, 1.0F);
+    /* dEdb */
+    _Sum(b->grad, node->grad, b->grad);
+    node->visitMark = NODE_FINISHED;
+    delete c;
+    delete d;
+    delete e;
+    delete f;
+    delete g;
+    delete h;
+    delete i;
+    delete j;
+    delete k;
+    delete p;
+    delete q;
+    delete r;
+}
+/*
+gradient for absolute
+for
+c = |a|
+we have
+dE/da = dE/dc   a >= 0
+        -dE/dc  a < 0
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradAbsolute(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for ABSOLUTE!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XNoder::MakeGrad(a);
+    _Sign(a, b);
+    _Multiply(node->grad, b, a->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+}
+/*
+gradient for sign
+for
+c = sign(a)
+we have
+dE/da = 0
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradSign(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SIGN!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XNoder::MakeGrad(a);
+    b->SetZeroAll();
+    _Sum(a->grad, b, a->grad);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+}
+/*
+gradient for reduceMean
+for
+c = reduceMean(a, dim)
+we have
+dE/da = Unsqueeze(dE/dc) * 1/dimSizeA[dim]
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradReduceMean(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for Reduce!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XTensor * c = NewTensor(a);
+    int dim = income.GetParamInt(0);
+    int n = a->GetDim(dim);
+    XNoder::MakeGrad(a);
+    _Unsqueeze(node->grad, b, dim, n);
+    _ScaleAndShift(b, c, 1 / n);
+    _Sum(a->grad, c, a->grad);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+    delete c;
+}
+/*
+gradient for reduceSum
+for
+c = reduceSum(a, dim)
+we have
+dE/da = Unsqueeze(dE/dc) * 1
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradReduceSum(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for Reduce!");
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    int dim = income.GetParamInt(0);
+    int n = a->GetDim(dim);
+    XNoder::MakeGrad(a);
+    _Unsqueeze(node->grad, b, dim, n);
+    _Sum(a->grad, b, a->grad);
+    node->visitMark = NODE_FINISHED;
+    delete b;
+}
+/*
+gradient for reduceSumSquared
+for
+c = reduceSumSquared(a, dim, b)
+we have
+dE/da = Unsqueeze(dE/dc) * 2a
+dE/db = Unsqueeze(dE/dc) * (-2b)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradReduceSumSquared(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for Reduce!");
+    XTensor * a = income.tails[0];
+    XTensor * b = income.tails[1];
+    XTensor * c = NewTensor(a);
+    XTensor * d = NewTensor(b);
+    XTensor * e = NewTensor(c);
+    int dim = income.GetParamInt(0);
+    int n = a->GetDim(dim);
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+    _ScaleAndShift(a, c, 2.0F);
+    _ScaleAndShift(b, d, -2.0F);
+    _Unsqueeze(node->grad, e, dim, n);
+    _Multiply(e, c, a->grad, 1.0F);
+    _Multiply(node->grad, d, b->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete c;
+    delete d;
+    delete e;
+}
+/*
+gradient for reduceVariance
+for
+c = reduceVariance(a, dim, b)
+we have
+dE/da = Unsqueeze(dE/dc) * 2a/dimSizeA[dim]
+dE/db = Unsqueeze(dE/dc) * (-2a/dimSizeA[dim])
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradReduceVariance(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for Reduce!");
+    XTensor * a = income.tails[0];
+    XTensor * b = income.tails[1];
+    XTensor * c = NewTensor(a);
+    XTensor * d = NewTensor(b);
+    XTensor * e = NewTensor(a);
+    int dim = income.GetParamInt(0);
+    int n = a->GetDim(dim);
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+    _ScaleAndShift(a, c, 2.0F / n);
+    _ScaleAndShift(b, d, -2.0F / n);
+    _Unsqueeze(node->grad, e, dim, n);
+    _Multiply(e, c, a->grad, 1.0F);
+    _Multiply(node->grad, d, b->grad, 1.0F);
+    node->visitMark = NODE_FINISHED;
+    delete c;
+    delete d;
+    delete e;
 }
 }
--- a/source/network/XBackwardMath.h
+++ b/source/network/XBackwardMath.h
@@ -44,6 +44,11 @@ private:
    static
    void GradSum(XTensor * node);
+    /* gradient for sum with one dimension: c = a + b * \beta
+       where the size of b is equal to that of one dimension of a */
+    static
+    void GradSumDim(XTensor * node);
    /* gradient for multiply (dot production): c =  a * b */
    static
    void GradMultiply(XTensor * node);
@@ -51,6 +56,74 @@ private:
    /* gradient for matrix multiply: c = matmul(a, b) */
    static
    void GradMatrixMul(XTensor * node);
+    /* gradient for log: c =  log(a) */
+    static
+    void GradLog(XTensor * node);
+    /* gradient for power */
+    static
+    void GradPower(XTensor * node);
+    /* gradient for negate */
+    static
+    void GradNegate(XTensor * node);
+    /* gradient for ScaleAndShift */
+    static
+    void GradScaleAndShift(XTensor * node);
+    /* gradient for Minus */
+    static
+    void GradSub(XTensor * node);
+    /* gradient for Divide */
+    static
+    void GradDiv(XTensor * node);
+    /* gradient for reduceMean */
+    static
+    void GradReduceMean(XTensor * node);
+    /* gradient for reduceSum */
+    static
+    void GradReduceSum(XTensor * node);
+    /* gradient for reduceSumSquared */
+    static
+    void GradReduceSumSquared(XTensor * node);
+    /* gradient for reduceVariance */
+    static
+    void GradReduceVariance(XTensor * node);
+    /* gradient for sin */
+    static
+    void GradSin(XTensor * node);
+    /* gradient for cos */
+    static
+    void GradCos(XTensor * node);
+    /* gradient for tan */
+    static
+    void GradTan(XTensor * node);
+    /* gradient for exp */
+    static
+    void GradExp(XTensor * node);
+    /* gradient for normalize */
+    static
+    void GradNormalize(XTensor * node);
+    /* gradient for absolute */
+    static
+    void GradAbsolute(XTensor * node);
+    /* gradient for sign */
+    static
+    void GradSign(XTensor * node);
 };
 }

--- a/source/network/XBackwardShape.cpp
+++ b/source/network/XBackwardShape.cpp
@@ -43,6 +43,12 @@ void XShapeGrad::MakeGrad(XTensor * node)
        GradMergeList(node);
    else if(operID == SHAPE_UNSQUEEZE)
        GradUnsqueeze(node);
+    else if(operID == SHAPE_SPLIT)
+        GradSplit(node);
+    else if(operID == SHAPE_SPLIT_LIST)
+        GradSplitList(node);
+    else if (operID == SHAPE_TRANSPOSE)
+        GradTranspose(node);
    else{
        ShowNTErrors("TODO!");
    }
@@ -55,6 +61,13 @@ bool XShapeGrad::IsShapeOP(XTensor * node)
    return (income.typeID & DATA_BASE) != 0;
 }
+/* post processing of a node */
+void XShapeGrad::PostProcessing(XTensor * node, int typeID)
+{
+    if(typeID == SHAPE_SPLIT_LIST)
+        GradSplitListPost(node);
+}
 /* 
 gradient for merge
 for 
@@ -134,6 +147,8 @@ void XShapeGrad::GradMerge(XTensor * node)
    gradInputSmall.data = NULL;
    delete[] dims;
+    node->visitMark = NODE_FINISHED;
 }
 /* 
@@ -213,6 +228,120 @@ void XShapeGrad::GradMergeList(XTensor * node)
        gradSmall.data = NULL;
        delete[] dims;
    }
+    node->visitMark = NODE_FINISHED;
+}
+/* 
+gradient computation for split: 
+for
+c = split(a)
+we have
+dE/da = merge(dE/dc)
+>> node - the node (c) for backward computation
+*/
+void XShapeGrad::GradSplit(XTensor * node)
+{
+    XLink &income = node->income;
+    XTensor * input = income.tails[0];
+    int whereToSplit = income.GetParamInt(0);
+    int splitNum = income.GetParamInt(1);
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SPLIT!");
+    CheckNTErrors(node->order == input->order + 1, "Wrong tensor orders!");
+    CheckNTErrors(splitNum == node->dimSize[0], "Wrong split number!");
+    XNoder::MakeGrad(input);
+    /* we can simply merge the gradient tensor 
+       if the input is used in spliting only */
+    if(input->outgo.tailNum == 1)
+        _Merge(node->grad, input->grad, whereToSplit + 1, 0);
+    /* if the tensor is used somewhere else, we need another SUM
+       for gradient accumulation */
+    else{
+        XTensor inputGradTMP(input);
+        _Merge(node->grad, &inputGradTMP, whereToSplit + 1, 0);
+        _Sum(input->grad, &inputGradTMP, input->grad);
+    }
+    node->visitMark = NODE_FINISHED;
+}
+/* 
+gradient computation for spliting 
+where we return the list of the splits
+for
+list(c_1, ...) = split(a) 
+we have
+dE/da = merge(dE/c_1, ...)
+>> node - the node (c) for backward computation
+*/
+void XShapeGrad::GradSplitList(XTensor * node)
+{
+    XLink &income = node->income;
+    XTensor * input = income.tails[0];
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SPLIT!");
+    CheckNTErrors(node->order == input->order + 1, "Wrong tensor orders!");
+    node->visitMark = NODE_DOING;
+}
+/*
+gradient computation for spliting. We return 
+the list of the splits : list(c_1, ...) = split(a).
+this method is called only when all nodes of spliting 
+have been processed. We do this in a post-processing
+manner because we can fuze multiple memory copy jobs 
+one time. This is good for system speed up. 
+>> node - the node (c) for backward computation
+*/
+void XShapeGrad::GradSplitListPost(XTensor * node)
+{
+    /* we compute the gradient for current node, rather than for
+       child node, i.e., we use the outgoing edge here */
+    XLink &outgo = node->outgo;
+    XList splits(outgo.tailNum);
+    int whereToSplit = -1;
+    int splitNum = 0;
+    for(int i = 0; i < outgo.tailNum; i++){
+        XTensor * parent = (XTensor*)outgo.tails[i];
+        XLink &income = parent->income;
+        if(income.typeID == SHAPE_SPLIT_LIST){
+            int w = income.GetParamInt(0);
+            int splitID = income.GetParamInt(1);
+            if(whereToSplit < 0)
+                whereToSplit = w;
+            splitNum++;
+            CheckNTErrors(whereToSplit == w, "Wrong dimension for spliting");
+            CheckNTErrors(income.tailNum == 1, "Something wrong with outgoing edge!");
+            CheckNTErrors(splitNum - 1 == splitID, "Wrong split id!");
+            splits.Add(parent);
+        }
+    }
+    /* we can simply merge the gradient tensor 
+       if the node is used in spliting only */
+    if(outgo.tailNum == splitNum){
+        _Merge(&splits, node->grad, whereToSplit + 1);
+    }
+    /* if the tensor is used as input to other nodes
+       somewhere else, we need another SUM for gradient 
+       accumulation */
+    else{
+        XTensor nodeGradTMP(node);
+        _Merge(&splits, &nodeGradTMP, whereToSplit + 1);
+        _Sum(node->grad, &nodeGradTMP, node->grad);
+    }
 }
 /* 
@@ -239,6 +368,40 @@ void XShapeGrad::GradUnsqueeze(XTensor * node)
    CheckNTErrors(output->unitNum = input->unitNum * dSize, "Wrong tensor size!");
    _ReduceSum(output->grad, input->grad, dim);
+    node->visitMark = NODE_FINISHED;
+}
+/*
+gradient for transposing a tensor
+for
+c = Transpose(a)
+we have
+dE/da = Transpose(dE/dc)
+>> node - the node (c) for backward computation
+*/
+void XShapeGrad::GradTranspose(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for TRANSPOSE!");
+    XTensor * output = node;
+    XTensor * input = income.tails[0];
+    XTensor * b = NewTensor(input);
+    XNoder::MakeGrad(input);
+    int i = income.GetParamInt(0);
+    int j = income.GetParamInt(1);
+    CheckNTErrors(input->order > i && i >= 0, "index of dimension is out of scope!");
+    CheckNTErrors(input->order > j && j >= 0, "index of dimension is out of scope!");
+    _Transpose(output->grad, b, i, j);
+    _Sum(input->grad, b, input->grad);
+    node->visitMark = NODE_FINISHED;
+    delete b;
 }
 }
\ No newline at end of file
--- a/source/network/XBackwardShape.h
+++ b/source/network/XBackwardShape.h
@@ -40,18 +40,41 @@ public:
    static
    bool IsShapeOP(XTensor * node);
+    /* post processing of a node */
+    static
+    void PostProcessing(XTensor * node, int typeId);
 private:
-    /* gradient for merge: c = merge(a, b, ...) */
+    /* gradient computation for merge: c = merge(a, b, ...) */
    static
    void GradMerge(XTensor * node);
-    /* gradient for merging a list of tensors : c = merge(list(a, b, ...)) */
+    /* gradient computation for merging a list of tensors : c = merge(list(a, b, ...)) */
    static
    void GradMergeList(XTensor * node);
-    /* gradient for unsqueezing a tensor : c = unsqueeze(a) */
+    /* gradient computation for split: c = split(a) */
+    static
+    void GradSplit(XTensor * node);
+    /* gradient computation for spliting. we return the list of the splits : list(c_1, ...) = split(a) */
+    static
+    void GradSplitList(XTensor * node);
+    /* gradient computation for spliting. we return the list of the splits : list(c_1, ...) = split(a).
+       this method is called only when all nodes of spliting have been processed. We do this in a post-processing
+       manner because we can fuze multiple memory copy jobs one time. This is good for system speed up. */
+    static
+    void GradSplitListPost(XTensor * node);
+    /* gradient computation for unsqueezing a tensor : c = unsqueeze(a) */
    static
    void GradUnsqueeze(XTensor * node);
+    /* gradient computation for unsqueezing a tensor : c = unsqueeze(a) */
+    static
+    void GradTranspose(XTensor * node);
 };
 }

--- a/source/network/XNet.cpp
+++ b/source/network/XNet.cpp
@@ -143,7 +143,7 @@ void XNet::Backward(XList &roots, XList &golds, LOSS_FUNCTION_NAME loss)
    /* back-propagation from output to input */
    for(int i = nodes.count - 1; i >= 0; i--){
-        XTensor * node = (XTensor*)nodes.Get(i);
+        XTensor * node = (XTensor*)nodes.Get(i);;
        if(node->visitMark == NODE_FINISHED)
            continue;
@@ -176,6 +176,10 @@ void XNet::BackwardNode(XTensor * node)
        return;
    if(!XNoder::IsLeaf(node)){
+        /* post processing for parent nodes */
+        BackwardNodePost(node);
+        /* process the current node */
        if(XMathGrad::IsMathOP(node))
            XMathGrad::MakeGrad(node);
        else if(XFuncGrad::IsFunc(node))
@@ -186,8 +190,24 @@ void XNet::BackwardNode(XTensor * node)
            ShowNTErrors("Wrong node type!");
        }
    }
+}
+/* 
+backward computation (in post processing) for a given node 
+>> node - the node whose parent nodes are not processed yet. So
+          we do the job at the child node.
+*/
+void XNet::BackwardNodePost(XTensor * node)
+{
+    bool isSplitList = false;
+    XLink &outgo = node->outgo;
+    for(int i = 0; i < outgo.tailNum; i++){
+        if(outgo.tails[i]->income.typeID == SHAPE_SPLIT_LIST)
+            isSplitList = true;
+    }
-    node->visitMark = NODE_FINISHED;
+    if(isSplitList)
+        XShapeGrad::PostProcessing(node, SHAPE_SPLIT_LIST);
 }
 /* 

--- a/source/network/XNet.h
+++ b/source/network/XNet.h
@@ -73,6 +73,9 @@ struct XNet
    /* backward computation for a given node */
    void BackwardNode(XTensor * node);
+    /* backward computation (in post processing) for a given node */
+    void BackwardNodePost(XTensor * node);
    /* traverse the net and find the topological order by 
       depth-first search (Tarjan's algorithm) */
    void Traverse(XTensor &root);

--- a/source/sample/fnnlm/FNNLM.cpp
+++ b/source/sample/fnnlm/FNNLM.cpp
@@ -33,7 +33,7 @@
 #include "../../tensor/function/FHeader.h"
 #include "../../network/XNet.h"
-namespace samplefnnlm
+namespace fnnlm
 {
 #define MAX_NAME_LENGTH 1024
@@ -57,7 +57,7 @@ void LoadArgs(int argc, const char ** argv, FNNModel &model);
 void Init(FNNModel &model);
 void Check(FNNModel &model);
 void Copy(FNNModel &tgt, FNNModel &src);
-void Clear(FNNModel &model);
+void Clear(FNNModel &model, bool isNodeGrad);
 void InitModelTensor1D(XTensor &tensor, int num, FNNModel &model);
 void InitModelTensor2D(XTensor &tensor, int rowNum, int colNum, FNNModel &model);
 void Train(const char * train, bool isShuffled, FNNModel &model);
@@ -153,43 +153,80 @@ load arguments
 */
 void LoadArgs(int argc, const char ** argv, FNNModel &model)
 {
+    fprintf(stderr, "args:\n");
    for(int i = 0; i < argc; i++){
-        if(!strcmp(argv[i], "-train") && i + 1 < argc)
+        if(!strcmp(argv[i], "-train") && i + 1 < argc){
            strcpy(trainFN, argv[i + 1]);
-        if(!strcmp(argv[i], "-model") && i + 1 < argc)
+            fprintf(stderr, " -train=%s\n", argv[i + 1]);
+        }
+        if(!strcmp(argv[i], "-model") && i + 1 < argc){
            strcpy(modelFN, argv[i + 1]);
-        if(!strcmp(argv[i], "-test") && i + 1 < argc)
+            fprintf(stderr, " -model=%s\n", argv[i + 1]);
+        }
+        if(!strcmp(argv[i], "-test") && i + 1 < argc){
            strcpy(testFN, argv[i + 1]);
-        if(!strcmp(argv[i], "-output") && i + 1 < argc)
+            fprintf(stderr, " -test=%s\n", argv[i + 1]);
+        }
+        if(!strcmp(argv[i], "-output") && i + 1 < argc){
            strcpy(outputFN, argv[i + 1]);
-        if(!strcmp(argv[i], "-n") && i + 1 < argc)
+            fprintf(stderr, " -output=%s\n", argv[i + 1]);
+        }
+        if(!strcmp(argv[i], "-n") && i + 1 < argc){
            model.n = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-esize") && i + 1 < argc)
+            fprintf(stderr, " -n=%d\n", model.n);
+        }
+        if(!strcmp(argv[i], "-esize") && i + 1 < argc){
            model.eSize = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-vsize") && i + 1 < argc)
+            fprintf(stderr, " -esize=%d\n", model.eSize);
+        }
+        if(!strcmp(argv[i], "-vsize") && i + 1 < argc){
            model.vSize = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-hdepth") && i + 1 < argc)
+            fprintf(stderr, " -vsize=%d\n", model.vSize);
+        }
+        if(!strcmp(argv[i], "-hdepth") && i + 1 < argc){
            model.hDepth = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-hsize") && i + 1 < argc)
+            fprintf(stderr, " -hdepth=%d\n", model.hDepth);
+        }
+        if(!strcmp(argv[i], "-hsize") && i + 1 < argc){
            model.hSize = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-lrate") && i + 1 < argc)
+            fprintf(stderr, " -hsize=%d\n", model.hSize);
+        }
+        if(!strcmp(argv[i], "-lrate") && i + 1 < argc){
            learningRate = (float)atof(argv[i + 1]);
-        if(!strcmp(argv[i], "-nstep") && i + 1 < argc)
+            fprintf(stderr, " -lrate=%f\n", learningRate);
+        }
+        if(!strcmp(argv[i], "-nstep") && i + 1 < argc){
            nStep = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-nepoch") && i + 1 < argc)
+            fprintf(stderr, " -nstep=%d\n", nStep);
+        }
+        if(!strcmp(argv[i], "-nepoch") && i + 1 < argc){
            nEpoch = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-minmax") && i + 1 < argc)
+            fprintf(stderr, " -nepoch=%d\n", nEpoch);
+        }
+        if(!strcmp(argv[i], "-minmax") && i + 1 < argc){
            minmax = (float)fabs(atof(argv[i + 1]));
-        if(!strcmp(argv[i], "-batch") && i + 1 < argc)
+            fprintf(stderr, " -minmax=%f\n", minmax);
+        }
+        if(!strcmp(argv[i], "-batch") && i + 1 < argc){
            sentBatch = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-wbatch") && i + 1 < argc)
+            fprintf(stderr, " -batch=%d\n", sentBatch);
+        }
+        if(!strcmp(argv[i], "-wbatch") && i + 1 < argc){
            wordBatch = atoi(argv[i + 1]);
-        if(!strcmp(argv[i], "-shuffle"))
+            fprintf(stderr, " -wbatch=%d\n", wordBatch);
+        }
+        if(!strcmp(argv[i], "-shuffle")){
            shuffled = true;
-        if(!strcmp(argv[i], "-autodiff"))
+            fprintf(stderr, " -shuffle=true\n");
+        }
+        if(!strcmp(argv[i], "-autodiff")){
            autoDiff = true;
-        if(!strcmp(argv[i], "-dev") && i + 1 < argc)
+            fprintf(stderr, " -autodiff=true\n");
+        }
+        if(!strcmp(argv[i], "-dev") && i + 1 < argc){
            model.devID = atoi(argv[i + 1]);
+            fprintf(stderr, " -dev=%d\n", model.devID);
+        }
    }
    for(int i = 0; i < argc; i++){
@@ -230,16 +267,37 @@ void Copy(FNNModel &tgt, FNNModel &src)
    }
 }
-/* reset model parameters */
+/* 
-void Clear(FNNModel &model)
+reset model parameters 
+>> model - the model whose parameter (gradient) is set to 0
+>> isNodeGrad - indicates whether the tensor node keeps the 
+                gradient information
+*/
+void Clear(FNNModel &model, bool isNodeGrad)
 {
-    model.embeddingW.SetZeroAll();
+    if (isNodeGrad) {
-    for(int i = 0; i < MAX_HIDDEN_NUM; i++){
+        if(model.embeddingW.grad != NULL)
-        model.hiddenW[i].SetZeroAll();
+            model.embeddingW.grad->SetZeroAll();
-        model.hiddenB[i].SetZeroAll();
+        for (int i = 0; i < MAX_HIDDEN_NUM; i++) {
+            if(model.hiddenW[i].grad != NULL)
+                model.hiddenW[i].grad->SetZeroAll();
+            if(model.hiddenB[i].grad != NULL)
+                model.hiddenB[i].grad->SetZeroAll();
+        }
+        if(model.outputW.grad != NULL)
+            model.outputW.grad->SetZeroAll();
+        if(model.outputB.grad != NULL)
+            model.outputB.grad->SetZeroAll();
+    }
+    else {
+        model.embeddingW.SetZeroAll();
+        for (int i = 0; i < MAX_HIDDEN_NUM; i++) {
+            model.hiddenW[i].SetZeroAll();
+            model.hiddenB[i].SetZeroAll();
+        }
+        model.outputW.SetZeroAll();
+        model.outputB.SetZeroAll();
    }
-    model.outputW.SetZeroAll();
-    model.outputB.SetZeroAll();
 }
 /* 
@@ -401,7 +459,7 @@ void Train(const char * train, bool isShuffled, FNNModel &model)
                FNNNet net;
                /* gradident = 0 */
-                Clear(grad);
+                Clear(grad, false);
                /* forward computation */
                Forward(inputs, output, model, net);
@@ -413,6 +471,9 @@ void Train(const char * train, bool isShuffled, FNNModel &model)
                Update(model, grad, learningRate, false);
            }
            else{
+                /* gradient = 0 */
+                Clear(model, true);
                /* forward + backward process */
                ForwardAutoDiff(inputs, output, model);
@@ -492,21 +553,24 @@ void Update(FNNModel &model, FNNModel &grad, float epsilon, bool isNodeGrad)
        gradList.Add(&grad.embeddingW);
    }
    else{
-        paraList.Add(model.outputW.grad);
+        gradList.Add(model.outputW.grad);
-        paraList.Add(&model.outputB.grad);
+        gradList.Add(model.outputB.grad);
        for (int i = 0; i < model.hDepth; i++) {
-            paraList.Add(&model.hiddenW[i].grad);
+            gradList.Add(model.hiddenW[i].grad);
-            paraList.Add(&model.hiddenB[i].grad);
+            gradList.Add(model.hiddenB[i].grad);
        }
-        paraList.Add(&model.embeddingW.grad);
+        gradList.Add(model.embeddingW.grad);
    }
    for (int i = 0; i < paraList.count; i++) {
        XTensor * para = (XTensor*)paraList.GetItem(i);
        XTensor * paraGrad = (XTensor*)gradList.GetItem(i);
+        //fprintf(stderr, "%d\n", i);
+        //paraGrad->Dump(stderr, "grad:", 10);
        /* the delta rule */
        _Sum(para, paraGrad, para, -epsilon);
    }
@@ -911,7 +975,6 @@ forward process (with tensor connections)
 */
 void ForwardAutoDiff(XTensor inputs[], XTensor &output, FNNModel &model)
 {
-    int batchSize = inputs[0].GetDim(0);
    int n = model.n;
    int depth = model.hDepth;
@@ -935,15 +998,13 @@ void ForwardAutoDiff(XTensor inputs[], XTensor &output, FNNModel &model)
    hidden = Merge(hidden, 2, 0);
    /* hidden layers */
-    for(int i = 0; i < depth; i++){
+    for(int i = 0; i < depth; i++)
-        b = Unsqueeze(model.hiddenB[i], 1, batchSize);
+        hidden = MMul(hidden, model.hiddenW[i]) + model.hiddenB[i];
-        hidden = MMul(hidden, model.hiddenW) + b;
-    }
-    b = Unsqueeze(model.outputB, 1, batchSize);
    /* output layer */
-    output = LogSoftmax(MMul(hidden, model.outputW) + b, 1);
+    output = LogSoftmax(MMul(hidden, model.outputW) + model.outputB, 1);
+    //XLink::ShowNetwork(stderr, &output);
 }
 /* 
@@ -1040,18 +1101,23 @@ void Test(const char * test, const char * result, FNNModel &model)
        /* the gold standard */
        XTensor gold;
-        /* prepare an empty network for building the fnn */
+        if (!autoDiff) {
-        FNNNet net;
+            /* prepare an empty network for building the fnn */
+            FNNNet net;
-        /* make the input tensor for position i */
+            /* make the input tensor for position i */
-        for (int i = 0; i < model.n - 1; i++)
+            for (int i = 0; i < model.n - 1; i++)
-            MakeWordBatch(inputs[i], ngrams, ngramNum, i, model.vSize, model.devID, model.mem);
+                MakeWordBatch(inputs[i], ngrams, ngramNum, i, model.vSize, model.devID, model.mem);
-        /* make the gold tensor */
+            /* make the gold tensor */
-        MakeWordBatch(gold, ngrams, ngramNum, model.n - 1, model.vSize, model.devID, model.mem);
+            MakeWordBatch(gold, ngrams, ngramNum, model.n - 1, model.vSize, model.devID, model.mem);
-        /* forward computation */
+            /* forward computation */
-        Forward(inputs, output, model, net);
+            Forward(inputs, output, model, net);
+        }
+        else {
+            ForwardAutoDiff(inputs, output, model);
+        }
        /* prediction probabilities */
        XTensor probs;

--- a/source/sample/fnnlm/FNNLM.h
+++ b/source/sample/fnnlm/FNNLM.h
@@ -36,7 +36,7 @@
 using namespace nts;
-namespace samplefnnlm
+namespace fnnlm
 {
 #define _EXIT_(x)// exit(x)
@@ -126,7 +126,7 @@ struct FNNNet
    XTensor output;
 };
-/* entry of the program */
+/* entrance of the program */
 int FNNLMMain(int argc, const char ** argv);
 };

--- a/source/sample/transformer/T2TAttention.cpp
+++ b/source/sample/transformer/T2TAttention.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#include <math.h>
+#include "T2TAttention.h"
+#include "T2TUtility.h"
+#include "../../tensor/core/CHeader.h"
+namespace transformer
+{
+/* constructor */
+T2TAttention::T2TAttention()
+{
+    nhead = -1;
+    dk = -1;
+    dv = -1;
+    d  = -1;
+}
+/* deconstructor */
+T2TAttention::~T2TAttention()
+{
+}
+/* 
+initialize the model 
+>> argc - number of arguments
+>> argv - list of pointers to the arguments
+>> myDevID - device id
+>> myMem - the memory pool
+*/
+void T2TAttention::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
+{
+    devID = myDevID;
+    mem = myMem;
+    float minmax = 0;
+    LoadParamInt(argc, argv, "nhead", &nhead, 8);
+    LoadParamInt(argc, argv, "dk", &dk, 512);
+    LoadParamInt(argc, argv, "dv", &dv, 512);
+    LoadParamInt(argc, argv, "d", &d, 512);
+    LoadParamFloat(argc, argv, "attminmax", &minmax, 0.08F);
+    InitTensor2D(&wk, d, dk, X_FLOAT, devID, mem);
+    InitTensor2D(&wq, d, dk, X_FLOAT, devID, mem);
+    InitTensor2D(&wv, d, dv, X_FLOAT, devID, mem);
+    wk.SetDataRand(-minmax, minmax);
+    wq.SetDataRand(-minmax, minmax);
+    wv.SetDataRand(-minmax, minmax);
+}
+/* 
+make the network 
+>> k - keys. It might be of size B * L * H
+       where B = batch size, L = sequence length, 
+       and H = vector size of each position
+>> q - queries
+>> v - values
+<< return - multi-attention result
+*/
+XTensor * T2TAttention::Make(XTensor * k, XTensor * q, XTensor * v)
+{
+    XTensor k2;
+    XTensor q2;
+    XTensor v2;
+    /* linear transofmration before self-attention */
+    k2 = MMul(*k, wk);
+    q2 = MMul(*q, wq);
+    v2 = MMul(*v, wv);
+    XTensor kheads;
+    XTensor qheads;
+    XTensor vheads;
+    /* multi head */
+    kheads = Split(k2, k2.order - 1, nhead);
+    qheads = Split(q2, q2.order - 1, nhead);
+    vheads = Split(v2, v2.order - 1, nhead);
+    XTensor att;
+    XTensor scalar;
+    /* scalar = softmax(Q * K^T / sqrt(dk)) * V */
+    scalar = Softmax(Linear(BMMul(qheads, X_NOTRANS, kheads, X_TRANS), 1/sqrt((float)dk)), -1);
+    att = MMul(scalar, vheads);
+    XTensor * result = new XTensor();
+    /* concatenate the heads */
+    *result = Merge(att, -1);
+    return result;
+}
+}
--- a/source/sample/transformer/T2TAttention.h
+++ b/source/sample/transformer/T2TAttention.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#ifndef __T2TATTENTION_H__
+#define __T2TATTENTION_H__
+#include "../../network/XNet.h"
+using namespace nts;
+namespace transformer
+{
+/* 
+multi-head attention 
+y(Q, K, V) = cat(head_1, head_2, ..., head_n)
+where head_i = Attention(Q * w_i^Q, K * w_i^K, V * w_i^V)
+      attention(Q, K, V) = softmax(Q * K^T/d_k^0.5) V
+      d_k = dimension size of K
+*/
+class T2TAttention
+{
+public:
+    /* device id */
+    int devID;
+    /* memory pool */
+    XMem * mem;
+    /* head number */
+    int nhead;
+    /* transformation matrix for K */
+    XTensor wk;
+    /* transformation matrix for Q */
+    XTensor wq;
+    /* transformation matrix for V */
+    XTensor wv;
+    /* size of transformed Q and K */
+    int dk;
+    /* size of transformed V */
+    int dv;
+    /* size of input Q, K and V */
+    int d;
+public:
+    /* constructor */
+    T2TAttention();
+    /* de-constructor */
+    ~T2TAttention();
+    /* initialize the model */
+    void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
+    /* make the network */
+    XTensor * Make(XTensor * k, XTensor * q, XTensor * v);
+};
+}
+#endif
--- a/source/sample/transformer/T2TDecoder.cpp
+++ b/source/sample/transformer/T2TDecoder.cpp
--- a/source/sample/transformer/T2TDecoder.h
+++ b/source/sample/transformer/T2TDecoder.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#ifndef __T2TDECODER_H__
+#define __T2TDECODER_H__
+namespace transformer
+{
+class T2TDecoder
+{
+};
+class AttDecoder : T2TDecoder
+{
+public:
+    /* initialize the model */
+    void InitModel(int argc, const char ** argv);
+};
+}
+#endif
\ No newline at end of file
--- a/source/sample/transformer/T2TEmbedding.cpp
+++ b/source/sample/transformer/T2TEmbedding.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-08-01
+ */
+#include <math.h>
+#include "T2TEmbedding.h"
+#include "T2TUtility.h"
+#include "../../tensor/core/CHeader.h"
+namespace transformer
+{
+/* constructor */
+T2TEmbedder::T2TEmbedder()
+{
+    devID = -1;
+    mem = NULL;
+    vSize = -1;
+    maxLength = -1;
+}
+/* deconstructor */
+T2TEmbedder::~T2TEmbedder()
+{
+}
+/* 
+initialize the model 
+>> argc - number of arguments
+>> argv - list of pointers to the arguments
+>> myDevID - device id
+>> myMem - the memory pool
+*/
+void T2TEmbedder::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
+{
+    devID = myDevID;
+    mem = myMem;
+    int d = 0;
+    LoadParamInt(argc, argv, "vsize", &vSize, -1);
+    LoadParamInt(argc, argv, "maxlen", &maxLength, 256);
+    LoadParamInt(argc, argv, "d", &d, 256);
+    InitTensor2D(&w, vSize, eSize, X_FLOAT, devID, mem);
+    w.SetDataRandn(0, sqrt((float)eSize));
+    /* create the positional embedding matrix */
+    MakePosEmbedding(eSize, d, maxLength);
+}
+/* 
+make positional embeddings (of size eSize * length
+eSize - embedding size
+length - length of the sequenc
+*/
+void T2TEmbedder::MakePosEmbedding(int eSize, int d, int length)
+{
+    InitTensor2D(&posEmbedding, length, eSize, X_FLOAT, devID, mem);
+    float * data = new float[posEmbedding.unitNum];
+    for(int pos = 0; pos < length; pos++){
+        float * dp = data + pos * eSize;
+        for(int k = 0; k < eSize; k++){
+            if(k % 2 == 0){
+                int i = k/2;
+                dp[k] = sin(pos/pow(10000.0F, 2.0F*i/d));
+            }
+            else{
+                int i = (k - 1)/2;
+                dp[k] = cos(pos/pow(10000.0F, 2.0F*i/d));
+            }
+        }
+    }
+    posEmbedding.SetData(data, posEmbedding.unitNum);
+    delete[] data;
+}
+/* 
+make the network 
+*/
+XTensor * T2TEmbedder::Make(XTensor * input)
+{
+    CheckNTErrors(input->GetDim(-1) == vSize, "Wrong vocabulary size!");
+    CheckNTErrors(input->order > 1, "Wrong input tensor size!");
+    CheckNTErrors(input->dimSize[input->order - 2] < maxLength, "The sequence is too long!");
+    int dims[MAX_TENSOR_DIM_NUM];
+    memcpy(dims, input->dimSize, input->order);
+    dims[0] = eSize;
+    bool match = (posEmbedding.order == input->order);
+    if(match){
+        for(int i = 0; i < input->order; i++){
+            if(dims[i] != posEmbedding.GetDim(i))
+                match = false;
+        }
+    }
+    /* we make positional embeddings first */
+    if(!match){
+        InitTensor(&posEmbedding, input->order, dims, X_FLOAT, 1.0F, devID, mem);
+        XTensor * posTMP = NewTensorBuf(2, dims, X_FLOAT, 1.0F, devID, mem);
+        _CopyValues(&posEmbeddingBase, 0, posTMP->unitNum, posTMP, 0);
+        int dims2[MAX_TENSOR_DIM_NUM];
+        dims2[0] = dims[0];
+        dims2[1] = dims[1];
+        dims2[2] = posEmbedding.unitNum / (dims[0] * dims[1]);
+        posEmbedding.Reshape(3, dims2);
+        _Unsqueeze(posTMP, &posEmbedding, 0, dims2[2]);
+        posEmbedding.Reshape(input->order, dims);
+        DelTensorBuf(posTMP);
+    }
+    XTensor wordEmbedding;
+    /* then we make word embeddings */
+    wordEmbedding = MMul(*input, w);
+    XTensor * result = new XTensor();
+    /* we sum over the two embeddings */
+    *result = wordEmbedding + posEmbedding;
+    return result;
+}
+}
--- a/source/sample/transformer/T2TEmbedding.h
+++ b/source/sample/transformer/T2TEmbedding.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-08-01
+ */
+#ifndef __T2TEMBEDDING_H__
+#define __T2TEMBEDDING_H__
+#include "../../network/XNet.h"
+using namespace nts;
+namespace transformer
+{
+/* 
+embedding (of word at position i):
+word embedding + positional embedding
+*/
+class T2TEmbedder
+{
+public:
+    /* device id */
+    int devID;
+    /* memory pool */
+    XMem * mem;
+    /* vocabulary size */
+    int vSize;
+    /* embedding size */
+    int eSize;
+    /* maximum length of the sequence */
+    int maxLength;
+    /* word embedding matrix */
+    XTensor w;
+    /* predefined positional embeddings. It can speeds up 
+       the embedding processing by re-loading. */
+    XTensor posEmbeddingBase;
+    /* positional embeddings */
+    XTensor posEmbedding;
+public:
+    /* constructor */
+    T2TEmbedder();
+    /* de-constructor */
+    ~T2TEmbedder();
+    /* initialize the model */
+    void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
+    /* make positional embeddings */
+    void MakePosEmbedding(int eSize, int d, int length);
+    /* make the network */
+    XTensor * Make(XTensor * input);
+};
+}
+#endif
--- a/source/sample/transformer/T2TEncoder.cpp
+++ b/source/sample/transformer/T2TEncoder.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#include <math.h>
+#include "T2TEncoder.h"
+#include "T2TLayerNormal.h"
+#include "T2TUtility.h"
+#include "../../tensor/core/CHeader.h"
+namespace transformer
+{
+/* constructor */
+AttEncoder::AttEncoder()
+{
+}
+/* de-constructor */
+AttEncoder::~AttEncoder()
+{
+    delete[] attentions;
+    delete[] fnns;
+    delete[] layerNorms;
+}
+/* 
+initialize the model 
+>> argc - number of arguments
+>> argv - list of pointers to the arguments
+>> myDevID - device id
+>> myMem - the memory pool
+*/
+void AttEncoder::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
+{
+    devID = myDevID;
+    mem = myMem;
+    LoadParamInt(argc, argv, "nstack", &nlayer, 6);
+    LoadParamInt(argc, argv, "hsize", &hSize, 512);
+    LoadParamInt(argc, argv, "esize", &eSize, 512);
+    LoadParamInt(argc, argv, "vsize", &vSize, -1);
+    CheckNTErrors(nlayer > 1, "We have one encoding layer at least!");
+    CheckNTErrors(vSize > 1, "set vocabulary size by \"-vsize\"");
+    /* embedding model */
+    embedder.InitModel(argc, argv, devID, mem);
+    attentions = new T2TAttention[nlayer];
+    fnns = new T2TFNN[nlayer];
+    layerNorms = new T2TLN[nlayer];
+    /* initialize the stacked layers */
+    for(int i = 0; i < nlayer; i++){
+        attentions[i].InitModel(argc, argv, myDevID, myMem);
+        fnns[i].InitModel(argc, argv, myDevID, myMem);
+        layerNorms[i].InitModel(argc, argv, myDevID, myMem);
+    }
+}
+/* 
+make the encoding network
+>> input - the input tensor of the encoder
+<< return - the output tensor of the encoder
+*/
+XTensor * AttEncoder::Make(XTensor * input)
+{
+    XTensor * x = embedder.Make(input);
+    for(int i = 0; i < nlayer; i++){
+        XTensor * att;
+        XTensor * ln;
+        XTensor * fnn;
+        XTensor res;
+        /* self attention */
+        att = attentions[i].Make(x, x, x);
+        /* residual connection */
+        res = Sum(*att, *x);
+        /* TODO: dropout */
+        /* layer normalization */
+        ln = layerNorms[i].Make(&res);
+        /* input of next layer */
+        x = ln;
+        /* fnn */
+        fnn = fnns[i].Make(x);
+        /* residual connection */
+        res = Sum(*fnn, *x);
+        /* TODO: dropout */
+        /* layer normalization */
+        ln = layerNorms[i].Make(&res);
+        /* input of next layer */
+        x = ln;
+    }
+    return x;
+}
+}
--- a/source/sample/transformer/T2TEncoder.h
+++ b/source/sample/transformer/T2TEncoder.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#ifndef __T2TENCODER_H__
+#define __T2TENCODER_H__
+#include "T2TFNN.h"
+#include "T2TAttention.h"
+#include "T2TEmbedding.h"
+#include "T2TLayerNormal.h"
+#include "../../network/XNet.h"
+using namespace nts;
+namespace transformer
+{
+/* 
+base class of the encoder 
+*/
+class T2TEncoder
+{
+public:
+    virtual
+    XTensor * Make(XTensor * input) = 0;
+};
+/* 
+the encoder based on RNN 
+*/
+class RNNEncoder : T2TEncoder
+{
+public:
+    XTensor * Make(XTensor * input);
+};
+/* 
+the encoder based on self-attention 
+*/
+class AttEncoder : T2TEncoder
+{
+public:
+    /* device id */
+    int devID;
+    /* memory pool */
+    XMem * mem;
+    /* layer number */
+    int nlayer;
+    /* hidden layer size of the FNN layer */
+    int hSize;
+    /* embedding size */
+    int eSize;
+    /* vocabulary size */
+    int vSize;
+    /* embedding of word at each position */
+    T2TEmbedder embedder;
+    /* FNN model of each layer */
+    T2TFNN * fnns;
+    /* attention model of each layer */
+    T2TAttention * attentions;
+    /* layer normalization */
+    T2TLN * layerNorms;
+    /* input tensor of the encoder */
+    XTensor * input;
+    /* output tensor of the encoder */
+    XTensor * output;
+public:
+    /* constructor */
+    AttEncoder();
+    /* de-constructor */
+    ~AttEncoder();
+    /* initialize the model */
+    void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
+    /* make the encoding network */
+    XTensor * Make(XTensor * input);
+};
+}
+#endif
--- a/source/sample/transformer/T2TFNN.cpp
+++ b/source/sample/transformer/T2TFNN.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#include "T2TFNN.h"
+#include "T2TUtility.h"
+#include "../../tensor/core/CHeader.h"
+#include "../../tensor/function/FHeader.h"
+namespace transformer
+{
+/* constructor */
+T2TFNN::T2TFNN()
+{
+    inSize  = -1;
+    outSize = -1;
+    hSize   = -1;
+}
+/* deconstructor */
+T2TFNN::~T2TFNN()
+{
+}
+/* 
+initialize the model 
+>> argc - number of arguments
+>> argv - list of pointers to the arguments
+>> myDevID - device id
+>> myMem - the memory pool
+*/
+void T2TFNN::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
+{
+    devID = myDevID;
+    mem = myMem;
+    float minmax = 0;
+    LoadParamInt(argc, argv, "d", &inSize, 512);
+    LoadParamInt(argc, argv, "d", &outSize, 512);
+    LoadParamInt(argc, argv, "fnnh", &hSize, 512);
+    LoadParamFloat(argc, argv, "fnnminmax", &minmax, 0.08F);
+    InitTensor2D(&w1, inSize, hSize, X_FLOAT, devID, mem);
+    InitTensor1D(&b1, hSize, X_FLOAT, devID, mem);
+    InitTensor2D(&w2, hSize, outSize, X_FLOAT, devID, mem);
+    InitTensor1D(&b2, outSize, X_FLOAT, devID, mem);
+    w1.SetDataRand(-minmax, minmax);
+    b1.SetDataRand(-minmax, minmax);
+    w2.SetDataRand(-minmax, minmax);
+    b2.SetDataRand(-minmax, minmax);
+}
+/* 
+make the network 
+y = max(0, x * w1 + b1) * w2 + b2
+>> input - the input tensor
+>> return - the output tensor 
+*/
+XTensor * T2TFNN::Make(XTensor * input)
+{
+    XTensor t1;
+    XTensor * result = new XTensor();
+    /* t1 = max(0, x * w1 + b1) */
+    t1 = Rectify(MMul(*input, X_NOTRANS, w1, X_NOTRANS) + b1);
+    /* result = t1 * w2 + b2 */
+    *result = MMul(t1, X_NOTRANS, w2, X_NOTRANS) + b2;
+    return result;
+}
+}
--- a/source/sample/transformer/T2TFNN.h
+++ b/source/sample/transformer/T2TFNN.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#ifndef __T2TFNN_H__
+#define __T2TFNN_H__
+#include "../../tensor/XTensor.h"
+using namespace nts;
+namespace transformer
+{
+/* a fnn: y = max(0, x * w1 + b1) * w2 + b2 */
+class T2TFNN
+{
+public:
+    /* device id */
+    int devID;
+    /* memory pool */
+    XMem * mem;
+    /* size of input vector */
+    int inSize;
+    /* size of output vector */
+    int outSize;
+    /* size of hidden layers */
+    int hSize;
+    /* matrix of transformation 1 */
+    XTensor w1;
+    /* bias of transformation 1 */
+    XTensor b1;
+    /* matrix of transformation 2 */
+    XTensor w2;
+    /* bias of transformation 2 */
+    XTensor b2;
+public:
+    /* constructor */
+    T2TFNN();
+    /* deconstructor */
+    ~T2TFNN();
+    /* initialize the model */
+    void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
+    /* make the network */
+    XTensor * Make(XTensor * input);
+};
+}
+#endif
--- a/source/sample/transformer/T2TLayerNormal.cpp
+++ b/source/sample/transformer/T2TLayerNormal.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#include "T2TLayerNormal.h"
+namespace transformer
+{
+/* constructor */
+T2TLN::T2TLN()
+{
+    devID = -1;
+    mem   = NULL;
+}
+/* de-constructor */
+T2TLN::~T2TLN()
+{
+}
+/*
+initialize the model
+>> argc - number of arguments
+>> argv - list of pointers to the arguments
+>> myDevID - device id
+>> myMem - the memory pool
+*/
+void T2TLN::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
+{
+    devID = myDevID;
+    mem = myMem;
+}
+/*
+make the network 
+for each layer representation x, we have
+y = 
+>> input - the input tensor
+>> return - layer normalization output
+*/
+XTensor * T2TLN::Make(XTensor * input)
+{
+    return NULL;
+}
+}
--- a/source/sample/transformer/T2TLayerNormal.h
+++ b/source/sample/transformer/T2TLayerNormal.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#ifndef __T2TLAYERNORMAL_H__
+#define __T2TLAYERNORMAL_H__
+#include "../../network/XNet.h"
+using namespace nts;
+namespace transformer
+{
+class T2TLN
+{
+public:
+    /* device id */
+    int devID;
+    /* memory pool */
+    XMem * mem;
+public:
+    /* constructor */
+    T2TLN();
+    /* de-constructor */
+    ~T2TLN();
+    /* initialize the model */
+    void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
+    /* make the network */
+    XTensor * Make(XTensor * input);
+};
+}
+#endif
--- a/source/sample/transformer/T2TModel.cpp
+++ b/source/sample/transformer/T2TModel.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#include "T2TModel.h"
+#include "T2TUtility.h"
+namespace transformer
+{
+/* constructor */
+T2TModel::T2TModel()
+{
+    devID = -1;
+    mem = NULL;
+    isLM = false;
+    isMT = false;
+}
+/* de-constructor */
+T2TModel::~T2TModel()
+{
+    delete mem;
+}
+/* 
+initialize the model 
+>> argc - number of arguments
+>> argv - list of pointers to the arguments
+*/
+void T2TModel::InitModel(int argc, const char ** argv)
+{
+    bool useMem = false;
+    LoadParamInt(argc, argv, "dev", &devID, -1);
+    LoadParamBool(argc, argv, "mem", &useMem, useMem);
+    LoadParamBool(argc, argv, "lm", &isLM, true);
+    LoadParamBool(argc, argv, "mt", &isMT, false);
+    if(useMem){
+        delete mem;
+        mem = new XMem(devID);
+    }
+    encoder.InitModel(argc, argv, devID, mem);
+    outputLayer.InitModel(argc, argv, devID, mem);
+}
+/* 
+make the encoding network
+>> input - input tensor
+<< return - encoding result
+*/
+XTensor * T2TModel::MakeEncoding(XTensor * input)
+{
+    return encoder.Make(input);
+}
+/* 
+make the entire network (with the output softmax layer) 
+>> input - input tensor
+>> output - output tensor (distribution)
+*/
+void T2TModel::Make(XTensor * input, XTensor * output)
+{
+    if(isLM){
+        XTensor * encoding = MakeEncoding(input);
+        outputLayer.Make(encoding, output);
+    }
+    else{
+        ShowNTErrors("TODO!");
+    }
+}
+}
\ No newline at end of file
--- a/source/sample/transformer/T2TModel.h
+++ b/source/sample/transformer/T2TModel.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#ifndef __T2TMODEL_H__
+#define __T2TMODEL_H__
+#include "T2TFNN.h"
+#include "T2TAttention.h"
+#include "T2TEncoder.h"
+#include "T2TDecoder.h"
+#include "T2TOutput.h"
+namespace transformer
+{
+class T2TModel
+{
+public:
+    /* device id */
+    int devID;
+    /* memory pool */
+    XMem * mem;
+    /* the encoder */
+    AttEncoder encoder;
+    /* the decoder */
+    AttDecoder decoder;
+    /* output layer */
+    T2TOutput outputLayer;
+    /* indicates whether the model is running for language modeling */
+    bool isLM;
+    /* indicates whether the model is running for machine translation */
+    bool isMT;
+public:
+    /* constructor */
+    T2TModel();
+    /* de-constructor */
+    ~T2TModel();
+    /* initialize the model */
+    void InitModel(int argc, const char ** argv);
+    /* make the encoding network */
+    XTensor * MakeEncoding(XTensor * input);
+    /* make the entire network (with the output softmax layer) */
+    void Make(XTensor * input, XTensor * output);
+};
+}
+#endif
\ No newline at end of file
--- a/source/sample/transformer/T2TOutput.cpp
+++ b/source/sample/transformer/T2TOutput.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#include "T2TOutput.h"
+#include "T2TUtility.h"
+#include "../../tensor/core/CHeader.h"
+namespace transformer
+{
+/* constructor */
+T2TOutput::T2TOutput()
+{
+    devID = -1;
+    mem = NULL;
+    vSize = -1;
+    inSize = -1;
+    hSize = -1;
+}
+/* de-constructor */
+T2TOutput::~T2TOutput()
+{
+}
+/*
+initialize the model 
+>> argc - number of arguments
+>> argv - list of pointers to the arguments
+>> myDevID - device id
+>> myMem - the memory pool
+*/
+void T2TOutput::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
+{
+    devID = myDevID;
+    mem = myMem;
+    LoadParamInt(argc, argv, "vsize", &vSize, -1);
+    LoadParamInt(argc, argv, "hsize", &inSize, 512);
+    LoadParamInt(argc, argv, "hsize", &hSize, 512);
+}
+/* 
+make the network 
+y = softmax(x * w)
+>> input - input tensor
+<< return - output tensor 
+*/
+XTensor * T2TOutput::Make(XTensor * input)
+{
+    XTensor &x = *input;
+    XTensor * result = new XTensor();
+    *result = LogSoftmax(MMul(x, w), -1);
+    return result;
+}
+/* 
+make the network (redefined output tensor) 
+>> input - input tensor
+>> output - output tensor 
+*/
+void T2TOutput::Make(XTensor * input, XTensor * output)
+{
+    XTensor &x = *input;
+    *output = LogSoftmax(MMul(x, w), -1);
+}
+}
\ No newline at end of file
--- a/source/sample/transformer/T2TOutput.h
+++ b/source/sample/transformer/T2TOutput.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#ifndef __T2TOUTPUT_H__
+#define __T2TOUTPUT_H__
+#include "../../tensor/function/FHeader.h"
+using namespace nts;
+namespace transformer
+{
+/* output layer */
+class T2TOutput
+{
+public:
+    /* device id */
+    int devID;
+    /* memory pool */
+    XMem * mem;
+    /* vocabulary size */
+    int vSize;
+    /* input vector size */
+    int inSize;
+    /* vector size of the linear transformation */
+    int hSize;
+    /* transformation matrix */
+    XTensor w;
+public:
+    /* constructor */
+    T2TOutput();
+    /* de-constructor */
+    ~T2TOutput();
+    /* initialize the model */
+    void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
+    /* make the network */
+    XTensor * Make(XTensor * input);
+    /* make the network (redefined output tensor) */
+    void Make(XTensor * input, XTensor * output);
+};
+}
+#endif
\ No newline at end of file
--- a/source/sample/transformer/T2TTrainer.cpp
+++ b/source/sample/transformer/T2TTrainer.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-08-02
+ */
+#include "T2TTrainer.h"
+#include "T2TUtility.h"
+namespace transformer
+{
+/* constructor */
+T2TTrainer::T2TTrainer()
+{
+    seqLen = NULL;
+    nseqBuf = 0;
+    nextSeq = -1;
+}
+/* de-constructor */
+T2TTrainer::~T2TTrainer()
+{
+    delete[] buf;
+    delete[] seqLen;
+}
+/* 
+initialization 
+>> argc - number of arguments
+>> argv - list of pointers to the arguments
+*/
+void T2TTrainer::Init(int argc, const char ** argv)
+{
+    LoadParamFloat(argc, argv, "lrate", &lrate, 0.001F);
+    LoadParamInt(argc, argv, "sbatch", &sBatchSize, 1);
+    LoadParamInt(argc, argv, "wbatch", &wBatchSize, 1);
+    LoadParamInt(argc, argv, "nepoch", &nepoch, 1);
+    LoadParamInt(argc, argv, "nstep", &nstep, 1);
+    int maxUnitInBuf;
+    LoadParamInt(argc, argv, "bufsize", &maxUnitInBuf, 20000);
+    buf = new int[maxUnitInBuf];
+    seqLen = new int[maxUnitInBuf];
+    seqOffset = new int[maxUnitInBuf];
+}
+/* 
+train the model
+>> fn - training data file
+>> model - model to train
+*/
+void T2TTrainer::Train(const char * fn, T2TModel * model)
+{
+}
+char line[MAX_SEQUENCE_LENGTH];
+/* 
+load data to buffer 
+>> file - where to load data
+*/
+int T2TTrainer::LoadBuf(FILE * file)
+{
+    int lineCount = 0;
+    int seqCount = 0;
+    int wordCount = 0;
+    while(fgets(line, MAX_SEQUENCE_LENGTH - 1, file)){
+        int len = (int)strlen(line);
+        if(line[len - 1] == '\r')
+            line[len - 1] = 0;
+        len = (int)strlen(line);
+        if(len == 0)
+            continue;
+        /* how many characters are in a word */
+        int wSize = 0;
+        /* how many words are in the sentence */
+        int wNum = 0;
+        int wNumLocal = 0;
+        for(int i = 0; i < len; i++){
+            /* load word (id) seperated by space or tab */
+            if((line[i] == ' ' || line[i] == '\t' || i == len - 1) && wSize > 0){
+                line[i] = 0;
+                if(wSize == 3 && line[i - 1] == '|' && line[i - 2] == '|' && line[i - 3] == '|'){
+                    seqLen[seqCount] = wNumLocal;
+                    seqOffset[seqCount] = wordCount + wNum - wNumLocal;
+                    seqCount++;
+                    wNumLocal = 0;
+                }
+                else{
+                    buf[wNum++] = atoi(line + i - wSize);
+                    wNumLocal++;
+                }
+                wSize = 0;
+            }
+            else
+                wSize++;
+        }
+        seqLen[seqCount] = wNumLocal;
+        seqOffset[seqCount] = wordCount + wNum - wNumLocal;
+        seqCount++;
+        wordCount += wNum;
+        lineCount++;
+        if(wordCount >= wBatchSize)
+            break;
+        if(lineCount >= sBatchSize)
+            break;
+    }
+    nseqBuf = seqCount;
+    nextSeq = 0;
+    return lineCount;
+}
+/* 
+load a batch of sequences 
+>> file - the handle to the data file
+>> batch - the batch
+>> step - the step we go over when move to the next sequence
+>> vs - vocabulary size
+>> sBatch - batch size of sequences
+>> wBatch - batch size of words
+>> isSorted - indicates whether the sequences are sorted by length
+*/
+int T2TTrainer::LoadBatch(FILE * file, XTensor * batch, int step, int vs, int sBatch, int wBatch, bool isSorted)
+{
+    if(nextSeq >= nseqBuf)
+        LoadBuf(file);
+    int seq = nextSeq;
+    int wc = 0;
+    int sc = 0;
+    int max = 0;
+    while(seq < nseqBuf){
+        wc += seqLen[seq];
+        sc += 1;
+        if(max < wc)
+            max = wc;
+        if(sc >= sBatch && wc >= wBatch)
+            break;
+    }
+    if(sc > 0){
+        int dims[MAX_TENSOR_DIM_NUM];
+        dims[0] = sc;
+        dims[1] = max;
+        dims[2] = vs;
+        if(batch->order != 3 || batch->GetDim(0) != dims[0] || 
+           batch->GetDim(1) != dims[1] || batch->GetDim(2) != dims[2]){
+               InitTensor(batch, 3, dims, X_FLOAT, 1.0F, devID, mem);
+        }
+        batch->SetZeroAll();
+        for(int s = seq; s < seq + sc; s++){
+            for(int w = 0; w < seqLen[s]; w++){
+                batch->Set3D(1.0F, s - seq, w, buf[seqOffset[s] + w]);
+            }
+        }
+    }
+    return sc;
+}
+}
\ No newline at end of file
--- a/source/sample/transformer/T2TTrainer.h
+++ b/source/sample/transformer/T2TTrainer.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-08-02
+ */
+#ifndef __T2TTRAINER_H__
+#define __T2TTRAINER_H__
+#include "T2TModel.h"
+#include "../../tensor/function/FHeader.h"
+#define MAX_SEQUENCE_LENGTH 1024 * 64
+using namespace nts;
+namespace transformer
+{
+/* trainer of the T2T model */
+class T2TTrainer
+{
+public:
+    /* device id */
+    int devID;
+    /* memory pool */
+    XMem * mem;
+    /* buffer for loading words */
+    int * buf;
+    /* length of each sequence */
+    int * seqLen;
+    /* offset of the first word for each sequence */
+    int * seqOffset;
+    /* number of sequences in the buffer */
+    int nseqBuf;
+    /* offset for next sequence in the buffer */
+    int nextSeq;
+    /* vocabulary size of the source side */
+    int vSize;
+    /* learning rate */
+    float lrate;
+    /* sentence batch size */
+    int sBatchSize;
+    /* word batch size */
+    int wBatchSize;
+    /* training epoch number */
+    int nepoch;
+    /* traing step number */
+    int nstep;
+public:
+    /* constructor */
+    T2TTrainer();
+    /* de-constructor */
+    ~T2TTrainer();
+    /* initialize the trainer */
+    void Init(int argc, const char ** argv);
+    /* train the model */
+    void Train(const char * fn, T2TModel * model);
+    /* load data to buffer */
+    int LoadBuf(FILE * file);
+    /* load a batch of sequences */
+    int LoadBatch(FILE * file, XTensor * batch, int step, int vs, int sBatch, int wBatch, bool isSorted); 
+};
+}
+#endif
\ No newline at end of file
--- a/source/sample/transformer/T2TUtility.cpp
+++ b/source/sample/transformer/T2TUtility.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+namespace transformer
+{
+void LoadParamString(int argc, const char ** argv, const char * name, char * p, char * defaultP)
+{
+    char vname[128];
+    vname[0] = '-';
+    strcpy(vname + 1, name);
+    bool hit = false;
+    for(int i = 0; i < argc; i++){
+        if(!strcmp(argv[i], vname) && i + 1 < argc){
+            *(int*)p = atoi(argv[i + 1]);
+            fprintf(stderr, " %s=%s\n", name, argv[i + 1]);
+            hit = true;
+        }
+    }
+    if(!hit)
+        strcpy(p, defaultP);
+}
+void LoadParamInt(int argc, const char ** argv, const char * name, int * p, int defaultP)
+{
+    char vname[128];
+    vname[0] = '-';
+    strcpy(vname + 1, name);
+    bool hit = false;
+    for(int i = 0; i < argc; i++){
+        if(!strcmp(argv[i], vname) && i + 1 < argc){
+            *(int*)p = atoi(argv[i + 1]);
+            fprintf(stderr, " %s=%s\n", name, argv[i + 1]);
+            hit = true;
+        }
+    }
+    if(!hit)
+        *p = defaultP;
+}
+void LoadParamBool(int argc, const char ** argv, const char * name, bool * p, bool defaultP)
+{
+    char vname[128];
+    vname[0] = '-';
+    strcpy(vname + 1, name);
+    bool hit = false;
+    for(int i = 0; i < argc; i++){
+        if(!strcmp(argv[i], vname)){
+            *(bool*)p = true;
+            fprintf(stderr, " %s=%s\n", name, "true");
+        }
+    }
+    if(!hit)
+        *p = defaultP;
+}
+void LoadParamFloat(int argc, const char ** argv, const char * name, float * p, float defaultP)
+{
+    char vname[128];
+    vname[0] = '-';
+    strcpy(vname + 1, name);
+    bool hit = false;
+    for(int i = 0; i < argc; i++){
+        if(!strcmp(argv[i], vname) && i + 1 < argc){
+            strcpy((char*)p, argv[i + 1]);
+            fprintf(stderr, " %s=%s\n", name, argv[i + 1]);
+        }
+    }
+    if(!hit)
+        *p = defaultP;
+}
+}
\ No newline at end of file
--- a/source/sample/transformer/T2TUtility.h
+++ b/source/sample/transformer/T2TUtility.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#ifndef __T2TUTILITY_H__
+#define __T2TUTILITY_H__
+#include <stdio.h>
+namespace transformer
+{
+/* load model parameters */
+void LoadParamString(int argc, const char ** argv, const char * name, char * p, char * defaultP);
+void LoadParamInt(int argc, const char ** argv, const char * name, int * p, int defaultP);
+void LoadParamBool(int argc, const char ** argv, const char * name, bool * p, bool defaultP);
+void LoadParamFloat(int argc, const char ** argv, const char * name, float * p, float defaultP);
+}
+#endif
\ No newline at end of file
--- a/source/sample/transformer/Transformer.cpp
+++ b/source/sample/transformer/Transformer.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ */
+#include "Transformer.h"
+namespace transformer
+{
+int TransformerMain(int argc, const char ** argv)
+{
+    return 0;
+}
+}
\ No newline at end of file
--- a/source/sample/transformer/Transformer.h
+++ b/source/sample/transformer/Transformer.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University. 
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ *
+ * An impelementation of the transformer system. See more details 
+ * about FNNLM in 
+ * "Attention Is All You Need" by Vaswani et al.
+ * https://arxiv.org/pdf/1706.03762.pdf
+ *
+ * $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
+ * I start writing the code related to NMT - a long time since my last coding 
+ * work on MT
+ */
+#ifndef __TRANSFORMER_H__
+#define __TRANSFORMER_H__
+#include "../../tensor/XGlobal.h"
+#include "../../tensor/XTensor.h"
+#include "../../tensor/core/CHeader.h"
+namespace transformer
+{
+/* entrance of the program */
+int TransformerMMain(int argc, const char ** argv);
+}
+#endif
\ No newline at end of file
--- a/source/tensor/Main.cpp
+++ b/source/tensor/Main.cpp
@@ -29,6 +29,7 @@
 #include "XTensor.h"
 #include "XDevice.h"
 #include "./test/Test.h"
+#include "./core/CHeader.h"
 //#define CRTDBG_MAP_ALLOC
 //#include <stdlib.h>  
@@ -36,7 +37,9 @@
 using namespace nts;
+void SetDataTest();
 void SmallTest();
+void TransposeTest();
 int main( int argc, const char ** argv )
 {
@@ -92,3 +95,35 @@ void SmallTest()
    c.Dump(stderr, "c:");
    d.Dump(stderr, "d:");
 }
+void TransposeTest()
+{
+    XTensor a;
+    XTensor b;
+    int I = 2;
+    int J = 3;
+    InitTensor4D(&a, 2, 3, 4, 5);
+    int * dims = new int[a.order];
+    memcpy(dims, a.dimSize, sizeof(int) * a.order);
+    dims[I] = a.dimSize[J];
+    dims[J] = a.dimSize[I];
+    InitTensor(&b, 4, dims);
+    a.SetZeroAll();
+    b.SetZeroAll();
+    float * data = new float[a.unitNum];
+    for(int i = 0; i < a.unitNum; i++)
+        data[i] = (float)i;
+    a.SetData(data, a.unitNum, 0);
+    _Transpose(&a, &b, I, J);
+    b.Dump(stderr, "b:");
+    delete[] data;
+}
--- a/source/tensor/XDevice.cpp
+++ b/source/tensor/XDevice.cpp
@@ -40,6 +40,7 @@ XDevManager GDevs;
 /* constructor */
 XDevice::XDevice()
 {
+    stream = NULL;
    Clear();
 #ifdef USE_CUDA
@@ -55,6 +56,8 @@ XDevice::~XDevice()
    MUTEX_DELE(cublasMutex);
    if(isHandleReady)
        cublasDestroy(cublasHandle);
+    if(stream != NULL)
+        delete stream;
 #endif
 }
@@ -118,6 +121,8 @@ void XDevice::Init(int myDevID)
        }
        else
            sprintf(name2, "GPU-%d %s", devID, name);
+        stream = new XStream(0, devID);
 #endif
    }
@@ -161,6 +166,14 @@ cublasHandle_t * XDevice::GetCublasHandle()
    return &cublasHandle;
 }
+/* get the stream of cuda */
+cudaStream_t * XDevice::GetCudaStream()
+{
+    CheckNTErrors(stream != NULL, "the stream is not initialized!");
+    return &stream->stream;
+}
 #endif // USE_CUDA
 /* switch to a device */
@@ -311,11 +324,19 @@ void XDevManager::Clear()
 /* get the handle of GPU */
 cublasHandle_t * XDevManager::GetCudaHandle(const int devID)
 {
-    CheckNTErrors((devID < nGPU), "index of GPU is out of range.");
+    CheckNTErrors(devID < nGPU, "index of GPU is out of range.");
    return GPUs[devID].GetCublasHandle();
 }
+/* get the stream of cuda */
+cudaStream_t * XDevManager::GetCudaStream(const int devID)
+{
+    CheckNTErrors(devID < nGPU, "index of GPU is out of range.");
+    return GPUs[devID].GetCudaStream();
+}
 #endif
 /* 
@@ -384,13 +405,10 @@ int XDevManager::GetCudaThread2D(const int devID, const int n, const int m, int 
    memset(gridSize, 0, sizeof(int) * 3);
    memset(blockSize, 0, sizeof(int) * 3);
-    if(n <= 0 || m <= 0 || devID >= nGPU)
+    if(n <= 0 || m <= 0)
        return 1;
-    if(devID < 0){
+    CheckNTErrors(devID >= 0 && devID < nGPU, "Invalid GPU device id!");
-        XPRINT(0, stderr, "WARNING! You are calling the grid and block size computation function on a CPU!");
-        return 0;
-    }
 #ifdef USE_CUDA

--- a/source/tensor/XDevice.h
+++ b/source/tensor/XDevice.h
@@ -25,6 +25,7 @@
 #define __XDEVICE_H__
 #include "XThread.h"
+#include "XStream.h"
 #ifdef USE_CUDA
@@ -92,6 +93,9 @@ public:
    /* specify whether Unified Virtual Address Space (UVA) is supported */
    bool isUVASupported;
+    /* default stream for the device */
+    XStream * stream;
 #ifdef USE_CUDA
    /* mutex for handle (GPU cublas) */
@@ -121,6 +125,9 @@ public:
 #ifdef USE_CUDA
    /* get cublas handle */
    cublasHandle_t * GetCublasHandle();
+    /* get the stream of cuda */
+    cudaStream_t * GetCudaStream();
 #endif
    /* switch to a device */
@@ -178,6 +185,9 @@ public:
 #ifdef USE_CUDA
    /* get the handle of GPU */
    cublasHandle_t * GetCudaHandle(const int devID);
+    /* get the stream of cuda */
+    cudaStream_t * GetCudaStream(const int devID);
 #endif
    /* get grid and block sizes that max potential */

--- a/source/tensor/XLink.cpp
+++ b/source/tensor/XLink.cpp
@@ -167,7 +167,9 @@ void XLink::SetType(int id)
    type[0] = 0;
    strcpy(type, GetOPName(id));
    typeID = id;
-    CheckNTErrors(strcmp(type, "NULL"), "illegal edge type name!");
+    if(id != 0){
+        CheckNTErrors(strcmp(type, "NULL"), "illegal edge type name!");
+    }
 }
 /* 
@@ -515,7 +517,7 @@ void XLink::CopyIncoming(const XTensor * reference, XTensor * target)
        tails.Add(tail);
    }
-    MakeLink(&tails, target, reference->id);
+    MakeLink(&tails, target, reference->income.typeID);
    int paraNum = reference->income.paramNum;
    target->income.paramNum = paraNum;

--- a/source/tensor/XList.cpp
+++ b/source/tensor/XList.cpp
@@ -208,22 +208,16 @@ void XList::Insert(int pos, void * item)
 /* get the item at position i */
 void * XList::GetItem(int i) const
 {
-    if( i >= 0 && i < count )
+    CheckNTErrors(i >= 0 && i < count, "Index of a list item is out of scope!");
-        return items[i];
+    return items[i];
-    else
-        return NULL;
 }
 /* get the integer-typed item at position i */
 int XList::GetItemInt(int i)
 {
    CheckNTErrors(isIntList, "An int list is required!");
+    CheckNTErrors(i >= 0 && i < count, "Index of a list item is out of scope!");
-    if( i >= 0 && i < count ){
+    return *(int*)(items[i]);
-        return *(int*)(items[i]);
-    }
-    else
-        return 0;
 }
 /* set the item at position i */

--- a/source/tensor/XMem.cpp
+++ b/source/tensor/XMem.cpp
@@ -181,7 +181,10 @@ void XMem::Free(int myDevID, void * mem)
    else{
 #ifdef USE_CUDA
        SetDevice(myDevID);
-        CheckNTErrors(cudaFree((char*)mem) == cudaSuccess, "Cannot free the memory.");
+        cudaError_t error = cudaFree((char*)mem);
+        if(error != cudaSuccess){
+            ShowNTErrors("Cannot free the memory.");
+        }
 #else
        ShowNTErrors("Please specify USE_CUDA for compiling this program.");
 #endif

--- a/source/tensor/XName.cpp
+++ b/source/tensor/XName.cpp
@@ -29,20 +29,34 @@ const char * GetOPName(int type)
    if ((type & MATH_BASE) != 0){
        if (type == MATH_ABSOLUTE)
            return "M_ABSOLUTE";
+        else if (type == MATH_EXP)
+            return "M_EXP";
+        else if (type == MATH_LOG)
+            return "M_LOG";
+        else if (type == MATH_SIN)
+            return "M_SIN";
+        else if (type == MATH_COS)
+            return "M_COS";
+        else if (type == MATH_TAN)
+            return "M_TAN";
        else if (type == MATH_MATRIXMUL)
            return "M_MATRIXMUL";
        else if (type == MATH_MATRIXMULBATCHED)
            return "M_MATRIXMULBATCHED";
        else if (type == MATH_MULTIPLY)
            return "M_MULTIPLY";
+        else if (type == MATH_DIV)
+            return "M_DIV";
        else if (type == MATH_NEGATE)
            return "M_NEGATE";
        else if (type == MATH_SIGN)
            return "M_SIGN";
        else if (type == MATH_SUM)
            return "M_SUM";
-        else if (type == MATH_LOG)
+        else if (type == MATH_SUB)
-            return "M_LOG";
+            return "M_SUB";
+        else if (type == MATH_SUMDIM)
+            return "M_SUMDIM";
        else if (type == MATH_NORMALIZE)
            return "M_NORMALIZE";
        else if (type == MATH_POWER)

--- a/source/tensor/XName.h
+++ b/source/tensor/XName.h
@@ -31,15 +31,23 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 /* math operations */
 #define MATH_BASE               0x00001000
 #define MATH_ABSOLUTE           MATH_BASE + 1
-#define MATH_MATRIXMUL          MATH_ABSOLUTE + 1
+#define MATH_EXP                MATH_ABSOLUTE + 1
+#define MATH_LOG                MATH_EXP + 1
+#define MATH_SIN                MATH_LOG + 1
+#define MATH_COS                MATH_SIN + 1
+#define MATH_TAN                MATH_COS + 1
+#define MATH_NEGATE             MATH_TAN + 1
+#define MATH_MATRIXMUL          MATH_TAN + 1
 #define MATH_MATRIXMULBATCHED   MATH_MATRIXMUL + 1
 #define MATH_MULTIPLY           MATH_MATRIXMULBATCHED + 1
-#define MATH_NEGATE             MATH_MULTIPLY + 1
+#define MATH_DIV                MATH_MULTIPLY + 1
-#define MATH_SIGN               MATH_NEGATE + 1
+#define MATH_SIGN               MATH_DIV + 1
 #define MATH_SUM                MATH_SIGN + 1
+#define MATH_SUB                MATH_SUM + 1
+#define MATH_SUMDIM             MATH_SUB + 1
-#define MATH_LOG                MATH_SUM + 1
+#define MATH_NORMALIZE          MATH_SUMDIM + 1
-#define MATH_NORMALIZE          MATH_LOG + 1
 #define MATH_POWER              MATH_NORMALIZE + 1
 #define MATH_SCALEANDSHIFT      MATH_POWER + 1

--- a/source/tensor/XStream.cpp
+++ b/source/tensor/XStream.cpp
@@ -84,7 +84,7 @@ void XStream::Create(int priority, int myDevID)
    XDevice::SetGPUDevice(myDevID);
    //cudaStreamCreateWithPriority(&stream, cudaStreamDefault, priority);
    CheckNTErrors((cudaStreamCreate(&stream) == cudaSuccess), 
-                        "cannot create the cuda stream!");
+                  "cannot create the cuda stream!");
    XDevice::SetGPUDevice(backupDevID);
 #endif
    devID = myDevID;

--- a/source/tensor/XTensor.cpp
+++ b/source/tensor/XTensor.cpp
@@ -426,8 +426,12 @@ get the size of a given dimension
 int XTensor::GetDim(const int dim)
 {
    CheckNTErrors(dim < order, "dimenision is out of range!");
+    int d = dim;
+    if(dim < 0)
+        d = order - 1;
-    return dimSize[dim];
+    return dimSize[d];
 }
 /* 
@@ -1439,6 +1443,21 @@ void XTensor::Dump(FILE * file, const char * label, const int n, const int verbo
 }
 /* 
+dump data to a file
+>> tensor - tensor whose data is dumped
+>> file - where to domp the data
+>> label - label of the tensor
+>> n - number of items to dump
+>> verbose - verbose level
+*/
+void XTensor::Dump(const XTensor * tensor, FILE * file, const char * label, const int n, const int verbose)
+{
+    XTensor a(tensor->order, tensor->dimSize, tensor->dataType, tensor->denseRatio, tensor->devID, tensor->mem);
+    _CopyValues(tensor, &a);
+    a.Dump(file, label, n, verbose);
+}
+/* 
 read data from a file
 >> file - where to load the data
 >> label - label of the tensor
@@ -1687,13 +1706,13 @@ void InitTensor(XTensor * tensor,
        dims[0] = -abs(dims[0]);
-        tensor->Resize(myOrder, dims, myDataType, myDenseRatio);
+        if (myDevID == CURRENT_GPU)
-        if(myDevID == CURRENT_GPU)
            tensor->devID = XDevice::GetGPUDevice();
        else
            tensor->devID = myDevID;
+        tensor->Resize(myOrder, dims, myDataType, myDenseRatio);
        if(allocated)
            XTensor::AllocateData(tensor);
    }
@@ -1870,28 +1889,47 @@ generate a XTensor which allocates data on the buffer
 >> myDimSize - the size of each dimension
 >> myMem - memory pool used to allocating the data array.
           we actually allocate the data on the buffer associated with
-           the memory pool.
+           the memory pool
+>> devID - device id
 >> myDataType - unit size (e.g., int, float, and double)
 >> myDenseRatio - how often an element has non-zero value
 */
-XTensor * NewTensorBuf(const int myOrder, const int * myDimSize, XMem * myMem,
+XTensor * NewTensorBuf(const int myOrder, const int * myDimSize,
-                       const TENSOR_DATA_TYPE myDataType, const float myDenseRatio)
+                       const TENSOR_DATA_TYPE myDataType, const float myDenseRatio,
+                       const int devID, XMem * myMem)
 {
-    CheckNTErrors(myMem != NULL, "No memory pool specified!");
    int dims[MAX_TENSOR_DIM_NUM];
    memcpy(dims, myDimSize, sizeof(int) * myOrder);
    dims[0] = -abs(dims[0]);
-    XTensor * tensor = NewTensor(myOrder, dims, myDataType, myDenseRatio, -1, myMem);
+    XTensor * tensor = NewTensor(myOrder, dims, myDataType, myDenseRatio, devID, myMem);
-    tensor->data = myMem->AllocBuf(myMem->devID, tensor->unitNum * tensor->unitSize);
+    if(myMem != NULL)
+        tensor->data = myMem->AllocBuf(myMem->devID, tensor->unitNum * tensor->unitSize);
+    else
+        tensor->data = XMemAlloc(devID, tensor->unitNum * tensor->unitSize);
    return tensor;
 }
 /* 
+generate a XTensor which allocates data on the buffer 
+>> reference - reference tensor
+>> devID - device id
+>> myMem - memory pool used to allocating the data array.
+           we actually allocate the data on the buffer associated with
+           the memory pool
+*/
+XTensor * NewTensorBuf(const XTensor * reference, int devID, XMem * myMem)
+{
+    return NewTensorBuf(reference->order, reference->dimSize, 
+                        reference->dataType, reference->denseRatio,
+                        devID, myMem);
+}
+/* 
 generate a dense vector 
 >> num - number of entries
 >> myDataType - unit size (e.g., int, float, and double) 
@@ -2041,7 +2079,7 @@ XTensor * NewTensor(XTensor * a, bool isFilledData)
 free the data space of a given tensor 
 >> tensor - pointer to the tensor
 */
-void DelTensor(const XTensor * tensor)
+void DelTensor(XTensor * tensor)
 {
    delete tensor;
 }
@@ -2050,10 +2088,13 @@ void DelTensor(const XTensor * tensor)
 free the data space of a given tensor (on the buffer)
 >> tensor - pointer to the tensor
 */
-void DelTensorBuf(const XTensor * tensor)
+void DelTensorBuf(XTensor * tensor)
 {
-    CheckNTErrors(tensor->mem != NULL, "No memory pool found!");
+    if(tensor->mem != NULL)
-    tensor->mem->ReleaseBuf(tensor->devID, tensor->unitNum * tensor->unitSize);
+        tensor->mem->ReleaseBuf(tensor->devID, tensor->unitNum * tensor->unitSize);
+    else
+        XMemFree(tensor->devID, tensor->data);
+    tensor->data = NULL;
    delete tensor;
 }

--- a/source/tensor/XTensor.h
+++ b/source/tensor/XTensor.h
@@ -45,12 +45,13 @@ namespace nts{
 struct XLink;
 /* define the maximum number of dimensions in a tensor */
-#define MAX_TENSOR_DIM_NUM 6
+#define MAX_TENSOR_DIM_NUM 8
 #define USE_BATCHED_STRIDED_MAT_MUL
-#define MIN_TENSOR_SPLIT_NUM 10
+#define MIN_TENSOR_SPLIT_NUM 0
 #define MIN_TENSOR_SPLIT_LIST_NUM 1024
 #define MIN_TENSOR_CAT_NUM 8
 /* computation flags */
 #define UNSAFE_BUT_FAST_MEM
 #define FAST_MATRIX
@@ -328,6 +329,10 @@ public:
    /* dump data to a file */
    void Dump(FILE * file, const char * label = NULL, const int n = -1, const int verbose = 0);
+    /* dump data to a file */
+    static
+    void Dump(const XTensor * tensor, FILE * file, const char * label = NULL, const int n = -1, const int verbose = 0);
    /* read data from a file */
    void Read(FILE * file, const char * label = NULL);
@@ -386,8 +391,12 @@ XTensor * NewTensor(const int myOrder, const int * myDimSize, const TENSOR_DATA_
                    const float myDenseRatio = 1.0F, const int myDevID = -1, XMem * myMem = NULL);
 /* generate a XTensor which allocates data on the buffer */
-XTensor * NewTensorBuf(const int myOrder, const int * myDimSize, XMem * myMem,
+XTensor * NewTensorBuf(const int myOrder, const int * myDimSize,
-                       const TENSOR_DATA_TYPE myDataType = X_FLOAT, const float myDenseRatio = 1.0F);
+                       const TENSOR_DATA_TYPE myDataType = X_FLOAT, const float myDenseRatio = 1.0F,
+                       const int myDevID = -1, XMem * myMem = NULL);
+/* generate a XTensor which allocates data on the buffer */
+XTensor * NewTensorBuf(const XTensor * reference, int devID, XMem * myMem);
 /* generate a dense vector */
 XTensor * NewTensor1D(const int num, const TENSOR_DATA_TYPE myDataType = X_FLOAT, const int myDevID = -1, 
@@ -417,10 +426,10 @@ XTensor * NewTensor5D(const int d0, const int d1, const int d2, const int d3, co
 XTensor * NewTensor(XTensor * a, bool isFilledData = true);
 /* free the data space of a given tensor */
-void DelTensor(const XTensor * tensor);
+void DelTensor(XTensor * tensor);
 /* free the data space of a given tensor (on the buffer) */
-void DelTensorBuf(const XTensor * tensor);
+void DelTensorBuf(XTensor * tensor);
 } /* end of the nts (NiuTrans.Tensor) namespace */

--- a/source/tensor/XUtility.cpp
+++ b/source/tensor/XUtility.cpp
@@ -175,29 +175,38 @@ void XMemCopy(void * t, int devIDT, const void * s, int devIDS, size_t size)
        return;
    }
 #ifdef USE_CUDA
-    else if(devIDT >= 0 && devIDS < 0){
-        cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyHostToDevice);
-        if(error != cudaSuccess){
-            ShowNTErrors("cudaMemcpy error (cudaMemcpyHostToDevice)");
-        }
-    }
-    else if(devIDT < 0 && devIDS >= 0){
-        cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyDeviceToHost);
-        if(error != cudaSuccess){
-            ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToHost)");
-        }
-    }
    else{
-        //if(devIDT == devIDS){
+        int devID = devIDT < 0 ? devIDS : devIDT;
-            cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyDeviceToDevice);
+        int devIDBackup = 0;
+        cudaGetDevice(&devIDBackup);
+        cudaSetDevice(devID);
+        if(devIDT >= 0 && devIDS < 0){
+            cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyHostToDevice);
            if(error != cudaSuccess){
-                ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToDevice)");
+                ShowNTErrors("cudaMemcpy error (cudaMemcpyHostToDevice)");
            }
-        /*}
+        }
+        else if(devIDT < 0 && devIDS >= 0){
+            cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyDeviceToHost);
+            if(error != cudaSuccess){
+                ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToHost)");
+            }
+        }
        else{
-            CheckNTErrors((cudaMemcpyPeer(t, devIDT, s, devIDS, size) == cudaSuccess),
+            //if(devIDT == devIDS){
-                                "cudaMemcpy error (cudaMemcpyDeviceToDevice)");
+                cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyDeviceToDevice);
-        }*/
+                if(error != cudaSuccess){
+                    ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToDevice)");
+                }
+            /*}
+            else{
+                CheckNTErrors((cudaMemcpyPeer(t, devIDT, s, devIDS, size) == cudaSuccess),
+                                    "cudaMemcpy error (cudaMemcpyDeviceToDevice)");
+            }*/
+        }
+        cudaSetDevice(devIDBackup);
    }
 #else
    ShowNTErrors("Please specify USE_CUDA and recompile the code!");
@@ -208,6 +217,9 @@ void XMemCopy(void * t, int devIDT, const void * s, int devIDS, size_t size)
 #ifdef USE_CUDA
 void XMemCopyAsync(void * t, int devIDT, const void * s, int devIDS, size_t size, cudaStream_t stream, int streamDevID)
 {
+    if(t == s)
+        return;
    int devIDBackup = -1;
    if(streamDevID >= 0 && (devIDT >= 0 || devIDS >= 0)){
        CheckNTErrors((cudaGetDevice(&devIDBackup) == cudaSuccess), "Cannot get GPU device id!");
@@ -220,17 +232,23 @@ void XMemCopyAsync(void * t, int devIDT, const void * s, int devIDS, size_t size
        return;
    }
    else if(devIDT >= 0 && devIDS < 0){
-        CheckNTErrors((cudaMemcpyAsync(t, s, size, cudaMemcpyHostToDevice, stream) == cudaSuccess),
+        cudaError_t error = cudaMemcpyAsync(t, s, size, cudaMemcpyHostToDevice, stream);
-                            "cudaMemcpyAsync error (cudaMemcpyHostToDevice)");
+        if(error != cudaSuccess){
+            ShowNTErrors("cudaMemcpyAsync error (cudaMemcpyHostToDevice)");
+        }
    }
    else if(devIDT < 0 && devIDS >= 0){
-        CheckNTErrors((cudaMemcpyAsync(t, s, size, cudaMemcpyDeviceToHost, stream) == cudaSuccess),
+        cudaError_t error = cudaMemcpyAsync(t, s, size, cudaMemcpyDeviceToHost, stream);
-                            "cudaMemcpyAsync error (cudaMemcpyDeviceToHost)");
+        if(error != cudaSuccess){
+            ShowNTErrors("cudaMemcpyAsync error (cudaMemcpyDeviceToHost)");
+        }
    }
    else{
        //if(devIDT == devIDS){
-            CheckNTErrors((cudaMemcpyAsync(t, s, size, cudaMemcpyDeviceToDevice, stream) == cudaSuccess),
+            cudaError_t error = cudaMemcpyAsync(t, s, size, cudaMemcpyDeviceToDevice, stream);
-                                "cudaMemcpyAsync error (cudaMemcpyDeviceToDevice)");
+            if(error != cudaSuccess){
+                ShowNTErrors("cudaMemcpyAsync error (cudaMemcpyDeviceToDevice)");
+            }
        //}
        /*else{
            CheckNTErrors((cudaMemcpyPeerAsync(t, devIDT, s, devIDS, size, stream) == cudaSuccess),
@@ -261,18 +279,69 @@ void XMemCopy2D(void * t, size_t tPitch, int devIDT, const void * s, size_t sPit
        return;
    }
 #ifdef USE_CUDA
-    else if (devIDT >= 0 && devIDS < 0) {
+    else{
-        CheckNTErrors((cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyHostToDevice) == cudaSuccess),
+        int devID = devIDT < 0 ? devIDS : devIDT;
-                            "cudaMemcpy2D error (cudaMemcpyHostToDevice)");
+        int devIDBackup = 0;
+        cudaGetDevice(&devIDBackup);
+        cudaSetDevice(devID);
+        if (devIDT >= 0 && devIDS < 0) {
+            cudaError_t error = cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyHostToDevice);
+            if(error != cudaSuccess){
+                ShowNTErrors("cudaMemcpy2D error (cudaMemcpyHostToDevice)");
+            }
+        }
+        else if (devIDT < 0 && devIDS >= 0) {
+            cudaError_t error = cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToHost);
+            if(error != cudaSuccess){
+                ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToHost)");
+            }
+        }
+        else {
+            cudaError_t error = cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToDevice);
+            if (error != cudaSuccess) {
+                ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToDevice)");
+            }
+        }
+        cudaSetDevice(devIDBackup);
    }
-    else if (devIDT < 0 && devIDS >= 0) {
+#else
-        CheckNTErrors((cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToHost) == cudaSuccess),
+    ShowNTErrors("Please specify USE_CUDA and recompile the code!");
-            "cudaMemcpy error (cudaMemcpyDeviceToHost)");
+#endif
+}
+void XMemCopy2DAsync(void * t, size_t tPitch, int devIDT, const void * s, size_t sPitch, int devIDS, size_t mSize, int n, XStream * stream)
+{
+    if (t == s)
+        return;
+    if (devIDT < 0 && devIDS < 0) {
+        for(int i = 0; i < n; i++)
+            memcpy((char*)t + tPitch * i, (char*)s + sPitch * i, mSize);
+        return;
    }
-    else {
+#ifdef USE_CUDA
-        cudaError_t error = cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToDevice);
+    else{
-        if (error != cudaSuccess) {
+        CheckNTErrors(stream != NULL, "No stream found!");
-            ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToDevice)");
+        cudaStream_t &cstream = stream->stream;
+        if (devIDT >= 0 && devIDS < 0) {
+            cudaError_t error = cudaMemcpy2DAsync(t, tPitch, s, sPitch, mSize, n, cudaMemcpyHostToDevice, cstream);
+            if(error != cudaSuccess){
+                ShowNTErrors("cudaMemcpy2D error (cudaMemcpyHostToDevice)");
+            }
+        }
+        else if (devIDT < 0 && devIDS >= 0) {
+            cudaError_t error = cudaMemcpy2DAsync(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToHost, cstream);
+            if(error != cudaSuccess){
+                ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToHost)");
+            }
+        }
+        else {
+            cudaError_t error = cudaMemcpy2DAsync(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToDevice, cstream);
+            if (error != cudaSuccess) {
+                ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToDevice)");
+            }
        }
    }
 #else

--- a/source/tensor/XUtility.h
+++ b/source/tensor/XUtility.h
@@ -23,6 +23,7 @@
 #include <stdio.h>
 #include "XGlobal.h"
+#include "XDevice.h"
 #ifndef __XUTILITY_H__
 #define __XUTILITY_H__
@@ -41,6 +42,7 @@ extern void XMemSet(void * p, int value, size_t size);
 extern void XMemSet(int devID, void * p, int value, size_t size);
 extern void XMemCopy(void * t, int devIDT, const void * s, int devIDS, size_t size);
 extern void XMemCopy2D(void * t, size_t tPitch, int devIDT, const void * s, size_t sPitch, int devIDS, size_t mSize, int n);
+extern void XMemCopy2DAsync(void * t, size_t tPitch, int devIDT, const void * s, size_t sPitch, int devIDS, size_t mSize, int n, XStream * stream);
 extern void * XMemAlloc(int devID, size_t size);
 extern void * XMemAllocOnDev(int devID, size_t size);
 extern void XMemFree(int devID, void * p);

--- a/source/tensor/core/CHeader.h
+++ b/source/tensor/core/CHeader.h
@@ -26,49 +26,62 @@
 #include "../XTensor.h"
-#include "shape/Concatenate.h"
+#include "arithmetic/Div.h"
-#include "shape/ConcatenateSolely.h"
-#include "movement/CopyBlocks.h"
-#include "movement/CopyBlocksInGrid.h"
-#include "movement/CopyBlocksOnSite.h"
-#include "movement/CopyData2D.h"
-#include "movement/CopyIndexed.h"
-#include "movement/CopyInGrid.h"
-#include "movement/CopyValues.h"
-#include "utilities/FlushToMem.h"
-#include "shape/MakeMergeBlockIndex.h"
-#include "shape/MakeSplitBlockIndex.h"
 #include "arithmetic/MatrixMul.h"
 #include "arithmetic/MatrixMul2D.h"
 #include "arithmetic/MatrixMul2DMultiTheading.h"
 #include "arithmetic/MatrixMul2DParallel.h"
 #include "arithmetic/MatrixMulBatched.h"
-#include "arithmetic/MatrixMULBatchedCPU.h"
-#include "shape/Merge.h"
-#include "shape/MergeBlockLists.h"
 #include "arithmetic/Multiply.h"
 #include "arithmetic/Negate.h"
+#include "arithmetic/Sign.h"
+#include "arithmetic/Sub.h"
+#include "arithmetic/Sum.h"
+#include "arithmetic/SumByColumnTV.h"
+#include "arithmetic/SumByColumnVT.h"
+#include "arithmetic/SumDim.h"
+#include "arithmetic/XTensorBLAS.h"
+#include "getandset/ConvertDataType.h"
+#include "getandset/Select.h"
+#include "getandset/SetData.h"
 #include "math/Normalize.h"
-#include "shape/Permute.h"
 #include "math/Power.h"
+#include "math/ScaleAndShift.h"
+#include "math/Unary.h"
+#include "movement/CopyBlocks.h"
+#include "movement/CopyBlocksInGrid.h"
+#include "movement/CopyBlocksOnSite.h"
+#include "movement/CopyData2D.h"
+#include "movement/CopyIndexed.h"
+#include "movement/CopyInGrid.h"
+#include "movement/CopyValues.h"
 #include "reduce/ReduceMax.h"
 #include "reduce/ReduceMean.h"
 #include "reduce/ReduceStandardVariance.h"
 #include "reduce/ReduceSum.h"
 #include "reduce/ReduceSumSquared.h"
 #include "reduce/ReduceVariance.h"
-#include "math/ScaleAndShift.h"
-#include "getandset/Select.h"
+#include "shape/Concatenate.h"
-#include "getandset/SetData.h"
+#include "shape/ConcatenateSolely.h"
-#include "sort/Sort.h"
+#include "shape/MakeMergeBlockIndex.h"
+#include "shape/MakeSplitBlockIndex.h"
+#include "shape/Merge.h"
+#include "shape/MergeBlockLists.h"
+#include "shape/Permute.h"
 #include "shape/Split.h"
-#include "arithmetic/Sum.h"
-#include "arithmetic/SumByColumnTV.h"
-#include "arithmetic/SumByColumnVT.h"
-#include "sort/TopK.h"
 #include "shape/Transpose.h"
 #include "shape/Unsqueeze.h"
+#include "sort/Sort.h"
+#include "sort/TopK.h"
 #include "utilities/XMatrixSegment.h"
-#include "arithmetic/XTensorBLAS.h"
+#include "utilities/FlushToMem.h"
 #endif // __CHEADER_H__
\ No newline at end of file
--- a/source/tensor/core/arithmetic/Absolute.cpp
+++ b/source/tensor/core/arithmetic/Absolute.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-/*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
-*/
-#include <math.h>
-#include "../../XTensor.h"
-#include "../../XName.h"
-#include "Absolute.h"
-#include "Absolute.cuh"
-namespace nts { // namespace nts(NiuTrans.Tensor)
-/*
-set every entry to its absolute value
->> a - input tensor we are processing
->> b - output tensor we are processing
-*/
-void _Absolute(const XTensor * a, XTensor * b)
-{
-#ifdef USE_CUDA
-    /* run it on GPUs */
-    if (a->devID >= 0) {
-        _CudaAbsolute(a, b);
-    return;
-}
-#endif
-    CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
-    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
-    DTYPE * d = (DTYPE*)a->data;
-    DTYPE * db = (DTYPE*)b->data;
-    for (int i = 0; i < a->unitNum; i++)
-        db[i] = (DTYPE)fabs(d[i]);
-}
-/*
-set every entry to its absolute value (do it on site)
-keep the result in the input tensor a and return nothing
->> a - the tensor we are processing
-*/
-void _AbsoluteMe(XTensor * a)
-{
-    _Absolute(a, a);
-}
-/*
-set every entry to its absolute value (return a XTensor structure)
-make a new tensor to keep the result and return it
->> a - input tensor we are processing
-<< return - the absolute value of input tensor
-*/
-XTensor Absolute(const XTensor & a)
-{
-    XTensor b(&a);
-    b.SetTMP();
-    /* call _Absolute function */
-    _Absolute(&a, &b);
-    /* tensor connections */
-    XLink::MakeLink(&a, NULL, &b, MATH_ABSOLUTE);
-    return b;
-}
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/arithmetic/Absolute.cu
+++ b/source/tensor/core/arithmetic/Absolute.cu
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-/*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
-*/
-#include "../../XDevice.h"
-#include "../../XTensor.h"
-#include "Absolute.h"
-#include "Absolute.cuh"
-namespace nts { // namespace nts(NiuTrans.Tensor)
-#ifdef USE_CUDA
-/*
-set each entry to its absolute value (CUDA Kernel)
->> a - pointer to input data array
->> b - pointer to output data array
->> size - size of the data array
-*/
-__global__
-void KernelAbsolute(DTYPE * a, DTYPE * b, int size)
-{
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-    if (i < size)
-        b[i] = fabs(a[i]);
-}
-/*
-set each entry to its absolute value (CUDA Kernel)
-This is for float16 computation
->> a - pointer to input data array
->> b - pointer to output data array
->> size - size of the data array
-*/
-__global__
-void KernelAbsolute(__half * a, __half * b, int size)
-{
-    return;
-}
-/*
-set each entry to its absolute value
->> a - input tensor
->> b - output tensor
-*/
-void _CudaAbsolute(const XTensor * a, XTensor * b)
-{
-    CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
-    CheckNTErrors((a->isSparse == false), "TODO!");
-    int gridSize[3];
-    int blockSize[3];
-    GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
-    dim3 blocks(gridSize[0]);
-    dim3 threads(blockSize[0]);
-    int devIDBackup;
-    ProtectCudaDev(a->devID, devIDBackup);
-    if (a->dataType == DEFAULT_DTYPE) {
-        KernelAbsolute << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum);
-    }
-    else if (a->dataType == X_FLOAT16) {
-        KernelAbsolute << <blocks, threads >> >((__half*)a->data, (__half*)b->data, a->unitNum);
-    }
-    else {
-        ShowNTErrors("TODO!");
-    }
-    BacktoCudaDev(a->devID, devIDBackup);
-}
-#endif // USE_CUDA
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/Div.cpp
+++ b/source/tensor/core/arithmetic/Div.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#include "../../XTensor.h"
+#include "../../XName.h"
+#include "Div.h"
+#include "Div.cuh"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+element-wise division of two tensors
+c(i) = a(i)/b(i) + \alpha * c(i)
+where i is the index of the item
+>> a - tensor a
+>> b - tensor b
+>> c - result tensor
+>> alpha - the coefficient
+>> leadingDim - the dimension along which we perform broadcasting
+*/
+void _Div(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha, int leadingDim)
+{
+	int leadingDimRDI = a->order - leadingDim - 1;
+    CheckNTErrors((a->unitNum <= c->unitNum && b->unitNum <= c->unitNum),
+                  "Unmatched tensors in multiplication!");
+    CheckNTErrors((a->order == b->order && a->order == c->order), 
+                  "Unmatched tensors!");
+#ifdef USE_CUDA
+    if (a->devID >= 0 || b->devID >= 0 || c->devID >= 0) {
+        _CudaDiv(a, b, c, alpha, leadingDim);
+        return;
+    }
+#endif
+    int stride = 1;
+    int blockSizeA = 1;
+    int blockSizeB = 1;
+    int blockSizeC = 1;
+    int blockNum = 1;
+    int dimensionSizeA = a->dimSizeRDI[leadingDimRDI];
+    int dimensionSizeB = b->dimSizeRDI[leadingDimRDI];
+    int dimensionSizeC = c->dimSizeRDI[leadingDimRDI];
+    for (int i = 0; i < a->order; i++) {
+        if (i != leadingDimRDI) {
+            CheckNTErrors((a->dimSizeRDI[i] == b->dimSizeRDI[i] && a->dimSizeRDI[i] == c->dimSizeRDI[i]),
+                          "Unmatched tensors!");
+        }
+        if (i < leadingDimRDI)
+            stride *= a->dimSizeRDI[i];
+    }
+    blockSizeA = stride * dimensionSizeA;
+    blockSizeB = stride * dimensionSizeB;
+    blockSizeC = stride * dimensionSizeC;
+    blockNum = a->unitNum / blockSizeA;
+    if (!a->isSparse && !b->isSparse) {
+        if (a->dataType == DEFAULT_DTYPE && b->dataType == DEFAULT_DTYPE) {
+            if (a->unitNum == c->unitNum && b->unitNum == c->unitNum) {
+                int size = a->unitNum;
+                DTYPE * ap = (DTYPE*)a->data;
+                DTYPE * bp = (DTYPE*)b->data;
+                DTYPE * cp = (DTYPE*)c->data;
+                if (alpha == 0) {
+                    for (int i = 0; i < size; i++)
+                        cp[i] = ap[i] / bp[i];
+                }
+                else {
+                    for (int i = 0; i < size; i++)
+                        cp[i] = ap[i] / bp[i] + alpha * cp[i];
+                }
+            }
+            else {
+                for (int k = 0; k < blockNum; k++) {
+                    for (int ci = 0, ai = 0, bi = 0; ci < dimensionSizeC; ci++, ai++, bi++) {
+                        if (ai >= dimensionSizeA)
+                            ai = 0;
+                        if (bi >= dimensionSizeB)
+                            bi = 0;
+                        DTYPE * ap = (DTYPE*)a->data + k * blockSizeA + ai * stride;
+                        DTYPE * bp = (DTYPE*)b->data + k * blockSizeB + bi * stride;
+                        DTYPE * cp = (DTYPE*)c->data + k * blockSizeC + ci * stride;
+                        for (int j = 0; j < stride; j++)
+                            cp[j] = ap[j] / bp[j] + cp[j] * alpha;
+                    }
+                }
+            }
+        }
+        else {
+            // TODO!!
+            ShowNTErrors("TODO!");
+        }
+    }
+    else {
+        // TODO!!
+        ShowNTErrors("TODO!");
+    }
+}
+/*
+element-wise division of two tensors (do it on site)
+keep the result in the input tensor a and return nothing
+a(i) = a(i)*b(i) + \alpha * a(i)
+where i is the index of the item
+>> a - tensor a (where keep the result)
+>> b - tensor b
+>> alpha - the coefficient
+>> leadingDim - the dimension along which we perform broadcasting
+*/
+void _DivMe(XTensor * a, const XTensor * b, DTYPE alpha, int leadingDim)
+{
+    _Div(a, b, a, alpha, leadingDim);
+}
+/*
+element-wise division of two tensors (return a XTensor structure)
+make a new tensor c to keep the result and return it
+c(i) = a(i)*b(i)
+where i is the index of the item
+>> a - tensor a
+>> b - tensor b
+>> leadingDim - the dimension along which we perform broadcasting
+<< return - the product of the tensors
+*/
+XTensor Div(const XTensor &a, const XTensor &b, int leadingDim)
+{
+    CheckNTErrors(a.dimSize[leadingDim] == b.dimSize[leadingDim], "TODO!");
+    XTensor c(&a);
+    c.SetTMP();
+    /* call _Multiply function */
+    _Div(&a, &b, &c, 0, leadingDim);
+    /* tensor connections */
+    XLink::MakeLink(&a, &b, &c, MATH_DIV);
+    XLink::AddParamToHeadInt(&c, leadingDim);
+    return c;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/Div.cu
+++ b/source/tensor/core/arithmetic/Div.cu
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
+*/
+#include "../../XDevice.h"
+#include "../../XTensor.h"
+#include "Div.h"
+#include "Div.cuh"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/*
+division of data arrays in a element-wise manner c(i) = a(i)/b(i)
+>> a - data array a
+>> b - data array b
+>> c - result data array
+>> size - size of c
+*/
+__global__
+void KernelDivElementWise(DTYPE * a, DTYPE * b, DTYPE * c, int size)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i < size)
+        c[i] = a[i] / b[i];
+}
+/*
+division of data arrays in a element-wise manner c(i) = a(i)/b(i) + \alpha*c(i)
+>> a - data array a
+>> b - data array b
+>> c - result data array
+>> size - size of c
+>> alpha - the coefficient
+*/
+__global__
+void KernelDivElementWiseV2(DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE alpha)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i < size)
+        c[i] = a[i] / b[i] + alpha * c[i];
+}
+/*
+division of two tensors in a element-wise manner c(i) = a(i)/b(i).
+Note that a and b can be of different sizes here, i.e.,
+|a_lead| <= |c_lead| and |b_lead| <= |c_lead|
+where |a_lead| means the size of the leading dimension of a
+>> a - tensor a
+>> b - tensor b
+>> c - result tensor
+>> alpha - the coefficient
+>> stride - the number of items we go over when move next along the leading dimension in a block
+>> ldSizeA - size of the leading dimension of a
+>> ldSizeB - size of the leading dimension of b
+>> ldSizeC - size of the leading dimension of c
+>> blockNum - number of blocks
+*/
+template<int nonZeroAlpha> __global__
+void KernelDivElementWiseTensorDynamic(DTYPE * a, DTYPE * b, DTYPE * c, DTYPE alpha,
+    int stride, int ldSizeA, int ldSizeB, int ldSizeC, int blockNum)
+{
+    __shared__ DTYPE* ap[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    __shared__ DTYPE* bp[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    __shared__ DTYPE* cp[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    int j = blockDim.y * blockIdx.y + threadIdx.y;
+    if (i >= blockNum * stride || j >= ldSizeC)
+        return;
+    if (threadIdx.y == 0) {
+        int block = i / stride;
+        int size = block * stride;
+        ap[threadIdx.x] = a + size * ldSizeA;
+        bp[threadIdx.x] = b + size * ldSizeB;
+        cp[threadIdx.x] = c + size * ldSizeC;
+    }
+    __syncthreads();
+    int aj = j >= ldSizeA ? j % ldSizeA : j;
+    int bj = j >= ldSizeB ? j % ldSizeB : j;
+    int offseti = i % stride;
+    if (nonZeroAlpha == 0)
+        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj * ldSizeA + offseti] / bp[threadIdx.x][bj * ldSizeB + offseti];
+    else
+        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj * ldSizeA + offseti] / bp[threadIdx.x][bj * ldSizeB + offseti]
+                                                 + alpha * cp[threadIdx.x][j * ldSizeC + offseti];
+}
+/*
+element-wise division of two tensors
+c(i) = a(i)*b(i) + \alpha * c(i)
+where i is the item index
+>> a - tensor a
+>> b - tensor b
+>> c - result tensor
+>> alpha - the coefficient
+>> leadingDim - dimension along which we perform broadcasting
+*/
+void _CudaDiv(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha, int leadingDim)
+{
+	int leadingDimRDI = a->order - leadingDim - 1;
+    CheckNTErrors((a->unitNum <= c->unitNum && b->unitNum <= c->unitNum),
+                  "Unmatched tensors in multiplication!");
+    CheckNTErrors((a->order == b->order && a->order == c->order), "Unmatched tensors!");
+    int stride = 1;
+    int blockSizeA = 1;
+    int blockNum = 1;
+    int dimensionSizeA = a->dimSizeRDI[leadingDimRDI];
+    int dimensionSizeB = b->dimSizeRDI[leadingDimRDI];
+    int dimensionSizeC = c->dimSizeRDI[leadingDimRDI];
+    for (int i = 0; i < a->order; i++) {
+        if (i != leadingDimRDI) {
+            CheckNTErrors((a->dimSizeRDI[i] == b->dimSizeRDI[i] &&
+                           a->dimSizeRDI[i] == c->dimSizeRDI[i]),
+                          "Unmatched tensors!");
+        }
+        if (i < leadingDimRDI)
+            stride *= a->dimSizeRDI[i];
+    }
+    blockSizeA = stride * dimensionSizeA;
+    blockNum = a->unitNum / blockSizeA;
+    int devIDBackup;
+    ProtectCudaDev(a->devID, devIDBackup);
+    if (!a->isSparse && !b->isSparse) {
+        if (a->dataType == DEFAULT_DTYPE && b->dataType == DEFAULT_DTYPE) {
+            int cudaGridSize[3];
+            int cudaBlockSize[3];
+            if (a->unitNum == c->unitNum && b->unitNum == c->unitNum) {
+                GDevs.GetCudaThread(a->devID, c->unitNum, cudaGridSize, cudaBlockSize);
+                dim3 blocks(cudaGridSize[0]), threads(cudaBlockSize[0]);
+                if (alpha == 0)
+                    KernelDivElementWise << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, c->unitNum);
+                else
+                    KernelDivElementWiseV2 << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, c->unitNum, alpha);
+            }
+            else {
+                GDevs.GetCudaThread2D(c->devID, stride * blockNum, dimensionSizeC, MAX_INT, cudaGridSize, cudaBlockSize);
+                dim3 blocks(cudaGridSize[0], cudaGridSize[1]), threads(cudaBlockSize[0], cudaBlockSize[1]);
+                if (alpha == 0) {
+                    KernelDivElementWiseTensorDynamic<0> << <blocks, threads >> >
+                        ((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, 0,
+                        stride, dimensionSizeA, dimensionSizeB, dimensionSizeC, blockNum);
+                }
+                else {
+                    KernelDivElementWiseTensorDynamic<1> << <blocks, threads >> >
+                        ((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, alpha,
+                        stride, dimensionSizeA, dimensionSizeB, dimensionSizeC, blockNum);
+                }
+            }
+        }
+        else {
+            // TODO!!
+            ShowNTErrors("TODO!");
+        }
+    }
+    else {
+        // TODO!!
+        ShowNTErrors("TODO!");
+    }
+    BacktoCudaDev(a->devID, devIDBackup);
+}
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/arithmetic/Div.cuh
+++ b/source/tensor/core/arithmetic/Div.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#ifndef __DIV_CUH__
+#define __DIV_CUH__
+#include "Div.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/* division of two tensors in a element-wise manner c(i) = a(i)/b(i) */
+__global__
+void KernelDivElementWise(DTYPE * a, DTYPE * b, DTYPE * c, int size);
+/* division of two tensors in a element-wise manner c(i) = a(i)/b(i) + \alpha*c(i) */
+__global__
+void KernelDivElementWiseV2(DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE alpha);
+/* division of two tensors in a element-wise manner c(i) = a(i)/b(i)+ \alpha*c(i)  */
+template<int nonZeroAlpha>__global__
+void KernelDivElementWiseTensorDynamic(DTYPE * a, DTYPE * b, DTYPE * c, DTYPE alpha, int stride, int ldSizeA, int ldSizeB, int ldSizeC, int blockNum);
+/* element-wise division of two tensors */
+void _CudaDiv(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha = 0, int leadingDim = 0);
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
+#endif // __DIV_CUH__
--- a/source/tensor/core/math/Log.h
+++ b/source/tensor/core/math/Log.h
@@ -16,31 +16,39 @@
 */
 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
-*/
+ */
-#ifndef __LOG_H__
+#ifndef __DIV_H__
-#define __LOG_H__
+#define __DIV_H__
 #include "../../XTensor.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
-/* set every entry to its log value */
+/* 
-void _Log(const XTensor * a, XTensor * b);
+element-wise division of two tensors:
+c(i) = a(i)/b(i) + \alpha * c(i) 
+where i is the index of the element
+*/
+void _Div(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha = 0, int leadingDim = 0);
 /* 
-set every entry to its log value (do it on site)
+element-wise division of two tensors (do it on site)
 keep the result in the input tensor a and return nothing
+a(i) = a(i)/b(i) + \alpha * a(i) 
+where i is the index of the element 
 */
-void _LogMe(XTensor * a);
+void _DivMe(XTensor * a, const XTensor * b, DTYPE alpha = 0, int leadingDim = 0);
 /* 
-set every entry to its log value (return a XTensor structure)
+element-wise division of two tensors (return a XTensor structure)
 make a new tensor to keep the result and return it
+c(i) = a(i)/b(i)
+where i is the index of the element 
 */
-XTensor Log(const XTensor & a);
+XTensor Div(const XTensor &a, const XTensor &b, int leadingDim = 0);
 } // namespace nts(NiuTrans.Tensor)
-#endif // __LOG_H__
+#endif // __DIV_H__
\ No newline at end of file
--- a/source/tensor/core/arithmetic/MatrixMULBatchedCPU.cpp
+++ b/source/tensor/core/arithmetic/MatrixMULBatchedCPU.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-#include "../../XTensor.h"
-#include "MatrixMULBatchedCPU.h"
-#include "MatrixMul2D.h"
-#include "XTensorBLAS.h"
-namespace nts { // namespace nts(NiuTrans.Tensor)
-/*
-matrix multiplication in batch mode (BLAS)
-c_i = trans(a_i) * trans(b_i) * \alpha + c_i * \beta for each i in [0,count-1]
->> a - list of input matrices (2d tensors)
->> transposedA - indicate whether the matrix a is transposed
->> b - another list of input matrices (2d tensors)
->> transposedB - indicate whether the matrix b is transposed
->> c - output matrix (2d tensor)
->> alpha - scalar
->> beta - scalar
-*/
-void _MatrixMULBatchedCPU(const XList * a, MATRIX_TRANS_TYPE transposedA,
-                          const XList * b, MATRIX_TRANS_TYPE transposedB,
-                          XList * c, DTYPE alpha, DTYPE beta)
-{
-    CheckNTErrors(a && b && c, "Empty input lists!");
-    CheckNTErrors(a->count == b->count && a->count == c->count, "Input lists must be of the same size!");
-    if (a->count == 0)
-        return;
-    bool isUniform = true;
-    for (int i = 1; i < a->count; i++) {
-        XTensor * aim = (XTensor*)a->GetItem(i - 1);
-        XTensor * bim = (XTensor*)b->GetItem(i - 1);
-        XTensor * cim = (XTensor*)c->GetItem(i - 1);
-        XTensor * ai = (XTensor*)a->GetItem(i);
-        XTensor * bi = (XTensor*)b->GetItem(i);
-        XTensor * ci = (XTensor*)c->GetItem(i);
-        if (!XTensor::IsSameShaped(aim, ai) ||
-            !XTensor::IsSameShaped(bim, bi) ||
-            !XTensor::IsSameShaped(cim, ci))
-        {
-            isUniform = false;
-            break;
-        }
-    }
-    for (int i = 0; i < a->count; i++) {
-        XTensor * ai = (XTensor*)a->GetItem(i);
-        XTensor * bi = (XTensor*)b->GetItem(i);
-        XTensor * ci = (XTensor*)c->GetItem(i);
-        CheckNTErrors((ai->order == 2), "2d tensor (i.e., matrix) is required!");
-        CheckNTErrors((bi->order == 2), "2d tensor (i.e., matrix) is required!");
-        CheckNTErrors((ci->order == 2), "2d tensor (i.e., matrix) is required!");
-#ifdef USE_BLAS
-        if (useBLAS)
-            _MatrixMULCPU(ai, transposedA, bi, transposedB, ci, alpha, beta);
-        else
-            _MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
-#else
-        _MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
-#endif
-    }
-    //}
-}
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/arithmetic/MatrixMul.cpp
+++ b/source/tensor/core/arithmetic/MatrixMul.cpp
@@ -24,8 +24,8 @@
 #include "../../XName.h"
 #include "MatrixMul.h"
 #include "MatrixMul2D.h"
-#include "MatrixMULBatchedCPU.h"
 #include "XTensorBLAS.h"
+#include "MatrixMulBatched.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -156,9 +156,9 @@ void _MatrixMul(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
    }
    else {
        CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
-        _MatrixMULBatchedCPU(aList, transposedA,
+        _MatrixMulBatchedCPU(aList, transposedA,
-            bList, transposedB,
+                             bList, transposedB,
-            cList, alpha, beta);
+                             cList, alpha, beta);
    }
    for (int i = 0; i < aList->count; i++) {

--- a/source/tensor/core/arithmetic/MatrixMulBatched.cpp
+++ b/source/tensor/core/arithmetic/MatrixMulBatched.cpp
@@ -23,8 +23,8 @@
 #include "../../XDevice.h"
 #include "../../XName.h"
 #include "MatrixMulBatched.h"
-#include "MatrixMULBatchedCPU.h"
 #include "XTensorBLAS.h"
+#include "MatrixMul2D.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -57,6 +57,43 @@ void _MatrixMulBatched(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
    CheckNTErrors((a->order == b->order && a->order == c->order), 
                  "Input tensor and output tensor must have same order!");
+    if (a->devID >= 0 || b->devID >= 0 || c->devID >= 0)
+        _MatrixMulBatchedGPU(a, transposedA, b, transposedB, c, alpha, beta);
+    else
+        _MatrixMulBatchedCPU(a, transposedA, b, transposedB, c, alpha, beta);
+}
+/*
+matrix multiplication of the two tensors
+optimized for GPU
+for each 2-dimensional data array in a (denoted as ai) and
+each 2-dimensional data array in b (denoted as bi), we have
+ci = trans(ai) * trans(bi) * alpha + cm * beta
+where trans() returns the transposed matrix if the flag is fired
+>> a - tensor a
+>> transposedA - indicates whether the matrices in a are transposed
+>> b - tensor b
+>> transposedB - indicates whether teh matrices in b are transposed
+>> c - where we keep a*b
+>> alpha - a coefficient
+>> beta - another coefficient
+*/
+void _MatrixMulBatchedGPU(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
+                          const XTensor * b, MATRIX_TRANS_TYPE transposedB,
+                          XTensor * c, DTYPE alpha, DTYPE beta)
+{
+#ifdef USE_CUDA
+    CheckNTErrors((a && b && c), "Empty input tensors!");
+    CheckNTErrors((a->dataType == b->dataType && a->dataType == c->dataType),
+                  "Input tensors should have the same data type!");
+    CheckNTErrors((a->order >= 2 && b->order >= 2 && c->order >= 2),
+                  "Input tensors must have a order >= 2!");
+    CheckNTErrors((a->order == b->order && a->order == c->order), 
+                  "Input tensor and output tensor must have same order!");
+    CheckNTErrors(a->devID >= 0 && b->devID >= 0 && c->devID >= 0, "The tensors must be on GPUs");
    int an = transposedA == X_TRANS ? a->dimSizeRDI[0] : a->dimSizeRDI[1];
    int am = transposedA == X_TRANS ? a->dimSizeRDI[1] : a->dimSizeRDI[0];
    int bn = transposedB == X_TRANS ? b->dimSizeRDI[0] : b->dimSizeRDI[1];
@@ -64,8 +101,7 @@ void _MatrixMulBatched(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
    int cn = c->dimSizeRDI[1];
    int cm = c->dimSizeRDI[0];
-    CheckNTErrors((am == bn && an == cn && bm == cm),
+    CheckNTErrors((am == bn && an == cn && bm == cm), "Unmatched tensors in multiplication!");
-        "Unmatched tensors in multiplication!");
    int aBlockSize = a->dimSizeRDI[0] * a->dimSizeRDI[1];
    int bBlockSize = b->dimSizeRDI[0] * b->dimSizeRDI[1];
@@ -81,76 +117,154 @@ void _MatrixMulBatched(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
        blockNum *= a->dimSizeRDI[i];
    }
-    XList * aList = new XList(10);
+    cublasHandle_t * handle = a->mem != NULL ? a->mem->GetCublasHandle() : GDevs.GetCudaHandle(a->devID);
-    XList * bList = new XList(10);
+    _CudaBLASMatrixMULBatchedStrided(handle,
-    XList * cList = new XList(10);
+                                    a->data, transposedA, a->dataType, aBlockSize,
-    int aDimSize[2] = { -a->dimSizeRDI[1], a->dimSizeRDI[0] };
+                                    b->data, transposedB, b->dataType, bBlockSize,
-    int bDimSize[2] = { -b->dimSizeRDI[1], b->dimSizeRDI[0] };
+                                    c->data, c->dataType, cBlockSize, blockNum,
-    int cDimSize[2] = { -c->dimSizeRDI[1], c->dimSizeRDI[0] };
+                                    a->dimSizeRDI[1], a->dimSizeRDI[0],
+                                    b->dimSizeRDI[1], b->dimSizeRDI[0],
-    for (int p = 0; p < blockNum; p++) {
+                                    c->dimSizeRDI[1], c->dimSizeRDI[0], alpha, beta);
-        void * ap = (char*)a->data + aRealBlockSize * p;
+#endif
-        void * bp = (char*)b->data + bRealBlockSize * p;
+}
-        void * cp = (char*)c->data + cRealBlockSize * p;
-        XTensor * ai = NewTensor(2, aDimSize, a->dataType, a->denseRatio, a->devID, a->mem);
+/*
-        XTensor * bi = NewTensor(2, bDimSize, b->dataType, b->denseRatio, b->devID, b->mem);
+matrix multiplication of the two tensors
-        XTensor * ci = NewTensor(2, cDimSize, c->dataType, c->denseRatio, c->devID, c->mem);
+optimized for CPU
-        ai->data = ap;
-        bi->data = bp;
+for each 2-dimensional data array in a (denoted as ai) and
-        ci->data = cp;
+each 2-dimensional data array in b (denoted as bi), we have
-        aList->Add(ai);
+ci = trans(ai) * trans(bi) * alpha + cm * beta
-        bList->Add(bi);
+where trans() returns the transposed matrix if the flag is fired
-        cList->Add(ci);
+>> a - tensor a
+>> transposedA - indicates whether the matrices in a are transposed
+>> b - tensor b
+>> transposedB - indicates whether teh matrices in b are transposed
+>> c - where we keep a*b
+>> alpha - a coefficient
+>> beta - another coefficient
+*/
+void _MatrixMulBatchedCPU(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
+                          const XTensor * b, MATRIX_TRANS_TYPE transposedB,
+                          XTensor * c, DTYPE alpha, DTYPE beta)
+{
+CheckNTErrors((a && b && c), "Empty input tensors!");
+    CheckNTErrors((a->dataType == b->dataType && a->dataType == c->dataType),
+                  "Input tensors should have the same data type!");
+    CheckNTErrors((a->order >= 2 && b->order >= 2 && c->order >= 2),
+                  "Input tensors must have a order >= 2!");
+    CheckNTErrors((a->order == b->order && a->order == c->order), 
+                  "Input tensor and output tensor must have same order!");
+    int an = transposedA == X_TRANS ? a->dimSizeRDI[0] : a->dimSizeRDI[1];
+    int am = transposedA == X_TRANS ? a->dimSizeRDI[1] : a->dimSizeRDI[0];
+    int bn = transposedB == X_TRANS ? b->dimSizeRDI[0] : b->dimSizeRDI[1];
+    int bm = transposedB == X_TRANS ? b->dimSizeRDI[1] : b->dimSizeRDI[0];
+    int cn = c->dimSizeRDI[1];
+    int cm = c->dimSizeRDI[0];
+    CheckNTErrors((am == bn && an == cn && bm == cm), "Unmatched tensors in multiplication!");
+    int aBlockSize = a->dimSizeRDI[0] * a->dimSizeRDI[1];
+    int bBlockSize = b->dimSizeRDI[0] * b->dimSizeRDI[1];
+    int cBlockSize = c->dimSizeRDI[0] * c->dimSizeRDI[1];
+    int aRealBlockSize = aBlockSize * a->unitSize;
+    int bRealBlockSize = bBlockSize * b->unitSize;
+    int cRealBlockSize = cBlockSize * c->unitSize;
+    int blockNum = 1;
+    for (int i = 2; i < a->order; i++) {
+        CheckNTErrors((a->dimSizeRDI[i] == c->dimSizeRDI[i]), "Incorrect tensor sizes!");
+        CheckNTErrors((b->dimSizeRDI[i] == c->dimSizeRDI[i]), "Incorrect tensor sizes!");
+        blockNum *= a->dimSizeRDI[i];
    }
-    if (a->devID >= 0 && b->devID >= 0 && c->devID >= 0) {
+    int aDimSize[2] = {-a->dimSizeRDI[1], a->dimSizeRDI[0]};
-#ifdef USE_CUDA
+    int bDimSize[2] = {-b->dimSizeRDI[1], b->dimSizeRDI[0]};
-        CheckNTErrors((a->devID == b->devID && a->devID == c->devID),
+    int cDimSize[2] = {-c->dimSizeRDI[1], c->dimSizeRDI[0]};
-                      "The code must be run on the same GPU!");
+    XTensor * ai = NewTensor2D(aDimSize[0], aDimSize[1], a->dataType, a->devID, a->mem);
-        int devIDBackup;
+    XTensor * bi = NewTensor2D(bDimSize[0], bDimSize[1], b->dataType, b->devID, b->mem);
-        ProtectCudaDev(a->devID, devIDBackup);
+    XTensor * ci = NewTensor2D(cDimSize[0], cDimSize[1], c->dataType, c->devID, c->mem);
-        cublasHandle_t * handle = a->mem != NULL ? a->mem->GetCublasHandle() : GDevs.GetCudaHandle(a->devID);
+    for (int i = 0; i < blockNum; i++) {
-        _CudaBLASMatrixMULList(handle,
+        ai->data = (char*)a->data + i * aRealBlockSize;
-							   aList, transposedA,
+        bi->data = (char*)b->data + i * bRealBlockSize;
-                               bList, transposedB,
+        ci->data = (char*)c->data + i * cRealBlockSize;
-                               cList, aList->count,
+#ifdef USE_BLAS
-                               alpha, beta);
+        if (useBLAS)
+            _MatrixMULCPU(ai, transposedA, bi, transposedB, ci, alpha, beta);
-        BacktoCudaDev(a->devID, devIDBackup);
+        else
+            _MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
 #else
-        ShowNTErrors("Please specify USE_CUDA and recompile the code!");
+        _MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
 #endif
    }
-    else {
-        CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
-        _MatrixMULBatchedCPU(aList, transposedA,
-            bList, transposedB,
-            cList, alpha, beta);
-    }
-    for (int i = 0; i < aList->count; i++) {
+    ai->data = NULL;
-        XTensor * ai = (XTensor*)aList->GetItem(i);
+    bi->data = NULL;
-        ai->data = NULL;
+    ci->data = NULL;
-        delete ai;
+    delete ai;
-    }
+    delete bi;
+    delete ci;
+}
-    for (int i = 0; i < bList->count; i++) {
+/*
-        XTensor * bi = (XTensor*)bList->GetItem(i);
+matrix multiplication in batch mode for list inputs (BLAS)
-        bi->data = NULL;
+c_i = trans(a_i) * trans(b_i) * \alpha + c_i * \beta for each i in [0,count-1]
-        delete bi;
+>> a - list of input matrices (2d tensors)
-    }
+>> transposedA - indicate whether the matrix a is transposed
+>> b - another list of input matrices (2d tensors)
+>> transposedB - indicate whether the matrix b is transposed
+>> c - output matrix (2d tensor)
+>> alpha - scalar
+>> beta - scalar
+*/
+void _MatrixMulBatchedCPU(const XList * a, MATRIX_TRANS_TYPE transposedA,
+                          const XList * b, MATRIX_TRANS_TYPE transposedB,
+                          XList * c, DTYPE alpha, DTYPE beta)
+{
+    CheckNTErrors(a && b && c, "Empty input lists!");
+    CheckNTErrors(a->count == b->count && a->count == c->count, "Input lists must be of the same size!");
+    if (a->count == 0)
+        return;
-    for (int i = 0; i < cList->count; i++) {
+    bool isUniform = true;
-        XTensor * ci = (XTensor*)cList->GetItem(i);
+    for (int i = 1; i < a->count; i++) {
-        ci->data = NULL;
+        XTensor * aim = (XTensor*)a->GetItem(i - 1);
-        delete ci;
+        XTensor * bim = (XTensor*)b->GetItem(i - 1);
+        XTensor * cim = (XTensor*)c->GetItem(i - 1);
+        XTensor * ai = (XTensor*)a->GetItem(i);
+        XTensor * bi = (XTensor*)b->GetItem(i);
+        XTensor * ci = (XTensor*)c->GetItem(i);
+        if (!XTensor::IsSameShaped(aim, ai) ||
+            !XTensor::IsSameShaped(bim, bi) ||
+            !XTensor::IsSameShaped(cim, ci))
+        {
+            isUniform = false;
+            break;
+        }
    }
-    delete aList;
+    for (int i = 0; i < a->count; i++) {
-    delete bList;
+        XTensor * ai = (XTensor*)a->GetItem(i);
-    delete cList;
+        XTensor * bi = (XTensor*)b->GetItem(i);
+        XTensor * ci = (XTensor*)c->GetItem(i);
+        CheckNTErrors((ai->order == 2), "2d tensor (i.e., matrix) is required!");
+        CheckNTErrors((bi->order == 2), "2d tensor (i.e., matrix) is required!");
+        CheckNTErrors((ci->order == 2), "2d tensor (i.e., matrix) is required!");
+#ifdef USE_BLAS
+        if (useBLAS)
+            _MatrixMULCPU(ai, transposedA, bi, transposedB, ci, alpha, beta);
+        else
+            _MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
+#else
+        _MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
+#endif
+    }
 }
 /*

--- a/source/tensor/core/arithmetic/MatrixMulBatched.h
+++ b/source/tensor/core/arithmetic/MatrixMulBatched.h
@@ -26,6 +26,8 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
+#define BMMul MatrixMulBatched
 /*
 matrix multiplication of the two tensors c = trans(a) * trans(b) * alpha + c * beta
@@ -37,6 +39,28 @@ where trans() returns the transposed matrix if the flag is fired
 void _MatrixMulBatched(const XTensor * a, MATRIX_TRANS_TYPE transposedA, const XTensor * b, MATRIX_TRANS_TYPE transposedB,
                       XTensor * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0, XPRunner * parallelRunner = NULL);
+/*
+matrix multiplication of the two tensors c = trans(a) * trans(b) * alpha + c * beta
+optimized for GPU
+*/
+void _MatrixMulBatchedGPU(const XTensor * a, MATRIX_TRANS_TYPE transposedA, const XTensor * b, MATRIX_TRANS_TYPE transposedB,
+                          XTensor * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0);
+/*
+matrix multiplication of the two tensors c = trans(a) * trans(b) * alpha + c * beta
+optimized for GPU
+*/
+void _MatrixMulBatchedCPU(const XTensor * a, MATRIX_TRANS_TYPE transposedA, const XTensor * b, MATRIX_TRANS_TYPE transposedB, 
+                          XTensor * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0);
+/*
+matrix multiplication of the two tensors c = trans(a) * trans(b) * alpha + c * beta (for list inputs)
+optimized for GPU
+*/
+void _MatrixMulBatchedCPU(const XList * a, MATRIX_TRANS_TYPE transposedA, const XList * b, MATRIX_TRANS_TYPE transposedB, 
+                          XList * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0);
 /*
 matrix multiplication of the two tensors (return a XTensor structure) c = trans(a) * trans(b) * alpha
 make a new tensor to keep the result and return it

--- a/source/tensor/core/arithmetic/Multiply.cpp
+++ b/source/tensor/core/arithmetic/Multiply.cpp
@@ -32,9 +32,9 @@ element-wise product of two tensors
 c(i) = a(i)*b(i) + \alpha * c(i)
 where i is the index of the item
->> a - matrix a
+>> a - tensor a
->> b - matrix b
+>> b - tensor b
->> c - result matrix
+>> c - result tensor
 >> alpha - the coefficient
 >> leadingDim - the dimension along which we perform broadcasting
 */

--- a/source/tensor/core/arithmetic/Multiply.cu
+++ b/source/tensor/core/arithmetic/Multiply.cu
@@ -104,9 +104,9 @@ void KernelMulElementWiseTensorDynamic(DTYPE * a, DTYPE * b, DTYPE * c, DTYPE al
    int offseti = i % stride;
    if (nonZeroAlpha == 0)
-        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj* ldSizeA + offseti] * bp[threadIdx.x][bj* ldSizeB + offseti];
+        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj * ldSizeA + offseti] * bp[threadIdx.x][bj * ldSizeB + offseti];
    else
-        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj* ldSizeA + offseti] * bp[threadIdx.x][bj* ldSizeB + offseti] +
+        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj * ldSizeA + offseti] * bp[threadIdx.x][bj * ldSizeB + offseti] +
        alpha * cp[threadIdx.x][j * ldSizeC + offseti];
 }

--- a/source/tensor/core/arithmetic/Sub.cpp
+++ b/source/tensor/core/arithmetic/Sub.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#include "../../XTensor.h"
+#include "../../XName.h"
+#include "../../XUtility.h"
+#include "Sub.h"
+#include "Sub.cuh"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+tensor subtraction c = a - b * \beta
+>> a - a tensor
+>> b - another tensor
+>> c - where we put a-b*\beta. we save it in a if c is NULL
+>> beta - the scaling factor
+*/
+void _Sub(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta)
+{
+    CheckNTErrors(a && b && c, "Empty tensor input!");
+    CheckNTErrors(a->unitNum == b->unitNum && a->unitNum == c->unitNum,
+                  "Unmatched tensors in addition!");
+    CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType,
+                  "Unmatched tensors in addition!");
+    if (a->devID >= 0 || b->devID >= 0 || c->devID >= 0) {
+#ifdef USE_CUDA
+        if (a == c) {
+            int P2PAccesible = 0;
+#ifdef CUDA_UVA
+            cudaDeviceCanAccessPeer(&P2PAccesible, a->devID, b->devID);
+#endif
+            if ((a->devID < 0 && b->devID >= 0) ||
+                (a->devID >= 0 && b->devID < 0) ||
+                (a->devID >= 0 && b->devID >= 0 && a->devID != b->devID && !P2PAccesible))
+            {
+                ShowNTErrors("Cannot run this method on multiple devices simultaneously!");
+            }
+            else
+                _CudaSub(a, b, c, beta);
+        }
+        else
+            _CudaSub(a, b, c, beta);
+#endif
+    }
+    else {
+        if (!a->isSparse && !b->isSparse) {
+            CheckNTErrors(!c->isSparse, "Illegal use of sparse tensor in addition!");
+            if (a->dataType == DEFAULT_DTYPE &&
+                b->dataType == DEFAULT_DTYPE &&
+                c->dataType == DEFAULT_DTYPE)
+            {
+                DTYPE * ap = (DTYPE*)a->data;
+                DTYPE * bp = (DTYPE*)b->data;
+                DTYPE * cp = (DTYPE*)c->data;
+                /* unrolling */
+                int num = a->unitNum;
+                if (num % 4 == 0) {
+                    for (int i = 0; i < num; i += 4) {
+                        cp[i] = ap[i] - bp[i] * beta;
+                        cp[i + 1] = ap[i + 1] - bp[i + 1] * beta;
+                        cp[i + 2] = ap[i + 2] - bp[i + 2] * beta;
+                        cp[i + 3] = ap[i + 3] - bp[i + 3] * beta;
+                    }
+                }
+                else if (num % 2 == 0) {
+                    for (int i = 0; i < num; i += 2) {
+                        cp[i] = ap[i] - bp[i] * beta;
+                        cp[i + 1] = ap[i + 1] - bp[i + 1] * beta;
+                    }
+                }
+                else {
+                    for (int i = 0; i < num; i++) {
+                        cp[i] = ap[i] - bp[i] * beta;
+                    }
+                }
+            }
+            else {
+                // TODO!!
+                ShowNTErrors("TODO!");
+            }
+        }
+        else {
+            // TODO!!
+            ShowNTErrors("TODO!");
+        }
+    }
+}
+/*
+tensor subtraction a = a - b * \beta (do it on site)
+keep the result in the tensor a and return nothing
+>> a - a tensor
+>> b - another tensor
+>> beta - the scaling factor
+*/
+void _SubMe(XTensor * a, const XTensor * b, DTYPE beta)
+{
+    _Sub(a, b, a, beta);
+}
+/*
+tensor subtraction c = a - b * \beta (return a XTensor structure)
+make a new tensor c to keep the result and return it
+>> a - a tensor
+>> b - another tensor
+>> beta - the scaling factor
+<< return - the result of tensor subtraction
+*/
+XTensor Sub(const XTensor &a, const XTensor &b, DTYPE beta)
+{
+    XTensor c(&a);
+    c.SetTMP();
+    /* call _Sub function */
+    _Sub(&a, &b, &c, beta);
+    /* tensor connections */
+    XLink::MakeLink(&a, &b, &c, MATH_SUB);
+    XLink::AddParamToHead(&c, beta);
+    return c;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/Sub.cu
+++ b/source/tensor/core/arithmetic/Sub.cu
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#include "../../XDevice.h"
+#include "../../XUtility.h"
+#include "Sub.cuh"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/*
+subtraction of data arrays (CUDA Kernel)
+c = a - b * \beta
+>> a - A matrix
+>> b - another matrix
+>> c - where we put a-b
+>> size - the size of a/b/c
+>> beta - the coefficient
+*/
+__global__
+void KernelSUB(DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE beta)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i < size)
+        c[i] = a[i] - b[i] * beta;
+}
+/*
+tensor subtraction c = a - b * \beta (cuda version)
+>> a - a tensor
+>> b - another tensor
+>> c - where we put a-b*\beta.
+>> beta - the scaling factor
+*/
+void _CudaSub(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta)
+{
+    CheckNTErrors(a && b && c, "Empty tensor input!");
+    CheckNTErrors((a->unitNum == b->unitNum && a->unitNum == c->unitNum),
+                  "Unmatched tensors in addition!");
+    CheckNTErrors((a->dataType == b->dataType && a->dataType == c->dataType),
+                  "Unmatched tensors in addition!");
+    CheckNTErrors((a->devID == b->devID && a->devID == c->devID),
+                  "The tensors must be on the same!");
+    int devIDBackup = XDevice::GetGPUDevice();
+    XDevice::SetGPUDevice(a->devID);
+    if (!a->isSparse && !b->isSparse) {
+        CheckNTErrors(!c->isSparse, "Illegal use of sparse matrix in addition!");
+        if (a->dataType == DEFAULT_DTYPE &&
+            b->dataType == DEFAULT_DTYPE &&
+            c->dataType == DEFAULT_DTYPE)
+        {
+            int gridSize[3], blockSize[3];
+            GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
+            dim3 blocks(gridSize[0]);
+            dim3 threads(blockSize[0]);
+            KernelSUB << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, a->unitNum, beta);
+        }
+        else {
+            // TODO!!
+            ShowNTErrors("TODO!");
+        }
+    }
+    else {
+        // TODO!!
+        ShowNTErrors("TODO!");
+    }
+    XDevice::SetGPUDevice(devIDBackup);
+}
+/* subtraction over arrays
+tensor subtraction c = a - b * \beta (cuda version) with an input handle
+>> devID - device ID (MUST >= 0)
+>> handle - cuda handle
+>> a - an array
+>> b - another array
+>> c - where we put a-b
+>> size - size of the array
+>> beta - the coefficient
+*/
+void _CudaSubWithHandle(int devID, cublasHandle_t * handle, DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE beta)
+{
+    if (size == 0)
+        return;
+    if (c == NULL)
+        c = a;
+    CheckNTErrors((a && b && c), "Empty arrays in addition!");
+    int devIDBackup;
+    ProtectCudaDev(devID, devIDBackup);
+    if (c == a) {
+#ifdef DOUBELPRICSION
+        cublasDaxpy(*handle, size, &beta, b, 1, a, 1);
+#else
+        cublasSaxpy(*handle, size, &beta, b, 1, a, 1);
+#endif
+    }
+    else {
+        int gridSize[3], blockSize[3];
+        GDevs.GetCudaThread(devID, size, gridSize, blockSize);
+        dim3 blocks(gridSize[0]);
+        dim3 threads(blockSize[0]);
+        KernelSUB<<<blocks, threads>>>((DTYPE*)a, (DTYPE*)b, (DTYPE*)c, size, beta);
+    }
+    BacktoCudaDev(devID, devIDBackup);
+}
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/Sub.cuh
+++ b/source/tensor/core/arithmetic/Sub.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#ifndef __SUB_CUH__
+#define __SUB_CUH__
+#include "Sub.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/* subtraction of data arrays (CUDA Kernel) */
+__global__
+void KernelSUB(DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE beta = (DTYPE)1.0);
+/* tensor subtraction c = a - b * \beta (cuda version) */
+void _CudaSub(const XTensor * a, const XTensor * b, XTensor * c = NULL, DTYPE beta = (DTYPE)1.0);
+/*  tensor subtraction c = a - b * \beta (cuda version) with an input handle */
+void _CudaSubWithHandle(int devID, cublasHandle_t * handle, DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE beta = (DTYPE)1.0);
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
+#endif // __SUB_CUH__
--- a/source/tensor/core/arithmetic/Absolute.h
+++ b/source/tensor/core/arithmetic/Absolute.h
 /* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
+ * All rights reserved.
-*
+ *
-* Licensed under the Apache License, Version 2.0 (the "License");
+ * Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
+ * you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
+ * You may obtain a copy of the License at
-*
+ *
-*   http://www.apache.org/licenses/LICENSE-2.0
+ *   http://www.apache.org/licenses/LICENSE-2.0
-*
+ *
-* Unless required by applicable law or agreed to in writing, software
+ * Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
+ * distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
+ * See the License for the specific language governing permissions and
-* limitations under the License.
+ * limitations under the License.
-*/
+ */
 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
-*/
+ * Today is the first day of August. It's still very hot.
+ */
-#ifndef __ABSOLUTE_H__
+#ifndef __SUB_H__
-#define __ABSOLUTE_H__
+#define __SUB_H__
 #include "../../XTensor.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
-/* set every entry to its absolute value */
+/* tensor subtraction c = a - b * \beta */
-void _Absolute(const XTensor * a, XTensor * b);
+void _Sub(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta = (DTYPE)1.0);
-/*
+/* 
-set every entry to its absolute value (do it on site)
+tensor subtraction a = a - b * \beta
 keep the result in the input tensor a and return nothing
 */
-void _AbsoluteMe(XTensor * a);
+void _SubMe(XTensor * a, const XTensor * b, DTYPE beta = (DTYPE)1.0);
-/* 
+/*
-set every entry to its absolute value (return a XTensor structure)
+tensor subtraction c = a - b * \beta
-make a new tensor to keep the result and return it
+make a new tensor c to keep the result and return it
 */
-XTensor Absolute(const XTensor & a);
+XTensor Sub(const XTensor &a, const XTensor &b, DTYPE beta = (DTYPE)1.0);
 } // namespace nts(NiuTrans.Tensor)
-#endif // __ABSOLUTE_H__
+#endif // __SUB_H__
--- a/source/tensor/core/arithmetic/Sum.cpp
+++ b/source/tensor/core/arithmetic/Sum.cpp
@@ -24,6 +24,7 @@
 #include "../../XUtility.h"
 #include "Sum.h"
 #include "Sum.cuh"
+#include "SumDim.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -67,7 +68,7 @@ void _Sum(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta)
    }
    else {
        if (!a->isSparse && !b->isSparse) {
-            CheckNTErrors(!c->isSparse, "Illegal use of sparse matrix in addition!");
+            CheckNTErrors(!c->isSparse, "Illegal use of sparse tensor in addition!");
            if (a->dataType == DEFAULT_DTYPE &&
                b->dataType == DEFAULT_DTYPE &&
@@ -123,6 +124,33 @@ void _SumMe(XTensor * a, const XTensor * b, DTYPE beta)
 {
    _Sum(a, b, a, beta);
 }
+/* 
+return a dimension if the sum is performed as SumDim (in more details in SumDim.h 
+>> a - a tensor
+>> b - another tensor for sum
+*/
+int GetSumDimIndex(const XTensor &a, const XTensor &b)
+{
+    if(a.order < b.order)
+        return -1;
+    int hitCount = 0;
+    int hitDim = -1;
+    for(int i = 0; i < b.order; i++){
+        if(b.dimSize[b.order - 1 - i] == 1)
+            continue;
+        else if(b.dimSize[b.order - 1 - i] == a.dimSize[a.order - 1 - i]){
+            hitCount++;
+            hitDim = a.order - b.order + i;
+        }
+    }
+    if(hitCount == 1)
+        return hitDim;
+    else
+        return -1;
+}
 /*
 tensor summation c = a + b * \beta (return a XTensor structure)
@@ -137,13 +165,29 @@ XTensor Sum(const XTensor &a, const XTensor &b, DTYPE beta)
 {
    XTensor c(&a);
    c.SetTMP();
+    int n = GetSumDimIndex(a, b);
+    if(n == -1){
+        /* call _Sum function */
+        _Sum(&a, &b, &c, beta);
-    /* call _Sum function */
+        /* tensor connections */
-    _Sum(&a, &b, &c, beta);
+        XLink::MakeLink(&a, &b, &c, MATH_SUM);
+        XLink::AddParamToHead(&c, beta);
+    }
+    else if(n >= 0 && n < a.order){
+        /* call _Sum function */
+        _SumDim(&a, &b, &c, n, beta);
-    /* tensor connections */
+        /* tensor connections */
-    XLink::MakeLink(&a, &b, &c, MATH_SUM);
+        XLink::MakeLink(&a, &b, &c, MATH_SUMDIM);
-    XLink::AddParamToHead(&c, beta);
+        XLink::AddParamToHeadInt(&c, n);
+        XLink::AddParamToHead(&c, beta);
+    }
+    else{
+        ShowNTErrors("Something is wrong!");
+    }
    return c;
 }

--- a/source/tensor/core/arithmetic/Sum.cu
+++ b/source/tensor/core/arithmetic/Sum.cu
@@ -20,6 +20,7 @@
 */
 #include "../../XDevice.h"
+#include "../../XUtility.h"
 #include "Sum.cuh"
 namespace nts { // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/core/arithmetic/SumDim.cpp
+++ b/source/tensor/core/arithmetic/SumDim.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-29
+ */
+#include "Sum.h"
+#include "SumDim.h"
+#include "SumDim.cuh"
+#include "../../XName.h"
+#include "../movement/CopyValues.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+tensor summation 
+c = a + b * \beta 
+where the size of b is equal to the n-th dimension of a, 
+i.e., a is summed with b by broadcasting
+>> a - a tensor
+>> b - another tensor whose size is equal to that of dimension n of a
+>> c - where we put a+b*\beta. we save it in a if c is NULL
+>> n - the dimension index
+>> beta - the scaling factor
+*/
+void _SumDim(const XTensor * a, const XTensor * b, XTensor * c, int n, DTYPE beta)
+{
+    CheckNTErrors(a && b && c, "Empty tensor input!");
+    CheckNTErrors(a->unitNum == c->unitNum, "Unmatched tensors in addition!");
+    CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType,
+                  "Unmatched data types in addition!");
+    CheckNTErrors(a->order == c->order, "The input tensors do not have the same order in addition!");
+    CheckNTErrors(!a->isSparse && !b->isSparse && !c->isSparse, "Dense tensors are required!");
+    CheckNTErrors(a->dimSize[n] == b->unitNum, "Wrong tensor size!");
+    if(beta == 0){
+        _CopyValues(a, c);
+        return;
+    }
+    if(XTensor::IsSameShaped(a, b)){
+        _Sum(a, b, c, beta);
+        return;
+    }
+    if(a->devID >= 0 || b->devID >= 0 || c->devID >= 0){
+#ifdef USE_CUDA
+        _CudaSumDim(a, b, c, n, beta);
+#else
+        ShowNTErrors("Please specify USE_CUDA and recompile the code!");
+#endif
+    }
+    else{
+        int stride = 1;
+        int blockSize = a->dimSize[n];
+        int blockNum = 1;
+        for(int i = a->order - 1; i >= 0; i--){
+            if(i > n)
+                stride *= a->dimSize[i];
+            else if(i < n)
+                blockNum *= a->dimSize[i];
+        }
+        if (a->dataType == DEFAULT_DTYPE){
+            int num = a->unitNum;
+            if(stride > 1){
+                for(int i = 0, j = 0; i < num; i += stride, j++){
+                    DTYPE * ap =   (DTYPE*)a->data + i;
+                    DTYPE   bv = *((DTYPE*)b->data + j % blockSize) * beta;
+                    DTYPE * cp =   (DTYPE*)c->data + i;
+                    for(int k = 0; k < stride; k++)
+                        cp[k] = ap[k] + bv;
+                }
+            }
+            else if(stride == 1){
+                DTYPE * bp = (DTYPE*)b->data;
+                for(int i = 0; i < num; i += blockSize){
+                    DTYPE * ap = (DTYPE*)a->data + i;
+                    DTYPE * cp = (DTYPE*)c->data + i;
+                    if(beta == 1.0F){
+                        for(int j = 0; j < blockSize; j++)
+                            cp[j] = ap[j] + bp[j];
+                    }
+                    else{
+                        for(int j = 0; j < blockSize; j++)
+                            cp[j] = ap[j] + bp[j] * beta;
+                    }
+                }
+            }
+            else{
+                ShowNTErrors("Something is wrong!");
+            }
+        }
+        else {
+            ShowNTErrors("TODO!");
+        }
+    }
+}
+/*
+tensor summation (do it on site)
+keep the result in the input tensor and return nothing
+a = a + b * \beta
+where the size of b is equal to the n-th dimension of a,
+i.e., a is summed with b by broadcasting
+>> a - a tensor
+>> b - another tensor whose size is equal to that of dimension n of a
+>> n - the dimension index
+>> beta - the scaling factor
+*/
+void _SumDim(XTensor * a, const XTensor * b, int n, DTYPE beta)
+{
+    _SumDim(a, b, a, n, beta);
+}
+/*
+tensor summation (return a XTensor structure and make tensor connections)
+make a new tensor to keep the result and return it
+c = a + b * \beta
+where the size of b is equal to the n-th dimension of a,
+i.e., a is summed with b by broadcasting
+>> a - a tensor
+>> b - another tensor whose size is equal to that of dimension n of a
+>> n - the dimension index
+>> beta - the scaling factor
+<< return - the result tensor by tensor summation
+*/
+XTensor SumDim(const XTensor &a, const XTensor &b, int n, DTYPE beta)
+{
+    XTensor c(&a);
+    c.SetTMP();
+    /* call _Sum function */
+    _SumDim(&a, &b, &c, n, beta);
+    /* tensor connections */
+    XLink::MakeLink(&a, &b, &c, MATH_SUMDIM);
+    XLink::AddParamToHeadInt(&c, n);
+    XLink::AddParamToHead(&c, beta);
+    return c;
+}
+}
--- a/source/tensor/core/arithmetic/SumDim.cu
+++ b/source/tensor/core/arithmetic/SumDim.cu
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-29
+*/
+#include "SumDim.cuh"
+#include "../../XDevice.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/* 
+tensor summation of a tensor and a row vector
+c = a + b * \beta 
+where a is a tensor and b is a row vector
+>> a - pointer to the data array of a
+>> b - pointer to the data array of b
+>> c - pointer to the data array of c
+>> rowNum - number of rows of a and c
+>> colNum - number of columns of a and c (i.e., the size of b)
+>> beta - the scaling factor
+*/
+template <class T, bool betaFired>
+__global__
+void KernelAddWithRow(T * a, T * b, T * c, int rowNum, int colNum, T beta)
+{
+    __shared__ T bv[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    int col = blockDim.x * blockIdx.x + threadIdx.x;
+    int row = blockDim.y * blockIdx.y + threadIdx.y;
+    if(col >= colNum || row >= rowNum)
+        return;
+    if(threadIdx.y == 0)
+        bv[threadIdx.x] = b[col];
+    __syncthreads();
+    int offset = colNum * row + col;
+    if(betaFired)
+        c[offset] = a[offset] + bv[threadIdx.x] * beta;
+    else
+        c[offset] = a[offset] + bv[threadIdx.x];
+}
+/* 
+tensor summation of a tensor and a colum vector
+c = a + b * \beta 
+where a is a tensor and b is a colum vector
+>> a - pointer to the data array of a
+>> b - pointer to the data array of b
+>> c - pointer to the data array of c
+>> rowNum - number of rows of a and c (i.e., the size of b)
+>> colNum - number of columns of a and c 
+>> blockNum - size of a block (matrix), i.e., rowNum * colNum
+>> blockNum - number of matrics 
+>> beta - the scaling factor
+*/
+template <class T, bool betaFired>
+__global__
+void KernelAddWithCol(T * a, T * b, T * c, int rowNum, int colNum, int blockSize, int blockNum, T beta)
+{
+    __shared__ T bv[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    int colIndex = blockDim.x * blockIdx.x + threadIdx.x;
+    int row = blockDim.y * blockIdx.y + threadIdx.y;
+    int col = colIndex % colNum;
+    int block = colIndex / colNum;
+    if(row >= rowNum || block >= blockNum)
+        return;
+    if(threadIdx.x == 0)
+        bv[threadIdx.y] = b[row];
+    __syncthreads();
+    int offset = block * blockSize + row * colNum + col;
+    if(betaFired)
+        c[offset] = a[offset] + bv[threadIdx.y] * beta;
+    else
+        c[offset] = a[offset] + bv[threadIdx.y];
+}
+/*
+tensor summation (cuda version)
+c = a + b * \beta 
+where the size of b is equal to the n-th dimension of a, 
+i.e., a is summed with b by broadcasting
+>> a - a tensor
+>> b - another tensor whose size is equal to that of dimension n of a
+>> c - where we put a+b*\beta. we save it in a if c is NULL
+>> n - the dimension index
+>> beta - the scaling factor
+*/
+void _CudaSumDim(const XTensor * a, const XTensor * b, XTensor * c, int n, DTYPE beta)
+{
+    CheckNTErrors(a && b && c, "Empty tensor input!");
+    CheckNTErrors(a->unitNum == c->unitNum, "Unmatched tensors in addition!");
+    CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType,
+                  "Unmatched data types in addition!");
+    CheckNTErrors(a->order == c->order, "The input tensors do not have the same order in addition!");
+    CheckNTErrors(!a->isSparse && !b->isSparse && !c->isSparse, "Dense tensors are required!");
+    CheckNTErrors(a->dimSize[n] == b->unitNum, "Wrong tensor size!");
+    int stride = 1;
+    int blockSize = a->dimSize[n];
+    int blockNum = 1;
+    for(int i = a->order - 1; i >= 0; i--){
+        if(i > n)
+            stride *= a->dimSize[i];
+        else if(i < n)
+            blockNum *= a->dimSize[i];
+    }
+    int cudaGrids[3];
+    int cudaBlocks[3];
+    int devIDBackup = 0;
+    ProtectCudaDev(a->devID, devIDBackup);
+    if (a->dataType == DEFAULT_DTYPE){
+        if(stride > 1){
+            GDevs.GetCudaThread2D(a->devID, stride * blockNum, blockSize, MAX_INT, cudaGrids, cudaBlocks);
+            if(beta == (DTYPE)1.0F)
+                KernelAddWithCol<DTYPE, false> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1])>>>
+                                                ((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, 
+                                                  blockSize, stride, blockSize * stride, blockNum, beta);
+            else
+                KernelAddWithCol<DTYPE, true>  <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1])>>>
+                                                ((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, 
+                                                  blockSize, stride, blockSize * stride, blockNum, beta);
+        }
+        else if(stride == 1){
+            GDevs.GetCudaThread2D(a->devID, blockSize, blockNum, MAX_INT, cudaGrids, cudaBlocks);
+            if(beta == (DTYPE)1.0F)
+                KernelAddWithRow<DTYPE, false> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1])>>>
+                                                ((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, 
+                                                  blockNum, blockSize, beta);
+            else
+                KernelAddWithRow<DTYPE, true>  <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1])>>>
+                                                ((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, 
+                                                  blockNum, blockSize, beta);
+        }
+        else{
+            ShowNTErrors("Something is wrong!");
+        }
+    }
+    else {
+        ShowNTErrors("TODO!");
+    }
+    BacktoCudaDev(a->devID, devIDBackup);
+}
+#endif
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/SumDim.cuh
+++ b/source/tensor/core/arithmetic/SumDim.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-29
+*/
+#ifndef __SUMDIM_CUH__
+#define __SUMDIM_CUH__
+#include "../../XTensor.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/* tensor summation c = a + b * \beta where the size of b is equal to the n-th dimension of a, 
+   i.e., a is summed with b by broadcasting (cuda version) */
+void _CudaSumDim(const XTensor * a, const XTensor * b, XTensor * c, int n, DTYPE beta = (DTYPE)1.0);
+#endif
+} // namespace nts(NiuTrans.Tensor)
+#endif // __SUMDIM_CUH__
--- a/source/tensor/core/arithmetic/SumDim.h
+++ b/source/tensor/core/arithmetic/SumDim.h
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-29
+ * It reached to 39 centigrade around 3:00 pm in Shenyang
+ */
+#ifndef __SUMDIM_H__
+#define __SUMDIM_H__
+#include "../../XTensor.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* tensor summation c = a + b * \beta where the size of b is equal to the n-th dimension of a, 
+   i.e., a is summed with b by broadcasting */
+void _SumDim(const XTensor * a, const XTensor * b, XTensor * c, int n, DTYPE beta = (DTYPE)1.0);
+/* tensor summation c = a + b * \beta where the size of b is equal to the n-th dimension of a, 
+   i.e., a is summed with b by broadcasting. we keep the result in the input tensor a and return nothing */
+void _SumDim(XTensor * a, const XTensor * b, int n, DTYPE beta = (DTYPE)1.0);
+/* tensor summation c = a + b * \beta where the size of b is equal to the n-th dimension of a, 
+   i.e., a is summed with b by broadcasting. We make a new tensor c to keep the result and return it */
+XTensor SumDim(const XTensor &a, const XTensor &b, int n, DTYPE beta = (DTYPE)1.0);
+} // namespace nts(NiuTrans.Tensor)
+#endif // __SUMDIM_H__
--- a/source/tensor/core/getandset/SetData.cpp
+++ b/source/tensor/core/getandset/SetData.cpp
@@ -20,6 +20,7 @@
 * $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-05-08
 */
+#include <math.h>
 #include "SetData.h"
 #include "SetData.cuh"
 #include "../../XUtility.h"
@@ -37,6 +38,43 @@
 namespace nts{ // namespace nts(NiuTrans.Tensor)
+/*
+Fills the input Tensor or Variable with values according to the method described in 
+"Understanding the difficulty of training deep feedforward neural networks" - Glorot, X. & Bengio, Y. (2010), 
+using a uniform distribution. The resulting tensor will have values sampled from :math:`U(-a, a)` 
+where :math:`a = gain \times \sqrt{2 / (fan\_in + fan\_out)} \times \sqrt{3}`. Also known as Glorot initialisation.
+>> tensor - the tensor whose data array would be initialized
+>> gain - an optional scaling factor
+*/
+void _SetDataFanInOut(XTensor * tensor, DTYPE gain)
+{
+    CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_FLOAT!");
+    CheckNTErrors(tensor->order >= 2, "the tensor dimension must be no less than 2!");
+    int fanIn = 1;
+    int fanOut = 1;
+    int order = tensor->order;
+    if (order == 2) {
+        fanIn = tensor->dimSize[1];
+        fanOut = tensor->dimSize[0];
+    }
+    else {
+        int numInputFmaps = tensor->dimSize[1];
+        int numOutputFmaps = tensor->dimSize[0];
+        int receptiveFieldSize = 0;
+        for (int i = 2; i < order; i++)
+            receptiveFieldSize += tensor->dimSize[i];
+        fanIn = numInputFmaps * receptiveFieldSize;
+        fanOut = numOutputFmaps * receptiveFieldSize;
+    }
+    DTYPE std = gain * sqrt(2.0/(fanIn + fanOut));
+    DTYPE a = sqrt(3.0) * std;
+    _SetDataRand(tensor, -a, a);
+}
 /* 
 generate data items with a fixed value p 
 >> tensor - the tensor whose data array would be initialized
@@ -65,7 +103,7 @@ void _SetDataFixed(XTensor * tensor, void * valuePointer)
        }
        else{
 #ifdef USE_CUDA
-            CudaSetDataFixedInt(tensor, p);
+            _CudaSetDataFixedInt(tensor, p);
 #endif
        }
    }
@@ -88,7 +126,7 @@ void _SetDataFixed(XTensor * tensor, void * valuePointer)
        }
        else{
 #ifdef USE_CUDA
-            CudaSetDataFixedFloat(tensor, p);
+            _CudaSetDataFixedFloat(tensor, p);
 #endif
        }
    }
@@ -111,7 +149,7 @@ void _SetDataFixed(XTensor * tensor, void * valuePointer)
        }
        else{
 #ifdef USE_CUDA
-            CudaSetDataFixedDouble(tensor, p);
+            _CudaSetDataFixedDouble(tensor, p);
 #endif
        }
    }
@@ -137,7 +175,7 @@ generate data items with a fixed value p (in integer)
 */
 void _SetDataFixedInt(XTensor * tensor, int p)
 {
-    CheckNTErrors(tensor->dataType == X_INT, "the tensor must be in X_INT");
+    CheckNTErrors(tensor->dataType == X_INT, "the tensor must be in X_INT!");
    if(p == 0)
        tensor->SetZeroAll();
@@ -152,7 +190,7 @@ generate data items with a fixed value p (in float)
 */
 void _SetDataFixedFloat(XTensor * tensor, float p)
 {
-    CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_INT");
+    CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_FLOAT!");
    if(p == 0)
        tensor->SetZeroAll();
@@ -167,7 +205,7 @@ generate data items with a fixed value p (in double)
 */
 void _SetDataFixedDouble(XTensor * tensor, double p)
 {
-    CheckNTErrors(tensor->dataType == X_DOUBLE, "the tensor must be in X_INT");
+    CheckNTErrors(tensor->dataType == X_DOUBLE, "the tensor must be in X_DOUBLE!");
    if(p == 0)
        tensor->SetZeroAll();
@@ -183,6 +221,8 @@ generate data items with a uniform distribution in [low,high]
 */
 void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
 {
+    CheckNTErrors(high > low, "the high value must be greater than low value!");
    if(tensor == NULL)
        return;
@@ -215,10 +255,13 @@ void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
    TODO: generate data points on GPUs straightforwardly.
    */
    else{
-        XTensor * t2 = NewTensor(tensor->order, tensor->dimSize, tensor->dataType, tensor->denseRatio, -1);
+#ifdef USE_CUDA
-        _SetDataRand(t2, low, high);
+        _CudaSetDataRand(tensor, low, high);
-        _CopyValues(t2, tensor);
+#endif
-        delete t2;
+        //XTensor * t2 = NewTensor(tensor->order, tensor->dimSize, tensor->dataType, tensor->denseRatio, -1);
+        //_SetDataRand(t2, low, high);
+        //_CopyValues(t2, tensor);
+        //delete t2;
    }
 }

--- a/source/tensor/core/getandset/SetData.cu
+++ b/source/tensor/core/getandset/SetData.cu
@@ -21,7 +21,10 @@
 * I'm surprised that I did not write this file till today.
 */
+#include <curand.h>
+#include <time.h>
 #include "SetData.cuh"
+#include <curand_kernel.h>
 #include "../../XDevice.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -46,7 +49,7 @@ generate data items with a fixed value p (in int)
 >> tensor - the tensor for initialization
 >> p - the initial value
 */
-void CudaSetDataFixedInt(XTensor * tensor, int p)
+void _CudaSetDataFixedInt(XTensor * tensor, int p)
 {
    CheckNTErrors(tensor->dataType == X_INT, "the tensor must be in X_INT!");
@@ -86,7 +89,7 @@ generate data items with a fixed value p (in float)
 >> tensor - the tensor for initialization
 >> p - the initial value
 */
-void CudaSetDataFixedFloat(XTensor * tensor, float p)
+void _CudaSetDataFixedFloat(XTensor * tensor, float p)
 {
    CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_FLOAT!");
@@ -126,7 +129,7 @@ generate data items with a fixed value p (in double)
 >> tensor - the tensor for initialization
 >> p - the initial value
 */
-void CudaSetDataFixedDouble(XTensor * tensor, double p)
+void _CudaSetDataFixedDouble(XTensor * tensor, double p)
 {
    CheckNTErrors(tensor->dataType == X_DOUBLE, "the tensor must be in X_DOUBLE!");
@@ -146,4 +149,115 @@ void CudaSetDataFixedDouble(XTensor * tensor, double p)
    BacktoCudaDev(tensor->devID, devIDBackup);
 }
+/* 
+call curand_init function on each kernel with the same random seed
+and init the rng states
+*/
+__global__ 
+void KernelInitializeCurand(curandState * state, unsigned long seed)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    curand_init(seed, i, 0, &state[i]);
+}
+/* */
+__device__ 
+float GenerateFloat(curandState* globalState, int i)
+{
+    //copy state to local mem
+    curandState localState = globalState[i];
+    //apply uniform distribution with calculated random
+    float randNum = curand_uniform(&localState);
+    //update state
+    globalState[i] = localState;
+    //return value
+    return randNum;
+}
+/**/
+__device__ 
+double GenerateDouble(curandState* globalState, int i)
+{
+    //copy state to local mem
+    curandState localState = globalState[i];
+    //apply uniform distribution with calculated random
+    double randNum = curand_uniform_double(&localState);
+    //update state
+    globalState[i] = localState;
+    //return value
+    return randNum;
+}
+/* 
+set data array with a uniform distribution in [low, high] 
+>> deviceStates - the state of curand
+>> d - float datatype pointer to the data array 
+>> size - size of the array
+>> low - low value of the range
+>> high - high value of the range
+*/
+__global__
+void KernelSetDataRandFloat(curandState* deviceStates, float * d, int size, DTYPE low, DTYPE variance)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i < size) {
+        float randNum = GenerateFloat(deviceStates, i);
+        d[i] = randNum * variance + low;
+    }
+}
+/* 
+set data array with a uniform distribution in [low, high] 
+>> deviceStates - the state of curand
+>> d - double datatype pointer to the data array
+>> size - size of the array
+>> low - low value of the range
+>> high - high value of the range
+*/
+__global__
+void KernelSetDataRandDouble(curandState* deviceStates, double * d, int size, DTYPE low, DTYPE variance)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i < size){
+        double randNum = GenerateDouble(deviceStates, i);
+        d[i] = randNum * variance + low;
+    }
+}
+/*
+generate data items with a uniform distribution in [low,high]
+>> tensor - the tensor whose data array would be initialized
+>> low - lower value of the range
+>> high - higher value of the range
+*/
+void _CudaSetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
+{
+    CheckNTErrors(high > low, "the high value must be greater than low value!");
+    int gridSize[3];
+    int blockSize[3];
+    GDevs.GetCudaThread(tensor->devID, tensor->unitNum, gridSize, blockSize);
+    dim3 blocks(gridSize[0]);
+    dim3 threads(blockSize[0]);
+    int devIDBackup;
+    ProtectCudaDev(tensor->devID, devIDBackup);
+    curandState *deviceStates;
+    cudaMalloc(&deviceStates, sizeof(curandState));
+    DTYPE variance = high - low;
+    KernelInitializeCurand<<<blocks, threads>>>(deviceStates, unsigned(time(NULL)));
+    if (tensor->dataType == X_FLOAT)
+        KernelSetDataRandFloat <<<blocks, threads >>>(deviceStates, (float*)tensor->data, tensor->unitNum, low, variance);
+    else if (tensor->dataType == X_DOUBLE)
+        KernelSetDataRandDouble <<<blocks, threads >>>(deviceStates, (double*)tensor->data, tensor->unitNum, low, variance);
+    BacktoCudaDev(tensor->devID, devIDBackup);
+}
 } // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/getandset/SetData.cuh
+++ b/source/tensor/core/getandset/SetData.cuh
@@ -29,13 +29,16 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
 /* generate data items with a fixed value p (in int) */
-void CudaSetDataFixedInt(XTensor * tensor, int p);
+void _CudaSetDataFixedInt(XTensor * tensor, int p);
 /* generate data items with a fixed value p (in float) */
-void CudaSetDataFixedFloat(XTensor * tensor, float p);
+void _CudaSetDataFixedFloat(XTensor * tensor, float p);
 /* generate data items with a fixed value p (in double) */
-void CudaSetDataFixedDouble(XTensor * tensor, double p);
+void _CudaSetDataFixedDouble(XTensor * tensor, double p);
+/* generate data items with a uniform distribution in [low,high] */
+void _CudaSetDataRand(XTensor * tensor, DTYPE low, DTYPE high);
 } // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/core/getandset/SetData.h
+++ b/source/tensor/core/getandset/SetData.h
@@ -27,6 +27,9 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
+/* generate data items with a xavier initialization */
+void _SetDataFanInOut(XTensor * tensor, DTYPE gain = 1.0F);
 /* generate data items with a fixed value p */
 void _SetDataFixed(XTensor * tensor, void * valuePointer);

--- a/source/tensor/core/math/Log.cpp
+++ b/source/tensor/core/math/Log.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-/*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
-*/
-#include "../../XTensor.h"
-#include "../../XName.h"
-#include "Log.h"
-#include "Log.cuh"
-#include <math.h>
-namespace nts { // namespace nts(NiuTrans.Tensor)
-/*
-set every entry to its log value (do it on site)
->> a - input tensor we are processing
->> b - output tensor we are processing
-*/
-void _Log(const XTensor * a, XTensor * b)
-{
-#ifdef USE_CUDA
-    /* run it on GPUs */
-    if (a->devID >= 0) {
-        _CudaLog(a, b);
-    return;
-    }
-#endif
-    CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
-    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
-    DTYPE * d = (DTYPE*)a->data;
-    DTYPE * db = (DTYPE*)b->data;
-    for (int i = 0; i < a->unitNum; i++)
-        db[i] = (DTYPE)log(d[i]);
-}
-/*
-set every entry to its log value
-keep the result in the input tensor a and return nothing
->> a - the tensor we are processing
-*/
-void _LogMe(XTensor * a)
-{
-    _Log(a, a);
-}
-/*
-set every entry to its log value (return a XTensor structure)
-make a new tensor to keep the result and return it
->> a - input tensor we are processing
-<< return - the log value of the input tensor
-*/
-XTensor Log(const XTensor & a)
-{
-    XTensor b(&a);
-    b.SetTMP();
-    /* call _Log function */
-    _Log(&a, &b);
-    /* tensor connections */
-    XLink::MakeLink(&a, NULL, &b, MATH_LOG);
-    return b;
-}
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/math/Log.cu
+++ b/source/tensor/core/math/Log.cu
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-/*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
-*/
-#include "../../XDevice.h"
-#include "../../XTensor.h"
-#include "Log.h"
-#include "Log.cuh"
-namespace nts { // namespace nts(NiuTrans.Tensor)
-#ifdef USE_CUDA
-/*
-set each entry to its log value (CUDA Kernel)
->> a - pointer to input data array
->> b - pointer to output data array
->> size - size of the data array
-*/
-__global__
-void KernelLog(DTYPE * a, DTYPE * b, int size)
-{
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-    if (i < size)
-        b[i] = log(a[i]);
-}
-/*
-set each entry to its log value (CUDA Kernel)
-This is for float16 computation
->> a - pointer to input data array
->> b - pointer to output data array
->> size - size of the data array
-*/
-__global__
-void KernelLog(__half * a, __half * b, int size)
-{
-    return;
-}
-/*
-set each entry to its log value
->> a - input tensor
->> b - output tensor
-*/
-void _CudaLog(const XTensor * a, XTensor * b)
-{
-    CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
-    CheckNTErrors((a->isSparse == false), "TODO!");
-    int gridSize[3];
-    int blockSize[3];
-    GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
-    dim3 blocks(gridSize[0]);
-    dim3 threads(blockSize[0]);
-    int devIDBackup;
-    ProtectCudaDev(a->devID, devIDBackup);
-    if (a->dataType == DEFAULT_DTYPE) {
-        KernelLog << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum);
-    }
-    else if (a->dataType == X_FLOAT16) {
-        KernelLog << <blocks, threads >> >((__half*)a->data, (__half*)b->data, a->unitNum);
-    }
-    else {
-        ShowNTErrors("TODO!");
-    }
-    BacktoCudaDev(a->devID, devIDBackup);
-}
-#endif // USE_CUDA
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/math/Unary.cpp
+++ b/source/tensor/core/math/Unary.cpp
+#include <math.h>
+#include "../../XName.h"
+#include "Unary.h"
+#include "Unary.cuh"
+namespace nts{
+#ifdef USE_CUDA
+/* define three marco separately, specify the respective function names */
+#define _SIMPLE_UNARY_FUNCTION(_funcName, _cudaFuncName, origFunc)          \
+void _funcName(const XTensor * a, XTensor * b)                              \
+{                                                                           \
+    /* run it on GPUs */                                                    \
+    if (a->devID >= 0) {                                                    \
+        _cudaFuncName(a, b);                                                \
+    return;                                                                 \
+    }                                                                       \
+    CheckNTErrors((XTensor::IsSameShaped(a, b)),                            \
+                  "Input tensors should have the same type!");              \
+    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");                 \
+    DTYPE * d = (DTYPE*)a->data;                                            \
+    DTYPE * db = (DTYPE*)b->data;                                           \
+    for (int i = 0; i < a->unitNum; i++)                                    \
+        db[i] = (DTYPE)origFunc(d[i]);                                      \
+}
+#define _SIMPLE_UNARY_FUNCTION_ME(_funcNameMe, _funcName)                   \
+void _funcNameMe(XTensor * a)                                               \
+{                                                                           \
+    _funcName(a, a);                                                        \
+}        
+#define SIMPLE_UNARY_FUNCTION(funcName, _funcName, operationId)             \
+XTensor funcName(const XTensor &a)                                          \
+{                                                                           \
+    XTensor b(&a);                                                          \
+    b.SetTMP();                                                             \
+    _funcName(&a, &b);                                                      \
+    XLink::MakeLink(&a, NULL, &b, operationId);                             \
+    return b;                                                               \
+}
+_SIMPLE_UNARY_FUNCTION(_Absolute, _CudaAbsolute, fabs)
+_SIMPLE_UNARY_FUNCTION_ME(_AbsoluteMe, _Absolute)
+SIMPLE_UNARY_FUNCTION(Absolute, _Absolute, MATH_ABSOLUTE)
+_SIMPLE_UNARY_FUNCTION(_Exp, _CudaExp, exp)
+_SIMPLE_UNARY_FUNCTION_ME(_ExpMe, _Exp)
+SIMPLE_UNARY_FUNCTION(Exp, _Exp, MATH_EXP)
+_SIMPLE_UNARY_FUNCTION(_Log, _CudaLog, log)
+_SIMPLE_UNARY_FUNCTION_ME(_LogMe, _Log)
+SIMPLE_UNARY_FUNCTION(Log, _Log, MATH_LOG)
+_SIMPLE_UNARY_FUNCTION(_Sin, _CudaSin, sin)
+_SIMPLE_UNARY_FUNCTION_ME(_SinMe, _Sin)
+SIMPLE_UNARY_FUNCTION(Sin, _Sin, MATH_SIN)
+_SIMPLE_UNARY_FUNCTION(_Cos, _CudaCos, cos)
+_SIMPLE_UNARY_FUNCTION_ME(_CosMe, _Cos)
+SIMPLE_UNARY_FUNCTION(Cos, _Cos, MATH_COS)
+_SIMPLE_UNARY_FUNCTION(_Tan, _CudaTan, tan)
+_SIMPLE_UNARY_FUNCTION_ME(_TanMe, _Tan)
+SIMPLE_UNARY_FUNCTION(Tan, _Tan, MATH_TAN)
+#else
+/* define three marco separately, specify the respective function names */
+#define _SIMPLE_UNARY_FUNCTION(_funcName, origFunc)          \
+void _funcName(const XTensor * a, XTensor * b)                              \
+{                                                                           \
+    CheckNTErrors((XTensor::IsSameShaped(a, b)),                            \
+                  "Input tensors should have the same type!");              \
+    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");                 \
+    DTYPE * d = (DTYPE*)a->data;                                            \
+    DTYPE * db = (DTYPE*)b->data;                                           \
+    for (int i = 0; i < a->unitNum; i++)                                    \
+        db[i] = (DTYPE)origFunc(d[i]);                                      \
+}
+#define _SIMPLE_UNARY_FUNCTION_ME(_funcNameMe, _funcName)                   \
+void _funcNameMe(XTensor * a)                                               \
+{                                                                           \
+    _funcName(a, a);                                                        \
+}        
+#define SIMPLE_UNARY_FUNCTION(funcName, _funcName, operationId)             \
+XTensor funcName(const XTensor &a)                                          \
+{                                                                           \
+    XTensor b(&a);                                                          \
+    b.SetTMP();                                                             \
+    _funcName(&a, &b);                                                      \
+    XLink::MakeLink(&a, NULL, &b, operationId);                             \
+    return b;                                                               \
+}
+_SIMPLE_UNARY_FUNCTION(_Absolute, fabs)
+_SIMPLE_UNARY_FUNCTION_ME(_AbsoluteMe, _Absolute)
+SIMPLE_UNARY_FUNCTION(Absolute, _Absolute, MATH_ABSOLUTE)
+_SIMPLE_UNARY_FUNCTION(_Exp, exp)
+_SIMPLE_UNARY_FUNCTION_ME(_ExpMe, _Exp)
+SIMPLE_UNARY_FUNCTION(Exp, _Exp, MATH_EXP)
+_SIMPLE_UNARY_FUNCTION(_Log, log)
+_SIMPLE_UNARY_FUNCTION_ME(_LogMe, _Log)
+SIMPLE_UNARY_FUNCTION(Log, _Log, MATH_LOG)
+_SIMPLE_UNARY_FUNCTION(_Sin, sin)
+_SIMPLE_UNARY_FUNCTION_ME(_SinMe, _Sin)
+SIMPLE_UNARY_FUNCTION(Sin, _Sin, MATH_SIN)
+_SIMPLE_UNARY_FUNCTION(_Cos, cos)
+_SIMPLE_UNARY_FUNCTION_ME(_CosMe, _Cos)
+SIMPLE_UNARY_FUNCTION(Cos, _Cos, MATH_COS)
+_SIMPLE_UNARY_FUNCTION(_Tan, tan)
+_SIMPLE_UNARY_FUNCTION_ME(_TanMe, _Tan)
+SIMPLE_UNARY_FUNCTION(Tan, _Tan, MATH_TAN)
+#endif
+}
\ No newline at end of file
--- a/source/tensor/core/math/Unary.cu
+++ b/source/tensor/core/math/Unary.cu
+#include <math.h>
+#include "../../XDevice.h"
+#include "../../XName.h"
+#include "Unary.cuh"
+namespace nts {
+#define SIMPLE_UNARY_FUNCTION_GPU(funcName, origFunc)                   \
+__global__                                                              \
+void Kernel##funcName(DTYPE * a, DTYPE * b, int size)                   \
+{                                                                       \
+    int i = blockDim.x * blockIdx.x + threadIdx.x;                      \
+                                                                        \
+    if (i < size)                                                       \
+        b[i] = (DTYPE)origFunc(a[i]);                                   \
+}                                                                       \
+__global__                                                              \
+    void Kernel##funcName(__half * a, __half * b, int size)             \
+{                                                                       \
+    return;                                                             \
+}                                                                       \
+void _Cuda##funcName(const XTensor * a, XTensor * b)                    \
+{                                                                       \
+    CheckNTErrors((XTensor::IsSameShaped(a, b)),                        \
+                  "Input tensors should have the same type!");          \
+    CheckNTErrors((a->isSparse == false), "TODO!");                     \
+                                                                        \
+    int gridSize[3];                                                    \
+    int blockSize[3];                                                   \
+                                                                        \
+    GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);     \
+                                                                        \
+    dim3 blocks(gridSize[0]);                                           \
+    dim3 threads(blockSize[0]);                                         \
+                                                                        \
+    int devIDBackup;                                                    \
+    ProtectCudaDev(a->devID, devIDBackup);                              \
+                                                                        \
+    if (a->dataType == DEFAULT_DTYPE) {                                 \
+        Kernel##funcName << <blocks, threads >> >                       \
+                     ((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum);    \
+    }                                                                   \
+    else if (a->dataType == X_FLOAT16) {                                \
+        Kernel##funcName << <blocks, threads >> >                       \
+                     ((__half*)a->data, (__half*)b->data, a->unitNum);  \
+    }                                                                   \
+    else {                                                              \
+        ShowNTErrors("TODO!");                                          \
+    }                                                                   \
+                                                                        \
+    BacktoCudaDev(a->devID, devIDBackup);                               \
+}                                                                       \
+SIMPLE_UNARY_FUNCTION_GPU(Absolute, fabs)
+SIMPLE_UNARY_FUNCTION_GPU(Exp, exp)
+SIMPLE_UNARY_FUNCTION_GPU(Log, log)
+SIMPLE_UNARY_FUNCTION_GPU(Sin, sin)
+SIMPLE_UNARY_FUNCTION_GPU(Cos, cos)
+SIMPLE_UNARY_FUNCTION_GPU(Tan, tan)
+}
\ No newline at end of file
--- a/source/tensor/core/math/Unary.cuh
+++ b/source/tensor/core/math/Unary.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#ifndef __UNARY_CUH__
+#define __UNARY_CUH__
+#include "../../XTensor.h"
+#include "Unary.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/* set each entry to its absolute value (CUDA Kernel) */
+__global__
+void KernelAbsolute(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its absolute value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelAbsolute(__half * a, __half * b, int size);
+/* set each entry to its absolute value */
+void _CudaAbsolute(const XTensor * a, XTensor * b);
+/* set each entry to its exponent value (CUDA Kernel) */
+__global__
+void KernelExp(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its exponent value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelExp(__half * a, __half * b, int size);
+/* set each entry to its exponent value */
+void _CudaExp(const XTensor * a, XTensor * b);
+/* set each entry to its logarithm value (CUDA Kernel) */
+__global__
+void KernelLog(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its logarithm value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelLog(__half * a, __half * b, int size);
+/* set each entry to its logarithm value */
+void _CudaLog(const XTensor * a, XTensor * b);
+/* set each entry to its sine value (CUDA Kernel) */
+__global__
+void KernelSin(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its sine value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelSin(__half * a, __half * b, int size);
+/* set each entry to its sine value */
+void _CudaSin(const XTensor * a, XTensor * b);
+/* set each entry to its cosine value (CUDA Kernel) */
+__global__
+void KernelCos(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its cosine value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelCos(__half * a, __half * b, int size);
+/* set each entry to its cosine value */
+void _CudaCos(const XTensor * a, XTensor * b);
+/* set each entry to its tangent value (CUDA Kernel) */
+__global__
+void KernelTan(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its tangent value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelTan(__half * a, __half * b, int size);
+/* set each entry to its tangent value */
+void _CudaTan(const XTensor * a, XTensor * b);
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
+#endif // __UNARY_CUH__
\ No newline at end of file
--- a/source/tensor/core/math/Unary.h
+++ b/source/tensor/core/math/Unary.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#ifndef __UNARY_H__
+#define __UNARY_H__
+#include "../../XTensor.h"
+namespace nts{
+/* set every entry to its absolute value */
+void _Absolute(const XTensor * a, XTensor * b);
+/* 
+set every entry to its absolute value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _AbsoluteMe(XTensor * a);
+/* 
+set every entry to its absolute value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Absolute(const XTensor & a);
+/* set every entry to its exponent value */
+void _Exp(const XTensor * a, XTensor * b);
+/* 
+set every entry to its exponent value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _ExpMe(XTensor * a);
+/* 
+set every entry to its exponent value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Exp(const XTensor & a);
+/* set every entry to its logarithm value */
+void _Log(const XTensor * a, XTensor * b);
+/* 
+set every entry to its logarithm value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _LogMe(XTensor * a);
+/* 
+set every entry to its logarithm value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Log(const XTensor & a);
+/* set every entry to its sine value */
+void _Sin(const XTensor * a, XTensor * b);
+/* 
+set every entry to its sine value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _SinMe(XTensor * a);
+/* 
+set every entry to its sine value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Sin(const XTensor & a);
+/* set every entry to its cosine value */
+void _Cos(const XTensor * a, XTensor * b);
+/* 
+set every entry to its cosine value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _CosMe(XTensor * a);
+/* 
+set every entry to its cosine value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Cos(const XTensor & a);
+/* set every entry to its tangent value */
+void _Tan(const XTensor * a, XTensor * b);
+/* 
+set every entry to its tangent value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _TanMe(XTensor * a);
+/* 
+set every entry to its tangent value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Tan(const XTensor & a);
+}
+#endif //end __UNARY_H__
\ No newline at end of file
--- a/source/tensor/core/movement/CopyBlocks.cpp
+++ b/source/tensor/core/movement/CopyBlocks.cpp
@@ -35,24 +35,33 @@ copy a number of blocks to target positions
 >> target - target data array
 >> targetBlocks - target positions of the copy
 >> myMem - the memory pool
+>> devID - device id
 */
-void _CopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem)
+void _CopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID)
 {
-    if (myMem != NULL && myMem->devID >= 0) {
+    if (myMem != NULL)
+        devID = myMem->devID;
+    if (devID >= 0) {
 #ifdef USE_CUDA
        /* copy the index from host to device */
-        int * targetBlocksTMP = (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int));
+        int * targetBlocksTMP = myMem != NULL ?
+                               (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int)):
+                               (int*)XMemAlloc(devID, blockNum * sizeof(int));
        XMemCopy(targetBlocksTMP, myMem->devID, targetBlocks, -1, blockNum * sizeof(int));
-        _CopyBlocksOnSite(source, blockSize, blockNum, target, targetBlocksTMP, myMem);
+        _CopyBlocksOnSite(source, blockSize, blockNum, target, targetBlocksTMP, devID);
-        myMem->ReleaseBuf(myMem->devID, blockNum * sizeof(int));
+        if(myMem != NULL)
+            myMem->ReleaseBuf(myMem->devID, blockNum * sizeof(int));
+        else
+            XMemFree(devID, targetBlocksTMP);
 #else
        ShowNTErrors("Plesae specify USE_CUDA and recompile the code!");
 #endif
    }
    else {
-        _CopyBlocksOnSite(source, blockSize, blockNum, target, targetBlocks, myMem);
+        _CopyBlocksOnSite(source, blockSize, blockNum, target, targetBlocks, devID);
    }
 }
@@ -65,11 +74,12 @@ copy a number of blocks source source positions to target positions
 >> target - target data array
 >> targetBlocks - target positions of the copy
 >> myMem - the memory pool
+>> devID - device id
 */
 void _CopyBlocks(void * source, int blockSize, int * sourceBlocks, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID)
 {
    if (myMem != NULL)
-        CheckNTErrors((myMem->devID == devID), "DevIDs are different between memory pool and input devID!");
+        devID = myMem->devID;
    if (devID >= 0) {
 #ifdef USE_CUDA

--- a/source/tensor/core/movement/CopyBlocks.h
+++ b/source/tensor/core/movement/CopyBlocks.h
@@ -27,7 +27,7 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
 /* copy a number of blocks to target positions */
-void _CopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem);
+void _CopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID);
 /* copy a number of blocks from source positions to target positions */
 void _CopyBlocks(void * source, int blockSize, int * sourceBlocks, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID);

--- a/source/tensor/core/movement/CopyBlocksInGrid.cu
+++ b/source/tensor/core/movement/CopyBlocksInGrid.cu
@@ -223,8 +223,11 @@ void _CudaCopyBlocksInGrid(void * source, int blockSize, int blockNum, int gridN
    int cudaGrids[3];
    int cudaBlocks[3];
    int threadNum = MIN(MAX(blockSize, blockNum), MAX_CUDA_THREAD_NUM_PER_BLOCK);
+    int devIDBackup;
+    ProtectCudaDev(myMem->devID, devIDBackup);
    GDevs.GetCudaThread2D(myMem->devID, threadNum, gridNum * blockNum, INT_MAX, cudaGrids, cudaBlocks);
    cudaBlocks[1] = 1;
@@ -237,39 +240,41 @@ void _CudaCopyBlocksInGrid(void * source, int blockSize, int blockNum, int gridN
    if (blockNum == 4) {
        if ((SHARED_MEMORY_SIZE / itemSize - 2 * MAX_CUDA_THREAD_NUM_PER_BLOCK) >= 2 * cudaBlocks[0] * blockNum)
            KernelCopyBlocksInGridFast<int, 4, 2> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                                    ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
        else
            KernelCopyBlocksInGridFast<int, 4, 1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                                    ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
    }
    else if (blockNum == 6) {
        if ((SHARED_MEMORY_SIZE / itemSize - 2 * MAX_CUDA_THREAD_NUM_PER_BLOCK) >= 2 * cudaBlocks[0] * blockNum)
            KernelCopyBlocksInGridFast<int, 6, 2> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                                    ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
        else
            KernelCopyBlocksInGridFast<int, 6, 1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                                    ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
    }
    else if (blockNum == 8) {
        if ((SHARED_MEMORY_SIZE / itemSize - 2 * MAX_CUDA_THREAD_NUM_PER_BLOCK) >= 2 * cudaBlocks[0] * blockNum)
            KernelCopyBlocksInGridFast<int, 8, 2> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                                    ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
        else
            KernelCopyBlocksInGridFast<int, 8, 1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                                    ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
    }
    else if (blockNum == 12) {
        if ((SHARED_MEMORY_SIZE / itemSize - 2 * MAX_CUDA_THREAD_NUM_PER_BLOCK) >= 2 * cudaBlocks[0] * blockNum)
            KernelCopyBlocksInGridFast<int, 12, 2> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                                     ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
        else
            KernelCopyBlocksInGridFast<int, 12, 1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                                     ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
    }
    else {
        KernelCopyBlocksInGrid<int> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
+                                      ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
    }
+    BacktoCudaDev(myMem->devID, devIDBackup);
 }
 #endif // USE_CUDA

--- a/source/tensor/core/movement/CopyBlocksOnSite.cpp
+++ b/source/tensor/core/movement/CopyBlocksOnSite.cpp
@@ -34,29 +34,35 @@ all the data has been on the device (CPU/GPU) already.
 >> blockNum - number of blocks
 >> target - target data array
 >> targetBlocks - target positions of the copy
->> myMem - the memory pool
+>> devID - device id
 */
-void _CopyBlocksOnSite(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem)
+void _CopyBlocksOnSite(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, int devID)
 {
-    if (myMem != NULL && myMem->devID >= 0) {
+    if (devID >= 0) {
 #ifdef USE_CUDA
-        _CudaCopyBlocks(source, blockSize, blockNum, target, targetBlocks, myMem);
+        _CudaCopyBlocks(source, blockSize, blockNum, target, targetBlocks, devID);
 #else
        ShowNTErrors("Plesae specify USE_CUDA and recompile the code!");
 #endif
    }
    else {
-        int devID = myMem != NULL ? myMem->devID : -1;
        /* 
        The following code should be fine with GPUs, but too many
        kernel calls would slow down the system. We prefer to use
        one kernel to do block copy in batch (kernel fusion). 
        */
-        for (int i = 0, b = 0; i < blockNum; i++, b += blockSize) {
+        if(blockSize == sizeof(int)){
-            XMemCopy((char*)target + targetBlocks[i] * blockSize, devID,
+            for (int i = 0, b = 0; i < blockNum; i++, b += blockSize) {
-                (char*)source + b, devID, blockSize);
+                *(int*)((char*)target + targetBlocks[i] * blockSize) = 
+                *(int*)((char*)source + b);
+            }
+        }
+        else{
+            for (int i = 0, b = 0; i < blockNum; i++, b += blockSize) {
+                XMemCopy((char*)target + targetBlocks[i] * blockSize, devID,
+                         (char*)source + b, devID, blockSize);
+            }
        }
    }
 }
 } // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/movement/CopyBlocksOnSite.cu
+++ b/source/tensor/core/movement/CopyBlocksOnSite.cu
@@ -36,39 +36,48 @@ NOTE that this version makes more use of the 2d threads in cuda
 >> target - target data array
 >> targetBlocks - target positions of the copy
 */
-template<int miniBlockSize>
+template<class T>
 __global__
-void KernelCopyBlocks(DTYPE * source, int blockSize, int blockNum, DTYPE * target, int * targetBlocks)
+void KernelCopyBlocks(T * source, int blockSize, int blockNum, T * target, int * targetBlocks)
 {
    /* entry index in the block */
-    int i = (blockDim.x * blockIdx.x + threadIdx.x) * miniBlockSize;
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
    /* block index */
    int j = blockDim.y * blockIdx.y + threadIdx.y;
-    if (j >= blockNum)
+    if (i >= blockSize || j >= blockNum)
        return;
-    /* target position */
+    T * s = source + blockSize * j;
-    int k = targetBlocks[j];
+    T * t = target + blockSize * targetBlocks[j];
-    DTYPE * s = source + blockSize * j;
+    t[i] = s[i];
-    DTYPE * t = target + blockSize * k;
+}
-    if (i < blockSize) {
+/*
-        if (miniBlockSize == 4) {
+copy a number of blocks to target positions
-            t[i] = s[i];
+NOTE that this version makes more use of the 2d threads in cuda
-            t[i + 1] = s[i + 1];
+>> source - data array (head of the blocks) to copy from
-            t[i + 2] = s[i + 2];
+>> blockSize - size of block
-            t[i + 3] = s[i + 3];
+>> blockNum - number of blocks
-        }
+>> target - target data array
-        else if (miniBlockSize <= 1) {
+>> targetBlocks - target positions of the copy
-            t[i] = s[i];
+*/
-        }
+template<class T>
-        else {
+__global__
-            printf("something wrong!");
+void KernelCopyBlocksV2(T * source, int blockSize, int blockNum, int totalSize, T * target, int * targetBlocks)
-        }
+{
-    }
+    /* entry index in the block */
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i >= totalSize)
+        return;
+    int targetBlockID = targetBlocks[i / blockSize];
+    int targetOffset  = i % blockSize;
+    *(target + blockSize * targetBlockID + targetOffset) = source[i];
 }
 /*
@@ -78,29 +87,42 @@ copy a number of blocks to target positions (cuda version)
 >> blockNum - number of blocks
 >> target - target data array
 >> targetBlocks - target positions of the copy (on the device)
->> myMem - memory pool
+>> devID - device id
 */
-void _CudaCopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem)
+void _CudaCopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, int devID)
 {
-    CheckNTErrors((myMem != NULL), "No memory pool!");
+    CheckNTErrors(devID >= 0, "Wrong device to run!");
-    CheckNTErrors((myMem->devID >= 0), "Wrong device to run!");
-    CheckNTErrors((blockSize % sizeof(DTYPE) == 0), "Unsupported block size!");
    int cudaGrids[3];
    int cudaBlocks[3];
-    int bSize = blockSize / sizeof(DTYPE);
-    if (bSize % 4 == 0) {
+    int devIDBackup;
-        GDevs.GetCudaThread2D(myMem->devID, bSize / 4, blockNum, MAX_INT, cudaGrids, cudaBlocks);
+    ProtectCudaDev(devID, devIDBackup);
-        KernelCopyBlocks<4> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((DTYPE*)source, bSize, blockNum, (DTYPE*)target, targetBlocks);
+    if(blockSize % sizeof(double) == 0){
+        int bSize = blockSize / sizeof(double);
+        GDevs.GetCudaThread(devID, bSize * blockNum, cudaGrids, cudaBlocks);
+        KernelCopyBlocksV2<double> <<<dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >>>
+                                    ((double*)source, bSize, blockNum, bSize * blockNum, (double*)target, targetBlocks);
+        //GDevs.GetCudaThread2D(devID, bSize, blockNum, MAX_INT, cudaGrids, cudaBlocks);
+        //KernelCopyBlocks<double> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>>
+        //                            ((double*)source, bSize, blockNum, (double*)target, targetBlocks);
+    }
+    else 
+    if(blockSize % sizeof(float) == 0){
+        int bSize = blockSize / sizeof(float);
+        GDevs.GetCudaThread(devID, bSize * blockNum, cudaGrids, cudaBlocks);
+        KernelCopyBlocksV2<float> <<<dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >>>
+                                   ((float*)source, bSize, blockNum, bSize * blockNum, (float*)target, targetBlocks);
+        //GDevs.GetCudaThread2D(devID, bSize, blockNum, MAX_INT, cudaGrids, cudaBlocks);
+        //KernelCopyBlocks<float> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>>
+        //                         ((float*)source, bSize, blockNum, (float*)target, targetBlocks);
    }
-    else {
+    else{
-        GDevs.GetCudaThread2D(myMem->devID, bSize, blockNum, MAX_INT, cudaGrids, cudaBlocks);
+        ShowNTErrors("Unsupported block size!");
-        KernelCopyBlocks<1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
-            ((DTYPE*)source, bSize, blockNum, (DTYPE*)target, targetBlocks);
    }
+    BacktoCudaDev(devID, devIDBackup);
 }
 #endif // USE_CUDA
 } // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/movement/CopyBlocksOnSite.cuh
+++ b/source/tensor/core/movement/CopyBlocksOnSite.cuh
@@ -28,15 +28,11 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 #ifdef USE_CUDA
-/* copy a number of blocks to target positions */
-__global__
-void KernelCopyBlocks(DTYPE * source, int blockSize, int blockNum, DTYPE * target, int * targetBlocks);
 /* copy a number of blocks to target positions (cuda version) */
-void _CudaCopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem);
+void _CudaCopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, int devID);
 #endif // USE_CUDA
 } // namespace nts(NiuTrans.Tensor)
 #endif // __COPYBLOCKS_CUH__
\ No newline at end of file
--- a/source/tensor/core/movement/CopyBlocksOnSite.h
+++ b/source/tensor/core/movement/CopyBlocksOnSite.h
@@ -27,7 +27,7 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
 /* copy a number of blocks to target positions (on site) */
-void _CopyBlocksOnSite(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem);
+void _CopyBlocksOnSite(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, int devID);
 } // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/core/movement/CopyBlocksSelected.cu
+++ b/source/tensor/core/movement/CopyBlocksSelected.cu
@@ -75,6 +75,9 @@ void _CudaCopyBlocksSelected(void * source, int blockSize, int * sourceBlocks, i
    CheckNTErrors(devID >= 0, "Wrong device to run!");
    CheckNTErrors((blockSize % sizeof(DTYPE) == 0), "Unsupported block size!");
+    int devIDBackup;
+    ProtectCudaDev(devID, devIDBackup);
    /* copy the index to the GPU memory */
    int * sourceBlocksTMP = myMem != NULL ? (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int)) : (int *)XMemAlloc(devID, blockNum * sizeof(int));
    int * targetBlocksTMP = myMem != NULL ? (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int)) : (int *)XMemAlloc(devID, blockNum * sizeof(int));
@@ -97,6 +100,8 @@ void _CudaCopyBlocksSelected(void * source, int blockSize, int * sourceBlocks, i
        XMemFree(devID, sourceBlocksTMP);
        XMemFree(devID, targetBlocksTMP);
    }
+    BacktoCudaDev(devID, devIDBackup);
 }
 #endif // USE_CUDA

--- a/source/tensor/core/movement/CopyIndexed.cpp
+++ b/source/tensor/core/movement/CopyIndexed.cpp
@@ -37,8 +37,8 @@ copy indexed sub-tensors
 >> indexSize - length of srcIndex (and tgtIndex)
 >> tgtIndex - index of the target sub-tensors
 >> copyNum - number of the sub-tensors we copy for each source index, 
-   e.g., for srcIndex = [1,4] and copyNum = 2,
+             e.g., for srcIndex = [1,4] and copyNum = 2,
-   we actually copy the source sub-tensors 1, 2, 4, 5
+             we actually copy the source sub-tensors 1, 2, 4, 5
 */
 void _CopyIndexed(const XTensor * s, XTensor * t, int dim, int * srcIndex, int indexSize, int * tgtIndex, int copyNum)
 {
@@ -73,17 +73,23 @@ void _CopyIndexed(const XTensor * s, XTensor * t, int dim, int * srcIndex, int i
    int * realSrcIndex = new int[realIndexSize];
    int * realTgtIndex = new int[realIndexSize];
    for (int i = 0; i < indexOffsetNum; i++) {
+        int base = i * indexSize * copyNum;
+        int baseSrc = i * leadDimSizeSrc;
+        int baseTgt = i * leadDimSizeTgt;
        for (int j = 0; j < indexSize; j++) {
+            int offset = base + j * copyNum;
+            int * rsi = realSrcIndex + offset;
+            int * rti = realTgtIndex + offset;
            for (int k = 0; k < copyNum; k++) {
-                realSrcIndex[i * indexSize * copyNum + j * copyNum + k] = i * leadDimSizeSrc + srcIndex[j] + k;
+                rsi[k] = baseSrc + srcIndex[j] + k;
-                realTgtIndex[i * indexSize * copyNum + j * copyNum + k] = i * leadDimSizeTgt + tgtIndex[j] + k;
+                rti[k] = baseTgt + tgtIndex[j] + k;
            }
        }
    }
    for (int i = 0; i < indexSize; i++) {
-        CheckNTErrors((srcIndex[i] < blockNumSrc), "Index is out of range!");
+        CheckNTErrors((srcIndex[i] < blockNumSrc), "Index is out of scope!");
-        CheckNTErrors((tgtIndex[i] < blockNumTgt), "Index is out of range!");
+        CheckNTErrors((tgtIndex[i] < blockNumTgt), "Index is out of scope!");
    }
    _CopyBlocks(s->data, blockSizeSrc * s->unitSize, realSrcIndex, realIndexSize, t->data, realTgtIndex, s->mem, s->devID);

--- a/source/tensor/core/movement/CopyValues.cpp
+++ b/source/tensor/core/movement/CopyValues.cpp
@@ -20,6 +20,7 @@
 */
 #include "../../XName.h"
+#include "../../XUtility.h"
 #include "CopyValues.h"
 #include "CopyValues.cuh"
@@ -42,7 +43,7 @@ void _CopyValues(const XTensor * s, XTensor * t, XStream * stream)
    if ((s->dataType == X_FLOAT16 && t->dataType == X_FLOAT) ||
        (s->dataType == X_FLOAT && t->dataType == X_FLOAT16)) {
        CheckNTErrors(((s->devID < 0 && t->devID < 0) || s->devID == t->devID),
-            "The code must be run on the same device!");
+                       "The code must be run on the same device!");
        CheckNTErrors((s->isSparse || t->isSparse), "TODO!");
        ConvertDataType(s->devID, s->data, s->dataType, t->data, t->dataType, s->unitNum);
    }
@@ -69,6 +70,34 @@ void _CopyValues(const XTensor * s, XTensor * t, XStream * stream)
 }
 /*
+copy s to t
+>> s - source
+>> sBeg - begining of the segment 
+>> sLen - length of the segment
+>> t - target
+>> tBeg - beginning of the segment on the target side
+>> stream - the stream for creating the job pipeline
+*/
+void _CopyValues(const XTensor * s, const int sBeg, const int sLen, XTensor * t, const int tBeg, XStream * stream)
+{
+    CheckNTErrors(s != NULL && t != NULL, "The input tensor and output tensor must be nonempty!");
+    CheckNTErrors(s->data != NULL && t->data != NULL, "Cannot copy from an empty data array!");
+    CheckNTErrors(s->unitSize == t->unitSize, "The input tensors must be of the same unit size!");
+    CheckNTErrors(s->order > sBeg && sBeg >= 0 && sLen <= s->unitNum, "Wrong segment on the source side");
+    CheckNTErrors(t->order > tBeg && tBeg >= 0, "Wrong segment on the target side");
+    if (!s->isSparse && !t->isSparse) {
+        XMemCopy((char*)t->data + tBeg * t->unitSize, t->devID,
+                 (char*)s->data + sBeg * s->unitSize, s->devID,
+                  s->unitSize * sLen);
+    }
+    else {
+        ShowNTErrors("TODO!");
+    }
+}
+/*
 copy s to t (return a XTensor structure)
 make a new tensor to keep the result and return it

--- a/source/tensor/core/movement/CopyValues.h
+++ b/source/tensor/core/movement/CopyValues.h
@@ -29,6 +29,9 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 /* copy s to t */
 void _CopyValues(const XTensor * s, XTensor * t, XStream * stream = NULL);
+/* copy a segment of s to t  */
+void _CopyValues(const XTensor * s, const int sBeg, const int sLen, XTensor * t, const int tBeg, XStream * stream = NULL);
 /* 
 copy s to t (return a XTensor structure)
 make a new tensor to keep the result and return it

--- a/source/tensor/core/shape/MakeMergeBlockIndex.cpp
+++ b/source/tensor/core/shape/MakeMergeBlockIndex.cpp
@@ -33,14 +33,14 @@ set target data block index for the data movement in merge
 >> splitSizeInGrid - size of each data array to merge
 >> gridSize - number of blocks in a grid (here grid is a higher level orgnization upon blocks)
 >> gridNum - number of grids
->> mem - the memory pool
+>> devID - device id
 */
 void _MakeMergeBlockIndex(int * blockIndex, int blockNum, int blockNumInMerge,
-                          int splitSizeInGrid, int gridSize, int gridNum, XMem * mem)
+                          int splitSizeInGrid, int gridSize, int gridNum, int devID)
 {
-    if (mem != NULL && mem->devID >= 0) {
+    if (devID >= 0) {
 #ifdef USE_CUDA
-        _CudaMakeMergeBlockIndex(mem->devID, blockIndex, blockNum, blockNumInMerge, splitSizeInGrid, gridSize, gridNum);
+        _CudaMakeMergeBlockIndex(devID, blockIndex, blockNum, blockNumInMerge, splitSizeInGrid, gridSize, gridNum);
 #else
        ShowNTErrors("Please specify USE_CUDA and recompile the code!");
 #endif

--- a/source/tensor/core/shape/MakeMergeBlockIndex.h
+++ b/source/tensor/core/shape/MakeMergeBlockIndex.h
@@ -28,7 +28,7 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 /* set target data block index for the data movement in merge */
 void _MakeMergeBlockIndex(int * blockIndex, int blockNum, int blockNumInMerge,
-                          int splitSizeInGrid, int gridSize, int gridNum, XMem * mem);
+                          int splitSizeInGrid, int gridSize, int gridNum, int devID);
 } // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/core/shape/MakeSplitBlockIndex.cpp
+++ b/source/tensor/core/shape/MakeSplitBlockIndex.cpp
@@ -31,13 +31,13 @@ set target data block index for the data movement in split
 >> splitNum - number of splits
 >> blockSplitSize - size of the splitted block
 >> blockNum - number of data blocks
->> mem - the memory pool
+>> devID - device id
 */
-void _MakeSplitBlockIndex(int * blockIndex, int splitNum, int blockSplitSize, int blockNum, XMem * mem)
+void _MakeSplitBlockIndex(int * blockIndex, int splitNum, int blockSplitSize, int blockNum, int devID)
 {
-    if (mem != NULL && mem->devID >= 0) {
+    if (devID >= 0) {
 #ifdef USE_CUDA
-        _CudaMakeSplitBlockIndex(mem->devID, blockIndex, splitNum, blockSplitSize, blockNum);
+        _CudaMakeSplitBlockIndex(devID, blockIndex, splitNum, blockSplitSize, blockNum);
 #else
        ShowNTErrors("Please specify USE_CUDA and recompile the code!");
 #endif

--- a/source/tensor/core/shape/MakeSplitBlockIndex.h
+++ b/source/tensor/core/shape/MakeSplitBlockIndex.h
@@ -27,7 +27,7 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
 /* set target data block index for the data movement in split */
-void _MakeSplitBlockIndex(int * blockIndex, int splitNum, int blockSplitSize, int blockNum, XMem * mem);
+void _MakeSplitBlockIndex(int * blockIndex, int splitNum, int blockSplitSize, int blockNum, int devID);
 } // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/core/shape/Merge.cpp
+++ b/source/tensor/core/shape/Merge.cpp
@@ -42,10 +42,13 @@ e.g., (N/3, M, 3) -> (N, M)
 */
 void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim)
 {
-	int whereToMergeRDI = s->order - whereToMerge - 1;
+    if(leadingDim < 0)
-	int leadingDimRDI = s->order - leadingDim - 1;
+        leadingDim = 0;
+    int whereToMergeRDI = s->order - whereToMerge - 1;
+    int leadingDimRDI = s->order - leadingDim - 1;
    if (leadingDimRDI < 0)
-		leadingDimRDI = s->order - 1;
+        leadingDimRDI = s->order - 1;
    CheckNTErrors((s != NULL && t != NULL), "Invalid tensors!");
    CheckNTErrors((s->devID == t->devID || (s->devID < 0 && t->devID < 0)),
@@ -60,8 +63,12 @@ void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim)
            CheckNTErrors((t->dimSizeRDI[i] == s->dimSizeRDI[i] * s->dimSizeRDI[leadingDimRDI]),
                          "Unmatched tensor sizes!");
        }
+        else if (i < leadingDimRDI){
+            CheckNTErrors((s->dimSizeRDI[i] == t->dimSizeRDI[i]),
+                          "Unmatched tensor sizes!");
+        }
        else if (i > leadingDimRDI) {
-            CheckNTErrors((s->dimSizeRDI[i - 1] == t->dimSizeRDI[i]),
+            CheckNTErrors((s->dimSizeRDI[i] == t->dimSizeRDI[i - 1]),
                          "Unmatched tensor sizes!");
        }
    }
@@ -119,28 +126,24 @@ void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim)
        int realBlockSize = blockSize * t->unitSize;
        int * blockIndex = (int*)(mem != NULL ?
-            mem->AllocBuf(mem->devID, blockNum * gridNum * sizeof(int)) :
+                                  mem->AllocBuf(mem->devID, blockNum * gridNum * sizeof(int)) :
-            XMemAlloc(mem->devID, blockNum * gridNum * sizeof(int)));
+                                  XMemAlloc(s->devID, blockNum * gridNum * sizeof(int)));
-        _MakeMergeBlockIndex(blockIndex, blockNum, blockNumInMerge, splitSizeInGrid, gridSize, gridNum, mem);
+        _MakeMergeBlockIndex(blockIndex, blockNum, blockNumInMerge, splitSizeInGrid, gridSize, gridNum, s->devID);
-        _CopyBlocksOnSite(s->data, realBlockSize, blockNum, dataTMP, blockIndex, mem);
+        _CopyBlocksOnSite(s->data, realBlockSize, blockNum * gridNum, dataTMP, blockIndex, s->devID);
        if (mem != NULL)
            mem->ReleaseBuf(mem->devID, blockNum * gridNum * sizeof(int));
        else
-            XMemFree(mem->devID, blockIndex);
+            XMemFree(s->devID, blockIndex);
-        /* copy from tmp to target */
-        XMemCopy(t->data, t->devID, dataTMP, s->devID, size);
        if (!isOnSameDevice) {
            XMemCopy(t->data, t->devID, dataTMP, s->devID, size);
            if (mem != NULL)
                mem->ReleaseBuf(mem->devID, size);
            else
-                XMemFree(mem->devID, dataTMP);
+                XMemFree(s->devID, dataTMP);
        }
    }
 }
@@ -163,7 +166,7 @@ XTensor Merge(const XTensor &s, int whereToMerge, int leadingDim)
    CheckNTErrors(leadingDim < whereToMerge, "Invalid leading dimension!");
    if (leadingDim < 0)
-		leadingDim = 0;
+        leadingDim = 0;
    int order = s.order - 1;
    int * dimSize = new int[order];
@@ -205,7 +208,7 @@ merge small tensors into a big tensor
 */
 void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
 {
-	CheckNTErrors((smalls != NULL), "Invalid list!");
+    CheckNTErrors((smalls != NULL), "Invalid list!");
    CheckNTErrors((smalls->count > 0), "Empty list!");
    bool uniform = true;
@@ -233,7 +236,7 @@ void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
    int mergedNum = smalls->count;
    XTensor * s0 = (XTensor*)smalls->GetItem(0);
-	int whereToMergeRDI = s0->order - whereToMerge - 1;
+    int whereToMergeRDI = s0->order - whereToMerge - 1;
    for (int i = 0; i < s0->order; i++) {
        if (i <= whereToMergeRDI)
            blockSize *= s0->dimSizeRDI[i];
@@ -268,10 +271,10 @@ void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
    }
    /* merging with fewer kernel/api calls??? (i'm not sure about it!! may remove this later) */
    else {
-        int* dimSizeTMP = new int[MAX_TENSOR_DIM_NUM];
+        int* dimSizeTMP = new int[smallsItem0->order + 1];
-        for (int i = 0; i < MAX_TENSOR_DIM_NUM; i++)
+        for (int i = 0; i < smallsItem0->order; i++)
-            dimSizeTMP[i] = -smallsItem0->dimSizeRDI[i];
+            dimSizeTMP[i + 1] = -smallsItem0->dimSize[i];
-        dimSizeTMP[smallsItem0->order] = -mergeNum;
+        dimSizeTMP[0] = -mergeNum;
        XMem * mem = smallsItem0->mem;
        XTensor * tensorTMP = new XTensor(smallsItem0->order + 1, dimSizeTMP,
@@ -283,7 +286,7 @@ void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
        if (uniform)
            dataTMP = smallsItem0->data;
        else
-            dataTMP = mem != NULL ? mem->AllocBuf(mem->devID, size) : XMemAlloc(mem->devID, size);
+            dataTMP = mem != NULL ? mem->AllocBuf(mem->devID, size) : XMemAlloc(big->devID, size);
        tensorTMP->data = dataTMP;
@@ -295,18 +298,17 @@ void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
            }
        }
-        _Merge(tensorTMP, big, whereToMerge);
+        _Merge(tensorTMP, big, whereToMerge + 1);
        delete[] dimSizeTMP;
-        tensorTMP->data = NULL;
-        dataTMP = NULL;
+        tensorTMP->data = NULL;
        delete tensorTMP;
        if ((!uniform) && (mem != NULL))
            mem->ReleaseBuf(mem->devID, size);
        else
-            XMemFree(mem->devID, dataTMP);
+            XMemFree(big->devID, dataTMP);
    }
 }

--- a/source/tensor/core/shape/Split.cpp
+++ b/source/tensor/core/shape/Split.cpp
@@ -24,6 +24,7 @@
 #include "MakeSplitBlockIndex.h"
 #include "../../XName.h"
 #include "../../XTensor.h"
+#include "../../XDevice.h"
 #include "../../XUtility.h"
 #include "../movement/CopyBlocksOnSite.h"
@@ -88,10 +89,33 @@ void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum)
        int n = blockNum / splitNum;
        int sStep = blockSize * s->unitSize;
        int tStep = n * tPitch;
-        for (int k = 0; k < splitNum; k++) {
+        if(t->devID < 0){
-            XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID,
+            for (int k = 0; k < splitNum; k++) {
-                (char*)s->data + k * sStep, sPitch, s->devID,
+                XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID,
-                mSize, n);
+                           (char*)s->data + k * sStep, sPitch, s->devID,
+                            mSize, n);
+            }
+        }
+        else{
+#ifdef USE_CUDA
+#ifdef STREAMED_MEMCPOPY
+            XStream * stream = GDevs.GPUs[t->devID].stream;
+            for (int k = 0; k < splitNum; k++) {
+                XMemCopy2DAsync((char*)t->data + k * tStep, tPitch, t->devID,
+                                (char*)s->data + k * sStep, sPitch, s->devID,
+                                 mSize, n, stream);
+            }
+            stream->StreamSynchronize();
+#else
+            for (int k = 0; k < splitNum; k++) {
+                XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID,
+                           (char*)s->data + k * sStep, sPitch, s->devID,
+                            mSize, n);
+            }
+#endif
+#else
+            ShowNTErrors("Please specify USE_CUDA and recompile the code!");
+#endif
        }
    }
    else {
@@ -108,17 +132,17 @@ void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum)
        int blockSplitSize = blockNum / splitNum;
        int * blockIndex = (int*)(mem != NULL ?
-            mem->AllocBuf(mem->devID, blockNum * sizeof(int)) :
+                                  mem->AllocBuf(mem->devID, blockNum * sizeof(int)) :
-            XMemAlloc(mem->devID, blockNum * sizeof(int)));
+                                  XMemAlloc(s->devID, blockNum * sizeof(int)));
-        _MakeSplitBlockIndex(blockIndex, splitNum, blockSplitSize, blockNum, mem);
+        _MakeSplitBlockIndex(blockIndex, splitNum, blockSplitSize, blockNum, s->devID);
-        _CopyBlocksOnSite(s->data, realBlockSize, blockNum, dataTMP, blockIndex, mem);
+        _CopyBlocksOnSite(s->data, realBlockSize, blockNum, dataTMP, blockIndex, s->devID);
        if (mem != NULL)
            mem->ReleaseBuf(mem->devID, blockNum * sizeof(int));
        else
-            XMemFree(mem->devID, blockIndex);
+            XMemFree(s->devID, blockIndex);
        /* copy from tmp to target */
        if (!isOnSameDevice) {
@@ -127,7 +151,7 @@ void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum)
            if (mem != NULL)
                mem->ReleaseBuf(mem->devID, size);
            else
-                XMemFree(mem->devID, dataTMP);
+                XMemFree(s->devID, dataTMP);
        }
    }
 }
@@ -226,20 +250,46 @@ void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum)
        int n = blockNum / splitNum;
        int sStep = blockSize * big->unitSize;
        int tStep = 0;
-        for (int k = 0; k < splitNum; k++) {
-            XTensor * t = (XTensor*)smalls->GetItem(k);
+        if(big->devID < 0){
-            XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID,
+            for (int k = 0; k < splitNum; k++) {
-                (char*)big->data + k * sStep, sPitch, big->devID,
+                XTensor * t = (XTensor*)smalls->GetItem(k);
-                mSize, n);
+                XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID,
+                           (char*)big->data + k * sStep, sPitch, big->devID,
+                            mSize, n);
+            }
+        }
+        else{
+#ifdef USE_CUDA
+#ifdef STREAMED_MEMCPOPY
+            XStream * stream = GDevs.GPUs[big->devID].stream;
+            for (int k = 0; k < splitNum; k++) {
+                XTensor * t = (XTensor*)smalls->GetItem(k);
+                XMemCopy2DAsync((char*)t->data + k * tStep, tPitch, t->devID,
+                                (char*)big->data + k * sStep, sPitch, big->devID,
+                                 mSize, n, stream);
+            }
+            stream->StreamSynchronize();
+#else
+            for (int k = 0; k < splitNum; k++) {
+                XTensor * t = (XTensor*)smalls->GetItem(k);
+                XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID,
+                           (char*)big->data + k * sStep, sPitch, big->devID,
+                            mSize, n);
+            }
+#endif
+#else
+            ShowNTErrors("Please specify USE_CUDA and recompile the code!");
+#endif
        }
    }
    /* splitting with fewer kernel/api calls??? (i'm not sure about it!! may remove this later) */
    else {
-        int* dimSizeTMP = new int[MAX_TENSOR_DIM_NUM];
+        int* dimSizeTMP = new int[big->order + 1];
-        for (int i = 0; i < MAX_TENSOR_DIM_NUM; i++)
+        for (int i = 0; i < big->order; i++)
-            dimSizeTMP[i] = -big->dimSize[i];
+            dimSizeTMP[i + 1] = -big->dimSize[i];
-        dimSizeTMP[whereToSplit] /= splitNum;
+        dimSizeTMP[whereToSplit + 1] /= splitNum;
-        dimSizeTMP[big->order] = -splitNum;
+        dimSizeTMP[0] = -splitNum;
        XMem * mem = big->mem;
        XTensor* tensorTMP = new XTensor(big->order + 1, dimSizeTMP, big->dataType, big->denseRatio, big->devID, mem);
@@ -251,7 +301,7 @@ void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum)
            dataTMP = first->data;
        }
        else {
-            dataTMP = mem != NULL ? mem->AllocBuf(mem->devID, size) : XMemAlloc(mem->devID, size);
+            dataTMP = mem != NULL ? mem->AllocBuf(mem->devID, size) : XMemAlloc(big->devID, size);
        }
        tensorTMP->data = dataTMP;
@@ -270,13 +320,12 @@ void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum)
        delete[] dimSizeTMP;
        tensorTMP->data = NULL;
-        dataTMP = NULL;
        delete tensorTMP;
        if ((!uniform) && (mem != NULL))
            mem->ReleaseBuf(mem->devID, size);
        else
-            XMemFree(mem->devID, dataTMP);
+            XMemFree(big->devID, dataTMP);
    }
 }

--- a/source/tensor/core/shape/Split.h
+++ b/source/tensor/core/shape/Split.h
@@ -26,6 +26,8 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
+#define STREAMED_MEMCPOPY
 /* 
 transform a tensor by splitting it 
 e.g., (M, N) -> (M, N/3, 3) 

--- a/source/tensor/core/shape/Transpose.cpp
+++ b/source/tensor/core/shape/Transpose.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+ * Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+/*
+ * $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-28
+ * It is extreamly hot these days and i cannot sleep well. Fortunately we had 
+ * good lunch of Steamed Cold Noodles. This made me feel much better!
+ */
+#include "Transpose.h"
+#include "Merge.h"
+#include "../../XUtility.h"
+#include "../../XName.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+tensor transposition of dimensions i and j
+b = transposed(a) 
+For a input tensor a, we tranpose the dimensions i and j of it.
+E.g., let a be a tensor of size x * y * z, i = 0, j = 2, 
+then the output will be a tensor of size z * y * x.
+>> a - the input tensor
+>> b - the output tensor by transpose tensor a with specified dimensions i and j
+>> i - the transposed dimension
+>> j - the transposed dimension
+*/
+void _Transpose(const XTensor * a, XTensor * b, const int i, const int j)
+{
+    CheckNTErrors(a && b, "Empty tensors");
+    CheckNTErrors(a->order == b->order, "Wrong tensor orders");
+    CheckNTErrors(a->unitNum == b->unitNum && a->unitSize == b->unitSize, "Wrong tensor sizes");
+    CheckNTErrors(a->order > i && i >= 0, "index of dimension is out of scope!");
+    CheckNTErrors(a->order > j && j >= 0, "index of dimension is out of scope!");
+    for(int k = 0; k < a->order; k++){
+        if(k == i){
+            CheckNTErrors(a->dimSize[k] == b->dimSize[j], "Wrong dimension size in transposition");
+        }
+        else if(k == j){
+            CheckNTErrors(a->dimSize[k] == b->dimSize[i], "Wrong dimension size in transposition");
+        }
+        else{
+            CheckNTErrors(a->dimSize[k] == b->dimSize[k], "Wrong dimension size in transposition");
+        }
+    }
+    if(i == j){
+        XMemCopy(b->data, b->devID, a->data, a->devID, b->unitNum * b->unitSize);
+    }
+    else{
+        int I = MIN(i, j);
+        int J = MAX(i, j);
+        int * dims = new int[a->order + 1];
+        for(int k = 0; k <= J; k++)
+            dims[k] = a->dimSize[k];
+        dims[J + 1] = -1;
+        for(int k = J + 1; k < a->order; k++)
+            dims[k + 1] = a->dimSize[k];
+        /* reshape tensor a form (..., n_I, ..., n_J, ...) => (..., n_I, ..., n_J, 1, ...)*/
+        XTensor * aTMP =  new XTensor(a->order + 1, dims, a->dataType, a->denseRatio, a->devID, a->mem);
+        aTMP->data = a->data;
+        for(int k = 0; k < I; k++)
+            dims[k] = a->dimSize[k];
+        for(int k = I + 1; k <= J; k++)
+            dims[k - 1] = a->dimSize[k];
+        dims[J] = a->dimSize[I];
+        for(int k = J + 1; k < a->order; k++)
+            dims[k] = a->dimSize[k];
+        /* reshape tensor b form (..., m_I, ..., m_J, ...) => (..., m_J, m_I, ...) */
+        b->Reshape(b->order, dims);
+        /* tensor (..., n_I, ..., n_J, 1, ...) => tensor (..., m_J, m_I, ...) */
+        _Merge(aTMP, b, J + 1, I);
+        memcpy(dims, a->dimSize, sizeof(int) * a->order);
+        dims[I] = a->dimSize[J];
+        dims[J] = a->dimSize[I];
+        /* reshape tensor b form (..., m_J, m_I, ...) => (..., m_J, ..., m_I, ...) =>  */
+        b->Reshape(b->order, dims);
+        aTMP->data = NULL;
+        delete[] dims;
+        delete aTMP;
+    }
+}
+/*
+tensor transposition of dimensions i and j (return a XTensor structure).
+make a new tensor to keep the result and return it.
+b = transposed(a)
+For a input tensor a, we tranpose the dimensions i and j of it.
+E.g., let a be a tensor of size x * y * z, i = 0, j = 2, 
+then the output will be a tensor of size z * y * x.
+>> a - the input tensor
+>> i - the transposed dimension
+>> j - the transposed dimension
+<< return - the output tensor by transpose tensor a with specified dimensions i and j
+*/
+XTensor Transpose(const XTensor &a, const int i, const int j)
+{
+    CheckNTErrors(a.order > i && i >= 0, "index of dimension is out of scope!");
+    CheckNTErrors(a.order > j && j >= 0, "index of dimension is out of scope!");
+    int order = a.order;
+    int * dimSize = new int[order];
+    for(int k = 0; k < order; k++){
+        if(k == i)
+            dimSize[k] = a.dimSize[j];
+        else if(k == j)
+            dimSize[k] = a.dimSize[i];
+        else
+            dimSize[k] = a.dimSize[k];
+    }
+    float dr = (!a.isSparse) ? 1.0F : a.denseRatio;
+    XTensor b(order, dimSize, a.dataType, dr, a.devID, a.mem);
+    b.SetTMP();
+    /* call _Transpose function */
+    _Transpose(&a, &b, i, j);
+    /* tensor connection */
+    XLink::MakeLink(&a, NULL, &b, SHAPE_TRANSPOSE);
+    XLink::AddParamToHeadInt(&b, i);
+    XLink::AddParamToHeadInt(&b, j);
+    /* destroy variables */
+    delete[] dimSize;
+    return b;
+}
+}
--- a/source/tensor/core/shape/Transpose.h
+++ b/source/tensor/core/shape/Transpose.h
@@ -27,27 +27,18 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
-#define transpose _Transpose_
 /*
-generate a transposed 1D/2D tensor
+tensor transposition of dimensions i and j
 b = transposed(a) 
 */
-void _Transpose(XTensor * a, XTensor * b);
+void _Transpose(const XTensor * a, XTensor * b, const int i, const int j);
-/* 
-transpose a 1D/2D tensor (do it on site).
-keep the result in the input tensor and return nothing.
-a = transposed(a) 
-*/
-void _TransposeMe(XTensor * a);
 /* 
-make a transposed 1D/2D tensor (return a XTensor structure).
+tensor transposition of dimensions i and j (return a XTensor structure).
 make a new tensor to keep the result and return it.
 b = transposed(a)
 */
-XTensor Transpose(XTensor &a);
+XTensor Transpose(const XTensor &a, const int i, const int j);
 } // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/core/shape/Unsqueeze.cu
+++ b/source/tensor/core/shape/Unsqueeze.cu
@@ -32,12 +32,108 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 insert a dimension by copying the blocks for n times (where n is the size of the inerted dimension)
 >> s - pointer to the source data array
 >> blockSize - size of a block
+>> totalSize - total size of the blocks (i.e., blockSIze * n)
+>> t - pointer to the target data array
+>> n - number of blocks to copy data
+*/
+template<class T>
+__global__
+void KernelUnsqueezeFlat(void * s, int blockSize, int totalSize, void * t, int n)
+{
+    /* index of data items */
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i >= blockSize)
+        return;
+    T value = ((T*)s)[i];
+    T * tData = (T*)t;
+    __syncthreads();
+    for (int k = i; k < totalSize; k += blockSize)
+        tData[k] = value;
+}
+/*
+insert a dimension by copying the blocks for n times (where n is the size of the inerted dimension)
+>> s - pointer to the source data array
+>> blockSize - size of a block
+>> totalSize - total size of the blocks (i.e., blockSIze * n)
+>> t - pointer to the target data array
+>> n - number of blocks to copy data
+*/
+template<class T>
+__global__
+void KernelUnsqueezeFlatBigram(void * s, int blockSize, int totalSize, void * t, int n)
+{
+    /* index of data items */
+    int i = (blockDim.x * blockIdx.x + threadIdx.x) * 2;
+    if (i >= blockSize)
+        return;
+    T value = ((T*)s)[i];
+    T value2 = ((T*)s)[i + 1];
+    T * tData = (T*)t;
+    __syncthreads();
+    for (int k = i; k < totalSize; k += blockSize){
+        tData[k] = value;
+        tData[k + 1] = value2;
+    }
+}
+/*
+insert a dimension by copying the blocks for n times (where n is the size of the inerted dimension)
+>> s - pointer to the source data array
+>> blockSize - size of a block
+>> totalSize - total size of the blocks (i.e., blockSIze * n)
+>> t - pointer to the target data array
+>> n - number of blocks to copy data
+*/
+template<class T>
+__global__
+void KernelUnsqueezeFlat2D(void * s, int blockSize, int totalSize, void * t, int n)
+{
+    __shared__ T data[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    __shared__ int offsets[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    /* index of data items */
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    /* index of data items */
+    int j = blockDim.y * blockIdx.y + threadIdx.y;
+    if (i >= blockSize || j >= n)
+        return;
+    if(threadIdx.y == 0)
+        data[threadIdx.x] = ((T*)s)[i];
+    if(threadIdx.x == 0)
+        offsets[threadIdx.y] = blockSize * j;
+    __syncthreads();
+    ((T*)t)[offsets[threadIdx.y] + i] = data[threadIdx.x];
+}
+/*
+insert a dimension by copying the blocks for n times (where n is the size of the inerted dimension)
+>> s - pointer to the source data array
+>> blockSize - size of a block
 >> blockNum - number of the blocks
+>> totalSize - total size of the blocks (i.e., blockSIze * n)
 >> t - pointer to the target data array
+>> n - number of blocks to copy data
 */
 template<class T>
 __global__
-void KernelUnsqueeze(void * s, int blockSize, int blockNum, void * t, int n)
+void KernelUnsqueeze(void * s, int blockSize, int blockNum, int totalSize, void * t, int n)
 {
    /* index of data items */
    int i = blockDim.x * blockIdx.x + threadIdx.x;
@@ -51,11 +147,10 @@ void KernelUnsqueeze(void * s, int blockSize, int blockNum, void * t, int n)
    MTYPE offset = blockSize * j;
    T value = ((T*)s)[offset + i];
    T * tData = (T*)t + offset * n;
-    int length = blockSize * n;
    __syncthreads();
-    for (int k = i; k < length; k += blockSize)
+    for (int k = i; k < totalSize; k += blockSize)
        tData[k] = value;
 }
@@ -83,21 +178,71 @@ void _CudaUnsqueeze(const XTensor * a, XTensor * b, int dim, int dSize)
    int cudaGrids[3];
    int cudaBlocks[3];
-    GDevs.GetCudaThread2D(a->devID, blockSize, blockNumA, MAX_INT, cudaGrids, cudaBlocks);
    int devIDBackup = 0;
    ProtectCudaDev(a->devID, devIDBackup);
-    if (a->dataType == X_FLOAT && a->dataType == X_FLOAT) {
+    if(blockNumA > 1){
-        KernelUnsqueeze<float> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
+        GDevs.GetCudaThread2D(a->devID, blockSize, blockNumA, MAX_INT, cudaGrids, cudaBlocks);
-            (a->data, blockSize, blockNumA, b->data, dSize);
+        if (a->dataType == X_FLOAT && a->dataType == X_FLOAT) {
+            KernelUnsqueeze<float> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
+                                      (a->data, blockSize, blockNumA, blockSize * dSize, b->data, dSize);
+        }
+        else if (a->dataType == X_INT && a->dataType == X_INT) {
+            KernelUnsqueeze<int> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
+                                    (a->data, blockSize, blockNumA, blockSize * dSize, b->data, dSize);
+        }
+        else {
+            ShowNTErrors("TODO!");
+        }
+    }
+    else if(blockNumA == 1 && blockSize < MAX_CUDA_THREAD_NUM_PER_BLOCK){
+        GDevs.GetCudaThread2D(a->devID, blockSize, dSize, MAX_CUDA_THREAD_NUM_PER_BLOCK/4, cudaGrids, cudaBlocks);
+        if (a->dataType == X_FLOAT && a->dataType == X_FLOAT) {
+            KernelUnsqueezeFlat2D<float> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
+                                          (a->data, blockSize, blockSize * dSize, b->data, dSize);
+        }
+        else if (a->dataType == X_INT && a->dataType == X_INT) {
+            KernelUnsqueezeFlat2D<int> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
+                                        (a->data, blockSize, blockSize * dSize, b->data, dSize);
+        }
+        else {
+            ShowNTErrors("TODO!");
+        }
+    }
+    else if(blockNumA == 1 && blockSize % 2 == 0){
+        GDevs.GetCudaThread(a->devID, blockSize/2, cudaGrids, cudaBlocks);
+        if (a->dataType == X_FLOAT && a->dataType == X_FLOAT) {
+            KernelUnsqueezeFlatBigram<float> << <dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >> >
+                                                (a->data, blockSize, blockSize * dSize, b->data, dSize);
+        }
+        else if (a->dataType == X_INT && a->dataType == X_INT) {
+            KernelUnsqueezeFlatBigram<int> << <dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >> >
+                                              (a->data, blockSize, blockSize * dSize, b->data, dSize);
+        }
+        else {
+            ShowNTErrors("TODO!");
+        }
    }
-    else if (a->dataType == X_INT && a->dataType == X_INT) {
+    else if(blockNumA == 1){
-        KernelUnsqueeze<int> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
+        GDevs.GetCudaThread(a->devID, blockSize, cudaGrids, cudaBlocks);
-            (a->data, blockSize, blockNumA, b->data, dSize);
+        if (a->dataType == X_FLOAT && a->dataType == X_FLOAT) {
+            KernelUnsqueezeFlat<float> << <dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >> >
+                                          (a->data, blockSize, blockSize * dSize, b->data, dSize);
+        }
+        else if (a->dataType == X_INT && a->dataType == X_INT) {
+            KernelUnsqueezeFlat<int> << <dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >> >
+                                        (a->data, blockSize, blockSize * dSize, b->data, dSize);
+        }
+        else {
+            ShowNTErrors("TODO!");
+        }
    }
-    else {
+    else{
-        ShowNTErrors("TODO!");
+        ShowNTErrors("Something is wrong!");
    }
    BacktoCudaDev(a->devID, devIDBackup);

--- a/source/tensor/core/utilities/FlushToMem.cu
+++ b/source/tensor/core/utilities/FlushToMem.cu
@@ -117,7 +117,7 @@ void CudaGPUToCPUFlush(XTensor * tensor)
    else {
        tensor->dataHost = new char[tensor->unitNum * tensor->unitSize];
        if (tensor->data != NULL)
-            cudaMemcpy(tensor->dataHost, tensor->data, tensor->unitNum * tensor->unitSize, cudaMemcpyDeviceToHost);
+            XMemCopy(tensor->dataHost, -1, tensor->data, tensor->devID, tensor->unitNum * tensor->unitSize);
        else
            memset(tensor->dataHost, 0, tensor->unitNum * tensor->unitSize);
    }

--- a/source/tensor/function/LogSoftmax.cpp
+++ b/source/tensor/function/LogSoftmax.cpp
@@ -38,6 +38,17 @@ log scale softmax y = log(e^x / \sum_{i} e^{x_i})
 */
 void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
 {
+    CheckNTErrors(!x->isSparse && !y->isSparse, "TODO!");
+    CheckNTErrors(x && y, "Empty input tensors!");
+    if(leadDim < 0)
+        leadDim = x->order - 1;
+    if(y->dimSize[leadDim] == 1){
+        y->SetZeroAll();
+        return;
+    }
    int leadDimRDI = x->order - leadDim - 1;
    if (!x->isSparse && !y->isSparse &&
        x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE)
@@ -68,25 +79,27 @@ void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
        blockSize = stride * dimensionSize;
        blockNum = y->unitNum / blockSize;
-        max = NewTensor(x->order - 1, dimSize, x->dataType, x->denseRatio, x->devID, mem);
+        max = NewTensorBuf(x->order - 1, dimSize, x->dataType, x->denseRatio, x->devID, mem);
-        sum = NewTensor(x->order - 1, dimSize, x->dataType, x->denseRatio, x->devID, mem);
+        sum = NewTensorBuf(x->order - 1, dimSize, x->dataType, x->denseRatio, x->devID, mem);
-        max->data = mem != NULL ? (char*)mem->AllocBuf(mem->devID, max->unitNum * max->unitSize) : XMemAlloc(max->devID, max->unitNum * max->unitSize);
-        sum->data = mem != NULL ? (char*)mem->AllocBuf(mem->devID, sum->unitNum * sum->unitSize) : XMemAlloc(sum->devID, sum->unitNum * sum->unitSize);
        _ReduceMax(x, max, leadDim);
        _ReduceSum(x, sum, leadDim, max, 1.0F, true);
        if (x->devID >= 0) {
-            int dims[2];
+            if(leadDimRDI == 0){
-            dims[0] = -stride;
+                blockSize = y->unitNum;
-            dims[1] = dimensionSize;
+                blockNum  = 1;
-            blockx = NewTensor(2, dims, x->dataType, x->denseRatio, x->devID, mem);
+                blockx = NewTensor2D(blockSize/dimensionSize, -dimensionSize, x->dataType, x->devID, mem);
-            blocky = NewTensor(2, dims, x->dataType, x->denseRatio, x->devID, mem);
+                blocky = NewTensor2D(blockSize/dimensionSize, -dimensionSize, x->dataType, x->devID, mem);
-            dims[0] = -stride;
+                blockMax = NewTensor2D(blockSize/dimensionSize, -1, x->dataType, x->devID, mem);
-            dims[1] = 1;
+                blockSum = NewTensor2D(blockSize/dimensionSize, -1, x->dataType, x->devID, mem);
-            blockMax = NewTensor(2, dims, x->dataType, x->denseRatio, x->devID, mem);
+            }
-            blockSum = NewTensor(2, dims, x->dataType, x->denseRatio, x->devID, mem);
+            else{
+                blockx = NewTensor2D(-stride, dimensionSize, x->dataType, x->devID, mem);
+                blocky = NewTensor2D(-stride, dimensionSize, x->dataType, x->devID, mem);
+                blockMax = NewTensor2D(-stride, 1, x->dataType, x->devID, mem);
+                blockSum = NewTensor2D(-stride, 1, x->dataType, x->devID, mem);
+            }
        }
        for (int k = 0; k < blockNum; k++) {
@@ -123,7 +136,10 @@ void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
                blockMax->data = mp;
                blockSum->data = sp;
 #ifdef USE_CUDA
-                _CudaLogSoftmaxSumMax(blockx, blocky, leadDim, blockSum, blockMax);
+                if(leadDimRDI == 0)
+                    _CudaLogSoftmaxSumMax(blockx, blocky, 1, blockSum, blockMax);
+                else
+                    _CudaLogSoftmaxSumMax(blockx, blocky, leadDim, blockSum, blockMax);
 #else
                ShowNTErrors("Please specify USE_CUDA and recompile the code!");
 #endif
@@ -135,18 +151,8 @@ void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
        }
        if (x->devID < 0) {
-            if (mem != NULL) {
+            DelTensorBuf(max);
-                mem->ReleaseBuf(mem->devID, max->unitNum * max->unitSize);
+            DelTensorBuf(sum);
-                mem->ReleaseBuf(mem->devID, sum->unitNum * sum->unitSize);
-            }
-            else {
-                XMemFree(max->devID, max->data);
-                XMemFree(sum->devID, sum->data);
-                max->data = NULL;
-                sum->data = NULL;
-            }
-            delete max;
-            delete sum;
        }
        else {
            delete blockx;
@@ -184,6 +190,27 @@ XTensor LogSoftmax(const XTensor &x, int leadDim)
    return y;
 }
+/* 
+log scale softmax y = log(e^x / \sum_{i} e^{x_i})
+make a new tensor to keep the result and return it
+>> x - input vector
+>> y - output vector
+>> leadDim - leading dimension (along which we perform reduction)
+*/
+void LogSoftmax(const XTensor &x, XTensor &y, int leadDim)
+{
+    if(!XTensor::IsSameShaped(&x, &y))
+        InitTensor(&y, &x);
+    /* call _LogSoftmax function */
+    _LogSoftmax(&x, &y, leadDim);
+    /* tensor connection */
+    XLink::MakeLink(&x, NULL, &y, FUNC_LOGSOFTMAX);
+    XLink::AddParamToHeadInt(&y, leadDim);
+}
 /*
 backward computation for dense matrices with default data type

--- a/source/tensor/function/LogSoftmax.h
+++ b/source/tensor/function/LogSoftmax.h
@@ -33,6 +33,9 @@ void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim);
 /* log scale softmax y = log(e^x / \sum_{i} e^{x_i}) (return a XTensor structure) */
 XTensor LogSoftmax(const XTensor &x, int leadDim);
+/* log scale softmax y = log(e^x / \sum_{i} e^{x_i}) (with both argument of x and y) */
+void LogSoftmax(const XTensor &x, XTensor &y, int leadDim);
 /* de/dx */
 void _LogSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x, 
                         XTensor * dedy, XTensor * dedx,

--- a/source/tensor/function/Loss.cu
+++ b/source/tensor/function/Loss.cu
@@ -24,7 +24,7 @@
 #include "../XDevice.h"
 #include "../core/math/Power.h"
 #include "../core/math/ScaleAndShift.h"
-#include "../core/math/Log.h"
+#include "../core/math/Unary.h"
 #include "../core/arithmetic/Negate.h"
 #include "../core/arithmetic/Sum.h"
 #include "../core/arithmetic/Multiply.h"

--- a/source/tensor/function/Softmax.cpp
+++ b/source/tensor/function/Softmax.cpp
@@ -37,6 +37,9 @@ softmax y = e^x / \sum_{i} e^{x_i}
 */
 void _Softmax(const XTensor * x, XTensor * y, int leadDim)
 {
+    if(leadDim < 0)
+        leadDim = x->order - 1;
    int leadDimRDI = x->order - leadDim - 1;
    if(!x->isSparse && !y->isSparse && x->dataType == y->dataType){
        int * dimSize = new int[x->order - 1];

--- a/source/tensor/test/TAbsolute.cpp
+++ b/source/tensor/test/TAbsolute.cpp
@@ -19,6 +19,7 @@
 * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12
 */
+#include "../core/math/Unary.h"
 #include "TAbsolute.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -30,14 +31,14 @@ Set every entry to its absolute value.
 bool TestAbsolute1()
 {
 	/* a tensor of size (3, 2) */
-	int aOrder = 2;
+	int order = 2;
-	int * aDimSize = new int[aOrder];
+	int * dimSize = new int[order];
-	aDimSize[0] = 3;
+	dimSize[0] = 3;
-	aDimSize[1] = 2;
+	dimSize[1] = 2;
-	int aUnitNum = 1;
+	int unitNum = 1;
-	for (int i = 0; i < aOrder; i++)
+	for (int i = 0; i < order; i++)
-		aUnitNum *= aDimSize[i];
+		unitNum *= dimSize[i];
 	DTYPE aData[3][2] = { {1.0F, -2.0F}, 
 	                      {0.5F, -4.0F},
@@ -50,14 +51,14 @@ bool TestAbsolute1()
 	bool cpuTest = true;
 	/* create tensors */
-	XTensor * a = NewTensor(aOrder, aDimSize);
+	XTensor * a = NewTensor(order, dimSize);
-	XTensor * b = NewTensor(aOrder, aDimSize);
+	XTensor * b = NewTensor(order, dimSize);
-	XTensor * aMe = NewTensor(aOrder, aDimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
    XTensor bUser;
 	/* initialize variables */
-	a->SetData(aData, aUnitNum);
+	a->SetData(aData, unitNum);
-    aMe->SetData(aData, aUnitNum);
+    aMe->SetData(aData, unitNum);
 	/* call Absolute function */
    _Absolute(a, b);
@@ -65,21 +66,21 @@ bool TestAbsolute1()
    bUser = Absolute(*a);
 	/* check results */
-	cpuTest = b->CheckData(answer, aUnitNum, 1e-4F) && aMe->CheckData(answer, aUnitNum, 1e-4F) && bUser.CheckData(answer, aUnitNum, 1e-4F);
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
 #ifdef USE_CUDA
 	/* GPU test */
 	bool gpuTest = true;
 	/* create tensor */
-	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-	XTensor * bGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-	XTensor * aMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
    XTensor bUserGPU;
 	/* Initialize variables */
-	aGPU->SetData(aData, aUnitNum);
+	aGPU->SetData(aData, unitNum);
-    aMeGPU->SetData(aData, aUnitNum);
+    aMeGPU->SetData(aData, unitNum);
 	/* call Absolute function */
    _Absolute(aGPU, bGPU);
@@ -87,7 +88,7 @@ bool TestAbsolute1()
    bUserGPU = Absolute(*aGPU);
 	/* check results */
-	gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F) && aMeGPU->CheckData(answer, aUnitNum, 1e-4F) && bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
 	/* destroy variables */
 	delete a;
@@ -96,7 +97,7 @@ bool TestAbsolute1()
    delete aGPU;
    delete bGPU;
    delete aMeGPU;
-	delete[] aDimSize;
+	delete[] dimSize;
 	return cpuTest && gpuTest;
 #else
@@ -104,7 +105,7 @@ bool TestAbsolute1()
 	delete a;
 	delete b;
 	delete aMe;
-	delete[] aDimSize;
+	delete[] dimSize;
 	return cpuTest;
 #endif // USE_CUDA

--- a/source/tensor/test/TAbsolute.h
+++ b/source/tensor/test/TAbsolute.h
@@ -22,7 +22,6 @@
 #ifndef __TEST_ABSOLUTE_H__
 #define __TEST_ABSOLUTE_H__
-#include "../core/arithmetic/Absolute.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/test/TCos.cpp
+++ b/source/tensor/test/TCos.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#include "../core/math/Unary.h"
+#include "TCos.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+case 1: test Cos function.
+Set every entry to its cosine value.
+*/
+bool TestCos1()
+{
+	/* a tensor of size (3, 2) */
+	int order = 2;
+	int * dimSize = new int[order];
+	dimSize[0] = 3;
+	dimSize[1] = 2;
+	int unitNum = 1;
+	for (int i = 0; i < order; i++)
+		unitNum *= dimSize[i];
+	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
+	                      {-1.0F, -2.0F},
+	                      {0.0F, 0.5F} };
+	DTYPE answer[3][2] = { {0.5403F, -0.4161F},
+	                       {0.5403F, -0.4161F},
+	                       {1.0F, 0.8776F} };
+	/* CPU test */
+	bool cpuTest = true;
+	/* create tensors */
+	XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
+    XTensor bUser;
+	/* initialize variables */
+	a->SetData(aData, unitNum);
+	aMe->SetData(aData, unitNum);
+	/* call Cos function */
+	_Cos(a, b);
+	_CosMe(aMe);
+    bUser = Cos(*a);
+	/* check results */
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+	/* create tensor */
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+	/* Initialize variables */
+	aGPU->SetData(aData, unitNum);
+	aMeGPU->SetData(aData, unitNum);
+	/* call Cos function */
+    _Cos(aGPU, bGPU);
+	_CosMe(aMeGPU);
+    bUserGPU = Cos(*aGPU);
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+    delete aGPU;
+    delete bGPU;
+    delete aMeGPU;
+	delete[] dimSize;
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete[] dimSize;
+	return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+TODO!!
+*/
+/* test for Cos Function */
+bool TestCos()
+{
+	XPRINT(0, stdout, "[TEST Cos] set every entry to its cosine value \n");
+	bool returnFlag = true, caseFlag = true;
+	/* case 1 test */
+	caseFlag = TestCos1();
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+	XPRINT(0, stdout, "\n");
+	return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TCos.h
+++ b/source/tensor/test/TCos.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#ifndef __TEST_SIN_H__
+#define __TEST_SIN_H__
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* test for Sin Function */
+bool TestSin();
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_SIN_H__
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#ifndef __TEST_COS_H__
+#define __TEST_COS_H__
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* test for Cos Function */
+bool TestCos();
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_COS_H__
--- a/source/tensor/test/TDiv.cpp
+++ b/source/tensor/test/TDiv.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#include "TDiv.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* 
+case 1: element-wise division of two tensors
+c(i) = a(i)/b(i) + \alpha * c(i)
+In this case, (2, 2)  (2, 2) -> (2, 2), leadingDim=0, alpha=0.
+*/
+bool TestDiv1()
+{
+	/* a source tensor of size (2, 2) */
+	int sOrder1 = 2;
+	int * sDimSize1 = new int[sOrder1];
+	sDimSize1[0] = 2;
+	sDimSize1[1] = 2;
+	int sUnitNum1 = 1;
+	for (int i = 0; i < sOrder1; i++)
+		sUnitNum1 *= sDimSize1[i];
+	/* a source tensor of size (2, 2) */
+	int sOrder2 = 2;
+	int * sDimSize2 = new int[sOrder2];
+	sDimSize2[0] = 2;
+	sDimSize2[1] = 2;
+	int sUnitNum2 = 1;
+	for (int i = 0; i < sOrder2; i++)
+		sUnitNum2 *= sDimSize2[i];
+	/* a target tensor of size (2, 2) */
+	int tOrder = 2;
+	int * tDimSize = new int[tOrder];
+	tDimSize[0] = 2;
+	tDimSize[1] = 2;
+	int tUnitNum = 1;
+	for (int i = 0; i < tOrder; i++)
+		tUnitNum *= tDimSize[i];
+	DTYPE sData1[2][2] = { {0.0F, 1.0F},
+	                       {2.0F, 3.0F} };
+	DTYPE sData2[2][2] = { {1.0F, 1.0F},
+	                       {4.0F, 9.0F} };
+	DTYPE answer[2][2] = { {0.0F, 1.0F},
+	                       {0.5F, 0.3333F} };
+	/* CPU test */
+	bool cpuTest = true;
+	/* create tensors */
+	XTensor * s1 = NewTensor(sOrder1, sDimSize1);
+	XTensor * s2 = NewTensor(sOrder2, sDimSize2);
+	XTensor * t = NewTensor(tOrder, tDimSize);
+    XTensor * tMe = NewTensor(tOrder, tDimSize);
+    XTensor tUser;
+	/* initialize variables */
+	s1->SetData(sData1, sUnitNum1);
+	tMe->SetData(sData1, sUnitNum1);
+	s2->SetData(sData2, sUnitNum2);
+	t->SetZeroAll();
+	/* call Div function */
+	_Div(s1, s2, t, 0, 0);
+	_DivMe(tMe, s2, 0, 0);
+    tUser = Div(*s1, *s2, 0);
+	/* check results */
+	cpuTest = t->CheckData(answer, tUnitNum, 1e-4F) && 
+              tMe->CheckData(answer, tUnitNum, 1e-4F) && 
+              tUser.CheckData(answer, tUnitNum, 1e-4F);
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+	/* create tensor */
+	XTensor * sGPU1 = NewTensor(sOrder1, sDimSize1, X_FLOAT, 1.0F, 0);
+	XTensor * sGPU2 = NewTensor(sOrder2, sDimSize2, X_FLOAT, 1.0F, 0);
+	XTensor * tGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * tMeGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
+    XTensor tUserGPU;
+	/* Initialize variables */
+	sGPU1->SetData(sData1, sUnitNum1);
+	tMeGPU->SetData(sData1, sUnitNum1);
+	sGPU2->SetData(sData2, sUnitNum2);
+	tGPU->SetZeroAll();
+	/* call Div function */
+	_Div(sGPU1, sGPU2, tGPU, 0, 0);
+	_DivMe(tMeGPU, sGPU2, 0, 0);
+    tUserGPU = Div(*sGPU1, *sGPU2, 0);
+	/* check results */
+	gpuTest = tGPU->CheckData(answer, tUnitNum, 1e-4F) && 
+              tMeGPU->CheckData(answer, tUnitNum, 1e-4F) && 
+              tUserGPU.CheckData(answer, tUnitNum, 1e-4F);
+	/* destroy variables */
+    delete s1;
+    delete s2;
+    delete t;
+    delete tMe;
+    delete sGPU1;
+    delete sGPU2;
+    delete tGPU;
+    delete tMeGPU;
+    delete[] sDimSize1;
+    delete[] sDimSize2;
+    delete[] tDimSize;
+	return cpuTest && gpuTest;
+#else
+    /* destroy variables */
+    delete s1;
+    delete s2;
+    delete t;
+    delete tMe;
+    delete[] sDimSize1;
+    delete[] sDimSize2;
+    delete[] tDimSize;
+	return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+TODO!!
+*/
+/* test for Div Function */
+bool TestDiv()
+{
+	XPRINT(0, stdout, "[TEST Div] element-wise division of two tensors \n");
+	bool returnFlag = true, caseFlag = true;
+	/* case 1 test */
+	caseFlag = TestDiv1();
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+	XPRINT(0, stdout, "\n");
+	return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TMatrixMULBatchedCPU.h
+++ b/source/tensor/test/TMatrixMULBatchedCPU.h
@@ -16,19 +16,20 @@
 */
 /*
-* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-06-15
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
-*/
+ */
-#ifndef __TEST_MATRIXMULBATCHEDCPU_H__
+#ifndef __TEST_DIV_H__
-#define __TEST_MATRIXMULBATCHEDCPU_H__
+#define __TEST_DIV_H__
-#include "../core/arithmetic/MatrixMULBatchedCPU.h"
+#include "../core/arithmetic/Div.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
-/* test for MatrixMulBatchedCPU Function */
+/* test for Div Function */
 extern "C"
-bool TestMatrixMulBatchedCPU();
+bool TestDiv();
 } // namespace nts(NiuTrans.Tensor)
-#endif // __TEST_MATRIXMULBATCHEDCPU_H__
+#endif // __TEST_DIV_H__
--- a/source/tensor/test/TExp.cpp
+++ b/source/tensor/test/TExp.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#include "../core/math/Unary.h"
+#include "TExp.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+case 1: test Exp function.
+Set every entry to its exponent value.
+*/
+bool TestExp1()
+{
+	/* a tensor of size (3, 2) */
+	int order = 2;
+	int * dimSize = new int[order];
+	dimSize[0] = 3;
+	dimSize[1] = 2;
+	int unitNum = 1;
+	for (int i = 0; i < order; i++)
+		unitNum *= dimSize[i];
+	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
+	                      {-1.0F, -2.0F},
+	                      {0.0F, 0.5F} };
+	DTYPE answer[3][2] = { {2.7183F, 7.3891F},
+	                       {0.3679F, 0.1353F},
+	                       {1.0F, 1.6487F} };
+	/* CPU test */
+	bool cpuTest = true;
+	/* create tensors */
+	XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
+    XTensor bUser;
+	/* initialize variables */
+	a->SetData(aData, unitNum);
+	aMe->SetData(aData, unitNum);
+	/* call Exp function */
+	_Exp(a, b);
+	_ExpMe(aMe);
+    bUser = Exp(*a);
+	/* check results */
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+	/* create tensor */
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+	/* Initialize variables */
+	aGPU->SetData(aData, unitNum);
+	aMeGPU->SetData(aData, unitNum);
+	/* call Exp function */
+    _Exp(aGPU, bGPU);
+	_ExpMe(aMeGPU);
+    bUserGPU = Exp(*aGPU);
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+    delete aGPU;
+    delete bGPU;
+    delete aMeGPU;
+	delete[] dimSize;
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete[] dimSize;
+	return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+TODO!!
+*/
+/* test for Exp Function */
+bool TestExp()
+{
+	XPRINT(0, stdout, "[TEST Exp] set every entry to its exponent value \n");
+	bool returnFlag = true, caseFlag = true;
+	/* case 1 test */
+	caseFlag = TestExp1();
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+	XPRINT(0, stdout, "\n");
+	return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/MatrixMULBatchedCPU.h
+++ b/source/tensor/core/arithmetic/MatrixMULBatchedCPU.h
@@ -16,20 +16,16 @@
 */
 /*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
 */
-#ifndef __MATRIXMULBATCHEDCPU_H__
+#ifndef __TEST_EXP_H__
-#define __MATRIXMULBATCHEDCPU_H__
+#define __TEST_EXP_H__
-#include "../../XTensor.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
-/* matrix multiplication in batch mode (CPU code) */
+/* test for Exp Function */
-void _MatrixMULBatchedCPU(const XList * a, MATRIX_TRANS_TYPE transposedA, const XList * b, MATRIX_TRANS_TYPE transposedB, 
+bool TestExp();
-                          XList * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0);
 } // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_EXP_H__
-#endif // __MATRIXMULBATCHEDCPU_H__
\ No newline at end of file
--- a/source/tensor/test/TLog.cpp
+++ b/source/tensor/test/TLog.cpp
@@ -19,6 +19,7 @@
 * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12
 */
+#include "../core/math/Unary.h"
 #include "TLog.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -30,14 +31,14 @@ Set every entry to its log value.
 bool TestLog1()
 {
 	/* a tensor of size (3, 2) */
-	int aOrder = 2;
+	int order = 2;
-	int * aDimSize = new int[aOrder];
+	int * dimSize = new int[order];
-	aDimSize[0] = 3;
+	dimSize[0] = 3;
-	aDimSize[1] = 2;
+	dimSize[1] = 2;
-	int aUnitNum = 1;
+	int unitNum = 1;
-	for (int i = 0; i < aOrder; i++)
+	for (int i = 0; i < order; i++)
-		aUnitNum *= aDimSize[i];
+		unitNum *= dimSize[i];
 	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
 	                      {0.5F, 4.0F},
@@ -50,14 +51,14 @@ bool TestLog1()
 	bool cpuTest = true;
 	/* create tensors */
-	XTensor * a = NewTensor(aOrder, aDimSize);
+	XTensor * a = NewTensor(order, dimSize);
-    XTensor * b = NewTensor(aOrder, aDimSize);
+    XTensor * b = NewTensor(order, dimSize);
-	XTensor * aMe = NewTensor(aOrder, aDimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
    XTensor bUser;
 	/* initialize variables */
-	a->SetData(aData, aUnitNum);
+	a->SetData(aData, unitNum);
-	aMe->SetData(aData, aUnitNum);
+	aMe->SetData(aData, unitNum);
 	/* call Log function */
 	_Log(a, b);
@@ -65,21 +66,21 @@ bool TestLog1()
    bUser = Log(*a);
 	/* check results */
-	cpuTest = b->CheckData(answer, aUnitNum, 1e-4F) && aMe->CheckData(answer, aUnitNum, 1e-4F) && bUser.CheckData(answer, aUnitNum, 1e-4F);
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
 #ifdef USE_CUDA
 	/* GPU test */
 	bool gpuTest = true;
 	/* create tensor */
-	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-	XTensor * bGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-	XTensor * aMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
    XTensor bUserGPU;
 	/* Initialize variables */
-	aGPU->SetData(aData, aUnitNum);
+	aGPU->SetData(aData, unitNum);
-	aMeGPU->SetData(aData, aUnitNum);
+	aMeGPU->SetData(aData, unitNum);
 	/* call Log function */
    _Log(aGPU, bGPU);
@@ -87,7 +88,7 @@ bool TestLog1()
    bUserGPU = Log(*aGPU);
 	/* check results */
-	gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F) && aMeGPU->CheckData(answer, aUnitNum, 1e-4F) && bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
 	/* destroy variables */
 	delete a;
@@ -96,7 +97,7 @@ bool TestLog1()
    delete aGPU;
    delete bGPU;
    delete aMeGPU;
-	delete[] aDimSize;
+	delete[] dimSize;
 	return cpuTest && gpuTest;
 #else
@@ -104,7 +105,7 @@ bool TestLog1()
 	delete a;
 	delete b;
 	delete aMe;
-	delete[] aDimSize;
+	delete[] dimSize;
 	return cpuTest;
 #endif // USE_CUDA

--- a/source/tensor/test/TLog.h
+++ b/source/tensor/test/TLog.h
@@ -22,8 +22,6 @@
 #ifndef __TEST_LOG_H__
 #define __TEST_LOG_H__
-#include "../core/math/Log.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
 /* test for Log Function */

--- a/source/tensor/test/TLogSoftmax.h
+++ b/source/tensor/test/TLogSoftmax.h
@@ -16,8 +16,8 @@
 */
 /*
-* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-02
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-02
-*/
+ */
 #ifndef __TEST_LOGSOFTMAX_H__
 #define __TEST_LOGSOFTMAX_H__

--- a/source/tensor/test/TMatrixMULBatchedCPU.cpp
+++ b/source/tensor/test/TMatrixMULBatchedCPU.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-/*
-* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-06-15
-*/
-#include "TMatrixMULBatchedCPU.h"
-namespace nts { // namespace nts(NiuTrans.Tensor)
-/* 
-case 1: matrix multiplication in batch mode (CPU code). 
-In this case, aList=2*(2, 3), bList=2*(3, 2) -> c=2*(2, 2), transposedA=X_NOTRANS, transposedB=X_NOTRANS.
-*/
-bool TestMatrixMulBatchedCPU1()
-{
-    /* create list */
-    XList * aList = new XList();
-    XList * bList = new XList();
-    XList * cList = new XList();
-    /* a source tensor of size (2, 3) */
-    int aOrder = 2;
-    int * aDimSize = new int[aOrder];
-    aDimSize[0] = 2;
-    aDimSize[1] = 3;
-    int aUnitNum = 1;
-    for (int i = 0; i < aOrder; i++)
-        aUnitNum *= aDimSize[i];
-    /* a source tensor of size (3, 2) */
-    int bOrder = 2;
-    int * bDimSize = new int[bOrder];
-    bDimSize[0] = 3;
-    bDimSize[1] = 2;
-    int bUnitNum = 1;
-    for (int i = 0; i < bOrder; i++)
-        bUnitNum *= bDimSize[i];
-    /* a target tensor of size (2, 2) */
-    int cOrder = 2;
-    int * cDimSize = new int[cOrder];
-    cDimSize[0] = 2;
-    cDimSize[1] = 2;
-    int cUnitNum = 1;
-    for (int i = 0; i < cOrder; i++)
-        cUnitNum *= cDimSize[i];
-    DTYPE aData1[2][3] = { {1.0F, 2.0F, 3.0F},
-                           {-4.0F, 5.0F, 6.0F} };
-    DTYPE aData2[2][3] = { {1.0F, -2.0F, -3.0F},
-                           {-4.0F, 3.0F, 2.0F} };
-    DTYPE bData1[3][2] = { {0.0F, -1.0F},
-                           {1.0F, 2.0F}, 
-                           {2.0F, 1.0F} };
-    DTYPE bData2[3][2] = { {0.0F, 1.0F},
-                           {3.0F, 2.0F}, 
-                           {2.0F, 1.0F} };
-    DTYPE answer1[2][2] = { {8.0F, 6.0F}, 
-                            {17.0F, 20.0F} };
-    DTYPE answer2[2][2] = { {-12.0F, -6.0F}, 
-                            {13.0F, 4.0F} };
-    /* CPU test */
-    bool cpuTest = true;
-    /* create tensors */
-    XTensor * a1 = NewTensor(aOrder, aDimSize);
-    XTensor * a2 = NewTensor(aOrder, aDimSize);
-    XTensor * b1 = NewTensor(bOrder, bDimSize);
-    XTensor * b2 = NewTensor(bOrder, bDimSize);
-    XTensor * c1 = NewTensor(cOrder, cDimSize);
-    XTensor * c2 = NewTensor(cOrder, cDimSize);
-    /* initialize variables */
-    a1->SetData(aData1, aUnitNum);
-    a2->SetData(aData2, aUnitNum);
-    b1->SetData(bData1, aUnitNum);
-    b2->SetData(bData2, aUnitNum);
-    c1->SetZeroAll();
-    c2->SetZeroAll();
-    /* add tensors to list */
-    aList->Add(a1);
-    aList->Add(a2);
-    bList->Add(b1);
-    bList->Add(b2);
-    cList->Add(c1);
-    cList->Add(c2);
-    /* call MatrixMULBatchedCPU function */
-    _MatrixMULBatchedCPU(aList, X_NOTRANS, bList, X_NOTRANS, cList);
-    /* check results */
-    cpuTest = c1->CheckData(answer1, cUnitNum) && c2->CheckData(answer2, cUnitNum);
-#ifdef USE_CUDA
-    /* GPU test */
-    bool gpuTest = true;
-    /* create tensors */
-    XTensor * aGPU1 = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
-    XTensor * aGPU2 = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
-    XTensor * bGPU1 = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
-    XTensor * bGPU2 = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
-    XTensor * cGPU1 = NewTensor(cOrder, cDimSize, X_FLOAT, 1.0F, 0);
-    XTensor * cGPU2 = NewTensor(cOrder, cDimSize, X_FLOAT, 1.0F, 0);
-    /* initialize variables */
-    aGPU1->SetData(aData1, aUnitNum);
-    aGPU2->SetData(aData2, aUnitNum);
-    bGPU1->SetData(bData1, aUnitNum);
-    bGPU2->SetData(bData2, aUnitNum);
-    cGPU1->SetZeroAll();
-    cGPU2->SetZeroAll();
-    /* clear list */
-    aList->Clear();
-    bList->Clear();
-    cList->Clear();
-    /* add tensors to list */
-    aList->Add(aGPU1);
-    aList->Add(aGPU2);
-    bList->Add(bGPU1);
-    bList->Add(bGPU2);
-    cList->Add(cGPU1);
-    cList->Add(cGPU2);
-    /* call MatrixMULBatchedCPU function */
-    _MatrixMULBatchedCPU(aList, X_NOTRANS, bList, X_NOTRANS, cList);
-    /* check results */
-    gpuTest = cGPU1->CheckData(answer1, cUnitNum) && gpuTest;
-    gpuTest = cGPU2->CheckData(answer2, cUnitNum) && gpuTest;
-    /* destroy variables */
-    delete a1;
-    delete a2;
-    delete b1;
-    delete b2;
-    delete c1;
-    delete c2;
-    delete aGPU1;
-    delete aGPU2;
-    delete bGPU1;
-    delete bGPU2;
-    delete cGPU1;
-    delete cGPU2;
-    delete[] aDimSize;
-    delete[] bDimSize;
-    delete[] cDimSize;
-    return cpuTest && gpuTest;
-#else
-    /* destroy variables */
-    delete a1;
-    delete a2;
-    delete b1;
-    delete b2;
-    delete c1;
-    delete c2;
-    delete[] aDimSize;
-    delete[] bDimSize;
-    delete[] cDimSize;
-    return cpuTest;
-#endif // USE_CUDA
-}
-/* other cases */
-/*
-    TODO!!
-*/
-/* test for MatrixMulBatchedCPU Function */
-extern "C"
-bool TestMatrixMulBatchedCPU()
-{
-    XPRINT(0, stdout, "[TEST MATRIXMULBATCHEDCPU] matrix multiplication in batch mode (CPU code) \n");
-    bool returnFlag = true, caseFlag = true;
-    /* case 1 test */
-    caseFlag = TestMatrixMulBatchedCPU1();
-    if (!caseFlag) {
-        returnFlag = false;
-        XPRINT(0, stdout, ">> case 1 failed!\n");
-    }
-    else
-        XPRINT(0, stdout, ">> case 1 passed!\n");
-    /* other cases test */
-    /*
-    TODO!!
-    */
-    if (returnFlag) {
-        XPRINT(0, stdout, ">> All Passed!\n");
-    }
-    else
-        XPRINT(0, stdout, ">> Failed!\n");
-    XPRINT(0, stdout, "\n");
-    return returnFlag;
-}
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TMultiply.cpp
+++ b/source/tensor/test/TMultiply.cpp
@@ -25,133 +25,10 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 /* 
 case 1: element-wise product of two tensors
-c(i) = a(i)*b(i) + \alpha * c(i) 
-In this case, (2, 1)  (2, 1) -> (2, 1), leadingDim=0, alpha=0.
-*/
-bool TestMultiply1()
-{
-	/* a source tensor of size (2, 1) */
-	int sOrder1 = 2;
-	int * sDimSize1 = new int[sOrder1];
-	sDimSize1[0] = 2;
-	sDimSize1[1] = 1;
-	int sUnitNum1 = 1;
-	for (int i = 0; i < sOrder1; i++)
-		sUnitNum1 *= sDimSize1[i];
-	/* a source tensor of size (2, 1) */
-	int sOrder2 = 2;
-	int * sDimSize2 = new int[sOrder2];
-	sDimSize2[0] = 2;
-	sDimSize2[1] = 1;
-	int sUnitNum2 = 1;
-	for (int i = 0; i < sOrder2; i++)
-		sUnitNum2 *= sDimSize2[i];
-	/* a target tensor of size (2, 1) */
-	int tOrder = 2;
-	int * tDimSize = new int[tOrder];
-	tDimSize[0] = 2;
-	tDimSize[1] = 1;
-	int tUnitNum = 1;
-	for (int i = 0; i < tOrder; i++)
-		tUnitNum *= tDimSize[i];
-	DTYPE sData1[2][1] = { {0.0F}, 
-                           {1.0F} };
-	DTYPE sData2[2][1] = { {2.0F},
-                           {3.0F} };
-	DTYPE answer[2][1] = { {0.0F},
-                           {3.0F} };
-	/* CPU test */
-	bool cpuTest = true;
-	/* create tensors */
-	XTensor * s1 = NewTensor(sOrder1, sDimSize1);
-	XTensor * s2 = NewTensor(sOrder2, sDimSize2);
-	XTensor * t = NewTensor(tOrder, tDimSize);
-	XTensor * tMe = NewTensor(tOrder, tDimSize);
-    XTensor tUser;
-	/* initialize variables */
-	s1->SetData(sData1, sUnitNum1);
-	tMe->SetData(sData1, sUnitNum1);
-	s2->SetData(sData2, sUnitNum2);
-	t->SetZeroAll();
-	/* call Multiply function */
-	_Multiply(s1, s2, t, 0, 0);
-	_MultiplyMe(tMe, s2, 0, 0);
-    tUser = Multiply(*s1, *s2, 0);
-	/* check results */
-	cpuTest = t->CheckData(answer, tUnitNum) 
-        && tMe->CheckData(answer, tUnitNum) && tUser.CheckData(answer, tUnitNum);
-#ifdef USE_CUDA
-	/* GPU test */
-	bool gpuTest = true;
-	/* create tensor */
-	XTensor * sGPU1 = NewTensor(sOrder1, sDimSize1, X_FLOAT, 1.0F, 0);
-	XTensor * sGPU2 = NewTensor(sOrder2, sDimSize2, X_FLOAT, 1.0F, 0);
-	XTensor * tGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
-	XTensor * tMeGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
-    XTensor tUserGPU;
-	/* Initialize variables */
-	sGPU1->SetData(sData1, sUnitNum1);
-	tMeGPU->SetData(sData1, sUnitNum1);
-	sGPU2->SetData(sData2, sUnitNum2);
-	tGPU->SetZeroAll();
-	/* call Multiply function */
-	_Multiply(sGPU1, sGPU2, tGPU, 0, 0);
-	_MultiplyMe(tMeGPU, sGPU2, 0, 0);
-    tUserGPU = Multiply(*sGPU1, *sGPU2, 0);
-	/* check results */
-	gpuTest = tGPU->CheckData(answer, tUnitNum)
-        && tMeGPU->CheckData(answer, tUnitNum) && tUserGPU.CheckData(answer, tUnitNum);
-	/* destroy variables */
-    delete s1;
-    delete s2;
-    delete t;
-    delete tMe;
-    delete sGPU1;
-    delete sGPU2;
-    delete tGPU;
-    delete tMeGPU;
-    delete[] sDimSize1;
-    delete[] sDimSize2;
-    delete[] tDimSize;
-	return cpuTest && gpuTest;
-#else
-    /* destroy variables */
-    delete s1;
-    delete s2;
-    delete t;
-    delete tMe;
-    delete[] sDimSize1;
-    delete[] sDimSize2;
-    delete[] tDimSize;
-	return cpuTest;
-#endif // USE_CUDA
-}
-/* 
-case 2: element-wise product of two tensors
 c(i) = a(i)*b(i) + \alpha * c(i)
 In this case, (2, 2)  (2, 2) -> (2, 2), leadingDim=0, alpha=0.
 */
-bool TestMultiply2()
+bool TestMultiply1()
 {
 	/* a source tensor of size (2, 2) */
 	int sOrder1 = 2;
@@ -212,8 +89,9 @@ bool TestMultiply2()
    tUser = Multiply(*s1, *s2, 0);
 	/* check results */
-	cpuTest = t->CheckData(answer, tUnitNum) 
+	cpuTest = t->CheckData(answer, tUnitNum) && 
-        && tMe->CheckData(answer, tUnitNum) && tUser.CheckData(answer, tUnitNum);
+              tMe->CheckData(answer, tUnitNum) && 
+              tUser.CheckData(answer, tUnitNum);
 #ifdef USE_CUDA
 	/* GPU test */
@@ -270,113 +148,6 @@ bool TestMultiply2()
 #endif // USE_CUDA
 }
-/* 
-case 3: element-wise product of two tensors, c(i) = a(i)*b(i) + \alpha * c(i)
-In this case, (2, 2)  (2, 2) -> (2, 2), leadingDim=1, alpha=0.
-*/
-bool TestMultiply3()
-{
-	/* a source tensor of size (2, 2) */
-	int sOrder1 = 2;
-	int * sDimSize1 = new int[sOrder1];
-	sDimSize1[0] = 2;
-	sDimSize1[1] = 2;
-	int sUnitNum1 = 1;
-	for (int i = 0; i < sOrder1; i++)
-		sUnitNum1 *= sDimSize1[i];
-	/* a source tensor of size (2, 2) */
-	int sOrder2 = 2;
-	int * sDimSize2 = new int[sOrder2];
-	sDimSize2[0] = 2;
-	sDimSize2[1] = 2;
-	int sUnitNum2 = 1;
-	for (int i = 0; i < sOrder2; i++)
-		sUnitNum2 *= sDimSize2[i];
-	/* a target tensor of size (2, 2) */
-	int tOrder = 2;
-	int * tDimSize = new int[tOrder];
-	tDimSize[0] = 2;
-	tDimSize[1] = 2;
-	int tUnitNum = 1;
-	for (int i = 0; i < tOrder; i++)
-		tUnitNum *= tDimSize[i];
-	DTYPE sData1[2][2] = { {0.0F, 1.0F},
-	                       {2.0F, 3.0F} };
-	DTYPE sData2[2][2] = { {0.0F, 1.0F},
-	                       {2.0F, 3.0F} };
-	DTYPE answer[2][2] = { {0.0F, 1.0F},
-	                       {4.0F, 9.0F} };
-	/* CPU test */
-	bool cpuTest = true;
-	/* create tensors */
-	XTensor * s1 = NewTensor(sOrder1, sDimSize1);
-	XTensor * s2 = NewTensor(sOrder2, sDimSize2);
-	XTensor * t = NewTensor(tOrder, tDimSize);
-	/* initialize variables */
-	s1->SetData(sData1, sUnitNum1);
-	s2->SetData(sData2, sUnitNum2);
-	t->SetZeroAll();
-	/* call MultiplyElementWise function */
-	_Multiply(s1, s2, t, 0, 1);
-	/* check results */
-	cpuTest = t->CheckData(answer, tUnitNum);
-#ifdef USE_CUDA
-	/* GPU test */
-	bool gpuTest = true;
-	/* create tensor */
-	XTensor * sGPU1 = NewTensor(sOrder1, sDimSize1, X_FLOAT, 1.0F, 0);
-	XTensor * sGPU2 = NewTensor(sOrder2, sDimSize2, X_FLOAT, 1.0F, 0);
-	XTensor * tGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
-	/* Initialize variables */
-	sGPU1->SetData(sData1, sUnitNum1);
-	sGPU2->SetData(sData2, sUnitNum2);
-	tGPU->SetZeroAll();
-	/* call MultiplyElementWise function */
-	_Multiply(sGPU1, sGPU2, tGPU, 0, 1);
-	/* check results */
-	gpuTest = tGPU->CheckData(answer, tUnitNum);
-	/* destroy variables */
-    delete s1;
-    delete s2;
-    delete t;
-    delete sGPU1;
-    delete sGPU2;
-    delete tGPU;
-    delete[] sDimSize1;
-    delete[] sDimSize2;
-    delete[] tDimSize;
-	return cpuTest && gpuTest;
-#else
-    /* destroy variables */
-    delete s1;
-    delete s2;
-    delete t;
-    delete[] sDimSize1;
-    delete[] sDimSize2;
-    delete[] tDimSize;
-	return cpuTest;
-#endif // USE_CUDA
-}
 /* other cases */
 /*
 TODO!!
@@ -398,26 +169,6 @@ bool TestMultiply()
 	else
 		XPRINT(0, stdout, ">> case 1 passed!\n");
-	/* case 2 test */
-	caseFlag = TestMultiply2();
-	if (!caseFlag) {
-		returnFlag = false;
-		XPRINT(0, stdout, ">> case 2 failed!\n");
-	}
-	else
-		XPRINT(0, stdout, ">> case 2 passed!\n");
-	/* case 3 test */
-	caseFlag = TestMultiply3();
-	if (!caseFlag) {
-		returnFlag = false;
-		XPRINT(0, stdout, ">> case 3 failed!\n");
-	}
-	else
-		XPRINT(0, stdout, ">> case 3 passed!\n");
 	/* other cases test */
 	/*
 	TODO!!

--- a/source/tensor/test/TMultiply.h
+++ b/source/tensor/test/TMultiply.h
@@ -19,16 +19,17 @@
 * $Created by: Lin Ye (email: linye2015@outlook.com) 2018-06-15
 */
-#ifndef __TEST_MULTIPLYELEMENTWISE_H__
+#ifndef __TEST_MULTIPLY_H__
-#define __TEST_MULTIPLYELEMENTWISE_H__
+#define __TEST_MULTIPLY_H__
 #include "../core/arithmetic/Multiply.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
-/* test for MultiplyElementWise Function */
+/* test for Multiply Function */
 extern "C"
 bool TestMultiply();
 } // namespace nts(NiuTrans.Tensor)
-#endif // __TEST_MULTIPLYELEMENTWISE_H__
+#endif // __TEST_MULTIPLY_H__
--- a/source/tensor/test/TSin.cpp
+++ b/source/tensor/test/TSin.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#include "../core/math/Unary.h"
+#include "TSin.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+case 1: test Sin function.
+Set every entry to its sine value.
+*/
+bool TestSin1()
+{
+	/* a tensor of size (3, 2) */
+	int order = 2;
+	int * dimSize = new int[order];
+	dimSize[0] = 3;
+	dimSize[1] = 2;
+	int unitNum = 1;
+	for (int i = 0; i < order; i++)
+		unitNum *= dimSize[i];
+	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
+	                      {-1.0F, -2.0F},
+	                      {0.0F, 0.5F} };
+	DTYPE answer[3][2] = { {0.8415F, 0.9093F},
+	                       {-0.8415F, -0.9093F},
+	                       {0.0F, 0.4794F} };
+	/* CPU test */
+	bool cpuTest = true;
+	/* create tensors */
+	XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
+    XTensor bUser;
+	/* initialize variables */
+	a->SetData(aData, unitNum);
+	aMe->SetData(aData, unitNum);
+	/* call Sin function */
+	_Sin(a, b);
+	_SinMe(aMe);
+    bUser = Sin(*a);
+	/* check results */
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+	/* create tensor */
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+	/* Initialize variables */
+	aGPU->SetData(aData, unitNum);
+	aMeGPU->SetData(aData, unitNum);
+	/* call Sin function */
+    _Sin(aGPU, bGPU);
+	_SinMe(aMeGPU);
+    bUserGPU = Sin(*aGPU);
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+    delete aGPU;
+    delete bGPU;
+    delete aMeGPU;
+	delete[] dimSize;
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete[] dimSize;
+	return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+TODO!!
+*/
+/* test for Sin Function */
+bool TestSin()
+{
+	XPRINT(0, stdout, "[TEST Sin] set every entry to its sine value \n");
+	bool returnFlag = true, caseFlag = true;
+	/* case 1 test */
+	caseFlag = TestSin1();
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+	XPRINT(0, stdout, "\n");
+	return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/Absolute.cuh
+++ b/source/tensor/core/arithmetic/Absolute.cuh
@@ -16,26 +16,16 @@
 */
 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
 */
-#include "Absolute.h"
+#ifndef __TEST_SIN_H__
+#define __TEST_SIN_H__
 namespace nts { // namespace nts(NiuTrans.Tensor)
-#ifdef USE_CUDA
+/* test for Sin Function */
+bool TestSin();
-/* set each entry to its absolute value (CUDA Kernel) */
+} // namespace nts(NiuTrans.Tensor)
-__global__
+#endif // __TEST_SIN_H__
-void KernelAbsolute(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its absolute value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelAbsolute(__half * a, __half * b, int size);
-/* set each entry to its absolute value */
-void _CudaAbsolute(const XTensor * a, XTensor * b);
-#endif // USE_CUDA
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/test/TSub.cpp
+++ b/source/tensor/test/TSub.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#include "TSub.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* case 1: tensor subtraction c = a - b * \beta */
+bool TestSub1()
+{
+    /* a tensor of size (2, 4) */
+    int order = 2;
+    int * dimSize = new int[order];
+    dimSize[0] = 2;
+    dimSize[1] = 4;
+    int unitNum = 1;
+    for (int i = 0; i < order; i++)
+        unitNum *= dimSize[i];
+    DTYPE aData[2][4] = { {0.0F, 1.0F, 2.0F, 3.0F},
+                          {4.0F, 5.0F, 6.0F, 7.0F} };
+    DTYPE bData[2][4] = { {1.0F, -1.0F, -3.0F, -5.0F}, 
+                          {-7.0F, -9.0F, -11.0F, -13.0F} };
+    DTYPE answer[2][4] = { {-1.0F, 2.0F, 5.0F, 8.0F},
+                           {11.0F, 14.0F, 17.0F, 20.0F} };
+    /* CPU test */
+    bool cpuTest = true;
+    /* create tensors */
+    XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+    XTensor * c = NewTensor(order, dimSize);
+    XTensor * cMe = NewTensor(order, dimSize);
+    XTensor cUser;
+    /* initialize variables */
+    a->SetData(aData, unitNum);
+    cMe->SetData(aData, unitNum);
+    b->SetData(bData, unitNum);
+    c->SetZeroAll();
+    /* call Sub function */
+    _Sub(a, b, c);
+    _SubMe(cMe, b);
+    cUser = Sub(*a, *b);
+    /* check results */
+    cpuTest = c->CheckData(answer, unitNum)
+              && cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
+#ifdef USE_CUDA
+    /* GPU test */
+    bool gpuTest = true;
+    /* create tensor */
+    XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor cUserGPU;
+    /* Initialize variables */
+    aGPU->SetData(aData, unitNum);
+    cMeGPU->SetData(aData, unitNum);
+    bGPU->SetData(bData, unitNum);
+    cGPU->SetZeroAll();
+    /* call Sub function */
+    _Sub(aGPU, bGPU, cGPU);
+    _SubMe(cMeGPU, bGPU);
+    cUserGPU = Sub(*aGPU, *bGPU);
+    /* check results */
+    gpuTest = cGPU->CheckData(answer, unitNum, 1e-4F)
+              && cMeGPU->CheckData(answer, unitNum, 1e-4F) && cUserGPU.CheckData(answer, unitNum, 1e-4F);
+    /* destroy variables */
+    delete a;
+    delete b;
+    delete c;
+    delete cMe;
+    delete aGPU;
+    delete bGPU;
+    delete cGPU;
+    delete cMeGPU;
+    delete[] dimSize;
+    return cpuTest && gpuTest;
+#else
+    /* destroy variables */
+    delete a;
+	delete b;
+	delete c;
+    delete cMe;
+    delete[] dimSize;
+    return cpuTest;
+#endif // USE_CUDA
+}
+/* case 2: tensor subtraction c = a - b * \beta */
+bool TestSub2()
+{
+    /* a tensor of size (2, 4) */
+    int order = 2;
+    int * dimSize = new int[order];
+    dimSize[0] = 2;
+    dimSize[1] = 4;
+    int unitNum = 1;
+    for (int i = 0; i < order; i++) {
+        unitNum *= dimSize[i];
+    }
+    DTYPE aData[2][4] = { {0.0F, 1.0F, 2.0F, 3.0F},
+                          {4.0F, 5.0F, 6.0F, 7.0F} };
+    DTYPE bData[2][4] = { {1.0F, -1.0F, -3.0F, -5.0F}, 
+                          {-7.0F, -9.0F, -11.0F, -13.0F} };
+    DTYPE answer[2][4] = { {-0.5F, 1.5F, 3.5F, 5.5F},
+                           {7.5F, 9.5F, 11.5F, 13.5F} };
+    float beta = 0.5F;
+    /* CPU test */
+    bool cpuTest = true;
+    /* create tensor */
+    XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+    XTensor * c = NewTensor(order, dimSize);
+    XTensor * cMe = NewTensor(order, dimSize);
+    XTensor cUser;
+    /* initialize variables */
+    a->SetData(aData, unitNum);
+    cMe->SetData(aData, unitNum);
+    b->SetData(bData, unitNum);
+    c->SetZeroAll();
+    /* call Sub function */
+    _Sub(a, b, c, beta);
+    _SubMe(cMe, b, beta);
+    cUser = Sub(*a, *b, beta);
+    /* check results */
+    cpuTest = c->CheckData(answer, unitNum)
+              && cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
+#ifdef USE_CUDA
+    /* GPU test */
+    bool gpuTest = true;
+    /* create tensor */
+    XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor cUserGPU;
+    /* Initialize variables */
+    aGPU->SetData(aData, unitNum);
+    cMeGPU->SetData(aData, unitNum);
+    bGPU->SetData(bData, unitNum);
+    cGPU->SetZeroAll();
+    /* call Sub function */
+    _Sub(aGPU, bGPU, cGPU, beta);
+    _SubMe(cMeGPU, bGPU, beta);
+    cUserGPU = Sub(*aGPU, *bGPU, beta);
+    /* check results */
+    gpuTest = cGPU->CheckData(answer, unitNum, 1e-4F)
+              && cMeGPU->CheckData(answer, unitNum, 1e-4F) && cUserGPU.CheckData(answer, unitNum, 1e-4F);
+    /* destroy variables */
+    delete a;
+    delete b;
+    delete c;
+    delete cMe;
+    delete aGPU;
+    delete bGPU;
+    delete cGPU;
+    delete cMeGPU;
+    delete[] dimSize;
+    return cpuTest && gpuTest;
+#else
+    /* destroy variables */
+    delete a;
+    delete b;
+    delete c;
+    delete cMe;
+    delete[] dimSize;
+    return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+    TODO!!
+*/
+/* test for Sub Function */
+bool TestSub()
+{
+    XPRINT(0, stdout, "[TEST SUB] tensor subtraction c = a - b * beta\n");
+    bool returnFlag = true, caseFlag = true;
+    /* case 1 test */
+    caseFlag = TestSub1();
+    if (!caseFlag) {
+        returnFlag = false;
+        XPRINT(0, stdout, ">> case 1 failed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> case 1 passed!\n");
+    /* case 2 test */
+    caseFlag = TestSub2();
+    if (!caseFlag) {
+        returnFlag = false;
+        XPRINT(0, stdout, ">> case 2 failed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> case 2 passed!\n");
+    /* other cases test */
+    /*
+        TODO!!
+    */
+    if (returnFlag) {
+        XPRINT(0, stdout, ">> All Passed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> Failed!\n");
+    XPRINT(0, stdout, "\n");
+    return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TSub.h
+++ b/source/tensor/test/TSub.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+#ifndef __TEST_SUB_H__
+#define __TEST_SUB_H__
+#include "../core/arithmetic/Sub.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* test for Sub Function */
+bool TestSub();
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_SUB_H__
--- a/source/tensor/test/TSum.cpp
+++ b/source/tensor/test/TSum.cpp
@@ -16,8 +16,8 @@
 */
 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-04-30
+ * $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-04-30
-*/
+ */
 #include "TSum.h"
@@ -59,14 +59,14 @@ bool TestSum1()
    b->SetData(bData, unitNum);
    c->SetZeroAll();
-    /* call sum function */
+    /* call Sum function */
    _Sum(a, b, c);
    _SumMe(cMe, b);
    cUser = Sum(*a, *b);
    /* check results */
    cpuTest = c->CheckData(answer, unitNum)
-        && cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
+              && cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
 #ifdef USE_CUDA
    /* GPU test */
@@ -85,14 +85,14 @@ bool TestSum1()
    bGPU->SetData(bData, unitNum);
    cGPU->SetZeroAll();
-    /* call sum function */
+    /* call Sum function */
    _Sum(aGPU, bGPU, cGPU);
    _SumMe(cMeGPU, bGPU);
    cUserGPU = Sum(*aGPU, *bGPU);
    /* check results */
    gpuTest = cGPU->CheckData(answer, unitNum)
-        && cMeGPU->CheckData(answer, unitNum) && cUserGPU.CheckData(answer, unitNum);
+              && cMeGPU->CheckData(answer, unitNum) && cUserGPU.CheckData(answer, unitNum);
    /* destroy variables */
    delete a;
@@ -155,14 +155,14 @@ bool TestSum2()
    b->SetData(bData, unitNum);
    c->SetZeroAll();
-    /* call sum function */
+    /* call Sum function */
    _Sum(a, b, c, beta);
    _SumMe(cMe, b, beta);
    cUser = Sum(*a, *b, beta);
    /* check results */
    cpuTest = c->CheckData(answer, unitNum)
-        && cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
+              && cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
 #ifdef USE_CUDA
    /* GPU test */
@@ -181,14 +181,14 @@ bool TestSum2()
    bGPU->SetData(bData, unitNum);
    cGPU->SetZeroAll();
-    /* call sum function */
+    /* call Sum function */
    _Sum(aGPU, bGPU, cGPU, beta);
    _SumMe(cMeGPU, bGPU, beta);
    cUserGPU = Sum(*aGPU, *bGPU, beta);
    /* check results */
    gpuTest = cGPU->CheckData(answer, unitNum)
-        && cMeGPU->CheckData(answer, unitNum) && cUserGPU.CheckData(answer, unitNum);
+              && cMeGPU->CheckData(answer, unitNum) && cUserGPU.CheckData(answer, unitNum);
    /* destroy variables */
    delete a;

--- a/source/tensor/test/TSum.h
+++ b/source/tensor/test/TSum.h
@@ -16,8 +16,8 @@
 */
 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-04-30
+ * $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-04-30
-*/
+ */
 #ifndef __TEST_SUM_H__
 #define __TEST_SUM_H__

--- a/source/tensor/test/TSumDim.cpp
+++ b/source/tensor/test/TSumDim.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-30
+*/
+#include "TSumDim.h"
+#include "../core/arithmetic/SumDim.h"
+#include "../XTensor.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* 
+case 1: tensor summation c = a + b * \beta 
+where the size of b is equal to the n-th dimension of a, 
+i.e., a is summed with b by broadcasting 
+*/
+bool TestSumDim1()
+{
+    /* a tensor of size (2, 4) */
+    int aOrder = 2;
+    int * aDimSize = new int[aOrder];
+    aDimSize[0] = 2;
+    aDimSize[1] = 4;
+    int aUnitNum = 1;
+    for (int i = 0; i < aOrder; i++)
+        aUnitNum *= aDimSize[i];
+    /* a tensor of size (2) */
+    int bOrder = 1;
+    int * bDimSize = new int[bOrder];
+    bDimSize[0] = 2;
+    int bUnitNum = 1;
+    for (int i = 0; i < bOrder; i++)
+        bUnitNum *= bDimSize[i];
+    DTYPE aData[2][4] = { {0.0F, 1.0F, 2.0F, 3.0F},
+                          {4.0F, 5.0F, 6.0F, 7.0F} };
+    DTYPE bData[2] = {1.0F, -1.0F};
+    DTYPE answer[2][4] = { {1.0F, 2.0F, 3.0F, 4.0F},
+                           {3.0F, 4.0F, 5.0F, 6.0F} };
+    /* CPU test */
+    bool cpuTest = true;
+    /* create tensors */
+    XTensor * a = NewTensor(aOrder, aDimSize);
+    XTensor * b = NewTensor(bOrder, bDimSize);
+    XTensor * c = NewTensor(aOrder, aDimSize);
+    XTensor * cMe = NewTensor(aOrder, aDimSize);
+    XTensor cUser;
+    /* initialize variables */
+    a->SetData(aData, aUnitNum);
+    cMe->SetData(aData, aUnitNum);
+    b->SetData(bData, bUnitNum);
+    c->SetZeroAll();
+    /* call SumDim function */
+    _SumDim(a, b, c, 0);
+    _SumDim(cMe, b, 0);
+    cUser = SumDim(*a, *b, 0);
+    /* check results */
+    cpuTest = c->CheckData(answer, aUnitNum)
+              && cMe->CheckData(answer, aUnitNum) 
+              && cUser.CheckData(answer, aUnitNum);
+#ifdef USE_CUDA
+    /* GPU test */
+    bool gpuTest = true;
+    /* create tensor */
+    XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * bGPU = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor cUserGPU;
+    /* Initialize variables */
+    aGPU->SetData(aData, aUnitNum);
+    cMeGPU->SetData(aData, aUnitNum);
+    bGPU->SetData(bData, bUnitNum);
+    cGPU->SetZeroAll();
+    /* call sum function */
+    _SumDim(aGPU, bGPU, cGPU, 0);
+    _SumDim(cMeGPU, bGPU, 0);
+    cUserGPU = SumDim(*aGPU, *bGPU, 0);
+    /* check results */
+    gpuTest = cGPU->CheckData(answer, aUnitNum)
+              && cMeGPU->CheckData(answer, aUnitNum) 
+              && cUserGPU.CheckData(answer, aUnitNum);
+    /* destroy variables */
+    delete a;
+    delete b;
+    delete c;
+    delete cMe;
+    delete aGPU;
+    delete bGPU;
+    delete cGPU;
+    delete cMeGPU;
+    delete[] aDimSize;
+    delete[] bDimSize;
+    return cpuTest && gpuTest;
+#else
+    /* destroy variables */
+    delete a;
+	delete b;
+	delete c;
+    delete cMe;
+    delete[] aDimSize;
+    delete[] bDimSize;
+    return cpuTest;
+#endif // USE_CUDA
+}
+/* 
+case 2: tensor summation c = a + b * \beta 
+where the size of b is equal to the n-th dimension of a, 
+i.e., a is summed with b by broadcasting 
+*/
+bool TestSumDim2()
+{
+    /* a tensor of size (2, 4) */
+    int aOrder = 2;
+    int * aDimSize = new int[aOrder];
+    aDimSize[0] = 2;
+    aDimSize[1] = 4;
+    int aUnitNum = 1;
+    for (int i = 0; i < aOrder; i++)
+        aUnitNum *= aDimSize[i];
+    /* a tensor of size (2, 2) */
+    int bOrder = 2;
+    int * bDimSize = new int[bOrder];
+    bDimSize[0] = 2;
+    bDimSize[1] = 2;
+    int bUnitNum = 1;
+    for (int i = 0; i < bOrder; i++)
+        bUnitNum *= bDimSize[i];
+    DTYPE aData[2][4] = { {0.0F, 1.0F, 2.0F, 3.0F},
+                          {4.0F, 5.0F, 6.0F, 7.0F} };
+    DTYPE bData[2][2] = { {1.0F, -1.0F},
+                          {-1.0F, 1.0F} };
+    DTYPE answer[2][4] = { {1.0F, 0.0F, 1.0F, 4.0F},
+                           {5.0F, 4.0F, 5.0F, 8.0F} };
+    /* CPU test */
+    bool cpuTest = true;
+    /* create tensors */
+    XTensor * a = NewTensor(aOrder, aDimSize);
+    XTensor * b = NewTensor(bOrder, bDimSize);
+    XTensor * c = NewTensor(aOrder, aDimSize);
+    XTensor * cMe = NewTensor(aOrder, aDimSize);
+    XTensor cUser;
+    /* initialize variables */
+    a->SetData(aData, aUnitNum);
+    cMe->SetData(aData, aUnitNum);
+    b->SetData(bData, bUnitNum);
+    c->SetZeroAll();
+    /* call SumDim function */
+    _SumDim(a, b, c, 1);
+    _SumDim(cMe, b, 1);
+    cUser = SumDim(*a, *b, 1);
+    /* check results */
+    cpuTest = c->CheckData(answer, aUnitNum)
+              && cMe->CheckData(answer, aUnitNum) 
+              && cUser.CheckData(answer, aUnitNum);
+#ifdef USE_CUDA
+    /* GPU test */
+    bool gpuTest = true;
+    /* create tensor */
+    XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * bGPU = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor cUserGPU;
+    /* Initialize variables */
+    aGPU->SetData(aData, aUnitNum);
+    cMeGPU->SetData(aData, aUnitNum);
+    bGPU->SetData(bData, bUnitNum);
+    cGPU->SetZeroAll();
+    /* call sum function */
+    _SumDim(aGPU, bGPU, cGPU, 1);
+    _SumDim(cMeGPU, bGPU, 1);
+    cUserGPU = SumDim(*aGPU, *bGPU, 1);
+    /* check results */
+    gpuTest = cGPU->CheckData(answer, aUnitNum)
+              && cMeGPU->CheckData(answer, aUnitNum) 
+              && cUserGPU.CheckData(answer, aUnitNum);
+    /* destroy variables */
+    delete a;
+    delete b;
+    delete c;
+    delete cMe;
+    delete aGPU;
+    delete bGPU;
+    delete cGPU;
+    delete cMeGPU;
+    delete[] aDimSize;
+    delete[] bDimSize;
+    return cpuTest && gpuTest;
+#else
+    /* destroy variables */
+    delete a;
+	delete b;
+	delete c;
+    delete cMe;
+    delete[] aDimSize;
+    delete[] bDimSize;
+    return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+    TODO!!
+*/
+/* test for SumDim Function */
+bool TestSumDim()
+{
+    XPRINT(0, stdout, "[TEST SUMDIM] tensor summation c = a + b * beta by broadcasting\n");
+    bool returnFlag = true, caseFlag = true;
+    /* case 1 test */
+    caseFlag = TestSumDim1();
+    if (!caseFlag) {
+        returnFlag = false;
+        XPRINT(0, stdout, ">> case 1 failed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> case 1 passed!\n");
+    /* case 2 test */
+    caseFlag = TestSumDim2();
+    if (!caseFlag) {
+        returnFlag = false;
+        XPRINT(0, stdout, ">> case 2 failed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> case 2 passed!\n");
+    /* other cases test */
+    /*
+        TODO!!
+    */
+    if (returnFlag) {
+        XPRINT(0, stdout, ">> All Passed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> Failed!\n");
+    XPRINT(0, stdout, "\n");
+    return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TSumDim.h
+++ b/source/tensor/test/TSumDim.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-30
+* I finish my summer holidays and go back to study.
+*/
+#ifndef __TEST_SUMDIM_H__
+#define __TEST_SUMDIM_H__
+#include "../core/arithmetic/SumDim.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* test for SumDim Function */
+extern "C"
+bool TestSumDim();
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_SUMDIM_H__
--- a/source/tensor/test/TTan.cpp
+++ b/source/tensor/test/TTan.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+#include "../core/math/Unary.h"
+#include "TTan.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+case 1: test Tan function.
+Set every entry to its tangent value.
+*/
+bool TestTan1()
+{
+	/* a tensor of size (3, 2) */
+	int order = 2;
+	int * dimSize = new int[order];
+	dimSize[0] = 3;
+	dimSize[1] = 2;
+	int unitNum = 1;
+	for (int i = 0; i < order; i++)
+		unitNum *= dimSize[i];
+	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
+	                      {-1.0F, -2.0F},
+	                      {0.0F, 0.5F} };
+	DTYPE answer[3][2] = { {1.5574F, -2.1850F},
+	                       {-1.5574F, 2.1850F},
+	                       {0.0F, 0.5463F} };
+	/* CPU test */
+	bool cpuTest = true;
+	/* create tensors */
+	XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
+    XTensor bUser;
+	/* initialize variables */
+	a->SetData(aData, unitNum);
+	aMe->SetData(aData, unitNum);
+	/* call Tan function */
+	_Tan(a, b);
+	_TanMe(aMe);
+    bUser = Tan(*a);
+	/* check results */
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+	/* create tensor */
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+	/* Initialize variables */
+	aGPU->SetData(aData, unitNum);
+	aMeGPU->SetData(aData, unitNum);
+	/* call Tan function */
+    _Tan(aGPU, bGPU);
+	_TanMe(aMeGPU);
+    bUserGPU = Tan(*aGPU);
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+    delete aGPU;
+    delete bGPU;
+    delete aMeGPU;
+	delete[] dimSize;
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete[] dimSize;
+	return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+TODO!!
+*/
+/* test for Tan Function */
+bool TestTan()
+{
+	XPRINT(0, stdout, "[TEST Tan] set every entry to its tangent value \n");
+	bool returnFlag = true, caseFlag = true;
+	/* case 1 test */
+	caseFlag = TestTan1();
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+	XPRINT(0, stdout, "\n");
+	return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/math/Log.cuh
+++ b/source/tensor/core/math/Log.cuh
@@ -16,31 +16,16 @@
 */
 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
 */
-#ifndef __LOG_CUH__
+#ifndef __TEST_TAN_H__
-#define __LOG_CUH__
+#define __TEST_TAN_H__
-#include "Log.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
-#ifdef USE_CUDA
+/* test for Tan Function */
+bool TestTan();
-/* set each entry to its log value (CUDA Kernel) */
-__global__
-void KernelLog(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its log value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelLog(__half * a, __half * b, int size);
-/* set each entry to its log value */
-void _CudaLog(const XTensor * a, XTensor * b);
-#endif // USE_CUDA
 } // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_TAN_H__
-#endif // __LOG_CUH__
\ No newline at end of file
--- a/source/tensor/test/TTranspose.cpp
+++ b/source/tensor/test/TTranspose.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12
+*/
+#include "TTranspose.h"
+#include "../core/movement/CopyValues.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+case 1: test Transpose function.
+tensor transposition of dimensions i and j 
+*/
+bool TestTranspose1()
+{
+	/* a tensor of size (3, 2) */
+	int aOrder = 2;
+	int * aDimSize = new int[aOrder];
+	aDimSize[0] = 3;
+	aDimSize[1] = 2;
+	int aUnitNum = 1;
+	for (int i = 0; i < aOrder; i++)
+		aUnitNum *= aDimSize[i];
+    /* a tensor of size (2, 3) */
+	int bOrder = 2;
+	int * bDimSize = new int[bOrder];
+	bDimSize[0] = 2;
+	bDimSize[1] = 3;
+	int bUnitNum = 1;
+	for (int i = 0; i < bOrder; i++)
+		bUnitNum *= bDimSize[i];
+	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
+	                      {3.0F, 4.0F},
+	                      {5.0F, 6.0F} };
+	DTYPE answer[2][3] = { {1.0F, 3.0F, 5.0F},
+	                       {2.0F, 4.0F, 6.0F} };
+	/* CPU test */
+	bool cpuTest = true;
+	/* create tensors */
+	XTensor * a = NewTensor(aOrder, aDimSize);
+	XTensor * b = NewTensor(bOrder, bDimSize);
+    XTensor bUser;
+	/* initialize variables */
+	a->SetData(aData, aUnitNum);
+	/* call Transpose function */
+    _Transpose(a, b, 0, 1);
+    bUser = Transpose(*a, 0, 1);
+	/* check results */
+	cpuTest = b->CheckData(answer, aUnitNum, 1e-4F)
+              && bUser.CheckData(answer, aUnitNum, 1e-4F);
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+	/* create tensor */
+	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+	/* Initialize variables */
+	aGPU->SetData(aData, aUnitNum);
+	/* call Transpose function */
+    _Transpose(aGPU, bGPU, 0, 1);
+    bUserGPU = Transpose(*aGPU, 0, 1);
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F)
+              && bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
+	/* destroy variables */
+	delete a;
+	delete b;
+    delete aGPU;
+    delete bGPU;
+	delete[] aDimSize;
+	delete[] bDimSize;
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete[] aDimSize;
+	delete[] bDimSize;
+	return cpuTest;
+#endif // USE_CUDA
+}
+/*
+case 2: test Transpose function.
+tensor transposition of dimensions i and j 
+*/
+bool TestTranspose2()
+{
+	/* a tensor of size (4, 3, 2) */
+	int aOrder = 3;
+	int * aDimSize = new int[aOrder];
+	aDimSize[0] = 4;
+	aDimSize[1] = 3;
+	aDimSize[2] = 2;
+	int aUnitNum = 1;
+	for (int i = 0; i < aOrder; i++)
+		aUnitNum *= aDimSize[i];
+    /* a tensor of size (2, 3, 4) */
+	int bOrder = 3;
+	int * bDimSize = new int[bOrder];
+	bDimSize[0] = 2;
+	bDimSize[1] = 3;
+	bDimSize[2] = 4;
+	int bUnitNum = 1;
+	for (int i = 0; i < bOrder; i++)
+		bUnitNum *= bDimSize[i];
+	DTYPE aData[4][3][2] = { { {1.0F, 2.0F}, 
+	                           {3.0F, 4.0F},
+	                           {5.0F, 6.0F} },
+                             { {2.0F, 4.0F}, 
+	                           {4.0F, 7.0F},
+	                           {6.0F, 8.0F} },
+                             { {1.0F, 2.0F}, 
+	                           {3.0F, 4.0F},
+	                           {5.0F, 6.0F} },
+                             { {2.0F, 4.0F}, 
+	                           {4.0F, 7.0F},
+	                           {6.0F, 8.0F} },};
+	DTYPE answer[2][3][4] = { { {1.0F, 2.0F, 1.0F, 2.0F},
+                                {2.0F, 4.0F, 2.0F, 4.0F},
+                                {3.0F, 4.0F, 3.0F, 4.0F} },
+                              { {4.0F, 7.0F, 4.0F, 7.0F},
+                                {5.0F, 6.0F, 5.0F, 6.0F},
+                                {6.0F, 8.0F, 6.0F, 8.0F} } };
+	/* CPU test */
+	bool cpuTest = true;
+	/* create tensors */
+	XTensor * a = NewTensor(aOrder, aDimSize);
+	XTensor * b = NewTensor(bOrder, bDimSize);
+    XTensor bUser;
+	/* initialize variables */
+	a->SetData(aData, aUnitNum);
+	/* call Transpose function */
+    _Transpose(a, b, 0, 2);
+    bUser = Transpose(*a, 0, 2);
+	/* check results */
+	cpuTest = b->CheckData(answer, aUnitNum, 1e-4F)
+              && bUser.CheckData(answer, aUnitNum, 1e-4F);
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+	/* create tensor */
+	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+	/* Initialize variables */
+	aGPU->SetData(aData, aUnitNum);
+	/* call Transpose function */
+    _Transpose(aGPU, bGPU, 0, 2);
+    bUserGPU = Transpose(*aGPU, 0, 2);
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F)
+              && bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
+	/* destroy variables */
+	delete a;
+	delete b;
+    delete aGPU;
+    delete bGPU;
+	delete[] aDimSize;
+	delete[] bDimSize;
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete[] aDimSize;
+	delete[] bDimSize;
+	return cpuTest;
+#endif // USE_CUDA
+}
+/* other cases */
+/*
+TODO!!
+*/
+/* test for Transpose Function */
+bool TestTranspose()
+{
+	XPRINT(0, stdout, "[TEST TRANSPOSE] tensor transposition with specified dimensions \n");
+	bool returnFlag = true, caseFlag = true;
+	/* case 1 test */
+	caseFlag = TestTranspose1();
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+	/* case 2 test */
+	caseFlag = TestTranspose2();
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 2 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 2 passed!\n");
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+	XPRINT(0, stdout, "\n");
+	return returnFlag;
+}
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TTranspose.h
+++ b/source/tensor/test/TTranspose.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-30
+*/
+#ifndef __TEST_TRANSPOSE_H__
+#define __TEST_TRANSPOSE_H__
+#include "../core/shape/Transpose.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* test for Transpose Function */
+bool TestTranspose();
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_TRANSPOSE_H__
--- a/source/tensor/test/Test.cpp
+++ b/source/tensor/test/Test.cpp
@@ -32,15 +32,17 @@ bool Test()
    wrong = !TestAbsolute() || wrong;
    wrong = !TestConcatenate() || wrong;
    wrong = !TestConcatenateSolely() || wrong;
+    wrong = !TestCos() || wrong;
    wrong = !TestConvertDataType() || wrong;
    wrong = !TestCopyIndexed() || wrong;
    wrong = !TestCopyValues() || wrong;
+    wrong = !TestDiv() || wrong;
+    wrong = !TestExp() || wrong;
    wrong = !TestLog() || wrong;
    wrong = !TestMatrixMul() || wrong;
    wrong = !TestMatrixMul2D() || wrong;
    wrong = !TestMatrixMul2DParallel() || wrong;
    wrong = !TestMatrixMulBatched() || wrong;
-    wrong = !TestMatrixMulBatchedCPU() || wrong;
    wrong = !TestMerge() || wrong;
    wrong = !TestMultiply() || wrong;
    wrong = !TestNegate() || wrong;
@@ -56,11 +58,16 @@ bool Test()
    wrong = !TestSetAscendingOrder() || wrong;
    wrong = !TestSetData() || wrong;
    wrong = !TestSign() || wrong;
+    wrong = !TestSin() || wrong;
    wrong = !TestSort() || wrong;
    wrong = !TestSplit() || wrong;
+    wrong = !TestSub() || wrong;
    wrong = !TestSum() || wrong;
    wrong = !TestSumByColumnTV() || wrong;
    wrong = !TestSumByColumnVT() || wrong;
+    wrong = !TestSumDim() || wrong;
+    wrong = !TestTan() || wrong;
+    wrong = !TestTranspose() || wrong;
    wrong = !TestTopK() || wrong;
    wrong = !TestUnsqueeze() || wrong;
    wrong = !TestXMem() || wrong;

--- a/source/tensor/test/Test.h
+++ b/source/tensor/test/Test.h
@@ -25,15 +25,17 @@
 #include "TAbsolute.h"
 #include "TConcatenate.h"
 #include "TConcatenateSolely.h"
+#include "TCos.h"
 #include "TConvertDataType.h"
 #include "TCopyIndexed.h"
 #include "TCopyValues.h"
+#include "TDiv.h"
+#include "TExp.h"
 #include "TLog.h"
 #include "TMatrixMul.h"
 #include "TMatrixMul2D.h"
 #include "TMatrixMul2DParallel.h"
 #include "TMatrixMulBatched.h"
-#include "TMatrixMULBatchedCPU.h"
 #include "TMerge.h"
 #include "TMultiply.h"
 #include "TNegate.h"
@@ -49,11 +51,16 @@
 #include "TSetAscendingOrder.h"
 #include "TSetData.h"
 #include "TSign.h"
+#include "TSin.h"
 #include "TSort.h"
 #include "TSplit.h"
+#include "TSub.h"
 #include "TSum.h"
 #include "TSumByColumnTV.h"
 #include "TSumByColumnVT.h"
+#include "TSumDim.h"
+#include "TTan.h"
+#include "TTranspose.h"
 #include "TTopK.h"
 #include "TUnsqueeze.h"
 #include "TXMem.h"