merge with the latest branch of xuchen

67bbdfd2 · xuchen · a33c3231 · d664c0a0 · 67bbdfd2 · 67bbdfd2
Commit 67bbdfd2 authored Aug 02, 2018 by xuchen
--- a/doc/manual.md
+++ b/doc/manual.md
+<script type="text/javascript" async src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML"> </script>
+
 # NiuTrans.Tensor张量计算库

 ## NiuTrans.Tensor
@@ -27,39 +29,46 @@ NiuTrans.Tensor撠皞★撘銝芸極

 ## 什么是张量

-在计算机科学中，张量（Tensor）通常被定义为\\(n\\)维空间中的一种量，它具有\\(n\\)个分量，这种张量本质上是一个多维数组（ multidimensional array）。张量的阶或秩是这个多维数组的维度，或者简单理解为索引张量里的每个元素所需要的索引个数。通常来说，0阶张量被定义为标量（Scalar），1阶张量被定义为向量（vector），而2阶张量被定义为矩阵（matrix）。比如，在一个三维空间中，1阶张量就是空间中点所表示的向量\\((x,y,z)\\)，其中\\(x\\)、\\(y\\)、\\(z\\)分别表示这个点在三个轴上的坐标。
+在计算机科学中，张量（Tensor）通常被定义为$n$维空间中的一种量，它具有$n$个分量，这种张量本质上是一个多维数组（ multidimensional array）。张量的阶或秩是这个多维数组的维度，或者简单理解为索引张量里的每个元素所需要的索引个数。通常来说，0阶张量被定义为标量（Scalar），1阶张量被定义为向量（vector），而2阶张量被定义为矩阵（matrix）。比如，在一个三维空间中，1阶张量就是空间中点所表示的向量$(x,y,z)$，其中$x$、$y$、$z$分别表示这个点在三个轴上的坐标。

-张量是一种高效的数学建模工具，它可以将复杂的问题通过统一、简洁的方式进行表达。比如，姜英俊同学做饭需要2斤牛肉、5斤土豆，市场上牛肉每斤32元、土豆每斤2元，那么购买这些食物总共花费\\(2 \times 32 + 5 \times 2 = 74\\)元。如果用张量来描述，我们可以用一个1阶张量\\(a=(2,5)\\)表示所需不同食物的重量。然后用另一个1阶张量\\(b=(32,2)\\)表示不同食物的价格。最后，我们用一个0阶张量\\(c\\)表示购买这些食物的总价，计算如下
+张量是一种高效的数学建模工具，它可以将复杂的问题通过统一、简洁的方式进行表达。比如，姜英俊同学做饭需要2斤牛肉、5斤土豆，市场上牛肉每斤32元、土豆每斤2元，那么购买这些食物总共花费$2 \times 32 + 5 \times 2 = 74$元。如果用张量来描述，我们可以用一个1阶张量$a=(2,5)$表示所需不同食物的重量。然后用另一个1阶张量$b=(32,2)$表示不同食物的价格。最后，我们用一个0阶张量$c$表示购买这些食物的总价，计算如下

 $$
 \begin{aligned}
-  c & = a \times b^T \\\\
-    & = \left(\begin{matrix}2 & 5\end{matrix}\right) \times \left(\begin{matrix}32 \\\\ 2\end{matrix}\right) \\\\
-    & = 2 \times 32 + 5 \times 2 \\\\
+  c & = a \times b^T \\
+    & = \left(\begin{matrix}2 & 5\end{matrix}\right) \times \left(\begin{matrix}32 \\ 2\end{matrix}\right) \\
+    & = 2 \times 32 + 5 \times 2 \\
    & = 74
 \end{aligned}
 $$

-其中\\(b^T\\)表示行向量\\(b\\)的转置 - 列向量，\\(\times\\)表示向量的乘法。第二天，姜英俊同学换了一个市场，这里牛肉每斤35元、土豆每斤1元。如果要知道在两个市场分别购物的总价，可以把\\(b\\)重新定义为一个2阶张量\\(\left(\begin{matrix}32 & 2 \\\\ 35 & 1\end{matrix}\right)\\)，总价\\(c\\)定义为一个2阶张量。同样有
+其中$b^T$表示行向量$b$的转置 - 列向量，$\times$表示向量的乘法。第二天，姜英俊同学换了一个市场，这里牛肉每斤35元、土豆每斤1元。如果要知道在两个市场分别购物的总价，可以把$b$重新定义为一个2阶张量$\left(\begin{matrix}32 & 2 \\ 35 & 1\end{matrix}\right)$，总价$c$定义为一个2阶张量。同样有

 $$
 \begin{aligned}
-  c & = a \times b^T \\\\
-    & = \left(\begin{matrix}2 & 5\end{matrix}\right) \times \left(\begin{matrix}32 & 35 \\\\ 2 & 1\end{matrix}\right) \\\\
+  c & = a \times b^T \\
+    & = \left(\begin{matrix}2 & 5\end{matrix}\right) \times \left(\begin{matrix}32 & 35 \\ 2 & 1\end{matrix}\right) \\
    & = \left(\begin{matrix}74 & 75\end{matrix}\right)
 \end{aligned}
 $$

-即，在两个市场分别花费74元和75元。可以看出，利用张量可以对多样、复杂的问题进行建模，比如，可以进一步扩展上述问题中\\(a\\)、\\(b\\)、\\(c\\)的定义，把它们定义成更高阶的张量，处理不同时间、不同市场、不同菜谱的情况，但是不论情况如何变化，都可以用同一个公式\\(c = a \times b^T\\)来描述问题。
+即，在两个市场分别花费74元和75元。可以看出，利用张量可以对多样、复杂的问题进行建模，比如，可以进一步扩展上述问题中$a$、$b$、$c$的定义，把它们定义成更高阶的张量，处理不同时间、不同市场、不同菜谱的情况，但是不论情况如何变化，都可以用同一个公式$c = a \times b^T$来描述问题。

 许多现实世界的问题都可以被描述为张量表达式（expression），也就是把张量的组合、计算描述为算数表达式。这种建模方式也构成了现代神经网络模型及深度学习方法的基础。在许多机器学习工具中，张量计算已经成为了神经网络前向、反向传播等过程的基本单元，应用十分广泛。

 ## 如何定义张量

-如果你是一名C/C++或者Python的使用者，那么在程序中使用NiuTrans.Tensor定义张量将非常简单。首先，下载NiuTrans.Tensor的工具包(source???)，并解压到任意目录，比如~/NTS目录。我们会在NTS这个目录中找到source子目录，它是存放源代码的目录。对于source子目录的结构，信息如下：
+如果你是一名C/C++或者Python的使用者，那么在程序中使用NiuTrans.Tensor定义张量将非常简单。首先，下载NiuTrans.Tensor的工具包，并解压到任意目录，比如~/NTS目录。我们会在NTS这个目录中找到source子目录，它是存放源代码的目录。对于source子目录的结构，信息如下：

 * ~/NTS/source/XTensor.h - 定义了张量结构XTensor，以及构建和销毁XTensor的接口
 * ~/NTS/source/core - 存放张量计算的函数声明及函数体实现的源文件
+    * arithmetic - 存放有关算术运算的源文件
+    * getandset - 存放有关算术存取的源文件
+    * math - 存放有关数学运算的源文件
+    * movement - 存放有关数据移动的源文件
+    * reduce - 存放有关规约操作的源文件
+    * shape - 存放有关形状转换的源文件
+    * sort - 存放有关排序操作的源文件
 * ~/NTS/source/function - 存放各种激活函数的源文件
 * ~/NTS/source/test - 存放单元测试的源文件
 * ~/NTS/source/*.h(cpp) - 与张量定义不相关，后文介绍 :)
@@ -84,7 +93,7 @@ int main(int argc, const char ** argv)
 }
 ```

-下一步，编译以上源程序，这个过程需要指定XTensor.h头文件所在目录。比如，使用g++编译sample.cpp（如果你使用的是visual studio，请看这里???）
+下一步，编译以上源程序，这个过程需要指定XTensor.h头文件所在目录。比如，使用g++编译sample.cpp

 ```
 g++ sample.cpp -I~/NTS/source -o sample
@@ -190,8 +199,6 @@ int main(int argc, const char ** argv)
 | 创建4维稠密张量 | XTensor * NewTensor4D(<br>const int d0, const int d1, const int d2, const int d3, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br>  d3 - 张量第四维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
 | 创建5维稠密张量 | XTensor * NewTensor5D(<br>const int d0, const int d1, const int d2, <br> const int d3, const int d4, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br>  d3 - 张量第四维大小 <br>  d4 - 张量第五维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |

-## 设备
-
 ## 访问张量中的内容

 在C/C++中，我们通过XTensor.h访问张量中的内容，并且仅需要在源程序中引用XTensor.h头文件就可以完成张量的定义。
@@ -204,71 +211,73 @@ int main(int argc, const char ** argv)
 | void * data | 保存元素的数据数组 |
 | int devID | 设备ID，指张量所申请的空间所在CPU或者GPU设备的编号，-1表示CPU |
 | int order | 张量的维度，例如：一个矩阵（维度为2）是一个二维张量 |
-| int dimSize<br> [MAX_TENSOR_DIM_NUM] | 张量中每一维度的大小，索引0表示第1维 |
+| int dimSize[ ] | 张量中每一维度的大小，索引0表示第1维 |
 | TENSOR_DATA_TYPE dataType | 每个数据单元的数据类型 |
 | int unitSize | 数据单元的大小，类似于sizeof() |
 | int unitNum | 数据单元的数量 |
 | bool isSparse | 是否稠密，一个n * m稠密矩阵的数据量大小为n * m,而稀疏（非稠密）矩阵的数据量大小则取决于矩阵中非零元素个数。|
 | float denseRatio | 稠密度，指非零单元的比例，是介于0和1之间的一个实数，0表示所有单元全为零，1表示全为非零单元。|

-在XTensor.h头文件中定义的方法说明：
+在XTensor.h头文件中定义的部分方法说明，详情参见附录：

 | 功能 | 函数  | 参数 |
 | - | - | - |
-| 判断两个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 |
-| 判断三个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b, XTensor * c) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 <br> c - 进行比较的第三个张量 |
 | 设置张量每一维度的大小 | void SetDim(int * myDimSize) |myDimSize - 张量每一维度的大小 |
 | 得到张量中给定的维度大小 | int GetDim(const int dim) | dim - 张量的维度 |
 | 重新调整矩阵维度 | void Reshape(<br> const int order, const int * myDimSize) | order - 张量的维度 <br> myDimSize - 张量每一维的大小 |
 | 得到张量中元素数量 | int GetSize() | N/A |
-| 得到所给数据类型的数据<br> 单元大小 | int GetUnitSize(<br> TENSOR_DATA_TYPE myDataType) | myDataType - 所给数据类型 |
-| 张量中所有元素设置为0 | void SetZeroAll(XStream * stream = NULL) | stream - 多线程流|
 | 用数组赋值张量 | void SetData(<br> const void * d, int num, int beg = 0) | d - 赋值数组  <br> num - 数组大小 <br> beg - 赋值时从张量的第几位开始 |
-| 设置张量服从均匀分布 | void SetDataRand(<br> DTYPE lower, DTYPE upper) | lower - 最小值 <br> upper - 最大值 |
-| 设置张量服从正态分布 | void SetDataRandn(<br> DTYPE mean, DTYPE standardDeviation) | mean - 均值 <br> standardDeviation - 标准差 |
-| 将给定维度中元素<br> 设置为升序 | void SetAscendingOrder(int dim) | dim - 给定维度 |
+| 张量中所有元素设置为0 | void SetZeroAll(XStream * stream = NULL) | stream - 多线程流|
 | 获取二维张量的值 | DTYPE Get2D(int ni, int mi = 0) | ni - 行值 <br> mi - 列值 |
 | 设置二维张量中<br> 的单元值 | bool Set2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
 | 增加二维张量中<br> 的单元值 | bool Add2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
 | 将矩阵重置为特定大小 | bool Resize(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 |
 | 将矩阵重置为<br> 另一矩阵大小 | bool Resize(<br> const XTensor * myTensor) | myTensor - 重置矩阵大小的参考矩阵 |
-| 依据给定张量<br>复制一个新的张量 | XTensor * NewTensor(<br>XTensor * a, bool isFilledData = true) | a - 给定张量 <br>  isFilledData - 是否申请张量中的数据空间 |
 | 依据给定张量<br>释放数据空间 | void DelTensor(<br>const XTensor * tensor) | tensor - 给定张量 |

 ## 张量计算

-NiuTrans.Tensor提供关于张量计算的函数功能，主要包括一些基本的张量运算以及激活函数，在本节中，主要对这些函数及其用法用例进行介绍。
+NiuTrans.Tensor提供关于张量计算的函数功能，主要包括一些基本的张量运算以及激活函数，在本节中，主要对这些函数及其用法用例进行介绍。我们以点乘(Multiply)操作为例介绍NiuTrans.Tensor的几种函数定义形式：
+
+* _Multiply: 需指定输出张量，只支持前向操作
+* Multiply: 输出张量与输入张量相同，只支持前向操作
+* MultiplyMe: 输出张量需返回给上层，同时支持前向和反向操作

-### arithmetic
+### 代数计算(arithmetic)

 此部分主要包括各种数学运算，加、减、乘、除、取负等。

 #### 矩阵乘法（MatrixMul）

 ##### 什么是张量间矩阵乘法？
-利用矩阵乘法可以将矩阵想乘并得到一个新的结果矩阵，两个维度分别为\\(2 \times 3\\)和\\(3 \times 2\\)的矩阵相乘过程如下所示，结果矩阵的维度为\\(2 \times 2\\)：
+利用矩阵乘法可以将矩阵想乘并得到一个新的结果矩阵，两个维度分别为$2 \times 3$和$3 \times 2$的矩阵相乘过程如下所示，结果矩阵的维度为$2 \times 2$：

 $$
-\left(\begin{matrix}1.0 & 2.0 & 3.0\\\\-4.0 & 5.0 & 6.0\end{matrix}\right) × 
-\left(\begin{matrix}0.0 & -1.0\\\\1.0 & 2.0\\\\2.0 & 1.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}8.0 & 6.0\\\\17.0 & 20.0\end{matrix}\right)
+\left(\begin{matrix}1.0 & 2.0 & 3.0\\-4.0 & 5.0 & 6.0\end{matrix}\right) × 
+\left(\begin{matrix}0.0 & -1.0\\1.0 & 2.0\\2.0 & 1.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}8.0 & 6.0\\17.0 & 20.0\end{matrix}\right)
 $$

 ##### 矩阵乘法的调用

-NiuTrans.Tensor提供了矩阵乘法的计算操作，在NiuTrans.Tensor/Tensor/core中定义，矩阵乘法的调用方式以及参数说明如下所示:
+NiuTrans.Tensor提供了矩阵乘法的计算操作，在NiuTrans.Tensor/Tensor/core/arithmetic中定义，函数定义为：
+>c_{i,j} = trans(ai) * trans(bj) * alpha + c_{i,j} * beta
+
+矩阵乘法的调用方式以及参数说明如下所示:
 ```
 void _MatrixMul(XTensor * a, MATRIX_TRANS_TYPE transposedA, XTensor * b, MATRIX_TRANS_TYPE transposedB, XTensor * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0)
+
+XTensor MatrixMul(const XTensor &a, MATRIX_TRANS_TYPE transposedA, const XTensor &b, MATRIX_TRANS_TYPE transposedB, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0)
 ```
 Parameters: 

-* a - 操作张量1
-* transposedA - 操作张量1是否进行转置
-* b - 操作张量2
-* transposedB - 操作张量2是否进行转置
-* c - 操作张量3
-* alpha - 系数
-* beta - 系数
+* a - 输入张量1
+* transposedA - a是否进行转置
+* b - 输入张量2
+* transposedB - b是否进行转置
+* c - 输出张量
+* alpha - 系数α
+* beta - 系数β

 ##### 矩阵乘法片段示例

@@ -285,26 +294,30 @@ NiuTrans.Tensor/Tensor/test/TMatrixMul.cpp

 ##### 什么是张量点乘？

-利用张量间的点乘操作可以进行张量间元素的按位置依次相乘，两个维度分别为\\(2 \times 2\\)的张量点乘过程如下所示：
+利用张量间的点乘操作可以进行张量间元素的按位置依次相乘，两个维度分别为$2 \times 2$的张量点乘过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0\\\\2.0 & 3.0\end{matrix}\right)  ·
-\left(\begin{matrix}0.0 & 1.0\\\\2.0 & 3.0\end{matrix}\right)  \rightarrow 
-\left(\begin{matrix}0.0 & 1.0\\\\4.0 & 9.0\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0\\2.0 & 3.0\end{matrix}\right)  ·
+\left(\begin{matrix}0.0 & 1.0\\2.0 & 3.0\end{matrix}\right)  \rightarrow 
+\left(\begin{matrix}0.0 & 1.0\\4.0 & 9.0\end{matrix}\right)
 $$

 ##### 张量点乘的调用

-NiuTrans.Tensor提供了张量点乘的计算操作，用来计算张量中元素点乘结果，该函数在NiuTrans.Tensor/Tensor/core中定义，张量点乘的调用方式以及参数说明如下所示:
+NiuTrans.Tensor提供了张量点乘的计算操作，用来计算张量中元素点乘结果，该函数在NiuTrans.Tensor/Tensor/core/arithmetic中定义，张量点乘的调用方式以及参数说明如下所示:
 ```
 _Multiply(XTensor * a, XTensor * b, XTensor * c, int leadingDim, DTYPE alpha = 0)
+
+void _MultiplyMe(XTensor * a, const XTensor * b, DTYPE alpha = 0, int leadingDim = 0)
+
+XTensor Multiply(const XTensor &a, const XTensor &b, DTYPE alpha = 0, int leadingDim = 0)
 ```
 Parameters: 

-* a - 操作张量1
-* b - 操作张量2
-* c - 结果张量
-* leadingDim - ???
+* a - 输入张量1
+* b - 输入张量2
+* c - 输出张量
+* leadingDim - 沿着指定维度进行点乘操作
 * alpha - 系数

 ##### 张量点乘片段示例
@@ -322,22 +335,22 @@ NiuTrans.Tensor/Tensor/test/TMultiply.cpp

 ##### 什么是张量的取负操作？

-在进行张量的取负操作时，张量中每一元素都进行取负得到新的元素，所有新元素的组合得到新的结果张量，一个维度为\\(3 \times 2\\)的张量取负操作过程如下所示：
+在进行张量的取负操作时，张量中每一元素都进行取负得到新的元素，所有新元素的组合得到新的结果张量，一个维度为$3 \times 2$的张量取负操作过程如下所示：

 $$
-\left(\begin{matrix}1.0 & -2.0\\\\-3.0 & 4.0\\\\5.0 & -6.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}-1.0 & 2.0\\\\3.0 & -4.0\\\\-5.0 & 6.0\end{matrix}\right)
+\left(\begin{matrix}1.0 & -2.0\\-3.0 & 4.0\\5.0 & -6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}-1.0 & 2.0\\3.0 & -4.0\\-5.0 & 6.0\end{matrix}\right)
 $$

 ##### 张量取负的调用

-NiuTrans.Tensor提供了张量取负的计算操作，进行张量的按元素位置进行取负操作，该函数在NiuTrans.Tensor/Tensor/core中定义，张量取负的调用方式以及参数说明如下所示:
+NiuTrans.Tensor提供了张量取负的计算操作，进行张量的按元素位置进行取负操作，该函数在NiuTrans.Tensor/Tensor/core/arithmetic中定义，张量取负的调用方式以及参数说明如下所示:
 ```
-_Negate(XTensor * a)
+void _Negate(XTensor * a)
 ```
 Parameters: 

-* a - 操作张量
+* a - 输入张量

 ##### 张量取负片段示例

@@ -353,35 +366,39 @@ NiuTrans.Tensor/Tensor/test/TNegate.cpp
 #### 加法（Sum）

 ##### 什么是张量加法？
-张量加法的目的是将n个张量相加得到一个新的结果张量，结果张量某一位置的元素数值为进行操作的张量在该位置上元素的求和，在张量加法的计算过程中进行操作的张量与结果张量的维度相同，两个维度为\\(2\times 3\\)的张量相加过程如下所示：
+张量加法的目的是将n个张量相加得到一个新的结果张量，结果张量某一位置的元素数值为进行操作的张量在该位置上元素的求和，在张量加法的计算过程中进行操作的张量与结果张量的维度相同，两个维度为$2\times 3$的张量相加过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 \\\\ 3.0 & 4.0 & 5.0\end{matrix}\right) + 
-\left(\begin{matrix}0.5 & 1.5 & 2.5 \\\\ 3.5 & 4.5 & 5.5\end{matrix}\right) \rightarrow
-\left(\begin{matrix}0.5 & 2.5 & 4.5 \\\\ 6.5 & 8.5 & 10.5\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0 \\ 3.0 & 4.0 & 5.0\end{matrix}\right) + 
+\left(\begin{matrix}0.5 & 1.5 & 2.5 \\ 3.5 & 4.5 & 5.5\end{matrix}\right) \rightarrow
+\left(\begin{matrix}0.5 & 2.5 & 4.5 \\ 6.5 & 8.5 & 10.5\end{matrix}\right)
 $$

 ##### 张量加法的调用

-NiuTrans.Tensor提供了张量加法的计算操作，在NiuTrans.Tensor/Tensor/core中定义，该操作用来进行张量之间的按元素位置相加，并得到相加的结果张量，张量加法的调用方法为：
+NiuTrans.Tensor提供了张量加法的计算操作，在NiuTrans.Tensor/Tensor/core/arithmetic中定义，该操作用来进行张量之间的按元素位置相加，并得到相加的结果张量，张量加法的调用方法为：
 ```
-_Sum(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
+void _Sum(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta = (DTYPE)1.0)
+
+void _SumMe(XTensor * a, const XTensor * b, DTYPE beta = (DTYPE)1.0)
+
+XTensor Sum(const XTensor &a, const XTensor &b, DTYPE beta = (DTYPE)1.0)
 ```
 其中a和b为输入张量，c为结果张量，若c为NULL则将相加结果存入a中，beta为一个缩放参数，缩放公式为：c = a + b * beta，beta默认为1.0，NiuTrans.Tensor中张量加法的调用方式以及参数说明如下所示:

 Parameters: 

-* a - 操作张量1
-* b - 操作张量2
-* c - 结果张量，如果c为空则将结果存入a
+* a - 输入张量1
+* b - 输入张量2
+* c - 输出张量
 * beta - 缩放参数

 ##### 张量加法片段示例

-调用Sum进行张量间的求和操作如下所示，在此例中直接将张量相加结果存入a中：
+调用Sum进行张量间的求和操作如下所示，在此例中将张量相加结果存入c中：
 ```
 /* call sum function */
-_Sum(a, b);
+_Sum(a, b, c);
 ```
 详细代码示例见：

@@ -391,24 +408,24 @@ NiuTrans.Tensor/Tensor/test/TSum.cpp

 ##### 什么是SumByColumnTV？

-SumByColumnTV的作用是将一个Tensor和一个Vector按列相加，所得结果维度与Tensor一致，一个\\(2 \times 4\\)的Tensor和一个\\(2 \times 1\\)的Vector的SumByColumnTV操作过程如下所示：
+SumByColumnTV的作用是将一个Tensor和一个Vector按列相加，所得结果维度与Tensor一致，一个$2 \times 4$的Tensor和一个$2 \times 1$的Vector的SumByColumnTV操作过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) + \left(\begin{matrix}1.0\\\\0.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}1.0 & 2.0 & 3.0 & 4.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) + \left(\begin{matrix}1.0\\0.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}1.0 & 2.0 & 3.0 & 4.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right)
 $$ 

 ##### SumByColumnTV的调用

 NiuTrans.Tensor提供了张量的SumByColumnTV操作，调用方法及参数说明如下所示:
 ```
-_SumByColumnTV(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
+void _SumByColumnTV(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
 ```
 Parameters:

-* a - 操作张量
-* b - 操作向量
-* c - 结果张量
+* a - 输入张量
+* b - 输入向量
+* c - 输出张量
 * beta - 缩放参数

 调用SumByColumnTV进行的运算为c_col = a_col + b * \beta
@@ -418,7 +435,7 @@ Parameters:
 SumByColumnTV示例代码如下，其中a为输入的张量，b为输入的向量，c为a和b按列相加所得结果：
 ```
 /* call SumByColumnTV function */
-_SumByColumnTV(a, b, c);
+void _SumByColumnTV(a, b, c);
 ```
 有关张量SumByColumnTV的详细代码示例见：

@@ -428,11 +445,11 @@ NiuTrans.Tensor/Tensor/test/TSumByColumnTV.cpp

 ##### 什么是SumByColumnVT？

-SumByColumnVT的作用是将一个Vector和一个Tensor按列相加，所得结果维度与Vector一致，一个\\(2 \times 1\\)的Vector和一个\\(2 \times 4\\)的Tensor的SumByColumnVT操作过程如下所示：
+SumByColumnVT的作用是将一个Vector和一个Tensor按列相加，所得结果维度与Vector一致，一个$2 \times 1$的Vector和一个$2 \times 4$的Tensor的SumByColumnVT操作过程如下所示：

 $$
-\left(\begin{matrix}1.0\\\\0.0\end{matrix}\right) + \left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}7.0\\\\22.0\end{matrix}\right)
+\left(\begin{matrix}1.0\\0.0\end{matrix}\right) + \left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}7.0\\22.0\end{matrix}\right)
 $$ 

 ##### SumByColumnVT调用
@@ -443,9 +460,9 @@ _SumByColumnVT(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
 ```
 Parameters:

-* a - 操作向量
-* b - 操作张量
-* c - 结果向量
+* a - 输入向量
+* b - 输入张量
+* c - 输出向量
 * beta - 缩放参数

 调用SumByColumnVT进行的运算为c = a + \sum{col} b_col * \beta
@@ -461,7 +478,7 @@ _SumByColumnVT(a, b, c);

 NiuTrans.Tensor/Tensor/test/TSumByColumnVT.cpp

-### getandset
+### 张量存取(getandset)

 此部分包括各种数据类型转化，设置数据、取数据等操作。

@@ -469,25 +486,25 @@ NiuTrans.Tensor/Tensor/test/TSumByColumnVT.cpp

 ##### 什么是张量的选择操作？

-Select时按张量指定维度上的指定位置对张量进行选择的操作，一个\\(2 \times 2 \times 4\\)的张量选择过程如下所示，本例中是选择张量维度2上位置索引为1和2的元素并存入目标张量，得到一个维度为\\(2 \times 2 \times 2\\)的张量：
+Select时按张量指定维度上的指定位置对张量进行选择的操作，一个$2 \times 2 \times 4$的张量选择过程如下所示，本例中是选择张量维度2上位置索引为1和2的元素并存入目标张量，得到一个维度为$2 \times 2 \times 2$的张量：

 $$
 \begin{aligned}
 \Biggl( 
 & \left( 
-\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right),\\\\ 
+\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right),\\ 
 & \left( 
-\begin{matrix}1.0 & 2.0 & 3.0 & 4.0\\\\5.0 & 6.0 & 7.0 & 8.0\end{matrix}
+\begin{matrix}1.0 & 2.0 & 3.0 & 4.0\\5.0 & 6.0 & 7.0 & 8.0\end{matrix}
 \right)
 \Biggr)
 \end{aligned} \rightarrow 
 \begin{aligned}
 \Biggl( 
 & \left( 
-\begin{matrix}1.0 & 2.0\\\\5.0 & 6.0\end{matrix}
-\right),\\\\ 
+\begin{matrix}1.0 & 2.0\\5.0 & 6.0\end{matrix}
+\right),\\ 
 & \left( 
-\begin{matrix}2.0 & 3.0\\\\6.0 & 7.0\end{matrix}
+\begin{matrix}2.0 & 3.0\\6.0 & 7.0\end{matrix}
 \right)  
 \Biggr)
 \end{aligned}
@@ -496,8 +513,25 @@ $$
 ##### 张量选择的调用

 NiuTrans.Tensor提供了张量的选择操作，调用方法及参数说明如下所示:
+
+第一种选择操由一个0，1构成的index矩阵对张量进行选择：
+```
+void _Select(const XTensor * a, XTensor * c, XTensor * indexCPU)
+
+XTensor Select(const XTensor &a, XTensor &indexCPU)
+```
+
+Parameters:
+
+* a - 输入张量
+* c - 输出张量
+* indexCPU - 张量选择标志
+
+第二种调用方式是按位置范围对张量进行选择：
 ```
-_SelectRange(XTensor * a, int dim, int low, int high, XTensor * c)
+void _SelectRange(const XTensor * a, XTensor * c, int dim, int low, int high)
+
+XTensor SelectRange(const XTensor &a, int dim, int low, int high)
 ```
 Parameters:

@@ -505,7 +539,7 @@ Parameters:
 * dim - 在哪一维对张量进行张量选择操作
 * low - 张量选择范围的下限
 * high - 张量选择范围的上限
-* c - 结果张量
+* c - 输出张量

 >需要注意的是，当张量选择的取值范围为[1,3]时意味着选择的是索引位置为1和2的值

@@ -524,23 +558,70 @@ NiuTrans.Tensor/Tensor/test/TSelect.cpp

 ##### 什么是SetData？

-SetData的作用是将张量在一定取值范围内随机进行初始化设置，一个\\(2 \times 4\\)的张量在[0.0,1.0]的取值范围SetData过程如下所示：
+SetData的作用是将张量在一定取值范围内随机进行初始化设置，一个$2 \times 4$的张量在[0.0,1.0]的取值范围SetData过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 0.0 & 0.0 & 0.0\\\\0.0 & 0.0 & 0.0 & 0.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.1 & 0.5 & 0.3 & 0.9\\\\0.8 & 0.5 & 0.5 & 0.2\end{matrix}\right)
+\left(\begin{matrix}0.0 & 0.0 & 0.0 & 0.0\\0.0 & 0.0 & 0.0 & 0.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.1 & 0.5 & 0.3 & 0.9\\0.8 & 0.5 & 0.5 & 0.2\end{matrix}\right)
 $$ 

 ##### SetData调用

 NiuTrans.Tensor提供了张量的SetData操作，调用方法及参数说明如下所示:
+
+设置张量为固定值：
+```
+void SetDataFixed(XTensor * tensor, void * valuePointer)
+```
+Parameters:
+
+* tensor - 输入张量
+* valuePointer - 指向数据的指针
+
+设置张量为整型值：
+```
+void SetDataFixedInt(XTensor * tensor, int p)
+```
+Parameters:
+
+* tensor - 输入张量
+* p - 固定整型值
+
+设置张量为单精度浮点值：
+```
+void SetDataFixedFloat(XTensor * tensor, float p)
+```
+Parameters:
+
+* tensor - 输入张量
+* p - 固定单精度浮点值
+
+设置张量为双精度浮点值：
+```
+void SetDataFixedDouble(XTensor * tensor, double p)
+```
+Parameters:
+
+* tensor - 输入张量
+* p - 固定双精度浮点值
+
+设置张量为随机分布：
+```
+void SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
+```
+* tensor - 输入张量
+* low - 取值下限
+* high - 取值上限
+
+设置张量为正态分布：
 ```
-SetDataRand(DTYPE lower, DTYPE upper)
+void SetDataRandN(XTensor * tensor, DTYPE mean, DTYPE standardDeviation)
 ```
 Parameters:

-* lower - 取值下限
-* upper - 取值上限
+* tensor - 输入张量
+* mean - 均值
+* standardDeviation - 标准差

 #####  SetData片段示例

@@ -553,7 +634,7 @@ s->SetDataRand(0.0, 1.0);

 NiuTrans.Tensor/Tensor/test/TSetData.cpp

-### math
+### 数学运算(math)

 此部分包括各种非基本代数操作，包括：log、exp、abs等。

@@ -568,7 +649,11 @@ NiuTrans.Tensor/Tensor/test/TSetData.cpp

 NiuTrans.Tensor提供了张量的Normalize操作，调用方法及参数说明如下所示:
 ```
-_Normalize(XTensor * input, XTensor * output, int dim, XTensor * mean, XTensor * var, XTensor * a, XTensor * b, DTYPE epsilon)
+void _Normalize(const XTensor * input, XTensor * output, int dim, const XTensor * mean, const XTensor * var, const XTensor * a, const XTensor * b, DTYPE epsilon)
+
+void _NormalizeMe(XTensor * input, int dim, const XTensor * mean, const XTensor * var, const XTensor * a, const XTensor * b, DTYPE epsilon)
+
+XTensor Normalize(const XTensor &input, int dim, const XTensor &mean, const XTensor &var, const XTensor &a, const XTensor &b, DTYPE epsilon)
 ```
 Parameters:

@@ -596,25 +681,25 @@ NiuTrans.Tensor/Tensor/test/TNormalize.cpp

 ##### 什么是张量的幂运算操作？

-幂运算是一种关于幂的数学运算，张量的幂运算是将张量中的每个元素都进行幂运算从而得到新的张量，一个维度为\\(3 \times 2\\)的幂为2.0的张量幂运算过程如下所示：
+幂运算是一种关于幂的数学运算，张量的幂运算是将张量中的每个元素都进行幂运算从而得到新的张量，一个维度为$3 \times 2$的幂为2.0的张量幂运算过程如下所示：


 $$
-\left(\begin{matrix}1.0 & 2.0\\\\3.0 & 4.0\\\\5.0 & 6.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}1.0 & 4.0\\\\9.0 & 16.0\\\\25.0 & 36.0\end{matrix}\right)
+\left(\begin{matrix}1.0 & 2.0\\3.0 & 4.0\\5.0 & 6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}1.0 & 4.0\\9.0 & 16.0\\25.0 & 36.0\end{matrix}\right)
 $$

 ##### 张量幂运算的调用

 NiuTrans.Tensor提供了张量幂运算的操作，用来进行张量的按元素位置进行幂运算的操作，调用方法为：
 ```
-_Power(XTensor * a, DTYPE p)
+void _Power(XTensor * a, DTYPE p)
 ```
 其中a为进行操作的张量，p为次方数，张量幂运算的参数说明如下所示:

 Parameters: 

-* a - 操作张量
+* a - 输入张量
 * p - 次方数

 ##### 张量幂运算片段示例
@@ -632,24 +717,29 @@ NiuTrans.Tensor/Tensor/test/TPower.cpp

 ##### 什么是张量的缩放和偏移？

-张量的缩放和偏移计算公式为：p = p * scale + shift，其中scale和shift分别为张量缩放和偏移的参数，一个\\(2 \times 4\\)的张量进行缩放和偏移的过程如下所示，缩放参数取2.0，偏移参数取0.5：
+张量的缩放和偏移计算公式为：p = p * scale + shift，其中scale和shift分别为张量缩放和偏移的参数，一个$2 \times 4$的张量进行缩放和偏移的过程如下所示，缩放参数取2.0，偏移参数取0.5：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.5 & 2.5 & 4.5 & 6.5\\\\8.5 & 10.5 & 12.5 & 14.5\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.5 & 2.5 & 4.5 & 6.5\\8.5 & 10.5 & 12.5 & 14.5\end{matrix}\right)
 $$

 ##### 张量缩放和偏移的调用

 NiuTrans.Tensor提供了张量的缩放和偏移操作，调用方法为：
 ```
-_ScaleAndShift(XTensor * a, DTYPE scale, DTYPE shift)
+void _ScaleAndShift(const XTensor * a, XTensor * b, DTYPE scale, DTYPE shift = 0)
+
+void _ScaleAndShiftMe(XTensor * a, DTYPE scale, DTYPE shift = 0)
+
+XTensor ScaleAndShift(const XTensor &a, DTYPE scale, DTYPE shift = 0)
 ```
 张量的缩放和偏移操作结果为：p = p * scale + shift，其中scale和shift分别为张量的缩放和偏移参数，张量缩放和偏移操作的参数说明如下表所示:

 Parameters:

 * a - 输入张量
+* b - 输出张量
 * scale - 缩放参数
 * shift - 偏移参数

@@ -664,7 +754,7 @@ _ScaleAndShift(input, scaleFactor, shiftFactor);

 NiuTrans.Tensor/Tensor/test/TScaleAndShift.cpp

-### movement
+### 数据移动(movement)

 此部分主要是介绍有关数据拷贝函数。

@@ -672,24 +762,26 @@ NiuTrans.Tensor/Tensor/test/TScaleAndShift.cpp

 ##### 什么是张量的拷贝操作？

-拷贝，即将一个张量的值赋给另一个张量，也就是对张量进行拷贝操作，一个\\(2 \times 4\\)的张量拷贝过程如下所示：
+拷贝，即将一个张量的值赋给另一个张量，也就是对张量进行拷贝操作，一个$2 \times 4$的张量拷贝过程如下所示：

 $$
-\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow
+\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow
 \left(
-\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right)
+\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right)
 $$

 ##### 张量拷贝操作的调用

 NiuTrans.Tensor提供了张量的拷贝操作，调用方法及参数说明如下所示:
 ```
-_CopyValues(XTensor * s, XTensor * t, XStream * stream)
+void _CopyValues(const XTensor * s, XTensor * t, XStream * stream = NULL)
+
+XTensor CopyValues(const XTensor &s, XStream * stream = NULL)
 ```
 Parameters:

 * s - 输入张量
-* t - 输出结果张量
+* t - 输出张量
 * stream - 多线程流

 #####  张量拷贝片段示例
@@ -707,30 +799,30 @@ NiuTrans.Tensor/Tensor/test/TCopyValues.cpp

 ##### 什么是张量的CopyIndexed操作？

-CopyIndexed，即按指定索引位置拷贝张量，一个\\(2 \times 2 \times 3\\)的张量拷贝过程如下所示，本例中是对张量维度2上起始位置索引为0和2的1个元素进行拷贝，所得张量维度为\\(2 \times 2 \times 2\\)：
+CopyIndexed，即按指定索引位置拷贝张量，一个$2 \times 2 \times 3$的张量拷贝过程如下所示，本例中是对张量维度2上起始位置索引为0和2的1个元素进行拷贝，所得张量维度为$2 \times 2 \times 2$：

 $$
 \begin{aligned}
 \Biggl( 
 & \left( 
-\begin{matrix}0.0 & -1.0 & 2.0\\\\2.0 & 1.0 & 3.0\end{matrix}\right),\\\\ 
+\begin{matrix}0.0 & -1.0 & 2.0\\2.0 & 1.0 & 3.0\end{matrix}\right),\\ 
 & \left( 
-\begin{matrix}1.0 & 2.0 & 4.0\\\\3.0 & 1.0 & 2.0\end{matrix}
-\right),\\\\ 
+\begin{matrix}1.0 & 2.0 & 4.0\\3.0 & 1.0 & 2.0\end{matrix}
+\right),\\ 
 & \left( 
-\begin{matrix}-1.0 & 3.0 & 2.0\\\\1.0 & -1.0 & 0.0\end{matrix}
+\begin{matrix}-1.0 & 3.0 & 2.0\\1.0 & -1.0 & 0.0\end{matrix}
 \right)  
 \Biggr)
 \end{aligned} \rightarrow 
 \begin{aligned}
 \Biggl( 
 & \left( 
-\begin{matrix}0.0 & 2.0\\\\2.0 & 3.0\end{matrix}\right),\\\\ 
+\begin{matrix}0.0 & 2.0\\2.0 & 3.0\end{matrix}\right),\\ 
 & \left( 
-\begin{matrix}1.0 & 4.0\\\\3.0 & 2.0\end{matrix}
-\right),\\\\ 
+\begin{matrix}1.0 & 4.0\\3.0 & 2.0\end{matrix}
+\right),\\ 
 & \left( 
-\begin{matrix}-1.0 & 2.0\\\\1.0 & 0.0\end{matrix}
+\begin{matrix}-1.0 & 2.0\\1.0 & 0.0\end{matrix}
 \right)  
 \Biggr)
 \end{aligned}
@@ -740,12 +832,14 @@ $$

 NiuTrans.Tensor提供了张量的CopyIndexed操作，调用方法及参数说明如下所示:
 ```
-_CopyIndexed(XTensor * s, XTensor * t, int dim, int * srcIndex, int indexSize, int * tgtIndex, int copyNum)
+void _CopyIndexed(const XTensor * s, XTensor * t, int dim, int * srcIndex, int indexSize, int * tgtIndex, int copyNum)
+
+XTensor CopyIndexed(const XTensor &s, int dim, int * srcIndex,int indexSize, int * tgtIndex, int copyNum)
 ```
 Parameters:

 * s - 输入张量
-* t - 输出结果张量
+* t - 输出张量
 * dim - 在哪一维对张量进行CopyIndexed操作
 * srcIndex - 源索引，即在指定dim上进行赋值的值的索引
 * indexSize - 源索引的个数
@@ -763,29 +857,31 @@ _CopyIndexed(s, t, 2, srcIndex, indexSize, tgtIndex, 1);

 NiuTrans.Tensor/Tensor/test/TCopyIndexed.cpp

-### reduce
+### 规约操作(reduce)

 #### 归约取最大值（ReduceMax）

 ##### 什么是张量的归约取最大值？

-张量的归约取最大值操作是沿着张量的某一维度，取得该向量在该维度中的最大值,一个\\(2 \times 4\\)的张量在维度0和维度1进行取最大值操作的过程分别如下所示：
+张量的归约取最大值操作是沿着张量的某一维度，取得该向量在该维度中的最大值,一个$2 \times 4$的张量在维度0和维度1进行取最大值操作的过程分别如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
 \left(\begin{matrix}4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right)
 $$

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}3.0\\\\7.0\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}3.0\\7.0\end{matrix}\right)
 $$

 ##### 张量归约取最大值操作的调用

 NiuTrans.Tensor提供了张量的ReduceMax操作，用来获得张量中沿指定维度取得的最大值，张量归约取最大值操作的调用方式及参数说明如下所示:
 ```
-_ReduceMax(XTensor * input, XTensor * output, int dim)
+void _ReduceMax(const XTensor * input, XTensor * output, int dim)
+
+XTensor ReduceMax(const XTensor &input, int dim)
 ```
 Parameters:

@@ -809,23 +905,25 @@ NiuTrans.Tensor/Tensor/test/TReduceMax.cpp

 ##### 什么是张量的归约求和操作？

-张量的归约求和操作是沿着张量的某一维度，计算该张量在该维度的和,一个\\(2 \times 4\\)的张量在维度0和维度1进行求和操作的过程分别如下所示：
+张量的归约求和操作是沿着张量的某一维度，计算该张量在该维度的和,一个$2 \times 4$的张量在维度0和维度1进行求和操作的过程分别如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
 \left(\begin{matrix}4.0 & 6.0 & 8.0 & 10.0\end{matrix}\right)
 $$

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}6.0\\\\22.0\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}6.0\\22.0\end{matrix}\right)
 $$

 ##### 张量归约求和操作的调用

 NiuTrans.Tensor提供了张量的ReduceSum操作，调用方法为：
 ```
-_ReduceSum(XTensor * input, XTensor * output, int dim, XTensor * shift, DTYPE power, bool isExp)
+void _ReduceSum(const XTensor * input, XTensor * output, int dim, const XTensor * shift = NULL, DTYPE power = (DTYPE)1.0F, bool isExp = false)
+
+XTensor ReduceSum(const XTensor &input, int dim, const XTensor &shift = NULLTensor, DTYPE power = (DTYPE)1.0F, bool isExp = false)
 ```
 其中shift默认为NULL，power默认为1.0F，isExp默认为false，张量归约求和操作的参数说明如下所示:

@@ -854,23 +952,25 @@ NiuTrans.Tensor/Tensor/test/TReduceSum.cpp

 ##### 什么是张量的归约取均值操作？

-张量的归约取均值操作是沿着张量的某一维度，计算该张量在该维度的均值,一个\\(2 \times 4\\)的张量在维度0和维度1进行取均值操作的过程分别如下所示：
+张量的归约取均值操作是沿着张量的某一维度，计算该张量在该维度的均值,一个$2 \times 4$的张量在维度0和维度1进行取均值操作的过程分别如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
 \left(\begin{matrix}2.0 & 3.0 & 4.0 & 5.0\end{matrix}\right)
 $$

 $$
-\left(\begin{matrix}1.0 & 1.0 & 3.0 & 3.0\\\\4.0 & 4.0 & 6.0 & 6.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}2.0\\\\5.0\end{matrix}\right)
+\left(\begin{matrix}1.0 & 1.0 & 3.0 & 3.0\\4.0 & 4.0 & 6.0 & 6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}2.0\\5.0\end{matrix}\right)
 $$

 ##### 张量归约取均值操作的调用

 NiuTrans.Tensor提供了张量的ReduceMean操作，调用方法为：
 ```
-_ReduceMean(XTensor * input, XTensor * output, int dim)
+void _ReduceMean(const XTensor * input, XTensor * output, int dim)
+
+XTensor ReduceMean(const XTensor &input, int dim)
 ```
 ReduceMean用来获得张量中沿指定维度取得的数值均值，张量归约取均值的参数说明如下所示:

@@ -896,10 +996,10 @@ NiuTrans.Tensor/Tensor/test/TReduceMean.cpp

 ##### 什么是张量的归约取方差操作？

-张量的归约取方差操作是沿着张量的某一维度，计算该张量在该维度的方差,一个\\(2 \times 4\\)的张量在维度0进行取方差操作的过程如下所示：
+张量的归约取方差操作是沿着张量的某一维度，计算该张量在该维度的方差,一个$2 \times 4$的张量在维度0进行取方差操作的过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
 \left(\begin{matrix}8.0 & 8.0 & 8.0 & 8.0\end{matrix}\right)
 $$

@@ -907,7 +1007,9 @@ $$

 NiuTrans.Tensor提供了张量的ReduceSumSquared操作，调用方法为：
 ```
-_ReduceSumSquared(XTensor * input, XTensor * output, int dim, XTensor * shift)
+void _ReduceSumSquared(const XTensor * input, XTensor * output, int dim, const XTensor * shift)
+
+XTensor ReduceSumSquared(const XTensor &input, int dim, const XTensor &shift)
 ```
 ReduceSumSquared用来计算张量的沿着某一维度元素的方差，张量归约取方差操作的参数说明如下所示:

@@ -933,10 +1035,10 @@ NiuTrans.Tensor/Tensor/test/TReduceSumSquared.cpp

 ##### 什么是张量的归约取标准差操作？

-张量的归约取标准差操作是沿着张量的某一维度，计算该张量在该维度的标准差,一个\\(2 \times 4\\)的张量在维度0进行取标准差操作的过程如下所示：
+张量的归约取标准差操作是沿着张量的某一维度，计算该张量在该维度的标准差,一个$2 \times 4$的张量在维度0进行取标准差操作的过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
 \left(\begin{matrix}4.0 & 4.0 & 4.0 & 4.0\end{matrix}\right)
 $$

@@ -944,7 +1046,9 @@ $$

 NiuTrans.Tensor提供了张量的ReduceVariance操作，调用方法为：
 ```
-_ReduceVariance(XTensor * input, XTensor * output, int dim, XTensor * mean)
+void _ReduceVariance(const XTensor * input, XTensor * output, int dim, const XTensor * mean)
+
+XTensor ReduceVariance(const XTensor &input, int dim, const XTensor &mean)
 ```
 ReduceVariance用来计算张量的沿着某一维度元素的标准差，张量归约取标准差操作的参数说明如下所示:

@@ -966,7 +1070,7 @@ _ReduceVariance(input, output, 0, mean);

 NiuTrans.Tensor/Tensor/test/TReduceVariance.cpp

-### shape
+### 形状转换(shape)

 此部分主要包括关于形状改变的函数，比如：split、merge、reshape等。

@@ -974,36 +1078,41 @@ NiuTrans.Tensor/Tensor/test/TReduceVariance.cpp

 ##### 什么是张量的级联操作？

-张量间的级联操作是沿着张量的某一维度，将一系列张量或是一个列表中的所有张量连接在一起组成一个更大的张量，将维度分别为\\(2 \times 1\\)和\\(2 \times 2\\)的两个张量进行级联过程如下所示：
+张量间的级联操作是沿着张量的某一维度，将一系列张量或是一个列表中的所有张量连接在一起组成一个更大的张量，将维度分别为$2 \times 1$和$2 \times 2$的两个张量进行级联过程如下所示：

 $$
-\left(\begin{matrix}0.0\\\\1.0\end{matrix}\right) +
-\left(\begin{matrix}2.0 & 3.0\\\\4.0 & 5.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right)
+\left(\begin{matrix}0.0\\1.0\end{matrix}\right) +
+\left(\begin{matrix}2.0 & 3.0\\4.0 & 5.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 2.0 & 3.0\\1.0 & 4.0 & 5.0\end{matrix}\right)
 $$

 ##### 张量级联的调用

 NiuTrans.Tensor提供了张量间的级联操作，调用方法为：
-```
-_Concatenate(XList * smalls, XTensor * big, int dim)
-_Concatenate(XTensor * smallA, XTensor * smallB, XTensor * big, int dim)
-```
+
 第一种调用方法中的操作对象是列表，将进行级联操作的张量存入列表smalls中，级联结果存入张量big中：
+```
+void _Concatenate(const XList * smalls, XTensor * big, int dim)

+XTensor Concatenate(const XList &smalls, int dim)
+```
 Parameters:

 * smalls - 进行级联张量的列表
-* big - 结果张量
+* big - 输出张量
 * dim - 在指定维度进行级联

 第二种方法操作对象不再是列表中的张量而是直接对一系列张量进行级联操作：
+```
+void _Concatenate(const XTensor * smallA, const XTensor * smallB, XTensor * big, int dim)

+XTensor Concatenate(const XTensor &smallA, const XTensor &smallB, int dim)
+```
 Parameters:

-* smallA - 操作张量1
-* smallB - 操作张量2
-* big - 结果张量
+* smallA - 输入张量1
+* smallB - 输入张量2
+* big - 输出张量
 * dim - 进行级联的维度

 ##### 张量级联片段示例
@@ -1028,47 +1137,52 @@ NiuTrans.Tensor/Tensor/test/TConcatenate.cpp

 张量间的切分操作是沿着张量的某一维度，可以将一个张量切分成另一张量，也可以将一个大的张量切分成n个小的张量集合的列表。

-第一种情况下将维度为\\(4 \times 3\\)张量沿着维度0进行切分，切分份数为2，得到维度为\\(2 \times 2 \times 3\\)的张量的过程如下所示：
+第一种情况下将维度为$4 \times 3$张量沿着维度0进行切分，切分份数为2，得到维度为$2 \times 2 \times 3$的张量的过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\\0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
 \begin{aligned}
 \Biggl( & \left( 
-\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right),
-\\\\ & \left( 
-\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}
+\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\end{matrix}\right),
+\\ & \left( 
+\begin{matrix}0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}
 \right) \Biggr)
 \end{aligned}
 $$

-在第二种情况下将维度为\\(4 \times 3\\)张量沿着维度0进行切分，切分份数为2，得到两个维度均为\\(2 \times 3\\)的张量的过程如下所示：
+在第二种情况下将维度为$4 \times 3$张量沿着维度0进行切分，切分份数为2，得到两个维度均为$2 \times 3$的张量的过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\\0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 2.0 & 3.0\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right)
 $$

 ##### 张量切分的调用

 NiuTrans.Tensor提供了两种张量切分操作，调用方法为：
-```
-_Split(XTensor * s, XTensor * t, int whereToSplit, int splitNum)
-_Split(XTensor * big, XList * smalls, int whereToSplit, int splitNum)
-```
+
 在第一种调用方法中是将源张量中的某一维度进行Split操作，Split结果为张量t，whereToSplit为在哪一维度进行split操作，splitNum表示分成多少份，例如：(N, M) -> (N/3, M, 3)，参数说明如下所示:
+```
+void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum)

+XTensor Split(const XTensor &s, int whereToSplit, int splitNum)
+```
 Parameters:

-* s - 操作张量
-* t - 结果张量
+* s - 输入张量
+* t - 输出张量
 * whereToSplit - 在指定维度进行split操作
 * splitNum - 分成多少份

 在第二种调用方法中是将所操作张量big按某一维度whereToSplit进行Split操作，操作结果为包含若干更小维度张量的列表smalls，splitNum表示分成多少份，例如：(N, M) -> 2 * (N/2, M)，参数说明如下所示:
+```
+void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum)

+XList SplitList(const XTensor &big, int whereToSplit, int splitNum)
+```
 Parameters:

-* big - 操作张量
+* big - 输入张量
 * smalls - 存放切分出张量的列表
 * whereToSplit - 在指定维度进行split操作
 * splitNum - 分成多少份
@@ -1096,44 +1210,49 @@ NiuTrans.Tensor/Tensor/test/TSplit.cpp

 张量间的合并操作与级联有些类似，是沿着张量的某一维度，可以将一个张量合并为另一个维度不同的张量，也可以将一个列表中的所有张量合并在一起组成一个更大的张量。

-在第一种情况下将维度为\\(2 \times 2 \times 3\\)的张量在维度1进行合并，进行合并的维度为0，得到维度为\\(4 \times 3\\)的张量的过程如下所示：
+在第一种情况下将维度为$2 \times 2 \times 3$的张量在维度1进行合并，进行合并的维度为0，得到维度为$4 \times 3$的张量的过程如下所示：

 $$
 \begin{aligned}
 \Biggl( & \left( 
-\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right),
-\\\\ & \left( 
-\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}
+\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\end{matrix}\right),
+\\ & \left( 
+\begin{matrix}0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}
 \right) \Biggr)
 \end{aligned} \rightarrow 
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\\0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right)
 $$

-在第二种情况下将两个维度均为\\(2 \times 3\\)的张量沿着维度0合并为维度为\\(4 \times 3\\)的张量的过程如下所示：
+在第二种情况下将两个维度均为$2 \times 3$的张量沿着维度0合并为维度为$4 \times 3$的张量的过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
+\left(\begin{matrix}0.0 & 2.0 & 3.0\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\\0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right)
 $$ 

 ##### 张量合并操作的调用

 NiuTrans.Tensor提供了张量的合并操作，调用方法为：
-```
-_Merge(XTensor * s, XTensor * t, int whereToMerge, int leadingDim)
-_Merge(XList * smalls, XTensor * big, int whereToMerge)
-```
+
 在第一种调用方法中是将源张量中的某一维度进行Merge操作，Merge结果为张量t，whereToMerge为指定进行Merge操作的维度，leadingDim为指定将哪一维度Merge，例如：(N/2, 2, M) -> (N, M)，参数说明如下表所示:
+```
+void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim = -1)

+XTensor Merge(const XTensor &s, int whereToMerge, int leadingDim = -1)
+```
 Parameters:

-* s - 操作张量
-* t - 结果张量
+* s - 输入张量
+* t - 输出张量
 * whereToMerge - 沿着指定维度进行Merge操作
 * leadingDim - 把指定维度进行Merge操作

 在第二种调用方法中是将所操作张量存入列表smalls中，操作结果为张量big，whereToMerge为指定进行Merge操作的维度，例如：2 * (N/2, M) -> (N, M)，参数说明如下表所示:
+```
+void _Merge(const XList * smalls, XTensor * big, int whereToMerge)

+XTensor Merge(const XList &smalls, int whereToMerge)
+```
 Parameters:

 * smalls - 存放进行合并张量的列表
@@ -1161,26 +1280,26 @@ NiuTrans.Tensor/Tensor/test/TMerge.cpp

 ##### 什么是Unsqueeze？

-Unsqueeze的作用是通过对张量进行操作，返回一个新的在指定维度插入新维度的张量，这个返回的张量与源张量共享相同的基础数据，一个\\(2 \times 3\\)的张量在维度1和2分别进行Unsqueeze的操作如下所示，插入新的维度大小均为2：
+Unsqueeze的作用是通过对张量进行操作，返回一个新的在指定维度插入新维度的张量，这个返回的张量与源张量共享相同的基础数据，一个$2 \times 3$的张量在维度1和2分别进行Unsqueeze的操作如下所示，插入新的维度大小均为2：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\end{matrix}\right) \rightarrow 
 \begin{aligned}
 \Biggl( & \left( 
-\begin{matrix}0.0 & 1.0 & 2.0\\\\0.0 & 1.0 & 2.0\end{matrix}\right),
-\\\\ & \left( 
-\begin{matrix}3.0 & 4.0 & 5.0\\\\3.0 & 4.0 & 5.0\end{matrix}
+\begin{matrix}0.0 & 1.0 & 2.0\\0.0 & 1.0 & 2.0\end{matrix}\right),
+\\ & \left( 
+\begin{matrix}3.0 & 4.0 & 5.0\\3.0 & 4.0 & 5.0\end{matrix}
 \right) \Biggr)
 \end{aligned}
 $$

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right) \rightarrow  
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\end{matrix}\right) \rightarrow  
 \begin{aligned}
 \Biggl( & \left( 
-\begin{matrix}0.0 & 0.0\\\\1.0 & 1.0\\\\2.0 & 2.0\end{matrix}\right),
-\\\\ & \left( 
-\begin{matrix}3.0 & 3.0\\\\4.0 & 4.0\\\\5.0 & 5.0\end{matrix}
+\begin{matrix}0.0 & 0.0\\1.0 & 1.0\\2.0 & 2.0\end{matrix}\right),
+\\ & \left( 
+\begin{matrix}3.0 & 3.0\\4.0 & 4.0\\5.0 & 5.0\end{matrix}
 \right) \Biggr)
 \end{aligned}
 $$
@@ -1189,12 +1308,14 @@ $$

 NiuTrans.Tensor提供了张量的Unsqueeze操作，调用方法及参数说明如下所示:
 ```
-_Unsqueeze(XTensor * a, XTensor * b, int dim, int dSize)
+void _Unsqueeze(const XTensor * a, XTensor * b, int dim, int dSize)
+
+XTensor Unsqueeze(const XTensor &a, int dim, int dSize)
 ```
 Parameters:

 * a - 输入张量
-* b - 输出结果张量
+* b - 输出张量
 * dim - 在指定维度进行Unsqueeze操作
 * dSize - 插入维度的大小

@@ -1210,7 +1331,7 @@ _Unsqueeze(s, t2, 2, 2);

 NiuTrans.Tensor/Tensor/test/TUnsqueeze.cpp

-### sort
+### 排序操作(sort)

 此部分主要介绍排序相关的函数，如：sort、topk等。

@@ -1218,23 +1339,23 @@ NiuTrans.Tensor/Tensor/test/TUnsqueeze.cpp

 ##### 什么是Sort？

-Sort操作是对张量中元素沿着指定的维度进行排序，一个\\(2 \times 4\\)的张量沿着维度0进行Sort操作过程如下所示：
+Sort操作是对张量中元素沿着指定的维度进行排序，一个$2 \times 4$的张量沿着维度0进行Sort操作过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}4.0 & 5.0 & 6.0 & 7.0\\\\0.0 & 1.0 & 2.0 & 3.0\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}4.0 & 5.0 & 6.0 & 7.0\\0.0 & 1.0 & 2.0 & 3.0\end{matrix}\right)
 $$

 ##### Sort的调用

 NiuTrans.Tensor提供了张量的Sort操作，调用方法及参数说明如下所示:
 ```
-_Sort(XTensor * a, XTensor * index, int dim)
+void _Sort(XTensor * a, XTensor * index, int dim)
 ```
 Parameters:

-* a - 操作张量
-* index - 结果张量中元素的索引
+* a - 输入张量
+* index - 输出张量中元素的索引
 * dim - 沿着指定维度进行Sort操作

 #####  Sort片段示例
@@ -1250,15 +1371,15 @@ _Sort(a, b, 0);

 ##### 什么是TopK？

-TopK操作是通过对张量中元素进行排序，得到最大或最小的k个元素值及其对应的索引值，在张量中，可以沿着某一维度进行TopK操作，一个\\(2 \times 4\\)的张量沿着维度0进行Top-2操作过程如下所示：
+TopK操作是通过对张量中元素进行排序，得到最大或最小的k个元素值及其对应的索引值，在张量中，可以沿着某一维度进行TopK操作，一个$2 \times 4$的张量沿着维度0进行Top-2操作过程如下所示：

 $$
-\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow 
 \begin{aligned}
 outputAnswer: & \left(
-\begin{matrix}0.5 & 2.5 & 4.5 & 6.5\\\\8.5 & 10.5 & 12.5 & 14.5\end{matrix}\right)\\\\ +
-\\\\ indexAnswer: & \left(
-\begin{matrix}0 & 1 & 1 & 0\\\\1 & 0 & 0 & 1\end{matrix}\right)
+\begin{matrix}0.5 & 2.5 & 4.5 & 6.5\\8.5 & 10.5 & 12.5 & 14.5\end{matrix}\right)\\ +
+\\ indexAnswer: & \left(
+\begin{matrix}0 & 1 & 1 & 0\\1 & 0 & 0 & 1\end{matrix}\right)
 \end{aligned}
 $$

@@ -1266,12 +1387,12 @@ $$

 NiuTrans.Tensor提供了张量的TopK操作，调用方法及参数说明如下所示:
 ```
-_TopK(XTensor * a, XTensor * b, XTensor * index, int dim, int k)
+void _TopK(XTensor * a, XTensor * b, XTensor * index, int dim, int k)
 ```
 Parameters:

 * a - 输入张量
-* b - 输出结果张量
+* b - 输出张量
 * index - 输出结果索引
 * dim - 沿着指定维度进行TopK操作
 * k - TopK中k代表取最大的k个值
@@ -1287,7 +1408,7 @@ _TopK(input, outputA, indexA, dim, k);
 ```
 有关TopK的详细代码示例见                              NiuTrans.Tensor/Tensor/test/TTopK.cpp

-### function
+### 激活函数(function)

 此部分主要介绍一些激活函数和损失函数。

@@ -1302,7 +1423,7 @@ Rectify銝蝘瘣餃嚗ectify摰蛹嚗

 NiuTrans.Tensor提供了张量的Rectify激活函数，调用方法及参数说明如下所示:
 ```
-Rectify(XTensor * x, XTensor * y)
+void _Rectify(const XTensor * x, XTensor * y)
 ```
 Parameters:

@@ -1314,7 +1435,7 @@ Parameters:
 Rectify示例代码如下，其中x为输入的向量，y为输入的张量：
 ```
 /* call Rectify function */
-Rectify(x, y);
+_Rectify(x, y);
 ```
 有关Rectify的详细代码示例见：

@@ -1333,7 +1454,7 @@ HardTanH銝蝘瘣餃嚗ardTanH摰蛹嚗

 NiuTrans.Tensor提供了张量的HardTanH激活函数，调用方法及参数说明如下所示:
 ```
-HardTanH(XTensor * x, XTensor * y)
+void _HardTanH(const XTensor * x, XTensor * y)
 ```
 Parameters:

@@ -1345,7 +1466,7 @@ Parameters:
 HardTanH示例代码如下，其中x为输入的向量，y为输入的张量：
 ```
 /* call hardtanh function */
-HardTanH(x, y);
+_HardTanH(x, y);
 ```
 有关HardTanH的详细代码示例见：

@@ -1362,7 +1483,7 @@ Identity銝蝘瘣餃嚗dentity摰蛹嚗

 NiuTrans.Tensor提供了张量的Identity激活函数，调用方法及参数说明如下所示:
 ```
-Identity(XTensor * x, XTensor * y)
+void _Identity(const XTensor * x, XTensor * y)
 ```
 Parameters:

@@ -1374,7 +1495,7 @@ Parameters:
 Identity示例代码如下，其中x为输入的向量，y为输入的张量：
 ```
 /* call Identity function */
-Identity(x, y);
+_Identity(x, y);
 ```
 有关Identity的详细代码示例见：

@@ -1391,7 +1512,7 @@ LogSoftmax銝蝘瘣餃嚗ogSoftmax摰蛹嚗

 NiuTrans.Tensor提供了张量的LogSoftmax激活函数，调用方法及参数说明如下所示:
 ```
-LogSoftmax(XTensor * x, XTensor * y, int leadDim)
+void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
 ```
 Parameters:

@@ -1404,7 +1525,7 @@ Parameters:
 LogSoftmax示例代码如下，其中x为输入的向量，y为输入的张量，本例中沿着维度1进行LogSoftmax操作：
 ```
 /* call LogSoftmax function */
-LogSoftmax(x, y, 1);
+_LogSoftmax(x, y, 1);
 ```
 有关LogSoftmax的详细代码示例见：

@@ -1421,7 +1542,7 @@ Sigmoid銝蝘瘣餃嚗igmoid摰蛹嚗

 NiuTrans.Tensor提供了张量的Sigmoid激活函数，调用方法及参数说明如下所示:
 ```
-Sigmoid(XTensor * x, XTensor * y)
+void _Sigmoid(const XTensor * x, XTensor * y)
 ```
 Parameters:

@@ -1433,7 +1554,7 @@ Parameters:
 Sigmoid示例代码如下，其中x为输入的向量，y为输入的张量：
 ```
 /* call Sigmoid function */
-Sigmoid(x, y);
+_Sigmoid(x, y);
 ```
 有关Sigmoid的详细代码示例见：

@@ -1450,7 +1571,7 @@ Softmax銝蝘瘣餃嚗oftmax摰蛹嚗

 NiuTrans.Tensor提供了张量的Softmax激活函数，调用方法及参数说明如下所示:
 ```
-Softmax(XTensor * x, XTensor * y, int leadDim)
+void _Softmax(const XTensor * x, XTensor * y, int leadDim)
 ```
 Parameters:

@@ -1463,7 +1584,7 @@ Parameters:
 Softmax示例代码如下，其中x为输入的向量，y为输入的张量，本例中沿着维度1进行Softmax操作：
 ```
 /* call Softmax function */
-Softmax(x, y, 1);
+_Softmax(x, y, 1);
 ```
 有关Softmax的详细代码示例见：

@@ -1485,7 +1606,7 @@ one hot error : loss = sum_{i} e_i <br />

 NiuTrans.Tensor提供了张量的Loss激活函数，调用方法及参数说明如下所示:
 ```
-LossCompute(XTensor * gold, XTensor * output, LOSS_FUNCTION_NAME LFName,bool isLogOutput, int leadDim, int gBeg, int gLen, int oBeg)
+DTYPE LossCompute(XTensor * gold, XTensor * output, LOSS_FUNCTION_NAME LFName, bool isLogOutput, int leadDim, int gBeg, int gLen, int oBeg)
 ```
 Parameters:

@@ -1513,10 +1634,365 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp

 ### 内存池

+内存作为计算机软件运行过程中不可或缺的一项重要资源，在软件开发过程中具有十分重要的地位。对于一个软件系统而言，如何更高效地进行内存管理将对系统整体性能，尤其是运行速度方面产生很大程度的影响。对于内存的管理一般来说主要包括分配、追踪以及释放，通过相应的接口即可简单地在内存空间上进行变量的定义、使用以及删除等操作。
+虽然目前而言，主流编程语言均会为开发人员提供相应的系统级接口（如C语言中的malloc和free，C++中的new和delete等），但这类接口在设计的时候由于需要考虑各种使用情况，因此并不一定能够最适用于目前的使用需求（如对速度具有较高要求等），因此直接使用系统级的内存管理接口存在以下弊端：
+1. 内存申请、释放时间消耗大：由于操作系统在进行内存管理的时候需要保证内存空间得到有效地使用，因此在执行内存申请操作的时候，系统将会根据“最先匹配”或“最优匹配”等算法在内存空间中找到一处闲置内存进行分配。同理，在对内存空间进行释放的时候，为方便后续空间的申请，系统也会在释放的过程中适时地合并空闲内存区域，保证系统中存在大块连续内存。诸如此类的操作虽然说能够使得内存空间的使用更加高效，但也给这些操作带来了许多额外的时间开销，导致频繁地对内存进行操作耗时较大。 
+2. 程序执行效率低：由于所申请内存块的大小不定，当频繁使用系统级接口进行内存管理的时候容易在存储空间中产生大量内存碎片，拖慢系统的执行效率。
+3. 易发生内存泄漏：使用系统级接口对内存空间进行申请的时候，一般来说需要程序开发人员显性地对空间进行释放，一旦疏忽将导致内存泄漏情况的发生，严重情况下会使得软件甚至系统发生崩溃。因此使用系统级接口进行内存管理需要谨慎对存储空间的使用情况进行分析，使用相关检测工具对内存泄漏情况进行有效地核查。
+
+此外，当系统中存在对GPU设备上的显存空间进行管理的时候，申请、释放操作所产生的时间代价相对普通内存来说更大。不同于内存空间的申请，在申请或释放显存的时候需要对CPU正在执行的操作进行中断，交由GPU设备进行显存的操作，因此这部分产生的时间消耗远比内存申请来说大得多，最终导致频繁地对显存空间进行操作会更严重地拖慢系统整体的执行效率。
+针对以上问题，本系统支持使用内存池（Memory Pool）来对系统中的存储空间（包括内存和显存）进行管理。内存池的概念主要是在对存储空间进行使用之前，预先从系统中申请一整块的空间，由程序自身（内存池）对这部分的空间进行管理。这样做的好处在于对存储空间的申请、释放等操作不需要对系统的相应接口进行频繁调用，降低了其中中断、搜寻最优块等操作的耗时，同时也不易产生内存碎片。此外，由于内存池的申请是一次性的操作，因此不会在系统全局产生大规模内存|泄漏的情况，对系统的稳定性会有所助益。
+具体来说，想要在NiuTrans.Tensor的工具包中使用内存池（XMem）进行操作，只需要三个步骤：内存池的定义，使用以及释放。
+* 内存池的定义
+
+最简单的定义一个内存池只需指定一个设备ID即可，下面是一段示例代码。
+```
+// 定义一个内存池mem，它的类型是XMem
+XMem * mem = new XMem(devID);
+```
+若需要更具体地指定内存池的信息，可以定义内存池的时候通过myMode、myBlockSize、myBlockNum、myBufSize等参数设置内存池的使用模型、内存块大小、内存块数量以及缓存区大小。
+
+* 内存池的使用
+
+在定义好内存池之后，我们即可在该空间上进行变量的定义及使用了，这里以张量的定义为例，下面是一段示例代码。
+```
+// 声明一个变量tensor，它的类型是XTensor
+XTensor tensor;                         
+
+// 在内存池上初始化这个变量为50列*100行的矩阵(2阶张量)      
+InitTensor2D(&tensor, 50, 100, X_FLOAT, -1, mem);
+```
+我们可以看到，上述代码相对之前之前未使用内存池时的定义方式而言，仅需在定义的时候指定所使用的内存池即可，无需更复杂的操作。
+
+* 内存池的释放
+   
+当希望将完全对内存池进行释放的时候，我们仅需直接对内存池进行删除即可，下面是一段示例代码。
+```
+// 删除内存池mem
+delete mem;
+```
+
 ## 实例1：矩阵乘法

+NiuTrans.Tensor提供的矩阵乘法实例如下所示，详细代码见NiuTrans.Tensor/Tensor/sample/mul/
+
+```
+#include "mul.h"
+
+namespace nts
+{
+void sampleMUL()
+{
+    DTYPE aData[2][3] = { { 1.0F, 2.0F, 3.0F },
+                          { -4.0F, 5.0F, 6.0F } };
+    DTYPE bData[3][2] = { { 0.0F, -1.0F },
+                          { 1.0F, 2.0F },
+                          { 2.0F, 1.0F } };
+    DTYPE answer[2][2] = { { 8.0F, 6.0F },
+                           { 17.0F, 20.0F } };
+
+    XTensor a;
+    //XTensor * a = NewTensor();?
+    XTensor b;
+    XTensor result;
+
+    InitTensor2D(&a, 2, 3);
+    InitTensor2D(&b, 3, 2);
+
+    //a.GetSize;
+
+    a.SetData(aData, 6);
+    b.SetData(bData, 6);
+
+    result = MatrixMul(a, X_NOTRANS, b, X_NOTRANS);
+
+    result.Dump(stderr, "result:");
+
+    if (result.CheckData(answer, 4))
+        fprintf(stderr, "answer is right\n");
+
+}
+
+void sampleMUL1()
+{
+    DTYPE aData[2][3] = { { 1.0F, 2.0F, 3.0F },
+                          { -4.0F, 5.0F, 6.0F } };
+    DTYPE bData[3][2] = { { 0.0F, -1.0F },
+                          { 1.0F, 2.0F },
+                          { 2.0F, 1.0F } };
+    DTYPE answer[2][2] = { { 8.0F, 6.0F },
+                           { 17.0F, 20.0F } };
+
+    /* a source tensor of size (2, 3) */
+    int aOrder = 2;
+    int * aDimSize = new int[aOrder];
+    aDimSize[0] = 2;
+    aDimSize[1] = 3;
+
+    int aUnitNum = 1;
+    for (int i = 0; i < aOrder; i++)
+        aUnitNum *= aDimSize[i];
+
+    /* a source tensor of size (3, 2) */
+    int bOrder = 2;
+    int * bDimSize = new int[bOrder];
+    bDimSize[0] = 3;
+    bDimSize[1] = 2;
+
+    int bUnitNum = 1;
+    for (int i = 0; i < bOrder; i++)
+        bUnitNum *= bDimSize[i];
+
+    /* a target tensor of size (2, 2) */
+    int resultOrder = 2;
+    int * resultDimSize = new int[resultOrder];
+    resultDimSize[0] = 2;
+    resultDimSize[1] = 2;
+
+    int resultUnitNum = 1;
+    for (int i = 0; i < resultOrder; i++)
+        resultUnitNum *= resultDimSize[i];
+
+    XTensor * a = NewTensor(aOrder, aDimSize);
+    XTensor * b = NewTensor(bOrder, bDimSize);
+    XTensor * result = NewTensor(resultOrder, resultDimSize);
+
+    a->SetData(aData, aUnitNum);
+    b->SetData(bData, bUnitNum);
+    result->SetZeroAll();
+
+    _MatrixMul(a, X_NOTRANS, b, X_NOTRANS, result);
+
+    result->Dump(stderr, "result:");
+}
+}
+```
+
 ## 实例2：前馈神经网络

+NiuTrans.Tensor提供的语言模型任务上的前馈神经网络实例部分代码如下所示，主要是关于前馈神经网络语言模型上前向和反向训练的处理过程，详细代码见NiuTrans.Tensor/Tensor/sample/fnnlm/
+
+```
+/*
+forward procedure
+>> inputs - input word representations
+>> output - output probability
+>> model - the fnn model
+>> net - the network that keeps the internal tensors generated in the process
+*/
+void Forward(XTensor inputs[], XTensor &output, FNNModel &model, FNNNet &net)
+{
+    int batchSize = -1;
+    int n = model.n;
+    int depth = model.hDepth;
+    XList eList(n - 1);
+
+    /* previoius n - 1 words */
+    for(int i = 0; i < n - 1; i++){
+        XTensor &input = inputs[i];
+        XTensor &w = model.embeddingW;
+        XTensor &embedding = net.embeddings[i];
+
+        if(batchSize == -1)
+            batchSize = input.dimSize[0];
+        else{
+            CheckErrors(batchSize == input.dimSize[0], "Wrong input word representations!");
+        }
+
+        /* embedding output tensor of position i */
+        InitModelTensor2D(embedding, batchSize, model.eSize, model);
+
+        /* generate word embedding of position i:
+           embedding = input * w   */
+        _MatrixMul(&input, X_NOTRANS, &w, X_NOTRANS, &embedding);
+
+        eList.Add(&net.embeddings[i]);
+    }
+
+    /* concatenate word embeddings
+       embeddingcat = cat(embedding_0...embedding_{n-1}) */
+    InitModelTensor2D(net.embeddingCat, batchSize, (n - 1) * model.eSize, model);
+    _Concatenate(&eList, &net.embeddingCat, 1);
+
+    /* go over each hidden layer */
+    for(int i = 0; i < depth; i++){
+        XTensor &h_pre = i == 0 ? net.embeddingCat : net.hiddens[i - 1];
+        XTensor &w = model.hiddenW[i];
+        XTensor &b = model.hiddenB[i];
+        XTensor &h = net.hiddens[i];
+        XTensor &s = net.hiddenStates[i];
+
+        InitModelTensor2D(h, batchSize, model.hSize, model);
+        InitModelTensor2D(s, batchSize, model.hSize, model);
+
+        /* generate hidden states of layer i: 
+           s = h_pre * w    */
+        _MatrixMul(&h_pre, X_NOTRANS, &w, X_NOTRANS, &s);
+
+        /* make a 2d tensor for the bias term */
+        XTensor b2D;
+        InitTensor(&b2D, &s);
+        _Unsqueeze(&b, &b2D, 0, batchSize);
+
+        /* introduce bias term:
+           s = s + b
+           NOTE: the trick here is to extend b to a 2d tensor
+                 to fit into the 2d representation in tensor summation */
+        _Sum(&s, &b2D, &s);
+
+        /* pass the state through the hard tanh function:
+           h = tanh(s) */
+        _HardTanH(&s, &h);
+    }
+
+    /* generate the output Pr(w_{n-1}|w_0...w_{n-2}):
+       y = softmax(h_last * w) 
+       Note that this is the implementation as that in Bengio et al.' paper.
+       TODO: we add bias term here */
+    {
+        XTensor &h_last = depth > 0 ? net.hiddens[depth - 1] : net.embeddingCat;
+        XTensor &w = model.outputW;
+        XTensor &b = model.outputB;
+        XTensor &s = net.stateLast;
+        XTensor &y = output;
+
+        InitModelTensor2D(s, batchSize, model.vSize, model);
+        InitModelTensor2D(y, batchSize, model.vSize, model);
+
+        /* s = h_last * w  */
+        _MatrixMul(&h_last, X_NOTRANS, &w, X_NOTRANS, &s);
+
+        XTensor b2D;
+        InitTensor(&b2D, &s);
+        _Unsqueeze(&b, &b2D, 0, batchSize);
+
+        _Sum(&s, &b2D, &s);
+
+        /* y = softmax(s) */
+        _LogSoftmax(&s, &y, 1);
+    }   
+}
+
+/*
+backward procedure
+>> inputs - input word representations
+>> output - output probability
+>> gold - gold standard
+>> loss - loss function name
+>> model - the fnn model
+>> grad - the model that keeps the gradient information
+>> net - the network that keeps the internal tensors generated in the process
+*/
+void Backward(XTensor inputs[], XTensor &output, XTensor &gold, LOSS_FUNCTION_NAME loss, 
+              FNNModel &model,  FNNModel &grad, FNNNet &net)
+{
+    int batchSize = output.GetDim(0);
+    int n = model.n;
+    int depth = model.hDepth;
+
+    /* back-propagation for the output layer */
+    XTensor &y = output;
+    XTensor &s = net.stateLast;
+    XTensor &x = depth > 0 ? net.hiddens[depth - 1] : net.embeddingCat;
+    XTensor &w = model.outputW;
+    XTensor &dedw = grad.outputW;
+    XTensor &dedb = grad.outputB;
+    XTensor deds(&y);
+    XTensor dedx(&x);
+
+    /* for y = softmax(s), we get dE/ds
+        where E is the error function (define by loss) */
+    _LogSoftmaxBackward(&gold, &y, &s, NULL, &deds, 1, loss);
+
+    /* for s = x * w, we get 
+       dE/w_{i,j} = dE/ds_j * ds/dw_{i,j} 
+                  = dE/ds_j * x_{i}
+       (where i and j are the row and column indices, and
+        x is the top most hidden layer)
+       so we know 
+       dE/dw = x^T * dE/ds */
+    _MatrixMul(&x, X_TRANS, &deds, X_NOTRANS, &dedw);
+
+    /* gradient of the bias: dE/db = dE/ds * 1 = dE/ds
+    specifically dE/db_{j} = \sum_{i} dE/ds_{i,j} */
+    _ReduceSum(&deds, &dedb, 0);
+
+    /* then, we compute 
+       dE/dx_{j} = \sum_j' (dE/ds_{j'} * ds_{j'}/dx_j) 
+                 = \sum_j' (dE/ds_{j'} * w_{j, j'})
+       i.e., 
+       dE/dx = dE/ds * w^T */
+    _MatrixMul(&deds, X_NOTRANS, &w, X_TRANS, &dedx);
+
+    XTensor &gradPassed = dedx;
+    XTensor dedsHidden;
+    XTensor dedxBottom;
+    if (depth > 0)
+        InitTensor(&dedsHidden, &dedx);
+    InitTensor(&dedxBottom, &net.embeddingCat);
+
+    /* back-propagation from top to bottom in the stack of hidden layers
+       for each layer, h = f(s)
+                       s = x * w + b */
+    for (int i = depth - 1; i >= 0; i--) {
+        XTensor &h = net.hiddens[i];
+        XTensor &s = net.hiddenStates[i];
+        XTensor &x = i == 0 ? net.embeddingCat : net.hiddenStates[i - 1];
+        XTensor &w = model.hiddenW[i];
+        XTensor &dedh = gradPassed;  // gradient passed though the previous layer
+        XTensor &dedx = i == 0 ? dedxBottom : dedh;
+        XTensor &deds = dedsHidden;
+        XTensor &dedw = grad.hiddenW[i];
+        XTensor &dedb = grad.hiddenB[i];
+        
+        /* backpropagation through the activation fucntion: 
+           dE/ds = dE/dh * dh/ds */
+        _HardTanHBackward(NULL, &h, &s, &dedh, &deds, NOLOSS);
+
+        /* gradient of the weight: dE/dw = x^T * dE/ds   */
+        _MatrixMul(&x, X_TRANS, &deds, X_NOTRANS, &dedw);
+
+        /* gradient of the bias: dE/db = dE/ds * 1 = dE/ds
+           specifically dE/db_{j} = \sum_{i} dE/ds_{i,j} */
+        _ReduceSum(&deds, &dedb, 0);
+
+        /* gradient of the input: dE/dx = dE/ds * w^T    */
+        _MatrixMul(&deds, X_NOTRANS, &w, X_TRANS, &dedx);
+
+        if (i > 0)
+            _CopyValues(&dedx, &gradPassed);
+    }
+
+    XList eList(n - 1);
+
+    /* back-propagation for the embedding layer */
+    for (int i = 0; i < n - 1; i++) {
+        XTensor * dedy = NewTensor2D(batchSize, model.eSize, X_FLOAT, model.devID, model.mem);
+        eList.Add(dedy);
+    }
+
+    /* gradient of the concatenation of the embedding layers */
+    XTensor &dedyCat = depth > 0 ? dedxBottom : dedx;
+
+    /* split the concatenation of gradients of the embeddings */
+    _Split(&dedyCat, &eList, 1, n - 1);
+
+    /* go over for each word */
+    for (int i = 0; i < n - 1; i++) {
+        XTensor * dedy = (XTensor*)eList.GetItem(i);
+        XTensor &x = inputs[i];
+        XTensor &dedw = grad.embeddingW;
+
+        /* gradient of the embedding weight: dE/dw += x^T * dE/dy 
+           NOTE that we accumulate dE/dw here because the matrix w
+           is shared by several layers (or words) */
+        _MatrixMul(&x, X_TRANS, dedy, X_NOTRANS, &dedw, 1.0F, 1.0F);
+
+        delete dedy;
+    }
+}
+```
+
 ## 实例3：循环神经网络

 ## 致谢
@@ -1527,13 +2003,15 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp

 | 成员变量 | 功能 |
 | - | - |
+| int id | 张量标识 |
 | XMem * mem | 张量所使用的内存池 |
 | void * data | 保存元素的数据数组 |
 | void * dataHost | 主机内存上的数据副本，只在GPU上运行时被激活 |
+| void ** dataP | 指向数据地址的指针 |
 | int devID | 设备ID，指张量所申请的空间所在CPU或者GPU设备的编号，-1表示CPU |
 | int order | 张量的维度，例如：一个矩阵（维度为2）是一个二维张量 |
-| int dimSize<br> [MAX_TENSOR_DIM_NUM] | 张量中每一维度的大小，索引0表示第1维 |
-| int dimSizeRDI<br> [MAX_TENSOR_DIM_NUM] | 转置模式下张量中每一维度的大小，索引0表示第1维 |
+| int dimSize[ ] | 张量中每一维度的大小，索引0表示第1维 |
+| int dimSizeRDI[ ] | 转置模式下张量中每一维度的大小，索引0表示第1维 |
 | TENSOR_DATA_TYPE dataType | 每个数据单元的数据类型 |
 | int unitSize | 数据单元的大小，类似于sizeof() |
 | int unitNum | 数据单元的数量 |
@@ -1541,17 +2019,34 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp
 | int unitNumNonZero | 稀疏矩阵中非零元素个数 |
 | float denseRatio | 稠密度，指非零单元的比例，是介于0和1之间的一个实数，0表示所有单元全为零，1表示全为非零单元。|
 | bool isShared | 标志数据数组是否被其他张量所共享 |
+| bool isDefaultDType | 矩阵中使用的数据类型是否是属于默认数据类型 |
 | bool isInGlobalMem | 标志数据是否在全局内存而不是内存池中 |
-| bool isAllValued<br> [MAX_TENSOR_DIM_NUM] | 标志稀疏矩阵中是否每个维度都具有非零元素 |
+| bool isAllValued[ ] | 标志稀疏矩阵中是否每个维度都具有非零元素 |
+| bool isInit | 张量是否被初始化 |
+| bool isTmp | 张量是否为临时创建 |
+| bool isGrad | 当使用模型参数时张量是否保持梯度 |
+| unsigned int visitMark | 节点访问标志 |
+| XTensor * grad | 反向传播的梯度 |
+| XLink income | 超边的入边 |
+| XLink outgo | 超边的出边 |

 在XTensor.h头文件中定义的方法说明：

 | 功能 | 函数  | 参数 |
 | - | - | - |
+| 构造函数 | XTensor() | N/A |
+| 析构函数 | ~XTensor() | N/A |
+| 初始化成员变量 | void Init() | N/A |
+| 销毁数据 | void DestroyData() | N/A |
+| 张量的浅层复制 | void ShallowCopy(<br>const XTensor &tensor) | tensor - 进行复制的张量 |
+| 重载等于符号 | XTensor& operator= (<br>const XTensor &tensor) | tensor - 重载的张量 |
+| 重载加法符号 | XTensor  operator+ (<br>const XTensor &tensor) | tensor - 重载的张量 |
+| 重载乘法符号 | XTensor  operator* (<br>const XTensor &tensor) | tensor - 重载的张量 |
+| 线性变换 | XTensor Lin(<br>DTYPE scale, DTYPE shift = 0) | scale - 缩放参数 <br> shift - 偏移参数 |
 | 判断两个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 |
 | 判断三个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b, XTensor * c) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 <br> c - 进行比较的第三个张量 |
-| 设置张量每一维度的大小 | void SetDim(int * myDimSize) |myDimSize - 张量每一维度的大小 |
-| 得到张量中给定的维度大小 | int GetDim(const int dim) | dim - 张量的维度 |
+| 设置张量每一维度的大小 | void SetDim(<br>int * myDimSize) |myDimSize - 张量每一维度的大小 |
+| 得到张量中给定的维度大小 | int GetDim(<br>const int dim) | dim - 张量的维度 |
 | 重新调整矩阵维度 | void Reshape(<br> const int order, const int * myDimSize) | order - 张量的维度 <br> myDimSize - 张量每一维的大小 |
 | 得到张量中元素数量 | int GetSize() | N/A |
 | 得到内存使用大小 | int GetDataSizeInChar() | N/A |
@@ -1561,15 +2056,26 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp
 | 设置张量服从均匀分布 | void SetDataRand(<br> DTYPE lower, DTYPE upper) | lower - 最小值 <br> upper - 最大值 |
 | 设置张量服从正态分布 | void SetDataRandn(<br> DTYPE mean, DTYPE standardDeviation) | mean - 均值 <br> standardDeviation - 标准差 |
 | 检查张量中元素是否相同 | bool CheckData(<br> const void * answer, int num, int beg = 0) | answer - 给定数组 <br> num - 数组大小 <br> beg - 赋值时从张量的第几位开始 |
-| 将给定维度中元素<br> 设置为升序 | void SetAscendingOrder(int dim) | dim - 给定维度 |
-| 获取张量中元素指针 | void * GetCell(int * index, int size)    | index - 元素位置 <br> size-矩阵大小 |
-| 获取二维张量中元素指针 | void * GetCell2D(int ni, int mi = 0) | ni - 行值 <br> mi - 列值 |
-| 获取二维张量的值 | DTYPE Get2D(int ni, int mi = 0) | ni - 行值 <br> mi - 列值 |
+| 设置数据指针 | void SetDataPointer() | N/A |
+| 将给定维度中元素<br> 设置为升序 | void SetAscendingOrder(<br>int dim) | dim - 给定维度 |
+| 得到索引指向的单元的值 | DTYPE Get(int index[], int size = -1) | index - 给定索引 <br> size-矩阵大小 |
+| 获取张量中元素指针 | void * GetCell(<br>int * index, int size)    | index - 元素位置 <br> size-矩阵大小 |
+| 获取一维张量中元素的<br>默认类型值 | DTYPE Get1D(<br>int i) | i - 第一维 |
+| 获取二维张量中元素的<br>默认类型值 | DTYPE Get2D(<br>int ni, int mi) const | ni - 第一维 <br> mi - 第二维 |
+| 获取三维张量中元素的<br>默认类型值 | DTYPE Get3D(<br>int d0, int d1, int d2) | d0 - 第一维 <br> d1 - 第二维 <br> d2 - 第三维 |
+| 获取一维张量中元素的<br>整形值 |int Get1DInt(<br>int i) | i - 第一维 |
+| 获取二维张量中元素的<br>整形值 | int Get2DInt(<br>int ni, int mi) | ni - 第一维 <br> mi - 第二维 |
+| 获取三维张量中元素的整形值 | int Get3DInt(<br>int d0, int d1, int d2) | d0 - 第一维 <br> d1 - 第二维 <br> d2 - 第三维 |
 | 获取稀疏张量的值 | DTYPE GetInSparse(int i) | i - 稀疏矩阵中非0元素位置 |
 | 获取稀疏张量中<br> 元组的键值 | int GetKeyInSparse(int i) | i - 稀疏矩阵中非0元素位置 |
-| 设置二维张量中<br> 的单元值 | bool Set2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
-| 增加二维张量中<br> 的单元值 | bool Add2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
+| 设置单元中的值 | bool Set(<br>DTYPE value, int index[], int size = -1) | value - 值 <br> index - 元素位置 <br> size-矩阵大小 |
+| 设置一维张量中的单元值 | bool Set1D(<br>DTYPE value, int i) | value - 值 <br> i - 第一维 |
+| 设置二维张量中的单元值 | bool Set2D(<br>DTYPE value, int ni, int mi) | value - 值 <br> ni - 第一维 <br> mi - 第二维 |
+| 设置三维张量中的单元值 | bool Set3D(<br>DTYPE value, int d0, int d1, int d2) | value - 值 <br> d0 - 第一维 <br> d1 - 第二维 <br> d2 - 第三维 |
+| 增加二维张量中<br> 的单元值 | bool Add2D(<br>DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
 | 获取稀疏矩阵中<br> 非零元素数量 | int GetNonzeroSize() | N/A |
+| 设置张量为临时变量 | void SetTMP(<br>bool myIsTmp = true) | myIsTmp - 是否为临时变量 |
+| 张量是否保持梯度 | void SetGrad(<br>bool myIsGrad = true) | myIsTmp - 是否保持梯度 |
 | 将矩阵重置为特定大小 | bool Resize(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 |
 | 将矩阵重置为特定大小<br>并不申请新空间 | bool ResizeWithNoData(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 |
 | 将矩阵重置为<br> 另一矩阵大小 | bool Resize(<br> const XTensor * myTensor) | myTensor - 重置矩阵大小的参考矩阵 |

--- a/source/network/XBackwardMath.cpp
+++ b/source/network/XBackwardMath.cpp
@@ -43,6 +43,40 @@ void XMathGrad::MakeGrad(XTensor * node)
        GradMultiply(node);
    else if(operID == MATH_MATRIXMUL)
        GradMatrixMul(node);
+    else if (operID == MATH_LOG)
+        GradLog(node);
+    else if (operID == MATH_POWER)
+        GradPower(node);
+    else if (operID == MATH_NEGATE)
+        GradNegate(node);
+    else if (operID == MATH_SCALEANDSHIFT)
+        GradScaleAndShift(node);
+    else if (operID == MATH_DIV)
+        GradDiv(node);
+    else if (operID == MATH_SUB)
+        GradSub(node);
+    else if (operID == MATH_SIN)
+        GradSin(node);
+    else if (operID == MATH_COS)
+        GradCos(node);
+    else if (operID == MATH_TAN)
+        GradTan(node);
+    else if (operID == MATH_EXP)
+        GradExp(node);
+    else if (operID == MATH_NORMALIZE)
+        GradNormalize(node);
+    else if (operID == MATH_ABSOLUTE)
+        GradAbsolute(node);
+    else if (operID == MATH_SIGN)
+        GradSign(node);
+    else if (operID == REDUCE_REDUCEMEAN)
+        GradReduceMean(node);
+    else if (operID == REDUCE_REDUCESUM)
+        GradReduceSum(node);
+    else if (operID == REDUCE_REDUCESUMSQUARED)
+        GradReduceSumSquared(node);
+    else if (operID == REDUCE_REDUCEVARIANCE)
+        GradReduceVariance(node);
    else{
        ShowNTErrors("TODO!");
    }
@@ -72,6 +106,7 @@ void XMathGrad::GradSum(XTensor * node)
    XTensor * a = income.tails[0];
    XTensor * b = income.tails[1];
    DTYPE beta = income.GetParam(0);
+
    XNoder::MakeGrad(a);
    XNoder::MakeGrad(b);

@@ -271,4 +306,553 @@ void XMathGrad::GradMatrixMul(XTensor * node)
    node->visitMark = NODE_FINISHED;
 }

+/*
+gradient for log
+for
+c = log(a)
+we have
+dE/da = dE/dc * 1/a
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradLog(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for LOG!");
+
+    XTensor * a = income.tails[0];
+
+    XNoder::MakeGrad(a);
+
+    _Div(node->grad, a, a->grad, 1.0F);
+
+    node->visitMark = NODE_FINISHED;
+}
+
+/*
+gradient for power
+for
+c = pow(a,p)
+we have
+dE/da = (dE/dc) * p*a^(p-1)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradPower(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for POWER!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XTensor * c = NewTensor(a);
+
+    DTYPE p = income.GetParam(0);
+
+    XNoder::MakeGrad(a);
+
+    _Power(a, b, (p-1)/p);
+    _ScaleAndShift(b, c, p);
+    _Multiply(node->grad, c, a->grad, 1.0F);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete b;
+    delete c;
+}
+
+/*
+gradient for negate
+for
+c = -a
+we have
+dE/da = dE/dc * (-1)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradNegate(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for NEGATE!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+
+    XNoder::MakeGrad(a);
+
+    _ScaleAndShift(node->grad, b, -1.0F);
+    _Sum(a->grad, b, a->grad);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete b;
+}
+
+/*
+gradient for ScaleAndShift
+for
+c = a * scale + shift
+we have
+dE/da = dE/dc * scale
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradScaleAndShift(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SCALEANDSHIFT!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+
+    DTYPE scale = income.GetParam(0);
+
+    XNoder::MakeGrad(a);
+
+    _ScaleAndShift(node->grad, b, scale);
+    _Sum(a->grad, b, a->grad);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete b;
+}
+
+/*
+gradient for minus
+for
+c =  a - b * \beta
+we have
+dE/da = dE/dc
+dE/db = -dE/dc * \beta
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradSub(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for SUBSTRACT!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = income.tails[1];
+    DTYPE beta = income.GetParam(0);
+
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+
+    _Sum(a->grad, node->grad, a->grad);
+    _Sum(b->grad, node->grad, b->grad, -beta);
+
+    node->visitMark = NODE_FINISHED;
+}
+
+/*
+gradient for divide
+for
+c =  a / b
+we have
+dE/da = dE/dc / b
+dE/db = dE/dc * a / -b^2
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradDiv(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for DIVIDE!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = income.tails[1];
+    XTensor * c = NewTensor(b);
+    XTensor * d = NewTensor(b);
+    XTensor * e = NewTensor(b);
+
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+
+    CheckNTErrors(XTensor::IsSameShaped(a, b), "Wrong sized input tensors!");
+
+    _Div(node->grad, b, a->grad, 1.0F);
+    _Power(b, c, -2.0F);
+    _Multiply(a, c, d);
+    _ScaleAndShift(d, e, -1.0F);
+    _Multiply(node->grad, e, b->grad, 1.0F);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete c;
+    delete d;
+    delete e;
+}
+
+/*
+gradient for exp
+for
+c = exp(a)
+we have
+dE/da = dE/dc * exp(a)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradExp(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for EXP!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+
+    XNoder::MakeGrad(a);
+
+    _Exp(a, b);
+    _Multiply(node->grad, b, a->grad, 1.0F);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete b;
+}
+
+/*
+gradient for sin
+for
+c = sin(a)
+we have
+dE/da = dE/dc * cos(a)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradSin(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SIN!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+
+    XNoder::MakeGrad(a);
+
+    _Cos(a, b);
+    _Multiply(node->grad, b, a->grad, 1.0F);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete b;
+}
+
+/*
+gradient for cos
+for
+c = cos(a)
+we have
+dE/da = dE/dc * -sin(a)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradCos(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for COS!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XTensor * c = NewTensor(a);
+
+    XNoder::MakeGrad(a);
+
+    _Sin(a, b);
+    _ScaleAndShift(b, c, -1.0F);
+    _Multiply(node->grad, c, a->grad, 1.0F);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete b;
+    delete c;
+}
+
+/*
+gradient for tan
+for
+c = tan(a)
+we have
+dE/da = dE/dc * 1/(cos(a))^2
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradTan(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for TAN!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XTensor * c = NewTensor(a);
+
+    XNoder::MakeGrad(a);
+
+    _Cos(a, b);
+    _Power(b, c, -2.0F);
+    _Multiply(node->grad, c, a->grad, 1.0F);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete b;
+    delete c;
+}
+
+/*
+gradient for normalize
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradNormalize(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 5, "Wrong input tensor number for NORMALIZE!");
+
+    XTensor * input = income.tails[0];
+    XTensor * mean = income.tails[1];
+    XTensor * var = income.tails[2];
+    XTensor * a = income.tails[3];
+    XTensor * b = income.tails[4];
+    XTensor * c = NewTensor(a);
+    XTensor * d = NewTensor(a);
+    XTensor * e = NewTensor(a);
+    XTensor * f = NewTensor(a);
+    XTensor * g = NewTensor(a);
+    XTensor * h = NewTensor(a);
+    XTensor * i = NewTensor(a);
+    XTensor * j = NewTensor(a);
+    XTensor * k = NewTensor(a);
+    XTensor * p = NewTensor(a);
+    XTensor * q = NewTensor(a);
+    XTensor * r = NewTensor(a);
+    DTYPE epsilon = income.GetParamInt(0);
+
+    int dim = income.GetParamInt(0);
+    int n = a->GetDim(dim);
+    XNoder::MakeGrad(input);
+    XNoder::MakeGrad(mean);
+    XNoder::MakeGrad(var);
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+
+    /* dEdinput */
+    _ScaleAndShift(var, c, 1.0F, epsilon);
+    _Unsqueeze(c, d, dim, n);
+    _Power(d, e, -0.5F);
+    _Multiply(a, e, f);
+    _Multiply(node->grad, f, input->grad, 1.0F);
+
+    /* dEdmean */
+    _ScaleAndShift(f, g, -1.0F);
+    _Multiply(node->grad, g, mean->grad, 1.0F);
+
+    /* dEdvar */
+    _Unsqueeze(mean, h, dim, n);
+    _Sub(input, h, i);
+    _Multiply(a, i, j);
+    _Power(var, k, -1.5F);
+    _ScaleAndShift(k, p, -0.5F);
+    _Multiply(j, p, q);
+    _Multiply(node->grad, q, var->grad, 1.0F);
+
+    /* dEda */
+    _Multiply(i, e, r);
+    _Multiply(node->grad, r, a->grad, 1.0F);
+
+    /* dEdb */
+    _Sum(b->grad, node->grad, b->grad);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete c;
+    delete d;
+    delete e;
+    delete f;
+    delete g;
+    delete h;
+    delete i;
+    delete j;
+    delete k;
+    delete p;
+    delete q;
+    delete r;
+}
+
+/*
+gradient for absolute
+for
+c = |a|
+we have
+dE/da = dE/dc   a >= 0
+        -dE/dc  a < 0
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradAbsolute(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for ABSOLUTE!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+
+    XNoder::MakeGrad(a);
+
+    _Sign(a, b);
+    _Multiply(node->grad, b, a->grad, 1.0F);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete b;
+}
+
+/*
+gradient for sign
+for
+c = sign(a)
+we have
+dE/da = 0
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradSign(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SIGN!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+
+    XNoder::MakeGrad(a);
+
+    b->SetZeroAll();
+    _Sum(a->grad, b, a->grad);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete b;
+}
+
+/*
+gradient for reduceMean
+for
+c = reduceMean(a, dim)
+we have
+dE/da = Unsqueeze(dE/dc) * 1/dimSizeA[dim]
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradReduceMean(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for Reduce!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+    XTensor * c = NewTensor(a);
+
+    int dim = income.GetParamInt(0);
+    int n = a->GetDim(dim);
+    XNoder::MakeGrad(a);
+
+    _Unsqueeze(node->grad, b, dim, n);
+    _ScaleAndShift(b, c, 1 / n);
+    _Sum(a->grad, c, a->grad);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete b;
+    delete c;
+}
+
+/*
+gradient for reduceSum
+for
+c = reduceSum(a, dim)
+we have
+dE/da = Unsqueeze(dE/dc) * 1
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradReduceSum(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for Reduce!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+
+    int dim = income.GetParamInt(0);
+    int n = a->GetDim(dim);
+    XNoder::MakeGrad(a);
+
+    _Unsqueeze(node->grad, b, dim, n);
+    _Sum(a->grad, b, a->grad);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete b;
+}
+
+/*
+gradient for reduceSumSquared
+for
+c = reduceSumSquared(a, dim, b)
+we have
+dE/da = Unsqueeze(dE/dc) * 2a
+dE/db = Unsqueeze(dE/dc) * (-2b)
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradReduceSumSquared(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for Reduce!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = income.tails[1];
+    XTensor * c = NewTensor(a);
+    XTensor * d = NewTensor(b);
+    XTensor * e = NewTensor(c);
+
+    int dim = income.GetParamInt(0);
+    int n = a->GetDim(dim);
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+
+    _ScaleAndShift(a, c, 2.0F);
+    _ScaleAndShift(b, d, -2.0F);
+    _Unsqueeze(node->grad, e, dim, n);
+    _Multiply(e, c, a->grad, 1.0F);
+    _Multiply(node->grad, d, b->grad, 1.0F);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete c;
+    delete d;
+    delete e;
+}
+
+/*
+gradient for reduceVariance
+for
+c = reduceVariance(a, dim, b)
+we have
+dE/da = Unsqueeze(dE/dc) * 2a/dimSizeA[dim]
+dE/db = Unsqueeze(dE/dc) * (-2a/dimSizeA[dim])
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradReduceVariance(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for Reduce!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = income.tails[1];
+    XTensor * c = NewTensor(a);
+    XTensor * d = NewTensor(b);
+    XTensor * e = NewTensor(a);
+
+    int dim = income.GetParamInt(0);
+    int n = a->GetDim(dim);
+    XNoder::MakeGrad(a);
+    XNoder::MakeGrad(b);
+
+    _ScaleAndShift(a, c, 2.0F / n);
+    _ScaleAndShift(b, d, -2.0F / n);
+    _Unsqueeze(node->grad, e, dim, n);
+    _Multiply(e, c, a->grad, 1.0F);
+    _Multiply(node->grad, d, b->grad, 1.0F);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete c;
+    delete d;
+    delete e;
+}
+
 }
--- a/source/network/XBackwardMath.h
+++ b/source/network/XBackwardMath.h
@@ -56,6 +56,74 @@ private:
    /* gradient for matrix multiply: c = matmul(a, b) */
    static
    void GradMatrixMul(XTensor * node);
+
+    /* gradient for log: c =  log(a) */
+    static
+    void GradLog(XTensor * node);
+
+    /* gradient for power */
+    static
+    void GradPower(XTensor * node);
+
+    /* gradient for negate */
+    static
+    void GradNegate(XTensor * node);
+
+    /* gradient for ScaleAndShift */
+    static
+    void GradScaleAndShift(XTensor * node);
+
+    /* gradient for Minus */
+    static
+    void GradSub(XTensor * node);
+
+    /* gradient for Divide */
+    static
+    void GradDiv(XTensor * node);
+
+    /* gradient for reduceMean */
+    static
+    void GradReduceMean(XTensor * node);
+
+    /* gradient for reduceSum */
+    static
+    void GradReduceSum(XTensor * node);
+
+    /* gradient for reduceSumSquared */
+    static
+    void GradReduceSumSquared(XTensor * node);
+
+    /* gradient for reduceVariance */
+    static
+    void GradReduceVariance(XTensor * node);
+
+    /* gradient for sin */
+    static
+    void GradSin(XTensor * node);
+
+    /* gradient for cos */
+    static
+    void GradCos(XTensor * node);
+
+    /* gradient for tan */
+    static
+    void GradTan(XTensor * node);
+
+    /* gradient for exp */
+    static
+    void GradExp(XTensor * node);
+
+    /* gradient for normalize */
+    static
+    void GradNormalize(XTensor * node);
+
+    /* gradient for absolute */
+    static
+    void GradAbsolute(XTensor * node);
+
+    /* gradient for sign */
+    static
+    void GradSign(XTensor * node);
 };

 }

--- a/source/network/XBackwardShape.cpp
+++ b/source/network/XBackwardShape.cpp
@@ -47,6 +47,8 @@ void XShapeGrad::MakeGrad(XTensor * node)
        GradSplit(node);
    else if(operID == SHAPE_SPLIT_LIST)
        GradSplitList(node);
+    else if (operID == SHAPE_TRANSPOSE)
+        GradTranspose(node);
    else{
        ShowNTErrors("TODO!");
    }
@@ -370,4 +372,36 @@ void XShapeGrad::GradUnsqueeze(XTensor * node)
    node->visitMark = NODE_FINISHED;
 }

+/*
+gradient for transposing a tensor
+for
+c = Transpose(a)
+we have
+dE/da = Transpose(dE/dc)
+>> node - the node (c) for backward computation
+*/
+void XShapeGrad::GradTranspose(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for TRANSPOSE!");
+
+    XTensor * output = node;
+    XTensor * input = income.tails[0];
+    XTensor * b = NewTensor(input);
+    XNoder::MakeGrad(input);
+
+    int i = income.GetParamInt(0);
+    int j = income.GetParamInt(1);
+
+    CheckNTErrors(input->order > i && i >= 0, "index of dimension is out of scope!");
+    CheckNTErrors(input->order > j && j >= 0, "index of dimension is out of scope!");
+
+    _Transpose(output->grad, b, i, j);
+    _Sum(input->grad, b, input->grad);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete b;
+}
+
 }
\ No newline at end of file
--- a/source/network/XBackwardShape.h
+++ b/source/network/XBackwardShape.h
@@ -71,6 +71,10 @@ private:
    static
    void GradUnsqueeze(XTensor * node);

+    /* gradient computation for unsqueezing a tensor : c = unsqueeze(a) */
+    static
+    void GradTranspose(XTensor * node);
+    
 };

 }

--- a/source/tensor/Main.cpp
+++ b/source/tensor/Main.cpp
@@ -37,6 +37,7 @@

 using namespace nts;

+void SetDataTest();
 void SmallTest();
 void TransposeTest();


--- a/source/tensor/XName.cpp
+++ b/source/tensor/XName.cpp
@@ -29,22 +29,34 @@ const char * GetOPName(int type)
    if ((type & MATH_BASE) != 0){
        if (type == MATH_ABSOLUTE)
            return "M_ABSOLUTE";
+        else if (type == MATH_EXP)
+            return "M_EXP";
+        else if (type == MATH_LOG)
+            return "M_LOG";
+        else if (type == MATH_SIN)
+            return "M_SIN";
+        else if (type == MATH_COS)
+            return "M_COS";
+        else if (type == MATH_TAN)
+            return "M_TAN";
        else if (type == MATH_MATRIXMUL)
            return "M_MATRIXMUL";
        else if (type == MATH_MATRIXMULBATCHED)
            return "M_MATRIXMULBATCHED";
        else if (type == MATH_MULTIPLY)
            return "M_MULTIPLY";
+        else if (type == MATH_DIV)
+            return "M_DIV";
        else if (type == MATH_NEGATE)
            return "M_NEGATE";
        else if (type == MATH_SIGN)
            return "M_SIGN";
        else if (type == MATH_SUM)
            return "M_SUM";
+        else if (type == MATH_SUB)
+            return "M_SUB";
        else if (type == MATH_SUMDIM)
            return "M_SUMDIM";
-        else if (type == MATH_LOG)
-            return "M_LOG";
        else if (type == MATH_NORMALIZE)
            return "M_NORMALIZE";
        else if (type == MATH_POWER)

--- a/source/tensor/XName.h
+++ b/source/tensor/XName.h
@@ -31,16 +31,23 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 /* math operations */
 #define MATH_BASE               0x00001000
 #define MATH_ABSOLUTE           MATH_BASE + 1
-#define MATH_MATRIXMUL          MATH_ABSOLUTE + 1
+#define MATH_EXP                MATH_ABSOLUTE + 1
+#define MATH_LOG                MATH_EXP + 1
+#define MATH_SIN                MATH_LOG + 1
+#define MATH_COS                MATH_SIN + 1
+#define MATH_TAN                MATH_COS + 1
+
+#define MATH_NEGATE             MATH_TAN + 1
+#define MATH_MATRIXMUL          MATH_TAN + 1
 #define MATH_MATRIXMULBATCHED   MATH_MATRIXMUL + 1
 #define MATH_MULTIPLY           MATH_MATRIXMULBATCHED + 1
-#define MATH_NEGATE             MATH_MULTIPLY + 1
-#define MATH_SIGN               MATH_NEGATE + 1
+#define MATH_DIV                MATH_MULTIPLY + 1
+#define MATH_SIGN               MATH_DIV + 1
 #define MATH_SUM                MATH_SIGN + 1
-#define MATH_SUMDIM             MATH_SUM + 1
+#define MATH_SUB                MATH_SUM + 1
+#define MATH_SUMDIM             MATH_SUB + 1

-#define MATH_LOG                MATH_SUMDIM + 1
-#define MATH_NORMALIZE          MATH_LOG + 1
+#define MATH_NORMALIZE          MATH_SUMDIM + 1
 #define MATH_POWER              MATH_NORMALIZE + 1
 #define MATH_SCALEANDSHIFT      MATH_POWER + 1


--- a/source/tensor/core/CHeader.h
+++ b/source/tensor/core/CHeader.h
@@ -26,49 +26,62 @@

 #include "../XTensor.h"

-#include "shape/Concatenate.h"
-#include "shape/ConcatenateSolely.h"
-#include "movement/CopyBlocks.h"
-#include "movement/CopyBlocksInGrid.h"
-#include "movement/CopyBlocksOnSite.h"
-#include "movement/CopyData2D.h"
-#include "movement/CopyIndexed.h"
-#include "movement/CopyInGrid.h"
-#include "movement/CopyValues.h"
-#include "utilities/FlushToMem.h"
-#include "shape/MakeMergeBlockIndex.h"
-#include "shape/MakeSplitBlockIndex.h"
+#include "arithmetic/Div.h"
 #include "arithmetic/MatrixMul.h"
 #include "arithmetic/MatrixMul2D.h"
 #include "arithmetic/MatrixMul2DMultiTheading.h"
 #include "arithmetic/MatrixMul2DParallel.h"
 #include "arithmetic/MatrixMulBatched.h"
-#include "shape/Merge.h"
-#include "shape/MergeBlockLists.h"
 #include "arithmetic/Multiply.h"
 #include "arithmetic/Negate.h"
+#include "arithmetic/Sign.h"
+#include "arithmetic/Sub.h"
+#include "arithmetic/Sum.h"
+#include "arithmetic/SumByColumnTV.h"
+#include "arithmetic/SumByColumnVT.h"
+#include "arithmetic/SumDim.h"
+#include "arithmetic/XTensorBLAS.h"
+
+#include "getandset/ConvertDataType.h"
+#include "getandset/Select.h"
+#include "getandset/SetData.h"
+
 #include "math/Normalize.h"
-#include "shape/Permute.h"
 #include "math/Power.h"
+#include "math/ScaleAndShift.h"
+#include "math/Unary.h"
+
+
+#include "movement/CopyBlocks.h"
+#include "movement/CopyBlocksInGrid.h"
+#include "movement/CopyBlocksOnSite.h"
+#include "movement/CopyData2D.h"
+#include "movement/CopyIndexed.h"
+#include "movement/CopyInGrid.h"
+#include "movement/CopyValues.h"
+
 #include "reduce/ReduceMax.h"
 #include "reduce/ReduceMean.h"
 #include "reduce/ReduceStandardVariance.h"
 #include "reduce/ReduceSum.h"
 #include "reduce/ReduceSumSquared.h"
 #include "reduce/ReduceVariance.h"
-#include "math/ScaleAndShift.h"
-#include "getandset/Select.h"
-#include "getandset/SetData.h"
-#include "sort/Sort.h"
+
+#include "shape/Concatenate.h"
+#include "shape/ConcatenateSolely.h"
+#include "shape/MakeMergeBlockIndex.h"
+#include "shape/MakeSplitBlockIndex.h"
+#include "shape/Merge.h"
+#include "shape/MergeBlockLists.h"
+#include "shape/Permute.h"
 #include "shape/Split.h"
-#include "arithmetic/Sum.h"
-#include "arithmetic/SumByColumnTV.h"
-#include "arithmetic/SumByColumnVT.h"
-#include "arithmetic/SumDim.h"
-#include "sort/TopK.h"
 #include "shape/Transpose.h"
 #include "shape/Unsqueeze.h"
+
+#include "sort/Sort.h"
+#include "sort/TopK.h"
+
 #include "utilities/XMatrixSegment.h"
-#include "arithmetic/XTensorBLAS.h"
+#include "utilities/FlushToMem.h"

 #endif // __CHEADER_H__
--- a/source/tensor/core/arithmetic/Absolute.cpp
+++ b/source/tensor/core/arithmetic/Absolute.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
-*/
-
-#include <math.h>
-#include "../../XTensor.h"
-#include "../../XName.h"
-#include "Absolute.h"
-#include "Absolute.cuh"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/*
-set every entry to its absolute value
->> a - input tensor we are processing
->> b - output tensor we are processing
-*/
-void _Absolute(const XTensor * a, XTensor * b)
-{
-#ifdef USE_CUDA
-    /* run it on GPUs */
-    if (a->devID >= 0) {
-        _CudaAbsolute(a, b);
-    return;
-}
-#endif
-
-    CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
-    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
-    DTYPE * d = (DTYPE*)a->data;
-    DTYPE * db = (DTYPE*)b->data;
-    for (int i = 0; i < a->unitNum; i++)
-        db[i] = (DTYPE)fabs(d[i]);
-}
-
-/*
-set every entry to its absolute value (do it on site)
-keep the result in the input tensor a and return nothing
->> a - the tensor we are processing
-*/
-void _AbsoluteMe(XTensor * a)
-{
-    _Absolute(a, a);
-}
-
-/*
-set every entry to its absolute value (return a XTensor structure)
-make a new tensor to keep the result and return it
->> a - input tensor we are processing
-<< return - the absolute value of input tensor
-*/
-XTensor Absolute(const XTensor & a)
-{
-    XTensor b(&a);
-    b.SetTMP();
-    
-    /* call _Absolute function */
-    _Absolute(&a, &b);
-    
-    /* tensor connections */
-    XLink::MakeLink(&a, NULL, &b, MATH_ABSOLUTE);
-    
-    return b;
-}
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/arithmetic/Absolute.cu
+++ b/source/tensor/core/arithmetic/Absolute.cu
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
-*/
-
-#include "../../XDevice.h"
-#include "../../XTensor.h"
-#include "Absolute.h"
-#include "Absolute.cuh"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-/*
-set each entry to its absolute value (CUDA Kernel)
->> a - pointer to input data array
->> b - pointer to output data array
->> size - size of the data array
-*/
-__global__
-void KernelAbsolute(DTYPE * a, DTYPE * b, int size)
-{
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-
-    if (i < size)
-        b[i] = fabs(a[i]);
-}
-
-/*
-set each entry to its absolute value (CUDA Kernel)
-This is for float16 computation
->> a - pointer to input data array
->> b - pointer to output data array
->> size - size of the data array
-*/
-__global__
-void KernelAbsolute(__half * a, __half * b, int size)
-{
-    return;
-}
-
-/*
-set each entry to its absolute value
->> a - input tensor
->> b - output tensor
-*/
-void _CudaAbsolute(const XTensor * a, XTensor * b)
-{
-    CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
-    CheckNTErrors((a->isSparse == false), "TODO!");
-
-    int gridSize[3];
-    int blockSize[3];
-
-    GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
-
-    dim3 blocks(gridSize[0]);
-    dim3 threads(blockSize[0]);
-
-    int devIDBackup;
-    ProtectCudaDev(a->devID, devIDBackup);
-
-    if (a->dataType == DEFAULT_DTYPE) {
-        KernelAbsolute << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum);
-    }
-    else if (a->dataType == X_FLOAT16) {
-        KernelAbsolute << <blocks, threads >> >((__half*)a->data, (__half*)b->data, a->unitNum);
-    }
-    else {
-        ShowNTErrors("TODO!");
-    }
-
-    BacktoCudaDev(a->devID, devIDBackup);
-}
-
-#endif // USE_CUDA
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/Div.cpp
+++ b/source/tensor/core/arithmetic/Div.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+
+#include "../../XTensor.h"
+#include "../../XName.h"
+#include "Div.h"
+#include "Div.cuh"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/*
+element-wise division of two tensors
+
+c(i) = a(i)/b(i) + \alpha * c(i)
+where i is the index of the item
+
+>> a - tensor a
+>> b - tensor b
+>> c - result tensor
+>> alpha - the coefficient
+>> leadingDim - the dimension along which we perform broadcasting
+*/
+void _Div(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha, int leadingDim)
+{
+	int leadingDimRDI = a->order - leadingDim - 1;
+    CheckNTErrors((a->unitNum <= c->unitNum && b->unitNum <= c->unitNum),
+                  "Unmatched tensors in multiplication!");
+    CheckNTErrors((a->order == b->order && a->order == c->order), 
+                  "Unmatched tensors!");
+
+#ifdef USE_CUDA
+    if (a->devID >= 0 || b->devID >= 0 || c->devID >= 0) {
+        _CudaDiv(a, b, c, alpha, leadingDim);
+        return;
+    }
+#endif
+
+    int stride = 1;
+    int blockSizeA = 1;
+    int blockSizeB = 1;
+    int blockSizeC = 1;
+    int blockNum = 1;
+    int dimensionSizeA = a->dimSizeRDI[leadingDimRDI];
+    int dimensionSizeB = b->dimSizeRDI[leadingDimRDI];
+    int dimensionSizeC = c->dimSizeRDI[leadingDimRDI];
+
+    for (int i = 0; i < a->order; i++) {
+        if (i != leadingDimRDI) {
+            CheckNTErrors((a->dimSizeRDI[i] == b->dimSizeRDI[i] && a->dimSizeRDI[i] == c->dimSizeRDI[i]),
+                          "Unmatched tensors!");
+        }
+        if (i < leadingDimRDI)
+            stride *= a->dimSizeRDI[i];
+    }
+
+    blockSizeA = stride * dimensionSizeA;
+    blockSizeB = stride * dimensionSizeB;
+    blockSizeC = stride * dimensionSizeC;
+    blockNum = a->unitNum / blockSizeA;
+
+    if (!a->isSparse && !b->isSparse) {
+        if (a->dataType == DEFAULT_DTYPE && b->dataType == DEFAULT_DTYPE) {
+            if (a->unitNum == c->unitNum && b->unitNum == c->unitNum) {
+                int size = a->unitNum;
+                DTYPE * ap = (DTYPE*)a->data;
+                DTYPE * bp = (DTYPE*)b->data;
+                DTYPE * cp = (DTYPE*)c->data;
+                if (alpha == 0) {
+                    for (int i = 0; i < size; i++)
+                        cp[i] = ap[i] / bp[i];
+                }
+                else {
+                    for (int i = 0; i < size; i++)
+                        cp[i] = ap[i] / bp[i] + alpha * cp[i];
+                }
+            }
+            else {
+                for (int k = 0; k < blockNum; k++) {
+
+                    for (int ci = 0, ai = 0, bi = 0; ci < dimensionSizeC; ci++, ai++, bi++) {
+                        if (ai >= dimensionSizeA)
+                            ai = 0;
+                        if (bi >= dimensionSizeB)
+                            bi = 0;
+                        DTYPE * ap = (DTYPE*)a->data + k * blockSizeA + ai * stride;
+                        DTYPE * bp = (DTYPE*)b->data + k * blockSizeB + bi * stride;
+                        DTYPE * cp = (DTYPE*)c->data + k * blockSizeC + ci * stride;
+                        for (int j = 0; j < stride; j++)
+                            cp[j] = ap[j] / bp[j] + cp[j] * alpha;
+                    }
+                }
+            }
+        }
+        else {
+            // TODO!!
+            ShowNTErrors("TODO!");
+        }
+    }
+    else {
+        // TODO!!
+        ShowNTErrors("TODO!");
+    }
+}
+
+/*
+element-wise division of two tensors (do it on site)
+keep the result in the input tensor a and return nothing
+
+a(i) = a(i)*b(i) + \alpha * a(i)
+where i is the index of the item
+
+>> a - tensor a (where keep the result)
+>> b - tensor b
+>> alpha - the coefficient
+>> leadingDim - the dimension along which we perform broadcasting
+*/
+void _DivMe(XTensor * a, const XTensor * b, DTYPE alpha, int leadingDim)
+{
+    _Div(a, b, a, alpha, leadingDim);
+}
+
+/*
+element-wise division of two tensors (return a XTensor structure)
+make a new tensor c to keep the result and return it
+
+c(i) = a(i)*b(i)
+where i is the index of the item
+
+>> a - tensor a
+>> b - tensor b
+>> leadingDim - the dimension along which we perform broadcasting
+<< return - the product of the tensors
+*/
+XTensor Div(const XTensor &a, const XTensor &b, int leadingDim)
+{
+    CheckNTErrors(a.dimSize[leadingDim] == b.dimSize[leadingDim], "TODO!");
+
+    XTensor c(&a);
+    c.SetTMP();
+    
+    /* call _Multiply function */
+    _Div(&a, &b, &c, 0, leadingDim);
+    
+    /* tensor connections */
+    XLink::MakeLink(&a, &b, &c, MATH_DIV);
+    XLink::AddParamToHeadInt(&c, leadingDim);
+    
+    return c;
+}
+
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/Div.cu
+++ b/source/tensor/core/arithmetic/Div.cu
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
+*/
+
+#include "../../XDevice.h"
+#include "../../XTensor.h"
+#include "Div.h"
+#include "Div.cuh"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+#ifdef USE_CUDA
+/*
+division of data arrays in a element-wise manner c(i) = a(i)/b(i)
+>> a - data array a
+>> b - data array b
+>> c - result data array
+>> size - size of c
+*/
+__global__
+void KernelDivElementWise(DTYPE * a, DTYPE * b, DTYPE * c, int size)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+
+    if (i < size)
+        c[i] = a[i] / b[i];
+}
+
+/*
+division of data arrays in a element-wise manner c(i) = a(i)/b(i) + \alpha*c(i)
+>> a - data array a
+>> b - data array b
+>> c - result data array
+>> size - size of c
+>> alpha - the coefficient
+*/
+__global__
+void KernelDivElementWiseV2(DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE alpha)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+
+    if (i < size)
+        c[i] = a[i] / b[i] + alpha * c[i];
+}
+
+/*
+division of two tensors in a element-wise manner c(i) = a(i)/b(i).
+Note that a and b can be of different sizes here, i.e.,
+|a_lead| <= |c_lead| and |b_lead| <= |c_lead|
+where |a_lead| means the size of the leading dimension of a
+>> a - tensor a
+>> b - tensor b
+>> c - result tensor
+>> alpha - the coefficient
+>> stride - the number of items we go over when move next along the leading dimension in a block
+>> ldSizeA - size of the leading dimension of a
+>> ldSizeB - size of the leading dimension of b
+>> ldSizeC - size of the leading dimension of c
+>> blockNum - number of blocks
+*/
+template<int nonZeroAlpha> __global__
+void KernelDivElementWiseTensorDynamic(DTYPE * a, DTYPE * b, DTYPE * c, DTYPE alpha,
+    int stride, int ldSizeA, int ldSizeB, int ldSizeC, int blockNum)
+{
+    __shared__ DTYPE* ap[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    __shared__ DTYPE* bp[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    __shared__ DTYPE* cp[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    int j = blockDim.y * blockIdx.y + threadIdx.y;
+
+    if (i >= blockNum * stride || j >= ldSizeC)
+        return;
+
+    if (threadIdx.y == 0) {
+        int block = i / stride;
+        int size = block * stride;
+        ap[threadIdx.x] = a + size * ldSizeA;
+        bp[threadIdx.x] = b + size * ldSizeB;
+        cp[threadIdx.x] = c + size * ldSizeC;
+    }
+
+    __syncthreads();
+
+    int aj = j >= ldSizeA ? j % ldSizeA : j;
+    int bj = j >= ldSizeB ? j % ldSizeB : j;
+    int offseti = i % stride;
+
+    if (nonZeroAlpha == 0)
+        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj * ldSizeA + offseti] / bp[threadIdx.x][bj * ldSizeB + offseti];
+    else
+        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj * ldSizeA + offseti] / bp[threadIdx.x][bj * ldSizeB + offseti]
+                                                 + alpha * cp[threadIdx.x][j * ldSizeC + offseti];
+}
+
+/*
+element-wise division of two tensors
+c(i) = a(i)*b(i) + \alpha * c(i)
+where i is the item index
+>> a - tensor a
+>> b - tensor b
+>> c - result tensor
+>> alpha - the coefficient
+>> leadingDim - dimension along which we perform broadcasting
+*/
+void _CudaDiv(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha, int leadingDim)
+{
+	int leadingDimRDI = a->order - leadingDim - 1;
+    CheckNTErrors((a->unitNum <= c->unitNum && b->unitNum <= c->unitNum),
+                  "Unmatched tensors in multiplication!");
+    CheckNTErrors((a->order == b->order && a->order == c->order), "Unmatched tensors!");
+
+    int stride = 1;
+    int blockSizeA = 1;
+    int blockNum = 1;
+    int dimensionSizeA = a->dimSizeRDI[leadingDimRDI];
+    int dimensionSizeB = b->dimSizeRDI[leadingDimRDI];
+    int dimensionSizeC = c->dimSizeRDI[leadingDimRDI];
+
+    for (int i = 0; i < a->order; i++) {
+        if (i != leadingDimRDI) {
+            CheckNTErrors((a->dimSizeRDI[i] == b->dimSizeRDI[i] &&
+                           a->dimSizeRDI[i] == c->dimSizeRDI[i]),
+                          "Unmatched tensors!");
+        }
+        if (i < leadingDimRDI)
+            stride *= a->dimSizeRDI[i];
+    }
+
+    blockSizeA = stride * dimensionSizeA;
+    blockNum = a->unitNum / blockSizeA;
+
+    int devIDBackup;
+    ProtectCudaDev(a->devID, devIDBackup);
+
+    if (!a->isSparse && !b->isSparse) {
+        if (a->dataType == DEFAULT_DTYPE && b->dataType == DEFAULT_DTYPE) {
+            int cudaGridSize[3];
+            int cudaBlockSize[3];
+
+            if (a->unitNum == c->unitNum && b->unitNum == c->unitNum) {
+                GDevs.GetCudaThread(a->devID, c->unitNum, cudaGridSize, cudaBlockSize);
+                dim3 blocks(cudaGridSize[0]), threads(cudaBlockSize[0]);
+
+                if (alpha == 0)
+                    KernelDivElementWise << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, c->unitNum);
+                else
+                    KernelDivElementWiseV2 << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, c->unitNum, alpha);
+            }
+            else {
+                GDevs.GetCudaThread2D(c->devID, stride * blockNum, dimensionSizeC, MAX_INT, cudaGridSize, cudaBlockSize);
+                dim3 blocks(cudaGridSize[0], cudaGridSize[1]), threads(cudaBlockSize[0], cudaBlockSize[1]);
+
+                if (alpha == 0) {
+                    KernelDivElementWiseTensorDynamic<0> << <blocks, threads >> >
+                        ((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, 0,
+                        stride, dimensionSizeA, dimensionSizeB, dimensionSizeC, blockNum);
+                }
+                else {
+                    KernelDivElementWiseTensorDynamic<1> << <blocks, threads >> >
+                        ((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, alpha,
+                        stride, dimensionSizeA, dimensionSizeB, dimensionSizeC, blockNum);
+                }
+            }
+        }
+        else {
+            // TODO!!
+            ShowNTErrors("TODO!");
+        }
+    }
+    else {
+        // TODO!!
+        ShowNTErrors("TODO!");
+    }
+
+    BacktoCudaDev(a->devID, devIDBackup);
+}
+
+#endif // USE_CUDA
+
+} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/arithmetic/Div.cuh
+++ b/source/tensor/core/arithmetic/Div.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+
+#ifndef __DIV_CUH__
+#define __DIV_CUH__
+
+#include "Div.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+#ifdef USE_CUDA
+
+/* division of two tensors in a element-wise manner c(i) = a(i)/b(i) */
+__global__
+void KernelDivElementWise(DTYPE * a, DTYPE * b, DTYPE * c, int size);
+
+/* division of two tensors in a element-wise manner c(i) = a(i)/b(i) + \alpha*c(i) */
+__global__
+void KernelDivElementWiseV2(DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE alpha);
+
+/* division of two tensors in a element-wise manner c(i) = a(i)/b(i)+ \alpha*c(i)  */
+template<int nonZeroAlpha>__global__
+void KernelDivElementWiseTensorDynamic(DTYPE * a, DTYPE * b, DTYPE * c, DTYPE alpha, int stride, int ldSizeA, int ldSizeB, int ldSizeC, int blockNum);
+
+/* element-wise division of two tensors */
+void _CudaDiv(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha = 0, int leadingDim = 0);
+
+#endif // USE_CUDA
+
+} // namespace nts(NiuTrans.Tensor)
+
+#endif // __DIV_CUH__
+
--- a/source/tensor/core/math/Log.h
+++ b/source/tensor/core/math/Log.h
@@ -16,31 +16,39 @@
 */

 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
-*/
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */

-#ifndef __LOG_H__
-#define __LOG_H__
+#ifndef __DIV_H__
+#define __DIV_H__

 #include "../../XTensor.h"

 namespace nts { // namespace nts(NiuTrans.Tensor)

-/* set every entry to its log value */
-void _Log(const XTensor * a, XTensor * b);
+/* 
+element-wise division of two tensors:
+c(i) = a(i)/b(i) + \alpha * c(i) 
+where i is the index of the element
+*/
+void _Div(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha = 0, int leadingDim = 0);

 /* 
-set every entry to its log value (do it on site)
+element-wise division of two tensors (do it on site)
 keep the result in the input tensor a and return nothing
+a(i) = a(i)/b(i) + \alpha * a(i) 
+where i is the index of the element 
 */
-void _LogMe(XTensor * a);
+void _DivMe(XTensor * a, const XTensor * b, DTYPE alpha = 0, int leadingDim = 0);

 /* 
-set every entry to its log value (return a XTensor structure)
+element-wise division of two tensors (return a XTensor structure)
 make a new tensor to keep the result and return it
+c(i) = a(i)/b(i)
+where i is the index of the element 
 */
-XTensor Log(const XTensor & a);
+XTensor Div(const XTensor &a, const XTensor &b, int leadingDim = 0);

 } // namespace nts(NiuTrans.Tensor)

-#endif // __LOG_H__
+#endif // __DIV_H__
\ No newline at end of file
--- a/source/tensor/core/arithmetic/Multiply.cpp
+++ b/source/tensor/core/arithmetic/Multiply.cpp
@@ -32,9 +32,9 @@ element-wise product of two tensors
 c(i) = a(i)*b(i) + \alpha * c(i)
 where i is the index of the item

->> a - matrix a
->> b - matrix b
->> c - result matrix
+>> a - tensor a
+>> b - tensor b
+>> c - result tensor
 >> alpha - the coefficient
 >> leadingDim - the dimension along which we perform broadcasting
 */

--- a/source/tensor/core/arithmetic/Multiply.cu
+++ b/source/tensor/core/arithmetic/Multiply.cu
@@ -104,9 +104,9 @@ void KernelMulElementWiseTensorDynamic(DTYPE * a, DTYPE * b, DTYPE * c, DTYPE al
    int offseti = i % stride;

    if (nonZeroAlpha == 0)
-        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj* ldSizeA + offseti] * bp[threadIdx.x][bj* ldSizeB + offseti];
+        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj * ldSizeA + offseti] * bp[threadIdx.x][bj * ldSizeB + offseti];
    else
-        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj* ldSizeA + offseti] * bp[threadIdx.x][bj* ldSizeB + offseti] +
+        cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj * ldSizeA + offseti] * bp[threadIdx.x][bj * ldSizeB + offseti] +
        alpha * cp[threadIdx.x][j * ldSizeC + offseti];
 }


--- a/source/tensor/core/arithmetic/Sub.cpp
+++ b/source/tensor/core/arithmetic/Sub.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+
+#include "../../XTensor.h"
+#include "../../XName.h"
+#include "../../XUtility.h"
+#include "Sub.h"
+#include "Sub.cuh"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/*
+tensor subtraction c = a - b * \beta
+
+>> a - a tensor
+>> b - another tensor
+>> c - where we put a-b*\beta. we save it in a if c is NULL
+>> beta - the scaling factor
+*/
+void _Sub(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta)
+{
+    CheckNTErrors(a && b && c, "Empty tensor input!");
+    CheckNTErrors(a->unitNum == b->unitNum && a->unitNum == c->unitNum,
+                  "Unmatched tensors in addition!");
+    CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType,
+                  "Unmatched tensors in addition!");
+
+    if (a->devID >= 0 || b->devID >= 0 || c->devID >= 0) {
+
+#ifdef USE_CUDA
+        if (a == c) {
+            int P2PAccesible = 0;
+#ifdef CUDA_UVA
+            cudaDeviceCanAccessPeer(&P2PAccesible, a->devID, b->devID);
+#endif
+            if ((a->devID < 0 && b->devID >= 0) ||
+                (a->devID >= 0 && b->devID < 0) ||
+                (a->devID >= 0 && b->devID >= 0 && a->devID != b->devID && !P2PAccesible))
+            {
+                ShowNTErrors("Cannot run this method on multiple devices simultaneously!");
+            }
+            else
+                _CudaSub(a, b, c, beta);
+        }
+        else
+            _CudaSub(a, b, c, beta);
+
+#endif
+    }
+    else {
+        if (!a->isSparse && !b->isSparse) {
+            CheckNTErrors(!c->isSparse, "Illegal use of sparse tensor in addition!");
+    
+            if (a->dataType == DEFAULT_DTYPE &&
+                b->dataType == DEFAULT_DTYPE &&
+                c->dataType == DEFAULT_DTYPE)
+            {
+                DTYPE * ap = (DTYPE*)a->data;
+                DTYPE * bp = (DTYPE*)b->data;
+                DTYPE * cp = (DTYPE*)c->data;
+    
+                /* unrolling */
+                int num = a->unitNum;
+                if (num % 4 == 0) {
+                    for (int i = 0; i < num; i += 4) {
+                        cp[i] = ap[i] - bp[i] * beta;
+                        cp[i + 1] = ap[i + 1] - bp[i + 1] * beta;
+                        cp[i + 2] = ap[i + 2] - bp[i + 2] * beta;
+                        cp[i + 3] = ap[i + 3] - bp[i + 3] * beta;
+                    }
+                }
+                else if (num % 2 == 0) {
+                    for (int i = 0; i < num; i += 2) {
+                        cp[i] = ap[i] - bp[i] * beta;
+                        cp[i + 1] = ap[i + 1] - bp[i + 1] * beta;
+                    }
+                }
+                else {
+                    for (int i = 0; i < num; i++) {
+                        cp[i] = ap[i] - bp[i] * beta;
+                    }
+                }
+            }
+            else {
+                // TODO!!
+                ShowNTErrors("TODO!");
+            }
+        }
+        else {
+            // TODO!!
+            ShowNTErrors("TODO!");
+        }
+    }
+}
+    
+/*
+tensor subtraction a = a - b * \beta (do it on site)
+keep the result in the tensor a and return nothing
+
+>> a - a tensor
+>> b - another tensor
+>> beta - the scaling factor
+*/
+void _SubMe(XTensor * a, const XTensor * b, DTYPE beta)
+{
+    _Sub(a, b, a, beta);
+}
+    
+/*
+tensor subtraction c = a - b * \beta (return a XTensor structure)
+make a new tensor c to keep the result and return it
+
+>> a - a tensor
+>> b - another tensor
+>> beta - the scaling factor
+<< return - the result of tensor subtraction
+*/
+XTensor Sub(const XTensor &a, const XTensor &b, DTYPE beta)
+{
+    XTensor c(&a);
+    c.SetTMP();
+
+    /* call _Sub function */
+    _Sub(&a, &b, &c, beta);
+    
+    /* tensor connections */
+    XLink::MakeLink(&a, &b, &c, MATH_SUB);
+    XLink::AddParamToHead(&c, beta);
+    
+    return c;
+}
+
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/Sub.cu
+++ b/source/tensor/core/arithmetic/Sub.cu
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+
+#include "../../XDevice.h"
+#include "../../XUtility.h"
+#include "Sub.cuh"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+#ifdef USE_CUDA
+
+/*
+subtraction of data arrays (CUDA Kernel)
+c = a - b * \beta
+>> a - A matrix
+>> b - another matrix
+>> c - where we put a-b
+>> size - the size of a/b/c
+>> beta - the coefficient
+*/
+__global__
+void KernelSUB(DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE beta)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+
+    if (i < size)
+        c[i] = a[i] - b[i] * beta;
+}
+
+/*
+tensor subtraction c = a - b * \beta (cuda version)
+>> a - a tensor
+>> b - another tensor
+>> c - where we put a-b*\beta.
+>> beta - the scaling factor
+*/
+void _CudaSub(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta)
+{
+    CheckNTErrors(a && b && c, "Empty tensor input!");
+    CheckNTErrors((a->unitNum == b->unitNum && a->unitNum == c->unitNum),
+                  "Unmatched tensors in addition!");
+    CheckNTErrors((a->dataType == b->dataType && a->dataType == c->dataType),
+                  "Unmatched tensors in addition!");
+    CheckNTErrors((a->devID == b->devID && a->devID == c->devID),
+                  "The tensors must be on the same!");
+
+    int devIDBackup = XDevice::GetGPUDevice();
+    XDevice::SetGPUDevice(a->devID);
+
+    if (!a->isSparse && !b->isSparse) {
+        CheckNTErrors(!c->isSparse, "Illegal use of sparse matrix in addition!");
+
+        if (a->dataType == DEFAULT_DTYPE &&
+            b->dataType == DEFAULT_DTYPE &&
+            c->dataType == DEFAULT_DTYPE)
+        {
+            int gridSize[3], blockSize[3];
+
+            GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
+            dim3 blocks(gridSize[0]);
+            dim3 threads(blockSize[0]);
+            KernelSUB << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, a->unitNum, beta);
+        }
+        else {
+            // TODO!!
+            ShowNTErrors("TODO!");
+        }
+    }
+    else {
+        // TODO!!
+        ShowNTErrors("TODO!");
+    }
+
+    XDevice::SetGPUDevice(devIDBackup);
+}
+
+/* subtraction over arrays
+tensor subtraction c = a - b * \beta (cuda version) with an input handle
+>> devID - device ID (MUST >= 0)
+>> handle - cuda handle
+>> a - an array
+>> b - another array
+>> c - where we put a-b
+>> size - size of the array
+>> beta - the coefficient
+*/
+void _CudaSubWithHandle(int devID, cublasHandle_t * handle, DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE beta)
+{
+    if (size == 0)
+        return;
+
+    if (c == NULL)
+        c = a;
+
+    CheckNTErrors((a && b && c), "Empty arrays in addition!");
+
+    int devIDBackup;
+    ProtectCudaDev(devID, devIDBackup);
+
+    if (c == a) {
+#ifdef DOUBELPRICSION
+        cublasDaxpy(*handle, size, &beta, b, 1, a, 1);
+#else
+        cublasSaxpy(*handle, size, &beta, b, 1, a, 1);
+#endif
+    }
+    else {
+        int gridSize[3], blockSize[3];
+
+        GDevs.GetCudaThread(devID, size, gridSize, blockSize);
+
+        dim3 blocks(gridSize[0]);
+        dim3 threads(blockSize[0]);
+
+        KernelSUB<<<blocks, threads>>>((DTYPE*)a, (DTYPE*)b, (DTYPE*)c, size, beta);
+    }
+
+    BacktoCudaDev(devID, devIDBackup);
+}
+
+#endif // USE_CUDA
+
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/Sub.cuh
+++ b/source/tensor/core/arithmetic/Sub.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+
+#ifndef __SUB_CUH__
+#define __SUB_CUH__
+
+#include "Sub.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+#ifdef USE_CUDA
+
+/* subtraction of data arrays (CUDA Kernel) */
+__global__
+void KernelSUB(DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE beta = (DTYPE)1.0);
+
+/* tensor subtraction c = a - b * \beta (cuda version) */
+void _CudaSub(const XTensor * a, const XTensor * b, XTensor * c = NULL, DTYPE beta = (DTYPE)1.0);
+
+/*  tensor subtraction c = a - b * \beta (cuda version) with an input handle */
+void _CudaSubWithHandle(int devID, cublasHandle_t * handle, DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE beta = (DTYPE)1.0);
+
+#endif // USE_CUDA
+
+} // namespace nts(NiuTrans.Tensor)
+
+#endif // __SUB_CUH__
--- a/source/tensor/core/arithmetic/Absolute.h
+++ b/source/tensor/core/arithmetic/Absolute.h
 /* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
+ * Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
+ * All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */

 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
-*/
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ * Today is the first day of August. It's still very hot.
+ */

-#ifndef __ABSOLUTE_H__
-#define __ABSOLUTE_H__
+#ifndef __SUB_H__
+#define __SUB_H__

 #include "../../XTensor.h"

 namespace nts { // namespace nts(NiuTrans.Tensor)

-/* set every entry to its absolute value */
-void _Absolute(const XTensor * a, XTensor * b);
+/* tensor subtraction c = a - b * \beta */
+void _Sub(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta = (DTYPE)1.0);

 /* 
-set every entry to its absolute value (do it on site)
+tensor subtraction a = a - b * \beta
 keep the result in the input tensor a and return nothing
 */
-void _AbsoluteMe(XTensor * a);
+void _SubMe(XTensor * a, const XTensor * b, DTYPE beta = (DTYPE)1.0);
    
 /*
-set every entry to its absolute value (return a XTensor structure)
-make a new tensor to keep the result and return it
+tensor subtraction c = a - b * \beta
+make a new tensor c to keep the result and return it
 */
-XTensor Absolute(const XTensor & a);
+XTensor Sub(const XTensor &a, const XTensor &b, DTYPE beta = (DTYPE)1.0);

 } // namespace nts(NiuTrans.Tensor)

-#endif // __ABSOLUTE_H__
+#endif // __SUB_H__
--- a/source/tensor/core/arithmetic/SumDim.cpp
+++ b/source/tensor/core/arithmetic/SumDim.cpp
@@ -116,7 +116,8 @@ void _SumDim(const XTensor * a, const XTensor * b, XTensor * c, int n, DTYPE bet
 }
    
 /*
-tensor summation (on site)
+tensor summation (do it on site)
+keep the result in the input tensor and return nothing

 a = a + b * \beta
 where the size of b is equal to the n-th dimension of a,
@@ -133,7 +134,8 @@ void _SumDim(XTensor * a, const XTensor * b, int n, DTYPE beta)
 }
    
 /*
-tensor summation (return a structure and make tensor connections)
+tensor summation (return a XTensor structure and make tensor connections)
+make a new tensor to keep the result and return it

 c = a + b * \beta
 where the size of b is equal to the n-th dimension of a,
@@ -141,9 +143,9 @@ i.e., a is summed with b by broadcasting

 >> a - a tensor
 >> b - another tensor whose size is equal to that of dimension n of a
->> c - where we put a+b*\beta. we save it in a if c is NULL
 >> n - the dimension index
 >> beta - the scaling factor
+<< return - the result tensor by tensor summation
 */
 XTensor SumDim(const XTensor &a, const XTensor &b, int n, DTYPE beta)
 {

--- a/source/tensor/core/getandset/SetData.cpp
+++ b/source/tensor/core/getandset/SetData.cpp
@@ -20,6 +20,7 @@
 * $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-05-08
 */

+#include <math.h>
 #include "SetData.h"
 #include "SetData.cuh"
 #include "../../XUtility.h"
@@ -38,6 +39,43 @@
 namespace nts{ // namespace nts(NiuTrans.Tensor)

 /*
+Fills the input Tensor or Variable with values according to the method described in 
+"Understanding the difficulty of training deep feedforward neural networks" - Glorot, X. & Bengio, Y. (2010), 
+using a uniform distribution. The resulting tensor will have values sampled from :math:`U(-a, a)` 
+where :math:`a = gain \times \sqrt{2 / (fan\_in + fan\_out)} \times \sqrt{3}`. Also known as Glorot initialisation.
+
+>> tensor - the tensor whose data array would be initialized
+>> gain - an optional scaling factor
+*/
+void _SetDataFanInOut(XTensor * tensor, DTYPE gain)
+{
+    CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_FLOAT!");
+    CheckNTErrors(tensor->order >= 2, "the tensor dimension must be no less than 2!");
+
+    int fanIn = 1;
+    int fanOut = 1;
+
+    int order = tensor->order;
+    if (order == 2) {
+        fanIn = tensor->dimSize[1];
+        fanOut = tensor->dimSize[0];
+    }
+    else {
+        int numInputFmaps = tensor->dimSize[1];
+        int numOutputFmaps = tensor->dimSize[0];
+        int receptiveFieldSize = 0;
+        for (int i = 2; i < order; i++)
+            receptiveFieldSize += tensor->dimSize[i];
+        fanIn = numInputFmaps * receptiveFieldSize;
+        fanOut = numOutputFmaps * receptiveFieldSize;
+    }
+
+    DTYPE std = gain * sqrt(2.0/(fanIn + fanOut));
+    DTYPE a = sqrt(3.0) * std;
+    _SetDataRand(tensor, -a, a);
+}
+
+/* 
 generate data items with a fixed value p 
 >> tensor - the tensor whose data array would be initialized
 >> p - pointer to the number for initializing the tensor
@@ -65,7 +103,7 @@ void _SetDataFixed(XTensor * tensor, void * valuePointer)
        }
        else{
 #ifdef USE_CUDA
-            CudaSetDataFixedInt(tensor, p);
+            _CudaSetDataFixedInt(tensor, p);
 #endif
        }
    }
@@ -88,7 +126,7 @@ void _SetDataFixed(XTensor * tensor, void * valuePointer)
        }
        else{
 #ifdef USE_CUDA
-            CudaSetDataFixedFloat(tensor, p);
+            _CudaSetDataFixedFloat(tensor, p);
 #endif
        }
    }
@@ -111,7 +149,7 @@ void _SetDataFixed(XTensor * tensor, void * valuePointer)
        }
        else{
 #ifdef USE_CUDA
-            CudaSetDataFixedDouble(tensor, p);
+            _CudaSetDataFixedDouble(tensor, p);
 #endif
        }
    }
@@ -137,7 +175,7 @@ generate data items with a fixed value p (in integer)
 */
 void _SetDataFixedInt(XTensor * tensor, int p)
 {
-    CheckNTErrors(tensor->dataType == X_INT, "the tensor must be in X_INT");
+    CheckNTErrors(tensor->dataType == X_INT, "the tensor must be in X_INT!");

    if(p == 0)
        tensor->SetZeroAll();
@@ -152,7 +190,7 @@ generate data items with a fixed value p (in float)
 */
 void _SetDataFixedFloat(XTensor * tensor, float p)
 {
-    CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_INT");
+    CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_FLOAT!");

    if(p == 0)
        tensor->SetZeroAll();
@@ -167,7 +205,7 @@ generate data items with a fixed value p (in double)
 */
 void _SetDataFixedDouble(XTensor * tensor, double p)
 {
-    CheckNTErrors(tensor->dataType == X_DOUBLE, "the tensor must be in X_INT");
+    CheckNTErrors(tensor->dataType == X_DOUBLE, "the tensor must be in X_DOUBLE!");

    if(p == 0)
        tensor->SetZeroAll();
@@ -183,6 +221,8 @@ generate data items with a uniform distribution in [low,high]
 */
 void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
 {
+    CheckNTErrors(high > low, "the high value must be greater than low value!");
+
    if(tensor == NULL)
        return;
    
@@ -215,10 +255,13 @@ void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
    TODO: generate data points on GPUs straightforwardly.
    */
    else{
-        XTensor * t2 = NewTensor(tensor->order, tensor->dimSize, tensor->dataType, tensor->denseRatio, -1);
-        _SetDataRand(t2, low, high);
-        _CopyValues(t2, tensor);
-        delete t2;
+#ifdef USE_CUDA
+        _CudaSetDataRand(tensor, low, high);
+#endif
+        //XTensor * t2 = NewTensor(tensor->order, tensor->dimSize, tensor->dataType, tensor->denseRatio, -1);
+        //_SetDataRand(t2, low, high);
+        //_CopyValues(t2, tensor);
+        //delete t2;
    }
 }
    

--- a/source/tensor/core/getandset/SetData.cu
+++ b/source/tensor/core/getandset/SetData.cu
@@ -21,7 +21,10 @@
 * I'm surprised that I did not write this file till today.
 */

+#include <curand.h>
+#include <time.h>
 #include "SetData.cuh"
+#include <curand_kernel.h>
 #include "../../XDevice.h"

 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -46,7 +49,7 @@ generate data items with a fixed value p (in int)
 >> tensor - the tensor for initialization
 >> p - the initial value
 */
-void CudaSetDataFixedInt(XTensor * tensor, int p)
+void _CudaSetDataFixedInt(XTensor * tensor, int p)
 {
    CheckNTErrors(tensor->dataType == X_INT, "the tensor must be in X_INT!");

@@ -86,7 +89,7 @@ generate data items with a fixed value p (in float)
 >> tensor - the tensor for initialization
 >> p - the initial value
 */
-void CudaSetDataFixedFloat(XTensor * tensor, float p)
+void _CudaSetDataFixedFloat(XTensor * tensor, float p)
 {
    CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_FLOAT!");

@@ -126,7 +129,7 @@ generate data items with a fixed value p (in double)
 >> tensor - the tensor for initialization
 >> p - the initial value
 */
-void CudaSetDataFixedDouble(XTensor * tensor, double p)
+void _CudaSetDataFixedDouble(XTensor * tensor, double p)
 {
    CheckNTErrors(tensor->dataType == X_DOUBLE, "the tensor must be in X_DOUBLE!");

@@ -146,4 +149,115 @@ void CudaSetDataFixedDouble(XTensor * tensor, double p)
    BacktoCudaDev(tensor->devID, devIDBackup);
 }

+/* 
+call curand_init function on each kernel with the same random seed
+and init the rng states
+*/
+__global__ 
+void KernelInitializeCurand(curandState * state, unsigned long seed)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    curand_init(seed, i, 0, &state[i]);
+}
+
+/* */
+__device__ 
+float GenerateFloat(curandState* globalState, int i)
+{
+    //copy state to local mem
+    curandState localState = globalState[i];
+    //apply uniform distribution with calculated random
+    float randNum = curand_uniform(&localState);
+    //update state
+    globalState[i] = localState;
+
+    //return value
+    return randNum;
+}
+
+/**/
+__device__ 
+double GenerateDouble(curandState* globalState, int i)
+{
+    //copy state to local mem
+    curandState localState = globalState[i];
+    //apply uniform distribution with calculated random
+    double randNum = curand_uniform_double(&localState);
+    //update state
+    globalState[i] = localState;
+    //return value
+    return randNum;
+}
+
+/* 
+set data array with a uniform distribution in [low, high] 
+>> deviceStates - the state of curand
+>> d - float datatype pointer to the data array 
+>> size - size of the array
+>> low - low value of the range
+>> high - high value of the range
+*/
+__global__
+void KernelSetDataRandFloat(curandState* deviceStates, float * d, int size, DTYPE low, DTYPE variance)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    
+    if (i < size) {
+        float randNum = GenerateFloat(deviceStates, i);
+        d[i] = randNum * variance + low;
+    }
+}
+/* 
+set data array with a uniform distribution in [low, high] 
+>> deviceStates - the state of curand
+>> d - double datatype pointer to the data array
+>> size - size of the array
+>> low - low value of the range
+>> high - high value of the range
+*/
+__global__
+void KernelSetDataRandDouble(curandState* deviceStates, double * d, int size, DTYPE low, DTYPE variance)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    
+    if (i < size){
+        double randNum = GenerateDouble(deviceStates, i);
+        d[i] = randNum * variance + low;
+    }
+}
+
+/*
+generate data items with a uniform distribution in [low,high]
+>> tensor - the tensor whose data array would be initialized
+>> low - lower value of the range
+>> high - higher value of the range
+*/
+void _CudaSetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
+{
+    CheckNTErrors(high > low, "the high value must be greater than low value!");
+
+    int gridSize[3];
+    int blockSize[3];
+
+    GDevs.GetCudaThread(tensor->devID, tensor->unitNum, gridSize, blockSize);
+
+    dim3 blocks(gridSize[0]);
+    dim3 threads(blockSize[0]);
+
+    int devIDBackup;
+    ProtectCudaDev(tensor->devID, devIDBackup);
+    
+    curandState *deviceStates;
+    cudaMalloc(&deviceStates, sizeof(curandState));
+    DTYPE variance = high - low;
+
+    KernelInitializeCurand<<<blocks, threads>>>(deviceStates, unsigned(time(NULL)));
+    if (tensor->dataType == X_FLOAT)
+        KernelSetDataRandFloat <<<blocks, threads >>>(deviceStates, (float*)tensor->data, tensor->unitNum, low, variance);
+    else if (tensor->dataType == X_DOUBLE)
+        KernelSetDataRandDouble <<<blocks, threads >>>(deviceStates, (double*)tensor->data, tensor->unitNum, low, variance);
+
+    BacktoCudaDev(tensor->devID, devIDBackup);
+}
+
 } // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/getandset/SetData.cuh
+++ b/source/tensor/core/getandset/SetData.cuh
@@ -29,13 +29,16 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)

 /* generate data items with a fixed value p (in int) */
-void CudaSetDataFixedInt(XTensor * tensor, int p);
+void _CudaSetDataFixedInt(XTensor * tensor, int p);

 /* generate data items with a fixed value p (in float) */
-void CudaSetDataFixedFloat(XTensor * tensor, float p);
+void _CudaSetDataFixedFloat(XTensor * tensor, float p);

 /* generate data items with a fixed value p (in double) */
-void CudaSetDataFixedDouble(XTensor * tensor, double p);
+void _CudaSetDataFixedDouble(XTensor * tensor, double p);
+
+/* generate data items with a uniform distribution in [low,high] */
+void _CudaSetDataRand(XTensor * tensor, DTYPE low, DTYPE high);

 } // namespace nts(NiuTrans.Tensor)


--- a/source/tensor/core/getandset/SetData.h
+++ b/source/tensor/core/getandset/SetData.h
@@ -27,6 +27,9 @@

 namespace nts { // namespace nts(NiuTrans.Tensor)

+/* generate data items with a xavier initialization */
+void _SetDataFanInOut(XTensor * tensor, DTYPE gain = 1.0F);
+
 /* generate data items with a fixed value p */
 void _SetDataFixed(XTensor * tensor, void * valuePointer);


--- a/source/tensor/core/math/Log.cpp
+++ b/source/tensor/core/math/Log.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
-*/
-
-#include "../../XTensor.h"
-#include "../../XName.h"
-#include "Log.h"
-#include "Log.cuh"
-#include <math.h>
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-/*
-set every entry to its log value (do it on site)
->> a - input tensor we are processing
->> b - output tensor we are processing
-*/
-void _Log(const XTensor * a, XTensor * b)
-{
-#ifdef USE_CUDA
-    /* run it on GPUs */
-    if (a->devID >= 0) {
-        _CudaLog(a, b);
-    return;
-    }
-#endif
-
-    CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
-    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
-    DTYPE * d = (DTYPE*)a->data;
-    DTYPE * db = (DTYPE*)b->data;
-    for (int i = 0; i < a->unitNum; i++)
-        db[i] = (DTYPE)log(d[i]);
-}
-
-/*
-set every entry to its log value
-keep the result in the input tensor a and return nothing
->> a - the tensor we are processing
-*/
-void _LogMe(XTensor * a)
-{
-    _Log(a, a);
-}
-
-/*
-set every entry to its log value (return a XTensor structure)
-make a new tensor to keep the result and return it
->> a - input tensor we are processing
-<< return - the log value of the input tensor
-*/
-XTensor Log(const XTensor & a)
-{
-    XTensor b(&a);
-    b.SetTMP();
-    
-    /* call _Log function */
-    _Log(&a, &b);
-    
-    /* tensor connections */
-    XLink::MakeLink(&a, NULL, &b, MATH_LOG);
-    
-    return b;
-}
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/math/Log.cu
+++ b/source/tensor/core/math/Log.cu
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-
-/*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
-*/
-
-#include "../../XDevice.h"
-#include "../../XTensor.h"
-#include "Log.h"
-#include "Log.cuh"
-
-namespace nts { // namespace nts(NiuTrans.Tensor)
-
-#ifdef USE_CUDA
-/*
-set each entry to its log value (CUDA Kernel)
->> a - pointer to input data array
->> b - pointer to output data array
->> size - size of the data array
-*/
-__global__
-void KernelLog(DTYPE * a, DTYPE * b, int size)
-{
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-
-    if (i < size)
-        b[i] = log(a[i]);
-}
-
-/*
-set each entry to its log value (CUDA Kernel)
-This is for float16 computation
->> a - pointer to input data array
->> b - pointer to output data array
->> size - size of the data array
-*/
-__global__
-void KernelLog(__half * a, __half * b, int size)
-{
-    return;
-}
-
-/*
-set each entry to its log value
->> a - input tensor
->> b - output tensor
-*/
-void _CudaLog(const XTensor * a, XTensor * b)
-{
-    CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
-    CheckNTErrors((a->isSparse == false), "TODO!");
-
-    int gridSize[3];
-    int blockSize[3];
-
-    GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
-
-    dim3 blocks(gridSize[0]);
-    dim3 threads(blockSize[0]);
-
-    int devIDBackup;
-    ProtectCudaDev(a->devID, devIDBackup);
-
-    if (a->dataType == DEFAULT_DTYPE) {
-        KernelLog << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum);
-    }
-    else if (a->dataType == X_FLOAT16) {
-        KernelLog << <blocks, threads >> >((__half*)a->data, (__half*)b->data, a->unitNum);
-    }
-    else {
-        ShowNTErrors("TODO!");
-    }
-
-    BacktoCudaDev(a->devID, devIDBackup);
-}
-
-#endif // USE_CUDA
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/math/Unary.cpp
+++ b/source/tensor/core/math/Unary.cpp
+#include <math.h>
+#include "../../XName.h"
+#include "Unary.h"
+#include "Unary.cuh"
+
+namespace nts{
+    
+
+#ifdef USE_CUDA
+/* define three marco separately, specify the respective function names */
+#define _SIMPLE_UNARY_FUNCTION(_funcName, _cudaFuncName, origFunc)          \
+void _funcName(const XTensor * a, XTensor * b)                              \
+{                                                                           \
+    /* run it on GPUs */                                                    \
+    if (a->devID >= 0) {                                                    \
+        _cudaFuncName(a, b);                                                \
+    return;                                                                 \
+    }                                                                       \
+    CheckNTErrors((XTensor::IsSameShaped(a, b)),                            \
+                  "Input tensors should have the same type!");              \
+    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");                 \
+    DTYPE * d = (DTYPE*)a->data;                                            \
+    DTYPE * db = (DTYPE*)b->data;                                           \
+    for (int i = 0; i < a->unitNum; i++)                                    \
+        db[i] = (DTYPE)origFunc(d[i]);                                      \
+}
+
+#define _SIMPLE_UNARY_FUNCTION_ME(_funcNameMe, _funcName)                   \
+void _funcNameMe(XTensor * a)                                               \
+{                                                                           \
+    _funcName(a, a);                                                        \
+}        
+
+#define SIMPLE_UNARY_FUNCTION(funcName, _funcName, operationId)             \
+XTensor funcName(const XTensor &a)                                          \
+{                                                                           \
+    XTensor b(&a);                                                          \
+    b.SetTMP();                                                             \
+    _funcName(&a, &b);                                                      \
+    XLink::MakeLink(&a, NULL, &b, operationId);                             \
+    return b;                                                               \
+}
+
+_SIMPLE_UNARY_FUNCTION(_Absolute, _CudaAbsolute, fabs)
+_SIMPLE_UNARY_FUNCTION_ME(_AbsoluteMe, _Absolute)
+SIMPLE_UNARY_FUNCTION(Absolute, _Absolute, MATH_ABSOLUTE)
+
+_SIMPLE_UNARY_FUNCTION(_Exp, _CudaExp, exp)
+_SIMPLE_UNARY_FUNCTION_ME(_ExpMe, _Exp)
+SIMPLE_UNARY_FUNCTION(Exp, _Exp, MATH_EXP)
+
+_SIMPLE_UNARY_FUNCTION(_Log, _CudaLog, log)
+_SIMPLE_UNARY_FUNCTION_ME(_LogMe, _Log)
+SIMPLE_UNARY_FUNCTION(Log, _Log, MATH_LOG)
+
+_SIMPLE_UNARY_FUNCTION(_Sin, _CudaSin, sin)
+_SIMPLE_UNARY_FUNCTION_ME(_SinMe, _Sin)
+SIMPLE_UNARY_FUNCTION(Sin, _Sin, MATH_SIN)
+
+_SIMPLE_UNARY_FUNCTION(_Cos, _CudaCos, cos)
+_SIMPLE_UNARY_FUNCTION_ME(_CosMe, _Cos)
+SIMPLE_UNARY_FUNCTION(Cos, _Cos, MATH_COS)
+
+_SIMPLE_UNARY_FUNCTION(_Tan, _CudaTan, tan)
+_SIMPLE_UNARY_FUNCTION_ME(_TanMe, _Tan)
+SIMPLE_UNARY_FUNCTION(Tan, _Tan, MATH_TAN)
+#else
+/* define three marco separately, specify the respective function names */
+#define _SIMPLE_UNARY_FUNCTION(_funcName, origFunc)          \
+void _funcName(const XTensor * a, XTensor * b)                              \
+{                                                                           \
+    CheckNTErrors((XTensor::IsSameShaped(a, b)),                            \
+                  "Input tensors should have the same type!");              \
+    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");                 \
+    DTYPE * d = (DTYPE*)a->data;                                            \
+    DTYPE * db = (DTYPE*)b->data;                                           \
+    for (int i = 0; i < a->unitNum; i++)                                    \
+        db[i] = (DTYPE)origFunc(d[i]);                                      \
+}
+
+#define _SIMPLE_UNARY_FUNCTION_ME(_funcNameMe, _funcName)                   \
+void _funcNameMe(XTensor * a)                                               \
+{                                                                           \
+    _funcName(a, a);                                                        \
+}        
+
+#define SIMPLE_UNARY_FUNCTION(funcName, _funcName, operationId)             \
+XTensor funcName(const XTensor &a)                                          \
+{                                                                           \
+    XTensor b(&a);                                                          \
+    b.SetTMP();                                                             \
+    _funcName(&a, &b);                                                      \
+    XLink::MakeLink(&a, NULL, &b, operationId);                             \
+    return b;                                                               \
+}
+
+_SIMPLE_UNARY_FUNCTION(_Absolute, fabs)
+_SIMPLE_UNARY_FUNCTION_ME(_AbsoluteMe, _Absolute)
+SIMPLE_UNARY_FUNCTION(Absolute, _Absolute, MATH_ABSOLUTE)
+
+_SIMPLE_UNARY_FUNCTION(_Exp, exp)
+_SIMPLE_UNARY_FUNCTION_ME(_ExpMe, _Exp)
+SIMPLE_UNARY_FUNCTION(Exp, _Exp, MATH_EXP)
+
+_SIMPLE_UNARY_FUNCTION(_Log, log)
+_SIMPLE_UNARY_FUNCTION_ME(_LogMe, _Log)
+SIMPLE_UNARY_FUNCTION(Log, _Log, MATH_LOG)
+
+_SIMPLE_UNARY_FUNCTION(_Sin, sin)
+_SIMPLE_UNARY_FUNCTION_ME(_SinMe, _Sin)
+SIMPLE_UNARY_FUNCTION(Sin, _Sin, MATH_SIN)
+
+_SIMPLE_UNARY_FUNCTION(_Cos, cos)
+_SIMPLE_UNARY_FUNCTION_ME(_CosMe, _Cos)
+SIMPLE_UNARY_FUNCTION(Cos, _Cos, MATH_COS)
+
+_SIMPLE_UNARY_FUNCTION(_Tan, tan)
+_SIMPLE_UNARY_FUNCTION_ME(_TanMe, _Tan)
+SIMPLE_UNARY_FUNCTION(Tan, _Tan, MATH_TAN)
+#endif
+
+}
\ No newline at end of file
--- a/source/tensor/core/math/Unary.cu
+++ b/source/tensor/core/math/Unary.cu
+#include <math.h>
+#include "../../XDevice.h"
+#include "../../XName.h"
+#include "Unary.cuh"
+
+namespace nts {
+
+#define SIMPLE_UNARY_FUNCTION_GPU(funcName, origFunc)                   \
+__global__                                                              \
+void Kernel##funcName(DTYPE * a, DTYPE * b, int size)                   \
+{                                                                       \
+    int i = blockDim.x * blockIdx.x + threadIdx.x;                      \
+                                                                        \
+    if (i < size)                                                       \
+        b[i] = (DTYPE)origFunc(a[i]);                                   \
+}                                                                       \
+__global__                                                              \
+    void Kernel##funcName(__half * a, __half * b, int size)             \
+{                                                                       \
+    return;                                                             \
+}                                                                       \
+void _Cuda##funcName(const XTensor * a, XTensor * b)                    \
+{                                                                       \
+    CheckNTErrors((XTensor::IsSameShaped(a, b)),                        \
+                  "Input tensors should have the same type!");          \
+    CheckNTErrors((a->isSparse == false), "TODO!");                     \
+                                                                        \
+    int gridSize[3];                                                    \
+    int blockSize[3];                                                   \
+                                                                        \
+    GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);     \
+                                                                        \
+    dim3 blocks(gridSize[0]);                                           \
+    dim3 threads(blockSize[0]);                                         \
+                                                                        \
+    int devIDBackup;                                                    \
+    ProtectCudaDev(a->devID, devIDBackup);                              \
+                                                                        \
+    if (a->dataType == DEFAULT_DTYPE) {                                 \
+        Kernel##funcName << <blocks, threads >> >                       \
+                     ((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum);    \
+    }                                                                   \
+    else if (a->dataType == X_FLOAT16) {                                \
+        Kernel##funcName << <blocks, threads >> >                       \
+                     ((__half*)a->data, (__half*)b->data, a->unitNum);  \
+    }                                                                   \
+    else {                                                              \
+        ShowNTErrors("TODO!");                                          \
+    }                                                                   \
+                                                                        \
+    BacktoCudaDev(a->devID, devIDBackup);                               \
+}                                                                       \
+
+SIMPLE_UNARY_FUNCTION_GPU(Absolute, fabs)
+SIMPLE_UNARY_FUNCTION_GPU(Exp, exp)
+SIMPLE_UNARY_FUNCTION_GPU(Log, log)
+SIMPLE_UNARY_FUNCTION_GPU(Sin, sin)
+SIMPLE_UNARY_FUNCTION_GPU(Cos, cos)
+SIMPLE_UNARY_FUNCTION_GPU(Tan, tan)
+
+}
\ No newline at end of file
--- a/source/tensor/core/math/Unary.cuh
+++ b/source/tensor/core/math/Unary.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+
+#ifndef __UNARY_CUH__
+#define __UNARY_CUH__
+
+#include "../../XTensor.h"
+#include "Unary.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+#ifdef USE_CUDA
+
+/* set each entry to its absolute value (CUDA Kernel) */
+__global__
+void KernelAbsolute(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its absolute value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelAbsolute(__half * a, __half * b, int size);
+/* set each entry to its absolute value */
+void _CudaAbsolute(const XTensor * a, XTensor * b);
+
+/* set each entry to its exponent value (CUDA Kernel) */
+__global__
+void KernelExp(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its exponent value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelExp(__half * a, __half * b, int size);
+/* set each entry to its exponent value */
+void _CudaExp(const XTensor * a, XTensor * b);
+
+/* set each entry to its logarithm value (CUDA Kernel) */
+__global__
+void KernelLog(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its logarithm value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelLog(__half * a, __half * b, int size);
+/* set each entry to its logarithm value */
+void _CudaLog(const XTensor * a, XTensor * b);
+
+/* set each entry to its sine value (CUDA Kernel) */
+__global__
+void KernelSin(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its sine value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelSin(__half * a, __half * b, int size);
+/* set each entry to its sine value */
+void _CudaSin(const XTensor * a, XTensor * b);
+
+/* set each entry to its cosine value (CUDA Kernel) */
+__global__
+void KernelCos(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its cosine value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelCos(__half * a, __half * b, int size);
+/* set each entry to its cosine value */
+void _CudaCos(const XTensor * a, XTensor * b);
+
+/* set each entry to its tangent value (CUDA Kernel) */
+__global__
+void KernelTan(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its tangent value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelTan(__half * a, __half * b, int size);
+/* set each entry to its tangent value */
+void _CudaTan(const XTensor * a, XTensor * b);
+
+#endif // USE_CUDA
+
+} // namespace nts(NiuTrans.Tensor)
+
+#endif // __UNARY_CUH__
\ No newline at end of file
--- a/source/tensor/core/math/Unary.h
+++ b/source/tensor/core/math/Unary.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+
+#ifndef __UNARY_H__
+#define __UNARY_H__
+
+#include "../../XTensor.h"
+
+namespace nts{
+
+/* set every entry to its absolute value */
+void _Absolute(const XTensor * a, XTensor * b);
+/* 
+set every entry to its absolute value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _AbsoluteMe(XTensor * a);
+/* 
+set every entry to its absolute value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Absolute(const XTensor & a);
+
+/* set every entry to its exponent value */
+void _Exp(const XTensor * a, XTensor * b);
+/* 
+set every entry to its exponent value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _ExpMe(XTensor * a);
+/* 
+set every entry to its exponent value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Exp(const XTensor & a);
+
+/* set every entry to its logarithm value */
+void _Log(const XTensor * a, XTensor * b);
+/* 
+set every entry to its logarithm value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _LogMe(XTensor * a);
+/* 
+set every entry to its logarithm value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Log(const XTensor & a);
+
+/* set every entry to its sine value */
+void _Sin(const XTensor * a, XTensor * b);
+/* 
+set every entry to its sine value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _SinMe(XTensor * a);
+/* 
+set every entry to its sine value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Sin(const XTensor & a);
+
+/* set every entry to its cosine value */
+void _Cos(const XTensor * a, XTensor * b);
+/* 
+set every entry to its cosine value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _CosMe(XTensor * a);
+/* 
+set every entry to its cosine value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Cos(const XTensor & a);
+
+/* set every entry to its tangent value */
+void _Tan(const XTensor * a, XTensor * b);
+/* 
+set every entry to its tangent value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _TanMe(XTensor * a);
+/* 
+set every entry to its tangent value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Tan(const XTensor & a);
+
+}
+#endif //end __UNARY_H__
\ No newline at end of file
--- a/source/tensor/core/shape/Transpose.cpp
+++ b/source/tensor/core/shape/Transpose.cpp
@@ -24,12 +24,22 @@
 #include "Transpose.h"
 #include "Merge.h"
 #include "../../XUtility.h"
+#include "../../XName.h"

 namespace nts { // namespace nts(NiuTrans.Tensor)

 /*
 tensor transposition of dimensions i and j
 b = transposed(a) 
+
+For a input tensor a, we tranpose the dimensions i and j of it.
+E.g., let a be a tensor of size x * y * z, i = 0, j = 2, 
+then the output will be a tensor of size z * y * x.
+
+>> a - the input tensor
+>> b - the output tensor by transpose tensor a with specified dimensions i and j
+>> i - the transposed dimension
+>> j - the transposed dimension
 */
 void _Transpose(const XTensor * a, XTensor * b, const int i, const int j)
 {
@@ -96,4 +106,52 @@ void _Transpose(const XTensor * a, XTensor * b, const int i, const int j)
    }
 }

+/*
+tensor transposition of dimensions i and j (return a XTensor structure).
+make a new tensor to keep the result and return it.
+b = transposed(a)
+
+For a input tensor a, we tranpose the dimensions i and j of it.
+E.g., let a be a tensor of size x * y * z, i = 0, j = 2, 
+then the output will be a tensor of size z * y * x.
+
+>> a - the input tensor
+>> i - the transposed dimension
+>> j - the transposed dimension
+<< return - the output tensor by transpose tensor a with specified dimensions i and j
+*/
+XTensor Transpose(const XTensor &a, const int i, const int j)
+{
+    CheckNTErrors(a.order > i && i >= 0, "index of dimension is out of scope!");
+    CheckNTErrors(a.order > j && j >= 0, "index of dimension is out of scope!");
+
+    int order = a.order;
+    int * dimSize = new int[order];
+    for(int k = 0; k < order; k++){
+        if(k == i)
+            dimSize[k] = a.dimSize[j];
+        else if(k == j)
+            dimSize[k] = a.dimSize[i];
+        else
+            dimSize[k] = a.dimSize[k];
+    }
+
+    float dr = (!a.isSparse) ? 1.0F : a.denseRatio;
+    XTensor b(order, dimSize, a.dataType, dr, a.devID, a.mem);
+    b.SetTMP();
+
+    /* call _Transpose function */
+    _Transpose(&a, &b, i, j);
+    
+    /* tensor connection */
+    XLink::MakeLink(&a, NULL, &b, SHAPE_TRANSPOSE);
+    XLink::AddParamToHeadInt(&b, i);
+    XLink::AddParamToHeadInt(&b, j);
+
+    /* destroy variables */
+    delete[] dimSize;
+
+    return b;
+}
+
 }
--- a/source/tensor/core/shape/Transpose.h
+++ b/source/tensor/core/shape/Transpose.h
@@ -34,13 +34,6 @@ b = transposed(a)
 void _Transpose(const XTensor * a, XTensor * b, const int i, const int j);

 /* 
-tensor transposition of dimensions i and j (do this on site)
-keep the result in the input tensor and return nothing.
-a = transposed(a) 
-*/
-void _TransposeMe(XTensor * a, const int i, const int j);
-
-/* 
 tensor transposition of dimensions i and j (return a XTensor structure).
 make a new tensor to keep the result and return it.
 b = transposed(a)

--- a/source/tensor/function/Loss.cu
+++ b/source/tensor/function/Loss.cu
@@ -24,7 +24,7 @@
 #include "../XDevice.h"
 #include "../core/math/Power.h"
 #include "../core/math/ScaleAndShift.h"
-#include "../core/math/Log.h"
+#include "../core/math/Unary.h"
 #include "../core/arithmetic/Negate.h"
 #include "../core/arithmetic/Sum.h"
 #include "../core/arithmetic/Multiply.h"

--- a/source/tensor/test/TAbsolute.cpp
+++ b/source/tensor/test/TAbsolute.cpp
@@ -19,6 +19,7 @@
 * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12
 */

+#include "../core/math/Unary.h"
 #include "TAbsolute.h"

 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -30,14 +31,14 @@ Set every entry to its absolute value.
 bool TestAbsolute1()
 {
 	/* a tensor of size (3, 2) */
-	int aOrder = 2;
-	int * aDimSize = new int[aOrder];
-	aDimSize[0] = 3;
-	aDimSize[1] = 2;
+	int order = 2;
+	int * dimSize = new int[order];
+	dimSize[0] = 3;
+	dimSize[1] = 2;

-	int aUnitNum = 1;
-	for (int i = 0; i < aOrder; i++)
-		aUnitNum *= aDimSize[i];
+	int unitNum = 1;
+	for (int i = 0; i < order; i++)
+		unitNum *= dimSize[i];

 	DTYPE aData[3][2] = { {1.0F, -2.0F}, 
 	                      {0.5F, -4.0F},
@@ -50,14 +51,14 @@ bool TestAbsolute1()
 	bool cpuTest = true;

 	/* create tensors */
-	XTensor * a = NewTensor(aOrder, aDimSize);
-	XTensor * b = NewTensor(aOrder, aDimSize);
-	XTensor * aMe = NewTensor(aOrder, aDimSize);
+	XTensor * a = NewTensor(order, dimSize);
+	XTensor * b = NewTensor(order, dimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
    XTensor bUser;

 	/* initialize variables */
-	a->SetData(aData, aUnitNum);
-    aMe->SetData(aData, aUnitNum);
+	a->SetData(aData, unitNum);
+    aMe->SetData(aData, unitNum);

 	/* call Absolute function */
    _Absolute(a, b);
@@ -65,21 +66,21 @@ bool TestAbsolute1()
    bUser = Absolute(*a);

 	/* check results */
-	cpuTest = b->CheckData(answer, aUnitNum, 1e-4F) && aMe->CheckData(answer, aUnitNum, 1e-4F) && bUser.CheckData(answer, aUnitNum, 1e-4F);
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
    
 #ifdef USE_CUDA
 	/* GPU test */
 	bool gpuTest = true;

 	/* create tensor */
-	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
-	XTensor * bGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
-	XTensor * aMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
    XTensor bUserGPU;

 	/* Initialize variables */
-	aGPU->SetData(aData, aUnitNum);
-    aMeGPU->SetData(aData, aUnitNum);
+	aGPU->SetData(aData, unitNum);
+    aMeGPU->SetData(aData, unitNum);

 	/* call Absolute function */
    _Absolute(aGPU, bGPU);
@@ -87,7 +88,7 @@ bool TestAbsolute1()
    bUserGPU = Absolute(*aGPU);

 	/* check results */
-	gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F) && aMeGPU->CheckData(answer, aUnitNum, 1e-4F) && bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);

 	/* destroy variables */
 	delete a;
@@ -96,7 +97,7 @@ bool TestAbsolute1()
    delete aGPU;
    delete bGPU;
    delete aMeGPU;
-	delete[] aDimSize;
+	delete[] dimSize;

 	return cpuTest && gpuTest;
 #else
@@ -104,7 +105,7 @@ bool TestAbsolute1()
 	delete a;
 	delete b;
 	delete aMe;
-	delete[] aDimSize;
+	delete[] dimSize;

 	return cpuTest;
 #endif // USE_CUDA

--- a/source/tensor/test/TAbsolute.h
+++ b/source/tensor/test/TAbsolute.h
@@ -22,7 +22,6 @@
 #ifndef __TEST_ABSOLUTE_H__
 #define __TEST_ABSOLUTE_H__

-#include "../core/arithmetic/Absolute.h"

 namespace nts { // namespace nts(NiuTrans.Tensor)


--- a/source/tensor/test/TCos.cpp
+++ b/source/tensor/test/TCos.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+
+#include "../core/math/Unary.h"
+#include "TCos.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/*
+case 1: test Cos function.
+Set every entry to its cosine value.
+*/
+bool TestCos1()
+{
+	/* a tensor of size (3, 2) */
+	int order = 2;
+	int * dimSize = new int[order];
+	dimSize[0] = 3;
+	dimSize[1] = 2;
+
+	int unitNum = 1;
+	for (int i = 0; i < order; i++)
+		unitNum *= dimSize[i];
+
+	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
+	                      {-1.0F, -2.0F},
+	                      {0.0F, 0.5F} };
+	DTYPE answer[3][2] = { {0.5403F, -0.4161F},
+	                       {0.5403F, -0.4161F},
+	                       {1.0F, 0.8776F} };
+
+	/* CPU test */
+	bool cpuTest = true;
+
+	/* create tensors */
+	XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
+    XTensor bUser;
+
+	/* initialize variables */
+	a->SetData(aData, unitNum);
+	aMe->SetData(aData, unitNum);
+
+	/* call Cos function */
+	_Cos(a, b);
+	_CosMe(aMe);
+    bUser = Cos(*a);
+
+	/* check results */
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
+    
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+
+	/* create tensor */
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+
+	/* Initialize variables */
+	aGPU->SetData(aData, unitNum);
+	aMeGPU->SetData(aData, unitNum);
+
+	/* call Cos function */
+    _Cos(aGPU, bGPU);
+	_CosMe(aMeGPU);
+    bUserGPU = Cos(*aGPU);
+
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
+
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+    delete aGPU;
+    delete bGPU;
+    delete aMeGPU;
+	delete[] dimSize;
+
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete[] dimSize;
+
+	return cpuTest;
+#endif // USE_CUDA
+}
+
+/* other cases */
+/*
+TODO!!
+*/
+
+/* test for Cos Function */
+bool TestCos()
+{
+	XPRINT(0, stdout, "[TEST Cos] set every entry to its cosine value \n");
+	bool returnFlag = true, caseFlag = true;
+
+	/* case 1 test */
+	caseFlag = TestCos1();
+
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+
+	XPRINT(0, stdout, "\n");
+
+	return returnFlag;
+}
+
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TCos.h
+++ b/source/tensor/test/TCos.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+
+#ifndef __TEST_SIN_H__
+#define __TEST_SIN_H__
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/* test for Sin Function */
+bool TestSin();
+
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_SIN_H__
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+
+#ifndef __TEST_COS_H__
+#define __TEST_COS_H__
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/* test for Cos Function */
+bool TestCos();
+
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_COS_H__
--- a/source/tensor/test/TDiv.cpp
+++ b/source/tensor/test/TDiv.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+
+#include "TDiv.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/* 
+case 1: element-wise division of two tensors
+c(i) = a(i)/b(i) + \alpha * c(i)
+In this case, (2, 2)  (2, 2) -> (2, 2), leadingDim=0, alpha=0.
+*/
+bool TestDiv1()
+{
+	/* a source tensor of size (2, 2) */
+	int sOrder1 = 2;
+	int * sDimSize1 = new int[sOrder1];
+	sDimSize1[0] = 2;
+	sDimSize1[1] = 2;
+
+	int sUnitNum1 = 1;
+	for (int i = 0; i < sOrder1; i++)
+		sUnitNum1 *= sDimSize1[i];
+
+	/* a source tensor of size (2, 2) */
+	int sOrder2 = 2;
+	int * sDimSize2 = new int[sOrder2];
+	sDimSize2[0] = 2;
+	sDimSize2[1] = 2;
+
+	int sUnitNum2 = 1;
+	for (int i = 0; i < sOrder2; i++)
+		sUnitNum2 *= sDimSize2[i];
+
+	/* a target tensor of size (2, 2) */
+	int tOrder = 2;
+	int * tDimSize = new int[tOrder];
+	tDimSize[0] = 2;
+	tDimSize[1] = 2;
+
+	int tUnitNum = 1;
+	for (int i = 0; i < tOrder; i++)
+		tUnitNum *= tDimSize[i];
+
+	DTYPE sData1[2][2] = { {0.0F, 1.0F},
+	                       {2.0F, 3.0F} };
+	DTYPE sData2[2][2] = { {1.0F, 1.0F},
+	                       {4.0F, 9.0F} };
+	DTYPE answer[2][2] = { {0.0F, 1.0F},
+	                       {0.5F, 0.3333F} };
+
+	/* CPU test */
+	bool cpuTest = true;
+
+	/* create tensors */
+	XTensor * s1 = NewTensor(sOrder1, sDimSize1);
+	XTensor * s2 = NewTensor(sOrder2, sDimSize2);
+	XTensor * t = NewTensor(tOrder, tDimSize);
+    XTensor * tMe = NewTensor(tOrder, tDimSize);
+    XTensor tUser;
+
+	/* initialize variables */
+	s1->SetData(sData1, sUnitNum1);
+	tMe->SetData(sData1, sUnitNum1);
+	s2->SetData(sData2, sUnitNum2);
+	t->SetZeroAll();
+
+	/* call Div function */
+	_Div(s1, s2, t, 0, 0);
+	_DivMe(tMe, s2, 0, 0);
+    tUser = Div(*s1, *s2, 0);
+
+	/* check results */
+	cpuTest = t->CheckData(answer, tUnitNum, 1e-4F) && 
+              tMe->CheckData(answer, tUnitNum, 1e-4F) && 
+              tUser.CheckData(answer, tUnitNum, 1e-4F);
+
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+
+	/* create tensor */
+	XTensor * sGPU1 = NewTensor(sOrder1, sDimSize1, X_FLOAT, 1.0F, 0);
+	XTensor * sGPU2 = NewTensor(sOrder2, sDimSize2, X_FLOAT, 1.0F, 0);
+	XTensor * tGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * tMeGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
+    XTensor tUserGPU;
+
+	/* Initialize variables */
+	sGPU1->SetData(sData1, sUnitNum1);
+	tMeGPU->SetData(sData1, sUnitNum1);
+	sGPU2->SetData(sData2, sUnitNum2);
+	tGPU->SetZeroAll();
+
+	/* call Div function */
+	_Div(sGPU1, sGPU2, tGPU, 0, 0);
+	_DivMe(tMeGPU, sGPU2, 0, 0);
+    tUserGPU = Div(*sGPU1, *sGPU2, 0);
+
+	/* check results */
+	gpuTest = tGPU->CheckData(answer, tUnitNum, 1e-4F) && 
+              tMeGPU->CheckData(answer, tUnitNum, 1e-4F) && 
+              tUserGPU.CheckData(answer, tUnitNum, 1e-4F);
+
+	/* destroy variables */
+    delete s1;
+    delete s2;
+    delete t;
+    delete tMe;
+    delete sGPU1;
+    delete sGPU2;
+    delete tGPU;
+    delete tMeGPU;
+    delete[] sDimSize1;
+    delete[] sDimSize2;
+    delete[] tDimSize;
+
+	return cpuTest && gpuTest;
+#else
+    /* destroy variables */
+    delete s1;
+    delete s2;
+    delete t;
+    delete tMe;
+    delete[] sDimSize1;
+    delete[] sDimSize2;
+    delete[] tDimSize;
+
+	return cpuTest;
+#endif // USE_CUDA
+}
+
+/* other cases */
+/*
+TODO!!
+*/
+
+/* test for Div Function */
+bool TestDiv()
+{
+	XPRINT(0, stdout, "[TEST Div] element-wise division of two tensors \n");
+	bool returnFlag = true, caseFlag = true;
+
+	/* case 1 test */
+	caseFlag = TestDiv1();
+
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+
+	XPRINT(0, stdout, "\n");
+
+	return returnFlag;
+}
+
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TDiv.h
+++ b/source/tensor/test/TDiv.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+
+#ifndef __TEST_DIV_H__
+#define __TEST_DIV_H__
+
+#include "../core/arithmetic/Div.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/* test for Div Function */
+extern "C"
+bool TestDiv();
+
+} // namespace nts(NiuTrans.Tensor)
+
+#endif // __TEST_DIV_H__
--- a/source/tensor/test/TExp.cpp
+++ b/source/tensor/test/TExp.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+
+#include "../core/math/Unary.h"
+#include "TExp.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/*
+case 1: test Exp function.
+Set every entry to its exponent value.
+*/
+bool TestExp1()
+{
+	/* a tensor of size (3, 2) */
+	int order = 2;
+	int * dimSize = new int[order];
+	dimSize[0] = 3;
+	dimSize[1] = 2;
+
+	int unitNum = 1;
+	for (int i = 0; i < order; i++)
+		unitNum *= dimSize[i];
+
+	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
+	                      {-1.0F, -2.0F},
+	                      {0.0F, 0.5F} };
+	DTYPE answer[3][2] = { {2.7183F, 7.3891F},
+	                       {0.3679F, 0.1353F},
+	                       {1.0F, 1.6487F} };
+
+	/* CPU test */
+	bool cpuTest = true;
+
+	/* create tensors */
+	XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
+    XTensor bUser;
+
+	/* initialize variables */
+	a->SetData(aData, unitNum);
+	aMe->SetData(aData, unitNum);
+
+	/* call Exp function */
+	_Exp(a, b);
+	_ExpMe(aMe);
+    bUser = Exp(*a);
+
+	/* check results */
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
+    
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+
+	/* create tensor */
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+
+	/* Initialize variables */
+	aGPU->SetData(aData, unitNum);
+	aMeGPU->SetData(aData, unitNum);
+
+	/* call Exp function */
+    _Exp(aGPU, bGPU);
+	_ExpMe(aMeGPU);
+    bUserGPU = Exp(*aGPU);
+
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
+
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+    delete aGPU;
+    delete bGPU;
+    delete aMeGPU;
+	delete[] dimSize;
+
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete[] dimSize;
+
+	return cpuTest;
+#endif // USE_CUDA
+}
+
+/* other cases */
+/*
+TODO!!
+*/
+
+/* test for Exp Function */
+bool TestExp()
+{
+	XPRINT(0, stdout, "[TEST Exp] set every entry to its exponent value \n");
+	bool returnFlag = true, caseFlag = true;
+
+	/* case 1 test */
+	caseFlag = TestExp1();
+
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+
+	XPRINT(0, stdout, "\n");
+
+	return returnFlag;
+}
+
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/Absolute.cuh
+++ b/source/tensor/core/arithmetic/Absolute.cuh
@@ -16,26 +16,16 @@
 */

 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
 */

-#include "Absolute.h"
+#ifndef __TEST_EXP_H__
+#define __TEST_EXP_H__

 namespace nts { // namespace nts(NiuTrans.Tensor)

-#ifdef USE_CUDA
-
-/* set each entry to its absolute value (CUDA Kernel) */
-__global__
-void KernelAbsolute(DTYPE * a, DTYPE * b, int size);
-
-/* set each entry to its absolute value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelAbsolute(__half * a, __half * b, int size);
-
-/* set each entry to its absolute value */
-void _CudaAbsolute(const XTensor * a, XTensor * b);
-
-#endif // USE_CUDA
+/* test for Exp Function */
+bool TestExp();

 } // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_EXP_H__
--- a/source/tensor/test/TLog.cpp
+++ b/source/tensor/test/TLog.cpp
@@ -19,6 +19,7 @@
 * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12
 */

+#include "../core/math/Unary.h"
 #include "TLog.h"

 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -30,14 +31,14 @@ Set every entry to its log value.
 bool TestLog1()
 {
 	/* a tensor of size (3, 2) */
-	int aOrder = 2;
-	int * aDimSize = new int[aOrder];
-	aDimSize[0] = 3;
-	aDimSize[1] = 2;
+	int order = 2;
+	int * dimSize = new int[order];
+	dimSize[0] = 3;
+	dimSize[1] = 2;

-	int aUnitNum = 1;
-	for (int i = 0; i < aOrder; i++)
-		aUnitNum *= aDimSize[i];
+	int unitNum = 1;
+	for (int i = 0; i < order; i++)
+		unitNum *= dimSize[i];

 	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
 	                      {0.5F, 4.0F},
@@ -50,14 +51,14 @@ bool TestLog1()
 	bool cpuTest = true;

 	/* create tensors */
-	XTensor * a = NewTensor(aOrder, aDimSize);
-    XTensor * b = NewTensor(aOrder, aDimSize);
-	XTensor * aMe = NewTensor(aOrder, aDimSize);
+	XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
    XTensor bUser;

 	/* initialize variables */
-	a->SetData(aData, aUnitNum);
-	aMe->SetData(aData, aUnitNum);
+	a->SetData(aData, unitNum);
+	aMe->SetData(aData, unitNum);

 	/* call Log function */
 	_Log(a, b);
@@ -65,21 +66,21 @@ bool TestLog1()
    bUser = Log(*a);

 	/* check results */
-	cpuTest = b->CheckData(answer, aUnitNum, 1e-4F) && aMe->CheckData(answer, aUnitNum, 1e-4F) && bUser.CheckData(answer, aUnitNum, 1e-4F);
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
    
 #ifdef USE_CUDA
 	/* GPU test */
 	bool gpuTest = true;

 	/* create tensor */
-	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
-	XTensor * bGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
-	XTensor * aMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
    XTensor bUserGPU;

 	/* Initialize variables */
-	aGPU->SetData(aData, aUnitNum);
-	aMeGPU->SetData(aData, aUnitNum);
+	aGPU->SetData(aData, unitNum);
+	aMeGPU->SetData(aData, unitNum);

 	/* call Log function */
    _Log(aGPU, bGPU);
@@ -87,7 +88,7 @@ bool TestLog1()
    bUserGPU = Log(*aGPU);

 	/* check results */
-	gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F) && aMeGPU->CheckData(answer, aUnitNum, 1e-4F) && bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);

 	/* destroy variables */
 	delete a;
@@ -96,7 +97,7 @@ bool TestLog1()
    delete aGPU;
    delete bGPU;
    delete aMeGPU;
-	delete[] aDimSize;
+	delete[] dimSize;

 	return cpuTest && gpuTest;
 #else
@@ -104,7 +105,7 @@ bool TestLog1()
 	delete a;
 	delete b;
 	delete aMe;
-	delete[] aDimSize;
+	delete[] dimSize;

 	return cpuTest;
 #endif // USE_CUDA

--- a/source/tensor/test/TLog.h
+++ b/source/tensor/test/TLog.h
@@ -22,8 +22,6 @@
 #ifndef __TEST_LOG_H__
 #define __TEST_LOG_H__

-#include "../core/math/Log.h"
-
 namespace nts { // namespace nts(NiuTrans.Tensor)

 /* test for Log Function */

--- a/source/tensor/test/TLogSoftmax.h
+++ b/source/tensor/test/TLogSoftmax.h
@@ -16,8 +16,8 @@
 */

 /*
-* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-02
-*/
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-02
+ */

 #ifndef __TEST_LOGSOFTMAX_H__
 #define __TEST_LOGSOFTMAX_H__

--- a/source/tensor/test/TMultiply.cpp
+++ b/source/tensor/test/TMultiply.cpp
@@ -26,132 +26,9 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 /* 
 case 1: element-wise product of two tensors
 c(i) = a(i)*b(i) + \alpha * c(i)
-In this case, (2, 1)  (2, 1) -> (2, 1), leadingDim=0, alpha=0.
-*/
-bool TestMultiply1()
-{
-	/* a source tensor of size (2, 1) */
-	int sOrder1 = 2;
-	int * sDimSize1 = new int[sOrder1];
-	sDimSize1[0] = 2;
-	sDimSize1[1] = 1;
-
-	int sUnitNum1 = 1;
-	for (int i = 0; i < sOrder1; i++)
-		sUnitNum1 *= sDimSize1[i];
-
-	/* a source tensor of size (2, 1) */
-	int sOrder2 = 2;
-	int * sDimSize2 = new int[sOrder2];
-	sDimSize2[0] = 2;
-	sDimSize2[1] = 1;
-
-	int sUnitNum2 = 1;
-	for (int i = 0; i < sOrder2; i++)
-		sUnitNum2 *= sDimSize2[i];
-
-	/* a target tensor of size (2, 1) */
-	int tOrder = 2;
-	int * tDimSize = new int[tOrder];
-	tDimSize[0] = 2;
-	tDimSize[1] = 1;
-
-	int tUnitNum = 1;
-	for (int i = 0; i < tOrder; i++)
-		tUnitNum *= tDimSize[i];
-
-	DTYPE sData1[2][1] = { {0.0F}, 
-                           {1.0F} };
-	DTYPE sData2[2][1] = { {2.0F},
-                           {3.0F} };
-	DTYPE answer[2][1] = { {0.0F},
-                           {3.0F} };
-
-	/* CPU test */
-	bool cpuTest = true;
-
-	/* create tensors */
-	XTensor * s1 = NewTensor(sOrder1, sDimSize1);
-	XTensor * s2 = NewTensor(sOrder2, sDimSize2);
-	XTensor * t = NewTensor(tOrder, tDimSize);
-	XTensor * tMe = NewTensor(tOrder, tDimSize);
-    XTensor tUser;
-
-	/* initialize variables */
-	s1->SetData(sData1, sUnitNum1);
-	tMe->SetData(sData1, sUnitNum1);
-	s2->SetData(sData2, sUnitNum2);
-	t->SetZeroAll();
-
-	/* call Multiply function */
-	_Multiply(s1, s2, t, 0, 0);
-	_MultiplyMe(tMe, s2, 0, 0);
-    tUser = Multiply(*s1, *s2, 0);
-
-	/* check results */
-	cpuTest = t->CheckData(answer, tUnitNum) 
-        && tMe->CheckData(answer, tUnitNum) && tUser.CheckData(answer, tUnitNum);
-
-#ifdef USE_CUDA
-	/* GPU test */
-	bool gpuTest = true;
-
-	/* create tensor */
-	XTensor * sGPU1 = NewTensor(sOrder1, sDimSize1, X_FLOAT, 1.0F, 0);
-	XTensor * sGPU2 = NewTensor(sOrder2, sDimSize2, X_FLOAT, 1.0F, 0);
-	XTensor * tGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
-	XTensor * tMeGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
-    XTensor tUserGPU;
-
-	/* Initialize variables */
-	sGPU1->SetData(sData1, sUnitNum1);
-	tMeGPU->SetData(sData1, sUnitNum1);
-	sGPU2->SetData(sData2, sUnitNum2);
-	tGPU->SetZeroAll();
-
-	/* call Multiply function */
-	_Multiply(sGPU1, sGPU2, tGPU, 0, 0);
-	_MultiplyMe(tMeGPU, sGPU2, 0, 0);
-    tUserGPU = Multiply(*sGPU1, *sGPU2, 0);
-
-	/* check results */
-	gpuTest = tGPU->CheckData(answer, tUnitNum)
-        && tMeGPU->CheckData(answer, tUnitNum) && tUserGPU.CheckData(answer, tUnitNum);
-
-	/* destroy variables */
-    delete s1;
-    delete s2;
-    delete t;
-    delete tMe;
-    delete sGPU1;
-    delete sGPU2;
-    delete tGPU;
-    delete tMeGPU;
-    delete[] sDimSize1;
-    delete[] sDimSize2;
-    delete[] tDimSize;
-
-	return cpuTest && gpuTest;
-#else
-    /* destroy variables */
-    delete s1;
-    delete s2;
-    delete t;
-    delete tMe;
-    delete[] sDimSize1;
-    delete[] sDimSize2;
-    delete[] tDimSize;
-
-	return cpuTest;
-#endif // USE_CUDA
-}
-
-/* 
-case 2: element-wise product of two tensors
-c(i) = a(i)*b(i) + \alpha * c(i)
 In this case, (2, 2)  (2, 2) -> (2, 2), leadingDim=0, alpha=0.
 */
-bool TestMultiply2()
+bool TestMultiply1()
 {
 	/* a source tensor of size (2, 2) */
 	int sOrder1 = 2;
@@ -212,8 +89,9 @@ bool TestMultiply2()
    tUser = Multiply(*s1, *s2, 0);

 	/* check results */
-	cpuTest = t->CheckData(answer, tUnitNum) 
-        && tMe->CheckData(answer, tUnitNum) && tUser.CheckData(answer, tUnitNum);
+	cpuTest = t->CheckData(answer, tUnitNum) && 
+              tMe->CheckData(answer, tUnitNum) && 
+              tUser.CheckData(answer, tUnitNum);

 #ifdef USE_CUDA
 	/* GPU test */
@@ -270,113 +148,6 @@ bool TestMultiply2()
 #endif // USE_CUDA
 }

-/* 
-case 3: element-wise product of two tensors, c(i) = a(i)*b(i) + \alpha * c(i)
-In this case, (2, 2)  (2, 2) -> (2, 2), leadingDim=1, alpha=0.
-*/
-bool TestMultiply3()
-{
-	/* a source tensor of size (2, 2) */
-	int sOrder1 = 2;
-	int * sDimSize1 = new int[sOrder1];
-	sDimSize1[0] = 2;
-	sDimSize1[1] = 2;
-
-	int sUnitNum1 = 1;
-	for (int i = 0; i < sOrder1; i++)
-		sUnitNum1 *= sDimSize1[i];
-
-	/* a source tensor of size (2, 2) */
-	int sOrder2 = 2;
-	int * sDimSize2 = new int[sOrder2];
-	sDimSize2[0] = 2;
-	sDimSize2[1] = 2;
-
-	int sUnitNum2 = 1;
-	for (int i = 0; i < sOrder2; i++)
-		sUnitNum2 *= sDimSize2[i];
-
-	/* a target tensor of size (2, 2) */
-	int tOrder = 2;
-	int * tDimSize = new int[tOrder];
-	tDimSize[0] = 2;
-	tDimSize[1] = 2;
-
-	int tUnitNum = 1;
-	for (int i = 0; i < tOrder; i++)
-		tUnitNum *= tDimSize[i];
-
-	DTYPE sData1[2][2] = { {0.0F, 1.0F},
-	                       {2.0F, 3.0F} };
-	DTYPE sData2[2][2] = { {0.0F, 1.0F},
-	                       {2.0F, 3.0F} };
-	DTYPE answer[2][2] = { {0.0F, 1.0F},
-	                       {4.0F, 9.0F} };
-
-	/* CPU test */
-	bool cpuTest = true;
-
-	/* create tensors */
-	XTensor * s1 = NewTensor(sOrder1, sDimSize1);
-	XTensor * s2 = NewTensor(sOrder2, sDimSize2);
-	XTensor * t = NewTensor(tOrder, tDimSize);
-
-	/* initialize variables */
-	s1->SetData(sData1, sUnitNum1);
-	s2->SetData(sData2, sUnitNum2);
-	t->SetZeroAll();
-
-	/* call MultiplyElementWise function */
-	_Multiply(s1, s2, t, 0, 1);
-
-	/* check results */
-	cpuTest = t->CheckData(answer, tUnitNum);
-
-#ifdef USE_CUDA
-	/* GPU test */
-	bool gpuTest = true;
-
-	/* create tensor */
-	XTensor * sGPU1 = NewTensor(sOrder1, sDimSize1, X_FLOAT, 1.0F, 0);
-	XTensor * sGPU2 = NewTensor(sOrder2, sDimSize2, X_FLOAT, 1.0F, 0);
-	XTensor * tGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
-
-	/* Initialize variables */
-	sGPU1->SetData(sData1, sUnitNum1);
-	sGPU2->SetData(sData2, sUnitNum2);
-	tGPU->SetZeroAll();
-
-	/* call MultiplyElementWise function */
-	_Multiply(sGPU1, sGPU2, tGPU, 0, 1);
-
-	/* check results */
-	gpuTest = tGPU->CheckData(answer, tUnitNum);
-
-	/* destroy variables */
-    delete s1;
-    delete s2;
-    delete t;
-    delete sGPU1;
-    delete sGPU2;
-    delete tGPU;
-    delete[] sDimSize1;
-    delete[] sDimSize2;
-    delete[] tDimSize;
-
-	return cpuTest && gpuTest;
-#else
-    /* destroy variables */
-    delete s1;
-    delete s2;
-    delete t;
-    delete[] sDimSize1;
-    delete[] sDimSize2;
-    delete[] tDimSize;
-
-	return cpuTest;
-#endif // USE_CUDA
-}
-
 /* other cases */
 /*
 TODO!!
@@ -398,26 +169,6 @@ bool TestMultiply()
 	else
 		XPRINT(0, stdout, ">> case 1 passed!\n");

-	/* case 2 test */
-	caseFlag = TestMultiply2();
-
-	if (!caseFlag) {
-		returnFlag = false;
-		XPRINT(0, stdout, ">> case 2 failed!\n");
-	}
-	else
-		XPRINT(0, stdout, ">> case 2 passed!\n");
-
-	/* case 3 test */
-	caseFlag = TestMultiply3();
-
-	if (!caseFlag) {
-		returnFlag = false;
-		XPRINT(0, stdout, ">> case 3 failed!\n");
-	}
-	else
-		XPRINT(0, stdout, ">> case 3 passed!\n");
-
 	/* other cases test */
 	/*
 	TODO!!

--- a/source/tensor/test/TMultiply.h
+++ b/source/tensor/test/TMultiply.h
@@ -19,16 +19,17 @@
 * $Created by: Lin Ye (email: linye2015@outlook.com) 2018-06-15
 */

-#ifndef __TEST_MULTIPLYELEMENTWISE_H__
-#define __TEST_MULTIPLYELEMENTWISE_H__
+#ifndef __TEST_MULTIPLY_H__
+#define __TEST_MULTIPLY_H__

 #include "../core/arithmetic/Multiply.h"

 namespace nts { // namespace nts(NiuTrans.Tensor)

-/* test for MultiplyElementWise Function */
+/* test for Multiply Function */
 extern "C"
 bool TestMultiply();

 } // namespace nts(NiuTrans.Tensor)
-#endif // __TEST_MULTIPLYELEMENTWISE_H__
+
+#endif // __TEST_MULTIPLY_H__
--- a/source/tensor/test/TSin.cpp
+++ b/source/tensor/test/TSin.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+
+#include "../core/math/Unary.h"
+#include "TSin.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/*
+case 1: test Sin function.
+Set every entry to its sine value.
+*/
+bool TestSin1()
+{
+	/* a tensor of size (3, 2) */
+	int order = 2;
+	int * dimSize = new int[order];
+	dimSize[0] = 3;
+	dimSize[1] = 2;
+
+	int unitNum = 1;
+	for (int i = 0; i < order; i++)
+		unitNum *= dimSize[i];
+
+	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
+	                      {-1.0F, -2.0F},
+	                      {0.0F, 0.5F} };
+	DTYPE answer[3][2] = { {0.8415F, 0.9093F},
+	                       {-0.8415F, -0.9093F},
+	                       {0.0F, 0.4794F} };
+
+	/* CPU test */
+	bool cpuTest = true;
+
+	/* create tensors */
+	XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
+    XTensor bUser;
+
+	/* initialize variables */
+	a->SetData(aData, unitNum);
+	aMe->SetData(aData, unitNum);
+
+	/* call Sin function */
+	_Sin(a, b);
+	_SinMe(aMe);
+    bUser = Sin(*a);
+
+	/* check results */
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
+    
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+
+	/* create tensor */
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+
+	/* Initialize variables */
+	aGPU->SetData(aData, unitNum);
+	aMeGPU->SetData(aData, unitNum);
+
+	/* call Sin function */
+    _Sin(aGPU, bGPU);
+	_SinMe(aMeGPU);
+    bUserGPU = Sin(*aGPU);
+
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
+
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+    delete aGPU;
+    delete bGPU;
+    delete aMeGPU;
+	delete[] dimSize;
+
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete[] dimSize;
+
+	return cpuTest;
+#endif // USE_CUDA
+}
+
+/* other cases */
+/*
+TODO!!
+*/
+
+/* test for Sin Function */
+bool TestSin()
+{
+	XPRINT(0, stdout, "[TEST Sin] set every entry to its sine value \n");
+	bool returnFlag = true, caseFlag = true;
+
+	/* case 1 test */
+	caseFlag = TestSin1();
+
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+
+	XPRINT(0, stdout, "\n");
+
+	return returnFlag;
+}
+
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/math/Log.cuh
+++ b/source/tensor/core/math/Log.cuh
@@ -16,31 +16,16 @@
 */

 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
 */

-#ifndef __LOG_CUH__
-#define __LOG_CUH__
-
-#include "Log.h"
+#ifndef __TEST_SIN_H__
+#define __TEST_SIN_H__

 namespace nts { // namespace nts(NiuTrans.Tensor)

-#ifdef USE_CUDA
-
-/* set each entry to its log value (CUDA Kernel) */
-__global__
-void KernelLog(DTYPE * a, DTYPE * b, int size);
-
-/* set each entry to its log value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelLog(__half * a, __half * b, int size);
-
-/* set each entry to its log value */
-void _CudaLog(const XTensor * a, XTensor * b);
-
-#endif // USE_CUDA
+/* test for Sin Function */
+bool TestSin();

 } // namespace nts(NiuTrans.Tensor)
-
-#endif // __LOG_CUH__
\ No newline at end of file
+#endif // __TEST_SIN_H__
--- a/source/tensor/test/TSub.cpp
+++ b/source/tensor/test/TSub.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+
+#include "TSub.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/* case 1: tensor subtraction c = a - b * \beta */
+bool TestSub1()
+{
+    /* a tensor of size (2, 4) */
+    int order = 2;
+    int * dimSize = new int[order];
+    dimSize[0] = 2;
+    dimSize[1] = 4;
+
+    int unitNum = 1;
+    for (int i = 0; i < order; i++)
+        unitNum *= dimSize[i];
+
+    DTYPE aData[2][4] = { {0.0F, 1.0F, 2.0F, 3.0F},
+                          {4.0F, 5.0F, 6.0F, 7.0F} };
+    DTYPE bData[2][4] = { {1.0F, -1.0F, -3.0F, -5.0F}, 
+                          {-7.0F, -9.0F, -11.0F, -13.0F} };
+    DTYPE answer[2][4] = { {-1.0F, 2.0F, 5.0F, 8.0F},
+                           {11.0F, 14.0F, 17.0F, 20.0F} };
+
+    /* CPU test */
+    bool cpuTest = true;
+
+    /* create tensors */
+    XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+    XTensor * c = NewTensor(order, dimSize);
+    XTensor * cMe = NewTensor(order, dimSize);
+    XTensor cUser;
+
+    /* initialize variables */
+    a->SetData(aData, unitNum);
+    cMe->SetData(aData, unitNum);
+    b->SetData(bData, unitNum);
+    c->SetZeroAll();
+
+    /* call Sub function */
+    _Sub(a, b, c);
+    _SubMe(cMe, b);
+    cUser = Sub(*a, *b);
+
+    /* check results */
+    cpuTest = c->CheckData(answer, unitNum)
+              && cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
+
+#ifdef USE_CUDA
+    /* GPU test */
+    bool gpuTest = true;
+
+    /* create tensor */
+    XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor cUserGPU;
+
+    /* Initialize variables */
+    aGPU->SetData(aData, unitNum);
+    cMeGPU->SetData(aData, unitNum);
+    bGPU->SetData(bData, unitNum);
+    cGPU->SetZeroAll();
+
+    /* call Sub function */
+    _Sub(aGPU, bGPU, cGPU);
+    _SubMe(cMeGPU, bGPU);
+    cUserGPU = Sub(*aGPU, *bGPU);
+
+    /* check results */
+    gpuTest = cGPU->CheckData(answer, unitNum, 1e-4F)
+              && cMeGPU->CheckData(answer, unitNum, 1e-4F) && cUserGPU.CheckData(answer, unitNum, 1e-4F);
+    
+    /* destroy variables */
+    delete a;
+    delete b;
+    delete c;
+    delete cMe;
+    delete aGPU;
+    delete bGPU;
+    delete cGPU;
+    delete cMeGPU;
+    delete[] dimSize;
+
+    return cpuTest && gpuTest;
+#else
+    /* destroy variables */
+    delete a;
+	delete b;
+	delete c;
+    delete cMe;
+    delete[] dimSize;
+
+    return cpuTest;
+#endif // USE_CUDA
+}
+
+/* case 2: tensor subtraction c = a - b * \beta */
+bool TestSub2()
+{
+    /* a tensor of size (2, 4) */
+    int order = 2;
+    int * dimSize = new int[order];
+    dimSize[0] = 2;
+    dimSize[1] = 4;
+
+    int unitNum = 1;
+    for (int i = 0; i < order; i++) {
+        unitNum *= dimSize[i];
+    }
+    DTYPE aData[2][4] = { {0.0F, 1.0F, 2.0F, 3.0F},
+                          {4.0F, 5.0F, 6.0F, 7.0F} };
+    DTYPE bData[2][4] = { {1.0F, -1.0F, -3.0F, -5.0F}, 
+                          {-7.0F, -9.0F, -11.0F, -13.0F} };
+    DTYPE answer[2][4] = { {-0.5F, 1.5F, 3.5F, 5.5F},
+                           {7.5F, 9.5F, 11.5F, 13.5F} };
+    float beta = 0.5F;
+
+    /* CPU test */
+    bool cpuTest = true;
+
+    /* create tensor */
+    XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+    XTensor * c = NewTensor(order, dimSize);
+    XTensor * cMe = NewTensor(order, dimSize);
+    XTensor cUser;
+
+    /* initialize variables */
+    a->SetData(aData, unitNum);
+    cMe->SetData(aData, unitNum);
+    b->SetData(bData, unitNum);
+    c->SetZeroAll();
+
+    /* call Sub function */
+    _Sub(a, b, c, beta);
+    _SubMe(cMe, b, beta);
+    cUser = Sub(*a, *b, beta);
+
+    /* check results */
+    cpuTest = c->CheckData(answer, unitNum)
+              && cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
+
+#ifdef USE_CUDA
+    /* GPU test */
+    bool gpuTest = true;
+
+    /* create tensor */
+    XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor cUserGPU;
+
+    /* Initialize variables */
+    aGPU->SetData(aData, unitNum);
+    cMeGPU->SetData(aData, unitNum);
+    bGPU->SetData(bData, unitNum);
+    cGPU->SetZeroAll();
+
+    /* call Sub function */
+    _Sub(aGPU, bGPU, cGPU, beta);
+    _SubMe(cMeGPU, bGPU, beta);
+    cUserGPU = Sub(*aGPU, *bGPU, beta);
+
+    /* check results */
+    gpuTest = cGPU->CheckData(answer, unitNum, 1e-4F)
+              && cMeGPU->CheckData(answer, unitNum, 1e-4F) && cUserGPU.CheckData(answer, unitNum, 1e-4F);
+
+    /* destroy variables */
+    delete a;
+    delete b;
+    delete c;
+    delete cMe;
+    delete aGPU;
+    delete bGPU;
+    delete cGPU;
+    delete cMeGPU;
+    delete[] dimSize;
+
+    return cpuTest && gpuTest;
+#else
+    /* destroy variables */
+    delete a;
+    delete b;
+    delete c;
+    delete cMe;
+    delete[] dimSize;
+
+    return cpuTest;
+#endif // USE_CUDA
+}
+
+/* other cases */
+/*
+    TODO!!
+*/
+
+/* test for Sub Function */
+bool TestSub()
+{
+    XPRINT(0, stdout, "[TEST SUB] tensor subtraction c = a - b * beta\n");
+    bool returnFlag = true, caseFlag = true;
+
+    /* case 1 test */
+    caseFlag = TestSub1();
+    if (!caseFlag) {
+        returnFlag = false;
+        XPRINT(0, stdout, ">> case 1 failed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> case 1 passed!\n");
+
+    /* case 2 test */
+    caseFlag = TestSub2();
+    if (!caseFlag) {
+        returnFlag = false;
+        XPRINT(0, stdout, ">> case 2 failed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> case 2 passed!\n");
+
+    /* other cases test */
+    /*
+        TODO!!
+    */
+
+    if (returnFlag) {
+        XPRINT(0, stdout, ">> All Passed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> Failed!\n");
+
+    XPRINT(0, stdout, "\n");
+
+    return returnFlag;
+}
+
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TSub.h
+++ b/source/tensor/test/TSub.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+ * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
+ */
+
+#ifndef __TEST_SUB_H__
+#define __TEST_SUB_H__
+
+#include "../core/arithmetic/Sub.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/* test for Sub Function */
+bool TestSub();
+
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_SUB_H__
--- a/source/tensor/test/TSum.cpp
+++ b/source/tensor/test/TSum.cpp
@@ -16,8 +16,8 @@
 */

 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-04-30
-*/
+ * $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-04-30
+ */

 #include "TSum.h"

@@ -59,7 +59,7 @@ bool TestSum1()
    b->SetData(bData, unitNum);
    c->SetZeroAll();

-    /* call sum function */
+    /* call Sum function */
    _Sum(a, b, c);
    _SumMe(cMe, b);
    cUser = Sum(*a, *b);
@@ -85,7 +85,7 @@ bool TestSum1()
    bGPU->SetData(bData, unitNum);
    cGPU->SetZeroAll();

-    /* call sum function */
+    /* call Sum function */
    _Sum(aGPU, bGPU, cGPU);
    _SumMe(cMeGPU, bGPU);
    cUserGPU = Sum(*aGPU, *bGPU);
@@ -155,7 +155,7 @@ bool TestSum2()
    b->SetData(bData, unitNum);
    c->SetZeroAll();

-    /* call sum function */
+    /* call Sum function */
    _Sum(a, b, c, beta);
    _SumMe(cMe, b, beta);
    cUser = Sum(*a, *b, beta);
@@ -181,7 +181,7 @@ bool TestSum2()
    bGPU->SetData(bData, unitNum);
    cGPU->SetZeroAll();

-    /* call sum function */
+    /* call Sum function */
    _Sum(aGPU, bGPU, cGPU, beta);
    _SumMe(cMeGPU, bGPU, beta);
    cUserGPU = Sum(*aGPU, *bGPU, beta);

--- a/source/tensor/test/TSum.h
+++ b/source/tensor/test/TSum.h
@@ -16,8 +16,8 @@
 */

 /*
-* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-04-30
-*/
+ * $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-04-30
+ */

 #ifndef __TEST_SUM_H__
 #define __TEST_SUM_H__

--- a/source/tensor/test/TSumDim.cpp
+++ b/source/tensor/test/TSumDim.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-30
+*/
+
+#include "TSumDim.h"
+#include "../core/arithmetic/SumDim.h"
+#include "../XTensor.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/* 
+case 1: tensor summation c = a + b * \beta 
+where the size of b is equal to the n-th dimension of a, 
+i.e., a is summed with b by broadcasting 
+*/
+bool TestSumDim1()
+{
+    /* a tensor of size (2, 4) */
+    int aOrder = 2;
+    int * aDimSize = new int[aOrder];
+    aDimSize[0] = 2;
+    aDimSize[1] = 4;
+
+    int aUnitNum = 1;
+    for (int i = 0; i < aOrder; i++)
+        aUnitNum *= aDimSize[i];
+
+    /* a tensor of size (2) */
+    int bOrder = 1;
+    int * bDimSize = new int[bOrder];
+    bDimSize[0] = 2;
+
+    int bUnitNum = 1;
+    for (int i = 0; i < bOrder; i++)
+        bUnitNum *= bDimSize[i];
+
+    DTYPE aData[2][4] = { {0.0F, 1.0F, 2.0F, 3.0F},
+                          {4.0F, 5.0F, 6.0F, 7.0F} };
+    DTYPE bData[2] = {1.0F, -1.0F};
+    DTYPE answer[2][4] = { {1.0F, 2.0F, 3.0F, 4.0F},
+                           {3.0F, 4.0F, 5.0F, 6.0F} };
+
+    /* CPU test */
+    bool cpuTest = true;
+
+    /* create tensors */
+    XTensor * a = NewTensor(aOrder, aDimSize);
+    XTensor * b = NewTensor(bOrder, bDimSize);
+    XTensor * c = NewTensor(aOrder, aDimSize);
+    XTensor * cMe = NewTensor(aOrder, aDimSize);
+    XTensor cUser;
+
+    /* initialize variables */
+    a->SetData(aData, aUnitNum);
+    cMe->SetData(aData, aUnitNum);
+    b->SetData(bData, bUnitNum);
+    c->SetZeroAll();
+
+    /* call SumDim function */
+    _SumDim(a, b, c, 0);
+    _SumDim(cMe, b, 0);
+    cUser = SumDim(*a, *b, 0);
+
+    /* check results */
+    cpuTest = c->CheckData(answer, aUnitNum)
+              && cMe->CheckData(answer, aUnitNum) 
+              && cUser.CheckData(answer, aUnitNum);
+
+#ifdef USE_CUDA
+    /* GPU test */
+    bool gpuTest = true;
+
+    /* create tensor */
+    XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * bGPU = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor cUserGPU;
+
+    /* Initialize variables */
+    aGPU->SetData(aData, aUnitNum);
+    cMeGPU->SetData(aData, aUnitNum);
+    bGPU->SetData(bData, bUnitNum);
+    cGPU->SetZeroAll();
+
+    /* call sum function */
+    _SumDim(aGPU, bGPU, cGPU, 0);
+    _SumDim(cMeGPU, bGPU, 0);
+    cUserGPU = SumDim(*aGPU, *bGPU, 0);
+
+    /* check results */
+    gpuTest = cGPU->CheckData(answer, aUnitNum)
+              && cMeGPU->CheckData(answer, aUnitNum) 
+              && cUserGPU.CheckData(answer, aUnitNum);
+
+    /* destroy variables */
+    delete a;
+    delete b;
+    delete c;
+    delete cMe;
+    delete aGPU;
+    delete bGPU;
+    delete cGPU;
+    delete cMeGPU;
+    delete[] aDimSize;
+    delete[] bDimSize;
+
+    return cpuTest && gpuTest;
+#else
+    /* destroy variables */
+    delete a;
+	delete b;
+	delete c;
+    delete cMe;
+    delete[] aDimSize;
+    delete[] bDimSize;
+
+    return cpuTest;
+#endif // USE_CUDA
+}
+
+/* 
+case 2: tensor summation c = a + b * \beta 
+where the size of b is equal to the n-th dimension of a, 
+i.e., a is summed with b by broadcasting 
+*/
+bool TestSumDim2()
+{
+    /* a tensor of size (2, 4) */
+    int aOrder = 2;
+    int * aDimSize = new int[aOrder];
+    aDimSize[0] = 2;
+    aDimSize[1] = 4;
+
+    int aUnitNum = 1;
+    for (int i = 0; i < aOrder; i++)
+        aUnitNum *= aDimSize[i];
+
+    /* a tensor of size (2, 2) */
+    int bOrder = 2;
+    int * bDimSize = new int[bOrder];
+    bDimSize[0] = 2;
+    bDimSize[1] = 2;
+
+    int bUnitNum = 1;
+    for (int i = 0; i < bOrder; i++)
+        bUnitNum *= bDimSize[i];
+
+    DTYPE aData[2][4] = { {0.0F, 1.0F, 2.0F, 3.0F},
+                          {4.0F, 5.0F, 6.0F, 7.0F} };
+    DTYPE bData[2][2] = { {1.0F, -1.0F},
+                          {-1.0F, 1.0F} };
+    DTYPE answer[2][4] = { {1.0F, 0.0F, 1.0F, 4.0F},
+                           {5.0F, 4.0F, 5.0F, 8.0F} };
+
+    /* CPU test */
+    bool cpuTest = true;
+
+    /* create tensors */
+    XTensor * a = NewTensor(aOrder, aDimSize);
+    XTensor * b = NewTensor(bOrder, bDimSize);
+    XTensor * c = NewTensor(aOrder, aDimSize);
+    XTensor * cMe = NewTensor(aOrder, aDimSize);
+    XTensor cUser;
+
+    /* initialize variables */
+    a->SetData(aData, aUnitNum);
+    cMe->SetData(aData, aUnitNum);
+    b->SetData(bData, bUnitNum);
+    c->SetZeroAll();
+
+    /* call SumDim function */
+    _SumDim(a, b, c, 1);
+    _SumDim(cMe, b, 1);
+    cUser = SumDim(*a, *b, 1);
+
+    /* check results */
+    cpuTest = c->CheckData(answer, aUnitNum)
+              && cMe->CheckData(answer, aUnitNum) 
+              && cUser.CheckData(answer, aUnitNum);
+
+#ifdef USE_CUDA
+    /* GPU test */
+    bool gpuTest = true;
+
+    /* create tensor */
+    XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * bGPU = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * cMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+    XTensor cUserGPU;
+
+    /* Initialize variables */
+    aGPU->SetData(aData, aUnitNum);
+    cMeGPU->SetData(aData, aUnitNum);
+    bGPU->SetData(bData, bUnitNum);
+    cGPU->SetZeroAll();
+
+    /* call sum function */
+    _SumDim(aGPU, bGPU, cGPU, 1);
+    _SumDim(cMeGPU, bGPU, 1);
+    cUserGPU = SumDim(*aGPU, *bGPU, 1);
+
+    /* check results */
+    gpuTest = cGPU->CheckData(answer, aUnitNum)
+              && cMeGPU->CheckData(answer, aUnitNum) 
+              && cUserGPU.CheckData(answer, aUnitNum);
+
+    /* destroy variables */
+    delete a;
+    delete b;
+    delete c;
+    delete cMe;
+    delete aGPU;
+    delete bGPU;
+    delete cGPU;
+    delete cMeGPU;
+    delete[] aDimSize;
+    delete[] bDimSize;
+
+    return cpuTest && gpuTest;
+#else
+    /* destroy variables */
+    delete a;
+	delete b;
+	delete c;
+    delete cMe;
+    delete[] aDimSize;
+    delete[] bDimSize;
+
+    return cpuTest;
+#endif // USE_CUDA
+}
+
+/* other cases */
+/*
+    TODO!!
+*/
+
+/* test for SumDim Function */
+bool TestSumDim()
+{
+    XPRINT(0, stdout, "[TEST SUMDIM] tensor summation c = a + b * beta by broadcasting\n");
+    bool returnFlag = true, caseFlag = true;
+
+    /* case 1 test */
+    caseFlag = TestSumDim1();
+    if (!caseFlag) {
+        returnFlag = false;
+        XPRINT(0, stdout, ">> case 1 failed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> case 1 passed!\n");
+
+    /* case 2 test */
+    caseFlag = TestSumDim2();
+    if (!caseFlag) {
+        returnFlag = false;
+        XPRINT(0, stdout, ">> case 2 failed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> case 2 passed!\n");
+
+    /* other cases test */
+    /*
+        TODO!!
+    */
+
+    if (returnFlag) {
+        XPRINT(0, stdout, ">> All Passed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> Failed!\n");
+
+    XPRINT(0, stdout, "\n");
+
+    return returnFlag;
+}
+
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TSumDim.h
+++ b/source/tensor/test/TSumDim.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-30
+* I finish my summer holidays and go back to study.
+*/
+
+#ifndef __TEST_SUMDIM_H__
+#define __TEST_SUMDIM_H__
+
+#include "../core/arithmetic/SumDim.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/* test for SumDim Function */
+extern "C"
+bool TestSumDim();
+
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_SUMDIM_H__
--- a/source/tensor/test/TTan.cpp
+++ b/source/tensor/test/TTan.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+
+#include "../core/math/Unary.h"
+#include "TTan.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/*
+case 1: test Tan function.
+Set every entry to its tangent value.
+*/
+bool TestTan1()
+{
+	/* a tensor of size (3, 2) */
+	int order = 2;
+	int * dimSize = new int[order];
+	dimSize[0] = 3;
+	dimSize[1] = 2;
+
+	int unitNum = 1;
+	for (int i = 0; i < order; i++)
+		unitNum *= dimSize[i];
+
+	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
+	                      {-1.0F, -2.0F},
+	                      {0.0F, 0.5F} };
+	DTYPE answer[3][2] = { {1.5574F, -2.1850F},
+	                       {-1.5574F, 2.1850F},
+	                       {0.0F, 0.5463F} };
+
+	/* CPU test */
+	bool cpuTest = true;
+
+	/* create tensors */
+	XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
+    XTensor bUser;
+
+	/* initialize variables */
+	a->SetData(aData, unitNum);
+	aMe->SetData(aData, unitNum);
+
+	/* call Tan function */
+	_Tan(a, b);
+	_TanMe(aMe);
+    bUser = Tan(*a);
+
+	/* check results */
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
+    
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+
+	/* create tensor */
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+
+	/* Initialize variables */
+	aGPU->SetData(aData, unitNum);
+	aMeGPU->SetData(aData, unitNum);
+
+	/* call Tan function */
+    _Tan(aGPU, bGPU);
+	_TanMe(aMeGPU);
+    bUserGPU = Tan(*aGPU);
+
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
+
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+    delete aGPU;
+    delete bGPU;
+    delete aMeGPU;
+	delete[] dimSize;
+
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete[] dimSize;
+
+	return cpuTest;
+#endif // USE_CUDA
+}
+
+/* other cases */
+/*
+TODO!!
+*/
+
+/* test for Tan Function */
+bool TestTan()
+{
+	XPRINT(0, stdout, "[TEST Tan] set every entry to its tangent value \n");
+	bool returnFlag = true, caseFlag = true;
+
+	/* case 1 test */
+	caseFlag = TestTan1();
+
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+
+	XPRINT(0, stdout, "\n");
+
+	return returnFlag;
+}
+
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TTan.h
+++ b/source/tensor/test/TTan.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+
+#ifndef __TEST_TAN_H__
+#define __TEST_TAN_H__
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/* test for Tan Function */
+bool TestTan();
+
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_TAN_H__
--- a/source/tensor/test/TTranspose.cpp
+++ b/source/tensor/test/TTranspose.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12
+*/
+
+#include "TTranspose.h"
+#include "../core/movement/CopyValues.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/*
+case 1: test Transpose function.
+tensor transposition of dimensions i and j 
+*/
+bool TestTranspose1()
+{
+	/* a tensor of size (3, 2) */
+	int aOrder = 2;
+	int * aDimSize = new int[aOrder];
+	aDimSize[0] = 3;
+	aDimSize[1] = 2;
+
+	int aUnitNum = 1;
+	for (int i = 0; i < aOrder; i++)
+		aUnitNum *= aDimSize[i];
+
+    /* a tensor of size (2, 3) */
+	int bOrder = 2;
+	int * bDimSize = new int[bOrder];
+	bDimSize[0] = 2;
+	bDimSize[1] = 3;
+
+	int bUnitNum = 1;
+	for (int i = 0; i < bOrder; i++)
+		bUnitNum *= bDimSize[i];
+
+	DTYPE aData[3][2] = { {1.0F, 2.0F}, 
+	                      {3.0F, 4.0F},
+	                      {5.0F, 6.0F} };
+	DTYPE answer[2][3] = { {1.0F, 3.0F, 5.0F},
+	                       {2.0F, 4.0F, 6.0F} };
+
+	/* CPU test */
+	bool cpuTest = true;
+
+	/* create tensors */
+	XTensor * a = NewTensor(aOrder, aDimSize);
+	XTensor * b = NewTensor(bOrder, bDimSize);
+    XTensor bUser;
+
+	/* initialize variables */
+	a->SetData(aData, aUnitNum);
+
+	/* call Transpose function */
+    _Transpose(a, b, 0, 1);
+    bUser = Transpose(*a, 0, 1);
+
+	/* check results */
+	cpuTest = b->CheckData(answer, aUnitNum, 1e-4F)
+              && bUser.CheckData(answer, aUnitNum, 1e-4F);
+
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+
+	/* create tensor */
+	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+
+	/* Initialize variables */
+	aGPU->SetData(aData, aUnitNum);
+
+	/* call Transpose function */
+    _Transpose(aGPU, bGPU, 0, 1);
+    bUserGPU = Transpose(*aGPU, 0, 1);
+
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F)
+              && bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
+
+	/* destroy variables */
+	delete a;
+	delete b;
+    delete aGPU;
+    delete bGPU;
+	delete[] aDimSize;
+	delete[] bDimSize;
+
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete[] aDimSize;
+	delete[] bDimSize;
+
+	return cpuTest;
+#endif // USE_CUDA
+}
+
+/*
+case 2: test Transpose function.
+tensor transposition of dimensions i and j 
+*/
+bool TestTranspose2()
+{
+	/* a tensor of size (4, 3, 2) */
+	int aOrder = 3;
+	int * aDimSize = new int[aOrder];
+	aDimSize[0] = 4;
+	aDimSize[1] = 3;
+	aDimSize[2] = 2;
+
+	int aUnitNum = 1;
+	for (int i = 0; i < aOrder; i++)
+		aUnitNum *= aDimSize[i];
+
+    /* a tensor of size (2, 3, 4) */
+	int bOrder = 3;
+	int * bDimSize = new int[bOrder];
+	bDimSize[0] = 2;
+	bDimSize[1] = 3;
+	bDimSize[2] = 4;
+
+	int bUnitNum = 1;
+	for (int i = 0; i < bOrder; i++)
+		bUnitNum *= bDimSize[i];
+
+	DTYPE aData[4][3][2] = { { {1.0F, 2.0F}, 
+	                           {3.0F, 4.0F},
+	                           {5.0F, 6.0F} },
+                             { {2.0F, 4.0F}, 
+	                           {4.0F, 7.0F},
+	                           {6.0F, 8.0F} },
+                             { {1.0F, 2.0F}, 
+	                           {3.0F, 4.0F},
+	                           {5.0F, 6.0F} },
+                             { {2.0F, 4.0F}, 
+	                           {4.0F, 7.0F},
+	                           {6.0F, 8.0F} },};
+	DTYPE answer[2][3][4] = { { {1.0F, 2.0F, 1.0F, 2.0F},
+                                {2.0F, 4.0F, 2.0F, 4.0F},
+                                {3.0F, 4.0F, 3.0F, 4.0F} },
+                              { {4.0F, 7.0F, 4.0F, 7.0F},
+                                {5.0F, 6.0F, 5.0F, 6.0F},
+                                {6.0F, 8.0F, 6.0F, 8.0F} } };
+
+	/* CPU test */
+	bool cpuTest = true;
+
+	/* create tensors */
+	XTensor * a = NewTensor(aOrder, aDimSize);
+	XTensor * b = NewTensor(bOrder, bDimSize);
+    XTensor bUser;
+
+	/* initialize variables */
+	a->SetData(aData, aUnitNum);
+
+	/* call Transpose function */
+    _Transpose(a, b, 0, 2);
+    bUser = Transpose(*a, 0, 2);
+
+	/* check results */
+	cpuTest = b->CheckData(answer, aUnitNum, 1e-4F)
+              && bUser.CheckData(answer, aUnitNum, 1e-4F);
+
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+
+	/* create tensor */
+	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+
+	/* Initialize variables */
+	aGPU->SetData(aData, aUnitNum);
+
+	/* call Transpose function */
+    _Transpose(aGPU, bGPU, 0, 2);
+    bUserGPU = Transpose(*aGPU, 0, 2);
+
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F)
+              && bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
+
+	/* destroy variables */
+	delete a;
+	delete b;
+    delete aGPU;
+    delete bGPU;
+	delete[] aDimSize;
+	delete[] bDimSize;
+
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete[] aDimSize;
+	delete[] bDimSize;
+
+	return cpuTest;
+#endif // USE_CUDA
+}
+
+/* other cases */
+/*
+TODO!!
+*/
+
+/* test for Transpose Function */
+bool TestTranspose()
+{
+	XPRINT(0, stdout, "[TEST TRANSPOSE] tensor transposition with specified dimensions \n");
+	bool returnFlag = true, caseFlag = true;
+
+	/* case 1 test */
+	caseFlag = TestTranspose1();
+
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+    
+	/* case 2 test */
+	caseFlag = TestTranspose2();
+
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 2 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 2 passed!\n");
+
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+
+	XPRINT(0, stdout, "\n");
+
+	return returnFlag;
+}
+
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TTranspose.h
+++ b/source/tensor/test/TTranspose.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-30
+*/
+
+#ifndef __TEST_TRANSPOSE_H__
+#define __TEST_TRANSPOSE_H__
+
+#include "../core/shape/Transpose.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/* test for Transpose Function */
+bool TestTranspose();
+
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_TRANSPOSE_H__
--- a/source/tensor/test/Test.cpp
+++ b/source/tensor/test/Test.cpp
@@ -32,9 +32,12 @@ bool Test()
    wrong = !TestAbsolute() || wrong;
    wrong = !TestConcatenate() || wrong;
    wrong = !TestConcatenateSolely() || wrong;
+    wrong = !TestCos() || wrong;
    wrong = !TestConvertDataType() || wrong;
    wrong = !TestCopyIndexed() || wrong;
    wrong = !TestCopyValues() || wrong;
+    wrong = !TestDiv() || wrong;
+    wrong = !TestExp() || wrong;
    wrong = !TestLog() || wrong;
    wrong = !TestMatrixMul() || wrong;
    wrong = !TestMatrixMul2D() || wrong;
@@ -55,11 +58,16 @@ bool Test()
    wrong = !TestSetAscendingOrder() || wrong;
    wrong = !TestSetData() || wrong;
    wrong = !TestSign() || wrong;
+    wrong = !TestSin() || wrong;
    wrong = !TestSort() || wrong;
    wrong = !TestSplit() || wrong;
+    wrong = !TestSub() || wrong;
    wrong = !TestSum() || wrong;
    wrong = !TestSumByColumnTV() || wrong;
    wrong = !TestSumByColumnVT() || wrong;
+    wrong = !TestSumDim() || wrong;
+    wrong = !TestTan() || wrong;
+    wrong = !TestTranspose() || wrong;
    wrong = !TestTopK() || wrong;
    wrong = !TestUnsqueeze() || wrong;
    wrong = !TestXMem() || wrong;

--- a/source/tensor/test/Test.h
+++ b/source/tensor/test/Test.h
@@ -25,9 +25,12 @@
 #include "TAbsolute.h"
 #include "TConcatenate.h"
 #include "TConcatenateSolely.h"
+#include "TCos.h"
 #include "TConvertDataType.h"
 #include "TCopyIndexed.h"
 #include "TCopyValues.h"
+#include "TDiv.h"
+#include "TExp.h"
 #include "TLog.h"
 #include "TMatrixMul.h"
 #include "TMatrixMul2D.h"
@@ -48,11 +51,16 @@
 #include "TSetAscendingOrder.h"
 #include "TSetData.h"
 #include "TSign.h"
+#include "TSin.h"
 #include "TSort.h"
 #include "TSplit.h"
+#include "TSub.h"
 #include "TSum.h"
 #include "TSumByColumnTV.h"
 #include "TSumByColumnVT.h"
+#include "TSumDim.h"
+#include "TTan.h"
+#include "TTranspose.h"
 #include "TTopK.h"
 #include "TUnsqueeze.h"
 #include "TXMem.h"