Merge with li branch

f8a37184 · liyinqiao · 0ca350a3 · 9b11391e · f8a37184 · f8a37184
Commit f8a37184 authored Jul 12, 2018 by liyinqiao
--- a/doc/manual.md
+++ b/doc/manual.md
 # NiuTrans.Tensor张量计算库
 ## NiuTrans.Tensor
 NiuTrans.Tensor是小牛开源项目所开发的一个工具包，提供了完整的张量定义及计算功能，可以被用于深度学习相关研究及工业系统的开发。NiuTrans.Tensor具有以下特点：
 * 简单小巧，易于修改
@@ -12,38 +11,52 @@ NiuTrans.Tensor撠皞★撘銝芸極
 ## 安装NiuTrans.Tensor
+在开始创建您的项目并使用NiuTrans.Tensor工具包时，需要注意的是：
+* 所创建项目如在CPU上运行，我们的系统支持高性能的数学运算库，推荐安装[Intel® MKL](https://software.intel.com/en-us/mkl)或[OpenBLAS](http://www.openblas.net/)。
+* 所创建项目如需在GPU上运行，需安装 [NVIDIA®CUDA®Toolkit](https://developer.nvidia.com/cuda-downloads)，CUDA版本需求为9.0及以上，CUDA工具为创建高性能GPU加速应用程序提供了开发环境。
+在使用小牛开源项目所开发的NiuTrans.Tensor工具包时：
+* 首先需要将NiuTrans.Tensor代码包含在所创建的项目中
+* 需要引用XTensor.h、core里的CHeader.h和function里的FHeader.h这三个头文件：
+    * 通过XTensor.h可以获取我们需要操作的XTensor类
+    * 通过core里的CHeader.h可以对Tensor进行一些张量运算
+    * 通过function里的FHeader.h可以调用一些激活函数
+* 在所创建项目中使用命名空间nts
 ## 什么是张量
-在计算机科学中，张量（Tensor）通常被定义为$n$维空间中的一种量，它具有$n$个分量，这种量本质上是一个多维数组（ multidimensional array）。张量的阶或秩是这个多维数组的维度，或者简单理解为索引张量里的每个元素所需要的索引个数。通常来说，0阶张量被定义为标量（Scalar），1阶张量被定义为向量（vector），而2阶张量被定义为矩阵（matrix）。比如，在一个三维空间中，1阶张量就是空间中点所表示的向量$(x,y,z)$，其中$x$、$y$、$z$分别表示这个点在三个轴上的坐标。
+在计算机科学中，张量（Tensor）通常被定义为\\(n\\)维空间中的一种量，它具有\\(n\\)个分量，这种张量本质上是一个多维数组（ multidimensional array）。张量的阶或秩是这个多维数组的维度，或者简单理解为索引张量里的每个元素所需要的索引个数。通常来说，0阶张量被定义为标量（Scalar），1阶张量被定义为向量（vector），而2阶张量被定义为矩阵（matrix）。比如，在一个三维空间中，1阶张量就是空间中点所表示的向量\\((x,y,z)\\)，其中\\(x\\)、\\(y\\)、\\(z\\)分别表示这个点在三个轴上的坐标。
-张量是一种高效的数学建模工具，它可以将复杂的问题通过统一、简洁的方式进行表达。比如，姜英俊同学做饭需要2斤牛肉、5斤土豆，市场上牛肉每斤32元、土豆每斤2元，那么购买这些食物总共花费$2 \times 32 + 5 \times 2 = 74$元。如果用张量来描述，我们可以用一个1阶张量$a=(2,5)$表示所需不同食物的重量。然后用另一个1阶张量$b=(32,2)$表示不同食物的价格。最后，我们用一个0阶张量$c$表示购买这些食物的总价，计算如下
+张量是一种高效的数学建模工具，它可以将复杂的问题通过统一、简洁的方式进行表达。比如，姜英俊同学做饭需要2斤牛肉、5斤土豆，市场上牛肉每斤32元、土豆每斤2元，那么购买这些食物总共花费\\(2 \times 32 + 5 \times 2 = 74\\)元。如果用张量来描述，我们可以用一个1阶张量\\(a=(2,5)\\)表示所需不同食物的重量。然后用另一个1阶张量\\(b=(32,2)\\)表示不同食物的价格。最后，我们用一个0阶张量\\(c\\)表示购买这些食物的总价，计算如下
 $$
 \begin{aligned}
-  c & = a \times b^T \\
+  c & = a \times b^T \\\\
-    & = \left(\begin{matrix}2 & 5\end{matrix}\right) \times \left(\begin{matrix}32 \\ 2\end{matrix}\right) \\
+    & = \left(\begin{matrix}2 & 5\end{matrix}\right) \times \left(\begin{matrix}32 \\\\ 2\end{matrix}\right) \\\\
-    & = 2 \times 32 + 5 \times 2 \\
+    & = 2 \times 32 + 5 \times 2 \\\\
    & = 74
 \end{aligned}
 $$
-其中$b^T$表示行向量$b$的转置 - 列向量，$\times$表示向量的乘法。第二天，姜英俊同学换了一个市场，这里牛肉每斤35元、土豆每斤1元。如果要知道在两个市场分别购物的总价，可以把$b$重新定义为一个2阶张量$\left(\begin{matrix}12 & 2\\35 & 1\end{matrix}\right)$，总价$c$定义为一个2阶张量。同样有
+其中\\(b^T\\)表示行向量\\(b\\)的转置 - 列向量，\\(\times\\)表示向量的乘法。第二天，姜英俊同学换了一个市场，这里牛肉每斤35元、土豆每斤1元。如果要知道在两个市场分别购物的总价，可以把\\(b\\)重新定义为一个2阶张量\\(\left(\begin{matrix}32 & 2 \\\\ 35 & 1\end{matrix}\right)\\)，总价\\(c\\)定义为一个2阶张量。同样有
 $$
 \begin{aligned}
-  c & = a \times b^T \\
+  c & = a \times b^T \\\\
-    & = \left(\begin{matrix}2 & 5\end{matrix}\right) \times \left(\begin{matrix}12 & 35 \\ 2 & 1\end{matrix}\right) \\
+    & = \left(\begin{matrix}2 & 5\end{matrix}\right) \times \left(\begin{matrix}32 & 35 \\\\ 2 & 1\end{matrix}\right) \\\\
    & = \left(\begin{matrix}74 & 75\end{matrix}\right)
 \end{aligned}
 $$
-即，在两个市场分别花费74元和75元。可以看出，利用张量可以对多样、复杂的问题进行建模，比如，可以进一步扩展上述问题中$a$、$b$、$c$的定义，把它们定义成更高阶的张量，处理不同时间、不同市场、不同菜谱的情况，但是不论情况如何变化，都可以用同一个公式$c = a \times b^T$来描述问题。
+即，在两个市场分别花费74元和75元。可以看出，利用张量可以对多样、复杂的问题进行建模，比如，可以进一步扩展上述问题中\\(a\\)、\\(b\\)、\\(c\\)的定义，把它们定义成更高阶的张量，处理不同时间、不同市场、不同菜谱的情况，但是不论情况如何变化，都可以用同一个公式\\(c = a \times b^T\\)来描述问题。
 许多现实世界的问题都可以被描述为张量表达式（expression），也就是把张量的组合、计算描述为算数表达式。这种建模方式也构成了现代神经网络模型及深度学习方法的基础。在许多机器学习工具中，张量计算已经成为了神经网络前向、反向传播等过程的基本单元，应用十分广泛。
 ## 如何定义张量
-如果你是一名C/C++或者Python的使用者，那么在程序中使用NiuTrans.Tensor定义张量将非常简单。首先，下载NiuTrans.Tensor的工具包(source???)，并加压到任意目录，比如~/NTS目录。我们会在NTS这个目录中有找到source子目录，它是存放源代码的目录。对于source子目录的结构，信息如下：
+如果你是一名C/C++或者Python的使用者，那么在程序中使用NiuTrans.Tensor定义张量将非常简单。首先，下载NiuTrans.Tensor的工具包(source???)，并解压到任意目录，比如~/NTS目录。我们会在NTS这个目录中找到source子目录，它是存放源代码的目录。对于source子目录的结构，信息如下：
 * ~/NTS/source/XTensor.h - 定义了张量结构XTensor，以及构建和销毁XTensor的接口
 * ~/NTS/source/core - 存放张量计算的函数声明及函数体实现的源文件
@@ -52,8 +65,8 @@ $$
 * ~/NTS/source/*.h(cpp) - 与张量定义不相关，后文介绍 :)
 以C/C++为例，仅需要在源程序中引用XTensor.h头文件就可以完成张量的定义。下面是一个简单的示例程序sample.cpp
+```
-<pre><code>#inlucde "XTensor.h"          // 引用XTensor定义的头文件
+#inlucde "XTensor.h"          // 引用XTensor定义的头文件
 using namepsace nt;           // 使用XTensor所在的命名空间nt
@@ -61,49 +74,54 @@ int main(int argc, const char ** argv)
 {
    // 声明一个变量tensor，它的类型是XTensor
    XTensor tensor;                         
    // 初始化这个变量为50列*100行的矩阵(2阶张量)      
    InitTensor2D(&tensor, 50, 100, X_FLOAT);
    // 之后可以使用张量tensor了
    return 0;
 }
-</code></pre>
+```
 下一步，编译以上源程序，这个过程需要指定XTensor.h头文件所在目录。比如，使用g++编译sample.cpp（如果你使用的是visual studio，请看这里???）
-<pre><code>g++ sample.cpp -I~/NTS/source -o sample</code></pre>
+```
+g++ sample.cpp -I~/NTS/source -o sample
+```
 在sample.cpp中使用了XTensor，它是NiuTrans.Tensor里的一个类，这个类定义了张量所需的数据结构。我们可以使用这个类完成对张量的计算、拷贝等各种操作。XTensor类型的变量被声明后，这个变量需要被初始化，或者说被真正指定为一个张量，比如，指定张量各个维度的大小、张量中每个单元的数据类型、给张量分配内存空间等。InitTensor2D()就是一个张量初始化函数，它把张量初始化为一个矩阵，有四个参数：指向被初始化的张量的指针，矩阵的列数，矩阵的行数，数据单元的类型。这里X_FLOAT，是NiuTrans.Tensor自定义的枚举类型，它表示单精度浮点数。我们也可以使用X_INT或者X_DOUBLE，将数据类型指定为32bit整数或者双精度浮点数。
 NiuTrans.Tensor也提供了其它方式定义张量。比如可以直接调用一个函数完成张量的创建，而且可以显性释放张量。下面是一段示例代码（sample2.cpp）：
-<pre><code>#inlucde "XTensor.h"         // 引用XTensor定义的头文件
+```
+#inlucde "XTensor.h"         // 引用XTensor定义的头文件
 using namepsace nt;          // 使用XTensor所在的命名空间nt
 int main(int argc, const char ** argv)
 {
    // 构建一个单精度浮点类型张量，它是一个50列*100行的矩阵
-    XTensor * tensor = NewTensor2D(&tensor, 50, 100, X_FLOAT);  
+    XTensor * tensor = NewTensor2D(50, 100, X_FLOAT);  
    // 之后可以使用张量tensor了
    // 释放这个张量
    DelTensor(tensor);
    return 0;
 }
-</code></pre>
+```
-sample2.cpp中使用的NewTensor2D和DelTensor是一组函数，前者生成张量并返回指向这个张量的指针，后者释放指针所指向张量的内容。这种方法比较适合C语言风格的开发。
+sample2.cpp中使用的NewTensor2D和DelTensor是一组函数，前者生成张量并返回指向这个张量的指针，后者释放指针所指向张量的内容。这种方法比较适合C/C++风格的开发。
 > 注意，在NiuTrans.Tensor中所有张量默认都是“稠密”张量，也就是张量中所有的单元都会被分配空间，而且这些空间是连续的。有些情况下，张量里的单元仅有少数为非零单元，对于这类张量，可以使用“稀疏"的表示方法，这样可以有效的节省存储空间。
 如果要定义稀疏张量，需要在原有的参数基础上额外指定一个参数 - 稠密度。所谓稠密度是指非零单元的比例，他是介于0和1之间的一个实数，0表示所有单元全为零，1表示全为非零单元。默认所有张量的稠密度都是1。下面是不同类型张量的定义方法示例（sample3.cpp）
-<pre><code>#inlucde "XTensor.h"         // 引用XTensor定义的头文件
+```
+#inlucde "XTensor.h"         // 引用XTensor定义的头文件
 using namepsace nt;          // 使用XTensor所在的命名空间nt
@@ -111,88 +129,1413 @@ int main(int argc, const char ** argv)
 {
    // 构建一个单精度浮点类型张量，它是一个50列*100行的矩阵
    // 这个张量是稠密的
-    XTensor * tensor0 = NewTensor2D(&tensor, 50, 100, X_FLOAT);
+    XTensor * tensor0 = NewTensor2D(50, 100, X_FLOAT);
    // 构建一个单精度浮点类型张量，它是一个50列*100行的矩阵
    // 这个张量是稠密的
-    XTensor * tensor1 = NewTensor2D(&tensor, 50, 100, X_FLOAT, 1.0F);
+    XTensor * tensor1 = NewTensor2D(50, 100, X_FLOAT, 1.0F);
    // 构建一个单精度浮点类型张量，它是一个50列*100行的矩阵
    // 这个张量是稀疏的，有10%的单元非零
-    XTensor * tensor2 = NewTensor2D(&tensor, 50, 100, X_FLOAT, 0.1F);  
+    XTensor * tensor2 = NewTensor2D(50, 100, X_FLOAT, 0.1F);  
    // 之后可以使用张量tensor0，tensor1和tensor2了
    // 释放这些张量
    DelTensor(tensor0);
    DelTensor(tensor1);
    DelTensor(tensor2);
    return 0;
 }
-</code></pre>
+```
 以下是关于张量定义的基础函数：
-功能 | 函数 | 参数 
+| 功能 | 函数| 参数 |
-: | - | - 
+| - | - | - |
-初始化张量 | void InitTensor(<br>XTensor * tensor, const int myOrder, <br> const int * myDimSize, const float myDenseRatio, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> | tensor - 指向被初始化张量的指针 <br> myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小，索引0表示第一维 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池
+| 初始化张量 | void InitTensor(<br>XTensor * tensor, const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br> const float myDenseRatio, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> | tensor - 指向被初始化张量的指针 <br> myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小，索引0表示第一维 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
-初始化稠密张量 | void InitTensor(<br>XTensor * tensor, const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> | tensor - 指向被初始化张量的指针 <br> myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小，索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 
+| 初始化稠密张量 | void InitTensor(<br>XTensor * tensor, const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> | tensor - 指向被初始化张量的指针 <br> myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小，索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
-创建空张量 | XTensor * NewTensor() | N/A
+| 创建空张量 | XTensor * NewTensor() | N/A |
-创建张量 | XTensor * NewTensor(<br>const int myOrder, <br> const int * myDimSize, const float myDenseRatio, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br>  | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小，索引0表示第一维 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池
+| 创建张量 | XTensor * NewTensor(<br>const int myOrder, <br> const int * myDimSize, const float myDenseRatio, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小，索引0表示第一维 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
-创建稠密张量 | XTensor * NewTensor(<br>const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br>| myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小，索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池
+| 创建稠密张量 | XTensor * NewTensor(<br>const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小，索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
-销毁张量 | void DelTensor(const XTensor * tensor) | tensor - 指向要被销毁的张量的指针 
+| 销毁张量 | void DelTensor(const XTensor * tensor)   | tensor - 指向要被销毁的张量的指针 |
 上述函数中需要说明的是
 * 设备ID是指张量所申请的空间所在CPU或者GPU设备的编号，-1表示CPU
-* XMem是NiuTrans.Tensor中定一个内存/显存池类，它负责内存（或显存）的统一管理。关于设备ID和XMem的进一步说明，请参见下一节内容。
+* XMem是NiuTrans.Tensor中的一个内存/显存池类，它负责内存（或显存）的统一管理。关于设备ID和XMem的进一步说明，请参见下一节内容。
 * TENSOR_DATA_TYPE定义了张量的数据类型，包括：
-类型 | 说明 
+| 类型 | 说明 |
- | -
+| - | - |
-X_INT | 32bit整数
+| X_INT | 32bit整数 |
-X_FLOAT | 32bit浮点数
+| X_FLOAT | 32bit浮点数 |
-X_DOUBLE | 64bit浮点数
+| X_DOUBLE | 64bit浮点数 |
-X_INT8 | 8bit整数（计划支持）
+| X_INT8 | 8bit整数（计划支持）|
-X_FLOAT16 | 16bit浮点数（计划支持）
+| X_FLOAT16 | 16bit浮点数（计划支持） |
 此外，NiuTrans.Tensor也提供了更多种类的张量初始化和创建方法：
-功能 | 函数 | 参数 
+| 功能 | 函数 | 参数 |
-: | - | - 
+| - | - | - |
-初始化为稠密向量 | void InitTensor1D(<br>XTensor * tensor, const int num, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> | tensor - 指向被初始化张量的指针 <br>  num - 向量维度大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池
+| 初始化为稠密向量 | void InitTensor1D(<br>XTensor * tensor, const int num, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> | tensor - 指向被初始化张量的指针 <br>  num - 向量维度大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
-初始化为稠密矩阵 | void InitTensor2D(<br>XTensor * tensor, const int colNum, const int rowNum, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> | tensor - 指向被初始化张量的指针 <br>  colNum - 矩阵列数 <br> rowNum - 矩阵行数 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池
+| 初始化为稠密矩阵 | void InitTensor2D(<br>XTensor * tensor, const int colNum, const int rowNum, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> | tensor - 指向被初始化张量的指针 <br>  colNum - 矩阵列数 <br> rowNum - 矩阵行数 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
-初始化为3维稠密张量 | void InitTensor3D(<br>XTensor * tensor, <br> const int d0, const int d1, const int d2, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> | tensor - 指向被初始化张量的指针 <br>  d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池
+| 初始化为3维稠密张量 | void InitTensor3D(<br>XTensor * tensor, <br> const int d0, const int d1, const int d2, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> | tensor - 指向被初始化张量的指针 <br>  d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
-初始化为4维稠密张量 | void InitTensor4D(<br>XTensor * tensor, <br> const int d0, const int d1, const int d2, const int d3, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | tensor - 指向被初始化张量的指针 <br>  d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br>  d3 - 张量第四维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池
+| 初始化为4维稠密张量 | void InitTensor4D(<br>XTensor * tensor, <br> const int d0, const int d1, const int d2, const int d3, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | tensor - 指向被初始化张量的指针 <br>  d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br>  d3 - 张量第四维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
-初始化为5维稠密张量 | void InitTensor5D(<br>XTensor * tensor, <br> const int d0, const int d1, const int d2, <br> const int d3, const int d4, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | tensor - 指向被初始化张量的指针 <br>  d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br>  d3 - 张量第四维大小 <br>  d4 - 张量第五维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池
+| 初始化为5维稠密张量 | void InitTensor5D(<br>XTensor * tensor, <br> const int d0, const int d1, const int d2, <br> const int d3, const int d4, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | tensor - 指向被初始化张量的指针 <br>  d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br>  d3 - 张量第四维大小 <br>  d4 - 张量第五维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
-创建稠密向量 | XTensor * NewTensor1D(<br>const int num, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> | num - 向量维度大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池
+| 创建稠密向量 | XTensor * NewTensor1D(<br>const int num, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> | num - 向量维度大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
-创建稠密矩阵 | XTensor * NewTensor2D(<br>const int colNum, const int rowNum, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> | colNum - 矩阵列数 <br> rowNum - 矩阵行数 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池
+| 创建稠密矩阵 | XTensor * NewTensor2D(<br>const int colNum, const int rowNum, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> | colNum - 矩阵列数 <br> rowNum - 矩阵行数 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
-创建3维稠密张量 | XTensor * NewTensor3D(<br> const int d0, const int d1, const int d2, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> | d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池
+| 创建3维稠密张量 | XTensor * NewTensor3D(<br> const int d0, const int d1, const int d2, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> | d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
-创建4维稠密张量 | XTensor * NewTensor4D(<br>const int d0, const int d1, const int d2, const int d3, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br>  d3 - 张量第四维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池
+| 创建4维稠密张量 | XTensor * NewTensor4D(<br>const int d0, const int d1, const int d2, const int d3, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br>  d3 - 张量第四维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
-创建5维稠密张量 | XTensor * NewTensor5D(<br>const int d0, const int d1, const int d2, <br> const int d3, const int d4, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br>  d3 - 张量第四维大小 <br>  d4 - 张量第五维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池
+| 创建5维稠密张量 | XTensor * NewTensor5D(<br>const int d0, const int d1, const int d2, <br> const int d3, const int d4, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | d0 - 张量第一维大小 <br>  d1 - 张量第二维大小 <br>  d2 - 张量第三维大小 <br>  d3 - 张量第四维大小 <br>  d4 - 张量第五维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
-其它问题？？
+## 设备
-* 程序编译，是直接引用源文件还是引用库
-* 是否需要改变环境变量
-* 命名空间是nt还是ntts，还是nts
-## 设备及内存池
 ## 访问张量中的内容
+在C/C++中，我们通过XTensor.h访问张量中的内容，并且仅需要在源程序中引用XTensor.h头文件就可以完成张量的定义。
+在XTensor.h头文件中定义的成员变量说明：
+| 成员变量 | 功能 |
+| - | - |
+| XMem * mem | 张量所使用的内存池 |
+| void * data | 保存元素的数据数组 |
+| void * dataHost | 主机内存上的数据副本，只在GPU上运行时被激活 |
+| int devID | 设备ID，指张量所申请的空间所在CPU或者GPU设备的编号，-1表示CPU |
+| int order | 张量的维度，例如：一个矩阵（维度为2）是一个二维张量 |
+| int dimSize<br> [MAX_TENSOR_DIM_NUM] | 张量中每一维度的大小，索引0表示第1维 |
+| int dimSizeRDI<br> [MAX_TENSOR_DIM_NUM] | 转置模式下张量中每一维度的大小，索引0表示第1维 |
+| TENSOR_DATA_TYPE dataType | 每个数据单元的数据类型 |
+| int unitSize | 数据单元的大小，类似于sizeof() |
+| int unitNum | 数据单元的数量 |
+| bool isSparse | 是否稠密，一个n * m稠密矩阵的数据量大小为n * m,而稀疏（非稠密）矩阵的数据量大小则取决于矩阵中非零元素个数。|
+| int unitNumNonZero | 稀疏矩阵中非零元素个数 |
+| float denseRatio | 稠密度，指非零单元的比例，是介于0和1之间的一个实数，0表示所有单元全为零，1表示全为非零单元。|
+| bool isShared | 标志数据数组是否被其他张量所共享 |
+| bool isInGlobalMem | 标志数据是否在全局内存而不是内存池中 |
+| bool isAllValued<br> [MAX_TENSOR_DIM_NUM] | 标志稀疏矩阵中是否每个维度都具有非零元素 |
+在XTensor.h头文件中定义的方法说明：
+| 功能 | 函数  | 参数 |
+| - | - | - |
+| 判断两个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 |
+| 判断三个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b, XTensor * c) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 <br> c - 进行比较的第三个张量 |
+| 设置张量每一维度的大小 | void SetDim(int * myDimSize) |myDimSize - 张量每一维度的大小 |
+| 得到张量中给定的维度大小 | int GetDim(const int dim) | dim - 张量的维度 |
+| 重新调整矩阵维度 | void Reshape(<br> const int order, const int * myDimSize) | order - 张量的维度 <br> myDimSize - 张量每一维的大小 |
+| 得到张量中元素数量 | int GetSize() | N/A |
+| 得到内存使用大小 | int GetDataSizeInChar() | N/A |
+| 得到所给数据类型的数据<br> 单元大小 | int GetUnitSize(<br> TENSOR_DATA_TYPE myDataType) | myDataType - 所给数据类型 |
+| 张量中所有元素设置为0 | void SetZeroAll(XStream * stream = NULL) | stream - 多线程流|
+| 用数组赋值张量 | void SetData(<br> const void * d, int num, int beg = 0) | d - 赋值数组  <br> num - 数组大小 <br> beg - 赋值时从张量的第几位开始 |
+| 设置张量服从均匀分布 | void SetDataRand(<br> DTYPE lower, DTYPE upper) | lower - 最小值 <br> upper - 最大值 |
+| 设置张量服从正态分布 | void SetDataRandn(<br> DTYPE mean, DTYPE standardDeviation) | mean - 均值 <br> standardDeviation - 标准差 |
+| 检查张量中元素是否相同 | bool CheckData(<br> const void * answer, int num, int beg = 0) | answer - 给定数组 <br> num - 数组大小 <br> beg - 赋值时从张量的第几位开始 |
+| 将给定维度中元素<br> 设置为升序 | void SetAscendingOrder(int dim) | dim - 给定维度 |
+| 获取张量中元素指针 | void * GetCell(int * index, int size)    | index - 元素位置 <br> size-矩阵大小 |
+| 获取二维张量中元素指针 | void * GetCell2D(int ni, int mi = 0) | ni - 行值 <br> mi - 列值 |
+| 获取二维张量的值 | DTYPE Get2D(int ni, int mi = 0) | ni - 行值 <br> mi - 列值 |
+| 获取稀疏张量的值 | DTYPE GetInSparse(int i) | i - 稀疏矩阵中非0元素位置 |
+| 获取稀疏张量中<br> 元组的键值 | int GetKeyInSparse(int i) | i - 稀疏矩阵中非0元素位置 |
+| 设置二维张量中<br> 的单元值 | bool Set2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
+| 增加二维张量中<br> 的单元值 | bool Add2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
+| 获取稀疏矩阵中<br> 非零元素数量 | int GetNonzeroSize() | N/A |
+| 将矩阵重置为特定大小 | bool Resize(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 |
+| 将矩阵重置为特定大小<br>并不申请新空间 | bool ResizeWithNoData(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 |
+| 将矩阵重置为<br> 另一矩阵大小 | bool Resize(<br> const XTensor * myTensor) | myTensor - 重置矩阵大小的参考矩阵 |
+| 用二值搜索方法<br> 找到稀疏矩阵中元素 | bool BinarySearch(<br> int key, DTYPE &value, void * &position) | key - 稀疏矩阵中元素位置 <br> value - 元素值 <br> position - 元素坐标位置 |
+| 将数据刷新到<br> 目标设备中 | void FlushToMem(XMem * targetMem) | targetMem - 目标设备 |
+| 在全局内存中<br> 申请矩阵的内存空间 | static void AllocateData(<br> XTensor * matrix, <br> XMem * myMem = NULL, <br> bool useBuf = false) | matrix - 申请内存空间的矩阵 <br> myMem - 是否在内存池中申请空间 <br> useBuf - 是否使用缓冲区 |
+| 在全局内存中<br> 释放矩阵的内存空间 | static void FreeData(<br> XTensor * matrix, <br> XMem * myMem = NULL, <br> bool useBuf = false) | matrix - 申请内存空间的矩阵 <br> myMem - 是否在内存池中申请空间 <br> useBuf - 是否使用缓冲区 |
+| 在缓冲区创建张量 | XTensor * NewTensorBuf( <br> const int myOrder,  <br> const int * myDimSize, XMem * myMem, <br> const TENSOR_DATA_TYPE myDataType = <br> X_FLOAT, const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myMem - 张量所使用的内存池 <br>  myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度，1表示稠密张量 |
+| 依据给定张量<br>复制一个新的张量 | XTensor * NewTensor(<br>XTensor * a, bool isFilledData = true) | a - 给定张量 <br>  isFilledData - 是否申请张量中的数据空间 |
+| 依据给定张量<br>释放数据空间 | void DelTensor(<br>const XTensor * tensor) | tensor - 给定张量 |
+| 依据给定张量<br>在缓存中释放数据空间 | void DelTensorBuf(<br>const XTensor * tensor) | tensor - 给定张量 |
 ## 张量计算
-### 加法（Sum）
+NiuTrans.Tensor提供关于张量计算的函数功能，主要包括一些基本的张量运算以及激活函数，在本节中，主要对这些函数及其用法用例进行介绍。
+### arithmetic
+此部分主要包括各种数学运算，加、减、乘、除、取负等。
+#### 矩阵乘法（MatrixMul）
+##### 什么是张量间矩阵乘法？
+利用矩阵乘法可以将矩阵想乘并得到一个新的结果矩阵，两个维度分别为\\(2 \times 3\\)和\\(3 \times 2\\)的矩阵相乘过程如下所示，结果矩阵的维度为\\(2 \times 2\\)：
+$$
+\left(\begin{matrix}1.0 & 2.0 & 3.0\\\\-4.0 & 5.0 & 6.0\end{matrix}\right) × 
+\left(\begin{matrix}0.0 & -1.0\\\\1.0 & 2.0\\\\2.0 & 1.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}8.0 & 6.0\\\\17.0 & 20.0\end{matrix}\right)
+$$
+##### 矩阵乘法的调用
+NiuTrans.Tensor提供了矩阵乘法的计算操作，在NiuTrans.Tensor/Tensor/core中定义，矩阵乘法的调用方式以及参数说明如下所示:
+```
+void _MatrixMul(XTensor * a, MATRIX_TRANS_TYPE transposedA, XTensor * b, MATRIX_TRANS_TYPE transposedB, XTensor * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0)
+```
+Parameters: 
+* a - 操作张量1
+* transposedA - 操作张量1是否进行转置
+* b - 操作张量2
+* transposedB - 操作张量2是否进行转置
+* c - 操作张量3
+* alpha - 系数
+* beta - 系数
+##### 矩阵乘法片段示例
+我们以最基本的二维矩阵乘法为例，用MatrixMul进行矩阵乘法操作的示例代码为：
+```
+/* call MatrixMul function */
+_MatrixMul(s1, X_NOTRANS, s2, X_NOTRANS, t);
+```
+有关矩阵乘法的详细代码示例：
+NiuTrans.Tensor/Tensor/test/TMatrixMul.cpp
+#### 点乘（Multiply）
+##### 什么是张量点乘？
+利用张量间的点乘操作可以进行张量间元素的按位置依次相乘，两个维度分别为\\(2 \times 2\\)的张量点乘过程如下所示：
+$$
+\left(\begin{matrix}0.0 & 1.0\\\\2.0 & 3.0\end{matrix}\right)  ·
+\left(\begin{matrix}0.0 & 1.0\\\\2.0 & 3.0\end{matrix}\right)  \rightarrow 
+\left(\begin{matrix}0.0 & 1.0\\\\4.0 & 9.0\end{matrix}\right)
+$$
+##### 张量点乘的调用
+NiuTrans.Tensor提供了张量点乘的计算操作，用来计算张量中元素点乘结果，该函数在NiuTrans.Tensor/Tensor/core中定义，张量点乘的调用方式以及参数说明如下所示:
+```
+_Multiply(XTensor * a, XTensor * b, XTensor * c, int leadingDim, DTYPE alpha = 0)
+```
+Parameters: 
+* a - 操作张量1
+* b - 操作张量2
+* c - 结果张量
+* leadingDim - ???
+* alpha - 系数
+##### 张量点乘片段示例
+用Multiply进行s1和s2张量间的点乘操作的调用示例如下所示，计算结果存入t中：
+```
+/* call multiply function */
+_Multiply(s1, s2, t, 0);
+```
+有关矩阵乘法的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TMultiply.cpp
+#### 取负（Negate）
+##### 什么是张量的取负操作？
+在进行张量的取负操作时，张量中每一元素都进行取负得到新的元素，所有新元素的组合得到新的结果张量，一个维度为\\(3 \times 2\\)的张量取负操作过程如下所示：
+$$
+\left(\begin{matrix}1.0 & -2.0\\\\-3.0 & 4.0\\\\5.0 & -6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}-1.0 & 2.0\\\\3.0 & -4.0\\\\-5.0 & 6.0\end{matrix}\right)
+$$
+##### 张量取负的调用
+NiuTrans.Tensor提供了张量取负的计算操作，进行张量的按元素位置进行取负操作，该函数在NiuTrans.Tensor/Tensor/core中定义，张量取负的调用方式以及参数说明如下所示:
+```
+_Negate(XTensor * a)
+```
+Parameters: 
+* a - 操作张量
+##### 张量取负片段示例
+用Negate进行张量取负操作的调用示例如下所示，其中a为我们要进行处理的张量：
+```
+/* call negate function */
+_Negate(a);
+```
+有关张量取负的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TNegate.cpp
+#### 加法（Sum）
+##### 什么是张量加法？
+张量加法的目的是将n个张量相加得到一个新的结果张量，结果张量某一位置的元素数值为进行操作的张量在该位置上元素的求和，在张量加法的计算过程中进行操作的张量与结果张量的维度相同，两个维度为\\(2\times 3\\)的张量相加过程如下所示：
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0 \\\\ 3.0 & 4.0 & 5.0\end{matrix}\right) + 
+\left(\begin{matrix}0.5 & 1.5 & 2.5 \\\\ 3.5 & 4.5 & 5.5\end{matrix}\right) \rightarrow
+\left(\begin{matrix}0.5 & 2.5 & 4.5 \\\\ 6.5 & 8.5 & 10.5\end{matrix}\right)
+$$
+##### 张量加法的调用
+NiuTrans.Tensor提供了张量加法的计算操作，在NiuTrans.Tensor/Tensor/core中定义，该操作用来进行张量之间的按元素位置相加，并得到相加的结果张量，张量加法的调用方法为：
+```
+_Sum(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
+```
+其中a和b为输入张量，c为结果张量，若c为NULL则将相加结果存入a中，beta为一个缩放参数，缩放公式为：c = a + b * beta，beta默认为1.0，NiuTrans.Tensor中张量加法的调用方式以及参数说明如下所示:
+Parameters: 
+* a - 操作张量1
+* b - 操作张量2
+* c - 结果张量，如果c为空则将结果存入a
+* beta - 缩放参数
+##### 张量加法片段示例
+调用Sum进行张量间的求和操作如下所示，在此例中直接将张量相加结果存入a中：
+```
+/* call sum function */
+_Sum(a, b);
+```
+详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TSum.cpp
+#### SumByColumnTV
+##### 什么是SumByColumnTV？
+SumByColumnTV的作用是将一个Tensor和一个Vector按列相加，所得结果维度与Tensor一致，一个\\(2 \times 4\\)的Tensor和一个\\(2 \times 1\\)的Vector的SumByColumnTV操作过程如下所示：
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) + \left(\begin{matrix}1.0\\\\0.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}1.0 & 2.0 & 3.0 & 4.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right)
+$$ 
+##### SumByColumnTV的调用
+NiuTrans.Tensor提供了张量的SumByColumnTV操作，调用方法及参数说明如下所示:
+```
+_SumByColumnTV(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
+```
+Parameters:
+* a - 操作张量
+* b - 操作向量
+* c - 结果张量
+* beta - 缩放参数
+调用SumByColumnTV进行的运算为c_col = a_col + b * \beta
+#####  SumByColumnTV片段示例
+SumByColumnTV示例代码如下，其中a为输入的张量，b为输入的向量，c为a和b按列相加所得结果：
+```
+/* call SumByColumnTV function */
+_SumByColumnTV(a, b, c);
+```
+有关张量SumByColumnTV的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TSumByColumnTV.cpp
+#### SumByColumnVT
+##### 什么是SumByColumnVT？
+SumByColumnVT的作用是将一个Vector和一个Tensor按列相加，所得结果维度与Vector一致，一个\\(2 \times 1\\)的Vector和一个\\(2 \times 4\\)的Tensor的SumByColumnVT操作过程如下所示：
+$$
+\left(\begin{matrix}1.0\\\\0.0\end{matrix}\right) + \left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}7.0\\\\22.0\end{matrix}\right)
+$$ 
+##### SumByColumnVT调用
+NiuTrans.Tensor提供了张量的SumByColumnVT操作，调用方法及参数说明如下所示:
+```
+_SumByColumnVT(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
+```
+Parameters:
+* a - 操作向量
+* b - 操作张量
+* c - 结果向量
+* beta - 缩放参数
+调用SumByColumnVT进行的运算为c = a + \sum{col} b_col * \beta
+#####  SumByColumnVT片段示例
+SumByColumnVT示例代码如下，其中a为输入的向量，b为输入的张量，c为a和b按列相加所得结果：
+```
+/* call SumByColumnVT function */
+_SumByColumnVT(a, b, c);
+```
+有关张量SumByColumnVT的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TSumByColumnVT.cpp
+### getandset
+此部分包括各种数据类型转化，设置数据、取数据等操作。
+#### 选择（Select）
+##### 什么是张量的选择操作？
+Select时按张量指定维度上的指定位置对张量进行选择的操作，一个\\(2 \times 2 \times 4\\)的张量选择过程如下所示，本例中是选择张量维度2上位置索引为1和2的元素并存入目标张量，得到一个维度为\\(2 \times 2 \times 2\\)的张量：
+$$
+\begin{aligned}
+\Biggl( 
+& \left( 
+\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right),\\\\ 
+& \left( 
+\begin{matrix}1.0 & 2.0 & 3.0 & 4.0\\\\5.0 & 6.0 & 7.0 & 8.0\end{matrix}
+\right)
+\Biggr)
+\end{aligned} \rightarrow 
+\begin{aligned}
+\Biggl( 
+& \left( 
+\begin{matrix}1.0 & 2.0\\\\5.0 & 6.0\end{matrix}
+\right),\\\\ 
+& \left( 
+\begin{matrix}2.0 & 3.0\\\\6.0 & 7.0\end{matrix}
+\right)  
+\Biggr)
+\end{aligned}
+$$
+##### 张量选择的调用
+NiuTrans.Tensor提供了张量的选择操作，调用方法及参数说明如下所示:
+```
+_SelectRange(XTensor * a, int dim, int low, int high, XTensor * c)
+```
+Parameters:
+* a - 输入张量
+* dim - 在哪一维对张量进行张量选择操作
+* low - 张量选择范围的下限
+* high - 张量选择范围的上限
+* c - 结果张量
+>需要注意的是，当张量选择的取值范围为[1,3]时意味着选择的是索引位置为1和2的值
+#####  张量选择片段示例
+张量选择示例代码如下，其中s为输入的待操作张量，t输出结果张量，在第三维上按范围[1,3]进行张量的选择操作：
+```
+/* call SelectRange function */
+_SelectRange(s, 2, 1, 3, t);
+```
+有关张量选择的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TSelect.cpp
+#### SetData
+##### 什么是SetData？
+SetData的作用是将张量在一定取值范围内随机进行初始化设置，一个\\(2 \times 4\\)的张量在[0.0,1.0]的取值范围SetData过程如下所示：
+$$
+\left(\begin{matrix}0.0 & 0.0 & 0.0 & 0.0\\\\0.0 & 0.0 & 0.0 & 0.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.1 & 0.5 & 0.3 & 0.9\\\\0.8 & 0.5 & 0.5 & 0.2\end{matrix}\right)
+$$ 
+##### SetData调用
+NiuTrans.Tensor提供了张量的SetData操作，调用方法及参数说明如下所示:
+```
+SetDataRand(DTYPE lower, DTYPE upper)
+```
+Parameters:
+* lower - 取值下限
+* upper - 取值上限
+#####  SetData片段示例
+SetData示例代码如下，本例中是在[0.0,1.0]取值范围内对张量s进行随机初始化：
+```
+/* call SetData function */
+s->SetDataRand(0.0, 1.0);
+```
+有关张量SetData的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TSetData.cpp
+### math
+此部分包括各种非基本代数操作，包括：log、exp、abs等。
+#### 标准化（Normalize）
+##### 什么是张量的标准化？
+神经网络需要标准化处理（Normalize），这样做是为了弱化某些变量的值较大而对模型产生影响，Normalize函数定义为：
+>y = a * (x-mean)/sqrt(variance+\epsilon) + b
+##### Normalize调用
+NiuTrans.Tensor提供了张量的Normalize操作，调用方法及参数说明如下所示:
+```
+_Normalize(XTensor * input, XTensor * output, int dim, XTensor * mean, XTensor * var, XTensor * a, XTensor * b, DTYPE epsilon)
+```
+Parameters:
+* input - 输入张量
+* output - 输出张量
+* dim - 沿着指定维度产生均值和方差
+* mean - 均值
+* var - 方差
+* a - 缩放
+* b - 偏置
+* epsilon - 防止方差为0的参数
+#####  Normalize片段示例
+Normalize示例代码如下所示：
+```
+/* call normalize function */
+_Normalize(s, t, 0, mean, var, a, b, 0.0);
+```
+有关Normalize的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TNormalize.cpp
+#### 幂运算（Power）
+##### 什么是张量的幂运算操作？
+幂运算是一种关于幂的数学运算，张量的幂运算是将张量中的每个元素都进行幂运算从而得到新的张量，一个维度为\\(3 \times 2\\)的幂为2.0的张量幂运算过程如下所示：
+$$
+\left(\begin{matrix}1.0 & 2.0\\\\3.0 & 4.0\\\\5.0 & 6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}1.0 & 4.0\\\\9.0 & 16.0\\\\25.0 & 36.0\end{matrix}\right)
+$$
+##### 张量幂运算的调用
+NiuTrans.Tensor提供了张量幂运算的操作，用来进行张量的按元素位置进行幂运算的操作，调用方法为：
+```
+_Power(XTensor * a, DTYPE p)
+```
+其中a为进行操作的张量，p为次方数，张量幂运算的参数说明如下所示:
+Parameters: 
+* a - 操作张量
+* p - 次方数
+##### 张量幂运算片段示例
+下面是调用Power进行a的幂为2.0的幂运算操作的一段示例代码：
+```
+/* call power function */
+_Power(a, 2.0);
+```
+有关张量幂运算的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TPower.cpp
+#### 缩放和偏移（Scale and Shift）
+##### 什么是张量的缩放和偏移？
+张量的缩放和偏移计算公式为：p = p * scale + shift，其中scale和shift分别为张量缩放和偏移的参数，一个\\(2 \times 4\\)的张量进行缩放和偏移的过程如下所示，缩放参数取2.0，偏移参数取0.5：
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.5 & 2.5 & 4.5 & 6.5\\\\8.5 & 10.5 & 12.5 & 14.5\end{matrix}\right)
+$$
+##### 张量缩放和偏移的调用
+NiuTrans.Tensor提供了张量的缩放和偏移操作，调用方法为：
+```
+_ScaleAndShift(XTensor * a, DTYPE scale, DTYPE shift)
+```
+张量的缩放和偏移操作结果为：p = p * scale + shift，其中scale和shift分别为张量的缩放和偏移参数，张量缩放和偏移操作的参数说明如下表所示:
+Parameters:
+* a - 输入张量
+* scale - 缩放参数
+* shift - 偏移参数
+##### 张量缩放和偏移片段示例
+张量缩放和偏移示例代码如下，input为输入的待操作张量，scaleFactor为缩放参数，shiftFactor为偏移参数：
+```
+/* call ScaleAndShift function */
+_ScaleAndShift(input, scaleFactor, shiftFactor);
+```
+有关张量缩放和偏移的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TScaleAndShift.cpp
+### movement
+此部分主要是介绍有关数据拷贝函数。
+#### 拷贝（CopyValues）
+##### 什么是张量的拷贝操作？
+拷贝，即将一个张量的值赋给另一个张量，也就是对张量进行拷贝操作，一个\\(2 \times 4\\)的张量拷贝过程如下所示：
+$$
+\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow
+\left(
+\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right)
+$$
+##### 张量拷贝操作的调用
+NiuTrans.Tensor提供了张量的拷贝操作，调用方法及参数说明如下所示:
+```
+_CopyValues(XTensor * s, XTensor * t, XStream * stream)
+```
+Parameters:
+* s - 输入张量
+* t - 输出结果张量
+* stream - 多线程流
+#####  张量拷贝片段示例
+ 张量拷贝示例代码如下，其中input为输入的待操作张量，output输出结果张量：
+```
+/* call CopyValues function */
+_CopyValues(input, output);
+```
+有关张量拷贝的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TCopyValues.cpp
+#### CopyIndexed
+##### 什么是张量的CopyIndexed操作？
+CopyIndexed，即按指定索引位置拷贝张量，一个\\(2 \times 2 \times 3\\)的张量拷贝过程如下所示，本例中是对张量维度2上起始位置索引为0和2的1个元素进行拷贝，所得张量维度为\\(2 \times 2 \times 2\\)：
+$$
+\begin{aligned}
+\Biggl( 
+& \left( 
+\begin{matrix}0.0 & -1.0 & 2.0\\\\2.0 & 1.0 & 3.0\end{matrix}\right),\\\\ 
+& \left( 
+\begin{matrix}1.0 & 2.0 & 4.0\\\\3.0 & 1.0 & 2.0\end{matrix}
+\right),\\\\ 
+& \left( 
+\begin{matrix}-1.0 & 3.0 & 2.0\\\\1.0 & -1.0 & 0.0\end{matrix}
+\right)  
+\Biggr)
+\end{aligned} \rightarrow 
+\begin{aligned}
+\Biggl( 
+& \left( 
+\begin{matrix}0.0 & 2.0\\\\2.0 & 3.0\end{matrix}\right),\\\\ 
+& \left( 
+\begin{matrix}1.0 & 4.0\\\\3.0 & 2.0\end{matrix}
+\right),\\\\ 
+& \left( 
+\begin{matrix}-1.0 & 2.0\\\\1.0 & 0.0\end{matrix}
+\right)  
+\Biggr)
+\end{aligned}
+$$
+##### 张量CopyIndexed的调用
+NiuTrans.Tensor提供了张量的CopyIndexed操作，调用方法及参数说明如下所示:
+```
+_CopyIndexed(XTensor * s, XTensor * t, int dim, int * srcIndex, int indexSize, int * tgtIndex, int copyNum)
+```
+Parameters:
+* s - 输入张量
+* t - 输出结果张量
+* dim - 在哪一维对张量进行CopyIndexed操作
+* srcIndex - 源索引，即在指定dim上进行赋值的值的索引
+* indexSize - 源索引的个数
+* tgtIndex - 目标索引，所赋值的值在输出张量中的索引
+* copyNum - 以源索引为起始位置拷贝的元素个数
+#####  张量CopyIndexed片段示例
+CopyIndexed示例代码如下，其中s为输入的待操作张量，t输出结果张量，在第三维上按起始位置索引拷贝一个元素到目标张量：
+```
+/* call CopyIndexed function */
+_CopyIndexed(s, t, 2, srcIndex, indexSize, tgtIndex, 1);
+```
+有关CopyIndexed的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TCopyIndexed.cpp
+### reduce
+#### 归约取最大值（ReduceMax）
+##### 什么是张量的归约取最大值？
+张量的归约取最大值操作是沿着张量的某一维度，取得该向量在该维度中的最大值,一个\\(2 \times 4\\)的张量在维度0和维度1进行取最大值操作的过程分别如下所示：
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right)
+$$
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}3.0\\\\7.0\end{matrix}\right)
+$$
+##### 张量归约取最大值操作的调用
+NiuTrans.Tensor提供了张量的ReduceMax操作，用来获得张量中沿指定维度取得的最大值，张量归约取最大值操作的调用方式及参数说明如下所示:
+```
+_ReduceMax(XTensor * input, XTensor * output, int dim)
+```
+Parameters:
+* input - 输入张量
+* output - 输出张量
+* dim - 沿着指定维度进行取最大值操作
+##### 张量归约取最大值片段示例
+调用ReduceMax进行张量归约取最大值操作的示例代码如下所示，代码中两行分别表示沿着维度0和维度1进行取值：
+```
+/* call reduce max function */
+_ReduceMax(a, reduce_a, 0);
+_ReduceMax(b, reduce_b, 1);
+```
+有关张量归约取最大值的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TReduceMax.cpp
+#### 归约求和（ReduceSum）
+##### 什么是张量的归约求和操作？
+张量的归约求和操作是沿着张量的某一维度，计算该张量在该维度的和,一个\\(2 \times 4\\)的张量在维度0和维度1进行求和操作的过程分别如下所示：
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}4.0 & 6.0 & 8.0 & 10.0\end{matrix}\right)
+$$
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}6.0\\\\22.0\end{matrix}\right)
+$$
+##### 张量归约求和操作的调用
+NiuTrans.Tensor提供了张量的ReduceSum操作，调用方法为：
+```
+_ReduceSum(XTensor * input, XTensor * output, int dim, XTensor * shift, DTYPE power, bool isExp)
+```
+其中shift默认为NULL，power默认为1.0F，isExp默认为false，张量归约求和操作的参数说明如下所示:
+Parameters:
+* input - 输入张量
+* output - 输出张量
+* dim - 沿着指定维度进行取最大值操作
+* shift - 输入的偏移，默认为NULL
+* power - 元素的幂，默认为1.0F
+* isExp - 是否取指，默认为false
-### 缩放和偏移（Scale and Shift）
+##### 张量归约求和片段示例
+调用ReduceSum进行张量归约求和操作的示例代码如下所示，代码中两行分别表示沿着维度0和维度1进行取值：
+```
+/* call reduce sum function */
+_ReduceSum(a, reduce_a, 0);
+_ReduceSum(b, reduce_b, 1);
+```
+有关张量归约求和的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TReduceSum.cpp
+#### 归约取均值（ReduceMean）
+##### 什么是张量的归约取均值操作？
+张量的归约取均值操作是沿着张量的某一维度，计算该张量在该维度的均值,一个\\(2 \times 4\\)的张量在维度0和维度1进行取均值操作的过程分别如下所示：
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}2.0 & 3.0 & 4.0 & 5.0\end{matrix}\right)
+$$
+$$
+\left(\begin{matrix}1.0 & 1.0 & 3.0 & 3.0\\\\4.0 & 4.0 & 6.0 & 6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}2.0\\\\5.0\end{matrix}\right)
+$$
+##### 张量归约取均值操作的调用
+NiuTrans.Tensor提供了张量的ReduceMean操作，调用方法为：
+```
+_ReduceMean(XTensor * input, XTensor * output, int dim)
+```
+ReduceMean用来获得张量中沿指定维度取得的数值均值，张量归约取均值的参数说明如下所示:
+Parameters:
+* input - 输入张量
+* output - 输出张量
+* dim - 沿着指定维度进行取平均值操作
+##### 张量归约取均值片段示例
+调用ReduceMean进行张量归约取均值操作的示例代码如下所示，代码中两行分别表示沿着维度0和维度1进行取值：
+```
+/* call reduce mean function */
+_ReduceMean(a, reduce_a, 0);
+_ReduceMean(b, reduce_b, 1);
+```
+有关张量归约取均值的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TReduceMean.cpp
+#### 归约取方差（ReduceSumSquared）
+##### 什么是张量的归约取方差操作？
+张量的归约取方差操作是沿着张量的某一维度，计算该张量在该维度的方差,一个\\(2 \times 4\\)的张量在维度0进行取方差操作的过程如下所示：
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}8.0 & 8.0 & 8.0 & 8.0\end{matrix}\right)
+$$
+##### 张量归约取方差操作的调用
+NiuTrans.Tensor提供了张量的ReduceSumSquared操作，调用方法为：
+```
+_ReduceSumSquared(XTensor * input, XTensor * output, int dim, XTensor * shift)
+```
+ReduceSumSquared用来计算张量的沿着某一维度元素的方差，张量归约取方差操作的参数说明如下所示:
+Parameters:
+* input - 输入张量
+* output - 输出张量
+* dim - 沿着指定维度进行取平均值操作
+* shift - 输入的偏移
+##### 张量归约取方差片段示例
+调用ReduceSumSquared进行张量归约取方差操作的示例代码如下所示：
+```
+/* call reduce sum squared function */
+_ReduceSumSquared(input, output, 0, shift);
+```
+有关张量归约取方差的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TReduceSumSquared.cpp
+#### 归约取标准差（ReduceVariance）
+##### 什么是张量的归约取标准差操作？
+张量的归约取标准差操作是沿着张量的某一维度，计算该张量在该维度的标准差,一个\\(2 \times 4\\)的张量在维度0进行取标准差操作的过程如下所示：
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}4.0 & 4.0 & 4.0 & 4.0\end{matrix}\right)
+$$
+##### 张量归约取标准差操作的调用
+NiuTrans.Tensor提供了张量的ReduceVariance操作，调用方法为：
+```
+_ReduceVariance(XTensor * input, XTensor * output, int dim, XTensor * mean)
+```
+ReduceVariance用来计算张量的沿着某一维度元素的标准差，张量归约取标准差操作的参数说明如下所示:
+Parameters:
+* input - 输入张量
+* output - 输出张量
+* dim - 沿着指定维度进行取标准差操作
+* mean - 均值
+##### 张量归约取标准差片段示例
+调用ReduceVariance进行张量归约取标准差操作的示例代码如下所示：
+```
+/* call reduce variance function */
+_ReduceVariance(input, output, 0, mean);
+```
+有关张量归约取标准差的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TReduceVariance.cpp
+### shape
+此部分主要包括关于形状改变的函数，比如：split、merge、reshape等。
+#### 级联（Concatenate）
+##### 什么是张量的级联操作？
+张量间的级联操作是沿着张量的某一维度，将一系列张量或是一个列表中的所有张量连接在一起组成一个更大的张量，将维度分别为\\(2 \times 1\\)和\\(2 \times 2\\)的两个张量进行级联过程如下所示：
+$$
+\left(\begin{matrix}0.0\\\\1.0\end{matrix}\right) +
+\left(\begin{matrix}2.0 & 3.0\\\\4.0 & 5.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right)
+$$
+##### 张量级联的调用
+NiuTrans.Tensor提供了张量间的级联操作，调用方法为：
+```
+_Concatenate(XList * smalls, XTensor * big, int dim)
+_Concatenate(XTensor * smallA, XTensor * smallB, XTensor * big, int dim)
+```
+第一种调用方法中的操作对象是列表，将进行级联操作的张量存入列表smalls中，级联结果存入张量big中：
+Parameters:
+* smalls - 进行级联张量的列表
+* big - 结果张量
+* dim - 在指定维度进行级联
+第二种方法操作对象不再是列表中的张量而是直接对一系列张量进行级联操作：
+Parameters:
+* smallA - 操作张量1
+* smallB - 操作张量2
+* big - 结果张量
+* dim - 进行级联的维度
+##### 张量级联片段示例
+通过操作张量列表进行张量的级联操作片段示例如下所示，sList为存放进行级联张量的列表，t为结果张量：
+```
+/* call concatenate function */
+_Concatenate(&sList, t, 1);
+```
+直接通过操作一系列张量进行张量的级联操作片段示例如下所示，s1、s2为需要进行级联的张量，t为结果张量：
+```
+/* call concatenate function */
+_Concatenate(s1, s2, t, 1);
+```
+有关张量级联的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TConcatenate.cpp
+#### 切分（Split）
+##### 什么是张量的切分操作？
+张量间的切分操作是沿着张量的某一维度，可以将一个张量切分成另一张量，也可以将一个大的张量切分成n个小的张量集合的列表。
+第一种情况下将维度为\\(4 \times 3\\)张量沿着维度0进行切分，切分份数为2，得到维度为\\(2 \times 2 \times 3\\)的张量的过程如下所示：
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
+\begin{aligned}
+\Biggl( & \left( 
+\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right),
+\\\\ & \left( 
+\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}
+\right) \Biggr)
+\end{aligned}
+$$
+在第二种情况下将维度为\\(4 \times 3\\)张量沿着维度0进行切分，切分份数为2，得到两个维度均为\\(2 \times 3\\)的张量的过程如下所示：
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
+$$
+##### 张量切分的调用
+NiuTrans.Tensor提供了两种张量切分操作，调用方法为：
+```
+_Split(XTensor * s, XTensor * t, int whereToSplit, int splitNum)
+_Split(XTensor * big, XList * smalls, int whereToSplit, int splitNum)
+```
+在第一种调用方法中是将源张量中的某一维度进行Split操作，Split结果为张量t，whereToSplit为在哪一维度进行split操作，splitNum表示分成多少份，例如：(N, M) -> (N/3, M, 3)，参数说明如下所示:
+Parameters:
+* s - 操作张量
+* t - 结果张量
+* whereToSplit - 在指定维度进行split操作
+* splitNum - 分成多少份
+在第二种调用方法中是将所操作张量big按某一维度whereToSplit进行Split操作，操作结果为包含若干更小维度张量的列表smalls，splitNum表示分成多少份，例如：(N, M) -> 2 * (N/2, M)，参数说明如下所示:
+Parameters:
+* big - 操作张量
+* smalls - 存放切分出张量的列表
+* whereToSplit - 在指定维度进行split操作
+* splitNum - 分成多少份
+##### 张量切分片段示例
+上述第一种张量切分片段示例如下所示，s为进行切分的张量，t为结果张量，0表示沿着维度0进行切分操作，2表示切分份数为2：
+```
+/* call split function */
+_Split(s, t, 0, 2);
+```
+上述第二种张量切分片段示例如下所示，s为进行切分的张量，tList为存放结果张量的列表，1表示沿着维度1进行切分操作，2表示切分份数为2：
+```
+/* call split function */
+_Split(s, &tList, 1, 2);
+```
+有关张量切分的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TSplit.cpp
+#### 合并（Merge）
+##### 什么是张量的合并操作？
+张量间的合并操作与级联有些类似，是沿着张量的某一维度，可以将一个张量合并为另一个维度不同的张量，也可以将一个列表中的所有张量合并在一起组成一个更大的张量。
+在第一种情况下将维度为\\(2 \times 2 \times 3\\)的张量在维度1进行合并，进行合并的维度为0，得到维度为\\(4 \times 3\\)的张量的过程如下所示：
+$$
+\begin{aligned}
+\Biggl( & \left( 
+\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right),
+\\\\ & \left( 
+\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}
+\right) \Biggr)
+\end{aligned} \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
+$$
+在第二种情况下将两个维度均为\\(2 \times 3\\)的张量沿着维度0合并为维度为\\(4 \times 3\\)的张量的过程如下所示：
+$$
+\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
+$$ 
+##### 张量合并操作的调用
+NiuTrans.Tensor提供了张量的合并操作，调用方法为：
+```
+_Merge(XTensor * s, XTensor * t, int whereToMerge, int leadingDim)
+_Merge(XList * smalls, XTensor * big, int whereToMerge)
+```
+在第一种调用方法中是将源张量中的某一维度进行Merge操作，Merge结果为张量t，whereToMerge为指定进行Merge操作的维度，leadingDim为指定将哪一维度Merge，例如：(N/2, 2, M) -> (N, M)，参数说明如下表所示:
+Parameters:
+* s - 操作张量
+* t - 结果张量
+* whereToMerge - 沿着指定维度进行Merge操作
+* leadingDim - 把指定维度进行Merge操作
+在第二种调用方法中是将所操作张量存入列表smalls中，操作结果为张量big，whereToMerge为指定进行Merge操作的维度，例如：2 * (N/2, M) -> (N, M)，参数说明如下表所示:
+Parameters:
+* smalls - 存放进行合并张量的列表
+* big - 结果张量
+* whereToMerge - 沿着指定维度进行Merge操作
+##### 张量合并片段示例
+上述第一种张量合并片段示例如下所示，s为进行合并的张量，t为结果张量，1表示在维度1进行合并操作，0表示将维度0进行合并操作：
+```
+/* call merge function */
+_Merge(s, t, 1, 0);
+```
+上述第二种张量合并片段示例如下所示，sList为要进行合并的张量列表，t为结果张量，0表示沿着维度0进行合并操作：
+```
+/* call merge function */
+_Merge(&sList, t, 0);
+```
+有关张量合并的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TMerge.cpp
+#### Unsqueeze
+##### 什么是Unsqueeze？
+Unsqueeze的作用是通过对张量进行操作，返回一个新的在指定维度插入新维度的张量，这个返回的张量与源张量共享相同的基础数据，一个\\(2 \times 3\\)的张量在维度1和2分别进行Unsqueeze的操作如下所示，插入新的维度大小均为2：
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right) \rightarrow 
+\begin{aligned}
+\Biggl( & \left( 
+\begin{matrix}0.0 & 1.0 & 2.0\\\\0.0 & 1.0 & 2.0\end{matrix}\right),
+\\\\ & \left( 
+\begin{matrix}3.0 & 4.0 & 5.0\\\\3.0 & 4.0 & 5.0\end{matrix}
+\right) \Biggr)
+\end{aligned}
+$$
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right) \rightarrow  
+\begin{aligned}
+\Biggl( & \left( 
+\begin{matrix}0.0 & 0.0\\\\1.0 & 1.0\\\\2.0 & 2.0\end{matrix}\right),
+\\\\ & \left( 
+\begin{matrix}3.0 & 3.0\\\\4.0 & 4.0\\\\5.0 & 5.0\end{matrix}
+\right) \Biggr)
+\end{aligned}
+$$
+##### Unsqueeze的调用
+NiuTrans.Tensor提供了张量的Unsqueeze操作，调用方法及参数说明如下所示:
+```
+_Unsqueeze(XTensor * a, XTensor * b, int dim, int dSize)
+```
+Parameters:
+* a - 输入张量
+* b - 输出结果张量
+* dim - 在指定维度进行Unsqueeze操作
+* dSize - 插入维度的大小
+#####  Unsqueeze片段示例
+Unsqueeze示例代码如下，其中s为输入的待操作张量，t1、t2代表输出结果张量，以下两行分别表示在维度1和维度2上插入的维度大小为2：
+```
+/* call Unsqueeze function */
+_Unsqueeze(s, t1, 1, 2);
+_Unsqueeze(s, t2, 2, 2);
+```
+有关张量Unsqueeze的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TUnsqueeze.cpp
+### sort
+此部分主要介绍排序相关的函数，如：sort、topk等。
+#### Sort
+##### 什么是Sort？
+Sort操作是对张量中元素沿着指定的维度进行排序，一个\\(2 \times 4\\)的张量沿着维度0进行Sort操作过程如下所示：
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}4.0 & 5.0 & 6.0 & 7.0\\\\0.0 & 1.0 & 2.0 & 3.0\end{matrix}\right)
+$$
+##### Sort的调用
+NiuTrans.Tensor提供了张量的Sort操作，调用方法及参数说明如下所示:
+```
+_Sort(XTensor * a, XTensor * index, int dim)
+```
+Parameters:
+* a - 操作张量
+* index - 结果张量中元素的索引
+* dim - 沿着指定维度进行Sort操作
+#####  Sort片段示例
+Sort示例代码如下所示，a为进行操作的张量，b为结果张量中元素的索引，本例中沿着维度0进行Sort操作：
+```
+/* call Sort function */
+_Sort(a, b, 0);
+```
+有关Sort的详细代码示例见                              NiuTrans.Tensor/Tensor/test/TSort.cpp
+#### TopK
+##### 什么是TopK？
+TopK操作是通过对张量中元素进行排序，得到最大或最小的k个元素值及其对应的索引值，在张量中，可以沿着某一维度进行TopK操作，一个\\(2 \times 4\\)的张量沿着维度0进行Top-2操作过程如下所示：
+$$
+\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow 
+\begin{aligned}
+outputAnswer: & \left(
+\begin{matrix}0.5 & 2.5 & 4.5 & 6.5\\\\8.5 & 10.5 & 12.5 & 14.5\end{matrix}\right)\\\\ +
+\\\\ indexAnswer: & \left(
+\begin{matrix}0 & 1 & 1 & 0\\\\1 & 0 & 0 & 1\end{matrix}\right)
+\end{aligned}
+$$
+##### TopK的调用
+NiuTrans.Tensor提供了张量的TopK操作，调用方法及参数说明如下所示:
+```
+_TopK(XTensor * a, XTensor * b, XTensor * index, int dim, int k)
+```
+Parameters:
+* a - 输入张量
+* b - 输出结果张量
+* index - 输出结果索引
+* dim - 沿着指定维度进行TopK操作
+* k - TopK中k代表取最大的k个值
+#####  TopK片段示例
+ TopK示例代码如下，input为输入的待操作张量，output输出结果张量，index为输出结果索引，本例中沿着维度0取Top-2：
+```
+/* call TopK function */
+int dim = 0;
+int k = inputDimSize[dim];
+_TopK(input, outputA, indexA, dim, k);
+```
+有关TopK的详细代码示例见                              NiuTrans.Tensor/Tensor/test/TTopK.cpp
+### function
+此部分主要介绍一些激活函数和损失函数。
+#### Rectify
+##### 什么是Rectify？
+Rectify是一种激活函数，Rectify函数定义为：
+>y = max(0, x)
+##### Rectify调用
+NiuTrans.Tensor提供了张量的Rectify激活函数，调用方法及参数说明如下所示:
+```
+Rectify(XTensor * x, XTensor * y)
+```
+Parameters:
+* x - 输入张量
+* y - 输出张量
+#####  Rectify片段示例
+Rectify示例代码如下，其中x为输入的向量，y为输入的张量：
+```
+/* call Rectify function */
+Rectify(x, y);
+```
+有关Rectify的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TRectify.cpp
+#### HardTanH
+##### 什么是HardTanH？
+HardTanH是一种激活函数，HardTanH函数定义为：
+>y =  1 &nbsp;&nbsp;if x > 1 \
+&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; x &nbsp;&nbsp;if -1 <= x <= 1 \
+&nbsp;&nbsp; &nbsp; -1 &nbsp;&nbsp;if x < -1
+##### HardTanH调用
+NiuTrans.Tensor提供了张量的HardTanH激活函数，调用方法及参数说明如下所示:
+```
+HardTanH(XTensor * x, XTensor * y)
+```
+Parameters:
+* x - 输入张量
+* y - 输出张量
+#####  HardTanH片段示例
+HardTanH示例代码如下，其中x为输入的向量，y为输入的张量：
+```
+/* call hardtanh function */
+HardTanH(x, y);
+```
+有关HardTanH的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/THardTanH.cpp
+#### Identity
+##### 什么是Identity？
+Identity是一种激活函数，Identity函数定义为：
+>y = x
+##### Identity调用
+NiuTrans.Tensor提供了张量的Identity激活函数，调用方法及参数说明如下所示:
+```
+Identity(XTensor * x, XTensor * y)
+```
+Parameters:
+* x - 输入张量
+* y - 输出张量
+#####  Identity片段示例
+Identity示例代码如下，其中x为输入的向量，y为输入的张量：
+```
+/* call Identity function */
+Identity(x, y);
+```
+有关Identity的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TIdentity.cpp
+#### LogSoftmax
+##### 什么是LogSoftmax？
+LogSoftmax是一种激活函数，LogSoftmax函数定义为：
+>y = log(e^x / \sum_{i} e^{x_i})
+##### LogSoftmax调用
+NiuTrans.Tensor提供了张量的LogSoftmax激活函数，调用方法及参数说明如下所示:
+```
+LogSoftmax(XTensor * x, XTensor * y, int leadDim)
+```
+Parameters:
+* x - 输入张量
+* y - 输出张量
+* leadDim - 沿着指定维度进行操作
+#####  LogSoftmax片段示例
+LogSoftmax示例代码如下，其中x为输入的向量，y为输入的张量，本例中沿着维度1进行LogSoftmax操作：
+```
+/* call LogSoftmax function */
+LogSoftmax(x, y, 1);
+```
+有关LogSoftmax的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TLogSoftmax.cpp
+#### Sigmoid
+##### 什么是Sigmoid？
+Sigmoid是一种激活函数，Sigmoid函数定义为：
+>y = 1/(1+exp(-x))
+##### Sigmoid调用
+NiuTrans.Tensor提供了张量的Sigmoid激活函数，调用方法及参数说明如下所示:
+```
+Sigmoid(XTensor * x, XTensor * y)
+```
+Parameters:
+* x - 输入张量
+* y - 输出张量
+#####  Sigmoid片段示例
+Sigmoid示例代码如下，其中x为输入的向量，y为输入的张量：
+```
+/* call Sigmoid function */
+Sigmoid(x, y);
+```
+有关Sigmoid的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TSigmoid.cpp
+#### Softmax
+##### 什么是Softmax？
+Softmax是一种激活函数，Softmax函数定义为：
+>y = e^x / \sum_{i} e^{x_i}
+##### Softmax调用
+NiuTrans.Tensor提供了张量的Softmax激活函数，调用方法及参数说明如下所示:
+```
+Softmax(XTensor * x, XTensor * y, int leadDim)
+```
+Parameters:
+* x - 输入张量
+* y - 输出张量
+* leadDim - 沿着指定维度进行操作
+#####  Softmax片段示例
+Softmax示例代码如下，其中x为输入的向量，y为输入的张量，本例中沿着维度1进行Softmax操作：
+```
+/* call Softmax function */
+Softmax(x, y, 1);
+```
+有关Softmax的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TSoftmax.cpp
+#### Loss
+##### 什么是Loss？
+Loss Function(损失函数)是用来衡量神经网络模型效果及优化目标的一种损失函数，函数定义为：
+>squared error : loss = sum_{i} 0.5*(gold_i - output_i)^2 \
+cross entropy : loss = sum_{i} (-gold_i * log(output_i)) \
+one hot error : loss = sum_{i} e_i \
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; where e_i = 0.5*(t_i - y_i)^2 &nbsp;&nbsp;if t_i = 1, \
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;e_i = 0 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; otherwise
+##### Loss调用
+NiuTrans.Tensor提供了张量的Loss激活函数，调用方法及参数说明如下所示:
+```
+LossCompute(XTensor * gold, XTensor * output, LOSS_FUNCTION_NAME LFName,bool isLogOutput, int leadDim, int gBeg, int gLen, int oBeg)
+```
+Parameters:
+* gold - 标准答案
+* output - 输出的模型预测结果
+* LFName - 损失函数名称
+* isLogOutput - 输出是否log
+* leadDim - 沿着指定维度进行输出
+* gBeg - 沿着指定维度leadDim从指定位置取标准答案
+* gLen - 从指定位置gBeg开始标准答案的偏移
+* oBeg - 沿着指定维度leadDim从指定位置开始输出模型预测结果
+#####  Loss片段示例
+Loss示例代码如下所示：
+```
+/* call LossCompute function */
+error = LossCompute(gold, output, SQUAREDERROR, false, 0, 0, dimSize[0], 0);
+```
+有关Loss的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TLoss.cpp
 ## 高级技巧
+### 内存池
 ## 实例1：矩阵乘法
 ## 实例2：前馈神经网络
 ## 实例3：循环神经网络
 ## 致谢
\ No newline at end of file
--- a/source/tensor/XTensor.cpp
+++ b/source/tensor/XTensor.cpp
@@ -542,14 +542,14 @@ void XTensor::SetDataRand(DTYPE lower, DTYPE upper)
    if (dataType == X_FLOAT) {
        d = new float[unitNum];
        for (int i = 0; i < unitNum; i++) {
-            DTYPE value = lower + upper * (float)rand() / RAND_MAX;
+            DTYPE value = lower + (upper - lower) * (float)rand() / RAND_MAX;
            *((float*)d + i) = value;
        }
    }
    else if (dataType == X_DOUBLE) {
        d = new double[unitNum];
        for (int i = 0; i < unitNum; i++) {
-            *((double*)d + i) = rand() / RAND_MAX;
+            *((double*)d + i) = lower + (upper - lower) * rand() / RAND_MAX;
        }
    }
    else {
@@ -922,8 +922,10 @@ set the value of a cell
 >> index - index of the cell for each dimension
 >> 
 */
-bool XTensor::Set(DTYPE value, int * index, int size)
+bool XTensor::Set(DTYPE value, int index[], int size)
 {
+	CheckNTErrors((dataType == DEFAULT_DTYPE), "The tensor is not in default type.");
    return SetToDevice(devID, GetCell(index, size), value);
 }

--- a/source/tensor/core/arithmetic/Absolute.cpp
+++ b/source/tensor/core/arithmetic/Absolute.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+*/
+#include "../../XTensor.h"
+#include "Absolute.h"
+#include "Absolute.cuh"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+set every entry to its absolute value
+>> a - the tensor we are processing
+*/
+void Absolute(XTensor * a)
+{
+#ifdef USE_CUDA
+    /* run it on GPUs */
+    if (a->devID >= 0) {
+        CudaAbsolute(a);
+    return;
+}
+#endif
+    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
+    DTYPE * d = (DTYPE*)a->data;
+    for (int i = 0; i < a->unitNum; i++)
+        d[i] = (DTYPE)fabs(d[i]);
+}
+} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/arithmetic/Absolute.cu
+++ b/source/tensor/core/arithmetic/Absolute.cu
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+*/
+#include "../../XDevice.h"
+#include "../../XTensor.h"
+#include "Absolute.h"
+#include "Absolute.cuh"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/*
+set each entry to its absolute value (CUDA Kernel)
+>> d - pointer to the data array
+>> size - size of the data array
+*/
+__global__
+void KernelAbsolute(DTYPE * d, int size)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i < size)
+        d[i] = fabs(d[i]);
+}
+/*
+set each entry to its absolute value (CUDA Kernel)
+This is for float16 computation
+>> d - pointer to the data array
+>> size - size of the data array
+*/
+__global__
+void KernelAbsolute(__half * d, int size)
+{
+    return;
+}
+/*
+set each entry to its  with float16 data type value
+>> a - the tensor
+*/
+extern "C"
+void CudaAbsolute(XTensor * a)
+{
+    CheckNTErrors((a->isSparse == false), "TODO!");
+    int gridSize[3];
+    int blockSize[3];
+    GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
+    dim3 blocks(gridSize[0]);
+    dim3 threads(blockSize[0]);
+    int devIDBackup;
+    ProtectCudaDev(a->devID, devIDBackup);
+    if (a->dataType == DEFAULT_DTYPE) {
+        KernelAbsolute << <blocks, threads >> >((DTYPE*)a->data, a->unitNum);
+    }
+    else if (a->dataType == X_FLOAT16) {
+        KernelAbsolute << <blocks, threads >> >((__half*)a->data, a->unitNum);
+    }
+    else {
+        ShowNTErrors("TODO!");
+    }
+    BacktoCudaDev(a->devID, devIDBackup);
+}
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/Absolute.cuh
+++ b/source/tensor/core/arithmetic/Absolute.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+*/
+#include "Absolute.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/* set each entry to its absolute value (CUDA Kernel) */
+__global__
+void KernelAbsolute(DTYPE * d, int size);
+/* set each entry to its absolute value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelAbsolute(__half * d, int size);
+/* set each entry to its absolute value */
+extern "C"
+void CudaAbsolute(XTensor * a);
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/arithmetic/Absolute.h
+++ b/source/tensor/core/arithmetic/Absolute.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+*/
+#ifndef __ABSOLUTE_H__
+#define __ABSOLUTE_H__
+#include "../../XTensor.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* set every entry to its absolute value */
+extern "C"
+void Absolute(XTensor * a);
+} // namespace nts(NiuTrans.Tensor)
+#endif // __ABSOLUTE_H__
--- a/source/tensor/core/arithmetic/MatrixMulBatched.cpp
+++ b/source/tensor/core/arithmetic/MatrixMulBatched.cpp
@@ -89,9 +89,9 @@ void MatrixMulBatched(XTensor * a, MATRIX_TRANS_TYPE transposedA,
        void * ap = (char*)a->data + aRealBlockSize * p;
        void * bp = (char*)b->data + bRealBlockSize * p;
        void * cp = (char*)c->data + cRealBlockSize * p;
-        XTensor * ai = new XTensor(2, aDimSize, a->dataType, a->denseRatio, a->devID, a->mem);
+        XTensor * ai = NewTensor(2, aDimSize, a->dataType, a->denseRatio, a->devID, a->mem);
-        XTensor * bi = new XTensor(2, bDimSize, b->dataType, b->denseRatio, b->devID, b->mem);
+        XTensor * bi = NewTensor(2, bDimSize, b->dataType, b->denseRatio, b->devID, b->mem);
-        XTensor * ci = new XTensor(2, cDimSize, c->dataType, c->denseRatio, c->devID, c->mem);
+        XTensor * ci = NewTensor(2, cDimSize, c->dataType, c->denseRatio, c->devID, c->mem);
        ai->data = ap;
        bi->data = bp;
        ci->data = cp;

--- a/source/tensor/core/arithmetic/Sign.cpp
+++ b/source/tensor/core/arithmetic/Sign.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+*/
+#include "../../XTensor.h"
+#include "Sign.h"
+#include "Sign.cuh"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+set every entry to its sign value
+>> a - the tensor we are processing
+*/
+void Sign(XTensor * a)
+{
+#ifdef USE_CUDA
+    /* run it on GPUs */
+    if (a->devID >= 0) {
+        CudaSign(a);
+    return;
+}
+#endif
+    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
+    DTYPE * d = (DTYPE*)a->data;
+    for (int i = 0; i < a->unitNum; i++) {
+        if (d[i] > 0)
+            d[i] = 1.0F;
+        else if (d[i] == 0)
+            d[i] = 0.0F;
+        else
+            d[i] = -1.0F;
+    }
+}
+} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/arithmetic/Sign.cu
+++ b/source/tensor/core/arithmetic/Sign.cu
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+*/
+#include "../../XDevice.h"
+#include "../../XTensor.h"
+#include "Sign.h"
+#include "Sign.cuh"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/*
+set each entry to its sign value (CUDA Kernel)
+>> d - pointer to the data array
+>> size - size of the data array
+*/
+__global__
+void KernelSign(DTYPE * d, int size)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i < size) {
+        if (d[i] > 0)
+            d[i] = 1.0F;
+        else if (d[i] == 0)
+            d[i] = 0.0F;
+        else
+            d[i] = -1.0F;
+    }
+}
+/*
+set each entry to its sign value (CUDA Kernel)
+This is for float16 computation
+>> d - pointer to the data array
+>> size - size of the data array
+*/
+__global__
+void KernelSign(__half * d, int size)
+{
+    return;
+}
+/*
+set each entry to its  with float16 data type value
+>> a - the tensor
+*/
+extern "C"
+void CudaSign(XTensor * a)
+{
+    CheckNTErrors((a->isSparse == false), "TODO!");
+    int gridSize[3];
+    int blockSize[3];
+    GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
+    dim3 blocks(gridSize[0]);
+    dim3 threads(blockSize[0]);
+    int devIDBackup;
+    ProtectCudaDev(a->devID, devIDBackup);
+    if (a->dataType == DEFAULT_DTYPE) {
+        KernelSign << <blocks, threads >> >((DTYPE*)a->data, a->unitNum);
+    }
+    else if (a->dataType == X_FLOAT16) {
+        KernelSign << <blocks, threads >> >((__half*)a->data, a->unitNum);
+    }
+    else {
+        ShowNTErrors("TODO!");
+    }
+    BacktoCudaDev(a->devID, devIDBackup);
+}
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/arithmetic/Sign.cuh
+++ b/source/tensor/core/arithmetic/Sign.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+*/
+#include "Sign.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/* set each entry to its sign value (CUDA Kernel) */
+__global__
+void KernelSign(DTYPE * d, int size);
+/* set each entry to its sign value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelSign(__half * d, int size);
+/* set each entry to its sign value */
+extern "C"
+void CudaSign(XTensor * a);
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/arithmetic/Sign.h
+++ b/source/tensor/core/arithmetic/Sign.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+*/
+#ifndef __SIGN_H__
+#define __SIGN_H__
+#include "../../XTensor.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* set every entry to its sign value */
+extern "C"
+void Sign(XTensor * a);
+} // namespace nts(NiuTrans.Tensor)
+#endif // __SIGN_H__
--- a/source/tensor/core/arithmetic/SumByColumnVT.cu
+++ b/source/tensor/core/arithmetic/SumByColumnVT.cu
@@ -52,7 +52,7 @@ void KernelADDByColumnVT(DTYPE * a, DTYPE * b, DTYPE * c, int colNum, int rowNum
        DTYPE * bp = b + (rowNum * k + row) * colNum;
        if (colNum % 4 == 0) {
            for (int i = 0; i < colNum; i += 4)
-                sum += bp[i] + bp[i + 1] + b[i + 2] + b[i + 3];
+                sum += bp[i] + bp[i + 1] + bp[i + 2] + bp[i + 3];
        }
        else if (colNum % 2 == 0) {
            for (int i = 0; i < colNum; i += 2)

--- a/source/tensor/core/getandset/ConvertDataType.cpp
+++ b/source/tensor/core/getandset/ConvertDataType.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+*/
+#include "../../XTensor.h"
+#include "ConvertDataType.h"
+#include "ConvertDataType.cuh"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+convert data type
+>> input - input tensor
+>> output - output tensor
+*/
+void ConvertTensorDataType(XTensor * input, XTensor * output)
+{
+    CheckNTErrors(XTensor::IsIdentical(input, output), "Input and Output are different in type or size!");
+    if (input->dataType == output->dataType)
+        return;
+#ifdef USE_CUDA
+    /* run it on GPUs */
+    if (input->devID >= 0) {
+        CudaConvertDataType(input, output);
+    return;
+}
+#endif
+    if (input->dataType == X_FLOAT && output->dataType == X_INT) {
+        float * inputData = (float*)input->data;
+        int * outputData = (int*)output->data;
+        for (int i = 0; i < input->unitNum; i++) 
+            outputData[i] = (int)inputData[i];
+    }
+    else if (input->dataType == X_INT && output->dataType == X_FLOAT) {
+        int * inputData = (int*)input->data;
+        float * outputData = (float*)output->data;
+        for (int i = 0; i < input->unitNum; i++) 
+            outputData[i] = (float)inputData[i];
+    }
+    else
+        ShowNTErrors("Unsupported data types for conversion!");
+}
+} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/getandset/ConvertDataType.cu
+++ b/source/tensor/core/getandset/ConvertDataType.cu
@@ -21,6 +21,7 @@
 #include "../../XTensor.h"
 #include "../../XDevice.h"
+#include "ConvertDataType.cuh"
 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -49,6 +50,24 @@ void KernelFloat16ToFloat(__half * s, float * t, int size)
    }
 }
+__global__ 
+void KernelFloatToInt(float * inputData, int * outputData, int size)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i < size){
+        outputData[i] = (int)(inputData[i]);
+    }
+}
+__global__ 
+void KernelIntToFloat(int * inputData, float * outputData, int size)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i < size){
+        outputData[i] = (float)(inputData[i]);
+    }}
 /* 
 data conversion (cuda code) 
@@ -88,6 +107,39 @@ void CudaConvertDataType(int devID, void * s, TENSOR_DATA_TYPE typeS, void * t, 
    ProtectCudaDev(devID, devIDBackup);
 }
+/*
+convert data type (cuda code) 
+>> input - input tensor
+>> output - output tensor
+*/
+void CudaConvertDataType(XTensor * input, XTensor * output)
+{
+    CheckNTErrors(XTensor::IsIdentical(input, output), "Input and Output are different in type or size!");
+    if (input->dataType == output->dataType)
+        return;
+    int gridSize[3];
+    int blockSize[3];
+    GDevs.GetCudaThread(input->devID, input->unitNum, gridSize, blockSize);
+    dim3 blocks(gridSize[0]);
+    dim3 threads(blockSize[0]);
+    int devIDBackup;
+    ProtectCudaDev(input->devID, devIDBackup);
+    if(input->dataType == X_FLOAT && output->dataType == X_INT)
+        KernelFloatToInt<<<blocks, threads>>>((float*)input->data, (int*)output->data, input->unitNum);
+    else if(input->dataType == X_INT && output->dataType == X_FLOAT)
+        KernelIntToFloat<<<blocks, threads>>>((int*)input->data, (float*)output->data, input->unitNum);
+    else{
+        ShowNTErrors("Unsupported data types for conversion!");
+    }
+    ProtectCudaDev(input->devID, devIDBackup);
+}
 #endif // USE_CUDA
 } // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/getandset/ConvertDataType.cuh
+++ b/source/tensor/core/getandset/ConvertDataType.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+*/
+#include "ConvertDataType.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/* convert data type from X_FLOAT to X_FLOAT16 (CUDA Kernel) */
+__global__
+void KernelFloatToFloat16(float * s, __half * t, int size);
+/* convert data type from X_FLOAT16 to X_FLOAT (CUDA Kernel) */
+__global__
+void KernelFloat16ToFloat(__half * s, float * t, int size);
+/* convert data type from X_FLOAT to X_INT (CUDA Kernel) */
+__global__
+void KernelFloatToInt(float * inputData, int * outputData, int size);
+/* convert data type from X_INT to X_FLOAT (CUDA Kernel) */
+__global__
+void KernelIntToFloat(int * inputData, float * outputData, int size);
+/* convert data type */
+void CudaConvertDataType(XTensor * input, XTensor * output);
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/getandset/ConvertDataType.h
+++ b/source/tensor/core/getandset/ConvertDataType.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+*/
+#ifndef __CONVERTDATATYPE_H__
+#define __CONVERTDATATYPE_H__
+#include "../../XTensor.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* convert data type */
+void ConvertDataType(XTensor * input, XTensor * output);
+} // namespace nts(NiuTrans.Tensor)
+#endif // __CONVERTDATATYPE_H__
--- a/source/tensor/core/math/Log.cpp
+++ b/source/tensor/core/math/Log.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+*/
+#include "../../XTensor.h"
+#include "Log.h"
+#include "Log.cuh"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/*
+set every entry to its log value
+>> a - the tensor we are processing
+*/
+void Log(XTensor * a)
+{
+#ifdef USE_CUDA
+    /* run it on GPUs */
+    if (a->devID >= 0) {
+        CudaLog(a);
+    return;
+}
+#endif
+    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
+    DTYPE * d = (DTYPE*)a->data;
+    for (int i = 0; i < a->unitNum; i++)
+        d[i] = (DTYPE)log(d[i]);
+}
+} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/math/Log.cu
+++ b/source/tensor/core/math/Log.cu
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+*/
+#include "../../XDevice.h"
+#include "../../XTensor.h"
+#include "Log.h"
+#include "Log.cuh"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/*
+set each entry to its log value (CUDA Kernel)
+>> d - pointer to the data array
+>> size - size of the data array
+*/
+__global__
+void KernelLog(DTYPE * d, int size)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i < size)
+        d[i] = log(d[i]);
+}
+/*
+set each entry to its log value (CUDA Kernel)
+This is for float16 computation
+>> d - pointer to the data array
+>> size - size of the data array
+*/
+__global__
+void KernelLog(__half * d, int size)
+{
+    return;
+}
+/*
+set each entry to its log value
+>> a - the tensor
+*/
+extern "C"
+void CudaLog(XTensor * a)
+{
+    CheckNTErrors((a->isSparse == false), "TODO!");
+    int gridSize[3];
+    int blockSize[3];
+    GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
+    dim3 blocks(gridSize[0]);
+    dim3 threads(blockSize[0]);
+    int devIDBackup;
+    ProtectCudaDev(a->devID, devIDBackup);
+    if (a->dataType == DEFAULT_DTYPE) {
+        KernelLog << <blocks, threads >> >((DTYPE*)a->data, a->unitNum);
+    }
+    else if (a->dataType == X_FLOAT16) {
+        KernelLog << <blocks, threads >> >((__half*)a->data, a->unitNum);
+    }
+    else {
+        ShowNTErrors("TODO!");
+    }
+    BacktoCudaDev(a->devID, devIDBackup);
+}
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/math/Log.cuh
+++ b/source/tensor/core/math/Log.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+*/
+#include "Log.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+#ifdef USE_CUDA
+/* set each entry to its log value (CUDA Kernel) */
+__global__
+void KernelLog(DTYPE * d, int size);
+/* set each entry to its log value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelLog(__half * d, int size);
+/* set each entry to its log value */
+extern "C"
+void CudaLog(XTensor * a);
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/math/Log.h
+++ b/source/tensor/core/math/Log.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+/*
+* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
+*/
+#ifndef __LOG_H__
+#define __LOG_H__
+#include "../../XTensor.h"
+namespace nts { // namespace nts(NiuTrans.Tensor)
+/* set every entry to its log value */
+extern "C"
+void Log(XTensor * a);
+} // namespace nts(NiuTrans.Tensor)
+#endif // __LOG_H__
--- a/source/tensor/core/math/ScaleAndShift.cuh
+++ b/source/tensor/core/math/ScaleAndShift.cuh
@@ -37,6 +37,7 @@ __global__
 void KernelScaleAndShift(__half * a, __half * b, int size, __half scale, __half shift);
 /* scale and shift all tensor entires b = a * scale + shift (cuda version) */
+extern "C" 
 void _CudaScaleAndShift(const XTensor * a, XTensor * b, DTYPE scale, DTYPE shift);
 #endif // USE_CUDA

--- a/source/tensor/core/movement/CopyBlocks.cpp
+++ b/source/tensor/core/movement/CopyBlocks.cpp
@@ -66,18 +66,19 @@ copy a number of blocks source source positions to target positions
 >> targetBlocks - target positions of the copy
 >> myMem - the memory pool
 */
-void CopyBlocks(void * source, int blockSize, int * sourceBlocks, int blockNum, void * target, int * targetBlocks, XMem * myMem)
+void CopyBlocks(void * source, int blockSize, int * sourceBlocks, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID)
 {
-    if (myMem != NULL && myMem->devID >= 0) {
+    if (myMem != NULL)
+        CheckNTErrors((myMem->devID == devID), "DevIDs are different between memory pool and input devID!");
+    if (devID >= 0) {
 #ifdef USE_CUDA
-        CudaCopyBlocksSelected(source, blockSize, sourceBlocks, blockNum, target, targetBlocks, myMem);
+        CudaCopyBlocksSelected(source, blockSize, sourceBlocks, blockNum, target, targetBlocks, myMem, devID);
 #else
        ShowNTErrors("Plesae specify USE_CUDA and recompile the code!");
 #endif
    }
    else {
-        int devID = myMem != NULL ? myMem->devID : -1;
        /* 
        The following code should be fine with GPUs, but too many
        kernel calls would slow down the system. We prefer to use

--- a/source/tensor/core/movement/CopyBlocks.h
+++ b/source/tensor/core/movement/CopyBlocks.h
@@ -30,7 +30,7 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 void CopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem);
 /* copy a number of blocks from source positions to target positions */
-void CopyBlocks(void * source, int blockSize, int * sourceBlocks, int blockNum, void * target, int * targetBlocks, XMem * myMem);
+void CopyBlocks(void * source, int blockSize, int * sourceBlocks, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID);
 } // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/core/movement/CopyBlocksSelected.cu
+++ b/source/tensor/core/movement/CopyBlocksSelected.cu
@@ -70,28 +70,33 @@ copy a number of blocks from source positions to target positions (cuda version)
 >> targetBlocks - target positions of the copy
 >> myMem - memory pool
 */
-void CudaCopyBlocksSelected(void * source, int blockSize, int * sourceBlocks, int blockNum, void * target, int * targetBlocks, XMem * myMem)
+void CudaCopyBlocksSelected(void * source, int blockSize, int * sourceBlocks, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID)
 {
-    CheckNTErrors((myMem != NULL), "No memory pool!");
+    CheckNTErrors((devID >= 0), "Wrong device to run!");
-    CheckNTErrors((myMem->devID >= 0), "Wrong device to run!");
    CheckNTErrors((blockSize % sizeof(DTYPE) == 0), "Unsupported block size!");
    /* copy the index to the GPU memory */
-    int * sourceBlocksTMP = (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int));
+    int * sourceBlocksTMP = myMem != NULL ? (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int)) : (int *)XMemAlloc(devID, blockNum * sizeof(int));
-    int * targetBlocksTMP = (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int));
+    int * targetBlocksTMP = myMem != NULL ? (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int)) : (int *)XMemAlloc(devID, blockNum * sizeof(int));
-    XMemCopy(sourceBlocksTMP, myMem->devID, sourceBlocks, -1, blockNum * sizeof(int));
+    XMemCopy(sourceBlocksTMP, devID, sourceBlocks, -1, blockNum * sizeof(int));
-    XMemCopy(targetBlocksTMP, myMem->devID, targetBlocks, -1, blockNum * sizeof(int));
+    XMemCopy(targetBlocksTMP, devID, targetBlocks, -1, blockNum * sizeof(int));
    int cudaGrids[3];
    int cudaBlocks[3];
-    GDevs.GetCudaThread2D(myMem->devID, blockSize / sizeof(DTYPE), blockNum, MAX_INT, cudaGrids, cudaBlocks);
+    GDevs.GetCudaThread2D(devID, blockSize / sizeof(DTYPE), blockNum, MAX_INT, cudaGrids, cudaBlocks);
    KernelCopyBlocksSelected << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
                               ((DTYPE*)source, blockSize / sizeof(DTYPE), sourceBlocksTMP, blockNum, (DTYPE*)target, targetBlocksTMP);
-    myMem->ReleaseBuf(myMem->devID, blockNum * sizeof(int));
+    if (myMem != NULL) {
-    myMem->ReleaseBuf(myMem->devID, blockNum * sizeof(int));
+        myMem->ReleaseBuf(myMem->devID, blockNum * sizeof(int));
+        myMem->ReleaseBuf(myMem->devID, blockNum * sizeof(int));
+    }
+    else {
+        XMemFree(devID, sourceBlocksTMP);
+        XMemFree(devID, targetBlocksTMP);
+    }
 }
 #endif // USE_CUDA

--- a/source/tensor/core/movement/CopyBlocksSelected.cuh
+++ b/source/tensor/core/movement/CopyBlocksSelected.cuh
@@ -34,7 +34,7 @@ void KernelCopyBlocksSelected(DTYPE * source, int blockSize, int * sourceBlocks,
 /* copy a number of blocks form source positions to target positions (cuda version) */
 extern "C"
-void CudaCopyBlocksSelected(void * source, int blockSize, int * sourceBlocks, int blockNum, void * target, int * targetBlocks, XMem * myMem);
+void CudaCopyBlocksSelected(void * source, int blockSize, int * sourceBlocks, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID);
 #endif // USE_CUDA

--- a/source/tensor/core/movement/CopyIndexed.cpp
+++ b/source/tensor/core/movement/CopyIndexed.cpp
@@ -84,7 +84,7 @@ bool CopyIndexed(XTensor * s, XTensor * t, int dim, int * srcIndex, int indexSiz
        CheckNTErrors((tgtIndex[i] < blockNumTgt), "Index is out of range!");
    }
-    CopyBlocks(s->data, blockSizeSrc * s->unitSize, realSrcIndex, realIndexSize, t->data, realTgtIndex, s->mem);
+    CopyBlocks(s->data, blockSizeSrc * s->unitSize, realSrcIndex, realIndexSize, t->data, realTgtIndex, s->mem, s->devID);
    delete[] realSrcIndex;
    delete[] realTgtIndex;

--- a/source/tensor/core/movement/CopyValues.h
+++ b/source/tensor/core/movement/CopyValues.h
@@ -27,8 +27,9 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
 /* copy s to t */
+extern "C"
 bool CopyValues(const XTensor * s, XTensor * t, XStream * stream = NULL);
 } // namespace nts(NiuTrans.Tensor)
 #endif // __COPYVALUES_H__
\ No newline at end of file
--- a/source/tensor/function/Loss.cpp
+++ b/source/tensor/function/Loss.cpp
@@ -21,6 +21,7 @@
 #include <math.h>
 #include "Loss.h"
+#include "Loss.cuh"
 namespace nts{ // namespace nts(NiuTrans.Tensor)
@@ -43,137 +44,142 @@ compute the loss
 DTYPE LossCompute(XTensor * gold, XTensor * output, LOSS_FUNCTION_NAME LFName,
                  bool isLogOutput, int leadDim, int gBeg, int gLen, int oBeg)
 {
-    CheckNTErrors((gLen >= 0 && gLen <= output->unitNum), "Illegal input length!");
-    CheckNTErrors((XTensor::IsIdentical(gold, output)), "The input tensors must be of the same size!");
-    CheckNTErrors((gold->dimSizeRDI[0] == 1 && output->dimSizeRDI[0] == 1), "TODO!");
-    CheckNTErrors((gold->order > leadDim && leadDim >= 0), "Illegal leading dimension!");
-    CheckNTErrors((gold->dataType == DEFAULT_DTYPE && output->dataType == DEFAULT_DTYPE),
-                         "TODO!");
-    int leadDimRDI = output->order - leadDim - 1;
-    int dimensionSize = output->dimSizeRDI[leadDimRDI];
-    int stride = 1;
-    int blockSize = 1;
-    int blockNum = 1;
-    for(int i = 0; i < leadDimRDI; i++)
-        stride *= output->dimSizeRDI[i];
-    blockSize = stride * dimensionSize;
-    blockNum = output->unitNum / blockSize;
-    if(isLogOutput)
-        return LossComputeForLogScale(gold, output, LFName, leadDim, gBeg, gLen, oBeg);
-    DTYPE * gp = (DTYPE*)gold->data;
-    DTYPE * op = (DTYPE*)output->data;
    DTYPE error = 0.0F;
+    if (output->devID < 0) {
-    /* 
+        CheckNTErrors((gLen >= 0 && gLen <= output->unitNum), "Illegal input length!");
-    squared error 
+        CheckNTErrors((XTensor::IsIdentical(gold, output)), "The input tensors must be of the same size!");
-    loss = sum_{i} 0.5*(gold_i - output_i)^2
+        CheckNTErrors((gold->dimSizeRDI[0] == 1 && output->dimSizeRDI[0] == 1), "TODO!");
-    where gold_i is the gold standard and output_i is the model prediction
+        CheckNTErrors((gold->order > leadDim && leadDim >= 0), "Illegal leading dimension!");
-    */
+        CheckNTErrors((gold->dataType == DEFAULT_DTYPE && output->dataType == DEFAULT_DTYPE),
-    if(LFName == SQUAREDERROR){
+                             "TODO!");
-        if(gold->isSparse){
-            CheckNTErrors((gBeg == 0 && gLen == dimensionSize), "TODO!");
+        int leadDimRDI = output->order - leadDim - 1;
-            for(int i = 0; i < blockSize; i++){
+        int dimensionSize = output->dimSizeRDI[leadDimRDI];
-                DTYPE diff = 0 - *(op + oBeg + i);
+        int stride = 1;
-                error += (DTYPE)0.5 * diff * diff;
+        int blockSize = 1;
-            }
+        int blockNum = 1;
-            int num = gold->GetNonzeroSize();
-            for(int i = 0; i < num; i++){
+        for(int i = 0; i < leadDimRDI; i++)
-                int key = gold->GetKeyInSparse(i);
+            stride *= output->dimSizeRDI[i];
-                DTYPE value = gold->GetInSparse(i);
+        blockSize = stride * dimensionSize;
-                int offset = key - gBeg;
+        blockNum = output->unitNum / blockSize;
-                DTYPE diff = value - *(op + oBeg + offset);
-                error += (DTYPE)0.5 * diff * diff;
+        if(isLogOutput)
-                DTYPE diff2 = 0 - *(op + oBeg + offset);
+            return LossComputeForLogScale(gold, output, LFName, leadDim, gBeg, gLen, oBeg);
-                error -= (DTYPE)0.5 * diff2 * diff2;
-            }
+        DTYPE * gp = (DTYPE*)gold->data;
-        }
+        DTYPE * op = (DTYPE*)output->data;
-        else{
-            for(int k = 0; k < blockNum; k++){
+        /* 
-                int bg = k * blockSize + gBeg * stride;
+        squared error 
-                int og = k * blockSize + oBeg * stride;
+        loss = sum_{i} 0.5*(gold_i - output_i)^2
-                int size = stride * gLen;
+        where gold_i is the gold standard and output_i is the model prediction
-                for(int i = 0; i < size; i++){
+        */
-                    DTYPE diff = *(gp + bg + i) - *(op + og + i);
+        if(LFName == SQUAREDERROR){
+            if(gold->isSparse){
+                CheckNTErrors((gBeg == 0 && gLen == dimensionSize), "TODO!");
+                for(int i = 0; i < blockSize; i++){
+                    DTYPE diff = 0 - *(op + oBeg + i);
                    error += (DTYPE)0.5 * diff * diff;
                }
+                int num = gold->GetNonzeroSize();
+                for(int i = 0; i < num; i++){
+                    int key = gold->GetKeyInSparse(i);
+                    DTYPE value = gold->GetInSparse(i);
+                    int offset = key - gBeg;
+                    DTYPE diff = value - *(op + oBeg + offset);
+                    error += (DTYPE)0.5 * diff * diff;
+                    DTYPE diff2 = 0 - *(op + oBeg + offset);
+                    error -= (DTYPE)0.5 * diff2 * diff2;
+                }
+            }
+            else{
+                for(int k = 0; k < blockNum; k++){
+                    int bg = k * blockSize + gBeg * stride;
+                    int og = k * blockSize + oBeg * stride;
+                    int size = stride * gLen;
+                    for(int i = 0; i < size; i++){
+                        DTYPE diff = *(gp + bg + i) - *(op + og + i);
+                        error += (DTYPE)0.5 * diff * diff;
+                    }
+                }
            }
        }
-    }
-    /* 
+        /* 
-    cross entropy
+        cross entropy
-    loss = sum_{i} (-gold_i * log(output_i))
+        loss = sum_{i} (-gold_i * log(output_i))
-    where gold and output are distributions 
+        where gold and output are distributions 
-    */
+        */
-    if(LFName == CROSSENTROPY){
+        if(LFName == CROSSENTROPY){
-        if(gold->isSparse){
+            if(gold->isSparse){
-            CheckNTErrors((gBeg == 0 && gLen == dimensionSize), "TODO!");
+                CheckNTErrors((gBeg == 0 && gLen == dimensionSize), "TODO!");
-            int num = gold->GetNonzeroSize();
+                int num = gold->GetNonzeroSize();
-            for(int i = 0; i < num; i++){
+                for(int i = 0; i < num; i++){
-                int key = gold->GetKeyInSparse(i);
+                    int key = gold->GetKeyInSparse(i);
-                DTYPE value = gold->GetInSparse(i);
+                    DTYPE value = gold->GetInSparse(i);
-                int offset = key - gBeg;
+                    int offset = key - gBeg;
-                error += -value * (DTYPE)log((*(op + oBeg + offset)));
+                    error += -value * (DTYPE)log((*(op + oBeg + offset)));
+                }
            }
-        }
+            else{
-        else{
+                for(int k = 0; k < blockNum; k++){
-            for(int k = 0; k < blockNum; k++){
+                    int bg = k * blockSize + gBeg * stride;
-                int bg = k * blockSize + gBeg * stride;
+                    int og = k * blockSize + oBeg * stride;
-                int og = k * blockSize + oBeg * stride;
+                    int size = stride * gLen;
-                int size = stride * gLen;
+                    for(int i = 0; i < size; i++){
-                for(int i = 0; i < size; i++){
+                        error += -(*(gp + bg + i)) * (DTYPE)log(*(op + og + i));
-                    error += -(*(gp + bg + i)) * (DTYPE)log(*(op + og + i));
+                    }
                }
            }
        }
-    }
+        /*
-    /*
+        one hot error
-    one hot error
+        loss = sum_{i} e_i 
-    loss = sum_{i} e_i 
+        where e_i = 0.5*(t_i - y_i)^2 if t_i = 1, 
-    where e_i = 0.5*(t_i - y_i)^2 if t_i = 1, 
+              e_i = 0 otherwise
-          e_i = 0 otherwise
+        */
-    */
+        if(LFName == ONEHOTERROR){
-    if(LFName == ONEHOTERROR){
+            if(gold->isSparse){
-        if(gold->isSparse){
+                CheckNTErrors((gBeg == 0 && gLen == dimensionSize), "TODO!");
-            CheckNTErrors((gBeg == 0 && gLen == dimensionSize), "TODO!");
+                for(int i = 0; i < blockSize; i++){
-            for(int i = 0; i < blockSize; i++){
+                    DTYPE diff = 0 - *(op + oBeg + i);
-                DTYPE diff = 0 - *(op + oBeg + i);
+                    error += (DTYPE)0.5 * diff * diff;
-                error += (DTYPE)0.5 * diff * diff;
+                }
-            }
+                int num = gold->GetNonzeroSize();
-            int num = gold->GetNonzeroSize();
+                for(int i = 0; i < num; i++){
-            for(int i = 0; i < num; i++){
+                    int key = gold->GetKeyInSparse(i);
-                int key = gold->GetKeyInSparse(i);
+                    DTYPE value = gold->GetInSparse(i);
-                DTYPE value = gold->GetInSparse(i);
+                    int offset = key - gBeg;
-                int offset = key - gBeg;
-                if(value >= 1.0F)
-                    continue;
-                DTYPE diff0 = 0 - *(op + oBeg + offset);
+                    if(value >= 1.0F)
-                error += (DTYPE)0.5 * diff0 * diff0;
-                DTYPE diff = value - *(op + oBeg + offset);
-                error += (DTYPE)0.5 * diff * diff;
-                DTYPE diff2 = 0 - *(op + oBeg + offset);
-                error -= (DTYPE)0.5 * diff2 * diff2;
-            }
-        }
-        else{
-            for(int k = 0; k < blockNum; k++){
-                int size = stride * gLen;
-                for(int i = 0; i < size; i++){
-                    if(*(gp + gBeg + i) >= 1.0F)
                        continue;
-                    DTYPE diff = *(gp + gBeg + i) - *(op + oBeg + i);
+                    DTYPE diff0 = 0 - *(op + oBeg + offset);
+                    error += (DTYPE)0.5 * diff0 * diff0;
+                    DTYPE diff = value - *(op + oBeg + offset);
                    error += (DTYPE)0.5 * diff * diff;
+                    DTYPE diff2 = 0 - *(op + oBeg + offset);
+                    error -= (DTYPE)0.5 * diff2 * diff2;
+                }
+            }
+            else{
+                for(int k = 0; k < blockNum; k++){
+                    int size = stride * gLen;
+                    for(int i = 0; i < size; i++){
+                        if(*(gp + gBeg + i) < 1.0F)
+                            continue;
+                        DTYPE diff = *(gp + gBeg + i) - *(op + oBeg + i);
+                        error += (DTYPE)0.5 * diff * diff;
+                    }
                }
            }
        }
    }
+    else {
+        error = CudaLossCompute(gold, output, LFName, isLogOutput, leadDim, gBeg, gLen, oBeg);
+    }
    return error;
 }
@@ -374,94 +380,104 @@ void LossBackward(XTensor * dedy, XTensor * t, XTensor * y,
                  LOSS_FUNCTION_NAME LFName, 
                  int leadDim, int tBeg, int tLen, int yBeg)
 {
-    CheckNTErrors((tLen < y->unitNum), "Illegal input length!");
+    if (y->devID < 0) {
-    CheckNTErrors((XTensor::IsIdentical(t, y)&& XTensor::IsIdentical(dedy, y)), 
+        CheckNTErrors((tLen <= y->unitNum), "Illegal input length!");
-                        "The input tensors must be of the same size!");
+        CheckNTErrors((XTensor::IsIdentical(t, y)&& XTensor::IsIdentical(dedy, y)), 
-    CheckNTErrors((t->dimSizeRDI[0] == 1 && y->dimSizeRDI[0] == 1 && dedy->dimSizeRDI[0] == 1), "TODO!");
+                            "The input tensors must be of the same size!");
-    CheckNTErrors((t->order > leadDim && leadDim >= 0), "Illegal leading dimension!");
+        CheckNTErrors(((dedy->devID == t->devID) && (dedy->devID == y->devID)), "Tensor must be on the same device!");
-    CheckNTErrors((t->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE),
+        CheckNTErrors((t->order > leadDim), "Illegal leading dimension!");
-                         "TODO!");
+        CheckNTErrors((t->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE),
+                             "TODO!");
-    int leadDimRDI = leadDim >= 0 ? y->order - leadDim - 1 : -1;
-    if(leadDimRDI < 0){
+        int leadDimRDI = leadDim >= 0 ? y->order - leadDim - 1 : -1;
-        leadDimRDI = y->dimSizeRDI[y->order - 1];
+        if(leadDimRDI < 0){
-        tBeg = 0;
+            leadDimRDI = y->order - 1;
-        yBeg = 0;
+            tBeg = 0;
-        tLen = y->dimSizeRDI[leadDimRDI];
+            yBeg = 0;
-    }
+            tLen = y->dimSizeRDI[leadDimRDI];
+        }
-    int dimensionSize = y->dimSizeRDI[leadDimRDI];
-    int stride = 1;
-    int blockSize = 1;
-    int blockNum = 1;
-    for(int i = 0; i < leadDimRDI; i++)
-        stride *= y->dimSizeRDI[i];
-    blockSize = stride * dimensionSize;
-    blockNum = y->unitNum / blockSize;
-    DTYPE * tp = (DTYPE*)t->data;
-    DTYPE * yp = (DTYPE*)y->data;
-    DTYPE * dedyp = (DTYPE*)dedy->data;
-    CheckNTErrors((t->dataType == DEFAULT_DTYPE && 
-                   y->dataType == DEFAULT_DTYPE && 
-                   dedy->dataType == DEFAULT_DTYPE),
-                   "Input vectors are not in default type!");
-    /* 
+        int dimensionSize = y->dimSizeRDI[leadDimRDI];
-    squared error 
+        int stride = 1;
-    loss = sum_{i} 0.5*(t_i - y_i)^2, where t_i is the gold standard and y_i is the model output
+        int blockSize = 1;
-    dloss/dy_i = y_i - t_i
+        int blockNum = 1;
-    */
-    if(LFName == SQUAREDERROR){
+        for(int i = 0; i < leadDimRDI; i++)
-        if(t->isSparse){
+            stride *= y->dimSizeRDI[i];
-            CheckNTErrors((tBeg == 0 && tLen == dimensionSize), "TODO!");
+        blockSize = stride * dimensionSize;
-            int num = t->GetNonzeroSize();
+        blockNum = y->unitNum / blockSize;
-            for(int i = 0; i < num; i++){
-                int key = t->GetKeyInSparse(i);
+        DTYPE * tp = (DTYPE*)t->data;
-                DTYPE value = t->GetInSparse(i);
+        DTYPE * yp = (DTYPE*)y->data;
-                if(key >= tBeg && key < tBeg + tLen)
+        DTYPE * dedyp = (DTYPE*)dedy->data;
-                    *(dedyp + yBeg + key - tBeg) = -value;
-            }
+        CheckNTErrors((t->dataType == DEFAULT_DTYPE && 
-            for(int i = 0; i < tLen; i++){
+                       y->dataType == DEFAULT_DTYPE && 
-                *(dedyp + yBeg + i) += *(yp + yBeg + i);
+                       dedy->dataType == DEFAULT_DTYPE),
+                       "Input vectors are not in default type!");
+        /* 
+        squared error 
+        loss = sum_{i} 0.5*(t_i - y_i)^2, where t_i is the gold standard and y_i is the model output
+        dloss/dy_i = y_i - t_i
+        */
+        if(LFName == SQUAREDERROR){
+            if(t->isSparse){
+                CheckNTErrors((tBeg == 0 && tLen == dimensionSize), "TODO!");
+                int num = t->GetNonzeroSize();
+                for(int i = 0; i < num; i++){
+                    int key = t->GetKeyInSparse(i);
+                    DTYPE value = t->GetInSparse(i);
+                    if(key >= tBeg && key < tBeg + tLen)
+                        *(dedyp + yBeg + key - tBeg) = -value;
+                }
+                for(int i = 0; i < tLen; i++){
+                    *(dedyp + yBeg + i) += *(yp + yBeg + i);
+                }
            }
-        }
+            else{
-        else{
+                for(int k = 0; k < blockNum; k++){
-            for(int k = 0; k < blockNum; k++){
+                    int bg = k * blockSize + tBeg * stride;
-                int bg = k * blockSize + tBeg * stride;
+                    int yg = k * blockSize + yBeg * stride;
-                int yg = k * blockSize + yBeg * stride;
+                    int size = stride * tLen;
-                int size = stride * tLen;
+                    for(int i = 0; i < size; i++){
-                for(int i = 0; i < size; i++){
+                        *(dedyp + bg + i) = *(yp + yBeg + i) - *(tp + yg + i);
-                    *(dedyp + bg + i) = *(yp + yBeg + i) - *(tp + yg + i);
+                    }
                }
            }
        }
-    }
-    /* 
+        /* 
-    cross entropy
+        cross entropy
-    loss = sum_{i} (-t_i * log(y_i)), where t and y are distributions 
+        loss = sum_{i} (-t_i * log(y_i)), where t and y are distributions 
-    dloss/dy_i = -t_i / y_i
+        dloss/dy_i = -t_i / y_i
-    */
+        */
-    if(LFName == CROSSENTROPY){
+        if(LFName == CROSSENTROPY){
-        if(t->isSparse){
+            if(t->isSparse){
-            memset(dedyp + yBeg, 0, sizeof(DTYPE) * tLen);
+                memset(dedyp + yBeg, 0, sizeof(DTYPE) * tLen);
-            int num = t->GetNonzeroSize();
+                int num = t->GetNonzeroSize();
-            for(int i = 0; i < num; i++){
+                for(int i = 0; i < num; i++){
-                int key = t->GetKeyInSparse(i);
+                    int key = t->GetKeyInSparse(i);
-                DTYPE value = t->GetInSparse(i);
+                    DTYPE value = t->GetInSparse(i);
-                if(key >= tBeg && key < tBeg + tLen)
+                    if(key >= tBeg && key < tBeg + tLen)
-                    *(dedyp + yBeg + key - tBeg) = -value/(DTYPE)*(yp + yBeg + key - tBeg);
+                        *(dedyp + yBeg + key - tBeg) = -value/(DTYPE)*(yp + yBeg + key - tBeg);
+                }
            }
-        }
+            else{
-        else{
+                for (int i = 0; i < blockNum; i++) {
-            for(int i = 0; i < tLen; i++){
+                    for (int j = 0; j < stride; j++) {
-                *(dedyp + yBeg + i) = -(DTYPE)*(tp + tBeg + i)/(DTYPE)*(yp + yBeg + i);
+                        for (int k = 0; k < tLen; k++) {
+                            *(dedyp + i * stride * dimensionSize + j + stride * (yBeg + k)) = -(DTYPE)*(tp + i * stride * dimensionSize
+                                + j + stride * (tBeg + k)) / (DTYPE)*(yp +  i * stride * dimensionSize + j + stride * (yBeg + k));
+                        }
+                    }
+                }
            }
        }
    }
+    else {
+        CudaLossBackward(dedy, t, y, LFName, leadDim, tBeg, tLen, yBeg);
+    }
 }
 } // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/function/Loss.cu
+++ b/source/tensor/function/Loss.cu
@@ -22,6 +22,14 @@
 #include "Loss.h"
 #include "Loss.cuh"
 #include "../XDevice.h"
+#include "../core/math/Power.h"
+#include "../core/math/ScaleAndShift.h"
+#include "../core/math/Log.h"
+#include "../core/arithmetic/Negate.h"
+#include "../core/arithmetic/Sum.h"
+#include "../core/arithmetic/Multiply.h"
+#include "../core/reduce/ReduceSum.h"
+#include "../core/movement/CopyValues.h"
 namespace nts{ // namespace nts(NiuTrans.Tensor)
@@ -46,7 +54,126 @@ compute the loss
 DTYPE CudaLossCompute(XTensor * gold, XTensor * y, LOSS_FUNCTION_NAME LFName,
                      bool isLogOutput, int leadDim, int gBeg, int gLen, int yBeg)
 {
-    return 0;
+    CheckNTErrors((gLen >= 0 && gLen <= y->unitNum), "Illegal input length!");
+    CheckNTErrors((XTensor::IsIdentical(gold, y)), "The input tensors must be of the same size!");
+    CheckNTErrors((gold->dimSizeRDI[0] == 1 && y->dimSizeRDI[0] == 1), "TODO!");
+    CheckNTErrors((gold->order > leadDim && leadDim >= 0), "Illegal leading dimension!");
+    CheckNTErrors((gold->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE),
+                         "TODO!");
+    CheckNTErrors((gold->devID == y->devID), "Tensors must be on the same device!");
+    CheckNTErrors((gold->devID >= 0), "Tensors must be on GPU device!");
+    CheckNTErrors((gLen == gold->dimSize[leadDim] && gBeg == 0 && yBeg == 0), "TODO!");
+    if(isLogOutput)
+        return LossComputeForLogScale(gold, y, LFName, leadDim, gBeg, gLen, yBeg);
+    DTYPE error = 0.0F;
+    /* 
+    squared error 
+    loss = sum_{i} 0.5*(gold_i - output_i)^2
+    where gold_i is the gold standard and output_i is the model prediction
+    */
+    if(LFName == SQUAREDERROR){
+        XTensor * diff = NewTensor(gold->order, gold->dimSize, gold->dataType, gold->denseRatio, gold->devID, gold->mem);
+        _Sum(gold, y, diff, -1.0F);
+        Power(diff, 2.0F);
+        _ScaleAndShiftMe(diff, 0.5F, 0.0F);
+        int reduceTimes = diff->order;
+        for (int i = 0; i < reduceTimes; i++) {
+            int diffOrder = diff->order - 1;
+            int * diffDimSize = new int[diffOrder];
+            memcpy(diffDimSize, diff->dimSize + 1, diffOrder * sizeof(int));
+            XTensor * diffNew = NewTensor(diffOrder, diffDimSize, X_FLOAT, 1.0F, diff->devID, diff->mem);
+            int reducePlace = diff->dimSize[0] == 1 ? 1 : 0;
+            ReduceSum(diff, diffNew, reducePlace);
+            if (diffNew->order == 1) {
+                diffNew->order = 2;
+                diffNew->dimSize[1] = diffNew->dimSize[0];
+                diffNew->dimSize[0] = 1;
+                diffNew->dimSizeRDI[1] = 1;
+            }
+            delete diff;
+            diff = diffNew;
+            delete diffDimSize;
+        }
+        error = diff->Get2D(0, 0);
+        delete diff;
+    }
+    /* 
+    cross entropy
+    loss = sum_{i} (-gold_i * log(output_i))
+    where gold and output are distributions 
+    */
+    if(LFName == CROSSENTROPY){
+        XTensor * diff = NewTensor(y->order, y->dimSize, y->dataType, y->denseRatio, y->devID, y->mem);
+        CopyValues(y, diff);
+        Log(diff);
+        _Multiply(gold, diff, diff);
+        Negate(diff);
+        int reduceTimes = diff->order;
+        for (int i = 0; i < reduceTimes; i++) {
+            int diffOrder = diff->order - 1;
+            int * diffDimSize = new int[diffOrder];
+            memcpy(diffDimSize, diff->dimSize + 1, diffOrder * sizeof(int));
+            XTensor * diffNew = NewTensor(diffOrder, diffDimSize, X_FLOAT, 1.0F, diff->devID, diff->mem);
+            int reducePlace = diff->dimSize[0] == 1 ? 1 : 0;
+            ReduceSum(diff, diffNew, reducePlace);
+            if (diffNew->order == 1) {
+                diffNew->order = 2;
+                diffNew->dimSize[1] = diffNew->dimSize[0];
+                diffNew->dimSize[0] = 1;
+                diffNew->dimSizeRDI[1] = 1;
+            }
+            delete diff;
+            diff = diffNew;
+            delete diffDimSize;
+        }
+        error = diff->Get2D(0, 0);
+        delete diff;
+    }
+    /*
+    one hot error
+    loss = sum_{i} e_i 
+    where e_i = 0.5*(t_i - y_i)^2 if t_i = 1, 
+          e_i = 0 otherwise
+    */
+    if(LFName == ONEHOTERROR){
+        XTensor * diff = NewTensor(gold->order, gold->dimSize, gold->dataType, gold->denseRatio, gold->devID, gold->mem);
+        XTensor * yOnehot = NewTensor(y->order, y->dimSize, y->dataType, y->denseRatio, y->devID, y->mem);
+        CopyValues(y, yOnehot);
+        _Multiply(gold, y, yOnehot);
+        _Sum(gold, yOnehot, diff, -1.0F);
+        Power(diff, 2.0F);
+        _ScaleAndShiftMe(diff, 0.5F, 0.0F);
+        int reduceTimes = diff->order;
+        for (int i = 0; i < reduceTimes; i++) {
+            int diffOrder = diff->order - 1;
+            int * diffDimSize = new int[diffOrder];
+            memcpy(diffDimSize, diff->dimSize + 1, diffOrder * sizeof(int));
+            XTensor * diffNew = NewTensor(diffOrder, diffDimSize, X_FLOAT, 1.0F, diff->devID, diff->mem);
+            int reducePlace = diff->dimSize[0] == 1 ? 1 : 0;
+            ReduceSum(diff, diffNew, reducePlace);
+            if (diffNew->order == 1) {
+                diffNew->order = 2;
+                diffNew->dimSize[1] = diffNew->dimSize[0];
+                diffNew->dimSize[0] = 1;
+                diffNew->dimSizeRDI[1] = 1;
+            }
+            delete diff;
+            diff = diffNew;
+            delete diffDimSize;
+        }
+        error = diff->Get2D(0, 0);
+        delete diff;
+        delete yOnehot;
+    }
+    return error;
    // TODO: call cuda kernels for computing the errors
 }
@@ -140,13 +267,25 @@ backward compuation for cross entropy (Cuda kernel)
 >> size - size of the vector (dedy)
 */
 extern "C" __global__ 
-void KernelLossBackwardCrossEntropy(DTYPE * dedy, DTYPE * t, DTYPE * y, int size)
+void KernelLossBackwardCrossEntropy(DTYPE * dedy, DTYPE * t, DTYPE * y, int tBeg, int tLen, int yBeg, int blockNum, int stride, int dimensionSize)
 {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    if (i > stride * dimensionSize * blockNum) 
+        return;
-    if (i < size){
+    int blockNumIndex = i / (stride * dimensionSize);
+    int blockNumTail = i % (stride * dimensionSize);
+    int dimensionSizeIndex = blockNumTail / stride;
+    int strideIndex = blockNumTail % stride;
+    if (dimensionSizeIndex >= tLen)
+        return;
+    dedy[blockNumIndex * stride * dimensionSize + strideIndex + stride * (yBeg + dimensionSizeIndex)] = -t[blockNumIndex * stride * dimensionSize + 
+        strideIndex + stride * (tBeg + dimensionSizeIndex)] / y[blockNumIndex * stride * dimensionSize + strideIndex + stride * (yBeg + dimensionSizeIndex)];
+    /*if (i < size){
        dedy[i] =  -t[i]/y[i];
-    }
+    }*/
 }
 /* 
@@ -193,9 +332,11 @@ void CudaLossBackward(XTensor * dedy, XTensor * t, XTensor * y,
                      LOSS_FUNCTION_NAME LFName, 
                      int leadDim, int tBeg, int tLen, int yBeg)
 {
+    CheckNTErrors((tLen <= y->unitNum), "Illegal input length!");
    CheckNTErrors((XTensor::IsIdentical(t, y)&& XTensor::IsIdentical(dedy, y)), 
                        "The input tensors must be of the same size!");
-    CheckNTErrors((t->dimSizeRDI[0] == 1 && y->dimSizeRDI[0] == 1 && dedy->dimSizeRDI[1] == 1), "TODO!");
+    CheckNTErrors(((dedy->devID == t->devID) && (dedy->devID == y->devID)), "Tensor must be on the same device!");
+    CheckNTErrors((t->order > leadDim), "Illegal leading dimension!");
    CheckNTErrors((t->dataType == DEFAULT_DTYPE && 
                         y->dataType == DEFAULT_DTYPE && 
                         dedy->dataType == DEFAULT_DTYPE),
@@ -208,21 +349,25 @@ void CudaLossBackward(XTensor * dedy, XTensor * t, XTensor * y,
                        "The vectors must be on the same GPU.");
    CheckNTErrors((tBeg == yBeg), "TODO!");
-    int leadDimRDI = y->order - leadDim - 1;
+    int leadDimRDI = leadDim >= 0 ? y->order - leadDim - 1 : -1;
    if(leadDimRDI < 0){
-        leadDimRDI = y->dimSizeRDI[y->order - 1];
+        leadDimRDI = y->order - 1;
        tBeg = 0;
        yBeg = 0;
        tLen = y->dimSizeRDI[leadDimRDI];
    }
+    int dimensionSize = y->dimSizeRDI[leadDimRDI];
    int stride = 1;
    int blockSize = 1;
+    int blockNum = 1;
    int size = 1;
    for(int i = 0; i < leadDimRDI; i++)
        stride *= y->dimSizeRDI[i];
    size = tLen * stride;
+    blockSize = stride * dimensionSize;
+    blockNum = y->unitNum / blockSize;
    int cudaGridSize[3], cudaBlockSize[3];
@@ -265,7 +410,7 @@ void CudaLossBackward(XTensor * dedy, XTensor * t, XTensor * y,
            ShowNTErrors("TODO!");
        }
        else if(size == y->unitNum){
-            KernelLossBackwardCrossEntropy<<<blocks, threads>>>(dedyp, tp, yp, tLen);
+            KernelLossBackwardCrossEntropy<<<blocks, threads>>>(dedyp, tp, yp, tBeg, tLen, yBeg, blockNum, stride, dimensionSize);
        }
        else{
            KernelLossBackwardCrossEntropyBlock<<<blocks, threads>>>(dedyp, tp, yp, blockSize, tBeg * stride, tLen * stride, y->unitNum);

--- a/source/tensor/function/Rectify.cu
+++ b/source/tensor/function/Rectify.cu
@@ -97,7 +97,7 @@ void KernelRectifyBackward(DTYPE * dedy, DTYPE * dedx, DTYPE * gold, DTYPE * y, 
    if (i < size){
        DTYPE s = x[i];
        if(s >= 0)
-            dedx[i] = 1;
+            dedx[i] = dedy[i];
        else
            dedx[i] = 0;
    }

--- a/source/tensor/function/Softmax.cu
+++ b/source/tensor/function/Softmax.cu
@@ -248,7 +248,7 @@ void CudaSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x,
                       "Unknown loss function.");
        if(lossName == CROSSENTROPY || lossName == SQUAREDERROR){
-            ShowNTErrors("TODO!");
+            _Sum(y, gold, dedx, -1.0F);
        }
        else if(lossName == ONEHOTERROR){
            ShowNTErrors("TODO!");

--- a/source/tensor/test/TConcatenate.cpp
+++ b/source/tensor/test/TConcatenate.cpp
@@ -483,9 +483,9 @@ bool TestConcatenate4()
    delete sGPU1;
    delete sGPU2;
    delete tGPU;
-    delete[] sDimSize1;
+    //delete[] sDimSize1;
-    delete[] sDimSize2;
+    //delete[] sDimSize2;
-    delete[] tDimSize;
+    //delete[] tDimSize;
 	return cpuTest && gpuTest;
 #else

--- a/source/tensor/test/TIdentity.cpp
+++ b/source/tensor/test/TIdentity.cpp
@@ -30,15 +30,15 @@ Identity function: y = x
 */
 bool TestIdentity1()
 {
-    /* a input tensor of size (2, 3) */
+    /* a tensor of size (2, 3) */
-    int sOrder = 2;
+    int order = 2;
-    int * sDimSize = new int[sOrder];
+    int * dimSize = new int[order];
-    sDimSize[0] = 2;
+    dimSize[0] = 2;
-    sDimSize[1] = 3;
+    dimSize[1] = 3;
-    int sUnitNum = 1;
+    int unitNum = 1;
-    for (int i = 0; i < sOrder; i++)
+    for (int i = 0; i < order; i++)
-        sUnitNum *= sDimSize[i];
+        unitNum *= dimSize[i];
    DTYPE xData[2][3] = { {0.0F, 1.0F, 2.0F}, 
                          {0.5F, 0.7F, 1.4F} };
@@ -49,47 +49,50 @@ bool TestIdentity1()
    bool cpuTest = true;
    /* create tensors */
-    XTensor * x = NewTensor(sOrder, sDimSize);
+    XTensor * x = NewTensor(order, dimSize);
-    XTensor * y = NewTensor(sOrder, sDimSize);
+    XTensor * y = NewTensor(order, dimSize);
    /* initialize variables */
-    x->SetData(xData, sUnitNum);
+    x->SetData(xData, unitNum);
    y->SetZeroAll();
    /* call Identity function */
    Identity(x, y);
    /* check result */
-    cpuTest = y->CheckData(answer, sUnitNum);
+    cpuTest = y->CheckData(answer, unitNum);
 #ifdef USE_CUDA
    /* GPU test */
    bool gpuTest = true;
    /* create tensors */
-    XTensor * xGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * xGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * yGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * yGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
    /* initialize variables */
-    xGPU->SetData(xData, sUnitNum);
+    xGPU->SetData(xData, unitNum);
    yGPU->SetZeroAll();
    /* call Identity function */
    Identity(xGPU, yGPU);
    /* check result */
-    gpuTest = yGPU->CheckData(answer, sUnitNum);
+    gpuTest = yGPU->CheckData(answer, unitNum);
    /* destroy variables */
-    delete x, y;
+    delete x;
-    delete xGPU, yGPU;
+    delete y;
-    delete[] sDimSize;
+    delete xGPU;
+    delete yGPU;
+    delete[] dimSize;
    return cpuTest && gpuTest;
 #else
    /* destroy variables */
-    delete x, y;
+    delete x;
-    delete[] sDimSize;
+    delete y;
+    delete[] dimSize;
    return cpuTest;
 #endif // USE_CUDA
@@ -98,35 +101,39 @@ bool TestIdentity1()
 /* 
 case 2: test IdentityBackward function.
 IdentityBackward function: dE/dx = dE/dy * dy/dx = dE/dy
+In this case, lossName=CROSSENTROPY.
 */
 bool TestIdentity2()
 {
-    int sOrder = 2;
+    /* a tensor of size (2, 3) */
-    int * sDimSize = new int[sOrder];
+    int order = 2;
-    sDimSize[0] = 1;
+    int * dimSize = new int[order];
-    sDimSize[1] = 3;
+    dimSize[0] = 1;
+    dimSize[1] = 3;
-    int sUnitNum = 1;
-    for (int i = 0; i < sOrder; i++)
+    int unitNum = 1;
-        sUnitNum *= sDimSize[i];
+    for (int i = 0; i < order; i++)
+        unitNum *= dimSize[i];
-    DTYPE xData[1][3] = { {0.0F, 1.0F, 2.0F} };
-    DTYPE gData[1][3] = { {0.0F, 0.0F, 1.0F} };
+    DTYPE xData[3] = {1.0F, 1.0F, 2.0F};
-    DTYPE dedxAnswer[3] = {0.090031F, 0.244728F, -0.334759F};
+    DTYPE gData[3] = {0.0F, 0.0F, 1.0F};
+    DTYPE yAnswer[3] = {1.0F, 1.0F, 2.0F};
+    DTYPE dedyAnswer[3] = {0.0F, 0.0F, -0.5F};
+    DTYPE dedxAnswer[3] = {0.0F, 0.0F, -0.5F};
    /* CPU test */
    bool cpuTest = true;
    /* create tensors */
-    XTensor * x = NewTensor(sOrder, sDimSize);
+    XTensor * x = NewTensor(order, dimSize);
-    XTensor * y = NewTensor(sOrder, sDimSize);
+    XTensor * y = NewTensor(order, dimSize);
-    XTensor * g = NewTensor(sOrder, sDimSize);
+    XTensor * g = NewTensor(order, dimSize);
-    XTensor * dedy = NewTensor(sOrder, sDimSize);
+    XTensor * dedy = NewTensor(order, dimSize);
-    XTensor * dedx = NewTensor(sOrder, sDimSize);
+    XTensor * dedx = NewTensor(order, dimSize);
    /* initialize variables */
-    x->SetData(xData, sUnitNum);
+    x->SetData(xData, unitNum);
-    g->SetData(gData, sUnitNum);
+    g->SetData(gData, unitNum);
    y->SetZeroAll();
    dedx->SetZeroAll();
    dedy->SetZeroAll();
@@ -138,22 +145,24 @@ bool TestIdentity2()
    IdentityBackward(g, y, x, dedy, dedx, CROSSENTROPY);
    /* check result */
-    cpuTest = dedx->CheckData(dedxAnswer, sUnitNum, 1e-4F);
+    cpuTest = y->CheckData(yAnswer, unitNum, 1e-4F)
+              && dedx->CheckData(dedxAnswer, unitNum, 1e-4F)
+              && dedy->CheckData(dedyAnswer, unitNum, 1e-4F);
 #ifdef USE_CUDA
    /* GPU test */
    bool gpuTest = true;
        /* create tensors */
-    XTensor * xGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * xGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * yGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * yGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * gGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * gGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * dedyGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * dedyGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * dedxGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * dedxGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
    /* initialize variables */
-    xGPU->SetData(xData, sUnitNum);
+    xGPU->SetData(xData, unitNum);
-    gGPU->SetData(gData, sUnitNum);
+    gGPU->SetData(gData, unitNum);
    yGPU->SetZeroAll();
    dedxGPU->SetZeroAll();
    dedyGPU->SetZeroAll();
@@ -165,7 +174,9 @@ bool TestIdentity2()
    IdentityBackward(gGPU, yGPU, xGPU, dedyGPU, dedxGPU, CROSSENTROPY);
    /* check result */
-    gpuTest = dedxGPU->CheckData(dedxAnswer, sUnitNum, 1e-4F);
+    gpuTest = yGPU->CheckData(yAnswer, unitNum, 1e-4F)
+              && dedxGPU->CheckData(dedxAnswer, unitNum, 1e-4F)
+              && dedyGPU->CheckData(dedyAnswer, unitNum, 1e-4F);
    /* destroy variables */
    delete x;
@@ -178,7 +189,7 @@ bool TestIdentity2()
    delete gGPU;
    delete dedxGPU;
    delete dedyGPU;
-    delete[] sDimSize;
+    delete[] dimSize;
    return cpuTest && gpuTest;
 #else
@@ -188,7 +199,7 @@ bool TestIdentity2()
    delete g;
    delete dedx;
    delete dedy;
-    delete[] sDimSize;
+    delete[] dimSize;
    return cpuTest;
 #endif // USE_CUDA

--- a/source/tensor/test/TLogSoftmax.cpp
+++ b/source/tensor/test/TLogSoftmax.cpp
@@ -30,15 +30,15 @@ LogSoftmax function: y = log(e^x / \sum_{i} e^{x_i})
 */
 bool TestLogSoftmax1()
 {
-    /* a input tensor of size (2, 3) */
+    /* a tensor of size (2, 3) */
-    int sOrder = 2;
+    int order = 2;
-    int * sDimSize = new int[sOrder];
+    int * dimSize = new int[order];
-    sDimSize[0] = 2;
+    dimSize[0] = 2;
-    sDimSize[1] = 3;
+    dimSize[1] = 3;
-    int sUnitNum = 1;
+    int unitNum = 1;
-    for (int i = 0; i < sOrder; i++)
+    for (int i = 0; i < order; i++)
-        sUnitNum *= sDimSize[i];
+        unitNum *= dimSize[i];
    DTYPE xData[2][3] = { {0.0F, 1.0F, 2.0F}, 
                          {0.5F, 0.7F, 1.4F} };
@@ -49,50 +49,50 @@ bool TestLogSoftmax1()
    bool cpuTest = true;
    /* create tensors */
-    XTensor * x = NewTensor(sOrder, sDimSize);
+    XTensor * x = NewTensor(order, dimSize);
-    XTensor * y = NewTensor(sOrder, sDimSize);
+    XTensor * y = NewTensor(order, dimSize);
    /* initialize variables */
-    x->SetData(xData, sUnitNum);
+    x->SetData(xData, unitNum);
    y->SetZeroAll();
    /* call LogSoftmax function */
    LogSoftmax(x, y, 1);
    /* check result */
-    cpuTest = y->CheckData(answer, sUnitNum, 1e-4F);
+    cpuTest = y->CheckData(answer, unitNum, 1e-4F);
 #ifdef USE_CUDA
    /* GPU test */
    bool gpuTest = true;
    /* create tensors */
-    XTensor * xGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * xGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * yGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * yGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
    /* initialize variables */
-    xGPU->SetData(xData, sUnitNum);
+    xGPU->SetData(xData, unitNum);
    yGPU->SetZeroAll();
    /* call LogSoftmax function */
    LogSoftmax(xGPU, yGPU, 1);
    /* check result */
-    gpuTest = yGPU->CheckData(answer, sUnitNum, 1e-4F);
+    gpuTest = yGPU->CheckData(answer, unitNum, 1e-4F);
    /* destroy variables */
    delete x;
    delete y;
    delete xGPU;
    delete yGPU;
-    delete[] sDimSize;
+    delete[] dimSize;
    return cpuTest && gpuTest;
 #else
    /* destroy variables */
    delete x;
    delete y;
-    delete[] sDimSize;
+    delete[] dimSize;
    return cpuTest;
 #endif // USE_CUDA
@@ -102,75 +102,78 @@ bool TestLogSoftmax1()
 case 2: test LogSoftmaxBackward function.
 dE/dx = dE/dy * dy/dx
 log softmax: y_i = log(e^{x_i} / \sum_{k} e^{x_k})
+In this case, LossName=CROSSENTROPY.
 */
 bool TestLogSoftmax2()
 {
-    /* a input tensor of size (3) */
+    /* a tensor of size (1, 3) */
-    int sOrder = 1;
+    int order = 2;
-    int * sDimSize = new int[sOrder];
+    int * dimSize = new int[order];
-    sDimSize[0] = 3;
+    dimSize[0] = 1;
+    dimSize[1] = 3;
-    int sUnitNum = 1;
+    int unitNum = 1;
-    for (int i = 0; i < sOrder; i++)
+    for (int i = 0; i < order; i++)
-        sUnitNum *= sDimSize[i];
+        unitNum *= dimSize[i];
-    DTYPE xData[3] = {0.0F, 1.0F, 2.0F};
+    DTYPE xData[1][3] = {0.0F, 1.0F, 2.0F};
-    DTYPE gData[3] = {0.5F, 0.8F, 1.5F};
+    DTYPE gData[1][3] = {0.5F, 0.8F, 1.5F};
-    DTYPE yAnswer[3] = {-2.4076F, -1.4076F, -0.4076F};
+    DTYPE yAnswer[1][3] = {-2.4076F, -1.4076F, -0.4076F};
-    DTYPE dedxAnswer[3] = {-0.409969F, -0.555272F, -0.834759F};
+    DTYPE dedxAnswer[1][3] = {-0.4100F, -0.5553F, -0.8348F};
    /* CPU test */
    bool cpuTest = true;
    /* create tensors */
-    XTensor * x = NewTensor(sOrder, sDimSize);
+    XTensor * x = NewTensor(order, dimSize);
-    XTensor * y = NewTensor(sOrder, sDimSize);
+    XTensor * y = NewTensor(order, dimSize);
-    XTensor * g = NewTensor(sOrder, sDimSize);
+    XTensor * g = NewTensor(order, dimSize);
-    XTensor * dedy = NewTensor(sOrder, sDimSize);
+    XTensor * dedy = NewTensor(order, dimSize);
-    XTensor * dedx = NewTensor(sOrder, sDimSize);
+    XTensor * dedx = NewTensor(order, dimSize);
    /* initialize variables */
-    x->SetData(xData, sUnitNum);
+    x->SetData(xData, unitNum);
-    g->SetData(gData, sUnitNum);
+    g->SetData(gData, unitNum);
    y->SetZeroAll();
    dedx->SetZeroAll();
    dedy->SetZeroAll();
    /* call LogSoftmax function */
-    LogSoftmax(x, y, 0);
+    LogSoftmax(x, y, 1);
    /* call LogSoftmaxBackward function */
-    LogSoftmaxBackward(g, y, x, dedy, dedx, 0, CROSSENTROPY);
+    LogSoftmaxBackward(g, y, x, dedy, dedx, 1, CROSSENTROPY);
    /* check result */
-    cpuTest = y->CheckData(yAnswer, sUnitNum, 1e-4F) && dedx->CheckData(dedxAnswer, sUnitNum, 1e-4F);
+    cpuTest = y->CheckData(yAnswer, unitNum, 1e-4F) 
+              && dedx->CheckData(dedxAnswer, unitNum, 1e-4F);
 #ifdef USE_CUDA
    /* GPU test */
    bool gpuTest = true;
    /* create tensors */
-    XTensor * xGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * xGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * yGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * yGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * gGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * gGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * dedyGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * dedyGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * dedxGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * dedxGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
    /* initialize variables */
-    xGPU->SetData(xData, sUnitNum);
+    xGPU->SetData(xData, unitNum);
-    gGPU->SetData(gData, sUnitNum);
+    gGPU->SetData(gData, unitNum);
    yGPU->SetZeroAll();
    dedxGPU->SetZeroAll();
    dedyGPU->SetZeroAll();
    /* call LogSoftmax function */
-    LogSoftmax(xGPU, yGPU, 0);
+    LogSoftmax(xGPU, yGPU, 1);
    /* call LogSoftmaxBackward function */
-    LogSoftmaxBackward(gGPU, yGPU, xGPU, dedyGPU, dedxGPU, 0, CROSSENTROPY);
+    LogSoftmaxBackward(gGPU, yGPU, xGPU, dedyGPU, dedxGPU, 1, CROSSENTROPY);
    /* check result */
-    gpuTest = yGPU->CheckData(yAnswer, sUnitNum, 1e-4F) && dedxGPU->CheckData(dedxAnswer, sUnitNum, 1e-4F);
+    gpuTest = yGPU->CheckData(yAnswer, unitNum, 1e-4F) && dedxGPU->CheckData(dedxAnswer, unitNum, 1e-4F);
    /* destroy variables */
    delete x;
@@ -183,7 +186,7 @@ bool TestLogSoftmax2()
    delete gGPU;
    delete dedxGPU;
    delete dedyGPU;
-    delete[] sDimSize;
+    delete[] dimSize;
    return cpuTest && gpuTest;
 #else
@@ -193,7 +196,7 @@ bool TestLogSoftmax2()
    delete g;
    delete dedx;
    delete dedy;
-    delete[] sDimSize;
+    delete[] dimSize;
    return cpuTest;
 #endif // USE_CUDA
@@ -203,37 +206,38 @@ bool TestLogSoftmax2()
 case 3: test LogSoftmaxBackward function.
 dE/dx = dE/dy * dy/dx
 log softmax: y_i = log(e^{x_i} / \sum_{k} e^{x_k})
+In this case, LossName=SQUAREDERROR
 */
 bool TestLogSoftmax3()
 {
    /* a tensor of size (1, 3) */
-    int sOrder = 2;
+    int order = 2;
-    int * sDimSize = new int[sOrder];
+    int * dimSize = new int[order];
-    sDimSize[0] = 1;
+    dimSize[0] = 1;
-    sDimSize[1] = 3;
+    dimSize[1] = 3;
-    int sUnitNum = 1;
+    int unitNum = 1;
-    for (int i = 0; i < sOrder; i++)
+    for (int i = 0; i < order; i++)
-        sUnitNum *= sDimSize[i];
+        unitNum *= dimSize[i];
-    DTYPE xData[1][3] = { {0.0F, 1.0F, 2.0F} };
+    DTYPE xData[1][3] = {0.0F, 1.0F, 2.0F};
-    DTYPE gData[1][3] = { {0.5F, 0.8F, 1.5F} };
+    DTYPE gData[1][3] = {0.5F, 0.8F, 1.5F};
    DTYPE yAnswer[1][3] = {-2.4076F, -1.4076F, -0.4076F};
-    DTYPE dedxAnswer[1][3] = {-0.409969F, -0.555272F, -0.834759F};
+    DTYPE dedxAnswer[1][3] = {-0.4100F, -0.5553F, -0.8348F};
    /* CPU test */
    bool cpuTest = true;
    /* create tensors */
-    XTensor * x = NewTensor(sOrder, sDimSize);
+    XTensor * x = NewTensor(order, dimSize);
-    XTensor * y = NewTensor(sOrder, sDimSize);
+    XTensor * y = NewTensor(order, dimSize);
-    XTensor * g = NewTensor(sOrder, sDimSize);
+    XTensor * g = NewTensor(order, dimSize);
-    XTensor * dedy = NewTensor(sOrder, sDimSize);
+    XTensor * dedy = NewTensor(order, dimSize);
-    XTensor * dedx = NewTensor(sOrder, sDimSize);
+    XTensor * dedx = NewTensor(order, dimSize);
    /* initialize variables */
-    x->SetData(xData, sUnitNum);
+    x->SetData(xData, unitNum);
-    g->SetData(gData, sUnitNum);
+    g->SetData(gData, unitNum);
    y->SetZeroAll();
    dedx->SetZeroAll();
    dedy->SetZeroAll();
@@ -242,25 +246,26 @@ bool TestLogSoftmax3()
    LogSoftmax(x, y, 1);
    /* call LogSoftmaxBackward function */
-    LogSoftmaxBackward(g, y, x, dedy, dedx, 1, CROSSENTROPY);
+    LogSoftmaxBackward(g, y, x, dedy, dedx, 1, SQUAREDERROR);
    /* check result */
-    cpuTest = y->CheckData(yAnswer, sUnitNum, 1e-4F) && dedx->CheckData(dedxAnswer, sUnitNum, 1e-4F);
+    cpuTest = y->CheckData(yAnswer, unitNum, 1e-4F) 
+              && dedx->CheckData(dedxAnswer, unitNum, 1e-4F);
 #ifdef USE_CUDA
    /* GPU test */
    bool gpuTest = true;
    /* create tensors */
-    XTensor * xGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * xGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * yGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * yGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * gGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * gGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * dedyGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * dedyGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * dedxGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * dedxGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
    /* initialize variables */
-    xGPU->SetData(xData, sUnitNum);
+    xGPU->SetData(xData, unitNum);
-    gGPU->SetData(gData, sUnitNum);
+    gGPU->SetData(gData, unitNum);
    yGPU->SetZeroAll();
    dedxGPU->SetZeroAll();
    dedyGPU->SetZeroAll();
@@ -269,10 +274,11 @@ bool TestLogSoftmax3()
    LogSoftmax(xGPU, yGPU, 1);
    /* call LogSoftmaxBackward function */
-    LogSoftmaxBackward(gGPU, yGPU, xGPU, dedyGPU, dedxGPU, 1, CROSSENTROPY);
+    LogSoftmaxBackward(gGPU, yGPU, xGPU, dedyGPU, dedxGPU, 1, SQUAREDERROR);
    /* check result */
-    gpuTest = yGPU->CheckData(yAnswer, sUnitNum, 1e-4F) && dedxGPU->CheckData(dedxAnswer, sUnitNum, 1e-4F);
+    gpuTest = yGPU->CheckData(yAnswer, unitNum, 1e-4F) 
+              && dedxGPU->CheckData(dedxAnswer, unitNum, 1e-3F);
    /* destroy variables */
    delete x;
@@ -285,7 +291,7 @@ bool TestLogSoftmax3()
    delete gGPU;
    delete dedxGPU;
    delete dedyGPU;
-    delete[] sDimSize;
+    delete[] dimSize;
    return cpuTest && gpuTest;
 #else
@@ -295,13 +301,12 @@ bool TestLogSoftmax3()
    delete g;
    delete dedx;
    delete dedy;
-    delete[] sDimSize;
+    delete[] dimSize;
    return cpuTest;
 #endif // USE_CUDA
 }
 /* other cases */
 /*
    TODO!!
@@ -310,7 +315,7 @@ bool TestLogSoftmax3()
 /* test for LogSoftmax Function */
 bool TestLogSoftmax()
 {
-    XPRINT(0, stdout, "[TEST LogSoftmax] test log softmax function and its backward computation \n");
+    XPRINT(0, stdout, "[TEST LogSoftmax] logsoftmax function and its backward computation \n");
    bool returnFlag = true, caseFlag = true;
    /* case 1 test */

--- a/source/tensor/test/TLoss.cpp
+++ b/source/tensor/test/TLoss.cpp
@@ -20,15 +20,15 @@
 */
 #include "../core/math/ScaleAndShift.h"
-#include "../function/Loss.h"
+#include "TLoss.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
 /* 
-case 1: test LossCompute function 
+case 1: test LossCompute function.
 In this case, Loss function name = SQUAREDERROR.
 loss = sum_{i} 0.5*(t_i - y_i)^2, 
-where t_i is the gold standard and y_i is the model output
+where t_i is the gold standard and y_i is the model output.
 */
 bool TestLoss1()
 {
@@ -102,10 +102,10 @@ bool TestLoss1()
 }
 /* 
-case 2: test LossCompute function 
+case 2: test LossCompute function.
 In this case, Loss function name = CROSSENTROPY.
 loss = sum_{i} (-t_i * log(y_i))
-where t_i is the gold standard and y_i is the model output
+where t_i is the gold standard and y_i is the model output.
 */
 bool TestLoss2()
 {
@@ -179,10 +179,10 @@ bool TestLoss2()
 }
 /* 
-case 3: test LossCompute function 
+case 3: test LossCompute function.
 In this case, Loss function name = ONEHOTERROR.
 loss = sum_{i} e_i
-where e_i = 0.5*(t_i - y_i)^2 if t_i = 1, e_i = 0 otherwise
+where e_i = 0.5*(t_i - y_i)^2 if t_i = 1, e_i = 0 otherwise.
 */
 bool TestLoss3()
 {

--- a/source/tensor/test/TMatrixMulBatched.cpp
+++ b/source/tensor/test/TMatrixMulBatched.cpp
@@ -19,6 +19,7 @@
 * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-06-15
 */
+#include "../XTensor.h"
 #include "TMatrixMulBatched.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
@@ -105,7 +106,7 @@ bool TestMatrixMulBatched1()
    /* check results */
    gpuTest = tGPU->CheckData(answer, tUnitNum);
    /* destroy variables */
    delete s1;
    delete s2;

--- a/source/tensor/test/TRectify.cpp
+++ b/source/tensor/test/TRectify.cpp
@@ -29,25 +29,15 @@ In this case, y = max(0, x)
 */
 bool TestRectify1()
 {
-    /* a x tensor of size (2, 3) */
+    /* a tensor of size (2, 3) */
-    int xOrder = 2;
+    int order = 2;
-    int * xDimSize = new int[xOrder];
+    int * dimSize = new int[order];
-    xDimSize[0] = 2;
+    dimSize[0] = 2;
-    xDimSize[1] = 3;
+    dimSize[1] = 3;
-    int xUnitNum = 1;
+    int unitNum = 1;
-    for (int i = 0; i < xOrder; i++)
+    for (int i = 0; i < order; i++)
-        xUnitNum *= xDimSize[i];
+        unitNum *= dimSize[i];
-    /* a y tensor of size (2, 3) */
-    int yOrder = 2;
-    int * yDimSize = new int[yOrder];
-    yDimSize[0] = 2;
-    yDimSize[1] = 3;
-    int yUnitNum = 1;
-    for (int i = 0; i < yOrder; i++)
-        yUnitNum *= yDimSize[i];
    DTYPE xData[2][3] = { {0.0F, -1.0F, 2.0F},
                          {3.0F, -4.0F, -5.0F} };
@@ -58,52 +48,50 @@ bool TestRectify1()
    bool cpuTest = true;
    /* create tensors */
-    XTensor * x = NewTensor(xOrder, xDimSize);
+    XTensor * x = NewTensor(order, dimSize);
-    XTensor * y = NewTensor(yOrder, yDimSize);
+    XTensor * y = NewTensor(order, dimSize);
    /* initialize variables */
-    x->SetData(xData, xUnitNum);
+    x->SetData(xData, unitNum);
    y->SetZeroAll();
    /* call Rectify function */
    Rectify(x, y);
    /* check results */
-    cpuTest = y->CheckData(answer, yUnitNum);
+    cpuTest = y->CheckData(answer, unitNum);
 #ifdef USE_CUDA
 	/* GPU test */
 	bool gpuTest = true;
 	/* create tensor */
-	XTensor * xGPU = NewTensor(xOrder, xDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * xGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-	XTensor * yGPU = NewTensor(yOrder, yDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * yGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
 	/* Initialize variables */
-	xGPU->SetData(xData, xUnitNum);
+	xGPU->SetData(xData, unitNum);
 	yGPU->SetZeroAll();
 	/* call Rectify function */
 	Rectify(xGPU, yGPU);
 	/* check results */
-	gpuTest = yGPU->CheckData(answer, yUnitNum);
+	gpuTest = yGPU->CheckData(answer, unitNum);
 	/* destroy variables */
 	delete x;
    delete y;
    delete xGPU;
    delete yGPU;
-	delete[] xDimSize;
+	delete[] dimSize;
-    delete[] yDimSize;
 	return cpuTest && gpuTest;
 #else
 	/* destroy variables */
 	delete x;
    delete y;
-	delete[] xDimSize;
+	delete[] dimSize;
-    delete[] yDimSize;
 	return cpuTest;
 #endif // USE_CUDA
@@ -117,73 +105,83 @@ In this case, lossName=CROSSENTROPY.
 */
 bool TestRectify2()
 {
-	/* a x tensor of size (2, 3) */
+	/* a tensor of size (2, 3) */
-	int xOrder = 2;
+	int order = 2;
-	int * xDimSize = new int[xOrder];
+	int * dimSize = new int[order];
-	xDimSize[0] = 2;
+	dimSize[0] = 2;
-	xDimSize[1] = 3;
+	dimSize[1] = 3;
-	int xUnitNum = 1;
+	int unitNum = 1;
-	for (int i = 0; i < xOrder; i++)
+	for (int i = 0; i < order; i++)
-		xUnitNum *= xDimSize[i];
+		unitNum *= dimSize[i];
 	DTYPE xData[2][3] = { {1.0F, 1.0F, 2.0F},
 	                      {2.0F, 4.0F, 5.0F} };
-	DTYPE yData[2][3] = { {1.0F, 1.0F, 2.0F},
-	                      {2.0F, 4.0F, 5.0F} };
 	DTYPE goldData[2][3] = { {1.0F, 1.0F, 1.0F},
 	                         {1.0F, 1.0F, 1.0F} };
-	DTYPE dedyData[2][3] = { {-1.0F, -1.0F, -0.5F},
+    DTYPE yAnswer[2][3] = { {1.0F, 1.0F, 2.0F},
-	                         {-0.5F, -0.25F, -0.2F} };
+	                        {2.0F, 4.0F, 5.0F} };
-	DTYPE answer[2][3] = { {-1.0F, -1.0F, -0.5F},
+	DTYPE dedyAnswer[2][3] = { {-1.0F, -1.0F, -0.5F},
-	                       {-0.5F, -0.25F, -0.2F} };
+	                           {-0.5F, -0.25F, -0.2F} };
+	DTYPE dedxAnswer[2][3] = { {-1.0F, -1.0F, -0.5F},
+	                           {-0.5F, -0.25F, -0.2F} };
 	/* CPU test */
 	bool cpuTest = true;
 	/* create tensors */
-	XTensor * x = NewTensor(xOrder, xDimSize);
+	XTensor * x = NewTensor(order, dimSize);
-	XTensor * y = NewTensor(xOrder, xDimSize);
+	XTensor * y = NewTensor(order, dimSize);
-	XTensor * gold = NewTensor(xOrder, xDimSize);
+	XTensor * gold = NewTensor(order, dimSize);
-	XTensor * dedy = NewTensor(xOrder, xDimSize);
+	XTensor * dedy = NewTensor(order, dimSize);
-	XTensor * dedx = NewTensor(xOrder, xDimSize);
+	XTensor * dedx = NewTensor(order, dimSize);
 	/* initialize variables */
-	x->SetData(xData, xUnitNum);
+	x->SetData(xData, unitNum);
-	y->SetData(yData, xUnitNum);
+	gold->SetData(goldData, unitNum);
-	gold->SetData(goldData, xUnitNum);
+	y->SetZeroAll();
-	dedy->SetData(dedyData, xUnitNum);
+	dedy->SetZeroAll();
 	dedx->SetZeroAll();
+    /* call Rectify function */
+    Rectify(x, y);
 	/* call RectifyBackward function */
-	RectifyBackward(gold, y, x, dedy, dedx, NOLOSS);
+	RectifyBackward(gold, y, x, dedy, dedx, CROSSENTROPY);
 	/* check results */
-	cpuTest = dedx->CheckData(answer, xUnitNum);
+    cpuTest = y->CheckData(yAnswer, unitNum, 1e-4F)
+              && dedx->CheckData(dedxAnswer, unitNum, 1e-4F)
+              && dedy->CheckData(dedyAnswer, unitNum, 1e-4F);
 #ifdef USE_CUDA
 	/* GPU test */
 	bool gpuTest = true;
 	/* create tensors */
-	XTensor * xGPU = NewTensor(xOrder, xDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * xGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-	XTensor * yGPU = NewTensor(xOrder, xDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * yGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-	XTensor * goldGPU = NewTensor(xOrder, xDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * goldGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-	XTensor * dedyGPU = NewTensor(xOrder, xDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * dedyGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-	XTensor * dedxGPU = NewTensor(xOrder, xDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * dedxGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
 	/* initialize variables */
-	xGPU->SetData(xData, xUnitNum);
+	xGPU->SetData(xData, unitNum);
-	yGPU->SetData(yData, xUnitNum);
+	goldGPU->SetData(goldData, unitNum);
-	goldGPU->SetData(goldData, xUnitNum);
+	yGPU->SetZeroAll();
-	dedyGPU->SetData(dedyData, xUnitNum);
+	dedyGPU->SetZeroAll();
 	dedxGPU->SetZeroAll();
+    /* call Rectify function */
+    Rectify(xGPU, yGPU);
 	/* call rectifybackward function */
-	RectifyBackward(goldGPU, yGPU, xGPU, dedyGPU, dedxGPU, NOLOSS);
+	RectifyBackward(goldGPU, yGPU, xGPU, dedyGPU, dedxGPU, CROSSENTROPY);
 	/* check results */
-	gpuTest = dedxGPU->CheckData(answer, xUnitNum);
+    gpuTest = yGPU->CheckData(yAnswer, unitNum, 1e-4F)
+              && dedxGPU->CheckData(dedxAnswer, unitNum, 1e-4F)
+              && dedyGPU->CheckData(dedyAnswer, unitNum, 1e-4F);
 	/* destroy variables */
    delete x;
@@ -196,7 +194,7 @@ bool TestRectify2()
    delete dedyGPU;
    delete dedxGPU;
    delete goldGPU;
-	delete[] xDimSize;
+	delete[] dimSize;
 	return cpuTest && gpuTest;
 #else
@@ -206,7 +204,7 @@ bool TestRectify2()
    delete dedy;
    delete dedx;
    delete gold;
-	delete[] xDimSize;
+	delete[] dimSize;
 	return cpuTest;
 #endif // USE_CUDA
@@ -220,7 +218,7 @@ TODO!!
 /* test for Rectify Function */
 bool TestRectify()
 {
-    XPRINT(0, stdout, "[TEST RECTIFY] test rectify and its backward computation \n");
+    XPRINT(0, stdout, "[TEST RECTIFY] rectify function and its backward computation \n");
    bool returnFlag = true, caseFlag = true;
    /* case 1 test */

--- a/source/tensor/test/TSetAscendingOrder.cpp
+++ b/source/tensor/test/TSetAscendingOrder.cpp
@@ -23,8 +23,7 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
-/* case 1: set the cell to the ascending order along a given dimension.
+/* case 1: set the cell to the ascending order along a given dimension. */
-*/
 bool TestSetAscendingOrder1()
 {
    /* a input tensor of size (2, 4) */
@@ -50,7 +49,6 @@ bool TestSetAscendingOrder1()
    s->SetZeroAll();
    /* call SetAscendingOrder function */
    s->SetAscendingOrder(1);
    /* check results */

--- a/source/tensor/test/TSetData.cpp
+++ b/source/tensor/test/TSetData.cpp
@@ -23,7 +23,10 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
-/* case 1: set the cell to the ascending order along a given dimension. */
+/* 
+case 1: test SetDataRand function.
+set the tensor items by a uniform distribution in range [lower, upper]. 
+*/
 bool TestSetData1()
 {
    /* a input tensor of size (2, 4) */
@@ -44,7 +47,7 @@ bool TestSetData1()
    /* create tensors */
    XTensor * s = NewTensor(sOrder, sDimSize);
-    /* call SetData function */
+    /* call SetDataRand function */
    s->SetDataRand(0.0, 1.0);
    /* check results */

--- a/source/tensor/test/TSigmoid.cpp
+++ b/source/tensor/test/TSigmoid.cpp
@@ -25,102 +25,71 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
 /* 
-case 1: test Sigmoid function and SigmoidBackward function.
+case 1: test Sigmoid function.
 sigmoid function: y = 1/(1+exp(-x))
-backward computation: dE/ds = dE/dy * dy/dx
 */
 bool TestSigmoid1()
 {
    /* a input tensor of size (3) */
-    int sOrder = 1;
+    int order = 1;
-    int * sDimSize = new int[sOrder];
+    int * dimSize = new int[order];
-    sDimSize[0] = 3;
+    dimSize[0] = 3;
-    int sUnitNum = 1;
+    int unitNum = 1;
-    for (int i = 0; i < sOrder; i++)
+    for (int i = 0; i < order; i++)
-        sUnitNum *= sDimSize[i];
+        unitNum *= dimSize[i];
    DTYPE xData[3] = {0.0F, 1.0F, 2.0F};
-    DTYPE gData[3] = {0.4F, 0.8F, 1.0F};
+    DTYPE answer[3] = {0.5F, 0.7311F, 0.8808F};
-    DTYPE dedyData[3] = {-0.8F, -1.094F, -1.135F};
-    DTYPE yAnswer[3] = {0.5F, 0.731F, 0.881F};
-    DTYPE dedxAnswer[3] = {-0.2F, -0.215F, -0.119F};
    /* CPU test */
    bool cpuTest = true;
    /* create tensors */
-    XTensor * x = NewTensor(sOrder, sDimSize);
+    XTensor * x = NewTensor(order, dimSize);
-    XTensor * y = NewTensor(sOrder, sDimSize);
+    XTensor * y = NewTensor(order, dimSize);
-    XTensor * g = NewTensor(sOrder, sDimSize);
-    XTensor * dedy = NewTensor(sOrder, sDimSize);
-    XTensor * dedx = NewTensor(sOrder, sDimSize);
    /* initialize variables */
-    x->SetData(xData, sUnitNum);
+    x->SetData(xData, unitNum);
-    g->SetData(gData, sUnitNum);
-    dedy->SetData(dedyData, sUnitNum);
    y->SetZeroAll();
-    dedx->SetZeroAll();
    /* call Sigmoid function */
    Sigmoid(x, y);
-    /* call SigmoidBackward function */
-    SigmoidBackward(g, y, x, dedy, dedx, NOLOSS);
    /* check result */
-    cpuTest = y->CheckData(yAnswer, sUnitNum) && dedx->CheckData(dedxAnswer, sUnitNum);
+    cpuTest = y->CheckData(answer, unitNum, 1e-4F);
 #ifdef USE_CUDA
    /* GPU test */
    bool gpuTest = true;
        /* create tensors */
-    XTensor * xGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * xGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * yGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * yGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * gGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
-    XTensor * dedyGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
-    XTensor * dedxGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
    /* initialize variables */
-    xGPU->SetData(xData, sUnitNum);
+    xGPU->SetData(xData, unitNum);
-    gGPU->SetData(gData, sUnitNum);
-    dedyGPU->SetData(dedyData, sUnitNum);
    yGPU->SetZeroAll();
-    dedxGPU->SetZeroAll();
    /* call Sigmoid function */
    Sigmoid(xGPU, yGPU);
-    /* call SigmoidBackward function */
-    SigmoidBackward(gGPU, yGPU, xGPU, dedyGPU, dedxGPU, NOLOSS);
    /* check result */
-    gpuTest = yGPU->CheckData(yAnswer, sUnitNum) && dedxGPU->CheckData(dedxAnswer, sUnitNum);
+    gpuTest = yGPU->CheckData(answer, unitNum, 1e-4F);
    /* destroy variables */
    delete x;
    delete y;
-    delete g;
-    delete dedx;
-    delete dedy;
    delete xGPU;
    delete yGPU;
-    delete gGPU;
+    delete[] dimSize;
-    delete dedxGPU;
-    delete dedyGPU;
-    delete[] sDimSize;
    return cpuTest && gpuTest;
 #else
    /* destroy variables */
    delete x;
    delete y;
-    delete g;
+    delete[] dimSize;
-    delete dedx;
-    delete dedy;
-    delete[] sDimSize;
    return cpuTest;
 #endif // USE_CUDA
@@ -129,70 +98,72 @@ bool TestSigmoid1()
 /* 
 case 2: test Sigmoid function and SigmoidBackward function.
 sigmoid function: y = 1/(1+exp(-x))
-backward computation: dE/ds = dE/dy * dy/dx
+backward computation: 
+dE/ds = dE/dy * dy/dx
+dy/dx = y * (1 -y)
+In this case, LossName=CROSSENTROPY.
 */
 bool TestSigmoid2()
 {
    /* a input tensor of size (3) */
-    int sOrder = 1;
+    int order = 1;
-    int * sDimSize = new int[sOrder];
+    int * dimSize = new int[order];
-    sDimSize[0] = 3;
+    dimSize[0] = 3;
-    int sUnitNum = 1;
+    int unitNum = 1;
-    for (int i = 0; i < sOrder; i++)
+    for (int i = 0; i < order; i++)
-        sUnitNum *= sDimSize[i];
+        unitNum *= dimSize[i];
    DTYPE xData[3] = {0.0F, 1.0F, 2.0F};
    DTYPE gData[3] = {0.4F, 0.8F, 1.0F};
-    DTYPE dedyData[3] = {-0.8F, -1.094F, -1.135F};
+    DTYPE yAnswer[3] = {0.5F, 0.7311F, 0.8808F};
-    DTYPE yAnswer[3] = {0.5F, 0.731F, 0.881F};
+    DTYPE dedyAnswer[3] = {-0.8F, -1.0943F, -1.1353F};
-    DTYPE dedxAnswer[3] = {-0.2F, -0.215F, -0.119F};
+    DTYPE dedxAnswer[3] = {-0.2F, -0.2151F, -0.1192F};
    /* CPU test */
    bool cpuTest = true;
    /* create tensors */
-    XTensor * x = NewTensor(sOrder, sDimSize);
+    XTensor * x = NewTensor(order, dimSize);
-    XTensor * y = NewTensor(sOrder, sDimSize);
+    XTensor * y = NewTensor(order, dimSize);
-    XTensor * g = NewTensor(sOrder, sDimSize);
+    XTensor * g = NewTensor(order, dimSize);
-    XTensor * dedy = NewTensor(sOrder, sDimSize);
+    XTensor * dedy = NewTensor(order, dimSize);
-    XTensor * dedx = NewTensor(sOrder, sDimSize);
+    XTensor * dedx = NewTensor(order, dimSize);
    /* initialize variables */
-    x->SetData(xData, sUnitNum);
+    x->SetData(xData, unitNum);
-    g->SetData(gData, sUnitNum);
+    g->SetData(gData, unitNum);
-    dedy->SetZeroAll();
    y->SetZeroAll();
+    dedy->SetZeroAll();
    dedx->SetZeroAll();
    /* call Sigmoid function */
    Sigmoid(x, y);
-    /* initialize variables */
-    dedy->SetData(dedyData, sUnitNum);
    /* call SigmoidBackward function */
    SigmoidBackward(g, y, x, dedy, dedx, CROSSENTROPY);
    /* check result */
-    cpuTest = y->CheckData(yAnswer, sUnitNum) && dedx->CheckData(dedxAnswer, sUnitNum);
+    cpuTest = y->CheckData(yAnswer, unitNum, 1e-4F)
+              && dedx->CheckData(dedxAnswer, unitNum, 1e-4F)
+              && dedy->CheckData(dedyAnswer, unitNum, 1e-4F);
 #ifdef USE_CUDA
    /* GPU test */
    bool gpuTest = true;
        /* create tensors */
-    XTensor * xGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * xGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * yGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * yGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * gGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * gGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * dedyGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * dedyGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * dedxGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * dedxGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
    /* initialize variables */
-    xGPU->SetData(xData, sUnitNum);
+    xGPU->SetData(xData, unitNum);
-    gGPU->SetData(gData, sUnitNum);
+    gGPU->SetData(gData, unitNum);
-    dedyGPU->SetZeroAll();
    yGPU->SetZeroAll();
+    dedyGPU->SetZeroAll();
    dedxGPU->SetZeroAll();
    /* call Sigmoid function */
@@ -202,8 +173,9 @@ bool TestSigmoid2()
    SigmoidBackward(gGPU, yGPU, xGPU, dedyGPU, dedxGPU, CROSSENTROPY);
    /* check result */
-    gpuTest = yGPU->CheckData(yAnswer, sUnitNum) && dedxGPU->CheckData(dedxAnswer, sUnitNum);
+    gpuTest = yGPU->CheckData(yAnswer, unitNum, 1e-4F)
+              && dedxGPU->CheckData(dedxAnswer, unitNum, 1e-4F)
+              && dedyGPU->CheckData(dedyAnswer, unitNum, 1e-4F);
    /* destroy variables */
    delete x;
    delete y;
@@ -215,7 +187,7 @@ bool TestSigmoid2()
    delete gGPU;
    delete dedxGPU;
    delete dedyGPU;
-    delete[] sDimSize;
+    delete[] dimSize;
    return cpuTest && gpuTest;
 #else
@@ -225,7 +197,7 @@ bool TestSigmoid2()
    delete g;
    delete dedx;
    delete dedy;
-    delete[] sDimSize;
+    delete[] dimSize;
    return cpuTest;
 #endif // USE_CUDA
@@ -251,6 +223,16 @@ bool TestSigmoid()
    }
    else
        XPRINT(0, stdout, ">> case 1 passed!\n");
+    /* case 2 test */
+    caseFlag = TestSigmoid2();
+    if (!caseFlag) {
+        returnFlag = false;
+        XPRINT(0, stdout, ">> case 2 failed!\n");
+    }
+    else
+        XPRINT(0, stdout, ">> case 2 passed!\n");
    /* other cases test */
    /*

--- a/source/tensor/test/TSoftmax.cpp
+++ b/source/tensor/test/TSoftmax.cpp
@@ -31,68 +31,69 @@ softmax function: y = e^x / \sum_{i} e^{x_i}
 */
 bool TestSoftmax1()
 {
-    /* a input tensor of size (2, 3) */
+    /* a tensor of size (2, 3) */
-    int sOrder = 2;
+    int order = 2;
-    int * sDimSize = new int[sOrder];
+    int * dimSize = new int[order];
-    sDimSize[0] = 2;
+    dimSize[0] = 2;
-    sDimSize[1] = 3;
+    dimSize[1] = 3;
-    int sUnitNum = 1;
+    int unitNum = 1;
-    for (int i = 0; i < sOrder; i++)
+    for (int i = 0; i < order; i++)
-        sUnitNum *= sDimSize[i];
+        unitNum *= dimSize[i];
    DTYPE xData[2][3] = { {0.0F, 1.0F, 2.0F}, 
                          {0.5F, 0.7F, 1.4F} };
-    DTYPE answer[2][3] = { {0.09003057F, 0.24472848F, 0.66524094F}, 
+    DTYPE answer[2][3] = { {0.0900F, 0.2447F, 0.6652F}, 
-                           {0.21362929F, 0.2609274F , 0.52544326F} };
+                           {0.2136F, 0.2609F, 0.5254F} };
    /* CPU test */
    bool cpuTest = true;
    /* create tensors */
-    XTensor * x = NewTensor(sOrder, sDimSize);
+    XTensor * x = NewTensor(order, dimSize);
-    XTensor * y = NewTensor(sOrder, sDimSize);
+    XTensor * y = NewTensor(order, dimSize);
    /* initialize variables */
-    x->SetData(xData, sUnitNum);
+    x->SetData(xData, unitNum);
    y->SetZeroAll();
    /* call Softmax function */
    Softmax(x, y, 1);
    /* check result */
-    cpuTest = y->CheckData(answer, sUnitNum);
+    cpuTest = y->CheckData(answer, unitNum, 1e-4F);
 #ifdef USE_CUDA
    /* GPU test */
    bool gpuTest = true;
    /* create tensors */
-    XTensor * xGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * xGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * yGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * yGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
    /* initialize variables */
-    xGPU->SetData(xData, sUnitNum);
+    xGPU->SetData(xData, unitNum);
    yGPU->SetZeroAll();
    /* call Softmax function */
    Softmax(xGPU, yGPU, 1);
    /* check result */
-    gpuTest = yGPU->CheckData(answer, sUnitNum);
+    gpuTest = yGPU->CheckData(answer, unitNum, 1e-4F);
    /* destroy variables */
    delete x;
    delete y;
    delete xGPU;
    delete yGPU;
-    delete[] sDimSize;
+    delete[] dimSize;
    return cpuTest && gpuTest;
 #else
    /* destroy variables */
-    delete x, y;
+    delete x;
-    delete[] sDimSize;
+    delete y;
+    delete[] dimSize;
    return cpuTest;
 #endif // USE_CUDA
@@ -101,62 +102,66 @@ bool TestSoftmax1()
 /* 
 case 2: test SoftmaxBackward function.
 SoftmaxBackward function: dE/dx_j = -gold_j + y_j
+In this case, LossName=CROSSENTROPY.
 */
 bool TestSoftmax2()
 {
    /* a input tensor of size (2, 3) */
-    int sOrder = 2;
+    int order = 2;
-    int * sDimSize = new int[sOrder];
+    int * dimSize = new int[order];
-    sDimSize[0] = 1;
+    dimSize[0] = 1;
-    sDimSize[1] = 3;
+    dimSize[1] = 3;
-    int sUnitNum = 1;
+    int unitNum = 1;
-    for (int i = 0; i < sOrder; i++)
+    for (int i = 0; i < order; i++)
-        sUnitNum *= sDimSize[i];
+        unitNum *= dimSize[i];
    DTYPE xData[1][3] = { {0.0F, 1.0F, 2.0F} };
    DTYPE gData[1][3] = { {0.0F, 0.0F, 1.0F} };
-    DTYPE dedxAnswer[3] = {0.090031F, 0.244728F, -0.334759F};
+    DTYPE yAnswer[1][3] = { {0.0900F, 0.2447F, 0.6652F} };
+    DTYPE dedxAnswer[1][3] = {0.0900F, 0.2447F, -0.3347F};
    /* CPU test */
    bool cpuTest = true;
    /* create tensors */
-    XTensor * x = NewTensor(sOrder, sDimSize);
+    XTensor * x = NewTensor(order, dimSize);
-    XTensor * y = NewTensor(sOrder, sDimSize);
+    XTensor * y = NewTensor(order, dimSize);
-    XTensor * g = NewTensor(sOrder, sDimSize);
+    XTensor * g = NewTensor(order, dimSize);
-    XTensor * dedy = NewTensor(sOrder, sDimSize);
+    XTensor * dedy = NewTensor(order, dimSize);
-    XTensor * dedx = NewTensor(sOrder, sDimSize);
+    XTensor * dedx = NewTensor(order, dimSize);
    /* initialize variables */
-    x->SetData(xData, sUnitNum);
+    x->SetData(xData, unitNum);
-    g->SetData(gData, sUnitNum);
+    g->SetData(gData, unitNum);
    y->SetZeroAll();
    dedx->SetZeroAll();
    dedy->SetZeroAll();
    /* call Softmax function */
    Softmax(x, y, 1);
+    /* call SoftmaxBackward function */
    SoftmaxBackward(g, y, x, dedy, dedx, 1, CROSSENTROPY);
    /* check result */
-    cpuTest = dedx->CheckData(dedxAnswer, sUnitNum);
+    cpuTest = y->CheckData(yAnswer, unitNum, 1e-4F)
+              && dedx->CheckData(dedxAnswer, unitNum, 1e-4F);
 #ifdef USE_CUDA
    /* GPU test */
    bool gpuTest = true;
        /* create tensors */
-    XTensor * xGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * xGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * yGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * yGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * gGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * gGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * dedyGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * dedyGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
-    XTensor * dedxGPU = NewTensor(sOrder, sDimSize, X_FLOAT, 1.0F, 0);
+    XTensor * dedxGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
    /* initialize variables */
-    xGPU->SetData(xData, sUnitNum);
+    xGPU->SetData(xData, unitNum);
-    gGPU->SetData(gData, sUnitNum);
+    gGPU->SetData(gData, unitNum);
    yGPU->SetZeroAll();
    dedxGPU->SetZeroAll();
    dedyGPU->SetZeroAll();
@@ -168,7 +173,8 @@ bool TestSoftmax2()
    SoftmaxBackward(gGPU, yGPU, xGPU, dedyGPU, dedxGPU, 1, CROSSENTROPY);
    /* check result */
-    gpuTest = dedxGPU->CheckData(dedxAnswer, sUnitNum);
+    gpuTest = yGPU->CheckData(yAnswer, unitNum, 1e-4F)
+              && dedxGPU->CheckData(dedxAnswer, unitNum, 1e-4F);
    /* destroy variables */
    delete x;
@@ -181,7 +187,7 @@ bool TestSoftmax2()
    delete gGPU;
    delete dedxGPU;
    delete dedyGPU;
-    delete[] sDimSize;
+    delete[] dimSize;
    return cpuTest && gpuTest;
 #else
@@ -191,7 +197,7 @@ bool TestSoftmax2()
    delete g;
    delete dedx;
    delete dedy;
-    delete[] sDimSize;
+    delete[] dimSize;
    return cpuTest;
 #endif // USE_CUDA

--- a/source/tensor/test/TSplit.cpp
+++ b/source/tensor/test/TSplit.cpp
@@ -181,14 +181,20 @@ bool TestSplit2()
    gpuTest = tGPU->CheckData(answer, tUnitNum);
    /* destroy variables */
-	delete s, t, sGPU, tGPU;
+	delete s;
-	delete[] sDimSize, tDimSize;
+    delete t;
+    delete sGPU;
+    delete tGPU;
+	delete[] sDimSize;
+	delete[] tDimSize;
 	return cpuTest && gpuTest;
 #else
    /* destroy variables */
-	delete s, t;
+	delete s;
-	delete[] sDimSize, tDimSize;
+    delete t;
+	delete[] sDimSize;
+	delete[] tDimSize;
 	return cpuTest;
 #endif // USE_CUDA
@@ -295,14 +301,25 @@ bool TestSplit3()
 	gpuTest = tGPU1->CheckData(answer1, tUnitNum1) && tGPU2->CheckData(answer2, tUnitNum2);
    /* destroy variables */
-	delete s, t1, t2, sGPU, tGPU1, tGPU2;
+	delete s;
-	delete[] sDimSize, tDimSize1, tDimSize2;
+    delete t1;
+    delete t2;
+    delete sGPU;
+    delete tGPU1;
+    delete tGPU2;
+	delete[] sDimSize;
+	delete[] tDimSize1;
+	delete[] tDimSize2;
 	return cpuTest && gpuTest;
 #else
    /* destroy variables */
-    delete s, t1, t2;
+	delete s;
-	delete[] sDimSize, tDimSize1, tDimSize2;
+    delete t1;
+    delete t2;
+	delete[] sDimSize;
+	delete[] tDimSize1;
+	delete[] tDimSize2;
 	return cpuTest;
 #endif // USE_CUDA

--- a/source/tensor/test/Test.cpp
+++ b/source/tensor/test/Test.cpp
@@ -31,12 +31,12 @@ bool Test()
    wrong = !TestConcatenate() || wrong;
    wrong = !TestConcatenateSolely() || wrong;
-    //wrong = !TestCopyIndexed() || wrong;
+    wrong = !TestCopyIndexed() || wrong;
    wrong = !TestCopyValues() || wrong;
    wrong = !TestMatrixMul() || wrong;
    wrong = !TestMatrixMul2D() || wrong;
    wrong = !TestMatrixMul2DParallel() || wrong;
-    //wrong = !TestMatrixMulBatched() || wrong;
+    wrong = !TestMatrixMulBatched() || wrong;
    wrong = !TestMatrixMulBatchedCPU() || wrong;
    wrong = !TestMerge() || wrong;
    wrong = !TestMultiply() || wrong;
@@ -56,18 +56,18 @@ bool Test()
    wrong = !TestSplit() || wrong;
    wrong = !TestSum() || wrong;
    wrong = !TestSumByColumnTV() || wrong;
-    //wrong = !TestSumByColumnVT() || wrong;
+    wrong = !TestSumByColumnVT() || wrong;
    wrong = !TestTopK() || wrong;
    wrong = !TestUnsqueeze() || wrong;
    wrong = !TestXMem() || wrong;
-    //wrong = !TestHardTanH() || wrong;
+    wrong = !TestHardTanH() || wrong;
-    //wrong = !TestIdentity() || wrong;
+    wrong = !TestIdentity() || wrong;
-    //wrong = !TestLogSoftmax() || wrong;
+    wrong = !TestLogSoftmax() || wrong;
-    //wrong = !TestLoss() || wrong;
+    wrong = !TestLoss() || wrong;
-    //wrong = !TestRectify() || wrong;
+    wrong = !TestRectify() || wrong;
-    //wrong = !TestSigmoid() || wrong;
+    wrong = !TestSigmoid() || wrong;
-    //wrong = !TestSoftmax() || wrong;
+    wrong = !TestSoftmax() || wrong;
    /* other test */
    /*