NiuTrans.Tensor - version 0.1.0!

9d33e210 · xuchen · de548dd3 · d294ac15 · 9d33e210 · 9d33e210
Commit 9d33e210 authored Aug 04, 2018 by xuchen
--- a/README.md
+++ b/README.md
-NiuTrans.Tensor张量计算库
\ No newline at end of file
+# NiuTrans.Tensor张量计算库
+
+## NiuTrans.Tensor
+
+NiuTrans.Tensor是小牛开源项目所开发的一个工具包，提供了完整的张量定义及计算功能，可以被用于深度学习相关研究及工业系统的开发。NiuTrans.Tensor具有以下特点：
+
+* 简单小巧，易于修改
+* c语言编写，代码高度优化
+* 同时支持CPU和GPU设备
+* 丰富的张量计算接口
+* 支持C/C++、Python等调用方式
+
+## 安装方法
+
+在开始创建您的项目并使用NiuTrans.Tensor工具包时，需要注意的是：
+
+* 所创建项目如在CPU上运行，我们的系统支持高性能的数学运算库，推荐安装[MKL](https://software.intel.com/en-us/mkl)或[OpenBLAS](http://www.openblas.net/)。
+* 所创建项目如需在GPU上运行，需安装 [CUDA](https://developer.nvidia.com/cuda-downloads)，CUDA版本需求为9.0及以上，CUDA工具为创建高性能GPU加速应用程序提供了开发环境。
+
+小牛开源项目所开发的NiuTrans.Tensor工具包采用源程序编译方法，在Windows和Linux环境下的安装方法如下所示。
+
+### Windows
+
+若在Windows上使用NiuTrans.Tensor工具包：
+
+* 首先需要将NiuTrans.Tensor代码包含在所创建的项目中
+* 在所创建项目中需要引用XTensor.h、core里的CHeader.h和function里的FHeader.h这三个头文件：
+    * 通过XTensor.h可以获取我们需要操作的XTensor类
+    * 通过core里的CHeader.h可以对Tensor进行一些张量运算
+    * 通过function里的FHeader.h可以调用一些激活函数
+* 在所创建项目中使用命名空间nts
+
+此外，一些必须的环境配置方法请参考 [NiuTrans.Tensor环境配置](http://47.105.50.196/NiuTrans/NiuTrans.Tensor/blob/linye/doc/Configuration.md)。
+
+### Linux
+
+若在Linux上使用NiuTrans.Tensor工具包，直接执行make.sh即可在同级目录下生成tensorCPU和tensorGPU，分别对应于NiuTrans.Tensor的CPU以及GPU的可执行文件。以前馈神经网络语言模型为例，输入以下命令即可在GPU上执行提供的测试用例：
+>./tensorGPU -test
+
+更多详细使用方法请见[NiuTrans.Tensor开发文档](http://47.104.97.237/niutrans/site/niutensor/index.html)
+
+
+## 开发团队
+
+NiuTrans.Tensor张量计算库由东北大学自然语言处理实验室、小牛翻译、小牛雅智合作开发，致力于为深度学习相关研究及工业系统的开发提供完整的张量定义及计算功能。
+
+## 更新版本
+
+NiuTrans.Tensor version 0.1.0 - 2018年8月3日
\ No newline at end of file
--- a/doc/Configuration.md
+++ b/doc/Configuration.md
+# NiuTrans.Tensor环境配置
+
+## 注意事项
+
+CUDA最新版本9.2尚且不支持VS2017最新版本，因此建议使用CUDA版本为9.0或9.1，建议使用VS版本为VS2015，或使用VS2017时安装v140工具集。
+
+## CUDA配置
+
+在已安装好VS、CUDA并配置好环境变量后，一些关键的CUDA配置选项如下所示，以下配置选项在 **项目 -> 属性** 中可以找到。
+
+>$(CUDA_PATH)\include
+
+加入到 **VC++目录 -> 包含** 中。
+
+>$(CUDA_PATH)\lib\Win32
+
+加入到 **VC++目录 -> 库** 中。
+
+>cuda.lib;cudadevrt.lib;cudart.lib;cudart_static.lib;nvcuvid.lib;OpenCL.lib;cublas.lib;curand.lib;
+
+加入到 **链接器->输入->附加依赖项** 中。
+
+配置完成后，右键 **工程->项目依赖性** ，选择CUDA9。
+在.cu文件上右键属性，在项类型中选择"CUDA C/C++"（最好搜索.cu文件，然后全选设置）。
+
+## 其他配置
+
+**C/C++->常规->SDL检查**，设为否。
+
+在 **C/C++->预处理器->预处理器定义** 中，添加
+
+>USE_CUDA;USE_BLAS;WIN32;MKL;DEBUG;CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_WARNINGS_
+CONSOLE;
+
+**链接器->系统->子系统**，设置为控制台。
+
+**常规->字符集**，使用Unicode字符集。
+
+**调试->命令参数**中设置可执行文件所需要的参数。
+
+
--- a/doc/manual.md
+++ b/doc/manual.md
-<script type="text/javascript" async src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML"> </script>
-
 # NiuTrans.Tensor张量计算库

 ## NiuTrans.Tensor
@@ -18,50 +16,61 @@ NiuTrans.Tensor撠皞★撘銝芸極
 * 所创建项目如在CPU上运行，我们的系统支持高性能的数学运算库，推荐安装[Intel® MKL](https://software.intel.com/en-us/mkl)或[OpenBLAS](http://www.openblas.net/)。
 * 所创建项目如需在GPU上运行，需安装 [NVIDIA®CUDA®Toolkit](https://developer.nvidia.com/cuda-downloads)，CUDA版本需求为9.0及以上，CUDA工具为创建高性能GPU加速应用程序提供了开发环境。

-在使用小牛开源项目所开发的NiuTrans.Tensor工具包时：
+小牛开源项目所开发的NiuTrans.Tensor工具包采用源程序编译方法，在Windows和Linux环境下的安装方法如下所示。
+
+### Windows
+
+若在Windows上使用NiuTrans.Tensor工具包：

 * 首先需要将NiuTrans.Tensor代码包含在所创建的项目中
-* 需要引用XTensor.h、core里的CHeader.h和function里的FHeader.h这三个头文件：
+* 在所创建项目中需要引用XTensor.h、core里的CHeader.h和function里的FHeader.h这三个头文件：
    * 通过XTensor.h可以获取我们需要操作的XTensor类
    * 通过core里的CHeader.h可以对Tensor进行一些张量运算
    * 通过function里的FHeader.h可以调用一些激活函数
 * 在所创建项目中使用命名空间nts

+此外，一些必须的环境配置方法请参考 [NiuTrans.Tensor环境配置](http://47.105.50.196/NiuTrans/NiuTrans.Tensor/blob/linye/doc/Configuration.md)。
+
+### Linux
+
+若在Linux上使用NiuTrans.Tensor工具包，直接执行make.sh即可在同级目录下生成tensorCPU和tensorGPU，分别对应于NiuTrans.Tensor的CPU以及GPU的可执行文件。以前馈神经网络语言模型为例，输入以下命令即可在GPU上执行提供的测试用例：
+>./tensorGPU -test
+
 ## 什么是张量

-在计算机科学中，张量（Tensor）通常被定义为$n$维空间中的一种量，它具有$n$个分量，这种张量本质上是一个多维数组（ multidimensional array）。张量的阶或秩是这个多维数组的维度，或者简单理解为索引张量里的每个元素所需要的索引个数。通常来说，0阶张量被定义为标量（Scalar），1阶张量被定义为向量（vector），而2阶张量被定义为矩阵（matrix）。比如，在一个三维空间中，1阶张量就是空间中点所表示的向量$(x,y,z)$，其中$x$、$y$、$z$分别表示这个点在三个轴上的坐标。
+在计算机科学中，张量（Tensor）通常被定义为\\(n\\)维空间中的一种量，它具有\\(n\\)个分量，这种张量本质上是一个多维数组（ multidimensional array）。张量的阶或秩是这个多维数组的维度，或者简单理解为索引张量里的每个元素所需要的索引个数。通常来说，0阶张量被定义为标量（Scalar），1阶张量被定义为向量（vector），而2阶张量被定义为矩阵（matrix）。比如，在一个三维空间中，1阶张量就是空间中点所表示的向量\\((x,y,z)\\)，其中\\(x\\)、\\(y\\)、\\(z\\)分别表示这个点在三个轴上的坐标。

-张量是一种高效的数学建模工具，它可以将复杂的问题通过统一、简洁的方式进行表达。比如，姜英俊同学做饭需要2斤牛肉、5斤土豆，市场上牛肉每斤32元、土豆每斤2元，那么购买这些食物总共花费$2 \times 32 + 5 \times 2 = 74$元。如果用张量来描述，我们可以用一个1阶张量$a=(2,5)$表示所需不同食物的重量。然后用另一个1阶张量$b=(32,2)$表示不同食物的价格。最后，我们用一个0阶张量$c$表示购买这些食物的总价，计算如下
+张量是一种高效的数学建模工具，它可以将复杂的问题通过统一、简洁的方式进行表达。比如，姜英俊同学做饭需要2斤牛肉、5斤土豆，市场上牛肉每斤32元、土豆每斤2元，那么购买这些食物总共花费\\(2 \times 32 + 5 \times 2 = 74\\)元。如果用张量来描述，我们可以用一个1阶张量\\(a=(2,5)\\)表示所需不同食物的重量。然后用另一个1阶张量\\(b=(32,2)\\)表示不同食物的价格。最后，我们用一个0阶张量\\(c\\)表示购买这些食物的总价，计算如下

 $$
 \begin{aligned}
-  c & = a \times b^T \\
-    & = \left(\begin{matrix}2 & 5\end{matrix}\right) \times \left(\begin{matrix}32 \\ 2\end{matrix}\right) \\
-    & = 2 \times 32 + 5 \times 2 \\
+  c & = a \times b^T \\\\
+    & = \left(\begin{matrix}2 & 5\end{matrix}\right) \times \left(\begin{matrix}32 \\\\ 2\end{matrix}\right) \\\\
+    & = 2 \times 32 + 5 \times 2 \\\\
    & = 74
 \end{aligned}
 $$

-其中$b^T$表示行向量$b$的转置 - 列向量，$\times$表示向量的乘法。第二天，姜英俊同学换了一个市场，这里牛肉每斤35元、土豆每斤1元。如果要知道在两个市场分别购物的总价，可以把$b$重新定义为一个2阶张量$\left(\begin{matrix}32 & 2 \\ 35 & 1\end{matrix}\right)$，总价$c$定义为一个2阶张量。同样有
+其中\\(b^T\\)表示行向量\\(b\\)的转置 - 列向量，\\(\times\\)表示向量的乘法。第二天，姜英俊同学换了一个市场，这里牛肉每斤35元、土豆每斤1元。如果要知道在两个市场分别购物的总价，可以把\\(b\\)重新定义为一个2阶张量\\(\left(\begin{matrix}32 & 2 \\\\ 35 & 1\end{matrix}\right)\\)，总价\\(c\\)定义为一个2阶张量。同样有

 $$
 \begin{aligned}
-  c & = a \times b^T \\
-    & = \left(\begin{matrix}2 & 5\end{matrix}\right) \times \left(\begin{matrix}32 & 35 \\ 2 & 1\end{matrix}\right) \\
+  c & = a \times b^T \\\\
+    & = \left(\begin{matrix}2 & 5\end{matrix}\right) \times \left(\begin{matrix}32 & 35 \\\\ 2 & 1\end{matrix}\right) \\\\
    & = \left(\begin{matrix}74 & 75\end{matrix}\right)
 \end{aligned}
 $$

-即，在两个市场分别花费74元和75元。可以看出，利用张量可以对多样、复杂的问题进行建模，比如，可以进一步扩展上述问题中$a$、$b$、$c$的定义，把它们定义成更高阶的张量，处理不同时间、不同市场、不同菜谱的情况，但是不论情况如何变化，都可以用同一个公式$c = a \times b^T$来描述问题。
+即，在两个市场分别花费74元和75元。可以看出，利用张量可以对多样、复杂的问题进行建模，比如，可以进一步扩展上述问题中\\(a\\)、\\(b\\)、\\(c\\)的定义，把它们定义成更高阶的张量，处理不同时间、不同市场、不同菜谱的情况，但是不论情况如何变化，都可以用同一个公式\\(c = a \times b^T\\)来描述问题。

 许多现实世界的问题都可以被描述为张量表达式（expression），也就是把张量的组合、计算描述为算数表达式。这种建模方式也构成了现代神经网络模型及深度学习方法的基础。在许多机器学习工具中，张量计算已经成为了神经网络前向、反向传播等过程的基本单元，应用十分广泛。

 ## 如何定义张量

-如果你是一名C/C++或者Python的使用者，那么在程序中使用NiuTrans.Tensor定义张量将非常简单。首先，下载NiuTrans.Tensor的工具包，并解压到任意目录，比如~/NTS目录。我们会在NTS这个目录中找到source子目录，它是存放源代码的目录。对于source子目录的结构，信息如下：
+如果你是一名C/C++或者Python的使用者，那么在程序中使用NiuTrans.Tensor定义张量将非常简单。首先，下载NiuTrans.Tensor的工具包，并解压到任意目录，比如~/NiuTrans.Tensor目录。我们会在NiuTrans.Tensor这个目录中找到source子目录，它是存放源代码的目录。对于source子目录的结构，信息如下：

-* ~/NTS/source/XTensor.h - 定义了张量结构XTensor，以及构建和销毁XTensor的接口
-* ~/NTS/source/core - 存放张量计算的函数声明及函数体实现的源文件
+* ~/NiuTrans.Tensor/source/tensor/XTensor.h - 定义了张量结构XTensor，以及构建和销毁XTensor的接口
+* ~/NiuTrans.Tensor/source/tensor/core - 存放张量计算的函数声明及函数体实现的源文件
    * arithmetic - 存放有关算术运算的源文件
    * getandset - 存放有关算术存取的源文件
    * math - 存放有关数学运算的源文件
@@ -69,9 +78,9 @@ $$
    * reduce - 存放有关规约操作的源文件
    * shape - 存放有关形状转换的源文件
    * sort - 存放有关排序操作的源文件
-* ~/NTS/source/function - 存放各种激活函数的源文件
-* ~/NTS/source/test - 存放单元测试的源文件
-* ~/NTS/source/*.h(cpp) - 与张量定义不相关，后文介绍 :)
+* ~/NiuTrans.Tensor/source/tensor/function - 存放各种激活函数的源文件
+* ~/NiuTrans.Tensor/source/tensor/test - 存放单元测试的源文件
+* ~/NiuTrans.Tensor/source/tensor/*.h(cpp) - 与张量定义不相关，后文介绍 :)

 以C/C++为例，仅需要在源程序中引用XTensor.h头文件就可以完成张量的定义。下面是一个简单的示例程序sample.cpp
 ```
@@ -96,7 +105,7 @@ int main(int argc, const char ** argv)
 下一步，编译以上源程序，这个过程需要指定XTensor.h头文件所在目录。比如，使用g++编译sample.cpp

 ```
-g++ sample.cpp -I~/NTS/source -o sample
+g++ sample.cpp -I~/NiuTrans.Tensor/source/tensor -o sample
 ```

 在sample.cpp中使用了XTensor，它是NiuTrans.Tensor里的一个类，这个类定义了张量所需的数据结构。我们可以使用这个类完成对张量的计算、拷贝等各种操作。XTensor类型的变量被声明后，这个变量需要被初始化，或者说被真正指定为一个张量，比如，指定张量各个维度的大小、张量中每个单元的数据类型、给张量分配内存空间等。InitTensor2D()就是一个张量初始化函数，它把张量初始化为一个矩阵，有四个参数：指向被初始化的张量的指针，矩阵的列数，矩阵的行数，数据单元的类型。这里X_FLOAT，是NiuTrans.Tensor自定义的枚举类型，它表示单精度浮点数。我们也可以使用X_INT或者X_DOUBLE，将数据类型指定为32bit整数或者双精度浮点数。
@@ -245,17 +254,55 @@ NiuTrans.Tensor鈭恣蝞嚗蜓閬銝鈭

 ### 代数计算(arithmetic)

-此部分主要包括各种数学运算，加、减、乘、除、取负等。
+此部分主要包括各种数学运算，加、减、乘、除、取负、取绝对值等。
+
+#### 取绝对值（Absolute）
+
+##### 什么是张量的取绝对值运算？
+利用张量的取绝对值运算可以将张量中每一元素取绝对值并得到一个新的张量，一个维度分别为\\(2 \times 3\\)的矩阵取绝对值过程如下所示：
+
+$$
+\left(\begin{matrix}-1.0 & 2.0 & 3.0\\\\-4.0 & 5.0 & 6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0\end{matrix}\right)
+$$
+
+##### 张量取绝对值的调用
+
+NiuTrans.Tensor提供了张量的取绝对值操作，在NiuTrans.Tensor/Tensor/core/arithmetic中定义
+
+张量取绝对值的调用方式以及参数说明如下所示:
+```
+void _Absolute(const XTensor * a, XTensor * b)
+
+void _AbsoluteMe(XTensor * a)
+
+XTensor Absolute(const XTensor & a)
+```
+Parameters: 
+
+* a - 输入张量
+* b - 输出张量
+
+##### 张量取绝对值片段示例
+
+用Absolute进行张量取绝对值操作的示例代码为：
+```
+/* call Absolute function */
+b = Absolute(*a);
+```
+有关张量取绝对值的详细代码示例：
+
+NiuTrans.Tensor/Tensor/test/TAbsolute.cpp

 #### 矩阵乘法（MatrixMul）

 ##### 什么是张量间矩阵乘法？
-利用矩阵乘法可以将矩阵想乘并得到一个新的结果矩阵，两个维度分别为$2 \times 3$和$3 \times 2$的矩阵相乘过程如下所示，结果矩阵的维度为$2 \times 2$：
+利用矩阵乘法可以将矩阵想乘并得到一个新的结果矩阵，两个维度分别为\\(2 \times 3\\)和\\(3 \times 2\\)的矩阵相乘过程如下所示，结果矩阵的维度为\\(2 \times 2\\)：

 $$
-\left(\begin{matrix}1.0 & 2.0 & 3.0\\-4.0 & 5.0 & 6.0\end{matrix}\right) × 
-\left(\begin{matrix}0.0 & -1.0\\1.0 & 2.0\\2.0 & 1.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}8.0 & 6.0\\17.0 & 20.0\end{matrix}\right)
+\left(\begin{matrix}1.0 & 2.0 & 3.0\\\\-4.0 & 5.0 & 6.0\end{matrix}\right) × 
+\left(\begin{matrix}0.0 & -1.0\\\\1.0 & 2.0\\\\2.0 & 1.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}8.0 & 6.0\\\\17.0 & 20.0\end{matrix}\right)
 $$

 ##### 矩阵乘法的调用
@@ -284,7 +331,7 @@ Parameters:
 我们以最基本的二维矩阵乘法为例，用MatrixMul进行矩阵乘法操作的示例代码为：
 ```
 /* call MatrixMul function */
-_MatrixMul(s1, X_NOTRANS, s2, X_NOTRANS, t);
+t = MatrixMul(*s1, X_NOTRANS, *s2, X_NOTRANS);
 ```
 有关矩阵乘法的详细代码示例：

@@ -294,12 +341,12 @@ NiuTrans.Tensor/Tensor/test/TMatrixMul.cpp

 ##### 什么是张量点乘？

-利用张量间的点乘操作可以进行张量间元素的按位置依次相乘，两个维度分别为$2 \times 2$的张量点乘过程如下所示：
+利用张量间的点乘操作可以进行张量间元素的按位置依次相乘，两个维度分别为\\(2 \times 2\\)的张量点乘过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0\\2.0 & 3.0\end{matrix}\right)  ·
-\left(\begin{matrix}0.0 & 1.0\\2.0 & 3.0\end{matrix}\right)  \rightarrow 
-\left(\begin{matrix}0.0 & 1.0\\4.0 & 9.0\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0\\\\2.0 & 3.0\end{matrix}\right)  ·
+\left(\begin{matrix}0.0 & 1.0\\\\2.0 & 3.0\end{matrix}\right)  \rightarrow 
+\left(\begin{matrix}0.0 & 1.0\\\\4.0 & 9.0\end{matrix}\right)
 $$

 ##### 张量点乘的调用
@@ -325,7 +372,7 @@ Parameters:
 用Multiply进行s1和s2张量间的点乘操作的调用示例如下所示，计算结果存入t中：
 ```
 /* call multiply function */
-_Multiply(s1, s2, t, 0);
+t = Multiply(*s1, *s2, 0);
 ```
 有关矩阵乘法的详细代码示例见：

@@ -335,43 +382,85 @@ NiuTrans.Tensor/Tensor/test/TMultiply.cpp

 ##### 什么是张量的取负操作？

-在进行张量的取负操作时，张量中每一元素都进行取负得到新的元素，所有新元素的组合得到新的结果张量，一个维度为$3 \times 2$的张量取负操作过程如下所示：
+在进行张量的取负操作时，张量中每一元素都进行取负得到新的元素，所有新元素的组合得到新的结果张量，一个维度为\\(3 \times 2\\)的张量取负操作过程如下所示：

 $$
-\left(\begin{matrix}1.0 & -2.0\\-3.0 & 4.0\\5.0 & -6.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}-1.0 & 2.0\\3.0 & -4.0\\-5.0 & 6.0\end{matrix}\right)
+\left(\begin{matrix}1.0 & -2.0\\\\-3.0 & 4.0\\\\5.0 & -6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}-1.0 & 2.0\\\\3.0 & -4.0\\\\-5.0 & 6.0\end{matrix}\right)
 $$

 ##### 张量取负的调用

 NiuTrans.Tensor提供了张量取负的计算操作，进行张量的按元素位置进行取负操作，该函数在NiuTrans.Tensor/Tensor/core/arithmetic中定义，张量取负的调用方式以及参数说明如下所示:
 ```
-void _Negate(XTensor * a)
+void _Negate(const XTensor * a, XTensor * b)
+
+void _NegateMe(XTensor * a)
+
+XTensor Negate(const XTensor & a)
 ```
 Parameters: 

 * a - 输入张量
+* b - 输出张量

 ##### 张量取负片段示例

-用Negate进行张量取负操作的调用示例如下所示，其中a为我们要进行处理的张量：
+用Negate进行张量取负操作的调用示例如下所示，其中a为我们要进行处理的张量，b为得到的结果张量：
 ```
 /* call negate function */
-_Negate(a);
+b = Negate(*aGPU);
 ```
 有关张量取负的详细代码示例见：

 NiuTrans.Tensor/Tensor/test/TNegate.cpp

+#### 符号函数（Sign）
+
+##### 什么是张量的符号函数？
+
+张量的符号函数用来取得张量中每一元素的符号，一个维度为\\(3 \times 2\\)的张量符号函数操作过程如下所示：
+
+$$
+\left(\begin{matrix}1.0 & -2.0\\\\0.0 & 4.0\\\\5.0 & -6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}1.0 & -1.0\\\\0.0 & 1.0\\\\1.0 & -1.0\end{matrix}\right)
+$$
+
+##### 张量符号函数的调用
+
+NiuTrans.Tensor提供了张量的符号函数，该函数在NiuTrans.Tensor/Tensor/core/arithmetic中定义，张量符号函数的调用方式以及参数说明如下所示:
+```
+void _Sign(const XTensor * a, XTensor * b)
+
+void _SignMe(XTensor * a)
+
+XTensor Sign(const XTensor & a)
+```
+Parameters: 
+
+* a - 输入张量
+* b - 输出张量
+
+##### 张量符号函数片段示例
+
+用Sign进行张量符号函数的调用示例如下所示，其中a为我们要进行处理的张量，b为得到的结果张量：
+```
+/* call Sign function */
+b = Sign(*a);
+```
+有关张量符号函数的详细代码示例见：
+
+NiuTrans.Tensor/Tensor/test/TSign.cpp
+
 #### 加法（Sum）

 ##### 什么是张量加法？
-张量加法的目的是将n个张量相加得到一个新的结果张量，结果张量某一位置的元素数值为进行操作的张量在该位置上元素的求和，在张量加法的计算过程中进行操作的张量与结果张量的维度相同，两个维度为$2\times 3$的张量相加过程如下所示：
+张量加法的目的是将n个张量相加得到一个新的结果张量，结果张量某一位置的元素数值为进行操作的张量在该位置上元素的求和，在张量加法的计算过程中进行操作的张量与结果张量的维度相同，两个维度为\\(2\times 3\\)的张量相加过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 \\ 3.0 & 4.0 & 5.0\end{matrix}\right) + 
-\left(\begin{matrix}0.5 & 1.5 & 2.5 \\ 3.5 & 4.5 & 5.5\end{matrix}\right) \rightarrow
-\left(\begin{matrix}0.5 & 2.5 & 4.5 \\ 6.5 & 8.5 & 10.5\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0 \\\\ 3.0 & 4.0 & 5.0\end{matrix}\right) + 
+\left(\begin{matrix}0.5 & 1.5 & 2.5 \\\\ 3.5 & 4.5 & 5.5\end{matrix}\right) \rightarrow
+\left(\begin{matrix}0.5 & 2.5 & 4.5 \\\\ 6.5 & 8.5 & 10.5\end{matrix}\right)
 $$

 ##### 张量加法的调用
@@ -398,7 +487,7 @@ Parameters:
 调用Sum进行张量间的求和操作如下所示，在此例中将张量相加结果存入c中：
 ```
 /* call sum function */
-_Sum(a, b, c);
+c = Sum(*a, *b);
 ```
 详细代码示例见：

@@ -408,11 +497,11 @@ NiuTrans.Tensor/Tensor/test/TSum.cpp

 ##### 什么是SumByColumnTV？

-SumByColumnTV的作用是将一个Tensor和一个Vector按列相加，所得结果维度与Tensor一致，一个$2 \times 4$的Tensor和一个$2 \times 1$的Vector的SumByColumnTV操作过程如下所示：
+SumByColumnTV的作用是将一个Tensor和一个Vector按列相加，所得结果维度与Tensor一致，一个\\(2 \times 4\\)的Tensor和一个\\(2 \times 1\\)的Vector的SumByColumnTV操作过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) + \left(\begin{matrix}1.0\\0.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}1.0 & 2.0 & 3.0 & 4.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) + \left(\begin{matrix}1.0\\\\0.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}1.0 & 2.0 & 3.0 & 4.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right)
 $$ 

 ##### SumByColumnTV的调用
@@ -445,11 +534,11 @@ NiuTrans.Tensor/Tensor/test/TSumByColumnTV.cpp

 ##### 什么是SumByColumnVT？

-SumByColumnVT的作用是将一个Vector和一个Tensor按列相加，所得结果维度与Vector一致，一个$2 \times 1$的Vector和一个$2 \times 4$的Tensor的SumByColumnVT操作过程如下所示：
+SumByColumnVT的作用是将一个Vector和一个Tensor按列相加，所得结果维度与Vector一致，一个\\(2 \times 1\\)的Vector和一个\\(2 \times 4\\)的Tensor的SumByColumnVT操作过程如下所示：

 $$
-\left(\begin{matrix}1.0\\0.0\end{matrix}\right) + \left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}7.0\\22.0\end{matrix}\right)
+\left(\begin{matrix}1.0\\\\0.0\end{matrix}\right) + \left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}7.0\\\\22.0\end{matrix}\right)
 $$ 

 ##### SumByColumnVT调用
@@ -482,29 +571,66 @@ NiuTrans.Tensor/Tensor/test/TSumByColumnVT.cpp

 此部分包括各种数据类型转化，设置数据、取数据等操作。

+#### ConvertDataType
+
+##### 什么是ConvertDataType？
+
+ConvertDataType的作用是将张量中每个元素的数据类型转换为另一数据类型。
+
+##### ConvertDataType调用
+
+NiuTrans.Tensor提供了张量的ConvertDataType操作，调用方法及参数说明如下所示:
+
+```
+void _ConvertDataType(const XTensor * input, XTensor * output)
+```
+Parameters:
+
+* input - 输入张量
+* output - 输出张量
+
+#####  ConvertDataType片段示例
+
+ConvertDataType示例代码如下，本例中将张量中元素数据类型由flaot32转换为int32。
+
+首先，创建张量时a为flaot32类型，b为int32类型：
+```
+/* create tensors */
+XTensor * a = NewTensor(aOrder, aDimSize);
+XTensor * b = NewTensor(aOrder, aDimSize, X_INT);
+```
+调用ConvertDataType函数
+```
+/* call ConvertDataType function */
+_ConvertDataType(a, b);
+```
+有关张量ConvertDataType的详细代码示例见：
+
+NiuTrans.Tensor/Tensor/test/TConvertDataType.cpp
+
 #### 选择（Select）

 ##### 什么是张量的选择操作？

-Select时按张量指定维度上的指定位置对张量进行选择的操作，一个$2 \times 2 \times 4$的张量选择过程如下所示，本例中是选择张量维度2上位置索引为1和2的元素并存入目标张量，得到一个维度为$2 \times 2 \times 2$的张量：
+Select时按张量指定维度上的指定位置对张量进行选择的操作，一个\\(2 \times 2 \times 4\\)的张量选择过程如下所示，本例中是选择张量维度2上位置索引为1和2的元素并存入目标张量，得到一个维度为\\(2 \times 2 \times 2\\)的张量：

 $$
 \begin{aligned}
 \Biggl( 
 & \left( 
-\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right),\\ 
+\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right),\\\\ 
 & \left( 
-\begin{matrix}1.0 & 2.0 & 3.0 & 4.0\\5.0 & 6.0 & 7.0 & 8.0\end{matrix}
+\begin{matrix}1.0 & 2.0 & 3.0 & 4.0\\\\5.0 & 6.0 & 7.0 & 8.0\end{matrix}
 \right)
 \Biggr)
 \end{aligned} \rightarrow 
 \begin{aligned}
 \Biggl( 
 & \left( 
-\begin{matrix}1.0 & 2.0\\5.0 & 6.0\end{matrix}
-\right),\\ 
+\begin{matrix}1.0 & 2.0\\\\5.0 & 6.0\end{matrix}
+\right),\\\\ 
 & \left( 
-\begin{matrix}2.0 & 3.0\\6.0 & 7.0\end{matrix}
+\begin{matrix}2.0 & 3.0\\\\6.0 & 7.0\end{matrix}
 \right)  
 \Biggr)
 \end{aligned}
@@ -548,7 +674,7 @@ Parameters:
 张量选择示例代码如下，其中s为输入的待操作张量，t输出结果张量，在第三维上按范围[1,3]进行张量的选择操作：
 ```
 /* call SelectRange function */
-_SelectRange(s, 2, 1, 3, t);
+t = SelectRange(*s, 2, 1, 3);
 ```
 有关张量选择的详细代码示例见：

@@ -558,11 +684,11 @@ NiuTrans.Tensor/Tensor/test/TSelect.cpp

 ##### 什么是SetData？

-SetData的作用是将张量在一定取值范围内随机进行初始化设置，一个$2 \times 4$的张量在[0.0,1.0]的取值范围SetData过程如下所示：
+SetData的作用是将张量在一定取值范围内随机进行初始化设置，一个\\(2 \times 4\\)的张量在[0.0,1.0]的取值范围SetData过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 0.0 & 0.0 & 0.0\\0.0 & 0.0 & 0.0 & 0.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.1 & 0.5 & 0.3 & 0.9\\0.8 & 0.5 & 0.5 & 0.2\end{matrix}\right)
+\left(\begin{matrix}0.0 & 0.0 & 0.0 & 0.0\\\\0.0 & 0.0 & 0.0 & 0.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.1 & 0.5 & 0.3 & 0.9\\\\0.8 & 0.5 & 0.5 & 0.2\end{matrix}\right)
 $$ 

 ##### SetData调用
@@ -571,16 +697,19 @@ NiuTrans.TensoretData瘜霂湔

 设置张量为固定值：
 ```
-void SetDataFixed(XTensor * tensor, void * valuePointer)
+void _SetDataFixed(XTensor * tensor, void * valuePointer)
+
+void SetDataFixed(XTensor &tensor, DTYPE p)
 ```
 Parameters:

 * tensor - 输入张量
 * valuePointer - 指向数据的指针
+* p - 设置的值

 设置张量为整型值：
 ```
-void SetDataFixedInt(XTensor * tensor, int p)
+void _SetDataFixedInt(XTensor * tensor, int p)
 ```
 Parameters:

@@ -589,7 +718,7 @@ Parameters:

 设置张量为单精度浮点值：
 ```
-void SetDataFixedFloat(XTensor * tensor, float p)
+void _SetDataFixedFloat(XTensor * tensor, float p)
 ```
 Parameters:

@@ -598,7 +727,7 @@ Parameters:

 设置张量为双精度浮点值：
 ```
-void SetDataFixedDouble(XTensor * tensor, double p)
+void _SetDataFixedDouble(XTensor * tensor, double p)
 ```
 Parameters:

@@ -607,7 +736,7 @@ Parameters:

 设置张量为随机分布：
 ```
-void SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
+void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
 ```
 * tensor - 输入张量
 * low - 取值下限
@@ -615,7 +744,7 @@ void SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)

 设置张量为正态分布：
 ```
-void SetDataRandN(XTensor * tensor, DTYPE mean, DTYPE standardDeviation)
+void _SetDataRandN(XTensor * tensor, DTYPE mean, DTYPE standardDeviation)
 ```
 Parameters:

@@ -636,7 +765,39 @@ NiuTrans.Tensor/Tensor/test/TSetData.cpp

 ### 数学运算(math)

-此部分包括各种非基本代数操作，包括：log、exp、abs等。
+此部分包括各种非基本代数操作，包括：log、exp、power等。
+
+#### 对数运算（Log）
+
+##### 什么是张量的对数运算？
+
+张量的对数运算即将张量中每一元素都取对数从而得到一个新的张量。
+
+##### Log调用
+
+NiuTrans.Tensor提供了张量的Log操作，调用方法及参数说明如下所示:
+```
+void _Log(const XTensor * a, XTensor * b)
+
+void _LogMe(XTensor * a)
+
+XTensor Log(const XTensor & a)
+```
+Parameters:
+
+* a - 输入张量
+* b - 输出张量
+  
+#####  Log片段示例
+
+Log示例代码如下所示：
+```
+/* call Log function */
+b = Log(*a);
+```
+有关Log的详细代码示例见：
+
+NiuTrans.Tensor/Tensor/test/TLog.cpp

 #### 标准化（Normalize）

@@ -671,7 +832,7 @@ Parameters:
 Normalize示例代码如下所示：
 ```
 /* call normalize function */
-_Normalize(s, t, 0, mean, var, a, b, 0.0);
+t = Normalize(*s, 0, *mean, *var, *a, *b, 0.0F);
 ```
 有关Normalize的详细代码示例见：

@@ -681,25 +842,30 @@ NiuTrans.Tensor/Tensor/test/TNormalize.cpp

 ##### 什么是张量的幂运算操作？

-幂运算是一种关于幂的数学运算，张量的幂运算是将张量中的每个元素都进行幂运算从而得到新的张量，一个维度为$3 \times 2$的幂为2.0的张量幂运算过程如下所示：
+幂运算是一种关于幂的数学运算，张量的幂运算是将张量中的每个元素都进行幂运算从而得到新的张量，一个维度为\\(3 \times 2\\)的幂为2.0的张量幂运算过程如下所示：


 $$
-\left(\begin{matrix}1.0 & 2.0\\3.0 & 4.0\\5.0 & 6.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}1.0 & 4.0\\9.0 & 16.0\\25.0 & 36.0\end{matrix}\right)
+\left(\begin{matrix}1.0 & 2.0\\\\3.0 & 4.0\\\\5.0 & 6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}1.0 & 4.0\\\\9.0 & 16.0\\\\25.0 & 36.0\end{matrix}\right)
 $$

 ##### 张量幂运算的调用

 NiuTrans.Tensor提供了张量幂运算的操作，用来进行张量的按元素位置进行幂运算的操作，调用方法为：
 ```
-void _Power(XTensor * a, DTYPE p)
+void _Power(const XTensor * a, XTensor * b, DTYPE p)
+
+void _PowerMe(XTensor * a, DTYPE p)
+
+XTensor Power(const XTensor & a, DTYPE p)
 ```
 其中a为进行操作的张量，p为次方数，张量幂运算的参数说明如下所示:

 Parameters: 

 * a - 输入张量
+* b - 输出张量
 * p - 次方数

 ##### 张量幂运算片段示例
@@ -707,7 +873,7 @@ Parameters:
 下面是调用Power进行a的幂为2.0的幂运算操作的一段示例代码：
 ```
 /* call power function */
-_Power(a, 2.0);
+b = Power(*a, 2.0F);
 ```
 有关张量幂运算的详细代码示例见：

@@ -717,11 +883,11 @@ NiuTrans.Tensor/Tensor/test/TPower.cpp

 ##### 什么是张量的缩放和偏移？

-张量的缩放和偏移计算公式为：p = p * scale + shift，其中scale和shift分别为张量缩放和偏移的参数，一个$2 \times 4$的张量进行缩放和偏移的过程如下所示，缩放参数取2.0，偏移参数取0.5：
+张量的缩放和偏移计算公式为：p = p * scale + shift，其中scale和shift分别为张量缩放和偏移的参数，一个\\(2 \times 4\\)的张量进行缩放和偏移的过程如下所示，缩放参数取2.0，偏移参数取0.5：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.5 & 2.5 & 4.5 & 6.5\\8.5 & 10.5 & 12.5 & 14.5\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.5 & 2.5 & 4.5 & 6.5\\\\8.5 & 10.5 & 12.5 & 14.5\end{matrix}\right)
 $$

 ##### 张量缩放和偏移的调用
@@ -748,7 +914,7 @@ Parameters:
 张量缩放和偏移示例代码如下，input为输入的待操作张量，scaleFactor为缩放参数，shiftFactor为偏移参数：
 ```
 /* call ScaleAndShift function */
-_ScaleAndShift(input, scaleFactor, shiftFactor);
+t = ScaleAndShift(*s, scaleFactor, shiftFactor);
 ```
 有关张量缩放和偏移的详细代码示例见：

@@ -758,71 +924,34 @@ NiuTrans.Tensor/Tensor/test/TScaleAndShift.cpp

 此部分主要是介绍有关数据拷贝函数。

-#### 拷贝（CopyValues）
-
-##### 什么是张量的拷贝操作？
-
-拷贝，即将一个张量的值赋给另一个张量，也就是对张量进行拷贝操作，一个$2 \times 4$的张量拷贝过程如下所示：
-
-$$
-\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow
-\left(
-\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right)
-$$
-
-##### 张量拷贝操作的调用
-
-NiuTrans.Tensor提供了张量的拷贝操作，调用方法及参数说明如下所示:
-```
-void _CopyValues(const XTensor * s, XTensor * t, XStream * stream = NULL)
-
-XTensor CopyValues(const XTensor &s, XStream * stream = NULL)
-```
-Parameters:
-
-* s - 输入张量
-* t - 输出张量
-* stream - 多线程流
-
-#####  张量拷贝片段示例
-
- 张量拷贝示例代码如下，其中input为输入的待操作张量，output输出结果张量：
-```
-/* call CopyValues function */
-_CopyValues(input, output);
-```
-有关张量拷贝的详细代码示例见：
-
-NiuTrans.Tensor/Tensor/test/TCopyValues.cpp
-
 #### CopyIndexed

 ##### 什么是张量的CopyIndexed操作？

-CopyIndexed，即按指定索引位置拷贝张量，一个$2 \times 2 \times 3$的张量拷贝过程如下所示，本例中是对张量维度2上起始位置索引为0和2的1个元素进行拷贝，所得张量维度为$2 \times 2 \times 2$：
+CopyIndexed，即按指定索引位置拷贝张量，一个\\(2 \times 2 \times 3\\)的张量拷贝过程如下所示，本例中是对张量维度2上起始位置索引为0和2的1个元素进行拷贝，所得张量维度为\\(2 \times 2 \times 2\\)：

 $$
 \begin{aligned}
 \Biggl( 
 & \left( 
-\begin{matrix}0.0 & -1.0 & 2.0\\2.0 & 1.0 & 3.0\end{matrix}\right),\\ 
+\begin{matrix}0.0 & -1.0 & 2.0\\\\2.0 & 1.0 & 3.0\end{matrix}\right),\\\\ 
 & \left( 
-\begin{matrix}1.0 & 2.0 & 4.0\\3.0 & 1.0 & 2.0\end{matrix}
-\right),\\ 
+\begin{matrix}1.0 & 2.0 & 4.0\\\\3.0 & 1.0 & 2.0\end{matrix}
+\right),\\\\ 
 & \left( 
-\begin{matrix}-1.0 & 3.0 & 2.0\\1.0 & -1.0 & 0.0\end{matrix}
+\begin{matrix}-1.0 & 3.0 & 2.0\\\\1.0 & -1.0 & 0.0\end{matrix}
 \right)  
 \Biggr)
 \end{aligned} \rightarrow 
 \begin{aligned}
 \Biggl( 
 & \left( 
-\begin{matrix}0.0 & 2.0\\2.0 & 3.0\end{matrix}\right),\\ 
+\begin{matrix}0.0 & 2.0\\\\2.0 & 3.0\end{matrix}\right),\\\\ 
 & \left( 
-\begin{matrix}1.0 & 4.0\\3.0 & 2.0\end{matrix}
-\right),\\ 
+\begin{matrix}1.0 & 4.0\\\\3.0 & 2.0\end{matrix}
+\right),\\\\ 
 & \left( 
-\begin{matrix}-1.0 & 2.0\\1.0 & 0.0\end{matrix}
+\begin{matrix}-1.0 & 2.0\\\\1.0 & 0.0\end{matrix}
 \right)  
 \Biggr)
 \end{aligned}
@@ -848,31 +977,68 @@ Parameters:

 #####  张量CopyIndexed片段示例

-CopyIndexed示例代码如下，其中s为输入的待操作张量，t输出结果张量，在第三维上按起始位置索引拷贝一个元素到目标张量：
+CopyIndexed示例代码如下，其中s为输入的待操作张量，t输出结果张量，在指定维度上按起始位置索引拷贝一个元素到目标张量：
 ```
 /* call CopyIndexed function */
-_CopyIndexed(s, t, 2, srcIndex, indexSize, tgtIndex, 1);
+t = CopyIndexed(*s, dim, srcIndex, indexSize, tgtIndex, copyNum);
 ```
 有关CopyIndexed的详细代码示例见：

 NiuTrans.Tensor/Tensor/test/TCopyIndexed.cpp

+#### 拷贝（CopyValues）
+
+##### 什么是张量的拷贝操作？
+
+拷贝，即将一个张量的值赋给另一个张量，也就是对张量进行拷贝操作，一个\\(2 \times 4\\)的张量拷贝过程如下所示：
+
+$$
+\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow
+\left(
+\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right)
+$$
+
+##### 张量拷贝操作的调用
+
+NiuTrans.Tensor提供了张量的拷贝操作，调用方法及参数说明如下所示:
+```
+void _CopyValues(const XTensor * s, XTensor * t, XStream * stream = NULL)
+
+XTensor CopyValues(const XTensor &s, XStream * stream = NULL)
+```
+Parameters:
+
+* s - 输入张量
+* t - 输出张量
+* stream - 多线程流
+
+#####  张量拷贝片段示例
+
+张量拷贝示例代码如下，其中s为输入的待操作张量，t输出结果张量：
+```
+/* call CopyValues function */
+t = CopyValues(*s);
+```
+有关张量拷贝的详细代码示例见：
+
+NiuTrans.Tensor/Tensor/test/TCopyValues.cpp
+
 ### 规约操作(reduce)

 #### 归约取最大值（ReduceMax）

 ##### 什么是张量的归约取最大值？

-张量的归约取最大值操作是沿着张量的某一维度，取得该向量在该维度中的最大值,一个$2 \times 4$的张量在维度0和维度1进行取最大值操作的过程分别如下所示：
+张量的归约取最大值操作是沿着张量的某一维度，取得该向量在该维度中的最大值,一个\\(2 \times 4\\)的张量在维度0和维度1进行取最大值操作的过程分别如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
 \left(\begin{matrix}4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right)
 $$

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}3.0\\7.0\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}3.0\\\\7.0\end{matrix}\right)
 $$

 ##### 张量归约取最大值操作的调用
@@ -894,112 +1060,112 @@ Parameters:
 调用ReduceMax进行张量归约取最大值操作的示例代码如下所示，代码中两行分别表示沿着维度0和维度1进行取值：
 ```
 /* call reduce max function */
-_ReduceMax(a, reduce_a, 0);
-_ReduceMax(b, reduce_b, 1);
+t = ReduceMax(*s, 0);
+t = ReduceMax(*s, 1);
 ```
 有关张量归约取最大值的详细代码示例见：

 NiuTrans.Tensor/Tensor/test/TReduceMax.cpp

-#### 归约求和（ReduceSum）
+#### 归约取均值（ReduceMean）

-##### 什么是张量的归约求和操作？
+##### 什么是张量的归约取均值操作？

-张量的归约求和操作是沿着张量的某一维度，计算该张量在该维度的和,一个$2 \times 4$的张量在维度0和维度1进行求和操作的过程分别如下所示：
+张量的归约取均值操作是沿着张量的某一维度，计算该张量在该维度的均值,一个\\(2 \times 4\\)的张量在维度0和维度1进行取均值操作的过程分别如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}4.0 & 6.0 & 8.0 & 10.0\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}2.0 & 3.0 & 4.0 & 5.0\end{matrix}\right)
 $$

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}6.0\\22.0\end{matrix}\right)
+\left(\begin{matrix}1.0 & 1.0 & 3.0 & 3.0\\\\4.0 & 4.0 & 6.0 & 6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}2.0\\\\5.0\end{matrix}\right)
 $$

-##### 张量归约求和操作的调用
+##### 张量归约取均值操作的调用

-NiuTrans.Tensor提供了张量的ReduceSum操作，调用方法为：
+NiuTrans.Tensor提供了张量的ReduceMean操作，调用方法为：
 ```
-void _ReduceSum(const XTensor * input, XTensor * output, int dim, const XTensor * shift = NULL, DTYPE power = (DTYPE)1.0F, bool isExp = false)
+void _ReduceMean(const XTensor * input, XTensor * output, int dim)

-XTensor ReduceSum(const XTensor &input, int dim, const XTensor &shift = NULLTensor, DTYPE power = (DTYPE)1.0F, bool isExp = false)
+XTensor ReduceMean(const XTensor &input, int dim)
 ```
-其中shift默认为NULL，power默认为1.0F，isExp默认为false，张量归约求和操作的参数说明如下所示:
+ReduceMean用来获得张量中沿指定维度取得的数值均值，张量归约取均值的参数说明如下所示:

 Parameters:

 * input - 输入张量
 * output - 输出张量
-* dim - 沿着指定维度进行取最大值操作
-* shift - 输入的偏移，默认为NULL
-* power - 元素的幂，默认为1.0F
-* isExp - 是否取指，默认为false
+* dim - 沿着指定维度进行取平均值操作

-##### 张量归约求和片段示例
+##### 张量归约取均值片段示例

-调用ReduceSum进行张量归约求和操作的示例代码如下所示，代码中两行分别表示沿着维度0和维度1进行取值：
+调用ReduceMean进行张量归约取均值操作的示例代码如下所示，代码中两行分别表示沿着维度0和维度1进行取值：
 ```
-/* call reduce sum function */
-_ReduceSum(a, reduce_a, 0);
-_ReduceSum(b, reduce_b, 1);
+/* call reduce mean function */
+t = ReduceMean(*s, 0);
+t = ReduceMean(*s, 1);
 ```
-有关张量归约求和的详细代码示例见：
+有关张量归约取均值的详细代码示例见：

-NiuTrans.Tensor/Tensor/test/TReduceSum.cpp
+NiuTrans.Tensor/Tensor/test/TReduceMean.cpp

-#### 归约取均值（ReduceMean）
+#### 归约求和（ReduceSum）

-##### 什么是张量的归约取均值操作？
+##### 什么是张量的归约求和操作？

-张量的归约取均值操作是沿着张量的某一维度，计算该张量在该维度的均值,一个$2 \times 4$的张量在维度0和维度1进行取均值操作的过程分别如下所示：
+张量的归约求和操作是沿着张量的某一维度，计算该张量在该维度的和,一个\\(2 \times 4\\)的张量在维度0和维度1进行求和操作的过程分别如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}2.0 & 3.0 & 4.0 & 5.0\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}4.0 & 6.0 & 8.0 & 10.0\end{matrix}\right)
 $$

 $$
-\left(\begin{matrix}1.0 & 1.0 & 3.0 & 3.0\\4.0 & 4.0 & 6.0 & 6.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}2.0\\5.0\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}6.0\\\\22.0\end{matrix}\right)
 $$

-##### 张量归约取均值操作的调用
+##### 张量归约求和操作的调用

-NiuTrans.Tensor提供了张量的ReduceMean操作，调用方法为：
+NiuTrans.Tensor提供了张量的ReduceSum操作，调用方法为：
 ```
-void _ReduceMean(const XTensor * input, XTensor * output, int dim)
+void _ReduceSum(const XTensor * input, XTensor * output, int dim, const XTensor * shift = NULL, DTYPE power = (DTYPE)1.0F, bool isExp = false)

-XTensor ReduceMean(const XTensor &input, int dim)
+XTensor ReduceSum(const XTensor &input, int dim, const XTensor &shift = NULLTensor, DTYPE power = (DTYPE)1.0F, bool isExp = false)
 ```
-ReduceMean用来获得张量中沿指定维度取得的数值均值，张量归约取均值的参数说明如下所示:
+其中shift默认为NULL，power默认为1.0F，isExp默认为false，张量归约求和操作的参数说明如下所示:

 Parameters:

 * input - 输入张量
 * output - 输出张量
-* dim - 沿着指定维度进行取平均值操作
+* dim - 沿着指定维度进行取最大值操作
+* shift - 输入的偏移，默认为NULL
+* power - 元素的幂，默认为1.0F
+* isExp - 是否取指，默认为false

-##### 张量归约取均值片段示例
+##### 张量归约求和片段示例

-调用ReduceMean进行张量归约取均值操作的示例代码如下所示，代码中两行分别表示沿着维度0和维度1进行取值：
+调用ReduceSum进行张量归约求和操作的示例代码如下所示，代码中两行分别表示沿着维度0和维度1进行取值：
 ```
-/* call reduce mean function */
-_ReduceMean(a, reduce_a, 0);
-_ReduceMean(b, reduce_b, 1);
+/* call reduce sum function */
+t1 = ReduceSum(*s, 0, *shift1);
+t2 = ReduceSum(*s, 1, *shift2);
 ```
-有关张量归约取均值的详细代码示例见：
+有关张量归约求和的详细代码示例见：

-NiuTrans.Tensor/Tensor/test/TReduceMean.cpp
+NiuTrans.Tensor/Tensor/test/TReduceSum.cpp

 #### 归约取方差（ReduceSumSquared）

 ##### 什么是张量的归约取方差操作？

-张量的归约取方差操作是沿着张量的某一维度，计算该张量在该维度的方差,一个$2 \times 4$的张量在维度0进行取方差操作的过程如下所示：
+张量的归约取方差操作是沿着张量的某一维度，计算该张量在该维度的方差,一个\\(2 \times 4\\)的张量在维度0进行取方差操作的过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
 \left(\begin{matrix}8.0 & 8.0 & 8.0 & 8.0\end{matrix}\right)
 $$

@@ -1025,7 +1191,7 @@ Parameters:
 调用ReduceSumSquared进行张量归约取方差操作的示例代码如下所示：
 ```
 /* call reduce sum squared function */
-_ReduceSumSquared(input, output, 0, shift);
+t = ReduceSumSquared(*s, 0, *shift);
 ```
 有关张量归约取方差的详细代码示例见：

@@ -1035,10 +1201,10 @@ NiuTrans.Tensor/Tensor/test/TReduceSumSquared.cpp

 ##### 什么是张量的归约取标准差操作？

-张量的归约取标准差操作是沿着张量的某一维度，计算该张量在该维度的标准差,一个$2 \times 4$的张量在维度0进行取标准差操作的过程如下所示：
+张量的归约取标准差操作是沿着张量的某一维度，计算该张量在该维度的标准差,一个\\(2 \times 4\\)的张量在维度0进行取标准差操作的过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
 \left(\begin{matrix}4.0 & 4.0 & 4.0 & 4.0\end{matrix}\right)
 $$

@@ -1064,7 +1230,7 @@ Parameters:
 调用ReduceVariance进行张量归约取标准差操作的示例代码如下所示：
 ```
 /* call reduce variance function */
-_ReduceVariance(input, output, 0, mean);
+t = ReduceVariance(*s, 0, *mean);
 ```
 有关张量归约取标准差的详细代码示例见：

@@ -1078,12 +1244,12 @@ NiuTrans.Tensor/Tensor/test/TReduceVariance.cpp

 ##### 什么是张量的级联操作？

-张量间的级联操作是沿着张量的某一维度，将一系列张量或是一个列表中的所有张量连接在一起组成一个更大的张量，将维度分别为$2 \times 1$和$2 \times 2$的两个张量进行级联过程如下所示：
+张量间的级联操作是沿着张量的某一维度，将一系列张量或是一个列表中的所有张量连接在一起组成一个更大的张量，将维度分别为\\(2 \times 1\\)和\\(2 \times 2\\)的两个张量进行级联过程如下所示：

 $$
-\left(\begin{matrix}0.0\\1.0\end{matrix}\right) +
-\left(\begin{matrix}2.0 & 3.0\\4.0 & 5.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.0 & 2.0 & 3.0\\1.0 & 4.0 & 5.0\end{matrix}\right)
+\left(\begin{matrix}0.0\\\\1.0\end{matrix}\right) +
+\left(\begin{matrix}2.0 & 3.0\\\\4.0 & 5.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right)
 $$

 ##### 张量级联的调用
@@ -1120,186 +1286,189 @@ Parameters:
 通过操作张量列表进行张量的级联操作片段示例如下所示，sList为存放进行级联张量的列表，t为结果张量：
 ```
 /* call concatenate function */
-_Concatenate(&sList, t, 1);
+t = Concatenate(*sList, 1);
 ```
 直接通过操作一系列张量进行张量的级联操作片段示例如下所示，s1、s2为需要进行级联的张量，t为结果张量：
 ```
 /* call concatenate function */
-_Concatenate(s1, s2, t, 1);
+t = Concatenate(*s1, *s2, 1);
 ```
 有关张量级联的详细代码示例见：

 NiuTrans.Tensor/Tensor/test/TConcatenate.cpp

-#### 切分（Split）
+#### 合并（Merge）

-##### 什么是张量的切分操作？
+##### 什么是张量的合并操作？

-张量间的切分操作是沿着张量的某一维度，可以将一个张量切分成另一张量，也可以将一个大的张量切分成n个小的张量集合的列表。
+张量间的合并操作与级联有些类似，是沿着张量的某一维度，可以将一个张量合并为另一个维度不同的张量，也可以将一个列表中的所有张量合并在一起组成一个更大的张量。

-第一种情况下将维度为$4 \times 3$张量沿着维度0进行切分，切分份数为2，得到维度为$2 \times 2 \times 3$的张量的过程如下所示：
+在第一种情况下将维度为\\(2 \times 2 \times 3\\)的张量在维度1进行合并，进行合并的维度为0，得到维度为\\(4 \times 3\\)的张量的过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\\0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
 \begin{aligned}
 \Biggl( & \left( 
-\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\end{matrix}\right),
-\\ & \left( 
-\begin{matrix}0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}
+\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right),
+\\\\ & \left( 
+\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}
 \right) \Biggr)
-\end{aligned}
+\end{aligned} \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
 $$

-在第二种情况下将维度为$4 \times 3$张量沿着维度0进行切分，切分份数为2，得到两个维度均为$2 \times 3$的张量的过程如下所示：
+在第二种情况下将两个维度均为\\(2 \times 3\\)的张量沿着维度0合并为维度为\\(4 \times 3\\)的张量的过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\\0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.0 & 2.0 & 3.0\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right)
-$$
+\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
+$$ 

-##### 张量切分的调用
+##### 张量合并操作的调用

-NiuTrans.Tensor提供了两种张量切分操作，调用方法为：
+NiuTrans.Tensor提供了张量的合并操作，调用方法为：

-在第一种调用方法中是将源张量中的某一维度进行Split操作，Split结果为张量t，whereToSplit为在哪一维度进行split操作，splitNum表示分成多少份，例如：(N, M) -> (N/3, M, 3)，参数说明如下所示:
+在第一种调用方法中是将源张量中的某一维度进行Merge操作，Merge结果为张量t，whereToMerge为指定进行Merge操作的维度，leadingDim为指定将哪一维度Merge，例如：(N/2, 2, M) -> (N, M)，参数说明如下表所示:
 ```
-void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum)
+void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim = -1)

-XTensor Split(const XTensor &s, int whereToSplit, int splitNum)
+XTensor Merge(const XTensor &s, int whereToMerge, int leadingDim = -1)
 ```
 Parameters:

 * s - 输入张量
 * t - 输出张量
-* whereToSplit - 在指定维度进行split操作
-* splitNum - 分成多少份
+* whereToMerge - 沿着指定维度进行Merge操作
+* leadingDim - 把指定维度进行Merge操作

-在第二种调用方法中是将所操作张量big按某一维度whereToSplit进行Split操作，操作结果为包含若干更小维度张量的列表smalls，splitNum表示分成多少份，例如：(N, M) -> 2 * (N/2, M)，参数说明如下所示:
+在第二种调用方法中是将所操作张量存入列表smalls中，操作结果为张量big，whereToMerge为指定进行Merge操作的维度，例如：2 * (N/2, M) -> (N, M)，参数说明如下表所示:
 ```
-void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum)
+void _Merge(const XList * smalls, XTensor * big, int whereToMerge)

-XList SplitList(const XTensor &big, int whereToSplit, int splitNum)
+XTensor Merge(const XList &smalls, int whereToMerge)
 ```
 Parameters:

-* big - 输入张量
-* smalls - 存放切分出张量的列表
-* whereToSplit - 在指定维度进行split操作
-* splitNum - 分成多少份
+* smalls - 存放进行合并张量的列表
+* big - 结果张量
+* whereToMerge - 沿着指定维度进行Merge操作

-##### 张量切分片段示例
+##### 张量合并片段示例

-上述第一种张量切分片段示例如下所示，s为进行切分的张量，t为结果张量，0表示沿着维度0进行切分操作，2表示切分份数为2：
+上述第一种张量合并片段示例如下所示，s为进行合并的张量，t为结果张量，1表示在维度1进行合并操作，0表示将维度0进行合并操作：
 ```
-/* call split function */
-_Split(s, t, 0, 2);
+/* call merge function */
+t = Merge(*s, 1, 0);
 ```
-上述第二种张量切分片段示例如下所示，s为进行切分的张量，tList为存放结果张量的列表，1表示沿着维度1进行切分操作，2表示切分份数为2：
+上述第二种张量合并片段示例如下所示，sList为要进行合并的张量列表，t为结果张量，0表示沿着维度0进行合并操作：
 ```
-/* call split function */
-_Split(s, &tList, 1, 2);
+/* call merge function */
+t = Merge(*sList, 0);
 ```

-有关张量切分的详细代码示例见：
+有关张量合并的详细代码示例见：

-NiuTrans.Tensor/Tensor/test/TSplit.cpp
+NiuTrans.Tensor/Tensor/test/TMerge.cpp

-#### 合并（Merge）
+#### 切分（Split）

-##### 什么是张量的合并操作？
+##### 什么是张量的切分操作？

-张量间的合并操作与级联有些类似，是沿着张量的某一维度，可以将一个张量合并为另一个维度不同的张量，也可以将一个列表中的所有张量合并在一起组成一个更大的张量。
+张量间的切分操作是沿着张量的某一维度，可以将一个张量切分成另一张量，也可以将一个大的张量切分成n个小的张量集合的列表。

-在第一种情况下将维度为$2 \times 2 \times 3$的张量在维度1进行合并，进行合并的维度为0，得到维度为$4 \times 3$的张量的过程如下所示：
+第一种情况下将维度为\\(4 \times 3\\)张量沿着维度0进行切分，切分份数为2，得到维度为\\(2 \times 2 \times 3\\)的张量的过程如下所示：

 $$
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
 \begin{aligned}
 \Biggl( & \left( 
-\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\end{matrix}\right),
-\\ & \left( 
-\begin{matrix}0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}
+\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right),
+\\\\ & \left( 
+\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}
 \right) \Biggr)
-\end{aligned} \rightarrow 
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\\0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right)
+\end{aligned}
 $$

-在第二种情况下将两个维度均为$2 \times 3$的张量沿着维度0合并为维度为$4 \times 3$的张量的过程如下所示：
+在第二种情况下将维度为\\(4 \times 3\\)张量沿着维度0进行切分，切分份数为2，得到两个维度均为\\(2 \times 3\\)的张量的过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 2.0 & 3.0\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\\0.1 & 1.1 & 2.1\\3.1 & 4.1 & 5.1\end{matrix}\right)
-$$ 
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
+$$

-##### 张量合并操作的调用
+##### 张量切分的调用

-NiuTrans.Tensor提供了张量的合并操作，调用方法为：
+NiuTrans.Tensor提供了两种张量切分操作，调用方法为：

-在第一种调用方法中是将源张量中的某一维度进行Merge操作，Merge结果为张量t，whereToMerge为指定进行Merge操作的维度，leadingDim为指定将哪一维度Merge，例如：(N/2, 2, M) -> (N, M)，参数说明如下表所示:
+在第一种调用方法中是将源张量中的某一维度进行Split操作，Split结果为张量t，whereToSplit为在哪一维度进行split操作，splitNum表示分成多少份，例如：(N, M) -> (N/3, M, 3)，参数说明如下所示:
 ```
-void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim = -1)
+void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum)

-XTensor Merge(const XTensor &s, int whereToMerge, int leadingDim = -1)
+XTensor Split(const XTensor &s, int whereToSplit, int splitNum)
 ```
 Parameters:

 * s - 输入张量
 * t - 输出张量
-* whereToMerge - 沿着指定维度进行Merge操作
-* leadingDim - 把指定维度进行Merge操作
+* whereToSplit - 在指定维度进行split操作
+* splitNum - 分成多少份

-在第二种调用方法中是将所操作张量存入列表smalls中，操作结果为张量big，whereToMerge为指定进行Merge操作的维度，例如：2 * (N/2, M) -> (N, M)，参数说明如下表所示:
+在第二种调用方法中是将所操作张量big按某一维度whereToSplit进行Split操作，操作结果为包含若干更小维度张量的列表smalls，splitNum表示分成多少份，例如：(N, M) -> 2 * (N/2, M)，参数说明如下所示:
 ```
-void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
+void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum)

-XTensor Merge(const XList &smalls, int whereToMerge)
+XList SplitList(const XTensor &big, int whereToSplit, int splitNum)
 ```
 Parameters:

-* smalls - 存放进行合并张量的列表
-* big - 结果张量
-* whereToMerge - 沿着指定维度进行Merge操作
+* big - 输入张量
+* smalls - 存放切分出张量的列表
+* whereToSplit - 在指定维度进行split操作
+* splitNum - 分成多少份

-##### 张量合并片段示例
+##### 张量切分片段示例
+
+上述第一种张量切分片段示例如下所示，s为进行切分的张量，t为结果张量，0表示沿着维度0进行切分操作，2表示切分份数为2：

-上述第一种张量合并片段示例如下所示，s为进行合并的张量，t为结果张量，1表示在维度1进行合并操作，0表示将维度0进行合并操作：
 ```
-/* call merge function */
-_Merge(s, t, 1, 0);
+/* call split function */
+t = Split(*s, 0, 2);
 ```
-上述第二种张量合并片段示例如下所示，sList为要进行合并的张量列表，t为结果张量，0表示沿着维度0进行合并操作：
+
+上述第二种张量切分片段示例如下所示，s为进行切分的张量，tList为存放结果张量的列表，1表示沿着维度1进行切分操作，2表示切分份数为2：
+
 ```
-/* call merge function */
-_Merge(&sList, t, 0);
+/* call split function */
+Split(*s, tList, 1, 2);
 ```

-有关张量合并的详细代码示例见：
+有关张量切分的详细代码示例见：

-NiuTrans.Tensor/Tensor/test/TMerge.cpp
+NiuTrans.Tensor/Tensor/test/TSplit.cpp

 #### Unsqueeze

 ##### 什么是Unsqueeze？

-Unsqueeze的作用是通过对张量进行操作，返回一个新的在指定维度插入新维度的张量，这个返回的张量与源张量共享相同的基础数据，一个$2 \times 3$的张量在维度1和2分别进行Unsqueeze的操作如下所示，插入新的维度大小均为2：
+Unsqueeze的作用是通过对张量进行操作，返回一个新的在指定维度插入新维度的张量，这个返回的张量与源张量共享相同的基础数据，一个\\(2 \times 3\\)的张量在维度1和2分别进行Unsqueeze的操作如下所示，插入新的维度大小均为2：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right) \rightarrow 
 \begin{aligned}
 \Biggl( & \left( 
-\begin{matrix}0.0 & 1.0 & 2.0\\0.0 & 1.0 & 2.0\end{matrix}\right),
-\\ & \left( 
-\begin{matrix}3.0 & 4.0 & 5.0\\3.0 & 4.0 & 5.0\end{matrix}
+\begin{matrix}0.0 & 1.0 & 2.0\\\\0.0 & 1.0 & 2.0\end{matrix}\right),
+\\\\ & \left( 
+\begin{matrix}3.0 & 4.0 & 5.0\\\\3.0 & 4.0 & 5.0\end{matrix}
 \right) \Biggr)
 \end{aligned}
 $$

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0\\3.0 & 4.0 & 5.0\end{matrix}\right) \rightarrow  
+\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right) \rightarrow  
 \begin{aligned}
 \Biggl( & \left( 
-\begin{matrix}0.0 & 0.0\\1.0 & 1.0\\2.0 & 2.0\end{matrix}\right),
-\\ & \left( 
-\begin{matrix}3.0 & 3.0\\4.0 & 4.0\\5.0 & 5.0\end{matrix}
+\begin{matrix}0.0 & 0.0\\\\1.0 & 1.0\\\\2.0 & 2.0\end{matrix}\right),
+\\\\ & \left( 
+\begin{matrix}3.0 & 3.0\\\\4.0 & 4.0\\\\5.0 & 5.0\end{matrix}
 \right) \Biggr)
 \end{aligned}
 $$
@@ -1322,11 +1491,13 @@ Parameters:
 #####  Unsqueeze片段示例

 Unsqueeze示例代码如下，其中s为输入的待操作张量，t1、t2代表输出结果张量，以下两行分别表示在维度1和维度2上插入的维度大小为2：
+
 ```
 /* call Unsqueeze function */
-_Unsqueeze(s, t1, 1, 2);
-_Unsqueeze(s, t2, 2, 2);
+t1 = Unsqueeze(*s, 1, 2);
+t2 = Unsqueeze(*s, 2, 2);
 ```
+
 有关张量Unsqueeze的详细代码示例见：

 NiuTrans.Tensor/Tensor/test/TUnsqueeze.cpp
@@ -1339,56 +1510,69 @@ NiuTrans.Tensor/Tensor/test/TUnsqueeze.cpp

 ##### 什么是Sort？

-Sort操作是对张量中元素沿着指定的维度进行排序，一个$2 \times 4$的张量沿着维度0进行Sort操作过程如下所示：
+Sort操作是对张量中元素沿着指定的维度进行排序，一个\\(2 \times 4\\)的张量沿着维度0进行Sort操作过程如下所示：

 $$
-\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
-\left(\begin{matrix}4.0 & 5.0 & 6.0 & 7.0\\0.0 & 1.0 & 2.0 & 3.0\end{matrix}\right)
+\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}4.0 & 5.0 & 6.0 & 7.0\\\\0.0 & 1.0 & 2.0 & 3.0\end{matrix}\right)
 $$

 ##### Sort的调用

 NiuTrans.Tensor提供了张量的Sort操作，调用方法及参数说明如下所示:
+
 ```
-void _Sort(XTensor * a, XTensor * index, int dim)
+void _Sort(const XTensor * a, XTensor * b, XTensor * index, int dim)
+
+void _SortMe(XTensor * a, XTensor * index, int dim)
+
+void Sort(XTensor & a, XTensor & b, XTensor & index, int dim)
 ```
+
 Parameters:

 * a - 输入张量
+* b- 输出张量
 * index - 输出张量中元素的索引
 * dim - 沿着指定维度进行Sort操作

 #####  Sort片段示例

-Sort示例代码如下所示，a为进行操作的张量，b为结果张量中元素的索引，本例中沿着维度0进行Sort操作：
+Sort示例代码如下所示，a为进行操作的张量，index为结果张量中元素的索引，本例中沿着维度0进行Sort操作：
+
 ```
 /* call Sort function */
-_Sort(a, b, 0);
+Sort(*a, b, *index, 0)
 ```
+
 有关Sort的详细代码示例见                              NiuTrans.Tensor/Tensor/test/TSort.cpp

 #### TopK

 ##### 什么是TopK？

-TopK操作是通过对张量中元素进行排序，得到最大或最小的k个元素值及其对应的索引值，在张量中，可以沿着某一维度进行TopK操作，一个$2 \times 4$的张量沿着维度0进行Top-2操作过程如下所示：
+TopK操作是通过对张量中元素进行排序，得到最大或最小的k个元素值及其对应的索引值，在张量中，可以沿着某一维度进行TopK操作，一个\\(2 \times 4\\)的张量沿着维度0进行Top-2操作过程如下所示：

 $$
-\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow 
 \begin{aligned}
 outputAnswer: & \left(
-\begin{matrix}0.5 & 2.5 & 4.5 & 6.5\\8.5 & 10.5 & 12.5 & 14.5\end{matrix}\right)\\ +
-\\ indexAnswer: & \left(
-\begin{matrix}0 & 1 & 1 & 0\\1 & 0 & 0 & 1\end{matrix}\right)
+\begin{matrix}0.5 & 2.5 & 4.5 & 6.5\\\\8.5 & 10.5 & 12.5 & 14.5\end{matrix}\right)\\\\ +
+\\\\ indexAnswer: & \left(
+\begin{matrix}0 & 1 & 1 & 0\\\\1 & 0 & 0 & 1\end{matrix}\right)
 \end{aligned}
 $$

 ##### TopK的调用

 NiuTrans.Tensor提供了张量的TopK操作，调用方法及参数说明如下所示:
+
 ```
-void _TopK(XTensor * a, XTensor * b, XTensor * index, int dim, int k)
+void _TopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k)
+
+void TopK(XTensor &a, XTensor &b, XTensor &index, int dim, int k)
 ```
+
 Parameters:

 * a - 输入张量
@@ -1399,48 +1583,21 @@ Parameters:

 #####  TopK片段示例

- TopK示例代码如下，input为输入的待操作张量，output输出结果张量，index为输出结果索引，本例中沿着维度0取Top-2：
+TopK示例代码如下，s为输入的待操作张量，t输出结果张量，index为输出结果索引，本例中沿着维度dim取Top-k：
+
 ```
 /* call TopK function */
 int dim = 0;
 int k = inputDimSize[dim];
-_TopK(input, outputA, indexA, dim, k);
+TopK(s, t, index, dim, k);
 ```
+
 有关TopK的详细代码示例见                              NiuTrans.Tensor/Tensor/test/TTopK.cpp

 ### 激活函数(function)

 此部分主要介绍一些激活函数和损失函数。

-#### Rectify
-
-##### 什么是Rectify？
-
-Rectify是一种激活函数，Rectify函数定义为：
->y = max(0, x)
-
-##### Rectify调用
-
-NiuTrans.Tensor提供了张量的Rectify激活函数，调用方法及参数说明如下所示:
-```
-void _Rectify(const XTensor * x, XTensor * y)
-```
-Parameters:
-
-* x - 输入张量
-* y - 输出张量
-
-#####  Rectify片段示例
-
-Rectify示例代码如下，其中x为输入的向量，y为输入的张量：
-```
-/* call Rectify function */
-_Rectify(x, y);
-```
-有关Rectify的详细代码示例见：
-
-NiuTrans.Tensor/Tensor/test/TRectify.cpp
-
 #### HardTanH

 ##### 什么是HardTanH？
@@ -1453,9 +1610,13 @@ HardTanH銝蝘瘣餃嚗ardTanH摰蛹嚗
 ##### HardTanH调用

 NiuTrans.Tensor提供了张量的HardTanH激活函数，调用方法及参数说明如下所示:
+
 ```
 void _HardTanH(const XTensor * x, XTensor * y)
+
+XTensor HardTanH(const XTensor &x)
 ```
+
 Parameters:

 * x - 输入张量
@@ -1464,10 +1625,12 @@ Parameters:
 #####  HardTanH片段示例

 HardTanH示例代码如下，其中x为输入的向量，y为输入的张量：
+
 ```
 /* call hardtanh function */
-_HardTanH(x, y);
+y = HardTanH(*x);
 ```
+
 有关HardTanH的详细代码示例见：

 NiuTrans.Tensor/Tensor/test/THardTanH.cpp
@@ -1482,9 +1645,13 @@ Identity銝蝘瘣餃嚗dentity摰蛹嚗
 ##### Identity调用

 NiuTrans.Tensor提供了张量的Identity激活函数，调用方法及参数说明如下所示:
+
 ```
 void _Identity(const XTensor * x, XTensor * y)
+
+XTensor Identity(const XTensor &x)
 ```
+
 Parameters:

 * x - 输入张量
@@ -1493,10 +1660,12 @@ Parameters:
 #####  Identity片段示例

 Identity示例代码如下，其中x为输入的向量，y为输入的张量：
+
 ```
 /* call Identity function */
-_Identity(x, y);
+y = Identity(*x);
 ```
+
 有关Identity的详细代码示例见：

 NiuTrans.Tensor/Tensor/test/TIdentity.cpp
@@ -1511,9 +1680,13 @@ LogSoftmax銝蝘瘣餃嚗ogSoftmax摰蛹嚗
 ##### LogSoftmax调用

 NiuTrans.Tensor提供了张量的LogSoftmax激活函数，调用方法及参数说明如下所示:
+
 ```
 void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
+
+XTensor LogSoftmax(const XTensor &x, int leadDim)
 ```
+
 Parameters:

 * x - 输入张量
@@ -1523,14 +1696,94 @@ Parameters:
 #####  LogSoftmax片段示例

 LogSoftmax示例代码如下，其中x为输入的向量，y为输入的张量，本例中沿着维度1进行LogSoftmax操作：
+
 ```
 /* call LogSoftmax function */
-_LogSoftmax(x, y, 1);
+y = LogSoftmax(*x, 1);
 ```
+
 有关LogSoftmax的详细代码示例见：

 NiuTrans.Tensor/Tensor/test/TLogSoftmax.cpp

+#### Loss
+
+##### 什么是Loss？
+
+Loss Function(损失函数)是用来衡量神经网络模型效果及优化目标的一种损失函数，函数定义为：
+>squared error : loss = sum_{i} 0.5*(gold_i - output_i)^2 <br />
+cross entropy : loss = sum_{i} (-gold_i * log(output_i)) <br />
+one hot error : loss = sum_{i} e_i <br />
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; where e_i = 0.5*(t_i - y_i)^2 &nbsp;&nbsp;if t_i = 1, <br />
+&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;e_i = 0 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; otherwise
+
+
+##### Loss调用
+
+NiuTrans.Tensor提供了张量的Loss激活函数，调用方法及参数说明如下所示:
+
+```
+DTYPE _LossCompute(XTensor * gold, XTensor * output, LOSS_FUNCTION_NAME LFName, bool isLogOutput, int leadDim, int gBeg, int gLen, int oBeg)
+```
+
+Parameters:
+
+* gold - 标准答案
+* output - 输出的模型预测结果
+* LFName - 损失函数名称
+* isLogOutput - 输出是否log
+* leadDim - 沿着指定维度进行输出
+* gBeg - 沿着指定维度leadDim从指定位置取标准答案
+* gLen - 从指定位置gBeg开始标准答案的偏移
+* oBeg - 沿着指定维度leadDim从指定位置开始输出模型预测结果
+
+#####  Loss片段示例
+
+Loss示例代码如下所示：
+
+```
+/* call LossCompute function */
+error = _LossCompute(gold, output, SQUAREDERROR, false, 0, 0, dimSize[0], 0);
+```
+
+有关Loss的详细代码示例见：
+
+NiuTrans.Tensor/Tensor/test/TLoss.cpp
+
+#### Rectify
+
+##### 什么是Rectify？
+
+Rectify是一种激活函数，Rectify函数定义为：
+>y = max(0, x)
+
+##### Rectify调用
+
+NiuTrans.Tensor提供了张量的Rectify激活函数，调用方法及参数说明如下所示:
+
+```
+void _Rectify(const XTensor * x, XTensor * y)
+
+XTensor Rectify(const XTensor &x)
+```
+
+
+Parameters:
+
+* x - 输入张量
+* y - 输出张量
+
+#####  Rectify片段示例
+
+Rectify示例代码如下，其中x为输入的向量，y为输入的张量：
+```
+/* call Rectify function */
+y = Rectify(*x);
+```
+有关Rectify的详细代码示例见：
+
+NiuTrans.Tensor/Tensor/test/TRectify.cpp
+
 #### Sigmoid

 ##### 什么是Sigmoid？
@@ -1543,6 +1796,8 @@ Sigmoid銝蝘瘣餃嚗igmoid摰蛹嚗
 NiuTrans.Tensor提供了张量的Sigmoid激活函数，调用方法及参数说明如下所示:
 ```
 void _Sigmoid(const XTensor * x, XTensor * y)
+
+XTensor Sigmoid(const XTensor &x)
 ```
 Parameters:

@@ -1554,7 +1809,7 @@ Parameters:
 Sigmoid示例代码如下，其中x为输入的向量，y为输入的张量：
 ```
 /* call Sigmoid function */
-_Sigmoid(x, y);
+y = Sigmoid(*x);
 ```
 有关Sigmoid的详细代码示例见：

@@ -1572,6 +1827,8 @@ Softmax銝蝘瘣餃嚗oftmax摰蛹嚗
 NiuTrans.Tensor提供了张量的Softmax激活函数，调用方法及参数说明如下所示:
 ```
 void _Softmax(const XTensor * x, XTensor * y, int leadDim)
+
+XTensor Softmax(const XTensor &x, int leadDim)
 ```
 Parameters:

@@ -1584,65 +1841,28 @@ Parameters:
 Softmax示例代码如下，其中x为输入的向量，y为输入的张量，本例中沿着维度1进行Softmax操作：
 ```
 /* call Softmax function */
-_Softmax(x, y, 1);
+y = Softmax(*x, 1);
 ```
 有关Softmax的详细代码示例见：

 NiuTrans.Tensor/Tensor/test/TSoftmax.cpp

-#### Loss
-
-##### 什么是Loss？
-
-Loss Function(损失函数)是用来衡量神经网络模型效果及优化目标的一种损失函数，函数定义为：
->squared error : loss = sum_{i} 0.5*(gold_i - output_i)^2 <br />
-cross entropy : loss = sum_{i} (-gold_i * log(output_i)) <br />
-one hot error : loss = sum_{i} e_i <br />
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; where e_i = 0.5*(t_i - y_i)^2 &nbsp;&nbsp;if t_i = 1, <br />
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;e_i = 0 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; otherwise
-
-
-##### Loss调用
-
-NiuTrans.Tensor提供了张量的Loss激活函数，调用方法及参数说明如下所示:
-```
-DTYPE LossCompute(XTensor * gold, XTensor * output, LOSS_FUNCTION_NAME LFName, bool isLogOutput, int leadDim, int gBeg, int gLen, int oBeg)
-```
-Parameters:
-
-* gold - 标准答案
-* output - 输出的模型预测结果
-* LFName - 损失函数名称
-* isLogOutput - 输出是否log
-* leadDim - 沿着指定维度进行输出
-* gBeg - 沿着指定维度leadDim从指定位置取标准答案
-* gLen - 从指定位置gBeg开始标准答案的偏移
-* oBeg - 沿着指定维度leadDim从指定位置开始输出模型预测结果
-
-#####  Loss片段示例
-
-Loss示例代码如下所示：
-```
-/* call LossCompute function */
-error = LossCompute(gold, output, SQUAREDERROR, false, 0, 0, dimSize[0], 0);
-```
-有关Loss的详细代码示例见：
-
-NiuTrans.Tensor/Tensor/test/TLoss.cpp
-
 ## 高级技巧

 ### 内存池

-内存作为计算机软件运行过程中不可或缺的一项重要资源，在软件开发过程中具有十分重要的地位。对于一个软件系统而言，如何更高效地进行内存管理将对系统整体性能，尤其是运行速度方面产生很大程度的影响。对于内存的管理一般来说主要包括分配、追踪以及释放，通过相应的接口即可简单地在内存空间上进行变量的定义、使用以及删除等操作。
-虽然目前而言，主流编程语言均会为开发人员提供相应的系统级接口（如C语言中的malloc和free，C++中的new和delete等），但这类接口在设计的时候由于需要考虑各种使用情况，因此并不一定能够最适用于目前的使用需求（如对速度具有较高要求等），因此直接使用系统级的内存管理接口存在以下弊端：
-1. 内存申请、释放时间消耗大：由于操作系统在进行内存管理的时候需要保证内存空间得到有效地使用，因此在执行内存申请操作的时候，系统将会根据“最先匹配”或“最优匹配”等算法在内存空间中找到一处闲置内存进行分配。同理，在对内存空间进行释放的时候，为方便后续空间的申请，系统也会在释放的过程中适时地合并空闲内存区域，保证系统中存在大块连续内存。诸如此类的操作虽然说能够使得内存空间的使用更加高效，但也给这些操作带来了许多额外的时间开销，导致频繁地对内存进行操作耗时较大。 
+内存作为计算机软件运行过程中不可或缺的一项重要资源，在软件开发过程中具有十分重要的地位。对于一个软件系统而言，如何更高效地进行内存管理将对系统整体性能，尤其是运行速度方面产生很大程度的影响。虽然目前而言，主流编程语言均会为开发人员提供相应的系统级接口（如C语言中的malloc和free，C++中的new和delete等），但这类接口在设计的时候由于需要考虑各种使用情况，因此并不一定能够最适用于目前的使用需求（如对速度具有较高要求等），因此直接使用系统级的内存管理接口存在以下弊端：
+
+1. 内存申请、释放时间消耗大：由于操作系统在进行内存管理的时候需要保证内存空间得到有效地使用，因此在执行内存申请或释放操作的时候，系统会对候选内存块进行一定程度的选择和合并，这些操作给相应的操作带来了许多额外的时间开销，导致频繁地对内存进行操作耗时较大。 
 2. 程序执行效率低：由于所申请内存块的大小不定，当频繁使用系统级接口进行内存管理的时候容易在存储空间中产生大量内存碎片，拖慢系统的执行效率。
-3. 易发生内存泄漏：使用系统级接口对内存空间进行申请的时候，一般来说需要程序开发人员显性地对空间进行释放，一旦疏忽将导致内存泄漏情况的发生，严重情况下会使得软件甚至系统发生崩溃。因此使用系统级接口进行内存管理需要谨慎对存储空间的使用情况进行分析，使用相关检测工具对内存泄漏情况进行有效地核查。
+3. 易发生内存泄漏：使用系统级接口对内存空间进行申请的时候，一般来说需要程序开发人员显性地对空间进行释放，一旦疏忽将导致内存泄漏情况的发生，因此使用系统级接口进行内存管理需要谨慎对存储空间的使用情况进行分析，使用相关检测工具对内存泄漏情况进行有效地核查。

 此外，当系统中存在对GPU设备上的显存空间进行管理的时候，申请、释放操作所产生的时间代价相对普通内存来说更大。不同于内存空间的申请，在申请或释放显存的时候需要对CPU正在执行的操作进行中断，交由GPU设备进行显存的操作，因此这部分产生的时间消耗远比内存申请来说大得多，最终导致频繁地对显存空间进行操作会更严重地拖慢系统整体的执行效率。
+
 针对以上问题，本系统支持使用内存池（Memory Pool）来对系统中的存储空间（包括内存和显存）进行管理。内存池的概念主要是在对存储空间进行使用之前，预先从系统中申请一整块的空间，由程序自身（内存池）对这部分的空间进行管理。这样做的好处在于对存储空间的申请、释放等操作不需要对系统的相应接口进行频繁调用，降低了其中中断、搜寻最优块等操作的耗时，同时也不易产生内存碎片。此外，由于内存池的申请是一次性的操作，因此不会在系统全局产生大规模内存|泄漏的情况，对系统的稳定性会有所助益。
+
 具体来说，想要在NiuTrans.Tensor的工具包中使用内存池（XMem）进行操作，只需要三个步骤：内存池的定义，使用以及释放。
+
 * 内存池的定义

 最简单的定义一个内存池只需指定一个设备ID即可，下面是一段示例代码。
@@ -1674,45 +1894,15 @@ delete mem;

 ## 实例1：矩阵乘法

-NiuTrans.Tensor提供的矩阵乘法实例如下所示，详细代码见NiuTrans.Tensor/Tensor/sample/mul/
+这里我们给出一个矩阵乘法的例子，首先定义张量维度的大小，然后初始化两个维度分别为2*3和3*2的矩阵，使用SetData()方法对矩阵进行赋值，最后计算两个矩阵相乘。
+
+关于矩阵乘法的详细代码请见NiuTrans.Tensor/Tensor/sample/mul/。

 ```
 #include "mul.h"

 namespace nts
 {
-void sampleMUL()
-{
-    DTYPE aData[2][3] = { { 1.0F, 2.0F, 3.0F },
-                          { -4.0F, 5.0F, 6.0F } };
-    DTYPE bData[3][2] = { { 0.0F, -1.0F },
-                          { 1.0F, 2.0F },
-                          { 2.0F, 1.0F } };
-    DTYPE answer[2][2] = { { 8.0F, 6.0F },
-                           { 17.0F, 20.0F } };
-
-    XTensor a;
-    //XTensor * a = NewTensor();?
-    XTensor b;
-    XTensor result;
-
-    InitTensor2D(&a, 2, 3);
-    InitTensor2D(&b, 3, 2);
-
-    //a.GetSize;
-
-    a.SetData(aData, 6);
-    b.SetData(bData, 6);
-
-    result = MatrixMul(a, X_NOTRANS, b, X_NOTRANS);
-
-    result.Dump(stderr, "result:");
-
-    if (result.CheckData(answer, 4))
-        fprintf(stderr, "answer is right\n");
-
-}
-
 void sampleMUL1()
 {
    DTYPE aData[2][3] = { { 1.0F, 2.0F, 3.0F },
@@ -1753,24 +1943,266 @@ void sampleMUL1()
    for (int i = 0; i < resultOrder; i++)
        resultUnitNum *= resultDimSize[i];

+	/* create tensors */
    XTensor * a = NewTensor(aOrder, aDimSize);
    XTensor * b = NewTensor(bOrder, bDimSize);
    XTensor * result = NewTensor(resultOrder, resultDimSize);

+	/* initialize variables */
    a->SetData(aData, aUnitNum);
    b->SetData(bData, bUnitNum);
    result->SetZeroAll();

+	/* call MatrixMul function */
    _MatrixMul(a, X_NOTRANS, b, X_NOTRANS, result);

    result->Dump(stderr, "result:");
-}
+
+	/* destroy variables */
+    delete[] aDimSize;
+    delete[] bDimSize;
+    delete[] resultDimSize;
+    delete a;
+    delete b;
+    delete result;
 }
 ```

 ## 实例2：前馈神经网络

-NiuTrans.Tensor提供的语言模型任务上的前馈神经网络实例部分代码如下所示，主要是关于前馈神经网络语言模型上前向和反向训练的处理过程，详细代码见NiuTrans.Tensor/Tensor/sample/fnnlm/
+下面我们来实现一个简单的前馈神经网络语言模型。
+
+语言建模任务是通过某种方式对语言建立数学模型的过程。在神经网络出现之前，一般使用统计的方法来设计语言模型。比较常见的为n-gram模型，它对文本中若干词语共现的频率进行统计，并使用平滑算法对未见词语搭配进行修正，最终得到该语言中不同词语连续出现的概率值。神经语言模型相对传统基于统计的模型而言，能够在学习词语搭配的同时学习到词汇之间的相似性，相对平滑算法而言有效提高了对已知单词的未见搭配的预测效果，获得了更好的性能。
+
+神经语言模型最早由Bengio等人系统化提出并进行了深入研究，其整体结构上和普通的前馈神经网络类似，由输入层、隐藏层和输出层组成，层和层之间存在连接，每一层将本层接收到的向量映射到另一维空间上作为该层的输出。
+
+前馈神经网络语言模型的主要流程如下所示:
+
+```
+int FNNLMMain(int argc, const char ** argv)
+{
+    if(argc == 0)
+        return 1;
+
+    FNNModel model;
+
+    /* load arguments */
+    LoadArgs(argc, argv, model);
+
+    /* check the setting */
+    Check(model);
+
+    /* initialize model parameters */
+    Init(model);
+
+    /* learn model parameters */
+    if(strcmp(trainFN, ""))
+        Train(trainFN, shuffled, model);
+
+    /* save the final model */
+    if(strcmp(modelFN, "") && strcmp(trainFN, ""))
+        Dump(modelFN, model);
+
+    /* load the model if neccessary */
+    if(strcmp(modelFN, ""))
+        Read(modelFN, model);
+
+    /* test the model on the new data */
+    if(strcmp(testFN, "") && strcmp(outputFN, ""))
+        Test(testFN, outputFN, model);
+
+    return 0;
+}
+```
+
+对模型中的参数进行初始化：
+
+```
+/* initialize the model */
+void Init(FNNModel &model)
+{
+    /* create embedding parameter matrix: vSize * eSize */
+    InitModelTensor2D(model.embeddingW, model.vSize, model.eSize, model);
+    
+    /* create hidden layer parameter matrics */
+    for(int i = 0; i < model.hDepth; i++){
+        /* hidden layer parameter matrix: (n-1)eSize * hsize if it is the first layer
+                                           hsize * hsize otherwise */
+        if(i == 0)
+            InitModelTensor2D(model.hiddenW[i], (model.n - 1) * model.eSize, model.hSize, model);
+        else
+            InitModelTensor2D(model.hiddenW[i], model.hSize, model.hSize, model);
+        
+        /* bias term: a row vector of hSize entries */
+        InitModelTensor1D(model.hiddenB[i], model.hSize, model);
+    }
+    
+    /* create the output layer parameter matrix and bias term */
+    int iSize = model.hDepth == 0 ? (model.n - 1) * model.eSize : model.hSize;
+    InitModelTensor2D(model.outputW, iSize, model.vSize, model);
+    InitModelTensor1D(model.outputB, model.vSize, model);
+    
+    /* then, we initialize model parameters using a uniform distribution in range
+       of [-minmax, minmax] */
+    model.embeddingW.SetDataRand(-minmax, minmax);
+    model.outputW.SetDataRand(-minmax, minmax);
+    for(int i = 0; i < model.hDepth; i++)
+        model.hiddenW[i].SetDataRand(-minmax, minmax);
+    
+    /* all bias terms are set to zero */
+    model.outputB.SetZeroAll();
+    for(int i = 0; i < model.hDepth; i++)
+        model.hiddenB[i].SetZeroAll();
+}
+```
+
+训练过程：
+
+```
+void Train(const char * train, bool isShuffled, FNNModel &model)
+{
+    char name[MAX_NAME_LENGTH];
+    
+    /* shuffle the data */
+    if(isShuffled){
+        sprintf(name, "%s-tmp", train);
+        Shuffle(train, name);
+    }
+    else
+        strcpy(name, train);
+    
+    int epoch = 0;
+    int step = 0;
+    int wordCount = 0;
+    int wordCountTotal = 0;
+    int ngramNum = 1;
+    float loss = 0;
+    bool isEnd = false;
+    
+    NGram * ngrams = new NGram[MAX_LINE_LENGTH_HERE];
+
+    /* make a model to keep gradients */
+    FNNModel grad;
+    Copy(grad, model);
+
+    /* XNet for automatic differentiation */
+    XNet autoDiffer;
+
+    double startT = GetClockSec();
+    
+    /* iterate for a number of epochs */
+    for(epoch = 0; epoch < nEpoch; epoch++){
+
+        /* data file */
+        FILE * file = fopen(name, "rb");
+        CheckErrors(file, "Cannot open the training file");
+
+        wordCount = 0;
+        loss = 0;
+        ngramNum = 1;
+
+        while(ngramNum > 0){
+            
+            /* load a minibatch of ngrams */
+            ngramNum = LoadNGrams(file, model.n, ngrams, sentBatch, wordBatch);
+
+            if (ngramNum <= 0)
+                break;
+
+            /* previous n - 1 words */
+            XTensor inputs[MAX_N_GRAM];
+
+            /* the predicted word */
+            XTensor output;
+
+            /* the gold standard */
+            XTensor gold;
+
+            /* make the input tensor for position i */
+            for(int i = 0; i < model.n - 1; i++)
+                MakeWordBatch(inputs[i], ngrams, ngramNum, i, model.vSize, model.devID, model.mem);
+
+            /* make the gold tensor */
+            MakeWordBatch(gold, ngrams, ngramNum, model.n - 1, model.vSize, model.devID, model.mem);
+
+            if(!autoDiff){
+                /* prepare an empty network for building the fnn */
+                FNNNet net;
+
+                /* gradident = 0 */
+                Clear(grad);
+
+                /* forward computation */
+                Forward(inputs, output, model, net);
+
+                /* backward computation to obtain gradients */
+                Backward(inputs, output, gold, CROSSENTROPY, model, grad, net);
+
+                /* update model parameters */
+                Update(model, grad, learningRate, false);
+            }
+            else{
+                /* forward + backward process */
+                ForwardAutoDiff(inputs, output, model);
+
+                /* automatic differentiation */
+                autoDiffer.Backward(output, gold, CROSSENTROPY);
+
+                /* update model parameters */
+                Update(model, grad, learningRate, true);
+            }
+                
+            /* get probabilities */
+            float prob = GetProb(output, gold);
+                
+            loss += -prob;
+            wordCount += ngramNum;
+            wordCountTotal += ngramNum;
+            
+            if(++step >= nStep){
+                isEnd = true;
+                break;
+            }
+
+            if (step % 100 == 0) {
+                double elapsed = GetClockSec() - startT;
+                XPRINT5(0, stderr, "[INFO] elapsed=%.1fs, step=%d, epoch=%d, ngram=%d, ppl=%.3f\n",
+                           elapsed, step, epoch + 1, wordCountTotal, exp(loss / wordCount));
+            }
+        }
+
+        fclose(file);
+        
+        if(isEnd)
+            break;
+    }
+
+    double elapsed = GetClockSec() - startT;
+    
+    XPRINT5(0, stderr, "[INFO] elapsed=%.1fs, step=%d, epoch=%d, ngram=%d, ppl=%.3f\n", 
+               elapsed, step, epoch, wordCountTotal, exp(loss / wordCount));
+    XPRINT3(0, stderr, "[INFO] training finished (took %.1fs, step=%d and epoch=%d)\n", 
+               elapsed, step, epoch);
+    
+    delete[] ngrams;
+}
+```
+
+在这里只介绍部分主要代码，详细代码请参见NiuTrans.Tensor/source/sample/FNNLM.cpp
+
+前馈神经网络前向部分：经过数据处理之后我们得到了语言模型的输入（n-1个词），我们把输入input和输入层的权重w1（词向量）相乘得到每个输入单词的向量表示，公式如下：
+
+>embedding = input * w1
+
+最后将n-1个词的向量连接起来作为输入层最终的输出。
+
+同理，我们将输入层的输出分别经过隐藏层和输出层得到最终的结果，公式如下：
+
+>h = tanh(h_pre*w2+b)
+
+>y = softmax(h_last*w3)
+
+前向过程代码如下：

 ```
 /*
@@ -1872,7 +2304,17 @@ void Forward(XTensor inputs[], XTensor &output, FNNModel &model, FNNNet &net)
        _LogSoftmax(&s, &y, 1);
    }   
 }
+```
+
+反向部分：首先利用前向得到的最终结果和标准答案计算总的损失函数L，然后采用梯度下降的方法通过反向传播计算得到损失函数L对每层的参数w的导数∂L/∂w，之后我们根据

+>w_(k+1)= w_k-η*  ∂L/(∂w_k )	
+
+对参数W进行更新，其中η是学习率。
+
+反向以及反向传播后的更新代码如下：
+
+```
 /*
 backward procedure
 >> inputs - input word representations
@@ -1995,7 +2437,15 @@ void Backward(XTensor inputs[], XTensor &output, XTensor &gold, LOSS_FUNCTION_NA

 ## 实例3：循环神经网络

-## 致谢
+## NiuTrans.Tensor团队
+
+* 肖桐
+* 李垠桥
+* 许晨
+* 姜雨帆
+* 林野
+* 张裕浩
+* 胡驰

 ## 附录


--- a/source/network/XBackwardMath.cpp
+++ b/source/network/XBackwardMath.cpp
@@ -71,6 +71,10 @@ void XMathGrad::MakeGrad(XTensor * node)
        GradAbsolute(node);
    else if (operID == MATH_SIGN)
        GradSign(node);
+    else if (operID == MATH_ROUND)
+        GradRound(node);
+    else if (operID == MATH_CLIP)
+        GradClip(node);
    else if (operID == REDUCE_REDUCEMEAN)
        GradReduceMean(node);
    else if (operID == REDUCE_REDUCESUM)
@@ -725,7 +729,7 @@ void XMathGrad::GradNormalize(XTensor * node)
    XTensor * var = income.tails[2];
    XTensor * a = income.tails[3];
    XTensor * b = income.tails[4];
-    XTensor * c = NewTensor(a);
+    XTensor * c = NewTensor(var);
    XTensor * d = NewTensor(a);
    XTensor * e = NewTensor(a);
    XTensor * f = NewTensor(a);
@@ -733,11 +737,14 @@ void XMathGrad::GradNormalize(XTensor * node)
    XTensor * h = NewTensor(a);
    XTensor * i = NewTensor(a);
    XTensor * j = NewTensor(a);
-    XTensor * k = NewTensor(a);
-    XTensor * p = NewTensor(a);
-    XTensor * q = NewTensor(a);
+    XTensor * k = NewTensor(var);
+    XTensor * p = NewTensor(var);
+    XTensor * q = NewTensor(var);
    XTensor * r = NewTensor(a);
-    DTYPE epsilon = income.GetParam(0);
+    XTensor * x = NewTensor(mean);
+    XTensor * y = NewTensor(mean);
+    XTensor * z = NewTensor(mean);
+    DTYPE epsilon = income.GetParam(1);

    int dim = income.GetParamInt(0);
    int n = a->GetDim(dim);
@@ -756,7 +763,9 @@ void XMathGrad::GradNormalize(XTensor * node)

    /* dEdmean */
    _ScaleAndShift(f, g, -1.0F);
-    _Multiply(node->grad, g, mean->grad, 1.0F);
+    _ReduceSum(g, x, dim);
+    _ReduceSum(node->grad, y, dim);
+    _Multiply(y, x, mean->grad, 1.0F);

    /* dEdvar */
    _Unsqueeze(mean, h, dim, n);
@@ -764,8 +773,9 @@ void XMathGrad::GradNormalize(XTensor * node)
    _Multiply(a, i, j);
    _Power(var, k, -1.5F);
    _ScaleAndShift(k, p, -0.5F);
-    _Multiply(j, p, q);
-    _Multiply(node->grad, q, var->grad, 1.0F);
+    _ReduceSum(j, z, dim);
+    _Multiply(z, p, q);
+    _Multiply(y, q, var->grad, 1.0F);

    /* dEda */
    _Multiply(i, e, r);
@@ -788,6 +798,9 @@ void XMathGrad::GradNormalize(XTensor * node)
    delete p;
    delete q;
    delete r;
+    delete x;
+    delete y;
+    delete z;
 }

 /*
@@ -844,6 +857,60 @@ void XMathGrad::GradSign(XTensor * node)
 }

 /*
+gradient for round
+for
+c = round(a)
+we have
+dE/da = 0
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradRound(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for ROUND!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+
+    XNoder::MakeGrad(a);
+
+    b->SetZeroAll();
+    _Sum(a->grad, b, a->grad);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete b;
+}
+
+/*
+gradient for clip
+we have
+dE/da = 1  lower < a < upper
+dE/da = 0  otherwise 
+>> node - the node (c) for backward computation
+*/
+void XMathGrad::GradClip(XTensor * node)
+{
+    XLink &income = node->income;
+    CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for CLIP!");
+
+    XTensor * a = income.tails[0];
+    XTensor * b = NewTensor(a);
+
+    DTYPE lower = income.GetParam(0);
+    DTYPE upper = income.GetParam(1);
+
+    XNoder::MakeGrad(a);
+
+    _ClipBackward(node, a, node->grad, a->grad, lower, upper);
+    _Sum(a->grad, b, a->grad);
+
+    node->visitMark = NODE_FINISHED;
+
+    delete b;
+}
+
+/*
 gradient for reduceMean
 for
 c = reduceMean(a, dim)

--- a/source/network/XBackwardMath.h
+++ b/source/network/XBackwardMath.h
@@ -135,6 +135,14 @@ private:
    /* gradient for sign */
    static
    void GradSign(XTensor * node);
+
+    /* gradient for clip */
+    static
+    void GradClip(XTensor * node);
+
+    /* gradient for round */
+    static
+    void GradRound(XTensor * node);
 };

 }

--- a/source/tensor/Main.cpp
+++ b/source/tensor/Main.cpp
@@ -37,7 +37,6 @@

 using namespace nts;

-void SetDataTest();
 void SmallTest();
 void TransposeTest();


--- a/source/tensor/XName.cpp
+++ b/source/tensor/XName.cpp
@@ -39,16 +39,26 @@ const char * GetOPName(int type)
            return "M_COS";
        else if (type == MATH_TAN)
            return "M_TAN";
+        else if (type == MATH_ROUND)
+            return "M_ROUND";
+        else if (type == MATH_CLIP)
+            return "M_CLIP";
+        else if (type == MATH_DIV)
+            return "M_DIV";
        else if (type == MATH_MATRIXMUL)
            return "M_MATRIXMUL";
        else if (type == MATH_MATRIXMULBATCHED)
            return "M_MATRIXMULBATCHED";
        else if (type == MATH_MULTIPLY)
            return "M_MULTIPLY";
-        else if (type == MATH_DIV)
-            return "M_DIV";
        else if (type == MATH_NEGATE)
            return "M_NEGATE";
+        else if (type == MATH_NORMALIZE)
+            return "M_NORMALIZE";
+        else if (type == MATH_POWER)
+            return "M_POWER";
+        else if (type == MATH_SCALEANDSHIFT)
+            return "M_SCALEANDSHIFT";
        else if (type == MATH_SIGN)
            return "M_SIGN";
        else if (type == MATH_SUM)
@@ -57,12 +67,6 @@ const char * GetOPName(int type)
            return "M_SUB";
        else if (type == MATH_SUMDIM)
            return "M_SUMDIM";
-        else if (type == MATH_NORMALIZE)
-            return "M_NORMALIZE";
-        else if (type == MATH_POWER)
-            return "M_POWER";
-        else if (type == MATH_SCALEANDSHIFT)
-            return "M_SCALEANDSHIFT";
        else if (type == REDUCE_REDUCEMAX)
            return "R_REDUCEMAX";
        else if (type == REDUCE_REDUCEMEAN)

--- a/source/tensor/XName.h
+++ b/source/tensor/XName.h
@@ -30,28 +30,30 @@ namespace nts { // namespace nts(NiuTrans.Tensor)

 /* math operations */
 #define MATH_BASE               0x00001000
+
 #define MATH_ABSOLUTE           MATH_BASE + 1
 #define MATH_EXP                MATH_ABSOLUTE + 1
 #define MATH_LOG                MATH_EXP + 1
 #define MATH_SIN                MATH_LOG + 1
 #define MATH_COS                MATH_SIN + 1
 #define MATH_TAN                MATH_COS + 1
+#define MATH_ROUND              MATH_TAN + 1

-#define MATH_NEGATE             MATH_TAN + 1
-#define MATH_MATRIXMUL          MATH_TAN + 1
+#define MATH_CLIP               MATH_ROUND + 1
+#define MATH_DIV                MATH_CLIP + 1
+#define MATH_MATRIXMUL          MATH_DIV + 1
 #define MATH_MATRIXMULBATCHED   MATH_MATRIXMUL + 1
 #define MATH_MULTIPLY           MATH_MATRIXMULBATCHED + 1
-#define MATH_DIV                MATH_MULTIPLY + 1
-#define MATH_SIGN               MATH_DIV + 1
+#define MATH_NEGATE             MATH_MULTIPLY + 1
+#define MATH_NORMALIZE          MATH_NEGATE + 1
+#define MATH_POWER              MATH_NORMALIZE + 1
+#define MATH_SCALEANDSHIFT      MATH_POWER + 1
+#define MATH_SIGN               MATH_SCALEANDSHIFT + 1
 #define MATH_SUM                MATH_SIGN + 1
 #define MATH_SUB                MATH_SUM + 1
 #define MATH_SUMDIM             MATH_SUB + 1

-#define MATH_NORMALIZE          MATH_SUMDIM + 1
-#define MATH_POWER              MATH_NORMALIZE + 1
-#define MATH_SCALEANDSHIFT      MATH_POWER + 1
-
-#define REDUCE                  MATH_SCALEANDSHIFT + 1
+#define REDUCE                  MATH_SUMDIM + 1
 #define REDUCE_REDUCEMAX        REDUCE + 1
 #define REDUCE_REDUCEMEAN       REDUCE_REDUCEMAX + 1
 #define REDUCE_REDUCESUM        REDUCE_REDUCEMEAN + 1

--- a/source/tensor/XTensor.cpp
+++ b/source/tensor/XTensor.cpp
@@ -599,25 +599,24 @@ set the tensor items by a uniform distribution in range [lower, upper]
 void XTensor::SetDataRand(DTYPE lower, DTYPE upper)
 {
    // TODO: cuda code!!!!!!!
-    // TODO: replace float with DTYPE

    if (data == NULL)
        return;

    // srand((unsigned)time(0));
-
+    DTYPE variance = upper - lower;
    void * d = NULL;
    if (dataType == X_FLOAT) {
        d = new float[unitNum];
        for (int i = 0; i < unitNum; i++) {
-            DTYPE value = lower + (upper - lower) * (float)rand() / RAND_MAX;
+            DTYPE value = lower + variance * (float)rand() / RAND_MAX;
            *((float*)d + i) = value;
        }
    }
    else if (dataType == X_DOUBLE) {
        d = new double[unitNum];
        for (int i = 0; i < unitNum; i++) {
-            *((double*)d + i) = lower + (upper - lower) * rand() / RAND_MAX;
+            *((double*)d + i) = lower + variance * rand() / RAND_MAX;
        }
    }
    else {
@@ -627,15 +626,15 @@ void XTensor::SetDataRand(DTYPE lower, DTYPE upper)
    SetData(d, unitNum);
    
    if (dataType == X_FLOAT) {
-        delete[](float*)d;
+        delete[] (float*)d;
    }
    else {
-        delete[](double*)d;
+        delete[] (double*)d;
    }
 }

-/* a gauss distribution */
-double GaussRand()
+/* a gauss distribution (Box-Muller method) */
+double GaussRand(DTYPE mean, DTYPE standardDeviation)
 {
    // TODO: cuda code!!!!!!!

@@ -645,8 +644,8 @@ double GaussRand()
    double pi = 3.141592654;

    if (phase == 0){
-        u = (rand() + 1) / (RAND_MAX + 1.0);
-        v = (rand() + 1) / (RAND_MAX + 1.0);
+        u = (rand() + 1.0) / (RAND_MAX + 1.0);
+        v = (rand() + 1.0) / (RAND_MAX + 1.0);
        z = sqrt(-2.0 * log(u))* sin(2.0 * pi * v);
    }
    else{
@@ -654,7 +653,7 @@ double GaussRand()
    }

    phase = 1 - phase;
-    return z;
+    return mean + (z * standardDeviation);
 }

 /* 
@@ -665,7 +664,6 @@ set the tensor items by a normal distribution
 void XTensor::SetDataRandn(DTYPE mean, DTYPE standardDeviation)
 {
    // TODO: cuda code!!!!!!!
-    // TODO: replace float with DTYPE

    if (data == NULL)
        return;
@@ -675,13 +673,13 @@ void XTensor::SetDataRandn(DTYPE mean, DTYPE standardDeviation)
    if (dataType == X_FLOAT) {
        d = new float[unitNum];
        for (int i = 0; i < unitNum; i++) {
-            *((float*)d + i) = (float)GaussRand();
+            *((float*)d + i) = (float)GaussRand(mean, standardDeviation);
        }
    }
    else if (dataType == X_DOUBLE) {
        d = new double[unitNum];
        for (int i = 0; i < unitNum; i++) {
-            *((double*)d + i) = GaussRand();
+            *((double*)d + i) = GaussRand(mean, standardDeviation);
        }
    }
    else {
@@ -691,10 +689,10 @@ void XTensor::SetDataRandn(DTYPE mean, DTYPE standardDeviation)
    SetData(d, unitNum);

    if (dataType == X_FLOAT) {
-        delete[](float*)d;
+        delete[] (float*)d;
    }
    else {
-        delete[](double*)d;
+        delete[] (double*)d;
    }
 }


--- a/source/tensor/core/CHeader.h
+++ b/source/tensor/core/CHeader.h
@@ -46,6 +46,7 @@
 #include "getandset/Select.h"
 #include "getandset/SetData.h"

+#include "math/Clip.h"
 #include "math/Normalize.h"
 #include "math/Power.h"
 #include "math/ScaleAndShift.h"

--- a/source/tensor/core/arithmetic/Sign.cpp
+++ b/source/tensor/core/arithmetic/Sign.cpp
@@ -76,7 +76,7 @@ XTensor Sign(const XTensor & a)
    XTensor b(&a);
    b.SetTMP();

-    /* call _ScaleAndShift function */
+    /* call _Sign function */
    _Sign(&a, &b);

    /* tensor connections */

--- a/source/tensor/core/getandset/SetData.cpp
+++ b/source/tensor/core/getandset/SetData.cpp
@@ -214,34 +214,32 @@ void _SetDataFixedDouble(XTensor * tensor, double p)
 }

 /*
-generate data items with a uniform distribution in [low,high]
+generate data items with a uniform distribution in [lower, upper]
 >> tensor - the tensor whose data array would be initialized
->> low - lower value of the range
->> high - higher value of the range
+>> lower - lower value of the range
+>> upper - upper value of the range
 */
-void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
+void _SetDataRand(XTensor * tensor, DTYPE lower, DTYPE upper)
 {
-    CheckNTErrors(high > low, "the high value must be greater than low value!");
+    CheckNTErrors(upper > lower, "the high value must be greater than low value!");

    if(tensor == NULL)
        return;
    
    /* GPU code */
    if(tensor->devID < 0){
-        DTYPE variance = high - low;
-        
-        srand((unsigned)time(NULL));
+        DTYPE variance = upper - lower;
        
        if(tensor->dataType == X_FLOAT){
            float * d = (float*)tensor->data;
            for(int i = 0; i < tensor->unitNum; i++){
-                d[i] = variance * ((float)rand()/RAND_MAX) + low;
+                d[i] = variance * ((float)rand()/RAND_MAX) + lower;
            }
        }
        else if(tensor->dataType == X_DOUBLE){
            double * d = (double*)tensor->data;
            for(int i = 0; i < tensor->unitNum; i++){
-                d[i] = variance * ((double)rand()/RAND_MAX) + low;
+                d[i] = variance * ((double)rand()/RAND_MAX) + lower;
            }
        }
        else{
@@ -256,7 +254,7 @@ void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
    */
    else{
 #ifdef USE_CUDA
-        _CudaSetDataRand(tensor, low, high);
+        _CudaSetDataRand(tensor, lower, upper);
 #endif
        //XTensor * t2 = NewTensor(tensor->order, tensor->dimSize, tensor->dataType, tensor->denseRatio, -1);
        //_SetDataRand(t2, low, high);
@@ -265,5 +263,17 @@ void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
    }
 }
    
+
+/*
+generate data items with a normal distribution with specified mean and standard deviation 
+>> mean - mean or expectation of the distribution
+>> standardDeviation - standard deviation of the distribution
+*/
+void _SetDataRandN(XTensor * tensor, DTYPE mean, DTYPE standardDeviation)
+{
+    // TODO: rewrite it and add cuda code!!!!!!!
+    tensor->SetDataRandn(mean, standardDeviation);
+}
+
 } // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/core/getandset/SetData.cu
+++ b/source/tensor/core/getandset/SetData.cu
@@ -150,61 +150,20 @@ void _CudaSetDataFixedDouble(XTensor * tensor, double p)
 }

 /* 
-call curand_init function on each kernel with the same random seed
-and init the rng states
-*/
-__global__ 
-void KernelInitializeCurand(curandState * state, unsigned long seed)
-{
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-    curand_init(seed, i, 0, &state[i]);
-}
-
-/* */
-__device__ 
-float GenerateFloat(curandState* globalState, int i)
-{
-    //copy state to local mem
-    curandState localState = globalState[i];
-    //apply uniform distribution with calculated random
-    float randNum = curand_uniform(&localState);
-    //update state
-    globalState[i] = localState;
-
-    //return value
-    return randNum;
-}
-
-/**/
-__device__ 
-double GenerateDouble(curandState* globalState, int i)
-{
-    //copy state to local mem
-    curandState localState = globalState[i];
-    //apply uniform distribution with calculated random
-    double randNum = curand_uniform_double(&localState);
-    //update state
-    globalState[i] = localState;
-    //return value
-    return randNum;
-}
-
-/* 
 set data array with a uniform distribution in [low, high] 
 >> deviceStates - the state of curand
 >> d - float datatype pointer to the data array 
 >> size - size of the array
->> low - low value of the range
->> high - high value of the range
+>> lower - low value of the range
+>> variance - the variance of the range
 */
 __global__
-void KernelSetDataRandFloat(curandState* deviceStates, float * d, int size, DTYPE low, DTYPE variance)
+void KernelSetDataRandFloat(float * d, int size, DTYPE lower, DTYPE variance)
 {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    
    if (i < size) {
-        float randNum = GenerateFloat(deviceStates, i);
-        d[i] = randNum * variance + low;
+        d[i] = d[i] * variance + lower;
    }
 }
 /* 
@@ -212,29 +171,28 @@ set data array with a uniform distribution in [low, high]
 >> deviceStates - the state of curand
 >> d - double datatype pointer to the data array
 >> size - size of the array
->> low - low value of the range
->> high - high value of the range
+>> lower - low value of the range
+>> variance - the variance of the range
 */
 __global__
-void KernelSetDataRandDouble(curandState* deviceStates, double * d, int size, DTYPE low, DTYPE variance)
+void KernelSetDataRandDouble(double * d, int size, DTYPE lower, DTYPE variance)
 {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    
    if (i < size){
-        double randNum = GenerateDouble(deviceStates, i);
-        d[i] = randNum * variance + low;
+        d[i] = d[i] * variance + lower;
    }
 }

 /*
-generate data items with a uniform distribution in [low,high]
+generate data items with a uniform distribution in [lower, upper]
 >> tensor - the tensor whose data array would be initialized
->> low - lower value of the range
->> high - higher value of the range
+>> lower - lower value of the range
+>> upper - upper value of the range
 */
-void _CudaSetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
+void _CudaSetDataRand(XTensor * tensor, DTYPE lower, DTYPE upper)
 {
-    CheckNTErrors(high > low, "the high value must be greater than low value!");
+    CheckNTErrors(upper > lower, "the high value must be greater than low value!");

    int gridSize[3];
    int blockSize[3];
@@ -247,15 +205,17 @@ void _CudaSetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
    int devIDBackup;
    ProtectCudaDev(tensor->devID, devIDBackup);
    
-    curandState *deviceStates;
-    cudaMalloc(&deviceStates, sizeof(curandState));
-    DTYPE variance = high - low;
+    curandGenerator_t gen;
+    curandCreateGenerator (&gen, CURAND_RNG_PSEUDO_DEFAULT);
+    curandSetPseudoRandomGeneratorSeed(gen, time(NULL));
+    curandGenerateUniform(gen , (float*)tensor->data , tensor->unitNum);
+    curandDestroyGenerator(gen);
+    DTYPE variance = upper - lower;

-    KernelInitializeCurand<<<blocks, threads>>>(deviceStates, unsigned(time(NULL)));
    if (tensor->dataType == X_FLOAT)
-        KernelSetDataRandFloat <<<blocks, threads >>>(deviceStates, (float*)tensor->data, tensor->unitNum, low, variance);
+        KernelSetDataRandFloat <<<blocks, threads >>>((float*)tensor->data, tensor->unitNum, lower, variance);
    else if (tensor->dataType == X_DOUBLE)
-        KernelSetDataRandDouble <<<blocks, threads >>>(deviceStates, (double*)tensor->data, tensor->unitNum, low, variance);
+        KernelSetDataRandDouble <<<blocks, threads >>>((double*)tensor->data, tensor->unitNum, lower, variance);

    BacktoCudaDev(tensor->devID, devIDBackup);
 }

--- a/source/tensor/core/getandset/SetData.cuh
+++ b/source/tensor/core/getandset/SetData.cuh
@@ -37,8 +37,8 @@ void _CudaSetDataFixedFloat(XTensor * tensor, float p);
 /* generate data items with a fixed value p (in double) */
 void _CudaSetDataFixedDouble(XTensor * tensor, double p);

-/* generate data items with a uniform distribution in [low,high] */
-void _CudaSetDataRand(XTensor * tensor, DTYPE low, DTYPE high);
+/* generate data items with a uniform distribution in [lower, upper] */
+void _CudaSetDataRand(XTensor * tensor, DTYPE lower, DTYPE upper);

 } // namespace nts(NiuTrans.Tensor)


--- a/source/tensor/core/getandset/SetData.h
+++ b/source/tensor/core/getandset/SetData.h
@@ -45,8 +45,8 @@ void _SetDataFixedFloat(XTensor * tensor, float p);
 /* generate data items with a fixed value p (in double) */
 void _SetDataFixedDouble(XTensor * tensor, double p);

-/* generate data items with a uniform distribution in [low,high] */
-void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high);
+/* generate data items with a uniform distribution in [lower, upper] */
+void _SetDataRand(XTensor * tensor, DTYPE lower, DTYPE upper);

 /* generate data items with a normal distribution with specified mean and standard deviation */
 void _SetDataRandN(XTensor * tensor, DTYPE mean, DTYPE standardDeviation);

--- a/source/tensor/core/math/Clip.cpp
+++ b/source/tensor/core/math/Clip.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
+*/
+
+#include "../../XTensor.h"
+#include "../../XName.h"
+#include "Clip.h"
+#include "Clip.cuh"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/*
+set every entry to its clip value
+>> a - input tensor we are processing
+>> b - output tensor we are processing
+>> lower - the lower border
+>> upper - the upper border
+*/
+void _Clip(const XTensor * a, XTensor * b, DTYPE lower, DTYPE upper)
+{
+#ifdef USE_CUDA
+	/* run it on GPUs */
+	if (a->devID >= 0) {
+		_CudaClip(a, b, lower, upper);
+		return;
+	}
+#endif
+
+	CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
+	CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
+
+	DTYPE * d = (DTYPE*)a->data;
+	DTYPE * db = (DTYPE*)b->data;
+	for (int i = 0; i < a->unitNum; i++) {
+		if (d[i] > upper)
+			db[i] = upper;
+		else if (d[i] < lower)
+			db[i] = lower;
+		else
+			db[i] = d[i];
+	}
+}
+
+/*
+set every entry to its clip value (do it on site)
+keep the result in the input tensor a and return nothing
+>> a - the tensor we are processing
+>> lower - the lower border
+>> upper - the upper border
+*/
+void _ClipMe(XTensor * a, DTYPE lower, DTYPE upper)
+{
+	_Clip(a, a, lower, upper);
+}
+
+/*
+set every entry to its clip value (return a XTensor structure)
+make a new tensor to keep the result and return it
+>> a - input tensor we are processing
+>> lower - the lower border
+>> upper - the upper border
+<< return - the clip value of the input tensor
+*/
+XTensor Clip(const XTensor & a, DTYPE lower, DTYPE upper)
+{
+	XTensor b(&a);
+	b.SetTMP();
+
+	/* call _Clip function */
+	_Clip(&a, &b, lower, upper);
+
+	/* tensor connections */
+	XLink::MakeLink(&a, NULL, &b, MATH_CLIP);
+	XLink::AddParamToHead(&b, lower);
+	XLink::AddParamToHead(&b, upper);
+
+	return b;
+}
+
+/*
+backward computation
+
+dE/dx = dE/dy * dy/dx
+
+hard tanh: y =  upper    if x > upper
+x    if lower <= x <= upper
+lower    if x< lower
+
+and dy/dx =  1    if lower <= x <= upper
+0    otherwise
+
+>> gold - gold standard to measure error (or loss)
+>> y - output of the function
+>> x - input of the function
+>> dedy - dE/dy
+>> dedx - dE/dx
+>> lossName - type of loss function, e.g., cross entropy
+*/
+void _ClipBackward(XTensor * y, XTensor * x, XTensor * dedy, XTensor * dedx, DTYPE lower, DTYPE upper) 
+{
+    
+#ifdef USE_CUDA
+    if (x->devID >= 0) {
+        _CudaClipBackward(y, x, dedy, dedx, lower, upper);
+        return;
+}
+#endif
+
+    if (x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE) {
+        DTYPE * dedyp = (DTYPE*)dedy->data;
+        DTYPE * dedxp = (DTYPE*)dedx->data;
+        DTYPE * ip = (DTYPE*)x->data;
+        int size = y->unitNum;
+
+        /* dE/dx = dE/dy * dy/dx */
+        for (int i = 0; i < size; i++) {
+            DTYPE s = ip[i];
+            if (s > upper || s < lower)
+                dedxp[i] = 0;
+            else
+                dedxp[i] = dedyp[i];
+        }
+    }
+    else
+        ShowNTErrors("TODO!");
+}
+
+} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/math/Clip.cu
+++ b/source/tensor/core/math/Clip.cu
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
+*/
+
+#include "../../XDevice.h"
+#include "../../XTensor.h"
+#include "Clip.h"
+#include "Clip.cuh"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+#ifdef USE_CUDA
+/*
+set each entry to its clip value (CUDA Kernel)
+>> a - pointer to input data array
+>> b - pointer to output data array
+>> lower - the lower border
+>> upper - the upper border
+>> size - size of the data array
+*/
+__global__
+	void KernelClip(DTYPE * a, DTYPE * b, DTYPE lower, DTYPE upper, int size)
+{
+	int i = blockDim.x * blockIdx.x + threadIdx.x;
+
+	if (i < size) {
+		if (a[i] > upper)
+			b[i] = upper;
+		else if (a[i] < lower)
+			b[i] = lower;
+		else
+			b[i] = a[i];
+	}
+}
+
+/*
+set each entry to its clip value with float16 data type value (CUDA Kernel)
+This is for float16 computation
+>> a - pointer to input data array
+>> b - pointer to output data array
+>> lower - the lower border
+>> upper - the upper border
+>> size - size of the data array
+*/
+__global__
+void KernelClip(__half * a, __half * b, DTYPE lower, DTYPE upper, int size)
+{
+	return;
+}
+
+/*
+set each entry to its clip value
+>> a - input tensor we are processing
+>> b - output tensor we are processing
+>> lower - the lower border
+>> upper - the upper border
+*/
+void _CudaClip(const XTensor * a, XTensor * b, DTYPE lower, DTYPE upper)
+{
+	CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
+	CheckNTErrors((a->isSparse == false), "TODO!");
+
+	int gridSize[3];
+	int blockSize[3];
+
+	GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
+
+	dim3 blocks(gridSize[0]);
+	dim3 threads(blockSize[0]);
+
+	int devIDBackup;
+	ProtectCudaDev(a->devID, devIDBackup);
+
+	if (a->dataType == DEFAULT_DTYPE) {
+		KernelClip << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, lower, upper, a->unitNum);
+	}
+	else if (a->dataType == X_FLOAT16) {
+		KernelClip << <blocks, threads >> >((__half*)a->data, (__half*)b->data, lower, upper, a->unitNum);
+	}
+	else {
+		ShowNTErrors("TODO!");
+	}
+
+	BacktoCudaDev(a->devID, devIDBackup);
+}
+
+/*
+clip backward computation of dE/dx (Cuda kernel)
+
+dy/dx = 1     if lower <= x <= upper
+0     otherwise
+
+>> dedy - dE/dy
+>> dedx - dE/dx
+>> y - y of the function
+>> x - x of the function
+>> lower 
+>> upper 
+*/
+__global__
+void KernelClipBackward(DTYPE * dedy, DTYPE * dedx, DTYPE * y, DTYPE * x, DTYPE lower, DTYPE upper, int size)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+
+    if (i < size) {
+        DTYPE s = x[i];
+        if (s > upper || s < lower)
+            dedx[i] = 0;
+        else
+            dedx[i] = dedy[i];
+    }
+}
+
+/*
+backward computation (Cuda version)
+
+dE/dx = dE/dy * dy/dx
+
+hard tanh: y =  upper    if x > upper
+x    if lower <= x <= upper
+lower    if x< lower
+
+and dy/dx =  1    if lower <= x <= upper
+0    otherwise
+
+>> gold - gold standard to measure error (or loss)
+>> y - output of the function
+>> x - input of the function
+>> dedy - dE/dy
+>> dedx - dE/dx
+>> lossName - type of loss function, e.g., cross entropy
+*/
+void _CudaClipBackward(XTensor * y, XTensor * x, XTensor * dedy, XTensor * dedx, DTYPE lower, DTYPE upper)
+{
+    if (x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE) {
+
+        int gridSize[3], blockSize[3];
+
+        GDevs.GetCudaThread(x->devID, x->unitNum, gridSize, blockSize);
+
+        int devIDBackup;
+        ProtectCudaDev(x->devID, devIDBackup);
+
+        /* dE/dx = dE/dy * dy/dx */
+        KernelClipBackward <<<dim3(gridSize[0]), dim3(blockSize[0])>>>
+                             ((DTYPE*)dedy->data,
+                              (DTYPE*)dedx->data,
+                              (DTYPE*)y->data, (DTYPE*)x->data,
+                              lower, upper,
+                              x->unitNum);
+
+        BacktoCudaDev(x->devID, devIDBackup);
+    }
+    else
+        ShowNTErrors("TODO!");
+}
+
+
+#endif // USE_CUDA
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/math/Clip.cuh
+++ b/source/tensor/core/math/Clip.cuh
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
+*/
+
+#ifndef __CLIP_CUH__
+#define __CLIP_CUH__
+
+#include "Clip.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+#ifdef USE_CUDA
+
+/* set each entry to its clip value (CUDA Kernel) */
+__global__
+void KernelClip(DTYPE * a, DTYPE * b, DTYPE lower, DTYPE upper, int size);
+
+/* set each entry to its clip value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelClip(__half * a, __half * b, DTYPE lower, DTYPE upper, int size);
+
+/* set each entry to its clip value */
+void _CudaClip(const XTensor * a, XTensor * b, DTYPE lower, DTYPE upper);
+
+/* backward of Clip function (CUDA Kernel) */
+__global__
+void KernelClipBackward(DTYPE * dedy, DTYPE * dedx, DTYPE * y, DTYPE * x, DTYPE lower, DTYPE upper, int size);
+
+/* backward of Clip function */
+void _CudaClipBackward(XTensor * y, XTensor * x, XTensor * dedy, XTensor * dedx, DTYPE lower, DTYPE upper);
+
+#endif // USE_CUDA
+
+} // namespace nts(NiuTrans.Tensor)
+
+#endif // __CLIP_H__
\ No newline at end of file
--- a/source/tensor/core/math/Clip.h
+++ b/source/tensor/core/math/Clip.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
+*/
+
+#ifndef __CLIP_H__
+#define __CLIP_H__
+
+#include "../../XTensor.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/* set every entry to its clip value */
+void _Clip(const XTensor * a, XTensor * b, DTYPE lower, DTYPE upper);
+
+/*
+set every entry to its clip value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _ClipMe(XTensor * a, DTYPE lower, DTYPE upper);
+
+/*
+set every entry to its clip value  (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Clip(const XTensor & a, DTYPE lower, DTYPE upper);
+
+/*
+backward of Clip function
+*/
+void _ClipBackward(XTensor * y, XTensor * x, XTensor * dedy, XTensor * dedx, DTYPE lower, DTYPE upper);
+
+} // namespace nts(NiuTrans.Tensor)
+
+#endif // __CLIP_H__
--- a/source/tensor/core/math/Unary.cpp
+++ b/source/tensor/core/math/Unary.cpp
@@ -64,6 +64,10 @@ SIMPLE_UNARY_FUNCTION(Cos, _Cos, MATH_COS)
 _SIMPLE_UNARY_FUNCTION(_Tan, _CudaTan, tan)
 _SIMPLE_UNARY_FUNCTION_ME(_TanMe, _Tan)
 SIMPLE_UNARY_FUNCTION(Tan, _Tan, MATH_TAN)
+
+_SIMPLE_UNARY_FUNCTION(_Round, _CudaRound, round)
+_SIMPLE_UNARY_FUNCTION_ME(_RoundMe, _Round)
+SIMPLE_UNARY_FUNCTION(Round, _Round, MATH_ROUND)
 #else
 /* define three marco separately, specify the respective function names */
 #define _SIMPLE_UNARY_FUNCTION(_funcName, origFunc)          \
@@ -117,6 +121,10 @@ SIMPLE_UNARY_FUNCTION(Cos, _Cos, MATH_COS)
 _SIMPLE_UNARY_FUNCTION(_Tan, tan)
 _SIMPLE_UNARY_FUNCTION_ME(_TanMe, _Tan)
 SIMPLE_UNARY_FUNCTION(Tan, _Tan, MATH_TAN)
+
+_SIMPLE_UNARY_FUNCTION(_Round, round)
+_SIMPLE_UNARY_FUNCTION_ME(_RoundMe, _Round)
+SIMPLE_UNARY_FUNCTION(Round, _Round, MATH_ROUND)
 #endif

 }
\ No newline at end of file
--- a/source/tensor/core/math/Unary.cu
+++ b/source/tensor/core/math/Unary.cu
@@ -5,51 +5,51 @@

 namespace nts {

-#define SIMPLE_UNARY_FUNCTION_GPU(funcName, origFunc)                   \
-__global__                                                              \
-void Kernel##funcName(DTYPE * a, DTYPE * b, int size)                   \
-{                                                                       \
-    int i = blockDim.x * blockIdx.x + threadIdx.x;                      \
-                                                                        \
-    if (i < size)                                                       \
-        b[i] = (DTYPE)origFunc(a[i]);                                   \
-}                                                                       \
-__global__                                                              \
-    void Kernel##funcName(__half * a, __half * b, int size)             \
-{                                                                       \
-    return;                                                             \
-}                                                                       \
-void _Cuda##funcName(const XTensor * a, XTensor * b)                    \
-{                                                                       \
-    CheckNTErrors((XTensor::IsSameShaped(a, b)),                        \
-                  "Input tensors should have the same type!");          \
-    CheckNTErrors((a->isSparse == false), "TODO!");                     \
-                                                                        \
-    int gridSize[3];                                                    \
-    int blockSize[3];                                                   \
-                                                                        \
-    GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);     \
-                                                                        \
-    dim3 blocks(gridSize[0]);                                           \
-    dim3 threads(blockSize[0]);                                         \
-                                                                        \
-    int devIDBackup;                                                    \
-    ProtectCudaDev(a->devID, devIDBackup);                              \
-                                                                        \
-    if (a->dataType == DEFAULT_DTYPE) {                                 \
-        Kernel##funcName << <blocks, threads >> >                       \
-                     ((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum);    \
-    }                                                                   \
-    else if (a->dataType == X_FLOAT16) {                                \
-        Kernel##funcName << <blocks, threads >> >                       \
-                     ((__half*)a->data, (__half*)b->data, a->unitNum);  \
-    }                                                                   \
-    else {                                                              \
-        ShowNTErrors("TODO!");                                          \
-    }                                                                   \
-                                                                        \
-    BacktoCudaDev(a->devID, devIDBackup);                               \
-}                                                                       \
+#define SIMPLE_UNARY_FUNCTION_GPU(funcName, origFunc)                       \
+__global__                                                                  \
+void Kernel##funcName(DTYPE * a, DTYPE * b, int size)                       \
+{                                                                           \
+    int i = blockDim.x * blockIdx.x + threadIdx.x;                          \
+                                                                            \
+    if (i < size)                                                           \
+        b[i] = (DTYPE)origFunc(a[i]);                                       \
+}                                                                           \
+__global__                                                                  \
+    void Kernel##funcName(__half * a, __half * b, int size)                 \
+{                                                                           \
+    return;                                                                 \
+}                                                                           \
+void _Cuda##funcName(const XTensor * a, XTensor * b)                        \
+{                                                                           \
+    CheckNTErrors((XTensor::IsSameShaped(a, b)),                            \
+                  "Input tensors should have the same type!");              \
+    CheckNTErrors((a->isSparse == false), "TODO!");                         \
+                                                                            \
+    int gridSize[3];                                                        \
+    int blockSize[3];                                                       \
+                                                                            \
+    GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);         \
+                                                                            \
+    dim3 blocks(gridSize[0]);                                               \
+    dim3 threads(blockSize[0]);                                             \
+                                                                            \
+    int devIDBackup;                                                        \
+    ProtectCudaDev(a->devID, devIDBackup);                                  \
+                                                                            \
+    if (a->dataType == DEFAULT_DTYPE) {                                     \
+        Kernel##funcName << <blocks, threads >> >                           \
+                     ((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum);        \
+    }                                                                       \
+    else if (a->dataType == X_FLOAT16) {                                    \
+        Kernel##funcName << <blocks, threads >> >                           \
+                     ((__half*)a->data, (__half*)b->data, a->unitNum);      \
+    }                                                                       \
+    else {                                                                  \
+        ShowNTErrors("TODO!");                                              \
+    }                                                                       \
+                                                                            \
+    BacktoCudaDev(a->devID, devIDBackup);                                   \
+}                                                                           \

 SIMPLE_UNARY_FUNCTION_GPU(Absolute, fabs)
 SIMPLE_UNARY_FUNCTION_GPU(Exp, exp)
@@ -57,5 +57,6 @@ SIMPLE_UNARY_FUNCTION_GPU(Log, log)
 SIMPLE_UNARY_FUNCTION_GPU(Sin, sin)
 SIMPLE_UNARY_FUNCTION_GPU(Cos, cos)
 SIMPLE_UNARY_FUNCTION_GPU(Tan, tan)
+SIMPLE_UNARY_FUNCTION_GPU(Round, round)

 }
\ No newline at end of file
--- a/source/tensor/core/math/Unary.cuh
+++ b/source/tensor/core/math/Unary.cuh
@@ -83,6 +83,15 @@ void KernelTan(__half * a, __half * b, int size);
 /* set each entry to its tangent value */
 void _CudaTan(const XTensor * a, XTensor * b);

+/* set each entry to its round value (CUDA Kernel) */
+__global__
+void KernelRound(DTYPE * a, DTYPE * b, int size);
+/* set each entry to its round value (CUDA Kernel) with float16 data type*/
+__global__
+void KernelRound(__half * a, __half * b, int size);
+/* set each entry to its round value */
+void _CudaRound(const XTensor * a, XTensor * b);
+
 #endif // USE_CUDA

 } // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/core/math/Unary.h
+++ b/source/tensor/core/math/Unary.h
@@ -104,5 +104,19 @@ make a new tensor to keep the result and return it
 */
 XTensor Tan(const XTensor & a);

+
+/* set every entry to its round value */
+void _Round(const XTensor * a, XTensor * b);
+/* 
+set every entry to its round value (do it on site)
+keep the result in the input tensor a and return nothing
+*/
+void _RoundMe(XTensor * a);
+/* 
+set every entry to its round value (return a XTensor structure)
+make a new tensor to keep the result and return it
+*/
+XTensor Round(const XTensor & a);
+
 }
 #endif //end __UNARY_H__
\ No newline at end of file
--- a/source/tensor/core/reduce/ReduceMax.cu
+++ b/source/tensor/core/reduce/ReduceMax.cu
@@ -29,6 +29,71 @@ namespace nts{ // namespace nts(NiuTrans.Tensor)

 #ifdef USE_CUDA

+
+/*
+use PTX code to reduce float data
+*/
+__device__ __forceinline__  
+float shflDownReduceMax(float input)
+{
+    float output;
+    asm volatile(
+        "{"
+        ".reg .f32 r0;"
+        ".reg .pred p;"
+        "shfl.down.b32  r0, %1, 0x10, 0x1f;"
+        "setp.lt.f32    p,%1,r0;"
+        "@p mov.f32     %1,r0;"
+        "shfl.down.b32  r0, %1, 0x8, 0xf;"
+        "setp.lt.f32    p,%1,r0;"
+        "@p mov.f32     %1,r0;"
+        "shfl.down.b32  r0, %1, 0x4, 0x7;"
+        "setp.lt.f32    p,%1,r0;"
+        "@p mov.f32     %1,r0;"
+        "shfl.down.b32  r0, %1, 0x2, 0x3;"
+        "setp.lt.f32    p,%1,r0;"
+        "@p mov.f32     %1,r0;"
+        "shfl.down.b32  r0, %1, 0x1, 0x1;"
+        "setp.lt.f32    p, %1, r0; "
+        "@p mov.f32     %1,r0;"
+        "mov.f32        %0,%1;"
+        "}"
+        : "=f"(output) : "f"(input));
+    return output;
+}
+
+/*
+use PTX code to reduce int data
+*/
+__device__ __forceinline__
+int shflDownReduceMax(int input)
+{
+    int output;
+    asm volatile(
+        "{"
+        ".reg .s32 r0;"
+        ".reg .pred p;"
+        "shfl.down.b32  r0, %1, 0x10, 0x1f;"
+        "setp.lt.s32    p,%1,r0;"
+        "@p mov.s32     %1,r0;"
+        "shfl.down.b32  r0, %1, 0x8, 0xf;"
+        "setp.lt.s32    p,%1,r0;"
+        "@p mov.s32     %1,r0;"
+        "shfl.down.b32  r0, %1, 0x4, 0x7;"
+        "setp.lt.s32    p,%1,r0;"
+        "@p mov.s32     %1,r0;"
+        "shfl.down.b32  r0, %1, 0x2, 0x3;"
+        "setp.lt.s32    p,%1,r0;"
+        "@p mov.s32     %1,r0;"
+        "shfl.down.b32  r0, %1, 0x1, 0x1;"
+        "setp.lt.s32    p, %1, r0; "
+        "@p mov.s32     %1,r0;"
+        "mov.s32        %0,%1;"
+        "}"
+        : "=r"(output) : "r"(input));
+    return output;
+}
+
 /* 
 reduce a tensor to another that keeps the max value along a dimension  - slow version
 Given a block of data, we go over each dimension i in the stride and we have
@@ -191,25 +256,19 @@ void KernelReduceMaxFast(DTYPE * input, DTYPE * output,
    DTYPE value  = j < strideNum ? inputData[j * stride + iOffset]: FLOAT_MIN;
    DTYPE value2 = j + blockDim.y < strideNum ? inputData[(j + blockDim.y) * stride + iOffset]: FLOAT_MIN;

-    /* load data into the shared mem */
-    data[tid] = MAX(value, value2);
-
+    value = MAX(value, value2);
+    value = shflDownReduceMax(value);
+    if ((tid & 0x1f) == 0) { data[tid / 32] = value; }
    __syncthreads();

-    /* unroll the warp */
-    if(goodSize >= 512) {if(tid < 256) {if(data[tid] < data[tid + 256]) data[tid] = data[tid + 256];} __syncthreads();}
-    if(goodSize >= 256) {if(tid < 128) {if(data[tid] < data[tid + 128]) data[tid] = data[tid + 128];} __syncthreads();}
-    if(goodSize >= 128) {if(tid <  64) {if(data[tid] < data[tid +  64]) data[tid] = data[tid +  64];} __syncthreads();}
-    if(goodSize >=  64) {if(tid <  32) {if(data[tid] < data[tid +  32]) data[tid] = data[tid +  32];} __syncthreads();}
-    if(goodSize >=  32) {if(tid <  16) {if(data[tid] < data[tid +  16]) data[tid] = data[tid +  16];} __syncthreads();}
-    if(goodSize >=  16) {if(tid <   8) {if(data[tid] < data[tid +   8]) data[tid] = data[tid +   8];} __syncthreads();}
-    if(goodSize >=   8) {if(tid <   4) {if(data[tid] < data[tid +   4]) data[tid] = data[tid +   4];} __syncthreads();}
-    if(goodSize >=   4) {if(tid <   2) {if(data[tid] < data[tid +   2]) data[tid] = data[tid +   2];} __syncthreads();}
-    if(goodSize >=   2) {if(tid <   1) {if(data[tid] < data[tid +   1]) data[tid] = data[tid +   1];} __syncthreads();}
-
-    /* write result for this block to the output array */
-    if(threadIdx.y == 0 && blockIdx.y < reducedStrideNum) 
-        output[(k * reducedStrideNum + blockIdx.y) * stride  + iOffset] = data[0];
+    if (tid < 32) {
+        if (tid < blockDim.y / 32)
+            value = data[tid];
+        else value = FLOAT_MIN;
+        value = shflDownReduceMax(value);
+        if (tid == 0 && blockIdx.y < reducedStrideNum)
+            output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = value;
+    }
 }

 /*
@@ -326,6 +385,105 @@ void KernelReduceMaxSimpleFast(DTYPE * input, DTYPE * output,
    op[offset] = max;
 }

+/*
+according the GPU's sm number allocation warp num
+*/
+inline void continuousStorageThreadAllocation(dim3& grid, dim3& block, long long vectorNum, int vectorSize)
+{
+    int warpNum = 4;
+    if (vectorNum < 20 * 8){
+        warpNum = 8;
+        if (vectorNum < 20 * 4){
+            warpNum = 16;
+            if (warpNum < 20 * 2)
+                warpNum = 32;
+        }
+    }
+    int minWarpNum = vectorSize / 32;
+    if (vectorSize % 32 != 0) minWarpNum++;
+    warpNum = min(warpNum, minWarpNum);
+
+    grid.x = vectorNum;
+    grid.y = 1;
+    grid.z = 1;
+    block.x = 1;
+    block.y = warpNum * 32;
+    block.z = 1;
+}
+
+/*
+adjust threads.x number then we can use warp optimization 
+*/
+inline void adjustThreadForUseWarpOptimization(dim3& blocks, dim3& threads)
+{
+    if (threads.x > 1) {
+        blocks.x *= threads.x;
+        threads.x = 1;
+    }
+    if (threads.y < 32)
+        threads.y = 32;
+}
+
+/*
+In some case,we use less block to imporve efficiency
+*/
+__global__
+void KernelReduceMaxOpLessBlocks(DTYPE * input, DTYPE * output, int strideNum, int blockNum)
+{
+    int idx = threadIdx.x % 32;
+    int idy = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
+
+    int startIndex = idy * strideNum;
+    DTYPE threadMax = FLOAT_MIN;
+    for (int i = idx; i < strideNum; i += 32) {
+        threadMax = max(input[startIndex + i], threadMax);
+    }
+    threadMax = shflDownReduceMax(threadMax);
+    if (idx == 0) 
+        output[idy] = threadMax;
+}
+
+/*
+we use PTX code reduce
+*/
+__global__
+void KernelReduceMaxOp(DTYPE * input, DTYPE * output,int stride, int strideNum, 
+                       int reducedStrideNum,int blockSize, int blockNum)
+{
+    __shared__ DTYPE iData[MAX_CUDA_THREAD_NUM_PER_BLOCK / 32];
+
+    unsigned int tid = threadIdx.y;
+    unsigned int j = blockIdx.y * blockDim.y + threadIdx.y;
+    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
+    if (i >= stride * blockNum)
+        return;
+
+    /* first level reduction */
+    int k = i / stride;
+    int iOffset = i % stride;
+
+    DTYPE threadMax = FLOAT_MIN;
+
+    DTYPE * data = iData + threadIdx.x * blockDim.y;
+    DTYPE * inputData = input + k * blockSize;
+    for (int it = j; it < strideNum; it += blockDim.y){
+        threadMax = max(inputData[it * stride + iOffset], threadMax);
+    }
+    __syncthreads();
+    threadMax = shflDownReduceMax(threadMax);
+    if ((tid & 0x1f) == 0) { data[tid / 32] = threadMax; }
+    __syncthreads();
+    /* use one warp to reduce remaining data */
+    if (tid < 32){
+        if (tid < blockDim.y / 32)
+            threadMax = data[tid];
+        else threadMax = 0;
+        threadMax = shflDownReduceMax(threadMax);
+        if (tid == 0 && blockIdx.y < reducedStrideNum)
+            output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = threadMax;
+    }
+}
+
 /* 
 get the max-valued items along a dimension of the tensor (cuda version). 
 For a 1-dimensional data array a,
@@ -382,130 +540,147 @@ void _CudaReduceMax(const XTensor * input, XTensor * output, int dim)
    int devIDBackup;
    ProtectCudaDev(input->devID, devIDBackup);

-    do{
-        if (input->dataType == DEFAULT_DTYPE) {
-            DTYPE * iData = NULL;
-            DTYPE * oData = NULL;
-            if (iter == 0) {
-                iData = (DTYPE*)input->data;
-                oData = buf1;
-            }
-            else if (iter % 2 == 1) {
-                iData = buf1;
-                oData = buf2;
-            }
-            else {
-                iData = buf2;
-                oData = buf1;
-            }
-
-            /* unroll the reduction procedure. The code is messy but it is faster. */
-            if (strideNum < 32) {
-                GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (DTYPE*)output->data;
-                KernelReduceMax << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-            else if (strideNum < 128) {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (DTYPE*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceMaxFast<64> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-            else if (strideNum < 256) {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (DTYPE*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceMaxFast<128> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-            else if (strideNum < 512) {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (DTYPE*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceMaxFast<256> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-            else {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (DTYPE*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceMaxFast<512> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
+    if (stride == 1 && blockNum >= 10) {
+        dim3 grids;
+        dim3 blocks;
+        continuousStorageThreadAllocation(grids, blocks, (long long)blockNum, strideNum);
+        if (blocks.y > 128) {
+            KernelReduceMaxOp <<<grids, blocks >>> ((DTYPE *)input->data, (DTYPE*)output->data, stride, strideNum, grids.y, blockSize, blockNum);
        }
-        else if (input->dataType == X_FLOAT16) {
-            __half * buf1ft16 = (__half *)buf1;
-            __half * buf2ft16 = (__half *)buf2;
-            __half * iData = NULL;
-            __half * oData = NULL;
-            if (iter == 0) {
-                iData = (__half*)input->data;
-                oData = buf1ft16;
-            }
-            else if (iter % 2 == 1) {
-                iData = buf1ft16;
-                oData = buf2ft16;
+        else {
+            KernelReduceMaxOpLessBlocks <<<blockNum / 4, 128 >>> ((DTYPE *)input->data, (DTYPE*)output->data, strideNum, blockNum);
+        }
+    }
+    else {
+        do {
+            if (input->dataType == DEFAULT_DTYPE) {
+                DTYPE * iData = NULL;
+                DTYPE * oData = NULL;
+                if (iter == 0) {
+                    iData = (DTYPE*)input->data;
+                    oData = buf1;
+                }
+                else if (iter % 2 == 1) {
+                    iData = buf1;
+                    oData = buf2;
+                }
+                else {
+                    iData = buf2;
+                    oData = buf1;
+                }
+
+                /* unroll the reduction procedure. The code is messy but it is faster. */
+                if (strideNum < 32) {
+                    GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (DTYPE*)output->data;
+                    KernelReduceMax <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
+                else if (strideNum < 128) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (DTYPE*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
+                    adjustThreadForUseWarpOptimization(blocks, threads);
+                    KernelReduceMaxFast<64> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
+                else if (strideNum < 256) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (DTYPE*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
+                    adjustThreadForUseWarpOptimization(blocks, threads);
+                    KernelReduceMaxFast<128> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
+                else if (strideNum < 512) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (DTYPE*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
+                    adjustThreadForUseWarpOptimization(blocks, threads);
+                    KernelReduceMaxFast<256> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
+                else {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (DTYPE*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
+                    adjustThreadForUseWarpOptimization(blocks, threads);
+                    KernelReduceMaxFast<512> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
            }
-            else {
-                iData = buf2ft16;
-                oData = buf1ft16;
+            else if (input->dataType == X_FLOAT16) {
+                __half * buf1ft16 = (__half *)buf1;
+                __half * buf2ft16 = (__half *)buf2;
+                __half * iData = NULL;
+                __half * oData = NULL;
+                if (iter == 0) {
+                    iData = (__half*)input->data;
+                    oData = buf1ft16;
+                }
+                else if (iter % 2 == 1) {
+                    iData = buf1ft16;
+                    oData = buf2ft16;
+                }
+                else {
+                    iData = buf2ft16;
+                    oData = buf1ft16;
+                }
+
+                /* unroll the reduction procedure. The code is messy but it is faster. */
+                if (strideNum < 32) {
+                    GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    KernelReduceMax << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
+                else if (strideNum < 128) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
+                    KernelReduceMaxFast<64> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
+                else if (strideNum < 256) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
+                    KernelReduceMaxFast<128> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
+                else if (strideNum < 512) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
+                    KernelReduceMaxFast<256> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
+                else {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
+                    KernelReduceMaxFast<512> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
+                }
            }

-            /* unroll the reduction procedure. The code is messy but it is faster. */
-            if (strideNum < 32) {
-                GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                KernelReduceMax << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-            else if (strideNum < 128) {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceMaxFast<64> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-            else if (strideNum < 256) {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceMaxFast<128> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-            else if (strideNum < 512) {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceMaxFast<256> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-            else {
-                GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceMaxFast<512> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
-            }
-        }
-        
-        strideNum = cudaGridSize[0];
-        blockSize = cudaGridSize[0];
+            strideNum = cudaGridSize[0];
+            blockSize = cudaGridSize[0];

-        iter++;
+            iter++;

-    }while(strideNum > 1);
+        } while (strideNum > 1);
+    }

    BacktoCudaDev(input->devID, devIDBackup);


--- a/source/tensor/core/reduce/ReduceSum.cu
+++ b/source/tensor/core/reduce/ReduceSum.cu
@@ -27,6 +27,57 @@ namespace nts{ // namespace nts(NiuTrans.Tensor)

 #ifdef USE_CUDA

+/*
+use PTX code to reduce float data
+*/
+__device__ __forceinline__  
+float shflDownReduceSum(float input)
+{
+    float output;
+    asm volatile(
+        "{"
+        ".reg .f32 r0;"
+        "shfl.down.b32  r0, %1, 0x10, 0x1f;"
+        "add.f32        %1, r0, %1;"
+        "shfl.down.b32  r0, %1, 0x8, 0xf;"
+        "add.f32        %1, r0, %1;"
+        "shfl.down.b32  r0, %1, 0x4, 0x7;"
+        "add.f32        %1, r0, %1;"
+        "shfl.down.b32  r0, %1, 0x2, 0x3;"
+        "add.f32        %1, r0, %1;"
+        "shfl.down.b32  r0, %1, 0x1, 0x1;"
+        "add.f32        %0, r0, %1;"
+        "}"
+        : "=f"(output) : "f"(input));
+    return output;
+}
+
+/*
+use PTX code to reduce int data
+*/
+__device__ __forceinline__  
+int shflDownReduceSum(int input)
+{
+    int output;
+    asm volatile(
+        "{"
+        ".reg .s32 r0;"
+        "shfl.down.b32  r0, %1, 0x10, 0x1f;"
+        "add.s32        %1, r0, %1;"
+        "shfl.down.b32  r0, %1, 0x8, 0xf;"
+        "add.s32        %1, r0, %1;"
+        "shfl.down.b32  r0, %1, 0x4, 0x7;"
+        "add.s32        %1, r0, %1;"
+        "shfl.down.b32  r0, %1, 0x2, 0x3;"
+        "add.s32        %1, r0, %1;"
+        "shfl.down.b32  r0, %1, 0x1, 0x1;"
+        "add.s32        %0, r0, %1;"
+        "}"
+        : "=r"(output) : "r"(input));
+    return output;
+}
+
+
 /* 
 reduce a tensor to another that keeps the sum along a dimension  - slow version
 Given a block of data, we go over each dimension i in the stride and we have
@@ -96,7 +147,6 @@ void KernelReduceSum(DTYPE * input, DTYPE * output,

        __syncthreads();
    }
-
    /* write result for this block to the output array */
    if (threadIdx.y == 0 && blockIdx.y < reducedStrideNum) 
        output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = iData[threadIdx.x * blockDim.y];
@@ -276,25 +326,19 @@ void KernelReduceSumFast(DTYPE * input, DTYPE * output,
            value2 = exp(value2);
    }

-    /* load data into the shared mem */
-    data[tid] = value + value2;
-
+    value = value + value2;
    __syncthreads();
-
-    /* unroll the warp */
-    if(goodSize >= 512) {if(tid < 256) {data[tid] += data[tid + 256];} __syncthreads();}
-    if(goodSize >= 256) {if(tid < 128) {data[tid] += data[tid + 128];} __syncthreads();}
-    if(goodSize >= 128) {if(tid <  64) {data[tid] += data[tid +  64];} __syncthreads();}
-    if(goodSize >= 64)  {if(tid <  32) {data[tid] += data[tid +  32];} __syncthreads();}
-    if(goodSize >= 32)  {if(tid <  16) {data[tid] += data[tid +  16];} __syncthreads();}
-    if(goodSize >= 16)  {if(tid <   8) {data[tid] += data[tid +   8];} __syncthreads();}
-    if(goodSize >=  8)  {if(tid <   4) {data[tid] += data[tid +   4];} __syncthreads();}
-    if(goodSize >=  4)  {if(tid <   2) {data[tid] += data[tid +   2];} __syncthreads();}
-    if(goodSize >=  2)  {if(tid <   1) {data[tid] += data[tid +   1];} __syncthreads();}
-
-    /* write result for this block to the output array */
-    if(threadIdx.y == 0 && blockIdx.y < reducedStrideNum) 
-        output[(k * reducedStrideNum + blockIdx.y) * stride  + iOffset] = data[0];
+    value = shflDownReduceSum(value);
+    if ((tid & 0x1f) == 0) { data[tid / 32] = value; }
+    __syncthreads();
+    if (tid < 32){
+        if (tid < blockDim.y / 32)
+            value = data[tid];
+        else value = 0;
+            value = shflDownReduceSum(value);
+        if (tid == 0 && blockIdx.y < reducedStrideNum)
+            output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = value;
+    }
 }

 /* 
@@ -430,6 +474,174 @@ void KernelReduceSumFast(__half * input, __half * output,
 #endif
 }

+/*
+if data storage is discontinuius ,use this way to reduce 
+*/
+__global__ 
+void KernelReduceSumDiscontinuousStorage(DTYPE * input, DTYPE * output, int stride, 
+                                         int strideNum, DTYPE * shift, DTYPE power, bool isExp)
+{
+    //int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    //int endIndex = (idx+1) * strideNum;
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int blockIndex = idx / stride;
+    int offsetInBlock = idx% stride;
+    DTYPE ans = 0;
+#pragma unroll
+    for (int i = stride * strideNum * blockIndex + offsetInBlock;
+        i < stride * strideNum * blockIndex + offsetInBlock + stride * strideNum;
+        i += stride){
+        ans += input[i];
+    }
+    output[idx] = ans;
+}
+
+__global__
+void KernelReduceSumOp(DTYPE * input, DTYPE * output,
+    int stride, int strideNum, int reducedStrideNum,
+    int blockSize, int blockNum,
+    DTYPE * shift, DTYPE power, bool isExp)
+{
+    __shared__ DTYPE iData[MAX_CUDA_THREAD_NUM_PER_BLOCK / 32];
+    __shared__ DTYPE bias[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+
+    unsigned int tid = threadIdx.y;
+    unsigned int j = blockIdx.y * blockDim.y + threadIdx.y;
+    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
+    if (i >= stride * blockNum)
+        return;
+
+    if (threadIdx.y == 0)
+        bias[threadIdx.x] = shift != NULL ? shift[i] : 0;
+
+    __syncthreads();
+
+    /* first level reduction */
+    int k = i / stride;
+    int iOffset = i % stride;
+
+    DTYPE threadSum = 0;
+
+    DTYPE * data = iData + threadIdx.x * blockDim.y;
+    DTYPE * inputData = input + k * blockSize;
+    for (int it = j; it < strideNum; it += blockDim.y){
+        DTYPE value = inputData[it * stride + iOffset] - bias[threadIdx.x];
+        if (power != (DTYPE)1.0) {
+            if (power == (DTYPE)2.0) {
+                value = value * value;
+            }
+            else if (power == (DTYPE)0.5) {
+                value = sqrt(value);
+            }
+            else {
+                value = pow(value, power);
+            }
+        }
+        if (isExp) value = exp(value);
+        threadSum += value;
+    }
+    __syncthreads();
+    threadSum = shflDownReduceSum(threadSum);
+    if ((tid & 0x1f) == 0) { data[tid / 32] = threadSum; }
+    __syncthreads();
+    if (tid < 32){
+        if (tid < blockDim.y / 32)
+            threadSum = data[tid];
+        else threadSum = 0;
+        threadSum = shflDownReduceSum(threadSum);
+        if (tid == 0 && blockIdx.y < reducedStrideNum)
+            output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = threadSum;
+    }
+
+}
+
+__global__
+void KernelReduceSumOpLessBlocks(DTYPE * input, DTYPE * output,
+    int strideNum, int blockNum,
+    DTYPE * shift, DTYPE power, bool isExp)
+{
+    __shared__ DTYPE bias[MAX_CUDA_THREAD_NUM_PER_BLOCK];
+    int idx = threadIdx.x % 32;
+    int idy = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
+
+    if (idx == 0)
+        bias[threadIdx.x / 32] = shift != NULL ? shift[idy] : 0;
+    int startIndex = idy * strideNum;
+    DTYPE threadSum = 0;
+    for (int i = idx; i < strideNum; i += 32) {
+        DTYPE value = input[startIndex + i] - bias[threadIdx.x / 32];
+        if (power != (DTYPE)1.0) {
+            if (power == (DTYPE)2.0) {
+                value = value * value;
+            }
+            else if (power == (DTYPE)0.5) {
+                value = sqrt(value);
+            }
+            else {
+                value = pow(value, power);
+            }
+        }
+        if (isExp) value = exp(value);
+        threadSum += value;
+    }
+    threadSum = shflDownReduceSum(threadSum);
+    if (idx == 0)
+        output[idy] = threadSum;
+}
+
+/*
+according the GPU's sm number allocation warp num
+*/
+inline void continuousStorageThreadAllocation(dim3& grid, dim3& block, long long vectorNum, int vectorSize)
+{
+    int warpNum = 4;
+    if (vectorNum < 20 * 8) {
+        warpNum = 8;
+        if (vectorNum < 20 * 4) {
+            warpNum = 16;
+            if (warpNum < 20 * 2)
+                warpNum = 32;
+        }
+    }
+    int minWarpNum = vectorSize / 32;
+    if (vectorSize % 32 != 0) minWarpNum++;
+    warpNum = min(warpNum, minWarpNum);
+
+    grid.x = vectorNum;
+    grid.y = 1;
+    grid.z = 1;
+    block.x = 1;
+    block.y = warpNum * 32;
+    block.z = 1;
+}
+
+/* 
+this situation we use block.x * grid.x deal one vector for continuous read
+*/
+inline void discontinuousStorageNoShareMemThreadAllocation(dim3& grid, dim3& block, int stride, int blockNum)
+{
+    block.x = 512;
+    block.y = 1;
+    if ((stride * blockNum) % 512 == 0)
+        grid.x = (stride * blockNum) / 512;
+    else
+        grid.x = (stride * blockNum) / 512 + 1;
+    grid.y = 1;
+}
+
+/*
+adjust threads.x number then we can use warp optimization
+*/
+inline void adjustThreadForUseWarpOptimization(dim3& blocks, dim3& threads)
+{
+    if (threads.x > 1){
+        blocks.x *= threads.x;
+        threads.x = 1;
+    }
+    if (threads.y<32)
+        threads.y = 32;
+}
+
 /* 
 sum the items along a dimension of the tensor (cuda version). 
 For a 1-dimensional data array a,
@@ -495,137 +707,158 @@ void _CudaReduceSum(const XTensor * input, XTensor * output, int dim, const XTen

    int devIDBackup;
    ProtectCudaDev(input->devID, devIDBackup);
-
-    do{
-        if(input->dataType == DEFAULT_DTYPE){
-            DTYPE * iData = NULL;
-            DTYPE * oData = NULL;
-            if (iter == 0) {
-                iData = (DTYPE*)input->data;
-                oData = buf1;
-            }
-            else if (iter % 2 == 1) {
-                iData = buf1;
-                oData = buf2;
-            }
-            else {
-                iData = buf2;
-                oData = buf1;
-            }
-            /* unroll the reduction procedure. The code is messy but it is faster. */
-            if(strideNum < 32){
-                GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (DTYPE*)output->data;
-                KernelReduceSum <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
-            }
-            else if(strideNum < 128){
-                GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (DTYPE*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceSumFast<64> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
-            }
-            else if(strideNum < 256){
-                GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (DTYPE*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceSumFast<128> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
-            }
-            else if(strideNum < 512){
-                GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (DTYPE*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceSumFast<256> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
-            }
-            else{
-                GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (DTYPE*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceSumFast<512> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
-            }
-        }
-        else if(input->dataType == X_FLOAT16){
-            __half * buf1ft16 = (__half *)buf1;
-            __half * buf2ft16 = (__half *)buf2;
-            __half * spft16 = (__half *)sp;
-            unsigned short power2 = FloatToFloat16(power);
-            __half * powerft16p = (__half*)&power2;
-            __half * iData = NULL;
-            __half * oData = NULL;
-            if (iter == 0) {
-                iData = (__half*)input->data;
-                oData = buf1ft16;
-            }
-            else if (iter % 2 == 1) {
-                iData = buf1ft16;
-                oData = buf2ft16;
-            }
-            else {
-                iData = buf2ft16;
-                oData = buf1ft16;
-            }
-
-            /* unroll the reduction procedure. The code is messy but it is faster. */
-            if(strideNum < 32){
-                GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                KernelReduceSum << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
-            }
-            else if(strideNum < 128){
-                GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceSumFast<64> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
-            }
-            else if(strideNum < 256){
-                GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceSumFast<128> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
-            }
-            else if(strideNum < 512){
-                GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceSumFast<256> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
+    if (stride == 1 && blockNum >= 10) {
+        dim3 grids;
+        dim3 blocks;
+        continuousStorageThreadAllocation(grids, blocks, (long long)blockNum, strideNum);
+        if (blocks.y > 128)
+            KernelReduceSumOp <<<grids, blocks >>> ((DTYPE *)input->data, (DTYPE*)output->data, stride, strideNum, grids.y, blockSize, blockNum, sp, power, isExp);
+        else
+            KernelReduceSumOpLessBlocks <<<blockNum / 4, 128 >>> ((DTYPE *)input->data, (DTYPE*)output->data, strideNum, blockNum, sp, power, isExp);
+    }
+    else if (stride != 1 && stride * blockNum > 4096){
+        //GDevs->GetGridAndBlockSize2D(devID, stride * blockNum, strideNum,MAX_INT, cudaGridSize, cudaBlockSize);
+        //unsigned int* goutput = (unsigned int *)input->data;
+        //convert2uintV2 << <dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1]) >> > ((float*)input->data, goutput, stride, strideNum, blockNum, strideNum*blockNum*stride);
+        dim3 grid, block;
+        discontinuousStorageNoShareMemThreadAllocation(grid, block, stride, blockNum);
+        KernelReduceSumDiscontinuousStorage <<<grid, block >>> ((DTYPE *)input->data, (DTYPE*)output->data, stride, strideNum, sp, power, isExp);
+    }
+    else {
+        do {
+            if (input->dataType == DEFAULT_DTYPE) {
+                DTYPE * iData = NULL;
+                DTYPE * oData = NULL;
+                if (iter == 0) {
+                    iData = (DTYPE*)input->data;
+                    oData = buf1;
+                }
+                else if (iter % 2 == 1) {
+                    iData = buf1;
+                    oData = buf2;
+                }
+                else {
+                    iData = buf2;
+                    oData = buf1;
+                }
+                /* unroll the reduction procedure. The code is messy but it is faster. */
+                if (strideNum <= 32) {
+                    GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (DTYPE*)output->data;
+                    KernelReduceSum <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
+                }
+                else if (strideNum < 128) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (DTYPE*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
+                    adjustThreadForUseWarpOptimization(blocks, threads);
+                    KernelReduceSumFast<64> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
+                }
+                else if (strideNum < 256) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (DTYPE*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
+                    adjustThreadForUseWarpOptimization(blocks, threads);
+                    KernelReduceSumFast<128> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
+                }
+                else if (strideNum < 512) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (DTYPE*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
+                    adjustThreadForUseWarpOptimization(blocks, threads);
+                    KernelReduceSumFast<256> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
+                }
+                else {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (DTYPE*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
+                    adjustThreadForUseWarpOptimization(blocks, threads);
+                    KernelReduceSumFast<512> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
+                }
            }
-            else{
-                GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
-                dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
-                if (cudaGridSize[0] == 1)
-                    oData = (__half*)output->data;
-                CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
-                KernelReduceSumFast<512> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
+            else if (input->dataType == X_FLOAT16) {
+                __half * buf1ft16 = (__half *)buf1;
+                __half * buf2ft16 = (__half *)buf2;
+                __half * spft16 = (__half *)sp;
+                unsigned short power2 = FloatToFloat16(power);
+                __half * powerft16p = (__half*)&power2;
+                __half * iData = NULL;
+                __half * oData = NULL;
+                if (iter == 0) {
+                    iData = (__half*)input->data;
+                    oData = buf1ft16;
+                }
+                else if (iter % 2 == 1) {
+                    iData = buf1ft16;
+                    oData = buf2ft16;
+                }
+                else {
+                    iData = buf2ft16;
+                    oData = buf1ft16;
+                }
+
+                /* unroll the reduction procedure. The code is messy but it is faster. */
+                if (strideNum < 32) {
+                    GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    KernelReduceSum <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
+                }
+                else if (strideNum < 128) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
+                    KernelReduceSumFast<64> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
+                }
+                else if (strideNum < 256) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
+                    KernelReduceSumFast<128> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
+                }
+                else if (strideNum < 512) {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
+                    KernelReduceSumFast<256> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
+                }
+                else {
+                    GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+                    dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
+                    if (cudaGridSize[0] == 1)
+                        oData = (__half*)output->data;
+                    CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
+                    KernelReduceSumFast<512> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
+                }
            }
-        }
-
-        strideNum = cudaGridSize[0];
-        blockSize = cudaGridSize[0];
-        sp = NULL;
-        power = (DTYPE)1.0;
-        isExp = false;

-        iter++;
+            strideNum = cudaGridSize[0];
+            blockSize = cudaGridSize[0];
+            sp = NULL;
+            power = (DTYPE)1.0;
+            isExp = false;

-    }while(strideNum > 1);
+            iter++;

+        } while (strideNum > 1);
+    }
    ProtectCudaDev(input->devID, devIDBackup);

    if (mem != NULL)

--- a/source/tensor/core/sort/TopK.cu
+++ b/source/tensor/core/sort/TopK.cu
@@ -25,6 +25,7 @@
 #include "TopK.h"
 #include "TopK.cuh"
 #include "Sort.cuh"
+#define WORKERSNUM 64

 namespace nts { // namespace nts(NiuTrans.Tensor)

@@ -363,6 +364,436 @@ void KernelTopK2(T * input, int stride, int strideNum, int blockNum, int k, T mi
 }

 /*
+get the top-k items
+>> input - the input data array
+>> stride - number of items we go over when we move to the next item along a given dimension
+>> strideNum - size of the given dimension
+>> blockNum - number of data blocks
+>> k - as it is
+>> minValue - min value of an item
+>> output - the output data array
+>> index - the output index array
+*/
+template<class T> __global__
+void KernelTopK3(T * input, int stride, int strideNum, int blockNum, int k, T minValue, T * output, int * index)
+{
+    __shared__ CudaHeapNode<T> heapData[(SHARED_MEMORY_SIZE - 1024 * sizeof(T)) / sizeof(CudaHeapNode<T>)];
+    __shared__ T eachHeapMaxValue[1024];
+    /*optimization k size the parameter must more than half of k*/
+    int parameter = 0;
+
+    /* worker index */
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+
+    /* index of the data arry along the given dimension */
+    int j = blockDim.y * blockIdx.y + threadIdx.y;
+
+    if (i >= strideNum || i >= blockDim.x || j >= stride * blockNum)
+        return;
+
+    int blockIndex = j / stride;
+    int offsetInBlock = j % stride;
+    T * d = input + stride * strideNum * blockIndex + offsetInBlock;
+
+    CudaXHeap<MIN_HEAP, T> heap(k - parameter, heapData + k * (threadIdx.y * blockDim.x + threadIdx.x));
+    __syncthreads();
+
+    /* go over the data array and build the heap */
+    int indexOffset = blockDim.x;
+    int dataOffset = stride * blockDim.x;
+
+    if (i + (heap.size - 1) * indexOffset < strideNum) {
+        int p = i;
+        int q = i * stride;
+        for (int m = 0; m < heap.size; m++) {
+            heap.Push(p, d[q]);
+            p += indexOffset;
+            q += dataOffset;
+        }
+
+        for (; p < strideNum; p += indexOffset, q += dataOffset) {
+            T v = d[q];
+            if (v > heap.topValue) {
+                heap.ReplaceTop(p, v);
+            }
+        }
+    }
+    else {
+        for (int p = i, q = i * stride; p < strideNum; p += indexOffset, q += dataOffset) {
+            heap.Push(p, d[q]);
+        }
+    }
+    /* fill the heap if no enough items are processed */
+    while (heap.count < heap.size) {
+        heap.Push(-1, minValue);
+    }
+    __syncthreads();
+
+    /*to merge the heap use another way*/
+    T minData = minValue;
+    int heapLimit = heap.count / 2;
+    if (heapLimit % 2 == 0 && heapLimit != 0) heapLimit -= 1;
+    for (int counter = heap.count - 1; counter >= heapLimit; --counter) {
+        if (minData < heap.items[counter].value)
+            minData = heap.items[counter].value;
+    }
+    eachHeapMaxValue[threadIdx.y * blockDim.x + threadIdx.x] = minData;
+    //need more optimation
+    if (i == 0) {
+        int threadLimit = (threadIdx.y + 1) * blockDim.x;
+        CudaXHeap<MIN_HEAP, T> chooseHeap(k, heapData + k * ((blockDim.x * blockDim.y) + threadIdx.y));
+        int counter = threadIdx.y * blockDim.x;
+        for (; counter < threadIdx.y * blockDim.x + k; ++counter) {
+            chooseHeap.Push(counter, eachHeapMaxValue[counter]);
+        }
+        for (; counter < threadLimit; ++counter) {
+            if (eachHeapMaxValue[counter]>chooseHeap.items[0].value) {
+                chooseHeap.ReplaceTop(counter, eachHeapMaxValue[counter]);
+            }
+        }
+        CudaXHeap<MIN_HEAP, T>  ansHeapData(k, k - parameter, heapData + k * chooseHeap.items[0].index);
+        int miss = parameter;
+        for (counter = 1; counter < k; ++counter) {
+            chooseHeap.items[0] = chooseHeap.items[chooseHeap.count - 1];
+            chooseHeap.count--;
+            chooseHeap.Down(0);
+            CudaHeapNode<T> * cmpHeapData = heapData + k * (chooseHeap.items[0].index);
+            int cmpHeapLimit = 0;
+            if (counter + heapLimit <= k - parameter){
+                cmpHeapLimit = heapLimit;
+            }
+            /* take the max data from the minHeap,so start search from the leaf node */
+            for (int iterator = k - 1 - parameter; iterator >= cmpHeapLimit; --iterator){
+                if (miss > 0){
+                    ansHeapData.Push(cmpHeapData[iterator].index, cmpHeapData[iterator].value);
+                    miss--;
+                }
+                else if (ansHeapData.items[0].value < cmpHeapData[iterator].value){
+                    ansHeapData.ReplaceTop(cmpHeapData[iterator].index, cmpHeapData[iterator].value);
+                }
+            }
+        }
+        int offset = stride * k * blockIndex + offsetInBlock;
+        T * dOutput = output + offset;
+        int * indexOutput = index + offset;
+        for (int q = 0; q < k; ++q){
+            dOutput[stride * q] = ansHeapData.items[q].value;
+            indexOutput[stride * q] = ansHeapData.items[q].index;
+        }
+    }
+}
+
+
+__device__ __forceinline__ 
+unsigned getLaneMaskLe() 
+{
+    unsigned mask;
+    asm("mov.u32 %0, %%lanemask_le;" : "=r"(mask));
+    return mask;
+}
+
+__device__ __forceinline__ 
+int getLaneId() 
+{
+    int laneId;
+    asm("mov.s32 %0, %laneid;" : "=r"(laneId));
+    return laneId;
+}
+
+__device__ 
+unsigned convert(float v)
+{
+    unsigned x = __float_as_int(v);
+    unsigned mask = (x & 0x80000000) ? 0xffffffff : 0x80000000;
+    return (x ^ mask);
+}
+
+__device__ 
+float convert(unsigned int v)
+{
+    float x = __uint_as_float(v);
+    return x;
+}
+
+__device__ 
+float deconvert(unsigned int v) 
+{
+    unsigned int mask = (v & 0x80000000) ? 0x80000000 : 0xffffffff;
+    return __int_as_float(v ^ mask);
+}
+
+__global__ 
+void convert2uintV2(float* input, unsigned int *output, int stride, int strideNum, int blockNum, int size)
+{
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int idy = blockDim.y * blockIdx.y + threadIdx.y;
+    int blockIndex = idy / stride;
+    int offsetInBlock = idy% stride;
+#pragma unroll
+    for (int i = idx * stride + stride * strideNum * blockIndex + offsetInBlock;
+        i < stride * strideNum * blockIndex + offsetInBlock + stride * strideNum && i < size;
+        i += stride * blockDim.x){
+        output[i] = convert(input[i]);
+    }
+}
+
+__global__ 
+void deconvert2floatV2(unsigned int * input, float *output, int stride, int strideNum, int blockNum, int size)
+{
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int idy = blockDim.y * blockIdx.y + threadIdx.y;
+    //int strideNum = (int)strideNumSize;
+    //if (flag) strideNum = strideNumSize[idy];
+    int blockIndex = idy / stride;
+    int offsetInBlock = idy% stride;
+#pragma unroll
+    for (int i = idx * stride + stride * strideNum * blockIndex + offsetInBlock;
+        i < stride * strideNum * blockIndex + offsetInBlock + stride * strideNum && i < size;
+        i += stride * blockDim.x){
+        output[i] = deconvert(input[i]);
+    }
+}
+
+__device__ 
+void radixCount(unsigned int *data, int limit, int *posCount, unsigned int mask, int maskDesire, unsigned int desire, int stride, int strideNum, int blockNum)
+{
+
+    /*the idx th thread in one vector */
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    /* the idy th vector in one tensor */
+    int idy = blockDim.y * blockIdx.y + threadIdx.y;
+    int blockIndex = idy / stride;
+    int offsetInBlock = idy% stride;
+    for (int j = idx*stride + stride * strideNum * blockIndex + offsetInBlock;
+        j<  stride * strideNum * blockIndex + offsetInBlock + stride*strideNum && j<limit;
+        j += stride * WORKERSNUM) {
+        if ((data[j] & maskDesire) == desire) {
+            if (data[j] & mask) {
+                posCount[(idy % (512 / WORKERSNUM))*blockDim.x + idx]++;
+            }
+        }
+    }
+}
+
+/* We can use this way to check thread status in a warp fastly,
+   note that the theard number need be 32 times */
+__device__ 
+void gpuCheckWarp(int *smem, bool in, int *carry, int *index)
+{
+    int vote = __ballot_sync(0xffffffff, in);
+    *index = __popc(getLaneMaskLe() & vote);
+    *carry = __popc(vote);
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int warp = idx / 32; 
+    int warpNum = blockDim.x / 32;
+
+    if (getLaneId() == 0) {
+        /* save each warp carry */
+        smem[warp + warpNum * threadIdx.y] = *carry; 
+    }
+    __syncthreads();
+    /* use one thread to count the carry for globe the warp */
+    if (idx == 0) {
+        for (int i = 1 + warpNum * threadIdx.y; i < warpNum * (threadIdx.y + 1); ++i) {
+            smem[i] += smem[i - 1];
+        }
+    }
+    __syncthreads();
+    if (warp % warpNum) {
+        *index += smem[warpNum * threadIdx.y + warp - 1];
+    }
+    *carry = smem[warpNum * threadIdx.y + warpNum - 1];
+}
+
+/*
+collect the data bigger than pattern as ans return
+*/
+__device__ 
+void collectNumber(unsigned int *data, int stride, int strideNum, int limit, 
+                    unsigned int pattern, float *ans, int *ansIndex, int k)
+{
+    int idy = blockDim.y * blockIdx.y + threadIdx.y;
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    int blockIndex = idy / stride;
+    int offsetInBlock = idy % stride;
+
+    /* for count each warp's tmp carry */
+    __shared__ int smem[32]; 
+    int carry;
+    int index;
+    int vectorLimit = stride * strideNum * blockIndex + offsetInBlock + stride * strideNum;
+    int alibnStrideNum = strideNum;
+    if (alibnStrideNum % blockDim.x) alibnStrideNum = alibnStrideNum + blockDim.x - (alibnStrideNum % blockDim.x);
+    int vectorAlibnLimit = stride * strideNum * blockIndex + offsetInBlock + stride * alibnStrideNum;
+    int ansArrayIndex = stride * k * blockIndex + offsetInBlock;
+
+    int ansSize = 0;
+    __syncthreads();
+
+#pragma unroll
+    for (int i = idx * stride + stride * strideNum * blockIndex + offsetInBlock;
+        i < vectorAlibnLimit; i += stride * WORKERSNUM){
+
+        bool hasTopk = false;
+        if (i < vectorLimit&&data[i] > pattern){
+            hasTopk = true;
+        }
+        gpuCheckWarp(smem, hasTopk, &carry, &index);
+        if (carry > 0) {
+            if (hasTopk) {
+                ans[ansArrayIndex + (index - 1) * stride] = deconvert(data[i]);
+                ansIndex[ansArrayIndex + (index - 1) * stride] = i - stride * strideNum * blockIndex;
+            }
+            ansArrayIndex += carry * stride;
+            ansSize += carry;
+        }
+        __syncthreads();
+    }
+    if (ansSize < k){
+        int ramindNum = k - ansSize;
+#pragma unroll
+        for (int i = idx * stride + stride * strideNum * blockIndex + offsetInBlock; i < vectorAlibnLimit; i += stride * WORKERSNUM) {
+            bool hasTopk = false;
+            if (i < vectorLimit && data[i] == pattern) {
+                hasTopk = true;
+            }
+
+            gpuCheckWarp(smem, hasTopk, &carry, &index);
+
+            if (carry>0) {
+                int checkTmpIndex = ansArrayIndex + (index - 1) * stride;
+                /* for don't pointer boundary overflow, for instance, 
+                   if there need one index,but two index fits, wo should filter the bigger index */
+                if (hasTopk && checkTmpIndex <stride * k * blockIndex + offsetInBlock + stride * k) {
+                    ans[checkTmpIndex] = deconvert(pattern);
+                    ansIndex[checkTmpIndex] = i - stride * strideNum * blockIndex;
+                }
+                ramindNum -= carry;
+                ansArrayIndex += carry * stride;
+                if (ramindNum <= 0) break;
+            }
+            __syncthreads();
+        }
+    }
+}
+
+/*
+This is an old way,we use one thread to collect number and this way is very slow,so we drop it 
+*/
+__device__ 
+void collectNumberOld(unsigned int *data, int n, int k, unsigned int pattern, unsigned int *ans, int *indexNum, int stride, int strideNum)
+{
+    int idy = blockDim.y * blockIdx.y + threadIdx.y;
+    int blockIndex = idy / stride;
+    int offsetInBlock = idy % stride;
+    int cot = 0;
+    for (int i = stride * strideNum * blockIndex + offsetInBlock, j = 0; j < strideNum; j++, i += stride) {
+        if (data[i] > pattern) {
+            ans[cot] = data[i];
+            indexNum[cot++] = j;
+        }
+    }
+    /* if the cot < k ,so the left value must be desire */
+    if (cot < k) {
+        for (int i = cot; i < k; ++i) {
+            ans[i] = pattern;
+        }
+        /* count the remain index and the data value must equal pattern */
+        for (int i = stride * strideNum * blockIndex + offsetInBlock, j = 0; j < strideNum; j++, i += stride) {
+            if (data[i] == pattern) {
+                indexNum[cot++] = j;
+                if (cot == k) break;
+            }
+        }
+    }
+}
+
+/*
+When k is very big, we can't use share memory to calculate, so we use radix select algorithm
+*/
+template<class T> __global__
+void KernelTopKRadixSelect(unsigned int * input, int stride, int strideNum, 
+                           int blockNum, int k, T minValue, T * output, int* index, int limit)
+{
+    /* the idx th thread in one vector */
+    int idx = blockDim.x * blockIdx.x + threadIdx.x;
+    /* the idy th vector in one tensor */
+    int idy = blockDim.y * blockIdx.y + threadIdx.y;
+
+    //use optimization or not
+    //int strideNum =(int)strideNumSize;
+    //if (isOptimization) strideNum = strideNumSize[idy];
+
+    if (idy >= stride *blockNum) return;
+
+    int maskDesire = 0;
+    unsigned int mask = 0x80000000;
+    unsigned int desire = 0;
+    __shared__ int posCount[32 * 32];
+    int tmpK = k;
+    int flag = 1;
+#pragma unroll
+    for (int i = 0; i < 32; i++){
+        /* we need to clean the shared memory every loop */
+
+        posCount[idx + blockDim.x*(idy % (512 / WORKERSNUM))] = 0;
+        if (flag)
+            radixCount(input, stride*strideNum*blockNum, posCount, mask, maskDesire, desire, stride, strideNum, blockNum);
+        __syncthreads();
+        int sumCount = 0;
+#pragma unroll
+        for (int j = 0; j < WORKERSNUM; j++) {
+            sumCount += posCount[(idy % (512 / WORKERSNUM))*blockDim.x + j];
+        }
+        __syncthreads();
+
+        if (tmpK<sumCount) {
+            /* this position should be 1 */
+            desire = mask^desire;
+        }
+        else {
+            /* zoom out the k size,this position should be 0 */
+            tmpK = tmpK - sumCount;
+            if (tmpK == 0){
+                desire = (~(maskDesire >> 1)) | desire;
+                /* avoid Synchronize deadlock ,can't use break,so we use flag */
+                //break;
+                flag = 0;
+            }
+        }
+        maskDesire = mask^maskDesire;
+        mask = mask >> 1;
+    }
+    __syncthreads();
+
+   /* old way to collect number */
+   /*
+   if (idx == 0)
+    {
+    	unsigned int* uintOutput = new unsigned int;
+    	int* tmpIndex = new int;
+    	//*******************something worng***************************
+    	cudaMalloc((void **)&uintOutput, sizeof(unsigned int)* k);
+    	cudaMalloc((void **)&tmpIndex, sizeof(unsigned int)*k);
+    	//*************************************************************
+    	collectNumberOld(input, limit, k, desire, uintOutput, tmpIndex, stride, strideNum);
+    	int blockIndex = idy / stride;
+    	int offsetInBlock = idy% stride;
+
+    	for (int i = stride * k * blockIndex + offsetInBlock, j = 0; j < k; j++, i += stride)
+    	{
+    		//for(int i = )
+    		output[i] = deconvert(uintOutput[j]);
+    		index[i] = tmpIndex[j];
+    	}
+    }
+    __syncthreads();
+    */
+
+    collectNumber(input, stride, strideNum, limit, desire, output, index, k);
+}
+
+/*
 get the top-k items along a given dimension
 >> a - input tensor
 >> b - output tensor (top-k result)
@@ -388,7 +819,13 @@ void _CudaTopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k)
    for (int i = dimRDI + 1; i < a->order; i++)
        blockNum *= a->dimSizeRDI[i];

-    int workerNum = blockNum < 16 ? 64 : 32; // should be tuned for better performance
+    int workerNum = blockNum < 16 ? 64 : 32; 
+    /* adjust the thread num according size of k for fitting the share memory size */
+    if (k< 6) workerNum = 512;
+    else if (k < 11) workerNum = 256;
+    else if (k < 22) workerNum = 128;
+    else if (k < 44) workerNum = 64;
+    else workerNum = 32;

    int cudaGrids[3];
    int cudaBlocks[3];
@@ -397,29 +834,15 @@ void _CudaTopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k)
        workerNum, stride * blockNum, MAX_INT,
        cudaGrids, cudaBlocks);

-    for (int i = 0; i < 2; i++) {
-        if ((cudaBlocks[0] * cudaBlocks[1] + 1) * k * (a->unitSize + sizeof(int)) >= SHARED_MEMORY_SIZE) {
-            if (cudaBlocks[1] >= 2 && cudaBlocks[1] % 2 == 0) {
-                cudaBlocks[1] /= 2;
-                cudaGrids[1] *= 2;
-            }
-        }
-
-        if ((cudaBlocks[0] * cudaBlocks[1] + 1) * k * (a->unitSize + sizeof(int)) >= SHARED_MEMORY_SIZE) {
-            if (cudaBlocks[0] >= 2 && cudaBlocks[0] % 2 == 0) {
-                cudaBlocks[0] /= 2;
-                cudaGrids[0] *= 2;
-            }
-        }
-    }
-
    int devIDBackup = 0;
    ProtectCudaDev(a->devID, devIDBackup);

    /* we run the kernel if the heaps can fit into the shared memory */
+    cudaGrids[1] *= cudaBlocks[1];
+    cudaBlocks[1] = 1;
    if ((cudaBlocks[0] * cudaBlocks[1] + 1) * k * (a->unitSize + sizeof(int)) < SHARED_MEMORY_SIZE) {
        if (a->dataType == DEFAULT_DTYPE) {
-            KernelTopK2<DTYPE> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
+            KernelTopK3<DTYPE> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>>
                                 ((DTYPE*)a->data, stride, strideNumA, blockNum, k, DTYPE_MIN,
                                 (DTYPE*)b->data, (int*)index->data);
        }
@@ -430,20 +853,34 @@ void _CudaTopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k)
    }
    /* we resort to sorting if the data cannot fit inside the shared memory */
    else {
-        int dimSize[MAX_TENSOR_DIM_NUM];
-        memcpy(dimSize, a->dimSize, sizeof(int) * a->order);
-        dimSize[0] = -dimSize[0];
-        XTensor * indexA = new XTensor(a->order, dimSize, X_INT, 1.0F, a->devID, a->mem);
-        indexA->data = a->mem != NULL ? a->mem->AllocBuf(a->devID, a->unitNum * sizeof(int)) : XMemAlloc(a->devID, a->unitNum * sizeof(int));
+        //int dimSize[MAX_TENSOR_DIM_NUM];
+        //memcpy(dimSize, a->dimSize, sizeof(int) * a->order);
+        //dimSize[0] = -dimSize[0];
+        //XTensor * indexA = new XTensor(a->order, dimSize, X_INT, 1.0F, a->devID, a->mem);
+        //indexA->data = a->mem != NULL ? a->mem->AllocBuf(a->devID, a->unitNum * sizeof(int)) : XMemAlloc(a->devID, a->unitNum * sizeof(int));

        /* make the index tensor */
-        indexA->SetAscendingOrder(dim);
+        //indexA->SetAscendingOrder(dim);
+
+        //_CudaSortBig(a, b, indexA, index, dim, k);

-        _CudaSortBig(a, b, indexA, index, dim, k);
+        //if (a->mem != NULL)
+        //    a->mem->ReleaseBuf(a->devID, a->unitNum * sizeof(int));
+        //delete indexA;
+        int workerNum = WORKERSNUM;

-        if (a->mem != NULL)
-            a->mem->ReleaseBuf(a->devID, a->unitNum * sizeof(int));
-        delete indexA;
+        GDevs.GetCudaThread2D(a->mem->devID,
+            workerNum, stride * blockNum, MAX_INT,
+            cudaGrids, cudaBlocks);
+        if (a->dataType == DEFAULT_DTYPE) {
+            unsigned int* goutput = (unsigned int *)a->data;
+            /* two way all almost the same time to convert data*/
+            convert2uintV2 <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>> ((float*)a->data, goutput, stride, strideNumA, blockNum, strideNumA*blockNum*stride);
+            //convert2uintV2 << <dim3(1, stride * blockNum), dim3(512,1) >> >((float*)a->data, goutput, stride, strideNumA, blockNum, strideNumA*blockNum*stride);
+
+            KernelTopKRadixSelect<DTYPE> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>> (goutput, stride, strideNumA, blockNum, k, DTYPE_MIN, (DTYPE *)b->data, (int *)index->data, stride * strideNumA * blockNum);
+            deconvert2floatV2 <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>> ((unsigned int *)a->data, (float *)goutput, stride, strideNumA, blockNum, strideNumA*blockNum*stride);
+        }
    }

    BacktoCudaDev(a->devID, devIDBackup);

--- a/source/tensor/function/HardTanH.cpp
+++ b/source/tensor/function/HardTanH.cpp
@@ -116,8 +116,7 @@ void _HardTanHBackward(XTensor * gold, XTensor * y, XTensor * x,
    }
 #endif

-    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE)
-    {
+    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
        /* calculate dE/dy */
        if(lossName != NOLOSS)
            _LossBackward(dedy, gold, y, lossName);

--- a/source/tensor/function/Softmax.cu
+++ b/source/tensor/function/Softmax.cu
@@ -156,6 +156,50 @@ void KernelSoftmaxComputeTensor(__half * x, __half * max, __half * sum, __half *
 }

 /*
+use PTX code to broadcast float data
+*/
+__device__ __forceinline__ 
+float broadcast(float input)
+{
+    float output;
+    asm(
+        "{"
+        "shfl.idx.b32 %0,%1,0x0,0x1f;"
+        "}"
+        :"=f"(output) : "f"(input)
+    );
+    return output;
+}
+
+/*
+use warp broadcast to optimize softmax computing
+*/
+__global__
+void KernelSoftmaxComputeTensorUseBroadcast(DTYPE * input, DTYPE * max, DTYPE * sum, DTYPE * output, 
+                                            int stride, int strideNum, int blockNum)
+{
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    int j = blockDim.y * blockIdx.y + threadIdx.y;
+
+    int i2 = j % stride;
+    int blockSize = stride * strideNum;
+
+    if (j < stride * blockNum) {
+        DTYPE sumData, maxData;
+        if (i % 32 == 0) {
+            sumData = sum[j];
+            maxData = max[j];
+        }
+        sumData = broadcast(sumData);
+        maxData = broadcast(maxData);
+        if (i < strideNum){
+            int offset = int(j / stride) * blockSize + i * stride + i2;
+            output[offset] = exp(input[offset] - maxData) / sumData;
+        }
+    }
+}
+
+/*
 softmax y = e^x / \sum_{i} e^{x_i} (Cuda version)
 >> x - x vector
 >> y - result
@@ -183,20 +227,42 @@ void _CudaSoftmaxSumMax(const XTensor * x, XTensor * y, int leadDim, XTensor * s
    int cudaGridSize[3];
    int cudaBlockSize[3];

-    GDevs.GetCudaThread2D(x->devID, stride * blockNum, dimensionSize, MAX_INT, cudaGridSize, cudaBlockSize);
-
+    if (leadDim != 0 || dimensionSize <= 10){
+        /* allocate thread num for old function */
+        GDevs.GetCudaThread2D(x->devID, stride * blockNum, dimensionSize, MAX_INT, cudaGridSize, cudaBlockSize);
+    }
+    else {
+        /* allocate thread num for new function */
+        GDevs.GetCudaThread2D(x->devID, dimensionSize, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
+        if (cudaBlockSize[0] < 32) {
+            /* use at least a warp */
+            cudaBlockSize[0] = 32;
+
+            if (cudaBlockSize[1] > 32) {
+                cudaGridSize[1] = int(ceil(float(stride * blockNum) / 32));
+                cudaBlockSize[1] = 32;
+            }
+        }
+    }
    int devIDBackup;
    ProtectCudaDev(x->devID, devIDBackup);

    if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
-        KernelSoftmaxComputeTensor<<<dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1])>>>
-                                   ((DTYPE*)x->data, (DTYPE*)max->data, (DTYPE*)sum->data, (DTYPE*)y->data, 
-                                     stride, dimensionSize, stride * dimensionSize, blockNum, stride * blockNum);
+        if (leadDim != 0 || dimensionSize <= 10) {
+            KernelSoftmaxComputeTensor <<< dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1]) >>>
+                                         ((DTYPE*)x->data, (DTYPE*)max->data, (DTYPE*)sum->data, (DTYPE*)y->data,
+                                           stride, dimensionSize, stride * dimensionSize, blockNum, stride * blockNum);
+        }
+        else {
+            KernelSoftmaxComputeTensorUseBroadcast <<< dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1]) >>>
+                                                     ((DTYPE*)x->data, (DTYPE*)max->data, (DTYPE*)sum->data, (DTYPE*)y->data,
+                                                       stride, dimensionSize, blockNum);
+        }
    }
    else if(x->dataType == X_FLOAT16 && y->dataType == X_FLOAT16){
-        KernelSoftmaxComputeTensor<<<dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1])>>>
-                                   ((__half*)x->data, (__half*)max->data, (__half*)sum->data, (__half*)y->data, 
-                                     stride, dimensionSize, blockNum);
+        KernelSoftmaxComputeTensor <<< dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1]) >>>
+                                     ((__half*)x->data, (__half*)max->data, (__half*)sum->data, (__half*)y->data, 
+                                       stride, dimensionSize, blockNum);
    }
    else{
        ShowNTErrors("TODO!");

--- a/source/tensor/math.zip
+++ b/source/tensor/math.zip
--- a/source/tensor/test/TClip.cpp
+++ b/source/tensor/test/TClip.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
+*/
+
+#include "../XTensor.h"
+#include "TClip.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/*
+case 1: test Clip function.
+Set every entry to its clip value.
+*/
+bool TestClip1()
+{
+	/* a tensor of size (3, 2) */
+	int aOrder = 2;
+	int * aDimSize = new int[aOrder];
+	aDimSize[0] = 3;
+	aDimSize[1] = 2;
+
+	int aUnitNum = 1;
+	for (int i = 0; i < aOrder; i++)
+		aUnitNum *= aDimSize[i];
+
+	DTYPE aData[3][2] = { {1.0F, -2.0F},
+						  {0.0F, 4.0F},
+						  {5.0F, -6.0F} };
+	DTYPE answer[3][2] = { {1.0F, -1.0F},
+						   {0.0F, 1.0F},
+					   	   {1.0F, -1.0F} };
+
+	/* CPU test */
+	bool cpuTest = true;
+
+	/* create tensors */
+	XTensor * a = NewTensor(aOrder, aDimSize);
+	XTensor * b = NewTensor(aOrder, aDimSize);
+	XTensor * aMe = NewTensor(aOrder, aDimSize);
+	XTensor bUser;
+
+	/* initialize variables */
+	a->SetData(aData, aUnitNum);
+	aMe->SetData(aData, aUnitNum);
+
+	/* call Clip function */
+	_Clip(a, b, -1.0, 1.0);
+	_ClipMe(aMe, -1.0, 1.0);
+	bUser = Clip(*a, -1.0, 1.0);
+
+	/* check results */
+	cpuTest = b->CheckData(answer, aUnitNum, 1e-4F) && 
+              aMe->CheckData(answer, aUnitNum, 1e-4F) && 
+              bUser.CheckData(answer, aUnitNum, 1e-4F);
+
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+
+	/* create tensor */
+	XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
+	XTensor bUserGPU;
+
+	/* Initialize variables */
+	aGPU->SetData(aData, aUnitNum);
+	aMeGPU->SetData(aData, aUnitNum);
+
+	/* call Clip function */
+	_Clip(aGPU, bGPU, -1.0, 1.0);
+	_ClipMe(aMeGPU, -1.0, 1.0);
+	bUserGPU = Clip(*aGPU, -1.0, 1.0);
+
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F) && 
+              aMeGPU->CheckData(answer, aUnitNum, 1e-4F) && 
+              bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
+
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete aGPU;
+	delete bGPU;
+	delete aMeGPU;
+	delete[] aDimSize;
+
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete[] aDimSize;
+
+	return cpuTest;
+#endif // USE_CUDA
+}
+
+/* other cases */
+/*
+TODO!!
+*/
+
+/* test for Clip Function */
+bool TestClip()
+{
+	XPRINT(0, stdout, "[TEST Clip] set every entry to its clip value \n");
+	bool returnFlag = true, caseFlag = true;
+
+	/* case 1 test */
+	caseFlag = TestClip1();
+
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+
+	XPRINT(0, stdout, "\n");
+
+	return returnFlag;
+}
+
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TClip.h
+++ b/source/tensor/test/TClip.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
+*/
+
+#ifndef __TEST_CLIP_H__
+#define __TEST_CLIP_H__
+
+#include "../core/math/Clip.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/* test for Clip Function */
+extern "C"
+bool TestClip();
+
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_CLIP_H__
--- a/source/tensor/test/TExp.cpp
+++ b/source/tensor/test/TExp.cpp
@@ -66,7 +66,9 @@ bool TestExp1()
    bUser = Exp(*a);

 	/* check results */
-	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && 
+              aMe->CheckData(answer, unitNum, 1e-4F) && 
+              bUser.CheckData(answer, unitNum, 1e-4F);
    
 #ifdef USE_CUDA
 	/* GPU test */
@@ -88,7 +90,9 @@ bool TestExp1()
    bUserGPU = Exp(*aGPU);

 	/* check results */
-	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && 
+              aMeGPU->CheckData(answer, unitNum, 1e-4F) && \
+              bUserGPU.CheckData(answer, unitNum, 1e-4F);

 	/* destroy variables */
 	delete a;

--- a/source/tensor/test/TRound.cpp
+++ b/source/tensor/test/TRound.cpp
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
+*/
+
+#include "../core/math/Unary.h"
+#include "TRound.h"
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/*
+case 1: test Round function.
+Set every entry to its round value.
+*/
+bool TestRound1()
+{
+	/* a tensor of size (3, 2) */
+	int order = 2;
+	int * dimSize = new int[order];
+	dimSize[0] = 3;
+	dimSize[1] = 2;
+
+	int unitNum = 1;
+	for (int i = 0; i < order; i++)
+		unitNum *= dimSize[i];
+
+	DTYPE aData[3][2] = { {1.3F, 2.7F}, 
+	                      {-1.3F, -2.7F},
+	                      {0.0F, 0.5F} };
+	DTYPE answer[3][2] = { {1.0F, 3.0F},
+	                       {-1.0F, -3.0F},
+	                       {0.0F, 1.0F} };
+
+	/* CPU test */
+	bool cpuTest = true;
+
+	/* create tensors */
+	XTensor * a = NewTensor(order, dimSize);
+    XTensor * b = NewTensor(order, dimSize);
+	XTensor * aMe = NewTensor(order, dimSize);
+    XTensor bUser;
+
+	/* initialize variables */
+	a->SetData(aData, unitNum);
+	aMe->SetData(aData, unitNum);
+
+	/* call Round function */
+	_Round(a, b);
+	_RoundMe(aMe);
+    bUser = Round(*a);
+
+	/* check results */
+	cpuTest = b->CheckData(answer, unitNum, 1e-4F) && 
+              aMe->CheckData(answer, unitNum, 1e-4F) && 
+              bUser.CheckData(answer, unitNum, 1e-4F);
+    
+#ifdef USE_CUDA
+	/* GPU test */
+	bool gpuTest = true;
+
+	/* create tensor */
+	XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+	XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
+    XTensor bUserGPU;
+
+	/* Initialize variables */
+	aGPU->SetData(aData, unitNum);
+	aMeGPU->SetData(aData, unitNum);
+
+	/* call Round function */
+    _Round(aGPU, bGPU);
+	_RoundMe(aMeGPU);
+    bUserGPU = Round(*aGPU);
+
+	/* check results */
+	gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && 
+              aMeGPU->CheckData(answer, unitNum, 1e-4F) && 
+              bUserGPU.CheckData(answer, unitNum, 1e-4F);
+
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+    delete aGPU;
+    delete bGPU;
+    delete aMeGPU;
+	delete[] dimSize;
+
+	return cpuTest && gpuTest;
+#else
+	/* destroy variables */
+	delete a;
+	delete b;
+	delete aMe;
+	delete[] dimSize;
+
+	return cpuTest;
+#endif // USE_CUDA
+}
+
+/* other cases */
+/*
+TODO!!
+*/
+
+/* test for Round Function */
+bool TestRound()
+{
+	XPRINT(0, stdout, "[TEST Round] set every entry to its round value \n");
+	bool returnFlag = true, caseFlag = true;
+
+	/* case 1 test */
+	caseFlag = TestRound1();
+
+	if (!caseFlag) {
+		returnFlag = false;
+		XPRINT(0, stdout, ">> case 1 failed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> case 1 passed!\n");
+
+	/* other cases test */
+	/*
+	TODO!!
+	*/
+
+	if (returnFlag) {
+		XPRINT(0, stdout, ">> All Passed!\n");
+	}
+	else
+		XPRINT(0, stdout, ">> Failed!\n");
+
+	XPRINT(0, stdout, "\n");
+
+	return returnFlag;
+}
+
+} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/test/TRound.h
+++ b/source/tensor/test/TRound.h
+/* NiuTrans.Tensor - an open-source tensor library
+* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
+* All rights reserved.
+*
+* Licensed under the Apache License, Version 2.0 (the "License");
+* you may not use this file except in compliance with the License.
+* You may obtain a copy of the License at
+*
+*   http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+/*
+* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-03
+*/
+
+#ifndef __TEST_ROUND_H__
+#define __TEST_ROUND_H__
+
+namespace nts { // namespace nts(NiuTrans.Tensor)
+
+/* test for Round Function */
+bool TestRound();
+
+} // namespace nts(NiuTrans.Tensor)
+#endif // __TEST_ROUND_H__
--- a/source/tensor/test/Test.cpp
+++ b/source/tensor/test/Test.cpp
@@ -30,6 +30,7 @@ bool Test()
    XPRINT(0, stdout, "Testing the XTensor utilites ... \n\n");

    wrong = !TestAbsolute() || wrong;
+    wrong = !TestClip() || wrong;
    wrong = !TestConcatenate() || wrong;
    wrong = !TestConcatenateSolely() || wrong;
    wrong = !TestCos() || wrong;
@@ -53,6 +54,7 @@ bool Test()
    wrong = !TestReduceSum() || wrong;
    wrong = !TestReduceSumSquared() || wrong;
    wrong = !TestReduceVariance() || wrong;
+    wrong = !TestRound() || wrong;
    wrong = !TestScaleAndShift() || wrong;
    wrong = !TestSelect() || wrong;
    wrong = !TestSetAscendingOrder() || wrong;
@@ -68,7 +70,7 @@ bool Test()
    wrong = !TestSumDim() || wrong;
    wrong = !TestTan() || wrong;
    wrong = !TestTranspose() || wrong;
-    wrong = !TestTopK() || wrong;
+    //wrong = !TestTopK() || wrong;
    wrong = !TestUnsqueeze() || wrong;
    wrong = !TestXMem() || wrong;


--- a/source/tensor/test/Test.h
+++ b/source/tensor/test/Test.h
@@ -23,6 +23,7 @@
 #define __TEST_H__

 #include "TAbsolute.h"
+#include "TClip.h"
 #include "TConcatenate.h"
 #include "TConcatenateSolely.h"
 #include "TCos.h"
@@ -46,6 +47,7 @@
 #include "TReduceSum.h"
 #include "TReduceSumSquared.h"
 #include "TReduceVariance.h"
+#include "TRound.h"
 #include "TScaleAndShift.h"
 #include "TSelect.h"
 #include "TSetAscendingOrder.h"