Modify the implementation of the unary and binary by template. It's so cool!

b30fad5f · xuchen · f7c6fb3b · b30fad5f · b30fad5f · b30fad5f
Commit b30fad5f authored Jul 25, 2019 by xuchen
--- a/doc/Configuration.md
+++ b/doc/Configuration.md
-# NiuTrans.Tensor环境配置
+# NiuTrans.Tensor
-## 注意事项
+## Windows系统通过Visual Studio配置NiuTrans.Tensor项目
-CUDA最新版本9.2尚且不支持VS2017最新版本，因此建议使用CUDA版本为9.0或9.1，建议使用VS版本为VS2015，或使用VS2017时安装v140工具集，解决方案平台设置为×64。
+### 注意事项
-## CUDA配置
+* 我们仅仅测试了VS2015和CUDA9.0之后的版本，对于之前的版本并不清楚是否存在问题。
+* VS2015版本可以直接使用，使用较新版本的VS（如VS2017）时，需要**安装组件“适用于桌面的 VC++ 2015.3 v14.00 (v140) 工具集”**。
+* 建议先安装Visual Studio再安装CUDA。安装CUDA时，建议不要勾选Visual Studio Integration，有时候可能会出错。CUDA安装完成后，解压CUDA安装文件（exe文件可以解压），在CUDAVisualStudioIntegration\extras\visual_studio_integration\MSBuildExtensions路径下有四个文件，拷贝到下述路径中。
-在已安装好VS、CUDA并配置好环境变量后，一些关键的CUDA配置选项如下所示，以下配置选项在 **项目 -> 属性** 中可以找到。
+  * VS2015
+  > C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\v140\BuildCustomizations
->$(CUDA_PATH)\include
+  * VS2017(以下两个路径分别对应v140工具集和VS默认工具集的路径)
+  > C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\v140\BuildCustomizations
+  > C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\Common7\IDE\VC\VCTargets\BuildCustomizations
-加入到 **VC++目录 -> 包含** 中。
+### 新建项目
->$(CUDA_PATH)\lib\Win32
+* 新建一个VC++空项目。
+* 将菜单栏中的**解决方案平台**设置为×64（默认是X86）。
+* 在**菜单栏->项目->属性**中，将平台设置为X64。
+* 将源代码(source文件夹)拷贝到项目的根目录，然后选择**菜单栏->项目->显示所有文件**，解决方案中即可以看到source文件夹，右键点击source，选择包含在项目中，即可将所有的*.h和*.cpp加入到本项目中。
-加入到 **VC++目录 -> 库** 中。
+### CUDA配置（无GPU设备可以跳过此步骤）
->cuda.lib;cudadevrt.lib;cudart.lib;cudart_static.lib;nvcuvid.lib;OpenCL.lib;cublas.lib;curand.lib;
+在VS项目中使用CUDA，需要设置项目的相关属性。
+以下配置选项在 **菜单栏->项目 -> 属性** 中可以找到。
-加入到 **链接器->输入->附加依赖项** 中。
+* **C/C++->预处理器->预处理器定义** 中，添加
-配置完成后，右键 **工程->项目依赖性** ，选择CUDA9。
+> USE_CUDA;
-在.cu文件上右键属性，在项类型中选择"CUDA C/C++"（最好搜索.cu文件，然后全选设置）。
-## 其他配置
+* **VC++目录->包含目录** 中加入
-**C/C++->常规->SDL检查**，设为否。
+> $(CUDA_PATH)\include
-在 **C/C++->预处理器->预处理器定义** 中，添加
+* **VC++目录->库目录** 中加入 
->USE_CUDA;USE_BLAS;WIN32;MKL;_DEBUG;_CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_WARNINGS_
+> $(CUDA_PATH)\lib\Win32
-CONSOLE;
-**链接器->系统->子系统**，设置为控制台。
+* **链接器->输入->附加依赖项**中加入以下库
-**常规->字符集**，使用Unicode字符集。
+> cuda.lib;cudadevrt.lib;cudart.lib;cudart_static.lib;nvcuvid.lib;OpenCL.lib;cublas.lib;curand.lib;
-**调试->命令参数**中设置可执行文件所需要的参数。
+* 上述配置完成后，在**菜单栏->项目->生成自定义**中，勾选CUDA*（根据自己安装的CUDA版本自行选择）。
+* 在所有的*.cu和*.cuh文件上右键，包含在项目中。
+### 其他配置
+注：以下选项也是 **菜单栏-项目 -> 属性** 中可以找到。
+* **常规->平台工具集**，设置为Visual Studio 2015（v140）。
+* **C/C++->常规->SDL检查**，设为否。
+* **C/C++->预处理器->预处理器定义** 中，添加
+> WIN32;_DEBUG;_CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_WARNINGS_CONSOLE;
+* **C/C++->预编译头->预编译头**，设置为不使用预编译头。
+* **链接器->系统->子系统**，设置为控制台。
+* **常规->字符集**，使用Unicode字符集。
+* **调试->命令参数**，设置可执行文件所需要的参数（初始可以设置为-test，用来执行测试用例）。
--- a/doc/manual.md
+++ b/doc/manual.md
@@ -20,21 +20,38 @@ NiuTrans.Tensor是小牛开源项目所开发的一个工具包，提供了完
 ### Windows
-若在Windows上使用NiuTrans.Tensor工具包：
+若在Windows上使用NiuTrans.Tensor工具包，需要安装Visual Studio集成开发工具，具体的环境配置方法请参考[NiuTrans.Tensor环境配置](http://47.105.50.196/NiuTrans/NiuTrans.Tensor/blob/linye/doc/Configuration.md)。
-* 首先需要将NiuTrans.Tensor代码包含在所创建的项目中
+环境配置完成后，可以使用 **-test** 命令行参数运行本项目的测试用例，如果最后输出
+> OK! Everything is good!
+则说明本项目配置成功。
+<!-- * 首先需要将NiuTrans.Tensor代码包含在所创建的项目中
 * 在所创建项目中需要引用XTensor.h、core里的CHeader.h和function里的FHeader.h这三个头文件：
    * 通过XTensor.h可以获取我们需要操作的XTensor类
    * 通过core里的CHeader.h可以对Tensor进行一些张量运算
    * 通过function里的FHeader.h可以调用一些激活函数
-* 在所创建项目中使用命名空间nts
+* 在所创建项目中使用命名空间nts -->
-此外，一些必须的环境配置方法请参考 [NiuTrans.Tensor环境配置](http://47.105.50.196/NiuTrans/NiuTrans.Tensor/blob/linye/doc/Configuration.md)。
 ### Linux
-若在Linux上使用NiuTrans.Tensor工具包，直接执行make.sh即可在同级目录下生成tensorCPU和tensorGPU，分别对应于NiuTrans.Tensor的CPU以及GPU的可执行文件。以前馈神经网络语言模型为例，输入以下命令即可在GPU上执行提供的测试用例：
+若在Linux上使用NiuTrans.Tensor工具包，请使用[Makefile](https://github.com/NiuTrans/NiuTrans.Tensor/blob/master/Makefile)文件，修改其中的部分配置使其匹配当前系统环境，然后在Makefile文件所在目录执行 **make** 命令，即可在bin文件夹中创建
->./tensorGPU -test
+NiuTrans.Tensor.CPU或NiuTrans.TensorGPU，分别对应于NiuTrans.Tensor工具包的CPU以及GPU的可执行文件，同时在lib目录下生成相应的动态链接库。
+输入以下命令即可执行提供的测试用例：
+> ./bin/NiuTrans.Tensor.CPU -test
+或
+> ./bin/NiuTrans.Tensor.GPU -test
+如果最后输出
+> OK! Everything is good!
+则说明编译成功。
+注意：若先生成CPU的可执行文件，之后如需生成GPU可执行文件，需要先执行make clean命令，删除生成CPU可执行文件时产生的中间结果，反之亦然。
 ## 什么是张量
@@ -79,14 +96,15 @@ $$
    * shape - 存放有关形状转换的源文件
    * sort - 存放有关排序操作的源文件
 * ~/NiuTrans.Tensor/source/tensor/function - 存放各种激活函数的源文件
+* ~/NiuTrans.Tensor/source/tensor/loss - 存放各种损失函数的源文件
 * ~/NiuTrans.Tensor/source/tensor/test - 存放单元测试的源文件
 * ~/NiuTrans.Tensor/source/tensor/*.h(cpp) - 与张量定义不相关，后文介绍 :)
 以C/C++为例，仅需要在源程序中引用XTensor.h头文件就可以完成张量的定义。下面是一个简单的示例程序sample.cpp
 ```
-#inlucde "XTensor.h"          // 引用XTensor定义的头文件
+#include "XTensor.h"          // 引用XTensor定义的头文件
-using namepsace nt;           // 使用XTensor所在的命名空间nt
+using namepsace nts;           // 使用XTensor所在的命名空间nt
 int main(int argc, const char ** argv)
 {
@@ -113,9 +131,9 @@ g++ sample.cpp -I~/NiuTrans.Tensor/source/tensor -o sample
 NiuTrans.Tensor也提供了其它方式定义张量。比如可以直接调用一个函数完成张量的创建，而且可以显性释放张量。下面是一段示例代码（sample2.cpp）：
 ```
-#inlucde "XTensor.h"         // 引用XTensor定义的头文件
+#include "XTensor.h"         // 引用XTensor定义的头文件
-using namepsace nt;          // 使用XTensor所在的命名空间nt
+using namepsace nts;          // 使用XTensor所在的命名空间nt
 int main(int argc, const char ** argv)
 {
@@ -139,9 +157,9 @@ sample2.cpp中使用的NewTensor2D和DelTensor是一组函数，前者生成张
 ```
-#inlucde "XTensor.h"         // 引用XTensor定义的头文件
+#include "XTensor.h"         // 引用XTensor定义的头文件
-using namepsace nt;          // 使用XTensor所在的命名空间nt
+using namepsace nts;          // 使用XTensor所在的命名空间nt
 int main(int argc, const char ** argv)
 {
@@ -254,45 +272,49 @@ NiuTrans.Tensor提供关于张量计算的函数功能，主要包括一些基
 ### 代数计算(arithmetic)
-此部分主要包括各种数学运算，加、减、乘、除、取负、取绝对值等。
+此部分主要包括各种数学运算，加、减、乘、除、取负等。
-#### 取绝对值（Absolute）
+#### 除法（Div）
-##### 什么是张量的取绝对值运算？
+##### 什么是张量的除法？
-利用张量的取绝对值运算可以将张量中每一元素取绝对值并得到一个新的张量，一个维度分别为\\(2 \times 3\\)的矩阵取绝对值过程如下所示：
+利用张量的除法运算可以将两个张量相除并得到一个新的张量，两个维度分别为\\(2 \times 2\\)的张量相除过程如下所示：
 $$
-\left(\begin{matrix}-1.0 & 2.0 & 3.0\\\\-4.0 & 5.0 & 6.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0\\\\2.0 & 3.0\end{matrix}\right) ÷ 
-\left(\begin{matrix}1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0\end{matrix}\right)
+\left(\begin{matrix}1.0 & 1.0\\\\4.0 & 9.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix}0.0 & 1.0\\\\0.5 & 0.3333\end{matrix}\right)
 $$
-##### 张量取绝对值的调用
+##### 张量除法的调用
-NiuTrans.Tensor提供了张量的取绝对值操作，在NiuTrans.Tensor/Tensor/core/arithmetic中定义
+NiuTrans.Tensor提供了张量的除法操作，在NiuTrans.Tensor/Tensor/core/arithmetic中定义
-张量取绝对值的调用方式以及参数说明如下所示:
+张量除法的调用方式以及参数说明如下所示:
 ```
-void _Absolute(const XTensor * a, XTensor * b)
+void _Div(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha = 0.0, int leadingDim = 0)
-void _AbsoluteMe(XTensor * a)
+void _DivMe(XTensor * a, const XTensor * b, DTYPE alpha = 0.0, int leadingDim = 0)
-XTensor Absolute(const XTensor & a)
+XTensor Div(const XTensor &a, const XTensor &b, DTYPE alpha = 0.0, int leadingDim = 0)
 ```
 Parameters: 
 * a - 输入张量
-* b - 输出张量
+* b - 输人张量
+* c - 输出张量
+* alpha - 系数
+* leadingDim - 沿着某一维度执行广播操作
-##### 张量取绝对值片段示例
+##### 张量除法片段示例
-用Absolute进行张量取绝对值操作的示例代码为：
+用Div进行张量除法操作的示例代码为：
 ```
-/* call Absolute function */
+/* call Div function */
-b = Absolute(*a);
+t = Div(*s1, *s2, 0);
 ```
 有关张量取绝对值的详细代码示例：
-NiuTrans.Tensor/Tensor/test/TAbsolute.cpp
+NiuTrans.Tensor/Tensor/test/TDiv.cpp
 #### 矩阵乘法（MatrixMul）
@@ -308,7 +330,7 @@ $$
 ##### 矩阵乘法的调用
 NiuTrans.Tensor提供了矩阵乘法的计算操作，在NiuTrans.Tensor/Tensor/core/arithmetic中定义，函数定义为：
->c_{i,j} = trans(ai) * trans(bj) * alpha + c_{i,j} * beta
+>c\_{i,j} = trans(ai) * trans(bj) * alpha + c_{i,j} * beta
 矩阵乘法的调用方式以及参数说明如下所示:
 ```
@@ -349,15 +371,19 @@ $$
 \left(\begin{matrix}0.0 & 1.0\\\\4.0 & 9.0\end{matrix}\right)
 $$
+> c\_{i,j} = a\_{i,j} * b\_{i,j} + c\_{i,j}  * alpha 
 ##### 张量点乘的调用
 NiuTrans.Tensor提供了张量点乘的计算操作，用来计算张量中元素点乘结果，该函数在NiuTrans.Tensor/Tensor/core/arithmetic中定义，张量点乘的调用方式以及参数说明如下所示:
 ```
-_Multiply(XTensor * a, XTensor * b, XTensor * c, int leadingDim, DTYPE alpha = 0)
+_Multiply(XTensor * a, XTensor * b, XTensor * c, DTYPE alpha = 0.0, int leadingDim=0)
-void _MultiplyMe(XTensor * a, const XTensor * b, DTYPE alpha = 0, int leadingDim = 0)
+void _MultiplyMe(XTensor * a, const XTensor * b, DTYPE alpha = 0.0, int leadingDim = 0)
-XTensor Multiply(const XTensor &a, const XTensor &b, DTYPE alpha = 0, int leadingDim = 0)
+XTensor Multiply(const XTensor &a, const XTensor &b, DTYPE alpha = 0.0, int leadingDim = 0)
 ```
 Parameters: 
@@ -452,6 +478,47 @@ b = Sign(*a);
 NiuTrans.Tensor/Tensor/test/TSign.cpp
+#### 减法（Sub）
+##### 什么是张量减法？
+张量减法的目的是将两个张量相减得到一个新的结果张量，结果张量某一位置的元素数值为进行操作的张量在该位置上元素的差，在张量减法的计算过程中进行操作的张量与结果张量的维度相同，两个维度为\\(2\times 3\\)的张量减法过程如下所示：
+$$
+\left(\begin{matrix}0.0 & 1.0 & 2.0 \\\\ 3.0 & 4.0 & 5.0\end{matrix}\right) -
+\left(\begin{matrix}0.5 & 1.5 & 2.5 \\\\ 3.5 & 4.5 & 5.5\end{matrix}\right) \rightarrow
+\left(\begin{matrix}-0.5 & -0.5 & -0.5 \\\\ -0.5 & -0.5 & -0.5\end{matrix}\right)
+$$
+##### 张量减法的调用
+NiuTrans.Tensor提供了张量减法的计算操作，在NiuTrans.Tensor/Tensor/core/arithmetic中定义，该操作用来进行张量之间的按元素位置相减，并得到相减的结果张量，张量减法的调用方法为：
+```
+void _Sub(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta = (DTYPE)1.0)
+void _SubMe(XTensor * a, const XTensor * b, DTYPE beta = (DTYPE)1.0)
+XTensor Sub(const XTensor &a, const XTensor &b, DTYPE beta = (DTYPE)1.0)
+```
+其中a和b为输入张量，c为结果张量，若c为NULL则将相加结果存入a中，beta为一个缩放参数，缩放公式为：c = a - b * beta，beta默认为1.0，NiuTrans.Tensor中张量减法的调用方式以及参数说明如下所示:
+Parameters: 
+* a - 输入张量1
+* b - 输入张量2
+* c - 输出张量
+* beta - 缩放参数
+##### 张量减法片段示例
+调用Sub进行张量间的减法操作如下所示，在此例中将张量相减结果存入c中：
+```
+/* call Sub function */
+c = Sub(*a, *b);
+```
+详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TSub.cpp
 #### 加法（Sum）
 ##### 什么是张量加法？
@@ -765,39 +832,49 @@ NiuTrans.Tensor/Tensor/test/TSetData.cpp
 ### 数学运算(math)
-此部分包括各种非基本代数操作，包括：log、exp、power等。
+此部分包括各种非基本代数操作，包括：clip、exp、power等。
-#### 对数运算（Log）
+#### 张量裁剪（Clip）
-##### 什么是张量的对数运算？
+##### 什么是张量的裁剪操作？
-张量的对数运算即将张量中每一元素都取对数从而得到一个新的张量。
+张量的裁剪即将张量中每一元素都通过裁剪操作限定在某一范围内从而得到一个新的张量。
-##### Log调用
+一个\\(2 \times 4\\)的张量在裁剪至[2, 5]的取值范围过程如下所示：
+$$
+\left(\begin{matrix} 0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow 
+\left(\begin{matrix} 2.0 & 2.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 5.0 & 5.0\end{matrix}\right)
+$$
-NiuTrans.Tensor提供了张量的Log操作，调用方法及参数说明如下所示:
+##### 
+##### Clip调用
+NiuTrans.Tensor提供了张量的Clip操作，调用方法及参数说明如下所示:
 ```
-void _Log(const XTensor * a, XTensor * b)
+void _Clip(const XTensor * a, XTensor * b, DTYPE lower, DTYPE upper)
-void _LogMe(XTensor * a)
+void _ClipMe(XTensor * a, DTYPE lower, DTYPE upper)
-XTensor Log(const XTensor & a)
+XTensor Clip(const XTensor & a, DTYPE lower, DTYPE upper)
 ```
 Parameters:
 * a - 输入张量
 * b - 输出张量
+* lower - 裁剪范围下限
+* upper - 裁剪范围上限
-#####  Log片段示例
+#####  Clip片段示例
-Log示例代码如下所示：
+Clip示例代码如下所示：
 ```
-/* call Log function */
+/* call Clip function */
-b = Log(*a);
+b = Clip(*a, -1.0, 1.0);
 ```
-有关Log的详细代码示例见：
+有关Clip的详细代码示例见：
-NiuTrans.Tensor/Tensor/test/TLog.cpp
+NiuTrans.Tensor/Tensor/test/TClip.cpp
 #### 标准化（Normalize）
@@ -920,6 +997,22 @@ t = ScaleAndShift(*s, scaleFactor, shiftFactor);
 NiuTrans.Tensor/Tensor/test/TScaleAndShift.cpp
+#### 一元操作（Unary）
+##### 什么是张量的一元操作？
+张量的一元操作主要包括张量的取绝对值、取指、取对数等只需对单个张量进行操作的函数。
+##### 张量一元操作的调用
+NiuTrans.Tensor提供了一些关于张量的一元操作，主要包括Absolute、Ceil、Exp、Floor、IsNonZero、IsZero、Log、Round、Sqrt、Square、Sin、Cos、Tan，调用方法详见NiuTrans.Tensor/Tensor/core/test/Unary.h
+##### 张量一元操作示例
+有关张量一元操作的详细代码示例见：
+NiuTrans.Tensor/Tensor/test
 ### 数据移动(movement)
 此部分主要是介绍有关数据拷贝函数。
@@ -1004,6 +1097,8 @@ NiuTrans.Tensor提供了张量的拷贝操作，调用方法及参数说明如
 ```
 void _CopyValues(const XTensor * s, XTensor * t, XStream * stream = NULL)
+void _CopyValues(const XTensor * s, const int sBeg, const int sLen, XTensor * t, const int tBeg, XStream * stream = NULL)
 XTensor CopyValues(const XTensor &s, XStream * stream = NULL)
 ```
 Parameters:
@@ -1023,6 +1118,41 @@ t = CopyValues(*s);
 NiuTrans.Tensor/Tensor/test/TCopyValues.cpp
+#### 采集（gather）
+##### 什么是张量的采集操作？
+张量的采集操作，即将张量中元素按给定索引取出。
+##### 张量采集操作的调用
+NiuTrans.Tensor提供了张量的采集操作，调用方法及参数说明如下所示:
+```
+void _Gather(const XTensor * s, XTensor * t, int dim, int * srcIndex, int indexSize)
+XTensor Gather(const XTensor &s, int dim, int * srcIndex, int indexSize)
+XTensor Gather(const XTensor &s, const XTensor &index)
+```
+Parameters:
+* s - 输入张量
+* t - 输出张量
+* dim - 沿指定维度进行操作
+* srcIndex - 给定索引
+* indexSize - 给定索引大小
+#####  张量采集片段示例
+张量采集示例代码如下，其中s为输入的待操作张量：
+```
+/* call Gather function */
+t = Gather(*s, dim, srcIndex, indexSize);
+```
+有关张量采集的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TGather.cpp
 ### 规约操作(reduce)
 #### 归约取最大值（ReduceMax）
@@ -1107,6 +1237,7 @@ Parameters:
 t = ReduceMean(*s, 0);
 t = ReduceMean(*s, 1);
 ```
 有关张量归约取均值的详细代码示例见：
 NiuTrans.Tensor/Tensor/test/TReduceMean.cpp
@@ -1557,7 +1688,7 @@ $$
 \left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow 
 \begin{aligned}
 outputAnswer: & \left(
-\begin{matrix}0.5 & 2.5 & 4.5 & 6.5\\\\8.5 & 10.5 & 12.5 & 14.5\end{matrix}\right)\\\\ +
+\begin{matrix}5.0 & 3.0 & 7.0 & 8.0\\\\4.0 & 1.0 & 2.0 & 6.0\end{matrix}\right)\\\\ +
 \\\\ indexAnswer: & \left(
 \begin{matrix}0 & 1 & 1 & 0\\\\1 & 0 & 0 & 1\end{matrix}\right)
 \end{aligned}
@@ -1598,6 +1729,43 @@ TopK(s, t, index, dim, k);
 此部分主要介绍一些激活函数和损失函数。
+#### Dropout
+##### 什么是Dropout？
+Dropout是一种在深度学习环境中应用的正规化手段，目的是在每次进行神经网络循环时随机对一些单元进行隐藏。
+##### Dropout调用
+NiuTrans.Tensor提供了张量的Dropout激活函数，调用方法及参数说明如下所示:
+```
+void _Dropout(const XTensor * x, XTensor * y, unsigned int seed, DTYPE dropProb, int leadingDim = -1)
+XTensor Dropout(const XTensor &x, DTYPE dropProb, int leadingDim = -1)
+```
+Parameters:
+* x - 输入张量
+* y - 输出张量
+* seed - 随机种子
+* dropProb - 随机将单元隐藏的概率
+* leadingDim - 沿着指定维度进行操作
+##### Dropout片段示例
+Dropout示例代码如下：
+```
+/* call Dropout function */
+y = Dropout(*x, dropProb);
+```
+有关Dropout的详细代码示例见：
+NiuTrans.Tensor/Tensor/test/TDropout.cpp
 #### HardTanH
 ##### 什么是HardTanH？
@@ -1730,7 +1898,7 @@ Parameters:
 * gold - 标准答案
 * output - 输出的模型预测结果
-* LFName - 损失函数名称
+* LFName - 损失函数名称，目前支持SQAREDERROR, CROSSENTROPY, ONTHOTERROR
 * isLogOutput - 输出是否log
 * leadDim - 沿着指定维度进行输出
 * gBeg - 沿着指定维度leadDim从指定位置取标准答案
@@ -1847,6 +2015,36 @@ y = Softmax(*x, 1);
 NiuTrans.Tensor/Tensor/test/TSoftmax.cpp
+## 自动微分
+NiuTrans.Tensor提供关于具有自动微分功能的反向传播函数，主要包括在进行神经网络反向传播过程中涉及到的几种形式。在本节中，主要对这些函数及其用法用例进行介绍，函数定义详见NiuTrans.Tensor/Network/XNet.h。
+NiuTrans.Tensor中几种具有自动微分功能的反向传播函数接口如下：
+```
+void Backward(XTensor &root, XTensor &gold, LOSS_FUNCTION_NAME loss = NOLOSS)
+void Backward(XTensor &root, XTensor &gold, XTensor &padding, LOSS_FUNCTION_NAME loss = NOLOSS)
+void Backward(XTensor &root, LOSS_FUNCTION_NAME loss = NOLOSS)
+void Backward(XList &roots, XList &golds, XList &paddings, LOSS_FUNCTION_NAME loss = NOLOSS)
+void Backward(XList &roots, LOSS_FUNCTION_NAME loss = NOLOSS)
+void Backward(XList &roots, XList &golds, LOSS_FUNCTION_NAME loss = NOLOSS)
+```
+Parameters:
+* root - 根节点，指最后神经网络的输出，也是反向传播的起点
+* gold - 标准答案
+* padding - 指不需要计算梯度的位置
+* loss - 损失函数的类型
+有关Backward的具体用法详见：
+NiuTrans.Tensor/Tensor/Sample中的具体示例
 ## 高级技巧
 ### 内存池
@@ -2447,6 +2645,8 @@ void Backward(XTensor inputs[], XTensor &output, XTensor &gold, LOSS_FUNCTION_NA
 * 张裕浩
 * 胡驰
+NiuTrans.Tensor张量计算库由东北大学自然语言处理实验室小牛开源团队开发，成员来自东北大学自然语言处理实验室、小牛翻译、小牛雅智，致力于为深度学习相关研究及工业系统的开发提供完整的张量定义及计算功能。
 ## 附录
 在XTensor.h头文件中定义的成员变量说明：

--- a/source/sample/fnnlm/FNNLM.cpp
+++ b/source/sample/fnnlm/FNNLM.cpp
@@ -331,6 +331,7 @@ void Init(FNNModel &model)
 {
    /* create embedding parameter matrix: vSize * eSize */
    InitModelTensor2D(model.embeddingW, model.vSize, model.eSize, model);
+    model.embeddingW.SetVarFlag();
    /* create hidden layer parameter matrics */
    for(int i = 0; i < model.hDepth; i++){
@@ -340,15 +341,20 @@ void Init(FNNModel &model)
            InitModelTensor2D(model.hiddenW[i], (model.n - 1) * model.eSize, model.hSize, model);
        else
            InitModelTensor2D(model.hiddenW[i], model.hSize, model.hSize, model);
+        model.hiddenW[i].SetVarFlag();
        /* bias term: a row vector of hSize entries */
        InitModelTensor1D(model.hiddenB[i], model.hSize, model);
+        model.hiddenB[i].SetVarFlag();
    }
    /* create the output layer parameter matrix and bias term */
    int iSize = model.hDepth == 0 ? (model.n - 1) * model.eSize : model.hSize;
    InitModelTensor2D(model.outputW, iSize, model.vSize, model);
+    model.outputW.SetVarFlag();
    InitModelTensor1D(model.outputB, model.vSize, model);
+    model.outputB.SetVarFlag();
    /* then, we initialize model parameters using a uniform distribution in range
       of [-minmax, minmax] */

--- a/source/tensor/XGlobal.cpp
+++ b/source/tensor/XGlobal.cpp
 /* NiuTrans.Tensor - an open-source tensor library
 * Copyright (C) 2017, Natural Language Processing Lab, Northestern University. 
 * All rights reserved.
 *
@@ -51,7 +51,13 @@ bool CONST_TRUE = true;
 int verboseLevel = 0;
 bool useBLAS = false;
-bool useCUDA = false;
+#ifdef USE_CUDA
+    bool useCUDA = true;
+#else
+    bool useCUDA = false;
+#endif
 FILE * tmpLog = NULL;
 double myTime = 0;

--- a/source/tensor/core/CHeader.h
+++ b/source/tensor/core/CHeader.h
@@ -52,7 +52,6 @@
 #include "math/Clip.h"
 #include "math/Compare.h"
 #include "math/Normalize.h"
-#include "math/Power.h"
 #include "math/ScaleAndShift.h"
 #include "math/Unary.h"

--- a/source/tensor/core/arithmetic/Mask.h
+++ b/source/tensor/core/arithmetic/Mask.h
--- a/source/tensor/core/getandset/SetData.cu
+++ b/source/tensor/core/getandset/SetData.cu
@@ -567,15 +567,17 @@ void _CudaSetDataRand(const XTensor * tensor, DTYPE lower, DTYPE upper)
    ProtectCudaDev(tensor->devID, devIDBackup);
    curandGenerator_t & gen = GDevs.GPUs[tensor->devID].gen;
-    curandGenerateUniform(gen , (float*)tensor->data , tensor->unitNum);
+    curandGenerateUniform(gen, (float*)tensor->data, tensor->unitNum);
    DTYPE variance = upper - lower;
    if(variance != 1.0F || lower != 0){
        if (tensor->dataType == X_FLOAT)
-            KernelSetDataRandFloat  <<<blocks, threads >>>((float*) tensor->data, tensor->unitNum, lower, variance);
+            KernelSetDataRandFloat  <<<blocks, threads >>>
+                                     ((float*) tensor->data, tensor->unitNum, lower, variance);
        else if (tensor->dataType == X_DOUBLE)
-            KernelSetDataRandDouble <<<blocks, threads >>>((double*)tensor->data, tensor->unitNum, lower, variance);
+            KernelSetDataRandDouble <<<blocks, threads >>>
+                                     ((double*)tensor->data, tensor->unitNum, lower, variance);
    }
    BacktoCudaDev(tensor->devID, devIDBackup);

--- a/source/tensor/core/math/Binary.cpp
+++ b/source/tensor/core/math/Binary.cpp
@@ -26,125 +26,124 @@
 namespace nts {
-int scale(int x, int scale)
+template<class T1, class T2>
+T1 descale(T1 x, T2 num)
 {
-    return x * scale;
+    return (T1)(x / num);
 }
-float scale(float x, float scale)
+template<class T1, class T2>
+T1 power(T1 x, T2 num)
 {
-    return x * scale;
+    if (num == 0)
-}
+        return (T1)1.0;
+    else if (num == 0.5)
-int descale(int x, int descale)
+        return (T1)sqrt(num);
-{
+    else if (num == 2)
-    return x / descale;
+        return x * x;
-}
+    else {
+        if (x == 0 && num < 0)
-float descale(float x, float descale)
+            return (T1)NAN;
-{
+        else
-    return x / descale;
+            return (T1)pow(x, num);
-}
+    }
+}
-int shift(int x, int shift)
+template<class T1, class T2>
+T1 scale(T1 x, T2 num)
 {
-    return x + shift;
+    return (T1)(x * num);
 }
-float shift(float x, float shift)
+template<class T1, class T2>
+T1 shift(T1 x, T2 num)
 {
-    return x + shift;
+    return (T1)(x + num);
 }
-int mod(int x, int mod)
+int mod(int x, int num)
 {
-    return x % mod;
+    return x % num;
 }
-#ifdef USE_CUDA
+/* define three marco separately, specify the respective function names */
-/* define three marco separately, specify the respective function names  (GPU mode) */
+#define _SIMPLE_BINARY_FUNCTION(_funcName, _cudaFuncName, origFunc)                  \
-#define _SIMPLE_BINARY_FUNCTION_INT(_funcName, _cudaFuncName, origFunc)     \
+template<class T>                                                                    \
-void _funcName(const XTensor * a, XTensor * b, int num)                     \
+void _funcName(const XTensor * a, XTensor * b, T num)                                \
 {                                                                                    \
    /* run it on GPUs */                                                             \
    if (a->devID >= 0) {                                                             \
+        if (useCUDA) {                                                               \
            _cudaFuncName(a, b, num);                                                \
            return;                                                                  \
        }                                                                            \
+        else                                                                         \
+            ShowNTErrors("No GPU devices support!")                                  \
+    }                                                                                \
    CheckNTErrors((XTensor::IsSameShaped(a, b)),                                     \
                  "Input tensors should have the same data type!");                  \
-    CheckNTErrors(a->dataType == X_INT && b->dataType == X_INT,             \
+    if (a->dataType == X_INT) {                                                      \
-                 "TODO!");                                                  \
        int * d = (int*)a->data;                                                     \
        int * db = (int*)b->data;                                                    \
        for (int i = 0; i < a->unitNum; i++)                                         \
-        db[i] = (int)origFunc(d[i], num);                                   \
+            db[i] = (int)origFunc((int)d[i], (T)num);                                \
-}                                                                           \
-#define _SIMPLE_BINARY_FUNCTION(_funcName, _cudaFuncName, origFunc)         \
-void _funcName(const XTensor * a, XTensor * b, float num)                   \
-{                                                                           \
-    /* run it on GPUs */                                                    \
-    if (a->devID >= 0) {                                                    \
-        _cudaFuncName(a, b, num);                                           \
-        return;                                                             \
    }                                                                                \
-    CheckNTErrors((XTensor::IsSameShaped(a, b)),                            \
+    else if (a->dataType == X_FLOAT) {                                               \
-                  "Input tensors should have the same data type!");         \
-    CheckNTErrors(a->dataType == X_FLOAT && b->dataType == X_FLOAT,         \
-                 "TODO!");                                                  \
        float * d = (float*)a->data;                                                 \
        float * db = (float*)b->data;                                                \
        for (int i = 0; i < a->unitNum; i++)                                         \
-        db[i] = (float)origFunc(d[i], num);                                 \
+            db[i] = (float)origFunc((float)d[i], (T)num);                            \
-}
+    }                                                                                \
+    else if (a->dataType == X_DOUBLE) {                                              \
-#define _SIMPLE_BINARY_FUNCTION_ME_INT(_funcNameMe, _funcName)              \
+        double * d = (double*)a->data;                                               \
-void _funcNameMe(XTensor * a, int num)                                      \
+        double * db = (double*)b->data;                                              \
-{                                                                           \
+        for (int i = 0; i < a->unitNum; i++)                                         \
-    _funcName(a, a, num);                                                   \
+            db[i] = (double)origFunc((double)d[i], (T)num);                          \
-}      
+    }                                                                                \
+    else                                                                             \
+        ShowNTErrors("TO DO!");                                                      \
+}                                                                                    \
+template void _funcName<int>(const XTensor*, XTensor*, int);                         \
+template void _funcName<float>(const XTensor*, XTensor*, float);                     \
+template void _funcName<double>(const XTensor*, XTensor*, double);
 #define _SIMPLE_BINARY_FUNCTION_ME(_funcNameMe, _funcName)                           \
-void _funcNameMe(XTensor * a, float num)                                    \
+template<class T>                                                                    \
+void _funcNameMe(XTensor * a, T num)                                                 \
 {                                                                                    \
    _funcName(a, a, num);                                                            \
-}                                                                          
-#define SIMPLE_BINARY_FUNCTION_ME_INT(funcNameMe, _funcName)                \
-void funcNameMe(XTensor &a, int num)                                        \
-{                                                                           \
-    _funcName(&a, &a, num);                                                 \
 }                                                                                    \
+template void _funcNameMe<int>(XTensor*, int);                                       \
+template void _funcNameMe<float>(XTensor*, float);                                   \
+template void _funcNameMe<double>(XTensor*, double);                                                                                    
 #define SIMPLE_BINARY_FUNCTION_ME(funcNameMe, _funcName)                             \
-void funcNameMe(XTensor &a, float num)                                      \
+template<class T>                                                                    \
+void funcNameMe(XTensor &a, T num)                                                   \
 {                                                                                    \
    _funcName(&a, &a, num);                                                          \
-}                                                                           
+}                                                                                    \
+template void funcNameMe<int>(XTensor&, int);                                        \
+template void funcNameMe<float>(XTensor&, float);                                    \
+template void funcNameMe<double>(XTensor&, double);                                                                                    
 #define SIMPLE_BINARY_FUNCTION(funcName, _funcName, operationId)                     \
-XTensor funcName(const XTensor &a, float num)                               \
+template<class T>                                                                    \
+XTensor funcName(const XTensor &a, T num)                                            \
 {                                                                                    \
    XTensor b(&a);                                                                   \
    b.SetTMPFlag();                                                                  \
    _funcName(&a, &b, num);                                                          \
    XLink::MakeLink(&a, NULL, &b, operationId);                                      \
    return b;                                                                        \
-}                                                                           
+}                                                                                    \
+template XTensor funcName<int>(const XTensor&, int);                                 \
-#define SIMPLE_BINARY_FUNCTION_INT(funcName, _funcName, operationId)        \
+template XTensor funcName<float>(const XTensor&, float);                             \
-XTensor funcName(const XTensor &a, int num)                                 \
+template XTensor funcName<double>(const XTensor&, double);                                                                                    
-{                                                                           \
-    XTensor b(&a);                                                          \
-    b.SetTMPFlag();                                                         \
-    _funcName(&a, &b, num);                                                 \
-    XLink::MakeLink(&a, NULL, &b, operationId);                             \
-    return b;                                                               \
-}                                                                           
 #define SIMPLE_BINARY_FUNCTION_VOID(funcName, _funcName, operationId)                \
-void funcName(const XTensor &a, XTensor &b, float num)                      \
+template<class T>                                                                    \
+void funcName(const XTensor &a, XTensor &b, T num)                                   \
 {                                                                                    \
    if (!b.isInit || !XTensor::IsSameShaped(&a, &b)) {                               \
        InitTensor(&b, &a);                                                          \
@@ -153,143 +152,39 @@ void funcName(const XTensor &a, XTensor &b, float num)                      \
    if (b.enableGrad) {                                                              \
        XLink::MakeLink(&a, NULL, &b, operationId);                                  \
    }                                                                                \
-}                                                                           
+}                                                                                    \
+template void funcName<int>(const XTensor&, XTensor&, int);                          \
+template void funcName<float>(const XTensor&, XTensor&, float);                      \
+template void funcName<double>(const XTensor&, XTensor&, double);                                                                           
-#define SIMPLE_BINARY_FUNCTION_INT_VOID(funcName, _funcName, operationId)   \
+_SIMPLE_BINARY_FUNCTION(_Descale, _CudaDescale, descale)
-void funcName(const XTensor &a, XTensor &b, int num)                        \
+_SIMPLE_BINARY_FUNCTION_ME(_DescaleMe, _Descale)
-{                                                                           \
+SIMPLE_BINARY_FUNCTION_ME(DescaleMe, _Descale)
-    if (!b.isInit || !XTensor::IsSameShaped(&a, &b)) {                      \
+SIMPLE_BINARY_FUNCTION(Descale, _Descale, MATH_DESCALE)
-        InitTensor(&b, &a);                                                 \
+SIMPLE_BINARY_FUNCTION_VOID(Descale, _Descale, MATH_DESCALE)
-    }                                                                       \
-    _funcName(&a, &b, num);                                                 \
+_SIMPLE_BINARY_FUNCTION(_Mod, _CudaMod, mod)
-    if (b.enableGrad) {                                                     \
+_SIMPLE_BINARY_FUNCTION_ME(_ModMe, _Mod)
-        XLink::MakeLink(&a, NULL, &b, operationId);                         \
+SIMPLE_BINARY_FUNCTION_ME(ModMe, _Mod)
-    }                                                                       \
+SIMPLE_BINARY_FUNCTION(Mod, _Mod, MATH_MOD)
-}                                                                           
+SIMPLE_BINARY_FUNCTION_VOID(Mod, _Mod, MATH_MOD)
-_SIMPLE_BINARY_FUNCTION_INT(_Scale, _CudaScale, scale)
+_SIMPLE_BINARY_FUNCTION(_Power, _CudaPower, power)
-_SIMPLE_BINARY_FUNCTION_ME_INT(_ScaleMe, _Scale)
+_SIMPLE_BINARY_FUNCTION_ME(_PowerMe, _Power)
-SIMPLE_BINARY_FUNCTION_ME_INT(ScaleMe, _Scale)
+SIMPLE_BINARY_FUNCTION_ME(PowerMe, _Power)
-SIMPLE_BINARY_FUNCTION_INT(Scale, _Scale, MATH_SCALE)
+SIMPLE_BINARY_FUNCTION(Power, _Power, MATH_POWER)
-SIMPLE_BINARY_FUNCTION_INT_VOID(Scale, _Scale, MATH_SCALE)
+SIMPLE_BINARY_FUNCTION_VOID(Power, _Power, MATH_POWER)
-_SIMPLE_BINARY_FUNCTION(_Scale, _CudaScaleFloat, scale)
+_SIMPLE_BINARY_FUNCTION(_Scale, _CudaScale, scale)
 _SIMPLE_BINARY_FUNCTION_ME(_ScaleMe, _Scale)
 SIMPLE_BINARY_FUNCTION_ME(ScaleMe, _Scale)
 SIMPLE_BINARY_FUNCTION(Scale, _Scale, MATH_SCALE)
 SIMPLE_BINARY_FUNCTION_VOID(Scale, _Scale, MATH_SCALE)
-_SIMPLE_BINARY_FUNCTION_INT(_Descale, _CudaDescale, descale)
+_SIMPLE_BINARY_FUNCTION(_Shift, _CudaShift, shift)
-_SIMPLE_BINARY_FUNCTION_ME_INT(_DescaleMe, _Descale)
-SIMPLE_BINARY_FUNCTION_ME_INT(DescaleMe, _Descale)
-SIMPLE_BINARY_FUNCTION_INT(Descale, _Descale, MATH_DESCALE)
-SIMPLE_BINARY_FUNCTION_INT_VOID(Descale, _Descale, MATH_DESCALE)
-_SIMPLE_BINARY_FUNCTION(_Descale, _CudaDescaleFloat, descale)
-_SIMPLE_BINARY_FUNCTION_ME(_DescaleMe, _Descale)
-SIMPLE_BINARY_FUNCTION_ME(DescaleMe, _Descale)
-SIMPLE_BINARY_FUNCTION(Descale, _Descale, MATH_DESCALE)
-SIMPLE_BINARY_FUNCTION_VOID(Descale, _Descale, MATH_DESCALE)
-_SIMPLE_BINARY_FUNCTION_INT(_Shift, _CudaShift, shift)
-_SIMPLE_BINARY_FUNCTION_ME_INT(_ShiftMe, _Shift)
-SIMPLE_BINARY_FUNCTION_ME_INT(ShiftMe, _Shift)
-SIMPLE_BINARY_FUNCTION_INT(Shift, _Shift, MATH_SHIFT)
-SIMPLE_BINARY_FUNCTION_INT_VOID(Shift, _Shift, MATH_SHIFT)
-_SIMPLE_BINARY_FUNCTION(_Shift, _CudaShiftFloat, shift)
 _SIMPLE_BINARY_FUNCTION_ME(_ShiftMe, _Shift)
 SIMPLE_BINARY_FUNCTION_ME(ShiftMe, _Shift)
 SIMPLE_BINARY_FUNCTION(Shift, _Shift, MATH_SHIFT)
 SIMPLE_BINARY_FUNCTION_VOID(Shift, _Shift, MATH_SHIFT)
-_SIMPLE_BINARY_FUNCTION_INT(_Mod, _CudaMod, mod)
-_SIMPLE_BINARY_FUNCTION_ME_INT(_ModMe, _Mod)
-SIMPLE_BINARY_FUNCTION_ME_INT(ModMe, _Mod)
-SIMPLE_BINARY_FUNCTION_INT(Mod, _Mod, MATH_MOD)
-SIMPLE_BINARY_FUNCTION_INT_VOID(Mod, _Mod, MATH_MOD)
-#else
-/* define three marco separately, specify the respective function names (CPU mode) */
-#define _SIMPLE_BINARY_FUNCTION_INT(_funcName, origFunc)                    \
-void _funcName(const XTensor * a, XTensor * b, int num)                     \
-{                                                                           \
-    CheckNTErrors(a->devID < 0, "No GPU code is supported");                \
-    CheckNTErrors((XTensor::IsSameShaped(a, b)),                            \
-                "Input tensors should have the same data type!");           \
-    CheckNTErrors((a->dataType == X_INT&&b->dataType == X_INT), "TODO!");   \
-    int * d = (int*)a->data;                                                \
-    int * db = (int*)b->data;                                               \
-    for (int i = 0; i < a->unitNum; i++)                                    \
-        db[i] = (int)origFunc(d[i], num);                                   \
-}                                                                           \
-#define _SIMPLE_BINARY_FUNCTION(_funcName, origFunc)         				\
-void _funcName(const XTensor * a, XTensor * b, float num)                   \
-{                                                                           \
-    CheckNTErrors(a->devID < 0, "No GPU code is supported");                \
-    CheckNTErrors((XTensor::IsSameShaped(a, b)),                            \
-                "Input tensors should have the same data type!");           \
-    CheckNTErrors((a->dataType == X_FLOAT&&b->dataType == X_FLOAT), "TODO!");\
-    float * d = (float*)a->data;                                            \
-    float * db = (float*)b->data;                                           \
-    for (int i = 0; i < a->unitNum; i++)                                    \
-        db[i] = (float)origFunc(d[i], num);                                 \
-}
-#define SIMPLE_BINARY_FUNCTION_ME_INT(funcName, _funcName)                  \
-void funcName(XTensor &a, int num)                                          \
-{                                                                           \
-    _funcName(&a, &a, num);                                                 \
-}                                                                           \
-#define SIMPLE_BINARY_FUNCTION_ME(funcName, _funcName)                      \
-void funcName(XTensor &a, float num)                                        \
-{                                                                           \
-    _funcName(&a, &a, num);                                                 \
-}                                                                           \
-#define SIMPLE_BINARY_FUNCTION_INT(funcName, _funcName)                     \
-void funcName(const XTensor &a, XTensor &b, int num)                        \
-{                                                                           \
-    _funcName(&a, &b, num);                                                 \
-}                                                                           \
-#define SIMPLE_BINARY_FUNCTION(funcName, _funcName)                         \
-void funcName(const XTensor &a, XTensor &b, float num)                      \
-{                                                                           \
-    _funcName(&a, &b, num);                                                 \
-}                                                                           \
-_SIMPLE_BINARY_FUNCTION_INT(_Scale, scale)
-SIMPLE_BINARY_FUNCTION_ME_INT(_ScaleMe, _Scale)
-SIMPLE_BINARY_FUNCTION_INT(Scale, _Scale)
-_SIMPLE_BINARY_FUNCTION(_Scale, scale)
-SIMPLE_BINARY_FUNCTION_ME(_ScaleMe, _Scale)
-SIMPLE_BINARY_FUNCTION(Scale, _Scale)
-_SIMPLE_BINARY_FUNCTION_INT(_Descale, descale)
-SIMPLE_BINARY_FUNCTION_ME_INT(_DescaleMe, _Descale)
-SIMPLE_BINARY_FUNCTION_INT(Descale, _Descale)
-_SIMPLE_BINARY_FUNCTION(_Descale, descale)
-SIMPLE_BINARY_FUNCTION_ME(_DescaleMe, _Descale)
-SIMPLE_BINARY_FUNCTION(Descale, _Descale)
-_SIMPLE_BINARY_FUNCTION_INT(_Shift, shift)
-SIMPLE_BINARY_FUNCTION_ME_INT(_Shift, _Shift)
-SIMPLE_BINARY_FUNCTION_INT(Shift, _Shift)
-_SIMPLE_BINARY_FUNCTION(_Shift, shift)
-SIMPLE_BINARY_FUNCTION_ME(_ShiftMe, _Shift)
-SIMPLE_BINARY_FUNCTION(Shift, _Shift)
-_SIMPLE_BINARY_FUNCTION_INT(_Mod, mod)
-SIMPLE_BINARY_FUNCTION_ME_INT(_ModMe, _Mod)
-SIMPLE_BINARY_FUNCTION_INT(Mod, _Mod)
-#endif
 } // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/math/Binary.cu
+++ b/source/tensor/core/math/Binary.cu
@@ -21,6 +21,7 @@
 #include <math.h>
 #include "../../XDevice.h"
+#include "../../XUtility.h"
 #include "../../XName.h"
 #include "Binary.h"
 #include "Binary.cuh"
@@ -30,59 +31,63 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 #ifdef USE_CUDA
 __device__
-int cudascale(int x, int scale)
+int BaseMod(int x, int base)
 {
-    return x * scale;
+    return x % base;
 }
+template<class T1, class T2>
 __device__
-float cudascale(float x, float scale)
+T1 BaseDescale(T1 x, T2 num)
 {
-    return x * scale;
+    return x / num;
 }
+template<class T1, class T2>
 __device__
-int cudadescale(int x, int descale)
+T1 BasePower(T1 x, T2 num)
 {
-    return x / descale;
+    if (num == 0)
+        return (T1)1.0;
+    else if (num == 0.5)
+        return (T1)sqrt((float)num);
+    else if (num == 2)
+        return (T1)(x * x);
+    else {
+        if (x == 0 && num < 0)
+            return 1e20F;
+        else
+            return (T1)pow((float)x, (float)num);
+    }
 }
+template<class T1, class T2>
 __device__
-float cudadescale(float x, float descale)
+T1 BaseScale(T1 x, T2 num)
 {
-    return x / descale;
+    return x * num;
 }
+template<class T1, class T2>
 __device__
-int cudashift(int x, int shift)
+T1 BaseShift(T1 x, T2 num)
 {
-    return x + shift;
+    return x + num;
 }
-__device__
-float cudashift(float x, float descale)
-{
-    return x + descale;
-}
-__device__
-int cudamod(int x, int mod)
-{
-    return x % mod;
-}
 #define SIMPLE_BINARY_FUNCTION_GPU(funcName, origFunc)                              \
+template<class T1, class T2>                                                        \
 __global__                                                                          \
-void Kernel##funcName(int * a, int * b, int size, int num)                  \
+void Kernel##funcName(T1 * a, T1 * b, int size, T2 num)                             \
 {                                                                                   \
    int i = blockDim.x * blockIdx.x + threadIdx.x;                                  \
                                                                                    \
    if (i < size)                                                                   \
-        b[i] = (int)origFunc(a[i], num);                                    \
+        b[i] = (T1)origFunc((T1)a[i], (T2)num);                                     \
 }                                                                                   \
                                                                                    \
-void _Cuda##funcName(const XTensor * a, XTensor * b, int num)               \
+template<class T>                                                                   \
+void _Cuda##funcName(const XTensor * a, XTensor * b, T num)                         \
 {                                                                                   \
    CheckNTErrors((XTensor::IsSameShaped(a, b)),                                    \
                  "Input tensors should have the same type!");                      \
@@ -99,63 +104,33 @@ void _Cuda##funcName(const XTensor * a, XTensor * b, int num)               \
    int devIDBackup;                                                                \
    ProtectCudaDev(a->devID, devIDBackup);                                          \
                                                                                    \
-    if (a->dataType == X_INT) {                                             \
+    if (a->dataType == X_FLOAT) {                                                   \
        Kernel##funcName<<<blocks, threads>>>                                       \
-                         ((int*)a->data, (int*)b->data, a->unitNum, num);   \
+                         ((float*)a->data, (float*)b->data, a->unitNum, (T)num);    \
    }                                                                               \
-    else {                                                                  \
+    else if (a->dataType == X_DOUBLE) {                                             \
-        ShowNTErrors("TODO!");                                              \
+        Kernel##funcName<<<blocks, threads>>>                                       \
+                         ((double*)a->data, (double*)b->data, a->unitNum, (T)num);  \
    }                                                                               \
-                                                                            \
+    else if (a->dataType == X_INT) {                                                \
-    BacktoCudaDev(a->devID, devIDBackup);                                   \
-}                                                                           \
-#define SIMPLE_BINARY_FUNCTION_FLOAT_GPU(funcName, origFunc)                \
-__global__                                                                  \
-void Kernel##funcName(float * a, float * b, int size, float num)            \
-{                                                                           \
-    int i = blockDim.x * blockIdx.x + threadIdx.x;                          \
-                                                                            \
-    if (i < size)                                                           \
-        b[i] = (float)origFunc(a[i], num);                                  \
-}                                                                           \
-                                                                            \
-                                                                            \
-void _Cuda##funcName(const XTensor * a, XTensor * b, float num)             \
-{                                                                           \
-    CheckNTErrors((XTensor::IsSameShaped(a, b)),                            \
-                  "Input tensors should have the same type!");              \
-    CheckNTErrors((a->isSparse == false), "TODO!");                         \
-                                                                            \
-    int gridSize[3];                                                        \
-    int blockSize[3];                                                       \
-                                                                            \
-    GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);         \
-                                                                            \
-    dim3 blocks(gridSize[0]);                                               \
-    dim3 threads(blockSize[0]);                                             \
-                                                                            \
-    int devIDBackup;                                                        \
-    ProtectCudaDev(a->devID, devIDBackup);                                  \
-                                                                            \
-    if (a->dataType == X_FLOAT) {                                           \
        Kernel##funcName<<<blocks, threads>>>                                       \
-                        ((float*)a->data, (float*)b->data, a->unitNum, num);\
+                         ((int*)a->data, (int*)b->data, a->unitNum, (T)num);        \
    }                                                                               \
    else {                                                                          \
        ShowNTErrors("TODO!");                                                      \
    }                                                                               \
                                                                                    \
    BacktoCudaDev(a->devID, devIDBackup);                                           \
-}
+}                                                                                   \
+template void _Cuda##funcName<int>(const XTensor*, XTensor*, int);                  \
-SIMPLE_BINARY_FUNCTION_GPU(Scale, cudascale)
+template void _Cuda##funcName<float>(const XTensor*, XTensor*, float);              \
-SIMPLE_BINARY_FUNCTION_FLOAT_GPU(ScaleFloat, cudascale)
+template void _Cuda##funcName<double>(const XTensor*, XTensor*, double);            
-SIMPLE_BINARY_FUNCTION_GPU(Descale, cudadescale)
-SIMPLE_BINARY_FUNCTION_FLOAT_GPU(DescaleFloat, cudadescale)
+SIMPLE_BINARY_FUNCTION_GPU(Descale, BaseDescale)
-SIMPLE_BINARY_FUNCTION_GPU(Shift, cudashift)
+SIMPLE_BINARY_FUNCTION_GPU(Mod, BaseMod)
-SIMPLE_BINARY_FUNCTION_FLOAT_GPU(ShiftFloat, cudashift)
+SIMPLE_BINARY_FUNCTION_GPU(Power, BasePower)
-SIMPLE_BINARY_FUNCTION_GPU(Mod, cudamod)
+SIMPLE_BINARY_FUNCTION_GPU(Scale, BaseScale)
+SIMPLE_BINARY_FUNCTION_GPU(Shift, BaseShift)
 #endif // USE_CUDA

--- a/source/tensor/core/math/Binary.cuh
+++ b/source/tensor/core/math/Binary.cuh
@@ -29,38 +29,25 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 #ifdef USE_CUDA
-/* scale each entry (CUDA Kernel) */
-__global__
-void KernelScale(int * a, int * b, int size, int scale);
-__global__
-void KernelScale(int * a, int * b, int size, float scale);
-/* scale each entry */
-void _CudaScale(const XTensor * a, XTensor * b, int scale);
-void _CudaScaleFloat(const XTensor * a, XTensor * b, float scale);
-/* descale each entry (CUDA Kernel) */
-__global__
-void KernelDescale(int * a, int * b, int size, int scale);
-__global__
-void KernelDescale(int * a, int * b, int size, float scale);
 /* descale each entry */
-void _CudaDescale(const XTensor * a, XTensor * b, int scale);
+template<class T>
-void _CudaDescaleFloat(const XTensor * a, XTensor * b, float scale);
+void _CudaDescale(const XTensor * a, XTensor * b, T num);
-/* shift each entry (CUDA Kernel) */
+/* power each entry */
-__global__
+template<class T>
-void KernelShift(int * a, int * b, int size, int shift);
+void _CudaPower(const XTensor * a, XTensor * b, T num);
-__global__
-void KernelShift(int * a, int * b, int size, float shift);
-/* shift each entry */
-void _CudaShift(const XTensor * a, XTensor * b, int shift);
-void _CudaShiftFloat(const XTensor * a, XTensor * b, float shift);
-/* mod each entry (CUDA Kernel) */
-__global__
-void KernelMod(int * a, int * b, int size, int base);
 /* mod each entry */
-void _CudaMod(const XTensor * a, XTensor * b, int base);
+template<class T>
+void _CudaMod(const XTensor * a, XTensor * b, T base);
+/* scale each entry */
+template<class T>
+void _CudaScale(const XTensor * a, XTensor * b, T num);
+/* shift each entry */
+template<class T>
+void _CudaShift(const XTensor * a, XTensor * b, T num);
 #endif // USE_CUDA

--- a/source/tensor/core/math/Binary.h
+++ b/source/tensor/core/math/Binary.h
@@ -26,84 +26,110 @@
 namespace nts { // namespace nts(NiuTrans.Tensor)
-/* scale up tensor entires
-b = a * scale */
-void _Scale(const XTensor * a, XTensor * b, int scale);
-void _Scale(const XTensor * a, XTensor * b, float scale);
-/* scale up tensor entires (on site)
-b = a * scale */
-void _ScaleMe(XTensor * a, int scale);
-void _ScaleMe(XTensor * a, float scale);
-/* scale up tensor entires (on site)
-b = a * scale */
-void ScaleMe(XTensor & a, int scale);
-void ScaleMe(XTensor & a, float scale);
-/* scale up tensor entires
-b = a * scale */
-void Scale(const XTensor & a, XTensor & b, int scale);
-void Scale(const XTensor & a, XTensor & b, float scale);
-/* scale up tensor entires (return an XTensor structure)
-b = a * scale */
-XTensor Scale(const XTensor & a, int scale);
-XTensor Scale(const XTensor & a, float scale);
 /* descale tensor entires
-b = a / scale */
+b = a / num */
-void _Descale(const XTensor * a, XTensor * b, int scale);
+template<class T>
-void _Descale(const XTensor * a, XTensor * b, float scale);
+void _Descale(const XTensor * a, XTensor * b, T num);
 /* descale tensor entires (on site)
-b = a / scale */
+b = a / num */
-void _DescaleMe(XTensor * a, int scale);
+template<class T>
-void _DescaleMe(XTensor * a, float scale);
+void _DescaleMe(XTensor * a, T num);
 /* descale tensor entires (on site)
-b = a / scale */
+b = a / num */
-void DescaleMe(XTensor & a, int scale);
+template<class T>
-void DescaleMe(XTensor & a, float scale); 
+void DescaleMe(XTensor & a, T num); 
 /* descale tensor entires
-b = a / scale */
+b = a / num */
-void Descale(const XTensor & a, XTensor & b, int scale);
+template<class T>
-void Descale(const XTensor & a, XTensor & b, float scale);
+void Descale(const XTensor & a, XTensor & b, T num);
 /* descale tensor entires (return an XTensor structure)
-b = a / scale */
+b = a / num */
-XTensor Descale(const XTensor & a, int scale);
+template<class T>
-XTensor Descale(const XTensor & a, float scale);
+XTensor Descale(const XTensor & a, T num);
+/* mod tensor entires
+b = a % base */
+template<class T>
+void _Mod(const XTensor * a, XTensor * b, T base);
+/* mod base entires (on site)
+b = a % num */
+template<class T>
+void _ModMe(XTensor * a, T base);
+/* mod tensor entires (on site)
+b = a % base */
+template<class T>
+void ModMe(XTensor & a, T base);
+/* mod tensor entires
+b = a % base */
+template<class T>
+void Mod(const XTensor & a, XTensor & b, T base);
+/* mod tensor entires (return an XTensor structure)
+b = a % base */
+template<class T>
+XTensor Mod(const XTensor & a, T base);
+/* get the power(x, y)
+b = power(a, num) */
+template<class T>
+void _Power(const XTensor * a, XTensor * b, T scale);
+/* get the power(x, y) (on site)
+b = power(a, num) */
+template<class T>
+void _PowerMe(XTensor * a, T scale);
+/* get the power(x, y) (on site)
+b = power(a, num) */
+template<class T>
+void PowerMe(XTensor & a, T scale); 
+/* get the power(x, y)
+b = power(a, num) */
+template<class T>
+void Power(const XTensor & a, XTensor & b, T scale);
+/* get the power(x, y) (return an XTensor structure)
+b = power(a, num) */
+template<class T>
+XTensor Power(const XTensor & a, T scale);
+/* scale up tensor entires
+b = a * num */
+template<class T>
+void _Scale(const XTensor * a, XTensor * b, T num);
+/* scale up tensor entires (on site)
+b = a * num */
+template<class T>
+void _ScaleMe(XTensor * a, T num);
+/* scale up tensor entires (on site)
+b = a * num */
+template<class T>
+void ScaleMe(XTensor & a, T num);
+/* scale up tensor entires
+b = a * num */
+template<class T>
+void Scale(const XTensor & a, XTensor & b, T num);
+/* scale up tensor entires (return an XTensor structure)
+b = a * num */
+template<class T>
+XTensor Scale(const XTensor & a, T num);
 /* shift tensor entires
-b = a + shift */
+b = a + num */
-void _Shift(const XTensor * a, XTensor * b, int shift);
+template<class T>
-void _Shift(const XTensor * a, XTensor * b, float shift);
+void _Shift(const XTensor * a, XTensor * b, T num);
 /* shift tensor entires (on site)
-b = a + shift */
+b = a + num */
-void _ShiftMe(XTensor * a, int shift);
+template<class T>
-void _ShiftMe(XTensor * a, float shift);
+void _ShiftMe(XTensor * a, T num);
 /* shift tensor entires (on site)
-b = a + shift */
+b = a + num */
-void ShiftMe(XTensor & a, int shift);
+template<class T>
-void ShiftMe(XTensor & a, float shift); 
+void ShiftMe(XTensor & a, T num); 
 /* shift tensor entires
-b = a + shift */
+b = a + num */
-void Shift(const XTensor & a, XTensor & b, int shift);
+template<class T>
-void Shift(const XTensor & a, XTensor & b, float shift);
+void Shift(const XTensor & a, XTensor & b, T num);
 /* shift tensor entires (return an XTensor structure)
-b = a + shift */
+b = a + num */
-XTensor Shift(const XTensor & a, int shift);
+template<class T>
-XTensor Shift(const XTensor & a, float shift);
+XTensor Shift(const XTensor & a, T num);
-/* mod tensor entires
-b = a % mod */
-void _Mod(const XTensor * a, XTensor * b, int base);
-/* mod tensor entires (on site)
-b = a % mod */
-void _ModMe(XTensor * a, int base);
-/* mod tensor entires (on site)
-b = a % mod */
-void ModMe(XTensor & a, int base);
-/* mod tensor entires
-b = a % mod */
-void Mod(const XTensor & a, XTensor & b, int base);
-/* mod tensor entires (return an XTensor structure)
-b = a + shift */
-XTensor Mod(const XTensor & a, int shift);
 } // namespace nts(NiuTrans.Tensor)

--- a/source/tensor/core/math/Power.cpp
+++ b/source/tensor/core/math/Power.cpp
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-#include <math.h>
-#include "../../XTensor.h"
-#include "../../XName.h"
-#include "Power.h"
-#include "Power.cuh"
-namespace nts { // namespace nts(NiuTrans.Tensor)
-/*
-get the power(a, p)
->> a - input tensor
->> b - output tensor
->> p - parameter
-*/
-void _Power(const XTensor * a, XTensor * b, DTYPE p)
-{
-#ifdef USE_CUDA
-    /* run it on GPUs */
-    if (a->devID >= 0) {
-        _CudaPower(a, b, p);
-        return;
-    }
-#endif
-    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
-    DTYPE * aData = (DTYPE*)a->data;
-    DTYPE * bData = (DTYPE*)b->data;
-    if (p == 0) {
-        for (int i = 0; i < a->unitNum; i++)
-            bData[i] = (DTYPE)1.0;
-    }
-    else if (p == (DTYPE)0.5) {
-        for (int i = 0; i < a->unitNum; i++)
-            bData[i] = (DTYPE)sqrt(aData[i]);
-    }
-    else if (p == (DTYPE)2.0) {
-        for (int i = 0; i < a->unitNum; i++)
-            bData[i] = aData[i] * aData[i];
-    }
-    else {
-        for (int i = 0; i < a->unitNum; i++) {
-            if (p < 0 && aData[i] == 0)
-                bData[i] = 1e20F;
-            else
-                bData[i] = (DTYPE)pow(aData[i], p);
-        }
-    }
-}
-/*
-get the power(a, p) (do it on site)
-keep the result in the input tensor a and return nothing
->> a - the tensor
->> p - parameter
-*/
-void _PowerMe(XTensor * a, DTYPE p)
-{
-    _Power(a, a, p);
-}
-/*
-get the power(a, p) (return an XTensor structure)
-make a new tensor to keep the result and return it
->> a - input tensor
->> p - parameter
-<< return - the power value of the input tensor
-*/
-XTensor Power(const XTensor & a, DTYPE p)
-{
-    XTensor b(&a);
-    b.SetTMPFlag();
-    /* call _Power function */
-    _Power(&a, &b, p);
-    /* tensor connections */
-    XLink::MakeLink(&a, NULL, &b, MATH_POWER);
-    XLink::AddParamToHead(&b, p);
-    return b;
-}
-/*
-get the power(a, p)
->> a - input tensor
->> b - output tensor
->> p - parameter
->> requireLink - if add operation to network
-*/
-void Power(const XTensor & a, XTensor & b, DTYPE p, bool requireLink)
-{
-    if (!b.isInit || !XTensor::IsSameShaped(&a, &b)) {
-        InitTensor(&b, &a);
-    }
-    /* call _Power function */
-    _Power(&a, &b, p);
-    if (requireLink) {
-        /* tensor connections */
-        XLink::MakeLink(&a, NULL, &b, MATH_POWER);
-        XLink::AddParamToHead(&b, p);
-    }
-}
-} // namespace nts(NiuTrans.Tensor)
--- a/source/tensor/core/math/Power.cu
+++ b/source/tensor/core/math/Power.cu
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-#include "../../XDevice.h"
-#include "../../XTensor.h"
-#include "../movement/CopyValues.cuh"
-#include "Power.h"
-#include "Power.cuh"
-namespace nts { // namespace nts(NiuTrans.Tensor)
-#ifdef USE_CUDA
-/*
-set all entries to its root (CUDA Kernel)
->> a - input data array
->> b - output data array
->> size - size of the data array
-*/
-__global__
-void KernelSqrtV2(DTYPE * a, DTYPE * b, int size)
-{
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-    if (i < size)
-        b[i] = sqrt(a[i]);
-}
-/*
-set all entries to its root (CUDA Kernel)
->> a - input data array
->> b - output data array
->> size - size of the data array
-*/
-__global__
-void KernelSqrtV2(__half * a, __half * b, int size)
-{
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-#if __CUDA_ARCH__ >= 530 || !defined(__CUDA_ARCH__)
-    if (i < size)
-        b[i] = hsqrt(a[i]);
-#else
-    if (i < size)
-        b[i] = __float2half(sqrt(__half2float(a[i])));
-#endif
-}
-/*
-get power(d[i], p)
->> a - input data array
->> b - output data array
->> p - power
->> size - size of the data array
-*/
-__global__
-void KernelPower(DTYPE * a, DTYPE * b, DTYPE p, int size)
-{
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-    if (i < size) {
-        DTYPE v = a[i];
-        if (p < 0 && v == 0)
-            b[i] = 1e20;
-        else
-            b[i] = pow(a[i], p);
-    }
-}
-/*
-get power(d[i], p)
->> a - input data array
->> b - output data array
->> p - power
->> size - size of the data array
-*/
-__global__
-void KernelPower(__half * a, __half * b, __half p, int size)
-{
-#if __CUDA_ARCH__ >= 530 || !defined(__CUDA_ARCH__)
-#else
-    int i = blockDim.x * blockIdx.x + threadIdx.x;
-    if (i < size) {
-        float v = __half2float(a[i]);
-        if (__half2float(p) < 0 && v == 0)
-            b[i] = __float2half(1e20);
-        else
-            b[i] = __float2half(pow(__half2float(a[i]), __half2float(p)));
-    }
-#endif
-}
-/* get the power of the entries */
-void _CudaPower(const XTensor * a, XTensor * b, DTYPE p)
-{
-    CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
-    int gridSize[3];
-    int blockSize[3];
-    GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
-    dim3 blocks(gridSize[0]);
-    dim3 threads(blockSize[0]);
-    int devIDBackup;
-    ProtectCudaDev(a->devID, devIDBackup);
-    if (a->dataType == DEFAULT_DTYPE) {
-        if (p == (DTYPE)0.5) {
-            KernelSqrtV2 << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum);
-        }
-        else if (p == (DTYPE)1.0) {
-            _CudaCopyValues(a, b);
-        }
-        else if (p != (DTYPE)1.0) {
-            KernelPower << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, p, a->unitNum);
-        }
-    }
-    else if (a->dataType == X_FLOAT16) {
-        if (p == (DTYPE)0.5) {
-            KernelSqrtV2 << <blocks, threads >> >((__half*)a->data, (__half*)b->data, a->unitNum);
-        }
-        else if (p != (DTYPE)1.0) {
-            ShowNTErrors("TODO!");
-        }
-    }
-    else {
-        ShowNTErrors("TODO!");
-    }
-    BacktoCudaDev(a->devID, devIDBackup);
-}
-#endif // USE_CUDA
-} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
--- a/source/tensor/core/math/Power.cuh
+++ b/source/tensor/core/math/Power.cuh
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-#ifndef __POWER_CUH__
-#define __POWER_CUH__
-#include "Power.h"
-namespace nts { // namespace nts(NiuTrans.Tensor)
-#ifdef USE_CUDA
-/* set all entries to its root (CUDA Kernel) */
-__global__
-void KernelSqrtV2(DTYPE * a, DTYPE * b, int size);
-/* set all entries to its root (CUDA Kernel) */
-__global__
-void KernelSqrtV2(__half * a, __half * b, int size);
-/* get the power of the entries */
-void _CudaPower(const XTensor * a, XTensor * b, DTYPE p);
-#endif // USE_CUDA
-} // namespace nts(NiuTrans.Tensor)
-#endif // __POWER_CUH__
\ No newline at end of file
--- a/source/tensor/core/math/Power.h
+++ b/source/tensor/core/math/Power.h
-/* NiuTrans.Tensor - an open-source tensor library
-* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
-* All rights reserved.
-*
-* Licensed under the Apache License, Version 2.0 (the "License");
-* you may not use this file except in compliance with the License.
-* You may obtain a copy of the License at
-*
-*   http://www.apache.org/licenses/LICENSE-2.0
-*
-* Unless required by applicable law or agreed to in writing, software
-* distributed under the License is distributed on an "AS IS" BASIS,
-* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-* See the License for the specific language governing permissions and
-* limitations under the License.
-*/
-/*
-* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
-*/
-#ifndef __POWER_H__
-#define __POWER_H__
-#include "../../XTensor.h"
-namespace nts { // namespace nts(NiuTrans.Tensor)
-/* get the power(x, y) */
-void _Power(const XTensor * a, XTensor * b, DTYPE p);
-/* 
-get the power(x, y) (do it on site)
-keep the result in the input tensor a and return nothing
-*/
-void _PowerMe(XTensor * a, DTYPE p);
-/* 
-get the power(x, y) (return an XTensor structure)
-make a new tensor to keep the result and return it
-*/
-XTensor Power(const XTensor & a, DTYPE p);
-/* get the power(x, y) */
-void Power(const XTensor & a, XTensor & b, DTYPE p, bool requireLink = false);
-} // namespace nts(NiuTrans.Tensor)
-#endif // __POWER_H__
--- a/source/tensor/core/math/Unary.cpp
+++ b/source/tensor/core/math/Unary.cpp
@@ -27,72 +27,82 @@
 namespace nts{
-DTYPE negate(DTYPE x) {
+template<class T>
-    return -x;
+T negate(T x) {
+    return (T)-x;
 }
-DTYPE square(DTYPE x)
+template<class T>
+T square(T x)
 {
-    return x * x;
+    return (T)(x * x);
 }
-DTYPE round(DTYPE r)
+template<class T>
+T round(T r)
 {
-	return (r > 0.0) ? (DTYPE)floor(r + 0.5) : (DTYPE)ceil(r - 0.5);
+	return (r > 0.0) ? (T)floor(r + 0.5) : (T)ceil(r - 0.5);
 }
-DTYPE sign(DTYPE r)
+template<class T>
+T sign(T r)
 {
-    if (r > 0)
+    if (r > 0.0)
-       return 1.0F;
+       return (T)1.0;
-    else if (r == 0)
+    else if (r == 0.0)
-       return 0.0F;
+       return (T)0.0;
    else
-       return -1.0F;
+       return (T)-1.0;
 }
-DTYPE isnonzero(DTYPE r)
+template<class T>
+T isnonzero(T r)
 {
-    return (r != 0.0) ? (DTYPE)1.0 : (DTYPE)0.0;
+    return (r != 0.0) ? (T)1.0 : (T)0.0;
 }
-DTYPE iszero(DTYPE r)
+template<class T>
+T iszero(T r)
 {
-    return (r == 0.0) ? (DTYPE)1.0 : (DTYPE)0.0;
+    return (r == 0.0) ? (T)1.0 : (T)0.0;
 }
-#ifdef USE_CUDA
+/* define three marco separately, specify the respective function names */
-/* define three marco separately, specify the respective function names  (GPU mode) */
 #define _SIMPLE_UNARY_FUNCTION(_funcName, _cudaFuncName, origFunc)                   \
 void _funcName(const XTensor * a, XTensor * b)                                       \
 {                                                                                    \
    /* run it on GPUs */                                                             \
    if (a->devID >= 0) {                                                             \
+        if (useCUDA) {                                                               \
            _cudaFuncName(a, b);                                                     \
            return;                                                                  \
        }                                                                            \
+        else                                                                         \
+            ShowNTErrors("No GPU devices support!")                                  \
+    }                                                                                \
    CheckNTErrors((XTensor::IsSameShaped(a, b)),                                     \
                  "Input tensors should have the same type!");                       \
-    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");                 \
+    if (a->dataType == X_INT) {                                                      \
-    DTYPE * d = (DTYPE*)a->data;                                            \
+        int * d = (int*)a->data;                                                     \
-    DTYPE * db = (DTYPE*)b->data;                                           \
+        int * db = (int*)b->data;                                                    \
        for (int i = 0; i < a->unitNum; i++)                                         \
-        db[i] = (DTYPE)origFunc(d[i]);                                      \
+            db[i] = (int)origFunc(d[i]);                                             \
-}
+    }                                                                                \
-#else
+    else if (a->dataType == X_FLOAT) {                                               \
-/* define three marco separately, specify the respective function names (CPU mode) */
+        float * d = (float*)a->data;                                                 \
-#define _SIMPLE_UNARY_FUNCTION(_funcName, origFunc)                         \
+        float * db = (float*)b->data;                                                \
-void _funcName(const XTensor * a, XTensor * b)                              \
-{                                                                           \
-    CheckNTErrors((XTensor::IsSameShaped(a, b)),                            \
-                  "Input tensors should have the same type!");              \
-    CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");                 \
-    DTYPE * d = (DTYPE*)a->data;                                            \
-    DTYPE * db = (DTYPE*)b->data;                                           \
        for (int i = 0; i < a->unitNum; i++)                                         \
-        db[i] = (DTYPE)origFunc(d[i]);                                      \
+            db[i] = (float)origFunc(d[i]);                                           \
+    }                                                                                \
+    else if (a->dataType == X_DOUBLE) {                                              \
+        double * d = (double*)a->data;                                               \
+        double * db = (double*)b->data;                                              \
+        for (int i = 0; i < a->unitNum; i++)                                         \
+            db[i] = (double)origFunc(d[i]);                                          \
+    }                                                                                \
+    else                                                                             \
+        ShowNTErrors("TO DO!");                                                      \
 }                                       
-#endif
 #define _SIMPLE_UNARY_FUNCTION_ME(_funcNameMe, _funcName)                            \
 void _funcNameMe(XTensor * a)                                                        \
@@ -128,7 +138,6 @@ void funcName(const XTensor & a, XTensor & b)                               \
    }                                                                                \
 }
-#ifdef USE_CUDA
 _SIMPLE_UNARY_FUNCTION(_Absolute, _CudaAbsolute, fabs)
 _SIMPLE_UNARY_FUNCTION(_Ceil, _CudaCeil, ceil)
 _SIMPLE_UNARY_FUNCTION(_Exp, _CudaExp, exp)
@@ -144,23 +153,6 @@ _SIMPLE_UNARY_FUNCTION(_Square, _CudaSquare, square)
 _SIMPLE_UNARY_FUNCTION(_Sin, _CudaSin, sin)
 _SIMPLE_UNARY_FUNCTION(_Cos, _CudaCos, cos)
 _SIMPLE_UNARY_FUNCTION(_Tan, _CudaTan, tan)
-#else
-_SIMPLE_UNARY_FUNCTION(_Absolute, fabs)
-_SIMPLE_UNARY_FUNCTION(_Ceil, ceil)
-_SIMPLE_UNARY_FUNCTION(_Exp, exp)
-_SIMPLE_UNARY_FUNCTION(_Floor, floor)
-_SIMPLE_UNARY_FUNCTION(_IsNonZero, isnonzero)
-_SIMPLE_UNARY_FUNCTION(_IsZero, iszero)
-_SIMPLE_UNARY_FUNCTION(_Log, log)
-_SIMPLE_UNARY_FUNCTION(_Negate, negate)
-_SIMPLE_UNARY_FUNCTION(_Round, round)
-_SIMPLE_UNARY_FUNCTION(_Sign, sign)
-_SIMPLE_UNARY_FUNCTION(_Sqrt, sqrt)
-_SIMPLE_UNARY_FUNCTION(_Square, square)
-_SIMPLE_UNARY_FUNCTION(_Sin, sin)
-_SIMPLE_UNARY_FUNCTION(_Cos, cos)
-_SIMPLE_UNARY_FUNCTION(_Tan, tan)
-#endif //  USE_CUDA
 _SIMPLE_UNARY_FUNCTION_ME(_AbsoluteMe, _Absolute)
 SIMPLE_UNARY_FUNCTION_ME(AbsoluteMe, _Absolute)

--- a/source/tensor/core/math/Unary.cu
+++ b/source/tensor/core/math/Unary.cu
@@ -24,66 +24,133 @@
 #include "../../XName.h"
 #include "Unary.h"
 #include "Unary.cuh"
+#include<cuda_runtime.h>
 namespace nts { // namespace nts(NiuTrans.Tensor)
 #ifdef USE_CUDA
+template<class T>
 __device__
-DTYPE cudanegate(DTYPE x)
+T BaseCeil(T x)
+{
+    return (T)ceil((float)x);
+}
+template<class T>
+__device__
+T BaseExp(T x)
+{
+    return (T)exp((float)x);
+}
+template<class T>
+__device__
+T BaseFabs(T x)
+{
+    return (T)fabs((float)x);
+}
+template<class T>
+__device__
+T BaseFloor(T x)
+{
+    return (T)floor((float)x);
+}
+template<class T>
+__device__
+T BaseIsNonZero(T r)
+{
+    return (r != (T)0.0) ? (T)1.0 : (T)0.0;
+}
+template<class T>
+__device__
+T BaseIsZero(T r)
+{
+    return (r == (T)0.0) ? (T)1.0 : (T)0.0;
+}
+template<class T>
+__device__
+T BaseLog(T x)
+{
+    return (T)log((float)x);
+}
+template<class T>
+__device__
+T BaseNegate(T x)
 {
    return -x;
 }
+template<class T>
+__device__
+T BaseSign(T r)
+{
+    if (r > (T)0)
+       return 1.0;
+    else if (r == (T)0)
+       return 0.0;
+    else
+       return -1.0;
+}
+template<class T>
 __device__
-DTYPE cudasquare(DTYPE x)
+T BaseSqrt(T x)
+{
+    return (T)sqrt((float)x);
+}
+template<class T>
+__device__
+T BaseSquare(T x)
 {
    return x * x;
 }
+template<class T>
 __device__
-DTYPE cudaround(DTYPE r)
+T BaseRound(T r)
 {
-	return (r > 0.0) ? (DTYPE)floor(r + 0.5) : (DTYPE)ceil(r - 0.5);
+	return (r > (T)0.0) ? (T)BaseFloor(r + (T)0.5) : (T)BaseCeil(r - (T)0.5);
 }
+template<class T>
 __device__
-DTYPE cudasign(DTYPE r)
+T BaseSin(T x)
 {
-    if (r > 0)
+    return (T)sin((float)x);
-       return 1.0F;
-    else if (r == 0)
-       return 0.0F;
-    else
-       return -1.0F;
 }
+template<class T>
 __device__
-DTYPE cudaisnonzero(DTYPE r)
+T BaseCos(T x)
 {
-    return (r != 0.0) ? (DTYPE)1.0 : (DTYPE)0.0;
+    return (T)cos((float)x);
 }
+template<class T>
 __device__
-DTYPE cudaiszero(DTYPE r)
+T BaseTan(T x)
 {
-    return (r == 0.0) ? (DTYPE)1.0 : (DTYPE)0.0;
+    return (T)tan((float)x);
 }
 #define SIMPLE_UNARY_FUNCTION_GPU(funcName, origFunc)                       \
+template<class T>                                                           \
 __global__                                                                  \
-void Kernel##funcName(DTYPE * a, DTYPE * b, int size)                       \
+void Kernel##funcName(T * a, T * b, int size)                               \
 {                                                                           \
    int i = blockDim.x * blockIdx.x + threadIdx.x;                          \
                                                                            \
    if (i < size)                                                           \
-        b[i] = (DTYPE)origFunc(a[i]);                                       \
+        b[i] = (T)origFunc(a[i]);                                           \
-}                                                                           \
-__global__                                                                  \
-void Kernel##funcName(__half * a, __half * b, int size)                     \
-{                                                                           \
-    return;                                                                 \
 }                                                                           \
 void _Cuda##funcName(const XTensor * a, XTensor * b)                        \
 {                                                                           \
@@ -102,9 +169,17 @@ void _Cuda##funcName(const XTensor * a, XTensor * b)                        \
    int devIDBackup;                                                        \
    ProtectCudaDev(a->devID, devIDBackup);                                  \
                                                                            \
-    if (a->dataType == DEFAULT_DTYPE) {                                     \
+    if (a->dataType == X_FLOAT) {                                           \
+        Kernel##funcName<<<blocks, threads>>>                               \
+                         ((float*)a->data, (float*)b->data, a->unitNum);    \
+    }                                                                       \
+    else if (a->dataType == X_DOUBLE) {                                     \
+        Kernel##funcName<<<blocks, threads>>>                               \
+                         ((double*)a->data, (double*)b->data, a->unitNum);  \
+    }                                                                       \
+    else if (a->dataType == X_INT) {                                        \
        Kernel##funcName<<<blocks, threads>>>                               \
-                         ((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum);    \
+                         ((int*)a->data, (int*)b->data, a->unitNum);        \
    }                                                                       \
    else if (a->dataType == X_FLOAT16) {                                    \
        Kernel##funcName<<<blocks, threads>>>                               \
@@ -115,24 +190,26 @@ void _Cuda##funcName(const XTensor * a, XTensor * b)                        \
    }                                                                       \
                                                                            \
    BacktoCudaDev(a->devID, devIDBackup);                                   \
-}                                                                           \
+}
+SIMPLE_UNARY_FUNCTION_GPU(Absolute, BaseFabs)
+SIMPLE_UNARY_FUNCTION_GPU(Ceil, BaseCeil)
+SIMPLE_UNARY_FUNCTION_GPU(Exp, BaseExp)
+SIMPLE_UNARY_FUNCTION_GPU(Floor, BaseFloor)
+SIMPLE_UNARY_FUNCTION_GPU(IsNonZero, BaseIsNonZero)
+SIMPLE_UNARY_FUNCTION_GPU(IsZero, BaseIsZero)
+SIMPLE_UNARY_FUNCTION_GPU(Log, BaseLog)
+SIMPLE_UNARY_FUNCTION_GPU(Negate, BaseNegate)
+SIMPLE_UNARY_FUNCTION_GPU(Round, BaseRound)
+SIMPLE_UNARY_FUNCTION_GPU(Sign, BaseSign)
+SIMPLE_UNARY_FUNCTION_GPU(Sqrt, BaseSqrt)
+SIMPLE_UNARY_FUNCTION_GPU(Square, BaseSquare)
-SIMPLE_UNARY_FUNCTION_GPU(Absolute, fabs)
+SIMPLE_UNARY_FUNCTION_GPU(Sin, BaseSin)
-SIMPLE_UNARY_FUNCTION_GPU(Ceil, ceil)
+SIMPLE_UNARY_FUNCTION_GPU(Cos, BaseCos)
-SIMPLE_UNARY_FUNCTION_GPU(Exp, exp)
+SIMPLE_UNARY_FUNCTION_GPU(Tan, BaseTan)
-SIMPLE_UNARY_FUNCTION_GPU(Floor, floor)
-SIMPLE_UNARY_FUNCTION_GPU(IsNonZero, cudaisnonzero)
-SIMPLE_UNARY_FUNCTION_GPU(IsZero, cudaiszero)
-SIMPLE_UNARY_FUNCTION_GPU(Log, log)
-SIMPLE_UNARY_FUNCTION_GPU(Negate, cudanegate)
-SIMPLE_UNARY_FUNCTION_GPU(Round, cudaround)
-SIMPLE_UNARY_FUNCTION_GPU(Sign, cudasign)
-SIMPLE_UNARY_FUNCTION_GPU(Sqrt, sqrt)
-SIMPLE_UNARY_FUNCTION_GPU(Square, cudasquare)
-SIMPLE_UNARY_FUNCTION_GPU(Sin, sin)
-SIMPLE_UNARY_FUNCTION_GPU(Cos, cos)
-SIMPLE_UNARY_FUNCTION_GPU(Tan, tan)
 #endif // USE_CUDA

--- a/source/tensor/core/math/Unary.cuh
+++ b/source/tensor/core/math/Unary.cuh
@@ -29,139 +29,49 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
 #ifdef USE_CUDA
-/* set each entry to its absolute value (CUDA Kernel) */
-__global__
-void KernelAbsolute(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its absolute value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelAbsolute(__half * a, __half * b, int size);
 /* set each entry to its absolute value */
 void _CudaAbsolute(const XTensor * a, XTensor * b);
-/* set each entry to its ceil value (CUDA Kernel) */
-__global__
-void KernelCeil(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its ceil value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelCeil(__half * a, __half * b, int size);
 /* set each entry to its ceil value */
 void _CudaCeil(const XTensor * a, XTensor * b);
-/* set each entry to its exponent value (CUDA Kernel) */
-__global__
-void KernelExp(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its exponent value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelExp(__half * a, __half * b, int size);
 /* set each entry to its exponent value */
 void _CudaExp(const XTensor * a, XTensor * b);
-/* set each entry to its floor value (CUDA Kernel) */
-__global__
-void KernelFloor(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its floor value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelFloor(__half * a, __half * b, int size);
 /* set each entry to its floor value */
 void _CudaFloor(const XTensor * a, XTensor * b);
-/* if source entry is non-zero, set target entry to be one, otherwise zero (CUDA Kernel) */
-__global__
-void KernelIsNonZero(DTYPE * a, DTYPE * b, int size);
-/* if source entry is non-zero, set target entry to be one, otherwise zero (CUDA Kernel) with float16 data type*/
-__global__
-void KernelIsNonZero(__half * a, __half * b, int size);
 /* if source entry is non-zero, set target entry to be one, otherwise zero */
 void _CudaIsNonZero(const XTensor * a, XTensor * b);
-/* if source entry is zero, set target entry to be one, otherwise zero (CUDA Kernel) */
-__global__
-void KernelIsZero(DTYPE * a, DTYPE * b, int size);
-/* if source entry is zero, set target entry to be one, otherwise zero (CUDA Kernel) with float16 data type*/
-__global__
-void KernelIsZero(__half * a, __half * b, int size);
 /* if source entry is zero, set target entry to be one, otherwise zero */
 void _CudaIsZero(const XTensor * a, XTensor * b);
-/* set each entry to its logarithm value (CUDA Kernel) */
-__global__
-void KernelLog(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its logarithm value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelLog(__half * a, __half * b, int size);
 /* set each entry to its logarithm value */
 void _CudaLog(const XTensor * a, XTensor * b);
-/* set each entry to its negative value (CUDA Kernel) */
-__global__
-void KernelNegate(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its negative value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelNegate(__half * a, __half * b, int size);
 /* set each entry to its negative value */
 void _CudaNegate(const XTensor * a, XTensor * b);
-/* set each entry to its round value (CUDA Kernel) */
-__global__
-void KernelRound(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its round value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelRound(__half * a, __half * b, int size);
 /* set each entry to its round value */
 void _CudaRound(const XTensor * a, XTensor * b);
-/* set each entry to its sign value (CUDA Kernel) */
-__global__
-void KernelSign(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its sign value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelSign(__half * a, __half * b, int size);
 /* set each entry to its sign value */
 void _CudaSign(const XTensor * a, XTensor * b);
-/* set each entry to its sqrt value (CUDA Kernel) */
-__global__
-void KernelSqrt(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its sqrt value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelSqrt(__half * a, __half * b, int size);
 /* set each entry to its sqrt value */
 void _CudaSqrt(const XTensor * a, XTensor * b);
-/* set each entry to its square value (CUDA Kernel) */
-__global__
-void KernelSquare(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its square value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelSquare(__half * a, __half * b, int size);
 /* set each entry to its square value */
 void _CudaSquare(const XTensor * a, XTensor * b);
-/* set each entry to its sine value (CUDA Kernel) */
-__global__
-void KernelSin(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its sine value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelSin(__half * a, __half * b, int size);
 /* set each entry to its sine value */
 void _CudaSin(const XTensor * a, XTensor * b);
-/* set each entry to its cosine value (CUDA Kernel) */
-__global__
-void KernelCos(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its cosine value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelCos(__half * a, __half * b, int size);
 /* set each entry to its cosine value */
 void _CudaCos(const XTensor * a, XTensor * b);
-/* set each entry to its tangent value (CUDA Kernel) */
-__global__
-void KernelTan(DTYPE * a, DTYPE * b, int size);
-/* set each entry to its tangent value (CUDA Kernel) with float16 data type*/
-__global__
-void KernelTan(__half * a, __half * b, int size);
 /* set each entry to its tangent value */
 void _CudaTan(const XTensor * a, XTensor * b);

--- a/source/tensor/function/Loss.cu
+++ b/source/tensor/function/Loss.cu
@@ -22,9 +22,9 @@
 #include "Loss.h"
 #include "Loss.cuh"
 #include "../XDevice.h"
-#include "../core/math/Power.h"
 #include "../core/math/ScaleAndShift.h"
 #include "../core/math/Unary.h"
+#include "../core/math/Binary.h"
 #include "../core/arithmetic/Sum.h"
 #include "../core/arithmetic/Multiply.h"
 #include "../core/reduce/ReduceSum.h"

--- a/source/tensor/test/TPower.cpp
+++ b/source/tensor/test/TPower.cpp
@@ -19,6 +19,7 @@
 * $Created by: Lin Ye (email: linye2015@outlook.com) 2018-06-15
 */
+#include "../core/math/Binary.h"
 #include "../XUtility.h"
 #include "TPower.h"

--- a/source/tensor/test/TPower.h
+++ b/source/tensor/test/TPower.h
@@ -22,8 +22,6 @@
 #ifndef __TEST_POWER_H__
 #define __TEST_POWER_H__
-#include "../core/math/Power.h"
 namespace nts { // namespace nts(NiuTrans.Tensor)
 /* test for Power Function */