Commit 4388c982 by xuchen

Merge branch 'xiaotong-working'

parents dd6646ed 9d33e210
NiuTrans.Tensor.vcxproj NiuTrans.Tensor.vcxproj
NiuTrans.Tensor.vcxproj.filters NiuTrans.Tensor.vcxproj.filters
x64/ x64/
vc140.pdb
NiuTrans.Tensor.vcxproj.user
NiuTrans.Tensor.aps
NiuTrans.Tensor张量计算库 # NiuTrans.Tensor张量计算库
\ No newline at end of file
## NiuTrans.Tensor
NiuTrans.Tensor是小牛开源项目所开发的一个工具包,提供了完整的张量定义及计算功能,可以被用于深度学习相关研究及工业系统的开发。NiuTrans.Tensor具有以下特点:
* 简单小巧,易于修改
* c语言编写,代码高度优化
* 同时支持CPU和GPU设备
* 丰富的张量计算接口
* 支持C/C++、Python等调用方式
## 安装方法
在开始创建您的项目并使用NiuTrans.Tensor工具包时,需要注意的是:
* 所创建项目如在CPU上运行,我们的系统支持高性能的数学运算库,推荐安装[MKL](https://software.intel.com/en-us/mkl)[OpenBLAS](http://www.openblas.net/)
* 所创建项目如需在GPU上运行,需安装 [CUDA](https://developer.nvidia.com/cuda-downloads),CUDA版本需求为9.0及以上,CUDA工具为创建高性能GPU加速应用程序提供了开发环境。
小牛开源项目所开发的NiuTrans.Tensor工具包采用源程序编译方法,在Windows和Linux环境下的安装方法如下所示。
### Windows
若在Windows上使用NiuTrans.Tensor工具包:
* 首先需要将NiuTrans.Tensor代码包含在所创建的项目中
* 在所创建项目中需要引用XTensor.h、core里的CHeader.h和function里的FHeader.h这三个头文件:
* 通过XTensor.h可以获取我们需要操作的XTensor类
* 通过core里的CHeader.h可以对Tensor进行一些张量运算
* 通过function里的FHeader.h可以调用一些激活函数
* 在所创建项目中使用命名空间nts
此外,一些必须的环境配置方法请参考 [NiuTrans.Tensor环境配置](http://47.105.50.196/NiuTrans/NiuTrans.Tensor/blob/linye/doc/Configuration.md)
### Linux
若在Linux上使用NiuTrans.Tensor工具包,直接执行make.sh即可在同级目录下生成tensorCPU和tensorGPU,分别对应于NiuTrans.Tensor的CPU以及GPU的可执行文件。以前馈神经网络语言模型为例,输入以下命令即可在GPU上执行提供的测试用例:
>./tensorGPU -test
更多详细使用方法请见[NiuTrans.Tensor开发文档](http://47.104.97.237/niutrans/site/niutensor/index.html)
## 开发团队
NiuTrans.Tensor张量计算库由东北大学自然语言处理实验室、小牛翻译、小牛雅智合作开发,致力于为深度学习相关研究及工业系统的开发提供完整的张量定义及计算功能。
## 更新版本
NiuTrans.Tensor version 0.1.0 - 2018年8月3日
\ No newline at end of file
# NiuTrans.Tensor环境配置
## 注意事项
CUDA最新版本9.2尚且不支持VS2017最新版本,因此建议使用CUDA版本为9.0或9.1,建议使用VS版本为VS2015,或使用VS2017时安装v140工具集。
## CUDA配置
在已安装好VS、CUDA并配置好环境变量后,一些关键的CUDA配置选项如下所示,以下配置选项在 **项目 -> 属性** 中可以找到。
>$(CUDA_PATH)\include
加入到 **VC++目录 -> 包含** 中。
>$(CUDA_PATH)\lib\Win32
加入到 **VC++目录 -> 库** 中。
>cuda.lib;cudadevrt.lib;cudart.lib;cudart_static.lib;nvcuvid.lib;OpenCL.lib;cublas.lib;curand.lib;
加入到 **链接器->输入->附加依赖项** 中。
配置完成后,右键 **工程->项目依赖性** ,选择CUDA9。
在.cu文件上右键属性,在项类型中选择"CUDA C/C++"(最好搜索.cu文件,然后全选设置)。
## 其他配置
**C/C++->常规->SDL检查**,设为否。
**C/C++->预处理器->预处理器定义** 中,添加
>USE_CUDA;USE_BLAS;WIN32;MKL;DEBUG;CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_WARNINGS_
CONSOLE;
**链接器->系统->子系统**,设置为控制台。
**常规->字符集**,使用Unicode字符集。
**调试->命令参数**中设置可执行文件所需要的参数。
...@@ -16,15 +16,26 @@ NiuTrans.Tensor撠皞★撘銝芸極 ...@@ -16,15 +16,26 @@ NiuTrans.Tensor撠皞★撘銝芸極
* 所创建项目如在CPU上运行,我们的系统支持高性能的数学运算库,推荐安装[Intel® MKL](https://software.intel.com/en-us/mkl)[OpenBLAS](http://www.openblas.net/) * 所创建项目如在CPU上运行,我们的系统支持高性能的数学运算库,推荐安装[Intel® MKL](https://software.intel.com/en-us/mkl)[OpenBLAS](http://www.openblas.net/)
* 所创建项目如需在GPU上运行,需安装 [NVIDIA®CUDA®Toolkit](https://developer.nvidia.com/cuda-downloads),CUDA版本需求为9.0及以上,CUDA工具为创建高性能GPU加速应用程序提供了开发环境。 * 所创建项目如需在GPU上运行,需安装 [NVIDIA®CUDA®Toolkit](https://developer.nvidia.com/cuda-downloads),CUDA版本需求为9.0及以上,CUDA工具为创建高性能GPU加速应用程序提供了开发环境。
在使用小牛开源项目所开发的NiuTrans.Tensor工具包时: 小牛开源项目所开发的NiuTrans.Tensor工具包采用源程序编译方法,在Windows和Linux环境下的安装方法如下所示。
### Windows
若在Windows上使用NiuTrans.Tensor工具包:
* 首先需要将NiuTrans.Tensor代码包含在所创建的项目中 * 首先需要将NiuTrans.Tensor代码包含在所创建的项目中
* 需要引用XTensor.h、core里的CHeader.h和function里的FHeader.h这三个头文件: * 在所创建项目中需要引用XTensor.h、core里的CHeader.h和function里的FHeader.h这三个头文件:
* 通过XTensor.h可以获取我们需要操作的XTensor类 * 通过XTensor.h可以获取我们需要操作的XTensor类
* 通过core里的CHeader.h可以对Tensor进行一些张量运算 * 通过core里的CHeader.h可以对Tensor进行一些张量运算
* 通过function里的FHeader.h可以调用一些激活函数 * 通过function里的FHeader.h可以调用一些激活函数
* 在所创建项目中使用命名空间nts * 在所创建项目中使用命名空间nts
此外,一些必须的环境配置方法请参考 [NiuTrans.Tensor环境配置](http://47.105.50.196/NiuTrans/NiuTrans.Tensor/blob/linye/doc/Configuration.md)
### Linux
若在Linux上使用NiuTrans.Tensor工具包,直接执行make.sh即可在同级目录下生成tensorCPU和tensorGPU,分别对应于NiuTrans.Tensor的CPU以及GPU的可执行文件。以前馈神经网络语言模型为例,输入以下命令即可在GPU上执行提供的测试用例:
>./tensorGPU -test
## 什么是张量 ## 什么是张量
在计算机科学中,张量(Tensor)通常被定义为\\(n\\)维空间中的一种量,它具有\\(n\\)个分量,这种张量本质上是一个多维数组( multidimensional array)。张量的阶或秩是这个多维数组的维度,或者简单理解为索引张量里的每个元素所需要的索引个数。通常来说,0阶张量被定义为标量(Scalar),1阶张量被定义为向量(vector),而2阶张量被定义为矩阵(matrix)。比如,在一个三维空间中,1阶张量就是空间中点所表示的向量\\((x,y,z)\\),其中\\(x\\)、\\(y\\)、\\(z\\)分别表示这个点在三个轴上的坐标。 在计算机科学中,张量(Tensor)通常被定义为\\(n\\)维空间中的一种量,它具有\\(n\\)个分量,这种张量本质上是一个多维数组( multidimensional array)。张量的阶或秩是这个多维数组的维度,或者简单理解为索引张量里的每个元素所需要的索引个数。通常来说,0阶张量被定义为标量(Scalar),1阶张量被定义为向量(vector),而2阶张量被定义为矩阵(matrix)。比如,在一个三维空间中,1阶张量就是空间中点所表示的向量\\((x,y,z)\\),其中\\(x\\)、\\(y\\)、\\(z\\)分别表示这个点在三个轴上的坐标。
...@@ -56,13 +67,20 @@ $$ ...@@ -56,13 +67,20 @@ $$
## 如何定义张量 ## 如何定义张量
如果你是一名C/C++或者Python的使用者,那么在程序中使用NiuTrans.Tensor定义张量将非常简单。首先,下载NiuTrans.Tensor的工具包(source???),并解压到任意目录,比如~/NTS目录。我们会在NTS这个目录中找到source子目录,它是存放源代码的目录。对于source子目录的结构,信息如下: 如果你是一名C/C++或者Python的使用者,那么在程序中使用NiuTrans.Tensor定义张量将非常简单。首先,下载NiuTrans.Tensor的工具包,并解压到任意目录,比如~/NiuTrans.Tensor目录。我们会在NiuTrans.Tensor这个目录中找到source子目录,它是存放源代码的目录。对于source子目录的结构,信息如下:
* ~/NTS/source/XTensor.h - 定义了张量结构XTensor,以及构建和销毁XTensor的接口 * ~/NiuTrans.Tensor/source/tensor/XTensor.h - 定义了张量结构XTensor,以及构建和销毁XTensor的接口
* ~/NTS/source/core - 存放张量计算的函数声明及函数体实现的源文件 * ~/NiuTrans.Tensor/source/tensor/core - 存放张量计算的函数声明及函数体实现的源文件
* ~/NTS/source/function - 存放各种激活函数的源文件 * arithmetic - 存放有关算术运算的源文件
* ~/NTS/source/test - 存放单元测试的源文件 * getandset - 存放有关算术存取的源文件
* ~/NTS/source/*.h(cpp) - 与张量定义不相关,后文介绍 :) * math - 存放有关数学运算的源文件
* movement - 存放有关数据移动的源文件
* reduce - 存放有关规约操作的源文件
* shape - 存放有关形状转换的源文件
* sort - 存放有关排序操作的源文件
* ~/NiuTrans.Tensor/source/tensor/function - 存放各种激活函数的源文件
* ~/NiuTrans.Tensor/source/tensor/test - 存放单元测试的源文件
* ~/NiuTrans.Tensor/source/tensor/*.h(cpp) - 与张量定义不相关,后文介绍 :)
以C/C++为例,仅需要在源程序中引用XTensor.h头文件就可以完成张量的定义。下面是一个简单的示例程序sample.cpp 以C/C++为例,仅需要在源程序中引用XTensor.h头文件就可以完成张量的定义。下面是一个简单的示例程序sample.cpp
``` ```
...@@ -84,10 +102,10 @@ int main(int argc, const char ** argv) ...@@ -84,10 +102,10 @@ int main(int argc, const char ** argv)
} }
``` ```
下一步,编译以上源程序,这个过程需要指定XTensor.h头文件所在目录。比如,使用g++编译sample.cpp(如果你使用的是visual studio,请看这里???) 下一步,编译以上源程序,这个过程需要指定XTensor.h头文件所在目录。比如,使用g++编译sample.cpp
``` ```
g++ sample.cpp -I~/NTS/source -o sample g++ sample.cpp -I~/NiuTrans.Tensor/source/tensor -o sample
``` ```
在sample.cpp中使用了XTensor,它是NiuTrans.Tensor里的一个类,这个类定义了张量所需的数据结构。我们可以使用这个类完成对张量的计算、拷贝等各种操作。XTensor类型的变量被声明后,这个变量需要被初始化,或者说被真正指定为一个张量,比如,指定张量各个维度的大小、张量中每个单元的数据类型、给张量分配内存空间等。InitTensor2D()就是一个张量初始化函数,它把张量初始化为一个矩阵,有四个参数:指向被初始化的张量的指针,矩阵的列数,矩阵的行数,数据单元的类型。这里X_FLOAT,是NiuTrans.Tensor自定义的枚举类型,它表示单精度浮点数。我们也可以使用X_INT或者X_DOUBLE,将数据类型指定为32bit整数或者双精度浮点数。 在sample.cpp中使用了XTensor,它是NiuTrans.Tensor里的一个类,这个类定义了张量所需的数据结构。我们可以使用这个类完成对张量的计算、拷贝等各种操作。XTensor类型的变量被声明后,这个变量需要被初始化,或者说被真正指定为一个张量,比如,指定张量各个维度的大小、张量中每个单元的数据类型、给张量分配内存空间等。InitTensor2D()就是一个张量初始化函数,它把张量初始化为一个矩阵,有四个参数:指向被初始化的张量的指针,矩阵的列数,矩阵的行数,数据单元的类型。这里X_FLOAT,是NiuTrans.Tensor自定义的枚举类型,它表示单精度浮点数。我们也可以使用X_INT或者X_DOUBLE,将数据类型指定为32bit整数或者双精度浮点数。
...@@ -190,8 +208,6 @@ int main(int argc, const char ** argv) ...@@ -190,8 +208,6 @@ int main(int argc, const char ** argv)
| 创建4维稠密张量 | XTensor * NewTensor4D(<br>const int d0, const int d1, const int d2, const int d3, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | d0 - 张量第一维大小 <br> d1 - 张量第二维大小 <br> d2 - 张量第三维大小 <br> d3 - 张量第四维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 | | 创建4维稠密张量 | XTensor * NewTensor4D(<br>const int d0, const int d1, const int d2, const int d3, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | d0 - 张量第一维大小 <br> d1 - 张量第二维大小 <br> d2 - 张量第三维大小 <br> d3 - 张量第四维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
| 创建5维稠密张量 | XTensor * NewTensor5D(<br>const int d0, const int d1, const int d2, <br> const int d3, const int d4, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | d0 - 张量第一维大小 <br> d1 - 张量第二维大小 <br> d2 - 张量第三维大小 <br> d3 - 张量第四维大小 <br> d4 - 张量第五维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 | | 创建5维稠密张量 | XTensor * NewTensor5D(<br>const int d0, const int d1, const int d2, <br> const int d3, const int d4, <br> const TENSOR_DATA_TYPE myDataType = X_FLOAT, <br>const int myDevID = -1, XMem * myMem = NULL) <br> <br> <br> <br> | d0 - 张量第一维大小 <br> d1 - 张量第二维大小 <br> d2 - 张量第三维大小 <br> d3 - 张量第四维大小 <br> d4 - 张量第五维大小 <br> myDataType - 张量的数据类型 <br> myDevID - 张量所在的设备ID <br> myMem - 张量所使用的内存池 |
## 设备
## 访问张量中的内容 ## 访问张量中的内容
在C/C++中,我们通过XTensor.h访问张量中的内容,并且仅需要在源程序中引用XTensor.h头文件就可以完成张量的定义。 在C/C++中,我们通过XTensor.h访问张量中的内容,并且仅需要在源程序中引用XTensor.h头文件就可以完成张量的定义。
...@@ -204,44 +220,79 @@ int main(int argc, const char ** argv) ...@@ -204,44 +220,79 @@ int main(int argc, const char ** argv)
| void * data | 保存元素的数据数组 | | void * data | 保存元素的数据数组 |
| int devID | 设备ID,指张量所申请的空间所在CPU或者GPU设备的编号,-1表示CPU | | int devID | 设备ID,指张量所申请的空间所在CPU或者GPU设备的编号,-1表示CPU |
| int order | 张量的维度,例如:一个矩阵(维度为2)是一个二维张量 | | int order | 张量的维度,例如:一个矩阵(维度为2)是一个二维张量 |
| int dimSize<br> [MAX_TENSOR_DIM_NUM] | 张量中每一维度的大小,索引0表示第1维 | | int dimSize[ ] | 张量中每一维度的大小,索引0表示第1维 |
| TENSOR_DATA_TYPE dataType | 每个数据单元的数据类型 | | TENSOR_DATA_TYPE dataType | 每个数据单元的数据类型 |
| int unitSize | 数据单元的大小,类似于sizeof() | | int unitSize | 数据单元的大小,类似于sizeof() |
| int unitNum | 数据单元的数量 | | int unitNum | 数据单元的数量 |
| bool isSparse | 是否稠密,一个n * m稠密矩阵的数据量大小为n * m,而稀疏(非稠密)矩阵的数据量大小则取决于矩阵中非零元素个数。| | bool isSparse | 是否稠密,一个n * m稠密矩阵的数据量大小为n * m,而稀疏(非稠密)矩阵的数据量大小则取决于矩阵中非零元素个数。|
| float denseRatio | 稠密度,指非零单元的比例,是介于0和1之间的一个实数,0表示所有单元全为零,1表示全为非零单元。| | float denseRatio | 稠密度,指非零单元的比例,是介于0和1之间的一个实数,0表示所有单元全为零,1表示全为非零单元。|
在XTensor.h头文件中定义的方法说明: 在XTensor.h头文件中定义部分方法说明,详情参见附录:
| 功能 | 函数 | 参数 | | 功能 | 函数 | 参数 |
| - | - | - | | - | - | - |
| 判断两个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 |
| 判断三个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b, XTensor * c) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 <br> c - 进行比较的第三个张量 |
| 设置张量每一维度的大小 | void SetDim(int * myDimSize) |myDimSize - 张量每一维度的大小 | | 设置张量每一维度的大小 | void SetDim(int * myDimSize) |myDimSize - 张量每一维度的大小 |
| 得到张量中给定的维度大小 | int GetDim(const int dim) | dim - 张量的维度 | | 得到张量中给定的维度大小 | int GetDim(const int dim) | dim - 张量的维度 |
| 重新调整矩阵维度 | void Reshape(<br> const int order, const int * myDimSize) | order - 张量的维度 <br> myDimSize - 张量每一维的大小 | | 重新调整矩阵维度 | void Reshape(<br> const int order, const int * myDimSize) | order - 张量的维度 <br> myDimSize - 张量每一维的大小 |
| 得到张量中元素数量 | int GetSize() | N/A | | 得到张量中元素数量 | int GetSize() | N/A |
| 得到所给数据类型的数据<br> 单元大小 | int GetUnitSize(<br> TENSOR_DATA_TYPE myDataType) | myDataType - 所给数据类型 |
| 张量中所有元素设置为0 | void SetZeroAll(XStream * stream = NULL) | stream - 多线程流|
| 用数组赋值张量 | void SetData(<br> const void * d, int num, int beg = 0) | d - 赋值数组 <br> num - 数组大小 <br> beg - 赋值时从张量的第几位开始 | | 用数组赋值张量 | void SetData(<br> const void * d, int num, int beg = 0) | d - 赋值数组 <br> num - 数组大小 <br> beg - 赋值时从张量的第几位开始 |
| 设置张量服从均匀分布 | void SetDataRand(<br> DTYPE lower, DTYPE upper) | lower - 最小值 <br> upper - 最大值 | | 张量中所有元素设置为0 | void SetZeroAll(XStream * stream = NULL) | stream - 多线程流|
| 设置张量服从正态分布 | void SetDataRandn(<br> DTYPE mean, DTYPE standardDeviation) | mean - 均值 <br> standardDeviation - 标准差 |
| 将给定维度中元素<br> 设置为升序 | void SetAscendingOrder(int dim) | dim - 给定维度 |
| 获取二维张量的值 | DTYPE Get2D(int ni, int mi = 0) | ni - 行值 <br> mi - 列值 | | 获取二维张量的值 | DTYPE Get2D(int ni, int mi = 0) | ni - 行值 <br> mi - 列值 |
| 设置二维张量中<br> 的单元值 | bool Set2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 | | 设置二维张量中<br> 的单元值 | bool Set2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
| 增加二维张量中<br> 的单元值 | bool Add2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 | | 增加二维张量中<br> 的单元值 | bool Add2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
| 将矩阵重置为特定大小 | bool Resize(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度,1表示稠密张量 | | 将矩阵重置为特定大小 | bool Resize(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度,1表示稠密张量 |
| 将矩阵重置为<br> 另一矩阵大小 | bool Resize(<br> const XTensor * myTensor) | myTensor - 重置矩阵大小的参考矩阵 | | 将矩阵重置为<br> 另一矩阵大小 | bool Resize(<br> const XTensor * myTensor) | myTensor - 重置矩阵大小的参考矩阵 |
| 依据给定张量<br>复制一个新的张量 | XTensor * NewTensor(<br>XTensor * a, bool isFilledData = true) | a - 给定张量 <br> isFilledData - 是否申请张量中的数据空间 |
| 依据给定张量<br>释放数据空间 | void DelTensor(<br>const XTensor * tensor) | tensor - 给定张量 | | 依据给定张量<br>释放数据空间 | void DelTensor(<br>const XTensor * tensor) | tensor - 给定张量 |
## 张量计算 ## 张量计算
NiuTrans.Tensor提供关于张量计算的函数功能,主要包括一些基本的张量运算以及激活函数,在本节中,主要对这些函数及其用法用例进行介绍。 NiuTrans.Tensor提供关于张量计算的函数功能,主要包括一些基本的张量运算以及激活函数,在本节中,主要对这些函数及其用法用例进行介绍。我们以点乘(Multiply)操作为例介绍NiuTrans.Tensor的几种函数定义形式:
* _Multiply: 需指定输出张量,只支持前向操作
* Multiply: 输出张量与输入张量相同,只支持前向操作
* MultiplyMe: 输出张量需返回给上层,同时支持前向和反向操作
### 代数计算(arithmetic)
此部分主要包括各种数学运算,加、减、乘、除、取负、取绝对值等。
#### 取绝对值(Absolute)
##### 什么是张量的取绝对值运算?
利用张量的取绝对值运算可以将张量中每一元素取绝对值并得到一个新的张量,一个维度分别为\\(2 \times 3\\)的矩阵取绝对值过程如下所示:
$$
\left(\begin{matrix}-1.0 & 2.0 & 3.0\\\\-4.0 & 5.0 & 6.0\end{matrix}\right) \rightarrow
\left(\begin{matrix}1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0\end{matrix}\right)
$$
##### 张量取绝对值的调用
NiuTrans.Tensor提供了张量的取绝对值操作,在NiuTrans.Tensor/Tensor/core/arithmetic中定义
张量取绝对值的调用方式以及参数说明如下所示:
```
void _Absolute(const XTensor * a, XTensor * b)
void _AbsoluteMe(XTensor * a)
XTensor Absolute(const XTensor & a)
```
Parameters:
* a - 输入张量
* b - 输出张量
##### 张量取绝对值片段示例
### arithmetic 用Absolute进行张量取绝对值操作的示例代码为:
```
/* call Absolute function */
b = Absolute(*a);
```
有关张量取绝对值的详细代码示例:
此部分主要包括各种数学运算,加、减、乘、除、取负等。 NiuTrans.Tensor/Tensor/test/TAbsolute.cpp
#### 矩阵乘法(MatrixMul) #### 矩阵乘法(MatrixMul)
...@@ -256,26 +307,31 @@ $$ ...@@ -256,26 +307,31 @@ $$
##### 矩阵乘法的调用 ##### 矩阵乘法的调用
NiuTrans.Tensor提供了矩阵乘法的计算操作,在NiuTrans.Tensor/Tensor/core中定义,矩阵乘法的调用方式以及参数说明如下所示: NiuTrans.Tensor提供了矩阵乘法的计算操作,在NiuTrans.Tensor/Tensor/core/arithmetic中定义,函数定义为:
>c_{i,j} = trans(ai) * trans(bj) * alpha + c_{i,j} * beta
矩阵乘法的调用方式以及参数说明如下所示:
``` ```
void _MatrixMul(XTensor * a, MATRIX_TRANS_TYPE transposedA, XTensor * b, MATRIX_TRANS_TYPE transposedB, XTensor * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0) void _MatrixMul(XTensor * a, MATRIX_TRANS_TYPE transposedA, XTensor * b, MATRIX_TRANS_TYPE transposedB, XTensor * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0)
XTensor MatrixMul(const XTensor &a, MATRIX_TRANS_TYPE transposedA, const XTensor &b, MATRIX_TRANS_TYPE transposedB, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0)
``` ```
Parameters: Parameters:
* a - 操作张量1 * a - 输入张量1
* transposedA - 作张量1是否进行转置 * transposedA - a是否进行转置
* b - 操作张量2 * b - 输入张量2
* transposedB - 作张量2是否进行转置 * transposedB - b是否进行转置
* c - 作张量3 * c - 输出张量
* alpha - 系数 * alpha - 系数α
* beta - 系数 * beta - 系数β
##### 矩阵乘法片段示例 ##### 矩阵乘法片段示例
我们以最基本的二维矩阵乘法为例,用MatrixMul进行矩阵乘法操作的示例代码为: 我们以最基本的二维矩阵乘法为例,用MatrixMul进行矩阵乘法操作的示例代码为:
``` ```
/* call MatrixMul function */ /* call MatrixMul function */
_MatrixMul(s1, X_NOTRANS, s2, X_NOTRANS, t); t = MatrixMul(*s1, X_NOTRANS, *s2, X_NOTRANS);
``` ```
有关矩阵乘法的详细代码示例: 有关矩阵乘法的详细代码示例:
...@@ -295,16 +351,20 @@ $$ ...@@ -295,16 +351,20 @@ $$
##### 张量点乘的调用 ##### 张量点乘的调用
NiuTrans.Tensor提供了张量点乘的计算操作,用来计算张量中元素点乘结果,该函数在NiuTrans.Tensor/Tensor/core中定义,张量点乘的调用方式以及参数说明如下所示: NiuTrans.Tensor提供了张量点乘的计算操作,用来计算张量中元素点乘结果,该函数在NiuTrans.Tensor/Tensor/core/arithmetic中定义,张量点乘的调用方式以及参数说明如下所示:
``` ```
_Multiply(XTensor * a, XTensor * b, XTensor * c, int leadingDim, DTYPE alpha = 0) _Multiply(XTensor * a, XTensor * b, XTensor * c, int leadingDim, DTYPE alpha = 0)
void _MultiplyMe(XTensor * a, const XTensor * b, DTYPE alpha = 0, int leadingDim = 0)
XTensor Multiply(const XTensor &a, const XTensor &b, DTYPE alpha = 0, int leadingDim = 0)
``` ```
Parameters: Parameters:
* a - 操作张量1 * a - 输入张量1
* b - 操作张量2 * b - 输入张量2
* c - 果张量 * c - 输出张量
* leadingDim - ??? * leadingDim - 沿着指定维度进行点乘操作
* alpha - 系数 * alpha - 系数
##### 张量点乘片段示例 ##### 张量点乘片段示例
...@@ -312,7 +372,7 @@ Parameters: ...@@ -312,7 +372,7 @@ Parameters:
用Multiply进行s1和s2张量间的点乘操作的调用示例如下所示,计算结果存入t中: 用Multiply进行s1和s2张量间的点乘操作的调用示例如下所示,计算结果存入t中:
``` ```
/* call multiply function */ /* call multiply function */
_Multiply(s1, s2, t, 0); t = Multiply(*s1, *s2, 0);
``` ```
有关矩阵乘法的详细代码示例见: 有关矩阵乘法的详细代码示例见:
...@@ -331,25 +391,67 @@ $$ ...@@ -331,25 +391,67 @@ $$
##### 张量取负的调用 ##### 张量取负的调用
NiuTrans.Tensor提供了张量取负的计算操作,进行张量的按元素位置进行取负操作,该函数在NiuTrans.Tensor/Tensor/core中定义,张量取负的调用方式以及参数说明如下所示: NiuTrans.Tensor提供了张量取负的计算操作,进行张量的按元素位置进行取负操作,该函数在NiuTrans.Tensor/Tensor/core/arithmetic中定义,张量取负的调用方式以及参数说明如下所示:
``` ```
_Negate(XTensor * a) void _Negate(const XTensor * a, XTensor * b)
void _NegateMe(XTensor * a)
XTensor Negate(const XTensor & a)
``` ```
Parameters: Parameters:
* a - 操作张量 * a - 输入张量
* b - 输出张量
##### 张量取负片段示例 ##### 张量取负片段示例
用Negate进行张量取负操作的调用示例如下所示,其中a为我们要进行处理的张量: 用Negate进行张量取负操作的调用示例如下所示,其中a为我们要进行处理的张量,b为得到的结果张量:
``` ```
/* call negate function */ /* call negate function */
_Negate(a); b = Negate(*aGPU);
``` ```
有关张量取负的详细代码示例见: 有关张量取负的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TNegate.cpp NiuTrans.Tensor/Tensor/test/TNegate.cpp
#### 符号函数(Sign)
##### 什么是张量的符号函数?
张量的符号函数用来取得张量中每一元素的符号,一个维度为\\(3 \times 2\\)的张量符号函数操作过程如下所示:
$$
\left(\begin{matrix}1.0 & -2.0\\\\0.0 & 4.0\\\\5.0 & -6.0\end{matrix}\right) \rightarrow
\left(\begin{matrix}1.0 & -1.0\\\\0.0 & 1.0\\\\1.0 & -1.0\end{matrix}\right)
$$
##### 张量符号函数的调用
NiuTrans.Tensor提供了张量的符号函数,该函数在NiuTrans.Tensor/Tensor/core/arithmetic中定义,张量符号函数的调用方式以及参数说明如下所示:
```
void _Sign(const XTensor * a, XTensor * b)
void _SignMe(XTensor * a)
XTensor Sign(const XTensor & a)
```
Parameters:
* a - 输入张量
* b - 输出张量
##### 张量符号函数片段示例
用Sign进行张量符号函数的调用示例如下所示,其中a为我们要进行处理的张量,b为得到的结果张量:
```
/* call Sign function */
b = Sign(*a);
```
有关张量符号函数的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TSign.cpp
#### 加法(Sum) #### 加法(Sum)
##### 什么是张量加法? ##### 什么是张量加法?
...@@ -363,25 +465,29 @@ $$ ...@@ -363,25 +465,29 @@ $$
##### 张量加法的调用 ##### 张量加法的调用
NiuTrans.Tensor提供了张量加法的计算操作,在NiuTrans.Tensor/Tensor/core中定义,该操作用来进行张量之间的按元素位置相加,并得到相加的结果张量,张量加法的调用方法为: NiuTrans.Tensor提供了张量加法的计算操作,在NiuTrans.Tensor/Tensor/core/arithmetic中定义,该操作用来进行张量之间的按元素位置相加,并得到相加的结果张量,张量加法的调用方法为:
``` ```
_Sum(XTensor * a, XTensor * b, XTensor * c, DTYPE beta) void _Sum(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta = (DTYPE)1.0)
void _SumMe(XTensor * a, const XTensor * b, DTYPE beta = (DTYPE)1.0)
XTensor Sum(const XTensor &a, const XTensor &b, DTYPE beta = (DTYPE)1.0)
``` ```
其中a和b为输入张量,c为结果张量,若c为NULL则将相加结果存入a中,beta为一个缩放参数,缩放公式为:c = a + b * beta,beta默认为1.0,NiuTrans.Tensor中张量加法的调用方式以及参数说明如下所示: 其中a和b为输入张量,c为结果张量,若c为NULL则将相加结果存入a中,beta为一个缩放参数,缩放公式为:c = a + b * beta,beta默认为1.0,NiuTrans.Tensor中张量加法的调用方式以及参数说明如下所示:
Parameters: Parameters:
* a - 操作张量1 * a - 输入张量1
* b - 操作张量2 * b - 输入张量2
* c - 结果张量,如果c为空则将结果存入a * c - 输出张量
* beta - 缩放参数 * beta - 缩放参数
##### 张量加法片段示例 ##### 张量加法片段示例
调用Sum进行张量间的和操作如下所示,在此例中直接将张量相加结果存入a中: 调用Sum进行张量间的和操作如下所示,在此例中将张量相加结果存入c中:
``` ```
/* call sum function */ /* call sum function */
_Sum(a, b); c = Sum(*a, *b);
``` ```
详细代码示例见: 详细代码示例见:
...@@ -402,13 +508,13 @@ $$ ...@@ -402,13 +508,13 @@ $$
NiuTrans.Tensor提供了张量的SumByColumnTV操作,调用方法及参数说明如下所示: NiuTrans.Tensor提供了张量的SumByColumnTV操作,调用方法及参数说明如下所示:
``` ```
_SumByColumnTV(XTensor * a, XTensor * b, XTensor * c, DTYPE beta) void _SumByColumnTV(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
``` ```
Parameters: Parameters:
* a - 操作张量 * a - 输入张量
* b - 操作向量 * b - 入向量
* c - 果张量 * c - 输出张量
* beta - 缩放参数 * beta - 缩放参数
调用SumByColumnTV进行的运算为c_col = a_col + b * \beta 调用SumByColumnTV进行的运算为c_col = a_col + b * \beta
...@@ -418,7 +524,7 @@ Parameters: ...@@ -418,7 +524,7 @@ Parameters:
SumByColumnTV示例代码如下,其中a为输入的张量,b为输入的向量,c为a和b按列相加所得结果: SumByColumnTV示例代码如下,其中a为输入的张量,b为输入的向量,c为a和b按列相加所得结果:
``` ```
/* call SumByColumnTV function */ /* call SumByColumnTV function */
_SumByColumnTV(a, b, c); void _SumByColumnTV(a, b, c);
``` ```
有关张量SumByColumnTV的详细代码示例见: 有关张量SumByColumnTV的详细代码示例见:
...@@ -443,9 +549,9 @@ _SumByColumnVT(XTensor * a, XTensor * b, XTensor * c, DTYPE beta) ...@@ -443,9 +549,9 @@ _SumByColumnVT(XTensor * a, XTensor * b, XTensor * c, DTYPE beta)
``` ```
Parameters: Parameters:
* a - 操作向量 * a - 入向量
* b - 操作张量 * b - 输入张量
* c - 果向量 * c - 出向量
* beta - 缩放参数 * beta - 缩放参数
调用SumByColumnVT进行的运算为c = a + \sum{col} b_col * \beta 调用SumByColumnVT进行的运算为c = a + \sum{col} b_col * \beta
...@@ -461,10 +567,47 @@ _SumByColumnVT(a, b, c); ...@@ -461,10 +567,47 @@ _SumByColumnVT(a, b, c);
NiuTrans.Tensor/Tensor/test/TSumByColumnVT.cpp NiuTrans.Tensor/Tensor/test/TSumByColumnVT.cpp
### getandset ### 张量存取(getandset)
此部分包括各种数据类型转化,设置数据、取数据等操作。 此部分包括各种数据类型转化,设置数据、取数据等操作。
#### ConvertDataType
##### 什么是ConvertDataType?
ConvertDataType的作用是将张量中每个元素的数据类型转换为另一数据类型。
##### ConvertDataType调用
NiuTrans.Tensor提供了张量的ConvertDataType操作,调用方法及参数说明如下所示:
```
void _ConvertDataType(const XTensor * input, XTensor * output)
```
Parameters:
* input - 输入张量
* output - 输出张量
##### ConvertDataType片段示例
ConvertDataType示例代码如下,本例中将张量中元素数据类型由flaot32转换为int32。
首先,创建张量时a为flaot32类型,b为int32类型:
```
/* create tensors */
XTensor * a = NewTensor(aOrder, aDimSize);
XTensor * b = NewTensor(aOrder, aDimSize, X_INT);
```
调用ConvertDataType函数
```
/* call ConvertDataType function */
_ConvertDataType(a, b);
```
有关张量ConvertDataType的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TConvertDataType.cpp
#### 选择(Select) #### 选择(Select)
##### 什么是张量的选择操作? ##### 什么是张量的选择操作?
...@@ -496,8 +639,25 @@ $$ ...@@ -496,8 +639,25 @@ $$
##### 张量选择的调用 ##### 张量选择的调用
NiuTrans.Tensor提供了张量的选择操作,调用方法及参数说明如下所示: NiuTrans.Tensor提供了张量的选择操作,调用方法及参数说明如下所示:
第一种选择操由一个0,1构成的index矩阵对张量进行选择:
```
void _Select(const XTensor * a, XTensor * c, XTensor * indexCPU)
XTensor Select(const XTensor &a, XTensor &indexCPU)
```
Parameters:
* a - 输入张量
* c - 输出张量
* indexCPU - 张量选择标志
第二种调用方式是按位置范围对张量进行选择:
``` ```
_SelectRange(XTensor * a, int dim, int low, int high, XTensor * c) void _SelectRange(const XTensor * a, XTensor * c, int dim, int low, int high)
XTensor SelectRange(const XTensor &a, int dim, int low, int high)
``` ```
Parameters: Parameters:
...@@ -505,7 +665,7 @@ Parameters: ...@@ -505,7 +665,7 @@ Parameters:
* dim - 在哪一维对张量进行张量选择操作 * dim - 在哪一维对张量进行张量选择操作
* low - 张量选择范围的下限 * low - 张量选择范围的下限
* high - 张量选择范围的上限 * high - 张量选择范围的上限
* c - 果张量 * c - 输出张量
>需要注意的是,当张量选择的取值范围为[1,3]时意味着选择的是索引位置为1和2的值 >需要注意的是,当张量选择的取值范围为[1,3]时意味着选择的是索引位置为1和2的值
...@@ -514,7 +674,7 @@ Parameters: ...@@ -514,7 +674,7 @@ Parameters:
张量选择示例代码如下,其中s为输入的待操作张量,t输出结果张量,在第三维上按范围[1,3]进行张量的选择操作: 张量选择示例代码如下,其中s为输入的待操作张量,t输出结果张量,在第三维上按范围[1,3]进行张量的选择操作:
``` ```
/* call SelectRange function */ /* call SelectRange function */
_SelectRange(s, 2, 1, 3, t); t = SelectRange(*s, 2, 1, 3);
``` ```
有关张量选择的详细代码示例见: 有关张量选择的详细代码示例见:
...@@ -534,13 +694,63 @@ $$ ...@@ -534,13 +694,63 @@ $$
##### SetData调用 ##### SetData调用
NiuTrans.Tensor提供了张量的SetData操作,调用方法及参数说明如下所示: NiuTrans.Tensor提供了张量的SetData操作,调用方法及参数说明如下所示:
设置张量为固定值:
```
void _SetDataFixed(XTensor * tensor, void * valuePointer)
void SetDataFixed(XTensor &tensor, DTYPE p)
```
Parameters:
* tensor - 输入张量
* valuePointer - 指向数据的指针
* p - 设置的值
设置张量为整型值:
```
void _SetDataFixedInt(XTensor * tensor, int p)
```
Parameters:
* tensor - 输入张量
* p - 固定整型值
设置张量为单精度浮点值:
```
void _SetDataFixedFloat(XTensor * tensor, float p)
```
Parameters:
* tensor - 输入张量
* p - 固定单精度浮点值
设置张量为双精度浮点值:
```
void _SetDataFixedDouble(XTensor * tensor, double p)
```
Parameters:
* tensor - 输入张量
* p - 固定双精度浮点值
设置张量为随机分布:
``` ```
SetDataRand(DTYPE lower, DTYPE upper) void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
```
* tensor - 输入张量
* low - 取值下限
* high - 取值上限
设置张量为正态分布:
```
void _SetDataRandN(XTensor * tensor, DTYPE mean, DTYPE standardDeviation)
``` ```
Parameters: Parameters:
* lower - 取值下限 * tensor - 输入张量
* upper - 取值上限 * mean - 均值
* standardDeviation - 标准差
##### SetData片段示例 ##### SetData片段示例
...@@ -553,9 +763,41 @@ s->SetDataRand(0.0, 1.0); ...@@ -553,9 +763,41 @@ s->SetDataRand(0.0, 1.0);
NiuTrans.Tensor/Tensor/test/TSetData.cpp NiuTrans.Tensor/Tensor/test/TSetData.cpp
### math ### 数学运算(math)
此部分包括各种非基本代数操作,包括:log、exp、power等。
#### 对数运算(Log)
##### 什么是张量的对数运算?
张量的对数运算即将张量中每一元素都取对数从而得到一个新的张量。
此部分包括各种非基本代数操作,包括:log、exp、abs等。 ##### Log调用
NiuTrans.Tensor提供了张量的Log操作,调用方法及参数说明如下所示:
```
void _Log(const XTensor * a, XTensor * b)
void _LogMe(XTensor * a)
XTensor Log(const XTensor & a)
```
Parameters:
* a - 输入张量
* b - 输出张量
##### Log片段示例
Log示例代码如下所示:
```
/* call Log function */
b = Log(*a);
```
有关Log的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TLog.cpp
#### 标准化(Normalize) #### 标准化(Normalize)
...@@ -568,7 +810,11 @@ NiuTrans.Tensor/Tensor/test/TSetData.cpp ...@@ -568,7 +810,11 @@ NiuTrans.Tensor/Tensor/test/TSetData.cpp
NiuTrans.Tensor提供了张量的Normalize操作,调用方法及参数说明如下所示: NiuTrans.Tensor提供了张量的Normalize操作,调用方法及参数说明如下所示:
``` ```
_Normalize(XTensor * input, XTensor * output, int dim, XTensor * mean, XTensor * var, XTensor * a, XTensor * b, DTYPE epsilon) void _Normalize(const XTensor * input, XTensor * output, int dim, const XTensor * mean, const XTensor * var, const XTensor * a, const XTensor * b, DTYPE epsilon)
void _NormalizeMe(XTensor * input, int dim, const XTensor * mean, const XTensor * var, const XTensor * a, const XTensor * b, DTYPE epsilon)
XTensor Normalize(const XTensor &input, int dim, const XTensor &mean, const XTensor &var, const XTensor &a, const XTensor &b, DTYPE epsilon)
``` ```
Parameters: Parameters:
...@@ -586,7 +832,7 @@ Parameters: ...@@ -586,7 +832,7 @@ Parameters:
Normalize示例代码如下所示: Normalize示例代码如下所示:
``` ```
/* call normalize function */ /* call normalize function */
_Normalize(s, t, 0, mean, var, a, b, 0.0); t = Normalize(*s, 0, *mean, *var, *a, *b, 0.0F);
``` ```
有关Normalize的详细代码示例见: 有关Normalize的详细代码示例见:
...@@ -608,13 +854,18 @@ $$ ...@@ -608,13 +854,18 @@ $$
NiuTrans.Tensor提供了张量幂运算的操作,用来进行张量的按元素位置进行幂运算的操作,调用方法为: NiuTrans.Tensor提供了张量幂运算的操作,用来进行张量的按元素位置进行幂运算的操作,调用方法为:
``` ```
_Power(XTensor * a, DTYPE p) void _Power(const XTensor * a, XTensor * b, DTYPE p)
void _PowerMe(XTensor * a, DTYPE p)
XTensor Power(const XTensor & a, DTYPE p)
``` ```
其中a为进行操作的张量,p为次方数,张量幂运算的参数说明如下所示: 其中a为进行操作的张量,p为次方数,张量幂运算的参数说明如下所示:
Parameters: Parameters:
* a - 操作张量 * a - 输入张量
* b - 输出张量
* p - 次方数 * p - 次方数
##### 张量幂运算片段示例 ##### 张量幂运算片段示例
...@@ -622,7 +873,7 @@ Parameters: ...@@ -622,7 +873,7 @@ Parameters:
下面是调用Power进行a的幂为2.0的幂运算操作的一段示例代码: 下面是调用Power进行a的幂为2.0的幂运算操作的一段示例代码:
``` ```
/* call power function */ /* call power function */
_Power(a, 2.0); b = Power(*a, 2.0F);
``` ```
有关张量幂运算的详细代码示例见: 有关张量幂运算的详细代码示例见:
...@@ -643,13 +894,18 @@ $$ ...@@ -643,13 +894,18 @@ $$
NiuTrans.Tensor提供了张量的缩放和偏移操作,调用方法为: NiuTrans.Tensor提供了张量的缩放和偏移操作,调用方法为:
``` ```
_ScaleAndShift(XTensor * a, DTYPE scale, DTYPE shift) void _ScaleAndShift(const XTensor * a, XTensor * b, DTYPE scale, DTYPE shift = 0)
void _ScaleAndShiftMe(XTensor * a, DTYPE scale, DTYPE shift = 0)
XTensor ScaleAndShift(const XTensor &a, DTYPE scale, DTYPE shift = 0)
``` ```
张量的缩放和偏移操作结果为:p = p * scale + shift,其中scale和shift分别为张量的缩放和偏移参数,张量缩放和偏移操作的参数说明如下表所示: 张量的缩放和偏移操作结果为:p = p * scale + shift,其中scale和shift分别为张量的缩放和偏移参数,张量缩放和偏移操作的参数说明如下表所示:
Parameters: Parameters:
* a - 输入张量 * a - 输入张量
* b - 输出张量
* scale - 缩放参数 * scale - 缩放参数
* shift - 偏移参数 * shift - 偏移参数
...@@ -658,51 +914,16 @@ Parameters: ...@@ -658,51 +914,16 @@ Parameters:
张量缩放和偏移示例代码如下,input为输入的待操作张量,scaleFactor为缩放参数,shiftFactor为偏移参数: 张量缩放和偏移示例代码如下,input为输入的待操作张量,scaleFactor为缩放参数,shiftFactor为偏移参数:
``` ```
/* call ScaleAndShift function */ /* call ScaleAndShift function */
_ScaleAndShift(input, scaleFactor, shiftFactor); t = ScaleAndShift(*s, scaleFactor, shiftFactor);
``` ```
有关张量缩放和偏移的详细代码示例见: 有关张量缩放和偏移的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TScaleAndShift.cpp NiuTrans.Tensor/Tensor/test/TScaleAndShift.cpp
### movement ### 数据移动(movement)
此部分主要是介绍有关数据拷贝函数。 此部分主要是介绍有关数据拷贝函数。
#### 拷贝(CopyValues)
##### 什么是张量的拷贝操作?
拷贝,即将一个张量的值赋给另一个张量,也就是对张量进行拷贝操作,一个\\(2 \times 4\\)的张量拷贝过程如下所示:
$$
\left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow
\left(
\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right)
$$
##### 张量拷贝操作的调用
NiuTrans.Tensor提供了张量的拷贝操作,调用方法及参数说明如下所示:
```
_CopyValues(XTensor * s, XTensor * t, XStream * stream)
```
Parameters:
* s - 输入张量
* t - 输出结果张量
* stream - 多线程流
##### 张量拷贝片段示例
张量拷贝示例代码如下,其中input为输入的待操作张量,output输出结果张量:
```
/* call CopyValues function */
_CopyValues(input, output);
```
有关张量拷贝的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TCopyValues.cpp
#### CopyIndexed #### CopyIndexed
##### 什么是张量的CopyIndexed操作? ##### 什么是张量的CopyIndexed操作?
...@@ -740,12 +961,14 @@ $$ ...@@ -740,12 +961,14 @@ $$
NiuTrans.Tensor提供了张量的CopyIndexed操作,调用方法及参数说明如下所示: NiuTrans.Tensor提供了张量的CopyIndexed操作,调用方法及参数说明如下所示:
``` ```
_CopyIndexed(XTensor * s, XTensor * t, int dim, int * srcIndex, int indexSize, int * tgtIndex, int copyNum) void _CopyIndexed(const XTensor * s, XTensor * t, int dim, int * srcIndex, int indexSize, int * tgtIndex, int copyNum)
XTensor CopyIndexed(const XTensor &s, int dim, int * srcIndex,int indexSize, int * tgtIndex, int copyNum)
``` ```
Parameters: Parameters:
* s - 输入张量 * s - 输入张量
* t - 输结果张量 * t - 输张量
* dim - 在哪一维对张量进行CopyIndexed操作 * dim - 在哪一维对张量进行CopyIndexed操作
* srcIndex - 源索引,即在指定dim上进行赋值的值的索引 * srcIndex - 源索引,即在指定dim上进行赋值的值的索引
* indexSize - 源索引的个数 * indexSize - 源索引的个数
...@@ -754,101 +977,95 @@ Parameters: ...@@ -754,101 +977,95 @@ Parameters:
##### 张量CopyIndexed片段示例 ##### 张量CopyIndexed片段示例
CopyIndexed示例代码如下,其中s为输入的待操作张量,t输出结果张量,在第三维上按起始位置索引拷贝一个元素到目标张量: CopyIndexed示例代码如下,其中s为输入的待操作张量,t输出结果张量,在指定维度上按起始位置索引拷贝一个元素到目标张量:
``` ```
/* call CopyIndexed function */ /* call CopyIndexed function */
_CopyIndexed(s, t, 2, srcIndex, indexSize, tgtIndex, 1); t = CopyIndexed(*s, dim, srcIndex, indexSize, tgtIndex, copyNum);
``` ```
有关CopyIndexed的详细代码示例见: 有关CopyIndexed的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TCopyIndexed.cpp NiuTrans.Tensor/Tensor/test/TCopyIndexed.cpp
### reduce #### 拷贝(CopyValues)
#### 归约取最大值(ReduceMax)
##### 什么是张量的归约取最大值?
张量的归约取最大值操作是沿着张量的某一维度,取得该向量在该维度中的最大值,一个\\(2 \times 4\\)的张量在维度0和维度1进行取最大值操作的过程分别如下所示: ##### 什么是张量的拷贝操作?
$$ 拷贝,即将一个张量的值赋给另一个张量,也就是对张量进行拷贝操作,一个\\(2 \times 4\\)的张量拷贝过程如下所示:
\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow
\left(\begin{matrix}4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right)
$$
$$ $$
\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow \left(\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right) \rightarrow
\left(\begin{matrix}3.0\\\\7.0\end{matrix}\right) \left(
\begin{matrix}5.0 & 1.0 & 2.0 & 8.0\\\\4.0 & 3.0 & 7.0 & 6.0\end{matrix}\right)
$$ $$
##### 张量归约取最大值操作的调用 ##### 张拷贝操作的调用
NiuTrans.Tensor提供了张量的ReduceMax操作,用来获得张量中沿指定维度取得的最大值,张量归约取最大值操作的调用方式及参数说明如下所示: NiuTrans.Tensor提供了张量的拷贝操作,调用方法及参数说明如下所示:
``` ```
_ReduceMax(XTensor * input, XTensor * output, int dim) void _CopyValues(const XTensor * s, XTensor * t, XStream * stream = NULL)
XTensor CopyValues(const XTensor &s, XStream * stream = NULL)
``` ```
Parameters: Parameters:
* input - 输入张量 * s - 输入张量
* output - 输出张量 * t - 输出张量
* dim - 沿着指定维度进行取最大值操作 * stream - 多线程
##### 张量归约取最大值片段示例 ##### 张量拷贝片段示例
调用ReduceMax进行张量归约取最大值操作的示例代码如下所示,代码中两行分别表示沿着维度0和维度1进行取值: 张量拷贝示例代码如下,其中s为输入的待操作张量,t输出结果张量:
``` ```
/* call reduce max function */ /* call CopyValues function */
_ReduceMax(a, reduce_a, 0); t = CopyValues(*s);
_ReduceMax(b, reduce_b, 1);
``` ```
关张量归约取最大值的详细代码示例见: 张量拷贝的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TReduceMax.cpp NiuTrans.Tensor/Tensor/test/TCopyValues.cpp
#### 归约求和(ReduceSum) ### 规约操作(reduce)
##### 什么是张量的归约求和操作? #### 归约取最大值(ReduceMax)
张量的归约求和操作是沿着张量的某一维度,计算该张量在该维度的和,一个\\(2 \times 4\\)的张量在维度0和维度1进行求和操作的过程分别如下所示: ##### 什么是张量的归约取最大值?
张量的归约取最大值操作是沿着张量的某一维度,取得该向量在该维度中的最大值,一个\\(2 \times 4\\)的张量在维度0和维度1进行取最大值操作的过程分别如下所示:
$$ $$
\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow \left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow
\left(\begin{matrix}4.0 & 6.0 & 8.0 & 10.0\end{matrix}\right) \left(\begin{matrix}4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right)
$$ $$
$$ $$
\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow \left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow
\left(\begin{matrix}6.0\\\\22.0\end{matrix}\right) \left(\begin{matrix}3.0\\\\7.0\end{matrix}\right)
$$ $$
##### 张量约求和操作的调用 ##### 张量归约取最大值操作的调用
NiuTrans.Tensor提供了张量的ReduceSum操作,调用方法为: NiuTrans.Tensor提供了张量的ReduceMax操作,用来获得张量中沿指定维度取得的最大值,张量归约取最大值操作的调用方式及参数说明如下所示:
```
_ReduceSum(XTensor * input, XTensor * output, int dim, XTensor * shift, DTYPE power, bool isExp)
``` ```
其中shift默认为NULL,power默认为1.0F,isExp默认为false,张量归约求和操作的参数说明如下所示: void _ReduceMax(const XTensor * input, XTensor * output, int dim)
XTensor ReduceMax(const XTensor &input, int dim)
```
Parameters: Parameters:
* input - 输入张量 * input - 输入张量
* output - 输出张量 * output - 输出张量
* dim - 沿着指定维度进行取最大值操作 * dim - 沿着指定维度进行取最大值操作
* shift - 输入的偏移,默认为NULL
* power - 元素的幂,默认为1.0F
* isExp - 是否取指,默认为false
##### 张量约求和片段示例 ##### 张量归约取最大值片段示例
调用ReduceSum进行张量归约求和操作的示例代码如下所示,代码中两行分别表示沿着维度0和维度1进行取值: 调用ReduceMax进行张量归约取最大值操作的示例代码如下所示,代码中两行分别表示沿着维度0和维度1进行取值:
``` ```
/* call reduce sum function */ /* call reduce max function */
_ReduceSum(a, reduce_a, 0); t = ReduceMax(*s, 0);
_ReduceSum(b, reduce_b, 1); t = ReduceMax(*s, 1);
``` ```
有关量归约求和的详细代码示例见: 有关张量归约取最大值的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TReduceSum.cpp NiuTrans.Tensor/Tensor/test/TReduceMax.cpp
#### 归约取均值(ReduceMean) #### 归约取均值(ReduceMean)
...@@ -870,7 +1087,9 @@ $$ ...@@ -870,7 +1087,9 @@ $$
NiuTrans.Tensor提供了张量的ReduceMean操作,调用方法为: NiuTrans.Tensor提供了张量的ReduceMean操作,调用方法为:
``` ```
_ReduceMean(XTensor * input, XTensor * output, int dim) void _ReduceMean(const XTensor * input, XTensor * output, int dim)
XTensor ReduceMean(const XTensor &input, int dim)
``` ```
ReduceMean用来获得张量中沿指定维度取得的数值均值,张量归约取均值的参数说明如下所示: ReduceMean用来获得张量中沿指定维度取得的数值均值,张量归约取均值的参数说明如下所示:
...@@ -885,29 +1104,78 @@ Parameters: ...@@ -885,29 +1104,78 @@ Parameters:
调用ReduceMean进行张量归约取均值操作的示例代码如下所示,代码中两行分别表示沿着维度0和维度1进行取值: 调用ReduceMean进行张量归约取均值操作的示例代码如下所示,代码中两行分别表示沿着维度0和维度1进行取值:
``` ```
/* call reduce mean function */ /* call reduce mean function */
_ReduceMean(a, reduce_a, 0); t = ReduceMean(*s, 0);
_ReduceMean(b, reduce_b, 1); t = ReduceMean(*s, 1);
``` ```
有关张量归约取均值的详细代码示例见: 有关张量归约取均值的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TReduceMean.cpp NiuTrans.Tensor/Tensor/test/TReduceMean.cpp
#### 归约取方差(ReduceSumSquared) #### 归约求和(ReduceSum)
##### 什么是张量的归约取方差操作? ##### 什么是张的归约求和操作?
张量的归约取方差操作是沿着张量的某一维度,计算该张量在该维度的方差,一个\\(2 \times 4\\)的张量在维度0进行取方差操作的过程如下所示: 张量的归约求和操作是沿着张量的某一维度,计算该张量在该维度的和,一个\\(2 \times 4\\)的张量在维度0和维度1进行求和操作的过程分别如下所示:
$$ $$
\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow \left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow
\left(\begin{matrix}8.0 & 8.0 & 8.0 & 8.0\end{matrix}\right) \left(\begin{matrix}4.0 & 6.0 & 8.0 & 10.0\end{matrix}\right)
$$ $$
##### 张量归约取方差操作的调用 $$
\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow
\left(\begin{matrix}6.0\\\\22.0\end{matrix}\right)
$$
##### 张量归约求和操作的调用
NiuTrans.Tensor提供了张量的ReduceSum操作,调用方法为:
```
void _ReduceSum(const XTensor * input, XTensor * output, int dim, const XTensor * shift = NULL, DTYPE power = (DTYPE)1.0F, bool isExp = false)
XTensor ReduceSum(const XTensor &input, int dim, const XTensor &shift = NULLTensor, DTYPE power = (DTYPE)1.0F, bool isExp = false)
```
其中shift默认为NULL,power默认为1.0F,isExp默认为false,张量归约求和操作的参数说明如下所示:
Parameters:
* input - 输入张量
* output - 输出张量
* dim - 沿着指定维度进行取最大值操作
* shift - 输入的偏移,默认为NULL
* power - 元素的幂,默认为1.0F
* isExp - 是否取指,默认为false
##### 张量归约求和片段示例
调用ReduceSum进行张量归约求和操作的示例代码如下所示,代码中两行分别表示沿着维度0和维度1进行取值:
```
/* call reduce sum function */
t1 = ReduceSum(*s, 0, *shift1);
t2 = ReduceSum(*s, 1, *shift2);
```
有关张量归约求和的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TReduceSum.cpp
#### 归约取方差(ReduceSumSquared)
##### 什么是张量的归约取方差操作?
张量的归约取方差操作是沿着张量的某一维度,计算该张量在该维度的方差,一个\\(2 \times 4\\)的张量在维度0进行取方差操作的过程如下所示:
$$
\left(\begin{matrix}0.0 & 1.0 & 2.0 & 3.0\\\\4.0 & 5.0 & 6.0 & 7.0\end{matrix}\right) \rightarrow
\left(\begin{matrix}8.0 & 8.0 & 8.0 & 8.0\end{matrix}\right)
$$
##### 张量归约取方差操作的调用
NiuTrans.Tensor提供了张量的ReduceSumSquared操作,调用方法为: NiuTrans.Tensor提供了张量的ReduceSumSquared操作,调用方法为:
``` ```
_ReduceSumSquared(XTensor * input, XTensor * output, int dim, XTensor * shift) void _ReduceSumSquared(const XTensor * input, XTensor * output, int dim, const XTensor * shift)
XTensor ReduceSumSquared(const XTensor &input, int dim, const XTensor &shift)
``` ```
ReduceSumSquared用来计算张量的沿着某一维度元素的方差,张量归约取方差操作的参数说明如下所示: ReduceSumSquared用来计算张量的沿着某一维度元素的方差,张量归约取方差操作的参数说明如下所示:
...@@ -923,7 +1191,7 @@ Parameters: ...@@ -923,7 +1191,7 @@ Parameters:
调用ReduceSumSquared进行张量归约取方差操作的示例代码如下所示: 调用ReduceSumSquared进行张量归约取方差操作的示例代码如下所示:
``` ```
/* call reduce sum squared function */ /* call reduce sum squared function */
_ReduceSumSquared(input, output, 0, shift); t = ReduceSumSquared(*s, 0, *shift);
``` ```
有关张量归约取方差的详细代码示例见: 有关张量归约取方差的详细代码示例见:
...@@ -944,7 +1212,9 @@ $$ ...@@ -944,7 +1212,9 @@ $$
NiuTrans.Tensor提供了张量的ReduceVariance操作,调用方法为: NiuTrans.Tensor提供了张量的ReduceVariance操作,调用方法为:
``` ```
_ReduceVariance(XTensor * input, XTensor * output, int dim, XTensor * mean) void _ReduceVariance(const XTensor * input, XTensor * output, int dim, const XTensor * mean)
XTensor ReduceVariance(const XTensor &input, int dim, const XTensor &mean)
``` ```
ReduceVariance用来计算张量的沿着某一维度元素的标准差,张量归约取标准差操作的参数说明如下所示: ReduceVariance用来计算张量的沿着某一维度元素的标准差,张量归约取标准差操作的参数说明如下所示:
...@@ -960,13 +1230,13 @@ Parameters: ...@@ -960,13 +1230,13 @@ Parameters:
调用ReduceVariance进行张量归约取标准差操作的示例代码如下所示: 调用ReduceVariance进行张量归约取标准差操作的示例代码如下所示:
``` ```
/* call reduce variance function */ /* call reduce variance function */
_ReduceVariance(input, output, 0, mean); t = ReduceVariance(*s, 0, *mean);
``` ```
有关张量归约取标准差的详细代码示例见: 有关张量归约取标准差的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TReduceVariance.cpp NiuTrans.Tensor/Tensor/test/TReduceVariance.cpp
### shape ### 形状转换(shape)
此部分主要包括关于形状改变的函数,比如:split、merge、reshape等。 此部分主要包括关于形状改变的函数,比如:split、merge、reshape等。
...@@ -985,25 +1255,30 @@ $$ ...@@ -985,25 +1255,30 @@ $$
##### 张量级联的调用 ##### 张量级联的调用
NiuTrans.Tensor提供了张量间的级联操作,调用方法为: NiuTrans.Tensor提供了张量间的级联操作,调用方法为:
```
_Concatenate(XList * smalls, XTensor * big, int dim)
_Concatenate(XTensor * smallA, XTensor * smallB, XTensor * big, int dim)
```
第一种调用方法中的操作对象是列表,将进行级联操作的张量存入列表smalls中,级联结果存入张量big中: 第一种调用方法中的操作对象是列表,将进行级联操作的张量存入列表smalls中,级联结果存入张量big中:
```
void _Concatenate(const XList * smalls, XTensor * big, int dim)
XTensor Concatenate(const XList &smalls, int dim)
```
Parameters: Parameters:
* smalls - 进行级联张量的列表 * smalls - 进行级联张量的列表
* big - 果张量 * big - 输出张量
* dim - 在指定维度进行级联 * dim - 在指定维度进行级联
第二种方法操作对象不再是列表中的张量而是直接对一系列张量进行级联操作: 第二种方法操作对象不再是列表中的张量而是直接对一系列张量进行级联操作:
```
void _Concatenate(const XTensor * smallA, const XTensor * smallB, XTensor * big, int dim)
XTensor Concatenate(const XTensor &smallA, const XTensor &smallB, int dim)
```
Parameters: Parameters:
* smallA - 操作张量1 * smallA - 输入张量1
* smallB - 操作张量2 * smallB - 输入张量2
* big - 果张量 * big - 输出张量
* dim - 进行级联的维度 * dim - 进行级联的维度
##### 张量级联片段示例 ##### 张量级联片段示例
...@@ -1011,151 +1286,164 @@ Parameters: ...@@ -1011,151 +1286,164 @@ Parameters:
通过操作张量列表进行张量的级联操作片段示例如下所示,sList为存放进行级联张量的列表,t为结果张量: 通过操作张量列表进行张量的级联操作片段示例如下所示,sList为存放进行级联张量的列表,t为结果张量:
``` ```
/* call concatenate function */ /* call concatenate function */
_Concatenate(&sList, t, 1); t = Concatenate(*sList, 1);
``` ```
直接通过操作一系列张量进行张量的级联操作片段示例如下所示,s1、s2为需要进行级联的张量,t为结果张量: 直接通过操作一系列张量进行张量的级联操作片段示例如下所示,s1、s2为需要进行级联的张量,t为结果张量:
``` ```
/* call concatenate function */ /* call concatenate function */
_Concatenate(s1, s2, t, 1); t = Concatenate(*s1, *s2, 1);
``` ```
有关张量级联的详细代码示例见: 有关张量级联的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TConcatenate.cpp NiuTrans.Tensor/Tensor/test/TConcatenate.cpp
#### 切分(Split) #### 合并(Merge)
##### 什么是张量的切分操作? ##### 什么是量的合并操作?
量间的切分操作是沿着张量的某一维度,可以将一个张量切分成另一张量,也可以将一个大的张量切分成n个小的张量集合的列表。 量间的合并操作与级联有些类似,是沿着张量的某一维度,可以将一个张量合并为另一个维度不同的张量,也可以将一个列表中的所有张量合并在一起组成一个更大的张量。
第一种情况下将维度为\\(4 \times 3\\)张量沿着维度0进行切分,切分份数为2,得到维度为\\(2 \times 2 \times 3\\)的张量的过程如下所示: 在第一种情况下将维度为\\(2 \times 2 \times 3\\)的张量在维度1进行合并,进行合并的维度为0,得到维度为\\(4 \times 3\\)的张量的过程如下所示:
$$ $$
\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow
\begin{aligned} \begin{aligned}
\Biggl( & \left( \Biggl( & \left(
\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right), \begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right),
\\\\ & \left( \\\\ & \left(
\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix} \begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}
\right) \Biggr) \right) \Biggr)
\end{aligned} \end{aligned} \rightarrow
\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
$$ $$
在第二种情况下将维度为\\(4 \times 3\\)张量沿着维度0进行切分,切分份数为2,得到两个维度均为\\(2 \times 3\\)的张量的过程如下所示: 在第二种情况下将两个维度均为\\(2 \times 3\\)的张量沿着维度0合并为维度为\\(4 \times 3\\)的张量的过程如下所示:
$$ $$
\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow \left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow
\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
$$ $$
##### 张量切分的调用 ##### 张合并操作的调用
NiuTrans.Tensor提供了两种张量切分操作,调用方法为: NiuTrans.Tensor提供了张量的合并操作,调用方法为:
```
_Split(XTensor * s, XTensor * t, int whereToSplit, int splitNum) 在第一种调用方法中是将源张量中的某一维度进行Merge操作,Merge结果为张量t,whereToMerge为指定进行Merge操作的维度,leadingDim为指定将哪一维度Merge,例如:(N/2, 2, M) -> (N, M),参数说明如下表所示:
_Split(XTensor * big, XList * smalls, int whereToSplit, int splitNum)
``` ```
在第一种调用方法中是将源张量中的某一维度进行Split操作,Split结果为张量t,whereToSplit为在哪一维度进行split操作,splitNum表示分成多少份,例如:(N, M) -> (N/3, M, 3),参数说明如下所示: void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim = -1)
XTensor Merge(const XTensor &s, int whereToMerge, int leadingDim = -1)
```
Parameters: Parameters:
* s - 操作张量 * s - 输入张量
* t - 果张量 * t - 输出张量
* whereToSplit - 在指定维度进行split操作 * whereToMerge - 沿着指定维度进行Merge操作
* splitNum - 分成多少份 * leadingDim - 把指定维度进行Merge操作
在第二种调用方法中是将所操作张量big按某一维度whereToSplit进行Split操作,操作结果为包含若干更小维度张量的列表smalls,splitNum表示分成多少份,例如:(N, M) -> 2 * (N/2, M),参数说明如下所示: 在第二种调用方法中是将所操作张量存入列表smalls中,操作结果为张量big,whereToMerge为指定进行Merge操作的维度,例如:2 * (N/2, M) -> (N, M),参数说明如下表所示:
```
void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
XTensor Merge(const XList &smalls, int whereToMerge)
```
Parameters: Parameters:
* big - 操作张量 * smalls - 存放进行合并张量的列表
* smalls - 存放切分出张量的列表 * big - 结果张量
* whereToSplit - 在指定维度进行split操作 * whereToMerge - 沿着指定维度进行Merge操作
* splitNum - 分成多少份
##### 张量切分片段示例 ##### 张合并片段示例
上述第一种张量切分片段示例如下所示,s为进行切分的张量,t为结果张量,0表示沿着维度0进行切分操作,2表示切分份数为2: 上述第一种张量合并片段示例如下所示,s为进行合并的张量,t为结果张量,1表示在维度1进行合并操作,0表示将维度0进行合并操作:
``` ```
/* call split function */ /* call merge function */
_Split(s, t, 0, 2); t = Merge(*s, 1, 0);
``` ```
上述第二种张量切分片段示例如下所示,s为进行切分的张量,tList为存放结果张量的列表,1表示沿着维度1进行切分操作,2表示切分份数为2: 上述第二种张量合并片段示例如下所示,sList为要进行合并的张量列表,t为结果张量,0表示沿着维度0进行合并操作:
``` ```
/* call split function */ /* call merge function */
_Split(s, &tList, 1, 2); t = Merge(*sList, 0);
``` ```
有关张量切分的详细代码示例见: 张量合并的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TSplit.cpp NiuTrans.Tensor/Tensor/test/TMerge.cpp
#### 合并(Merge) #### 切分(Split)
##### 什么是量的合并操作? ##### 什么是张量的切分操作?
量间的合并操作与级联有些类似,是沿着张量的某一维度,可以将一个张量合并为另一个维度不同的张量,也可以将一个列表中的所有张量合并在一起组成一个更大的张量。 量间的切分操作是沿着张量的某一维度,可以将一个张量切分成另一张量,也可以将一个大的张量切分成n个小的张量集合的列表。
在第一种情况下将维度为\\(2 \times 2 \times 3\\)的张量在维度1进行合并,进行合并的维度为0,得到维度为\\(4 \times 3\\)的张量的过程如下所示: 第一种情况下将维度为\\(4 \times 3\\)张量沿着维度0进行切分,切分份数为2,得到维度为\\(2 \times 2 \times 3\\)的张量的过程如下所示:
$$ $$
\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow
\begin{aligned} \begin{aligned}
\Biggl( & \left( \Biggl( & \left(
\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right), \begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\end{matrix}\right),
\\\\ & \left( \\\\ & \left(
\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix} \begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}
\right) \Biggr) \right) \Biggr)
\end{aligned} \rightarrow \end{aligned}
\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
$$ $$
在第二种情况下将两个维度均为\\(2 \times 3\\)的张量沿着维度0合并为维度为\\(4 \times 3\\)的张量的过程如下所示: 在第二种情况下将维度为\\(4 \times 3\\)张量沿着维度0进行切分,切分份数为2,得到两个维度均为\\(2 \times 3\\)的张量的过程如下所示:
$$ $$
\left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow \left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \rightarrow
\left(\begin{matrix}0.0 & 1.0 & 2.0\\\\3.0 & 4.0 & 5.0\\\\0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right) \left(\begin{matrix}0.0 & 2.0 & 3.0\\\\1.0 & 4.0 & 5.0\end{matrix}\right) + \left(\begin{matrix}0.1 & 1.1 & 2.1\\\\3.1 & 4.1 & 5.1\end{matrix}\right)
$$ $$
##### 张合并操作的调用 ##### 张量切分的调用
NiuTrans.Tensor提供了张量的合并操作,调用方法为: NiuTrans.Tensor提供了两种张量切分操作,调用方法为:
```
_Merge(XTensor * s, XTensor * t, int whereToMerge, int leadingDim) 在第一种调用方法中是将源张量中的某一维度进行Split操作,Split结果为张量t,whereToSplit为在哪一维度进行split操作,splitNum表示分成多少份,例如:(N, M) -> (N/3, M, 3),参数说明如下所示:
_Merge(XList * smalls, XTensor * big, int whereToMerge)
``` ```
在第一种调用方法中是将源张量中的某一维度进行Merge操作,Merge结果为张量t,whereToMerge为指定进行Merge操作的维度,leadingDim为指定将哪一维度Merge,例如:(N/2, 2, M) -> (N, M),参数说明如下表所示: void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum)
XTensor Split(const XTensor &s, int whereToSplit, int splitNum)
```
Parameters: Parameters:
* s - 操作张量 * s - 输入张量
* t - 果张量 * t - 输出张量
* whereToMerge - 沿着指定维度进行Merge操作 * whereToSplit - 在指定维度进行split操作
* leadingDim - 把指定维度进行Merge操作 * splitNum - 分成多少份
在第二种调用方法中是将所操作张量存入列表smalls中,操作结果为张量big,whereToMerge为指定进行Merge操作的维度,例如:2 * (N/2, M) -> (N, M),参数说明如下表所示: 在第二种调用方法中是将所操作张量big按某一维度whereToSplit进行Split操作,操作结果为包含若干更小维度张量的列表smalls,splitNum表示分成多少份,例如:(N, M) -> 2 * (N/2, M),参数说明如下所示:
```
void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum)
XList SplitList(const XTensor &big, int whereToSplit, int splitNum)
```
Parameters: Parameters:
* smalls - 存放进行合并张量的列表 * big - 输入张量
* big - 结果张量 * smalls - 存放切分出张量的列表
* whereToMerge - 沿着指定维度进行Merge操作 * whereToSplit - 在指定维度进行split操作
* splitNum - 分成多少份
##### 张量合并片段示例 ##### 张量切分片段示例
上述第一种张量切分片段示例如下所示,s为进行切分的张量,t为结果张量,0表示沿着维度0进行切分操作,2表示切分份数为2:
上述第一种张量合并片段示例如下所示,s为进行合并的张量,t为结果张量,1表示在维度1进行合并操作,0表示将维度0进行合并操作:
``` ```
/* call merge function */ /* call split function */
_Merge(s, t, 1, 0); t = Split(*s, 0, 2);
``` ```
上述第二种张量合并片段示例如下所示,sList为要进行合并的张量列表,t为结果张量,0表示沿着维度0进行合并操作:
上述第二种张量切分片段示例如下所示,s为进行切分的张量,tList为存放结果张量的列表,1表示沿着维度1进行切分操作,2表示切分份数为2:
``` ```
/* call merge function */ /* call split function */
_Merge(&sList, t, 0); Split(*s, tList, 1, 2);
``` ```
张量合并的详细代码示例见: 有关张量切分的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TMerge.cpp NiuTrans.Tensor/Tensor/test/TSplit.cpp
#### Unsqueeze #### Unsqueeze
...@@ -1189,28 +1477,32 @@ $$ ...@@ -1189,28 +1477,32 @@ $$
NiuTrans.Tensor提供了张量的Unsqueeze操作,调用方法及参数说明如下所示: NiuTrans.Tensor提供了张量的Unsqueeze操作,调用方法及参数说明如下所示:
``` ```
_Unsqueeze(XTensor * a, XTensor * b, int dim, int dSize) void _Unsqueeze(const XTensor * a, XTensor * b, int dim, int dSize)
XTensor Unsqueeze(const XTensor &a, int dim, int dSize)
``` ```
Parameters: Parameters:
* a - 输入张量 * a - 输入张量
* b - 输结果张量 * b - 输张量
* dim - 在指定维度进行Unsqueeze操作 * dim - 在指定维度进行Unsqueeze操作
* dSize - 插入维度的大小 * dSize - 插入维度的大小
##### Unsqueeze片段示例 ##### Unsqueeze片段示例
Unsqueeze示例代码如下,其中s为输入的待操作张量,t1、t2代表输出结果张量,以下两行分别表示在维度1和维度2上插入的维度大小为2: Unsqueeze示例代码如下,其中s为输入的待操作张量,t1、t2代表输出结果张量,以下两行分别表示在维度1和维度2上插入的维度大小为2:
``` ```
/* call Unsqueeze function */ /* call Unsqueeze function */
_Unsqueeze(s, t1, 1, 2); t1 = Unsqueeze(*s, 1, 2);
_Unsqueeze(s, t2, 2, 2); t2 = Unsqueeze(*s, 2, 2);
``` ```
有关张量Unsqueeze的详细代码示例见: 有关张量Unsqueeze的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TUnsqueeze.cpp NiuTrans.Tensor/Tensor/test/TUnsqueeze.cpp
### sort ### 排序操作(sort)
此部分主要介绍排序相关的函数,如:sort、topk等。 此部分主要介绍排序相关的函数,如:sort、topk等。
...@@ -1228,22 +1520,31 @@ $$ ...@@ -1228,22 +1520,31 @@ $$
##### Sort的调用 ##### Sort的调用
NiuTrans.Tensor提供了张量的Sort操作,调用方法及参数说明如下所示: NiuTrans.Tensor提供了张量的Sort操作,调用方法及参数说明如下所示:
``` ```
_Sort(XTensor * a, XTensor * index, int dim) void _Sort(const XTensor * a, XTensor * b, XTensor * index, int dim)
void _SortMe(XTensor * a, XTensor * index, int dim)
void Sort(XTensor & a, XTensor & b, XTensor & index, int dim)
``` ```
Parameters: Parameters:
* a - 操作张量 * a - 输入张量
* index - 结果张量中元素的索引 * b- 输出张量
* index - 输出张量中元素的索引
* dim - 沿着指定维度进行Sort操作 * dim - 沿着指定维度进行Sort操作
##### Sort片段示例 ##### Sort片段示例
Sort示例代码如下所示,a为进行操作的张量,b为结果张量中元素的索引,本例中沿着维度0进行Sort操作: Sort示例代码如下所示,a为进行操作的张量,index为结果张量中元素的索引,本例中沿着维度0进行Sort操作:
``` ```
/* call Sort function */ /* call Sort function */
_Sort(a, b, 0); Sort(*a, b, *index, 0)
``` ```
有关Sort的详细代码示例见 NiuTrans.Tensor/Tensor/test/TSort.cpp 有关Sort的详细代码示例见 NiuTrans.Tensor/Tensor/test/TSort.cpp
#### TopK #### TopK
...@@ -1265,61 +1566,38 @@ $$ ...@@ -1265,61 +1566,38 @@ $$
##### TopK的调用 ##### TopK的调用
NiuTrans.Tensor提供了张量的TopK操作,调用方法及参数说明如下所示: NiuTrans.Tensor提供了张量的TopK操作,调用方法及参数说明如下所示:
``` ```
_TopK(XTensor * a, XTensor * b, XTensor * index, int dim, int k) void _TopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k)
void TopK(XTensor &a, XTensor &b, XTensor &index, int dim, int k)
``` ```
Parameters: Parameters:
* a - 输入张量 * a - 输入张量
* b - 输结果张量 * b - 输张量
* index - 输出结果索引 * index - 输出结果索引
* dim - 沿着指定维度进行TopK操作 * dim - 沿着指定维度进行TopK操作
* k - TopK中k代表取最大的k个值 * k - TopK中k代表取最大的k个值
##### TopK片段示例 ##### TopK片段示例
TopK示例代码如下,input为输入的待操作张量,output输出结果张量,index为输出结果索引,本例中沿着维度0取Top-2: TopK示例代码如下,s为输入的待操作张量,t输出结果张量,index为输出结果索引,本例中沿着维度dim取Top-k:
``` ```
/* call TopK function */ /* call TopK function */
int dim = 0; int dim = 0;
int k = inputDimSize[dim]; int k = inputDimSize[dim];
_TopK(input, outputA, indexA, dim, k); TopK(s, t, index, dim, k);
``` ```
有关TopK的详细代码示例见 NiuTrans.Tensor/Tensor/test/TTopK.cpp 有关TopK的详细代码示例见 NiuTrans.Tensor/Tensor/test/TTopK.cpp
### function ### 激活函数(function)
此部分主要介绍一些激活函数和损失函数。 此部分主要介绍一些激活函数和损失函数。
#### Rectify
##### 什么是Rectify?
Rectify是一种激活函数,Rectify函数定义为:
>y = max(0, x)
##### Rectify调用
NiuTrans.Tensor提供了张量的Rectify激活函数,调用方法及参数说明如下所示:
```
Rectify(XTensor * x, XTensor * y)
```
Parameters:
* x - 输入张量
* y - 输出张量
##### Rectify片段示例
Rectify示例代码如下,其中x为输入的向量,y为输入的张量:
```
/* call Rectify function */
Rectify(x, y);
```
有关Rectify的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TRectify.cpp
#### HardTanH #### HardTanH
##### 什么是HardTanH? ##### 什么是HardTanH?
...@@ -1332,9 +1610,13 @@ HardTanH銝蝘瘣餃嚗ardTanH摰蛹嚗 ...@@ -1332,9 +1610,13 @@ HardTanH銝蝘瘣餃嚗ardTanH摰蛹嚗
##### HardTanH调用 ##### HardTanH调用
NiuTrans.Tensor提供了张量的HardTanH激活函数,调用方法及参数说明如下所示: NiuTrans.Tensor提供了张量的HardTanH激活函数,调用方法及参数说明如下所示:
``` ```
HardTanH(XTensor * x, XTensor * y) void _HardTanH(const XTensor * x, XTensor * y)
XTensor HardTanH(const XTensor &x)
``` ```
Parameters: Parameters:
* x - 输入张量 * x - 输入张量
...@@ -1343,10 +1625,12 @@ Parameters: ...@@ -1343,10 +1625,12 @@ Parameters:
##### HardTanH片段示例 ##### HardTanH片段示例
HardTanH示例代码如下,其中x为输入的向量,y为输入的张量: HardTanH示例代码如下,其中x为输入的向量,y为输入的张量:
``` ```
/* call hardtanh function */ /* call hardtanh function */
HardTanH(x, y); y = HardTanH(*x);
``` ```
有关HardTanH的详细代码示例见: 有关HardTanH的详细代码示例见:
NiuTrans.Tensor/Tensor/test/THardTanH.cpp NiuTrans.Tensor/Tensor/test/THardTanH.cpp
...@@ -1361,9 +1645,13 @@ Identity銝蝘瘣餃嚗dentity摰蛹嚗 ...@@ -1361,9 +1645,13 @@ Identity銝蝘瘣餃嚗dentity摰蛹嚗
##### Identity调用 ##### Identity调用
NiuTrans.Tensor提供了张量的Identity激活函数,调用方法及参数说明如下所示: NiuTrans.Tensor提供了张量的Identity激活函数,调用方法及参数说明如下所示:
``` ```
Identity(XTensor * x, XTensor * y) void _Identity(const XTensor * x, XTensor * y)
XTensor Identity(const XTensor &x)
``` ```
Parameters: Parameters:
* x - 输入张量 * x - 输入张量
...@@ -1372,10 +1660,12 @@ Parameters: ...@@ -1372,10 +1660,12 @@ Parameters:
##### Identity片段示例 ##### Identity片段示例
Identity示例代码如下,其中x为输入的向量,y为输入的张量: Identity示例代码如下,其中x为输入的向量,y为输入的张量:
``` ```
/* call Identity function */ /* call Identity function */
Identity(x, y); y = Identity(*x);
``` ```
有关Identity的详细代码示例见: 有关Identity的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TIdentity.cpp NiuTrans.Tensor/Tensor/test/TIdentity.cpp
...@@ -1390,9 +1680,13 @@ LogSoftmax銝蝘瘣餃嚗ogSoftmax摰蛹嚗 ...@@ -1390,9 +1680,13 @@ LogSoftmax銝蝘瘣餃嚗ogSoftmax摰蛹嚗
##### LogSoftmax调用 ##### LogSoftmax调用
NiuTrans.Tensor提供了张量的LogSoftmax激活函数,调用方法及参数说明如下所示: NiuTrans.Tensor提供了张量的LogSoftmax激活函数,调用方法及参数说明如下所示:
``` ```
LogSoftmax(XTensor * x, XTensor * y, int leadDim) void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
XTensor LogSoftmax(const XTensor &x, int leadDim)
``` ```
Parameters: Parameters:
* x - 输入张量 * x - 输入张量
...@@ -1402,14 +1696,94 @@ Parameters: ...@@ -1402,14 +1696,94 @@ Parameters:
##### LogSoftmax片段示例 ##### LogSoftmax片段示例
LogSoftmax示例代码如下,其中x为输入的向量,y为输入的张量,本例中沿着维度1进行LogSoftmax操作: LogSoftmax示例代码如下,其中x为输入的向量,y为输入的张量,本例中沿着维度1进行LogSoftmax操作:
``` ```
/* call LogSoftmax function */ /* call LogSoftmax function */
LogSoftmax(x, y, 1); y = LogSoftmax(*x, 1);
``` ```
有关LogSoftmax的详细代码示例见: 有关LogSoftmax的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TLogSoftmax.cpp NiuTrans.Tensor/Tensor/test/TLogSoftmax.cpp
#### Loss
##### 什么是Loss?
Loss Function(损失函数)是用来衡量神经网络模型效果及优化目标的一种损失函数,函数定义为:
>squared error : loss = sum_{i} 0.5*(gold_i - output_i)^2 <br />
cross entropy : loss = sum_{i} (-gold_i * log(output_i)) <br />
one hot error : loss = sum_{i} e_i <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; where e_i = 0.5*(t_i - y_i)^2 &nbsp;&nbsp;if t_i = 1, <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;e_i = 0 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; otherwise
##### Loss调用
NiuTrans.Tensor提供了张量的Loss激活函数,调用方法及参数说明如下所示:
```
DTYPE _LossCompute(XTensor * gold, XTensor * output, LOSS_FUNCTION_NAME LFName, bool isLogOutput, int leadDim, int gBeg, int gLen, int oBeg)
```
Parameters:
* gold - 标准答案
* output - 输出的模型预测结果
* LFName - 损失函数名称
* isLogOutput - 输出是否log
* leadDim - 沿着指定维度进行输出
* gBeg - 沿着指定维度leadDim从指定位置取标准答案
* gLen - 从指定位置gBeg开始标准答案的偏移
* oBeg - 沿着指定维度leadDim从指定位置开始输出模型预测结果
##### Loss片段示例
Loss示例代码如下所示:
```
/* call LossCompute function */
error = _LossCompute(gold, output, SQUAREDERROR, false, 0, 0, dimSize[0], 0);
```
有关Loss的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TLoss.cpp
#### Rectify
##### 什么是Rectify?
Rectify是一种激活函数,Rectify函数定义为:
>y = max(0, x)
##### Rectify调用
NiuTrans.Tensor提供了张量的Rectify激活函数,调用方法及参数说明如下所示:
```
void _Rectify(const XTensor * x, XTensor * y)
XTensor Rectify(const XTensor &x)
```
Parameters:
* x - 输入张量
* y - 输出张量
##### Rectify片段示例
Rectify示例代码如下,其中x为输入的向量,y为输入的张量:
```
/* call Rectify function */
y = Rectify(*x);
```
有关Rectify的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TRectify.cpp
#### Sigmoid #### Sigmoid
##### 什么是Sigmoid? ##### 什么是Sigmoid?
...@@ -1421,7 +1795,9 @@ Sigmoid銝蝘瘣餃嚗igmoid摰蛹嚗 ...@@ -1421,7 +1795,9 @@ Sigmoid銝蝘瘣餃嚗igmoid摰蛹嚗
NiuTrans.Tensor提供了张量的Sigmoid激活函数,调用方法及参数说明如下所示: NiuTrans.Tensor提供了张量的Sigmoid激活函数,调用方法及参数说明如下所示:
``` ```
Sigmoid(XTensor * x, XTensor * y) void _Sigmoid(const XTensor * x, XTensor * y)
XTensor Sigmoid(const XTensor &x)
``` ```
Parameters: Parameters:
...@@ -1433,7 +1809,7 @@ Parameters: ...@@ -1433,7 +1809,7 @@ Parameters:
Sigmoid示例代码如下,其中x为输入的向量,y为输入的张量: Sigmoid示例代码如下,其中x为输入的向量,y为输入的张量:
``` ```
/* call Sigmoid function */ /* call Sigmoid function */
Sigmoid(x, y); y = Sigmoid(*x);
``` ```
有关Sigmoid的详细代码示例见: 有关Sigmoid的详细代码示例见:
...@@ -1450,7 +1826,9 @@ Softmax銝蝘瘣餃嚗oftmax摰蛹嚗 ...@@ -1450,7 +1826,9 @@ Softmax銝蝘瘣餃嚗oftmax摰蛹嚗
NiuTrans.Tensor提供了张量的Softmax激活函数,调用方法及参数说明如下所示: NiuTrans.Tensor提供了张量的Softmax激活函数,调用方法及参数说明如下所示:
``` ```
Softmax(XTensor * x, XTensor * y, int leadDim) void _Softmax(const XTensor * x, XTensor * y, int leadDim)
XTensor Softmax(const XTensor &x, int leadDim)
``` ```
Parameters: Parameters:
...@@ -1463,63 +1841,611 @@ Parameters: ...@@ -1463,63 +1841,611 @@ Parameters:
Softmax示例代码如下,其中x为输入的向量,y为输入的张量,本例中沿着维度1进行Softmax操作: Softmax示例代码如下,其中x为输入的向量,y为输入的张量,本例中沿着维度1进行Softmax操作:
``` ```
/* call Softmax function */ /* call Softmax function */
Softmax(x, y, 1); y = Softmax(*x, 1);
``` ```
有关Softmax的详细代码示例见: 有关Softmax的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TSoftmax.cpp NiuTrans.Tensor/Tensor/test/TSoftmax.cpp
#### Loss ## 高级技
##### 什么是Loss ### 内存池
Loss Function(损失函数)是用来衡量神经网络模型效果及优化目标的一种损失函数,函数定义为: 内存作为计算机软件运行过程中不可或缺的一项重要资源,在软件开发过程中具有十分重要的地位。对于一个软件系统而言,如何更高效地进行内存管理将对系统整体性能,尤其是运行速度方面产生很大程度的影响。虽然目前而言,主流编程语言均会为开发人员提供相应的系统级接口(如C语言中的malloc和free,C++中的new和delete等),但这类接口在设计的时候由于需要考虑各种使用情况,因此并不一定能够最适用于目前的使用需求(如对速度具有较高要求等),因此直接使用系统级的内存管理接口存在以下弊端:
>squared error : loss = sum_{i} 0.5*(gold_i - output_i)^2 <br />
cross entropy : loss = sum_{i} (-gold_i * log(output_i)) <br />
one hot error : loss = sum_{i} e_i <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; where e_i = 0.5*(t_i - y_i)^2 &nbsp;&nbsp;if t_i = 1, <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;e_i = 0 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; otherwise
1. 内存申请、释放时间消耗大:由于操作系统在进行内存管理的时候需要保证内存空间得到有效地使用,因此在执行内存申请或释放操作的时候,系统会对候选内存块进行一定程度的选择和合并,这些操作给相应的操作带来了许多额外的时间开销,导致频繁地对内存进行操作耗时较大。
2. 程序执行效率低:由于所申请内存块的大小不定,当频繁使用系统级接口进行内存管理的时候容易在存储空间中产生大量内存碎片,拖慢系统的执行效率。
3. 易发生内存泄漏:使用系统级接口对内存空间进行申请的时候,一般来说需要程序开发人员显性地对空间进行释放,一旦疏忽将导致内存泄漏情况的发生,因此使用系统级接口进行内存管理需要谨慎对存储空间的使用情况进行分析,使用相关检测工具对内存泄漏情况进行有效地核查。
##### Loss调 此外,当系统中存在对GPU设备上的显存空间进行管理的时候,申请、释放操作所产生的时间代价相对普通内存来说更大。不同于内存空间的申请,在申请或释放显存的时候需要对CPU正在执行的操作进行中断,交由GPU设备进行显存的操作,因此这部分产生的时间消耗远比内存申请来说大得多,最终导致频繁地对显存空间进行操作会更严重地拖慢系统整体的执行效率。
NiuTrans.Tensor提供了张量的Loss激活函数,调用方法及参数说明如下所示: 针对以上问题,本系统支持使用内存池(Memory Pool)来对系统中的存储空间(包括内存和显存)进行管理。内存池的概念主要是在对存储空间进行使用之前,预先从系统中申请一整块的空间,由程序自身(内存池)对这部分的空间进行管理。这样做的好处在于对存储空间的申请、释放等操作不需要对系统的相应接口进行频繁调用,降低了其中中断、搜寻最优块等操作的耗时,同时也不易产生内存碎片。此外,由于内存池的申请是一次性的操作,因此不会在系统全局产生大规模内存|泄漏的情况,对系统的稳定性会有所助益。
具体来说,想要在NiuTrans.Tensor的工具包中使用内存池(XMem)进行操作,只需要三个步骤:内存池的定义,使用以及释放。
* 内存池的定义
最简单的定义一个内存池只需指定一个设备ID即可,下面是一段示例代码。
``` ```
LossCompute(XTensor * gold, XTensor * output, LOSS_FUNCTION_NAME LFName,bool isLogOutput, int leadDim, int gBeg, int gLen, int oBeg) // 定义一个内存池mem,它的类型是XMem
XMem * mem = new XMem(devID);
``` ```
Parameters: 若需要更具体地指定内存池的信息,可以定义内存池的时候通过myMode、myBlockSize、myBlockNum、myBufSize等参数设置内存池的使用模型、内存块大小、内存块数量以及缓存区大小。
* gold - 标准答案 * 内存池的使用
* output - 输出的模型预测结果
* LFName - 损失函数名称
* isLogOutput - 输出是否log
* leadDim - 沿着指定维度进行输出
* gBeg - 沿着指定维度leadDim从指定位置取标准答案
* gLen - 从指定位置gBeg开始标准答案的偏移
* oBeg - 沿着指定维度leadDim从指定位置开始输出模型预测结果
##### Loss片段示例 在定义好内存池之后,我们即可在该空间上进行变量的定义及使用了,这里以张量的定义为例,下面是一段示例代码。
```
// 声明一个变量tensor,它的类型是XTensor
XTensor tensor;
Loss示例代码如下所示: // 在内存池上初始化这个变量为50列*100行的矩阵(2阶张量)
InitTensor2D(&tensor, 50, 100, X_FLOAT, -1, mem);
``` ```
/* call LossCompute function */ 我们可以看到,上述代码相对之前之前未使用内存池时的定义方式而言,仅需在定义的时候指定所使用的内存池即可,无需更复杂的操作。
error = LossCompute(gold, output, SQUAREDERROR, false, 0, 0, dimSize[0], 0);
* 内存池的释放
当希望将完全对内存池进行释放的时候,我们仅需直接对内存池进行删除即可,下面是一段示例代码。
```
// 删除内存池mem
delete mem;
``` ```
有关Loss的详细代码示例见:
NiuTrans.Tensor/Tensor/test/TLoss.cpp ## 实例1:矩阵乘法
## 高级技 这里我们给出一个矩阵乘法的例子,首先定义张量维度的大小,然后初始化两个维度分别为2*3和3*2的矩阵,使用SetData()方法对矩阵进行赋值,最后计算两个矩阵相乘。
### 内存池 关于矩阵乘法的详细代码请见NiuTrans.Tensor/Tensor/sample/mul/。
## 实例1:矩阵乘法 ```
#include "mul.h"
namespace nts
{
void sampleMUL1()
{
DTYPE aData[2][3] = { { 1.0F, 2.0F, 3.0F },
{ -4.0F, 5.0F, 6.0F } };
DTYPE bData[3][2] = { { 0.0F, -1.0F },
{ 1.0F, 2.0F },
{ 2.0F, 1.0F } };
DTYPE answer[2][2] = { { 8.0F, 6.0F },
{ 17.0F, 20.0F } };
/* a source tensor of size (2, 3) */
int aOrder = 2;
int * aDimSize = new int[aOrder];
aDimSize[0] = 2;
aDimSize[1] = 3;
int aUnitNum = 1;
for (int i = 0; i < aOrder; i++)
aUnitNum *= aDimSize[i];
/* a source tensor of size (3, 2) */
int bOrder = 2;
int * bDimSize = new int[bOrder];
bDimSize[0] = 3;
bDimSize[1] = 2;
int bUnitNum = 1;
for (int i = 0; i < bOrder; i++)
bUnitNum *= bDimSize[i];
/* a target tensor of size (2, 2) */
int resultOrder = 2;
int * resultDimSize = new int[resultOrder];
resultDimSize[0] = 2;
resultDimSize[1] = 2;
int resultUnitNum = 1;
for (int i = 0; i < resultOrder; i++)
resultUnitNum *= resultDimSize[i];
/* create tensors */
XTensor * a = NewTensor(aOrder, aDimSize);
XTensor * b = NewTensor(bOrder, bDimSize);
XTensor * result = NewTensor(resultOrder, resultDimSize);
/* initialize variables */
a->SetData(aData, aUnitNum);
b->SetData(bData, bUnitNum);
result->SetZeroAll();
/* call MatrixMul function */
_MatrixMul(a, X_NOTRANS, b, X_NOTRANS, result);
result->Dump(stderr, "result:");
/* destroy variables */
delete[] aDimSize;
delete[] bDimSize;
delete[] resultDimSize;
delete a;
delete b;
delete result;
}
```
## 实例2:前馈神经网络 ## 实例2:前馈神经网络
下面我们来实现一个简单的前馈神经网络语言模型。
语言建模任务是通过某种方式对语言建立数学模型的过程。在神经网络出现之前,一般使用统计的方法来设计语言模型。比较常见的为n-gram模型,它对文本中若干词语共现的频率进行统计,并使用平滑算法对未见词语搭配进行修正,最终得到该语言中不同词语连续出现的概率值。神经语言模型相对传统基于统计的模型而言,能够在学习词语搭配的同时学习到词汇之间的相似性,相对平滑算法而言有效提高了对已知单词的未见搭配的预测效果,获得了更好的性能。
神经语言模型最早由Bengio等人系统化提出并进行了深入研究,其整体结构上和普通的前馈神经网络类似,由输入层、隐藏层和输出层组成,层和层之间存在连接,每一层将本层接收到的向量映射到另一维空间上作为该层的输出。
前馈神经网络语言模型的主要流程如下所示:
```
int FNNLMMain(int argc, const char ** argv)
{
if(argc == 0)
return 1;
FNNModel model;
/* load arguments */
LoadArgs(argc, argv, model);
/* check the setting */
Check(model);
/* initialize model parameters */
Init(model);
/* learn model parameters */
if(strcmp(trainFN, ""))
Train(trainFN, shuffled, model);
/* save the final model */
if(strcmp(modelFN, "") && strcmp(trainFN, ""))
Dump(modelFN, model);
/* load the model if neccessary */
if(strcmp(modelFN, ""))
Read(modelFN, model);
/* test the model on the new data */
if(strcmp(testFN, "") && strcmp(outputFN, ""))
Test(testFN, outputFN, model);
return 0;
}
```
对模型中的参数进行初始化:
```
/* initialize the model */
void Init(FNNModel &model)
{
/* create embedding parameter matrix: vSize * eSize */
InitModelTensor2D(model.embeddingW, model.vSize, model.eSize, model);
/* create hidden layer parameter matrics */
for(int i = 0; i < model.hDepth; i++){
/* hidden layer parameter matrix: (n-1)eSize * hsize if it is the first layer
hsize * hsize otherwise */
if(i == 0)
InitModelTensor2D(model.hiddenW[i], (model.n - 1) * model.eSize, model.hSize, model);
else
InitModelTensor2D(model.hiddenW[i], model.hSize, model.hSize, model);
/* bias term: a row vector of hSize entries */
InitModelTensor1D(model.hiddenB[i], model.hSize, model);
}
/* create the output layer parameter matrix and bias term */
int iSize = model.hDepth == 0 ? (model.n - 1) * model.eSize : model.hSize;
InitModelTensor2D(model.outputW, iSize, model.vSize, model);
InitModelTensor1D(model.outputB, model.vSize, model);
/* then, we initialize model parameters using a uniform distribution in range
of [-minmax, minmax] */
model.embeddingW.SetDataRand(-minmax, minmax);
model.outputW.SetDataRand(-minmax, minmax);
for(int i = 0; i < model.hDepth; i++)
model.hiddenW[i].SetDataRand(-minmax, minmax);
/* all bias terms are set to zero */
model.outputB.SetZeroAll();
for(int i = 0; i < model.hDepth; i++)
model.hiddenB[i].SetZeroAll();
}
```
训练过程:
```
void Train(const char * train, bool isShuffled, FNNModel &model)
{
char name[MAX_NAME_LENGTH];
/* shuffle the data */
if(isShuffled){
sprintf(name, "%s-tmp", train);
Shuffle(train, name);
}
else
strcpy(name, train);
int epoch = 0;
int step = 0;
int wordCount = 0;
int wordCountTotal = 0;
int ngramNum = 1;
float loss = 0;
bool isEnd = false;
NGram * ngrams = new NGram[MAX_LINE_LENGTH_HERE];
/* make a model to keep gradients */
FNNModel grad;
Copy(grad, model);
/* XNet for automatic differentiation */
XNet autoDiffer;
double startT = GetClockSec();
/* iterate for a number of epochs */
for(epoch = 0; epoch < nEpoch; epoch++){
/* data file */
FILE * file = fopen(name, "rb");
CheckErrors(file, "Cannot open the training file");
wordCount = 0;
loss = 0;
ngramNum = 1;
while(ngramNum > 0){
/* load a minibatch of ngrams */
ngramNum = LoadNGrams(file, model.n, ngrams, sentBatch, wordBatch);
if (ngramNum <= 0)
break;
/* previous n - 1 words */
XTensor inputs[MAX_N_GRAM];
/* the predicted word */
XTensor output;
/* the gold standard */
XTensor gold;
/* make the input tensor for position i */
for(int i = 0; i < model.n - 1; i++)
MakeWordBatch(inputs[i], ngrams, ngramNum, i, model.vSize, model.devID, model.mem);
/* make the gold tensor */
MakeWordBatch(gold, ngrams, ngramNum, model.n - 1, model.vSize, model.devID, model.mem);
if(!autoDiff){
/* prepare an empty network for building the fnn */
FNNNet net;
/* gradident = 0 */
Clear(grad);
/* forward computation */
Forward(inputs, output, model, net);
/* backward computation to obtain gradients */
Backward(inputs, output, gold, CROSSENTROPY, model, grad, net);
/* update model parameters */
Update(model, grad, learningRate, false);
}
else{
/* forward + backward process */
ForwardAutoDiff(inputs, output, model);
/* automatic differentiation */
autoDiffer.Backward(output, gold, CROSSENTROPY);
/* update model parameters */
Update(model, grad, learningRate, true);
}
/* get probabilities */
float prob = GetProb(output, gold);
loss += -prob;
wordCount += ngramNum;
wordCountTotal += ngramNum;
if(++step >= nStep){
isEnd = true;
break;
}
if (step % 100 == 0) {
double elapsed = GetClockSec() - startT;
XPRINT5(0, stderr, "[INFO] elapsed=%.1fs, step=%d, epoch=%d, ngram=%d, ppl=%.3f\n",
elapsed, step, epoch + 1, wordCountTotal, exp(loss / wordCount));
}
}
fclose(file);
if(isEnd)
break;
}
double elapsed = GetClockSec() - startT;
XPRINT5(0, stderr, "[INFO] elapsed=%.1fs, step=%d, epoch=%d, ngram=%d, ppl=%.3f\n",
elapsed, step, epoch, wordCountTotal, exp(loss / wordCount));
XPRINT3(0, stderr, "[INFO] training finished (took %.1fs, step=%d and epoch=%d)\n",
elapsed, step, epoch);
delete[] ngrams;
}
```
在这里只介绍部分主要代码,详细代码请参见NiuTrans.Tensor/source/sample/FNNLM.cpp
前馈神经网络前向部分:经过数据处理之后我们得到了语言模型的输入(n-1个词),我们把输入input和输入层的权重w1(词向量)相乘得到每个输入单词的向量表示,公式如下:
>embedding = input * w1
最后将n-1个词的向量连接起来作为输入层最终的输出。
同理,我们将输入层的输出分别经过隐藏层和输出层得到最终的结果,公式如下:
>h = tanh(h_pre*w2+b)
>y = softmax(h_last*w3)
前向过程代码如下:
```
/*
forward procedure
>> inputs - input word representations
>> output - output probability
>> model - the fnn model
>> net - the network that keeps the internal tensors generated in the process
*/
void Forward(XTensor inputs[], XTensor &output, FNNModel &model, FNNNet &net)
{
int batchSize = -1;
int n = model.n;
int depth = model.hDepth;
XList eList(n - 1);
/* previoius n - 1 words */
for(int i = 0; i < n - 1; i++){
XTensor &input = inputs[i];
XTensor &w = model.embeddingW;
XTensor &embedding = net.embeddings[i];
if(batchSize == -1)
batchSize = input.dimSize[0];
else{
CheckErrors(batchSize == input.dimSize[0], "Wrong input word representations!");
}
/* embedding output tensor of position i */
InitModelTensor2D(embedding, batchSize, model.eSize, model);
/* generate word embedding of position i:
embedding = input * w */
_MatrixMul(&input, X_NOTRANS, &w, X_NOTRANS, &embedding);
eList.Add(&net.embeddings[i]);
}
/* concatenate word embeddings
embeddingcat = cat(embedding_0...embedding_{n-1}) */
InitModelTensor2D(net.embeddingCat, batchSize, (n - 1) * model.eSize, model);
_Concatenate(&eList, &net.embeddingCat, 1);
/* go over each hidden layer */
for(int i = 0; i < depth; i++){
XTensor &h_pre = i == 0 ? net.embeddingCat : net.hiddens[i - 1];
XTensor &w = model.hiddenW[i];
XTensor &b = model.hiddenB[i];
XTensor &h = net.hiddens[i];
XTensor &s = net.hiddenStates[i];
InitModelTensor2D(h, batchSize, model.hSize, model);
InitModelTensor2D(s, batchSize, model.hSize, model);
/* generate hidden states of layer i:
s = h_pre * w */
_MatrixMul(&h_pre, X_NOTRANS, &w, X_NOTRANS, &s);
/* make a 2d tensor for the bias term */
XTensor b2D;
InitTensor(&b2D, &s);
_Unsqueeze(&b, &b2D, 0, batchSize);
/* introduce bias term:
s = s + b
NOTE: the trick here is to extend b to a 2d tensor
to fit into the 2d representation in tensor summation */
_Sum(&s, &b2D, &s);
/* pass the state through the hard tanh function:
h = tanh(s) */
_HardTanH(&s, &h);
}
/* generate the output Pr(w_{n-1}|w_0...w_{n-2}):
y = softmax(h_last * w)
Note that this is the implementation as that in Bengio et al.' paper.
TODO: we add bias term here */
{
XTensor &h_last = depth > 0 ? net.hiddens[depth - 1] : net.embeddingCat;
XTensor &w = model.outputW;
XTensor &b = model.outputB;
XTensor &s = net.stateLast;
XTensor &y = output;
InitModelTensor2D(s, batchSize, model.vSize, model);
InitModelTensor2D(y, batchSize, model.vSize, model);
/* s = h_last * w */
_MatrixMul(&h_last, X_NOTRANS, &w, X_NOTRANS, &s);
XTensor b2D;
InitTensor(&b2D, &s);
_Unsqueeze(&b, &b2D, 0, batchSize);
_Sum(&s, &b2D, &s);
/* y = softmax(s) */
_LogSoftmax(&s, &y, 1);
}
}
```
反向部分:首先利用前向得到的最终结果和标准答案计算总的损失函数L,然后采用梯度下降的方法通过反向传播计算得到损失函数L对每层的参数w的导数∂L/∂w,之后我们根据
>w_(k+1)= w_k-η* ∂L/(∂w_k )
对参数W进行更新,其中η是学习率。
反向以及反向传播后的更新代码如下:
```
/*
backward procedure
>> inputs - input word representations
>> output - output probability
>> gold - gold standard
>> loss - loss function name
>> model - the fnn model
>> grad - the model that keeps the gradient information
>> net - the network that keeps the internal tensors generated in the process
*/
void Backward(XTensor inputs[], XTensor &output, XTensor &gold, LOSS_FUNCTION_NAME loss,
FNNModel &model, FNNModel &grad, FNNNet &net)
{
int batchSize = output.GetDim(0);
int n = model.n;
int depth = model.hDepth;
/* back-propagation for the output layer */
XTensor &y = output;
XTensor &s = net.stateLast;
XTensor &x = depth > 0 ? net.hiddens[depth - 1] : net.embeddingCat;
XTensor &w = model.outputW;
XTensor &dedw = grad.outputW;
XTensor &dedb = grad.outputB;
XTensor deds(&y);
XTensor dedx(&x);
/* for y = softmax(s), we get dE/ds
where E is the error function (define by loss) */
_LogSoftmaxBackward(&gold, &y, &s, NULL, &deds, 1, loss);
/* for s = x * w, we get
dE/w_{i,j} = dE/ds_j * ds/dw_{i,j}
= dE/ds_j * x_{i}
(where i and j are the row and column indices, and
x is the top most hidden layer)
so we know
dE/dw = x^T * dE/ds */
_MatrixMul(&x, X_TRANS, &deds, X_NOTRANS, &dedw);
/* gradient of the bias: dE/db = dE/ds * 1 = dE/ds
specifically dE/db_{j} = \sum_{i} dE/ds_{i,j} */
_ReduceSum(&deds, &dedb, 0);
/* then, we compute
dE/dx_{j} = \sum_j' (dE/ds_{j'} * ds_{j'}/dx_j)
= \sum_j' (dE/ds_{j'} * w_{j, j'})
i.e.,
dE/dx = dE/ds * w^T */
_MatrixMul(&deds, X_NOTRANS, &w, X_TRANS, &dedx);
XTensor &gradPassed = dedx;
XTensor dedsHidden;
XTensor dedxBottom;
if (depth > 0)
InitTensor(&dedsHidden, &dedx);
InitTensor(&dedxBottom, &net.embeddingCat);
/* back-propagation from top to bottom in the stack of hidden layers
for each layer, h = f(s)
s = x * w + b */
for (int i = depth - 1; i >= 0; i--) {
XTensor &h = net.hiddens[i];
XTensor &s = net.hiddenStates[i];
XTensor &x = i == 0 ? net.embeddingCat : net.hiddenStates[i - 1];
XTensor &w = model.hiddenW[i];
XTensor &dedh = gradPassed; // gradient passed though the previous layer
XTensor &dedx = i == 0 ? dedxBottom : dedh;
XTensor &deds = dedsHidden;
XTensor &dedw = grad.hiddenW[i];
XTensor &dedb = grad.hiddenB[i];
/* backpropagation through the activation fucntion:
dE/ds = dE/dh * dh/ds */
_HardTanHBackward(NULL, &h, &s, &dedh, &deds, NOLOSS);
/* gradient of the weight: dE/dw = x^T * dE/ds */
_MatrixMul(&x, X_TRANS, &deds, X_NOTRANS, &dedw);
/* gradient of the bias: dE/db = dE/ds * 1 = dE/ds
specifically dE/db_{j} = \sum_{i} dE/ds_{i,j} */
_ReduceSum(&deds, &dedb, 0);
/* gradient of the input: dE/dx = dE/ds * w^T */
_MatrixMul(&deds, X_NOTRANS, &w, X_TRANS, &dedx);
if (i > 0)
_CopyValues(&dedx, &gradPassed);
}
XList eList(n - 1);
/* back-propagation for the embedding layer */
for (int i = 0; i < n - 1; i++) {
XTensor * dedy = NewTensor2D(batchSize, model.eSize, X_FLOAT, model.devID, model.mem);
eList.Add(dedy);
}
/* gradient of the concatenation of the embedding layers */
XTensor &dedyCat = depth > 0 ? dedxBottom : dedx;
/* split the concatenation of gradients of the embeddings */
_Split(&dedyCat, &eList, 1, n - 1);
/* go over for each word */
for (int i = 0; i < n - 1; i++) {
XTensor * dedy = (XTensor*)eList.GetItem(i);
XTensor &x = inputs[i];
XTensor &dedw = grad.embeddingW;
/* gradient of the embedding weight: dE/dw += x^T * dE/dy
NOTE that we accumulate dE/dw here because the matrix w
is shared by several layers (or words) */
_MatrixMul(&x, X_TRANS, dedy, X_NOTRANS, &dedw, 1.0F, 1.0F);
delete dedy;
}
}
```
## 实例3:循环神经网络 ## 实例3:循环神经网络
## 致谢 ## NiuTrans.Tensor团队
* 肖桐
* 李垠桥
* 许晨
* 姜雨帆
* 林野
* 张裕浩
* 胡驰
## 附录 ## 附录
...@@ -1527,13 +2453,15 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp ...@@ -1527,13 +2453,15 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp
| 成员变量 | 功能 | | 成员变量 | 功能 |
| - | - | | - | - |
| int id | 张量标识 |
| XMem * mem | 张量所使用的内存池 | | XMem * mem | 张量所使用的内存池 |
| void * data | 保存元素的数据数组 | | void * data | 保存元素的数据数组 |
| void * dataHost | 主机内存上的数据副本,只在GPU上运行时被激活 | | void * dataHost | 主机内存上的数据副本,只在GPU上运行时被激活 |
| void ** dataP | 指向数据地址的指针 |
| int devID | 设备ID,指张量所申请的空间所在CPU或者GPU设备的编号,-1表示CPU | | int devID | 设备ID,指张量所申请的空间所在CPU或者GPU设备的编号,-1表示CPU |
| int order | 张量的维度,例如:一个矩阵(维度为2)是一个二维张量 | | int order | 张量的维度,例如:一个矩阵(维度为2)是一个二维张量 |
| int dimSize<br> [MAX_TENSOR_DIM_NUM] | 张量中每一维度的大小,索引0表示第1维 | | int dimSize[ ] | 张量中每一维度的大小,索引0表示第1维 |
| int dimSizeRDI<br> [MAX_TENSOR_DIM_NUM] | 转置模式下张量中每一维度的大小,索引0表示第1维 | | int dimSizeRDI[ ] | 转置模式下张量中每一维度的大小,索引0表示第1维 |
| TENSOR_DATA_TYPE dataType | 每个数据单元的数据类型 | | TENSOR_DATA_TYPE dataType | 每个数据单元的数据类型 |
| int unitSize | 数据单元的大小,类似于sizeof() | | int unitSize | 数据单元的大小,类似于sizeof() |
| int unitNum | 数据单元的数量 | | int unitNum | 数据单元的数量 |
...@@ -1541,17 +2469,34 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp ...@@ -1541,17 +2469,34 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp
| int unitNumNonZero | 稀疏矩阵中非零元素个数 | | int unitNumNonZero | 稀疏矩阵中非零元素个数 |
| float denseRatio | 稠密度,指非零单元的比例,是介于0和1之间的一个实数,0表示所有单元全为零,1表示全为非零单元。| | float denseRatio | 稠密度,指非零单元的比例,是介于0和1之间的一个实数,0表示所有单元全为零,1表示全为非零单元。|
| bool isShared | 标志数据数组是否被其他张量所共享 | | bool isShared | 标志数据数组是否被其他张量所共享 |
| bool isDefaultDType | 矩阵中使用的数据类型是否是属于默认数据类型 |
| bool isInGlobalMem | 标志数据是否在全局内存而不是内存池中 | | bool isInGlobalMem | 标志数据是否在全局内存而不是内存池中 |
| bool isAllValued<br> [MAX_TENSOR_DIM_NUM] | 标志稀疏矩阵中是否每个维度都具有非零元素 | | bool isAllValued[ ] | 标志稀疏矩阵中是否每个维度都具有非零元素 |
| bool isInit | 张量是否被初始化 |
| bool isTmp | 张量是否为临时创建 |
| bool isGrad | 当使用模型参数时张量是否保持梯度 |
| unsigned int visitMark | 节点访问标志 |
| XTensor * grad | 反向传播的梯度 |
| XLink income | 超边的入边 |
| XLink outgo | 超边的出边 |
在XTensor.h头文件中定义的方法说明: 在XTensor.h头文件中定义的方法说明:
| 功能 | 函数 | 参数 | | 功能 | 函数 | 参数 |
| - | - | - | | - | - | - |
| 构造函数 | XTensor() | N/A |
| 析构函数 | ~XTensor() | N/A |
| 初始化成员变量 | void Init() | N/A |
| 销毁数据 | void DestroyData() | N/A |
| 张量的浅层复制 | void ShallowCopy(<br>const XTensor &tensor) | tensor - 进行复制的张量 |
| 重载等于符号 | XTensor& operator= (<br>const XTensor &tensor) | tensor - 重载的张量 |
| 重载加法符号 | XTensor operator+ (<br>const XTensor &tensor) | tensor - 重载的张量 |
| 重载乘法符号 | XTensor operator* (<br>const XTensor &tensor) | tensor - 重载的张量 |
| 线性变换 | XTensor Lin(<br>DTYPE scale, DTYPE shift = 0) | scale - 缩放参数 <br> shift - 偏移参数 |
| 判断两个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 | | 判断两个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 |
| 判断三个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b, XTensor * c) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 <br> c - 进行比较的第三个张量 | | 判断三个张量数据类型<br>和大小是否相同 | static bool IsIdentical(<br> XTensor * a, XTensor * b, XTensor * c) | a - 进行比较的第一个张量 <br> b - 进行比较的第二个张量 <br> c - 进行比较的第三个张量 |
| 设置张量每一维度的大小 | void SetDim(int * myDimSize) |myDimSize - 张量每一维度的大小 | | 设置张量每一维度的大小 | void SetDim(<br>int * myDimSize) |myDimSize - 张量每一维度的大小 |
| 得到张量中给定的维度大小 | int GetDim(const int dim) | dim - 张量的维度 | | 得到张量中给定的维度大小 | int GetDim(<br>const int dim) | dim - 张量的维度 |
| 重新调整矩阵维度 | void Reshape(<br> const int order, const int * myDimSize) | order - 张量的维度 <br> myDimSize - 张量每一维的大小 | | 重新调整矩阵维度 | void Reshape(<br> const int order, const int * myDimSize) | order - 张量的维度 <br> myDimSize - 张量每一维的大小 |
| 得到张量中元素数量 | int GetSize() | N/A | | 得到张量中元素数量 | int GetSize() | N/A |
| 得到内存使用大小 | int GetDataSizeInChar() | N/A | | 得到内存使用大小 | int GetDataSizeInChar() | N/A |
...@@ -1561,15 +2506,26 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp ...@@ -1561,15 +2506,26 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp
| 设置张量服从均匀分布 | void SetDataRand(<br> DTYPE lower, DTYPE upper) | lower - 最小值 <br> upper - 最大值 | | 设置张量服从均匀分布 | void SetDataRand(<br> DTYPE lower, DTYPE upper) | lower - 最小值 <br> upper - 最大值 |
| 设置张量服从正态分布 | void SetDataRandn(<br> DTYPE mean, DTYPE standardDeviation) | mean - 均值 <br> standardDeviation - 标准差 | | 设置张量服从正态分布 | void SetDataRandn(<br> DTYPE mean, DTYPE standardDeviation) | mean - 均值 <br> standardDeviation - 标准差 |
| 检查张量中元素是否相同 | bool CheckData(<br> const void * answer, int num, int beg = 0) | answer - 给定数组 <br> num - 数组大小 <br> beg - 赋值时从张量的第几位开始 | | 检查张量中元素是否相同 | bool CheckData(<br> const void * answer, int num, int beg = 0) | answer - 给定数组 <br> num - 数组大小 <br> beg - 赋值时从张量的第几位开始 |
| 将给定维度中元素<br> 设置为升序 | void SetAscendingOrder(int dim) | dim - 给定维度 | | 设置数据指针 | void SetDataPointer() | N/A |
| 获取张量中元素指针 | void * GetCell(int * index, int size) | index - 元素位置 <br> size-矩阵大小 | | 将给定维度中元素<br> 设置为升序 | void SetAscendingOrder(<br>int dim) | dim - 给定维度 |
| 获取二维张量中元素指针 | void * GetCell2D(int ni, int mi = 0) | ni - 行值 <br> mi - 列值 | | 得到索引指向的单元的值 | DTYPE Get(int index[], int size = -1) | index - 给定索引 <br> size-矩阵大小 |
| 获取二维张量的值 | DTYPE Get2D(int ni, int mi = 0) | ni - 行值 <br> mi - 列值 | | 获取张量中元素指针 | void * GetCell(<br>int * index, int size) | index - 元素位置 <br> size-矩阵大小 |
| 获取一维张量中元素的<br>默认类型值 | DTYPE Get1D(<br>int i) | i - 第一维 |
| 获取二维张量中元素的<br>默认类型值 | DTYPE Get2D(<br>int ni, int mi) const | ni - 第一维 <br> mi - 第二维 |
| 获取三维张量中元素的<br>默认类型值 | DTYPE Get3D(<br>int d0, int d1, int d2) | d0 - 第一维 <br> d1 - 第二维 <br> d2 - 第三维 |
| 获取一维张量中元素的<br>整形值 |int Get1DInt(<br>int i) | i - 第一维 |
| 获取二维张量中元素的<br>整形值 | int Get2DInt(<br>int ni, int mi) | ni - 第一维 <br> mi - 第二维 |
| 获取三维张量中元素的整形值 | int Get3DInt(<br>int d0, int d1, int d2) | d0 - 第一维 <br> d1 - 第二维 <br> d2 - 第三维 |
| 获取稀疏张量的值 | DTYPE GetInSparse(int i) | i - 稀疏矩阵中非0元素位置 | | 获取稀疏张量的值 | DTYPE GetInSparse(int i) | i - 稀疏矩阵中非0元素位置 |
| 获取稀疏张量中<br> 元组的键值 | int GetKeyInSparse(int i) | i - 稀疏矩阵中非0元素位置 | | 获取稀疏张量中<br> 元组的键值 | int GetKeyInSparse(int i) | i - 稀疏矩阵中非0元素位置 |
| 设置二维张量中<br> 的单元值 | bool Set2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 | | 设置单元中的值 | bool Set(<br>DTYPE value, int index[], int size = -1) | value - 值 <br> index - 元素位置 <br> size-矩阵大小 |
| 增加二维张量中<br> 的单元值 | bool Add2D(DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 | | 设置一维张量中的单元值 | bool Set1D(<br>DTYPE value, int i) | value - 值 <br> i - 第一维 |
| 设置二维张量中的单元值 | bool Set2D(<br>DTYPE value, int ni, int mi) | value - 值 <br> ni - 第一维 <br> mi - 第二维 |
| 设置三维张量中的单元值 | bool Set3D(<br>DTYPE value, int d0, int d1, int d2) | value - 值 <br> d0 - 第一维 <br> d1 - 第二维 <br> d2 - 第三维 |
| 增加二维张量中<br> 的单元值 | bool Add2D(<br>DTYPE value, int ni, int mi = 0) | value - 单元值 <br> ni - 行值 <br> mi - 列值 |
| 获取稀疏矩阵中<br> 非零元素数量 | int GetNonzeroSize() | N/A | | 获取稀疏矩阵中<br> 非零元素数量 | int GetNonzeroSize() | N/A |
| 设置张量为临时变量 | void SetTMP(<br>bool myIsTmp = true) | myIsTmp - 是否为临时变量 |
| 张量是否保持梯度 | void SetGrad(<br>bool myIsGrad = true) | myIsTmp - 是否保持梯度 |
| 将矩阵重置为特定大小 | bool Resize(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度,1表示稠密张量 | | 将矩阵重置为特定大小 | bool Resize(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度,1表示稠密张量 |
| 将矩阵重置为特定大小<br>并不申请新空间 | bool ResizeWithNoData(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度,1表示稠密张量 | | 将矩阵重置为特定大小<br>并不申请新空间 | bool ResizeWithNoData(<br> const int myOrder, <br> const int * myDimSize, <br> const TENSOR_DATA_TYPE myDataType = DEFAULT_DTYPE, <br> const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度,1表示稠密张量 |
| 将矩阵重置为<br> 另一矩阵大小 | bool Resize(<br> const XTensor * myTensor) | myTensor - 重置矩阵大小的参考矩阵 | | 将矩阵重置为<br> 另一矩阵大小 | bool Resize(<br> const XTensor * myTensor) | myTensor - 重置矩阵大小的参考矩阵 |
...@@ -1580,4 +2536,4 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp ...@@ -1580,4 +2536,4 @@ NiuTrans.Tensor/Tensor/test/TLoss.cpp
| 在缓冲区创建张量 | XTensor * NewTensorBuf( <br> const int myOrder, <br> const int * myDimSize, XMem * myMem, <br> const TENSOR_DATA_TYPE myDataType = <br> X_FLOAT, const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myMem - 张量所使用的内存池 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度,1表示稠密张量 | | 在缓冲区创建张量 | XTensor * NewTensorBuf( <br> const int myOrder, <br> const int * myDimSize, XMem * myMem, <br> const TENSOR_DATA_TYPE myDataType = <br> X_FLOAT, const float myDenseRatio = 1.0F) | myOrder - 张量的维度 <br> myDimSize - 张量每一维的大小,索引0表示第一维 <br> myMem - 张量所使用的内存池 <br> myDataType - 张量的数据类型 <br> myDenseRatio - 张量的稠密度,1表示稠密张量 |
| 依据给定张量<br>复制一个新的张量 | XTensor * NewTensor(<br>XTensor * a, bool isFilledData = true) | a - 给定张量 <br> isFilledData - 是否申请张量中的数据空间 | | 依据给定张量<br>复制一个新的张量 | XTensor * NewTensor(<br>XTensor * a, bool isFilledData = true) | a - 给定张量 <br> isFilledData - 是否申请张量中的数据空间 |
| 依据给定张量<br>释放数据空间 | void DelTensor(<br>const XTensor * tensor) | tensor - 给定张量 | | 依据给定张量<br>释放数据空间 | void DelTensor(<br>const XTensor * tensor) | tensor - 给定张量 |
| 依据给定张量<br>在缓存中释放数据空间 | void DelTensorBuf(<br>const XTensor * tensor) | tensor - 给定张量 | | 依据给定张量<br>在缓存中释放数据空间 | void DelTensorBuf(<br>const XTensor * tensor) | tensor - 给定张量 |
\ No newline at end of file
...@@ -21,25 +21,31 @@ ...@@ -21,25 +21,31 @@
#include <stdio.h> #include <stdio.h>
#include "XNet.h" #include "XNet.h"
#include "../tensor/XUtility.h"
#include "../tensor/function/FHeader.h" #include "../tensor/function/FHeader.h"
#include "../tensor/core/CHeader.h" #include "../tensor/core/CHeader.h"
#include "../sample/fnnlm/FNNLM.h" #include "../sample/fnnlm/FNNLM.h"
#include "../sample/transformer/Transformer.h"
//#define CRTDBG_MAP_ALLOC //#define CRTDBG_MAP_ALLOC
//#include <stdlib.h> //#include <stdlib.h>
//#include <crtdbg.h> //#include <crtdbg.h>
using namespace nts; void TransposeTest();
using namespace samplefnnlm; void SumDimTest();
using namespace nts;
using namespace fnnlm;
using namespace transformer;
int main( int argc, const char ** argv ) int main( int argc, const char ** argv )
{ {
//_CrtSetBreakAlloc(896);
if(argc > 1 && !strcmp(argv[1], "-test"))
1;//Test(); if(argc > 1 && !strcmp(argv[1], "-fnnlm"))
else if(argc > 1 && !strcmp(argv[1], "-fnnlm"))
FNNLMMain(argc - 1, argv + 1); FNNLMMain(argc - 1, argv + 1);
else if(argc > 1 && !strcmp(argv[1], "-t2t"))
TransformerMain(argc - 1, argv + 1);
else{ else{
fprintf(stderr, "Thanks for using NiuTrans.Network! This is a library for building\n"); fprintf(stderr, "Thanks for using NiuTrans.Network! This is a library for building\n");
fprintf(stderr, "neural networks in an easy way. \n\n"); fprintf(stderr, "neural networks in an easy way. \n\n");
...@@ -47,36 +53,120 @@ int main( int argc, const char ** argv ) ...@@ -47,36 +53,120 @@ int main( int argc, const char ** argv )
fprintf(stderr, "Or run this program with \"-fnnlm\" for sample FNNLM!\n"); fprintf(stderr, "Or run this program with \"-fnnlm\" for sample FNNLM!\n");
} }
XNet net; //_CrtDumpMemoryLeaks();
XTensor a;
XTensor b; return 0;
XTensor c; }
InitTensor2D(&a, 2, 2); void TransposeTest()
InitTensor2D(&b, 2, 4); {
InitTensor2D(&c, 2, 4); #ifdef USE_CUDA
XMem mem0(0, UNI_FREE, MILLION * 64, 1024, MILLION * 64);
//XMem mem1(1, UNI_FREE, MILLION * 64, 1024, MILLION * 64);
XTensor x;
XTensor y;
XTensor z;
a.SetZeroAll(); int loops = 2000;
b.SetZeroAll();
c.SetZeroAll();
SetDataFixed(a, 0.1F); int B = 3 * 2 * 4;
a.Set2D(0.3F, 1, 0); int K = 8 * 1;
a.Set2D(0.4F, 1, 1); int N = 50;
int H = 512 * 4;
b = Merge(a, a, 1); int nnn = GDevs.nGPU;
c = HTanH(MMul(a, b));
a.Dump(stderr, "a:"); InitTensor3D(&x, B, N, H, X_FLOAT, 0);
b.Dump(stderr, "b:"); InitTensor4D(&y, K, B, N, H/K, X_FLOAT, 0);
c.Dump(stderr, "c:"); InitTensor3D(&z, B, N, H, X_FLOAT, 0);
XLink::ShowNetwork(stderr, &c);
net.Backward(c); cudaEvent_t ctime0;
cudaEvent_t ctime1;
cudaEvent_t ctime2;
cudaEvent_t ctime3;
cudaEvent_t ctime4;
cudaEvent_t ctime5;
net.Dump(stderr); float elapsedSplit = 0.0;
float elapsedMerge = 0.0;
//_CrtDumpMemoryLeaks(); float elapsedSum = 0.0;
cudaEventCreate(&ctime0);
cudaEventCreate(&ctime1);
cudaEventCreate(&ctime2);
cudaEventCreate(&ctime3);
cudaEventCreate(&ctime4);
cudaEventCreate(&ctime5);
cudaEventRecord(ctime0, 0);
double time0 = GetClock();
for(int i = 0; i < loops; i++)
_Split(&x, &y, 2, K);
double time1 = GetClock();
return 0; cudaEventRecord(ctime1, 0);
cudaEventSynchronize(ctime1);
cudaEventElapsedTime(&elapsedSplit, ctime0, ctime1);
cudaEventRecord(ctime2, 0);
double time2 = GetClock();
for(int i = 0; i < loops; i++)
_Merge(&y, &x, 3);
double time3 = GetClock();
cudaEventRecord(ctime3, 0);
cudaEventSynchronize(ctime3);
cudaEventElapsedTime(&elapsedMerge, ctime2, ctime3);
cudaEventRecord(ctime4, 0);
double time4 = GetClock();
for(int i = 0; i < loops; i++)
_Sum(&x, &z, &x);
double time5 = GetClock();
cudaEventRecord(ctime5, 0);
cudaEventSynchronize(ctime5);
cudaEventElapsedTime(&elapsedSum, ctime4, ctime5);
fprintf(stderr, "split:%f merge:%f sum:%f\n", time1 - time0, time3 - time2, time5 - time4);
fprintf(stderr, "split:%f merge:%f sum:%f\n", elapsedSplit, elapsedMerge, elapsedSum);
#endif
}
void SumDimTest()
{
XTensor x;
XTensor y;
XTensor z;
int a = 5;
int b = 7;
int c = 3;
InitTensor3D(&x, a, b, c, X_FLOAT, -1);
InitTensor1D(&y, c, X_FLOAT, -1);
InitTensor3D(&z, a, b, c, X_FLOAT, -1);
x.SetZeroAll();
y.SetZeroAll();
z.SetZeroAll();
float * data = new float[x.unitNum];
for(int i = 0; i < x.unitNum; i++)
data[i] = (DTYPE)i;
x.SetData(data, x.unitNum);
for(int i = 0; i < y.unitNum; i++)
data[i] = -(DTYPE)i;
y.SetData(data, y.unitNum);
_SumDim(&x, &y, &z, 2);
z.Dump(stderr, "z:");
delete[] data;
} }
...@@ -63,6 +63,8 @@ void XFuncGrad::MakeGrad(XTensor * node) ...@@ -63,6 +63,8 @@ void XFuncGrad::MakeGrad(XTensor * node)
else{ else{
ShowNTErrors("Wrong activation function type!"); ShowNTErrors("Wrong activation function type!");
} }
node->visitMark = NODE_FINISHED;
} }
/* indicates whether the node is for an activation function */ /* indicates whether the node is for an activation function */
......
...@@ -37,10 +37,52 @@ void XMathGrad::MakeGrad(XTensor * node) ...@@ -37,10 +37,52 @@ void XMathGrad::MakeGrad(XTensor * node)
if(operID == MATH_SUM) if(operID == MATH_SUM)
GradSum(node); GradSum(node);
else if(operID == MATH_SUMDIM)
GradSumDim(node);
else if(operID == MATH_MULTIPLY) else if(operID == MATH_MULTIPLY)
GradMultiply(node); GradMultiply(node);
else if(operID == MATH_MATRIXMUL) else if(operID == MATH_MATRIXMUL)
GradMatrixMul(node); GradMatrixMul(node);
else if(operID == MATH_MATRIXMULBATCHED)
GradMatrixMulBatched(node);
else if (operID == MATH_LOG)
GradLog(node);
else if (operID == MATH_POWER)
GradPower(node);
else if (operID == MATH_NEGATE)
GradNegate(node);
else if (operID == MATH_SCALEANDSHIFT)
GradScaleAndShift(node);
else if (operID == MATH_DIV)
GradDiv(node);
else if (operID == MATH_SUB)
GradSub(node);
else if (operID == MATH_SIN)
GradSin(node);
else if (operID == MATH_COS)
GradCos(node);
else if (operID == MATH_TAN)
GradTan(node);
else if (operID == MATH_EXP)
GradExp(node);
else if (operID == MATH_NORMALIZE)
GradNormalize(node);
else if (operID == MATH_ABSOLUTE)
GradAbsolute(node);
else if (operID == MATH_SIGN)
GradSign(node);
else if (operID == MATH_ROUND)
GradRound(node);
else if (operID == MATH_CLIP)
GradClip(node);
else if (operID == REDUCE_REDUCEMEAN)
GradReduceMean(node);
else if (operID == REDUCE_REDUCESUM)
GradReduceSum(node);
else if (operID == REDUCE_REDUCESUMSQUARED)
GradReduceSumSquared(node);
else if (operID == REDUCE_REDUCEVARIANCE)
GradReduceVariance(node);
else{ else{
ShowNTErrors("TODO!"); ShowNTErrors("TODO!");
} }
...@@ -70,11 +112,108 @@ void XMathGrad::GradSum(XTensor * node) ...@@ -70,11 +112,108 @@ void XMathGrad::GradSum(XTensor * node)
XTensor * a = income.tails[0]; XTensor * a = income.tails[0];
XTensor * b = income.tails[1]; XTensor * b = income.tails[1];
DTYPE beta = income.GetParam(0); DTYPE beta = income.GetParam(0);
XNoder::MakeGrad(a); XNoder::MakeGrad(a);
XNoder::MakeGrad(b); XNoder::MakeGrad(b);
_Sum(a->grad, node->grad, a->grad); _Sum(a->grad, node->grad, a->grad);
_Sum(b->grad, node->grad, b->grad, beta); _Sum(b->grad, node->grad, b->grad, beta);
node->visitMark = NODE_FINISHED;
}
/*
gradient for sum with one dimension
c = a + b * \beta
where the size of b is equal to dimension n of a, i.e., |b| = a.dimSize[n]
dE/da = dE/dc
dE/db = dE/dc * b.reduce(0,...,n-1,n+1,...) * \beta
*/
void XMathGrad::GradSumDim(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for SUMDIM!");
XTensor * a = income.tails[0];
XTensor * b = income.tails[1];
int n = income.GetParamInt(0);
DTYPE beta = income.GetParam(1);
XNoder::MakeGrad(a);
XNoder::MakeGrad(b);
_Sum(a->grad, node->grad, a->grad);
int order = a->order;
int dimSize[MAX_TENSOR_DIM_NUM];
memcpy(dimSize, a->dimSize, sizeof(int) * a->order);
if(n == order - 1){
int reshapedSize[MAX_TENSOR_DIM_NUM];
reshapedSize[0] = a->unitNum/dimSize[order - 1];
reshapedSize[1] = dimSize[order - 1];
/* we reshape dE/dc to a matrix whose column number is equal to the
size of b. Then we can reduce the matrix into a row vector. */
node->grad->Reshape(2, reshapedSize);
if(b->outgo.tailNum > 1){
XTensor * bGradTMP = NewTensorBuf(b->grad, b->devID, b->mem);
_ReduceSum(node->grad, bGradTMP, 0);
if(beta != 1.0F)
_ScaleAndShiftMe(bGradTMP, beta);
_Sum(bGradTMP, b->grad, b->grad);
DelTensorBuf(bGradTMP);
}
else{
_ReduceSum(node->grad, b->grad, 0);
if(beta != 1.0F)
_ScaleAndShiftMe(b->grad, beta);
}
node->grad->Reshape(order, dimSize);
}
else{
int reshapedSize[MAX_TENSOR_DIM_NUM];
reshapedSize[0] = 1;
reshapedSize[1] = dimSize[n];
reshapedSize[2] = 1;
for(int i = 0; i < order; i++){
if(i < n)
reshapedSize[0] *= dimSize[i];
}
reshapedSize[2] = a->unitNum / (reshapedSize[0] * reshapedSize[1]);
/* we reshape dE/dc to a 3D tensor of size (x, y, z) where y = |b|.
Then reduce along with z and x to obtain dE/db. */
node->grad->Reshape(3, reshapedSize);
XTensor * interGrad = NewTensorBuf(2, reshapedSize, b->dataType, b->denseRatio, b->devID, b->mem);
_ReduceSum(node->grad, interGrad, 2);
if(b->outgo.tailNum > 1){
XTensor * bGradTMP = NewTensorBuf(b->grad, b->devID, b->mem);
_ReduceSum(interGrad, bGradTMP, 0);
if(beta != 1.0F)
_ScaleAndShiftMe(bGradTMP, beta);
_Sum(bGradTMP, b->grad, b->grad);
DelTensorBuf(bGradTMP);
}
else{
_ReduceSum(interGrad, b->grad, 0);
if(beta != 1.0F)
_ScaleAndShiftMe(b->grad, beta);
}
node->grad->Reshape(order, dimSize);
DelTensorBuf(interGrad);
}
node->visitMark = NODE_FINISHED;
} }
/* /*
...@@ -99,6 +238,8 @@ void XMathGrad::GradMultiply(XTensor * node) ...@@ -99,6 +238,8 @@ void XMathGrad::GradMultiply(XTensor * node)
CheckNTErrors(XTensor::IsSameShaped(a, b), "Wrong sized input tensors!"); CheckNTErrors(XTensor::IsSameShaped(a, b), "Wrong sized input tensors!");
_Multiply(node->grad, b, a->grad, 1.0F); _Multiply(node->grad, b, a->grad, 1.0F);
_Multiply(node->grad, a, b->grad, 1.0F); _Multiply(node->grad, a, b->grad, 1.0F);
node->visitMark = NODE_FINISHED;
} }
/* /*
...@@ -124,16 +265,59 @@ void XMathGrad::GradMatrixMul(XTensor * node) ...@@ -124,16 +265,59 @@ void XMathGrad::GradMatrixMul(XTensor * node)
XNoder::MakeGrad(a); XNoder::MakeGrad(a);
XNoder::MakeGrad(b); XNoder::MakeGrad(b);
XTensor * c = node;
XTensor * dedc = node->grad; XTensor * dedc = node->grad;
XTensor * deda = a->grad; XTensor * deda = a->grad;
XTensor * dedb = b->grad; XTensor * dedb = b->grad;
if(deda->order == 2 && dedb->order == 2)
GradMatrixMul(a, deda, transA, b, dedb, transB, dedc, alpha);
else if(transA == X_NOTRANS && deda->order > 2 && dedb->order == 2){
int orderBackupA = a->order;
int orderBackupC = c->order;
int dimsBackupA[MAX_TENSOR_DIM_NUM];
int dimsBackupC[MAX_TENSOR_DIM_NUM];
memcpy(dimsBackupA, a->dimSize, sizeof(int) * a->order);
memcpy(dimsBackupC, c->dimSize, sizeof(int) * c->order);
a->Reshape(a->unitNum/a->GetDim(-1), a->GetDim(-1));
c->Reshape(c->unitNum/c->GetDim(-1), c->GetDim(-1));
deda->Reshape(a->unitNum/a->GetDim(-1), a->GetDim(-1));
dedc->Reshape(c->unitNum/c->GetDim(-1), c->GetDim(-1));
GradMatrixMul(a, deda, transA, b, dedb, transB, dedc, alpha);
a->Reshape(orderBackupA, dimsBackupA);
c->Reshape(orderBackupC, dimsBackupC);
deda->Reshape(orderBackupA, dimsBackupA);
dedc->Reshape(orderBackupC, dimsBackupC);
}
else{
ShowNTErrors("TODO!");
}
node->visitMark = NODE_FINISHED;
}
/*
gradient for matrix multiply: c = matmul(a, b) * \alpha
>> a - as it is
>> deda - dE/da
>> b - as it is
>> dedb - dE/db
>> dedc - dE/dc
>> alpha - the scalar
*/
void XMathGrad::GradMatrixMul(XTensor * a, XTensor * deda, MATRIX_TRANS_TYPE transA,
XTensor * b, XTensor * dedb, MATRIX_TRANS_TYPE transB,
XTensor * dedc, DTYPE alpha)
{
/* c = a * b * \alpha */ /* c = a * b * \alpha */
if(transA == X_NOTRANS && transB == X_NOTRANS){ if(transA == X_NOTRANS && transB == X_NOTRANS){
/* dE/da = dE/dc * b^T * \alpha */ /* dE/da = dE/dc * b^T * \alpha */
_MatrixMul(dedc, X_NOTRANS, b, X_TRANS, deda, alpha, 1.0F); _MatrixMul(dedc, X_NOTRANS, b, X_TRANS, deda, alpha, 1.0F);
/* dE/db = a^T * dE/dc * \alpha */ /* dE/db = a^T * dE/dc * \alpha */
_MatrixMul(a, X_TRANS, dedc, X_NOTRANS, dedb, alpha, 1.0F); _MatrixMul(a, X_TRANS, dedc, X_NOTRANS, dedb, alpha, 1.0F);
} }
...@@ -141,8 +325,9 @@ void XMathGrad::GradMatrixMul(XTensor * node) ...@@ -141,8 +325,9 @@ void XMathGrad::GradMatrixMul(XTensor * node)
/* c = a^T * b * \alpha */ /* c = a^T * b * \alpha */
else if(transA == X_TRANS && transB == X_NOTRANS){ else if(transA == X_TRANS && transB == X_NOTRANS){
/* dE/da = dE/dc * b^T * \alpha */ /* dE/da = (dE/dc * b^T)^T * \alpha
_MatrixMul(dedc, X_NOTRANS, b, X_TRANS, deda, alpha, 1.0F); = b * dE/dc^T * \alpha */
_MatrixMul(b, X_NOTRANS, dedc, X_TRANS, deda, alpha, 1.0F);
/* dE/db = a * dE/dc * \alpha */ /* dE/db = a * dE/dc * \alpha */
_MatrixMul(a, X_NOTRANS, dedc, X_NOTRANS, dedb, alpha, 1.0F); _MatrixMul(a, X_NOTRANS, dedc, X_NOTRANS, dedb, alpha, 1.0F);
...@@ -154,19 +339,710 @@ void XMathGrad::GradMatrixMul(XTensor * node) ...@@ -154,19 +339,710 @@ void XMathGrad::GradMatrixMul(XTensor * node)
/* dE/da = dE/dc * b * \alpha */ /* dE/da = dE/dc * b * \alpha */
_MatrixMul(dedc, X_NOTRANS, b, X_NOTRANS, deda, alpha, 1.0F); _MatrixMul(dedc, X_NOTRANS, b, X_NOTRANS, deda, alpha, 1.0F);
/* dE/db = a^T * dE/dc * \alpha */ /* dE/db = (a^T * dE/dc)^T * \alpha
_MatrixMul(a, X_TRANS, dedc, X_NOTRANS, dedb, alpha, 1.0F); = dE/dc^T * a * \alpha */
_MatrixMul(dedc, X_TRANS, a, X_NOTRANS, dedb, alpha, 1.0F);
} }
/* c = a^T * b^T * \alpha */ /* c = a^T * b^T * \alpha */
else if(transA == X_TRANS && transB == X_TRANS){ else if(transA == X_TRANS && transB == X_TRANS){
/* dE/da = dE/dc * b * \alpha */ /* dE/da = (dE/dc * b)^T * \alpha
_MatrixMul(dedc, X_NOTRANS, b, X_NOTRANS, deda, alpha, 1.0F); = b^T * dE/dc^T * \alpha */
_MatrixMul(b, X_TRANS, dedc, X_TRANS, deda, alpha, 1.0F);
/* dE/db = (a * dE/dc)^T * \alpha
= dE/dc^T * a^T * \alpha */
_MatrixMul(dedc, X_TRANS, a, X_TRANS, dedb, alpha, 1.0F);
}
}
/*
gradient for matrix multiply in batch mode.
for each batch: c_i = matmul(a_i, b_i) * \alpha
for c_i = matmul(a_i, b_i) * \alpha
we have
dE/da_i = dE/dc_i * b_i^T * \alpha
dE/db_i = a_i^T * dE/dc_i * \alpha
>> node - the node (c) for backward computation
*/
void XMathGrad::GradMatrixMulBatched(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for MULTIPLY!");
CheckNTErrors(income.paramNum == 3, "Wrong parameter number for MULTIPLY!");
XTensor * a = income.tails[0];
XTensor * b = income.tails[1];
MATRIX_TRANS_TYPE transA = income.GetParamTrans(0);
MATRIX_TRANS_TYPE transB = income.GetParamTrans(1);
DTYPE alpha = income.GetParam(2);
XNoder::MakeGrad(a);
XNoder::MakeGrad(b);
XTensor * c = node;
XTensor * dedc = node->grad;
XTensor * deda = a->grad;
XTensor * dedb = b->grad;
/* c = a * b * \alpha */
if(transA == X_NOTRANS && transB == X_NOTRANS){
/* dE/da = dE/dc * b^T * \alpha */
_MatrixMulBatched(dedc, X_NOTRANS, b, X_TRANS, deda, alpha, 1.0F);
/* dE/db = a^T * dE/dc * \alpha */
_MatrixMulBatched(a, X_TRANS, dedc, X_NOTRANS, dedb, alpha, 1.0F);
}
/* c = a^T * b * \alpha */
else if(transA == X_TRANS && transB == X_NOTRANS){
/* dE/da = (dE/dc * b^T)^T * \alpha
= b * dE/dc^T * \alpha */
_MatrixMulBatched(b, X_NOTRANS, dedc, X_TRANS, deda, alpha, 1.0F);
/* dE/db = a * dE/dc * \alpha */ /* dE/db = a * dE/dc * \alpha */
_MatrixMul(a, X_NOTRANS, dedc, X_NOTRANS, dedb, alpha, 1.0F); _MatrixMulBatched(a, X_NOTRANS, dedc, X_NOTRANS, dedb, alpha, 1.0F);
}
/* c = a * b^T * \alpha */
else if(transA == X_NOTRANS && transB == X_TRANS){
/* dE/da = dE/dc * b * \alpha */
_MatrixMulBatched(dedc, X_NOTRANS, b, X_NOTRANS, deda, alpha, 1.0F);
/* dE/db = (a^T * dE/dc)^T * \alpha
= dE/dc^T * a * \alpha */
_MatrixMulBatched(dedc, X_TRANS, a, X_NOTRANS, dedb, alpha, 1.0F);
}
/* c = a^T * b^T * \alpha */
else if(transA == X_TRANS && transB == X_TRANS){
/* dE/da = (dE/dc * b)^T * \alpha
= b^T * dE/dc^T * \alpha */
_MatrixMulBatched(b, X_TRANS, dedc, X_TRANS, deda, alpha, 1.0F);
/* dE/db = (a * dE/dc)^T * \alpha
= dE/dc^T * a^T * \alpha */
_MatrixMulBatched(dedc, X_TRANS, a, X_TRANS, dedb, alpha, 1.0F);
} }
node->visitMark = NODE_FINISHED;
}
/*
gradient for log
for
c = log(a)
we have
dE/da = dE/dc * 1/a
>> node - the node (c) for backward computation
*/
void XMathGrad::GradLog(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for LOG!");
XTensor * a = income.tails[0];
XNoder::MakeGrad(a);
_Div(node->grad, a, a->grad, 1.0F);
node->visitMark = NODE_FINISHED;
}
/*
gradient for power
for
c = pow(a,p)
we have
dE/da = (dE/dc) * p*a^(p-1)
>> node - the node (c) for backward computation
*/
void XMathGrad::GradPower(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for POWER!");
XTensor * a = income.tails[0];
XTensor * b = NewTensor(a);
XTensor * c = NewTensor(a);
DTYPE p = income.GetParam(0);
XNoder::MakeGrad(a);
_Power(a, b, (p-1)/p);
_ScaleAndShift(b, c, p);
_Multiply(node->grad, c, a->grad, 1.0F);
node->visitMark = NODE_FINISHED;
delete b;
delete c;
}
/*
gradient for negate
for
c = -a
we have
dE/da = dE/dc * (-1)
>> node - the node (c) for backward computation
*/
void XMathGrad::GradNegate(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for NEGATE!");
XTensor * a = income.tails[0];
XTensor * b = NewTensor(a);
XNoder::MakeGrad(a);
_ScaleAndShift(node->grad, b, -1.0F);
_Sum(a->grad, b, a->grad);
node->visitMark = NODE_FINISHED;
delete b;
}
/*
gradient for ScaleAndShift
for
c = a * scale + shift
we have
dE/da = dE/dc * scale
>> node - the node (c) for backward computation
*/
void XMathGrad::GradScaleAndShift(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SCALEANDSHIFT!");
XTensor * a = income.tails[0];
XTensor * b = NewTensor(a);
DTYPE scale = income.GetParam(0);
XNoder::MakeGrad(a);
_ScaleAndShift(node->grad, b, scale);
_Sum(a->grad, b, a->grad);
node->visitMark = NODE_FINISHED;
delete b;
}
/*
gradient for minus
for
c = a - b * \beta
we have
dE/da = dE/dc
dE/db = -dE/dc * \beta
>> node - the node (c) for backward computation
*/
void XMathGrad::GradSub(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for SUBSTRACT!");
XTensor * a = income.tails[0];
XTensor * b = income.tails[1];
DTYPE beta = income.GetParam(0);
XNoder::MakeGrad(a);
XNoder::MakeGrad(b);
_Sum(a->grad, node->grad, a->grad);
_Sum(b->grad, node->grad, b->grad, -beta);
node->visitMark = NODE_FINISHED;
}
/*
gradient for divide
for
c = a / b
we have
dE/da = dE/dc / b
dE/db = dE/dc * a / -b^2
>> node - the node (c) for backward computation
*/
void XMathGrad::GradDiv(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for DIVIDE!");
XTensor * a = income.tails[0];
XTensor * b = income.tails[1];
XTensor * c = NewTensor(b);
XTensor * d = NewTensor(b);
XTensor * e = NewTensor(b);
XNoder::MakeGrad(a);
XNoder::MakeGrad(b);
CheckNTErrors(XTensor::IsSameShaped(a, b), "Wrong sized input tensors!");
_Div(node->grad, b, a->grad, 1.0F);
_Power(b, c, -2.0F);
_Multiply(a, c, d);
_ScaleAndShift(d, e, -1.0F);
_Multiply(node->grad, e, b->grad, 1.0F);
node->visitMark = NODE_FINISHED;
delete c;
delete d;
delete e;
}
/*
gradient for exp
for
c = exp(a)
we have
dE/da = dE/dc * exp(a)
>> node - the node (c) for backward computation
*/
void XMathGrad::GradExp(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for EXP!");
XTensor * a = income.tails[0];
XTensor * b = NewTensor(a);
XNoder::MakeGrad(a);
_Exp(a, b);
_Multiply(node->grad, b, a->grad, 1.0F);
node->visitMark = NODE_FINISHED;
delete b;
}
/*
gradient for sin
for
c = sin(a)
we have
dE/da = dE/dc * cos(a)
>> node - the node (c) for backward computation
*/
void XMathGrad::GradSin(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SIN!");
XTensor * a = income.tails[0];
XTensor * b = NewTensor(a);
XNoder::MakeGrad(a);
_Cos(a, b);
_Multiply(node->grad, b, a->grad, 1.0F);
node->visitMark = NODE_FINISHED;
delete b;
}
/*
gradient for cos
for
c = cos(a)
we have
dE/da = dE/dc * -sin(a)
>> node - the node (c) for backward computation
*/
void XMathGrad::GradCos(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for COS!");
XTensor * a = income.tails[0];
XTensor * b = NewTensor(a);
XTensor * c = NewTensor(a);
XNoder::MakeGrad(a);
_Sin(a, b);
_ScaleAndShift(b, c, -1.0F);
_Multiply(node->grad, c, a->grad, 1.0F);
node->visitMark = NODE_FINISHED;
delete b;
delete c;
}
/*
gradient for tan
for
c = tan(a)
we have
dE/da = dE/dc * 1/(cos(a))^2
>> node - the node (c) for backward computation
*/
void XMathGrad::GradTan(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for TAN!");
XTensor * a = income.tails[0];
XTensor * b = NewTensor(a);
XTensor * c = NewTensor(a);
XNoder::MakeGrad(a);
_Cos(a, b);
_Power(b, c, -2.0F);
_Multiply(node->grad, c, a->grad, 1.0F);
node->visitMark = NODE_FINISHED;
delete b;
delete c;
}
/*
gradient for normalize
>> node - the node (c) for backward computation
*/
void XMathGrad::GradNormalize(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 5, "Wrong input tensor number for NORMALIZE!");
XTensor * input = income.tails[0];
XTensor * mean = income.tails[1];
XTensor * var = income.tails[2];
XTensor * a = income.tails[3];
XTensor * b = income.tails[4];
XTensor * c = NewTensor(var);
XTensor * d = NewTensor(a);
XTensor * e = NewTensor(a);
XTensor * f = NewTensor(a);
XTensor * g = NewTensor(a);
XTensor * h = NewTensor(a);
XTensor * i = NewTensor(a);
XTensor * j = NewTensor(a);
XTensor * k = NewTensor(var);
XTensor * p = NewTensor(var);
XTensor * q = NewTensor(var);
XTensor * r = NewTensor(a);
XTensor * x = NewTensor(mean);
XTensor * y = NewTensor(mean);
XTensor * z = NewTensor(mean);
DTYPE epsilon = income.GetParam(1);
int dim = income.GetParamInt(0);
int n = a->GetDim(dim);
XNoder::MakeGrad(input);
XNoder::MakeGrad(mean);
XNoder::MakeGrad(var);
XNoder::MakeGrad(a);
XNoder::MakeGrad(b);
/* dEdinput */
_ScaleAndShift(var, c, 1.0F, epsilon);
_Unsqueeze(c, d, dim, n);
_Power(d, e, -0.5F);
_Multiply(a, e, f);
_Multiply(node->grad, f, input->grad, 1.0F);
/* dEdmean */
_ScaleAndShift(f, g, -1.0F);
_ReduceSum(g, x, dim);
_ReduceSum(node->grad, y, dim);
_Multiply(y, x, mean->grad, 1.0F);
/* dEdvar */
_Unsqueeze(mean, h, dim, n);
_Sub(input, h, i);
_Multiply(a, i, j);
_Power(var, k, -1.5F);
_ScaleAndShift(k, p, -0.5F);
_ReduceSum(j, z, dim);
_Multiply(z, p, q);
_Multiply(y, q, var->grad, 1.0F);
/* dEda */
_Multiply(i, e, r);
_Multiply(node->grad, r, a->grad, 1.0F);
/* dEdb */
_Sum(b->grad, node->grad, b->grad);
node->visitMark = NODE_FINISHED;
delete c;
delete d;
delete e;
delete f;
delete g;
delete h;
delete i;
delete j;
delete k;
delete p;
delete q;
delete r;
delete x;
delete y;
delete z;
}
/*
gradient for absolute
for
c = |a|
we have
dE/da = dE/dc a >= 0
-dE/dc a < 0
>> node - the node (c) for backward computation
*/
void XMathGrad::GradAbsolute(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for ABSOLUTE!");
XTensor * a = income.tails[0];
XTensor * b = NewTensor(a);
XNoder::MakeGrad(a);
_Sign(a, b);
_Multiply(node->grad, b, a->grad, 1.0F);
node->visitMark = NODE_FINISHED;
delete b;
}
/*
gradient for sign
for
c = sign(a)
we have
dE/da = 0
>> node - the node (c) for backward computation
*/
void XMathGrad::GradSign(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SIGN!");
XTensor * a = income.tails[0];
XTensor * b = NewTensor(a);
XNoder::MakeGrad(a);
b->SetZeroAll();
_Sum(a->grad, b, a->grad);
node->visitMark = NODE_FINISHED;
delete b;
}
/*
gradient for round
for
c = round(a)
we have
dE/da = 0
>> node - the node (c) for backward computation
*/
void XMathGrad::GradRound(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for ROUND!");
XTensor * a = income.tails[0];
XTensor * b = NewTensor(a);
XNoder::MakeGrad(a);
b->SetZeroAll();
_Sum(a->grad, b, a->grad);
node->visitMark = NODE_FINISHED;
delete b;
}
/*
gradient for clip
we have
dE/da = 1 lower < a < upper
dE/da = 0 otherwise
>> node - the node (c) for backward computation
*/
void XMathGrad::GradClip(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for CLIP!");
XTensor * a = income.tails[0];
XTensor * b = NewTensor(a);
DTYPE lower = income.GetParam(0);
DTYPE upper = income.GetParam(1);
XNoder::MakeGrad(a);
_ClipBackward(node, a, node->grad, a->grad, lower, upper);
_Sum(a->grad, b, a->grad);
node->visitMark = NODE_FINISHED;
delete b;
}
/*
gradient for reduceMean
for
c = reduceMean(a, dim)
we have
dE/da = Unsqueeze(dE/dc) * 1/dimSizeA[dim]
>> node - the node (c) for backward computation
*/
void XMathGrad::GradReduceMean(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for Reduce!");
XTensor * a = income.tails[0];
XTensor * b = NewTensor(a);
XTensor * c = NewTensor(a);
int dim = income.GetParamInt(0);
int n = a->GetDim(dim);
XNoder::MakeGrad(a);
_Unsqueeze(node->grad, b, dim, n);
_ScaleAndShift(b, c, 1.0F/n);
_Sum(a->grad, c, a->grad);
node->visitMark = NODE_FINISHED;
delete b;
delete c;
}
/*
gradient for reduceSum
for
c = reduceSum(a, dim)
we have
dE/da = Unsqueeze(dE/dc) * 1
>> node - the node (c) for backward computation
*/
void XMathGrad::GradReduceSum(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for Reduce!");
XTensor * a = income.tails[0];
XTensor * b = NewTensor(a);
int dim = income.GetParamInt(0);
int n = a->GetDim(dim);
XNoder::MakeGrad(a);
_Unsqueeze(node->grad, b, dim, n);
_Sum(a->grad, b, a->grad);
node->visitMark = NODE_FINISHED;
delete b;
}
/*
gradient for reduceSumSquared
for
c = reduceSumSquared(a, dim, b)
we have
dE/da = Unsqueeze(dE/dc) * 2a
dE/db = Unsqueeze(dE/dc) * (-2b)
>> node - the node (c) for backward computation
*/
void XMathGrad::GradReduceSumSquared(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for Reduce!");
XTensor * a = income.tails[0];
XTensor * b = income.tails[1];
XTensor * c = NewTensor(a);
XTensor * d = NewTensor(b);
XTensor * e = NewTensor(c);
int dim = income.GetParamInt(0);
int n = a->GetDim(dim);
XNoder::MakeGrad(a);
XNoder::MakeGrad(b);
_ScaleAndShift(a, c, 2.0F);
_ScaleAndShift(b, d, -2.0F);
_Unsqueeze(node->grad, e, dim, n);
_Multiply(e, c, a->grad, 1.0F);
_Multiply(node->grad, d, b->grad, 1.0F);
node->visitMark = NODE_FINISHED;
delete c;
delete d;
delete e;
}
/*
gradient for reduceVariance
for
c = reduceVariance(a, dim, b)
we have
dE/da = Unsqueeze(dE/dc) * 2a/dimSizeA[dim]
dE/db = Unsqueeze(dE/dc) * (-2a/dimSizeA[dim])
>> node - the node (c) for backward computation
*/
void XMathGrad::GradReduceVariance(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 2, "Wrong input tensor number for Reduce!");
XTensor * a = income.tails[0];
XTensor * b = income.tails[1];
XTensor * c = NewTensor(a);
XTensor * d = NewTensor(b);
XTensor * e = NewTensor(a);
int dim = income.GetParamInt(0);
int n = a->GetDim(dim);
XNoder::MakeGrad(a);
XNoder::MakeGrad(b);
_ScaleAndShift(a, c, 2.0F / n);
_ScaleAndShift(b, d, -2.0F / n);
_Unsqueeze(node->grad, e, dim, n);
_Multiply(e, c, a->grad, 1.0F);
_Multiply(node->grad, d, b->grad, 1.0F);
node->visitMark = NODE_FINISHED;
delete c;
delete d;
delete e;
} }
} }
...@@ -44,15 +44,107 @@ private: ...@@ -44,15 +44,107 @@ private:
static static
void GradSum(XTensor * node); void GradSum(XTensor * node);
/* gradient for multiply (dot production): c = a * b */ /* gradient for sum with one dimension: c = a + b * \beta
where the size of b is equal to that of one dimension of a */
static
void GradSumDim(XTensor * node);
/* gradient for multiply (dot production): c = a * b * \alpha */
static static
void GradMultiply(XTensor * node); void GradMultiply(XTensor * node);
/* gradient for matrix multiply: c = matmul(a, b) */ /* gradient for matrix multiply: c = matmul(a, b) * \alpha */
static static
void GradMatrixMul(XTensor * node); void GradMatrixMul(XTensor * node);
/* gradient for matrix multiply: c = matmul(a, b) * \alpha */
static
void GradMatrixMul(XTensor * a, XTensor * deda, MATRIX_TRANS_TYPE transA,
XTensor * b, XTensor * dedb, MATRIX_TRANS_TYPE transB,
XTensor * dedc, DTYPE alpha);
/* gradient for matrix multiply in batch mode.
for each batch: c_i = matmul(a_i, b_i) * \alpha */
static
void GradMatrixMulBatched(XTensor * node);
/* gradient for log: c = log(a) */
static
void GradLog(XTensor * node);
/* gradient for power */
static
void GradPower(XTensor * node);
/* gradient for negate */
static
void GradNegate(XTensor * node);
/* gradient for ScaleAndShift */
static
void GradScaleAndShift(XTensor * node);
/* gradient for Minus */
static
void GradSub(XTensor * node);
/* gradient for Divide */
static
void GradDiv(XTensor * node);
/* gradient for reduceMean */
static
void GradReduceMean(XTensor * node);
/* gradient for reduceSum */
static
void GradReduceSum(XTensor * node);
/* gradient for reduceSumSquared */
static
void GradReduceSumSquared(XTensor * node);
/* gradient for reduceVariance */
static
void GradReduceVariance(XTensor * node);
/* gradient for sin */
static
void GradSin(XTensor * node);
/* gradient for cos */
static
void GradCos(XTensor * node);
/* gradient for tan */
static
void GradTan(XTensor * node);
/* gradient for exp */
static
void GradExp(XTensor * node);
/* gradient for normalize */
static
void GradNormalize(XTensor * node);
/* gradient for absolute */
static
void GradAbsolute(XTensor * node);
/* gradient for sign */
static
void GradSign(XTensor * node);
/* gradient for clip */
static
void GradClip(XTensor * node);
/* gradient for round */
static
void GradRound(XTensor * node);
}; };
} }
#endif #endif
\ No newline at end of file
...@@ -43,6 +43,12 @@ void XShapeGrad::MakeGrad(XTensor * node) ...@@ -43,6 +43,12 @@ void XShapeGrad::MakeGrad(XTensor * node)
GradMergeList(node); GradMergeList(node);
else if(operID == SHAPE_UNSQUEEZE) else if(operID == SHAPE_UNSQUEEZE)
GradUnsqueeze(node); GradUnsqueeze(node);
else if(operID == SHAPE_SPLIT)
GradSplit(node);
else if(operID == SHAPE_SPLIT_LIST)
GradSplitList(node);
else if (operID == SHAPE_TRANSPOSE)
GradTranspose(node);
else{ else{
ShowNTErrors("TODO!"); ShowNTErrors("TODO!");
} }
...@@ -55,6 +61,13 @@ bool XShapeGrad::IsShapeOP(XTensor * node) ...@@ -55,6 +61,13 @@ bool XShapeGrad::IsShapeOP(XTensor * node)
return (income.typeID & DATA_BASE) != 0; return (income.typeID & DATA_BASE) != 0;
} }
/* post processing of a node */
void XShapeGrad::PostProcessing(XTensor * node, int typeID)
{
if(typeID == SHAPE_SPLIT_LIST)
GradSplitListPost(node);
}
/* /*
gradient for merge gradient for merge
for for
...@@ -134,6 +147,8 @@ void XShapeGrad::GradMerge(XTensor * node) ...@@ -134,6 +147,8 @@ void XShapeGrad::GradMerge(XTensor * node)
gradInputSmall.data = NULL; gradInputSmall.data = NULL;
delete[] dims; delete[] dims;
node->visitMark = NODE_FINISHED;
} }
/* /*
...@@ -213,6 +228,120 @@ void XShapeGrad::GradMergeList(XTensor * node) ...@@ -213,6 +228,120 @@ void XShapeGrad::GradMergeList(XTensor * node)
gradSmall.data = NULL; gradSmall.data = NULL;
delete[] dims; delete[] dims;
} }
node->visitMark = NODE_FINISHED;
}
/*
gradient computation for split:
for
c = split(a)
we have
dE/da = merge(dE/dc)
>> node - the node (c) for backward computation
*/
void XShapeGrad::GradSplit(XTensor * node)
{
XLink &income = node->income;
XTensor * input = income.tails[0];
int whereToSplit = income.GetParamInt(0);
int splitNum = income.GetParamInt(1);
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SPLIT!");
CheckNTErrors(node->order == input->order + 1, "Wrong tensor orders!");
CheckNTErrors(splitNum == node->dimSize[0], "Wrong split number!");
XNoder::MakeGrad(input);
/* we can simply merge the gradient tensor
if the input is used in spliting only */
if(input->outgo.tailNum == 1)
_Merge(node->grad, input->grad, whereToSplit + 1, 0);
/* if the tensor is used somewhere else, we need another SUM
for gradient accumulation */
else{
XTensor inputGradTMP(input);
_Merge(node->grad, &inputGradTMP, whereToSplit + 1, 0);
_Sum(input->grad, &inputGradTMP, input->grad);
}
node->visitMark = NODE_FINISHED;
}
/*
gradient computation for spliting
where we return the list of the splits
for
list(c_1, ...) = split(a)
we have
dE/da = merge(dE/c_1, ...)
>> node - the node (c) for backward computation
*/
void XShapeGrad::GradSplitList(XTensor * node)
{
XLink &income = node->income;
XTensor * input = income.tails[0];
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for SPLIT!");
CheckNTErrors(node->order == input->order + 1, "Wrong tensor orders!");
node->visitMark = NODE_DOING;
}
/*
gradient computation for spliting. We return
the list of the splits : list(c_1, ...) = split(a).
this method is called only when all nodes of spliting
have been processed. We do this in a post-processing
manner because we can fuze multiple memory copy jobs
one time. This is good for system speed up.
>> node - the node (c) for backward computation
*/
void XShapeGrad::GradSplitListPost(XTensor * node)
{
/* we compute the gradient for current node, rather than for
child node, i.e., we use the outgoing edge here */
XLink &outgo = node->outgo;
XList splits(outgo.tailNum);
int whereToSplit = -1;
int splitNum = 0;
for(int i = 0; i < outgo.tailNum; i++){
XTensor * parent = (XTensor*)outgo.tails[i];
XLink &income = parent->income;
if(income.typeID == SHAPE_SPLIT_LIST){
int w = income.GetParamInt(0);
int splitID = income.GetParamInt(1);
if(whereToSplit < 0)
whereToSplit = w;
splitNum++;
CheckNTErrors(whereToSplit == w, "Wrong dimension for spliting");
CheckNTErrors(income.tailNum == 1, "Something wrong with outgoing edge!");
CheckNTErrors(splitNum - 1 == splitID, "Wrong split id!");
splits.Add(parent);
}
}
/* we can simply merge the gradient tensor
if the node is used in spliting only */
if(outgo.tailNum == splitNum){
_Merge(&splits, node->grad, whereToSplit + 1);
}
/* if the tensor is used as input to other nodes
somewhere else, we need another SUM for gradient
accumulation */
else{
XTensor nodeGradTMP(node);
_Merge(&splits, &nodeGradTMP, whereToSplit + 1);
_Sum(node->grad, &nodeGradTMP, node->grad);
}
} }
/* /*
...@@ -239,6 +368,40 @@ void XShapeGrad::GradUnsqueeze(XTensor * node) ...@@ -239,6 +368,40 @@ void XShapeGrad::GradUnsqueeze(XTensor * node)
CheckNTErrors(output->unitNum = input->unitNum * dSize, "Wrong tensor size!"); CheckNTErrors(output->unitNum = input->unitNum * dSize, "Wrong tensor size!");
_ReduceSum(output->grad, input->grad, dim); _ReduceSum(output->grad, input->grad, dim);
node->visitMark = NODE_FINISHED;
}
/*
gradient for transposing a tensor
for
c = Transpose(a)
we have
dE/da = Transpose(dE/dc)
>> node - the node (c) for backward computation
*/
void XShapeGrad::GradTranspose(XTensor * node)
{
XLink &income = node->income;
CheckNTErrors(income.tailNum == 1, "Wrong input tensor number for TRANSPOSE!");
XTensor * output = node;
XTensor * input = income.tails[0];
XTensor * b = NewTensor(input);
XNoder::MakeGrad(input);
int i = income.GetParamInt(0);
int j = income.GetParamInt(1);
CheckNTErrors(input->order > i && i >= 0, "index of dimension is out of scope!");
CheckNTErrors(input->order > j && j >= 0, "index of dimension is out of scope!");
_Transpose(output->grad, b, i, j);
_Sum(input->grad, b, input->grad);
node->visitMark = NODE_FINISHED;
delete b;
} }
} }
\ No newline at end of file
...@@ -40,18 +40,41 @@ public: ...@@ -40,18 +40,41 @@ public:
static static
bool IsShapeOP(XTensor * node); bool IsShapeOP(XTensor * node);
/* post processing of a node */
static
void PostProcessing(XTensor * node, int typeId);
private: private:
/* gradient for merge: c = merge(a, b, ...) */ /* gradient computation for merge: c = merge(a, b, ...) */
static static
void GradMerge(XTensor * node); void GradMerge(XTensor * node);
/* gradient for merging a list of tensors : c = merge(list(a, b, ...)) */ /* gradient computation for merging a list of tensors : c = merge(list(a, b, ...)) */
static static
void GradMergeList(XTensor * node); void GradMergeList(XTensor * node);
/* gradient for unsqueezing a tensor : c = unsqueeze(a) */ /* gradient computation for split: c = split(a) */
static
void GradSplit(XTensor * node);
/* gradient computation for spliting. we return the list of the splits : list(c_1, ...) = split(a) */
static
void GradSplitList(XTensor * node);
/* gradient computation for spliting. we return the list of the splits : list(c_1, ...) = split(a).
this method is called only when all nodes of spliting have been processed. We do this in a post-processing
manner because we can fuze multiple memory copy jobs one time. This is good for system speed up. */
static
void GradSplitListPost(XTensor * node);
/* gradient computation for unsqueezing a tensor : c = unsqueeze(a) */
static static
void GradUnsqueeze(XTensor * node); void GradUnsqueeze(XTensor * node);
/* gradient computation for unsqueezing a tensor : c = unsqueeze(a) */
static
void GradTranspose(XTensor * node);
}; };
} }
......
...@@ -46,6 +46,11 @@ unsigned int MakeNetID() ...@@ -46,6 +46,11 @@ unsigned int MakeNetID()
return id; return id;
} }
void XNetClearAll()
{
MUTEX_DELE(netMutex);
}
/* constructor */ /* constructor */
XNet::XNet() XNet::XNet()
{ {
...@@ -143,7 +148,7 @@ void XNet::Backward(XList &roots, XList &golds, LOSS_FUNCTION_NAME loss) ...@@ -143,7 +148,7 @@ void XNet::Backward(XList &roots, XList &golds, LOSS_FUNCTION_NAME loss)
/* back-propagation from output to input */ /* back-propagation from output to input */
for(int i = nodes.count - 1; i >= 0; i--){ for(int i = nodes.count - 1; i >= 0; i--){
XTensor * node = (XTensor*)nodes.Get(i); XTensor * node = (XTensor*)nodes.Get(i);;
if(node->visitMark == NODE_FINISHED) if(node->visitMark == NODE_FINISHED)
continue; continue;
...@@ -176,6 +181,10 @@ void XNet::BackwardNode(XTensor * node) ...@@ -176,6 +181,10 @@ void XNet::BackwardNode(XTensor * node)
return; return;
if(!XNoder::IsLeaf(node)){ if(!XNoder::IsLeaf(node)){
/* post processing for parent nodes */
BackwardNodePost(node);
/* process the current node */
if(XMathGrad::IsMathOP(node)) if(XMathGrad::IsMathOP(node))
XMathGrad::MakeGrad(node); XMathGrad::MakeGrad(node);
else if(XFuncGrad::IsFunc(node)) else if(XFuncGrad::IsFunc(node))
...@@ -186,8 +195,24 @@ void XNet::BackwardNode(XTensor * node) ...@@ -186,8 +195,24 @@ void XNet::BackwardNode(XTensor * node)
ShowNTErrors("Wrong node type!"); ShowNTErrors("Wrong node type!");
} }
} }
}
/*
backward computation (in post processing) for a given node
>> node - the node whose parent nodes are not processed yet. So
we do the job at the child node.
*/
void XNet::BackwardNodePost(XTensor * node)
{
bool isSplitList = false;
XLink &outgo = node->outgo;
for(int i = 0; i < outgo.tailNum; i++){
if(outgo.tails[i]->income.typeID == SHAPE_SPLIT_LIST)
isSplitList = true;
}
node->visitMark = NODE_FINISHED; if(isSplitList)
XShapeGrad::PostProcessing(node, SHAPE_SPLIT_LIST);
} }
/* /*
...@@ -238,10 +263,11 @@ void XNet::TarjanVisit(XTensor * node, XList &orders, const unsigned int code) ...@@ -238,10 +263,11 @@ void XNet::TarjanVisit(XTensor * node, XList &orders, const unsigned int code)
if(node == NULL) if(node == NULL)
return; return;
//fprintf(stderr, "%d\n", node->id);
if(node->visitMark == code + 1){ if(node->visitMark == code + 1){
ShowNTErrors("There is a circle in the network\n"); ShowNTErrors("There is a circle in the network\n");
} }
else if(node->visitMark <= code || node->visitMark >= code + 2){ else if(node->visitMark <= code){
node->visitMark = code + 1; node->visitMark = code + 1;
XLink &income = node->income; XLink &income = node->income;
for(int i = 0; i < income.tailNum; i++){ for(int i = 0; i < income.tailNum; i++){
......
...@@ -73,6 +73,9 @@ struct XNet ...@@ -73,6 +73,9 @@ struct XNet
/* backward computation for a given node */ /* backward computation for a given node */
void BackwardNode(XTensor * node); void BackwardNode(XTensor * node);
/* backward computation (in post processing) for a given node */
void BackwardNodePost(XTensor * node);
/* traverse the net and find the topological order by /* traverse the net and find the topological order by
depth-first search (Tarjan's algorithm) */ depth-first search (Tarjan's algorithm) */
void Traverse(XTensor &root); void Traverse(XTensor &root);
...@@ -92,6 +95,7 @@ struct XNet ...@@ -92,6 +95,7 @@ struct XNet
extern unsigned int netIDGlobal; extern unsigned int netIDGlobal;
extern MUTEX_HANDLE netMutex; extern MUTEX_HANDLE netMutex;
extern unsigned int MakeNetID(); extern unsigned int MakeNetID();
extern void XNetClearAll();
} }
......
...@@ -33,7 +33,7 @@ ...@@ -33,7 +33,7 @@
#include "../../tensor/function/FHeader.h" #include "../../tensor/function/FHeader.h"
#include "../../network/XNet.h" #include "../../network/XNet.h"
namespace samplefnnlm namespace fnnlm
{ {
#define MAX_NAME_LENGTH 1024 #define MAX_NAME_LENGTH 1024
...@@ -57,7 +57,7 @@ void LoadArgs(int argc, const char ** argv, FNNModel &model); ...@@ -57,7 +57,7 @@ void LoadArgs(int argc, const char ** argv, FNNModel &model);
void Init(FNNModel &model); void Init(FNNModel &model);
void Check(FNNModel &model); void Check(FNNModel &model);
void Copy(FNNModel &tgt, FNNModel &src); void Copy(FNNModel &tgt, FNNModel &src);
void Clear(FNNModel &model); void Clear(FNNModel &model, bool isNodeGrad);
void InitModelTensor1D(XTensor &tensor, int num, FNNModel &model); void InitModelTensor1D(XTensor &tensor, int num, FNNModel &model);
void InitModelTensor2D(XTensor &tensor, int rowNum, int colNum, FNNModel &model); void InitModelTensor2D(XTensor &tensor, int rowNum, int colNum, FNNModel &model);
void Train(const char * train, bool isShuffled, FNNModel &model); void Train(const char * train, bool isShuffled, FNNModel &model);
...@@ -153,43 +153,80 @@ load arguments ...@@ -153,43 +153,80 @@ load arguments
*/ */
void LoadArgs(int argc, const char ** argv, FNNModel &model) void LoadArgs(int argc, const char ** argv, FNNModel &model)
{ {
fprintf(stderr, "args:\n");
for(int i = 0; i < argc; i++){ for(int i = 0; i < argc; i++){
if(!strcmp(argv[i], "-train") && i + 1 < argc) if(!strcmp(argv[i], "-train") && i + 1 < argc){
strcpy(trainFN, argv[i + 1]); strcpy(trainFN, argv[i + 1]);
if(!strcmp(argv[i], "-model") && i + 1 < argc) fprintf(stderr, " -train=%s\n", argv[i + 1]);
}
if(!strcmp(argv[i], "-model") && i + 1 < argc){
strcpy(modelFN, argv[i + 1]); strcpy(modelFN, argv[i + 1]);
if(!strcmp(argv[i], "-test") && i + 1 < argc) fprintf(stderr, " -model=%s\n", argv[i + 1]);
}
if(!strcmp(argv[i], "-test") && i + 1 < argc){
strcpy(testFN, argv[i + 1]); strcpy(testFN, argv[i + 1]);
if(!strcmp(argv[i], "-output") && i + 1 < argc) fprintf(stderr, " -test=%s\n", argv[i + 1]);
}
if(!strcmp(argv[i], "-output") && i + 1 < argc){
strcpy(outputFN, argv[i + 1]); strcpy(outputFN, argv[i + 1]);
if(!strcmp(argv[i], "-n") && i + 1 < argc) fprintf(stderr, " -output=%s\n", argv[i + 1]);
}
if(!strcmp(argv[i], "-n") && i + 1 < argc){
model.n = atoi(argv[i + 1]); model.n = atoi(argv[i + 1]);
if(!strcmp(argv[i], "-esize") && i + 1 < argc) fprintf(stderr, " -n=%d\n", model.n);
}
if(!strcmp(argv[i], "-esize") && i + 1 < argc){
model.eSize = atoi(argv[i + 1]); model.eSize = atoi(argv[i + 1]);
if(!strcmp(argv[i], "-vsize") && i + 1 < argc) fprintf(stderr, " -esize=%d\n", model.eSize);
}
if(!strcmp(argv[i], "-vsize") && i + 1 < argc){
model.vSize = atoi(argv[i + 1]); model.vSize = atoi(argv[i + 1]);
if(!strcmp(argv[i], "-hdepth") && i + 1 < argc) fprintf(stderr, " -vsize=%d\n", model.vSize);
}
if(!strcmp(argv[i], "-hdepth") && i + 1 < argc){
model.hDepth = atoi(argv[i + 1]); model.hDepth = atoi(argv[i + 1]);
if(!strcmp(argv[i], "-hsize") && i + 1 < argc) fprintf(stderr, " -hdepth=%d\n", model.hDepth);
}
if(!strcmp(argv[i], "-hsize") && i + 1 < argc){
model.hSize = atoi(argv[i + 1]); model.hSize = atoi(argv[i + 1]);
if(!strcmp(argv[i], "-lrate") && i + 1 < argc) fprintf(stderr, " -hsize=%d\n", model.hSize);
}
if(!strcmp(argv[i], "-lrate") && i + 1 < argc){
learningRate = (float)atof(argv[i + 1]); learningRate = (float)atof(argv[i + 1]);
if(!strcmp(argv[i], "-nstep") && i + 1 < argc) fprintf(stderr, " -lrate=%f\n", learningRate);
}
if(!strcmp(argv[i], "-nstep") && i + 1 < argc){
nStep = atoi(argv[i + 1]); nStep = atoi(argv[i + 1]);
if(!strcmp(argv[i], "-nepoch") && i + 1 < argc) fprintf(stderr, " -nstep=%d\n", nStep);
}
if(!strcmp(argv[i], "-nepoch") && i + 1 < argc){
nEpoch = atoi(argv[i + 1]); nEpoch = atoi(argv[i + 1]);
if(!strcmp(argv[i], "-minmax") && i + 1 < argc) fprintf(stderr, " -nepoch=%d\n", nEpoch);
}
if(!strcmp(argv[i], "-minmax") && i + 1 < argc){
minmax = (float)fabs(atof(argv[i + 1])); minmax = (float)fabs(atof(argv[i + 1]));
if(!strcmp(argv[i], "-batch") && i + 1 < argc) fprintf(stderr, " -minmax=%f\n", minmax);
}
if(!strcmp(argv[i], "-batch") && i + 1 < argc){
sentBatch = atoi(argv[i + 1]); sentBatch = atoi(argv[i + 1]);
if(!strcmp(argv[i], "-wbatch") && i + 1 < argc) fprintf(stderr, " -batch=%d\n", sentBatch);
}
if(!strcmp(argv[i], "-wbatch") && i + 1 < argc){
wordBatch = atoi(argv[i + 1]); wordBatch = atoi(argv[i + 1]);
if(!strcmp(argv[i], "-shuffle")) fprintf(stderr, " -wbatch=%d\n", wordBatch);
}
if(!strcmp(argv[i], "-shuffle")){
shuffled = true; shuffled = true;
if(!strcmp(argv[i], "-autodiff")) fprintf(stderr, " -shuffle=true\n");
}
if(!strcmp(argv[i], "-autodiff")){
autoDiff = true; autoDiff = true;
if(!strcmp(argv[i], "-dev") && i + 1 < argc) fprintf(stderr, " -autodiff=true\n");
}
if(!strcmp(argv[i], "-dev") && i + 1 < argc){
model.devID = atoi(argv[i + 1]); model.devID = atoi(argv[i + 1]);
fprintf(stderr, " -dev=%d\n", model.devID);
}
} }
for(int i = 0; i < argc; i++){ for(int i = 0; i < argc; i++){
...@@ -203,6 +240,7 @@ void Check(FNNModel &model) ...@@ -203,6 +240,7 @@ void Check(FNNModel &model)
{ {
CheckErrors(model.n > 0 && model.n <= MAX_N_GRAM, "The LM order is out of range (use -n)!"); CheckErrors(model.n > 0 && model.n <= MAX_N_GRAM, "The LM order is out of range (use -n)!");
CheckErrors(model.vSize > 0, "no vocabulary size found (use -vsize)!"); CheckErrors(model.vSize > 0, "no vocabulary size found (use -vsize)!");
CheckErrors(model.eSize > 0, "no embedding size found (use -esize)!");
} }
/* make a hard copy of the fnn model */ /* make a hard copy of the fnn model */
...@@ -230,16 +268,37 @@ void Copy(FNNModel &tgt, FNNModel &src) ...@@ -230,16 +268,37 @@ void Copy(FNNModel &tgt, FNNModel &src)
} }
} }
/* reset model parameters */ /*
void Clear(FNNModel &model) reset model parameters
>> model - the model whose parameter (gradient) is set to 0
>> isNodeGrad - indicates whether the tensor node keeps the
gradient information
*/
void Clear(FNNModel &model, bool isNodeGrad)
{ {
model.embeddingW.SetZeroAll(); if (isNodeGrad) {
for(int i = 0; i < MAX_HIDDEN_NUM; i++){ if(model.embeddingW.grad != NULL)
model.hiddenW[i].SetZeroAll(); model.embeddingW.grad->SetZeroAll();
model.hiddenB[i].SetZeroAll(); for (int i = 0; i < MAX_HIDDEN_NUM; i++) {
if(model.hiddenW[i].grad != NULL)
model.hiddenW[i].grad->SetZeroAll();
if(model.hiddenB[i].grad != NULL)
model.hiddenB[i].grad->SetZeroAll();
}
if(model.outputW.grad != NULL)
model.outputW.grad->SetZeroAll();
if(model.outputB.grad != NULL)
model.outputB.grad->SetZeroAll();
}
else {
model.embeddingW.SetZeroAll();
for (int i = 0; i < MAX_HIDDEN_NUM; i++) {
model.hiddenW[i].SetZeroAll();
model.hiddenB[i].SetZeroAll();
}
model.outputW.SetZeroAll();
model.outputB.SetZeroAll();
} }
model.outputW.SetZeroAll();
model.outputB.SetZeroAll();
} }
/* /*
...@@ -401,7 +460,7 @@ void Train(const char * train, bool isShuffled, FNNModel &model) ...@@ -401,7 +460,7 @@ void Train(const char * train, bool isShuffled, FNNModel &model)
FNNNet net; FNNNet net;
/* gradident = 0 */ /* gradident = 0 */
Clear(grad); Clear(grad, false);
/* forward computation */ /* forward computation */
Forward(inputs, output, model, net); Forward(inputs, output, model, net);
...@@ -413,6 +472,9 @@ void Train(const char * train, bool isShuffled, FNNModel &model) ...@@ -413,6 +472,9 @@ void Train(const char * train, bool isShuffled, FNNModel &model)
Update(model, grad, learningRate, false); Update(model, grad, learningRate, false);
} }
else{ else{
/* gradient = 0 */
Clear(model, true);
/* forward + backward process */ /* forward + backward process */
ForwardAutoDiff(inputs, output, model); ForwardAutoDiff(inputs, output, model);
...@@ -492,21 +554,24 @@ void Update(FNNModel &model, FNNModel &grad, float epsilon, bool isNodeGrad) ...@@ -492,21 +554,24 @@ void Update(FNNModel &model, FNNModel &grad, float epsilon, bool isNodeGrad)
gradList.Add(&grad.embeddingW); gradList.Add(&grad.embeddingW);
} }
else{ else{
paraList.Add(model.outputW.grad); gradList.Add(model.outputW.grad);
paraList.Add(&model.outputB.grad); gradList.Add(model.outputB.grad);
for (int i = 0; i < model.hDepth; i++) { for (int i = 0; i < model.hDepth; i++) {
paraList.Add(&model.hiddenW[i].grad); gradList.Add(model.hiddenW[i].grad);
paraList.Add(&model.hiddenB[i].grad); gradList.Add(model.hiddenB[i].grad);
} }
paraList.Add(&model.embeddingW.grad); gradList.Add(model.embeddingW.grad);
} }
for (int i = 0; i < paraList.count; i++) { for (int i = 0; i < paraList.count; i++) {
XTensor * para = (XTensor*)paraList.GetItem(i); XTensor * para = (XTensor*)paraList.GetItem(i);
XTensor * paraGrad = (XTensor*)gradList.GetItem(i); XTensor * paraGrad = (XTensor*)gradList.GetItem(i);
//fprintf(stderr, "%d\n", i);
//paraGrad->Dump(stderr, "grad:", 10);
/* the delta rule */ /* the delta rule */
_Sum(para, paraGrad, para, -epsilon); _Sum(para, paraGrad, para, -epsilon);
} }
...@@ -516,7 +581,7 @@ void Update(FNNModel &model, FNNModel &grad, float epsilon, bool isNodeGrad) ...@@ -516,7 +581,7 @@ void Update(FNNModel &model, FNNModel &grad, float epsilon, bool isNodeGrad)
get prediction probabilites of the gold words get prediction probabilites of the gold words
>> output - output probabilities >> output - output probabilities
>> gold - gold standard >> gold - gold standard
>> >> wordPobs - probability of each word
<< return - probability of the batch << return - probability of the batch
*/ */
float GetProb(XTensor &output, XTensor &gold, XTensor * wordProbs) float GetProb(XTensor &output, XTensor &gold, XTensor * wordProbs)
...@@ -568,8 +633,10 @@ int LoadNGrams(FILE * file, int n, NGram * ngrams, int sentNum, int wordNum) ...@@ -568,8 +633,10 @@ int LoadNGrams(FILE * file, int n, NGram * ngrams, int sentNum, int wordNum)
if(pin <= 0){ if(pin <= 0){
int len = (int)strlen(lineBuf); int len = (int)strlen(lineBuf);
if(lineBuf[len - 1] == '\r') while(lineBuf[len - 1] == '\r' || lineBuf[len - 1] == '\n'){
lineBuf[len - 1] = 0; lineBuf[len - 1] = 0;
len--;
}
len = (int)strlen(lineBuf); len = (int)strlen(lineBuf);
if(len == 0) if(len == 0)
...@@ -580,10 +647,11 @@ int LoadNGrams(FILE * file, int n, NGram * ngrams, int sentNum, int wordNum) ...@@ -580,10 +647,11 @@ int LoadNGrams(FILE * file, int n, NGram * ngrams, int sentNum, int wordNum)
/* how many words are in the sentence */ /* how many words are in the sentence */
int wNum = 0; int wNum = 0;
int i = 0;
for(int i = pin; i < len; i++){ for(i = pin; i < len; i++){
/* load word (id) seperated by space or tab */ /* load word (id) seperated by space or tab */
if((lineBuf[i] == ' ' || lineBuf[i] == '\t' || i == len - 1) && wSize > 0){ if((lineBuf[i] == ' ' || lineBuf[i] == '\t') && wSize > 0){
lineBuf[i] = 0; lineBuf[i] = 0;
wordBuf[wNum++] = atoi(lineBuf + i - wSize); wordBuf[wNum++] = atoi(lineBuf + i - wSize);
wSize = 0; wSize = 0;
...@@ -592,6 +660,9 @@ int LoadNGrams(FILE * file, int n, NGram * ngrams, int sentNum, int wordNum) ...@@ -592,6 +660,9 @@ int LoadNGrams(FILE * file, int n, NGram * ngrams, int sentNum, int wordNum)
wSize++; wSize++;
} }
if(wSize > 0)
wordBuf[wNum++] = atoi(lineBuf + i - wSize);
wordBufCount = wNum; wordBufCount = wNum;
lineNum++; lineNum++;
} }
...@@ -911,7 +982,6 @@ forward process (with tensor connections) ...@@ -911,7 +982,6 @@ forward process (with tensor connections)
*/ */
void ForwardAutoDiff(XTensor inputs[], XTensor &output, FNNModel &model) void ForwardAutoDiff(XTensor inputs[], XTensor &output, FNNModel &model)
{ {
int batchSize = inputs[0].GetDim(0);
int n = model.n; int n = model.n;
int depth = model.hDepth; int depth = model.hDepth;
...@@ -935,15 +1005,13 @@ void ForwardAutoDiff(XTensor inputs[], XTensor &output, FNNModel &model) ...@@ -935,15 +1005,13 @@ void ForwardAutoDiff(XTensor inputs[], XTensor &output, FNNModel &model)
hidden = Merge(hidden, 2, 0); hidden = Merge(hidden, 2, 0);
/* hidden layers */ /* hidden layers */
for(int i = 0; i < depth; i++){ for(int i = 0; i < depth; i++)
b = Unsqueeze(model.hiddenB[i], 1, batchSize); hidden = MMul(hidden, model.hiddenW[i]) + model.hiddenB[i];
hidden = MMul(hidden, model.hiddenW) + b;
}
b = Unsqueeze(model.outputB, 1, batchSize);
/* output layer */ /* output layer */
output = LogSoftmax(MMul(hidden, model.outputW) + b, 1); output = LogSoftmax(MMul(hidden, model.outputW) + model.outputB, 1);
//XLink::ShowNetwork(stderr, &output);
} }
/* /*
...@@ -1040,18 +1108,23 @@ void Test(const char * test, const char * result, FNNModel &model) ...@@ -1040,18 +1108,23 @@ void Test(const char * test, const char * result, FNNModel &model)
/* the gold standard */ /* the gold standard */
XTensor gold; XTensor gold;
/* prepare an empty network for building the fnn */ if (!autoDiff) {
FNNNet net; /* prepare an empty network for building the fnn */
FNNNet net;
/* make the input tensor for position i */ /* make the input tensor for position i */
for (int i = 0; i < model.n - 1; i++) for (int i = 0; i < model.n - 1; i++)
MakeWordBatch(inputs[i], ngrams, ngramNum, i, model.vSize, model.devID, model.mem); MakeWordBatch(inputs[i], ngrams, ngramNum, i, model.vSize, model.devID, model.mem);
/* make the gold tensor */ /* make the gold tensor */
MakeWordBatch(gold, ngrams, ngramNum, model.n - 1, model.vSize, model.devID, model.mem); MakeWordBatch(gold, ngrams, ngramNum, model.n - 1, model.vSize, model.devID, model.mem);
/* forward computation */ /* forward computation */
Forward(inputs, output, model, net); Forward(inputs, output, model, net);
}
else {
ForwardAutoDiff(inputs, output, model);
}
/* prediction probabilities */ /* prediction probabilities */
XTensor probs; XTensor probs;
......
...@@ -36,7 +36,7 @@ ...@@ -36,7 +36,7 @@
using namespace nts; using namespace nts;
namespace samplefnnlm namespace fnnlm
{ {
#define _EXIT_(x)// exit(x) #define _EXIT_(x)// exit(x)
...@@ -126,7 +126,7 @@ struct FNNNet ...@@ -126,7 +126,7 @@ struct FNNNet
XTensor output; XTensor output;
}; };
/* entry of the program */ /* entrance of the program */
int FNNLMMain(int argc, const char ** argv); int FNNLMMain(int argc, const char ** argv);
}; };
......
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
*/
#include <math.h>
#include "T2TAttention.h"
#include "T2TUtility.h"
#include "T2TEmbedding.h"
#include "../../tensor/core/CHeader.h"
namespace transformer
{
/* constructor */
T2TAttention::T2TAttention()
{
nhead = -1;
dk = -1;
dv = -1;
d = -1;
}
/* deconstructor */
T2TAttention::~T2TAttention()
{
}
/*
initialize the model
>> argc - number of arguments
>> argv - list of pointers to the arguments
>> myDevID - device id
>> myMem - the memory pool
*/
void T2TAttention::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
{
devID = myDevID;
mem = myMem;
float minmax = 0;
LoadParamInt(argc, argv, "nhead", &nhead, 8);
LoadParamInt(argc, argv, "d", &dk, DEFAULT_BEDDING_SIZE);
LoadParamInt(argc, argv, "d", &dv, DEFAULT_BEDDING_SIZE);
LoadParamInt(argc, argv, "d", &d, DEFAULT_BEDDING_SIZE);
LoadParamFloat(argc, argv, "attminmax", &minmax, 0.08F);
InitTensor2D(&wk, d, dk, X_FLOAT, devID, mem);
InitTensor2D(&wq, d, dk, X_FLOAT, devID, mem);
InitTensor2D(&wv, d, dv, X_FLOAT, devID, mem);
wk.SetDataRand(-minmax, minmax);
wq.SetDataRand(-minmax, minmax);
wv.SetDataRand(-minmax, minmax);
}
/*
make the network
>> k - keys. It might be of size B * L * H
where B = batch size, L = sequence length,
and H = vector size of each position
>> q - queries
>> v - values
<< return - multi-attention result
*/
XTensor T2TAttention::Make(XTensor &k, XTensor &q, XTensor &v)
{
XTensor k2;
XTensor q2;
XTensor v2;
/* linear transofmration before self-attention */
k2 = MMul(k, wk);
q2 = MMul(q, wq);
v2 = MMul(v, wv);
XTensor kheads;
XTensor qheads;
XTensor vheads;
/* multi head */
kheads = Split(k2, k2.order - 1, nhead);
qheads = Split(q2, q2.order - 1, nhead);
vheads = Split(v2, v2.order - 1, nhead);
XTensor att;
XTensor scalar;
/* scalar = softmax(Q * K^T / sqrt(dk)) * V */
scalar = Softmax(Linear(BMMul(qheads, X_NOTRANS, kheads, X_TRANS), 1/sqrt((float)dk)), -1);
att = BMMul(scalar, vheads);
/* concatenate the heads */
return Merge(att, att.order - 1);
}
}
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
*/
#ifndef __T2TATTENTION_H__
#define __T2TATTENTION_H__
#include "../../network/XNet.h"
using namespace nts;
namespace transformer
{
/*
multi-head attention
y(Q, K, V) = cat(head_1, head_2, ..., head_n)
where head_i = Attention(Q * w_i^Q, K * w_i^K, V * w_i^V)
attention(Q, K, V) = softmax(Q * K^T/d_k^0.5) V
d_k = dimension size of K
*/
class T2TAttention
{
public:
/* device id */
int devID;
/* memory pool */
XMem * mem;
/* head number */
int nhead;
/* transformation matrix for K */
XTensor wk;
/* transformation matrix for Q */
XTensor wq;
/* transformation matrix for V */
XTensor wv;
/* size of transformed Q and K */
int dk;
/* size of transformed V */
int dv;
/* size of input Q, K and V */
int d;
public:
/* constructor */
T2TAttention();
/* de-constructor */
~T2TAttention();
/* initialize the model */
void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
/* make the network */
XTensor Make(XTensor &k, XTensor &q, XTensor &v);
};
}
#endif
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
*/
#ifndef __T2TDECODER_H__
#define __T2TDECODER_H__
namespace transformer
{
class T2TDecoder
{
};
class AttDecoder : T2TDecoder
{
public:
/* initialize the model */
void InitModel(int argc, const char ** argv);
};
}
#endif
\ No newline at end of file
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-08-01
*/
#include <math.h>
#include "T2TEmbedding.h"
#include "T2TUtility.h"
#include "../../tensor/core/CHeader.h"
namespace transformer
{
/* constructor */
T2TEmbedder::T2TEmbedder()
{
devID = -1;
mem = NULL;
vSize = -1;
maxLength = -1;
}
/* deconstructor */
T2TEmbedder::~T2TEmbedder()
{
}
/*
initialize the model
>> argc - number of arguments
>> argv - list of pointers to the arguments
>> myDevID - device id
>> myMem - the memory pool
*/
void T2TEmbedder::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
{
devID = myDevID;
mem = myMem;
int d = 0;
LoadParamInt(argc, argv, "vsize", &vSize, -1);
LoadParamInt(argc, argv, "maxlen", &maxLength, 256);
LoadParamInt(argc, argv, "d", &eSize, DEFAULT_BEDDING_SIZE);
LoadParamInt(argc, argv, "d", &d, DEFAULT_BEDDING_SIZE);
InitTensor2D(&w, vSize, eSize, X_FLOAT, devID, mem);
w.SetDataRandn(0, sqrt((float)eSize));
/* create the positional embedding matrix */
MakePosEmbedding(eSize, d, maxLength);
}
/*
make positional embeddings (of size eSize * length
eSize - embedding size
length - length of the sequenc
*/
void T2TEmbedder::MakePosEmbedding(int eSize, int d, int length)
{
InitTensor2D(&posEmbeddingBase, length, eSize, X_FLOAT, devID, mem);
float * data = new float[posEmbeddingBase.unitNum];
for(int pos = 0; pos < length; pos++){
float * dp = data + pos * eSize;
for(int k = 0; k < eSize; k++){
if(k % 2 == 0){
int i = k/2;
dp[k] = sin(pos/pow(10000.0F, 2.0F*i/d));
}
else{
int i = (k - 1)/2;
dp[k] = cos(pos/pow(10000.0F, 2.0F*i/d));
}
}
}
posEmbeddingBase.SetData(data, posEmbeddingBase.unitNum);
delete[] data;
}
/*
make the network
*/
XTensor T2TEmbedder::Make(XTensor &input)
{
CheckNTErrors(input.GetDim(-1) == vSize, "Wrong vocabulary size!");
CheckNTErrors(input.order > 1, "Wrong input tensor size!");
CheckNTErrors(input.dimSize[input.order - 2] < maxLength, "The sequence is too long!");
CheckNTErrors(vSize > 0, "set vocabulary size by \"-vsize\"");
CheckNTErrors(eSize > 0, "set embedding size by \"-esize\"");
int dims[MAX_TENSOR_DIM_NUM];
memcpy(dims, input.dimSize, input.order * sizeof(int));
dims[input.order - 1] = eSize;
bool match = (posEmbedding.order == input.order);
if(match){
for(int i = 0; i < input.order; i++){
if(dims[i] != posEmbedding.GetDim(i))
match = false;
}
}
/* we make positional embeddings first */
if(!match){
InitTensor(&posEmbedding, input.order, dims, X_FLOAT, 1.0F, devID, mem);
XTensor * posTMP = NewTensorBuf(2, dims + 1, X_FLOAT, 1.0F, devID, mem);
_CopyValues(&posEmbeddingBase, 0, posTMP->unitNum, posTMP, 0);
_Unsqueeze(posTMP, &posEmbedding, 0, dims[0]);
DelTensorBuf(posTMP);
}
XTensor wordEmbedding;
/* then we make word embeddings */
wordEmbedding = MMul(&input, w);
/* we sum over the two embeddings */
return wordEmbedding + posEmbedding;
}
}
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-08-01
*/
#ifndef __T2TEMBEDDING_H__
#define __T2TEMBEDDING_H__
#include "../../network/XNet.h"
using namespace nts;
namespace transformer
{
#define DEFAULT_BEDDING_SIZE 512
/*
embedding (of word at position i):
word embedding + positional embedding
*/
class T2TEmbedder
{
public:
/* device id */
int devID;
/* memory pool */
XMem * mem;
/* vocabulary size */
int vSize;
/* embedding size */
int eSize;
/* maximum length of the sequence */
int maxLength;
/* word embedding matrix */
XTensor w;
/* predefined positional embeddings. It can speeds up
the embedding processing by re-loading. */
XTensor posEmbeddingBase;
/* positional embeddings */
XTensor posEmbedding;
public:
/* constructor */
T2TEmbedder();
/* de-constructor */
~T2TEmbedder();
/* initialize the model */
void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
/* make positional embeddings */
void MakePosEmbedding(int eSize, int d, int length);
/* make the network */
XTensor Make(XTensor &input);
};
}
#endif
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
*/
#include <math.h>
#include "T2TEncoder.h"
#include "T2TLayerNormal.h"
#include "T2TUtility.h"
#include "../../tensor/core/CHeader.h"
namespace transformer
{
/* constructor */
AttEncoder::AttEncoder()
{
}
/* de-constructor */
AttEncoder::~AttEncoder()
{
delete[] attentions;
delete[] fnns;
delete[] layerNorms;
}
/*
initialize the model
>> argc - number of arguments
>> argv - list of pointers to the arguments
>> myDevID - device id
>> myMem - the memory pool
*/
void AttEncoder::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
{
devID = myDevID;
mem = myMem;
LoadParamInt(argc, argv, "nstack", &nlayer, 6);
LoadParamInt(argc, argv, "hsize", &hSize, 512);
LoadParamInt(argc, argv, "esize", &eSize, 512);
LoadParamInt(argc, argv, "vsize", &vSize, -1);
CheckNTErrors(nlayer > 1, "We have one encoding layer at least!");
CheckNTErrors(vSize > 1, "set vocabulary size by \"-vsize\"");
/* embedding model */
embedder.InitModel(argc, argv, devID, mem);
attentions = new T2TAttention[nlayer];
fnns = new T2TFNN[nlayer];
layerNorms = new T2TLN[nlayer];
/* initialize the stacked layers */
for(int i = 0; i < nlayer; i++){
attentions[i].InitModel(argc, argv, myDevID, myMem);
fnns[i].InitModel(argc, argv, myDevID, myMem);
layerNorms[i].InitModel(argc, argv, myDevID, myMem);
}
}
/*
make the encoding network
>> input - the input tensor of the encoder
<< return - the output tensor of the encoder
*/
XTensor AttEncoder::Make(XTensor &input)
{
XTensor x;
x = embedder.Make(input);
for(int i = 0; i < nlayer; i++){
XTensor att;
XTensor ln;
XTensor fnn;
XTensor res;
/* self attention */
att = attentions[i].Make(x, x, x);
/* residual connection */
res = Sum(att, x);
/* TODO: dropout */
/* layer normalization */
ln = layerNorms[i].Make(res);
/* input of next layer */
x = ln;
/* fnn */
fnn = fnns[i].Make(x);
/* residual connection */
res = Sum(fnn, x);
/* TODO: dropout */
/* layer normalization */
ln = layerNorms[i].Make(res);
/* input of next layer */
x = ln;
}
return x;
}
}
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
*/
#ifndef __T2TENCODER_H__
#define __T2TENCODER_H__
#include "T2TFNN.h"
#include "T2TAttention.h"
#include "T2TEmbedding.h"
#include "T2TLayerNormal.h"
#include "../../network/XNet.h"
using namespace nts;
namespace transformer
{
/*
base class of the encoder
*/
class T2TEncoder
{
public:
virtual
XTensor Make(XTensor &input) = 0;
};
/*
the encoder based on RNN
*/
class RNNEncoder : T2TEncoder
{
public:
XTensor Make(XTensor &input);
};
/*
the encoder based on self-attention
*/
class AttEncoder : T2TEncoder
{
public:
/* device id */
int devID;
/* memory pool */
XMem * mem;
/* layer number */
int nlayer;
/* hidden layer size of the FNN layer */
int hSize;
/* embedding size */
int eSize;
/* vocabulary size */
int vSize;
/* embedding of word at each position */
T2TEmbedder embedder;
/* FNN model of each layer */
T2TFNN * fnns;
/* attention model of each layer */
T2TAttention * attentions;
/* layer normalization */
T2TLN * layerNorms;
/* input tensor of the encoder */
XTensor * input;
/* output tensor of the encoder */
XTensor * output;
public:
/* constructor */
AttEncoder();
/* de-constructor */
~AttEncoder();
/* initialize the model */
void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
/* make the encoding network */
XTensor Make(XTensor &input);
};
}
#endif
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
*/
#include "T2TFNN.h"
#include "T2TUtility.h"
#include "T2TEmbedding.h"
#include "../../tensor/core/CHeader.h"
#include "../../tensor/function/FHeader.h"
namespace transformer
{
/* constructor */
T2TFNN::T2TFNN()
{
inSize = -1;
outSize = -1;
hSize = -1;
}
/* deconstructor */
T2TFNN::~T2TFNN()
{
}
/*
initialize the model
>> argc - number of arguments
>> argv - list of pointers to the arguments
>> myDevID - device id
>> myMem - the memory pool
*/
void T2TFNN::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
{
devID = myDevID;
mem = myMem;
float minmax = 0;
LoadParamInt(argc, argv, "d", &inSize, DEFAULT_BEDDING_SIZE);
LoadParamInt(argc, argv, "d", &outSize, DEFAULT_BEDDING_SIZE);
LoadParamInt(argc, argv, "fnnh", &hSize, DEFAULT_BEDDING_SIZE);
LoadParamFloat(argc, argv, "fnnminmax", &minmax, 0.08F);
InitTensor2D(&w1, inSize, hSize, X_FLOAT, devID, mem);
InitTensor1D(&b1, hSize, X_FLOAT, devID, mem);
InitTensor2D(&w2, hSize, outSize, X_FLOAT, devID, mem);
InitTensor1D(&b2, outSize, X_FLOAT, devID, mem);
w1.SetDataRand(-minmax, minmax);
b1.SetDataRand(-minmax, minmax);
w2.SetDataRand(-minmax, minmax);
b2.SetDataRand(-minmax, minmax);
}
/*
make the network
y = max(0, x * w1 + b1) * w2 + b2
>> input - the input tensor
>> return - the output tensor
*/
XTensor T2TFNN::Make(XTensor &input)
{
XTensor t1;
/* t1 = max(0, x * w1 + b1) */
t1 = Rectify(MMul(input, X_NOTRANS, w1, X_NOTRANS) + b1);
/* result = t1 * w2 + b2 */
return MMul(t1, X_NOTRANS, w2, X_NOTRANS) + b2;
}
}
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
*/
#ifndef __T2TFNN_H__
#define __T2TFNN_H__
#include "../../tensor/XTensor.h"
using namespace nts;
namespace transformer
{
/* a fnn: y = max(0, x * w1 + b1) * w2 + b2 */
class T2TFNN
{
public:
/* device id */
int devID;
/* memory pool */
XMem * mem;
/* size of input vector */
int inSize;
/* size of output vector */
int outSize;
/* size of hidden layers */
int hSize;
/* matrix of transformation 1 */
XTensor w1;
/* bias of transformation 1 */
XTensor b1;
/* matrix of transformation 2 */
XTensor w2;
/* bias of transformation 2 */
XTensor b2;
public:
/* constructor */
T2TFNN();
/* deconstructor */
~T2TFNN();
/* initialize the model */
void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
/* make the network */
XTensor Make(XTensor &input);
};
}
#endif
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
*/
#include "T2TLayerNormal.h"
#include "../../tensor/core/CHeader.h"
namespace transformer
{
/* constructor */
T2TLN::T2TLN()
{
devID = -1;
mem = NULL;
}
/* de-constructor */
T2TLN::~T2TLN()
{
}
/*
initialize the model
>> argc - number of arguments
>> argv - list of pointers to the arguments
>> myDevID - device id
>> myMem - the memory pool
*/
void T2TLN::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
{
devID = myDevID;
mem = myMem;
}
/*
make the network
for each layer representation x, we have
y =
>> input - the input tensor
>> return - layer normalization output
*/
XTensor T2TLN::Make(XTensor &input)
{
XTensor &x = input;
XTensor mean;
XTensor variance;
XTensor standard;
XTensor meanFilled;
XTensor standardFilled;
/* \mu = (sum_i x_i)/m */
mean = ReduceSum(x, x.order - 1);
/* \sigma = (sum_i (x_i - \mu)^2)/m */
variance = ReduceVariance(x, x.order - 1, mean);
/* standard = sqrt(variance) */
standard = Power(variance, 0.5F);
/* unsqueeze mean and standard deviation to fit them into
the same size of x */
meanFilled = Unsqueeze(mean, x.order - 1, x.GetDim(-1));
standardFilled = Unsqueeze(standard, x.order - 1, x.GetDim(-1));
/* x' = (x - \mu)/standard */
return (x - meanFilled)/standardFilled;
}
}
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
*/
#ifndef __T2TLAYERNORMAL_H__
#define __T2TLAYERNORMAL_H__
#include "../../network/XNet.h"
using namespace nts;
namespace transformer
{
class T2TLN
{
public:
/* device id */
int devID;
/* memory pool */
XMem * mem;
public:
/* constructor */
T2TLN();
/* de-constructor */
~T2TLN();
/* initialize the model */
void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
/* make the network */
XTensor Make(XTensor &input);
};
}
#endif
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
*/
#include "T2TModel.h"
#include "T2TUtility.h"
namespace transformer
{
/* constructor */
T2TModel::T2TModel()
{
devID = -1;
mem = NULL;
isLM = false;
isMT = false;
}
/* de-constructor */
T2TModel::~T2TModel()
{
delete mem;
}
/*
initialize the model
>> argc - number of arguments
>> argv - list of pointers to the arguments
*/
void T2TModel::InitModel(int argc, const char ** argv)
{
bool useMem = false;
LoadParamInt(argc, argv, "dev", &devID, -1);
LoadParamBool(argc, argv, "mem", &useMem, useMem);
LoadParamBool(argc, argv, "lm", &isLM, true);
LoadParamBool(argc, argv, "mt", &isMT, false);
if(useMem){
delete mem;
mem = new XMem(devID);
}
encoder.InitModel(argc, argv, devID, mem);
outputLayer.InitModel(argc, argv, devID, mem);
}
/*
make the encoding network
>> input - input tensor
<< return - encoding result
*/
XTensor T2TModel::MakeEncoding(XTensor &input)
{
return encoder.Make(input);
}
/*
make the entire network (with the output softmax layer)
>> input - input tensor
>> output - output tensor (distribution)
*/
void T2TModel::Make(XTensor &input, XTensor &output)
{
if(isLM){
XTensor encoding;
encoding = MakeEncoding(input);
outputLayer.Make(encoding, output);
}
else{
ShowNTErrors("TODO!");
}
}
}
\ No newline at end of file
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
*/
#ifndef __T2TMODEL_H__
#define __T2TMODEL_H__
#include "T2TFNN.h"
#include "T2TAttention.h"
#include "T2TEncoder.h"
#include "T2TDecoder.h"
#include "T2TOutput.h"
namespace transformer
{
class T2TModel
{
public:
/* device id */
int devID;
/* memory pool */
XMem * mem;
/* the encoder */
AttEncoder encoder;
/* the decoder */
AttDecoder decoder;
/* output layer */
T2TOutput outputLayer;
/* indicates whether the model is running for language modeling */
bool isLM;
/* indicates whether the model is running for machine translation */
bool isMT;
public:
/* constructor */
T2TModel();
/* de-constructor */
~T2TModel();
/* initialize the model */
void InitModel(int argc, const char ** argv);
/* make the encoding network */
XTensor MakeEncoding(XTensor &input);
/* make the entire network (with the output softmax layer) */
void Make(XTensor &input, XTensor &output);
};
}
#endif
\ No newline at end of file
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
*/
#include "T2TOutput.h"
#include "T2TUtility.h"
#include "T2TEmbedding.h"
#include "../../tensor/core/CHeader.h"
namespace transformer
{
/* constructor */
T2TOutput::T2TOutput()
{
devID = -1;
mem = NULL;
vSize = -1;
inSize = -1;
hSize = -1;
}
/* de-constructor */
T2TOutput::~T2TOutput()
{
}
/*
initialize the model
>> argc - number of arguments
>> argv - list of pointers to the arguments
>> myDevID - device id
>> myMem - the memory pool
*/
void T2TOutput::InitModel(int argc, const char ** argv, int myDevID, XMem * myMem)
{
devID = myDevID;
mem = myMem;
float minmax = 0;
LoadParamInt(argc, argv, "vsize", &vSize, -1);
LoadParamInt(argc, argv, "d", &inSize, DEFAULT_BEDDING_SIZE);
LoadParamInt(argc, argv, "d", &hSize, DEFAULT_BEDDING_SIZE);
LoadParamFloat(argc, argv, "outputminmax", &minmax, 0.08F);
InitTensor2D(&w, hSize, vSize, X_FLOAT, devID, mem);
w.SetDataRand(-minmax, minmax);
}
/*
make the network
y = softmax(x * w)
>> input - input tensor
<< return - output tensor
*/
XTensor T2TOutput::Make(XTensor &input)
{
XTensor &x = input;
return LogSoftmax(MMul(x, w), -1);
}
/*
make the network (redefined output tensor)
>> input - input tensor
>> output - output tensor
*/
void T2TOutput::Make(XTensor &input, XTensor &output)
{
XTensor &x = input;
output = LogSoftmax(MMul(x, w), -1);
}
}
\ No newline at end of file
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
*/
#ifndef __T2TOUTPUT_H__
#define __T2TOUTPUT_H__
#include "../../tensor/function/FHeader.h"
using namespace nts;
namespace transformer
{
/* output layer */
class T2TOutput
{
public:
/* device id */
int devID;
/* memory pool */
XMem * mem;
/* vocabulary size */
int vSize;
/* input vector size */
int inSize;
/* vector size of the linear transformation */
int hSize;
/* transformation matrix */
XTensor w;
public:
/* constructor */
T2TOutput();
/* de-constructor */
~T2TOutput();
/* initialize the model */
void InitModel(int argc, const char ** argv, int myDevID = -1, XMem * myMem = NULL);
/* make the network */
XTensor Make(XTensor &input);
/* make the network (redefined output tensor) */
void Make(XTensor &input, XTensor &output);
};
}
#endif
\ No newline at end of file
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-08-02
*/
#include <math.h>
#include "T2TTrainer.h"
#include "T2TUtility.h"
#include "../../tensor/XUtility.h"
#include "../../tensor/core/CHeader.h"
namespace transformer
{
/* constructor */
T2TTrainer::T2TTrainer()
{
devID = -1;
mem = NULL;
seqLen = NULL;
nseqBuf = 0;
nextSeq = -1;
}
/* de-constructor */
T2TTrainer::~T2TTrainer()
{
delete[] buf;
delete[] seqLen;
delete[] seqOffset;
}
/*
initialization
>> argc - number of arguments
>> argv - list of pointers to the arguments
*/
void T2TTrainer::Init(int argc, const char ** argv)
{
LoadParamInt(argc, argv, "dev", &devID, -1);
LoadParamFloat(argc, argv, "lrate", &lrate, 0.001F);
LoadParamInt(argc, argv, "sbatch", &sBatchSize, 1);
LoadParamInt(argc, argv, "wbatch", &wBatchSize, 1);
LoadParamInt(argc, argv, "nepoch", &nepoch, 1);
LoadParamInt(argc, argv, "nstep", &nstep, 1);
LoadParamInt(argc, argv, "vsize", &vSize, 1);
LoadParamBool(argc, argv, "sorted", &isLenSorted, false);
LoadParamInt(argc, argv, "bufsize", &bufSize, 50000);
buf = new int[bufSize];
seqLen = new int[bufSize];
seqOffset = new int[bufSize];
}
/*
train the model
>> fn - training data file
>> model - model to train
*/
void T2TTrainer::Train(const char * fn, T2TModel * model)
{
int epoch = 0;
int step = 0;
int wc = 0;
int wordCount = 0;
int wordCountTotal = 0;
bool isEnd = false;
float loss = 0;
XNet net;
double startT = GetClockSec();
for(epoch = 0; epoch < nepoch; epoch++){
FILE * file = fopen(fn, "rb");
CheckNTErrors(file, "cannot open training file!");
wordCount = 0;
/* batch of input sequences */
XTensor batch;
while(LoadBatch(file, &batch, 1, vSize, sBatchSize, wBatchSize, isLenSorted, wc)){
/* output probabilities */
XTensor output;
/* make the network */
model->Make(batch, output);
/* back-propagation for obtaining gradients */
net.Backward(output, batch, CROSSENTROPY);
/* update the parameters */
Update(model);
/* get probabilities */
float prob = GetProb(&output, &batch, NULL);
loss += -prob;
wordCount += wc;
wordCountTotal += wc;
if(++step >= nstep){
isEnd = true;
break;
}
if (step % 1 == 0) {
double elapsed = GetClockSec() - startT;
XPRINT5(0, stderr, "[INFO] elapsed=%.1fs, step=%d, epoch=%d, ngram=%d, ppl=%.3f\n",
elapsed, step, epoch + 1, wordCountTotal, exp(loss / wordCount));
}
}
fclose(file);
}
double elapsed = GetClockSec() - startT;
XPRINT5(0, stderr, "[INFO] elapsed=%.1fs, step=%d, epoch=%d, ngram=%d, ppl=%.3f\n",
elapsed, step, epoch, wordCountTotal, exp(loss / wordCount));
XPRINT3(0, stderr, "[INFO] training finished (took %.1fs, step=%d and epoch=%d)\n",
elapsed, step, epoch);
}
char line[MAX_SEQUENCE_LENGTH];
/*
load data to buffer
>> file - where to load data
*/
int T2TTrainer::LoadBuf(FILE * file)
{
int lineCount = 0;
int seqCount = 0;
int wordCount = 0;
while(fgets(line, MAX_SEQUENCE_LENGTH - 1, file)){
int len = (int)strlen(line);
while(line[len - 1] == '\r' || line[len - 1] == '\n'){
line[len - 1] = 0;
len--;
}
len = (int)strlen(line);
if(len == 0)
continue;
/* how many characters are in a word */
int wSize = 0;
/* how many words are in the sentence */
int wNum = 0;
int wNumLocal = 0;
int i = 0;
for(i = 0; i < len; i++){
/* load word (id) seperated by space or tab */
if((line[i] == ' ' || line[i] == '\t') && wSize > 0){
line[i] = 0;
if(wSize == 3 && line[i - 1] == '|' && line[i - 2] == '|' && line[i - 3] == '|'){
seqLen[seqCount] = wNumLocal;
seqOffset[seqCount] = wordCount + wNum - wNumLocal;
seqCount++;
wNumLocal = 0;
}
else{
buf[wordCount + wNum++] = atoi(line + i - wSize);
wNumLocal++;
}
wSize = 0;
}
else
wSize++;
}
if(wSize > 0){
buf[wordCount + wNum++] = atoi(line + i - wSize);
wNumLocal++;
}
seqLen[seqCount] = wNumLocal;
seqOffset[seqCount] = wordCount + wNum - wNumLocal;
seqCount++;
wordCount += wNum;
lineCount++;
if(wordCount >= bufSize - MAX_SEQUENCE_LENGTH)
break;
}
nseqBuf = seqCount;
nextSeq = 0;
return lineCount;
}
/*
load a batch of sequences
>> file - the handle to the data file
>> batch - the batch
>> step - the step we go over when move to the next sequence
>> vs - vocabulary size
>> sBatch - batch size of sequences
>> wBatch - batch size of words
>> isSorted - indicates whether the sequences are sorted by length
>> wCount - word count
*/
int T2TTrainer::LoadBatch(FILE * file, XTensor * batch, int step, int vs, int sBatch, int wBatch, bool isSorted, int &wCount)
{
if(nextSeq < 0 || nextSeq >= nseqBuf)
LoadBuf(file);
int seq = MAX(nextSeq, 0);
int wc = 0;
int wn = 0;
int sc = 0;
int max = 0;
while(seq + sc < nseqBuf){
wn = seqLen[seq + sc];
wc += wn;
sc += 1;
if(max < wn)
max = wn;
if(sc >= sBatch && wc >= wBatch)
break;
}
nextSeq = seq + sc;
if(sc > 0){
int dims[MAX_TENSOR_DIM_NUM];
dims[0] = sc;
dims[1] = max;
dims[2] = vs;
if(batch->order != 3 || batch->GetDim(0) != dims[0] ||
batch->GetDim(1) != dims[1] || batch->GetDim(2) != dims[2]){
InitTensor(batch, 3, dims, X_FLOAT, 1.0F, devID, mem);
}
batch->SetZeroAll();
/* this might be slow on GPUs :( */
for(int s = seq; s < seq + sc; s++){
for(int w = 0; w < seqLen[s]; w++){
batch->Set3D(1.0F, s - seq, w, buf[seqOffset[s] + w]);
wCount++;
}
}
}
return sc;
}
/*
get word probabilities for a batch of sequences
>> output - word distribution for each position
>> gold - gold standard
>> wordProbs - word probability for gold prediction
*/
float T2TTrainer::GetProb(XTensor * output, XTensor * gold, XTensor * wordProbs)
{
XTensor probs;
InitTensor(&probs, output);
/* probs[i,j] = output[i,j] * gold[i,j] */
_Multiply(output, gold, &probs);
/* probability of each word */
XTensor wprobs;
InitTensor1D(&wprobs, output->unitNum/output->GetDim(-1), X_FLOAT, output->devID, output->mem);
int dims[2] = {output->unitNum/output->GetDim(-1), output->GetDim(-1)};
probs.Reshape(2, dims);
_ReduceSum(&probs, &wprobs, 1);
if(wordProbs != NULL)
_CopyValues(&wprobs, wordProbs);
/* reshape the tensor to fit it into the reduce procedure
TODO: XTensor supports scalars */
dims[0] = 1;
dims[1] = probs.unitNum;
probs.Reshape(2, dims);
/* probability for the batch */
XTensor result;
InitTensor1D(&result, 1, X_FLOAT, output->devID, output->mem);
_ReduceSum(&probs, &result, 1);
return result.Get1D(0);
}
/*
update the model by delta rule
>> model - the t2t model
*/
void T2TTrainer::Update(T2TModel * model)
{
XList ws(100);
ws.Add(&model->outputLayer.w);
for(int i = 0; i < model->encoder.nlayer; i++){
ws.Add(&model->encoder.fnns[i].w1);
ws.Add(&model->encoder.fnns[i].b1);
ws.Add(&model->encoder.fnns[i].w2);
ws.Add(&model->encoder.fnns[i].b2);
}
ws.Add(&model->encoder.embedder.w);
for(int i = 0; i < ws.count; i++){
XTensor * para = (XTensor*)ws.Get(i);
XTensor * paraGrad = para->grad;
CheckNTErrors(para != NULL, "NULL parameter tensor!");
CheckNTErrors(paraGrad != NULL, "NULL gradient tensor!");
/* the delta rule */
_Sum(para, paraGrad, para, -lrate);
}
}
}
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-08-02
*/
#ifndef __T2TTRAINER_H__
#define __T2TTRAINER_H__
#include "T2TModel.h"
#include "../../tensor/function/FHeader.h"
#define MAX_SEQUENCE_LENGTH 1024 * 4
using namespace nts;
namespace transformer
{
/* trainer of the T2T model */
class T2TTrainer
{
public:
/* device id */
int devID;
/* memory pool */
XMem * mem;
/* buffer for loading words */
int * buf;
/* buffer size */
int bufSize;
/* length of each sequence */
int * seqLen;
/* offset of the first word for each sequence */
int * seqOffset;
/* number of sequences in the buffer */
int nseqBuf;
/* offset for next sequence in the buffer */
int nextSeq;
/* indicates whether the sequence is sorted by length */
bool isLenSorted;
/* vocabulary size of the source side */
int vSize;
/* learning rate */
float lrate;
/* sentence batch size */
int sBatchSize;
/* word batch size */
int wBatchSize;
/* training epoch number */
int nepoch;
/* traing step number */
int nstep;
public:
/* constructor */
T2TTrainer();
/* de-constructor */
~T2TTrainer();
/* initialize the trainer */
void Init(int argc, const char ** argv);
/* train the model */
void Train(const char * fn, T2TModel * model);
/* load data to buffer */
int LoadBuf(FILE * file);
/* load a batch of sequences */
int LoadBatch(FILE * file, XTensor * batch, int step, int vs, int sBatch, int wBatch, bool isSorted, int &wCount);
/* get word probabilities for a batch of sequences */
float GetProb(XTensor * output, XTensor * gold, XTensor * wordProbs);
/* update the model by delta rule */
void Update(T2TModel * model);
};
}
#endif
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
namespace transformer
{
void LoadParamString(int argc, const char ** argv, const char * name, char * p, const char * defaultP)
{
char vname[128];
vname[0] = '-';
strcpy(vname + 1, name);
bool hit = false;
for(int i = 0; i < argc; i++){
if(!strcmp(argv[i], vname) && i + 1 < argc){
strcpy(p, argv[i + 1]);
//fprintf(stderr, " %s=%s\n", name, argv[i + 1]);
hit = true;
}
}
if(!hit)
strcpy(p, defaultP);
}
void LoadParamInt(int argc, const char ** argv, const char * name, int * p, int defaultP)
{
char vname[128];
vname[0] = '-';
strcpy(vname + 1, name);
bool hit = false;
for(int i = 0; i < argc; i++){
if(!strcmp(argv[i], vname) && i + 1 < argc){
*(int*)p = atoi(argv[i + 1]);
//fprintf(stderr, " %s=%s\n", name, argv[i + 1]);
hit = true;
}
}
if(!hit)
*p = defaultP;
}
void LoadParamBool(int argc, const char ** argv, const char * name, bool * p, bool defaultP)
{
char vname[128];
vname[0] = '-';
strcpy(vname + 1, name);
bool hit = false;
for(int i = 0; i < argc; i++){
if(!strcmp(argv[i], vname)){
*(bool*)p = true;
//fprintf(stderr, " %s=%s\n", name, "true");
hit = true;
}
}
if(!hit)
*p = defaultP;
}
void LoadParamFloat(int argc, const char ** argv, const char * name, float * p, float defaultP)
{
char vname[128];
vname[0] = '-';
strcpy(vname + 1, name);
bool hit = false;
for(int i = 0; i < argc; i++){
if(!strcmp(argv[i], vname) && i + 1 < argc){
*p = (float)atof(argv[i + 1]);
//fprintf(stderr, " %s=%s\n", name, argv[i + 1]);
hit = true;
}
}
if(!hit)
*p = defaultP;
}
void ShowParams(int argc, const char ** argv)
{
fprintf(stderr, "args:\n");
for(int i = 0; i < argc; i++){
if(argv[i][0] == '-'){
if(i + 1 < argc && argv[i + 1][0] != '-')
fprintf(stderr, " %s=%s\n", argv[i], argv[i + 1]);
else
fprintf(stderr, " %s=yes\n", argv[i]);
}
}
fprintf(stderr, "\n");
}
}
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
*/
#ifndef __T2TUTILITY_H__
#define __T2TUTILITY_H__
#include <stdio.h>
namespace transformer
{
/* load arguments */
void LoadParamString(int argc, const char ** argv, const char * name, char * p, const char * defaultP);
void LoadParamInt(int argc, const char ** argv, const char * name, int * p, int defaultP);
void LoadParamBool(int argc, const char ** argv, const char * name, bool * p, bool defaultP);
void LoadParamFloat(int argc, const char ** argv, const char * name, float * p, float defaultP);
/* show arguments */
void ShowParams(int argc, const char ** argv);
}
#endif
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
*/
#include "Transformer.h"
#include "T2TModel.h"
#include "T2TUtility.h"
#include "T2TTrainer.h"
#include "../../tensor/XDevice.h"
namespace transformer
{
int TransformerMain(int argc, const char ** argv)
{
if(argc == 0)
return 1;
ShowParams(argc, argv);
char * trainFN = new char[MAX_LINE_LENGTH];
LoadParamString(argc, argv, "train", trainFN, "");
T2TModel model;
model.InitModel(argc, argv);
if(strcmp(trainFN, "")){
T2TTrainer trainer;
trainer.Init(argc, argv);
trainer.Train(trainFN, &model);
}
delete[] trainFN;
return 0;
}
}
\ No newline at end of file
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
*
* An impelementation of the transformer system. See more details
* about FNNLM in
* "Attention Is All You Need" by Vaswani et al.
* https://arxiv.org/pdf/1706.03762.pdf
*
* $Created by: XIAO Tong (xiaotong@mail.neu.edu.cn) 2018-07-31
* I start writing the code related to NMT - a long time since my last coding
* work on MT
*/
#ifndef __TRANSFORMER_H__
#define __TRANSFORMER_H__
#include "../../tensor/XGlobal.h"
#include "../../tensor/XTensor.h"
#include "../../tensor/core/CHeader.h"
namespace transformer
{
/* entrance of the program */
int TransformerMain(int argc, const char ** argv);
}
#endif
\ No newline at end of file
...@@ -29,6 +29,7 @@ ...@@ -29,6 +29,7 @@
#include "XTensor.h" #include "XTensor.h"
#include "XDevice.h" #include "XDevice.h"
#include "./test/Test.h" #include "./test/Test.h"
#include "./core/CHeader.h"
//#define CRTDBG_MAP_ALLOC //#define CRTDBG_MAP_ALLOC
//#include <stdlib.h> //#include <stdlib.h>
...@@ -37,6 +38,7 @@ ...@@ -37,6 +38,7 @@
using namespace nts; using namespace nts;
void SmallTest(); void SmallTest();
void TransposeTest();
int main( int argc, const char ** argv ) int main( int argc, const char ** argv )
{ {
...@@ -92,3 +94,35 @@ void SmallTest() ...@@ -92,3 +94,35 @@ void SmallTest()
c.Dump(stderr, "c:"); c.Dump(stderr, "c:");
d.Dump(stderr, "d:"); d.Dump(stderr, "d:");
} }
void TransposeTest()
{
XTensor a;
XTensor b;
int I = 2;
int J = 3;
InitTensor4D(&a, 2, 3, 4, 5);
int * dims = new int[a.order];
memcpy(dims, a.dimSize, sizeof(int) * a.order);
dims[I] = a.dimSize[J];
dims[J] = a.dimSize[I];
InitTensor(&b, 4, dims);
a.SetZeroAll();
b.SetZeroAll();
float * data = new float[a.unitNum];
for(int i = 0; i < a.unitNum; i++)
data[i] = (float)i;
a.SetData(data, a.unitNum, 0);
_Transpose(&a, &b, I, J);
b.Dump(stderr, "b:");
delete[] data;
}
...@@ -40,6 +40,7 @@ XDevManager GDevs; ...@@ -40,6 +40,7 @@ XDevManager GDevs;
/* constructor */ /* constructor */
XDevice::XDevice() XDevice::XDevice()
{ {
stream = NULL;
Clear(); Clear();
#ifdef USE_CUDA #ifdef USE_CUDA
...@@ -55,6 +56,8 @@ XDevice::~XDevice() ...@@ -55,6 +56,8 @@ XDevice::~XDevice()
MUTEX_DELE(cublasMutex); MUTEX_DELE(cublasMutex);
if(isHandleReady) if(isHandleReady)
cublasDestroy(cublasHandle); cublasDestroy(cublasHandle);
if(stream != NULL)
delete stream;
#endif #endif
} }
...@@ -118,6 +121,8 @@ void XDevice::Init(int myDevID) ...@@ -118,6 +121,8 @@ void XDevice::Init(int myDevID)
} }
else else
sprintf(name2, "GPU-%d %s", devID, name); sprintf(name2, "GPU-%d %s", devID, name);
stream = new XStream(0, devID);
#endif #endif
} }
...@@ -161,6 +166,14 @@ cublasHandle_t * XDevice::GetCublasHandle() ...@@ -161,6 +166,14 @@ cublasHandle_t * XDevice::GetCublasHandle()
return &cublasHandle; return &cublasHandle;
} }
/* get the stream of cuda */
cudaStream_t * XDevice::GetCudaStream()
{
CheckNTErrors(stream != NULL, "the stream is not initialized!");
return &stream->stream;
}
#endif // USE_CUDA #endif // USE_CUDA
/* switch to a device */ /* switch to a device */
...@@ -311,11 +324,19 @@ void XDevManager::Clear() ...@@ -311,11 +324,19 @@ void XDevManager::Clear()
/* get the handle of GPU */ /* get the handle of GPU */
cublasHandle_t * XDevManager::GetCudaHandle(const int devID) cublasHandle_t * XDevManager::GetCudaHandle(const int devID)
{ {
CheckNTErrors((devID < nGPU), "index of GPU is out of range."); CheckNTErrors(devID < nGPU, "index of GPU is out of range.");
return GPUs[devID].GetCublasHandle(); return GPUs[devID].GetCublasHandle();
} }
/* get the stream of cuda */
cudaStream_t * XDevManager::GetCudaStream(const int devID)
{
CheckNTErrors(devID < nGPU, "index of GPU is out of range.");
return GPUs[devID].GetCudaStream();
}
#endif #endif
/* /*
...@@ -384,13 +405,10 @@ int XDevManager::GetCudaThread2D(const int devID, const int n, const int m, int ...@@ -384,13 +405,10 @@ int XDevManager::GetCudaThread2D(const int devID, const int n, const int m, int
memset(gridSize, 0, sizeof(int) * 3); memset(gridSize, 0, sizeof(int) * 3);
memset(blockSize, 0, sizeof(int) * 3); memset(blockSize, 0, sizeof(int) * 3);
if(n <= 0 || m <= 0 || devID >= nGPU) if(n <= 0 || m <= 0)
return 1; return 1;
if(devID < 0){ CheckNTErrors(devID >= 0 && devID < nGPU, "Invalid GPU device id!");
XPRINT(0, stderr, "WARNING! You are calling the grid and block size computation function on a CPU!");
return 0;
}
#ifdef USE_CUDA #ifdef USE_CUDA
......
...@@ -25,6 +25,7 @@ ...@@ -25,6 +25,7 @@
#define __XDEVICE_H__ #define __XDEVICE_H__
#include "XThread.h" #include "XThread.h"
#include "XStream.h"
#ifdef USE_CUDA #ifdef USE_CUDA
...@@ -92,6 +93,9 @@ public: ...@@ -92,6 +93,9 @@ public:
/* specify whether Unified Virtual Address Space (UVA) is supported */ /* specify whether Unified Virtual Address Space (UVA) is supported */
bool isUVASupported; bool isUVASupported;
/* default stream for the device */
XStream * stream;
#ifdef USE_CUDA #ifdef USE_CUDA
/* mutex for handle (GPU cublas) */ /* mutex for handle (GPU cublas) */
...@@ -121,6 +125,9 @@ public: ...@@ -121,6 +125,9 @@ public:
#ifdef USE_CUDA #ifdef USE_CUDA
/* get cublas handle */ /* get cublas handle */
cublasHandle_t * GetCublasHandle(); cublasHandle_t * GetCublasHandle();
/* get the stream of cuda */
cudaStream_t * GetCudaStream();
#endif #endif
/* switch to a device */ /* switch to a device */
...@@ -178,6 +185,9 @@ public: ...@@ -178,6 +185,9 @@ public:
#ifdef USE_CUDA #ifdef USE_CUDA
/* get the handle of GPU */ /* get the handle of GPU */
cublasHandle_t * GetCudaHandle(const int devID); cublasHandle_t * GetCudaHandle(const int devID);
/* get the stream of cuda */
cudaStream_t * GetCudaStream(const int devID);
#endif #endif
/* get grid and block sizes that max potential */ /* get grid and block sizes that max potential */
......
...@@ -167,7 +167,9 @@ void XLink::SetType(int id) ...@@ -167,7 +167,9 @@ void XLink::SetType(int id)
type[0] = 0; type[0] = 0;
strcpy(type, GetOPName(id)); strcpy(type, GetOPName(id));
typeID = id; typeID = id;
CheckNTErrors(strcmp(type, "NULL"), "illegal edge type name!"); if(id != 0){
CheckNTErrors(strcmp(type, "NULL"), "illegal edge type name!");
}
} }
/* /*
...@@ -515,7 +517,7 @@ void XLink::CopyIncoming(const XTensor * reference, XTensor * target) ...@@ -515,7 +517,7 @@ void XLink::CopyIncoming(const XTensor * reference, XTensor * target)
tails.Add(tail); tails.Add(tail);
} }
MakeLink(&tails, target, reference->id); MakeLink(&tails, target, reference->income.typeID);
int paraNum = reference->income.paramNum; int paraNum = reference->income.paramNum;
target->income.paramNum = paraNum; target->income.paramNum = paraNum;
......
...@@ -208,22 +208,16 @@ void XList::Insert(int pos, void * item) ...@@ -208,22 +208,16 @@ void XList::Insert(int pos, void * item)
/* get the item at position i */ /* get the item at position i */
void * XList::GetItem(int i) const void * XList::GetItem(int i) const
{ {
if( i >= 0 && i < count ) CheckNTErrors(i >= 0 && i < count, "Index of a list item is out of scope!");
return items[i]; return items[i];
else
return NULL;
} }
/* get the integer-typed item at position i */ /* get the integer-typed item at position i */
int XList::GetItemInt(int i) int XList::GetItemInt(int i)
{ {
CheckNTErrors(isIntList, "An int list is required!"); CheckNTErrors(isIntList, "An int list is required!");
CheckNTErrors(i >= 0 && i < count, "Index of a list item is out of scope!");
if( i >= 0 && i < count ){ return *(int*)(items[i]);
return *(int*)(items[i]);
}
else
return 0;
} }
/* set the item at position i */ /* set the item at position i */
......
...@@ -181,7 +181,10 @@ void XMem::Free(int myDevID, void * mem) ...@@ -181,7 +181,10 @@ void XMem::Free(int myDevID, void * mem)
else{ else{
#ifdef USE_CUDA #ifdef USE_CUDA
SetDevice(myDevID); SetDevice(myDevID);
CheckNTErrors(cudaFree((char*)mem) == cudaSuccess, "Cannot free the memory."); cudaError_t error = cudaFree((char*)mem);
if(error != cudaSuccess){
ShowNTErrors("Cannot free the memory.");
}
#else #else
ShowNTErrors("Please specify USE_CUDA for compiling this program."); ShowNTErrors("Please specify USE_CUDA for compiling this program.");
#endif #endif
......
...@@ -29,6 +29,22 @@ const char * GetOPName(int type) ...@@ -29,6 +29,22 @@ const char * GetOPName(int type)
if ((type & MATH_BASE) != 0){ if ((type & MATH_BASE) != 0){
if (type == MATH_ABSOLUTE) if (type == MATH_ABSOLUTE)
return "M_ABSOLUTE"; return "M_ABSOLUTE";
else if (type == MATH_EXP)
return "M_EXP";
else if (type == MATH_LOG)
return "M_LOG";
else if (type == MATH_SIN)
return "M_SIN";
else if (type == MATH_COS)
return "M_COS";
else if (type == MATH_TAN)
return "M_TAN";
else if (type == MATH_ROUND)
return "M_ROUND";
else if (type == MATH_CLIP)
return "M_CLIP";
else if (type == MATH_DIV)
return "M_DIV";
else if (type == MATH_MATRIXMUL) else if (type == MATH_MATRIXMUL)
return "M_MATRIXMUL"; return "M_MATRIXMUL";
else if (type == MATH_MATRIXMULBATCHED) else if (type == MATH_MATRIXMULBATCHED)
...@@ -37,18 +53,20 @@ const char * GetOPName(int type) ...@@ -37,18 +53,20 @@ const char * GetOPName(int type)
return "M_MULTIPLY"; return "M_MULTIPLY";
else if (type == MATH_NEGATE) else if (type == MATH_NEGATE)
return "M_NEGATE"; return "M_NEGATE";
else if (type == MATH_SIGN)
return "M_SIGN";
else if (type == MATH_SUM)
return "M_SUM";
else if (type == MATH_LOG)
return "M_LOG";
else if (type == MATH_NORMALIZE) else if (type == MATH_NORMALIZE)
return "M_NORMALIZE"; return "M_NORMALIZE";
else if (type == MATH_POWER) else if (type == MATH_POWER)
return "M_POWER"; return "M_POWER";
else if (type == MATH_SCALEANDSHIFT) else if (type == MATH_SCALEANDSHIFT)
return "M_SCALEANDSHIFT"; return "M_SCALEANDSHIFT";
else if (type == MATH_SIGN)
return "M_SIGN";
else if (type == MATH_SUM)
return "M_SUM";
else if (type == MATH_SUB)
return "M_SUB";
else if (type == MATH_SUMDIM)
return "M_SUMDIM";
else if (type == REDUCE_REDUCEMAX) else if (type == REDUCE_REDUCEMAX)
return "R_REDUCEMAX"; return "R_REDUCEMAX";
else if (type == REDUCE_REDUCEMEAN) else if (type == REDUCE_REDUCEMEAN)
......
...@@ -30,20 +30,30 @@ namespace nts { // namespace nts(NiuTrans.Tensor) ...@@ -30,20 +30,30 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
/* math operations */ /* math operations */
#define MATH_BASE 0x00001000 #define MATH_BASE 0x00001000
#define MATH_ABSOLUTE MATH_BASE + 1 #define MATH_ABSOLUTE MATH_BASE + 1
#define MATH_MATRIXMUL MATH_ABSOLUTE + 1 #define MATH_EXP MATH_ABSOLUTE + 1
#define MATH_LOG MATH_EXP + 1
#define MATH_SIN MATH_LOG + 1
#define MATH_COS MATH_SIN + 1
#define MATH_TAN MATH_COS + 1
#define MATH_ROUND MATH_TAN + 1
#define MATH_CLIP MATH_ROUND + 1
#define MATH_DIV MATH_CLIP + 1
#define MATH_MATRIXMUL MATH_DIV + 1
#define MATH_MATRIXMULBATCHED MATH_MATRIXMUL + 1 #define MATH_MATRIXMULBATCHED MATH_MATRIXMUL + 1
#define MATH_MULTIPLY MATH_MATRIXMULBATCHED + 1 #define MATH_MULTIPLY MATH_MATRIXMULBATCHED + 1
#define MATH_NEGATE MATH_MULTIPLY + 1 #define MATH_NEGATE MATH_MULTIPLY + 1
#define MATH_SIGN MATH_NEGATE + 1 #define MATH_NORMALIZE MATH_NEGATE + 1
#define MATH_SUM MATH_SIGN + 1
#define MATH_LOG MATH_SUM + 1
#define MATH_NORMALIZE MATH_LOG + 1
#define MATH_POWER MATH_NORMALIZE + 1 #define MATH_POWER MATH_NORMALIZE + 1
#define MATH_SCALEANDSHIFT MATH_POWER + 1 #define MATH_SCALEANDSHIFT MATH_POWER + 1
#define MATH_SIGN MATH_SCALEANDSHIFT + 1
#define MATH_SUM MATH_SIGN + 1
#define MATH_SUB MATH_SUM + 1
#define MATH_SUMDIM MATH_SUB + 1
#define REDUCE MATH_SCALEANDSHIFT + 1 #define REDUCE MATH_SUMDIM + 1
#define REDUCE_REDUCEMAX REDUCE + 1 #define REDUCE_REDUCEMAX REDUCE + 1
#define REDUCE_REDUCEMEAN REDUCE_REDUCEMAX + 1 #define REDUCE_REDUCEMEAN REDUCE_REDUCEMAX + 1
#define REDUCE_REDUCESUM REDUCE_REDUCEMEAN + 1 #define REDUCE_REDUCESUM REDUCE_REDUCEMEAN + 1
......
...@@ -84,7 +84,7 @@ void XStream::Create(int priority, int myDevID) ...@@ -84,7 +84,7 @@ void XStream::Create(int priority, int myDevID)
XDevice::SetGPUDevice(myDevID); XDevice::SetGPUDevice(myDevID);
//cudaStreamCreateWithPriority(&stream, cudaStreamDefault, priority); //cudaStreamCreateWithPriority(&stream, cudaStreamDefault, priority);
CheckNTErrors((cudaStreamCreate(&stream) == cudaSuccess), CheckNTErrors((cudaStreamCreate(&stream) == cudaSuccess),
"cannot create the cuda stream!"); "cannot create the cuda stream!");
XDevice::SetGPUDevice(backupDevID); XDevice::SetGPUDevice(backupDevID);
#endif #endif
devID = myDevID; devID = myDevID;
......
...@@ -42,6 +42,8 @@ ...@@ -42,6 +42,8 @@
#include "core/movement/CopyValues.h" #include "core/movement/CopyValues.h"
#include "core/arithmetic/Sum.h" #include "core/arithmetic/Sum.h"
#include "core/arithmetic/Multiply.h" #include "core/arithmetic/Multiply.h"
#include "core/arithmetic/Sub.h"
#include "core/arithmetic/Div.h"
#include "core/math/ScaleAndShift.h" #include "core/math/ScaleAndShift.h"
#ifdef USE_CUDA #ifdef USE_CUDA
...@@ -354,6 +356,18 @@ XTensor XTensor::operator* (const XTensor& tensor) ...@@ -354,6 +356,18 @@ XTensor XTensor::operator* (const XTensor& tensor)
return Multiply(*this, tensor); return Multiply(*this, tensor);
} }
/* overloading of the minus-sign */
XTensor XTensor::operator- (const XTensor& tensor)
{
return Sub(*this, tensor);
}
/* overloading of the division-sign */
XTensor XTensor::operator/ (const XTensor& tensor)
{
return Div(*this, tensor);
}
/* /*
linear transformation b = a * \scale + \shift linear transformation b = a * \scale + \shift
>> scale - the slope >> scale - the slope
...@@ -426,8 +440,12 @@ get the size of a given dimension ...@@ -426,8 +440,12 @@ get the size of a given dimension
int XTensor::GetDim(const int dim) int XTensor::GetDim(const int dim)
{ {
CheckNTErrors(dim < order, "dimenision is out of range!"); CheckNTErrors(dim < order, "dimenision is out of range!");
int d = dim;
if(dim < 0)
d = order - 1;
return dimSize[dim]; return dimSize[d];
} }
/* /*
...@@ -454,6 +472,27 @@ void XTensor::Reshape(const int myOrder, const int * myDimSize) ...@@ -454,6 +472,27 @@ void XTensor::Reshape(const int myOrder, const int * myDimSize)
memcpy(dimSizeRDI, dimsRDI, sizeof(int) * order); memcpy(dimSizeRDI, dimsRDI, sizeof(int) * order);
} }
/*
reshape the tensor to a vector
>> num - number of elements
*/
void XTensor::Reshape(const int num)
{
int dim = num;
Reshape(1, &dim);
}
/*
reshape the tensor to a matrix
>> rowNum - number of rows
>> colNum - number of columns
*/
void XTensor::Reshape(const int rowNum, const int colNum)
{
int dims[2] = {rowNum, colNum};
Reshape(2, dims);
}
/* get the number of items in the data array */ /* get the number of items in the data array */
int XTensor::GetSize() const int XTensor::GetSize() const
{ {
...@@ -560,25 +599,24 @@ set the tensor items by a uniform distribution in range [lower, upper] ...@@ -560,25 +599,24 @@ set the tensor items by a uniform distribution in range [lower, upper]
void XTensor::SetDataRand(DTYPE lower, DTYPE upper) void XTensor::SetDataRand(DTYPE lower, DTYPE upper)
{ {
// TODO: cuda code!!!!!!! // TODO: cuda code!!!!!!!
// TODO: replace float with DTYPE
if (data == NULL) if (data == NULL)
return; return;
// srand((unsigned)time(0)); // srand((unsigned)time(0));
DTYPE variance = upper - lower;
void * d = NULL; void * d = NULL;
if (dataType == X_FLOAT) { if (dataType == X_FLOAT) {
d = new float[unitNum]; d = new float[unitNum];
for (int i = 0; i < unitNum; i++) { for (int i = 0; i < unitNum; i++) {
DTYPE value = lower + (upper - lower) * (float)rand() / RAND_MAX; DTYPE value = lower + variance * (float)rand() / RAND_MAX;
*((float*)d + i) = value; *((float*)d + i) = value;
} }
} }
else if (dataType == X_DOUBLE) { else if (dataType == X_DOUBLE) {
d = new double[unitNum]; d = new double[unitNum];
for (int i = 0; i < unitNum; i++) { for (int i = 0; i < unitNum; i++) {
*((double*)d + i) = lower + (upper - lower) * rand() / RAND_MAX; *((double*)d + i) = lower + variance * rand() / RAND_MAX;
} }
} }
else { else {
...@@ -588,15 +626,15 @@ void XTensor::SetDataRand(DTYPE lower, DTYPE upper) ...@@ -588,15 +626,15 @@ void XTensor::SetDataRand(DTYPE lower, DTYPE upper)
SetData(d, unitNum); SetData(d, unitNum);
if (dataType == X_FLOAT) { if (dataType == X_FLOAT) {
delete[](float*)d; delete[] (float*)d;
} }
else { else {
delete[](double*)d; delete[] (double*)d;
} }
} }
/* a gauss distribution */ /* a gauss distribution (Box-Muller method) */
double GaussRand() double GaussRand(DTYPE mean, DTYPE standardDeviation)
{ {
// TODO: cuda code!!!!!!! // TODO: cuda code!!!!!!!
...@@ -606,8 +644,8 @@ double GaussRand() ...@@ -606,8 +644,8 @@ double GaussRand()
double pi = 3.141592654; double pi = 3.141592654;
if (phase == 0){ if (phase == 0){
u = rand() / (RAND_MAX + 1.0); u = (rand() + 1.0) / (RAND_MAX + 1.0);
v = rand() / (RAND_MAX + 1.0); v = (rand() + 1.0) / (RAND_MAX + 1.0);
z = sqrt(-2.0 * log(u))* sin(2.0 * pi * v); z = sqrt(-2.0 * log(u))* sin(2.0 * pi * v);
} }
else{ else{
...@@ -615,7 +653,7 @@ double GaussRand() ...@@ -615,7 +653,7 @@ double GaussRand()
} }
phase = 1 - phase; phase = 1 - phase;
return z; return mean + (z * standardDeviation);
} }
/* /*
...@@ -626,7 +664,6 @@ set the tensor items by a normal distribution ...@@ -626,7 +664,6 @@ set the tensor items by a normal distribution
void XTensor::SetDataRandn(DTYPE mean, DTYPE standardDeviation) void XTensor::SetDataRandn(DTYPE mean, DTYPE standardDeviation)
{ {
// TODO: cuda code!!!!!!! // TODO: cuda code!!!!!!!
// TODO: replace float with DTYPE
if (data == NULL) if (data == NULL)
return; return;
...@@ -636,13 +673,13 @@ void XTensor::SetDataRandn(DTYPE mean, DTYPE standardDeviation) ...@@ -636,13 +673,13 @@ void XTensor::SetDataRandn(DTYPE mean, DTYPE standardDeviation)
if (dataType == X_FLOAT) { if (dataType == X_FLOAT) {
d = new float[unitNum]; d = new float[unitNum];
for (int i = 0; i < unitNum; i++) { for (int i = 0; i < unitNum; i++) {
*((float*)d + i) = (float)GaussRand(); *((float*)d + i) = (float)GaussRand(mean, standardDeviation);
} }
} }
else if (dataType == X_DOUBLE) { else if (dataType == X_DOUBLE) {
d = new double[unitNum]; d = new double[unitNum];
for (int i = 0; i < unitNum; i++) { for (int i = 0; i < unitNum; i++) {
*((double*)d + i) = GaussRand(); *((double*)d + i) = GaussRand(mean, standardDeviation);
} }
} }
else { else {
...@@ -652,10 +689,10 @@ void XTensor::SetDataRandn(DTYPE mean, DTYPE standardDeviation) ...@@ -652,10 +689,10 @@ void XTensor::SetDataRandn(DTYPE mean, DTYPE standardDeviation)
SetData(d, unitNum); SetData(d, unitNum);
if (dataType == X_FLOAT) { if (dataType == X_FLOAT) {
delete[](float*)d; delete[] (float*)d;
} }
else { else {
delete[](double*)d; delete[] (double*)d;
} }
} }
...@@ -1003,11 +1040,11 @@ set the value of a cell in a 3d tensor in default type ...@@ -1003,11 +1040,11 @@ set the value of a cell in a 3d tensor in default type
*/ */
bool XTensor::Set3D(DTYPE value, int d0, int d1, int d2) bool XTensor::Set3D(DTYPE value, int d0, int d1, int d2)
{ {
CheckNTErrors((order == 3), "Cannot get a 2d cell for a tensor whose order is not 2!"); CheckNTErrors(order == 3, "Cannot get a 2d cell for a tensor whose order is not 2!");
CheckNTErrors((d0 >= 0 && d1 < dimSize[0]), "dimension 0 is out of range!"); CheckNTErrors(d0 >= 0 && d0 < dimSize[0], "dimension 0 is out of range!");
CheckNTErrors((d2 >= 0 && d2 < dimSize[1]), "dimension 1 is out of range!"); CheckNTErrors(d1 >= 0 && d1 < dimSize[1], "dimension 1 is out of range!");
CheckNTErrors((d2 >= 0 && d2 < dimSize[2]), "dimension 1 is out of range!"); CheckNTErrors(d2 >= 0 && d2 < dimSize[2], "dimension 1 is out of range!");
CheckNTErrors((dataType == DEFAULT_DTYPE), "The tensor is not in default type."); CheckNTErrors(dataType == DEFAULT_DTYPE, "The tensor is not in default type.");
int dims[3] = {d0, d1, d1}; int dims[3] = {d0, d1, d1};
...@@ -1439,6 +1476,21 @@ void XTensor::Dump(FILE * file, const char * label, const int n, const int verbo ...@@ -1439,6 +1476,21 @@ void XTensor::Dump(FILE * file, const char * label, const int n, const int verbo
} }
/* /*
dump data to a file
>> tensor - tensor whose data is dumped
>> file - where to domp the data
>> label - label of the tensor
>> n - number of items to dump
>> verbose - verbose level
*/
void XTensor::Dump(const XTensor * tensor, FILE * file, const char * label, const int n, const int verbose)
{
XTensor a(tensor->order, tensor->dimSize, tensor->dataType, tensor->denseRatio, tensor->devID, tensor->mem);
_CopyValues(tensor, &a);
a.Dump(file, label, n, verbose);
}
/*
read data from a file read data from a file
>> file - where to load the data >> file - where to load the data
>> label - label of the tensor >> label - label of the tensor
...@@ -1687,13 +1739,13 @@ void InitTensor(XTensor * tensor, ...@@ -1687,13 +1739,13 @@ void InitTensor(XTensor * tensor,
dims[0] = -abs(dims[0]); dims[0] = -abs(dims[0]);
tensor->Resize(myOrder, dims, myDataType, myDenseRatio); if (myDevID == CURRENT_GPU)
if(myDevID == CURRENT_GPU)
tensor->devID = XDevice::GetGPUDevice(); tensor->devID = XDevice::GetGPUDevice();
else else
tensor->devID = myDevID; tensor->devID = myDevID;
tensor->Resize(myOrder, dims, myDataType, myDenseRatio);
if(allocated) if(allocated)
XTensor::AllocateData(tensor); XTensor::AllocateData(tensor);
} }
...@@ -1870,28 +1922,47 @@ generate a XTensor which allocates data on the buffer ...@@ -1870,28 +1922,47 @@ generate a XTensor which allocates data on the buffer
>> myDimSize - the size of each dimension >> myDimSize - the size of each dimension
>> myMem - memory pool used to allocating the data array. >> myMem - memory pool used to allocating the data array.
we actually allocate the data on the buffer associated with we actually allocate the data on the buffer associated with
the memory pool. the memory pool
>> devID - device id
>> myDataType - unit size (e.g., int, float, and double) >> myDataType - unit size (e.g., int, float, and double)
>> myDenseRatio - how often an element has non-zero value >> myDenseRatio - how often an element has non-zero value
*/ */
XTensor * NewTensorBuf(const int myOrder, const int * myDimSize, XMem * myMem, XTensor * NewTensorBuf(const int myOrder, const int * myDimSize,
const TENSOR_DATA_TYPE myDataType, const float myDenseRatio) const TENSOR_DATA_TYPE myDataType, const float myDenseRatio,
const int devID, XMem * myMem)
{ {
CheckNTErrors(myMem != NULL, "No memory pool specified!");
int dims[MAX_TENSOR_DIM_NUM]; int dims[MAX_TENSOR_DIM_NUM];
memcpy(dims, myDimSize, sizeof(int) * myOrder); memcpy(dims, myDimSize, sizeof(int) * myOrder);
dims[0] = -abs(dims[0]); dims[0] = -abs(dims[0]);
XTensor * tensor = NewTensor(myOrder, dims, myDataType, myDenseRatio, -1, myMem); XTensor * tensor = NewTensor(myOrder, dims, myDataType, myDenseRatio, devID, myMem);
tensor->data = myMem->AllocBuf(myMem->devID, tensor->unitNum * tensor->unitSize);
if(myMem != NULL)
tensor->data = myMem->AllocBuf(myMem->devID, tensor->unitNum * tensor->unitSize);
else
tensor->data = XMemAlloc(devID, tensor->unitNum * tensor->unitSize);
return tensor; return tensor;
} }
/* /*
generate a XTensor which allocates data on the buffer
>> reference - reference tensor
>> devID - device id
>> myMem - memory pool used to allocating the data array.
we actually allocate the data on the buffer associated with
the memory pool
*/
XTensor * NewTensorBuf(const XTensor * reference, int devID, XMem * myMem)
{
return NewTensorBuf(reference->order, reference->dimSize,
reference->dataType, reference->denseRatio,
devID, myMem);
}
/*
generate a dense vector generate a dense vector
>> num - number of entries >> num - number of entries
>> myDataType - unit size (e.g., int, float, and double) >> myDataType - unit size (e.g., int, float, and double)
...@@ -2041,7 +2112,7 @@ XTensor * NewTensor(XTensor * a, bool isFilledData) ...@@ -2041,7 +2112,7 @@ XTensor * NewTensor(XTensor * a, bool isFilledData)
free the data space of a given tensor free the data space of a given tensor
>> tensor - pointer to the tensor >> tensor - pointer to the tensor
*/ */
void DelTensor(const XTensor * tensor) void DelTensor(XTensor * tensor)
{ {
delete tensor; delete tensor;
} }
...@@ -2050,10 +2121,13 @@ void DelTensor(const XTensor * tensor) ...@@ -2050,10 +2121,13 @@ void DelTensor(const XTensor * tensor)
free the data space of a given tensor (on the buffer) free the data space of a given tensor (on the buffer)
>> tensor - pointer to the tensor >> tensor - pointer to the tensor
*/ */
void DelTensorBuf(const XTensor * tensor) void DelTensorBuf(XTensor * tensor)
{ {
CheckNTErrors(tensor->mem != NULL, "No memory pool found!"); if(tensor->mem != NULL)
tensor->mem->ReleaseBuf(tensor->devID, tensor->unitNum * tensor->unitSize); tensor->mem->ReleaseBuf(tensor->devID, tensor->unitNum * tensor->unitSize);
else
XMemFree(tensor->devID, tensor->data);
tensor->data = NULL;
delete tensor; delete tensor;
} }
......
...@@ -45,12 +45,13 @@ namespace nts{ ...@@ -45,12 +45,13 @@ namespace nts{
struct XLink; struct XLink;
/* define the maximum number of dimensions in a tensor */ /* define the maximum number of dimensions in a tensor */
#define MAX_TENSOR_DIM_NUM 6 #define MAX_TENSOR_DIM_NUM 8
#define USE_BATCHED_STRIDED_MAT_MUL #define USE_BATCHED_STRIDED_MAT_MUL
#define MIN_TENSOR_SPLIT_NUM 10 #define MIN_TENSOR_SPLIT_NUM 0
#define MIN_TENSOR_SPLIT_LIST_NUM 1024 #define MIN_TENSOR_SPLIT_LIST_NUM 1024
#define MIN_TENSOR_CAT_NUM 8 #define MIN_TENSOR_CAT_NUM 8
/* computation flags */ /* computation flags */
#define UNSAFE_BUT_FAST_MEM #define UNSAFE_BUT_FAST_MEM
#define FAST_MATRIX #define FAST_MATRIX
...@@ -202,6 +203,12 @@ public: ...@@ -202,6 +203,12 @@ public:
/* overloading of the multiply-sign */ /* overloading of the multiply-sign */
XTensor operator* (const XTensor &tensor); XTensor operator* (const XTensor &tensor);
/* overloading of the minus-sign */
XTensor operator- (const XTensor &tensor);
/* overloading of the division-sign */
XTensor operator/ (const XTensor &tensor);
/* linear transformation */ /* linear transformation */
XTensor Lin(DTYPE scale, DTYPE shift = 0); XTensor Lin(DTYPE scale, DTYPE shift = 0);
...@@ -222,6 +229,12 @@ public: ...@@ -222,6 +229,12 @@ public:
/* reshape the tensor */ /* reshape the tensor */
void Reshape(const int order, const int * myDimSize); void Reshape(const int order, const int * myDimSize);
/* reshape the tensor to a vector */
void Reshape(const int num);
/* reshape the tensor to a matrix */
void Reshape(const int rowNum, const int colNum);
/* get the number of items in the data array */ /* get the number of items in the data array */
int GetSize() const; int GetSize() const;
...@@ -328,6 +341,10 @@ public: ...@@ -328,6 +341,10 @@ public:
/* dump data to a file */ /* dump data to a file */
void Dump(FILE * file, const char * label = NULL, const int n = -1, const int verbose = 0); void Dump(FILE * file, const char * label = NULL, const int n = -1, const int verbose = 0);
/* dump data to a file */
static
void Dump(const XTensor * tensor, FILE * file, const char * label = NULL, const int n = -1, const int verbose = 0);
/* read data from a file */ /* read data from a file */
void Read(FILE * file, const char * label = NULL); void Read(FILE * file, const char * label = NULL);
...@@ -386,8 +403,12 @@ XTensor * NewTensor(const int myOrder, const int * myDimSize, const TENSOR_DATA_ ...@@ -386,8 +403,12 @@ XTensor * NewTensor(const int myOrder, const int * myDimSize, const TENSOR_DATA_
const float myDenseRatio = 1.0F, const int myDevID = -1, XMem * myMem = NULL); const float myDenseRatio = 1.0F, const int myDevID = -1, XMem * myMem = NULL);
/* generate a XTensor which allocates data on the buffer */ /* generate a XTensor which allocates data on the buffer */
XTensor * NewTensorBuf(const int myOrder, const int * myDimSize, XMem * myMem, XTensor * NewTensorBuf(const int myOrder, const int * myDimSize,
const TENSOR_DATA_TYPE myDataType = X_FLOAT, const float myDenseRatio = 1.0F); const TENSOR_DATA_TYPE myDataType = X_FLOAT, const float myDenseRatio = 1.0F,
const int myDevID = -1, XMem * myMem = NULL);
/* generate a XTensor which allocates data on the buffer */
XTensor * NewTensorBuf(const XTensor * reference, int devID, XMem * myMem);
/* generate a dense vector */ /* generate a dense vector */
XTensor * NewTensor1D(const int num, const TENSOR_DATA_TYPE myDataType = X_FLOAT, const int myDevID = -1, XTensor * NewTensor1D(const int num, const TENSOR_DATA_TYPE myDataType = X_FLOAT, const int myDevID = -1,
...@@ -417,10 +438,10 @@ XTensor * NewTensor5D(const int d0, const int d1, const int d2, const int d3, co ...@@ -417,10 +438,10 @@ XTensor * NewTensor5D(const int d0, const int d1, const int d2, const int d3, co
XTensor * NewTensor(XTensor * a, bool isFilledData = true); XTensor * NewTensor(XTensor * a, bool isFilledData = true);
/* free the data space of a given tensor */ /* free the data space of a given tensor */
void DelTensor(const XTensor * tensor); void DelTensor(XTensor * tensor);
/* free the data space of a given tensor (on the buffer) */ /* free the data space of a given tensor (on the buffer) */
void DelTensorBuf(const XTensor * tensor); void DelTensorBuf(XTensor * tensor);
} /* end of the nts (NiuTrans.Tensor) namespace */ } /* end of the nts (NiuTrans.Tensor) namespace */
......
...@@ -175,29 +175,38 @@ void XMemCopy(void * t, int devIDT, const void * s, int devIDS, size_t size) ...@@ -175,29 +175,38 @@ void XMemCopy(void * t, int devIDT, const void * s, int devIDS, size_t size)
return; return;
} }
#ifdef USE_CUDA #ifdef USE_CUDA
else if(devIDT >= 0 && devIDS < 0){
cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyHostToDevice);
if(error != cudaSuccess){
ShowNTErrors("cudaMemcpy error (cudaMemcpyHostToDevice)");
}
}
else if(devIDT < 0 && devIDS >= 0){
cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyDeviceToHost);
if(error != cudaSuccess){
ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToHost)");
}
}
else{ else{
//if(devIDT == devIDS){ int devID = devIDT < 0 ? devIDS : devIDT;
cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyDeviceToDevice); int devIDBackup = 0;
cudaGetDevice(&devIDBackup);
cudaSetDevice(devID);
if(devIDT >= 0 && devIDS < 0){
cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyHostToDevice);
if(error != cudaSuccess){ if(error != cudaSuccess){
ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToDevice)"); ShowNTErrors("cudaMemcpy error (cudaMemcpyHostToDevice)");
} }
/*} }
else if(devIDT < 0 && devIDS >= 0){
cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyDeviceToHost);
if(error != cudaSuccess){
ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToHost)");
}
}
else{ else{
CheckNTErrors((cudaMemcpyPeer(t, devIDT, s, devIDS, size) == cudaSuccess), //if(devIDT == devIDS){
"cudaMemcpy error (cudaMemcpyDeviceToDevice)"); cudaError_t error = cudaMemcpy(t, s, size, cudaMemcpyDeviceToDevice);
}*/ if(error != cudaSuccess){
ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToDevice)");
}
/*}
else{
CheckNTErrors((cudaMemcpyPeer(t, devIDT, s, devIDS, size) == cudaSuccess),
"cudaMemcpy error (cudaMemcpyDeviceToDevice)");
}*/
}
cudaSetDevice(devIDBackup);
} }
#else #else
ShowNTErrors("Please specify USE_CUDA and recompile the code!"); ShowNTErrors("Please specify USE_CUDA and recompile the code!");
...@@ -208,6 +217,9 @@ void XMemCopy(void * t, int devIDT, const void * s, int devIDS, size_t size) ...@@ -208,6 +217,9 @@ void XMemCopy(void * t, int devIDT, const void * s, int devIDS, size_t size)
#ifdef USE_CUDA #ifdef USE_CUDA
void XMemCopyAsync(void * t, int devIDT, const void * s, int devIDS, size_t size, cudaStream_t stream, int streamDevID) void XMemCopyAsync(void * t, int devIDT, const void * s, int devIDS, size_t size, cudaStream_t stream, int streamDevID)
{ {
if(t == s)
return;
int devIDBackup = -1; int devIDBackup = -1;
if(streamDevID >= 0 && (devIDT >= 0 || devIDS >= 0)){ if(streamDevID >= 0 && (devIDT >= 0 || devIDS >= 0)){
CheckNTErrors((cudaGetDevice(&devIDBackup) == cudaSuccess), "Cannot get GPU device id!"); CheckNTErrors((cudaGetDevice(&devIDBackup) == cudaSuccess), "Cannot get GPU device id!");
...@@ -220,17 +232,23 @@ void XMemCopyAsync(void * t, int devIDT, const void * s, int devIDS, size_t size ...@@ -220,17 +232,23 @@ void XMemCopyAsync(void * t, int devIDT, const void * s, int devIDS, size_t size
return; return;
} }
else if(devIDT >= 0 && devIDS < 0){ else if(devIDT >= 0 && devIDS < 0){
CheckNTErrors((cudaMemcpyAsync(t, s, size, cudaMemcpyHostToDevice, stream) == cudaSuccess), cudaError_t error = cudaMemcpyAsync(t, s, size, cudaMemcpyHostToDevice, stream);
"cudaMemcpyAsync error (cudaMemcpyHostToDevice)"); if(error != cudaSuccess){
ShowNTErrors("cudaMemcpyAsync error (cudaMemcpyHostToDevice)");
}
} }
else if(devIDT < 0 && devIDS >= 0){ else if(devIDT < 0 && devIDS >= 0){
CheckNTErrors((cudaMemcpyAsync(t, s, size, cudaMemcpyDeviceToHost, stream) == cudaSuccess), cudaError_t error = cudaMemcpyAsync(t, s, size, cudaMemcpyDeviceToHost, stream);
"cudaMemcpyAsync error (cudaMemcpyDeviceToHost)"); if(error != cudaSuccess){
ShowNTErrors("cudaMemcpyAsync error (cudaMemcpyDeviceToHost)");
}
} }
else{ else{
//if(devIDT == devIDS){ //if(devIDT == devIDS){
CheckNTErrors((cudaMemcpyAsync(t, s, size, cudaMemcpyDeviceToDevice, stream) == cudaSuccess), cudaError_t error = cudaMemcpyAsync(t, s, size, cudaMemcpyDeviceToDevice, stream);
"cudaMemcpyAsync error (cudaMemcpyDeviceToDevice)"); if(error != cudaSuccess){
ShowNTErrors("cudaMemcpyAsync error (cudaMemcpyDeviceToDevice)");
}
//} //}
/*else{ /*else{
CheckNTErrors((cudaMemcpyPeerAsync(t, devIDT, s, devIDS, size, stream) == cudaSuccess), CheckNTErrors((cudaMemcpyPeerAsync(t, devIDT, s, devIDS, size, stream) == cudaSuccess),
...@@ -261,18 +279,69 @@ void XMemCopy2D(void * t, size_t tPitch, int devIDT, const void * s, size_t sPit ...@@ -261,18 +279,69 @@ void XMemCopy2D(void * t, size_t tPitch, int devIDT, const void * s, size_t sPit
return; return;
} }
#ifdef USE_CUDA #ifdef USE_CUDA
else if (devIDT >= 0 && devIDS < 0) { else{
CheckNTErrors((cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyHostToDevice) == cudaSuccess), int devID = devIDT < 0 ? devIDS : devIDT;
"cudaMemcpy2D error (cudaMemcpyHostToDevice)"); int devIDBackup = 0;
cudaGetDevice(&devIDBackup);
cudaSetDevice(devID);
if (devIDT >= 0 && devIDS < 0) {
cudaError_t error = cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyHostToDevice);
if(error != cudaSuccess){
ShowNTErrors("cudaMemcpy2D error (cudaMemcpyHostToDevice)");
}
}
else if (devIDT < 0 && devIDS >= 0) {
cudaError_t error = cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToHost);
if(error != cudaSuccess){
ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToHost)");
}
}
else {
cudaError_t error = cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToDevice);
if (error != cudaSuccess) {
ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToDevice)");
}
}
cudaSetDevice(devIDBackup);
} }
else if (devIDT < 0 && devIDS >= 0) { #else
CheckNTErrors((cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToHost) == cudaSuccess), ShowNTErrors("Please specify USE_CUDA and recompile the code!");
"cudaMemcpy error (cudaMemcpyDeviceToHost)"); #endif
}
void XMemCopy2DAsync(void * t, size_t tPitch, int devIDT, const void * s, size_t sPitch, int devIDS, size_t mSize, int n, XStream * stream)
{
if (t == s)
return;
if (devIDT < 0 && devIDS < 0) {
for(int i = 0; i < n; i++)
memcpy((char*)t + tPitch * i, (char*)s + sPitch * i, mSize);
return;
} }
else { #ifdef USE_CUDA
cudaError_t error = cudaMemcpy2D(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToDevice); else{
if (error != cudaSuccess) { CheckNTErrors(stream != NULL, "No stream found!");
ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToDevice)"); cudaStream_t &cstream = stream->stream;
if (devIDT >= 0 && devIDS < 0) {
cudaError_t error = cudaMemcpy2DAsync(t, tPitch, s, sPitch, mSize, n, cudaMemcpyHostToDevice, cstream);
if(error != cudaSuccess){
ShowNTErrors("cudaMemcpy2D error (cudaMemcpyHostToDevice)");
}
}
else if (devIDT < 0 && devIDS >= 0) {
cudaError_t error = cudaMemcpy2DAsync(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToHost, cstream);
if(error != cudaSuccess){
ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToHost)");
}
}
else {
cudaError_t error = cudaMemcpy2DAsync(t, tPitch, s, sPitch, mSize, n, cudaMemcpyDeviceToDevice, cstream);
if (error != cudaSuccess) {
ShowNTErrors("cudaMemcpy error (cudaMemcpyDeviceToDevice)");
}
} }
} }
#else #else
......
...@@ -23,6 +23,7 @@ ...@@ -23,6 +23,7 @@
#include <stdio.h> #include <stdio.h>
#include "XGlobal.h" #include "XGlobal.h"
#include "XDevice.h"
#ifndef __XUTILITY_H__ #ifndef __XUTILITY_H__
#define __XUTILITY_H__ #define __XUTILITY_H__
...@@ -41,6 +42,7 @@ extern void XMemSet(void * p, int value, size_t size); ...@@ -41,6 +42,7 @@ extern void XMemSet(void * p, int value, size_t size);
extern void XMemSet(int devID, void * p, int value, size_t size); extern void XMemSet(int devID, void * p, int value, size_t size);
extern void XMemCopy(void * t, int devIDT, const void * s, int devIDS, size_t size); extern void XMemCopy(void * t, int devIDT, const void * s, int devIDS, size_t size);
extern void XMemCopy2D(void * t, size_t tPitch, int devIDT, const void * s, size_t sPitch, int devIDS, size_t mSize, int n); extern void XMemCopy2D(void * t, size_t tPitch, int devIDT, const void * s, size_t sPitch, int devIDS, size_t mSize, int n);
extern void XMemCopy2DAsync(void * t, size_t tPitch, int devIDT, const void * s, size_t sPitch, int devIDS, size_t mSize, int n, XStream * stream);
extern void * XMemAlloc(int devID, size_t size); extern void * XMemAlloc(int devID, size_t size);
extern void * XMemAllocOnDev(int devID, size_t size); extern void * XMemAllocOnDev(int devID, size_t size);
extern void XMemFree(int devID, void * p); extern void XMemFree(int devID, void * p);
......
...@@ -26,49 +26,63 @@ ...@@ -26,49 +26,63 @@
#include "../XTensor.h" #include "../XTensor.h"
#include "shape/Concatenate.h" #include "arithmetic/Div.h"
#include "shape/ConcatenateSolely.h"
#include "movement/CopyBlocks.h"
#include "movement/CopyBlocksInGrid.h"
#include "movement/CopyBlocksOnSite.h"
#include "movement/CopyData2D.h"
#include "movement/CopyIndexed.h"
#include "movement/CopyInGrid.h"
#include "movement/CopyValues.h"
#include "utilities/FlushToMem.h"
#include "shape/MakeMergeBlockIndex.h"
#include "shape/MakeSplitBlockIndex.h"
#include "arithmetic/MatrixMul.h" #include "arithmetic/MatrixMul.h"
#include "arithmetic/MatrixMul2D.h" #include "arithmetic/MatrixMul2D.h"
#include "arithmetic/MatrixMul2DMultiTheading.h" #include "arithmetic/MatrixMul2DMultiTheading.h"
#include "arithmetic/MatrixMul2DParallel.h" #include "arithmetic/MatrixMul2DParallel.h"
#include "arithmetic/MatrixMulBatched.h" #include "arithmetic/MatrixMulBatched.h"
#include "arithmetic/MatrixMULBatchedCPU.h"
#include "shape/Merge.h"
#include "shape/MergeBlockLists.h"
#include "arithmetic/Multiply.h" #include "arithmetic/Multiply.h"
#include "arithmetic/Negate.h" #include "arithmetic/Negate.h"
#include "arithmetic/Sign.h"
#include "arithmetic/Sub.h"
#include "arithmetic/Sum.h"
#include "arithmetic/SumByColumnTV.h"
#include "arithmetic/SumByColumnVT.h"
#include "arithmetic/SumDim.h"
#include "arithmetic/XTensorBLAS.h"
#include "getandset/ConvertDataType.h"
#include "getandset/Select.h"
#include "getandset/SetData.h"
#include "math/Clip.h"
#include "math/Normalize.h" #include "math/Normalize.h"
#include "shape/Permute.h"
#include "math/Power.h" #include "math/Power.h"
#include "math/ScaleAndShift.h"
#include "math/Unary.h"
#include "movement/CopyBlocks.h"
#include "movement/CopyBlocksInGrid.h"
#include "movement/CopyBlocksOnSite.h"
#include "movement/CopyData2D.h"
#include "movement/CopyIndexed.h"
#include "movement/CopyInGrid.h"
#include "movement/CopyValues.h"
#include "reduce/ReduceMax.h" #include "reduce/ReduceMax.h"
#include "reduce/ReduceMean.h" #include "reduce/ReduceMean.h"
#include "reduce/ReduceStandardVariance.h" #include "reduce/ReduceStandardVariance.h"
#include "reduce/ReduceSum.h" #include "reduce/ReduceSum.h"
#include "reduce/ReduceSumSquared.h" #include "reduce/ReduceSumSquared.h"
#include "reduce/ReduceVariance.h" #include "reduce/ReduceVariance.h"
#include "math/ScaleAndShift.h"
#include "getandset/Select.h" #include "shape/Concatenate.h"
#include "getandset/SetData.h" #include "shape/ConcatenateSolely.h"
#include "sort/Sort.h" #include "shape/MakeMergeBlockIndex.h"
#include "shape/MakeSplitBlockIndex.h"
#include "shape/Merge.h"
#include "shape/MergeBlockLists.h"
#include "shape/Permute.h"
#include "shape/Split.h" #include "shape/Split.h"
#include "arithmetic/Sum.h"
#include "arithmetic/SumByColumnTV.h"
#include "arithmetic/SumByColumnVT.h"
#include "sort/TopK.h"
#include "shape/Transpose.h" #include "shape/Transpose.h"
#include "shape/Unsqueeze.h" #include "shape/Unsqueeze.h"
#include "sort/Sort.h"
#include "sort/TopK.h"
#include "utilities/XMatrixSegment.h" #include "utilities/XMatrixSegment.h"
#include "arithmetic/XTensorBLAS.h" #include "utilities/FlushToMem.h"
#endif // __CHEADER_H__ #endif // __CHEADER_H__
\ No newline at end of file
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
*/
#include "../../XTensor.h"
#include "../../XName.h"
#include "Div.h"
#include "Div.cuh"
namespace nts { // namespace nts(NiuTrans.Tensor)
/*
element-wise division of two tensors
c(i) = a(i)/b(i) + \alpha * c(i)
where i is the index of the item
>> a - tensor a
>> b - tensor b
>> c - result tensor
>> alpha - the coefficient
>> leadingDim - the dimension along which we perform broadcasting
*/
void _Div(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha, int leadingDim)
{
int leadingDimRDI = a->order - leadingDim - 1;
CheckNTErrors((a->unitNum <= c->unitNum && b->unitNum <= c->unitNum),
"Unmatched tensors in multiplication!");
CheckNTErrors((a->order == b->order && a->order == c->order),
"Unmatched tensors!");
#ifdef USE_CUDA
if (a->devID >= 0 || b->devID >= 0 || c->devID >= 0) {
_CudaDiv(a, b, c, alpha, leadingDim);
return;
}
#endif
int stride = 1;
int blockSizeA = 1;
int blockSizeB = 1;
int blockSizeC = 1;
int blockNum = 1;
int dimensionSizeA = a->dimSizeRDI[leadingDimRDI];
int dimensionSizeB = b->dimSizeRDI[leadingDimRDI];
int dimensionSizeC = c->dimSizeRDI[leadingDimRDI];
for (int i = 0; i < a->order; i++) {
if (i != leadingDimRDI) {
CheckNTErrors((a->dimSizeRDI[i] == b->dimSizeRDI[i] && a->dimSizeRDI[i] == c->dimSizeRDI[i]),
"Unmatched tensors!");
}
if (i < leadingDimRDI)
stride *= a->dimSizeRDI[i];
}
blockSizeA = stride * dimensionSizeA;
blockSizeB = stride * dimensionSizeB;
blockSizeC = stride * dimensionSizeC;
blockNum = a->unitNum / blockSizeA;
if (!a->isSparse && !b->isSparse) {
if (a->dataType == DEFAULT_DTYPE && b->dataType == DEFAULT_DTYPE) {
if (a->unitNum == c->unitNum && b->unitNum == c->unitNum) {
int size = a->unitNum;
DTYPE * ap = (DTYPE*)a->data;
DTYPE * bp = (DTYPE*)b->data;
DTYPE * cp = (DTYPE*)c->data;
if (alpha == 0) {
for (int i = 0; i < size; i++)
cp[i] = ap[i] / bp[i];
}
else {
for (int i = 0; i < size; i++)
cp[i] = ap[i] / bp[i] + alpha * cp[i];
}
}
else {
for (int k = 0; k < blockNum; k++) {
for (int ci = 0, ai = 0, bi = 0; ci < dimensionSizeC; ci++, ai++, bi++) {
if (ai >= dimensionSizeA)
ai = 0;
if (bi >= dimensionSizeB)
bi = 0;
DTYPE * ap = (DTYPE*)a->data + k * blockSizeA + ai * stride;
DTYPE * bp = (DTYPE*)b->data + k * blockSizeB + bi * stride;
DTYPE * cp = (DTYPE*)c->data + k * blockSizeC + ci * stride;
for (int j = 0; j < stride; j++)
cp[j] = ap[j] / bp[j] + cp[j] * alpha;
}
}
}
}
else {
// TODO!!
ShowNTErrors("TODO!");
}
}
else {
// TODO!!
ShowNTErrors("TODO!");
}
}
/*
element-wise division of two tensors (do it on site)
keep the result in the input tensor a and return nothing
a(i) = a(i)*b(i) + \alpha * a(i)
where i is the index of the item
>> a - tensor a (where keep the result)
>> b - tensor b
>> alpha - the coefficient
>> leadingDim - the dimension along which we perform broadcasting
*/
void _DivMe(XTensor * a, const XTensor * b, DTYPE alpha, int leadingDim)
{
_Div(a, b, a, alpha, leadingDim);
}
/*
element-wise division of two tensors (return a XTensor structure)
make a new tensor c to keep the result and return it
c(i) = a(i)*b(i)
where i is the index of the item
>> a - tensor a
>> b - tensor b
>> leadingDim - the dimension along which we perform broadcasting
<< return - the product of the tensors
*/
XTensor Div(const XTensor &a, const XTensor &b, int leadingDim)
{
CheckNTErrors(a.dimSize[leadingDim] == b.dimSize[leadingDim], "TODO!");
XTensor c(&a);
c.SetTMP();
/* call _Multiply function */
_Div(&a, &b, &c, 0, leadingDim);
/* tensor connections */
XLink::MakeLink(&a, &b, &c, MATH_DIV);
XLink::AddParamToHeadInt(&c, leadingDim);
return c;
}
} // namespace nts(NiuTrans.Tensor)
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
*/
#include "../../XDevice.h"
#include "../../XTensor.h"
#include "Div.h"
#include "Div.cuh"
namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA
/*
division of data arrays in a element-wise manner c(i) = a(i)/b(i)
>> a - data array a
>> b - data array b
>> c - result data array
>> size - size of c
*/
__global__
void KernelDivElementWise(DTYPE * a, DTYPE * b, DTYPE * c, int size)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size)
c[i] = a[i] / b[i];
}
/*
division of data arrays in a element-wise manner c(i) = a(i)/b(i) + \alpha*c(i)
>> a - data array a
>> b - data array b
>> c - result data array
>> size - size of c
>> alpha - the coefficient
*/
__global__
void KernelDivElementWiseV2(DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE alpha)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size)
c[i] = a[i] / b[i] + alpha * c[i];
}
/*
division of two tensors in a element-wise manner c(i) = a(i)/b(i).
Note that a and b can be of different sizes here, i.e.,
|a_lead| <= |c_lead| and |b_lead| <= |c_lead|
where |a_lead| means the size of the leading dimension of a
>> a - tensor a
>> b - tensor b
>> c - result tensor
>> alpha - the coefficient
>> stride - the number of items we go over when move next along the leading dimension in a block
>> ldSizeA - size of the leading dimension of a
>> ldSizeB - size of the leading dimension of b
>> ldSizeC - size of the leading dimension of c
>> blockNum - number of blocks
*/
template<int nonZeroAlpha> __global__
void KernelDivElementWiseTensorDynamic(DTYPE * a, DTYPE * b, DTYPE * c, DTYPE alpha,
int stride, int ldSizeA, int ldSizeB, int ldSizeC, int blockNum)
{
__shared__ DTYPE* ap[MAX_CUDA_THREAD_NUM_PER_BLOCK];
__shared__ DTYPE* bp[MAX_CUDA_THREAD_NUM_PER_BLOCK];
__shared__ DTYPE* cp[MAX_CUDA_THREAD_NUM_PER_BLOCK];
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
if (i >= blockNum * stride || j >= ldSizeC)
return;
if (threadIdx.y == 0) {
int block = i / stride;
int size = block * stride;
ap[threadIdx.x] = a + size * ldSizeA;
bp[threadIdx.x] = b + size * ldSizeB;
cp[threadIdx.x] = c + size * ldSizeC;
}
__syncthreads();
int aj = j >= ldSizeA ? j % ldSizeA : j;
int bj = j >= ldSizeB ? j % ldSizeB : j;
int offseti = i % stride;
if (nonZeroAlpha == 0)
cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj * ldSizeA + offseti] / bp[threadIdx.x][bj * ldSizeB + offseti];
else
cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj * ldSizeA + offseti] / bp[threadIdx.x][bj * ldSizeB + offseti]
+ alpha * cp[threadIdx.x][j * ldSizeC + offseti];
}
/*
element-wise division of two tensors
c(i) = a(i)*b(i) + \alpha * c(i)
where i is the item index
>> a - tensor a
>> b - tensor b
>> c - result tensor
>> alpha - the coefficient
>> leadingDim - dimension along which we perform broadcasting
*/
void _CudaDiv(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha, int leadingDim)
{
int leadingDimRDI = a->order - leadingDim - 1;
CheckNTErrors((a->unitNum <= c->unitNum && b->unitNum <= c->unitNum),
"Unmatched tensors in multiplication!");
CheckNTErrors((a->order == b->order && a->order == c->order), "Unmatched tensors!");
int stride = 1;
int blockSizeA = 1;
int blockNum = 1;
int dimensionSizeA = a->dimSizeRDI[leadingDimRDI];
int dimensionSizeB = b->dimSizeRDI[leadingDimRDI];
int dimensionSizeC = c->dimSizeRDI[leadingDimRDI];
for (int i = 0; i < a->order; i++) {
if (i != leadingDimRDI) {
CheckNTErrors((a->dimSizeRDI[i] == b->dimSizeRDI[i] &&
a->dimSizeRDI[i] == c->dimSizeRDI[i]),
"Unmatched tensors!");
}
if (i < leadingDimRDI)
stride *= a->dimSizeRDI[i];
}
blockSizeA = stride * dimensionSizeA;
blockNum = a->unitNum / blockSizeA;
int devIDBackup;
ProtectCudaDev(a->devID, devIDBackup);
if (!a->isSparse && !b->isSparse) {
if (a->dataType == DEFAULT_DTYPE && b->dataType == DEFAULT_DTYPE) {
int cudaGridSize[3];
int cudaBlockSize[3];
if (a->unitNum == c->unitNum && b->unitNum == c->unitNum) {
GDevs.GetCudaThread(a->devID, c->unitNum, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[0]), threads(cudaBlockSize[0]);
if (alpha == 0)
KernelDivElementWise << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, c->unitNum);
else
KernelDivElementWiseV2 << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, c->unitNum, alpha);
}
else {
GDevs.GetCudaThread2D(c->devID, stride * blockNum, dimensionSizeC, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[0], cudaGridSize[1]), threads(cudaBlockSize[0], cudaBlockSize[1]);
if (alpha == 0) {
KernelDivElementWiseTensorDynamic<0> << <blocks, threads >> >
((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, 0,
stride, dimensionSizeA, dimensionSizeB, dimensionSizeC, blockNum);
}
else {
KernelDivElementWiseTensorDynamic<1> << <blocks, threads >> >
((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, alpha,
stride, dimensionSizeA, dimensionSizeB, dimensionSizeC, blockNum);
}
}
}
else {
// TODO!!
ShowNTErrors("TODO!");
}
}
else {
// TODO!!
ShowNTErrors("TODO!");
}
BacktoCudaDev(a->devID, devIDBackup);
}
#endif // USE_CUDA
} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
*/
#ifndef __DIV_CUH__
#define __DIV_CUH__
#include "Div.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA
/* division of two tensors in a element-wise manner c(i) = a(i)/b(i) */
__global__
void KernelDivElementWise(DTYPE * a, DTYPE * b, DTYPE * c, int size);
/* division of two tensors in a element-wise manner c(i) = a(i)/b(i) + \alpha*c(i) */
__global__
void KernelDivElementWiseV2(DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE alpha);
/* division of two tensors in a element-wise manner c(i) = a(i)/b(i)+ \alpha*c(i) */
template<int nonZeroAlpha>__global__
void KernelDivElementWiseTensorDynamic(DTYPE * a, DTYPE * b, DTYPE * c, DTYPE alpha, int stride, int ldSizeA, int ldSizeB, int ldSizeC, int blockNum);
/* element-wise division of two tensors */
void _CudaDiv(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha = 0, int leadingDim = 0);
#endif // USE_CUDA
} // namespace nts(NiuTrans.Tensor)
#endif // __DIV_CUH__
...@@ -16,31 +16,39 @@ ...@@ -16,31 +16,39 @@
*/ */
/* /*
* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11 * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
*/ */
#ifndef __LOG_H__ #ifndef __DIV_H__
#define __LOG_H__ #define __DIV_H__
#include "../../XTensor.h" #include "../../XTensor.h"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
/* set every entry to its log value */ /*
void _Log(const XTensor * a, XTensor * b); element-wise division of two tensors:
c(i) = a(i)/b(i) + \alpha * c(i)
where i is the index of the element
*/
void _Div(const XTensor * a, const XTensor * b, XTensor * c, DTYPE alpha = 0, int leadingDim = 0);
/* /*
set every entry to its log value (do it on site) element-wise division of two tensors (do it on site)
keep the result in the input tensor a and return nothing keep the result in the input tensor a and return nothing
a(i) = a(i)/b(i) + \alpha * a(i)
where i is the index of the element
*/ */
void _LogMe(XTensor * a); void _DivMe(XTensor * a, const XTensor * b, DTYPE alpha = 0, int leadingDim = 0);
/* /*
set every entry to its log value (return a XTensor structure) element-wise division of two tensors (return a XTensor structure)
make a new tensor to keep the result and return it make a new tensor to keep the result and return it
c(i) = a(i)/b(i)
where i is the index of the element
*/ */
XTensor Log(const XTensor & a); XTensor Div(const XTensor &a, const XTensor &b, int leadingDim = 0);
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
#endif // __LOG_H__ #endif // __DIV_H__
\ No newline at end of file
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
*/
#include "../../XTensor.h"
#include "MatrixMULBatchedCPU.h"
#include "MatrixMul2D.h"
#include "XTensorBLAS.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/*
matrix multiplication in batch mode (BLAS)
c_i = trans(a_i) * trans(b_i) * \alpha + c_i * \beta for each i in [0,count-1]
>> a - list of input matrices (2d tensors)
>> transposedA - indicate whether the matrix a is transposed
>> b - another list of input matrices (2d tensors)
>> transposedB - indicate whether the matrix b is transposed
>> c - output matrix (2d tensor)
>> alpha - scalar
>> beta - scalar
*/
void _MatrixMULBatchedCPU(const XList * a, MATRIX_TRANS_TYPE transposedA,
const XList * b, MATRIX_TRANS_TYPE transposedB,
XList * c, DTYPE alpha, DTYPE beta)
{
CheckNTErrors(a && b && c, "Empty input lists!");
CheckNTErrors(a->count == b->count && a->count == c->count, "Input lists must be of the same size!");
if (a->count == 0)
return;
bool isUniform = true;
for (int i = 1; i < a->count; i++) {
XTensor * aim = (XTensor*)a->GetItem(i - 1);
XTensor * bim = (XTensor*)b->GetItem(i - 1);
XTensor * cim = (XTensor*)c->GetItem(i - 1);
XTensor * ai = (XTensor*)a->GetItem(i);
XTensor * bi = (XTensor*)b->GetItem(i);
XTensor * ci = (XTensor*)c->GetItem(i);
if (!XTensor::IsSameShaped(aim, ai) ||
!XTensor::IsSameShaped(bim, bi) ||
!XTensor::IsSameShaped(cim, ci))
{
isUniform = false;
break;
}
}
for (int i = 0; i < a->count; i++) {
XTensor * ai = (XTensor*)a->GetItem(i);
XTensor * bi = (XTensor*)b->GetItem(i);
XTensor * ci = (XTensor*)c->GetItem(i);
CheckNTErrors((ai->order == 2), "2d tensor (i.e., matrix) is required!");
CheckNTErrors((bi->order == 2), "2d tensor (i.e., matrix) is required!");
CheckNTErrors((ci->order == 2), "2d tensor (i.e., matrix) is required!");
#ifdef USE_BLAS
if (useBLAS)
_MatrixMULCPU(ai, transposedA, bi, transposedB, ci, alpha, beta);
else
_MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
#else
_MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
#endif
}
//}
}
} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
...@@ -24,8 +24,8 @@ ...@@ -24,8 +24,8 @@
#include "../../XName.h" #include "../../XName.h"
#include "MatrixMul.h" #include "MatrixMul.h"
#include "MatrixMul2D.h" #include "MatrixMul2D.h"
#include "MatrixMULBatchedCPU.h"
#include "XTensorBLAS.h" #include "XTensorBLAS.h"
#include "MatrixMulBatched.h"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
...@@ -53,11 +53,29 @@ void _MatrixMul(const XTensor * a, MATRIX_TRANS_TYPE transposedA, ...@@ -53,11 +53,29 @@ void _MatrixMul(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
const XTensor * b, MATRIX_TRANS_TYPE transposedB, const XTensor * b, MATRIX_TRANS_TYPE transposedB,
XTensor * c, DTYPE alpha, DTYPE beta, XPRunner * parallelRunner) XTensor * c, DTYPE alpha, DTYPE beta, XPRunner * parallelRunner)
{ {
CheckNTErrors((a && b && c), "Empty input tensors!"); CheckNTErrors(a && b && c, "Empty input tensors!");
CheckNTErrors((a->dataType == b->dataType && a->dataType == c->dataType), CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType,
"Input tensors should have the same data type!"); "Input tensors should have the same data type!");
CheckNTErrors((a->order >= 2 && b->order >= 2 && c->order >= 2), CheckNTErrors(a->order >= 2 && b->order >= 2 && c->order >= 2,
"Input tensors must have a order >= 2!"); "Input tensors must have a order >= 2!");
CheckNTErrors(c->order == a->order + b->order - 2, "wrong tensor order")
/* we transform a higher order tensor to a matrix to kill the number
of calls of matrix multiplication */
if(transposedA == X_NOTRANS && a->order > 2 && b->order == 2){
int ncolA = a->dimSize[a->order - 1];
int ncolC = c->dimSize[c->order - 1];
XTensor * a2 = NewTensor2D(a->unitNum/ncolA, -ncolA, a->dataType, a->devID, a->mem);
XTensor * c2 = NewTensor2D(c->unitNum/ncolC, -ncolC, c->dataType, c->devID, c->mem);
a2->data = a->data;
c2->data = c->data;
_MatrixMul2D(a2, transposedA, b, transposedB, c2, alpha, beta, parallelRunner);
a2->data = NULL;
c2->data = NULL;
delete a2;
delete c2;
return;
}
int an = transposedA == X_TRANS ? a->dimSizeRDI[0] : a->dimSizeRDI[1]; int an = transposedA == X_TRANS ? a->dimSizeRDI[0] : a->dimSizeRDI[1];
int am = transposedA == X_TRANS ? a->dimSizeRDI[1] : a->dimSizeRDI[0]; int am = transposedA == X_TRANS ? a->dimSizeRDI[1] : a->dimSizeRDI[0];
...@@ -144,10 +162,10 @@ void _MatrixMul(const XTensor * a, MATRIX_TRANS_TYPE transposedA, ...@@ -144,10 +162,10 @@ void _MatrixMul(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
cublasHandle_t * handle = a->mem != NULL ? a->mem->GetCublasHandle() : GDevs.GetCudaHandle(a->devID); cublasHandle_t * handle = a->mem != NULL ? a->mem->GetCublasHandle() : GDevs.GetCudaHandle(a->devID);
_CudaBLASMatrixMULList(handle, _CudaBLASMatrixMULList(handle,
aList, transposedA, aList, transposedA,
bList, transposedB, bList, transposedB,
cList, aList->count, cList, aList->count,
alpha, beta); alpha, beta);
BacktoCudaDev(a->devID, devIDBackup); BacktoCudaDev(a->devID, devIDBackup);
#else #else
...@@ -156,9 +174,9 @@ void _MatrixMul(const XTensor * a, MATRIX_TRANS_TYPE transposedA, ...@@ -156,9 +174,9 @@ void _MatrixMul(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
} }
else { else {
CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!"); CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
_MatrixMULBatchedCPU(aList, transposedA, _MatrixMulBatchedCPU(aList, transposedA,
bList, transposedB, bList, transposedB,
cList, alpha, beta); cList, alpha, beta);
} }
for (int i = 0; i < aList->count; i++) { for (int i = 0; i < aList->count; i++) {
...@@ -251,9 +269,7 @@ XTensor MatrixMul(const XTensor &a, MATRIX_TRANS_TYPE transposedA, ...@@ -251,9 +269,7 @@ XTensor MatrixMul(const XTensor &a, MATRIX_TRANS_TYPE transposedA,
/* /*
matrix multiplication with no transposition c = a * b * alpha matrix multiplication with no transposition c = a * b * alpha
>> a - tensor a >> a - tensor a
>> transposedA - indicates whether the matrices in a are transposed
>> b - tensor b >> b - tensor b
>> transposedB - indicates whether teh matrices in b are transposed
>> alpha - a coefficient >> alpha - a coefficient
>> parallelRunner - parallel processing module >> parallelRunner - parallel processing module
<< return - the result of matrix multiplication << return - the result of matrix multiplication
......
...@@ -23,8 +23,8 @@ ...@@ -23,8 +23,8 @@
#include "../../XDevice.h" #include "../../XDevice.h"
#include "../../XName.h" #include "../../XName.h"
#include "MatrixMulBatched.h" #include "MatrixMulBatched.h"
#include "MatrixMULBatchedCPU.h"
#include "XTensorBLAS.h" #include "XTensorBLAS.h"
#include "MatrixMul2D.h"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
...@@ -57,6 +57,43 @@ void _MatrixMulBatched(const XTensor * a, MATRIX_TRANS_TYPE transposedA, ...@@ -57,6 +57,43 @@ void _MatrixMulBatched(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
CheckNTErrors((a->order == b->order && a->order == c->order), CheckNTErrors((a->order == b->order && a->order == c->order),
"Input tensor and output tensor must have same order!"); "Input tensor and output tensor must have same order!");
if (a->devID >= 0 || b->devID >= 0 || c->devID >= 0)
_MatrixMulBatchedGPU(a, transposedA, b, transposedB, c, alpha, beta);
else
_MatrixMulBatchedCPU(a, transposedA, b, transposedB, c, alpha, beta);
}
/*
matrix multiplication of the two tensors
optimized for GPU
for each 2-dimensional data array in a (denoted as ai) and
each 2-dimensional data array in b (denoted as bi), we have
ci = trans(ai) * trans(bi) * alpha + cm * beta
where trans() returns the transposed matrix if the flag is fired
>> a - tensor a
>> transposedA - indicates whether the matrices in a are transposed
>> b - tensor b
>> transposedB - indicates whether teh matrices in b are transposed
>> c - where we keep a*b
>> alpha - a coefficient
>> beta - another coefficient
*/
void _MatrixMulBatchedGPU(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
const XTensor * b, MATRIX_TRANS_TYPE transposedB,
XTensor * c, DTYPE alpha, DTYPE beta)
{
#ifdef USE_CUDA
CheckNTErrors((a && b && c), "Empty input tensors!");
CheckNTErrors((a->dataType == b->dataType && a->dataType == c->dataType),
"Input tensors should have the same data type!");
CheckNTErrors((a->order >= 2 && b->order >= 2 && c->order >= 2),
"Input tensors must have a order >= 2!");
CheckNTErrors((a->order == b->order && a->order == c->order),
"Input tensor and output tensor must have same order!");
CheckNTErrors(a->devID >= 0 && b->devID >= 0 && c->devID >= 0, "The tensors must be on GPUs");
int an = transposedA == X_TRANS ? a->dimSizeRDI[0] : a->dimSizeRDI[1]; int an = transposedA == X_TRANS ? a->dimSizeRDI[0] : a->dimSizeRDI[1];
int am = transposedA == X_TRANS ? a->dimSizeRDI[1] : a->dimSizeRDI[0]; int am = transposedA == X_TRANS ? a->dimSizeRDI[1] : a->dimSizeRDI[0];
int bn = transposedB == X_TRANS ? b->dimSizeRDI[0] : b->dimSizeRDI[1]; int bn = transposedB == X_TRANS ? b->dimSizeRDI[0] : b->dimSizeRDI[1];
...@@ -64,8 +101,7 @@ void _MatrixMulBatched(const XTensor * a, MATRIX_TRANS_TYPE transposedA, ...@@ -64,8 +101,7 @@ void _MatrixMulBatched(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
int cn = c->dimSizeRDI[1]; int cn = c->dimSizeRDI[1];
int cm = c->dimSizeRDI[0]; int cm = c->dimSizeRDI[0];
CheckNTErrors((am == bn && an == cn && bm == cm), CheckNTErrors((am == bn && an == cn && bm == cm), "Unmatched tensors in multiplication!");
"Unmatched tensors in multiplication!");
int aBlockSize = a->dimSizeRDI[0] * a->dimSizeRDI[1]; int aBlockSize = a->dimSizeRDI[0] * a->dimSizeRDI[1];
int bBlockSize = b->dimSizeRDI[0] * b->dimSizeRDI[1]; int bBlockSize = b->dimSizeRDI[0] * b->dimSizeRDI[1];
...@@ -81,76 +117,159 @@ void _MatrixMulBatched(const XTensor * a, MATRIX_TRANS_TYPE transposedA, ...@@ -81,76 +117,159 @@ void _MatrixMulBatched(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
blockNum *= a->dimSizeRDI[i]; blockNum *= a->dimSizeRDI[i];
} }
XList * aList = new XList(10); int devIDBackup = 0;
XList * bList = new XList(10); ProtectCudaDev(a->devID, devIDBackup);
XList * cList = new XList(10);
int aDimSize[2] = { -a->dimSizeRDI[1], a->dimSizeRDI[0] }; cublasHandle_t * handle = a->mem != NULL ? a->mem->GetCublasHandle() : GDevs.GetCudaHandle(a->devID);
int bDimSize[2] = { -b->dimSizeRDI[1], b->dimSizeRDI[0] }; _CudaBLASMatrixMULBatchedStrided(handle,
int cDimSize[2] = { -c->dimSizeRDI[1], c->dimSizeRDI[0] }; a->data, transposedA, a->dataType, aBlockSize,
b->data, transposedB, b->dataType, bBlockSize,
for (int p = 0; p < blockNum; p++) { c->data, c->dataType, cBlockSize, blockNum,
void * ap = (char*)a->data + aRealBlockSize * p; a->dimSizeRDI[1], a->dimSizeRDI[0],
void * bp = (char*)b->data + bRealBlockSize * p; b->dimSizeRDI[1], b->dimSizeRDI[0],
void * cp = (char*)c->data + cRealBlockSize * p; c->dimSizeRDI[1], c->dimSizeRDI[0], alpha, beta);
XTensor * ai = NewTensor(2, aDimSize, a->dataType, a->denseRatio, a->devID, a->mem);
XTensor * bi = NewTensor(2, bDimSize, b->dataType, b->denseRatio, b->devID, b->mem); BacktoCudaDev(a->devID, devIDBackup);
XTensor * ci = NewTensor(2, cDimSize, c->dataType, c->denseRatio, c->devID, c->mem); #endif
ai->data = ap; }
bi->data = bp;
ci->data = cp; /*
aList->Add(ai); matrix multiplication of the two tensors
bList->Add(bi); optimized for CPU
cList->Add(ci);
for each 2-dimensional data array in a (denoted as ai) and
each 2-dimensional data array in b (denoted as bi), we have
ci = trans(ai) * trans(bi) * alpha + cm * beta
where trans() returns the transposed matrix if the flag is fired
>> a - tensor a
>> transposedA - indicates whether the matrices in a are transposed
>> b - tensor b
>> transposedB - indicates whether teh matrices in b are transposed
>> c - where we keep a*b
>> alpha - a coefficient
>> beta - another coefficient
*/
void _MatrixMulBatchedCPU(const XTensor * a, MATRIX_TRANS_TYPE transposedA,
const XTensor * b, MATRIX_TRANS_TYPE transposedB,
XTensor * c, DTYPE alpha, DTYPE beta)
{
CheckNTErrors((a && b && c), "Empty input tensors!");
CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType,
"Input tensors should have the same data type!");
CheckNTErrors(a->order >= 2 && b->order >= 2 && c->order >= 2,
"Input tensors must have a order >= 2!");
CheckNTErrors(a->order == b->order && a->order == c->order,
"Input tensor and output tensor must have same order!");
int an = transposedA == X_TRANS ? a->dimSizeRDI[0] : a->dimSizeRDI[1];
int am = transposedA == X_TRANS ? a->dimSizeRDI[1] : a->dimSizeRDI[0];
int bn = transposedB == X_TRANS ? b->dimSizeRDI[0] : b->dimSizeRDI[1];
int bm = transposedB == X_TRANS ? b->dimSizeRDI[1] : b->dimSizeRDI[0];
int cn = c->dimSizeRDI[1];
int cm = c->dimSizeRDI[0];
CheckNTErrors(am == bn && an == cn && bm == cm, "Unmatched tensors in multiplication!");
int aBlockSize = a->dimSizeRDI[0] * a->dimSizeRDI[1];
int bBlockSize = b->dimSizeRDI[0] * b->dimSizeRDI[1];
int cBlockSize = c->dimSizeRDI[0] * c->dimSizeRDI[1];
int aRealBlockSize = aBlockSize * a->unitSize;
int bRealBlockSize = bBlockSize * b->unitSize;
int cRealBlockSize = cBlockSize * c->unitSize;
int blockNum = 1;
for (int i = 2; i < a->order; i++) {
CheckNTErrors((a->dimSizeRDI[i] == c->dimSizeRDI[i]), "Incorrect tensor sizes!");
CheckNTErrors((b->dimSizeRDI[i] == c->dimSizeRDI[i]), "Incorrect tensor sizes!");
blockNum *= a->dimSizeRDI[i];
} }
if (a->devID >= 0 && b->devID >= 0 && c->devID >= 0) { int aDimSize[2] = {-a->dimSizeRDI[1], a->dimSizeRDI[0]};
#ifdef USE_CUDA int bDimSize[2] = {-b->dimSizeRDI[1], b->dimSizeRDI[0]};
CheckNTErrors((a->devID == b->devID && a->devID == c->devID), int cDimSize[2] = {-c->dimSizeRDI[1], c->dimSizeRDI[0]};
"The code must be run on the same GPU!");
XTensor * ai = NewTensor2D(aDimSize[0], aDimSize[1], a->dataType, a->devID, a->mem);
int devIDBackup; XTensor * bi = NewTensor2D(bDimSize[0], bDimSize[1], b->dataType, b->devID, b->mem);
ProtectCudaDev(a->devID, devIDBackup); XTensor * ci = NewTensor2D(cDimSize[0], cDimSize[1], c->dataType, c->devID, c->mem);
cublasHandle_t * handle = a->mem != NULL ? a->mem->GetCublasHandle() : GDevs.GetCudaHandle(a->devID); for (int i = 0; i < blockNum; i++) {
_CudaBLASMatrixMULList(handle, ai->data = (char*)a->data + i * aRealBlockSize;
aList, transposedA, bi->data = (char*)b->data + i * bRealBlockSize;
bList, transposedB, ci->data = (char*)c->data + i * cRealBlockSize;
cList, aList->count, #ifdef USE_BLAS
alpha, beta); if (useBLAS)
_MatrixMULCPU(ai, transposedA, bi, transposedB, ci, alpha, beta);
BacktoCudaDev(a->devID, devIDBackup); else
_MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
#else #else
ShowNTErrors("Please specify USE_CUDA and recompile the code!"); _MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
#endif #endif
} }
else {
CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
_MatrixMULBatchedCPU(aList, transposedA,
bList, transposedB,
cList, alpha, beta);
}
for (int i = 0; i < aList->count; i++) { ai->data = NULL;
XTensor * ai = (XTensor*)aList->GetItem(i); bi->data = NULL;
ai->data = NULL; ci->data = NULL;
delete ai; delete ai;
} delete bi;
delete ci;
}
for (int i = 0; i < bList->count; i++) { /*
XTensor * bi = (XTensor*)bList->GetItem(i); matrix multiplication in batch mode for list inputs (BLAS)
bi->data = NULL; c_i = trans(a_i) * trans(b_i) * \alpha + c_i * \beta for each i in [0,count-1]
delete bi; >> a - list of input matrices (2d tensors)
} >> transposedA - indicate whether the matrix a is transposed
>> b - another list of input matrices (2d tensors)
>> transposedB - indicate whether the matrix b is transposed
>> c - output matrix (2d tensor)
>> alpha - scalar
>> beta - scalar
*/
void _MatrixMulBatchedCPU(const XList * a, MATRIX_TRANS_TYPE transposedA,
const XList * b, MATRIX_TRANS_TYPE transposedB,
XList * c, DTYPE alpha, DTYPE beta)
{
CheckNTErrors(a && b && c, "Empty input lists!");
CheckNTErrors(a->count == b->count && a->count == c->count, "Input lists must be of the same size!");
if (a->count == 0)
return;
for (int i = 0; i < cList->count; i++) { bool isUniform = true;
XTensor * ci = (XTensor*)cList->GetItem(i); for (int i = 1; i < a->count; i++) {
ci->data = NULL; XTensor * aim = (XTensor*)a->GetItem(i - 1);
delete ci; XTensor * bim = (XTensor*)b->GetItem(i - 1);
XTensor * cim = (XTensor*)c->GetItem(i - 1);
XTensor * ai = (XTensor*)a->GetItem(i);
XTensor * bi = (XTensor*)b->GetItem(i);
XTensor * ci = (XTensor*)c->GetItem(i);
if (!XTensor::IsSameShaped(aim, ai) ||
!XTensor::IsSameShaped(bim, bi) ||
!XTensor::IsSameShaped(cim, ci))
{
isUniform = false;
break;
}
} }
delete aList; for (int i = 0; i < a->count; i++) {
delete bList; XTensor * ai = (XTensor*)a->GetItem(i);
delete cList; XTensor * bi = (XTensor*)b->GetItem(i);
XTensor * ci = (XTensor*)c->GetItem(i);
CheckNTErrors((ai->order == 2), "2d tensor (i.e., matrix) is required!");
CheckNTErrors((bi->order == 2), "2d tensor (i.e., matrix) is required!");
CheckNTErrors((ci->order == 2), "2d tensor (i.e., matrix) is required!");
#ifdef USE_BLAS
if (useBLAS)
_MatrixMULCPU(ai, transposedA, bi, transposedB, ci, alpha, beta);
else
_MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
#else
_MatrixMul2D(ai, transposedA, bi, transposedB, ci, alpha, beta);
#endif
}
} }
/* /*
...@@ -212,4 +331,60 @@ XTensor MatrixMulBatched(const XTensor &a, MATRIX_TRANS_TYPE transposedA, const ...@@ -212,4 +331,60 @@ XTensor MatrixMulBatched(const XTensor &a, MATRIX_TRANS_TYPE transposedA, const
return c; return c;
} }
/*
matrix multiplication of the two tensors (do it on site)
c = a * b * alpha
make a new tensor to keep the result and return it
for each 2-dimensional data array in a (denoted as ai) and
each 2-dimensional data array in b (denoted as bi), we have
ci = ai * bi * alpha + cm * beta
>> a - tensor a
>> b - tensor b
>> alpha - a coefficient
>> parallelRunner - parallel processing module
<< return - the result of matrix multiplication of the two tensors
*/
XTensor MatrixMulBatched(const XTensor &a, const XTensor &b,
DTYPE alpha, XPRunner * parallelRunner)
{
CheckNTErrors(a.dataType == b.dataType, "Input tensors should have the same data type!");
CheckNTErrors(a.order >= 2 && b.order >= 2, "Input tensors must have a order >= 2!");
CheckNTErrors(a.order == b.order, "Input tensor and output tensor must have same order!");
int an = a.dimSizeRDI[1];
int am = a.dimSizeRDI[0];
int bn = b.dimSizeRDI[1];
int bm = b.dimSizeRDI[0];
CheckNTErrors(am == bn, "Unmatched tensors in multiplication!");
int order = a.order;
int sub = 0;
int * dimSize = new int[order];
for (int i = 0; i < a.order - 2; i++)
dimSize[sub++] = a.dimSize[i];
dimSize[sub++] = an;
dimSize[sub++] = bm;
float dr = (!a.isSparse || !b.isSparse) ? 1.0F : MAX(a.denseRatio, b.denseRatio);
XTensor c(order, dimSize, a.dataType, dr, a.devID, a.mem);
c.SetTMP();
/*call _MatrixMulBatched function */
_MatrixMulBatched(&a, X_NOTRANS, &b, X_NOTRANS, &c, alpha, 0, parallelRunner);
/* tensor connections */
XLink::MakeLink(&a, &b, &c, MATH_MATRIXMULBATCHED);
XLink::AddParamToHeadTrans(&c, X_NOTRANS);
XLink::AddParamToHeadTrans(&c, X_NOTRANS);
XLink::AddParamToHead(&c, alpha);
/* destroy variables */
delete[] dimSize;
return c;
}
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
...@@ -26,6 +26,8 @@ ...@@ -26,6 +26,8 @@
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
#define BMMul MatrixMulBatched
/* /*
matrix multiplication of the two tensors c = trans(a) * trans(b) * alpha + c * beta matrix multiplication of the two tensors c = trans(a) * trans(b) * alpha + c * beta
...@@ -37,6 +39,28 @@ where trans() returns the transposed matrix if the flag is fired ...@@ -37,6 +39,28 @@ where trans() returns the transposed matrix if the flag is fired
void _MatrixMulBatched(const XTensor * a, MATRIX_TRANS_TYPE transposedA, const XTensor * b, MATRIX_TRANS_TYPE transposedB, void _MatrixMulBatched(const XTensor * a, MATRIX_TRANS_TYPE transposedA, const XTensor * b, MATRIX_TRANS_TYPE transposedB,
XTensor * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0, XPRunner * parallelRunner = NULL); XTensor * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0, XPRunner * parallelRunner = NULL);
/*
matrix multiplication of the two tensors c = trans(a) * trans(b) * alpha + c * beta
optimized for GPU
*/
void _MatrixMulBatchedGPU(const XTensor * a, MATRIX_TRANS_TYPE transposedA, const XTensor * b, MATRIX_TRANS_TYPE transposedB,
XTensor * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0);
/*
matrix multiplication of the two tensors c = trans(a) * trans(b) * alpha + c * beta
optimized for GPU
*/
void _MatrixMulBatchedCPU(const XTensor * a, MATRIX_TRANS_TYPE transposedA, const XTensor * b, MATRIX_TRANS_TYPE transposedB,
XTensor * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0);
/*
matrix multiplication of the two tensors c = trans(a) * trans(b) * alpha + c * beta (for list inputs)
optimized for GPU
*/
void _MatrixMulBatchedCPU(const XList * a, MATRIX_TRANS_TYPE transposedA, const XList * b, MATRIX_TRANS_TYPE transposedB,
XList * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0);
/* /*
matrix multiplication of the two tensors (return a XTensor structure) c = trans(a) * trans(b) * alpha matrix multiplication of the two tensors (return a XTensor structure) c = trans(a) * trans(b) * alpha
make a new tensor to keep the result and return it make a new tensor to keep the result and return it
...@@ -49,6 +73,17 @@ where trans() returns the transposed matrix if the flag is fired ...@@ -49,6 +73,17 @@ where trans() returns the transposed matrix if the flag is fired
XTensor MatrixMulBatched(const XTensor &a, MATRIX_TRANS_TYPE transposedA, const XTensor &b, MATRIX_TRANS_TYPE transposedB, XTensor MatrixMulBatched(const XTensor &a, MATRIX_TRANS_TYPE transposedA, const XTensor &b, MATRIX_TRANS_TYPE transposedB,
DTYPE alpha = (DTYPE)1.0, XPRunner * parallelRunner = NULL); DTYPE alpha = (DTYPE)1.0, XPRunner * parallelRunner = NULL);
/*
matrix multiplication of the two tensors (return a XTensor structure) c = a * b * alpha
make a new tensor to keep the result and return it
for each 2-dimensional data array in a (denoted as ai) and
each 2-dimensional data array in b (denoted as bi), we have
ci = ai * bi * alpha + cm * beta
*/
XTensor MatrixMulBatched(const XTensor &a, const XTensor &b,
DTYPE alpha = (DTYPE)1.0, XPRunner * parallelRunner = NULL);
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
#endif // __MATRIXMULBATCHED_H__ #endif // __MATRIXMULBATCHED_H__
\ No newline at end of file
...@@ -32,9 +32,9 @@ element-wise product of two tensors ...@@ -32,9 +32,9 @@ element-wise product of two tensors
c(i) = a(i)*b(i) + \alpha * c(i) c(i) = a(i)*b(i) + \alpha * c(i)
where i is the index of the item where i is the index of the item
>> a - matrix a >> a - tensor a
>> b - matrix b >> b - tensor b
>> c - result matrix >> c - result tensor
>> alpha - the coefficient >> alpha - the coefficient
>> leadingDim - the dimension along which we perform broadcasting >> leadingDim - the dimension along which we perform broadcasting
*/ */
......
...@@ -104,9 +104,9 @@ void KernelMulElementWiseTensorDynamic(DTYPE * a, DTYPE * b, DTYPE * c, DTYPE al ...@@ -104,9 +104,9 @@ void KernelMulElementWiseTensorDynamic(DTYPE * a, DTYPE * b, DTYPE * c, DTYPE al
int offseti = i % stride; int offseti = i % stride;
if (nonZeroAlpha == 0) if (nonZeroAlpha == 0)
cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj* ldSizeA + offseti] * bp[threadIdx.x][bj* ldSizeB + offseti]; cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj * ldSizeA + offseti] * bp[threadIdx.x][bj * ldSizeB + offseti];
else else
cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj* ldSizeA + offseti] * bp[threadIdx.x][bj* ldSizeB + offseti] + cp[threadIdx.x][j * ldSizeC + offseti] = ap[threadIdx.x][aj * ldSizeA + offseti] * bp[threadIdx.x][bj * ldSizeB + offseti] +
alpha * cp[threadIdx.x][j * ldSizeC + offseti]; alpha * cp[threadIdx.x][j * ldSizeC + offseti];
} }
......
...@@ -76,7 +76,7 @@ XTensor Sign(const XTensor & a) ...@@ -76,7 +76,7 @@ XTensor Sign(const XTensor & a)
XTensor b(&a); XTensor b(&a);
b.SetTMP(); b.SetTMP();
/* call _ScaleAndShift function */ /* call _Sign function */
_Sign(&a, &b); _Sign(&a, &b);
/* tensor connections */ /* tensor connections */
......
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
*/
#include "../../XTensor.h"
#include "../../XName.h"
#include "../../XUtility.h"
#include "Sub.h"
#include "Sub.cuh"
namespace nts { // namespace nts(NiuTrans.Tensor)
/*
tensor subtraction c = a - b * \beta
>> a - a tensor
>> b - another tensor
>> c - where we put a-b*\beta. we save it in a if c is NULL
>> beta - the scaling factor
*/
void _Sub(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta)
{
CheckNTErrors(a && b && c, "Empty tensor input!");
CheckNTErrors(a->unitNum == b->unitNum && a->unitNum == c->unitNum,
"Unmatched tensors in addition!");
CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType,
"Unmatched tensors in addition!");
if (a->devID >= 0 || b->devID >= 0 || c->devID >= 0) {
#ifdef USE_CUDA
if (a == c) {
int P2PAccesible = 0;
#ifdef CUDA_UVA
cudaDeviceCanAccessPeer(&P2PAccesible, a->devID, b->devID);
#endif
if ((a->devID < 0 && b->devID >= 0) ||
(a->devID >= 0 && b->devID < 0) ||
(a->devID >= 0 && b->devID >= 0 && a->devID != b->devID && !P2PAccesible))
{
ShowNTErrors("Cannot run this method on multiple devices simultaneously!");
}
else
_CudaSub(a, b, c, beta);
}
else
_CudaSub(a, b, c, beta);
#endif
}
else {
if (!a->isSparse && !b->isSparse) {
CheckNTErrors(!c->isSparse, "Illegal use of sparse tensor in addition!");
if (a->dataType == DEFAULT_DTYPE &&
b->dataType == DEFAULT_DTYPE &&
c->dataType == DEFAULT_DTYPE)
{
DTYPE * ap = (DTYPE*)a->data;
DTYPE * bp = (DTYPE*)b->data;
DTYPE * cp = (DTYPE*)c->data;
/* unrolling */
int num = a->unitNum;
if (num % 4 == 0) {
for (int i = 0; i < num; i += 4) {
cp[i] = ap[i] - bp[i] * beta;
cp[i + 1] = ap[i + 1] - bp[i + 1] * beta;
cp[i + 2] = ap[i + 2] - bp[i + 2] * beta;
cp[i + 3] = ap[i + 3] - bp[i + 3] * beta;
}
}
else if (num % 2 == 0) {
for (int i = 0; i < num; i += 2) {
cp[i] = ap[i] - bp[i] * beta;
cp[i + 1] = ap[i + 1] - bp[i + 1] * beta;
}
}
else {
for (int i = 0; i < num; i++) {
cp[i] = ap[i] - bp[i] * beta;
}
}
}
else {
// TODO!!
ShowNTErrors("TODO!");
}
}
else {
// TODO!!
ShowNTErrors("TODO!");
}
}
}
/*
tensor subtraction a = a - b * \beta (do it on site)
keep the result in the tensor a and return nothing
>> a - a tensor
>> b - another tensor
>> beta - the scaling factor
*/
void _SubMe(XTensor * a, const XTensor * b, DTYPE beta)
{
_Sub(a, b, a, beta);
}
/*
tensor subtraction c = a - b * \beta (return a XTensor structure)
make a new tensor c to keep the result and return it
>> a - a tensor
>> b - another tensor
>> beta - the scaling factor
<< return - the result of tensor subtraction
*/
XTensor Sub(const XTensor &a, const XTensor &b, DTYPE beta)
{
XTensor c(&a);
c.SetTMP();
/* call _Sub function */
_Sub(&a, &b, &c, beta);
/* tensor connections */
XLink::MakeLink(&a, &b, &c, MATH_SUB);
XLink::AddParamToHead(&c, beta);
return c;
}
} // namespace nts(NiuTrans.Tensor)
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
*/
#include "../../XDevice.h"
#include "../../XUtility.h"
#include "Sub.cuh"
namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA
/*
subtraction of data arrays (CUDA Kernel)
c = a - b * \beta
>> a - A matrix
>> b - another matrix
>> c - where we put a-b
>> size - the size of a/b/c
>> beta - the coefficient
*/
__global__
void KernelSUB(DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE beta)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size)
c[i] = a[i] - b[i] * beta;
}
/*
tensor subtraction c = a - b * \beta (cuda version)
>> a - a tensor
>> b - another tensor
>> c - where we put a-b*\beta.
>> beta - the scaling factor
*/
void _CudaSub(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta)
{
CheckNTErrors(a && b && c, "Empty tensor input!");
CheckNTErrors((a->unitNum == b->unitNum && a->unitNum == c->unitNum),
"Unmatched tensors in addition!");
CheckNTErrors((a->dataType == b->dataType && a->dataType == c->dataType),
"Unmatched tensors in addition!");
CheckNTErrors((a->devID == b->devID && a->devID == c->devID),
"The tensors must be on the same!");
int devIDBackup = XDevice::GetGPUDevice();
XDevice::SetGPUDevice(a->devID);
if (!a->isSparse && !b->isSparse) {
CheckNTErrors(!c->isSparse, "Illegal use of sparse matrix in addition!");
if (a->dataType == DEFAULT_DTYPE &&
b->dataType == DEFAULT_DTYPE &&
c->dataType == DEFAULT_DTYPE)
{
int gridSize[3], blockSize[3];
GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
dim3 blocks(gridSize[0]);
dim3 threads(blockSize[0]);
KernelSUB << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data, a->unitNum, beta);
}
else {
// TODO!!
ShowNTErrors("TODO!");
}
}
else {
// TODO!!
ShowNTErrors("TODO!");
}
XDevice::SetGPUDevice(devIDBackup);
}
/* subtraction over arrays
tensor subtraction c = a - b * \beta (cuda version) with an input handle
>> devID - device ID (MUST >= 0)
>> handle - cuda handle
>> a - an array
>> b - another array
>> c - where we put a-b
>> size - size of the array
>> beta - the coefficient
*/
void _CudaSubWithHandle(int devID, cublasHandle_t * handle, DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE beta)
{
if (size == 0)
return;
if (c == NULL)
c = a;
CheckNTErrors((a && b && c), "Empty arrays in addition!");
int devIDBackup;
ProtectCudaDev(devID, devIDBackup);
if (c == a) {
#ifdef DOUBELPRICSION
cublasDaxpy(*handle, size, &beta, b, 1, a, 1);
#else
cublasSaxpy(*handle, size, &beta, b, 1, a, 1);
#endif
}
else {
int gridSize[3], blockSize[3];
GDevs.GetCudaThread(devID, size, gridSize, blockSize);
dim3 blocks(gridSize[0]);
dim3 threads(blockSize[0]);
KernelSUB<<<blocks, threads>>>((DTYPE*)a, (DTYPE*)b, (DTYPE*)c, size, beta);
}
BacktoCudaDev(devID, devIDBackup);
}
#endif // USE_CUDA
} // namespace nts(NiuTrans.Tensor)
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
*/
#ifndef __SUB_CUH__
#define __SUB_CUH__
#include "Sub.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA
/* subtraction of data arrays (CUDA Kernel) */
__global__
void KernelSUB(DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE beta = (DTYPE)1.0);
/* tensor subtraction c = a - b * \beta (cuda version) */
void _CudaSub(const XTensor * a, const XTensor * b, XTensor * c = NULL, DTYPE beta = (DTYPE)1.0);
/* tensor subtraction c = a - b * \beta (cuda version) with an input handle */
void _CudaSubWithHandle(int devID, cublasHandle_t * handle, DTYPE * a, DTYPE * b, DTYPE * c, int size, DTYPE beta = (DTYPE)1.0);
#endif // USE_CUDA
} // namespace nts(NiuTrans.Tensor)
#endif // __SUB_CUH__
/* NiuTrans.Tensor - an open-source tensor library /* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University. * Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved. * All rights reserved.
* *
* Licensed under the Apache License, Version 2.0 (the "License"); * Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License. * you may not use this file except in compliance with the License.
* You may obtain a copy of the License at * You may obtain a copy of the License at
* *
* http://www.apache.org/licenses/LICENSE-2.0 * http://www.apache.org/licenses/LICENSE-2.0
* *
* Unless required by applicable law or agreed to in writing, software * Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, * distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and * See the License for the specific language governing permissions and
* limitations under the License. * limitations under the License.
*/ */
/* /*
* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11 * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
*/ * Today is the first day of August. It's still very hot.
*/
#ifndef __ABSOLUTE_H__ #ifndef __SUB_H__
#define __ABSOLUTE_H__ #define __SUB_H__
#include "../../XTensor.h" #include "../../XTensor.h"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
/* set every entry to its absolute value */ /* tensor subtraction c = a - b * \beta */
void _Absolute(const XTensor * a, XTensor * b); void _Sub(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta = (DTYPE)1.0);
/* /*
set every entry to its absolute value (do it on site) tensor subtraction a = a - b * \beta
keep the result in the input tensor a and return nothing keep the result in the input tensor a and return nothing
*/ */
void _AbsoluteMe(XTensor * a); void _SubMe(XTensor * a, const XTensor * b, DTYPE beta = (DTYPE)1.0);
/* /*
set every entry to its absolute value (return a XTensor structure) tensor subtraction c = a - b * \beta
make a new tensor to keep the result and return it make a new tensor c to keep the result and return it
*/ */
XTensor Absolute(const XTensor & a); XTensor Sub(const XTensor &a, const XTensor &b, DTYPE beta = (DTYPE)1.0);
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
#endif // __ABSOLUTE_H__ #endif // __SUB_H__
...@@ -22,8 +22,10 @@ ...@@ -22,8 +22,10 @@
#include "../../XTensor.h" #include "../../XTensor.h"
#include "../../XName.h" #include "../../XName.h"
#include "../../XUtility.h" #include "../../XUtility.h"
#include "../movement/CopyValues.h"
#include "Sum.h" #include "Sum.h"
#include "Sum.cuh" #include "Sum.cuh"
#include "SumDim.h"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
...@@ -43,8 +45,12 @@ void _Sum(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta) ...@@ -43,8 +45,12 @@ void _Sum(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta)
CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType, CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType,
"Unmatched tensors in addition!"); "Unmatched tensors in addition!");
if (a->devID >= 0 || b->devID >= 0 || c->devID >= 0) { if(beta == 0){
_CopyValues(a, c);
return;
}
if (a->devID >= 0 || b->devID >= 0 || c->devID >= 0) {
#ifdef USE_CUDA #ifdef USE_CUDA
if (a == c) { if (a == c) {
int P2PAccesible = 0; int P2PAccesible = 0;
...@@ -67,7 +73,7 @@ void _Sum(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta) ...@@ -67,7 +73,7 @@ void _Sum(const XTensor * a, const XTensor * b, XTensor * c, DTYPE beta)
} }
else { else {
if (!a->isSparse && !b->isSparse) { if (!a->isSparse && !b->isSparse) {
CheckNTErrors(!c->isSparse, "Illegal use of sparse matrix in addition!"); CheckNTErrors(!c->isSparse, "Illegal use of sparse tensor in addition!");
if (a->dataType == DEFAULT_DTYPE && if (a->dataType == DEFAULT_DTYPE &&
b->dataType == DEFAULT_DTYPE && b->dataType == DEFAULT_DTYPE &&
...@@ -123,6 +129,33 @@ void _SumMe(XTensor * a, const XTensor * b, DTYPE beta) ...@@ -123,6 +129,33 @@ void _SumMe(XTensor * a, const XTensor * b, DTYPE beta)
{ {
_Sum(a, b, a, beta); _Sum(a, b, a, beta);
} }
/*
return a dimension if the sum is performed as SumDim (in more details in SumDim.h
>> a - a tensor
>> b - another tensor for sum
*/
int GetSumDimIndex(const XTensor &a, const XTensor &b)
{
if(a.order < b.order)
return -1;
int hitCount = 0;
int hitDim = -1;
for(int i = 0; i < b.order; i++){
if(b.dimSize[b.order - 1 - i] == 1)
continue;
else if(b.dimSize[b.order - 1 - i] == a.dimSize[a.order - 1 - i]){
hitCount++;
hitDim = a.order - b.order + i;
}
}
if(hitCount == 1)
return hitDim;
else
return -1;
}
/* /*
tensor summation c = a + b * \beta (return a XTensor structure) tensor summation c = a + b * \beta (return a XTensor structure)
...@@ -137,13 +170,29 @@ XTensor Sum(const XTensor &a, const XTensor &b, DTYPE beta) ...@@ -137,13 +170,29 @@ XTensor Sum(const XTensor &a, const XTensor &b, DTYPE beta)
{ {
XTensor c(&a); XTensor c(&a);
c.SetTMP(); c.SetTMP();
int n = GetSumDimIndex(a, b);
if(n == -1){
/* call _Sum function */
_Sum(&a, &b, &c, beta);
/* call _Sum function */ /* tensor connections */
_Sum(&a, &b, &c, beta); XLink::MakeLink(&a, &b, &c, MATH_SUM);
XLink::AddParamToHead(&c, beta);
}
else if(n >= 0 && n < a.order){
/* call _Sum function */
_SumDim(&a, &b, &c, n, beta);
/* tensor connections */ /* tensor connections */
XLink::MakeLink(&a, &b, &c, MATH_SUM); XLink::MakeLink(&a, &b, &c, MATH_SUMDIM);
XLink::AddParamToHead(&c, beta); XLink::AddParamToHeadInt(&c, n);
XLink::AddParamToHead(&c, beta);
}
else{
ShowNTErrors("Something is wrong!");
}
return c; return c;
} }
......
...@@ -20,6 +20,7 @@ ...@@ -20,6 +20,7 @@
*/ */
#include "../../XDevice.h" #include "../../XDevice.h"
#include "../../XUtility.h"
#include "Sum.cuh" #include "Sum.cuh"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
......
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-29
*/
#include "Sum.h"
#include "SumDim.h"
#include "SumDim.cuh"
#include "../../XName.h"
#include "../movement/CopyValues.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/*
tensor summation
c = a + b * \beta
where the size of b is equal to the n-th dimension of a,
i.e., a is summed with b by broadcasting
>> a - a tensor
>> b - another tensor whose size is equal to that of dimension n of a
>> c - where we put a+b*\beta. we save it in a if c is NULL
>> n - the dimension index
>> beta - the scaling factor
*/
void _SumDim(const XTensor * a, const XTensor * b, XTensor * c, int n, DTYPE beta)
{
CheckNTErrors(a && b && c, "Empty tensor input!");
CheckNTErrors(a->unitNum == c->unitNum, "Unmatched tensors in addition!");
CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType,
"Unmatched data types in addition!");
CheckNTErrors(a->order == c->order, "The input tensors do not have the same order in addition!");
CheckNTErrors(!a->isSparse && !b->isSparse && !c->isSparse, "Dense tensors are required!");
CheckNTErrors(a->dimSize[n] == b->unitNum, "Wrong tensor size!");
if(beta == 0){
_CopyValues(a, c);
return;
}
if(XTensor::IsSameShaped(a, b)){
_Sum(a, b, c, beta);
return;
}
if(a->devID >= 0 || b->devID >= 0 || c->devID >= 0){
#ifdef USE_CUDA
_CudaSumDim(a, b, c, n, beta);
#else
ShowNTErrors("Please specify USE_CUDA and recompile the code!");
#endif
}
else{
int stride = 1;
int blockSize = a->dimSize[n];
int blockNum = 1;
for(int i = a->order - 1; i >= 0; i--){
if(i > n)
stride *= a->dimSize[i];
else if(i < n)
blockNum *= a->dimSize[i];
}
if (a->dataType == DEFAULT_DTYPE){
int num = a->unitNum;
if(stride > 1){
for(int i = 0, j = 0; i < num; i += stride, j++){
DTYPE * ap = (DTYPE*)a->data + i;
DTYPE bv = *((DTYPE*)b->data + j % blockSize) * beta;
DTYPE * cp = (DTYPE*)c->data + i;
for(int k = 0; k < stride; k++)
cp[k] = ap[k] + bv;
}
}
else if(stride == 1){
DTYPE * bp = (DTYPE*)b->data;
for(int i = 0; i < num; i += blockSize){
DTYPE * ap = (DTYPE*)a->data + i;
DTYPE * cp = (DTYPE*)c->data + i;
if(beta == 1.0F){
for(int j = 0; j < blockSize; j++)
cp[j] = ap[j] + bp[j];
}
else{
for(int j = 0; j < blockSize; j++)
cp[j] = ap[j] + bp[j] * beta;
}
}
}
else{
ShowNTErrors("Something is wrong!");
}
}
else {
ShowNTErrors("TODO!");
}
}
}
/*
tensor summation (do it on site)
keep the result in the input tensor and return nothing
a = a + b * \beta
where the size of b is equal to the n-th dimension of a,
i.e., a is summed with b by broadcasting
>> a - a tensor
>> b - another tensor whose size is equal to that of dimension n of a
>> n - the dimension index
>> beta - the scaling factor
*/
void _SumDim(XTensor * a, const XTensor * b, int n, DTYPE beta)
{
_SumDim(a, b, a, n, beta);
}
/*
tensor summation (return a XTensor structure and make tensor connections)
make a new tensor to keep the result and return it
c = a + b * \beta
where the size of b is equal to the n-th dimension of a,
i.e., a is summed with b by broadcasting
>> a - a tensor
>> b - another tensor whose size is equal to that of dimension n of a
>> n - the dimension index
>> beta - the scaling factor
<< return - the result tensor by tensor summation
*/
XTensor SumDim(const XTensor &a, const XTensor &b, int n, DTYPE beta)
{
XTensor c(&a);
c.SetTMP();
/* call _Sum function */
_SumDim(&a, &b, &c, n, beta);
/* tensor connections */
XLink::MakeLink(&a, &b, &c, MATH_SUMDIM);
XLink::AddParamToHeadInt(&c, n);
XLink::AddParamToHead(&c, beta);
return c;
}
}
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-29
*/
#include "SumDim.cuh"
#include "../../XDevice.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA
/*
tensor summation of a tensor and a row vector
c = a + b * \beta
where a is a tensor and b is a row vector
>> a - pointer to the data array of a
>> b - pointer to the data array of b
>> c - pointer to the data array of c
>> rowNum - number of rows of a and c
>> colNum - number of columns of a and c (i.e., the size of b)
>> beta - the scaling factor
*/
template <class T, bool betaFired>
__global__
void KernelAddWithRow(T * a, T * b, T * c, int rowNum, int colNum, T beta)
{
__shared__ T bv[MAX_CUDA_THREAD_NUM_PER_BLOCK];
int col = blockDim.x * blockIdx.x + threadIdx.x;
int row = blockDim.y * blockIdx.y + threadIdx.y;
if(col >= colNum || row >= rowNum)
return;
if(threadIdx.y == 0)
bv[threadIdx.x] = b[col];
__syncthreads();
int offset = colNum * row + col;
if(betaFired)
c[offset] = a[offset] + bv[threadIdx.x] * beta;
else
c[offset] = a[offset] + bv[threadIdx.x];
}
/*
tensor summation of a tensor and a colum vector
c = a + b * \beta
where a is a tensor and b is a colum vector
>> a - pointer to the data array of a
>> b - pointer to the data array of b
>> c - pointer to the data array of c
>> rowNum - number of rows of a and c (i.e., the size of b)
>> colNum - number of columns of a and c
>> blockNum - size of a block (matrix), i.e., rowNum * colNum
>> blockNum - number of matrics
>> beta - the scaling factor
*/
template <class T, bool betaFired>
__global__
void KernelAddWithCol(T * a, T * b, T * c, int rowNum, int colNum, int blockSize, int blockNum, T beta)
{
__shared__ T bv[MAX_CUDA_THREAD_NUM_PER_BLOCK];
int colIndex = blockDim.x * blockIdx.x + threadIdx.x;
int row = blockDim.y * blockIdx.y + threadIdx.y;
int col = colIndex % colNum;
int block = colIndex / colNum;
if(row >= rowNum || block >= blockNum)
return;
if(threadIdx.x == 0)
bv[threadIdx.y] = b[row];
__syncthreads();
int offset = block * blockSize + row * colNum + col;
if(betaFired)
c[offset] = a[offset] + bv[threadIdx.y] * beta;
else
c[offset] = a[offset] + bv[threadIdx.y];
}
/*
tensor summation (cuda version)
c = a + b * \beta
where the size of b is equal to the n-th dimension of a,
i.e., a is summed with b by broadcasting
>> a - a tensor
>> b - another tensor whose size is equal to that of dimension n of a
>> c - where we put a+b*\beta. we save it in a if c is NULL
>> n - the dimension index
>> beta - the scaling factor
*/
void _CudaSumDim(const XTensor * a, const XTensor * b, XTensor * c, int n, DTYPE beta)
{
CheckNTErrors(a && b && c, "Empty tensor input!");
CheckNTErrors(a->unitNum == c->unitNum, "Unmatched tensors in addition!");
CheckNTErrors(a->dataType == b->dataType && a->dataType == c->dataType,
"Unmatched data types in addition!");
CheckNTErrors(a->order == c->order, "The input tensors do not have the same order in addition!");
CheckNTErrors(!a->isSparse && !b->isSparse && !c->isSparse, "Dense tensors are required!");
CheckNTErrors(a->dimSize[n] == b->unitNum, "Wrong tensor size!");
int stride = 1;
int blockSize = a->dimSize[n];
int blockNum = 1;
for(int i = a->order - 1; i >= 0; i--){
if(i > n)
stride *= a->dimSize[i];
else if(i < n)
blockNum *= a->dimSize[i];
}
int cudaGrids[3];
int cudaBlocks[3];
int devIDBackup = 0;
ProtectCudaDev(a->devID, devIDBackup);
if (a->dataType == DEFAULT_DTYPE){
if(stride > 1){
GDevs.GetCudaThread2D(a->devID, stride * blockNum, blockSize, MAX_INT, cudaGrids, cudaBlocks);
if(beta == (DTYPE)1.0F)
KernelAddWithCol<DTYPE, false> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1])>>>
((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data,
blockSize, stride, blockSize * stride, blockNum, beta);
else
KernelAddWithCol<DTYPE, true> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1])>>>
((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data,
blockSize, stride, blockSize * stride, blockNum, beta);
}
else if(stride == 1){
GDevs.GetCudaThread2D(a->devID, blockSize, blockNum, MAX_INT, cudaGrids, cudaBlocks);
if(beta == (DTYPE)1.0F)
KernelAddWithRow<DTYPE, false> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1])>>>
((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data,
blockNum, blockSize, beta);
else
KernelAddWithRow<DTYPE, true> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1])>>>
((DTYPE*)a->data, (DTYPE*)b->data, (DTYPE*)c->data,
blockNum, blockSize, beta);
}
else{
ShowNTErrors("Something is wrong!");
}
}
else {
ShowNTErrors("TODO!");
}
BacktoCudaDev(a->devID, devIDBackup);
}
#endif
} // namespace nts(NiuTrans.Tensor)
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-29
*/
#ifndef __SUMDIM_CUH__
#define __SUMDIM_CUH__
#include "../../XTensor.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA
/* tensor summation c = a + b * \beta where the size of b is equal to the n-th dimension of a,
i.e., a is summed with b by broadcasting (cuda version) */
void _CudaSumDim(const XTensor * a, const XTensor * b, XTensor * c, int n, DTYPE beta = (DTYPE)1.0);
#endif
} // namespace nts(NiuTrans.Tensor)
#endif // __SUMDIM_CUH__
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2018, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-29
* It reached to 39 centigrade around 3:00 pm in Shenyang
*/
#ifndef __SUMDIM_H__
#define __SUMDIM_H__
#include "../../XTensor.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/* tensor summation c = a + b * \beta where the size of b is equal to the n-th dimension of a,
i.e., a is summed with b by broadcasting */
void _SumDim(const XTensor * a, const XTensor * b, XTensor * c, int n, DTYPE beta = (DTYPE)1.0);
/* tensor summation c = a + b * \beta where the size of b is equal to the n-th dimension of a,
i.e., a is summed with b by broadcasting. we keep the result in the input tensor a and return nothing */
void _SumDim(XTensor * a, const XTensor * b, int n, DTYPE beta = (DTYPE)1.0);
/* tensor summation c = a + b * \beta where the size of b is equal to the n-th dimension of a,
i.e., a is summed with b by broadcasting. We make a new tensor c to keep the result and return it */
XTensor SumDim(const XTensor &a, const XTensor &b, int n, DTYPE beta = (DTYPE)1.0);
} // namespace nts(NiuTrans.Tensor)
#endif // __SUMDIM_H__
...@@ -20,6 +20,7 @@ ...@@ -20,6 +20,7 @@
* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-05-08 * $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-05-08
*/ */
#include <math.h>
#include "SetData.h" #include "SetData.h"
#include "SetData.cuh" #include "SetData.cuh"
#include "../../XUtility.h" #include "../../XUtility.h"
...@@ -37,6 +38,43 @@ ...@@ -37,6 +38,43 @@
namespace nts{ // namespace nts(NiuTrans.Tensor) namespace nts{ // namespace nts(NiuTrans.Tensor)
/*
Fills the input Tensor or Variable with values according to the method described in
"Understanding the difficulty of training deep feedforward neural networks" - Glorot, X. & Bengio, Y. (2010),
using a uniform distribution. The resulting tensor will have values sampled from :math:`U(-a, a)`
where :math:`a = gain \times \sqrt{2 / (fan\_in + fan\_out)} \times \sqrt{3}`. Also known as Glorot initialisation.
>> tensor - the tensor whose data array would be initialized
>> gain - an optional scaling factor
*/
void _SetDataFanInOut(XTensor * tensor, DTYPE gain)
{
CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_FLOAT!");
CheckNTErrors(tensor->order >= 2, "the tensor dimension must be no less than 2!");
int fanIn = 1;
int fanOut = 1;
int order = tensor->order;
if (order == 2) {
fanIn = tensor->dimSize[1];
fanOut = tensor->dimSize[0];
}
else {
int numInputFmaps = tensor->dimSize[1];
int numOutputFmaps = tensor->dimSize[0];
int receptiveFieldSize = 0;
for (int i = 2; i < order; i++)
receptiveFieldSize += tensor->dimSize[i];
fanIn = numInputFmaps * receptiveFieldSize;
fanOut = numOutputFmaps * receptiveFieldSize;
}
DTYPE std = gain * sqrt(2.0/(fanIn + fanOut));
DTYPE a = sqrt(3.0) * std;
_SetDataRand(tensor, -a, a);
}
/* /*
generate data items with a fixed value p generate data items with a fixed value p
>> tensor - the tensor whose data array would be initialized >> tensor - the tensor whose data array would be initialized
...@@ -65,7 +103,7 @@ void _SetDataFixed(XTensor * tensor, void * valuePointer) ...@@ -65,7 +103,7 @@ void _SetDataFixed(XTensor * tensor, void * valuePointer)
} }
else{ else{
#ifdef USE_CUDA #ifdef USE_CUDA
CudaSetDataFixedInt(tensor, p); _CudaSetDataFixedInt(tensor, p);
#endif #endif
} }
} }
...@@ -88,7 +126,7 @@ void _SetDataFixed(XTensor * tensor, void * valuePointer) ...@@ -88,7 +126,7 @@ void _SetDataFixed(XTensor * tensor, void * valuePointer)
} }
else{ else{
#ifdef USE_CUDA #ifdef USE_CUDA
CudaSetDataFixedFloat(tensor, p); _CudaSetDataFixedFloat(tensor, p);
#endif #endif
} }
} }
...@@ -111,7 +149,7 @@ void _SetDataFixed(XTensor * tensor, void * valuePointer) ...@@ -111,7 +149,7 @@ void _SetDataFixed(XTensor * tensor, void * valuePointer)
} }
else{ else{
#ifdef USE_CUDA #ifdef USE_CUDA
CudaSetDataFixedDouble(tensor, p); _CudaSetDataFixedDouble(tensor, p);
#endif #endif
} }
} }
...@@ -137,7 +175,7 @@ generate data items with a fixed value p (in integer) ...@@ -137,7 +175,7 @@ generate data items with a fixed value p (in integer)
*/ */
void _SetDataFixedInt(XTensor * tensor, int p) void _SetDataFixedInt(XTensor * tensor, int p)
{ {
CheckNTErrors(tensor->dataType == X_INT, "the tensor must be in X_INT"); CheckNTErrors(tensor->dataType == X_INT, "the tensor must be in X_INT!");
if(p == 0) if(p == 0)
tensor->SetZeroAll(); tensor->SetZeroAll();
...@@ -152,7 +190,7 @@ generate data items with a fixed value p (in float) ...@@ -152,7 +190,7 @@ generate data items with a fixed value p (in float)
*/ */
void _SetDataFixedFloat(XTensor * tensor, float p) void _SetDataFixedFloat(XTensor * tensor, float p)
{ {
CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_INT"); CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_FLOAT!");
if(p == 0) if(p == 0)
tensor->SetZeroAll(); tensor->SetZeroAll();
...@@ -167,7 +205,7 @@ generate data items with a fixed value p (in double) ...@@ -167,7 +205,7 @@ generate data items with a fixed value p (in double)
*/ */
void _SetDataFixedDouble(XTensor * tensor, double p) void _SetDataFixedDouble(XTensor * tensor, double p)
{ {
CheckNTErrors(tensor->dataType == X_DOUBLE, "the tensor must be in X_INT"); CheckNTErrors(tensor->dataType == X_DOUBLE, "the tensor must be in X_DOUBLE!");
if(p == 0) if(p == 0)
tensor->SetZeroAll(); tensor->SetZeroAll();
...@@ -176,32 +214,32 @@ void _SetDataFixedDouble(XTensor * tensor, double p) ...@@ -176,32 +214,32 @@ void _SetDataFixedDouble(XTensor * tensor, double p)
} }
/* /*
generate data items with a uniform distribution in [low,high] generate data items with a uniform distribution in [lower, upper]
>> tensor - the tensor whose data array would be initialized >> tensor - the tensor whose data array would be initialized
>> low - lower value of the range >> lower - lower value of the range
>> high - higher value of the range >> upper - upper value of the range
*/ */
void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high) void _SetDataRand(XTensor * tensor, DTYPE lower, DTYPE upper)
{ {
CheckNTErrors(upper > lower, "the high value must be greater than low value!");
if(tensor == NULL) if(tensor == NULL)
return; return;
/* GPU code */ /* GPU code */
if(tensor->devID < 0){ if(tensor->devID < 0){
DTYPE variance = high - low; DTYPE variance = upper - lower;
srand((unsigned)time(NULL));
if(tensor->dataType == X_FLOAT){ if(tensor->dataType == X_FLOAT){
float * d = (float*)tensor->data; float * d = (float*)tensor->data;
for(int i = 0; i < tensor->unitNum; i++){ for(int i = 0; i < tensor->unitNum; i++){
d[i] = variance * ((float)rand()/RAND_MAX) + low; d[i] = variance * ((float)rand()/RAND_MAX) + lower;
} }
} }
else if(tensor->dataType == X_DOUBLE){ else if(tensor->dataType == X_DOUBLE){
double * d = (double*)tensor->data; double * d = (double*)tensor->data;
for(int i = 0; i < tensor->unitNum; i++){ for(int i = 0; i < tensor->unitNum; i++){
d[i] = variance * ((double)rand()/RAND_MAX) + low; d[i] = variance * ((double)rand()/RAND_MAX) + lower;
} }
} }
else{ else{
...@@ -215,12 +253,27 @@ void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high) ...@@ -215,12 +253,27 @@ void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high)
TODO: generate data points on GPUs straightforwardly. TODO: generate data points on GPUs straightforwardly.
*/ */
else{ else{
XTensor * t2 = NewTensor(tensor->order, tensor->dimSize, tensor->dataType, tensor->denseRatio, -1); #ifdef USE_CUDA
_SetDataRand(t2, low, high); _CudaSetDataRand(tensor, lower, upper);
_CopyValues(t2, tensor); #endif
delete t2; //XTensor * t2 = NewTensor(tensor->order, tensor->dimSize, tensor->dataType, tensor->denseRatio, -1);
//_SetDataRand(t2, low, high);
//_CopyValues(t2, tensor);
//delete t2;
} }
} }
/*
generate data items with a normal distribution with specified mean and standard deviation
>> mean - mean or expectation of the distribution
>> standardDeviation - standard deviation of the distribution
*/
void _SetDataRandN(XTensor * tensor, DTYPE mean, DTYPE standardDeviation)
{
// TODO: rewrite it and add cuda code!!!!!!!
tensor->SetDataRandn(mean, standardDeviation);
}
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
...@@ -21,7 +21,10 @@ ...@@ -21,7 +21,10 @@
* I'm surprised that I did not write this file till today. * I'm surprised that I did not write this file till today.
*/ */
#include <curand.h>
#include <time.h>
#include "SetData.cuh" #include "SetData.cuh"
#include <curand_kernel.h>
#include "../../XDevice.h" #include "../../XDevice.h"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
...@@ -46,7 +49,7 @@ generate data items with a fixed value p (in int) ...@@ -46,7 +49,7 @@ generate data items with a fixed value p (in int)
>> tensor - the tensor for initialization >> tensor - the tensor for initialization
>> p - the initial value >> p - the initial value
*/ */
void CudaSetDataFixedInt(XTensor * tensor, int p) void _CudaSetDataFixedInt(XTensor * tensor, int p)
{ {
CheckNTErrors(tensor->dataType == X_INT, "the tensor must be in X_INT!"); CheckNTErrors(tensor->dataType == X_INT, "the tensor must be in X_INT!");
...@@ -86,7 +89,7 @@ generate data items with a fixed value p (in float) ...@@ -86,7 +89,7 @@ generate data items with a fixed value p (in float)
>> tensor - the tensor for initialization >> tensor - the tensor for initialization
>> p - the initial value >> p - the initial value
*/ */
void CudaSetDataFixedFloat(XTensor * tensor, float p) void _CudaSetDataFixedFloat(XTensor * tensor, float p)
{ {
CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_FLOAT!"); CheckNTErrors(tensor->dataType == X_FLOAT, "the tensor must be in X_FLOAT!");
...@@ -126,7 +129,7 @@ generate data items with a fixed value p (in double) ...@@ -126,7 +129,7 @@ generate data items with a fixed value p (in double)
>> tensor - the tensor for initialization >> tensor - the tensor for initialization
>> p - the initial value >> p - the initial value
*/ */
void CudaSetDataFixedDouble(XTensor * tensor, double p) void _CudaSetDataFixedDouble(XTensor * tensor, double p)
{ {
CheckNTErrors(tensor->dataType == X_DOUBLE, "the tensor must be in X_DOUBLE!"); CheckNTErrors(tensor->dataType == X_DOUBLE, "the tensor must be in X_DOUBLE!");
...@@ -146,4 +149,75 @@ void CudaSetDataFixedDouble(XTensor * tensor, double p) ...@@ -146,4 +149,75 @@ void CudaSetDataFixedDouble(XTensor * tensor, double p)
BacktoCudaDev(tensor->devID, devIDBackup); BacktoCudaDev(tensor->devID, devIDBackup);
} }
/*
set data array with a uniform distribution in [low, high]
>> deviceStates - the state of curand
>> d - float datatype pointer to the data array
>> size - size of the array
>> lower - low value of the range
>> variance - the variance of the range
*/
__global__
void KernelSetDataRandFloat(float * d, int size, DTYPE lower, DTYPE variance)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size) {
d[i] = d[i] * variance + lower;
}
}
/*
set data array with a uniform distribution in [low, high]
>> deviceStates - the state of curand
>> d - double datatype pointer to the data array
>> size - size of the array
>> lower - low value of the range
>> variance - the variance of the range
*/
__global__
void KernelSetDataRandDouble(double * d, int size, DTYPE lower, DTYPE variance)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size){
d[i] = d[i] * variance + lower;
}
}
/*
generate data items with a uniform distribution in [lower, upper]
>> tensor - the tensor whose data array would be initialized
>> lower - lower value of the range
>> upper - upper value of the range
*/
void _CudaSetDataRand(XTensor * tensor, DTYPE lower, DTYPE upper)
{
CheckNTErrors(upper > lower, "the high value must be greater than low value!");
int gridSize[3];
int blockSize[3];
GDevs.GetCudaThread(tensor->devID, tensor->unitNum, gridSize, blockSize);
dim3 blocks(gridSize[0]);
dim3 threads(blockSize[0]);
int devIDBackup;
ProtectCudaDev(tensor->devID, devIDBackup);
curandGenerator_t gen;
curandCreateGenerator (&gen, CURAND_RNG_PSEUDO_DEFAULT);
curandSetPseudoRandomGeneratorSeed(gen, time(NULL));
curandGenerateUniform(gen , (float*)tensor->data , tensor->unitNum);
curandDestroyGenerator(gen);
DTYPE variance = upper - lower;
if (tensor->dataType == X_FLOAT)
KernelSetDataRandFloat <<<blocks, threads >>>((float*)tensor->data, tensor->unitNum, lower, variance);
else if (tensor->dataType == X_DOUBLE)
KernelSetDataRandDouble <<<blocks, threads >>>((double*)tensor->data, tensor->unitNum, lower, variance);
BacktoCudaDev(tensor->devID, devIDBackup);
}
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
...@@ -29,13 +29,16 @@ ...@@ -29,13 +29,16 @@
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
/* generate data items with a fixed value p (in int) */ /* generate data items with a fixed value p (in int) */
void CudaSetDataFixedInt(XTensor * tensor, int p); void _CudaSetDataFixedInt(XTensor * tensor, int p);
/* generate data items with a fixed value p (in float) */ /* generate data items with a fixed value p (in float) */
void CudaSetDataFixedFloat(XTensor * tensor, float p); void _CudaSetDataFixedFloat(XTensor * tensor, float p);
/* generate data items with a fixed value p (in double) */ /* generate data items with a fixed value p (in double) */
void CudaSetDataFixedDouble(XTensor * tensor, double p); void _CudaSetDataFixedDouble(XTensor * tensor, double p);
/* generate data items with a uniform distribution in [lower, upper] */
void _CudaSetDataRand(XTensor * tensor, DTYPE lower, DTYPE upper);
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
......
...@@ -27,6 +27,9 @@ ...@@ -27,6 +27,9 @@
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
/* generate data items with a xavier initialization */
void _SetDataFanInOut(XTensor * tensor, DTYPE gain = 1.0F);
/* generate data items with a fixed value p */ /* generate data items with a fixed value p */
void _SetDataFixed(XTensor * tensor, void * valuePointer); void _SetDataFixed(XTensor * tensor, void * valuePointer);
...@@ -42,8 +45,8 @@ void _SetDataFixedFloat(XTensor * tensor, float p); ...@@ -42,8 +45,8 @@ void _SetDataFixedFloat(XTensor * tensor, float p);
/* generate data items with a fixed value p (in double) */ /* generate data items with a fixed value p (in double) */
void _SetDataFixedDouble(XTensor * tensor, double p); void _SetDataFixedDouble(XTensor * tensor, double p);
/* generate data items with a uniform distribution in [low,high] */ /* generate data items with a uniform distribution in [lower, upper] */
void _SetDataRand(XTensor * tensor, DTYPE low, DTYPE high); void _SetDataRand(XTensor * tensor, DTYPE lower, DTYPE upper);
/* generate data items with a normal distribution with specified mean and standard deviation */ /* generate data items with a normal distribution with specified mean and standard deviation */
void _SetDataRandN(XTensor * tensor, DTYPE mean, DTYPE standardDeviation); void _SetDataRandN(XTensor * tensor, DTYPE mean, DTYPE standardDeviation);
......
...@@ -16,67 +16,130 @@ ...@@ -16,67 +16,130 @@
*/ */
/* /*
* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11 * $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
*/ */
#include <math.h>
#include "../../XTensor.h" #include "../../XTensor.h"
#include "../../XName.h" #include "../../XName.h"
#include "Absolute.h" #include "Clip.h"
#include "Absolute.cuh" #include "Clip.cuh"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
/* /*
set every entry to its absolute value set every entry to its clip value
>> a - input tensor we are processing >> a - input tensor we are processing
>> b - output tensor we are processing >> b - output tensor we are processing
>> lower - the lower border
>> upper - the upper border
*/ */
void _Absolute(const XTensor * a, XTensor * b) void _Clip(const XTensor * a, XTensor * b, DTYPE lower, DTYPE upper)
{ {
#ifdef USE_CUDA #ifdef USE_CUDA
/* run it on GPUs */ /* run it on GPUs */
if (a->devID >= 0) { if (a->devID >= 0) {
_CudaAbsolute(a, b); _CudaClip(a, b, lower, upper);
return; return;
} }
#endif #endif
CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!"); CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!"); CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
DTYPE * d = (DTYPE*)a->data;
DTYPE * db = (DTYPE*)b->data; DTYPE * d = (DTYPE*)a->data;
for (int i = 0; i < a->unitNum; i++) DTYPE * db = (DTYPE*)b->data;
db[i] = (DTYPE)fabs(d[i]); for (int i = 0; i < a->unitNum; i++) {
if (d[i] > upper)
db[i] = upper;
else if (d[i] < lower)
db[i] = lower;
else
db[i] = d[i];
}
} }
/* /*
set every entry to its absolute value (do it on site) set every entry to its clip value (do it on site)
keep the result in the input tensor a and return nothing keep the result in the input tensor a and return nothing
>> a - the tensor we are processing >> a - the tensor we are processing
>> lower - the lower border
>> upper - the upper border
*/ */
void _AbsoluteMe(XTensor * a) void _ClipMe(XTensor * a, DTYPE lower, DTYPE upper)
{ {
_Absolute(a, a); _Clip(a, a, lower, upper);
} }
/* /*
set every entry to its absolute value (return a XTensor structure) set every entry to its clip value (return a XTensor structure)
make a new tensor to keep the result and return it make a new tensor to keep the result and return it
>> a - input tensor we are processing >> a - input tensor we are processing
<< return - the absolute value of input tensor >> lower - the lower border
>> upper - the upper border
<< return - the clip value of the input tensor
*/ */
XTensor Absolute(const XTensor & a) XTensor Clip(const XTensor & a, DTYPE lower, DTYPE upper)
{
XTensor b(&a);
b.SetTMP();
/* call _Clip function */
_Clip(&a, &b, lower, upper);
/* tensor connections */
XLink::MakeLink(&a, NULL, &b, MATH_CLIP);
XLink::AddParamToHead(&b, lower);
XLink::AddParamToHead(&b, upper);
return b;
}
/*
backward computation
dE/dx = dE/dy * dy/dx
hard tanh: y = upper if x > upper
x if lower <= x <= upper
lower if x< lower
and dy/dx = 1 if lower <= x <= upper
0 otherwise
>> gold - gold standard to measure error (or loss)
>> y - output of the function
>> x - input of the function
>> dedy - dE/dy
>> dedx - dE/dx
>> lossName - type of loss function, e.g., cross entropy
*/
void _ClipBackward(XTensor * y, XTensor * x, XTensor * dedy, XTensor * dedx, DTYPE lower, DTYPE upper)
{ {
XTensor b(&a);
b.SetTMP();
/* call _Absolute function */
_Absolute(&a, &b);
/* tensor connections */
XLink::MakeLink(&a, NULL, &b, MATH_ABSOLUTE);
return b; #ifdef USE_CUDA
if (x->devID >= 0) {
_CudaClipBackward(y, x, dedy, dedx, lower, upper);
return;
} }
#endif
if (x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE) {
DTYPE * dedyp = (DTYPE*)dedy->data;
DTYPE * dedxp = (DTYPE*)dedx->data;
DTYPE * ip = (DTYPE*)x->data;
int size = y->unitNum;
/* dE/dx = dE/dy * dy/dx */
for (int i = 0; i < size; i++) {
DTYPE s = ip[i];
if (s > upper || s < lower)
dedxp[i] = 0;
else
dedxp[i] = dedyp[i];
}
}
else
ShowNTErrors("TODO!");
}
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
...@@ -16,78 +16,162 @@ ...@@ -16,78 +16,162 @@
*/ */
/* /*
* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11 * $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
*/ */
#include "../../XDevice.h" #include "../../XDevice.h"
#include "../../XTensor.h" #include "../../XTensor.h"
#include "Absolute.h" #include "Clip.h"
#include "Absolute.cuh" #include "Clip.cuh"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA #ifdef USE_CUDA
/* /*
set each entry to its absolute value (CUDA Kernel) set each entry to its clip value (CUDA Kernel)
>> a - pointer to input data array >> a - pointer to input data array
>> b - pointer to output data array >> b - pointer to output data array
>> lower - the lower border
>> upper - the upper border
>> size - size of the data array >> size - size of the data array
*/ */
__global__ __global__
void KernelAbsolute(DTYPE * a, DTYPE * b, int size) void KernelClip(DTYPE * a, DTYPE * b, DTYPE lower, DTYPE upper, int size)
{ {
int i = blockDim.x * blockIdx.x + threadIdx.x; int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size) if (i < size) {
b[i] = fabs(a[i]); if (a[i] > upper)
b[i] = upper;
else if (a[i] < lower)
b[i] = lower;
else
b[i] = a[i];
}
} }
/* /*
set each entry to its absolute value (CUDA Kernel) set each entry to its clip value with float16 data type value (CUDA Kernel)
This is for float16 computation This is for float16 computation
>> a - pointer to input data array >> a - pointer to input data array
>> b - pointer to output data array >> b - pointer to output data array
>> lower - the lower border
>> upper - the upper border
>> size - size of the data array >> size - size of the data array
*/ */
__global__ __global__
void KernelAbsolute(__half * a, __half * b, int size) void KernelClip(__half * a, __half * b, DTYPE lower, DTYPE upper, int size)
{ {
return; return;
} }
/* /*
set each entry to its absolute value set each entry to its clip value
>> a - input tensor >> a - input tensor we are processing
>> b - output tensor >> b - output tensor we are processing
>> lower - the lower border
>> upper - the upper border
*/ */
void _CudaAbsolute(const XTensor * a, XTensor * b) void _CudaClip(const XTensor * a, XTensor * b, DTYPE lower, DTYPE upper)
{ {
CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!"); CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
CheckNTErrors((a->isSparse == false), "TODO!"); CheckNTErrors((a->isSparse == false), "TODO!");
int gridSize[3];
int blockSize[3];
GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
dim3 blocks(gridSize[0]);
dim3 threads(blockSize[0]);
int gridSize[3]; int devIDBackup;
int blockSize[3]; ProtectCudaDev(a->devID, devIDBackup);
GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize); if (a->dataType == DEFAULT_DTYPE) {
KernelClip << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, lower, upper, a->unitNum);
}
else if (a->dataType == X_FLOAT16) {
KernelClip << <blocks, threads >> >((__half*)a->data, (__half*)b->data, lower, upper, a->unitNum);
}
else {
ShowNTErrors("TODO!");
}
dim3 blocks(gridSize[0]); BacktoCudaDev(a->devID, devIDBackup);
dim3 threads(blockSize[0]); }
/*
clip backward computation of dE/dx (Cuda kernel)
int devIDBackup; dy/dx = 1 if lower <= x <= upper
ProtectCudaDev(a->devID, devIDBackup); 0 otherwise
if (a->dataType == DEFAULT_DTYPE) { >> dedy - dE/dy
KernelAbsolute << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum); >> dedx - dE/dx
>> y - y of the function
>> x - x of the function
>> lower
>> upper
*/
__global__
void KernelClipBackward(DTYPE * dedy, DTYPE * dedx, DTYPE * y, DTYPE * x, DTYPE lower, DTYPE upper, int size)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size) {
DTYPE s = x[i];
if (s > upper || s < lower)
dedx[i] = 0;
else
dedx[i] = dedy[i];
} }
else if (a->dataType == X_FLOAT16) { }
KernelAbsolute << <blocks, threads >> >((__half*)a->data, (__half*)b->data, a->unitNum);
/*
backward computation (Cuda version)
dE/dx = dE/dy * dy/dx
hard tanh: y = upper if x > upper
x if lower <= x <= upper
lower if x< lower
and dy/dx = 1 if lower <= x <= upper
0 otherwise
>> gold - gold standard to measure error (or loss)
>> y - output of the function
>> x - input of the function
>> dedy - dE/dy
>> dedx - dE/dx
>> lossName - type of loss function, e.g., cross entropy
*/
void _CudaClipBackward(XTensor * y, XTensor * x, XTensor * dedy, XTensor * dedx, DTYPE lower, DTYPE upper)
{
if (x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE) {
int gridSize[3], blockSize[3];
GDevs.GetCudaThread(x->devID, x->unitNum, gridSize, blockSize);
int devIDBackup;
ProtectCudaDev(x->devID, devIDBackup);
/* dE/dx = dE/dy * dy/dx */
KernelClipBackward <<<dim3(gridSize[0]), dim3(blockSize[0])>>>
((DTYPE*)dedy->data,
(DTYPE*)dedx->data,
(DTYPE*)y->data, (DTYPE*)x->data,
lower, upper,
x->unitNum);
BacktoCudaDev(x->devID, devIDBackup);
} }
else { else
ShowNTErrors("TODO!"); ShowNTErrors("TODO!");
}
BacktoCudaDev(a->devID, devIDBackup);
} }
#endif // USE_CUDA #endif // USE_CUDA
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
*/
#ifndef __CLIP_CUH__
#define __CLIP_CUH__
#include "Clip.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA
/* set each entry to its clip value (CUDA Kernel) */
__global__
void KernelClip(DTYPE * a, DTYPE * b, DTYPE lower, DTYPE upper, int size);
/* set each entry to its clip value (CUDA Kernel) with float16 data type*/
__global__
void KernelClip(__half * a, __half * b, DTYPE lower, DTYPE upper, int size);
/* set each entry to its clip value */
void _CudaClip(const XTensor * a, XTensor * b, DTYPE lower, DTYPE upper);
/* backward of Clip function (CUDA Kernel) */
__global__
void KernelClipBackward(DTYPE * dedy, DTYPE * dedx, DTYPE * y, DTYPE * x, DTYPE lower, DTYPE upper, int size);
/* backward of Clip function */
void _CudaClipBackward(XTensor * y, XTensor * x, XTensor * dedy, XTensor * dedx, DTYPE lower, DTYPE upper);
#endif // USE_CUDA
} // namespace nts(NiuTrans.Tensor)
#endif // __CLIP_H__
\ No newline at end of file
...@@ -16,67 +16,36 @@ ...@@ -16,67 +16,36 @@
*/ */
/* /*
* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11 * $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
*/ */
#ifndef __CLIP_H__
#define __CLIP_H__
#include "../../XTensor.h" #include "../../XTensor.h"
#include "../../XName.h"
#include "Log.h"
#include "Log.cuh"
#include <math.h>
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
/* /* set every entry to its clip value */
set every entry to its log value (do it on site) void _Clip(const XTensor * a, XTensor * b, DTYPE lower, DTYPE upper);
>> a - input tensor we are processing
>> b - output tensor we are processing
*/
void _Log(const XTensor * a, XTensor * b)
{
#ifdef USE_CUDA
/* run it on GPUs */
if (a->devID >= 0) {
_CudaLog(a, b);
return;
}
#endif
CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
DTYPE * d = (DTYPE*)a->data;
DTYPE * db = (DTYPE*)b->data;
for (int i = 0; i < a->unitNum; i++)
db[i] = (DTYPE)log(d[i]);
}
/* /*
set every entry to its log value set every entry to its clip value (do it on site)
keep the result in the input tensor a and return nothing keep the result in the input tensor a and return nothing
>> a - the tensor we are processing
*/ */
void _LogMe(XTensor * a) void _ClipMe(XTensor * a, DTYPE lower, DTYPE upper);
{
_Log(a, a);
}
/* /*
set every entry to its log value (return a XTensor structure) set every entry to its clip value (return a XTensor structure)
make a new tensor to keep the result and return it make a new tensor to keep the result and return it
>> a - input tensor we are processing
<< return - the log value of the input tensor
*/ */
XTensor Log(const XTensor & a) XTensor Clip(const XTensor & a, DTYPE lower, DTYPE upper);
{
XTensor b(&a); /*
b.SetTMP(); backward of Clip function
*/
/* call _Log function */ void _ClipBackward(XTensor * y, XTensor * x, XTensor * dedy, XTensor * dedx, DTYPE lower, DTYPE upper);
_Log(&a, &b);
} // namespace nts(NiuTrans.Tensor)
/* tensor connections */
XLink::MakeLink(&a, NULL, &b, MATH_LOG); #endif // __CLIP_H__
return b;
}
} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11
*/
#include "../../XDevice.h"
#include "../../XTensor.h"
#include "Log.h"
#include "Log.cuh"
namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA
/*
set each entry to its log value (CUDA Kernel)
>> a - pointer to input data array
>> b - pointer to output data array
>> size - size of the data array
*/
__global__
void KernelLog(DTYPE * a, DTYPE * b, int size)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size)
b[i] = log(a[i]);
}
/*
set each entry to its log value (CUDA Kernel)
This is for float16 computation
>> a - pointer to input data array
>> b - pointer to output data array
>> size - size of the data array
*/
__global__
void KernelLog(__half * a, __half * b, int size)
{
return;
}
/*
set each entry to its log value
>> a - input tensor
>> b - output tensor
*/
void _CudaLog(const XTensor * a, XTensor * b)
{
CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
CheckNTErrors((a->isSparse == false), "TODO!");
int gridSize[3];
int blockSize[3];
GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
dim3 blocks(gridSize[0]);
dim3 threads(blockSize[0]);
int devIDBackup;
ProtectCudaDev(a->devID, devIDBackup);
if (a->dataType == DEFAULT_DTYPE) {
KernelLog << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum);
}
else if (a->dataType == X_FLOAT16) {
KernelLog << <blocks, threads >> >((__half*)a->data, (__half*)b->data, a->unitNum);
}
else {
ShowNTErrors("TODO!");
}
BacktoCudaDev(a->devID, devIDBackup);
}
#endif // USE_CUDA
} // namespace nts(NiuTrans.Tensor)
...@@ -110,7 +110,7 @@ void _CudaNormalize(const XTensor * input, XTensor * output, int dim, ...@@ -110,7 +110,7 @@ void _CudaNormalize(const XTensor * input, XTensor * output, int dim,
int cudaBlockSize[3]; int cudaBlockSize[3];
GDevs.GetCudaThread2D(input->devID, strideNum, stride * blockNum, GDevs.GetCudaThread2D(input->devID, strideNum, stride * blockNum,
MAX_INT, cudaGridSize, cudaBlockSize); MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]); dim3 blocks(cudaGridSize[1], cudaGridSize[0]);
dim3 threads(cudaBlockSize[1], cudaBlockSize[0]); dim3 threads(cudaBlockSize[1], cudaBlockSize[0]);
...@@ -119,9 +119,9 @@ void _CudaNormalize(const XTensor * input, XTensor * output, int dim, ...@@ -119,9 +119,9 @@ void _CudaNormalize(const XTensor * input, XTensor * output, int dim,
ProtectCudaDev(a->devID, devIDBackup); ProtectCudaDev(a->devID, devIDBackup);
KernelNormalize << <blocks, threads >> >((DTYPE*)input->data, (DTYPE*)output->data, KernelNormalize << <blocks, threads >> >((DTYPE*)input->data, (DTYPE*)output->data,
(DTYPE*)mean->data, (DTYPE*)var->data, (DTYPE*)mean->data, (DTYPE*)var->data,
(DTYPE*)a->data, (DTYPE*)b->data, epsilon, (DTYPE*)a->data, (DTYPE*)b->data, epsilon,
stride, strideNum, blockNum); stride, strideNum, blockNum);
BacktoCudaDev(a->devID, devIDBackup); BacktoCudaDev(a->devID, devIDBackup);
} }
......
#include <math.h>
#include "../../XName.h"
#include "Unary.h"
#include "Unary.cuh"
namespace nts{
#ifdef USE_CUDA
/* define three marco separately, specify the respective function names */
#define _SIMPLE_UNARY_FUNCTION(_funcName, _cudaFuncName, origFunc) \
void _funcName(const XTensor * a, XTensor * b) \
{ \
/* run it on GPUs */ \
if (a->devID >= 0) { \
_cudaFuncName(a, b); \
return; \
} \
CheckNTErrors((XTensor::IsSameShaped(a, b)), \
"Input tensors should have the same type!"); \
CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!"); \
DTYPE * d = (DTYPE*)a->data; \
DTYPE * db = (DTYPE*)b->data; \
for (int i = 0; i < a->unitNum; i++) \
db[i] = (DTYPE)origFunc(d[i]); \
}
#define _SIMPLE_UNARY_FUNCTION_ME(_funcNameMe, _funcName) \
void _funcNameMe(XTensor * a) \
{ \
_funcName(a, a); \
}
#define SIMPLE_UNARY_FUNCTION(funcName, _funcName, operationId) \
XTensor funcName(const XTensor &a) \
{ \
XTensor b(&a); \
b.SetTMP(); \
_funcName(&a, &b); \
XLink::MakeLink(&a, NULL, &b, operationId); \
return b; \
}
_SIMPLE_UNARY_FUNCTION(_Absolute, _CudaAbsolute, fabs)
_SIMPLE_UNARY_FUNCTION_ME(_AbsoluteMe, _Absolute)
SIMPLE_UNARY_FUNCTION(Absolute, _Absolute, MATH_ABSOLUTE)
_SIMPLE_UNARY_FUNCTION(_Exp, _CudaExp, exp)
_SIMPLE_UNARY_FUNCTION_ME(_ExpMe, _Exp)
SIMPLE_UNARY_FUNCTION(Exp, _Exp, MATH_EXP)
_SIMPLE_UNARY_FUNCTION(_Log, _CudaLog, log)
_SIMPLE_UNARY_FUNCTION_ME(_LogMe, _Log)
SIMPLE_UNARY_FUNCTION(Log, _Log, MATH_LOG)
_SIMPLE_UNARY_FUNCTION(_Sin, _CudaSin, sin)
_SIMPLE_UNARY_FUNCTION_ME(_SinMe, _Sin)
SIMPLE_UNARY_FUNCTION(Sin, _Sin, MATH_SIN)
_SIMPLE_UNARY_FUNCTION(_Cos, _CudaCos, cos)
_SIMPLE_UNARY_FUNCTION_ME(_CosMe, _Cos)
SIMPLE_UNARY_FUNCTION(Cos, _Cos, MATH_COS)
_SIMPLE_UNARY_FUNCTION(_Tan, _CudaTan, tan)
_SIMPLE_UNARY_FUNCTION_ME(_TanMe, _Tan)
SIMPLE_UNARY_FUNCTION(Tan, _Tan, MATH_TAN)
_SIMPLE_UNARY_FUNCTION(_Round, _CudaRound, round)
_SIMPLE_UNARY_FUNCTION_ME(_RoundMe, _Round)
SIMPLE_UNARY_FUNCTION(Round, _Round, MATH_ROUND)
#else
/* define three marco separately, specify the respective function names */
#define _SIMPLE_UNARY_FUNCTION(_funcName, origFunc) \
void _funcName(const XTensor * a, XTensor * b) \
{ \
CheckNTErrors((XTensor::IsSameShaped(a, b)), \
"Input tensors should have the same type!"); \
CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!"); \
DTYPE * d = (DTYPE*)a->data; \
DTYPE * db = (DTYPE*)b->data; \
for (int i = 0; i < a->unitNum; i++) \
db[i] = (DTYPE)origFunc(d[i]); \
}
#define _SIMPLE_UNARY_FUNCTION_ME(_funcNameMe, _funcName) \
void _funcNameMe(XTensor * a) \
{ \
_funcName(a, a); \
}
#define SIMPLE_UNARY_FUNCTION(funcName, _funcName, operationId) \
XTensor funcName(const XTensor &a) \
{ \
XTensor b(&a); \
b.SetTMP(); \
_funcName(&a, &b); \
XLink::MakeLink(&a, NULL, &b, operationId); \
return b; \
}
_SIMPLE_UNARY_FUNCTION(_Absolute, fabs)
_SIMPLE_UNARY_FUNCTION_ME(_AbsoluteMe, _Absolute)
SIMPLE_UNARY_FUNCTION(Absolute, _Absolute, MATH_ABSOLUTE)
_SIMPLE_UNARY_FUNCTION(_Exp, exp)
_SIMPLE_UNARY_FUNCTION_ME(_ExpMe, _Exp)
SIMPLE_UNARY_FUNCTION(Exp, _Exp, MATH_EXP)
_SIMPLE_UNARY_FUNCTION(_Log, log)
_SIMPLE_UNARY_FUNCTION_ME(_LogMe, _Log)
SIMPLE_UNARY_FUNCTION(Log, _Log, MATH_LOG)
_SIMPLE_UNARY_FUNCTION(_Sin, sin)
_SIMPLE_UNARY_FUNCTION_ME(_SinMe, _Sin)
SIMPLE_UNARY_FUNCTION(Sin, _Sin, MATH_SIN)
_SIMPLE_UNARY_FUNCTION(_Cos, cos)
_SIMPLE_UNARY_FUNCTION_ME(_CosMe, _Cos)
SIMPLE_UNARY_FUNCTION(Cos, _Cos, MATH_COS)
_SIMPLE_UNARY_FUNCTION(_Tan, tan)
_SIMPLE_UNARY_FUNCTION_ME(_TanMe, _Tan)
SIMPLE_UNARY_FUNCTION(Tan, _Tan, MATH_TAN)
_SIMPLE_UNARY_FUNCTION(_Round, round)
_SIMPLE_UNARY_FUNCTION_ME(_RoundMe, _Round)
SIMPLE_UNARY_FUNCTION(Round, _Round, MATH_ROUND)
#endif
}
\ No newline at end of file
#include <math.h>
#include "../../XDevice.h"
#include "../../XName.h"
#include "Unary.cuh"
namespace nts {
#define SIMPLE_UNARY_FUNCTION_GPU(funcName, origFunc) \
__global__ \
void Kernel##funcName(DTYPE * a, DTYPE * b, int size) \
{ \
int i = blockDim.x * blockIdx.x + threadIdx.x; \
\
if (i < size) \
b[i] = (DTYPE)origFunc(a[i]); \
} \
__global__ \
void Kernel##funcName(__half * a, __half * b, int size) \
{ \
return; \
} \
void _Cuda##funcName(const XTensor * a, XTensor * b) \
{ \
CheckNTErrors((XTensor::IsSameShaped(a, b)), \
"Input tensors should have the same type!"); \
CheckNTErrors((a->isSparse == false), "TODO!"); \
\
int gridSize[3]; \
int blockSize[3]; \
\
GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize); \
\
dim3 blocks(gridSize[0]); \
dim3 threads(blockSize[0]); \
\
int devIDBackup; \
ProtectCudaDev(a->devID, devIDBackup); \
\
if (a->dataType == DEFAULT_DTYPE) { \
Kernel##funcName << <blocks, threads >> > \
((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum); \
} \
else if (a->dataType == X_FLOAT16) { \
Kernel##funcName << <blocks, threads >> > \
((__half*)a->data, (__half*)b->data, a->unitNum); \
} \
else { \
ShowNTErrors("TODO!"); \
} \
\
BacktoCudaDev(a->devID, devIDBackup); \
} \
SIMPLE_UNARY_FUNCTION_GPU(Absolute, fabs)
SIMPLE_UNARY_FUNCTION_GPU(Exp, exp)
SIMPLE_UNARY_FUNCTION_GPU(Log, log)
SIMPLE_UNARY_FUNCTION_GPU(Sin, sin)
SIMPLE_UNARY_FUNCTION_GPU(Cos, cos)
SIMPLE_UNARY_FUNCTION_GPU(Tan, tan)
SIMPLE_UNARY_FUNCTION_GPU(Round, round)
}
\ No newline at end of file
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
*/
#ifndef __UNARY_CUH__
#define __UNARY_CUH__
#include "../../XTensor.h"
#include "Unary.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA
/* set each entry to its absolute value (CUDA Kernel) */
__global__
void KernelAbsolute(DTYPE * a, DTYPE * b, int size);
/* set each entry to its absolute value (CUDA Kernel) with float16 data type*/
__global__
void KernelAbsolute(__half * a, __half * b, int size);
/* set each entry to its absolute value */
void _CudaAbsolute(const XTensor * a, XTensor * b);
/* set each entry to its exponent value (CUDA Kernel) */
__global__
void KernelExp(DTYPE * a, DTYPE * b, int size);
/* set each entry to its exponent value (CUDA Kernel) with float16 data type*/
__global__
void KernelExp(__half * a, __half * b, int size);
/* set each entry to its exponent value */
void _CudaExp(const XTensor * a, XTensor * b);
/* set each entry to its logarithm value (CUDA Kernel) */
__global__
void KernelLog(DTYPE * a, DTYPE * b, int size);
/* set each entry to its logarithm value (CUDA Kernel) with float16 data type*/
__global__
void KernelLog(__half * a, __half * b, int size);
/* set each entry to its logarithm value */
void _CudaLog(const XTensor * a, XTensor * b);
/* set each entry to its sine value (CUDA Kernel) */
__global__
void KernelSin(DTYPE * a, DTYPE * b, int size);
/* set each entry to its sine value (CUDA Kernel) with float16 data type*/
__global__
void KernelSin(__half * a, __half * b, int size);
/* set each entry to its sine value */
void _CudaSin(const XTensor * a, XTensor * b);
/* set each entry to its cosine value (CUDA Kernel) */
__global__
void KernelCos(DTYPE * a, DTYPE * b, int size);
/* set each entry to its cosine value (CUDA Kernel) with float16 data type*/
__global__
void KernelCos(__half * a, __half * b, int size);
/* set each entry to its cosine value */
void _CudaCos(const XTensor * a, XTensor * b);
/* set each entry to its tangent value (CUDA Kernel) */
__global__
void KernelTan(DTYPE * a, DTYPE * b, int size);
/* set each entry to its tangent value (CUDA Kernel) with float16 data type*/
__global__
void KernelTan(__half * a, __half * b, int size);
/* set each entry to its tangent value */
void _CudaTan(const XTensor * a, XTensor * b);
/* set each entry to its round value (CUDA Kernel) */
__global__
void KernelRound(DTYPE * a, DTYPE * b, int size);
/* set each entry to its round value (CUDA Kernel) with float16 data type*/
__global__
void KernelRound(__half * a, __half * b, int size);
/* set each entry to its round value */
void _CudaRound(const XTensor * a, XTensor * b);
#endif // USE_CUDA
} // namespace nts(NiuTrans.Tensor)
#endif // __UNARY_CUH__
\ No newline at end of file
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
*/
#ifndef __UNARY_H__
#define __UNARY_H__
#include "../../XTensor.h"
namespace nts{
/* set every entry to its absolute value */
void _Absolute(const XTensor * a, XTensor * b);
/*
set every entry to its absolute value (do it on site)
keep the result in the input tensor a and return nothing
*/
void _AbsoluteMe(XTensor * a);
/*
set every entry to its absolute value (return a XTensor structure)
make a new tensor to keep the result and return it
*/
XTensor Absolute(const XTensor & a);
/* set every entry to its exponent value */
void _Exp(const XTensor * a, XTensor * b);
/*
set every entry to its exponent value (do it on site)
keep the result in the input tensor a and return nothing
*/
void _ExpMe(XTensor * a);
/*
set every entry to its exponent value (return a XTensor structure)
make a new tensor to keep the result and return it
*/
XTensor Exp(const XTensor & a);
/* set every entry to its logarithm value */
void _Log(const XTensor * a, XTensor * b);
/*
set every entry to its logarithm value (do it on site)
keep the result in the input tensor a and return nothing
*/
void _LogMe(XTensor * a);
/*
set every entry to its logarithm value (return a XTensor structure)
make a new tensor to keep the result and return it
*/
XTensor Log(const XTensor & a);
/* set every entry to its sine value */
void _Sin(const XTensor * a, XTensor * b);
/*
set every entry to its sine value (do it on site)
keep the result in the input tensor a and return nothing
*/
void _SinMe(XTensor * a);
/*
set every entry to its sine value (return a XTensor structure)
make a new tensor to keep the result and return it
*/
XTensor Sin(const XTensor & a);
/* set every entry to its cosine value */
void _Cos(const XTensor * a, XTensor * b);
/*
set every entry to its cosine value (do it on site)
keep the result in the input tensor a and return nothing
*/
void _CosMe(XTensor * a);
/*
set every entry to its cosine value (return a XTensor structure)
make a new tensor to keep the result and return it
*/
XTensor Cos(const XTensor & a);
/* set every entry to its tangent value */
void _Tan(const XTensor * a, XTensor * b);
/*
set every entry to its tangent value (do it on site)
keep the result in the input tensor a and return nothing
*/
void _TanMe(XTensor * a);
/*
set every entry to its tangent value (return a XTensor structure)
make a new tensor to keep the result and return it
*/
XTensor Tan(const XTensor & a);
/* set every entry to its round value */
void _Round(const XTensor * a, XTensor * b);
/*
set every entry to its round value (do it on site)
keep the result in the input tensor a and return nothing
*/
void _RoundMe(XTensor * a);
/*
set every entry to its round value (return a XTensor structure)
make a new tensor to keep the result and return it
*/
XTensor Round(const XTensor & a);
}
#endif //end __UNARY_H__
\ No newline at end of file
...@@ -35,24 +35,33 @@ copy a number of blocks to target positions ...@@ -35,24 +35,33 @@ copy a number of blocks to target positions
>> target - target data array >> target - target data array
>> targetBlocks - target positions of the copy >> targetBlocks - target positions of the copy
>> myMem - the memory pool >> myMem - the memory pool
>> devID - device id
*/ */
void _CopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem) void _CopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID)
{ {
if (myMem != NULL && myMem->devID >= 0) { if (myMem != NULL)
devID = myMem->devID;
if (devID >= 0) {
#ifdef USE_CUDA #ifdef USE_CUDA
/* copy the index from host to device */ /* copy the index from host to device */
int * targetBlocksTMP = (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int)); int * targetBlocksTMP = myMem != NULL ?
(int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int)):
(int*)XMemAlloc(devID, blockNum * sizeof(int));
XMemCopy(targetBlocksTMP, myMem->devID, targetBlocks, -1, blockNum * sizeof(int)); XMemCopy(targetBlocksTMP, myMem->devID, targetBlocks, -1, blockNum * sizeof(int));
_CopyBlocksOnSite(source, blockSize, blockNum, target, targetBlocksTMP, myMem); _CopyBlocksOnSite(source, blockSize, blockNum, target, targetBlocksTMP, devID);
myMem->ReleaseBuf(myMem->devID, blockNum * sizeof(int)); if(myMem != NULL)
myMem->ReleaseBuf(myMem->devID, blockNum * sizeof(int));
else
XMemFree(devID, targetBlocksTMP);
#else #else
ShowNTErrors("Plesae specify USE_CUDA and recompile the code!"); ShowNTErrors("Plesae specify USE_CUDA and recompile the code!");
#endif #endif
} }
else { else {
_CopyBlocksOnSite(source, blockSize, blockNum, target, targetBlocks, myMem); _CopyBlocksOnSite(source, blockSize, blockNum, target, targetBlocks, devID);
} }
} }
...@@ -65,11 +74,12 @@ copy a number of blocks source source positions to target positions ...@@ -65,11 +74,12 @@ copy a number of blocks source source positions to target positions
>> target - target data array >> target - target data array
>> targetBlocks - target positions of the copy >> targetBlocks - target positions of the copy
>> myMem - the memory pool >> myMem - the memory pool
>> devID - device id
*/ */
void _CopyBlocks(void * source, int blockSize, int * sourceBlocks, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID) void _CopyBlocks(void * source, int blockSize, int * sourceBlocks, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID)
{ {
if (myMem != NULL) if (myMem != NULL)
CheckNTErrors((myMem->devID == devID), "DevIDs are different between memory pool and input devID!"); devID = myMem->devID;
if (devID >= 0) { if (devID >= 0) {
#ifdef USE_CUDA #ifdef USE_CUDA
......
...@@ -27,7 +27,7 @@ ...@@ -27,7 +27,7 @@
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
/* copy a number of blocks to target positions */ /* copy a number of blocks to target positions */
void _CopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem); void _CopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID);
/* copy a number of blocks from source positions to target positions */ /* copy a number of blocks from source positions to target positions */
void _CopyBlocks(void * source, int blockSize, int * sourceBlocks, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID); void _CopyBlocks(void * source, int blockSize, int * sourceBlocks, int blockNum, void * target, int * targetBlocks, XMem * myMem, int devID);
......
...@@ -223,8 +223,11 @@ void _CudaCopyBlocksInGrid(void * source, int blockSize, int blockNum, int gridN ...@@ -223,8 +223,11 @@ void _CudaCopyBlocksInGrid(void * source, int blockSize, int blockNum, int gridN
int cudaGrids[3]; int cudaGrids[3];
int cudaBlocks[3]; int cudaBlocks[3];
int threadNum = MIN(MAX(blockSize, blockNum), MAX_CUDA_THREAD_NUM_PER_BLOCK); int threadNum = MIN(MAX(blockSize, blockNum), MAX_CUDA_THREAD_NUM_PER_BLOCK);
int devIDBackup;
ProtectCudaDev(myMem->devID, devIDBackup);
GDevs.GetCudaThread2D(myMem->devID, threadNum, gridNum * blockNum, INT_MAX, cudaGrids, cudaBlocks); GDevs.GetCudaThread2D(myMem->devID, threadNum, gridNum * blockNum, INT_MAX, cudaGrids, cudaBlocks);
cudaBlocks[1] = 1; cudaBlocks[1] = 1;
...@@ -237,39 +240,41 @@ void _CudaCopyBlocksInGrid(void * source, int blockSize, int blockNum, int gridN ...@@ -237,39 +240,41 @@ void _CudaCopyBlocksInGrid(void * source, int blockSize, int blockNum, int gridN
if (blockNum == 4) { if (blockNum == 4) {
if ((SHARED_MEMORY_SIZE / itemSize - 2 * MAX_CUDA_THREAD_NUM_PER_BLOCK) >= 2 * cudaBlocks[0] * blockNum) if ((SHARED_MEMORY_SIZE / itemSize - 2 * MAX_CUDA_THREAD_NUM_PER_BLOCK) >= 2 * cudaBlocks[0] * blockNum)
KernelCopyBlocksInGridFast<int, 4, 2> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> > KernelCopyBlocksInGridFast<int, 4, 2> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
((int*)source, blockSize, blockNum, gridNum, (int*)target, index); ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
else else
KernelCopyBlocksInGridFast<int, 4, 1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> > KernelCopyBlocksInGridFast<int, 4, 1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
((int*)source, blockSize, blockNum, gridNum, (int*)target, index); ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
} }
else if (blockNum == 6) { else if (blockNum == 6) {
if ((SHARED_MEMORY_SIZE / itemSize - 2 * MAX_CUDA_THREAD_NUM_PER_BLOCK) >= 2 * cudaBlocks[0] * blockNum) if ((SHARED_MEMORY_SIZE / itemSize - 2 * MAX_CUDA_THREAD_NUM_PER_BLOCK) >= 2 * cudaBlocks[0] * blockNum)
KernelCopyBlocksInGridFast<int, 6, 2> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> > KernelCopyBlocksInGridFast<int, 6, 2> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
((int*)source, blockSize, blockNum, gridNum, (int*)target, index); ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
else else
KernelCopyBlocksInGridFast<int, 6, 1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> > KernelCopyBlocksInGridFast<int, 6, 1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
((int*)source, blockSize, blockNum, gridNum, (int*)target, index); ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
} }
else if (blockNum == 8) { else if (blockNum == 8) {
if ((SHARED_MEMORY_SIZE / itemSize - 2 * MAX_CUDA_THREAD_NUM_PER_BLOCK) >= 2 * cudaBlocks[0] * blockNum) if ((SHARED_MEMORY_SIZE / itemSize - 2 * MAX_CUDA_THREAD_NUM_PER_BLOCK) >= 2 * cudaBlocks[0] * blockNum)
KernelCopyBlocksInGridFast<int, 8, 2> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> > KernelCopyBlocksInGridFast<int, 8, 2> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
((int*)source, blockSize, blockNum, gridNum, (int*)target, index); ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
else else
KernelCopyBlocksInGridFast<int, 8, 1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> > KernelCopyBlocksInGridFast<int, 8, 1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
((int*)source, blockSize, blockNum, gridNum, (int*)target, index); ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
} }
else if (blockNum == 12) { else if (blockNum == 12) {
if ((SHARED_MEMORY_SIZE / itemSize - 2 * MAX_CUDA_THREAD_NUM_PER_BLOCK) >= 2 * cudaBlocks[0] * blockNum) if ((SHARED_MEMORY_SIZE / itemSize - 2 * MAX_CUDA_THREAD_NUM_PER_BLOCK) >= 2 * cudaBlocks[0] * blockNum)
KernelCopyBlocksInGridFast<int, 12, 2> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> > KernelCopyBlocksInGridFast<int, 12, 2> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
((int*)source, blockSize, blockNum, gridNum, (int*)target, index); ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
else else
KernelCopyBlocksInGridFast<int, 12, 1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> > KernelCopyBlocksInGridFast<int, 12, 1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
((int*)source, blockSize, blockNum, gridNum, (int*)target, index); ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
} }
else { else {
KernelCopyBlocksInGrid<int> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> > KernelCopyBlocksInGrid<int> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
((int*)source, blockSize, blockNum, gridNum, (int*)target, index); ((int*)source, blockSize, blockNum, gridNum, (int*)target, index);
} }
BacktoCudaDev(myMem->devID, devIDBackup);
} }
#endif // USE_CUDA #endif // USE_CUDA
......
...@@ -34,29 +34,35 @@ all the data has been on the device (CPU/GPU) already. ...@@ -34,29 +34,35 @@ all the data has been on the device (CPU/GPU) already.
>> blockNum - number of blocks >> blockNum - number of blocks
>> target - target data array >> target - target data array
>> targetBlocks - target positions of the copy >> targetBlocks - target positions of the copy
>> myMem - the memory pool >> devID - device id
*/ */
void _CopyBlocksOnSite(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem) void _CopyBlocksOnSite(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, int devID)
{ {
if (myMem != NULL && myMem->devID >= 0) { if (devID >= 0) {
#ifdef USE_CUDA #ifdef USE_CUDA
_CudaCopyBlocks(source, blockSize, blockNum, target, targetBlocks, myMem); _CudaCopyBlocks(source, blockSize, blockNum, target, targetBlocks, devID);
#else #else
ShowNTErrors("Plesae specify USE_CUDA and recompile the code!"); ShowNTErrors("Plesae specify USE_CUDA and recompile the code!");
#endif #endif
} }
else { else {
int devID = myMem != NULL ? myMem->devID : -1;
/* /*
The following code should be fine with GPUs, but too many The following code should be fine with GPUs, but too many
kernel calls would slow down the system. We prefer to use kernel calls would slow down the system. We prefer to use
one kernel to do block copy in batch (kernel fusion). one kernel to do block copy in batch (kernel fusion).
*/ */
for (int i = 0, b = 0; i < blockNum; i++, b += blockSize) { if(blockSize == sizeof(int)){
XMemCopy((char*)target + targetBlocks[i] * blockSize, devID, for (int i = 0, b = 0; i < blockNum; i++, b += blockSize) {
(char*)source + b, devID, blockSize); *(int*)((char*)target + targetBlocks[i] * blockSize) =
*(int*)((char*)source + b);
}
}
else{
for (int i = 0, b = 0; i < blockNum; i++, b += blockSize) {
XMemCopy((char*)target + targetBlocks[i] * blockSize, devID,
(char*)source + b, devID, blockSize);
}
} }
} }
} }
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
...@@ -36,39 +36,48 @@ NOTE that this version makes more use of the 2d threads in cuda ...@@ -36,39 +36,48 @@ NOTE that this version makes more use of the 2d threads in cuda
>> target - target data array >> target - target data array
>> targetBlocks - target positions of the copy >> targetBlocks - target positions of the copy
*/ */
template<int miniBlockSize> template<class T>
__global__ __global__
void KernelCopyBlocks(DTYPE * source, int blockSize, int blockNum, DTYPE * target, int * targetBlocks) void KernelCopyBlocks(T * source, int blockSize, int blockNum, T * target, int * targetBlocks)
{ {
/* entry index in the block */ /* entry index in the block */
int i = (blockDim.x * blockIdx.x + threadIdx.x) * miniBlockSize; int i = blockDim.x * blockIdx.x + threadIdx.x;
/* block index */ /* block index */
int j = blockDim.y * blockIdx.y + threadIdx.y; int j = blockDim.y * blockIdx.y + threadIdx.y;
if (j >= blockNum) if (i >= blockSize || j >= blockNum)
return; return;
/* target position */ T * s = source + blockSize * j;
int k = targetBlocks[j]; T * t = target + blockSize * targetBlocks[j];
DTYPE * s = source + blockSize * j; t[i] = s[i];
DTYPE * t = target + blockSize * k; }
if (i < blockSize) { /*
if (miniBlockSize == 4) { copy a number of blocks to target positions
t[i] = s[i]; NOTE that this version makes more use of the 2d threads in cuda
t[i + 1] = s[i + 1]; >> source - data array (head of the blocks) to copy from
t[i + 2] = s[i + 2]; >> blockSize - size of block
t[i + 3] = s[i + 3]; >> blockNum - number of blocks
} >> target - target data array
else if (miniBlockSize <= 1) { >> targetBlocks - target positions of the copy
t[i] = s[i]; */
} template<class T>
else { __global__
printf("something wrong!"); void KernelCopyBlocksV2(T * source, int blockSize, int blockNum, int totalSize, T * target, int * targetBlocks)
} {
} /* entry index in the block */
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i >= totalSize)
return;
int targetBlockID = targetBlocks[i / blockSize];
int targetOffset = i % blockSize;
*(target + blockSize * targetBlockID + targetOffset) = source[i];
} }
/* /*
...@@ -78,29 +87,42 @@ copy a number of blocks to target positions (cuda version) ...@@ -78,29 +87,42 @@ copy a number of blocks to target positions (cuda version)
>> blockNum - number of blocks >> blockNum - number of blocks
>> target - target data array >> target - target data array
>> targetBlocks - target positions of the copy (on the device) >> targetBlocks - target positions of the copy (on the device)
>> myMem - memory pool >> devID - device id
*/ */
void _CudaCopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem) void _CudaCopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, int devID)
{ {
CheckNTErrors((myMem != NULL), "No memory pool!"); CheckNTErrors(devID >= 0, "Wrong device to run!");
CheckNTErrors((myMem->devID >= 0), "Wrong device to run!");
CheckNTErrors((blockSize % sizeof(DTYPE) == 0), "Unsupported block size!");
int cudaGrids[3]; int cudaGrids[3];
int cudaBlocks[3]; int cudaBlocks[3];
int bSize = blockSize / sizeof(DTYPE);
if (bSize % 4 == 0) { int devIDBackup;
GDevs.GetCudaThread2D(myMem->devID, bSize / 4, blockNum, MAX_INT, cudaGrids, cudaBlocks); ProtectCudaDev(devID, devIDBackup);
KernelCopyBlocks<4> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
((DTYPE*)source, bSize, blockNum, (DTYPE*)target, targetBlocks); if(blockSize % sizeof(double) == 0){
int bSize = blockSize / sizeof(double);
GDevs.GetCudaThread(devID, bSize * blockNum, cudaGrids, cudaBlocks);
KernelCopyBlocksV2<double> <<<dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >>>
((double*)source, bSize, blockNum, bSize * blockNum, (double*)target, targetBlocks);
//GDevs.GetCudaThread2D(devID, bSize, blockNum, MAX_INT, cudaGrids, cudaBlocks);
//KernelCopyBlocks<double> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>>
// ((double*)source, bSize, blockNum, (double*)target, targetBlocks);
}
else
if(blockSize % sizeof(float) == 0){
int bSize = blockSize / sizeof(float);
GDevs.GetCudaThread(devID, bSize * blockNum, cudaGrids, cudaBlocks);
KernelCopyBlocksV2<float> <<<dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >>>
((float*)source, bSize, blockNum, bSize * blockNum, (float*)target, targetBlocks);
//GDevs.GetCudaThread2D(devID, bSize, blockNum, MAX_INT, cudaGrids, cudaBlocks);
//KernelCopyBlocks<float> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>>
// ((float*)source, bSize, blockNum, (float*)target, targetBlocks);
} }
else { else{
GDevs.GetCudaThread2D(myMem->devID, bSize, blockNum, MAX_INT, cudaGrids, cudaBlocks); ShowNTErrors("Unsupported block size!");
KernelCopyBlocks<1> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
((DTYPE*)source, bSize, blockNum, (DTYPE*)target, targetBlocks);
} }
BacktoCudaDev(devID, devIDBackup);
} }
#endif // USE_CUDA #endif // USE_CUDA
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
...@@ -28,15 +28,11 @@ namespace nts { // namespace nts(NiuTrans.Tensor) ...@@ -28,15 +28,11 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA #ifdef USE_CUDA
/* copy a number of blocks to target positions */
__global__
void KernelCopyBlocks(DTYPE * source, int blockSize, int blockNum, DTYPE * target, int * targetBlocks);
/* copy a number of blocks to target positions (cuda version) */ /* copy a number of blocks to target positions (cuda version) */
void _CudaCopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem); void _CudaCopyBlocks(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, int devID);
#endif // USE_CUDA #endif // USE_CUDA
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
#endif // __COPYBLOCKS_CUH__ #endif // __COPYBLOCKS_CUH__
\ No newline at end of file
...@@ -27,7 +27,7 @@ ...@@ -27,7 +27,7 @@
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
/* copy a number of blocks to target positions (on site) */ /* copy a number of blocks to target positions (on site) */
void _CopyBlocksOnSite(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, XMem * myMem); void _CopyBlocksOnSite(void * source, int blockSize, int blockNum, void * target, int * targetBlocks, int devID);
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
......
...@@ -75,6 +75,9 @@ void _CudaCopyBlocksSelected(void * source, int blockSize, int * sourceBlocks, i ...@@ -75,6 +75,9 @@ void _CudaCopyBlocksSelected(void * source, int blockSize, int * sourceBlocks, i
CheckNTErrors(devID >= 0, "Wrong device to run!"); CheckNTErrors(devID >= 0, "Wrong device to run!");
CheckNTErrors((blockSize % sizeof(DTYPE) == 0), "Unsupported block size!"); CheckNTErrors((blockSize % sizeof(DTYPE) == 0), "Unsupported block size!");
int devIDBackup;
ProtectCudaDev(devID, devIDBackup);
/* copy the index to the GPU memory */ /* copy the index to the GPU memory */
int * sourceBlocksTMP = myMem != NULL ? (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int)) : (int *)XMemAlloc(devID, blockNum * sizeof(int)); int * sourceBlocksTMP = myMem != NULL ? (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int)) : (int *)XMemAlloc(devID, blockNum * sizeof(int));
int * targetBlocksTMP = myMem != NULL ? (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int)) : (int *)XMemAlloc(devID, blockNum * sizeof(int)); int * targetBlocksTMP = myMem != NULL ? (int*)myMem->AllocBuf(myMem->devID, blockNum * sizeof(int)) : (int *)XMemAlloc(devID, blockNum * sizeof(int));
...@@ -97,6 +100,8 @@ void _CudaCopyBlocksSelected(void * source, int blockSize, int * sourceBlocks, i ...@@ -97,6 +100,8 @@ void _CudaCopyBlocksSelected(void * source, int blockSize, int * sourceBlocks, i
XMemFree(devID, sourceBlocksTMP); XMemFree(devID, sourceBlocksTMP);
XMemFree(devID, targetBlocksTMP); XMemFree(devID, targetBlocksTMP);
} }
BacktoCudaDev(devID, devIDBackup);
} }
#endif // USE_CUDA #endif // USE_CUDA
......
...@@ -37,8 +37,8 @@ copy indexed sub-tensors ...@@ -37,8 +37,8 @@ copy indexed sub-tensors
>> indexSize - length of srcIndex (and tgtIndex) >> indexSize - length of srcIndex (and tgtIndex)
>> tgtIndex - index of the target sub-tensors >> tgtIndex - index of the target sub-tensors
>> copyNum - number of the sub-tensors we copy for each source index, >> copyNum - number of the sub-tensors we copy for each source index,
e.g., for srcIndex = [1,4] and copyNum = 2, e.g., for srcIndex = [1,4] and copyNum = 2,
we actually copy the source sub-tensors 1, 2, 4, 5 we actually copy the source sub-tensors 1, 2, 4, 5
*/ */
void _CopyIndexed(const XTensor * s, XTensor * t, int dim, int * srcIndex, int indexSize, int * tgtIndex, int copyNum) void _CopyIndexed(const XTensor * s, XTensor * t, int dim, int * srcIndex, int indexSize, int * tgtIndex, int copyNum)
{ {
...@@ -73,17 +73,23 @@ void _CopyIndexed(const XTensor * s, XTensor * t, int dim, int * srcIndex, int i ...@@ -73,17 +73,23 @@ void _CopyIndexed(const XTensor * s, XTensor * t, int dim, int * srcIndex, int i
int * realSrcIndex = new int[realIndexSize]; int * realSrcIndex = new int[realIndexSize];
int * realTgtIndex = new int[realIndexSize]; int * realTgtIndex = new int[realIndexSize];
for (int i = 0; i < indexOffsetNum; i++) { for (int i = 0; i < indexOffsetNum; i++) {
int base = i * indexSize * copyNum;
int baseSrc = i * leadDimSizeSrc;
int baseTgt = i * leadDimSizeTgt;
for (int j = 0; j < indexSize; j++) { for (int j = 0; j < indexSize; j++) {
int offset = base + j * copyNum;
int * rsi = realSrcIndex + offset;
int * rti = realTgtIndex + offset;
for (int k = 0; k < copyNum; k++) { for (int k = 0; k < copyNum; k++) {
realSrcIndex[i * indexSize * copyNum + j * copyNum + k] = i * leadDimSizeSrc + srcIndex[j] + k; rsi[k] = baseSrc + srcIndex[j] + k;
realTgtIndex[i * indexSize * copyNum + j * copyNum + k] = i * leadDimSizeTgt + tgtIndex[j] + k; rti[k] = baseTgt + tgtIndex[j] + k;
} }
} }
} }
for (int i = 0; i < indexSize; i++) { for (int i = 0; i < indexSize; i++) {
CheckNTErrors((srcIndex[i] < blockNumSrc), "Index is out of range!"); CheckNTErrors((srcIndex[i] < blockNumSrc), "Index is out of scope!");
CheckNTErrors((tgtIndex[i] < blockNumTgt), "Index is out of range!"); CheckNTErrors((tgtIndex[i] < blockNumTgt), "Index is out of scope!");
} }
_CopyBlocks(s->data, blockSizeSrc * s->unitSize, realSrcIndex, realIndexSize, t->data, realTgtIndex, s->mem, s->devID); _CopyBlocks(s->data, blockSizeSrc * s->unitSize, realSrcIndex, realIndexSize, t->data, realTgtIndex, s->mem, s->devID);
......
...@@ -20,6 +20,7 @@ ...@@ -20,6 +20,7 @@
*/ */
#include "../../XName.h" #include "../../XName.h"
#include "../../XUtility.h"
#include "CopyValues.h" #include "CopyValues.h"
#include "CopyValues.cuh" #include "CopyValues.cuh"
...@@ -35,14 +36,14 @@ copy s to t ...@@ -35,14 +36,14 @@ copy s to t
void _CopyValues(const XTensor * s, XTensor * t, XStream * stream) void _CopyValues(const XTensor * s, XTensor * t, XStream * stream)
{ {
CheckNTErrors((s != NULL && t != NULL), "The input tensor and output tensor must be nonempty!"); CheckNTErrors((s != NULL && t != NULL), "The input tensor and output tensor must be nonempty!");
CheckNTErrors((s->data != NULL), "Cannot copy from an empty data array!"); CheckNTErrors((s->data != NULL), "Cannot copy an empty data array!");
CheckNTErrors((t->data != NULL), "Cannot copy to an empty data array!"); CheckNTErrors((t->data != NULL), "Cannot copy to an empty data array!");
CheckNTErrors((s->unitNum == t->unitNum), "Unmatched data item number!"); CheckNTErrors((s->unitNum == t->unitNum), "Unmatched data item number!");
if ((s->dataType == X_FLOAT16 && t->dataType == X_FLOAT) || if ((s->dataType == X_FLOAT16 && t->dataType == X_FLOAT) ||
(s->dataType == X_FLOAT && t->dataType == X_FLOAT16)) { (s->dataType == X_FLOAT && t->dataType == X_FLOAT16)) {
CheckNTErrors(((s->devID < 0 && t->devID < 0) || s->devID == t->devID), CheckNTErrors(((s->devID < 0 && t->devID < 0) || s->devID == t->devID),
"The code must be run on the same device!"); "The code must be run on the same device!");
CheckNTErrors((s->isSparse || t->isSparse), "TODO!"); CheckNTErrors((s->isSparse || t->isSparse), "TODO!");
ConvertDataType(s->devID, s->data, s->dataType, t->data, t->dataType, s->unitNum); ConvertDataType(s->devID, s->data, s->dataType, t->data, t->dataType, s->unitNum);
} }
...@@ -69,6 +70,34 @@ void _CopyValues(const XTensor * s, XTensor * t, XStream * stream) ...@@ -69,6 +70,34 @@ void _CopyValues(const XTensor * s, XTensor * t, XStream * stream)
} }
/* /*
copy s to t
>> s - source
>> sBeg - begining of the segment
>> sLen - length of the segment
>> t - target
>> tBeg - beginning of the segment on the target side
>> stream - the stream for creating the job pipeline
*/
void _CopyValues(const XTensor * s, const int sBeg, const int sLen, XTensor * t, const int tBeg, XStream * stream)
{
CheckNTErrors(s != NULL && t != NULL, "The input tensor and output tensor must be nonempty!");
CheckNTErrors(s->data != NULL && t->data != NULL, "Cannot copy an empty data array!");
CheckNTErrors(s->unitSize == t->unitSize, "The input tensors must be of the same unit size!");
CheckNTErrors(s->order > sBeg && sBeg >= 0 && sLen <= s->unitNum, "Wrong segment on the source side");
CheckNTErrors(t->order > tBeg && tBeg >= 0, "Wrong segment on the target side");
if (!s->isSparse && !t->isSparse) {
XMemCopy((char*)t->data + tBeg * t->unitSize, t->devID,
(char*)s->data + sBeg * s->unitSize, s->devID,
s->unitSize * sLen);
}
else {
ShowNTErrors("TODO!");
}
}
/*
copy s to t (return a XTensor structure) copy s to t (return a XTensor structure)
make a new tensor to keep the result and return it make a new tensor to keep the result and return it
......
...@@ -29,6 +29,9 @@ namespace nts { // namespace nts(NiuTrans.Tensor) ...@@ -29,6 +29,9 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
/* copy s to t */ /* copy s to t */
void _CopyValues(const XTensor * s, XTensor * t, XStream * stream = NULL); void _CopyValues(const XTensor * s, XTensor * t, XStream * stream = NULL);
/* copy a segment of s to t */
void _CopyValues(const XTensor * s, const int sBeg, const int sLen, XTensor * t, const int tBeg, XStream * stream = NULL);
/* /*
copy s to t (return a XTensor structure) copy s to t (return a XTensor structure)
make a new tensor to keep the result and return it make a new tensor to keep the result and return it
......
...@@ -29,6 +29,71 @@ namespace nts{ // namespace nts(NiuTrans.Tensor) ...@@ -29,6 +29,71 @@ namespace nts{ // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA #ifdef USE_CUDA
/*
use PTX code to reduce float data
*/
__device__ __forceinline__
float shflDownReduceMax(float input)
{
float output;
asm volatile(
"{"
".reg .f32 r0;"
".reg .pred p;"
"shfl.down.b32 r0, %1, 0x10, 0x1f;"
"setp.lt.f32 p,%1,r0;"
"@p mov.f32 %1,r0;"
"shfl.down.b32 r0, %1, 0x8, 0xf;"
"setp.lt.f32 p,%1,r0;"
"@p mov.f32 %1,r0;"
"shfl.down.b32 r0, %1, 0x4, 0x7;"
"setp.lt.f32 p,%1,r0;"
"@p mov.f32 %1,r0;"
"shfl.down.b32 r0, %1, 0x2, 0x3;"
"setp.lt.f32 p,%1,r0;"
"@p mov.f32 %1,r0;"
"shfl.down.b32 r0, %1, 0x1, 0x1;"
"setp.lt.f32 p, %1, r0; "
"@p mov.f32 %1,r0;"
"mov.f32 %0,%1;"
"}"
: "=f"(output) : "f"(input));
return output;
}
/*
use PTX code to reduce int data
*/
__device__ __forceinline__
int shflDownReduceMax(int input)
{
int output;
asm volatile(
"{"
".reg .s32 r0;"
".reg .pred p;"
"shfl.down.b32 r0, %1, 0x10, 0x1f;"
"setp.lt.s32 p,%1,r0;"
"@p mov.s32 %1,r0;"
"shfl.down.b32 r0, %1, 0x8, 0xf;"
"setp.lt.s32 p,%1,r0;"
"@p mov.s32 %1,r0;"
"shfl.down.b32 r0, %1, 0x4, 0x7;"
"setp.lt.s32 p,%1,r0;"
"@p mov.s32 %1,r0;"
"shfl.down.b32 r0, %1, 0x2, 0x3;"
"setp.lt.s32 p,%1,r0;"
"@p mov.s32 %1,r0;"
"shfl.down.b32 r0, %1, 0x1, 0x1;"
"setp.lt.s32 p, %1, r0; "
"@p mov.s32 %1,r0;"
"mov.s32 %0,%1;"
"}"
: "=r"(output) : "r"(input));
return output;
}
/* /*
reduce a tensor to another that keeps the max value along a dimension - slow version reduce a tensor to another that keeps the max value along a dimension - slow version
Given a block of data, we go over each dimension i in the stride and we have Given a block of data, we go over each dimension i in the stride and we have
...@@ -191,25 +256,19 @@ void KernelReduceMaxFast(DTYPE * input, DTYPE * output, ...@@ -191,25 +256,19 @@ void KernelReduceMaxFast(DTYPE * input, DTYPE * output,
DTYPE value = j < strideNum ? inputData[j * stride + iOffset]: FLOAT_MIN; DTYPE value = j < strideNum ? inputData[j * stride + iOffset]: FLOAT_MIN;
DTYPE value2 = j + blockDim.y < strideNum ? inputData[(j + blockDim.y) * stride + iOffset]: FLOAT_MIN; DTYPE value2 = j + blockDim.y < strideNum ? inputData[(j + blockDim.y) * stride + iOffset]: FLOAT_MIN;
/* load data into the shared mem */ value = MAX(value, value2);
data[tid] = MAX(value, value2); value = shflDownReduceMax(value);
if ((tid & 0x1f) == 0) { data[tid / 32] = value; }
__syncthreads(); __syncthreads();
/* unroll the warp */ if (tid < 32) {
if(goodSize >= 512) {if(tid < 256) {if(data[tid] < data[tid + 256]) data[tid] = data[tid + 256];} __syncthreads();} if (tid < blockDim.y / 32)
if(goodSize >= 256) {if(tid < 128) {if(data[tid] < data[tid + 128]) data[tid] = data[tid + 128];} __syncthreads();} value = data[tid];
if(goodSize >= 128) {if(tid < 64) {if(data[tid] < data[tid + 64]) data[tid] = data[tid + 64];} __syncthreads();} else value = FLOAT_MIN;
if(goodSize >= 64) {if(tid < 32) {if(data[tid] < data[tid + 32]) data[tid] = data[tid + 32];} __syncthreads();} value = shflDownReduceMax(value);
if(goodSize >= 32) {if(tid < 16) {if(data[tid] < data[tid + 16]) data[tid] = data[tid + 16];} __syncthreads();} if (tid == 0 && blockIdx.y < reducedStrideNum)
if(goodSize >= 16) {if(tid < 8) {if(data[tid] < data[tid + 8]) data[tid] = data[tid + 8];} __syncthreads();} output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = value;
if(goodSize >= 8) {if(tid < 4) {if(data[tid] < data[tid + 4]) data[tid] = data[tid + 4];} __syncthreads();} }
if(goodSize >= 4) {if(tid < 2) {if(data[tid] < data[tid + 2]) data[tid] = data[tid + 2];} __syncthreads();}
if(goodSize >= 2) {if(tid < 1) {if(data[tid] < data[tid + 1]) data[tid] = data[tid + 1];} __syncthreads();}
/* write result for this block to the output array */
if(threadIdx.y == 0 && blockIdx.y < reducedStrideNum)
output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = data[0];
} }
/* /*
...@@ -326,6 +385,105 @@ void KernelReduceMaxSimpleFast(DTYPE * input, DTYPE * output, ...@@ -326,6 +385,105 @@ void KernelReduceMaxSimpleFast(DTYPE * input, DTYPE * output,
op[offset] = max; op[offset] = max;
} }
/*
according the GPU's sm number allocation warp num
*/
inline void continuousStorageThreadAllocation(dim3& grid, dim3& block, long long vectorNum, int vectorSize)
{
int warpNum = 4;
if (vectorNum < 20 * 8){
warpNum = 8;
if (vectorNum < 20 * 4){
warpNum = 16;
if (warpNum < 20 * 2)
warpNum = 32;
}
}
int minWarpNum = vectorSize / 32;
if (vectorSize % 32 != 0) minWarpNum++;
warpNum = min(warpNum, minWarpNum);
grid.x = vectorNum;
grid.y = 1;
grid.z = 1;
block.x = 1;
block.y = warpNum * 32;
block.z = 1;
}
/*
adjust threads.x number then we can use warp optimization
*/
inline void adjustThreadForUseWarpOptimization(dim3& blocks, dim3& threads)
{
if (threads.x > 1) {
blocks.x *= threads.x;
threads.x = 1;
}
if (threads.y < 32)
threads.y = 32;
}
/*
In some case,we use less block to imporve efficiency
*/
__global__
void KernelReduceMaxOpLessBlocks(DTYPE * input, DTYPE * output, int strideNum, int blockNum)
{
int idx = threadIdx.x % 32;
int idy = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
int startIndex = idy * strideNum;
DTYPE threadMax = FLOAT_MIN;
for (int i = idx; i < strideNum; i += 32) {
threadMax = max(input[startIndex + i], threadMax);
}
threadMax = shflDownReduceMax(threadMax);
if (idx == 0)
output[idy] = threadMax;
}
/*
we use PTX code reduce
*/
__global__
void KernelReduceMaxOp(DTYPE * input, DTYPE * output,int stride, int strideNum,
int reducedStrideNum,int blockSize, int blockNum)
{
__shared__ DTYPE iData[MAX_CUDA_THREAD_NUM_PER_BLOCK / 32];
unsigned int tid = threadIdx.y;
unsigned int j = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= stride * blockNum)
return;
/* first level reduction */
int k = i / stride;
int iOffset = i % stride;
DTYPE threadMax = FLOAT_MIN;
DTYPE * data = iData + threadIdx.x * blockDim.y;
DTYPE * inputData = input + k * blockSize;
for (int it = j; it < strideNum; it += blockDim.y){
threadMax = max(inputData[it * stride + iOffset], threadMax);
}
__syncthreads();
threadMax = shflDownReduceMax(threadMax);
if ((tid & 0x1f) == 0) { data[tid / 32] = threadMax; }
__syncthreads();
/* use one warp to reduce remaining data */
if (tid < 32){
if (tid < blockDim.y / 32)
threadMax = data[tid];
else threadMax = 0;
threadMax = shflDownReduceMax(threadMax);
if (tid == 0 && blockIdx.y < reducedStrideNum)
output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = threadMax;
}
}
/* /*
get the max-valued items along a dimension of the tensor (cuda version). get the max-valued items along a dimension of the tensor (cuda version).
For a 1-dimensional data array a, For a 1-dimensional data array a,
...@@ -382,130 +540,147 @@ void _CudaReduceMax(const XTensor * input, XTensor * output, int dim) ...@@ -382,130 +540,147 @@ void _CudaReduceMax(const XTensor * input, XTensor * output, int dim)
int devIDBackup; int devIDBackup;
ProtectCudaDev(input->devID, devIDBackup); ProtectCudaDev(input->devID, devIDBackup);
do{ if (stride == 1 && blockNum >= 10) {
if (input->dataType == DEFAULT_DTYPE) { dim3 grids;
DTYPE * iData = NULL; dim3 blocks;
DTYPE * oData = NULL; continuousStorageThreadAllocation(grids, blocks, (long long)blockNum, strideNum);
if (iter == 0) { if (blocks.y > 128) {
iData = (DTYPE*)input->data; KernelReduceMaxOp <<<grids, blocks >>> ((DTYPE *)input->data, (DTYPE*)output->data, stride, strideNum, grids.y, blockSize, blockNum);
oData = buf1;
}
else if (iter % 2 == 1) {
iData = buf1;
oData = buf2;
}
else {
iData = buf2;
oData = buf1;
}
/* unroll the reduction procedure. The code is messy but it is faster. */
if (strideNum < 32) {
GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (DTYPE*)output->data;
KernelReduceMax << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
else if (strideNum < 128) {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (DTYPE*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
KernelReduceMaxFast<64> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
else if (strideNum < 256) {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (DTYPE*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
KernelReduceMaxFast<128> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
else if (strideNum < 512) {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (DTYPE*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
KernelReduceMaxFast<256> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
else {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (DTYPE*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
KernelReduceMaxFast<512> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
} }
else if (input->dataType == X_FLOAT16) { else {
__half * buf1ft16 = (__half *)buf1; KernelReduceMaxOpLessBlocks <<<blockNum / 4, 128 >>> ((DTYPE *)input->data, (DTYPE*)output->data, strideNum, blockNum);
__half * buf2ft16 = (__half *)buf2; }
__half * iData = NULL; }
__half * oData = NULL; else {
if (iter == 0) { do {
iData = (__half*)input->data; if (input->dataType == DEFAULT_DTYPE) {
oData = buf1ft16; DTYPE * iData = NULL;
} DTYPE * oData = NULL;
else if (iter % 2 == 1) { if (iter == 0) {
iData = buf1ft16; iData = (DTYPE*)input->data;
oData = buf2ft16; oData = buf1;
}
else if (iter % 2 == 1) {
iData = buf1;
oData = buf2;
}
else {
iData = buf2;
oData = buf1;
}
/* unroll the reduction procedure. The code is messy but it is faster. */
if (strideNum < 32) {
GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (DTYPE*)output->data;
KernelReduceMax <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
else if (strideNum < 128) {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (DTYPE*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
adjustThreadForUseWarpOptimization(blocks, threads);
KernelReduceMaxFast<64> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
else if (strideNum < 256) {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (DTYPE*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
adjustThreadForUseWarpOptimization(blocks, threads);
KernelReduceMaxFast<128> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
else if (strideNum < 512) {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (DTYPE*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
adjustThreadForUseWarpOptimization(blocks, threads);
KernelReduceMaxFast<256> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
else {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (DTYPE*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
adjustThreadForUseWarpOptimization(blocks, threads);
KernelReduceMaxFast<512> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
} }
else { else if (input->dataType == X_FLOAT16) {
iData = buf2ft16; __half * buf1ft16 = (__half *)buf1;
oData = buf1ft16; __half * buf2ft16 = (__half *)buf2;
__half * iData = NULL;
__half * oData = NULL;
if (iter == 0) {
iData = (__half*)input->data;
oData = buf1ft16;
}
else if (iter % 2 == 1) {
iData = buf1ft16;
oData = buf2ft16;
}
else {
iData = buf2ft16;
oData = buf1ft16;
}
/* unroll the reduction procedure. The code is messy but it is faster. */
if (strideNum < 32) {
GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
KernelReduceMax << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
else if (strideNum < 128) {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
KernelReduceMaxFast<64> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
else if (strideNum < 256) {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
KernelReduceMaxFast<128> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
else if (strideNum < 512) {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
KernelReduceMaxFast<256> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
else {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
KernelReduceMaxFast<512> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
} }
/* unroll the reduction procedure. The code is messy but it is faster. */ strideNum = cudaGridSize[0];
if (strideNum < 32) { blockSize = cudaGridSize[0];
GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
KernelReduceMax << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
else if (strideNum < 128) {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
KernelReduceMaxFast<64> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
else if (strideNum < 256) {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
KernelReduceMaxFast<128> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
else if (strideNum < 512) {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
KernelReduceMaxFast<256> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
else {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
KernelReduceMaxFast<512> << <blocks, threads >> >(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum);
}
}
strideNum = cudaGridSize[0];
blockSize = cudaGridSize[0];
iter++; iter++;
}while(strideNum > 1); } while (strideNum > 1);
}
BacktoCudaDev(input->devID, devIDBackup); BacktoCudaDev(input->devID, devIDBackup);
......
...@@ -27,6 +27,57 @@ namespace nts{ // namespace nts(NiuTrans.Tensor) ...@@ -27,6 +27,57 @@ namespace nts{ // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA #ifdef USE_CUDA
/*
use PTX code to reduce float data
*/
__device__ __forceinline__
float shflDownReduceSum(float input)
{
float output;
asm volatile(
"{"
".reg .f32 r0;"
"shfl.down.b32 r0, %1, 0x10, 0x1f;"
"add.f32 %1, r0, %1;"
"shfl.down.b32 r0, %1, 0x8, 0xf;"
"add.f32 %1, r0, %1;"
"shfl.down.b32 r0, %1, 0x4, 0x7;"
"add.f32 %1, r0, %1;"
"shfl.down.b32 r0, %1, 0x2, 0x3;"
"add.f32 %1, r0, %1;"
"shfl.down.b32 r0, %1, 0x1, 0x1;"
"add.f32 %0, r0, %1;"
"}"
: "=f"(output) : "f"(input));
return output;
}
/*
use PTX code to reduce int data
*/
__device__ __forceinline__
int shflDownReduceSum(int input)
{
int output;
asm volatile(
"{"
".reg .s32 r0;"
"shfl.down.b32 r0, %1, 0x10, 0x1f;"
"add.s32 %1, r0, %1;"
"shfl.down.b32 r0, %1, 0x8, 0xf;"
"add.s32 %1, r0, %1;"
"shfl.down.b32 r0, %1, 0x4, 0x7;"
"add.s32 %1, r0, %1;"
"shfl.down.b32 r0, %1, 0x2, 0x3;"
"add.s32 %1, r0, %1;"
"shfl.down.b32 r0, %1, 0x1, 0x1;"
"add.s32 %0, r0, %1;"
"}"
: "=r"(output) : "r"(input));
return output;
}
/* /*
reduce a tensor to another that keeps the sum along a dimension - slow version reduce a tensor to another that keeps the sum along a dimension - slow version
Given a block of data, we go over each dimension i in the stride and we have Given a block of data, we go over each dimension i in the stride and we have
...@@ -96,7 +147,6 @@ void KernelReduceSum(DTYPE * input, DTYPE * output, ...@@ -96,7 +147,6 @@ void KernelReduceSum(DTYPE * input, DTYPE * output,
__syncthreads(); __syncthreads();
} }
/* write result for this block to the output array */ /* write result for this block to the output array */
if (threadIdx.y == 0 && blockIdx.y < reducedStrideNum) if (threadIdx.y == 0 && blockIdx.y < reducedStrideNum)
output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = iData[threadIdx.x * blockDim.y]; output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = iData[threadIdx.x * blockDim.y];
...@@ -276,25 +326,19 @@ void KernelReduceSumFast(DTYPE * input, DTYPE * output, ...@@ -276,25 +326,19 @@ void KernelReduceSumFast(DTYPE * input, DTYPE * output,
value2 = exp(value2); value2 = exp(value2);
} }
/* load data into the shared mem */ value = value + value2;
data[tid] = value + value2;
__syncthreads(); __syncthreads();
value = shflDownReduceSum(value);
/* unroll the warp */ if ((tid & 0x1f) == 0) { data[tid / 32] = value; }
if(goodSize >= 512) {if(tid < 256) {data[tid] += data[tid + 256];} __syncthreads();} __syncthreads();
if(goodSize >= 256) {if(tid < 128) {data[tid] += data[tid + 128];} __syncthreads();} if (tid < 32){
if(goodSize >= 128) {if(tid < 64) {data[tid] += data[tid + 64];} __syncthreads();} if (tid < blockDim.y / 32)
if(goodSize >= 64) {if(tid < 32) {data[tid] += data[tid + 32];} __syncthreads();} value = data[tid];
if(goodSize >= 32) {if(tid < 16) {data[tid] += data[tid + 16];} __syncthreads();} else value = 0;
if(goodSize >= 16) {if(tid < 8) {data[tid] += data[tid + 8];} __syncthreads();} value = shflDownReduceSum(value);
if(goodSize >= 8) {if(tid < 4) {data[tid] += data[tid + 4];} __syncthreads();} if (tid == 0 && blockIdx.y < reducedStrideNum)
if(goodSize >= 4) {if(tid < 2) {data[tid] += data[tid + 2];} __syncthreads();} output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = value;
if(goodSize >= 2) {if(tid < 1) {data[tid] += data[tid + 1];} __syncthreads();} }
/* write result for this block to the output array */
if(threadIdx.y == 0 && blockIdx.y < reducedStrideNum)
output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = data[0];
} }
/* /*
...@@ -430,6 +474,174 @@ void KernelReduceSumFast(__half * input, __half * output, ...@@ -430,6 +474,174 @@ void KernelReduceSumFast(__half * input, __half * output,
#endif #endif
} }
/*
if data storage is discontinuius ,use this way to reduce
*/
__global__
void KernelReduceSumDiscontinuousStorage(DTYPE * input, DTYPE * output, int stride,
int strideNum, DTYPE * shift, DTYPE power, bool isExp)
{
//int idx = blockIdx.x * blockDim.x + threadIdx.x;
//int endIndex = (idx+1) * strideNum;
int idx = blockDim.x * blockIdx.x + threadIdx.x;
int blockIndex = idx / stride;
int offsetInBlock = idx% stride;
DTYPE ans = 0;
#pragma unroll
for (int i = stride * strideNum * blockIndex + offsetInBlock;
i < stride * strideNum * blockIndex + offsetInBlock + stride * strideNum;
i += stride){
ans += input[i];
}
output[idx] = ans;
}
__global__
void KernelReduceSumOp(DTYPE * input, DTYPE * output,
int stride, int strideNum, int reducedStrideNum,
int blockSize, int blockNum,
DTYPE * shift, DTYPE power, bool isExp)
{
__shared__ DTYPE iData[MAX_CUDA_THREAD_NUM_PER_BLOCK / 32];
__shared__ DTYPE bias[MAX_CUDA_THREAD_NUM_PER_BLOCK];
unsigned int tid = threadIdx.y;
unsigned int j = blockIdx.y * blockDim.y + threadIdx.y;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= stride * blockNum)
return;
if (threadIdx.y == 0)
bias[threadIdx.x] = shift != NULL ? shift[i] : 0;
__syncthreads();
/* first level reduction */
int k = i / stride;
int iOffset = i % stride;
DTYPE threadSum = 0;
DTYPE * data = iData + threadIdx.x * blockDim.y;
DTYPE * inputData = input + k * blockSize;
for (int it = j; it < strideNum; it += blockDim.y){
DTYPE value = inputData[it * stride + iOffset] - bias[threadIdx.x];
if (power != (DTYPE)1.0) {
if (power == (DTYPE)2.0) {
value = value * value;
}
else if (power == (DTYPE)0.5) {
value = sqrt(value);
}
else {
value = pow(value, power);
}
}
if (isExp) value = exp(value);
threadSum += value;
}
__syncthreads();
threadSum = shflDownReduceSum(threadSum);
if ((tid & 0x1f) == 0) { data[tid / 32] = threadSum; }
__syncthreads();
if (tid < 32){
if (tid < blockDim.y / 32)
threadSum = data[tid];
else threadSum = 0;
threadSum = shflDownReduceSum(threadSum);
if (tid == 0 && blockIdx.y < reducedStrideNum)
output[(k * reducedStrideNum + blockIdx.y) * stride + iOffset] = threadSum;
}
}
__global__
void KernelReduceSumOpLessBlocks(DTYPE * input, DTYPE * output,
int strideNum, int blockNum,
DTYPE * shift, DTYPE power, bool isExp)
{
__shared__ DTYPE bias[MAX_CUDA_THREAD_NUM_PER_BLOCK];
int idx = threadIdx.x % 32;
int idy = (blockIdx.x * blockDim.x + threadIdx.x) / 32;
if (idx == 0)
bias[threadIdx.x / 32] = shift != NULL ? shift[idy] : 0;
int startIndex = idy * strideNum;
DTYPE threadSum = 0;
for (int i = idx; i < strideNum; i += 32) {
DTYPE value = input[startIndex + i] - bias[threadIdx.x / 32];
if (power != (DTYPE)1.0) {
if (power == (DTYPE)2.0) {
value = value * value;
}
else if (power == (DTYPE)0.5) {
value = sqrt(value);
}
else {
value = pow(value, power);
}
}
if (isExp) value = exp(value);
threadSum += value;
}
threadSum = shflDownReduceSum(threadSum);
if (idx == 0)
output[idy] = threadSum;
}
/*
according the GPU's sm number allocation warp num
*/
inline void continuousStorageThreadAllocation(dim3& grid, dim3& block, long long vectorNum, int vectorSize)
{
int warpNum = 4;
if (vectorNum < 20 * 8) {
warpNum = 8;
if (vectorNum < 20 * 4) {
warpNum = 16;
if (warpNum < 20 * 2)
warpNum = 32;
}
}
int minWarpNum = vectorSize / 32;
if (vectorSize % 32 != 0) minWarpNum++;
warpNum = min(warpNum, minWarpNum);
grid.x = vectorNum;
grid.y = 1;
grid.z = 1;
block.x = 1;
block.y = warpNum * 32;
block.z = 1;
}
/*
this situation we use block.x * grid.x deal one vector for continuous read
*/
inline void discontinuousStorageNoShareMemThreadAllocation(dim3& grid, dim3& block, int stride, int blockNum)
{
block.x = 512;
block.y = 1;
if ((stride * blockNum) % 512 == 0)
grid.x = (stride * blockNum) / 512;
else
grid.x = (stride * blockNum) / 512 + 1;
grid.y = 1;
}
/*
adjust threads.x number then we can use warp optimization
*/
inline void adjustThreadForUseWarpOptimization(dim3& blocks, dim3& threads)
{
if (threads.x > 1){
blocks.x *= threads.x;
threads.x = 1;
}
if (threads.y<32)
threads.y = 32;
}
/* /*
sum the items along a dimension of the tensor (cuda version). sum the items along a dimension of the tensor (cuda version).
For a 1-dimensional data array a, For a 1-dimensional data array a,
...@@ -495,137 +707,158 @@ void _CudaReduceSum(const XTensor * input, XTensor * output, int dim, const XTen ...@@ -495,137 +707,158 @@ void _CudaReduceSum(const XTensor * input, XTensor * output, int dim, const XTen
int devIDBackup; int devIDBackup;
ProtectCudaDev(input->devID, devIDBackup); ProtectCudaDev(input->devID, devIDBackup);
if (stride == 1 && blockNum >= 10) {
do{ dim3 grids;
if(input->dataType == DEFAULT_DTYPE){ dim3 blocks;
DTYPE * iData = NULL; continuousStorageThreadAllocation(grids, blocks, (long long)blockNum, strideNum);
DTYPE * oData = NULL; if (blocks.y > 128)
if (iter == 0) { KernelReduceSumOp <<<grids, blocks >>> ((DTYPE *)input->data, (DTYPE*)output->data, stride, strideNum, grids.y, blockSize, blockNum, sp, power, isExp);
iData = (DTYPE*)input->data; else
oData = buf1; KernelReduceSumOpLessBlocks <<<blockNum / 4, 128 >>> ((DTYPE *)input->data, (DTYPE*)output->data, strideNum, blockNum, sp, power, isExp);
} }
else if (iter % 2 == 1) { else if (stride != 1 && stride * blockNum > 4096){
iData = buf1; //GDevs->GetGridAndBlockSize2D(devID, stride * blockNum, strideNum,MAX_INT, cudaGridSize, cudaBlockSize);
oData = buf2; //unsigned int* goutput = (unsigned int *)input->data;
} //convert2uintV2 << <dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1]) >> > ((float*)input->data, goutput, stride, strideNum, blockNum, strideNum*blockNum*stride);
else { dim3 grid, block;
iData = buf2; discontinuousStorageNoShareMemThreadAllocation(grid, block, stride, blockNum);
oData = buf1; KernelReduceSumDiscontinuousStorage <<<grid, block >>> ((DTYPE *)input->data, (DTYPE*)output->data, stride, strideNum, sp, power, isExp);
} }
/* unroll the reduction procedure. The code is messy but it is faster. */ else {
if(strideNum < 32){ do {
GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize); if (input->dataType == DEFAULT_DTYPE) {
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]); DTYPE * iData = NULL;
if (cudaGridSize[0] == 1) DTYPE * oData = NULL;
oData = (DTYPE*)output->data; if (iter == 0) {
KernelReduceSum <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp); iData = (DTYPE*)input->data;
} oData = buf1;
else if(strideNum < 128){ }
GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize); else if (iter % 2 == 1) {
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]); iData = buf1;
if (cudaGridSize[0] == 1) oData = buf2;
oData = (DTYPE*)output->data; }
CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!"); else {
KernelReduceSumFast<64> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp); iData = buf2;
} oData = buf1;
else if(strideNum < 256){ }
GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize); /* unroll the reduction procedure. The code is messy but it is faster. */
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]); if (strideNum <= 32) {
if (cudaGridSize[0] == 1) GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
oData = (DTYPE*)output->data; dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!"); if (cudaGridSize[0] == 1)
KernelReduceSumFast<128> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp); oData = (DTYPE*)output->data;
} KernelReduceSum <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
else if(strideNum < 512){ }
GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize); else if (strideNum < 128) {
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]); GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
if (cudaGridSize[0] == 1) dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
oData = (DTYPE*)output->data; if (cudaGridSize[0] == 1)
CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!"); oData = (DTYPE*)output->data;
KernelReduceSumFast<256> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp); CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
} adjustThreadForUseWarpOptimization(blocks, threads);
else{ KernelReduceSumFast<64> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize); }
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]); else if (strideNum < 256) {
if (cudaGridSize[0] == 1) GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
oData = (DTYPE*)output->data; dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!"); if (cudaGridSize[0] == 1)
KernelReduceSumFast<512> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp); oData = (DTYPE*)output->data;
} CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
} adjustThreadForUseWarpOptimization(blocks, threads);
else if(input->dataType == X_FLOAT16){ KernelReduceSumFast<128> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
__half * buf1ft16 = (__half *)buf1; }
__half * buf2ft16 = (__half *)buf2; else if (strideNum < 512) {
__half * spft16 = (__half *)sp; GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
unsigned short power2 = FloatToFloat16(power); dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
__half * powerft16p = (__half*)&power2; if (cudaGridSize[0] == 1)
__half * iData = NULL; oData = (DTYPE*)output->data;
__half * oData = NULL; CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
if (iter == 0) { adjustThreadForUseWarpOptimization(blocks, threads);
iData = (__half*)input->data; KernelReduceSumFast<256> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
oData = buf1ft16; }
} else {
else if (iter % 2 == 1) { GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
iData = buf1ft16; dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
oData = buf2ft16; if (cudaGridSize[0] == 1)
} oData = (DTYPE*)output->data;
else { CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
iData = buf2ft16; adjustThreadForUseWarpOptimization(blocks, threads);
oData = buf1ft16; KernelReduceSumFast<512> <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, sp, power, isExp);
} }
/* unroll the reduction procedure. The code is messy but it is faster. */
if(strideNum < 32){
GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
KernelReduceSum << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
}
else if(strideNum < 128){
GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
KernelReduceSumFast<64> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
}
else if(strideNum < 256){
GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
KernelReduceSumFast<128> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
}
else if(strideNum < 512){
GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
KernelReduceSumFast<256> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
} }
else{ else if (input->dataType == X_FLOAT16) {
GDevs.GetCudaThread2D(devID, MAX(strideNum/2+1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize); __half * buf1ft16 = (__half *)buf1;
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]); __half * buf2ft16 = (__half *)buf2;
if (cudaGridSize[0] == 1) __half * spft16 = (__half *)sp;
oData = (__half*)output->data; unsigned short power2 = FloatToFloat16(power);
CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!"); __half * powerft16p = (__half*)&power2;
KernelReduceSumFast<512> <<<blocks, threads >>>(iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp); __half * iData = NULL;
__half * oData = NULL;
if (iter == 0) {
iData = (__half*)input->data;
oData = buf1ft16;
}
else if (iter % 2 == 1) {
iData = buf1ft16;
oData = buf2ft16;
}
else {
iData = buf2ft16;
oData = buf1ft16;
}
/* unroll the reduction procedure. The code is messy but it is faster. */
if (strideNum < 32) {
GDevs.GetCudaThread2D(devID, strideNum, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
KernelReduceSum <<<blocks, threads >>> (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
}
else if (strideNum < 128) {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 64), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 64), "Incorrect thread number when calling the cuda kernel!");
KernelReduceSumFast<64> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
}
else if (strideNum < 256) {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 128), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 128), "Incorrect thread number when calling the cuda kernel!");
KernelReduceSumFast<128> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
}
else if (strideNum < 512) {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 256), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 256), "Incorrect thread number when calling the cuda kernel!");
KernelReduceSumFast<256> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
}
else {
GDevs.GetCudaThread2D(devID, MAX(strideNum / 2 + 1, 512), stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
dim3 blocks(cudaGridSize[1], cudaGridSize[0]), threads(cudaBlockSize[1], cudaBlockSize[0]);
if (cudaGridSize[0] == 1)
oData = (__half*)output->data;
CheckNTErrors((cudaBlockSize[0] >= 512), "Incorrect thread number when calling the cuda kernel!");
KernelReduceSumFast<512> << <blocks, threads >> > (iData, oData, stride, strideNum, blocks.y, blockSize, blockNum, spft16, *powerft16p, isExp);
}
} }
}
strideNum = cudaGridSize[0];
blockSize = cudaGridSize[0];
sp = NULL;
power = (DTYPE)1.0;
isExp = false;
iter++; strideNum = cudaGridSize[0];
blockSize = cudaGridSize[0];
sp = NULL;
power = (DTYPE)1.0;
isExp = false;
}while(strideNum > 1); iter++;
} while (strideNum > 1);
}
ProtectCudaDev(input->devID, devIDBackup); ProtectCudaDev(input->devID, devIDBackup);
if (mem != NULL) if (mem != NULL)
......
...@@ -33,14 +33,14 @@ set target data block index for the data movement in merge ...@@ -33,14 +33,14 @@ set target data block index for the data movement in merge
>> splitSizeInGrid - size of each data array to merge >> splitSizeInGrid - size of each data array to merge
>> gridSize - number of blocks in a grid (here grid is a higher level orgnization upon blocks) >> gridSize - number of blocks in a grid (here grid is a higher level orgnization upon blocks)
>> gridNum - number of grids >> gridNum - number of grids
>> mem - the memory pool >> devID - device id
*/ */
void _MakeMergeBlockIndex(int * blockIndex, int blockNum, int blockNumInMerge, void _MakeMergeBlockIndex(int * blockIndex, int blockNum, int blockNumInMerge,
int splitSizeInGrid, int gridSize, int gridNum, XMem * mem) int splitSizeInGrid, int gridSize, int gridNum, int devID)
{ {
if (mem != NULL && mem->devID >= 0) { if (devID >= 0) {
#ifdef USE_CUDA #ifdef USE_CUDA
_CudaMakeMergeBlockIndex(mem->devID, blockIndex, blockNum, blockNumInMerge, splitSizeInGrid, gridSize, gridNum); _CudaMakeMergeBlockIndex(devID, blockIndex, blockNum, blockNumInMerge, splitSizeInGrid, gridSize, gridNum);
#else #else
ShowNTErrors("Please specify USE_CUDA and recompile the code!"); ShowNTErrors("Please specify USE_CUDA and recompile the code!");
#endif #endif
......
...@@ -28,7 +28,7 @@ namespace nts { // namespace nts(NiuTrans.Tensor) ...@@ -28,7 +28,7 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
/* set target data block index for the data movement in merge */ /* set target data block index for the data movement in merge */
void _MakeMergeBlockIndex(int * blockIndex, int blockNum, int blockNumInMerge, void _MakeMergeBlockIndex(int * blockIndex, int blockNum, int blockNumInMerge,
int splitSizeInGrid, int gridSize, int gridNum, XMem * mem); int splitSizeInGrid, int gridSize, int gridNum, int devID);
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
......
...@@ -31,13 +31,13 @@ set target data block index for the data movement in split ...@@ -31,13 +31,13 @@ set target data block index for the data movement in split
>> splitNum - number of splits >> splitNum - number of splits
>> blockSplitSize - size of the splitted block >> blockSplitSize - size of the splitted block
>> blockNum - number of data blocks >> blockNum - number of data blocks
>> mem - the memory pool >> devID - device id
*/ */
void _MakeSplitBlockIndex(int * blockIndex, int splitNum, int blockSplitSize, int blockNum, XMem * mem) void _MakeSplitBlockIndex(int * blockIndex, int splitNum, int blockSplitSize, int blockNum, int devID)
{ {
if (mem != NULL && mem->devID >= 0) { if (devID >= 0) {
#ifdef USE_CUDA #ifdef USE_CUDA
_CudaMakeSplitBlockIndex(mem->devID, blockIndex, splitNum, blockSplitSize, blockNum); _CudaMakeSplitBlockIndex(devID, blockIndex, splitNum, blockSplitSize, blockNum);
#else #else
ShowNTErrors("Please specify USE_CUDA and recompile the code!"); ShowNTErrors("Please specify USE_CUDA and recompile the code!");
#endif #endif
......
...@@ -27,7 +27,7 @@ ...@@ -27,7 +27,7 @@
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
/* set target data block index for the data movement in split */ /* set target data block index for the data movement in split */
void _MakeSplitBlockIndex(int * blockIndex, int splitNum, int blockSplitSize, int blockNum, XMem * mem); void _MakeSplitBlockIndex(int * blockIndex, int splitNum, int blockSplitSize, int blockNum, int devID);
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
......
...@@ -42,10 +42,13 @@ e.g., (N/3, M, 3) -> (N, M) ...@@ -42,10 +42,13 @@ e.g., (N/3, M, 3) -> (N, M)
*/ */
void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim) void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim)
{ {
int whereToMergeRDI = s->order - whereToMerge - 1; if(leadingDim < 0)
int leadingDimRDI = s->order - leadingDim - 1; leadingDim = 0;
int whereToMergeRDI = s->order - whereToMerge - 1;
int leadingDimRDI = s->order - leadingDim - 1;
if (leadingDimRDI < 0) if (leadingDimRDI < 0)
leadingDimRDI = s->order - 1; leadingDimRDI = s->order - 1;
CheckNTErrors((s != NULL && t != NULL), "Invalid tensors!"); CheckNTErrors((s != NULL && t != NULL), "Invalid tensors!");
CheckNTErrors((s->devID == t->devID || (s->devID < 0 && t->devID < 0)), CheckNTErrors((s->devID == t->devID || (s->devID < 0 && t->devID < 0)),
...@@ -60,8 +63,12 @@ void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim) ...@@ -60,8 +63,12 @@ void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim)
CheckNTErrors((t->dimSizeRDI[i] == s->dimSizeRDI[i] * s->dimSizeRDI[leadingDimRDI]), CheckNTErrors((t->dimSizeRDI[i] == s->dimSizeRDI[i] * s->dimSizeRDI[leadingDimRDI]),
"Unmatched tensor sizes!"); "Unmatched tensor sizes!");
} }
else if (i < leadingDimRDI){
CheckNTErrors((s->dimSizeRDI[i] == t->dimSizeRDI[i]),
"Unmatched tensor sizes!");
}
else if (i > leadingDimRDI) { else if (i > leadingDimRDI) {
CheckNTErrors((s->dimSizeRDI[i - 1] == t->dimSizeRDI[i]), CheckNTErrors((s->dimSizeRDI[i] == t->dimSizeRDI[i - 1]),
"Unmatched tensor sizes!"); "Unmatched tensor sizes!");
} }
} }
...@@ -119,28 +126,24 @@ void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim) ...@@ -119,28 +126,24 @@ void _Merge(const XTensor * s, XTensor * t, int whereToMerge, int leadingDim)
int realBlockSize = blockSize * t->unitSize; int realBlockSize = blockSize * t->unitSize;
int * blockIndex = (int*)(mem != NULL ? int * blockIndex = (int*)(mem != NULL ?
mem->AllocBuf(mem->devID, blockNum * gridNum * sizeof(int)) : mem->AllocBuf(mem->devID, blockNum * gridNum * sizeof(int)) :
XMemAlloc(mem->devID, blockNum * gridNum * sizeof(int))); XMemAlloc(s->devID, blockNum * gridNum * sizeof(int)));
_MakeMergeBlockIndex(blockIndex, blockNum, blockNumInMerge, splitSizeInGrid, gridSize, gridNum, mem); _MakeMergeBlockIndex(blockIndex, blockNum, blockNumInMerge, splitSizeInGrid, gridSize, gridNum, s->devID);
_CopyBlocksOnSite(s->data, realBlockSize, blockNum, dataTMP, blockIndex, mem); _CopyBlocksOnSite(s->data, realBlockSize, blockNum * gridNum, dataTMP, blockIndex, s->devID);
if (mem != NULL) if (mem != NULL)
mem->ReleaseBuf(mem->devID, blockNum * gridNum * sizeof(int)); mem->ReleaseBuf(mem->devID, blockNum * gridNum * sizeof(int));
else else
XMemFree(mem->devID, blockIndex); XMemFree(s->devID, blockIndex);
/* copy from tmp to target */
XMemCopy(t->data, t->devID, dataTMP, s->devID, size);
if (!isOnSameDevice) { if (!isOnSameDevice) {
XMemCopy(t->data, t->devID, dataTMP, s->devID, size); XMemCopy(t->data, t->devID, dataTMP, s->devID, size);
if (mem != NULL) if (mem != NULL)
mem->ReleaseBuf(mem->devID, size); mem->ReleaseBuf(mem->devID, size);
else else
XMemFree(mem->devID, dataTMP); XMemFree(s->devID, dataTMP);
} }
} }
} }
...@@ -163,7 +166,7 @@ XTensor Merge(const XTensor &s, int whereToMerge, int leadingDim) ...@@ -163,7 +166,7 @@ XTensor Merge(const XTensor &s, int whereToMerge, int leadingDim)
CheckNTErrors(leadingDim < whereToMerge, "Invalid leading dimension!"); CheckNTErrors(leadingDim < whereToMerge, "Invalid leading dimension!");
if (leadingDim < 0) if (leadingDim < 0)
leadingDim = 0; leadingDim = 0;
int order = s.order - 1; int order = s.order - 1;
int * dimSize = new int[order]; int * dimSize = new int[order];
...@@ -205,7 +208,7 @@ merge small tensors into a big tensor ...@@ -205,7 +208,7 @@ merge small tensors into a big tensor
*/ */
void _Merge(const XList * smalls, XTensor * big, int whereToMerge) void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
{ {
CheckNTErrors((smalls != NULL), "Invalid list!"); CheckNTErrors((smalls != NULL), "Invalid list!");
CheckNTErrors((smalls->count > 0), "Empty list!"); CheckNTErrors((smalls->count > 0), "Empty list!");
bool uniform = true; bool uniform = true;
...@@ -233,7 +236,7 @@ void _Merge(const XList * smalls, XTensor * big, int whereToMerge) ...@@ -233,7 +236,7 @@ void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
int mergedNum = smalls->count; int mergedNum = smalls->count;
XTensor * s0 = (XTensor*)smalls->GetItem(0); XTensor * s0 = (XTensor*)smalls->GetItem(0);
int whereToMergeRDI = s0->order - whereToMerge - 1; int whereToMergeRDI = s0->order - whereToMerge - 1;
for (int i = 0; i < s0->order; i++) { for (int i = 0; i < s0->order; i++) {
if (i <= whereToMergeRDI) if (i <= whereToMergeRDI)
blockSize *= s0->dimSizeRDI[i]; blockSize *= s0->dimSizeRDI[i];
...@@ -268,10 +271,10 @@ void _Merge(const XList * smalls, XTensor * big, int whereToMerge) ...@@ -268,10 +271,10 @@ void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
} }
/* merging with fewer kernel/api calls??? (i'm not sure about it!! may remove this later) */ /* merging with fewer kernel/api calls??? (i'm not sure about it!! may remove this later) */
else { else {
int* dimSizeTMP = new int[MAX_TENSOR_DIM_NUM]; int* dimSizeTMP = new int[smallsItem0->order + 1];
for (int i = 0; i < MAX_TENSOR_DIM_NUM; i++) for (int i = 0; i < smallsItem0->order; i++)
dimSizeTMP[i] = -smallsItem0->dimSizeRDI[i]; dimSizeTMP[i + 1] = -smallsItem0->dimSize[i];
dimSizeTMP[smallsItem0->order] = -mergeNum; dimSizeTMP[0] = -mergeNum;
XMem * mem = smallsItem0->mem; XMem * mem = smallsItem0->mem;
XTensor * tensorTMP = new XTensor(smallsItem0->order + 1, dimSizeTMP, XTensor * tensorTMP = new XTensor(smallsItem0->order + 1, dimSizeTMP,
...@@ -283,7 +286,7 @@ void _Merge(const XList * smalls, XTensor * big, int whereToMerge) ...@@ -283,7 +286,7 @@ void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
if (uniform) if (uniform)
dataTMP = smallsItem0->data; dataTMP = smallsItem0->data;
else else
dataTMP = mem != NULL ? mem->AllocBuf(mem->devID, size) : XMemAlloc(mem->devID, size); dataTMP = mem != NULL ? mem->AllocBuf(mem->devID, size) : XMemAlloc(big->devID, size);
tensorTMP->data = dataTMP; tensorTMP->data = dataTMP;
...@@ -295,18 +298,17 @@ void _Merge(const XList * smalls, XTensor * big, int whereToMerge) ...@@ -295,18 +298,17 @@ void _Merge(const XList * smalls, XTensor * big, int whereToMerge)
} }
} }
_Merge(tensorTMP, big, whereToMerge); _Merge(tensorTMP, big, whereToMerge + 1);
delete[] dimSizeTMP; delete[] dimSizeTMP;
tensorTMP->data = NULL;
dataTMP = NULL;
tensorTMP->data = NULL;
delete tensorTMP; delete tensorTMP;
if ((!uniform) && (mem != NULL)) if ((!uniform) && (mem != NULL))
mem->ReleaseBuf(mem->devID, size); mem->ReleaseBuf(mem->devID, size);
else else
XMemFree(mem->devID, dataTMP); XMemFree(big->devID, dataTMP);
} }
} }
......
...@@ -109,6 +109,9 @@ void _CudaMergeBlockLists(const XList * sourceList, int * blockSizes, int blockN ...@@ -109,6 +109,9 @@ void _CudaMergeBlockLists(const XList * sourceList, int * blockSizes, int blockN
CheckNTErrors((maxBlockSize % sizeof(DTYPE) == 0), "Unsupported block size!"); CheckNTErrors((maxBlockSize % sizeof(DTYPE) == 0), "Unsupported block size!");
realMaxBlockSize = maxBlockSize / sizeof(DTYPE); realMaxBlockSize = maxBlockSize / sizeof(DTYPE);
int devIDBackup;
ProtectCudaDev(myMem->devID, devIDBackup);
int cudaGridSizes[3]; int cudaGridSizes[3];
int cudaBlockSizes[3]; int cudaBlockSizes[3];
...@@ -135,6 +138,8 @@ void _CudaMergeBlockLists(const XList * sourceList, int * blockSizes, int blockN ...@@ -135,6 +138,8 @@ void _CudaMergeBlockLists(const XList * sourceList, int * blockSizes, int blockN
delete[] targetArrays; delete[] targetArrays;
delete[] sizes; delete[] sizes;
delete[] offsets; delete[] offsets;
BacktoCudaDev(myMem->devID, devIDBackup);
} }
#endif // USE_CUDA #endif // USE_CUDA
......
...@@ -24,6 +24,7 @@ ...@@ -24,6 +24,7 @@
#include "MakeSplitBlockIndex.h" #include "MakeSplitBlockIndex.h"
#include "../../XName.h" #include "../../XName.h"
#include "../../XTensor.h" #include "../../XTensor.h"
#include "../../XDevice.h"
#include "../../XUtility.h" #include "../../XUtility.h"
#include "../movement/CopyBlocksOnSite.h" #include "../movement/CopyBlocksOnSite.h"
...@@ -88,10 +89,33 @@ void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum) ...@@ -88,10 +89,33 @@ void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum)
int n = blockNum / splitNum; int n = blockNum / splitNum;
int sStep = blockSize * s->unitSize; int sStep = blockSize * s->unitSize;
int tStep = n * tPitch; int tStep = n * tPitch;
for (int k = 0; k < splitNum; k++) { if(t->devID < 0){
XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID, for (int k = 0; k < splitNum; k++) {
(char*)s->data + k * sStep, sPitch, s->devID, XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID,
mSize, n); (char*)s->data + k * sStep, sPitch, s->devID,
mSize, n);
}
}
else{
#ifdef USE_CUDA
#ifdef STREAMED_MEMCPOPY
XStream * stream = GDevs.GPUs[t->devID].stream;
for (int k = 0; k < splitNum; k++) {
XMemCopy2DAsync((char*)t->data + k * tStep, tPitch, t->devID,
(char*)s->data + k * sStep, sPitch, s->devID,
mSize, n, stream);
}
stream->StreamSynchronize();
#else
for (int k = 0; k < splitNum; k++) {
XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID,
(char*)s->data + k * sStep, sPitch, s->devID,
mSize, n);
}
#endif
#else
ShowNTErrors("Please specify USE_CUDA and recompile the code!");
#endif
} }
} }
else { else {
...@@ -108,17 +132,17 @@ void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum) ...@@ -108,17 +132,17 @@ void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum)
int blockSplitSize = blockNum / splitNum; int blockSplitSize = blockNum / splitNum;
int * blockIndex = (int*)(mem != NULL ? int * blockIndex = (int*)(mem != NULL ?
mem->AllocBuf(mem->devID, blockNum * sizeof(int)) : mem->AllocBuf(mem->devID, blockNum * sizeof(int)) :
XMemAlloc(mem->devID, blockNum * sizeof(int))); XMemAlloc(s->devID, blockNum * sizeof(int)));
_MakeSplitBlockIndex(blockIndex, splitNum, blockSplitSize, blockNum, mem); _MakeSplitBlockIndex(blockIndex, splitNum, blockSplitSize, blockNum, s->devID);
_CopyBlocksOnSite(s->data, realBlockSize, blockNum, dataTMP, blockIndex, mem); _CopyBlocksOnSite(s->data, realBlockSize, blockNum, dataTMP, blockIndex, s->devID);
if (mem != NULL) if (mem != NULL)
mem->ReleaseBuf(mem->devID, blockNum * sizeof(int)); mem->ReleaseBuf(mem->devID, blockNum * sizeof(int));
else else
XMemFree(mem->devID, blockIndex); XMemFree(s->devID, blockIndex);
/* copy from tmp to target */ /* copy from tmp to target */
if (!isOnSameDevice) { if (!isOnSameDevice) {
...@@ -127,7 +151,7 @@ void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum) ...@@ -127,7 +151,7 @@ void _Split(const XTensor * s, XTensor * t, int whereToSplit, int splitNum)
if (mem != NULL) if (mem != NULL)
mem->ReleaseBuf(mem->devID, size); mem->ReleaseBuf(mem->devID, size);
else else
XMemFree(mem->devID, dataTMP); XMemFree(s->devID, dataTMP);
} }
} }
} }
...@@ -144,6 +168,8 @@ make a new tensor to keep the result and return it ...@@ -144,6 +168,8 @@ make a new tensor to keep the result and return it
XTensor Split(const XTensor &s, int whereToSplit, int splitNum) XTensor Split(const XTensor &s, int whereToSplit, int splitNum)
{ {
CheckNTErrors(&s, "Invalid tensors!"); CheckNTErrors(&s, "Invalid tensors!");
CheckNTErrors(s.dimSize[whereToSplit] % splitNum == 0,
"The dimension cannot be splitted due to the inproper split number");
int order = s.order + 1; int order = s.order + 1;
int * dimSize = new int[order]; int * dimSize = new int[order];
...@@ -226,20 +252,46 @@ void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum) ...@@ -226,20 +252,46 @@ void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum)
int n = blockNum / splitNum; int n = blockNum / splitNum;
int sStep = blockSize * big->unitSize; int sStep = blockSize * big->unitSize;
int tStep = 0; int tStep = 0;
for (int k = 0; k < splitNum; k++) {
XTensor * t = (XTensor*)smalls->GetItem(k); if(big->devID < 0){
XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID, for (int k = 0; k < splitNum; k++) {
(char*)big->data + k * sStep, sPitch, big->devID, XTensor * t = (XTensor*)smalls->GetItem(k);
mSize, n); XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID,
(char*)big->data + k * sStep, sPitch, big->devID,
mSize, n);
}
}
else{
#ifdef USE_CUDA
#ifdef STREAMED_MEMCPOPY
XStream * stream = GDevs.GPUs[big->devID].stream;
for (int k = 0; k < splitNum; k++) {
XTensor * t = (XTensor*)smalls->GetItem(k);
XMemCopy2DAsync((char*)t->data + k * tStep, tPitch, t->devID,
(char*)big->data + k * sStep, sPitch, big->devID,
mSize, n, stream);
}
stream->StreamSynchronize();
#else
for (int k = 0; k < splitNum; k++) {
XTensor * t = (XTensor*)smalls->GetItem(k);
XMemCopy2D((char*)t->data + k * tStep, tPitch, t->devID,
(char*)big->data + k * sStep, sPitch, big->devID,
mSize, n);
}
#endif
#else
ShowNTErrors("Please specify USE_CUDA and recompile the code!");
#endif
} }
} }
/* splitting with fewer kernel/api calls??? (i'm not sure about it!! may remove this later) */ /* splitting with fewer kernel/api calls??? (i'm not sure about it!! may remove this later) */
else { else {
int* dimSizeTMP = new int[MAX_TENSOR_DIM_NUM]; int* dimSizeTMP = new int[big->order + 1];
for (int i = 0; i < MAX_TENSOR_DIM_NUM; i++) for (int i = 0; i < big->order; i++)
dimSizeTMP[i] = -big->dimSize[i]; dimSizeTMP[i + 1] = -big->dimSize[i];
dimSizeTMP[whereToSplit] /= splitNum; dimSizeTMP[whereToSplit + 1] /= splitNum;
dimSizeTMP[big->order] = -splitNum; dimSizeTMP[0] = -splitNum;
XMem * mem = big->mem; XMem * mem = big->mem;
XTensor* tensorTMP = new XTensor(big->order + 1, dimSizeTMP, big->dataType, big->denseRatio, big->devID, mem); XTensor* tensorTMP = new XTensor(big->order + 1, dimSizeTMP, big->dataType, big->denseRatio, big->devID, mem);
...@@ -251,7 +303,7 @@ void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum) ...@@ -251,7 +303,7 @@ void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum)
dataTMP = first->data; dataTMP = first->data;
} }
else { else {
dataTMP = mem != NULL ? mem->AllocBuf(mem->devID, size) : XMemAlloc(mem->devID, size); dataTMP = mem != NULL ? mem->AllocBuf(mem->devID, size) : XMemAlloc(big->devID, size);
} }
tensorTMP->data = dataTMP; tensorTMP->data = dataTMP;
...@@ -270,13 +322,12 @@ void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum) ...@@ -270,13 +322,12 @@ void _Split(const XTensor * big, XList * smalls, int whereToSplit, int splitNum)
delete[] dimSizeTMP; delete[] dimSizeTMP;
tensorTMP->data = NULL; tensorTMP->data = NULL;
dataTMP = NULL;
delete tensorTMP; delete tensorTMP;
if ((!uniform) && (mem != NULL)) if ((!uniform) && (mem != NULL))
mem->ReleaseBuf(mem->devID, size); mem->ReleaseBuf(mem->devID, size);
else else
XMemFree(mem->devID, dataTMP); XMemFree(big->devID, dataTMP);
} }
} }
......
...@@ -26,6 +26,8 @@ ...@@ -26,6 +26,8 @@
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
#define STREAMED_MEMCPOPY
/* /*
transform a tensor by splitting it transform a tensor by splitting it
e.g., (M, N) -> (M, N/3, 3) e.g., (M, N) -> (M, N/3, 3)
......
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-07-28
* It is extreamly hot these days and i cannot sleep well. Fortunately we had
* good lunch of Steamed Cold Noodles. This made me feel much better!
*/
#include "Transpose.h"
#include "Merge.h"
#include "../../XUtility.h"
#include "../../XName.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/*
tensor transposition of dimensions i and j
b = transposed(a)
For a input tensor a, we tranpose the dimensions i and j of it.
E.g., let a be a tensor of size x * y * z, i = 0, j = 2,
then the output will be a tensor of size z * y * x.
>> a - the input tensor
>> b - the output tensor by transpose tensor a with specified dimensions i and j
>> i - the transposed dimension
>> j - the transposed dimension
*/
void _Transpose(const XTensor * a, XTensor * b, const int i, const int j)
{
CheckNTErrors(a && b, "Empty tensors");
CheckNTErrors(a->order == b->order, "Wrong tensor orders");
CheckNTErrors(a->unitNum == b->unitNum && a->unitSize == b->unitSize, "Wrong tensor sizes");
CheckNTErrors(a->order > i && i >= 0, "index of dimension is out of scope!");
CheckNTErrors(a->order > j && j >= 0, "index of dimension is out of scope!");
for(int k = 0; k < a->order; k++){
if(k == i){
CheckNTErrors(a->dimSize[k] == b->dimSize[j], "Wrong dimension size in transposition");
}
else if(k == j){
CheckNTErrors(a->dimSize[k] == b->dimSize[i], "Wrong dimension size in transposition");
}
else{
CheckNTErrors(a->dimSize[k] == b->dimSize[k], "Wrong dimension size in transposition");
}
}
if(i == j){
XMemCopy(b->data, b->devID, a->data, a->devID, b->unitNum * b->unitSize);
}
else{
int I = MIN(i, j);
int J = MAX(i, j);
int * dims = new int[a->order + 1];
for(int k = 0; k <= J; k++)
dims[k] = a->dimSize[k];
dims[J + 1] = -1;
for(int k = J + 1; k < a->order; k++)
dims[k + 1] = a->dimSize[k];
/* reshape tensor a form (..., n_I, ..., n_J, ...) => (..., n_I, ..., n_J, 1, ...)*/
XTensor * aTMP = new XTensor(a->order + 1, dims, a->dataType, a->denseRatio, a->devID, a->mem);
aTMP->data = a->data;
for(int k = 0; k < I; k++)
dims[k] = a->dimSize[k];
for(int k = I + 1; k <= J; k++)
dims[k - 1] = a->dimSize[k];
dims[J] = a->dimSize[I];
for(int k = J + 1; k < a->order; k++)
dims[k] = a->dimSize[k];
/* reshape tensor b form (..., m_I, ..., m_J, ...) => (..., m_J, m_I, ...) */
b->Reshape(b->order, dims);
/* tensor (..., n_I, ..., n_J, 1, ...) => tensor (..., m_J, m_I, ...) */
_Merge(aTMP, b, J + 1, I);
memcpy(dims, a->dimSize, sizeof(int) * a->order);
dims[I] = a->dimSize[J];
dims[J] = a->dimSize[I];
/* reshape tensor b form (..., m_J, m_I, ...) => (..., m_J, ..., m_I, ...) => */
b->Reshape(b->order, dims);
aTMP->data = NULL;
delete[] dims;
delete aTMP;
}
}
/*
tensor transposition of dimensions i and j (return a XTensor structure).
make a new tensor to keep the result and return it.
b = transposed(a)
For a input tensor a, we tranpose the dimensions i and j of it.
E.g., let a be a tensor of size x * y * z, i = 0, j = 2,
then the output will be a tensor of size z * y * x.
>> a - the input tensor
>> i - the transposed dimension
>> j - the transposed dimension
<< return - the output tensor by transpose tensor a with specified dimensions i and j
*/
XTensor Transpose(const XTensor &a, const int i, const int j)
{
CheckNTErrors(a.order > i && i >= 0, "index of dimension is out of scope!");
CheckNTErrors(a.order > j && j >= 0, "index of dimension is out of scope!");
int order = a.order;
int * dimSize = new int[order];
for(int k = 0; k < order; k++){
if(k == i)
dimSize[k] = a.dimSize[j];
else if(k == j)
dimSize[k] = a.dimSize[i];
else
dimSize[k] = a.dimSize[k];
}
float dr = (!a.isSparse) ? 1.0F : a.denseRatio;
XTensor b(order, dimSize, a.dataType, dr, a.devID, a.mem);
b.SetTMP();
/* call _Transpose function */
_Transpose(&a, &b, i, j);
/* tensor connection */
XLink::MakeLink(&a, NULL, &b, SHAPE_TRANSPOSE);
XLink::AddParamToHeadInt(&b, i);
XLink::AddParamToHeadInt(&b, j);
/* destroy variables */
delete[] dimSize;
return b;
}
}
...@@ -27,27 +27,18 @@ ...@@ -27,27 +27,18 @@
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
#define transpose _Transpose_
/* /*
generate a transposed 1D/2D tensor tensor transposition of dimensions i and j
b = transposed(a) b = transposed(a)
*/ */
void _Transpose(XTensor * a, XTensor * b); void _Transpose(const XTensor * a, XTensor * b, const int i, const int j);
/*
transpose a 1D/2D tensor (do it on site).
keep the result in the input tensor and return nothing.
a = transposed(a)
*/
void _TransposeMe(XTensor * a);
/* /*
make a transposed 1D/2D tensor (return a XTensor structure). tensor transposition of dimensions i and j (return a XTensor structure).
make a new tensor to keep the result and return it. make a new tensor to keep the result and return it.
b = transposed(a) b = transposed(a)
*/ */
XTensor Transpose(XTensor &a); XTensor Transpose(const XTensor &a, const int i, const int j);
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
......
...@@ -32,12 +32,108 @@ namespace nts { // namespace nts(NiuTrans.Tensor) ...@@ -32,12 +32,108 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
insert a dimension by copying the blocks for n times (where n is the size of the inerted dimension) insert a dimension by copying the blocks for n times (where n is the size of the inerted dimension)
>> s - pointer to the source data array >> s - pointer to the source data array
>> blockSize - size of a block >> blockSize - size of a block
>> totalSize - total size of the blocks (i.e., blockSIze * n)
>> t - pointer to the target data array
>> n - number of blocks to copy data
*/
template<class T>
__global__
void KernelUnsqueezeFlat(void * s, int blockSize, int totalSize, void * t, int n)
{
/* index of data items */
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i >= blockSize)
return;
T value = ((T*)s)[i];
T * tData = (T*)t;
__syncthreads();
for (int k = i; k < totalSize; k += blockSize)
tData[k] = value;
}
/*
insert a dimension by copying the blocks for n times (where n is the size of the inerted dimension)
>> s - pointer to the source data array
>> blockSize - size of a block
>> totalSize - total size of the blocks (i.e., blockSIze * n)
>> t - pointer to the target data array
>> n - number of blocks to copy data
*/
template<class T>
__global__
void KernelUnsqueezeFlatBigram(void * s, int blockSize, int totalSize, void * t, int n)
{
/* index of data items */
int i = (blockDim.x * blockIdx.x + threadIdx.x) * 2;
if (i >= blockSize)
return;
T value = ((T*)s)[i];
T value2 = ((T*)s)[i + 1];
T * tData = (T*)t;
__syncthreads();
for (int k = i; k < totalSize; k += blockSize){
tData[k] = value;
tData[k + 1] = value2;
}
}
/*
insert a dimension by copying the blocks for n times (where n is the size of the inerted dimension)
>> s - pointer to the source data array
>> blockSize - size of a block
>> totalSize - total size of the blocks (i.e., blockSIze * n)
>> t - pointer to the target data array
>> n - number of blocks to copy data
*/
template<class T>
__global__
void KernelUnsqueezeFlat2D(void * s, int blockSize, int totalSize, void * t, int n)
{
__shared__ T data[MAX_CUDA_THREAD_NUM_PER_BLOCK];
__shared__ int offsets[MAX_CUDA_THREAD_NUM_PER_BLOCK];
/* index of data items */
int i = blockDim.x * blockIdx.x + threadIdx.x;
/* index of data items */
int j = blockDim.y * blockIdx.y + threadIdx.y;
if (i >= blockSize || j >= n)
return;
if(threadIdx.y == 0)
data[threadIdx.x] = ((T*)s)[i];
if(threadIdx.x == 0)
offsets[threadIdx.y] = blockSize * j;
__syncthreads();
((T*)t)[offsets[threadIdx.y] + i] = data[threadIdx.x];
}
/*
insert a dimension by copying the blocks for n times (where n is the size of the inerted dimension)
>> s - pointer to the source data array
>> blockSize - size of a block
>> blockNum - number of the blocks >> blockNum - number of the blocks
>> totalSize - total size of the blocks (i.e., blockSIze * n)
>> t - pointer to the target data array >> t - pointer to the target data array
>> n - number of blocks to copy data
*/ */
template<class T> template<class T>
__global__ __global__
void KernelUnsqueeze(void * s, int blockSize, int blockNum, void * t, int n) void KernelUnsqueeze(void * s, int blockSize, int blockNum, int totalSize, void * t, int n)
{ {
/* index of data items */ /* index of data items */
int i = blockDim.x * blockIdx.x + threadIdx.x; int i = blockDim.x * blockIdx.x + threadIdx.x;
...@@ -51,11 +147,10 @@ void KernelUnsqueeze(void * s, int blockSize, int blockNum, void * t, int n) ...@@ -51,11 +147,10 @@ void KernelUnsqueeze(void * s, int blockSize, int blockNum, void * t, int n)
MTYPE offset = blockSize * j; MTYPE offset = blockSize * j;
T value = ((T*)s)[offset + i]; T value = ((T*)s)[offset + i];
T * tData = (T*)t + offset * n; T * tData = (T*)t + offset * n;
int length = blockSize * n;
__syncthreads(); __syncthreads();
for (int k = i; k < length; k += blockSize) for (int k = i; k < totalSize; k += blockSize)
tData[k] = value; tData[k] = value;
} }
...@@ -83,21 +178,71 @@ void _CudaUnsqueeze(const XTensor * a, XTensor * b, int dim, int dSize) ...@@ -83,21 +178,71 @@ void _CudaUnsqueeze(const XTensor * a, XTensor * b, int dim, int dSize)
int cudaGrids[3]; int cudaGrids[3];
int cudaBlocks[3]; int cudaBlocks[3];
GDevs.GetCudaThread2D(a->devID, blockSize, blockNumA, MAX_INT, cudaGrids, cudaBlocks);
int devIDBackup = 0; int devIDBackup = 0;
ProtectCudaDev(a->devID, devIDBackup); ProtectCudaDev(a->devID, devIDBackup);
if (a->dataType == X_FLOAT && a->dataType == X_FLOAT) { if(blockNumA > 1){
KernelUnsqueeze<float> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> > GDevs.GetCudaThread2D(a->devID, blockSize, blockNumA, MAX_INT, cudaGrids, cudaBlocks);
(a->data, blockSize, blockNumA, b->data, dSize);
if (a->dataType == X_FLOAT && a->dataType == X_FLOAT) {
KernelUnsqueeze<float> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
(a->data, blockSize, blockNumA, blockSize * dSize, b->data, dSize);
}
else if (a->dataType == X_INT && a->dataType == X_INT) {
KernelUnsqueeze<int> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
(a->data, blockSize, blockNumA, blockSize * dSize, b->data, dSize);
}
else {
ShowNTErrors("TODO!");
}
}
else if(blockNumA == 1 && blockSize < MAX_CUDA_THREAD_NUM_PER_BLOCK){
GDevs.GetCudaThread2D(a->devID, blockSize, dSize, MAX_CUDA_THREAD_NUM_PER_BLOCK/4, cudaGrids, cudaBlocks);
if (a->dataType == X_FLOAT && a->dataType == X_FLOAT) {
KernelUnsqueezeFlat2D<float> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
(a->data, blockSize, blockSize * dSize, b->data, dSize);
}
else if (a->dataType == X_INT && a->dataType == X_INT) {
KernelUnsqueezeFlat2D<int> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> >
(a->data, blockSize, blockSize * dSize, b->data, dSize);
}
else {
ShowNTErrors("TODO!");
}
}
else if(blockNumA == 1 && blockSize % 2 == 0){
GDevs.GetCudaThread(a->devID, blockSize/2, cudaGrids, cudaBlocks);
if (a->dataType == X_FLOAT && a->dataType == X_FLOAT) {
KernelUnsqueezeFlatBigram<float> << <dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >> >
(a->data, blockSize, blockSize * dSize, b->data, dSize);
}
else if (a->dataType == X_INT && a->dataType == X_INT) {
KernelUnsqueezeFlatBigram<int> << <dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >> >
(a->data, blockSize, blockSize * dSize, b->data, dSize);
}
else {
ShowNTErrors("TODO!");
}
} }
else if (a->dataType == X_INT && a->dataType == X_INT) { else if(blockNumA == 1){
KernelUnsqueeze<int> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> > GDevs.GetCudaThread(a->devID, blockSize, cudaGrids, cudaBlocks);
(a->data, blockSize, blockNumA, b->data, dSize);
if (a->dataType == X_FLOAT && a->dataType == X_FLOAT) {
KernelUnsqueezeFlat<float> << <dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >> >
(a->data, blockSize, blockSize * dSize, b->data, dSize);
}
else if (a->dataType == X_INT && a->dataType == X_INT) {
KernelUnsqueezeFlat<int> << <dim3(cudaGrids[0]), dim3(cudaBlocks[0]) >> >
(a->data, blockSize, blockSize * dSize, b->data, dSize);
}
else {
ShowNTErrors("TODO!");
}
} }
else { else{
ShowNTErrors("TODO!"); ShowNTErrors("Something is wrong!");
} }
BacktoCudaDev(a->devID, devIDBackup); BacktoCudaDev(a->devID, devIDBackup);
......
...@@ -25,6 +25,7 @@ ...@@ -25,6 +25,7 @@
#include "TopK.h" #include "TopK.h"
#include "TopK.cuh" #include "TopK.cuh"
#include "Sort.cuh" #include "Sort.cuh"
#define WORKERSNUM 64
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
...@@ -363,6 +364,436 @@ void KernelTopK2(T * input, int stride, int strideNum, int blockNum, int k, T mi ...@@ -363,6 +364,436 @@ void KernelTopK2(T * input, int stride, int strideNum, int blockNum, int k, T mi
} }
/* /*
get the top-k items
>> input - the input data array
>> stride - number of items we go over when we move to the next item along a given dimension
>> strideNum - size of the given dimension
>> blockNum - number of data blocks
>> k - as it is
>> minValue - min value of an item
>> output - the output data array
>> index - the output index array
*/
template<class T> __global__
void KernelTopK3(T * input, int stride, int strideNum, int blockNum, int k, T minValue, T * output, int * index)
{
__shared__ CudaHeapNode<T> heapData[(SHARED_MEMORY_SIZE - 1024 * sizeof(T)) / sizeof(CudaHeapNode<T>)];
__shared__ T eachHeapMaxValue[1024];
/*optimization k size the parameter must more than half of k*/
int parameter = 0;
/* worker index */
int i = blockDim.x * blockIdx.x + threadIdx.x;
/* index of the data arry along the given dimension */
int j = blockDim.y * blockIdx.y + threadIdx.y;
if (i >= strideNum || i >= blockDim.x || j >= stride * blockNum)
return;
int blockIndex = j / stride;
int offsetInBlock = j % stride;
T * d = input + stride * strideNum * blockIndex + offsetInBlock;
CudaXHeap<MIN_HEAP, T> heap(k - parameter, heapData + k * (threadIdx.y * blockDim.x + threadIdx.x));
__syncthreads();
/* go over the data array and build the heap */
int indexOffset = blockDim.x;
int dataOffset = stride * blockDim.x;
if (i + (heap.size - 1) * indexOffset < strideNum) {
int p = i;
int q = i * stride;
for (int m = 0; m < heap.size; m++) {
heap.Push(p, d[q]);
p += indexOffset;
q += dataOffset;
}
for (; p < strideNum; p += indexOffset, q += dataOffset) {
T v = d[q];
if (v > heap.topValue) {
heap.ReplaceTop(p, v);
}
}
}
else {
for (int p = i, q = i * stride; p < strideNum; p += indexOffset, q += dataOffset) {
heap.Push(p, d[q]);
}
}
/* fill the heap if no enough items are processed */
while (heap.count < heap.size) {
heap.Push(-1, minValue);
}
__syncthreads();
/*to merge the heap use another way*/
T minData = minValue;
int heapLimit = heap.count / 2;
if (heapLimit % 2 == 0 && heapLimit != 0) heapLimit -= 1;
for (int counter = heap.count - 1; counter >= heapLimit; --counter) {
if (minData < heap.items[counter].value)
minData = heap.items[counter].value;
}
eachHeapMaxValue[threadIdx.y * blockDim.x + threadIdx.x] = minData;
//need more optimation
if (i == 0) {
int threadLimit = (threadIdx.y + 1) * blockDim.x;
CudaXHeap<MIN_HEAP, T> chooseHeap(k, heapData + k * ((blockDim.x * blockDim.y) + threadIdx.y));
int counter = threadIdx.y * blockDim.x;
for (; counter < threadIdx.y * blockDim.x + k; ++counter) {
chooseHeap.Push(counter, eachHeapMaxValue[counter]);
}
for (; counter < threadLimit; ++counter) {
if (eachHeapMaxValue[counter]>chooseHeap.items[0].value) {
chooseHeap.ReplaceTop(counter, eachHeapMaxValue[counter]);
}
}
CudaXHeap<MIN_HEAP, T> ansHeapData(k, k - parameter, heapData + k * chooseHeap.items[0].index);
int miss = parameter;
for (counter = 1; counter < k; ++counter) {
chooseHeap.items[0] = chooseHeap.items[chooseHeap.count - 1];
chooseHeap.count--;
chooseHeap.Down(0);
CudaHeapNode<T> * cmpHeapData = heapData + k * (chooseHeap.items[0].index);
int cmpHeapLimit = 0;
if (counter + heapLimit <= k - parameter){
cmpHeapLimit = heapLimit;
}
/* take the max data from the minHeap,so start search from the leaf node */
for (int iterator = k - 1 - parameter; iterator >= cmpHeapLimit; --iterator){
if (miss > 0){
ansHeapData.Push(cmpHeapData[iterator].index, cmpHeapData[iterator].value);
miss--;
}
else if (ansHeapData.items[0].value < cmpHeapData[iterator].value){
ansHeapData.ReplaceTop(cmpHeapData[iterator].index, cmpHeapData[iterator].value);
}
}
}
int offset = stride * k * blockIndex + offsetInBlock;
T * dOutput = output + offset;
int * indexOutput = index + offset;
for (int q = 0; q < k; ++q){
dOutput[stride * q] = ansHeapData.items[q].value;
indexOutput[stride * q] = ansHeapData.items[q].index;
}
}
}
__device__ __forceinline__
unsigned getLaneMaskLe()
{
unsigned mask;
asm("mov.u32 %0, %%lanemask_le;" : "=r"(mask));
return mask;
}
__device__ __forceinline__
int getLaneId()
{
int laneId;
asm("mov.s32 %0, %laneid;" : "=r"(laneId));
return laneId;
}
__device__
unsigned convert(float v)
{
unsigned x = __float_as_int(v);
unsigned mask = (x & 0x80000000) ? 0xffffffff : 0x80000000;
return (x ^ mask);
}
__device__
float convert(unsigned int v)
{
float x = __uint_as_float(v);
return x;
}
__device__
float deconvert(unsigned int v)
{
unsigned int mask = (v & 0x80000000) ? 0x80000000 : 0xffffffff;
return __int_as_float(v ^ mask);
}
__global__
void convert2uintV2(float* input, unsigned int *output, int stride, int strideNum, int blockNum, int size)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;
int idy = blockDim.y * blockIdx.y + threadIdx.y;
int blockIndex = idy / stride;
int offsetInBlock = idy% stride;
#pragma unroll
for (int i = idx * stride + stride * strideNum * blockIndex + offsetInBlock;
i < stride * strideNum * blockIndex + offsetInBlock + stride * strideNum && i < size;
i += stride * blockDim.x){
output[i] = convert(input[i]);
}
}
__global__
void deconvert2floatV2(unsigned int * input, float *output, int stride, int strideNum, int blockNum, int size)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;
int idy = blockDim.y * blockIdx.y + threadIdx.y;
//int strideNum = (int)strideNumSize;
//if (flag) strideNum = strideNumSize[idy];
int blockIndex = idy / stride;
int offsetInBlock = idy% stride;
#pragma unroll
for (int i = idx * stride + stride * strideNum * blockIndex + offsetInBlock;
i < stride * strideNum * blockIndex + offsetInBlock + stride * strideNum && i < size;
i += stride * blockDim.x){
output[i] = deconvert(input[i]);
}
}
__device__
void radixCount(unsigned int *data, int limit, int *posCount, unsigned int mask, int maskDesire, unsigned int desire, int stride, int strideNum, int blockNum)
{
/*the idx th thread in one vector */
int idx = blockDim.x * blockIdx.x + threadIdx.x;
/* the idy th vector in one tensor */
int idy = blockDim.y * blockIdx.y + threadIdx.y;
int blockIndex = idy / stride;
int offsetInBlock = idy% stride;
for (int j = idx*stride + stride * strideNum * blockIndex + offsetInBlock;
j< stride * strideNum * blockIndex + offsetInBlock + stride*strideNum && j<limit;
j += stride * WORKERSNUM) {
if ((data[j] & maskDesire) == desire) {
if (data[j] & mask) {
posCount[(idy % (512 / WORKERSNUM))*blockDim.x + idx]++;
}
}
}
}
/* We can use this way to check thread status in a warp fastly,
note that the theard number need be 32 times */
__device__
void gpuCheckWarp(int *smem, bool in, int *carry, int *index)
{
int vote = __ballot_sync(0xffffffff, in);
*index = __popc(getLaneMaskLe() & vote);
*carry = __popc(vote);
int idx = blockDim.x * blockIdx.x + threadIdx.x;
int warp = idx / 32;
int warpNum = blockDim.x / 32;
if (getLaneId() == 0) {
/* save each warp carry */
smem[warp + warpNum * threadIdx.y] = *carry;
}
__syncthreads();
/* use one thread to count the carry for globe the warp */
if (idx == 0) {
for (int i = 1 + warpNum * threadIdx.y; i < warpNum * (threadIdx.y + 1); ++i) {
smem[i] += smem[i - 1];
}
}
__syncthreads();
if (warp % warpNum) {
*index += smem[warpNum * threadIdx.y + warp - 1];
}
*carry = smem[warpNum * threadIdx.y + warpNum - 1];
}
/*
collect the data bigger than pattern as ans return
*/
__device__
void collectNumber(unsigned int *data, int stride, int strideNum, int limit,
unsigned int pattern, float *ans, int *ansIndex, int k)
{
int idy = blockDim.y * blockIdx.y + threadIdx.y;
int idx = blockDim.x * blockIdx.x + threadIdx.x;
int blockIndex = idy / stride;
int offsetInBlock = idy % stride;
/* for count each warp's tmp carry */
__shared__ int smem[32];
int carry;
int index;
int vectorLimit = stride * strideNum * blockIndex + offsetInBlock + stride * strideNum;
int alibnStrideNum = strideNum;
if (alibnStrideNum % blockDim.x) alibnStrideNum = alibnStrideNum + blockDim.x - (alibnStrideNum % blockDim.x);
int vectorAlibnLimit = stride * strideNum * blockIndex + offsetInBlock + stride * alibnStrideNum;
int ansArrayIndex = stride * k * blockIndex + offsetInBlock;
int ansSize = 0;
__syncthreads();
#pragma unroll
for (int i = idx * stride + stride * strideNum * blockIndex + offsetInBlock;
i < vectorAlibnLimit; i += stride * WORKERSNUM){
bool hasTopk = false;
if (i < vectorLimit&&data[i] > pattern){
hasTopk = true;
}
gpuCheckWarp(smem, hasTopk, &carry, &index);
if (carry > 0) {
if (hasTopk) {
ans[ansArrayIndex + (index - 1) * stride] = deconvert(data[i]);
ansIndex[ansArrayIndex + (index - 1) * stride] = i - stride * strideNum * blockIndex;
}
ansArrayIndex += carry * stride;
ansSize += carry;
}
__syncthreads();
}
if (ansSize < k){
int ramindNum = k - ansSize;
#pragma unroll
for (int i = idx * stride + stride * strideNum * blockIndex + offsetInBlock; i < vectorAlibnLimit; i += stride * WORKERSNUM) {
bool hasTopk = false;
if (i < vectorLimit && data[i] == pattern) {
hasTopk = true;
}
gpuCheckWarp(smem, hasTopk, &carry, &index);
if (carry>0) {
int checkTmpIndex = ansArrayIndex + (index - 1) * stride;
/* for don't pointer boundary overflow, for instance,
if there need one index,but two index fits, wo should filter the bigger index */
if (hasTopk && checkTmpIndex <stride * k * blockIndex + offsetInBlock + stride * k) {
ans[checkTmpIndex] = deconvert(pattern);
ansIndex[checkTmpIndex] = i - stride * strideNum * blockIndex;
}
ramindNum -= carry;
ansArrayIndex += carry * stride;
if (ramindNum <= 0) break;
}
__syncthreads();
}
}
}
/*
This is an old way,we use one thread to collect number and this way is very slow,so we drop it
*/
__device__
void collectNumberOld(unsigned int *data, int n, int k, unsigned int pattern, unsigned int *ans, int *indexNum, int stride, int strideNum)
{
int idy = blockDim.y * blockIdx.y + threadIdx.y;
int blockIndex = idy / stride;
int offsetInBlock = idy % stride;
int cot = 0;
for (int i = stride * strideNum * blockIndex + offsetInBlock, j = 0; j < strideNum; j++, i += stride) {
if (data[i] > pattern) {
ans[cot] = data[i];
indexNum[cot++] = j;
}
}
/* if the cot < k ,so the left value must be desire */
if (cot < k) {
for (int i = cot; i < k; ++i) {
ans[i] = pattern;
}
/* count the remain index and the data value must equal pattern */
for (int i = stride * strideNum * blockIndex + offsetInBlock, j = 0; j < strideNum; j++, i += stride) {
if (data[i] == pattern) {
indexNum[cot++] = j;
if (cot == k) break;
}
}
}
}
/*
When k is very big, we can't use share memory to calculate, so we use radix select algorithm
*/
template<class T> __global__
void KernelTopKRadixSelect(unsigned int * input, int stride, int strideNum,
int blockNum, int k, T minValue, T * output, int* index, int limit)
{
/* the idx th thread in one vector */
int idx = blockDim.x * blockIdx.x + threadIdx.x;
/* the idy th vector in one tensor */
int idy = blockDim.y * blockIdx.y + threadIdx.y;
//use optimization or not
//int strideNum =(int)strideNumSize;
//if (isOptimization) strideNum = strideNumSize[idy];
if (idy >= stride *blockNum) return;
int maskDesire = 0;
unsigned int mask = 0x80000000;
unsigned int desire = 0;
__shared__ int posCount[32 * 32];
int tmpK = k;
int flag = 1;
#pragma unroll
for (int i = 0; i < 32; i++){
/* we need to clean the shared memory every loop */
posCount[idx + blockDim.x*(idy % (512 / WORKERSNUM))] = 0;
if (flag)
radixCount(input, stride*strideNum*blockNum, posCount, mask, maskDesire, desire, stride, strideNum, blockNum);
__syncthreads();
int sumCount = 0;
#pragma unroll
for (int j = 0; j < WORKERSNUM; j++) {
sumCount += posCount[(idy % (512 / WORKERSNUM))*blockDim.x + j];
}
__syncthreads();
if (tmpK<sumCount) {
/* this position should be 1 */
desire = mask^desire;
}
else {
/* zoom out the k size,this position should be 0 */
tmpK = tmpK - sumCount;
if (tmpK == 0){
desire = (~(maskDesire >> 1)) | desire;
/* avoid Synchronize deadlock ,can't use break,so we use flag */
//break;
flag = 0;
}
}
maskDesire = mask^maskDesire;
mask = mask >> 1;
}
__syncthreads();
/* old way to collect number */
/*
if (idx == 0)
{
unsigned int* uintOutput = new unsigned int;
int* tmpIndex = new int;
//*******************something worng***************************
cudaMalloc((void **)&uintOutput, sizeof(unsigned int)* k);
cudaMalloc((void **)&tmpIndex, sizeof(unsigned int)*k);
//*************************************************************
collectNumberOld(input, limit, k, desire, uintOutput, tmpIndex, stride, strideNum);
int blockIndex = idy / stride;
int offsetInBlock = idy% stride;
for (int i = stride * k * blockIndex + offsetInBlock, j = 0; j < k; j++, i += stride)
{
//for(int i = )
output[i] = deconvert(uintOutput[j]);
index[i] = tmpIndex[j];
}
}
__syncthreads();
*/
collectNumber(input, stride, strideNum, limit, desire, output, index, k);
}
/*
get the top-k items along a given dimension get the top-k items along a given dimension
>> a - input tensor >> a - input tensor
>> b - output tensor (top-k result) >> b - output tensor (top-k result)
...@@ -388,7 +819,13 @@ void _CudaTopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k) ...@@ -388,7 +819,13 @@ void _CudaTopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k)
for (int i = dimRDI + 1; i < a->order; i++) for (int i = dimRDI + 1; i < a->order; i++)
blockNum *= a->dimSizeRDI[i]; blockNum *= a->dimSizeRDI[i];
int workerNum = blockNum < 16 ? 64 : 32; // should be tuned for better performance int workerNum = blockNum < 16 ? 64 : 32;
/* adjust the thread num according size of k for fitting the share memory size */
if (k< 6) workerNum = 512;
else if (k < 11) workerNum = 256;
else if (k < 22) workerNum = 128;
else if (k < 44) workerNum = 64;
else workerNum = 32;
int cudaGrids[3]; int cudaGrids[3];
int cudaBlocks[3]; int cudaBlocks[3];
...@@ -397,29 +834,15 @@ void _CudaTopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k) ...@@ -397,29 +834,15 @@ void _CudaTopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k)
workerNum, stride * blockNum, MAX_INT, workerNum, stride * blockNum, MAX_INT,
cudaGrids, cudaBlocks); cudaGrids, cudaBlocks);
for (int i = 0; i < 2; i++) {
if ((cudaBlocks[0] * cudaBlocks[1] + 1) * k * (a->unitSize + sizeof(int)) >= SHARED_MEMORY_SIZE) {
if (cudaBlocks[1] >= 2 && cudaBlocks[1] % 2 == 0) {
cudaBlocks[1] /= 2;
cudaGrids[1] *= 2;
}
}
if ((cudaBlocks[0] * cudaBlocks[1] + 1) * k * (a->unitSize + sizeof(int)) >= SHARED_MEMORY_SIZE) {
if (cudaBlocks[0] >= 2 && cudaBlocks[0] % 2 == 0) {
cudaBlocks[0] /= 2;
cudaGrids[0] *= 2;
}
}
}
int devIDBackup = 0; int devIDBackup = 0;
ProtectCudaDev(a->devID, devIDBackup); ProtectCudaDev(a->devID, devIDBackup);
/* we run the kernel if the heaps can fit into the shared memory */ /* we run the kernel if the heaps can fit into the shared memory */
cudaGrids[1] *= cudaBlocks[1];
cudaBlocks[1] = 1;
if ((cudaBlocks[0] * cudaBlocks[1] + 1) * k * (a->unitSize + sizeof(int)) < SHARED_MEMORY_SIZE) { if ((cudaBlocks[0] * cudaBlocks[1] + 1) * k * (a->unitSize + sizeof(int)) < SHARED_MEMORY_SIZE) {
if (a->dataType == DEFAULT_DTYPE) { if (a->dataType == DEFAULT_DTYPE) {
KernelTopK2<DTYPE> << <dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >> > KernelTopK3<DTYPE> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>>
((DTYPE*)a->data, stride, strideNumA, blockNum, k, DTYPE_MIN, ((DTYPE*)a->data, stride, strideNumA, blockNum, k, DTYPE_MIN,
(DTYPE*)b->data, (int*)index->data); (DTYPE*)b->data, (int*)index->data);
} }
...@@ -430,20 +853,34 @@ void _CudaTopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k) ...@@ -430,20 +853,34 @@ void _CudaTopK(const XTensor * a, XTensor * b, XTensor * index, int dim, int k)
} }
/* we resort to sorting if the data cannot fit inside the shared memory */ /* we resort to sorting if the data cannot fit inside the shared memory */
else { else {
int dimSize[MAX_TENSOR_DIM_NUM]; //int dimSize[MAX_TENSOR_DIM_NUM];
memcpy(dimSize, a->dimSize, sizeof(int) * a->order); //memcpy(dimSize, a->dimSize, sizeof(int) * a->order);
dimSize[0] = -dimSize[0]; //dimSize[0] = -dimSize[0];
XTensor * indexA = new XTensor(a->order, dimSize, X_INT, 1.0F, a->devID, a->mem); //XTensor * indexA = new XTensor(a->order, dimSize, X_INT, 1.0F, a->devID, a->mem);
indexA->data = a->mem != NULL ? a->mem->AllocBuf(a->devID, a->unitNum * sizeof(int)) : XMemAlloc(a->devID, a->unitNum * sizeof(int)); //indexA->data = a->mem != NULL ? a->mem->AllocBuf(a->devID, a->unitNum * sizeof(int)) : XMemAlloc(a->devID, a->unitNum * sizeof(int));
/* make the index tensor */ /* make the index tensor */
indexA->SetAscendingOrder(dim); //indexA->SetAscendingOrder(dim);
//_CudaSortBig(a, b, indexA, index, dim, k);
_CudaSortBig(a, b, indexA, index, dim, k); //if (a->mem != NULL)
// a->mem->ReleaseBuf(a->devID, a->unitNum * sizeof(int));
//delete indexA;
int workerNum = WORKERSNUM;
if (a->mem != NULL) GDevs.GetCudaThread2D(a->mem->devID,
a->mem->ReleaseBuf(a->devID, a->unitNum * sizeof(int)); workerNum, stride * blockNum, MAX_INT,
delete indexA; cudaGrids, cudaBlocks);
if (a->dataType == DEFAULT_DTYPE) {
unsigned int* goutput = (unsigned int *)a->data;
/* two way all almost the same time to convert data*/
convert2uintV2 <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>> ((float*)a->data, goutput, stride, strideNumA, blockNum, strideNumA*blockNum*stride);
//convert2uintV2 << <dim3(1, stride * blockNum), dim3(512,1) >> >((float*)a->data, goutput, stride, strideNumA, blockNum, strideNumA*blockNum*stride);
KernelTopKRadixSelect<DTYPE> <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>> (goutput, stride, strideNumA, blockNum, k, DTYPE_MIN, (DTYPE *)b->data, (int *)index->data, stride * strideNumA * blockNum);
deconvert2floatV2 <<<dim3(cudaGrids[0], cudaGrids[1]), dim3(cudaBlocks[0], cudaBlocks[1]) >>> ((unsigned int *)a->data, (float *)goutput, stride, strideNumA, blockNum, strideNumA*blockNum*stride);
}
} }
BacktoCudaDev(a->devID, devIDBackup); BacktoCudaDev(a->devID, devIDBackup);
......
...@@ -117,7 +117,7 @@ void CudaGPUToCPUFlush(XTensor * tensor) ...@@ -117,7 +117,7 @@ void CudaGPUToCPUFlush(XTensor * tensor)
else { else {
tensor->dataHost = new char[tensor->unitNum * tensor->unitSize]; tensor->dataHost = new char[tensor->unitNum * tensor->unitSize];
if (tensor->data != NULL) if (tensor->data != NULL)
cudaMemcpy(tensor->dataHost, tensor->data, tensor->unitNum * tensor->unitSize, cudaMemcpyDeviceToHost); XMemCopy(tensor->dataHost, -1, tensor->data, tensor->devID, tensor->unitNum * tensor->unitSize);
else else
memset(tensor->dataHost, 0, tensor->unitNum * tensor->unitSize); memset(tensor->dataHost, 0, tensor->unitNum * tensor->unitSize);
} }
......
...@@ -116,8 +116,7 @@ void _HardTanHBackward(XTensor * gold, XTensor * y, XTensor * x, ...@@ -116,8 +116,7 @@ void _HardTanHBackward(XTensor * gold, XTensor * y, XTensor * x,
} }
#endif #endif
if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE) if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
{
/* calculate dE/dy */ /* calculate dE/dy */
if(lossName != NOLOSS) if(lossName != NOLOSS)
_LossBackward(dedy, gold, y, lossName); _LossBackward(dedy, gold, y, lossName);
......
...@@ -38,6 +38,17 @@ log scale softmax y = log(e^x / \sum_{i} e^{x_i}) ...@@ -38,6 +38,17 @@ log scale softmax y = log(e^x / \sum_{i} e^{x_i})
*/ */
void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim) void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
{ {
CheckNTErrors(!x->isSparse && !y->isSparse, "TODO!");
CheckNTErrors(x && y, "Empty input tensors!");
if(leadDim < 0)
leadDim = x->order - 1;
if(y->dimSize[leadDim] == 1){
y->SetZeroAll();
return;
}
int leadDimRDI = x->order - leadDim - 1; int leadDimRDI = x->order - leadDim - 1;
if (!x->isSparse && !y->isSparse && if (!x->isSparse && !y->isSparse &&
x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE) x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE)
...@@ -68,25 +79,27 @@ void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim) ...@@ -68,25 +79,27 @@ void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
blockSize = stride * dimensionSize; blockSize = stride * dimensionSize;
blockNum = y->unitNum / blockSize; blockNum = y->unitNum / blockSize;
max = NewTensor(x->order - 1, dimSize, x->dataType, x->denseRatio, x->devID, mem); max = NewTensorBuf(x->order - 1, dimSize, x->dataType, x->denseRatio, x->devID, mem);
sum = NewTensor(x->order - 1, dimSize, x->dataType, x->denseRatio, x->devID, mem); sum = NewTensorBuf(x->order - 1, dimSize, x->dataType, x->denseRatio, x->devID, mem);
max->data = mem != NULL ? (char*)mem->AllocBuf(mem->devID, max->unitNum * max->unitSize) : XMemAlloc(max->devID, max->unitNum * max->unitSize);
sum->data = mem != NULL ? (char*)mem->AllocBuf(mem->devID, sum->unitNum * sum->unitSize) : XMemAlloc(sum->devID, sum->unitNum * sum->unitSize);
_ReduceMax(x, max, leadDim); _ReduceMax(x, max, leadDim);
_ReduceSum(x, sum, leadDim, max, 1.0F, true); _ReduceSum(x, sum, leadDim, max, 1.0F, true);
if (x->devID >= 0) { if (x->devID >= 0) {
int dims[2]; if(leadDimRDI == 0){
dims[0] = -stride; blockSize = y->unitNum;
dims[1] = dimensionSize; blockNum = 1;
blockx = NewTensor(2, dims, x->dataType, x->denseRatio, x->devID, mem); blockx = NewTensor2D(blockSize/dimensionSize, -dimensionSize, x->dataType, x->devID, mem);
blocky = NewTensor(2, dims, x->dataType, x->denseRatio, x->devID, mem); blocky = NewTensor2D(blockSize/dimensionSize, -dimensionSize, x->dataType, x->devID, mem);
dims[0] = -stride; blockMax = NewTensor2D(blockSize/dimensionSize, -1, x->dataType, x->devID, mem);
dims[1] = 1; blockSum = NewTensor2D(blockSize/dimensionSize, -1, x->dataType, x->devID, mem);
blockMax = NewTensor(2, dims, x->dataType, x->denseRatio, x->devID, mem); }
blockSum = NewTensor(2, dims, x->dataType, x->denseRatio, x->devID, mem); else{
blockx = NewTensor2D(-stride, dimensionSize, x->dataType, x->devID, mem);
blocky = NewTensor2D(-stride, dimensionSize, x->dataType, x->devID, mem);
blockMax = NewTensor2D(-stride, 1, x->dataType, x->devID, mem);
blockSum = NewTensor2D(-stride, 1, x->dataType, x->devID, mem);
}
} }
for (int k = 0; k < blockNum; k++) { for (int k = 0; k < blockNum; k++) {
...@@ -123,7 +136,10 @@ void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim) ...@@ -123,7 +136,10 @@ void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
blockMax->data = mp; blockMax->data = mp;
blockSum->data = sp; blockSum->data = sp;
#ifdef USE_CUDA #ifdef USE_CUDA
_CudaLogSoftmaxSumMax(blockx, blocky, leadDim, blockSum, blockMax); if(leadDimRDI == 0)
_CudaLogSoftmaxSumMax(blockx, blocky, 1, blockSum, blockMax);
else
_CudaLogSoftmaxSumMax(blockx, blocky, leadDim, blockSum, blockMax);
#else #else
ShowNTErrors("Please specify USE_CUDA and recompile the code!"); ShowNTErrors("Please specify USE_CUDA and recompile the code!");
#endif #endif
...@@ -134,21 +150,10 @@ void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim) ...@@ -134,21 +150,10 @@ void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim)
} }
} }
if (x->devID < 0) { DelTensorBuf(max);
if (mem != NULL) { DelTensorBuf(sum);
mem->ReleaseBuf(mem->devID, max->unitNum * max->unitSize);
mem->ReleaseBuf(mem->devID, sum->unitNum * sum->unitSize); if (x->devID >= 0) {
}
else {
XMemFree(max->devID, max->data);
XMemFree(sum->devID, sum->data);
max->data = NULL;
sum->data = NULL;
}
delete max;
delete sum;
}
else {
delete blockx; delete blockx;
delete blocky; delete blocky;
delete blockMax; delete blockMax;
...@@ -184,6 +189,27 @@ XTensor LogSoftmax(const XTensor &x, int leadDim) ...@@ -184,6 +189,27 @@ XTensor LogSoftmax(const XTensor &x, int leadDim)
return y; return y;
} }
/*
log scale softmax y = log(e^x / \sum_{i} e^{x_i})
make a new tensor to keep the result and return it
>> x - input vector
>> y - output vector
>> leadDim - leading dimension (along which we perform reduction)
*/
void LogSoftmax(const XTensor &x, XTensor &y, int leadDim)
{
if(!XTensor::IsSameShaped(&x, &y))
InitTensor(&y, &x);
/* call _LogSoftmax function */
_LogSoftmax(&x, &y, leadDim);
/* tensor connection */
XLink::MakeLink(&x, NULL, &y, FUNC_LOGSOFTMAX);
XLink::AddParamToHeadInt(&y, leadDim);
}
/* /*
backward computation for dense matrices with default data type backward computation for dense matrices with default data type
...@@ -255,6 +281,9 @@ void _LogSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x, ...@@ -255,6 +281,9 @@ void _LogSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x,
CheckNTErrors((!dedx->isSparse), "The gradient matrix must be dense!"); CheckNTErrors((!dedx->isSparse), "The gradient matrix must be dense!");
CheckNTErrors((gold != NULL), "The gold standard cannot be empty!"); CheckNTErrors((gold != NULL), "The gold standard cannot be empty!");
if(leadDim < 0)
leadDim = y->order - 1;
int leadDimRDI = y->order - leadDim - 1; int leadDimRDI = y->order - leadDim - 1;
#ifdef USE_CUDA #ifdef USE_CUDA
if (gold->devID >= 0) { if (gold->devID >= 0) {
......
...@@ -33,6 +33,9 @@ void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim); ...@@ -33,6 +33,9 @@ void _LogSoftmax(const XTensor * x, XTensor * y, int leadDim);
/* log scale softmax y = log(e^x / \sum_{i} e^{x_i}) (return a XTensor structure) */ /* log scale softmax y = log(e^x / \sum_{i} e^{x_i}) (return a XTensor structure) */
XTensor LogSoftmax(const XTensor &x, int leadDim); XTensor LogSoftmax(const XTensor &x, int leadDim);
/* log scale softmax y = log(e^x / \sum_{i} e^{x_i}) (with both argument of x and y) */
void LogSoftmax(const XTensor &x, XTensor &y, int leadDim);
/* de/dx */ /* de/dx */
void _LogSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x, void _LogSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x,
XTensor * dedy, XTensor * dedx, XTensor * dedy, XTensor * dedx,
......
...@@ -24,7 +24,7 @@ ...@@ -24,7 +24,7 @@
#include "../XDevice.h" #include "../XDevice.h"
#include "../core/math/Power.h" #include "../core/math/Power.h"
#include "../core/math/ScaleAndShift.h" #include "../core/math/ScaleAndShift.h"
#include "../core/math/Log.h" #include "../core/math/Unary.h"
#include "../core/arithmetic/Negate.h" #include "../core/arithmetic/Negate.h"
#include "../core/arithmetic/Sum.h" #include "../core/arithmetic/Sum.h"
#include "../core/arithmetic/Multiply.h" #include "../core/arithmetic/Multiply.h"
......
...@@ -37,6 +37,9 @@ softmax y = e^x / \sum_{i} e^{x_i} ...@@ -37,6 +37,9 @@ softmax y = e^x / \sum_{i} e^{x_i}
*/ */
void _Softmax(const XTensor * x, XTensor * y, int leadDim) void _Softmax(const XTensor * x, XTensor * y, int leadDim)
{ {
if(leadDim < 0)
leadDim = x->order - 1;
int leadDimRDI = x->order - leadDim - 1; int leadDimRDI = x->order - leadDim - 1;
if(!x->isSparse && !y->isSparse && x->dataType == y->dataType){ if(!x->isSparse && !y->isSparse && x->dataType == y->dataType){
int * dimSize = new int[x->order - 1]; int * dimSize = new int[x->order - 1];
...@@ -182,10 +185,14 @@ void _SoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x, ...@@ -182,10 +185,14 @@ void _SoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x,
int leadDim, int leadDim,
LOSS_FUNCTION_NAME lossName) LOSS_FUNCTION_NAME lossName)
{ {
CheckNTErrors((dedx->isSparse == false), "The gradient tensor must be dense!"); CheckNTErrors(dedx->isSparse == false, "The gradient tensor must be dense!");
CheckNTErrors((gold != NULL), "Incorrect x gold standard tensor!"); CheckNTErrors(gold != NULL || lossName == NOLOSS, "Gold standard is required for computing loss!");
if(leadDim < 0)
leadDim = y->order - 1;
int leadDimRDI = y->order - leadDim - 1; int leadDimRDI = y->order - leadDim - 1;
#ifdef USE_CUDA #ifdef USE_CUDA
if(y->devID >= 0){ if(y->devID >= 0){
_CudaSoftmaxBackward(gold, y, x, dedy, dedx, leadDim, lossName); _CudaSoftmaxBackward(gold, y, x, dedy, dedx, leadDim, lossName);
......
...@@ -156,6 +156,50 @@ void KernelSoftmaxComputeTensor(__half * x, __half * max, __half * sum, __half * ...@@ -156,6 +156,50 @@ void KernelSoftmaxComputeTensor(__half * x, __half * max, __half * sum, __half *
} }
/* /*
use PTX code to broadcast float data
*/
__device__ __forceinline__
float broadcast(float input)
{
float output;
asm(
"{"
"shfl.idx.b32 %0,%1,0x0,0x1f;"
"}"
:"=f"(output) : "f"(input)
);
return output;
}
/*
use warp broadcast to optimize softmax computing
*/
__global__
void KernelSoftmaxComputeTensorUseBroadcast(DTYPE * input, DTYPE * max, DTYPE * sum, DTYPE * output,
int stride, int strideNum, int blockNum)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
int j = blockDim.y * blockIdx.y + threadIdx.y;
int i2 = j % stride;
int blockSize = stride * strideNum;
if (j < stride * blockNum) {
DTYPE sumData, maxData;
if (i % 32 == 0) {
sumData = sum[j];
maxData = max[j];
}
sumData = broadcast(sumData);
maxData = broadcast(maxData);
if (i < strideNum){
int offset = int(j / stride) * blockSize + i * stride + i2;
output[offset] = exp(input[offset] - maxData) / sumData;
}
}
}
/*
softmax y = e^x / \sum_{i} e^{x_i} (Cuda version) softmax y = e^x / \sum_{i} e^{x_i} (Cuda version)
>> x - x vector >> x - x vector
>> y - result >> y - result
...@@ -183,20 +227,42 @@ void _CudaSoftmaxSumMax(const XTensor * x, XTensor * y, int leadDim, XTensor * s ...@@ -183,20 +227,42 @@ void _CudaSoftmaxSumMax(const XTensor * x, XTensor * y, int leadDim, XTensor * s
int cudaGridSize[3]; int cudaGridSize[3];
int cudaBlockSize[3]; int cudaBlockSize[3];
GDevs.GetCudaThread2D(x->devID, stride * blockNum, dimensionSize, MAX_INT, cudaGridSize, cudaBlockSize); if (leadDim != 0 || dimensionSize <= 10){
/* allocate thread num for old function */
GDevs.GetCudaThread2D(x->devID, stride * blockNum, dimensionSize, MAX_INT, cudaGridSize, cudaBlockSize);
}
else {
/* allocate thread num for new function */
GDevs.GetCudaThread2D(x->devID, dimensionSize, stride * blockNum, MAX_INT, cudaGridSize, cudaBlockSize);
if (cudaBlockSize[0] < 32) {
/* use at least a warp */
cudaBlockSize[0] = 32;
if (cudaBlockSize[1] > 32) {
cudaGridSize[1] = int(ceil(float(stride * blockNum) / 32));
cudaBlockSize[1] = 32;
}
}
}
int devIDBackup; int devIDBackup;
ProtectCudaDev(x->devID, devIDBackup); ProtectCudaDev(x->devID, devIDBackup);
if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){ if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
KernelSoftmaxComputeTensor<<<dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1])>>> if (leadDim != 0 || dimensionSize <= 10) {
((DTYPE*)x->data, (DTYPE*)max->data, (DTYPE*)sum->data, (DTYPE*)y->data, KernelSoftmaxComputeTensor <<< dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1]) >>>
stride, dimensionSize, stride * dimensionSize, blockNum, stride * blockNum); ((DTYPE*)x->data, (DTYPE*)max->data, (DTYPE*)sum->data, (DTYPE*)y->data,
stride, dimensionSize, stride * dimensionSize, blockNum, stride * blockNum);
}
else {
KernelSoftmaxComputeTensorUseBroadcast <<< dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1]) >>>
((DTYPE*)x->data, (DTYPE*)max->data, (DTYPE*)sum->data, (DTYPE*)y->data,
stride, dimensionSize, blockNum);
}
} }
else if(x->dataType == X_FLOAT16 && y->dataType == X_FLOAT16){ else if(x->dataType == X_FLOAT16 && y->dataType == X_FLOAT16){
KernelSoftmaxComputeTensor<<<dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1])>>> KernelSoftmaxComputeTensor <<< dim3(cudaGridSize[0], cudaGridSize[1]), dim3(cudaBlockSize[0], cudaBlockSize[1]) >>>
((__half*)x->data, (__half*)max->data, (__half*)sum->data, (__half*)y->data, ((__half*)x->data, (__half*)max->data, (__half*)sum->data, (__half*)y->data,
stride, dimensionSize, blockNum); stride, dimensionSize, blockNum);
} }
else{ else{
ShowNTErrors("TODO!"); ShowNTErrors("TODO!");
...@@ -239,6 +305,9 @@ void _CudaSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x, ...@@ -239,6 +305,9 @@ void _CudaSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x,
CheckNTErrors((x->devID == y->devID), "Matrices used in log softmax are not on the same GPU."); CheckNTErrors((x->devID == y->devID), "Matrices used in log softmax are not on the same GPU.");
CheckNTErrors((y->order >= 1), "Empty tensor!"); CheckNTErrors((y->order >= 1), "Empty tensor!");
int devIDBackup;
ProtectCudaDev(x->devID, devIDBackup);
if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){ if(x->dataType == DEFAULT_DTYPE && y->dataType == DEFAULT_DTYPE){
CheckNTErrors((lossName == CROSSENTROPY || CheckNTErrors((lossName == CROSSENTROPY ||
...@@ -284,8 +353,14 @@ void _CudaSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x, ...@@ -284,8 +353,14 @@ void _CudaSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x,
/* make a matrix to keep \beta */ /* make a matrix to keep \beta */
XTensor * beta = new XTensor(y->order - 1, dimSize, y->dataType, y->denseRatio, y->devID, mem); XTensor * beta = new XTensor(y->order - 1, dimSize, y->dataType, y->denseRatio, y->devID, mem);
ytmp->data = mem->AllocBuf(mem->devID, y->unitNum * y->unitSize); if(mem != NULL){
beta->data = mem->AllocBuf(mem->devID, beta->unitNum * beta->unitSize); ytmp->data = mem->AllocBuf(mem->devID, y->unitNum * y->unitSize);
beta->data = mem->AllocBuf(mem->devID, beta->unitNum * beta->unitSize);
}
else{
ytmp->data = XMemAlloc(y->devID, y->unitNum * y->unitSize);
beta->data = XMemAlloc(y->devID, beta->unitNum * beta->unitSize);
}
/* \beta = \sum_i (dE/dy_i * y_i) */ /* \beta = \sum_i (dE/dy_i * y_i) */
_Multiply(dedy, y, ytmp, 0, 0); _Multiply(dedy, y, ytmp, 0, 0);
...@@ -298,8 +373,18 @@ void _CudaSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x, ...@@ -298,8 +373,18 @@ void _CudaSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x,
/* dE/ds_j = y_j * ytmp = y_j * (dE/dy_j - \beta) */ /* dE/ds_j = y_j * ytmp = y_j * (dE/dy_j - \beta) */
_Multiply(y, ytmp, dedx, 0, 0); _Multiply(y, ytmp, dedx, 0, 0);
mem->ReleaseBuf(mem->devID, y->unitNum * y->unitSize);
mem->ReleaseBuf(mem->devID, beta->unitNum * beta->unitSize); if(mem != NULL){
mem->ReleaseBuf(mem->devID, y->unitNum * y->unitSize);
mem->ReleaseBuf(mem->devID, beta->unitNum * beta->unitSize);
}
else{
XMemFree(y->devID, ytmp->data);
XMemFree(y->devID, beta->data);
}
ytmp->data = NULL;
beta->data = NULL;
delete[] dimSize; delete[] dimSize;
delete ytmp; delete ytmp;
...@@ -311,6 +396,8 @@ void _CudaSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x, ...@@ -311,6 +396,8 @@ void _CudaSoftmaxBackward(XTensor * gold, XTensor * y, XTensor * x,
} }
else else
ShowNTErrors("TODO!"); ShowNTErrors("TODO!");
BacktoCudaDev(x->devID, devIDBackup);
} }
#endif #endif
......
...@@ -19,6 +19,7 @@ ...@@ -19,6 +19,7 @@
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12 * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12
*/ */
#include "../core/math/Unary.h"
#include "TAbsolute.h" #include "TAbsolute.h"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
...@@ -30,14 +31,14 @@ Set every entry to its absolute value. ...@@ -30,14 +31,14 @@ Set every entry to its absolute value.
bool TestAbsolute1() bool TestAbsolute1()
{ {
/* a tensor of size (3, 2) */ /* a tensor of size (3, 2) */
int aOrder = 2; int order = 2;
int * aDimSize = new int[aOrder]; int * dimSize = new int[order];
aDimSize[0] = 3; dimSize[0] = 3;
aDimSize[1] = 2; dimSize[1] = 2;
int aUnitNum = 1; int unitNum = 1;
for (int i = 0; i < aOrder; i++) for (int i = 0; i < order; i++)
aUnitNum *= aDimSize[i]; unitNum *= dimSize[i];
DTYPE aData[3][2] = { {1.0F, -2.0F}, DTYPE aData[3][2] = { {1.0F, -2.0F},
{0.5F, -4.0F}, {0.5F, -4.0F},
...@@ -50,14 +51,14 @@ bool TestAbsolute1() ...@@ -50,14 +51,14 @@ bool TestAbsolute1()
bool cpuTest = true; bool cpuTest = true;
/* create tensors */ /* create tensors */
XTensor * a = NewTensor(aOrder, aDimSize); XTensor * a = NewTensor(order, dimSize);
XTensor * b = NewTensor(aOrder, aDimSize); XTensor * b = NewTensor(order, dimSize);
XTensor * aMe = NewTensor(aOrder, aDimSize); XTensor * aMe = NewTensor(order, dimSize);
XTensor bUser; XTensor bUser;
/* initialize variables */ /* initialize variables */
a->SetData(aData, aUnitNum); a->SetData(aData, unitNum);
aMe->SetData(aData, aUnitNum); aMe->SetData(aData, unitNum);
/* call Absolute function */ /* call Absolute function */
_Absolute(a, b); _Absolute(a, b);
...@@ -65,21 +66,21 @@ bool TestAbsolute1() ...@@ -65,21 +66,21 @@ bool TestAbsolute1()
bUser = Absolute(*a); bUser = Absolute(*a);
/* check results */ /* check results */
cpuTest = b->CheckData(answer, aUnitNum, 1e-4F) && aMe->CheckData(answer, aUnitNum, 1e-4F) && bUser.CheckData(answer, aUnitNum, 1e-4F); cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
#ifdef USE_CUDA #ifdef USE_CUDA
/* GPU test */ /* GPU test */
bool gpuTest = true; bool gpuTest = true;
/* create tensor */ /* create tensor */
XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0); XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * bGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0); XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * aMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0); XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor bUserGPU; XTensor bUserGPU;
/* Initialize variables */ /* Initialize variables */
aGPU->SetData(aData, aUnitNum); aGPU->SetData(aData, unitNum);
aMeGPU->SetData(aData, aUnitNum); aMeGPU->SetData(aData, unitNum);
/* call Absolute function */ /* call Absolute function */
_Absolute(aGPU, bGPU); _Absolute(aGPU, bGPU);
...@@ -87,7 +88,7 @@ bool TestAbsolute1() ...@@ -87,7 +88,7 @@ bool TestAbsolute1()
bUserGPU = Absolute(*aGPU); bUserGPU = Absolute(*aGPU);
/* check results */ /* check results */
gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F) && aMeGPU->CheckData(answer, aUnitNum, 1e-4F) && bUserGPU.CheckData(answer, aUnitNum, 1e-4F); gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
/* destroy variables */ /* destroy variables */
delete a; delete a;
...@@ -96,7 +97,7 @@ bool TestAbsolute1() ...@@ -96,7 +97,7 @@ bool TestAbsolute1()
delete aGPU; delete aGPU;
delete bGPU; delete bGPU;
delete aMeGPU; delete aMeGPU;
delete[] aDimSize; delete[] dimSize;
return cpuTest && gpuTest; return cpuTest && gpuTest;
#else #else
...@@ -104,7 +105,7 @@ bool TestAbsolute1() ...@@ -104,7 +105,7 @@ bool TestAbsolute1()
delete a; delete a;
delete b; delete b;
delete aMe; delete aMe;
delete[] aDimSize; delete[] dimSize;
return cpuTest; return cpuTest;
#endif // USE_CUDA #endif // USE_CUDA
......
...@@ -22,7 +22,6 @@ ...@@ -22,7 +22,6 @@
#ifndef __TEST_ABSOLUTE_H__ #ifndef __TEST_ABSOLUTE_H__
#define __TEST_ABSOLUTE_H__ #define __TEST_ABSOLUTE_H__
#include "../core/arithmetic/Absolute.h"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
......
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
*/
#include "../XTensor.h"
#include "TClip.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/*
case 1: test Clip function.
Set every entry to its clip value.
*/
bool TestClip1()
{
/* a tensor of size (3, 2) */
int aOrder = 2;
int * aDimSize = new int[aOrder];
aDimSize[0] = 3;
aDimSize[1] = 2;
int aUnitNum = 1;
for (int i = 0; i < aOrder; i++)
aUnitNum *= aDimSize[i];
DTYPE aData[3][2] = { {1.0F, -2.0F},
{0.0F, 4.0F},
{5.0F, -6.0F} };
DTYPE answer[3][2] = { {1.0F, -1.0F},
{0.0F, 1.0F},
{1.0F, -1.0F} };
/* CPU test */
bool cpuTest = true;
/* create tensors */
XTensor * a = NewTensor(aOrder, aDimSize);
XTensor * b = NewTensor(aOrder, aDimSize);
XTensor * aMe = NewTensor(aOrder, aDimSize);
XTensor bUser;
/* initialize variables */
a->SetData(aData, aUnitNum);
aMe->SetData(aData, aUnitNum);
/* call Clip function */
_Clip(a, b, -1.0, 1.0);
_ClipMe(aMe, -1.0, 1.0);
bUser = Clip(*a, -1.0, 1.0);
/* check results */
cpuTest = b->CheckData(answer, aUnitNum, 1e-4F) &&
aMe->CheckData(answer, aUnitNum, 1e-4F) &&
bUser.CheckData(answer, aUnitNum, 1e-4F);
#ifdef USE_CUDA
/* GPU test */
bool gpuTest = true;
/* create tensor */
XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
XTensor * bGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
XTensor * aMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
XTensor bUserGPU;
/* Initialize variables */
aGPU->SetData(aData, aUnitNum);
aMeGPU->SetData(aData, aUnitNum);
/* call Clip function */
_Clip(aGPU, bGPU, -1.0, 1.0);
_ClipMe(aMeGPU, -1.0, 1.0);
bUserGPU = Clip(*aGPU, -1.0, 1.0);
/* check results */
gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F) &&
aMeGPU->CheckData(answer, aUnitNum, 1e-4F) &&
bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
/* destroy variables */
delete a;
delete b;
delete aMe;
delete aGPU;
delete bGPU;
delete aMeGPU;
delete[] aDimSize;
return cpuTest && gpuTest;
#else
/* destroy variables */
delete a;
delete b;
delete aMe;
delete[] aDimSize;
return cpuTest;
#endif // USE_CUDA
}
/* other cases */
/*
TODO!!
*/
/* test for Clip Function */
bool TestClip()
{
XPRINT(0, stdout, "[TEST Clip] set every entry to its clip value \n");
bool returnFlag = true, caseFlag = true;
/* case 1 test */
caseFlag = TestClip1();
if (!caseFlag) {
returnFlag = false;
XPRINT(0, stdout, ">> case 1 failed!\n");
}
else
XPRINT(0, stdout, ">> case 1 passed!\n");
/* other cases test */
/*
TODO!!
*/
if (returnFlag) {
XPRINT(0, stdout, ">> All Passed!\n");
}
else
XPRINT(0, stdout, ">> Failed!\n");
XPRINT(0, stdout, "\n");
return returnFlag;
}
} // namespace nts(NiuTrans.Tensor)
...@@ -16,19 +16,19 @@ ...@@ -16,19 +16,19 @@
*/ */
/* /*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-06-15 * $Created by: Lin Ye (email: linye2015@outlook.com) 2018-08-03
*/ */
#ifndef __TEST_MATRIXMULBATCHEDCPU_H__ #ifndef __TEST_CLIP_H__
#define __TEST_MATRIXMULBATCHEDCPU_H__ #define __TEST_CLIP_H__
#include "../core/arithmetic/MatrixMULBatchedCPU.h" #include "../core/math/Clip.h"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
/* test for MatrixMulBatchedCPU Function */ /* test for Clip Function */
extern "C" extern "C"
bool TestMatrixMulBatchedCPU(); bool TestClip();
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
#endif // __TEST_MATRIXMULBATCHEDCPU_H__ #endif // __TEST_CLIP_H__
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
*/
#include "../core/math/Unary.h"
#include "TCos.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/*
case 1: test Cos function.
Set every entry to its cosine value.
*/
bool TestCos1()
{
/* a tensor of size (3, 2) */
int order = 2;
int * dimSize = new int[order];
dimSize[0] = 3;
dimSize[1] = 2;
int unitNum = 1;
for (int i = 0; i < order; i++)
unitNum *= dimSize[i];
DTYPE aData[3][2] = { {1.0F, 2.0F},
{-1.0F, -2.0F},
{0.0F, 0.5F} };
DTYPE answer[3][2] = { {0.5403F, -0.4161F},
{0.5403F, -0.4161F},
{1.0F, 0.8776F} };
/* CPU test */
bool cpuTest = true;
/* create tensors */
XTensor * a = NewTensor(order, dimSize);
XTensor * b = NewTensor(order, dimSize);
XTensor * aMe = NewTensor(order, dimSize);
XTensor bUser;
/* initialize variables */
a->SetData(aData, unitNum);
aMe->SetData(aData, unitNum);
/* call Cos function */
_Cos(a, b);
_CosMe(aMe);
bUser = Cos(*a);
/* check results */
cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
#ifdef USE_CUDA
/* GPU test */
bool gpuTest = true;
/* create tensor */
XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor bUserGPU;
/* Initialize variables */
aGPU->SetData(aData, unitNum);
aMeGPU->SetData(aData, unitNum);
/* call Cos function */
_Cos(aGPU, bGPU);
_CosMe(aMeGPU);
bUserGPU = Cos(*aGPU);
/* check results */
gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
/* destroy variables */
delete a;
delete b;
delete aMe;
delete aGPU;
delete bGPU;
delete aMeGPU;
delete[] dimSize;
return cpuTest && gpuTest;
#else
/* destroy variables */
delete a;
delete b;
delete aMe;
delete[] dimSize;
return cpuTest;
#endif // USE_CUDA
}
/* other cases */
/*
TODO!!
*/
/* test for Cos Function */
bool TestCos()
{
XPRINT(0, stdout, "[TEST Cos] set every entry to its cosine value \n");
bool returnFlag = true, caseFlag = true;
/* case 1 test */
caseFlag = TestCos1();
if (!caseFlag) {
returnFlag = false;
XPRINT(0, stdout, ">> case 1 failed!\n");
}
else
XPRINT(0, stdout, ">> case 1 passed!\n");
/* other cases test */
/*
TODO!!
*/
if (returnFlag) {
XPRINT(0, stdout, ">> All Passed!\n");
}
else
XPRINT(0, stdout, ">> Failed!\n");
XPRINT(0, stdout, "\n");
return returnFlag;
}
} // namespace nts(NiuTrans.Tensor)
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
*/
#ifndef __TEST_SIN_H__
#define __TEST_SIN_H__
namespace nts { // namespace nts(NiuTrans.Tensor)
/* test for Sin Function */
bool TestSin();
} // namespace nts(NiuTrans.Tensor)
#endif // __TEST_SIN_H__
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
*/
#ifndef __TEST_COS_H__
#define __TEST_COS_H__
namespace nts { // namespace nts(NiuTrans.Tensor)
/* test for Cos Function */
bool TestCos();
} // namespace nts(NiuTrans.Tensor)
#endif // __TEST_COS_H__
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
*/
#include "TDiv.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/*
case 1: element-wise division of two tensors
c(i) = a(i)/b(i) + \alpha * c(i)
In this case, (2, 2) (2, 2) -> (2, 2), leadingDim=0, alpha=0.
*/
bool TestDiv1()
{
/* a source tensor of size (2, 2) */
int sOrder1 = 2;
int * sDimSize1 = new int[sOrder1];
sDimSize1[0] = 2;
sDimSize1[1] = 2;
int sUnitNum1 = 1;
for (int i = 0; i < sOrder1; i++)
sUnitNum1 *= sDimSize1[i];
/* a source tensor of size (2, 2) */
int sOrder2 = 2;
int * sDimSize2 = new int[sOrder2];
sDimSize2[0] = 2;
sDimSize2[1] = 2;
int sUnitNum2 = 1;
for (int i = 0; i < sOrder2; i++)
sUnitNum2 *= sDimSize2[i];
/* a target tensor of size (2, 2) */
int tOrder = 2;
int * tDimSize = new int[tOrder];
tDimSize[0] = 2;
tDimSize[1] = 2;
int tUnitNum = 1;
for (int i = 0; i < tOrder; i++)
tUnitNum *= tDimSize[i];
DTYPE sData1[2][2] = { {0.0F, 1.0F},
{2.0F, 3.0F} };
DTYPE sData2[2][2] = { {1.0F, 1.0F},
{4.0F, 9.0F} };
DTYPE answer[2][2] = { {0.0F, 1.0F},
{0.5F, 0.3333F} };
/* CPU test */
bool cpuTest = true;
/* create tensors */
XTensor * s1 = NewTensor(sOrder1, sDimSize1);
XTensor * s2 = NewTensor(sOrder2, sDimSize2);
XTensor * t = NewTensor(tOrder, tDimSize);
XTensor * tMe = NewTensor(tOrder, tDimSize);
XTensor tUser;
/* initialize variables */
s1->SetData(sData1, sUnitNum1);
tMe->SetData(sData1, sUnitNum1);
s2->SetData(sData2, sUnitNum2);
t->SetZeroAll();
/* call Div function */
_Div(s1, s2, t, 0, 0);
_DivMe(tMe, s2, 0, 0);
tUser = Div(*s1, *s2, 0);
/* check results */
cpuTest = t->CheckData(answer, tUnitNum, 1e-4F) &&
tMe->CheckData(answer, tUnitNum, 1e-4F) &&
tUser.CheckData(answer, tUnitNum, 1e-4F);
#ifdef USE_CUDA
/* GPU test */
bool gpuTest = true;
/* create tensor */
XTensor * sGPU1 = NewTensor(sOrder1, sDimSize1, X_FLOAT, 1.0F, 0);
XTensor * sGPU2 = NewTensor(sOrder2, sDimSize2, X_FLOAT, 1.0F, 0);
XTensor * tGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
XTensor * tMeGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
XTensor tUserGPU;
/* Initialize variables */
sGPU1->SetData(sData1, sUnitNum1);
tMeGPU->SetData(sData1, sUnitNum1);
sGPU2->SetData(sData2, sUnitNum2);
tGPU->SetZeroAll();
/* call Div function */
_Div(sGPU1, sGPU2, tGPU, 0, 0);
_DivMe(tMeGPU, sGPU2, 0, 0);
tUserGPU = Div(*sGPU1, *sGPU2, 0);
/* check results */
gpuTest = tGPU->CheckData(answer, tUnitNum, 1e-4F) &&
tMeGPU->CheckData(answer, tUnitNum, 1e-4F) &&
tUserGPU.CheckData(answer, tUnitNum, 1e-4F);
/* destroy variables */
delete s1;
delete s2;
delete t;
delete tMe;
delete sGPU1;
delete sGPU2;
delete tGPU;
delete tMeGPU;
delete[] sDimSize1;
delete[] sDimSize2;
delete[] tDimSize;
return cpuTest && gpuTest;
#else
/* destroy variables */
delete s1;
delete s2;
delete t;
delete tMe;
delete[] sDimSize1;
delete[] sDimSize2;
delete[] tDimSize;
return cpuTest;
#endif // USE_CUDA
}
/* other cases */
/*
TODO!!
*/
/* test for Div Function */
bool TestDiv()
{
XPRINT(0, stdout, "[TEST Div] element-wise division of two tensors \n");
bool returnFlag = true, caseFlag = true;
/* case 1 test */
caseFlag = TestDiv1();
if (!caseFlag) {
returnFlag = false;
XPRINT(0, stdout, ">> case 1 failed!\n");
}
else
XPRINT(0, stdout, ">> case 1 passed!\n");
/* other cases test */
/*
TODO!!
*/
if (returnFlag) {
XPRINT(0, stdout, ">> All Passed!\n");
}
else
XPRINT(0, stdout, ">> Failed!\n");
XPRINT(0, stdout, "\n");
return returnFlag;
}
} // namespace nts(NiuTrans.Tensor)
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
*/
#ifndef __TEST_DIV_H__
#define __TEST_DIV_H__
#include "../core/arithmetic/Div.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/* test for Div Function */
extern "C"
bool TestDiv();
} // namespace nts(NiuTrans.Tensor)
#endif // __TEST_DIV_H__
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
*/
#include "../core/math/Unary.h"
#include "TExp.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/*
case 1: test Exp function.
Set every entry to its exponent value.
*/
bool TestExp1()
{
/* a tensor of size (3, 2) */
int order = 2;
int * dimSize = new int[order];
dimSize[0] = 3;
dimSize[1] = 2;
int unitNum = 1;
for (int i = 0; i < order; i++)
unitNum *= dimSize[i];
DTYPE aData[3][2] = { {1.0F, 2.0F},
{-1.0F, -2.0F},
{0.0F, 0.5F} };
DTYPE answer[3][2] = { {2.7183F, 7.3891F},
{0.3679F, 0.1353F},
{1.0F, 1.6487F} };
/* CPU test */
bool cpuTest = true;
/* create tensors */
XTensor * a = NewTensor(order, dimSize);
XTensor * b = NewTensor(order, dimSize);
XTensor * aMe = NewTensor(order, dimSize);
XTensor bUser;
/* initialize variables */
a->SetData(aData, unitNum);
aMe->SetData(aData, unitNum);
/* call Exp function */
_Exp(a, b);
_ExpMe(aMe);
bUser = Exp(*a);
/* check results */
cpuTest = b->CheckData(answer, unitNum, 1e-4F) &&
aMe->CheckData(answer, unitNum, 1e-4F) &&
bUser.CheckData(answer, unitNum, 1e-4F);
#ifdef USE_CUDA
/* GPU test */
bool gpuTest = true;
/* create tensor */
XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor bUserGPU;
/* Initialize variables */
aGPU->SetData(aData, unitNum);
aMeGPU->SetData(aData, unitNum);
/* call Exp function */
_Exp(aGPU, bGPU);
_ExpMe(aMeGPU);
bUserGPU = Exp(*aGPU);
/* check results */
gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) &&
aMeGPU->CheckData(answer, unitNum, 1e-4F) && \
bUserGPU.CheckData(answer, unitNum, 1e-4F);
/* destroy variables */
delete a;
delete b;
delete aMe;
delete aGPU;
delete bGPU;
delete aMeGPU;
delete[] dimSize;
return cpuTest && gpuTest;
#else
/* destroy variables */
delete a;
delete b;
delete aMe;
delete[] dimSize;
return cpuTest;
#endif // USE_CUDA
}
/* other cases */
/*
TODO!!
*/
/* test for Exp Function */
bool TestExp()
{
XPRINT(0, stdout, "[TEST Exp] set every entry to its exponent value \n");
bool returnFlag = true, caseFlag = true;
/* case 1 test */
caseFlag = TestExp1();
if (!caseFlag) {
returnFlag = false;
XPRINT(0, stdout, ">> case 1 failed!\n");
}
else
XPRINT(0, stdout, ">> case 1 passed!\n");
/* other cases test */
/*
TODO!!
*/
if (returnFlag) {
XPRINT(0, stdout, ">> All Passed!\n");
}
else
XPRINT(0, stdout, ">> Failed!\n");
XPRINT(0, stdout, "\n");
return returnFlag;
}
} // namespace nts(NiuTrans.Tensor)
...@@ -16,20 +16,16 @@ ...@@ -16,20 +16,16 @@
*/ */
/* /*
* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24 * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
*/ */
#ifndef __MATRIXMULBATCHEDCPU_H__ #ifndef __TEST_EXP_H__
#define __MATRIXMULBATCHEDCPU_H__ #define __TEST_EXP_H__
#include "../../XTensor.h"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
/* matrix multiplication in batch mode (CPU code) */ /* test for Exp Function */
void _MatrixMULBatchedCPU(const XList * a, MATRIX_TRANS_TYPE transposedA, const XList * b, MATRIX_TRANS_TYPE transposedB, bool TestExp();
XList * c, DTYPE alpha = (DTYPE)1.0, DTYPE beta = 0);
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
#endif // __TEST_EXP_H__
#endif // __MATRIXMULBATCHEDCPU_H__
\ No newline at end of file
...@@ -19,6 +19,7 @@ ...@@ -19,6 +19,7 @@
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12 * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12
*/ */
#include "../core/math/Unary.h"
#include "TLog.h" #include "TLog.h"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
...@@ -30,14 +31,14 @@ Set every entry to its log value. ...@@ -30,14 +31,14 @@ Set every entry to its log value.
bool TestLog1() bool TestLog1()
{ {
/* a tensor of size (3, 2) */ /* a tensor of size (3, 2) */
int aOrder = 2; int order = 2;
int * aDimSize = new int[aOrder]; int * dimSize = new int[order];
aDimSize[0] = 3; dimSize[0] = 3;
aDimSize[1] = 2; dimSize[1] = 2;
int aUnitNum = 1; int unitNum = 1;
for (int i = 0; i < aOrder; i++) for (int i = 0; i < order; i++)
aUnitNum *= aDimSize[i]; unitNum *= dimSize[i];
DTYPE aData[3][2] = { {1.0F, 2.0F}, DTYPE aData[3][2] = { {1.0F, 2.0F},
{0.5F, 4.0F}, {0.5F, 4.0F},
...@@ -50,14 +51,14 @@ bool TestLog1() ...@@ -50,14 +51,14 @@ bool TestLog1()
bool cpuTest = true; bool cpuTest = true;
/* create tensors */ /* create tensors */
XTensor * a = NewTensor(aOrder, aDimSize); XTensor * a = NewTensor(order, dimSize);
XTensor * b = NewTensor(aOrder, aDimSize); XTensor * b = NewTensor(order, dimSize);
XTensor * aMe = NewTensor(aOrder, aDimSize); XTensor * aMe = NewTensor(order, dimSize);
XTensor bUser; XTensor bUser;
/* initialize variables */ /* initialize variables */
a->SetData(aData, aUnitNum); a->SetData(aData, unitNum);
aMe->SetData(aData, aUnitNum); aMe->SetData(aData, unitNum);
/* call Log function */ /* call Log function */
_Log(a, b); _Log(a, b);
...@@ -65,21 +66,21 @@ bool TestLog1() ...@@ -65,21 +66,21 @@ bool TestLog1()
bUser = Log(*a); bUser = Log(*a);
/* check results */ /* check results */
cpuTest = b->CheckData(answer, aUnitNum, 1e-4F) && aMe->CheckData(answer, aUnitNum, 1e-4F) && bUser.CheckData(answer, aUnitNum, 1e-4F); cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
#ifdef USE_CUDA #ifdef USE_CUDA
/* GPU test */ /* GPU test */
bool gpuTest = true; bool gpuTest = true;
/* create tensor */ /* create tensor */
XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0); XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * bGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0); XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * aMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0); XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor bUserGPU; XTensor bUserGPU;
/* Initialize variables */ /* Initialize variables */
aGPU->SetData(aData, aUnitNum); aGPU->SetData(aData, unitNum);
aMeGPU->SetData(aData, aUnitNum); aMeGPU->SetData(aData, unitNum);
/* call Log function */ /* call Log function */
_Log(aGPU, bGPU); _Log(aGPU, bGPU);
...@@ -87,7 +88,7 @@ bool TestLog1() ...@@ -87,7 +88,7 @@ bool TestLog1()
bUserGPU = Log(*aGPU); bUserGPU = Log(*aGPU);
/* check results */ /* check results */
gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F) && aMeGPU->CheckData(answer, aUnitNum, 1e-4F) && bUserGPU.CheckData(answer, aUnitNum, 1e-4F); gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
/* destroy variables */ /* destroy variables */
delete a; delete a;
...@@ -96,7 +97,7 @@ bool TestLog1() ...@@ -96,7 +97,7 @@ bool TestLog1()
delete aGPU; delete aGPU;
delete bGPU; delete bGPU;
delete aMeGPU; delete aMeGPU;
delete[] aDimSize; delete[] dimSize;
return cpuTest && gpuTest; return cpuTest && gpuTest;
#else #else
...@@ -104,7 +105,7 @@ bool TestLog1() ...@@ -104,7 +105,7 @@ bool TestLog1()
delete a; delete a;
delete b; delete b;
delete aMe; delete aMe;
delete[] aDimSize; delete[] dimSize;
return cpuTest; return cpuTest;
#endif // USE_CUDA #endif // USE_CUDA
......
...@@ -22,8 +22,6 @@ ...@@ -22,8 +22,6 @@
#ifndef __TEST_LOG_H__ #ifndef __TEST_LOG_H__
#define __TEST_LOG_H__ #define __TEST_LOG_H__
#include "../core/math/Log.h"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
/* test for Log Function */ /* test for Log Function */
......
...@@ -16,8 +16,8 @@ ...@@ -16,8 +16,8 @@
*/ */
/* /*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-02 * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-02
*/ */
#ifndef __TEST_LOGSOFTMAX_H__ #ifndef __TEST_LOGSOFTMAX_H__
#define __TEST_LOGSOFTMAX_H__ #define __TEST_LOGSOFTMAX_H__
......
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-06-15
*/
#include "TMatrixMULBatchedCPU.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/*
case 1: matrix multiplication in batch mode (CPU code).
In this case, aList=2*(2, 3), bList=2*(3, 2) -> c=2*(2, 2), transposedA=X_NOTRANS, transposedB=X_NOTRANS.
*/
bool TestMatrixMulBatchedCPU1()
{
/* create list */
XList * aList = new XList();
XList * bList = new XList();
XList * cList = new XList();
/* a source tensor of size (2, 3) */
int aOrder = 2;
int * aDimSize = new int[aOrder];
aDimSize[0] = 2;
aDimSize[1] = 3;
int aUnitNum = 1;
for (int i = 0; i < aOrder; i++)
aUnitNum *= aDimSize[i];
/* a source tensor of size (3, 2) */
int bOrder = 2;
int * bDimSize = new int[bOrder];
bDimSize[0] = 3;
bDimSize[1] = 2;
int bUnitNum = 1;
for (int i = 0; i < bOrder; i++)
bUnitNum *= bDimSize[i];
/* a target tensor of size (2, 2) */
int cOrder = 2;
int * cDimSize = new int[cOrder];
cDimSize[0] = 2;
cDimSize[1] = 2;
int cUnitNum = 1;
for (int i = 0; i < cOrder; i++)
cUnitNum *= cDimSize[i];
DTYPE aData1[2][3] = { {1.0F, 2.0F, 3.0F},
{-4.0F, 5.0F, 6.0F} };
DTYPE aData2[2][3] = { {1.0F, -2.0F, -3.0F},
{-4.0F, 3.0F, 2.0F} };
DTYPE bData1[3][2] = { {0.0F, -1.0F},
{1.0F, 2.0F},
{2.0F, 1.0F} };
DTYPE bData2[3][2] = { {0.0F, 1.0F},
{3.0F, 2.0F},
{2.0F, 1.0F} };
DTYPE answer1[2][2] = { {8.0F, 6.0F},
{17.0F, 20.0F} };
DTYPE answer2[2][2] = { {-12.0F, -6.0F},
{13.0F, 4.0F} };
/* CPU test */
bool cpuTest = true;
/* create tensors */
XTensor * a1 = NewTensor(aOrder, aDimSize);
XTensor * a2 = NewTensor(aOrder, aDimSize);
XTensor * b1 = NewTensor(bOrder, bDimSize);
XTensor * b2 = NewTensor(bOrder, bDimSize);
XTensor * c1 = NewTensor(cOrder, cDimSize);
XTensor * c2 = NewTensor(cOrder, cDimSize);
/* initialize variables */
a1->SetData(aData1, aUnitNum);
a2->SetData(aData2, aUnitNum);
b1->SetData(bData1, aUnitNum);
b2->SetData(bData2, aUnitNum);
c1->SetZeroAll();
c2->SetZeroAll();
/* add tensors to list */
aList->Add(a1);
aList->Add(a2);
bList->Add(b1);
bList->Add(b2);
cList->Add(c1);
cList->Add(c2);
/* call MatrixMULBatchedCPU function */
_MatrixMULBatchedCPU(aList, X_NOTRANS, bList, X_NOTRANS, cList);
/* check results */
cpuTest = c1->CheckData(answer1, cUnitNum) && c2->CheckData(answer2, cUnitNum);
#ifdef USE_CUDA
/* GPU test */
bool gpuTest = true;
/* create tensors */
XTensor * aGPU1 = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
XTensor * aGPU2 = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
XTensor * bGPU1 = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
XTensor * bGPU2 = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
XTensor * cGPU1 = NewTensor(cOrder, cDimSize, X_FLOAT, 1.0F, 0);
XTensor * cGPU2 = NewTensor(cOrder, cDimSize, X_FLOAT, 1.0F, 0);
/* initialize variables */
aGPU1->SetData(aData1, aUnitNum);
aGPU2->SetData(aData2, aUnitNum);
bGPU1->SetData(bData1, aUnitNum);
bGPU2->SetData(bData2, aUnitNum);
cGPU1->SetZeroAll();
cGPU2->SetZeroAll();
/* clear list */
aList->Clear();
bList->Clear();
cList->Clear();
/* add tensors to list */
aList->Add(aGPU1);
aList->Add(aGPU2);
bList->Add(bGPU1);
bList->Add(bGPU2);
cList->Add(cGPU1);
cList->Add(cGPU2);
/* call MatrixMULBatchedCPU function */
_MatrixMULBatchedCPU(aList, X_NOTRANS, bList, X_NOTRANS, cList);
/* check results */
gpuTest = cGPU1->CheckData(answer1, cUnitNum) && gpuTest;
gpuTest = cGPU2->CheckData(answer2, cUnitNum) && gpuTest;
/* destroy variables */
delete a1;
delete a2;
delete b1;
delete b2;
delete c1;
delete c2;
delete aGPU1;
delete aGPU2;
delete bGPU1;
delete bGPU2;
delete cGPU1;
delete cGPU2;
delete[] aDimSize;
delete[] bDimSize;
delete[] cDimSize;
return cpuTest && gpuTest;
#else
/* destroy variables */
delete a1;
delete a2;
delete b1;
delete b2;
delete c1;
delete c2;
delete[] aDimSize;
delete[] bDimSize;
delete[] cDimSize;
return cpuTest;
#endif // USE_CUDA
}
/* other cases */
/*
TODO!!
*/
/* test for MatrixMulBatchedCPU Function */
extern "C"
bool TestMatrixMulBatchedCPU()
{
XPRINT(0, stdout, "[TEST MATRIXMULBATCHEDCPU] matrix multiplication in batch mode (CPU code) \n");
bool returnFlag = true, caseFlag = true;
/* case 1 test */
caseFlag = TestMatrixMulBatchedCPU1();
if (!caseFlag) {
returnFlag = false;
XPRINT(0, stdout, ">> case 1 failed!\n");
}
else
XPRINT(0, stdout, ">> case 1 passed!\n");
/* other cases test */
/*
TODO!!
*/
if (returnFlag) {
XPRINT(0, stdout, ">> All Passed!\n");
}
else
XPRINT(0, stdout, ">> Failed!\n");
XPRINT(0, stdout, "\n");
return returnFlag;
}
} // namespace nts(NiuTrans.Tensor)
...@@ -25,133 +25,10 @@ namespace nts { // namespace nts(NiuTrans.Tensor) ...@@ -25,133 +25,10 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
/* /*
case 1: element-wise product of two tensors case 1: element-wise product of two tensors
c(i) = a(i)*b(i) + \alpha * c(i)
In this case, (2, 1) (2, 1) -> (2, 1), leadingDim=0, alpha=0.
*/
bool TestMultiply1()
{
/* a source tensor of size (2, 1) */
int sOrder1 = 2;
int * sDimSize1 = new int[sOrder1];
sDimSize1[0] = 2;
sDimSize1[1] = 1;
int sUnitNum1 = 1;
for (int i = 0; i < sOrder1; i++)
sUnitNum1 *= sDimSize1[i];
/* a source tensor of size (2, 1) */
int sOrder2 = 2;
int * sDimSize2 = new int[sOrder2];
sDimSize2[0] = 2;
sDimSize2[1] = 1;
int sUnitNum2 = 1;
for (int i = 0; i < sOrder2; i++)
sUnitNum2 *= sDimSize2[i];
/* a target tensor of size (2, 1) */
int tOrder = 2;
int * tDimSize = new int[tOrder];
tDimSize[0] = 2;
tDimSize[1] = 1;
int tUnitNum = 1;
for (int i = 0; i < tOrder; i++)
tUnitNum *= tDimSize[i];
DTYPE sData1[2][1] = { {0.0F},
{1.0F} };
DTYPE sData2[2][1] = { {2.0F},
{3.0F} };
DTYPE answer[2][1] = { {0.0F},
{3.0F} };
/* CPU test */
bool cpuTest = true;
/* create tensors */
XTensor * s1 = NewTensor(sOrder1, sDimSize1);
XTensor * s2 = NewTensor(sOrder2, sDimSize2);
XTensor * t = NewTensor(tOrder, tDimSize);
XTensor * tMe = NewTensor(tOrder, tDimSize);
XTensor tUser;
/* initialize variables */
s1->SetData(sData1, sUnitNum1);
tMe->SetData(sData1, sUnitNum1);
s2->SetData(sData2, sUnitNum2);
t->SetZeroAll();
/* call Multiply function */
_Multiply(s1, s2, t, 0, 0);
_MultiplyMe(tMe, s2, 0, 0);
tUser = Multiply(*s1, *s2, 0);
/* check results */
cpuTest = t->CheckData(answer, tUnitNum)
&& tMe->CheckData(answer, tUnitNum) && tUser.CheckData(answer, tUnitNum);
#ifdef USE_CUDA
/* GPU test */
bool gpuTest = true;
/* create tensor */
XTensor * sGPU1 = NewTensor(sOrder1, sDimSize1, X_FLOAT, 1.0F, 0);
XTensor * sGPU2 = NewTensor(sOrder2, sDimSize2, X_FLOAT, 1.0F, 0);
XTensor * tGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
XTensor * tMeGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
XTensor tUserGPU;
/* Initialize variables */
sGPU1->SetData(sData1, sUnitNum1);
tMeGPU->SetData(sData1, sUnitNum1);
sGPU2->SetData(sData2, sUnitNum2);
tGPU->SetZeroAll();
/* call Multiply function */
_Multiply(sGPU1, sGPU2, tGPU, 0, 0);
_MultiplyMe(tMeGPU, sGPU2, 0, 0);
tUserGPU = Multiply(*sGPU1, *sGPU2, 0);
/* check results */
gpuTest = tGPU->CheckData(answer, tUnitNum)
&& tMeGPU->CheckData(answer, tUnitNum) && tUserGPU.CheckData(answer, tUnitNum);
/* destroy variables */
delete s1;
delete s2;
delete t;
delete tMe;
delete sGPU1;
delete sGPU2;
delete tGPU;
delete tMeGPU;
delete[] sDimSize1;
delete[] sDimSize2;
delete[] tDimSize;
return cpuTest && gpuTest;
#else
/* destroy variables */
delete s1;
delete s2;
delete t;
delete tMe;
delete[] sDimSize1;
delete[] sDimSize2;
delete[] tDimSize;
return cpuTest;
#endif // USE_CUDA
}
/*
case 2: element-wise product of two tensors
c(i) = a(i)*b(i) + \alpha * c(i) c(i) = a(i)*b(i) + \alpha * c(i)
In this case, (2, 2) (2, 2) -> (2, 2), leadingDim=0, alpha=0. In this case, (2, 2) (2, 2) -> (2, 2), leadingDim=0, alpha=0.
*/ */
bool TestMultiply2() bool TestMultiply1()
{ {
/* a source tensor of size (2, 2) */ /* a source tensor of size (2, 2) */
int sOrder1 = 2; int sOrder1 = 2;
...@@ -212,8 +89,9 @@ bool TestMultiply2() ...@@ -212,8 +89,9 @@ bool TestMultiply2()
tUser = Multiply(*s1, *s2, 0); tUser = Multiply(*s1, *s2, 0);
/* check results */ /* check results */
cpuTest = t->CheckData(answer, tUnitNum) cpuTest = t->CheckData(answer, tUnitNum) &&
&& tMe->CheckData(answer, tUnitNum) && tUser.CheckData(answer, tUnitNum); tMe->CheckData(answer, tUnitNum) &&
tUser.CheckData(answer, tUnitNum);
#ifdef USE_CUDA #ifdef USE_CUDA
/* GPU test */ /* GPU test */
...@@ -270,113 +148,6 @@ bool TestMultiply2() ...@@ -270,113 +148,6 @@ bool TestMultiply2()
#endif // USE_CUDA #endif // USE_CUDA
} }
/*
case 3: element-wise product of two tensors, c(i) = a(i)*b(i) + \alpha * c(i)
In this case, (2, 2) (2, 2) -> (2, 2), leadingDim=1, alpha=0.
*/
bool TestMultiply3()
{
/* a source tensor of size (2, 2) */
int sOrder1 = 2;
int * sDimSize1 = new int[sOrder1];
sDimSize1[0] = 2;
sDimSize1[1] = 2;
int sUnitNum1 = 1;
for (int i = 0; i < sOrder1; i++)
sUnitNum1 *= sDimSize1[i];
/* a source tensor of size (2, 2) */
int sOrder2 = 2;
int * sDimSize2 = new int[sOrder2];
sDimSize2[0] = 2;
sDimSize2[1] = 2;
int sUnitNum2 = 1;
for (int i = 0; i < sOrder2; i++)
sUnitNum2 *= sDimSize2[i];
/* a target tensor of size (2, 2) */
int tOrder = 2;
int * tDimSize = new int[tOrder];
tDimSize[0] = 2;
tDimSize[1] = 2;
int tUnitNum = 1;
for (int i = 0; i < tOrder; i++)
tUnitNum *= tDimSize[i];
DTYPE sData1[2][2] = { {0.0F, 1.0F},
{2.0F, 3.0F} };
DTYPE sData2[2][2] = { {0.0F, 1.0F},
{2.0F, 3.0F} };
DTYPE answer[2][2] = { {0.0F, 1.0F},
{4.0F, 9.0F} };
/* CPU test */
bool cpuTest = true;
/* create tensors */
XTensor * s1 = NewTensor(sOrder1, sDimSize1);
XTensor * s2 = NewTensor(sOrder2, sDimSize2);
XTensor * t = NewTensor(tOrder, tDimSize);
/* initialize variables */
s1->SetData(sData1, sUnitNum1);
s2->SetData(sData2, sUnitNum2);
t->SetZeroAll();
/* call MultiplyElementWise function */
_Multiply(s1, s2, t, 0, 1);
/* check results */
cpuTest = t->CheckData(answer, tUnitNum);
#ifdef USE_CUDA
/* GPU test */
bool gpuTest = true;
/* create tensor */
XTensor * sGPU1 = NewTensor(sOrder1, sDimSize1, X_FLOAT, 1.0F, 0);
XTensor * sGPU2 = NewTensor(sOrder2, sDimSize2, X_FLOAT, 1.0F, 0);
XTensor * tGPU = NewTensor(tOrder, tDimSize, X_FLOAT, 1.0F, 0);
/* Initialize variables */
sGPU1->SetData(sData1, sUnitNum1);
sGPU2->SetData(sData2, sUnitNum2);
tGPU->SetZeroAll();
/* call MultiplyElementWise function */
_Multiply(sGPU1, sGPU2, tGPU, 0, 1);
/* check results */
gpuTest = tGPU->CheckData(answer, tUnitNum);
/* destroy variables */
delete s1;
delete s2;
delete t;
delete sGPU1;
delete sGPU2;
delete tGPU;
delete[] sDimSize1;
delete[] sDimSize2;
delete[] tDimSize;
return cpuTest && gpuTest;
#else
/* destroy variables */
delete s1;
delete s2;
delete t;
delete[] sDimSize1;
delete[] sDimSize2;
delete[] tDimSize;
return cpuTest;
#endif // USE_CUDA
}
/* other cases */ /* other cases */
/* /*
TODO!! TODO!!
...@@ -398,26 +169,6 @@ bool TestMultiply() ...@@ -398,26 +169,6 @@ bool TestMultiply()
else else
XPRINT(0, stdout, ">> case 1 passed!\n"); XPRINT(0, stdout, ">> case 1 passed!\n");
/* case 2 test */
caseFlag = TestMultiply2();
if (!caseFlag) {
returnFlag = false;
XPRINT(0, stdout, ">> case 2 failed!\n");
}
else
XPRINT(0, stdout, ">> case 2 passed!\n");
/* case 3 test */
caseFlag = TestMultiply3();
if (!caseFlag) {
returnFlag = false;
XPRINT(0, stdout, ">> case 3 failed!\n");
}
else
XPRINT(0, stdout, ">> case 3 passed!\n");
/* other cases test */ /* other cases test */
/* /*
TODO!! TODO!!
......
...@@ -19,16 +19,17 @@ ...@@ -19,16 +19,17 @@
* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-06-15 * $Created by: Lin Ye (email: linye2015@outlook.com) 2018-06-15
*/ */
#ifndef __TEST_MULTIPLYELEMENTWISE_H__ #ifndef __TEST_MULTIPLY_H__
#define __TEST_MULTIPLYELEMENTWISE_H__ #define __TEST_MULTIPLY_H__
#include "../core/arithmetic/Multiply.h" #include "../core/arithmetic/Multiply.h"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
/* test for MultiplyElementWise Function */ /* test for Multiply Function */
extern "C" extern "C"
bool TestMultiply(); bool TestMultiply();
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
#endif // __TEST_MULTIPLYELEMENTWISE_H__
#endif // __TEST_MULTIPLY_H__
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
*/
#include "../core/math/Unary.h"
#include "TRound.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/*
case 1: test Round function.
Set every entry to its round value.
*/
bool TestRound1()
{
/* a tensor of size (3, 2) */
int order = 2;
int * dimSize = new int[order];
dimSize[0] = 3;
dimSize[1] = 2;
int unitNum = 1;
for (int i = 0; i < order; i++)
unitNum *= dimSize[i];
DTYPE aData[3][2] = { {1.3F, 2.7F},
{-1.3F, -2.7F},
{0.0F, 0.5F} };
DTYPE answer[3][2] = { {1.0F, 3.0F},
{-1.0F, -3.0F},
{0.0F, 1.0F} };
/* CPU test */
bool cpuTest = true;
/* create tensors */
XTensor * a = NewTensor(order, dimSize);
XTensor * b = NewTensor(order, dimSize);
XTensor * aMe = NewTensor(order, dimSize);
XTensor bUser;
/* initialize variables */
a->SetData(aData, unitNum);
aMe->SetData(aData, unitNum);
/* call Round function */
_Round(a, b);
_RoundMe(aMe);
bUser = Round(*a);
/* check results */
cpuTest = b->CheckData(answer, unitNum, 1e-4F) &&
aMe->CheckData(answer, unitNum, 1e-4F) &&
bUser.CheckData(answer, unitNum, 1e-4F);
#ifdef USE_CUDA
/* GPU test */
bool gpuTest = true;
/* create tensor */
XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor bUserGPU;
/* Initialize variables */
aGPU->SetData(aData, unitNum);
aMeGPU->SetData(aData, unitNum);
/* call Round function */
_Round(aGPU, bGPU);
_RoundMe(aMeGPU);
bUserGPU = Round(*aGPU);
/* check results */
gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) &&
aMeGPU->CheckData(answer, unitNum, 1e-4F) &&
bUserGPU.CheckData(answer, unitNum, 1e-4F);
/* destroy variables */
delete a;
delete b;
delete aMe;
delete aGPU;
delete bGPU;
delete aMeGPU;
delete[] dimSize;
return cpuTest && gpuTest;
#else
/* destroy variables */
delete a;
delete b;
delete aMe;
delete[] dimSize;
return cpuTest;
#endif // USE_CUDA
}
/* other cases */
/*
TODO!!
*/
/* test for Round Function */
bool TestRound()
{
XPRINT(0, stdout, "[TEST Round] set every entry to its round value \n");
bool returnFlag = true, caseFlag = true;
/* case 1 test */
caseFlag = TestRound1();
if (!caseFlag) {
returnFlag = false;
XPRINT(0, stdout, ">> case 1 failed!\n");
}
else
XPRINT(0, stdout, ">> case 1 passed!\n");
/* other cases test */
/*
TODO!!
*/
if (returnFlag) {
XPRINT(0, stdout, ">> All Passed!\n");
}
else
XPRINT(0, stdout, ">> Failed!\n");
XPRINT(0, stdout, "\n");
return returnFlag;
}
} // namespace nts(NiuTrans.Tensor)
...@@ -16,26 +16,16 @@ ...@@ -16,26 +16,16 @@
*/ */
/* /*
* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11 * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-03
*/ */
#include "Absolute.h" #ifndef __TEST_ROUND_H__
#define __TEST_ROUND_H__
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA /* test for Round Function */
bool TestRound();
/* set each entry to its absolute value (CUDA Kernel) */ } // namespace nts(NiuTrans.Tensor)
__global__ #endif // __TEST_ROUND_H__
void KernelAbsolute(DTYPE * a, DTYPE * b, int size);
/* set each entry to its absolute value (CUDA Kernel) with float16 data type*/
__global__
void KernelAbsolute(__half * a, __half * b, int size);
/* set each entry to its absolute value */
void _CudaAbsolute(const XTensor * a, XTensor * b);
#endif // USE_CUDA
} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
*/
#include "../core/math/Unary.h"
#include "TSin.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/*
case 1: test Sin function.
Set every entry to its sine value.
*/
bool TestSin1()
{
/* a tensor of size (3, 2) */
int order = 2;
int * dimSize = new int[order];
dimSize[0] = 3;
dimSize[1] = 2;
int unitNum = 1;
for (int i = 0; i < order; i++)
unitNum *= dimSize[i];
DTYPE aData[3][2] = { {1.0F, 2.0F},
{-1.0F, -2.0F},
{0.0F, 0.5F} };
DTYPE answer[3][2] = { {0.8415F, 0.9093F},
{-0.8415F, -0.9093F},
{0.0F, 0.4794F} };
/* CPU test */
bool cpuTest = true;
/* create tensors */
XTensor * a = NewTensor(order, dimSize);
XTensor * b = NewTensor(order, dimSize);
XTensor * aMe = NewTensor(order, dimSize);
XTensor bUser;
/* initialize variables */
a->SetData(aData, unitNum);
aMe->SetData(aData, unitNum);
/* call Sin function */
_Sin(a, b);
_SinMe(aMe);
bUser = Sin(*a);
/* check results */
cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
#ifdef USE_CUDA
/* GPU test */
bool gpuTest = true;
/* create tensor */
XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor bUserGPU;
/* Initialize variables */
aGPU->SetData(aData, unitNum);
aMeGPU->SetData(aData, unitNum);
/* call Sin function */
_Sin(aGPU, bGPU);
_SinMe(aMeGPU);
bUserGPU = Sin(*aGPU);
/* check results */
gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
/* destroy variables */
delete a;
delete b;
delete aMe;
delete aGPU;
delete bGPU;
delete aMeGPU;
delete[] dimSize;
return cpuTest && gpuTest;
#else
/* destroy variables */
delete a;
delete b;
delete aMe;
delete[] dimSize;
return cpuTest;
#endif // USE_CUDA
}
/* other cases */
/*
TODO!!
*/
/* test for Sin Function */
bool TestSin()
{
XPRINT(0, stdout, "[TEST Sin] set every entry to its sine value \n");
bool returnFlag = true, caseFlag = true;
/* case 1 test */
caseFlag = TestSin1();
if (!caseFlag) {
returnFlag = false;
XPRINT(0, stdout, ">> case 1 failed!\n");
}
else
XPRINT(0, stdout, ">> case 1 passed!\n");
/* other cases test */
/*
TODO!!
*/
if (returnFlag) {
XPRINT(0, stdout, ">> All Passed!\n");
}
else
XPRINT(0, stdout, ">> Failed!\n");
XPRINT(0, stdout, "\n");
return returnFlag;
}
} // namespace nts(NiuTrans.Tensor)
...@@ -16,31 +16,16 @@ ...@@ -16,31 +16,16 @@
*/ */
/* /*
* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-7-11 * $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
*/ */
#ifndef __LOG_CUH__ #ifndef __TEST_SIN_H__
#define __LOG_CUH__ #define __TEST_SIN_H__
#include "Log.h"
namespace nts { // namespace nts(NiuTrans.Tensor) namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA /* test for Sin Function */
bool TestSin();
/* set each entry to its log value (CUDA Kernel) */
__global__
void KernelLog(DTYPE * a, DTYPE * b, int size);
/* set each entry to its log value (CUDA Kernel) with float16 data type*/
__global__
void KernelLog(__half * a, __half * b, int size);
/* set each entry to its log value */
void _CudaLog(const XTensor * a, XTensor * b);
#endif // USE_CUDA
} // namespace nts(NiuTrans.Tensor) } // namespace nts(NiuTrans.Tensor)
#endif // __TEST_SIN_H__
#endif // __LOG_CUH__
\ No newline at end of file
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
*/
#include "TSub.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/* case 1: tensor subtraction c = a - b * \beta */
bool TestSub1()
{
/* a tensor of size (2, 4) */
int order = 2;
int * dimSize = new int[order];
dimSize[0] = 2;
dimSize[1] = 4;
int unitNum = 1;
for (int i = 0; i < order; i++)
unitNum *= dimSize[i];
DTYPE aData[2][4] = { {0.0F, 1.0F, 2.0F, 3.0F},
{4.0F, 5.0F, 6.0F, 7.0F} };
DTYPE bData[2][4] = { {1.0F, -1.0F, -3.0F, -5.0F},
{-7.0F, -9.0F, -11.0F, -13.0F} };
DTYPE answer[2][4] = { {-1.0F, 2.0F, 5.0F, 8.0F},
{11.0F, 14.0F, 17.0F, 20.0F} };
/* CPU test */
bool cpuTest = true;
/* create tensors */
XTensor * a = NewTensor(order, dimSize);
XTensor * b = NewTensor(order, dimSize);
XTensor * c = NewTensor(order, dimSize);
XTensor * cMe = NewTensor(order, dimSize);
XTensor cUser;
/* initialize variables */
a->SetData(aData, unitNum);
cMe->SetData(aData, unitNum);
b->SetData(bData, unitNum);
c->SetZeroAll();
/* call Sub function */
_Sub(a, b, c);
_SubMe(cMe, b);
cUser = Sub(*a, *b);
/* check results */
cpuTest = c->CheckData(answer, unitNum)
&& cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
#ifdef USE_CUDA
/* GPU test */
bool gpuTest = true;
/* create tensor */
XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * cGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * cMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor cUserGPU;
/* Initialize variables */
aGPU->SetData(aData, unitNum);
cMeGPU->SetData(aData, unitNum);
bGPU->SetData(bData, unitNum);
cGPU->SetZeroAll();
/* call Sub function */
_Sub(aGPU, bGPU, cGPU);
_SubMe(cMeGPU, bGPU);
cUserGPU = Sub(*aGPU, *bGPU);
/* check results */
gpuTest = cGPU->CheckData(answer, unitNum, 1e-4F)
&& cMeGPU->CheckData(answer, unitNum, 1e-4F) && cUserGPU.CheckData(answer, unitNum, 1e-4F);
/* destroy variables */
delete a;
delete b;
delete c;
delete cMe;
delete aGPU;
delete bGPU;
delete cGPU;
delete cMeGPU;
delete[] dimSize;
return cpuTest && gpuTest;
#else
/* destroy variables */
delete a;
delete b;
delete c;
delete cMe;
delete[] dimSize;
return cpuTest;
#endif // USE_CUDA
}
/* case 2: tensor subtraction c = a - b * \beta */
bool TestSub2()
{
/* a tensor of size (2, 4) */
int order = 2;
int * dimSize = new int[order];
dimSize[0] = 2;
dimSize[1] = 4;
int unitNum = 1;
for (int i = 0; i < order; i++) {
unitNum *= dimSize[i];
}
DTYPE aData[2][4] = { {0.0F, 1.0F, 2.0F, 3.0F},
{4.0F, 5.0F, 6.0F, 7.0F} };
DTYPE bData[2][4] = { {1.0F, -1.0F, -3.0F, -5.0F},
{-7.0F, -9.0F, -11.0F, -13.0F} };
DTYPE answer[2][4] = { {-0.5F, 1.5F, 3.5F, 5.5F},
{7.5F, 9.5F, 11.5F, 13.5F} };
float beta = 0.5F;
/* CPU test */
bool cpuTest = true;
/* create tensor */
XTensor * a = NewTensor(order, dimSize);
XTensor * b = NewTensor(order, dimSize);
XTensor * c = NewTensor(order, dimSize);
XTensor * cMe = NewTensor(order, dimSize);
XTensor cUser;
/* initialize variables */
a->SetData(aData, unitNum);
cMe->SetData(aData, unitNum);
b->SetData(bData, unitNum);
c->SetZeroAll();
/* call Sub function */
_Sub(a, b, c, beta);
_SubMe(cMe, b, beta);
cUser = Sub(*a, *b, beta);
/* check results */
cpuTest = c->CheckData(answer, unitNum)
&& cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
#ifdef USE_CUDA
/* GPU test */
bool gpuTest = true;
/* create tensor */
XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * cGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * cMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor cUserGPU;
/* Initialize variables */
aGPU->SetData(aData, unitNum);
cMeGPU->SetData(aData, unitNum);
bGPU->SetData(bData, unitNum);
cGPU->SetZeroAll();
/* call Sub function */
_Sub(aGPU, bGPU, cGPU, beta);
_SubMe(cMeGPU, bGPU, beta);
cUserGPU = Sub(*aGPU, *bGPU, beta);
/* check results */
gpuTest = cGPU->CheckData(answer, unitNum, 1e-4F)
&& cMeGPU->CheckData(answer, unitNum, 1e-4F) && cUserGPU.CheckData(answer, unitNum, 1e-4F);
/* destroy variables */
delete a;
delete b;
delete c;
delete cMe;
delete aGPU;
delete bGPU;
delete cGPU;
delete cMeGPU;
delete[] dimSize;
return cpuTest && gpuTest;
#else
/* destroy variables */
delete a;
delete b;
delete c;
delete cMe;
delete[] dimSize;
return cpuTest;
#endif // USE_CUDA
}
/* other cases */
/*
TODO!!
*/
/* test for Sub Function */
bool TestSub()
{
XPRINT(0, stdout, "[TEST SUB] tensor subtraction c = a - b * beta\n");
bool returnFlag = true, caseFlag = true;
/* case 1 test */
caseFlag = TestSub1();
if (!caseFlag) {
returnFlag = false;
XPRINT(0, stdout, ">> case 1 failed!\n");
}
else
XPRINT(0, stdout, ">> case 1 passed!\n");
/* case 2 test */
caseFlag = TestSub2();
if (!caseFlag) {
returnFlag = false;
XPRINT(0, stdout, ">> case 2 failed!\n");
}
else
XPRINT(0, stdout, ">> case 2 passed!\n");
/* other cases test */
/*
TODO!!
*/
if (returnFlag) {
XPRINT(0, stdout, ">> All Passed!\n");
}
else
XPRINT(0, stdout, ">> Failed!\n");
XPRINT(0, stdout, "\n");
return returnFlag;
}
} // namespace nts(NiuTrans.Tensor)
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-08-01
*/
#ifndef __TEST_SUB_H__
#define __TEST_SUB_H__
#include "../core/arithmetic/Sub.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/* test for Sub Function */
bool TestSub();
} // namespace nts(NiuTrans.Tensor)
#endif // __TEST_SUB_H__
...@@ -16,8 +16,8 @@ ...@@ -16,8 +16,8 @@
*/ */
/* /*
* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-04-30 * $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-04-30
*/ */
#include "TSum.h" #include "TSum.h"
...@@ -59,14 +59,14 @@ bool TestSum1() ...@@ -59,14 +59,14 @@ bool TestSum1()
b->SetData(bData, unitNum); b->SetData(bData, unitNum);
c->SetZeroAll(); c->SetZeroAll();
/* call sum function */ /* call Sum function */
_Sum(a, b, c); _Sum(a, b, c);
_SumMe(cMe, b); _SumMe(cMe, b);
cUser = Sum(*a, *b); cUser = Sum(*a, *b);
/* check results */ /* check results */
cpuTest = c->CheckData(answer, unitNum) cpuTest = c->CheckData(answer, unitNum)
&& cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum); && cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
#ifdef USE_CUDA #ifdef USE_CUDA
/* GPU test */ /* GPU test */
...@@ -85,14 +85,14 @@ bool TestSum1() ...@@ -85,14 +85,14 @@ bool TestSum1()
bGPU->SetData(bData, unitNum); bGPU->SetData(bData, unitNum);
cGPU->SetZeroAll(); cGPU->SetZeroAll();
/* call sum function */ /* call Sum function */
_Sum(aGPU, bGPU, cGPU); _Sum(aGPU, bGPU, cGPU);
_SumMe(cMeGPU, bGPU); _SumMe(cMeGPU, bGPU);
cUserGPU = Sum(*aGPU, *bGPU); cUserGPU = Sum(*aGPU, *bGPU);
/* check results */ /* check results */
gpuTest = cGPU->CheckData(answer, unitNum) gpuTest = cGPU->CheckData(answer, unitNum)
&& cMeGPU->CheckData(answer, unitNum) && cUserGPU.CheckData(answer, unitNum); && cMeGPU->CheckData(answer, unitNum) && cUserGPU.CheckData(answer, unitNum);
/* destroy variables */ /* destroy variables */
delete a; delete a;
...@@ -155,14 +155,14 @@ bool TestSum2() ...@@ -155,14 +155,14 @@ bool TestSum2()
b->SetData(bData, unitNum); b->SetData(bData, unitNum);
c->SetZeroAll(); c->SetZeroAll();
/* call sum function */ /* call Sum function */
_Sum(a, b, c, beta); _Sum(a, b, c, beta);
_SumMe(cMe, b, beta); _SumMe(cMe, b, beta);
cUser = Sum(*a, *b, beta); cUser = Sum(*a, *b, beta);
/* check results */ /* check results */
cpuTest = c->CheckData(answer, unitNum) cpuTest = c->CheckData(answer, unitNum)
&& cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum); && cMe->CheckData(answer, unitNum) && cUser.CheckData(answer, unitNum);
#ifdef USE_CUDA #ifdef USE_CUDA
/* GPU test */ /* GPU test */
...@@ -181,14 +181,14 @@ bool TestSum2() ...@@ -181,14 +181,14 @@ bool TestSum2()
bGPU->SetData(bData, unitNum); bGPU->SetData(bData, unitNum);
cGPU->SetZeroAll(); cGPU->SetZeroAll();
/* call sum function */ /* call Sum function */
_Sum(aGPU, bGPU, cGPU, beta); _Sum(aGPU, bGPU, cGPU, beta);
_SumMe(cMeGPU, bGPU, beta); _SumMe(cMeGPU, bGPU, beta);
cUserGPU = Sum(*aGPU, *bGPU, beta); cUserGPU = Sum(*aGPU, *bGPU, beta);
/* check results */ /* check results */
gpuTest = cGPU->CheckData(answer, unitNum) gpuTest = cGPU->CheckData(answer, unitNum)
&& cMeGPU->CheckData(answer, unitNum) && cUserGPU.CheckData(answer, unitNum); && cMeGPU->CheckData(answer, unitNum) && cUserGPU.CheckData(answer, unitNum);
/* destroy variables */ /* destroy variables */
delete a; delete a;
......
...@@ -16,8 +16,8 @@ ...@@ -16,8 +16,8 @@
*/ */
/* /*
* $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-04-30 * $Created by: LI Yinqiao (li.yin.qiao.2012@hotmail.com) 2018-04-30
*/ */
#ifndef __TEST_SUM_H__ #ifndef __TEST_SUM_H__
#define __TEST_SUM_H__ #define __TEST_SUM_H__
......
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-30
*/
#include "TSumDim.h"
#include "../core/arithmetic/SumDim.h"
#include "../XTensor.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/*
case 1: tensor summation c = a + b * \beta
where the size of b is equal to the n-th dimension of a,
i.e., a is summed with b by broadcasting
*/
bool TestSumDim1()
{
/* a tensor of size (2, 4) */
int aOrder = 2;
int * aDimSize = new int[aOrder];
aDimSize[0] = 2;
aDimSize[1] = 4;
int aUnitNum = 1;
for (int i = 0; i < aOrder; i++)
aUnitNum *= aDimSize[i];
/* a tensor of size (2) */
int bOrder = 1;
int * bDimSize = new int[bOrder];
bDimSize[0] = 2;
int bUnitNum = 1;
for (int i = 0; i < bOrder; i++)
bUnitNum *= bDimSize[i];
DTYPE aData[2][4] = { {0.0F, 1.0F, 2.0F, 3.0F},
{4.0F, 5.0F, 6.0F, 7.0F} };
DTYPE bData[2] = {1.0F, -1.0F};
DTYPE answer[2][4] = { {1.0F, 2.0F, 3.0F, 4.0F},
{3.0F, 4.0F, 5.0F, 6.0F} };
/* CPU test */
bool cpuTest = true;
/* create tensors */
XTensor * a = NewTensor(aOrder, aDimSize);
XTensor * b = NewTensor(bOrder, bDimSize);
XTensor * c = NewTensor(aOrder, aDimSize);
XTensor * cMe = NewTensor(aOrder, aDimSize);
XTensor cUser;
/* initialize variables */
a->SetData(aData, aUnitNum);
cMe->SetData(aData, aUnitNum);
b->SetData(bData, bUnitNum);
c->SetZeroAll();
/* call SumDim function */
_SumDim(a, b, c, 0);
_SumDim(cMe, b, 0);
cUser = SumDim(*a, *b, 0);
/* check results */
cpuTest = c->CheckData(answer, aUnitNum)
&& cMe->CheckData(answer, aUnitNum)
&& cUser.CheckData(answer, aUnitNum);
#ifdef USE_CUDA
/* GPU test */
bool gpuTest = true;
/* create tensor */
XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
XTensor * bGPU = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
XTensor * cGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
XTensor * cMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
XTensor cUserGPU;
/* Initialize variables */
aGPU->SetData(aData, aUnitNum);
cMeGPU->SetData(aData, aUnitNum);
bGPU->SetData(bData, bUnitNum);
cGPU->SetZeroAll();
/* call sum function */
_SumDim(aGPU, bGPU, cGPU, 0);
_SumDim(cMeGPU, bGPU, 0);
cUserGPU = SumDim(*aGPU, *bGPU, 0);
/* check results */
gpuTest = cGPU->CheckData(answer, aUnitNum)
&& cMeGPU->CheckData(answer, aUnitNum)
&& cUserGPU.CheckData(answer, aUnitNum);
/* destroy variables */
delete a;
delete b;
delete c;
delete cMe;
delete aGPU;
delete bGPU;
delete cGPU;
delete cMeGPU;
delete[] aDimSize;
delete[] bDimSize;
return cpuTest && gpuTest;
#else
/* destroy variables */
delete a;
delete b;
delete c;
delete cMe;
delete[] aDimSize;
delete[] bDimSize;
return cpuTest;
#endif // USE_CUDA
}
/*
case 2: tensor summation c = a + b * \beta
where the size of b is equal to the n-th dimension of a,
i.e., a is summed with b by broadcasting
*/
bool TestSumDim2()
{
/* a tensor of size (2, 4) */
int aOrder = 2;
int * aDimSize = new int[aOrder];
aDimSize[0] = 2;
aDimSize[1] = 4;
int aUnitNum = 1;
for (int i = 0; i < aOrder; i++)
aUnitNum *= aDimSize[i];
/* a tensor of size (2, 2) */
int bOrder = 2;
int * bDimSize = new int[bOrder];
bDimSize[0] = 2;
bDimSize[1] = 2;
int bUnitNum = 1;
for (int i = 0; i < bOrder; i++)
bUnitNum *= bDimSize[i];
DTYPE aData[2][4] = { {0.0F, 1.0F, 2.0F, 3.0F},
{4.0F, 5.0F, 6.0F, 7.0F} };
DTYPE bData[2][2] = { {1.0F, -1.0F},
{-1.0F, 1.0F} };
DTYPE answer[2][4] = { {1.0F, 0.0F, 1.0F, 4.0F},
{5.0F, 4.0F, 5.0F, 8.0F} };
/* CPU test */
bool cpuTest = true;
/* create tensors */
XTensor * a = NewTensor(aOrder, aDimSize);
XTensor * b = NewTensor(bOrder, bDimSize);
XTensor * c = NewTensor(aOrder, aDimSize);
XTensor * cMe = NewTensor(aOrder, aDimSize);
XTensor cUser;
/* initialize variables */
a->SetData(aData, aUnitNum);
cMe->SetData(aData, aUnitNum);
b->SetData(bData, bUnitNum);
c->SetZeroAll();
/* call SumDim function */
_SumDim(a, b, c, 1);
_SumDim(cMe, b, 1);
cUser = SumDim(*a, *b, 1);
/* check results */
cpuTest = c->CheckData(answer, aUnitNum)
&& cMe->CheckData(answer, aUnitNum)
&& cUser.CheckData(answer, aUnitNum);
#ifdef USE_CUDA
/* GPU test */
bool gpuTest = true;
/* create tensor */
XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
XTensor * bGPU = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
XTensor * cGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
XTensor * cMeGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
XTensor cUserGPU;
/* Initialize variables */
aGPU->SetData(aData, aUnitNum);
cMeGPU->SetData(aData, aUnitNum);
bGPU->SetData(bData, bUnitNum);
cGPU->SetZeroAll();
/* call sum function */
_SumDim(aGPU, bGPU, cGPU, 1);
_SumDim(cMeGPU, bGPU, 1);
cUserGPU = SumDim(*aGPU, *bGPU, 1);
/* check results */
gpuTest = cGPU->CheckData(answer, aUnitNum)
&& cMeGPU->CheckData(answer, aUnitNum)
&& cUserGPU.CheckData(answer, aUnitNum);
/* destroy variables */
delete a;
delete b;
delete c;
delete cMe;
delete aGPU;
delete bGPU;
delete cGPU;
delete cMeGPU;
delete[] aDimSize;
delete[] bDimSize;
return cpuTest && gpuTest;
#else
/* destroy variables */
delete a;
delete b;
delete c;
delete cMe;
delete[] aDimSize;
delete[] bDimSize;
return cpuTest;
#endif // USE_CUDA
}
/* other cases */
/*
TODO!!
*/
/* test for SumDim Function */
bool TestSumDim()
{
XPRINT(0, stdout, "[TEST SUMDIM] tensor summation c = a + b * beta by broadcasting\n");
bool returnFlag = true, caseFlag = true;
/* case 1 test */
caseFlag = TestSumDim1();
if (!caseFlag) {
returnFlag = false;
XPRINT(0, stdout, ">> case 1 failed!\n");
}
else
XPRINT(0, stdout, ">> case 1 passed!\n");
/* case 2 test */
caseFlag = TestSumDim2();
if (!caseFlag) {
returnFlag = false;
XPRINT(0, stdout, ">> case 2 failed!\n");
}
else
XPRINT(0, stdout, ">> case 2 passed!\n");
/* other cases test */
/*
TODO!!
*/
if (returnFlag) {
XPRINT(0, stdout, ">> All Passed!\n");
}
else
XPRINT(0, stdout, ">> Failed!\n");
XPRINT(0, stdout, "\n");
return returnFlag;
}
} // namespace nts(NiuTrans.Tensor)
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-30
* I finish my summer holidays and go back to study.
*/
#ifndef __TEST_SUMDIM_H__
#define __TEST_SUMDIM_H__
#include "../core/arithmetic/SumDim.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/* test for SumDim Function */
extern "C"
bool TestSumDim();
} // namespace nts(NiuTrans.Tensor)
#endif // __TEST_SUMDIM_H__
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
*/
#include "../core/math/Unary.h"
#include "TTan.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/*
case 1: test Tan function.
Set every entry to its tangent value.
*/
bool TestTan1()
{
/* a tensor of size (3, 2) */
int order = 2;
int * dimSize = new int[order];
dimSize[0] = 3;
dimSize[1] = 2;
int unitNum = 1;
for (int i = 0; i < order; i++)
unitNum *= dimSize[i];
DTYPE aData[3][2] = { {1.0F, 2.0F},
{-1.0F, -2.0F},
{0.0F, 0.5F} };
DTYPE answer[3][2] = { {1.5574F, -2.1850F},
{-1.5574F, 2.1850F},
{0.0F, 0.5463F} };
/* CPU test */
bool cpuTest = true;
/* create tensors */
XTensor * a = NewTensor(order, dimSize);
XTensor * b = NewTensor(order, dimSize);
XTensor * aMe = NewTensor(order, dimSize);
XTensor bUser;
/* initialize variables */
a->SetData(aData, unitNum);
aMe->SetData(aData, unitNum);
/* call Tan function */
_Tan(a, b);
_TanMe(aMe);
bUser = Tan(*a);
/* check results */
cpuTest = b->CheckData(answer, unitNum, 1e-4F) && aMe->CheckData(answer, unitNum, 1e-4F) && bUser.CheckData(answer, unitNum, 1e-4F);
#ifdef USE_CUDA
/* GPU test */
bool gpuTest = true;
/* create tensor */
XTensor * aGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * bGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor * aMeGPU = NewTensor(order, dimSize, X_FLOAT, 1.0F, 0);
XTensor bUserGPU;
/* Initialize variables */
aGPU->SetData(aData, unitNum);
aMeGPU->SetData(aData, unitNum);
/* call Tan function */
_Tan(aGPU, bGPU);
_TanMe(aMeGPU);
bUserGPU = Tan(*aGPU);
/* check results */
gpuTest = bGPU->CheckData(answer, unitNum, 1e-4F) && aMeGPU->CheckData(answer, unitNum, 1e-4F) && bUserGPU.CheckData(answer, unitNum, 1e-4F);
/* destroy variables */
delete a;
delete b;
delete aMe;
delete aGPU;
delete bGPU;
delete aMeGPU;
delete[] dimSize;
return cpuTest && gpuTest;
#else
/* destroy variables */
delete a;
delete b;
delete aMe;
delete[] dimSize;
return cpuTest;
#endif // USE_CUDA
}
/* other cases */
/*
TODO!!
*/
/* test for Tan Function */
bool TestTan()
{
XPRINT(0, stdout, "[TEST Tan] set every entry to its tangent value \n");
bool returnFlag = true, caseFlag = true;
/* case 1 test */
caseFlag = TestTan1();
if (!caseFlag) {
returnFlag = false;
XPRINT(0, stdout, ">> case 1 failed!\n");
}
else
XPRINT(0, stdout, ">> case 1 passed!\n");
/* other cases test */
/*
TODO!!
*/
if (returnFlag) {
XPRINT(0, stdout, ">> All Passed!\n");
}
else
XPRINT(0, stdout, ">> Failed!\n");
XPRINT(0, stdout, "\n");
return returnFlag;
}
} // namespace nts(NiuTrans.Tensor)
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-31
*/
#ifndef __TEST_TAN_H__
#define __TEST_TAN_H__
namespace nts { // namespace nts(NiuTrans.Tensor)
/* test for Tan Function */
bool TestTan();
} // namespace nts(NiuTrans.Tensor)
#endif // __TEST_TAN_H__
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-12
*/
#include "TTranspose.h"
#include "../core/movement/CopyValues.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/*
case 1: test Transpose function.
tensor transposition of dimensions i and j
*/
bool TestTranspose1()
{
/* a tensor of size (3, 2) */
int aOrder = 2;
int * aDimSize = new int[aOrder];
aDimSize[0] = 3;
aDimSize[1] = 2;
int aUnitNum = 1;
for (int i = 0; i < aOrder; i++)
aUnitNum *= aDimSize[i];
/* a tensor of size (2, 3) */
int bOrder = 2;
int * bDimSize = new int[bOrder];
bDimSize[0] = 2;
bDimSize[1] = 3;
int bUnitNum = 1;
for (int i = 0; i < bOrder; i++)
bUnitNum *= bDimSize[i];
DTYPE aData[3][2] = { {1.0F, 2.0F},
{3.0F, 4.0F},
{5.0F, 6.0F} };
DTYPE answer[2][3] = { {1.0F, 3.0F, 5.0F},
{2.0F, 4.0F, 6.0F} };
/* CPU test */
bool cpuTest = true;
/* create tensors */
XTensor * a = NewTensor(aOrder, aDimSize);
XTensor * b = NewTensor(bOrder, bDimSize);
XTensor bUser;
/* initialize variables */
a->SetData(aData, aUnitNum);
/* call Transpose function */
_Transpose(a, b, 0, 1);
bUser = Transpose(*a, 0, 1);
/* check results */
cpuTest = b->CheckData(answer, aUnitNum, 1e-4F)
&& bUser.CheckData(answer, aUnitNum, 1e-4F);
#ifdef USE_CUDA
/* GPU test */
bool gpuTest = true;
/* create tensor */
XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
XTensor * bGPU = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
XTensor bUserGPU;
/* Initialize variables */
aGPU->SetData(aData, aUnitNum);
/* call Transpose function */
_Transpose(aGPU, bGPU, 0, 1);
bUserGPU = Transpose(*aGPU, 0, 1);
/* check results */
gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F)
&& bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
/* destroy variables */
delete a;
delete b;
delete aGPU;
delete bGPU;
delete[] aDimSize;
delete[] bDimSize;
return cpuTest && gpuTest;
#else
/* destroy variables */
delete a;
delete b;
delete[] aDimSize;
delete[] bDimSize;
return cpuTest;
#endif // USE_CUDA
}
/*
case 2: test Transpose function.
tensor transposition of dimensions i and j
*/
bool TestTranspose2()
{
/* a tensor of size (4, 3, 2) */
int aOrder = 3;
int * aDimSize = new int[aOrder];
aDimSize[0] = 4;
aDimSize[1] = 3;
aDimSize[2] = 2;
int aUnitNum = 1;
for (int i = 0; i < aOrder; i++)
aUnitNum *= aDimSize[i];
/* a tensor of size (2, 3, 4) */
int bOrder = 3;
int * bDimSize = new int[bOrder];
bDimSize[0] = 2;
bDimSize[1] = 3;
bDimSize[2] = 4;
int bUnitNum = 1;
for (int i = 0; i < bOrder; i++)
bUnitNum *= bDimSize[i];
DTYPE aData[4][3][2] = { { {1.0F, 2.0F},
{3.0F, 4.0F},
{5.0F, 6.0F} },
{ {2.0F, 4.0F},
{4.0F, 7.0F},
{6.0F, 8.0F} },
{ {1.0F, 2.0F},
{3.0F, 4.0F},
{5.0F, 6.0F} },
{ {2.0F, 4.0F},
{4.0F, 7.0F},
{6.0F, 8.0F} },};
DTYPE answer[2][3][4] = { { {1.0F, 2.0F, 1.0F, 2.0F},
{2.0F, 4.0F, 2.0F, 4.0F},
{3.0F, 4.0F, 3.0F, 4.0F} },
{ {4.0F, 7.0F, 4.0F, 7.0F},
{5.0F, 6.0F, 5.0F, 6.0F},
{6.0F, 8.0F, 6.0F, 8.0F} } };
/* CPU test */
bool cpuTest = true;
/* create tensors */
XTensor * a = NewTensor(aOrder, aDimSize);
XTensor * b = NewTensor(bOrder, bDimSize);
XTensor bUser;
/* initialize variables */
a->SetData(aData, aUnitNum);
/* call Transpose function */
_Transpose(a, b, 0, 2);
bUser = Transpose(*a, 0, 2);
/* check results */
cpuTest = b->CheckData(answer, aUnitNum, 1e-4F)
&& bUser.CheckData(answer, aUnitNum, 1e-4F);
#ifdef USE_CUDA
/* GPU test */
bool gpuTest = true;
/* create tensor */
XTensor * aGPU = NewTensor(aOrder, aDimSize, X_FLOAT, 1.0F, 0);
XTensor * bGPU = NewTensor(bOrder, bDimSize, X_FLOAT, 1.0F, 0);
XTensor bUserGPU;
/* Initialize variables */
aGPU->SetData(aData, aUnitNum);
/* call Transpose function */
_Transpose(aGPU, bGPU, 0, 2);
bUserGPU = Transpose(*aGPU, 0, 2);
/* check results */
gpuTest = bGPU->CheckData(answer, aUnitNum, 1e-4F)
&& bUserGPU.CheckData(answer, aUnitNum, 1e-4F);
/* destroy variables */
delete a;
delete b;
delete aGPU;
delete bGPU;
delete[] aDimSize;
delete[] bDimSize;
return cpuTest && gpuTest;
#else
/* destroy variables */
delete a;
delete b;
delete[] aDimSize;
delete[] bDimSize;
return cpuTest;
#endif // USE_CUDA
}
/* other cases */
/*
TODO!!
*/
/* test for Transpose Function */
bool TestTranspose()
{
XPRINT(0, stdout, "[TEST TRANSPOSE] tensor transposition with specified dimensions \n");
bool returnFlag = true, caseFlag = true;
/* case 1 test */
caseFlag = TestTranspose1();
if (!caseFlag) {
returnFlag = false;
XPRINT(0, stdout, ">> case 1 failed!\n");
}
else
XPRINT(0, stdout, ">> case 1 passed!\n");
/* case 2 test */
caseFlag = TestTranspose2();
if (!caseFlag) {
returnFlag = false;
XPRINT(0, stdout, ">> case 2 failed!\n");
}
else
XPRINT(0, stdout, ">> case 2 passed!\n");
/* other cases test */
/*
TODO!!
*/
if (returnFlag) {
XPRINT(0, stdout, ">> All Passed!\n");
}
else
XPRINT(0, stdout, ">> Failed!\n");
XPRINT(0, stdout, "\n");
return returnFlag;
}
} // namespace nts(NiuTrans.Tensor)
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: Xu Chen (email: hello_master1954@163.com) 2018-07-30
*/
#ifndef __TEST_TRANSPOSE_H__
#define __TEST_TRANSPOSE_H__
#include "../core/shape/Transpose.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/* test for Transpose Function */
bool TestTranspose();
} // namespace nts(NiuTrans.Tensor)
#endif // __TEST_TRANSPOSE_H__
...@@ -30,17 +30,20 @@ bool Test() ...@@ -30,17 +30,20 @@ bool Test()
XPRINT(0, stdout, "Testing the XTensor utilites ... \n\n"); XPRINT(0, stdout, "Testing the XTensor utilites ... \n\n");
wrong = !TestAbsolute() || wrong; wrong = !TestAbsolute() || wrong;
wrong = !TestClip() || wrong;
wrong = !TestConcatenate() || wrong; wrong = !TestConcatenate() || wrong;
wrong = !TestConcatenateSolely() || wrong; wrong = !TestConcatenateSolely() || wrong;
wrong = !TestCos() || wrong;
wrong = !TestConvertDataType() || wrong; wrong = !TestConvertDataType() || wrong;
wrong = !TestCopyIndexed() || wrong; wrong = !TestCopyIndexed() || wrong;
wrong = !TestCopyValues() || wrong; wrong = !TestCopyValues() || wrong;
wrong = !TestDiv() || wrong;
wrong = !TestExp() || wrong;
wrong = !TestLog() || wrong; wrong = !TestLog() || wrong;
wrong = !TestMatrixMul() || wrong; wrong = !TestMatrixMul() || wrong;
wrong = !TestMatrixMul2D() || wrong; wrong = !TestMatrixMul2D() || wrong;
wrong = !TestMatrixMul2DParallel() || wrong; wrong = !TestMatrixMul2DParallel() || wrong;
wrong = !TestMatrixMulBatched() || wrong; wrong = !TestMatrixMulBatched() || wrong;
wrong = !TestMatrixMulBatchedCPU() || wrong;
wrong = !TestMerge() || wrong; wrong = !TestMerge() || wrong;
wrong = !TestMultiply() || wrong; wrong = !TestMultiply() || wrong;
wrong = !TestNegate() || wrong; wrong = !TestNegate() || wrong;
...@@ -51,17 +54,23 @@ bool Test() ...@@ -51,17 +54,23 @@ bool Test()
wrong = !TestReduceSum() || wrong; wrong = !TestReduceSum() || wrong;
wrong = !TestReduceSumSquared() || wrong; wrong = !TestReduceSumSquared() || wrong;
wrong = !TestReduceVariance() || wrong; wrong = !TestReduceVariance() || wrong;
wrong = !TestRound() || wrong;
wrong = !TestScaleAndShift() || wrong; wrong = !TestScaleAndShift() || wrong;
wrong = !TestSelect() || wrong; wrong = !TestSelect() || wrong;
wrong = !TestSetAscendingOrder() || wrong; wrong = !TestSetAscendingOrder() || wrong;
wrong = !TestSetData() || wrong; wrong = !TestSetData() || wrong;
wrong = !TestSign() || wrong; wrong = !TestSign() || wrong;
wrong = !TestSin() || wrong;
wrong = !TestSort() || wrong; wrong = !TestSort() || wrong;
wrong = !TestSplit() || wrong; wrong = !TestSplit() || wrong;
wrong = !TestSub() || wrong;
wrong = !TestSum() || wrong; wrong = !TestSum() || wrong;
wrong = !TestSumByColumnTV() || wrong; wrong = !TestSumByColumnTV() || wrong;
wrong = !TestSumByColumnVT() || wrong; wrong = !TestSumByColumnVT() || wrong;
wrong = !TestTopK() || wrong; wrong = !TestSumDim() || wrong;
wrong = !TestTan() || wrong;
wrong = !TestTranspose() || wrong;
//wrong = !TestTopK() || wrong;
wrong = !TestUnsqueeze() || wrong; wrong = !TestUnsqueeze() || wrong;
wrong = !TestXMem() || wrong; wrong = !TestXMem() || wrong;
......
...@@ -23,17 +23,20 @@ ...@@ -23,17 +23,20 @@
#define __TEST_H__ #define __TEST_H__
#include "TAbsolute.h" #include "TAbsolute.h"
#include "TClip.h"
#include "TConcatenate.h" #include "TConcatenate.h"
#include "TConcatenateSolely.h" #include "TConcatenateSolely.h"
#include "TCos.h"
#include "TConvertDataType.h" #include "TConvertDataType.h"
#include "TCopyIndexed.h" #include "TCopyIndexed.h"
#include "TCopyValues.h" #include "TCopyValues.h"
#include "TDiv.h"
#include "TExp.h"
#include "TLog.h" #include "TLog.h"
#include "TMatrixMul.h" #include "TMatrixMul.h"
#include "TMatrixMul2D.h" #include "TMatrixMul2D.h"
#include "TMatrixMul2DParallel.h" #include "TMatrixMul2DParallel.h"
#include "TMatrixMulBatched.h" #include "TMatrixMulBatched.h"
#include "TMatrixMULBatchedCPU.h"
#include "TMerge.h" #include "TMerge.h"
#include "TMultiply.h" #include "TMultiply.h"
#include "TNegate.h" #include "TNegate.h"
...@@ -44,16 +47,22 @@ ...@@ -44,16 +47,22 @@
#include "TReduceSum.h" #include "TReduceSum.h"
#include "TReduceSumSquared.h" #include "TReduceSumSquared.h"
#include "TReduceVariance.h" #include "TReduceVariance.h"
#include "TRound.h"
#include "TScaleAndShift.h" #include "TScaleAndShift.h"
#include "TSelect.h" #include "TSelect.h"
#include "TSetAscendingOrder.h" #include "TSetAscendingOrder.h"
#include "TSetData.h" #include "TSetData.h"
#include "TSign.h" #include "TSign.h"
#include "TSin.h"
#include "TSort.h" #include "TSort.h"
#include "TSplit.h" #include "TSplit.h"
#include "TSub.h"
#include "TSum.h" #include "TSum.h"
#include "TSumByColumnTV.h" #include "TSumByColumnTV.h"
#include "TSumByColumnVT.h" #include "TSumByColumnVT.h"
#include "TSumDim.h"
#include "TTan.h"
#include "TTranspose.h"
#include "TTopK.h" #include "TTopK.h"
#include "TUnsqueeze.h" #include "TUnsqueeze.h"
#include "TXMem.h" #include "TXMem.h"
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论