Commit b30fad5f by xuchen

Modify the implementation of the unary and binary by template. It's so cool!

parent f7c6fb3b
# NiuTrans.Tensor环境配置
# NiuTrans.Tensor
## 注意事项
## Windows系统通过Visual Studio配置NiuTrans.Tensor项目
CUDA最新版本9.2尚且不支持VS2017最新版本,因此建议使用CUDA版本为9.0或9.1,建议使用VS版本为VS2015,或使用VS2017时安装v140工具集,解决方案平台设置为×64。
### 注意事项
## CUDA配置
* 我们仅仅测试了VS2015和CUDA9.0之后的版本,对于之前的版本并不清楚是否存在问题。
* VS2015版本可以直接使用,使用较新版本的VS(如VS2017)时,需要**安装组件“适用于桌面的 VC++ 2015.3 v14.00 (v140) 工具集”**
* 建议先安装Visual Studio再安装CUDA。安装CUDA时,建议不要勾选Visual Studio Integration,有时候可能会出错。CUDA安装完成后,解压CUDA安装文件(exe文件可以解压),在CUDAVisualStudioIntegration\extras\visual_studio_integration\MSBuildExtensions路径下有四个文件,拷贝到下述路径中。
在已安装好VS、CUDA并配置好环境变量后,一些关键的CUDA配置选项如下所示,以下配置选项在 **项目 -> 属性** 中可以找到。
* VS2015
> C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\v140\BuildCustomizations
>$(CUDA_PATH)\include
* VS2017(以下两个路径分别对应v140工具集和VS默认工具集的路径)
> C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\v140\BuildCustomizations
> C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\Common7\IDE\VC\VCTargets\BuildCustomizations
加入到 **VC++目录 -> 包含** 中。
### 新建项目
>$(CUDA_PATH)\lib\Win32
* 新建一个VC++空项目。
* 将菜单栏中的**解决方案平台**设置为×64(默认是X86)。
***菜单栏->项目->属性**中,将平台设置为X64。
* 将源代码(source文件夹)拷贝到项目的根目录,然后选择**菜单栏->项目->显示所有文件**,解决方案中即可以看到source文件夹,右键点击source,选择包含在项目中,即可将所有的*.h和*.cpp加入到本项目中。
加入到 **VC++目录 -> 库** 中。
### CUDA配置(无GPU设备可以跳过此步骤)
>cuda.lib;cudadevrt.lib;cudart.lib;cudart_static.lib;nvcuvid.lib;OpenCL.lib;cublas.lib;curand.lib;
在VS项目中使用CUDA,需要设置项目的相关属性。
以下配置选项在 **菜单栏->项目 -> 属性** 中可以找到。
加入到 **链接器->输入->附加依赖项** 中。
* **C/C++->预处理器->预处理器定义** 中,添加
配置完成后,右键 **工程->项目依赖性** ,选择CUDA9。
在.cu文件上右键属性,在项类型中选择"CUDA C/C++"(最好搜索.cu文件,然后全选设置)。
> USE_CUDA;
## 其他配置
* **VC++目录->包含目录** 中加入
**C/C++->常规->SDL检查**,设为否。
> $(CUDA_PATH)\include
**C/C++->预处理器->预处理器定义** 中,添加
* **VC++目录->库目录** 中加入
>USE_CUDA;USE_BLAS;WIN32;MKL;_DEBUG;_CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_WARNINGS_
CONSOLE;
> $(CUDA_PATH)\lib\Win32
**链接器->系统->子系统**,设置为控制台。
* **链接器->输入->附加依赖项**中加入以下库
**常规->字符集**,使用Unicode字符集。
> cuda.lib;cudadevrt.lib;cudart.lib;cudart_static.lib;nvcuvid.lib;OpenCL.lib;cublas.lib;curand.lib;
**调试->命令参数**中设置可执行文件所需要的参数。
* 上述配置完成后,在**菜单栏->项目->生成自定义**中,勾选CUDA*(根据自己安装的CUDA版本自行选择)。
* 在所有的*.cu和*.cuh文件上右键,包含在项目中。
### 其他配置
注:以下选项也是 **菜单栏-项目 -> 属性** 中可以找到。
* **常规->平台工具集**,设置为Visual Studio 2015(v140)。
* **C/C++->常规->SDL检查**,设为否。
* **C/C++->预处理器->预处理器定义** 中,添加
> WIN32;_DEBUG;_CRT_SECURE_NO_WARNINGS;_CRT_SECURE_NO_WARNINGS_CONSOLE;
* **C/C++->预编译头->预编译头**,设置为不使用预编译头。
* **链接器->系统->子系统**,设置为控制台。
* **常规->字符集**,使用Unicode字符集。
* **调试->命令参数**,设置可执行文件所需要的参数(初始可以设置为-test,用来执行测试用例)。
......@@ -331,7 +331,8 @@ void Init(FNNModel &model)
{
/* create embedding parameter matrix: vSize * eSize */
InitModelTensor2D(model.embeddingW, model.vSize, model.eSize, model);
model.embeddingW.SetVarFlag();
/* create hidden layer parameter matrics */
for(int i = 0; i < model.hDepth; i++){
/* hidden layer parameter matrix: (n-1)eSize * hsize if it is the first layer
......@@ -340,15 +341,20 @@ void Init(FNNModel &model)
InitModelTensor2D(model.hiddenW[i], (model.n - 1) * model.eSize, model.hSize, model);
else
InitModelTensor2D(model.hiddenW[i], model.hSize, model.hSize, model);
model.hiddenW[i].SetVarFlag();
/* bias term: a row vector of hSize entries */
InitModelTensor1D(model.hiddenB[i], model.hSize, model);
model.hiddenB[i].SetVarFlag();
}
/* create the output layer parameter matrix and bias term */
int iSize = model.hDepth == 0 ? (model.n - 1) * model.eSize : model.hSize;
InitModelTensor2D(model.outputW, iSize, model.vSize, model);
model.outputW.SetVarFlag();
InitModelTensor1D(model.outputB, model.vSize, model);
model.outputB.SetVarFlag();
/* then, we initialize model parameters using a uniform distribution in range
of [-minmax, minmax] */
......
/* NiuTrans.Tensor - an open-source tensor library
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
......@@ -51,7 +51,13 @@ bool CONST_TRUE = true;
int verboseLevel = 0;
bool useBLAS = false;
bool useCUDA = false;
#ifdef USE_CUDA
bool useCUDA = true;
#else
bool useCUDA = false;
#endif
FILE * tmpLog = NULL;
double myTime = 0;
......
......@@ -52,7 +52,6 @@
#include "math/Clip.h"
#include "math/Compare.h"
#include "math/Normalize.h"
#include "math/Power.h"
#include "math/ScaleAndShift.h"
#include "math/Unary.h"
......
......@@ -44,7 +44,7 @@ where i is the index of the element
*/
void _MaskMe(XTensor * a, const XTensor * mask, DTYPE alpha = 0.0);
/*
/*
mask entries of a given tensor (return an XTensor structure):
a(i) = a(i) if mask(i) is non-zero
a(i) = alpha if mask(i) = 0
......
......@@ -567,15 +567,17 @@ void _CudaSetDataRand(const XTensor * tensor, DTYPE lower, DTYPE upper)
ProtectCudaDev(tensor->devID, devIDBackup);
curandGenerator_t & gen = GDevs.GPUs[tensor->devID].gen;
curandGenerateUniform(gen , (float*)tensor->data , tensor->unitNum);
curandGenerateUniform(gen, (float*)tensor->data, tensor->unitNum);
DTYPE variance = upper - lower;
if(variance != 1.0F || lower != 0){
if (tensor->dataType == X_FLOAT)
KernelSetDataRandFloat <<<blocks, threads >>>((float*) tensor->data, tensor->unitNum, lower, variance);
KernelSetDataRandFloat <<<blocks, threads >>>
((float*) tensor->data, tensor->unitNum, lower, variance);
else if (tensor->dataType == X_DOUBLE)
KernelSetDataRandDouble <<<blocks, threads >>>((double*)tensor->data, tensor->unitNum, lower, variance);
KernelSetDataRandDouble <<<blocks, threads >>>
((double*)tensor->data, tensor->unitNum, lower, variance);
}
BacktoCudaDev(tensor->devID, devIDBackup);
......
......@@ -29,38 +29,25 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA
/* scale each entry (CUDA Kernel) */
__global__
void KernelScale(int * a, int * b, int size, int scale);
__global__
void KernelScale(int * a, int * b, int size, float scale);
/* scale each entry */
void _CudaScale(const XTensor * a, XTensor * b, int scale);
void _CudaScaleFloat(const XTensor * a, XTensor * b, float scale);
/* descale each entry (CUDA Kernel) */
__global__
void KernelDescale(int * a, int * b, int size, int scale);
__global__
void KernelDescale(int * a, int * b, int size, float scale);
/* descale each entry */
void _CudaDescale(const XTensor * a, XTensor * b, int scale);
void _CudaDescaleFloat(const XTensor * a, XTensor * b, float scale);
template<class T>
void _CudaDescale(const XTensor * a, XTensor * b, T num);
/* shift each entry (CUDA Kernel) */
__global__
void KernelShift(int * a, int * b, int size, int shift);
__global__
void KernelShift(int * a, int * b, int size, float shift);
/* shift each entry */
void _CudaShift(const XTensor * a, XTensor * b, int shift);
void _CudaShiftFloat(const XTensor * a, XTensor * b, float shift);
/* power each entry */
template<class T>
void _CudaPower(const XTensor * a, XTensor * b, T num);
/* mod each entry (CUDA Kernel) */
__global__
void KernelMod(int * a, int * b, int size, int base);
/* mod each entry */
void _CudaMod(const XTensor * a, XTensor * b, int base);
template<class T>
void _CudaMod(const XTensor * a, XTensor * b, T base);
/* scale each entry */
template<class T>
void _CudaScale(const XTensor * a, XTensor * b, T num);
/* shift each entry */
template<class T>
void _CudaShift(const XTensor * a, XTensor * b, T num);
#endif // USE_CUDA
......
......@@ -26,84 +26,110 @@
namespace nts { // namespace nts(NiuTrans.Tensor)
/* scale up tensor entires
b = a * scale */
void _Scale(const XTensor * a, XTensor * b, int scale);
void _Scale(const XTensor * a, XTensor * b, float scale);
/* scale up tensor entires (on site)
b = a * scale */
void _ScaleMe(XTensor * a, int scale);
void _ScaleMe(XTensor * a, float scale);
/* scale up tensor entires (on site)
b = a * scale */
void ScaleMe(XTensor & a, int scale);
void ScaleMe(XTensor & a, float scale);
/* scale up tensor entires
b = a * scale */
void Scale(const XTensor & a, XTensor & b, int scale);
void Scale(const XTensor & a, XTensor & b, float scale);
/* scale up tensor entires (return an XTensor structure)
b = a * scale */
XTensor Scale(const XTensor & a, int scale);
XTensor Scale(const XTensor & a, float scale);
/* descale tensor entires
b = a / scale */
void _Descale(const XTensor * a, XTensor * b, int scale);
void _Descale(const XTensor * a, XTensor * b, float scale);
b = a / num */
template<class T>
void _Descale(const XTensor * a, XTensor * b, T num);
/* descale tensor entires (on site)
b = a / scale */
void _DescaleMe(XTensor * a, int scale);
void _DescaleMe(XTensor * a, float scale);
b = a / num */
template<class T>
void _DescaleMe(XTensor * a, T num);
/* descale tensor entires (on site)
b = a / scale */
void DescaleMe(XTensor & a, int scale);
void DescaleMe(XTensor & a, float scale);
b = a / num */
template<class T>
void DescaleMe(XTensor & a, T num);
/* descale tensor entires
b = a / scale */
void Descale(const XTensor & a, XTensor & b, int scale);
void Descale(const XTensor & a, XTensor & b, float scale);
b = a / num */
template<class T>
void Descale(const XTensor & a, XTensor & b, T num);
/* descale tensor entires (return an XTensor structure)
b = a / scale */
XTensor Descale(const XTensor & a, int scale);
XTensor Descale(const XTensor & a, float scale);
b = a / num */
template<class T>
XTensor Descale(const XTensor & a, T num);
/* mod tensor entires
b = a % base */
template<class T>
void _Mod(const XTensor * a, XTensor * b, T base);
/* mod base entires (on site)
b = a % num */
template<class T>
void _ModMe(XTensor * a, T base);
/* mod tensor entires (on site)
b = a % base */
template<class T>
void ModMe(XTensor & a, T base);
/* mod tensor entires
b = a % base */
template<class T>
void Mod(const XTensor & a, XTensor & b, T base);
/* mod tensor entires (return an XTensor structure)
b = a % base */
template<class T>
XTensor Mod(const XTensor & a, T base);
/* get the power(x, y)
b = power(a, num) */
template<class T>
void _Power(const XTensor * a, XTensor * b, T scale);
/* get the power(x, y) (on site)
b = power(a, num) */
template<class T>
void _PowerMe(XTensor * a, T scale);
/* get the power(x, y) (on site)
b = power(a, num) */
template<class T>
void PowerMe(XTensor & a, T scale);
/* get the power(x, y)
b = power(a, num) */
template<class T>
void Power(const XTensor & a, XTensor & b, T scale);
/* get the power(x, y) (return an XTensor structure)
b = power(a, num) */
template<class T>
XTensor Power(const XTensor & a, T scale);
/* scale up tensor entires
b = a * num */
template<class T>
void _Scale(const XTensor * a, XTensor * b, T num);
/* scale up tensor entires (on site)
b = a * num */
template<class T>
void _ScaleMe(XTensor * a, T num);
/* scale up tensor entires (on site)
b = a * num */
template<class T>
void ScaleMe(XTensor & a, T num);
/* scale up tensor entires
b = a * num */
template<class T>
void Scale(const XTensor & a, XTensor & b, T num);
/* scale up tensor entires (return an XTensor structure)
b = a * num */
template<class T>
XTensor Scale(const XTensor & a, T num);
/* shift tensor entires
b = a + shift */
void _Shift(const XTensor * a, XTensor * b, int shift);
void _Shift(const XTensor * a, XTensor * b, float shift);
b = a + num */
template<class T>
void _Shift(const XTensor * a, XTensor * b, T num);
/* shift tensor entires (on site)
b = a + shift */
void _ShiftMe(XTensor * a, int shift);
void _ShiftMe(XTensor * a, float shift);
b = a + num */
template<class T>
void _ShiftMe(XTensor * a, T num);
/* shift tensor entires (on site)
b = a + shift */
void ShiftMe(XTensor & a, int shift);
void ShiftMe(XTensor & a, float shift);
b = a + num */
template<class T>
void ShiftMe(XTensor & a, T num);
/* shift tensor entires
b = a + shift */
void Shift(const XTensor & a, XTensor & b, int shift);
void Shift(const XTensor & a, XTensor & b, float shift);
b = a + num */
template<class T>
void Shift(const XTensor & a, XTensor & b, T num);
/* shift tensor entires (return an XTensor structure)
b = a + shift */
XTensor Shift(const XTensor & a, int shift);
XTensor Shift(const XTensor & a, float shift);
/* mod tensor entires
b = a % mod */
void _Mod(const XTensor * a, XTensor * b, int base);
/* mod tensor entires (on site)
b = a % mod */
void _ModMe(XTensor * a, int base);
/* mod tensor entires (on site)
b = a % mod */
void ModMe(XTensor & a, int base);
/* mod tensor entires
b = a % mod */
void Mod(const XTensor & a, XTensor & b, int base);
/* mod tensor entires (return an XTensor structure)
b = a + shift */
XTensor Mod(const XTensor & a, int shift);
b = a + num */
template<class T>
XTensor Shift(const XTensor & a, T num);
} // namespace nts(NiuTrans.Tensor)
......
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
*/
#include <math.h>
#include "../../XTensor.h"
#include "../../XName.h"
#include "Power.h"
#include "Power.cuh"
namespace nts { // namespace nts(NiuTrans.Tensor)
/*
get the power(a, p)
>> a - input tensor
>> b - output tensor
>> p - parameter
*/
void _Power(const XTensor * a, XTensor * b, DTYPE p)
{
#ifdef USE_CUDA
/* run it on GPUs */
if (a->devID >= 0) {
_CudaPower(a, b, p);
return;
}
#endif
CheckNTErrors((a->dataType == DEFAULT_DTYPE), "TODO!");
DTYPE * aData = (DTYPE*)a->data;
DTYPE * bData = (DTYPE*)b->data;
if (p == 0) {
for (int i = 0; i < a->unitNum; i++)
bData[i] = (DTYPE)1.0;
}
else if (p == (DTYPE)0.5) {
for (int i = 0; i < a->unitNum; i++)
bData[i] = (DTYPE)sqrt(aData[i]);
}
else if (p == (DTYPE)2.0) {
for (int i = 0; i < a->unitNum; i++)
bData[i] = aData[i] * aData[i];
}
else {
for (int i = 0; i < a->unitNum; i++) {
if (p < 0 && aData[i] == 0)
bData[i] = 1e20F;
else
bData[i] = (DTYPE)pow(aData[i], p);
}
}
}
/*
get the power(a, p) (do it on site)
keep the result in the input tensor a and return nothing
>> a - the tensor
>> p - parameter
*/
void _PowerMe(XTensor * a, DTYPE p)
{
_Power(a, a, p);
}
/*
get the power(a, p) (return an XTensor structure)
make a new tensor to keep the result and return it
>> a - input tensor
>> p - parameter
<< return - the power value of the input tensor
*/
XTensor Power(const XTensor & a, DTYPE p)
{
XTensor b(&a);
b.SetTMPFlag();
/* call _Power function */
_Power(&a, &b, p);
/* tensor connections */
XLink::MakeLink(&a, NULL, &b, MATH_POWER);
XLink::AddParamToHead(&b, p);
return b;
}
/*
get the power(a, p)
>> a - input tensor
>> b - output tensor
>> p - parameter
>> requireLink - if add operation to network
*/
void Power(const XTensor & a, XTensor & b, DTYPE p, bool requireLink)
{
if (!b.isInit || !XTensor::IsSameShaped(&a, &b)) {
InitTensor(&b, &a);
}
/* call _Power function */
_Power(&a, &b, p);
if (requireLink) {
/* tensor connections */
XLink::MakeLink(&a, NULL, &b, MATH_POWER);
XLink::AddParamToHead(&b, p);
}
}
} // namespace nts(NiuTrans.Tensor)
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
*/
#include "../../XDevice.h"
#include "../../XTensor.h"
#include "../movement/CopyValues.cuh"
#include "Power.h"
#include "Power.cuh"
namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA
/*
set all entries to its root (CUDA Kernel)
>> a - input data array
>> b - output data array
>> size - size of the data array
*/
__global__
void KernelSqrtV2(DTYPE * a, DTYPE * b, int size)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size)
b[i] = sqrt(a[i]);
}
/*
set all entries to its root (CUDA Kernel)
>> a - input data array
>> b - output data array
>> size - size of the data array
*/
__global__
void KernelSqrtV2(__half * a, __half * b, int size)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
#if __CUDA_ARCH__ >= 530 || !defined(__CUDA_ARCH__)
if (i < size)
b[i] = hsqrt(a[i]);
#else
if (i < size)
b[i] = __float2half(sqrt(__half2float(a[i])));
#endif
}
/*
get power(d[i], p)
>> a - input data array
>> b - output data array
>> p - power
>> size - size of the data array
*/
__global__
void KernelPower(DTYPE * a, DTYPE * b, DTYPE p, int size)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size) {
DTYPE v = a[i];
if (p < 0 && v == 0)
b[i] = 1e20;
else
b[i] = pow(a[i], p);
}
}
/*
get power(d[i], p)
>> a - input data array
>> b - output data array
>> p - power
>> size - size of the data array
*/
__global__
void KernelPower(__half * a, __half * b, __half p, int size)
{
#if __CUDA_ARCH__ >= 530 || !defined(__CUDA_ARCH__)
#else
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size) {
float v = __half2float(a[i]);
if (__half2float(p) < 0 && v == 0)
b[i] = __float2half(1e20);
else
b[i] = __float2half(pow(__half2float(a[i]), __half2float(p)));
}
#endif
}
/* get the power of the entries */
void _CudaPower(const XTensor * a, XTensor * b, DTYPE p)
{
CheckNTErrors((XTensor::IsSameShaped(a, b)), "Input tensors should have the same type!");
int gridSize[3];
int blockSize[3];
GDevs.GetCudaThread(a->devID, a->unitNum, gridSize, blockSize);
dim3 blocks(gridSize[0]);
dim3 threads(blockSize[0]);
int devIDBackup;
ProtectCudaDev(a->devID, devIDBackup);
if (a->dataType == DEFAULT_DTYPE) {
if (p == (DTYPE)0.5) {
KernelSqrtV2 << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum);
}
else if (p == (DTYPE)1.0) {
_CudaCopyValues(a, b);
}
else if (p != (DTYPE)1.0) {
KernelPower << <blocks, threads >> >((DTYPE*)a->data, (DTYPE*)b->data, p, a->unitNum);
}
}
else if (a->dataType == X_FLOAT16) {
if (p == (DTYPE)0.5) {
KernelSqrtV2 << <blocks, threads >> >((__half*)a->data, (__half*)b->data, a->unitNum);
}
else if (p != (DTYPE)1.0) {
ShowNTErrors("TODO!");
}
}
else {
ShowNTErrors("TODO!");
}
BacktoCudaDev(a->devID, devIDBackup);
}
#endif // USE_CUDA
} // namespace nts(NiuTrans.Tensor)
\ No newline at end of file
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
*/
#ifndef __POWER_CUH__
#define __POWER_CUH__
#include "Power.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA
/* set all entries to its root (CUDA Kernel) */
__global__
void KernelSqrtV2(DTYPE * a, DTYPE * b, int size);
/* set all entries to its root (CUDA Kernel) */
__global__
void KernelSqrtV2(__half * a, __half * b, int size);
/* get the power of the entries */
void _CudaPower(const XTensor * a, XTensor * b, DTYPE p);
#endif // USE_CUDA
} // namespace nts(NiuTrans.Tensor)
#endif // __POWER_CUH__
\ No newline at end of file
/* NiuTrans.Tensor - an open-source tensor library
* Copyright (C) 2017, Natural Language Processing Lab, Northestern University.
* All rights reserved.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/*
* $Created by: XIAO Tong (email: xiaotong@mail.neu.edu.cn) 2018-04-24
*/
#ifndef __POWER_H__
#define __POWER_H__
#include "../../XTensor.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/* get the power(x, y) */
void _Power(const XTensor * a, XTensor * b, DTYPE p);
/*
get the power(x, y) (do it on site)
keep the result in the input tensor a and return nothing
*/
void _PowerMe(XTensor * a, DTYPE p);
/*
get the power(x, y) (return an XTensor structure)
make a new tensor to keep the result and return it
*/
XTensor Power(const XTensor & a, DTYPE p);
/* get the power(x, y) */
void Power(const XTensor & a, XTensor & b, DTYPE p, bool requireLink = false);
} // namespace nts(NiuTrans.Tensor)
#endif // __POWER_H__
......@@ -24,66 +24,133 @@
#include "../../XName.h"
#include "Unary.h"
#include "Unary.cuh"
#include<cuda_runtime.h>
namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA
template<class T>
__device__
DTYPE cudanegate(DTYPE x)
T BaseCeil(T x)
{
return (T)ceil((float)x);
}
template<class T>
__device__
T BaseExp(T x)
{
return (T)exp((float)x);
}
template<class T>
__device__
T BaseFabs(T x)
{
return (T)fabs((float)x);
}
template<class T>
__device__
T BaseFloor(T x)
{
return (T)floor((float)x);
}
template<class T>
__device__
T BaseIsNonZero(T r)
{
return (r != (T)0.0) ? (T)1.0 : (T)0.0;
}
template<class T>
__device__
T BaseIsZero(T r)
{
return (r == (T)0.0) ? (T)1.0 : (T)0.0;
}
template<class T>
__device__
T BaseLog(T x)
{
return (T)log((float)x);
}
template<class T>
__device__
T BaseNegate(T x)
{
return -x;
}
template<class T>
__device__
T BaseSign(T r)
{
if (r > (T)0)
return 1.0;
else if (r == (T)0)
return 0.0;
else
return -1.0;
}
template<class T>
__device__
DTYPE cudasquare(DTYPE x)
T BaseSqrt(T x)
{
return (T)sqrt((float)x);
}
template<class T>
__device__
T BaseSquare(T x)
{
return x * x;
}
template<class T>
__device__
DTYPE cudaround(DTYPE r)
T BaseRound(T r)
{
return (r > 0.0) ? (DTYPE)floor(r + 0.5) : (DTYPE)ceil(r - 0.5);
return (r > (T)0.0) ? (T)BaseFloor(r + (T)0.5) : (T)BaseCeil(r - (T)0.5);
}
template<class T>
__device__
DTYPE cudasign(DTYPE r)
T BaseSin(T x)
{
if (r > 0)
return 1.0F;
else if (r == 0)
return 0.0F;
else
return -1.0F;
return (T)sin((float)x);
}
template<class T>
__device__
DTYPE cudaisnonzero(DTYPE r)
T BaseCos(T x)
{
return (r != 0.0) ? (DTYPE)1.0 : (DTYPE)0.0;
return (T)cos((float)x);
}
template<class T>
__device__
DTYPE cudaiszero(DTYPE r)
T BaseTan(T x)
{
return (r == 0.0) ? (DTYPE)1.0 : (DTYPE)0.0;
return (T)tan((float)x);
}
#define SIMPLE_UNARY_FUNCTION_GPU(funcName, origFunc) \
template<class T> \
__global__ \
void Kernel##funcName(DTYPE * a, DTYPE * b, int size) \
void Kernel##funcName(T * a, T * b, int size) \
{ \
int i = blockDim.x * blockIdx.x + threadIdx.x; \
\
if (i < size) \
b[i] = (DTYPE)origFunc(a[i]); \
} \
__global__ \
void Kernel##funcName(__half * a, __half * b, int size) \
{ \
return; \
b[i] = (T)origFunc(a[i]); \
} \
void _Cuda##funcName(const XTensor * a, XTensor * b) \
{ \
......@@ -102,9 +169,17 @@ void _Cuda##funcName(const XTensor * a, XTensor * b) \
int devIDBackup; \
ProtectCudaDev(a->devID, devIDBackup); \
\
if (a->dataType == DEFAULT_DTYPE) { \
if (a->dataType == X_FLOAT) { \
Kernel##funcName<<<blocks, threads>>> \
((float*)a->data, (float*)b->data, a->unitNum); \
} \
else if (a->dataType == X_DOUBLE) { \
Kernel##funcName<<<blocks, threads>>> \
((double*)a->data, (double*)b->data, a->unitNum); \
} \
else if (a->dataType == X_INT) { \
Kernel##funcName<<<blocks, threads>>> \
((DTYPE*)a->data, (DTYPE*)b->data, a->unitNum); \
((int*)a->data, (int*)b->data, a->unitNum); \
} \
else if (a->dataType == X_FLOAT16) { \
Kernel##funcName<<<blocks, threads>>> \
......@@ -115,24 +190,26 @@ void _Cuda##funcName(const XTensor * a, XTensor * b) \
} \
\
BacktoCudaDev(a->devID, devIDBackup); \
} \
}
SIMPLE_UNARY_FUNCTION_GPU(Absolute, BaseFabs)
SIMPLE_UNARY_FUNCTION_GPU(Ceil, BaseCeil)
SIMPLE_UNARY_FUNCTION_GPU(Exp, BaseExp)
SIMPLE_UNARY_FUNCTION_GPU(Floor, BaseFloor)
SIMPLE_UNARY_FUNCTION_GPU(IsNonZero, BaseIsNonZero)
SIMPLE_UNARY_FUNCTION_GPU(IsZero, BaseIsZero)
SIMPLE_UNARY_FUNCTION_GPU(Log, BaseLog)
SIMPLE_UNARY_FUNCTION_GPU(Negate, BaseNegate)
SIMPLE_UNARY_FUNCTION_GPU(Round, BaseRound)
SIMPLE_UNARY_FUNCTION_GPU(Sign, BaseSign)
SIMPLE_UNARY_FUNCTION_GPU(Sqrt, BaseSqrt)
SIMPLE_UNARY_FUNCTION_GPU(Square, BaseSquare)
SIMPLE_UNARY_FUNCTION_GPU(Absolute, fabs)
SIMPLE_UNARY_FUNCTION_GPU(Ceil, ceil)
SIMPLE_UNARY_FUNCTION_GPU(Exp, exp)
SIMPLE_UNARY_FUNCTION_GPU(Floor, floor)
SIMPLE_UNARY_FUNCTION_GPU(IsNonZero, cudaisnonzero)
SIMPLE_UNARY_FUNCTION_GPU(IsZero, cudaiszero)
SIMPLE_UNARY_FUNCTION_GPU(Log, log)
SIMPLE_UNARY_FUNCTION_GPU(Negate, cudanegate)
SIMPLE_UNARY_FUNCTION_GPU(Round, cudaround)
SIMPLE_UNARY_FUNCTION_GPU(Sign, cudasign)
SIMPLE_UNARY_FUNCTION_GPU(Sqrt, sqrt)
SIMPLE_UNARY_FUNCTION_GPU(Square, cudasquare)
SIMPLE_UNARY_FUNCTION_GPU(Sin, sin)
SIMPLE_UNARY_FUNCTION_GPU(Cos, cos)
SIMPLE_UNARY_FUNCTION_GPU(Tan, tan)
SIMPLE_UNARY_FUNCTION_GPU(Sin, BaseSin)
SIMPLE_UNARY_FUNCTION_GPU(Cos, BaseCos)
SIMPLE_UNARY_FUNCTION_GPU(Tan, BaseTan)
#endif // USE_CUDA
......
......@@ -29,139 +29,49 @@ namespace nts { // namespace nts(NiuTrans.Tensor)
#ifdef USE_CUDA
/* set each entry to its absolute value (CUDA Kernel) */
__global__
void KernelAbsolute(DTYPE * a, DTYPE * b, int size);
/* set each entry to its absolute value (CUDA Kernel) with float16 data type*/
__global__
void KernelAbsolute(__half * a, __half * b, int size);
/* set each entry to its absolute value */
void _CudaAbsolute(const XTensor * a, XTensor * b);
/* set each entry to its ceil value (CUDA Kernel) */
__global__
void KernelCeil(DTYPE * a, DTYPE * b, int size);
/* set each entry to its ceil value (CUDA Kernel) with float16 data type*/
__global__
void KernelCeil(__half * a, __half * b, int size);
/* set each entry to its ceil value */
void _CudaCeil(const XTensor * a, XTensor * b);
/* set each entry to its exponent value (CUDA Kernel) */
__global__
void KernelExp(DTYPE * a, DTYPE * b, int size);
/* set each entry to its exponent value (CUDA Kernel) with float16 data type*/
__global__
void KernelExp(__half * a, __half * b, int size);
/* set each entry to its exponent value */
void _CudaExp(const XTensor * a, XTensor * b);
/* set each entry to its floor value (CUDA Kernel) */
__global__
void KernelFloor(DTYPE * a, DTYPE * b, int size);
/* set each entry to its floor value (CUDA Kernel) with float16 data type*/
__global__
void KernelFloor(__half * a, __half * b, int size);
/* set each entry to its floor value */
void _CudaFloor(const XTensor * a, XTensor * b);
/* if source entry is non-zero, set target entry to be one, otherwise zero (CUDA Kernel) */
__global__
void KernelIsNonZero(DTYPE * a, DTYPE * b, int size);
/* if source entry is non-zero, set target entry to be one, otherwise zero (CUDA Kernel) with float16 data type*/
__global__
void KernelIsNonZero(__half * a, __half * b, int size);
/* if source entry is non-zero, set target entry to be one, otherwise zero */
void _CudaIsNonZero(const XTensor * a, XTensor * b);
/* if source entry is zero, set target entry to be one, otherwise zero (CUDA Kernel) */
__global__
void KernelIsZero(DTYPE * a, DTYPE * b, int size);
/* if source entry is zero, set target entry to be one, otherwise zero (CUDA Kernel) with float16 data type*/
__global__
void KernelIsZero(__half * a, __half * b, int size);
/* if source entry is zero, set target entry to be one, otherwise zero */
void _CudaIsZero(const XTensor * a, XTensor * b);
/* set each entry to its logarithm value (CUDA Kernel) */
__global__
void KernelLog(DTYPE * a, DTYPE * b, int size);
/* set each entry to its logarithm value (CUDA Kernel) with float16 data type*/
__global__
void KernelLog(__half * a, __half * b, int size);
/* set each entry to its logarithm value */
void _CudaLog(const XTensor * a, XTensor * b);
/* set each entry to its negative value (CUDA Kernel) */
__global__
void KernelNegate(DTYPE * a, DTYPE * b, int size);
/* set each entry to its negative value (CUDA Kernel) with float16 data type*/
__global__
void KernelNegate(__half * a, __half * b, int size);
/* set each entry to its negative value */
void _CudaNegate(const XTensor * a, XTensor * b);
/* set each entry to its round value (CUDA Kernel) */
__global__
void KernelRound(DTYPE * a, DTYPE * b, int size);
/* set each entry to its round value (CUDA Kernel) with float16 data type*/
__global__
void KernelRound(__half * a, __half * b, int size);
/* set each entry to its round value */
void _CudaRound(const XTensor * a, XTensor * b);
/* set each entry to its sign value (CUDA Kernel) */
__global__
void KernelSign(DTYPE * a, DTYPE * b, int size);
/* set each entry to its sign value (CUDA Kernel) with float16 data type*/
__global__
void KernelSign(__half * a, __half * b, int size);
/* set each entry to its sign value */
void _CudaSign(const XTensor * a, XTensor * b);
/* set each entry to its sqrt value (CUDA Kernel) */
__global__
void KernelSqrt(DTYPE * a, DTYPE * b, int size);
/* set each entry to its sqrt value (CUDA Kernel) with float16 data type*/
__global__
void KernelSqrt(__half * a, __half * b, int size);
/* set each entry to its sqrt value */
void _CudaSqrt(const XTensor * a, XTensor * b);
/* set each entry to its square value (CUDA Kernel) */
__global__
void KernelSquare(DTYPE * a, DTYPE * b, int size);
/* set each entry to its square value (CUDA Kernel) with float16 data type*/
__global__
void KernelSquare(__half * a, __half * b, int size);
/* set each entry to its square value */
void _CudaSquare(const XTensor * a, XTensor * b);
/* set each entry to its sine value (CUDA Kernel) */
__global__
void KernelSin(DTYPE * a, DTYPE * b, int size);
/* set each entry to its sine value (CUDA Kernel) with float16 data type*/
__global__
void KernelSin(__half * a, __half * b, int size);
/* set each entry to its sine value */
void _CudaSin(const XTensor * a, XTensor * b);
/* set each entry to its cosine value (CUDA Kernel) */
__global__
void KernelCos(DTYPE * a, DTYPE * b, int size);
/* set each entry to its cosine value (CUDA Kernel) with float16 data type*/
__global__
void KernelCos(__half * a, __half * b, int size);
/* set each entry to its cosine value */
void _CudaCos(const XTensor * a, XTensor * b);
/* set each entry to its tangent value (CUDA Kernel) */
__global__
void KernelTan(DTYPE * a, DTYPE * b, int size);
/* set each entry to its tangent value (CUDA Kernel) with float16 data type*/
__global__
void KernelTan(__half * a, __half * b, int size);
/* set each entry to its tangent value */
void _CudaTan(const XTensor * a, XTensor * b);
......
......@@ -22,9 +22,9 @@
#include "Loss.h"
#include "Loss.cuh"
#include "../XDevice.h"
#include "../core/math/Power.h"
#include "../core/math/ScaleAndShift.h"
#include "../core/math/Unary.h"
#include "../core/math/Binary.h"
#include "../core/arithmetic/Sum.h"
#include "../core/arithmetic/Multiply.h"
#include "../core/reduce/ReduceSum.h"
......
......@@ -19,6 +19,7 @@
* $Created by: Lin Ye (email: linye2015@outlook.com) 2018-06-15
*/
#include "../core/math/Binary.h"
#include "../XUtility.h"
#include "TPower.h"
......
......@@ -22,8 +22,6 @@
#ifndef __TEST_POWER_H__
#define __TEST_POWER_H__
#include "../core/math/Power.h"
namespace nts { // namespace nts(NiuTrans.Tensor)
/* test for Power Function */
......
Markdown 格式
0%
您添加了 0 到此讨论。请谨慎行事。
请先完成此评论的编辑!
注册 或者 后发表评论