我最近开始学习C和CUDA编程,我正在将一个函数移植到CUDA上。因此,我对两个不同系统的性能有一些问题。C版本是:
#include <iostream>
#include <math.h>
#include "cmath"
void wavevector(int rx, int ry, int rz,
double *wavevectorX, double *wavevectorY,
double *wavevectorZ, double boxLength[3]){
int igx, igy, igz;
int lgx, lgy, lgz;
int igk;
double waveX, waveY, waveZ;
for(igz = 0; igz < rz; igz++){
lgz = ((igz + (rz / 2 - 1)) % rz) - (rz / 2 - 1);
for(igy = 0; igy < ry; igy++){
lgy = ((igy + (ry / 2 - 1)) % ry) - (ry / 2 - 1);
for(igx= 0; igx < rx/2; igx++){
lgx = ((igx + (rx / 2 - 1)) % rx) - (rx / 2 - 1);
igk = 1 + igx + igy * rx / 2 + igz * ry * rx / 2;
waveX = 2 * M_PI / boxLength[0] * lgx;
waveY = 2 * M_PI / boxLength[1] * lgy;
waveZ = 2 * M_PI / boxLength[2] * lgz;
wavevectorX[igk] = waveX;
wavevectorY[igk] = waveY;
wavevectorZ[igk] = waveZ;
}
}
}
}
int main(void)
{
std::cout << "======================== Start! ========================" << std::endl;
int rx = 500, ry = 500, rz = 500;
int N = rx * ry * rz / 2 + 1;
double *wavevectorX;
double *wavevectorY;
double *wavevectorZ;
double boxLength[3];
boxLength[0] = 3;
boxLength[1] = 3;
boxLength[2] = 3;
// Allocate Unified Memory -- accessible from CPU or GPU
wavevectorX = new double[N+1];
wavevectorY = new double[N+1];
wavevectorZ = new double[N+1];
wavevector(rx, ry, rz,
wavevectorX, wavevectorY, wavevectorZ,
boxLength);
std::cout << "========================= End! =========================" << std::endl;
int start_p = 1;
int end_p = 20;
for(int i = start_p; i < end_p; i++){
std::cout << "wavevectorX[" << i << "] = " << wavevectorX[i] <<
" || wavevectorY[" << i << "] = " << wavevectorY[i] <<
" || wavevectorZ[" << i << "] = " << wavevectorZ[i] << std::endl;
}
// Free memory
delete [] wavevectorX;
delete [] wavevectorY;
delete [] wavevectorZ;
return 0;
}
字符串
我已经将它移植到CUDA中(**编辑:**正如下面的评论中所建议的,我使用单精度来从CUDA中获得更好的性能):
#include <iostream>
#include <math.h>
#include "cmath"
// CUDA kernel to calculate wavevectors
__global__ void wavevector(int n, int rx, int ry, int rz,
float *wavevectorX, float *wavevectorY,
float *wavevectorZ, float bLx, float bLy,
float bLz){
int lgz, lgy, lgx, igk;
float waveX, waveY, waveZ;
int indX = blockIdx.x * blockDim.x + threadIdx.x;
int indY = blockIdx.y * blockDim.y + threadIdx.y;
int indZ = blockIdx.z * blockDim.z + threadIdx.z;
if(indX >= rx/2 || indY >= ry || indZ >= rz){
return;
}
igk = 1 + indX + (indY + indZ * ry)* rx / 2;
lgz = ((indZ + (rz / 2 - 1)) % rz) - (rz / 2 - 1);
lgy = ((indY + (ry / 2 - 1)) % ry) - (ry / 2 - 1);
lgx = ((indX + (rx / 2 - 1)) % rx) - (rx / 2 - 1);
waveX = float(2.0 * M_PI) / bLx * lgx;
waveY = float(2.0 * M_PI) / bLy * lgy;
waveZ = float(2.0 * M_PI) / bLz * lgz;
wavevectorX[igk] = waveX;
wavevectorY[igk] = waveY;
wavevectorZ[igk] = waveZ;
}
int main(void)
{
std::cout << "======================== Start! ========================" << std::endl;
// int N = 1<<20;
int rx = 500, ry = 500, rz = 500;
int N = rx * ry * rz / 2;
float *wavevectorX;
float *wavevectorY;
float *wavevectorZ;
float boxLength[3];
boxLength[0] = 3.0;
boxLength[1] = 3.0;
boxLength[2] = 3.0;
// Allocate Unified Memory -- accessible from CPU or GPU
cudaMallocManaged(&wavevectorX, N*sizeof(float));
cudaMallocManaged(&wavevectorY, N*sizeof(float));
cudaMallocManaged(&wavevectorZ, N*sizeof(float));
// Prefetch the data to the GPU
cudaMemPrefetchAsync(wavevectorX, N*sizeof(float), 0);
cudaMemPrefetchAsync(wavevectorY, N*sizeof(float), 0);
cudaMemPrefetchAsync(wavevectorZ, N*sizeof(float), 0);
int n_blocksx = 512;
int n_blocksy = 1;
int n_blocksz = 1;
dim3 block(n_blocksx,n_blocksy,n_blocksz);
dim3 grid((rx + n_blocksx-1) / n_blocksx, (ry + n_blocksy-1) / n_blocksy, (rz + n_blocksz-1) / n_blocksz);
wavevector<<<grid, block>>>(N, rx, ry, rz,
wavevectorX, wavevectorY, wavevectorZ,
boxLength[0],boxLength[1],boxLength[2]);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
std::cout << "========================= End! =========================" << std::endl;
int start_p = 1;
int end_p = 20;
for(int i = start_p; i <= end_p; i++){
std::cout << "wavevectorX[" << i << "] = " << wavevectorX[i] <<
" || wavevectorY[" << i << "] = " << wavevectorY[i] <<
" || wavevectorZ[" << i << "] = " << wavevectorZ[i] << std::endl;
}
// Free memory
cudaFree(wavevectorX);
cudaFree(wavevectorY);
cudaFree(wavevectorZ);
return 0;
}
型
我正在运行这些计算(我已通过运行cat /proc/meminfo
获得MemAvailable):
1.谷歌Colab:采用英特尔(R)至强(R)CPU@2.30GHz,内存可用:12281932 kB和特斯拉T4 16 Gb;
- Windows 11上的WSL 2:采用英特尔(R)酷睿(TM)i5- 10500 H CPU@2.50GHz,内存可用:7617120 kB和NVIDIA GeForce RTX 3060笔记本电脑6 Gb;
下面是一个表,其中列出了这些系统中每个代码的执行时间(我只是在执行计算时添加了time
命令才得到这些时间):
| 系统/语言|执行时|
| --|--|
| Google Colab / C++ |真实的0m1.438s|
| Google Colab / CUDA(仍采用双精度)|真实的0m0.356s|
| C++ |真实的0m0.456s|
| WSL 2/ CUDA数据库|真实的约为0m0.8s|
**编辑:**查看在Tesla T4上执行的C计算和CUDA的时间,我们可以看到CUDA版本是一个改进(即使在本例中使用双精度,在编译C版本时使用-O3
标志)。WSL 2上的CUDA计算与WSL 2上的C几乎相同。
**1:**即使我的GPU和CPU具有近似GFLOPS的性能,但与C版本中的嵌套循环相比,使用(512x1x1)
块并并行计算(indX, inY, indZ)
集,我仍希望看到性能的改进。
**2:**如果我将rx
,ry
和rz
从500更改为1000,它在Google Colab上运行正常,但在WSL 2上,C运行被终止,CUDA版本在尝试打印结果时返回分段错误。在这种情况下,数据大小将为1000*1000*1000*sizeof(double)/2
(3906250 kB)。据我所知,这应该适合两个系统。对于C版本,我读到它可能被内核的oom killer终止,但我不确定CUDA版本。我在这里遗漏了什么?(**编辑:**使用块作为(512x1x1)
不再导致CUDA代码的分段错误)。
我也将感谢任何阅读和提示,以帮助我优化此代码:)
**编辑:**我将C++版本编译为g++ -o wavevector_cpp wavevector_cpp.cpp -O3
,将CUDA版本编译为nvcc -o wavevector_cuda wavevector_cuda.cu -arch=sm_86
(对于RTX 3060)。
**编辑:**我将运行分析器来运行CUDA计算,以便提供更多信息。
1条答案
按热度按时间wtlkbnrh1#
我做了一些阅读,根据这个链接,统一内存在WSL 2上有限制。因此,我进行了一个测试:
使用GoogleColab环境(具有T4 GPU)运行相同代码的两个版本:一个手动分配内存,另一个使用统一内存
字符串
和
型
将它们编译为
nvcc -o /content/src/wavevector_manual_k /content/src/wavevector_manual_k.cu -arch=sm_75
和nvcc -o /content/src/wavevector_k /content/src/wavevector_k.cu -arch=sm_75
,我得到了以下配置文件(以time nvprof /content/src/wavevector...
运行):型
和
型
我们可以看到:使用手动内存分配或统一内存分配不会改变内核
wavevector
的运行时间,也不会显著改变总执行时间。如果在WSL 2上运行这些代码,统一内存的总执行时间大约为5到6秒,而手动内存分配的总执行时间大约为0.622秒(这是一个与Google Colab上运行的时间相当的执行时间)。很抱歉没有WSL 2的分析,但是--虽然我读到它支持Windows 11上的分析--到目前为止我还没有设法让它工作。
根据有关WSL 2上统一内存性能的文档信息和这些测试,我可以得出结论,我之前在WSL 2上经历的奇怪的长时间执行是由于它在处理统一内存时的限制。
无论如何:感谢所有在问题评论中给我提供了提高性能的有用提示的人:)