CUDA超大规模并行程序设计.ppt
CUDA 超大规模并行程序设计,邓仰东清华大学微电子学研究所,提纲,从GPGPU到CUDACUDA并行程序组织并行执行模型CUDA基础CUDA存储器CUDA程序设计工具程序优化,Graphic Processing Unit(GPU),用于个人计算机、工作站和游戏机的专用图像显示设备显示卡nVidia和ATI(now AMD)是主要制造商Intel准备通过Larrabee进入这一市场主板集成Intel,3维图像流水线,Framebuffer,Texture,CPU,GPU,实时3维高速图形处理,一帧典型图像1M triangles3M vertices25M fragments30 frames/s30M triangles/s90M vertices/s750M fragments/s,传统GPU架构,Graphics program,Vertex processors,Fragment processors,Pixel operations,Output image,GPU的强大运算能力,数据级并行:计算一致性,专用存储器通道有效隐藏存储器延时,General Purpose Computing on GPU(GPGPU),GPGPU,核心思想用图形语言描述通用计算问题把数据映射到vertex或者fragment处理器但是硬件资源使用不充分存储器访问方式严重受限难以调试和查错高度图形处理和编程技巧,NVidia G200 Architecture,CUDA:Compute Unified Device Architecture,通用并行计算模型单指令、多数据执行模式(SIMD)所有线程执行同一段代码(1000s threads on the fly)大量并行计算资源处理不同数据隐藏存储器延时提升计算通信比例合并相邻地址的内存访问 快速线程切换1 cycleGPU vs.1000 cyclesCPU,混合计算模型,CUDA:集成CPU+GPU C应用程序CPU:顺序执行代码GPU=超大规模数据并行协处理器“批发”式执行大量细粒度线程,kernel 0,CPU Serial Code,CPU Serial Code,GPU Parallel Code,GPU Parallel Code,Concurrent execution!,kernel 1,CUDA成功案例,CUDA性能,BLAS3:127 GFLOPS/基本线性代数:matrix-matrixFFT:52 benchFFT*GFLOPSFDTD:1.2 Gcells/sec/计算电动力学SSEARCH:5.2 Gcells/sec/Smith-Waterman基因序列比较Black Scholes:4.7GOptions/sec/期权定价模型VMD:290 GFLOPS/分子动力学图形显示,Problem Instances for Sparse Matrix Vector Product(SMVP),SPMV Throughput on GTX280,SMVP Application:Static Timing Analysis,Adapted from Ramalingam,A.et.al.An Accurate Sparse Matrix Based Framework for Statistical Static Timing Analysis.ICCAD.2006.,Static Timing Analysis Results on GTX280,提纲,从GPGPU到CUDACUDA并行程序组织并行执行模型CUDA基础CUDA存储器CUDA程序设计工具程序优化,并行性的维度,1维y=a+b/y,a,b vectors2维P=M N/P,M,N matrices3维CT or MRI imaging,=,并行线程组织结构,Thread:并行的基本单位Thread block:互相合作的线程组Cooperative Thread Array(CTA)允许彼此同步通过快速共享内存交换数据以1维、2维或3维组织最多包含512个线程Grid:一组thread block以1维或2维组织共享全局内存Kernel:在GPU上执行的核心程序One kernel one grid,Parallel Program Organization in CUDA,Thread,Thread block,Grid,SP,Software,Hardware,SM,GPU,并行线程执行,调用kernel function 需要指定执行配置Threads和blocks具有IDsthreadIdx:1D,2D,or 3DblockIdx:1D,or 2D由此决定相应处理数据,_global_ void kernel(.);dim3 DimGrid(3,2);/6 thread blocks dim3 DimBlock(16,16);/256 threads per block kernel(.);,实例1:Element-Wise Addition,/CPU program/sum of two vectors a and bvoid add_cpu(float*a,float*b,int N)for(int idx=0;idxN;idx+)aidx+=bidx;void main().fun_add(a,b,N);,/CUDA program/sum of two vectors a and b _global_ void add_gpu(float*a,float*b,int N)Int idx=blockIdx.x*blockDim.x+threadIdx.x;if(idx(a,b,N);,提纲,从GPGPU到CUDACUDA并行程序组织并行执行模型CUDA基础CUDA存储器CUDA程序设计工具程序优化,CUDA Processing Flow,并行线程执行,SM内以(warp即32 threads)为单位并行执行Warp内线程执行同一条指令Half-warp是存储操作的基本单位,Warp,GPU负载分配,Global block scheduler管理thread block级并行从CPU获得线程组织信息根据硬件结构分配thread block到SM,Streaming Multiprocessor(SM),Streaming Multiprocessor执行Thread Blocks,线程以block为单位分配到SM视资源需求,一个SM分配至多8个blockSM in G80可以接受768个线程256(threads/block)*3 blocks 或128(threads/block)*6 blocks,etc.线程并发(concurrently)运行SM分配并维护线程IDSM管理并调度线程,Thread Life Cycle,Grid在GPU上启动Thread blocks顺序分配到SMs一般SM应有1 thread blockSM把线程组织为warpsSM调度并执行就绪的warpWarps和thread blocks 执行结束后释放资源GPU继续分发thread blocks,Example of Hiding Memory Latency,G80:执行warp全部线程的一条指令需要8个时钟cycle假定1 global memory access/8 instructionsA 400-cycle global memory latencyHow many warps are needed to tolerate the latency?,400 cycles*1 MEM/8 cycles=50 cycles per instruction on average50 cycles/4 cycles per warp=12.5 13 warps to keep an SM busy,Arithmetic Instruction Throughput,4 clock cyclesSingle-precision floating-point add,multiply,and multiply-add,Integer add,24-bit integer multiplicationBitwise operations,compare,min,max,type conversion instruction;,Note.1.A warp is issued in 4 cycles2.Arithmetic ops are pipelined.3.Still possible to have 8 ops ready in each cycle.,Arithmetic Instruction Throughput,16 clock cyclesReciprocal,reciprocal square root,32-bit Integer multiplicationOther functions are combinations of the abovey/x=rcp(x)*y/20 cycles per warpsqrt(x)=rcp(rsqrt(x)/32 cycles per warpInteger division and modulo operation are costly!,浮点数精度,GT200之前的GPUIEEE-754 Floating Point Standard单精度浮点数GT200增加双精度浮点数支持SP仍然只支持单精度浮点数每个SM配置一个双精度浮点单元双精度比单精度运算慢8-12倍,控制流(Control Flow),同一warp内的分支语句可能执行不同的指令路径不同指令路径的线程只能顺序执行每次执行warp中一条可能的路径N条指令路径1/N throughput只需要考虑同一warp即可,不同warp的不同的指令路径不具相关性 G80上使用指令预测技术加速指令执行,控制流(Control Flow),常见情况:分支条件是thread ID的函数时,容易导致divergenceExample with divergence:If(threadIdx.x 2)在thread block产生两条不同指令路径Branch granularity 2)也在thread block产生两条不同指令路径Branch granularity is a whole multiple of warp size同一warp的所有线程具备相同指令路径,线程同步,void _syncthreads();Barrier synchronization同步thread block之内的所有线程避免访问共享内存时发生RAW/WAR/WAW 冒险(hazard),_shared_ float scratch256;scratchthreadID=beginthreadID;_syncthreads();int left=scratchthreadID-1;,在此等待,直至所有线程到达才开始执行下面的代码,Dead-Lock with _syncthreads,Dead-lock ifSome threads have val larger than thresholdAnd others not,_global_ void compute(.)/do some computation for valif(val threshold)return;_syncthreads();/work with val,提纲,从GPGPU到CUDACUDA并行程序组织并行执行模型CUDA基础CUDA存储器CUDA程序设计工具程序优化,CUDA扩展语言结构,Declspecsglobal,device,shared,local,constantKeywordsthreadIdx,blockIdxthreadDim,blockDimIntrinsics_syncthreadsRuntime APIMemory,symbol,execution managementFunction launch,_device_ float filterN;_global_ void convolve(float*image)_shared_ float regionM;.regionthreadIdx=imagei;_syncthreads().imagej=result;/Allocate GPU memoryvoid*myimage=cudaMalloc(bytes)/100 blocks,10 threads per blockfoo(parameters);,存储器空间,R/W per-thread registers1-cycle latencyR/W per-thread local memorySlow register spilling to global memoryR/W per-block shared memory1-cycle latencyBut bank conflicts may drag downR/W per-grid global memory500-cycle latencyBut coalescing accessing could hide latencyRead only per-grid constant and texture memories500-cycle latencyBut cached,GPU Global Memory分配,cudaMalloc()分配显存中的global memory两个参数对象数组指针和数组尺寸cudaFree()释放显存中的global memory对象数组指针,int blk_sz=64;float*Md;int size=blk_sz*blk_sz*sizeof(float);cudaMalloc(void*),Host Device数据交换,cudaMemcpy()Memory data transferRequires four parametersPointer to destinationPointer to sourceNumber of bytes copiedType of transfer Host to Host,Host to Device,Device to Host,Device to Device,cudaMemcpy(Md,M.elements,size,cudaMemcpyHostToDevice);cudaMemcpy(M.elements,Md,size,cudaMemcpyDeviceToHost);,CUDA引入的新变量类型,_device_ 储存于GPU上的global memory空间 和应用程序具有相同的生命期(lifetime)可被grid中所有线程存取,CPU代码通过runtime函数存取 _constant_储存于GPU上的constant memory空间 和应用程序具有相同的生命期(lifetime)可被grid中所有线程存取,CPU代码通过runtime函数存取_shared_储存于GPU上thread block内的共享存储器和thread block具有相同的生命期(lifetime)只能被thread block内的线程存取Local变量储存于SM内的寄存器和local memory和thread具有相同的生命期(lifetime)Thread私有,CUDA函数定义,_global_ 定义kernel函数必须返回void_device_ 函数不能用&运算符取地址,不支持递归调用,不支持静态变量(static variable),不支持可变长度参数函数调用,CUDA数学函数,pow,sqrt,cbrt,hypot,exp,exp2,expm1,log,log2,log10,log1p,sin,cos,tan,asin,acos,atan,atan2,sinh,cosh,tanh,asinh,acosh,atanh,ceil,floor,trunc,round,etc.只支持标量运算许多函数有一个快速、较不精确的对应版本以”_”为前缀,如_sin()编译开关-use_fast_math强制生成该版本的目标码每个多处理器包含两个超越函数计算单元,实例2:矩阵相乘,矩阵数据类型 不属于CUDA!单精度浮点数width height个元素矩阵元素在elements中1-D数组存放矩阵数据Row-major storagetypedef struct int width;int height;float*elements;Matrix;,A,B,C,WM.width=N.heightI,M.height,M.width,N.width,实例2:矩阵相乘,C=A B of size WIDTH x WIDTH一个线程处理一个矩阵元素简化:假定 WIDTH x WIDTH 512只需要一个thread block线程载入A的一行和B的一列A和B的一对相应元素作一次乘法和一次加法,A,B,C,WIDTH,WIDTH,WIDTH,WIDTH,CUDA Implementation Host Side,/Matrix multiplication on the devicevoid Mul(const Matrix A,const Matrix B,Matrix C)int size=A.width A.width sizeof(float);/Load M and N to the device float*Ad,*Bd,*Cd;cudaMalloc(void*),CUDA Implementation Host Side,/Launch the device computation threads!dim3 dimGrid(1);dim3 dimBlock(M.width,M.width);Muld(Ad,Bd,Cd,M.width);/Read P from the device copyFromDeviceMatrix(C.elements,Cd);cudaMemCopy(C,Cd,N*size,cudaMemcpyDeviceToHost);/Free device matrices cudaFree(Ad);cudaFree(Bd);cudaFree(Cd);,CUDA Implementation Kernel,/Matrix multiplication kernel thread specification_global_ void Muld(float*Ad,float*Bd,float*Cd,int width)/2D Thread ID int tx=threadIdx.x;int ty=threadIdx.y;/cvalue is used to store the element of the matrix/that is computed by the thread float cvalue=0;,CUDA Implementation Kernel,A,B,C,WIDTH,WIDTH,WIDTH,WIDTH,ty,tx,for(int k=0;k width;+k)float ae=Adty*width+k;float be=Bd tx+k*width;cvalue+=ae*be;/Write the matrix to device memory;/each thread writes one element Cdty*width+tx=cvalue;,提纲,从GPGPU到CUDACUDA并行程序组织并行执行模型CUDA存储器Shared memoryGlobal memoryCUDA程序设计工具程序优化,共享存储器(Shared Memory),设置于streaming multiprocessor内部由一个线程块内部全部线程共享完全由软件控制访问一个地址只需要1个时钟周期,共享存储器结构,G80的共享存储器组织为16 banksAddressed in 4 bytesBank ID=4-byte address%16相邻4-byte地址映射相邻banks每一bank的带宽为4 bytes per clock cycle对同一bank的同时访问导致bank conflict 只能顺序处理仅限于同一线程块内的线程,Bank Addressing实例,No Bank ConflictsLinear addressing stride=1(s=1),No Bank ConflictsRandom 1:1 Permutation,_shared_ float shared256;float foo=sharedthreadIdx.x;,Bank Addressing实例,2-way bank conflictsLinear addressing stride=2(s=2),8-way bank conflictsLinear addressing stride=8(s=8),_shared_ float shared256;float foo=shared2*threadIdx.x;,_shared_ float shared256;float foo=shared8*threadIdx.x;,常见Bank Conflict模式,Shared memory存放2D浮点数组16x16-elelment shared memory1个线程处理矩阵的一行循环处理一行16个元素同一block的线程同时访问一列即 column 1 in purple16-way bank conflicts,Bank Indices without Padding,Bank,t15,解决方案,方案1:pad the rows在每行最后添加一个元素方案 2:transpose before processingSuffer bank conflicts during transposeBut possibly save them later,Bank Indices with Padding,Transpose,提纲,从GPGPU到CUDACUDA并行程序组织并行执行模型CUDA存储器Shared memoryGlobal memoryCUDA程序设计工具程序优化,全局内存(Global Memory),全局内存在G80上没有缓存Constant memory和texture memory有少量缓存存取延时400-600 clock cycles非常容易成为性能瓶颈优化是提高性能的关键!,Coalesced Global Memory Accesses,在half-warp层次对访问global memory进行协调访问连续global memory区域:64 bytes-each thread reads a word:int,float,128 bytes-each thread reads a double-word:int2,float2,256 bytes each thread reads a quad-word:int4,float4,额外限制:Global memory区域的起始地址必须是该区域数据类型尺寸的整数倍Warp中第k个线程访问第k个地址例外:可以有某些中间线程不参加Predicated access,divergence within a warp,Coalesced Global Memory Accesses,Non-Coalesced Global Memory Accesses,Non-Coalesced Global Memory Accesses,提纲,从GPGPU到CUDACUDA并行程序组织并行执行模型CUDA存储器Shared memoryGlobal memoryCUDA程序设计工具程序优化,下载CUDA软件,http:/CUDA driver硬件驱动CUDA toolkit工具包CUDA SDK程序范例及动态链接库CUDA Visual Profiler程序剖析工具,软件环境,GPU硬件和CUDA软件安装后:,CPU(Host),CUDA Libraries(CUFFT&CUBLAS),CUDA Runtime Libraries,CUDA Driver,Application,GPU(Device),CUDA程序的编译(compile),CUDA源文件被nvcc处理nvcc is a compiler drivernvcc输出:PTX(Parallel Thread eXecution)Virtual ISA for multiple GPU hardwareJust-In-Time compilation by CUDA runtimeGPU binaryDevice-specific binary objectStandard C codeWith explicit parallelism,DEBUG,make dbg=1CPU代码以debug模式编译可以用debugger(e.g.gdb,visual studio)运行但不能检查GPU代码的中间结果make emu=1在CPU上以emulation方式顺序运行可以使用printf()打印中间结果基本顺序执行但不能再现线程间的竞争(race)现象浮点运算结果可能有微小的差别,检查资源使用,使用-cubin flag编译开关检查.cubin文件的”code”部分architecture sm_10abiversion 0modname cubincode name=BlackScholesGPUlmem=0smem=68reg=20bar=0bincode 0 xa0004205 0 x04200780 0 x40024c09 0 x00200780,per thread local memory,per thread block shared memory,per thread registers,提纲,从GPGPU到CUDACUDA并行程序组织并行执行模型CUDA存储器Shared memoryGlobal memoryCUDA程序设计工具程序优化,针对GPU优化算法,最大化独立并行性最大化算术计算密度(math/bandwidth)重复计算往往优于访问存储器GPU spends its transistors on ALUs,not memory尽量在GPU上计算以避免与CPU传递数据即使低并行度运算也往往优于频繁的CPU-GPU数据传递,利用Shared Memory,几百倍快于global memory线程之间通过shared memory合作使用一个或少量线程装载和计算thread block内全部线程共享的数据Use it to avoid non-coalesced accessStage loads and stores in shared memory to re-order non-coalesceable addressing,有效并行,划分计算使得GPU各个SM负载均衡Many threads,many thread blocks降低资源使用,以便多个thread blocks在SM上运行Registers,shared memory,优化存储器访问的一致性,全局存储器延时:400-600 clock cycles经常成为性能瓶颈Coalesced vs.non-coalesced优化效果明显-数量级性能差别Experiment:Kernel:read a float,increment,write back3M floats(12MB),Times averaged over 10K runs12K blocks x 256 threads:356 s coalesced357 s coalesced,some threads dont participate3,494 s permuted/misaligned thread access使用texture memory优化空间局部行为Spatial locality,Uncoalesced float3 Code,_global_ void accessFloat3(float3*d_in,float3 d_out)int index=blockIdx.x*blockDim.x+threadIdx.x;float3 a=d_inindex;a.x+=2;a.y+=2;a.z+=2;d_outindex=a;,Uncoalesced float3 Code,float3需要12 bytes:float3 f=d_inthreadIdx.x;Each thread ends up executing 3 readssizeof(float3)4,8,or 16Half-warp reads three 64B non-contiguous regions,Coalescing float3 Access,A 3-step approach(256 threads/block),Global Memory,Shared memory,Shared memory,Coalescing float3 Access,Use shared memory to allow coalescing256 threads per blockA thread block needs sizeof(float3)x256 bytes of SMEMEach thread reads 3 scalar floats:Offsets:0,(threads/block),2*(threads/block)These will likely be processed by other threads,so syncProcessingEach thread retrieves its float3 from SMEM arrayCast the SMEM pointer to(float3*)Use thread ID as indexRest of the compute code does not change!,Coalescing float3 Access代码,Matrix Transpose,SDK Sample(“transpose”)解释通过shared memory实现coalescing 在小尺度数据即可显示优化的明显效果,Uncoalesced Transpose,_global_ void transpose_naive(float*odata,float*idata,int width,int height)unsigned int xIndex=blockDim.x*blockIdx.x+threadIdx.x;unsigned int yIndex=blockDim.y*blockIdx.y+threadIdx.y;if(xIndex width,tx,ty,Height,Width,Height,Width,Uncoalesced Transpose,Coalesced Transpose,假设:矩阵已被分解为方块(tile)Thread block(bx,by):Read the(bx,by)input tile,store into SMEMWrite the SMEM data to(by,bx)output tileThread(tx,ty):Reads element(tx,ty)from input tileWrites element(tx,ty)into output tileCoalescing is achieved if:Block/tile dimensions are multiples of 16,Coalesced Transpose,Coalesced Transpose,_global_ void transpose(float*odata,float*idata,int width,int height)_shared_ float blockBLOCK_DIM*BLOCK_DIM;unsigned int xBlock=blockDim.x*blockIdx.x;unsigned int yBlock=blockDim.y*blockIdx.y;unsigned int xIndex=xBlock+threadIdx.x;unsigned int yIndex=yBlock+threadIdx.y;unsigned int index_out,index_transpose;if(xIndex width,Shared Memory优化,Threads read SMEM with stride=16Bank conflictsSolutionAllocate an“extra”columnRead stride=17Threads read from consecutive banks,Coalesced Transpose with Shared Memory Optimization,_global_ void transpose_exp(float*odata,float*idata,int width,int height)_shared_ float blockBLOCK_DIMBLOCK_DIM+1;unsigned int xIndex=blockIdx.x*BLOCK_DIM+threadIdx.x;unsigned int yIndex=blockIdx.y*BLOCK_DIM+threadIdx.y;if(xIndex width),Transpose Timings,Speedups with coalescing128x128:0.011ms vs.0.022ms(2.0X speedup)512x512:0.07ms vs.0.33ms(4.5X speedup)1024x1024:0.30ms vs.1.92ms(6.4X speedup)1024x2048:0.79ms vs.6.6ms(8.4X speedup)(注意:优化shared memory bank conflicts带来 10%的speedup),参考书目,NVIDIA,nVidia CUDA Programming Guide,NVidia,2008.H.Nguyen(ed.),GPU Gems 3 3D and General Programming Techniques for GPUs,Addison Wesley,2007.T.Mattson,et al.,Patterns for Parallel Programming,Addison Wesley,2005.CUDA中文网站http:/,参考文献,Special issues on GPU computingIEEE,Proceedings of IEEE,Vol.96,No.5,May,2008.ACM,Queue,Vol.6,No.2,March/April,2008.Elsevier,Journal of Parallel and Distributed Computing,Vol.68,No.10,October,2008.,