并行计算机体系结构.ppt

资源ID：2263237 资源大小：5.43MB 全文页数：72页
资源格式： PPT 下载积分：8金币

快捷下载

会员登录下载

三方登录下载：

下载资源需要8金币

邮箱/手机：
温馨提示：	用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）
支付方式：
验证码：	换一换

加入VIP免费专享

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

5、试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓。

网站客服

侵权投诉

并行计算机体系结构.ppt

Parallel Computer Architecture并行计算机体系结构Lecture 7The Introduction of Multicore Processor,April 13,2009隋秀峰（）,2023/2/7,The Introduction of Multicore Processor,2,主要内容,多核处理器发展的动力多核处理器需要解决的关键问题多核处理器的发展现状多核处理器中的新兴技术,2023/2/7,The Introduction of Multicore Processor,3,Todays Processor,Voltage levelA flashlight(1 volt)Current levelAn oven(250 amps)Power levelA light bulb(100 watts)AreaA postage stamp(1 square inch)PerformanceGFLOPS,2023/2/7,The Introduction of Multicore Processor,4,What is the future need?,Performance need is never endingComplains from end-users nowadaysTomorrows killer applicationNext Step:How can we get to 1 TFLOPS?,2023/2/7,The Introduction of Multicore Processor,5,Tomorrows killer Application(RMS),2023/2/7,The Introduction of Multicore Processor,6,多核发展的动力线延迟,Consider the 1 Tflop/s sequential machine:Data must travel some distance,r,to get from memory to CPU.To get 1 data element per cycle,this means 1012 times per second at the speed of light,c=3x108 m/s.Thus r c/1012=0.3 mm.Now put 1T byte of storage in a 0.3 mm x 0.3 mm area:Each word occupies about 3 square Angstroms,or the size of a small atom.No choice but parallelism,2023/2/7,The Introduction of Multicore Processor,7,多核发展的动力发热问题,2023/2/7,The Introduction of Multicore Processor,8,Managing the Heat Load,2023/2/7,The Introduction of Multicore Processor,9,多核发展的动力漏电流,Leakage CurrentFrom Minor Nuisance to Chip Killer,250,180,130,90,70,Dissipated Power CV2f,Process Technology(nm),Power(W),2023/2/7,The Introduction of Multicore Processor,10,多核发展的动力制造成本,Moores 2nd law(Rocks law),Demo of 0.06 micron CMOS,Technology Trends:Microprocessor Capacity,2023/2/7,The Introduction of Multicore Processor,11,2X transistors/Chip Every 1.5 yearsCalled“Moores Law”,Microprocessors have become smaller,denser,and more powerful.Not just processors,bandwidth,storage,etc,Gordon Moore(co-founder of Intel)predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.,Moores Law Still Holds,2023/2/7,The Introduction of Multicore Processor,12,No Exponential is Forever,But perhaps we can Delay it Forever,Means of Increasing Performance,Increasing Clock FrequencyFrom 60 MHz to 3,800 MHz in 12 yearsHas resulted in expected performance increaseExecution OptimizationThe kernel is Instruction Level Parallelism,2023/2/7,The Introduction of Multicore Processor,13,A brief history of micro-architecture evolution,Two axes:Exploring the parallelism,much of the performance from parallelismBit-Level ParallelismInstruction-Level Parallelism(ILP)Thread-Level Parallelism(TLP)Hiding the memory latency,2023/2/7,The Introduction of Multicore Processor,14,What is Pipelining?,2023/2/7,The Introduction of Multicore Processor,15,In this example:Sequential execution takes 4*90min=6 hoursPipelined execution takes 30+4*40+20=3.5 hoursBandwidth=loads/hourBW=4/6 l/h pipeliningBW=4/3.5 l/h pipeliningPipelining helps bandwidth but not latency(90 min)Bandwidth limited by slowest pipeline stagePotential speedup=Number pipe stages,6 PM,7,8,9,TaskOrder,Time,Dave Pattersons Laundry example:4 people doing laundrywash(30 min)+dry(40 min)+fold(20 min)=90 min Latency,VLIW,2023/2/7,The Introduction of Multicore Processor,16,Means of Increasing Performance,Execution OptimizationMore powerful instructionsExecution optimization(pipelining,branch prediction,execution of multiple instructions,reordering instruction stream,etc.)The gain from exploring ILP is diminishingThe inherent barrier ILP need to tackleControl dependence,data dependence,2023/2/7,The Introduction of Multicore Processor,17,Means of Increasing Performance,What is the next?Need to feed TLP for the processorHere the problem is essentially the same as parallel programmingTechnologies for TLPSimultaneous Multi-threading(SMT)-Example:Intel Hyper-threadingChip multiprocessing(CMP)-Multi-Core Processor,2023/2/7,The Introduction of Multicore Processor,18,Micro-architecture Trends,2023/2/7,The Introduction of Multicore Processor,19,Adapted from Johan De Gelas,Quest for More Processing Power,AnandTech,Feb.8,2005.,Understanding SMT and CMP,2023/2/7,The Introduction of Multicore Processor,20,Make clear Concurrency vs.Parallelism,Concurrency:two or more threads are in progress at the same time:Parallelism:two or more threads are executing at the same timeMultiple cores needed,Simultaneous Multithreading(SMT),Minimal resource replicationProvides instructions to overlap memory latencySeparate threads exploit idle resources,2023/2/7,The Introduction of Multicore Processor,21,Context1,Context2,Functional Units,L1 Cache,L2 Cache,Main Memory,SMT:simultaneous multithreading,2023/2/7,The Introduction of Multicore Processor,22,Superscalar,Multithreaded,SMT,Issue slots,Go to the era of Multicore,Concurrency in the form of hardware multithreading has been around for a while.Useful for hiding memory latencies.Only about 30%performance improvement for special application.How can we continue to utilize the ever-higher transistor densities predicted by Moores Law?Current View:Can continue performance improvements by packing multiple processing cores onto a single chip,i.e.,multicore.Multi-core=Chip Multiprocessing=Tera-scale Computing,2023/2/7,The Introduction of Multicore Processor,23,Chip Multiprocessing,Much larger degree of resource replicationTwo complete processing cores on each chipOuter levels of cache and external interface are sharedGreatly reduced resource contention compared to SMT,2023/2/7,The Introduction of Multicore Processor,24,L2 Cache,Main Memory,Context1,Context2,Functional Units,Functional Units,L1 Cache,L1 Cache,What we benefit from Multi-Core?,2023/2/7,The Introduction of Multicore Processor,25,New Target for Micro-architecture high performance/power,Multi-Core Processors,Improved cost/performance ratioMinimal increases in architectural complexity provide significant increases in performanceMinimizes performance stalls,with a dramatic increase in overall effective system performanceGreater EEP(energy efficient performance)and scalabilityCores enable thread-level parallelismMulti-core architecture enables divide-and-conquer strategy to perform more work in a given clock cycle.,2023/2/7,The Introduction of Multicore Processor,26,Multi-Core Processors(cont.),Whats special for many-cores?Explicit multi-threads required to speedup single application performanceCore to core communicationLatency reduceBandwidth increaseCache size per-core will also reduce,2023/2/7,The Introduction of Multicore Processor,27,Multi-Core Processors(cont.),2023/2/7,The Introduction of Multicore Processor,28,Intel Clovertown 上的延迟测试,2023/2/7,The Introduction of Multicore Processor,29,What is the problem?Where is the innovation?,How about the core?Equal to the original one or not?Simple core may be a good chooseHow about the power control on chip?Fine granularity power controlHow about the interconnection between cores and other units?X cores means X times of memory referencesRequires higher throughputs between cores and caches,within cache hierarchy,and between last-level cache and memoryRequires less latencies in those placesFour basic kinds of interconnectsBuses,crossbars,tiny-networks,and ringsEach has its own tradeoffs in throughput,latency,resource occupation,and ease of implementationMay be suitable at different levels,2023/2/7,The Introduction of Multicore Processor,30,What is the problem?Where is the innovation?,How about the Cache?(NUCA:non-uniform cache arch.),2023/2/7,The Introduction of Multicore Processor,31,A NUCA Substrate for Flexible CMP Cache Sharing,Proc.the 19th Annual International Conference on Supercomputing,June 2005,pp.31-40,多核处理器的问题,多核处理器实际上是一个片上并行系统分层性分布性加速单个应用需要显式多线程多内核处理器系统对软件技术的核心问题是并行程序的开发问题，包括并行程序的编程与调试多核处理器的软件挑战,2023/2/7,The Introduction of Multicore Processor,32,What is the problem?Where is the innovation?,Where are the threads?Maybe the most largest challengeMake programmer write threading programsThe World may be confused.Automatic Parallelism Mission impossible,but can improve in some sense.Make module with threading for useHow to control high level behavior of our programs?Try to ease the burden of programmerLooks good,but how can?,2023/2/7,The Introduction of Multicore Processor,33,如何应对多核上的软件挑战,让程序员进行并行编程继承和优化OpenMP和MPI等新的编程语言X10等事务内存(Transactional Memory)自动并行化难度大，经过20年的发展通用性仍不好推测多线程(Speculative multi-threading)实现并行库INTEL MKL、SCALAPACK如何控制程序的高级行为?其他有价值的工作函数语言、数据流、领域语言,2023/2/7,The Introduction of Multicore Processor,34,All of above are still open issues,2023/2/7,The Introduction of Multicore Processor,35,Break！,2023/2/7,The Introduction of Multicore Processor,36,Multicore Products Nowadays,Lots of dual-core products now:Intel:Pentium D and Pentium Extreme Edition,Core Duo(2),Woodcrest,MontecitoIBM PowerPCAMD Opteron/Athlon 64Sun UltraSPARC IV.Systems with more than two cores are here with more coming:IBM Cell(asymmetric).Dual-core PowerPC plus eight“synergistic processing elements”.Sun NiagaraEight cores,four hyper-threaded threads per core.General Purpose Computation on Graphics Processors(GPGPU)Intel expects to produce 16-or even 32-core chips within a decade.,2023/2/7,The Introduction of Multicore Processor,37,Architecture of Dual-Core Chips,2023/2/7,The Introduction of Multicore Processor,38,INTEL CORE DUOTwo physical cores in a packageEach with its own execution resourcesEach with its own L1 cache32K instruction and 32K dataBoth cores share the L2 cache2MB 8-way set associative;64-byte line size 10 clock cycles latency;Write Back update policy,AMD OpteronSeparate 1 Mbyte L2 cachesImprovement for Memory affinity and Thread affinity,Intel Multi-core Plan,2023/2/7,The Introduction of Multicore Processor,39,Cell from IBM and Sony,2023/2/7,The Introduction of Multicore Processor,40,Cell from IBM and Sony,2023/2/7,The Introduction of Multicore Processor,41,Niagara from SUN,2023/2/7,The Introduction of Multicore Processor,42,The technologies underway,Rethink the concurrency and parallelism for multi-coreNew programming model and programming languagesHardware support(and software)for multithreadingControl-driven speculationSpeculative multithreadingData-driven speculationProgram demultiplexingArchitectural thread enhancementSupport for hardware threadsLightweight synchronization(monitor/mwait),2023/2/7,The Introduction of Multicore Processor,43,Rethink the C and P for multi-core,What we have seen for multi-coreMore parallelism need to be exploitedScaling maybe more importantMore heterogeneity need to be exploitedTask mapping may be revisitedLow latency and high bandwidth between cores on chipFine granularity parallelism may be rethinked,2023/2/7,The Introduction of Multicore Processor,44,Rethink the C and P for multi-core,Make full use of Multi-core resourcesMore parallelismHide Memory access stall well-known Memory Wall,2023/2/7,The Introduction of Multicore Processor,45,索引计算在clovertown上的测试,索引计算是计算密集与IO密集并重的应用网页数据32GB，生成的索引大小为4.5GB,2023/2/7,The Introduction of Multicore Processor,46,索引计算在clovertown上的测试(续）,索引各个阶段，有的以计算为主，有的以IO为主考虑将索引过程划分为多个流水段，实现流水索引算法，充分利用系统计算资源流水段的划分原则资源独立：各个流水段使用独立的资源时间接近：各个流水段的用时比较接近细粒度流水算法：利用流水段的重叠执行，实现并行化,2023/2/7,The Introduction of Multicore Processor,47,Intel clovertown测试环境,2023/2/7,The Introduction of Multicore Processor,48,索引计算在clovertown上的测试(续）,单核上的性能提高流水线隐藏部分读文档I/O时间多核下的性能提高计算并行化,2023/2/7,The Introduction of Multicore Processor,49,性能提高8.2%,性能提高53.4%,测试时，使用1.5G内存，且待测数据和索引位于同一块磁盘,Rethink the C and P for multi-core,Processor affinity benefit for task mappingParallel FFT computation in NPB get 14%performance increase for MPICH,2023/2/7,The Introduction of Multicore Processor,50,Rethink the C and P for multi-core,Exploit dynamic and adaptive out-of-order execution patterns on multi-core and heterogeneous system,2023/2/7,The Introduction of Multicore Processor,51,The technologies underway,Rethink the concurrency and parallelism for multi-coreNew programming model and programming languagesHardware support(and software)for multithreadingControl-driven speculationSpeculative multithreadingData-driven speculationProgram demultiplexingArchitectural thread enhancementSupport for hardware threadsLightweight synchronization(monitor/mwait),2023/2/7,The Introduction of Multicore Processor,52,Programming Model and PLs,Bridge the application software to system software and hardware for better expressing the parallelism for such heterogeneous systemsTransactional MemoryIBM X10SUN Fortress其它有意义的探索函数语言数据流领域语言,2023/2/7,The Introduction of Multicore Processor,53,Transactional memory a way to ease thread programming,Thread programming is a boring thing,2023/2/7,The Introduction of Multicore Processor,54,Transactional memory a way to ease thread programming,Thread programming is a boring thing,2023/2/7,The Introduction of Multicore Processor,55,Transactional memory a way to ease thread programming,A transaction is a sequence of memory loads and stores that either commits or abortsIf a transaction commits,all the loads and stores appear to have executed atomicallyIf a transaction aborts,none of its stores take effectTransaction operations arent visible until they commit or abortSimplified version of traditional ACID database transactions(no durability,for example),2023/2/7,The Introduction of Multicore Processor,56,Transactional memory example,2023/2/7,The Introduction of Multicore Processor,57,Problems in Transactional Memory,2023/2/7,The Introduction of Multicore Processor,58,Solutions for Transactional Memory,2023/2/7,The Introduction of Multicore Processor,59,X10,对多内核系统与集群系统提供统一的支持高生产率语言设计注重可移植性和安全性性能扩展了Java虚拟机提供手工性能调整的手段在 Java 语言基础上开发继承了JAVA语言的核心价值-高生产率，可移植性，成熟、安全面向主流Java/C/C+程序员,2023/2/7,The Introduction of Multicore Processor,60,X10 Vision:Portable Productive Parallel Programming,2023/2/7,The Introduction of Multicore Processor,61,X10 Places,Physical PEs,X10 language defines mapping from X10 objects&activities to X10 places,X10 Data Structures,X10 deployment defines mapping from virtual X10 places to physical processing elements,Overview of X10(),Dynamic parallelism with a Partitioned Global Address SpacePlaces encapsulate binding of activities and globally addressable dataasync(P)S-run statement S asynchronously at place Pfinish S-execute statement S,and wait for descendant asyncs to terminate atomic S-execute statement S atomicallyNo place-remote accesses permitted in atomic section,2023/2/7,The Introduction of Multicore Processor,62,Storage classes:Activity-localPlace-localPartitioned global Immutable,Deadlock safety:any X10 program written with async,atomic,and finish can never deadlock,X10程序示例,2023/2/7,The Introduction of Multicore Processor,63,Activity A4,finish,async,async,Activity A0(Part 3),Activity A0(Part 2),IndexOutOfBoundsexception,finish,Activity A0(Part 1),async,Activity A1,async,Activity A2,/X10 pseudo codemain()/implicit finish Activity A0(Part 1);async A1;async A2;try finish Activity A0(Part 2);async A3;async A4;catch()Activity A0(Part 3);,Activity A3,The technologies underway,Rethin

注意事项

本文（并行计算机体系结构.ppt）为本站会员（laozhun）主动上传，三一办公仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三一办公（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。