并行计算机体系结构.ppt
《并行计算机体系结构.ppt》由会员分享,可在线阅读,更多相关《并行计算机体系结构.ppt(72页珍藏版)》请在三一办公上搜索。
1、Parallel Computer Architecture并行计算机体系结构Lecture 7The Introduction of Multicore Processor,April 13,2009隋秀峰(),2023/2/7,The Introduction of Multicore Processor,2,主要内容,多核处理器发展的动力多核处理器需要解决的关键问题多核处理器的发展现状多核处理器中的新兴技术,2023/2/7,The Introduction of Multicore Processor,3,Todays Processor,Voltage levelA flashlig
2、ht(1 volt)Current levelAn oven(250 amps)Power levelA light bulb(100 watts)AreaA postage stamp(1 square inch)PerformanceGFLOPS,2023/2/7,The Introduction of Multicore Processor,4,What is the future need?,Performance need is never endingComplains from end-users nowadaysTomorrows killer applicationNext
3、Step:How can we get to 1 TFLOPS?,2023/2/7,The Introduction of Multicore Processor,5,Tomorrows killer Application(RMS),2023/2/7,The Introduction of Multicore Processor,6,多核发展的动力线延迟,Consider the 1 Tflop/s sequential machine:Data must travel some distance,r,to get from memory to CPU.To get 1 data eleme
4、nt per cycle,this means 1012 times per second at the speed of light,c=3x108 m/s.Thus r c/1012=0.3 mm.Now put 1T byte of storage in a 0.3 mm x 0.3 mm area:Each word occupies about 3 square Angstroms,or the size of a small atom.No choice but parallelism,2023/2/7,The Introduction of Multicore Processor
5、,7,多核发展的动力发热问题,2023/2/7,The Introduction of Multicore Processor,8,Managing the Heat Load,2023/2/7,The Introduction of Multicore Processor,9,多核发展的动力漏电流,Leakage CurrentFrom Minor Nuisance to Chip Killer,250,180,130,90,70,Dissipated Power CV2f,Process Technology(nm),Power(W),2023/2/7,The Introduction o
6、f Multicore Processor,10,多核发展的动力制造成本,Moores 2nd law(Rocks law),Demo of 0.06 micron CMOS,Technology Trends:Microprocessor Capacity,2023/2/7,The Introduction of Multicore Processor,11,2X transistors/Chip Every 1.5 yearsCalled“Moores Law”,Microprocessors have become smaller,denser,and more powerful.Not
7、 just processors,bandwidth,storage,etc,Gordon Moore(co-founder of Intel)predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.,Moores Law Still Holds,2023/2/7,The Introduction of Multicore Processor,12,No Exponential is Forever,But perhaps we can D
8、elay it Forever,Means of Increasing Performance,Increasing Clock FrequencyFrom 60 MHz to 3,800 MHz in 12 yearsHas resulted in expected performance increaseExecution OptimizationThe kernel is Instruction Level Parallelism,2023/2/7,The Introduction of Multicore Processor,13,A brief history of micro-ar
9、chitecture evolution,Two axes:Exploring the parallelism,much of the performance from parallelismBit-Level ParallelismInstruction-Level Parallelism(ILP)Thread-Level Parallelism(TLP)Hiding the memory latency,2023/2/7,The Introduction of Multicore Processor,14,What is Pipelining?,2023/2/7,The Introduct
10、ion of Multicore Processor,15,In this example:Sequential execution takes 4*90min=6 hoursPipelined execution takes 30+4*40+20=3.5 hoursBandwidth=loads/hourBW=4/6 l/h pipeliningBW=4/3.5 l/h pipeliningPipelining helps bandwidth but not latency(90 min)Bandwidth limited by slowest pipeline stagePotential
11、 speedup=Number pipe stages,6 PM,7,8,9,TaskOrder,Time,Dave Pattersons Laundry example:4 people doing laundrywash(30 min)+dry(40 min)+fold(20 min)=90 min Latency,VLIW,2023/2/7,The Introduction of Multicore Processor,16,Means of Increasing Performance,Execution OptimizationMore powerful instructionsEx
12、ecution optimization(pipelining,branch prediction,execution of multiple instructions,reordering instruction stream,etc.)The gain from exploring ILP is diminishingThe inherent barrier ILP need to tackleControl dependence,data dependence,2023/2/7,The Introduction of Multicore Processor,17,Means of Inc
13、reasing Performance,What is the next?Need to feed TLP for the processorHere the problem is essentially the same as parallel programmingTechnologies for TLPSimultaneous Multi-threading(SMT)-Example:Intel Hyper-threadingChip multiprocessing(CMP)-Multi-Core Processor,2023/2/7,The Introduction of Multic
14、ore Processor,18,Micro-architecture Trends,2023/2/7,The Introduction of Multicore Processor,19,Adapted from Johan De Gelas,Quest for More Processing Power,AnandTech,Feb.8,2005.,Understanding SMT and CMP,2023/2/7,The Introduction of Multicore Processor,20,Make clear Concurrency vs.Parallelism,Concurr
15、ency:two or more threads are in progress at the same time:Parallelism:two or more threads are executing at the same timeMultiple cores needed,Simultaneous Multithreading(SMT),Minimal resource replicationProvides instructions to overlap memory latencySeparate threads exploit idle resources,2023/2/7,T
16、he Introduction of Multicore Processor,21,Context1,Context2,Functional Units,L1 Cache,L2 Cache,Main Memory,SMT:simultaneous multithreading,2023/2/7,The Introduction of Multicore Processor,22,Superscalar,Multithreaded,SMT,Issue slots,Go to the era of Multicore,Concurrency in the form of hardware mult
17、ithreading has been around for a while.Useful for hiding memory latencies.Only about 30%performance improvement for special application.How can we continue to utilize the ever-higher transistor densities predicted by Moores Law?Current View:Can continue performance improvements by packing multiple p
18、rocessing cores onto a single chip,i.e.,multicore.Multi-core=Chip Multiprocessing=Tera-scale Computing,2023/2/7,The Introduction of Multicore Processor,23,Chip Multiprocessing,Much larger degree of resource replicationTwo complete processing cores on each chipOuter levels of cache and external inter
19、face are sharedGreatly reduced resource contention compared to SMT,2023/2/7,The Introduction of Multicore Processor,24,L2 Cache,Main Memory,Context1,Context2,Functional Units,Functional Units,L1 Cache,L1 Cache,What we benefit from Multi-Core?,2023/2/7,The Introduction of Multicore Processor,25,New T
20、arget for Micro-architecture high performance/power,Multi-Core Processors,Improved cost/performance ratioMinimal increases in architectural complexity provide significant increases in performanceMinimizes performance stalls,with a dramatic increase in overall effective system performanceGreater EEP(
21、energy efficient performance)and scalabilityCores enable thread-level parallelismMulti-core architecture enables divide-and-conquer strategy to perform more work in a given clock cycle.,2023/2/7,The Introduction of Multicore Processor,26,Multi-Core Processors(cont.),Whats special for many-cores?Expl
22、icit multi-threads required to speedup single application performanceCore to core communicationLatency reduceBandwidth increaseCache size per-core will also reduce,2023/2/7,The Introduction of Multicore Processor,27,Multi-Core Processors(cont.),2023/2/7,The Introduction of Multicore Processor,28,Int
23、el Clovertown 上的延迟测试,2023/2/7,The Introduction of Multicore Processor,29,What is the problem?Where is the innovation?,How about the core?Equal to the original one or not?Simple core may be a good chooseHow about the power control on chip?Fine granularity power controlHow about the interconnection be
24、tween cores and other units?X cores means X times of memory referencesRequires higher throughputs between cores and caches,within cache hierarchy,and between last-level cache and memoryRequires less latencies in those placesFour basic kinds of interconnectsBuses,crossbars,tiny-networks,and ringsEach
25、 has its own tradeoffs in throughput,latency,resource occupation,and ease of implementationMay be suitable at different levels,2023/2/7,The Introduction of Multicore Processor,30,What is the problem?Where is the innovation?,How about the Cache?(NUCA:non-uniform cache arch.),2023/2/7,The Introduction
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 并行 计算机体系结构
链接地址:https://www.31ppt.com/p-2263237.html