Google云计算技术MapReduce国外课件.ppt

资源ID：6506594 资源大小：1.01MB 全文页数：48页
资源格式： PPT 下载积分：15金币

快捷下载

会员登录下载

三方登录下载：

下载资源需要15金币

邮箱/手机：
温馨提示：	用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）
支付方式：
验证码：	换一换

加入VIP免费专享

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

5、试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓。

网站客服

侵权投诉

Google云计算技术MapReduce国外课件.ppt

MapReduce:Simplified Data Processing on Large Clusters,Jeffrey Dean&Sanjay GhemawatOSDI04,“The density of transistors on a chip doubles every 18 months,for the same cost”(1965),The Free Lunch Is Almost Over!,The Future is Multi-core!,Web graphic Super ComputerJanet E.Ward,2000,Cluster of Desktops,The Future is Multi-core!,Replace specialized powerful Super-Computers with large clusters of commodity hardwareBut Distributed programming is inherently complex.,Googles MapReduce Paradigm,Platform for reliable,scalable parallel computingAbstracts issues of distributed and parallel environment from programmer.Runs over Google File Systems,What is MapReduce?,A programming model and an associated implementation(library)for processing and generating large data sets(on large clusters).A new abstraction allowing us to express the simple computations we were trying to perform but hides the messy details of parallelization,fault-tolerance,data distribution and load balancing in a library.,References,Jeffrey Dean,Sanjay Ghemawat:MapReduce:Simplified Data Processing on Large Clusters.OSDI 2004:137-150Also:Interpreting the Data:Parallel Analysis with Sawzall.Rob Pike,Sean Dorward,Robert Griesemer,Sean Quinlan.Google Labs.,Google File Systems(GFS),Highly scalable distributed file system for large data-intensive applications.Provides redundant storage of massive amounts of data on cheap and unreliable computersProvides a platform over which other systems like MapReduce,BigTable operate.,GFS Architecture,MapReduce:Insight,”Consider the problem of counting the number of occurrences of each word in a large collection of documents”How would you do it in parallel?,One possible solution,MapReduce Programming Model,Inspired from map and reduce operations commonly used in functional programming languages like Lisp.Users implement interface of two primary methods:1.Map:(key1,val1)(key2,val2)2.Reduce:(key2,val2)val3Many real world tasks are expressible in this model.Assumption:data has no correlation,or it is small.,Big picture,Map operation,Map,a pure function,written by the user,takes an input key/value pair and produces a set of intermediate key/value pairs.e.g.(docid,doc-content)Draw an analogy to SQL,map can be visualized as group-by clause of an aggregate query.,Reduce operation,On completion of map phase,all the intermediate values for a given output key are combined together into a list and given to a reducer.Can be visualized as aggregate function(e.g.,average)that is computed over all the rows with the same group-by attribute.,Example,The problem of counting the number of occurrences of each word in a large collection of documents.,Pseudo-code,map(String input_key,String input_value):/input_key:document name/input_value:document contents for each word w in input_value:EmitIntermediate(w,1);reduce(String output_key,Iterator intermediate_values):/output_key:a word/output_values:a list of counts int result=0;for each v in intermediate_values:result+=ParseInt(v);Emit(AsString(result);,More Examples,Distributed grep:Map:(key,whole doc/a line)(the matched line,key)Reduce:identity function,More Examples,Count of URL Access Frequency:Map:logs of web page requests(URL,1)Reduce:(URL,total count),More Examples,Reverse Web-Link Graph:Map:(source,target)(target,source)Reduce:(target,list(source)(target,list(source),MapReduce:Execution overview,Architecture,Master Data StructureTask state:idle,in-progress,completedIdentity of worker machine:for in-progress tasksLocation of intermediate file regions of map tasks.Receive from map tasksPush to reduce tasks.,Execution overview,Split input files(1)Master and workers(2)Map task workers(3)Buffering of results(4)Copying and sorting(5)Reduce workers(6)Return to user code(7),MapReduce:Execution overview,MapReduce:Example,MapReduce in Parallel:Example,MapReduce:Runtime Environment,Fault Management,Fault Tolerance in a word:redoMaster pings workers,re-schedules failed tasks.Note:Completed map tasks are re-executed on failure because their output is stored on the local disk.Master failure:redoSemantics in the presence of failures:Deterministic map/reduce function:Produce the same output as would have been produced by a non-faulting sequential execution of the entire programRely on atomic commits of map and reduce task outputs to achieve this property.,MapReduce:Fault Tolerance,Handled via re-execution of tasks.Task completion committed through master What happens if Mapper fails?Re-execute completed+in-progress map tasksWhat happens if Reducer fails?Re-execute in progress reduce tasksWhat happens if Master fails?Potential trouble!,MapReduce:Refinements Locality Optimization,Leverage GFS to schedule a map task on a machine that contains a replica of the corresponding input data.Thousands of machines read input at local disk speedWithout this,rack switches limit read rate,MapReduce:Refinements Redundant Execution,Slow workers are source of bottleneck,may delay completion time.Near end of phase,spawn backup tasks,one to finish first wins.Effectively utilizes computing power,reducing job completion time by a factor.,MapReduce:Refinements Skipping Bad Records,Map/Reduce functions sometimes fail for particular inputs.Fixing the Bug might not be possible:Third Party Libraries.On ErrorWorker sends signal to MasterIf multiple error on same record,skip record,MapReduce:Refinements Miscellaneous,Combiner Function at MapperSorting Guarantees within each reduce partition.Local execution for debugging/testing User-defined counters,MapReduce:,Walk through of One more Application,MapReduce:PageRank,PageRank models the behavior of a“random surfer”.C(t)is the out-degree of t,and(1-d)is a damping factor(random jump)The“random surfer”keeps clicking on successive links at random not taking content into consideration.Distributes its pages rank equally among all pages it links to.The dampening factor takes the surfer“getting bored”and typing arbitrary URL.,Computing PageRank,PageRank:Key Insights,Effect at each iteration is local.i+1th iteration depends only on ith iterationAt iteration i,PageRank for individual nodes can be computed independently,PageRank using MapReduce,Use Sparse matrix representation(M)Map each row of M to a list of PageRank“credit”to assign to out link neighbours.These prestige scores are reduced to a single PageRank value for a page by aggregating over them.,PageRank using MapReduce,Source of Image:Lin 2008,Phase 1:Process HTML,Map task takes(URL,page-content)pairs and maps them to(URL,(PRinit,list-of-urls)PRinit is the“seed”PageRank for URLlist-of-urls contains all pages pointed to by URLReduce task is just the identity function,Phase 2:PageRank Distribution,Reduce task gets(URL,url_list)and many(URL,val)valuesSum vals and fix up with d to get new PREmit(URL,(new_rank,url_list)Check for convergence using non parallel component,MapReduce:Some More Apps,Distributed Grep.Count of URL Access Frequency.Clustering(K-means)Graph Algorithms.Indexing Systems,MapReduce Programs In Google Source Tree,MapReduce Jobs run in Aug,2004,MapReduce:Extensions and similar apps,PIG(Yahoo)Hadoop(Apache)DryadLinq(Microsoft),Large Scale Systems Architecture using MapReduce,Take Home Messages,Although restrictive,provides good fit for many problems encountered in the practice of processing large data sets.Functional Programming Paradigm can be applied to large scale computation.Easy to use,hides messy details of parallelization,fault-tolerance,data distribution and load balancing from the programmers.And finally,if it works for Google,it should be handy!,

注意事项

本文（Google云计算技术MapReduce国外课件.ppt）为本站会员（牧羊曲112）主动上传，三一办公仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知三一办公（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。