基于HADOOP的云计算平台搭建业设计翻译毕业论文.doc
《基于HADOOP的云计算平台搭建业设计翻译毕业论文.doc》由会员分享,可在线阅读,更多相关《基于HADOOP的云计算平台搭建业设计翻译毕业论文.doc(26页珍藏版)》请在三一办公上搜索。
1、 MapReduce: Simplied Data Processing on Large Clusters Jeffrey Dean and Sanjay GhemawatAbstract MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermedi
2、ate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large clust
3、er of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the programs execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with pa
4、rallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers n
5、d the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Googles clusters every day. 1 Introduction Over the past five years, the authors and many others at Google have implemented hundreds of special-purpose computatio
6、ns that process large amounts of raw data, such as crawled documents, web request logs, etc., to compute various kinds of derived data, such as inverted indices, various representations of the graph structure of web documents, summaries of the number of pages crawled per host, the set of most freque
7、nt queries in a To appear in given day, etc. Most such computations are conceptually straightforward. However, the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to nish in a reasonable amount of time. The issues of how to pa
8、rallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues. As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we w
9、ere trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involv
10、ed applying a map operation to each logical “record” in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with userspe
11、cied map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance. The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scale comp
12、utations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs. Section 2 describes the basic programming model and gives several examples. Section 3 describes an implementation of the MapReduce interface tailored towards our cluster-bas
13、ed computing environment. Section 4 describes several renements of the programming model that we have found useful. Section 5 has performance measurements of our implementation for a variety of tasks. Section 6 explores the use of MapReduce within Google including our experiences in using it as the
14、basis for a rewrite of our production indexing system. Section 7 discusses related and future work. 2 Programming Model The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Ma
15、p and Reduce. Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function. The Reduce function, also written by the u
16、ser, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the users reduce function via an iterator.
17、 This allows us to handle lists of values that are too large to t in memory. 2.1 Example Consider the problem of counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code:map(String key, String value):/ key: doc
18、ument name/ value: document contentsfor each word w in value:EmitIntermediate(w, 1);reduce(String key, Iterator values):/ key: a word/ values: a list of countsint result = 0;for each v in values:result += ParseInt(v);Emit(AsString(result);The map function emits each word plus an associated count of
19、occurrences (just 1 in this simple example). The reduce function sums together all counts emitted for a particular word. In addition, the user writes code to ll in a mapreduce specication object with the names of the input and output les, and optional tuning parameters. The user then invokes the Map
20、Reduce function, passing it the speci- cation object. The users code is linked together with the MapReduce library (implemented in C+). Appendix A contains the full program text for this example. 2.2 TypesEven though the previous pseudo-code is written in terms of string inputs and outputs, conceptu
21、ally the map and reduce functions supplied by the user have associated types: map (k1,v1) list(k2,v2)reduce (k2,list(v2) list(v2)I.e., the input keys and values are drawn from a different domain than the output keys and values. Furthermore, the intermediate keys and values are from the same domain a
22、s the output keys and values. Our C+ implementation passes strings to and from the user-dened functions and leaves it to the user code to convert between strings and appropriate types. 2.3 More ExamplesHere are a few simple examples of interesting programs that can be easily expressed as MapReduce c
23、omputations. Distributed Grep: The map function emits a line if it matches a supplied pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs UR
24、L, 1 . The reduce function adds together all values for the same URL and emits a URL, total count pair. Reverse Web-Link Graph: The map function outputs target, source pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs assoc
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 基于 HADOOP 计算 平台 搭建 设计 翻译 毕业论文

链接地址:https://www.31ppt.com/p-2393266.html