Skip to content

Latest commit

 

History

History
31 lines (19 loc) · 3.2 KB

2018.03.30-MapReduce.md

File metadata and controls

31 lines (19 loc) · 3.2 KB

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

Summary

For companies like Google, they need to process extremely large data, and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. While handling such distributed tasks, they have to face the issue of how to parallelize the computation, distribute the data, and handle failures, which obscures the original simple computation with large amounts of complex engineering work to deal with these issues.

This paper presents a new programming model, MapReduce, which is inspired by the map and reduce primitives in many functional languages. This model provides users with an abstraction that allows them to express the simple computations they are trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. In this paper, the authors also describe an implementation of the MapReduce interface towards the cluster-based computing environment in Google and discuss several useful refinements of the programming model.

Contributions

  • The authors present MapReduce, a new and simple abstraction for processing any size of data that fits into this programming model. They bring the idea from functional languages.
  • MapReduce is allowed to run over a large cluster of commodity machines. As well, it can handle a large amount of data.

Strengths

  • MapReduce provides a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations. It hides complex technical details under the simple interface and helps users deal with data partition problem, task scheduling problem, failure handling problem, the inter-machine communication problem. Moreover, such design can achieve high performance on large clusters of commodity PCs.
  • The distributed fashion and the use of a master node enable the system to tolerate node failures by simply rescheduling the work to other available nodes. Furthermore, the master node periodically stores checkpoints of the master data structures so that a new copy of the master node can be restarted and resumes the whole work from the last interrupted state.
  • MapReduce does not limit input and output types. In this case, users can implement their readers to retrieve data from different sources and can produce customed data types.

Weaknesses

  • Such programming model is not suitable for real-time data processing scenario. First, it relies on a distributed file system to store large amounts of data. Then, MapReduce framework starts to process the data. This two-step process may downgrade the performance.
  • MapReduce only supports a restricted programming model, which requires that tasks must be written as acyclic data-flow programs. Moreover, based on the programming model in this paper, the map function and the reduce function must be stateless, which imposes limitations that are felt in fields such as machine learning, where iterative algorithms may lead to data dependency among iterations.