From the course: Learning Hadoop

What is MapReduce? - Hadoop Tutorial

From the course: Learning Hadoop

What is MapReduce?

- [Instructor] Foundationally, Hadoop was designed to work with the computational framework called MapReduce. What is this? Well, it's a programming paradigm for compute and storage that's able to be distributed. It was initially designed to solve one problem, to split massive data sets, text-based data sets, to index the entire internet. It was open sourced from technologies that were created internally at Google. If you remember back to earlier videos, there were white papers, GFS for the storage and MapReduce for the compute. The compute has two main parts, map and reduce. This is a very high level conceptual diagram, which helps you to intuitively start to understand what MapReduce does. So you can see in the map phase you have some input, and the different colors represent different chunks of text that we want to process. And then you have a shuffle phase, which is internal, and then you have a reduced phase which groups the subsets of, in this case, words or text information together. And again, we'll be drilling into this in quite a bit of detail over the next few sections. But you can see that we have a map phase, a shuffle phase, and a reduced phase. And the circles represent units of computational work. So key aspects of MapReduce are that it's an API or a set of libraries that are designed to be runnable on commodity hardware. The whole idea here is to make it much cheaper than using a traditional data analytics solution like a data warehouse or a data lake to be able to do computation on massive amounts of data. There are some concepts that are important to start to understand. A job, which is a unit of MapReduce work, or an instance. A map task, which runs on each worker node in the cluster. A reduced task, which runs on some worker nodes. The input data runs by default on HDFS, which you'll remember is triple replicated, but that's tuneable, Or, more often these days, on cheaper cloud storage, such as bucket based storage. The map part of MapReduce executes the map function on the data on each node. And these nodes are the worker nodes in the cluster. The output is a key and a value pair on each node. Now, if we're doing text analysis, it's often account, and then the value of the word, but it doesn't have to be. MapReduce has been broadened or extended to be able to analyze many different kinds of data, not just text data. So it's overridable or configurable in all of these functions, if you have a different type of data or you want to do optimization. The reduced part of MapReduce executes the reduce function on the data. It'll execute on some of the nodes, not necessarily all of them. And that's really, really important to understand. As we drill into optimization, this is a key optimization technique to determine on how many nodes, and what size those nodes would be, and how much data those nodes are going to be receiving, and also how much data is being sent across the network. The idea of reduce is you produce aggregates. So you count, in the canonical example, how many of each word. So the mapper would output, you know, one, and then the first word, one, the second word, so on and so forth. The reducer would output the total count of the particular word. So two of this word, three of that word, pairs on some nodes. And it will output a combined list. So map MapReduce can be written in many different programming languages. Initially, most of the work was done in Java that I've seen out there, but it's supported in many languages. So this is a kind of a starter pseudo-code. Now, I've put this all in one screen. Again, in production environments, you often would have separate classes. It's for testing and for portability, but we're, you know, at early stage learning here. So we're just putting it all in a script kind of fashion on one page. So we have a public class called MapReduce, and notice that we have three static methods. So we have the main method, we have the map method, and we have the reduce method. And the reason they're static or global is because MapReduce is designed to work functionally and have no shared state, which is one of the reasons that it can scale. In the main class, notice there is a JobRunner utility that allows you to create a job instance, which will then be invoked by calling a map instance and a reduce instance.

Contents