Hadoop MapReduce Defined

Hadoop MapReduce is the way Hadoop processes data. MapReduce uses the Hadoop Distributed File System to handle the distribution of data on the cluster. MapReduce is how Hadoop parallelizes its operations, with many concurrent Mapper and Reducer processes running on many different machines. Mappers scan data from HDFS in a massively parallel fashion. They emit a key and a piece of data. Reducers group data from Mappers upon this key together for processing.

That is MapReduce in its entirety. It is a simple framework for computing that turns out to generalize well to many kinds of operations like JOINs, sorts, etc.

Because of tools like Pig and Hive, you don’t need to think in terms of MapReduce or to program in MapReduce to use Hadoop. But if you wonder how things operate under the covers… it is all Map (read the data, emit a key and a value), Reduce (group all values per key, perform another operation). And be aware… the most common use of data from a mapreduce job is to feed it into another mapreduce job. You don’t have to get it right in one operation!

The post Hadoop MapReduce Defined appeared first on Hortonworks.