Wednesday, October 7, 2015

MapReduce Part I: How it works?

Most of the time, you wonder what it is all about?, when we hit the training for the very first time.. Students who are not good at writing JAVA program,  really scare about it... To some extent it's true, but if you really understand the concept & how MapReduce works, it's piece of cake..

You just need to apply the logic of data processing, that should work very well.

What if I don't know JAVA?,, don't worry, Hadoop can run MapReduce Programs written in Various Languages; for instance.. Python, R, Ruby, Etc.. We have streaming connectors for all programming that are pre-written.. You just need to use the right streaming for your programming to run MapReduce.

Ok what's MapReduce?? --is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

MapReduce how it works? 
MapReduce works by breaking the process into two. Namely, Map Phase & Reduce Phase.
Each Phase has Key-value pairs as input and output.

Input to the Map Phase will be the Raw Data(user feed input file), which reads each and every records using Key/Value Pair concept.

Key -> offset of the beginning of the line
Value -> entire record

We can apply filtering in Map Phase to select only required data that are required for processing.

Output(format) of Map Phase will be the input(formats) to the Reduce Phase. Here both the formats should match..

Let's take an example & see how it works.

I have a raw data : population counts who have participated in survey(sample)

00740211975..+10000+010101010010101010010101010001010100101010001.....
00740211975..+20000+010101010010101010010101010001010100101010001....
00740211980..+50000+010101010010101010010101010001010100101010001....
00740211980..+4000+010101010010101010010101010001010100101010001.....
00740211975..+40000+010101010010101010010101010001010100101010001....
00740211980..+19000+010101010010101010010101010001010100101010001....
00740211975..+12000+010101010010101010010101010001010100101010001...

Now I want to process this using MapReduce Paradigm. As soon as we give this raw data as input to the MapReduce, Map phase will begin execution, but the raw data will be transformed into key-value pairs before it is being taken by Map phase.

(0,    00740211975..+10000+010101010010101010010101010001010100101010001....)
(75,  00740211975..+20000+010101010010101010010101010001010100101010001....)
(150,00740211980..+50000+010101010010101010010101010001010100101010001....)
(225,00740211980..+4000+010101010010101010010101010001010100101010001..... )
....
....
....


Here (0,75,150,225) indicates => Keys which are lines offsets within the file.

Our Purpose is to find the maximum population who have participated in survey in that year. Map Function extracts Year & the population from each records & emits them as output to Reduce Phase.

So after Map Function execution, it would be like this

Map Phase :Output
(1975,10000)
(1975,20000)
(1980,50000)
(1980,4000)
(1975,40000)
(1980,19000)
(1975,12000)

Note, the output of map phase still looks like Key-value pairs, but here, YEARS are acting as Key & population is acting as value.

this o/p will be processed by the MapReduce Framework before being sent to the reduce function. Which is nothing but Shuffle & Sort. We will look about Shuffle phase in details in my next post, but sort phase process the output and groups the Key-Value Pairs by Key. so the above data will be like below after "Shuffle & SORT" process is done

Shuffle&Sort o/p

(1975,[10000,20000,40000,12000])
(1980,[50000,4000,19000])

Now this data will be given to Reduce function, which iterates this list and find out maximum population in that year who have participated in Survey.  O/p of Reduce function will be.

Reduce Phase : o/p

(1975,40000)
(1980,50000)..

We will see in next post, how Hadoop handles this MapReduce. Why this is playing a key role & considered as Core Compoent.

See ya...

No comments:

Post a Comment