Big Data: September 2015

Thursday, September 24, 2015

HDFS File Read : How it works in highlevel

After you read this post, you will be able to understand how the file read operation works in HDFS. It is under presumption that you are reading all the posts in my blog in order.

I would recommend you to read my Other posts "HDFS Part I" before continue reading this to understand the architecture of HDFS first..

Let's consider the data which we posted in earlier post for READ operation.

Source :Google

As soon as client receives the read request from User, it connects with Name node for the list of blocks that has to read to fulfil user request.

Block Locations are stored in metadata of Name Node.

For each block, there must be 3 nodes(since replication factor is 3 in our example), so each block corresponding with data node address will be in sorting order in Name Node. As soon as client request for blocks to read, Name Node results the data nodes that are nearest in the network for read operation.

Once the list of data nodes are identified then client starts reading the data blocks directly from data node. When it read all the blocks, it closes the connection & give back the results to user.

Here Name Node will not involve in such reading operation, it will just give the address of the data nodes which holds the block.

Hope you understand the basics of how the file read works, please post your comments/suggestions if any..

Disclaimer : All content provided on this blog is for informational purposes only. Content that are shared here, is purely what I learnt & my practical knowledge. The Apache Software Foundation has no affiliation with and does not endorse or review the materials provided at this website.

Tuesday, September 22, 2015

HDFS Part 1

Hadoop Distributed File System

After you read this post, you will be able to understand, what's HDFS, it's architecture & how it's interacting in Hadoop EcoSystems.

What's HDFS ? :

Distributed File System for Hadoop!, yes when you have GigaBytes of data you will be able to run it with standalone systems. But what if you have TeraBytes, PetaBytes, ZetaBytes & so on,...You will be in a situation to use multiple CPU's to process the data to get the output you want.

So the process of splitting file into series of chunks(blocks) & distribute across the system in the cluster is called distributed file system. It's been out there before Hadoop!, but what's special with HDFS then?

Yes, it has many features added such as

> Process the data through Data Locality

> Scalable

> Auto-Replica of data when specific block crashes/corrupted in the node in cluster

> Written in Java, so it's platform Independent.

> Balancing Algorithm to have load evenly distributed & have cluster well balanced.

and many more...

HDFS Architecture :

Source : Hadoop.Apache.Org

HDFS consists of two main parts

> Name Node & Data Node..

Works in Master Slave concept or (Server/Client) Architecture.. Here Name Node acts as Master & Data Nodes acts as Slaves..

So per cluster(collection of nodes), only one Master is allowed & you can have N number of slaves.

Name Node : is a master who takes care of MetaData(Data about the data : Dictionary) of all the files/blocks/directors that gets stored in Data Nodes. (For Instance, you can consider it as a "Supervisor" in a company, who controls or holds data of all employee & performs regular check and decide upon).

Data Node : Is a slave, who is actual work house & does all forms of read/write.

Client : acts as a bridge between User & the Cluster.

Note : All data nodes/name nodes could be a commodity hardware(that is affordable and easy to obtain. Typically it is a low-performance system that is IBM PC-compatible and is capable of running Microsoft Windows, Linux, or MS-DOS without requiring any special devices or equipment.)

Let's take an example FILE WRITE & see how HDFS Architecture works.

Consider you have a file of 1GB(for instance, am considering low size, but you can see real power of Hadoop only when you have TB, PB of files) & want to WRITE process in Hadoop Environment.

Quick Math on how the file will be split-up into blocks..

1GB -> 1024 MB

Block Size : 64MB(default size)..

1024 / 64 -> 16 blocks(so file will be divided in to 16 blocks)..

Consider you have replication factor of 3..

So each block will be replicated into 3 copies, i.e, 16 * 3 -> 48 blocks to be stored.

As soon as you give a command from Terminal(Ubuntu, we will see installations & examples in later post), client API takes the request from User to Name Node..

Name Node, does the following check before it take an action of writing the data

* Whether User has write permission

* Whether file already exists? (remember HDFS works in WORA principle) WRITE ONCE READY MANY.. (if already exists, then it will throw IO Exception)

Once it done with these checks & if pass, then it creates one place holder in Meta Data for that file, & gives signal to client for further proceeding.

Client then split the file/data in to series of packets(chunks/blocks) & pass those into Data QUEUE. Data Streamer is responsible for communicating with Name Node to get the list of data nodes where the blocks to be stored(basis the replication factor : Default 3). Consider replication factor is 3, so 3 data nodes will be identified for replication blocks to be stored & pipeline will be created between them. It starts the transmission of blocks in sequential fashion.

As soon as the first data nodes stored the first block, then it transmit the second block to next data nodes... it goes on until the last blocks get stored successfully. Till that time, it maintains ACK Queue to fetch the acknowledgement from all data nodes(that have got the data packets written).

This process continues, until all packets have been successfully transmitted & stored in data nodes. Then only it gives back "SUCCESS" message to Name Node that file has been written.

Periodically, Data Nodes will communicate with Name Nodes for every 3 seconds in the form of Heart beat signal. through which Name node will have its metadata up-to-date. If there is no response from Data nodes for more than 20 seconds, then name node will note that data node as DEAD. It then checks out list of blocks that the dead node was holding & scans which other data nodes holds the replica of it, then do a auto-replication of that blocks in other data nodes in the cluster to meet "Fault Tolerant" & Data availability..

Hopes it clear, we will see how Hadoop Read Operation works in next article, if you find this information useful please comment.

Sunday, September 20, 2015

Hadoop EcoSystems

There are two main Core Components in Hadoop

I) HDFS(Hadoop Distributed File System)
II) MapReduce.

Many other components around these two core components often referred as EcoSystems
-- PIG
-- HIVE
-- HBASE
-- FLUME
-- OOZIE
-- SQOOP

Let's see about each component in high level , later in my next post I will explain in detail.

HDFS
Is responsible for storing data into Cluster, which act as BRAIN in Hadoop Architecture. It divides the data into series of blocks(chunks) & store it in different nodes that are in the cluster. It is fault tolerant system which makes any time data availability by replicating the data in to n copies(default 3) & store it in different nodes using rack awareness policy.

It has two main parts such as
---- Name Node & Data Node.

MapReduce
Is a computation part(execution), works in batch processing. It actually takes care of work distribution around the cluster.

When you want to work with MapReduce, you must be having sound knowledge in JAVA!, yes MapReduce developed using language JAVA.

Oh no!, I am not that good in JAVA!, what should I do?..

No worries, we have other utility in EcoSystems that helps you to process your data without having MapReduce Configured.

HIVE
Are you a SQL programmer, then this tool is for you.. It is created by Facebook..

Many organizations are having programmer who are good in programming SQL, but not JAVA.. When you use HiveQL(HQL) to fetch data , which in turns runs mapreduce programming behind the scene.

PIG
if you are not good at SQL, but can work on SCRIPTING language, then this should be your "Go-Getter".. Yes PIG developed by Yahoo!..
it has two parts
-- PIG Latin(High-level Scripting Language)
-- PIG Run time environment..

It process data one at a time!, easy to understand, easy to follow..

SQOOP
If you have data in RDBMS(MYSQL,Oracle), & want to import the data into Hadoop for processing, then this utility helps alot.

SQOOP(SQl for hadOOP) can also exports data from hadoop back to RDBMS(Such as MYSQL, Oracle etc)

Flume
> This utility helps to collect large amounts of LOG Data.
> It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.
> It uses a simple extensible data model that allows for online analytic application.

OOZIE

If you have everything set-up & system is working fine, you would be expecting it to be automated!, yes this tool helps you to schedule the scripting/jobs with respect to different frequency.!.. so it automatically perform the job on your behalf!..

HBASE

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java.

It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop.

We have so many components getting added into this EcoSystems & some are in Incubation stage..! to know more on projects that are in incubators : http://incubator.apache.org/

Saturday, September 19, 2015

Hadoop Questions & Answers

Watch out this space!...

Facts & Figures on Big Data

* All data in the world got generated in last few years(< 4 to 5 years)
* It is expected that amount of digital information in existence will have grown from 3.2 zettabytes to 40 zettabytes.
* Every Minute, we send more than 204 million emails , 1.8 million facebook likes, 278k tweets, upload 200k photos in facebook.
* Google alone processess on average 40 thousand search queries / second & 3.5 billion in single day.
* Around 100 hours of video are being uploaded to YOUTUBE every minute and it would take you around 15 years to watch every video uploaded by users in one day.
* if you burn all of the data created onto DVDs , you could stack them on top of each other and reach the moon-twice.
* 570 new websites spring into existence every minute /day.
* 1.9 Million IT Jobs will be created in the US by 2015 to carry out big data projects.

Sounds interesting????.. Still many more are there..

Source -> http://www.slideshare.net/BernardMarr/big-data-25-facts

What is Hadoop?

Hadoop in Short :
Hadoop is an Apache Open Source Framework, designed for Distributed framework. It designed with the concept in mind to address "Hardware Failure" & the process to recover that through software framework..

Founder History :
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project.

Hadoop 2 EcoSystem :

Source : Google

Hadoop Current Version :

Apache Hadoop Released Version 2.7.1(while this blog created).. You have to understand how to read version number first.. so that you can be clear, what kind of patches/upgrades are being carried out & how to react with it.

Version number comes in 0.0.0(3 digit formats)..

Version numbers usually consist of three numbers separated by dots. For example: 2.7.1 These numbers have names. The leftmost number (2) is called the major version. The middle number (7) is called the minor version. The rightmost number (1) is called the revision but it may also be referred to as a "point release" or "sub-minor version".

Most of the organizations are using Hadoop 1.2.1, but they can roll forward & upgrade to newer version..

Hadoop Functionality :

1) Storage , Analysis , Statistics, not all the tools that address big data problem have this beauty.. That's the reason, Hadoop stands unique.

2) It has two core components(HDFS/MapReduce).

HDFS stands for Hadoop distributed file system..

Why Hadoop Came into existence?

When we are dealing with data, we need to look into few V's..

Volume : Amount of data that generates / sec/min/hour/day...

Variety : Types of data that are getting generated.

Velocity : Speed in which data generates or we access it.

Relational database systems cannot handle if volumes are higher than GigaBytes, so Organization tend to archive the data, when it reaches the threshold. So that's the end of those archived data. It will not be taken into account for analysis.

At the same time, it is also important that we deals with variety of data that are getting generated through social media, mobile phones, auto mobiles etc..

So it's hard to take all those data for computation & generate results. Here Hadoop(DFS) plays a key role in moving computation to the data where it resides. It follows Distributed File System architecture, where all the data will be stored into multiple notes(cluster) & allows the program/computation part to move over & perform the analysis.

Clusters?

Consider you have a bunch of computers/laptop that are connected in network LAN. Which we can take it as one Rack. Here, computers are referred as NODES. Nodes that are connected in same network switch will be considered as one RACK.

Hence multiple racks forms a cluster. Each racks holds minimum of 2 nodes to n Nodes... for instance (see the picture below)

Source : Google

Who uses Hadoop?

It's a big list, many organizations have deployed Hadoop to analyze their data. Can review the list here in http://wiki.apache.org/hadoop/PoweredBy

It's open source framework, so all companies are tend to move towards to it.

We will see each components in details in next post.

Thanks for reading, please don't forget to hit like/comment.

What is BigData?

BigData is really a big data, when you do not have sufficient space to store a data that's where the problem arise. For instance, if you have 8GB of SD Card & you have utilized 7.5GB already, now you cannot add on 650MB of data into that right?, yes..

So there is no definite size for BigData!..

We have traditional database system to store the data that are structured, ex(rows & columns, some specific format). But now adays organizations are more concerned about their customers/clients/investors & want to improvise their business. In order to meet that, they need to analyse all forms of data & make a decision.

Types of Data :
> Structured -- Database
> Semi-Structured - XML document
> Unstructured - Images, Videos, Text, PDF , Word Document & so on..

We might be clear with Structured / Semi-Structured & have seen it before, but why do they want to analyse Unstructured?..

For Instance :

As soon as you click on some link in web, there are huge amount of data that are getting logged/generated behind the scenes about users, using which analysis can be done & find what kind of product/information that customer is looking into & reach customer accordingly.

By analysing images/videos, pattern matching can be done...

Well, am clear with everything but what Hadoop is doing here?-- Should be your next question?

Keep watching that in next post..

if you find this post useful, please comment/like...

Disclaimer : All content provided on this blog is for informational purposes only. Content that are shared here, is purely what I learnt & my practical knowledge. The Apache Software Foundation has no affiliation with and does not endorse or review the materials provided at this website.

HadoopMakeItEasy

In this post, I would like to give you some insights on what are the pre-requisites that are required before learning Hadoop!..

Are you interested in solving Big data problems?,
Are you interested in switching over your career from any platform(IT) to Big data?
Whether you have basics/sounds computer/programming knowledge ?
Can you put in good efforts to grasp new things quick?..
Can you work on round the clock to do POC's?

If you have "YES" for all, then don't think further!.. you are on right move..

Before you go and consult any training institute for Hadoop Training, it is always recommended to prepare your brain with basics of "What is Hadoop"? & "Why do we need it"? .. Is this the only tool to solve Big Data Problem?

These questions will certainly help you to ask right questions to the trainer & see whether they can help you to gain much knowledge & assist you in doing lot of Proof of Concepts.

Ok where can I find those information?..

#1) I strongly recommend "The Hadoop Definitive Guide", Very good book for all the beginners, those who have difficulties in understanding foreign author or those who are good at watching videos & grab knowledge then see #2
#2) There are plenty of Videos available in "YOUTUBE" which clearly explains what's Hadoop & Why do we need it?
#3) Check out -> http://www.tutorialspoint.com/hadoop/

If you are reading this post & want to learn more on Hadoop. Please comment, so that I can work on put-together training road map for you!.. JUST FREE!

Do you have any suggestions, please drop me an email @ lokeshwaran.a82@gmail.com

Logeswaran is Post Graduate(MTech) from SRM University, with 9.6 years of experience in IT Industry(while this blog being created). Have real time exposure on data science and analytics space.

In this blog, I will be sharing my experience on Hadoop, What are all the pre-requisites that are required to enrich your knowledge on this big data platform & how to create a roadmap.

I have experience in Base SAS, Advance SAS, SQL, R, Java, SASBI technologies.

Disclaimer : All content provided on this blog is for informational purposes only. Content that are shared here, is purely what I learnt & my practical knowledge. The Apache Software Foundation has no affiliation with and does not endorse or review the materials provided at this website.