Tuesday, September 22, 2015

HDFS Part 1

Hadoop Distributed File System


After you read this post, you will be able to understand, what's HDFS, it's architecture & how it's interacting in Hadoop EcoSystems.

What's HDFS ? :

Distributed File System for Hadoop!, yes when you have GigaBytes of data you will be able to run it with standalone systems. But what if you have TeraBytes, PetaBytes, ZetaBytes & so on,...You will be in a situation to use multiple CPU's to process the data to get the output you want. 

So the process of splitting file into series of chunks(blocks) & distribute across the system in the cluster is called distributed file system. It's been out there before Hadoop!, but what's special with HDFS then?

Yes, it has many features added such as 

> Process the data through Data Locality
> Scalable
> Auto-Replica of data when specific block crashes/corrupted in the node in cluster
> Written in Java, so it's platform Independent.
> Balancing Algorithm to have load evenly distributed & have cluster well balanced.

and many more... 

HDFS Architecture :

Source : Hadoop.Apache.Org


HDFS consists of two main parts 
> Name Node  & Data Node..

Works in Master Slave concept or (Server/Client) Architecture.. Here Name Node acts as Master & Data Nodes acts as Slaves..

So per cluster(collection of nodes), only one Master is allowed & you can have N number of slaves. 

Name Node : is a master who takes care of MetaData(Data about the data : Dictionary) of all the files/blocks/directors that gets stored in Data Nodes. (For Instance, you can consider it as a "Supervisor" in a company, who controls or holds data of all employee & performs regular check and decide upon).

Data Node : Is a slave, who is actual work house & does all forms of read/write. 

Client : acts as a bridge between User & the Cluster.

Note : All data nodes/name nodes could be a commodity hardware(that is affordable and easy to obtain. Typically it is a low-performance system that is IBM PC-compatible and is capable of running Microsoft Windows, Linux, or MS-DOS without requiring any special devices or equipment.)

Let's take an example FILE WRITE & see how HDFS Architecture works.

Consider you have a file of 1GB(for instance, am considering low size, but you can see real power of Hadoop only when you have TB, PB of files) & want to WRITE process in Hadoop Environment.

Quick Math on how the file will be split-up into blocks..

1GB -> 1024 MB

Block Size : 64MB(default size).. 

1024 / 64 -> 16 blocks(so file will be divided in to 16 blocks).. 
Consider you have replication factor of 3.. 

So each block will be replicated into 3 copies, i.e, 16 * 3 -> 48 blocks to be stored.

As soon as you give a command from Terminal(Ubuntu, we will see installations & examples in later post), client API takes the request from User to Name Node.. 

Name Node, does the following check before it take an action of writing the data
* Whether User has write permission
* Whether file already exists? (remember HDFS works in WORA principle) WRITE ONCE READY MANY.. (if already exists, then it will throw IO Exception)

Once it done with these checks & if pass, then it creates one place holder in Meta Data for that file, & gives signal to client for further proceeding.

Client then split the file/data in to series of packets(chunks/blocks) & pass those into Data QUEUE. Data Streamer is responsible for communicating with Name Node to get the list of data nodes where the blocks to be stored(basis the replication factor : Default 3). Consider replication factor is 3, so 3 data nodes will be identified for replication blocks to be stored & pipeline will be created between them. It starts the transmission of blocks in sequential fashion.

As soon as the first data nodes stored the first block, then it transmit the second block to next data nodes... it goes on until the last blocks get stored successfully.  Till that time, it maintains ACK Queue to fetch the acknowledgement from all data nodes(that have got the data packets written). 

This process continues, until all packets have been successfully transmitted & stored in data nodes. Then only it gives back "SUCCESS" message to Name Node that file has been written.

Periodically, Data Nodes will communicate with Name Nodes for every 3 seconds in the form of Heart beat signal. through which Name node will have its metadata up-to-date. If there is no response from Data nodes for more than 20 seconds, then name node will note that data node as DEAD. It then checks out list of blocks that the dead node was holding & scans which other data nodes holds the replica of it, then do a auto-replication of that blocks in other data nodes in the cluster to meet "Fault Tolerant" & Data availability..

Hopes it clear, we will see how Hadoop Read Operation works in next article, if you find this information useful please comment.



Disclaimer : All content provided on this blog is for informational purposes only. Content that are shared here, is purely what I learnt & my practical knowledge.  The Apache Software Foundation has no affiliation with and does not endorse or review the materials provided at this website.

2 comments:

  1. Hello Loges, I have a question in the HDFS achitecture. I thought the client cant directly interact with the data node and it should be only through name node but in the picture i could see client directly reaching out to the data node and on top of it is Name node. Can you please clarify .. Thanks

    ReplyDelete
  2. Big Data and Data Science Course Material. Avail 15 Day Free Trial! Learn Flume, Sqoop, Pig, Hive, MapReduce, Yarn & More. Get Certified By Experts! big data courses

    ReplyDelete