Big Data: Hadoop EcoSystems

There are two main Core Components in Hadoop

I) HDFS(Hadoop Distributed File System)
II) MapReduce.

Many other components around these two core components often referred as EcoSystems
-- PIG
-- HIVE
-- HBASE
-- FLUME
-- OOZIE
-- SQOOP

Let's see about each component in high level , later in my next post I will explain in detail.

HDFS
Is responsible for storing data into Cluster, which act as BRAIN in Hadoop Architecture. It divides the data into series of blocks(chunks) & store it in different nodes that are in the cluster. It is fault tolerant system which makes any time data availability by replicating the data in to n copies(default 3) & store it in different nodes using rack awareness policy.

It has two main parts such as
---- Name Node & Data Node.

MapReduce
Is a computation part(execution), works in batch processing. It actually takes care of work distribution around the cluster.

When you want to work with MapReduce, you must be having sound knowledge in JAVA!, yes MapReduce developed using language JAVA.

Oh no!, I am not that good in JAVA!, what should I do?..

No worries, we have other utility in EcoSystems that helps you to process your data without having MapReduce Configured.

HIVE
Are you a SQL programmer, then this tool is for you.. It is created by Facebook..

Many organizations are having programmer who are good in programming SQL, but not JAVA.. When you use HiveQL(HQL) to fetch data , which in turns runs mapreduce programming behind the scene.

PIG
if you are not good at SQL, but can work on SCRIPTING language, then this should be your "Go-Getter".. Yes PIG developed by Yahoo!..
it has two parts
-- PIG Latin(High-level Scripting Language)
-- PIG Run time environment..

It process data one at a time!, easy to understand, easy to follow..

SQOOP
If you have data in RDBMS(MYSQL,Oracle), & want to import the data into Hadoop for processing, then this utility helps alot.

SQOOP(SQl for hadOOP) can also exports data from hadoop back to RDBMS(Such as MYSQL, Oracle etc)

Flume
> This utility helps to collect large amounts of LOG Data.
> It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.
> It uses a simple extensible data model that allows for online analytic application.

OOZIE

If you have everything set-up & system is working fine, you would be expecting it to be automated!, yes this tool helps you to schedule the scripting/jobs with respect to different frequency.!.. so it automatically perform the job on your behalf!..

HBASE

HBase is an open source, non-relational, distributed database modeled after Google's BigTable and written in Java.

It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop.

We have so many components getting added into this EcoSystems & some are in Incubation stage..! to know more on projects that are in incubators : http://incubator.apache.org/

Disclaimer : All content provided on this blog is for informational purposes only. Content that are shared here, is purely what I learnt & my practical knowledge. The Apache Software Foundation has no affiliation with and does not endorse or review the materials provided at this website.

Big Data

Sunday, September 20, 2015

Hadoop EcoSystems

No comments:

Post a Comment