Big Data: Apache Spark : Fundamentals

In this post, I would like to share my learning experience in Apache Spark!(In-Memory computation). Trust me, I was amazed to see the working performance of Spark after Map Reduce.

After you read this post, you will be able to explain

- What is Apache Spark?
- Current Version

- Purpose of Spark?
- List of applications that get benefits from Spark?

- Components of Sparks
- Download & Installation Procedures
- MapReduce Vs Spark Comparison

What is Apache Spark?
-- It is a cluster computing framework for Large-Scale Data Processing.
-- It does not use MapReduce as a execution Engine.

Current release?
-- The latest release of Spark is Spark 1.5.1, released on October 2, 2015

Purpose of Spark?
-- It is best known for its ability to keep large working datasets in memory between jobs.
-- This allows Spark to Outperform the equivalent MapReduce workflow, where datasets are always loaded from desk.

List of Applications that get benefits from Spark?
-- Batch, Interactive & Streaming
-- Iterative Algorithms
----- can be explained using a function that is applied on specific file or datasets repeatedly until an exit condition is met.

Components of Spark?

*Source : Google

Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities.The fundamental programming abstraction is called Resilient Distributed Datasets (RDDs), a logical collection of data partitioned across machines.

RDDs can be created in 3 ways ..
> from an in-memory collction of objects
> using a dataset from external storage
> through transforming an existing RDDs

The RDD abstraction is exposed through a language-integrated API in Java, Python, Scala, and R similar to local, in-process collections. This simplifies programming complexity because the way applications manipulate RDDs is similar to manipulating local collections of data.

A Spark cluster is composed of one Driver JVM and one or many Executor JVMs.

Spark SQL:
it's on top of the stack and provides dataframe abstractions, which supports both structured & Semi-structured data.

Spark Streaming:
> It has fast scheduling capability which helps to perform streaming analytics.
>It ingest data in mini-batches & perform RDD transformations on those mini-batches of data.

Spark MLib:

Spark MLlib is a distributed machine learning framework on top of Spark Core that, due in large part of the distributed memory-based Spark architecture, is as much as nine times as fast as the disk-based implementation used by Apache Mahout (according to benchmarks done by the MLlib developers against the Alternating Least Squares (ALS) implementations, and before Mahout itself gained a Spark interface), and scales better than Vowpal Wabbit.

Download & Installation Procedures
- Download Apache spark via Click here
--- Note before you install SPark, you need to install Scala. Since Spark is developed using Scala Language. Don't worry if you do not know Scala, but it's easy to learn & apply language. Spark comes with REPL(Read-Eval-Print-Loop) for both Scala & Python which makes it quick and easy to explore datasets.
- Installation procedures : Click here

MapReduce Vs Spark Comparison

Factor	Hadoop MapRedce	Apache Spark
Data computation	On Disk	In-Memory
Fault Tolerance	Achieves through replication	Uses Data Storage Model(RDD) :
Performance	Need to access disk for each i/o so will take time.	100 times faster than mapreduce, since it uses in-memory execution.
Memory	Can run with available memory	Needs more memory, since it need to cache the data for fast processing.
Iterative Computations	If you do not have enough memory to use spark, then go with MapReduce.	Yes, good deal..

Interactive Analysis	No	Yes, Good deal
Streaming	No	Yes, Good Deal
Batch Processing	Yes	Yes
Native Language	Java	Scala
Ease of use/programming	Difficult to program in java but many tools are available such as PIG, HIVE, etc.	Easier to program & includes an interactive mode.
Cost	Open Source	Open Source
Hardware supports	Commodity	Commodity
Processing Big Data	Best Option	Should be used only if you have as much as memory than data you want to process.

I hope you find this post useful, in next article I will share more on Anatomy of Spark execution & how it works ...

please do not forget to hit like or do comment if you have any questions/suggestions to improve.

Big Data

Saturday, October 31, 2015

Apache Spark : Fundamentals

No comments:

Post a Comment