Saturday, October 31, 2015

Apache Spark : Fundamentals

In this post, I would like to share my learning experience in Apache Spark!(In-Memory computation). Trust me, I was amazed to see the working performance of Spark after Map Reduce.

After you read this post, you will be able to explain 

- What is Apache Spark?
- Current Version

- Purpose of Spark?
- List of applications that get benefits from Spark?

- Components of Sparks
- Download & Installation Procedures
- MapReduce Vs Spark Comparison




What is Apache Spark?
-- It is a cluster computing framework for Large-Scale Data Processing.
-- It does not use MapReduce  as a execution Engine.

Current release?
-- The latest release of Spark is Spark 1.5.1, released on October 2, 2015


Purpose of Spark?
-- It is best known for its ability to keep large working datasets in memory between jobs.
-- This allows Spark to Outperform the equivalent MapReduce workflow, where datasets are always loaded from desk.

List of Applications that get benefits from Spark?
-- Batch, Interactive & Streaming
-- Iterative Algorithms
----- can be explained using a function that is applied on specific file or datasets repeatedly until an exit condition is met.

Components of Spark?

 
*Source : Google

Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities.The fundamental programming abstraction is called Resilient Distributed Datasets (RDDs), a logical collection of data partitioned across machines. 

RDDs can be created in 3 ways ..
> from an in-memory collction of objects
> using a dataset from external storage
> through transforming an existing RDDs


The RDD abstraction is exposed through a language-integrated API in Java, Python, Scala, and R similar to local, in-process collections. This simplifies programming complexity because the way applications manipulate RDDs is similar to manipulating local collections of data.

A Spark cluster is composed of one Driver JVM and one or many Executor JVMs.

Spark SQL:

it's on top of the stack and provides dataframe abstractions, which supports both structured & Semi-structured data.


Spark Streaming: 
> It has fast scheduling capability which helps to perform streaming analytics. 
>It ingest data in mini-batches & perform RDD transformations on those mini-batches of data.

Spark MLib:
Spark MLlib is a distributed machine learning framework on top of Spark Core that, due in large part of the distributed memory-based Spark architecture, is as much as nine times as fast as the disk-based implementation used by Apache Mahout (according to benchmarks done by the MLlib developers against the Alternating Least Squares (ALS) implementations, and before Mahout itself gained a Spark interface), and scales better than Vowpal Wabbit.

Download & Installation Procedures
- Download Apache spark via Click here
--- Note before you install SPark, you need to install Scala. Since Spark is developed using Scala Language. Don't worry if you do not know Scala, but it's easy to learn & apply language. Spark  comes with REPL(Read-Eval-Print-Loop) for both Scala & Python which makes it quick and easy to explore datasets.
- Installation procedures : Click here


MapReduce Vs Spark Comparison

Factor
Hadoop MapRedce
Apache Spark
Data computation
On Disk
In-Memory
Fault Tolerance
Achieves through replication
Uses Data Storage Model(RDD) :
Performance
Need to access disk for each i/o so will take time.
100 times faster than mapreduce, since it uses in-memory execution.
Memory
Can run with available memory
Needs more memory, since it need to cache the data for fast processing.
Iterative Computations
If you do not have enough memory to use spark, then go with MapReduce.
Yes, good deal..




Interactive Analysis
No
Yes, Good deal
Streaming
No
Yes, Good Deal
Batch Processing
Yes
Yes
Native Language
Java
Scala
Ease of use/programming
Difficult to program in java but many tools are available such as PIG, HIVE, etc.
Easier to program & includes an interactive mode.
Cost
Open Source
Open Source
Hardware supports
Commodity
Commodity
Processing Big Data
Best Option
Should be used only if you have as much as memory than data you want to process.

I hope you find this post useful, in next article I will share more on Anatomy of Spark execution & how it works ...


please do not forget to hit like or do comment if you have any questions/suggestions to improve.