In this post, I would like to share my learning experience in Apache Spark!(In-Memory computation). Trust me, I was amazed to see the working performance of Spark after Map Reduce.
After you read this post, you will be able to explain
- What is Apache Spark?
- Current Version
- Purpose of Spark?
- List of applications that get benefits from Spark?
- Components of Sparks
- Download & Installation Procedures
- MapReduce Vs Spark Comparison
What is Apache Spark?
-- It is a cluster computing framework for Large-Scale Data Processing.
-- It does not use MapReduce as a execution Engine.
Current release?
-- The latest release of Spark is Spark 1.5.1, released on October 2, 2015
Purpose of Spark?
-- It is best known for its ability to keep large working datasets in memory between jobs.
-- This allows Spark to Outperform the equivalent MapReduce workflow, where datasets are always loaded from desk.
List of Applications that get benefits from Spark?
-- Batch, Interactive & Streaming
-- Iterative Algorithms
----- can be explained using a function that is applied on specific file or datasets repeatedly until an exit condition is met.
Components of Spark?
Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities.The fundamental programming abstraction is called Resilient Distributed Datasets (RDDs), a logical collection of data partitioned across machines.
RDDs can be created in 3 ways ..
> from an in-memory collction of objects
> using a dataset from external storage
> through transforming an existing RDDs
The RDD abstraction is exposed through a language-integrated API in Java, Python, Scala, and R similar to local, in-process collections. This simplifies programming complexity because the way applications manipulate RDDs is similar to manipulating local collections of data.
A Spark cluster is composed of one Driver JVM and one or many Executor JVMs.
Spark SQL:
it's on top of the stack and provides dataframe abstractions, which supports both structured & Semi-structured data.
Spark Streaming:
> It has fast scheduling capability which helps to perform streaming analytics.
>It ingest data in mini-batches & perform RDD transformations on those mini-batches of data.
Spark MLib:
Download & Installation Procedures
- Download Apache spark via Click here
--- Note before you install SPark, you need to install Scala. Since Spark is developed using Scala Language. Don't worry if you do not know Scala, but it's easy to learn & apply language. Spark comes with REPL(Read-Eval-Print-Loop) for both Scala & Python which makes it quick and easy to explore datasets.
- Installation procedures : Click here
MapReduce Vs Spark Comparison
I hope you find this post useful, in next article I will share more on Anatomy of Spark execution & how it works ...
please do not forget to hit like or do comment if you have any questions/suggestions to improve.
After you read this post, you will be able to explain
- What is Apache Spark?
- Current Version
- Purpose of Spark?
- List of applications that get benefits from Spark?
- Components of Sparks
- Download & Installation Procedures
- MapReduce Vs Spark Comparison
What is Apache Spark?
-- It is a cluster computing framework for Large-Scale Data Processing.
-- It does not use MapReduce as a execution Engine.
Current release?
-- The latest release of Spark is Spark 1.5.1, released on October 2, 2015
Purpose of Spark?
-- It is best known for its ability to keep large working datasets in memory between jobs.
-- This allows Spark to Outperform the equivalent MapReduce workflow, where datasets are always loaded from desk.
List of Applications that get benefits from Spark?
-- Batch, Interactive & Streaming
-- Iterative Algorithms
----- can be explained using a function that is applied on specific file or datasets repeatedly until an exit condition is met.
Components of Spark?
*Source : Google
Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities.The fundamental programming abstraction is called Resilient Distributed Datasets (RDDs), a logical collection of data partitioned across machines.
RDDs can be created in 3 ways ..
> from an in-memory collction of objects
> using a dataset from external storage
> through transforming an existing RDDs
The RDD abstraction is exposed through a language-integrated API in Java, Python, Scala, and R similar to local, in-process collections. This simplifies programming complexity because the way applications manipulate RDDs is similar to manipulating local collections of data.
A Spark cluster is composed of one Driver JVM and one or many Executor JVMs.
Spark SQL:
it's on top of the stack and provides dataframe abstractions, which supports both structured & Semi-structured data.
Spark Streaming:
> It has fast scheduling capability which helps to perform streaming analytics.
>It ingest data in mini-batches & perform RDD transformations on those mini-batches of data.
Spark MLib:
Spark MLlib is a
distributed machine learning framework on top of Spark Core that, due
in large part of the distributed memory-based Spark architecture, is
as much as nine times as fast as the disk-based implementation used
by Apache Mahout (according to benchmarks done by the MLlib
developers against the Alternating Least Squares (ALS)
implementations, and before Mahout itself gained a Spark interface),
and scales better than Vowpal Wabbit.
Download & Installation Procedures
- Download Apache spark via Click here
--- Note before you install SPark, you need to install Scala. Since Spark is developed using Scala Language. Don't worry if you do not know Scala, but it's easy to learn & apply language. Spark comes with REPL(Read-Eval-Print-Loop) for both Scala & Python which makes it quick and easy to explore datasets.
- Installation procedures : Click here
MapReduce Vs Spark Comparison
Factor |
Hadoop MapRedce |
Apache Spark |
Data computation |
On Disk |
In-Memory |
Fault Tolerance |
Achieves through replication |
Uses Data Storage Model(RDD) : |
Performance |
Need to access disk for each i/o so will take time. |
100 times faster than mapreduce, since it uses in-memory
execution. |
Memory |
Can run with available memory |
Needs more memory, since it need to cache the data for fast
processing. |
Iterative Computations |
If you do not have enough memory to use spark, then go with
MapReduce. |
Yes, good deal.. |
|
||
Interactive Analysis |
No |
Yes, Good deal |
Streaming |
No |
Yes, Good Deal |
Batch Processing |
Yes |
Yes |
Native Language |
Java |
Scala |
Ease of use/programming |
Difficult to program in java but many tools are available such
as PIG, HIVE, etc. |
Easier to program & includes an interactive mode. |
Cost |
Open Source |
Open Source |
Hardware supports |
Commodity |
Commodity |
Processing Big Data |
Best Option |
Should be used only if you have as much as memory than data you
want to process. |
please do not forget to hit like or do comment if you have any questions/suggestions to improve.