Apache Spark Overview

Submitted by heartin on Sat, 06/10/2017 - 02:07

Let us go through some of the definitions of Apache Spark available online and offline.

Apache Spark is an open-source cluster-computing framework that provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. Spark was developed in response to limitations in the MapReduce, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across data, reduce results of map, and store results on disk. Spark's resilient distributed datasets (RDDs) function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory. (Ref = Wikipedia)

Apache Spark is a fast and general engine for large-scale data processing. Some of the highlights of Spark are: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Write applications quickly in Java, Scala, Python, R. Combine SQL, streaming, and complex analytics, as Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.. Spark runs on Hadoop (YARN), Apache Mesos, standalone, or in the cloud (e.g. AWS EC2). It can access diverse data sources including HDFS, Cassandra, HBase, and S3. (ref=spark.apache)

Apache Spark is a fast, in-memory data processing engine with elegant and expressive APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark adds in-Memory Compute for ETL, Machine Learning and Data Science Workloads to Hadoop. Spark is one of many data access engines that work with Hadoop YARN. With Spark running on YARN, developers can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop. (Ref=hortonworks site)

Apache Spark is a cluster computing platform designed to be fast and general purpose. Spark extends MapReduce to support more computations such as iterative algorithms, interactive queries and stream processing, apart from batch processing that MapReduce support. Spark offers computations in memory, but is also efficient for applications running on disk. Spark offers simple APIs in Python, Java, Scala and SQL, and also rich built-in libraries. Spark can run in Hadoop clusters and access any Hadoop data source, including Cassandra. Spark can run without Hadoop also. (Ref=Learning Spark Book)

What is Resilient Distributed Dataset (RDD) mentioned in the definitions?

Apache Spark provides programmers with an API centered on a data structure called the resilient distributed dataset (RDD), which is Spark’s main abstraction. RDD is a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. In simple terms, RDD is a collection of items distributed across many computing nodes that can be manipulated in parallel. Spark's RDDs function as a working set for distributed programs that offers a restricted form of distributed shared memory. Spark core is home to the API that defines RDDs. (Ref= Wikipedia, Learning Spark Book)

Aside from the RDD-oriented functional style of programming, Spark provides two restricted forms of shared variables: broadcast variables reference read-only data that needs to be available on all nodes, while accumulators can be used to program reductions in an imperative style. (Ref = Wikpedia)

Important Spark Use Cases

Spark can be used in two categories of use cases: data science and data applications. Various Spark components support different tasks of data science such as interactive data analysis using Pythin and Scala (Spark Shell), data exploration using SQL (Spark SQL), Machine learning and data analysis (MLib) and also calling out to external programs in Matlab or R. For software developers working with data applications, Spark hides the complexity of distributed systems programming and provides a simple way to parallelize such applications. (Ref=Learning Spark Book)