SlideShare a Scribd company logo
Apache Spark Core
RealTime In-MemoryAnalytics
Girish Khanzode
Contents
• Spark - In-Memory Data Sharing
• Programming Model
• Spark Context
• RDD - Resilient Distributed Datasets
• Program Execution
• Job Scheduling
• Fault Resiliency
• Transformations and Actions
• RDD Partitions
• RDDTransformations
• Memory Management
• RDD Benefits
• Resources
• References
What is Spark?
• An open-source cluster computing framework
• Leverages distributed memory
• Allows programs to load data into a cluster's memory and query it repeatedly
• Compared to Hadoop
– Scalability - can work with large data
– Fault tolerance : can self-recover
• Functional programming model
• Supports batch & streaming analysis
What is Spark?
• Separate, fast MapReduce-like engine
– In-memory data storage for very fast iterative queries
– General execution graphs and powerful optimizations
• Compatible with Hadoop storage APIs
– Can read / write to any Hadoop-supported system, including HDFS, HBase, Sequence
Files etc
• Faster Application Development - 2-5x less code
• Disk Execution Speed - 10× faster
• Memory Execution Speed – 100× faster
What is Spark?
• Apart from simple map and reduce operations, supports SQL queries, streaming
data, and complex analytics such as machine learning and graph algorithms out-
of-the-box
• In-memory cluster computing
• Supports any existing Hadoop input / output format
• Spark is written in Scala
• Provides concise and consistentAPIs in Scala, Java and Python
• Offers interactive shell for Scala and Python
Spark Deployments – Cluster ManagerTypes
• Standalone (native Spark cluster)
• HadoopYARN - Hadoop 2 resource manager
• Apache Mesos - generic cluster manager that can also handle MapReduce
• Local - A pseudo-distributed local mode for development or testing using
local file system
– Spark runs on a single machine with one executor per CPU core
Interfacing with Distributed Storage
• HDFS
• Cassandra
• Amazon S3
A Single Unified Platform for Big DataAnalytics
Project History
2009 – Project
started at UC
BerkeleyAMPLab
2010 - Open
sourced under a
BSD license
2013- the project
was donated to the
Apache Software
Foundation and
switched its license
to Apache 2.0
Feb 2014 - became
an ApacheTop-
Level Project
November 2014 -
engineering team
at Databricks used
Spark to set a new
world record in
large scale sorting
The Most Active Open Source Project in Big Data
Series 1,
Hadoop
MapReduce,
103
Series 1,
Giraph, 32
Series 1,
Storm, 25
Series 1,Tez,
17
Series 2,
Spark, 125
Projectcontributorsinpastyear
Hadoop Model
• Hadoop has an acyclic data flow model
– Load data -> process data ->write output -> finished
• Hadoop is slow due to replication, serialization and disk IO
• Hadoop is at a disadvantage to pipeline multiple jobs
• Cheaper DRAM makes it a better option for using main memory for
intermediate results instead of disks
Hadoop Model
Iteration1 Iteration2 Iteration n
HDFS
Read
HDFS
Write
HDFS
Read
HDFS
Write
Spark Model
• MapReduce allows sharing data across jobs using only one option of stable
storage like file system which is slow
• Applications want to reuse intermediate results across multiple computations
– Work on same dataset to optimize parameters in machine learning algorithms
– More complex, multi-stage applications (iterative graph algorithms and machine
learning)
– More interactive ad-hoc queries
– Efficient primitives for data sharing across parallel jobs
• These challenges can be tackled by keeping intermediate results in memory
• Caching the data for multiple queries benefits interactive data analysis tools
Spark - In-Memory Data Sharing
10-100× faster than network and disk
Input
One-time
Processing
Distributed
memory
Result 1
Result 3
Result 2
iteration1 iteration2 Iteration n
Input
Spark Components
Stack
• Spark SQL
– allows querying data via SQL as well as the ApacheVariant of SQL (HQL) and supports
many sources of data, including Hive tables, Parquet and JSON
• Spark Streaming
– Components that enables processing of live streams of data in a elegant, fault tolerant,
scalable and fast way
• MLlib
– Library containing common machine learning (ML) functionality including algorithms
such as classification, regression, clustering, collaborative filtering to scale out across a
cluster
Stack
• GraphX
– Library for manipulating graphs and performing graph-parallel computation
• Cluster Managers
– Spark is designed to efficiently scale up from one to many thousands of compute
nodes. It can run over a variety of cluster managers including Hadoop,YARN, Apache
Mesos etc
– Spark has a simple cluster manager included in Spark itself called the Standalone
Scheduler
Job Execution
Programming Model
• Spark programming model is based on parallelizable operators
• Parallelizable operators are higher-order functions that execute user-defined functions in
parallel
• A data flow is composed of any number of data sources, operators, and data sinks by
connecting their inputs and outputs
• Job description is based on directed acyclic graphs (DAG)
• Spark allows programmers to develop complex, multi-step data pipelines using directed
acyclic graph (DAG) pattern
• Since spark is based on DAG , it can follow a chain from child to parent to fetch any value
like tree traversal
• DAG supports fault-tolerance
Programming Model
Directed - only in a single direction
Acyclic - no looping
How SparkWorks
• User submits Jobs
• Every Spark application consists of a driver program that launches various
parallel operations on the cluster
• The driver program contains your application’s main function and defines
distributed datasets on the cluster, then applies operations to them
How SparkWorks
• Driver programs access spark through theSparkContextobject, which
represents a connection to a computing cluster.
• The SparkContext can be used to build RDDs (Resilient distributed
datasets) on which you can run a series of operations
• To run these operations, driver programs typically manage a number of
nodes called executors
How SparkWorks
• SparkContext (driver) contacts Cluster Manager which
assigns cluster resources
• Then it sends application code to assigned Executors
(distributing computation, not data)
• Finally sends tasks to Executors to run
How SparkWorks
• SparkContext (driver) contacts Cluster Manager which assigns cluster
resources
• Then it sends application code to assigned Executors (distributing
computation, not data)
• Finally sends tasks to Executors to run
Spark Context
• Main entry point to Spark functionality
• Available in shell as variable sc
• In standalone programs, you should make your own
• import org.apache.spark.SparkContext
• import org.apache.spark.SparkContext.
• val sc = new SparkContext(master, appName, [sparkHome], [jars])
RDD - Resilient Distributed Datasets
• A distributed memory abstraction
• An immutable distributed collection of data partitioned across machines
in a cluster – provides scalability
• Immutability provides safety with parallel processing
• Distributed - stored in memory across the cluster
RDD - Resilient Distributed Datasets
• Stored in-memory - automatically rebuilt if a partition is lost
• In-memory storage makes it fast
• Facilitates two types of operations- transformation and action
• Lazily evaluated
• Type inferred
RDDs
• Fault-tolerant collection of elements that can be operated on in parallel
• Manipulated through various parallel operators using a diverse set of
transformations (map, filter, join etc)
• Fault recovery without costly replication
• Remembers the series of transformations that built an RDD (its lineage) to re-
compute lost data
• RDD operators are higher order functions
• Turn a collection into an RDD
– val a = sc.parallelize(Array(1, 2, 3))
RDDs
Program Execution
Program Execution
• The driver program when starting execution builds up a graph where nodes are
RDD and edges are transformation steps
• No execution happens at the cluster until an action is encountered
• The driver program ships the execution graph as well as the code block to the
cluster, where every worker server will get a copy
• The execution graph is a DAG
• Each DAG is a atomic unit of execution
Program Execution
• Each source node (no incoming edge) is an external data source or driver
memory
• Each intermediate node is a RDD
• Each sink node (no outgoing edge) is an external data source or driver
memory
• Green edge connecting to RDD represents a transformation
• Red edge connecting to a sink node represents an action
Program Execution
How SparkWorks?
• Spark is divided in various independent layers with responsibilities
• The first layer is the interpreter - Spark uses a Scala interpreter, with some
modifications
• When code is typed in spark console (creating RDD’s and applying operators),
Spark creates a operator graph
• When an action is run, the Graph is submitted to a DAG Scheduler
• DAG scheduler divides operator graph into (map and reduce) stages
• A stage consists of tasks based on partitions of the input data
How SparkWorks?
• The DAG scheduler pipelines operators together to optimize the graph
– Example - many map operators can be scheduled in a single stage
• The final result of a DAG scheduler is a set of stages that are passed on toTask
Scheduler
• The task scheduler launches tasks via cluster manager (Spark
Standalone/Yarn/Mesos)
• The task scheduler doesn’t know about dependencies among stages
• The Worker executes the tasks by starting a new JVM per job
• The worker knows only about the code that is passed to it
How SparkWorks
Job Scheduling
• When an action on an RDD is executed, the scheduler builds a DAG of stages from
the RDD lineage graph
• A stage contains many pipelined transformations with narrow dependencies
• The boundary of a stage
– Shuffles for wide dependencies.
– Already computed partitions
Job Scheduling
• The scheduler launches tasks to compute missing partitions from each
stage until it computes the target RDD
• Tasks are assigned to machines based on data locality
• If a task needs a partition, which is available in the memory of a node, the
task is sent to that node
Job Scheduling
Data Shuffling
• Spark ships the code to a worker server where data processing happens
• But data movement cannot be completely eliminated
• Example - if the processing requires data residing in different partitions
to be grouped first, then data should be shuffled among worker servers
• Transformation operation has two types – Narrow andWide
Data Shuffling
• Narrow transformation
– The processing where the processing logic depends only on data that is already
residing in the partition and data shuffling is unnecessary
– Examples - filter(), sample(), map(), flatMap() etc
• Wide transformation
– The processing where the processing logic depends on data residing in multiple
partitions and therefore data shuffling is needed to bring them together in one place
– Example - groupByKey(), reduceByKey() etc
Data Shuffling
NarrowTransformation WideTransformation
RDD Joins
• Joining of two RDD affects the amount of data shuffled
• Spark provides two ways to join data – shuffle and broadcast
• Shuffle join - data of two RDD with the same key is redistributed to the same
partition. Each of the items in each RDD is shuffled across worker servers
• Broadcast join - one of the RDD is broadcasted and copied over to every partition
– If one of the RDD is significantly smaller relative to the other, then broadcast join
reduces the network traffic because only the small RDD need to be copied to all worker
servers while the large RDD doesn't need to be shuffled at all
RDD Joins
Shuffle Join Broadcast Join
Fault Resiliency
• RDDs track series of transformations used to build them (their lineage) to re-compute lost
data
• messages = textFile.filter(lambda s: s.startsWith(“ERROR”)).map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
Fault Resiliency
• RDDs maintain lineage information used to reconstruct lost partitions
• Logging lineage rather than the actual data
• No replication
• Recompute only the lost partitions of an RDD
Fault Resiliency
• Recovery may be time-consuming for RDDs with long lineage chains and
wide dependencies
• It is helpful to checkpoint some RDDs to stable storage
• Decision about which data to checkpoint is left to users
Fault Resiliency
• DAG defines a deterministic transformation steps between different
partitions of data within each RDD
• Whenever a worker server crashes during the execution of a stage,
another worker server re-executes the stage from the beginning by
pulling the input data from its parent stage that has the output data
stored in local files
Fault Resiliency
• In case the result of the parent stage is not accessible (the worker server
lost the file), the parent stage need to be re-executed as well
• Imagine this is a lineage of transformation steps, and any failure of a step
will trigger a restart of execution from its last step
• Since the DAG itself is an atomic unit of execution, all the RDD values will
be forgotten after the DAG finishes its execution
Fault Resiliency
• Therefore, after the driver program finishes an action (which execute a DAG to its
completion), all the RDD value will be forgotten and if the program access the
RDD again in subsequent statement, the RDD needs to be recomputed again
from its dependents
• To reduce this repetitive processing, Spark provide a caching mechanism to
remember RDDs in worker server memory (or local disk)
• Once the execution planner finds the RDD is already cache in memory, it will use
the RDD right away without tracing back to its parent RDDs
• This way, DAG is pruned once an RDD in the cache is reached
RDD Operators -Transformations
• Creates a new dataset from an existing map, filter, distinct, union,
sample, groupByKey, join, etc…
• RDD transformations allow to create dependencies between RDDs
• Dependencies are only steps for producing results (a program)
RDD Operators -Transformations
• Each RDD in lineage chain (string of dependencies) has a function for
calculating its data and has a pointer (dependency) to its parent RDD
• Spark divides RDD dependencies into stages and tasks and send those to
workers for execution
• Lazy operators
Transformations - Lazy Evaluation
RDD Operators - Actions
• Return a value after running a computation
• Compute a result based on a RDD
• Result is returned to the driver program or saved to an external storage
system
• Typical RDD actions are count, first, collect, first, takeSample, foreach
Transformations and Actions
SparkContext Transformation Action
Lazy Evaluation
Transformations
• Set of operations of a RDD that define how its data should be transformed
• An operation such as map(), filter() or union on a RDD that yield another RDD
• Transformations create new RDD based on the existing RDD.
• RDD's are immutable
• Lazily evaluated - Data in RDD's is not processed until an action is performed.
Transformations
• Why lazy execution? because we are expecting to apply some
optimization of the series of transformation on such RDD
• Spark driver remembers the transformation applied to an RDD – so a lost
partition is can be reconstructed on some other machine in the cluster
• This resiliency is achieved via a LineageGraph
Transformations
• Words - an RDD containing a reference to lines RDD
• When program executes, first lines' function isexecuted (load the data from a text
file)
• Then words' function is executed on the resulting data (split lines into words)
• Spark is lazy, so nothing is executed unless some transformation or action is
called that triggers job creation and execution (collect in this example)
• RDD (transformed RDD, too) is not 'a set of data', but a step in a program (might
be the only step) telling Spark how to get the data and what to do with it
Transformations
• val lines = sc.textFile("...")
• val words = lines.flatMap(line => line.split(" "))
• val localwords = words.collect()
Actions
• Applies all transformations on RDD and then performs the action to obtain results
• Operations that return a final value to the driver program or write data to an
external storage system
• After performing action on RDD, the result is returned to the driver program or
written to the storage system
Actions
• Actions force the evaluation of the transformations required for the RDD
they were called on, since they need to actually produce output
• Action can be recognized by looking at the return value
– primitive and built-in types such as int, long, List<Object>,Array<Object>, …
(action).
Transformation Functions
• map(func)
• filter(func)
• flatMap(func)
• mapPartitions(func)
• mapPartitionsWithIndex(func)
• sample(withReplacement, fraction, seed)
• union(otherDataset)
• intersection(otherDataset)
• distinct([numTasks]))
• groupByKey([numTasks])
• reduceByKey(func, [numTasks])
Transformation Functions
• aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
• sortByKey([ascending], [numTasks])
• join(otherDataset, [numTasks])
• cogroup(otherDataset, [numTasks])
• cartesian(otherDataset)
• pipe(command, [envVars])
• coalesce(numPartitions)
• repartition(numPartitions)
• repartitionAndSortWithinPartitions(partitioner)
Action Functions
• reduce(func)
• collect()
• count()
• first()
• take(n)
• takeSample(withReplacement, num,
[seed])
• takeOrdered(n, [ordering])
• saveAsTextFile(path)
• saveAsSequenceFile(path)
(Java and Scala)
• saveAsObjectFile(path)
(Java and Scala)
• countByKey()
• foreach(func)
RDD Creation
• Read from data sources - HDFS, JSON files, text files - any kind of files
• Transforming other RDDs using parallel operations - transformations and actions
• RDD keeps information about how it was derived from other RDDs
• A RDD has a set of partitions and a set of dependencies on parent RDD
• Narrow dependency if it derives from only 1 parent
RDD Creation
• Wide dependency if it has more than 2 parents (joining 2 parents)
• A function to compute the partitions from its parents
• Metadata about its partitioning scheme and data placement (preferred
location to compute for each partition)
• Partitioner (defines strategy of partitioning its partitions)
SharedVariables
• When Spark runs a function in parallel as a set of tasks on different nodes, it ships
a copy of each variable used in the function to each task
• These variables are copied to each machine
• No updates to the variables on the remote machine are propagated back to the
driver program
• Spark does provide two limited types of shared variables for two common usage
patterns
– broadcast variables
– accumulators
BroadcastVariables
• A broadcast variable is a read-only variable made available from the driver
program that runs the SparkContext object to the nodes that will execute the
computation
• Useful in applications that make same data available to the worker nodes in an
efficient manner, such as machine learning algorithms
• The broadcast values are not shipped to the nodes more than once
BroadcastVariables
• To create broadcast variables, call a method on SparkContext
– val broadcastAList = sc.broadcast(List("a", "b", "c", "d", "e"))
• Spark attempts to distribute broadcast variables using efficient broadcast
algorithms to reduce communication cost
– For example, to give every node a copy of a large input dataset efficiently
Accumulators
• An accumulator is also a variable that is broadcasted to the worker nodes
• Variables that can only be added to through an associative operation
• The addition must be an associative operation so that the global accumulated
value can be correctly computed in parallel and returned to the driver program
• Used to implement counters and sums, efficiently in parallel
Accumulators
• Spark natively supports accumulators of numeric value types and standard
mutable collections, and programmers can extend for new types
• Only the driver program can read an accumulator’s value, not the task
• Each worker node can only access and add to its own local accumulator value
• Only the driver program can access the global value
• Accumulators are also accessed within the Spark code using the value method
RDD Partitions
RDD Partitions
• An RDD is divided into a number of partitions, which are atomic pieces of
information
• Partitions of an RDD can be stored on different nodes of a cluster
• RDD data is just collection of partitions
• Logical division of data
• Derived from Hadoop Map/Reduce
• All input, intermediate and output data will be represented as partitions
• Partitions are basic unit of parallelism
Page Rank Controlled Partitions - Performance
Hadoop, 1, 170.75
Basic Spark, 1,
72.02868571
Spark + Controlled
Partitioning, 1, 23.01
Iterationtime(s)
Hadoop
Basic Spark
Spark + Controlled Partitioning
Partitioning - Immutability
• All partitions are immutable
• Each RDD has 2 sets of parallel operations - transformation and action
• Every transformation generates new partition
• Partition immutability driven by underneath storage like HDFS
• Partition immutability allows for fault recovery
Partitioning - Distribution
• Partitions derived from HDFS are distributed by default
• Partitions are also location aware
• Location awareness of partitions allow for data locality
• For computed data, using caching we can distribute in memory also
Accessing Partitions
• Accessed together single row at a time
• Use mapParititonsAPI of RDD
• Allows to do partionwise operation which cannot be done by accessing
single row
Partitioning ofTransformed Data
• Partitioning is different for key/value pairs that are generated by shuffle
operation
• Partitioning is driven by partitioner specified
• By default HashPartitioner is used
• Can use your own partitioner too
Custom Partitioner
• Partition the data according to your data structure
• Custom partitioning allows control over no of partitions and
the distribution of data across when grouping or reducing is
done
Lookup Operation
• Partitioning allows faster lookups
• Lookup operation allows to look up for a given value by specifying the key
• Using partitioner, lookup determines which partition look for
• Then it only need to look in that partition
• If no partition is specified, it will fallback to filter
Laziness – Parent Dependency
• Each RDD has access to parent RDD
• Value of parent for first RDD is nil
• Before computing it’s value, it always computes it’s parent
• This chain of running allows for laziness
Subclassing
• Each spark operator, creates an instance of specific sub class of RDD
• Map operator results in MappedRDD, flatMap in FlatMappedRDD etc
• Subclass allows RDD to remember the operation that is performed in the
transformation
RDDTransformations
• val dataRDD = sc.textFile(args(1))
• val splitRDD = dataRDD.flatMap(value =>value.split(“ “)
• Compute
– A function for evaluation of each partition in
RDD
– An abstract method of RDD
– Each sub class of RDD like MappedRDD,
FilteredRDD have to override this method
Compute Function
• A function for evaluation of each partition in RDD
• An abstract method of RDD
• Each sub class of RDD like MappedRDD, FilteredRDD have to override
this method
Lineage
• Transformations used to build an RDD
• RDDs are stored as chain of objects
capturing the lineage of each RDD
• val file = sc.textFile("hdfs://...")
• val sics = file.filter(_.contains("SICS"))
• val cachedSics = sics.cache()
• val ones = cachedSics.map(_ => 1)
• val count = ones.reduce(_+_)
RDD Actions
• val dataRDD = sc.textFile(args(1))
• val flatMapRDD = dataRDD.flatMap(value => value.split(““)
• flatMapRDD.collect()
• runJob API
– an API of RDD for action implementation
– Allows taking each partition and evaluate
– Internally used by all spark actions
Memory Management
• If there is not enough space in memory for a new computed RDD partition, a
partition from the least recently used RDD is evicted
• Spark provides three options for storage of persistent RDDs
– In memory storage as de-serialized Java objects
– In memory storage as serialized Java objects
– On disk storage
• When an RDD is persisted, each node stores any partitions of the RDD that it
computes in memory - allows future actions to be much faster
Memory Management
• Persisting an RDD using persist() or cache() methods
• Storage levels
– MEMORY ONLY
– MEMORYAND DISK
– MEMORY ONLY SER
– MEMORYAND DISK SER
– MEMORY ONLY 2, MEMORYAND DISK 2...
Caching
• Cache internally uses persistAPI
• Persist sets a specific storage level for a given RDD
• Spark context tracks persistent RDD
• Partition is put into memory by block manager
Caching - Block Manager
• Handles all in memory data in spark
• Responsible for
– Cached Data ( BlockRDD)
– Shuffle Data
– Broadcast data
• Partition will be stored in Block with id (RDD.id, partition_index)
Working of Caching
• Partition iterator checks the storage level
• if Storage level is set it calls cacheManager.getOrCompute(partition)
• As iterator is run for each RDD evaluation, it is transparent to user
Cache Performance
Series1,Cache
disabled,
68.84140599
Series1,25%,
58.06137503
Series1,50%,
40.74074024
Series1,75%,
29.74707779
Series1,Fully
cached,
11.5304319
Executiontime(s)
% of working set in cache
Extending Spark API
• Extending RDD API allows creating custom RDD structure
• Custom RDD’s allow control over computation
• Possible to change partitions, locality and evaluation depending upon
requirements
Extending Spark API
• Custom operators to RDD
– Domain specific operators to specific RDD’s
– Uses Scala implicit mechanism
– Feels and works like built in operator
• Custom RDD
– Extend RDD API to create new RDD
– Combined with RDD makes it powerful
RDD Benefits
• Data and intermediate results are stored in memory to speed up
computation and located on the adequate nodes for optimization
• Able to perform transformation operation on RDD many times
• Calculate lineage information about RDD transformation for failure
recovery - If a failure occurs operating a partition it is re-operated
RDD Benefits - Persistence
• Default is in memory
• Able to locate replica on plural nodes
• If data does not fit in memory, spill data to a disk
• Better to make a checkpoint when a lineage is long or wide dependency
exist on a lineage - Making checkpoint is performed in the background
RDD Benefits
• Data locality works in narrow dependency
• Intermediate results in wide dependency is dumped to a disk like a
mapper output
• Comparison to DSM (Distributed Sharing Memory)
– Hard to implement fault-tolerance on commodity servers
– RDD is immutable, so easy to take a backup
– In DSM, tasks access to the same memory location and interfere with each
other's updates
Resources
• https://github.com/aniket486/pig
• https://github.com/twitter/pig/tree/spork
• http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1
• https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix
• http://databricks.com/categories/spark/
• http://www.spark-stack.org/
References
1. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf
Chowdhury. Technical Report UCB/EECS-2011-82. July 2011
2. M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-Tolerant Streaming Computation at Scale,
SOSP 2013, November 2013
3. K. Ousterhout, P. Wendell, M. Zaharia and I. Stoica. Sparrow: Distributed, Low-Latency Scheduling, SOSP 2013, November 2013
4. R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale, SIGMOD 2013, June 2013
5. A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, Dominant Resource Fairness: Fair Allocation of Multiple
Resources Types, NSDI 2011, March 2011
6. Spark: In-Memory Cluster Computing for Iterative and Interactive Applications, Stanford University, Stanford, CA, February 2011
7. Spark: Cluster Computing with Working Sets, HotCloud 2010, Boston, MA, June 2010
8. http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-SQL-
Michael-Armbrust.pdf
9. https://github.com/apache/spark/tree/master/sql
ThankYou
Check Out My LinkedIn Profile at
https://in.linkedin.com/in/girishkhanzode

More Related Content

What's hot

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Spark
SparkSpark
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
GauravBiswas9
 

What's hot (20)

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Spark
SparkSpark
Spark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 

Viewers also liked

df: Dataframe on Spark
df: Dataframe on Sparkdf: Dataframe on Spark
df: Dataframe on Spark
Alpine Data
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA
 
Code Review and other aspects of project organization
Code Review and other aspects of project organizationCode Review and other aspects of project organization
Code Review and other aspects of project organization
Łukasz Dumiszewski
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
Namgee Lee
 
Programming in Spark - Lessons Learned in OpenAire project
Programming in Spark - Lessons Learned in OpenAire projectProgramming in Spark - Lessons Learned in OpenAire project
Programming in Spark - Lessons Learned in OpenAire project
Łukasz Dumiszewski
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
huguk
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Carol McDonald
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
Thành Nguyễn
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Carol McDonald
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
Carol McDonald
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
Matthias Niehoff
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data
Jongwook Woo
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Carol McDonald
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
Taro L. Saito
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Ali Hodroj
 
Variance in scala
Variance in scalaVariance in scala
Variance in scala
LyleK
 
Types by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaTypes by Adform Research, Saulius Valatka
Types by Adform Research, Saulius Valatka
Vasil Remeniuk
 

Viewers also liked (20)

df: Dataframe on Spark
df: Dataframe on Sparkdf: Dataframe on Spark
df: Dataframe on Spark
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Code Review and other aspects of project organization
Code Review and other aspects of project organizationCode Review and other aspects of project organization
Code Review and other aspects of project organization
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
 
Programming in Spark - Lessons Learned in OpenAire project
Programming in Spark - Lessons Learned in OpenAire projectProgramming in Spark - Lessons Learned in OpenAire project
Programming in Spark - Lessons Learned in OpenAire project
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
 
Analytics with Cassandra & Spark
Analytics with Cassandra & SparkAnalytics with Cassandra & Spark
Analytics with Cassandra & Spark
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in JapanSpark Internals - Hadoop Source Code Reading #16 in Japan
Spark Internals - Hadoop Source Code Reading #16 in Japan
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Variance in scala
Variance in scalaVariance in scala
Variance in scala
 
Types by Adform Research, Saulius Valatka
Types by Adform Research, Saulius ValatkaTypes by Adform Research, Saulius Valatka
Types by Adform Research, Saulius Valatka
 

Similar to Apache Spark Core

Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
Karan Alang
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
Dona Mary Philip
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Apache spark
Apache sparkApache spark
Apache spark
Prashant Pranay
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
bhargavi804095
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
SaiSriMadhuriYatam
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
Ramesh Mudunuri
 

Similar to Apache Spark Core (20)

Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache spark
Apache sparkApache spark
Apache spark
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Apache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.pptApache Spark™ is a multi-language engine for executing data-S5.ppt
Apache Spark™ is a multi-language engine for executing data-S5.ppt
 
Spark core
Spark coreSpark core
Spark core
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 

More from Girish Khanzode

Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
Girish Khanzode
 
Data Visulalization
Data VisulalizationData Visulalization
Data Visulalization
Girish Khanzode
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
Girish Khanzode
 
IR
IRIR
Machine Learning
Machine LearningMachine Learning
Machine Learning
Girish Khanzode
 
NLP
NLPNLP
NLTK
NLTKNLTK
NoSql
NoSqlNoSql
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
Girish Khanzode
 
Hadoop
HadoopHadoop
Language R
Language RLanguage R
Language R
Girish Khanzode
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
Girish Khanzode
 
Funtional Programming
Funtional ProgrammingFuntional Programming
Funtional Programming
Girish Khanzode
 

More from Girish Khanzode (13)

Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Data Visulalization
Data VisulalizationData Visulalization
Data Visulalization
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
IR
IRIR
IR
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
NLP
NLPNLP
NLP
 
NLTK
NLTKNLTK
NLTK
 
NoSql
NoSqlNoSql
NoSql
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Hadoop
HadoopHadoop
Hadoop
 
Language R
Language RLanguage R
Language R
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
 
Funtional Programming
Funtional ProgrammingFuntional Programming
Funtional Programming
 

Recently uploaded

Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
Performance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy EvertsPerformance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy Everts
ScyllaDB
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
ScyllaDB
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
Aurora Consulting
 
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecGDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
James Anderson
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024
The Digital Insurer
 
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
Edge AI and Vision Alliance
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024
The Digital Insurer
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
Enterprise Wired
 
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design ApproachesKnowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Earley Information Science
 
5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx
SATYENDRA100
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
Stephanie Beckett
 
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
ScyllaDB
 

Recently uploaded (20)

Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
Performance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy EvertsPerformance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy Everts
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
 
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecGDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024
 
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
 
Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
 
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design ApproachesKnowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
 
5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
 
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
 

Apache Spark Core

  • 1. Apache Spark Core RealTime In-MemoryAnalytics Girish Khanzode
  • 2. Contents • Spark - In-Memory Data Sharing • Programming Model • Spark Context • RDD - Resilient Distributed Datasets • Program Execution • Job Scheduling • Fault Resiliency • Transformations and Actions • RDD Partitions • RDDTransformations • Memory Management • RDD Benefits • Resources • References
  • 3. What is Spark? • An open-source cluster computing framework • Leverages distributed memory • Allows programs to load data into a cluster's memory and query it repeatedly • Compared to Hadoop – Scalability - can work with large data – Fault tolerance : can self-recover • Functional programming model • Supports batch & streaming analysis
  • 4. What is Spark? • Separate, fast MapReduce-like engine – In-memory data storage for very fast iterative queries – General execution graphs and powerful optimizations • Compatible with Hadoop storage APIs – Can read / write to any Hadoop-supported system, including HDFS, HBase, Sequence Files etc • Faster Application Development - 2-5x less code • Disk Execution Speed - 10× faster • Memory Execution Speed – 100× faster
  • 5. What is Spark? • Apart from simple map and reduce operations, supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out- of-the-box • In-memory cluster computing • Supports any existing Hadoop input / output format • Spark is written in Scala • Provides concise and consistentAPIs in Scala, Java and Python • Offers interactive shell for Scala and Python
  • 6. Spark Deployments – Cluster ManagerTypes • Standalone (native Spark cluster) • HadoopYARN - Hadoop 2 resource manager • Apache Mesos - generic cluster manager that can also handle MapReduce • Local - A pseudo-distributed local mode for development or testing using local file system – Spark runs on a single machine with one executor per CPU core
  • 7. Interfacing with Distributed Storage • HDFS • Cassandra • Amazon S3
  • 8. A Single Unified Platform for Big DataAnalytics
  • 9. Project History 2009 – Project started at UC BerkeleyAMPLab 2010 - Open sourced under a BSD license 2013- the project was donated to the Apache Software Foundation and switched its license to Apache 2.0 Feb 2014 - became an ApacheTop- Level Project November 2014 - engineering team at Databricks used Spark to set a new world record in large scale sorting
  • 10. The Most Active Open Source Project in Big Data Series 1, Hadoop MapReduce, 103 Series 1, Giraph, 32 Series 1, Storm, 25 Series 1,Tez, 17 Series 2, Spark, 125 Projectcontributorsinpastyear
  • 11. Hadoop Model • Hadoop has an acyclic data flow model – Load data -> process data ->write output -> finished • Hadoop is slow due to replication, serialization and disk IO • Hadoop is at a disadvantage to pipeline multiple jobs • Cheaper DRAM makes it a better option for using main memory for intermediate results instead of disks
  • 12. Hadoop Model Iteration1 Iteration2 Iteration n HDFS Read HDFS Write HDFS Read HDFS Write
  • 13. Spark Model • MapReduce allows sharing data across jobs using only one option of stable storage like file system which is slow • Applications want to reuse intermediate results across multiple computations – Work on same dataset to optimize parameters in machine learning algorithms – More complex, multi-stage applications (iterative graph algorithms and machine learning) – More interactive ad-hoc queries – Efficient primitives for data sharing across parallel jobs • These challenges can be tackled by keeping intermediate results in memory • Caching the data for multiple queries benefits interactive data analysis tools
  • 14. Spark - In-Memory Data Sharing 10-100× faster than network and disk Input One-time Processing Distributed memory Result 1 Result 3 Result 2 iteration1 iteration2 Iteration n Input
  • 16. Stack • Spark SQL – allows querying data via SQL as well as the ApacheVariant of SQL (HQL) and supports many sources of data, including Hive tables, Parquet and JSON • Spark Streaming – Components that enables processing of live streams of data in a elegant, fault tolerant, scalable and fast way • MLlib – Library containing common machine learning (ML) functionality including algorithms such as classification, regression, clustering, collaborative filtering to scale out across a cluster
  • 17. Stack • GraphX – Library for manipulating graphs and performing graph-parallel computation • Cluster Managers – Spark is designed to efficiently scale up from one to many thousands of compute nodes. It can run over a variety of cluster managers including Hadoop,YARN, Apache Mesos etc – Spark has a simple cluster manager included in Spark itself called the Standalone Scheduler
  • 19. Programming Model • Spark programming model is based on parallelizable operators • Parallelizable operators are higher-order functions that execute user-defined functions in parallel • A data flow is composed of any number of data sources, operators, and data sinks by connecting their inputs and outputs • Job description is based on directed acyclic graphs (DAG) • Spark allows programmers to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern • Since spark is based on DAG , it can follow a chain from child to parent to fetch any value like tree traversal • DAG supports fault-tolerance
  • 20. Programming Model Directed - only in a single direction Acyclic - no looping
  • 21. How SparkWorks • User submits Jobs • Every Spark application consists of a driver program that launches various parallel operations on the cluster • The driver program contains your application’s main function and defines distributed datasets on the cluster, then applies operations to them
  • 22. How SparkWorks • Driver programs access spark through theSparkContextobject, which represents a connection to a computing cluster. • The SparkContext can be used to build RDDs (Resilient distributed datasets) on which you can run a series of operations • To run these operations, driver programs typically manage a number of nodes called executors
  • 23. How SparkWorks • SparkContext (driver) contacts Cluster Manager which assigns cluster resources • Then it sends application code to assigned Executors (distributing computation, not data) • Finally sends tasks to Executors to run
  • 24. How SparkWorks • SparkContext (driver) contacts Cluster Manager which assigns cluster resources • Then it sends application code to assigned Executors (distributing computation, not data) • Finally sends tasks to Executors to run
  • 25. Spark Context • Main entry point to Spark functionality • Available in shell as variable sc • In standalone programs, you should make your own • import org.apache.spark.SparkContext • import org.apache.spark.SparkContext. • val sc = new SparkContext(master, appName, [sparkHome], [jars])
  • 26. RDD - Resilient Distributed Datasets • A distributed memory abstraction • An immutable distributed collection of data partitioned across machines in a cluster – provides scalability • Immutability provides safety with parallel processing • Distributed - stored in memory across the cluster
  • 27. RDD - Resilient Distributed Datasets • Stored in-memory - automatically rebuilt if a partition is lost • In-memory storage makes it fast • Facilitates two types of operations- transformation and action • Lazily evaluated • Type inferred
  • 28. RDDs • Fault-tolerant collection of elements that can be operated on in parallel • Manipulated through various parallel operators using a diverse set of transformations (map, filter, join etc) • Fault recovery without costly replication • Remembers the series of transformations that built an RDD (its lineage) to re- compute lost data • RDD operators are higher order functions • Turn a collection into an RDD – val a = sc.parallelize(Array(1, 2, 3))
  • 29. RDDs
  • 31. Program Execution • The driver program when starting execution builds up a graph where nodes are RDD and edges are transformation steps • No execution happens at the cluster until an action is encountered • The driver program ships the execution graph as well as the code block to the cluster, where every worker server will get a copy • The execution graph is a DAG • Each DAG is a atomic unit of execution
  • 32. Program Execution • Each source node (no incoming edge) is an external data source or driver memory • Each intermediate node is a RDD • Each sink node (no outgoing edge) is an external data source or driver memory • Green edge connecting to RDD represents a transformation • Red edge connecting to a sink node represents an action
  • 34. How SparkWorks? • Spark is divided in various independent layers with responsibilities • The first layer is the interpreter - Spark uses a Scala interpreter, with some modifications • When code is typed in spark console (creating RDD’s and applying operators), Spark creates a operator graph • When an action is run, the Graph is submitted to a DAG Scheduler • DAG scheduler divides operator graph into (map and reduce) stages • A stage consists of tasks based on partitions of the input data
  • 35. How SparkWorks? • The DAG scheduler pipelines operators together to optimize the graph – Example - many map operators can be scheduled in a single stage • The final result of a DAG scheduler is a set of stages that are passed on toTask Scheduler • The task scheduler launches tasks via cluster manager (Spark Standalone/Yarn/Mesos) • The task scheduler doesn’t know about dependencies among stages • The Worker executes the tasks by starting a new JVM per job • The worker knows only about the code that is passed to it
  • 37. Job Scheduling • When an action on an RDD is executed, the scheduler builds a DAG of stages from the RDD lineage graph • A stage contains many pipelined transformations with narrow dependencies • The boundary of a stage – Shuffles for wide dependencies. – Already computed partitions
  • 38. Job Scheduling • The scheduler launches tasks to compute missing partitions from each stage until it computes the target RDD • Tasks are assigned to machines based on data locality • If a task needs a partition, which is available in the memory of a node, the task is sent to that node
  • 40. Data Shuffling • Spark ships the code to a worker server where data processing happens • But data movement cannot be completely eliminated • Example - if the processing requires data residing in different partitions to be grouped first, then data should be shuffled among worker servers • Transformation operation has two types – Narrow andWide
  • 41. Data Shuffling • Narrow transformation – The processing where the processing logic depends only on data that is already residing in the partition and data shuffling is unnecessary – Examples - filter(), sample(), map(), flatMap() etc • Wide transformation – The processing where the processing logic depends on data residing in multiple partitions and therefore data shuffling is needed to bring them together in one place – Example - groupByKey(), reduceByKey() etc
  • 43. RDD Joins • Joining of two RDD affects the amount of data shuffled • Spark provides two ways to join data – shuffle and broadcast • Shuffle join - data of two RDD with the same key is redistributed to the same partition. Each of the items in each RDD is shuffled across worker servers • Broadcast join - one of the RDD is broadcasted and copied over to every partition – If one of the RDD is significantly smaller relative to the other, then broadcast join reduces the network traffic because only the small RDD need to be copied to all worker servers while the large RDD doesn't need to be shuffled at all
  • 44. RDD Joins Shuffle Join Broadcast Join
  • 45. Fault Resiliency • RDDs track series of transformations used to build them (their lineage) to re-compute lost data • messages = textFile.filter(lambda s: s.startsWith(“ERROR”)).map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 46. Fault Resiliency • RDDs maintain lineage information used to reconstruct lost partitions • Logging lineage rather than the actual data • No replication • Recompute only the lost partitions of an RDD
  • 47. Fault Resiliency • Recovery may be time-consuming for RDDs with long lineage chains and wide dependencies • It is helpful to checkpoint some RDDs to stable storage • Decision about which data to checkpoint is left to users
  • 48. Fault Resiliency • DAG defines a deterministic transformation steps between different partitions of data within each RDD • Whenever a worker server crashes during the execution of a stage, another worker server re-executes the stage from the beginning by pulling the input data from its parent stage that has the output data stored in local files
  • 49. Fault Resiliency • In case the result of the parent stage is not accessible (the worker server lost the file), the parent stage need to be re-executed as well • Imagine this is a lineage of transformation steps, and any failure of a step will trigger a restart of execution from its last step • Since the DAG itself is an atomic unit of execution, all the RDD values will be forgotten after the DAG finishes its execution
  • 50. Fault Resiliency • Therefore, after the driver program finishes an action (which execute a DAG to its completion), all the RDD value will be forgotten and if the program access the RDD again in subsequent statement, the RDD needs to be recomputed again from its dependents • To reduce this repetitive processing, Spark provide a caching mechanism to remember RDDs in worker server memory (or local disk) • Once the execution planner finds the RDD is already cache in memory, it will use the RDD right away without tracing back to its parent RDDs • This way, DAG is pruned once an RDD in the cache is reached
  • 51. RDD Operators -Transformations • Creates a new dataset from an existing map, filter, distinct, union, sample, groupByKey, join, etc… • RDD transformations allow to create dependencies between RDDs • Dependencies are only steps for producing results (a program)
  • 52. RDD Operators -Transformations • Each RDD in lineage chain (string of dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD • Spark divides RDD dependencies into stages and tasks and send those to workers for execution • Lazy operators
  • 54. RDD Operators - Actions • Return a value after running a computation • Compute a result based on a RDD • Result is returned to the driver program or saved to an external storage system • Typical RDD actions are count, first, collect, first, takeSample, foreach
  • 55. Transformations and Actions SparkContext Transformation Action Lazy Evaluation
  • 56. Transformations • Set of operations of a RDD that define how its data should be transformed • An operation such as map(), filter() or union on a RDD that yield another RDD • Transformations create new RDD based on the existing RDD. • RDD's are immutable • Lazily evaluated - Data in RDD's is not processed until an action is performed.
  • 57. Transformations • Why lazy execution? because we are expecting to apply some optimization of the series of transformation on such RDD • Spark driver remembers the transformation applied to an RDD – so a lost partition is can be reconstructed on some other machine in the cluster • This resiliency is achieved via a LineageGraph
  • 58. Transformations • Words - an RDD containing a reference to lines RDD • When program executes, first lines' function isexecuted (load the data from a text file) • Then words' function is executed on the resulting data (split lines into words) • Spark is lazy, so nothing is executed unless some transformation or action is called that triggers job creation and execution (collect in this example) • RDD (transformed RDD, too) is not 'a set of data', but a step in a program (might be the only step) telling Spark how to get the data and what to do with it
  • 59. Transformations • val lines = sc.textFile("...") • val words = lines.flatMap(line => line.split(" ")) • val localwords = words.collect()
  • 60. Actions • Applies all transformations on RDD and then performs the action to obtain results • Operations that return a final value to the driver program or write data to an external storage system • After performing action on RDD, the result is returned to the driver program or written to the storage system
  • 61. Actions • Actions force the evaluation of the transformations required for the RDD they were called on, since they need to actually produce output • Action can be recognized by looking at the return value – primitive and built-in types such as int, long, List<Object>,Array<Object>, … (action).
  • 62. Transformation Functions • map(func) • filter(func) • flatMap(func) • mapPartitions(func) • mapPartitionsWithIndex(func) • sample(withReplacement, fraction, seed) • union(otherDataset) • intersection(otherDataset) • distinct([numTasks])) • groupByKey([numTasks]) • reduceByKey(func, [numTasks])
  • 63. Transformation Functions • aggregateByKey(zeroValue)(seqOp, combOp, [numTasks]) • sortByKey([ascending], [numTasks]) • join(otherDataset, [numTasks]) • cogroup(otherDataset, [numTasks]) • cartesian(otherDataset) • pipe(command, [envVars]) • coalesce(numPartitions) • repartition(numPartitions) • repartitionAndSortWithinPartitions(partitioner)
  • 64. Action Functions • reduce(func) • collect() • count() • first() • take(n) • takeSample(withReplacement, num, [seed]) • takeOrdered(n, [ordering]) • saveAsTextFile(path) • saveAsSequenceFile(path) (Java and Scala) • saveAsObjectFile(path) (Java and Scala) • countByKey() • foreach(func)
  • 65. RDD Creation • Read from data sources - HDFS, JSON files, text files - any kind of files • Transforming other RDDs using parallel operations - transformations and actions • RDD keeps information about how it was derived from other RDDs • A RDD has a set of partitions and a set of dependencies on parent RDD • Narrow dependency if it derives from only 1 parent
  • 66. RDD Creation • Wide dependency if it has more than 2 parents (joining 2 parents) • A function to compute the partitions from its parents • Metadata about its partitioning scheme and data placement (preferred location to compute for each partition) • Partitioner (defines strategy of partitioning its partitions)
  • 67. SharedVariables • When Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task • These variables are copied to each machine • No updates to the variables on the remote machine are propagated back to the driver program • Spark does provide two limited types of shared variables for two common usage patterns – broadcast variables – accumulators
  • 68. BroadcastVariables • A broadcast variable is a read-only variable made available from the driver program that runs the SparkContext object to the nodes that will execute the computation • Useful in applications that make same data available to the worker nodes in an efficient manner, such as machine learning algorithms • The broadcast values are not shipped to the nodes more than once
  • 69. BroadcastVariables • To create broadcast variables, call a method on SparkContext – val broadcastAList = sc.broadcast(List("a", "b", "c", "d", "e")) • Spark attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost – For example, to give every node a copy of a large input dataset efficiently
  • 70. Accumulators • An accumulator is also a variable that is broadcasted to the worker nodes • Variables that can only be added to through an associative operation • The addition must be an associative operation so that the global accumulated value can be correctly computed in parallel and returned to the driver program • Used to implement counters and sums, efficiently in parallel
  • 71. Accumulators • Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend for new types • Only the driver program can read an accumulator’s value, not the task • Each worker node can only access and add to its own local accumulator value • Only the driver program can access the global value • Accumulators are also accessed within the Spark code using the value method
  • 73. RDD Partitions • An RDD is divided into a number of partitions, which are atomic pieces of information • Partitions of an RDD can be stored on different nodes of a cluster • RDD data is just collection of partitions • Logical division of data • Derived from Hadoop Map/Reduce • All input, intermediate and output data will be represented as partitions • Partitions are basic unit of parallelism
  • 74. Page Rank Controlled Partitions - Performance Hadoop, 1, 170.75 Basic Spark, 1, 72.02868571 Spark + Controlled Partitioning, 1, 23.01 Iterationtime(s) Hadoop Basic Spark Spark + Controlled Partitioning
  • 75. Partitioning - Immutability • All partitions are immutable • Each RDD has 2 sets of parallel operations - transformation and action • Every transformation generates new partition • Partition immutability driven by underneath storage like HDFS • Partition immutability allows for fault recovery
  • 76. Partitioning - Distribution • Partitions derived from HDFS are distributed by default • Partitions are also location aware • Location awareness of partitions allow for data locality • For computed data, using caching we can distribute in memory also
  • 77. Accessing Partitions • Accessed together single row at a time • Use mapParititonsAPI of RDD • Allows to do partionwise operation which cannot be done by accessing single row
  • 78. Partitioning ofTransformed Data • Partitioning is different for key/value pairs that are generated by shuffle operation • Partitioning is driven by partitioner specified • By default HashPartitioner is used • Can use your own partitioner too
  • 79. Custom Partitioner • Partition the data according to your data structure • Custom partitioning allows control over no of partitions and the distribution of data across when grouping or reducing is done
  • 80. Lookup Operation • Partitioning allows faster lookups • Lookup operation allows to look up for a given value by specifying the key • Using partitioner, lookup determines which partition look for • Then it only need to look in that partition • If no partition is specified, it will fallback to filter
  • 81. Laziness – Parent Dependency • Each RDD has access to parent RDD • Value of parent for first RDD is nil • Before computing it’s value, it always computes it’s parent • This chain of running allows for laziness
  • 82. Subclassing • Each spark operator, creates an instance of specific sub class of RDD • Map operator results in MappedRDD, flatMap in FlatMappedRDD etc • Subclass allows RDD to remember the operation that is performed in the transformation
  • 83. RDDTransformations • val dataRDD = sc.textFile(args(1)) • val splitRDD = dataRDD.flatMap(value =>value.split(“ “) • Compute – A function for evaluation of each partition in RDD – An abstract method of RDD – Each sub class of RDD like MappedRDD, FilteredRDD have to override this method
  • 84. Compute Function • A function for evaluation of each partition in RDD • An abstract method of RDD • Each sub class of RDD like MappedRDD, FilteredRDD have to override this method
  • 85. Lineage • Transformations used to build an RDD • RDDs are stored as chain of objects capturing the lineage of each RDD • val file = sc.textFile("hdfs://...") • val sics = file.filter(_.contains("SICS")) • val cachedSics = sics.cache() • val ones = cachedSics.map(_ => 1) • val count = ones.reduce(_+_)
  • 86. RDD Actions • val dataRDD = sc.textFile(args(1)) • val flatMapRDD = dataRDD.flatMap(value => value.split(““) • flatMapRDD.collect() • runJob API – an API of RDD for action implementation – Allows taking each partition and evaluate – Internally used by all spark actions
  • 87. Memory Management • If there is not enough space in memory for a new computed RDD partition, a partition from the least recently used RDD is evicted • Spark provides three options for storage of persistent RDDs – In memory storage as de-serialized Java objects – In memory storage as serialized Java objects – On disk storage • When an RDD is persisted, each node stores any partitions of the RDD that it computes in memory - allows future actions to be much faster
  • 88. Memory Management • Persisting an RDD using persist() or cache() methods • Storage levels – MEMORY ONLY – MEMORYAND DISK – MEMORY ONLY SER – MEMORYAND DISK SER – MEMORY ONLY 2, MEMORYAND DISK 2...
  • 89. Caching • Cache internally uses persistAPI • Persist sets a specific storage level for a given RDD • Spark context tracks persistent RDD • Partition is put into memory by block manager
  • 90. Caching - Block Manager • Handles all in memory data in spark • Responsible for – Cached Data ( BlockRDD) – Shuffle Data – Broadcast data • Partition will be stored in Block with id (RDD.id, partition_index)
  • 91. Working of Caching • Partition iterator checks the storage level • if Storage level is set it calls cacheManager.getOrCompute(partition) • As iterator is run for each RDD evaluation, it is transparent to user
  • 93. Extending Spark API • Extending RDD API allows creating custom RDD structure • Custom RDD’s allow control over computation • Possible to change partitions, locality and evaluation depending upon requirements
  • 94. Extending Spark API • Custom operators to RDD – Domain specific operators to specific RDD’s – Uses Scala implicit mechanism – Feels and works like built in operator • Custom RDD – Extend RDD API to create new RDD – Combined with RDD makes it powerful
  • 95. RDD Benefits • Data and intermediate results are stored in memory to speed up computation and located on the adequate nodes for optimization • Able to perform transformation operation on RDD many times • Calculate lineage information about RDD transformation for failure recovery - If a failure occurs operating a partition it is re-operated
  • 96. RDD Benefits - Persistence • Default is in memory • Able to locate replica on plural nodes • If data does not fit in memory, spill data to a disk • Better to make a checkpoint when a lineage is long or wide dependency exist on a lineage - Making checkpoint is performed in the background
  • 97. RDD Benefits • Data locality works in narrow dependency • Intermediate results in wide dependency is dumped to a disk like a mapper output • Comparison to DSM (Distributed Sharing Memory) – Hard to implement fault-tolerance on commodity servers – RDD is immutable, so easy to take a backup – In DSM, tasks access to the same memory location and interfere with each other's updates
  • 98. Resources • https://github.com/aniket486/pig • https://github.com/twitter/pig/tree/spork • http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 • https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix • http://databricks.com/categories/spark/ • http://www.spark-stack.org/
  • 99. References 1. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury. Technical Report UCB/EECS-2011-82. July 2011 2. M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-Tolerant Streaming Computation at Scale, SOSP 2013, November 2013 3. K. Ousterhout, P. Wendell, M. Zaharia and I. Stoica. Sparrow: Distributed, Low-Latency Scheduling, SOSP 2013, November 2013 4. R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale, SIGMOD 2013, June 2013 5. A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, Dominant Resource Fairness: Fair Allocation of Multiple Resources Types, NSDI 2011, March 2011 6. Spark: In-Memory Cluster Computing for Iterative and Interactive Applications, Stanford University, Stanford, CA, February 2011 7. Spark: Cluster Computing with Working Sets, HotCloud 2010, Boston, MA, June 2010 8. http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-SQL- Michael-Armbrust.pdf 9. https://github.com/apache/spark/tree/master/sql
  • 100. ThankYou Check Out My LinkedIn Profile at https://in.linkedin.com/in/girishkhanzode

Editor's Notes

  1. Gracefully
  2. This isn’t all proven out yet, but some of it should just work already.