Spark core

What Is Apache Spark?
Apache Spark is a cluster computing platform designed to be
fast and general purpose.
On the speed side, Spark extends the popular MapReduce
model to efficiently support more types of computations,
including interactive queries and stream processing.
On the generality side, Spark is designed to cover a wide
range of workloads that previously required separate
distributed systems, including batch applications, iterative
algorithms, interactive queries, and streaming.
Spark is designed to be highly accessible, offering simple APIs
in Python, Java, Scala, and SQL, and rich built-in libraries.
Spark itself is written in Scala, and runs on the Java Virtual
Machine (JVM).

The Spark stack
 Spark Core
 Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems, and
more. Spark Core is also home to the API that defines resilient distributed datasets
(RDDs).
 Spark SQL
 Spark SQL is Spark’s package for working with structured data. It allows querying data via
SQL as well as the Apache Hive variant of SQL, HQL.
 Spark Streaming
 Spark Streaming is a Spark component that enables processing of live streams of data.
 MLlib
 Spark comes with a library containing common machine learning (ML) functionality,
called MLlib. MLlib provides multiple types of machine learning algorithms, including
classification, regression, clustering, and collaborative filtering, as well as supporting
functionality such as model evaluation and data import.
 GraphX
 GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and
performing graph-parallel computations.
 Cluster Managers
 Under the hood, Spark is designed to efficiently scale up from one to many thousands of
compute nodes. To achieve this while maximizing flexibility, Spark can run over a variety
of cluster managers, including Hadoop YARN, Apache Mesos, and a simple cluster
manager included in Spark itself called the Standalone Scheduler.

Storage Layers for Spark
• Spark can create distributed datasets from any file
stored in the Hadoop distributed file system (HDFS) or
other storage systems supported by the Hadoop APIs
(including your local file system, Amazon S3,
Cassandra, Hive, HBase, etc.).
• It’s important to remember that Spark does not
require Hadoop; it simply has support for storage
systems implementing the Hadoop APIs.
• Spark supports text files, SequenceFiles, Avro,
Parquet, and any other Hadoop InputFormat.

Installing Spark
• Download and extract
 Download a compressed TAR file, or tar ball.
 You don’t need to have Hadoop, but if you have an existing Hadoop cluster or
HDFS installation, download the matching version.
 Extract the tar.
 Update the bashrc file
• Spark directory Contents.
 README.md
 Contains short instructions for getting started with Spark.
 Bin
 Contains executable files that can be used to interact with Spark in various ways like
shell.
 core, streaming, python, …
 Contains the source code of major components of the Spark project.
 examples
 Contains some helpful Spark standalone jobs that you can look at and run to learn about
the Spark API.

Core Spark Concepts
• Every Spark application consists of a driver program
that launches various parallel operations on a cluster.
The driver program contains your application’s main
function and defines distributed datasets on the cluster,
then applies operations to them.
 Driver programs access Spark through a
SparkContext object, which represents a
connection to a computing cluster.
 Once you have a SparkContext, you can
use it to build RDDs.
 To run operations on RDD, driver
programs typically manage a number of
nodes called executors.

Spark’s Python and Scala Shells
• Spark comes with interactive shells that enable ad hoc data
analysis.
• Unlike most other shells, however, which let you manipulate
data using the disk and memory on a single machine, Spark’s
shells allow you to interact with data that is distributed on
disk or in memory across many machines, and Spark takes
care of automatically distributing this processing.
• bin/pyspark and bin/spark-shell to open the respective shell.
• When the shell starts, you will notice a lot of log messages. It
can be controlled by changing log4j.rootCategory=INFO in
conf/log4j.properties.

StandaloneApplications
• The main difference from using it in the shell is that you need to
initialize your own SparkContext. After that, the API is the same.
• Initializing Spark in Scala
 import org.apache.spark.SparkConf
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 val conf = new SparkConf().setMaster("local").setAppName(“WordCount")
 val sc = new SparkContext(conf)
• Once we have our build defined, we can easily package and run our
application using the bin/spark-submit script.
• The spark-submit script sets up a number of environment variables
used by Spark.
 Maven build and run (mvn clean && mvn compile && mvn package)
 $SPARK_HOME/bin/spark-submit
 --class com.oreilly.learningsparkexamples.mini.java.WordCount
 ./target/learning-spark-mini-example-0.0.1.jar
 ./README.md ./wordcounts

Word count Scala application
• // Create a Scala Spark Context.
• val conf = new SparkConf().setAppName("wordCount")
• val sc = new SparkContext(conf)
• // Load our input data.
• val input = sc.textFile(inputFile)
• // Split it up into words.
• val words = input.flatMap(line => line.split(" "))
• // Transform into pairs and count.
• val counts = words.map(word => (word, 1)).reduceByKey{case (x, y)
=> x + y}
• // Save the word count back out to a text file, causing evaluation.
• counts.saveAsTextFile(outputFile)

Resilient Distributed Dataset
• An RDD is simply a immutable distributed collection of
elements.
• Each RDD is split into multiple partitions, which may be
computed on different nodes of the cluster.
• Once created, RDDs offer two types of operations:
transformations and actions.
• Transformations construct a new RDD from a previous one.
• Actions, on the other hand, compute a result based on an
RDD, and either return it to the driver program or save it to
an external storage system.
• In Spark all work is expressed as either creating new RDDs,
transforming existing RDDs, or calling operations on RDDs
to compute a result.

Create RDDs
• Spark provides two ways to create RDDs:
loading an external dataset
val lines = sc.textFile("/path/to/README.md")
parallelizing a collection in your driver program.
The simplest way to create RDDs is to take an existing
collection in your program and pass it to SparkContext’s
parallelize() method.
Beyond prototyping and testing, this is not widely used
since it requires that you have your entire dataset in
memory on one machine.
val lines = sc.parallelize(List(1,2,3))

Lazy Evaluation
• Lazy evaluation means that when we call a transformation
on an RDD (for instance, calling map()), the operation is not
immediately performed. Instead, Spark internally records
metadata to indicate that this operation has been
requested.
• Rather than thinking of an RDD as containing specific data,
it is best to think of each RDD as consisting of instructions
on how to compute the data that we build up through
transformations. Loading data into an RDD is lazily
evaluated in the same way transformations are. So, when
we call sc.textFile(), the data is not loaded until it is
necessary.
• Spark uses lazy evaluation to reduce the number of passes
it has to take over our data by grouping operations
together.

RDD Operations(Transformations)
• Transformations are operations on RDDs that return a new RDD, such as
map() and filter().
• Transformed RDDs are computed lazily, only when you use them in an
action.
 val inputRDD = sc.textFile("log.txt")
 val errorsRDD = inputRDD.filter(line => line.contains("error"))
• Transformations operation does not mutate the existing inputRDD. Instead,
it returns a pointer to an entirely new RDD.
• InputRDD can still be reused later in the program
• Transformations can actually operate on any number of input RDDs. Like
Union to merge the RDDs.
• As you derive new RDDs from each other using transformations, Spark keeps
track of the set of dependencies between different RDDs, called the lineage
graph. It uses this information to compute each RDD on demand and to
recover lost data if part of a persistent RDD is lost.

Common Transformations
• Element-wise transformations
 MAP
The map() transformation takes in a function and applies it to each
element in the RDD with the result of the function being the new
value of each element in the resulting RDD.
 val input = sc.parallelize(List(1, 2, 3, 4))
 val result = input.map(x => x * x)
 println(result.collect().mkString(","))
 Filter
The filter() transformation takes in a function and returns an RDD
that only has elements that pass the filter() function.
 FLATMAP
Sometimes we want to produce multiple output elements for each
input element. The operation to do this is called flatMap().
 val lines = sc.parallelize(List("hello world", "hi"))
 val words = lines.flatMap(line => line.split(" "))
 words.first() // returns "hello"

Examples(Transformation)
• Create RDD / collect(Action)
 val seq = Seq(4,2,5,3,1,4)
 val rdd = sc.parallelize(seq)
 rdd.collect()
• Filter
 val filteredrdd = rdd.filter(x=>x>2)
• Distinct
 val rdddist = rdd.distinct()
 rdddist.collect()
• Map (square a number)
 val rddsquare = rdd.map(x=>x*x);
• Flatmap
 val rdd1 = sc.parallelize(List(("Hadoop PIG Hive"), ("Hive PIG PIG Hadoop"),
("Hadoop Hadoop Hadoop")))
 val rdd2 = rdd1.flatMap(x => x.split(" "))
• Sample an RDD
 rdd.sample(false,0.5).collect()

Examples(Transformation) Cont…
• Create RDD
 val x = Seq(1,2,3)
 val y = Seq(3,4,5)
 val rdd1 = sc.parallelize(x)
 val rdd2 = sc.parallelize(y)
• Union
 rdd1.union(rdd2).collect()
• Intersection
 rdd1.intersection(rdd2).collect()
• subtract
 Rdd1.subtract(rdd2).collect()
• Cartesian
 rdd1.cartesian(rdd2).collect()

RDD Operations(Actions)
• Actions are operations that return a result to the driver
program or write it to storage, and kick off a
computation, such as count() and first().
• Actions force the evaluation of the transformations
required for the RDD they were called on, since they
need to actually produce output.
 println("Input had " + badLinesRDD.count() + " concerning lines")
 println("Here are 10 examples:")
 badLinesRDD.take(10).foreach(println)
• It is important to note that each time we call a new
action, the entire RDD must be computed “from
scratch.”

Examples(Action)
• Reduce
 The most common action on basic RDDs you will likely use is reduce(), which
takes a function that operates on two elements of the type in your RDD and
returns a new element of the same type. A simple example of such a function
is +, which we can use to sum our RDD. With reduce(), we can easily sum the
elements of our RDD, count the number of elements, and perform other types
of aggregations.
 val seq = Seq(4,2,5,3,1,4)
 val rdd = sc.parallelize(seq)
 val sum = rdd.reduce((x,y)=> x+y)
• Fold
 Similar to reduce() is fold(), which also takes a function with the same
signature as needed for reduce(), but in addition takes a “zero value” to be
used for the initial call on each partition. The zero value you provide should be
the identity element for your operation;
 val seq = Seq(4,2,5,3,1,4)
 val rdd = sc.parallelize(seq,2)
 val rdd2 = rdd.fold(1)((x,y)=>x+y);

Examples(Action) cont.…
• Aggregate
val rdd = sc.parallelize(List(1,2,3,4,5,6),5)
Val rdd2 =
rdd.aggregate(0,0)((ac,val)=>(ac._1+val,ac._2+1),(
p1,p2)=>(p1._1+p2._1,p1._2+p2._2))
Val avg = rdd2._1/rdd._2.toDouble

Examples(Action) cont.…
• takeOrdered(num)(ordering)
 Reverse Order (Highest number)
 val seq = Seq(3,9,2,3,5,4)
 val rdd = sc.parallelize(seq,2)
 rdd.takeOrdered(1)(Ordering[Int].reverse)
 Custom Order (Highest based on age)
 case class Person(name:String,age:Int)
 val rdd = sc.parallelize(Array(("x",10),("y",14),("z",12)))
 val rdd2 = rdd.map(x=>Person(x._1,x._2))
 rdd2.takeOrdered(1)(Ordering[Int].reverse.on(x=>x.age))
 (highest/lowest repeated word in word count program)
 val rdd1 = sc.parallelize(List(("Hadoop PIG Hive"), ("Hive PIG PIG Hadoop"), ("Hadoop
Hadoop Hadoop")))
 val rdd2 = rdd1.flatMap(x => x.split(" ")).map(x => (x,1))
 val rdd3 = rdd2.reduceByKey((x,y) => (x+y))
 rdd3.takeOrdered(3)(Ordering[Int].on(x=>x._2)) //Lowest value
 rdd3.takeOrdered(3)(Ordering[Int].reverse.on(x=>x._2)) //Higest value

Persist(Cache) RDD
• Spark’s RDDs are by default recomputed each time you run an action on them. If
you would like to reuse an RDD in multiple actions, you can ask Spark to persist it
using RDD.persist().
• After computing it the first time, Spark will store the RDD contents in memory
(partitioned across the machines in your cluster), and reuse them in future actions.
Persisting RDDs on disk instead of memory is also possible.
• If a node that has data persisted on it fails, Spark will recompute the lost partitions
of the data when needed.
• We can also replicate our data on multiple nodes if we want to be able to handle
node failure without slowdown.
• If you attempt to cache too much data to fit in memory, Spark will automatically
evict old partitions using a Least Recently Used (LRU) cache policy. Caching
unnecessary data can lead to eviction of useful data and more recomputation
time.
• RDDs come with a method called unpersist() that lets you manually remove them
from the cache.
• cache() is the same as calling persist() with the default storage level.

Persistencelevels
Note: Cache and persist will work only after the RDD is computed again either
by performing action on the RDD or on the transformed RDD when persist is not
given when the RDD is created.

Spark Summary
• To summarize, every Spark program and shell
session will work as follows:
Create some input RDDs from external data.
Transform them to define new RDDs using
transformations like filter().
Ask Spark to persist() any intermediate RDDs that will
need to be reused. cache() is the same as calling
persist() with the default storage level.
Launch actions such as count() and first() to kick off a
parallel computation, which is then optimized and
executed by Spark.

Pair RDD (Key/Value Pair)
• Spark provides special operations on RDDs containing
key/value pairs, called pair RDDs.
• Key/value RDDs are commonly used to perform
aggregations, and often we will do some initial ETL to get
our data into a key/value format.
• Key/value RDDs expose new operations (e.g., counting up
reviews for each product, grouping together data with the
same key, and grouping together two different RDDs)
• Creating Pair RDDs
 Some loading formats directly return the Pair RDDs.
 Using map function
 Scala
 val pairs = lines.map(x => (x.split(" ")(0), x))

Transformations on Pair RDDs
• Pair RDDs are allowed to use all the transformations
available to standard RDDs.
• Aggregations
When datasets are described in terms of key/value
pairs, it is common to want to aggregate statistics
across all elements with the same key. We have
looked at the fold(), combine(), and reduce() actions
on basic RDDs, and similar per-key transformations
exist on pair RDDs. Spark has a similar set of
operations that combines values that have the same
key. These operations return RDDs and thus are called
transformations rather than actions.

Transformations...cont…
• reduceByKey()
 is quite similar to reduce(); both take a function and use it to combine values.
reduceByKey() runs several parallel reduce operations, one for each key in the
dataset, where each operation combines values that have the same key.
Because datasets can have very large numbers of keys, reduceByKey() is not
implemented as an action that returns a value to the user program. Instead, it
returns a new RDD consisting of each key and the reduced value for that key.
• Example Word count in Scala
 val input = sc.textFile("s3://...")
 val words = input.flatMap(x => x.split(" "))
 val result = words.map(x => (x, 1)).reduceByKey((acc, val) => acc + val)
• We can actually implement word count even faster by using the
countByValue() function on the first RDD: input.flatMap(x =>x.split("
")).countByValue().

Per-key average with
reduceByKey() and
mapValues() in Scala
– rdd.mapValues(x => (x,
1)).reduceByKey((x, y) =>
(x._1 + y._1, x._2 + y._2))
Per-key average data flow

• foldByKey()
is quite similar to fold(); both use a zero value of
the same type of the data in our RDD and
combination function. As with fold(), the provided
zero value for foldByKey() should have no impact
when added with your combination function to
another element.
Those familiar with the combiner concept from MapReduce should note that
calling reduceByKey() and foldByKey() will automatically perform combining locally
on each machine before computing global totals for each key. The user does not
need to specify a combiner. The more general combineByKey() interface allows you
to customize combining behavior.

• combineByKey()
 is the most general of the per-key aggregation functions. Most of the other
per-key combiners are implemented using it. Like aggregate(), combineBy
Key() allows the user to return values that are not the same type as our input
data.
 As combineByKey() goes through the elements in a partition, each element
either has a key it hasn’t seen before or has the same key as a previous
element.
 If it’s a new element, combineByKey() uses a function we provide, called
createCombiner(), to create the initial value for the accumulator on that key. It’s
important to note that this happens the first time a key is found in each partition, rather
than only the first time the key is found in the RDD.
 If it is a value we have seen before while processing that partition, it will instead use the
provided function, mergeValue(), with the current value for the accumulator for that key
and the new value.
 Since each partition is processed independently, we can have multiple
accumulators for the same key. When we are merging the results from each
partition, if two or more partitions have an accumulator for the same key we
merge the accumulators using the user-supplied mergeCombiners() function.

• Per-key average using combineByKey() in Scala
val result = input.combineByKey((v) => (v, 1), (acc:
(Int, Int), v) => (acc._1 + v, acc._2 + 1), (acc1: (Int,
Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2
+ acc2._2)).map{ case (key, value) => (key,
value._1 / value._2.toFloat)
}result.collectAsMap().map(println(_))

• GroupByKey()
 With keyed data a common use case is grouping our data by key—for
example, viewing all of a customer’s orders together.
 If our data is already keyed in the way we want, groupByKey() will
group our data using the key in our RDD. On an RDD consisting of keys
of type K and values of type V, we get back an RDD of type [K,
Iterable[V]].
 If you find yourself writing code where you groupByKey() and then use
a reduce() or fold() on the values, you can probably achieve the same
result more efficiently by using one of the per-key aggregation
functions. Rather than reducing the RDD to an inmemory value, we
reduce the data per key and get back an RDD with the reduced values
corresponding to each key. For example, rdd.reduceByKey(func)
produces the same RDD as rdd.groupByKey().mapValues(value =>
value.reduce(func)) but is more efficient as it avoids the step of
creating a list of values for each key.

• cogroup
 In addition to grouping data from a single RDD, we can group
data sharing the same key from multiple RDDs using a function
called cogroup(). cogroup() over two RDDs sharing the same key
type, K, with the respective value types V and W gives us back
RDD[(K, (Iterable[V], Iterable[W]))]. If one of the RDDs doesn’t
have elements for a given key that is present in the other RDD,
the corresponding Iterable is simply empty.
 cogroup() gives us the power to group data from multiple RDDs.
 cogroup() is used as a building block for the joins. However
cogroup() can be used for much more than just implementing
joins. We can also use it to implement intersect by key.
 Additionally, cogroup() can work on three or more RDDs at
once.

• Joins
 Joining data together is probably one of the most common operations on a
pair RDD, and we have a full range of options including right and left outer
joins, cross joins, and inner joins.
• Scala shell inner join
 storeAddress = {(Store("Ritual"), "1026 Valencia St"), (Store("Philz"), "748 Van
Ness Ave"), (Store("Philz"), "3101 24th St"), (Store("Starbucks"), "Seattle")}
 storeRating = {(Store("Ritual"), 4.9), (Store("Philz"), 4.8))}
 storeAddress.join(storeRating) == {(Store("Ritual"), ("1026 Valencia St", 4.9)),
(Store("Philz"), ("748 Van Ness Ave", 4.8)), (Store("Philz"), ("3101 24th St",
4.8))}
• leftOuterJoin() and rightOuterJoin()
 storeAddress.leftOuterJoin(storeRating) == {(Store("Ritual"),("1026 Valencia
St",Some(4.9))), (Store("Starbucks"),("Seattle",None)), (Store("Philz"),("748
Van Ness Ave",Some(4.8))), (Store("Philz"),("3101 24th St",Some(4.8)))}
 storeAddress.rightOuterJoin(storeRating) == {(Store("Ritual"),(Some("1026
Valencia St"),4.9)), (Store("Philz"),(Some("748 Van Ness Ave"),4.8)),
(Store("Philz"), (Some("3101 24th St"),4.8))}

Tuning the level of parallelism
• When performing aggregations or grouping operations, we can ask Spark
to use a specific number of partitions. Spark will always try to infer a
sensible default value based on the size of your cluster, but in some cases
you will want to tune the level of parallelism for better performance.
 val data = Seq(("a", 3), ("b", 4), ("a", 1))
 sc.parallelize(data).reduceByKey((x, y) => x + y) // Default parallelism
 sc.parallelize(data).reduceByKey((x, y) => x + y,10) // Custom parallelism
• Repartitioning your data is a fairly expensive operation. Spark also has an
optimized version of repartition() function called coalesce() that allows
avoiding data movement, but only if you are decreasing the number of
RDD partitions.
• To know whether you can safely call coalesce(), you can check the size of
the RDD using rdd.partitions.size() or rdd.getNumPartitions().
• To see each partition data of a RDD,
 scala> rdd.mapPartitionsWithIndex( (index: Int, it: Iterator[(Int,Int)]) =>
it.toList.map(x => index + ":" + x).iterator).collect

Sorting Data
• Having sorted data is quite useful in many cases, especially when
you’re producing downstream output. We can sort an RDD with
key/value pairs provided that there is an ordering defined on the
key. Once we have sorted our data, any subsequent call on the
sorted data to collect() or save() will result in ordered data.
• Since we often want our RDDs in the reverse order, the sortByKey()
function takes a parameter called true/false indicating whether we
want it in ascending order (it defaults to true i.e. ascending).
• Sometimes we want a different sort order entirely, and to support
this we can provide our own comparison function.
 Custom sort order in Scala, sorting integers as if strings
 val rdd1 = sc.parallelize(List(("Hadoop PIG Hive"), ("Hive PIG PIG Hadoop"),
("Hadoop Hadoop Hadoop")))
 val rdd2 = rdd1.flatMap(x => x.split(" ")).map(x => (x,1))
 val rdd3 = rdd2.reduceByKey((x,y) => (x+y))
 rdd3.takeOrdered(3)(Ordering[Int].on(x=>x._2))

Actions available on Pair RDDs
• As with the transformations, all of the traditional actions available on the
base RDD are also available on pair RDDs. Some additional actions are
available on pair RDDs to take advantage of the key/value nature of the
data.

Accumulators
• Shared variable, accumulators, provides a simple syntax for aggregating values
from worker nodes back to the driver program.
• One of the most common uses of accumulators is to count events that occur
during job execution for debugging purposes.
 val sc = new SparkContext(...)
 val file = sc.textFile("file.txt")
 val blankLines = sc.accumulator(0) // Create Accumulator[Int] initialized to 0
 val callSigns = file.flatMap(line => {
if (line == "") {
blankLines += 1 // Add to the accumulator
}
line.split(" ")})
 callSigns.saveAsTextFile("output.txt")
 println("Blank lines: " + blankLines.value)
 Note that we will see the right count only after we run the saveAsTextFile() action, because the
transformation above it, map(), is lazy, so the side-effect of incrementing of the accumulator will
happen only when the map() transformation is forced to occur by the saveAsTextFile() action.
 Also note that tasks on worker nodes cannot access the accumulator’s value()—from the point of view
of these tasks, accumulators are write-only variables. This allows accumulators to be implemented
efficiently, without having to communicate every update.

Accumulators cont…
• Accumulators and Fault Tolerance
 Spark automatically deals with failed or slow machines by re-executing failed or slow tasks. For
example, if the node running a partition of a map() operation crashes, Spark will rerun it on
another node; and even if the node does not crash but is simply much slower than other
nodes, Spark can preemptively launch a “speculative” copy of the task on another node, and
take its result if that finishes. Even if no nodes fail, Spark may have to rerun a task to rebuild a
cached value that falls out of memory. The net result is therefore that the same function may
run multiple times on the same data depending on what happens on the cluster.
 The end result is that for accumulators used in actions, Spark applies each task’s update to
each accumulator only once. Thus, if we want a reliable absolute value counter, regardless of
failures or multiple evaluations, we must put it inside an action like foreach().
• Custom Accumulators
 Spark supports accumulators of type Int, Double, Long, and Float.
 Spark also includes an API to define custom accumulator types and custom aggregation
operations.
 Custom accumulators need to extend AccumulatorParam.
 we can use any operation for add, provided that operation is commutative and associative.
 An operation op is commutative if a op b = b op a for all values a, b.
 An operation op is associative if (a op b) op c = a op (b op c) for all values a, b, and c.

Broadcast Variables
• Shared variable, broadcast variables, allows the program to efficiently send a large, read-only
value to all the worker nodes for use in one or more Spark operations.
 It’s expensive to send that Array from the master with each task. Also if we used same
object later (if we ran the same code for other files), it would send again to each node.
 val c = sc.broadcast(Array(100,200))
 val a = sc.parallelize(List(“This is Krishna” ,“Sureka learning the Spark”))
 val d = a.flatMap(x=>x.split(" ")).map(x=>x+c.value(1))
• Optimizing Broadcasts
 When we are broadcasting large values, it is important to choose a data serialization format
that is both fast and compact.

Working on a Per-Partition Basis
 Working with data on a per-partition basis allows us to avoid redoing
setup work for each data item. Operations like opening a database
connection or creating a random number generator are examples of
setup steps that we wish to avoid doing for each element.
 Spark has per-partition versions of map and foreach to help reduce the
cost of these operations by letting you run code only once for each
partition of an RDD.
 Example-1 (mapPartitions)
 val a = sc.parallelize(List(1,2,3,4,5,6),2)
 scala> val b = a.mapPartitions((x:Iterator[Int])=>{println("Hello"); x.toList.map(y=>y+1).iterator})
 scala> val b = a.mapPartitions((x:Iterator[Int])=>{println("Hello"); x.map(y=>y+1)})
 scala> b.collect
 Hello
 Hello
 res20: Array[Int] = Array(2, 3, 4, 5, 6, 7)
 scala> a.mapPartitions(x=>x.filter(y => y > 1)).collect
 res35: Array[Int] = Array(2, 3, 4, 5, 6)

Working on a Per-Partition Basis
 Example-2 (mapPartitionWithIndex)
 scala> val b =
a.mapPartitionsWithIndex((index:Int,x:Iterator[Int])=>{println("Hello
from "+ index); if(index==1){x.map(y=>y+1)}else{(x.map(y=>y+200))}})
 b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[28] at
mapPartitionsWithIndex at <console>:29
 scala> val b = a.mapPartitionsWithIndex((index,x)=>{println("Hello
from "+ index); if(index==1){x.map(y=>y+1)}else{(x.map(y=>y+200))}})
 scala> b.collect
 Hello from 0
 Hello from 1
 res44: Array[Int] = Array(201, 202, 203, 5, 6, 7)

• scala> case class person(id:Int,name:String,sal:Int)
• defined class person
•
• scala> person
• res38: person.type = person
•
• scala> val a = sc.textFile("file:///home/vm4learning/Desktop/spark/emp")
• a: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[36] at textFile at
<console>:27
•
• scala> a.collect
• res39: Array[String] = Array(837798 Krishna 5000000, 843123 Sandipta
10000000, 844066 Abhishek 40000, 843289 ZohriSahab 9999999)
•

• scala> val b = a.map(_.split(" "))
• b: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[37] at map at <console>:29
•
• scala> b.collect
• res40: Array[Array[String]] = Array(Array(837798, Krishna, 5000000), Array(843123, Sandipta,
10000000), Array(844066, Abhishek, 40000), Array(843289, ZohriSahab, 9999999))
•
• scala> val c = b.map (x=>person(x(0).toInt,x(1),x(2).toInt))
• c: org.apache.spark.rdd.RDD[person] = MapPartitionsRDD[38] at map at <console>:33
•
• scala> c.collect
• res41: Array[person] = Array(person(837798,Krishna,5000000),
person(843123,Sandipta,10000000), person(844066,Abhishek,40000),
person(843289,ZohriSahab,9999999))
•
• scala> val d = c.toDF()
• d: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int]
•

• scala> d.collect
• res42: Array[org.apache.spark.sql.Row] = Array([837798,Krishna,5000000],
[843123,Sandipta,10000000], [844066,Abhishek,40000], [843289,ZohriSahab,9999999])
•
• scala> d.registerTempTable("Emp")
•
• scala> val e = sqlContext.sql("""select id,name,sal from Emp where sal < 9999999""")
• e: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int]
•
• scala> e.collect
• res44: Array[org.apache.spark.sql.Row] = Array([837798,Krishna,5000000],
[844066,Abhishek,40000])
•
• scala> e
• res46: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int]
•
• scala> val a1 = sc.textFile("file:///home/vm4learning/Desktop/spark/dept")
• a1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[52] at textFile at <console>:27
•

• scala> a1.collect
• res49: Array[String] = Array(843123 CEO, 844066 SparkCommiter, 843289
BoardHSBC)
•
• scala> case class pdept(id:Int,dept:String)
• defined class pdept
•
• scala> val a2 = a1.map(x=>x.split("
")).map(x=>pdept(x(0).toInt,x(1))).toDF()
• a2: org.apache.spark.sql.DataFrame = [id: int, dept: string]
•
• scala> a2.collect
• res50: Array[org.apache.spark.sql.Row] = Array([843123,CEO],
[844066,SparkCommiter], [843289,BoardHSBC])
•

• scala> a2.registerTempTable("Dept")
•
• scala> val e = sqlContext.sql("""select * from Emp join Dept on
Emp.id = Dept.id where sal < 9999999""")
• e: org.apache.spark.sql.DataFrame = [id: int, name: string, sal: int,
id: int, dept: string]
•
• scala> e.collect
• res52: Array[org.apache.spark.sql.Row] =
Array([844066,Abhishek,40000,844066,SparkCommiter])
•
• scala> sqlContext.cacheTable("Dept")
•
• scala> sqlContext.uncacheTable("Dept")

Thank You
• Question?
• Feedback?
explorehadoop@gmail.com

Spark core

More Related Content

What's hot

What's hot (20)

Similar to Spark core

Similar to Spark core (20)

More from Prashant Gupta

More from Prashant Gupta (10)

Recently uploaded

Recently uploaded (20)

Spark core