Top 5 mistakes when writing Spark applications

Top 5 mistakes when
writing Spark
applications
tiny.cloudera.com/spark-mistakes
Mark Grover | Software Engineer, Cloudera | @mark_grover
Ted Malaska | Technical Group Architect, Blizzard| @TedMalaska

2
About the book
• @hadooparchbook
• hadooparchitecturebook.com
• github.com/hadooparchitecturebook
• slideshare.com/hadooparchbook

3
Mistakes people make
when using Spark

4
Mistakes people we’ve made
when using Spark

5
Mistakes people make
when using Spark

7
# Executors, cores, memory !?!
• 6 Nodes
• 16 cores each
• 64 GB of RAM each

8
Decisions, decisions, decisions
• Number of executors (--num-executors)
• Cores for each executor (--executor-cores)
• Memory for each executor (--executor-memory)
• 6 nodes
• 16 cores each
• 64 GB of RAM

10
Answer #1 – Most granular
• Have smallest sized executors
possible
• 1 core each
• 64GB/node / 16 executors/node
= 4 GB/executor
• Total of 16 cores x 6 nodes
= 96 cores => 96 executors
Worker node
Executor 16
Executor 4
Executor 3
Executor 2
Executor 1

11
Answer #1 – Most granular
• Have smallest sized executors
possible
• 1 core each
• 64GB/node / 16 executors/node
= 4 GB/executor
• Total of 16 cores x 6 nodes
= 96 cores => 96 executors
Worker node
Executor 16
Executor 4
Executor 3
Executor 2
Executor 1

12
Why?
• Not using benefits of running multiple tasks in same executor

13
Answer #2 – Least granular
• 6 executors in total
=>1 executor per node
• 64 GB memory each
• 16 cores each
Worker node
Executor 1

14
Answer #2 – Least granular
=>1 executor per node
• 16 cores each
Worker node
Executor 1

15
Why?
• Need to leave some memory overhead for OS/Hadoop daemons

16
Answer #3 – with overhead
• 6 executors – 1 executor/node
• 15 cores each
Worker node
Executor 1
Overhead(1G,1 core)

17
Answer #3 – with overhead
• 6 executors – 1 executor/node
• 15 cores each
Worker node
Executor 1
Overhead(1G,1 core)

18
Let’s assume…
• You are running Spark on YARN, from here on…

19
3 things
• 3 other things to keep in mind

20
#1 – Memory overhead
• --executor-memory controls the heap size
• Need some overhead (controlled by
spark.yarn.executor.memory.overhead) for off heap memory
• Default is max(384MB, .07 * spark.executor.memory)

21
#2 - YARN AM needs a core: Client mode

22
#2 YARN AM needs a core: Cluster mode

23
#3 HDFS Throughput
• 15 cores per executor can lead to bad HDFS I/O throughput.
• Best is to keep under 5 cores per executor

24
Calculations
• 5 cores per executor
– For max HDFS throughput
• Cluster has 6 * 15 = 90 cores in total
after taking out Hadoop/Yarn daemon cores)
• 90 cores / 5 cores/executor
= 18 executors
• Each node has 3 executors
• 63 GB/3 = 21 GB, 21 x (1-0.07)
~ 19 GB
• 1 executor for AM => 17 executors
Overhead
Worker node
Executor 3
Executor 2
Executor 1

25
Correct answer
• 19 GB memory/executor
• 5 cores/executor
* Not etched in stone
Overhead
Worker node
Executor 3
Executor 2
Executor 1

26
Dynamic allocation helps with though, right?
• Dynamic allocation allows Spark to dynamically scale the cluster
resources allocated to your application based on the workload.
• Works with Spark-On-Yarn

27
Decisions with Dynamic Allocation
• Number of executors (--num-executors)
• Cores for each executor (--executor-cores)
• Memory for each executor (--executor-memory)
• 6 nodes
• 16 cores each
• 64 GB of RAM

28
Read more
• From a great blog post on this topic by Sandy Ryza:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-
part-2/

30
Application failure
15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in stage
6.0 (TID 120, 10.215.149.47): java.lang.IllegalArgumentException: Size
exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517)
at
org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618) at
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)

31
Why?
• No Spark shuffle block can be greater than 2 GB

32
Ok, what’s a shuffle block again?
• In MapReduce terminology, a file written from one Mapper for a Reducer
• The Reducer makes a local copy of this file (reducer local copy) and then
‘reduces’ it

33
Defining shuffle and partition
Each yellow arrow
in this diagram
represents a shuffle
block.
Each blue block is a
partition.

34
Once again
• Overflow exception if shuffle block size > 2 GB

35
What’s going on here?
• Spark uses ByteBuffer as abstraction for blocks
val buf = ByteBuffer.allocate(length.toInt)
• ByteBuffer is limited by Integer.MAX_SIZE (2 GB)!

36
Spark SQL
• Especially problematic for Spark SQL
• Default number of partitions to use when doing shuffles is 200
– This low number of partitions leads to high shuffle block size

37
Umm, ok, so what can I do?
1. Increase the number of partitions
– Thereby, reducing the average partition size
2. Get rid of skew in your data
– More on that later

38
Umm, how exactly?
• In Spark SQL, increase the value of
spark.sql.shuffle.partitions
• In regular Spark applications, use rdd.repartition() or
rdd.coalesce()(latter to reduce #partitions, if needed)

39
But, how many partitions should I have?
• Rule of thumb is around 128 MB per partition

40
But! There’s more!
• Spark uses a different data structure for bookkeeping during shuffles, when
the number of partitions is less than 2000, vs. more than 2000.

41
Don’t believe me?
• In MapStatus.scala
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]):
MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}

42
Ok, so what are you saying?
If number of partitions < 2000, but not by much, bump it to be slightly higher
than 2000.

43
Can you summarize, please?
• Don’t have too big partitions
–Your job will fail due to 2 GB limit
• Don’t have too few partitions
–Your job will be slow, not making using of parallelism
• Rule of thumb: ~128 MB per partition
• If #partitions < 2000, but close, bump to just > 2000
• Track SPARK-6235 for removing various 2 GB limits

45
Slow jobs on Join/Shuffle
• Your dataset takes 20 seconds to run over with a map job, but take 4 hours
when joined or shuffled. What wrong?

46
Mistake - Skew
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Normal
Distributed
The Holy Grail of Distributed Systems

47
Mistake - Skew
Single ThreadNormal
Distributed
What about Skew, because that is a thing

48
• Salting
• Isolated Salting
• Isolated Map Joins
Mistake – Skew : Answers

49
• Normal Key: “Foo”
• Salted Key: “Foo” + random.nextInt(saltFactor)
Mistake – Skew : Salting

53
• Two Stage Aggregation
– Stage one to do operations on the salted keys
– Stage two to do operation access unsalted key results
Mistake – Skew : Salting
Data Source Map
Convert to
Salted Key & Value
Tuple
Reduce
By Salted Key
Map Convert results
to
Key & Value
Tuple
Reduce
By Key
Results

54
• Second Stage only required for Isolated Keys
Mistake – Skew : Isolated Salting
Data Source Map
Convert to
Key & Value
Isolate Key and
convert to
Salted Key & Value
Tuple
Reduce
By Key & Salted
Key
Filter Isolated
Keys
From Salted
Keys
Map Convert
results to
Key & Value
Tuple
Reduce
By Key
Union to Results

55
• Filter Out Isolated Keys and use Map Join/Aggregate on
those
• And normal reduce on the rest of the data
• This can remove a large amount of data being shuffled
Mistake – Skew : Isolated Map Join
Data Source Filter Normal
Keys
From Isolated
Keys
Reduce
By Normal Key
Union to Results
Map Join
For Isolated
Keys

56
Managing Parallelism
Cartesian Join
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
ReduceTask
ReduceTask
ReduceTask
ReduceTask
Amount
of Data
Amount of Data
10x
100x
1000x
10000x
100000x
1000000x
Or more

57
Table YTable X
• How To fight Cartesian Join
– Nested Structures
A, 1
A, 2
A, 3
A, 4
A, 5
A, 6
Table X
A, 1, 4
A, 2, 4
A, 3, 4
A, 1, 5
A, 2, 5
A, 3, 5
A, 1, 6
A, 2, 6
A, 3, 6
JOIN OR
Table X
A
A, 1
A, 2
A, 3
A, 4
A, 5
A, 6

58
• How To fight Cartesian Join
– Nested Structures
create table nestedTable (
col1 string,
col2 string,
col3 array< struct<
col3_1: string,
col3_2: string>>
val rddNested = sc.parallelize(Array(
Row("a1", "b1", Seq(Row("c1_1", "c2_1"),
Row("c1_2", "c2_2"),
Row("c1_3", "c2_3"))),
Row("a2", "b2", Seq(Row("c1_2", "c2_2"),
Row("c1_3", "c2_3"),
Row("c1_4", "c2_4")))), 2)
=

60
Out of luck?
• Do you every run out of memory?
• Do you every have more then 20 stages?
• Is your driver doing a lot of work?

61
Mistake – DAG Management
• Shuffles are to be avoided
• ReduceByKey over GroupByKey
• TreeReduce over Reduce
• Use Complex/Nested Types

62
Mistake – DAG Management: Shuffles
• Map Side reduction, where possible
• Think about partitioning/bucketing ahead of time
• Do as much as possible with a single shuffle
• Only send what you have to send
• Avoid Skew and Cartesians

63
ReduceByKey over GroupByKey
• ReduceByKey can do almost anything that GroupByKey
can do
• Aggregations
• Windowing
• Use memory
• But you have more control
• ReduceByKey has a fixed limit of Memory requirements
• GroupByKey is unbound and dependent on data

64
TreeReduce over Reduce
• TreeReduce & Reduce return some result to driver
• TreeReduce does more work on the executors
• While Reduce bring everything back to the driver
Partition
Partition
Partition
Partition
Driver
100%
Partition
Partition
Partition
Partition
Driver
4
25%
25%
25%
25%

65
Complex Types
• Top N List
• Multiple types of Aggregations
• Windowing operations
• All in one pass

66
Complex Types
• Think outside of the box use objects to reduce by
• (Make something simple)

68
Ever seen this?
Exception in thread "main" java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at org.apache.spark.util.collection.OpenHashSet.org
$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
at
org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
at
org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
at
org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at
org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
at…....

69
But!
• I already included protobuf in my app’s maven
dependencies?

70
Ah!
• My protobuf version doesn’t match with
Spark’s protobuf version!

71
Shading
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.2</version>
...
<relocations>
<relocation>
<pattern>com.google.protobuf</pattern>
<shadedPattern>com.company.my.protobuf</shadedPattern>
</relocation>
</relocations>

72
Future of shading
• Spark 2.0 has some libraries shaded
• Gauva is fully shaded

74
5 Mistakes
• Size up your executors right
• 2 GB limit on Spark shuffle blocks
• Evil thing about skew and cartesians
• Learn to manage your DAG, yo!
• Do shady stuff, don’t let classpath leaks mess
you up

75
THANK YOU.
tiny.cloudera.com/spark-mistakes
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska

Top 5 mistakes when writing Spark applications

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Top 5 mistakes when writing Spark applications

Similar to Top 5 mistakes when writing Spark applications (20)

More from hadooparchbook

More from hadooparchbook (20)

Recently uploaded

Recently uploaded (20)

Top 5 mistakes when writing Spark applications