Top 5 mistakes when
writing Spark
Mark Grover | Software Engineer, Cloudera | @mark_grover
Ted Malaska | Technical Group Architect, Blizzard| @TedMalaska
About the book
• @hadooparchbook
Mistakes people make
when using Spark
Mistakes people we’ve made
when using Spark
Mistakes people make
when using Spark
Mistake # 1
# Executors, cores, memory !?!
• 6 Nodes
• 16 cores each
• 64 GB of RAM each
Decisions, decisions, decisions
• Number of executors (--num-executors)
• Cores for each executor (--executor-cores)
• Memory for each executor (--executor-memory)
• 6 nodes
• 16 cores each
• 64 GB of RAM
Spark Architecture recap
Answer #1 – Most granular
• Have smallest sized executors
• 1 core each
• 64GB/node / 16 executors/node
= 4 GB/executor
• Total of 16 cores x 6 nodes
= 96 cores => 96 executors
Worker node
Executor 16
Executor 4
Executor 3
Executor 2
Executor 1
Answer #1 – Most granular
• Have smallest sized executors
• 1 core each
• 64GB/node / 16 executors/node
= 4 GB/executor
• Total of 16 cores x 6 nodes
= 96 cores => 96 executors
Worker node
Executor 16
Executor 4
Executor 3
Executor 2
Executor 1
• Not using benefits of running multiple tasks in same executor
Answer #2 – Least granular
• 6 executors in total
=>1 executor per node
• 64 GB memory each
• 16 cores each
Worker node
Executor 1
Answer #2 – Least granular
• 6 executors in total
=>1 executor per node
• 64 GB memory each
• 16 cores each
Worker node
Executor 1
• Need to leave some memory overhead for OS/Hadoop daemons
Answer #3 – with overhead
• 6 executors – 1 executor/node
• 63 GB memory each
• 15 cores each
Worker node
Executor 1
Overhead(1G,1 core)
Answer #3 – with overhead
• 6 executors – 1 executor/node
• 63 GB memory each
• 15 cores each
Worker node
Executor 1
Overhead(1G,1 core)
Let’s assume…
• You are running Spark on YARN, from here on…
3 things
• 3 other things to keep in mind
#1 – Memory overhead
• --executor-memory controls the heap size
• Need some overhead (controlled by
spark.yarn.executor.memory.overhead) for off heap memory
• Default is max(384MB, .07 * spark.executor.memory)
#2 - YARN AM needs a core: Client mode
#2 YARN AM needs a core: Cluster mode
#3 HDFS Throughput
• 15 cores per executor can lead to bad HDFS I/O throughput.
• Best is to keep under 5 cores per executor
• 5 cores per executor
– For max HDFS throughput
• Cluster has 6 * 15 = 90 cores in total
after taking out Hadoop/Yarn daemon cores)
• 90 cores / 5 cores/executor
= 18 executors
• Each node has 3 executors
• 63 GB/3 = 21 GB, 21 x (1-0.07)
~ 19 GB
• 1 executor for AM => 17 executors
Worker node
Executor 3
Executor 2
Executor 1
Correct answer
• 17 executors in total
• 19 GB memory/executor
• 5 cores/executor
* Not etched in stone
Worker node
Executor 3
Executor 2
Executor 1
Dynamic allocation helps with though, right?
• Dynamic allocation allows Spark to dynamically scale the cluster
resources allocated to your application based on the workload.
• Works with Spark-On-Yarn
Decisions with Dynamic Allocation
• Number of executors (--num-executors)
• Cores for each executor (--executor-cores)
• Memory for each executor (--executor-memory)
• 6 nodes
• 16 cores each
• 64 GB of RAM
Read more
• From a great blog post on this topic by Sandy Ryza:
Mistake # 2
Application failure
15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in stage
6.0 (TID 120, java.lang.IllegalArgumentException: Size
exceeds Integer.MAX_VALUE
at at at at
at at
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
• No Spark shuffle block can be greater than 2 GB
Ok, what’s a shuffle block again?
• In MapReduce terminology, a file written from one Mapper for a Reducer
• The Reducer makes a local copy of this file (reducer local copy) and then
‘reduces’ it
Defining shuffle and partition
Each yellow arrow
in this diagram
represents a shuffle
Each blue block is a
Once again
• Overflow exception if shuffle block size > 2 GB
What’s going on here?
• Spark uses ByteBuffer as abstraction for blocks
val buf = ByteBuffer.allocate(length.toInt)
• ByteBuffer is limited by Integer.MAX_SIZE (2 GB)!
Spark SQL
• Especially problematic for Spark SQL
• Default number of partitions to use when doing shuffles is 200
– This low number of partitions leads to high shuffle block size
Umm, ok, so what can I do?
1. Increase the number of partitions
– Thereby, reducing the average partition size
2. Get rid of skew in your data
– More on that later
Umm, how exactly?
• In Spark SQL, increase the value of
• In regular Spark applications, use rdd.repartition() or
rdd.coalesce()(latter to reduce #partitions, if needed)
But, how many partitions should I have?
• Rule of thumb is around 128 MB per partition
But! There’s more!
• Spark uses a different data structure for bookkeeping during shuffles, when
the number of partitions is less than 2000, vs. more than 2000.
Don’t believe me?
• In MapStatus.scala
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]):
MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
Ok, so what are you saying?
If number of partitions < 2000, but not by much, bump it to be slightly higher
than 2000.
Can you summarize, please?
• Don’t have too big partitions
–Your job will fail due to 2 GB limit
• Don’t have too few partitions
–Your job will be slow, not making using of parallelism
• Rule of thumb: ~128 MB per partition
• If #partitions < 2000, but close, bump to just > 2000
• Track SPARK-6235 for removing various 2 GB limits
Mistake # 3
Slow jobs on Join/Shuffle
• Your dataset takes 20 seconds to run over with a map job, but take 4 hours
when joined or shuffled. What wrong?
Mistake - Skew
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
The Holy Grail of Distributed Systems
Mistake - Skew
Single ThreadNormal
What about Skew, because that is a thing
• Salting
• Isolated Salting
• Isolated Map Joins
Mistake – Skew : Answers
• Normal Key: “Foo”
• Salted Key: “Foo” + random.nextInt(saltFactor)
Mistake – Skew : Salting
Managing Parallelism
Mistake – Skew: Salting
52©2014 Cloudera, Inc. All rights reserved.
Add Example Slide
• Two Stage Aggregation
– Stage one to do operations on the salted keys
– Stage two to do operation access unsalted key results
Mistake – Skew : Salting
Data Source Map
Convert to
Salted Key & Value
By Salted Key
Map Convert results
Key & Value
By Key
• Second Stage only required for Isolated Keys
Mistake – Skew : Isolated Salting
Data Source Map
Convert to
Key & Value
Isolate Key and
convert to
Salted Key & Value
By Key & Salted
Filter Isolated
From Salted
Map Convert
results to
Key & Value
By Key
Union to Results
• Filter Out Isolated Keys and use Map Join/Aggregate on
• And normal reduce on the rest of the data
• This can remove a large amount of data being shuffled
Mistake – Skew : Isolated Map Join
Data Source Filter Normal
From Isolated
By Normal Key
Union to Results
Map Join
For Isolated
Managing Parallelism
Cartesian Join
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
of Data
Amount of Data
Or more
Table YTable X
• How To fight Cartesian Join
– Nested Structures
Managing Parallelism
A, 1
A, 2
A, 3
A, 4
A, 5
A, 6
Table X
A, 1, 4
A, 2, 4
A, 3, 4
A, 1, 5
A, 2, 5
A, 3, 5
A, 1, 6
A, 2, 6
A, 3, 6
Table X
A, 1
A, 2
A, 3
A, 4
A, 5
A, 6
• How To fight Cartesian Join
– Nested Structures
Managing Parallelism
create table nestedTable (
col1 string,
col2 string,
col3 array< struct<
col3_1: string,
col3_2: string>>
val rddNested = sc.parallelize(Array(
Row("a1", "b1", Seq(Row("c1_1", "c2_1"),
Row("c1_2", "c2_2"),
Row("c1_3", "c2_3"))),
Row("a2", "b2", Seq(Row("c1_2", "c2_2"),
Row("c1_3", "c2_3"),
Row("c1_4", "c2_4")))), 2)
Mistake # 4
Out of luck?
• Do you every run out of memory?
• Do you every have more then 20 stages?
• Is your driver doing a lot of work?
Mistake – DAG Management
• Shuffles are to be avoided
• ReduceByKey over GroupByKey
• TreeReduce over Reduce
• Use Complex/Nested Types
Mistake – DAG Management: Shuffles
• Map Side reduction, where possible
• Think about partitioning/bucketing ahead of time
• Do as much as possible with a single shuffle
• Only send what you have to send
• Avoid Skew and Cartesians
ReduceByKey over GroupByKey
• ReduceByKey can do almost anything that GroupByKey
can do
• Aggregations
• Windowing
• Use memory
• But you have more control
• ReduceByKey has a fixed limit of Memory requirements
• GroupByKey is unbound and dependent on data
TreeReduce over Reduce
• TreeReduce & Reduce return some result to driver
• TreeReduce does more work on the executors
• While Reduce bring everything back to the driver
Complex Types
• Top N List
• Multiple types of Aggregations
• Windowing operations
• All in one pass
Complex Types
• Think outside of the box use objects to reduce by
• (Make something simple)
Mistake # 5
Ever seen this?
Exception in thread "main" java.lang.NoSuchMethodError:;
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
• I already included protobuf in my app’s maven
• My protobuf version doesn’t match with
Spark’s protobuf version!
Future of shading
• Spark 2.0 has some libraries shaded
• Gauva is fully shaded
5 Mistakes
• Size up your executors right
• 2 GB limit on Spark shuffle blocks
• Evil thing about skew and cartesians
• Learn to manage your DAG, yo!
• Do shady stuff, don’t let classpath leaks mess
you up
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska

Top 5 mistakes when writing Spark applications

  • 1. Top 5 mistakes when writing Spark applications Mark Grover | Software Engineer, Cloudera | @mark_grover Ted Malaska | Technical Group Architect, Blizzard| @TedMalaska
  • 2. 2 About the book • @hadooparchbook • • •
  • 4. 4 Mistakes people we’ve made when using Spark
  • 7. 7 # Executors, cores, memory !?! • 6 Nodes • 16 cores each • 64 GB of RAM each
  • 8. 8 Decisions, decisions, decisions • Number of executors (--num-executors) • Cores for each executor (--executor-cores) • Memory for each executor (--executor-memory) • 6 nodes • 16 cores each • 64 GB of RAM
  • 10. 10 Answer #1 – Most granular • Have smallest sized executors possible • 1 core each • 64GB/node / 16 executors/node = 4 GB/executor • Total of 16 cores x 6 nodes = 96 cores => 96 executors Worker node Executor 16 Executor 4 Executor 3 Executor 2 Executor 1
  • 11. 11 Answer #1 – Most granular • Have smallest sized executors possible • 1 core each • 64GB/node / 16 executors/node = 4 GB/executor • Total of 16 cores x 6 nodes = 96 cores => 96 executors Worker node Executor 16 Executor 4 Executor 3 Executor 2 Executor 1
  • 12. 12 Why? • Not using benefits of running multiple tasks in same executor
  • 13. 13 Answer #2 – Least granular • 6 executors in total =>1 executor per node • 64 GB memory each • 16 cores each Worker node Executor 1
  • 14. 14 Answer #2 – Least granular • 6 executors in total =>1 executor per node • 64 GB memory each • 16 cores each Worker node Executor 1
  • 15. 15 Why? • Need to leave some memory overhead for OS/Hadoop daemons
  • 16. 16 Answer #3 – with overhead • 6 executors – 1 executor/node • 63 GB memory each • 15 cores each Worker node Executor 1 Overhead(1G,1 core)
  • 17. 17 Answer #3 – with overhead • 6 executors – 1 executor/node • 63 GB memory each • 15 cores each Worker node Executor 1 Overhead(1G,1 core)
  • 18. 18 Let’s assume… • You are running Spark on YARN, from here on…
  • 19. 19 3 things • 3 other things to keep in mind
  • 20. 20 #1 – Memory overhead • --executor-memory controls the heap size • Need some overhead (controlled by spark.yarn.executor.memory.overhead) for off heap memory • Default is max(384MB, .07 * spark.executor.memory)
  • 21. 21 #2 - YARN AM needs a core: Client mode
  • 22. 22 #2 YARN AM needs a core: Cluster mode
  • 23. 23 #3 HDFS Throughput • 15 cores per executor can lead to bad HDFS I/O throughput. • Best is to keep under 5 cores per executor
  • 24. 24 Calculations • 5 cores per executor – For max HDFS throughput • Cluster has 6 * 15 = 90 cores in total after taking out Hadoop/Yarn daemon cores) • 90 cores / 5 cores/executor = 18 executors • Each node has 3 executors • 63 GB/3 = 21 GB, 21 x (1-0.07) ~ 19 GB • 1 executor for AM => 17 executors Overhead Worker node Executor 3 Executor 2 Executor 1
  • 25. 25 Correct answer • 17 executors in total • 19 GB memory/executor • 5 cores/executor * Not etched in stone Overhead Worker node Executor 3 Executor 2 Executor 1
  • 26. 26 Dynamic allocation helps with though, right? • Dynamic allocation allows Spark to dynamically scale the cluster resources allocated to your application based on the workload. • Works with Spark-On-Yarn
  • 27. 27 Decisions with Dynamic Allocation • Number of executors (--num-executors) • Cores for each executor (--executor-cores) • Memory for each executor (--executor-memory) • 6 nodes • 16 cores each • 64 GB of RAM
  • 28. 28 Read more • From a great blog post on this topic by Sandy Ryza: part-2/
  • 30. 30 Application failure 15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in stage 6.0 (TID 120, java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at at at at at at at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
  • 31. 31 Why? • No Spark shuffle block can be greater than 2 GB
  • 32. 32 Ok, what’s a shuffle block again? • In MapReduce terminology, a file written from one Mapper for a Reducer • The Reducer makes a local copy of this file (reducer local copy) and then ‘reduces’ it
  • 33. 33 Defining shuffle and partition Each yellow arrow in this diagram represents a shuffle block. Each blue block is a partition.
  • 34. 34 Once again • Overflow exception if shuffle block size > 2 GB
  • 35. 35 What’s going on here? • Spark uses ByteBuffer as abstraction for blocks val buf = ByteBuffer.allocate(length.toInt) • ByteBuffer is limited by Integer.MAX_SIZE (2 GB)!
  • 36. 36 Spark SQL • Especially problematic for Spark SQL • Default number of partitions to use when doing shuffles is 200 – This low number of partitions leads to high shuffle block size
  • 37. 37 Umm, ok, so what can I do? 1. Increase the number of partitions – Thereby, reducing the average partition size 2. Get rid of skew in your data – More on that later
  • 38. 38 Umm, how exactly? • In Spark SQL, increase the value of spark.sql.shuffle.partitions • In regular Spark applications, use rdd.repartition() or rdd.coalesce()(latter to reduce #partitions, if needed)
  • 39. 39 But, how many partitions should I have? • Rule of thumb is around 128 MB per partition
  • 40. 40 But! There’s more! • Spark uses a different data structure for bookkeeping during shuffles, when the number of partitions is less than 2000, vs. more than 2000.
  • 41. 41 Don’t believe me? • In MapStatus.scala def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = { if (uncompressedSizes.length > 2000) { HighlyCompressedMapStatus(loc, uncompressedSizes) } else { new CompressedMapStatus(loc, uncompressedSizes) } }
  • 42. 42 Ok, so what are you saying? If number of partitions < 2000, but not by much, bump it to be slightly higher than 2000.
  • 43. 43 Can you summarize, please? • Don’t have too big partitions –Your job will fail due to 2 GB limit • Don’t have too few partitions –Your job will be slow, not making using of parallelism • Rule of thumb: ~128 MB per partition • If #partitions < 2000, but close, bump to just > 2000 • Track SPARK-6235 for removing various 2 GB limits
  • 45. 45 Slow jobs on Join/Shuffle • Your dataset takes 20 seconds to run over with a map job, but take 4 hours when joined or shuffled. What wrong?
  • 46. 46 Mistake - Skew Single Thread Single Thread Single Thread Single Thread Single Thread Single Thread Single Thread Normal Distributed The Holy Grail of Distributed Systems
  • 47. 47 Mistake - Skew Single ThreadNormal Distributed What about Skew, because that is a thing
  • 48. 48 • Salting • Isolated Salting • Isolated Map Joins Mistake – Skew : Answers
  • 49. 49 • Normal Key: “Foo” • Salted Key: “Foo” + random.nextInt(saltFactor) Mistake – Skew : Salting
  • 52. 52©2014 Cloudera, Inc. All rights reserved. Add Example Slide
  • 53. 53 • Two Stage Aggregation – Stage one to do operations on the salted keys – Stage two to do operation access unsalted key results Mistake – Skew : Salting Data Source Map Convert to Salted Key & Value Tuple Reduce By Salted Key Map Convert results to Key & Value Tuple Reduce By Key Results
  • 54. 54 • Second Stage only required for Isolated Keys Mistake – Skew : Isolated Salting Data Source Map Convert to Key & Value Isolate Key and convert to Salted Key & Value Tuple Reduce By Key & Salted Key Filter Isolated Keys From Salted Keys Map Convert results to Key & Value Tuple Reduce By Key Union to Results
  • 55. 55 • Filter Out Isolated Keys and use Map Join/Aggregate on those • And normal reduce on the rest of the data • This can remove a large amount of data being shuffled Mistake – Skew : Isolated Map Join Data Source Filter Normal Keys From Isolated Keys Reduce By Normal Key Union to Results Map Join For Isolated Keys
  • 56. 56 Managing Parallelism Cartesian Join Map Task Shuffle Tmp 1 Shuffle Tmp 2 Shuffle Tmp 3 Shuffle Tmp 4 Map Task Shuffle Tmp 1 Shuffle Tmp 2 Shuffle Tmp 3 Shuffle Tmp 4 Map Task Shuffle Tmp 1 Shuffle Tmp 2 Shuffle Tmp 3 Shuffle Tmp 4 ReduceTask ReduceTask ReduceTask ReduceTask Amount of Data Amount of Data 10x 100x 1000x 10000x 100000x 1000000x Or more
  • 57. 57 Table YTable X • How To fight Cartesian Join – Nested Structures Managing Parallelism A, 1 A, 2 A, 3 A, 4 A, 5 A, 6 Table X A, 1, 4 A, 2, 4 A, 3, 4 A, 1, 5 A, 2, 5 A, 3, 5 A, 1, 6 A, 2, 6 A, 3, 6 JOIN OR Table X A A, 1 A, 2 A, 3 A, 4 A, 5 A, 6
  • 58. 58 • How To fight Cartesian Join – Nested Structures Managing Parallelism create table nestedTable ( col1 string, col2 string, col3 array< struct< col3_1: string, col3_2: string>> val rddNested = sc.parallelize(Array( Row("a1", "b1", Seq(Row("c1_1", "c2_1"), Row("c1_2", "c2_2"), Row("c1_3", "c2_3"))), Row("a2", "b2", Seq(Row("c1_2", "c2_2"), Row("c1_3", "c2_3"), Row("c1_4", "c2_4")))), 2) =
  • 60. 60 Out of luck? • Do you every run out of memory? • Do you every have more then 20 stages? • Is your driver doing a lot of work?
  • 61. 61 Mistake – DAG Management • Shuffles are to be avoided • ReduceByKey over GroupByKey • TreeReduce over Reduce • Use Complex/Nested Types
  • 62. 62 Mistake – DAG Management: Shuffles • Map Side reduction, where possible • Think about partitioning/bucketing ahead of time • Do as much as possible with a single shuffle • Only send what you have to send • Avoid Skew and Cartesians
  • 63. 63 ReduceByKey over GroupByKey • ReduceByKey can do almost anything that GroupByKey can do • Aggregations • Windowing • Use memory • But you have more control • ReduceByKey has a fixed limit of Memory requirements • GroupByKey is unbound and dependent on data
  • 64. 64 TreeReduce over Reduce • TreeReduce & Reduce return some result to driver • TreeReduce does more work on the executors • While Reduce bring everything back to the driver Partition Partition Partition Partition Driver 100% Partition Partition Partition Partition Driver 4 25% 25% 25% 25%
  • 65. 65 Complex Types • Top N List • Multiple types of Aggregations • Windowing operations • All in one pass
  • 66. 66 Complex Types • Think outside of the box use objects to reduce by • (Make something simple)
  • 68. 68 Ever seen this? Exception in thread "main" java.lang.NoSuchMethodError:; at $apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102) at org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210) at…....
  • 69. 69 But! • I already included protobuf in my app’s maven dependencies?
  • 70. 70 Ah! • My protobuf version doesn’t match with Spark’s protobuf version!
  • 72. 72 Future of shading • Spark 2.0 has some libraries shaded • Gauva is fully shaded
  • 74. 74 5 Mistakes • Size up your executors right • 2 GB limit on Spark shuffle blocks • Evil thing about skew and cartesians • Learn to manage your DAG, yo! • Do shady stuff, don’t let classpath leaks mess you up
  • 75. 75 THANK YOU. Mark Grover | @mark_grover Ted Malaska | @TedMalaska