SlideShare a Scribd company logo
Top 5 mistakes when
writing Spark
applications
tiny.cloudera.com/spark-mistakes
Mark Grover | Software Engineer, Cloudera | @mark_grover
Ted Malaska | Technical Group Architect, Blizzard| @TedMalaska
2
About the book
• @hadooparchbook
• hadooparchitecturebook.com
• github.com/hadooparchitecturebook
• slideshare.com/hadooparchbook
3
Mistakes people make
when using Spark
4
Mistakes people we’ve made
when using Spark
5
Mistakes people make
when using Spark
6
Mistake # 1
7
# Executors, cores, memory !?!
• 6 Nodes
• 16 cores each
• 64 GB of RAM each
8
Decisions, decisions, decisions
• Number of executors (--num-executors)
• Cores for each executor (--executor-cores)
• Memory for each executor (--executor-memory)
• 6 nodes
• 16 cores each
• 64 GB of RAM
9
Spark Architecture recap
10
Answer #1 – Most granular
• Have smallest sized executors
possible
• 1 core each
• 64GB/node / 16 executors/node
= 4 GB/executor
• Total of 16 cores x 6 nodes
= 96 cores => 96 executors
Worker node
Executor 16
Executor 4
Executor 3
Executor 2
Executor 1
11
Answer #1 – Most granular
• Have smallest sized executors
possible
• 1 core each
• 64GB/node / 16 executors/node
= 4 GB/executor
• Total of 16 cores x 6 nodes
= 96 cores => 96 executors
Worker node
Executor 16
Executor 4
Executor 3
Executor 2
Executor 1
12
Why?
• Not using benefits of running multiple tasks in same executor
13
Answer #2 – Least granular
• 6 executors in total
=>1 executor per node
• 64 GB memory each
• 16 cores each
Worker node
Executor 1
14
Answer #2 – Least granular
• 6 executors in total
=>1 executor per node
• 64 GB memory each
• 16 cores each
Worker node
Executor 1
15
Why?
• Need to leave some memory overhead for OS/Hadoop daemons
16
Answer #3 – with overhead
• 6 executors – 1 executor/node
• 63 GB memory each
• 15 cores each
Worker node
Executor 1
Overhead(1G,1 core)
17
Answer #3 – with overhead
• 6 executors – 1 executor/node
• 63 GB memory each
• 15 cores each
Worker node
Executor 1
Overhead(1G,1 core)
18
Let’s assume…
• You are running Spark on YARN, from here on…
19
3 things
• 3 other things to keep in mind
20
#1 – Memory overhead
• --executor-memory controls the heap size
• Need some overhead (controlled by
spark.yarn.executor.memory.overhead) for off heap memory
• Default is max(384MB, .07 * spark.executor.memory)
21
#2 - YARN AM needs a core: Client mode
22
#2 YARN AM needs a core: Cluster mode
23
#3 HDFS Throughput
• 15 cores per executor can lead to bad HDFS I/O throughput.
• Best is to keep under 5 cores per executor
24
Calculations
• 5 cores per executor
– For max HDFS throughput
• Cluster has 6 * 15 = 90 cores in total
after taking out Hadoop/Yarn daemon cores)
• 90 cores / 5 cores/executor
= 18 executors
• Each node has 3 executors
• 63 GB/3 = 21 GB, 21 x (1-0.07)
~ 19 GB
• 1 executor for AM => 17 executors
Overhead
Worker node
Executor 3
Executor 2
Executor 1
25
Correct answer
• 17 executors in total
• 19 GB memory/executor
• 5 cores/executor
* Not etched in stone
Overhead
Worker node
Executor 3
Executor 2
Executor 1
26
Dynamic allocation helps with though, right?
• Dynamic allocation allows Spark to dynamically scale the cluster
resources allocated to your application based on the workload.
• Works with Spark-On-Yarn
27
Decisions with Dynamic Allocation
• Number of executors (--num-executors)
• Cores for each executor (--executor-cores)
• Memory for each executor (--executor-memory)
• 6 nodes
• 16 cores each
• 64 GB of RAM
28
Read more
• From a great blog post on this topic by Sandy Ryza:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-
part-2/
29
Mistake # 2
30
Application failure
15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in stage
6.0 (TID 120, 10.215.149.47): java.lang.IllegalArgumentException: Size
exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at
org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at
org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517)
at
org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618) at
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
31
Why?
• No Spark shuffle block can be greater than 2 GB
32
Ok, what’s a shuffle block again?
• In MapReduce terminology, a file written from one Mapper for a Reducer
• The Reducer makes a local copy of this file (reducer local copy) and then
‘reduces’ it
33
Defining shuffle and partition
Each yellow arrow
in this diagram
represents a shuffle
block.
Each blue block is a
partition.
34
Once again
• Overflow exception if shuffle block size > 2 GB
35
What’s going on here?
• Spark uses ByteBuffer as abstraction for blocks
val buf = ByteBuffer.allocate(length.toInt)
• ByteBuffer is limited by Integer.MAX_SIZE (2 GB)!
36
Spark SQL
• Especially problematic for Spark SQL
• Default number of partitions to use when doing shuffles is 200
– This low number of partitions leads to high shuffle block size
37
Umm, ok, so what can I do?
1. Increase the number of partitions
– Thereby, reducing the average partition size
2. Get rid of skew in your data
– More on that later
38
Umm, how exactly?
• In Spark SQL, increase the value of
spark.sql.shuffle.partitions
• In regular Spark applications, use rdd.repartition() or
rdd.coalesce()(latter to reduce #partitions, if needed)
39
But, how many partitions should I have?
• Rule of thumb is around 128 MB per partition
40
But! There’s more!
• Spark uses a different data structure for bookkeeping during shuffles, when
the number of partitions is less than 2000, vs. more than 2000.
41
Don’t believe me?
• In MapStatus.scala
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]):
MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
42
Ok, so what are you saying?
If number of partitions < 2000, but not by much, bump it to be slightly higher
than 2000.
43
Can you summarize, please?
• Don’t have too big partitions
–Your job will fail due to 2 GB limit
• Don’t have too few partitions
–Your job will be slow, not making using of parallelism
• Rule of thumb: ~128 MB per partition
• If #partitions < 2000, but close, bump to just > 2000
• Track SPARK-6235 for removing various 2 GB limits
44
Mistake # 3
45
Slow jobs on Join/Shuffle
• Your dataset takes 20 seconds to run over with a map job, but take 4 hours
when joined or shuffled. What wrong?
46
Mistake - Skew
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Normal
Distributed
The Holy Grail of Distributed Systems
47
Mistake - Skew
Single ThreadNormal
Distributed
What about Skew, because that is a thing
48
• Salting
• Isolated Salting
• Isolated Map Joins
Mistake – Skew : Answers
49
• Normal Key: “Foo”
• Salted Key: “Foo” + random.nextInt(saltFactor)
Mistake – Skew : Salting
50
Managing Parallelism
51
Mistake – Skew: Salting
52©2014 Cloudera, Inc. All rights reserved.
Add Example Slide
53
• Two Stage Aggregation
– Stage one to do operations on the salted keys
– Stage two to do operation access unsalted key results
Mistake – Skew : Salting
Data Source Map
Convert to
Salted Key & Value
Tuple
Reduce
By Salted Key
Map Convert results
to
Key & Value
Tuple
Reduce
By Key
Results
54
• Second Stage only required for Isolated Keys
Mistake – Skew : Isolated Salting
Data Source Map
Convert to
Key & Value
Isolate Key and
convert to
Salted Key & Value
Tuple
Reduce
By Key & Salted
Key
Filter Isolated
Keys
From Salted
Keys
Map Convert
results to
Key & Value
Tuple
Reduce
By Key
Union to Results
55
• Filter Out Isolated Keys and use Map Join/Aggregate on
those
• And normal reduce on the rest of the data
• This can remove a large amount of data being shuffled
Mistake – Skew : Isolated Map Join
Data Source Filter Normal
Keys
From Isolated
Keys
Reduce
By Normal Key
Union to Results
Map Join
For Isolated
Keys
56
Managing Parallelism
Cartesian Join
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
ReduceTask
ReduceTask
ReduceTask
ReduceTask
Amount
of Data
Amount of Data
10x
100x
1000x
10000x
100000x
1000000x
Or more
57
Table YTable X
• How To fight Cartesian Join
– Nested Structures
Managing Parallelism
A, 1
A, 2
A, 3
A, 4
A, 5
A, 6
Table X
A, 1, 4
A, 2, 4
A, 3, 4
A, 1, 5
A, 2, 5
A, 3, 5
A, 1, 6
A, 2, 6
A, 3, 6
JOIN OR
Table X
A
A, 1
A, 2
A, 3
A, 4
A, 5
A, 6
58
• How To fight Cartesian Join
– Nested Structures
Managing Parallelism
create table nestedTable (
col1 string,
col2 string,
col3 array< struct<
col3_1: string,
col3_2: string>>
val rddNested = sc.parallelize(Array(
Row("a1", "b1", Seq(Row("c1_1", "c2_1"),
Row("c1_2", "c2_2"),
Row("c1_3", "c2_3"))),
Row("a2", "b2", Seq(Row("c1_2", "c2_2"),
Row("c1_3", "c2_3"),
Row("c1_4", "c2_4")))), 2)
=
59
Mistake # 4
60
Out of luck?
• Do you every run out of memory?
• Do you every have more then 20 stages?
• Is your driver doing a lot of work?
61
Mistake – DAG Management
• Shuffles are to be avoided
• ReduceByKey over GroupByKey
• TreeReduce over Reduce
• Use Complex/Nested Types
62
Mistake – DAG Management: Shuffles
• Map Side reduction, where possible
• Think about partitioning/bucketing ahead of time
• Do as much as possible with a single shuffle
• Only send what you have to send
• Avoid Skew and Cartesians
63
ReduceByKey over GroupByKey
• ReduceByKey can do almost anything that GroupByKey
can do
• Aggregations
• Windowing
• Use memory
• But you have more control
• ReduceByKey has a fixed limit of Memory requirements
• GroupByKey is unbound and dependent on data
64
TreeReduce over Reduce
• TreeReduce & Reduce return some result to driver
• TreeReduce does more work on the executors
• While Reduce bring everything back to the driver
Partition
Partition
Partition
Partition
Driver
100%
Partition
Partition
Partition
Partition
Driver
4
25%
25%
25%
25%
65
Complex Types
• Top N List
• Multiple types of Aggregations
• Windowing operations
• All in one pass
66
Complex Types
• Think outside of the box use objects to reduce by
• (Make something simple)
67
Mistake # 5
68
Ever seen this?
Exception in thread "main" java.lang.NoSuchMethodError:
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at org.apache.spark.util.collection.OpenHashSet.org
$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
at
org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
at
org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
at
org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at
org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
at…....
69
But!
• I already included protobuf in my app’s maven
dependencies?
70
Ah!
• My protobuf version doesn’t match with
Spark’s protobuf version!
71
Shading
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.2</version>
...
<relocations>
<relocation>
<pattern>com.google.protobuf</pattern>
<shadedPattern>com.company.my.protobuf</shadedPattern>
</relocation>
</relocations>
72
Future of shading
• Spark 2.0 has some libraries shaded
• Gauva is fully shaded
73
Summary
74
5 Mistakes
• Size up your executors right
• 2 GB limit on Spark shuffle blocks
• Evil thing about skew and cartesians
• Learn to manage your DAG, yo!
• Do shady stuff, don’t let classpath leaks mess
you up
75
THANK YOU.
tiny.cloudera.com/spark-mistakes
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska

More Related Content

What's hot

Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
DataWorks Summit
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
Ilya Ganelin
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
Omid Vahdaty
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 

What's hot (20)

Why your Spark Job is Failing
Why your Spark Job is FailingWhy your Spark Job is Failing
Why your Spark Job is Failing
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 

Similar to Top 5 mistakes when writing Spark applications

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Spark Summit
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
Jason Hubbard
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Cloudera, Inc.
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
Spark Summit
 
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
Guru Dharmateja Medasani
 
Why the Address Translation Scheme Matters?
Why the Address Translation Scheme Matters?Why the Address Translation Scheme Matters?
Why the Address Translation Scheme Matters?
Jiaqing Du
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Rose Toomey
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
inside-BigData.com
 
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Spark Summit
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
mundlapudi
 
Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014
marvin herrera
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
Vadim Semenov
 
Hoard_2022AIM1001.pptx.pdf
Hoard_2022AIM1001.pptx.pdfHoard_2022AIM1001.pptx.pdf
Hoard_2022AIM1001.pptx.pdf
AshutoshKumar437302
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
Chris Fregly
 
osdi20-slides_zhao.pptx
osdi20-slides_zhao.pptxosdi20-slides_zhao.pptx
osdi20-slides_zhao.pptx
Cive1971
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Kristofferson A
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 

Similar to Top 5 mistakes when writing Spark applications (20)

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted MalaskaTop 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
Top 5 Mistakes When Writing Spark Applications by Mark Grover and Ted Malaska
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-publicChicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
 
Why the Address Translation Scheme Matters?
Why the Address Translation Scheme Matters?Why the Address Translation Scheme Matters?
Why the Address Translation Scheme Matters?
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
Deconstructiong Recommendations on Spark-(Ilya Ganelin, Capital One)
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014Colvin exadata mistakes_ioug_2014
Colvin exadata mistakes_ioug_2014
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
 
Hoard_2022AIM1001.pptx.pdf
Hoard_2022AIM1001.pptx.pdfHoard_2022AIM1001.pptx.pdf
Hoard_2022AIM1001.pptx.pdf
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
osdi20-slides_zhao.pptx
osdi20-slides_zhao.pptxosdi20-slides_zhao.pptx
osdi20-slides_zhao.pptx
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 

More from hadooparchbook

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
hadooparchbook
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
hadooparchbook
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
hadooparchbook
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
hadooparchbook
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
hadooparchbook
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
hadooparchbook
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
hadooparchbook
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
hadooparchbook
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
hadooparchbook
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
hadooparchbook
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
hadooparchbook
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
hadooparchbook
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
hadooparchbook
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 

More from hadooparchbook (20)

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Streaming architecture patterns
Streaming architecture patternsStreaming architecture patterns
Streaming architecture patterns
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 

Recently uploaded

UNIT-I-METAL CASTING PROCESSES -Manufact
UNIT-I-METAL CASTING PROCESSES -ManufactUNIT-I-METAL CASTING PROCESSES -Manufact
UNIT-I-METAL CASTING PROCESSES -Manufact
Mr.C.Dineshbabu
 
Cyber security detailed ppt and understand
Cyber security detailed ppt and understandCyber security detailed ppt and understand
Cyber security detailed ppt and understand
docpain605501
 
The Pennsylvania State University degree Cert diploma offer
The Pennsylvania State University degree Cert diploma offerThe Pennsylvania State University degree Cert diploma offer
The Pennsylvania State University degree Cert diploma offer
ekyhonz
 
internship project presentation for reference.pptx
internship project presentation for reference.pptxinternship project presentation for reference.pptx
internship project presentation for reference.pptx
SaieJadhav1
 
Sea Wave Energy - Renewable Energy Resources
Sea Wave Energy - Renewable Energy ResourcesSea Wave Energy - Renewable Energy Resources
Sea Wave Energy - Renewable Energy Resources
21h16charis
 
Comerica Inc Annual report summary and financial updates
Comerica Inc Annual report summary and financial updatesComerica Inc Annual report summary and financial updates
Comerica Inc Annual report summary and financial updates
ssuserb8b8c7
 
r4.OReilly.Hadoop.The.Definitive.Guide.4th.Edition.2015.pdf
r4.OReilly.Hadoop.The.Definitive.Guide.4th.Edition.2015.pdfr4.OReilly.Hadoop.The.Definitive.Guide.4th.Edition.2015.pdf
r4.OReilly.Hadoop.The.Definitive.Guide.4th.Edition.2015.pdf
ArunKumar750226
 
Predicting damage in notched functionally graded materials plates thr...
Predicting  damage  in  notched  functionally  graded  materials  plates  thr...Predicting  damage  in  notched  functionally  graded  materials  plates  thr...
Predicting damage in notched functionally graded materials plates thr...
Barhm Mohamad
 
Human_assault project using jetson nano new
Human_assault project using jetson nano newHuman_assault project using jetson nano new
Human_assault project using jetson nano new
frostflash010
 
Defect Elimination Management - CMMS Success.pdf
Defect Elimination Management - CMMS Success.pdfDefect Elimination Management - CMMS Success.pdf
Defect Elimination Management - CMMS Success.pdf
David Johnston
 
AC-AC Traction system of Indian Railway HHP Locomotives.pdf
AC-AC Traction system of Indian Railway HHP Locomotives.pdfAC-AC Traction system of Indian Railway HHP Locomotives.pdf
AC-AC Traction system of Indian Railway HHP Locomotives.pdf
AMITKUMAR948425
 
software engineering software engineering
software engineering software engineeringsoftware engineering software engineering
software engineering software engineering
PrabhuB33
 
Introduction to Power System Engingeering
Introduction to Power System EngingeeringIntroduction to Power System Engingeering
Introduction to Power System Engingeering
Zamir Fatemi
 
FIRE OUTBREAK EMERGENCY FINAL REPORT.pdf
FIRE OUTBREAK EMERGENCY FINAL REPORT.pdfFIRE OUTBREAK EMERGENCY FINAL REPORT.pdf
FIRE OUTBREAK EMERGENCY FINAL REPORT.pdf
Dar es Salaam, Tanzania
 
R18B.Tech.OpenElectivesWEF2021_22AdmittedBatch1.pdf
R18B.Tech.OpenElectivesWEF2021_22AdmittedBatch1.pdfR18B.Tech.OpenElectivesWEF2021_22AdmittedBatch1.pdf
R18B.Tech.OpenElectivesWEF2021_22AdmittedBatch1.pdf
bibej11828
 
sensor networks unit wise 4 ppt units ppt
sensor networks unit wise 4  ppt units pptsensor networks unit wise 4  ppt units ppt
sensor networks unit wise 4 ppt units ppt
sarikasatya
 
Structural Dynamics and Earthquake Engineering
Structural Dynamics and Earthquake EngineeringStructural Dynamics and Earthquake Engineering
Structural Dynamics and Earthquake Engineering
tushardatta
 
CATIA V5 Automation VB script.........pdf
CATIA V5 Automation VB script.........pdfCATIA V5 Automation VB script.........pdf
CATIA V5 Automation VB script.........pdf
shahidad729
 
Design and Engineering Module 1 power point
Design and Engineering Module 1 power pointDesign and Engineering Module 1 power point
Design and Engineering Module 1 power point
ssuser76af31
 
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
Robert Pyke
 

Recently uploaded (20)

UNIT-I-METAL CASTING PROCESSES -Manufact
UNIT-I-METAL CASTING PROCESSES -ManufactUNIT-I-METAL CASTING PROCESSES -Manufact
UNIT-I-METAL CASTING PROCESSES -Manufact
 
Cyber security detailed ppt and understand
Cyber security detailed ppt and understandCyber security detailed ppt and understand
Cyber security detailed ppt and understand
 
The Pennsylvania State University degree Cert diploma offer
The Pennsylvania State University degree Cert diploma offerThe Pennsylvania State University degree Cert diploma offer
The Pennsylvania State University degree Cert diploma offer
 
internship project presentation for reference.pptx
internship project presentation for reference.pptxinternship project presentation for reference.pptx
internship project presentation for reference.pptx
 
Sea Wave Energy - Renewable Energy Resources
Sea Wave Energy - Renewable Energy ResourcesSea Wave Energy - Renewable Energy Resources
Sea Wave Energy - Renewable Energy Resources
 
Comerica Inc Annual report summary and financial updates
Comerica Inc Annual report summary and financial updatesComerica Inc Annual report summary and financial updates
Comerica Inc Annual report summary and financial updates
 
r4.OReilly.Hadoop.The.Definitive.Guide.4th.Edition.2015.pdf
r4.OReilly.Hadoop.The.Definitive.Guide.4th.Edition.2015.pdfr4.OReilly.Hadoop.The.Definitive.Guide.4th.Edition.2015.pdf
r4.OReilly.Hadoop.The.Definitive.Guide.4th.Edition.2015.pdf
 
Predicting damage in notched functionally graded materials plates thr...
Predicting  damage  in  notched  functionally  graded  materials  plates  thr...Predicting  damage  in  notched  functionally  graded  materials  plates  thr...
Predicting damage in notched functionally graded materials plates thr...
 
Human_assault project using jetson nano new
Human_assault project using jetson nano newHuman_assault project using jetson nano new
Human_assault project using jetson nano new
 
Defect Elimination Management - CMMS Success.pdf
Defect Elimination Management - CMMS Success.pdfDefect Elimination Management - CMMS Success.pdf
Defect Elimination Management - CMMS Success.pdf
 
AC-AC Traction system of Indian Railway HHP Locomotives.pdf
AC-AC Traction system of Indian Railway HHP Locomotives.pdfAC-AC Traction system of Indian Railway HHP Locomotives.pdf
AC-AC Traction system of Indian Railway HHP Locomotives.pdf
 
software engineering software engineering
software engineering software engineeringsoftware engineering software engineering
software engineering software engineering
 
Introduction to Power System Engingeering
Introduction to Power System EngingeeringIntroduction to Power System Engingeering
Introduction to Power System Engingeering
 
FIRE OUTBREAK EMERGENCY FINAL REPORT.pdf
FIRE OUTBREAK EMERGENCY FINAL REPORT.pdfFIRE OUTBREAK EMERGENCY FINAL REPORT.pdf
FIRE OUTBREAK EMERGENCY FINAL REPORT.pdf
 
R18B.Tech.OpenElectivesWEF2021_22AdmittedBatch1.pdf
R18B.Tech.OpenElectivesWEF2021_22AdmittedBatch1.pdfR18B.Tech.OpenElectivesWEF2021_22AdmittedBatch1.pdf
R18B.Tech.OpenElectivesWEF2021_22AdmittedBatch1.pdf
 
sensor networks unit wise 4 ppt units ppt
sensor networks unit wise 4  ppt units pptsensor networks unit wise 4  ppt units ppt
sensor networks unit wise 4 ppt units ppt
 
Structural Dynamics and Earthquake Engineering
Structural Dynamics and Earthquake EngineeringStructural Dynamics and Earthquake Engineering
Structural Dynamics and Earthquake Engineering
 
CATIA V5 Automation VB script.........pdf
CATIA V5 Automation VB script.........pdfCATIA V5 Automation VB script.........pdf
CATIA V5 Automation VB script.........pdf
 
Design and Engineering Module 1 power point
Design and Engineering Module 1 power pointDesign and Engineering Module 1 power point
Design and Engineering Module 1 power point
 
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
Updated Limitations of Simplified Methods for Evaluating the Potential for Li...
 

Top 5 mistakes when writing Spark applications

  • 1. Top 5 mistakes when writing Spark applications tiny.cloudera.com/spark-mistakes Mark Grover | Software Engineer, Cloudera | @mark_grover Ted Malaska | Technical Group Architect, Blizzard| @TedMalaska
  • 2. 2 About the book • @hadooparchbook • hadooparchitecturebook.com • github.com/hadooparchitecturebook • slideshare.com/hadooparchbook
  • 4. 4 Mistakes people we’ve made when using Spark
  • 7. 7 # Executors, cores, memory !?! • 6 Nodes • 16 cores each • 64 GB of RAM each
  • 8. 8 Decisions, decisions, decisions • Number of executors (--num-executors) • Cores for each executor (--executor-cores) • Memory for each executor (--executor-memory) • 6 nodes • 16 cores each • 64 GB of RAM
  • 10. 10 Answer #1 – Most granular • Have smallest sized executors possible • 1 core each • 64GB/node / 16 executors/node = 4 GB/executor • Total of 16 cores x 6 nodes = 96 cores => 96 executors Worker node Executor 16 Executor 4 Executor 3 Executor 2 Executor 1
  • 11. 11 Answer #1 – Most granular • Have smallest sized executors possible • 1 core each • 64GB/node / 16 executors/node = 4 GB/executor • Total of 16 cores x 6 nodes = 96 cores => 96 executors Worker node Executor 16 Executor 4 Executor 3 Executor 2 Executor 1
  • 12. 12 Why? • Not using benefits of running multiple tasks in same executor
  • 13. 13 Answer #2 – Least granular • 6 executors in total =>1 executor per node • 64 GB memory each • 16 cores each Worker node Executor 1
  • 14. 14 Answer #2 – Least granular • 6 executors in total =>1 executor per node • 64 GB memory each • 16 cores each Worker node Executor 1
  • 15. 15 Why? • Need to leave some memory overhead for OS/Hadoop daemons
  • 16. 16 Answer #3 – with overhead • 6 executors – 1 executor/node • 63 GB memory each • 15 cores each Worker node Executor 1 Overhead(1G,1 core)
  • 17. 17 Answer #3 – with overhead • 6 executors – 1 executor/node • 63 GB memory each • 15 cores each Worker node Executor 1 Overhead(1G,1 core)
  • 18. 18 Let’s assume… • You are running Spark on YARN, from here on…
  • 19. 19 3 things • 3 other things to keep in mind
  • 20. 20 #1 – Memory overhead • --executor-memory controls the heap size • Need some overhead (controlled by spark.yarn.executor.memory.overhead) for off heap memory • Default is max(384MB, .07 * spark.executor.memory)
  • 21. 21 #2 - YARN AM needs a core: Client mode
  • 22. 22 #2 YARN AM needs a core: Cluster mode
  • 23. 23 #3 HDFS Throughput • 15 cores per executor can lead to bad HDFS I/O throughput. • Best is to keep under 5 cores per executor
  • 24. 24 Calculations • 5 cores per executor – For max HDFS throughput • Cluster has 6 * 15 = 90 cores in total after taking out Hadoop/Yarn daemon cores) • 90 cores / 5 cores/executor = 18 executors • Each node has 3 executors • 63 GB/3 = 21 GB, 21 x (1-0.07) ~ 19 GB • 1 executor for AM => 17 executors Overhead Worker node Executor 3 Executor 2 Executor 1
  • 25. 25 Correct answer • 17 executors in total • 19 GB memory/executor • 5 cores/executor * Not etched in stone Overhead Worker node Executor 3 Executor 2 Executor 1
  • 26. 26 Dynamic allocation helps with though, right? • Dynamic allocation allows Spark to dynamically scale the cluster resources allocated to your application based on the workload. • Works with Spark-On-Yarn
  • 27. 27 Decisions with Dynamic Allocation • Number of executors (--num-executors) • Cores for each executor (--executor-cores) • Memory for each executor (--executor-memory) • 6 nodes • 16 cores each • 64 GB of RAM
  • 28. 28 Read more • From a great blog post on this topic by Sandy Ryza: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs- part-2/
  • 30. 30 Application failure 15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in stage 6.0 (TID 120, 10.215.149.47): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517) at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
  • 31. 31 Why? • No Spark shuffle block can be greater than 2 GB
  • 32. 32 Ok, what’s a shuffle block again? • In MapReduce terminology, a file written from one Mapper for a Reducer • The Reducer makes a local copy of this file (reducer local copy) and then ‘reduces’ it
  • 33. 33 Defining shuffle and partition Each yellow arrow in this diagram represents a shuffle block. Each blue block is a partition.
  • 34. 34 Once again • Overflow exception if shuffle block size > 2 GB
  • 35. 35 What’s going on here? • Spark uses ByteBuffer as abstraction for blocks val buf = ByteBuffer.allocate(length.toInt) • ByteBuffer is limited by Integer.MAX_SIZE (2 GB)!
  • 36. 36 Spark SQL • Especially problematic for Spark SQL • Default number of partitions to use when doing shuffles is 200 – This low number of partitions leads to high shuffle block size
  • 37. 37 Umm, ok, so what can I do? 1. Increase the number of partitions – Thereby, reducing the average partition size 2. Get rid of skew in your data – More on that later
  • 38. 38 Umm, how exactly? • In Spark SQL, increase the value of spark.sql.shuffle.partitions • In regular Spark applications, use rdd.repartition() or rdd.coalesce()(latter to reduce #partitions, if needed)
  • 39. 39 But, how many partitions should I have? • Rule of thumb is around 128 MB per partition
  • 40. 40 But! There’s more! • Spark uses a different data structure for bookkeeping during shuffles, when the number of partitions is less than 2000, vs. more than 2000.
  • 41. 41 Don’t believe me? • In MapStatus.scala def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = { if (uncompressedSizes.length > 2000) { HighlyCompressedMapStatus(loc, uncompressedSizes) } else { new CompressedMapStatus(loc, uncompressedSizes) } }
  • 42. 42 Ok, so what are you saying? If number of partitions < 2000, but not by much, bump it to be slightly higher than 2000.
  • 43. 43 Can you summarize, please? • Don’t have too big partitions –Your job will fail due to 2 GB limit • Don’t have too few partitions –Your job will be slow, not making using of parallelism • Rule of thumb: ~128 MB per partition • If #partitions < 2000, but close, bump to just > 2000 • Track SPARK-6235 for removing various 2 GB limits
  • 45. 45 Slow jobs on Join/Shuffle • Your dataset takes 20 seconds to run over with a map job, but take 4 hours when joined or shuffled. What wrong?
  • 46. 46 Mistake - Skew Single Thread Single Thread Single Thread Single Thread Single Thread Single Thread Single Thread Normal Distributed The Holy Grail of Distributed Systems
  • 47. 47 Mistake - Skew Single ThreadNormal Distributed What about Skew, because that is a thing
  • 48. 48 • Salting • Isolated Salting • Isolated Map Joins Mistake – Skew : Answers
  • 49. 49 • Normal Key: “Foo” • Salted Key: “Foo” + random.nextInt(saltFactor) Mistake – Skew : Salting
  • 52. 52©2014 Cloudera, Inc. All rights reserved. Add Example Slide
  • 53. 53 • Two Stage Aggregation – Stage one to do operations on the salted keys – Stage two to do operation access unsalted key results Mistake – Skew : Salting Data Source Map Convert to Salted Key & Value Tuple Reduce By Salted Key Map Convert results to Key & Value Tuple Reduce By Key Results
  • 54. 54 • Second Stage only required for Isolated Keys Mistake – Skew : Isolated Salting Data Source Map Convert to Key & Value Isolate Key and convert to Salted Key & Value Tuple Reduce By Key & Salted Key Filter Isolated Keys From Salted Keys Map Convert results to Key & Value Tuple Reduce By Key Union to Results
  • 55. 55 • Filter Out Isolated Keys and use Map Join/Aggregate on those • And normal reduce on the rest of the data • This can remove a large amount of data being shuffled Mistake – Skew : Isolated Map Join Data Source Filter Normal Keys From Isolated Keys Reduce By Normal Key Union to Results Map Join For Isolated Keys
  • 56. 56 Managing Parallelism Cartesian Join Map Task Shuffle Tmp 1 Shuffle Tmp 2 Shuffle Tmp 3 Shuffle Tmp 4 Map Task Shuffle Tmp 1 Shuffle Tmp 2 Shuffle Tmp 3 Shuffle Tmp 4 Map Task Shuffle Tmp 1 Shuffle Tmp 2 Shuffle Tmp 3 Shuffle Tmp 4 ReduceTask ReduceTask ReduceTask ReduceTask Amount of Data Amount of Data 10x 100x 1000x 10000x 100000x 1000000x Or more
  • 57. 57 Table YTable X • How To fight Cartesian Join – Nested Structures Managing Parallelism A, 1 A, 2 A, 3 A, 4 A, 5 A, 6 Table X A, 1, 4 A, 2, 4 A, 3, 4 A, 1, 5 A, 2, 5 A, 3, 5 A, 1, 6 A, 2, 6 A, 3, 6 JOIN OR Table X A A, 1 A, 2 A, 3 A, 4 A, 5 A, 6
  • 58. 58 • How To fight Cartesian Join – Nested Structures Managing Parallelism create table nestedTable ( col1 string, col2 string, col3 array< struct< col3_1: string, col3_2: string>> val rddNested = sc.parallelize(Array( Row("a1", "b1", Seq(Row("c1_1", "c2_1"), Row("c1_2", "c2_2"), Row("c1_3", "c2_3"))), Row("a2", "b2", Seq(Row("c1_2", "c2_2"), Row("c1_3", "c2_3"), Row("c1_4", "c2_4")))), 2) =
  • 60. 60 Out of luck? • Do you every run out of memory? • Do you every have more then 20 stages? • Is your driver doing a lot of work?
  • 61. 61 Mistake – DAG Management • Shuffles are to be avoided • ReduceByKey over GroupByKey • TreeReduce over Reduce • Use Complex/Nested Types
  • 62. 62 Mistake – DAG Management: Shuffles • Map Side reduction, where possible • Think about partitioning/bucketing ahead of time • Do as much as possible with a single shuffle • Only send what you have to send • Avoid Skew and Cartesians
  • 63. 63 ReduceByKey over GroupByKey • ReduceByKey can do almost anything that GroupByKey can do • Aggregations • Windowing • Use memory • But you have more control • ReduceByKey has a fixed limit of Memory requirements • GroupByKey is unbound and dependent on data
  • 64. 64 TreeReduce over Reduce • TreeReduce & Reduce return some result to driver • TreeReduce does more work on the executors • While Reduce bring everything back to the driver Partition Partition Partition Partition Driver 100% Partition Partition Partition Partition Driver 4 25% 25% 25% 25%
  • 65. 65 Complex Types • Top N List • Multiple types of Aggregations • Windowing operations • All in one pass
  • 66. 66 Complex Types • Think outside of the box use objects to reduce by • (Make something simple)
  • 68. 68 Ever seen this? Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; at org.apache.spark.util.collection.OpenHashSet.org $apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165) at org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102) at org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210) at…....
  • 69. 69 But! • I already included protobuf in my app’s maven dependencies?
  • 70. 70 Ah! • My protobuf version doesn’t match with Spark’s protobuf version!
  • 72. 72 Future of shading • Spark 2.0 has some libraries shaded • Gauva is fully shaded
  • 74. 74 5 Mistakes • Size up your executors right • 2 GB limit on Spark shuffle blocks • Evil thing about skew and cartesians • Learn to manage your DAG, yo! • Do shady stuff, don’t let classpath leaks mess you up
  • 75. 75 THANK YOU. tiny.cloudera.com/spark-mistakes Mark Grover | @mark_grover Ted Malaska | @TedMalaska