Project Tungsten
Advanced Apache Spark Meetup
Chris Fregly
Principal Data Solutions Engineer
Nov 12, 2015
Who Am I?

Streaming Data Engineer
Open Source Committer

Data Solutions Engineer

Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Advanced Apache Meetup
Due 2016
Advanced Apache Spark Meetup
Meetup Metrics
~1600 Members in just 4 mos!
4th Most Active Spark Meetup!!

Meetup Goals
  Dig deep into codebase of Spark and related projects
  Study integrations of Cassandra, ElasticSearch,

Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R
  Surface and share patterns and idioms of these 

well-designed, distributed, big data components
All Slides and Code Are Available!

Themes of this Talk
 Find Similarity
 Minimize Seeks
 Maximize Scans
 Customize for Workload
 Tune Performance At Every Layer
  Be Nice, Collaborate!
Like a Mom!!
①  Mechanical Sympathy
②  Recap of 100TB GraySort Challenge
③  Project Tungsten Deep Dive
Mechanical Sympathy
Hardware and software working together in harmony.

- Martin Thompson

Whatever your data structure, my array will beat it.

- Scott Meyers

 Every C++ Book, basically

- Bruce Jenner
Spark and Mechanical Sympathy

(Spark 1.4-1.6+)
(Spark 1.1-1.2)
Minimize Memory and GC
Maximize CPU Cache Locality
Saturate Network I/O
Saturate Disk I/O
AlphaSort Technique: Sort 100 Bytes Recs
Dereference Not Required!

List [(Key, Pointer)]

Key is directly available for comparison

List [Pointer]

Must dereference key for comparison
Dereference for Key Comparison
CPU Cache Line and Memory Sympathy
Key (10 bytes)+Pointer (*4 bytes)*Compressed OOPs

= 14 bytes

Not CPU Cache-line Friendly!
2x CPU Cache-line Friendly!
Key-Prefix (4 bytes) + Pointer (4 bytes) 

= 8 bytes
Key (10 bytes)+Pad (2 bytes)+Pointer (4 bytes)

= 16 bytes
 CPU Cache-line Friendly!
Performance Comparison
Similar Trick: Direct Cache Access (DCA)
Pull out packet header along side pointer to payload
CPU Cache Lines: Sequential vs. Random
CPU Cache Naïve Matrix Multiplication
// Dot product of each row & column vector
for (i <- 0 until numRowA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];

Bad: Row-wise traversal,

 not using CPU cache line,

ineffective pre-fetching
CPU Cache Friendly Matrix Multiplication

// Transpose B
for (i <- 0 until numRowsB)
for (j <- 0 until numColsB)

matBT[ i ][ j ] = matB[ j ][ i ];

// Modify dot product calculation for B Transpose
for (i <- 0 until numRowsA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
Good: Full CPU cache line,

effective prefetching
OLD: res[ i ][ j ] += matA[ i ][ k ] * matB [ k ] [ j ];
Reference j

before k
Instrumenting and Monitoring CPU
Use Linux perf command!
Results of Matrix Multiply Comparison
Naïve Matrix Multiply
Cache-Friendly Matrix Multiply
perf stat -XX:+PreserveFramePointer -XX:-Inline 
–event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, 
 55 hp
550 hp
Compare CPU Naïve & Cache-Friendly Matrix Multiplication
CPU Cache Naïve Tuple Counters
object CacheNaiveTupleIncrement {
var tuple = (0,0)

def increment(leftIncrement: Int, rightIncrement: Int) : (Int, Int) = {
this.synchronized {
tuple = (tuple._1 + leftIncrement, tuple._2 + rightIncrement)
CPU Cache Naïve Case Class Counters
case class MyTuple(left: Int, right: Int)

object CacheNaiveCaseClassCounters {
var tuple = new MyTuple(0,0)

def increment(leftIncrement: Int, rightIncrement: Int) : MyTuple = {
this.synchronized {
tuple = new MyTuple(tuple.left + leftIncrement, 

 tuple.right + rightIncrement)
CPU Cache Friendly Lock-Free Counters
object CacheFriendlyLockFreeCounters {
// a single Long (8-bytes) will maintain 2 separate Ints (4-bytes each)
val tuple = new AtomicLong()
def increment(leftIncrement: Int, rightIncrement: Int) : Long = {
var originalLong = 0L
var updatedLong = 0L
do {

originalLong = tuple.get()

val originalRightInt = originalLong.toInt // cast originalLong to Int to get right counter

val originalLeftInt = (originalLong >>> 32).toInt // shift right to get left counter

val updatedRightInt = originalRightInt + rightIncrement // increment right counter

val updatedLeftInt = originalLeftInt + leftIncrement // increment left counter

updatedLong = updatedLeftInt // update the new long with the left counter

updatedLong = updatedLong << 32 // shift the new long left 

updatedLong += updatedRightInt // update the new long with the right counter
} while (tuple.compareAndSet(originalLong, updatedLong) == false)
Quiz: Why not @volatile?
Compare CPU Naïve & Cache-Friendly Tuple Counter Sync
Results of Counters Comparison
Naïve Tuple Counters

Naïve Case Class Counters

Cache Friendly Lock-Free Counters
Profiling Visualizations: Flame Graphs
With Java Stack Traces!!
Example: Spark Word Count
Java Stack Traces 
are Good!

are Bad!!
①  Mechanical Sympathy
②  Recap of 100TB GraySort Challenge
③  Project Tungsten Deep Dive
100TB GraySort Challenge
Sort 100TB of 100-Byte Records with 10-byte Keys
Custom Data Structs & Algos for Sort & Shuffle
Saturate Network and Disk I/O Controllers
100TB GraySort Challenge Results
Performance Goals
  Saturate Network I/O
  Saturate Disk I/O
(2013) (2014)
Winning Hardware Configuration

206 Workers, 1 Master (AWS EC2 i2.8xlarge)

32 Intel Xeon CPU E5-2670 @ 2.5 Ghz

244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4

3 GBps mixed read/write disk I/O per node

AWS Placement Groups, VPC, Enhanced Networking

Single Root I/O Virtualization (SR-IOV)

10 Gbps, low latency, low jitter (iperf: ~9.5 Gbps)
Winning Software Configuration
Spark 1.2, OpenJDK 1.7
Disable caching, compression, spec execution, shuffle spill
Force NODE_LOCAL task scheduling for optimal data locality
HDFS 2.4.1 short-circuit local reads, 2x replication
Empirically chose between 4-6 partitions per cpu

206 nodes * 32 cores = 6592 cores 

6592 cores * 4 = 26,368 partitions

6592 cores * 6 = 39,552 partitions

6592 cores * 4.25 = 28,000 partitions (empirical best)
Range partitioning takes advantage of sequential keyspace

Required ~10s of sampling 79 keys from in each partition
New Sort Shuffle Manager for Spark 1.2
Original “hash-based” 
 New “sort-based”

①  Use less OS resources (socket buffers, file descriptors)
②  TimSort partitions in-memory
③  MergeSort partitions on-disk into a single master file
④  Serve partitions from master file: seek once, sequential scan
Asynchronous Network Module
Switch to asyncronous Netty vs. synchronous java.nio
Switch to zero-copy epoll

Use only kernel-space between disk and network controllers
Custom memory management

Spark-Netty Performance Tuning

Reuse off-heap buffers (for example)

Increase to saturate hosts with multiple disks (8x800 SSD)
Details in
Custom Algorithms and Data Structures
Optimized for sort & shuffle workloads

Based on JDK 1.7 TimSort

Performs best with partially-sorted runs

Optimized for elements of (K,V) pairs

Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)

Open addressing hash, quadratic probing

Array of [(K, V), (K, V)] 

Good memory locality

Keys never removed, values only append
Daytona GraySort Challenge Goal Success

1.1 Gbps/node network I/O (Reducers)

Theoretical max = 1.25 Gbps for 10 GB ethernet
3 GBps/node disk I/O (Mappers)

Network I/O!
220 Gbps / 206 nodes ~= 1.1 Gbps per node
Shuffle Performance Tuning Tips
Hash Shuffle Manager (Deprecated)

spark.shuffle.consolidateFiles (Mapper)

Intermediate Files

Increase spark.shuffle.file.buffer (Reducer)

Increase spark.reducer.maxSizeInFlight if memory allows
Use Smaller Number of Larger Executors

Minimizes intermediate files and overall shuffle

More opportunity for PROCESS_LOCAL
SQL: BroadcastHashJoin vs. ShuffledHashJoin


Use DataFrame.explain(true) or EXPLAIN to verify

Many Threads
(1 per CPU)
①  Mechanical Sympathy
②  Recap of 100TB GraySort Challenge
③  Project Tungsten Deep Dive
Project Tungsten
Data Struts & Algos Operate Directly on Byte Arrays
Maximize CPU Cache Locality, Minimize GC
Utilize Dynamic Code Generation
(Spark 1.4)
Quick Review of Project Tungsten Jiras

(Spark 1.4)
Why is CPU the Bottleneck?
CPU is used for serialization, hashing, compression!

Network and Disk I/O bandwidth are relatively high

GraySort optimizations improved network & shuffle

Partitioning, pruning, and predicate pushdowns

Binary, compressed, columnar file formats (Parquet)
Yet Another Spark Shuffle Manager!
spark.shuffle.manager =

hash (Deprecated)

< 10,000 reducers

Output partition file hashes the key of (K,V) pair

Mapper creates an output file per partition 

Leads to M*P output files for all partitions

sort (GraySort Challenge)

> 10,000 reducers

Default from Spark 1.2-1.5

Mapper creates single output file for all partitions

Minimizes OS resources, netty + epoll optimizes network I/O, disk I/O, and memory

Uses custom data structures and algorithms for sort-shuffle workload

Wins Daytona GraySort Challenge 

tungsten-sort (Project Tungsten)

Default since 1.5

Modification of existing sort-based shuffle

Uses com.misc.Unsafe for self-managed memory and garbage collection

Maximize CPU utilization and cache locality with AlphaSort-inspired binary data structures/algorithms

Perform joins, sorts, and other operators on both serialized and compressed byte buffers
CPU & Memory Optimizations
Custom Managed Memory

Reduces GC overhead

Both on and off heap

Exact size calculations
Direct Binary Processing

Operate on serialized/compressed arrays

Kryo can reorder/sort serialized records

LZF can reorder/sort compressed records
More CPU Cache-aware Data Structs & Algorithms

Code Generation (default in 1.5)

Generate source code from overall query plan

100+ UDFs converted to use code generation
Mostly Same Join Code,
Details in
IBM Spark



















getAddress() – not guaranteed after GC








Used by 

Spark + com.misc.Unsafe
joins.ShuffledHashOuterJoin (not yet converted)
unsafe.memory.MemoryAllocator (trait/interface)
Over 200 source
files affected!!
Traditional Java Object Row Layout
4-byte String

Multi-field Object

Custom Data Structures for Workload

(Dense Binary Row)

(Virtual Memory Address)

(Dense Binary HashMap)
Dense, 8-bytes per field (word-aligned)
AlphaSort-Style (Key + Pointer)
OS-Style Memory Paging
UnsafeRow Layout Example

Custom Memory Management

TaskMemoryManager & MemoryConsumer

Memory management: virtual memory allocation, pageing

Off-heap: direct 64-bit address

On-heap: 13-bit page num + 27-bit page offset


64-bit word

(24-bit partition key, (13-bit page num, 27-bit page offset))


Primitive Array[Byte]
2^13 pages * 2^27 page size = 1 TB RAM per Task
IBM Spark


Uses BytesToBytesMap

In-place updates of serialized data

No object creation on hot-path

Improved external agg support

No OOM’s for large, single key aggs


Combine 2 UnsafeRows into 1

TungstenAggregate & TungstenAggregationIterator

Operates directly on serialized, binary UnsafeRow

2 Steps: hash-based agg (grouping), then sort-based agg

Supports spilling and external merge sorting
Bitwise comparison on UnsafeRow

No need to calculate equals(), hashCode()

Row 1
Row 2
Surprisingly, not many code changes



Converts InternalRow to UnsafeRow
IBM Spark





AlphaSort-Style Cache Friendly

2x CPU Cache-line Friendly!
Using multiple subclasses of SortDataFormat
simultaneously will prevent JIT inlining.
This affects sort & shuffle performance.
Supports merging compressed records
if compression CODEC supports it (LZF)
Efficient Spilling

Exact data size is known

No need to maintain heuristics & approximations

Controls amount of spilling
Spill merge on compressed, binary records!

If compression CODEC supports it

Exact Peak Memory
for Spark Jobs
Code Generation
Boxing causes excessive object creation 
Expensive expression tree evals per row
JVM can’t inline polymorphic impls
Codegen by-passes virtual function calls
Defer source code generation to each operator, UDF, UDAF
Use Scala quasiquote macros for Scala AST source code gen
Rewrite and optimize code for overall plan, 8-byte align, etc
Use Janino to compile generated source code into bytecode
Spark SQL UDF Code Generation
100+ UDFs now generating code
More to come in Spark 1.6+
Details in
SPARK-8159, SPARK-9571
Each Implements
Creating a Custom UDF with Codegen
Study existing implementations
Extend base trait

Register the function

Augment DataFrame with new UDF (Scala implicits)

Don’t forget about Python!

Who Benefits from Project Tungsten?
Users of DataFrames

All Spark SQL Queries


All RDDs

Serialization, Compression, and Aggregations
Performance Results
Query Time

OOM’d on
Large Dataset!
Thank You!!!
Chris Fregly @cfregly
IBM Spark Technology Center 
San Francisco, California
Relevant Links
Signup for the book & global meetup!
Clone, contribute, and commit code!
Run all demos in your own environment with Docker!

  • 1. Power of data. Simplicity of design. Speed of innovation. IBM Spark Project Tungsten Advanced Apache Spark Meetup Chris Fregly Principal Data Solutions Engineer We’re Hiring - Only Nice People! Nov 12, 2015
  • 2. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Am I? 2 Streaming Data Engineer Open Source Committer
 Data Solutions Engineer
 Apache Contributor Principal Data Solutions Engineer IBM Technology Center Founder Advanced Apache Meetup Author Advanced . Due 2016 My Ma’s First Time in California
  • 3. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Random Slide: More Ma “First Time” Pics 3 In California Using Chopsticks Using “New” iPhone
  • 4. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th) Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd) Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th) Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) San Francisco (Nov 10th) 4 San Francisco Advanced Spark (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 18th) Helsinki Spark Meetup (Nov 20th) Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 26th) Budapest Spark Meetup (Nov 27th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Spark (Dec 8th) Mountain View Advanced Spark (Dec 10th) Toronto Spark Meetup (Dec 14th) Washington DC Spark Meetup (Jan 2016)
  • 5. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Advanced Apache Spark Meetup Meetup Metrics ~1600 Members in just 4 mos! 4th Most Active Spark Meetup!! Meetup Goals   Dig deep into codebase of Spark and related projects   Study integrations of Cassandra, ElasticSearch, Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R   Surface and share patterns and idioms of these well-designed, distributed, big data components THANKS TO ALL OF YOU!!
  • 6. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark All Slides and Code Are Available! 6
  • 7. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Themes of this Talk  Filter  Off-Heap  Parallelize  Approximate  Find Similarity  Minimize Seeks  Maximize Scans  Customize for Workload  Tune Performance At Every Layer 7   Be Nice, Collaborate! Like a Mom!!
  • 8. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Outline ①  Mechanical Sympathy ②  Recap of 100TB GraySort Challenge ③  Project Tungsten Deep Dive 8
  • 9. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Mechanical Sympathy Hardware and software working together in harmony. - Martin Thompson Whatever your data structure, my array will beat it. - Scott Meyers Every C++ Book, basically 9 Hair Sympathy - Bruce Jenner
  • 10. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark and Mechanical Sympathy 10 Project 
 Tungsten (Spark 1.4-1.6+) GraySort Challenge (Spark 1.1-1.2) Minimize Memory and GC Maximize CPU Cache Locality Saturate Network I/O Saturate Disk I/O
  • 11. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark AlphaSort Technique: Sort 100 Bytes Recs 11 Value Ptr Key Dereference Not Required! AlphaSort List [(Key, Pointer)] Key is directly available for comparison Naïve List [Pointer] Must dereference key for comparison Ptr Dereference for Key Comparison Key
  • 12. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Line and Memory Sympathy Key (10 bytes)+Pointer (*4 bytes)*Compressed OOPs = 14 bytes 12 Key Ptr Not CPU Cache-line Friendly! Ptr Key-Prefix 2x CPU Cache-line Friendly! Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes Key (10 bytes)+Pad (2 bytes)+Pointer (4 bytes)
 = 16 bytes Key Ptr Pad /Pad CPU Cache-line Friendly!
  • 13. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Performance Comparison 13
  • 14. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Similar Trick: Direct Cache Access (DCA) Pull out packet header along side pointer to payload 14
  • 15. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Lines: Sequential vs. Random 15
  • 16. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Matrix Multiplication // Dot product of each row & column vector for (i <- 0 until numRowA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ]; 16 Bad: Row-wise traversal, not using CPU cache line,
 ineffective pre-fetching
  • 17. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Friendly Matrix Multiplication // Transpose B for (i <- 0 until numRowsB) for (j <- 0 until numColsB) matBT[ i ][ j ] = matB[ j ][ i ]; 
 // Modify dot product calculation for B Transpose for (i <- 0 until numRowsA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ]; 17 Good: Full CPU cache line,
 effective prefetching OLD: res[ i ][ j ] += matA[ i ][ k ] * matB [ k ] [ j ]; Reference j
 before k
  • 18. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Instrumenting and Monitoring CPU Use Linux perf command! 18
  • 19. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Results of Matrix Multiply Comparison Naïve Matrix Multiply Cache-Friendly Matrix Multiply ~72x ~8x ~3x ~3x ~2x ~7x ~10x perf stat -XX:+PreserveFramePointer -XX:-Inline –event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend ~10x 55 hp 550 hp
  • 20. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Demo! Compare CPU Naïve & Cache-Friendly Matrix Multiplication 20
  • 21. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Tuple Counters object CacheNaiveTupleIncrement { var tuple = (0,0) … def increment(leftIncrement: Int, rightIncrement: Int) : (Int, Int) = { this.synchronized { tuple = (tuple._1 + leftIncrement, tuple._2 + rightIncrement) tuple } } } 21
  • 22. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Naïve Case Class Counters case class MyTuple(left: Int, right: Int) object CacheNaiveCaseClassCounters { var tuple = new MyTuple(0,0) … def increment(leftIncrement: Int, rightIncrement: Int) : MyTuple = { this.synchronized { tuple = new MyTuple(tuple.left + leftIncrement, tuple.right + rightIncrement) tuple } } } 22
  • 23. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU Cache Friendly Lock-Free Counters object CacheFriendlyLockFreeCounters { // a single Long (8-bytes) will maintain 2 separate Ints (4-bytes each) val tuple = new AtomicLong() … def increment(leftIncrement: Int, rightIncrement: Int) : Long = { var originalLong = 0L var updatedLong = 0L do { originalLong = tuple.get() val originalRightInt = originalLong.toInt // cast originalLong to Int to get right counter val originalLeftInt = (originalLong >>> 32).toInt // shift right to get left counter val updatedRightInt = originalRightInt + rightIncrement // increment right counter val updatedLeftInt = originalLeftInt + leftIncrement // increment left counter updatedLong = updatedLeftInt // update the new long with the left counter updatedLong = updatedLong << 32 // shift the new long left updatedLong += updatedRightInt // update the new long with the right counter } while (tuple.compareAndSet(originalLong, updatedLong) == false) updatedLong } 23 Quiz: Why not @volatile?
  • 24. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Demo! Compare CPU Naïve & Cache-Friendly Tuple Counter Sync 24
  • 25. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Results of Counters Comparison Naïve Tuple Counters Naïve Case Class Counters Cache Friendly Lock-Free Counters ~2x ~1.5x ~3.5x ~2x ~2x ~1.5x ~1.5x ~1.5x
  • 26. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Profiling Visualizations: Flame Graphs With Java Stack Traces!! 26 Example: Spark Word Count Java Stack Traces are Good! Plateaus
 are Bad!!
  • 27. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Outline ①  Mechanical Sympathy ②  Recap of 100TB GraySort Challenge ③  Project Tungsten Deep Dive 27
  • 28. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark 100TB GraySort Challenge Sort 100TB of 100-Byte Records with 10-byte Keys Custom Data Structs & Algos for Sort & Shuffle Saturate Network and Disk I/O Controllers 28
  • 29. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark 100TB GraySort Challenge Results 29 Performance Goals   Saturate Network I/O   Saturate Disk I/O (2013) (2014)
  • 30. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Hardware Configuration Compute 206 Workers, 1 Master (AWS EC2 i2.8xlarge) 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz 244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4 3 GBps mixed read/write disk I/O per node Network AWS Placement Groups, VPC, Enhanced Networking Single Root I/O Virtualization (SR-IOV) 10 Gbps, low latency, low jitter (iperf: ~9.5 Gbps) 30
  • 31. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Winning Software Configuration Spark 1.2, OpenJDK 1.7 Disable caching, compression, spec execution, shuffle spill Force NODE_LOCAL task scheduling for optimal data locality HDFS 2.4.1 short-circuit local reads, 2x replication Empirically chose between 4-6 partitions per cpu 206 nodes * 32 cores = 6592 cores 6592 cores * 4 = 26,368 partitions 6592 cores * 6 = 39,552 partitions 6592 cores * 4.25 = 28,000 partitions (empirical best) Range partitioning takes advantage of sequential keyspace Required ~10s of sampling 79 keys from in each partition 31
  • 32. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark New Sort Shuffle Manager for Spark 1.2 Original “hash-based” New “sort-based” ①  Use less OS resources (socket buffers, file descriptors) ②  TimSort partitions in-memory ③  MergeSort partitions on-disk into a single master file ④  Serve partitions from master file: seek once, sequential scan 32
  • 33. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Asynchronous Network Module Switch to asyncronous Netty vs. synchronous java.nio Switch to zero-copy epoll Use only kernel-space between disk and network controllers Custom memory management spark.shuffle.blockTransferService=netty Spark-Netty Performance Tuning Reuse off-heap buffers (for example) Increase to saturate hosts with multiple disks (8x800 SSD) 33 Details in SPARK-2468
  • 34. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom Algorithms and Data Structures Optimized for sort & shuffle workloads o.a.s.util.collection.TimSort[K,V] Based on JDK 1.7 TimSort Performs best with partially-sorted runs Optimized for elements of (K,V) pairs Sorts impl of SortDataFormat (ie. KVArraySortDataFormat) o.a.s.util.collection.AppendOnlyMap Open addressing hash, quadratic probing Array of [(K, V), (K, V)] Good memory locality Keys never removed, values only append 34
  • 35. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Daytona GraySort Challenge Goal Success 1.1 Gbps/node network I/O (Reducers)
 Theoretical max = 1.25 Gbps for 10 GB ethernet 3 GBps/node disk I/O (Mappers) 35 Aggregate 
 Cluster Network I/O! 220 Gbps / 206 nodes ~= 1.1 Gbps per node
  • 36. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Shuffle Performance Tuning Tips Hash Shuffle Manager (Deprecated) spark.shuffle.consolidateFiles (Mapper) o.a.s.shuffle.FileShuffleBlockResolver Intermediate Files Increase spark.shuffle.file.buffer (Reducer) Increase spark.reducer.maxSizeInFlight if memory allows Use Smaller Number of Larger Executors Minimizes intermediate files and overall shuffle More opportunity for PROCESS_LOCAL SQL: BroadcastHashJoin vs. ShuffledHashJoin spark.sql.autoBroadcastJoinThreshold Use DataFrame.explain(true) or EXPLAIN to verify 36 Many Threads (1 per CPU)
  • 37. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Outline ①  Mechanical Sympathy ②  Recap of 100TB GraySort Challenge ③  Project Tungsten Deep Dive 37
  • 38. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Project Tungsten Data Struts & Algos Operate Directly on Byte Arrays Maximize CPU Cache Locality, Minimize GC Utilize Dynamic Code Generation 38 SPARK-7076 (Spark 1.4)
  • 39. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Quick Review of Project Tungsten Jiras 39 SPARK-7076 (Spark 1.4)
  • 40. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Why is CPU the Bottleneck? CPU is used for serialization, hashing, compression! Network and Disk I/O bandwidth are relatively high GraySort optimizations improved network & shuffle Partitioning, pruning, and predicate pushdowns Binary, compressed, columnar file formats (Parquet) 40
  • 41. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Yet Another Spark Shuffle Manager! spark.shuffle.manager = hash (Deprecated) < 10,000 reducers Output partition file hashes the key of (K,V) pair Mapper creates an output file per partition Leads to M*P output files for all partitions sort (GraySort Challenge) > 10,000 reducers Default from Spark 1.2-1.5 Mapper creates single output file for all partitions Minimizes OS resources, netty + epoll optimizes network I/O, disk I/O, and memory Uses custom data structures and algorithms for sort-shuffle workload Wins Daytona GraySort Challenge tungsten-sort (Project Tungsten) Default since 1.5 Modification of existing sort-based shuffle Uses com.misc.Unsafe for self-managed memory and garbage collection Maximize CPU utilization and cache locality with AlphaSort-inspired binary data structures/algorithms Perform joins, sorts, and other operators on both serialized and compressed byte buffers 41
  • 42. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark CPU & Memory Optimizations Custom Managed Memory Reduces GC overhead Both on and off heap Exact size calculations Direct Binary Processing Operate on serialized/compressed arrays Kryo can reorder/sort serialized records LZF can reorder/sort compressed records More CPU Cache-aware Data Structs & Algorithms o.a.s.sql.catalyst.expression.UnsafeRow Code Generation (default in 1.5) Generate source code from overall query plan 100+ UDFs converted to use code generation 42 UnsafeFixedWithAggregationMap TungstenAggregationIterator CodeGenerator GeneratorUnsafeRowJoiner UnsafeSortDataFormat UnsafeShuffleSortDataFormat PackedRecordPointer UnsafeRow UnsafeInMemorySorter UnsafeExternalSorter UnsafeShuffleWriter Mostly Same Join Code, UnsafeProjection UnsafeShuffleManager UnsafeShuffleInMemorySorter UnsafeShuffleExternalSorter Details in SPARK-7075
  • 43. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark sun.misc.Unsafe 43 Info addressSize() pageSize() Objects allocateInstance() objectFieldOffset() Classes staticFieldOffset() defineClass() defineAnonymousClass() ensureClassInitialized() Synchronization monitorEnter() tryMonitorEnter() monitorExit() compareAndSwapInt() putOrderedInt() Arrays arrayBaseOffset() arrayIndexScale() Memory allocateMemory() copyMemory() freeMemory() getAddress() – not guaranteed after GC getInt()/putInt() getBoolean()/putBoolean() getByte()/putByte() getShort()/putShort() getLong()/putLong() getFloat()/putFloat() getDouble()/putDouble() getObjectVolatile()/putObjectVolatile() Used by 
  • 44. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Spark + com.misc.Unsafe 44 org.apache.spark.sql.execution. aggregate.SortBasedAggregate aggregate.TungstenAggregate aggregate.AggregationIterator aggregate.udaf aggregate.utils SparkPlanner rowFormatConverters UnsafeFixedWidthAggregationMap UnsafeExternalSorter UnsafeExternalRowSorter UnsafeKeyValueSorter UnsafeKVExternalSorter local.ConvertToUnsafeNode local.ConvertToSafeNode local.HashJoinNode local.ProjectNode local.LocalNode local.BinaryHashJoinNode local.NestedLoopJoinNode joins.HashJoin joins.HashSemiJoin joins.HashedRelation joins.BroadcastHashJoin joins.ShuffledHashOuterJoin (not yet converted) joins.BroadcastHashOuterJoin joins.BroadcastLeftSemiJoinHash joins.BroadcastNestedLoopJoin joins.SortMergeJoin joins.LeftSemiJoinBNL joins.SortMergerOuterJoin Exchange SparkPlan UnsafeRowSerializer SortPrefixUtils sort basicOperators aggregate.SortBasedAggregationIterator aggregate.TungstenAggregationIterator datasources.WriterContainer datasources.json.JacksonParser datasources.jdbc.JDBCRDD org.apache.spark. unsafe.Platform unsafe.KVIterator unsafe.array.LongArray unsafe.array.ByteArrayMethods unsafe.array.BitSet unsafe.bitset.BitSetMethods unsafe.hash.Murmur3_x86_32 unsafe.memory.TaskMemoryManager unsafe.memory.ExecutorMemoryManager unsafe.memory.MemoryLocation unsafe.memory.UnsafeMemoryAllocator unsafe.memory.MemoryAllocator (trait/interface) unsafe.memory.MemoryBlock unsafe.memory.HeapMemoryAllocator unsafe.memory.ExecutorMemoryManager unsafe.sort.RecordComparator unsafe.sort.PrefixComparator unsafe.sort.PrefixComparators unsafe.sort.UnsafeSorterSpillWriter serializer.DummySerializationInstance shuffle.unsafe.UnsafeShuffleManager shuffle.unsafe.UnsafeShuffleSortDataFormat shuffle.unsafe.SpillInfo shuffle.unsafe.UnsafeShuffleWriter shuffle.unsafe.UnsafeShuffleExternalSorter shuffle.unsafe.PackedRecordPointer shuffle.ShuffleMemoryManager util.collection.unsafe.sort.UnsafeSorterSpillMerger util.collection.unsafe.sort.UnsafeSorterSpillReader util.collection.unsafe.sort.UnsafeSorterSpillWriter util.collection.unsafe.sort.UnsafeShuffleInMemorySorter util.collection.unsafe.sort.UnsafeInMemorySorter util.collection.unsafe.sort.RecordPointerAndKeyPrefix util.collection.unsafe.sort.UnsafeSorterIterator network.shuffle.ExternalShuffleBlockResolver scheduler.Task rdd.SqlNewHadoopRDD executor.Executor org.apache.spark.sql.catalyst.expressions. regexpExpressions BoundAttribute SortOrder SpecializedGetters ExpressionEvalHelper UnsafeArrayData UnsafeReaders UnsafeMapData Projection LiteralGeneartor UnsafeRow JoinedRow SpecializedGetters InputFileName SpecificMutableRow codegen.CodeGenerator codegen.GenerateProjection codegen.GenerateUnsafeRowJoiner codegen.GenerateSafeProjection codegen.GenerateUnsafeProjection codegen.BufferHolder codegen.UnsafeRowWriter codegen.UnsafeArrayWriter complexTypeCreator rows literals misc stringExpressions Over 200 source files affected!!
  • 45. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Traditional Java Object Row Layout 4-byte String Multi-field Object 45
  • 46. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom Data Structures for Workload UnsafeRow (Dense Binary Row) TaskMemoryManager (Virtual Memory Address) BytesToBytesMap (Dense Binary HashMap) 46 Dense, 8-bytes per field (word-aligned) Key Ptr AlphaSort-Style (Key + Pointer) OS-Style Memory Paging
  • 47. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark UnsafeRow Layout Example 47 Pre-Tungsten Tungsten
  • 48. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Custom Memory Management o.a.s.memory.
 TaskMemoryManager & MemoryConsumer Memory management: virtual memory allocation, pageing Off-heap: direct 64-bit address On-heap: 13-bit page num + 27-bit page offset o.a.s.shuffle.sort. PackedRecordPointer 64-bit word (24-bit partition key, (13-bit page num, 27-bit page offset)) o.a.s.unsafe.types. UTF8String Primitive Array[Byte] 48 2^13 pages * 2^27 page size = 1 TB RAM per Task
  • 49. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark UnsafeFixedWidthAggregationMap Aggregations o.a.s.sql.execution.
 UnsafeFixedWidthAggregationMap Uses BytesToBytesMap In-place updates of serialized data No object creation on hot-path Improved external agg support No OOM’s for large, single key aggs o.a.s.sql.catalyst.expression.codegen. GenerateUnsafeRowJoiner Combine 2 UnsafeRows into 1 o.a.s.sql.execution.aggregate. TungstenAggregate & TungstenAggregationIterator Operates directly on serialized, binary UnsafeRow 2 Steps: hash-based agg (grouping), then sort-based agg Supports spilling and external merge sorting 49
  • 50. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Equality Bitwise comparison on UnsafeRow No need to calculate equals(), hashCode() Row 1 Equals! Row 2 50
  • 51. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Joins Surprisingly, not many code changes o.a.s.sql.catalyst.expressions. UnsafeProjection Converts InternalRow to UnsafeRow 51
  • 52. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Sorting o.a.s.util.collection.unsafe.sort. UnsafeSortDataFormat UnsafeInMemorySorter UnsafeExternalSorter RecordPointerAndKeyPrefix
 UnsafeShuffleWriter AlphaSort-Style Cache Friendly 52 Ptr Key-Prefix 2x CPU Cache-line Friendly! Using multiple subclasses of SortDataFormat simultaneously will prevent JIT inlining. This affects sort & shuffle performance. Supports merging compressed records if compression CODEC supports it (LZF)
  • 53. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Spilling Efficient Spilling Exact data size is known No need to maintain heuristics & approximations Controls amount of spilling Spill merge on compressed, binary records! If compression CODEC supports it 53 UnsafeFixedWidthAggregationMap.getPeakMemoryUsedBytes() Exact Peak Memory for Spark Jobs
  • 54. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Code Generation Problem Boxing causes excessive object creation Expensive expression tree evals per row JVM can’t inline polymorphic impls Solution Codegen by-passes virtual function calls Defer source code generation to each operator, UDF, UDAF Use Scala quasiquote macros for Scala AST source code gen Rewrite and optimize code for overall plan, 8-byte align, etc Use Janino to compile generated source code into bytecode 54
  • 55. Power of data. Simplicity of design. Speed of innovation. IBM Spark IBM | Spark SQL UDF Code Generation 100+ UDFs now generating code More to come in Spark 1.6+ Details in SPARK-8159, SPARK-9571 Each Implements Expression.genCode()!
  • 56. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Creating a Custom UDF with Codegen Study existing implementations Extend base trait o.a.s.sql.catalyst.expressions.Expression.genCode() Register the function o.a.s.sql.catalyst.analysis.FunctionRegistry.registerFunction() Augment DataFrame with new UDF (Scala implicits) o.a.s.sql.functions.scala Don’t forget about Python! 56
  • 57. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Who Benefits from Project Tungsten? Users of DataFrames All Spark SQL Queries Catalyst All RDDs Serialization, Compression, and Aggregations 57
  • 58. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Performance Results Query Time Garbage Collection 58 OOM’d on Large Dataset!
  • 59. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark Thank You!!! Chris Fregly @cfregly IBM Spark Technology Center San Francisco, California Relevant Links Signup for the book & global meetup! Clone, contribute, and commit code! Run all demos in your own environment with Docker! 59
  • 60. Power of data. Simplicity of design. Speed of innovation. IBM Spark Power of data. Simplicity of design. Speed of innovation. IBM Spark