SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Cheng Su, Facebook
Spark SQL Bucketing at
Facebook
#UnifiedDataAnalytics #SparkAISummit
About me
Cheng Su
• Software Engineer at Facebook (Data Infrastructure
Organization)
• Working in Spark team
• Previously worked in Hive/Corona team
3#UnifiedDataAnalytics #SparkAISummit
Agenda
• Spark at Facebook
• What is Bucketing
• Spark Bucketing Optimizations (JIRA: SPARK-19256)
• Bucketing Compatability across SQL Engines
• The Road Ahead
4#UnifiedDataAnalytics #SparkAISummit
Spark at Facebook
5#UnifiedDataAnalytics #SparkAISummit
What is Bucketing
6#UnifiedDataAnalytics #SparkAISummit
Pre-shuffle and (optionally) pre-sort when writing table.
Avoid shuffle and (optionally) sort when reading table.
table user(id, info)
write normal table
. . . . . .
(2, )
(1, )
(5, )
(1, )
(2, )
(4, )
(3, )
(0, )
write bucketed sorted table
. . . . . .
(2, )
(1, )
(5, )
(1, )
(2, )
(4, )
(3, )
(0, )
file0 file1 file9
(0, )
(0, )
(4, )
(1, )
(1, )
(5, )
(3, )
(3, )
(2, )
(2, )
shuffle(id)
sort(id)
What is Bucketing (query plan)
CREATE TABLE user
(id INT, info STRING)
CLUSTERED BY (id)
SORTED BY (id)
INTO 8 BUCKETS
7#UnifiedDataAnalytics #SparkAISummit
SQL query to create
bucketed table
InsertIntoTable
Sort(id)
ShuffleExechange
(id, 8, HashFunc)
. . .
Query plan to write
bucketed table
INSERT OVERWRITE
TABLE user
SELECT id, info
FROM . . .
WHERE . . .
SQL query to write
bucketed table
What is Bucketing (write path)
8#UnifiedDataAnalytics #SparkAISummit
Spark Bucketing Optimizations (join)
9#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when sort-merge-join bucketed tables
SELECT . . .
FROM left L
JOIN right R
ON L.id = R.id
SQL query to join
tables
SortMergeJoin
Sort(id)
Shuffle(id)
Sort(id)
Shuffle(id)
TableScan(L) TableScan(R)
SortMergeJoin
TableScan(L) TableScan(R)
Query plan to sort-merge-
join two bucketed tables
with same buckets
Table Scan L Table Scan R
ShuffleShuffleShuffle
Join
Sort
Join
Sort
. . . . . .
(2, )
(1, )
(5, )
(0, )
(2, )
(4, )
(3, )
(0, )
. . . . . .(3, )
(9, )
(5, )
(4, )
(2, )
(8, )
(2, )
(1, )
Sort merge join
- Shuffle both tables
- Sort both tables
- Join by buffer one, stream
the bigger one
Join
Sort
Table Scan L Table Scan R
Join Join Join
Sort merge join of
bucketed sorted
table
- Join by buffer one, stream
the bigger one
. . . . . .
(1, )
(1, )
(5, )
(0, )
(0, )
(4, )
(3, )
(3, )
. . . . . .(1, )
(9, )
(0, )
(4, )
(4, )
(3, )
(7, )
(7, )
12#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle when shuffled-hash-join bucketed
tables
SELECT . . .
FROM left L
JOIN right R
ON L.id = R.id
SQL query to join
tables
ShuffledHashJoin
Shuffle(id) Shuffle(id)
TableScan(L) TableScan(R)
ShuffledHashJoin
TableScan(L) TableScan(R)
Query plan to shuffled-
hash-join two bucketed
tables with same buckets
Spark Bucketing Optimizations (join)
Table Scan L Table Scan R
ShuffleShuffleShuffle
Join
Build
hash
table
Join
. . . . . .
(2, )
(1, )
(5, )
(0, )
(2, )
(4, )
(3, )
(0, )
. . . . . .(3, )
(9, )
(5, )
(4, )
(2, )
(8, )
(2, )
(1, )
Shuffled hash join
- Shuffle both tables
- Join by hash one, stream
the bigger one
Join
Build
hash
table
Build
hash
table
Table Scan L Table Scan R
Join
Build
hash
table
Join
Shuffled hash join
of bucketed table
- Join by hash one, stream
the bigger one
Join
Build
hash
table
Build
hash
table
. . . . . .
(5 )
(1, )
(5, )
(0, )
(4, )
(8, )
(3, )
(3, )
. . . . . .(9, )
(1, )
(0, )
(4, )
(4, )
(7, )
(3, )
(7, )
15#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when joining non-bucketed, and bucketed
table
SELECT . . .
FROM left L
JOIN right R
ON L.id = R.id
SQL query to join
tables
SortMergeJoin
Sort(id)
Shuffle(id)
TableScan(L)
TableScan(R)
Query plan to sort-merge-join
non-bucketed table (L) with
bucketed table (R)
Spark Bucketing Optimizations (join)
Table Scan L (non-bucketed) Table Scan R (bucketed)
ShuffleShuffleShuffle
Join
Sort
Join
Sort
. . . . . .
(2, )
(1, )
(5, )
(0, )
(2, )
(4, )
(3, )
(0, )
Sort merge join of
non-bucketed and
bucketed table
- Shuffle non-bucketed
table
- Sort non-bucketed table
- Join by buffer one, stream
the bigger one
Join
Sort
. . . . . .(1, )
(9, )
(0, )
(4, )
(4, )
(3, )
(7, )
(7, )
17#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when joining bucketed tables with different buckets
SELECT . . .
FROM left L
JOIN right R
ON L.id = R.id
SQL query to join
tables
SortMergeJoin
TableScan(L)
TableScan(R)
Query plan to join 4-buckets-table
(L) with 16-buckets-table (R)
Spark Bucketing Optimizations (join)
SortedCoalesce(4)
SortedCoalesceExec
(physical plan operator
inherits child ordering )
SortedCoalescedRDD
(extends CoalescedRDD
to read children RDDs in
sort-merge-way)
(priority-queue)
Table Scan L Table Scan R
Join
Sort merge join of
bucketed sorted
table with different
buckets
- Coalesce the bigger one
in sort-merge way
- Join by buffer one, stream
the bigger one
(1, )
(1, )
(3, )
(0, )
(0, )
(2, )
(1, )
(9, )
(0, )
(4, )
(3, )
(7, )
(7, )
(2, )
(2, )
(6, )
(0, )
(0, )
(2, )
(0, )
(2, )
(2, )
(4, )
(6, )
Sorted-Coalesce
Join
(1, )
(1, )
(3, )
(1, )
(3, )
(7, )
(7, )
(9, )
Sorted-Coalesce
19#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when joining bucketed tables with different buckets
SELECT . . .
FROM left L
JOIN right R
ON L.id = R.id
SQL query to join
tables
SortMergeJoin
TableScan(L)
TableScan(R)
Query plan to join 4-buckets-table
(L) with 16-buckets-table (R)
Spark Bucketing Optimizations (join)
Repartition(16)
RepartitionWithoutShuffleExe
c
(physical plan operator
inherits child ordering)
RepartitionWithoutShuffleRD
D (divide-read-filter children
RDD partitions)
Table Scan L Table Scan R
Join
Sort merge join of
bucketed sorted
table with different
buckets
- Divide (repartition-w/o-
shuffle) the smaller one
- Join by buffer one, stream
the bigger one
(1, )
(1, )
(3, )
(0, )
(0, )
(2, )
(1, )
(9, )
(0, )
(4, )
(3, )
(7, )
(7, )
(2, )
(2, )
(6, )
(0, )
(0, )
Divide
(0, )
(4, ) Join
(1, )
(1, )
Divide
(1, )
(9, )
Join
(2, )
Divide
(2, )
(2, )
(6, ) Join
(3, )
Divide
(3, )
(7, )
(7, )
Spark Bucketing Optimizations (group-by)
21#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when sort-aggregate bucketed tables
SELECT . . .
FROM t
GROUP BY id
SQL query to group-
by table
SortAggregate
Sort(id)
Shuffle(id)
TableScan(t)
Query plan to sort-
aggregate bucketed table
SortAggregate
TableScan(t)
Table Scan t
ShuffleShuffleShuffle
Sort
. . . . . .(3, )
(9, )
(5, )
(4, )
(2, )
(8, )
(2, )
(1, )
Sort aggregation
- Shuffle table
- Sort table
- Aggregate
Aggregate
Sort
Aggregate
Sort
Aggregate
Table Scan t
Sort aggregation
of bucketed table
- Aggregate
Aggregate Aggregate Aggregate
. . . . . .(1, )
(9, )
(0, )
(4, )
(4, )
(3, )
(7, )
(7, )
Spark Bucketing Optimizations (group-by)
24#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle when hash-aggregate bucketed tables
SELECT . . .
FROM t
GROUP BY id
SQL query to group-
by table
HashAggregate
Shuffle(id)
TableScan(t)
Query plan to hash-
aggregate bucketed table
HashAggregate
TableScan(t)
Table Scan t
ShuffleShuffleShuffle
. . . . . .(3, )
(9, )
(5, )
(4, )
(2, )
(8, )
(2, )
(1, )
Hash aggregation
- Shuffle table
- Aggregate Aggregate Aggregate Aggregate
Table Scan t
Hash aggregation
of bucketed table
- Aggregate
Aggregate Aggregate Aggregate
. . . . . .(9, )
(1, )
(4 )
(0, )
(4, )
(7, )
(3, )
(7, )
Spark Bucketing Optimizations (union all)
27#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when join/group-by on union-all of bucketed tables
SELECT . . .
FROM (
SELECT … FROM L
UNION ALL
SELECT … FROM R
)
GROUP BY id
SQL query to group-by
on union-all of tables
SortAggregate
Union
TableScan(L)
Query plan to hash-
aggregate union-all of
bucketed tables
TableScan(R)
Change UnionExec to
produce
SortedCoalescedRDD
instead of CoalescedRDD
Table Scan L Table Scan R
Union-all
Aggregate
. . . . . .
(2, )
(1, )
(5, )
(0, )
(2, )
(4, )
(3, )
(0, )
. . . . . .(3, )
(9, )
(5, )
(4, )
(2, )
(8, )
(2, )
(1, )
Aggregate after
union-all
- Union-all of both tables
- Shuffle both tables
- Sort both tables
- Aggregate
Union-all
Aggregate
Union-all
Aggregate
Shuffle & Sort Shuffle & Sort Shuffle & Sort
Table Scan L Table Scan R
Union-all
Aggregate
Aggregate after
union-all of
bucketed sorted
table
- Union-all of both tables in
sort-merge way
- Aggregate
Union-all Union-all
. . . . . .
(1, )
(1, )
(5, )
(0, )
(0, )
(4, )
(3, )
(3, )
. . . . . .(1, )
(9, )
(0, )
(4, )
(4, )
(3, )
(7, )
(7, )
(0, )
(0, )
(0, )
(4, )
(4, )
(4, )
Aggregate
(1, )
(1, )
(1, )
(5, )
(9, )
Aggregate
(3, )
(3, )
(3, )
(7, )
(7, )
Spark Bucketing Optimizations (filter)
30#UnifiedDataAnalytics #SparkAISummit
Filter pushdown for bucketed table
SELECT … FROM t
WHERE id = 1
SQL query to read
bucketed table with
filter on bucketed
column (id)
Filter
Query plan to read
bucketed table with filter
pushdown
PushDownBucketFilter
physical plan rule to extract
bucketed column filter from
FilterExec, then filtering out
unnecessary buckets from e.g.
HiveTableScanExec
(i.e. not read unrelated buckets
at all)
TableScan(t)SELECT … FROM t
WHERE id IN (1, 2, 3)
Bucket Filter Push
Down
SELECT … FROM t
WHERE id = 1
Normal Filter
Bucket Filter Push Down
. . . . . .(9, )
(1, )
(4 )
(0, )
(4, )
(7, )
(3, )
(7, )
(1, )
- Only read required bucket
files
(9, )
(1, )
(1, )
Spark Bucketing Optimizations (validation)
32#UnifiedDataAnalytics #SparkAISummit
Validate bucketing and sorting before writing bucketed tables
INSERT OVERWRITE
TABLE t
SELECT …
FROM …
SQL query to write
bucketed table
InsertIntoTable(t)
SortVerifie
r
Query plan to validate
bucketing and sorting
before writing table
ShuffleVerifierExec
compute bucket-id for each
row on-the-fly, compare
bucket-id with RDD-partition-id
ShuffleVerifie
r
SortVerifierExec
compare ordering
between current and
previous rows
Shuffle Verifier
Shuffle Verifier
Sort Verifier
- Validate bucket id
- Validate sort order
- Write to table
. . . . . .(9, )
(1, )
(0, )
(4, )
(4, )
(3, )
(6, )
(7, )
. . . . . .(9, )
(1, )
(0, )
(4, )
(4, )
(3, )
(6, )
(7, )
Sort Verifier . . . . . .(9, )
(1, )
(0, )
(4, )
(4, )
Spark Bucketing Optimizations (others)
34#UnifiedDataAnalytics #SparkAISummit
• Sorted-coalesced-read multiple partitions of bucketed table
• Prefer sort-merge-join for bucketed sorted tables
• Prefer sort-aggregate for bucketed sorted tables
• Avoid shuffle for NULL-safe-equal join (<=>) on bucketed tables
• Allow to skip shuffle and sort before writing bucketed table
• Automatically align dynamic allocation maximal executors, with
buckets
• Efficiently hive table sampling support
• Hive hash is different from murmur3 hash! (bitwise-and with 2^31-1 in
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getBucketNumber)
• Should use same bucketing hash function (e.g. hive hash) across SQL
engines (Spark/Presto/Hive)
• Number of buckets of all tables should be divisible by each other (e.g.
power-of-two)
35#UnifiedDataAnalytics #SparkAISummit
Bucketing Compatability across SQL Engines
• Change number of buckets should be easy and pain-less across
compute engines for SQL users
• When and What to bucket?
• Have more than one query to do join or group-by on some columns
36#UnifiedDataAnalytics #SparkAISummit
Bucketing Compatability across SQL Engines
The Road Ahead
• Bucketing should be user-transparent
• Auto-bucketing project
• Audit join/group-by columns information for all warehouse queries
• Recommend bucketed columns and number of buckets based on
computational cost models
• What is problem of bucketing? Can we have better data placement,
besides bucketing and partitioning?
37#UnifiedDataAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot

Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
Databricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
Spark Summit
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
Eric Xiao
 
Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross Lawley
Spark Summit
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
Databricks
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
DataWorks Summit
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
Databricks
 

What's hot (20)

Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross Lawley
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Incremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and IcebergIncremental View Maintenance with Coral, DBT, and Iceberg
Incremental View Maintenance with Coral, DBT, and Iceberg
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 

Similar to Spark SQL Bucketing at Facebook

Hive Bucketing in Apache Spark
Hive Bucketing in Apache SparkHive Bucketing in Apache Spark
Hive Bucketing in Apache Spark
Tejas Patil
 
The PostgreSQL Query Planner
The PostgreSQL Query PlannerThe PostgreSQL Query Planner
The PostgreSQL Query Planner
Command Prompt., Inc
 
Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data Structures
SHAKOOR AB
 
Hands on data science with r.pptx
Hands  on data science with r.pptxHands  on data science with r.pptx
Hands on data science with r.pptx
Nimrita Koul
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
Purna Chander
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce world
Yu Liu
 
vFabric SQLFire Introduction
vFabric SQLFire IntroductionvFabric SQLFire Introduction
vFabric SQLFire Introduction
Jags Ramnarayan
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
Spark_Documentation_Template1
Spark_Documentation_Template1Spark_Documentation_Template1
Spark_Documentation_Template1
Nagavarunkumar Kolla
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
Takuya UESHIN
 
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Databricks
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
PoguttuezhiniVP
 
Presentation
PresentationPresentation
Presentation
Sayed Hoque
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
Julian Hyde
 
Rdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimationRdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimation
INRIA-OAK
 
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitJDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
PROIDEA
 
SQL Plan Directives explained
SQL Plan Directives explainedSQL Plan Directives explained
SQL Plan Directives explained
Mauro Pagano
 
Ontop: Answering SPARQL Queries over Relational Databases
Ontop: Answering SPARQL Queries over Relational DatabasesOntop: Answering SPARQL Queries over Relational Databases
Ontop: Answering SPARQL Queries over Relational Databases
Guohui Xiao
 
6. list
6. list6. list
CRL: A Rule Language for Table Analysis and Interpretation
CRL: A Rule Language for Table Analysis and InterpretationCRL: A Rule Language for Table Analysis and Interpretation
CRL: A Rule Language for Table Analysis and Interpretation
Alexey Shigarov
 

Similar to Spark SQL Bucketing at Facebook (20)

Hive Bucketing in Apache Spark
Hive Bucketing in Apache SparkHive Bucketing in Apache Spark
Hive Bucketing in Apache Spark
 
The PostgreSQL Query Planner
The PostgreSQL Query PlannerThe PostgreSQL Query Planner
The PostgreSQL Query Planner
 
Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data Structures
 
Hands on data science with r.pptx
Hands  on data science with r.pptxHands  on data science with r.pptx
Hands on data science with r.pptx
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
 
Tree representation in map reduce world
Tree representation  in map reduce worldTree representation  in map reduce world
Tree representation in map reduce world
 
vFabric SQLFire Introduction
vFabric SQLFire IntroductionvFabric SQLFire Introduction
vFabric SQLFire Introduction
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Spark_Documentation_Template1
Spark_Documentation_Template1Spark_Documentation_Template1
Spark_Documentation_Template1
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookScaling Machine Learning Feature Engineering in Apache Spark at Facebook
Scaling Machine Learning Feature Engineering in Apache Spark at Facebook
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
 
Presentation
PresentationPresentation
Presentation
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Rdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimationRdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimation
 
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitJDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And Profit
 
SQL Plan Directives explained
SQL Plan Directives explainedSQL Plan Directives explained
SQL Plan Directives explained
 
Ontop: Answering SPARQL Queries over Relational Databases
Ontop: Answering SPARQL Queries over Relational DatabasesOntop: Answering SPARQL Queries over Relational Databases
Ontop: Answering SPARQL Queries over Relational Databases
 
6. list
6. list6. list
6. list
 
CRL: A Rule Language for Table Analysis and Interpretation
CRL: A Rule Language for Table Analysis and InterpretationCRL: A Rule Language for Table Analysis and Interpretation
CRL: A Rule Language for Table Analysis and Interpretation
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
District 11 Solutions
 
SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
Becky Burwell
 
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Deepikakumari457585
 
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Alireza Kamrani
 
Annex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf documentAnnex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf document
Steven McGee
 
Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
Riya Sen
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
Sathiakumar Chandr
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
49AkshitYadav
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
wojakmodern
 
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
Riya Sen
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
SelcukTOPAL2
 
Field Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdfField Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdf
hritikbui
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
amazenolmedojeruel
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
weiwchu
 
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
femim26318
 
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERINGSOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
PrabhuB33
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
Ladislau5
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
RuchiRathor2
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
deepikakumaridk25
 
Vrinda store data analysis project using Excel
Vrinda store data analysis project using ExcelVrinda store data analysis project using Excel
Vrinda store data analysis project using Excel
SantuJana12
 

Recently uploaded (20)

Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
 
SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
 
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
 
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
 
Annex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf documentAnnex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf document
 
Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
 
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
 
Field Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdfField Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdf
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
 
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
 
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERINGSOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
 
Vrinda store data analysis project using Excel
Vrinda store data analysis project using ExcelVrinda store data analysis project using Excel
Vrinda store data analysis project using Excel
 

Spark SQL Bucketing at Facebook

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Cheng Su, Facebook Spark SQL Bucketing at Facebook #UnifiedDataAnalytics #SparkAISummit
  • 3. About me Cheng Su • Software Engineer at Facebook (Data Infrastructure Organization) • Working in Spark team • Previously worked in Hive/Corona team 3#UnifiedDataAnalytics #SparkAISummit
  • 4. Agenda • Spark at Facebook • What is Bucketing • Spark Bucketing Optimizations (JIRA: SPARK-19256) • Bucketing Compatability across SQL Engines • The Road Ahead 4#UnifiedDataAnalytics #SparkAISummit
  • 6. What is Bucketing 6#UnifiedDataAnalytics #SparkAISummit Pre-shuffle and (optionally) pre-sort when writing table. Avoid shuffle and (optionally) sort when reading table. table user(id, info) write normal table . . . . . . (2, ) (1, ) (5, ) (1, ) (2, ) (4, ) (3, ) (0, ) write bucketed sorted table . . . . . . (2, ) (1, ) (5, ) (1, ) (2, ) (4, ) (3, ) (0, ) file0 file1 file9 (0, ) (0, ) (4, ) (1, ) (1, ) (5, ) (3, ) (3, ) (2, ) (2, ) shuffle(id) sort(id)
  • 7. What is Bucketing (query plan) CREATE TABLE user (id INT, info STRING) CLUSTERED BY (id) SORTED BY (id) INTO 8 BUCKETS 7#UnifiedDataAnalytics #SparkAISummit SQL query to create bucketed table InsertIntoTable Sort(id) ShuffleExechange (id, 8, HashFunc) . . . Query plan to write bucketed table INSERT OVERWRITE TABLE user SELECT id, info FROM . . . WHERE . . . SQL query to write bucketed table
  • 8. What is Bucketing (write path) 8#UnifiedDataAnalytics #SparkAISummit
  • 9. Spark Bucketing Optimizations (join) 9#UnifiedDataAnalytics #SparkAISummit Avoid shuffle and sort when sort-merge-join bucketed tables SELECT . . . FROM left L JOIN right R ON L.id = R.id SQL query to join tables SortMergeJoin Sort(id) Shuffle(id) Sort(id) Shuffle(id) TableScan(L) TableScan(R) SortMergeJoin TableScan(L) TableScan(R) Query plan to sort-merge- join two bucketed tables with same buckets
  • 10. Table Scan L Table Scan R ShuffleShuffleShuffle Join Sort Join Sort . . . . . . (2, ) (1, ) (5, ) (0, ) (2, ) (4, ) (3, ) (0, ) . . . . . .(3, ) (9, ) (5, ) (4, ) (2, ) (8, ) (2, ) (1, ) Sort merge join - Shuffle both tables - Sort both tables - Join by buffer one, stream the bigger one Join Sort
  • 11. Table Scan L Table Scan R Join Join Join Sort merge join of bucketed sorted table - Join by buffer one, stream the bigger one . . . . . . (1, ) (1, ) (5, ) (0, ) (0, ) (4, ) (3, ) (3, ) . . . . . .(1, ) (9, ) (0, ) (4, ) (4, ) (3, ) (7, ) (7, )
  • 12. 12#UnifiedDataAnalytics #SparkAISummit Avoid shuffle when shuffled-hash-join bucketed tables SELECT . . . FROM left L JOIN right R ON L.id = R.id SQL query to join tables ShuffledHashJoin Shuffle(id) Shuffle(id) TableScan(L) TableScan(R) ShuffledHashJoin TableScan(L) TableScan(R) Query plan to shuffled- hash-join two bucketed tables with same buckets Spark Bucketing Optimizations (join)
  • 13. Table Scan L Table Scan R ShuffleShuffleShuffle Join Build hash table Join . . . . . . (2, ) (1, ) (5, ) (0, ) (2, ) (4, ) (3, ) (0, ) . . . . . .(3, ) (9, ) (5, ) (4, ) (2, ) (8, ) (2, ) (1, ) Shuffled hash join - Shuffle both tables - Join by hash one, stream the bigger one Join Build hash table Build hash table
  • 14. Table Scan L Table Scan R Join Build hash table Join Shuffled hash join of bucketed table - Join by hash one, stream the bigger one Join Build hash table Build hash table . . . . . . (5 ) (1, ) (5, ) (0, ) (4, ) (8, ) (3, ) (3, ) . . . . . .(9, ) (1, ) (0, ) (4, ) (4, ) (7, ) (3, ) (7, )
  • 15. 15#UnifiedDataAnalytics #SparkAISummit Avoid shuffle and sort when joining non-bucketed, and bucketed table SELECT . . . FROM left L JOIN right R ON L.id = R.id SQL query to join tables SortMergeJoin Sort(id) Shuffle(id) TableScan(L) TableScan(R) Query plan to sort-merge-join non-bucketed table (L) with bucketed table (R) Spark Bucketing Optimizations (join)
  • 16. Table Scan L (non-bucketed) Table Scan R (bucketed) ShuffleShuffleShuffle Join Sort Join Sort . . . . . . (2, ) (1, ) (5, ) (0, ) (2, ) (4, ) (3, ) (0, ) Sort merge join of non-bucketed and bucketed table - Shuffle non-bucketed table - Sort non-bucketed table - Join by buffer one, stream the bigger one Join Sort . . . . . .(1, ) (9, ) (0, ) (4, ) (4, ) (3, ) (7, ) (7, )
  • 17. 17#UnifiedDataAnalytics #SparkAISummit Avoid shuffle and sort when joining bucketed tables with different buckets SELECT . . . FROM left L JOIN right R ON L.id = R.id SQL query to join tables SortMergeJoin TableScan(L) TableScan(R) Query plan to join 4-buckets-table (L) with 16-buckets-table (R) Spark Bucketing Optimizations (join) SortedCoalesce(4) SortedCoalesceExec (physical plan operator inherits child ordering ) SortedCoalescedRDD (extends CoalescedRDD to read children RDDs in sort-merge-way) (priority-queue)
  • 18. Table Scan L Table Scan R Join Sort merge join of bucketed sorted table with different buckets - Coalesce the bigger one in sort-merge way - Join by buffer one, stream the bigger one (1, ) (1, ) (3, ) (0, ) (0, ) (2, ) (1, ) (9, ) (0, ) (4, ) (3, ) (7, ) (7, ) (2, ) (2, ) (6, ) (0, ) (0, ) (2, ) (0, ) (2, ) (2, ) (4, ) (6, ) Sorted-Coalesce Join (1, ) (1, ) (3, ) (1, ) (3, ) (7, ) (7, ) (9, ) Sorted-Coalesce
  • 19. 19#UnifiedDataAnalytics #SparkAISummit Avoid shuffle and sort when joining bucketed tables with different buckets SELECT . . . FROM left L JOIN right R ON L.id = R.id SQL query to join tables SortMergeJoin TableScan(L) TableScan(R) Query plan to join 4-buckets-table (L) with 16-buckets-table (R) Spark Bucketing Optimizations (join) Repartition(16) RepartitionWithoutShuffleExe c (physical plan operator inherits child ordering) RepartitionWithoutShuffleRD D (divide-read-filter children RDD partitions)
  • 20. Table Scan L Table Scan R Join Sort merge join of bucketed sorted table with different buckets - Divide (repartition-w/o- shuffle) the smaller one - Join by buffer one, stream the bigger one (1, ) (1, ) (3, ) (0, ) (0, ) (2, ) (1, ) (9, ) (0, ) (4, ) (3, ) (7, ) (7, ) (2, ) (2, ) (6, ) (0, ) (0, ) Divide (0, ) (4, ) Join (1, ) (1, ) Divide (1, ) (9, ) Join (2, ) Divide (2, ) (2, ) (6, ) Join (3, ) Divide (3, ) (7, ) (7, )
  • 21. Spark Bucketing Optimizations (group-by) 21#UnifiedDataAnalytics #SparkAISummit Avoid shuffle and sort when sort-aggregate bucketed tables SELECT . . . FROM t GROUP BY id SQL query to group- by table SortAggregate Sort(id) Shuffle(id) TableScan(t) Query plan to sort- aggregate bucketed table SortAggregate TableScan(t)
  • 22. Table Scan t ShuffleShuffleShuffle Sort . . . . . .(3, ) (9, ) (5, ) (4, ) (2, ) (8, ) (2, ) (1, ) Sort aggregation - Shuffle table - Sort table - Aggregate Aggregate Sort Aggregate Sort Aggregate
  • 23. Table Scan t Sort aggregation of bucketed table - Aggregate Aggregate Aggregate Aggregate . . . . . .(1, ) (9, ) (0, ) (4, ) (4, ) (3, ) (7, ) (7, )
  • 24. Spark Bucketing Optimizations (group-by) 24#UnifiedDataAnalytics #SparkAISummit Avoid shuffle when hash-aggregate bucketed tables SELECT . . . FROM t GROUP BY id SQL query to group- by table HashAggregate Shuffle(id) TableScan(t) Query plan to hash- aggregate bucketed table HashAggregate TableScan(t)
  • 25. Table Scan t ShuffleShuffleShuffle . . . . . .(3, ) (9, ) (5, ) (4, ) (2, ) (8, ) (2, ) (1, ) Hash aggregation - Shuffle table - Aggregate Aggregate Aggregate Aggregate
  • 26. Table Scan t Hash aggregation of bucketed table - Aggregate Aggregate Aggregate Aggregate . . . . . .(9, ) (1, ) (4 ) (0, ) (4, ) (7, ) (3, ) (7, )
  • 27. Spark Bucketing Optimizations (union all) 27#UnifiedDataAnalytics #SparkAISummit Avoid shuffle and sort when join/group-by on union-all of bucketed tables SELECT . . . FROM ( SELECT … FROM L UNION ALL SELECT … FROM R ) GROUP BY id SQL query to group-by on union-all of tables SortAggregate Union TableScan(L) Query plan to hash- aggregate union-all of bucketed tables TableScan(R) Change UnionExec to produce SortedCoalescedRDD instead of CoalescedRDD
  • 28. Table Scan L Table Scan R Union-all Aggregate . . . . . . (2, ) (1, ) (5, ) (0, ) (2, ) (4, ) (3, ) (0, ) . . . . . .(3, ) (9, ) (5, ) (4, ) (2, ) (8, ) (2, ) (1, ) Aggregate after union-all - Union-all of both tables - Shuffle both tables - Sort both tables - Aggregate Union-all Aggregate Union-all Aggregate Shuffle & Sort Shuffle & Sort Shuffle & Sort
  • 29. Table Scan L Table Scan R Union-all Aggregate Aggregate after union-all of bucketed sorted table - Union-all of both tables in sort-merge way - Aggregate Union-all Union-all . . . . . . (1, ) (1, ) (5, ) (0, ) (0, ) (4, ) (3, ) (3, ) . . . . . .(1, ) (9, ) (0, ) (4, ) (4, ) (3, ) (7, ) (7, ) (0, ) (0, ) (0, ) (4, ) (4, ) (4, ) Aggregate (1, ) (1, ) (1, ) (5, ) (9, ) Aggregate (3, ) (3, ) (3, ) (7, ) (7, )
  • 30. Spark Bucketing Optimizations (filter) 30#UnifiedDataAnalytics #SparkAISummit Filter pushdown for bucketed table SELECT … FROM t WHERE id = 1 SQL query to read bucketed table with filter on bucketed column (id) Filter Query plan to read bucketed table with filter pushdown PushDownBucketFilter physical plan rule to extract bucketed column filter from FilterExec, then filtering out unnecessary buckets from e.g. HiveTableScanExec (i.e. not read unrelated buckets at all) TableScan(t)SELECT … FROM t WHERE id IN (1, 2, 3)
  • 31. Bucket Filter Push Down SELECT … FROM t WHERE id = 1 Normal Filter Bucket Filter Push Down . . . . . .(9, ) (1, ) (4 ) (0, ) (4, ) (7, ) (3, ) (7, ) (1, ) - Only read required bucket files (9, ) (1, ) (1, )
  • 32. Spark Bucketing Optimizations (validation) 32#UnifiedDataAnalytics #SparkAISummit Validate bucketing and sorting before writing bucketed tables INSERT OVERWRITE TABLE t SELECT … FROM … SQL query to write bucketed table InsertIntoTable(t) SortVerifie r Query plan to validate bucketing and sorting before writing table ShuffleVerifierExec compute bucket-id for each row on-the-fly, compare bucket-id with RDD-partition-id ShuffleVerifie r SortVerifierExec compare ordering between current and previous rows
  • 33. Shuffle Verifier Shuffle Verifier Sort Verifier - Validate bucket id - Validate sort order - Write to table . . . . . .(9, ) (1, ) (0, ) (4, ) (4, ) (3, ) (6, ) (7, ) . . . . . .(9, ) (1, ) (0, ) (4, ) (4, ) (3, ) (6, ) (7, ) Sort Verifier . . . . . .(9, ) (1, ) (0, ) (4, ) (4, )
  • 34. Spark Bucketing Optimizations (others) 34#UnifiedDataAnalytics #SparkAISummit • Sorted-coalesced-read multiple partitions of bucketed table • Prefer sort-merge-join for bucketed sorted tables • Prefer sort-aggregate for bucketed sorted tables • Avoid shuffle for NULL-safe-equal join (<=>) on bucketed tables • Allow to skip shuffle and sort before writing bucketed table • Automatically align dynamic allocation maximal executors, with buckets • Efficiently hive table sampling support
  • 35. • Hive hash is different from murmur3 hash! (bitwise-and with 2^31-1 in org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getBucketNumber) • Should use same bucketing hash function (e.g. hive hash) across SQL engines (Spark/Presto/Hive) • Number of buckets of all tables should be divisible by each other (e.g. power-of-two) 35#UnifiedDataAnalytics #SparkAISummit Bucketing Compatability across SQL Engines
  • 36. • Change number of buckets should be easy and pain-less across compute engines for SQL users • When and What to bucket? • Have more than one query to do join or group-by on some columns 36#UnifiedDataAnalytics #SparkAISummit Bucketing Compatability across SQL Engines
  • 37. The Road Ahead • Bucketing should be user-transparent • Auto-bucketing project • Audit join/group-by columns information for all warehouse queries • Recommend bucketed columns and number of buckets based on computational cost models • What is problem of bucketing? Can we have better data placement, besides bucketing and partitioning? 37#UnifiedDataAnalytics #SparkAISummit
  • 38. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT