SlideShare a Scribd company logo
The Apache Spark File Format Ecosystem
The Spark File Format Ecosystem
Vinoo Ganesh
Chief Technology Officer, Veraset
Agenda
About Veraset
Session Goals
On-Disk Storage
OLTP/OLAP Workflows
File Format Deep Dive
Case: Veraset
Case: Parquet Pruning
Looking Forward
Questions
About Veraset
About Me
▪ CTO at Veraset
▪ (Formerly) Lead of Compute / Spark at Palantir Technologies
Data-as-a-Service (DaaS) Startup
Anonymized Geospatial Data
▪ Centered around population movement
▪ Model training at scale
▪ Heavily used during COVID-19 investigations / analyses
Process, Cleanse, Optimize, and Deliver >2 PB Data Yearly
Data is Our Product
▪ We don’t build analytical tools
▪ No fancy visualizations
▪ Optimized data storage, retrieval, and processing are our lifeblood
▪ “Just Data”
We’re Hiring!
vinoo@veraset.com
Session Goals
On-disk storage
▪ Row, Column, Hybrid
Introduce OLTP / OLAP workflows
Explain feature set of formats
Inspect formats
Explore configuration for formats
Look forward
We can’t cover everything about file formats in 30 minutes, so let’s hit the high points.
File Formats
▪ Text
▪ CSV *
▪ TSV *
▪ JSON
▪ XML
Semi-StructuredUnstructured
▪ Avro
▪ ORC
▪ Parquet
Structured
Bolded formats will be
covered in this session
* Can be considered ”semi-structured”
On-Disk Storage
Data is stored on hard drives in “blocks”
Disk loads a “block” into memory at a time
▪ Block is the minimum amount of data read during a read
Reading unnecessary data == expensive!
Reading fragmented data == expensive!
Random seek == expensive!
Sequential read/writes strongly preferred
Insight: Lay data on disk in a manner
optimized for your workflows
▪ Common categorizations for these workflows: OLTP/OLAP
https://bit.ly/2TG7SJw
Example Data
A0 B0 C0
A1 B1 C1
A2 B2 C2
A3 B3 C3
Column A Column B Column C
Row 0
Row 1
Row 2
Row 3
Example Data
Column A Column B Column C
Row 0
Row 1
Row 2
Row 3
A0 C0B0
A1
A2
A3
B1
B2
B3
C1
C2
C3
Row-wise Storage
A0 C0B0 A1
A0 B0 C0
A1 B1 C1
A2 B2 C2
A3 B3 C3
Visually
Block 1
B1 A2C1 B2
Block 2
C2 B3A3 C3
Block 3
On Disk
Columnar (Column-wise) Storage
A0 A2A1 A3
Block 1
B0 B2B1 B3
Block 2
C0 C2C1 C3
Block 3
A0 B0 C0
A1 B1 C1
A2 B2 C2
A3 B3 C3
Visually
On Disk
Hybrid Storage
A0 B0A1 B1
Row Group 1
C0 A2C1 A3
Row Group 2
B2 C2B3 C3
A0 B0 C0
A1 B1 C1
A2 B2 C2
A3 B3 C3
Visually
Logical Row Groups
Hybrid Storage
A0 B0A1 B1
Block 1
C0 A2C1 A3
Block 2
B2 C2B3 C3
Block 3
On Disk
A0 B0A1 B1
Row Group 1
C0 A2C1 A3
Row Group 2
B2 C2B3 C3
Logical Row Groups
In Parquet – aim to fit
one row group in one
block
Summary: Physical Layout
Row-wise formats are best for write-heavy (transactional) workflows
Columnar formats are best optimized for read-heavy (analytical)
workflows
Hybrid formats combine both methodologies
OLTP / OLAP
OLTP / OLAP Workflows
Online Transaction Processing (OLTP)
▪ Larger numbers of short queries / transactions
▪ More processing than analysis focused
▪ Geared towards record (row) based processing than column based
▪ More frequent data updates / deletes (transactions)
Online Analytical Processing (OLAP)
▪ More analysis than processing focused
▪ Geared towards column-based data analytics processing
▪ Less frequent transactions
▪ More analytic complexity per query
Insight: Data access patterns should inform the selection of file
formats
Example Data
student_id subject score
Row 0
Row 1
Row 2
Row 3
71 97.44math
33
101
13
history
geography
physics
88.32
73.11
87.78
Unstructured File Format: CSV
About: CSV
CSV developed by IBM in 1972
▪ Ease of typing CSV lists on punched cards
Flexible (not always good)
Row-based
Human Readable
Compressible
Splittable
▪ When raw / using spittable format
Supported Natively
Fast (from a write perspective)
Comma Separated Value (CSV)
$ cat myTable.csv
"student_id","subject","score"
71,"math",97.44
33,"history",88.32
101,"geography",73.11
13,"physics",87.78
scala> val table = spark.read.option("header","true")
.option("inferSchema", "true").csv("myTable.csv")
table: org.apache.spark.sql.DataFrame = [student_id: int,
subject: string ... 1 more field]
scala> table.printSchema
root
|-- student_id: integer (nullable = true)
|-- subject: string (nullable = true)
|-- score: double (nullable = true)
scala> table.show
+----------+---------+-----+
|student_id| subject|score|
+----------+---------+-----+
| 71| math|97.44|
| 33| history|88.32|
| 101|geography|73.11|
| 13| physics|87.78|
+----------+---------+-----+
* Some formatting applied
Semi-Structured File Format: JSON
About: JSON
Specified in early 2000s
Self-Describing
Row-based
Human Readable
Compressible
Splittable (in some cases)
Supported natively in Spark
Supports Complex Data Types
Fast (from a write perspective)
JSON (JavaScript Object Notation) $ cat myTable.json
{"student_id":71,"subject":"math","score":97.44}
{"student_id":33,"subject":"history","score":88.32}
{"student_id":101,"subject":"geography","score":73.11}
{"student_id":13,"subject":"physics","score":87.78}
scala> val table = spark.read.json("myTable.json")
table: org.apache.spark.sql.DataFrame = [score:
double, student_id: bigint ... 1 more field]
scala> table.show
+-----+----------+---------+
|score|student_id| subject|
+-----+----------+---------+
|97.44| 71| math|
|88.32| 33| history|
|73.11| 101|geography|
|87.78| 13| physics|
+-----+----------+---------+
* Some formatting applied
Structured File Formats: Avro, ORC, Parquet
About: Avro
Data Format + Serialization Format
Self-Describing
▪ Schema evolution
Row-based
▪ Optimized for write-intensive applications
Binary Format – Schema stored inside of file (as JSON)
Compressible
Splittable
Supported by external library in Spark
Supports rich data structures
Inspecting: Avro
$ avro-tools tojson part-00000-tid-8566179657485710941-115d547d-6b9a-43cf-957a-c549d3243afb-3-1-c000.avro
{"student_id":{"int":71},"subject":{"string":"math"},"score":{"double":97.44}}
{"student_id":{"int":33},"subject":{"string":"history"},"score":{"double":88.32}}
{"student_id":{"int":101},"subject":{"string":"geography"},"score":{"double":73.11}}
{"student_id":{"int":13},"subject":{"string":"physics"},"score":{"double":87.78}}
$ avro-tools getmeta part-00000-tid-8566179657485710941-115d547d-6b9a-43cf-957a-c549d3243afb-3-1-c000.avro
avro.schema {
"type" : "record",
"name" : "topLevelRecord",
"fields" : [ {
"name" : "student_id",
"type" : [ "int", "null" ]
}, {
"name" : "subject",
"type" : [ "string", "null" ]
}, {
"name" : "score",
"type" : [ "double", "null" ]
} ]
}
avro.codec snappy
* Some formatting applied
Config: Avro
spark.sql.avro.compression.codec
▪ What: Compression codec used in writing of AVRO files
▪ Options: {uncompressed, deflate, snappy, bzip2, xz}
spark.sql.avro.deflate.level
▪ What: Compression level for the deflate codec
▪ Options: {-1,1..9}
* Default value is underlined
About: ORC
Next iteration of Hive RCFile
▪ Created in 2013 as part of Stinger initiative to speed up Hive
Self-Describing
Hybrid-Based (rows grouped by row groups, then column partitioned)
▪ Optimized for read-intensive applications
Binary Format – Schema stored inside of file (in metadata)
Compressible
Splittable
Supported by natively in Spark
Supports rich data structures
▪ Hive data Type Support (including compound types): struct, list, map, union
Structure: ORC
Row groups called Stripes
Index Data contain column min/max values
and row positions within each column
▪ Bit field / bloom filter as well (if included)
▪ Used for selection of stripes / row groups, not for answering queries
Row Data contains the actual data
Stripe Footer contain directory of stream
locations
Postscript contains compression
parameters and size of the compressed
footer
https://bit.ly/2A7AlS1
Inspecting: ORC
$ orc-tools meta part-00000-34aef610-c8d4-46fb-84c9-
b43887d2b37e-c000.snappy.orc
Processing data file part-00000-34aef610-c8d4-46fb-84c9-
b43887d2b37e-c000.snappy.orc [length: 574]
Structure for part-00000-34aef610-c8d4-46fb-84c9-b43887d2b37e-
c000.snappy.orc
File Version: 0.12 with ORC_135
Rows: 4
Compression: SNAPPY
Compression size: 262144
Type: struct<score:double,student_id:bigint,subject:string>
Stripe Statistics:
Stripe 1:
Column 0: count: 4 hasNull: false
Column 1: count: 4 hasNull: false bytesOnDisk: 35 min: 73.11
max: 97.44 sum: 346.65
Column 2: count: 4 hasNull: false bytesOnDisk: 9 min: 13
max: 101 sum: 218
Column 3: count: 4 hasNull: false bytesOnDisk: 37 min:
geography max: physics sum: 27
File Statistics:
Column 0: count: 4 hasNull: false
Column 1: count: 4 hasNull: false bytesOnDisk: 35 min: 73.11
max: 97.44 sum: 346.65
Column 2: count: 4 hasNull: false bytesOnDisk: 9 min: 13 min:
101 sum: 218
Column 3: count: 4 hasNull: false bytesOnDisk: 37 min:
geography max: physics sum: 27
Stripes:
Stripe: offset: 3 data: 81 rows: 4 tail: 75 index: 123
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 44
Stream: column 2 section ROW_INDEX start: 58 length 26
Stream: column 3 section ROW_INDEX start: 84 length 42
Stream: column 1 section DATA start: 126 length 35
Stream: column 2 section DATA start: 161 length 9
Stream: column 3 section DATA start: 170 length 30
Stream: column 3 section LENGTH start: 200 length 7
Encoding column 0: DIRECT
Encoding column 1: DIRECT
Encoding column 2: DIRECT_V2
Encoding column 3: DIRECT_V2
File length: 574 bytes
Padding length: 0 bytes
Padding ratio: 0%
* Some formatting applied
Config: ORC
spark.sql.orc.impl
▪ What: The name of ORC implementation
▪ Options: {native, hive}
spark.sql.orc.compression.codec
▪ What: Compression codec used when writing ORC files
▪ Options: {none, uncompressed, snappy, zlib, lzo}
spark.sql.orc.mergeSchema
▪ What: (3.0+) ORC data source should merge schemas from all files
(else, picked at random)
▪ Options: {true, false}
spark.sql.orc.columnarReaderBatchSize
▪ What: Number of rows to include in a ORC vectorized reader batch.
▪ Options: Int
▪ Default: 4096
spark.sql.orc.filterPushdown
▪ What: Enable filter pushdown for ORC files
▪ Options: {true, false}
spark.sql.orc.enableVectorizedReader
▪ What: Enables vectorized orc decoding
▪ Options: {true, false}
* Default value is underlined
About: Parquet
Originally built by Twitter and Cloudera
Self-Describing
Hybrid-Based (rows grouped by row groups, then column partitioned)
▪ Optimized for read-intensive applications
Binary Format – Schema stored inside of file
Compressible
Splittable
Supported by natively in Spark
Supports rich data structures
Structure: Parquet
Row Groups are a logical horizontal
partitioning of the data into rows
▪ Consists of a column chunk for each column in the dataset
Column chunk are chunks of the data for
a column
▪ Guaranteed to be contiguous in the file
Pages make up column chunks
▪ A page is conceptually an indivisible unit (in terms of compression and
encoding)
File metadata contains the locations of
all the column metadata start locations
Inspecting: Parquet (1)
$ parquet-tools meta part-00000-5adea6d5-53ae-49cc-8f70-a7365519b6bf-c000.snappy.parquet
file: file:part-00000-5adea6d5-53ae-49cc-8f70-a7365519b6bf-c000.snappy.parquet
creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1)
extra: org.apache.spark.sql.parquet.row.metadata =
{"type":"struct","fields":[{"name":"score","type":"double","nullable":true,"metadata":{}},{"name":"student
_id","type":"long","nullable":true,"metadata":{}},{"name":"subject","type":"string","nullable":true,"metad
ata":{}}]}
file schema: spark_schema
--------------------------------------------------------------------------------
score: OPTIONAL DOUBLE R:0 D:1
student_id: OPTIONAL INT64 R:0 D:1
subject: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:4 TS:288 OFFSET:4
--------------------------------------------------------------------------------
score: DOUBLE SNAPPY DO:0 FPO:4 SZ:101/99/0.98 VC:4 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 73.11, max:
97.44, num_nulls: 0]
student_id: INT64 SNAPPY DO:0 FPO:105 SZ:95/99/1.04 VC:4 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 13, max: 101,
num_nulls: 0]
subject: BINARY SNAPPY DO:0 FPO:200 SZ:92/90/0.98 VC:4 ENC:PLAIN,BIT_PACKED,RLE ST:[min: geography,
max: physics, num_nulls: 0]
* Some formatting applied
Inspecting: Parquet (2)
$ parquet-tools dump part-00000-5adea6d5-53ae-49cc-8f70-a7365519b6bf-
c000.snappy.parquet
row group 0
-------------------------------------------------------------------------
score: DOUBLE SNAPPY DO:0 FPO:4 SZ:101/99/0.98 VC:4 ENC:RLE,BIT_ [more]...
student_id: INT64 SNAPPY DO:0 FPO:105 SZ:95/99/1.04 VC:4 ENC:RLE,BIT_ [more]...
subject: BINARY SNAPPY DO:0 FPO:200 SZ:92/90/0.98 VC:4 ENC:RLE,BIT [more]...
score TV=4 RL=0 DL=1
---------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[min: 73.11, max: 97. [more]...
VC:4
student_id TV=4 RL=0 DL=1
---------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[min: 13, max: 101, n [more]...
VC:4
subject TV=4 RL=0 DL=1
---------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[min: geography, max: [more]...
VC:4
DOUBLE score
---------------------------------------
*** row group 1 of 1, values 1 to 4 ***
value 1: R:0 D:1 V:97.44
value 2: R:0 D:1 V:88.32
value 3: R:0 D:1 V:73.11
value 4: R:0 D:1 V:87.78
INT64 student_id
---------------------------------------
*** row group 1 of 1, values 1 to 4 ***
value 1: R:0 D:1 V:71
value 2: R:0 D:1 V:33
value 3: R:0 D:1 V:101
value 4: R:0 D:1 V:13
BINARY subject
---------------------------------------
*** row group 1 of 1, values 1 to 4 ***
value 1: R:0 D:1 V:math
value 2: R:0 D:1 V:history
value 3: R:0 D:1 V:geography
value 4: R:0 D:1 V:physics
* Some formatting applied
Config: Parquet
spark.sql.parquet.compression.codec
▪ What: Compression codec used when writing Parquet files
▪ Options: {none, uncompressed, snappy, gzip, lzo, lz4, brotli, zstd}
spark.sql.parquet.mergeSchema
▪ What: Parquet data source merges schemas collected from all data
file (else, picked at random)
▪ Options: {true, false}
spark.sql.parquet.columnarReaderBatchSize
▪ What: Number of rows to include in a parquet vectorized reader batch
▪ Options: Int
▪ Default: 4096
spark.sql.parquet.enableVectorizedReader
▪ What: Enables vectorized parquet decoding
▪ Options: {true, false}
spark.sql.parquet.filterPushdown
▪ What: Enables Parquet filter push-down optimization
▪ Options: {true, false}
▪ Similar:
▪ spark.sql.parquet.filterPushdown.date
▪ spark.sql.parquet.filterPushdown.timestamp
▪ spark.sql.parquet.filterPushdown.decimal
▪ spark.sql.parquet.filterPushdown.string.startsWith
▪ spark.sql.parquet.pushdown.inFilterThreshold
▪ spark.sql.parquet.recordLevelFilter.enabled
* Default value is underlined
Case Study: Veraset
Case Study: Veraset
Veraset processes and delivers 3+ TB data daily
Historically processed and delivered data in CSV
▪ Pipeline runtime ~5.5 hours
OLAP Workflow
▪ Data used by read-intensive applications
▪ Schema fixed (no schema evolution)
▪ Strictly typed and fixed columns
▪ Heavy analytics / aggregations performed on data
▪ Processing-heavy workflow
▪ Frequently read data – Snappy > GZip
Migration to snappy compressed parquet
▪ Pipeline runtime ~2.12 hours
Migration from CSV -> snappy compressed Parquet
Disclaimer: Software can have bugs.
Case Study: Parquet Bug
Case Study: Parquet Partition Pruning Bug
Data formats are software and can have bugs - PARQUET-1246
Sort order not specified for -0.0/+0.0 and NaN, leading to incorrect
partition pruning
If NaN or -0.0/+0.0 first row in group, entire row groups would be
pruned out
Conclusion: Make sure you are frequently updating your data format
version to get bug fixes and performance improvements
Looking Forward: Apache Arrow
Complements (not competes with) on-disk formats and storage
technologies to promote data exchange / interoperability
▪ Interfaces between systems (ie. Python <> JVM)
Columnar layout in memory, optimized for data locality
Zero-Copy reads + minimizes SerDe Overhead
Cache-efficient in OLAP workloads
Organized for SIMD optimizations
Flexible data model
In Memory Data Format
Final Thoughts
Think critically about your workflows and needs – OLTP and OLAP,
schema evolution, etc..
Migrating to formats optimized for your workflows can be easy
performance wins
Perform load and scale testing of your format before moving to
production
Don’t neglect the impact of compression codecs on your IO
performance
Keep format libraries up-to-date
Thank you!
vinoo.ganesh@gmail.com
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
The Apache Spark File Format Ecosystem

More Related Content

What's hot

Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet Format
Yue Chen
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
Databricks
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
Mateusz Buśkiewicz
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
Jaemun Jung
 

What's hot (20)

Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet Format
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 

Similar to The Apache Spark File Format Ecosystem

Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Gruter
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
DataWorks Summit
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Codemotion
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - Advanced
Tony Rogerson
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
DataStax Academy
 
Oracle Database : Addressing a performance issue the drilldown approach
Oracle Database : Addressing a performance issue the drilldown approachOracle Database : Addressing a performance issue the drilldown approach
Oracle Database : Addressing a performance issue the drilldown approach
Laurent Leturgez
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
Rajendran
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
Databricks
 
Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...
Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...
Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...
Yandex
 
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander KorotkovPostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
Nikolay Samokhvalov
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1
Charles Givre
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Sap technical deep dive in a column oriented in memory database
Sap technical deep dive in a column oriented in memory databaseSap technical deep dive in a column oriented in memory database
Sap technical deep dive in a column oriented in memory database
Alexander Talac
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jug
Gerald Muecke
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
Jason Terpko
 
Introducing Apache Carbon Data - Hadoop Native Columnar Data Format
Introducing Apache Carbon Data - Hadoop Native Columnar Data FormatIntroducing Apache Carbon Data - Hadoop Native Columnar Data Format
Introducing Apache Carbon Data - Hadoop Native Columnar Data Format
Vimal Das Kammath
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
Amazon Web Services
 
Data Warehousing in the Era of Big Data
Data Warehousing in the Era of Big DataData Warehousing in the Era of Big Data
Data Warehousing in the Era of Big Data
Amazon Web Services
 
Gdc03 ericson memory_optimization
Gdc03 ericson memory_optimizationGdc03 ericson memory_optimization
Gdc03 ericson memory_optimization
brettlevin
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
PoguttuezhiniVP
 

Similar to The Apache Spark File Format Ecosystem (20)

Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
SQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - AdvancedSQL Server 2014 Memory Optimised Tables - Advanced
SQL Server 2014 Memory Optimised Tables - Advanced
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Oracle Database : Addressing a performance issue the drilldown approach
Oracle Database : Addressing a performance issue the drilldown approachOracle Database : Addressing a performance issue the drilldown approach
Oracle Database : Addressing a performance issue the drilldown approach
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
 
Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...
Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...
Типы данных JSONb, соответствующие индексы и модуль jsquery – Олег Бартунов, ...
 
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander KorotkovPostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
PostgreSQL Moscow Meetup - September 2014 - Oleg Bartunov and Alexander Korotkov
 
Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Sap technical deep dive in a column oriented in memory database
Sap technical deep dive in a column oriented in memory databaseSap technical deep dive in a column oriented in memory database
Sap technical deep dive in a column oriented in memory database
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jug
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Introducing Apache Carbon Data - Hadoop Native Columnar Data Format
Introducing Apache Carbon Data - Hadoop Native Columnar Data FormatIntroducing Apache Carbon Data - Hadoop Native Columnar Data Format
Introducing Apache Carbon Data - Hadoop Native Columnar Data Format
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Data Warehousing in the Era of Big Data
Data Warehousing in the Era of Big DataData Warehousing in the Era of Big Data
Data Warehousing in the Era of Big Data
 
Gdc03 ericson memory_optimization
Gdc03 ericson memory_optimizationGdc03 ericson memory_optimization
Gdc03 ericson memory_optimization
 
Postgresql Database Administration Basic - Day2
Postgresql  Database Administration Basic  - Day2Postgresql  Database Administration Basic  - Day2
Postgresql Database Administration Basic - Day2
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
49AkshitYadav
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
PromptCloud
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
District 11 Solutions
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
SomalyEng
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
Ladislau5
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
deepikakumaridk25
 
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
Riya Sen
 
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
femim26318
 
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Alireza Kamrani
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
amazenolmedojeruel
 
SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
Becky Burwell
 
Vrinda store data analysis project using Excel
Vrinda store data analysis project using ExcelVrinda store data analysis project using Excel
Vrinda store data analysis project using Excel
SantuJana12
 
Annex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf documentAnnex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf document
Steven McGee
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
HeidiLivengood
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
RuchiRathor2
 
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Deepikakumari457585
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
talha2khan2k
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
Sathiakumar Chandr
 
Accounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-RegulationsAccounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-Regulations
DALubis
 
future-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-managementfuture-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-management
Aadee4
 

Recently uploaded (20)

Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
 
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
 
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
 
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
 
SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
 
Vrinda store data analysis project using Excel
Vrinda store data analysis project using ExcelVrinda store data analysis project using Excel
Vrinda store data analysis project using Excel
 
Annex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf documentAnnex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf document
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
 
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
 
Accounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-RegulationsAccounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-Regulations
 
future-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-managementfuture-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-management
 

The Apache Spark File Format Ecosystem

  • 2. The Spark File Format Ecosystem Vinoo Ganesh Chief Technology Officer, Veraset
  • 3. Agenda About Veraset Session Goals On-Disk Storage OLTP/OLAP Workflows File Format Deep Dive Case: Veraset Case: Parquet Pruning Looking Forward Questions
  • 4. About Veraset About Me ▪ CTO at Veraset ▪ (Formerly) Lead of Compute / Spark at Palantir Technologies Data-as-a-Service (DaaS) Startup Anonymized Geospatial Data ▪ Centered around population movement ▪ Model training at scale ▪ Heavily used during COVID-19 investigations / analyses Process, Cleanse, Optimize, and Deliver >2 PB Data Yearly Data is Our Product ▪ We don’t build analytical tools ▪ No fancy visualizations ▪ Optimized data storage, retrieval, and processing are our lifeblood ▪ “Just Data” We’re Hiring! vinoo@veraset.com
  • 5. Session Goals On-disk storage ▪ Row, Column, Hybrid Introduce OLTP / OLAP workflows Explain feature set of formats Inspect formats Explore configuration for formats Look forward We can’t cover everything about file formats in 30 minutes, so let’s hit the high points.
  • 6. File Formats ▪ Text ▪ CSV * ▪ TSV * ▪ JSON ▪ XML Semi-StructuredUnstructured ▪ Avro ▪ ORC ▪ Parquet Structured Bolded formats will be covered in this session * Can be considered ”semi-structured”
  • 7. On-Disk Storage Data is stored on hard drives in “blocks” Disk loads a “block” into memory at a time ▪ Block is the minimum amount of data read during a read Reading unnecessary data == expensive! Reading fragmented data == expensive! Random seek == expensive! Sequential read/writes strongly preferred Insight: Lay data on disk in a manner optimized for your workflows ▪ Common categorizations for these workflows: OLTP/OLAP https://bit.ly/2TG7SJw
  • 8. Example Data A0 B0 C0 A1 B1 C1 A2 B2 C2 A3 B3 C3 Column A Column B Column C Row 0 Row 1 Row 2 Row 3
  • 9. Example Data Column A Column B Column C Row 0 Row 1 Row 2 Row 3 A0 C0B0 A1 A2 A3 B1 B2 B3 C1 C2 C3
  • 10. Row-wise Storage A0 C0B0 A1 A0 B0 C0 A1 B1 C1 A2 B2 C2 A3 B3 C3 Visually Block 1 B1 A2C1 B2 Block 2 C2 B3A3 C3 Block 3 On Disk
  • 11. Columnar (Column-wise) Storage A0 A2A1 A3 Block 1 B0 B2B1 B3 Block 2 C0 C2C1 C3 Block 3 A0 B0 C0 A1 B1 C1 A2 B2 C2 A3 B3 C3 Visually On Disk
  • 12. Hybrid Storage A0 B0A1 B1 Row Group 1 C0 A2C1 A3 Row Group 2 B2 C2B3 C3 A0 B0 C0 A1 B1 C1 A2 B2 C2 A3 B3 C3 Visually Logical Row Groups
  • 13. Hybrid Storage A0 B0A1 B1 Block 1 C0 A2C1 A3 Block 2 B2 C2B3 C3 Block 3 On Disk A0 B0A1 B1 Row Group 1 C0 A2C1 A3 Row Group 2 B2 C2B3 C3 Logical Row Groups In Parquet – aim to fit one row group in one block
  • 14. Summary: Physical Layout Row-wise formats are best for write-heavy (transactional) workflows Columnar formats are best optimized for read-heavy (analytical) workflows Hybrid formats combine both methodologies
  • 16. OLTP / OLAP Workflows Online Transaction Processing (OLTP) ▪ Larger numbers of short queries / transactions ▪ More processing than analysis focused ▪ Geared towards record (row) based processing than column based ▪ More frequent data updates / deletes (transactions) Online Analytical Processing (OLAP) ▪ More analysis than processing focused ▪ Geared towards column-based data analytics processing ▪ Less frequent transactions ▪ More analytic complexity per query Insight: Data access patterns should inform the selection of file formats
  • 17. Example Data student_id subject score Row 0 Row 1 Row 2 Row 3 71 97.44math 33 101 13 history geography physics 88.32 73.11 87.78
  • 19. About: CSV CSV developed by IBM in 1972 ▪ Ease of typing CSV lists on punched cards Flexible (not always good) Row-based Human Readable Compressible Splittable ▪ When raw / using spittable format Supported Natively Fast (from a write perspective) Comma Separated Value (CSV) $ cat myTable.csv "student_id","subject","score" 71,"math",97.44 33,"history",88.32 101,"geography",73.11 13,"physics",87.78 scala> val table = spark.read.option("header","true") .option("inferSchema", "true").csv("myTable.csv") table: org.apache.spark.sql.DataFrame = [student_id: int, subject: string ... 1 more field] scala> table.printSchema root |-- student_id: integer (nullable = true) |-- subject: string (nullable = true) |-- score: double (nullable = true) scala> table.show +----------+---------+-----+ |student_id| subject|score| +----------+---------+-----+ | 71| math|97.44| | 33| history|88.32| | 101|geography|73.11| | 13| physics|87.78| +----------+---------+-----+ * Some formatting applied
  • 21. About: JSON Specified in early 2000s Self-Describing Row-based Human Readable Compressible Splittable (in some cases) Supported natively in Spark Supports Complex Data Types Fast (from a write perspective) JSON (JavaScript Object Notation) $ cat myTable.json {"student_id":71,"subject":"math","score":97.44} {"student_id":33,"subject":"history","score":88.32} {"student_id":101,"subject":"geography","score":73.11} {"student_id":13,"subject":"physics","score":87.78} scala> val table = spark.read.json("myTable.json") table: org.apache.spark.sql.DataFrame = [score: double, student_id: bigint ... 1 more field] scala> table.show +-----+----------+---------+ |score|student_id| subject| +-----+----------+---------+ |97.44| 71| math| |88.32| 33| history| |73.11| 101|geography| |87.78| 13| physics| +-----+----------+---------+ * Some formatting applied
  • 22. Structured File Formats: Avro, ORC, Parquet
  • 23. About: Avro Data Format + Serialization Format Self-Describing ▪ Schema evolution Row-based ▪ Optimized for write-intensive applications Binary Format – Schema stored inside of file (as JSON) Compressible Splittable Supported by external library in Spark Supports rich data structures
  • 24. Inspecting: Avro $ avro-tools tojson part-00000-tid-8566179657485710941-115d547d-6b9a-43cf-957a-c549d3243afb-3-1-c000.avro {"student_id":{"int":71},"subject":{"string":"math"},"score":{"double":97.44}} {"student_id":{"int":33},"subject":{"string":"history"},"score":{"double":88.32}} {"student_id":{"int":101},"subject":{"string":"geography"},"score":{"double":73.11}} {"student_id":{"int":13},"subject":{"string":"physics"},"score":{"double":87.78}} $ avro-tools getmeta part-00000-tid-8566179657485710941-115d547d-6b9a-43cf-957a-c549d3243afb-3-1-c000.avro avro.schema { "type" : "record", "name" : "topLevelRecord", "fields" : [ { "name" : "student_id", "type" : [ "int", "null" ] }, { "name" : "subject", "type" : [ "string", "null" ] }, { "name" : "score", "type" : [ "double", "null" ] } ] } avro.codec snappy * Some formatting applied
  • 25. Config: Avro spark.sql.avro.compression.codec ▪ What: Compression codec used in writing of AVRO files ▪ Options: {uncompressed, deflate, snappy, bzip2, xz} spark.sql.avro.deflate.level ▪ What: Compression level for the deflate codec ▪ Options: {-1,1..9} * Default value is underlined
  • 26. About: ORC Next iteration of Hive RCFile ▪ Created in 2013 as part of Stinger initiative to speed up Hive Self-Describing Hybrid-Based (rows grouped by row groups, then column partitioned) ▪ Optimized for read-intensive applications Binary Format – Schema stored inside of file (in metadata) Compressible Splittable Supported by natively in Spark Supports rich data structures ▪ Hive data Type Support (including compound types): struct, list, map, union
  • 27. Structure: ORC Row groups called Stripes Index Data contain column min/max values and row positions within each column ▪ Bit field / bloom filter as well (if included) ▪ Used for selection of stripes / row groups, not for answering queries Row Data contains the actual data Stripe Footer contain directory of stream locations Postscript contains compression parameters and size of the compressed footer https://bit.ly/2A7AlS1
  • 28. Inspecting: ORC $ orc-tools meta part-00000-34aef610-c8d4-46fb-84c9- b43887d2b37e-c000.snappy.orc Processing data file part-00000-34aef610-c8d4-46fb-84c9- b43887d2b37e-c000.snappy.orc [length: 574] Structure for part-00000-34aef610-c8d4-46fb-84c9-b43887d2b37e- c000.snappy.orc File Version: 0.12 with ORC_135 Rows: 4 Compression: SNAPPY Compression size: 262144 Type: struct<score:double,student_id:bigint,subject:string> Stripe Statistics: Stripe 1: Column 0: count: 4 hasNull: false Column 1: count: 4 hasNull: false bytesOnDisk: 35 min: 73.11 max: 97.44 sum: 346.65 Column 2: count: 4 hasNull: false bytesOnDisk: 9 min: 13 max: 101 sum: 218 Column 3: count: 4 hasNull: false bytesOnDisk: 37 min: geography max: physics sum: 27 File Statistics: Column 0: count: 4 hasNull: false Column 1: count: 4 hasNull: false bytesOnDisk: 35 min: 73.11 max: 97.44 sum: 346.65 Column 2: count: 4 hasNull: false bytesOnDisk: 9 min: 13 min: 101 sum: 218 Column 3: count: 4 hasNull: false bytesOnDisk: 37 min: geography max: physics sum: 27 Stripes: Stripe: offset: 3 data: 81 rows: 4 tail: 75 index: 123 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 44 Stream: column 2 section ROW_INDEX start: 58 length 26 Stream: column 3 section ROW_INDEX start: 84 length 42 Stream: column 1 section DATA start: 126 length 35 Stream: column 2 section DATA start: 161 length 9 Stream: column 3 section DATA start: 170 length 30 Stream: column 3 section LENGTH start: 200 length 7 Encoding column 0: DIRECT Encoding column 1: DIRECT Encoding column 2: DIRECT_V2 Encoding column 3: DIRECT_V2 File length: 574 bytes Padding length: 0 bytes Padding ratio: 0% * Some formatting applied
  • 29. Config: ORC spark.sql.orc.impl ▪ What: The name of ORC implementation ▪ Options: {native, hive} spark.sql.orc.compression.codec ▪ What: Compression codec used when writing ORC files ▪ Options: {none, uncompressed, snappy, zlib, lzo} spark.sql.orc.mergeSchema ▪ What: (3.0+) ORC data source should merge schemas from all files (else, picked at random) ▪ Options: {true, false} spark.sql.orc.columnarReaderBatchSize ▪ What: Number of rows to include in a ORC vectorized reader batch. ▪ Options: Int ▪ Default: 4096 spark.sql.orc.filterPushdown ▪ What: Enable filter pushdown for ORC files ▪ Options: {true, false} spark.sql.orc.enableVectorizedReader ▪ What: Enables vectorized orc decoding ▪ Options: {true, false} * Default value is underlined
  • 30. About: Parquet Originally built by Twitter and Cloudera Self-Describing Hybrid-Based (rows grouped by row groups, then column partitioned) ▪ Optimized for read-intensive applications Binary Format – Schema stored inside of file Compressible Splittable Supported by natively in Spark Supports rich data structures
  • 31. Structure: Parquet Row Groups are a logical horizontal partitioning of the data into rows ▪ Consists of a column chunk for each column in the dataset Column chunk are chunks of the data for a column ▪ Guaranteed to be contiguous in the file Pages make up column chunks ▪ A page is conceptually an indivisible unit (in terms of compression and encoding) File metadata contains the locations of all the column metadata start locations
  • 32. Inspecting: Parquet (1) $ parquet-tools meta part-00000-5adea6d5-53ae-49cc-8f70-a7365519b6bf-c000.snappy.parquet file: file:part-00000-5adea6d5-53ae-49cc-8f70-a7365519b6bf-c000.snappy.parquet creator: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1) extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"score","type":"double","nullable":true,"metadata":{}},{"name":"student _id","type":"long","nullable":true,"metadata":{}},{"name":"subject","type":"string","nullable":true,"metad ata":{}}]} file schema: spark_schema -------------------------------------------------------------------------------- score: OPTIONAL DOUBLE R:0 D:1 student_id: OPTIONAL INT64 R:0 D:1 subject: OPTIONAL BINARY O:UTF8 R:0 D:1 row group 1: RC:4 TS:288 OFFSET:4 -------------------------------------------------------------------------------- score: DOUBLE SNAPPY DO:0 FPO:4 SZ:101/99/0.98 VC:4 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 73.11, max: 97.44, num_nulls: 0] student_id: INT64 SNAPPY DO:0 FPO:105 SZ:95/99/1.04 VC:4 ENC:PLAIN,BIT_PACKED,RLE ST:[min: 13, max: 101, num_nulls: 0] subject: BINARY SNAPPY DO:0 FPO:200 SZ:92/90/0.98 VC:4 ENC:PLAIN,BIT_PACKED,RLE ST:[min: geography, max: physics, num_nulls: 0] * Some formatting applied
  • 33. Inspecting: Parquet (2) $ parquet-tools dump part-00000-5adea6d5-53ae-49cc-8f70-a7365519b6bf- c000.snappy.parquet row group 0 ------------------------------------------------------------------------- score: DOUBLE SNAPPY DO:0 FPO:4 SZ:101/99/0.98 VC:4 ENC:RLE,BIT_ [more]... student_id: INT64 SNAPPY DO:0 FPO:105 SZ:95/99/1.04 VC:4 ENC:RLE,BIT_ [more]... subject: BINARY SNAPPY DO:0 FPO:200 SZ:92/90/0.98 VC:4 ENC:RLE,BIT [more]... score TV=4 RL=0 DL=1 --------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[min: 73.11, max: 97. [more]... VC:4 student_id TV=4 RL=0 DL=1 --------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[min: 13, max: 101, n [more]... VC:4 subject TV=4 RL=0 DL=1 --------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[min: geography, max: [more]... VC:4 DOUBLE score --------------------------------------- *** row group 1 of 1, values 1 to 4 *** value 1: R:0 D:1 V:97.44 value 2: R:0 D:1 V:88.32 value 3: R:0 D:1 V:73.11 value 4: R:0 D:1 V:87.78 INT64 student_id --------------------------------------- *** row group 1 of 1, values 1 to 4 *** value 1: R:0 D:1 V:71 value 2: R:0 D:1 V:33 value 3: R:0 D:1 V:101 value 4: R:0 D:1 V:13 BINARY subject --------------------------------------- *** row group 1 of 1, values 1 to 4 *** value 1: R:0 D:1 V:math value 2: R:0 D:1 V:history value 3: R:0 D:1 V:geography value 4: R:0 D:1 V:physics * Some formatting applied
  • 34. Config: Parquet spark.sql.parquet.compression.codec ▪ What: Compression codec used when writing Parquet files ▪ Options: {none, uncompressed, snappy, gzip, lzo, lz4, brotli, zstd} spark.sql.parquet.mergeSchema ▪ What: Parquet data source merges schemas collected from all data file (else, picked at random) ▪ Options: {true, false} spark.sql.parquet.columnarReaderBatchSize ▪ What: Number of rows to include in a parquet vectorized reader batch ▪ Options: Int ▪ Default: 4096 spark.sql.parquet.enableVectorizedReader ▪ What: Enables vectorized parquet decoding ▪ Options: {true, false} spark.sql.parquet.filterPushdown ▪ What: Enables Parquet filter push-down optimization ▪ Options: {true, false} ▪ Similar: ▪ spark.sql.parquet.filterPushdown.date ▪ spark.sql.parquet.filterPushdown.timestamp ▪ spark.sql.parquet.filterPushdown.decimal ▪ spark.sql.parquet.filterPushdown.string.startsWith ▪ spark.sql.parquet.pushdown.inFilterThreshold ▪ spark.sql.parquet.recordLevelFilter.enabled * Default value is underlined
  • 36. Case Study: Veraset Veraset processes and delivers 3+ TB data daily Historically processed and delivered data in CSV ▪ Pipeline runtime ~5.5 hours OLAP Workflow ▪ Data used by read-intensive applications ▪ Schema fixed (no schema evolution) ▪ Strictly typed and fixed columns ▪ Heavy analytics / aggregations performed on data ▪ Processing-heavy workflow ▪ Frequently read data – Snappy > GZip Migration to snappy compressed parquet ▪ Pipeline runtime ~2.12 hours Migration from CSV -> snappy compressed Parquet
  • 39. Case Study: Parquet Partition Pruning Bug Data formats are software and can have bugs - PARQUET-1246 Sort order not specified for -0.0/+0.0 and NaN, leading to incorrect partition pruning If NaN or -0.0/+0.0 first row in group, entire row groups would be pruned out Conclusion: Make sure you are frequently updating your data format version to get bug fixes and performance improvements
  • 40. Looking Forward: Apache Arrow Complements (not competes with) on-disk formats and storage technologies to promote data exchange / interoperability ▪ Interfaces between systems (ie. Python <> JVM) Columnar layout in memory, optimized for data locality Zero-Copy reads + minimizes SerDe Overhead Cache-efficient in OLAP workloads Organized for SIMD optimizations Flexible data model In Memory Data Format
  • 41. Final Thoughts Think critically about your workflows and needs – OLTP and OLAP, schema evolution, etc.. Migrating to formats optimized for your workflows can be easy performance wins Perform load and scale testing of your format before moving to production Don’t neglect the impact of compression codecs on your IO performance Keep format libraries up-to-date
  • 43. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.