SlideShare a Scribd company logo
Iceberg
A modern table format for big data
Ryan Blue & Parth Brahmbhatt
July 2018 - Presto Summit
● A Netflix use case and performance results
● What is Iceberg?
○ How large Hive tables (fail to) work
○ How Iceberg works and its benefits
● Iceberg at Netflix
○ Future of Netflix’s data platform
● Iceberg & Raptor comparison
Contents
Iceberg Performance
July 2018 - Presto Summit
● Historical Atlas data:
○ Time-series metrics from Netflix runtime systems
○ 1 month: 2.7 million files in 2,688 partitions
○ Problem: cannot process more than a few days of data
● Sample query:
select distinct tags['type'] as type
from iceberg.atlas
where
name = 'metric-name' and
date > 20180222 and date <= 20180228
order by type;
Case Study: Atlas
● Hive table – with Parquet filters:
○ 400k+ splits per day, not combined
○ EXPLAIN query: 9.6 min (planning wall time)
● Iceberg table – partition data filtering:
○ 15,218 splits, combined
○ 13 min (wall time) / 61.5 hr (task time) / 10 sec (planning)
● Iceberg table – partition and min/max filtering:
○ 412 splits
○ 42 sec (wall time) / 22 min (task time) / 25 sec (planning)
Case Study: Atlas Performance
Iceberg’s Design
July 2018 - Presto Summit
First, what is a table format?
● Key idea: organize data in a directory tree
date=20180513/
|- hour=18/
| |- ...
|- hour=19/
| |- part-000.parquet
| |- ...
| |- part-031.parquet
|- hour=20/
| |- ...
|- ...
Hive Table Design
● Filter by directories as columns
SELECT ... WHERE date = '20180513' AND hour = 19
date=20180513/
|- hour=18/
| |- ...
|- hour=19/
| |- part-000.parquet
| |- ...
| |- part-031.parquet
|- hour=20/
| |- ...
|- ...
Hive Table Design
● Problem: too much directory listing for large tables
● Solution: use HMS to track partitions
○ Partition key to FS location
date=20180513/hour=19 -> hdfs:/.../date=20180513/hour=19
date=20180513/hour=20 -> hdfs:/.../date=20180513/hour=20
○ Enables predicate push-down in HMS for (some) scans
● The file system still tracks the files in each partition...
Hive Metastore
● Table state is stored in two places
○ Partitions in the Hive Metastore
○ Files in a FS with no transaction support
● Requires elaborate locking for correctness
○ Nothing respects the locking scheme
● Still requires directory listing to plan jobs
○ O(n) listing calls, n = # matching partitions
○ Eventual consistency breaks correctness
Design Problems
● Key idea: track all files in a table over time
○ A snapshot is a complete list of files in a table
○ Each write produces and commits a new snapshot
Iceberg’s Design
S1 S2 S3 ...
● Snapshot isolation without locking
○ Readers use a current snapshot
○ Writers produce new snapshots in isolation, then commit
● Any change to the file list is an atomic operation
○ Append data across partitions
○ Merge or rewrite files
Snapshot Design Benefits
S1 S2 S3 ...
R W
In reality, it’s a bit more
complicated.
Design Benefits
● Reads and writes are isolated and all changes are atomic
● No expensive or eventually-consistent FS operations:
○ No directory or prefix listing
○ No rename: data files written in place
● Faster scan planning, distributed across the cluster
○ O(1) manifest reads, not O(n) partition list calls
○ Upper and lower bounds used to eliminate files
● Reliable CBO metrics
Iceberg at Netflix
July 2018 - Presto Summit
● Hidden partitioning
○ Partition filters derived from data filters
○ No more accidental full table scans
● Full schema evolution
○ Supports add, drop, and rename columns
● Reliable support for types
○ date, time, timestamp, and decimal
○ struct, list, map, and mixed nesting
Works as Users Expect
● Queries are not broken by layout changes
● Physical layout can evolve without painful migration
○ Mistakes can be fixed
○ Prototypes can move to production faster
○ Tables can change as volume grows over time
● Data Platform can transparently fix table layout
Table Layout is Hidden
● Any write is atomic – either complete or invisible
○ Rewrite files instead of partitions
○ Tables never have partially committed data
● Simple, built-in change detection
○ Cache and materialized view maintenance
○ Incremental processing
● Data Platform can monitor and fix data files
○ Compact small files
○ Repartition to a new layout
Snapshot-based Tables
● Common implementation for table operations
○ Tune write options per table (Parquet row group size)
○ Tune read defaults once (split combination)
● Simple data gathering
○ Log scan predicates and projection to Kafka
○ Analyze table settings from the Iceberg table
● Data Platform can automate tuning configuration
○ Test file format tuning settings per table
○ Update table to affect all writes
Table Format Library
● Current: merge service
○ Dedicated cluster to convert data to Parquet
○ Makes data available to tables after merge completes
● Iceberg: Data is available after commit, in row-oriented format
● Autotune (planned)
○ Recommend tuning parameters
○ Rewrite files based on priority
○ Opportunistic Parquet conversion
Autotune: Data Librarian
Iceberg & Raptor Comparison
July 2018 - Presto Summit
● Fine-grained tracking of data
○ Iceberg: file-level, Raptor: shard-level
● Use min/max values for efficient job planning
● Safe atomic writes
● Same metadata, different purpose and design
Raptor Similarities
● Raptor targets low-latency query execution
○ Data stored in flash
○ Shard tracking in MySQL
○ Built for Presto
● Iceberg manages table metadata targeting scale
○ Open specification for Spark, Presto, and others
○ Distributed metadata workload
○ Atomic changes across clusters and engines
● Iceberg & Raptor are complementary projects
Raptor Differences
Getting Started with Iceberg
July 2018 - Presto Summit
● github.com/Netflix/iceberg
○ Apache Licensed, ALv2
○ Spark 2.3.x data source plug-in
○ Pig (read-only) support in development
○ Planned python support
● Presto Iceberg PR: coming soon!
Using Iceberg
Current blocker: HMS support
Questions?
July 2018 - Presto Summit
Additional Slides
July 2018 - Presto Summit
● Implementation of snapshot-based tracking
○ Adds table schema, partition layout, string properties
○ Tracks old snapshots for eventual garbage collection
● Table metadata is immutable and always moves forward
● The current snapshot (pointer) can be rolled back
Iceberg Metadata
v1.json
S1 S2
v2.json
S1 S2 S3
v3.json
S2 S3
● Snapshots are split across one or more manifest files
○ Manifests store partition data for each data file
○ Reused to avoid high write volume
Manifest Files
v1.json
S1 S2
v2.json
S1 S2 S3
v3.json
S2 S3
m0.avro m1.avro m2.avro
● Basic data file info:
○ File location and format
○ Iceberg tracking data
● Values to filter files for a scan:
○ Partition data values
○ Per-column lower and upper bounds
● Metrics for cost-based optimization:
○ File-level: row count, size
○ Column-level: value count, null count, size
Manifest File Contents
● To commit, a writer must:
○ Note the current metadata version – the base version
○ Create new metadata and manifest files
○ Atomically swap the base version for the new version
● This atomic swap ensures a linear history
● Atomic swap is implemented by:
○ A custom metastore implementation
○ Atomic rename for HDFS or local tables
Commits
● Writers optimistically write new versions:
○ Assume that no other writer is operating
○ On conflict, retry based on the latest metadata
● To support retry, operations are structured as:
○ Assumptions about the current table state
○ Pending changes to the current table state
● Changes are safe if the assumptions are all true
Commits: Conflict Resolution
● Use case: safely merge small files
○ Merge input: file1.avro, file2.avro
○ Merge output: merge1.parquet
● Rewrite operation:
○ Assumption: file1.avro and file2.avro are still present
○ Pending changes:
Remove file1.avro and file2.avro
Add merge1.parquet
● Deleting file1.avro or file2.avro will cause a commit failure
Commits: Resolution Example
Presto iceberg connector
● Why a new connector ?
● What it can do
○ Read Support is done
○ Split planning, predicate pushdown
○ All iceberg datatypes supported
○ DDL and DML in the works
● Transparent querying between hive and icerberg catalogs.
● Hive table with no partition information
● Use iceberg api for split planning
● Iceberg APIs prunes manifest and datafiles based on stats.
● Stats pruning results into 3X performance improvements.
● Parquet version upgrade needed.
How it works

More Related Content

What's hot

Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
University of California, Santa Cruz
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
Eric Xiao
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
Databricks
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
MIJIN AN
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
MIJIN AN
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Paris Data Engineers !
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
Modern Data Stack France
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
chrislusf
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Flink Forward
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 

What's hot (20)

Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Log Structured Merge Tree
Log Structured Merge TreeLog Structured Merge Tree
Log Structured Merge Tree
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 

Similar to Presto Summit 2018 - 09 - Netflix Iceberg

The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Ryan Blue
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
XinliShang1
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
felixbarny
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
Zhenxiao Luo
 
Webinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLabWebinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLab
MayaData Inc
 
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch FixData Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Stefan Krawczyk
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdf
TusharAgarwal49094
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Ahmed Ossama
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
OVHcloud
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
Jihoon Son
 
RubiX
RubiXRubiX
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Databricks
 
Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdf
fengxun
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Stefan Krawczyk
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
Accumulo Summit
 
When is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar SeriesWhen is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar Series
Alkin Tezuysal
 
Say Hello to MyRocks
Say Hello to MyRocksSay Hello to MyRocks
Say Hello to MyRocks
Sergey Petrunya
 
Migration strategies for a mission critical cluster
Migration strategies for a mission critical clusterMigration strategies for a mission critical cluster
Migration strategies for a mission critical cluster
Francismara Souza
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
Daniel Marcous
 

Similar to Presto Summit 2018 - 09 - Netflix Iceberg (20)

The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
ApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptxApacheCon 2022_ Large scale unification of file format.pptx
ApacheCon 2022_ Large scale unification of file format.pptx
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 
Webinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLabWebinar: Building a multi-cloud Kubernetes storage on GitLab
Webinar: Building a multi-cloud Kubernetes storage on GitLab
 
Data Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch FixData Day Texas 2017: Scaling Data Science at Stitch Fix
Data Day Texas 2017: Scaling Data Science at Stitch Fix
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdf
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
RubiX
RubiXRubiX
RubiX
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Raft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdfRaft Engine Meetup 220702.pdf
Raft Engine Meetup 220702.pdf
 
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch FixData Day Seattle 2017: Scaling Data Science at Stitch Fix
Data Day Seattle 2017: Scaling Data Science at Stitch Fix
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
When is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar SeriesWhen is Myrocks good? 2020 Webinar Series
When is Myrocks good? 2020 Webinar Series
 
Say Hello to MyRocks
Say Hello to MyRocksSay Hello to MyRocks
Say Hello to MyRocks
 
Migration strategies for a mission critical cluster
Migration strategies for a mission critical clusterMigration strategies for a mission critical cluster
Migration strategies for a mission critical cluster
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 

More from kbajda

Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
kbajda
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
kbajda
 
Presto Summit 2018 - 08 - FINRA
Presto Summit 2018  - 08 - FINRAPresto Summit 2018  - 08 - FINRA
Presto Summit 2018 - 08 - FINRA
kbajda
 
Presto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - LyftPresto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - Lyft
kbajda
 
Presto Summit 2018 - 06 - Facebook Geospatial
Presto Summit 2018 - 06 - Facebook GeospatialPresto Summit 2018 - 06 - Facebook Geospatial
Presto Summit 2018 - 06 - Facebook Geospatial
kbajda
 
Presto Summit 2018 - 05 - Uber Elasticsearch
Presto Summit 2018 - 05 - Uber ElasticsearchPresto Summit 2018 - 05 - Uber Elasticsearch
Presto Summit 2018 - 05 - Uber Elasticsearch
kbajda
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
kbajda
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
kbajda
 
Presto Summit 2018 - 01 - Facebook Presto
Presto Summit 2018  - 01 - Facebook PrestoPresto Summit 2018  - 01 - Facebook Presto
Presto Summit 2018 - 01 - Facebook Presto
kbajda
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
kbajda
 
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
kbajda
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
kbajda
 
Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talk
kbajda
 

More from kbajda (13)

Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
 
Presto Summit 2018 - 08 - FINRA
Presto Summit 2018  - 08 - FINRAPresto Summit 2018  - 08 - FINRA
Presto Summit 2018 - 08 - FINRA
 
Presto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - LyftPresto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - Lyft
 
Presto Summit 2018 - 06 - Facebook Geospatial
Presto Summit 2018 - 06 - Facebook GeospatialPresto Summit 2018 - 06 - Facebook Geospatial
Presto Summit 2018 - 06 - Facebook Geospatial
 
Presto Summit 2018 - 05 - Uber Elasticsearch
Presto Summit 2018 - 05 - Uber ElasticsearchPresto Summit 2018 - 05 - Uber Elasticsearch
Presto Summit 2018 - 05 - Uber Elasticsearch
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
 
Presto Summit 2018 - 01 - Facebook Presto
Presto Summit 2018  - 01 - Facebook PrestoPresto Summit 2018  - 01 - Facebook Presto
Presto Summit 2018 - 01 - Facebook Presto
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
 
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talk
 

Recently uploaded

Annex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf documentAnnex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf document
Steven McGee
 
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
Milind Agarwal
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
SelcukTOPAL2
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
HeidiLivengood
 
Field Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdfField Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdf
hritikbui
 
Histology of Muscle types histology o.ppt
Histology of Muscle types histology o.pptHistology of Muscle types histology o.ppt
Histology of Muscle types histology o.ppt
SamanArshad11
 
Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
Riya Sen
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
wojakmodern
 
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
Riya Sen
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Samuel Jackson
 
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERINGSOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
PrabhuB33
 
CT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptxCT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptx
RejoJohn2
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
49AkshitYadav
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
SomalyEng
 
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Deepikakumari457585
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
RuchiRathor2
 
SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
Becky Burwell
 
Accounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-RegulationsAccounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-Regulations
DALubis
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Samuel Jackson
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
deepikakumaridk25
 

Recently uploaded (20)

Annex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf documentAnnex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf document
 
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
 
Field Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdfField Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdf
 
Histology of Muscle types histology o.ppt
Histology of Muscle types histology o.pptHistology of Muscle types histology o.ppt
Histology of Muscle types histology o.ppt
 
Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
 
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
 
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERINGSOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
 
CT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptxCT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptx
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
 
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
 
SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
 
Accounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-RegulationsAccounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-Regulations
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
 

Presto Summit 2018 - 09 - Netflix Iceberg

  • 1. Iceberg A modern table format for big data Ryan Blue & Parth Brahmbhatt July 2018 - Presto Summit
  • 2. ● A Netflix use case and performance results ● What is Iceberg? ○ How large Hive tables (fail to) work ○ How Iceberg works and its benefits ● Iceberg at Netflix ○ Future of Netflix’s data platform ● Iceberg & Raptor comparison Contents
  • 4. ● Historical Atlas data: ○ Time-series metrics from Netflix runtime systems ○ 1 month: 2.7 million files in 2,688 partitions ○ Problem: cannot process more than a few days of data ● Sample query: select distinct tags['type'] as type from iceberg.atlas where name = 'metric-name' and date > 20180222 and date <= 20180228 order by type; Case Study: Atlas
  • 5. ● Hive table – with Parquet filters: ○ 400k+ splits per day, not combined ○ EXPLAIN query: 9.6 min (planning wall time) ● Iceberg table – partition data filtering: ○ 15,218 splits, combined ○ 13 min (wall time) / 61.5 hr (task time) / 10 sec (planning) ● Iceberg table – partition and min/max filtering: ○ 412 splits ○ 42 sec (wall time) / 22 min (task time) / 25 sec (planning) Case Study: Atlas Performance
  • 7. First, what is a table format?
  • 8. ● Key idea: organize data in a directory tree date=20180513/ |- hour=18/ | |- ... |- hour=19/ | |- part-000.parquet | |- ... | |- part-031.parquet |- hour=20/ | |- ... |- ... Hive Table Design
  • 9. ● Filter by directories as columns SELECT ... WHERE date = '20180513' AND hour = 19 date=20180513/ |- hour=18/ | |- ... |- hour=19/ | |- part-000.parquet | |- ... | |- part-031.parquet |- hour=20/ | |- ... |- ... Hive Table Design
  • 10. ● Problem: too much directory listing for large tables ● Solution: use HMS to track partitions ○ Partition key to FS location date=20180513/hour=19 -> hdfs:/.../date=20180513/hour=19 date=20180513/hour=20 -> hdfs:/.../date=20180513/hour=20 ○ Enables predicate push-down in HMS for (some) scans ● The file system still tracks the files in each partition... Hive Metastore
  • 11. ● Table state is stored in two places ○ Partitions in the Hive Metastore ○ Files in a FS with no transaction support ● Requires elaborate locking for correctness ○ Nothing respects the locking scheme ● Still requires directory listing to plan jobs ○ O(n) listing calls, n = # matching partitions ○ Eventual consistency breaks correctness Design Problems
  • 12. ● Key idea: track all files in a table over time ○ A snapshot is a complete list of files in a table ○ Each write produces and commits a new snapshot Iceberg’s Design S1 S2 S3 ...
  • 13. ● Snapshot isolation without locking ○ Readers use a current snapshot ○ Writers produce new snapshots in isolation, then commit ● Any change to the file list is an atomic operation ○ Append data across partitions ○ Merge or rewrite files Snapshot Design Benefits S1 S2 S3 ... R W
  • 14. In reality, it’s a bit more complicated.
  • 15. Design Benefits ● Reads and writes are isolated and all changes are atomic ● No expensive or eventually-consistent FS operations: ○ No directory or prefix listing ○ No rename: data files written in place ● Faster scan planning, distributed across the cluster ○ O(1) manifest reads, not O(n) partition list calls ○ Upper and lower bounds used to eliminate files ● Reliable CBO metrics
  • 16. Iceberg at Netflix July 2018 - Presto Summit
  • 17. ● Hidden partitioning ○ Partition filters derived from data filters ○ No more accidental full table scans ● Full schema evolution ○ Supports add, drop, and rename columns ● Reliable support for types ○ date, time, timestamp, and decimal ○ struct, list, map, and mixed nesting Works as Users Expect
  • 18. ● Queries are not broken by layout changes ● Physical layout can evolve without painful migration ○ Mistakes can be fixed ○ Prototypes can move to production faster ○ Tables can change as volume grows over time ● Data Platform can transparently fix table layout Table Layout is Hidden
  • 19. ● Any write is atomic – either complete or invisible ○ Rewrite files instead of partitions ○ Tables never have partially committed data ● Simple, built-in change detection ○ Cache and materialized view maintenance ○ Incremental processing ● Data Platform can monitor and fix data files ○ Compact small files ○ Repartition to a new layout Snapshot-based Tables
  • 20. ● Common implementation for table operations ○ Tune write options per table (Parquet row group size) ○ Tune read defaults once (split combination) ● Simple data gathering ○ Log scan predicates and projection to Kafka ○ Analyze table settings from the Iceberg table ● Data Platform can automate tuning configuration ○ Test file format tuning settings per table ○ Update table to affect all writes Table Format Library
  • 21. ● Current: merge service ○ Dedicated cluster to convert data to Parquet ○ Makes data available to tables after merge completes ● Iceberg: Data is available after commit, in row-oriented format ● Autotune (planned) ○ Recommend tuning parameters ○ Rewrite files based on priority ○ Opportunistic Parquet conversion Autotune: Data Librarian
  • 22. Iceberg & Raptor Comparison July 2018 - Presto Summit
  • 23. ● Fine-grained tracking of data ○ Iceberg: file-level, Raptor: shard-level ● Use min/max values for efficient job planning ● Safe atomic writes ● Same metadata, different purpose and design Raptor Similarities
  • 24. ● Raptor targets low-latency query execution ○ Data stored in flash ○ Shard tracking in MySQL ○ Built for Presto ● Iceberg manages table metadata targeting scale ○ Open specification for Spark, Presto, and others ○ Distributed metadata workload ○ Atomic changes across clusters and engines ● Iceberg & Raptor are complementary projects Raptor Differences
  • 25. Getting Started with Iceberg July 2018 - Presto Summit
  • 26. ● github.com/Netflix/iceberg ○ Apache Licensed, ALv2 ○ Spark 2.3.x data source plug-in ○ Pig (read-only) support in development ○ Planned python support ● Presto Iceberg PR: coming soon! Using Iceberg
  • 28. Questions? July 2018 - Presto Summit
  • 29. Additional Slides July 2018 - Presto Summit
  • 30. ● Implementation of snapshot-based tracking ○ Adds table schema, partition layout, string properties ○ Tracks old snapshots for eventual garbage collection ● Table metadata is immutable and always moves forward ● The current snapshot (pointer) can be rolled back Iceberg Metadata v1.json S1 S2 v2.json S1 S2 S3 v3.json S2 S3
  • 31. ● Snapshots are split across one or more manifest files ○ Manifests store partition data for each data file ○ Reused to avoid high write volume Manifest Files v1.json S1 S2 v2.json S1 S2 S3 v3.json S2 S3 m0.avro m1.avro m2.avro
  • 32. ● Basic data file info: ○ File location and format ○ Iceberg tracking data ● Values to filter files for a scan: ○ Partition data values ○ Per-column lower and upper bounds ● Metrics for cost-based optimization: ○ File-level: row count, size ○ Column-level: value count, null count, size Manifest File Contents
  • 33. ● To commit, a writer must: ○ Note the current metadata version – the base version ○ Create new metadata and manifest files ○ Atomically swap the base version for the new version ● This atomic swap ensures a linear history ● Atomic swap is implemented by: ○ A custom metastore implementation ○ Atomic rename for HDFS or local tables Commits
  • 34. ● Writers optimistically write new versions: ○ Assume that no other writer is operating ○ On conflict, retry based on the latest metadata ● To support retry, operations are structured as: ○ Assumptions about the current table state ○ Pending changes to the current table state ● Changes are safe if the assumptions are all true Commits: Conflict Resolution
  • 35. ● Use case: safely merge small files ○ Merge input: file1.avro, file2.avro ○ Merge output: merge1.parquet ● Rewrite operation: ○ Assumption: file1.avro and file2.avro are still present ○ Pending changes: Remove file1.avro and file2.avro Add merge1.parquet ● Deleting file1.avro or file2.avro will cause a commit failure Commits: Resolution Example
  • 36. Presto iceberg connector ● Why a new connector ? ● What it can do ○ Read Support is done ○ Split planning, predicate pushdown ○ All iceberg datatypes supported ○ DDL and DML in the works ● Transparent querying between hive and icerberg catalogs.
  • 37. ● Hive table with no partition information ● Use iceberg api for split planning ● Iceberg APIs prunes manifest and datafiles based on stats. ● Stats pruning results into 3X performance improvements. ● Parquet version upgrade needed. How it works