Presto Summit 2018 - 09 - Netflix Iceberg

Iceberg
A modern table format for big data
Ryan Blue & Parth Brahmbhatt
July 2018 - Presto Summit

● A Netflix use case and performance results
● What is Iceberg?
○ How large Hive tables (fail to) work
○ How Iceberg works and its benefits
● Iceberg at Netflix
○ Future of Netflix’s data platform
● Iceberg & Raptor comparison
Contents

Iceberg Performance

● Historical Atlas data:
○ Time-series metrics from Netflix runtime systems
○ 1 month: 2.7 million files in 2,688 partitions
○ Problem: cannot process more than a few days of data
● Sample query:
select distinct tags['type'] as type
from iceberg.atlas
where
name = 'metric-name' and
date > 20180222 and date <= 20180228
order by type;
Case Study: Atlas

● Hive table – with Parquet filters:
○ 400k+ splits per day, not combined
○ EXPLAIN query: 9.6 min (planning wall time)
● Iceberg table – partition data filtering:
○ 15,218 splits, combined
○ 13 min (wall time) / 61.5 hr (task time) / 10 sec (planning)
● Iceberg table – partition and min/max filtering:
○ 412 splits
○ 42 sec (wall time) / 22 min (task time) / 25 sec (planning)
Case Study: Atlas Performance

Iceberg’s Design

First, what is a table format?

● Problem: too much directory listing for large tables
● Solution: use HMS to track partitions
○ Partition key to FS location
date=20180513/hour=19 -> hdfs:/.../date=20180513/hour=19
date=20180513/hour=20 -> hdfs:/.../date=20180513/hour=20
○ Enables predicate push-down in HMS for (some) scans
● The file system still tracks the files in each partition...
Hive Metastore

● Table state is stored in two places
○ Partitions in the Hive Metastore
○ Files in a FS with no transaction support
● Requires elaborate locking for correctness
○ Nothing respects the locking scheme
● Still requires directory listing to plan jobs
○ O(n) listing calls, n = # matching partitions
○ Eventual consistency breaks correctness
Design Problems

● Key idea: track all files in a table over time
○ A snapshot is a complete list of files in a table
○ Each write produces and commits a new snapshot
Iceberg’s Design
S1 S2 S3 ...

● Snapshot isolation without locking
○ Readers use a current snapshot
○ Writers produce new snapshots in isolation, then commit
● Any change to the file list is an atomic operation
○ Append data across partitions
○ Merge or rewrite files
Snapshot Design Benefits
S1 S2 S3 ...
R W

In reality, it’s a bit more
complicated.

Design Benefits
● Reads and writes are isolated and all changes are atomic
● No expensive or eventually-consistent FS operations:
○ No directory or prefix listing
○ No rename: data files written in place
● Faster scan planning, distributed across the cluster
○ O(1) manifest reads, not O(n) partition list calls
○ Upper and lower bounds used to eliminate files
● Reliable CBO metrics

Iceberg at Netflix

● Hidden partitioning
○ Partition filters derived from data filters
○ No more accidental full table scans
● Full schema evolution
○ Supports add, drop, and rename columns
● Reliable support for types
○ date, time, timestamp, and decimal
○ struct, list, map, and mixed nesting
Works as Users Expect

● Queries are not broken by layout changes
● Physical layout can evolve without painful migration
○ Mistakes can be fixed
○ Prototypes can move to production faster
○ Tables can change as volume grows over time
● Data Platform can transparently fix table layout
Table Layout is Hidden

● Any write is atomic – either complete or invisible
○ Rewrite files instead of partitions
○ Tables never have partially committed data
● Simple, built-in change detection
○ Cache and materialized view maintenance
○ Incremental processing
● Data Platform can monitor and fix data files
○ Compact small files
○ Repartition to a new layout
Snapshot-based Tables

● Common implementation for table operations
○ Tune write options per table (Parquet row group size)
○ Tune read defaults once (split combination)
● Simple data gathering
○ Log scan predicates and projection to Kafka
○ Analyze table settings from the Iceberg table
● Data Platform can automate tuning configuration
○ Test file format tuning settings per table
○ Update table to affect all writes
Table Format Library

● Current: merge service
○ Dedicated cluster to convert data to Parquet
○ Makes data available to tables after merge completes
● Iceberg: Data is available after commit, in row-oriented format
● Autotune (planned)
○ Recommend tuning parameters
○ Rewrite files based on priority
○ Opportunistic Parquet conversion
Autotune: Data Librarian

Iceberg & Raptor Comparison

● Fine-grained tracking of data
○ Iceberg: file-level, Raptor: shard-level
● Use min/max values for efficient job planning
● Safe atomic writes
● Same metadata, different purpose and design
Raptor Similarities

● Raptor targets low-latency query execution
○ Data stored in flash
○ Shard tracking in MySQL
○ Built for Presto
● Iceberg manages table metadata targeting scale
○ Open specification for Spark, Presto, and others
○ Distributed metadata workload
○ Atomic changes across clusters and engines
● Iceberg & Raptor are complementary projects
Raptor Differences

Getting Started with Iceberg

● github.com/Netflix/iceberg
○ Apache Licensed, ALv2
○ Spark 2.3.x data source plug-in
○ Pig (read-only) support in development
○ Planned python support
● Presto Iceberg PR: coming soon!
Using Iceberg

Questions?

Additional Slides

● Implementation of snapshot-based tracking
○ Adds table schema, partition layout, string properties
○ Tracks old snapshots for eventual garbage collection
● Table metadata is immutable and always moves forward
● The current snapshot (pointer) can be rolled back
Iceberg Metadata
v1.json
S1 S2
v2.json
S1 S2 S3
v3.json
S2 S3

● Snapshots are split across one or more manifest files
○ Manifests store partition data for each data file
○ Reused to avoid high write volume
Manifest Files
v1.json
S1 S2
v2.json
S1 S2 S3
v3.json
S2 S3
m0.avro m1.avro m2.avro

● Basic data file info:
○ File location and format
○ Iceberg tracking data
● Values to filter files for a scan:
○ Partition data values
○ Per-column lower and upper bounds
● Metrics for cost-based optimization:
○ File-level: row count, size
○ Column-level: value count, null count, size
Manifest File Contents

● To commit, a writer must:
○ Note the current metadata version – the base version
○ Create new metadata and manifest files
○ Atomically swap the base version for the new version
● This atomic swap ensures a linear history
● Atomic swap is implemented by:
○ A custom metastore implementation
○ Atomic rename for HDFS or local tables
Commits

● Writers optimistically write new versions:
○ Assume that no other writer is operating
○ On conflict, retry based on the latest metadata
● To support retry, operations are structured as:
○ Assumptions about the current table state
○ Pending changes to the current table state
● Changes are safe if the assumptions are all true
Commits: Conflict Resolution

● Use case: safely merge small files
○ Merge input: file1.avro, file2.avro
○ Merge output: merge1.parquet
● Rewrite operation:
○ Assumption: file1.avro and file2.avro are still present
○ Pending changes:
Remove file1.avro and file2.avro
Add merge1.parquet
● Deleting file1.avro or file2.avro will cause a commit failure
Commits: Resolution Example

Presto iceberg connector
● Why a new connector ?
● What it can do
○ Read Support is done
○ Split planning, predicate pushdown
○ All iceberg datatypes supported
○ DDL and DML in the works
● Transparent querying between hive and icerberg catalogs.

● Hive table with no partition information
● Use iceberg api for split planning
● Iceberg APIs prunes manifest and datafiles based on stats.
● Stats pruning results into 3X performance improvements.
● Parquet version upgrade needed.
How it works

Presto Summit 2018 - 09 - Netflix Iceberg

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Presto Summit 2018 - 09 - Netflix Iceberg

Similar to Presto Summit 2018 - 09 - Netflix Iceberg (20)

More from kbajda

More from kbajda (13)

Recently uploaded

Recently uploaded (20)

Presto Summit 2018 - 09 - Netflix Iceberg