SlideShare a Scribd company logo
Delta by example
Palla Lentz, Assoc. Resident Solutions Architect
Jake Therianos, Customer Success Engineer
A Data Engineer’s Dream...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
Process data continuously and incrementally as new data arrive in a
cost efficient way without having to choose between batch or
streaming
Table
(Data gets written
continuously)
AI & Reporting
Events
Spark job gets exponentially slower
with time due to small files.
Stream
Stream
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Table
(Data gets compacted
every hour)
Batch Batch
Late arriving data means
processing need to be delayed
Stream
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Table
(Data gets compacted
every hour) Few hours latency doesn’t
satisfy business needs
Batch Batch
Stream
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch
Stream
Unified View
Lambda arch increases
operational burden
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Validations and other cleanup
actions need to be done twice
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Fixing mistakes means
blowing up partitions and
doing atomic re-publish
Reprocessing
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Updates & Merge get
complex with data lake
Reprocessing
Update & Merge
Stream
Table
(Data gets compacted
every hour)
The Data Engineer’s Journey...
Table
(Data gets written
continuously)
AI & Reporting
Events
Batch Batch
Stream
Unified ViewValidation
Updates & Merge get
complex with data lake
Reprocessing
Update & Merge
Can this be simplified?
Stream
The Data Engineer’s Journey...
Table
(Data gets compacted
every hour)
What was missing?
1. Ability to read consistent data while data is being written
1. Ability to read incrementally from a large table with good throughput
1. Ability to rollback in case of bad writes
1. Ability to replay historical data along new data that arrived
1. Ability to handle late arriving data without having to delay downstream processing
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
So… What is the answer?
STRUCTURED
STREAMING
+ =
The
Delta
Architecture
1. Unify batch & streaming with a continuous data flow model
2. Infinite retention to replay/reprocess historical events as needed
3. Independent, elastic compute and storage to scale while balancing costs
How Delta Lake Works?
Delta On Disk
my_table/
_delta_log/
00000.json
00001.json
date=2019-01-01/
file-1.parquet
Transaction Log
Table Versions
(Optional) Partition Directories
Data Files
Table = result of a set of actions
Change Metadata – name, schema, partitioning, etc
Add File – adds a file (with optional statistics)
Remove File – removes a file
Result: Current Metadata, List of Files, List of Txns, Version
Implementing Atomicity
Changes to the table
are stored as
ordered, atomic
units called commits
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
000000.json
000001.json
…
Solving Conflicts Optimistically
1. Record start version
2. Record reads/writes
3. Attempt commit
4. If someone else wins,
check if anything you
read has changed.
5. Try again.
000000.json
000001.json
000002.json
User 1 User 2
Write: Append
Read: Schema
Write: Append
Read: Schema
Handling Massive Metadata
Large tables can have millions of files in them! How do we scale
the metadata? Use Spark for scaling!
Add 1.parquet
Add 2.parquet
Remove 1.parquet
Remove 2.parquet
Add 3.parquet
Checkpoint
The Delta Architecture
Connecting the dots...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
Connecting the dots...
Snapshot isolation between writers and
readers
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Connecting the dots...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
1. Ability to read incrementally from a large
table with good throughput
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Connecting the dots...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
1. Ability to read incrementally from a large
table with good throughput
1. Ability to rollback in case of bad writes
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Connecting the dots...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
1. Ability to read incrementally from a large
table with good throughput
1. Ability to rollback in case of bad writes
1. Ability to replay historical data along new
data that arrived
1. Ability to read consistent data while data is
being written
1. Ability to read incrementally from a large
table with good throughput
1. Ability to rollback in case of bad writes
1. Ability to replay historical data along new
data that arrived
1. Ability to handle late arriving data without
having to delay downstream processing
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Stream any late arriving data added to the
table as they get added
Connecting the dots...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
?
1. Ability to read consistent data while data is
being written
1. Ability to read incrementally from a large
table with good throughput
1. Ability to rollback in case of bad writes
1. Ability to replay historical data along new
data that arrived
1. Ability to handle late arriving data without
having to delay downstream processing
Snapshot isolation between writers and
readers
Optimized file source with scalable metadata
handling
Time travel
Stream the backfilled historical data through
the same pipeline
Stream any late arriving data added to the
table as they get added
Connecting the dots...
Data Lake
CSV,
JSON, TXT…
Kinesis
AI & Reporting
AI & Reporting
Streaming
Analytics
Data Lake
CSV,
JSON, TXT…
Kinesis
The Delta Architecture
A continuous data flow model to unify batch & streaming
Up next...DEMO
Website: https://delta.io
Community (Slack/Email): https://delta.io/#community

More Related Content

What's hot

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
Dustin Vannoy
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
Prakash Chockalingam
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
Databricks
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
Databricks
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
James Serra
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 

What's hot (20)

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaEnd-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 

Similar to Delta from a Data Engineer's Perspective

Delta Architecture
Delta ArchitectureDelta Architecture
Delta Architecture
Paulo Gutierrez
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
Databricks
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustOpen Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Data Con LA
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Anyscale
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
Databricks
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Microsoft Tech Community
 
Oracle Database Overview
Oracle Database OverviewOracle Database Overview
Oracle Database Overview
honglee71
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
Write intensive workloads and lsm trees
Write intensive workloads and lsm treesWrite intensive workloads and lsm trees
Write intensive workloads and lsm trees
Tilak Patidar
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best PracticesApache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
felixcss
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Tathagata Das
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
Vinoth Chandar
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
Database Systems - A Historical Perspective
Database Systems - A Historical PerspectiveDatabase Systems - A Historical Perspective
Database Systems - A Historical Perspective
Karoly K
 

Similar to Delta from a Data Engineer's Perspective (20)

Delta Architecture
Delta ArchitectureDelta Architecture
Delta Architecture
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael ArmbrustOpen Source Reliability for Data Lake with Apache Spark by Michael Armbrust
Open Source Reliability for Data Lake with Apache Spark by Michael Armbrust
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisNoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
 
Oracle Database Overview
Oracle Database OverviewOracle Database Overview
Oracle Database Overview
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Write intensive workloads and lsm trees
Write intensive workloads and lsm treesWrite intensive workloads and lsm trees
Write intensive workloads and lsm trees
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best PracticesApache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
 
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
[Pulsar summit na 21] Change Data Capture To Data Lakes Using Apache Pulsar/Hudi
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Database Systems - A Historical Perspective
Database Systems - A Historical PerspectiveDatabase Systems - A Historical Perspective
Database Systems - A Historical Perspective
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
talha2khan2k
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
Sathiakumar Chandr
 
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
JeevanKp7
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
SelcukTOPAL2
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
Ladislau5
 
Full Disclosure Board Policy.docx BRGY LICUMA
Full  Disclosure Board Policy.docx BRGY LICUMAFull  Disclosure Board Policy.docx BRGY LICUMA
Full Disclosure Board Policy.docx BRGY LICUMA
brgylicumaormoccity
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
Priyanka Jadhav
 
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
Milind Agarwal
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
wojakmodern
 
SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
Becky Burwell
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
49AkshitYadav
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Samuel Jackson
 
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
femim26318
 
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di MalaysiaBimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
aznidajailani
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
AltanAtabarut
 
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERINGSOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
PrabhuB33
 
Systane Global education training centre
Systane Global education training centreSystane Global education training centre
Systane Global education training centre
AkhinaRomdoni
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
rightmanforbloodline
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
HeidiLivengood
 
Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
Riya Sen
 

Recently uploaded (20)

Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
 
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
 
Full Disclosure Board Policy.docx BRGY LICUMA
Full  Disclosure Board Policy.docx BRGY LICUMAFull  Disclosure Board Policy.docx BRGY LICUMA
Full Disclosure Board Policy.docx BRGY LICUMA
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
 
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
 
SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
 
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
 
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di MalaysiaBimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
 
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERINGSOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
 
Systane Global education training centre
Systane Global education training centreSystane Global education training centre
Systane Global education training centre
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
 
Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
 

Delta from a Data Engineer's Perspective

  • 1. Delta by example Palla Lentz, Assoc. Resident Solutions Architect Jake Therianos, Customer Success Engineer
  • 2. A Data Engineer’s Dream... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting Process data continuously and incrementally as new data arrive in a cost efficient way without having to choose between batch or streaming
  • 3. Table (Data gets written continuously) AI & Reporting Events Spark job gets exponentially slower with time due to small files. Stream Stream The Data Engineer’s Journey...
  • 4. Table (Data gets written continuously) AI & Reporting Events Table (Data gets compacted every hour) Batch Batch Late arriving data means processing need to be delayed Stream The Data Engineer’s Journey...
  • 5. Table (Data gets written continuously) AI & Reporting Events Table (Data gets compacted every hour) Few hours latency doesn’t satisfy business needs Batch Batch Stream The Data Engineer’s Journey...
  • 6. Table (Data gets written continuously) AI & Reporting Events Batch Stream Unified View Lambda arch increases operational burden Stream Table (Data gets compacted every hour) The Data Engineer’s Journey...
  • 7. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Unified ViewValidation Validations and other cleanup actions need to be done twice Stream Table (Data gets compacted every hour) The Data Engineer’s Journey...
  • 8. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Unified ViewValidation Fixing mistakes means blowing up partitions and doing atomic re-publish Reprocessing Stream Table (Data gets compacted every hour) The Data Engineer’s Journey...
  • 9. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Unified ViewValidation Updates & Merge get complex with data lake Reprocessing Update & Merge Stream Table (Data gets compacted every hour) The Data Engineer’s Journey...
  • 10. Table (Data gets written continuously) AI & Reporting Events Batch Batch Stream Unified ViewValidation Updates & Merge get complex with data lake Reprocessing Update & Merge Can this be simplified? Stream The Data Engineer’s Journey... Table (Data gets compacted every hour)
  • 11. What was missing? 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput 1. Ability to rollback in case of bad writes 1. Ability to replay historical data along new data that arrived 1. Ability to handle late arriving data without having to delay downstream processing Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ?
  • 12. So… What is the answer? STRUCTURED STREAMING + = The Delta Architecture 1. Unify batch & streaming with a continuous data flow model 2. Infinite retention to replay/reprocess historical events as needed 3. Independent, elastic compute and storage to scale while balancing costs
  • 13. How Delta Lake Works?
  • 15. Table = result of a set of actions Change Metadata – name, schema, partitioning, etc Add File – adds a file (with optional statistics) Remove File – removes a file Result: Current Metadata, List of Files, List of Txns, Version
  • 16. Implementing Atomicity Changes to the table are stored as ordered, atomic units called commits Add 1.parquet Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet 000000.json 000001.json …
  • 17. Solving Conflicts Optimistically 1. Record start version 2. Record reads/writes 3. Attempt commit 4. If someone else wins, check if anything you read has changed. 5. Try again. 000000.json 000001.json 000002.json User 1 User 2 Write: Append Read: Schema Write: Append Read: Schema
  • 18. Handling Massive Metadata Large tables can have millions of files in them! How do we scale the metadata? Use Spark for scaling! Add 1.parquet Add 2.parquet Remove 1.parquet Remove 2.parquet Add 3.parquet Checkpoint
  • 20. Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ?
  • 21. Connecting the dots... Snapshot isolation between writers and readers Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written
  • 22. Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput
  • 23. Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput 1. Ability to rollback in case of bad writes
  • 24. Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Stream the backfilled historical data through the same pipeline Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ? 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput 1. Ability to rollback in case of bad writes 1. Ability to replay historical data along new data that arrived
  • 25. 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput 1. Ability to rollback in case of bad writes 1. Ability to replay historical data along new data that arrived 1. Ability to handle late arriving data without having to delay downstream processing Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Stream the backfilled historical data through the same pipeline Stream any late arriving data added to the table as they get added Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting ?
  • 26. 1. Ability to read consistent data while data is being written 1. Ability to read incrementally from a large table with good throughput 1. Ability to rollback in case of bad writes 1. Ability to replay historical data along new data that arrived 1. Ability to handle late arriving data without having to delay downstream processing Snapshot isolation between writers and readers Optimized file source with scalable metadata handling Time travel Stream the backfilled historical data through the same pipeline Stream any late arriving data added to the table as they get added Connecting the dots... Data Lake CSV, JSON, TXT… Kinesis AI & Reporting
  • 27. AI & Reporting Streaming Analytics Data Lake CSV, JSON, TXT… Kinesis The Delta Architecture A continuous data flow model to unify batch & streaming
  • 28. Up next...DEMO Website: https://delta.io Community (Slack/Email): https://delta.io/#community