SlideShare a Scribd company logo
Reshape Data Lake (As of 2020.07)
Eric Sun @ LinkedIn
https://www.linkedin.com/in/ericsun
SF Big Analytics
Similar Presentation/Blog(s)
https://databricks.com/session_na20/a-thorough-comparison-of-delta-lake-
iceberg-and-hudi
https://databricks.com/session_eu19/end-to-end-spark-tensorflow-pytorch-
pipelines-with-databricks-delta
https://bit.ly/comparison-of-delta-iceberg-hudi-by-domisj
https://bit.ly/acid-iceberg-delta-comparison-by-wssbck
Disclaimer
The views expressed in this presentation are those of the author and do not reflect any policy or
position of the employers of the author. Audience may verify the anecdotes mentioned below.
Vocabulary & Jargon
● T+1: event/transaction time plus 1 day - typical daily-batch
T+0: realtime process which can deliver insight with minimal delay
T+0.000694: mintely-batch; T+0.041666: hourly-batch
● Delta Engine: Spark compiled in LLVM (similar to Dremio Gandiva)
● Skipping Index: Min/Max, Bloom Filter, and ValueList w/ Z-Ordering
● DML: Insert + Delete + Update + Upsert/Merge
● Time Travel: isolate & preserver multiple snapshot versions
● SCD-2: type 2 of multi-versioned data model to provide time travel
● Object/Cloud Storage: S3/IA/Glacier, ABS/Cool/Archive, GCS/NL/CL
● Streaming & Batch Unification: union historical bounded data with
continuous stream; interactively query both anytime
Data Warehouse Data Lake v1 Data Lake v2
Relational DB based MPP
ETL done by IT team
ELT inside MPP
Star schema
OLAP and BI focused
SQL is the main DSL
ODBC + JDBC as ⇿ interface
<Expensive to scale …>
Limited UD*F to run R and Data
Mining inside database
HDFS + NoSQL
ETL done by Java folks
Nested schema or no schema
Hive used by non-engineers
Export data back to RDBMS
for OLAP/BI
M/R API & DSL dominated
Scalable ML became possible
<Hard to operate …>
UD*F & SerDe made easier
Cloud + HTAP/MPP + NoSQL
ETL done by data people in
Spark and Presto
Data model and schema matter
again
Streaming + Batch ⇨ unified
More expressed in SQL + Python
ML as a critical use case
<Too confused to migrate…>
Non-JVM engines emerge

Recommended for you

Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3

Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots. In this session, you'll learn: • Some background about big data at Netflix • Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive • How Iceberg maintains table metadata to make queries fast and reliable • The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse • How you can get started using Iceberg Speaker Ryan Blue, Software Engineer, Netflix

apache avroapache hiveapache parquet
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...

Apache Hudi is a data lake platform, that provides streaming primitives (upserts/deletes/change streams) on top of data lake storage. Hudi powers very large data lakes at Uber, Robinhood and other companies, while being pre-installed on four major cloud platforms. Hudi supports exactly-once, near real-time data ingestion from Apache Kafka to cloud storage, which is typically used in-place of a S3/HDFS sink connector to gain transactions and mutability. While this approach is scalable and battle-tested, it can only ingest data in mini batches, leading to lower data freshness. In this talk, we introduce a Kafka Connect Sink Connector for Apache Hudi, which writes data straight into Hudi's log format, making the data immediately queryable, while Hudi's table services like indexing, compaction, clustering work behind the scenes, to further re-organize for better query performance.

apache kafkakafka summit
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Share So Much
Despite of all the marketing
buzzwords and manipulations,
‘data lakehouse’, ‘data lake’,
and ‘data warehouse’ are all
there to solve the same data
integration and insight
generation problems.
The implementation will
continue to evolve as the new
hardware and software
become viable and practical.
● ACID
● Mutable (Delete, Update, Compact)
● Schema (DDL and Evolution)
● Metadata (Rich, Performant)
● Open (Format, API, Tooling, Adoption)
● Fast (Optimized for Various Patterns)
● Extensible (User-defined ***, Federation)
● Intuitive (Data-centric Operation/Language)
● Productive (Achieve more with less)
● Practical (Join, Aggregate, Cache, View)
In Common
Solution Architecture Template
Sources
Ads
BI/OLAP
Machine Learning
Deep Learning
Observability
Recommendation
A/B Test
Storage
Data Format and SerDe
Metadata Catalog and Table API
Unified Data Interface
CDC
Ingestion
T+0 or T+0.000694
T+0.0416 or T+1
...
Data Analytics in Cloud Storage
● Object Store File System
○ There is no hierarchy semantics to rename or inherit
○ Object is not appendable (in general)
○ Metadata is limited to a few KB
● REST is easy to program but RPC is much faster
○ Job/query planning step needs a lot of small scans (it is chatty)
○ 4MB cache block size may be inefficient for metadata operations
● Hadoop stack is tightly-coupled with HDFS notions
○ Hive and Spark (originally) were not optimized for object stores
○ Running HDFS as a cache/intermediate layer on a VM fleet can be
useful yet suboptimal (and operational heavy)
○ Data locally still matters for SLA-sensitive batch jobs
Is Not
Big Data becomes too big, even Metadata
● Computation cost keep rising for big data
○ Partitioning the files by date is not enough
○ Hot and warm data sizes are still very big (how to save $$$)
○ Analytics often scan big data files but discard 90% records and 80%
fields. The CPU, memory, network and I/O cost is billed for 100%
○ Columnar format has skipping index and projection pushdown, but
how to fetch them swiftly
● Hive Metadata only manages directory (HIVE-9452 abandoned)
○ Commits can happen at file or file group level (instead of directory)
○ High-performance engines need better file layout and rich metadata at
field level for each segment/chunk in a file
○ Process metadata via Java ORM?

Recommended for you

Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview

This document provides an overview of a talk on Apache Spark. It introduces the speaker and their background. It acknowledges inspiration from a previous Spark training. It then outlines the structure of the talk, which will include: a brief history of big data; a tour of Spark including its advantages over MapReduce; and explanations of Spark concepts like RDDs, transformations, and actions. The document serves to introduce the topics that will be covered in the talk.

analyticsbig dataapache spark
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide

Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.

parquetnetflixhadoop
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters

The document discusses Pinterest migrating their Apache Spark clusters from HDFS to S3 storage. Some key points: 1) Migrating to S3 provided significantly better performance due to the higher IOPS of modern EC2 instances compared to their older HDFS nodes. Jobs saw 25-35% improvements on average. 2) S3 is eventually consistent while HDFS is strongly consistent, so they implemented the S3Committer to handle output consistency issues during job failures. 3) Metadata operations like file moves were very slow in S3, so they optimized jobs to reduce unnecessary moves using techniques like multipart uploads to S3.

spark + ai summit
Immutable or Mutable
● Big data is all about immutable schemaless data
○ To get useful insights and features out the raw data, we still have to
dedupe, transform, conform, merge, aggregate, and backfill
○ Schema evolution happens frequently when merge & backfill occurs
● Storage is infinite and compute is cheap
○ Why not rewriting the entire data file or directory all the time
○ If it is slow, increase the number of partitions and executors
● Streaming and Batch Unification requires a decent incremental logic
○ Store granularly with ACID isolation and clear watermarks
○ Process incrementally without partial reads or duplicates
○ Evolve reliably with enough flexibility
Are All Open Standards Equal?
● Hive 3.x
○ DML (based on ORC + Bucketing + on-the-fly Merge + Compactor)
○ Streaming Ingestion API, LLAP (daemon, caching, faster execution)
● Iceberg
○ Flexible Field Schema and Partition Layout Evolution (S3-first)
○ Hidden Partition (expression-based) and Bucket Transformation
● Delta Lake
○ Everything done by Spark + Parquet, DML (Copy-On-Write) + SCD-2
○ Fully supported in SparkSQL, PySpark and Delta Engine
● Hudi
○ Optimized UPSERT with indexing (record key, file id, partition path)
○ Copy-on-Read (low-latency write) or Copy-on-Write (HDFS-first)
Why Iceberg is so cool?
● Netflix is the most advanced AWS flagship partner
○ S3 is very scalable but a little bit over-simplified
○ Solve the critical cloud storage problems:
■ Avoid rename
■ Avoid directory hierarchy and naming convention
■ Aggregate (index) metadata into a compacted (manifest) file
● Netflix has migrated to Flink for stream processing
○ Fast ETL/analytics are needed to respond to its non-stop VOD
○ w/ One of the biggest Cassandra cluster (less mutable headache)
○ No urgent need for DML yet
● Netflix uses multiple data platforms/engines, and migrates faster than ...
○ Support other file formats, engines, schema, bucketing by nature
Why Delta Lake is so handy?
● If you love to use Spark for ETL (Steaming & Batch), Delta
Lake just makes it so much more powerful
○ The API and SQL syntax are so easy to use (especially for data folks)
○ Wide range of patterns provided by paid customers and OSS community
○ (feel locked-in?) it is well-tested, less buggy, and more useable in 3 clouds
● Databricks has full control and moves very fast
○ v0.2 (cloud storage support: June 2019)
○ v0.3 (DML: Aug 2019), v0.4(SQL syntax, Python API: Sep 2019)
○ v0.5 (DML & compaction performance, Presto integration: Dec 2019)
○ v0.6 (Schema evolution during merge, read by path: Apr 2020)
○ v0.7 (DDL for Hive Metastore, retention control, ADLSv2: Jun 2020)

Recommended for you

The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook

The document outlines topics covered in "The Impala Cookbook" published by Cloudera. It discusses physical and schema design best practices for Impala, including recommendations for data types, partition design, file formats, and block size. It also covers estimating and managing Impala's memory usage, and how to identify the cause when queries exceed memory limits.

hadoopsqlimpala
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

apache sparksparkaisummit
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing

This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.

big dataopen sourceapache software foundation
Why Hudi is faster?
● Uber is a true fast-data company
○ Their marketing place and supply-demand-matching business model
seriously depends on near real-time analytics:
■ Directly upsert MySql BIN log to Hudi table
■ Frequently bulk dump Cassandra is obviously infeasible
■ record_key is indexed (file names + bloom filters) to speed up
■ Batch favors Copy-on-Write but Streaming likes Merge-on-Read
■ Snapshot query is faster, while Incremental query has low latency
● Uber is also committed to Flink
● Uber mainly builds its own data centers and HDFS clusters
○ So Hudi is mainly optimized for on-prem HDFS with Hive convention
○ GCP and AWS support was added later
Code Snippets - Delta
spark.readStream.format("delta").load("/path/to/delta/events")
deltaTable = DeltaTable.forPath(spark, "/path/to/delta-table")
# Upsert (merge) new data
newData = spark.range(0, 20)
deltaTable.alias("oldData") 
.merge(
newData.alias("newData"),
"oldData.id = newData.id") 
.whenMatchedUpdate(set = { "id": col("newData.id") }) 
.whenNotMatchedInsert(values = { "id": col("newData.id") }) 
.execute()
val df = spark.read.format(“delta”).load("/path/to/my/table@v5238")
// ---- Spark SQL ----
SELECT * FROM events -- query table in the metastore
SELECT * FROM delta.`/delta/events` -- query table by path
SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
SELECT count(*) FROM my_table TIMESTAMP AS OF "2020-07-28 09:30:00.000"
SELECT count(*) FROM my_table VERSION AS OF 5238
UPDATE delta.`/data/events/` SET eventType = 'click' WHERE eventType = 'clck'
Code Snippets - Hudi
val tripsSnapshotDF = spark.read.format("hudi").load(basePath + "/*/*/*/*")
// load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery
// since partition (region/country/city) is 3 levels nested from basePath, using 4 levels "/*/*/*/*" here
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0").show()
spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from
hudi_trips_snapshot").show()
// -------------------
val beginTime = "000" // Represents all commits > this time.
val endTime = commits(commits.length - 2) // point in time to query
// incrementally query data
val tripsPointInTimeDF = spark.read.format("hudi").
option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
option(END_INSTANTTIME_OPT_KEY, endTime).
load(basePath)
tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare
> 20.0").show()
Code Snippets - Iceberg
CREATE TABLE prod.db.sample_table (
id bigint,
data string,
category string,
ts timestamp)
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts), category)
SELECT * FROM prod.db.sample_table.files
INSERT OVERWRITE prod.my_app.logs
SELECT uuid, first(level), first(ts), first(message)
FROM prod.my_app.logs
WHERE cast(ts as date) = '2020-07-01'
GROUP BY uuid
spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table")
// time travel to October 26, 1986 at 01:21:00
spark.read.option("as-of-timestamp", "499162860000").table("prod.db.sample_table")
// time travel to snapshot with ID 10963874102873L
spark.read.option("snapshot-id", 10963874102873L).table("prod.db.sample_table")

Recommended for you

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...

1) Columnar formats like Parquet, Kudu and Arrow provide more efficient data storage and querying by organizing data by column rather than row. 2) Parquet provides an immutable columnar format well-suited for storage, while Kudu allows for mutable updates but is optimized for scans. Arrow provides an in-memory columnar format focused on CPU efficiency. 3) By establishing common in-memory and on-disk columnar standards, Arrow and Parquet enable more efficient data sharing and querying across systems without serialization overhead.

hadoop summit
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0

This document provides an overview of new features in Airflow 1.10.8/1.10.9 and best practices for writing DAGs and configuring Airflow for production. It also outlines the roadmap for Airflow 2.0, including dag serialization, a revamped real-time UI, developing a production-grade modern API, releasing official Docker/Helm support, and improving the scheduler. The document aims to help users understand recent Airflow updates and plan their migration to version 2.0.

apache airflowairflowdata engineering
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

Time Travel
● Time Travel is focused on keeping both Batch and Streaming
jobs isolated from the Concurrent Reads & Writes
● Typical Range for Time Travel is 7 ~ 30 days
● Machine Learning (Feature reGeneration) often needs to
travel to 3~24 months back
○ Need to reduce the precision/granularity of commits kept
in Data Lake (compact the logs to daily or monthly level)
■ Monthly baseline/snapshot + daily delta/changes
○ Consider a more advanced SCD-2 data model for ML
What Else Should be Part of Data Lake?
● Catalog (next-generation metastore alternatives)
○ Daemon service: scalable, easy to update and query
○ Federation across data centers (across cloud and on-premises)
● Better file format and in-memory columnar format
○ Less SerDe overhead, zero-copy, directly vectorized operation on
compressed data (Artus-like). Tungsten v2 (Arrow-like)
● Performance and Data Management (for OLAP and AI)
○ New compute engines (non-JVM based) with smart caching and pre-
aggregation & materialized view
○ Mechanism to enable Time Travel with more flexible and wider range
○ Rich DSL with code generation and pushdown capability for faster AI
training and inference
How to
What are the pain points?
Each Data Lake framework has
its own emphasis, please find
the alignment of your pain
points accordingly.
● Motivations
Smoother integration with existing development
language and compute engine?
Contribute to the framework to solve new problems?
Have more control of the infrastructure, is the
framework’s open source governance friendly?
● Restrictions
...
Choose?
⧫ Delta Lake + Spark + Delta Engine +
Python support will effectively help
Databricks pull ahead in the race.
⧫ Flink community is all in for Iceberg.
⧫ GCP BigQuery, EMR, and Azure Synapse
(will) support reading from all table
formats, so you can lift-and-shift to ...

Recommended for you

Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA

Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.

icebergstlbigdataideabig data
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptxGrafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx

VictoriaMetrics and Grafana Mimir are time series databases with support of mostly the same protocols and APIs. However, they have different architectures and components, which makes the comparison more complicated. In the talk, we'll go through the details of the benchmark where I compared both solutions. We'll see how VictoriaMetrics and Mimir are dealing with identical workloads and how efficient they’re with using the allocated resources. The talk will cover design and architectural details, weak and strong points, trade-offs, and maintenance complexity of both solutions. 

victoriametricsmimirmonitoring
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals

Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with

shufflesparksqlspark
What’s next?
Data Lake can do more
Can be faster
Can be easier
Additional Readings
● Gartner Research
○ Are You Shifting Your Problems to the Cloud or Solving Them?
○ Demystifying Cloud Data Warehouse Characteristics
● Google
○ Procella + Artus (https://www.youtube.com/watch?v=QwXj7o4dLpw)
○ BigQuery + Capacitor (https://bit.ly/bigquery-capacitor)
● Uber
○ Incremental Processing on Hadoop (https://bit.ly/uber-incremental)
● Alibaba
○ AnalyticDB (https://www.vldb.org/pvldb/vol12/p2059-zhan.pdf)
○ Iceberg Sink for Flink (https://bit.ly/flink-iceberg-sink)
○ Use Iceberg in Flink 中文 (https://developer.aliyun.com/article/755329)
Reshape Data Lake (as of 2020.07)
Data Lake implementations are still
evolving, don’t hold your breath for
the single best choice. Roll up
sleeves and build practical solutions
with 2 or 3 options combined.
Computation engine gravity/bias
will directly reshape the waterscape.

Recommended for you

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.

sqlhadoopnosql
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Ethan Guo | Current 2022 Back in 2016, Apache Hudi brought transactions, change capture on top of data lakes, what is today referred to as the Lakehouse architecture. In this session, we first introduce Apache Hudi and the key technology gaps it fills in the modern data architecture. Bridging traditional data lakes and warehouses, Hudi helps realize the Lakehouse vision, by bringing transactions, optimized table metadata to data lakes and powerful storage layout optimizations, moving them closer to cloud warehouses of today. Viewed from a data engineering lens, Hudi also plays a key unifying role between the batch and stream processing worlds, by acting as a columnar, server-less ""state store"" for batch jobs, ushering in what we call the incremental processing model, where batch jobs can consume new data, update/delete intermediate results in a Hudi table, instead of re-computing/re-write entire output like old-school big batch jobs. Rest of talk focusses on a deep dive into the some of the time-tested design choices and tradeoffs in Hudi, that helps power some of the largest transactional data lakes on the planet today. We will start by describing a tour of the storage format design, including data, metadata layouts and of course Hudi's timeline, an event log that is central to implementing ACID transactions and concurrency control. We will delve deeper into the practical concurrency control pitfalls in data lakes, and show how Hudi's hybrid approach combining MVCC with optimistic concurrency control, lowers contention and unlocks minute-level near real-time commits to Hudi tables. We will conclude with code examples that showcase Hudi's rich set of table services that perform vital table management such as cleaning older file versions, compaction of delta logs into base files, dynamic re-clustering for faster query performance, or the more recently introduced indexing service that maintains Hudi's multi-modal indexing capabilities.

apachi hudidata architecturedata lakes
Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf

This document provides guidelines for building cloud BI project architectures. It discusses considerations for architectural design such as data sources, volumes, model complexity and sharing needs. It then presents four common architecture templates - Hulk, Iron Man, Thor and Hawkeye - tailored to different needs around reporting demand, data volume and complexity. Key aspects of architectures like sources, transportation, processing, storage, live calculation, data access and orchestration are examined. Finally, it compares features of technologies that can fulfill different functional roles.

azurebiarchitectural
Thanks!
Presentation URL:
https://bit.ly/SFBA0728
Blog:
http://bit.ly/iceberg-delta-hudi-hive

More Related Content

What's hot

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
HostedbyConfluent
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
DataWorks Summit
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Databricks
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptxGrafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
RomanKhavronenko
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 

What's hot (20)

Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptxGrafana Mimir and VictoriaMetrics_ Performance Tests.pptx
Grafana Mimir and VictoriaMetrics_ Performance Tests.pptx
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 

Similar to Reshape Data Lake (as of 2020.07)

Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
pbonillo1
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
Shubham Tagra
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Alluxio, Inc.
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
Modern Data Stack France
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
Stavros Papadopoulos
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Precisely
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
SolidQ
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
Jeffrey T. Pollock
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
Kent Graziano
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
DataWorks Summit
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 

Similar to Reshape Data Lake (as of 2020.07) (20)

Azure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdfAzure BI Cloud Architectural Guidelines.pdf
Azure BI Cloud Architectural Guidelines.pdf
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
AIS data management and time series analytics on TileDB Cloud (Webinar, Feb 3...
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 

Recently uploaded

*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
roobykhan02154
 
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
KiranKumar139571
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
bookmybebe1
 
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
Donghwan Lee
 
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any TimeBangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
adityaroy0215
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
Jyotishko Biswas
 
@Call @Girls in Kolkata 💋😂 XXXXXXXX 👄👄 Hello My name Is Kamli I am Here meet you
@Call @Girls in Kolkata 💋😂 XXXXXXXX 👄👄 Hello My name Is Kamli I am Here meet you@Call @Girls in Kolkata 💋😂 XXXXXXXX 👄👄 Hello My name Is Kamli I am Here meet you
@Call @Girls in Kolkata 💋😂 XXXXXXXX 👄👄 Hello My name Is Kamli I am Here meet you
Delhi Call Girls
 
₹Call ₹Girls Mumbai Central 09930245274 Deshi Chori Near You
₹Call ₹Girls Mumbai Central 09930245274 Deshi Chori Near You₹Call ₹Girls Mumbai Central 09930245274 Deshi Chori Near You
₹Call ₹Girls Mumbai Central 09930245274 Deshi Chori Near You
model sexy
 
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
manjukaushik328
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
kihus38
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
RajdeepPaul47
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
SanelaNikodinoska1
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
kamli sharma#S10
 
[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
Amazon Web Services Korea
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
dipti singh$A17
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
sanjay singh
 

Recently uploaded (20)

*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
 
iot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptxiot paper presentation FINAL EDIT by kiran.pptx
iot paper presentation FINAL EDIT by kiran.pptx
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
 
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
 
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any TimeBangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
Bangalore @Call @Girls 0000000000 Riya Khan Beautiful And Cute Girl any Time
 
LLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptxLLM powered Contract Compliance Application.pptx
LLM powered Contract Compliance Application.pptx
 
@Call @Girls in Kolkata 💋😂 XXXXXXXX 👄👄 Hello My name Is Kamli I am Here meet you
@Call @Girls in Kolkata 💋😂 XXXXXXXX 👄👄 Hello My name Is Kamli I am Here meet you@Call @Girls in Kolkata 💋😂 XXXXXXXX 👄👄 Hello My name Is Kamli I am Here meet you
@Call @Girls in Kolkata 💋😂 XXXXXXXX 👄👄 Hello My name Is Kamli I am Here meet you
 
₹Call ₹Girls Mumbai Central 09930245274 Deshi Chori Near You
₹Call ₹Girls Mumbai Central 09930245274 Deshi Chori Near You₹Call ₹Girls Mumbai Central 09930245274 Deshi Chori Near You
₹Call ₹Girls Mumbai Central 09930245274 Deshi Chori Near You
 
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
 
Introduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdfIntroduction to the Red Hat Portfolio.pdf
Introduction to the Red Hat Portfolio.pdf
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
Maruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekhoMaruti Wagon R on road price in Faridabad - CarDekho
Maruti Wagon R on road price in Faridabad - CarDekho
 
[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
 

Reshape Data Lake (as of 2020.07)

  • 1. Reshape Data Lake (As of 2020.07) Eric Sun @ LinkedIn https://www.linkedin.com/in/ericsun SF Big Analytics
  • 3. Vocabulary & Jargon ● T+1: event/transaction time plus 1 day - typical daily-batch T+0: realtime process which can deliver insight with minimal delay T+0.000694: mintely-batch; T+0.041666: hourly-batch ● Delta Engine: Spark compiled in LLVM (similar to Dremio Gandiva) ● Skipping Index: Min/Max, Bloom Filter, and ValueList w/ Z-Ordering ● DML: Insert + Delete + Update + Upsert/Merge ● Time Travel: isolate & preserver multiple snapshot versions ● SCD-2: type 2 of multi-versioned data model to provide time travel ● Object/Cloud Storage: S3/IA/Glacier, ABS/Cool/Archive, GCS/NL/CL ● Streaming & Batch Unification: union historical bounded data with continuous stream; interactively query both anytime
  • 4. Data Warehouse Data Lake v1 Data Lake v2 Relational DB based MPP ETL done by IT team ELT inside MPP Star schema OLAP and BI focused SQL is the main DSL ODBC + JDBC as ⇿ interface <Expensive to scale …> Limited UD*F to run R and Data Mining inside database HDFS + NoSQL ETL done by Java folks Nested schema or no schema Hive used by non-engineers Export data back to RDBMS for OLAP/BI M/R API & DSL dominated Scalable ML became possible <Hard to operate …> UD*F & SerDe made easier Cloud + HTAP/MPP + NoSQL ETL done by data people in Spark and Presto Data model and schema matter again Streaming + Batch ⇨ unified More expressed in SQL + Python ML as a critical use case <Too confused to migrate…> Non-JVM engines emerge
  • 5. Share So Much Despite of all the marketing buzzwords and manipulations, ‘data lakehouse’, ‘data lake’, and ‘data warehouse’ are all there to solve the same data integration and insight generation problems. The implementation will continue to evolve as the new hardware and software become viable and practical. ● ACID ● Mutable (Delete, Update, Compact) ● Schema (DDL and Evolution) ● Metadata (Rich, Performant) ● Open (Format, API, Tooling, Adoption) ● Fast (Optimized for Various Patterns) ● Extensible (User-defined ***, Federation) ● Intuitive (Data-centric Operation/Language) ● Productive (Achieve more with less) ● Practical (Join, Aggregate, Cache, View) In Common
  • 6. Solution Architecture Template Sources Ads BI/OLAP Machine Learning Deep Learning Observability Recommendation A/B Test Storage Data Format and SerDe Metadata Catalog and Table API Unified Data Interface CDC Ingestion T+0 or T+0.000694 T+0.0416 or T+1 ...
  • 7. Data Analytics in Cloud Storage ● Object Store File System ○ There is no hierarchy semantics to rename or inherit ○ Object is not appendable (in general) ○ Metadata is limited to a few KB ● REST is easy to program but RPC is much faster ○ Job/query planning step needs a lot of small scans (it is chatty) ○ 4MB cache block size may be inefficient for metadata operations ● Hadoop stack is tightly-coupled with HDFS notions ○ Hive and Spark (originally) were not optimized for object stores ○ Running HDFS as a cache/intermediate layer on a VM fleet can be useful yet suboptimal (and operational heavy) ○ Data locally still matters for SLA-sensitive batch jobs Is Not
  • 8. Big Data becomes too big, even Metadata ● Computation cost keep rising for big data ○ Partitioning the files by date is not enough ○ Hot and warm data sizes are still very big (how to save $$$) ○ Analytics often scan big data files but discard 90% records and 80% fields. The CPU, memory, network and I/O cost is billed for 100% ○ Columnar format has skipping index and projection pushdown, but how to fetch them swiftly ● Hive Metadata only manages directory (HIVE-9452 abandoned) ○ Commits can happen at file or file group level (instead of directory) ○ High-performance engines need better file layout and rich metadata at field level for each segment/chunk in a file ○ Process metadata via Java ORM?
  • 9. Immutable or Mutable ● Big data is all about immutable schemaless data ○ To get useful insights and features out the raw data, we still have to dedupe, transform, conform, merge, aggregate, and backfill ○ Schema evolution happens frequently when merge & backfill occurs ● Storage is infinite and compute is cheap ○ Why not rewriting the entire data file or directory all the time ○ If it is slow, increase the number of partitions and executors ● Streaming and Batch Unification requires a decent incremental logic ○ Store granularly with ACID isolation and clear watermarks ○ Process incrementally without partial reads or duplicates ○ Evolve reliably with enough flexibility
  • 10. Are All Open Standards Equal? ● Hive 3.x ○ DML (based on ORC + Bucketing + on-the-fly Merge + Compactor) ○ Streaming Ingestion API, LLAP (daemon, caching, faster execution) ● Iceberg ○ Flexible Field Schema and Partition Layout Evolution (S3-first) ○ Hidden Partition (expression-based) and Bucket Transformation ● Delta Lake ○ Everything done by Spark + Parquet, DML (Copy-On-Write) + SCD-2 ○ Fully supported in SparkSQL, PySpark and Delta Engine ● Hudi ○ Optimized UPSERT with indexing (record key, file id, partition path) ○ Copy-on-Read (low-latency write) or Copy-on-Write (HDFS-first)
  • 11. Why Iceberg is so cool? ● Netflix is the most advanced AWS flagship partner ○ S3 is very scalable but a little bit over-simplified ○ Solve the critical cloud storage problems: ■ Avoid rename ■ Avoid directory hierarchy and naming convention ■ Aggregate (index) metadata into a compacted (manifest) file ● Netflix has migrated to Flink for stream processing ○ Fast ETL/analytics are needed to respond to its non-stop VOD ○ w/ One of the biggest Cassandra cluster (less mutable headache) ○ No urgent need for DML yet ● Netflix uses multiple data platforms/engines, and migrates faster than ... ○ Support other file formats, engines, schema, bucketing by nature
  • 12. Why Delta Lake is so handy? ● If you love to use Spark for ETL (Steaming & Batch), Delta Lake just makes it so much more powerful ○ The API and SQL syntax are so easy to use (especially for data folks) ○ Wide range of patterns provided by paid customers and OSS community ○ (feel locked-in?) it is well-tested, less buggy, and more useable in 3 clouds ● Databricks has full control and moves very fast ○ v0.2 (cloud storage support: June 2019) ○ v0.3 (DML: Aug 2019), v0.4(SQL syntax, Python API: Sep 2019) ○ v0.5 (DML & compaction performance, Presto integration: Dec 2019) ○ v0.6 (Schema evolution during merge, read by path: Apr 2020) ○ v0.7 (DDL for Hive Metastore, retention control, ADLSv2: Jun 2020)
  • 13. Why Hudi is faster? ● Uber is a true fast-data company ○ Their marketing place and supply-demand-matching business model seriously depends on near real-time analytics: ■ Directly upsert MySql BIN log to Hudi table ■ Frequently bulk dump Cassandra is obviously infeasible ■ record_key is indexed (file names + bloom filters) to speed up ■ Batch favors Copy-on-Write but Streaming likes Merge-on-Read ■ Snapshot query is faster, while Incremental query has low latency ● Uber is also committed to Flink ● Uber mainly builds its own data centers and HDFS clusters ○ So Hudi is mainly optimized for on-prem HDFS with Hive convention ○ GCP and AWS support was added later
  • 14. Code Snippets - Delta spark.readStream.format("delta").load("/path/to/delta/events") deltaTable = DeltaTable.forPath(spark, "/path/to/delta-table") # Upsert (merge) new data newData = spark.range(0, 20) deltaTable.alias("oldData") .merge( newData.alias("newData"), "oldData.id = newData.id") .whenMatchedUpdate(set = { "id": col("newData.id") }) .whenNotMatchedInsert(values = { "id": col("newData.id") }) .execute() val df = spark.read.format(“delta”).load("/path/to/my/table@v5238") // ---- Spark SQL ---- SELECT * FROM events -- query table in the metastore SELECT * FROM delta.`/delta/events` -- query table by path SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1) SELECT count(*) FROM my_table TIMESTAMP AS OF "2020-07-28 09:30:00.000" SELECT count(*) FROM my_table VERSION AS OF 5238 UPDATE delta.`/data/events/` SET eventType = 'click' WHERE eventType = 'clck'
  • 15. Code Snippets - Hudi val tripsSnapshotDF = spark.read.format("hudi").load(basePath + "/*/*/*/*") // load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery // since partition (region/country/city) is 3 levels nested from basePath, using 4 levels "/*/*/*/*" here tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot") spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0").show() spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot").show() // ------------------- val beginTime = "000" // Represents all commits > this time. val endTime = commits(commits.length - 2) // point in time to query // incrementally query data val tripsPointInTimeDF = spark.read.format("hudi"). option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL). option(BEGIN_INSTANTTIME_OPT_KEY, beginTime). option(END_INSTANTTIME_OPT_KEY, endTime). load(basePath) tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time") spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare > 20.0").show()
  • 16. Code Snippets - Iceberg CREATE TABLE prod.db.sample_table ( id bigint, data string, category string, ts timestamp) USING iceberg PARTITIONED BY (bucket(16, id), days(ts), category) SELECT * FROM prod.db.sample_table.files INSERT OVERWRITE prod.my_app.logs SELECT uuid, first(level), first(ts), first(message) FROM prod.my_app.logs WHERE cast(ts as date) = '2020-07-01' GROUP BY uuid spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table") // time travel to October 26, 1986 at 01:21:00 spark.read.option("as-of-timestamp", "499162860000").table("prod.db.sample_table") // time travel to snapshot with ID 10963874102873L spark.read.option("snapshot-id", 10963874102873L).table("prod.db.sample_table")
  • 17. Time Travel ● Time Travel is focused on keeping both Batch and Streaming jobs isolated from the Concurrent Reads & Writes ● Typical Range for Time Travel is 7 ~ 30 days ● Machine Learning (Feature reGeneration) often needs to travel to 3~24 months back ○ Need to reduce the precision/granularity of commits kept in Data Lake (compact the logs to daily or monthly level) ■ Monthly baseline/snapshot + daily delta/changes ○ Consider a more advanced SCD-2 data model for ML
  • 18. What Else Should be Part of Data Lake? ● Catalog (next-generation metastore alternatives) ○ Daemon service: scalable, easy to update and query ○ Federation across data centers (across cloud and on-premises) ● Better file format and in-memory columnar format ○ Less SerDe overhead, zero-copy, directly vectorized operation on compressed data (Artus-like). Tungsten v2 (Arrow-like) ● Performance and Data Management (for OLAP and AI) ○ New compute engines (non-JVM based) with smart caching and pre- aggregation & materialized view ○ Mechanism to enable Time Travel with more flexible and wider range ○ Rich DSL with code generation and pushdown capability for faster AI training and inference
  • 19. How to What are the pain points? Each Data Lake framework has its own emphasis, please find the alignment of your pain points accordingly. ● Motivations Smoother integration with existing development language and compute engine? Contribute to the framework to solve new problems? Have more control of the infrastructure, is the framework’s open source governance friendly? ● Restrictions ... Choose?
  • 20. ⧫ Delta Lake + Spark + Delta Engine + Python support will effectively help Databricks pull ahead in the race. ⧫ Flink community is all in for Iceberg. ⧫ GCP BigQuery, EMR, and Azure Synapse (will) support reading from all table formats, so you can lift-and-shift to ...
  • 21. What’s next? Data Lake can do more Can be faster Can be easier
  • 22. Additional Readings ● Gartner Research ○ Are You Shifting Your Problems to the Cloud or Solving Them? ○ Demystifying Cloud Data Warehouse Characteristics ● Google ○ Procella + Artus (https://www.youtube.com/watch?v=QwXj7o4dLpw) ○ BigQuery + Capacitor (https://bit.ly/bigquery-capacitor) ● Uber ○ Incremental Processing on Hadoop (https://bit.ly/uber-incremental) ● Alibaba ○ AnalyticDB (https://www.vldb.org/pvldb/vol12/p2059-zhan.pdf) ○ Iceberg Sink for Flink (https://bit.ly/flink-iceberg-sink) ○ Use Iceberg in Flink 中文 (https://developer.aliyun.com/article/755329)
  • 24. Data Lake implementations are still evolving, don’t hold your breath for the single best choice. Roll up sleeves and build practical solutions with 2 or 3 options combined. Computation engine gravity/bias will directly reshape the waterscape.

Editor's Notes

  1. The views expressed in this presentation are those of the author and do not reflect any policy or position of the employers of the author.
  2. IA = Infrequent Access; NL = Near Line; CL = Code Line; https://flink.apache.org/news/2019/02/13/unified-batch-streaming-blink.html
  3. During v1 time, there are several attempts for non-JVM engines, but none of them have really thrived. GPU, C++ and LLVM are really changing the game of Deep Learning and OLAP. HDFS are reaching it peak time and it starts fading away.
  4. if all you have is a hammer, everything looks like a nail
  5. The Druid/Pinot (near real time analytics) block can be merged into the Data Lake with T+0 ingestion and processing capability. It can also be replaced by HTAP (such as TiDB) as a super ODS.
  6. AWS EFS is really a NFS/NAS solution, so it can’t even replace HDFS on S3. Use EmrFileSystem instead. And s3a:// has https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/bk_cloud-data-access/content/s3-limitations.html Azure Data Lake Storage Gen2 is almost capable of replacing HDFS. abfs:// Google Colossus is years ahead of OSS, a true distributed file system. HIVE-14269, HIVE-14270, HIVE-20517, HADOOP-15364, HADOOP-15281 Hive ACID is not allowed if S3 is the storage layer (Hudi or others can be used as SerDe)
  7. Snowflake uses FoundationDB to organize a lot of metadata to speed up its Query Processing. https://www.snowflake.com/blog/how-foundationdb-powers-snowflake-metadata-forward/ S3 Select was launched Apr 2018 to provide some pushdown (Sep 2018 for Parquet) (Nov 2018, output committer to avoid rename)
  8. Record-grain mutable is expensive, but how about min-batch level? GDPR, CCPA, IDPC and … affect offline big data as well.
  9. Iceberg is mainly optimized for Parquet, but its spec and API are open to support ORC and Avro too. The Bucket Transformation is designed to work across Hive, Spark, Presto and Flink.
  10. Clearly distinguish and handle processing_time (a.k.a. arrival_time) vs. event_time (a.k.a. payload_time or transaction_time) In short, Hudi can efficiently update/reconcile the late-arrival records to the proper partition. https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/
  11. https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html https://docs.delta.io/0.7.0/delta-batch.html
  12. Very typical Hive style. Fine-grain control.
  13. Cool stuff
  14. Similar to Aster Data Systems https://en.wikipedia.org/wiki/Aster_Data_Systems and https://github.com/sql-machine-learning/sqlflow
  15. Similar to Aster Data Systems https://en.wikipedia.org/wiki/Aster_Data_Systems and https://github.com/sql-machine-learning/sqlflow
  16. Anecdote: Huawei was donating CarbonData into open source Spark a few years ago, but maybe Delta had been the way to go already, CarbonData never made to a file format bundled in Spark. CarbonData is a more comprehensive columnar format that supports rich indexing and even DML operations at SerDe level. The latest FusionInsights MRS 8.0 is realizing the mutable Data Lake with streaming & batch combined on top of CarbonData. It will not be surprising if some of the Iceberg contributors & adopters have similar worry about Delta Lake.
  17. Huawei CarbonData anecdote:
  18. https://www.qlik.com/us/-/media/files/resource-library/global-us/register/ebooks/eb-cloud-data-warehouse-comparison-ebook-en.pdf https://www.gartner.com/doc/reprints?id=1-1ZA6E2JU&ct=200619&st=sb (Cloud Data Warehouse: Are You Shifting Your Problems to the Cloud or Solving Them?)
  19. We need to speculate where Databricks is forging forward next? (Data Lake + ETL + ML + OLAP + DL + SaaS/Serverless + Data Management + …) What shall we learn from Snowflake’s architecture and success? (Data Lake should be fast and intuitive to use, Metadata is so important to optimize the query performance) Anecdote: Snowflake’s IPO market cap is about 10x bigger than Cloudera, that should tell something about how useful it is.