SlideShare a Scribd company logo
Apache Hudi
Learning Series
Hudi Intro
Apache Hudi ingests & manages storage of large analytical datasets over DFS (hdfs
or cloud stores). Hudi brings stream processing to big data, providing fresh data
while being an order of magnitude efficient over traditional batch processing.
Incremental
Database Ingestion
De-duping Log
Events
Storage
Management
Transactional
Writes
Faster Derived/ETL
Data
Compliance/Data
Deletions
Unique key
constraints
Late data
handling
Industry/Cloud Solutions
Data Consistency
Datacenter agnostic, xDC
replication, strong consistency
Data Freshness
< 15 min of freshness on Lake
& warehouse
Hudi for Data
Application
Feature store for ML
Incremental
Processing for all
Easy on-boarding,
monitoring & debugging
Adaptive Data
Layout
Stitch files, Optimize layout,
Prune columns, Encrypt
rows/columns on demand
through a standardized
interface
Efficient Query
Execution
Column indexes for improved
query planning & execution
Compute & Storage
Efficiency
Do more with less CPU,
Storage, Memory
Data Accuracy
Semantic validations for
columns: NotNull, Range
etc
Hudi@Uber

Recommended for you

Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017

An Open Source Incremental Processing Framework called Hoodie is summarized. Key points: - Hoodie provides upsert and incremental processing capabilities on top of a Hadoop data lake to enable near real-time queries while avoiding costly full scans. - It introduces primitives like upsert and incremental pull to apply mutations and consume only changed data. - Hoodie stores data on HDFS and provides different views like read optimized, real-time, and log views to balance query performance and data latency for analytical workloads. - The framework is open source and built on Spark, providing horizontal scalability and leveraging existing Hadoop SQL query engines like Hive and Presto.

dataanalyticsbig data
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake

Hyperspace is a recently open-sourced (https://github.com/microsoft/hyperspace) indexing sub-system from Microsoft. The key idea behind Hyperspace is simple: Users specify the indexes they want to build. Hyperspace builds these indexes using Apache Spark, and maintains metadata in its write-ahead log that is stored in the data lake. At runtime, Hyperspace automatically selects the best index to use for a given query without requiring users to rewrite their queries. Since Hyperspace was introduced, one of the most popular asks from the Spark community was indexing support for Delta Lake. In this talk, we present our experiences in designing and implementing Hyperspace support for Delta Lake and how it can be used for accelerating queries over Delta tables. We will cover the necessary foundations behind Delta Lake’s transaction log design and how Hyperspace enables indexing support that seamlessly works with the former’s time travel queries.

Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)

Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait. Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including: * All reads use snapshot isolation without locking. * No directory listings are required for query planning. * Files can be added, removed, or replaced atomically. * Full schema evolution supports changes in the table over time. * Partitioning evolution enables changes to the physical layout without breaking existing queries. * Data files are stored as Avro, ORC, or Parquet. * Support for Spark, Pig, and Presto.

big datahadoopspark
UseCase (Latency, Scale..)
Batch / Stream
(Spark/Flink//Presto/...)
Source A
Table API
Incremental Stream
Pulls & Joins
Consumer
Derived Table A
delta
Source B delta
Source N delta
...
Table A delta
Table B delta
Table N delta
...
UseCase (Latency, Scale..)
Table API
Incremental Stream
Pulls & Joins
Consumer
Derived Table B
Data Processing : Incremental Streams
Batch / Stream
(Spark/Flink/Presto/...)
*source = {Kafka, CSV, DFS, Hive table, Hudi table etc}
500B+
records/day
150+ PB
Transactional Data Lake
8,000+
Tables
Hudi@Uber
Facts and figures
Read/Write Client APIs
01 Write Client
02 Read Client
03 Supported Engines
04 Q&A
Agenda

Recommended for you

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization

Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements. 1) Generality: support reading/writing most data management/storage systems. 2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities. Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.

apache sparksparkaisummit
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets

Data Orchestration Summit www.alluxio.io/data-orchestration-summit-2019 November 7, 2019 Apache Iceberg - A Table Format for Hige Analytic Datasets Speaker: Ryan Blue, Netflix For more Alluxio events: https://www.alluxio.io/events/

netflixalluxiodata warehouse
Hudi APIs Highlights
Snapshot Isolation
Readers will not see partial writes
from writers.
Atomic Writes
Writes happen either full, or not at
all. Partial writes (eg from killed
processes) are not valid.
Read / Write Optimized
Depending on the required SLA,
writes or reads can be made faster
(at the other’s expense).
Incremental Reads/Writes
Readers can choose to only read
new records from some timestamp.
This makes efficient incremental
pipelines possible.
Point In Time Queries
(aka Time-Travel)
Readers can read snapshot views at
either the latest time, or some past
time.
Table Services
Table management services such as
clustering, or compacting (covered
in later series).
Insert
● Similar to INSERT in databases
● Insert records without checking for
duplicates.
Hudi Write APIs
Upsert
● Similar to UPDATE or INSERT
paradigms in databases
● Uses an index to find existing records to
update and avoids duplicates.
● Slower than Insert.
Hudi Write APIs
Bulk Insert
● Similar to Insert.
● Handles large amounts of data - best for
bootstrapping use-cases.
● Does not guarantee file sizing
Insert Overwrite
● Overwrite a partition with new data.
● Useful for backfilling use-cases.
Insert Upsert
Bulk Insert
Hudi Write APIs
Delete
● Similar to DELETE in databases.
● Soft Deletes / Hard Deletes
Hive Registration
● Sync the changes to your dataset to
Hive.
Insert Overwrite
Insert Upsert

Recommended for you

The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities

The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.

* 
apache spark

 *big data

 *ai

 *
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi

Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.

spark + ai summit

 *
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook

The document outlines topics covered in "The Impala Cookbook" published by Cloudera. It discusses physical and schema design best practices for Impala, including recommendations for data types, partition design, file formats, and block size. It also covers estimating and managing Impala's memory usage, and how to identify the cause when queries exceed memory limits.

hadoopsqlimpala
Hudi Write APIs
Rollback / Restore
● Rollback inserts/upserts etc to restore
the dataset to some past state.
● Useful when mistakes happen.
Bulk Insert
Hive Registration
Insert Upsert
Insert Overwrite
Delete
Hudi Read APIs
Snapshot Read
● This is the typical read pattern
● Read data at latest time (standard)
● Read data at some point in time (time
travel)
Incremental Read
● Read records modified only after a
certain time or operation.
● Can be used in incremental processing
pipelines.
Hudi Metadata Client
Get Latest Snapshot Files
Get the list of files that contain the latest snapshot data.
This is useful for backing up / archiving datasets.
Globally Consistent Meta Client
Get X-DC consistent views at the cost of freshness.
Get Partitions / Files Mutated Since
Get a list of partitions or files Mutated since some time
timestamp. This is also useful for incremental backup /
archiving.
There is a read client for Hudi Table Metadata as well.
Here are some API highlights:
Hudi Table Services
Compaction
Convert files on disk into read optimized files (see Merge
on Read in the next section).
Clustering
Clustering can make reads more efficient by changing the
physical layout of records across files. (see section 3)
Clean
Remove Hudi data files that are no longer needed. (see
section 3)
Archiving
Archive Hudi metadata files that are no longer being
actively used. (see section 3)

Recommended for you

Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium

We have covered the need for CDC and the benefits of building a CDC pipeline. We will compare various CDC streaming and reconciliation frameworks. We will also cover the architecture and the challenges we faced while running this system in the production. Finally, we will conclude the talk by covering Apache Hudi, Schema Registry and Debezium in detail and our contributions to the open-source community.

big datacdcapache hudi
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers

Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg. Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS. Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg. You will learn: • The issues that arise when using the Hive table format at scale, and why we need a new table format • How a straightforward, elegant change in table format structure has enormous positive effects • The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it • The resulting benefits of this architectural design

high throughput and low latencyp99p99 conf
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...

Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner. In this talk, I am going examine a number common streaming design patterns in the context of the following questions. WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements? WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements? HOW are going to architect the solution? And how much are you willing to pay for it? Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.

* 
apache spark

 *big data

 *ai

 *
Code Examples
val dataGenerator = new DataGenerator
val generatedJson = convertToStringList(dataGenerator.generateInserts(100))
val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2))
Code Examples
val dataGenerator = new DataGenerator
val generatedJson = convertToStringList(dataGenerator.generateInserts(100))
val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2))
This is a data gen class provided by
Hudi for testing
We’ll be using SPARK for this demo
Code Examples: Generate Data
val dataGenerator = new DataGenerator
val generatedJson = convertToStringList(dataGenerator.generateInserts(100))
val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2))
generatedDataDF.show()
+--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+
| begin_lat| begin_lon| driver| end_lat| end_lon| fare| geo| rider| ts| uuid|
+--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+
| 0.47269058795|0.461578584504|driver-213| 0.75480340700| 0.967115994201|34.1582847163|americas/brazi...|rider-213|1611908339000|500ca486-9323-46f...|
| 0.61000705621| 0.87794022954|driver-213| 0.340787050592| 0.503079814229| 43.49238112|americas/brazi...|rider-123|1614586739000|cf0767fe-2afa-4b0...|
| 0.57318354079| 0.49234796529|driver-213|0.0898858178093|0.4252089969871| 64.276962958|americas/unite...|rider-214|1617326300000|a7fc67fd-8026-4c9...|
|0.216241503676|0.142850512594|driver-213| 0.589094962481| 0.096682383192| 93.560181152|americas/unite...|rider-417|1612167539000|77572226-6edd-4fb...|
| 0.406135109| 0.56440921390|driver-213| 0.79870630494|0.0269835922718|17.8511352550| asia/india/c...|rider-249|1617326301000|2227d696-bc2f-490...|
| 0.87420415264| 0.75282681532|driver-213| 0.919782712888| 0.36246477087|19.1791391066|americas/unite...|rider-351|1618301939000|90a4dae9-e21d-4a4...|
| 0.18564880850| 0.96945864178|driver-213|0.3818636703720|0.2525265221447| 33.922164839|americas/unite...|rider-491|1611908339000|bb742f4d-1ab8-42e...|
| 0.07505887600|0.038441044444|driver-213|0.0437635335453| 0.634604006761| 66.620843664|americas/brazi...|rider-481|1617351482000|4735f5e6-e746-49d...|
| 0.6510585056| 0.81928686877|driver-213|0.2071489600291|0.0622403109582| 41.062909290| asia/india/c...|rider-471|1617325991000|16db8d5d-955a-4d1...|
|0.114883931570| 0.62732122024|driver-213| 0.745467853751| 0.395493986490| 27.794786885|americas/unite...|rider-591|1611908339000|115c2738-9059-4be...|
+--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+
Code Examples: Generate Data
val dataGenerator = new DataGenerator
val generatedJson = convertToStringList(dataGenerator.generateInserts(100))
val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2))
generatedDataDF.show()
+--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+
| begin_lat| begin_lon| driver| end_lat| end_lon| fare| geo| rider| ts| uuid|
+--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+
| 0.47269058795|0.461578584504|driver-213| 0.75480340700| 0.967115994201|34.1582847163|americas/brazi...|rider-213|1611908339000|500ca486-9323-46f...|
| 0.61000705621| 0.87794022954|driver-213| 0.340787050592| 0.503079814229| 43.49238112|americas/brazi...|rider-123|1614586739000|cf0767fe-2afa-4b0...|
| 0.57318354079| 0.49234796529|driver-213|0.0898858178093|0.4252089969871| 64.276962958|americas/unite...|rider-214|1617326300000|a7fc67fd-8026-4c9...|
|0.216241503676|0.142850512594|driver-213| 0.589094962481| 0.096682383192| 93.560181152|americas/unite...|rider-417|1612167539000|77572226-6edd-4fb...|
| 0.406135109| 0.56440921390|driver-213| 0.79870630494|0.0269835922718|17.8511352550| asia/india/c...|rider-249|1617326301000|2227d696-bc2f-490...|
| 0.87420415264| 0.75282681532|driver-213| 0.919782712888| 0.36246477087|19.1791391066|americas/unite...|rider-351|1618301939000|90a4dae9-e21d-4a4...|
| 0.18564880850| 0.96945864178|driver-213|0.3818636703720|0.2525265221447| 33.922164839|americas/unite...|rider-491|1611908339000|bb742f4d-1ab8-42e...|
| 0.07505887600|0.038441044444|driver-213|0.0437635335453| 0.634604006761| 66.620843664|americas/brazi...|rider-481|1617351482000|4735f5e6-e746-49d...|
| 0.6510585056| 0.81928686877|driver-213|0.2071489600291|0.0622403109582| 41.062909290| asia/india/c...|rider-471|1617325991000|16db8d5d-955a-4d1...|
|0.114883931570| 0.62732122024|driver-213| 0.745467853751| 0.395493986490| 27.794786885|americas/unite...|rider-591|1611908339000|115c2738-9059-4be...|
+--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+
and this for
hoodie record
key.
We’ll use this
for partition
key.

Recommended for you

Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA

Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.

icebergstlbigdataideabig data
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective

This document describes the Delta architecture, which unifies batch and streaming data processing. Delta achieves this through a continuous data flow model using structured streaming. It allows data engineers to read consistent data while being written, incrementally read large tables at scale, rollback in case of errors, replay and process historical data along with new data, and handle late arriving data without delays. Delta uses transaction logging, optimistic concurrency, and Spark to scale metadata handling for large tables. This provides a simplified solution to common challenges data engineers face.

spark + ai summit

 *
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals

Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with

shufflesparksqlspark
Code Examples: Writes Opts
val dataGenerator = new DataGenerator
val generatedJson = convertToStringList(dataGenerator.generateInserts(100))
val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2))
val hudiWriteOpts = Map(
"hoodie.table.name" -> (tableName),
"hoodie.datasource.write.recordkey.field" -> "uuid",
"hoodie.datasource.write.partitionpath.field" -> "ts",
"hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator",
"hoodie.deltastreamer.keygen.timebased.timestamp.type" -> "UNIX_TIMESTAMP",
"hoodie.deltastreamer.keygen.timebased.output.dateformat" -> "yyyy/MM/dd",
)
Code Examples: Write
val dataGenerator = new DataGenerator
val generatedJson = convertToStringList(dataGenerator.generateInserts(100))
val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2))
val hudiWriteOpts = Map(
"hoodie.table.name" -> (tableName),
"hoodie.datasource.write.recordkey.field" -> "uuid",
"hoodie.datasource.write.partitionpath.field" -> "ts",
"hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator",
"hoodie.deltastreamer.keygen.timebased.timestamp.type" -> "UNIX_TIMESTAMP",
"hoodie.deltastreamer.keygen.timebased.output.dateformat" -> "yyyy/MM/dd",
)
generatedDataDF.write.
format("org.apache.hudi").
options(hudiWriteOpts).
save(basePath)
Code Examples: Hive Registration
val hiveSyncConfig = new HiveSyncConfig()
hiveSyncConfig.databaseName = databaseName
hiveSyncConfig.tableName = tableName
hiveSyncConfig.basePath = basePath
hiveSyncConfig.partitionFields = List("ts")
val hiveConf = new HiveConf()
val dfs = (new Path(basePath)).getFileSystem(new Configuration())
val hiveSyncTool = new HiveSyncTool(hiveSyncConfig, hiveConf, dfs)
hiveSyncTool.syncHoodieTable()
Not to be confused with
cross-dc hive sync
Can be called manually, or you
can configure HudiWriteOpts
to trigger it automatically.
Code Examples: Snapshot Read
val readDF = spark.sql("select uuid, driver, begin_lat, begin_lon from " + databaseName + "." + tableName)
val readDF =
spark.read.format("org.apache.hudi")
.load(basePath)
.select("uuid", "driver", "begin_lat", "begin_lon")
readDF.show()
+--------------------+----------+--------------------+--------------------+
| uuid| driver| begin_lat| begin_lon|
+--------------------+----------+--------------------+--------------------+
|57d559d0-e375-475...|driver-284|0.014159831486388885| 0.42849372303000655|
|fd51bc6e-1303-444...|driver-284| 0.1593867607188556|0.010872312870502165|
|e8033c1e-a6e5-490...|driver-284| 0.2110206104048945| 0.2783086084578943|
|d619e592-0b41-4c8...|driver-284| 0.08528650347654165| 0.4006983139989222|
|799f7e50-27bc-4c9...|driver-284| 0.6570857443423376| 0.888493603696927|
|c22ba7e5-68b5-4eb...|driver-284| 0.18294079059016366| 0.19949323322922063|
|fbb80816-fe18-4e2...|driver-284| 0.7340133901254792| 0.5142184937933181|
|3dfeb884-41fd-4ea...|driver-284| 0.4777395067707303| 0.3349917833248327|
|034e0576-f59f-4e9...|driver-284| 0.7180196467760873| 0.13755354862499358|
|e9c6e3b1-1ed4-43b...|driver-284| 0.16603428449020086| 0.6999655248704163|
|18b39bef-9ebb-4b5...|driver-213| 0.1856488085068272| 0.9694586417848392|
|653a4cb6-3c94-4ee...|driver-213| 0.11488393157088261| 0.6273212202489661|
|11fbfce7-a10b-4d1...|driver-213| 0.21624150367601136| 0.14285051259466197|
|0199a292-1702-47f...|driver-213| 0.4726905879569653| 0.46157858450465483|
|5e1d80ce-e95b-4ef...|driver-213| 0.5731835407930634| 0.4923479652912024|
|5d51b234-47ab-467...|driver-213| 0.651058505660742| 0.8192868687714224|
|ff2e935b-a403-490...|driver-213| 0.0750588760043035| 0.03844104444445928|
|bc644743-0667-48b...|driver-213| 0.6100070562136587| 0.8779402295427752|
|026c7b79-3012-414...|driver-213| 0.8742041526408587| 0.7528268153249502|
|9a06d89d-1921-4e2...|driver-213| 0.40613510977307| 0.5644092139040959|
+--------------------+----------+--------------------+--------------------+
only showing top 20 rows
Two ways of querying the same
Hudi Dataset

Recommended for you

Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications

This document discusses 5 common mistakes when writing Spark applications: 1) Improperly sizing executors by not considering cores, memory, and overhead. The optimal configuration depends on the workload and cluster resources. 2) Applications failing due to shuffle blocks exceeding 2GB size limit. Increasing the number of partitions helps address this. 3) Jobs running slowly due to data skew in joins and shuffles. Techniques like salting keys can help address skew. 4) Not properly managing the DAG to avoid shuffles and bring work to the data. Using ReduceByKey over GroupByKey and TreeReduce over Reduce when possible. 5) Classpath conflicts arising from mismatched library versions, which can be addressed using sh

#apachespark #sparksummit
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...

What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet. At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it. We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).

apache sparkspark summit east
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig

This document provides an overview and introduction to BigData using Hadoop and Pig. It begins with introducing the speaker and their background working with large datasets. It then outlines what will be covered, including an introduction to BigData, Hadoop, Pig, HBase and Hive. Definitions and examples are provided for each. The remainder of the document demonstrates Hadoop and Pig concepts and commands through code examples and explanations.

pighadoop
Code Examples: Incremental Read
val newerThanTimestamp = "20200728232543"
val readDF =
spark.read.format("org.apache.hudi")
.option(QUERY_TYPE_OPT_KEY,QUERY_TYPE_INCREMENTAL_OPT_VAL)
.option(BEGIN_INSTANTTIME_OPT_KEY, newerThanTimestamp)
.load(basePath)
.filter("_hoodie_commit_time" > newerThanTimestamp)
.select("uuid", "driver", "begin_lat", "begin_lon")
Code Examples: Incremental Read
val newerThanTimestamp = "20200728232543"
val readDF =
spark.read.format("org.apache.hudi")
.option(QUERY_TYPE_OPT_KEY,QUERY_TYPE_INCREMENTAL_OPT_VAL)
.option(BEGIN_INSTANTTIME_OPT_KEY, newerThanTimestamp)
.load(basePath)
.filter("_hoodie_commit_time" > newerThanTimestamp)
.select("uuid", "driver", "begin_lat", "begin_lon")
This is simply 2020/07/28 23:25:43s
Supported Engines
Spark Flink Hive
Presto Impala Athena (AWS)
Table Data Format

Recommended for you

Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture

Presented at Open Source India 2016, at a workshop titled: Building a Data Lake using Apache Hadoop: A Proof of Concept

hadooptechyugadidata lake
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloud

This document discusses analytics in the cloud using Amazon Elastic MapReduce (EMR). It describes how big data is growing rapidly in size and complexity, making it difficult for traditional systems to handle. Hadoop and MapReduce provide scalable and cost-effective solutions for big data analytics. Amazon EMR makes it easy to set up and manage Hadoop clusters in the AWS cloud, reducing costs and complexity compared to operating Hadoop on-premises. It also integrates tightly with other AWS services like S3 and EC2.

bigdatamapreducehadoop
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...

Michael Sun presented on CBS Interactive's use of Hadoop for web analytics processing. Some key points: - CBS Interactive processes over 1 billion web logs daily from hundreds of websites on a Hadoop cluster with over 1PB of storage. - They developed an ETL framework called Lumberjack in Python for extracting, transforming, and loading data from web logs into Hadoop and databases. - Lumberjack uses streaming, filters, and schemas to parse, clean, lookup dimensions, and sessionize web logs before loading into a data warehouse for reporting and analytics. - Migrating to Hadoop provided significant benefits including reduced processing time, fault tolerance, scalability, and cost effectiveness compared to their

michael sunhadoop world 2011cbsi
01 Table Types
02 Table Layout
03 Log File Format
04 Q&A
Agenda
● Partitions are directories on disk
○ Date based partitions: 2021/01/01 2021/01/02 ….
● Data is written as records in data-files within partitions
2021/
01/
01/
fd83af1d-1b18-45fe-9a8c-a19efd091994-0_49-36-47881_20210102020825.parquet
● Each record has a schema and should contain partition and a unique record id
● Each of the data-file is versioned and newer versions contain latest data
● Supported data-file formats: Parquet, ORC (under development)
Basics
● fd83af1d-1b18-45fe-9a8c-a19efd091994-0_49-36-47881_20210102020825.parquet
fileID (UUID) writeToken version (time of commit) file-format
● fd83af1d-1b18-45fe-9a8c-a19efd091994-0_49-36-47881_20210103102345.parquet
(newer version as timestamp is greater)
● A record with a particular hoodie-key will exist in only one fileID.
Basics
Updates to existing records lead to a newer version of the data-file
How are Inserts processed
Inserts are partitioned and written to multiple new data-files
How are updates processed
All records are read from latest version of data-file
Updates are applied in memory
New version of data-file written
Copy On Write
(Read-optimized format)

Recommended for you

Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop

The document provides an overview of big data, analytics, Hadoop, and related concepts. It discusses what big data is and the challenges it poses. It then describes Hadoop as an open-source platform for distributed storage and processing of large datasets across clusters of commodity hardware. Key components of Hadoop introduced include HDFS for storage, MapReduce for parallel processing, and various other tools. A word count example demonstrates how MapReduce works. Common use cases and companies using Hadoop are also listed.

hadoopanalyticsmapreduce
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...

This document discusses optimizations made to Alibaba Cloud's Data Lake Analytics (DLA) engine, which uses Presto, to improve performance when querying data stored in Object Storage Service (OSS). The optimizations included decreasing OSS API request counts, implementing an Alluxio data cache using local disks on Presto workers, and improving disk throughput by utilizing multiple ultra disks. These changes increased cache hit ratios and query performance for workloads involving large scans of data stored in OSS. Future plans include supporting an Alluxio cluster shared by multiple users and additional caching techniques.

alluxio dayalibabadata lake analytics
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing

1. The document discusses using Hadoop and Hive at Zing to build a log collecting, analyzing, and reporting system. 2. Scribe is used for fast log collection and storing data in Hadoop/Hive. Hive provides SQL-like queries to analyze large datasets. 3. The system transforms logs into Hive tables, runs analysis jobs in Hive, then exports data to MySQL for web reporting. This provides a scalable, high performance solution compared to the initial RDBMS-only system.

Key1 .....……...
...
Key2 …..……...
...
Key3 …..……...
...
Key4 …..……...
...
Batch 1 (ts1)
upsert
Key1 C1 ..
Key3 C2 ..
Version at C2 (ts2)
Version at C1 (ts1)
Version at C1 (ts1)
File 2
Key1 C1 ..
Key3 C1 ..
Key2 C1 ..
Key4 C1 ..
File 1
Queries
HUDI
Copy On Write: Explained
Batch 2 (ts2)
Key3 ... .....……...
Latest Data
Latest version of the data
can always be read from
the latest data-files
Performance
Native Columnar File
Format Read performance
(read-optimized) overhead
Limited Updates
Very performant for insert
only workloads with
occasional updates.
Copy On Write: Benefits
Copy On Write: Challenges
Write Amplification
Small batches lead to huge
read and rewrites of
parquet file
Ingestion Latency
Cannot ingest batches
very frequently due to
huge IO and compute
overhead
File sizes
Cannot control file sizes
very well, larger the file
size, more IO for a single
record update.
Merge On Read
(Write-optimized format)
Updates to existing records are written to a “log-file” (similar to WAL)
How are Inserts processed
Inserts are partitioned and written to multiple new data-files
How are updates processed
Updates are written to a LogBlock
Write the LogBlock to the log-file
Log-file format is optimized to support appends (HDFS only) but also works with Cloud Stores
(new versions created)

Recommended for you

Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud

Qubole is a cloud-based platform that allows customers to easily run Hadoop and Spark clusters on AWS for big data analytics. It optimizes performance and reduces costs through techniques like caching data in S3 for faster access, using spot instances, and directly writing query outputs to S3. The document discusses Qubole's features, capabilities, and how it provides an easier way for more users like data scientists and analysts to access and query big data compared to building and managing Hadoop clusters themselves.

Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing

The document discusses using Hadoop and Hive at Zing for log collecting, analyzing, and reporting. It provides an overview of Hadoop and Hive and how they are used at Zing to store and analyze large amounts of log and user data in a scalable, fault-tolerant manner. A case study is presented that describes how Zing evolved its log analysis system from using MySQL to using Scribe, Hadoop, and Hive to more efficiently collect, transform, analyze and report on log data.

zingmesnshadoop
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.com

Hypertable Distilled document report as pdf. Underway as another format for presentation. Thanks for your patient!

hypertable
upsert
Key1
.....……...
...
Key2
…..……...
...
Key3
…..……...
...
Key4
…..……...
...
Batch 1 (ts1)
Parquet + Log
Files
Key1 .....……...
...
Key2 …..……...
...
Key3 …..……...
...
Batch 2 (ts2)
K1 C2 ...
...
K2 C2 ...
Key1 C1 ..
Key3 C1 ..
Key2 C1 ..
Key4 C1 ..
K3 C2
Read Optimized
Queries
HUDI
Merge On Read: Explained
Data file at C1 (ts1) (parquet)
Data file at C1 (ts1) (parquet)
Unmerged log file at ts2
Unmerged log file at ts2
Real Time
Queries
Merge On Read: Benefits
Low Ingestion latency
Writes are very fast
Write Amplification
Low write amplification as
merge is over multiple
ingestion batches
Read vs Write Optimization
Merge data-file and delta-file to create
new version of data-file.
“Compaction” operation creates new
version of data-file, can be scheduled
asynchronously in a separate pipeline
without stopping Ingestion or Readers.
New data-files automatically used after
Compaction completes.
Merge On Read: Challenges
Freshness is impacted
Freshness may be worse if the
read uses only the Read
Optimized View (only data files).
Increased query cost
If reading from data-files and
delta-files together (due to
merge overhead). This is called
Real Time View.
Compaction required to
bound merge cost
Need to create and monitor
additional pipeline(s)
● Made up of multiple LogBlocks
● Each LogBlock is made up of:
○ A header with timestamp, schema and other details of the operation
○ Serialized records which are part of the operation
○ LogBlock can hold any format, typically AVRO or Parquet
● Log-File is also versioned
○ S3 and cloud stores do not allow appends
○ Versioning helps to assemble all updates
Log File Format
fileID (UUID) version (time of commit) file-format writeToken
3215eafe-72cb-4547-929a-0e982be3f45d-0_20210119233138.log.1_0-26-5305

Recommended for you

data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt

This document provides an overview of a data analytics session covering big data architecture, connecting and extracting data from storage, traditional processing with a bank use case, Hadoop-HDFS solutions, and HDFS working. The key topics covered include big data architecture layers, structured and unstructured data extraction, comparisons of storage media, traditional versus Hadoop approaches, HDFS basics including blocks and replication across nodes. The session aims to help learners understand efficient analytics systems for handling large and diverse data sources.

notes
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?

Strata Data Conference, London, May 2017 Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source (Apache License 2.0) object store connector for Hadoop and Apache Spark specifically designed to optimize their performance with object stores. Trent and Gil describe how Stocator works and share real-life examples and benchmarks that demonstrate how it can greatly improve performance and reduce the quantity of resources used.

ibm coss3acleversafe
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better

This document discusses using object stores for unstructured data storage instead of HDFS. It summarizes that existing Hadoop components are inefficient for working with object stores due to algorithms that create temporary files and supporting full HDFS shell functionality. An alternative approach called Stocator is presented that is optimized for object stores by directly interacting with them without Hadoop modules. Stocator provides significant performance gains over HDFS connectors by generating fewer requests and being tailored for analytic workflows rather than full HDFS compatibility. Case studies with SETI show Stocator processing large datasets from object stores much faster than HDFS connectors.

stocatorapache hadoopbig data
Table Metadata Format
01 Action Types
02 Hudi Metadata Table
03 Q&A
Agenda
No online component - all state is read and updated from HDFS
State saved as “actions” files within a directory (.hoodie)
.hoodie/
20210122133804.commit
20210122140222.clean
hoodie.properties
20210122140222.commit
when-action-happened what-action-was-taken
Sorted list of all actions is called “HUDI Timeline”
Basics
Action Types
20210102102345.commit
COW Table: Insert or Updates
MOR Table: data-files merged with delta-files
20210102102345.rollback
Older commits rolled-back (data deleted)
20210102102345.delta-commit
MOR Only: Insert or Updates
20210102102345.replace
data-files clustered and re-written
20210102102345.clean
Older versions of data-files and delta-files deleted
20210102102345.restore
Restore dataset to a previous point in time

Recommended for you

Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010

Jan 22nd, 2010 Hadoop meetup presentation on project voldemort and how it plays well with Hadoop at linkedin. The talk focus on Linkedin Hadoop ecosystem. How linkedin manage complex workflows, data ETL , data storage and online serving of 100GB to TB of data.

distributedhadoopbhupesh
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn

The document discusses Project Voldemort, a distributed key-value storage system developed at LinkedIn. It provides an overview of Voldemort's motivation and features, including high availability, horizontal scalability, and consistency guarantees. It also describes LinkedIn's use of Voldemort and Hadoop for applications like event logging, online lookups, and batch processing of large datasets.

Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data

The document provides an overview of big data analytics and Hadoop. It defines big data and the challenges of working with large, complex datasets. It then discusses Hadoop as an open-source framework for distributed storage and processing of big data across clusters of commodity hardware. Key components of Hadoop include HDFS for storage, MapReduce for parallel processing, and other tools like Pig, Hive, HBase etc. The document provides examples of how Hadoop is used by many large companies and describes the architecture and basic functions of HDFS and MapReduce.

big databasic bigdataapache
1. Mark the intention to perform an action
a. Create the file .hoodie/20210102102345.commit.requested
2. Pre-processing and validations (e.g. what files to update / delete)
3. Mark the starting of action
a. Create the file .hoodie/20210102102345.commit.inflight
b. Add the action plan to the file so we can rollback changes due to failures
4. Perform the action as per plan
5. Mark the end of the action
a. Create the file .hoodie/20210102102345.commit
How is an action performed ?
Before each operation HUDI needs to find the state of the dataset
List all action files from .hoodie directory
Read one or more of the action files
List one or more partitions to get list of latest data-files and log-files
HUDI operations lead to large number of ListStatus calls to NameNode
ListStatus is slow and resource intensive for NameNode
Challenges
● ListStatus data is cached in an internal table (Metadata Table)
● What is cached?
○ List of all partitions
○ List of files in each partition
○ Minimal required information on each file - file size
● Internal table is a HUDI MOR Table
○ Updated when any operation changes files (commit, clean, etc)
○ Updates written to log-files and compacted periodically
● Very fast lookups from the Metadata Table
HUDI File Listing Enhancements (0.7 release)
● Reduced load on NameNode
● Reduce time for operations which list partitions
● Metadata Table is a HUDI MOR Table (.hoodie/metadata)
○ Can be queried like a regular HUDI Table
○ Helps in debugging issues
Benefits

Recommended for you

Big data concepts
Big data conceptsBig data concepts
Big data concepts

This document provides an overview of big data concepts, including NoSQL databases, batch and real-time data processing frameworks, and analytical querying tools. It discusses scalability challenges with traditional SQL databases and introduces horizontal scaling with NoSQL systems like key-value, document, column, and graph stores. MapReduce and Hadoop are described for batch processing, while Storm is presented for real-time processing. Hive and Pig are summarized as tools for running analytical queries over large datasets.

How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated

The strategic relationship between Hortonworks and SAP enables SAP to resell Hortonworks Data Platform (HDP) and provide enterprise support for their global customer base. This means SAP customers can incorporate enterprise Hadoop as a complement within a data architecture that includes SAP HANA, Sybase and SAP BusinessObjects enabling a broad range of new analytic applications.

analyticsbig datasap
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010

Hadoop Summit 2010 - Developers Track Hadoop Frameworks Panel: Pig, Hive, Cascading, Cloudera Desktop, LinkedIn Voldemort, Twitter ElephantBird Moderator: Sanjay Radia, Yahoo!

hadoopsummit
Indexing
Recap
Writing data
Bulk_insert, insert, upsert, insert_overwrite
Querying data
Hive, Spark, Presto etc
Copy-On-Write: Columnar Format
Simple & ideal for analytics use-cases (limited updates)
Merge-On-Read: Write ahead log
Complex, but reduces write amplification with updates
Provides 2 views : Read-Optimized, Realtime
Timeline Metadata
Track information about actions taken on table
Incremental Processing
Efficiently propagate changes across tables
Table Service: Indexing
Concurrency Control
MVCC
Multi Version Concurrency Control
File versioning
Writes create a newer version, while concurrent
readers access an older version. For simplicity, we will
refer to hudi files as (fileId)-(timestamp)
● f1-t1, f1-t2
● f2-t1, f2-t2
Lock Free
Read and write transactions are isolated without any
need for locking.
Use timestamp to determine state of data to read.
Data Lake Feature Guarantees
Atomic multi-row commits
Snapshot isolation
Time travel

Recommended for you

Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs

This document discusses Hadoop and Hive development at Facebook, including how they generate large amounts of user data daily, how they store the data in Hadoop clusters, and how they use Hive as a data warehouse to efficiently run SQL queries on the Hadoop data using a SQL-like language. It also outlines some of Hive's architecture and features like partitioning, buckets, and UDF/UDAF support, as well as its performance improvements over time and future planned work.

[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning

이 세션에서는 SageMaker Training Jobs / SageMaker Jumpstart를 사용하여 Foundation Model 을 Pre-Triaining ���거나 Fine Tuing 하는 방안을 제시합니다. 이 세션을 통해 아래 3가지가 소개됩니다. 1. 파운데이션 모델을 처음부터 Training 2. 오픈 소스 모델을 사용하여 파운데이션 모델을 Pre-Training 3. 도메인에 맞게 모델을 Fine Tuning하는 방안 발표자: Miron Perel, Principal ML GTM Specialist, AWS Kristine Pearce, Principal ML BD, AWS

aws data analytics ai
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...

*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderabad Available

How is index used ?
Key1 ...
Key2 ...
Key3 ...
Key4 ...
upsert
Tag
Location
Using
Index
And
Timeline
Key1 partition, f1
...
Key2 partition, f2
...
Key3 partition, f1
...
Key4 partition, f2
...
Batch at t2 with index metadata
Key1, Key3
Key2,
Key4
f1-t2 (data/log)
f2-t2 (data/log)
Key1 C1 ..
Key3 C1 ..
Key2 C1 ..
Key4 C1 ..
Batch at t2
f1-t1 f2-t1
Indexing Scope
Global index
Enforce uniqueness of keys across all partitions of a table
Maintain mapping for record_key to (partition, fileId)
Update/delete cost grows with size of the table O(size of
table)
Local index
Enforce this constraint only within a specific partition.
Writer to provide the same consistent partition path for a
given record key
Maintain mapping (partition, record_key) -> (fileId)
Update/delete cost O(number of records updated/deleted)
Types of Indexes
Bloom Index (default)
Employs bloom filters built out of the record keys,
optionally also pruning candidate files using record key
ranges.
Ideal workload: Late arriving updates
Simple Index
Performs a lean join of the incoming update/delete records
against keys extracted from the table on storage.
Ideal workload: Random updates/deletes to a dimension
table
HBase Index
Manages the index mapping in an external Apache HBase
table.
Ideal workload: Global index
Custom Index
Users can provide custom index implementation
Indexing Configurations
Property: hoodie.index.type
Type of index to use. Default is Local Bloom filter (including
dynamic bloom filters)
Property: hoodie.index.class
Full path of user-defined index class and must be a
subclass of HoodieIndex class. It will take precedence over
the hoodie.index.type configuration if specified
Property: hoodie.bloom.index.parallelism
Dynamically computed, but may need tuning for some
cases for bloom index
Property hoodie.simple.index.parallelism
Tune parallelism for simple index

Recommended for you

❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...

❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY

#sattamatka #matka #dpboss#kalyanmatkadpboss kalyan matka guessing
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls

( Call  ) Girls Nehru Place 9711199012 Beautiful Girls

一比一原版英国埃塞克斯大学毕业证(essex毕业证书)如何办理
一比一原版英国埃塞克斯大学毕业证(essex毕业证书)如何办理一比一原版英国埃塞克斯大学毕业证(essex毕业证书)如何办理
一比一原版英国埃塞克斯大学毕业证(essex毕业证书)如何办理

原版一模一样【微信:741003700 】【英国埃塞克斯大学毕业证(essex毕业证书)成绩单】【微信:741003700 】学位证,留信学历认证(真实可查,永久存档)原件一模一样纸张工艺/offer、在读证明、外壳等材料/诚信可靠,可直接看成品样本,帮您解决无法毕业带来的各种难题!外壳,原版制作,诚信可靠,可直接看成品样本。行业标杆!精益求精,诚心合作,真诚制作!多年品质 ,按需精细制作,24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题,包您满意。 本公司拥有海外各大学样板无数,能完美还原。 1:1完美还原海外各大学毕业材料上的工艺:水印,阴影底纹,钢印LOGO烫金烫银,LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】 一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等! 二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证(教育部存档!教育部留服网站永久可查) 四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况: ◇在校期间,因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力,希望尽快拿到; ◇不清楚认证流程以及材料该如何准备; ◇回国时间很长,忘记办理; ◇回国马上就要找工作,办给用人单位看; ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金 留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内,将在公安局网内查询个人身份证信息后,同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料,供国家高端企业选择人才 办理英国埃塞克斯大学毕业证(essex毕业证书)【微信:741003700 】外观非常简单,由纸质材料制成,上面印有校徽、校名、毕业生姓名、专业等信息。 办理英国埃塞克斯大学毕业证(essex毕业证书)【微信:741003700 】格式相对统一,各专业都有相应的模板。通常包括以下部分: 校徽:象征着学校的荣誉和传承。 校名:学校英文全称 授予学位:本部分将注明获得的具体学位名称。 毕业生姓名:这是最重要的信息之一,标志着该证书是由特定人员获得的。 颁发日期:这是毕业正式生效的时间,也代表着毕业生学业的结束。 其他信息:根据不同的专业和学位,可能会有一些特定的信息或章节。 办理英国埃塞克斯大学毕业证(essex毕业证书)【微信:741003700 】价值很高,需要妥善保管。一般来说,应放置在安全、干燥、防潮的地方,避免长时间暴露在阳光下。如需使用,最好使用复印件而不是原件,以免丢失。 综上所述,办理英国埃塞克斯大学毕业证(essex毕业证书)【微信:741003700 】是证明身份和学历的高价值文件。外观简单庄重,格式统一,包括重要的个人信息和发布日期。对持有人来说,妥善保管是非常重要的。

英国埃塞克斯大学毕业证(essex毕业证书)
Indexing Limitations
Indexing only works on primary
key today.
WIP to make this available as
secondary index on other columns.
Index information is only used
in writer.
Using this in read path will improve
query performance.
Move the index info from
parquet metadata into
metadata table
Storage Management
Storage Management
Compaction
Convert files on disk into read optimized files.
Clustering
Optimizing data layout, stitching small files
Cleaning
Remove Hudi data files that are no longer needed.
Hudi Rewriter
Pruning columns, encrypting columns and other rewriting
use-cases
Savepoint & Restore
Bring table back to a correct/old state
Archival
Archive Hudi metadata files that are no longer being
actively used.
Table Service: Compaction
Main motivations behind Merge-On-Read is to reduce data latency when ingesting records
Data is stored using a combination of base files and log files
Compaction is a process to produce new versions of base files by merging updates
Compaction is performed in 2 steps
Compaction Scheduling
Pluggable Strategies for compaction
This is done inline. In this step, Hudi scans the
partitions and selects base and log files to be
compacted. A compaction plan is finally written to
Hudi timeline.
Compaction Execution
Inline - Perform compaction inline, right after ingestion
Asynchronous - A separate process reads the
compaction plan and performs compaction of file
slices.

Recommended for you

@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time

For Ad post Contact : adityaroy0215@gmail.com @Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time

Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdfOrange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf

jjj

Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeSaket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe

Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe

K1 T3 ..
K3 T3 ..
Version at
T3
K1 T4 ...
Version of
Log atT4
Real-time View Real-time View Real-time View
Compaction Example
Hudi Managed Dataset
Version at T1
Key1 .....……...
...
Key3 …..……...
...
Batch 1
T1
Key1 .………...
...
Key3 …..……...
...
Batch 2
T2
upsert
K1 T2 ...
...
Unmerged update
K1 T1 ..
K3 T1 ..
K3 T2
Version of
Log at T2
Phantom File
Schedule
Compaction
Commit Timeline
Key1 . .……
T4
Batch 3
T3
Unmerged update
done
T2 Commit 2 done
T4 Commit 4 done
T3 Compact done
T1 Commit 1
Read Optimized
View
Read Optimized
View
PARQUET
T3 Compaction inflight
T4 Commit 4 inflight
HUDI
Code Examples: Inline compaction
df.write.format("org.apache.hudi").
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
option("hoodie.parquet.small.file.limit", "0").
option("hoodie.compact.inline", "true").
option("hoodie.clustering.inline.max.delta.commits", "4").
option("hoodie.compaction.strategy, "org.apache.hudi.io.compact.strategy.LogFileSizeBasedCompactionStrategy").
mode(Append).
save(basePath);
Table Service: Clustering
Ingestion and query engines are
optimized for different things
FileSize
Ingestion prefers small files to improve
freshness. Small files => increase in parallelism
Query engines (and HDFS) perform poorly
when there are lot of small files
Data locality
Ingestion typically groups data based on arrival
time
Queries perform better when data frequently
queried together is co-located
Clustering is a new framework introduced in hudi
0.7
Improve query performance without compromising on
ingestion speed
Run inline or in an async pipeline
Pluggable strategy to rewrite data
Provides two in-built strategies to 1) ‘stitch’ files and 2) ‘sort’
data on a list of columns
Superset of Compaction.
Follows MVCC like other hudi operations
Provides snapshot isolation, time travel etc.
Update index/metadata as needed
Disadvantage: Incurs additional rewrite cost
Clustering: efficiency gain
Before clustering: 20M rows scanned After clustering: 100K rows scanned

Recommended for you

@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...
@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...
@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...

@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Dolle come here

Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%

Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25% For Ad post Contact : adityaroy0215@gmail.com

@Call @Girls Coimbatore 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl a...
@Call @Girls Coimbatore 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl a...@Call @Girls Coimbatore 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl a...
@Call @Girls Coimbatore 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl a...

For Ad post Contact : adityaroy0215@gmail.com @Call @Girls Coimbatore 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl any Time

Code Examples: Inline clustering
df.write.format("org.apache.hudi").
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
option("hoodie.parquet.small.file.limit", "0").
option("hoodie.clustering.inline", "true").
option("hoodie.clustering.inline.max.commits", "4").
option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824").
option("hoodie.clustering.plan.strategy.small.file.limit", "629145600").
option("hoodie.clustering.plan.strategy.sort.columns", ""). //optional, if sorting is needed
mode(Append).
save(basePath);
Table Service: Cleaning
Delete older data files that are no longer
needed
Different configurable policies supported.
Cleaner runs inline after every commit.
Criteria#1: TTD data quality issues
Provide sufficient time to detect data quality issues.
Multiple versions of data are stored. Earlier versions
can be used as a backup. Table can be rolled back to
earlier version as long as cleaner has not deleted
those files.
Criteria#2: Long running queries
Provide sufficient time for your long running jobs to
finish running. Otherwise, the cleaner could delete a
file that is being read by the job and will fail the job.
Criteria#3: Incremental queries
If you are using the incremental pull feature, then
ensure you configure the cleaner to retain sufficient
amount of last commits to rewind.
Cleaning Policies
Partition structure
f1_t1.parquet, f2_t1.parquet,
f3_t1.parquet
f1_t2.parquet, f2_t2.parquet,
f4_t2.parquet
f1_t3.parquet, f3_t3.parquet
Keep N latest versions
N=2, retain 2 versions for each file
group
At t3: Only f1_t1 can be removed
Keep N latest commits
N=2, retain all data for t2, t3
commits
At t3: f1_t1, f2_t1 can be removed.
f3_t1 cannot be removed
Table Service: Archiving
Delete older metadata State saved as “actions” files
within a directory (.hoodie)
.hoodie/20210122133804.commit
.hoodie/20210122140222.clean
.hoodie/hoodie.properties
Over time, many small files
are created
Moves older metadata to
commits.archived
sequence file
Easy Configurations
Set “hoodie.keep.min.commits” and
“hoodie.keep.max.commits”
Incremental queries only
work on ‘active’ timeline

Recommended for you

Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe

Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe

AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdfAWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf

AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona

❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...

❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY

kalyan matka results main baza#sattamatka #matka #dpboss#kalyanmatka #matka #
Table Service: Savepoints & Restore
Some common questions in production
systems
What if a bug resulted in incorrect data pushed to
the ingestion system ?
What if an upstream system incorrectly marked
column values as null ?
Hudi addresses these concerns for you
Ability to restore the table to the last known
correct time
Restore to well known state
Logically “rollback” multiple commits.
Savepoints - checkpoints at different instants of
time
Pro - optimizes number of versions needed to store and
minimizes disk space
Con - Not available for Merge_On_Read table types
Tools &
Capabilities
01 Ingestion frameworks
02 Hudi CLI
03 < 5 mins ingestion latency
04 Onboarding existing tables to Hudi
05 Testing Infra
06 Observability
07 Q&A
Agenda
Hudi offers standalone utilities to connect with the data sources, to inspect the dataset and
for registering a table with HMS.
Ingestion framework
Hudi Utilities
Source
DFS compatible
stores (HDFS, AWS,
GCP etc)
Data Lake
Ingest Data
DeltaStreamer
SparkDataSource
Query
Engines
Register Table
with HMS:
HiveSyncTool
Inspect table metadata:
Hudi CLI
Execution framework
*source = {Kafka, CSV, DFS, Hive table, Hudi table etc}
*Readers = {Hive, Presto, Spark SQL, Impala, AWS Athena}

Recommended for you

Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization

Democratizing Data – Why Data Mesh ?

[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...

Aurora PostgreSQL에서 가장 일반적인 performance use case 들에 대해 Aurora PostreSQL의 모니터링 Tool들을 통해 어떤게 문제를 식별하고 분석하는지 그리고 이 문제를 해결해나가는 절차와 방법에 대한 Deep Dive입니다.

awsdatabaseaurora
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach

Building a database that can beat industry benchmarks is hard work, and we had to use every trick in the book to keep as close to the hardware as possible. In doing so, we initially decided QuestDB would scale only vertically, on a single instance. A few years later, data replication —for horizontally scaling reads and for high availability— became one of the most demanded features, especially for enterprise and cloud environments. So, we rolled up our sleeves and made it happen. Today, QuestDB supports an unbounded number of geographically distributed read-replicas without slowing down reads on the primary node, which can ingest data at over 4 million rows per second. In this talk, I will tell you about the technical decisions we made, and their trade offs. You'll learn how we had to revamp the whole ingestion layer, and how we actually made the primary faster than before when we added multi-threaded Write Ahead Logs to deal with data replication. I'll also discuss how we are leveraging object storage as a central part of the process. And of course, I'll show you a live demo of high-performance multi-region replication in action.

questdbtime-series
Input formats
Input data could be available as a
HDFS file, Kafka source or as an
input stream.
Run Exactly Once
Performs one ingestion round which
includes incrementally pulling
events from upstream sources and
ingesting them to hudi table.
Continuous Mode
Runs an infinite loop with each
round performing one ingestion
round as described in Run Once
Mode. The frequency of data
ingestion can be controlled by the
configuration
Record Types
Support json, avro or a custom
record type for the incoming data
Checkpoint, rollback and
recovery
Automatically takes care of
checkpointing of input data, rollback
and recovery.
Avro Schemas
Leverage Avro schemas from DFS
or a schema registry service.
DeltaStreamer
HoodieDeltaStreamer Example
More info at
https://hudi.apache.org/docs/writing_data.html#deltastreamer
spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` 
--props file://${PWD}/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties

--schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider 
--source-class org.apache.hudi.utilities.sources.AvroKafkaSource 
--source-ordering-field impresssiontime 
--target-base-path file:///tmp/hudi-deltastreamer-op 
--target-table uber.impressions 
--op BULK_INSERT
HoodieDeltaStreamer is used to ingest from a kafka source into a Hudi table
Details on how to use the tool is available here
Spark Datasource API
The hudi-spark module offers the
DataSource API to write (and read) a
Spark DataFrame into a Hudi table.
Structured Spark Streaming
Hudi also supports spark streaming
to ingest data from a streaming
source to a Hudi table.
Flink Streaming
Hudi added support for the Flink
execution engine, in the latest 0.7.0
release.
Execution Engines
inputDF.write()
.format("org.apache.hudi")
.options(clientOpts) // any of the Hudi client opts can be passed in as well
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath);
More info at
https://hudi.apache.org/docs/writing_data.html#deltastreamer
Hudi CLI
Create table Connect with table Inspect commit metadata
File System View Inspect Archived Commits Clean, Rollback commits
More info at
https://hudi.apache.org/docs/deployment.html

Recommended for you

Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...

Los sistemas distribuidos son difíciles. Los sistemas distribuidos de alto rendimiento, más. Latencias de red, mensajes sin confirmación de recibo, reinicios de servidores, fallos de hardware, bugs en el software, releases problemáticas, timeouts... hay un montón de motivos por los que es muy difícil saber si un mensaje que has enviado se ha recibido y procesado correctamente en destino. Así que para asegurar mandas el mensaje otra vez.. y otra... y cruzas los dedos para que el sistema del otro lado tenga tolerancia a los duplicados. QuestDB es una base de datos open source diseñada para alto rendimiento. Nos queríamos asegurar de poder ofrecer garantías de "exactly once", deduplicando mensajes en tiempo de ingestión. En esta charla, te cuento cómo diseñamos e implementamos la palabra clave DEDUP en QuestDB, permitiendo deduplicar y además permitiendo Upserts en datos en tiempo real, añadiendo solo un 8% de tiempo de proceso, incluso en flujos con millones de inserciones por segundo. Además, explicaré nuestra arquitectura de log de escrituras (WAL) paralelo y multithread. Por supuesto, todo esto te lo cuento con demos, para que veas cómo funciona en la práctica.

time-seriesquestdbdatabases
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure

Airline Satisfaction Project using Azure This presentation is created as a foundation of understanding and comparing data science/machine learning solutions made in Python notebooks locally and on Azure cloud, as a part of Course DP-100 - Designing and Implementing a Data Science Solution on Azure.

data science
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript

原版制作【微信:A575476】【(NC毕业证)尼亚加拉学院毕业证成绩单offer】【微信:A575476】(留信学历认证永久存档查询)采用学校原版纸张(包括:隐形水印,阴影底纹,钢印LOGO烫金烫银,LOGO烫金烫银复合重叠,文字图案浮雕,激光镭射,紫外荧光,温感,复印防伪)行业标杆!精益求精,诚心合作,真诚制作!多年品质 ,按需精细制作,24小时接单,全套进口原装设备,十五年致力于帮助留学生解决难题,业务范围有加拿大、英国、澳洲、韩国、美国、新加坡,新西兰等学历材料,包您满意。 【业务选择办理准则】 一、工作未确定,回国需先给父母、亲戚朋友看下文凭的情况,办理一份就读学校的毕业证【微信:A575476】文凭即可 二、回国进私企、外企、自己做生意的情况,这些单位是不查询毕业证真伪的,而且国内没有渠道去查询国外文凭的真假,也不需要提供真实教育部认证。鉴于此,办理一份毕业证【微信:A575476】即可 三、进国企,银行,事业单位,考公务员等等,这些单位是必需要提供真实教育部认证的,办理教育部认证所需资料众多且烦琐,所有材料您都必须提供原件,我们凭借丰富的经验,快捷的绿色通道帮您快速整合材料,让您少走弯路。 留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信:A575476】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内,将在公安局网内查询个人身份证信息后,同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料,供国家高端企业选择人才 → 【关于价格问题(保证一手价格) 我们所定的价格是非常合理的,而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子 我给客户的都是第一手的代理价格,因为我想坦诚对待大家 不想跟大家在价格方面浪费时间 对于老客户或者被老客户介绍过来的朋友,我们都会适当给一些优惠。 选择实体注册公司办理,更放心,更安全!我们的承诺:可来公司面谈,可签订合同,会陪同客户一起到教育部认证窗口递交认证材料,客户在教育部官方认证查询网站查询到认证通过结果后付款,不成功不收费! 办理(NC毕业证)尼亚加拉学院毕业证【微信:A575476】外观非常精致,由特殊纸质材料制成,上面印有校徽、校名、毕业生姓名、专业等信息。 办理(NC毕业证)尼亚加拉学院毕业证【微信:A575476】格式相对统一,各专业都有相应的模板。通常包括以下部分: 校徽:象征着学校的荣誉和传承。 校名:学校英文全称 授予学位:本部分将注明获得的具体学位名称。 毕业生姓名:这是最重要的信息之一,标志着该证书是由特定人员获得的。 颁发日期:这是毕业正式生效的时间,也代表着毕业生学业的结束。 其他信息:根据不同的专业和学位,可能会有一些特定的信息或章节。 办理(NC毕业证)尼亚加拉学院毕业证【微信:A575476】价值很高,需要妥善保管。一般来说,应放置在安全、干燥、防潮的地方,避免长时间暴露在阳光下。如需使用,最好使用复印件而不是原件,以免丢失。 综上所述,办理(NC毕业证)尼亚加拉学院毕业证【微信:A575476 】是证明身份和学历的高价值文件。外观简单庄重,格式统一,包括重要的个人信息和发布日期。对持有人来说,妥善保管是非常重要的。

旧金山艺术大学毕业证金门大学毕业证圣地亚哥州立大学毕业证
Hive Registration Tools
Hive Sync tools enables syncing of the table’s latest schema and updated partitions to the Hive metastore.
cd hudi-hive
./run_sync_tool
.sh --jdbc-url jdbc:hive2:
//hiveserver:10000 --user hive --pass hive --partitioned-by partition --base-path
<basePath> --database default --table <tableName>
Hive Ecosystem
Hive Meta Store (HMS)
HiveSyncTool registers the
Hudi table and updates on
schema, partition changes
Query
Planner
Query
Executor
HoodieInputFormat
exposes the Hudi
datafiles present on
the DFS.
More info at
https://hudi.apache.org/docs/writing_data.html#syncing-to-hive
Hudi Dataset
Presto Spark SQL
Spark Data Source
Hudi HoodieInputFormat is
integrated with Datasource
API, without any dependency
on HMS Query Engines
Write Amplification
COW tables receiving many updates
have a large write amplification, as
the files are rewritten as new
snapshots, even if a single record in
the data file were to change.
Amortized rewrites
MOR reduces this write
amplification by writing the updates
to a log file and periodically merging
the log files with base data files.
Thus, amortizing the cost of
rewriting to the data file at the time
of compaction.
Read Optimized vs real
time view
Data freshness experienced by the
reader is affected by whether the
read requests are served from
compacted base files or by merging
the base files with log files in real
time, just before serving the reads.
Small vs Large Data files
Creating smaller data files (< 50MB)
can be done under few mins.
However, creating lots of small files
would put pressure on the
NameNode during the HDFS listing
(and other metadata) operations.
Creating larger data files (1GB) take
longer to write to disk (10+ mins).
However, maintaining larger files
reduces the NameNode pressure
Achieving ingestion latency of < 5 mins
With clustering and compaction
Achieving ingestion latency of < 5 mins
Managing write amplification with Clustering
INSERTS
UPDATES
DELETES
Ingestion
Commit C10
Partition P1
F5_W1_C5.parquet
[F1_C1, F2_C2, F3_C2, F4_C5]
Partition P2
F12_W1_C5.parquet
[F10_C1, F11_C3 ...]
Commit
C10
Commit
C9
Commit
C8
Commit
C7
Commit
C6
Commit
C5
Commit
C4
Commit
C3
Commit
C2
Commit
C1
Commit
C0
Background clustering process periodically rewrites the small base files
created by ingestion process into larger base files, amortizing the cost
to reduce pressure on the nameNode.
Clustered large 1GB files
Clustering/
compaction commit
Ingestion process writes to Small < 50MB
base files. Small base files help in
managing the write amplification and the
latency.
Query on
real-time table
at commit C10
Contents of:
1. All base files are available to the readers
Freshness is updated at every ingestion
commit.
F6_W1_C6.parquet
F6_W1_C6.parquet
F11_W1_C10.parquet
F6_W1_C6.parquet
F13_W1_C7.parquet
INSERTS
UPDATES
DELETES
Ingestion
Commit C10
Partition P1
F1_W1_C5.parquet
F1_W1_C10.log
Partition P2
F2_W1_C2.parquet
Commit
C10
Commit
C9
Commit
C8
Commit
C7
Commit
C6
Commit
C5
Commit
C4
Commit
C3
Commit
C2
Commit
C1
Commit
C0
F1_W1_C7.log
F2_W1_C6.log
Columnar basefile
Compaction commit
Row based append log
Updates and deletes are written to a
row based append log, by the
ingestion process. Later the async
compaction process merges the log
files to the base fiile.
Query on read
optimized table at
commit C10
Query on real
time table at
commit C10
Contents of:
1. Base file F1_W1_C5.parquet
2. Base file F2_W1_C2.parquet
Contents of:
1. Base file F1_W1_C5.parquet is
merged with append log files
F1_W1_C7.log and F1_W1_c10.log.
2. Base file F2_W1_C2.parquet is
merged with append log file
F2_W1_C6.log.
Timeline
Achieving ingestion latency of < 5 mins
Managing write amplification with merge-on-read

Recommended for you

Legacy data
When legacy data is available in parquet
format and the table needs to be
converted to aHudi table, all the
parquet files are to be rewritten to Hudi
data files.
Fast Migration Process
With Hudi Fast Migration, Hudi will keep
the legacy data files (in parquet format)
and generate a skeleton file containing
Hudi specific metadata, with a special
“BOOTSTRAP_TIMESTAMP”.
Querying legacy partitions
When executing a query involving legacy
partitions, Hudi will return the legacy
data file to the query engines. (Query
engines can handle serving the query
using non-hudi regular parquet/data
files).
Onboarding your table to Hudi
val bootstrapDF = spark.emptyDataFrame
bootstrapDF.write
.format("hudi")
.option(HoodieWriteConfig.TABLE_NAME, "hoodie_test")
.option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL)
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "_row_key")
.option(..)..
.mode(SaveMode.Overwrite)
.save(basePath)
Hudi unit-testing
framework
Hudi offers a unit testing framework
that allows developers to write unit
tests that mimic the real world
scenarios and run these tests every
time the code is recompiled.
This enables increased developer
velocity and roubest code changes.
Hudi-test-suite
Hudi-test-suite makes use of the
hudi utilities to create an
end-to-end testing framework to
simulate complex workloads,
schema evolution scenarios and
version compatibility tests.
Hudi A/B testing
Hudi offers A/B style testing to
ensure that data produced with a
new change/build matches/agrees
with the exact same workload in the
production.
With “Hudi union file-system”, a
production Hudi table can be used
as a read-only reference file system.
Any commit from the production
hudi-timeline can be replayed,
using a different Hudi build, to
compare the results of the “replay
run” against the “production run”.
Hudi Testing infrastructure
Hudi Test Suite
Build Complex Workloads
Define a complex workload that reflects
production setups.
Test Version Compatibility
Pause the workload, upgrade dependency
version, then resume the workload.
Cassandra
DBEvents
MySql
DBEvents
Schemaless
DBEvents
User Application
Heat pipe
Unified Ingestion pipeline
source /sink specific DAGs
HDFS
Hive/Presto/
Spark SQL
Evolve Workloads
Simulate changing elements such as schema
changes.
Simulate Ingestion
Mock Data Generator Launch Queries
Production workload as read-only file system
Hudi A/B testing
INSERTS
UPDATES
DELETES
Ingestion
Commit C10
Partition P1 Partition P2
Commit
C10
Commit
C9
Commit
C8
Commit
C7
Commit
C6
Commit
C5
Commit
C4
Commit
C3
Commit
C2
Commit
C1
Commit
C0
F6_W1_C6.parquet
F6_W1_C6.parquet
F11_W1_C10.parquet
F6_W1_C6.parquet
F13_W1_C7.parquet
F6_W1_C6.parquet
F5_W1_C5.parquet F6_W1_C6.parquet
F12_W1_C5.parquet
Write enabled test file system
Partner write enabled Partition P1
F11_W1_C10.parquet
Commit
C10
Ensure commit
produced by the
test matches
original commit
metadata
Ensure data files
produced by the
“commit replay” test
matches with the
original base/log data
files in production.

Recommended for you

Hudi Observability
Insights on a specific
ingestion run
Collect key insights around storage
efficiency, ingestion performance and
surface bottlenecks at various stages.
These insights can be used to
automate fine-tuning of ingestion jobs
by the feedback based tuning jobs.
Identifying outliers
At large scale, across thousands of
tables, when a bad node/executor is
involved, identifying the bad actor
takes time, requires coordination
across teams and involves lots of our
production on-call resources.
By reporting normalized stats, that
are independent of the job size or
workload characteristics, bad
executor/nodes can be surfaced as
outliers that warrant a closer
inspection.
Insights on Parallelism
When managing thousands of Hudi
tables in the data-lake, ability to
visualize the parallelism applied at
each stage of the job, would enable
insights into the bottlenecks and
allow the job to be fine-tuned at
granular level.
On-Going
& Future Work
➔ Concurrent Writers [RFC-22] & [PR-2374]
◆ Multiple Writers to Hudi tables with file level concurrency control
➔ Hudi Observability [RFC-23]
◆ Collect metrics such as Physical vs Logical, Users, Stage Skews
◆ Use to feedback jobs for auto-tuning
➔ Point index [RFC-08]
◆ Target usage for primary key indexes, eg. B+ Tree
➔ ORC support [RFC]
◆ Support for ORC file format
➔ Range Index [RFC-15]
◆ Target usage for column ranges and pruning files/row groups (secondary/column indexes)
➔ Enhance Hudi on Flink [RFC-24]
◆ Full feature support for Hudi on Flink version 1.11+
◆ First class support for Flink
➔ Spark-SQL extensions [RFC-25]
◆ DML/DDL operations such as create, insert, merge etc
◆ Spark DatasourceV2 (Spark 3+)
On-Going Work
➔ Native Schema Evolution
◆ Support remove and rename columns
➔ Apache Calcite SQL integration
◆ DML/DDL support for other engines besides Spark
Future Work (Upcoming RFCs)

Recommended for you

Thank you
dev@hudi.apache.org
@apachehudi
https://hudi.apache.org

More Related Content

What's hot

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
Vinoth Chandar
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 

What's hot (20)

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 

Similar to Hudi architecture, fundamentals and capabilities

Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
Sudar Muthu
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloud
Amazon Web Services
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
Ambuj Kumar
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Alluxio, Inc.
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
zingopen
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
Qubole
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
Long Dao
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.com
Edward D. Kim
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
RutujaPatil247341
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
gvernik
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
gvernik
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
Amar kumar
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
Serkan Özal
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
Douglas Bernardini
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
Yahoo Developer Network
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
Cloudera, Inc.
 

Similar to Hudi architecture, fundamentals and capabilities (20)

Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
AWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloudAWS Summit 2011: Big Data Analytics in the AWS cloud
AWS Summit 2011: Big Data Analytics in the AWS cloud
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.com
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 

Recently uploaded

[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
Donghwan Lee
 
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
roobykhan02154
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
Nikita Singh$A17
 
一比一原版英国埃塞克斯大学毕业证(essex毕业证书)如何办理
一比一原版英国埃塞克斯大学毕业证(essex毕业证书)如何办理一比一原版英国埃塞克斯大学毕业证(essex毕业证书)如何办理
一比一原版英国埃塞克斯大学毕业证(essex毕业证书)如何办理
qemnpg
 
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
manjukaushik328
 
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdfOrange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
RealDarrah
 
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeSaket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
shruti singh$A17
 
@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...
@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...
@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...
Disha Mukharji
 
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
punebabes1
 
@Call @Girls Coimbatore 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl a...
@Call @Girls Coimbatore 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl a...@Call @Girls Coimbatore 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl a...
@Call @Girls Coimbatore 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl a...
shivvichadda
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
dipti singh$A17
 
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdfAWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
Miguel Ángel Rodríguez Anticona
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
sanjay singh
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
Amazon Web Services Korea
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
javier ramirez
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
SanelaNikodinoska1
 
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript
taqyea
 

Recently uploaded (20)

[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
 
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
 
一比一原版英国埃塞克斯大学毕业证(essex毕业证书)如何办理
一比一原版英国埃塞克斯大学毕业证(essex毕业证书)如何办理一比一原版英国埃塞克斯大学毕业证(essex毕业证书)如何办理
一比一原版英国埃塞克斯大学毕业证(essex毕业证书)如何办理
 
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
@Call @Girls Kolkata 0000000000 Shivani Beautiful Girl any Time
 
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdfOrange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
Orange Yellow Gradient Aesthetic Y2K Creative Portfolio Presentation -3.pdf
 
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeSaket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
 
@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...
@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...
@Call @Girls Mira Bhayandar phone 9920874524 You Are Serach A Beautyfull Doll...
 
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
Madurai @Call @Girls Whatsapp 0000000000 With High Profile Offer 25%
 
@Call @Girls Coimbatore 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl a...
@Call @Girls Coimbatore 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl a...@Call @Girls Coimbatore 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl a...
@Call @Girls Coimbatore 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cute Girl a...
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdfAWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
 
How We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeachHow We Added Replication to QuestDB - JonTheBeach
How We Added Replication to QuestDB - JonTheBeach
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
 
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript
 

Hudi architecture, fundamentals and capabilities

  • 2. Hudi Intro Apache Hudi ingests & manages storage of large analytical datasets over DFS (hdfs or cloud stores). Hudi brings stream processing to big data, providing fresh data while being an order of magnitude efficient over traditional batch processing. Incremental Database Ingestion De-duping Log Events Storage Management Transactional Writes Faster Derived/ETL Data Compliance/Data Deletions Unique key constraints Late data handling
  • 4. Data Consistency Datacenter agnostic, xDC replication, strong consistency Data Freshness < 15 min of freshness on Lake & warehouse Hudi for Data Application Feature store for ML Incremental Processing for all Easy on-boarding, monitoring & debugging Adaptive Data Layout Stitch files, Optimize layout, Prune columns, Encrypt rows/columns on demand through a standardized interface Efficient Query Execution Column indexes for improved query planning & execution Compute & Storage Efficiency Do more with less CPU, Storage, Memory Data Accuracy Semantic validations for columns: NotNull, Range etc Hudi@Uber
  • 5. UseCase (Latency, Scale..) Batch / Stream (Spark/Flink//Presto/...) Source A Table API Incremental Stream Pulls & Joins Consumer Derived Table A delta Source B delta Source N delta ... Table A delta Table B delta Table N delta ... UseCase (Latency, Scale..) Table API Incremental Stream Pulls & Joins Consumer Derived Table B Data Processing : Incremental Streams Batch / Stream (Spark/Flink/Presto/...) *source = {Kafka, CSV, DFS, Hive table, Hudi table etc}
  • 6. 500B+ records/day 150+ PB Transactional Data Lake 8,000+ Tables Hudi@Uber Facts and figures
  • 8. 01 Write Client 02 Read Client 03 Supported Engines 04 Q&A Agenda
  • 9. Hudi APIs Highlights Snapshot Isolation Readers will not see partial writes from writers. Atomic Writes Writes happen either full, or not at all. Partial writes (eg from killed processes) are not valid. Read / Write Optimized Depending on the required SLA, writes or reads can be made faster (at the other’s expense). Incremental Reads/Writes Readers can choose to only read new records from some timestamp. This makes efficient incremental pipelines possible. Point In Time Queries (aka Time-Travel) Readers can read snapshot views at either the latest time, or some past time. Table Services Table management services such as clustering, or compacting (covered in later series).
  • 10. Insert ● Similar to INSERT in databases ● Insert records without checking for duplicates. Hudi Write APIs Upsert ● Similar to UPDATE or INSERT paradigms in databases ● Uses an index to find existing records to update and avoids duplicates. ● Slower than Insert.
  • 11. Hudi Write APIs Bulk Insert ● Similar to Insert. ● Handles large amounts of data - best for bootstrapping use-cases. ● Does not guarantee file sizing Insert Overwrite ● Overwrite a partition with new data. ● Useful for backfilling use-cases. Insert Upsert
  • 12. Bulk Insert Hudi Write APIs Delete ● Similar to DELETE in databases. ● Soft Deletes / Hard Deletes Hive Registration ● Sync the changes to your dataset to Hive. Insert Overwrite Insert Upsert
  • 13. Hudi Write APIs Rollback / Restore ● Rollback inserts/upserts etc to restore the dataset to some past state. ● Useful when mistakes happen. Bulk Insert Hive Registration Insert Upsert Insert Overwrite Delete
  • 14. Hudi Read APIs Snapshot Read ● This is the typical read pattern ● Read data at latest time (standard) ● Read data at some point in time (time travel) Incremental Read ● Read records modified only after a certain time or operation. ● Can be used in incremental processing pipelines.
  • 15. Hudi Metadata Client Get Latest Snapshot Files Get the list of files that contain the latest snapshot data. This is useful for backing up / archiving datasets. Globally Consistent Meta Client Get X-DC consistent views at the cost of freshness. Get Partitions / Files Mutated Since Get a list of partitions or files Mutated since some time timestamp. This is also useful for incremental backup / archiving. There is a read client for Hudi Table Metadata as well. Here are some API highlights:
  • 16. Hudi Table Services Compaction Convert files on disk into read optimized files (see Merge on Read in the next section). Clustering Clustering can make reads more efficient by changing the physical layout of records across files. (see section 3) Clean Remove Hudi data files that are no longer needed. (see section 3) Archiving Archive Hudi metadata files that are no longer being actively used. (see section 3)
  • 17. Code Examples val dataGenerator = new DataGenerator val generatedJson = convertToStringList(dataGenerator.generateInserts(100)) val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2))
  • 18. Code Examples val dataGenerator = new DataGenerator val generatedJson = convertToStringList(dataGenerator.generateInserts(100)) val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2)) This is a data gen class provided by Hudi for testing We’ll be using SPARK for this demo
  • 19. Code Examples: Generate Data val dataGenerator = new DataGenerator val generatedJson = convertToStringList(dataGenerator.generateInserts(100)) val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2)) generatedDataDF.show() +--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+ | begin_lat| begin_lon| driver| end_lat| end_lon| fare| geo| rider| ts| uuid| +--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+ | 0.47269058795|0.461578584504|driver-213| 0.75480340700| 0.967115994201|34.1582847163|americas/brazi...|rider-213|1611908339000|500ca486-9323-46f...| | 0.61000705621| 0.87794022954|driver-213| 0.340787050592| 0.503079814229| 43.49238112|americas/brazi...|rider-123|1614586739000|cf0767fe-2afa-4b0...| | 0.57318354079| 0.49234796529|driver-213|0.0898858178093|0.4252089969871| 64.276962958|americas/unite...|rider-214|1617326300000|a7fc67fd-8026-4c9...| |0.216241503676|0.142850512594|driver-213| 0.589094962481| 0.096682383192| 93.560181152|americas/unite...|rider-417|1612167539000|77572226-6edd-4fb...| | 0.406135109| 0.56440921390|driver-213| 0.79870630494|0.0269835922718|17.8511352550| asia/india/c...|rider-249|1617326301000|2227d696-bc2f-490...| | 0.87420415264| 0.75282681532|driver-213| 0.919782712888| 0.36246477087|19.1791391066|americas/unite...|rider-351|1618301939000|90a4dae9-e21d-4a4...| | 0.18564880850| 0.96945864178|driver-213|0.3818636703720|0.2525265221447| 33.922164839|americas/unite...|rider-491|1611908339000|bb742f4d-1ab8-42e...| | 0.07505887600|0.038441044444|driver-213|0.0437635335453| 0.634604006761| 66.620843664|americas/brazi...|rider-481|1617351482000|4735f5e6-e746-49d...| | 0.6510585056| 0.81928686877|driver-213|0.2071489600291|0.0622403109582| 41.062909290| asia/india/c...|rider-471|1617325991000|16db8d5d-955a-4d1...| |0.114883931570| 0.62732122024|driver-213| 0.745467853751| 0.395493986490| 27.794786885|americas/unite...|rider-591|1611908339000|115c2738-9059-4be...| +--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+
  • 20. Code Examples: Generate Data val dataGenerator = new DataGenerator val generatedJson = convertToStringList(dataGenerator.generateInserts(100)) val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2)) generatedDataDF.show() +--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+ | begin_lat| begin_lon| driver| end_lat| end_lon| fare| geo| rider| ts| uuid| +--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+ | 0.47269058795|0.461578584504|driver-213| 0.75480340700| 0.967115994201|34.1582847163|americas/brazi...|rider-213|1611908339000|500ca486-9323-46f...| | 0.61000705621| 0.87794022954|driver-213| 0.340787050592| 0.503079814229| 43.49238112|americas/brazi...|rider-123|1614586739000|cf0767fe-2afa-4b0...| | 0.57318354079| 0.49234796529|driver-213|0.0898858178093|0.4252089969871| 64.276962958|americas/unite...|rider-214|1617326300000|a7fc67fd-8026-4c9...| |0.216241503676|0.142850512594|driver-213| 0.589094962481| 0.096682383192| 93.560181152|americas/unite...|rider-417|1612167539000|77572226-6edd-4fb...| | 0.406135109| 0.56440921390|driver-213| 0.79870630494|0.0269835922718|17.8511352550| asia/india/c...|rider-249|1617326301000|2227d696-bc2f-490...| | 0.87420415264| 0.75282681532|driver-213| 0.919782712888| 0.36246477087|19.1791391066|americas/unite...|rider-351|1618301939000|90a4dae9-e21d-4a4...| | 0.18564880850| 0.96945864178|driver-213|0.3818636703720|0.2525265221447| 33.922164839|americas/unite...|rider-491|1611908339000|bb742f4d-1ab8-42e...| | 0.07505887600|0.038441044444|driver-213|0.0437635335453| 0.634604006761| 66.620843664|americas/brazi...|rider-481|1617351482000|4735f5e6-e746-49d...| | 0.6510585056| 0.81928686877|driver-213|0.2071489600291|0.0622403109582| 41.062909290| asia/india/c...|rider-471|1617325991000|16db8d5d-955a-4d1...| |0.114883931570| 0.62732122024|driver-213| 0.745467853751| 0.395493986490| 27.794786885|americas/unite...|rider-591|1611908339000|115c2738-9059-4be...| +--------------+--------------+----------+---------------+---------------+-------------+-----------------+---------+-------------+--------------------+ and this for hoodie record key. We’ll use this for partition key.
  • 21. Code Examples: Writes Opts val dataGenerator = new DataGenerator val generatedJson = convertToStringList(dataGenerator.generateInserts(100)) val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2)) val hudiWriteOpts = Map( "hoodie.table.name" -> (tableName), "hoodie.datasource.write.recordkey.field" -> "uuid", "hoodie.datasource.write.partitionpath.field" -> "ts", "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator", "hoodie.deltastreamer.keygen.timebased.timestamp.type" -> "UNIX_TIMESTAMP", "hoodie.deltastreamer.keygen.timebased.output.dateformat" -> "yyyy/MM/dd", )
  • 22. Code Examples: Write val dataGenerator = new DataGenerator val generatedJson = convertToStringList(dataGenerator.generateInserts(100)) val generatedDataDF = spark.read.json(spark.sparkContext.parallelize(generatedJson, 2)) val hudiWriteOpts = Map( "hoodie.table.name" -> (tableName), "hoodie.datasource.write.recordkey.field" -> "uuid", "hoodie.datasource.write.partitionpath.field" -> "ts", "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator", "hoodie.deltastreamer.keygen.timebased.timestamp.type" -> "UNIX_TIMESTAMP", "hoodie.deltastreamer.keygen.timebased.output.dateformat" -> "yyyy/MM/dd", ) generatedDataDF.write. format("org.apache.hudi"). options(hudiWriteOpts). save(basePath)
  • 23. Code Examples: Hive Registration val hiveSyncConfig = new HiveSyncConfig() hiveSyncConfig.databaseName = databaseName hiveSyncConfig.tableName = tableName hiveSyncConfig.basePath = basePath hiveSyncConfig.partitionFields = List("ts") val hiveConf = new HiveConf() val dfs = (new Path(basePath)).getFileSystem(new Configuration()) val hiveSyncTool = new HiveSyncTool(hiveSyncConfig, hiveConf, dfs) hiveSyncTool.syncHoodieTable() Not to be confused with cross-dc hive sync Can be called manually, or you can configure HudiWriteOpts to trigger it automatically.
  • 24. Code Examples: Snapshot Read val readDF = spark.sql("select uuid, driver, begin_lat, begin_lon from " + databaseName + "." + tableName) val readDF = spark.read.format("org.apache.hudi") .load(basePath) .select("uuid", "driver", "begin_lat", "begin_lon") readDF.show() +--------------------+----------+--------------------+--------------------+ | uuid| driver| begin_lat| begin_lon| +--------------------+----------+--------------------+--------------------+ |57d559d0-e375-475...|driver-284|0.014159831486388885| 0.42849372303000655| |fd51bc6e-1303-444...|driver-284| 0.1593867607188556|0.010872312870502165| |e8033c1e-a6e5-490...|driver-284| 0.2110206104048945| 0.2783086084578943| |d619e592-0b41-4c8...|driver-284| 0.08528650347654165| 0.4006983139989222| |799f7e50-27bc-4c9...|driver-284| 0.6570857443423376| 0.888493603696927| |c22ba7e5-68b5-4eb...|driver-284| 0.18294079059016366| 0.19949323322922063| |fbb80816-fe18-4e2...|driver-284| 0.7340133901254792| 0.5142184937933181| |3dfeb884-41fd-4ea...|driver-284| 0.4777395067707303| 0.3349917833248327| |034e0576-f59f-4e9...|driver-284| 0.7180196467760873| 0.13755354862499358| |e9c6e3b1-1ed4-43b...|driver-284| 0.16603428449020086| 0.6999655248704163| |18b39bef-9ebb-4b5...|driver-213| 0.1856488085068272| 0.9694586417848392| |653a4cb6-3c94-4ee...|driver-213| 0.11488393157088261| 0.6273212202489661| |11fbfce7-a10b-4d1...|driver-213| 0.21624150367601136| 0.14285051259466197| |0199a292-1702-47f...|driver-213| 0.4726905879569653| 0.46157858450465483| |5e1d80ce-e95b-4ef...|driver-213| 0.5731835407930634| 0.4923479652912024| |5d51b234-47ab-467...|driver-213| 0.651058505660742| 0.8192868687714224| |ff2e935b-a403-490...|driver-213| 0.0750588760043035| 0.03844104444445928| |bc644743-0667-48b...|driver-213| 0.6100070562136587| 0.8779402295427752| |026c7b79-3012-414...|driver-213| 0.8742041526408587| 0.7528268153249502| |9a06d89d-1921-4e2...|driver-213| 0.40613510977307| 0.5644092139040959| +--------------------+----------+--------------------+--------------------+ only showing top 20 rows Two ways of querying the same Hudi Dataset
  • 25. Code Examples: Incremental Read val newerThanTimestamp = "20200728232543" val readDF = spark.read.format("org.apache.hudi") .option(QUERY_TYPE_OPT_KEY,QUERY_TYPE_INCREMENTAL_OPT_VAL) .option(BEGIN_INSTANTTIME_OPT_KEY, newerThanTimestamp) .load(basePath) .filter("_hoodie_commit_time" > newerThanTimestamp) .select("uuid", "driver", "begin_lat", "begin_lon")
  • 26. Code Examples: Incremental Read val newerThanTimestamp = "20200728232543" val readDF = spark.read.format("org.apache.hudi") .option(QUERY_TYPE_OPT_KEY,QUERY_TYPE_INCREMENTAL_OPT_VAL) .option(BEGIN_INSTANTTIME_OPT_KEY, newerThanTimestamp) .load(basePath) .filter("_hoodie_commit_time" > newerThanTimestamp) .select("uuid", "driver", "begin_lat", "begin_lon") This is simply 2020/07/28 23:25:43s
  • 27. Supported Engines Spark Flink Hive Presto Impala Athena (AWS)
  • 29. 01 Table Types 02 Table Layout 03 Log File Format 04 Q&A Agenda
  • 30. ● Partitions are directories on disk ○ Date based partitions: 2021/01/01 2021/01/02 …. ● Data is written as records in data-files within partitions 2021/ 01/ 01/ fd83af1d-1b18-45fe-9a8c-a19efd091994-0_49-36-47881_20210102020825.parquet ● Each record has a schema and should contain partition and a unique record id ● Each of the data-file is versioned and newer versions contain latest data ● Supported data-file formats: Parquet, ORC (under development) Basics
  • 31. ● fd83af1d-1b18-45fe-9a8c-a19efd091994-0_49-36-47881_20210102020825.parquet fileID (UUID) writeToken version (time of commit) file-format ● fd83af1d-1b18-45fe-9a8c-a19efd091994-0_49-36-47881_20210103102345.parquet (newer version as timestamp is greater) ● A record with a particular hoodie-key will exist in only one fileID. Basics
  • 32. Updates to existing records lead to a newer version of the data-file How are Inserts processed Inserts are partitioned and written to multiple new data-files How are updates processed All records are read from latest version of data-file Updates are applied in memory New version of data-file written Copy On Write (Read-optimized format)
  • 33. Key1 .....……... ... Key2 …..……... ... Key3 …..……... ... Key4 …..……... ... Batch 1 (ts1) upsert Key1 C1 .. Key3 C2 .. Version at C2 (ts2) Version at C1 (ts1) Version at C1 (ts1) File 2 Key1 C1 .. Key3 C1 .. Key2 C1 .. Key4 C1 .. File 1 Queries HUDI Copy On Write: Explained Batch 2 (ts2) Key3 ... .....……...
  • 34. Latest Data Latest version of the data can always be read from the latest data-files Performance Native Columnar File Format Read performance (read-optimized) overhead Limited Updates Very performant for insert only workloads with occasional updates. Copy On Write: Benefits
  • 35. Copy On Write: Challenges Write Amplification Small batches lead to huge read and rewrites of parquet file Ingestion Latency Cannot ingest batches very frequently due to huge IO and compute overhead File sizes Cannot control file sizes very well, larger the file size, more IO for a single record update.
  • 36. Merge On Read (Write-optimized format) Updates to existing records are written to a “log-file” (similar to WAL) How are Inserts processed Inserts are partitioned and written to multiple new data-files How are updates processed Updates are written to a LogBlock Write the LogBlock to the log-file Log-file format is optimized to support appends (HDFS only) but also works with Cloud Stores (new versions created)
  • 37. upsert Key1 .....……... ... Key2 …..……... ... Key3 …..……... ... Key4 …..……... ... Batch 1 (ts1) Parquet + Log Files Key1 .....……... ... Key2 …..……... ... Key3 …..……... ... Batch 2 (ts2) K1 C2 ... ... K2 C2 ... Key1 C1 .. Key3 C1 .. Key2 C1 .. Key4 C1 .. K3 C2 Read Optimized Queries HUDI Merge On Read: Explained Data file at C1 (ts1) (parquet) Data file at C1 (ts1) (parquet) Unmerged log file at ts2 Unmerged log file at ts2 Real Time Queries
  • 38. Merge On Read: Benefits Low Ingestion latency Writes are very fast Write Amplification Low write amplification as merge is over multiple ingestion batches Read vs Write Optimization Merge data-file and delta-file to create new version of data-file. “Compaction” operation creates new version of data-file, can be scheduled asynchronously in a separate pipeline without stopping Ingestion or Readers. New data-files automatically used after Compaction completes.
  • 39. Merge On Read: Challenges Freshness is impacted Freshness may be worse if the read uses only the Read Optimized View (only data files). Increased query cost If reading from data-files and delta-files together (due to merge overhead). This is called Real Time View. Compaction required to bound merge cost Need to create and monitor additional pipeline(s)
  • 40. ● Made up of multiple LogBlocks ● Each LogBlock is made up of: ○ A header with timestamp, schema and other details of the operation ○ Serialized records which are part of the operation ○ LogBlock can hold any format, typically AVRO or Parquet ● Log-File is also versioned ○ S3 and cloud stores do not allow appends ○ Versioning helps to assemble all updates Log File Format fileID (UUID) version (time of commit) file-format writeToken 3215eafe-72cb-4547-929a-0e982be3f45d-0_20210119233138.log.1_0-26-5305
  • 42. 01 Action Types 02 Hudi Metadata Table 03 Q&A Agenda
  • 43. No online component - all state is read and updated from HDFS State saved as “actions” files within a directory (.hoodie) .hoodie/ 20210122133804.commit 20210122140222.clean hoodie.properties 20210122140222.commit when-action-happened what-action-was-taken Sorted list of all actions is called “HUDI Timeline” Basics
  • 44. Action Types 20210102102345.commit COW Table: Insert or Updates MOR Table: data-files merged with delta-files 20210102102345.rollback Older commits rolled-back (data deleted) 20210102102345.delta-commit MOR Only: Insert or Updates 20210102102345.replace data-files clustered and re-written 20210102102345.clean Older versions of data-files and delta-files deleted 20210102102345.restore Restore dataset to a previous point in time
  • 45. 1. Mark the intention to perform an action a. Create the file .hoodie/20210102102345.commit.requested 2. Pre-processing and validations (e.g. what files to update / delete) 3. Mark the starting of action a. Create the file .hoodie/20210102102345.commit.inflight b. Add the action plan to the file so we can rollback changes due to failures 4. Perform the action as per plan 5. Mark the end of the action a. Create the file .hoodie/20210102102345.commit How is an action performed ?
  • 46. Before each operation HUDI needs to find the state of the dataset List all action files from .hoodie directory Read one or more of the action files List one or more partitions to get list of latest data-files and log-files HUDI operations lead to large number of ListStatus calls to NameNode ListStatus is slow and resource intensive for NameNode Challenges
  • 47. ● ListStatus data is cached in an internal table (Metadata Table) ● What is cached? ○ List of all partitions ○ List of files in each partition ○ Minimal required information on each file - file size ● Internal table is a HUDI MOR Table ○ Updated when any operation changes files (commit, clean, etc) ○ Updates written to log-files and compacted periodically ● Very fast lookups from the Metadata Table HUDI File Listing Enhancements (0.7 release)
  • 48. ● Reduced load on NameNode ● Reduce time for operations which list partitions ● Metadata Table is a HUDI MOR Table (.hoodie/metadata) ○ Can be queried like a regular HUDI Table ○ Helps in debugging issues Benefits
  • 50. Recap Writing data Bulk_insert, insert, upsert, insert_overwrite Querying data Hive, Spark, Presto etc Copy-On-Write: Columnar Format Simple & ideal for analytics use-cases (limited updates) Merge-On-Read: Write ahead log Complex, but reduces write amplification with updates Provides 2 views : Read-Optimized, Realtime Timeline Metadata Track information about actions taken on table Incremental Processing Efficiently propagate changes across tables
  • 52. Concurrency Control MVCC Multi Version Concurrency Control File versioning Writes create a newer version, while concurrent readers access an older version. For simplicity, we will refer to hudi files as (fileId)-(timestamp) ● f1-t1, f1-t2 ● f2-t1, f2-t2 Lock Free Read and write transactions are isolated without any need for locking. Use timestamp to determine state of data to read. Data Lake Feature Guarantees Atomic multi-row commits Snapshot isolation Time travel
  • 53. How is index used ? Key1 ... Key2 ... Key3 ... Key4 ... upsert Tag Location Using Index And Timeline Key1 partition, f1 ... Key2 partition, f2 ... Key3 partition, f1 ... Key4 partition, f2 ... Batch at t2 with index metadata Key1, Key3 Key2, Key4 f1-t2 (data/log) f2-t2 (data/log) Key1 C1 .. Key3 C1 .. Key2 C1 .. Key4 C1 .. Batch at t2 f1-t1 f2-t1
  • 54. Indexing Scope Global index Enforce uniqueness of keys across all partitions of a table Maintain mapping for record_key to (partition, fileId) Update/delete cost grows with size of the table O(size of table) Local index Enforce this constraint only within a specific partition. Writer to provide the same consistent partition path for a given record key Maintain mapping (partition, record_key) -> (fileId) Update/delete cost O(number of records updated/deleted)
  • 55. Types of Indexes Bloom Index (default) Employs bloom filters built out of the record keys, optionally also pruning candidate files using record key ranges. Ideal workload: Late arriving updates Simple Index Performs a lean join of the incoming update/delete records against keys extracted from the table on storage. Ideal workload: Random updates/deletes to a dimension table HBase Index Manages the index mapping in an external Apache HBase table. Ideal workload: Global index Custom Index Users can provide custom index implementation
  • 56. Indexing Configurations Property: hoodie.index.type Type of index to use. Default is Local Bloom filter (including dynamic bloom filters) Property: hoodie.index.class Full path of user-defined index class and must be a subclass of HoodieIndex class. It will take precedence over the hoodie.index.type configuration if specified Property: hoodie.bloom.index.parallelism Dynamically computed, but may need tuning for some cases for bloom index Property hoodie.simple.index.parallelism Tune parallelism for simple index
  • 57. Indexing Limitations Indexing only works on primary key today. WIP to make this available as secondary index on other columns. Index information is only used in writer. Using this in read path will improve query performance. Move the index info from parquet metadata into metadata table
  • 59. Storage Management Compaction Convert files on disk into read optimized files. Clustering Optimizing data layout, stitching small files Cleaning Remove Hudi data files that are no longer needed. Hudi Rewriter Pruning columns, encrypting columns and other rewriting use-cases Savepoint & Restore Bring table back to a correct/old state Archival Archive Hudi metadata files that are no longer being actively used.
  • 60. Table Service: Compaction Main motivations behind Merge-On-Read is to reduce data latency when ingesting records Data is stored using a combination of base files and log files Compaction is a process to produce new versions of base files by merging updates Compaction is performed in 2 steps Compaction Scheduling Pluggable Strategies for compaction This is done inline. In this step, Hudi scans the partitions and selects base and log files to be compacted. A compaction plan is finally written to Hudi timeline. Compaction Execution Inline - Perform compaction inline, right after ingestion Asynchronous - A separate process reads the compaction plan and performs compaction of file slices.
  • 61. K1 T3 .. K3 T3 .. Version at T3 K1 T4 ... Version of Log atT4 Real-time View Real-time View Real-time View Compaction Example Hudi Managed Dataset Version at T1 Key1 .....……... ... Key3 …..……... ... Batch 1 T1 Key1 .………... ... Key3 …..……... ... Batch 2 T2 upsert K1 T2 ... ... Unmerged update K1 T1 .. K3 T1 .. K3 T2 Version of Log at T2 Phantom File Schedule Compaction Commit Timeline Key1 . .…… T4 Batch 3 T3 Unmerged update done T2 Commit 2 done T4 Commit 4 done T3 Compact done T1 Commit 1 Read Optimized View Read Optimized View PARQUET T3 Compaction inflight T4 Commit 4 inflight HUDI
  • 62. Code Examples: Inline compaction df.write.format("org.apache.hudi"). option(PRECOMBINE_FIELD_OPT_KEY, "ts"). option(RECORDKEY_FIELD_OPT_KEY, "uuid"). option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). option(TABLE_NAME, tableName). option("hoodie.parquet.small.file.limit", "0"). option("hoodie.compact.inline", "true"). option("hoodie.clustering.inline.max.delta.commits", "4"). option("hoodie.compaction.strategy, "org.apache.hudi.io.compact.strategy.LogFileSizeBasedCompactionStrategy"). mode(Append). save(basePath);
  • 63. Table Service: Clustering Ingestion and query engines are optimized for different things FileSize Ingestion prefers small files to improve freshness. Small files => increase in parallelism Query engines (and HDFS) perform poorly when there are lot of small files Data locality Ingestion typically groups data based on arrival time Queries perform better when data frequently queried together is co-located Clustering is a new framework introduced in hudi 0.7 Improve query performance without compromising on ingestion speed Run inline or in an async pipeline Pluggable strategy to rewrite data Provides two in-built strategies to 1) ‘stitch’ files and 2) ‘sort’ data on a list of columns Superset of Compaction. Follows MVCC like other hudi operations Provides snapshot isolation, time travel etc. Update index/metadata as needed Disadvantage: Incurs additional rewrite cost
  • 64. Clustering: efficiency gain Before clustering: 20M rows scanned After clustering: 100K rows scanned
  • 65. Code Examples: Inline clustering df.write.format("org.apache.hudi"). option(PRECOMBINE_FIELD_OPT_KEY, "ts"). option(RECORDKEY_FIELD_OPT_KEY, "uuid"). option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). option(TABLE_NAME, tableName). option("hoodie.parquet.small.file.limit", "0"). option("hoodie.clustering.inline", "true"). option("hoodie.clustering.inline.max.commits", "4"). option("hoodie.clustering.plan.strategy.target.file.max.bytes", "1073741824"). option("hoodie.clustering.plan.strategy.small.file.limit", "629145600"). option("hoodie.clustering.plan.strategy.sort.columns", ""). //optional, if sorting is needed mode(Append). save(basePath);
  • 66. Table Service: Cleaning Delete older data files that are no longer needed Different configurable policies supported. Cleaner runs inline after every commit. Criteria#1: TTD data quality issues Provide sufficient time to detect data quality issues. Multiple versions of data are stored. Earlier versions can be used as a backup. Table can be rolled back to earlier version as long as cleaner has not deleted those files. Criteria#2: Long running queries Provide sufficient time for your long running jobs to finish running. Otherwise, the cleaner could delete a file that is being read by the job and will fail the job. Criteria#3: Incremental queries If you are using the incremental pull feature, then ensure you configure the cleaner to retain sufficient amount of last commits to rewind.
  • 67. Cleaning Policies Partition structure f1_t1.parquet, f2_t1.parquet, f3_t1.parquet f1_t2.parquet, f2_t2.parquet, f4_t2.parquet f1_t3.parquet, f3_t3.parquet Keep N latest versions N=2, retain 2 versions for each file group At t3: Only f1_t1 can be removed Keep N latest commits N=2, retain all data for t2, t3 commits At t3: f1_t1, f2_t1 can be removed. f3_t1 cannot be removed
  • 68. Table Service: Archiving Delete older metadata State saved as “actions” files within a directory (.hoodie) .hoodie/20210122133804.commit .hoodie/20210122140222.clean .hoodie/hoodie.properties Over time, many small files are created Moves older metadata to commits.archived sequence file Easy Configurations Set “hoodie.keep.min.commits” and “hoodie.keep.max.commits” Incremental queries only work on ‘active’ timeline
  • 69. Table Service: Savepoints & Restore Some common questions in production systems What if a bug resulted in incorrect data pushed to the ingestion system ? What if an upstream system incorrectly marked column values as null ? Hudi addresses these concerns for you Ability to restore the table to the last known correct time Restore to well known state Logically “rollback” multiple commits. Savepoints - checkpoints at different instants of time Pro - optimizes number of versions needed to store and minimizes disk space Con - Not available for Merge_On_Read table types
  • 71. 01 Ingestion frameworks 02 Hudi CLI 03 < 5 mins ingestion latency 04 Onboarding existing tables to Hudi 05 Testing Infra 06 Observability 07 Q&A Agenda
  • 72. Hudi offers standalone utilities to connect with the data sources, to inspect the dataset and for registering a table with HMS. Ingestion framework Hudi Utilities Source DFS compatible stores (HDFS, AWS, GCP etc) Data Lake Ingest Data DeltaStreamer SparkDataSource Query Engines Register Table with HMS: HiveSyncTool Inspect table metadata: Hudi CLI Execution framework *source = {Kafka, CSV, DFS, Hive table, Hudi table etc} *Readers = {Hive, Presto, Spark SQL, Impala, AWS Athena}
  • 73. Input formats Input data could be available as a HDFS file, Kafka source or as an input stream. Run Exactly Once Performs one ingestion round which includes incrementally pulling events from upstream sources and ingesting them to hudi table. Continuous Mode Runs an infinite loop with each round performing one ingestion round as described in Run Once Mode. The frequency of data ingestion can be controlled by the configuration Record Types Support json, avro or a custom record type for the incoming data Checkpoint, rollback and recovery Automatically takes care of checkpointing of input data, rollback and recovery. Avro Schemas Leverage Avro schemas from DFS or a schema registry service. DeltaStreamer
  • 74. HoodieDeltaStreamer Example More info at https://hudi.apache.org/docs/writing_data.html#deltastreamer spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --props file://${PWD}/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider --source-class org.apache.hudi.utilities.sources.AvroKafkaSource --source-ordering-field impresssiontime --target-base-path file:///tmp/hudi-deltastreamer-op --target-table uber.impressions --op BULK_INSERT HoodieDeltaStreamer is used to ingest from a kafka source into a Hudi table Details on how to use the tool is available here
  • 75. Spark Datasource API The hudi-spark module offers the DataSource API to write (and read) a Spark DataFrame into a Hudi table. Structured Spark Streaming Hudi also supports spark streaming to ingest data from a streaming source to a Hudi table. Flink Streaming Hudi added support for the Flink execution engine, in the latest 0.7.0 release. Execution Engines inputDF.write() .format("org.apache.hudi") .options(clientOpts) // any of the Hudi client opts can be passed in as well .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp") .option(HoodieWriteConfig.TABLE_NAME, tableName) .mode(SaveMode.Append) .save(basePath); More info at https://hudi.apache.org/docs/writing_data.html#deltastreamer
  • 76. Hudi CLI Create table Connect with table Inspect commit metadata File System View Inspect Archived Commits Clean, Rollback commits More info at https://hudi.apache.org/docs/deployment.html
  • 77. Hive Registration Tools Hive Sync tools enables syncing of the table’s latest schema and updated partitions to the Hive metastore. cd hudi-hive ./run_sync_tool .sh --jdbc-url jdbc:hive2: //hiveserver:10000 --user hive --pass hive --partitioned-by partition --base-path <basePath> --database default --table <tableName> Hive Ecosystem Hive Meta Store (HMS) HiveSyncTool registers the Hudi table and updates on schema, partition changes Query Planner Query Executor HoodieInputFormat exposes the Hudi datafiles present on the DFS. More info at https://hudi.apache.org/docs/writing_data.html#syncing-to-hive Hudi Dataset Presto Spark SQL Spark Data Source Hudi HoodieInputFormat is integrated with Datasource API, without any dependency on HMS Query Engines
  • 78. Write Amplification COW tables receiving many updates have a large write amplification, as the files are rewritten as new snapshots, even if a single record in the data file were to change. Amortized rewrites MOR reduces this write amplification by writing the updates to a log file and periodically merging the log files with base data files. Thus, amortizing the cost of rewriting to the data file at the time of compaction. Read Optimized vs real time view Data freshness experienced by the reader is affected by whether the read requests are served from compacted base files or by merging the base files with log files in real time, just before serving the reads. Small vs Large Data files Creating smaller data files (< 50MB) can be done under few mins. However, creating lots of small files would put pressure on the NameNode during the HDFS listing (and other metadata) operations. Creating larger data files (1GB) take longer to write to disk (10+ mins). However, maintaining larger files reduces the NameNode pressure Achieving ingestion latency of < 5 mins With clustering and compaction
  • 79. Achieving ingestion latency of < 5 mins Managing write amplification with Clustering INSERTS UPDATES DELETES Ingestion Commit C10 Partition P1 F5_W1_C5.parquet [F1_C1, F2_C2, F3_C2, F4_C5] Partition P2 F12_W1_C5.parquet [F10_C1, F11_C3 ...] Commit C10 Commit C9 Commit C8 Commit C7 Commit C6 Commit C5 Commit C4 Commit C3 Commit C2 Commit C1 Commit C0 Background clustering process periodically rewrites the small base files created by ingestion process into larger base files, amortizing the cost to reduce pressure on the nameNode. Clustered large 1GB files Clustering/ compaction commit Ingestion process writes to Small < 50MB base files. Small base files help in managing the write amplification and the latency. Query on real-time table at commit C10 Contents of: 1. All base files are available to the readers Freshness is updated at every ingestion commit. F6_W1_C6.parquet F6_W1_C6.parquet F11_W1_C10.parquet F6_W1_C6.parquet F13_W1_C7.parquet
  • 80. INSERTS UPDATES DELETES Ingestion Commit C10 Partition P1 F1_W1_C5.parquet F1_W1_C10.log Partition P2 F2_W1_C2.parquet Commit C10 Commit C9 Commit C8 Commit C7 Commit C6 Commit C5 Commit C4 Commit C3 Commit C2 Commit C1 Commit C0 F1_W1_C7.log F2_W1_C6.log Columnar basefile Compaction commit Row based append log Updates and deletes are written to a row based append log, by the ingestion process. Later the async compaction process merges the log files to the base fiile. Query on read optimized table at commit C10 Query on real time table at commit C10 Contents of: 1. Base file F1_W1_C5.parquet 2. Base file F2_W1_C2.parquet Contents of: 1. Base file F1_W1_C5.parquet is merged with append log files F1_W1_C7.log and F1_W1_c10.log. 2. Base file F2_W1_C2.parquet is merged with append log file F2_W1_C6.log. Timeline Achieving ingestion latency of < 5 mins Managing write amplification with merge-on-read
  • 81. Legacy data When legacy data is available in parquet format and the table needs to be converted to aHudi table, all the parquet files are to be rewritten to Hudi data files. Fast Migration Process With Hudi Fast Migration, Hudi will keep the legacy data files (in parquet format) and generate a skeleton file containing Hudi specific metadata, with a special “BOOTSTRAP_TIMESTAMP”. Querying legacy partitions When executing a query involving legacy partitions, Hudi will return the legacy data file to the query engines. (Query engines can handle serving the query using non-hudi regular parquet/data files). Onboarding your table to Hudi val bootstrapDF = spark.emptyDataFrame bootstrapDF.write .format("hudi") .option(HoodieWriteConfig.TABLE_NAME, "hoodie_test") .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "_row_key") .option(..).. .mode(SaveMode.Overwrite) .save(basePath)
  • 82. Hudi unit-testing framework Hudi offers a unit testing framework that allows developers to write unit tests that mimic the real world scenarios and run these tests every time the code is recompiled. This enables increased developer velocity and roubest code changes. Hudi-test-suite Hudi-test-suite makes use of the hudi utilities to create an end-to-end testing framework to simulate complex workloads, schema evolution scenarios and version compatibility tests. Hudi A/B testing Hudi offers A/B style testing to ensure that data produced with a new change/build matches/agrees with the exact same workload in the production. With “Hudi union file-system”, a production Hudi table can be used as a read-only reference file system. Any commit from the production hudi-timeline can be replayed, using a different Hudi build, to compare the results of the “replay run” against the “production run”. Hudi Testing infrastructure
  • 83. Hudi Test Suite Build Complex Workloads Define a complex workload that reflects production setups. Test Version Compatibility Pause the workload, upgrade dependency version, then resume the workload. Cassandra DBEvents MySql DBEvents Schemaless DBEvents User Application Heat pipe Unified Ingestion pipeline source /sink specific DAGs HDFS Hive/Presto/ Spark SQL Evolve Workloads Simulate changing elements such as schema changes. Simulate Ingestion Mock Data Generator Launch Queries
  • 84. Production workload as read-only file system Hudi A/B testing INSERTS UPDATES DELETES Ingestion Commit C10 Partition P1 Partition P2 Commit C10 Commit C9 Commit C8 Commit C7 Commit C6 Commit C5 Commit C4 Commit C3 Commit C2 Commit C1 Commit C0 F6_W1_C6.parquet F6_W1_C6.parquet F11_W1_C10.parquet F6_W1_C6.parquet F13_W1_C7.parquet F6_W1_C6.parquet F5_W1_C5.parquet F6_W1_C6.parquet F12_W1_C5.parquet Write enabled test file system Partner write enabled Partition P1 F11_W1_C10.parquet Commit C10 Ensure commit produced by the test matches original commit metadata Ensure data files produced by the “commit replay” test matches with the original base/log data files in production.
  • 85. Hudi Observability Insights on a specific ingestion run Collect key insights around storage efficiency, ingestion performance and surface bottlenecks at various stages. These insights can be used to automate fine-tuning of ingestion jobs by the feedback based tuning jobs. Identifying outliers At large scale, across thousands of tables, when a bad node/executor is involved, identifying the bad actor takes time, requires coordination across teams and involves lots of our production on-call resources. By reporting normalized stats, that are independent of the job size or workload characteristics, bad executor/nodes can be surfaced as outliers that warrant a closer inspection. Insights on Parallelism When managing thousands of Hudi tables in the data-lake, ability to visualize the parallelism applied at each stage of the job, would enable insights into the bottlenecks and allow the job to be fine-tuned at granular level.
  • 87. ➔ Concurrent Writers [RFC-22] & [PR-2374] ◆ Multiple Writers to Hudi tables with file level concurrency control ➔ Hudi Observability [RFC-23] ◆ Collect metrics such as Physical vs Logical, Users, Stage Skews ◆ Use to feedback jobs for auto-tuning ➔ Point index [RFC-08] ◆ Target usage for primary key indexes, eg. B+ Tree ➔ ORC support [RFC] ◆ Support for ORC file format ➔ Range Index [RFC-15] ◆ Target usage for column ranges and pruning files/row groups (secondary/column indexes) ➔ Enhance Hudi on Flink [RFC-24] ◆ Full feature support for Hudi on Flink version 1.11+ ◆ First class support for Flink ➔ Spark-SQL extensions [RFC-25] ◆ DML/DDL operations such as create, insert, merge etc ◆ Spark DatasourceV2 (Spark 3+) On-Going Work
  • 88. ➔ Native Schema Evolution ◆ Support remove and rename columns ➔ Apache Calcite SQL integration ◆ DML/DDL support for other engines besides Spark Future Work (Upcoming RFCs)