SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Apache Kudu Webinar Series
Technical Deep Dive
David Alves| Apache Kudu PMC | Cloudera
2© Cloudera, Inc. All rights reserved.
Kudu Webinar Series
Part 1: Lambda Architectures – Simplified by Apache Kudu
A look into the potential trouble involved with a lambda architecture, and how Apache Kudu can
dramatically simplify real-time analytics.
Part 2: Extending the Capabilities of Operational and Analytical Databases
An examination of how Apache Kudu expands the set of use cases that Cloudera’s Operational and
Analytical databases can handle.
Part 3: Data-in-Motion: Unlock the Value of Real-Time Data
Forrester will discuss their research into real-time data pipelines and analytics, and Cloudera will
discuss how to make it a reality.
Part 4: Technical Deep-Dive into Apache Kudu
An in-depth examination of the technical architecture and design of Apache Kudu, straight from a Kudu
PMC Member.
3© Cloudera, Inc. All rights reserved.
Updateable Analytic Storage
Simple real-time analytics and updates with Apache Kudu
Kudu: Storage for fast analytics on fast data
• Simplified architecture for building real-time analytic
applications
• Designed for next-generation hardware for faster analytic
performance across frameworks
• Native Hadoop storage engine
Flexibility for the right tools for the right use
case in one platform
• Only analytic database for big data with Kudu + Impala
• Simple real-time applications with Kudu + Spark
Use cases
• Time series data
• Machine data analytics
• Online reporting
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
OTHER
Kite
NoSQL
HBase
OTHER
Object Store
FILESYSTEM
HDFS
RELATIONAL
Kudu
4© Cloudera, Inc. All rights reserved.
HDFS
Fast Scans, Analytics
and Processing of
Stored Data
Fast On-Line
Updates &
Data Serving
Arbitrary Storage
(Active Archive)
Fast Analytics
(on fast-changing or
frequently-updated data)
Filling the Analytic Gap
Unchanging
Fast Changing
Frequent Updates
HBase
Append-Only
Real-Time
Kudu Kudu fills the Gap
Modern analytic
applications often
require complex data
flow & difficult
integration work to
move data between
HBase & HDFS
Analytic
Gap
Pace of Analysis
PaceofData

Recommended for you

Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes

The document discusses the Spark Operator, which allows deploying, managing, and monitoring Spark clusters on Kubernetes. It describes how the operator extends Kubernetes by defining custom resources and reacting to events from those resources, such as SparkCluster, SparkApplication, and SparkHistoryServer. The operator takes care of common tasks to simplify running Spark on Kubernetes and hides the complexity through an abstract operator library.

* apache spark

 *big data

 *ai

 *
From my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumFrom my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debezium

REX How Jobteaser got rid of an old data dump job using change data capture (CDC) with Debezium and Kafka.

kafkadebeziumjobteaser
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Flink Forward San Francisco 2022. With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi. by Ethan Guo & Kyle Weller

stream processingbig dataapache flink
5© Cloudera, Inc. All rights reserved.
Better Together
Kudu Benefits from Integration with the Apache Ecosystem
Spark – Stream Processing for Kudu
• Open standard for real-time stream processing
• Effective for automating decision processes and machine
learning
• Use Cases include: Time Series Data & Machine Data
Analytics
Impala – High-Performance BI & SQL for Kudu
• Open standard for interactive SQL queries
• Powers analytic database workloads with flexibility, scale, and
open architecture
• Use Cases include: Time Series Data & Online Reporting
6© Cloudera, Inc. All rights reserved.
Apache Kudu Community
7© Cloudera, Inc. All rights reserved.
Why Kudu?
Use Cases and Motivation
8© Cloudera, Inc. All rights reserved.
Why Kudu?
A simultaneous combination of sequential and random reads and writes
Can you insert time series data in
real time? How long does it take to
prepare it for analysis? Can you get
results and act fast enough to
change outcomes?
Can you handle large volumes of
machine-generated data? Do you
have the tools to identify problems
or threats? Can your system do
machine learning?
How fast can you add data to your
data store? Are you trading off the
ability to do broad analytics for the
ability to make updates? Are you
retaining only part of your data?
Time Series Data Machine Data Analytics Online Reporting

Recommended for you

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive

This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include: - Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries. - Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering. - The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans. - Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.

Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA

Presentation on Apache Iceberg for the February 2021 St. Louis Big Data IDEA. Apache Iceberg is an alternative database platform that works with Hive and Spark.

icebergstlbigdataideabig data
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi

Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.

spark + ai summit

 *
9© Cloudera, Inc. All rights reserved.
Next generation hardware
Cheaper and faster every year.
Persistent memory (3D XPoint™)
Kudu can take advantage of SSD
and NVM using Intel’s NVM Library.
RAM is cheaper and bigger every
day.
Kudu runs smoothly with huge
RAM. Written in C++ to avoid GC
issues.
Modern CPUs are adding cores and
SIMD width, not GHz.
Kudu takes advantage of SIMD
instructions and concurrent data
structures.
Solid-state Storage Cheaper, Bigger Memory Efficiency on Modern CPUs
10© Cloudera, Inc. All rights reserved.
Apache Kudu: Scalable and fast tabular storage
Scalable
• Tested up to 275 nodes (~3PB cluster)
• Designed to scale to 1000s of nodes and tens of PBs
Fast
• Millions of read/write operations per second across cluster
• Multiple GB/second read throughput per node
Tabular
• Represents data in structured tables like a relational database
• Individual record-level access to 100+ billion row tables
11© Cloudera, Inc. All rights reserved.
Deep Dive:
Replication And Fault Tolerance
12© Cloudera, Inc. All rights reserved.
Metadata
• Replicated master
• Acts as a tablet directory
• Acts as a catalog (which tables exist, etc)
• Acts as a load balancer (tracks TS liveness, re-replicates under-replicated
tablets)
• Caches all metadata in RAM for high performance
• Client configured with master addresses
• Asks master for tablet locations as needed and caches them

Recommended for you

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...

Did you like it? Check out our E-book: Apache NiFi - A Complete Guide https://ebook.getindata.com/apache-nifi-complete-guide Apache NiFi is one of the most popular services for running ETL pipelines otherwise it’s not the youngest technology. During the talk, there are described all details about migrating pipelines from the old Hadoop platform to the Kubernetes, managing everything as the code, monitoring all corner cases of NiFi and making it a robust solution that is user-friendly even for non-programmers. Author: Albert Lewandowski Linkedin: https://www.linkedin.com/in/albert-lewandowski/ ___ Getindata is a company founded in 2014 by ex-Spotify data engineers. From day one our focus has been on Big Data projects. We bring together a group of best and most experienced experts in Poland, working with cloud and open-source Big Data technologies to help companies build scalable data architectures and implement advanced analytics over large data sets. Our experts have vast production experience in implementing Big Data projects for Polish as well as foreign companies including i.a. Spotify, Play, Truecaller, Kcell, Acast, Allegro, ING, Agora, Synerise, StepStone, iZettle and many others from the pharmaceutical, media, finance and FMCG industries. https://getindata.com​

big databig data expertsdevops
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra

If you are looking to collect and store time series data, it's probably not going to be small. Don't get caught without a plan! Apache Cassandra has proven itself as a solid choice now you can learn how to do it. We'll look at possible data models and the the choices you have to be successful. Then, let's open the hood and learn about how data is stored in Apache Cassandra. You don't need to be an expert in distributed systems to make this work and I'll show you how. I'll give you real-world examples and work through the steps. Give me an hour and I will upgrade your time series game.

cqlapache cassandradatabase
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers

Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg. Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS. Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg. You will learn: • The issues that arise when using the Hive table format at scale, and why we need a new table format • How a straightforward, elegant change in table format structure has enormous positive effects • The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it • The resulting benefits of this architectural design

high throughput and low latencyp99p99 conf
13© Cloudera, Inc. All rights reserved.
Client
Hey Master! Where is the row for
‘tlipcon’ in table “T”?
It’s part of tablet 2, which is on servers {Z,Y,X}.
BTW, here’s info on other tablets you might
care about: T1, T2, T3, …
UPDATE tlipcon
SET col=foo
Meta Cache
T1: …
T2: …
T3: …
14© Cloudera, Inc. All rights reserved.
Raft consensus
TS A
Tablet 1
(LEADER)
Client
TS B
Tablet 1
(FOLLOWER)
TS C
Tablet 1
(FOLLOWER)
WAL
WALWAL
2b. Leader writes local WAL
1a. Client->Leader: Write() RPC
2a. Leader->Followers:
UpdateConsensus() RPC
3. Follower: write WAL
4. Follower->Leader: success
3. Follower: write WAL
5. Leader has achieved majority
6. Leader->Client: Success!
15© Cloudera, Inc. All rights reserved.
Fault tolerance
• Transient FOLLOWER failure:
• Leader can still achieve majority
• Restart follower TS within 5 min and it will rejoin transparently
• Transient LEADER failure:
• Followers expect to hear a heartbeat from their leader every 1.5 seconds
• 3 missed heartbeats: leader election!
• New LEADER is elected from remaining nodes within a few seconds
• Restart within 5 min and it rejoins as a FOLLOWER
• N replicas handle (N-1)/2 failures
16© Cloudera, Inc. All rights reserved.
Fault tolerance (2)
• Permanent failure (Tablet Copy):
• Leader notices that a follower has been dead for 5 minutes
• Evicts that follower
• Master selects a new replica(in PRE_VOTER state)
• Leader copies the data over to the new one, which joins as a new FOLLOWER
• New replica is assigned VOTER state
• Cluster change_config lets you add/remove tablet servers to a tablet’s
configuration(only supported from CLI tool)

Recommended for you

What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive

Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features. We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.

dataworks summit barcelonadws19apache hive
Apache kudu
Apache kuduApache kudu
Apache kudu

Talk on Apache Kudu, presented by Asim Jalis at SF Data Engineering Meetup on 2/23/2016. http://www.meetup.com/SF-Data-Engineering/events/228293610/ Big Data applications need to ingest streaming data and analyze it. HBase is great at ingesting streaming data but not good at analytics. HDFS is great at analytics but not at ingesting streaming data. Frequently applications ingest data into HBase and then move it to HDFS for analytics. What if you could use a single system for both use cases? What if you could use a single system for both use cases? This could dramatically simplify your data pipeline architecture. This is where Kudu comes in. Kudu is a storage system that lives between HDFS and HBase. It is good at both ingesting streaming data and good at analyzing it using Spark, MapReduce, and SQL.

kuduhadoophdfs
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data

When interacting with analytics dashboards in order to achieve a smooth user experience, two major key requirements are sub-second response time and data freshness. Cluster computing frameworks such as Hadoop or Hive/Hbase work well for storing large volumes of data, although they are not optimized for ingesting streaming data and making it available for queries in realtime. Also, long query latencies make these systems sub-optimal choices for powering interactive dashboards and BI use-cases. In this talk we will present Druid as a complementary solution to existing hadoop based technologies. Druid is an open-source analytics data store, designed from scratch, for OLAP and business intelligence queries over massive data streams. It provides low latency realtime data ingestion and fast sub-second adhoc flexible data exploration queries. Many large companies are switching to Druid for analytics, and we will cover how druid is able to handle massive data streams and why it is a good fit for BI use cases. Agenda - 1) Introduction and Ideal Use cases for Druid 2) Data Architecture 3) Streaming Ingestion with Kafka 4) Demo using Druid, Kafka and Superset. 5) Recent Improvements in Druid moving from lambda architecture to Exactly once Ingestion 6) Future Work

dataworks summitdataworks summit 2017hadoop summit
17© Cloudera, Inc. All rights reserved.
Deep Dive:
Columnar Storage
18© Cloudera, Inc. All rights reserved.
Columnar Storage
• Improved Scan performance
• Predicates (e.g. WHERE time >= 2016-05-08T00:00:00) can be evaluated without reading
unnecessary data from other columns
• Efficient encodings can dramatically improve compression ratios, which reduces effective IO load
• Typed, homogenous data plays well to modern processor strengths (vectorization, pipelining)
• At the cost of random access performance
• single row access requires a number of seeks proportional to the number of columns
• BUT, random access is becoming cheaper (Cheap RAM, SSDs, NVRAM)
19© Cloudera, Inc. All rights reserved.
Row Storage
{23059873, newsycbot, 1442865158, Visual exp…}
{22309487, RideImpala, 1442828307, Introducing …}
…
Tweet_id, user_name, created_at, text
Scans have to read all the data, no encodings
20© Cloudera, Inc. All rights reserved.
Columnar storage
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly,
llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text

Recommended for you

Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data

Presto is a distributed SQL query engine that was developed by Facebook to make SQL queries scalable for large datasets. It translates SQL queries into multiple parallel tasks that can process data across many servers without using intermediate storage. This allows Presto to handle millions of records per second. Presto is now open source and used by many companies for interactive analysis of petabyte-scale datasets.

prestotreasuredata
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities

Learn about Hudi's architecture, concurrency control mechanisms, table services and tools. By : Abhishek Modi, Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal

big datadistributed databasesanalytics
Apache Ranger
Apache RangerApache Ranger
Apache Ranger

Data in Hadoop is getting bigger every day, consumers of the data are growing, organizations are now looking at making their Hadoop cluster compliant to federal regulations and commercial demands. Apache Ranger simplifies the management of security policies across all components in Hadoop. Ranger provides granular access controls to data. The deck describes what security tools are available in Hadoop and their purpose then it moves on to discuss in detail Apache Ranger.

apache rangerprotecting data in hadoopkerberos
21© Cloudera, Inc. All rights reserved.
Columnar storage
{25059873,
22309487,
23059861,
23010982}
Tweet_id
{newsycbot,
RideImpala,
fastly,
llvmorg}
User_name
{1442865158,
1442828307,
1442865156,
1442865155}
Created_at
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
text
SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’;
Only read 1 column
1GB 2GB 1GB 200GB
…AND door open to big IO gains with compression and encoding
22© Cloudera, Inc. All rights reserved.
Available Encodings:
• Dictionary (Strings, Binary)
• Bitshuffle (Numeric)
• RLE (Numeric, Bool)
Available Compression:
• Snappy
• LZ4
• ZLib
Columnar Storage – Other Encodings/Compression
Kudu 1.3 ships with “good” defaults for most cases
23© Cloudera, Inc. All rights reserved.
Deep Dive:
Write and Read Paths
24© Cloudera, Inc. All rights reserved.
LSM vs Kudu
• LSM – Log Structured Merge (Cassandra, HBase, etc)
• Inserts and updates all go to an in-memory map (MemStore) and later flush to
on-disk files (HFile/SSTable)
• Reads perform an on-the-fly merge of all on-disk HFiles
• Kudu
• Shares some traits (memstores, compactions)
• More complex.
• Slower writes in exchange for faster reads (especially scans)

Recommended for you

High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT

Raft protocol has been successfully used for consistent metadata replication; however, using it for data replication poses unique challenges. Apache Ratis is a RAFT implementation targeted at high throughput data replication problems. Apache Ratis is being successfully used as a consensus protocol for data stored in Ozone (object store) and Quadra(block device) to provide data throughput that saturates the network links and disk bandwidths. Pluggable nature of Ratis renders it useful for multiple use cases including high availability, data or metadata replication, and ensuring consistency semantics. This talk presents the design challenges to achieve high throughput and how Apache Ratis addresses them. We talk about specific optimizations that have been implemented to minimize overheads and scale up the throughput while maintaining correctness of the consistency protocol. The talk also explains how systems like Ozone take advantage of Ratis’s implementation choices to achieve scale. We will discuss the current performance numbers and also future optimizations. MUKUL KUMAR SINGH, Staff Software Engineer, Hortonworks and LOKESH JAIN, Software Engineer, Hortonworks

apache hadoopotherbig compute and storage
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...

OSA Con 2022: Apache Iceberg: An Architectural Look Under the Covers Alex Merced - Dremio The data lakehouse is one of the most exciting trends in the data space promising to merge the best aspects of data lakes and data warehouses without either of their problems. Open source tech is making this promise a reality and in this talk Dremio Developer Advocate, Alex Merced, explores these technologies. In this talk Alex Merced will cover: - What is a Data Lakehouse? - Why open matters in preserving the promise of lakehouses (better costs, vendor freedom, data freedom) - What are technologies that enable lakehouses like Apache Iceberg, Apache Parquet, Apache Arrow and Project Nessie

osa con 2022apache icebergdata analytics
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...

For self-service BI and exploratory analytic workloads, the cloud can provide a number of key benefits, but the move to the cloud isn’t all-or-nothing. Gartner predicts nearly 80 percent of businesses will adopt a hybrid strategy. Learn how a modern analytic database can power your business-critical workloads across multi-cloud and hybrid environments, while maintaining data portability. We'll also discuss how to best leverage the increased agility cloud provides, while maintaining peak performance.

analytic databasebusiness intelligencesql analytics
25© Cloudera, Inc. All rights reserved.
LSM Insert Path
MemStore
INSERT
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
flush
26© Cloudera, Inc. All rights reserved.
LSM Insert Path
MemStore
INSERT
Row=r1 col=c1 val=“blah2”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“blah2”
Row=r2 col=c2 val=“2”
flush
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
27© Cloudera, Inc. All rights reserved.
LSM Update path
MemStore
UPDATE
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“v2”
Row=r2 col=c2 val=“5”
Row=r2 col=c1 val=“newval”
Note: all updates are “fully
decoupled” from reads. Random-
write workload is transformed to
fully sequential!
28© Cloudera, Inc. All rights reserved.
LSM Read path
MemStore
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“v2”
Row=r2 col=c2 val=“5”
Row=r2 col=c1 val=“newval”
Merge based on string row
keys
R1: c1=blah c2=2
R2: c1=newval c2=5
….
CPU intensive!
Must always read
rowkeys
Any given row may exist across
multiple HFiles: must always
merge!
The more HFiles to merge, the
slower it reads

Recommended for you

Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...

3 Things to Learn About:
 *How Apache Kudu enables users to do more than ever before with their Analytic and Operational Databases *How Cloudera has built two versatile databases to help our customers tackle their hardest problems. *How the addition of Apache Kudu to this mix will enable new use cases around real-time analytics, internet of things, time series data, and more.

hdfsossapache hbase
Kudu Forrester Webinar
Kudu Forrester WebinarKudu Forrester Webinar
Kudu Forrester Webinar

Topics including: The transformative value of real-time data and analytics, and current barriers to adoption. The importance of an end-to-end solution for data-in-motion that includes ingestion, processing, and serving. Apache Kudu’s role in simplifying real-time architectures.

Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr


The document discusses how Sparklyr allows data scientists to access and work with data stored in Cloudera Enterprise using the popular RStudio IDE. It describes the challenges data scientists face in accessing secured Hadoop clusters and limitations of notebook environments. Sparklyr integration with RStudio provides a familiar environment for data scientists to access Hadoop data and compute using Spark, enabling distributed data science workflows directly in R. The presentation demonstrates how to analyze over a billion records using Spark and R through Sparklyr.

sparklyrplatformdata frame
29© Cloudera, Inc. All rights reserved.
Kudu storage – Inserts and Flushes
MemRowSet
INSERT(“todd”, “$1000”,”engineer”)
name pay role
DiskRowSet 1
flush
30© Cloudera, Inc. All rights reserved.
Kudu storage – Inserts and Flushes
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
INSERT(“doug”, “$1B”, “Hadoop man”)
flush
31© Cloudera, Inc. All rights reserved.
Kudu storage - Updates
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
Each DiskRowSet has its own
DeltaMemStore to
accumulate updates
base data
base data
32© Cloudera, Inc. All rights reserved.
Kudu storage - Updates
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
UPDATE set pay=“$1M”
WHERE name=“todd”
Is the row in DiskRowSet 2?
(check bloom filters)
Is the row in DiskRowSet 1?
(check bloom filters)
Bloom says: no!
Bloom says: maybe!
Search key column to find
offset: rowid = 150
150: col 1=$1M
base data

Recommended for you

Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World

3 Things to Learn About: * On-premises versus the cloud: What’s the same and what’s different?
 * Design and benefits of analytics in the cloud
 * Best practices and architectural considerations

analytic databasebicloud
Top 5 IoT Use Cases
Top 5 IoT Use CasesTop 5 IoT Use Cases
Top 5 IoT Use Cases

3 Things to Learn About: *The IoT ecosystem and data management considerations for IoT *Top IoT use cases and data architecture strategies for managing the sheer volume and variety of IoT data *Real-life case studies on how our customers are using Cloudera Enterprise to drive insights and analytics from all of their IoT data

customer insightdata managementiot ecosystem
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud


3 Things to Learn About:

 *On-premises versus the cloud: What’s the same and what’s different? *Benefits of data processing in the cloud *Best practices and architectural considerations

low-cost data processingdata processinginfrastructure choice
33© Cloudera, Inc. All rights reserved.
Kudu storage – Read path
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
150: pay=$1M
Read rows in DiskRowSet 2
Then, read rows in
DiskRowSet 1
Any row is only in exactly one
DiskRowSet– no need to merge cross-
DRS!
Updates are merged based on ordinal
offset within DRS: array indexing, no
string compares
base data
base data
34© Cloudera, Inc. All rights reserved.
Kudu storage – Delta flushes
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
0: pay=fooREDO DeltaFile
Flush
A REDO delta indicates how to
transform between the ‘base data’
(columnar) and a later version
base data
base data
35© Cloudera, Inc. All rights reserved.
Kudu storage – Major delta compaction
name pay role
DiskRowSet(pre-compaction)
Delta MS
REDO DeltaFile REDO DeltaFile REDO DeltaFile
Many deltas accumulate: lots of delta
application work on reads
name pay role
DiskRowSet(post-compaction)
Delta MS
Unmerged REDO
deltasUNDO deltas
If a column has few updates, doesn’t need to be
re-written: those deltas maintained in new
DeltaFile
Merge updates for columns with high update
percentage
base data
36© Cloudera, Inc. All rights reserved.
Kudu storage – RowSet Compactions
DRS 1 (32MB)
[PK=alice], [PK=joe], [PK=linda], [PK=zach]
DRS 2 (32MB)
[PK=bob], [PK=jon], [PK=mary] [PK=zeke]
DRS 3 (32MB)
[PK=carl], [PK=julie], [PK=omar] [PK=zoe]
DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB)
[alice, bob, carl, joe] [jon, julie, linda,
mary]
[omar, zach, zeke, zoe]
Reorganize rows to avoid rowsets
with overlapping key ranges
Writes for “chris” have to perform
bloom lookups on all 3 RS

Recommended for you

A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu

Kudu is an open source storage layer developed by Cloudera that provides low latency queries on large datasets. It uses a columnar storage format for fast scans and an embedded B-tree index for fast random access. Kudu tables are partitioned into tablets that are distributed and replicated across a cluster. The Raft consensus algorithm ensures consistency during replication. Kudu is suitable for applications requiring real-time analytics on streaming data and time-series queries across large datasets.

Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris

This document provides an introduction to Apache Kudu, a storage layer for Apache Hadoop designed for fast analytics on fast data. It discusses Kudu's motivations of filling gaps in HDFS and HBase capabilities, its design goals of high throughput scans and low latency reads/writes, and how its columnar storage and integration with tools like Spark and Impala enable it to meet these goals. Example use cases like time series and real-time analytics are presented. The document also covers Kudu's architecture of tables and tablets, its replication and fault tolerance model using Raft consensus, and performance comparisons that show it outperforming other storage systems.

kudumeetupparís
Kite SDK: Working with Datasets
Kite SDK: Working with DatasetsKite SDK: Working with Datasets
Kite SDK: Working with Datasets

The Kite SDK is an open source set of libraries, tools, examples, and documentation focused on helping developers build systems on top of the Apache Hadoop ecosystem. Learn (via examples) how Kite makes it easier to work with data in HDFS and Apache HBase as records and datasets, just as you would with a relational database.

kite sdkhadoop sdkcloudera sdk
37© Cloudera, Inc. All rights reserved.
Kudu Storage - Compactions
• Main Idea: Always be compacting!
• Compactions run continuously to prevent IO storms
• ”Budgeted” RS compactions: What is the best way to spend X MBs IO?
• Physical/Logical decoupling: different replicas run compactions at different times
38© Cloudera, Inc. All rights reserved.
Deep Dive
Partitioning
39© Cloudera, Inc. All rights reserved.
Partitioning
• Kudu has flexible policies for distributing data among partitions
• Hash partitioning is built in, and can be combined with range partitioning
• Keys are ordered within a partition
• Key order matters
• Partitioning key order dictates distribution
• Primary key order affects how much data is read for scans
40© Cloudera, Inc. All rights reserved.
Primary Key selection
• Example - Time series data:
• ”time” - timestamp
• ”series” - {region, server, metric}
(us-east.appserver01.loadavg, 2016-05-
09T15:14:00Z)
(us-east.appserver01.loadavg, 2016-05-
09T15:15:00Z)
(us-west.dbserver03.rss, 2016-05-
09T15:14:30Z)
(us-west.dbserver03.rss, 2016-05-
09T15:14:30Z)
(2016-05-09T15:14:00Z, us-
east.appserver01.loadavg)
(2016-05-09T15:14:30Z, us-west.dbserver03.rss)
(2016-05-09T15:15:00Z, us-
east.appserver01.loadavg)
(2016-05-09T15:14:30Z, us-west.dbserver03.rss)
(series, time) (time, series)
SELECT * WHERE series = ‘us-east.appserver01.loadavg’;

Recommended for you

Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot Games

Enterprise workflows in Hadoop using Oozie @ Riot Games. Simple use cases and lessons learned from our platform growth.

oozie riot games hadoop workflows
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarnApache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn

社内のHadoop本読書会のスライドです

hadoop
Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA

This document discusses Oozie's high availability (HA) solution. It allows multiple Oozie servers to run against the same database, providing active-active HA and horizontal scalability. A load balancer provides a single entry point to the Oozie servers. ZooKeeper coordinates the servers and enables services like locking and log streaming across servers. Configuration changes allow point Oozie to the ZooKeeper ensemble and load balancer, and enable HA services. Users can now view and access logs from any server, without needing to know which server processed their jobs.

big datahadoopoozie
41© Cloudera, Inc. All rights reserved.
Partitioning — By Time Range (inserts)
All Inserts go to Latest PartitionBig scans (across large time intervals)
can be parallelized across many partitions
42© Cloudera, Inc. All rights reserved.
Partitioning — By Series Range
Inserts are spread among all partitionsScans are over a single partition
43© Cloudera, Inc. All rights reserved.
Partitioning — By Series Range
Partitions can become unbalanced,
resulting in hot spotting
44© Cloudera, Inc. All rights reserved.
Partitioning — By Series Hash (inserts)
Inserts are spread among all partitionsScans are over a single partition

Recommended for you

Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn

Hadoop is about so much more than batch processing. With the recent release of Hadoop 2, there have been significant changes to how a Hadoop cluster uses resources. YARN, the new resource management component, allows for a more efficient mix of workloads across hardware resources, and enables new applications and new processing paradigms such as stream-processing. This talk will discuss the new design and components of Hadoop 2, and examples of Modern Data Architectures that leverage Hadoop for maximum business efficiency.

yarnhdfscontainer
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite

Data ingest is a deceptively hard problem. In the world of big data processing, it becomes exponentially more difficult. It's not sufficient to simply land data on a system, that data must be ready for processing and analysis. The Kite SDK is a data API designed for solving the issues related to data infest and preparation. In this talk you'll see how Kite can be used for everything from simple tasks to production ready data pipelines in minutes.

hadoopkiteddtx15
The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)

Organizations have a wealth of data contained within the existing infrastructures. At DellEMC we’re helping customers remove the barriers of legacy datastores and transforming the customer experience in the modern datacentre. Learn how to unshackle the valuable data inside your existing data warehouse, leverage new techniques, applications and technology to enhance the financial impact of all your data sources

partnercloudera sessionshadoop
45© Cloudera, Inc. All rights reserved.
Partitioning — By Series Hash
Partitions grow overtime, eventually
becoming too big for a single server
46© Cloudera, Inc. All rights reserved.
Partitioning — By Series Hash + Time Range
Inserts are spread among all partitions
in the latest time range
Big scans (across large time intervals)
can be parallelized across partitions
47© Cloudera, Inc. All rights reserved.
Next Steps
Get Started with
Kudu & Cloudera
Start Contributing
to Kudu
• www.cloudera.com/downloads
• https://blog.cloudera.com/?s=kudu
http://kudu.apache.org/
48© Cloudera, Inc. All rights reserved.
Thank you
David Alves – Apache Kudu PMC
@dribeiroalves

Recommended for you

Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the CloudCloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud

Cloud environments are increasingly becoming a popular deployment option for Hadoop. Enterprises can take advantage of the added flexibility and elasticity of the cloud for both long-running clusters, temporary deployments or for spikey workloads. However, as more and more users choose cloud environments for critical Hadoop workloads, they are often forced to compromise on key aspects of their data platform. Cloudera Director enables the full fidelity of the Enterprise Data Hub in the cloud, without compromises. Announced with the recent 5.2 release, Cloudera Director is the simple, reliable way to deploy and scale Hadoop in the cloud, while maintaining an open and neutral platform with enterprise-grade capabilities. During this webinar, Tushar Shanbhag, Director of Product Management, will look at why Hadoop cloud environments are becoming so popular and some of the challenges around Hadoop in the cloud. He will then provide an in-depth overview of Cloudera Director, its key features, and how it alleviates these common challenges. Finally, he will discuss some key use cases and provide insight into what’s next for Cloudera and Hadoop in the cloud.

cloudera hadoopcloudera directorhadoop cloud
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x

Apache Oozie has come a long way and now accounts for over 2.8 Million jobs per month on Yahoo's grid infrastructure. If you are running Hadoop jobs repeatedly and thinking of a smarter way of doing it, Apache Oozie is the answer. Be it running complex data transformation jobs chained one after another or simple daily data copy, Oozie workflows will help you to manage these tasks efficiently. Mona will cover the new features introduced in Apache Oozie 4.x, in particular, Apache HCatalog Integration, Job Notifications and SLA Monitoring for building large-scale and efficient data processing pipelines.

hahugoozie
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)

Your data is your IP and its security is paramount. The last thing you want is for your data to become a target for threats. This workshop will focus on the realities of protecting your customer’s IP from external and internal threats with battle hardened technologies and methodologies. Another key concept that will be examined is the connection of people, processes and technology. In addition, the session will take a look at authentication and authorisation, auditing and data lineage as well as the different groups required to play a part in the modern data hub. We will also look at how to produce high impact operation reports from Cloudera’s RecordService a new core security layer that centrally enforces fine-grained access control policy, which helps close the feedback loop to ensure awareness of security as a living entity within your organisation.

clouderacloudera sessionshadoop

More Related Content

What's hot

Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
David Groozman
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Databricks
 
From my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumFrom my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debezium
Clement Demonchy
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
Patrick McFadin
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
 
Apache kudu
Apache kuduApache kudu
Apache kudu
Asim Jalis
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
DataWorks Summit
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
Taro L. Saito
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
Rommel Garcia
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT
DataWorks Summit
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Altinity Ltd
 

What's hot (20)

Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 
From my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debeziumFrom my sql to postgresql using kafka+debezium
From my sql to postgresql using kafka+debezium
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Apache Ranger
Apache RangerApache Ranger
Apache Ranger
 
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
 

Viewers also liked

Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Cloudera, Inc.
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Cloudera, Inc.
 
Kudu Forrester Webinar
Kudu Forrester WebinarKudu Forrester Webinar
Kudu Forrester Webinar
Cloudera, Inc.
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

Cloudera, Inc.
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Cloudera, Inc.
 
Top 5 IoT Use Cases
Top 5 IoT Use CasesTop 5 IoT Use Cases
Top 5 IoT Use Cases
Cloudera, Inc.
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Cloudera, Inc.
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
Andriy Zabavskyy
 
Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
نهاد مبارك
 
Kite SDK: Working with Datasets
Kite SDK: Working with DatasetsKite SDK: Working with Datasets
Kite SDK: Working with Datasets
Cloudera, Inc.
 
Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot Games
Matt Goeke
 
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarnApache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
Shuya Tsukamoto
 
Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
Mona Chitnis
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
David Kaiser
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
Joey Echeverria
 
The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)
Cloudera, Inc.
 
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the CloudCloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera, Inc.
 
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
Yahoo Developer Network
 
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Cloudera, Inc.
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
Chicago Hadoop Users Group
 

Viewers also liked (20)

Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
Kudu Forrester Webinar
Kudu Forrester WebinarKudu Forrester Webinar
Kudu Forrester Webinar
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
 
Top 5 IoT Use Cases
Top 5 IoT Use CasesTop 5 IoT Use Cases
Top 5 IoT Use Cases
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Kudu Cloudera Meetup Paris
Kudu Cloudera Meetup ParisKudu Cloudera Meetup Paris
Kudu Cloudera Meetup Paris
 
Kite SDK: Working with Datasets
Kite SDK: Working with DatasetsKite SDK: Working with Datasets
Kite SDK: Working with Datasets
 
Oozie @ Riot Games
Oozie @ Riot GamesOozie @ Riot Games
Oozie @ Riot Games
 
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarnApache hadoop yarn 勉強会 8. capacity scheduler in yarn
Apache hadoop yarn 勉強会 8. capacity scheduler in yarn
 
Oozie meetup - HA
Oozie meetup - HAOozie meetup - HA
Oozie meetup - HA
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)The Transformation of your Data in modern IT (Presented by DellEMC)
The Transformation of your Data in modern IT (Presented by DellEMC)
 
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the CloudCloudera Director: Unlock the Full Potential of Hadoop in the Cloud
Cloudera Director: Unlock the Full Potential of Hadoop in the Cloud
 
October 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.xOctober 2013 HUG: Oozie 4.x
October 2013 HUG: Oozie 4.x
 
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
 

Similar to Apache Kudu: Technical Deep Dive



cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
Rakuten Group, Inc.
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentech
Cloudera Japan
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Hadoop / Spark Conference Japan
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
Yahoo Developer Network
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Felicia Haggarty
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
Spark Summit
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
jdcryans
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
Felicia Haggarty
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
DataStax Academy
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
michaelguia
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
Grant Henke
 
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
LarryZaman
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 

Similar to Apache Kudu: Technical Deep Dive

 (20)

cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platformcloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
 
Apache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentechApache Kudu - Updatable Analytical Storage #rakutentech
Apache Kudu - Updatable Analytical Storage #rakutentech
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
SFHUG Kudu Talk
SFHUG Kudu TalkSFHUG Kudu Talk
SFHUG Kudu Talk
 
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User StoreAzure + DataStax Enterprise (DSE) Powers Office365 Per User Store
Azure + DataStax Enterprise (DSE) Powers Office365 Per User Store
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Decoupling Decisions with Apache Kafka
Decoupling Decisions with Apache KafkaDecoupling Decisions with Apache Kafka
Decoupling Decisions with Apache Kafka
 
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
Business_Continuity_Planning_with_SQL_Server_HADR_options_TechEd_Bangalore_20...
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

dachnug51 - HCLs evolution of the employee experience platform.pdf
dachnug51 - HCLs evolution of the employee experience platform.pdfdachnug51 - HCLs evolution of the employee experience platform.pdf
dachnug51 - HCLs evolution of the employee experience platform.pdf
DNUG e.V.
 
Bhiwandi @Call @Girls Whatsapp 000000000 With Best And No 1
Bhiwandi @Call @Girls Whatsapp 000000000 With Best And No 1Bhiwandi @Call @Girls Whatsapp 000000000 With Best And No 1
Bhiwandi @Call @Girls Whatsapp 000000000 With Best And No 1
arvindkumarji156
 
dachnug51 - HCL Domino Roadmap .pdf
dachnug51 - HCL Domino Roadmap      .pdfdachnug51 - HCL Domino Roadmap      .pdf
dachnug51 - HCL Domino Roadmap .pdf
DNUG e.V.
 
Java SE 17 Study Guide for Certification - Chapter 01
Java SE 17 Study Guide for Certification - Chapter 01Java SE 17 Study Guide for Certification - Chapter 01
Java SE 17 Study Guide for Certification - Chapter 01
williamrobertherman
 
How we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hoursHow we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hours
Ortus Solutions, Corp
 
Folding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a seriesFolding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a series
Philip Schwarz
 
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
Asher Sterkin
 
ENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentationENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentation
sofiafernandezon
 
@Call @Girls in Surat 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Surat Avaulable
 @Call @Girls in Surat 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Surat Avaulable @Call @Girls in Surat 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Surat Avaulable
@Call @Girls in Surat 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Surat Avaulable
DiyaSharma6551
 
NYC 26-Jun-2024 Combined Presentations.pdf
NYC 26-Jun-2024 Combined Presentations.pdfNYC 26-Jun-2024 Combined Presentations.pdf
NYC 26-Jun-2024 Combined Presentations.pdf
AUGNYC
 
Splunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptxSplunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptx
sudsdeep
 
Major Outages in Major Enterprises Payara Conference
Major Outages in Major Enterprises Payara ConferenceMajor Outages in Major Enterprises Payara Conference
Major Outages in Major Enterprises Payara Conference
Tier1 app
 
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
shristi verma
 
Chennai @Call @Girls 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
Chennai @Call @Girls 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real MeetChennai @Call @Girls 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
Chennai @Call @Girls 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
lovelykumarilk789
 
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeKolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Misti Soneji
 
ANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdfANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdf
sachin chaurasia
 
@Call @Girls in Aligarh 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
 @Call @Girls in Aligarh 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A... @Call @Girls in Aligarh 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
@Call @Girls in Aligarh 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
msriya3
 
WEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service ProvidersWEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service Providers
Severalnines
 
@Call @Girls in Saharanpur 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas...
 @Call @Girls in Saharanpur 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas... @Call @Girls in Saharanpur 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas...
@Call @Girls in Saharanpur 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas...
AlinaDevecerski
 
Panvel @Call @Girls Whatsapp 9833363713 With High Profile Offer
Panvel @Call @Girls Whatsapp 9833363713 With High Profile OfferPanvel @Call @Girls Whatsapp 9833363713 With High Profile Offer
Panvel @Call @Girls Whatsapp 9833363713 With High Profile Offer
$A19
 

Recently uploaded (20)

dachnug51 - HCLs evolution of the employee experience platform.pdf
dachnug51 - HCLs evolution of the employee experience platform.pdfdachnug51 - HCLs evolution of the employee experience platform.pdf
dachnug51 - HCLs evolution of the employee experience platform.pdf
 
Bhiwandi @Call @Girls Whatsapp 000000000 With Best And No 1
Bhiwandi @Call @Girls Whatsapp 000000000 With Best And No 1Bhiwandi @Call @Girls Whatsapp 000000000 With Best And No 1
Bhiwandi @Call @Girls Whatsapp 000000000 With Best And No 1
 
dachnug51 - HCL Domino Roadmap .pdf
dachnug51 - HCL Domino Roadmap      .pdfdachnug51 - HCL Domino Roadmap      .pdf
dachnug51 - HCL Domino Roadmap .pdf
 
Java SE 17 Study Guide for Certification - Chapter 01
Java SE 17 Study Guide for Certification - Chapter 01Java SE 17 Study Guide for Certification - Chapter 01
Java SE 17 Study Guide for Certification - Chapter 01
 
How we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hoursHow we built TryBoxLang in under 48 hours
How we built TryBoxLang in under 48 hours
 
Folding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a seriesFolding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a series
 
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
Ported to Cloud with Wing_ Blue ZnZone app from _Hexagonal Architecture Expla...
 
ENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentationENISA Threat Landscape 2023 documentation
ENISA Threat Landscape 2023 documentation
 
@Call @Girls in Surat 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Surat Avaulable
 @Call @Girls in Surat 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Surat Avaulable @Call @Girls in Surat 🐱‍🐉  XXXXXXXXXX 🐱‍🐉  Best High Class Surat Avaulable
@Call @Girls in Surat 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Best High Class Surat Avaulable
 
NYC 26-Jun-2024 Combined Presentations.pdf
NYC 26-Jun-2024 Combined Presentations.pdfNYC 26-Jun-2024 Combined Presentations.pdf
NYC 26-Jun-2024 Combined Presentations.pdf
 
Splunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptxSplunk_Remote_Work_Insights_Overview.pptx
Splunk_Remote_Work_Insights_Overview.pptx
 
Major Outages in Major Enterprises Payara Conference
Major Outages in Major Enterprises Payara ConferenceMajor Outages in Major Enterprises Payara Conference
Major Outages in Major Enterprises Payara Conference
 
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
₹Call ₹Girls Andheri West 09967584737 Deshi Chori Near You
 
Chennai @Call @Girls 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
Chennai @Call @Girls 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real MeetChennai @Call @Girls 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
Chennai @Call @Girls 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Genuine WhatsApp Number for Real Meet
 
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model SafeKolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
Kolkata @ℂall @Girls ꧁❤ 000000000 ❤꧂@ℂall @Girls Service Vip Top Model Safe
 
ANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdfANSYS Mechanical APDL Introductory Tutorials.pdf
ANSYS Mechanical APDL Introductory Tutorials.pdf
 
@Call @Girls in Aligarh 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
 @Call @Girls in Aligarh 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A... @Call @Girls in Aligarh 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
@Call @Girls in Aligarh 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Class A...
 
WEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service ProvidersWEBINAR SLIDES: CCX for Cloud Service Providers
WEBINAR SLIDES: CCX for Cloud Service Providers
 
@Call @Girls in Saharanpur 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas...
 @Call @Girls in Saharanpur 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas... @Call @Girls in Saharanpur 🐱‍🐉  XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas...
@Call @Girls in Saharanpur 🐱‍🐉 XXXXXXXXXX 🐱‍🐉 Tanisha Sharma Best High Clas...
 
Panvel @Call @Girls Whatsapp 9833363713 With High Profile Offer
Panvel @Call @Girls Whatsapp 9833363713 With High Profile OfferPanvel @Call @Girls Whatsapp 9833363713 With High Profile Offer
Panvel @Call @Girls Whatsapp 9833363713 With High Profile Offer
 

Apache Kudu: Technical Deep Dive



  • 1. 1© Cloudera, Inc. All rights reserved. Apache Kudu Webinar Series Technical Deep Dive David Alves| Apache Kudu PMC | Cloudera
  • 2. 2© Cloudera, Inc. All rights reserved. Kudu Webinar Series Part 1: Lambda Architectures – Simplified by Apache Kudu A look into the potential trouble involved with a lambda architecture, and how Apache Kudu can dramatically simplify real-time analytics. Part 2: Extending the Capabilities of Operational and Analytical Databases An examination of how Apache Kudu expands the set of use cases that Cloudera’s Operational and Analytical databases can handle. Part 3: Data-in-Motion: Unlock the Value of Real-Time Data Forrester will discuss their research into real-time data pipelines and analytics, and Cloudera will discuss how to make it a reality. Part 4: Technical Deep-Dive into Apache Kudu An in-depth examination of the technical architecture and design of Apache Kudu, straight from a Kudu PMC Member.
  • 3. 3© Cloudera, Inc. All rights reserved. Updateable Analytic Storage Simple real-time analytics and updates with Apache Kudu Kudu: Storage for fast analytics on fast data • Simplified architecture for building real-time analytic applications • Designed for next-generation hardware for faster analytic performance across frameworks • Native Hadoop storage engine Flexibility for the right tools for the right use case in one platform • Only analytic database for big data with Kudu + Impala • Simple real-time applications with Kudu + Spark Use cases • Time series data • Machine data analytics • Online reporting STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr OTHER Kite NoSQL HBase OTHER Object Store FILESYSTEM HDFS RELATIONAL Kudu
  • 4. 4© Cloudera, Inc. All rights reserved. HDFS Fast Scans, Analytics and Processing of Stored Data Fast On-Line Updates & Data Serving Arbitrary Storage (Active Archive) Fast Analytics (on fast-changing or frequently-updated data) Filling the Analytic Gap Unchanging Fast Changing Frequent Updates HBase Append-Only Real-Time Kudu Kudu fills the Gap Modern analytic applications often require complex data flow & difficult integration work to move data between HBase & HDFS Analytic Gap Pace of Analysis PaceofData
  • 5. 5© Cloudera, Inc. All rights reserved. Better Together Kudu Benefits from Integration with the Apache Ecosystem Spark – Stream Processing for Kudu • Open standard for real-time stream processing • Effective for automating decision processes and machine learning • Use Cases include: Time Series Data & Machine Data Analytics Impala – High-Performance BI & SQL for Kudu • Open standard for interactive SQL queries • Powers analytic database workloads with flexibility, scale, and open architecture • Use Cases include: Time Series Data & Online Reporting
  • 6. 6© Cloudera, Inc. All rights reserved. Apache Kudu Community
  • 7. 7© Cloudera, Inc. All rights reserved. Why Kudu? Use Cases and Motivation
  • 8. 8© Cloudera, Inc. All rights reserved. Why Kudu? A simultaneous combination of sequential and random reads and writes Can you insert time series data in real time? How long does it take to prepare it for analysis? Can you get results and act fast enough to change outcomes? Can you handle large volumes of machine-generated data? Do you have the tools to identify problems or threats? Can your system do machine learning? How fast can you add data to your data store? Are you trading off the ability to do broad analytics for the ability to make updates? Are you retaining only part of your data? Time Series Data Machine Data Analytics Online Reporting
  • 9. 9© Cloudera, Inc. All rights reserved. Next generation hardware Cheaper and faster every year. Persistent memory (3D XPoint™) Kudu can take advantage of SSD and NVM using Intel’s NVM Library. RAM is cheaper and bigger every day. Kudu runs smoothly with huge RAM. Written in C++ to avoid GC issues. Modern CPUs are adding cores and SIMD width, not GHz. Kudu takes advantage of SIMD instructions and concurrent data structures. Solid-state Storage Cheaper, Bigger Memory Efficiency on Modern CPUs
  • 10. 10© Cloudera, Inc. All rights reserved. Apache Kudu: Scalable and fast tabular storage Scalable • Tested up to 275 nodes (~3PB cluster) • Designed to scale to 1000s of nodes and tens of PBs Fast • Millions of read/write operations per second across cluster • Multiple GB/second read throughput per node Tabular • Represents data in structured tables like a relational database • Individual record-level access to 100+ billion row tables
  • 11. 11© Cloudera, Inc. All rights reserved. Deep Dive: Replication And Fault Tolerance
  • 12. 12© Cloudera, Inc. All rights reserved. Metadata • Replicated master • Acts as a tablet directory • Acts as a catalog (which tables exist, etc) • Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets) • Caches all metadata in RAM for high performance • Client configured with master addresses • Asks master for tablet locations as needed and caches them
  • 13. 13© Cloudera, Inc. All rights reserved. Client Hey Master! Where is the row for ‘tlipcon’ in table “T”? It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, … UPDATE tlipcon SET col=foo Meta Cache T1: … T2: … T3: …
  • 14. 14© Cloudera, Inc. All rights reserved. Raft consensus TS A Tablet 1 (LEADER) Client TS B Tablet 1 (FOLLOWER) TS C Tablet 1 (FOLLOWER) WAL WALWAL 2b. Leader writes local WAL 1a. Client->Leader: Write() RPC 2a. Leader->Followers: UpdateConsensus() RPC 3. Follower: write WAL 4. Follower->Leader: success 3. Follower: write WAL 5. Leader has achieved majority 6. Leader->Client: Success!
  • 15. 15© Cloudera, Inc. All rights reserved. Fault tolerance • Transient FOLLOWER failure: • Leader can still achieve majority • Restart follower TS within 5 min and it will rejoin transparently • Transient LEADER failure: • Followers expect to hear a heartbeat from their leader every 1.5 seconds • 3 missed heartbeats: leader election! • New LEADER is elected from remaining nodes within a few seconds • Restart within 5 min and it rejoins as a FOLLOWER • N replicas handle (N-1)/2 failures
  • 16. 16© Cloudera, Inc. All rights reserved. Fault tolerance (2) • Permanent failure (Tablet Copy): • Leader notices that a follower has been dead for 5 minutes • Evicts that follower • Master selects a new replica(in PRE_VOTER state) • Leader copies the data over to the new one, which joins as a new FOLLOWER • New replica is assigned VOTER state • Cluster change_config lets you add/remove tablet servers to a tablet’s configuration(only supported from CLI tool)
  • 17. 17© Cloudera, Inc. All rights reserved. Deep Dive: Columnar Storage
  • 18. 18© Cloudera, Inc. All rights reserved. Columnar Storage • Improved Scan performance • Predicates (e.g. WHERE time >= 2016-05-08T00:00:00) can be evaluated without reading unnecessary data from other columns • Efficient encodings can dramatically improve compression ratios, which reduces effective IO load • Typed, homogenous data plays well to modern processor strengths (vectorization, pipelining) • At the cost of random access performance • single row access requires a number of seeks proportional to the number of columns • BUT, random access is becoming cheaper (Cheap RAM, SSDs, NVRAM)
  • 19. 19© Cloudera, Inc. All rights reserved. Row Storage {23059873, newsycbot, 1442865158, Visual exp…} {22309487, RideImpala, 1442828307, Introducing …} … Tweet_id, user_name, created_at, text Scans have to read all the data, no encodings
  • 20. 20© Cloudera, Inc. All rights reserved. Columnar storage {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text
  • 21. 21© Cloudera, Inc. All rights reserved. Columnar storage {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’; Only read 1 column 1GB 2GB 1GB 200GB …AND door open to big IO gains with compression and encoding
  • 22. 22© Cloudera, Inc. All rights reserved. Available Encodings: • Dictionary (Strings, Binary) • Bitshuffle (Numeric) • RLE (Numeric, Bool) Available Compression: • Snappy • LZ4 • ZLib Columnar Storage – Other Encodings/Compression Kudu 1.3 ships with “good” defaults for most cases
  • 23. 23© Cloudera, Inc. All rights reserved. Deep Dive: Write and Read Paths
  • 24. 24© Cloudera, Inc. All rights reserved. LSM vs Kudu • LSM – Log Structured Merge (Cassandra, HBase, etc) • Inserts and updates all go to an in-memory map (MemStore) and later flush to on-disk files (HFile/SSTable) • Reads perform an on-the-fly merge of all on-disk HFiles • Kudu • Shares some traits (memstores, compactions) • More complex. • Slower writes in exchange for faster reads (especially scans)
  • 25. 25© Cloudera, Inc. All rights reserved. LSM Insert Path MemStore INSERT Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1” HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1” flush
  • 26. 26© Cloudera, Inc. All rights reserved. LSM Insert Path MemStore INSERT Row=r1 col=c1 val=“blah2” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“blah2” Row=r2 col=c2 val=“2” flush HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1”
  • 27. 27© Cloudera, Inc. All rights reserved. LSM Update path MemStore UPDATE HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5” Row=r2 col=c1 val=“newval” Note: all updates are “fully decoupled” from reads. Random- write workload is transformed to fully sequential!
  • 28. 28© Cloudera, Inc. All rights reserved. LSM Read path MemStore HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5” Row=r2 col=c1 val=“newval” Merge based on string row keys R1: c1=blah c2=2 R2: c1=newval c2=5 …. CPU intensive! Must always read rowkeys Any given row may exist across multiple HFiles: must always merge! The more HFiles to merge, the slower it reads
  • 29. 29© Cloudera, Inc. All rights reserved. Kudu storage – Inserts and Flushes MemRowSet INSERT(“todd”, “$1000”,”engineer”) name pay role DiskRowSet 1 flush
  • 30. 30© Cloudera, Inc. All rights reserved. Kudu storage – Inserts and Flushes MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 INSERT(“doug”, “$1B”, “Hadoop man”) flush
  • 31. 31© Cloudera, Inc. All rights reserved. Kudu storage - Updates MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS Each DiskRowSet has its own DeltaMemStore to accumulate updates base data base data
  • 32. 32© Cloudera, Inc. All rights reserved. Kudu storage - Updates MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS UPDATE set pay=“$1M” WHERE name=“todd” Is the row in DiskRowSet 2? (check bloom filters) Is the row in DiskRowSet 1? (check bloom filters) Bloom says: no! Bloom says: maybe! Search key column to find offset: rowid = 150 150: col 1=$1M base data
  • 33. 33© Cloudera, Inc. All rights reserved. Kudu storage – Read path MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS 150: pay=$1M Read rows in DiskRowSet 2 Then, read rows in DiskRowSet 1 Any row is only in exactly one DiskRowSet– no need to merge cross- DRS! Updates are merged based on ordinal offset within DRS: array indexing, no string compares base data base data
  • 34. 34© Cloudera, Inc. All rights reserved. Kudu storage – Delta flushes MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS 0: pay=fooREDO DeltaFile Flush A REDO delta indicates how to transform between the ‘base data’ (columnar) and a later version base data base data
  • 35. 35© Cloudera, Inc. All rights reserved. Kudu storage – Major delta compaction name pay role DiskRowSet(pre-compaction) Delta MS REDO DeltaFile REDO DeltaFile REDO DeltaFile Many deltas accumulate: lots of delta application work on reads name pay role DiskRowSet(post-compaction) Delta MS Unmerged REDO deltasUNDO deltas If a column has few updates, doesn’t need to be re-written: those deltas maintained in new DeltaFile Merge updates for columns with high update percentage base data
  • 36. 36© Cloudera, Inc. All rights reserved. Kudu storage – RowSet Compactions DRS 1 (32MB) [PK=alice], [PK=joe], [PK=linda], [PK=zach] DRS 2 (32MB) [PK=bob], [PK=jon], [PK=mary] [PK=zeke] DRS 3 (32MB) [PK=carl], [PK=julie], [PK=omar] [PK=zoe] DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB) [alice, bob, carl, joe] [jon, julie, linda, mary] [omar, zach, zeke, zoe] Reorganize rows to avoid rowsets with overlapping key ranges Writes for “chris” have to perform bloom lookups on all 3 RS
  • 37. 37© Cloudera, Inc. All rights reserved. Kudu Storage - Compactions • Main Idea: Always be compacting! • Compactions run continuously to prevent IO storms • ”Budgeted” RS compactions: What is the best way to spend X MBs IO? • Physical/Logical decoupling: different replicas run compactions at different times
  • 38. 38© Cloudera, Inc. All rights reserved. Deep Dive Partitioning
  • 39. 39© Cloudera, Inc. All rights reserved. Partitioning • Kudu has flexible policies for distributing data among partitions • Hash partitioning is built in, and can be combined with range partitioning • Keys are ordered within a partition • Key order matters • Partitioning key order dictates distribution • Primary key order affects how much data is read for scans
  • 40. 40© Cloudera, Inc. All rights reserved. Primary Key selection • Example - Time series data: • ”time” - timestamp • ”series” - {region, server, metric} (us-east.appserver01.loadavg, 2016-05- 09T15:14:00Z) (us-east.appserver01.loadavg, 2016-05- 09T15:15:00Z) (us-west.dbserver03.rss, 2016-05- 09T15:14:30Z) (us-west.dbserver03.rss, 2016-05- 09T15:14:30Z) (2016-05-09T15:14:00Z, us- east.appserver01.loadavg) (2016-05-09T15:14:30Z, us-west.dbserver03.rss) (2016-05-09T15:15:00Z, us- east.appserver01.loadavg) (2016-05-09T15:14:30Z, us-west.dbserver03.rss) (series, time) (time, series) SELECT * WHERE series = ‘us-east.appserver01.loadavg’;
  • 41. 41© Cloudera, Inc. All rights reserved. Partitioning — By Time Range (inserts) All Inserts go to Latest PartitionBig scans (across large time intervals) can be parallelized across many partitions
  • 42. 42© Cloudera, Inc. All rights reserved. Partitioning — By Series Range Inserts are spread among all partitionsScans are over a single partition
  • 43. 43© Cloudera, Inc. All rights reserved. Partitioning — By Series Range Partitions can become unbalanced, resulting in hot spotting
  • 44. 44© Cloudera, Inc. All rights reserved. Partitioning — By Series Hash (inserts) Inserts are spread among all partitionsScans are over a single partition
  • 45. 45© Cloudera, Inc. All rights reserved. Partitioning — By Series Hash Partitions grow overtime, eventually becoming too big for a single server
  • 46. 46© Cloudera, Inc. All rights reserved. Partitioning — By Series Hash + Time Range Inserts are spread among all partitions in the latest time range Big scans (across large time intervals) can be parallelized across partitions
  • 47. 47© Cloudera, Inc. All rights reserved. Next Steps Get Started with Kudu & Cloudera Start Contributing to Kudu • www.cloudera.com/downloads • https://blog.cloudera.com/?s=kudu http://kudu.apache.org/
  • 48. 48© Cloudera, Inc. All rights reserved. Thank you David Alves – Apache Kudu PMC @dribeiroalves