Apache Kudu Webinar Series
Apache Kudu Webinar Series
Technical Deep Dive
David Alves| Apache Kudu PMC | Cloudera
Kudu Webinar Series
Kudu Webinar Series
Part 1: Lambda Architectures – Simplified by Apache Kudu
A look into the potential trouble involved with a lambda architecture, and how Apache Kudu can
dramatically simplify real-time analytics.
Part 2: Extending the Capabilities of Operational and Analytical Databases
An examination of how Apache Kudu expands the set of use cases that Cloudera’s Operational and
Analytical databases can handle.
Part 3: Data-in-Motion: Unlock the Value of Real-Time Data
Forrester will discuss their research into real-time data pipelines and analytics, and Cloudera will
discuss how to make it a reality.
Part 4: Technical Deep-Dive into Apache Kudu
An in-depth examination of the technical architecture and design of Apache Kudu, straight from a Kudu
PMC Member.
Updateable Analytic Storage
Updateable Analytic Storage
Simple real-time analytics and updates with Apache Kudu
Kudu: Storage for fast analytics on fast data
• Simplified architecture for building real-time analytic
• Designed for next-generation hardware for faster analytic
performance across frameworks
• Native Hadoop storage engine
Flexibility for the right tools for the right use
case in one platform
• Only analytic database for big data with Kudu + Impala
• Simple real-time applications with Kudu + Spark
Use cases
• Time series data
• Machine data analytics
• Online reporting
Kafka, Flume
Sentry, RecordService
Spark, Hive, Pig
Object Store
Filling the Analytic Gap
Fast Scans, Analytics
and Processing of
Stored Data
Fast On-Line
Updates &
Data Serving
Arbitrary Storage
(Active Archive)
Fast Analytics
(on fast-changing or
frequently-updated data)
Filling the Analytic Gap
Fast Changing
Frequent Updates
Kudu Kudu fills the Gap
Modern analytic
applications often
require complex data
flow & difficult
integration work to
move data between
HBase & HDFS
Pace of Analysis

Better Together
Better Together
Kudu Benefits from Integration with the Apache Ecosystem
Spark – Stream Processing for Kudu
• Open standard for real-time stream processing
• Effective for automating decision processes and machine
• Use Cases include: Time Series Data & Machine Data
Impala – High-Performance BI & SQL for Kudu
• Open standard for interactive SQL queries
• Powers analytic database workloads with flexibility, scale, and
open architecture
• Use Cases include: Time Series Data & Online Reporting
Apache Kudu Community
Apache Kudu Community
Why Kudu?
Why Kudu?
Use Cases and Motivation
Why Kudu?
Why Kudu?
A simultaneous combination of sequential and random reads and writes
Can you insert time series data in
real time? How long does it take to
prepare it for analysis? Can you get
results and act fast enough to
change outcomes?
Can you handle large volumes of
machine-generated data? Do you
have the tools to identify problems
or threats? Can your system do
machine learning?
How fast can you add data to your
data store? Are you trading off the
ability to do broad analytics for the
ability to make updates? Are you
retaining only part of your data?
Time Series Data Machine Data Analytics Online Reporting

Next generation hardware
Next generation hardware
Cheaper and faster every year.
Persistent memory (3D XPoint™)
Kudu can take advantage of SSD
and NVM using Intel’s NVM Library.
RAM is cheaper and bigger every
Kudu runs smoothly with huge
RAM. Written in C++ to avoid GC
Modern CPUs are adding cores and
SIMD width, not GHz.
Kudu takes advantage of SIMD
instructions and concurrent data
Solid-state Storage Cheaper, Bigger Memory Efficiency on Modern CPUs
Apache Kudu: Scalable and fast tabular storage
Apache Kudu: Scalable and fast tabular storage
• Tested up to 275 nodes (~3PB cluster)
• Designed to scale to 1000s of nodes and tens of PBs
• Millions of read/write operations per second across cluster
• Multiple GB/second read throughput per node
• Represents data in structured tables like a relational database
• Individual record-level access to 100+ billion row tables
Deep Dive:
Replication And Fault Tolerance
Deep Dive:
Replication And Fault Tolerance
Metadata
• Replicated master
• Acts as a tablet directory
• Acts as a catalog (which tables exist, etc)
• Acts as a load balancer (tracks TS liveness, re-replicates under-replicated
• Caches all metadata in RAM for high performance
• Client configured with master addresses
• Asks master for tablet locations as needed and caches them

Raft consensus
Hey Master! Where is the row for
‘tlipcon’ in table “T”?
It’s part of tablet 2, which is on servers {Z,Y,X}.
BTW, here’s info on other tablets you might
care about: T1, T2, T3, …
UPDATE tlipcon
SET col=foo
Meta Cache
T1: …
T2: …
T3: …
Fault tolerance
Raft consensus
Tablet 1
Tablet 1
Tablet 1
2b. Leader writes local WAL
1a. Client->Leader: Write() RPC
2a. Leader->Followers:
UpdateConsensus() RPC
3. Follower: write WAL
4. Follower->Leader: success
3. Follower: write WAL
5. Leader has achieved majority
6. Leader->Client: Success!
Fault tolerance (2)
Fault tolerance
• Transient FOLLOWER failure:
• Leader can still achieve majority
• Restart follower TS within 5 min and it will rejoin transparently
• Transient LEADER failure:
• Followers expect to hear a heartbeat from their leader every 1.5 seconds
• 3 missed heartbeats: leader election!
• New LEADER is elected from remaining nodes within a few seconds
• Restart within 5 min and it rejoins as a FOLLOWER
• N replicas handle (N-1)/2 failures
Deep Dive:
Columnar Storage
Fault tolerance (2)
• Permanent failure (Tablet Copy):
• Leader notices that a follower has been dead for 5 minutes
• Evicts that follower
• Master selects a new replica(in PRE_VOTER state)
• Leader copies the data over to the new one, which joins as a new FOLLOWER
• New replica is assigned VOTER state
• Cluster change_config lets you add/remove tablet servers to a tablet’s
configuration(only supported from CLI tool)

Columnar Storage
Deep Dive:
Columnar Storage
Columnar Storage – Other Encodings/Compression
Columnar Storage
• Improved Scan performance
• Predicates (e.g. WHERE time >= 2016-05-08T00:00:00) can be evaluated without reading
unnecessary data from other columns
• Efficient encodings can dramatically improve compression ratios, which reduces effective IO load
• Typed, homogenous data plays well to modern processor strengths (vectorization, pipelining)
• At the cost of random access performance
• single row access requires a number of seeks proportional to the number of columns
• BUT, random access is becoming cheaper (Cheap RAM, SSDs, NVRAM)
Row Storage
Row Storage
{23059873, newsycbot, 1442865158, Visual exp…}
{22309487, RideImpala, 1442828307, Introducing …}
Tweet_id, user_name, created_at, text
Scans have to read all the data, no encodings
20© Cloudera, Inc. All rights reserved.
Columnar storage
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}

Columnar storage
Columnar storage
{Visual exp…,
Introducing ..,
Missing July…,
LLVM 3.7….}
SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’;
Only read 1 column
1GB 2GB 1GB 200GB
…AND door open to big IO gains with compression and encoding
Columnar storage
Available Encodings:
• Dictionary (Strings, Binary)
• Bitshuffle (Numeric)
• RLE (Numeric, Bool)
Available Compression:
• Snappy
• LZ4
• ZLib
Columnar Storage – Other Encodings/Compression
Kudu 1.3 ships with “good” defaults for most cases
Deep Dive:
Write and Read Paths
Deep Dive:
Write and Read Paths
LSM vs Kudu
LSM vs Kudu
• LSM – Log Structured Merge (Cassandra, HBase, etc)
• Inserts and updates all go to an in-memory map (MemStore) and later flush to
on-disk files (HFile/SSTable)
• Reads perform an on-the-fly merge of all on-disk HFiles
• Kudu
• Shares some traits (memstores, compactions)
• More complex.
• Slower writes in exchange for faster reads (especially scans)

LSM Insert Path
LSM Insert Path
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
LSM Insert Path
LSM Insert Path
Row=r1 col=c1 val=“blah2”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“blah2”
Row=r2 col=c2 val=“2”
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“1”
LSM Update path
LSM Update path
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“v2”
Row=r2 col=c2 val=“5”
Row=r2 col=c1 val=“newval”
Note: all updates are “fully
decoupled” from reads. Random-
write workload is transformed to
fully sequential!
LSM Read path
LSM Read path
HFile 1
Row=r1 col=c1 val=“blah”
Row=r1 col=c2 val=“2”
HFile 2
Row=r2 col=c1 val=“v2”
Row=r2 col=c2 val=“5”
Row=r2 col=c1 val=“newval”
Merge based on string row
R1: c1=blah c2=2
R2: c1=newval c2=5
CPU intensive!
Must always read
Any given row may exist across
multiple HFiles: must always
The more HFiles to merge, the
slower it reads

Kudu storage – Inserts and Flushes
Kudu storage – Inserts and Flushes
INSERT(“todd”, “$1000”,”engineer”)
name pay role
DiskRowSet 1
Kudu storage – Inserts and Flushes
Kudu storage – Inserts and Flushes
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
INSERT(“doug”, “$1B”, “Hadoop man”)
Kudu storage - Updates
Kudu storage - Updates
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
Each DiskRowSet has its own
DeltaMemStore to
accumulate updates
base data
base data
Kudu storage - Updates
Kudu storage - Updates
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
UPDATE set pay=“$1M”
WHERE name=“todd”
Is the row in DiskRowSet 2?
(check bloom filters)
Is the row in DiskRowSet 1?
(check bloom filters)
Bloom says: no!
Bloom says: maybe!
Search key column to find
offset: rowid = 150
150: col 1=$1M
base data

Kudu storage – Read path
Kudu storage – Read path
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
150: pay=$1M
Read rows in DiskRowSet 2
Then, read rows in
DiskRowSet 1
Any row is only in exactly one
DiskRowSet– no need to merge cross-
Updates are merged based on ordinal
offset within DRS: array indexing, no
string compares
base data
base data
Kudu storage – Delta flushes
Kudu storage – Delta flushes
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta MS
Delta MS
0: pay=fooREDO DeltaFile
A REDO delta indicates how to
transform between the ‘base data’
(columnar) and a later version
base data
base data
Kudu storage – Major delta compaction
Kudu storage – Major delta compaction
name pay role
Delta MS
REDO DeltaFile REDO DeltaFile REDO DeltaFile
Many deltas accumulate: lots of delta
application work on reads
name pay role
Delta MS
Unmerged REDO
deltasUNDO deltas
If a column has few updates, doesn’t need to be
re-written: those deltas maintained in new
Merge updates for columns with high update
base data
Kudu storage – RowSet Compactions
Kudu storage – RowSet Compactions
DRS 1 (32MB)
[PK=alice], [PK=joe], [PK=linda], [PK=zach]
DRS 2 (32MB)
[PK=bob], [PK=jon], [PK=mary] [PK=zeke]
DRS 3 (32MB)
[PK=carl], [PK=julie], [PK=omar] [PK=zoe]
DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB)
[alice, bob, carl, joe] [jon, julie, linda,
[omar, zach, zeke, zoe]
Reorganize rows to avoid rowsets
with overlapping key ranges
Writes for “chris” have to perform
bloom lookups on all 3 RS

Kudu Storage - Compactions
Kudu Storage - Compactions
• Main Idea: Always be compacting!
• Compactions run continuously to prevent IO storms
• ”Budgeted” RS compactions: What is the best way to spend X MBs IO?
• Physical/Logical decoupling: different replicas run compactions at different times
Deep Dive
Partitioning
Deep Dive
Partitioning
• Kudu has flexible policies for distributing data among partitions
• Hash partitioning is built in, and can be combined with range partitioning
• Keys are ordered within a partition
• Key order matters
• Partitioning key order dictates distribution
• Primary key order affects how much data is read for scans
Primary Key selection
Primary Key selection
• Example - Time series data:
• ”time” - timestamp
• ”series” - {region, server, metric}
(us-east.appserver01.loadavg, 2016-05-
(us-east.appserver01.loadavg, 2016-05-
(us-west.dbserver03.rss, 2016-05-
(us-west.dbserver03.rss, 2016-05-
(2016-05-09T15:14:00Z, us-
(2016-05-09T15:14:30Z, us-west.dbserver03.rss)
(2016-05-09T15:15:00Z, us-
(2016-05-09T15:14:30Z, us-west.dbserver03.rss)
(series, time) (time, series)
SELECT * WHERE series = ‘us-east.appserver01.loadavg’;

Partitioning — By Time Range (inserts)
Partitioning — By Time Range (inserts)
All Inserts go to Latest PartitionBig scans (across large time intervals)
can be parallelized across many partitions
Partitioning — By Series Range
Partitioning — By Series Range
Inserts are spread among all partitionsScans are over a single partition
Partitioning — By Series Range
Partitioning — By Series Range
Partitions can become unbalanced,
resulting in hot spotting
Partitioning — By Series Hash (inserts)
Partitioning — By Series Hash (inserts)
Inserts are spread among all partitionsScans are over a single partition

Partitioning — By Series Hash
Partitioning — By Series Hash
Partitions grow overtime, eventually
becoming too big for a single server
Partitioning — By Series Hash + Time Range
Partitioning — By Series Hash + Time Range
Inserts are spread among all partitions
in the latest time range
Big scans (across large time intervals)
can be parallelized across partitions
Next Steps
Next Steps
Get Started with
Kudu & Cloudera
Start Contributing
to Kudu
Thank you
Thank you
David Alves – Apache Kudu PMC

Apache Kudu: Technical Deep Dive

  • 1. 1© Cloudera, Inc. All rights reserved. Apache Kudu Webinar Series Technical Deep Dive David Alves| Apache Kudu PMC | Cloudera
  • 2. 2© Cloudera, Inc. All rights reserved. Kudu Webinar Series Part 1: Lambda Architectures – Simplified by Apache Kudu A look into the potential trouble involved with a lambda architecture, and how Apache Kudu can dramatically simplify real-time analytics. Part 2: Extending the Capabilities of Operational and Analytical Databases An examination of how Apache Kudu expands the set of use cases that Cloudera’s Operational and Analytical databases can handle. Part 3: Data-in-Motion: Unlock the Value of Real-Time Data Forrester will discuss their research into real-time data pipelines and analytics, and Cloudera will discuss how to make it a reality. Part 4: Technical Deep-Dive into Apache Kudu An in-depth examination of the technical architecture and design of Apache Kudu, straight from a Kudu PMC Member.
  • 3. 3© Cloudera, Inc. All rights reserved. Updateable Analytic Storage Simple real-time analytics and updates with Apache Kudu Kudu: Storage for fast analytics on fast data • Simplified architecture for building real-time analytic applications • Designed for next-generation hardware for faster analytic performance across frameworks • Native Hadoop storage engine Flexibility for the right tools for the right use case in one platform • Only analytic database for big data with Kudu + Impala • Simple real-time applications with Kudu + Spark Use cases • Time series data • Machine data analytics • Online reporting STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr OTHER Kite NoSQL HBase OTHER Object Store FILESYSTEM HDFS RELATIONAL Kudu
  • 4. 4© Cloudera, Inc. All rights reserved. HDFS Fast Scans, Analytics and Processing of Stored Data Fast On-Line Updates & Data Serving Arbitrary Storage (Active Archive) Fast Analytics (on fast-changing or frequently-updated data) Filling the Analytic Gap Unchanging Fast Changing Frequent Updates HBase Append-Only Real-Time Kudu Kudu fills the Gap Modern analytic applications often require complex data flow & difficult integration work to move data between HBase & HDFS Analytic Gap Pace of Analysis PaceofData
  • 5. 5© Cloudera, Inc. All rights reserved. Better Together Kudu Benefits from Integration with the Apache Ecosystem Spark – Stream Processing for Kudu • Open standard for real-time stream processing • Effective for automating decision processes and machine learning • Use Cases include: Time Series Data & Machine Data Analytics Impala – High-Performance BI & SQL for Kudu • Open standard for interactive SQL queries • Powers analytic database workloads with flexibility, scale, and open architecture • Use Cases include: Time Series Data & Online Reporting
  • 6. 6© Cloudera, Inc. All rights reserved. Apache Kudu Community
  • 7. 7© Cloudera, Inc. All rights reserved. Why Kudu? Use Cases and Motivation
  • 8. 8© Cloudera, Inc. All rights reserved. Why Kudu? A simultaneous combination of sequential and random reads and writes Can you insert time series data in real time? How long does it take to prepare it for analysis? Can you get results and act fast enough to change outcomes? Can you handle large volumes of machine-generated data? Do you have the tools to identify problems or threats? Can your system do machine learning? How fast can you add data to your data store? Are you trading off the ability to do broad analytics for the ability to make updates? Are you retaining only part of your data? Time Series Data Machine Data Analytics Online Reporting
  • 9. 9© Cloudera, Inc. All rights reserved. Next generation hardware Cheaper and faster every year. Persistent memory (3D XPoint™) Kudu can take advantage of SSD and NVM using Intel’s NVM Library. RAM is cheaper and bigger every day. Kudu runs smoothly with huge RAM. Written in C++ to avoid GC issues. Modern CPUs are adding cores and SIMD width, not GHz. Kudu takes advantage of SIMD instructions and concurrent data structures. Solid-state Storage Cheaper, Bigger Memory Efficiency on Modern CPUs
  • 10. 10© Cloudera, Inc. All rights reserved. Apache Kudu: Scalable and fast tabular storage Scalable • Tested up to 275 nodes (~3PB cluster) • Designed to scale to 1000s of nodes and tens of PBs Fast • Millions of read/write operations per second across cluster • Multiple GB/second read throughput per node Tabular • Represents data in structured tables like a relational database • Individual record-level access to 100+ billion row tables
  • 11. 11© Cloudera, Inc. All rights reserved. Deep Dive: Replication And Fault Tolerance
  • 12. 12© Cloudera, Inc. All rights reserved. Metadata • Replicated master • Acts as a tablet directory • Acts as a catalog (which tables exist, etc) • Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets) • Caches all metadata in RAM for high performance • Client configured with master addresses • Asks master for tablet locations as needed and caches them
  • 13. 13© Cloudera, Inc. All rights reserved. Client Hey Master! Where is the row for ‘tlipcon’ in table “T”? It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, … UPDATE tlipcon SET col=foo Meta Cache T1: … T2: … T3: …
  • 14. 14© Cloudera, Inc. All rights reserved. Raft consensus TS A Tablet 1 (LEADER) Client TS B Tablet 1 (FOLLOWER) TS C Tablet 1 (FOLLOWER) WAL WALWAL 2b. Leader writes local WAL 1a. Client->Leader: Write() RPC 2a. Leader->Followers: UpdateConsensus() RPC 3. Follower: write WAL 4. Follower->Leader: success 3. Follower: write WAL 5. Leader has achieved majority 6. Leader->Client: Success!
  • 15. 15© Cloudera, Inc. All rights reserved. Fault tolerance • Transient FOLLOWER failure: • Leader can still achieve majority • Restart follower TS within 5 min and it will rejoin transparently • Transient LEADER failure: • Followers expect to hear a heartbeat from their leader every 1.5 seconds • 3 missed heartbeats: leader election! • New LEADER is elected from remaining nodes within a few seconds • Restart within 5 min and it rejoins as a FOLLOWER • N replicas handle (N-1)/2 failures
  • 16. 16© Cloudera, Inc. All rights reserved. Fault tolerance (2) • Permanent failure (Tablet Copy): • Leader notices that a follower has been dead for 5 minutes • Evicts that follower • Master selects a new replica(in PRE_VOTER state) • Leader copies the data over to the new one, which joins as a new FOLLOWER • New replica is assigned VOTER state • Cluster change_config lets you add/remove tablet servers to a tablet’s configuration(only supported from CLI tool)
  • 17. 17© Cloudera, Inc. All rights reserved. Deep Dive: Columnar Storage
  • 18. 18© Cloudera, Inc. All rights reserved. Columnar Storage • Improved Scan performance • Predicates (e.g. WHERE time >= 2016-05-08T00:00:00) can be evaluated without reading unnecessary data from other columns • Efficient encodings can dramatically improve compression ratios, which reduces effective IO load • Typed, homogenous data plays well to modern processor strengths (vectorization, pipelining) • At the cost of random access performance • single row access requires a number of seeks proportional to the number of columns • BUT, random access is becoming cheaper (Cheap RAM, SSDs, NVRAM)
  • 19. 19© Cloudera, Inc. All rights reserved. Row Storage {23059873, newsycbot, 1442865158, Visual exp…} {22309487, RideImpala, 1442828307, Introducing …} … Tweet_id, user_name, created_at, text Scans have to read all the data, no encodings
  • 20. 20© Cloudera, Inc. All rights reserved. Columnar storage {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text
  • 21. 21© Cloudera, Inc. All rights reserved. Columnar storage {25059873, 22309487, 23059861, 23010982} Tweet_id {newsycbot, RideImpala, fastly, llvmorg} User_name {1442865158, 1442828307, 1442865156, 1442865155} Created_at {Visual exp…, Introducing .., Missing July…, LLVM 3.7….} text SELECT COUNT(*) FROM tweets WHERE user_name = ‘newsycbot’; Only read 1 column 1GB 2GB 1GB 200GB …AND door open to big IO gains with compression and encoding
  • 22. 22© Cloudera, Inc. All rights reserved. Available Encodings: • Dictionary (Strings, Binary) • Bitshuffle (Numeric) • RLE (Numeric, Bool) Available Compression: • Snappy • LZ4 • ZLib Columnar Storage – Other Encodings/Compression Kudu 1.3 ships with “good” defaults for most cases
  • 23. 23© Cloudera, Inc. All rights reserved. Deep Dive: Write and Read Paths
  • 24. 24© Cloudera, Inc. All rights reserved. LSM vs Kudu • LSM – Log Structured Merge (Cassandra, HBase, etc) • Inserts and updates all go to an in-memory map (MemStore) and later flush to on-disk files (HFile/SSTable) • Reads perform an on-the-fly merge of all on-disk HFiles • Kudu • Shares some traits (memstores, compactions) • More complex. • Slower writes in exchange for faster reads (especially scans)
  • 25. 25© Cloudera, Inc. All rights reserved. LSM Insert Path MemStore INSERT Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1” HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1” flush
  • 26. 26© Cloudera, Inc. All rights reserved. LSM Insert Path MemStore INSERT Row=r1 col=c1 val=“blah2” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“blah2” Row=r2 col=c2 val=“2” flush HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“1”
  • 27. 27© Cloudera, Inc. All rights reserved. LSM Update path MemStore UPDATE HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5” Row=r2 col=c1 val=“newval” Note: all updates are “fully decoupled” from reads. Random- write workload is transformed to fully sequential!
  • 28. 28© Cloudera, Inc. All rights reserved. LSM Read path MemStore HFile 1 Row=r1 col=c1 val=“blah” Row=r1 col=c2 val=“2” HFile 2 Row=r2 col=c1 val=“v2” Row=r2 col=c2 val=“5” Row=r2 col=c1 val=“newval” Merge based on string row keys R1: c1=blah c2=2 R2: c1=newval c2=5 …. CPU intensive! Must always read rowkeys Any given row may exist across multiple HFiles: must always merge! The more HFiles to merge, the slower it reads
  • 29. 29© Cloudera, Inc. All rights reserved. Kudu storage – Inserts and Flushes MemRowSet INSERT(“todd”, “$1000”,”engineer”) name pay role DiskRowSet 1 flush
  • 30. 30© Cloudera, Inc. All rights reserved. Kudu storage – Inserts and Flushes MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 INSERT(“doug”, “$1B”, “Hadoop man”) flush
  • 31. 31© Cloudera, Inc. All rights reserved. Kudu storage - Updates MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS Each DiskRowSet has its own DeltaMemStore to accumulate updates base data base data
  • 32. 32© Cloudera, Inc. All rights reserved. Kudu storage - Updates MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS UPDATE set pay=“$1M” WHERE name=“todd” Is the row in DiskRowSet 2? (check bloom filters) Is the row in DiskRowSet 1? (check bloom filters) Bloom says: no! Bloom says: maybe! Search key column to find offset: rowid = 150 150: col 1=$1M base data
  • 33. 33© Cloudera, Inc. All rights reserved. Kudu storage – Read path MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS 150: pay=$1M Read rows in DiskRowSet 2 Then, read rows in DiskRowSet 1 Any row is only in exactly one DiskRowSet– no need to merge cross- DRS! Updates are merged based on ordinal offset within DRS: array indexing, no string compares base data base data
  • 34. 34© Cloudera, Inc. All rights reserved. Kudu storage – Delta flushes MemRowSet name pay role DiskRowSet 1 name pay role DiskRowSet 2 Delta MS Delta MS 0: pay=fooREDO DeltaFile Flush A REDO delta indicates how to transform between the ‘base data’ (columnar) and a later version base data base data
  • 35. 35© Cloudera, Inc. All rights reserved. Kudu storage – Major delta compaction name pay role DiskRowSet(pre-compaction) Delta MS REDO DeltaFile REDO DeltaFile REDO DeltaFile Many deltas accumulate: lots of delta application work on reads name pay role DiskRowSet(post-compaction) Delta MS Unmerged REDO deltasUNDO deltas If a column has few updates, doesn’t need to be re-written: those deltas maintained in new DeltaFile Merge updates for columns with high update percentage base data
  • 36. 36© Cloudera, Inc. All rights reserved. Kudu storage – RowSet Compactions DRS 1 (32MB) [PK=alice], [PK=joe], [PK=linda], [PK=zach] DRS 2 (32MB) [PK=bob], [PK=jon], [PK=mary] [PK=zeke] DRS 3 (32MB) [PK=carl], [PK=julie], [PK=omar] [PK=zoe] DRS 4 (32MB) DRS 5 (32MB) DRS 6 (32MB) [alice, bob, carl, joe] [jon, julie, linda, mary] [omar, zach, zeke, zoe] Reorganize rows to avoid rowsets with overlapping key ranges Writes for “chris” have to perform bloom lookups on all 3 RS
  • 37. 37© Cloudera, Inc. All rights reserved. Kudu Storage - Compactions • Main Idea: Always be compacting! • Compactions run continuously to prevent IO storms • ”Budgeted” RS compactions: What is the best way to spend X MBs IO? • Physical/Logical decoupling: different replicas run compactions at different times
  • 38. 38© Cloudera, Inc. All rights reserved. Deep Dive Partitioning
  • 39. 39© Cloudera, Inc. All rights reserved. Partitioning • Kudu has flexible policies for distributing data among partitions • Hash partitioning is built in, and can be combined with range partitioning • Keys are ordered within a partition • Key order matters • Partitioning key order dictates distribution • Primary key order affects how much data is read for scans
  • 40. 40© Cloudera, Inc. All rights reserved. Primary Key selection • Example - Time series data: • ”time” - timestamp • ”series” - {region, server, metric} (us-east.appserver01.loadavg, 2016-05- 09T15:14:00Z) (us-east.appserver01.loadavg, 2016-05- 09T15:15:00Z) (us-west.dbserver03.rss, 2016-05- 09T15:14:30Z) (us-west.dbserver03.rss, 2016-05- 09T15:14:30Z) (2016-05-09T15:14:00Z, us- east.appserver01.loadavg) (2016-05-09T15:14:30Z, us-west.dbserver03.rss) (2016-05-09T15:15:00Z, us- east.appserver01.loadavg) (2016-05-09T15:14:30Z, us-west.dbserver03.rss) (series, time) (time, series) SELECT * WHERE series = ‘us-east.appserver01.loadavg’;
  • 41. 41© Cloudera, Inc. All rights reserved. Partitioning — By Time Range (inserts) All Inserts go to Latest PartitionBig scans (across large time intervals) can be parallelized across many partitions
  • 42. 42© Cloudera, Inc. All rights reserved. Partitioning — By Series Range Inserts are spread among all partitionsScans are over a single partition
  • 43. 43© Cloudera, Inc. All rights reserved. Partitioning — By Series Range Partitions can become unbalanced, resulting in hot spotting
  • 44. 44© Cloudera, Inc. All rights reserved. Partitioning — By Series Hash (inserts) Inserts are spread among all partitionsScans are over a single partition
  • 45. 45© Cloudera, Inc. All rights reserved. Partitioning — By Series Hash Partitions grow overtime, eventually becoming too big for a single server
  • 46. 46© Cloudera, Inc. All rights reserved. Partitioning — By Series Hash + Time Range Inserts are spread among all partitions in the latest time range Big scans (across large time intervals) can be parallelized across partitions
  • 47. 47© Cloudera, Inc. All rights reserved. Next Steps Get Started with Kudu & Cloudera Start Contributing to Kudu • •
  • 48. 48© Cloudera, Inc. All rights reserved. Thank you David Alves – Apache Kudu PMC @dribeiroalves