Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic Databases  

1© Cloudera, Inc. All rights reserved.
Apache Kudu Webinar Series
Extending the Capabilities of Cloudera’s
Operational and Analytic Databases
Alex Gutow & Ryan Lippert | Cloudera

Kudu Webinar Series
Part 1: Lambda Architectures – Simplified by Apache Kudu
A look into the potential trouble involved with a lambda architecture, and how Apache Kudu can
dramatically simplify real-time analytics.
Part 2: Extending the Capabilities of Operational and Analytical Databases
An examination of how Apache Kudu expands the set of use cases that Cloudera’s Operational and
Analytical databases can handle.
Part 3: Data-in-Motion: Unlock the Value of Real-Time Data
Forrester will discuss their research into real-time data pipelines and analytics, and Cloudera will
discuss how
https://www.cloudera.com/about-cloudera/events/webinars/kudu-webinar-series.html

Updateable Analytic Storage
Simple real-time analytics and updates with Apache Kudu
Kudu: Storage for fast analytics on fast data
• Simplified architecture for building real-time analytic
applications
• Designed for next-generation hardware for faster analytic
performance across frameworks
• Native Hadoop storage engine
Flexibility for the right tools for the right use
case in one platform
• Only analytic database for big data with Kudu + Impala
• Simple real-time applications with Kudu + Spark
Use cases
• Time series data
• Machine data analytics
• Online reporting
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
OTHER
Kite
NoSQL
HBase
OTHER
Object Store
FILESYSTEM
HDFS
RELATIONAL
Kudu

HDFS
Fast Scans, Analytics
and Processing of
Stored Data
Fast On-Line
Updates &
Data Serving
Arbitrary Storage
(Active Archive)
Fast Analytics
(on fast-changing or
frequently-updated data)
Filling the Analytic Gap
Unchanging
Fast Changing
Frequent Updates
HBase
Append-Only
Real-Time
Kudu Kudu fills the Gap
Modern analytic
applications often
require complex data
flow & difficult
integration work to
move data between
HBase & HDFS
Analytic
Gap
Pace of Analysis
PaceofData

Better Together
Kudu Benefits from Integration with the Apache Ecosystem
Spark – Stream Processing for Kudu
• Open standard for real-time stream processing
• Effective for automating decision processes and machine
learning
• Use Cases include: Time Series Data & Machine Data
Analytics
Impala – High-Performance BI & SQL for Kudu
• Open standard for interactive SQL queries
• Powers analytic database workloads with flexibility, scale, and
open architecture
• Use Cases include: Time Series Data & Online Reporting

Apache Kudu Availability
Data-driven applications
to deliver real-time insights.
Operational
Database
Explore, analyze, and
understand all your data.
Analytic
Database
Process data, develop and
serve predictive models.
Data
Engineering

Kudu for the Operational Database
Expand Addressable Use Cases

Operational
Database
Durable, low latency storage for web
applications, message stores, and
mission critical operational activities.
Web-Scale Data Depot
Identifying meaningful events
based on multiple data streams
and taking action.
Complex Event Processing
Use data and current/past
events to score and serve the
likelihood of subsequent events.
Model Scoring/Serving

Storage
Processing/
Exploration
Unique Components
Cloudera’s Operational Database
Fast/random reads and writes
via a high-performance,
distributed NoSQL data store
HBase
Fast analytics on fast data
with a relational structure
Kudu
Faceted, text-based search for
data exploration and
democratization
Cloudera
Search
Powerful and flexible
processing, streaming, and
SQL
Spark
Multi-Storage
Multi-Environment
Encryption, Key Trustee
Navigator
Storage & Governance

Kudu Keeps Your Business Operational
Machine Data
Analytics
Inserts, scans, lookups
Workload
Real-time data inserts with the ability to analyze
trends identifies potential problems.
Kudu identifies trouble through:
• Unlimited storage, yielding better historic trend
analysis
• Fast inserts to enable an up-to-date network view
• Fast scans identify/flag undesired states for remedy
Examples
Network threat detection; network health
monitoring; application performance monitoring

Kudu Increases the Value of Time Series Data
Time Series
Inserts, updates, scans, lookups
Workload
Examples
Stream market data; IoT; fraud detection &
prevention; risk monitoring; connected cars;
Time series data is most valuable if you can
analyze it to change outcomes in real time.
Kudu simulateneously enables:
• Time series data inserted/updated as it arrives
• Analytic scans to find trends on fresh time series
data
• Lookups to quickly visit the point in time where an
event occured

Operational DB: Real-Time Architecture
Driving the Model Through Machine Learning
Kafka
Spark
Streaming
Kudu
Spark MLlib
Application
Data
Sources
Individual Session
Full Model/Learning
Genesis
Spark
1 Event
Occurs
2
Messaging
3
Stream
Processing 4
Land in
RDBMS
5
Apply ML
Libraries

MLlib & K-Means: Defining Microsegments via Machine Learning
Height
Weight
Height
Weight
1 2
Height
Weight
3
Height
Weight
4
L
M
S
XL
L
M
S
XS
Near
Custom
?

Determining the Next Best Action
Kafka
Spark
Streaming
Kudu
Spark MLlib
Application
Data
Sources
Individual Session
1
Data
Processed
Genesis
Spark
2
Request Processed/
Kudu Queried
3
4
Results
Returned
Results
Processed
5
Processed
Data
Returned
Full Model/Learning

Determining the Next Best Action
Step 1: Data Processed
Apache Spark processes the data from the event (IoT, clickstream, markets, etc),
which potentially involves keeping a running list of the last X number of events
Step 2: Request Processed/Kudu Queried
A Spark application uses the data gathered in step one to query Kudu’s database
in a predefined manner to look for similar patterns defined via machine learning
Step 3: Kudu Results Returned
Kudu returns the results from the query in step 2 back to Spark to determine what
needs to be returned to the application
Step 4: Results Processed
Spark associates the results from Kudu with the information stored from the
current event to determine the next step to feed back to the application
Step 5: Processed Data Returned
The machine-generated, best possible outcome is prescribed and served to the
application

Operational DB: Cybersecurity Use Case
Discovering APT in Your Network
Kafka
Spark
Streaming
Kudu
Spark MLlib
Application
Data
Sources
Individual Session
Rogue User
Spark
Full Model/Learning
Data Request Sent For Stream Processing
Data Cleaned/Ordered/Processed, Then
Delivered to Kudu for Modelling
Access verified, initial data delivered,
subsequent requests aggregated and
compared to standard user/role behavior
Illustrative,
models will
likely have
>2
dimensions

Kudu for the Analytic Database
Enabling Real-Time Updates

Analytic
Database
More data of all types is being
tapped for analytics, across
environments
Self-Service BI & Data
Open up new possibilities
for real-time insights as
data changes
Real-Time Analysis
BI & analytics are critical but
only tell part of the story. Get
more value by sharing data
across workloads
Converged Workloads

Cloudera’s Analytic Database
Identify, offload, &
optimize workloads to
Hadoop
Navigator
Optimizer
Intelligent SQL editor
Hue
Audit, lineage,
encryption, key
management, & policy
lifecycles
Navigator
Integration with the
leading BI tools
BI Partners
Interactive query engine
for BI & SQL analytics
Impala
Large-scale ETL & batch
processing engine
Hive-on-
Spark
Multi-Storage, Multi-Environment
Data Storage for Fast &
Changing Data
Kudu

Anatomy of an Analytic Database
Cloudera Decoupled by Design
Query Engine
Storage Engine
Catalog
Query Engine
(Impala)
Catalog
(HMS)
Monolithic Analytic Database Modern Analytic Database
Storage
(Kudu)
Storage
(S3)
Storage
(HDFS)

Key Benefits
An analytic database designed for Hadoop
High-Performance BI and SQL Analytics
Flexibility for Data and Use Case Variety
Cost-effective Scale for Today and Tomorrow
Go Beyond SQL with an Open Architecture

Handle Time Series Data in Real-Time
Time Series
Real-time analytics on live data
Enables:
• Time series data inserted/updated on arrival
• Analytic scans to find trends on fresh data
• Point-in-time lookups to quickly find where an
event occurred
Examples
Streaming market data, IoT, fraud detection &
prevention, risk monitoring, connected cars
Workload

More Versatility in Online Reporting
Remove the limits of online reporting
Enables:
• Always-on, unlimited storage, eliminating archival
needs
• Fast inserts/updates to keep data fresh
• Fast lookups and analytic scans with one data
store
Examples
Operational data store
Workload
Online
Reporting

Fast Analytics with Updates (pre-Kudu)
Complexity & Latency
Considerations:
● How do I handle failure during this
process?
● How often do I reorganize data
streaming in into a format
appropriate for reporting?
● When reporting, how do I see data
that has not yet been reorganized?
● How do I ensure that important
jobs aren’t interrupted by
maintenance?
New Partition
Most Recent Partition
Historic Data
HBase
Parquet File
Have we
accumulated
enough data?
Reorganize
HBase file
into Parquet
• Wait for running operations to complete
• Define new Impala partition referencing
the newly written Parquet file
Incoming Data
(Messaging
System)
Reporting
Request
Impala on HDFS

Real-Time Analytics Today (with Kudu)
Simpler Architecture, Superior Performance
Impala on Kudu
Incoming Data
(Messaging
System)
Reporting
Request

LOWER BUSINESS RISKS
Re-architected system to meet critical
latency requirements for fraud detection
• Would have been cost-prohibitive &
slower with legacy system
• Single platform for RT fraud detection
alerts and NRT executive monitoring
dashboards
• Achieved <2s response time for SQL
queries
Credit Card
Processing System

Needed simplified system for more and
faster analysis
• Understand trends better with more
data & detect/respond to anomalies
faster
• Single platform for both analytics and
operational reporting
• Met compliance requirements
Healthcare Services
Provider

Demo

Connected Car Demo Architecture
Data
Generator
Spark
Streaming
Impala
Kafka
Kafka
• Time
• VIN
• Miles
• xAccel
• yAccel
• zAccel
• Speed
• Brakes
• LaneDeparture
• Signal
• CollisionDetected
• HazardDetected
• Latitude
• Longitude
Kudu

Data is Transforming Business
DRIVE
CUSTOMER INSIGHTS
IMPROVE
PRODUCT & SERVICES EFFICIENCY
LOWER
BUSINESS RISKS
MODERN
DATA ARCHITECTURE
DATA SCIENCE &
ENGINEERING
ANALYTIC
DATABASE
OPERATIONAL
DATABASE

Next Steps
Join us on March 8th to learn
more about data in-motion
https://www.cloudera.com/about-cloudera/events/webinars/kudu-webinar-series.html
Get Started with
Kudu & Cloudera
Start Contributing
to Kudu
• www.cloudera.com/downloads
• https://blog.cloudera.com/?s=kudu
http://kudu.apache.org/

Thank you
See you on March 8th!

Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic Databases

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic Databases

Similar to Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic Databases   (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)