Hortonworks.bdb

Hortonworks: We Do Hadoop.
Our mission is to enable your Modern Data Architecture
by Delivering Enterprise Apache Hadoop
March 2014

Our Mission:
Our Commitment
Open Leadership
Drive innovation in the open exclusively via the
Apache community-driven open source process
Enterprise Rigor
Engineer, test and certify Apache Hadoop with
the enterprise in mind
Ecosystem Endorsement
Focus on deep integration with existing data
center technologies and skills
Page 2
Headquarters: Palo Alto, CA
Employees: 300+ and growing
Trusted Partners
Enable your Modern Data Architecture by
Delivering Enterprise Apache Hadoop

Requirements for
Enterprise Hadoop in the
Modern Data Architecture
Page 3

1Key Services
Platform, Operational and
Data services essential for
the enterprise
Skills
Leverage your existing
skills: development,
analytics, operations
2
Requirements for Enterprise Hadoop
Page 4
CORE
SERVICES
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
OPERATIONAL
SERVICES
HDFS
SQOOP
FLUME
NFS
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP
TEZREDUCE
HIVE &
HCATALOG
PIGHBASE
Integration
Interoperable with existing
data center investments3
OPERATIONAL
SERVICES
DATA
SERVICES
CORE SERVICES
Schedule
Storage
Resource Management
Process
Data
Movement
Cluster
Mgmt Dataset
Mgmt
Data Access
Data
Security

1Key Services
Platform, Operational and
Data services essential for
the enterprise
Skills
2
HDP: A Complete Hadoop Distribution
Page 5
OS/VM Cloud Appliance
CORE
SERVICES
CORE
HORTONWORKS
DATA PLATFORM (HDP)
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUME
NFS
LOAD &
EXTRACT
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP
TEZREDUCE
HIVE &
HCATALOG
PIGHBASE
Integration
OPERATIONAL
SERVICES
DATA
SERVICES
CORE SERVICES
HORTONWORKS
DATA PLATFORM (HDP)
Schedule
Storage
Resource Management
Process
Data
Movement
Cluster
Mgmnt Dataset
Mgmnt
Data Access
CORE SERVICES
HORTONWORKS
DATA PLATFORM (HDP)
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUMEAMBARI
FALCON
YARN
MAP
TEZREDUCE
HIVEPIG
HBASE
OOZIE
LOAD &
EXTRACT
WebHDFS
NFS
KNOX

Store all date in a single place, interact in multiple ways
Hadoop 2: The Introduction of YARN
1st Gen of
Hadoop
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management
& data processing)
HADOOP 2
Single Use System
Batch Apps
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
Page 6
Redundant, Reliable Storage
(HDFS)
Efficient Cluster Resource
Management & Shared Services
(YARN)
Standard Query
Processing
Hive, Pig
Batch
MapReduce
Interactive
Tez
Online Data
Processing
HBase, Accumulo
Real Time Stream
Processing
Storm
others
…

Apache Hadoop YARN
Page 7
Flexible
Enables other purpose-built data
processing models beyond
MapReduce (batch), such as
interactive and streaming
Efficient
Double processing IN Hadoop on
the same hardware while
providing predictable
performance & quality of service
Shared
Provides a stable, reliable,
secure foundation and
shared operational services
across multiple workloads
The data operating system for Hadoop 2.0
Data Processing Engines Run Natively IN Hadoop
BATCH
MapReduce
INTERACTIVE
Tez
STREAMING
Storm
IN-MEMORY
Spark
GRAPH
Giraph
SAS
LASR, HPA
ONLINE
HBase, Accumulo
OTHERS
HDFS: Redundant, Reliable Storage
YARN: Cluster Resource Management

Driving Our Innovation Through Apache
147,933 lines
614,041 lines
End Users
449,768 lines
Total Net Lines Contributed
to Apache Hadoop
Yahoo: 10
Cloudera: 7
IBM: 3
10 Others
21
Facebook: 5
LinkedIn: 3
Total Number of Committers
to Apache Hadoop
63
total
Hortonworks mission is
to power your modern data architecture by enabling
Hadoop to be an enterprise data platform that
deeply integrates with your data center technologies
Page 8
Apache
Project
Committers
PMC
Members
Hadoop 21 13
Tez 10 4
Hive 11 3
HBase 8 3
Pig 6 5
Sqoop 1 0
Ambari 20 12
Knox 6 2
Falcon 2 2
Oozie 2 2
Zookeepe
r
2 1
Flume 1 0
Accumulo 2 2
Storm 1 0
Drill 1 0
TOTAL 95 48

Patterns for Hadoop Applications
Page 9
1
Integration
data center investments
Key Services
Platform, operational and
data services essential for
the enterprise
Skills
2
3 DEVELOPANALYZEOPERATE
COLLECT PROCESS BUILD
EXPLORE QUERY DELIVER
PROVISION MANAGE MONITOR

Familiar and Existing Tools
Page 10
1Key Services
the enterprise
Skills
2
DEVELOPANALYZEOPERATE
COLLECT PROCESS BUILD
EXPLORE QUERY DELIVER
PROVISION MANAGE MONITOR
BusinessObjects BI
Integration

SQL Interactive Query & Apache Hive
Page 11
1Key Services
the enterprise
Skills
2
Integration
Stinger Initiative
Broad, community based effort to deliver the
next generation of Apache Hive
Scale
The only SQL interface
to Hadoop designed for
queries that scale from
TB to PB
SQL
Support broadest range
of SQL semantics for
analytic applications
against Hadoop
Speed
Improve Hive query
performance by 100X to
allow for interactive
query times (seconds)
SQL
Apache Hive
• The defacto standard for Hadoop SQL access
• Used by your current data center partners
• Built for batch AND interactive query

APPLICATIONSDATASYSTEM
REPOSITORIES
SOURCES
Existing Sources
(CRM, ERP, Clickstream, Logs)
RDBMS EDW MPP
Emerging Sources
(Sensor, Sentiment, Geo, Unstructured)
OPERATIONAL
TOOLS
MANAGE &
MONITOR
DEV & DATA
TOOLS
BUILD &
TEST
Business
Analytics
Custom
Applications
Packaged
Applications
Requirements for Enterprise Hadoop
Page 12
Integration
Integrate with
Applications
Business Intelligence,
Developer IDEs,
Data Integration
Systems
Data Systems & Storage,
Systems Management
Platforms
Operating Systems,
Virtualization, Cloud,
Appliances

Broad Ecosystem Integration
Page 13
APPLICATIONSDATASYSTEMSOURCES
RDBMS EDW MPP
Emerging Sources
(Sensor, Sentiment, Geo, Unstructured)
HANA
BusinessObjects BI
OPERATIONAL TOOLS
DEV & DATA TOOLS
Existing Sources
(CRM, ERP, Clickstream, Logs)
INFRASTRUCTURE

Apache Hive and Stinger:
SQL in Hadoop
Arun Murthy (@acmurthy)
Alan Gates (@alanfgates)
Owen O’Malley (@owen_omalley)
@hortonworks

Stinger Project
(announced February 2013)
Batch AND Interactive SQL-IN-Hadoop
Stinger Initiative
A broad, community-based effort to
drive the next generation of HIVE
Coming Soon:
• Hive on Apache Tez
• Query Service
• Buffer Cache
• Cost Based Optimizer (Optiq)
• Vectorized Processing
Hive 0.11, May 2013:
• Base Optimizations
• SQL Analytic Functions
• ORCFile, Modern File Format
Hive 0.12, October 2013:
• VARCHAR, DATE Types
• ORCFile predicate pushdown
• Advanced Optimizations
• Performance Boosts via YARN
Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed
for queries that scale from TB to PB
SQL
Support broadest range of SQL semantics for
analytic applications running against Hadoop
…all IN Hadoop
Goals:

Hive 0.12
Hive 0.12
Release Theme Speed, Scale and SQL
Specific Features • 10x faster query launch when using large number
(500+) of partitions
• ORCFile predicate pushdown speeds queries
• Evaluate LIMIT on the map side
• Parallel ORDER BY
• New query optimizer
• Introduces VARCHAR and DATE datatypes
• GROUP BY on structs or unions
Included
Components
Apache Hive 0.12

SPEED: Increasing Hive Performance
Performance Improvements
included in Hive 12
– Base & advanced query optimization
– Startup time improvement
– Join optimizations
Interactive Query Times across ALL use cases
• Simple and advanced queries in seconds
• Integrates seamlessly with existing tools
• Currently a >100x improvement in just nine months

Stinger Phase 3: Unlocking Interactive Query
Page 18
Stinger Phase 3: Features and Benefits
Container Pre-Launch
Overcomes Java VM startup latency by pre-
launching hot containers ready to serve queries
Container Re-Use
Finished Maps and Reduces pick up more work
rather than exiting. Reduces latency and
eliminates difficult split size tuning
Tez Integration
Tez Broadcast Edge and Intermediate Reduce
pattern improve query scale and throughput
In-Memory Cache Hot data kept in RAM for fast access

Stinger Phase 3: Speed, Scale, and SQL
Page 19
Release Theme Prove Hive for both large-scale and interactive SQL /
analytics
Specific Features • < 10s SQL queries over 200GB datasets through Hive
• Tez container pre-launch
• Tez container re-use
• Use of Tez Intermediate Reduce pattern
• In-memory HDFS caching
Made available as part of the Tech Preview for Stinger Phase 3

Stinger Phase 3: Beyond Tech Preview
Page 20
Release Theme Speed, SQL,…and Security
Specific Features • Hive-on-Tez: Interactive query on Hive
• SQL Improvements:
• Sub-query for WHERE
• Standard JOIN semantics
• Support for Common Table Expressions (CTE)
• Phase 1 of ACID Semantics support
• Automatic JOIN order optimization
• CHAR datatype
• PAM authentication support
• SSL encryption

SQL: Enhancing SQL Semantics
Hive SQL Datatypes Hive SQL Semantics
INT SELECT, INSERT
TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY
BOOLEAN JOIN on explicit join key
FLOAT Inner, outer, cross and semi joins
DOUBLE Sub-queries in FROM clause
STRING ROLLUP and CUBE
TIMESTAMP UNION
BINARY Windowing Functions (OVER, RANK, etc)
DECIMAL Custom Java UDFs
ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.)
DATE Advanced UDFs (ngram, Xpath, URL)
VARCHAR Sub-queries in WHERE, HAVING
CHAR Expanded JOIN Syntax
SQL Compliant Security (GRANT, etc.)
INSERT/UPDATE/DELETE (ACID)
Hive 0.12
Available
Roadmap
SQL Compliance
Hive 12 provides a wide
array of SQL datatypes
and semantics so your
existing tools integrate
more seamlessly with
Hadoop

Vectorized Query Execution
•Designed for Modern Processor Architectures
–Avoid branching in the inner loop.
–Make the most use of L1 and L2 cache.
•How It Works
–Process records in batches of 1,000 rows
–Generate code from templates to minimize branching.
•What It Gives
–30x improvement in rows processed per second.
–Initial prototype: 100M rows/sec on laptop
Page 23

Hive – MR Hive – Tez
Hive-on-MR vs. Hive-on-Tez
SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b.id) GROUP BY a
UNION SELECT x, AVERAGE(y) AS AVG
FROM c GROUP BY x
ORDER BY AVG;
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
SELECT b.id
Tez avoids
unneeded writes to
HDFS

Tez Delivers Interactive Query - Out of the Box!
Page 27
Feature Description Benefit
Tez Session
Overcomes Map-Reduce job-launch latency by pre-
launching Tez AppMaster
Latency
Tez Container Pre-
Launch
Overcomes Map-Reduce latency by pre-launching
hot containers ready to serve queries.
Latency
Tez Container Re-Use
Finished maps and reduces pick up more work
rather than exiting. Reduces latency and eliminates
difficult split-size tuning. Out of box performance!
Latency
Runtime re-
configuration of DAG
Runtime query tuning by picking aggregation
parallelism using online query statistics
Throughput
Tez In-Memory Cache Hot data kept in RAM for fast access. Latency
Complex DAGs
Tez Broadcast Edge and Map-Reduce-Reduce
pattern improve query scale and throughput.
Throughput

How Stinger Phase 3 Delivers Interactive Query
Page 34
Feature Description Benefit
Tez Integration Tez is significantly better engine than MapReduce Latency
Vectorized Query
Take advantage of modern hardware by processing
thousand-row blocks rather than row-at-a-time.
Throughput
Query Planner
Using extensive statistics now available in Metastore
to better plan and optimize query, including
predicate pushdown during compilation to eliminate
portions of input (beyond partition pruning)
Latency
Cost Based Optimizer
(Optiq)
Join re-ordering and other optimizations based on
column statistics including histograms etc.
Latency

Next Steps
• Blog
http://hortonworks.com/blog/delivering-on-stinger-a-phase-3-progress-update/
• Stinger Initiative
http://hortonworks.com/labs/stinger/
• Stinger Phase 3 Tech preview
• http://hortonworks.com/blog/announcing-stinger-phase-3-technical-preview/
• http://hadoopwrangler.com

Hortonworks: The Value of “Open” for You
Page 36
Validate & Try
1. Download the
Hortonworks Sandbox
2. Learn Hadoop using the
technical tutorials
3. Investigate a business
case using the step-by-
step business cases
scenarios
4. Validate YOUR business
case using your data in
the sandbox
Connect With the Hadoop Community
We employ a large number of Apache project committers & innovators so
that you are represented in the open source community
Avoid Vendor Lock-In
Hortonworks Data Platform remain as close to the open source trunk as
possible and is developed 100% in the open so you are never locked in
The Partners you Rely On, Rely On Hortonworks
We work with partners to deeply integrate Hadoop with data center
technologies so you can leverage existing skills and investments
Certified for the Enterprise
We engineer, test and certify the Hortonworks Data Platform at scale to
ensure reliability and stability you require for enterprise use
Support from the Experts
We provide the highest quality of support for deploying at scale. You are
supported by hundreds of years of Hadoop experience
Engage
1. Execute a Business Case
Discovery Workshop with
our architects
2. Build a business case for
Hadoop today

Hortonworks.bdb

Related slideshows

More Related Content

What's hot

What's hot (19)

Similar to Hortonworks.bdb

Similar to Hortonworks.bdb (20)

Hortonworks.bdb

Editor's Notes