Apache Phoenix + Apache HBase

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Phoenix + Apache HBase
An Enterprise Grade Data Warehouse
Ankit Singhal , Rajeshbabu , Josh Elser
June, 30 2016

About us!!
– Committer and member of Apache Phoenix PMC
– MTS at Hortonworks.
Ankit Singhal
– Committer and member of Apache Phoenix PMC
– Committer in Apache HBase
RajeshBabu
– Committer in Apache Phoenix
– Committer and Member of Apache Calcite PMC
Josh Elser

Agenda
Phoenix & HBase as an Enterprise Data Warehouse
Use Cases
Optimizations
Phoenix Query server
Q&A

Data Warehouse
EDW helps organize and aggregate analytical data from various functional domains and
serves as a critical repository for organizations’ operations.
STAGING
Files
IOT
data
Data Warehouse
Mart
OLTP
ETL Visualization
or BI

Phoenix Offerings and Interoperability:-
ETL Data Warehouse Visualization & BI

Table,a,123
Table,,123
RegionServer
HDFS
HBase client
Phoenix client
Phx coproc
ZooKeeper
Table,b,123
Table,a,123
Phx coproc
Table,c,123
Table,b,123
Phx coproc
RegionServer RegionServer
Application
HBase & Phoenix
HBase , a distributed NoSQL store
Phoenix , provides OLTP and Analytics over HBase

Open Source Data Warehouse
Hardware cost
Softwarecost
Specialized H/WCommodity H/W
LicensingcostNoCost SMPMPP
Open
Source MPP
HBase+
Phoenix

Phoenix & HBase as a Data Warehouse
Architecture
Run on
commodity
H/W
True MPP
O/S and
H/W
flexibility
Support
OLTP and
ROLAP

Scalability
Linear
scalability
for storage
Linear
scalability
for memory
Open to
Third party
storage

Reliability
Highly
Available
Replication
for disaster
recovery
Fully ACID
for Data
Integrity

Manageability
Performance
Tuning
Data
Modeling &
Schema
Evolution
Data
pruning
Online
expansion
Or upgrade
Data Backup
and recovery

Agenda
Use cases

Who uses Phoenix !!

Analytics Use case - (Web Advertising company)
 Functional Requirements
– Create a single source of truth
– Cross dimensional query on 50+ dimension and 80+ metrics
– Support fast Top-N queries
 Non-functional requirements
– Less than 3 second Response time for slice and dice
– 250+ concurrent users
– 100k+ Analytics queries/day
– Highly available
– Linear scalability

Data Warehouse Capacity
 Data Size(ETL Input)
– 24TB/day of raw data system wide
– 25 Billion of impressions
 HBase Input(cube)
– 6 Billion rows of aggregated data(100GB/day)
 HBase Cluster size
– 65 Nodes of HBase
– 520 TB of disk
– 4.1 TB of memory

Use Case Architecture
AdServer
Click Tracking
Kafka
Input
Kafka
Input
ETL Filter Aggregate
In- Memory
Store
ETL Filter Aggregate
Real-time
Kafka
CAMUS
HDFS
ETL
HDFS
Data
Uploader
D
A
T
A
A
P
I
HBase
Views
A
N
A
L
Y
T
I
C
S
UI
Batch Processing
Data Ingestion Analytics
Apache
Kafka

Cube
Generation
Cubes are stored in
HBase
A
N
A
L
Y
T
I
C
S
UI
Convert
slice and
dice query
to SQL
query
Data
API
Analytics Data Warehouse Architecture
Bulk
Load
HDFS
ETL
Backup
and
recovery

Time Series Use Case- (Apache Ambari)
 Functional requirements
– Store all cluster metrics collected every second(10k to 100k metrics/second)
– Optimize storage/access for time series data
 Non-functional requirements
– Near real time response time
– Scalable
– Real time ingestion
Ambari Metrics System (AMS)

AMS architecture
Metric
Monitors
Hosts
Hadoop
Sinks
HBase
Phoenix
Metric
Collector
Ambari
Server

Agenda
Use Cases
Optimizations

Schema Design
 Most important criteria for driving overall performance of queries on the table
 Primary key should be composed from most-used predicate columns in the queries
 In most cases, leading part of primary key should help to convert queries into point
lookups or range scans in HBase
Primary key design

Schema Design
 Use salting to alleviate write hot-spotting
CREATE TABLE …(
…
) SALT_BUCKETS = N
– Number of buckets should be equal to number of RegionServers
 Otherwise, try to presplit the table if you know the row key data set
CREATE TABLE …(
…
) SPLITS(…)
Salting vs pre-split

Schema Design
 Use block encoding and/or compression for better performance
CREATE TABLE …(
…
) DATA_BLOCK_ENCODING= ‘FAST_DIFF’, COMPRESSION=‘SNAPPY’
 Use region replication for read high availability
CREATE TABLE …(
…
) “REGION_REPLICATION” = “2”
Table properties

Schema Design
 Set UPDATE_CACHE_FREQUENCY to bigger value to avoid frequently touching server for
metadata updates
CREATE TABLE …(
…
) UPDATE_CACHE_FREQUENCY = 300000
Table properties

Schema Design
 Divide columns into multiple column families if there are rarely accessed columns
– HBase reads only the files of column families specified in the query to reduce I/O
pk1 pk2
CF1 CF2
Col1 Col2 Col3 Col4 Col5 Col6 Col7
Frequently accessing columns Rarely accessing columns

Secondary Indexes
 Global indexes
– Optimized for read heavy use cases
CREATE INDEX idx on table(…)
 Local Indexes
– Optimized for write heavy and space constrained use cases
CREATE LOCAL INDEX idx on table(…)
 Functional indexes
– Allow you to create indexes on arbitrary expressions.
CREATE INDEX UPPER_NAME_INDEX ON EMP(UPPER(FIRSTNAME||’ ’|| LASTNAME ))

Secondary Indexes
 Use covered indexes to efficiently scan over the index table instead of primary table.
CREATE INDEX idx ON table(…) include(…)
 Pass index hint to guide query optimizer to select the right index for query
SELECT /*+INDEX(<table> <index>)*/..

Row Timestamp Column
 Maps HBase native row timestamp to a Phoenix column
 Leverage optimizations provided by HBase like setting the minimum and maximum time
range for scans to entirely skip the store files which don’t fall in that time range.
 Perfect for time series use cases.
 Syntax
CREATE TABLE …(CREATED_DATE NOT NULL DATE
…
CONSTRAINT PK PRIMARY KEY(CREATED_DATE ROW_TIMESTAMP…
)

Use of Statistics
Region A
Region F
Region L
Region R
Chunk A
Chunk C
Chunk F
Chunk I
Chunk L
Chunk O
Chunk R
Chunk U
A
F
R
L
A
F
R
L
C
I
O
U
Client Client

Skip Scan
 Phoenix supports skip scan to jump to matching keys directly when the query has key
sets in predicate
SELECT * FROM METRIC_RECORD
WHERE METRIC_NAME LIKE 'abc%'
AND HOSTNAME in ('host1’, 'host2');
CLIENT 1-CHUNK PARALLEL 1-WAY SKIP SCAN
ON 2 RANGES OVER METRIC_RECORD
['abc','host1'] - ['abd','host2']
Region1
Region2
Region3
Region4
Client
RS3RS2RS1
Skip scan

Join optimizations
 Hash Join
– Hash join outperforms other types of join algorithms when one of the relations is smaller or
records matching the predicate should fit into memory
 Sort-Merge join
– When the relations are very big in size then use the sort-merge join algorithm
 NO_STAR_JOIN hint
– For multiple inner-join queries, Phoenix applies a star-join optimization by default. Use this hint in
the query if the overall size of all right-hand-side tables would exceed the memory size limit.
 NO_CHILD_PARENT_OPTIMIZATION hint
– Prevents the usage of child-parent-join optimization.

Optimize Writes
 Upsert values
– Call it multiple times before commit for batching mutations
– Use prepared statement when you run the query multiple times
 Upsert select
– Configure phoenix.mutate.batchSize based on row size
– Set auto-commit to true for writing scan results directly to HBase.
– Set auto-commit to true while running upsert selects on the same table so that writes happen at
server.

Hints
 SERIAL SCAN, RANGE SCAN
 SERIAL
 SMALL SCAN
Some important hints

Additional References
 For some more optimizations you can refer to these documents
– http://phoenix.apache.org/tuning.html
– https://hbase.apache.org/book.html#performance

Agenda
Use Cases
Optimizations
Phoenix Query Server

Apache Phoenix Query Server
 A standalone service that proxies user requests to HBase/Phoenix
– Optional
 Reference client implementation via JDBC
– ”Thick” versus “Thin”
 First introduced in Apache Phoenix 4.4.0
 Built on Apache Calcite’s Avatica
– ”A framework for building database drivers”

Traditional Apache Phoenix RPC Model
Table,a,123
Table,,123
RegionServer
HDFS
HBase client
Phoenix client
Phx coproc
ZooKeeper
Table,b,123
Table,a,123
Phx coproc
Table,c,123
Table,b,123
Phx coproc
Application

Query Server Model
Table,a,123
Table,,123
RegionServer
HDFS
HBase client
Phoenix client
Phx coproc
ZooKeeper
Table,b,123
Table,a,123
Phx coproc
Table,d,123
Table,b,123
Phx coproc
Query Server
Application

Query Server Technology
 HTTP Server and wire API definition
 Pluggable serialization
– Google Protocol Buffers
 “Thin” JDBC Driver (over HTTP)
 Other goodies!
– Pluggable metrics system
– TCK (technology compatibility kit)
– SPNEGO for Kerberos authentication
– Horizontally scalable with load balancing

Query Server Clients
 Go language database/sql/driver
– https://github.com/Boostport/avatica
 .NET driver
– https://github.com/Azure/hdinsight-phoenix-sharp
– https://www.nuget.org/packages/Microsoft.Phoenix.Client/1.0.0-preview
 ODBC
– Built by http://www.simba.com/, also available from Hortonworks
 Python DB API v2.0 (not “battle tested”)
– https://bitbucket.org/lalinsky/python-phoenixdb
Client enablement

Agenda
Use Cases
Optimizations
Phoenix Query Server
Q&A

We hope to see you all migrating to Phoenix & HBase and expecting more questions on the user mailing
lists.
Get involved in mailing lists:-
user@phoenix.apache.org
user@hbase.apache.org
You can reach us on:-
ankit@apache.org
rajeshbabu@apache.org
elserj@apache.org
Phoenix & HBase

Thank You

Apache Phoenix + Apache HBase

More Related Content

What's hot

What's hot (20)

Similar to Apache Phoenix + Apache HBase

Similar to Apache Phoenix + Apache HBase (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Apache Phoenix + Apache HBase