SlideShare a Scribd company logo
Ranjeeth Kathiresan
Senior Software Engineer
rkathiresan@salesforce.com
Scaling HBase for Big Data
Salesforce
Gurpreet Multani
Principal Software Engineer
gmultani@salesforce.com
Introduction
Ranjeeth Kathiresan is a Senior Software Engineer at Salesforce, where he
focuses primarily on improving the performance, scalability, and availability of
applications by assessing and tuning the server-side components in terms of
code, design, configuration, and so on, particularly with Apache HBase.
Ranjeeth is an admirer of performance engineering and is especially fond of
tuning an application to perform better.
Gurpreet Multani is a Principal Software Engineer at Salesforce. At Salesforce,
Gurpreet has lead initiatives to scale various Big Data technologies such as Apache
HBase, Apache Solr, Apache Kafka. He is particularly interested in finding ways to
optimize code to reduce bottlenecks, consume lesser resources and achieve more out
of available capacity in the process.
Agenda
• HBase @ Salesforce
• CAP Theorem
• HBase Refresher
• Typical HBase Use Cases
• HBase Internals
• Data Loading Use Case
• Write Bottlenecks
• Tuning Writes
• Best Practices
• Q&A
HBase @ Salesforce
Typical Cluster
Data Volume
120 TB
Nodes Across All Clusters
2200+
Variety
Simple Row Store
Denormalization
Messaging
Event Log
Analytics
Metrics
Graphs
Cache
CAP Theorem
It is impossible for a distributed data store to simultaneously provide more than two out of the following
three guarantees:
Availability
Consistency Partition tolerance
Each client can always
read and write
All clients have the same
view of the data
The system works well despite
physical network partitions
CassandraRDBMS
HBase
HBase Refresher
• Distributed database
• Non-relational
• Column-oriented
• Supports compression
• In-memory operations
• Bloom filters on a per-column basis
• Written in Java
• Runs on top of HDFS
“A sparse, distributed, persistent, multidimensional, sorted map”
Typical HBase Use Cases
Large Data Volume running into at least hundreds of GBs or more (aka Big Data)
Data access patterns are well known at design time and are not expected to change i.e.
no secondary indexes / joins need to be added at a later stage
RDBMS-like multi-row transactions are not required
Large “working set” of data. Working set = data being accessed or being updated
Multiple versions of data
Region Server
Region
HBase Internals
Write Operation
Client
Zookeeper HDFS
Region Server
.META.
Region
WAL
HFile HFile
HFile HFile HFile
Memstore HFile
Store
1. Get .META.
location
2. Get Region
location
3. Put
4. Write
5. Write
Flush
Region Region …..
HFile HFile HFile
…..
…..
Memstore Memstore…..
HBase Internals
Compaction
HFile HFile HFile
HFile HFile HFile
HFile HFile …
HFile Main purpose of compaction is to optimize read
performance by reducing the number of disk seeks
Minor Compaction Major Compaction
Trigger: Automatic based on configurations
Mechanism
• Reads a configurable number of smaller HFiles
and writes into a single large HFile
Trigger: Scheduled or Manual
Mechanism
• Reads all HFiles of a region and writes to a
single large HFile
• Physical deletion of records
• Tries to achieve high data locality
Region Server
Region
Client
Zookeeper
Memstore
HDFS
Region Server
.META.
Region
WAL
HFile HFile
HFile HFile HFile
Memstore HFile
1. Get .META.
location
2. Get Region
location
3. Get 5. Read
Region Region Region
HFile HFile HFile
Block Cache
4. Read
6. Read
…..
.….
Memstore…..
HBase Internals
Read Operation
One of the use cases is to
store and process data in
text format
Lookups from HBase using
row key is more efficient
A subset of data is stored in
Solr for effective lookups
from HBase
Data Loading Overview
Salesforce Application
Transform
Extract
Load
Data Insights
Key Details about the data used for processing
Velocity VarietyVolume
500MB
Data Influx/Min
200GB
Data Size/Cycle
Text
Data Format
175K
Records/Min
Throughput SLA
600K
Records/Min
3300GB
HBase Data Size/Cycle
CSV, JSON
Data Format
250MM
Records/Day
Write Operation Bottlenecks
Influx Rate:
600K Records/Min
Write Rate
60K Records/Min
Write Operation in
progress for >3 days
Write Rate dropped to <5K
Records/Min after few hours
Write Operation Tunings
Improved throughput by ~8 times & achieved ~3 times more than expected throughput
Initial Throughput:
60K Records/Min
Achieved Throughput:
480K Records/Min
Salting Pre-Splitting Optimal Configuration
CompressionRow Size Optimization Optimal Read vs. Write
Consistency Check
Region Hot Spotting
Outline: Region Hot Spotting refers to over utilizing a single region server, despite of having
multiple nodes in the cluster, during write operation because of using sequential rowkeys.
Scenario
Not our
turn, Yet!!
Not our
turn, Yet!!
Hey Buddy! I’m
overloaded
Impact
Node1 Node2 Node3
Utilization
Time
Write Operation Tunings
Initial Throughput:
60K Records/Min
Achieved Throughput:
480K Records/Min
Salting Pre-Splitting Optimal Configuration
CompressionRow Size Optimization Optimal Read vs. Write
Consistency Check
Salting
Outline: Salting helps to distribute writes over multiple
regions by using random row keys
How do I implement Salting?
Salting is implemented by defining the rowkeys wisely by
adding a salt prefix (random character) to the original key
Two Common ways of salting
• Adding a random number as prefix based on modulo
• Hashing the rowkey
Salting
Random number can be identified by performing modulo operation between insertion index and
total buckets
Salted Key = (++index % total buckets) +”_” + Original Key
Prefixing random number
0_1000
0_1003
1_1001
1_1004
2_1002
2_1005
Bucket 1
Bucket 2
Bucket 3
1000
1001
1002
1003
1004
1005
Example with 3 Salt Buckets
KeyPoints
 Randomness is provided to some
extent as it depends on insertion
order
 Salted keys stored in HBase won’t
be visible to client during lookups
Data
Salting
Hashing the entire rowkey or adding a few characters of the hash of rowkey as prefix can be used
to implement salting
Salted Key = hash(Original Key) OR firstNChars(hash(Original Key))+”_”+Original Key
Hashing Rowkey
AtNB/q..
B50SP..
e8aRjL..
ggEw9..
w56syI..
xwer51..
Bucket 1
Bucket 2
Bucket 3
1000
1001
1002
1003
1004
1005
Example with 3 Salt Buckets
KeyPoints
 Randomness in the row key is
ensured by hash values
 HBase lookups will be effective as
the same hashing function can be
used during lookup
Data
Salting
Salting does not resolve Region Hot spotting for the entire write cycle.
Reason: HBase creates only one region by default and uses default auto split policy to create more
regions
AtNB/q..
B50SP..
e8aRjL..
ggEw9..
w56syI..
xwer51..
Bucket 1
1000
1001
1002
1003
1004
1005
Data
Example
Does it help?
Impact
Node1 Node2 Node3
Utilization
Time
Write Operation Tunings
Initial Throughput:
60K Records/Min
Achieved Throughput:
480K Records/Min
Salting Pre-Splitting Optimal Configuration
CompressionRow Size Optimization Optimal Read vs. Write
Consistency Check
Pre-Splitting
Outline: Pre-Splitting helps to create multiple regions during table creation which will help to
reap the benefits of salting
How do I pre-split a HBase table?
Pre-splitting can be done by providing split points during table creation
Example: create ‘table_name’, ‘cf_name’, SPLITS => [‘a’ , ‘m’]
AtNB/q..
B50SP..
e8aRjL..
ggEw9..
w56syI..
xwer51..
Bucket 1 [‘’ -> ‘a’]
Bucket 2 [‘a’ -> ‘m’]
Bucket 3 [‘m’ -> ‘’]
1000
1001
1002
1003
1004
1005
Data
Pre-Splitting
Scenario
Hey Buddy! I’m
overloaded
Improvement
Node1 Node2 Node3
Utilization
Time
GO Regions!!!
Optimization Benefit
Salting
Pre-Splitting
Throughput Improvement
Current Throughput:
60K Records/Min
Improved Throughput:
150K Records/Min
Write Operation Tunings
Initial Throughput:
60K Records/Min
Achieved Throughput:
480K Records/Min
Salting Pre-Splitting Optimal Configuration
CompressionRow Size Optimization Optimal Read vs. Write
Consistency Check
Configuration Tuning
Outline: Default configurations may not work for all use cases. We need to tune configurations based on
our use case
It is 9!! No. It is 6!!
Configuration Tuning
Configuration Purpose
Change
Nature
hbase.regionserver.handler.count
Number of threads in region server used to
process read and write requests
Increased
hbase.hregion.memstore.flush.size
Memstore will be flushed to disk after reaching
the value provided in this configuration
Increased
hbase.hstore.blockingStoreFiles
Flushes will be blocked until compaction reduces
the number of HFiles to this value
Increased
hbase.hstore.blockingWaitTime
Maximum time for which the clients will be
blocked from writing to HBase
Decreased
Following are the key configurations which we have tuned based on our write use case
Configuration Tuning
Region Server Handler Count
Region Server
Client
Client
Client
Region
Region
Region
Region
…..
 Region Server Handlers (Default Count=10)
TuningBenefitCaution
 Increasing it could help in improving
throughput by increasing
concurrency
 Thumb Rule -> Low for high payload
and high for low payload
 Can increase heap utilization
eventually leading to OOM
 High GC pauses impacting the
throughput
Configuration Tuning
Region Memstore Size
Region Server
Region
Region
…..
 Thread which checks Memstore size
(Default – 128 MB)
TuningBenefitCaution
 Increasing Memstore size will
generate larger HFiles which will
minimize compaction impact and
improves throughput
 Can increase heap utilization
eventually leading to OOM
 High GC pauses impacting the
throughput
HDFS
HFile
Memstore
Memstore
Memstore
HFile
HFile
…..
…..
Memstore
Configuration Tuning
HStore Blocking Store Files
Region Server
Region
Region
…..
Default Blocking Store Files - 10
TuningBenefitCaution
 Increasing blocking store files will
allow client to write more with less
pauses and improves throughput
 Compaction could take more time as
more files could be written without
blocking client
HDFS
HFile
Memstore…..
HFile
HFile
Memstore
HFile
Store
HFile HFile
Client
HFile HFile…..
…..
Configuration Tuning
HStore Blocking Wait Time
Region Server
Region
Region
…..
TuningBenefitCaution
 Decreasing blocking wait time will
allow client to write more with less
pauses and improves throughput
 Compaction could take more time as
more files could be written without
blocking client
HDFS
HFile
Memstore…..
HFile
HFile
Memstore
HFile
Store
HFile HFile
Client
HFile HFile…..
…..
 Time for which writes on Region is
blocked (Default – 90 Secs)
Optimization Benefit
Throughput Improvement
Current Throughput:
150K Records/Min
Improved Throughput:
260K Records/Min
Optimal Configuration
Reduced Resource Utilization
Write Operation Tunings
Initial Throughput:
60K Records/Min
Achieved Throughput:
480K Records/Min
Salting Pre-Splitting Optimal Configuration
CompressionRow Size Optimization Optimal Read vs. Write
Consistency Check
Optimal Read vs. Write Consistency Check
Multi Version Concurrency Control
Multi Version Concurrency Control (MVCC) is used to achieve row level ACID property in HBase.
Source: https://blogs.apache.org/hbase/entry/apache_hbase_internals_locking_and
Write Steps with MVCC
Optimal Read vs. Write Consistency Check
Issue: MVCC stuck after few hours of write operation impacting the write throughput drastically
as there are 140+ columns per row
Scenario Impact
Throughput
Throughput
Records/Min
Time
Write point: I
have a lot to
write
Read point: I
have a lot to
catch up
Read point has to catch up write point to avoid high delay between read and write versions
Optimal Read vs. Write Consistency Check
Solution: Reduce the pressure on MVCC by storing all the 140+ columns in a single cell
Scenario
Throughput
Records/Min
Time
Improvement
abc
def
ghi
{
“col1”:”abc”,
“col2”:”def”,
“col3”,”ghi”
}
Column Representation
col1
col2
col3
column
Optimal Read vs. Write
Consistency Check
Optimization Benefit
Stability Improvement
Steady Resource Utilization
Storage Optimization
Storage is one of the important factors impacting scalability of HBase cluster
Write operation throughput is mainly dependent on the average row size as it is an I/O bound
process. Optimizing the storage will help us to achieve more throughput.
Example:
Having a Column Name as “BA_S1” instead of “Billing_Address_Street_Line_Number_One” will help in
reducing the storage and improve write throughput
Column Name #Characters Additional Bytes to each Row
Billing_Address_Street_Line_Number_One 39 78
BA_S1 5 10
Write Operation Tunings
Initial Throughput:
60K Records/Min
Achieved Throughput:
480K Records/Min
Salting Pre-Splitting Optimal Configuration
CompressionRow Size Optimization Optimal Read vs. Write
Consistency Check
Compression
Compression is one of the storage optimization technique
Commonly used compression algorithms in HBase
• Snappy
• Gzip
• LZO
• LZ4
Compression Ratio
Gzip compression ratio is better than Snappy and LZO
Resource Consumption
Snappy consumes lesser resources for compression and decompression than Gzip
Optimization Benefit
Productivity Improvement
Reduced
Storage Costs
Improved
Throughput
Compression
Before Optimization After Optimization %Improvement
Storage 3300 GB 963 GB ~71%
Throughput 260K Records/Min 380K Records/Min ~42%
Write Operation Tunings
Initial Throughput:
60K Records/Min
Achieved Throughput:
480K Records/Min
Salting Pre-Splitting Optimal Configuration
CompressionRow Size Optimization Optimal Read vs. Write
Consistency Check
Row Size Optimization
Few columns out of 140+ columns were empty for most of the rows. Storing empty columns in
JSON format will increase the average row size
Solution: Avoid storing empty values when using JSON
Example:
Salesforce Scenario
{“col1”:”abc”,
”col2”:”def”,
”col3”:””,
”col4”:””,
”col5”:”ghi”}
Data
Remove
empty
{“col1”:”abc”,
”col2”:”def”,
”col5”:”ghi”}
Data In HBase
Optimization Benefit
Productivity Improvement
Reduced
Storage Costs Improved
Throughput
Before Optimization After Optimization %Improvement
Storage 963 GB 784 GB ~19%
Throughput 380K Records/Min 480K Records/Min ~26%
Row Size Optimization
RECAP
Recap
Write Throughput
Initial:
60K Records/Min
Achieved:
480K Records/Min
SLA  175K Records/Min
Optimization
Salting
Pre-
Splitting
Config.
Tuning
Optimal
MVCC
Compres
s-ion
Optimal
Row Size
Data Format  Text Influx Rate  500 MB/Min
Reduced
Storage
Improved
Stability
Reduced
Resource
Utilization
Best Practices
Row key design
• Know your data better before pre-splitting
• Shorter row key but long enough for data access
Minimize IO
• Less number of Column Families
• Shorter Column Family and Qualifier name
Locality
• Review the locality of regions periodically
• Co-locate Region server and Data node
Maximize Throughput
• Minimize major compactions
• Use high throughput disk
When HBase?
HBase is for you HBase is NOT for you
Random read/write access to high
volumes of data in real time
No dependency on RDBMS features
Variable schema with flexibility to add
columns
Single/Range of key based lookups for
de-normalized data
Multiple versions of Big Data
Replacement for RDBMS
Low data volume
Scanning and aggregation on large
volumes of data
Replacement for batch processing
engines like MapReduce/Spark
Q & A
Thank Y u

More Related Content

What's hot

Millions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size MattersMillions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size Matters
DataWorks Summit
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
JAX London
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
alexbaranau
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Yahoo!デベロッパーネットワーク
 
Local Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache PhoenixLocal Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache Phoenix
Rajeshbabu Chintaguntla
 
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseHBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
Cloudera, Inc.
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
DataWorks Summit
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
DataWorks Summit
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
HBaseCon
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GCHadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Erik Krogen
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
Shivaji Dutta
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
强 王
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Erik Krogen
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
Owen O'Malley
 

What's hot (20)

Millions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size MattersMillions of Regions in HBase: Size Matters
Millions of Regions in HBase: Size Matters
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
 
Local Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache PhoenixLocal Secondary Indexes in Apache Phoenix
Local Secondary Indexes in Apache Phoenix
 
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseHBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GCHadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 

Similar to Scaling HBase for Big Data

Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
yongboy
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Yahoo Developer Network
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
sheetal sharma
 
Hbase
HbaseHbase
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
EffectiveUI
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
Effective
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
Tony Hillerson
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
Cloudera, Inc.
 
HBase New Features
HBase New FeaturesHBase New Features
HBase New Features
rxu
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
zpinter
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
Cloudera, Inc.
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
Jean-Baptiste Poullet
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
Jean-Baptiste Poullet
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
Cloudera, Inc.
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud Environment
Fei Dong
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
Anil Gupta
 
Hadoop - Apache Hbase
Hadoop - Apache HbaseHadoop - Apache Hbase
Hadoop - Apache Hbase
Vibrant Technologies & Computers
 
Hbase
HbaseHbase

Similar to Scaling HBase for Big Data (20)

Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Hbase
HbaseHbase
Hbase
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
HBase New Features
HBase New FeaturesHBase New Features
HBase New Features
 
Rails on HBase
Rails on HBaseRails on HBase
Rails on HBase
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 
Optimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud EnvironmentOptimization on Key-value Stores in Cloud Environment
Optimization on Key-value Stores in Cloud Environment
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Hadoop - Apache Hbase
Hadoop - Apache HbaseHadoop - Apache Hbase
Hadoop - Apache Hbase
 
Hbase
HbaseHbase
Hbase
 

More from Salesforce Engineering

Locker Service Ready Lightning Components With Webpack
Locker Service Ready Lightning Components With WebpackLocker Service Ready Lightning Components With Webpack
Locker Service Ready Lightning Components With Webpack
Salesforce Engineering
 
Techniques to Effectively Monitor the Performance of Customers in the Cloud
Techniques to Effectively Monitor the Performance of Customers in the CloudTechniques to Effectively Monitor the Performance of Customers in the Cloud
Techniques to Effectively Monitor the Performance of Customers in the Cloud
Salesforce Engineering
 
Predictive System Performance Data Analysis
Predictive System Performance Data AnalysisPredictive System Performance Data Analysis
Predictive System Performance Data Analysis
Salesforce Engineering
 
Apache HBase State of the Project
Apache HBase State of the ProjectApache HBase State of the Project
Apache HBase State of the Project
Salesforce Engineering
 
Hit the Trail with Trailhead
Hit the Trail with TrailheadHit the Trail with Trailhead
Hit the Trail with Trailhead
Salesforce Engineering
 
HBase/PHOENIX @ Scale
HBase/PHOENIX @ ScaleHBase/PHOENIX @ Scale
HBase/PHOENIX @ Scale
Salesforce Engineering
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
Salesforce Engineering
 
Containers and Security for DevOps
Containers and Security for DevOpsContainers and Security for DevOps
Containers and Security for DevOps
Salesforce Engineering
 
Aspect Oriented Programming: Hidden Toolkit That You Already Have
Aspect Oriented Programming: Hidden Toolkit That You Already HaveAspect Oriented Programming: Hidden Toolkit That You Already Have
Aspect Oriented Programming: Hidden Toolkit That You Already Have
Salesforce Engineering
 
Monitoring @ Scale in Salesforce
Monitoring @ Scale in SalesforceMonitoring @ Scale in Salesforce
Monitoring @ Scale in Salesforce
Salesforce Engineering
 
Performance Tuning with XHProf
Performance Tuning with XHProfPerformance Tuning with XHProf
Performance Tuning with XHProf
Salesforce Engineering
 
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache CalciteA Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
Salesforce Engineering
 
Implementing a Content Strategy Is Like Running 100 Miles
Implementing a Content Strategy Is Like Running 100 MilesImplementing a Content Strategy Is Like Running 100 Miles
Implementing a Content Strategy Is Like Running 100 Miles
Salesforce Engineering
 
Salesforce Cloud Infrastructure and Challenges - A Brief Overview
Salesforce Cloud Infrastructure and Challenges - A Brief OverviewSalesforce Cloud Infrastructure and Challenges - A Brief Overview
Salesforce Cloud Infrastructure and Challenges - A Brief Overview
Salesforce Engineering
 
Koober Preduction IO Presentation
Koober Preduction IO PresentationKoober Preduction IO Presentation
Koober Preduction IO Presentation
Salesforce Engineering
 
Finding Security Issues Fast!
Finding Security Issues Fast!Finding Security Issues Fast!
Finding Security Issues Fast!
Salesforce Engineering
 
Microservices
MicroservicesMicroservices
Microservices
Salesforce Engineering
 
Global State Management of Micro Services
Global State Management of Micro ServicesGlobal State Management of Micro Services
Global State Management of Micro Services
Salesforce Engineering
 
The Future of Hbase
The Future of HbaseThe Future of Hbase
The Future of Hbase
Salesforce Engineering
 
Apache BookKeeper Distributed Store- a Salesforce use case
Apache BookKeeper Distributed Store- a Salesforce use caseApache BookKeeper Distributed Store- a Salesforce use case
Apache BookKeeper Distributed Store- a Salesforce use case
Salesforce Engineering
 

More from Salesforce Engineering (20)

Locker Service Ready Lightning Components With Webpack
Locker Service Ready Lightning Components With WebpackLocker Service Ready Lightning Components With Webpack
Locker Service Ready Lightning Components With Webpack
 
Techniques to Effectively Monitor the Performance of Customers in the Cloud
Techniques to Effectively Monitor the Performance of Customers in the CloudTechniques to Effectively Monitor the Performance of Customers in the Cloud
Techniques to Effectively Monitor the Performance of Customers in the Cloud
 
Predictive System Performance Data Analysis
Predictive System Performance Data AnalysisPredictive System Performance Data Analysis
Predictive System Performance Data Analysis
 
Apache HBase State of the Project
Apache HBase State of the ProjectApache HBase State of the Project
Apache HBase State of the Project
 
Hit the Trail with Trailhead
Hit the Trail with TrailheadHit the Trail with Trailhead
Hit the Trail with Trailhead
 
HBase/PHOENIX @ Scale
HBase/PHOENIX @ ScaleHBase/PHOENIX @ Scale
HBase/PHOENIX @ Scale
 
Scaling up data science applications
Scaling up data science applicationsScaling up data science applications
Scaling up data science applications
 
Containers and Security for DevOps
Containers and Security for DevOpsContainers and Security for DevOps
Containers and Security for DevOps
 
Aspect Oriented Programming: Hidden Toolkit That You Already Have
Aspect Oriented Programming: Hidden Toolkit That You Already HaveAspect Oriented Programming: Hidden Toolkit That You Already Have
Aspect Oriented Programming: Hidden Toolkit That You Already Have
 
Monitoring @ Scale in Salesforce
Monitoring @ Scale in SalesforceMonitoring @ Scale in Salesforce
Monitoring @ Scale in Salesforce
 
Performance Tuning with XHProf
Performance Tuning with XHProfPerformance Tuning with XHProf
Performance Tuning with XHProf
 
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache CalciteA Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
 
Implementing a Content Strategy Is Like Running 100 Miles
Implementing a Content Strategy Is Like Running 100 MilesImplementing a Content Strategy Is Like Running 100 Miles
Implementing a Content Strategy Is Like Running 100 Miles
 
Salesforce Cloud Infrastructure and Challenges - A Brief Overview
Salesforce Cloud Infrastructure and Challenges - A Brief OverviewSalesforce Cloud Infrastructure and Challenges - A Brief Overview
Salesforce Cloud Infrastructure and Challenges - A Brief Overview
 
Koober Preduction IO Presentation
Koober Preduction IO PresentationKoober Preduction IO Presentation
Koober Preduction IO Presentation
 
Finding Security Issues Fast!
Finding Security Issues Fast!Finding Security Issues Fast!
Finding Security Issues Fast!
 
Microservices
MicroservicesMicroservices
Microservices
 
Global State Management of Micro Services
Global State Management of Micro ServicesGlobal State Management of Micro Services
Global State Management of Micro Services
 
The Future of Hbase
The Future of HbaseThe Future of Hbase
The Future of Hbase
 
Apache BookKeeper Distributed Store- a Salesforce use case
Apache BookKeeper Distributed Store- a Salesforce use caseApache BookKeeper Distributed Store- a Salesforce use case
Apache BookKeeper Distributed Store- a Salesforce use case
 

Recently uploaded

Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
OnBoard
 
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPathCommunity
 
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
Nohoax Kanont
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
 
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Stephanie Beckett
 
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
Fwdays
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
AmandaCheung15
 
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdfDefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
Yury Chemerkin
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan
 
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Alliance
 
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Quentin Reul
 
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
webbyacad software
 
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
Priyanka Aash
 
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Snarky Security
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Zilliz
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
AMol NAik
 

Recently uploaded (20)

Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
 
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
 
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
 
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
 
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
 
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdfDefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
 
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
 
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
 
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
 
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
 
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
 

Scaling HBase for Big Data

  • 1. Ranjeeth Kathiresan Senior Software Engineer rkathiresan@salesforce.com Scaling HBase for Big Data Salesforce Gurpreet Multani Principal Software Engineer gmultani@salesforce.com
  • 2. Introduction Ranjeeth Kathiresan is a Senior Software Engineer at Salesforce, where he focuses primarily on improving the performance, scalability, and availability of applications by assessing and tuning the server-side components in terms of code, design, configuration, and so on, particularly with Apache HBase. Ranjeeth is an admirer of performance engineering and is especially fond of tuning an application to perform better. Gurpreet Multani is a Principal Software Engineer at Salesforce. At Salesforce, Gurpreet has lead initiatives to scale various Big Data technologies such as Apache HBase, Apache Solr, Apache Kafka. He is particularly interested in finding ways to optimize code to reduce bottlenecks, consume lesser resources and achieve more out of available capacity in the process.
  • 3. Agenda • HBase @ Salesforce • CAP Theorem • HBase Refresher • Typical HBase Use Cases • HBase Internals • Data Loading Use Case • Write Bottlenecks • Tuning Writes • Best Practices • Q&A
  • 4. HBase @ Salesforce Typical Cluster Data Volume 120 TB Nodes Across All Clusters 2200+ Variety Simple Row Store Denormalization Messaging Event Log Analytics Metrics Graphs Cache
  • 5. CAP Theorem It is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: Availability Consistency Partition tolerance Each client can always read and write All clients have the same view of the data The system works well despite physical network partitions CassandraRDBMS HBase
  • 6. HBase Refresher • Distributed database • Non-relational • Column-oriented • Supports compression • In-memory operations • Bloom filters on a per-column basis • Written in Java • Runs on top of HDFS “A sparse, distributed, persistent, multidimensional, sorted map”
  • 7. Typical HBase Use Cases Large Data Volume running into at least hundreds of GBs or more (aka Big Data) Data access patterns are well known at design time and are not expected to change i.e. no secondary indexes / joins need to be added at a later stage RDBMS-like multi-row transactions are not required Large “working set” of data. Working set = data being accessed or being updated Multiple versions of data
  • 8. Region Server Region HBase Internals Write Operation Client Zookeeper HDFS Region Server .META. Region WAL HFile HFile HFile HFile HFile Memstore HFile Store 1. Get .META. location 2. Get Region location 3. Put 4. Write 5. Write Flush Region Region ….. HFile HFile HFile ….. ….. Memstore Memstore…..
  • 9. HBase Internals Compaction HFile HFile HFile HFile HFile HFile HFile HFile … HFile Main purpose of compaction is to optimize read performance by reducing the number of disk seeks Minor Compaction Major Compaction Trigger: Automatic based on configurations Mechanism • Reads a configurable number of smaller HFiles and writes into a single large HFile Trigger: Scheduled or Manual Mechanism • Reads all HFiles of a region and writes to a single large HFile • Physical deletion of records • Tries to achieve high data locality
  • 10. Region Server Region Client Zookeeper Memstore HDFS Region Server .META. Region WAL HFile HFile HFile HFile HFile Memstore HFile 1. Get .META. location 2. Get Region location 3. Get 5. Read Region Region Region HFile HFile HFile Block Cache 4. Read 6. Read ….. .…. Memstore….. HBase Internals Read Operation
  • 11. One of the use cases is to store and process data in text format Lookups from HBase using row key is more efficient A subset of data is stored in Solr for effective lookups from HBase Data Loading Overview Salesforce Application Transform Extract Load
  • 12. Data Insights Key Details about the data used for processing Velocity VarietyVolume 500MB Data Influx/Min 200GB Data Size/Cycle Text Data Format 175K Records/Min Throughput SLA 600K Records/Min 3300GB HBase Data Size/Cycle CSV, JSON Data Format 250MM Records/Day
  • 13. Write Operation Bottlenecks Influx Rate: 600K Records/Min Write Rate 60K Records/Min Write Operation in progress for >3 days Write Rate dropped to <5K Records/Min after few hours
  • 14. Write Operation Tunings Improved throughput by ~8 times & achieved ~3 times more than expected throughput Initial Throughput: 60K Records/Min Achieved Throughput: 480K Records/Min Salting Pre-Splitting Optimal Configuration CompressionRow Size Optimization Optimal Read vs. Write Consistency Check
  • 15. Region Hot Spotting Outline: Region Hot Spotting refers to over utilizing a single region server, despite of having multiple nodes in the cluster, during write operation because of using sequential rowkeys. Scenario Not our turn, Yet!! Not our turn, Yet!! Hey Buddy! I’m overloaded Impact Node1 Node2 Node3 Utilization Time
  • 16. Write Operation Tunings Initial Throughput: 60K Records/Min Achieved Throughput: 480K Records/Min Salting Pre-Splitting Optimal Configuration CompressionRow Size Optimization Optimal Read vs. Write Consistency Check
  • 17. Salting Outline: Salting helps to distribute writes over multiple regions by using random row keys How do I implement Salting? Salting is implemented by defining the rowkeys wisely by adding a salt prefix (random character) to the original key Two Common ways of salting • Adding a random number as prefix based on modulo • Hashing the rowkey
  • 18. Salting Random number can be identified by performing modulo operation between insertion index and total buckets Salted Key = (++index % total buckets) +”_” + Original Key Prefixing random number 0_1000 0_1003 1_1001 1_1004 2_1002 2_1005 Bucket 1 Bucket 2 Bucket 3 1000 1001 1002 1003 1004 1005 Example with 3 Salt Buckets KeyPoints  Randomness is provided to some extent as it depends on insertion order  Salted keys stored in HBase won’t be visible to client during lookups Data
  • 19. Salting Hashing the entire rowkey or adding a few characters of the hash of rowkey as prefix can be used to implement salting Salted Key = hash(Original Key) OR firstNChars(hash(Original Key))+”_”+Original Key Hashing Rowkey AtNB/q.. B50SP.. e8aRjL.. ggEw9.. w56syI.. xwer51.. Bucket 1 Bucket 2 Bucket 3 1000 1001 1002 1003 1004 1005 Example with 3 Salt Buckets KeyPoints  Randomness in the row key is ensured by hash values  HBase lookups will be effective as the same hashing function can be used during lookup Data
  • 20. Salting Salting does not resolve Region Hot spotting for the entire write cycle. Reason: HBase creates only one region by default and uses default auto split policy to create more regions AtNB/q.. B50SP.. e8aRjL.. ggEw9.. w56syI.. xwer51.. Bucket 1 1000 1001 1002 1003 1004 1005 Data Example Does it help? Impact Node1 Node2 Node3 Utilization Time
  • 21. Write Operation Tunings Initial Throughput: 60K Records/Min Achieved Throughput: 480K Records/Min Salting Pre-Splitting Optimal Configuration CompressionRow Size Optimization Optimal Read vs. Write Consistency Check
  • 22. Pre-Splitting Outline: Pre-Splitting helps to create multiple regions during table creation which will help to reap the benefits of salting How do I pre-split a HBase table? Pre-splitting can be done by providing split points during table creation Example: create ‘table_name’, ‘cf_name’, SPLITS => [‘a’ , ‘m’] AtNB/q.. B50SP.. e8aRjL.. ggEw9.. w56syI.. xwer51.. Bucket 1 [‘’ -> ‘a’] Bucket 2 [‘a’ -> ‘m’] Bucket 3 [‘m’ -> ‘’] 1000 1001 1002 1003 1004 1005 Data
  • 23. Pre-Splitting Scenario Hey Buddy! I’m overloaded Improvement Node1 Node2 Node3 Utilization Time GO Regions!!!
  • 24. Optimization Benefit Salting Pre-Splitting Throughput Improvement Current Throughput: 60K Records/Min Improved Throughput: 150K Records/Min
  • 25. Write Operation Tunings Initial Throughput: 60K Records/Min Achieved Throughput: 480K Records/Min Salting Pre-Splitting Optimal Configuration CompressionRow Size Optimization Optimal Read vs. Write Consistency Check
  • 26. Configuration Tuning Outline: Default configurations may not work for all use cases. We need to tune configurations based on our use case It is 9!! No. It is 6!!
  • 27. Configuration Tuning Configuration Purpose Change Nature hbase.regionserver.handler.count Number of threads in region server used to process read and write requests Increased hbase.hregion.memstore.flush.size Memstore will be flushed to disk after reaching the value provided in this configuration Increased hbase.hstore.blockingStoreFiles Flushes will be blocked until compaction reduces the number of HFiles to this value Increased hbase.hstore.blockingWaitTime Maximum time for which the clients will be blocked from writing to HBase Decreased Following are the key configurations which we have tuned based on our write use case
  • 28. Configuration Tuning Region Server Handler Count Region Server Client Client Client Region Region Region Region …..  Region Server Handlers (Default Count=10) TuningBenefitCaution  Increasing it could help in improving throughput by increasing concurrency  Thumb Rule -> Low for high payload and high for low payload  Can increase heap utilization eventually leading to OOM  High GC pauses impacting the throughput
  • 29. Configuration Tuning Region Memstore Size Region Server Region Region …..  Thread which checks Memstore size (Default – 128 MB) TuningBenefitCaution  Increasing Memstore size will generate larger HFiles which will minimize compaction impact and improves throughput  Can increase heap utilization eventually leading to OOM  High GC pauses impacting the throughput HDFS HFile Memstore Memstore Memstore HFile HFile ….. ….. Memstore
  • 30. Configuration Tuning HStore Blocking Store Files Region Server Region Region ….. Default Blocking Store Files - 10 TuningBenefitCaution  Increasing blocking store files will allow client to write more with less pauses and improves throughput  Compaction could take more time as more files could be written without blocking client HDFS HFile Memstore….. HFile HFile Memstore HFile Store HFile HFile Client HFile HFile….. …..
  • 31. Configuration Tuning HStore Blocking Wait Time Region Server Region Region ….. TuningBenefitCaution  Decreasing blocking wait time will allow client to write more with less pauses and improves throughput  Compaction could take more time as more files could be written without blocking client HDFS HFile Memstore….. HFile HFile Memstore HFile Store HFile HFile Client HFile HFile….. …..  Time for which writes on Region is blocked (Default – 90 Secs)
  • 32. Optimization Benefit Throughput Improvement Current Throughput: 150K Records/Min Improved Throughput: 260K Records/Min Optimal Configuration Reduced Resource Utilization
  • 33. Write Operation Tunings Initial Throughput: 60K Records/Min Achieved Throughput: 480K Records/Min Salting Pre-Splitting Optimal Configuration CompressionRow Size Optimization Optimal Read vs. Write Consistency Check
  • 34. Optimal Read vs. Write Consistency Check Multi Version Concurrency Control Multi Version Concurrency Control (MVCC) is used to achieve row level ACID property in HBase. Source: https://blogs.apache.org/hbase/entry/apache_hbase_internals_locking_and Write Steps with MVCC
  • 35. Optimal Read vs. Write Consistency Check Issue: MVCC stuck after few hours of write operation impacting the write throughput drastically as there are 140+ columns per row Scenario Impact Throughput Throughput Records/Min Time Write point: I have a lot to write Read point: I have a lot to catch up Read point has to catch up write point to avoid high delay between read and write versions
  • 36. Optimal Read vs. Write Consistency Check Solution: Reduce the pressure on MVCC by storing all the 140+ columns in a single cell Scenario Throughput Records/Min Time Improvement abc def ghi { “col1”:”abc”, “col2”:”def”, “col3”,”ghi” } Column Representation col1 col2 col3 column
  • 37. Optimal Read vs. Write Consistency Check Optimization Benefit Stability Improvement Steady Resource Utilization
  • 38. Storage Optimization Storage is one of the important factors impacting scalability of HBase cluster Write operation throughput is mainly dependent on the average row size as it is an I/O bound process. Optimizing the storage will help us to achieve more throughput. Example: Having a Column Name as “BA_S1” instead of “Billing_Address_Street_Line_Number_One” will help in reducing the storage and improve write throughput Column Name #Characters Additional Bytes to each Row Billing_Address_Street_Line_Number_One 39 78 BA_S1 5 10
  • 39. Write Operation Tunings Initial Throughput: 60K Records/Min Achieved Throughput: 480K Records/Min Salting Pre-Splitting Optimal Configuration CompressionRow Size Optimization Optimal Read vs. Write Consistency Check
  • 40. Compression Compression is one of the storage optimization technique Commonly used compression algorithms in HBase • Snappy • Gzip • LZO • LZ4 Compression Ratio Gzip compression ratio is better than Snappy and LZO Resource Consumption Snappy consumes lesser resources for compression and decompression than Gzip
  • 41. Optimization Benefit Productivity Improvement Reduced Storage Costs Improved Throughput Compression Before Optimization After Optimization %Improvement Storage 3300 GB 963 GB ~71% Throughput 260K Records/Min 380K Records/Min ~42%
  • 42. Write Operation Tunings Initial Throughput: 60K Records/Min Achieved Throughput: 480K Records/Min Salting Pre-Splitting Optimal Configuration CompressionRow Size Optimization Optimal Read vs. Write Consistency Check
  • 43. Row Size Optimization Few columns out of 140+ columns were empty for most of the rows. Storing empty columns in JSON format will increase the average row size Solution: Avoid storing empty values when using JSON Example: Salesforce Scenario {“col1”:”abc”, ”col2”:”def”, ”col3”:””, ”col4”:””, ”col5”:”ghi”} Data Remove empty {“col1”:”abc”, ”col2”:”def”, ”col5”:”ghi”} Data In HBase
  • 44. Optimization Benefit Productivity Improvement Reduced Storage Costs Improved Throughput Before Optimization After Optimization %Improvement Storage 963 GB 784 GB ~19% Throughput 380K Records/Min 480K Records/Min ~26% Row Size Optimization
  • 45. RECAP
  • 46. Recap Write Throughput Initial: 60K Records/Min Achieved: 480K Records/Min SLA  175K Records/Min Optimization Salting Pre- Splitting Config. Tuning Optimal MVCC Compres s-ion Optimal Row Size Data Format  Text Influx Rate  500 MB/Min Reduced Storage Improved Stability Reduced Resource Utilization
  • 47. Best Practices Row key design • Know your data better before pre-splitting • Shorter row key but long enough for data access Minimize IO • Less number of Column Families • Shorter Column Family and Qualifier name Locality • Review the locality of regions periodically • Co-locate Region server and Data node Maximize Throughput • Minimize major compactions • Use high throughput disk
  • 48. When HBase? HBase is for you HBase is NOT for you Random read/write access to high volumes of data in real time No dependency on RDBMS features Variable schema with flexibility to add columns Single/Range of key based lookups for de-normalized data Multiple versions of Big Data Replacement for RDBMS Low data volume Scanning and aggregation on large volumes of data Replacement for batch processing engines like MapReduce/Spark
  • 49. Q & A

Editor's Notes

  1. Add picture for Agenda
  2. Talk about scale and use cases of HBase at Salesforce
  3. Client gets the location of HBase META region from Zookeeper Client get the table details from HBase META region Client invokes HBase API to write records in HBase Region Server writes the records in WAL (Write Ahead Log) Region Server writes the records in Memstore Memstore is flushed to HDFS as HFile when the flushing policy is met
  4. Write Amplification: Records will be re-written during compaction as HFiles are immutable Explain benefit of data locality
  5. Client gets the location of HBase META region from Zookeeper Client get the table details from HBase META region Client invokes HBase API to write records in HBase Region Server fetches the record from block cache in HBase If the record is not available in block cache then it fetches the record from Memstore If the record is not available in Memstore then it fetches the record from HFile
  6. Major write bottlenecks were low throughput, instability in achieved throughput and write operation never ended even after 3 days
  7. Write Operation Tunings helped to improve write throughput and stability in HBase
  8. The impact explained here is about the beginning of write cycle. After more regions gets created, salting may help in distributing the load & utilization to multiple nodes
  9. Salting and Pre-splitting together helped us to improve write operation throughput from 60K to 150K Records/Min
  10. Optimal configuration in HBase yielded higher throughput and reduced resource utilization
  11. Optimal Read vs Write Consistency Check helped to stabilize the write operation throughput and resource utilization
  12. Compression is one of the commonly used approach to reduce storage utilization. It helped to improve throughput and reduce storage utilization
  13. Reducing IO by storing only essential information helped to improve throughput and reduce storage utilization
  14. Optimized various attributes in order to achieve higher write throughput in stable manner. Some of the attributes are applicable for both read and write operation. Tuning for writes does not imply that read operation will be efficient as well. Optimizing HBase for both write and read operation might not be effective.