HBase Read High Availability Using
Timeline-Consistent Region Replicas
Enis Soztutar (
Devaraj Das (
June 2014
About Us
Enis Soztutar
Committer and PMC member in Apache
HBase and Hadoop since 2007
HBase team @Hortonworks
Twitter @enissoz
Devaraj Das
Committer and PMC member in
Hadoop since 2006
Committer at HBase
Co-founder @Hortonworks
Twitter @ddraj
Outline of the talk
PART I: Use case and semantics
 CAP recap
 Use case and motivation
 Region replicas
 Timeline consistency
 Semantics
PART II : Implementation and next steps
 Server side
 Client side
 Data replication
 Next steps & Summary
Part I
Use case and semantics
Partition tolerance
Consistency Availability
Pick Two
HBase is CP
• In a distributed system you cannot NOT have P
• C vs A is about what happens if there is a network
• A an C are NEVER binary values, always a range
• Different operations in the system can have
different A / C choices
• HBase cannot be simplified as CP
Partition tolerance
Pick Two
HBase is CP
HBase consistency model
For a single row, HBase is strongly consistent within a data center
Across rows HBase is not strongly consistent (but available!).
When a RS goes down, only the regions on that server become
unavailable. Other regions are unaffected.
HBase multi-DC replication is “eventual consistent”
HBase applications should carefully design the schema for correct
semantics / performance tradeoff
Use cases and motivation
More and more applications are looking for a “0 down time” platform
 30 seconds downtime (aggressive MTTR time) is too much
Certain classes of apps are willing to tolerate decreased consistency
guarantees in favor of availability
 Especially for READs
Some build wrappers around the native API to be able to handle failures of
destination servers
 Multi-DC: when one server is down in one DC, the client switches to a different one
Can we do something in HBase natively?
 Within the same cluster?
Use cases and motivation
Designing the application requires careful tradeoff consideration
 In schema design since single-row is strong consistent, but no multi-row trx
 Multi-datacenter replication (active-passive, active-active, backups etc)
It is good to be able to give the application flexibility to pick-and-choose
 Higher availability vs stronger consistency
Read vs Write
 Different consistency models for read vs write
 Read-repair, latest ts-wins vs linearizable updates
Initial goals
Support applications talking to a single cluster really well
 No perceived downtime
 Only for READs
If apps wants to tolerate cluster failures
 Use HBase replication
 Combine that with wrappers in the application
Region Replicas in HBase
Timeline Consistency in HBase
Region replicas
For every region of the table, there can be more than one replica
 Every region replica has an associated “replica_id”, starting from 0
 Each region replica is hosted by a different region server
Tables can be configured with a REGION_REPLICATION parameter
 Default is 1
 No change in the current behavior
One replica per region is the “default” or “primary”
 Only this can accepts WRITEs
 All reads from this region replica return the most recent data
Other replicas, also called “secondaries” follow the primary
 They see only committed updates
Region replicas
Secondary region replicas are read-only
 No writes are routed to secondary replicas
 Data is replicated to secondary regions (more on this later)
 Serve data from the same data files are primary
 May not have received the recent data
 Reads and Scans can be performed, returning possibly stale data
Region replica placement is done to maximize availability of any particular
 Region replicas are not co-located on same region servers
 And same racks (if possible)
rowkey column:value column:value …
b9 b1
Read and write
Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
rowkey column:value column:value …
b9 b1
Read and write
rowkey column:value column:value …
Region replica
Read only
TIMELINE Consistency
Introduced a Consistency enum
Consistency.STRONG is default
Consistency can be set per read operation (per-get or per-scan)
Timeline-consistent read RPCs sent to more than one replica
Semantics is a bit different than Eventual Consistency model
TIMELINE Consistency
public enum Consistency {
Get get = new Get(row);
Result result = table.get(get);
if (result.isStale()) {
TIMELINE Consistency Semantics
Can be though of as in-cluster active-passive replication
Single homed and ordered updates
 All writes are handled and ordered by the primary region
 All writes are STRONG consistency
Secondaries apply the mutations in order
Only get/scan requests to secondaries
Get/Scan Result can be inspected to see whether the result was from
possibly stale data
TIMELINE Consistency Example
Replica_id=0 (primary)
Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
TIMELINE Consistency Example
Replica_id=0 (primary)
Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
TIMELINE Consistency Example
Replica_id=0 (primary)
X=2 X=2
Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
TIMELINE Consistency Example
Replica_id=0 (primary)
Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
TIMELINE Consistency Example
Replica_id=0 (primary)
Write X=3
Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
TIMELINE Consistency Example
Replica_id=0 (primary)
X=2 X=3
Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Implementation and next steps
Region replicas – recap
Every region replica has an associated “replica_id”, starting from 0
Each region replica is hosted by a different region server
 All replicas can serve READs
One replica per region is the “default” or “primary”
 Only this can accepts WRITEs
 All reads from this region replica return the most recent data
Updates in the Master
Replica creation
 Created during table creation
No distinction between primary & secondary replicas
Meta table contain all information in one row
Load balancer improvements
 LB made aware of replicas
 Does best effort to place replicas in machines/racks to maximize availability
Alter table support
 For adjusting number of replicas
Data Replication
Data should be replicated from primary regions to secondary regions
Data files MUST be shared. We do not want to store multiple copies
Do not cause more writes than necessary
Two solutions:
 Region snapshots : Share only data files
 Async WAL Replication : Share data files, every region replica has its own in-memory data
Data Replication – Region Snapshots
Primary region works as usual
 Buffer up mutations in memstore
 Flush to disk when full
 Compact files when needed
 Deleted files are kept in archive directory for some time
Secondary regions
 Are opened in read-only mode
 Periodically look for new files in primary region
– When a new flushed file is seen, just open it and start serving data from there
– When a compaction is seen, open new file, close the files that are gone
 Good for read-only, bulk load data or less frequently updated data
Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Updates in the RegionServer
Treats secondary replicas as read-only
Storefile management
 Keeps itself up-to-date with the changes to do with store file creation/deletions
IPC layer high level flow
Response within
timeout (10 millis)?
NO Send READ to all
Send READ to primary
Wait for response
Wait for response
Take the first
successful response;
cancel others
Similar flow for GET/Batch-
GET/Scan, except that Scan is
sticky to the server it sees
success from.
Performance and Testing
No significant performance issues discovered
 Added interrupt handling in the RPCs to cancel unneeded replica RPCs
Deeper level of performance testing work is still in progress
Tested with ChaosMonkey tests
 fails if response is not received within a certain time
 High-availability for read-only tables
 High-availability for stale reads
 Very low-latency for the above
 [No extra copies for the Storefiles]
 Increased blockcache usage
 Extra network traffic for the replica calls
 Increased number of regions to manage in the cluster
Next steps
What has been described so far is in “Phase-1” of the project
 WAL replication
– Will reduce the updates-seen lag drastically
 Handling of Merges and Splits
 Latency guarantees
– Cancellation of RPCs server side
– Promotion of one Secondary to Primary, and recruiting a new Secondary
Use the infrastructure to implement consensus protocols for read/write
within a single datacenter
[Critical for
Edit for region1
Edit for region1
Edit for region2
Edit for region3
Edit for region10
Apache branch hbase-10070 (
Merge to mainline trunk is under discussion in the community
HDP-2.1 comes with experimental support for Phase-1
Q & A

  • 1. Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HBase Read High Availability Using Timeline-Consistent Region Replicas Enis Soztutar ( Devaraj Das ( June 2014
  • 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved About Us Enis Soztutar Committer and PMC member in Apache HBase and Hadoop since 2007 HBase team @Hortonworks Twitter @enissoz Devaraj Das Committer and PMC member in Hadoop since 2006 Committer at HBase Co-founder @Hortonworks Twitter @ddraj
  • 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Outline of the talk PART I: Use case and semantics  CAP recap  Use case and motivation  Region replicas  Timeline consistency  Semantics PART II : Implementation and next steps  Server side  Client side  Data replication  Next steps & Summary
  • 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Part I Use case and semantics
  • 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved CAP reCAP Partition tolerance Consistency Availability Pick Two HBase is CP
  • 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Availability CAP reCAP • In a distributed system you cannot NOT have P • C vs A is about what happens if there is a network partition! • A an C are NEVER binary values, always a range • Different operations in the system can have different A / C choices • HBase cannot be simplified as CP Partition tolerance Consistency Pick Two HBase is CP
  • 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HBase consistency model For a single row, HBase is strongly consistent within a data center Across rows HBase is not strongly consistent (but available!). When a RS goes down, only the regions on that server become unavailable. Other regions are unaffected. HBase multi-DC replication is “eventual consistent” HBase applications should carefully design the schema for correct semantics / performance tradeoff
  • 8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Use cases and motivation More and more applications are looking for a “0 down time” platform  30 seconds downtime (aggressive MTTR time) is too much Certain classes of apps are willing to tolerate decreased consistency guarantees in favor of availability  Especially for READs Some build wrappers around the native API to be able to handle failures of destination servers  Multi-DC: when one server is down in one DC, the client switches to a different one Can we do something in HBase natively?  Within the same cluster?
  • 9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Use cases and motivation Designing the application requires careful tradeoff consideration  In schema design since single-row is strong consistent, but no multi-row trx  Multi-datacenter replication (active-passive, active-active, backups etc) It is good to be able to give the application flexibility to pick-and-choose  Higher availability vs stronger consistency Read vs Write  Different consistency models for read vs write  Read-repair, latest ts-wins vs linearizable updates
  • 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Initial goals Support applications talking to a single cluster really well  No perceived downtime  Only for READs If apps wants to tolerate cluster failures  Use HBase replication  Combine that with wrappers in the application
  • 11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Introducing…. Region Replicas in HBase Timeline Consistency in HBase
  • 12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Region replicas For every region of the table, there can be more than one replica  Every region replica has an associated “replica_id”, starting from 0  Each region replica is hosted by a different region server Tables can be configured with a REGION_REPLICATION parameter  Default is 1  No change in the current behavior One replica per region is the “default” or “primary”  Only this can accepts WRITEs  All reads from this region replica return the most recent data Other replicas, also called “secondaries” follow the primary  They see only committed updates
  • 13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Region replicas Secondary region replicas are read-only  No writes are routed to secondary replicas  Data is replicated to secondary regions (more on this later)  Serve data from the same data files are primary  May not have received the recent data  Reads and Scans can be performed, returning possibly stale data Region replica placement is done to maximize availability of any particular region  Region replicas are not co-located on same region servers  And same racks (if possible)
  • 14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved rowkey column:value column:value … RegionServer Region memstore DataNode b2 b9 b1 DataNode b2 b1 DataNode b1 Client Read and write RegionServer
  • 15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Page 15 rowkey column:value column:value … RegionServer Region DataNode b2 b9 b1 DataNode b2 b1 DataNode b1 Client Read and write memstore RegionServer rowkey column:value column:value … memstore Region replica Read only
  • 16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved TIMELINE Consistency Introduced a Consistency enum  STRONG  TIMELINE Consistency.STRONG is default Consistency can be set per read operation (per-get or per-scan) Timeline-consistent read RPCs sent to more than one replica Semantics is a bit different than Eventual Consistency model
  • 17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved TIMELINE Consistency public enum Consistency { STRONG, TIMELINE } Get get = new Get(row); get.setConsistency(Consistency.TIMELINE); ... Result result = table.get(get); … if (result.isStale()) { ... }
  • 18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved TIMELINE Consistency Semantics Can be though of as in-cluster active-passive replication Single homed and ordered updates  All writes are handled and ordered by the primary region  All writes are STRONG consistency Secondaries apply the mutations in order Only get/scan requests to secondaries Get/Scan Result can be inspected to see whether the result was from possibly stale data
  • 19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved TIMELINE Consistency Example Client1 X=1 Client2 WAL Data: Replica_id=0 (primary) Replica_id=1 Replica_id=2 replication replication X=3 WAL Data: WAL Data: X=1X=1Write
  • 20. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved TIMELINE Consistency Example Client1 X=1 Client2 WAL Data: Replica_id=0 (primary) Replica_id=1 Replica_id=2 replication replication X=3 WAL Data: WAL Data: X=1 X=1 X=1 X=1 X=1 X=1Read X=1Read X=1Read
  • 21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved TIMELINE Consistency Example Client1 X=1 Client2 WAL Data: Replica_id=0 (primary) Replica_id=1 Replica_id=2 replication replication WAL Data: WAL Data: Write X=1 X=1 X=2 X=2 X=2
  • 22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved TIMELINE Consistency Example Client1 X=1 Client2 WAL Data: Replica_id=0 (primary) Replica_id=1 Replica_id=2 replication replication WAL Data: WAL Data: X=2 X=1 X=2 X=2 X=2 X=2Read X=2Read X=1Read
  • 23. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved TIMELINE Consistency Example Client1 X=1 Client2 WAL Data: Replica_id=0 (primary) Replica_id=1 Replica_id=2 replication replication WAL Data: WAL Data: X=2 X=1 X=3 X=2 Write X=3 X=3
  • 24. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved TIMELINE Consistency Example Client1 X=1 Client2 WAL Data: Replica_id=0 (primary) Replica_id=1 Replica_id=2 replication replication WAL Data: WAL Data: X=2 X=1 X=3 X=2 X=3 X=3Read X=2Read X=1Read
  • 25. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved PART II Implementation and next steps
  • 26. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Region replicas – recap Every region replica has an associated “replica_id”, starting from 0 Each region replica is hosted by a different region server  All replicas can serve READs One replica per region is the “default” or “primary”  Only this can accepts WRITEs  All reads from this region replica return the most recent data
  • 27. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Updates in the Master Replica creation  Created during table creation No distinction between primary & secondary replicas Meta table contain all information in one row Load balancer improvements  LB made aware of replicas  Does best effort to place replicas in machines/racks to maximize availability Alter table support  For adjusting number of replicas
  • 28. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Replication Data should be replicated from primary regions to secondary regions Data files MUST be shared. We do not want to store multiple copies Do not cause more writes than necessary Two solutions:  Region snapshots : Share only data files  Async WAL Replication : Share data files, every region replica has its own in-memory data
  • 29. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Replication – Region Snapshots Primary region works as usual  Buffer up mutations in memstore  Flush to disk when full  Compact files when needed  Deleted files are kept in archive directory for some time Secondary regions  Are opened in read-only mode  Periodically look for new files in primary region – When a new flushed file is seen, just open it and start serving data from there – When a compaction is seen, open new file, close the files that are gone  Good for read-only, bulk load data or less frequently updated data RegionServerClient Send mutation Memstore Memstore Memstore Memstore Storefile Flush
  • 30. Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Updates in the RegionServer Treats secondary replicas as read-only Storefile management  Keeps itself up-to-date with the changes to do with store file creation/deletions
  • 31. Page31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved IPC layer high level flow Client YES Response within timeout (10 millis)? NO Send READ to all secondaries Send READ to primary Wait for response Wait for response Take the first successful response; cancel others Similar flow for GET/Batch- GET/Scan, except that Scan is sticky to the server it sees success from.
  • 32. Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Performance and Testing No significant performance issues discovered  Added interrupt handling in the RPCs to cancel unneeded replica RPCs Deeper level of performance testing work is still in progress Tested with ChaosMonkey tests  fails if response is not received within a certain time
  • 33. Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Summary Pros  High-availability for read-only tables  High-availability for stale reads  Very low-latency for the above  [No extra copies for the Storefiles] Cons  Increased blockcache usage  Extra network traffic for the replica calls  Increased number of regions to manage in the cluster
  • 34. Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Next steps What has been described so far is in “Phase-1” of the project Phase-2  WAL replication – Will reduce the updates-seen lag drastically  Handling of Merges and Splits  Latency guarantees – Cancellation of RPCs server side – Promotion of one Secondary to Primary, and recruiting a new Secondary Use the infrastructure to implement consensus protocols for read/write within a single datacenter RegionServerClient Send mutation WAL [Critical for recovery] MemstoreMemstoreMemstoreMemstore Edit for region1 Edit for region1 Edit for region2 Edit for region3 Edit for region10 ……
  • 35. Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved References Apache branch hbase-10070 ( 10070) Merge to mainline trunk is under discussion in the community HDP-2.1 comes with experimental support for Phase-1
  • 36. Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thanks Q & A

