November 2011 – Hadoop World NYC

Advanced HBase Schema Design
Lars George, Solutions Architect

1    Intro to HBase Architecture
2    Schema Design
3    Examples
4    Wrap up

About Me

• Solutions Architect @ Cloudera
• Apache HBase & Whirr Committer
• Author of
  HBase – The Definitive Guide
• Working with HBase since end
  of 2007
• Organizer of the Munich OpenHUG
• Speaker at Conferences (Fosdem, Hadoop

• Schema design is vital
• Needs to be done eventually
  – Same for RDBMS
• Exposes architecture and implementation
• Might be handled in storage layer
  – eg. MegaStore, Percolator
Configuration Layers   (aka “OSI for HBase”)
HBase Architecture
HBase Architecture
• HBase uses HDFS (or similar) as its reliable
  storage layer
  – Handles checksums, replication, failover
• Native Java API, Gateway for REST, Thrift,
• Master manages cluster
• RegionServer manage data
• ZooKeeper is used the “neural network”
  – Crucial for HBase
  – Bootstraps and coordinates cluster
Auto Sharding
Auto Sharding and Distribution

• Unit of scalability in HBase is the Region
• Sorted, contiguous range of rows
• Spread “randomly” across RegionServer
• Moved around for load balancing and
• Split automatically or manually to scale
  with growing data
• Capacity is solely a factor of cluster nodes
  vs. regions per node
Column Families
Storage Separation

• Column Families allow for separation of data
  – Used by Columnar Databases for fast analytical
    queries, but on column level only
  – Allows different or no compression depending on
    the content type
• Segregate information based on access
• Data is stored in one or more storage file,
  called HFiles
Merge Reads
Bloom Filter

• Bloom Filters are generated when HFile is
  – Stored at the end of each HFile
  – Loaded into memory
• Allows check on row or row+column level
• Can filter entire store files from reads
  – Useful when data is grouped
• Also useful when many misses are
  expected during reads (non existing keys)
Bloom Filter
Fold, Store, and Shift
Fold, Store, and Shift

• Logical layout does not match physical
• All values are stored with the full
  coordinates, including: Row Key, Column
  Family, Column Qualifier, and Timestamp
• Folds columns into “row per column”
• NULLs are cost free as nothing is stored
• Versions are multiple “rows” in folded table
Key Cardinality
Key Cardinality

• The best performance is gained from using
  row keys
• Time range bound reads can skip store files
  – So can Bloom Filters
• Selecting column families reduces the
  amount of data to be scanned
• Pure value based filtering is a full table scan
  – Filters often are too, but reduce network traffic
Tall-Narrow vs. Flat-Wide Tables
• Rows do not split
  – Might end up with one row per region
• Same storage footprint
• Put more details into the row key
  – Sometimes dummy column only
  – Make use of partial key scans
• Tall with Scans, Wide with Gets
  – Atomicity only on row level
• Example: Large graphs, stored as adjacency
Example: Mail Inbox

        <userId> : <colfam> : <messageId> : <timestamp> : <email-message>

12345   :   data   :   5fc38314-e290-ae5da5fc375d      :   1307097848   :   "Hi Lars, ..."
12345   :   data   :   725aae5f-d72e-f90f3f070419      :   1307099848   :   "Welcome, and ..."
12345   :   data   :   cc6775b3-f249-c6dd2b1a7467      :   1307101848   :   "To Whom It ..."
12345   :   data   :   dcbee495-6d5e-6ed48124632c      :   1307103848   :   "Hi, how are ..."

12345-5fc38314-e290-ae5da5fc375d        :   data   :   :   1307097848   :   "Hi Lars, ..."
12345-725aae5f-d72e-f90f3f070419        :   data   :   :   1307099848   :   "Welcome, and ..."
12345-cc6775b3-f249-c6dd2b1a7467        :   data   :   :   1307101848   :   "To Whom It ..."
12345-dcbee495-6d5e-6ed48124632c        :   data   :   :   1307103848   :   "Hi, how are ..."

                              Same Storage Requirements
Partial Key Scans
Key                                          Description
<userId>                                     Scan over all
                                             messages for a given
                                             user ID
<userId>-<date>                              Scan over all
                                             messages on a given
                                             date for the given user
<userId>-<date>-<messageId>                  Scan over all parts of a
                                             message for a given
                                             user ID and date
<userId>-<date>-<messageId>-<attachmentId>   Scan over all
                                             attachments of a
                                             message for a given
                                             user ID and date
Sequential Keys
    <timestamp><more key>: {CF: {CQ: {TS : Val}}}

• Hotspotting on Regions: bad!
• Instead do one of the following:
  – Salting
     • Prefix <timestamp> with distributed value
     • Binning or bucketing rows across regions
  – Key field swap/promotion
     • Move <more key> before the timestamp (see
       OpenTSDB later)
  – Randomization
     • Move <timestamp> out of key
Key Design
Key Design

• Based on access pattern, either use
  sequential or random keys
• Often a combination of both is needed
  – Overcome architectural limitations
• Neither is necessarily bad
  – Use bulk import for sequential keys and reads
  – Random keys are good for random access
Example: Facebook Insights

• > 20B Events per Day
• 1M Counter Updates per Second
  – 100 Nodes Cluster
  – 10K OPS per Node
• ”Like” button triggers AJAX request
• Event written to log file
• 30mins current for website owner
  Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase
HBase Counters
• Store counters per Domain and per URL
  – Leverage HBase increment (atomic read-modify-
    write) feature
• Each row is one specific Domain or URL
• The columns are the counters for specific
• Column families are used to group counters
  by time range
  – Set time-to-live on CF level to auto-expire
    counters by age to save space, e.g., 2 weeks on
    “Daily Counters” family
Key Design
• Reversed Domains
  – Examples: “com.cloudera.www”, “”
  – Helps keeping pages per site close, as HBase efficiently
    scans blocks of sorted keys
• Domain Row Key =
MD5(Reversed Domain) +
  Reversed Domain
  – Leading MD5 hash spreads keys randomly across all regions
    for load balancing reasons
  – Only hashing the domain groups per site (and per
    subdomain if needed)
• URL Row Key =
MD5(Reversed Domain) + Reversed
  Domain + URL ID
  – Unique ID per URL already available, make use of it
Insights Schema
Example: OpenTSDB

• Metric Type, Tags are stored as IDs
• Periodically rolled up

• Design for Use-Case
    – Read, Write, or Both?
•   Avoid Hotspotting
•   Consider using IDs instead of full text
•   Leverage Column Family to HFile relation
•   Shift details to appropriate position
    – Composite Keys
    – Column Qualifiers
Summary (cont.)

• Schema design is a combination of
  – Designing the keys (row and column)
  – Segregate data into column families
  – Choose compression and block sizes
• Similar techniques are needed to scale most
  – Add indexes, partition data, consistent hashing
• Denormalization, Duplication, and Intelligent
  Keys (DDI)

Twitter:   @larsgeorge

Hadoop World 2011: Advanced HBase Schema Design

  • 1. November 2011 – Hadoop World NYC Advanced HBase Schema Design Lars George, Solutions Architect
  • 2. Agenda 1 Intro to HBase Architecture 2 Schema Design 3 Examples 4 Wrap up 2 ©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or redistribution without written permission is prohibited.
  • 3. About Me • Solutions Architect @ Cloudera • Apache HBase & Whirr Committer • Author of HBase – The Definitive Guide • Working with HBase since end of 2007 • Organizer of the Munich OpenHUG • Speaker at Conferences (Fosdem, Hadoop World)
  • 4. Overview • Schema design is vital • Needs to be done eventually – Same for RDBMS • Exposes architecture and implementation “features” • Might be handled in storage layer – eg. MegaStore, Percolator
  • 5. Configuration Layers (aka “OSI for HBase”)
  • 7. HBase Architecture • HBase uses HDFS (or similar) as its reliable storage layer – Handles checksums, replication, failover • Native Java API, Gateway for REST, Thrift, Avro • Master manages cluster • RegionServer manage data • ZooKeeper is used the “neural network” – Crucial for HBase – Bootstraps and coordinates cluster
  • 10. Auto Sharding and Distribution • Unit of scalability in HBase is the Region • Sorted, contiguous range of rows • Spread “randomly” across RegionServer • Moved around for load balancing and failover • Split automatically or manually to scale with growing data • Capacity is solely a factor of cluster nodes vs. regions per node
  • 12. Storage Separation • Column Families allow for separation of data – Used by Columnar Databases for fast analytical queries, but on column level only – Allows different or no compression depending on the content type • Segregate information based on access pattern • Data is stored in one or more storage file, called HFiles
  • 14. Bloom Filter • Bloom Filters are generated when HFile is persisted – Stored at the end of each HFile – Loaded into memory • Allows check on row or row+column level • Can filter entire store files from reads – Useful when data is grouped • Also useful when many misses are expected during reads (non existing keys)
  • 17. Fold, Store, and Shift • Logical layout does not match physical one • All values are stored with the full coordinates, including: Row Key, Column Family, Column Qualifier, and Timestamp • Folds columns into “row per column” • NULLs are cost free as nothing is stored • Versions are multiple “rows” in folded table
  • 19. Key Cardinality • The best performance is gained from using row keys • Time range bound reads can skip store files – So can Bloom Filters • Selecting column families reduces the amount of data to be scanned • Pure value based filtering is a full table scan – Filters often are too, but reduce network traffic
  • 20. Tall-Narrow vs. Flat-Wide Tables • Rows do not split – Might end up with one row per region • Same storage footprint • Put more details into the row key – Sometimes dummy column only – Make use of partial key scans • Tall with Scans, Wide with Gets – Atomicity only on row level • Example: Large graphs, stored as adjacency matrix
  • 21. Example: Mail Inbox <userId> : <colfam> : <messageId> : <timestamp> : <email-message> 12345 : data : 5fc38314-e290-ae5da5fc375d : 1307097848 : "Hi Lars, ..." 12345 : data : 725aae5f-d72e-f90f3f070419 : 1307099848 : "Welcome, and ..." 12345 : data : cc6775b3-f249-c6dd2b1a7467 : 1307101848 : "To Whom It ..." 12345 : data : dcbee495-6d5e-6ed48124632c : 1307103848 : "Hi, how are ..." or 12345-5fc38314-e290-ae5da5fc375d : data : : 1307097848 : "Hi Lars, ..." 12345-725aae5f-d72e-f90f3f070419 : data : : 1307099848 : "Welcome, and ..." 12345-cc6775b3-f249-c6dd2b1a7467 : data : : 1307101848 : "To Whom It ..." 12345-dcbee495-6d5e-6ed48124632c : data : : 1307103848 : "Hi, how are ..."  Same Storage Requirements
  • 22. Partial Key Scans Key Description <userId> Scan over all messages for a given user ID <userId>-<date> Scan over all messages on a given date for the given user ID <userId>-<date>-<messageId> Scan over all parts of a message for a given user ID and date <userId>-<date>-<messageId>-<attachmentId> Scan over all attachments of a message for a given user ID and date
  • 23. Sequential Keys <timestamp><more key>: {CF: {CQ: {TS : Val}}} • Hotspotting on Regions: bad! • Instead do one of the following: – Salting • Prefix <timestamp> with distributed value • Binning or bucketing rows across regions – Key field swap/promotion • Move <more key> before the timestamp (see OpenTSDB later) – Randomization • Move <timestamp> out of key
  • 25. Key Design • Based on access pattern, either use sequential or random keys • Often a combination of both is needed – Overcome architectural limitations • Neither is necessarily bad – Use bulk import for sequential keys and reads – Random keys are good for random access patterns
  • 26. Example: Facebook Insights • > 20B Events per Day • 1M Counter Updates per Second – 100 Nodes Cluster – 10K OPS per Node • ”Like” button triggers AJAX request • Event written to log file • 30mins current for website owner Web ➜ Scribe ➜ Ptail ➜ Puma ➜ HBase
  • 27. HBase Counters • Store counters per Domain and per URL – Leverage HBase increment (atomic read-modify- write) feature • Each row is one specific Domain or URL • The columns are the counters for specific metrics • Column families are used to group counters by time range – Set time-to-live on CF level to auto-expire counters by age to save space, e.g., 2 weeks on “Daily Counters” family
  • 28. Key Design • Reversed Domains – Examples: “com.cloudera.www”, “” – Helps keeping pages per site close, as HBase efficiently scans blocks of sorted keys • Domain Row Key =
MD5(Reversed Domain) + Reversed Domain – Leading MD5 hash spreads keys randomly across all regions for load balancing reasons – Only hashing the domain groups per site (and per subdomain if needed) • URL Row Key =
MD5(Reversed Domain) + Reversed Domain + URL ID – Unique ID per URL already available, make use of it
  • 30. Example: OpenTSDB • Metric Type, Tags are stored as IDs • Periodically rolled up
  • 31. Summary • Design for Use-Case – Read, Write, or Both? • Avoid Hotspotting • Consider using IDs instead of full text • Leverage Column Family to HFile relation • Shift details to appropriate position – Composite Keys – Column Qualifiers
  • 32. Summary (cont.) • Schema design is a combination of – Designing the keys (row and column) – Segregate data into column families – Choose compression and block sizes • Similar techniques are needed to scale most systems – Add indexes, partition data, consistent hashing • Denormalization, Duplication, and Intelligent Keys (DDI)
  • 33. Questions? Email: Twitter: @larsgeorge