The document discusses Facebook's use of HBase as the database storage engine for its messaging platform. It provides an overview of HBase, including its data model, architecture, and benefits like scalability, fault tolerance, and simpler consistency model compared to relational databases. The document also describes Facebook's contributions to HBase to improve performance, availability, and achieve its goal of zero data loss. It shares Facebook's operational experiences running large HBase clusters and discusses its migration of messaging data from MySQL to a de-normalized schema in HBase.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
Hadoop Distributed File System Reliability and Durability at FacebookDataWorks Summit
The document summarizes how the HDFS Namenode is a single point of failure by design and discusses Facebook's solution called AvatarNode to address this. It notes that the Namenode is responsible for all metadata operations and was originally prioritized for features and performance over reliability. It then provides details on HDFS usage at Facebook, including that 41% of data warehouse incidents and 10% of messaging incidents are related to the Namenode SPOF. AvatarNode is presented as Facebook's open source solution to introduce Namenode high availability, though it has limitations compared to future automated solutions being worked on in HDFS.
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
The document summarizes two presentations about using HBase as a database. It discusses the speakers' experiences using HBase at Stumbleupon and Streamy to replace MySQL and other relational databases. Some key points covered include how HBase provides scalability, flexibility, and cost benefits over SQL databases for large datasets.
HBase Read High Availability Using Timeline Consistent Region Replicasenissoz
This document summarizes a talk on implementing timeline consistency for HBase region replicas. It introduces the concept of region replicas, where each region has multiple copies hosted on different servers. The primary accepts writes, while secondary replicas are read-only. Reads from secondaries return possibly stale data. The talk outlines the implementation of region replicas in HBase, including updates to the master, region servers, and IPC. It discusses data replication approaches and next steps to implement write replication using the write-ahead log. The goal is to provide high availability for reads in HBase while tolerating single-server failures.
This document discusses tuning HBase and HDFS for performance and correctness. Some key recommendations include:
- Enable HDFS sync on close and sync behind writes for correctness on power failures.
- Tune HBase compaction settings like blockingStoreFiles and compactionThreshold based on whether the workload is read-heavy or write-heavy.
- Size RegionServer machines based on disk size, heap size, and number of cores to optimize for the workload.
- Set client and server RPC chunk sizes like hbase.client.write.buffer to 2MB to maximize network throughput.
- Configure various garbage collection settings in HBase like -Xmn512m and -XX:+UseCMSInit
This document contains information about HBase concepts and configurations. It discusses different modes of HBase operation including standalone, pseudo-distributed, and distributed modes. It also covers basic prerequisites for running HBase like Java, SSH, DNS, NTP, ulimit settings, and Hadoop for distributed mode. The document explains important HBase configuration files like hbase-site.xml, hbase-default.xml, hbase-env.sh, log4j.properties, and regionservers. It provides details on column-oriented versus row-oriented databases and discusses optimizations that can be made through configuration settings.
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, InformaticaCloudera, Inc.
This document discusses HBase, an open-source, non-relational, distributed database built on top of Hadoop. It provides an overview of why HBase is useful, examples of how Navteq uses HBase at scale, and considerations for designing HBase schemas and deploying HBase clusters, including hardware requirements and configuration tuning. The document also outlines some desired future features for HBase like better tools, secondary indexes, and security improvements.
With the public confession of Facebook, HBase is on everyone's lips when it comes to the discussion around the new "NoSQL" area of databases. In this talk, Lars will introduce and present a comprehensive overview of HBase. This includes the history of HBase, the underlying architecture, available interfaces, and integration with Hadoop.
Hadoop Successes and Failures to Drive Deployment EvolutionBenoit Perroud
This document summarizes a presentation about successes and failures with Hadoop implementations that have driven its evolution. It discusses initial pseudo-distributed deployments, moving to full clusters, adding monitoring, rack awareness, splitting clusters, retrieving and visualizing data, handling updates, and future directions for Hadoop and related technologies. It emphasizes that Hadoop solutions require ongoing changes and that skills around it are in high demand.
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookCloudera, Inc.
This technical session will provide a quick review of the Apache HBase project, looking at it from the past to the future. It will cover the imminent HBase 0.92 release as well as what is slated for 0.94 and beyond. A number of companies and use cases will be used as examples to describe the overall direction of the HBase community and project.
HBaseCon 2013: Compaction Improvements in Apache HBaseCloudera, Inc.
This document discusses improvements to compaction in Apache HBase. It begins with an overview of what compactions are and how they improve read performance in HBase. It then describes the default compaction algorithm and improvements made, including exploring selection and off-peak compactions. The document also covers making compactions more pluggable and enabling tuning on a per-table/column family basis. Finally, it proposes algorithms for different scenarios, such as level and stripe compactions, to improve compaction performance.
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...Cloudera, Inc.
Gap Inc Direct, the online division for Gap Inc., uses HBase to serve, in real-time, apparel catalog for all its brands’ and markets’ web sites. This case study will review the business case as well as key decisions regarding schema selection and cluster configurations. We will also discuss implementation challenges and insights that were learned.
This document summarizes a presentation about HBase storage internals and future developments. It discusses how HBase provides random read/write access on HDFS using tables, regions, and region servers. It describes the write path involving the client, master, and region servers as well as the read path. It also covers topics like snapshots, compactions, and future plans to improve encryption, security, write-ahead logs, and compaction policies.
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
Introduction to HBase. HBase is a NoSQL databases which experienced a tremendous increase in popularity during the last years. Large companies like Facebook, LinkedIn, Foursquare are using HBase. In this presentation we will address questions like: what is HBase?, and compared to relational databases?, what is the architecture?, how does HBase work?, what about the schema design?, what about the IT ressources?. Questions that should help you consider whether this solution might be suitable in your case.
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
Valta is a resource management layer over Apache HBase that aims to address issues with shared workloads on a single HBase cluster. It introduces resource limits for HBase clients to prevent ill-behaved clients from monopolizing cluster resources. This is an initial step, and more work is needed to address request scheduling across HBase, HDFS, and lower layers to meet service level objectives. The document outlines ideas for full-stack request scheduling, auto-tuning systems based on high-level SLOs, and using multiple read replicas to improve latency.
How Art Works: March 14 Web Conversation DeckJessica Gheiler
and Training
Artists
Social Appeal of
Personal Appeal of
Creative Pursuits
New Modes of Creative Pursuits
Expression
Works of Arts
Commercial
Outlets for
Creative
Expression
Arts
Economic Value of Experiences Individual Value of
Art
Art
Creative Capacity Sense of Belonging
of Society
Cultural Vitality
Community and
Social Value of Art
This document discusses a systems map for how art works. It introduces a revised map showing four primary inputs to the arts system - subsidies, education and training, artists, and social appeal of creative pursuits. The map also shows four types of
Facebook uses HBase running on HDFS to store messaging data and metadata. Key reasons for choosing HBase include high write throughput, horizontal scalability, and integration with HDFS. Typical clusters have multiple regions and racks for redundancy. Facebook stores small messages, metadata, and attachments in HBase, while larger messages and attachments are stored separately. The system processes billions of read and write operations daily and continues to optimize performance and reliability.
Building Mission Critical Messaging System On Top Of HBase
Facebook chose HBase as the storage system for its messaging platform due to HBase's high write throughput, good random read performance, horizontal scalability, and automatic failover. Facebook stores messages, metadata, and search indices in HBase. To improve performance and reliability, Facebook developed the system on a production-stabilized branch of HBase, used shadow testing, added extensive monitoring, and contributed improvements back to the HBase community.
This document provides an overview of HBase, including:
- HBase is a distributed, scalable, big data store modeled after Google's BigTable. It provides a fault-tolerant way to store large amounts of sparse data.
- HBase is used by large companies to handle scaling and sparse data better than relational databases. It features automatic partitioning, linear scalability, commodity hardware, and fault tolerance.
- The document discusses HBase operations, schema design best practices, hardware recommendations, alerting, backups and more. It provides guidance on designing keys, column families and cluster configuration to optimize performance for read and write workloads.
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.
The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.
The document discusses how HDFS architecture has evolved to meet new requirements for higher scalability, availability, and improved random read performance. It summarizes the key aspects of HDFS architecture in 2010, including limitations, and improvements made since then, such as read pipeline optimizations, federated namespaces, and high availability name nodes. It also outlines future directions for HDFS architecture.
At StampedeCon 2012 in St. Louis, Pritam Damania presents: Reliable backup and recovery is one of the main requirements for any enterprise grade application. HBase has been very well embraced by enterprises needing random, real-time read/write access with huge volumes of data and ease of scalability. As such, they are looking for backup solutions that are reliable, easy to use, and can co-exist with existing infrastructure. HBase comes with several backup options but there is a clear need to improve the native export mechanisms. This talk will cover various options that are available out of the box, their drawbacks and what various companies are doing to make backup and recovery efficient. In particular it will cover what Facebook has done to improve performance of backup and recovery process with minimal impact to production cluster.
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Yahoo Developer Network
The document discusses different approaches for searching large datasets in Hadoop, including MapReduce, Lucene/Solr, and building a new search engine called HSearch. Some key challenges with existing approaches included slow response times and the need for manual sharding. HSearch indexes data stored in HDFS and HBase. The document outlines several techniques used in HSearch to improve performance, such as using SSDs selectively, reducing HBase table size, distributing queries across region servers, moving processing near data, byte block caching, and configuration tuning. Benchmarks showed HSearch could return results for common words from a 100 million page index within seconds.
Data Storage and Management project ReportTushar Dalvi
This paper aims at evaluating the performance of random reads and random writes the information of HBase and Cassandra and compare the results that we got through various ubuntu operation
Speaker: Varun Sharma (Pinterest)
Over the past year, HBase has become an integral component of Pinterest's storage stack. HBase has enabled us to quickly launch and iterate on new products and create amazing pinner experiences. This talk briefly describes some of these applications, the underlying schema, and how our HBase setup stays highly available and performant despite billions of requests every week. It will also include some performance tips for running on SSDs. Finally, we will talk about a homegrown serving technology we built from a mashup of HBase components that has gained wide adoption across Pinterest.
CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPERKrishnaVeni451953
HBase is an open source, column-oriented database built on top of Hadoop that allows for the storage and retrieval of large amounts of sparse data. It provides random real-time read/write access to this data stored in Hadoop and scales horizontally. HBase features include automatic failover, integration with MapReduce, and storing data as multidimensional sorted maps indexed by row, column, and timestamp. The architecture consists of a master server (HMaster), region servers (HRegionServer), regions (HRegions), and Zookeeper for coordination.
Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.
The document summarizes HBase use at Facebook, including its development and future work. HBase is used for incremental updates to data warehouses, high frequency analytics, and write-intensive workloads. Development includes Hive integration, master high availability, and random read optimizations. Future work focuses on coprocessors, intelligent load balancing, and cluster performance.
The document discusses backup and disaster recovery strategies for Hadoop. It focuses on protecting data sets stored in HDFS. HDFS uses data replication and checksums to protect against disk and node failures. Snapshots can protect against data corruption and accidental deletes. The document recommends copying data from the primary to secondary site for disaster recovery rather than teeing, and discusses considerations for large data movement like bandwidth needs and security. It also notes the importance of backing up metadata like Hive configurations along with core data.
Hbase status quo apache-con europe - nov 2012Chris Huang
The document summarizes the status of HBase and its relationship with HDFS. In the past, HDFS did not prioritize HBase's needs, but reliability, availability, and performance have improved with Hadoop 1.0 and 2.0. Hadoop 2.0 features like HDFS high availability and wire compatibility directly benefit HBase. Further improvements planned for Hadoop 2.x like direct reads and zero-copy support could significantly boost HBase performance. The HBase project is also advancing with new versions focused on features like coprocessors and performance optimizations.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
HBase is an open-source, distributed, column-oriented database that runs on top of Hadoop. It provides real-time read and write access to large amounts of data across clusters of commodity hardware. HBase scales to billions of rows and millions of columns and is used by companies like Twitter, Adobe, and Yahoo to store large datasets. It uses a master-slave architecture with a single HBaseMaster and multiple RegionServers and stores data in Hadoop's HDFS for high availability.
HBase is a NoSQL database that stores data in HDFS in a distributed, scalable, reliable way for big data. It is column-oriented and optimized for random read/write access to big data in real-time. HBase is not a relational database and relies on HDFS. Common use cases include flexible schemas, high read/write rates, and real-time analytics. Apache Phoenix provides a SQL interface for HBase, allowing SQL queries, joins, and familiar constructs to manage data in HBase tables.
6. Monthly data volume prior to launch
15B x 1,024 bytes = 14TB
120B x 100 bytes = 11TB
7. Messaging Data
▪ Small/medium sized data HBase
▪ Message metadata & indices
▪ Search index
▪ Small message bodies
▪ Attachments and large messages Haystack
▪ Used for our existing photo/video store
8. Open Source Stack
▪ Memcached --> App Server Cache
▪ ZooKeeper --> Small Data Coordination Service
▪ HBase --> Database Storage Engine
▪ HDFS --> Distributed FileSystem
▪ Hadoop --> Asynchronous Map-Reduce Jobs
9. Our architecture
User Directory Service
Clients
(Front End, MTA, etc.)
What’s the cell for
this user?
Cell 2
Cell 1 Cell 3
Cell 1 Application Server
Application Server Application Server
Attachments
HBase/HDFS/Z
HBase/HDFS/Z KMessage, Metadata,
HBase/HDFS/Z
K Search Index
K
Haystack
11. HBase in a nutshell
• distributed, large-scale data store
• efficient at random reads/writes
• initially modeled after Google’s BigTable
• open source project (Apache)
12. When to use HBase?
▪ storing large amounts of data
▪ need high write throughput
▪ need efficient random access within large data sets
▪ need to scale gracefully with data
▪ for structured and semi-structured data
▪ don’t need full RDMS capabilities (cross table transactions, joins, etc.)
13. HBase Data Model
• An HBase table is:
• a sparse , three-dimensional array of cells, indexed by:
RowKey, ColumnKey, Timestamp/Version
• sharded into regions along an ordered RowKey space
• Within each region:
• Data is grouped into column families
▪ Sort order within each column family:
Row Key (asc), Column Key (asc), Timestamp (desc)
14. Example: Inbox Search
• Schema
• Key: RowKey: userid, Column: word, Version: MessageID
• Value: Auxillary info (like offset of word in message)
• Data is stored sorted by <userid, word, messageID>:
User1:hi:17->offset1
Can efficiently handle queries like:
User1:hi:16->offset2
User1:hello:16->offset3 - Get top N messageIDs for a
User1:hello:2->offset4 specific user & word
...
User2:.... - Typeahead query: for a given user,
User2:... get words that match a prefix
...
15. HBase System Overview
Database Layer
HBASE
Master Backup
Master
Region Region Region ...
Server Server Server
Coordination Service
Storage Layer
HDFS Zookeeper Quorum
Namenode Secondary Namenode ZK ZK ...
Peer Peer
Datanode Datanode Datanode ...
16. HBase Overview
HBASE Region Server
....
Region #2
Region #1
....
ColumnFamily #2
ColumnFamily #1 Memstore
(in memory data structure)
HFiles (in HDFS) flush
Write Ahead Log ( in HDFS)
17. HBase Overview
• Very good at random reads/writes
• Write path
• Sequential write/sync to commit log
• update memstore
• Read path
• Lookup memstore & persistent HFiles
• HFile data is sorted and has a block index for efficient retrieval
• Background chores
• Flushes (memstore -> HFile)
• Compactions (group of HFiles merged into one)
19. Horizontal scalability
▪ HBase & HDFS are elastic by design
▪ Multiple table shards (regions) per physical server
▪ On node additions
▪ Load balancer automatically reassigns shards from overloaded
nodes to new nodes
▪ Because filesystem underneath is itself distributed, data for
reassigned regions is instantly servable from the new nodes.
▪ Regions can be dynamically split into smaller regions.
▪ Pre-sharding is not necessary
▪ Splits are near instantaneous!
20. Automatic Failover
▪ Node failures automatically detected by HBase Master
▪ Regions on failed node are distributed evenly among surviving nodes.
▪ Multiple regions/server model avoids need for substantial
overprovisioning
▪ HBase Master failover
▪ 1 active, rest standby
▪ When active master fails, a standby automatically takes over
21. HBase uses HDFS
We get the benefits of HDFS as a storage system for free
▪ Fault tolerance (block level replication for redundancy)
▪ Scalability
▪ End-to-end checksums to detect and recover from corruptions
▪ Map Reduce for large scale data processing
▪ HDFS already battle tested inside Facebook
▪ running petabyte scale clusters
▪ lot of in-house development and operational experience
22. Simpler Consistency Model
▪ HBase’s strong consistency model
▪ simpler for a wide variety of applications to deal with
▪ client gets same answer no matter which replica data is read from
▪ Eventual consistency: tricky for applications fronted by a cache
▪ replicas may heal eventually during failures
▪ but stale data could remain stuck in cache
23. Typical Cluster Layout
▪ Multiple clusters/cells for messaging
▪ 20 servers/rack; 5 or more racks per cluster
▪ Controllers (master/Zookeeper) spread across racks
ZooKeeper Peer ZooKeeper Peer ZooKeeper Peer ZooKeeper Peer ZooKeeper Peer
HDFS Namenode Backup Namenode Job Tracker Hbase Master Backup Master
Region Server Region Server Region Server Region Server Region Server
Data Node Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker
19x... 19x... 19x... 19x... 19x...
Region Server Region Server Region Server Region Server Region Server
Data Node Data Node Data Node Data Node Data Node
Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker
Rack #1 Rack #2 Rack #3 Rack #4 Rack #5
26. Goal of Zero Data Loss/Correctness
▪ sync support added to hadoop-20 branch
▪ for keeping transaction log (WAL) in HDFS
▪ to guarantee durability of transactions
▪ Row-level ACID compliance
▪ Enhanced HDFS’s Block Placement Policy:
▪ Original: rack aware, but minimally constrained
▪ Now: Placement of replicas constrained to configurable node groups
▪ Result: Data loss probability reduced by orders of magnitude
27. Availability/Stability improvements
▪ HBase master rewrite- region assignments using ZK
▪ Rolling Restarts – doing software upgrades without a downtime
▪ Interrupt Compactions – prioritize availability over minor perf gains
▪ Timeouts on client-server RPCs
▪ Staggered major compaction to avoid compaction storms
28. Performance Improvements
▪ Compactions
▪ critical for read performance
▪ Improved compaction algo
▪ delete/TTL/overwrite processing in minor compactions
▪ Read optimizations:
▪ Seek optimizations for rows with large number of cells
▪ Bloom filters to minimize HFile lookups
▪ Timerange hints on HFiles (great for temporal data)
▪ Improved handling of compressed HFiles
29. Operational Experiences
▪ Darklaunch:
▪ shadow traffic on test clusters for continuous, at scale testing
▪ experiment/tweak knobs
▪ simulate failures, test rolling upgrades
▪ Constant (pre-sharding) region count & controlled rolling splits
▪ Administrative tools and monitoring
▪ Alerts (HBCK, memory alerts, perf alerts, health alerts)
▪ auto detecting/decommissioning misbehaving machines
▪ Dashboards
▪ Application level backup/recovery pipeline
30. Working within the Apache community
▪ Growing with the community
▪ Started with a stable, healthy project
▪ In house expertise in both HDFS and HBase
▪ Increasing community involvement
▪ Undertook massive feature improvements with community help
▪ HDFS 0.20-append branch
▪ HBase Master rewrite
▪ Continually interacting with the community to identify and fix issues
▪ e.g., large responses (2GB RPC)
33. Move messaging data from MySQL to HBase
▪ In MySQL, inbox data was kept normalized
▪ user’s messages are stored across many different machines
▪ Migrating a user is basically one big join across tables spread over
many different machines
▪ Multiple terabytes of data (for over 500M users)
▪ Cannot pound 1000s of production UDBs to migrate users
34. How we migrated
▪ Periodically, get a full export of all the users’ inbox data in MySQL
▪ And, use bulk loader to import the above into a migration HBase
cluster
▪ To migrate users:
▪ Since users may continue to receive messages during migration:
▪ double-write (to old and new system) during the migration period
▪ Get a list of all recent messages (since last MySQL export) for the
user
▪ Load new messages into the migration HBase cluster
▪ Perform the join operations to generate the new data
▪ Export it and upload into the final cluster
36. Facebook Insights Goes Real-Time
▪ Recently launched real-time analytics for social plugins on top of
HBase
▪ Publishers get real-time distribution/engagement metrics:
▪ # of impressions, likes
▪ analytics by
▪ Domain, URL, demographics
▪ Over various time periods (the last hour, day, all-time)
▪ Makes use of HBase capabilities like:
▪ Efficient counters (read-modify-write increment operations)
▪ TTL for purging old data
37. Future Work
It is still early days…!
▪ Namenode HA (AvatarNode)
▪ Fast hot-backups (Export/Import)
▪ Online schema & config changes
▪ Running HBase as a service (multi-tenancy)
▪ Features (like secondary indices, batching hybrid mutations)
▪ Cross-DC replication
▪ Lot more performance/availability improvements