SlideShare a Scribd company logo
Real-time Analytics with
                       Alex Baranau, Sematext International

Sunday, May 20, 12
About me

                     Software Engineer at Sematext International




                                                         Alex Baranau, Sematext International, 2012
Sunday, May 20, 12

                     Problem background: what? why?

                     Going real-time with append-only updates
                     approach: how?

                     Open-source implementation: how exactly?


                                                        Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: our services
                     Systems Monitoring Service (Solr, HBase, ...)

                     Search Analytics Service

          data collector                                      Reports


          data collector         Analytics &             50

                                  Storage                25

          data collector                                  0
                                                              2007   2008   2009   2010

                                                                 Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: Report Example
                     Search engine (Solr) request latency

                                                        Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: Report Example
                     HBase flush operations

                                             Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: requirements

                      High volume of input data

                      Multiple filters/dimensions

                      Interactive (fast) reports

                      Show wide range of data intervals

                      Real-time data changes visibility

                      No sampling, accurate data needed

                                                          Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: serve raw data?
                        simply storing all data points doesn’t work
                           to show 1-year worth of data points collected every second
                           31,536,000 points have to be fetched

                        pre-aggregation (at least partial) needed

                       Data Analytics & Storage                           Reports
                                              aggregated data
                       input data

                      data processing

                                                                             Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: pre-aggregation
                     OLAP-like Solution

                                aggregation rules
                               * filters/dimensions
                               * time range granularities   aggregated
                               * ...                           value

           input data               processing                 value

                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: pre-aggregation
                     Simplified Example             aggregated record groups

                                                           by minute
                                                      minute: 22214701
                               aggregation rules      value: 30.0
                                * by sensor                       ...
                                * by minute/day              by day
                                                      day: 2012-04-26
      input data item                                 value: 10.5
     time: 1332882078             processing                      ...
     sensor: sensor55
     value: 80.0                     logic            by minute & sensor
                                                      minute: 22214701
                                                      sensor: sensor55
                                                      cpu: 70.3
                                                           Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: RMW updates are slow
                        more dimensions/filters -> greater output data vs input data

                        individual ready-modify-write (Get+Put) operations are slow
                        and not efficient (10-20+ times slower than only Puts)

                                   sensor1                          sensor2
                      ...            <...>
                                              value:15.0    ...    value:41.0
                                  Get   Put   Get   Put      Get        Put

                            ...   sensor1       sensor2
                                                                  ...           reports
                                    <...>      avg : 28.7          Get/Scan      sensor2
                                               min: 15.0
                      storage                  max: 41.0
                                                                                   Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: improve updates

                       Using in-place increment operations? Not fast
                       enough and not flexible...

                       Buffering input records on the way in and
                       writing in small batches? Doesn’t scale and
                       possible loss of data...

                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: batch updates

                      More efficient data processing: multiple
                      updates processed at once, not individually

                      Decreases aggregation output (per input

                      Reliable, no data loss

                      Using “dictionary records” helps to reduce
                      number of Get operations

                                                           Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: batch updates
              “Dictionary Records”

                     Using data de-normalization to reduce random Get
                     operations while doing “Get+Put” updates:

                       Keep compound records which hold data of
                       multiple “normal” records that are usually
                       updated together

                       N Get+Put operations replaced with M (Get+Put)
                       and N Put operations, where M << N

                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Background: batch updates

                      Not real-time

                      If done frequently (closer to real-time), still
                      a lot of costly Get+Put update operations

                      Bad (any?) rollback support

                      Handling of failures of tasks which partially
                      wrote data to HBase is complex

                                                             Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Going Real-time
                     Append-based Updates

Sunday, May 20, 12
Append-only: main goals

                     Increase record update throughput

                     Process updates more efficiently: reduce
                     operations number and resources usage

                     Ideally, apply high volume of incoming data
                     changes in real-time

                     Add ability to roll back changes

                     Handle well high update peaks

                                                          Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: how?

            1. Replace read-modify-write (Get+Put) operations
                     at write time with simple append-only writes (Put)

            2. Defer processing of updates to periodic jobs
            3. Perform processing of updates on the fly only
                     if user asks for data earlier than updates are

                                                             Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: writing updates

           1         Replace update (Get+Put) operations at write time
                     with simple append-only writes (Put)

                                sensor1          sensor2              sensor2
                      ...          ...          value:15.0   ...     value:41.0     input
                              Put          Put                                Put
                              ...     sensor1           sensor2
                                        <...>          avg : 22.7
                                                       max: 31.0
                                    ...                 sensor2
                                                       value: 15.0
                            storage                     sensor2
                                                       value: 41.0
                            (HBase)                  ...                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: writing updates

                     2        Defer processing of updates to periodic jobs

                     processing updates with MR job
         ...          sensor1           sensor2
                        <...>          avg : 22.7           ...   sensor1       sensor2
                                       max: 31.0                    <...>      avg : 23.4
                                          ...                                  max: 41.0
                        ...             sensor2
                                       value: 15.0
                                       value: 41.0

                                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: writing updates
                 3          Perform aggregations on the fly if user asks
                            for data earlier than updates are processed

                      ...     sensor1    sensor2
                                                         ...        reports
                                <...>   avg : 22.7                 sensor1         ...
                                        max: 31.0
                                ...      sensor2
                                        value: 15.0
                      storage           value: 41.0
                                                      avg : 23.4
                                                      max: 41.0

                                                                         Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: benefits
                     High update throughput

                     Real-time updates visibility

                     Efficient updates processing

                     Handling high peaks of update operations

                     Ability to roll back any range of changes

                     Automatically handling failures of tasks which
                     only partially updated data (e.g. in MR jobs)

                                                           Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
                     Append-only: high update throughput

                       Avoid Get+Put operations upon writing

                       Use only Put operations (i.e. insert new
                       records only) which is very fast in HBase

                       Process updates when flushing client-side
                       buffer to reduce the number of actual

                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
                     Append-only: real-time updates

                       Increased update throughput allows to apply
                       updates in real-time

                       User always sees the latest data changes

                       Updates processed on the fly during Get or
                       Scan can be stored back right away

                       Periodic updates processing helps avoid doing
                       a lot of work during reads, making reading
                       very fast

                                                           Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
                     Append-only: efficient updates
                      To apply N changes:
                        N Get+Put operations replaced with

                        N Puts and 1 Scan (shared) + 1 Put operation

                      Applying N changes at once is much more
                      efficient than performing N individual changes
                        Especially when updated value is complex (like bitmaps),
                        takes time to load in memory

                        Skip compacting if too few records to process

                      Avoid a lot of redundant Get operations when
                      large portion of operations - inserting new data
                                                                   Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
                     Append-only: high peaks handling

                      Actual updates do not happen at write time

                      Merge is deferred to periodic jobs, which can
                      be scheduled to run at off-peak time (nights/

                      Merge speed is not critical, doesn’t affect the
                      visibility of changes

                                                             Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
                          Append-only: rollback
                        Rollbacks are easy when updates were not
                        processed yet (not merged)

                        To preserve rollback ability after they are
                        processed (and result is written back), updates
                        can be compacted into groups
                     written at:     processing updates
                        9:00       ...    sensor2
                                                      ...   ...   sensor2          ...
                        10:00             sensor2                 sensor2
                                             ...                     ...
                        11:00             sensor2                 sensor2
                                             ...                     ...
                                             ...                    ...
                                                                     Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
                       Append-only: rollback
                     * keep all-time avg value for sensor
                     * data collected every 10 second for 30 days

                     * perform periodic compactions every 4 hours
                     * compact groups based on 1-hour interval

                     At any point of time there are no more than
                     24 * 30 + 4 * 60 * 6 = 2160 non-compacted
                     records that needs to be processed on the fly
                                                          Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
                     Append-only: idempotency
                      Using append-only approach helps recover from
                      failed tasks which write data to HBase
                         without rolling back partial updates

                         avoids applying duplicate updates

                         fixes task failure with simple restart of task

                      Note: new task should write records with same row
                      keys as failed one
                         easy, esp. given that input data is likely to be same

                      Very convenient when writing from MapReduce

                      Updates processing periodic jobs are also idempotent

                                                                                 Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only: cons

                     Processing on the fly makes reading slower

                     Looking for data to compact (during periodic
                     compactions) may be inefficient

                     Increased amount of stored data depending
                     on use-case (in 0.92+)

                                                         Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only + Batch?
                     Works very well together, batch approach
                     benefits from:

                       increased update throughput

                       automatic task failures handling

                       rollback ability

                     Use when HBase cluster cannot cope with
                     processing updates in real-time or update
                     operations are bottleneck in your batch

                     We use it ;)

                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
Append-only updates implementation

Sunday, May 20, 12
HBaseHUT: Overview


                     Easy to integrate into existing projects
                       Packed as a singe jar to be added to HBase client
                       classpath (also add it to RegionServer classpath to
                       benefit from server-side optimizations)

                       Supports native HBase API: HBaseHUT classes
                       implement native HBase interfaces

                     Apache License, v2.0

                                                                Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT: Overview
                     Processing of updates on-the-fly (behind
                     ResultScanner interface)

                       Allows storing back processed Result

                       Can use CPs to process updates on server-side

                     Periodic processing of updates with Scan or
                     MapReduce job
                       Including processing updates in groups based on write ts

                     Rolling back changes with MapReduce job

                                                                  Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT vs(?) OpenTSDB
                      “vs” is wrong, they are simply different things
                        OpenTSDB is a time-series database

                        HBaseHUT is a library which implements append-only
                        updates approach to be used in your project

                      OpenTSDB uses “serve raw data” approach (with
                      storage improvements), limited to handling
                      numeric values

                      HBaseHUT is meant for (but not limited to)
                      “serve aggregated data” approach, works with
                      any data

                                                                 Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT: API overview
            Writing data:
            Put put = new Put(HutPut.adjustRow(rowKey));
            // ...

            Reading data:
            Scan scan = new Scan(startKey, stopKey);
            ResultScanner resultScanner =
                new HutResultScanner(hTable.getScanner(scan),

            for (Result current : resultScanner) {...}

                                                            Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT: API overview
                 Example UpdateProcessor:
                 public class MaxFunction extends UpdateProcessor {
                   // ... constructor & utility methods

                     public void process(Iterable<Result> records,
                                         UpdateProcessingResult result) {
                       Double maxVal = null;

                         for (Result record : records) {
                           double val = getValue(record);
                           if (maxVal == null || maxVal < val) {
                             maxVal = val;

                         result.add(colfam, qual, Bytes.toBytes(maxVal));
                                                                   Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT: how we use it
                        Data Analytics & Storage                              Reports
                                                 aggregated data

       input           initial data
        data           processing


                                                                      Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
HBaseHUT: Next Steps
                     Wider CPs (HBase 0.92+) utilization
                       Process updates during memstore flush

                     Make use of Append operation (HBase 0.94+)

                     Integrate with asynchbase lib

                     Reduce storage overhead from adjusting
                     row keys


                                                              Alex Baranau, Sematext International, 2012
Sunday, May 20, 12




           , we are hiring! ;)

                                                          Alex Baranau, Sematext International, 2012
Sunday, May 20, 12

More Related Content

Viewers also liked

Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksVisualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
DataWorks Summit
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - SematextHBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
Cloudera, Inc.
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
DataWorks Summit
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
Nick Dimiduk
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBase
Cloudera, Inc.
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
Analyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeAnalyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-time
DataWorks Summit
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop Integration
Jeremy Hanna
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
DataWorks Summit/Hadoop Summit
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
P. Taylor Goetz
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
Chicago Hadoop Users Group
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
Michael Noll

Viewers also liked (20)

Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksVisualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - SematextHBaseCon 2012 | Real-time Analytics with HBase - Sematext
HBaseCon 2012 | Real-time Analytics with HBase - Sematext
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBase
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
Analyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeAnalyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-time
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop Integration
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign

Similar to Real-time analytics with HBase (long version)

Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social media
DataWorks Summit
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
Danny Yuan
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaCombinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaboration
Julien Pivotto
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaCombinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
Performance tuning in sap bi 7.0
Performance tuning in sap bi 7.0Performance tuning in sap bi 7.0
Performance tuning in sap bi 7.0
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
Bart Vandewoestyne
From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012
Anand Deshpande
13 monitor-analyse-system
13 monitor-analyse-system13 monitor-analyse-system
13 monitor-analyse-system
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
SL Corporation
Java one 2010
Java one 2010Java one 2010
Java one 2010
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
Denny Lee
Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106
Mark Tabladillo
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observability
Big Data
Big DataBig Data
Big Data
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupWhat is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
Maarten Balliauw
Spring Batch Introduction
Spring Batch IntroductionSpring Batch Introduction
Spring Batch Introduction
Tadaya Tsuyukubo
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics
Marco Parenzan

Similar to Real-time analytics with HBase (long version) (20)

Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social media
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaCombinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaboration
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaCombinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
Performance tuning in sap bi 7.0
Performance tuning in sap bi 7.0Performance tuning in sap bi 7.0
Performance tuning in sap bi 7.0
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012From the Big Data keynote at InCSIghts 2012
From the Big Data keynote at InCSIghts 2012
13 monitor-analyse-system
13 monitor-analyse-system13 monitor-analyse-system
13 monitor-analyse-system
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...
Java one 2010
Java one 2010Java one 2010
Java one 2010
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observability
Big Data
Big DataBig Data
Big Data
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupWhat is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
What is going on? Application Diagnostics on Azure - Copenhagen .NET User Group
Spring Batch Introduction
Spring Batch IntroductionSpring Batch Introduction
Spring Batch Introduction
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics

Recently uploaded

What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Stephanie Beckett
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Priyanka Aash
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Alliance
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partesExchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
Priyanka Aash
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
Peter Caitens
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
Priyanka Aash
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Priyanka Aash
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...

Recently uploaded (20)

What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partesExchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...

Real-time analytics with HBase (long version)

  • 1. Real-time Analytics with HBase Alex Baranau, Sematext International Sunday, May 20, 12
  • 2. About me Software Engineer at Sematext International @abaranau (abaranau) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 3. Plan Problem background: what? why? Going real-time with append-only updates approach: how? Open-source implementation: how exactly? Q&A Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 4. Background: our services Systems Monitoring Service (Solr, HBase, ...) Search Analytics Service data collector Reports Data 100 75 data collector Analytics & 50 Storage 25 data collector 0 2007 2008 2009 2010 Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 5. Background: Report Example Search engine (Solr) request latency Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 6. Background: Report Example HBase flush operations Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 7. Background: requirements High volume of input data Multiple filters/dimensions Interactive (fast) reports Show wide range of data intervals Real-time data changes visibility No sampling, accurate data needed Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 8. Background: serve raw data? simply storing all data points doesn’t work to show 1-year worth of data points collected every second 31,536,000 points have to be fetched pre-aggregation (at least partial) needed Data Analytics & Storage Reports aggregated data input data data processing (pre-aggregating) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 9. Background: pre-aggregation OLAP-like Solution aggregation rules * filters/dimensions * time range granularities aggregated * ... value aggregated input data processing value item logic aggregated value Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 10. Background: pre-aggregation Simplified Example aggregated record groups by minute minute: 22214701 aggregation rules value: 30.0 * by sensor ... * by minute/day by day day: 2012-04-26 input data item value: 10.5 time: 1332882078 processing ... sensor: sensor55 value: 80.0 logic by minute & sensor minute: 22214701 sensor: sensor55 cpu: 70.3 ... ... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 11. Background: RMW updates are slow more dimensions/filters -> greater output data vs input data ratio individual ready-modify-write (Get+Put) operations are slow and not efficient (10-20+ times slower than only Puts) sensor1 sensor2 ... <...> sensor2 value:15.0 ... value:41.0 input Get Put Get Put Get Put ... sensor1 sensor2 ... reports <...> avg : 28.7 Get/Scan sensor2 min: 15.0 storage max: 41.0 (HBase) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 12. Background: improve updates Using in-place increment operations? Not fast enough and not flexible... Buffering input records on the way in and writing in small batches? Doesn’t scale and possible loss of data... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 13. Background: batch updates More efficient data processing: multiple updates processed at once, not individually Decreases aggregation output (per input record) Reliable, no data loss Using “dictionary records” helps to reduce number of Get operations Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 14. Background: batch updates “Dictionary Records” Using data de-normalization to reduce random Get operations while doing “Get+Put” updates: Keep compound records which hold data of multiple “normal” records that are usually updated together N Get+Put operations replaced with M (Get+Put) and N Put operations, where M << N Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 15. Background: batch updates Not real-time If done frequently (closer to real-time), still a lot of costly Get+Put update operations Bad (any?) rollback support Handling of failures of tasks which partially wrote data to HBase is complex Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 16. Going Real-time with Append-based Updates Sunday, May 20, 12
  • 17. Append-only: main goals Increase record update throughput Process updates more efficiently: reduce operations number and resources usage Ideally, apply high volume of incoming data changes in real-time Add ability to roll back changes Handle well high update peaks Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 18. Append-only: how? 1. Replace read-modify-write (Get+Put) operations at write time with simple append-only writes (Put) 2. Defer processing of updates to periodic jobs 3. Perform processing of updates on the fly only if user asks for data earlier than updates are processed. Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 19. Append-only: writing updates 1 Replace update (Get+Put) operations at write time with simple append-only writes (Put) sensor1 sensor2 sensor2 ... ... value:15.0 ... value:41.0 input Put Put Put ... sensor1 sensor2 ... <...> avg : 22.7 max: 31.0 sensor1 <...> ... ... sensor2 value: 15.0 storage sensor2 value: 41.0 (HBase) ... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 20. Append-only: writing updates 2 Defer processing of updates to periodic jobs processing updates with MR job ... sensor1 sensor2 ... <...> avg : 22.7 ... sensor1 sensor2 ... max: 31.0 <...> avg : 23.4 sensor1 <...> ... max: 41.0 ... sensor2 value: 15.0 sensor2 value: 41.0 ... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 21. Append-only: writing updates 3 Perform aggregations on the fly if user asks for data earlier than updates are processed ... sensor1 sensor2 ... reports <...> avg : 22.7 sensor1 ... max: 31.0 sensor1 <...> ... ... sensor2 value: 15.0 sensor2 storage value: 41.0 ... sensor2 avg : 23.4 max: 41.0 Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 22. Append-only: benefits High update throughput Real-time updates visibility Efficient updates processing Handling high peaks of update operations Ability to roll back any range of changes Automatically handling failures of tasks which only partially updated data (e.g. in MR jobs) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 23. 1/6 Append-only: high update throughput Avoid Get+Put operations upon writing Use only Put operations (i.e. insert new records only) which is very fast in HBase Process updates when flushing client-side buffer to reduce the number of actual writes Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 24. 2/6 Append-only: real-time updates Increased update throughput allows to apply updates in real-time User always sees the latest data changes Updates processed on the fly during Get or Scan can be stored back right away Periodic updates processing helps avoid doing a lot of work during reads, making reading very fast Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 25. 3/6 Append-only: efficient updates To apply N changes: N Get+Put operations replaced with N Puts and 1 Scan (shared) + 1 Put operation Applying N changes at once is much more efficient than performing N individual changes Especially when updated value is complex (like bitmaps), takes time to load in memory Skip compacting if too few records to process Avoid a lot of redundant Get operations when large portion of operations - inserting new data Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 26. 4/6 Append-only: high peaks handling Actual updates do not happen at write time Merge is deferred to periodic jobs, which can be scheduled to run at off-peak time (nights/ week-ends) Merge speed is not critical, doesn’t affect the visibility of changes Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 27. 5/6 Append-only: rollback Rollbacks are easy when updates were not processed yet (not merged) To preserve rollback ability after they are processed (and result is written back), updates can be compacted into groups written at: processing updates 9:00 ... sensor2 ... ... ... sensor2 ... ... sensor2 ... ... 10:00 sensor2 sensor2 ... ... sensor2 ... ... 11:00 sensor2 sensor2 ... ... ... ... Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 28. 5/6 Append-only: rollback Example: * keep all-time avg value for sensor * data collected every 10 second for 30 days Solution: * perform periodic compactions every 4 hours * compact groups based on 1-hour interval Result: At any point of time there are no more than 24 * 30 + 4 * 60 * 6 = 2160 non-compacted records that needs to be processed on the fly Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 29. 6/6 Append-only: idempotency Using append-only approach helps recover from failed tasks which write data to HBase without rolling back partial updates avoids applying duplicate updates fixes task failure with simple restart of task Note: new task should write records with same row keys as failed one easy, esp. given that input data is likely to be same Very convenient when writing from MapReduce Updates processing periodic jobs are also idempotent Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 30. Append-only: cons Processing on the fly makes reading slower Looking for data to compact (during periodic compactions) may be inefficient Increased amount of stored data depending on use-case (in 0.92+) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 31. Append-only + Batch? Works very well together, batch approach benefits from: increased update throughput automatic task failures handling rollback ability Use when HBase cluster cannot cope with processing updates in real-time or update operations are bottleneck in your batch We use it ;) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 32. Append-only updates implementation HBaseHUT Sunday, May 20, 12
  • 33. HBaseHUT: Overview Simple Easy to integrate into existing projects Packed as a singe jar to be added to HBase client classpath (also add it to RegionServer classpath to benefit from server-side optimizations) Supports native HBase API: HBaseHUT classes implement native HBase interfaces Apache License, v2.0 Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 34. HBaseHUT: Overview Processing of updates on-the-fly (behind ResultScanner interface) Allows storing back processed Result Can use CPs to process updates on server-side Periodic processing of updates with Scan or MapReduce job Including processing updates in groups based on write ts Rolling back changes with MapReduce job Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 35. HBaseHUT vs(?) OpenTSDB “vs” is wrong, they are simply different things OpenTSDB is a time-series database HBaseHUT is a library which implements append-only updates approach to be used in your project OpenTSDB uses “serve raw data” approach (with storage improvements), limited to handling numeric values HBaseHUT is meant for (but not limited to) “serve aggregated data” approach, works with any data Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 36. HBaseHUT: API overview Writing data: Put put = new Put(HutPut.adjustRow(rowKey)); // ... hTable.put(put); Reading data: Scan scan = new Scan(startKey, stopKey); ResultScanner resultScanner = new HutResultScanner(hTable.getScanner(scan), updateProcessor); for (Result current : resultScanner) {...} Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 37. HBaseHUT: API overview Example UpdateProcessor: public class MaxFunction extends UpdateProcessor { // ... constructor & utility methods @Override public void process(Iterable<Result> records, UpdateProcessingResult result) { Double maxVal = null; for (Result record : records) { double val = getValue(record); if (maxVal == null || maxVal < val) { maxVal = val; } } result.add(colfam, qual, Bytes.toBytes(maxVal)); } } Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 38. HBaseHUT: how we use it Data Analytics & Storage Reports aggregated data HBaseHUT HBaseHUT input initial data data processing HBase HBaseHUT HBaseHUT periodic MapReduce updates jobs processing Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 39. HBaseHUT: Next Steps Wider CPs (HBase 0.92+) utilization Process updates during memstore flush Make use of Append operation (HBase 0.94+) Integrate with asynchbase lib Reduce storage overhead from adjusting row keys etc. Alex Baranau, Sematext International, 2012 Sunday, May 20, 12
  • 40. Qs? @abaranau (abaranau), we are hiring! ;) Alex Baranau, Sematext International, 2012 Sunday, May 20, 12