This document discusses using an append-only approach with HBase to enable real-time analytics. The append-only approach aims to increase update throughput by replacing read-modify-write operations with simple append writes. Updates are processed periodically by jobs rather than on-the-fly to reduce operations and resources. This allows handling high volumes of incoming data changes in real-time while enabling rollback of changes.
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, LucidworksLucidworks
Banana is a fork of Kibana that works with Apache Solr data. It uses Kibana's dashboard capabilities and ports key panels to work with Solr, providing additional capabilities like new D3.js panels. Banana aims to create rich and flexible UIs, enable rapid application development, and leverage Solr's power. To build a custom panel in Banana, you need an editor HTML file for settings, a module HTML file for display, and a module JS file containing panel logic.
This document provides an overview of Hadoop and its ecosystem. It discusses the evolution of Hadoop from version 1 which focused on batch processing using MapReduce, to version 2 which introduced YARN for distributed resource management and supported additional data processing engines beyond MapReduce. It also describes key Hadoop services like HDFS for distributed storage and the benefits of a Hadoop data platform for unlocking the value of large datasets.
Apache HBase is the Hadoop opensource, distributed, versioned storage manager well suited for random, realtime read/write access. This talk will give an overview on how HBase achieve random I/O, focusing on the storage layer internals. Starting from how the client interact with Region Servers and Master to go into WAL, MemStore, Compactions and on-disk format details. Looking at how the storage is used by features like snapshots, and how it can be improved to gain flexibility, performance and space efficiency.
HBaseCon 2012 | Real-time Analytics with HBase - SematextCloudera, Inc.
In this talk we’ll explain how we implemented “update-less updates” (not a typo!) for HBase using append-only approach. This approach uses HBase core strengths like fast range scans and the recently added coprocessors to enable real-time analytics. It shines in situations where high data volume and velocity make random updates (aka Get+Put) prohibitively expensive. Apart from making real-time analytics possible, we’ll show how the append-only approach to updates makes it possible to perform rollbacks of data changes and avoid data inconsistency problems caused by tasks in MapReduce jobs that fail after only partially updating data in HBase.
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachDataWorks Summit
Intel's big data journey began in 2011 with an evaluation of Hadoop. Since then, Intel has expanded its use of Hadoop and Cloudera across multiple environments. Intel's 3-year roadmap focuses on evolving its Hadoop platform to support more advanced analytics, real-time capabilities, and integrating with traditional BI tools. Key strategies include designing for scalability, following an iterative approach to understand data, and leveraging open source technologies.
HBase can be an intimidating beast for someone considering its adoption. For what kinds of workloads is it well suited? How does it integrate into the rest of my application infrastructure? What are the data semantics upon which applications can be built? What are the deployment and operational concerns? In this talk, I'll address each of these questions in turn. As supporting evidence, both high-level application architecture and internal details will be discussed. This is an interactive talk: bring your questions and your use-cases!
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
Hadoop is a great platform for storing and processing massive amounts of data. Elasticsearch is the ideal solution for Searching and Visualizing the same data. Join us to learn how you can leverage the full power of both platforms to maximize the value of your Big Data.
In this webinar we'll walk you through:
How Elasticsearch fits in the Modern Data Architecture.
A demo of Elasticsearch and Hortonworks Data Platform.
Best practices for combining Elasticsearch and Hortonworks Data Platform to extract maximum insights from your data.
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster Cloudera, Inc.
This document discusses implementing an HBase coprocessor to index columns from HBase into an Elasticsearch cluster. It describes storing book records from publishers and libraries in HBase and indexing them into Elasticsearch using MapReduce jobs. To handle updates, a coprocessor uses HBase's checkAndPut method to verify the record version before updating and indexing the new version into Elasticsearch, ensuring consistency between the two systems.
HBaseCon 2013: Full-Text Indexing for Apache HBaseCloudera, Inc.
This document discusses full-text indexing for HBase tables. It describes how Lucene indices are organized based on HBase regions. Index building is implemented using coprocessors to update indices on data changes. Index splitting is optimized to avoid blocking updates during region splits. Search performance of indexing 10 billion records was tested, showing search times of around 1 second.
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
This document provides an introduction to HBase internals and schema design for HBase users. It discusses the logical and physical views of HBase, including how tables are split into regions and stored across region servers. It covers best practices for schema design, such as using row keys efficiently and avoiding redundancy. The document also briefly discusses advanced topics like coprocessors and compression. The overall goal is to help HBase users optimize performance and scalability based on its internal architecture.
Analyzing 1.2 Million Network Packets per Second in Real-timeDataWorks Summit
The document describes Cisco's OpenSOC, an open source security operations center that can analyze 1.2 million network packets per second in real time. It discusses the business need for such a solution given how breaches often go undetected for months. The solution architecture utilizes big data technologies like Hadoop, Kafka and Storm to enable real-time processing of streaming data at large scale. It also provides lessons learned around optimizing the performance of components like Kafka, HBase and Storm topologies.
The document discusses Thomas Rabaix's involvement with Symfony including developing plugins, writing a book, and now working for Ekino. It also provides an overview of a talk on Solr including indexing, searching, administration and deployment of Solr. The talk covers what Solr is, indexing documents, filtering queries, and how Solr integrates with Apache projects like Nutch and Tika.
This document introduces HBase, an open-source, non-relational, distributed database modeled after Google's BigTable. It describes what HBase is, how it can be used, and when it is applicable. Key points include that HBase stores data in columns and rows accessed by row keys, integrates with Hadoop for MapReduce jobs, and is well-suited for large datasets, fast random access, and write-heavy applications. Common use cases involve log analytics, real-time analytics, and messages-centered systems.
This document discusses integrating Cassandra with Hadoop to enable both online transaction processing (OLTP) and online analytical processing (OLAP) on the same data. It provides an overview of Cassandra and Hadoop, describes how to configure them together on the same or separate clusters, and highlights tools like Pig that enable analytics on Cassandra data using Hadoop and MapReduce. Real-world examples of companies using the integration are also listed.
This document provides an overview of resource aware scheduling in Apache Storm. It discusses the challenges of scheduling Storm topologies at Yahoo scale, including increasing heterogeneous clusters, low cluster utilization, and unbalanced resource usage. It then introduces the Resource Aware Scheduler (RAS) built for Storm, which allows fine-grained resource control and isolation for topologies through APIs and cgroups. Key features of RAS include pluggable scheduling strategies, per user resource guarantees, and topology priorities. Experimental results from Yahoo Storm clusters show significant improvements to throughput and resource utilization with RAS. The talk concludes with future work on improved scheduling strategies and real-time resource monitoring.
Storm: distributed and fault-tolerant realtime computationnathanmarz
Storm is a distributed real-time computation system that provides guaranteed message processing, horizontal scalability, and fault tolerance. It allows users to define data processing topologies and submit them to a Storm cluster for distributed execution. Spouts emit streams of tuples that are processed by bolts. Storm tracks processing to ensure reliability and replays failed tasks. It provides tools for deployment, monitoring, and optimization of real-time data processing.
This document discusses how to use Storm and Hadoop together to enable real-time and batch processing of large datasets. It describes using Hadoop to precompute batch views of data, and Storm to incrementally update real-time views as new data streams in. This allows for low-latency queries by combining precomputed batch views with real-time views that compensate for recent data not yet absorbed into the batch views.
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
Apache Storm 0.9 basic training - VerisignMichael Noll
Apache Storm 0.9 basic training (130 slides) covering:
1. Introducing Storm: history, Storm adoption in the industry, why Storm
2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
3. Operating Storm: architecture, hardware specs, deploying, monitoring
4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning
5. Playing with Storm using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/
Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!
In this age of Big Data, data volumes grow exceedingly larger while the technical problems and business scenarios become more complex. Compounding these complexities, data consumers are demanding faster analysis to common business questions asked of their Big Data. This session provides concrete examples of how to address this challenge. We will highlight the use of Big Data technologies—including Hadoop and Hive —with classic BI systems such as SQL Server Analysis Services.
Session takeaways:
• Understand the architectural components surrounding Hadoop, Hive, Classic BI, and the Tier-1 BI ecosystem
• Get strategies for addressing the technical issues when working with extremely large cubes
• See how to address the technical issues when working with Big Data systems from the DBA perspective
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
Justin and I gave this talk in QCon SF 2014 about the Mantis, a stream processing system that features a reactive programming API, auto scaling, and stream locality
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaElasticsearch
Descubre cómo Elasticsearch combina de forma eficiente los datos en un solo almacén y cómo los usa Kibana para analizarlos. Además, podrás comprobar la forma en la que los desarrollos más recientes facilitan la tarea de identificación, solución de problemas y resolución de incidencias operativas con mayor rapidez.
Monitoring as an entry point for collaborationJulien Pivotto
This document summarizes a talk on using monitoring as an entry point for collaboration. It discusses using the Prometheus monitoring system to collect metrics and expose them using exporters. Grafana is then used to visualize the metrics and create dashboards focused on business metrics like requests, errors, and durations. These metrics provide observability across teams and enable alerting when business services are impacted.
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaElasticsearch
Descubre cómo Elasticsearch combina de forma eficiente los datos en un solo almacén y cómo los usa Kibana para analizarlos. Además, podrás comprobar la forma en la que los desarrollos más recientes facilitan la tarea de identificación, solución de problemas y resolución de incidencias operativas con mayor rapidez.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yyaHb8.
The authors discuss Netflix's new stream processing system that supports a reactive programming model, allows auto scaling, and is capable of processing millions of messages per second. Filmed at qconsf.com.
Danny Yuan is an architect and software developer in Netflix’s Platform Engineering team. Justin Becker is Senior Software Engineer at Netflix.
This document discusses performance tuning in SAP BI 7.0 at the backend and frontend levels. At the backend, factors like data load sequence, PSA partition size, parallelizing uploads, export data sources, and transformation rules can impact performance. At the frontend, query performance, aggregates, OLAP cache settings, and read mode can be tuned. The document provides steps to optimize these factors through tools like transaction codes RSCUSTV6, RSCUSTV14, and RSDIPROP.
The document discusses the emergence of big data and new data architectures needed to handle large, diverse datasets. It notes that internet companies built their own data systems like Hadoop to process massive amounts of unstructured data across thousands of servers in a fault-tolerant, scalable way. These systems use a map-reduce programming model and distributed file systems like HDFS to store and process data in a parallel, distributed manner.
The document describes how to monitor an SAP system using the Computing Center Management System (CCMS), which allows monitoring of components like the R/3 application servers, database, and operating system. It provides details on the monitoring architecture and tools for monitoring specific aspects of the system like users, workloads, buffers, and the database. Critical tasks for monitoring the system are also listed, such as checking backups, application server status, alerts, logs, jobs, locks, and resolving any issues.
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...SL Corporation
The most critical large-scale applications today, regardless of industry, involve a demand for real-time data transfer and visualization of potentially large volumes of data. With this demand comes numerous challenges and limiting factors, especially if these applications are deployed in virtual or cloud environments. In this session, SL’s CEO, Tom Lubinski, explains how to overcome the top four challenges to real-time application performance: database performance, network data transfer bandwidth limitations, processor performance and lack of real-time predictability. Solutions discussed will include design of the proper data model for the application data, along with design patterns that facilitate optimal and minimal data transfer across networks.
- The document summarizes key announcements and projects from JavaOne 2010, including Project Coin, Project Lambda, and Project Jigsaw which focus on language enhancements for productivity, closures, and modularity.
- It also discusses case studies from various companies on architectures using technologies like Spring, Hibernate, caching, and NoSQL databases to handle large-scale applications.
- Trends highlighted include focus on asynchronous and event-driven architectures, partitioning, and monitoring to handle thousands of servers and billions of requests per day.
How Klout is changing the landscape of social media with Hadoop and BIDenny Lee
Updated from the Hadoop Summit slides (http://www.slideshare.net/Hadoop_Summit/klout-changing-landscape-of-social-media), we've included additional screenshots to help tell the whole story.
Combining logs, metrics, and traces for unified observabilityElasticsearch
Learn how Elasticsearch efficiently combines data in a single store and how Kibana is used to analyze it. Plus, see how recent developments help identify, troubleshoot, and resolve operational issues faster.
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupMaarten Balliauw
We all like building and deploying cloud applications. But what happens once that’s done? How do we know if our application behaves like we expect it to behave? Of course, logging! But how do we get that data off of our machines? How do we sift through a bunch of seemingly meaningless diagnostics? In this session, we’ll look at how we can keep track of our Azure application using structured logging, AppInsights and AppInsights analytics to make all that data more meaningful.
Spring Batch is a lightweight framework for batch processing that provides reusable functions for transaction management, error handling, parallel processing, and more. It allows developers to focus on the business logic rather than infrastructure concerns. Spring Batch is lightweight and can be easily embedded into existing applications. It uses a POJO-based programming model with dependency injection and supports reading from and writing to various data sources. Jobs in Spring Batch are comprised of steps that can run sequentially or conditionally and process data in chunks for improved performance.
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...Logilab
The document discusses interactive exploration of complex relational datasets. It describes using the Cubicweb framework to store and query data using an entity-relationship model and RQL. Results can be visualized through standard views or processed into pivot tables and numerical arrays for array views like histograms, scatterplots and graphs. This allows flexible visualization and datamining of relational data through unique URLs.
This document contains an agenda for a presentation on Azure Stream Analytics. The agenda includes topics such as analytics in a modern world, why developers are interested in analytics, why use the cloud for analytics, an introduction to Azure Stream Analytics, the Azure Stream Analytics architecture, the Stream Analytics Query Language (SAQL), handling time in Azure Stream Analytics, scaling analytics, and conclusions. The document also includes speaker information and notes on various topics from the agenda.
Similar to Real-time analytics with HBase (long version) (20)
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
2. About me
Software Engineer at Sematext International
http://blog.sematext.com/author/abaranau
@abaranau
http://github.com/sematext (abaranau)
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
3. Plan
Problem background: what? why?
Going real-time with append-only updates
approach: how?
Open-source implementation: how exactly?
Q&A
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
4. Background: our services
Systems Monitoring Service (Solr, HBase, ...)
Search Analytics Service
data collector Reports
Data
100
75
data collector Analytics & 50
Storage 25
data collector 0
2007 2008 2009 2010
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
5. Background: Report Example
Search engine (Solr) request latency
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
6. Background: Report Example
HBase flush operations
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
7. Background: requirements
High volume of input data
Multiple filters/dimensions
Interactive (fast) reports
Show wide range of data intervals
Real-time data changes visibility
No sampling, accurate data needed
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
8. Background: serve raw data?
simply storing all data points doesn’t work
to show 1-year worth of data points collected every second
31,536,000 points have to be fetched
pre-aggregation (at least partial) needed
Data Analytics & Storage Reports
aggregated data
input data
data processing
(pre-aggregating)
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
9. Background: pre-aggregation
OLAP-like Solution
aggregation rules
* filters/dimensions
* time range granularities aggregated
* ... value
aggregated
input data processing value
item
logic
aggregated
value
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
10. Background: pre-aggregation
Simplified Example aggregated record groups
by minute
minute: 22214701
aggregation rules value: 30.0
* by sensor ...
* by minute/day by day
day: 2012-04-26
input data item value: 10.5
time: 1332882078 processing ...
sensor: sensor55
value: 80.0 logic by minute & sensor
minute: 22214701
sensor: sensor55
cpu: 70.3
...
...
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
11. Background: RMW updates are slow
more dimensions/filters -> greater output data vs input data
ratio
individual ready-modify-write (Get+Put) operations are slow
and not efficient (10-20+ times slower than only Puts)
sensor1 sensor2
... <...>
sensor2
value:15.0 ... value:41.0
input
Get Put Get Put Get Put
... sensor1 sensor2
... reports
<...> avg : 28.7 Get/Scan sensor2
min: 15.0
storage max: 41.0
(HBase)
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
12. Background: improve updates
Using in-place increment operations? Not fast
enough and not flexible...
Buffering input records on the way in and
writing in small batches? Doesn’t scale and
possible loss of data...
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
13. Background: batch updates
More efficient data processing: multiple
updates processed at once, not individually
Decreases aggregation output (per input
record)
Reliable, no data loss
Using “dictionary records” helps to reduce
number of Get operations
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
14. Background: batch updates
“Dictionary Records”
Using data de-normalization to reduce random Get
operations while doing “Get+Put” updates:
Keep compound records which hold data of
multiple “normal” records that are usually
updated together
N Get+Put operations replaced with M (Get+Put)
and N Put operations, where M << N
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
15. Background: batch updates
Not real-time
If done frequently (closer to real-time), still
a lot of costly Get+Put update operations
Bad (any?) rollback support
Handling of failures of tasks which partially
wrote data to HBase is complex
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
16. Going Real-time
with
Append-based Updates
Sunday, May 20, 12
17. Append-only: main goals
Increase record update throughput
Process updates more efficiently: reduce
operations number and resources usage
Ideally, apply high volume of incoming data
changes in real-time
Add ability to roll back changes
Handle well high update peaks
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
18. Append-only: how?
1. Replace read-modify-write (Get+Put) operations
at write time with simple append-only writes (Put)
2. Defer processing of updates to periodic jobs
3. Perform processing of updates on the fly only
if user asks for data earlier than updates are
processed.
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
19. Append-only: writing updates
1 Replace update (Get+Put) operations at write time
with simple append-only writes (Put)
sensor1 sensor2 sensor2
... ... value:15.0 ... value:41.0 input
Put Put Put
... sensor1 sensor2
...
<...> avg : 22.7
max: 31.0
sensor1
<...>
...
... sensor2
value: 15.0
storage sensor2
value: 41.0
(HBase) ... Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
20. Append-only: writing updates
2 Defer processing of updates to periodic jobs
processing updates with MR job
... sensor1 sensor2
...
<...> avg : 22.7 ... sensor1 sensor2
...
max: 31.0 <...> avg : 23.4
sensor1
<...>
... max: 41.0
... sensor2
value: 15.0
sensor2
value: 41.0
...
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
21. Append-only: writing updates
3 Perform aggregations on the fly if user asks
for data earlier than updates are processed
... sensor1 sensor2
... reports
<...> avg : 22.7 sensor1 ...
max: 31.0
sensor1
<...>
...
... sensor2
value: 15.0
sensor2
storage value: 41.0
...
sensor2
avg : 23.4
max: 41.0
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
22. Append-only: benefits
High update throughput
Real-time updates visibility
Efficient updates processing
Handling high peaks of update operations
Ability to roll back any range of changes
Automatically handling failures of tasks which
only partially updated data (e.g. in MR jobs)
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
23. 1/6
Append-only: high update throughput
Avoid Get+Put operations upon writing
Use only Put operations (i.e. insert new
records only) which is very fast in HBase
Process updates when flushing client-side
buffer to reduce the number of actual
writes
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
24. 2/6
Append-only: real-time updates
Increased update throughput allows to apply
updates in real-time
User always sees the latest data changes
Updates processed on the fly during Get or
Scan can be stored back right away
Periodic updates processing helps avoid doing
a lot of work during reads, making reading
very fast
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
25. 3/6
Append-only: efficient updates
To apply N changes:
N Get+Put operations replaced with
N Puts and 1 Scan (shared) + 1 Put operation
Applying N changes at once is much more
efficient than performing N individual changes
Especially when updated value is complex (like bitmaps),
takes time to load in memory
Skip compacting if too few records to process
Avoid a lot of redundant Get operations when
large portion of operations - inserting new data
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
26. 4/6
Append-only: high peaks handling
Actual updates do not happen at write time
Merge is deferred to periodic jobs, which can
be scheduled to run at off-peak time (nights/
week-ends)
Merge speed is not critical, doesn’t affect the
visibility of changes
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
27. 5/6
Append-only: rollback
Rollbacks are easy when updates were not
processed yet (not merged)
To preserve rollback ability after they are
processed (and result is written back), updates
can be compacted into groups
written at: processing updates
9:00 ... sensor2
...
... ... sensor2 ...
...
sensor2
...
...
10:00 sensor2 sensor2
... ...
sensor2
...
...
11:00 sensor2 sensor2
... ...
... ...
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
28. 5/6
Append-only: rollback
Example:
* keep all-time avg value for sensor
* data collected every 10 second for 30 days
Solution:
* perform periodic compactions every 4 hours
* compact groups based on 1-hour interval
Result:
At any point of time there are no more than
24 * 30 + 4 * 60 * 6 = 2160 non-compacted
records that needs to be processed on the fly
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
29. 6/6
Append-only: idempotency
Using append-only approach helps recover from
failed tasks which write data to HBase
without rolling back partial updates
avoids applying duplicate updates
fixes task failure with simple restart of task
Note: new task should write records with same row
keys as failed one
easy, esp. given that input data is likely to be same
Very convenient when writing from MapReduce
Updates processing periodic jobs are also idempotent
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
30. Append-only: cons
Processing on the fly makes reading slower
Looking for data to compact (during periodic
compactions) may be inefficient
Increased amount of stored data depending
on use-case (in 0.92+)
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
31. Append-only + Batch?
Works very well together, batch approach
benefits from:
increased update throughput
automatic task failures handling
rollback ability
Use when HBase cluster cannot cope with
processing updates in real-time or update
operations are bottleneck in your batch
We use it ;)
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
33. HBaseHUT: Overview
Simple
Easy to integrate into existing projects
Packed as a singe jar to be added to HBase client
classpath (also add it to RegionServer classpath to
benefit from server-side optimizations)
Supports native HBase API: HBaseHUT classes
implement native HBase interfaces
Apache License, v2.0
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
34. HBaseHUT: Overview
Processing of updates on-the-fly (behind
ResultScanner interface)
Allows storing back processed Result
Can use CPs to process updates on server-side
Periodic processing of updates with Scan or
MapReduce job
Including processing updates in groups based on write ts
Rolling back changes with MapReduce job
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
35. HBaseHUT vs(?) OpenTSDB
“vs” is wrong, they are simply different things
OpenTSDB is a time-series database
HBaseHUT is a library which implements append-only
updates approach to be used in your project
OpenTSDB uses “serve raw data” approach (with
storage improvements), limited to handling
numeric values
HBaseHUT is meant for (but not limited to)
“serve aggregated data” approach, works with
any data
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
36. HBaseHUT: API overview
Writing data:
Put put = new Put(HutPut.adjustRow(rowKey));
// ...
hTable.put(put);
Reading data:
Scan scan = new Scan(startKey, stopKey);
ResultScanner resultScanner =
new HutResultScanner(hTable.getScanner(scan),
updateProcessor);
for (Result current : resultScanner) {...}
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
37. HBaseHUT: API overview
Example UpdateProcessor:
public class MaxFunction extends UpdateProcessor {
// ... constructor & utility methods
@Override
public void process(Iterable<Result> records,
UpdateProcessingResult result) {
Double maxVal = null;
for (Result record : records) {
double val = getValue(record);
if (maxVal == null || maxVal < val) {
maxVal = val;
}
}
result.add(colfam, qual, Bytes.toBytes(maxVal));
}
}
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
38. HBaseHUT: how we use it
Data Analytics & Storage Reports
aggregated data
HBaseHUT
HBaseHUT
input initial data
data processing
HBase
HBaseHUT
HBaseHUT
periodic
MapReduce
updates
jobs
processing
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
39. HBaseHUT: Next Steps
Wider CPs (HBase 0.92+) utilization
Process updates during memstore flush
Make use of Append operation (HBase 0.94+)
Integrate with asynchbase lib
Reduce storage overhead from adjusting
row keys
etc.
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12
40. Qs?
http://github.com/sematext/HBaseHUT
http://blog.sematext.com
@abaranau
http://github.com/sematext (abaranau)
http://sematext.com, we are hiring! ;)
Alex Baranau, Sematext International, 2012
Sunday, May 20, 12