Robby Grossman presented on Shareaholic's transition from MongoDB to Riak. Shareaholic needed a database with linear scalability, full-text search, and flexible indexing to support their growing product. They evaluated HBase, Cassandra, and Riak. Riak was chosen for its operational simplicity, linear scalability, integrated search, and secondary indices. Shareaholic migrated their data from MongoDB to Riak without downtime by writing to both databases simultaneously and verifying data integrity before decommissioning MongoDB. Riak has succeeded for Shareaholic's MapReduce queries, full text search, and publisher analytics use cases. Benchmarking showed vertical scaling on EC2 provides better latency than horizontal scaling.
This document discusses challenges with keeping a metadata repository current using event-driven updates from data sources. It describes how using Apache Kafka and the Debezium connector to capture changes from database "outbox" tables that mirror system catalog metadata tables allows pushing metadata deltas to the repository in real-time. This overcomes limitations of log-based and query-based CDC approaches when applied directly to database system tables.
This document provides an introduction to JanusGraph, an open source distributed graph database that can be used with Apache HBase for storage. It begins with background on graph databases and their structures, such as vertices, edges, properties, and different storage models. It then discusses JanusGraph's architecture, support for the TinkerPop graph computing framework, and schema and data modeling capabilities. Details are given on partitioning graphs across servers and using different indexing approaches. The document concludes by explaining why HBase is a good storage backend for JanusGraph and providing examples of how the data model would be structured within HBase.
CloudStack currently provides a variety bespoke high availability mechanisms for resources such as virtual machines, hosts, and virtual routers. Each of these implementations duplicates the HA check/recovery cycle, as well as, concurrency, persistence, and clustering required manage high available for any CloudStack resource. The High Availability Resource Management Service has been developed to consolidate these concerns -- providing a robust, extensible HA mechanism. Using this service, plugins only need to define health check, activity check, and fence operations.
Most HTML5 web applications are relatively small scale – they are maintained by a single team and contain relatively little JavaScript, CSS and HTML5 code. At Caplin we build "thick client" replacement financial trading systems containing considerable business logic implemented by hundreds of thousands of lines of JavaScript code. The code is maintained by multiple development teams spread across multiple business units. The talk describes the problems faced and how they can be solved using componetization, loose coupling, services, event bus, design patterns, BDD, the best open source libraries, test by contract, and test automation etc.
The first presentation for Kafka Meetup @ Linkedin (Bangalore) held on 2015/12/5 It provides a brief introduction to the motivation for building Kafka and how it works from a high level. Please download the presentation if you wish to see the animated slides.
This document discusses AntsDB, an open source project that brings MySQL compatibility to HBase in order to address the need for relational database capabilities in NoSQL systems. It describes AntsDB's architecture, which uses caching and other techniques to provide low-latency transactions and joins on HBase. Performance tests show AntsDB can achieve high throughput for writes and OLTP workloads. AntsDB aims to be complementary to HBase by virtualizing MySQL atop HBase while simulating MySQL behaviors and allowing applications built for MySQL to run unchanged on HBase.
New Journey of HBase in Alibaba and Cloud discusses Alibaba's use of HBase over 8 years and improvements made. Key points discussed include: - Alibaba began using HBase in 2010 and has since contributed to the open source community while developing internal improvements. - Challenges addressed include JVM garbage collection pauses, separating computing and storage, and adding cold/hot data tiering. A diagnostic system was also created. - Alibaba uses HBase across many core scenarios and has integrated it with other databases in a multi-model approach to support different workloads. - Benefits of running HBase on cloud include flexibility, cost savings, and making it
This document discusses different big data scenarios using HBase including: 1. Architecture evolution over time including olap and real-time ETL scenarios 2. The olap scenario requirements like handling billion records with sub-second queries and examples using Kylin 3. The monitor scenario showing how different systems are monitored using technologies like Grafana 4. Brief mentions of data mining and HDI scenarios
RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.
HBase is used at China Telecom for various applications including persistence for streaming jobs, online reading and writing, and as a data store for their core system. They operate several HBase clusters storing over 500 TB of data ingesting 1 TB per day. They monitor HBase using Ganglia for basic metrics and Zabbix for critical alerts. When issues arise, such as a system hang, they investigate debug cases and perform optimizations like changing the garbage collector from CMS to G1 and implementing read/write splitting.
my plan talk at HBTC chinese largest big data technoloy conference,talking about column database and hadoop related area.
The document discusses how different protocols like REST, Kafka, GraphQL, gRPC, and mySQL can be made protocol-agnostic. It defines common attributes across protocols like scope, operation, sending and receiving data formats, asynchronous/streaming behavior, and connection and authentication settings. Making protocols protocol-agnostic provides benefits like a universal specification for documentation, collaboration between teams using different architectures, and a consistent user experience.
How we can make use of Kubernetes as Resource Manager for Spark. What are the Pros and Cons of Spark Resource manager are discussed on this slides and the associated tutorial. Refer this github project for more details and code samples : https://github.com/haridas/hadoop-env
Some people see their cars just as a means to get them from point A to point B without breaking down halfway, but most of us want it also to be comfortable, performant, easy to drive, and of course - to look good. We can think of Kafka Connect connectors in a similar way. While the main focus is on getting data from or writing data to the external target system, it’s also relevant how easy it is to configure, does it scale well, does it provide the best possible data consistency, is it resilient to both the external system and Kafka cluster failures, and so on. This talk focuses on aspects of connector plugin development important for achieving these goals. More specifically - we‘ll cover configuration definition and validation, external source partitions and offsets handling, achieving desired delivery semantics, and more."
This document discusses Pinterest's data architecture and use of Pinball for workflow management. Pinterest processes 3 petabytes of data daily from their 60 billion pins and 1 billion boards across a 2000 node Hadoop cluster. They use Kafka, Secor and Singer for ingesting event data. Pinball is used for workflow management to handle their scale of hundreds of workflows, thousands of jobs and 500+ jobs in some workflows. Pinball provides simple abstractions, extensibility, reliability, debuggability and horizontal scalability for workflow execution.
The presentation covers lambda architecture and implementation with spark. In the presentation we will discuss about components of lambda architecture like batch layer, speed layer and serving layer. We will also discuss its advantages and benefits with spark.
Presented by Mark Miller, Software Engineer, Cloudera As the NoSQL ecosystem looks to integrate great search, great search is naturally beginning to expose many NoSQL features. Will these Goliath's collide? Or will they remain specialized while intermingling – two sides of the same coin. Come learn about where SolrCloud fits into the NoSQL landscape. What can it do? What will it do? And how will the big data, NoSQL, Search ecosystem evolve. If you are interested in Big Data, NoSQL, distributed systems, CAP theorem and other hype filled terms, than this talk may be for you.
Robby Grossman, Shareaholic's Tech Lead, spoke at the first Boston Riak Meetup on August 30, 2012. These are his slides.
This document provides an overview of Riak TS, Basho's new purpose-built time series database. It describes Riak TS's key features like high write throughput, efficient range query support, and horizontal scalability. It also outlines Riak TS's data modeling approach of co-locating and partitioning time-series data, its SQL-like query language, and provides examples of its performance and roadmap. Finally, it demonstrates a potential use case application called UNCORKD for tracking wine check-ins and reviews.
ii ABSTRACT GPS is one of the technologies that are used in a huge number of applications today. One of the applications is tracking your vehicle and keeps regular monitoring on them. This tracking system can inform you the location and route travelled by vehicle, and that information can be observed from any other remote location. It also includes the web application that provides you exact location of target and the exact speed the vehicle is moving which is used to generate bills for over speeding automatically. This system enables us to track target in any weather conditions. This system uses GPS and Zigbee technologies. This includes the hardware part which comprises of GPS, Zigbee, ATmega microcontroller and software part is used for interfacing all the required modules and a web application is also developed at the client side and visualize data from IoT. Main objective is to design a system that can be easily installed and to provide platform for further enhancement. KEYWORDS GPS, ZigBee, Tracking System, IoT iii
Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
Time Series data is proliferating with literally every step that we take, just think about things like Fit Bit bracelets that track your every move and financial trading data all of which is timestamped. Time series data requires high performance reads and writes even with a huge number of data sources. Both speed and scale are integral to success, which makes for a unique challenge for your database. A time series NoSQL data model requires flexibility to support unstructured, and semi-structured data as well as the ability to write range queries to analyze your time series data. So how can you tackle speed, scale and flexibility all at once? Join Professional Services Architect Drew Kerrigan and Developer Advocate Matt Brender for a discussion of: Examples of time series data sets, from IoT to Finance to jet engines What makes time series queries different from other database queries How to model your dataset to answer the right questions about your data How to store, query and analyze a set of time series data points Learn how a NoSQL database model and Riak TS can help you address the unique challenges of time series data.
Cassandra is a distributed database that can be used with Solr for distributed search capabilities. Data is written to Cassandra and indexed by Solr to enable fast and scalable full-text search across nodes. Queries can be performed directly on Cassandra or through the Solr API, with tradeoffs in performance. Production deployments typically use a mix of Cassandra and Solr nodes for analytics and search workloads.
The document discusses various use cases for MapR's Hadoop distribution including restaurant recommendations, fraud modeling, network security, and log analysis. It highlights how MapR allows easy data access and deployment across these applications using techniques like NFS, mirrors, and avoiding special data movement mechanisms. The document also provides technical details on how specific solutions like recommendation modeling, fraud detection, and log analysis can leverage MapR.
With AWS you can choose the right database for the right job. Given the myriad of choices, from relational databases to non-relational stores, this session will profile details and examples of some of the choices available to you (MySQL, RDS, Elasticache, Redis, Cassandra, MongoDB and DynamoDB), with details on real world deployments from customers using Amazon RDS, ElastiCache and DynamoDB.
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming and Approximations Lambda Architecture
WebHack#43 Challenges of Global Infrastructure at Rakuten https://webhack.connpass.com/event/208888/