Riak at shareaholic

•

8 likes•19,082 views

Robby Grossman presented on Shareaholic's transition from MongoDB to Riak. Shareaholic needed a database with linear scalability, full-text search, and flexible indexing to support their growing product. They evaluated HBase, Cassandra, and Riak. Riak was chosen for its operational simplicity, linear scalability, integrated search, and secondary indices. Shareaholic migrated their data from MongoDB to Riak without downtime by writing to both databases simultaneously and verifying data integrity before decommissioning MongoDB. Riak has succeeded for Shareaholic's MapReduce queries, full text search, and publisher analytics use cases. Benchmarking showed vertical scaling on EC2 provides better latency than horizontal scaling.

Recommended for you

Keep your Metadata Repository Current with Event-Driven Updates using CDC and...

This document discusses challenges with keeping a metadata repository current using event-driven updates from data sources. It describes how using Apache Kafka and the Debezium connector to capture changes from database "outbox" tables that mirror system catalog metadata tables allows pushing metadata deltas to the repository in real-time. This overcomes limitations of log-based and query-based CDC approaches when applied directly to database system tables.

•by confluent

architectureevent-driven systemsconnectors

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase

This document provides an introduction to JanusGraph, an open source distributed graph database that can be used with Apache HBase for storage. It begins with background on graph databases and their structures, such as vertices, edges, properties, and different storage models. It then discusses JanusGraph's architecture, support for the TinkerPop graph computing framework, and schema and data modeling capabilities. Details are given on partitioning graphs across servers and using different indexing approaches. The document concludes by explaining why HBase is a good storage backend for JanusGraph and providing examples of how the data model would be structured within HBase.

•by Michael Stack

hbaseconasia2018hbasejanusgraph

When the Cloud is a Rockin: High Availability in Apache CloudStack

CloudStack currently provides a variety bespoke high availability mechanisms for resources such as virtual machines, hosts, and virtual routers. Each of these implementations duplicates the HA check/recovery cycle, as well as, concurrency, persistence, and clustering required manage high available for any CloudStack resource. The High Availability Resource Management Service has been developed to consolidate these concerns -- providing a robust, extensible HA mechanism. Using this service, plugins only need to define health check, activity check, and fence operations.

•by John Burwell

cloudapachecloudstack

Monthly @

Thousands of developers hitting API

Hundreds of thousands of publishers

Tens of millions of shares & clicks

Hundreds of millions of pageviews & events

Recommended for you

James Turner (Caplin) - Enterprise HTML5 Patterns

Most HTML5 web applications are relatively small scale – they are maintained by a single team and contain relatively little JavaScript, CSS and HTML5 code. At Caplin we build "thick client" replacement financial trading systems containing considerable business logic implemented by hundreds of thousands of lines of JavaScript code. The code is maintained by multiple development teams spread across multiple business units. The talk describes the problems faced and how they can be solved using componetization, loose coupling, services, event bus, design patterns, BDD, the best open source libraries, test by contract, and test automation etc.

•by akqaanoraks

html5technologyjavascript

Introduction to Kafka

The first presentation for Kafka Meetup @ Linkedin (Bangalore) held on 2015/12/5 It provides a brief introduction to the motivation for building Kafka and how it works from a high level. Please download the presentation if you wish to see the animated slides.

•by Akash Vacher

kafkameetuplinkedin

HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...

This document discusses AntsDB, an open source project that brings MySQL compatibility to HBase in order to address the need for relational database capabilities in NoSQL systems. It describes AntsDB's architecture, which uses caching and other techniques to provide low-latency transactions and joins on HBase. Performance tests show AntsDB can achieve high throughput for writes and OLTP workloads. AntsDB aims to be complementary to HBase by virtualizing MySQL atop HBase while simulating MySQL behaviors and allowing applications built for MySQL to run unchanged on HBase.

•by Michael Stack

hbasehbaseconasia2018mysql

Tech @

JRuby on Rails (via Torquebox)

MySQL (Master, Read Slave)

Elastic MapReduce (similar to Hadoop)

Redis

Formerly Mongo, Now Riak

Why Not Mongo?

Working set needs to ﬁt in memory

Global write lock blocks all queries
despite not having transactions/joins

Standbys not “hot”

Next @
Options: Goals:

HBase Linear scalability

Cassandra Full-text search

Riak Flexible indexing

Easier Devops

Recommended for you

HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud

New Journey of HBase in Alibaba and Cloud discusses Alibaba's use of HBase over 8 years and improvements made. Key points discussed include: - Alibaba began using HBase in 2010 and has since contributed to the open source community while developing internal improvements. - Challenges addressed include JVM garbage collection pauses, separating computing and storage, and adding cold/hot data tiering. A diagnostic system was also created. - Alibaba uses HBase across many core scenarios and has integrated it with other databases in a multi-model approach to support different workloads. - Benefits of running HBase on cloud include flexibility, cost savings, and making it

•by Michael Stack

hbasehbaseconasia2018alibaba

HBaseConAsia2018 Track3-5: HBase Practice at Lianjia

This document discusses different big data scenarios using HBase including: 1. Architecture evolution over time including olap and real-time ETL scenarios 2. The olap scenario requirements like handling billion records with sub-second queries and examples using Kylin 3. The monitor scenario showing how different systems are monitored using technologies like Grafana 4. Brief mentions of data mining and HDI scenarios

•by Michael Stack

hbasehbaseconasia2018lianjia

Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...

RocksDB is the default state store for Kafka Streams. In this talk, we will discuss how to improve single node performance of the state store by tuning RocksDB and how to efficiently identify issues in the setup. We start with a short description of the RocksDB architecture. We discuss how Kafka Streams restores the state stores from Kafka by leveraging RocksDB features for bulk loading of data. We give examples of hand-tuning the RocksDB state stores based on Kafka Streams metrics and RocksDB’s metrics. At the end, we dive into a few RocksDB command line utilities that allow you to debug your setup and dump data from a state store. We illustrate the usage of the utilities with a few real-life use cases. The key takeaway from the session is the ability to understand the internal details of the default state store in Kafka Streams so that engineers can fine-tune their performance for different varieties of workloads and operate the state stores in a more robust manner.

•by confluent

kafka streamsmicroservicesintermediate

HBase
Pros Cons

Battle tested Complex
Architecture
High performance
SPOFs

Requires Hive for
Indexing/Querying

Expensive to deploy
at small scale

Cassandra
Pros Cons

Native secondary Known users all
indices domain experts

Linear scalability Search requires
Lucene
Tunable CAP
Heavy Weight
MapReduce

Riak
Pros Cons

Operationally simpler Multi-data center
replication requires
Linear scalability Enterprise product

Integrated search leveldb puts high
strain on CPU
Secondary indices

Tunable CAP

Vector clocks solve
time-sync problems

Recommended for you

HBaseConAsia2018 Track3-2: HBase at China Telecom

HBase is used at China Telecom for various applications including persistence for streaming jobs, online reading and writing, and as a data store for their core system. They operate several HBase clusters storing over 500 TB of data ingesting 1 TB per day. They monitor HBase using Ganglia for basic metrics and Zabbix for critical alerts. When issues arise, such as a system hang, they investigate debug cases and perform optimizations like changing the garbage collector from CMS to G1 and implementing read/write splitting.

•by Michael Stack

hbasehbaseconasia2018china telecom

Column and hadoop

my plan talk at HBTC chinese largest big data technoloy conference,talking about column database and hadoop related area.

•by Alex Jiang

columnar databaseshadoopanalytic databases

Becoming Protocol-Agnostic with Kafka, REST, GraphQL & gRPC | Tyler Mills, Sm...

The document discusses how different protocols like REST, Kafka, GraphQL, gRPC, and mySQL can be made protocol-agnostic. It defines common attributes across protocols like scope, operation, sending and receiving data formats, asynchronous/streaming behavior, and connection and authentication settings. Making protocols protocol-agnostic provides benefits like a universal specification for documentation, collaboration between teams using different architectures, and a consistent user experience.

•by HostedbyConfluent

apache kafkakafka summit

Migration Goals

No time where database goes “ofﬂine”

Product parity throughout migration

Migration Process

1. App writes to Mongo and Riak

2. Verify data integrity

3. Import historical data

4. App reads from Riak

5. Decommission Mongo

Share API

Save shared content

Uses MapReduce to
populate user dashboard

Recommended for you

Apache Spark on Kubernetes

How we can make use of Kubernetes as Resource Manager for Spark. What are the Pros and Cons of Spark Resource manager are discussed on this slides and the associated tutorial. Refer this github project for more details and code samples : https://github.com/haridas/hadoop-env

•by haridasnss

sparkkubernetesbigdata

Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...

Some people see their cars just as a means to get them from point A to point B without breaking down halfway, but most of us want it also to be comfortable, performant, easy to drive, and of course - to look good. We can think of Kafka Connect connectors in a similar way. While the main focus is on getting data from or writing data to the external target system, it’s also relevant how easy it is to configure, does it scale well, does it provide the best possible data consistency, is it resilient to both the external system and Kafka cluster failures, and so on. This talk focuses on aspects of connector plugin development important for achieving these goals. More specifically - we‘ll cover configuration definition and validation, external source partitions and offsets handling, achieving desired delivery semantics, and more."

•by HostedbyConfluent

kafka summitapache kafkakafka connector

Big Data Platform at Pinterest

This document discusses Pinterest's data architecture and use of Pinball for workflow management. Pinterest processes 3 petabytes of data daily from their 60 billion pins and 1 billion boards across a 2000 node Hadoop cluster. They use Kafka, Secor and Singer for ingesting event data. Pinball is used for workflow management to handle their scale of hundreds of workflows, thousands of jobs and 500+ jobs in some workflows. Pinball provides simple abstractions, extensibility, reliability, debuggability and horizontal scalability for workflow execution.

•by Qubole

s3hivequbole

Recommendations

Sets of related pages

Generated on-demand

Publisher Analytics

Generated nightly via Hadoop

Typical stored “document” (JSON)

80kb-1Mb

MapReduce

Handy for querying

Runs at “web page speed”.

Easy to re-reduce for complex queries

Easy to test via CURL

Recommended for you

Lambda Architecture with Spark

The presentation covers lambda architecture and implementation with spark. In the presentation we will discuss about components of lambda architecture like batch layer, speed layer and serving layer. We will also discuss its advantages and benefits with spark.

•by Knoldus Inc.

sparkknolduslambda architecture with spark

Solr cloud the 'search first' nosql database extended deep dive

Presented by Mark Miller, Software Engineer, Cloudera As the NoSQL ecosystem looks to integrate great search, great search is naturally beginning to expose many NoSQL features. Will these Goliath's collide? Or will they remain specialized while intermingling – two sides of the same coin. Come learn about where SolrCloud fits into the NoSQL landscape. What can it do? What will it do? And how will the big data, NoSQL, Search ecosystem evolve. If you are interested in Big Data, NoSQL, distributed systems, CAP theorem and other hype filled terms, than this talk may be for you.

•by lucenerevolution

solrlucene/solr revolutionlucene/solr

Migrating to Riak at Shareaholic

Robby Grossman, Shareaholic's Tech Lead, spoke at the first Boston Riak Meetup on August 30, 2012. These are his slides.

•by Shareaholic

riakshareaholic

Tunable CAP @

Replication: primary/secondary authority

Read failure tolerance: speed/consistency

Write failure tolerance

Full Text Search

Built on Lucene

Make user content searchable

Make arbitrary keys queryable

“Just turn it on”

Hiccup: corrupt merge indexes

$Query Example Who’s our oldest user who’s shared something in the last minute? curl -XPOST http://localhost:8098/mapred -H 'Content-Type: application/json' -d '{ "inputs": { "bucket":"links", "query":"timestamp:[1346350877 TO 1346350937}" //60 second period }, "query":[ {"map":{"language":"javascript","source":"function(riakObject) { return [[Riak.mapValuesJson(riakObject)[0].user_id]]; }"}}, {"reduce":{"language":"javascript", "name":"Riak.reduceMin" // [[2],[5],[9],[13]] => [[2]] }} ] }' [[2197]]$

Recommended for you

Riak TS

This document provides an overview of Riak TS, Basho's new purpose-built time series database. It describes Riak TS's key features like high write throughput, efficient range query support, and horizontal scalability. It also outlines Riak TS's data modeling approach of co-locating and partitioning time-series data, its SQL-like query language, and provides examples of its performance and roadmap. Finally, it demonstrates a potential use case application called UNCORKD for tracking wine check-ins and reviews.

•by clive boulton

time seriesseascalebasho

IoT BASED VEHICLE TRACKING AND TRAFFIC SURVIELLENCE SYSTEM

ii ABSTRACT GPS is one of the technologies that are used in a huge number of applications today. One of the applications is tracking your vehicle and keeps regular monitoring on them. This tracking system can inform you the location and route travelled by vehicle, and that information can be observed from any other remote location. It also includes the web application that provides you exact location of target and the exact speed the vehicle is moving which is used to generate bills for over speeding automatically. This system enables us to track target in any weather conditions. This system uses GPS and Zigbee technologies. This includes the hardware part which comprises of GPS, Zigbee, ATmega microcontroller and software part is used for interfacing all the required modules and a web application is also developed at the client side and visualize data from IoT. Main objective is to design a system that can be easily installed and to provide platform for further enhancement. KEYWORDS GPS, ZigBee, Tracking System, IoT iii

•by john solomon j

A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise

Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.

•by Patrick McFadin

serach. analyticsapache solrdatabase

In a Nutshell

EC2 specs poorly proportioned for leveldb

Multiple AZs in one location works well

Scale vertically for better latency & consistency

Scale horizontally for more throughput/$

Benchmarks

Top Graph: c1.medium (1.7G, 5 CPU)

Middle: m1.large (7.5G, 4 CPU)

Bottom: cc1.4xlarge (23G, 33.5 CPU)

Recommended for you

Data Modeling IoT and Time Series data in NoSQL

Time Series data is proliferating with literally every step that we take, just think about things like Fit Bit bracelets that track your every move and financial trading data all of which is timestamped. Time series data requires high performance reads and writes even with a huge number of data sources. Both speed and scale are integral to success, which makes for a unique challenge for your database. A time series NoSQL data model requires flexibility to support unstructured, and semi-structured data as well as the ability to write range queries to analyze your time series data. So how can you tackle speed, scale and flexibility all at once? Join Professional Services Architect Drew Kerrigan and Developer Advocate Matt Brender for a discussion of: Examples of time series data sets, from IoT to Finance to jet engines What makes time series queries different from other database queries How to model your dataset to answer the right questions about your data How to store, query and analyze a set of time series data points Learn how a NoSQL database model and Riak TS can help you address the unique challenges of time series data.

•by Basho Technologies

distributed systemsbashoiot

An Introduction to Distributed Search with Cassandra and Solr

Cassandra is a distributed database that can be used with Solr for distributed search capabilities. Data is written to Cassandra and indexed by Solr to enable fast and scalable full-text search across nodes. Queries can be performed directly on Cassandra or through the Solr API, with tradeoffs in performance. Production deployments typically use a mix of Cassandra and Solr nodes for analytics and search workloads.

•by DataStax Academy

patricia gorladistributedpatricia

How to Make Hadoop Easy, Dependable and Fast

The document discusses various use cases for MapR's Hadoop distribution including restaurant recommendations, fraud modeling, network security, and log analysis. It highlights how MapR allows easy data access and deployment across these applications using techniques like NFS, mirrors, and avoiding special data movement mechanisms. The document also provides technical details on how specific solutions like recommendation modeling, fraud detection, and log analysis can leverage MapR.

•by MapR Technologies

Calculations
c1.medium (1.7G, 5 CPU)
1758 IOPS/$-hr
Worst 1% of queries: 300ms/800ms

m1.large (7.5G, 4 CPU)
1167 IOPS/$-hr
Worst 1% of queries: 110ms/200ms

cc1.4xlarge (23G, 33.5 CPU)
872 IOPS/$-hr
Worst 1% of queries: 47ms/139ms

Benchmark Takeaways

You can’t go “by spec”

IO is limiting factor

RAM never limiting factor for 1%
of keyspace to be in memory

Fin. Questions?
Thanks: We’re Hiring!

Tom Santero Robby Grossman

Justin Sheehy robby@shareaholic.com

Ryan Zezeski @freerobby

Reid Draper

#freenode riak crew

Recommended for you

Understanding Database Options

With AWS you can choose the right database for the right job. Given the myriad of choices, from relational databases to non-relational stores, this session will profile details and examples of some of the choices available to you (MySQL, RDS, Elasticache, Redis, Cassandra, MongoDB and DynamoDB), with details on real world deployments from customers using Amazon RDS, ElastiCache and DynamoDB.

•by Amazon Web Services

2013summitseriesnycsummit2013services overview

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming and Approximations Lambda Architecture

•by Chris Fregly

big data kinesis spark streaming approximations la

Kafka & Hadoop in Rakuten

WebHack#43 Challenges of Global Infrastructure at Rakuten https://webhack.connpass.com/event/208888/

•by Rakuten Group, Inc.

rakutenrakutentechrakutentechnology

What's hot

SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft

Chester Chen

Talk 1. Scaling Apache Spark on Kubernetes at Lyft As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, We will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup. Speaker: Li Gao Li Gao is the tech lead in the cloud native spark compute initiative at Lyft. Prior to Lyft, Li worked at Salesforce, Fitbit, Marin Software, and a few startups etc. on various technical leadership positions on cloud native and hybrid cloud data platforms at scale. Besides Spark, Li has scaled and productionized other open source projects, such as Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow, Apache Hive, and Apache Cassandra.

Apache HBase Workshop

Valerii Moisieienko

A Collaborative Data Science Development Workflow

Databricks

Collaborative data science workflows have several moving parts, and many organizations struggle with developing an efficient and scalable process. Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once a strong solution is created, the candidate pipeline is trained on cloud-agnostic, GPU-enabled containers. If this pipeline is production worthy, the resulting model is served to a production application through MLflow.

Keep your Metadata Repository Current with Event-Driven Updates using CDC and...

confluent

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase

Michael Stack

When the Cloud is a Rockin: High Availability in Apache CloudStack

John Burwell

James Turner (Caplin) - Enterprise HTML5 Patterns

akqaanoraks

Introduction to Kafka

Akash Vacher

HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...

Michael Stack

HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud

Michael Stack

HBaseConAsia2018 Track3-5: HBase Practice at Lianjia

Michael Stack

Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...

confluent

HBaseConAsia2018 Track3-2: HBase at China Telecom

Michael Stack

Column and hadoop

Alex Jiang

Becoming Protocol-Agnostic with Kafka, REST, GraphQL & gRPC | Tyler Mills, Sm...

HostedbyConfluent

Apache Spark on Kubernetes

haridasnss

Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...

HostedbyConfluent

Big Data Platform at Pinterest

Qubole

Lambda Architecture with Spark

Knoldus Inc.

Solr cloud the 'search first' nosql database extended deep dive

lucenerevolution

What's hot (20)

SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft

Apache HBase Workshop

A Collaborative Data Science Development Workflow

Keep your Metadata Repository Current with Event-Driven Updates using CDC and...

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase

When the Cloud is a Rockin: High Availability in Apache CloudStack

James Turner (Caplin) - Enterprise HTML5 Patterns

Introduction to Kafka

HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...

HBaseConAsia2018 Keynote 2: Recent Development of HBase in Alibaba and Cloud

HBaseConAsia2018 Track3-5: HBase Practice at Lianjia

Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...

HBaseConAsia2018 Track3-2: HBase at China Telecom

Column and hadoop

Becoming Protocol-Agnostic with Kafka, REST, GraphQL & gRPC | Tyler Mills, Sm...

Apache Spark on Kubernetes

Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...

Big Data Platform at Pinterest

Lambda Architecture with Spark

Solr cloud the 'search first' nosql database extended deep dive

Viewers also liked

Migrating to Riak at Shareaholic

Shareaholic

Riak TS

clive boulton

IoT BASED VEHICLE TRACKING AND TRAFFIC SURVIELLENCE SYSTEM

john solomon j

A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise

Patrick McFadin

Data Modeling IoT and Time Series data in NoSQL

Basho Technologies

An Introduction to Distributed Search with Cassandra and Solr

DataStax Academy

Viewers also liked (6)

Migrating to Riak at Shareaholic

Riak TS

IoT BASED VEHICLE TRACKING AND TRAFFIC SURVIELLENCE SYSTEM

A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise

Data Modeling IoT and Time Series data in NoSQL

An Introduction to Distributed Search with Cassandra and Solr

Similar to Riak at shareaholic

How to Make Hadoop Easy, Dependable and Fast

MapR Technologies

Understanding Database Options

Amazon Web Services

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...

Chris Fregly

Kafka & Hadoop in Rakuten

Rakuten Group, Inc.

Glint with Apache Spark

Venkata Naga Ravi

High Performance Databases

Amazon Web Services

Low Administration, High Performance Databases in the Cloud Scaling database performance requires effort. Differentiated effort increases application uniqueness. Amazon RDS and DynamoDB make it easy to setup, operate, and scale relational and NoSQL databases in the cloud with low administration. These services automate tasks like backups, patching, and scaling to help improve performance and agility. Case studies demonstrate how companies have used Amazon RDS and DynamoDB to rapidly scale applications and meet spikes in demand.

Scalable Stream Processing with Apache Samza

Prateek Maheshwari

We have seen tremendous growth in near real-time ("nearline") processing at LinkedIn in recent years. LinkedIn now uses Apache Samza to process well over a Trillion messages every day across thousands of applications. Apache Samza serves as the foundation for several application platforms at LinkedIn, spanning a wide variety of use cases like security, notifications, machine learning, monitoring, search, and more. In this talk we will explore various features of Apache Samza that provide the flexibility and scalability to we need to power stream processing at massive scale.

Riak at Engine Yard Cloud

Ines Sombra

The document discusses using Riak, an open source NoSQL database, in the cloud to provide highly available and fault tolerant storage for applications. It describes how the authors' application struggled with availability when hosted on AWS, and their research led them to choose Riak as a solution. Riak provides features like consistent hashing, hinted handoff, and active anti-entropy that allow it to scale linearly and remain available even during failures or added/reduced nodes. The document provides guidance on deploying a Riak cluster in the cloud, including choosing instance types, sizing the cluster, disabling swap, using the right filesystem and scheduler, and monitoring and scaling the cluster over time.

Efficient State Management With Spark 2.0 And Scale-Out Databases

Jen Aman

This document discusses efficient state management with Spark 2.0 and scale-out databases. It introduces SnappyData, an open source project that provides a unified in-memory database for streams, transactions, and OLAP queries to enable real-time operational analytics. SnappyData extends Spark by localizing state management and processing to avoid shuffles, supports approximate query processing for interactive queries, and provides a unified cluster architecture for OLTP, OLAP and streaming workloads.

Efficient State Management With Spark 2.x And Scale-Out Databases

SnappyData

Containerized Hadoop beyond Kubernetes

DataWorks Summit

Partha Seetala is the CTO of Robin Systems, which provides a Kubernetes platform for running big data, NoSQL, database, and AI/ML workloads. Robin addresses challenges with containerizing these applications, such as resource management and storage and networking issues. Robin's solution allows applications to drive infrastructure configuration for improved user experience with capabilities like one-click provisioning, scaling, cloning, backup, and migration of applications across clouds.

Handling Data in Mega Scale Systems

Directi Group

This document discusses strategies for handling large amounts of data in web applications. It begins by providing examples of how much data some large websites contain, ranging from terabytes to petabytes. It then covers various techniques for scaling data handling capabilities including vertical and horizontal scaling, replication, partitioning, consistency models, normalization, caching, and using different data engine types beyond relational databases. The key lessons are that data volumes continue growing rapidly, and a variety of techniques are needed to scale across servers, datacenters, and provide high performance and availability.

Navigating NoSQL in cloudy skies

shnkr_rmchndrn

NoSQL is not a buzzword anymore. The array of non- relational technologies have found wide-scale adoption even in non-Internet scale focus areas. With the advent of the Cloud...the churn has increased even more yet there is no crystal clear guidance on adoption techniques and architectural choices surrounding the plethora of options available. This session initiates you into the whys & wherefores, architectural patterns, caveats and techniques that will augment your decision making process & boost your perception of architecting scalable, fault-tolerant & distributed solutions.

Scaling Spark Workloads on YARN - Boulder/Denver July 2015

Mac Moore

DAT101 Understanding AWS Database Options - AWS re: Invent 2012

Amazon Web Services

When you're handling big data in the modern world, you will come to a point where you can't just pick a “one size fits all” approach anymore. However, to get the results you want, you also don’t have to spend big money on fire breathing hardware, or expensive software. AWS offers a beautiful array of open and commercial database choices, from do-it-yourself to fully managed services which handle scaling, and gives you powerful tools to choose the right architecture. You could choose from MySQL, RDS, Oracle, SQL Server, MongoDB, DynamoDB, Cassandra, ElastiCache, Redis, and SimpleDB, and our customers use them for different use cases. Each has different strengths, and this session highlights when you would want to choose each, with examples of how we use each to solve our big data challenges and why we made those decisions. We profile the some of the choices available to you - MySQL, RDS, Elasticache, Redis, Cassandra, MongoDB and DynamoDB – and three customer case studies on RDS, Elasticache and DynamoDB.

SnappyData overview NikeTechTalk 11/19/15

SnappyData

Microsoft Openness Mongo DB

Heriyadi Janwar

This document discusses Microsoft's support for open source tools and NoSQL databases on the Azure platform. It provides examples of various open source technologies supported, such as Linux, Hadoop, Java, PHP, and Node.js. It then discusses how Azure provides solutions for large-scale social networking and e-commerce applications through patterns like data sharding, caching layers, and messaging. Azure aims to provide high availability, elastic scale, and support for flexible data models and processing paradigms to meet the needs of NoSQL and big data applications.

Big Telco Real-Time Network Analytics

Yousun Jeong

This document provides an overview of SK Telecom's use of big data analytics and Spark. Some key points: - SKT collects around 250 TB of data per day which is stored and analyzed using a Hadoop cluster of over 1400 nodes. - Spark is used for both batch and real-time processing due to its performance benefits over other frameworks. Two main use cases are described: real-time network analytics and a network enterprise data warehouse (DW) built on Spark SQL. - The network DW consolidates data from over 130 legacy databases to enable thorough analysis of the entire network. Spark SQL, dynamic resource allocation in YARN, and BI integration help meet requirements for timely processing and quick responses.

Big Telco - Yousun Jeong

Spark Summit

This document provides an overview of SK Telecom's use of big data analytics and Spark. Some key points: - SKT collects around 250 TB of data per day which is stored and analyzed using a Hadoop cluster of over 1400 nodes. - Spark is used for both batch and real-time processing due to its performance benefits over other frameworks. Two main use cases are described: real-time network analytics and a network enterprise data warehouse (DW) built on Spark SQL. - The network DW consolidates data from over 130 legacy databases to enable thorough analysis of the entire network. Spark SQL, dynamic resource allocation in YARN, and integration with BI tools help meet requirements for timely processing and quick

SQL and NoSQL in SQL Server

Michael Rys

This document discusses SQL and NoSQL approaches to scaling databases. It describes how social networks and other large-scale websites use techniques like sharding and messaging to partition data across many databases. It also discusses how SQL Server is adopting NoSQL paradigms like flexible schemas and federated sharding to provide scalability. The document aims to educate about scaling databases and how SQL Server is evolving to support both SQL and NoSQL approaches.

Similar to Riak at shareaholic (20)

How to Make Hadoop Easy, Dependable and Fast

Understanding Database Options

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...

Kafka & Hadoop in Rakuten

Glint with Apache Spark

High Performance Databases

Scalable Stream Processing with Apache Samza

Riak at Engine Yard Cloud

Efficient State Management With Spark 2.0 And Scale-Out Databases

Efficient State Management With Spark 2.x And Scale-Out Databases

Containerized Hadoop beyond Kubernetes

Handling Data in Mega Scale Systems

Navigating NoSQL in cloudy skies

Scaling Spark Workloads on YARN - Boulder/Denver July 2015

DAT101 Understanding AWS Database Options - AWS re: Invent 2012

SnappyData overview NikeTechTalk 11/19/15

Microsoft Openness Mongo DB

Big Telco Real-Time Network Analytics

Big Telco - Yousun Jeong

SQL and NoSQL in SQL Server

Recently uploaded

K2G - Insurtech Innovation EMEA Award 2024

The Digital Insurer

Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...

Erasmo Purificato

Research Directions for Cross Reality Interfaces

Mark Billinghurst

How Netflix Builds High Performance Applications at Global Scale

ScyllaDB

一比一原版(msvu毕业证书）圣文森山大学毕业证如何办理

uuuot

原版一模一样【微信：741003700 】【(msvu毕业证书）圣文森山大学毕业证成绩单】【微信：741003700 】学位证，留信学历认证（真实可查，永久存档）原件一模一样纸张工艺/offer、在读证明、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才办理(msvu毕业证书）圣文森山大学毕业证【微信：741003700 】外观非常简单，由纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理(msvu毕业证书）圣文森山大学毕业证【微信：741003700 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理(msvu毕业证书）圣文森山大学毕业证【微信：741003700 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理(msvu毕业证书）圣文森山大学毕业证【微信：741003700 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

UiPath Community Day Kraków: Devs4Devs Conference

UiPathCommunity

We are honored to launch and host this event for our UiPath Polish Community, with the help of our partners - Proservartner! We certainly hope we have managed to spike your interest in the subjects to be presented and the incredible networking opportunities at hand, too! Check out our proposed agenda below 👇👇 08:30 ☕ Welcome coffee (30') 09:00 Opening note/ Intro to UiPath Community (10') Cristina Vidu, Global Manager, Marketing Community @UiPath Dawid Kot, Digital Transformation Lead @Proservartner 09:10 Cloud migration - Proservartner & DOVISTA case study (30') Marcin Drozdowski, Automation CoE Manager @DOVISTA Pawel Kamiński, RPA developer @DOVISTA Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner 09:40 From bottlenecks to breakthroughs: Citizen Development in action (25') Pawel Poplawski, Director, Improvement and Automation @McCormick & Company Michał Cieślak, Senior Manager, Automation Programs @McCormick & Company 10:05 Next-level bots: API integration in UiPath Studio (30') Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner 10:35 ☕ Coffee Break (15') 10:50 Document Understanding with my RPA Companion (45') Ewa Gruszka, Enterprise Sales Specialist, AI & ML @UiPath 11:35 Power up your Robots: GenAI and GPT in REFramework (45') Krzysztof Karaszewski, Global RPA Product Manager 12:20 🍕 Lunch Break (1hr) 13:20 From Concept to Quality: UiPath Test Suite for AI-powered Knowledge Bots (30') Kamil Miśko, UiPath MVP, Senior RPA Developer @Zurich Insurance 13:50 Communications Mining - focus on AI capabilities (30') Thomasz Wierzbicki, Business Analyst @Office Samurai 14:20 Polish MVP panel: Insights on MVP award achievements and career profiling

How to Avoid Learning the Linux-Kernel Memory Model

ScyllaDB

The Linux-kernel memory model (LKMM) is a powerful tool for developing highly concurrent Linux-kernel code, but it also has a steep learning curve. Wouldn't it be great to get most of LKMM's benefits without the learning curve? This talk will describe how to do exactly that by using the standard Linux-kernel APIs (locking, reference counting, RCU) along with a simple rules of thumb, thus gaining most of LKMM's power with less learning. And the full LKMM is always there when you need it!

Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches

Earley Information Science

In this follow-up session on knowledge and prompt engineering, we will explore structured prompting, chain of thought prompting, iterative prompting, prompt optimization, emotional language prompts, and the inclusion of user signals and industry-specific data to enhance LLM performance. Join EIS Founder & CEO Seth Earley and special guest Nick Usborne, Copywriter, Trainer, and Speaker, as they delve into these methodologies to improve AI-driven knowledge processes for employees and customers alike.

Cookies program to display the information though cookie creation

shanthidl1

Verti - EMEA Insurer Innovation Award 2024

The Digital Insurer

Observability For You and Me with OpenTelemetry

Eric D. Schabell

Are you interested in dipping your toes in the cloud native observability waters, but as an engineer you are not sure where to get started with tracing problems through your microservices and application landscapes on Kubernetes? Then this is the session for you, where we take you on your first steps in an active open-source project that offers a buffet of languages, challenges, and opportunities for getting started with telemetry data. The project is called openTelemetry, but before diving into the specifics, we’ll start with de-mystifying key concepts and terms such as observability, telemetry, instrumentation, cardinality, percentile to lay a foundation. After understanding the nuts and bolts of observability and distributed traces, we’ll explore the openTelemetry community; its Special Interest Groups (SIGs), repositories, and how to become not only an end-user, but possibly a contributor.We will wrap up with an overview of the components in this project, such as the Collector, the OpenTelemetry protocol (OTLP), its APIs, and its SDKs. Attendees will leave with an understanding of key observability concepts, become grounded in distributed tracing terminology, be aware of the components of openTelemetry, and know how to take their first steps to an open-source contribution! Key Takeaways: Open source, vendor neutral instrumentation is an exciting new reality as the industry standardizes on openTelemetry for observability. OpenTelemetry is on a mission to enable effective observability by making high-quality, portable telemetry ubiquitous. The world of observability and monitoring today has a steep learning curve and in order to achieve ubiquity, the project would benefit from growing our contributor community.

What's New in Copilot for Microsoft365 May 2024.pptx

Stephanie Beckett

20240704 QFM023 Engineering Leadership Reading List June 2024

Matthew Sinclair

Coordinate Systems in FME 101 - Webinar Slides

Safe Software

If you’ve ever had to analyze a map or GPS data, chances are you’ve encountered and even worked with coordinate systems. As historical data continually updates through GPS, understanding coordinate systems is increasingly crucial. However, not everyone knows why they exist or how to effectively use them for data-driven insights. During this webinar, you’ll learn exactly what coordinate systems are and how you can use FME to maintain and transform your data’s coordinate systems in an easy-to-digest way, accurately representing the geographical space that it exists within. During this webinar, you will have the chance to: - Enhance Your Understanding: Gain a clear overview of what coordinate systems are and their value - Learn Practical Applications: Why we need datams and projections, plus units between coordinate systems - Maximize with FME: Understand how FME handles coordinate systems, including a brief summary of the 3 main reprojectors - Custom Coordinate Systems: Learn how to work with FME and coordinate systems beyond what is natively supported - Look Ahead: Gain insights into where FME is headed with coordinate systems in the future Don’t miss the opportunity to improve the value you receive from your coordinate system data, ultimately allowing you to streamline your data analysis and maximize your time. See you there!

WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf

ArgaBisma

find out more about the role of autonomous vehicles in facing global challenges

huseindihon

How Social Media Hackers Help You to See Your Wife's Message.pdf

HackersList

Recent Advancements in the NIST-JARVIS Infrastructure

KAMAL CHOUDHARY

MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions

Linda Zhang

This brochure gives introduction of MYIR Electronics company and MYIR's products and services. MYIR Electronics Limited (MYIR for short), established in 2011, is a global provider of embedded System-On-Modules (SOMs) and comprehensive solutions based on various architectures such as ARM, FPGA, RISC-V, and AI. We cater to customers' needs for large-scale production, offering customized design, industry-specific application solutions, and one-stop OEM services. MYIR, recognized as a national high-tech enterprise, is also listed among the "Specialized and Special new" Enterprises in Shenzhen, China. Our core belief is that "Our success stems from our customers' success" and embraces the philosophy of "Make Your Idea Real, then My Idea Realizing!"

7 Most Powerful Solar Storms in the History of Earth.pdf

Enterprise Wired

Recently uploaded (20)

K2G - Insurtech Innovation EMEA Award 2024

Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...

Research Directions for Cross Reality Interfaces

How Netflix Builds High Performance Applications at Global Scale

一比一原版(msvu毕业证书）圣文森山大学毕业证如何办理

UiPath Community Day Kraków: Devs4Devs Conference

How to Avoid Learning the Linux-Kernel Memory Model

Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches

Cookies program to display the information though cookie creation

Verti - EMEA Insurer Innovation Award 2024

Observability For You and Me with OpenTelemetry

What's New in Copilot for Microsoft365 May 2024.pptx

20240704 QFM023 Engineering Leadership Reading List June 2024

Coordinate Systems in FME 101 - Webinar Slides

WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf

find out more about the role of autonomous vehicles in facing global challenges

How Social Media Hackers Help You to See Your Wife's Message.pdf

Recent Advancements in the NIST-JARVIS Infrastructure

MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions

7 Most Powerful Solar Storms in the History of Earth.pdf

Riak at shareaholic

1. Riak @ Robby Grossman robby@shareaholic.com @freerobby

2. Agenda Shareaholic: Product & Tech Why Riak: The Search for a Big Data Store Transitioning to Riak Riak Use Cases Deploying to EC2

3. What’s ?

8. Monthly @ Thousands of developers hitting API Hundreds of thousands of publishers Tens of millions of shares & clicks Hundreds of millions of pageviews & events

9. Tech @ JRuby on Rails (via Torquebox) MySQL (Master, Read Slave) Elastic MapReduce (similar to Hadoop) Redis Formerly Mongo, Now Riak

10. Why Not Mongo? Working set needs to ﬁt in memory Global write lock blocks all queries despite not having transactions/joins Standbys not “hot”

11. Why Riak?

12. Next @ Options: Goals: HBase Linear scalability Cassandra Full-text search Riak Flexible indexing Easier Devops

13. HBase Pros Cons Battle tested Complex Architecture High performance SPOFs Requires Hive for Indexing/Querying Expensive to deploy at small scale

14. Cassandra Pros Cons Native secondary Known users all indices domain experts Linear scalability Search requires Lucene Tunable CAP Heavy Weight MapReduce

15. Riak Pros Cons Operationally simpler Multi-data center replication requires Linear scalability Enterprise product Integrated search leveldb puts high strain on CPU Secondary indices Tunable CAP Vector clocks solve time-sync problems

16. From Mongo to Riak

17. Migration Goals No time where database goes “ofﬂine” Product parity throughout migration

18. Migration Process 1. App writes to Mongo and Riak 2. Verify data integrity 3. Import historical data 4. App reads from Riak 5. Decommission Mongo

19. Use Cases

20. Share API Save shared content Uses MapReduce to populate user dashboard

21. Recommendations Sets of related pages Generated on-demand

22. Publisher Analytics Generated nightly via Hadoop Typical stored “document” (JSON) 80kb-1Mb

23. Riak Successes

24. MapReduce Handy for querying Runs at “web page speed”. Easy to re-reduce for complex queries Easy to test via CURL

25. Tunable CAP @ Replication: primary/secondary authority Read failure tolerance: speed/consistency Write failure tolerance

26. Full Text Search Built on Lucene Make user content searchable Make arbitrary keys queryable “Just turn it on” Hiccup: corrupt merge indexes

27. Query Example Who’s our oldest user who’s shared something in the last minute? curl -XPOST http://localhost:8098/mapred -H 'Content-Type: application/json' -d '{ "inputs": { "bucket":"links", "query":"timestamp:[1346350877 TO 1346350937}" //60 second period }, "query":[ {"map":{"language":"javascript","source":"function(riakObject) { return [[Riak.mapValuesJson(riakObject)[0].user_id]]; }"}}, {"reduce":{"language":"javascript", "name":"Riak.reduceMin" // [[2],[5],[9],[13]] => [[2]] }} ] }' [[2197]]

28. Riak on EC2

29. In a Nutshell EC2 specs poorly proportioned for leveldb Multiple AZs in one location works well Scale vertically for better latency & consistency Scale horizontally for more throughput/$

30. Benchmarks Top Graph: c1.medium (1.7G, 5 CPU) Middle: m1.large (7.5G, 4 CPU) Bottom: cc1.4xlarge (23G, 33.5 CPU)

31. Throughput

32. Latency (Typical)

33. Latency (Worst Case)

34. Calculations c1.medium (1.7G, 5 CPU) 1758 IOPS/$-hr Worst 1% of queries: 300ms/800ms m1.large (7.5G, 4 CPU) 1167 IOPS/$-hr Worst 1% of queries: 110ms/200ms cc1.4xlarge (23G, 33.5 CPU) 872 IOPS/$-hr Worst 1% of queries: 47ms/139ms

35. Benchmark Takeaways You can’t go “by spec” IO is limiting factor RAM never limiting factor for 1% of keyspace to be in memory

36. Fin. Questions? Thanks: We’re Hiring! Tom Santero Robby Grossman Justin Sheehy robby@shareaholic.com Ryan Zezeski @freerobby Reid Draper #freenode riak crew

37. Fin.

Riak at shareaholic

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Riak at shareaholic

Similar to Riak at shareaholic (20)

Recently uploaded

Recently uploaded (20)

Riak at shareaholic