Using probabilistic data structures in sessions to power personalization and customization in real-time. Examples in Redis and Node.js
Demo code at: https://github.com/stockholmux/qcon-redis-session-store-demo
Presented at QCon SF 2017.
Frontera: open source, large scale web crawling frameworkScrapinghub
This document describes Frontera, an open source framework for large scale web crawling. It discusses the architecture and components of Frontera, which includes Scrapy for network operations, Apache Kafka as a data bus, and Apache HBase for storage. It also outlines some challenges faced during the development of Frontera and solutions implemented, such as handling large websites that flood the queue, optimizing traffic to HBase, and prioritizing URLs. The document provides details on using Frontera to crawl the Spanish (.es) web domain and presents results and future plans.
Migration from SQL to MongoDB - A Case Study at TheKnot.com MongoDB
8 out of 10 couples use TheKnot.com to help plan their wedding. A key part of planning involves selecting articles, photographs, and other resources and storing these in the user's Favorites. Recently we migrated major parts of our technology stack to open source technologies. As part of our migration strategy, we zeroed in on MongoDB, since it better suited our requirements for speed and data structure as well as eliminating the need for a caching layer. The transition required a period in which both our legacy and new API where working concurrently with data being persisted on both databases (SQL and Mongo) and all records were being synched with every request. We resourced to many strategies and applications to achieve this goal, including: Pentaho, AWS SQS and SNS, a queue messenger system and some proprietary ruby gems. In this session we will review our strategy and some of the lessons we learned about successfully migrating with zero downtime.
This document discusses how companies use NoSQL and Couchbase. Couchbase is an open-source, distributed document-oriented database that supports both key-value and document data models. It offers features like easy scalability, high performance, flexible data modeling with JSON documents, and 24/7 availability. Common use cases for Couchbase include mobile apps, caching, session stores, user profiles, content management, and real-time analytics. The document provides examples of companies using Couchbase for applications in social gaming, advertising targeting, and education.
Prepare for Peak Holiday Season with MongoDBMongoDB
This document discusses preparing for the holiday season by providing a seamless customer experience. It covers expected trends for the 2014 holiday season including increased spending and an extended shopping window. The opportunity is to provide personalized and relevant experiences for customers. The document then provides an overview of how MongoDB can be used to power various retail functions like product catalogs, real-time inventory and orders, and consolidated customer views to enable a modern seamless retail experience. Technical details are discussed for implementing product catalogs and real-time inventory using MongoDB.
The document discusses MongoDB operations for developers, including its data model, use of replication for high availability, sharding for scalability, and deployment architectures. It also covers MongoDB's philosophy, benefits of its document model, how replica sets provide self-healing and failure recovery, and security features available in MongoDB Enterprise.
Blackbird is Rocket Fuel's real-time bidding platform that processes billions of rows and bid requests in milliseconds. It uses HBase for its data store and has implemented various optimizations to achieve high performance, such as using protocol buffers, caching, append-only writes, and off-peak compactions. It also focuses on reliability through techniques like aggressive monitoring, dynamic blacklisting, and handling client failures without relying on a proxy server.
MongoDB Days Silicon Valley: Best Practices for Upgrading to MongoDBMongoDB
This document provides an overview of new features and best practices for upgrading to MongoDB version 3.2. It discusses major upgrades such as encrypted storage, document validation, and config server replica sets. It also emphasizes testing upgrades in a staging environment before production, checking for backward incompatible changes, and following the documented upgrade order and steps. Ops Manager and MMS can automate upgrades for easier management. Consulting services are also available to assist with planning and executing upgrades.
What's the Scoop on Hadoop? How It Works and How to WORK IT!MongoDB
MongoDB and Hadoop work powerfully together as complementary technologies. Learn how the Hadoop connector allows you to leverage the power of MapReduce to process data sourced from your MongoDB cluster.
This document discusses how MongoDB can help enterprises meet modern data and application requirements. It outlines the many new technologies and demands placing pressure on enterprises, including big data, mobile, cloud computing, and more. Traditional databases struggle to meet these new demands due to limitations like rigid schemas and difficulty scaling. MongoDB provides capabilities like dynamic schemas, high performance at scale through horizontal scaling, and low total cost of ownership. The document examines how MongoDB has been successfully used by enterprises for use cases like operational data stores and as an enterprise data service to break down silos.
An Elastic Metadata Store for eBay’s Media PlatformMongoDB
In order to build a robust, multi-tenant, highly available storage services that meet the business’ SLA your databases has to be sharded. But if your service has to scale continuously through the incremental additions of storage without service interruption or human intervention, basic static sharding is not enough. At eBay, we are building MStore to solve this problem, with MongoDB as the storage engine. In this presentation, we will dive into the key design concepts of this solution.
Webinar: Technical Introduction to Native Encryption on MongoDBMongoDB
The new encrypted storage engine in MongoDB 3.2 allows you to more easily build secure applications that handle sensitive data. Attend this webinar to learn how the internals work and discover all of the options available to you for securing your data.
This document summarizes Anahit Pogosova's presentation on serverless data streaming at scale. It discusses using AWS services like Kinesis Data Streams, Kinesis Firehose, and Kinesis Data Analytics to collect, store, and analyze large amounts of streaming data from Yle, Finland's national public broadcasting company. It also outlines some gotchas and lessons learned, such as understanding service limits and monitoring metrics like iterator age and throttling. The presentation provides an overview of serverless data architectures and best practices for streaming data at massive scales.
When it comes time to select database software for your project, there are a bewildering number of choices. How do you know if your project is a good fit for a relational database, or whether one of the many NoSQL options is a better choice?
In this webinar you will learn when to use MongoDB and how to evaluate if MongoDB is a fit for your project. You will see how MongoDB's flexible document model is solving business problems in ways that were not previously possible, and how MongoDB's built-in features allow running at scale.
Topics covered include:
Performance and Scalability
MongoDB's Data Model
Popular MongoDB Use Cases
Customer Stories
This document discusses using big data tools to build a fraud detection system. It outlines using Azure infrastructure to set up a Hadoop cluster with HDFS, HBase, Kafka and Spark. Mock transaction data will be generated and sent to Kafka. Spark jobs will process the data in batches, identifying potentially fraudulent transactions and writing them to an HBase table. The data will be visualized using Zeppelin notebooks querying Phoenix SQL on HBase. This will allow analysts to further investigate potential fraud patterns in near real-time.
New generations of database technologies are allowing organizations to build applications never before possible, at a speed and scale that were previously unimaginable. MongoDB is the fastest growing database on the planet, and the new 3.2 release will bring the benefits of modern database architectures to an ever broader range of applications and users.
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index StructuresCloudera, Inc.
HBase SEP & Indexer provide a mechanism for triggering and processing side effects and updates to other systems like Solr based on changes to HBase data. They leverage HBase replication to pass update events to a lightweight Indexer process that maps the changes to Solr. This allows structured search of HBase data in Solr without impacting the HBase write path and provides a way to perform indexing without running code on HBase region servers. The open source tools support various mapping, extraction, and indexing configurations and deployment options.
Webinar: “ditch Oracle NOW”: Best Practices for Migrating to MongoDBMongoDB
This webinar will guide you through the best practices for migrating off of a relational database. Whether you are migrating an existing application, or considering using MongoDB in place of your traditional relational database for a new project, this webinar will get you to production faster, with less effort, cost and risk.
The storage engine is responsible for managing how data is stored, both in memory and on disk. MongoDB supports multiple storage engines, as different engines perform better for specific workloads.
View this presentation to understand:
What a storage engine is
How to pick a storage engine
How to configure a storage engine and a replica set
Scalable and Reliable Logging at PinterestKrishna Gade
Pinterest uses Kafka as the central logging system to collect over 120 billion messages per day from thousands of hosts. They developed Singer, a lightweight logging agent, to reliably upload application logs to Kafka with low latency. Data is then moved from Kafka to cloud storage using systems like Secor and Merced that ensure exactly-once processing. Maintaining high log quality requires monitoring for anomalies, auditing new features, and catching issues both before and after releases through automated tooling.
Providing true interactive and scalable BI on Hadoop is proven to be one of the biggest challenges that is preventing completion of legacy EDW OLAP system transit to Hadoop. While we have all seen many benchmarks running consecutive queries claiming success, having thousands of concurrent business users sending complicated generated queries from their dashboards over billions of records while delivering interactive speed is yet to be seen.
In this session we will discuss how an architecture that replaces full-scan brute-force approach with adaptive indexing and auto-generated cubes can dramatically reduce the resources and effort per query, resulting in interactive performance for high concurrency workloads and explain how this is achieved with minimum data engineering efforts. We will also discuss how this architecture can be seamlessly integrated with Hive to provide a complete OLAP-on-Hadoop solution.
Session will include live demo of complex business dashboards connected to Hive and accessing billions of rows at interactive speed.
Speaker
Boaz Raufman, CTO and Co-Founder, JethroData
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectRemy Rosenbaum
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect. Jethro CEO Eli Singer discusses the limitations of Hadoop with Tableau. Presentation explores how Jethro's index-based architecture enables Tableau users to live-connect to data on Hadoop while maintaining the fast interactive speeds that they expect.
Auditing Drupal Sites for Performance, Content and Optimal Configuration - SA...Jon Peck
Nobody likes a slow site, and Drupal can be notoriously sluggish if you're not careful. With the right configuration and platform, any site can be blazingly fast - but where to start? Enter site_audit, a collection of drush commands for statically analyzing Drupal 7 sites for best practices and optimal configuration. Join the developer for an in-depth discussion of how to analyze and optimize your site, and discover how to extend the platform with your own checks!
Site Audit reports include:
- Best Practices - structural recommendations
- Block - caching
- Cache - optimal Drupal caching settings
- Codebase - size of the site; size and count of managed files
- Content - checks for unused content types, vocabularies
- Cron - Drupal's built-in cron
- Database - collation, engine, row counts, and size
- Extensions - count, development modules, duplicates, missing
- Insights - Analyze site with Google PageSpeed Insights
- Status - check for failures in Drupal's built-in status report
- Users - blocked user #1, number of normal and blocked users, list of roles
- Views - caching settings
- Watchdog - 404 error count, age, number of entries, enabled, PHP errors
Site Audit is also used by Pantheon to power Launch Check from within site dashboards.
SharePoint Saturday The Conference 2011 - SP2010 PerformanceBrian Culver
Is your farm struggling to server your organization? How long is it taking between page requests? Where is your bottleneck in your farm? Is your SQL Server tuned properly? Worried about upgrading due to poor performance? We will look at various tools for analyzing and measuring performance of your farm. We will look at simple SharePoint and IIS configuration options to instantly improve performance. I will discuss advanced approaches for analyzing, measuring and implementing optimizations in your farm.
This document summarizes a presentation about using Elasticsearch for analytics on customer communities. It discusses how Lithium uses Elasticsearch to analyze terabytes of daily social media data to understand customer participation and content that generates engagement. It also provides details about Lithium's large Elasticsearch cluster containing over 7 billion documents, and lessons learned around bulk loading, faceting, and settings for data centers.
Machine Learning for Smarter Apps - Jacksonville MeetupSri Ambati
Machine Learning for Smarter Apps with Tom Kraljevic
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
SharePoint Saturday San Antonio: SharePoint 2010 PerformanceBrian Culver
Is your farm struggling to server your organization? How long is it taking between page requests? Where is your bottleneck in your farm? Is your SQL Server tuned properly? Worried about upgrading due to poor performance? We will look at various tools for analyzing and measuring performance of your farm. We will look at simple SharePoint and IIS configuration options to instantly improve performance. I will discuss advanced approaches for analyzing, measuring and implementing optimizations in your farm.
This document summarizes key learnings from a presentation about SharePoint 2013 and Enterprise Search. It discusses how to run a successful search project through planning, development, testing and deployment. It also covers infrastructure needs and capacity testing findings. Additionally, it provides examples of how to customize the user experience through display templates and Front search. Methods for crawling thousands of file shares and enriching indexed content are presented. The document concludes with discussions on relevancy, managing property weighting, changing ranking models, and tuning search results.
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
1) StumbleUpon uses open source tools like Kafka, HBase, Hive and Pig to build a scalable big data infrastructure to process large amounts of data from its services in real-time and batch.
2) Data is collected from various services using Kafka and stored in HBase for real-time analytics. Batch processing is done using Pig and data is loaded into Hive for ad-hoc querying.
3) The infrastructure powers various applications like recommendations, ads and business intelligence dashboards.
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
Yahoo Mail has 200+ million users a month and generates hundreds of terabytes of data per day, which continues to grow steadily. The nature of email messages has also evolved: for example, today the majority of them are generated by machines, consisting of newsletters, social media notifications, purchase invoices, travel bookings, and the like, which drove innovations in product development to help users organize their inboxes.
Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail.
In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.
Presto is an open source distributed SQL query engine that was originally developed by Facebook. It allows for fast SQL queries on large datasets across multiple data sources. Presto uses various optimizations like code generation, predicate pushdown, and data layout awareness to improve query performance. It is used at Facebook and other companies for interactive analytics, batch ETL, A/B testing, and app analytics where low latency and high concurrency are important.
Webinar: Faster Log Indexing with FusionLucidworks
The document discusses Lucidworks Fusion, a log analytics platform that combines Apache Solr, Logstash, and Kibana. It describes how Fusion uses a time-based partitioning scheme to index logs into daily collections with hourly shards for query performance. It also discusses using transient collections to handle high volume indexing into multiple shards to avoid bottlenecks. The document provides details on schema design considerations, moving old data to cheaper storage, and GC tuning for Solr deployments handling large-scale log analytics.
This document provides an introduction to Microsoft Azure DocumentDB. It discusses how DocumentDB is a non-relational or NoSQL database that stores data in JSON documents. It also overview how DocumentDB provides scalability, high availability, and fast performance for large document workloads. Key features of DocumentDB discussed include its resource and interaction models, indexing, consistency options, querying capabilities, and support for JavaScript transactions.
The document discusses Site Audit, a Drupal module that analyzes Drupal sites and provides reports on best practices, configuration, content, extensions and more. It runs automated checks without executing code, highlighting areas for improvement. The summaries include details on the types of checks performed, customizing and extending the module, and how Site Audit can be installed and its reports utilized.
SPSNYC 2016 - Big data in SharePoint and the 5,000 Item List View ThresholdBen Steinhauser
Run into that annoying SharePoint 5,000 Item Limit List View Threshold? Wondering why Microsoft says you can put 30,000,000 files in a library but also limits your list views to 5,000 items and locks your List/Library in the event it goes over? This session will discuss this issue, why it exists, how to plan around it so your organization can prevent it, and how to fix Libraries that are impacted by this threshold. Best practices, recommendations, and custom tools are included for demonstration.
Haytham ElFadeel presented on next-generation storage systems and key-value stores. He began with an overview of scalable systems and the need for both vertical and horizontal scalability. He discussed the limitations of traditional databases in scaling, including complexity, wasted features, and multi-step query processing. Key-value stores were presented as an alternative, offering simple interfaces and designs optimized for scaling across hundreds of machines. Performance comparisons showed key-value stores significantly outperforming databases. Systems discussed included Amazon Dynamo, Facebook Cassandra, and Redis.
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
This document discusses using social media, cloud computing, machine learning, open source, and big data analytics to analyze Twitter data. It describes how to collect tweets using the Twitter API, classify tweets in real-time using machine learning models on AWS, store classified tweets in MongoDB on AWS, and present results. Cost estimates for real-time classification of 1 million tweets per day are provided. Use cases described include tracking food poisoning reports and disease occurrence. Future directions discussed include developing turnkey services and linking to additional open data sources.
This document discusses using social media, cloud computing, machine learning, open source, and big data analytics to analyze Twitter data. It describes how to collect tweets using the Twitter API, classify tweets in real-time using machine learning models on AWS, store classified tweets in MongoDB on AWS, and present results. Cost estimates for real-time classification of 1 million tweets per day are provided. Use cases described include tracking food poisoning reports and disease occurrence. Future directions discussed include developing turnkey services and linking to additional open data sources.
Similar to Making Session Stores More Intelligent (20)
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
3. • An chunk of data that is connected to one “user” of a service
–”user” can be a simple visitor
–or proper user with an account
• Often persisted between client and server by a token in a cookie*
–Cookie is given by server, stored by browser
–Client sends that cookie back to the server on subsequent requests
–Server associates that token with data
• Often the most frequently used data by that user
–Data that is specific to the user
–Data that is required for rendering or common use
• Often ephemeral and duplicated
A session store is…
4. Session Storage Uses Cases
Traditional
• Username
• Preferences
• Name
• “Stateful” data
Intelligent
• Traditional +
• Notifications
• Past behaviour
–content surfacing
–analytical information
–personalization
14. Who We Are
Open source. The leading in-memory database platform,
supporting any high performance operational, analytics or
hybrid use case.
The open source home and commercial provider of Redis
Enterprise technology, platform, products & services.
14
15. Redis Top Differentiators
Simplicity ExtensibilityPerformance
NoSQL Benchmark
1
Redis Data Structures
2 3
Redis Modules
15
Lists
Hashes
Bitmaps
Strings
Bit field
Streams
Hyperloglog
Sorted Sets
Sets
Geospatial Indexes
16. Performance: The Most Powerful Database
Highest Throughput at Lowest Latency
in High Volume of Writes Scenario
Least Servers Needed to
Deliver 1 Million Writes/Sec
Benchmarks performed by Avalon Consulting Group Benchmarks published in the Google blog
16
1
Serversusedtoachieve1Mwrites/sec
17. Simplicity: Data Structures - Redis’ Building Blocks
Lists
[ A → B → C → D → E ]
Hashes
{ A: “foo”, B: “bar”, C: “baz” }
Bitmaps
0011010101100111001010
Strings
"I'm a Plain Text String!��
Bit field
{23334}{112345569}{766538}
Key
17
2
”Retrieve the e-mail address of the user with the highest
bid in an auction that started on July 24th at 11:00pm PST” ZREVRANGE 07242015_2300 0 0=
Streams
{id1=time1.seq1(A:“xyz”, B:“cdf”),
d2=time2.seq2(D:“abc”, )}
Hyperloglog
00110101 11001110
Sorted Sets
{ A: 0.1, B: 0.3, C: 100 }
Sets
{ A , B , C , D , E }
Geospatial Indexes
{ A: (51.5, 0.12), B: (32.1, 34.7) }
18. • Add-ons that use a Redis API to seamlessly support additional
use cases and data structures.
• Enjoy Redis’ simplicity, super high performance, infinite
scalability and high availability.
Extensibility: Modules Extend Redis Infinitely
• Any C/C++/Go program can become a Module and run on Redis.
• Leverage existing data structures or introduce new ones.
• Can be used by anyone; Redis Enterprise Modules are tested and certified by Redis
Labs.
• Turn Redis into a Multi-Model database
18
3
19. Redise Pack Managed
Fully managed Redise
Pack in private
datacenters
Redise Pack
Downloadable Redise
software for any
enterprise datacenter
or cloud environment
Redise Cloud Private
Fully managed, server-
less scaling Redise
service in VPCs within
AWS, MS Azure, GCP
and IBM Softlayer
Redise Cloud
Fully managed, server-
less Redise service on
hosted resources
within AWS, MS Azure,
GCP, IBM Softlayer,
Heroku, CF and
OpenShift
Redis Labs Products
19
or or or
DBaaS Software
21. • Probabilistic data structure
• Hash -> sample bits -> set bits
• Properties:
–False negatives – not possible
–False positives – possible, but controllable
–Bits per item stored
–Add or check if exists
–Like the Tardis, it’s bigger on the inside than outside
• Availability:
–Redis Module
–On top of bitfields
Concept: Bloom Filters (presence)
22. • Probabilistic data structure
• Hash -> count runs -> store runs
• Properties:
–Estimates unique items
–Bits per item stored – 264 unique items in 12kb /
error rate 0.81%
–Add, count or merge!
–Like the Tardis, it’s bigger on the inside than
outside
• Availability:
–All versions of Redis
Concept: HyperLogLog (cardinality)
engineering.conversantmedia.com
23. • It’s just bits!
• Fixed starting point, each point
represents a moment in time, flip to
represent activity
• Properties:
–Size relative to length of time (byte round)
–Count totals or ranges
–BITOP (AND/XOR/OR/NOT)
• Availability:
–All versions of Redis
Concept: Bit counting (time series)
25. Process
• Group of users get notification “Sale on sweaters”
• Insert into central table of notifications
• Insert row in table with each user of group with notification and seen flag
• Each time it is needed, query notifications table where seen flag is false.
Traditional Group Notification Pattern
26. Traditional Group Notification Pattern
Challenges
• Adding/removing means touching a row for each user in group.
–Fine for groups of 10 users, what about 1 million?
–Also multi-step
• Storage is proportional to size of group and notifications
• Constant DB hits, not easily cacheable
• Setting “read” is DB write
27. Process
• Add notification to single group based structure or table (easily cacheable)
• First n notifications are read by all users in group.
• The notifications are checked to see if they are in a session-based Bloom filter or not.
• Mark read by adding to Bloom filter in session store.
Modern & Intelligent Group Notification Pattern
28. Modern & Intelligent Group Notification Pattern
Advantages
• Adding a notification only writes to a single table, single row.
• Model fits use – unread assumed.
• Fast. Checking for read / writing read is unrelated to number of items in the filter.
Consistent.
• ~5-bits per item, but Bloom filter doesn’t always grow.
• Gentle scaling
31. Process
• Hand pick and rotate a small number of
content/items
• Stored in DB table
• Served out dumbly to users
Traditional Content Surfacing Pattern (Basic)
Challenges
• May serve content multiple times
• Freshness is linked to a manual curatorial
process
32. Traditional Content Surfacing Pattern (Advanced)
Process
• Batch process builds content list to surface
for each user
• List is stored in DB Table
• Served out to user
• Rotated on a schedule
Challenges
• Not Real-time
• May serve content multiple times
• Un-cacheable DB content
• Hard to scale
33. Process
• Middleware adds each content read to a
Bloom filter stored in the session
• Featured content list is built, can be
extensive.
• Featured items are checked vs Bloom filter
on-the-fly
Modern & Intelligent Content Surfacing Pattern
Advantages
• No DB hits for user
• Featured content is cacheable
• Will not to show content multiple times if
read
• Tiny storage requirements even at scale
• Freshness can be achieved with zero/low
human input
• Real-time recording of activity –
immediate impact on fresh content
36. • Monitor the usage behaviour
–Content viewed
–Activity over time
–Combinations of content history and activity
• Personalize the content based on the behaviour
• Seen as difficult to accomplish
–Analytics data
• Stored in another service
• Anonymized
–Complicated graph or ML based solutions
• Inferences
• Black boxes
Activity Pattern Monitoring & Personalization?
37. • Record site activity with bit counting
• Unique page views in HyperLogLog
• Leverage the page visit Bloom filter
• Simpler counter for pages consumed
• Create criteria based on session stored analytics
–New to a page? Bloom filter
–New to the site? Unique Page view = 1 (HLL) && Previously Visited = false (Bloom)
–Inactive user? Sum the bit count over the last five records, if = 0 then inactive
–Been to a cluster of pages (infer interest)? Check cluster of pages vs Bloom filter – combo!
Activity Pattern Monitoring & Personalization
38. • Why is this suddenly possible?
–Probabilistic data structures are small/fast
–Bit counting is small/fast
–Decoupled from operational database
• What about privacy?
–Legitimate concern
–Non-reversible probabilistic structures
–Siloed from rest of database
Activity Pattern Monitoring & Personalization