Twitter processes over 500 million tweets per day and more than 2 billion search queries per day. The company uses a search architecture based on Lucene with custom extensions. This includes an in-memory real-time index optimized for concurrency without locks, and a schema-based document factory. Future work includes support for parallel index segments and additional Lucene features.
Dictionary Based Annotation at Scale with Spark by Sujit PalSpark Summit
This document summarizes a presentation about annotating millions of documents at scale using dictionary-based annotation with Apache Spark, Apache Solr, and Apache OpenNLP. The key points discussed include:
- The problem of annotating millions of documents from science corpora and the need to do it efficiently without model training.
- The architecture of SoDA (Dictionary Based Named Entity Annotator), which uses Apache Solr, SolrTextTagger, and OpenNLP for annotation and can be run on Spark for scaling.
- Performance optimizations made including combining paragraphs, tuning Solr garbage collection, using a larger Spark cluster, and scaling out Solr. These helped achieve over 25 documents per second annotation throughput.
Real Time search using Spark and ElasticsearchSigmoid
This document discusses using Spark Streaming and Elasticsearch to enable real-time search and analysis of streaming data. Spark Streaming processes and enriches streaming data and stores it in Elasticsearch for low-latency search and alerts. The elasticsearch-hadoop connector allows Spark jobs to read from and write to Elasticsearch, integrating the batch processing of Spark with the real-time search of Elasticsearch.
Grant Ingersoll presented on using Apache Solr and Apache Spark for data engineering. He discussed how Solr can be used for indexing and searching large amounts of data, while Spark enables large-scale processing on the indexed data. Lucidworks' Fusion product combines Solr and Spark capabilities to allow search-driven applications and machine learning on indexed content.
The document discusses Solr 4, an open source search platform built on Apache Lucene. Some key points:
- Solr 4 is a NoSQL search server that provides distributed indexing, fault tolerance, and real-time search capabilities.
- Solr Cloud is Solr's distributed architecture which uses Zookeeper for coordination to provide features like automatic sharding and replication of indexes across multiple servers.
- The document outlines Solr 4's capabilities including schema-less options, atomic updates, optimistic concurrency, and a REST API for managing the schema dynamically.
SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan ...Lucidworks
Running SolrCloud in Public Cloud is the future. This presentation and the code that will be contributed back to the community will allow such clusters to be highly efficient, scalable and elastic. Attendees will understand the challenges and potential of sharing index data between servers.
Speakers: Ilan Ginzburg & Yonik Seeley, Salesforce
Apache Solr is a powerful search and analytics engine with features such as full-text search, faceting, joins, sorting and capable of handling large amounts of data across a large number of servers. However, with all that power and scalability comes complexity. Solr 6 supports a Parallel SQL feature which provides a simplified, well-known interface to your data in Solr, performs key operations such as sorts and shuffling inside Solr for massive speedups, provides best-practices based query optimization and by leveraging the scalability of SolrCloud and a clever implementation, allows you to throw massive amounts of computation power behind analytical queries.
In this talk, we will explore the why, what and how of Parallel SQL and its building block Streaming Expressions in Solr 6 with a hint of the exciting new developments around this feature.
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
This talk describes how you can practically apply some of Lucene 4's new features (such as flexible indexing, scoring improvements, column-stride fields) to improve your search application.
The talk will give a brief description of these new features and some example use-cases, to address practical use cases you can try yourself in and around the new features now available in Lucene 4. We'll cover application of functions where you can configure Solr to:
Set up the schema to use Pulsing or Memory codec for a primary key field
Not use a separate spellcheck index, controlling character-level swaps from the query processor
Sorting with a different locale
Per-field similarity configurations, such as using a non-vector-space algorithm
Apache Solr/Lucene Internals by Anatoliy SokolenkoProvectus
This document provides an overview of Apache Lucene and Solr. It discusses Lucene's data model, index structure, basic indexing and search flows. It also summarizes how Solr builds on Lucene to provide enterprise-level search capabilities with features like sharding, replication, and faceting. The document also covers text analysis in Lucene, spell checking, and references for further reading.
This document provides an overview of a workshop on Lucene performance given by Lucid Imagination, Inc. It discusses common Lucene performance issues, introduces Lucid Gaze for Lucene (LG4L) as a tool for monitoring Lucene performance statistics and examples of using it to analyze indexing and search performance. LG4L provides statistics on indexing, analysis, searching and storage through logs, a persistent database and an API. It can help identify causes of poor performance and was shown to have low overhead.
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Lucidworks
1) The document discusses using black box optimization algorithms to automate the tuning of a search engine's configuration parameters to improve search relevancy.
2) It describes using a test collection of queries and relevance judgments, or search logs, to evaluate how changes to parameters impact relevancy metrics. An optimization algorithm would intelligently search the parameter space.
3) Care must be taken to validate any improved parameters on a separate test set to avoid overfitting and ensure gains generalize to new data. The approach holds promise for automating what can otherwise be a slow manual tuning process.
LuceneRDD for (Geospatial) Search and Entity Linkagezouzias
In this talk, I will present the design and implementation of LuceneRDD for Apache Spark. LuceneRDD instantiates an inverted index on each Spark executor and collects / aggregates search results from Spark executors to the Spark driver. The main motivation behind LuceneRDD is to natively extend Spark's capabilities with full-text search, geospatial search and entity linkage without requiring an external dependency of a SolrCloud or Elasticsearch cluster.
As a case study, we will show how LuceneRDD can tackle the entity linkage problem. We will demonstrate both the flexibility and efficiency of LuceneRDD for this problem. First, we will show that LuceneRDD's interface provide a highly flexible approach to its users for entity linkage. This flexibility is due to Lucene's powerful query language that is able to combine multiple full-text queries such as term, prefix, fuzzy and phrase queries. Second, we will focus on the efficiency and scalability of LuceneRDD by linking records between two relatively large datasets.
Lastly and time permitting, I will present ShapeLuceneRDD which enhances LuceneRDD with geospatial queries.
Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...Lucidworks
This document provides an agenda and overview for a presentation on H-Hypermap, a project to build a search platform called the Billion Object Platform (BOP) to index and search over billions of geo-tagged tweets in near real-time. The presentation will cover the architecture using Apache Kafka, Solr sharding, and techniques for fast geo-spatial queries and heatmaps. It will also discuss experiences using technologies like Kotlin, Dropwizard, Docker and Kontena.
ElasticSearch in Production: lessons learnedBeyondTrees
ElasticSearch is an open source search and analytics engine that allows for scalable full-text search, structured search, and analytics on textual data. The author discusses her experience using ElasticSearch at Udini to power search capabilities across millions of articles. She shares several lessons learned around indexing, querying, testing, and architecture considerations when using ElasticSearch at scale in production environments.
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPSujit Pal
This document summarizes a presentation about annotating millions of documents at scale using dictionary-based annotation with Apache Spark, Apache Solr, and Apache OpenNLP. The key points discussed include:
- The problem of annotating millions of documents from science corpora and the need to do it efficiently without model training.
- The architecture of SoDA (Dictionary Based Named Entity Annotator), which uses Apache Solr, SolrTextTagger, and OpenNLP for annotation and can be run on Spark for scaling.
- Performance optimizations made including combining paragraphs, tuning Solr garbage collection, and increasing the Spark cluster size, which resulted in annotating over 20 documents per second.
- Further work proposed
Faceted search is a powerful technique to let users easily navigate the search results. It can also be used to develop rich user interfaces, which give an analyst quick insights about the documents space. In this session I will introduce the Facets module, how to use it, under-the-hood details as well as optimizations and best practices. I will also describe advanced faceted search capabilities with Lucene Facets.
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Spark Summit
Feature hashing is a powerful technique for handling high-dimensional features in machine learning. It is fast, simple, memory-efficient, and well suited to online learning scenarios. While an approximation, it has surprisingly low accuracy tradeoffs in many machine learning problems.
Feature hashing has been made somewhat popular by libraries such as Vowpal Wabbit and scikit-learn. In Spark MLlib, it is mostly used for text features, however its use cases extend more broadly. Many Spark users are not familiar with the ways in which feature hashing might be applied to their problems.
In this talk, I will cover the basics of feature hashing, and how to use it for all feature types in machine learning. I will also introduce a more flexible and powerful feature hashing transformer for use within Spark ML pipelines. Finally, I will explore the performance and scalability tradeoffs of feature hashing on various datasets.
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Spark Summit
This document discusses integrating Solr and Spark. It provides an example of using Solr as a sink for streaming data from Spark Streaming. It also describes reading data from Solr into Spark using SolrRDD and exposing it as a Spark SQL DataFrame. Additional capabilities covered include querying Solr from the Spark shell, document matching using stored queries, and reading term vectors from Solr for machine learning with MLLib.
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksLucidworks
Timothy Potter presented at a Big Data conference in Boston from October 11-14, 2016. He discussed how Lucidworks Fusion provides an alternative to traditional big data stacks that emphasizes fast access, agility and automation over integration. Fusion allows for common access patterns like fast lookups, ranked retrieval and distributed scans while integrating technologies like Solr, Spark, HDFS and more. It provides tools for data ingestion, time-based partitioning, analytics, machine learning and more to solve business problems rather than focus on infrastructure.
Introduction to Lucene & Solr and UsecasesRahul Jain
Rahul Jain gave a presentation on Lucene and Solr. He began with an overview of information retrieval and the inverted index. He then discussed Lucene, describing it as an open source information retrieval library for indexing and searching. He discussed Solr, describing it as an enterprise search platform built on Lucene that provides distributed indexing, replication, and load balancing. He provided examples of how Solr is used for search, analytics, auto-suggest, and more by companies like eBay, Netflix, and Twitter.
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
At Twitter we serve more than 1.5 billion queries per day from Lucene indexes, while appending more than 200 million tweets per day in realtime. Additionally we recently launched image, video and relevance search on the same engine.
This talk will explain the changes we made to Lucene to support this high load and the changes and improvements we made in the last year.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Lucidworks
This document describes Bloomberg's development of a search analytics component for Solr. It was created by their search team to enable complex calculations and aggregations on numerical time-series data. Key features include statistical and mathematical expressions to facet and analyze data, supporting int, long, float, date and string fields. Examples show calculating a weighted average and variance. Future plans include multi-shard support and filtering result sets based on calculated statistics.
Solr 3.1 includes many new features and improvements such as range faceting on numeric fields, geospatial search enhancements, JSON document indexing, autosuggest and spellcheck components, analysis filter improvements, and distributed support for additional components. Major components include Apache Lucene 3.1.0, Apache Tika 0.8, Carrot2 3.4.2, Velocity 1.6.1 and Velocity Tools 2.0-beta3, and Apache UIMA 2.3.1-SNAPSHOT.
This presentation was provided at the Lucene/Solr Revolution conference in Washington, DC in 2014 by Oleg Savrasov and discusses why we need special faceting for Block Join queries as well as proposes a Block Join facet component.
This Ain't Your Parent's Search Engine: Presented by Grant Ingersoll, LucidworksLucidworks
Search technology has evolved from traditional keyword search to include richer data modeling, dynamic faceting, aggregations, analytics, spatial search, record linkage, alerting, and solutions for top N problems. Lucene and Solr have been updated with reduced memory usage, pluggable formats and similarity, column-oriented storage, time/space integration, and advanced distributed capabilities including joins, grouping, and pivots. Lucidworks Fusion performs real-time decision making and routing to provide search and recommendations based on clicks, tweets, ratings, locations and other data.
Building Smarter Search Applications Using Built-In Knowledge Graphs and Quer...Lucidworks
This document discusses improving search precision through better phrase detection, such as recognizing noun phrases using autophrasing. It also describes implementing query autofiltering to map noun and verb phrases in queries to metadata fields, and providing a suggester component that leverages faceted metadata to provide contextual suggestions.
Lucene/Solr Spatial in 2015: Presented by David SmileyLucidworks
The document summarizes new features, approaches, and improvements in Lucene and Solr spatial search capabilities in 2015. Key points include:
- New heatmap and grid faceting features allow spatial density visualization and are available in Lucene and Solr.
- Geo3D support in Lucene provides more accurate geometry representations and calculations on the surface of a sphere or ellipsoid.
- Indexing accuracy was improved through combining recursive prefix trees with serialized geometry storage.
- BKD tree indexes in Lucene provide faster point searching than prefix trees but currently only support filtering.
- The document outlines additional pending work and opportunities for the future.
SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Inge...Lucidworks
This document provides an overview and agenda for a SearchHub presentation on using Apache Fusion to build a search application called SearchHub. The presentation will cover configuring Fusion and the SearchHub UI, acquiring data from sources like Apache mailing lists and GitHub, deploying SearchHub on AWS, using signals and machine learning in Fusion, and next steps for the project. It includes demos of features like recommendations, tokenization, topic detection, and experiment management in Fusion.
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteLucidworks
Evernote stores over 3 billion notes from over 100 million users worldwide. To improve search performance and allow upgrades to newer Lucene versions, Evernote rearchitected their search system. They separated search code from the data storage, allowed multiple Lucene versions to run concurrently on each machine, and automatically migrated each user's index to the default version without downtime. This reduced disk I/O by 81% and allowed compression techniques to further reduce storage needs by terabytes and input/output by petabytes each week.
This document discusses Elasticsearch and its uses. It outlines 6 common use cases for Elasticsearch: 1) site search, 2) related posts, 3) replacing WP_Query, 4) log analytics with Logstash, 5) content reranking, and 6) breaking the blog boundary. It also provides an overview of what Elasticsearch is, including that it is a search engine, distributed, scalable, and supports analytics and multiple languages.
Evolving Search Relevancy: Presented by James Strassburg, Direct SupplyLucidworks
The document discusses using genetic algorithms to optimize search engine relevancy by evolving search parameters. It describes encoding search parameters as candidate solutions, defining a fitness function to evaluate solutions based on metrics like NDCG, and using genetic operators like crossover and mutation to generate new parameter sets and optimize relevancy over time.
MongoDB: Queries and Aggregation Framework with NBA Game DataValeri Karpov
This document provides an overview of querying and aggregating a dataset containing NBA box scores for over 30,000 games since 1985 using MongoDB. It demonstrates various MongoDB query and aggregation operations including findOne(), find(), count(), distinct(), $elemMatch, $sort, $limit, $unwind and aggregation pipelines to analyze and extract insights from the data such as answering questions about individual player and team statistics and performance.
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
Faceted Searching is a must have feature for enhancing findability and user engagement in enterprise search UI. The Faceted Searching features of Apache Solr have been a major factor in it's popularity, but many Solr users don't fully appreciate all of the capabilities that are available. In this session we will deep dive into the different types of data facets that Solr supports, discussing in detail the various options that can be used to explore them. We will also review some specific techniques for dealing with several complex use cases, and discuss some performance "gotchas" and how to avoid them.
Webinar: Ecommerce, Rules, and RelevanceLucidworks
This document provides an agenda and overview for a webinar on using Lucidworks Rules Editor for e-commerce search relevancy. The webinar will introduce Rules Editor, demonstrate how to create different rule types and triggers to boost or block product listings, and discuss how rules are processed through Fusion query pipelines. It will also include a live demonstration of creating rules in the Best Buy catalog and take questions at the end.
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoLucidworks
This document summarizes Joel Bernstein's presentation on parallel SQL in Solr 6.0. The key points are:
1. SQL provides an optimizer to choose the best query plan for complex queries in Solr, avoiding the need for users to determine optimal faceting APIs or parameters.
2. SQL queries in Solr 6.0 can perform distributed joins, aggregations, sorting, and filtering using Solr search predicates. Aggregations can be performed using either map-reduce or facets.
3. Under the hood, SQL queries are compiled to TupleStreams which are serialized to Streaming Expressions and executed in parallel across worker collections using Solr's streaming API framework.
Autocomplete Multi-Language Search Using Ngram and EDismax Phrase Queries: Pr...Lucidworks
Ivan Provalov presented on autocomplete multi-language search using n-grams and EDismax phrase queries at Netflix. Key points included using n-grams and phrase queries to rank shorter documents higher for autocomplete, and addressing language challenges such as different scripts, character composition, and stopword handling. Provalov discussed using a character mapper filter to preprocess input, and developed an open source query testing framework with over 20,000 queries to test language queries and detect regressions.
Blocks is a cool concept and is very much needed for performance improvements and responsiveness. GCD helps run blocks effortlessly by scheduling on a desired queue, priority and lots more.
Swift is Apple's language for the future, and in this presentation we'll cover a brief history of the Swift language, what advantages Swift has for today's microprocessors, and where it is going in the future.
Storm is a scalable distributed real-time computation system. It provides a simple programming model through topologies containing spouts that emit streams and bolts that process streams. Storm guarantees processing of all messages through anchoring and tracking tuples in distributed worker processes. It offers fault tolerance through mechanisms like acking tuples and replaying failed tasks. Exactly-once processing can be achieved through techniques like transaction IDs.
The JVM memory model describes how threads in the Java eco-system interact through memory. While the memory model impact on developing for the JVM may not be obvious, it is the cause for certain number of "anomalies" that are, well, by design.
In this presentation we will explore the aspects of the memory model, including things like reordering of instructions, volatile members, monitors, atomics and JIT.
Using Groovy? Got lots of stuff to do at the same time? Then you need to take a look at GPars (“Jeepers!”), a library providing support for concurrency and parallelism in Groovy. GPars brings powerful concurrency models from other languages to Groovy and makes them easy to use with custom DSLs:
- Actors (Erlang and Scala)
- Dataflow (Io)
- Fork/join (Java)
- Agent (Clojure agents)
In addition to this support, GPars integrates with standard Groovy frameworks like Grails and Griffon.
Background, comparisons to other languages, and motivating examples will be given for the major GPars features.
Игорь Фесенко "Direction of C# as a High-Performance Language"Fwdays
There are a lot of upcoming performance changes in .NET. Starting from code generation (JIT, AOT) and optimizations that can be performed by the compiler (inlining, flowgraph & loop analysis, dead code elimination, SIMD, stack allocation and so on). In this talk we will cover some features of C# 7 are going towards making low level optimization.
I will share not only how we can improve performance with the next version of .NET, but how we can do it today using different techniques and tools like Roslyn analyzers, Channels (Push based Streams), System.Slices, System.Buffers and System.Runtime.CompilerServices.Unsafe.
This document summarizes a lecture on key-value storage systems. It introduces the key-value data model and compares it to relational databases. It then describes Cassandra, a popular open-source key-value store, including how it maps keys to servers, replicates data across multiple servers, and performs reads and writes in a distributed manner while maintaining consistency. The document also discusses Cassandra's use of gossip protocols to manage cluster membership.
- The document provides an overview of Ehcache 3, a caching framework that is a major upgrade from Ehcache 2. It highlights Ehcache 3's type safe APIs, fluent configuration, support for multiple storage tiers including heap, off-heap and disk, and fully compliant support for JSR-107 caching standard. It also discusses some of Ehcache 3's new features like expiry, eviction advisors, and cache loaders/writers, as well as features that were dropped like search and explicit locking. The presentation aims to explain Ehcache 3's motivation and significant changes from Ehcache 2.
Tomas Doran presented on their implementation of Logstash at TIM Group to process over 55 million messages per day. Their applications are all Java/Scala/Clojure and they developed their own library to send structured log events as JSON to Logstash using ZeroMQ for reliability. They index data in Elasticsearch and use it for metrics, alerts and dashboards but face challenges with data growth.
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
By Rajiv Kurian, software engineer at SignalFx.
At SignalFx, we deal with high-volume high-resolution data from our users. This requires a high performance ingest pipeline. Over time we’ve found that we needed to adapt architectural principles from specialized fields such as HPC to get beyond performance plateaus encountered with more generic approaches. Some key examples include:
* Write very simple single threaded code, instead of complex algorithms
* Parallelize by running multiple copies of simple single threaded code, instead of using concurrent algorithms
* Separate the data plane from the control plane, instead of slowing data for control
* Write compact, array-based data structures with minimal indirection, instead of pointer-based data structures and uncontrolled allocation
The document discusses a technique called FRECKLE that provides lock-free access and modification of standard STL containers from multiple threads. FRECKLE uses a single shared shared_ptr to the container that is accessed atomically. Readers can concurrently access the container via atomic_load while writers copy, modify, and replace the container contents using atomic_compare_exchange. This allows concurrent reads and isolated writes without locks in a way that guarantees exception safety.
LoLA is a tool for verifying properties of Petri nets. This document discusses how to:
1. Choose and manage LoLA configurations to optimally verify properties.
2. Ask the right verification questions in a specific, modular way to efficiently verify properties.
3. Optimize Petri net modeling to take advantage of LoLA's reduction techniques and scale verification.
4. Employ scripts and makefiles to automate calling LoLA and analyzing results.
5. Integrate calling LoLA from other tools using UNIX streams for modular verification.
Kudu is an open source storage layer developed by Cloudera that provides low latency queries on large datasets. It uses a columnar storage format for fast scans and an embedded B-tree index for fast random access. Kudu tables are partitioned into tablets that are distributed and replicated across a cluster. The Raft consensus algorithm ensures consistency during replication. Kudu is suitable for applications requiring real-time analytics on streaming data and time-series queries across large datasets.
Bringing Concurrency to Ruby - RubyConf India 2014Charles Nutter
The document discusses bringing concurrency to Ruby. It begins by defining concurrency and parallelism, noting that both are needed but platforms only enable parallelism if jobs can split into concurrent tasks. It reviews concurrency and parallelism in popular Ruby platforms like MRI, JRuby, and Rubinius. The document outlines four rules for concurrency and discusses techniques like immutable data, locking, atomics, and specialized collections for mutable data. It highlights libraries that provide high-level concurrency abstractions like Celluloid for actors and Sidekiq for background jobs.
.NET UY Meetup 7 - CLR Memory by Fabian Alves.NET UY Meetup
The document discusses key concepts related to memory management in the .NET CLR, including the heap and stack, value and reference types, pointers, and how objects are allocated in memory. It explains the garbage collection process, including different flavors, generations of objects, and pinning. Large object heap and finalization are also covered as it relates to unmanaged resources. Overall, the document provides a comprehensive overview of memory management in the .NET CLR.
This document discusses using Ruby for distributed storage systems. It describes components like Bigdam, which is Treasure Data's new data ingestion pipeline. Bigdam uses microservices and a distributed key-value store called Bigdam-pool to buffer data. The document discusses designing and testing Bigdam using mocking, interfaces, and integration tests in Ruby. It also explores porting Bigdam-pool from Java to Ruby and investigating Ruby's suitability for tasks like asynchronous I/O, threading, and serialization/deserialization.
This document discusses run-time addressing and storage of variables in programming. It covers how variables are accessed using offsets from frames or stacks. It also discusses variable-length local data and how it can be allocated dynamically on the stack or heap. The document then covers scope, static and dynamic scoping rules, and how static links are used to access non-local variables at run-time.
Presentation about the Spil Storage Platform (SSP) written in Erlang. This talk was first given at the Erlang User Group Netherlands in July 2012 hosted at Spilgames in Hilversum.
These are the slides from my talk at the 2012 Sphinx Search Day in Santa Clara, California. It provides a high-level picture of where Sphinx is used at craigslist, a bit of history, issues, and future work.
Similar to Search at Twitter: Presented by Michael Busch, Twitter (20)
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
With ecommerce experiencing explosive growth, it seems intuitive that the B2B segment of that ecosystem is mirroring the same trajectory. That said, B2B has very different needs when it comes to transacting with the same style of experiences that we see in B2C. For instance, B2B ecommerce is about precision findability, whereas B2C customers can convert at higher rates when they’re just browsing online. In order for the B2B buying experience to be successful, search needs to be tuned to meet the unique needs of the segment.
In this webinar with Forrester senior analyst Joe Cicman, you’ll learn:
-Which verticals in B2B will drive the most growth, and how machine-learning powered personalization tactics can be deployed to support those specific verticals
-Why an omnichannel selling approach must be deployed in order to see success in B2B
-How deploying content search capabilities will support a longer sales cycle at scale
-What the next steps are to support a robust B2B commerce strategy supported by new technology
Speakers
Joe Cicman, Senior Analyst, Forrester
Jenny Gomez, VP of Marketing, Lucidworks
Customer loyalty starts with quickly responding to your customer’s needs. When it comes to resolving open support cases, time is of the essence. Time spent searching for answers adds up and creates inefficiencies in resolving cases at scale. Relevant answers need to be a few clicks away and easily accessible for agents directly from their service console.
We will explore how Lucidworks’ Agent Insights application automatically connects agents with the correct answers and resources. You’ll learn how to:
-Configure a proactive widget in an agent’s case view page to access resources across third-party systems (such as Sharepoint, Confluence, JIRA, Zendesk, and ServiceNow).
-Easily set up query pipelines to autonomously route assets and resources that are relevant to the case-at-hand—directly to the right agent.
-Identify subject matter experts within your support data and access tribal knowledge with lightning-fast speed.
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
Lunch and Learn during Retail TouchPoints #RIC21 virtual event.
***
Crate & Barrel’s previous search solution couldn’t provide its shoppers with an online search and browse experience consistent with the customer-centric Crate & Barrel brand. Meanwhile, Crate & Barrel merchandisers spent the bulk of their time manually creating and maintaining search rules. The search experience impacted customer retention, loyalty, and revenue growth.
Join this lunch & learn for an interactive chat on how Crate & Barrel partnered with Lucidworks to:
-Improve search and browse by modernizing the technology stack with ML-based personalization and merchandising solutions
-Enhance the experience for both shoppers and merchandisers
-Explore signals to transform the omnichannel shopping experience
Questions? Visit https://lucidworks.com/contact/
Learn how to guide customers to relevant products using eCommerce search, hyper-personalisation, and recommendations in our ‘Best-In-Class Retail Product Discovery’ webinar.
Nowadays, shoppers want their online experience to be engaging, inspirational and fulfilling. They want to find what they’re looking for quickly and easily. If the sought after item isn’t available, they want the next best product or content surfaced to them. They want a website to understand their goals as though they were talking to a sales assistant in person, in-store.
In this webinar, we explore IMRG industry data insights and a best-in-class example of retail product discovery. You’ll learn:
- How AI can drive increased revenue through hyper-personalised experiences
- How user intent can be easily understood and results displayed immediately
- How merchandisers can be empowered to curate results and product placement – all without having to rely on IT.
Presented by:
Dave Hawkins, Principal Sales Engineer - Lucidworks
Matthew Walsh, Director of Data & Retail - IMRG
Connected Experiences Are Personalized ExperiencesLucidworks
Many companies claim personalization and omnichannel capabilities are top priorities. Few are able to deliver on those experiences.
For a recent Lucidworks-commissioned study, Forrester Consulting surveyed 350+ global business decision-makers to see what gets in the way of achieving these goals. They discovered that inefficient technology, lack of behavioral insights, and failure to tie initiatives to enterprise-wide goals are some of the most frequent blockers to personalization success.
Join guest speaker, Forrester VP and Principal Analyst, Brendan Witcher, and Lucidworks CEO, Will Hayes, to hear the results of the Forrester Consulting study, how to avoid “digital blindness,” and how to apply VoC data in real-time to delight customers with personalized experiences connected across every touchpoint.
In this webinar, you’ll learn:
- Why companies who utilize real-time customer signals report more effective personalization
- How to connect employees and customers in a shared experience through search and browse
- How Lucidworks clients Lenovo, Morgan Stanley and Red Hat fast-tracked improvements in conversion, engagement and customer satisfaction
Featuring
- Will Hayes, CEO, Lucidworks
- Brendan Witcher, VP, Principal Analyst, Forrester
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
Intelligent Policing. Leveraging Data to more effectively Serve Communities.
Policing in the next decade is anticipated to be very different from historical methods. More data driven, more focused on the intricacies of communities they serve and more open and collaborative to make informed recommendations a reality. Whether its social populations, NIBRS or organization improvement that’s the driver, the IT requirement is largely the same. Provide 360 access to large volumes of siloed data to gain a full 360 understanding of existing connections and patterns for improved insight and recommendation.
Join us for a round table discussion of how the Toronto Police Service is better serving their community through deploying a unified intelligent data platform.
Data innovation improves officers' engagement with existing data and streamlines investigation workflows by enhancing collaboration. This improved visibility into existing police data allows for a more intelligent and responsive police force.
In this webinar, we'll cover:
-The technology needs of an intelligent police force.
-How a Global Search improves an officer's interaction with existing data.
Featuring:
-Simon Taylor, VP, Worldwide Channels & Alliances, Lucidworks
-Michael Cizmar, Managing Director, MC+A
-Ian Williams, Manager of Analytics & Innovation, Toronto Police Service
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
Policing in the next decade is anticipated to be very different from historical methods. More data driven, more focused on the intricacies of communities they serve and more open and collaborative to make informed recommendations a reality. Whether its social populations, NIBRS or organization improvement that’s the driver, the IT requirement is largely the same. Provide 360 access to large volumes of siloed data to gain a full 360 understanding of existing connections and patterns for improved insight and recommendation.
Join us for a round table discussion of how the Toronto Police Service is better serving their community through deploying a unified intelligent data platform.
Data innovation improves officers' engagement with existing data and streamlines investigation workflows by enhancing collaboration. This improved visibility into existing police data allows for a more intelligent and responsive police force.
In this webinar, we'll cover:
The technology needs of an intelligent police force.
How a Global Search improves an officer's interaction with existing data.
Featuring
-Simon Taylor, VP, Worldwide Channels & Alliances, Lucidworks
-Michael Cizmar, Managing Director, MC+A
-Ian Williams, Manager of Analytics & Innovation, Toronto Police Service
Preparing for Peak in Ecommerce | eTail Asia 2020Lucidworks
This document provides a framework for prioritizing onsite search problems and key performance indicators (KPIs) to measure for e-commerce search optimization. It recommends prioritizing fixing searches that yield no results, improving relevance of results, and reducing false positives. The most essential KPIs to measure include query latency, throughput, result relevance through click-through rates and NDCG scores. The document also provides tips for self-benchmarking search performance and examples of search performance benchmarks across nine e-commerce sites from various industries.
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
Wish your conversion rates were higher? Can’t figure out how to efficiently and effectively serve all the visitors on your site? Embarrassed by the quality of your product discovery experience? The bar is high and the influx of online shopping over recent months has reminded us that the opportunities are real. We’re all deep in holiday prep, but let’s take a few minutes to think about January 2021 and beyond. How can we position ourselves for success with our customers and against our competition?
Grab your lunch and let’s dive into three strategies that need to be part of your 2021 roadmap. You don’t need an army to get there. But you do need to take action and capitalize on the shoppers abandoning the product discovery journey on your site.
In this session, attendees will find out how to:
-Take control of merchandising at scale;
-Implement hands-free search relevancy; and
-Address personalization challenges.
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
For a personalized search experience, search curation requires robust text interpretation, data enrichment, relevancy tuning and recommendations. In order to achieve this, language and entity identification are crucial.
For teams working on search applications, advanced language packages allow them to achieve greater recall without sacrificing precision.
Join us for a guided tour of our new Advanced Linguistics packages, available in Fusion, thanks to the technology partnership between Lucidworks and Basistech.
We’ll explore the application of language identification and entity extraction in the context of search, along with practical examples of personalizing search and enhancing entity extraction.
In this webinar, we’ll cover:
-How Fusion uses the Rosette Basic Linguistics and Entity Extraction packages
-Tips for improving language identification and treatment as well as data enrichment for personalization
-Speech2 demo modeling Active Recommendation
-Use Rosette’s packages with Fusion Pipelines to build custom entities for specific domain use cases
Featuring:
-Radu Miclaus, Director of Product, AI and Cloud, Lucidworks, Lucidworks
-Robert Lucarini, Senior Software Engineer, Lucidworks
-Nick Belanger, Solutions Engineer, Basis Technology
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
Before COVID-19, almost 80% of the US workforce worked service in jobs that involve in-person interaction with strangers. Now, leaders of service organizations must reshape their offerings during the pandemic and prepare for whatever the new normal turns out to be. Our three panelists will share ideas for adapting their service businesses, now that closer-than-six-feet isn’t an option.
Join Lucidworks as we talk shop with 3 service business leaders, covering:
-Common impacts of the pandemic on service businesses (and what to do about them),
-How service teams can maintain a human touch across virtual channels, and
-Plans for the future, before and after the pandemic subsides.
Featuring
-Sara Nathan, President & CEO, AMIGOS
-Anthony Carruesco, Founder, AC Fly Fishing
-sara bradley, chef and proprietor, freight house
-Justin Sears, VP Product Marketing, Lucidworks
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
The COVID-19 pandemic has forced companies to support far more customers and employees through digital channels than ever before. Many are turning to chatbots to help meet increasing demand, but traditional rules-based approaches can’t keep up. Our new Smart Answers add-on to Lucidworks Fusion makes existing chatbots and virtual assistants more intelligent and more valuable to the people you serve.
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
Watch our on-demand webinar showcasing Smart Answers on Lucidworks Fusion. This technology makes existing chatbots and virtual assistants more intelligent and more valuable to the people you serve.
In this webinar, we’ll cover off:
-How search and deep learning extend conversational frameworks for improved experiences
-How Smart Answers improves customer care, call deflection, and employee self-service
-A live demo of Smart Answers for multi-channel self-service support
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
In the current climate, it’s now more important than ever to digitally enable your workforce and customers.
Hear from Simon Taylor, VP Global Partners & Alliances, Lucidworks and Matt Aslett, Research Vice President, 451 Research to get the inside scoop on how industry leaders in Europe are developing and executing their digital transformation strategies.
In this webinar, we’ll discuss:
The top challenges and aspirations European business and technology leaders are solving using AI and search technology
Which search and AI use cases are making the biggest impact in industries such as finance, healthcare, retail and energy in Europe
What technology buyers should look for when evaluating AI and search solutions
Webinar: Accelerate Data Science with Fusion 5.1Lucidworks
This document introduces Fusion 5.1 and its new capabilities for integrating with data science tools like Tensorflow, Scikit-Learn, and Spacy.
It provides an overview of Fusion's capabilities for understanding content, users, and delivering insights at scale. The document then demonstrates Fusion's Jupyter Notebook integration for reading and writing data and running SQL queries.
Finally, it shows how Fusion integrates with Seldon Core to easily deploy machine learning models with tools like Tensorflow and Scikit-Learn. A live demo is provided of deploying a custom model and using it in Fusion's query and indexing pipelines.
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
In this webinar with 451 Research, you'll understand how retailers are using AI to predict customer intent and learn which key performance metrics are used by more than 120 online retailers in Lucidworks’ 2019 Retail Benchmark Survey.
In this webinar, you’ll learn:
● What trends and opportunities are facing the ecommerce industry in 2020
● Why search is the universal path to understanding customer intent
● How large online retailers apply AI to maximize the effectiveness of their personalization efforts
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
Nordstrom Rack | Hautelook curates and serves customers a wide selection of on-trend apparel, accessories, and shoes at an everyday savings of up to 75 percent off regular prices. With over a million visitors shopping across different platforms every day, and a realization that customers have become accustomed to robust and personalized search interactions, Nordstrom Rack | Hautelook launched an initiative over a year ago to provide data science-driven digital experiences to their customers.
In this session, we’ll discuss Nordstrom Rack | Hautelook’s journey of operationalizing a hefty strategy, optimizing a fickle infrastructure, and rallying troops around a single vision of building an expansible machine-learning driven product discovery engine.
The audience will learn about:
-The key technical challenges and outcomes that come with onboarding a solution
-The lessons learned of creating and executing operational design
-The use of Lucidworks Fusion to plug custom data science models into search and browse applications to understand user intent and deliver personalized experiences
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
Knowledge graphs and machine learning are on the rise as enterprises hunt for more effective ways to connect the dots between the data and the business world. With newer technologies, the digital workplace can dramatically improve employee engagement, data-driven decisions, and actions that serve tangible business objectives.
In this webinar, you will learn
-- Introduction to knowledge graphs and where they fit in the ML landscape
-- How breakthroughs in search affect your business
-- The key features to consider when choosing a data discovery platform
-- Best practices for adopting AI-powered search, with real-world examples
Webinar: Building a Business Case for Enterprise SearchLucidworks
The document discusses building a business case for enterprise search. It notes that 85% of information is unstructured data locked in various locations and applications. Many knowledge workers spend a significant portion of their day searching across multiple systems for information. The rise of unstructured data and AI capabilities can help organizations unlock value from their information assets. Effective enterprise search powered by AI can provide real-time intelligence, personalized information, and more efficient research to help knowledge workers.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Snarky Security
How wonderful it is that in our modern age, every bit of our biological data can be digitized, stored, and potentially pilfered by cyber thieves! Isn't it just splendid to think that while scientists are busy pushing the boundaries of biotechnology, hackers could be plotting the next big bio-data heist? This delightful scenario is brought to you by the ever-expanding digital landscape of biology and biotechnology, where the integration of computer science, engineering, and data science transforms our understanding and manipulation of biological systems.
While the fusion of technology and biology offers immense benefits, it also necessitates a careful consideration of the ethical, security, and associated social implications. But let's be honest, in the grand scheme of things, what's a little risk compared to potential scientific achievements? After all, progress in biotechnology waits for no one, and we're just along for the ride in this thrilling, slightly terrifying, adventure.
So, as we continue to navigate this complex landscape, let's not forget the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. After all, what could possibly go wrong?
-------------------------
This document provides a comprehensive analysis of the security implications biological data use. The analysis explores various aspects of biological data security, including the vulnerabilities associated with data access, the potential for misuse by state and non-state actors, and the implications for national and transnational security. Key aspects considered include the impact of technological advancements on data security, the role of international policies in data governance, and the strategies for mitigating risks associated with unauthorized data access.
This view offers valuable insights for security professionals, policymakers, and industry leaders across various sectors, highlighting the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. The analysis serves as a crucial resource for understanding the complex dynamics at the intersection of biotechnology and security, providing actionable recommendations to enhance biosecurity in an digital and interconnected world.
The evolving landscape of biology and biotechnology, significantly influenced by advancements in computer science, engineering, and data science, is reshaping our understanding and manipulation of biological systems. The integration of these disciplines has led to the development of fields such as computational biology and synthetic biology, which utilize computational power and engineering principles to solve complex biological problems and innovate new biotechnological applications. This interdisciplinary approach has not only accelerated research and development but also introduced new capabilities such as gene editing and biomanufact
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
13. RT index
Search Architecture
RT stream
Analyzer/
Partitioner
RT index
(Earlybird)
Blender
Archive
index
RT index
Mapreduce
Analyzer
raw
tweets
Tweet archive
HDFS
Search
requests
writes
searches
analyzed
tweets
analyzed
tweets
raw
tweets
14. RT index
Search Architecture
Tweets
Analyzer/
Partitioner
RT index
(Earlybird)
Blender
Archive
index
RT index
queue
HDFS
Search
requests
Updates Deletes/
Engagement (e.g. retweets/favs)
writes
searches
Mapreduce
Analyzer
15. RT index
Search Architecture
RT index
(Earlybird)
Social
graph Social
Blender
Archive
index
RT index
User
search
Search
requests
writes
searches
• Blender is our Thrift
service aggregator
• Queries multiple
Earlybirds, merges results
Social
graph
graph
17. Search Architecture
RT index
(Earlybird)
Archive
index
• For historic reasons, these used
to be entirely different codebases,
but had similar features/
technologies
• Over time cross-dependencies
were introduced to share code
User
search
Lucene
18. Search Architecture
RT index
(Earlybird)
Archive
index
User
search
Lucene
Extensions
Lucene
• New Lucene extension package
• This package is truly generic and
has no dependency on an actual
product/index
• It contains Twitter’s extensions for
real-time search, a thin segment
management layer and other
features
22. Lucene Extension Library
• Abstraction layer for Lucene index segments
• Real-time writer for in-memory index segments
• Schema-based Lucene document factory
• Real-time faceting
23. Lucene Extension Library
• API layer for Lucene segments
• *IndexSegmentWriter
• *IndexSegmentAtomicReader
• Two implementations
• In-memory: RealtimeIndexSegmentWriter (and reader)
• On-disk: LuceneIndexSegmentWriter (and reader)
24. Lucene Extension Library
• IndexSegments can be built ...
• in realtime
• on Mesos or Hadoop (Mapreduce)
• locally on serving machines
• Cluster-management code that deals with IndexSegments
• Share segments across serving machines using HDFS
• Can rebuild segments (e.g. to upgrade Lucene version, change data
schema, etc.)
26. RealtimeIndexSegmentWriter
• Modified Lucene index implementation optimized for realtime search
• IndexWriter buffer is searchable (no need to flush to allow searching)
• In-memory
• Lock-free concurrency model for best performance
27. Concurrency - Definitions
• Pessimistic locking
• A thread holds an exclusive lock on a resource, while an action is
performed [mutual exclusion]
• Usually used when conflicts are expected to be likely
• Optimistic locking
• Operations are tried to be performed atomically without holding a lock;
conflicts can be detected; retry logic is often used in case of conflicts
• Usually used when conflicts are expected to be the exception
28. Concurrency - Definitions
• Non-blocking algorithm
Ensures, that threads competing for shared resources do not have their
execution indefinitely postponed by mutual exclusion.
• Lock-free algorithm
A non-blocking algorithm is lock-free if there is guaranteed system-wide
progress.
• Wait-free algorithm
A non-blocking algorithm is wait-free, if there is guaranteed per-thread
progress.
* Source: Wikipedia
29. Concurrency
• Having a single writer thread simplifies our problem: no locks have to be used
to protect data structures from corruption (only one thread modifies data)
• But: we have to make sure that all readers always see a consistent state of
all data structures -> this is much harder than it sounds!
• In Java, it is not guaranteed that one thread will see changes that another
thread makes in program execution order, unless the same memory barrier is
crossed by both threads -> safe publication
• Safe publication can be achieved in different, subtle ways. Read the great
book “Java concurrency in practice” by Brian Goetz for more information!
30. Java Memory Model
• Program order rule
Each action in a thread happens-before every action in that thread that comes
later in the program order.
• Volatile variable rule
A write to a volatile field happens-before every subsequent read of that same
field.
• Transitivity
If A happens-before B, and B happens-before C, then A happens-before C.
* Source: Brian Goetz: Java Concurrency in Practice
35. Concurrency
RAM 0
int x;
Thread A writes b=1 to RAM,
because b is volatile
5 x = 5;
1
Cache
Thread 1 Thread 2
time
volatile int b;
b = 1;
36. Concurrency
RAM 0
int x;
5 x = 5;
1
Cache
Thread 1 Thread 2
time
volatile int b;
b = 1;
Read volatile b
int dummy = b;
while(x != 5);
37. Concurrency
RAM 0
int x;
5 x = 5;
1
Cache
Thread 1 Thread 2
time
volatile int b;
b = 1;
int dummy = b;
while(x != 5);
happens-before
• Program order rule: Each action in a thread happens-before every action in
that thread that comes later in the program order.
38. Concurrency
RAM 0
int x;
5 x = 5;
1
Cache
Thread 1 Thread 2
time
volatile int b;
b = 1;
int dummy = b;
while(x != 5);
happens-before
• Volatile variable rule: A write to a volatile field happens-before every
subsequent read of that same field.
39. Concurrency
RAM 0
int x;
5 x = 5;
1
Cache
Thread 1 Thread 2
time
volatile int b;
b = 1;
int dummy = b;
while(x != 5);
happens-before
• Transitivity: If A happens-before B, and B happens-before C, then A
happens-before C.
40. Concurrency
RAM 0
int x;
5 x = 5;
1
Cache
Thread 1 Thread 2
time
volatile int b;
b = 1;
int dummy = b;
while(x != 5);
This condition will be
false, i.e. x==5
• Note: x itself doesn’t have to be volatile. There can be many variables like x,
but we need only a single volatile field.
41. Concurrency
RAM 0
int x;
5 x = 5;
1
Cache
Thread 1 Thread 2
time
volatile int b;
b = 1;
int dummy = b;
while(x != 5);
Memory barrier
• Note: x itself doesn’t have to be volatile. There can be many variables like x,
but we need only a single volatile field.
44. Concurrency
RAM 0
int x;
5 x = 5;
1
Cache
Thread 1 Thread 2
time
volatile int b;
b = 1;
int dummy = b;
while(x != 5);
Memory barrier
• Note: x itself doesn’t have to be volatile. There can be many variables like x,
but we need only a single volatile field.
45. Concurrency
IndexWriter IndexReader
time
write 100 docs
maxDoc = 100
in IR.open(): read maxDoc
search upto maxDoc
write more docs
maxDoc is volatile
46. Concurrency
IndexWriter IndexReader
time
write 100 docs
maxDoc = 100
in IR.open(): read maxDoc
search upto maxDoc
write more docs
maxDoc is volatile
happens-before
• Only maxDoc is volatile. All other fields that IW writes to and IR reads from
don’t need to be!
47. Wait-free
• Not a single exclusive lock
• Writer thread can always make progress
• Optimistic locking (retry-logic) in a few places for searcher thread
• Retry logic very simple and guaranteed to always make progress
48. In-memory Real-time Index
• Highly optimized for GC - all data is stored in blocked native arrays
• v1: Optimized for tweets with a term position limit of 255
• v2: Support for 32 bit positions without performance degradation
• v2: Basic support for out-of-order posting list inserts
49. In-memory Real-time Index
• Highly optimized for GC - all data is stored in blocked native arrays
• v1: Optimized for tweets with a term position limit of 255
• v2: Support for 32 bit positions without performance degradation
• v2: Basic support for out-of-order posting list inserts
50. In-memory Real-time Index
• RT term dictionary
• Term lookups using a lock-free hashtable in O(1)
• v2: Additional probabilistic, lock-free skip list maintains ordering on terms
• Perfect skip list not an option: out-of-order inserts would require
rebalancing, which is impractical with our lock-free index
• In a probabilistic skip list the tower height of a new (out-of-order) item can
be determined without knowing its insert position by simply rolling a dice
54. In-memory Real-time Index
• Probabilistic skip list Tower height determined by rolling a dice
BEFORE knowing the insert location; tower height
never has to change for an element, simplifying
memory allocation and concurrency.
55. Schema-based Document factory
• Apps provide one ThriftSchema per index and create a ThriftDocument for
each document
• SchemaDocumentFactory translates ThriftDocument -> Lucene Document
using the Schema
• Default field values
• Extended field settings
• Type-system on top of DocValues
• Validation
56. Schema-based Document factory
Schema
Lucene
Document
SchemaDocument
Factory
Thrift
Document
• Validation
• Fill in default values
• Apply correct Lucene
field settings
57. Schema-based Document factory
Schema
Lucene
Document
SchemaDocument
Factory
Thrift
Document
• Validation
• Fill in default values
• Apply correct Lucene
field settings
Decouples core package from
specific product/index. Similar
to Solr/ElasticSearch.
61. Outlook
• Support for parallel (sliced) segments to support partial segment rebuilds
and other cool posting list update patterns
• Add remaining missing Lucene features to RT index
• Index term statistics for ranking
• Term vectors
• Stored fields
65. Searching for top entities within Tweets
• Task: Find the best photos in a subset of tweets
• We could use a Lucene index, where each photo is a document
• Problem: How to update existing documents when the same photos are
tweeted again?
• In-place posting list updates are hard
• Lucene’s updateDocument() is a delete/add operation - expensive and not
order-preserving
66. Searching for top entities within Tweets
• Task: Find the best photos in a subset of tweets
• Could we use our existing time-ordered tweet index?
• Facets!
67. Searching for top entities within Tweets
Query Doc ids
Inverted
index
Term id Term label
Forward
Doc id index Document
Metadata
Facet
index
Doc id Term ids
69. 5 15 9000 9002 100000 100090
Matching
doc id
Facet
index
Term ids
Top-k heap
Id Count
48239 8
31241 2
Query
Searching for top entities within Tweets
70. 5 15 9000 9002 100000 100090
Matching
doc id
Facet
index
Term ids
Top-k heap
Id Count
48239 15
31241 12
85932 8
6748 3
Query
Searching for top entities within Tweets
71. Searching for top entities within Tweets
5 15 9000 9002 100000 100090
Matching
doc id
Facet
index
Term ids
Top-k heap
Id Count
48239 15
31241 12
85932 8
6748 3
Query
Weighted counts (from
engagement features) used
for relevance scoring
72. Searching for top entities within Tweets
5 15 9000 9002 100000 100090
Matching
doc id
Facet
index
Term ids
Top-k heap
Id Count
48239 15
31241 12
85932 8
6748 3
Query
All query operators can be
used. E.g. find best photos in
San Francisco tweeted by
people I follow
73. Searching for top entities within Tweets
Inverted
Term id index Term label
74. Searching for top entities within Tweets
Id Count Label Count
pic.twitter.com/jknui4w 45
pic.twitter.com/dslkfj83 23
pic.twitter.com/acm3ps 15
pic.twitter.com/948jdsd 11
pic.twitter.com/dsjkf15h 8
pic.twitter.com/irnsoa32 5
48239 45
31241 23
85932 15
6748 11
74294 8
3728 5
Inverted
index
75. Summary
• Indexing tweet entities (e.g. photos) as facets allows to search and rank top-entities
using a tweets index
• All query operators supported
• Documents don’t need to be reindexed
• Approach reusable for different use cases, e.g.: best vines, hashtags,
@mentions, etc.