Since the introduction of native vector-based search in Apache Lucene happened, many features have been developed, but the support for multiple vectors in a dedicated KNN vector field remained to explore. Having the possibility of indexing (and searching) multiple values per field unlocks the possibility of working with long textual documents, splitting them in paragraphs and encoding each paragraph as a separate vector: scenario that is often encountered by many businesses. This talk explores the challenges, the technical design and the implementation activities happened during the work for this contribution to the Apache Lucene project. The audience is expected to get an understanding of how multi-valued fields can work in a vector-based search use-case and how this feature has been implemented.
GT.M is a NoSQL database and programming language used for mission critical applications. It provides ACID transactions for thousands of online banking transactions per second. GT.M uses a hierarchical key-value data store and supports SQL access. The scripting language M is widely used in healthcare. GT.M supports replication across multiple instances for high availability and supports long-distance replication up to 12,450 miles.
Mobile edge computing (MEC) enables cloud computing capabilities and IT services at the edge of cellular networks. It addresses the long data paths and lack of determinism in quality of service (QoS) of traditional centralized architectures by relocating applications and services to the edge. This allows for campus area coverage, deterministic QoS, high availability, strong security and seamless mobility needed for demanding industrial Internet of Things (IoT) applications. The document discusses approaches like hybrid networks that separate control and user planes, as well as private LTE networks controlled by enterprises. It also highlights examples like Vodafone's 5G mobility lab demonstrating uses of MEC for areas like smart intersections and vehicle communications.
A talk presented at IEEE ComSoc workshop on Evolution of Data-centers in the context of 5G.
Discuss about what is edge computing and management issues in Edge Computing
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and an ecosystem of related projects like Hive, HBase, Pig and Zookeeper that provide additional functions. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
Nano computing describes computing that uses extremely small, or nano scale, devices. It will be integrated into existing products like disk drives and fundamentally new products, software, and architectures will be developed. Nano computing will change the computer industry in many ways by making existing technologies like memory and storage even more abundant and enabling new technologies to replace obsolete machines, requiring enormous effort and resources. Nano computers could potentially be small enough to fit in a jacket pocket while having storage for all of today's internet and processing capabilities surpassing today's supercomputers. However, full realization of nano computing's potential may not occur for at least 15 years.
Through this presentation, you will get to know about Edge computing and explore the fields where it is needed.
You can start exploring the technical knowledge by seeing what industries are working on now-days
The document discusses cloud computing and Aleric's cloud computing platform and services. Some key points:
- Cloud computing provides on-demand access to massive computing resources via the internet as a service. Resources are dynamically allocated from data centers located worldwide.
- Aleric's cloud platform combines advantages of cloud and enterprise security, offering private, public, or hybrid clouds with customizable and secure storage, networking, and access.
- Aleric accelerates customers' time to market by providing a secure cloud platform, instant application deployment, and partnerships within its Cloud Computing Alliance program.
Fog Computing is a paradigm that extends Cloud computing and services to the edge of the network. Similar to Cloud, Fog provides data, compute, storage, and application services to end-users. The motivation of Fog computing lies in a series of real scenarios, such as Smart Grid, smart traffic lights in vehicular networks and software defined networks.
This document provides an overview of MongoDB, a document-oriented NoSQL database. It discusses how MongoDB can efficiently store and process large amounts of data from companies like Walmart, Facebook, and Twitter. It also describes some of the problems with relational databases and how MongoDB addresses them through its flexible document model and scalable architecture. Key features of MongoDB discussed include storing data as JSON-like documents, dynamic schemas, load balancing across multiple servers, and its CRUD operations for creating, reading, updating, and deleting documents.
This document defines big data and discusses techniques for integrating large and complex datasets. It describes big data as collections that are too large for traditional database tools to handle. It outlines the "3Vs" of big data: volume, velocity, and variety. It also discusses challenges like heterogeneous structures, dynamic and continuous changes to data sources. The document summarizes techniques for big data integration including schema mapping, record linkage, data fusion, MapReduce, and adaptive blocking that help address these challenges at scale.
Parasitic Computing is a latest concept that challenges the communication protocol used in internet by exploring its loopholes and uses others computer resources without letting them know for very complex computational task. No, its not hacking.. it doesn't harm u, but...
This document discusses edge computing. It begins with an evolution of computing from Unix to client-server to cloud and now edge computing. Edge computing pushes intelligence to the edge of the network to reduce data sent to the cloud and latency. It is useful for emerging technologies like IoT, robotics, and autonomous vehicles. Migrating to edge computing requires centralized management, interoperability, APIs/extensibility, and support. Problems with edge computing include bad configurations, increased hacking vectors, and licensing costs.
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services
AWS hosts a variety of public data sets that anyone can access for free. Previously, large data sets such as satellite imagery or genomic data have required hours or days to locate, download, customize, and analyze. When data is made publicly available on AWS, anyone can analyze any volume of data without downloading or storing it themselves. In this session, the AWS Open Data Team shares tips and tricks, patterns and anti-patterns, and tools to help you effectively stage your data for analysis in the cloud.
Big data analytics tools from vendors like IBM, Tableau, and SAS can help organizations process and analyze big data. For smaller organizations, Excel is often used, while larger organizations employ data mining, predictive analytics, and dashboards. Business intelligence applications include OLAP, data mining, and decision support systems. Big data comes from many sources like web logs, sensors, social networks, and scientific research. It is defined by the volume, variety, velocity, veracity, variability, and value of the data. Hadoop and MapReduce are common technologies for storing and analyzing big data across clusters of machines. Stream analytics is useful for real-time analysis of data like sensor data.
Fog computing is a model that processes data closer to IoT devices rather than in the cloud. It addresses the limitations of cloud like high latency and bandwidth issues. Fog extends cloud services by providing computation, storage and applications at the edge of the network. Key applications of fog include connected vehicles, smart grids, smart buildings and healthcare. Fog computing supports mobility, location awareness, low latency and real-time interactions between heterogeneous edge devices and sensors.
Since the introduction of native vector-based search in Apache Lucene happened, many features have been developed, but the support for multiple vectors in a dedicated KNN vector field remained to explore. Having the possibility of indexing (and searching) multiple values per field unlocks the possibility of working with long textual documents, splitting them into paragraphs, and encoding each paragraph as a separate vector: a scenario that is often encountered by many businesses. This talk explores the challenges, the technical design and the implementation activities that happened during the work for this contribution to the Apache Lucene project. The audience is expected to get an understanding of how multi-valued fields can work in a vector-based search use case and how this feature has been implemented.
MongoDB is a document-oriented NoSQL database that provides polyglot persistence and multi-model capabilities. It supports document, graph, relational, and key-value data models through a single backend. MongoDB also provides tunable consistency levels, secondary indexing, aggregation capabilities, and multi-document ACID transactions. Mature drivers simplify application development, while MongoDB Atlas provides a fully managed cloud database service with high availability, security, and monitoring.
Domain driven design: a gentle introductionAsher Sterkin
This document provides an overview of Domain-Driven Design (DDD) concepts including:
- The common language shared between domain experts and software developers.
- Using models to capture the semantics of the domain language.
- Identifying multiple domain models with clear boundaries and mappings between them.
- Nesting boundaries at different levels of granularity such as sub-domains, bounded contexts, and aggregates.
- Key DDD patterns like entities, values, events, commands, and aggregates.
Why Distributed Tracing is Essential for Performance and ReliabilityDevOps.com
This document discusses how distributed tracing is essential for improving performance, reliability, developer velocity, and managing costs. It explains that distributed tracing provides end-to-end visibility into requests as they move between services, and allows teams to understand dependencies, pinpoint issues, and optimize software performance. The document provides examples of how distributed tracing can accelerate root cause analysis, generate more actionable alerts, inform architectural decisions, and help reduce infrastructure costs by focusing logging and storage. It concludes by recommending steps to implement distributed tracing, such as starting with critical user experiences, establishing conventions, and integrating tracing into existing workflows.
LDBC 8th TUC Meeting: Introduction and status updateLDBC council
The document summarizes an 8th Technical User Community meeting on the LDBC benchmark. It discusses:
1) The LDBC Organization which sponsors benchmarks and task forces to develop them.
2) The key elements of a benchmark - data/schema, workloads, performance metrics, and execution rules.
3) The Semantic Publishing Benchmark and Social Network Benchmark being developed to evaluate graph and RDF databases on industry workloads.
4) The workloads include interactive, business intelligence, and graph analytics to test different database capabilities.
5) Various database systems that can be evaluated using the benchmarks.
High Performance Computing on NYC Yellow Taxi Data SetParag Ahire
This presentation describes four high performance computing techniques. Each technique was applied to answer one particular question related to the NYC Yellow Taxi Data Set.
This document discusses big data analytics reference architectures and case studies. It begins with an overview of relational vs non-relational architectures and describes challenges of big data like volume, variety and velocity of data. It then covers traditional vs big data analytics, use cases, reference architectures including relational, non-relational and hybrid models. Finally, it shares two case studies on usage analysis and clickstream analysis along with architectural decisions and solutions implemented.
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysDemi Ben-Ari
Everybody wants to go on the “Big Data” hype cycle, “To do Scale”, to use the coolest tools in the market like Hadoop, Apache Spark, Apache Cassandra, etc.
But do they ask themselves is there really a reason for that?
In the talk we’ll make a brief overview to all of the technologies in the Big Data world nowadays and we’ll talk about the problems that really emerge when you’d like to enter the great world of Big Data handling.
Showing you the Hadoop ecosystem and Apache Spark and all of the distributed tools leading the market today, will give you all a notion of what will be the real costs entering that world.
Promise that I’ll share some stories from the trenches :)
(And about the “pool” thing...I don’t really know how to swim)
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Neo4j
With the torrent of data available to us on the Internet, it's been increasingly difficult to separate the signal from the noise. We set out on a journey with a simple directive: Figure out a way to discover emerging technology trends. Through a series of experiments, trials, and pivots, we found our answer in the power of graph databases. We essentially built our "Emerging Tech Radar" on emerging technologies with graph databases being central to our discovery platform. Using a mix of NoSQL databases and open source libraries we built a scalable information digestion platform which touches upon multiple topics such as NLP, named entity extraction, data cleansing, cypher queries, multiple visualizations, and polymorphic persistence.
The document discusses new features and capabilities in Neo4j 4.0, including unlimited scalability through sharding and federation, a fully reactive architecture, and new security and data privacy controls. It also introduces Neo4j Desktop for graph development workflows, Neo4j Aura cloud database service, and visualization and analytics tools for working with graph data.
The document discusses using MapReduce for a sequential web access-based recommendation system. It explains how web server logs could be mapped to create a pattern tree showing frequent sequences of accessed web pages. When making recommendations for a user, their access pattern would be compared to patterns in the tree to find matching branches to suggest. MapReduce is well-suited for this because it can efficiently process and modify the large, dynamic tree structure across many machines in a fault-tolerant way.
Presentation given at DMZ about Data Structure Graphs.
Also known as Applying Social Network Analysis Techniques to Data Modeling and Data Architecture
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
In the talk I describe two approaches for improve the recall and precision of an enterprise search engine using machine learning techniques. The main focus is improving relevancy with ML while using your existing search stack, be that Luce, Solr, Elastic Search, Endeca or something else.
The document discusses using big data architecture and Hadoop. It compares relational database management systems (RDBMS) to Hadoop, noting differences in schema, speed, governance, processing, and data types between the two. A scenario is presented of a trucking company collecting sensor data from vehicles via GPS, acceleration, braking etc. and how that data could flow through the Hadoop ecosystem using Flume, Sqoop, Hive, Pig, and Spark. Another example discusses acquiring and processing user event data from a bank. The document outlines the reference architecture and requirements extraction process for designing a big data system.
This document summarizes a summer training seminar on BigData Hadoop that was attended. The training was provided by LinuxWorld Informatics Pvt Ltd, which offers open source and commercial training programs. The attendee learned about Hadoop, MapReduce, single and multi-node clusters, Docker, and Ansible. Big data challenges related to volume, variety, velocity, and veracity of data were also covered. Hadoop and its core components HDFS and MapReduce were explained as solutions for storing and processing large datasets in a distributed manner across commodity hardware. Docker containers were introduced as a lightweight alternative to virtual machines.
This document discusses analyzing fire department call data from San Francisco using HiveQL and MapReduce. The authors cleaned the data, loaded it into HDFS, and performed queries and analysis. They found that Hive queries took less time than custom MapReduce programs for the same queries on this dataset. Visualizations of query results were created using JFreeCharts. The goal was to help improve fire department resource allocation and response based on patterns in call volume, location, and time.
The document discusses map reduce and how it can be used for sequential web access-based recommendation systems. It explains that map reduce separates large, unstructured data processing from computation, allowing it to run efficiently on many machines. A map reduce job could process web server logs to build a pattern tree for recommendations, with the tree continuously updated from new data. When making recommendations for a user, their access pattern would be compared to the tree generated from all user data.
FOSDEM 2014: Social Network Benchmark (SNB) Graph GeneratorLDBC council
The document discusses the Social Network Benchmark (SNB) developed by the Linked Data Benchmark Council (LDBC). The SNB aims to enable benchmarking of graph databases at scale by developing an open source benchmark including a social network data model and schema, workload definitions, and tools to generate synthetic but realistic graph data and validate benchmark results. The data generator creates correlated graph data by considering attributes like names, locations, interests that influence how the graph is generated to mimic real social networks. The SNB aims to evaluate systems on challenging graph queries and workloads.
Similar to Introducing Multi Valued Vectors Fields in Apache Lucene (20)
Hybrid Search with Apache Solr Reciprocal Rank FusionSease
Vector-based search gained incredible popularity in the last few years: Large Language Models fine-tuned for sentence similarity proved to be quite effective in encoding text to vectors and representing some of the semantics of sentences in a numerical form.
These vectors can be used to run a K-nearest neighbour search and look for documents/paragraphs close to the query in a n-dimensional vector space, effectively mimicking a similarity search in the semantic space (Apache Solr KNN Query Parser).
Although exciting, vector-based search nowadays still presents some limitations:
– it’s very difficult to explain (e.g. why is document A returned and why at position K?)
– it doesn’t care about exact keyword matching (and users still rely on keyword searches a lot)
Hybrid search comes to the rescue, combining lexical (traditional keyword-based) search with neural (vector-based) search.
So, what does it mean to combine these two worlds?
It starts with the retrieval of two sets of candidates:
– one set of results coming from lexical matches with the query keywords
– a set of results coming from the K-Nearest Neighbours search with the query vector
The result sets are merged and a single ranked list of documents is returned to the user.
Reciprocal Rank Fusion (RRF) is one of the most popular algorithms for such a task.
This talk introduces the foundation algorithms involved with RRF and walks you through the work done to implement them in Apache Solr, with a focus on the difficulties of the process, the distributed support(SolrCloud), the main components affected and the limitations faced.
The audience is expected to learn more about this interesting approach, the challenges in it and how the contribution process works for an Open Source search project as complex as Apache Solr.
Blazing-Fast Serverless MapReduce Indexer for Apache SolrSease
Indexing data from databases to Apache Solr has always been an open problem: for a while, the data import handler was used even if it was not recommended for production environments. Traditional indexing processes often encounter scalability challenges, especially with large datasets.
In this talk, we explore the architecture and implementation of a serverless MapReduce indexer designed for Apache Solr but extendable to any search engine. By embracing a serverless approach, we can take advantage of the elasticity and scalability offered by cloud services like AWS Lambda, enabling efficient indexing without needing to manage infrastructure.
We dig into the principles of MapReduce, a programming model for processing large datasets, and discuss how it can be adapted for indexing documents into Apache Solr. Using AWS Step Functions to orchestrate multiple Lambdas, we demonstrate how to distribute indexing tasks across multiple resources, achieving parallel processing and significantly reducing indexing times.
Through practical examples, we address key considerations such as data partitioning, fault tolerance, concurrency, and cost.
We also cover integration points with other AWS services such as Amazon S3 for data storage and retrieval, as well as DynamoDB for distributed lock between the lambda instances.
From Natural Language to Structured Solr Queries using LLMsSease
This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of accessibility, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints.
That is where AI – and most importantly, Natural Language Processing and Large Language Model techniques – could make a difference. This natural language, conversational engine could facilitate access and usage of the data leveraging the semantics of any data source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal.
The key concept is to enable users to express their search queries in natural language, which the LLM then enriches, interprets, and translates into structured queries based on the Solr index’s metadata.
This approach leverages the LLM’s ability to understand the nuances of natural language and the structure of documents within Apache Solr.
The LLM acts as an intermediary agent, offering a transparent experience to users automatically and potentially uncovering relevant documents that conventional search methods might overlook. The presentation will include the results of this experimental work, lessons learned, best practices, and the scope of future work that should improve the approach and make it production-ready.
Vector-based search gained incredible popularity in the last few years: Large Language Models fine-tuned for sentence similarity proved to be quite effective in encoding text to vectors and representing some of the semantics of sentences in a numerical form.
These vectors can be used to run a K-nearest neighbour search and look for documents/paragraphs close to the query in a n-dimensional vector space, effectively mimicking a similarity search in the semantic space (Apache Solr KNN Query Parser). Although exciting, vector-based search nowadays still presents some limitations:
– it’s very difficult to explain – e.g. why is document A returned and why at position K?
– it doesn’t care about exact keyword matching (and users still rely on keyword searches a lot)
To mitigate these problems, combining lexical (traditional keyword-based) search with neural (vector-based) search is possible.
So, what does it mean to combine these two worlds?
Join us as we explore various ways of running hybrid search in Apache Solr, including tricks, suggestions, pros/cons and future works on this exciting new search approach!
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...Sease
This intervention draws on experimentations ongoing in the context of the OECD-led Statistical Information System Collaboration Community (SIS-CC) to enable AI applications with SDMX. One important use case is to use AI for better accessibility and discoverability of the data: whilst UX techniques, lexical search improvements, and data harmonisation can take statistical organisations to a good level of accessibility, however, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints. That is where AI – and most importantly, NLP and LLM techniques – could potentially make a difference. The “StatsBot” could be this natural language, conversational engine that could facilitate access and usage of the data. The “StatsBot” could leverage the semantics of any SDMX source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal and create the StatsBot as a universal, open asset usable by all statistical organisations. In a first step, the concept tested is to use Large Language Models with the Apache Solr index of SDMX objects so as to transform natural language queries into SDMX queries. In a second step, results could be framed as a natural language statement complementing the top-k search results. For the purpose of initial PoCs – aimed to demonstrate functional features and feasibility – a commercial LLM (such as OpenAI GPT-4) will be used; in a later stage substitution with an open source LLM will be analysed. The presentation will include the results of the first experimental work, lessons learnt, and scope future work that should lead to defining the path for production-grade, fully open source, and universal StatsBot.
How To Implement Your Online Search Quality Evaluation With KibanaSease
Online testing represents a fundamental method to assess the performance of a ranking model in practical applications, providing the information needed to improve and better understand its behavior. Despite the advantages, the currently available evaluation tools have certain limitations. For this reason, we will present an alternative and customized approach to evaluate ranking models using Kibana. The talk will begin with an overview of online testing, including its benefits and drawbacks. Then, we will provide an in-depth exploration of our Kibana implementation, detailing the reasons behind our approach. Attendees will learn about the various tools provided by Kibana, and with practical examples, we will show how to create visualizations and dashboards, complete with queries and code, to compare different rankers. Attending this presentation will provide participants with valuable knowledge on how to leverage Kibana for the purpose of evaluating ranking models on custom metrics and on specific contexts such as the most popular and “populous” queries.
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Sease
Interleaving is an online evaluation approach for information retrieval systems that compares the effectiveness of ranking functions in interpreting the users’ implicit feedback. Previous work such as Hofmann et al. (2011) has evaluated the most promising interleaved methods at the time, on uniform distributions of queries. In the real world, usually, there is an unbalanced distribution of repeated queries that follows a long-tailed users’ search demand curve. This paper first aims to reproduce the Team Draft Interleaving accuracy evaluation on uniform query distributions and then focuses on assessing how this method generalises to long-tailed real-world scenarios. The replicability work raised interesting considerations on how the winning ranking function for each query should impact the overall winner for the entire evaluation. Based on what was observed, we propose that not all the queries should contribute to the final decision in equal proportion. As a result of these insights, we designed two variations of the ∆AB score winner estimator that assign to each query a credit based on statistical hypothesis testing. To reproduce, replicate and extend the original work, we have developed from scratch a system that simulates a search engine and users’ interactions from datasets from the industry. Our experiments confirm our intuition and show that our methods are promising in terms of accuracy, sensitivity, and robustness to noise.
How does ChatGPT work: an Information Retrieval perspectiveSease
In this talk, we will explore the underlying mechanisms of ChatGPT, a large-scale language model developed by OpenAI, from the perspective of Information Retrieval (IR). We will delve into the process of training the model using massive amounts of data, the techniques used to optimize the model’s performance, and how the IR concepts such as tokenization, vectorization, and ranking are used in generating responses. We will also discuss how ChatGPT handles contextual understanding and how it leverages the power of transfer learning to generate high-quality and relevant responses. Software engineers will gain insights into how a modern conversational AI system like ChatGPT works, providing a better understanding of its strengths and limitations, and how to best integrate it into their software applications.
This abstract has been fully written by ChatGPT with the simple prompt in input <Write an abstract for a talk called “How does ChatGPT work? An Information Retrieval perspective”, the audience is software engineers>.
How To Implement Your Online Search Quality Evaluation With KibanaSease
Online testing remains the optimal way to prove how your ranking model performs in your real-world scenario. It can lead to many advantages such as having a direct interpretation of the results and confirming the estimation of offline tests. It gives a better understanding of the ranking model behaviour and builds a solid foundation to learn from to improve it.
Nowadays, the available evaluation tools have some limitations and in this talk, we will describe an alternative and customised approach for evaluating ranking models through the use of Kibana.
First of all, we give an overview of online testing, highlighting the pros and cons and describing the state-of-the-art.
We then dive into Kibana’s implementation and the reasons behind it. We will explore the tools Kibana provides, with their constraints for real-world applications, and show, through practical examples, how to create dashboards (with queries and code) to compare different models.
Learning To Rank has been the first integration of machine learning techniques with Apache Solr allowing you to improve the ranking of your search results using training data.
One limitation is that documents have to contain the keywords that the user typed in the search box in order to be retrieved(and then reranked). For example, the query “jaguar” won’t retrieve documents containing only the terms “panthera onca”. This is called the vocabulary mismatch problem.
Neural search is an Artificial Intelligence technique that allows a search engine to reach those documents that are semantically similar to the user’s information need without necessarily containing those query terms; it learns the similarity of terms and sentences in your collection through deep neural networks and numerical vector representation(so no manual synonyms are needed!).
This talk explores the first Apache Solr official contribution about this topic, available from Apache Solr 9.0.
We start with an overview of neural search (Don’t worry - we keep it simple!): we describe vector representations for queries and documents, and how Approximate K-Nearest Neighbor (KNN) vector search works. We show how neural search can be used along with deep learning techniques (e.g, BERT) or directly on vector data, and how we implemented this feature in Apache Solr, giving usage examples!
Join us as we explore this new exciting Apache Solr feature and learn how you can leverage it to improve your search experience!
SHARE Virtual Discovery Environment (Share-VDE) is a library-driven initiative that brings together the bibliographic catalogues and authority files of a community of libraries in a shared discovery environment based on linked data.
One of the main challenges is the massive amount of data the system is supposed to manage in terms of Search, Manipulation, and Presentation.
Dense Retrieval with Apache Solr Neural Search.pdfSease
This document provides an overview of dense retrieval with Apache Solr neural search. It discusses semantic search problems that neural search aims to address through vector-based representations of queries and documents. It then describes Apache Solr's implementation of neural search using dense vector fields and HNSW graphs to perform k-nearest neighbor retrieval. Functions are shown for indexing and searching vector data. The document also discusses using vector queries for filtering, re-ranking, and hybrid searches combining dense and sparse criteria.
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Sease
The first integrations of machine learning techniques with search allowed to improve the ranking of your search results (Learning To Rank) – but one limitation has always been that documents had to contain the keywords that the user typed in the search box in order to be retrieved. For example, the query “tiger” won’t retrieve documents containing only the terms “panthera tigris”. This is called the vocabulary mismatch problem and over the years it has been mitigated through query and document expansion approaches.
Neural search is an Artificial Intelligence technique that allows a search engine to reach those documents that are semantically similar to the user’s query without necessarily containing those terms; it avoids the need for long lists of synonyms by automatically learning the similarity of terms and sentences in your collection through the utilisation of deep neural networks and numerical vector representation.
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfSease
f you want to expand your query/documents with synonyms in Apache Lucene, you need to have a predefined file containing the list of terms that share the same semantic. It’s not always easy to find a list of basic synonyms for a language and, even if you find it, this doesn’t necessarily match with your contextual domain.
The term “daemon” in the domain of operating system articles is not a synonym of “devil” but it’s closer to the term “process”.
Word2Vec is a two-layer neural network that takes as input a text and outputs a vector representation for each word in the dictionary. Two words with similar meanings are identified with two vectors close to each other.
How to cache your searches_ an open source implementation.pptxSease
This document discusses caching search results in Solr. It describes two main caches: the QueryResultCache, which caches query results including the document IDs, filter queries, and sort order; and the FilterCache, which caches filter clauses to improve performance of queries using common filters. It recommends using the CacheViewHandler tool to view cached results and provide examples of how to effectively cache queries using filters and sorting.
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Sease
Rated Ranking Evaluator Enterprise is an enterprise version of the open source Rated Ranking Evaluator search quality evaluation tool. It features query discovery to automatically extract queries from a search API, rating generation using both explicit ratings and implicit feedback, and an interactive UI for exploring and comparing evaluation results. The UI provides overview, exploration, and comparison views of evaluation data to meet the needs of business stakeholders and software engineers. Future work aims to improve the tool's capabilities around configuration, multimedia support, insights generation, and click modeling.
This presentation will start by introducing how Apache Lucene can be used to classify documents using data structures that already exist in your index instead of having to generate and supply external training sets. The focus will be on extensions of the Lucene Classification module that come in Lucene 6.0 and the Lucene Classification module's incorporation into Solr 6.1. These extensions will allow you to classify at a document level with individual field weighting, numeric field support, lat/lon fields etc. The Solr ClassificationUpdateProcessor will be explored and how to use it including basic and advanced features like multi class support and classification context filtering. The presentation will include practical examples and real world use cases.
Advanced Document Similarity with Apache LuceneSease
This document provides an overview of using Apache Lucene to perform document similarity searches. It discusses how Lucene indexes documents and calculates term scores using techniques like TF-IDF and BM25. The document demonstrates Lucene's More Like This module, which finds similar documents by building a query from weighted interesting terms in the input document. It also proposes future work, such as leveraging term positions and fields across multiple input documents to improve similarity searches.
Search Quality Evaluation: a Developer PerspectiveSease
Search quality evaluation is an ever-green topic every search engineer ordinarily struggles with. Improving the correctness and effectiveness of a search system requires a set of tools which help measuring the direction where the system is going.
The slides will focus on how a search quality evaluation tool can be seen under a practical developer perspective, how it could be used for producing a deliverable artifact and how it could be integrated within a continuous integration infrastructure.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
2. ‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ R&D Software Engineer
‣ Director
‣ Master degree in Computer Science
‣ PC member for ECIR, SIGIR and Desires
‣ Apache Lucene/Solr PMC member/committer
‣ Elasticsearch/OpenSearch expert
‣ Semantic search, NLP, Machine Learning
technologies passionate
‣ Beach Volleyball player and Snowboarder
ALESSANDRO BENEDETTI
WHO AM I ?
2
3. ‣ Headquarter in London/distributed
‣ Open-source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch/OpenSearch experts
‣ Community Contributors
‣ Active Researchers
‣ Hot Trends : Neural Search,
Natural Language Processing
Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevance Tuning
SEArch SErvices
www.sease.io
3
5. 3
2
1
WHY MULTI-VALUED
What Can you do now?
the text content of a field exceeds the maximum amount of characters accepted by
your inference model (to encode vectors)
Split the content in
paragraphs across
multiple
documents.
Your unit of
information
becomes the
paragraph
When returning the
results you need to
aggregate back to
documents
5
6. 3
2
1Split the content in
paragraphs across
multiple
documents.
Your unit of
information
becomes the
paragraph
When returning the
results you need to
aggregate back to
documents
● Indexing Time: nested
documents(slow/expe
nsive)
● Indexing Time:
flattened
documents(redundant
data)
● Query Time: parent-child
join queries?
(slow/expensive)
● Query Time:
collapsing/grouping
● Aggregations: faceting
becomes more
complicated
● Stats: aggregating data
and calculating stats is
impacted
WHY MULTI-VALUED
6
7. ● This applies for all fields and field types actually
● you may be ok applying those strategies …
● … but for some users may be quite annoying and
expensive
WHY MULTI-VALUED
7
8. ● K Nearest Neighbour Algorithm?
● Indexing data structures and approach?
● Query time data structures and approach?
What does it mean to bring multi-valued to vectors?
8
9. ANN - Approximate Nearest Neighbor
● Exact Nearest Neighbor is expensive! (1vs1 vector
distance)
● it’s fine to lose accuracy to get a massive performance gain
● pre-process the dataset to build index data structures
● Generally vectors are modelled in:
○ Trees
○ Hashes
○ Graphs - HNSW
9
10. HNSW - Hierarchical Navigable Small World graphs
Hierarchical Navigable Small World (HNSW)
graphs are among the top-performing
index-time data structures for approximate
nearest neighbor search (ANN).
References
https://doi.org/10.1016/j.is.2013.10.006
https://arxiv.org/abs/1603.09320
10
11. HNSW - How it works in a nutshell
● Proximity graph
● Vertices are vectors, closer vectors are linked
● Hierarchical Layers based on skip lists
○ longer edges in higher layers(fast retrieval)
○ shorter edges in lower layers(accuracy)
● Each layer is a Navigable Small World Graph
○ greedy search for the closest friend(local minimum)
○ higher the degree of vertices(number of connections)
lower the probability of hitting local min (but more
expensive
○ move down layer for refining the minimum(closest
friend)
11
12. HNSW - Skip Lists
● the higher the layer, the more sparse
● descending in layers while searching
● fast to search and insert
12
13. HNSW - Small World
rd Graphs
● start from entry point
● greedy search (each time distance is calculated across friends)
● starting from zoom out (low degree) to zoom in(high degree)
● when building the graph, higher average degree improve quality at a cost
image from https://www.pinecone.io/learn/hnsw/
13
14. HNSW - Index time
● add a vector at the time
● probability to enter layer N
● when added, it goes to all other layers
-> identify the layer(s) of insertion
● topk=1 closest neighbour is identified
● we descend and repeat until
the layer of insertion
● topk=ef_construction to identify neighbours
candidates
● M neighbours are linked (easiest is calculate
the exact distance)
image from https://www.pinecone.io/learn/hnsw/
Multi-Valued
- each node is not a document
- multiple vectors per document Id
14
15. HNSW - Search time
● Start from layer N (top)
○ longer edges in higher layers(fast retrieval)
○ shorter edges in lower layers(accuracy)
● Each layer is a Navigable Small World Graph
○ greedy search for the closest friend(local minimum)
○ higher the degree of vertices(number of connections)
lower the probability of hitting local min (but more
expensive
○ move down layer for refining the minimum(closest
friend)
Multi-Valued
you may add in the top-K results the same
document Id multiple times
15
16. HNSW - MAX/SUM approach
MAX
when adding a vector from the
same document, you update the
score with the max
SUM
when adding a vector from the
same document, you update the
score summing
16
17. Nov 2020 - Apache Lucene 9.0
Dedicated File Format for Navigable Small World Graphs
https://issues.apache.org/jira/browse/LUCENE-9004
Jan 2022 - Apache Lucene 9.0
Handle Document Deletions
https://issues.apache.org/jira/browse/LUCENE-10040
Feb 2022 - Apache Lucene 9.1
Introduced Hierarchy in HNSW
https://issues.apache.org/jira/browse/LUCENE-10054
Mar 2022 - Apache Lucene 9.1
Re-use data structures across HNSW Graph
https://issues.apache.org/jira/browse/LUCENE-10391
Mar 2022 - Apache Lucene 9.1
Pre filters with KNN queries
https://issues.apache.org/jira/browse/LUCENE-10382
Aug 2022 - Apache Lucene 9.4
8 bits vector quantization
https://github.com/apache/lucene/issues/11613
JIRA ISSUES:
https://issues.apache.org/jira/issues/?jql=project%20%3D%2
0LUCENE%20AND%20labels%20%3D%20vector-based-se
arch
GITHUB ISSUES:
https://github.com/apache/lucene/labels/vector-based-search
Apache Lucene
17
19. INDEXING - Auxiliary Data Structures
MAP VECTOR IDS TO DOCUMENT IDS
in a multi-valued scenario multiple vectors may belong to the
same document
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● leverage sparse support
● ordinal (vector Id) to document (document Id) map
● DocsWithVectorsSet to keep track of vectors per documents
● DirectMonotonicWriter to write the map
Lucene95HnswVectorsWriter
Write auxiliary data structures
19
20. INDEXING - DocsWithVectorsSet
DocsWithVectorsSet
accumulator of documents that have vectors, used by the HnswVectorsWriter when writing data
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● compatible with single valued dense/sparse scenarios
● keep a stack of vectors per document
● able to return a count of vectors for each document
DocsWithVectorsSet
20
21. INDEXING - DirectMonotonicWriter
DirectMonotonicWriter
write a sequence of integers monotonically increasing (never decreasing),
in blocks
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● each integer is a document Id
● the same document Id repeated for each vector in the document
● DirectMonotonicReader ordToDoc used then at reading time in the
SparseOffHeapVectorValues
● public int ordToDoc(int ord) -> the ordinal (vector Id) is the index
to use to access the block and the position within the block to
finally get the document Id
21
22. INDEXING - building HNSW Graph
NODE ID is the VECTOR ID
same as the sparse scenario, each node in the graph has an incremental ID
aligned with the vector ID
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● the nodes count in the graph = vector count
● no code changes
22
23. QUERY TIME - Exact Search
Vector Scorer
(naive solution) all vectors are iterated, only the ones corresponding
to an accepted doc are scored
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● VectorScorer scores only BitSet acceptedDocs
● all vectors from ByteVectorValues/FloatVectorValues are iterated
● scores are updated MAX/SUM
AbstractKnnVectorQuery
23
24. QUERY TIME - Approximate Search
HNSW SEARCH
searching on vectors(graph nodes) and returning
documents(max/sum score)
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● searching on level != 0 -> vectors are added as candidates/results
● searching on level = 0 -> document ID is added to the results
● int docId = vectors.ordToDoc(vectorId);
● results are added to NeighborQueue
HnswGraphSearcher
24
25. QUERY TIME - NeighborQueue
TOP-K DOCUMENTS (NeighborQueue)
data structure used to collect the top-k results
as a long heap
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● MIN HEAP
● each element is a long
[32 bits][32 bits] ->
[score][~document Id]
NeighborQueue
25
26. QUERY TIME - NeighborQueue
TOP-K DOCUMENTS (NeighborQueue)
data structure used to collect the top-k results
as a long heap
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
● nodeIdToHeapIndex cache is used to
keep track of nodes position
● score is updated for the node
(MAX/SUM)
● DOWNHEAP (as the ranking may
have improved)
NeighborQueue
26
27. ● to build the first prototype -> 1 year
● super active area -> merging
● Lucene codecs change names and old codec
is moved back to backwards codecs
● 85 classes DIFF!
○ simplified temporarily removing MAX/SUM
○ simplified temporarily removing separate
code branches for single/multivalued
○ down to 25 classes!
○ thanks to Benjamin Trent, Mayya Sharipova, Jim Ferenczi,
Josh Devins for the first reviews
Challenges - side project and merging
27