SlideShare a Scribd company logo
Berlin Buzzwords
Alessandro Benedetti, Director @ Sease
Introducing Multi-valued Vector
Fields in Apache Lucene
‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ R&D Software Engineer
‣ Director
‣ Master degree in Computer Science
‣ PC member for ECIR, SIGIR and Desires
‣ Apache Lucene/Solr PMC member/committer
‣ Elasticsearch/OpenSearch expert
‣ Semantic search, NLP, Machine Learning
technologies passionate
‣ Beach Volleyball player and Snowboarder
‣ Headquarter in London/distributed
‣ Open-source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch/OpenSearch experts
‣ Community Contributors
‣ Active Researchers
‣ Hot Trends : Neural Search,
Natural Language Processing
Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevance Tuning
SEArch SErvices
Why Multi-valued?
HNSW and modifications
Index time internals
Challenges of a contribution
Query time internals
What Can you do now?
the text content of a field exceeds the maximum amount of characters accepted by
your inference model (to encode vectors)
Split the content in
paragraphs across
Your unit of
becomes the
When returning the
results you need to
aggregate back to
1Split the content in
paragraphs across
Your unit of
becomes the
When returning the
results you need to
aggregate back to
● Indexing Time: nested
● Indexing Time:
● Query Time: parent-child
join queries?
● Query Time:
● Aggregations: faceting
becomes more
● Stats: aggregating data
and calculating stats is
● This applies for all fields and field types actually
● you may be ok applying those strategies …
● … but for some users may be quite annoying and
● K Nearest Neighbour Algorithm?
● Indexing data structures and approach?
● Query time data structures and approach?
What does it mean to bring multi-valued to vectors?
ANN - Approximate Nearest Neighbor
● Exact Nearest Neighbor is expensive! (1vs1 vector
● it’s fine to lose accuracy to get a massive performance gain
● pre-process the dataset to build index data structures
● Generally vectors are modelled in:
○ Trees
○ Hashes
○ Graphs - HNSW
HNSW - Hierarchical Navigable Small World graphs
Hierarchical Navigable Small World (HNSW)
graphs are among the top-performing
index-time data structures for approximate
nearest neighbor search (ANN).
HNSW - How it works in a nutshell
● Proximity graph
● Vertices are vectors, closer vectors are linked
● Hierarchical Layers based on skip lists
○ longer edges in higher layers(fast retrieval)
○ shorter edges in lower layers(accuracy)
● Each layer is a Navigable Small World Graph
○ greedy search for the closest friend(local minimum)
○ higher the degree of vertices(number of connections)
lower the probability of hitting local min (but more
○ move down layer for refining the minimum(closest
HNSW - Skip Lists
● the higher the layer, the more sparse
● descending in layers while searching
● fast to search and insert
HNSW - Small World
rd Graphs
● start from entry point
● greedy search (each time distance is calculated across friends)
● starting from zoom out (low degree) to zoom in(high degree)
● when building the graph, higher average degree improve quality at a cost
image from
HNSW - Index time
● add a vector at the time
● probability to enter layer N
● when added, it goes to all other layers
-> identify the layer(s) of insertion
● topk=1 closest neighbour is identified
● we descend and repeat until
the layer of insertion
● topk=ef_construction to identify neighbours
● M neighbours are linked (easiest is calculate
the exact distance)
image from
- each node is not a document
- multiple vectors per document Id
HNSW - Search time
● Start from layer N (top)
○ longer edges in higher layers(fast retrieval)
○ shorter edges in lower layers(accuracy)
● Each layer is a Navigable Small World Graph
○ greedy search for the closest friend(local minimum)
○ higher the degree of vertices(number of connections)
lower the probability of hitting local min (but more
○ move down layer for refining the minimum(closest
you may add in the top-K results the same
document Id multiple times
HNSW - MAX/SUM approach
when adding a vector from the
same document, you update the
score with the max
when adding a vector from the
same document, you update the
score summing
Nov 2020 - Apache Lucene 9.0
Dedicated File Format for Navigable Small World Graphs
Jan 2022 - Apache Lucene 9.0
Handle Document Deletions
Feb 2022 - Apache Lucene 9.1
Introduced Hierarchy in HNSW
Mar 2022 - Apache Lucene 9.1
Re-use data structures across HNSW Graph
Mar 2022 - Apache Lucene 9.1
Pre filters with KNN queries
Aug 2022 - Apache Lucene 9.4
8 bits vector quantization
Apache Lucene
Apache Lucene
Pull Request
INDEXING - Auxiliary Data Structures
in a multi-valued scenario multiple vectors may belong to the
same document
Search + LLMs KPI’s:
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
What Data metrics?
Combine Metric for
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● leverage sparse support
● ordinal (vector Id) to document (document Id) map
● DocsWithVectorsSet to keep track of vectors per documents
● DirectMonotonicWriter to write the map
Write auxiliary data structures
INDEXING - DocsWithVectorsSet
accumulator of documents that have vectors, used by the HnswVectorsWriter when writing data
Search + LLMs KPI’s:
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
What Data metrics?
Combine Metric for
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● compatible with single valued dense/sparse scenarios
● keep a stack of vectors per document
● able to return a count of vectors for each document
INDEXING - DirectMonotonicWriter
write a sequence of integers monotonically increasing (never decreasing),
in blocks
Search + LLMs KPI’s:
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
What Data metrics?
Combine Metric for
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● each integer is a document Id
● the same document Id repeated for each vector in the document
● DirectMonotonicReader ordToDoc used then at reading time in the
● public int ordToDoc(int ord) -> the ordinal (vector Id) is the index
to use to access the block and the position within the block to
finally get the document Id
INDEXING - building HNSW Graph
same as the sparse scenario, each node in the graph has an incremental ID
aligned with the vector ID
Search + LLMs KPI’s:
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
What Data metrics?
Combine Metric for
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● the nodes count in the graph = vector count
● no code changes
QUERY TIME - Exact Search
Vector Scorer
(naive solution) all vectors are iterated, only the ones corresponding
to an accepted doc are scored
Search + LLMs KPI’s:
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
What Data metrics?
Combine Metric for
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● VectorScorer scores only BitSet acceptedDocs
● all vectors from ByteVectorValues/FloatVectorValues are iterated
● scores are updated MAX/SUM
QUERY TIME - Approximate Search
searching on vectors(graph nodes) and returning
documents(max/sum score)
Search + LLMs KPI’s:
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
What Data metrics?
Combine Metric for
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● searching on level != 0 -> vectors are added as candidates/results
● searching on level = 0 -> document ID is added to the results
● int docId = vectors.ordToDoc(vectorId);
● results are added to NeighborQueue
QUERY TIME - NeighborQueue
TOP-K DOCUMENTS (NeighborQueue)
data structure used to collect the top-k results
as a long heap
Search + LLMs KPI’s:
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
What Data metrics?
Combine Metric for
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● each element is a long
[32 bits][32 bits] ->
[score][~document Id]
QUERY TIME - NeighborQueue
TOP-K DOCUMENTS (NeighborQueue)
data structure used to collect the top-k results
as a long heap
Search + LLMs KPI’s:
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
What Data metrics?
Combine Metric for
● nodeIdToHeapIndex cache is used to
keep track of nodes position
● score is updated for the node
● DOWNHEAP (as the ranking may
have improved)
● to build the first prototype -> 1 year
● super active area -> merging
● Lucene codecs change names and old codec
is moved back to backwards codecs
● 85 classes DIFF!
○ simplified temporarily removing MAX/SUM
○ simplified temporarily removing separate
code branches for single/multivalued
○ down to 25 classes!
○ thanks to Benjamin Trent, Mayya Sharipova, Jim Ferenczi,
Josh Devins for the first reviews
Challenges - side project and merging
Why Multi-valued?
HNSW and modifications
Index time internals
Challenges of a contribution
Query time internals
Pull Request
@seaseltd @sease-ltd @seaseltd @sease_ltd 30

More Related Content

What's hot

Intro to FIS GT.M
Intro to FIS GT.MIntro to FIS GT.M
Intro to FIS GT.M
Mobile Edge Computing
Mobile Edge ComputingMobile Edge Computing
Mobile Edge Computing
M2M Alliance e.V.
Big data
Big dataBig data
Big data
Nausheen Hasan
Edge Computing
Edge ComputingEdge Computing
Edge Computing
Chetan Kumar S
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Ravi Teja
Nano computing
Nano computingNano computing
Nano computing
Edge computing
Edge computingEdge computing
Edge computing
Cloud Computing Ppt
Cloud Computing PptCloud Computing Ppt
Cloud Computing Ppt
Anjoum .
Saisharan Amaravadhi
Developer Intro Deck-PowerPoint - Download for Speaker Notes
Developer Intro Deck-PowerPoint - Download for Speaker NotesDeveloper Intro Deck-PowerPoint - Download for Speaker Notes
Developer Intro Deck-PowerPoint - Download for Speaker Notes
Max De Marzi
Mongo Nosql CRUD Operations
Mongo Nosql CRUD OperationsMongo Nosql CRUD Operations
Mongo Nosql CRUD Operations
IoT Communication Protocols
IoT Communication ProtocolsIoT Communication Protocols
IoT Communication Protocols
Pradeep Kumar TS
Big Data
Big DataBig Data
Big Data
Vinayak Kamath
Parasitic computing
Parasitic computingParasitic computing
Parasitic computing
Aritra Mukherjee
Edge Computing
Edge ComputingEdge Computing
Edge Computing
Vikas Yadav
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
Amazon Web Services
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
Fog computing technology
Fog computing technologyFog computing technology
Fog computing technology
Nikhil Sabu
Edge computing
Edge computingEdge computing
Edge computing
Biddut Hossain

What's hot (20)

Intro to FIS GT.M
Intro to FIS GT.MIntro to FIS GT.M
Intro to FIS GT.M
Mobile Edge Computing
Mobile Edge ComputingMobile Edge Computing
Mobile Edge Computing
Big data
Big dataBig data
Big data
Edge Computing
Edge ComputingEdge Computing
Edge Computing
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Nano computing
Nano computingNano computing
Nano computing
Edge computing
Edge computingEdge computing
Edge computing
Cloud Computing Ppt
Cloud Computing PptCloud Computing Ppt
Cloud Computing Ppt
Developer Intro Deck-PowerPoint - Download for Speaker Notes
Developer Intro Deck-PowerPoint - Download for Speaker NotesDeveloper Intro Deck-PowerPoint - Download for Speaker Notes
Developer Intro Deck-PowerPoint - Download for Speaker Notes
Mongo Nosql CRUD Operations
Mongo Nosql CRUD OperationsMongo Nosql CRUD Operations
Mongo Nosql CRUD Operations
IoT Communication Protocols
IoT Communication ProtocolsIoT Communication Protocols
IoT Communication Protocols
Big Data
Big DataBig Data
Big Data
Parasitic computing
Parasitic computingParasitic computing
Parasitic computing
Edge Computing
Edge ComputingEdge Computing
Edge Computing
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
Fog computing technology
Fog computing technologyFog computing technology
Fog computing technology
Edge computing
Edge computingEdge computing
Edge computing

Similar to Introducing Multi Valued Vectors Fields in Apache Lucene

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
Which Questions We Should Have
Which Questions We Should HaveWhich Questions We Should Have
Which Questions We Should Have
Oracle Korea
Domain driven design: a gentle introduction
Domain driven design:  a gentle introductionDomain driven design:  a gentle introduction
Domain driven design: a gentle introduction
Asher Sterkin
Why Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and ReliabilityWhy Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and Reliability
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status update
LDBC council
Data analysis
Data analysisData analysis
Data analysis
ast nearest neighbor search with keywords
ast nearest neighbor search with keywordsast nearest neighbor search with keywords
ast nearest neighbor search with keywords
High Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data SetHigh Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data Set
Parag Ahire
vishal choudhary
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Demi Ben-Ari
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?
Mr bi
Mr biMr bi
Mr bi
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
Doug Needham
Enhancing Enterprise Search with Machine Learning - Simon Hughes,
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes,
Enhancing Enterprise Search with Machine Learning - Simon Hughes,
Simon Hughes
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
Mark Smith
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
Kumari Surabhi
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
Nithin Kakkireni
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014:  Social Network Benchmark (SNB) Graph GeneratorFOSDEM 2014:  Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
LDBC council

Similar to Introducing Multi Valued Vectors Fields in Apache Lucene (20)

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
Which Questions We Should Have
Which Questions We Should HaveWhich Questions We Should Have
Which Questions We Should Have
Domain driven design: a gentle introduction
Domain driven design:  a gentle introductionDomain driven design:  a gentle introduction
Domain driven design: a gentle introduction
Why Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and ReliabilityWhy Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and Reliability
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status update
Data analysis
Data analysisData analysis
Data analysis
ast nearest neighbor search with keywords
ast nearest neighbor search with keywordsast nearest neighbor search with keywords
ast nearest neighbor search with keywords
High Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data SetHigh Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data Set
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?
Mr bi
Mr biMr bi
Mr bi
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
Enhancing Enterprise Search with Machine Learning - Simon Hughes,
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes,
Enhancing Enterprise Search with Machine Learning - Simon Hughes,
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014:  Social Network Benchmark (SNB) Graph GeneratorFOSDEM 2014:  Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator

More from Sease

Hybrid Search with Apache Solr Reciprocal Rank Fusion
Hybrid Search with Apache Solr Reciprocal Rank FusionHybrid Search with Apache Solr Reciprocal Rank Fusion
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Blazing-Fast Serverless MapReduce Indexer for Apache SolrBlazing-Fast Serverless MapReduce Indexer for Apache Solr
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Hybrid Search With Apache Solr
Hybrid Search With Apache SolrHybrid Search With Apache Solr
Hybrid Search With Apache Solr
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective

More from Sease (20)

Hybrid Search with Apache Solr Reciprocal Rank Fusion
Hybrid Search with Apache Solr Reciprocal Rank FusionHybrid Search with Apache Solr Reciprocal Rank Fusion
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Blazing-Fast Serverless MapReduce Indexer for Apache SolrBlazing-Fast Serverless MapReduce Indexer for Apache Solr
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Hybrid Search With Apache Solr
Hybrid Search With Apache SolrHybrid Search With Apache Solr
Hybrid Search With Apache Solr
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective

Recently uploaded

What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Stephanie Beckett
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
Sara Kroft
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partesExchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
Priyanka Aash
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Alliance
Yury Chemerkin
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Alliance
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Alliance
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Stephanie Beckett
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Alliance
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Alliance
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
Nohoax Kanont
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan

Recently uploaded (20)

What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partesExchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day

Introducing Multi Valued Vectors Fields in Apache Lucene

  • 1. Berlin Buzzwords 19/06/2023 Alessandro Benedetti, Director @ Sease Introducing Multi-valued Vector Fields in Apache Lucene 1
  • 2. ‣ Born in Tarquinia (ancient Etruscan city in Italy) ‣ R&D Software Engineer ‣ Director ‣ Master degree in Computer Science ‣ PC member for ECIR, SIGIR and Desires ‣ Apache Lucene/Solr PMC member/committer ‣ Elasticsearch/OpenSearch expert ‣ Semantic search, NLP, Machine Learning technologies passionate ‣ Beach Volleyball player and Snowboarder ALESSANDRO BENEDETTI WHO AM I ? 2
  • 3. ‣ Headquarter in London/distributed ‣ Open-source Enthusiasts ‣ Apache Lucene/Solr experts ‣ Elasticsearch/OpenSearch experts ‣ Community Contributors ‣ Active Researchers ‣ Hot Trends : Neural Search, Natural Language Processing Learning To Rank, Document Similarity, Search Quality Evaluation, Relevance Tuning SEArch SErvices 3
  • 4. AGENDA Why Multi-valued? HNSW and modifications Index time internals Challenges of a contribution Query time internals 4
  • 5. 3 2 1 WHY MULTI-VALUED What Can you do now? the text content of a field exceeds the maximum amount of characters accepted by your inference model (to encode vectors) Split the content in paragraphs across multiple documents. Your unit of information becomes the paragraph When returning the results you need to aggregate back to documents 5
  • 6. 3 2 1Split the content in paragraphs across multiple documents. Your unit of information becomes the paragraph When returning the results you need to aggregate back to documents ● Indexing Time: nested documents(slow/expe nsive) ● Indexing Time: flattened documents(redundant data) ● Query Time: parent-child join queries? (slow/expensive) ● Query Time: collapsing/grouping ● Aggregations: faceting becomes more complicated ● Stats: aggregating data and calculating stats is impacted WHY MULTI-VALUED 6
  • 7. ● This applies for all fields and field types actually ● you may be ok applying those strategies … ● … but for some users may be quite annoying and expensive WHY MULTI-VALUED 7
  • 8. ● K Nearest Neighbour Algorithm? ● Indexing data structures and approach? ● Query time data structures and approach? What does it mean to bring multi-valued to vectors? 8
  • 9. ANN - Approximate Nearest Neighbor ● Exact Nearest Neighbor is expensive! (1vs1 vector distance) ● it’s fine to lose accuracy to get a massive performance gain ● pre-process the dataset to build index data structures ● Generally vectors are modelled in: ○ Trees ○ Hashes ○ Graphs - HNSW 9
  • 10. HNSW - Hierarchical Navigable Small World graphs Hierarchical Navigable Small World (HNSW) graphs are among the top-performing index-time data structures for approximate nearest neighbor search (ANN). References 10
  • 11. HNSW - How it works in a nutshell ● Proximity graph ● Vertices are vectors, closer vectors are linked ● Hierarchical Layers based on skip lists ○ longer edges in higher layers(fast retrieval) ○ shorter edges in lower layers(accuracy) ● Each layer is a Navigable Small World Graph ○ greedy search for the closest friend(local minimum) ○ higher the degree of vertices(number of connections) lower the probability of hitting local min (but more expensive ○ move down layer for refining the minimum(closest friend) 11
  • 12. HNSW - Skip Lists ● the higher the layer, the more sparse ● descending in layers while searching ● fast to search and insert 12
  • 13. HNSW - Small World rd Graphs ● start from entry point ● greedy search (each time distance is calculated across friends) ● starting from zoom out (low degree) to zoom in(high degree) ● when building the graph, higher average degree improve quality at a cost image from 13
  • 14. HNSW - Index time ● add a vector at the time ● probability to enter layer N ● when added, it goes to all other layers -> identify the layer(s) of insertion ● topk=1 closest neighbour is identified ● we descend and repeat until the layer of insertion ● topk=ef_construction to identify neighbours candidates ● M neighbours are linked (easiest is calculate the exact distance) image from Multi-Valued - each node is not a document - multiple vectors per document Id 14
  • 15. HNSW - Search time ● Start from layer N (top) ○ longer edges in higher layers(fast retrieval) ○ shorter edges in lower layers(accuracy) ● Each layer is a Navigable Small World Graph ○ greedy search for the closest friend(local minimum) ○ higher the degree of vertices(number of connections) lower the probability of hitting local min (but more expensive ○ move down layer for refining the minimum(closest friend) Multi-Valued you may add in the top-K results the same document Id multiple times 15
  • 16. HNSW - MAX/SUM approach MAX when adding a vector from the same document, you update the score with the max SUM when adding a vector from the same document, you update the score summing 16
  • 17. Nov 2020 - Apache Lucene 9.0 Dedicated File Format for Navigable Small World Graphs Jan 2022 - Apache Lucene 9.0 Handle Document Deletions Feb 2022 - Apache Lucene 9.1 Introduced Hierarchy in HNSW Mar 2022 - Apache Lucene 9.1 Re-use data structures across HNSW Graph Mar 2022 - Apache Lucene 9.1 Pre filters with KNN queries Aug 2022 - Apache Lucene 9.4 8 bits vector quantization JIRA ISSUES: 0LUCENE%20AND%20labels%20%3D%20vector-based-se arch GITHUB ISSUES: Apache Lucene 17
  • 19. INDEXING - Auxiliary Data Structures MAP VECTOR IDS TO DOCUMENT IDS in a multi-valued scenario multiple vectors may belong to the same document Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● leverage sparse support ● ordinal (vector Id) to document (document Id) map ● DocsWithVectorsSet to keep track of vectors per documents ● DirectMonotonicWriter to write the map Lucene95HnswVectorsWriter Write auxiliary data structures 19
  • 20. INDEXING - DocsWithVectorsSet DocsWithVectorsSet accumulator of documents that have vectors, used by the HnswVectorsWriter when writing data Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● compatible with single valued dense/sparse scenarios ● keep a stack of vectors per document ● able to return a count of vectors for each document DocsWithVectorsSet 20
  • 21. INDEXING - DirectMonotonicWriter DirectMonotonicWriter write a sequence of integers monotonically increasing (never decreasing), in blocks Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● each integer is a document Id ● the same document Id repeated for each vector in the document ● DirectMonotonicReader ordToDoc used then at reading time in the SparseOffHeapVectorValues ● public int ordToDoc(int ord) -> the ordinal (vector Id) is the index to use to access the block and the position within the block to finally get the document Id 21
  • 22. INDEXING - building HNSW Graph NODE ID is the VECTOR ID same as the sparse scenario, each node in the graph has an incremental ID aligned with the vector ID Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● the nodes count in the graph = vector count ● no code changes 22
  • 23. QUERY TIME - Exact Search Vector Scorer (naive solution) all vectors are iterated, only the ones corresponding to an accepted doc are scored Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● VectorScorer scores only BitSet acceptedDocs ● all vectors from ByteVectorValues/FloatVectorValues are iterated ● scores are updated MAX/SUM AbstractKnnVectorQuery 23
  • 24. QUERY TIME - Approximate Search HNSW SEARCH searching on vectors(graph nodes) and returning documents(max/sum score) Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● searching on level != 0 -> vectors are added as candidates/results ● searching on level = 0 -> document ID is added to the results ● int docId = vectors.ordToDoc(vectorId); ● results are added to NeighborQueue HnswGraphSearcher 24
  • 25. QUERY TIME - NeighborQueue TOP-K DOCUMENTS (NeighborQueue) data structure used to collect the top-k results as a long heap Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● MIN HEAP ● each element is a long [32 bits][32 bits] -> [score][~document Id] NeighborQueue 25
  • 26. QUERY TIME - NeighborQueue TOP-K DOCUMENTS (NeighborQueue) data structure used to collect the top-k results as a long heap Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? ● nodeIdToHeapIndex cache is used to keep track of nodes position ● score is updated for the node (MAX/SUM) ● DOWNHEAP (as the ranking may have improved) NeighborQueue 26
  • 27. ● to build the first prototype -> 1 year ● super active area -> merging ● Lucene codecs change names and old codec is moved back to backwards codecs ● 85 classes DIFF! ○ simplified temporarily removing MAX/SUM ○ simplified temporarily removing separate code branches for single/multivalued ○ down to 25 classes! ○ thanks to Benjamin Trent, Mayya Sharipova, Jim Ferenczi, Josh Devins for the first reviews Challenges - side project and merging 27
  • 28. WRAPPING UP Why Multi-valued? HNSW and modifications Index time internals Challenges of a contribution Query time internals 28
  • 30. THANK YOU! @seaseltd @sease-ltd @seaseltd @sease_ltd 30