SlideShare a Scribd company logo
Berlin Buzzwords
19/06/2023
Alessandro Benedetti, Director @ Sease
Introducing Multi-valued Vector
Fields in Apache Lucene
1
‣ Born in Tarquinia (ancient Etruscan city in Italy)
‣ R&D Software Engineer
‣ Director
‣ Master degree in Computer Science
‣ PC member for ECIR, SIGIR and Desires
‣ Apache Lucene/Solr PMC member/committer
‣ Elasticsearch/OpenSearch expert
‣ Semantic search, NLP, Machine Learning
technologies passionate
‣ Beach Volleyball player and Snowboarder
ALESSANDRO BENEDETTI
WHO AM I ?
2
‣ Headquarter in London/distributed
‣ Open-source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch/OpenSearch experts
‣ Community Contributors
‣ Active Researchers
‣ Hot Trends : Neural Search,
Natural Language Processing
Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevance Tuning
SEArch SErvices
www.sease.io
3
AGENDA
Why Multi-valued?
HNSW and modifications
Index time internals
Challenges of a contribution
Query time internals
4
3
2
1
WHY MULTI-VALUED
What Can you do now?
the text content of a field exceeds the maximum amount of characters accepted by
your inference model (to encode vectors)
Split the content in
paragraphs across
multiple
documents.
Your unit of
information
becomes the
paragraph
When returning the
results you need to
aggregate back to
documents
5
3
2
1Split the content in
paragraphs across
multiple
documents.
Your unit of
information
becomes the
paragraph
When returning the
results you need to
aggregate back to
documents
● Indexing Time: nested
documents(slow/expe
nsive)
● Indexing Time:
flattened
documents(redundant
data)
● Query Time: parent-child
join queries?
(slow/expensive)
● Query Time:
collapsing/grouping
● Aggregations: faceting
becomes more
complicated
● Stats: aggregating data
and calculating stats is
impacted
WHY MULTI-VALUED
6
● This applies for all fields and field types actually
● you may be ok applying those strategies …
● … but for some users may be quite annoying and
expensive
WHY MULTI-VALUED
7
● K Nearest Neighbour Algorithm?
● Indexing data structures and approach?
● Query time data structures and approach?
What does it mean to bring multi-valued to vectors?
8
ANN - Approximate Nearest Neighbor
● Exact Nearest Neighbor is expensive! (1vs1 vector
distance)
● it’s fine to lose accuracy to get a massive performance gain
● pre-process the dataset to build index data structures
● Generally vectors are modelled in:
○ Trees
○ Hashes
○ Graphs - HNSW
9
HNSW - Hierarchical Navigable Small World graphs
Hierarchical Navigable Small World (HNSW)
graphs are among the top-performing
index-time data structures for approximate
nearest neighbor search (ANN).
References
https://doi.org/10.1016/j.is.2013.10.006
https://arxiv.org/abs/1603.09320
10
HNSW - How it works in a nutshell
● Proximity graph
● Vertices are vectors, closer vectors are linked
● Hierarchical Layers based on skip lists
○ longer edges in higher layers(fast retrieval)
○ shorter edges in lower layers(accuracy)
● Each layer is a Navigable Small World Graph
○ greedy search for the closest friend(local minimum)
○ higher the degree of vertices(number of connections)
lower the probability of hitting local min (but more
expensive
○ move down layer for refining the minimum(closest
friend)
11
HNSW - Skip Lists
● the higher the layer, the more sparse
● descending in layers while searching
● fast to search and insert
12
HNSW - Small World
rd Graphs
● start from entry point
● greedy search (each time distance is calculated across friends)
● starting from zoom out (low degree) to zoom in(high degree)
● when building the graph, higher average degree improve quality at a cost
image from https://www.pinecone.io/learn/hnsw/
13
HNSW - Index time
● add a vector at the time
● probability to enter layer N
● when added, it goes to all other layers
-> identify the layer(s) of insertion
● topk=1 closest neighbour is identified
● we descend and repeat until
the layer of insertion
● topk=ef_construction to identify neighbours
candidates
● M neighbours are linked (easiest is calculate
the exact distance)
image from https://www.pinecone.io/learn/hnsw/
Multi-Valued
- each node is not a document
- multiple vectors per document Id
14
HNSW - Search time
● Start from layer N (top)
○ longer edges in higher layers(fast retrieval)
○ shorter edges in lower layers(accuracy)
● Each layer is a Navigable Small World Graph
○ greedy search for the closest friend(local minimum)
○ higher the degree of vertices(number of connections)
lower the probability of hitting local min (but more
expensive
○ move down layer for refining the minimum(closest
friend)
Multi-Valued
you may add in the top-K results the same
document Id multiple times
15
HNSW - MAX/SUM approach
MAX
when adding a vector from the
same document, you update the
score with the max
SUM
when adding a vector from the
same document, you update the
score summing
16
Nov 2020 - Apache Lucene 9.0
Dedicated File Format for Navigable Small World Graphs
https://issues.apache.org/jira/browse/LUCENE-9004
Jan 2022 - Apache Lucene 9.0
Handle Document Deletions
https://issues.apache.org/jira/browse/LUCENE-10040
Feb 2022 - Apache Lucene 9.1
Introduced Hierarchy in HNSW
https://issues.apache.org/jira/browse/LUCENE-10054
Mar 2022 - Apache Lucene 9.1
Re-use data structures across HNSW Graph
https://issues.apache.org/jira/browse/LUCENE-10391
Mar 2022 - Apache Lucene 9.1
Pre filters with KNN queries
https://issues.apache.org/jira/browse/LUCENE-10382
Aug 2022 - Apache Lucene 9.4
8 bits vector quantization
https://github.com/apache/lucene/issues/11613
JIRA ISSUES:
https://issues.apache.org/jira/issues/?jql=project%20%3D%2
0LUCENE%20AND%20labels%20%3D%20vector-based-se
arch
GITHUB ISSUES:
https://github.com/apache/lucene/labels/vector-based-search
Apache Lucene
17
Apache Lucene
18
GitHub
Pull Request
INDEXING - Auxiliary Data Structures
MAP VECTOR IDS TO DOCUMENT IDS
in a multi-valued scenario multiple vectors may belong to the
same document
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● leverage sparse support
● ordinal (vector Id) to document (document Id) map
● DocsWithVectorsSet to keep track of vectors per documents
● DirectMonotonicWriter to write the map
Lucene95HnswVectorsWriter
Write auxiliary data structures
19
INDEXING - DocsWithVectorsSet
DocsWithVectorsSet
accumulator of documents that have vectors, used by the HnswVectorsWriter when writing data
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● compatible with single valued dense/sparse scenarios
● keep a stack of vectors per document
● able to return a count of vectors for each document
DocsWithVectorsSet
20
INDEXING - DirectMonotonicWriter
DirectMonotonicWriter
write a sequence of integers monotonically increasing (never decreasing),
in blocks
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● each integer is a document Id
● the same document Id repeated for each vector in the document
● DirectMonotonicReader ordToDoc used then at reading time in the
SparseOffHeapVectorValues
● public int ordToDoc(int ord) -> the ordinal (vector Id) is the index
to use to access the block and the position within the block to
finally get the document Id
21
INDEXING - building HNSW Graph
NODE ID is the VECTOR ID
same as the sparse scenario, each node in the graph has an incremental ID
aligned with the vector ID
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● the nodes count in the graph = vector count
● no code changes
22
QUERY TIME - Exact Search
Vector Scorer
(naive solution) all vectors are iterated, only the ones corresponding
to an accepted doc are scored
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● VectorScorer scores only BitSet acceptedDocs
● all vectors from ByteVectorValues/FloatVectorValues are iterated
● scores are updated MAX/SUM
AbstractKnnVectorQuery
23
QUERY TIME - Approximate Search
HNSW SEARCH
searching on vectors(graph nodes) and returning
documents(max/sum score)
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● searching on level != 0 -> vectors are added as candidates/results
● searching on level = 0 -> document ID is added to the results
● int docId = vectors.ordToDoc(vectorId);
● results are added to NeighborQueue
HnswGraphSearcher
24
QUERY TIME - NeighborQueue
TOP-K DOCUMENTS (NeighborQueue)
data structure used to collect the top-k results
as a long heap
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
Focus on limited KPI’s that
impact business.
Track customers onsite
Behaviors for positive or
negative trends.
● MIN HEAP
● each element is a long
[32 bits][32 bits] ->
[score][~document Id]
NeighborQueue
25
QUERY TIME - NeighborQueue
TOP-K DOCUMENTS (NeighborQueue)
data structure used to collect the top-k results
as a long heap
Search + LLMs KPI’s:
Operational
Search Session
Improve Search-driven
Business Metrics.
What KPI’s specific to
LLMs?
What Data metrics?
Combine Metric for
Business?
● nodeIdToHeapIndex cache is used to
keep track of nodes position
● score is updated for the node
(MAX/SUM)
● DOWNHEAP (as the ranking may
have improved)
NeighborQueue
26
● to build the first prototype -> 1 year
● super active area -> merging
● Lucene codecs change names and old codec
is moved back to backwards codecs
● 85 classes DIFF!
○ simplified temporarily removing MAX/SUM
○ simplified temporarily removing separate
code branches for single/multivalued
○ down to 25 classes!
○ thanks to Benjamin Trent, Mayya Sharipova, Jim Ferenczi,
Josh Devins for the first reviews
Challenges - side project and merging
27
WRAPPING UP
Why Multi-valued?
HNSW and modifications
Index time internals
Challenges of a contribution
Query time internals
28
DO YOU WANT TO MAKE IT HAPPEN?
HELP WITH CODE
Pull Request
HELP WITH FUNDINGS
info@sease.io
29
THANK YOU!
@seaseltd @sease-ltd @seaseltd @sease_ltd 30

More Related Content

What's hot

Intro to FIS GT.M
Intro to FIS GT.MIntro to FIS GT.M
Intro to FIS GT.M
QueEsBhaskar
 
Mobile Edge Computing
Mobile Edge ComputingMobile Edge Computing
Mobile Edge Computing
M2M Alliance e.V.
 
Big data
Big dataBig data
Big data
Nausheen Hasan
 
Edge Computing
Edge ComputingEdge Computing
Edge Computing
Chetan Kumar S
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Ravi Teja
 
Nano computing
Nano computingNano computing
Nano computing
manpreetgrewal
 
Edge computing
Edge computingEdge computing
Edge computing
AbhayDhupar
 
Cloud Computing Ppt
Cloud Computing PptCloud Computing Ppt
Cloud Computing Ppt
Anjoum .
 
FOG COMPUTING
FOG COMPUTINGFOG COMPUTING
FOG COMPUTING
Saisharan Amaravadhi
 
Developer Intro Deck-PowerPoint - Download for Speaker Notes
Developer Intro Deck-PowerPoint - Download for Speaker NotesDeveloper Intro Deck-PowerPoint - Download for Speaker Notes
Developer Intro Deck-PowerPoint - Download for Speaker Notes
Max De Marzi
 
Mongo Nosql CRUD Operations
Mongo Nosql CRUD OperationsMongo Nosql CRUD Operations
Mongo Nosql CRUD Operations
anujaggarwal49
 
IoT Communication Protocols
IoT Communication ProtocolsIoT Communication Protocols
IoT Communication Protocols
Pradeep Kumar TS
 
Big Data
Big DataBig Data
Big Data
Vinayak Kamath
 
Parasitic computing
Parasitic computingParasitic computing
Parasitic computing
Aritra Mukherjee
 
Edge Computing
Edge ComputingEdge Computing
Edge Computing
Vikas Yadav
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
Amazon Web Services
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
RojaT4
 
Fog computing technology
Fog computing technologyFog computing technology
Fog computing technology
Nikhil Sabu
 
Edge computing
Edge computingEdge computing
Edge computing
Biddut Hossain
 

What's hot (20)

Intro to FIS GT.M
Intro to FIS GT.MIntro to FIS GT.M
Intro to FIS GT.M
 
Mobile Edge Computing
Mobile Edge ComputingMobile Edge Computing
Mobile Edge Computing
 
Big data
Big dataBig data
Big data
 
Edge Computing
Edge ComputingEdge Computing
Edge Computing
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Nano computing
Nano computingNano computing
Nano computing
 
Edge computing
Edge computingEdge computing
Edge computing
 
Cloud Computing Ppt
Cloud Computing PptCloud Computing Ppt
Cloud Computing Ppt
 
FOG COMPUTING
FOG COMPUTINGFOG COMPUTING
FOG COMPUTING
 
Developer Intro Deck-PowerPoint - Download for Speaker Notes
Developer Intro Deck-PowerPoint - Download for Speaker NotesDeveloper Intro Deck-PowerPoint - Download for Speaker Notes
Developer Intro Deck-PowerPoint - Download for Speaker Notes
 
Mongo Nosql CRUD Operations
Mongo Nosql CRUD OperationsMongo Nosql CRUD Operations
Mongo Nosql CRUD Operations
 
IoT Communication Protocols
IoT Communication ProtocolsIoT Communication Protocols
IoT Communication Protocols
 
Big Data
Big DataBig Data
Big Data
 
Parasitic computing
Parasitic computingParasitic computing
Parasitic computing
 
Edge Computing
Edge ComputingEdge Computing
Edge Computing
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Fog computing technology
Fog computing technologyFog computing technology
Fog computing technology
 
Edge computing
Edge computingEdge computing
Edge computing
 

Similar to Introducing Multi Valued Vectors Fields in Apache Lucene

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
Sease
 
Which Questions We Should Have
Which Questions We Should HaveWhich Questions We Should Have
Which Questions We Should Have
Oracle Korea
 
Domain driven design: a gentle introduction
Domain driven design:  a gentle introductionDomain driven design:  a gentle introduction
Domain driven design: a gentle introduction
Asher Sterkin
 
Why Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and ReliabilityWhy Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and Reliability
DevOps.com
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status update
LDBC council
 
Data analysis
Data analysisData analysis
Data analysis
AnandDesshpande
 
ast nearest neighbor search with keywords
ast nearest neighbor search with keywordsast nearest neighbor search with keywords
ast nearest neighbor search with keywords
swathi78
 
High Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data SetHigh Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data Set
Parag Ahire
 
big_data_case_studies.pdf
big_data_case_studies.pdfbig_data_case_studies.pdf
big_data_case_studies.pdf
vishal choudhary
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Demi Ben-Ari
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Neo4j
 
GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?
Neo4j
 
Mr bi
Mr biMr bi
Mr bi
renjan131
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
Doug Needham
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Simon Hughes
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
Mark Smith
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
Kumari Surabhi
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
Nithin Kakkireni
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
renjan131
 
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014:  Social Network Benchmark (SNB) Graph GeneratorFOSDEM 2014:  Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
LDBC council
 

Similar to Introducing Multi Valued Vectors Fields in Apache Lucene (20)

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
 
Which Questions We Should Have
Which Questions We Should HaveWhich Questions We Should Have
Which Questions We Should Have
 
Domain driven design: a gentle introduction
Domain driven design:  a gentle introductionDomain driven design:  a gentle introduction
Domain driven design: a gentle introduction
 
Why Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and ReliabilityWhy Distributed Tracing is Essential for Performance and Reliability
Why Distributed Tracing is Essential for Performance and Reliability
 
LDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status updateLDBC 8th TUC Meeting: Introduction and status update
LDBC 8th TUC Meeting: Introduction and status update
 
Data analysis
Data analysisData analysis
Data analysis
 
ast nearest neighbor search with keywords
ast nearest neighbor search with keywordsast nearest neighbor search with keywords
ast nearest neighbor search with keywords
 
High Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data SetHigh Performance Computing on NYC Yellow Taxi Data Set
High Performance Computing on NYC Yellow Taxi Data Set
 
big_data_case_studies.pdf
big_data_case_studies.pdfbig_data_case_studies.pdf
big_data_case_studies.pdf
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
 
GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?GraphTour 2020 - Neo4j: What's New?
GraphTour 2020 - Neo4j: What's New?
 
Mr bi
Mr biMr bi
Mr bi
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
 
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014:  Social Network Benchmark (SNB) Graph GeneratorFOSDEM 2014:  Social Network Benchmark (SNB) Graph Generator
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
 

More from Sease

Hybrid Search with Apache Solr Reciprocal Rank Fusion
Hybrid Search with Apache Solr Reciprocal Rank FusionHybrid Search with Apache Solr Reciprocal Rank Fusion
Hybrid Search with Apache Solr Reciprocal Rank Fusion
Sease
 
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Blazing-Fast Serverless MapReduce Indexer for Apache SolrBlazing-Fast Serverless MapReduce Indexer for Apache Solr
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Sease
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
Sease
 
Hybrid Search With Apache Solr
Hybrid Search With Apache SolrHybrid Search With Apache Solr
Hybrid Search With Apache Solr
Sease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
Sease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
Sease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
Sease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
Sease
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Sease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Sease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
Sease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
Sease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Sease
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
Sease
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
Sease
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
Sease
 

More from Sease (20)

Hybrid Search with Apache Solr Reciprocal Rank Fusion
Hybrid Search with Apache Solr Reciprocal Rank FusionHybrid Search with Apache Solr Reciprocal Rank Fusion
Hybrid Search with Apache Solr Reciprocal Rank Fusion
 
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
Blazing-Fast Serverless MapReduce Indexer for Apache SolrBlazing-Fast Serverless MapReduce Indexer for Apache Solr
Blazing-Fast Serverless MapReduce Indexer for Apache Solr
 
From Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMsFrom Natural Language to Structured Solr Queries using LLMs
From Natural Language to Structured Solr Queries using LLMs
 
Hybrid Search With Apache Solr
Hybrid Search With Apache SolrHybrid Search With Apache Solr
Hybrid Search With Apache Solr
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 

Recently uploaded

What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Stephanie Beckett
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
 
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
Sara Kroft
 
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
Badri_Bady
 
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partesExchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
jorgelebrato
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
 
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
Priyanka Aash
 
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Alliance
 
Enterprise_Mobile_Security_Forum_2013.pdf
Enterprise_Mobile_Security_Forum_2013.pdfEnterprise_Mobile_Security_Forum_2013.pdf
Enterprise_Mobile_Security_Forum_2013.pdf
Yury Chemerkin
 
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Alliance
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Alliance
 
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
AmandaCheung15
 
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Stephanie Beckett
 
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Alliance
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
siddu769252
 
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Alliance
 
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
Nohoax Kanont
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan
 

Recently uploaded (20)

What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
 
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
 
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
 
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partesExchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
 
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
 
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
 
Enterprise_Mobile_Security_Forum_2013.pdf
Enterprise_Mobile_Security_Forum_2013.pdfEnterprise_Mobile_Security_Forum_2013.pdf
Enterprise_Mobile_Security_Forum_2013.pdf
 
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
 
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
 
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
 
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
 
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
 
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
 

Introducing Multi Valued Vectors Fields in Apache Lucene

  • 1. Berlin Buzzwords 19/06/2023 Alessandro Benedetti, Director @ Sease Introducing Multi-valued Vector Fields in Apache Lucene 1
  • 2. ‣ Born in Tarquinia (ancient Etruscan city in Italy) ‣ R&D Software Engineer ‣ Director ‣ Master degree in Computer Science ‣ PC member for ECIR, SIGIR and Desires ‣ Apache Lucene/Solr PMC member/committer ‣ Elasticsearch/OpenSearch expert ‣ Semantic search, NLP, Machine Learning technologies passionate ‣ Beach Volleyball player and Snowboarder ALESSANDRO BENEDETTI WHO AM I ? 2
  • 3. ‣ Headquarter in London/distributed ‣ Open-source Enthusiasts ‣ Apache Lucene/Solr experts ‣ Elasticsearch/OpenSearch experts ‣ Community Contributors ‣ Active Researchers ‣ Hot Trends : Neural Search, Natural Language Processing Learning To Rank, Document Similarity, Search Quality Evaluation, Relevance Tuning SEArch SErvices www.sease.io 3
  • 4. AGENDA Why Multi-valued? HNSW and modifications Index time internals Challenges of a contribution Query time internals 4
  • 5. 3 2 1 WHY MULTI-VALUED What Can you do now? the text content of a field exceeds the maximum amount of characters accepted by your inference model (to encode vectors) Split the content in paragraphs across multiple documents. Your unit of information becomes the paragraph When returning the results you need to aggregate back to documents 5
  • 6. 3 2 1Split the content in paragraphs across multiple documents. Your unit of information becomes the paragraph When returning the results you need to aggregate back to documents ● Indexing Time: nested documents(slow/expe nsive) ● Indexing Time: flattened documents(redundant data) ● Query Time: parent-child join queries? (slow/expensive) ● Query Time: collapsing/grouping ● Aggregations: faceting becomes more complicated ● Stats: aggregating data and calculating stats is impacted WHY MULTI-VALUED 6
  • 7. ● This applies for all fields and field types actually ● you may be ok applying those strategies … ● … but for some users may be quite annoying and expensive WHY MULTI-VALUED 7
  • 8. ● K Nearest Neighbour Algorithm? ● Indexing data structures and approach? ● Query time data structures and approach? What does it mean to bring multi-valued to vectors? 8
  • 9. ANN - Approximate Nearest Neighbor ● Exact Nearest Neighbor is expensive! (1vs1 vector distance) ● it’s fine to lose accuracy to get a massive performance gain ● pre-process the dataset to build index data structures ● Generally vectors are modelled in: ○ Trees ○ Hashes ○ Graphs - HNSW 9
  • 10. HNSW - Hierarchical Navigable Small World graphs Hierarchical Navigable Small World (HNSW) graphs are among the top-performing index-time data structures for approximate nearest neighbor search (ANN). References https://doi.org/10.1016/j.is.2013.10.006 https://arxiv.org/abs/1603.09320 10
  • 11. HNSW - How it works in a nutshell ● Proximity graph ● Vertices are vectors, closer vectors are linked ● Hierarchical Layers based on skip lists ○ longer edges in higher layers(fast retrieval) ○ shorter edges in lower layers(accuracy) ● Each layer is a Navigable Small World Graph ○ greedy search for the closest friend(local minimum) ○ higher the degree of vertices(number of connections) lower the probability of hitting local min (but more expensive ○ move down layer for refining the minimum(closest friend) 11
  • 12. HNSW - Skip Lists ● the higher the layer, the more sparse ● descending in layers while searching ● fast to search and insert 12
  • 13. HNSW - Small World rd Graphs ● start from entry point ● greedy search (each time distance is calculated across friends) ● starting from zoom out (low degree) to zoom in(high degree) ● when building the graph, higher average degree improve quality at a cost image from https://www.pinecone.io/learn/hnsw/ 13
  • 14. HNSW - Index time ● add a vector at the time ● probability to enter layer N ● when added, it goes to all other layers -> identify the layer(s) of insertion ● topk=1 closest neighbour is identified ● we descend and repeat until the layer of insertion ● topk=ef_construction to identify neighbours candidates ● M neighbours are linked (easiest is calculate the exact distance) image from https://www.pinecone.io/learn/hnsw/ Multi-Valued - each node is not a document - multiple vectors per document Id 14
  • 15. HNSW - Search time ● Start from layer N (top) ○ longer edges in higher layers(fast retrieval) ○ shorter edges in lower layers(accuracy) ● Each layer is a Navigable Small World Graph ○ greedy search for the closest friend(local minimum) ○ higher the degree of vertices(number of connections) lower the probability of hitting local min (but more expensive ○ move down layer for refining the minimum(closest friend) Multi-Valued you may add in the top-K results the same document Id multiple times 15
  • 16. HNSW - MAX/SUM approach MAX when adding a vector from the same document, you update the score with the max SUM when adding a vector from the same document, you update the score summing 16
  • 17. Nov 2020 - Apache Lucene 9.0 Dedicated File Format for Navigable Small World Graphs https://issues.apache.org/jira/browse/LUCENE-9004 Jan 2022 - Apache Lucene 9.0 Handle Document Deletions https://issues.apache.org/jira/browse/LUCENE-10040 Feb 2022 - Apache Lucene 9.1 Introduced Hierarchy in HNSW https://issues.apache.org/jira/browse/LUCENE-10054 Mar 2022 - Apache Lucene 9.1 Re-use data structures across HNSW Graph https://issues.apache.org/jira/browse/LUCENE-10391 Mar 2022 - Apache Lucene 9.1 Pre filters with KNN queries https://issues.apache.org/jira/browse/LUCENE-10382 Aug 2022 - Apache Lucene 9.4 8 bits vector quantization https://github.com/apache/lucene/issues/11613 JIRA ISSUES: https://issues.apache.org/jira/issues/?jql=project%20%3D%2 0LUCENE%20AND%20labels%20%3D%20vector-based-se arch GITHUB ISSUES: https://github.com/apache/lucene/labels/vector-based-search Apache Lucene 17
  • 19. INDEXING - Auxiliary Data Structures MAP VECTOR IDS TO DOCUMENT IDS in a multi-valued scenario multiple vectors may belong to the same document Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● leverage sparse support ● ordinal (vector Id) to document (document Id) map ● DocsWithVectorsSet to keep track of vectors per documents ● DirectMonotonicWriter to write the map Lucene95HnswVectorsWriter Write auxiliary data structures 19
  • 20. INDEXING - DocsWithVectorsSet DocsWithVectorsSet accumulator of documents that have vectors, used by the HnswVectorsWriter when writing data Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● compatible with single valued dense/sparse scenarios ● keep a stack of vectors per document ● able to return a count of vectors for each document DocsWithVectorsSet 20
  • 21. INDEXING - DirectMonotonicWriter DirectMonotonicWriter write a sequence of integers monotonically increasing (never decreasing), in blocks Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● each integer is a document Id ● the same document Id repeated for each vector in the document ● DirectMonotonicReader ordToDoc used then at reading time in the SparseOffHeapVectorValues ● public int ordToDoc(int ord) -> the ordinal (vector Id) is the index to use to access the block and the position within the block to finally get the document Id 21
  • 22. INDEXING - building HNSW Graph NODE ID is the VECTOR ID same as the sparse scenario, each node in the graph has an incremental ID aligned with the vector ID Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● the nodes count in the graph = vector count ● no code changes 22
  • 23. QUERY TIME - Exact Search Vector Scorer (naive solution) all vectors are iterated, only the ones corresponding to an accepted doc are scored Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● VectorScorer scores only BitSet acceptedDocs ● all vectors from ByteVectorValues/FloatVectorValues are iterated ● scores are updated MAX/SUM AbstractKnnVectorQuery 23
  • 24. QUERY TIME - Approximate Search HNSW SEARCH searching on vectors(graph nodes) and returning documents(max/sum score) Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● searching on level != 0 -> vectors are added as candidates/results ● searching on level = 0 -> document ID is added to the results ● int docId = vectors.ordToDoc(vectorId); ● results are added to NeighborQueue HnswGraphSearcher 24
  • 25. QUERY TIME - NeighborQueue TOP-K DOCUMENTS (NeighborQueue) data structure used to collect the top-k results as a long heap Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? Focus on limited KPI’s that impact business. Track customers onsite Behaviors for positive or negative trends. ● MIN HEAP ● each element is a long [32 bits][32 bits] -> [score][~document Id] NeighborQueue 25
  • 26. QUERY TIME - NeighborQueue TOP-K DOCUMENTS (NeighborQueue) data structure used to collect the top-k results as a long heap Search + LLMs KPI’s: Operational Search Session Improve Search-driven Business Metrics. What KPI’s specific to LLMs? What Data metrics? Combine Metric for Business? ● nodeIdToHeapIndex cache is used to keep track of nodes position ● score is updated for the node (MAX/SUM) ● DOWNHEAP (as the ranking may have improved) NeighborQueue 26
  • 27. ● to build the first prototype -> 1 year ● super active area -> merging ● Lucene codecs change names and old codec is moved back to backwards codecs ● 85 classes DIFF! ○ simplified temporarily removing MAX/SUM ○ simplified temporarily removing separate code branches for single/multivalued ○ down to 25 classes! ○ thanks to Benjamin Trent, Mayya Sharipova, Jim Ferenczi, Josh Devins for the first reviews Challenges - side project and merging 27
  • 28. WRAPPING UP Why Multi-valued? HNSW and modifications Index time internals Challenges of a contribution Query time internals 28
  • 29. DO YOU WANT TO MAKE IT HAPPEN? HELP WITH CODE Pull Request HELP WITH FUNDINGS info@sease.io 29
  • 30. THANK YOU! @seaseltd @sease-ltd @seaseltd @sease_ltd 30