SlideShare a Scribd company logo
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in
Apache Solr 6
Shalin Shekhar Mangar
@shalinmangar
Lucidworks Inc.
Introduction
• Shalin Shekhar Mangar
• Lucene/Solr Committer
• PMC Member
• Senior Solr Consultant with Lucidworks Inc.
The standard
for enterprise
search.
of Fortune 500
uses Solr.
90%
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete,
highlighting
• Cursors
• More Like This
• De-duplication
• Apache Lucene
• Grouping and Joins
• Stats, expressions, transformations
and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr Key Features
Why SQL
• Simple, well-known interface to data inside Solr
• Hides the complexity of Solr and its various features
• Possible to optimise the query plan according to best-practices
automatically
• Distributed Joins done simply and well
Solr 6: Parallel SQL
• Parallel execution of SQL across SolrCloud collections
• Compiled to SolrJ Streaming API (TupleStream) which is a general
purpose parallel computing framework for Solr
• Executed in parallel over SolrCloud worker nodes
• SolrCloud collections are relational ‘tables’
• JDBC thin client as a SolrJ client
Solr’s SQL Interface
SQL Interface at a glance
• SQL over Map/Reduce — for high cardinality aggregations and
distributed joins
• SQL over Facets — high performance, moderate cardinality
aggregations
• SQL with Solr powered search queries
• Fully integrated with SolrCloud
• SQL over JDBC or HTTP — http://host:port/solr/collection1/sql
Limited vs Unlimited SELECT
• select movie, director from IMDB
Returns the entire result set! Return fields must be DocValues
• select movie, directory from IMDB limit 100
Returns specified number of records. It can sort by score and
retrieve any stored field
• select movie, director from IMDB order by rating desc, num_voters
desc
Search predicates
• select movie, director from IMDB where actor = ‘bruce’
• select movie, director from IMDB where actor = ‘(bruce tom)’
• select movie, director from IMDB where rating = ‘[8 TO *]’
• select movie, director from IMDB where (actor = ‘(bruce tom)’ AND
rating = ‘[8 TO *]’)
Search predicates are Solr queries specified inside single-quotes
Can specify arbitrary boolean clauses
Select DISTINCT
• select distinct actor_name from IMDB
• Map/Reduce implementation — Tuples are shuffled to worker
nodes and operation is performed by workers
• JSON Facet implementation — operation is ‘pushed down’ to Solr
Stats aggregations
• select count(*), sum(num_voters) from IMDB
• Computed using Solr’s StatsComponent under the hood
• count, sum, avg, min, max are the supported aggregations
• Always pushed down into the search engine
GROUP BY Aggregations
• select actor_name, director, count(*), sum(num_voters) from IMDB
group by actor_name, director having count(*) > 5 and
sum(num_voters) > 1000 order by sum(num_voters) desc
• Has a map/reduce implementation (shuffle) and a JSON Facet
implementation (push down)
• Multi-dimensional, high cardinality aggregations are possible with
the map/reduce implementation
Parallel SQL and Streaming Expressions in Apache Solr 6
JDBC
• Part of SolrJ
• SolrCloud Aware Load Balancing
• Connection has ‘aggregationMode’ parameter that can switch
between map_reduce or facet
• jdbc:solr://SOLR_ZK_CONNECTION_STRING?
collection=COLLECTION_NAME&aggregationMode=facet
Inside Parallel SQL
Solr’s Parallel Computing Framework
• Streaming API
• Streaming Expressions
• Shuffling
• Worker collections
• Parallel SQL
Streaming API
• Java API for parallel computation
• Real-time Map/Reduce and Parallel Relational Algebra
• Search results are streams of tuples (TupleStream)
• Transformed in parallel by Decorator streams
• Transformations include group by, rollup, union, intersection,
complement, joins
• org.apache.solr.client.solrj.io.*
Streaming API
• Streaming Transformation
Operations that transform the underlying streams e.g. unique,
group by, rollup, union, intersection, complement, join etc
• Streaming Aggregation
Operations that gather metrics and compute aggregates e.g. sum,
count, average, min, max etc
Streaming Expressions
• String Query Language and Serialisation format for the Streaming
API
• Streaming expressions compile to TupleStream
• TupleStream serialise to Streaming Expressions
• Human friendly syntax for Streaming API accessible to non-Java
folks as well
• Can be used directly via HTTP to SolrJ
Streaming Expressions
Streaming Expressions
• Stream Sources
The origin of a TupleStream
search, jdbc, facet, stats, topic
• Stream Decorators
Wrap other stream functions and perform operations on the stream
complement, hashJoin, innerJoin, merge, intersect, top, unique
• Many streams can be paralleled across worker collections
Shuffling
• Shuffling is pushed down to Solr
• Sorting is done by /export handler which stream-sorts entire result sets
• Partitioning is done by HashQParserPlugin which is a filter that
partitions on arbitrary fields
• Tuples (search results) start streaming instantly to worker nodes never
requiring a spill to the disk.
• All replicas shuffle in parallel for the same query which allows for
massively parallel IO and huge throughputs.
Worker collections
• Regular SolrCloud collections
• Perform streaming aggregations using the Streaming API
• Receive shuffled streams from the replicas
• Over an HTTP endpoint: /stream
• May be empty or created just-in-time for specific analytical queries
or have data as any regular SolrCloud collection
• The goal is to separate processing from data if necessary
Parallel SQL
• The Presto parser compiles SQL to a TupleStream
• TupleStream is serialised to a Streaming Expression and sent over
the wire to worker nodes
• Worker nodes convert the Streaming Expression back into a
TupleStream
• Worker nodes open() and read() the TupleStream in parallel
Parallel SQL and Streaming Expressions in Apache Solr 6
What’s next
Graph traversals via
streaming expressions
• Shortest path
• Node walking/gathering
• Distributed Gremlin
implementation
Machine learning
models
• LogisticRegressionQuery
• LogitStream
• More to come
Take actions based on
AI driven alerts
• DaemonStreams
• AlertStream
• ModelStream
More, more, more!
• UpdateStream
• Publish-subscribe
• Calcite integration
• Better JDBC support
References
• Joel Bernstein’s Blog — http://joelsolr.blogspot.in/
• https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
• https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface
• Parallel SQL by Joel Bernstein — https://www.youtube.com/watch?
v=baWQfHWozXc
• Streaming Aggregations by Erick Erickson — https://www.youtube.com/
watch?v=n5SYlw0vSFw
Thank you
shalin@apache.org
@shalinmangar

More Related Content

What's hot

Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
Kishore Gopalakrishna
 
Geo-Enablement of the Supply Chain Analytics
Geo-Enablement of the Supply Chain AnalyticsGeo-Enablement of the Supply Chain Analytics
Geo-Enablement of the Supply Chain Analytics
Nishant Sinha
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
Sease
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
Sujit Pal
 
Doing Synonyms Right - John Marquiss, Wolters Kluwer
Doing Synonyms Right - John Marquiss, Wolters KluwerDoing Synonyms Right - John Marquiss, Wolters Kluwer
Doing Synonyms Right - John Marquiss, Wolters Kluwer
Lucidworks
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of Presto
Taro L. Saito
 
グローバル化はなぜ日時処理問題を引き起こすのか
グローバル化はなぜ日時処理問題を引き起こすのかグローバル化はなぜ日時処理問題を引き起こすのか
グローバル化はなぜ日時処理問題を引き起こすのか
Atsushi Kambara
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
Alkin Tezuysal
 
FIWARE Global Summit - NGSI-LD – an Evolution from NGSIv2
FIWARE Global Summit - NGSI-LD – an Evolution from NGSIv2FIWARE Global Summit - NGSI-LD – an Evolution from NGSIv2
FIWARE Global Summit - NGSI-LD – an Evolution from NGSIv2
FIWARE
 
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive ApproachesData Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Databricks
 
Using Geospatial to Innovate in Last-Mile Logistics
Using Geospatial to Innovate in Last-Mile LogisticsUsing Geospatial to Innovate in Last-Mile Logistics
Using Geospatial to Innovate in Last-Mile Logistics
CARTO
 
what3words brochure | Asset Management
what3words brochure | Asset Managementwhat3words brochure | Asset Management
what3words brochure | Asset Management
what3words
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
Nicholas McClure
 
Analyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeAnalyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-time
DataWorks Summit
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
Jens Lehmann
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Edureka!
 
At the core you will have KUSTO
At the core you will have KUSTOAt the core you will have KUSTO
At the core you will have KUSTO
Riccardo Zamana
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
Altinity Ltd
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
DataWorks Summit
 

What's hot (20)

Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
 
Geo-Enablement of the Supply Chain Analytics
Geo-Enablement of the Supply Chain AnalyticsGeo-Enablement of the Supply Chain Analytics
Geo-Enablement of the Supply Chain Analytics
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
 
Doing Synonyms Right - John Marquiss, Wolters Kluwer
Doing Synonyms Right - John Marquiss, Wolters KluwerDoing Synonyms Right - John Marquiss, Wolters Kluwer
Doing Synonyms Right - John Marquiss, Wolters Kluwer
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of Presto
 
グローバル化はなぜ日時処理問題を引き起こすのか
グローバル化はなぜ日時処理問題を引き起こすのかグローバル化はなぜ日時処理問題を引き起こすのか
グローバル化はなぜ日時処理問題を引き起こすのか
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
 
FIWARE Global Summit - NGSI-LD – an Evolution from NGSIv2
FIWARE Global Summit - NGSI-LD – an Evolution from NGSIv2FIWARE Global Summit - NGSI-LD – an Evolution from NGSIv2
FIWARE Global Summit - NGSI-LD – an Evolution from NGSIv2
 
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive ApproachesData Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive Approaches
 
Using Geospatial to Innovate in Last-Mile Logistics
Using Geospatial to Innovate in Last-Mile LogisticsUsing Geospatial to Innovate in Last-Mile Logistics
Using Geospatial to Innovate in Last-Mile Logistics
 
what3words brochure | Asset Management
what3words brochure | Asset Managementwhat3words brochure | Asset Management
what3words brochure | Asset Management
 
Data Science At Zillow
Data Science At ZillowData Science At Zillow
Data Science At Zillow
 
Analyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-timeAnalyzing 1.2 Million Network Packets per Second in Real-time
Analyzing 1.2 Million Network Packets per Second in Real-time
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
 
At the core you will have KUSTO
At the core you will have KUSTOAt the core you will have KUSTO
At the core you will have KUSTO
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
 

Similar to Parallel SQL and Streaming Expressions in Apache Solr 6

Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
Tommaso Teofili
 
AI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsAI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analytics
DataWorks Summit
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and Graph
Lucidworks
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7
Lucidworks
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
Lucidworks
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
Anshum Gupta
 
Parallel SQL for SolrCloud
Parallel SQL for SolrCloudParallel SQL for SolrCloud
Parallel SQL for SolrCloud
Joel Bernstein
 
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoParallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Lucidworks
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
hadooparchbook
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
Spark Summit
 
Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)
Mary Jo Sminkey
 
Solr at zvents 6 years later & still going strong
Solr at zvents   6 years later & still going strongSolr at zvents   6 years later & still going strong
Solr at zvents 6 years later & still going strong
lucenerevolution
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scale
Anshum Gupta
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0
Anshum Gupta
 
ITB2017 - Slaying the ORM dragons with cborm
ITB2017 - Slaying the ORM dragons with cbormITB2017 - Slaying the ORM dragons with cborm
ITB2017 - Slaying the ORM dragons with cborm
Ortus Solutions, Corp
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
Jurriaan Persyn
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
Erik Hatcher
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
Jay Bharat
 

Similar to Parallel SQL and Streaming Expressions in Apache Solr 6 (20)

Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
AI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsAI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analytics
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and Graph
 
Webinar: What's New in Solr 7
Webinar: What's New in Solr 7 Webinar: What's New in Solr 7
Webinar: What's New in Solr 7
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Parallel SQL for SolrCloud
Parallel SQL for SolrCloudParallel SQL for SolrCloud
Parallel SQL for SolrCloud
 
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, AlfrescoParallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 
Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)
 
Solr at zvents 6 years later & still going strong
Solr at zvents   6 years later & still going strongSolr at zvents   6 years later & still going strong
Solr at zvents 6 years later & still going strong
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scale
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0
 
ITB2017 - Slaying the ORM dragons with cborm
ITB2017 - Slaying the ORM dragons with cbormITB2017 - Slaying the ORM dragons with cborm
ITB2017 - Slaying the ORM dragons with cborm
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Solr search engine with multiple table relation
Solr search engine with multiple table relationSolr search engine with multiple table relation
Solr search engine with multiple table relation
 

More from Shalin Shekhar Mangar

Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Shalin Shekhar Mangar
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
Shalin Shekhar Mangar
 
Intro to Apache Solr
Intro to Apache SolrIntro to Apache Solr
Intro to Apache Solr
Shalin Shekhar Mangar
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networks
Shalin Shekhar Mangar
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Shalin Shekhar Mangar
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Shalin Shekhar Mangar
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
Shalin Shekhar Mangar
 
GIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataGIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big Data
Shalin Shekhar Mangar
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Shalin Shekhar Mangar
 
SolrCloud and Shard Splitting
SolrCloud and Shard SplittingSolrCloud and Shard Splitting
SolrCloud and Shard Splitting
Shalin Shekhar Mangar
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software Foundation
Shalin Shekhar Mangar
 

More from Shalin Shekhar Mangar (11)

Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
Solr BoF (Birds of a Feather) session at Fifth Elephant 2018
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
 
Intro to Apache Solr
Intro to Apache SolrIntro to Apache Solr
Intro to Apache Solr
 
Call me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networksCall me maybe: Jepsen and flaky networks
Call me maybe: Jepsen and flaky networks
 
Inside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene MeetupInside Solr 5 - Bangalore Solr/Lucene Meetup
Inside Solr 5 - Bangalore Solr/Lucene Meetup
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
GIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataGIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big Data
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
SolrCloud and Shard Splitting
SolrCloud and Shard SplittingSolrCloud and Shard Splitting
SolrCloud and Shard Splitting
 
Get involved with the Apache Software Foundation
Get involved with the Apache Software FoundationGet involved with the Apache Software Foundation
Get involved with the Apache Software Foundation
 

Recently uploaded

02. Ruby Basic slides - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching02. Ruby Basic slides - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching
quanhoangd129
 
Waze vs. Google Maps vs. Apple Maps, Who Else.pdf
Waze vs. Google Maps vs. Apple Maps, Who Else.pdfWaze vs. Google Maps vs. Apple Maps, Who Else.pdf
Waze vs. Google Maps vs. Apple Maps, Who Else.pdf
Ben Ramedani
 
AI-driven Automation_ Transforming DevOps Practices.docx
AI-driven Automation_ Transforming DevOps Practices.docxAI-driven Automation_ Transforming DevOps Practices.docx
AI-driven Automation_ Transforming DevOps Practices.docx
zoondiacom
 
BDRSuite - #1 Cost effective Data Backup and Recovery Solution
BDRSuite - #1 Cost effective Data Backup and Recovery SolutionBDRSuite - #1 Cost effective Data Backup and Recovery Solution
BDRSuite - #1 Cost effective Data Backup and Recovery Solution
praveene26
 
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
John Gallagher
 
08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching
quanhoangd129
 
UW Cert degree offer diploma
UW Cert degree offer diploma UW Cert degree offer diploma
UW Cert degree offer diploma
dakyuhe
 
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsOld Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Benjamin Bischoff
 
Literals - A Machine Independent Feature
Literals - A Machine Independent FeatureLiterals - A Machine Independent Feature
Literals - A Machine Independent Feature
21h16charis
 
Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...
Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...
Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...
OnePlan Solutions
 
Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()
Gene Gotimer
 
05. Ruby Control Structures - Ruby Core Teaching
05. Ruby Control Structures - Ruby Core Teaching05. Ruby Control Structures - Ruby Core Teaching
05. Ruby Control Structures - Ruby Core Teaching
quanhoangd129
 
Unlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial IntelligenceUnlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial Intelligence
dorinIonescu
 
daily-improvements-with-sqdc-process.pdf
daily-improvements-with-sqdc-process.pdfdaily-improvements-with-sqdc-process.pdf
daily-improvements-with-sqdc-process.pdf
sayma33
 
07. Ruby String Slides - Ruby Core Teaching
07. Ruby String Slides - Ruby Core Teaching07. Ruby String Slides - Ruby Core Teaching
07. Ruby String Slides - Ruby Core Teaching
quanhoangd129
 
Applitools Autonomous 2.0 Sneak Peek.pdf
Applitools Autonomous 2.0 Sneak Peek.pdfApplitools Autonomous 2.0 Sneak Peek.pdf
Applitools Autonomous 2.0 Sneak Peek.pdf
Applitools
 
The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024
Henry Schreiner
 
The Politics of Agile Development.pptx
The  Politics of  Agile Development.pptxThe  Politics of  Agile Development.pptx
The Politics of Agile Development.pptx
NMahendiran
 
Three available editions of Windows Servers crucial to your organization’s op...
Three available editions of Windows Servers crucial to your organization’s op...Three available editions of Windows Servers crucial to your organization’s op...
Three available editions of Windows Servers crucial to your organization’s op...
Q-Advise
 
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Andre Hora
 

Recently uploaded (20)

02. Ruby Basic slides - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching02. Ruby Basic slides - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching
 
Waze vs. Google Maps vs. Apple Maps, Who Else.pdf
Waze vs. Google Maps vs. Apple Maps, Who Else.pdfWaze vs. Google Maps vs. Apple Maps, Who Else.pdf
Waze vs. Google Maps vs. Apple Maps, Who Else.pdf
 
AI-driven Automation_ Transforming DevOps Practices.docx
AI-driven Automation_ Transforming DevOps Practices.docxAI-driven Automation_ Transforming DevOps Practices.docx
AI-driven Automation_ Transforming DevOps Practices.docx
 
BDRSuite - #1 Cost effective Data Backup and Recovery Solution
BDRSuite - #1 Cost effective Data Backup and Recovery SolutionBDRSuite - #1 Cost effective Data Backup and Recovery Solution
BDRSuite - #1 Cost effective Data Backup and Recovery Solution
 
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
 
08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching
 
UW Cert degree offer diploma
UW Cert degree offer diploma UW Cert degree offer diploma
UW Cert degree offer diploma
 
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsOld Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
 
Literals - A Machine Independent Feature
Literals - A Machine Independent FeatureLiterals - A Machine Independent Feature
Literals - A Machine Independent Feature
 
Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...
Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...
Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...
 
Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()
 
05. Ruby Control Structures - Ruby Core Teaching
05. Ruby Control Structures - Ruby Core Teaching05. Ruby Control Structures - Ruby Core Teaching
05. Ruby Control Structures - Ruby Core Teaching
 
Unlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial IntelligenceUnlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial Intelligence
 
daily-improvements-with-sqdc-process.pdf
daily-improvements-with-sqdc-process.pdfdaily-improvements-with-sqdc-process.pdf
daily-improvements-with-sqdc-process.pdf
 
07. Ruby String Slides - Ruby Core Teaching
07. Ruby String Slides - Ruby Core Teaching07. Ruby String Slides - Ruby Core Teaching
07. Ruby String Slides - Ruby Core Teaching
 
Applitools Autonomous 2.0 Sneak Peek.pdf
Applitools Autonomous 2.0 Sneak Peek.pdfApplitools Autonomous 2.0 Sneak Peek.pdf
Applitools Autonomous 2.0 Sneak Peek.pdf
 
The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024
 
The Politics of Agile Development.pptx
The  Politics of  Agile Development.pptxThe  Politics of  Agile Development.pptx
The Politics of Agile Development.pptx
 
Three available editions of Windows Servers crucial to your organization’s op...
Three available editions of Windows Servers crucial to your organization’s op...Three available editions of Windows Servers crucial to your organization’s op...
Three available editions of Windows Servers crucial to your organization’s op...
 
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
 

Parallel SQL and Streaming Expressions in Apache Solr 6

  • 2. Parallel SQL and Streaming Expressions in Apache Solr 6 Shalin Shekhar Mangar @shalinmangar Lucidworks Inc.
  • 3. Introduction • Shalin Shekhar Mangar • Lucene/Solr Committer • PMC Member • Senior Solr Consultant with Lucidworks Inc.
  • 4. The standard for enterprise search. of Fortune 500 uses Solr. 90%
  • 5. • Full text search (Info Retr.) • Facets/Guided Nav galore! • Lots of data types • Spelling, auto-complete, highlighting • Cursors • More Like This • De-duplication • Apache Lucene • Grouping and Joins • Stats, expressions, transformations and more • Lang. Detection • Extensible • Massive Scale/Fault tolerance Solr Key Features
  • 6. Why SQL • Simple, well-known interface to data inside Solr • Hides the complexity of Solr and its various features • Possible to optimise the query plan according to best-practices automatically • Distributed Joins done simply and well
  • 7. Solr 6: Parallel SQL • Parallel execution of SQL across SolrCloud collections • Compiled to SolrJ Streaming API (TupleStream) which is a general purpose parallel computing framework for Solr • Executed in parallel over SolrCloud worker nodes • SolrCloud collections are relational ‘tables’ • JDBC thin client as a SolrJ client
  • 9. SQL Interface at a glance • SQL over Map/Reduce — for high cardinality aggregations and distributed joins • SQL over Facets — high performance, moderate cardinality aggregations • SQL with Solr powered search queries • Fully integrated with SolrCloud • SQL over JDBC or HTTP — http://host:port/solr/collection1/sql
  • 10. Limited vs Unlimited SELECT • select movie, director from IMDB Returns the entire result set! Return fields must be DocValues • select movie, directory from IMDB limit 100 Returns specified number of records. It can sort by score and retrieve any stored field • select movie, director from IMDB order by rating desc, num_voters desc
  • 11. Search predicates • select movie, director from IMDB where actor = ‘bruce’ • select movie, director from IMDB where actor = ‘(bruce tom)’ • select movie, director from IMDB where rating = ‘[8 TO *]’ • select movie, director from IMDB where (actor = ‘(bruce tom)’ AND rating = ‘[8 TO *]’) Search predicates are Solr queries specified inside single-quotes Can specify arbitrary boolean clauses
  • 12. Select DISTINCT • select distinct actor_name from IMDB • Map/Reduce implementation — Tuples are shuffled to worker nodes and operation is performed by workers • JSON Facet implementation — operation is ‘pushed down’ to Solr
  • 13. Stats aggregations • select count(*), sum(num_voters) from IMDB • Computed using Solr’s StatsComponent under the hood • count, sum, avg, min, max are the supported aggregations • Always pushed down into the search engine
  • 14. GROUP BY Aggregations • select actor_name, director, count(*), sum(num_voters) from IMDB group by actor_name, director having count(*) > 5 and sum(num_voters) > 1000 order by sum(num_voters) desc • Has a map/reduce implementation (shuffle) and a JSON Facet implementation (push down) • Multi-dimensional, high cardinality aggregations are possible with the map/reduce implementation
  • 16. JDBC • Part of SolrJ • SolrCloud Aware Load Balancing • Connection has ‘aggregationMode’ parameter that can switch between map_reduce or facet • jdbc:solr://SOLR_ZK_CONNECTION_STRING? collection=COLLECTION_NAME&aggregationMode=facet
  • 18. Solr’s Parallel Computing Framework • Streaming API • Streaming Expressions • Shuffling • Worker collections • Parallel SQL
  • 19. Streaming API • Java API for parallel computation • Real-time Map/Reduce and Parallel Relational Algebra • Search results are streams of tuples (TupleStream) • Transformed in parallel by Decorator streams • Transformations include group by, rollup, union, intersection, complement, joins • org.apache.solr.client.solrj.io.*
  • 20. Streaming API • Streaming Transformation Operations that transform the underlying streams e.g. unique, group by, rollup, union, intersection, complement, join etc • Streaming Aggregation Operations that gather metrics and compute aggregates e.g. sum, count, average, min, max etc
  • 21. Streaming Expressions • String Query Language and Serialisation format for the Streaming API • Streaming expressions compile to TupleStream • TupleStream serialise to Streaming Expressions • Human friendly syntax for Streaming API accessible to non-Java folks as well • Can be used directly via HTTP to SolrJ
  • 23. Streaming Expressions • Stream Sources The origin of a TupleStream search, jdbc, facet, stats, topic • Stream Decorators Wrap other stream functions and perform operations on the stream complement, hashJoin, innerJoin, merge, intersect, top, unique • Many streams can be paralleled across worker collections
  • 24. Shuffling • Shuffling is pushed down to Solr • Sorting is done by /export handler which stream-sorts entire result sets • Partitioning is done by HashQParserPlugin which is a filter that partitions on arbitrary fields • Tuples (search results) start streaming instantly to worker nodes never requiring a spill to the disk. • All replicas shuffle in parallel for the same query which allows for massively parallel IO and huge throughputs.
  • 25. Worker collections • Regular SolrCloud collections • Perform streaming aggregations using the Streaming API • Receive shuffled streams from the replicas • Over an HTTP endpoint: /stream • May be empty or created just-in-time for specific analytical queries or have data as any regular SolrCloud collection • The goal is to separate processing from data if necessary
  • 26. Parallel SQL • The Presto parser compiles SQL to a TupleStream • TupleStream is serialised to a Streaming Expression and sent over the wire to worker nodes • Worker nodes convert the Streaming Expression back into a TupleStream • Worker nodes open() and read() the TupleStream in parallel
  • 29. Graph traversals via streaming expressions • Shortest path • Node walking/gathering • Distributed Gremlin implementation
  • 31. Take actions based on AI driven alerts • DaemonStreams • AlertStream • ModelStream
  • 32. More, more, more! • UpdateStream • Publish-subscribe • Calcite integration • Better JDBC support
  • 33. References • Joel Bernstein’s Blog — http://joelsolr.blogspot.in/ • https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions • https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface • Parallel SQL by Joel Bernstein — https://www.youtube.com/watch? v=baWQfHWozXc • Streaming Aggregations by Erick Erickson — https://www.youtube.com/ watch?v=n5SYlw0vSFw