SlideShare a Scribd company logo
SolrCloud: Searching Big Data
Shalin Shekhar Mangar
Subset of o ptio nal featuresin Solr to enableand
simplify horizontal scaling asearch index using
sharding and replication.
Goals
performance, scalability, high-availability,
simplicity, and elasticity
What is SolrCloud?
Terminology
โ—
ZooKeeper: Distributed coordination servicethat
providescentralized configuration, cluster state
management, and leader election
โ—
Node: JVM processbound to aspecific port on amachine;
hoststheSolr web application
โ—
Collection: Search index distributed acrossmultiple
nodes; each collection hasaname, shard count, and
replication factor
โ—
Replication Factor: Number of copiesof adocument in
acollection
โ€ข Shard: Logical sliceof acollection; each shard hasaname, hash
range, leader, and replication factor. Documentsareassigned to
oneand only oneshard per collection using ahash-based
document routing strategy
โ€ข Replica: Solr index that hostsacopy of ashard in acollection;
behind thescenes, each replicaisimplemented asaSolr core
โ€ข Leader: Replicain ashard that assumesspecial dutiesneeded to
support distributed indexing in Solr; each shard hasoneand only
oneleader at any timeand leadersareelected using ZooKeeper
Terminology
High-level Architecture
Collection == Distributed Index
A collection isa distributed index defined by:
โ€ข named configuration stored in ZooKeeper
โ€ข number of shards: documents are distributed
across N partitions of the index
โ€ข document routing strategy: how documents get
assigned to shards
โ€ข replication factor: how many copiesof each
document in thecollection
Collections API:
http://localhost:8983/solr/admin/collections?
action=create&name=logstash4solr&replicationFactor=
2&numShards=2&collection.configName=logs
Collection == Distributed Index
โ—
Collection has a fixed number of shards
- existing shardscan besplit
โ—
When to shard?
- Largenumber of docs
- Largedocument sizes
- Parallelization during indexing and
queries
- Datapartitioning (custom hashing)
Sharding
โ—
Each shard coversahash-range
โ—
Default: Hash ID into 32-bit integer, map to range
- leadsto balanced (roughly) shards
โ—
Custom-hashing (examplein afew slides)
โ—
Tri-level: app!user!doc
โ—
Implicit: no hash-rangeset for shards
Document Routing
โ€ข Why replicate?
- High-availability
- Load balancing
โ—
How does it work in SolrCloud?
- Near-real-time, not master-slave
- Leader forwards to replicas in parallel,
waits for response
- Error handling during indexing is tricky
Replication
Example: Indexing
Example: Querying
1. Get cluster statefrom ZK
2. Routedocument directly to
leader (hash on doc ID)
3. Persist document on durable
storage(tlog)
4. Forward to healthy replicas
5. Acknowledgewrite succeed to
client
Distributed Indexing
โ—
Additional responsibilitiesduring indexing only! Not a
master node
โ—
Leader isareplica(handlesqueries)
โ—
Acceptsupdaterequestsfor theshard
โ—
Incrementsthe_version_ on thenew or updated doc
โ—
Sendsupdates(in parallel) to all replicas
Shard Leader
Distributed Queries
1. Query client can beZK awareor just
query viaaload balancer
2. Client can send query to any nodein the
cluster
3. Controller nodedistributesthequery to
areplicafor each shard to identify
documentsmatching query
4. Controller nodesortstheresultsfrom
step 3 and issuesasecond query for all
fieldsfor apageof results
Scalability / Stability Highlights
โ—
All nodesin cluster perform indexing and execute
queries; no master node
โ—
Distributed indexing: No SPoF, high throughput via
direct updatesto leaders, automated failover to new
leader
โ—
Distributed queries: Add replicasto scale-out qps;
parallelizecomplex query computations; fault-tolerance
โ—
Indexing / queriescontinueso long asthereis1 healthy
replicaper shard
SolrCloud and CAP
โ—
A distributed system should be: Consistent, Available, and
Partition tolerant
โ—
CAPsayspick 2 of the3! (slightly morenuanced than that
in reality)
โ—
SolrCloud favorsconsistency over write-availability (CP)
โ—
All replicasin ashard havethesamedata
โ—
Activereplicasetsconcept (writesaccepted so long asa
shard hasat least oneactivereplicaavailable)
SolrCloud and CAP
โ€ข No toolsto detect or fix consistency issuesin Solr
โ€“ Reads go to one replica; no concept of quorum
โ€“ Writes must fail if consistency cannot be
guaranteed (SOLR-5468)
ZooKeeper
โ—
Isavery good thing ... clustersareazoo!
โ—
Centralized configuration management
โ—
Cluster statemanagement
โ—
Leader election (shard leader and overseer)
โ—
Overseer distributed work queue
โ—
LiveNodes
โ€“ Ephemeral znodesused to signal aserver isgone
โ—
Needs3 nodesfor quorum in production
ZooKeeper: Centralized Configuration
โ—
Storeconfig filesin
ZooKeeper
โ—
Solr nodespull config
during coreinitialization
โ—
Config setscan beโ€œsharedโ€
acrosscollections
โ—
Changesareuploaded to ZK
and then collectionsshould
bereloaded
ZooKeeper: State Management
โ—
Keep track of /live_nodesznode
โ—
Ephemeral nodes
โ—
ZooKeeper client timeout
โ—
Collection metadataand replicastatein /clusterstate.json
โ—
Every corehaswatchersfor /live_nodesand
/clusterstate.json
โ—
Leader election
โ—
ZooKeeper sequencenumberson ephemeral znodes
Overseer
โ—
What doesit do?
โ€“ Persistscollection statechangeeventsto ZooKeeper
โ€“ Controller for Collection API commands
โ€“ Ordered updates
โ€“ Oneper cluster (for all collections); elected using leader election
โ—
How doesit work?
โ€“ Asynchronous(pub/sub messaging)
โ€“ ZooKeeper asdistributed queuerecipe
โ€“ Automated failover to ahealthy node
โ€“ Can beassigned to adedicated node(SOLR-5476)
Custom Hashing
โ—
Routedocumentsto specific shardsbased on ashard key
component in thedocument ID
โ—
Send all log messagesfrom thesamesystem to the
sameshard
โ—
Direct queriesto specific shards: q=...&_route_=httpd
{
"id" : โ€httpd!2",
"level_s" : โ€ERROR",
"lang_s" : "en",
...
},
Hash:
shardKey!docID
Custom Hashing Highlights
โ—
Co-locatedocumentshaving acommon property in thesame
shard
- e.g. docshaving IDshttpd!21 and httpd!33 will
bein thesameshard
โ€ข Scale-up thereplicasfor specific shardsto addresshigh query
and/or indexing volumefrom specific apps
โ€ข Not asmuch control over thedistribution of keys
- httpd, mysql, and collectd all in same shard
โ€ข Can split unbalanced shards when using custom hashing
โ€ข Can split shards into two sub-shards
โ€ข Live splitting! No downtime needed!
โ€ข Requests start being forwarded to sub-shards
automatically
โ€ข Expensive operation: Use as required during low
traffic
Shard Splitting
Other features / highlights
โ€ข Near-Real-Time Search: Documentsarevisiblewithin a
second or so after being indexed
โ€ข Partial Document Update: Just updatethefieldsyou need to
changeon existing documents
โ€ข Optimistic Locking: Ensureupdatesareapplied to thecorrect
version of adocument
โ€ข Transaction log: Better recoverability; peer-sync between nodes
after hiccups
โ€ข HTTPS
โ€ข Use HDFS for storing indexes
โ€ข UseMapReduce for building index (SOLR-1301)
More?
โ€ข Workshop: Apache Solr in Minutes tomorrow
โ€ข https://cwiki.apache.org/confluence/display/solr/Ap
ache+Solr+Reference+Guide
โ€ข shalin@apache.org
โ€ข http://twitter.com/shalinmangar
โ€ข http://shal.in
Attributions
โ€ข Tim Potter's slides on โ€œIntroduction to SolrCloudโ€ at
Lucene/Solr Exchange 2014
โ€“ http://twitter.com/thelabdude
โ€ข Erik Hatcher's slides on โ€œSolr: Search at the speed of
lightโ€ at JavaZone 2009
โ€“ http://twitter.com/ErikHatcher
GIDS2014: SolrCloud: Searching Big Data

More Related Content

What's hot

Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
thelabdude
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
Shalin Shekhar Mangar
 
Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)
searchbox-com
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
thelabdude
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
Alex Moundalexis
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
thelabdude
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Nitin S
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scale
Anshum Gupta
 
Apache SolrCloud
Apache SolrCloudApache SolrCloud
Apache SolrCloud
Michaล‚ Warecki
 
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of Collections
Anshum Gupta
 
SolrCloud Failover and Testing
SolrCloud Failover and TestingSolrCloud Failover and Testing
SolrCloud Failover and Testing
Mark Miller
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Lucidworks (Archived)
 
How to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr clusterHow to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr cluster
lucenerevolution
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
Anshum Gupta
 
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lucidworks
 
High Performance Solr and JVM Tuning Strategies used for MapQuestโ€™s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuestโ€™s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuestโ€™s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuestโ€™s Search Ah...
Lucidworks
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Lucidworks
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Shalin Shekhar Mangar
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
lucenerevolution
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
thelabdude
 

What's hot (20)

Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
High Performance Solr
High Performance SolrHigh Performance Solr
High Performance Solr
 
Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)Solr cluster with SolrCloud at lucenerevolution (tutorial)
Solr cluster with SolrCloud at lucenerevolution (tutorial)
 
NYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / SolrNYC Lucene/Solr Meetup: Spark / Solr
NYC Lucene/Solr Meetup: Spark / Solr
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
Lucene Revolution 2013 - Scaling Solr Cloud for Large-scale Social Media Anal...
 
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure Solr Compute Cloud - An Elastic SolrCloud Infrastructure
Solr Compute Cloud - An Elastic SolrCloud Infrastructure
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scale
 
Apache SolrCloud
Apache SolrCloudApache SolrCloud
Apache SolrCloud
 
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of Collections
 
SolrCloud Failover and Testing
SolrCloud Failover and TestingSolrCloud Failover and Testing
SolrCloud Failover and Testing
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
How to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr clusterHow to make a simple cheap high availability self-healing solr cluster
How to make a simple cheap high availability self-healing solr cluster
 
Best practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloudBest practices for highly available and large scale SolrCloud
Best practices for highly available and large scale SolrCloud
 
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
 
High Performance Solr and JVM Tuning Strategies used for MapQuestโ€™s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuestโ€™s Search Ah...High Performance Solr and JVM Tuning Strategies used for MapQuestโ€™s Search Ah...
High Performance Solr and JVM Tuning Strategies used for MapQuestโ€™s Search Ah...
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
 
How SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded EnvironmentHow SolrCloud Changes the User Experience In a Sharded Environment
How SolrCloud Changes the User Experience In a Sharded Environment
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 

Viewers also liked

Intro to Apache Solr
Intro to Apache SolrIntro to Apache Solr
Intro to Apache Solr
Shalin Shekhar Mangar
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Shalin Shekhar Mangar
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
Shalin Shekhar Mangar
 
Solr Compute Cloud โ€“ An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud โ€“ An Elastic Solr Infrastructure: Presented by Nitin Sharm...Solr Compute Cloud โ€“ An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud โ€“ An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Lucidworks
 
SolrCloud and Shard Splitting
SolrCloud and Shard SplittingSolrCloud and Shard Splitting
SolrCloud and Shard Splitting
Shalin Shekhar Mangar
 
็ฌฌ10ๅ›žsolrๅ‹‰ๅผทไผš solr cloudใฎๅฐŽๅ…ฅไบ‹ไพ‹
็ฌฌ10ๅ›žsolrๅ‹‰ๅผทไผš solr cloudใฎๅฐŽๅ…ฅไบ‹ไพ‹็ฌฌ10ๅ›žsolrๅ‹‰ๅผทไผš solr cloudใฎๅฐŽๅ…ฅไบ‹ไพ‹
็ฌฌ10ๅ›žsolrๅ‹‰ๅผทไผš solr cloudใฎๅฐŽๅ…ฅไบ‹ไพ‹
Ken Hirose
 
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
Lucidworks
 
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, ClouderaWhy Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
Lucidworks
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Lucidworks (Archived)
 

Viewers also liked (9)

Intro to Apache Solr
Intro to Apache SolrIntro to Apache Solr
Intro to Apache Solr
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6Parallel SQL and Streaming Expressions in Apache Solr 6
Parallel SQL and Streaming Expressions in Apache Solr 6
 
Solr Compute Cloud โ€“ An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud โ€“ An Elastic Solr Infrastructure: Presented by Nitin Sharm...Solr Compute Cloud โ€“ An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud โ€“ An Elastic Solr Infrastructure: Presented by Nitin Sharm...
 
SolrCloud and Shard Splitting
SolrCloud and Shard SplittingSolrCloud and Shard Splitting
SolrCloud and Shard Splitting
 
็ฌฌ10ๅ›žsolrๅ‹‰ๅผทไผš solr cloudใฎๅฐŽๅ…ฅไบ‹ไพ‹
็ฌฌ10ๅ›žsolrๅ‹‰ๅผทไผš solr cloudใฎๅฐŽๅ…ฅไบ‹ไพ‹็ฌฌ10ๅ›žsolrๅ‹‰ๅผทไผš solr cloudใฎๅฐŽๅ…ฅไบ‹ไพ‹
็ฌฌ10ๅ›žsolrๅ‹‰ๅผทไผš solr cloudใฎๅฐŽๅ…ฅไบ‹ไพ‹
 
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
SolrCloud - High Availability and Fault Tolerance: Presented by Mark Miller, ...
 
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, ClouderaWhy Is My Solr Slow?: Presented by Mike Drob, Cloudera
Why Is My Solr Slow?: Presented by Mike Drob, Cloudera
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 

Similar to GIDS2014: SolrCloud: Searching Big Data

Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Elasticsearch Data Analyses
Elasticsearch Data AnalysesElasticsearch Data Analyses
Elasticsearch Data Analyses
Alaa Elhadba
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
James Chen
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
Anthony Baker
 
Scalable Web Apps
Scalable Web AppsScalable Web Apps
Scalable Web Apps
Piotr Pelczar
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
Splunk
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin Presentation
Nitin Sharma
 
Cdcr apachecon-talk
Cdcr apachecon-talkCdcr apachecon-talk
Cdcr apachecon-talk
Amrit Sarkar
 
Everything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in KubernetesEverything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in Kubernetes
The {code} Team
 
HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta
MapR Technologies
 
Spark 1.0
Spark 1.0Spark 1.0
Spark 1.0
Jatin Arora
 
Comparison between zookeeper, etcd 3 and other distributed coordination systems
Comparison between zookeeper, etcd 3 and other distributed coordination systemsComparison between zookeeper, etcd 3 and other distributed coordination systems
Comparison between zookeeper, etcd 3 and other distributed coordination systems
Imesha Sudasingha
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in Scala
Alex Payne
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
bloomreacheng
 
NetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksNetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talks
Ruslan Meshenberg
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
whoschek
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera, Inc.
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
Roy Russo
 
MYSQL
MYSQLMYSQL
MYSQL
gilashikwa
 

Similar to GIDS2014: SolrCloud: Searching Big Data (20)

Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Elasticsearch Data Analyses
Elasticsearch Data AnalysesElasticsearch Data Analyses
Elasticsearch Data Analyses
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
 
Scalable Web Apps
Scalable Web AppsScalable Web Apps
Scalable Web Apps
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Solr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin PresentationSolr Lucene Conference 2014 - Nitin Presentation
Solr Lucene Conference 2014 - Nitin Presentation
 
Cdcr apachecon-talk
Cdcr apachecon-talkCdcr apachecon-talk
Cdcr apachecon-talk
 
Everything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in KubernetesEverything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in Kubernetes
 
HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta HPTS talk on micro sharding with Katta
HPTS talk on micro sharding with Katta
 
Spark 1.0
Spark 1.0Spark 1.0
Spark 1.0
 
Comparison between zookeeper, etcd 3 and other distributed coordination systems
Comparison between zookeeper, etcd 3 and other distributed coordination systemsComparison between zookeeper, etcd 3 and other distributed coordination systems
Comparison between zookeeper, etcd 3 and other distributed coordination systems
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in Scala
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
 
NetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksNetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talks
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
MYSQL
MYSQLMYSQL
MYSQL
 

Recently uploaded

UW Cert degree offer diploma
UW Cert degree offer diploma UW Cert degree offer diploma
UW Cert degree offer diploma
dakyuhe
 
Top 10 ERP Companies in UAE Banibro IT Solutions.pdf
Top 10 ERP Companies in UAE Banibro IT Solutions.pdfTop 10 ERP Companies in UAE Banibro IT Solutions.pdf
Top 10 ERP Companies in UAE Banibro IT Solutions.pdf
Banibro IT Solutions
 
Understanding Automated Testing Tools for Web Applications.pdf
Understanding Automated Testing Tools for Web Applications.pdfUnderstanding Automated Testing Tools for Web Applications.pdf
Understanding Automated Testing Tools for Web Applications.pdf
kalichargn70th171
 
B.Sc. Computer Science Department PPT 2024
B.Sc. Computer Science Department PPT 2024B.Sc. Computer Science Department PPT 2024
B.Sc. Computer Science Department PPT 2024
vmsdeptcom
 
05. Ruby Control Structures - Ruby Core Teaching
05. Ruby Control Structures - Ruby Core Teaching05. Ruby Control Structures - Ruby Core Teaching
05. Ruby Control Structures - Ruby Core Teaching
quanhoangd129
 
01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching
quanhoangd129
 
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
Andre Hora
 
The Politics of Agile Development.pptx
The  Politics of  Agile Development.pptxThe  Politics of  Agile Development.pptx
The Politics of Agile Development.pptx
NMahendiran
 
Unlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial IntelligenceUnlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial Intelligence
dorinIonescu
 
How Generative AI is Shaping the Future of Software Application Development
How Generative AI is Shaping the Future of Software Application DevelopmentHow Generative AI is Shaping the Future of Software Application Development
How Generative AI is Shaping the Future of Software Application Development
MohammedIrfan308637
 
07. Ruby String Slides - Ruby Core Teaching
07. Ruby String Slides - Ruby Core Teaching07. Ruby String Slides - Ruby Core Teaching
07. Ruby String Slides - Ruby Core Teaching
quanhoangd129
 
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Andre Hora
 
08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching
quanhoangd129
 
BitLocker Data Recovery | BLR Tools Data Recovery Solutions
BitLocker Data Recovery | BLR Tools Data Recovery SolutionsBitLocker Data Recovery | BLR Tools Data Recovery Solutions
BitLocker Data Recovery | BLR Tools Data Recovery Solutions
Alina Tait
 
vSAN_Tutorial_Presentation with important topics
vSAN_Tutorial_Presentation with important  topicsvSAN_Tutorial_Presentation with important  topics
vSAN_Tutorial_Presentation with important topics
abhilashspt
 
OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
Shane Coughlan
 
What is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdfWhat is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdf
lead93317
 
Literals - A Machine Independent Feature
Literals - A Machine Independent FeatureLiterals - A Machine Independent Feature
Literals - A Machine Independent Feature
21h16charis
 
BDRSuite - #1 Cost effective Data Backup and Recovery Solution
BDRSuite - #1 Cost effective Data Backup and Recovery SolutionBDRSuite - #1 Cost effective Data Backup and Recovery Solution
BDRSuite - #1 Cost effective Data Backup and Recovery Solution
praveene26
 
Fantastic Design Patterns and Where to use them No Notes.pdf
Fantastic Design Patterns and Where to use them No Notes.pdfFantastic Design Patterns and Where to use them No Notes.pdf
Fantastic Design Patterns and Where to use them No Notes.pdf
6m9p7qnjj8
 

Recently uploaded (20)

UW Cert degree offer diploma
UW Cert degree offer diploma UW Cert degree offer diploma
UW Cert degree offer diploma
 
Top 10 ERP Companies in UAE Banibro IT Solutions.pdf
Top 10 ERP Companies in UAE Banibro IT Solutions.pdfTop 10 ERP Companies in UAE Banibro IT Solutions.pdf
Top 10 ERP Companies in UAE Banibro IT Solutions.pdf
 
Understanding Automated Testing Tools for Web Applications.pdf
Understanding Automated Testing Tools for Web Applications.pdfUnderstanding Automated Testing Tools for Web Applications.pdf
Understanding Automated Testing Tools for Web Applications.pdf
 
B.Sc. Computer Science Department PPT 2024
B.Sc. Computer Science Department PPT 2024B.Sc. Computer Science Department PPT 2024
B.Sc. Computer Science Department PPT 2024
 
05. Ruby Control Structures - Ruby Core Teaching
05. Ruby Control Structures - Ruby Core Teaching05. Ruby Control Structures - Ruby Core Teaching
05. Ruby Control Structures - Ruby Core Teaching
 
01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching
 
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
 
The Politics of Agile Development.pptx
The  Politics of  Agile Development.pptxThe  Politics of  Agile Development.pptx
The Politics of Agile Development.pptx
 
Unlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial IntelligenceUnlocking the Future of Artificial Intelligence
Unlocking the Future of Artificial Intelligence
 
How Generative AI is Shaping the Future of Software Application Development
How Generative AI is Shaping the Future of Software Application DevelopmentHow Generative AI is Shaping the Future of Software Application Development
How Generative AI is Shaping the Future of Software Application Development
 
07. Ruby String Slides - Ruby Core Teaching
07. Ruby String Slides - Ruby Core Teaching07. Ruby String Slides - Ruby Core Teaching
07. Ruby String Slides - Ruby Core Teaching
 
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
 
08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching
 
BitLocker Data Recovery | BLR Tools Data Recovery Solutions
BitLocker Data Recovery | BLR Tools Data Recovery SolutionsBitLocker Data Recovery | BLR Tools Data Recovery Solutions
BitLocker Data Recovery | BLR Tools Data Recovery Solutions
 
vSAN_Tutorial_Presentation with important topics
vSAN_Tutorial_Presentation with important  topicsvSAN_Tutorial_Presentation with important  topics
vSAN_Tutorial_Presentation with important topics
 
OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
OpenChain Webinar: IAV, TimeToAct and ISO/IEC 5230 - Third-Party Certificatio...
 
What is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdfWhat is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdf
 
Literals - A Machine Independent Feature
Literals - A Machine Independent FeatureLiterals - A Machine Independent Feature
Literals - A Machine Independent Feature
 
BDRSuite - #1 Cost effective Data Backup and Recovery Solution
BDRSuite - #1 Cost effective Data Backup and Recovery SolutionBDRSuite - #1 Cost effective Data Backup and Recovery Solution
BDRSuite - #1 Cost effective Data Backup and Recovery Solution
 
Fantastic Design Patterns and Where to use them No Notes.pdf
Fantastic Design Patterns and Where to use them No Notes.pdfFantastic Design Patterns and Where to use them No Notes.pdf
Fantastic Design Patterns and Where to use them No Notes.pdf
 

GIDS2014: SolrCloud: Searching Big Data

  • 1. SolrCloud: Searching Big Data Shalin Shekhar Mangar
  • 2. Subset of o ptio nal featuresin Solr to enableand simplify horizontal scaling asearch index using sharding and replication. Goals performance, scalability, high-availability, simplicity, and elasticity What is SolrCloud?
  • 3. Terminology โ— ZooKeeper: Distributed coordination servicethat providescentralized configuration, cluster state management, and leader election โ— Node: JVM processbound to aspecific port on amachine; hoststheSolr web application โ— Collection: Search index distributed acrossmultiple nodes; each collection hasaname, shard count, and replication factor โ— Replication Factor: Number of copiesof adocument in acollection
  • 4. โ€ข Shard: Logical sliceof acollection; each shard hasaname, hash range, leader, and replication factor. Documentsareassigned to oneand only oneshard per collection using ahash-based document routing strategy โ€ข Replica: Solr index that hostsacopy of ashard in acollection; behind thescenes, each replicaisimplemented asaSolr core โ€ข Leader: Replicain ashard that assumesspecial dutiesneeded to support distributed indexing in Solr; each shard hasoneand only oneleader at any timeand leadersareelected using ZooKeeper Terminology
  • 6. Collection == Distributed Index A collection isa distributed index defined by: โ€ข named configuration stored in ZooKeeper โ€ข number of shards: documents are distributed across N partitions of the index โ€ข document routing strategy: how documents get assigned to shards โ€ข replication factor: how many copiesof each document in thecollection
  • 8. โ— Collection has a fixed number of shards - existing shardscan besplit โ— When to shard? - Largenumber of docs - Largedocument sizes - Parallelization during indexing and queries - Datapartitioning (custom hashing) Sharding
  • 9. โ— Each shard coversahash-range โ— Default: Hash ID into 32-bit integer, map to range - leadsto balanced (roughly) shards โ— Custom-hashing (examplein afew slides) โ— Tri-level: app!user!doc โ— Implicit: no hash-rangeset for shards Document Routing
  • 10. โ€ข Why replicate? - High-availability - Load balancing โ— How does it work in SolrCloud? - Near-real-time, not master-slave - Leader forwards to replicas in parallel, waits for response - Error handling during indexing is tricky Replication
  • 13. 1. Get cluster statefrom ZK 2. Routedocument directly to leader (hash on doc ID) 3. Persist document on durable storage(tlog) 4. Forward to healthy replicas 5. Acknowledgewrite succeed to client Distributed Indexing
  • 14. โ— Additional responsibilitiesduring indexing only! Not a master node โ— Leader isareplica(handlesqueries) โ— Acceptsupdaterequestsfor theshard โ— Incrementsthe_version_ on thenew or updated doc โ— Sendsupdates(in parallel) to all replicas Shard Leader
  • 15. Distributed Queries 1. Query client can beZK awareor just query viaaload balancer 2. Client can send query to any nodein the cluster 3. Controller nodedistributesthequery to areplicafor each shard to identify documentsmatching query 4. Controller nodesortstheresultsfrom step 3 and issuesasecond query for all fieldsfor apageof results
  • 16. Scalability / Stability Highlights โ— All nodesin cluster perform indexing and execute queries; no master node โ— Distributed indexing: No SPoF, high throughput via direct updatesto leaders, automated failover to new leader โ— Distributed queries: Add replicasto scale-out qps; parallelizecomplex query computations; fault-tolerance โ— Indexing / queriescontinueso long asthereis1 healthy replicaper shard
  • 17. SolrCloud and CAP โ— A distributed system should be: Consistent, Available, and Partition tolerant โ— CAPsayspick 2 of the3! (slightly morenuanced than that in reality) โ— SolrCloud favorsconsistency over write-availability (CP) โ— All replicasin ashard havethesamedata โ— Activereplicasetsconcept (writesaccepted so long asa shard hasat least oneactivereplicaavailable)
  • 18. SolrCloud and CAP โ€ข No toolsto detect or fix consistency issuesin Solr โ€“ Reads go to one replica; no concept of quorum โ€“ Writes must fail if consistency cannot be guaranteed (SOLR-5468)
  • 19. ZooKeeper โ— Isavery good thing ... clustersareazoo! โ— Centralized configuration management โ— Cluster statemanagement โ— Leader election (shard leader and overseer) โ— Overseer distributed work queue โ— LiveNodes โ€“ Ephemeral znodesused to signal aserver isgone โ— Needs3 nodesfor quorum in production
  • 20. ZooKeeper: Centralized Configuration โ— Storeconfig filesin ZooKeeper โ— Solr nodespull config during coreinitialization โ— Config setscan beโ€œsharedโ€ acrosscollections โ— Changesareuploaded to ZK and then collectionsshould bereloaded
  • 21. ZooKeeper: State Management โ— Keep track of /live_nodesznode โ— Ephemeral nodes โ— ZooKeeper client timeout โ— Collection metadataand replicastatein /clusterstate.json โ— Every corehaswatchersfor /live_nodesand /clusterstate.json โ— Leader election โ— ZooKeeper sequencenumberson ephemeral znodes
  • 22. Overseer โ— What doesit do? โ€“ Persistscollection statechangeeventsto ZooKeeper โ€“ Controller for Collection API commands โ€“ Ordered updates โ€“ Oneper cluster (for all collections); elected using leader election โ— How doesit work? โ€“ Asynchronous(pub/sub messaging) โ€“ ZooKeeper asdistributed queuerecipe โ€“ Automated failover to ahealthy node โ€“ Can beassigned to adedicated node(SOLR-5476)
  • 23. Custom Hashing โ— Routedocumentsto specific shardsbased on ashard key component in thedocument ID โ— Send all log messagesfrom thesamesystem to the sameshard โ— Direct queriesto specific shards: q=...&_route_=httpd { "id" : โ€httpd!2", "level_s" : โ€ERROR", "lang_s" : "en", ... }, Hash: shardKey!docID
  • 24. Custom Hashing Highlights โ— Co-locatedocumentshaving acommon property in thesame shard - e.g. docshaving IDshttpd!21 and httpd!33 will bein thesameshard โ€ข Scale-up thereplicasfor specific shardsto addresshigh query and/or indexing volumefrom specific apps โ€ข Not asmuch control over thedistribution of keys - httpd, mysql, and collectd all in same shard โ€ข Can split unbalanced shards when using custom hashing
  • 25. โ€ข Can split shards into two sub-shards โ€ข Live splitting! No downtime needed! โ€ข Requests start being forwarded to sub-shards automatically โ€ข Expensive operation: Use as required during low traffic Shard Splitting
  • 26. Other features / highlights โ€ข Near-Real-Time Search: Documentsarevisiblewithin a second or so after being indexed โ€ข Partial Document Update: Just updatethefieldsyou need to changeon existing documents โ€ข Optimistic Locking: Ensureupdatesareapplied to thecorrect version of adocument โ€ข Transaction log: Better recoverability; peer-sync between nodes after hiccups โ€ข HTTPS โ€ข Use HDFS for storing indexes โ€ข UseMapReduce for building index (SOLR-1301)
  • 27. More? โ€ข Workshop: Apache Solr in Minutes tomorrow โ€ข https://cwiki.apache.org/confluence/display/solr/Ap ache+Solr+Reference+Guide โ€ข shalin@apache.org โ€ข http://twitter.com/shalinmangar โ€ข http://shal.in
  • 28. Attributions โ€ข Tim Potter's slides on โ€œIntroduction to SolrCloudโ€ at Lucene/Solr Exchange 2014 โ€“ http://twitter.com/thelabdude โ€ข Erik Hatcher's slides on โ€œSolr: Search at the speed of lightโ€ at JavaZone 2009 โ€“ http://twitter.com/ErikHatcher