SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan Ginzburg & Yonik Seeley, Salesforce

SolrCloud in Public Cloud: Scaling
Compute Independently from Storage
Ilan Ginzburg
Working on Search infra
Architect at Salesforce
Yonik Seeley
Creator of Solr
Lucidworks co-founder
Principal Architect at
Salesforce

Reduce cost to serve and improve
quality of service
by separating search compute and search index storage
Salesforce search traffic lends itself well to such optimizations:
● Multi tenant
● Unknown access patterns
● Unknown data growth (speed and scale)
● Indexing heavy

What this talk is about
Changes to SolrCloud to allow storing indexes on
shared storage and allow sharing indexes by multiple
nodes, to allow adjusting cluster size to current traffic.
An Activate 2018 presentation “Who moved my state? A Blob
Storage Solr story” described the integration of Blob storage and the
Salesforce non SolrCloud Solr cluster.

Agenda
• Big picture
• Challenges and implementation
• Wrap up

Big picture SolrCloud node 1
Local storage Blob store
SolrCloud node 2
Local storage
1
2
34
5
Client
6
Not durable
Durable

Blob vs HDFS implementation
SolrCloud
node 1
Local storage Blob store
SolrCloud
node 2
Local storage
SolrCloud
node 1
HDFS
SolrCloud
node 2

Challenges and implementation
• Controlling writes
• Storage format
• Stateless nodes
• New replica type
• Commits...

Leaders and shared storage
SolrCloud
node 1
Blob store
SolrCloud
node 3
SolrCloud
node 2
SolrCloud
node 1
Blob store
SolrCloud
node 3
SolrCloud
node 2
leader
leader
t1 t2

Storage format on Blob
Metadata file listing files in commit point
Files with actual segment data.
_1.si → _1.si.957a2cec
_0.cfs → _0.cfs.94639b68
…
_0.cfe → _0.cfe.5d8dbde9
Segments_3 → segments_3.241a8cbc
core.metadata.23393bd7 ....
_1.si.957a2cec
....
_0.cfe.5d8dbde9
...
...

Pushing changes to Blob
1. Compare Blob metadata to local commit point,
2. Push new files to Blob (with unique names),
3. Push new version of Blob metadata file describing
new commit point.

Supporting multiple writers
SolrCloud
Blob store
2 Zookeeper
Current suffix: xyz
3
4
1
5
segment files
core.metadata.xyz
New segment files
core.metadata.newSuffix
xyznewSuffix

Nodes are “stateless”
Local storage on Nodes can disappear at any time
Transaction log not durable
Pushing to Blob only durability option
→ Requires hard commits before ACK

SHARED replica type
- Does not forward any indexing from leader
(splits excepted)
- Does not replicate
- Pushes to Blob (leader), pulls from Blob (others)
- Leader election not required, best effort “leader”
selection would be sufficient
- Imposes commits

Minimum number of replicas
- Replicas no longer used for durability
(Blob takes care of that).
- Do we need 2 replicas tracked in ZK for fast failover?
- Can we have only 1?
- Goal is 0...

Indexes are loaded when needed
Large number of tenants, not all of them active.
Loading all indexes is not possible (too many).
Only the working set should be in memory.
Indexes on local storage disk of Solr Node but not open.
Indexes on Blob store but not on any Solr Node...

Indexing Performance Impacts
Each commit is more expensive
● Push of new segments to Blob
● Write to Zookeeper
Commit amplification
● Every update request needs an implicit commit
● Commit amplification causes write amplification

Node 2
Shard 1
Node 2
Shard 1
Node 1
Shard 1
Node 1
Shard 1
Data Loss Scenario
Local
storage
Blob store
1
Client
4
Local
storage
5
leader
leader
3
2
(without implicit commits)

Reducing commit cost
● Put transaction log on blob store
■ Exchanges pushing small segment files for pushing
transaction logs
■ Does not resolve commit amplification
■ Adds additional complexity
● Index with multiple threads to compensate for latency
● Flushing to blob store pre-commit
■ Start writing large files early
■ Max-sized segments won’t be merged
■ Directory based implementation may help

Reducing commits
● Share commits
■ Concurrent batches could share commits
■ Implement via waitFlush=true with commitWithin
■ Adds efficiency, but increases latency
● Increase batch size
■ Add delay (if possible) to coalesce incremental updates
● “Best effort” indexing flag
■ Requires good client strategy for detecting & fixing missing data
● Client update partitioning
■ Indexing fanout can be largest contributor to commit amplification

Commit Amplification from Fanout
● One high level request turns
into N sub-requests
■ Each one needs an implicit commit
● O(num_shards * num_batches)
● Each request limited by slowest
shard
● Many small writes to Blob
Shard1 Shard2 Shard3 Shard4 Shard5
…
Doc1
Doc2
Doc3
Doc4
….
SolrJ
Client
SolrJ
Client
Doc5
Doc6
Doc7
Doc8
….

Fanout Mitigation: Topology Knowledge
Shard1
…
Doc199
Doc248
Doc3743
Doc4295
….
SolrJ
Client
Line up indexing batches with shards
● Use Solr APIs to get sharding
■ Need hash range for each shard
● Hash IDs using CompositeIdRouter
● Make batches that don’t cross shards
● Hash partitioning often desired
anyway to avoid version reorders!
For custom sharding
● Easy, just do it!
Shard2
Doc87
Doc462
Doc744
Doc2001
….
SolrJ
Client
Shard3
Doc322
Doc547
Doc1011
Doc2539
….
SolrJ
Client

Fanout Mitigation: Hash Partitioning
Simpler approach: partition by hash
● Utilizes Solr hash, but not topology
● Scale number of partitions by how
much data needs indexing
● Under-partitioning: no harm if not
enough data for optimal sized
batches anyway
Shard
1
Shard
2
Shard
3
Shard
4
Shard
5
…
Doc2
Doc4
Doc5
Doc7
….
SolrJ
Client
SolrJ
Client
Doc1
Doc3
Doc6
Doc8
….
0x80000000-0xFFFFFFFF 0x00000000-0x7FFFFFFF
Shard
6

Fanout Mitigation: Hash Partitioning 2
Over-partitioning: more client
partitions than shards.
● Again, no harm
● Increases parallelism
● Easy to decrease parallelism if
desired
■ Keep num partitions the same
■ Use fewer indexing threads
Shard1 Shard2
Doc2
Doc4
Doc5
Doc7
….
SolrJ
Client
SolrJ
Client
Doc1
Doc3
Doc6
Doc8
….
Doc9
Doc14
Doc11
Doc15
….
SolrJ
Client
SolrJ
Client
Doc12
Doc13
Doc16
Doc10
….

Hash Partitioning with Composite Ids
● Composite ids like “yonik!email1” do not distribute randomly on hash ring
■ Use splitByPrefix flag for prefix aware shard splitting
● Determine amount of data to be indexed for collection:
● If small amount, just send it
● If large amount, analyze the different prefixes
■ Tiny prefixes: send them together as a single partition
■ Medium prefixes: use the prefix as a separate partition
■ Large prefixes: create multiple partitions by evenly splitting the prefix range
● Maybe pick number of partitions that is a power of 2
■ Index partitions with any number of threads <= num_partitions
● If partitions > threads, don’t concurrently index partitions next to each other
● Note: beware changing indexing partitions on-the-fly

Shard splits
Plan for “online” splitting same as for non-blob
■ Forward updates from shard leader to sub-shard leaders
■ Transaction log on sub-shard leader buffers
■ If sub-shard leader dies, split should fail

Wrap up
• Summary
• What’s next

Summary
Elasticity IS THE initial goal.
Blob more cost effective than Block: less replicas!
Easily shut down servers. Be able to serve
all data regardless.

What’s next
Status of the work
Open Source
https://issues.apache.org/jira/browse/SOLR-13101
https://github.com/apache/lucene-solr/pull/864

SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan Ginzburg & Yonik Seeley, Salesforce

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan Ginzburg & Yonik Seeley, Salesforce

Similar to SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan Ginzburg & Yonik Seeley, Salesforce (20)

More from Lucidworks

More from Lucidworks (20)

Recently uploaded

Recently uploaded (20)

SolrCloud in Public Cloud: Scaling Compute Independently from Storage - Ilan Ginzburg & Yonik Seeley, Salesforce