O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X
Lessons from Sharding Solr at Etsy
Gregg Donovan
Senior Software Engineer,
• 5.5 Years Solr & Lucene at
• 3 Years Solr & Lucene at
• Speaker at LuceneRevolution 2011 & 2013
Jeff Dean, Challenges in Building Large-Scale Information Retrieval Systems
Jeff Dean, Challenges in Building Large-Scale Information Retrieval Systems
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
1.5Million Active Shops
32Million Items Listed
21.7Million Active Buyers
• Sharding Solr at Etsy V0 — No sharding
• Sharding Solr at Etsy V1 — Local sharding
• Sharding Solr at Etsy V2 (*) — Distributed sharding
• Questions
* —What we’re about to launch.
Sharding V0 — Not Sharding
• Why do we shard?
• Data size grows beyond RAM on a single box
• Lucene can handle this, but there’s a performance cost
• Data size grows beyond local disk
• Latency requirements
• Not sharding allowed us to avoid many problems we’ll discuss later.
Sharding V0 — Not Sharding
• How to keep data size small enough for one host?
• Don’t store anything other than IDs
• fl=pk_id,fk_id,score
• Keep materialized objects in memcached
• Only index fields needed
• Prune index after experiments add fields
• Get more RAM
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Sharding V0 — Not Sharding
• How does it fail?
• GC
• Solution
• “Banner” protocol
• Client-side load balancer
• Client connects, waits for 4-bytes — OxCODEA5CF— from the server within 1-10ms before
sending query. Otherwise, try another server.
Sharding V1 — Local Sharding
• Motivations
• Better latency
• Smaller JVMs
• Tough to open a 31gb heap dump on your laptop
• Working set still fit in RAM on one box.
• What’s the simplest system we can built?
Sharding V1 — Local Sharding
• Lucene parallelism
• Shikhar Bhushan at Etsy experimented with segment level parallelism
• See Search-time Parallelism at Lucene Revolution 2014
• Made its way into LUCENE-6294 (Generalize how IndexSearcher parallelizes collection
execution). Committed in Lucene 5.1.
• Ended up with eight Solr shards per host, each in its own small JVM
• Moved query generation and re-ranking to separate process: the “mixer”
Sharding V1 — Local Sharding
• Based on Solr distributed search
• By default, Solr does two-pass distributed search
• First pass gets top IDs
• Second pass fetches stored fields for each top document
• Implemented distrib.singlePass mode (SOLR-5768)
• Does not make sense if individual documents are expensive to fetch
• Basic request tracing via HTTP headers (SOLR-5969)
Sharding V1 — Local Sharding
• Required us to fetch 1000+ results from each shard for reranking layer
• How to efficiently fetch 1000 documents per shard?
• Use Solr’s field syntax to fetch data from FieldCache
• e.g. fl=pk_id:field(pk_id),fk_id:field(fk_id),score
• When all fields are “pseudo” fields, no need to fetch stored fields per document.
Sharding V1 — Local Sharding
• Result
• Very large latency win
• Easy system to manage
• Well understood failure and recovery
• Avoided solving many distributed systems issues
Sharding V2 — Distributed Sharding
• Motivation
• Further latency improvements
• Prepare for data to exceed a single node’s capacity
• Significant latency improvements require finer sharding, more CPUs per request
• Requires a real distributed system and sophisticated RPC
• Before proceeding, stop what you’re doing and read everything by Google’s Jeff Dean and
Twitter’s Marius Eriksen
Sharding V2 — Distributed Sharding
• New problems
• Partial failures
• Lagging shards
• Synchronizing cluster state and configuration
• Network partitions
• Jespen
• Distributed IDF issues exacerbated
Solving Distributed IDF
• Inverse Document Frequency (IDF) now varies across shards, biasing ranking
• Calculate IDF offline in Hadoop
• IDFReplacedSimilarityFactory
• Offline data populates cache of Map<BytesRef,Float> (term —> score)
• Override SimilarityFactory#idfExplain
• Cache misses given rare document constant
• Can be extended to solve i18n IDF issues
Sharding V2 — Distributed Sharding
• ShardHandler
• Solr’s abstraction for fanning out queries to shards
• Ships with default implementation (HttpShardHandler) based on HTTP 1.1
• Does fanout (distrib=true) and processes requests coming from other Solr nodes
• Reads shards.rows and shards.start parameters
ShardHandler API
Solr’s SearchHandler calls submit for each shard and then either takeCompletedIncludingErrors
or takeCompletedOrError depending on partial results tolerance.
public abstract class ShardHandler {

public abstract void checkDistributed(ResponseBuilder rb);

public abstract void submit(ShardRequest sreq, String shard, ModifiableSolrParams params);

public abstract ShardResponse takeCompletedIncludingErrors();

public abstract ShardResponse takeCompletedOrError();

public abstract void cancelAll();

public abstract ShardHandlerFactory getShardHandlerFactory();

Sharding V2 — Distributed Sharding
Distributed query requirements
• Distributed tracing
• E.g.: Google’s Dapper, Twitter’s Zipkin, Etsy’s CrossStich
• Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
• Handle node failures, slowness
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Better Know Your Switches
Have a clear understanding of your networking requirements and whether your hardware meets
• Prefer line-rate switches
• Prefer cut-through to store-and-forward
• No buffering, just read the IP packet header and move packet to the destination
• Track and graph switch statistics in the same dashboard you display your search latency stats
• errors, retransmits, etc.
Sharding V2 — Distributed Sharding
First experiment, Twitter’s Finagle
• Built on Netty
• Mux RPC multiplexing protocol
• SeeYour Server as a Function by Marius Eriksen
• Built-in support for Zipkin distributed tracing
• Served as inspiration for Facebook’s futures-based RPC Wangle
• Implemented a FinagleShardHandler
Sharding V2 — Distributed Sharding
Second experiment, custom Thrift-based protocol
• Blocking I/O easier to integrate with SolrJ API
• Able to integrate our own distributed tracing
• LZ4 compression via a custom Thrift TTransport
Sharding V2 — Distributed Sharding
Future experiment: HTTP/2
• One TCP connection for all requests between two servers
• Libraries
• Square’s OkHttp
• Google’s gRpc
• Jetty client in 9.3+ — appears to be Solr’s choice
Sharding V2 — Distributed Sharding
Implementation note
• Separated fanout from individual request processing
• SolrJ client via an EmbeddedSolrServer containing empty RAM directory.
• Saves a network hop
• Makes shards easier to profile, tune
• Can return result to SolrJ without sending merged results over the network
Sharding V2 — Distributed Sharding
• Good
• Individual shard times demonstrate very low average latency
• Bad
• Overall p95, p99 nowhere near averages
• Why? Lagging shards due to GC, filterCache misses, etc.
• More shards means more chances to hit outliers
Sharding V2 — Distributed Sharding
• Solutions
• See The Tail at Scale by Jeff Dean, CACM 2013.
• Eliminate all sources of inter-host variability
• No filter or other cache misses
• No GC
• Eliminate OS pauses, networking hiccups, deploys, restarts, etc.
• Not realistic
Sharding V2 — Distributed Sharding
• Backup Requests
• Methods
• Brute force — send two copies of every request to different hosts, take the fastest
• Less crude — wait X milliseconds for the first server to respond, then send a backup
• Adaptive — choose X based on the first Y% of responses to return.
• Cancellation — Cancel the slow request to save CPU once you’re sure you don’t need it.
Sharding V2 — Distributed Sharding
• “Good enough”
• Return results to user after X% of results return if there are enough results. Don’t issue
backup requests, just cancel laggards.
• Only applicable in certain domains.
• Poses questions:
• Should you cache partial results?
• How is paging effected?
Resilience Testing
Now you own a distributed system. How do you know it works?
• “The Troublemaker”
• Inspired by Netflix’s Chaos Monkey
• Authored by Etsy’s Toria Gibbs
• Make sure humans can operate it
• Failure simulation — don’t wait until 3am
• Gameday exercises and Runbooks
Bonus material!
Better Know Your Kernel
A lesson not about sharding learned while sharding…
• Linux’s futex_wait() was broken in CentOS 6.6
• Backported patches needed from Linux 3.18
• Future direction: make kernel updates independent from distribution updates
• E.g. Plenty of good stuff (e.g. networking improvements, kernel introspection [see
@brendangregg]) between 3.10 and 4.2+, but it won’t come to CentOS for years
• Updating kernel alone easier to roll out
What else are we working on?
• Mesos for cluster orchestration
• GPUs for massive increases in per query computational capacity
Thanks for coming.

