Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using Solr

BUILD A SUPERCHARGED DATAMART
WITH SOLR
Elliott Cordo
Chief Architect, Caserta Concepts

What is Solr
• Solr is an open source search platform
• Based on Core Lucene search technology
• Bundled up with an API, Management Tools, UI, scalability

But isn’t it just a search engine
• Although Solr was primarily architected as a search
engine, there is no reason you can’t use it as a database
• Search based application movement promotes search
engine as a data store
• Search has long been a “cheap” option fast and
interactive queries NoSQL and Hadoop datastores

So why would we use it
• Solr is fast –
• expect low ms response times on simple lookups
• properly tuned even complex queries will take less than 100ms
• Solr scales
• High concurrency
• Scales horizontally and vertically (larger hardware)

And it has the best query flexibility of any
NOSQL store!
..and in many cases RDBMS
• Grouping and Aggregation via Facets
• Fuzzy Search
• Equality and Range queries
• Geospatial capabilities
• HIGHLY extensible!

Another Datastore to Manage??
• Polygot persistence/polygot programming
• Feature/function will drive which technology should be
used
• Use the right tool for the job: Relational, MPP, Hadoop,
Graph, KV, NOSQL
Thankfully Solr is pretty easy to learn and manage!

When it works well
• Search is front and center  end users need to fuzzy
search dimensional attributes
• Flexible /Sparse schema
• Need for speed -> faster queries for more user
engagement
• Concurrency -> fast queries on client facing or open web

Use Cases
• Real time analytics  ingest incomming events from
Flume/Logstash/Custom app
• Supplement NOSQL, MPP, or Hadoop analytics
• Web facing analytics DB

What it doesn’t do well
• Joins across collections/cores (tables)
• Complex arbitrary queries
• Limited integration to standard ETL and BI frameworks

How do you get data in?
• A robust API
• Modules and libraries for just about any programing
language
• Index data in any DB via JDBC
• Pull in XML and Delimited files with Simple Posting Tool
• Flume/Logstash
NOTE: that like it NOSQL cousins, data needs to be
Flattened!

How do you interact with Solr?
HTTP, Nice concise query language
http://localhost:8983/solr/collection1/select?q=city:Yuma&wt=json&indent=true
And what does the response look like:
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"indent":"true",
"q":"city:Yuma",
"wt":"json"}},
"response":{"numFound":6,"start":0,"docs":[
{
"review_id":"JhUliQTD9iyGWov2nv-ZJA",
"stars":2,
"review_date":"2009-09-23T00:00:00Z",
"business_id":"gKRUdbTPBZ7kwBRCeZDDWA",
"business_name":"Wingate By Wyndham",
"city":"Yuma",
"state":"AZ",
"longitude":"-112.09343969999999",
"latitude":"33.434925100000001",
"user_id":"AqlZdDD7NK1fpQi9ltqIXQ",
"user_name":"Studl",
"_version_":1475098783569149953},

So what about analytic queries
select city, count(1)
from reviews
where state=‘AZ’
group by city
http://localhost:8983/solr/collection1/select?q=state:AZ&wt=json&indent
=true&facet=true&facet.field=city&rows=0&facet.mincount=1&facet.limit
=-1

More query fun
select city, count(1)
from reviews
where state=‘AZ’
and review_date between ‘2012-03-01’ and ‘2012-03-06’
group by city
having count(1)>=20
http://localhost:8983/solr/collection1/select?q=state:AZ+review_date:%
5B2012-03-01T23:59:59.999Z TO 2012-03-
06T00:00:00Z%5D&wt=json&indent=true&facet=true&facet.field=city&r
ows=0&facet.mincount=20&facet.limit=-1

Facet stats give you aggregation
http://localhost:8983/solr/collection1/select?q=state:AZ+review_date:%5B2012-
03-01T23:59:59.999Z%20TO%202012-03-
06T00:00:00Z%5D&wt=json&indent=true&facet=true&rows=0&facet.mincount=
1&facet.limit=-1&stats=true&stats.field=stars&stats.facet=city
"stats":{
"stats_fields":{
"stars":{
"min":1.0,
"max":5.0,
"count":991,
"missing":0,
"sum":3685.0,
"sumOfSquares":15313.0,
"mean":3.7184661957618568,
"stddev":1.2754290498612053,
"facets":{
"city":{
"Peoria":{
"min":1.0,
"max":5.0,
"count":14,
"missing":0,
"sum":54.0,
"sumOfSquares":234.0,
"mean":3.857142857142857,
"stddev":1.4064216928154862,
"facets":{}},
"Goodyear":{
"min":2.0,
"max":5.0,
"count":7,
….

Facet pivots too!
http://localhost:8983/solr/collection1/select?q=state:AZ+review_date:%5B2012-
03-01T23:59:59.999Z%20TO%202012-03-
06T00:00:00Z%5D&wt=json&indent=true&facet=true&rows=0&facet.mincount=
1&facet.limit=-1&facet.pivot=city,business_name
"facet_pivot":{
"city,business_name":[{
"field":"city",
"value":"Anthem",
"count":3,
"pivot":[{
"field":"business_name",
"value":"Outlets At Anthem",
"count":1},
{
"value":"Q to U BBQ",
"count":1},
{
"value":"Shanghai Club",
"count":1}]},
{
"field":"city",
"value":"Apache Junction",
"count":1,
"pivot":[{
"value":"Lost Dutchman State Park",
"count":1}]},

UI please!
• Roll your own  it’s not that hard
• How about using Python Flask to render a Solr Response
to D3 or Google Charts
• Sometimes a custom solution is the best option

And the really easy way
Banana – A Solr port of Kibana!
Why should Elasticache fans have all the fun?
And it’s open source!
https://github.com/LucidWorks/banana

Banana
• An AngularJS app (pure javascript, runs in any
browser)
• Make a pretty dashboard with no development in a
couple minutes
• Very user friendly, users can create their own
content

https://github.com/Caserta-Concepts/Solr-Datamart
elliott@casertaconcepts.com

Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using Solr

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using Solr

Similar to Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using Solr (20)

More from Caserta

More from Caserta (20)

Recently uploaded

Recently uploaded (20)

Big Data Warehousing Meetup: Developing a super-charged NoSQL data mart using Solr