Case Study of 
November 2014 Meetup 
Rahul Jain 
About Me… 
• Big-data/Search Consultant based out of Hyderabad, India 
• Provide Consulting services and solutions for Solr, Elasticsearch and other Big 
data solutions (Apache Hadoop and Spark) 
• Organizer of two Meetup groups in Hyderabad 
• Hyderabad Apache Solr/Lucene 
• Big Data Hyderabad
What it does? 
Rujhaan which means "#interest" is a news app that 
aggregates the Trending #News, #trends with #buzz 
around them from social media. 
It also works as a content discovery where user can see 
information based on his interest (under development).
What I am going to talk 
• Introduction 
• Software Stack 
• Crawler 
• Apache Solr 
• MongoDB 
• Redis 
• Machine Learning stack 
• Classification 
• Clustering 
• NER 
• POS Tagging
How it look ?
Case study of (A social news app )
Trends : Arpita Khan
Trends : Phil Hughes
Technology Stack
Major challenge: 
Response time of 500ms is Critical
High level Flow: Processing 
Managed Cache 
3 4 
Extraction 1 
Summary (Most 
Meaningful text 
of Story) 
High level Flow: View 
Managed Cache 
Current Traffic Stats 
• 16k users/month 
• ~38k pageviews/month 
• 200k requests/day by 24+ bots 
• Traffic growing by 60-70%/month 
• Alexa rank : ~211000
Application Stack 
• Crawler 
• Apache Solr 
• MongoDB 
• Redis
• A web crawler (also known as a web spider or ant) is a program, which browses the 
World Wide Web in a methodical, automated manner. 
• Web crawlers are mainly used to create a copy of all the visited pages for later 
processing by a search engine, that will index the downloaded pages to provide fast 
How it work?
• Enterprise Search platform for Apache Lucene 
• Open source 
• Highly reliable, scalable, fault tolerant 
• Support distributed Indexing (SolrCloud), 
Replication, and load balanced querying 
High level overview 
Apache Solr - Features 
• full-text search 
• faceted search (similar to GroupBy clause in RDBMS) 
• scalability 
– caching 
– replication 
– distributed search 
• near real-time indexing 
• geospatial search 
• and many more : highlighting, database integration, rich document 
(e.g., Word, PDF) handling 
Database: #MongoDB 
• Document Oriented NoSQL 
• Dynamic Schema 
• JSON based 
• Fast read and write 
• Quite suitable for Non 
Relational data 
• 2 million tweets 
• 70k news articles 
• ~25GB rawhtml unstructured data 
• ~16GB structured data
Why NoSQL 
• Large Volume of Data 
• Dynamic Schemas 
• Auto-sharding 
• Replication 
• Horizontally Scalable 
* Some of these above Operations can be achieved by Enterprise class RDBMS software but with very High cost
Major NoSQL Categories 
• Document databases 
• pair each key with a complex data structure 
known as a document. 
• MongoDB 
• Graph databases 
• store information about networks, such as social 
• Neo4j 
Major NoSQL Categories 
• Key-Value stores 
• Every single item in the database is stored as an 
attribute name (or "key"), 
• Riak , Voldemort, Redis 
• Wide-column stores 
• store data in columns together, instead of row 
• Google’s Bigtable, Cassandra and HBase
Sample Record (JSON) 
"_id" : ObjectId("53f087c69144ca452acadfb0"), 
"id" : "7a622c50e95d4debb1376d4f6e2d0a47", 
"title" : "Yelp Swings To Profitability In Strong Q2 With $88.8M In Revenue, EPS Of $0.04", 
"summary_gs" : "Today after the bell Yelp reported its second-quarter financial performance, including 
revenue of $88.79 million, and a profit of $0.04 per share. The company had net income of $2.7 million 
in the period, up from a $878,000 loss in the year-ago quarter. Investors had expected Yelp to lose 
3 cents per share on revenue of $86.32 million. The company’s revenue tally for its most recent 
quarter is up 61 percent on a year-over-year basis. The company also reported strong guidance for its 
third quarter, with revenues forecasted to land in the $98 to $99 million range. ", 
"link" : " 
"category_label" : "business", 
“score”: 38.0, 
“keywords”:[“news”, “yelp”, “revenue”] 
Cache: #Redis 
• Advanced In-Memory key-value store 
• Insane fast 
• Response time in order of 5-10ms 
• Provides Cache behavior (set, get) with 
advance data structures like hashes, lists, 
sets, sorted sets, bitmaps etc. 
Machine Learning 
• Classification 
• Clustering 
• NER (Named Entity Recognition) 
• Summarization (Relevant text) 
• Topics Extraction
ML Workflow
• classify a document into a predefined category. 
– For e.g news can be classified into business, politics, 
finance etc. 
• documents can be text, images 
• Popular one is Naive Bayes Classifier. 
• Steps: 
– Step1 : Train the program (Building a Model) using a 
training set with a category for e.g. sports, cricket, news, 
– Classifier will compute probability for each word, the 
probability that it makes a document belong to each of 
considered categories 
– Step2 : Test with a test data set against this Model 
• clustering is the task of grouping a set of objects in 
such a way that objects in the same group (called 
a cluster) are more similar to each other 
• objects are not predefined 
• For e.g. these keywords 
– “man’s shoe” 
– “women’s shoe” 
– “women’s t-shirt” 
– “man’s t-shirt” 
– can be cluster into 2 categories “shoe” and “t-shirt” or 
“man” and “women” 
• Popular ones are K-means clustering and Hierarchical 
K-means Clustering 
• partition n observations into k clusters in which each observation belongs 
to the cluster with the nearest mean, serving as a prototype of the cluster. 
• Finding the most relevant text related to story/article 
• There can be multiple approaches related to accuracy. 
• Below is our approach: 
1 Find low 3 
value cluster 
Cluster based 
on stop words 
Score each 
Take Highest 
score cluster 
Some more 
*Summary can be a content curated by computer system. i.e. translating the story into its own sentences (out of scope)
POS (Part of Speech) Tagging 
• process of marking up a word in a text (corpus) as 
corresponding to a particular part of speech, its 
definition, as well as its context 
• relationship with adjacent and related words in a 
phrase, sentence, or paragraph. 
• 9 parts of speech in English: noun, verb, article, 
adjective, preposition, pronoun, adverb, 
conjunction, and interjection. 
• “This is a sample sentence” will be output as 
• This/DT is/VBZ a/DT sample/NN sentence/NN 
• We use Stanford MaxentTagger 
Number Tag Description 
1. CC Coordinating 
2. CD Cardinal number 
3. DT Determiner 
4. JJ Adjective 
8. JJR Adjective, 
9. JJS Adjective, superlative 
10. LS List item marker 
11. MD Modal 
12. NN Noun, singular or mass 
13. NNS Noun, plural 
14. NNP Proper noun, singular 
15. NNPS Proper noun, plural 
16. PDT Predeterminer 
17. POS Possessive ending 
18. PRP Personal pronoun 
19. PRP$ Possessive pronoun 
20. RB Adverb 
21. RBR Adverb, comparative 
22. RBS Adverb, superlative 
23. RP Particle 
24. SYM Symbol 
25. TO to 
26. UH Interjection 
27. VBD Verb, past tense 
32. VBZ Verb, 3rd person 
singular present
• Identifying the Named Entities like Person name, location, organization from a text 
• Need a pre built trained model.
Machine Learning Stack 
• Stanford NER & Tagger 
• LingPipe 
• OpenNLP 
• Carrot2
We are Hiring! 
Want to make an impact on millions of 
lives ? 
Join Us
@rahuldausa on twitter and slideshare 
Join us @ For Solr, Lucene, Elasticsearch, Machine Learning, IR 
Join us @ For Hadoop, Spark, Cascading, Scala, NoSQL, Crawlers and all cutting edge technologies.

Case study of (A social news app )

  • 1. Case Study of November 2014 Meetup Rahul Jain @rahuldausa
  • 2. About Me… • Big-data/Search Consultant based out of Hyderabad, India • Provide Consulting services and solutions for Solr, Elasticsearch and other Big data solutions (Apache Hadoop and Spark) • Organizer of two Meetup groups in Hyderabad • Hyderabad Apache Solr/Lucene • Big Data Hyderabad
  • 3. What it does? Rujhaan which means "#interest" is a news app that aggregates the Trending #News, #trends with #buzz around them from social media. It also works as a content discovery where user can see information based on his interest (under development).
  • 4. What I am going to talk • Introduction • Software Stack • Crawler • Apache Solr • MongoDB • Redis • Machine Learning stack • Classification • Clustering • NER • POS Tagging
  • 5. How it look ?
  • 7. Trends : Arpita Khan
  • 8. Trends : Phil Hughes
  • 10. Major challenge: Response time of 500ms is Critical
  • 11. High level Flow: Processing Fetch Managed Cache Internet 2 1 3 4 Topics Extraction 1 8 5 Language Detectio 6 Classification/ Clustering 7 Parse MongoDB HTML Cleaner Junk/Sp am Cleaner (Text) n Scoring Summary (Most Meaningful text of Story) Social Media Apache Solr 9 0 1 1
  • 12. High level Flow: View HAProxy Redis Managed Cache Internet 2 1 3 Nginx MongoDB Tomcat (App) Apache Solr 4 5
  • 13. Current Traffic Stats Traffic: • 16k users/month • ~38k pageviews/month • 200k requests/day by 24+ bots • Traffic growing by 60-70%/month • Alexa rank : ~211000
  • 14. Application Stack • Crawler • Apache Solr • MongoDB • Redis
  • 15. Crawler • A web crawler (also known as a web spider or ant) is a program, which browses the World Wide Web in a methodical, automated manner. • Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches.
  • 16. How it work?
  • 17. Search@ApacheSolr • Enterprise Search platform for Apache Lucene • Open source • Highly reliable, scalable, fault tolerant • Support distributed Indexing (SolrCloud), Replication, and load balanced querying • 17
  • 18. High level overview Source:
  • 19. Apache Solr - Features • full-text search • faceted search (similar to GroupBy clause in RDBMS) • scalability – caching – replication – distributed search • near real-time indexing • geospatial search • and many more : highlighting, database integration, rich document (e.g., Word, PDF) handling 19
  • 20. Database: #MongoDB • Document Oriented NoSQL database • Dynamic Schema • JSON based • Fast read and write • Quite suitable for Non Relational data Stats: • 2 million tweets • 70k news articles • ~25GB rawhtml unstructured data • ~16GB structured data
  • 21. Why NoSQL • Large Volume of Data • Dynamic Schemas • Auto-sharding • Replication • Horizontally Scalable * Some of these above Operations can be achieved by Enterprise class RDBMS software but with very High cost
  • 22. Major NoSQL Categories • Document databases • pair each key with a complex data structure known as a document. • MongoDB • Graph databases • store information about networks, such as social connections • Neo4j Contd.
  • 23. Major NoSQL Categories • Key-Value stores • Every single item in the database is stored as an attribute name (or "key"), • Riak , Voldemort, Redis • Wide-column stores • store data in columns together, instead of row • Google’s Bigtable, Cassandra and HBase
  • 24. Sample Record (JSON) { "_id" : ObjectId("53f087c69144ca452acadfb0"), "id" : "7a622c50e95d4debb1376d4f6e2d0a47", "title" : "Yelp Swings To Profitability In Strong Q2 With $88.8M In Revenue, EPS Of $0.04", "summary_gs" : "Today after the bell Yelp reported its second-quarter financial performance, including revenue of $88.79 million, and a profit of $0.04 per share. The company had net income of $2.7 million in the period, up from a $878,000 loss in the year-ago quarter. Investors had expected Yelp to lose 3 cents per share on revenue of $86.32 million. The company’s revenue tally for its most recent quarter is up 61 percent on a year-over-year basis. The company also reported strong guidance for its third quarter, with revenues forecasted to land in the $98 to $99 million range. ", "link" : " eps-of-0-04/", "category_label" : "business", “image_url”:””, “score”: 38.0, “boost”:1.0, “keywords”:[“news”, “yelp”, “revenue”] }
  • 25. Cache: #Redis • Advanced In-Memory key-value store • Insane fast • Response time in order of 5-10ms • Provides Cache behavior (set, get) with advance data structures like hashes, lists, sets, sorted sets, bitmaps etc. •
  • 26. Machine Learning • Classification • Clustering • NER (Named Entity Recognition) • Summarization (Relevant text) • Topics Extraction
  • 28. Classification • classify a document into a predefined category. – For e.g news can be classified into business, politics, finance etc. • documents can be text, images • Popular one is Naive Bayes Classifier. • Steps: – Step1 : Train the program (Building a Model) using a training set with a category for e.g. sports, cricket, news, – Classifier will compute probability for each word, the probability that it makes a document belong to each of considered categories – Step2 : Test with a test data set against this Model •
  • 29. Clustering • clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other • objects are not predefined • For e.g. these keywords – “man’s shoe” – “women’s shoe” – “women’s t-shirt” – “man’s t-shirt” – can be cluster into 2 categories “shoe” and “t-shirt” or “man” and “women” • Popular ones are K-means clustering and Hierarchical clustering
  • 30. K-means Clustering • partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. •
  • 31. Summarization • Finding the most relevant text related to story/article • There can be multiple approaches related to accuracy. • Below is our approach: Cleaned Text 1 Find low 3 2 value cluster 4 5 Cluster based on stop words Score each cluster Take Highest score cluster Sentence Extractor Some more Scoring… Summary text 6 7 *Summary can be a content curated by computer system. i.e. translating the story into its own sentences (out of scope)
  • 32. POS (Part of Speech) Tagging • process of marking up a word in a text (corpus) as corresponding to a particular part of speech, its definition, as well as its context • relationship with adjacent and related words in a phrase, sentence, or paragraph. • 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. • “This is a sample sentence” will be output as • This/DT is/VBZ a/DT sample/NN sentence/NN • We use Stanford MaxentTagger • Number Tag Description 1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10. LS List item marker 11. MD Modal 12. NN Noun, singular or mass 13. NNS Noun, plural 14. NNP Proper noun, singular 15. NNPS Proper noun, plural 16. PDT Predeterminer 17. POS Possessive ending 18. PRP Personal pronoun 19. PRP$ Possessive pronoun 20. RB Adverb 21. RBR Adverb, comparative 22. RBS Adverb, superlative 23. RP Particle 24. SYM Symbol 25. TO to 26. UH Interjection 27. VBD Verb, past tense 32. VBZ Verb, 3rd person singular present
  • 33. NER • Identifying the Named Entities like Person name, location, organization from a text • Need a pre built trained model.
  • 34. Machine Learning Stack • Stanford NER & Tagger • LingPipe • OpenNLP • Carrot2
  • 35. We are Hiring! 35 Want to make an impact on millions of lives ? Join Us
  • 36. Thanks! @rahuldausa on twitter and slideshare 36 Join us @ For Solr, Lucene, Elasticsearch, Machine Learning, IR Join us @ For Hadoop, Spark, Cascading, Scala, NoSQL, Crawlers and all cutting edge technologies.