SlideShare a Scribd company logo
Case Study of Rujhaan.com 
November 2014 Meetup 
Rahul Jain 
@rahuldausa
About Me… 
• Big-data/Search Consultant based out of Hyderabad, India 
• Provide Consulting services and solutions for Solr, Elasticsearch and other Big 
data solutions (Apache Hadoop and Spark) 
• Organizer of two Meetup groups in Hyderabad 
• Hyderabad Apache Solr/Lucene 
• Big Data Hyderabad
What it does? 
Rujhaan which means "#interest" is a news app that 
aggregates the Trending #News, #trends with #buzz 
around them from social media. 
It also works as a content discovery where user can see 
information based on his interest (under development).
What I am going to talk 
• Introduction 
• Software Stack 
• Crawler 
• Apache Solr 
• MongoDB 
• Redis 
• Machine Learning stack 
• Classification 
• Clustering 
• NER 
• POS Tagging
How it look ? 
http://www.rujhaan.com
Case study of Rujhaan.com (A social news app )
Trends : Arpita Khan 
http://www.rujhaan.com/topic/Arpita-Khan.html
Trends : Phil Hughes 
http://www.rujhaan.com/topic/Phillip-Hughes.html
Technology Stack
Major challenge: 
Response time of 500ms is Critical
High level Flow: Processing 
Fetch 
Managed Cache 
Internet 
2 
1 
3 4 
Topics 
Extraction 1 
8 
5 
Language 
Detectio 
6 
Classification/ 
Clustering 
7 
Parse 
MongoDB 
HTML 
Cleaner 
Junk/Sp 
am 
Cleaner 
(Text) 
n 
Scoring 
Summary (Most 
Meaningful text 
of Story) 
Social 
Media 
Apache 
Solr 
9 
0 
1 
1
High level Flow: View 
HAProxy 
Redis 
Managed Cache 
Internet 
2 
1 
3 
Nginx 
MongoDB 
Tomcat 
(App) 
Apache 
Solr 
4 
5
Current Traffic Stats 
Traffic: 
• 16k users/month 
• ~38k pageviews/month 
• 200k requests/day by 24+ bots 
• Traffic growing by 60-70%/month 
• Alexa rank : ~211000
Application Stack 
• Crawler 
• Apache Solr 
• MongoDB 
• Redis
Crawler 
• A web crawler (also known as a web spider or ant) is a program, which browses the 
World Wide Web in a methodical, automated manner. 
• Web crawlers are mainly used to create a copy of all the visited pages for later 
processing by a search engine, that will index the downloaded pages to provide fast 
searches. 
http://www.codeproject.com/Articles/13486/A-Simple-Crawler-Using-C-Sockets
How it work? 
http://wiki.eclipse.org/SMILA/Documentation/Importing/Crawler/Web
Search@ApacheSolr 
• Enterprise Search platform for Apache Lucene 
• Open source 
• Highly reliable, scalable, fault tolerant 
• Support distributed Indexing (SolrCloud), 
Replication, and load balanced querying 
• http://lucene.apache.org/solr 
17
High level overview 
Source: http://www.slideshare.net/erikhatcher/solr-search-at-the-speed-of-light
Apache Solr - Features 
• full-text search 
• faceted search (similar to GroupBy clause in RDBMS) 
• scalability 
– caching 
– replication 
– distributed search 
• near real-time indexing 
• geospatial search 
• and many more : highlighting, database integration, rich document 
(e.g., Word, PDF) handling 
19
Database: #MongoDB 
• Document Oriented NoSQL 
database 
• Dynamic Schema 
• JSON based 
• Fast read and write 
• Quite suitable for Non 
Relational data 
Stats: 
• 2 million tweets 
• 70k news articles 
• ~25GB rawhtml unstructured data 
• ~16GB structured data
Why NoSQL 
• Large Volume of Data 
• Dynamic Schemas 
• Auto-sharding 
• Replication 
• Horizontally Scalable 
* Some of these above Operations can be achieved by Enterprise class RDBMS software but with very High cost
Major NoSQL Categories 
• Document databases 
• pair each key with a complex data structure 
known as a document. 
• MongoDB 
• Graph databases 
• store information about networks, such as social 
connections 
• Neo4j 
Contd.
Major NoSQL Categories 
• Key-Value stores 
• Every single item in the database is stored as an 
attribute name (or "key"), 
• Riak , Voldemort, Redis 
• Wide-column stores 
• store data in columns together, instead of row 
• Google’s Bigtable, Cassandra and HBase
Sample Record (JSON) 
{ 
"_id" : ObjectId("53f087c69144ca452acadfb0"), 
"id" : "7a622c50e95d4debb1376d4f6e2d0a47", 
"title" : "Yelp Swings To Profitability In Strong Q2 With $88.8M In Revenue, EPS Of $0.04", 
"summary_gs" : "Today after the bell Yelp reported its second-quarter financial performance, including 
revenue of $88.79 million, and a profit of $0.04 per share. The company had net income of $2.7 million 
in the period, up from a $878,000 loss in the year-ago quarter. Investors had expected Yelp to lose 
3 cents per share on revenue of $86.32 million. The company’s revenue tally for its most recent 
quarter is up 61 percent on a year-over-year basis. The company also reported strong guidance for its 
third quarter, with revenues forecasted to land in the $98 to $99 million range. ", 
"link" : "http://techcrunch.com/2014/07/30/yelp-swings-to-profitability-in-strong-q2-with-88-8m-in-revenue- 
eps-of-0-04/", 
"category_label" : "business", 
“image_url”:” http://tctechcrunch2011.files.wordpress.com/2014/04/yelp-earnings.jpg”, 
“score”: 38.0, 
“boost”:1.0, 
“keywords”:[“news”, “yelp”, “revenue”] 
}
Cache: #Redis 
• Advanced In-Memory key-value store 
• Insane fast 
• Response time in order of 5-10ms 
• Provides Cache behavior (set, get) with 
advance data structures like hashes, lists, 
sets, sorted sets, bitmaps etc. 
• http://redis.io/
Machine Learning 
• Classification 
• Clustering 
• NER (Named Entity Recognition) 
• Summarization (Relevant text) 
• Topics Extraction
ML Workflow
Classification 
• classify a document into a predefined category. 
– For e.g news can be classified into business, politics, 
finance etc. 
• documents can be text, images 
• Popular one is Naive Bayes Classifier. 
• Steps: 
– Step1 : Train the program (Building a Model) using a 
training set with a category for e.g. sports, cricket, news, 
– Classifier will compute probability for each word, the 
probability that it makes a document belong to each of 
considered categories 
– Step2 : Test with a test data set against this Model 
• http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Clustering 
• clustering is the task of grouping a set of objects in 
such a way that objects in the same group (called 
a cluster) are more similar to each other 
• objects are not predefined 
• For e.g. these keywords 
– “man’s shoe” 
– “women’s shoe” 
– “women’s t-shirt” 
– “man’s t-shirt” 
– can be cluster into 2 categories “shoe” and “t-shirt” or 
“man” and “women” 
• Popular ones are K-means clustering and Hierarchical 
clustering
K-means Clustering 
• partition n observations into k clusters in which each observation belongs 
to the cluster with the nearest mean, serving as a prototype of the cluster. 
• http://en.wikipedia.org/wiki/K-means_clustering 
http://pypr.sourceforge.net/kmeans.html
Summarization 
• Finding the most relevant text related to story/article 
• There can be multiple approaches related to accuracy. 
• Below is our approach: 
Cleaned 
Text 
1 Find low 3 
2 
value cluster 
4 
5 
Cluster based 
on stop words 
Score each 
cluster 
Take Highest 
score cluster 
Sentence 
Extractor 
Some more 
Scoring… 
Summary 
text 
6 
7 
*Summary can be a content curated by computer system. i.e. translating the story into its own sentences (out of scope)
POS (Part of Speech) Tagging 
• process of marking up a word in a text (corpus) as 
corresponding to a particular part of speech, its 
definition, as well as its context 
• relationship with adjacent and related words in a 
phrase, sentence, or paragraph. 
• 9 parts of speech in English: noun, verb, article, 
adjective, preposition, pronoun, adverb, 
conjunction, and interjection. 
• “This is a sample sentence” will be output as 
• This/DT is/VBZ a/DT sample/NN sentence/NN 
• We use Stanford MaxentTagger 
• http://nlp.stanford.edu/software/tagger.shtml 
Number Tag Description 
1. CC Coordinating 
conjunction 
2. CD Cardinal number 
3. DT Determiner 
4. JJ Adjective 
8. JJR Adjective, 
comparative 
9. JJS Adjective, superlative 
10. LS List item marker 
11. MD Modal 
12. NN Noun, singular or mass 
13. NNS Noun, plural 
14. NNP Proper noun, singular 
15. NNPS Proper noun, plural 
16. PDT Predeterminer 
17. POS Possessive ending 
18. PRP Personal pronoun 
19. PRP$ Possessive pronoun 
20. RB Adverb 
21. RBR Adverb, comparative 
22. RBS Adverb, superlative 
23. RP Particle 
24. SYM Symbol 
25. TO to 
26. UH Interjection 
27. VBD Verb, past tense 
32. VBZ Verb, 3rd person 
singular present
NER 
• Identifying the Named Entities like Person name, location, organization from a text 
• Need a pre built trained model.
Machine Learning Stack 
• Stanford NER & Tagger 
• LingPipe 
• OpenNLP 
• Carrot2
We are Hiring! 
rockstar@rujhaan.com 
35 
Want to make an impact on millions of 
lives ? 
Join Us
Thanks! 
@rahuldausa on twitter and slideshare 
http://www.linkedin.com/in/rahuldausa 
36 
Join us @ For Solr, Lucene, Elasticsearch, Machine Learning, IR 
http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ 
Join us @ For Hadoop, Spark, Cascading, Scala, NoSQL, Crawlers and all cutting edge technologies. 
http://www.meetup.com/Big-Data-Hyderabad/

More Related Content

What's hot

Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Lucidworks
 
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lucidworks
 
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema OrhianJaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Spark Summit
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
Audible, Inc.
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
Vritika Godara
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
Rich Lee
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Cloudera, Inc.
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Lucidworks
 
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, LucidworksSearching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
Lucidworks
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
Vinay Kumar
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Summit
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
lucenerevolution
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Spark Summit
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetJourney of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Lucidworks
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search
medcl
 

What's hot (20)

Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
 
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
 
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema OrhianJaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema Orhian
 
Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, LucidworksSearching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetJourney of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
quick intro to elastic search
quick intro to elastic search quick intro to elastic search
quick intro to elastic search
 

Viewers also liked

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
Rahul Jain
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Rahul Jain
 
Scalable Internet Architecture
Scalable Internet ArchitectureScalable Internet Architecture
Scalable Internet Architecture
Theo Schlossnagle
 
Publish or Perish: Towards a Ranking of Scientists using Bibliographic Data ...
Publish or Perish:  Towards a Ranking of Scientists using Bibliographic Data ...Publish or Perish:  Towards a Ranking of Scientists using Bibliographic Data ...
Publish or Perish: Towards a Ranking of Scientists using Bibliographic Data ...
Lior Rokach
 
Decision Forest: Twenty Years of Research
Decision Forest: Twenty Years of ResearchDecision Forest: Twenty Years of Research
Decision Forest: Twenty Years of Research
Lior Rokach
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
Lior Rokach
 
When Cyber Security Meets Machine Learning
When Cyber Security Meets Machine LearningWhen Cyber Security Meets Machine Learning
When Cyber Security Meets Machine Learning
Lior Rokach
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
Databricks
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
Databricks
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Rahul Jain
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL
Databricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
butest
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
butest
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
butest
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
Venkata Reddy Konasani
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
 

Viewers also liked (20)

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Scalable Internet Architecture
Scalable Internet ArchitectureScalable Internet Architecture
Scalable Internet Architecture
 
Publish or Perish: Towards a Ranking of Scientists using Bibliographic Data ...
Publish or Perish:  Towards a Ranking of Scientists using Bibliographic Data ...Publish or Perish:  Towards a Ranking of Scientists using Bibliographic Data ...
Publish or Perish: Towards a Ranking of Scientists using Bibliographic Data ...
 
Decision Forest: Twenty Years of Research
Decision Forest: Twenty Years of ResearchDecision Forest: Twenty Years of Research
Decision Forest: Twenty Years of Research
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
When Cyber Security Meets Machine Learning
When Cyber Security Meets Machine LearningWhen Cyber Security Meets Machine Learning
When Cyber Security Meets Machine Learning
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
Machine Learning presentation.
Machine Learning presentation.Machine Learning presentation.
Machine Learning presentation.
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Similar to Case study of Rujhaan.com (A social news app )

Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.
Shyjal Raazi
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
George Stathis
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
Eric Rodriguez (Hiring in Lex)
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
Pivorak MeetUp
 
No SQL : Which way to go? Presented at DDDMelbourne 2015
No SQL : Which way to go?  Presented at DDDMelbourne 2015No SQL : Which way to go?  Presented at DDDMelbourne 2015
No SQL : Which way to go? Presented at DDDMelbourne 2015
Himanshu Desai
 
NoSQL, which way to go?
NoSQL, which way to go?NoSQL, which way to go?
NoSQL, which way to go?
Ahmed Elharouny
 
NoSQL-Overview
NoSQL-OverviewNoSQL-Overview
NoSQL-Overview
Ranjeet Jha - OCM-JEA
 
MongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_Wilkins
MongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_WilkinsMongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_Wilkins
MongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_Wilkins
kiwilkins
 
Introducción a NoSQL
Introducción a NoSQLIntroducción a NoSQL
Introducción a NoSQL
MongoDB
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
ShivanandaVSeeri
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Caserta
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
ALTER WAY
 
MongoDB
MongoDBMongoDB
NoSQL
NoSQLNoSQL
History of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setHistory of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature set
Soner Altin
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
Hisham Arafat
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
Max De Marzi
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
Oscar Corcho
 
No sq lv1_0
No sq lv1_0No sq lv1_0
No sq lv1_0
Tuan Luong
 

Similar to Case study of Rujhaan.com (A social news app ) (20)

Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
 
No SQL : Which way to go? Presented at DDDMelbourne 2015
No SQL : Which way to go?  Presented at DDDMelbourne 2015No SQL : Which way to go?  Presented at DDDMelbourne 2015
No SQL : Which way to go? Presented at DDDMelbourne 2015
 
NoSQL, which way to go?
NoSQL, which way to go?NoSQL, which way to go?
NoSQL, which way to go?
 
NoSQL-Overview
NoSQL-OverviewNoSQL-Overview
NoSQL-Overview
 
MongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_Wilkins
MongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_WilkinsMongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_Wilkins
MongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_Wilkins
 
Introducción a NoSQL
Introducción a NoSQLIntroducción a NoSQL
Introducción a NoSQL
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
MongoDB
MongoDBMongoDB
MongoDB
 
NoSQL
NoSQLNoSQL
NoSQL
 
History of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature setHistory of NoSQL and Azure Documentdb feature set
History of NoSQL and Azure Documentdb feature set
 
Engineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?Why do they call it Linked Data when they want to say...?
Why do they call it Linked Data when they want to say...?
 
No sq lv1_0
No sq lv1_0No sq lv1_0
No sq lv1_0
 

More from Rahul Jain

Flipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and RecommendationFlipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and Recommendation
Rahul Jain
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
Rahul Jain
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
Rahul Jain
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
Rahul Jain
 
Hibernate tutorial for beginners
Hibernate tutorial for beginnersHibernate tutorial for beginners
Hibernate tutorial for beginners
Rahul Jain
 

More from Rahul Jain (6)

Flipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and RecommendationFlipkart Strategy Analysis and Recommendation
Flipkart Strategy Analysis and Recommendation
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Hibernate tutorial for beginners
Hibernate tutorial for beginnersHibernate tutorial for beginners
Hibernate tutorial for beginners
 

Recently uploaded

Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
 
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Quentin Reul
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
Peter Caitens
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
DianaGray10
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Alliance
 
Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
Priyanka Aash
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
 
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc
 
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Alliance
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
AmandaCheung15
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
DianaGray10
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
 
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Stephanie Beckett
 
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
Sara Kroft
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
Zilliz
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
 
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
Priyanka Aash
 

Recently uploaded (20)

Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
 
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
 
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
 
Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
 
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
 
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
 
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
 
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
 
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
 

Case study of Rujhaan.com (A social news app )

  • 1. Case Study of Rujhaan.com November 2014 Meetup Rahul Jain @rahuldausa
  • 2. About Me… • Big-data/Search Consultant based out of Hyderabad, India • Provide Consulting services and solutions for Solr, Elasticsearch and other Big data solutions (Apache Hadoop and Spark) • Organizer of two Meetup groups in Hyderabad • Hyderabad Apache Solr/Lucene • Big Data Hyderabad
  • 3. What it does? Rujhaan which means "#interest" is a news app that aggregates the Trending #News, #trends with #buzz around them from social media. It also works as a content discovery where user can see information based on his interest (under development).
  • 4. What I am going to talk • Introduction • Software Stack • Crawler • Apache Solr • MongoDB • Redis • Machine Learning stack • Classification • Clustering • NER • POS Tagging
  • 5. How it look ? http://www.rujhaan.com
  • 7. Trends : Arpita Khan http://www.rujhaan.com/topic/Arpita-Khan.html
  • 8. Trends : Phil Hughes http://www.rujhaan.com/topic/Phillip-Hughes.html
  • 10. Major challenge: Response time of 500ms is Critical
  • 11. High level Flow: Processing Fetch Managed Cache Internet 2 1 3 4 Topics Extraction 1 8 5 Language Detectio 6 Classification/ Clustering 7 Parse MongoDB HTML Cleaner Junk/Sp am Cleaner (Text) n Scoring Summary (Most Meaningful text of Story) Social Media Apache Solr 9 0 1 1
  • 12. High level Flow: View HAProxy Redis Managed Cache Internet 2 1 3 Nginx MongoDB Tomcat (App) Apache Solr 4 5
  • 13. Current Traffic Stats Traffic: • 16k users/month • ~38k pageviews/month • 200k requests/day by 24+ bots • Traffic growing by 60-70%/month • Alexa rank : ~211000
  • 14. Application Stack • Crawler • Apache Solr • MongoDB • Redis
  • 15. Crawler • A web crawler (also known as a web spider or ant) is a program, which browses the World Wide Web in a methodical, automated manner. • Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. http://www.codeproject.com/Articles/13486/A-Simple-Crawler-Using-C-Sockets
  • 16. How it work? http://wiki.eclipse.org/SMILA/Documentation/Importing/Crawler/Web
  • 17. Search@ApacheSolr • Enterprise Search platform for Apache Lucene • Open source • Highly reliable, scalable, fault tolerant • Support distributed Indexing (SolrCloud), Replication, and load balanced querying • http://lucene.apache.org/solr 17
  • 18. High level overview Source: http://www.slideshare.net/erikhatcher/solr-search-at-the-speed-of-light
  • 19. Apache Solr - Features • full-text search • faceted search (similar to GroupBy clause in RDBMS) • scalability – caching – replication – distributed search • near real-time indexing • geospatial search • and many more : highlighting, database integration, rich document (e.g., Word, PDF) handling 19
  • 20. Database: #MongoDB • Document Oriented NoSQL database • Dynamic Schema • JSON based • Fast read and write • Quite suitable for Non Relational data Stats: • 2 million tweets • 70k news articles • ~25GB rawhtml unstructured data • ~16GB structured data
  • 21. Why NoSQL • Large Volume of Data • Dynamic Schemas • Auto-sharding • Replication • Horizontally Scalable * Some of these above Operations can be achieved by Enterprise class RDBMS software but with very High cost
  • 22. Major NoSQL Categories • Document databases • pair each key with a complex data structure known as a document. • MongoDB • Graph databases • store information about networks, such as social connections • Neo4j Contd.
  • 23. Major NoSQL Categories • Key-Value stores • Every single item in the database is stored as an attribute name (or "key"), • Riak , Voldemort, Redis • Wide-column stores • store data in columns together, instead of row • Google’s Bigtable, Cassandra and HBase
  • 24. Sample Record (JSON) { "_id" : ObjectId("53f087c69144ca452acadfb0"), "id" : "7a622c50e95d4debb1376d4f6e2d0a47", "title" : "Yelp Swings To Profitability In Strong Q2 With $88.8M In Revenue, EPS Of $0.04", "summary_gs" : "Today after the bell Yelp reported its second-quarter financial performance, including revenue of $88.79 million, and a profit of $0.04 per share. The company had net income of $2.7 million in the period, up from a $878,000 loss in the year-ago quarter. Investors had expected Yelp to lose 3 cents per share on revenue of $86.32 million. The company’s revenue tally for its most recent quarter is up 61 percent on a year-over-year basis. The company also reported strong guidance for its third quarter, with revenues forecasted to land in the $98 to $99 million range. ", "link" : "http://techcrunch.com/2014/07/30/yelp-swings-to-profitability-in-strong-q2-with-88-8m-in-revenue- eps-of-0-04/", "category_label" : "business", “image_url”:” http://tctechcrunch2011.files.wordpress.com/2014/04/yelp-earnings.jpg”, “score”: 38.0, “boost”:1.0, “keywords”:[“news”, “yelp”, “revenue”] }
  • 25. Cache: #Redis • Advanced In-Memory key-value store • Insane fast • Response time in order of 5-10ms • Provides Cache behavior (set, get) with advance data structures like hashes, lists, sets, sorted sets, bitmaps etc. • http://redis.io/
  • 26. Machine Learning • Classification • Clustering • NER (Named Entity Recognition) • Summarization (Relevant text) • Topics Extraction
  • 28. Classification • classify a document into a predefined category. – For e.g news can be classified into business, politics, finance etc. • documents can be text, images • Popular one is Naive Bayes Classifier. • Steps: – Step1 : Train the program (Building a Model) using a training set with a category for e.g. sports, cricket, news, – Classifier will compute probability for each word, the probability that it makes a document belong to each of considered categories – Step2 : Test with a test data set against this Model • http://en.wikipedia.org/wiki/Naive_Bayes_classifier
  • 29. Clustering • clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other • objects are not predefined • For e.g. these keywords – “man’s shoe” – “women’s shoe” – “women’s t-shirt” – “man’s t-shirt” – can be cluster into 2 categories “shoe” and “t-shirt” or “man” and “women” • Popular ones are K-means clustering and Hierarchical clustering
  • 30. K-means Clustering • partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. • http://en.wikipedia.org/wiki/K-means_clustering http://pypr.sourceforge.net/kmeans.html
  • 31. Summarization • Finding the most relevant text related to story/article • There can be multiple approaches related to accuracy. • Below is our approach: Cleaned Text 1 Find low 3 2 value cluster 4 5 Cluster based on stop words Score each cluster Take Highest score cluster Sentence Extractor Some more Scoring… Summary text 6 7 *Summary can be a content curated by computer system. i.e. translating the story into its own sentences (out of scope)
  • 32. POS (Part of Speech) Tagging • process of marking up a word in a text (corpus) as corresponding to a particular part of speech, its definition, as well as its context • relationship with adjacent and related words in a phrase, sentence, or paragraph. • 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection. • “This is a sample sentence” will be output as • This/DT is/VBZ a/DT sample/NN sentence/NN • We use Stanford MaxentTagger • http://nlp.stanford.edu/software/tagger.shtml Number Tag Description 1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10. LS List item marker 11. MD Modal 12. NN Noun, singular or mass 13. NNS Noun, plural 14. NNP Proper noun, singular 15. NNPS Proper noun, plural 16. PDT Predeterminer 17. POS Possessive ending 18. PRP Personal pronoun 19. PRP$ Possessive pronoun 20. RB Adverb 21. RBR Adverb, comparative 22. RBS Adverb, superlative 23. RP Particle 24. SYM Symbol 25. TO to 26. UH Interjection 27. VBD Verb, past tense 32. VBZ Verb, 3rd person singular present
  • 33. NER • Identifying the Named Entities like Person name, location, organization from a text • Need a pre built trained model.
  • 34. Machine Learning Stack • Stanford NER & Tagger • LingPipe • OpenNLP • Carrot2
  • 35. We are Hiring! rockstar@rujhaan.com 35 Want to make an impact on millions of lives ? Join Us
  • 36. Thanks! @rahuldausa on twitter and slideshare http://www.linkedin.com/in/rahuldausa 36 Join us @ For Solr, Lucene, Elasticsearch, Machine Learning, IR http://www.meetup.com/Hyderabad-Apache-Solr-Lucene-Group/ Join us @ For Hadoop, Spark, Cascading, Scala, NoSQL, Crawlers and all cutting edge technologies. http://www.meetup.com/Big-Data-Hyderabad/