Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software

Storing and Indexing Social Media Content in the Hadoop Ecosystem Lance Riedel Brent Halsey Jive Software

Jive: Social Networking for the Enterprise Engage Employees Engage Customers Engage the Social Web What Matters Apps

Jive Social Media Engagement stores social media for monitoring (e.g. brand sentiment), searching, and analysis Jive Social Media Monitoring Overview

Searching and Following Social Media content

Analyzing Social Media content

Massive amount of data 230M+ entries / day Facebook/Twitter growing fast Blogs, news, YouTube, Flickr, etc 125+ GB / day 3.5k / sec (5k / sec peak) Volume is growing exponentially

Previous solution Maxed out large MySQL boxes Throw away data (95%) that doesn't match our users' filters Billions of rows in MySQL

And its limitations Can't search historical data Can't do ad hoc analytics MySQL severely stressed Schema migrations were painful

Next Generation Storage Search Analytics

Next Generation Storage Search Analytics Distribute data with Flume HBase (and HDFS) Katta / Lucene Hadoop

Why Flume? Need to distribute our data reliably to multiple locations and systems (e.g. servers in our datacenter, in ec2, to HBase, to Hadoop) Flume Design Goals Reliability – failover collectors, master failover Scalability – linear scale by adding collector nodes Manageability – central zookeeper managed configs Extensibility – custom sources and sinks Good match!

Flume Overview: The Canonical Use Case Flume Agent tier Collector tier Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Collector Collector Collector server server server server server server server server server server server server HDFS

Flume Overview: Data ingestion pipeline pattern Flume Agent Agent Agent Agent svr index hbase hdfs Collector Fanout HBase Key lookup Range query Incremental Search Idx Search query Faceted query HDFS Hive query Pig query

Katta – distributed Lucene Katta Master Index 2 Katta Node Index 1 Index 2 Katta Node Index 1 Index 2 Hadoop HDFS Raw.seq Index 1 Katta Node Index 1

Jive Social Media Search Architecture

Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Search Broker Systems Overview Index 1 Events Search Results HDFS HBase Collector Fanout Index 1

Raw.seq Systems Overview Events HDFS HBase Collector Fanout

Hadoop Job Controller Raw.seq Distributed Indexer Job Systems Overview Events HDFS HBase Collector Fanout Index 1

Index 2 Hadoop Job Controller Raw.seq Distributed Indexer Job Katta Systems Overview Index 1 Events HDFS HBase Collector Fanout Index 1

Distributed Lucene Indexer Job Input HDFS Blocks Shard 1 Shard 2

Distributed Lucene Indexer Job Map Map Map Map Raw Events Input HDFS Blocks Index 1 Index 2 Index 3 Index 4

Distributed Lucene Indexer Job Map Map Map Map Reduce Reduce Raw Events Input HDFS Blocks Shuffle/ Sort Key -> shard number Value -> path to index Shard 1 Shard 2 Index 1 Index 2 Index 3 Index 4

5 Minute Index Deployment Incremental Indexer Job Raw.seq

5 Minute Hour Index Deployment Incremental Indexer Job Raw.seq Hourly Merge Indexer Job

5 Minute Hour Day Index Deployment Incremental Indexer Job Raw.seq Hourly Merge Indexer Job Daily Merge Indexer Job

Incremental Indexing Job Controller HDFS 1. Scan HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes

Incremental Indexing Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes

Incremental Indexing Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files 3. Run INCREMENTAL index job Distributed Indexer Job raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes

Job Controller HDFS raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq 1. Scan HDFS 2. Determine raw input files 3. Run INCREMENTAL index job Distributed Indexer Job 4. Deploy index Katta Index.INCREMENTAL.time-1.6 Incremental Indexing raw.time-1.4.seq raw.time-1.5.seq raw.time-1.6.seq raw.time-1.7.seq.tmp /raw Index.HOUR.time-1 Index.INCREMENTAL.time-1.1 Index.INCREMENTAL.time-1.2 Index.INCREMENTAL.time-1.3 /indexes

Custom sources / sinks / decorators HBase Sink - There is now a supported HBase sink, but we do some of our own transformations before insertion (e.g. understands our json data) Zoie Realtime Search Sink - real-time searching of events on flume (more details next slide) Regex Filter Decorator - allows only events through that match a key value

Real-time Search and Indexing 5 Minute Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results

Real-time Search and Indexing Zoie Flume Sink 5 Minute 10 Second Index 2 Hadoop HDFS Job Controller HBase Collector Fanout Raw.seq Index 1 Distributed Indexer Job Katta Search Broker Index 1 Events Search Results

Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq

Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq 10 Seconds

Real-time Search and Indexing 5 Minute Incremental Indexer Job Raw.seq 10 Seconds Collector Fanout

Real-time Search and Indexing Zoie Flume Sink 5 Minute Incremental Indexer Job Raw.seq 10 Seconds Collector Fanout

Zoie Flume Sink Jetty Server 0-5 min 1 Search Broker Katta Zoie Sink

Zoie Flume Sink Jetty Server 0-5 min 5-10 min 2 1 Search Broker Katta Zoie Sink

Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 3 2 1 Search Broker Katta Zoie Sink

Zoie Flume Sink Jetty Server 0-5 min 5-10 min 10-15 min 4 3 2 1 > 15 min Search Broker Katta Zoie Sink

Track user activity to: Power recommendations What Matters Activity Stream people you should meet topics you are interested in Social Search search ranking based on social graph topical graph keywords Analytics community manager understands: what users are collaborating how engagement is increasing Hadoop Ecosystem @Jive

Track System/App/Web Logs to: A/B Testing Usage analysis, finding bugs, capacity planning Log searching (distributed grep) Hadoop Ecosystem @Jive

Flume to collect Activities from 100’s of Jive instances System and App Logs Custom Hadoop Jobs/Pig graph analysis, semantic/topical analysis Reporting Custom reporting infrastructure Datameer Hadoop Ecosystem @Jive

Questions Lance Riedel @lanceriedel Brent Halsey @bhalsey jivesoftware.com/bigdata

Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software

Similar to Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ecosystem - Lance Riedel & Brent Halsey - Jive Software

Editor's Notes