Search On Hadoop Frontier Meetup

Adding Search to the
Hadoop Ecosystem
Gregory Chanan (gchanan AT cloudera.com)
Frontier Meetup Dec 2013

1

Agenda
•

•
•
•
•

Big Data and Search – setting the stage
Cloudera Search Architecture
Component deep dive
Security
Conclusion

Why Search?
Hadoop for everyone
• Typical case:
•

•
•

Ingest data to storage engine (HDFS, HBase, etc)
Process data (MapReduce, Hive, Impala)

Experts know MapReduce
• Savvy people know SQL
• Everyone knows Search!
•

Why Search?
An Integrated Part of
the Hadoop System
One pool of data
One security framework
One set of system resources
One management interface

Benefits of Search
•

Improved Big Data ROI
•
•

•

Faster time to insight
•
•

•

An interactive experience without technical knowledge
Single data set for multiple computing frameworks
Exploratory analysis, esp. unstructured data
Broad range of indexing options to accommodate needs

Cost efficiency
•
•

Single scalable platform; no incremental investment
No need for separate systems, storage

What is Cloudera Search?
Full-text, interactive search with faceted navigation
• Apache Solr integrated with CDH
•

•
•

•

Established, mature search with vibrant community
In production environments for years

Open Source
•
•

100% Apache, 100% Solr
Standard Solr APIs

Batch, near real-time, and on-demand indexing
• Generally Available; released 1.1 last month
•

Cloudera Search Components
HDFS/MR/Lucene/Solr/SolrCloud
• Indexing
•

•
•

Near Real Time (NRT) indexing
Batch

ETL – Cloudera Morphlines
• Querying
•

Apache Hadoop
•

Apache HDFS
•
•
•

•

Distributed file system
High reliability
High throughput

Apache MapReduce
•
•
•

Parallel, distributed programming model
Allows processing of large datasets
Fault tolerant

Apache Lucene
•

Full text search
•
•

Indexing
Query

Traditional inverted index
• Batch and Incremental indexing
• We are using version 4.4 in current release
•

Apache Solr
•

Search service built using Lucene
•

•

Ships with Lucene (same TLP at Apache)

Provides XML/HTTP/JSON/Python/Ruby/… APIs
Indexing
• Query
• Administrative interface
• Also rich web admin GUI via HTTP
•

Apache SolrCloud
Provides distributed Search capability
• Part of Solr (not a separate library/codebase)
• Shards – provide scalability
•

•
•

•

partition index for size
replicate for query performance

Uses ZooKeeper for coordination
•
•

No split-brain issues
Simplifies operations

SolrCloud Architecture
•
•
•

Updates automatically sent to
the correct shard
Replicas handle queries,
forward updates to the leader
Leader indexes the document
for the shard, and forwards
the index notation to itself
and any replicas.

SolrCloud Architecture

Visual representation via admin UI

Distributed Search on Hadoop
ZK
Flume

SolrCloud
Hue UI

query

index

query

Custom
UI

Solr

HBase

index

Solr

query
Solr

index
MR
HDFS
Hadoop Cluster

Custom
App

Indexing
•

Near Real Time (NRT)
•
•

•

Flume
HBase Indexer

Batch (MR)

Near Real Time Indexing with Flume
Other
Log File

Log File

Flume
Agent

Flume
Agent

Indexer

17

HDFS

Solr and Flume
• Data ingest at scale
• Flexible extraction and
mapping
• Indexing at data ingest

Indexer

Apache Flume - MorphlineSolrSink
•

A Flume Source…
•

•

A Flume Channel…
•

•

Carries the event – MemoryChannel or reliable FileChannel

A Flume Sink…
•

•

Receives/gathers events

Sends the events on to the next location

Flume MorphlineSolrSink
•

Integrates Cloudera Morphlines library
•

ETL, more on that in a bit

Does batching
• Results sent to Solr for indexing
•

+

Search

Near Real Time Indexing of Apache HBase
=

HBase

Replication

interactive load

B I G D ATA D ATA M A N A G E M E N T

HDFS

planet-sized tabular data
immediate access & updates
fast & flexible information
discovery

HBase
Indexer(s)

Solr server
Solr server
Solr server
Solr server
Solr server

Lily HBase Indexer
•

Collaboration between NGData & Cloudera
•

•

NGData are creators of the Lily data management platform

Lily HBase Indexer
•

Service which acts as a HBase replication listener
•

HBase replication features, such as filtering, supported

Replication updates trigger indexing of updates (rows)
• Integrates Cloudera Morphlines library for ETL of rows
• AL2 licensed on github https://github.com/ngdata
•

Scalable Batch Indexing
Solr
server

Solr and MapReduce
Index
shard

Solr
server

Index
shard
Indexer

HDFS
Indexer
Files
Files

23

• Flexible, scalable batch
indexing
• Start serving new indices
with no downtime
• On-demand indexing, costefficient re-indexing

MapReduce Indexer
MapReduce Job with two parts
1) Scan HDFS for files to be indexed
•
•

Much like Unix “find” – see HADOOP-8989
Output is NLineInputFormat’ed file

2) Mapper/Reducer indexing step
Mapper extracts content via Cloudera Morphlines
• Reducer indexes documents via embedded Solr server
• Originally based on SOLR-1301
•

•

Many modifications to enable linear scalability

MapReduce Indexer “golive”
Cloudera created this to bridge the gap between NRT
(low latency, expensive) and Batch (high latency,
cheap at scale) indexing
• Results of MR indexing operation are immediately
merged into a live SolrCloud serving cluster
•

•
•
•

No downtime for users
No NRT expense
Linear scale out to the size of your MR cluster

HBase + MapReduce
•

New in search 1.1: run MapReduce job over HBase
tables
•
•

Same architecture as running over HDFS
Similar to HBase’s CopyTable,

Cloudera Morphlines
Open Source framework for simple ETL
• Simplify ETL
•

•
•

Built-in commands and library support (Avro format, Hadoop
SequenceFiles, grok for syslog messages)
Configuration over coding

Standardize ETL
• Ships as part of Kite SDK, formerly Cloudera
Developer Kit (CDK)
•

•
•

It’s a Java library
AL2 licensed on github https://github.com/kite-sdk

Cloudera Morphlines Architecture
Morphlines can be embedded in any application…
SolrCloud
Logs, tweets, social
media, html,
images, pdf, text….

Anything you want
to index

Flume, MR Indexer, HBase indexer, etc...
Or your application!

Solr

Solr
Morphline Library
Solr

Extraction and Mapping
syslog

Flume
Agent
Event

Solr sink

Morphline Library

Record

Command: readLine
Record

Command: grok
Record

Command: loadSolr
Document

Solr

• Modeled after Unix
pipelines
• Simple and flexible data
transformation
• Reusable across multiple
index workloads
• Over time, extend and reuse across platform
workloads

Morphline Example – syslog with grok
morphlines : [
{
id : morphline1
importCommands : ["com.cloudera.**", "org.apache.solr.**"]
commands : [
{ readLine {} }
{
grok {
dictionaryFiles : [/tmp/grok-dictionaries]
expressions : {
message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp}
%{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%{GREEDYDATA:syslog_message}"""
}
Example Input
<164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22
}
Output Record
}
syslog_pri:164
{ loadSolr {} }
syslog_timestamp:Feb 4 10:46:14
]
syslog_hostname:syslog
}
syslog_program:sshd
]
syslog_pid:607
syslog_message:listening on 0.0.0.0 port 22.

Current Command Library
•

•
•
•
•
•

•
•

Integrate with and load into Apache Solr
Flexible log file analysis
Single-line record, multi-line records, CSV files
Regex based pattern matching and extraction
Integration with Avro
Integration with Apache Hadoop Sequence Files
Integration with SolrCell and all Apache Tika parsers
Auto-detection of MIME types from binary data using
Apache Tika

Current Command Library (cont)
•
•
•
•

•
•
•

•
•

•

Scripting support for dynamic java code
Operations on fields for assignment and comparison
Operations on fields with list and set semantics
if-then-else conditionals
A small rules engine (tryRules)
String and timestamp conversions
slf4j logging
Yammer metrics and counters
Decompression and unpacking of arbitrarily nested
container file formats
Etc…

Querying
Built-in solr web UI
• Write your own
• Hue
•

Simple, Customizable Search Interface
Hue
• Simple UI
• Navigated, faceted drill
down
• Customizable display
• Full text search,
standard Solr API and
query language

Security
Upstream Solr doesn’t deal with security
• Search 1.0 supports kerberos authentication
•

•

•

Similar to Oozie / WebHDFS

Search 1.1 supports index-level authorization via
Apache Sentry (incubating)

Index-Level Authorization
Sentry works via “policy files” stored in HDFS
• Can grant roles administrative-only, query-only,
update-only access
• Example:
[groups]
# Assigns each Hadoop group to its set of roles
dev_ops = engineer_role, ops_role
[roles]
engineer_role = collection = source_code->action=*
ops_role = collection = hbase_logs->action=Query
•

Index-Level Authorization 2
•

Works by hooking into Solr RequestHandlers:
<requestHandler name="/update“ class="solr.UpdateRequestHandler">
<lst name="defaults“>
<str name="update.chain">updateIndexAuthorization</str>
</lst>
</requestHandler>

Also includes secure impersonation support
• Unauthorized attempts get a 401 response and are
written to the solr log
• Future work: more fine grain authorization
•

Conclusion
•

Cloudera Search now Generally Available (1.1)
•
•
•
•

•

Cloudera Manager Standard (i.e. the free version)
•
•

•

Free Download
Extensive documentation
Send your questions and feedback to searchuser@cloudera.org
Take the Search online training
Simple management of Search
Free Download

QuickStart VM also available!

Search On Hadoop Frontier Meetup

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Search On Hadoop Frontier Meetup

Similar to Search On Hadoop Frontier Meetup (20)

Recently uploaded

Recently uploaded (20)

Search On Hadoop Frontier Meetup