SlideShare a Scribd company logo
1
Adding Search to the
Hadoop Ecosystem
Gregory Chanan (gchanan AT cloudera.com)
SF HUG August 2013
Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component deep dive
• Security
• Conclusion
Why Search?
• Hadoop for everyone
• Typical case:
• Ingest data to storage engine (HDFS, HBase, etc)
• Process data (MapReduce, Hive, Impala)
• Experts know MapReduce
• Savvy people know SQL
• Everyone knows Search!
Why Search?
An Integrated Part of
the Hadoop System
One pool of data
One security framework
One set of system resources
One management interface

Recommended for you

Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling

Since mid-2016, Spark-as-a-Service has been available to researchers in Sweden from the Rise SICS ICE Data Center at www.hops.site. In this session, Dowling will discuss the challenges in building multi-tenant Spark structured streaming applications on YARN that are metered and easy-to-debug. The platform, called Hopsworks, is in an entirely UI-driven environment built with only open-source software. Learn how they use the ELK stack (Elasticsearch, Logstash and Kibana) for logging and debugging running Spark streaming applications; how they use Grafana and InfluxDB for monitoring Spark streaming applications; and, finally, how Apache Zeppelin can provide interactive visualizations and charts to end-users. This session will also show how Spark applications are run within a ‘project’ on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In addition, hear about the experiences of their users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and the novel solutions for helping researchers debug and optimize Spark applications.hear about the experiences of their users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and the novel solutions for helping researchers debug and optimize Spark applications.afka topics are protected from access by users that are not members of the project. We will also discuss the experiences of our users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.

apache sparkspark summit
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine

Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system. Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average. An instance of Bullet is currently running at Yahoo against its user engagement data pipeline. We’ll highlight how it is powering internal use-cases such as web page and native app instrumentation validation. Finally, we’ll show a demo of Bullet and go over query performance numbers.

dataworks summitdws17dataworks summit 2017
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...

Sanglin Lee and Joep Rottinghuis of Twitter at HBaseConEast2016: http://www.meetup.com/HBase-NYC/events/233024937/

apache hbase timeline service v2 hbaseconeast2016
Benefits of Search
• Improved Big Data ROI
• An interactive experience without technical knowledge
• Single data set for multiple computing frameworks
• Faster time to insight
• Exploratory analysis, esp. unstructured data
• Broad range of indexing options to accommodate needs
• Cost efficiency
• Single scalable platform; no incremental investment
• No need for separate systems, storage
What is Cloudera Search?
• Full-text, interactive search with faceted navigation
• Batch, near real-time, and on-demand indexing
• Apache Solr integrated with CDH
• Established, mature search with vibrant community
• In production environments for years
• Open Source
• 100% Apache, 100% Solr
• Standard Solr APIs
• In public beta (version 0.9.3)
Cloudera Search Components
• HDFS/MR/Lucene/Solr/SolrCloud
• Indexing
• Near Real Time (NRT) indexing
• Batch
• ETL – Cloudera Morphlines
• Querying
Apache Hadoop
• Apache HDFS
• Distributed file system
• High reliability
• High throughput
• Apache MapReduce
• Parallel, distributed programming model
• Allows processing of large datasets
• Fault tolerant

Recommended for you

#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos

This document discusses Spark, an open-source cluster computing framework. It provides a brief history of Spark, describing how it generalized MapReduce to support more types of applications. Spark allows for batch, interactive, and real-time processing within a single framework using Resilient Distributed Datasets (RDDs) and a logical plan represented as a directed acyclic graph (DAG). The document also discusses how Spark can be used for applications like machine learning via MLlib, graph processing with GraphX, and streaming data with Spark Streaming.

data sciencesparkbig data
Lambda Architecture in Practice
Lambda Architecture in PracticeLambda Architecture in Practice
Lambda Architecture in Practice

This presentation covers practical implementation of Lambda with different patterns. It also explains how to achieve continuous deployment using lambda.

function as servicelambdacontinuous delivery
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies

Distributing Data with HDFS Day1 Understanding Hadoop I/O Spark Introduction RDDs RDD Internals:Part-1 RDD Internals:Part-2 Day2 Data ingress and egress Running on a Cluster Spark Internals Advanced Spark Programming Spark Streaming Spark SQL Day3 Tuning and Debugging Spark Kafka Internals Storm Internals

big databig data analyticsspark
Apache Lucene
• Full text search
• Indexing
• Query
• Traditional inverted index
• Batch and Incremental indexing
• We are using version 4.3 in current release
Apache Solr
• Search service built using Lucene
• Ships with Lucene (same TLP at Apache)
• Provides XML/HTTP/JSON/Python/Ruby/… APIs
• Indexing
• Query
• Administrative interface
• Also rich web admin GUI via HTTP
Apache SolrCloud
• Provides distributed Search capability
• Part of Solr (not a separate library/codebase)
• Shards – provide scalability
• partition index for size
• replicate for query performance
• Uses ZooKeeper for coordination
• No split-brain issues
• Simplifies operations
Distributed Search on Hadoop
Flume
Hue UI
Custom
UI
Custom
App
Solr
Solr
Solr
SolrCloud
query
query
query
index
Hadoop Cluster
MR
HDFS
index
HBase
index
ZK

Recommended for you

Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

This document discusses running Sqoop jobs on Apache Spark for faster data ingestion into Hadoop. The authors describe how Sqoop jobs can be executed as Spark jobs by leveraging Spark's faster execution engine compared to MapReduce. They demonstrate running a Sqoop job to ingest data from MySQL to HDFS using Spark and show it is faster than using MapReduce. Some challenges encountered are managing dependencies and job submission, but overall it allows leveraging Sqoop's connectors within Spark's distributed processing framework. Next steps include exploring alternative job submission methods in Spark and adding transformation capabilities to Sqoop connectors.

apache sparkspark summit 2015
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex

Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.

big dataapache apexbig data analytics
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr

Slides from my talk on "Building a Large Scale SEO/SEM Application with Apache Solr" in Lucene/Solr Revolution 2014 where I talk how we handle Indexing/Search of 40 billion records (documents)/month in Apache Solr with 4.6 TB compressed index data. Abstract: We are working on building a SEO/SEM application where an end user search for a "keyword" or a "domain" and gets all the insights about these including Search engine ranking, CPC/CPM, search volume, No. of Ads, competitors details etc. in a couple of seconds. To have this intelligence, we get huge web data from various sources and after intensive processing it is 40 billion records/month in MySQL database with 4.6 TB compressed index data in Apache Solr. Due to large volume, we faced several challenges while improving indexing performance, search latency and scaling the overall system. In this session, I will talk about our several design approaches to import data faster from MySQL, tricks & techniques to improve the indexing performance, Distributed Search, DocValues(life saver), Redis and the overall system architecture.

solrapache solrlarge scale
Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch (MR)
Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch (MR)
Near Real Time Indexing with Flume
Log File
Solr and Flume
• Data ingest at scale
• Flexible extraction and
mapping
• Indexing at data ingest
HDFS
Flume
Agent
Indexer
Other
Log File
Flume
Agent
Indexer
15
Apache Flume - MorphlineSolrSink
• A Flume Source…
• Receives/gathers events
• A Flume Channel…
• Carries the event – MemoryChannel or reliable FileChannel
• A Flume Sink…
• Sends the events on to the next location
• Flume MorphlineSolrSink
• Integrates Cloudera Morphlines library
• ETL, more on that in a bit
• Does batching
• Results sent to Solr for indexing

Recommended for you

Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation

This document discusses using big data tools to build a fraud detection system. It outlines using Azure infrastructure to set up a Hadoop cluster with HDFS, HBase, Kafka and Spark. Mock transaction data will be generated and sent to Kafka. Spark jobs will process the data in batches, identifying potentially fraudulent transactions and writing them to an HBase table. The data will be visualized using Zeppelin notebooks querying Phoenix SQL on HBase. This will allow analysts to further investigate potential fraud patterns in near real-time.

HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

WorldCat is the world’s largest network of library content and services. Over 25,000 libraries in 170 countries have cooperated for 40 years to build WorldCat. OCLC is currently in the process of transitioning Worldcat from Oracle to Apache HBase. This session will discuss our data design for representing the constantly changing ownership information for thousands of libraries (billions of data points, millions of daily updates) and our plans for how we’re managing HBase in an environment that is equal parts end user facing and batch.

hbasecon 2012hbase libraryhbase
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod

This document outlines a project to capture user location data and send it to a database for real-time analysis using Kafka and Spark streaming. It describes starting Zookeeper and Kafka servers, creating Kafka topics, producing and consuming messages with Java producers and consumers, using the Spark CLI, integrating Kafka and Spark for streaming, creating DataFrames and SQL queries, and saving data to PostgreSQL tables for further processing and analysis. The goal is to demonstrate real-time data streaming and analytics on user location data.

Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch (MR)
Near Real Time Indexing of Apache HBase
HDFS
HBase
interactiveload
HBase
Indexer(s)
Trigger Solr server
Solr server
Solr server
Solr server
Solr server
Search
+ =
planet-sized tabular data
immediate access & updates
fast & flexible information
discovery
BIG DATA DATAMANAGEMENT
Lily HBase Indexer
• Collaboration between NGData & Cloudera
• NGData are creators of the Lily data management platform
• Lily HBase Indexer
• Service which acts as a HBase replication listener
• HBase replication features, such as filtering, supported
• Replication updates trigger indexing of updates (rows)
• Integrates Cloudera Morphlines library for ETL of rows
• AL2 licensed on github https://github.com/ngdata
Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch (MR)

Recommended for you

Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...

The document discusses building a large scale SEO/SEM application using Apache Solr. It describes some of the key challenges faced in indexing and searching over 40 billion records in the application's database each month. It discusses techniques used to optimize the data import process, create a distributed index across multiple tables, address out of memory errors, and improve search performance through partitioning, index optimization, and external caching.

lucenelucene/solr revolution 2014solr
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )

Rujhaan.com is a news aggregation app that collects trending news and social media discussions from topics of interest to users. It uses various technologies including a crawler to collect data from social media, Apache Solr for search, MongoDB for storage, Redis for caching, and machine learning techniques like classification and clustering. The presenter discussed the technical architecture and challenges of building Rujhaan.com to provide fast, personalized news content to over 16,000 monthly users while scaling to growing traffic levels.

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

This document summarizes the work done by Yahoo engineers to optimize performance of queries on a mobile analytics data mart hosted on Apache Hive. They implemented several techniques like using Tez, vectorized query execution, map-side aggregations, and ORC file format, which provided significant performance boosts. For high cardinality partitioned tables, they leveraged sketching which reduced query times from over 100 seconds to under 25 seconds. They also implemented a data mart in a box solution for easier setup of custom data marts and funnels analysis using UDFs. Human: Thank you for the summary. Summarize the following document in 2 sentences or less: [DOCUMENT]: Lorem ipsum dolor

hadoop summit
Scalable Batch Indexing
Index
shard
Files
Index
shard
Indexer
Files
Solr
server
Indexer
Solr
server
21
HDFS
Solr and MapReduce
• Flexible, scalable batch
indexing
• Start serving new indices
with no downtime
• On-demand indexing, cost-
efficient re-indexing
MapReduce Indexer
MapReduce Job with two parts
1) Scan HDFS for files to be indexed
• Much like Unix “find” – see HADOOP-8989
• Output is NLineInputFormat’ed file
2) Mapper/Reducer indexing step
• Mapper extracts content via Cloudera Morphlines
• Reducer indexes documents via embedded Solr server
• Originally based on SOLR-1301
• Many modifications to enable linear scalability
MapReduce Indexer “golive”
• Cloudera created this to bridge the gap between NRT
(low latency, expensive) and Batch (high latency,
cheap at scale) indexing
• Results of MR indexing operation are immediately
merged into a live SolrCloud serving cluster
• No downtime for users
• No NRT expense
• Linear scale out to the size of your MR cluster
Cloudera Morphlines
• Open Source framework for simple ETL
• Ships as part Cloudera Developer Kit (CDK)
• It’s a Java library
• AL2 licensed on github https://github.com/cloudera/cdk
• Simplify ETL
• Built-in commands and library support (Avro format, Hadoop
SequenceFiles, grok for syslog messages)
• Configuration over coding
• Standardize ETL

Recommended for you

Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...

The document discusses searching enterprise data lakes with Apache Solr. It begins with an overview of how data storage has evolved from single databases to data warehouses to modern data lakes that store vast amounts of raw and processed data. The challenge is finding needed data in this environment. The document then covers the process for indexing data lake contents with Solr, including ingesting data, configuring Solr, parsing and indexing data, searching and analyzing data. It concludes with a demonstration of performing these steps and resources for further information.

Presto - SQL on anything
Presto  - SQL on anythingPresto  - SQL on anything
Presto - SQL on anything

One of the key differences between Presto and Hive, also a crucial functional requirement Facebook made when launching this new SQL engine project, was to have the opportunity to query different kinds of data sources via a uniform ANSI SQL interface. Presto, an open source distributed analytical SQL engine, implements this with it’s connector architecture, creating an abstraction layer for anything that can be expressed as in a row-like format, ranging from MySQL tables, HDFS, Amazon S3 to NoSQL stores, Kafka streams and proprietary data sources. Presto connector SPI allows anyone to implement a Presto connector and benefit from the capabilities of the Presto SQL engine, enabling them to join data from various sources within a single SQL query.

sqlfacebookprestodb
'Flume' Essay
'Flume' Essay 'Flume' Essay
'Flume' Essay

The identity of artist Flume is consistent across different media forms through the use of color scheme, typography, his absence from promotional materials, and the recurring Infinity Prism symbol. Pink, white, and black are used throughout album artwork, music videos, and tour posters. The font and layout when writing "Flume" is also consistent. Flume is never physically present, adding an air of mystery. Most importantly, the Infinity Prism - a mysterious hexagonal light installation - appears in all analyzed media, tying his identity together and fueling audience curiosity. These consistencies have helped Flume craft an ambiguous, yet intriguing persona.

Cloudera Morphlines Architecture
Solr
Solr
Solr
SolrCloud
Logs, tweets, social
media, html,
images, pdf, text….
Anything you want
to index
Flume, MR Indexer, HBase indexer, etc...
Or your application!
Morphline Library
Morphlines can be embedded in any application…
Extraction and Mapping
• Modeled after Unix
pipelines
• Simple and flexible data
transformation
• Reusable across multiple
index workloads
• Over time, extend and re-
use across platform
workloads
syslog Flume
Agent
Solr sink
Command: readLine
Command: grok
Command: loadSolr
Solr
Event
Record
Record
Record
Document
MorphlineLibrary
Morphline Example – syslog with grok
morphlines : [
{
id : morphline1
importCommands : ["com.cloudera.**", "org.apache.solr.**"]
commands : [
{ readLine {} }
{
grok {
dictionaryFiles : [/tmp/grok-dictionaries]
expressions : {
message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp}
%{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%{GREEDYDATA:syslog_message}"""
}
}
}
{ loadSolr {} }
]
}
]
Example Input
<164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22
Output Record
syslog_pri:164
syslog_timestamp:Feb 4 10:46:14
syslog_hostname:syslog
syslog_program:sshd
syslog_pid:607
syslog_message:listening on 0.0.0.0 port 22.
Current Command Library
• Integrate with and load into Apache Solr
• Flexible log file analysis
• Single-line record, multi-line records, CSV files
• Regex based pattern matching and extraction
• Integration with Avro
• Integration with Apache Hadoop Sequence Files
• Integration with SolrCell and all Apache Tika parsers
• Auto-detection of MIME types from binary data using
Apache Tika

Recommended for you

Extracting twitter data using apache flume
Extracting twitter data using apache flumeExtracting twitter data using apache flume
Extracting twitter data using apache flume

In this ppt, I shall list down all the steps involved in extracting the twitter data using Apache Flume

hortonworks1.3datahortonworks
'Flume' Case Study
'Flume' Case Study'Flume' Case Study
'Flume' Case Study

1. The artist Flume consistently uses a color scheme of black, white, and pink in his album artwork, music video, and tour posters to create a recognizable visual brand. 2. The name "Flume" is always written in the same font, size, and with dots on either side to emphasize that the focus is solely on the artist. 3. A mysterious object called the "Infinity Prism" features prominently in the album artwork and music video, representing the futuristic style of Flume's music.

Apache flume
Apache flumeApache flume
Apache flume

Apache Flume is a tool for collecting large amounts of streaming data from various sources and transporting it to a centralized data store like HDFS. It reliably delivers events from multiple data sources to destinations such as HDFS or HBase. Flume uses a simple and flexible architecture based on streaming data flows, with reliable delivery of events guaranteed through a system of agents, channels, and sinks.

flumeapache flume
Current Command Library (cont)
• Scripting support for dynamic java code
• Operations on fields for assignment and comparison
• Operations on fields with list and set semantics
• if-then-else conditionals
• A small rules engine (tryRules)
• String and timestamp conversions
• slf4j logging
• Yammer metrics and counters
• Decompression and unpacking of arbitrarily nested
container file formats
• Etc…
Querying
• Built-in solr web UI
• Write your own
• Hue
Simple, Customizable Search Interface
Hue
• Simple UI
• Navigated, faceted drill
down
• Customizable display
• Full text search,
standard Solr API and
query language
Security
• Upstream Solr doesn’t really deal with security
• Goal: use kerberos, like other CDH components
• Current release: Support for kerberos authentication
• Actively working on Index-level authorization
• Future: more granular authorization

Recommended for you

Milap Thaker - Biology Powerpoint: Harvard University - DNA Damage Checkpoints
Milap Thaker - Biology Powerpoint: Harvard University -  DNA Damage CheckpointsMilap Thaker - Biology Powerpoint: Harvard University -  DNA Damage Checkpoints
Milap Thaker - Biology Powerpoint: Harvard University - DNA Damage Checkpoints

The document discusses two mechanisms by which DNA damage checkpoints inhibit mitotic exit in yeast cells. First, the Rad53 checkpoint kinase prevents mitotic exit by inhibiting the Mitotic Exit Network (MEN). Second, the FEAR pathway promotes limited release of the phosphatase Cdc14 from the nucleolus early in anaphase. The study finds that Rad53 acts through the Dun1 kinase to regulate the MEN more directly, while FEAR provides an alternate mechanism for temporary Cdc14 release and a delay in full mitotic exit. Experiments visualize budding, Cdc14 localization, and DNA content in various yeast strains to illustrate the two inhibitory mechanisms.

milap thaker biologymilap thakerthaker
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Apache Flume is a distributed system for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store such as Hadoop Distributed File System (HDFS). It consists of agents that collect data from sources and deliver it to sinks using channels. Common sources include log files, Kafka streams, and Avro clients. Common sinks include HDFS, HBase, Elasticsearch, and Kafka. Flume provides reliable and available service for efficiently collecting and moving large amounts of log data.

flumeapachechug
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloud

The document discusses using Flume and Solr to index log data. It begins with an introduction and example of using Flume to index syslog data into Solr. It then covers aspects like high availability, data routing, and schema design. The document also provides a step-by-step example of transforming a syslog message into a Solr document using Morphlines.

solrcloudapache hadoophadoop summit 2013
Conclusion
• Cloudera Search now in public beta
• Free Download
• Extensive documentation
• Send your questions and feedback to search-
user@cloudera.org
• Take the Search online training
• Cloudera Manager Standard (i.e. the free version)
• Simple management of Search
• Free Download
• QuickStart VM also available!

More Related Content

What's hot

Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Cloudera, Inc.
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
Lucidworks
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
DataWorks Summit
 
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
Michael Stack
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
Paco Nathan
 
Lambda Architecture in Practice
Lambda Architecture in PracticeLambda Architecture in Practice
Lambda Architecture in Practice
Navneet kumar
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Anand Narayanan
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
Stephen Borg
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Cloudera, Inc.
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
Vritika Godara
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Lucidworks
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Lucidworks
 
Presto - SQL on anything
Presto  - SQL on anythingPresto  - SQL on anything
Presto - SQL on anything
Grzegorz Kokosiński
 

What's hot (20)

Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
 
Lambda Architecture in Practice
Lambda Architecture in PracticeLambda Architecture in Practice
Lambda Architecture in Practice
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
 
Presto - SQL on anything
Presto  - SQL on anythingPresto  - SQL on anything
Presto - SQL on anything
 

Viewers also liked

'Flume' Essay
'Flume' Essay 'Flume' Essay
'Flume' Essay
PriyankaRadha
 
Extracting twitter data using apache flume
Extracting twitter data using apache flumeExtracting twitter data using apache flume
Extracting twitter data using apache flume
Bharat Khanna
 
'Flume' Case Study
'Flume' Case Study'Flume' Case Study
'Flume' Case Study
PriyankaRadha
 
Apache flume
Apache flumeApache flume
Apache flume
Ramakrishna kapa
 
Milap Thaker - Biology Powerpoint: Harvard University - DNA Damage Checkpoints
Milap Thaker - Biology Powerpoint: Harvard University -  DNA Damage CheckpointsMilap Thaker - Biology Powerpoint: Harvard University -  DNA Damage Checkpoints
Milap Thaker - Biology Powerpoint: Harvard University - DNA Damage Checkpoints
milapthaker
 
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Steve Hoffman
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloud
DataWorks Summit
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
dwmclary
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Data Con LA
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
Rapheephan Thongkham-Uan
 
Hadoop cluster setup by using cloudera manager
Hadoop cluster setup by using cloudera managerHadoop cluster setup by using cloudera manager
Hadoop cluster setup by using cloudera manager
Co-graph Inc.
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
Cloudera, Inc.
 
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and Hive
IMC Institute
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
AnalyticsWeek
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
Cloudera, Inc.
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
Isheeta Sanghi
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
Arinto Murdopo
 

Viewers also liked (18)

'Flume' Essay
'Flume' Essay 'Flume' Essay
'Flume' Essay
 
Extracting twitter data using apache flume
Extracting twitter data using apache flumeExtracting twitter data using apache flume
Extracting twitter data using apache flume
 
'Flume' Case Study
'Flume' Case Study'Flume' Case Study
'Flume' Case Study
 
Apache flume
Apache flumeApache flume
Apache flume
 
Milap Thaker - Biology Powerpoint: Harvard University - DNA Damage Checkpoints
Milap Thaker - Biology Powerpoint: Harvard University -  DNA Damage CheckpointsMilap Thaker - Biology Powerpoint: Harvard University -  DNA Damage Checkpoints
Milap Thaker - Biology Powerpoint: Harvard University - DNA Damage Checkpoints
 
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloud
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
 
Hadoop cluster setup by using cloudera manager
Hadoop cluster setup by using cloudera managerHadoop cluster setup by using cloudera manager
Hadoop cluster setup by using cloudera manager
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and Hive
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 

Similar to Search onhadoopsfhug081413

Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
gregchanan
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
whoschek
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
Alex Moundalexis
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
markgrover
 
Webinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataWebinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big Data
Lucidworks
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
Rick van den Bosch
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
Cask Data
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Codemotion
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
hdhappy001
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
Mark Kerzner
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
SATOSHI TAGOMORI
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
lucenerevolution
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
SATOSHI TAGOMORI
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 

Similar to Search onhadoopsfhug081413 (20)

Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
 
Webinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataWebinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big Data
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 

Recently uploaded

一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
uuuot
 
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecGDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
James Anderson
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 
K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024
The Digital Insurer
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
What's Next Web Development Trends to Watch.pdf
What's Next Web Development Trends to Watch.pdfWhat's Next Web Development Trends to Watch.pdf
What's Next Web Development Trends to Watch.pdf
SeasiaInfotech2
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
FellyciaHikmahwarani
 
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
ScyllaDB
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
ScyllaDB
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
BookNet Canada
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
Stephanie Beckett
 
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design ApproachesKnowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Earley Information Science
 

Recently uploaded (20)

一比一原版(msvu毕业证书)圣文��山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
一比一原版(msvu毕业证书)圣文森山大学毕业证如何办理
 
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecGDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 
K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
What's Next Web Development Trends to Watch.pdf
What's Next Web Development Trends to Watch.pdfWhat's Next Web Development Trends to Watch.pdf
What's Next Web Development Trends to Watch.pdf
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
 
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
 
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design ApproachesKnowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
 

Search onhadoopsfhug081413

  • 1. 1 Adding Search to the Hadoop Ecosystem Gregory Chanan (gchanan AT cloudera.com) SF HUG August 2013
  • 2. Agenda • Big Data and Search – setting the stage • Cloudera Search Architecture • Component deep dive • Security • Conclusion
  • 3. Why Search? • Hadoop for everyone • Typical case: • Ingest data to storage engine (HDFS, HBase, etc) • Process data (MapReduce, Hive, Impala) • Experts know MapReduce • Savvy people know SQL • Everyone knows Search!
  • 4. Why Search? An Integrated Part of the Hadoop System One pool of data One security framework One set of system resources One management interface
  • 5. Benefits of Search • Improved Big Data ROI • An interactive experience without technical knowledge • Single data set for multiple computing frameworks • Faster time to insight • Exploratory analysis, esp. unstructured data • Broad range of indexing options to accommodate needs • Cost efficiency • Single scalable platform; no incremental investment • No need for separate systems, storage
  • 6. What is Cloudera Search? • Full-text, interactive search with faceted navigation • Batch, near real-time, and on-demand indexing • Apache Solr integrated with CDH • Established, mature search with vibrant community • In production environments for years • Open Source • 100% Apache, 100% Solr • Standard Solr APIs • In public beta (version 0.9.3)
  • 7. Cloudera Search Components • HDFS/MR/Lucene/Solr/SolrCloud • Indexing • Near Real Time (NRT) indexing • Batch • ETL – Cloudera Morphlines • Querying
  • 8. Apache Hadoop • Apache HDFS • Distributed file system • High reliability • High throughput • Apache MapReduce • Parallel, distributed programming model • Allows processing of large datasets • Fault tolerant
  • 9. Apache Lucene • Full text search • Indexing • Query • Traditional inverted index • Batch and Incremental indexing • We are using version 4.3 in current release
  • 10. Apache Solr • Search service built using Lucene • Ships with Lucene (same TLP at Apache) • Provides XML/HTTP/JSON/Python/Ruby/… APIs • Indexing • Query • Administrative interface • Also rich web admin GUI via HTTP
  • 11. Apache SolrCloud • Provides distributed Search capability • Part of Solr (not a separate library/codebase) • Shards – provide scalability • partition index for size • replicate for query performance • Uses ZooKeeper for coordination • No split-brain issues • Simplifies operations
  • 12. Distributed Search on Hadoop Flume Hue UI Custom UI Custom App Solr Solr Solr SolrCloud query query query index Hadoop Cluster MR HDFS index HBase index ZK
  • 13. Indexing • Near Real Time (NRT) • Flume • HBase Indexer • Batch (MR)
  • 14. Indexing • Near Real Time (NRT) • Flume • HBase Indexer • Batch (MR)
  • 15. Near Real Time Indexing with Flume Log File Solr and Flume • Data ingest at scale • Flexible extraction and mapping • Indexing at data ingest HDFS Flume Agent Indexer Other Log File Flume Agent Indexer 15
  • 16. Apache Flume - MorphlineSolrSink • A Flume Source… • Receives/gathers events • A Flume Channel… • Carries the event – MemoryChannel or reliable FileChannel • A Flume Sink… • Sends the events on to the next location • Flume MorphlineSolrSink • Integrates Cloudera Morphlines library • ETL, more on that in a bit • Does batching • Results sent to Solr for indexing
  • 17. Indexing • Near Real Time (NRT) • Flume • HBase Indexer • Batch (MR)
  • 18. Near Real Time Indexing of Apache HBase HDFS HBase interactiveload HBase Indexer(s) Trigger Solr server Solr server Solr server Solr server Solr server Search + = planet-sized tabular data immediate access & updates fast & flexible information discovery BIG DATA DATAMANAGEMENT
  • 19. Lily HBase Indexer • Collaboration between NGData & Cloudera • NGData are creators of the Lily data management platform • Lily HBase Indexer • Service which acts as a HBase replication listener • HBase replication features, such as filtering, supported • Replication updates trigger indexing of updates (rows) • Integrates Cloudera Morphlines library for ETL of rows • AL2 licensed on github https://github.com/ngdata
  • 20. Indexing • Near Real Time (NRT) • Flume • HBase Indexer • Batch (MR)
  • 21. Scalable Batch Indexing Index shard Files Index shard Indexer Files Solr server Indexer Solr server 21 HDFS Solr and MapReduce • Flexible, scalable batch indexing • Start serving new indices with no downtime • On-demand indexing, cost- efficient re-indexing
  • 22. MapReduce Indexer MapReduce Job with two parts 1) Scan HDFS for files to be indexed • Much like Unix “find” – see HADOOP-8989 • Output is NLineInputFormat’ed file 2) Mapper/Reducer indexing step • Mapper extracts content via Cloudera Morphlines • Reducer indexes documents via embedded Solr server • Originally based on SOLR-1301 • Many modifications to enable linear scalability
  • 23. MapReduce Indexer “golive” • Cloudera created this to bridge the gap between NRT (low latency, expensive) and Batch (high latency, cheap at scale) indexing • Results of MR indexing operation are immediately merged into a live SolrCloud serving cluster • No downtime for users • No NRT expense • Linear scale out to the size of your MR cluster
  • 24. Cloudera Morphlines • Open Source framework for simple ETL • Ships as part Cloudera Developer Kit (CDK) • It’s a Java library • AL2 licensed on github https://github.com/cloudera/cdk • Simplify ETL • Built-in commands and library support (Avro format, Hadoop SequenceFiles, grok for syslog messages) • Configuration over coding • Standardize ETL
  • 25. Cloudera Morphlines Architecture Solr Solr Solr SolrCloud Logs, tweets, social media, html, images, pdf, text…. Anything you want to index Flume, MR Indexer, HBase indexer, etc... Or your application! Morphline Library Morphlines can be embedded in any application…
  • 26. Extraction and Mapping • Modeled after Unix pipelines • Simple and flexible data transformation • Reusable across multiple index workloads • Over time, extend and re- use across platform workloads syslog Flume Agent Solr sink Command: readLine Command: grok Command: loadSolr Solr Event Record Record Record Document MorphlineLibrary
  • 27. Morphline Example – syslog with grok morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"] commands : [ { readLine {} } { grok { dictionaryFiles : [/tmp/grok-dictionaries] expressions : { message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?: %{GREEDYDATA:syslog_message}""" } } } { loadSolr {} } ] } ] Example Input <164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 Output Record syslog_pri:164 syslog_timestamp:Feb 4 10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.
  • 28. Current Command Library • Integrate with and load into Apache Solr • Flexible log file analysis • Single-line record, multi-line records, CSV files • Regex based pattern matching and extraction • Integration with Avro • Integration with Apache Hadoop Sequence Files • Integration with SolrCell and all Apache Tika parsers • Auto-detection of MIME types from binary data using Apache Tika
  • 29. Current Command Library (cont) • Scripting support for dynamic java code • Operations on fields for assignment and comparison • Operations on fields with list and set semantics • if-then-else conditionals • A small rules engine (tryRules) • String and timestamp conversions • slf4j logging • Yammer metrics and counters • Decompression and unpacking of arbitrarily nested container file formats • Etc…
  • 30. Querying • Built-in solr web UI • Write your own • Hue
  • 31. Simple, Customizable Search Interface Hue • Simple UI • Navigated, faceted drill down • Customizable display • Full text search, standard Solr API and query language
  • 32. Security • Upstream Solr doesn’t really deal with security • Goal: use kerberos, like other CDH components • Current release: Support for kerberos authentication • Actively working on Index-level authorization • Future: more granular authorization
  • 33. Conclusion • Cloudera Search now in public beta • Free Download • Extensive documentation • Send your questions and feedback to search- user@cloudera.org • Take the Search online training • Cloudera Manager Standard (i.e. the free version) • Simple management of Search • Free Download • QuickStart VM also available!