Hadoop demo ppt

Brief background on me

Phil has over 16 years experience in data-centric system
development. His work has flowed from simulation and video-
game-like systems, to high-performance computing (HPC), to
traditional database (Oracle, SQL Server, Postgres, MySQL)
and CRM (warehouse/analytical) systems, and most recently to
the Hadoop stack. Recently, as an employee at TripAdvisor he
led the research into Hadoop/Hive which resulted in the
successful migration from the traditional RDBMS platform to a
system which is based on Hadoop/Hive and is integrated with
MS SQL Server/SSAS. Currently, he's focused on the Hadoop
stack and is creating a solution which involves integrating
Hadoop in a more traditional enterprise environment.

Agenda

To make you as excited about Hadoop as I am


What is Hadoop (high-level) ?

What have we actually done with it?

How does “it” (HDFS, M/R, Hive, and HBase) work?

Future of Hadoop

Q: What is Hadoop:
A#1 - The thing that empowers
Yahoo, FB, and others
Yahoo has >25k Hadoop nodes…wow…

Q: What is Hadoop
A#2 - Last year’s revolution (sort of)
The Linux/Hadoop vs Closed-Source “conflict” is a false one, IMO, and I’ll explain why as we go on

Q: What is Hadoop
A#3 – the revolution of 5+ years ago

“Success has many fathers”
And you can look them up, because it’s FOSS !
People are fighting to contribute, and to get credit… be a contributor…
(http://hortonworks.com/reality-check-contributions-to-apache-hadoop/)

What is Hadoop:
A#4 – the wave everyone is riding

Nearly all the big players (and many smaller ones) are on board…

In fact, beware of this

http://nosql.mypopescu.com/post/2955078419/origin-of-nosql

What have we actually done with it?

Hadoop projects performed by BlueMetal Architects


Hadoop at a Web 2.0 company (prior to BMA)

Ported traditional 30TB Warehouse to Hive

Big transform jobs in Hive

E.G. Joins 50M rows to 12B rows

Big Data jobs, e.g. Social Graph processing with
many “Cartesians” to empower emails

Hadoop in HealthCare (at BMA)

Applied HBase as part of a new system

Feeds data (via WS) to:

E.D.

Patient Web Portal

Other HealthCare affiliates
Note: Both projects include Hadoop as part of larger systems.

Warehouse Goals


Use the right tool for the right job
–Hadoop (M/R, Hive) is a batch system
• Inherently high-latency
–RDBMS (& other tools) are still needed

Empower users
–Minimize complexity
• Eliminate joins (almost)
• Eliminate “dimensions” (maybe)
–Expose *all* data
–Provide low-latency options
–Provide self-service options

A strategy for MASSIVE processing:
Best tool for the job
This is what we implemented and, it turns out, is also what Yahoo has done.
Yahoo’s SSAS cube is the largest in the world (14TB/quarter, 3B rows/day)

High-level descriptions are good,
but not enough. How does it work?
(From: http://blog.nahurst.com/visual-guide-to-nosql-systems)

Map-Reduce (M/R) example
Note: this job is not optimized
Take home message: “Simple API - Mappers read the
input and emit K/V pairs. Framework sends Reducers
K/V pairs partitioned and ordered* by Key”
(From: http://www.infosun.fim.uni-passau.de/cl/MapReduceFoundation/)

Hadoop M/R with some details:
Note: Partition, Combine and Shuffle
(From: http://www.lecturemaker.com/2011/02/rhipe/)

Hadoop M/R Primer
Let’s discuss HDFS: (blocks, replication) and how that helps “data local tasks”
(From: Yahoo)

Hadoop Terasort Job Profile
- or “hey, I thought it was just M/R”
(from
http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_s
orts_a_petabyte_in_162/)

Why Hadoop?
Because you don’t want to handle this…
This is actually a profile of a job running on an old version of Hadoop, but jobs
with many failures look similar. This also shows improvement in Hadoop.
(From: http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/)

Hadoop M/R executive summary

Distributed storage system, with distributed processing
capability, on commodity hardware (or in the cloud).

Moves the computation to the data !
That, in turn, saves network which is the limiting factor in
distributed apps.

The same code can run on data of any size. The cluster is
scaled with the data, not the code.

Hadoop Stack Key Components
(http://hortonworks.com/technology/hortonworksdataplatform/)
HCatalog is a recent project that allows Pig scripts to use Hive metadata/schemas.
Hadoop is not just about non/semi structured data !

Hive
= HDFS
+ Metadata
+ HQL-> (efficient) M/R
+ more
= RDBMS
- low-latency (usually)
- (row-level) updates
- other (e.g. constraints)
+ HUGE scalability
+ POWERFUL distributed processing

Common RDBMS warehouse query

select top 10
t.*
from (
select ip_address, count(*) as cnt
from f_pageviews pv
join d_ipaddress ip on (pv.ip_key = ip.id)
where date_key = 2992
group by ip_address
)t
order by cnt desc

– wait a few minutes
- time is usually 1-4x nominal time depending on load
- … assumes the job can succeed at all !

Hive Version…
The luxury of Hadoop space/power, means dimensional processing might not be
required
NOTE: Hive does support “column-oriented” storage, which is very efficient.

select t.*
from (
select ip_address, count(*) as cnt
from f_lookback
where ds = '2011-03-11'
group by ip_address
)t
order by cnt desc
Limit 10

– BUT – runtime is trickier
Time to run your job = HQL parse + M/R Job Submit + [ wait
in the queue for availability ] + M/R Job Runtime

What else can Hadoop do?

FB: Invented Cassandra but went with HBase for their new messaging system.
Does that mean HBase is ”better”? – no, it’s about using the right tool for the job.
http://www.facebook.com/note.php?note_id=454991608919

That’s to hold 135B messages per month !
http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html

Scale is relative (to your hardware and load),
but when you want a consistent “OLTP” solution that doesn’t require redesign to scale,
consider Hbase.

HBase Architecture
Not shown: HM, ZK and HDFS
(From: http://www.larsgeorge.com/2009/10/hbase-architecture-101-
storage.html)

HBase: a more detailed view
(http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)

HBase: one way to look at it
A BigTable Implementation: memcached + LSM + framework
(From: http://java.dzone.com/news/bigtable-model-cassandra-and)

HBase: Hadoop BigTable
Not just a CRUD back-end:
…coprocessors, versioned cells, range scans, optimization (e.g.
selective compression) via column families, etc.
The most important of these is distributed processing.

Hadoop in (pre*) action
Hadoop indexed “THE DATA” for Watson
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/i%E2%80%99ll-take-hadoop-for-400-alex/

*Runtime processing used Apache JMS + UIMA .

Overlapping Ecosystems

Hadoop (usage and contributions) will be
“shared” between FOSS and Closed Source
communities.

Image from: http://cyhshonorsbio.wikispaces.com/The+Chemistry+of+Life

False Conflicts, with Solutions
Sodium(explosive) + Chlorine(poison) =>
Salt(vital)

From http://strangetimes.lastsuperpower.net/?p=1663

Closed Source + Open Source =>
Free + Enterprise + Support
+ Integration
Visit: http://en.wikipedia.org/wiki/Business_models_for_open_source_software#Hybrid

IMO, an important message from a
brilliant man
Anant Jhingran Hadoop Summit 2011 IBM Watson & Big Data with Q&A

http://www.youtube.com/watch?v=IVS__xF3Byg

Add value by fostering the ecosystem.
Do not fragment Hadoop (as Unix did).
There is room for folks from many areas to contribute and benefit.

Hadoop “option” (MapR) that plays nicely

MS embraced Hadoop despite having developed
technology similar to NextGen Hadoop. Wow.
Hadoop release on Azure is 3/12.
BlueMetal Architects is part of the MS TAP program for Hadoop on Azure. Please
contact us as we’ll be blogging about it.

Hadoop NextGen:
NN-HA, performance gains, more

Hadoop NextGen:
A Brave New (!?) world
Hadoop “nextGen” will support more than M/R, e.g. “Apache Giraph”
BUT, the diagram is from MS Dryad blogs. Graph processing will also be “big”.

Hadoop >> (un)structured data store.
Why do this (except ad-hoc) …?
RDBMS and Hadoop have strengths, use them, don’t negate both.
See the above Warehouse Architecture diagram…
From: http://nosql.mypopescu.com/post/344388408/hadoop-and-oracle-parallel-processing)

Useful/Supporting Links
Bing crawls the web for Yahoo (for US, Canada, and some other countries)
http://www.ehow.com/info_8208930_isnt-yahoo-crawling-website.html
World’s largest SSAS Cube: 14TB/quarter, 3B rows/day
http://jobs.climber.com/jobs/Media-Communication/-CA-US/MS-SQL-SSAS-SSIS-
Engineer/22735283

http://hadoop.apache.org/

http://www.docstoc.com/docs/66356954/Advanced-HBase

https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial

http://wiki.apache.org/hadoop/WordCount

https://blogs.apache.org/foundation/entry/apache_innovation_bolsters_ibm_s

Fun Links
http://www.youtube.com/watch?v=tIrBVjVfjNY

Hadoop demo ppt

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Hadoop demo ppt

Similar to Hadoop demo ppt (20)

Recently uploaded

Recently uploaded (20)

Hadoop demo ppt