SlideShare a Scribd company logo
Hadoop Talk
Brief background on me
    Phil has over 16 years experience in data-centric system
    development. His work has flowed from simulation and video-
    game-like systems, to high-performance computing (HPC), to
    traditional database (Oracle, SQL Server, Postgres, MySQL)
    and CRM (warehouse/analytical) systems, and most recently to
    the Hadoop stack. Recently, as an employee at TripAdvisor he
    led the research into Hadoop/Hive which resulted in the
    successful migration from the traditional RDBMS platform to a
    system which is based on Hadoop/Hive and is integrated with
    MS SQL Server/SSAS. Currently, he's focused on the Hadoop
    stack and is creating a solution which involves integrating
    Hadoop in a more traditional enterprise environment.
    To make you as excited about Hadoop as I am

   What is Hadoop (high-level) ?
   What have we actually done with it?
    How does “it” (HDFS, M/R, Hive, and HBase) work?
    Future of Hadoop
What is Hadoop?
Q: What is Hadoop:
   A#1 - The thing that empowers
      Yahoo, FB, and others
         Yahoo has >25k Hadoop nodes…wow…
Q: What is Hadoop
   A#2 - Last year’s revolution (sort of)
The Linux/Hadoop vs Closed-Source “conflict” is a false one, IMO, and I’ll explain why as we go on
Q: What is Hadoop
A#3 – the revolution of 5+ years ago
“Success has many fathers”
And you can look them up, because it’s FOSS !
People are fighting to contribute, and to get credit… be a contributor…
What is Hadoop:
A#4 – the wave everyone is riding

 Nearly all the big players (and many smaller ones) are on board…
In fact, beware of this
What have we actually done with it?
Hadoop projects performed by BlueMetal Architects

    Hadoop at a Web 2.0 company (prior to BMA)
      Ported traditional 30TB Warehouse to Hive
      Big transform jobs in Hive
        E.G. Joins 50M rows to 12B rows
        Big Data jobs, e.g. Social Graph processing with
        many “Cartesians” to empower emails
     Hadoop in HealthCare (at BMA)
      Applied HBase as part of a new system
      Feeds data (via WS) to:
        Patient Web Portal
        Other HealthCare affiliates
  Note: Both projects include Hadoop as part of larger systems.
Warehouse Goals

   Use the right tool for the right job
  –Hadoop (M/R, Hive) is a batch system
   • Inherently high-latency
  –RDBMS (& other tools) are still needed
   Empower users
  –Minimize complexity
   • Eliminate joins (almost)
   • Eliminate “dimensions” (maybe)
  –Expose *all* data
  –Provide low-latency options
  –Provide self-service options
A strategy for MASSIVE processing:
Best tool for the job
This is what we implemented and, it turns out, is also what Yahoo has done.
Yahoo’s SSAS cube is the largest in the world (14TB/quarter, 3B rows/day)
Focus back to Hadoop …
High-level descriptions are good,
but not enough. How does it work?
Here we go…
Map-Reduce (M/R) example
Note: this job is not optimized
Take home message: “Simple API - Mappers read the
input and emit K/V pairs. Framework sends Reducers
K/V pairs partitioned and ordered* by Key”
Hadoop M/R with some details:
Note: Partition, Combine and Shuffle
Hadoop M/R Primer
Let’s discuss HDFS: (blocks, replication) and how that helps “data local tasks”
(From: Yahoo)
Hadoop Terasort Job Profile
- or “hey, I thought it was just M/R”
Why Hadoop?
Because you don’t want to handle this…
This is actually a profile of a job running on an old version of Hadoop, but jobs
with many failures look similar. This also shows improvement in Hadoop.
Hadoop M/R executive summary

Distributed storage system, with distributed processing
capability, on commodity hardware (or in the cloud).

Moves the computation to the data !
That, in turn, saves network which is the limiting factor in
distributed apps.

The same code can run on data of any size. The cluster is
scaled with the data, not the code.
Hadoop Stack Key Components
HCatalog is a recent project that allows Pig scripts to use Hive metadata/schemas.
                Hadoop is not just about non/semi structured data !
+ Metadata
+ HQL-> (efficient) M/R
+ more
- low-latency (usually)
- (row-level) updates
- other (e.g. constraints)
+ HUGE scalability
+ POWERFUL distributed processing
Common RDBMS warehouse query

select top 10
from (
  select ip_address, count(*) as cnt
  from f_pageviews pv
  join d_ipaddress ip on (pv.ip_key =
  where date_key = 2992
  group by ip_address
order by cnt desc

– wait a few minutes
- time is usually 1-4x nominal time depending on load
- … assumes the job can succeed at all !
Hive Version…
The luxury of Hadoop space/power, means dimensional processing might not be
NOTE: Hive does support “column-oriented” storage, which is very efficient.

select t.*
from (
  select ip_address, count(*) as cnt
  from f_lookback
  where ds = '2011-03-11'
  group by ip_address
order by cnt desc
Limit 10

– BUT – runtime is trickier
Time to run your job = HQL parse + M/R Job Submit + [ wait
in the queue for availability ] + M/R Job Runtime
What else can Hadoop do?

   FB: Invented Cassandra but went with HBase for their new messaging system.
   Does that mean HBase is ”better”? – no, it’s about using the right tool for the job.

That’s to hold 135B messages per month !

Scale is relative (to your hardware and load),
but when you want a consistent “OLTP” solution that doesn’t require redesign to scale,
consider Hbase.
HBase Architecture
Not shown: HM, ZK and HDFS
HBase: a more detailed view
HBase: one way to look at it
A BigTable Implementation: memcached + LSM + framework
HBase: Hadoop BigTable
Not just a CRUD back-end:
…coprocessors, versioned cells, range scans, optimization (e.g.
selective compression) via column families, etc.
       The most important of these is distributed processing.
Hadoop in (pre*) action
                    Hadoop indexed “THE DATA” for Watson

                        *Runtime processing used Apache JMS + UIMA .
Future of Hadoop
Overlapping Ecosystems

Hadoop (usage and contributions) will be
“shared” between FOSS and Closed Source

        Image from:
False Conflicts, with Solutions
             Sodium(explosive) + Chlorine(poison) =>


Closed Source + Open Source =>
Free + Enterprise + Support
+ Integration
IMO, an important message from a
brilliant man
    Anant Jhingran Hadoop Summit 2011 IBM Watson & Big Data with Q&A

Add value by fostering the ecosystem.
Do not fragment Hadoop (as Unix did).
There is room for folks from many areas to contribute and benefit.
Hadoop “option” (MapR) that plays nicely
MS embraced Hadoop despite having developed
technology similar to NextGen Hadoop. Wow.
Hadoop release on Azure is 3/12.
 BlueMetal Architects is part of the MS TAP program for Hadoop on Azure. Please
                      contact us as we’ll be blogging about it.
Hadoop NextGen:
 NN-HA, performance gains, more
Hadoop NextGen:
A Brave New (!?) world
Hadoop “nextGen” will support more than M/R, e.g. “Apache Giraph”
BUT, the diagram is from MS Dryad blogs. Graph processing will also be “big”.
Hadoop >> (un)structured data store.
Why do this        (except ad-hoc)   …?
RDBMS and Hadoop have strengths, use them, don’t negate both.
See the above Warehouse Architecture diagram…
Useful/Supporting Links
Bing crawls the web for Yahoo (for US, Canada, and some other countries)
World’s largest SSAS Cube: 14TB/quarter, 3B rows/day
Additional Slides
Fun Links

More Related Content

What's hot

Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Manish Borkar
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hari Shankar Sreekumar
Varun Narang
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Uwe Printz
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
Deborah Akuoko
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
Brendan Tierney
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce

What's hot (20)

Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce

Viewers also liked

sravya raju
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Shubham Parmar
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
Nick Dimiduk
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Carl Steinbach
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio

Viewers also liked (12)

PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop

Similar to Hadoop demo ppt

Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
Thanh Nguyen
Hadoop Tutorial for Beginners
Hadoop Tutorial for BeginnersHadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
business Corporate
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Ranjith Sekar
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
Sachin Holla
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Chris Baglieri
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Amr Awadallah
John Sing
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
BOSC 2010
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
Sreenu Musham
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Oded Rotter
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
Mahmoud Yassin
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in Amritsar
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
David Tjahjono,MD,MBA(UK)
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in Mohali
Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in Ludhiana

Similar to Hadoop demo ppt (20)

Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
Hadoop Tutorial for Beginners
Hadoop Tutorial for BeginnersHadoop Tutorial for Beginners
Hadoop Tutorial for Beginners
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
Big Data Training in Amritsar
Big Data Training in AmritsarBig Data Training in Amritsar
Big Data Training in Amritsar
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
Big Data Training in Mohali
Big Data Training in MohaliBig Data Training in Mohali
Big Data Training in Mohali
Big Data Training in Ludhiana
Big Data Training in LudhianaBig Data Training in Ludhiana
Big Data Training in Ludhiana

Recently uploaded

"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Stephanie Beckett
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Alliance
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
Peter Caitens
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
Priyanka Aash
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Alliance
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Alliance
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
Yury Chemerkin
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Marrie Morris
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
Sara Kroft

Recently uploaded (20)

"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf

Hadoop demo ppt

  • 2. Brief background on me  Phil has over 16 years experience in data-centric system development. His work has flowed from simulation and video- game-like systems, to high-performance computing (HPC), to traditional database (Oracle, SQL Server, Postgres, MySQL) and CRM (warehouse/analytical) systems, and most recently to the Hadoop stack. Recently, as an employee at TripAdvisor he led the research into Hadoop/Hive which resulted in the successful migration from the traditional RDBMS platform to a system which is based on Hadoop/Hive and is integrated with MS SQL Server/SSAS. Currently, he's focused on the Hadoop stack and is creating a solution which involves integrating Hadoop in a more traditional enterprise environment.
  • 3. Agenda  To make you as excited about Hadoop as I am  What is Hadoop (high-level) ?  What have we actually done with it?  How does “it” (HDFS, M/R, Hive, and HBase) work?  Future of Hadoop
  • 5. Q: What is Hadoop: A#1 - The thing that empowers Yahoo, FB, and others Yahoo has >25k Hadoop nodes…wow…
  • 6. Q: What is Hadoop A#2 - Last year’s revolution (sort of) The Linux/Hadoop vs Closed-Source “conflict” is a false one, IMO, and I’ll explain why as we go on
  • 7. Q: What is Hadoop A#3 – the revolution of 5+ years ago
  • 8. “Success has many fathers” And you can look them up, because it’s FOSS ! People are fighting to contribute, and to get credit… be a contributor… (
  • 9. What is Hadoop: A#4 – the wave everyone is riding Nearly all the big players (and many smaller ones) are on board…
  • 10. In fact, beware of this
  • 11. What have we actually done with it?
  • 12. Hadoop projects performed by BlueMetal Architects  Hadoop at a Web 2.0 company (prior to BMA)  Ported traditional 30TB Warehouse to Hive  Big transform jobs in Hive  E.G. Joins 50M rows to 12B rows  Big Data jobs, e.g. Social Graph processing with many “Cartesians” to empower emails  Hadoop in HealthCare (at BMA)  Applied HBase as part of a new system  Feeds data (via WS) to:  E.D.  Patient Web Portal  Other HealthCare affiliates Note: Both projects include Hadoop as part of larger systems.
  • 13. Warehouse Goals  Use the right tool for the right job –Hadoop (M/R, Hive) is a batch system • Inherently high-latency –RDBMS (& other tools) are still needed  Empower users –Minimize complexity • Eliminate joins (almost) • Eliminate “dimensions” (maybe) –Expose *all* data –Provide low-latency options –Provide self-service options
  • 14. A strategy for MASSIVE processing: Best tool for the job This is what we implemented and, it turns out, is also what Yahoo has done. Yahoo’s SSAS cube is the largest in the world (14TB/quarter, 3B rows/day)
  • 15. Focus back to Hadoop …
  • 16. High-level descriptions are good, but not enough. How does it work? (From:
  • 18. Map-Reduce (M/R) example Note: this job is not optimized Take home message: “Simple API - Mappers read the input and emit K/V pairs. Framework sends Reducers K/V pairs partitioned and ordered* by Key” (From:
  • 19. Hadoop M/R with some details: Note: Partition, Combine and Shuffle (From:
  • 20. Hadoop M/R Primer Let’s discuss HDFS: (blocks, replication) and how that helps “data local tasks” (From: Yahoo)
  • 21. Hadoop Terasort Job Profile - or “hey, I thought it was just M/R” (from orts_a_petabyte_in_162/)
  • 22. Why Hadoop? Because you don’t want to handle this… This is actually a profile of a job running on an old version of Hadoop, but jobs with many failures look similar. This also shows improvement in Hadoop. (From:
  • 23. Hadoop M/R executive summary Distributed storage system, with distributed processing capability, on commodity hardware (or in the cloud). Moves the computation to the data ! That, in turn, saves network which is the limiting factor in distributed apps. The same code can run on data of any size. The cluster is scaled with the data, not the code.
  • 24. Hadoop Stack Key Components ( HCatalog is a recent project that allows Pig scripts to use Hive metadata/schemas. Hadoop is not just about non/semi structured data !
  • 25. Hive = HDFS + Metadata + HQL-> (efficient) M/R + more = RDBMS - low-latency (usually) - (row-level) updates - other (e.g. constraints) + HUGE scalability + POWERFUL distributed processing
  • 26. Common RDBMS warehouse query select top 10 t.* from ( select ip_address, count(*) as cnt from f_pageviews pv join d_ipaddress ip on (pv.ip_key = where date_key = 2992 group by ip_address )t order by cnt desc – wait a few minutes - time is usually 1-4x nominal time depending on load - … assumes the job can succeed at all !
  • 27. Hive Version… The luxury of Hadoop space/power, means dimensional processing might not be required NOTE: Hive does support “column-oriented” storage, which is very efficient. select t.* from ( select ip_address, count(*) as cnt from f_lookback where ds = '2011-03-11' group by ip_address )t order by cnt desc Limit 10 – BUT – runtime is trickier Time to run your job = HQL parse + M/R Job Submit + [ wait in the queue for availability ] + M/R Job Runtime
  • 28. What else can Hadoop do? FB: Invented Cassandra but went with HBase for their new messaging system. Does that mean HBase is ”better”? – no, it’s about using the right tool for the job. That’s to hold 135B messages per month ! Scale is relative (to your hardware and load), but when you want a consistent “OLTP” solution that doesn’t require redesign to scale, consider Hbase.
  • 29. HBase Architecture Not shown: HM, ZK and HDFS (From: storage.html)
  • 30. HBase: a more detailed view (
  • 31. HBase: one way to look at it A BigTable Implementation: memcached + LSM + framework (From:
  • 32. HBase: Hadoop BigTable Not just a CRUD back-end: …coprocessors, versioned cells, range scans, optimization (e.g. selective compression) via column families, etc. The most important of these is distributed processing.
  • 33. Hadoop in (pre*) action Hadoop indexed “THE DATA” for Watson *Runtime processing used Apache JMS + UIMA .
  • 35. Overlapping Ecosystems Hadoop (usage and contributions) will be “shared” between FOSS and Closed Source communities. Image from:
  • 36. False Conflicts, with Solutions Sodium(explosive) + Chlorine(poison) => Salt(vital) From Closed Source + Open Source => Free + Enterprise + Support + Integration Visit:
  • 37. IMO, an important message from a brilliant man Anant Jhingran Hadoop Summit 2011 IBM Watson & Big Data with Q&A Add value by fostering the ecosystem. Do not fragment Hadoop (as Unix did). There is room for folks from many areas to contribute and benefit.
  • 38. Hadoop “option” (MapR) that plays nicely
  • 39. MS embraced Hadoop despite having developed technology similar to NextGen Hadoop. Wow. Hadoop release on Azure is 3/12. BlueMetal Architects is part of the MS TAP program for Hadoop on Azure. Please contact us as we’ll be blogging about it.
  • 40. Hadoop NextGen: NN-HA, performance gains, more
  • 41. Hadoop NextGen: A Brave New (!?) world Hadoop “nextGen” will support more than M/R, e.g. “Apache Giraph” BUT, the diagram is from MS Dryad blogs. Graph processing will also be “big”.
  • 42. Hadoop >> (un)structured data store. Why do this (except ad-hoc) …? RDBMS and Hadoop have strengths, use them, don’t negate both. See the above Warehouse Architecture diagram… From:
  • 43. Q&A
  • 44. Useful/Supporting Links Bing crawls the web for Yahoo (for US, Canada, and some other countries) World’s largest SSAS Cube: 14TB/quarter, 3B rows/day Engineer/22735283