SlideShare a Scribd company logo
Introduction to
Zak Stone <zak@eecs.harvard.edu>
PhD candidate, Harvard School of Engineering and Applied Sciences
Advisor: Todd Zickler (Computer Vision)
Hadoop distributes data and computation across a
large number of computers.
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
Why should you care? - Lots of Data




   LOTS OF DATA
   EVERYWHERE
Why should you care? - Lots of Data




                                      L
                                      O
                                      T
                                      S
                                      !
Why should you care? - Lots of Data
Why should you care? - Even Grocery Stores Care




                      ...
Why!! ! ! ! ! !                    for big data?

• Most credible open-source toolset for large-scale, general-purpose computing


  • Backed by                 ,


  • Used by                   ,              , many others


  • Increasing support from                          web services


  • Hadoop closely imitates infrastructure developed by


  • Hadoop processes petabytes daily, right now
Why!! ! ! ! ! !   for big data?
DISCLAIMER
   • Don’t use Hadoop if your data and computation fit on one machine


   • Getting easier to use, but still complicated




http://www.wired.com/gadgetlab/2008/07/patent-crazines/
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
What exactly is ! ! ! ! ! ! !                    ?

• Actually a growing collection of subprojects
What exactly is ! ! ! ! ! ! !                        ?

• Actually a growing collection of subprojects; focus on two right now
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
An overview of Hadoop Map-Reduce




   Traditional
                              Hadoop
   Computing



    (one computer)

                            (many computers)
An overview of Hadoop Map-Reduce

            (Actually more like this)




                    (many computers, little communication,
                           stragglers and failures)
Map-Reduce: Three phases



              1. Map

              2. Sort

              3. Reduce
Map-Reduce: Map phase


   Only specify operations on key-value pairs!
    INPUT PAIR                    OUTPUT PAIRS
  (key, value)                  (key, value)
                                (key, value)
                                (key, value)
                                (zero or more output pairs)


       (each “elephant” works on an input pair;
         doesn’t know other elephants exist )
Map-Reduce: Map phase, word-count example



   (line1, “Hello there.”)   (“hello”, 1)

                             (“there”, 1)




   (line2, “Why, hello.”)     (“why”, 1)

                              (“hello”, 1)
Map-Reduce: Sort phase

          (key1, value289)
           (key1, value43)
           (key1, value3)
                 ...
          (key2, value512)
           (key2, value11)
           (key2, value67)
                   ...
Map-Reduce: Sort phase, word-count example

                              (“hello”, 1)
                              (“hello”, 1)




                              (“there”, 1)




                               (“why”, 1)
Map-Reduce: Reduce phase




(key1, value289)
(key1, value43)            (key1, output1)
 (key1, value3)

                   ...
Map-Reduce: Reduce phase, word-count example


   (“hello”, 1)
                               (“hello”, 2)
   (“hello”, 1)




   (“there”, 1)                (“there”, 1)




    (“why”, 1)                  (“why”, 1)
Map-Reduce: Code for word-count


     def mapper(key,value):
       for word in value.split():
         yield word,1

     def reducer(key,values):
       yield key,sum(values)
Seems like too much work
   for a word-count!
Map-Reduce: Imagine word-count on the Web
Map-Reduce: The main advantage

With Hadoop, this very same code could run on
      the entire Web! (In theory, at least)
     def mapper(key,value):
       for word in value.split():
         yield word,1

     def reducer(key,values):
       yield key,sum(values)
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
HDFS: Hadoop Distributed File System



                            ...        (chunks of data
                                        on computers)


       Data                 ...      (each chunk
                                   replicated more
                                    than once for
                                       reliability)

                            ...
                          ...
HDFS: Hadoop Distributed File System
                       (key1, value1)
                       (key2, value2)
                             ...



  ...                  (key1, value1)
                       (key2, value2)
                             ...
                                          ...



         Computation is local to the data
Key-value pairs processed independently in parallel
HDFS: Inspired by the Google File System
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
Hadoop Map-Reduce and HDFS: Advantages
• Distribute data and computation

   • Computation local to data avoids network overload

• Tasks are independent

   • Easy to handle partial failures - entire nodes can fail and restart

   • Avoid crawling horrors of failure-tolerant synchronous distributed systems

   • Speculative execution to work around stragglers

• Linear scaling in the ideal case

   • Designed for cheap, commodity hardware

• Simple programming model

   • The “end-user” programmer only writes map-reduce tasks
Hadoop Map-Reduce and HDFS: Disadvantages
• Still rough - software under active development

   • e.g. HDFS only recently added support for append operations

• Programming model is very restrictive

   • Lack of central data can be frustrating

• “Joins” of multiple datasets are tricky and slow

   • No indices! Often, entire dataset gets copied in the process

• Cluster management is hard (debugging, distributing software, collecting logs...)

• Still single master, which requires care and may limit scaling

• Managing job flow isn’t trivial when intermediate data should be kept

• Optimal configuration of nodes not obvious (# mappers, # reducers, mem. limits)
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
Getting started: Installation options

• Cloudera virtual machine

• Your own virtual machine (install Ubuntu in VirtualBox, which is free)

• Elastic MapReduce on EC2

• StarCluster with Hadoop on EC2

• Cloudera’s distribution of Hadoop on EC2

• Install Cloudera’s distribution of Hadoop on your own machine

   • Available for RPM and Debian deployments

• Or download Hadoop directly from http://hadoop.apache.org/
Getting started: Language choices

• Hadoop is written in Java

• However, Hadoop Streaming allows mappers and reducers in any language!

• Binary data is a little tricky with Hadoop Streaming

   • Could use base64 encoding, but TypedBytes are much better

• For Python, try Dumbo: http://wiki.github.com/klbostee/dumbo

   • The Python word-count example and others come with Dumbo

   • Dumbo makes binary data with TypedBytes easy

• Also consider Hadoopy: https://github.com/bwhite/hadoopy
Outline

  1. Why should you care about Hadoop?
  2. What exactly is Hadoop?
  3. An overview of Hadoop Map-Reduce
  4. The Hadoop Distributed File System (HDFS)
  5. Hadoop advantages and disadvantages
  6. Getting started with Hadoop
  7. Useful resources
Useful resources and tips

• The Hadoop homepage: http://hadoop.apache.org/

• Cloudera: http://cloudera.com/

• Dumbo: http://wiki.github.com/klbostee/dumbo

• Hadoopy: https://github.com/bwhite/hadoopy

• Amazon Elastic Compute Cloud Getting Started Guide:
• http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/


• Always test locally on a tiny dataset before running on a cluster!
...
Thanks for your attention!

More Related Content

What's hot

Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
Paladion Networks
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
Chicago Hadoop Users Group
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
Brendan Tierney
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
Francisco Pérez-Sorrosal
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
Muralidharan Deenathayalan
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
Hadoop User Group
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
scottcrespo
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
M Baddar
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
Sandeep Deshmukh
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Manuel Correa
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
Chirag Ahuja
 
Join optimization in hive
Join optimization in hive Join optimization in hive
Join optimization in hive
Liyin Tang
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
Gabriela Agustini
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
schapht
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
ishan0019
 

What's hot (20)

Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Join optimization in hive
Join optimization in hive Join optimization in hive
Join optimization in hive
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Map reduce in Hadoop
Map reduce in HadoopMap reduce in Hadoop
Map reduce in Hadoop
 

Similar to [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
Keith Davis
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
Cloudera, Inc.
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
tcloudcomputing-tw
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Hadoop
HadoopHadoop
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx
Shree Shree
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
Sreenu Musham
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
DataPlato, Crossing the line
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
Cyanny LIANG
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
Nam Nham
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
harithakannan
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Steve Watt
 

Similar to [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard) (20)

Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx12-BigDataMapReduce.pptx
12-BigDataMapReduce.pptx
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure2012 apache hadoop_map_reduce_windows_azure
2012 apache hadoop_map_reduce_windows_azure
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 

More from npinto

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
npinto
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
npinto
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
npinto
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
npinto
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
npinto
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
npinto
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
npinto
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
npinto
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
npinto
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
npinto
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
npinto
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
npinto
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
npinto
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
npinto
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
npinto
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
npinto
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
npinto
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
npinto
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
npinto
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
npinto
 

More from npinto (20)

"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)"AI" for Blockchain Security (Case Study: Cosmos)
"AI" for Blockchain Security (Case Study: Cosmos)
 
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...
 
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...
 
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...
 
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
[Harvard CS264] 15a - Jacket: Visual Computing (James Malcolm, Accelereyes)
 
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (G...
 
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...
 
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...
 
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...
 
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...
 
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...
 
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
[Harvard CS264] 07 - GPU Cluster Programming (MPI & ZeroMQ)
 
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming[Harvard CS264] 05 - Advanced-level CUDA Programming
[Harvard CS264] 05 - Advanced-level CUDA Programming
 
[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming[Harvard CS264] 04 - Intermediate-level CUDA Programming
[Harvard CS264] 04 - Intermediate-level CUDA Programming
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 

Recently uploaded

Class 6 English Chapter 1 Fables and Folk Stories
Class 6 English Chapter 1 Fables and Folk StoriesClass 6 English Chapter 1 Fables and Folk Stories
Class 6 English Chapter 1 Fables and Folk Stories
sweetygupta8413
 
Parkinson Disease & Anti-Parkinsonian Drugs.pptx
Parkinson Disease & Anti-Parkinsonian Drugs.pptxParkinson Disease & Anti-Parkinsonian Drugs.pptx
Parkinson Disease & Anti-Parkinsonian Drugs.pptx
AnujVishwakarma34
 
New Features in Odoo 17 Email Marketing - Odoo Slides
New Features in Odoo 17 Email Marketing - Odoo SlidesNew Features in Odoo 17 Email Marketing - Odoo Slides
New Features in Odoo 17 Email Marketing - Odoo Slides
Celine George
 
How to install python packages from Pycharm
How to install python packages from PycharmHow to install python packages from Pycharm
How to install python packages from Pycharm
Celine George
 
FINAL MATATAG PE and Health CG 2023 Grades 4-10.pdf
FINAL MATATAG PE and Health CG 2023 Grades 4-10.pdfFINAL MATATAG PE and Health CG 2023 Grades 4-10.pdf
FINAL MATATAG PE and Health CG 2023 Grades 4-10.pdf
HayddieMaeCapunong
 
Java Developer Roadmap PDF By ScholarHat
Java Developer Roadmap PDF By ScholarHatJava Developer Roadmap PDF By ScholarHat
Java Developer Roadmap PDF By ScholarHat
Scholarhat
 
BANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
BANG E BHARAT QSN SET by Amra Quiz Pagoler DolBANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
BANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
Amra Quiz Pagoler Dol (AQPD)
 
Production Technology of Mango in Nepal.pptx
Production Technology of Mango in Nepal.pptxProduction Technology of Mango in Nepal.pptx
Production Technology of Mango in Nepal.pptx
UmeshTimilsina1
 
How to Configure Field Cleaning Rules in Odoo 17
How to Configure Field Cleaning Rules in Odoo 17How to Configure Field Cleaning Rules in Odoo 17
How to Configure Field Cleaning Rules in Odoo 17
Celine George
 
Brigada Eskwela editable Certificate.pptx
Brigada Eskwela editable Certificate.pptxBrigada Eskwela editable Certificate.pptx
Brigada Eskwela editable Certificate.pptx
aiofits06
 
A history of Innisfree in Milanville, Pennsylvania
A history of Innisfree in Milanville, PennsylvaniaA history of Innisfree in Milanville, Pennsylvania
A history of Innisfree in Milanville, Pennsylvania
ThomasRue2
 
Java MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHatJava MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHat
Scholarhat
 
How to define Related field in Odoo 17 - Odoo 17 Slides
How to define Related field in Odoo 17 - Odoo 17 SlidesHow to define Related field in Odoo 17 - Odoo 17 Slides
How to define Related field in Odoo 17 - Odoo 17 Slides
Celine George
 
classroom orientation/ back to school...
classroom orientation/ back to school...classroom orientation/ back to school...
classroom orientation/ back to school...
RoselleRaguindin
 
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.pptFIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
ashutoshklal29
 
Celebrating 25th Year SATURDAY, 27th JULY, 2024
Celebrating 25th Year SATURDAY, 27th JULY, 2024Celebrating 25th Year SATURDAY, 27th JULY, 2024
Celebrating 25th Year SATURDAY, 27th JULY, 2024
APEC Melmaruvathur
 
Reports in Odoo 17 Point of Sale - Odoo Slides
Reports in Odoo 17 Point of Sale - Odoo SlidesReports in Odoo 17 Point of Sale - Odoo Slides
Reports in Odoo 17 Point of Sale - Odoo Slides
Celine George
 
QCE – Unpacking the syllabus Implications for Senior School practices and ass...
QCE – Unpacking the syllabus Implications for Senior School practices and ass...QCE – Unpacking the syllabus Implications for Senior School practices and ass...
QCE – Unpacking the syllabus Implications for Senior School practices and ass...
mansk2
 
SD_Instructional-Design-Frameworkzz.pptx
SD_Instructional-Design-Frameworkzz.pptxSD_Instructional-Design-Frameworkzz.pptx
SD_Instructional-Design-Frameworkzz.pptx
MarkKennethBellen1
 
2024 Winter SWAYAM NPTEL & A Student.pptx
2024 Winter SWAYAM NPTEL & A Student.pptx2024 Winter SWAYAM NPTEL & A Student.pptx
2024 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 

Recently uploaded (20)

Class 6 English Chapter 1 Fables and Folk Stories
Class 6 English Chapter 1 Fables and Folk StoriesClass 6 English Chapter 1 Fables and Folk Stories
Class 6 English Chapter 1 Fables and Folk Stories
 
Parkinson Disease & Anti-Parkinsonian Drugs.pptx
Parkinson Disease & Anti-Parkinsonian Drugs.pptxParkinson Disease & Anti-Parkinsonian Drugs.pptx
Parkinson Disease & Anti-Parkinsonian Drugs.pptx
 
New Features in Odoo 17 Email Marketing - Odoo Slides
New Features in Odoo 17 Email Marketing - Odoo SlidesNew Features in Odoo 17 Email Marketing - Odoo Slides
New Features in Odoo 17 Email Marketing - Odoo Slides
 
How to install python packages from Pycharm
How to install python packages from PycharmHow to install python packages from Pycharm
How to install python packages from Pycharm
 
FINAL MATATAG PE and Health CG 2023 Grades 4-10.pdf
FINAL MATATAG PE and Health CG 2023 Grades 4-10.pdfFINAL MATATAG PE and Health CG 2023 Grades 4-10.pdf
FINAL MATATAG PE and Health CG 2023 Grades 4-10.pdf
 
Java Developer Roadmap PDF By ScholarHat
Java Developer Roadmap PDF By ScholarHatJava Developer Roadmap PDF By ScholarHat
Java Developer Roadmap PDF By ScholarHat
 
BANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
BANG E BHARAT QSN SET by Amra Quiz Pagoler DolBANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
BANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
 
Production Technology of Mango in Nepal.pptx
Production Technology of Mango in Nepal.pptxProduction Technology of Mango in Nepal.pptx
Production Technology of Mango in Nepal.pptx
 
How to Configure Field Cleaning Rules in Odoo 17
How to Configure Field Cleaning Rules in Odoo 17How to Configure Field Cleaning Rules in Odoo 17
How to Configure Field Cleaning Rules in Odoo 17
 
Brigada Eskwela editable Certificate.pptx
Brigada Eskwela editable Certificate.pptxBrigada Eskwela editable Certificate.pptx
Brigada Eskwela editable Certificate.pptx
 
A history of Innisfree in Milanville, Pennsylvania
A history of Innisfree in Milanville, PennsylvaniaA history of Innisfree in Milanville, Pennsylvania
A history of Innisfree in Milanville, Pennsylvania
 
Java MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHatJava MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHat
 
How to define Related field in Odoo 17 - Odoo 17 Slides
How to define Related field in Odoo 17 - Odoo 17 SlidesHow to define Related field in Odoo 17 - Odoo 17 Slides
How to define Related field in Odoo 17 - Odoo 17 Slides
 
classroom orientation/ back to school...
classroom orientation/ back to school...classroom orientation/ back to school...
classroom orientation/ back to school...
 
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.pptFIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
 
Celebrating 25th Year SATURDAY, 27th JULY, 2024
Celebrating 25th Year SATURDAY, 27th JULY, 2024Celebrating 25th Year SATURDAY, 27th JULY, 2024
Celebrating 25th Year SATURDAY, 27th JULY, 2024
 
Reports in Odoo 17 Point of Sale - Odoo Slides
Reports in Odoo 17 Point of Sale - Odoo SlidesReports in Odoo 17 Point of Sale - Odoo Slides
Reports in Odoo 17 Point of Sale - Odoo Slides
 
QCE – Unpacking the syllabus Implications for Senior School practices and ass...
QCE – Unpacking the syllabus Implications for Senior School practices and ass...QCE – Unpacking the syllabus Implications for Senior School practices and ass...
QCE – Unpacking the syllabus Implications for Senior School practices and ass...
 
SD_Instructional-Design-Frameworkzz.pptx
SD_Instructional-Design-Frameworkzz.pptxSD_Instructional-Design-Frameworkzz.pptx
SD_Instructional-Design-Frameworkzz.pptx
 
2024 Winter SWAYAM NPTEL & A Student.pptx
2024 Winter SWAYAM NPTEL & A Student.pptx2024 Winter SWAYAM NPTEL & A Student.pptx
2024 Winter SWAYAM NPTEL & A Student.pptx
 

[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

  • 1. Introduction to Zak Stone <zak@eecs.harvard.edu> PhD candidate, Harvard School of Engineering and Applied Sciences Advisor: Todd Zickler (Computer Vision)
  • 2. Hadoop distributes data and computation across a large number of computers.
  • 3. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 4. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 5. Why should you care? - Lots of Data LOTS OF DATA EVERYWHERE
  • 6. Why should you care? - Lots of Data L O T S !
  • 7. Why should you care? - Lots of Data
  • 8. Why should you care? - Even Grocery Stores Care ...
  • 9. Why!! ! ! ! ! ! for big data? • Most credible open-source toolset for large-scale, general-purpose computing • Backed by , • Used by , , many others • Increasing support from web services • Hadoop closely imitates infrastructure developed by • Hadoop processes petabytes daily, right now
  • 10. Why!! ! ! ! ! ! for big data?
  • 11. DISCLAIMER • Don’t use Hadoop if your data and computation fit on one machine • Getting easier to use, but still complicated http://www.wired.com/gadgetlab/2008/07/patent-crazines/
  • 12. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 13. What exactly is ! ! ! ! ! ! ! ? • Actually a growing collection of subprojects
  • 14. What exactly is ! ! ! ! ! ! ! ? • Actually a growing collection of subprojects; focus on two right now
  • 15. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 16. An overview of Hadoop Map-Reduce Traditional Hadoop Computing (one computer) (many computers)
  • 17. An overview of Hadoop Map-Reduce (Actually more like this) (many computers, little communication, stragglers and failures)
  • 18. Map-Reduce: Three phases 1. Map 2. Sort 3. Reduce
  • 19. Map-Reduce: Map phase Only specify operations on key-value pairs! INPUT PAIR OUTPUT PAIRS (key, value) (key, value) (key, value) (key, value) (zero or more output pairs) (each “elephant” works on an input pair; doesn’t know other elephants exist )
  • 20. Map-Reduce: Map phase, word-count example (line1, “Hello there.”) (“hello”, 1) (“there”, 1) (line2, “Why, hello.”) (“why”, 1) (“hello”, 1)
  • 21. Map-Reduce: Sort phase (key1, value289) (key1, value43) (key1, value3) ... (key2, value512) (key2, value11) (key2, value67) ...
  • 22. Map-Reduce: Sort phase, word-count example (“hello”, 1) (“hello”, 1) (“there”, 1) (“why”, 1)
  • 23. Map-Reduce: Reduce phase (key1, value289) (key1, value43) (key1, output1) (key1, value3) ...
  • 24. Map-Reduce: Reduce phase, word-count example (“hello”, 1) (“hello”, 2) (“hello”, 1) (“there”, 1) (“there”, 1) (“why”, 1) (“why”, 1)
  • 25. Map-Reduce: Code for word-count def mapper(key,value): for word in value.split(): yield word,1 def reducer(key,values): yield key,sum(values)
  • 26. Seems like too much work for a word-count!
  • 28. Map-Reduce: The main advantage With Hadoop, this very same code could run on the entire Web! (In theory, at least) def mapper(key,value): for word in value.split(): yield word,1 def reducer(key,values): yield key,sum(values)
  • 29. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 30. HDFS: Hadoop Distributed File System ... (chunks of data on computers) Data ... (each chunk replicated more than once for reliability) ... ...
  • 31. HDFS: Hadoop Distributed File System (key1, value1) (key2, value2) ... ... (key1, value1) (key2, value2) ... ... Computation is local to the data Key-value pairs processed independently in parallel
  • 32. HDFS: Inspired by the Google File System
  • 33. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 34. Hadoop Map-Reduce and HDFS: Advantages • Distribute data and computation • Computation local to data avoids network overload • Tasks are independent • Easy to handle partial failures - entire nodes can fail and restart • Avoid crawling horrors of failure-tolerant synchronous distributed systems • Speculative execution to work around stragglers • Linear scaling in the ideal case • Designed for cheap, commodity hardware • Simple programming model • The “end-user” programmer only writes map-reduce tasks
  • 35. Hadoop Map-Reduce and HDFS: Disadvantages • Still rough - software under active development • e.g. HDFS only recently added support for append operations • Programming model is very restrictive • Lack of central data can be frustrating • “Joins” of multiple datasets are tricky and slow • No indices! Often, entire dataset gets copied in the process • Cluster management is hard (debugging, distributing software, collecting logs...) • Still single master, which requires care and may limit scaling • Managing job flow isn’t trivial when intermediate data should be kept • Optimal configuration of nodes not obvious (# mappers, # reducers, mem. limits)
  • 36. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 37. Getting started: Installation options • Cloudera virtual machine • Your own virtual machine (install Ubuntu in VirtualBox, which is free) • Elastic MapReduce on EC2 • StarCluster with Hadoop on EC2 • Cloudera’s distribution of Hadoop on EC2 • Install Cloudera’s distribution of Hadoop on your own machine • Available for RPM and Debian deployments • Or download Hadoop directly from http://hadoop.apache.org/
  • 38. Getting started: Language choices • Hadoop is written in Java • However, Hadoop Streaming allows mappers and reducers in any language! • Binary data is a little tricky with Hadoop Streaming • Could use base64 encoding, but TypedBytes are much better • For Python, try Dumbo: http://wiki.github.com/klbostee/dumbo • The Python word-count example and others come with Dumbo • Dumbo makes binary data with TypedBytes easy • Also consider Hadoopy: https://github.com/bwhite/hadoopy
  • 39. Outline 1. Why should you care about Hadoop? 2. What exactly is Hadoop? 3. An overview of Hadoop Map-Reduce 4. The Hadoop Distributed File System (HDFS) 5. Hadoop advantages and disadvantages 6. Getting started with Hadoop 7. Useful resources
  • 40. Useful resources and tips • The Hadoop homepage: http://hadoop.apache.org/ • Cloudera: http://cloudera.com/ • Dumbo: http://wiki.github.com/klbostee/dumbo • Hadoopy: https://github.com/bwhite/hadoopy • Amazon Elastic Compute Cloud Getting Started Guide: • http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/ • Always test locally on a tiny dataset before running on a cluster!
  • 41. ...
  • 42. Thanks for your attention!