SlideShare a Scribd company logo
Petabyte scale on
commodity infrastructure
           Owen O’Malley
        Eric Baldeschwieler
             Yahoo Inc!
    {owen,eric14}@yahoo-inc.com
Hadoop: Why?

        • Need to process huge datasets on large
          clusters of computers
        • In large clusters, nodes fail every day
              – Failure is expected, rather than exceptional.
              – The number of nodes in a cluster is not constant.
        • Very expensive to build reliability into
          each application.
        • Need common infrastructure
              – Efficient, reliable, easy to use



Yahoo! Inc.
Hadoop: How?


        • Commodity Hardware Cluster
        • Distributed File System
              – Modeled on GFS
        • Distributed Processing
          Framework
              – Using Map/Reduce metaphor
        • Open Source, Java
              – Apache Lucene subproject
Yahoo! Inc.
Commodity Hardware Cluster




        •     Typically in 2 level architecture
              –   Nodes are commodity PCs
              –   30-40 nodes/rack
              –   Uplink from rack is 3-4 gigabit
              –   Rack-internal is 1 gigabit



Yahoo! Inc.
Distributed File System

        • Single namespace for entire cluster
              – Managed by a single namenode.
              – Files are write-once.
              – Optimized for streaming reads of large files.
        • Files are broken in to large blocks.
              – Typically 128 MB
              – Replicated to several datanodes, for reliability
        • Client talks to both namenode and datanodes
              – Data is not sent through the namenode.
              – Throughput of file system scales nearly linearly with the
                number of nodes.




Yahoo! Inc.
Block Placement

        • Default is 3 replicas, but settable
        • Blocks are placed:
              –   On same node
              –   On different rack
              –   On same rack
              –   Others placed randomly
        • Clients read from closest replica
        • If the replication for a block drops below
          target, it is automatically rereplicated.


Yahoo! Inc.
Data Correctness

        • Data is checked with CRC32
        • File Creation
              – Client computes checksum per 512 byte
              – DataNode stores the checksum
        • File access
              – Client retrieves the data and checksum
                from DataNode
              – If Validation fails, Client tries other replicas


Yahoo! Inc.
Distributed Processing

        • User submits Map/Reduce job to JobTracker
        • System:
              – Splits job into lots of tasks
              – Monitors tasks
              – Kills and restarts if they fail/hang/disappear
        • User can track progress of job via web ui
        • Pluggable file systems for input/output
              – Local file system for testing, debugging, etc…
              – KFS and S3 also have bindings…




Yahoo! Inc.
Hadoop Map-Reduce

        •     Implementation of the Map-Reduce programming model
               – Framework for distributed processing of large data sets
                    • Data handled as collections of key-value pairs
               – Pluggable user code runs in generic framework
        •     Very common design pattern in data processing
               – Demonstrated by a unix pipeline example:
                 cat * | grep | sort           | unique -c | cat > file
                 input | map | shuffle | reduce            | output
               – Natural for:
                    • Log processing
                    • Web search indexing
                    • Ad-hoc queries
               – Minimizes trips to disk and disk seeks
        •     Several interfaces:
               – Java, C++, text filter




Yahoo! Inc.
Map/Reduce Optimizations

        • Overlap of maps, shuffle, and sort
        • Mapper locality
              – Schedule mappers close to the data.
        • Combiner
              –   Mappers may generate duplicate keys
              –   Side-effect free reducer run on mapper node
              –   Minimize data size before transfer
              –   Reducer is still run
        • Speculative execution
              – Some nodes may be slower
              – Run duplicate task on another node




Yahoo! Inc.
Running on Amazon EC2/S3

        • Amazon sells cluster services
              – EC2: $0.10/cpu hour
              – S3: $0.20/gigabyte month
        • Hadoop supports:
              – EC2: cluster management scripts included
              – S3: file system implementation included
        • Tested on 400 node cluster
        • Combination used by several startups



Yahoo! Inc.
Hadoop On Demand

        • Traditionally Hadoop runs with dedicated
          servers
        • Hadoop On Demand works with a batch
          system to allocate and provision nodes
          dynamically
              – Bindings for Condor and Torque/Maui
        • Allows more dynamic allocation of
          resources




Yahoo! Inc.
Yahoo’s Hadoop Clusters

      •       We have ~10,000 machines running Hadoop
      •       Our largest cluster is currently 2000 nodes
      •       1 petabyte of user data (compressed, unreplicated)
      •       We run roughly 10,000 research jobs / week




Yahoo! Inc.
Scaling Hadoop



                               •   Sort benchmark
                                   – Sorting random data
                                   – Scaled to 10GB/node
                               •   We’ve improved both
                                   scalability and
                                   performance over time
                               •   Making improvements
                                   in frameworks helps a
                                   lot.




Yahoo! Inc.
Coming Attractions


        • Block rebalancing
        • Clean up of HDFS client write protocol
              – Heading toward file append support
        •     Rack-awareness for Map/Reduce
        •     Redesign of RPC timeout handling
        •     Support for versioning in Jute/Record IO
        •     Support for users, groups, and permissions
        •     Improved utilization
              – Your feedback solicited



Yahoo! Inc.
Upcoming Tools

        •     Pig
              – A scripting language/interpreter that makes it easy to define
                complex jobs that require multiple map/reduce jobs.
              – Currently in Apache Incubator.
        •     Zookeeper
              –     Highly available directory service
              –     Support master election and configuration
              –     Filesystem interface
              –     Consensus building between servers
              –     Posting to SourceForge soon
        •     HBase
              – BigTable-inspired distributed object store, sorted by primary key
              – Storage in HDFS files
              – Hadoop community project

Yahoo! Inc.
Collaboration

        •     Hadoop is an Open Source project!
        •     Please contribute your xtrace hooks
        •     Feedback on performance bottleneck welcome
        •     Developer tooling for easing debugging and
              performance diagnosing are very welcome.
              – IBM has contributed an Eclipse plugin.
        • Interested your thoughts in management and
          virtualization
        • Block placement in HDFS for reliability




Yahoo! Inc.
Thank You

        • Questions?
        • For more information:
              – Blog on http://developer.yahoo.net/
              – Hadoop website: http://lucene.apache.org/hadoop/




Yahoo! Inc.

More Related Content

What's hot

Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDB
MongoDB
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
Vitthal Gogate
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
clairvoyantllc
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
yarapavan
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
Ravi Veeramachaneni
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1
Sandeep Kunkunuru
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Cloudera, Inc.
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Mark Kerzner
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
DataWorks Summit
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
TrendProgContest13
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
TrendProgContest13
 
Presentation
PresentationPresentation
Presentation
ch samaram
 
Introduction to h base
Introduction to h baseIntroduction to h base
Introduction to h base
TrendProgContest13
 
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, InformaticaHadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Cloudera, Inc.
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
Altoros
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Cloudera, Inc.
 
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and DeploymentOct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Yahoo Developer Network
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
Alex Moundalexis
 

What's hot (20)

Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Moving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDBMoving from C#/.NET to Hadoop/MongoDB
Moving from C#/.NET to Hadoop/MongoDB
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Presentation
PresentationPresentation
Presentation
 
Introduction to h base
Introduction to h baseIntroduction to h base
Introduction to h base
 
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, InformaticaHadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
Hadoop World 2011: Practical HBase - Ravi Veeramchaneni, Informatica
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
 
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and DeploymentOct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 

Viewers also liked

(ENT312) Should You Build or Buy Cloud Infrastructure and Platforms? | AWS re...
(ENT312) Should You Build or Buy Cloud Infrastructure and Platforms? | AWS re...(ENT312) Should You Build or Buy Cloud Infrastructure and Platforms? | AWS re...
(ENT312) Should You Build or Buy Cloud Infrastructure and Platforms? | AWS re...
Amazon Web Services
 
Cloud Computing and the Datacenter of the Future
Cloud Computing and the Datacenter of the FutureCloud Computing and the Datacenter of the Future
Cloud Computing and the Datacenter of the Future
Appistry
 
Bimodal IT - Mode 2 Evolution Roadmap v12
Bimodal IT - Mode 2 Evolution Roadmap v12Bimodal IT - Mode 2 Evolution Roadmap v12
Bimodal IT - Mode 2 Evolution Roadmap v12
Janusz Stankiewicz
 
Case Study Template
Case Study TemplateCase Study Template
Case Study Template
Demand Metric
 
EMC ViPR Services Storage Engine Architecture
EMC ViPR Services Storage Engine ArchitectureEMC ViPR Services Storage Engine Architecture
EMC ViPR Services Storage Engine Architecture
EMC
 
Cloud computing simple ppt
Cloud computing simple pptCloud computing simple ppt
Cloud computing simple ppt
Agarwaljay
 
Introduction of Cloud computing
Introduction of Cloud computingIntroduction of Cloud computing
Introduction of Cloud computing
Rkrishna Mishra
 
cloud computing ppt
cloud computing pptcloud computing ppt
cloud computing ppt
himanshuawasthi2109
 

Viewers also liked (8)

(ENT312) Should You Build or Buy Cloud Infrastructure and Platforms? | AWS re...
(ENT312) Should You Build or Buy Cloud Infrastructure and Platforms? | AWS re...(ENT312) Should You Build or Buy Cloud Infrastructure and Platforms? | AWS re...
(ENT312) Should You Build or Buy Cloud Infrastructure and Platforms? | AWS re...
 
Cloud Computing and the Datacenter of the Future
Cloud Computing and the Datacenter of the FutureCloud Computing and the Datacenter of the Future
Cloud Computing and the Datacenter of the Future
 
Bimodal IT - Mode 2 Evolution Roadmap v12
Bimodal IT - Mode 2 Evolution Roadmap v12Bimodal IT - Mode 2 Evolution Roadmap v12
Bimodal IT - Mode 2 Evolution Roadmap v12
 
Case Study Template
Case Study TemplateCase Study Template
Case Study Template
 
EMC ViPR Services Storage Engine Architecture
EMC ViPR Services Storage Engine ArchitectureEMC ViPR Services Storage Engine Architecture
EMC ViPR Services Storage Engine Architecture
 
Cloud computing simple ppt
Cloud computing simple pptCloud computing simple ppt
Cloud computing simple ppt
 
Introduction of Cloud computing
Introduction of Cloud computingIntroduction of Cloud computing
Introduction of Cloud computing
 
cloud computing ppt
cloud computing pptcloud computing ppt
cloud computing ppt
 

Similar to Petabyte scale on commodity infrastructure

4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
elliando dias
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
InSemble
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
yaevents
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
Federico Cargnelutti
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
Jayant Mukherjee
 
Hadoop
HadoopHadoop
Hadoop
HadoopHadoop
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
Venneladonthireddy1
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Cloudera, Inc.
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
Steve Staso
 
Apache Drill
Apache DrillApache Drill
Apache Drill
Ted Dunning
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 

Similar to Petabyte scale on commodity infrastructure (20)

4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 

More from elliando dias

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slides
elliando dias
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScript
elliando dias
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
elliando dias
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de container
elliando dias
 
Geometria Projetiva
Geometria ProjetivaGeometria Projetiva
Geometria Projetiva
elliando dias
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agility
elliando dias
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Libraries
elliando dias
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!
elliando dias
 
Ragel talk
Ragel talkRagel talk
Ragel talk
elliando dias
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Web
elliando dias
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduino
elliando dias
 
Minicurso arduino
Minicurso arduinoMinicurso arduino
Minicurso arduino
elliando dias
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorcery
elliando dias
 
Rango
RangoRango
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Design
elliando dias
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makes
elliando dias
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
elliando dias
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
elliando dias
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
elliando dias
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Study
elliando dias
 

More from elliando dias (20)

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slides
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScript
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de container
 
Geometria Projetiva
Geometria ProjetivaGeometria Projetiva
Geometria Projetiva
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agility
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Libraries
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!
 
Ragel talk
Ragel talkRagel talk
Ragel talk
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Web
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduino
 
Minicurso arduino
Minicurso arduinoMinicurso arduino
Minicurso arduino
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorcery
 
Rango
RangoRango
Rango
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Design
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makes
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Study
 

Recently uploaded

Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
OnBoard
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
siddu769252
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
AMol NAik
 
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Alliance
 
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
 
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPathCommunity
 
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
Zilliz
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
DianaGray10
 
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
ZachWylie3
 
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
Badri_Bady
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Alliance
 
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Stephanie Beckett
 
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Alliance
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
Zilliz
 

Recently uploaded (20)

Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
 
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
 
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
 
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
 
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
 
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
 
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
 
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
 
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
 

Petabyte scale on commodity infrastructure

  • 1. Petabyte scale on commodity infrastructure Owen O’Malley Eric Baldeschwieler Yahoo Inc! {owen,eric14}@yahoo-inc.com
  • 2. Hadoop: Why? • Need to process huge datasets on large clusters of computers • In large clusters, nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. • Very expensive to build reliability into each application. • Need common infrastructure – Efficient, reliable, easy to use Yahoo! Inc.
  • 3. Hadoop: How? • Commodity Hardware Cluster • Distributed File System – Modeled on GFS • Distributed Processing Framework – Using Map/Reduce metaphor • Open Source, Java – Apache Lucene subproject Yahoo! Inc.
  • 4. Commodity Hardware Cluster • Typically in 2 level architecture – Nodes are commodity PCs – 30-40 nodes/rack – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit Yahoo! Inc.
  • 5. Distributed File System • Single namespace for entire cluster – Managed by a single namenode. – Files are write-once. – Optimized for streaming reads of large files. • Files are broken in to large blocks. – Typically 128 MB – Replicated to several datanodes, for reliability • Client talks to both namenode and datanodes – Data is not sent through the namenode. – Throughput of file system scales nearly linearly with the number of nodes. Yahoo! Inc.
  • 6. Block Placement • Default is 3 replicas, but settable • Blocks are placed: – On same node – On different rack – On same rack – Others placed randomly • Clients read from closest replica • If the replication for a block drops below target, it is automatically rereplicated. Yahoo! Inc.
  • 7. Data Correctness • Data is checked with CRC32 • File Creation – Client computes checksum per 512 byte – DataNode stores the checksum • File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas Yahoo! Inc.
  • 8. Distributed Processing • User submits Map/Reduce job to JobTracker • System: – Splits job into lots of tasks – Monitors tasks – Kills and restarts if they fail/hang/disappear • User can track progress of job via web ui • Pluggable file systems for input/output – Local file system for testing, debugging, etc… – KFS and S3 also have bindings… Yahoo! Inc.
  • 9. Hadoop Map-Reduce • Implementation of the Map-Reduce programming model – Framework for distributed processing of large data sets • Data handled as collections of key-value pairs – Pluggable user code runs in generic framework • Very common design pattern in data processing – Demonstrated by a unix pipeline example: cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output – Natural for: • Log processing • Web search indexing • Ad-hoc queries – Minimizes trips to disk and disk seeks • Several interfaces: – Java, C++, text filter Yahoo! Inc.
  • 10. Map/Reduce Optimizations • Overlap of maps, shuffle, and sort • Mapper locality – Schedule mappers close to the data. • Combiner – Mappers may generate duplicate keys – Side-effect free reducer run on mapper node – Minimize data size before transfer – Reducer is still run • Speculative execution – Some nodes may be slower – Run duplicate task on another node Yahoo! Inc.
  • 11. Running on Amazon EC2/S3 • Amazon sells cluster services – EC2: $0.10/cpu hour – S3: $0.20/gigabyte month • Hadoop supports: – EC2: cluster management scripts included – S3: file system implementation included • Tested on 400 node cluster • Combination used by several startups Yahoo! Inc.
  • 12. Hadoop On Demand • Traditionally Hadoop runs with dedicated servers • Hadoop On Demand works with a batch system to allocate and provision nodes dynamically – Bindings for Condor and Torque/Maui • Allows more dynamic allocation of resources Yahoo! Inc.
  • 13. Yahoo’s Hadoop Clusters • We have ~10,000 machines running Hadoop • Our largest cluster is currently 2000 nodes • 1 petabyte of user data (compressed, unreplicated) • We run roughly 10,000 research jobs / week Yahoo! Inc.
  • 14. Scaling Hadoop • Sort benchmark – Sorting random data – Scaled to 10GB/node • We’ve improved both scalability and performance over time • Making improvements in frameworks helps a lot. Yahoo! Inc.
  • 15. Coming Attractions • Block rebalancing • Clean up of HDFS client write protocol – Heading toward file append support • Rack-awareness for Map/Reduce • Redesign of RPC timeout handling • Support for versioning in Jute/Record IO • Support for users, groups, and permissions • Improved utilization – Your feedback solicited Yahoo! Inc.
  • 16. Upcoming Tools • Pig – A scripting language/interpreter that makes it easy to define complex jobs that require multiple map/reduce jobs. – Currently in Apache Incubator. • Zookeeper – Highly available directory service – Support master election and configuration – Filesystem interface – Consensus building between servers – Posting to SourceForge soon • HBase – BigTable-inspired distributed object store, sorted by primary key – Storage in HDFS files – Hadoop community project Yahoo! Inc.
  • 17. Collaboration • Hadoop is an Open Source project! • Please contribute your xtrace hooks • Feedback on performance bottleneck welcome • Developer tooling for easing debugging and performance diagnosing are very welcome. – IBM has contributed an Eclipse plugin. • Interested your thoughts in management and virtualization • Block placement in HDFS for reliability Yahoo! Inc.
  • 18. Thank You • Questions? • For more information: – Blog on http://developer.yahoo.net/ ��� Hadoop website: http://lucene.apache.org/hadoop/ Yahoo! Inc.