SlideShare a Scribd company logo
An Introduction to MapReduce
             Presented by Frane Bandov
    at the Operating Complex IT-Systems seminar
                  Berlin, 1/26/2010
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   2
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   3
Introduction – Problem
Sometimes we have to deal with huge amounts
                 of data
TBytes
250

200

 150

100

 50

  0
            You   Facebook              Yahoo! Groups    German Climate
                                                        Computing Centre

  2/16/10          An Introduction to MapReduce                       4
Introduction – Problem
    The data needs to be processed, but how?


     Can‘t process all of this data on one machine
     Distribute the processing to many machines




2/16/10             An Introduction to MapReduce     5
Introduction – Approach
           Distributed computing is the solution
           “Let’s write our own distributed computing
              software as a solution to our problem”
         Checklist
 design protocols             evelopment takes a long time
                              D
 design data structures
 write the code              Expensive: Cost-benefit ratio?
 assure failure tolerance



   Build complex software for simple computations?

 2/16/10                     An Introduction to MapReduce   6
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   7
Google MapReduce – Idea
      A framework for distributed computing

  Don‘t care about protocols, failure tolerance, etc.

           Just write your simple computation




2/16/10              An Introduction to MapReduce       8
Google MapReduce – Idea
              MapReduce Paradigm
Map:                                  Reduce:
 Apply function to all                  Combine all elements
 elements of a list                     of a list


square x = x * x;                     reduce (+)[1, 2, 3, 4, 5];
map square [1, 2, 3, 4, 5];
 [1, 4, 9, 16, 25]                    15




2/16/10               An Introduction to MapReduce                 9
Google MapReduce – Idea
               Basic functioning



      Input     Map                     Reduce   Output




2/16/10           An Introduction to MapReduce            10
Google MapReduce – Overview
                       MapReduce-Based User Program

 GFS                                                              GFS

 Split 1                              Master


 Split 2                      Intermediate
              Worker                                     Worker   File 1
                                  File 1

 Split 3
                              Intermediate
              Worker
                                  File 2                 Worker   File 2
 Split 4

                              Intermediate
 Split 5      Worker
                                  File 3
                                                         Reduce   Output
Input file   Map Phase                                   Phase     files
2/16/10                   An Introduction to MapReduce               11
MapReduce – Fault Tolerance
•  Workers are periodically pinged by master
•  No answer over certain time  worker failed

Mapper fails:
     –  Reset map job as idle
     –  Even if job was completed  intermediate files are
        inaccessible
     –  Notify reducers where to get the new intermediate file
Reducer fails:
     –  Reset its job as idle
2/16/10                   An Introduction to MapReduce       12
MapReduce – Fault Tolerance
Master fails:
     –  Periodically sets checkpoints
     –  In case of failure MapReduce-Operation is aborted
     –  Operation can be restarted from last checkpoint




2/16/10                An Introduction to MapReduce         13
Google MapReduce – GFS
               Google File System
•  In-house distributed file system at Google
•  Stores all input an output files
•  Stores files…
     – divided into 64 MB blocks
     – on at least 3 different machines
•  Machines running GFS also
   run MapReduce
2/16/10              An Introduction to MapReduce   14
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   15
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   16
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   17
Google MapReduce – Job Example




2/16/10    An Introduction to MapReduce   18
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   19
Alternative Implementations
Apache Hadoop

•    Open-Source-Implementation in Java
•    Jobs can be written in C++, Java, Python, etc.
•    Used by Yahoo!, Facebook, Amazon and others
•    Most commonly used implementation
•    HDFS as open-source-implementation of GFS
•    Can also use Amazon S3, HTTP(S) or FTP
•    Extensions: Hive, Pig, HBase
2/16/10              An Introduction to MapReduce     20
Alternative Implementations
                              Mars
          MapReduce-Implementation for nVidia GPU
                using the CUDA framework

                    MapReduce-Cell
            Implementation for the Cell multi-core
                         processor

                             Qizmt
     MySpace’s implementation of MapReduce in C#

2/16/10                An Introduction to MapReduce   21
Alternative Implementations


     There are many other open- and closed-
     source implementations of MapReduce!




2/16/10           An Introduction to MapReduce   22
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   23
Reception and Criticism
•  Yahoo!: Hadoop on a 10,000 server cluster
•  Facebook analyses the daily log (25TB) on
   a 1,000 server cluster
•  Amazon Elastic MapReduce: Hadoop
   clusters for rent on EC2 and S3
•  IBM and Google: Support university
   courses in distributed programming
•  UC Berkley announced to teach freashmen
   programming MapReduce
2/16/10          An Introduction to MapReduce   24
Reception and Criticism




2/16/10          An Introduction to MapReduce   25
Reception and Criticism
•  Criticism mainly by RDBMS experts
   DeWitt and Stonebraker
•  MapReduce
     – is a step backwards in database access
     – is a poor implementation
     – is not novel
     – is missing features that are routinely provided
       by modern DBMSs
     – is incompatible with the DBMS tools
2/16/10              An Introduction to MapReduce    26
Reception and Criticism
               Response to criticism

              MapReduce is no RDBMS

   It suits well for processing and structuring huge
              amounts of unstructured data

      MapReduce's big inovation is that it enables
     distributing data processing across a network of
         cheap and possibly unreliable computers
2/16/10              An Introduction to MapReduce      27
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   28
Trends and Future Development
   Trend of utilizing MapReduce/Hadoop as
                 parallel database

•  Hive: Query language for Hadoop
•  HBase: Column-oriented distributed database
   (modeled after Google’s BigTable)
•  Map-Reduce-Merge: Adding merge to the
   paradigm allows implementing features of
   relational algebra
2/16/10           An Introduction to MapReduce   29
Trends and Future Development
   Trend to use the MapReduce-paradigm to
         better utilize multi-core CPUs

•  Qt Concurrent
     –  Simplified C++ version of MapReduce for distributing
        tasks between multiple processor cores
•  Mars
•  MapReduce-Cell


2/16/10                An Introduction to MapReduce        30
Outline
•  Introduction
•  Google MapReduce
    –  Idea
    –  Overview
    –  Fault Tolerance
    –  GFS: Google File System
    –  Job Example
•  Alternative Implementations
•  Reception and Criticism
•  Trends and Future Development
•  Conclusion
2/16/10             An Introduction to MapReduce   31
Conclusion
                        MapReduce

     provides an easy solution for the processing of
                  large amounts of data

          brings a paradigm shift in programming

                      changed the world,
          i.e. made data processing more efficient and
            cheaper, is the foundation of many other
                   approaches and solutions
2/16/10                 An Introduction to MapReduce     32
Questions?




2/16/10    An Introduction to MapReduce   33
Thank You!




2/16/10    An Introduction to MapReduce   34

More Related Content

What's hot

Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
Vincenzo Gulisano
 
Hadoop
HadoopHadoop
MapReduce
MapReduceMapReduce
MapReduce
Amir Payberah
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
sudhakara st
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
Anuja Gunale
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
Snowflake Computing
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
Cloudera, Inc.
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
YounesCharfaoui
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
Michelle Darling
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Sri Prasanna
 
Apache hive
Apache hiveApache hive
Apache hive
pradipbajpai68
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
Paladion Networks
 

What's hot (20)

Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
Hadoop
HadoopHadoop
Hadoop
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop HDFS Architeture and Design
Hadoop HDFS Architeture and DesignHadoop HDFS Architeture and Design
Hadoop HDFS Architeture and Design
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Snowflake Overview
Snowflake OverviewSnowflake Overview
Snowflake Overview
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Apache hive
Apache hiveApache hive
Apache hive
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 

Similar to An Introduction to MapReduce

Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
Jakir Hossain
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
Urvashi Kataria
 
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
bhuvankumar3877
 
Mapreduce Hadop.pptx
Mapreduce Hadop.pptxMapreduce Hadop.pptx
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
dbpublications
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)
Ankit Gupta
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming Model
AdarshaDhakal
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
Bhushan Kulkarni
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA
 
Hadoop
HadoopHadoop
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
 
E031201032036
E031201032036E031201032036
E031201032036
ijceronline
 
An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)
Yu Liu
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
Lukas Vlcek
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
SANTOSH WAYAL
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
BADR
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
ijsrd.com
 
Big Data Technology
Big Data TechnologyBig Data Technology
Big Data Technology
Juan J. Mostazo
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
jani shaik
 

Similar to An Introduction to MapReduce (20)

Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE4_RK_BIG_DATA.pptx
 
Mapreduce Hadop.pptx
Mapreduce Hadop.pptxMapreduce Hadop.pptx
Mapreduce Hadop.pptx
 
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot ConfigurationsMap Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurations
 
Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)Mod05lec23(map reduce tutorial)
Mod05lec23(map reduce tutorial)
 
MapReduce Programming Model
MapReduce Programming ModelMapReduce Programming Model
MapReduce Programming Model
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
E031201032036
E031201032036E031201032036
E031201032036
 
An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)An Enhanced MapReduce Model (on BSP)
An Enhanced MapReduce Model (on BSP)
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Big Data Technology
Big Data TechnologyBig Data Technology
Big Data Technology
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 

Recently uploaded

History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
Badri_Bady
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan
 
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Alliance
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Priyanka Aash
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Zilliz
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Alliance
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
 
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Alliance
 
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
Zilliz
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
AmandaCheung15
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Zilliz
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
DianaGray10
 
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Quentin Reul
 
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
Sara Kroft
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
 
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Alliance
 
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 

Recently uploaded (20)

History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
 
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
 
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
 
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
 
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
 
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
 
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
 
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
 

An Introduction to MapReduce

  • 1. An Introduction to MapReduce Presented by Frane Bandov at the Operating Complex IT-Systems seminar Berlin, 1/26/2010
  • 2. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 2
  • 3. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 3
  • 4. Introduction – Problem Sometimes we have to deal with huge amounts of data TBytes 250 200 150 100 50 0 You Facebook Yahoo! Groups German Climate Computing Centre 2/16/10 An Introduction to MapReduce 4
  • 5. Introduction – Problem The data needs to be processed, but how? Can‘t process all of this data on one machine  Distribute the processing to many machines 2/16/10 An Introduction to MapReduce 5
  • 6. Introduction – Approach Distributed computing is the solution “Let’s write our own distributed computing software as a solution to our problem” Checklist  design protocols   evelopment takes a long time D  design data structures  write the code  Expensive: Cost-benefit ratio?  assure failure tolerance Build complex software for simple computations? 2/16/10 An Introduction to MapReduce 6
  • 7. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 7
  • 8. Google MapReduce – Idea A framework for distributed computing Don‘t care about protocols, failure tolerance, etc. Just write your simple computation 2/16/10 An Introduction to MapReduce 8
  • 9. Google MapReduce – Idea MapReduce Paradigm Map: Reduce: Apply function to all Combine all elements elements of a list of a list square x = x * x; reduce (+)[1, 2, 3, 4, 5]; map square [1, 2, 3, 4, 5];  [1, 4, 9, 16, 25]  15 2/16/10 An Introduction to MapReduce 9
  • 10. Google MapReduce – Idea Basic functioning Input Map Reduce Output 2/16/10 An Introduction to MapReduce 10
  • 11. Google MapReduce – Overview MapReduce-Based User Program GFS GFS Split 1 Master Split 2 Intermediate Worker Worker File 1 File 1 Split 3 Intermediate Worker File 2 Worker File 2 Split 4 Intermediate Split 5 Worker File 3 Reduce Output Input file Map Phase Phase files 2/16/10 An Introduction to MapReduce 11
  • 12. MapReduce – Fault Tolerance •  Workers are periodically pinged by master •  No answer over certain time  worker failed Mapper fails: –  Reset map job as idle –  Even if job was completed  intermediate files are inaccessible –  Notify reducers where to get the new intermediate file Reducer fails: –  Reset its job as idle 2/16/10 An Introduction to MapReduce 12
  • 13. MapReduce – Fault Tolerance Master fails: –  Periodically sets checkpoints –  In case of failure MapReduce-Operation is aborted –  Operation can be restarted from last checkpoint 2/16/10 An Introduction to MapReduce 13
  • 14. Google MapReduce – GFS Google File System •  In-house distributed file system at Google •  Stores all input an output files •  Stores files… – divided into 64 MB blocks – on at least 3 different machines •  Machines running GFS also run MapReduce 2/16/10 An Introduction to MapReduce 14
  • 15. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 15
  • 16. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 16
  • 17. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 17
  • 18. Google MapReduce – Job Example 2/16/10 An Introduction to MapReduce 18
  • 19. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 19
  • 20. Alternative Implementations Apache Hadoop •  Open-Source-Implementation in Java •  Jobs can be written in C++, Java, Python, etc. •  Used by Yahoo!, Facebook, Amazon and others •  Most commonly used implementation •  HDFS as open-source-implementation of GFS •  Can also use Amazon S3, HTTP(S) or FTP •  Extensions: Hive, Pig, HBase 2/16/10 An Introduction to MapReduce 20
  • 21. Alternative Implementations Mars MapReduce-Implementation for nVidia GPU using the CUDA framework MapReduce-Cell Implementation for the Cell multi-core processor Qizmt MySpace’s implementation of MapReduce in C# 2/16/10 An Introduction to MapReduce 21
  • 22. Alternative Implementations There are many other open- and closed- source implementations of MapReduce! 2/16/10 An Introduction to MapReduce 22
  • 23. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 23
  • 24. Reception and Criticism •  Yahoo!: Hadoop on a 10,000 server cluster •  Facebook analyses the daily log (25TB) on a 1,000 server cluster •  Amazon Elastic MapReduce: Hadoop clusters for rent on EC2 and S3 •  IBM and Google: Support university courses in distributed programming •  UC Berkley announced to teach freashmen programming MapReduce 2/16/10 An Introduction to MapReduce 24
  • 25. Reception and Criticism 2/16/10 An Introduction to MapReduce 25
  • 26. Reception and Criticism •  Criticism mainly by RDBMS experts DeWitt and Stonebraker •  MapReduce – is a step backwards in database access – is a poor implementation – is not novel – is missing features that are routinely provided by modern DBMSs – is incompatible with the DBMS tools 2/16/10 An Introduction to MapReduce 26
  • 27. Reception and Criticism Response to criticism MapReduce is no RDBMS It suits well for processing and structuring huge amounts of unstructured data MapReduce's big inovation is that it enables distributing data processing across a network of cheap and possibly unreliable computers 2/16/10 An Introduction to MapReduce 27
  • 28. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 28
  • 29. Trends and Future Development Trend of utilizing MapReduce/Hadoop as parallel database •  Hive: Query language for Hadoop •  HBase: Column-oriented distributed database (modeled after Google’s BigTable) •  Map-Reduce-Merge: Adding merge to the paradigm allows implementing features of relational algebra 2/16/10 An Introduction to MapReduce 29
  • 30. Trends and Future Development Trend to use the MapReduce-paradigm to better utilize multi-core CPUs •  Qt Concurrent –  Simplified C++ version of MapReduce for distributing tasks between multiple processor cores •  Mars •  MapReduce-Cell 2/16/10 An Introduction to MapReduce 30
  • 31. Outline •  Introduction •  Google MapReduce –  Idea –  Overview –  Fault Tolerance –  GFS: Google File System –  Job Example •  Alternative Implementations •  Reception and Criticism •  Trends and Future Development •  Conclusion 2/16/10 An Introduction to MapReduce 31
  • 32. Conclusion MapReduce provides an easy solution for the processing of large amounts of data brings a paradigm shift in programming changed the world, i.e. made data processing more efficient and cheaper, is the foundation of many other approaches and solutions 2/16/10 An Introduction to MapReduce 32
  • 33. Questions? 2/16/10 An Introduction to MapReduce 33
  • 34. Thank You! 2/16/10 An Introduction to MapReduce 34