The document presents an introduction to MapReduce. It discusses how MapReduce provides an easy framework for distributed computing by allowing programmers to write simple map and reduce functions without worrying about complex distributed systems issues. It outlines Google's implementation of MapReduce and how it uses the Google File System for fault tolerance. Alternative open-source implementations like Apache Hadoop are also covered. The document discusses how MapReduce has been widely adopted by companies to process massive amounts of data and analyzes some criticism of MapReduce from database experts. It concludes by noting trends in using MapReduce as a parallel database and for multi-core processing.
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
The document discusses big data and distributed computing. It explains that big data refers to large, unstructured datasets that are too large for traditional databases. Distributed computing uses multiple computers connected via a network to process large datasets in parallel. Hadoop is an open-source framework for distributed computing that uses MapReduce and HDFS for parallel processing and storage across clusters. HDFS stores data redundantly across nodes for fault tolerance.
The document discusses MapReduce, a programming model and framework for processing large datasets in parallel. MapReduce allows users to write distributed programs without worrying about parallelization, fault tolerance, data distribution or load balancing. It works by breaking the computation into map and reduce functions that process key-value pairs. The framework automatically parallelizes the computation across large clusters and handles failures.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
HDFS is a distributed file system designed for storing very large data files across commodity servers or clusters. It works on a master-slave architecture with one namenode (master) and multiple datanodes (slaves). The namenode manages the file system metadata and regulates client access, while datanodes store and retrieve block data from their local file systems. Files are divided into large blocks which are replicated across datanodes for fault tolerance. The namenode monitors datanodes and replicates blocks if their replication drops below a threshold.
The document discusses backup and disaster recovery strategies for Hadoop. It focuses on protecting data sets stored in HDFS. HDFS uses data replication and checksums to protect against disk and node failures. Snapshots can protect against data corruption and accidental deletes. The document recommends copying data from the primary to secondary site for disaster recovery rather than teeing, and discusses considerations for large data movement like bandwidth needs and security. It also notes the importance of backing up metadata like Hive configurations along with core data.
Apache Hive is a data warehouse infrastructure tool built on Hadoop that allows users to query and analyze large datasets stored in Hadoop using SQL. It works by translating SQL queries into MapReduce jobs that process the data. Hive provides a metastore to store metadata about the schema and HDFS location of tables, and uses a query language called HiveQL that is similar to SQL. It allows users to run analytics on large datasets without needing to write MapReduce code directly.
Organizations are struggling to make sense of their data within antiquated data platforms. Snowflake, the data warehouse built for the cloud, can help.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
This document provides an overview of Apache Hadoop and HBase. It begins with an introduction to why big data is important and how Hadoop addresses storing and processing large amounts of data across commodity servers. The core components of Hadoop, HDFS for storage and MapReduce for distributed processing, are described. An example MapReduce job is outlined. The document then introduces the Hadoop ecosystem, including Apache HBase for random read/write access to data stored in Hadoop. Real-world use cases of Hadoop at companies like Yahoo, Facebook and Twitter are briefly mentioned before addressing questions.
This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
This presentation about Hadoop for beginners will help you understand what is Hadoop, why Hadoop, what is Hadoop HDFS, Hadoop MapReduce, Hadoop YARN, a use case of Hadoop and finally a demo on HDFS (Hadoop Distributed File System), MapReduce and YARN. Big Data is a massive amount of data which cannot be stored, processed, and analyzed using traditional systems. To overcome this problem, we use Hadoop. Hadoop is a framework which stores and handles Big Data in a distributed and parallel fashion. Hadoop overcomes the challenges of Big Data. Hadoop has three components HDFS, MapReduce, and YARN. HDFS is the storage unit of Hadoop, MapReduce is its processing unit, and YARN is the resource management unit of Hadoop. In this video, we will look into these units individually and also see a demo on each of these units.
Below topics are explained in this Hadoop presentation:
1. What is Hadoop
2. Why Hadoop
3. Big Data generation
4. Hadoop HDFS
5. Hadoop MapReduce
6. Hadoop YARN
7. Use of Hadoop
8. Demo on HDFS, MapReduce and YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
I don't think it's hyperbole when I say that Facebook, Instagram, Twitter & Netflix now define the dimensions of our social & entertainment universe. But what kind of technology engines purr under the hoods of these social media machines?
Here is a tech student's perspective on making the paradigm shift to "Big Data" using innovative models: alphabet blocks, nesting dolls, & LEGOs!
Get info on:
- What is Cassandra (C*)?
- Installing C* Community Version on Amazon Web Services EC2
- Data Modelling & Database Design in C* using CQL3
- Industry Use Cases
The document provides an overview of MapReduce, including:
1) MapReduce is a programming model and implementation that allows for large-scale data processing across clusters of computers. It handles parallelization, distribution, and reliability.
2) The programming model involves mapping input data to intermediate key-value pairs and then reducing by key to output results.
3) Example uses of MapReduce include word counting and distributed searching of text.
Hive was initially developed by Facebook to manage large amounts of data stored in HDFS. It uses a SQL-like query language called HiveQL to analyze structured and semi-structured data. Hive compiles HiveQL queries into MapReduce jobs that are executed on a Hadoop cluster. It provides mechanisms for partitioning, bucketing, and sorting data to optimize query performance.
This document discusses efficient analysis of big data using the MapReduce framework. It introduces the challenges of analyzing large and complex datasets, and describes how MapReduce addresses these challenges through its map and reduce functions. MapReduce allows distributed processing of big data across clusters of computers using a simple programming model.
The document discusses cloud computing systems and MapReduce. It provides background on MapReduce, describing how it works and how it was inspired by functional programming concepts like map and reduce. It also discusses some limitations of MapReduce, noting that it is not designed for general-purpose parallel processing and can be inefficient for certain types of workloads. Alternative approaches like MRlite and DCell are proposed to provide more flexible and efficient distributed processing frameworks.
This document provides an overview of MapReduce and Apache Hadoop. It discusses the history and components of Hadoop, including HDFS and MapReduce. It then walks through an example MapReduce job, the WordCount algorithm, to illustrate how MapReduce works. The WordCount example counts the frequency of words in documents by having mappers emit <word, 1> pairs and reducers sum the counts for each word.
MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers.
Map Reduce Workloads: A Dynamic Job Ordering and Slot Configurationsdbpublications
MapReduce is a popular parallel computing paradigm for large-scale data processing in clusters and data centers. A MapReduce workload generally contains a set of jobs, each of which consists of multiple map tasks followed by multiple reduce tasks. Due to 1) that map tasks can only run in map slots and reduce tasks can only run in reduce slots, and 2) the general execution constraints that map tasks are executed before reduce tasks, different job execution orders and map/reduce slot configurations for a MapReduce workload have significantly different performance and system utilization. This survey proposes two classes of algorithms to minimize the make span and the total completion time for an offline MapReduce workload. Our first class of algorithms focuses on the job ordering optimization for a MapReduce workload under a given map/reduce slot configuration. In contrast, our second class of algorithms considers the scenario that we can perform optimization for map/reduce slot configuration for a MapReduce workload. We perform simulations as well as experiments on Amazon EC2 and show that our proposed algorithms produce results that are up to 15 - 80 percent better than currently unoptimized Hadoop, leading to significant reductions in running time in practice.
The document describes MapReduce, a programming model developed at Google for processing large datasets in a distributed computing environment. It discusses how MapReduce works, with mappers processing input data in parallel to generate intermediate key-value pairs, and reducers then merging all intermediate values associated with the same key. Three examples of MapReduce problems and their solutions are provided to illustrate how MapReduce can be used to calculate averages, group data by gender to find totals and averages, and categorize words by length.
The document is a seminar report on the Hadoop framework. It provides an introduction to Hadoop and describes its key technologies including MapReduce, HDFS, and programming model. MapReduce allows distributed processing of large datasets across clusters. HDFS is the distributed file system used by Hadoop to reliably store large amounts of data across commodity hardware.
Hadoop is a framework for distributed storage and processing of large datasets across clusters of computers. It addresses problems like hardware failure and combining data after analysis. The core components are HDFS for distributed storage and MapReduce for distributed processing. HDFS stores data as blocks across nodes and handles replication for reliability. The Namenode manages the file system namespace and metadata, while Datanodes store and retrieve blocks. Hadoop supports reliable analysis of large datasets in a distributed manner through its scalable architecture.
Hadoop/MapReduce is an open source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce, a programming model where input data is processed by "map" functions in parallel, and results are combined by "reduce" functions, to process and generate outputs from large amounts of data and nodes. The core components are the Hadoop Distributed File System for data storage, and the MapReduce programming model and framework. MapReduce jobs involve mapping data to intermediate key-value pairs, shuffling and sorting the data, and reducing to output results.
This document summarizes a proposal to improve fault tolerance in Hadoop clusters. It proposes adding a "Backup" state to store intermediate MapReduce data, so reducers can continue working even if mappers fail. It also proposes a "supernode" protocol where neighboring slave nodes communicate task information. If one node fails, a neighbor can take over its tasks without involving the JobTracker. This would improve fault tolerance by allowing computation to continue locally between nodes after failures.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
Hadoop is an open source software framework for distributed storage and processing of large datasets across clusters of computers. It implements the MapReduce programming model pioneered by Google and a distributed file system (HDFS). Mahout builds machine learning libraries on top of Hadoop. HBase is a non-relational distributed database modeled after Google's BigTable that provides random access and real-time read/write capabilities. These projects are used by many large companies for large-scale data processing and analytics tasks.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
This document discusses a proposed data-aware caching framework called Dache that could be used with big data applications built on MapReduce. Dache aims to cache intermediate data generated during MapReduce jobs to avoid duplicate computations. When tasks run, they would first check the cache for existing results before running the actual computations. The goal is to improve efficiency by reducing redundant work. The document outlines the objectives and scope of extending MapReduce with Dache, provides background on MapReduce and Hadoop, and concludes that initial experiments show Dache can eliminate duplicate tasks in incremental jobs.
This document provides an overview of Apache Hadoop, including its history, architecture, and key components. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It allows for the distributed processing of large data sets across clusters of computers using a simple programming model. The document outlines Hadoop's origins from Google's paper on MapReduce and the GFS file system. It describes Hadoop's core components - the Hadoop Distributed File System (HDFS) for storage and MapReduce for distributed processing. Use cases for Hadoop including log analysis, search, and analytics are also mentioned.
There is a growing trend of applications that ought to handle huge information. However, analysing huge information may be a terribly difficult drawback nowadays. For such data many techniques can be considered. The technologies like Grid Computing, Volunteering Computing, and RDBMS can be considered as potential techniques to handle such data. We have a still in growing phase Hadoop Tool to handle such data also. We will do a survey on all this techniques to find a potential technique to manage and work with Big Data.
The slides cover Map Reduce and Hadoop as basic technologies for Big Data processing. Based on this, the Hadoop ecosystem is explained along with extensions and concepts such as Lambda Architecture for real-time event-processing. The presentation ends with giving an outlook on future technologies.
we are interested in performing Big Data analytics, we need to
learn Hadoop to perform operations with Hadoop MapReduce. In this Presentation, we
will discuss what MapReduce is, why it is necessary, how MapReduce programs can
be developed through Apache Hadoop, and more.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
Self-Healing Test Automation Framework - Healenium
An Introduction to MapReduce
1. An Introduction to MapReduce
Presented by Frane Bandov
at the Operating Complex IT-Systems seminar
Berlin, 1/26/2010
2. Outline
• Introduction
• Google MapReduce
– Idea
– Overview
– Fault Tolerance
– GFS: Google File System
– Job Example
• Alternative Implementations
• Reception and Criticism
• Trends and Future Development
• Conclusion
2/16/10 An Introduction to MapReduce 2
3. Outline
• Introduction
• Google MapReduce
– Idea
– Overview
– Fault Tolerance
– GFS: Google File System
– Job Example
• Alternative Implementations
• Reception and Criticism
• Trends and Future Development
• Conclusion
2/16/10 An Introduction to MapReduce 3
4. Introduction – Problem
Sometimes we have to deal with huge amounts
of data
TBytes
250
200
150
100
50
0
You Facebook Yahoo! Groups German Climate
Computing Centre
2/16/10 An Introduction to MapReduce 4
5. Introduction – Problem
The data needs to be processed, but how?
Can‘t process all of this data on one machine
Distribute the processing to many machines
2/16/10 An Introduction to MapReduce 5
6. Introduction – Approach
Distributed computing is the solution
“Let’s write our own distributed computing
software as a solution to our problem”
Checklist
design protocols evelopment takes a long time
D
design data structures
write the code Expensive: Cost-benefit ratio?
assure failure tolerance
Build complex software for simple computations?
2/16/10 An Introduction to MapReduce 6
7. Outline
• Introduction
• Google MapReduce
– Idea
– Overview
– Fault Tolerance
– GFS: Google File System
– Job Example
• Alternative Implementations
• Reception and Criticism
• Trends and Future Development
• Conclusion
2/16/10 An Introduction to MapReduce 7
8. Google MapReduce – Idea
A framework for distributed computing
Don‘t care about protocols, failure tolerance, etc.
Just write your simple computation
2/16/10 An Introduction to MapReduce 8
9. Google MapReduce – Idea
MapReduce Paradigm
Map: Reduce:
Apply function to all Combine all elements
elements of a list of a list
square x = x * x; reduce (+)[1, 2, 3, 4, 5];
map square [1, 2, 3, 4, 5];
[1, 4, 9, 16, 25] 15
2/16/10 An Introduction to MapReduce 9
10. Google MapReduce – Idea
Basic functioning
Input Map Reduce Output
2/16/10 An Introduction to MapReduce 10
12. MapReduce – Fault Tolerance
• Workers are periodically pinged by master
• No answer over certain time worker failed
Mapper fails:
– Reset map job as idle
– Even if job was completed intermediate files are
inaccessible
– Notify reducers where to get the new intermediate file
Reducer fails:
– Reset its job as idle
2/16/10 An Introduction to MapReduce 12
13. MapReduce – Fault Tolerance
Master fails:
– Periodically sets checkpoints
– In case of failure MapReduce-Operation is aborted
– Operation can be restarted from last checkpoint
2/16/10 An Introduction to MapReduce 13
14. Google MapReduce – GFS
Google File System
• In-house distributed file system at Google
• Stores all input an output files
• Stores files…
– divided into 64 MB blocks
– on at least 3 different machines
• Machines running GFS also
run MapReduce
2/16/10 An Introduction to MapReduce 14
19. Outline
• Introduction
• Google MapReduce
– Idea
– Overview
– Fault Tolerance
– GFS: Google File System
– Job Example
• Alternative Implementations
• Reception and Criticism
• Trends and Future Development
• Conclusion
2/16/10 An Introduction to MapReduce 19
20. Alternative Implementations
Apache Hadoop
• Open-Source-Implementation in Java
• Jobs can be written in C++, Java, Python, etc.
• Used by Yahoo!, Facebook, Amazon and others
• Most commonly used implementation
• HDFS as open-source-implementation of GFS
• Can also use Amazon S3, HTTP(S) or FTP
• Extensions: Hive, Pig, HBase
2/16/10 An Introduction to MapReduce 20
21. Alternative Implementations
Mars
MapReduce-Implementation for nVidia GPU
using the CUDA framework
MapReduce-Cell
Implementation for the Cell multi-core
processor
Qizmt
MySpace’s implementation of MapReduce in C#
2/16/10 An Introduction to MapReduce 21
22. Alternative Implementations
There are many other open- and closed-
source implementations of MapReduce!
2/16/10 An Introduction to MapReduce 22
23. Outline
• Introduction
• Google MapReduce
– Idea
– Overview
– Fault Tolerance
– GFS: Google File System
– Job Example
• Alternative Implementations
• Reception and Criticism
• Trends and Future Development
• Conclusion
2/16/10 An Introduction to MapReduce 23
24. Reception and Criticism
• Yahoo!: Hadoop on a 10,000 server cluster
• Facebook analyses the daily log (25TB) on
a 1,000 server cluster
• Amazon Elastic MapReduce: Hadoop
clusters for rent on EC2 and S3
• IBM and Google: Support university
courses in distributed programming
• UC Berkley announced to teach freashmen
programming MapReduce
2/16/10 An Introduction to MapReduce 24
26. Reception and Criticism
• Criticism mainly by RDBMS experts
DeWitt and Stonebraker
• MapReduce
– is a step backwards in database access
– is a poor implementation
– is not novel
– is missing features that are routinely provided
by modern DBMSs
– is incompatible with the DBMS tools
2/16/10 An Introduction to MapReduce 26
27. Reception and Criticism
Response to criticism
MapReduce is no RDBMS
It suits well for processing and structuring huge
amounts of unstructured data
MapReduce's big inovation is that it enables
distributing data processing across a network of
cheap and possibly unreliable computers
2/16/10 An Introduction to MapReduce 27
28. Outline
• Introduction
• Google MapReduce
– Idea
– Overview
– Fault Tolerance
– GFS: Google File System
– Job Example
• Alternative Implementations
• Reception and Criticism
• Trends and Future Development
• Conclusion
2/16/10 An Introduction to MapReduce 28
29. Trends and Future Development
Trend of utilizing MapReduce/Hadoop as
parallel database
• Hive: Query language for Hadoop
• HBase: Column-oriented distributed database
(modeled after Google’s BigTable)
• Map-Reduce-Merge: Adding merge to the
paradigm allows implementing features of
relational algebra
2/16/10 An Introduction to MapReduce 29
30. Trends and Future Development
Trend to use the MapReduce-paradigm to
better utilize multi-core CPUs
• Qt Concurrent
– Simplified C++ version of MapReduce for distributing
tasks between multiple processor cores
• Mars
• MapReduce-Cell
2/16/10 An Introduction to MapReduce 30
31. Outline
• Introduction
• Google MapReduce
– Idea
– Overview
– Fault Tolerance
– GFS: Google File System
– Job Example
• Alternative Implementations
• Reception and Criticism
• Trends and Future Development
• Conclusion
2/16/10 An Introduction to MapReduce 31
32. Conclusion
MapReduce
provides an easy solution for the processing of
large amounts of data
brings a paradigm shift in programming
changed the world,
i.e. made data processing more efficient and
cheaper, is the foundation of many other
approaches and solutions
2/16/10 An Introduction to MapReduce 32