MapReduce has several advantages over parallel databases for processing large datasets:
1) MapReduce can handle heterogeneous systems with different storage systems more easily than parallel databases which require data copying and analysis.
2) Complex functions are more straightforward to express in MapReduce's simple map and reduce model compared to SQL in parallel databases which can require complicated user defined functions.
3) MapReduce provides better fault tolerance than parallel databases by using techniques like batching, sorting, grouping and smart task scheduling during data transfers between mapping and reducing tasks.
The document summarizes the Pregel system, which was designed for large-scale graph processing. Pregel addresses the inefficiency of MapReduce for graph problems by allowing direct message passing between vertices during synchronized iterations. It provides fault tolerance through checkpointing and a master-worker architecture. Key contributions of Pregel include its distributed programming model and APIs for message passing, combining messages to reduce overhead, global communication through aggregators, and mutating graph topology. The paper notes strengths like fault tolerance but also weaknesses such as putting responsibility on the user and lack of master failure detection.
Adaptive Execution Support for Malleable ComputationQian Lin
The document summarizes and discusses three papers on adaptive execution support for malleable computation. It introduces FORMLESS, which uses an actor-oriented specification model and space exploration to customize applications to target platforms. It also discusses a dynamic load balancing scheme that uses neighborhood averaging and grain size control, and adaptive load balancing supported by compiler extraction of data access patterns and run-time collection of statistics to adjust load distribution while minimizing communication.
A cluster computer consists of multiple connected nodes that work together like a single system. It can increase performance over a single computer by distributing work across nodes. There are different types of clusters, including load balancing clusters for high performance computing, visualization clusters with graphics cards, and grids that pool multiple distributed resources. Key advantages of clusters are increased performance through parallel processing, scalability by adding nodes, and lower cost by using commodity hardware. Performance monitoring is important as a cluster's speed depends on its nodes and network connection.
MapReduce is a programming model and an implementation for processing and generating big data sets with parallel & distributed algorithms on a cluster. It is a programming paradigm that enables massive scalability across hundreds or thousands of servers in a cluster for distributed computing of jobs. It is a Distributed Data Processing Algorithm mainly inspired by Functional Programming. In the MapReduce process, big tasks are split into smaller tasks and then they are assigned to several systems for processing. Introduced by Google, it is a reliable and efficient way to process data sets in cluster environments. MapReduce runs in the background to provide scalability, simplicity, speed, recovery and easy solutions for data processing.
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...Adrian Florea
MapReduce is a programming model and implementation for processing large datasets across clusters of computers. It allows users to specify map and reduce functions. The map function processes input key-value pairs to generate intermediate pairs, while the reduce function combines intermediate values into final output. Google developed MapReduce to simplify distributed computing on large datasets, addressing issues like parallelization, fault tolerance, and load balancing. It works by splitting input data into blocks and assigning them to worker nodes that apply the user-defined map and reduce functions to process the data in parallel.
EC2, MapReduce, and Distributed ProcessingJonathan Dahl
This document defines and discusses three key concepts related to distributed computing: distributed processing, asynchronous processing, and parallel processing. Distributed processing refers to running an application across multiple computers or processors connected via a local area network. Asynchronous processing involves computations that run independently without constant synchronization. Parallel processing means simultaneously solving a problem across separate CPU cores.
The document discusses resource scheduling techniques for cloud computing including single processor scheduling algorithms, cloud scheduling approaches for multi-tenant systems like the Dominant Resource Fairness scheduler, and Hadoop schedulers like the fair scheduler and capacity scheduler. It also proposes a management model for analyzing elasticity of cluster capacity and job dependencies to enable data bursting between private and public clouds.
MapReduce: Simplified Data Processing On Large Clusterskazuma_sato
The document discusses the MapReduce programming model which has been successfully used at Google. It describes how MapReduce simplifies data processing on large clusters by hiding the complexities of parallelization, fault tolerance, locality optimization, and load balancing. Computation is expressed as two functions - Map and Reduce. The Map function produces intermediate key-value pairs from input pairs, and the Reduce function merges all intermediate values associated with the same key.
The document discusses various measures to improve back-end, front-end, and data model performance in Informatica and databases. For back-end performance, it recommends avoiding certain transformations like joiners and aggregators when possible, using indexes and transferring filters to source qualifiers. For databases, it suggests using clusters, partitioning, parallelism, and dynamic tuning. For front-end performance, it recommends indexes, aggregate tables, and dimensional modeling with global dimensions and translation tables.
A cluster is a system of two or more connected computers that work together as a single system. There are three types of clusters: high availability clusters which provide continuous service if a node fails; load-balancing clusters which distribute requests across nodes; and high performance clusters which provide faster processing through unified effort. Clusters offer cost efficiency, processing speed, flexibility, and high availability of resources compared to mainframe computers.
MapReduce is a programming model and implementation for processing large datasets in a distributed environment. It allows users to write map and reduce functions to process key-value pairs. The MapReduce library handles parallelization across clusters, automatic parallelization, fault-tolerance through task replication, and load balancing. It was designed at Google to simplify distributed computations on massive amounts of data and aggregates the results across clusters.
The document outlines the process for developing a MapReduce application including:
1) Writing map and reduce functions with unit tests, then a driver program to run on test data.
2) Running the program on a cluster with the full dataset and fixing issues.
3) Tuning the program for performance after it is working correctly.
MapReduce : Simplified Data Processing on Large ClustersAbolfazl Asudeh
MapReduce is a programming model and associated implementation for processing large datasets in a distributed system. It allows users to specify a map function that processes input key-value pairs to generate intermediate output pairs, and a reduce function that merges all intermediate values associated with the same key. The MapReduce system automatically parallelizes the computation across large clusters and handles tasks like scheduling, parallelization, and failure recovery. An example of word counting demonstrates how text documents are broken into words as input pairs, mapped to count occurrences, and reduced to output word frequencies.
Spark is an Apache cluster computing framework designed for big data processing. It uses RDDs (Resilient Distributed Datasets), which are immutable distributed collections of objects that can be operated on in parallel. RDDs support transformations, which create new RDDs, and actions, which return final results. RDDs are lazily evaluated, meaning operations are not performed until an action requires a result. Caching RDDs in memory improves performance for iterative algorithms. MLlib is Spark's machine learning library, which implements parallel machine learning algorithms like clustering and forests that can operate directly on RDDs.
The document describes MapReduce, a programming model for processing large datasets in a distributed environment. MapReduce allows users to write map and reduce functions, hiding the complexity of parallelization, fault tolerance, and load balancing. It works by dividing the input data into mapped key-value pairs, shuffling and sorting by key, and reducing the values for each key. This makes it easy to write distributed programs for tasks like inverted indexing, sorting, and counting URL frequencies. The implementation assigns tasks to worker nodes, handles failures, and optimizes for locality and load balancing.
MapReduce is a programming model used for processing large datasets in a distributed computing environment. It consists of two main tasks - the Map task which converts input data into intermediate key-value pairs, and the Reduce task which combines these intermediate pairs into a smaller set of output pairs. The MapReduce framework operates on input and output in the form of key-value pairs, with the keys and values implemented as serializable Java objects. It divides jobs into map and reduce tasks executed in parallel on a cluster, with a JobTracker coordinating task assignment and tracking progress.
MapReduce: Simplified Data Processing on Large ClustersAshraf Uddin
This document summarizes the MapReduce programming model and its implementation for processing large datasets in parallel across clusters of computers. The key points are:
1) MapReduce expresses computations as two functions - Map and Reduce. Map processes input key-value pairs and generates intermediate output. Reduce combines these intermediate values to form the final output.
2) The implementation automatically parallelizes programs by partitioning work across nodes, scheduling tasks, and handling failures transparently. It optimizes data locality by scheduling tasks on machines containing input data.
3) The implementation provides fault tolerance by reexecuting failed tasks, guaranteeing the same output as non-faulty execution. Status information and counters help monitor progress and collect metrics.
This presentation gives an overview of the Apache MADlib AI/ML project. It explains Apache MADlib AI/ML in terms of it's functionality, it's architecture, dependencies and also gives an SQL example.
Links for further information and connecting
http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
https://nz.linkedin.com/pub/mike-frampton/20/630/385
https://open-source-systems.blogspot.com/
MapReduce is a programming model and associated implementation for processing large datasets in parallel across clusters of computers. It allows for automatic parallelization, distribution, fault tolerance, and monitoring of large-scale computations. The MapReduce model consists of a map function that processes input key-value pairs to generate intermediate key-value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. A MapReduce job is made up of many map and reduce tasks that are executed in parallel by the framework on a distributed computing infrastructure.
This document discusses MapReduce and its suitability for processing large datasets across distributed systems. It describes challenges like node failures, network bottlenecks and the motivation for a simple programming model that can handle massive computations and datasets across thousands of machines. MapReduce provides a programming model using map and reduce functions that hides complexities of parallelization, fault tolerance and load balancing. It has been widely adopted for applications involving log analysis, indexing large datasets, iterative graph processing and more.
The document discusses cloud computing systems and MapReduce. It provides background on MapReduce, describing how it works and how it was inspired by functional programming concepts like map and reduce. It also discusses some limitations of MapReduce, noting that it is not designed for general-purpose parallel processing and can be inefficient for certain types of workloads. Alternative approaches like MRlite and DCell are proposed to provide more flexible and efficient distributed processing frameworks.
MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers.
This document provides an overview of a disco workshop on parallel computing and MapReduce. The workshop covers an introduction to parallel computing including algorithms, programming models, and applications. It then introduces MapReduce, covering its history, examples, and execution overview. The workshop teaches how to write MapReduce jobs with Disco and includes an example of CDN log processing. It aims to provide attendees with the skills needed to get started with Disco for large-scale data processing.
1. Big Data - Introduction(what is bigdata).pdfAmanCSE050
Big Data Characteristics
Contents
Explosion in Quantity of Data
Importance of Big Data
Usage Example in Big Data
Challenges in Big Data
Hadoop Ecosystem
Hadoop is a framework for distributed storage and processing of large datasets across clusters of computers. It addresses problems like hardware failure and combining data after analysis. The core components are HDFS for distributed storage and MapReduce for distributed processing. HDFS stores data as blocks across nodes and handles replication for reliability. The Namenode manages the file system namespace and metadata, while Datanodes store and retrieve blocks. Hadoop supports reliable analysis of large datasets in a distributed manner through its scalable architecture.
This document summarizes a proposal to improve fault tolerance in Hadoop clusters. It proposes adding a "Backup" state to store intermediate MapReduce data, so reducers can continue working even if mappers fail. It also proposes a "supernode" protocol where neighboring slave nodes communicate task information. If one node fails, a neighbor can take over its tasks without involving the JobTracker. This would improve fault tolerance by allowing computation to continue locally between nodes after failures.
Big data refers to large volumes of unstructured or semi-structured data that is difficult to process using traditional databases and analysis tools. The amount of data generated daily is growing exponentially due to factors like increased internet usage and data collection by organizations. Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses HDFS for reliable storage and MapReduce as a programming model to process data in parallel across nodes.
This document compares three high-level programming languages for Big Data analytics on Hadoop clusters: Pig Latin, HiveQL, and Jaql. It analyzes and compares the languages based on four criteria: expressive power, performance, query processing methods, and how each language implements joins. The document finds that while each language has strengths in certain areas, no single language is superior in all criteria. Developers must consider the unique aspects of each language and criteria that matter most for their specific applications and datasets.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses problems with traditional systems like data growth, network/server failures, and high costs by allowing data to be stored in a distributed manner and processed in parallel. Hadoop has two main components - the Hadoop Distributed File System (HDFS) which provides high-throughput access to application data across servers, and the MapReduce programming model which processes large amounts of data in parallel by splitting work into map and reduce tasks.
Introduction to the Map-Reduce framework.pdfBikalAdhikari4
The document provides an introduction to the MapReduce programming model and framework. It describes how MapReduce is designed for processing large volumes of data in parallel by dividing work into independent tasks. Programs are written using functional programming idioms like map and reduce operations on lists. The key aspects are:
- Mappers process input records in parallel, emitting (key, value) pairs.
- A shuffle/sort phase groups values by key to same reducer.
- Reducers process grouped values to produce final output, aggregating as needed.
- This allows massive datasets to be processed across a cluster in a fault-tolerant way.
Spark is a framework for large-scale data processing that improves on MapReduce. It handles batch, iterative, and streaming workloads using a directed acyclic graph (DAG) model. Spark aims for generality, low latency, fault tolerance, and simplicity. It uses an in-memory computing model with Resilient Distributed Datasets (RDDs) and a driver-executor architecture. Common Spark performance issues relate to partitioning, shuffling data between stages, task placement, and load balancing. Evaluation tools include the Spark UI, Sar, iostat, and benchmarks like SparkBench and GroupBy tests.
Hadoop eco system with mapreduce hive and pigKhanKhaja1
This document provides an overview of MapReduce architecture and components. It discusses how MapReduce processes data using map and reduce tasks on key-value pairs. The JobTracker manages jobs by scheduling tasks on TaskTrackers. Data is partitioned and sorted during the shuffle and sort phase before being processed by reducers. Components like Hive, Pig, partitions, combiners, and HBase are described in the context of how they integrate with and optimize MapReduce processing.
The document discusses big data and distributed computing. It explains that big data refers to large, unstructured datasets that are too large for traditional databases. Distributed computing uses multiple computers connected via a network to process large datasets in parallel. Hadoop is an open-source framework for distributed computing that uses MapReduce and HDFS for parallel processing and storage across clusters. HDFS stores data redundantly across nodes for fault tolerance.
This document discusses cloud and big data technologies. It provides an overview of Hadoop and its ecosystem, which includes components like HDFS, MapReduce, HBase, Zookeeper, Pig and Hive. It also describes how data is stored in HDFS and HBase, and how MapReduce can be used for parallel processing across large datasets. Finally, it gives examples of using MapReduce to implement algorithms for word counting, building inverted indexes and performing joins.
Architecting and productionising data science applications at scalesamthemonad
This document discusses architecting and productionizing data science applications at scale. It covers topics like parallel processing with Spark, streaming platforms like Kafka, and scalable machine learning approaches. It also discusses architectures for data pipelines and productionizing models, with a focus on automation, avoiding SQL databases, and using Kafka streams and Spark for batch and streaming workloads.
Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.
Similar to Map reduce advantages over parallel databases (20)
Enabling Reusable and Adaptive Modeling,Provisioning & Execution of BPEL Proc...Ahmad El Tawil
This document proposes an architecture for enabling reusable and adaptive modeling, provisioning, and execution of BPEL processes. The architecture supports collaborative, dynamic, and complex systems through reusable process fragments and abstract activities that can be flexibly modeled and executed. It consists of a modeling environment to create and manage process models using fragments, and a runtime environment containing an execution engine, integration layer, and persistence layer to instantiate and execute processes by injecting fragments at runtime. The architecture was evaluated through a use case of modeling and executing payment processes across different transportation systems.
This document provides a risk assessment report on cloud computing. It begins with an abstract discussing how cloud computing has increased risks that consumers should be aware of. It then presents an introduction on cloud computing and the need for risk assessment. Several existing risk assessment approaches are studied. The discussion section analyzes previous risk assessment methods. It finds that while approaches assess risks for consumers, a complete qualitative or quantitative risk assessment method is still needed. The conclusion is that trust between consumers and providers requires a structured risk assessment approach that covers all domains.
Introduction
Survey Risk Assessment for Cloud Computing
Assessing the Security Risks of Cloud Computing
Security and Privacy Challenges in Cloud Computing
Conclusion
I. Background about Piper Alpha
II. General Purpose of the platform operation
III. The happening Event Timeline
IV. Cause and Effect of the disaster
V. Key Failures
VI. Improvement and prevention
VII. Conclusion
This document describes a fruit detection technique using morphological image processing. It outlines image acquisition by collecting fruit sample images in JPEG format. Image preprocessing steps like enhancement and noise removal are applied. Color and texture features are then extracted using color space conversion and Canny edge detection. Image segmentation is performed using a clustering algorithm. Morphological dilation is applied to segmented images to count fruit objects. The results show this technique can automatically count and distinguish fruits, providing a low-cost alternative to manual quality inspection.
About Piper Alpha Platform
The Happening Event Timeline
Cause of the Disaster
Effect of the Disaster
Key Failures
Improvement and Prevention
Conclusion
Cloud computing risk assesment presentationAhmad El Tawil
This document discusses risk assessment for cloud computing. It outlines the steps in risk assessment, which include threat identification, vulnerability identification, risk determination, and control recommendation. It also discusses assessing the security risks of cloud computing, including evaluating data location, recovery, viability, and support in reducing risk. Finally, it covers security and privacy challenges in cloud computing such as authentication, access control, secure service management, and privacy/data protection.
The Bhopal Disaster occurred in 1984 at a Union Carbide plant in Bhopal, India. Several gas leaks had occurred at the plant in previous years, exposing workers to toxic gases. On the night of December 2, 1984, a major leak released 30-45 tons of toxic methyl isocyanate gas into the air, impacting 500,000 to 600,000 people with numerous health issues. The gas leak caused immediate effects like coughing, eye irritation, and difficulty breathing. There were also arguments that negligence and lack of safety precautions by the plant owners contributed to the massive toxic gas release.
The document proposes algorithms to securely and efficiently apply mobile ad hoc networks (MANETs) using identity-based cryptography. It aims to address problems with the extreme vulnerability of military MANETs to foreign attacks by maintaining secrecy, confidentiality, and functionality even if nodes are captured. The key algorithms involve using large keys up to 65536 bits, designating node and main station roles, and refreshing keys through a public key generator that periodically generates new prime parameters, especially when nodes are compromised.
Bayesian networks are graphical models that represent conditional independence relationships between variables. A Bayesian network consists of nodes representing variables, and directed edges representing conditional dependencies. It encodes a joint probability distribution over all the variables. Bayesian networks allow efficient inference and can represent incomplete data. They are useful for modeling causal relationships and combining domain knowledge with data.
The document discusses authentication, authorization, and accounting (AAA) and provides instructions for configuring AAA on Cisco routers. It begins with an introduction to the three A's of AAA - authentication, authorization, and accounting. It then covers identifying each component and implementing authentication using local services or external servers like TACACS+ and RADIUS. The document also discusses authenticating router access, configuring AAA on Cisco routers including enabling AAA globally and setting authentication lists, and troubleshooting AAA using debug commands.
This document discusses green communication and security challenges in 5G wireless networks. It outlines techniques for power allocation and measuring greenness, including equal power allocation and game theory. It also covers 5G network technologies like device-to-device communication, massive MIMO, small cells, and the internet of things. The document notes security concerns with these technologies and potential solutions. It concludes that 5G networks aim to increase energy efficiency 1000x through combining efficient technologies while maintaining network security.
A survey of ethical hacking process and securityAhmad El Tawil
This document outlines the process of ethical hacking and security. It discusses hacking, classes of hackers including ethical hackers, how to conduct ethical hacking, tips for ethical hacking, warnings, and security measures. The goal of ethical hacking is to strengthen network security by identifying vulnerabilities from the perspective of a hacker but in a legal, authorized way to help organizations improve their defenses. It provides guidance on learning skills like coding and using tools to conduct security testing with permission rather than for malicious purposes.
This document proposes E-DHCP, an extension to the DHCP protocol that adds authentication. E-DHCP uses X.509 certificates and attribute certificates to authenticate DHCP clients, servers, and messages. It introduces an Attribute Authority server that creates and manages attribute certificates linking a client's identity certificate to its allocated IP address. The architecture involves E-DHCP clients and servers possessing identity certificates. Attribute certificates are then used to grant clients access to authorized services based on validating the linkage between certificates and allocated IP address. E-DHCP aims to address DHCP security issues like lack of authentication and prevent denial of service attacks.
Cybercriminals focus on cryptocurrency like Bitcoin because there is significant monetary value involved, with the total market cap exceeding $11 billion. They are able to earn coins through malware that mines cryptocurrency on victims' devices without their knowledge, or through ransomware attacks where victims' data is encrypted until a ransom is paid in cryptocurrency. Common attacks include malware containing cryptocurrency mining tools, ransomware like the WannaCry attack of May 2017 that demanded Bitcoin payments, and fake initial coin offerings that trick investors.
How to define Related field in Odoo 17 - Odoo 17 SlidesCeline George
The related attribute is used in field definitions to establish a relationship between models and automatically fetch the value from a related model's field. It provides a way to reference and display fields from related models without having to create a separate field and write code to synchronize the values manually.
A history of Innisfree in Milanville, PennsylvaniaThomasRue2
A history of Innisfree in Milanville, Damascus Township, Wayne County, Pennsylvania. By TOM RUE, July 23, 2023. Innisfree began as "an experiment in democracy," modeled after A.S. Neill's "Summerhill" school in England, "the first libertarian school".
New Features in Odoo 17 Email Marketing - Odoo SlidesCeline George
In this slide, let’s discuss the new features of email marketing Odoo 17. The new features enhance user in creating effective and efficient campaigns. This module will help to control the email layouts and other aspects of it.
How to Make a Field Storable in Odoo 17 - Odoo SlidesCeline George
Let’s discuss about how to make a field in Odoo model as a storable. For that, a module for College management has been created in which there is a model to store the the Student details.
Topics to be Covered
Beginning of Pedagogy
What is Pedagogy?
Definition of Pedagogy
Features of Pedagogy
What Is Pedagogy In Teaching?
What Is Teacher Pedagogy?
What Is The Pedagogy Approach?
What are Pedagogy Approaches?
Teaching and Learning Pedagogical approaches?
Importance of Pedagogy in Teaching & Learning
Role of Pedagogy in Effective Learning
Pedagogy Impact on Learner
Pedagogical Skills
10 Innovative Learning Strategies For Modern Pedagogy
Types of Pedagogy
Description:
Welcome to the comprehensive guide on Relational Database Management System (RDBMS) concepts, tailored for final year B.Sc. Computer Science students affiliated with Alagappa University. This document covers fundamental principles and advanced topics in RDBMS, offering a structured approach to understanding databases in the context of modern computing. PDF content is prepared from the text book Learn Oracle 8I by JOSE A RAMALHO.
Key Topics Covered:
Main Topic : PL/SQL
Sub-Topic :
Structure of PL/SQL Block, Declaration Section, Variable, Constant, Execution Section, Exception, How PL/SQL works, Control Structures, If then Command,
Loop Command, Loop with IF, Loop with When, For Loop Command, While Command, Integrating SQL in PL/SQL program.
Target Audience:
Final year B.Sc. Computer Science students at Alagappa University seeking a solid foundation in RDBMS principles for academic and practical applications.
URL for previous slides
Unit V
Chapter 15
Unit IV
Chapter 14 Synonym : https://www.slideshare.net/slideshow/lecture_notes_unit4_chapter14_synonyms-pdf/270327685
Chapter 13 Users, Privileges : https://www.slideshare.net/slideshow/lecture-notes-unit4-chapter13-users-roles-and-privileges/270304806
Chapter 12 View : https://www.slideshare.net/slideshow/rdbms-lecture-notes-unit4-chapter12-view/270199683
Chapter 11 Sequence: https://www.slideshare.net/slideshow/sequnces-lecture_notes_unit4_chapter11_sequence/270134792
chapter 8,9 and 10 : https://www.slideshare.net/slideshow/lecture_notes_unit4_chapter_8_9_10_rdbms-for-the-students-affiliated-by-alagappa-university/270123800
About the Author:
Dr. S. Murugan is Associate Professor at Alagappa Government Arts College, Karaikudi. With 23 years of teaching experience in the field of Computer Science, Dr. S. Murugan has a passion for simplifying complex concepts in database management.
Disclaimer:
This document is intended for educational purposes only. The content presented here reflects the author’s understanding in the field of RDBMS as of 2024.
PRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdfnservice241
The University of Ghana has launched a new vision and strategic plan, which will focus on transforming lives and societies through unparalleled scholarship, innovation, and result-oriented discoveries.
3. MapReduce: A Flexible Data Processing
Tool
• Map Reduce is a programming model MapReduce is
composed of :
• Map function key/value pairs
• Reduce function key/values
• Processes many terabytes of data
• System easy to use
3
4. Paper
• By Andrew Pavloetal
• Comparison paper
• MapReduce is a major step backwards
4
5. Heterogeneous Systems
• MapReduce provides a simple model for analyzing data
in such heterogeneous systems
• Storage systems like relational database or file systems
• In parallel database :
• input must be copied
• then analyze
5
6. Complex Functions
• Map & Reduce functions are simple and straight forward
SQL equivalent.
• Pavloetal pointed that some times it very complicated to
be expressed in SQL
• User Defined Functions(UDFs)
• Buggy some times
• MapReduce is a better framework
6
7. Fault Tolerance (1/2)
• Two models to transfer data between mappers and
reducers:
• Pull model (move)
• Push model (write)
• Pull model create many files and disks (Pavloetal )
• MapReduce used :
• batching, sorting and grouping
• Smart scheduling for reads
7
8. Fault Tolerance (2/2)
• MapReduce do not use push model due to fault-
tolerance
• Fault tolerance will be more important to process these
data efficiently.
8
9. Performance (1/2)
• Merging results (cost) :
• Merging isn’t necessary when the next consumer of
MapReduce is:
• another MapReduce
• not another MapReduce
9
10. Performance (2/2)
• Data loading:
• Hadoop can analyze data 5 to 50 times faster than the time
needed to load data to parallel database
10
11. Conclusion
• MapReduce is a highly effective and efficient tool for
large-scale fault-tolerant data analysis
• MapReduce is very useful in a heterogeneous system
• MapReduce provides a good framework
11