Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. The core of Hadoop consists of HDFS for storage and MapReduce for processing. Hadoop has been expanded with additional projects including YARN for job scheduling and resource management, Pig and Hive for SQL-like queries, HBase for column-oriented storage, Zookeeper for coordination, and Ambari for provisioning and managing Hadoop clusters. Hadoop provides scalable and cost-effective solutions for storing and analyzing massive amounts of data.
WANdisco is a provider of non-stop software for global enterprises to meet the challenges of Big Data and distributed software development.
KEY HIGHLIGHTS, Session 1: Tuesday, Feb. 26, 5:15 p.m.-6 p.m.
Hadoop and HBase on the Cloud: A Case Study on Performance and Isolation
Cloud infrastructure is a flexible tool to orchestrate multiple Hadoop and HBase clusters, which provides strict isolation of data and compute resources for multiple customers. Most importantly, our benchmarks show that virtualized environment allows for higher average utilization of per-node resources. For more session information, visit http://na.apachecon.com/schedule/presentation/131/.
CO-PRESENTERS, Dr. Konstantin V. Shvachko, Chief Architect, Big Data, WANdisco and Jagane Sundar, CTO/VP Engineering, Big Data, WANdisco
A veteran Hadoop developer and respected author, Konstantin Shvachko is a technical expert specializing in efficient data structures and algorithms for large-scale distributed storage systems. Konstantin joined WANdisco through the AltoStor acquisition and before that he was founder and Chief Scientist at AltoScale, a Hadoop and HBase-as-a-Platform company acquired by VertiCloud. Konstantin played a lead architectural role at eBay, building two generations of the organization's Hadoop platform. At Yahoo!, he worked on the Hadoop Distributed File System (HDFS). Konstantin has dozens of publications and presentations to his credit and is currently a member of the Apache Hadoop PMC. Konstantin has a Ph.D. in Computer Science and M.S. in Mathematics from Moscow State University, Russia.
Jagane Sundar has extensive big data, cloud, virtualization, and networking experience and joined WANdisco through its AltoStor acquisition. Before AltoStor, Jagane was founder and CEO of AltoScale, a Hadoop and HBase-as-a-Platform company acquired by VertiCloud. His experience with Hadoop began as Director of Hadoop Performance and Operability at Yahoo! Jagane has such accomplishments to his credit as the creation of Livebackup, development of a user mode TCP Stack for Precision I/O, development of the NFS and PPP clients and parts of the TCP stack for JavaOS for Sun MicroSystems, and more. Jagane received his B.E. in Electronics and Communications Engineering from Anna University.
About WANdisco
WANdisco ( LSE : WAND ) is a provider of enterprise-ready, non-stop software solutions that enable globally distributed organizations to meet today's data challenges of secure storage, scalability and availability. WANdisco's products are differentiated by the company's patented, active-active data replication technology, serving crucial high availability (HA) requirements, including Hadoop Big Data and Application Lifecycle Management (ALM). Fortune Global 1000 companies including AT&T, Motorola, Intel and Halliburton rely on WANdisco for performance, reliability, security and availability. For additional information, please visit www.wandisco.com.
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
The document proposes an Agile Big Data model to address perceived issues with traditional Hadoop implementations. It discusses the motivation for change and outlines an Agile model with self-organized roles including data stewards, data scientists, project teams, and an architecture board. Key aspects of the proposed model include independent and self-managed project teams, a domain-driven data model, and emphasis on data quality and governance through the involvement of data stewards across domains.
The document discusses linking the statistical programming language R with the Hadoop platform for big data analysis. It introduces Hadoop and its components like HDFS and MapReduce. It describes three ways to link R and Hadoop: RHIPE which performs distributed and parallel analysis, RHadoop which provides HDFS and MapReduce interfaces, and Hadoop streaming which allows R scripts to be used as Mappers and Reducers. The goal is to use these methods to analyze large datasets with R functions on Hadoop clusters.
Introduction to Hadoop and Hadoop component rebeccatho
This document provides an introduction to Apache Hadoop, which is an open-source software framework for distributed storage and processing of large datasets. It discusses Hadoop's main components of MapReduce and HDFS. MapReduce is a programming model for processing large datasets in a distributed manner, while HDFS provides distributed, fault-tolerant storage. Hadoop runs on commodity computer clusters and can scale to thousands of nodes.
Big data is generated from a variety of sources at a massive scale and high velocity. Hadoop is an open source framework that allows processing and analyzing large datasets across clusters of commodity hardware. It uses a distributed file system called HDFS that stores multiple replicas of data blocks across nodes for reliability. Hadoop also uses a MapReduce processing model where mappers process data in parallel across nodes before reducers consolidate the outputs into final results. An example demonstrates how Hadoop would count word frequencies in a large text file by mapping word counts across nodes before reducing the results.
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
This talk gives an introduction into Hadoop 2 and YARN. Then the changes for MapReduce 2 are explained. Finally Tez and Spark are explained and compared in detail.
The talk has been held on the Parallel 2014 conference in Karlsruhe, Germany on 06.05.2014.
Agenda:
- Introduction to Hadoop 2
- MapReduce 2
- Tez, Hive & Stinger Initiative
- Spark
- HDFS Federation allows Hadoop to scale beyond the limitations of a single namespace by splitting the namespace across multiple independent namenodes. Each namenode manages its own namespace volume consisting of a namespace and block pool.
- A client-side mount table provides a virtual unified namespace by mapping namespace volumes to namenodes, hiding the federation details from users and applications.
- HDFS Federation provides wire compatibility by requiring clients to use the same version of Hadoop as the servers, and supports existing HDFS functionality like append, sticky bits, and new APIs like FileContext.
Hadoop Institutes : kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.
This document provides an overview of big data ecosystems, including common log formats, compression techniques, data collection methods, distributed storage options like HDFS and S3, distributed processing frameworks like Hadoop MapReduce and Storm, workflow managers, real-time storage options, and other related topics. It describes technologies like Kafka, HBase, Cassandra, Pig, Hive, Oozie, and Azkaban; compares advantages and disadvantages of HDFS, S3, HBase and other storage systems; and provides references for further information.
Talk held at a combined meeting of the Web Performance Karlsruhe (http://www.meetup.com/Karlsruhe-Web-Performance-Group/events/153207062) & Big Data Karlsruhe/Stuttgart (http://www.meetup.com/Big-Data-User-Group-Karlsruhe-Stuttgart/events/162836152) user groups.
Agenda:
- Why Hadoop 2?
- HDFS 2
- YARN
- YARN Apps
- Write your own YARN App
- Tez, Hive & Stinger Initiative
This document summarizes Syncsort's high performance data integration solutions for Hadoop contexts. Syncsort has over 40 years of experience innovating performance solutions. Their DMExpress product provides high-speed connectivity to Hadoop and accelerates ETL workflows. It uses partitioning and parallelization to load data into HDFS 6x faster than native methods. DMExpress also enhances usability with a graphical interface and accelerates MapReduce jobs by replacing sort functions. Customers report TCO reductions of 50-75% and ROI within 12 months by using DMExpress to optimize their Hadoop deployments.
Hadoop is an open source framework for distributed storage and processing of large datasets across commodity hardware. It has two main components - the Hadoop Distributed File System (HDFS) for storage, and MapReduce for processing. HDFS stores data across clusters in a redundant and fault-tolerant manner. MapReduce allows distributed processing of large datasets in parallel using map and reduce functions. The architecture aims to provide reliable, scalable computing using commodity hardware.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It allows for the reliable, scalable, and distributed processing of large data sets across a cluster. A typical Hadoop cluster consists of thousands of commodity servers storing exabytes of data and processing petabytes of data per day. Hadoop uses the Hadoop Distributed File System (HDFS) for storage and MapReduce as its processing engine. HDFS stores data across nodes in a cluster as blocks and provides redundancy, while MapReduce processes data in parallel on those nodes.
Impala is a massively parallel processing SQL query engine for Hadoop. It allows users to issue SQL queries directly to their data in Apache Hadoop. Impala uses a distributed architecture where queries are executed in parallel across nodes by Impala daemons. It uses a new execution engine written in C++ with runtime code generation for high performance. Impala also supports commonly used Hadoop file formats and can query data stored in HDFS and HBase.
The document provides an overview of the Apache Hadoop ecosystem. It describes Hadoop as a distributed, scalable storage and computation system based on Google's architecture. The ecosystem includes many related projects that interact, such as YARN, HDFS, Impala, Avro, Crunch, and HBase. These projects innovate independently but work together, with Hadoop serving as a flexible data platform at the core.
This document summarizes a proposal to improve fault tolerance in Hadoop clusters. It proposes adding a "Backup" state to store intermediate MapReduce data, so reducers can continue working even if mappers fail. It also proposes a "supernode" protocol where neighboring slave nodes communicate task information. If one node fails, a neighbor can take over its tasks without involving the JobTracker. This would improve fault tolerance by allowing computation to continue locally between nodes after failures.
The document summarizes key components of Hadoop including:
1) The NameNode, located on the master node, stores metadata for HDFS such as file locations and attributes.
2) DataNodes, located on slave nodes, store and retrieve data blocks.
3) The JobTracker, located on the master node, schedules jobs and assigns tasks to TaskTrackers on slave nodes.
This document provides an agenda and overview for a presentation on Hadoop 2.x configuration and MapReduce performance tuning. The presentation covers hardware selection and capacity planning for Hadoop clusters, key configuration parameters for operating systems, HDFS, and YARN, and performance tuning techniques for MapReduce applications. It also demonstrates the Hadoop Vaidya performance diagnostic tool.
- Data is a precious resource that can last longer than the systems themselves (Tim Berners-Lee)
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides reliability, scalability and flexibility.
- Hadoop consists of HDFS for storage and MapReduce for processing. The main nodes include NameNode, DataNodes, JobTracker and TaskTrackers. Tools like Hive, Pig, HBase extend its capabilities for SQL-like queries, data flows and NoSQL access.
Hadoop is a Java software framework that supports data-intensive distributed applications and is developed under open source license. It enables applications to work with thousands of nodes and petabytes of data.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses Google's MapReduce programming model and Google File System for reliability. The Hadoop architecture includes a distributed file system (HDFS) that stores data across clusters and a job scheduling and resource management framework (YARN) that allows distributed processing of large datasets in parallel. Key components include the NameNode, DataNodes, ResourceManager and NodeManagers. Hadoop provides reliability through replication of data blocks and automatic recovery from failures.
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
The document provides an overview of Hadoop including what it is, how it works, its architecture and components. Key points include:
- Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers using simple programming models.
- It consists of HDFS for storage and MapReduce for processing via parallel computation using a map and reduce technique.
- HDFS stores data reliably across commodity hardware and MapReduce processes large amounts of data in parallel across nodes in a cluster.
This document discusses distributed computing and Hadoop. It begins by explaining distributed computing and how it divides programs across several computers. It then introduces Hadoop, an open-source Java framework for distributed processing of large data sets across clusters of computers. Key aspects of Hadoop include its scalable distributed file system (HDFS), MapReduce programming model, and ability to reliably process petabytes of data on thousands of nodes. Common use cases and challenges of using Hadoop are also outlined.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a master-slave architecture with the NameNode as master and DataNodes as slaves. The NameNode manages file system metadata and the DataNodes store data blocks. Hadoop also includes a MapReduce engine where the JobTracker splits jobs into tasks that are processed by TaskTrackers on each node. Hadoop saw early adoption from companies handling big data like Yahoo!, Facebook and Amazon and is now widely used for applications like advertisement targeting, search, and security analytics.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a simple programming model called MapReduce that automatically parallelizes and distributes work across nodes. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and MapReduce execution engine for processing. HDFS stores data as blocks replicated across nodes for fault tolerance. MapReduce jobs are split into map and reduce tasks that process key-value pairs in parallel. Hadoop is well-suited for large-scale data analytics as it scales to petabytes of data and thousands of machines with commodity hardware.
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
A summarized version of a presentation regarding Big Data architecture, covering from Big Data concept to Hadoop and tools like Hive, Pig and Cassandra
This document provides an overview of the Apache Spark framework. It discusses how Spark allows distributed processing of large datasets across computer clusters using simple programming models. It also describes how Spark can scale from single servers to thousands of machines. Spark is designed to provide high availability by detecting and handling failures at the application layer. The document also summarizes Resilient Distributed Datasets (RDDs), which are Spark's fundamental data abstraction, and transformations and actions that can be performed on RDDs.
Hadoop Map-Reduce from the subject: Big Data AnalyticsRUHULAMINHAZARIKA
Hadoop is an open-source framework for storing and processing large datasets across clusters of computers. It has two main components: HDFS for storage, and MapReduce for processing. MapReduce allows parallel processing of data stored in HDFS. It uses map and reduce tasks that can run on multiple machines in parallel. Hadoop 2.0 features include HDFS federation for scalability, high availability of the namenode, and YARN which splits job tracking from resource management for increased flexibility.
This document provides an overview of cloud computing and key related concepts like MapReduce and Hadoop Distributed File System (HDFS). It describes how MapReduce uses HDFS for distributed storage and processing of large datasets across clusters of machines. The document explains MapReduce data flows including data localization, rack optimization, and shuffle processes. It also covers HDFS architecture with NameNode and DataNodes, block storage, and reliability measures like secondary NameNode and high availability.
This document provides an overview of Hadoop, its core components HDFS and MapReduce, and how they work. It discusses that Hadoop is an open-source framework used for storing and processing huge datasets across clusters of commodity hardware. The two core concepts of Hadoop are HDFS for distributed storage and MapReduce for distributed processing. HDFS provides reliable storage with replication and MapReduce allows processing of large datasets in parallel by dividing work across nodes and integrating results.
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopLeons Petražickis
This document provides an overview of Hadoop MapReduce. It discusses map operations, reduce operations, submitting MapReduce jobs, the distributed mergesort engine, the two fundamental data types of MapReduce (key-value pairs and lists), fault tolerance, scheduling, and task execution. Map operations perform transformations on individual data elements, while reduce operations combine the outputs of map tasks into final results. Hadoop MapReduce allows large datasets to be processed in parallel across clusters of computers.
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins at Google and Yahoo
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Facebook and Amazon use Hadoop to process massive amounts of data
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins from Google and Apache projects
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Yahoo, Facebook, and Amazon use Hadoop for applications like log processing, searches, and advertisement targeting
Elascale is a comprehensive autoscaling and monitoring solution that leverages both virtual machines and containers. It adjusts both microservices and macroservices using Docker Swarm for clustering microservices and Docker Machine for provisioning macroservices. Metrics are collected using Beats and consolidated in Elasticsearch, while insights are provided in Kibana. The autoscaling engine uses metrics like CPU and memory utilization to scale services up and down according to a replication factor formula to maintain performance as workload increases. Elascale was tested on an IoT application deployed in a smart city testbed.
This document summarizes a presentation on Adaptation as a Service (ADaaS). The presentation discusses the challenges of building autonomic management systems for complex distributed systems and proposes providing adaptation capabilities as a cloud service. It outlines several examples of how security, configuration, healing, and optimization capabilities could be delivered as services. The presentation also demonstrates a proof-of-concept system called Elascale that provides auto-scaling and self-healing as a service for dockerized applications on OpenStack.
SAVI-IoT: A Self-managing Containerized IoT PlatformYork University
The document describes SAVI-IoT, a self-managing containerized IoT platform. SAVI-IoT combines software-defined infrastructure, networking, and things to provide an IoT platform that is fully programmable, cloud and application agnostic, and offers autonomic management. It leverages Linux container isolation to instantiate, relocate, and optimize virtual IoT capabilities with flexibility. The platform provides template-based deployment, interactive monitoring, and autonomically scales IoT applications and services across edge and core cloud infrastructure.
Realtime Big Data Analytics for Event Detection in HighwaysYork University
This document introduces a real-time big data analytics platform for event detection and classification in highways. The platform consists of data, analytics, and management components. It leverages cloud computing for reliability, scalability, and adaptability. The platform can perform both real-time and retrospective analytics. It is demonstrated for detecting events on major highways in the Greater Toronto Area. The platform uses a cluster-based architecture and Spark for streaming analytics. Algorithms are developed to model event signatures and detect events from sensor data in real-time.
Provisioning Performance of Cloud Microservice PlatformsYork University
This document summarizes research on the provisioning performance of cloud microservice platforms. It first discusses containers and microservices, and how containers provide lightweight virtualization compared to VMs. It then presents a performance model for analyzing a microservice platform running on top of a macroservice infrastructure like IaaS. The document outlines experiments conducted using a prototype microservice platform on Docker, Swarm and Composer, and measuring parameters on cloud platforms. Preliminary results are referenced, with plans to submit a further journal publication.
PhD Thesis: Performance Modeling of Cloud Computing CentersYork University
This document presents research on modeling the performance of cloud computing centers. It discusses developing analytical models to capture the complex interactions within cloud systems and gain insights into performance metrics like rejection probability and total task servicing delay. The research contributes interacting sub-models that consider factors like resource allocation, virtual machine provisioning and pool management at scale. It also aims to account for heterogeneous requests and system resources to better reflect real-world cloud environments. The performance models seek to balance accuracy and tractability for large cloud infrastructures.
How to define Related field in Odoo 17 - Odoo 17 SlidesCeline George
The related attribute is used in field definitions to establish a relationship between models and automatically fetch the value from a related model's field. It provides a way to reference and display fields from related models without having to create a separate field and write code to synchronize the values manually.
Dear Sakthi Thiru Dr. G. B. Senthil Kumar,
It is with great honor and respect that we extend this formal invitation to you. As a distinguished leader whose presence commands admiration and reverence, we cordially invite you to join us in celebrating the 25th anniversary of our graduation from Adhiparasakthi Engineering College on 27th July, 2024. we would be honored to have you by our side as we reflect on the achievements and memories of the past 25 years.
How to install python packages from PycharmCeline George
In this slide, let's discuss how to install Python packages from PyCharm. In case we do any customization in our Odoo environment, sometimes it will be necessary to install some additional Python packages. Let’s check how we can do this from PyCharm.
How to Load Custom Field to POS in Odoo 17 - Odoo 17 SlidesCeline George
This slide explains how to load custom fields you've created into the Odoo 17 Point-of-Sale (POS) interface. This approach involves extending the functionalities of existing POS models (e.g., product.product) to include your custom field.
How to Fix Field Does Not Exist Error in Odoo 17Celine George
This slide will represent how to fix the error field does not exist in a model in Odoo 17. So if you got an error field does not exist it typically means that you are trying to refer a field that doesn’t exist in the model or view.
1. Apache Hadoop
Core & Ecosystem
UOIT - Faculty of Business and IT
Hamzeh Khazaei
hkh@yorku.ca
November 16, 2015
2. Agenda
• Data management system
• Conventional Data
• Big Data
• Hadoop
• File System
• Computation Paradigm
• YARN
• Subprojects
• NoSQL Datastores
• A Real Project
2
3. Database management system
• Relational database
management
system
• Structured data
• SQL
• Standard interface
• Vertical Scalability
• High-end servers
3
6. Questions?
1. Who faced these challenges first?
2. When did they confronted with challenges?
3. What was their solution?
4. What are the opportunities?
5. What is the role of cloud here?
6. What is next?
6
7. 7
It is all about:
“How to store and process big data
with reasonable cost and time?”
8. Definitions
• Apache Hadoop is an open-source software project that enables distributed
processing of large data sets across clusters of commodity servers. It is designed
to scale out from a single server to thousands of machines, with a very high
degree of fault tolerance. (IBM)
• Apache Hadoop is a Java-based open-source framework for distributed storage
and processing of large sets of data on commodity hardware. Hadoop enables
businesses to quickly gain insight from massive amounts of structured and
unstructured data. (Hotronworks)
• Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware. It provides massive storage for
any kind of data, enormous processing power and the ability to handle virtually
limitless concurrent tasks or jobs. (SAS)
8
10. History, Google vs Hadoop
10
Devlop Group Google Apache
Sponsor Google Yahoo, Amazon
File System GFS (2003) HDFS (2005)
Programming Model MapReduce (2004) Hadoop MapReduce (2005)
Storage System BigTable (2006) HBase (2010)
Search Engine Google Nutch
11. ● Data-intensive text processing
● Assembly of large genomes
● Graph mining
● Machine learning and data mining
● Large scale social network analysis
● Log analytics
● Health Informatics
● Smart Cities
Uses for Hadoop
11
12. Hadoop
Common
Contains Libraries and other
modules
HDFS Hadoop Distributed File System
Hadoop YARN
Yet Another Resource
Negotiator
Hadoop
MapReduce
A programming model for large scale
data processing
Hadoop Core
12
14. Goals of HDFS
• Very Large Distributed File System
• 10K nodes, 100 million files, 200PB
• Assumes Commodity Hardware
• Files are replicated to handle hardware failure
• Detect failures and recover from them
• Optimized for Batch Processing
• Data locations exposed so that computations can move
to where data resides
• Provides very high aggregate bandwidth
14
15. HDFS - Specifications
• Single Namespace for entire cluster
• Data Coherency
• Write-once-read-many access model
• Client can only append to existing files
• Files are broken up into blocks
• Typically 64MB block size
• Each block replicated on multiple DataNodes
• Intelligent Client
• Client can find location of blocks
• Client accesses data directly from DataNode
15
16. HDFS - Architecture 1
• Master/Slave Architecture
• NameNode
• Metadata Server
• File location (file name -> the DataNode)
• File attributions (atime/ctime/mtime, size, number of replicas)
• DataNode
• Manages the storage attached to the nodes that they run on
• Client
• Producer and Consumers of data
16
18. HDFS I/O
• A typical read from a client involves:
a) Contact the NameNode to determine where the actual data is stored
b) NameNode replies with block identifiers and locations (i.e., which
DataNode)
c) Contact the DataNode to fetch data
• A typical write from a client involves:
a) Contact the NameNode to update the namespace and verify permissions
b) NameNode allocates a new block on a suitable DataNode
c) The client directly streams to the selected DataNode
d) Currently, HDFS files are immutable
• Data is never moved through the NameNode Hence, there is no
bottleneck
18
20. HDFS Replication
• By default, HDFS stores 3 separate copies of each
block
• This ensures reliability, availability and performance
• Replication policy
• Spread replicas across different racks
• Robust against cluster node failures
• Robust against rack failures
• Block replication benefits MapReduce
• Scheduling decisions can take replicas into account
• Exploit better data locality
20
23. ● A method for distributing computation across
multiple nodes
● Each node processes the data that is stored at that
node
● Consists of two main phases
◦ Map
◦ Reduce
MapReduce Overview
23
24. Now, Technically, What is MapReduce?
• MapReduce is a programming model for efficient
distributed computing
• It works like a Unix pipeline
• cat input | grep | sort | uniq -c | cat > output
• Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
• Streaming through data, reducing seeks
• Pipelining
• A good fit for a lot of applications
• Log processing
• Web index building
24
25. MapReduce in 41 words.
Goal: count the number of books in the library.
• Map:
You count up shelf #1, I count up shelf #2.
(The more people we get, the faster this part goes)
• Reduce:
We all get together and add up our individual counts.
25
26. Word Count Example
• Mapper
• Input: value: lines of text of input
• Output: key: word, value: 1
• Reducer
• Input: key: word, value: set of counts
• Output: key: word, value: sum
• Launching program
• Defines this job
• Submits job to cluster
26
28. Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();
public static void map(LongWritable key, Text value,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
String line = value.toString();
StringTokenizer = new StringTokenizer(line);
while (tokenizer.hasNext()) {
word.set(tokenizer.nextToken());
output.collect(word,one);
}
}
}
28
29. Word Count Reducer
public static class Reduce extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> {
public static void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
29
30. World Count Main
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
30
31. Execution Framework
• MapReduce program, a.k.a. a job:
• Code of mappers and reducers
• Code for combiners and partitioners (optional)
• Configuration parameters
• All packaged together
• A MapReduce job is submitted to the cluster
• The framework takes care of everything else
• Next, we will delve into the details
31
32. Scheduling
• Each Job is broken into tasks
• Map tasks work on fractions of the input dataset, as defined by the
underlying distributed filesystem
• Reduce tasks work on intermediate inputs and write back to the distributed
filesystem
• The number of tasks may exceed the number of available
machines in a cluster
• The scheduler takes care of maintaining something similar to a queue of
pending tasks to be assigned to machines with available resources
• Jobs to be executed in a cluster requires scheduling as well
• Different users may submit jobs
• Jobs may be of various complexity
• Fairness is generally a requirement
32
33. • NameNode
• Holds the metadata for the HDFS
• Secondary NameNode
• Performs housekeeping functions for the NameNode
• DataNode
• Stores the actual HDFS data blocks
• JobTracker
• Manages MapReduce jobs
• TaskTracker
• Monitors individual Map and Reduce tasks
Anatomy of a Hadoop Cluster
33
37. YARN
• Yet Another Resource Negotiator
• YARN Application Resource Negotiator
(Recursive Acronym)
• Remedies the scalability shortcomings of “classic”
MapReduce - one jobtracker per cluster so that limit to
4000 nodes per cluster.
• Is more of a general purpose framework of which classic
mapreduce is one application.
• Inflexible slots on nodes, run Map or Reduce not both --
causes underutilization of cluster
37
38. YARN = Hadoop 2.0 = MRv2
• The fundamental idea of
MRv2 is to split up the
two major functionalities
of the JobTracker (ie,
resource management
and job
scheduling/monitoring)
into separate daemons.
• The idea is to have a
global ResourceManager
(RM) and per-application
ApplicationMaster (AM).
38
39. Hadoop Common
• Hadoop Common refers to the collection of common
utilities and libraries that support other Hadoop modules.
• It is an essential part or module of the Apache Hadoop
Framework, along with the HDFS, Hadoop YARN and
Hadoop MapReduce.
• Like all other modules, Hadoop Common assumes that
hardware failures are common and that these should be
automatically handled in software by the Hadoop
Framework.
39
42. Pig
• Started at Yahoo! Research
• Now runs about 30% of Yahoo!’s jobs
• Features
• Expresses sequences of MapReduce jobs
• Data model: nested “bags” of items
• Provides relational (SQL) operators (JOIN, GROUP BY,
etc.)
• Easy to plug in Java functions
42
43. Example
• Suppose you have
user data in a file,
website data in
another, and you
need to find the top
5 most visited
pages by users aged
18-25
43
46. HBase
• Open source implementation of Google’s
Bigtable
• Row/column store
• Billions of rows/millions on columns
• Column-oriented - nulls are free
• Online processing
• Master/Slave architecture
• Based on HDFS
46
48. HBase Query
• Retrieve a cell
• Cell =
table.getRow(“enclosure1”).getColumn(“animal:type”).getValue();
• Retrieve a row
• RowResult = table.getRow( “enclosure1” );
• Scan through a range of rows
• Scanner s = table.getScanner( new String[] { “animal:type” } );
48
49. Hive
• Developed at Facebook
• Used for majority of Facebook jobs
• “Relational database” built on Hadoop
• Maintains list of table schemas
• SQL-like query language (HiveQL)
• Can call Hadoop Streaming scripts from HiveQL
• Supports table partitioning, clustering, complex data
types, some optimizations
49
50. Hive DDL
50
CREATE TABLE page_views (viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'User IP address')
COMMENT 'This is the page view table'
PARTITIONED BY (dt STRING, country STRING)
STORED AS SEQUENCEFILE;
● Partitioning breaks table into separate files for each (dt,
country) pair
○ Ex: /hive/page_view/dt=2015-06-08,country=USA
/hive/page_view/dt=2015-06-08,country=CA
51. A Simple Query
• Find all page views coming from xyz.com in
March:
• Hive only reads partitions “2015-03-*” instead
of scanning entire table
51
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2015-03-01'
AND page_views.date <= '2015-03-31'
AND page_views.referrer_url like '%xyz.com';
52. Mahout
52
• A Scalable machine learning and data mining
library.
• Apache Software Foundation project
• Create scalable machine learning libraries
• Why Mahout? Many Open Source ML libraries either:
• Lack Community
• Lack Documentation and Examples
• Lack Scalability
• Or are research-oriented
53. Apache Ambari
• Provision a Hadoop Cluster
• Ambari provides a step-by-step wizard for installing Hadoop services across any number of
hosts.
• Ambari handles configuration of Hadoop services for the cluster.
• Manage a Hadoop Cluster
• Ambari provides central management for starting, stopping, and reconfiguring Hadoop
services across the entire cluster.
• Monitor a Hadoop Cluster
• Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
• Ambari leverages Ambari Metrics System for metrics collection.
• Ambari leverages Ambari Alert Framework for system alerting and will notify you when
your attention is needed (e.g., a node goes down, remaining disk space is low, etc).
53
55. Others
• Zookeeper
• ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and providing
group services. All of these kinds of services are used in some form or
another by distributed applications.
• Oozie
• Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
• Sqoop
• Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores such as relational
databases.
55