If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
This document provides an outline and introduction for a lecture on MapReduce and Hadoop. It discusses Hadoop architecture including HDFS and YARN, and how they work together to provide distributed storage and processing of big data across clusters of machines. It also provides an overview of MapReduce programming model and how data is processed through the map and reduce phases in Hadoop. References several books on Hadoop, MapReduce, and big data fundamentals.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
From Hadoop, through HIVE, Spark, YARN, searching for the holy grail for low latency SQL dialect like big data query. I talk about OLTP and OLAP, row stores and column stores, and finally arrive to BigQuery. Demo on the web console.
Hadoop Operations - Best practices from the fieldUwe Printz
Talk about Hadoop Operations and Best Practices for building and maintaining Hadoop cluster.
Talk was held at the data2day conference in Karlsruhe, Germany on 27.11.2014
The document is a slide deck for a training on Hadoop fundamentals. It includes an agenda that covers what big data is, an introduction to Hadoop, the Hadoop architecture, MapReduce, Pig, Hive, Jaql, and certification. It provides overviews and explanations of these topics through multiple slides with images and text. The slides also describe hands-on labs for attendees to complete exercises using these big data technologies.
Bikas saha:the next generation of hadoop– hadoop 2 and yarnhdhappy001
The document discusses Apache YARN, which is the next-generation resource management platform for Apache Hadoop. YARN was designed to address limitations of the original Hadoop 1 architecture by supporting multiple data processing models (e.g. batch, interactive, streaming) and improving cluster utilization. YARN achieves this by separating resource management from application execution, allowing various data processing engines like MapReduce, HBase and Storm to run natively on Hadoop frames. This provides a flexible, efficient and shared platform for distributed applications.
Introduction to Hadoop Ecosystem was presented to Lansing Java User Group on 2/17/2015 by Vijay Mandava and Lan Jiang. The demo was built on top of HDP 2.2 and AWS cloud.
Hive provides a SQL-like interface to query large datasets stored in Hadoop. Pig is a dataflow language for transforming datasets. HBase is a distributed, scalable, big data store that provides random real-time read/write access to datasets.
The document provides an introduction to big data and Apache Hadoop. It discusses big data concepts like the 3Vs of volume, variety and velocity. It then describes Apache Hadoop including its core architecture, HDFS, MapReduce and running jobs. Examples of using Hadoop for a retail system and with SQL Server are presented. Real world applications at Microsoft and case studies are reviewed. References for further reading are included at the end.
The document discusses backup and disaster recovery strategies for Hadoop. It focuses on protecting data sets stored in HDFS. HDFS uses data replication and checksums to protect against disk and node failures. Snapshots can protect against data corruption and accidental deletes. The document recommends copying data from the primary to secondary site for disaster recovery rather than teeing, and discusses considerations for large data movement like bandwidth needs and security. It also notes the importance of backing up metadata like Hive configurations along with core data.
This document provides an agenda and overview for a presentation on Hadoop 2.x configuration and MapReduce performance tuning. The presentation covers hardware selection and capacity planning for Hadoop clusters, key configuration parameters for operating systems, HDFS, and YARN, and performance tuning techniques for MapReduce applications. It also demonstrates the Hadoop Vaidya performance diagnostic tool.
These are slides from a lecture given at the UC Berkeley School of Information for the Analyzing Big Data with Twitter class. A video of the talk can be found at http://blogs.ischool.berkeley.edu/i290-abdt-s12/2012/08/31/video-lecture-posted-intro-to-hadoop/
This presentation provides an overview of Hadoop, including what it is, how it works, its architecture and components, and examples of its use. Hadoop is an open-source software platform for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable and distributed processing of large datasets through its core components - the Hadoop Distributed File System (HDFS) for storage, and MapReduce for processing.
The document provides an overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets. It describes how Hadoop uses HDFS for distributed file storage across clusters and MapReduce for parallel processing of data. Key components of Hadoop include HDFS for storage, YARN for resource management, and MapReduce for distributed computing. The document also discusses some popular Hadoop distributions and real-world uses of Hadoop by companies.
The document summarizes a technical seminar on Hadoop. It discusses Hadoop's history and origin, how it was developed from Google's distributed systems, and how it provides an open-source framework for distributed storage and processing of large datasets. It also summarizes key aspects of Hadoop including HDFS, MapReduce, HBase, Pig, Hive and YARN, and how they address challenges of big data analytics. The seminar provides an overview of Hadoop's architecture and ecosystem and how it can effectively process large datasets measured in petabytes.
This document provides an overview of Hadoop and its ecosystem. It describes Hadoop as a framework for distributed storage and processing of large datasets across clusters of commodity hardware. The key components of Hadoop are the Hadoop Distributed File System (HDFS) for storage, and MapReduce as a programming model for distributed computation across large datasets. A variety of related projects form the Hadoop ecosystem, providing capabilities like data integration, analytics, workflow scheduling and more.
This document provides an overview of big data ecosystems, including common log formats, compression techniques, data collection methods, distributed storage options like HDFS and S3, distributed processing frameworks like Hadoop MapReduce and Storm, workflow managers, real-time storage options, and other related topics. It describes technologies like Kafka, HBase, Cassandra, Pig, Hive, Oozie, and Azkaban; compares advantages and disadvantages of HDFS, S3, HBase and other storage systems; and provides references for further information.
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
Uber’s mission is to provide transportation as reliable as running water and for fulfilling that mission data plays a critical role. In Uber, Hadoop plays a critical role in Data Infrastructure. We want to talk about the journey of Hadoop @Uber and our future plans in terms of scaling for billions of trips. We will talk about most unique use case Uber have and how Hadoop and eco system which we built, helped us in this journey. We want to talk about how we scaled from 10 -> 2000 and In future to scale up to 10’s X1000 of Nodes. We will talk about our mistakes, learning and wins and how we process billions of events per day. We will talk about the unique challenges and real world use-cases and how we will co-locate the Uber’s service architecture with batch (e.g data pipelines, machine learning and analytical workloads). Uber have done lot of improvements to current Hadoop eco system and uniquely solved some of the problems in a way which is never been solved in the past. This presentation will help audience to use this as an example and even encourage them to enhance the eco system. This will help to increase the community of these project and overall help the whole big data space. Audience is anybody who is working on Big Data and want to understand how to scale Hadoop and eco system for 10s of thousands of node. This talk will help them understand the Hadoop ecosystem and how to efficiently use that. It will also introduce them to some of the awesome technologies which Uber team is building in big data space.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a master-slave architecture with the NameNode as master and DataNodes as slaves. The NameNode manages file system metadata and the DataNodes store data blocks. Hadoop also includes a MapReduce engine where the JobTracker splits jobs into tasks that are processed by TaskTrackers on each node. Hadoop saw early adoption from companies handling big data like Yahoo!, Facebook and Amazon and is now widely used for applications like advertisement targeting, search, and security analytics.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was designed to support large datasets and scale efficiently using low-cost hardware. Hadoop's core components include HDFS for distributed storage and MapReduce for distributed processing. Hadoop saw early adoption by companies like Yahoo and Facebook to support applications like advertisement targeting, searches, and security using large datasets.
Hadoop Administrator Online training course by (Knowledgebee Trainings) with mastering Hadoop Cluster: Planning & Deployment, Monitoring, Performance tuning, Security using Kerberos, HDFS High Availability using Quorum Journal Manager (QJM) and Oozie, Hcatalog/Hive Administration.
Contact : knowledgebee@beenovo.com
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
This document provides an overview of Hadoop, including:
1. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware.
2. It describes the architecture of Hadoop, including the Hadoop Distributed File System (HDFS) and MapReduce engine. HDFS uses a master/slave architecture with a NameNode and DataNodes, while MapReduce uses a JobTracker and TaskTrackers.
3. It discusses some common uses of Hadoop in industry, such as for log processing, web search indexing, and ad-hoc queries at large companies like Yahoo, Facebook, and Amazon.
A summarized version of a presentation regarding Big Data architecture, covering from Big Data concept to Hadoop and tools like Hive, Pig and Cassandra
The document provides an overview of Hadoop including:
- A brief history of Hadoop and its origins from Nutch.
- An overview of the Hadoop architecture including HDFS and MapReduce.
- Examples of how companies like Yahoo, Facebook and Amazon use Hadoop at large scales to process petabytes of data.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses Google's MapReduce programming model and Google File System for reliability. The Hadoop architecture includes a distributed file system (HDFS) that stores data across clusters and a job scheduling and resource management framework (YARN) that allows distributed processing of large datasets in parallel. Key components include the NameNode, DataNodes, ResourceManager and NodeManagers. Hadoop provides reliability through replication of data blocks and automatic recovery from failures.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses limitations in traditional RDBMS for big data by allowing scaling to large clusters of commodity servers, high fault tolerance, and distributed processing. The core components of Hadoop are HDFS for distributed storage and MapReduce for distributed processing. Hadoop has an ecosystem of additional tools like Pig, Hive, HBase and more. Major companies use Hadoop to process and gain insights from massive amounts of structured and unstructured data.
M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana
Hadoop Foundation for Analytics
History of Hadoop
Features of Hadoop
Key Advantages of Hadoop
Why Hadoop
Versions of Hadoop
Eco Projects
Essential of Hadoop ecosystem
RDBMS versus Hadoop
Key Aspects of Hadoop
Components of Hadoop
This slide deck that Mr. Minh Tran - KMS's Software Architect shared at "Java-Trends and Career Opportunities" seminar of Information Technology Center of HCMC University of Science.
This document discusses distributed computing and Hadoop. It begins by explaining distributed computing and how it divides programs across several computers. It then introduces Hadoop, an open-source Java framework for distributed processing of large data sets across clusters of computers. Key aspects of Hadoop include its scalable distributed file system (HDFS), MapReduce programming model, and ability to reliably process petabytes of data on thousands of nodes. Common use cases and challenges of using Hadoop are also outlined.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable, and distributed processing of petabytes of data. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. Many large companies use Hadoop for applications such as log analysis, web indexing, and data mining of large datasets.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
Hadoop is an open-source software framework that supports data-intensive distributed applications. It has a flexible architecture designed for reliable, scalable computing and storage of large datasets across commodity hardware. Hadoop uses a distributed file system and MapReduce programming model, with a master node tracking metadata and worker nodes storing data blocks and performing computation in parallel. It is widely used by large companies to analyze massive amounts of structured and unstructured data.
- Data is a precious resource that can last longer than the systems themselves (Tim Berners-Lee)
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides reliability, scalability and flexibility.
- Hadoop consists of HDFS for storage and MapReduce for processing. The main nodes include NameNode, DataNodes, JobTracker and TaskTrackers. Tools like Hive, Pig, HBase extend its capabilities for SQL-like queries, data flows and NoSQL access.
hadoop distributed file systems complete informationbhargavi804095
The document provides an overview of the Hadoop Distributed File System (HDFS). It discusses that HDFS is the storage unit of Hadoop and relies on distributed file system principles. It has a master-slave architecture with the NameNode as the master and DataNodes as slaves. HDFS allows files to be broken into blocks which are replicated across DataNodes for fault tolerance. The document outlines the key components of HDFS and how read and write operations work in HDFS.
Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has very large files (over 100 million files) and is optimized for batch processing huge datasets across large clusters (over 10,000 nodes). HDFS stores multiple replicas of data blocks on different nodes to handle failures. It provides high aggregate bandwidth and allows computations to move to where data resides.
The document provides an overview of Apache Hadoop and how it addresses challenges related to big data. It discusses how Hadoop uses HDFS to distribute and store large datasets across clusters of commodity servers and uses MapReduce as a programming model to process and analyze the data in parallel. The core components of Hadoop - HDFS for storage and MapReduce for processing - allow it to efficiently handle large volumes and varieties of data across distributed systems in a fault-tolerant manner. Major companies have adopted Hadoop to derive insights from their big data.
This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop uses HDFS for scalable, fault-tolerant storage and MapReduce for parallel processing. The core components of Hadoop - HDFS and MapReduce - allow for distributed processing of large datasets across commodity hardware, providing capabilities for scalability, cost-effectiveness, and efficient distributed computing.
Similar to List of Engineering Colleges in Uttarakhand (20)
I have taken efforts in this project. However, it would not have been possible without the kind support and help of many individuals and organizations. I would like to extend my sincere thanks to all of them. I am highly indebted to Mr. I.P. CHANDRA, Mrs. NISHA DHIMAN, and Miss. PRIYANKA SINGH RANA for their guidance and constant supervision as well as for providing necessary information regarding the project & also for her support in completing the project. I would like to express my gratitude towards my parents & member of ROORKEE COLLEGE OF ENGINEERING, ROORKEE for their kind co-operation and the encouragement which helps me in the completion of this project.
Foremost, I would like to express my sincere gratitude to my advisor prof. Nisha Dhiman(Roorkee College of Engineering) for the continuous support of my Project Orientation Program, for her patience, motivation, enthusiasm, and immense knowledge. Her guidance helped me in all the time of research and writing of this thesis. I could not have imagined having a better advisor and mentor for my POP.
The document is a thesis on designing a cell phone detector circuit. It discusses objectives of detecting signals between 0.9-3GHz within 1.5m and notifying when a phone is in use. The circuit consists of an inductor, diodes, transistors, op-amps and an LED. It works by rectifying the RF signal induced in the inductor when a phone is near. The amplified output triggers the LED if above a reference voltage, indicating detection. Applications include areas where phone use is prohibited like petrol pumps, hospitals and exam halls.
In India Roorkee College of Engineering(RCE) is coming list of Top Placement College in India. It is the best institute for the study of Engineering, Diploma, and B.Sc(Agriculture & Forestry) with best facilities, 100% placement, and an affordable fee charge. RCE is the best choice for bright future in Engineering in India.
A campus with good infrastructure, coached by highly distinguished faculty using the latest teaching aids, excellent hostel facilities, and a vision that drives us to ensure imparting wholesome technical education to our students, is what sets us apart from the run-of-the-mill engineering collages spread all over the country. RCE aims at developing the aptitude of the students through interactive sessions with their mentors, regular brainstorming sessions and exposure to a plethora of event.
The document is about an event called "Tech Sangram" that took place in 2017 at the Roorkee College of Engineering in Roorkee. It provides basic information about the name of the event, the year it occurred, and the location of the host institution.
RCE(Roorkee College of Engineering) is best engineering college in Roorkee. Which offer great and beneficial facilities for students to best future. RCE was established in 2010, which approved by AICTE (Govt. of India) and affiliated to Uttarakhand Technical University, Uttarakhand. And courses provided by RCE M.tech, B.tech, Diploma, B.Sc agriculture with highly educated and experienced faculty, top class updated infrastructure, Wi-Fi enabled campus, great placement and more.
The document is about a tech competition called Tech Sangram held during the 2016-2017 academic year at Roorkee College of Engineering in Roorkee. It provides basic information about the event including the name, year it took place, and location of the host college.
This document appears to be a tech competition called Tech Sangram held in 2017 at Roorkee College of Engineering in Roorkee, India. No other details are provided about the event itself in the short document.
A campus with good infrastructure, coached by highly distinguished faculty using the latest teaching aids, excellent hostel facilities, and a vision that drives us to ensure imparting wholesome technical education to our students, is what sets us apart from the run-of-the-mill engineering collages spread all over the country. RCE aims at developing the aptitude of the students through interactive sessions with their mentors, regular brainstorming sessions and exposure to a plethora of event.
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
RCE(Roorkee College of Engineering) is best engineering college in Roorkee. Which offer great and beneficial facilities for students to best future. RCE was established in 2010, which approved by AICTE (Govt. of India) and affiliated to Uttarakhand Technical University, Uttarakhand. And courses provided by RCE M.tech, B.tech, Diploma, B.Sc agriculture with highly educated and experienced faculty, top class updated infrastructure, Wi-Fi enabled campus, great placement and more.
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
The document is about an event called "Tech Sangram" that took place in 2017 at Roorkee College of Engineering in Roorkee. It provides the name of the event, year it occurred, and location but no other details about what the event entailed.
Roorkee College of Engineering was launched and promoted by chartered Accountant S.K. Gupta in 2010 with a view to provide high quality technical education to the regional aspirants in general and localite rural poor in particular.
Roorkee College of Engineering was launched and promoted by chartered Accountant S.K. Gupta in 2010 with a view to provide high quality technical education to the regional aspirants in general and localite rural poor in particular.
Roorkee College of Engineering was launched and promoted by chartered Accountant S.K. Gupta in 2010 with a view to provide high quality technical education to the regional aspirants in general and localite rural poor in particular.
Roorkee College of Engineering was launched and promoted by chartered Accountant S.K. Gupta in 2010 with a view to provide high quality technical education to the regional aspirants in general and localite rural poor in particular.
Roorkee College of Engineering was launched and promoted by chartered Accountant S.K. Gupta in 2010 with a view to provide high quality technical education to the regional aspirants in general and localite rural poor in particular
More from Roorkee College of Engineering, Roorkee (20)
Plato and Aristotle's Views on Poetry by V.Jesinthal Maryjessintv
PPT on Plato and Aristotle's Views on Poetry prepared by Mrs.V.Jesinthal Mary, Dept of English and Foreign Languages(EFL),SRMIST Science and Humanities ,Ramapuram,Chennai-600089
Life of Ah Gong and Ah Kim ~ A Story with Life Lessons (Hokkien, English & Ch...OH TEIK BIN
A PowerPoint Presentation of a fictitious story that imparts Life Lessons on loving-kindness, virtue, compassion and wisdom.
The texts are in Romanized Hokkien, English and Chinese.
For the Video Presentation with audio narration in Hokkien, please check out the Link:
https://vimeo.com/manage/videos/987932748
Topics to be Covered
Beginning of Pedagogy
What is Pedagogy?
Definition of Pedagogy
Features of Pedagogy
What Is Pedagogy In Teaching?
What Is Teacher Pedagogy?
What Is The Pedagogy Approach?
What are Pedagogy Approaches?
Teaching and Learning Pedagogical approaches?
Importance of Pedagogy in Teaching & Learning
Role of Pedagogy in Effective Learning
Pedagogy Impact on Learner
Pedagogical Skills
10 Innovative Learning Strategies For Modern Pedagogy
Types of Pedagogy
How to Fix Field Does Not Exist Error in Odoo 17Celine George
This slide will represent how to fix the error field does not exist in a model in Odoo 17. So if you got an error field does not exist it typically means that you are trying to refer a field that doesn’t exist in the model or view.
How to Load Custom Field to POS in Odoo 17 - Odoo 17 SlidesCeline George
This slide explains how to load custom fields you've created into the Odoo 17 Point-of-Sale (POS) interface. This approach involves extending the functionalities of existing POS models (e.g., product.product) to include your custom field.
Dr. Nasir Mustafa CERTIFICATE OF APPRECIATION "NEUROANATOMY"Dr. Nasir Mustafa
CERTIFICATE OF APPRECIATION
"NEUROANATOMY"
DURING THE JOINT ONLINE LECTURE SERIES HELD BY
KUTAISI UNIVERSITY (GEORGIA) AND ISTANBUL GELISIM UNIVERSITY (TURKEY)
FROM JUNE 10TH TO JUNE 14TH, 2024
How to Make a Field Storable in Odoo 17 - Odoo SlidesCeline George
Let’s discuss about how to make a field in Odoo model as a storable. For that, a module for College management has been created in which there is a model to store the the Student details.
New Features in Odoo 17 Email Marketing - Odoo SlidesCeline George
In this slide, let’s discuss the new features of email marketing Odoo 17. The new features enhance user in creating effective and efficient campaigns. This module will help to control the email layouts and other aspects of it.
How to Create an XLS Report in Odoo 17 - Odoo 17 SlidesCeline George
XLSX reports are essential for structured data analysis, customizable presentation, and compatibility across platforms, facilitating efficient decision-making and communication within organizations.
How to install python packages from PycharmCeline George
In this slide, let's discuss how to install Python packages from PyCharm. In case we do any customization in our Odoo environment, sometimes it will be necessary to install some additional Python packages. Let’s check how we can do this from PyCharm.
How to Configure Field Cleaning Rules in Odoo 17Celine George
In this slide let’s discuss how to configure field cleaning rules in odoo 17. Field Cleaning is used to format the data that we use inside Odoo. Odoo 17's Data Cleaning module offers Field Cleaning Rules to improve data consistency and quality within specific fields of our Odoo records. By using the field cleaning, we can correct the typos, correct the spaces between them and also formats can be corrected.
New features of Maintenance Module in Odoo 17Celine George
In Odoo, the Maintenance Module is a comprehensive tool designed to help organizations manage their equipment, machinery, and overall maintenance activities efficiently. This module enables users to schedule, track, and manage maintenance requests and activities, ensuring minimal downtime and optimal operational efficiency.
3. What is Hadoop?
• Apache top level project, open-source
implementation of frameworks for reliable,
scalable, distributed computing and data
storage.
• It is a flexible and highly-available
architecture for large scale computation
and data processing on a network of
commodity hardware.
4. Brief History of Hadoop
• Designed to answer the question:
“How to process big data with
reasonable cost and time?”
7. Hadoop’s Developers
Doug Cutting
2005: Doug Cutting and Michael J. Cafarella developed
Hadoop to support distribution for the Nutch search
engine project.
The project was funded by Yahoo.
2006: Yahoo gave the project to Apache
Software Foundation.
9. Some Hadoop Milestones
• 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte
of data in 209 seconds, compared to previous record of 297 seconds)
• 2009 - Avro and Chukwa became new members of Hadoop
Framework family
• 2010 - Hadoop's Hbase, Hive and Pig subprojects completed, adding
more computational power to Hadoop framework
• 2011 - ZooKeeper Completed
• 2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha.
- Ambari, Cassandra, Mahout have been added
10. What is Hadoop?
• Hadoop:
• an open-source software framework that supports data-
intensive distributed applications, licensed under the
Apache v2 license.
• Goals / Requirements:
• Abstract and facilitate the storage and processing of
large and/or rapidly growing data sets
• Structured and non-structured data
• Simple programming models
• High scalability and availability
• Use commodity (cheap!) hardware with little redundancy
• Fault-tolerance
• Move computation rather than data
12. Hadoop’s Architecture
• Distributed, with some centralization
• Main nodes of cluster are where most of the computational power
and storage of the system lies
• Main nodes run TaskTracker to accept and reply to MapReduce
tasks, and also DataNode to store needed blocks closely as
possible
• Central control node runs NameNode to keep track of HDFS
directories & files, and JobTracker to dispatch compute tasks to
TaskTracker
• Written in Java, also supports Python and Ruby
14. Hadoop’s Architecture
• Hadoop Distributed Filesystem
• Tailored to needs of MapReduce
• Targeted towards many reads of filestreams
• Writes are more costly
• High degree of data replication (3x by default)
• No need for RAID on normal nodes
• Large blocksize (64MB)
• Location awareness of DataNodes in network
15. Hadoop’s Architecture
NameNode:
• Stores metadata for the files, like the directory structure of a
typical FS.
• The server holding the NameNode instance is quite crucial,
as there is only one.
• Transaction log for file deletes/adds, etc. Does not use
transactions for whole blocks or file-streams, only metadata.
• Handles creation of more replica blocks when necessary
after a DataNode failure
16. Hadoop’s Architecture
DataNode:
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere
19. Hadoop’s Architecture
MapReduce Engine:
• JobTracker & TaskTracker
• JobTracker splits up data into smaller tasks(“Map”) and
sends it to the TaskTracker process in each node
• TaskTracker reports back to the JobTracker node and
reports on job progress, sends data (“Reduce”) or requests
new jobs
20. Hadoop’s Architecture
• None of these components are necessarily limited to using
HDFS
• Many other distributed file-systems with quite different
architectures work
• Many other software packages besides Hadoop's
MapReduce platform make use of HDFS
21. Hadoop in the Wild
• Hadoop is in use at most organizations that handle big data:
o Yahoo!
o Facebook
o Amazon
o Netflix
o Etc…
• Some examples of scale:
o Yahoo!’s Search Webmap runs on 10,000 core Linux
cluster and powers Yahoo! Web search
o FB’s Hadoop cluster hosts 100+ PB of data (July, 2012)
& growing at ½ PB/day (Nov, 2012)
22. Hadoop in the Wild
• Advertisement (Mining user behavior to generate
recommendations)
• Searches (group related documents)
• Security (search for uncommon patterns)
Three main applications of Hadoop:
23. Hadoop in the Wild
• Non-realtime large dataset computing:
o NY Times was dynamically generating PDFs of articles
from 1851-1922
o Wanted to pre-generate & statically serve articles to
improve performance
o Using Hadoop + MapReduce running on EC2 / S3,
converted 4TB of TIFFs into 11 million PDF articles in
24 hrs
24. Hadoop in the Wild: Facebook Messages
• Design requirements:
o Integrate display of email, SMS and
chat messages between pairs and
groups of users
o Strong control over who users
receive messages from
o Suited for production use between
500 million people immediately after
launch
o Stringent latency & uptime
requirements
25. Hadoop in the Wild
• System requirements
o High write throughput
o Cheap, elastic storage
o Low latency
o High consistency (within a
single data center good
enough)
o Disk-efficient sequential
and random read
performance
26. Hadoop in the Wild
• Classic alternatives
o These requirements typically met using large MySQL cluster &
caching tiers using Memcached
o Content on HDFS could be loaded into MySQL or Memcached
if needed by web tier
• Problems with previous solutions
o MySQL has low random write throughput… BIG problem for
messaging!
o Difficult to scale MySQL clusters rapidly while maintaining
performance
o MySQL clusters have high management overhead, require
more expensive hardware
27. Hadoop in the Wild
• Facebook’s solution
o Hadoop + HBase as foundations
o Improve & adapt HDFS and HBase to scale to FB’s workload
and operational considerations
Major concern was availability: NameNode is SPOF &
failover times are at least 20 minutes
Proprietary “AvatarNode”: eliminates SPOF, makes HDFS
safe to deploy even with 24/7 uptime requirement
Performance improvements for realtime workload: RPC
timeout. Rather fail fast and try a different DataNode
30. Why use Hadoop?
• Need to process Multi Petabyte Datasets
• Data may not have strict schema
• Expensive to build reliability in each application
• Nodes fails everyday
• Need common infrastructure
• Very Large Distributed File System
• Assumes Commodity Hardware
• Optimized for Batch Processing
• Runs on heterogeneous OS
31. DataNode
• A Block Sever
– Stores data in local file system
– Stores meta-data of a block - checksum
– Serves data and meta-data to clients
• Block Report
– Periodically sends a report of all existing blocks
to NameNode
• Facilitate Pipelining of Data
– Forwards data to other specified DataNodes
32. Block Placement
• Replication Strategy
– One replica on local node
– Second replica on a remote rack
– Third replica on same remote rack
– Additional replicas are randomly placed
• Clients read from nearest replica
33. Data Correctness
• Use Checksums to validate data – CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File Access
– Client retrieves the data and checksum from
DataNode
– If validation fails, client tries other replicas
34. Data Pipelining
• Client retrieves a list of DataNodes on
which to place replicas of a block
• Client writes block to the first DataNode
• The first DataNode forwards the data to
the next DataNode in the Pipeline
• When all replicas are written, the client
moves on to write the next block in file
35. Hadoop MapReduce
• MapReduce programming model
– Framework for distributed processing of large
data sets
– Pluggable user code runs in generic
framework
• Common design pattern in data
processing
– cat * | grep | sort | uniq -c | cat > file
– input | map | shuffle | reduce | output
38. MapReduce Process
(org.apache.hadoop.mapred)
• JobClient
– Submit job
• JobTracker
– Manage and schedule job, split job into tasks
• TaskTracker
– Start and monitor the task execution
• Child
– The process that really execute the task
39. Inter Process Communication
IPC/RPC (org.apache.hadoop.ipc)
• Protocol
– JobClient <-------------> JobTracker
– TaskTracker <------------> JobTracker
– TaskTracker <-------------> Child
• JobTracker impliments both protocol and works as server
in both IPC
• TaskTracker implements the TaskUmbilicalProtocol; Child
gets task information and reports task status through it.
JobSubmissionProtocol
InterTrackerProtocol
TaskUmbilicalProtocol
40. JobClient.submitJob - 1
• Check input and output, e.g. check if the output
directory is already existing
– job.getInputFormat().validateInput(job);
– job.getOutputFormat().checkOutputSpecs(fs, job);
• Get InputSplits, sort, and write output to HDFS
– InputSplit[] splits = job.getInputFormat().
getSplits(job, job.getNumMapTasks());
– writeSplitsFile(splits, out); // out is
$SYSTEMDIR/$JOBID/job.split
41. JobClient.submitJob - 2
• The jar file and configuration file will be
uploaded to HDFS system directory
– job.write(out); // out is $SYSTEMDIR/$JOBID/job.xml
• JobStatus status =
jobSubmitClient.submitJob(jobId);
– This is an RPC invocation, jobSubmitClient is
a proxy created in the initialization
42. Job initialization on JobTracker - 1
• JobTracker.submitJob(jobID) <-- receive
RPC invocation request
• JobInProgress job = new
JobInProgress(jobId, this, this.conf)
• Add the job into Job Queue
– jobs.put(job.getProfile().getJobId(), job);
– jobsByPriority.add(job);
– jobInitQueue.add(job);
43. Job initialization on JobTracker - 2
• Sort by priority
– resortPriority();
– compare the JobPrioity first, then compare the
JobSubmissionTime
• Wake JobInitThread
– jobInitQueue.notifyall();
– job = jobInitQueue.remove(0);
– job.initTasks();
46. JobTracker Task Scheduling - 1
• Task getNewTaskForTaskTracker(String
taskTracker)
• Compute the maximum tasks that can be
running on taskTracker
– int maxCurrentMap Tasks =
tts.getMaxMapTasks();
– int maxMapLoad =
Math.min(maxCurrentMapTasks,
(int)Math.ceil(double)
remainingMapLoad/numTaskTrackers));
47. JobTracker Task Scheduling - 2
• int numMaps = tts.countMapTasks(); //
running tasks number
• If numMaps < maxMapLoad, then more
tasks can be allocated, then based on
priority, pick the first job from the
jobsByPriority Queue, create a task, and
return to TaskTracker
– Task t = job.obtainNewMapTask(tts,
numTaskTrackers);