The document provides an overview of the Hadoop Distributed File System (HDFS). It discusses that HDFS is the storage unit of Hadoop and relies on distributed file system principles. It has a master-slave architecture with the NameNode as the master and DataNodes as slaves. HDFS allows files to be broken into blocks which are replicated across DataNodes for fault tolerance. The document outlines the key components of HDFS and how read and write operations work in HDFS.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses Google's MapReduce programming model and Google File System for reliability. The Hadoop architecture includes a distributed file system (HDFS) that stores data across clusters and a job scheduling and resource management framework (YARN) that allows distributed processing of large datasets in parallel. Key components include the NameNode, DataNodes, ResourceManager and NodeManagers. Hadoop provides reliability through replication of data blocks and automatic recovery from failures.
Apache Hadoop is an open-source software framework that supports distributed applications and processing of large data sets across clusters of commodity hardware. It is highly scalable, fault-tolerant and allows processing of data in parallel. Hadoop consists of Hadoop Common, HDFS for storage, YARN for resource management and MapReduce for distributed processing. HDFS stores large files across clusters and provides high throughput access to application data. MapReduce allows distributed processing of large datasets across clusters using a simple programming model.
This document provides an overview and introduction to Hadoop, HDFS, and MapReduce. It covers the basic concepts of HDFS, including how files are stored in blocks across data nodes, and the role of the name node and data nodes. It also explains the MapReduce programming model, including the mapper, reducer, and how jobs are split into parallel tasks. The document discusses using Hadoop from the command line and writing MapReduce jobs in Java. It also mentions some other projects in the Hadoop ecosystem like Pig, Hive, HBase and Zookeeper.
* The file size is 1664MB
* HDFS block size is usually 128MB by default in Hadoop 2.0
* To calculate number of blocks required: File size / Block size
* 1664MB / 128MB = 13 blocks
* 8 blocks have been uploaded successfully
* So remaining blocks = Total blocks - Uploaded blocks = 13 - 8 = 5
If another client tries to access/read the data while the upload is still in progress, it will only be able to access the data from the 8 blocks that have been uploaded so far. The remaining 5 blocks of data will not be available or visible to other clients until the full upload is completed. HDFS follows write-once semantics, so partial
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It has four main modules - Hadoop Common, HDFS, YARN and MapReduce. HDFS provides a distributed file system that stores data reliably across commodity hardware. MapReduce is a programming model used to process large amounts of data in parallel. Hadoop architecture uses a master-slave model, with a NameNode master and DataNode slaves. It provides fault tolerance, high throughput access to application data and scales to thousands of machines.
The document provides an introduction to the Hadoop ecosystem. It discusses the history of Hadoop, originating from Google's paper on MapReduce and Google File System. It describes some of the core components of Hadoop including HDFS for storage, MapReduce for distributed processing, and additional components like Hive, Pig, and HBase. It also discusses different Hadoop distributions from companies like Cloudera, Hortonworks, MapR, and others that package and support Hadoop deployments.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
Scaling Storage and Computation with Hadoopyaevents
Hadoop provides a distributed storage and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. Hadoop is partitioning data and computation across thousands of hosts, and executes application computations in parallel close to their data. A Hadoop cluster scales computation capacity, storage capacity and IO bandwidth by simply adding commodity servers. Hadoop is an Apache Software Foundation project; it unites hundreds of developers, and hundreds of organizations worldwide report using Hadoop. This presentation will give an overview of the Hadoop family projects with a focus on its distributed storage solutions
The document provides an overview of Apache Hadoop and how it addresses challenges related to big data. It discusses how Hadoop uses HDFS to distribute and store large datasets across clusters of commodity servers and uses MapReduce as a programming model to process and analyze the data in parallel. The core components of Hadoop - HDFS for storage and MapReduce for processing - allow it to efficiently handle large volumes and varieties of data across distributed systems in a fault-tolerant manner. Major companies have adopted Hadoop to derive insights from their big data.
Hadoop is one of the booming and innovative data analytics technology which can effectively handle Big Data problems and achieve the data security. It is an open source and trending technology which involves in data collection, data processing and data analytics using HDFS (Hadoop Distributed File System) and MapReduce algorithms.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It has several core components including HDFS for distributed file storage and MapReduce for distributed processing. HDFS stores data across clusters of machines with replication for fault tolerance. MapReduce allows parallel processing of large datasets in a distributed manner. Hadoop was designed with goals of using commodity hardware, easy recovery from failures, large distributed file systems, and fast processing of large datasets.
http://www.learntek.org/product/big-data-and-hadoop/
http://www.learntek.org
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses. We are dedicated to designing, developing and implementing training programs for students, corporate employees and business professional.
- Data is a precious resource that can last longer than the systems themselves (Tim Berners-Lee)
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides reliability, scalability and flexibility.
- Hadoop consists of HDFS for storage and MapReduce for processing. The main nodes include NameNode, DataNodes, JobTracker and TaskTrackers. Tools like Hive, Pig, HBase extend its capabilities for SQL-like queries, data flows and NoSQL access.
This document provides an overview of Hadoop, its core components HDFS and MapReduce, and how they work. It discusses that Hadoop is an open-source framework used for storing and processing huge datasets across clusters of commodity hardware. The two core concepts of Hadoop are HDFS for distributed storage and MapReduce for distributed processing. HDFS provides reliable storage with replication and MapReduce allows processing of large datasets in parallel by dividing work across nodes and integrating results.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was designed to support large datasets and scale efficiently using low-cost hardware. Hadoop's core components include HDFS for distributed storage and MapReduce for distributed processing. Hadoop saw early adoption by companies like Yahoo and Facebook to support applications like advertisement targeting, searches, and security using large datasets.
This document provides information about Hadoop and its components. It discusses the history of Hadoop and how it has evolved over time. It describes key Hadoop components including HDFS, MapReduce, YARN, and HBase. HDFS is the distributed file system of Hadoop that stores and manages large datasets across clusters. MapReduce is a programming model used for processing large datasets in parallel. YARN is the cluster resource manager that allocates resources to applications. HBase is the Hadoop database that provides real-time random data access.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a master-slave architecture with the NameNode as master and DataNodes as slaves. The NameNode manages file system metadata and the DataNodes store data blocks. Hadoop also includes a MapReduce engine where the JobTracker splits jobs into tasks that are processed by TaskTrackers on each node. Hadoop saw early adoption from companies handling big data like Yahoo!, Facebook and Amazon and is now widely used for applications like advertisement targeting, search, and security analytics.
Similar to hadoop distributed file systems complete information (20)
C++ was developed by Bjarne Stroustrup, as an extension to the C language. cp...bhargavi804095
C++ is a cross-platform language that can be used to create high-performance applications.
C++ was developed by Bjarne Stroustrup, as an extension to the C language.
C++ gives programmers a high level of control over system resources and memory.
A File is a collection of data stored in the secondary memory. So far data wa...bhargavi804095
A File is a collection of data stored in the secondary memory. So far data was entered into the programs through the keyboard. So Files are used for storing information that can be processed by the programs. Files are not only used for storing the data, programs are also stored in files. In order to use files, we have to learn file input and output operations. That is, how data is read and how to write into a file. A Stream is the important concept in C. The Stream is a common, logical interface to the various devices that comprise the computer. So a Stream is a logical interface to a file. There are two types of Streams, Text Stream and Binary Stream. A Text File can be thought of as a stream of characters that can be processed sequentially. It can only be processed in the forward direction.
While writing program in any language, you need to use various variables to s...bhargavi804095
While writing program in any language, you need to use various variables to store various information. Variables are nothing but reserved memory locations to store values. This means that when you create a variable you reserve some space in memory.
Python is a high-level, general-purpose programming language. Its design phil...bhargavi804095
Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.[31]
Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.[
cpp-streams.ppt,C++ is the top choice of many programmers for creating powerf...bhargavi804095
C++ is the top choice of many programmers for creating powerful and scalable applications. From operating systems to video games, C++ is the proven language for delivering high-performance solutions across a range of industries.
One of the standout features of C++ is its built-in support of streams. C++ makes it easy to channel data in and out of your programs like a pro. Whether you’re pushing data out to cout or pulling it in from cin, C++ streams are the key to keeping your code in the zone.
Graphs in data structures are non-linear data structures made up of a finite ...bhargavi804095
Graphs in data structures are non-linear data structures made up of a finite number of nodes or vertices and the edges that connect them. Graphs in data structures are used to address real-world problems in which it represents the problem area as a network like telephone networks, circuit networks, and social networks
power point presentation to show oops with python.pptxbhargavi804095
Guido van Rossum created the Python programming language in 1991 at Centrum Wiskunde & Informatica in the Netherlands. Python has gained popularity due to its easy to read syntax and extensive standard library. It is an interpreted, object-oriented language used for web development, data analysis, artificial intelligence and more. Major companies like Google, Instagram, and Netflix use Python for its versatility and productivity.
Lecture4_Method_overloading power point presentaionbhargavi804095
The document discusses method overloading and the final keyword in Java. It provides examples of overloading methods by changing the number and type of parameters. Method overloading allows writing methods with the same name but different signatures. The final keyword can be applied to variables, methods, and classes. It prevents modification of final variables, overriding of final methods, and extending of final classes. Blank final variables must be initialized in the constructor.
Lecture5_Method_overloading_Final power point presentationbhargavi804095
The document discusses method overloading and the final keyword in Java. It provides examples of overloading methods by changing the number and type of parameters. Method overloading allows writing methods with the same name but different signatures. The final keyword can be applied to variables, methods, and classes. It prevents modification of final variables, overriding of final methods, and extending of final classes. Blank final variables must be initialized in the constructor.
power point presentation on object oriented programming functions conceptsbhargavi804095
The document discusses C++ functions. It covers the following key points in 3 sentences:
Standard functions that are included with C++ like math functions are discussed as well as how to define user-defined functions. User-defined functions are defined with a function header, parameters, and body. Functions can share data through parameters, either by value or by reference, and variables can have local or global scope.
THE C PROGRAMMING LANGUAGE PPT CONTAINS THE BASICS OF Cbhargavi804095
This document provides an overview of C programming and the differences between C and C++. It discusses how C and C++ were both developed at Bell Labs, how they are siblings but defined by separate standards committees. It summarizes some of the key differences between C and C++, such as C not having classes, templates, or exceptions. It also discusses various C programming concepts like functions, arrays, strings, memory management, standard libraries, and differences in const between C and C++.
Chapter24.pptx big data systems power point pptbhargavi804095
This document provides an overview of NOSQL databases and big data storage systems. It discusses several types of NOSQL databases including document-based systems like MongoDB, key-value stores like DynamoDB, column-based systems like HBase, and graph databases like Neo4j. It covers characteristics of distributed NOSQL systems like eventual consistency and the CAP theorem. Examples and concepts are provided for data modeling and querying approaches for different NOSQL systems.
power point presentation on pig -hadoop frameworkbhargavi804095
Pig is a platform for analyzing large datasets that sits on top of Hadoop. It allows users to write scripts in Pig Latin, a language similar to SQL, to perform tasks like filtering, joining, grouping and ordering on Hadoop clusters. Pig Latin scripts are translated into MapReduce jobs which execute the tasks in parallel across nodes. Pig provides simple abstractions over complex MapReduce code and allows for complex data analysis without writing Java code.
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...rightmanforbloodline
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B. Fraleigh, Verified Chapters 1 - 56,.pdf
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B. Fraleigh, Verified Chapters 1 - 56,.pdf
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...weiwchu
We recently discovered that models trained with large-scale speech datasets sourced from the web could achieve superior accuracy and potentially lower cost than traditionally human-labeled or simulated speech datasets. We developed a customizable AI-driven data labeling system. It infers word-level transcriptions with confidence scores, enabling supervised ASR training. It also robustly generates phone-level timestamps even in the presence of transcription or recognition errors, facilitating the training of TTS models. Moreover, It automatically assigns labels such as scenario, accent, language, and topic tags to the data, enabling the selection of task-specific data for training a model tailored to that particular task. We assessed the effectiveness of the datasets by fine-tuning open-source large speech models such as Whisper and SeamlessM4T and analyzing the resulting metrics. In addition to openly-available data, our data handling system can also be tailored to provide reliable labels for proprietary data from certain vertical domains. This customization enables supervised training of domain-specific models without the need for human labelers, eliminating data breach risks and significantly reducing data labeling cost.
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataSamuel Jackson
We present our work to improve data accessibility and performance for data-intensive tasks within the fusion research community. Our primary goal is to develop services that facilitate efficient access for data-intensive applications while ensuring compliance with FAIR principles [1], as well as adoption of interoperable tools, methods and standards.
The major outcome of our work is the successful creation and deployment of a data service for the MAST (Mega Ampere Spherical Tokamak) experiment [2], leading to substantial enhancements in data discoverability, accessibility, and overall data retrieval performance, particularly in scenarios involving large-scale data access. Our work follows the principles of Analysis-Ready, Cloud Optimised (ARCO) data [3] by using cloud optimised data formats for fusion data.
Our system consists of a query-able metadata catalogue, complemented with an object storage system for publicly serving data from the MAST experiment. We will show how our solution integrates with the Pandata stack [4] to enable data analysis and processing at scales that would have previously been intractable, paving the way for data-intensive workflows running routinely with minimal pre-processing on the part of the researcher. By using a cloud-optimised file format such as zarr [5] we can enable interactive data analysis and visualisation while avoiding large data transfers. Our solution integrates with common python data analysis libraries for large, complex scientific data such as xarray [6] for complex data structures and dask [7] for parallel computation and lazily working with larger that memory datasets.
The incorporation of these technologies is vital for advancing simulation, design, and enabling emerging technologies like machine learning and foundation models, all of which rely on efficient access to extensive repositories of high-quality data. Relying on the FAIR guiding principles for data stewardship not only enhances data findability, accessibility, and reusability, but also fosters international cooperation on the interoperability of data and tools, driving fusion research into new realms and ensuring its relevance in an era characterised by advanced technologies in data science.
[1] Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016) https://doi.org/10.1038/sdata.2016.18
[2] M Cox, The Mega Amp Spherical Tokamak, Fusion Engineering and Design, Volume 46, Issues 2–4, 1999, Pages 397-404, ISSN 0920-3796, https://doi.org/10.1016/S0920-3796(99)00031-9
[3] Stern, Charles, et al. "Pangeo forge: crowdsourcing analysis-ready, cloud optimized data production." Frontiers in Climate 3 (2022): 782909.
[4] Bednar, James A., and Martin Durant. "The Pandata Scalable Open-Source Analysis Stack." (2023).
[5] Alistair Miles (2024) ‘zarr-developers/zarr-python: v2.17.1’. Zenodo. doi: 10.5281/zenodo.10790679
[6] Hoyer, S. & Hamman, J., (20
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
The Rise of Python in Finance,Automating Trading Strategies: _.pdfRiya Sen
In the dynamic realm of finance, where every second counts, the integration of technology has become indispensable. Aspiring traders and seasoned investors alike are turning to coding as a powerful tool to unlock new avenues of financial success. In this blog, we delve into the world of Python live trading strategies, exploring how coding can be the key to navigating the complexities of the market and securing your path to prosperity.
Overview of Statistical software such as ODK, surveyCTO,and CSPro
2. Software installation(for computer, and tablet or mobile devices)
3. Create a data entry application
4. Create the data dictionary
5. Create the data entry forms
6. Enter data
7. Add Edits to the Data Entry Application
8. CAPI questions and texts
Big Data and Analytics Shaping the future of PaymentsRuchiRathor2
The payments industry is experiencing a data-driven revolution powered by big data and analytics.
Here's a glimpse into 5 ways this dynamic duo is transforming how we pay.
In essence, big data and analytics are playing a pivotal role in building a future filled with faster, more secure, and convenient payment methods for everyone.
Annex K RBF's The World Game pdf documentSteven McGee
Signals & Telemetry Annex K for RBF's The World Game / Trade Federations / USPTO 13/573,002 Heart Beacon Cycle Time - Space Time Chain meters, metrics, standards. Adaptive Procedural template framework structured data derived from DoD / NATO's system of systems engineering tech framework
2. Introduction
• What is Big Data??
– Bulk Amount
– Unstructured
• Lots of Applications which need to handle
huge amount of data (in terms of 500+ TB per
day)
• If a regular machine need to transmit 1TB of
data through 4 channels : 43 Minutes.
• What if 500 TB ??
2
Hadoop
3. What is Hadoop?
• Framework for large-scale data processing
• Inspired by Google’s Architecture:
– Google File System (GFS) and MapReduce
• Open-source Apache project
– Nutch search engine project
– Apache Incubator
• Written in Java and shell scripts
3
Hadoop
4. Hadoop Distributed File System (HDFS)
• Storage unit of Hadoop
• Relies on principles of Distributed File System.
• HDFS have a Master-Slave architecture
• Main Components:
– Name Node : Master
– Data Node : Slave
• 3+ replicas for each block
• Default Block Size : 128MB
4
Hadoop
5. H
Hadoop Distributed File System (HDFS)
• Hadoop Distributed File System (HDFS)
– Runs entirely in userspace
– The file system is dynamically distributed across multiple
computers
– Allows for nodes to be added or removed easily
– Highly scalable in a horizontal fashion
• Hadoop Development Platform
– Uses a MapReduce model for working with data
– Users can program in Java, C++, and other languages
5
Hadoop
6. Why should I use Hadoop?
• Fault-tolerant hardware is expensive
• Hadoop designed to run on commodity hardware
• Automatically handles data replication and deals with
node failure
• Does all the hard work so you can focus on processing
data
6
Hadoop
7. HDFS: Key Features
• Highly Fault Tolerant:
Automatic Failure Recovery System
• High aggregate throughput for streaming large files
• Supports replication and locality features
• Designed to work with systems with vary large file
(files with size in TB) and few in number.
• Provides streaming access to file system data. It is
specifically good for write once read many kind of
files (for example Log files).
7
Hadoop
8. Hadoop Distributed File System (HDFS)
• Can be built out of commodity hardware. HDFS
doesn't need highly expensive storage devices
– Uses off the shelf hardware
• Rapid Elasticity
– Need more capacity, just assign some more nodes
– Scalable
– Can add or remove nodes with little effort or
reconfiguration
• Resistant to Failure
• Individual node failure does not disrupt the
system
8
Hadoop
10. What features does Hadoop offer?
• API and implementation for working with
MapReduce
• Infrastructure
– Job configuration and efficient scheduling
– Web-based monitoring of cluster stats
– Handles failures in computation and data nodes
– Distributed File System optimized for huge amounts of
data
10
Hadoop
11. When should you choose Hadoop?
• Need to process a lot of unstructured data
• Processing needs are easily run in parallel
• Batch jobs are acceptable
• Access to lots of cheap commodity machines
11
Hadoop
12. When should you avoid Hadoop?
• Intense calculations with little or no data
• Processing cannot easily run in parallel
• Data is not self-contained
• Need interactive results
12
Hadoop
13. Hadoop Examples
• Hadoop would be a good choice for:
– Indexing log files
– Sorting vast amounts of data
– Image analysis
– Search engine optimization
– Analytics
• Hadoop would be a poor choice for:
– Calculating Pi to 1,000,000 digits
– Calculating Fibonacci sequences
– A general RDBMS replacement
13
Hadoop
14. Hadoop Distributed File System (HDFS)
• How does Hadoop work?
– Runs on top of multiple commodity systems
– A Hadoop cluster is composed of nodes
• One Master Node
• Many Slave Nodes
– Multiple nodes are used for storing data & processing
data
– System abstracts the underlying hardware to
users/software
14
Hadoop
15. How HDFS works: Split Data
• Data copied into HDFS is split into blocks
• Typical HDFS block size is 128 MB
– (VS 4 KB on UNIX File Systems)
15
Hadoop
16. How HDFS works: Replication
• Each block is replicated to multiple machines
• This allows for node failure without data loss
16
Data
Node 2
Data
Node 3
Data
Node 1
Block
#1
Block
#2
Block
#2
Block
#3
Block
#1
Block
#3
Hadoop
18. Hadoop Distributed File System (HDFS)p:
HDFS • HDFS Consists of data blocks
– Files are divided into data
blocks
– Default size if 64MB
– Default replication of blocks is 3
– Blocks are spread out over Data
Nodes
18
HDFS is a multi-node system
Name Node (Master)
Single point of failure
Data Node (Slave)
Failure tolerant (Data
replication)
Hadoop
19. Hadoop Architecture Overview
19
Client
Job Tracker
Task Tracker Task Tracker
Name
Node
Name
Node
Data
Node
Data
Node Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Hadoop
20. Hadoop Components: Job Tracker
20
Client
Job Tracker
Task Tracker Task Tracker
Name Node
Name Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Only one Job Tracker per cluster
Receives job requests submitted by client
Schedules and monitors jobs on task trackers
Hadoop
21. Hadoop Components: Name Node
21
Client
Job
Tracker
Task
Tracker
Task
Tracker
Name Node
Name Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
One active Name Node per cluster
Manages the file system namespace and metadata
Single point of failure: Good place to spend money on hardware
Hadoop
22. Name Node
• Master of HDFS
• Maintains and Manages data on Data Nodes
• High reliability Machine (can be even RAID)
• Expensive Hardware
• Stores NO data; Just holds Metadata!
• Secondary Name Node:
– Reads from RAM of Name Node and stores it to hard
disks periodically.
• Active Passive Name Nodes from Gen2 Hadoop
22
Hadoop
23. Hadoop Components: Task Tracker
23
Client
Job
Tracker
Task
Tracker
Task
Tracker
Name Node
Name Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
There are typically a lot of task trackers
Responsible for executing operations
Reads blocks of data from data nodes
Hadoop
24. Hadoop Components: Data Node
24
Client
Job
Tracker
Task
Tracker
Task
Tracker
Name Node
Name Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
Data
Node
There are typically a lot of data nodes
Data nodes manage data blocks and serve them to clients
Data is replicated so failure is not a problem
Hadoop
25. Data Nodes
• Slaves in HDFS
• Provides Data Storage
• Deployed on independent machines
• Responsible for serving Read/Write requests from
Client.
• The data processing is done on Data Nodes.
25
Hadoop
29. HDFS Operation
• Client makes a Write request to Name Node
• Name Node responds with the information about
on available data nodes and where data to be
written.
• Client write the data to the addressed Data Node.
• Replicas for all blocks are automatically created
by Data Pipeline.
• If Write fails, Data Node will notify the Client
and get new location to write.
• If Write Completed Successfully,
Acknowledgement is given to Client
• Non-Posted Write by Hadoop
29
Hadoop
32. Hadoop: Hadoop Stack
• Hadoop Development Platform
– User written code runs on system
– System appears to user as a single entity
– User does not need to worry about
distributed system
– Many system can run on top of Hadoop
• Allows further abstraction from system
32
Hadoop
33. Hadoop: Hive HBase
Hive and HBase are layers on top of Hadoop
HBase Hive are applications
Provide an interface to data on the HDFS
Other programs or applications may use Hive or
HBase as an intermediate layer
33
HBase
ZooKeeper
Hadoop
34. Hadoop: Hive
• Hive
– Data warehousing application
– SQL like commands (HiveQL)
– Not a traditional relational database
– Scales horizontally with ease
– Supports massive amounts of data*
34
* Facebook has more than 15PB of information stored in it and imports 60TB each day (as of 2010)
Hadoop
35. Hadoop: HBase
• HBase
– No SQL Like language
• Uses custom Java API for working with data
– Modeled after Google’s BigTable
– Random read/write operations allowed
– Multiple concurrent read/write operations allowed
35
Hadoop
36. Hadoop MapReduce
Hadoop has it’s own implementation of MapReduce
Hadoop 1.0.4
API: http://hadoop.apache.org/docs/r1.0.4/api/
Tutorial: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html
Custom Serialization
Data Types
Writable/Comparable
Text vs String
LongWritable vs long
IntWritable vs int
DoubleWritable vs double
36
Hadoop
45. Hadoop MR Job Interface:
Input Format
• The Hadoop MapReduce framework spawns
one map task for each InputSplit
• InputSplit: Input File is Split to Input Splits (Logical
splits (usually 1 block), not Physically split chunks)
Input Format::getInputSplits()
• The number of maps is usually driven by the total
number of blocks (InputSplits) of the input files.
1 block size = 128 MB,
10 TB file configured with 82000 maps
46. Hadoop MR Job Interface:
map()
• The framework then calls
map(WritableComparable, Writable, OutputCollector,
Reporter) for each key/value pair (line_num, line_string
) in the InputSplit for that task.
• Output pairs are collected with calls to
OutputCollector.collect(WritableComparable,Writable).
47. Hadoop MR Job Interface:
combiner()
• Optional combiner, via
JobConf.setCombinerClass(Class)
• to perform local aggregation of the intermediate
outputs of mapper
48. Hadoop MR Job Interface:
Partitioner()
• Partitioner controls the partitioning of the keys of the
intermediate map-outputs.
• The key (or a subset of the key) is used to derive the
partition, typically by a hash function.
• The total number of partitions is the same as the
number of reducers
• HashPartitioner is the default Partitioner of reduce
tasks for the job
50. Hadoop MR Job Interface:
reducer()
• Shuffle
Input to the Reducer is the sorted output of the mappers.
In this phase the framework fetches the relevant
partition of the output of all the mappers, via HTTP.
• Sort
The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this
stage.
• The shuffle and sort phases occur simultaneously;
while map-outputs are being fetched they are merged.
51. Hadoop MR Job Interface:
reducer()
• Reduce
The framework then calls
reduce(WritableComparable, Iterator, OutputCollector, Reporter)
method for each key, (list of values) pair in the
grouped inputs.
• The output of the reduce task is typically written to
the FileSystem via
OutputCollector.collect(WritableComparable, Writable).
61. Quick Overview of Other Topics
• Dealing with failures
• Hadoop Distributed FileSystem (HDFS)
• Optimizing a MapReduce job
62. Dealing with Failures and Slow Tasks
• What to do when a task fails?
– Try again (retries possible because of idempotence)
– Try again somewhere else
– Report failure
• What about slow tasks: stragglers
– Run another version of the same task in parallel. Take
results from the one that finishes first
– What are the pros and cons of this approach?
Fault tolerance is of
high priority in the
MapReduce framework
64. Map
Wave 1
Reduce
Wave 1
Map
Wave 2
Reduce
Wave 2
Input
Splits
Lifecycle of a MapReduce Job
Time
How are the number of splits, number of map and reduce
tasks, memory allocation to tasks, etc., determined?
67. Tuning Hadoop Job Conf. Parameters
• Do their settings impact performance?
• What are ways to set these parameters?
– Defaults -- are they good enough?
– Best practices -- the best setting can depend on data, job, and
cluster properties
– Automatic setting
68. Experimental Setting
• Hadoop cluster on 1 master + 16 workers
• Each node:
– 2GHz AMD processor, 1.8GB RAM, 30GB local disk
– Relatively ill-provisioned!
– Xen VM running Debian Linux
– Max 4 concurrent maps 2 reduces
• Maximum map wave size = 16x4 = 64
• Maximum reduce wave size = 16x2 = 32
• Not all users can run large Hadoop clusters:
– Can Hadoop be made competitive in the 10-25 node, multi GB
to TB data size range?
70. • Varying number of reduce tasks, number of concurrent sorted
streams for merging, and fraction of map-side sort buffer
devoted to metadata storage
Hadoop 50GB TeraSort
71. Hadoop 50GB TeraSort
• Varying number of reduce tasks for different values
of the fraction of map-side sort buffer devoted to
metadata storage (with io.sort.factor = 500)
72. Hadoop 50GB TeraSort
• Varying number of reduce tasks for different values of
io.sort.factor (io.sort.record.percent = 0.05, default)