Hadoop is a framework for distributed storage and processing of large datasets across clusters of commodity hardware. It includes HDFS, a distributed file system, and MapReduce, a programming model for large-scale data processing. HDFS stores data reliably across clusters and allows computations to be processed in parallel near the data. The key components are the NameNode, DataNodes, JobTracker and TaskTrackers. HDFS provides high throughput access to application data and is suitable for applications handling large datasets.
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...raghdooosh
The document discusses big data storage concepts including cluster computing, distributed file systems, and different database types. It covers cluster structures like symmetric and asymmetric, distribution models like sharding and replication, and database types like relational, non-relational and NewSQL. Sharding partitions large datasets across multiple machines while replication stores duplicate copies of data to improve fault tolerance. Distributed file systems allow clients to access files stored across cluster nodes. Relational databases are schema-based while non-relational databases like NoSQL are schema-less and scale horizontally.
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
Big Data Architecture Workshop - Vahid Amiridatastack
Big Data Architecture Workshop
This slide is about big data tools, thecnologies and layers that can be used in enterprise solutions.
TopHPC Conference
2019
This document provides an overview of big data concepts and Hadoop. It discusses the four V's of big data - volume, velocity, variety, and veracity. It then describes how Hadoop uses MapReduce and HDFS to process and store large datasets in a distributed, fault-tolerant and scalable manner across commodity hardware. Key components of Hadoop include the HDFS file system and MapReduce framework for distributed processing of large datasets in parallel.
This document discusses distributed data processing using MapReduce and Hadoop in a cloud computing environment. It describes the need for scalable, economical, and reliable distributed systems to process petabytes of data across thousands of nodes. It introduces Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers using MapReduce. Key aspects of Hadoop discussed include its core components HDFS for distributed file storage and MapReduce for distributed computation.
Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has very large files (over 100 million files) and is optimized for batch processing huge datasets across large clusters (over 10,000 nodes). HDFS stores multiple replicas of data blocks on different nodes to handle failures. It provides high aggregate bandwidth and allows computations to move to where data resides.
This document provides an introduction to Hadoop and big data. It discusses the new kinds of large, diverse data being generated and the need for platforms like Hadoop to process and analyze this data. It describes the core components of Hadoop, including HDFS for distributed storage and MapReduce for distributed processing. It also discusses some of the common applications of Hadoop and other projects in the Hadoop ecosystem like Hive, Pig, and HBase that build on the core Hadoop framework.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
This document summarizes key aspects of the Hadoop Distributed File System (HDFS). HDFS is designed for storing very large files across commodity hardware. It uses a master/slave architecture with a single NameNode that manages file system metadata and multiple DataNodes that store application data. HDFS allows for streaming access to this distributed data and can provide higher throughput than a single high-end server by parallelizing reads across nodes.
- Data is a precious resource that can last longer than the systems themselves (Tim Berners-Lee)
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides reliability, scalability and flexibility.
- Hadoop consists of HDFS for storage and MapReduce for processing. The main nodes include NameNode, DataNodes, JobTracker and TaskTrackers. Tools like Hive, Pig, HBase extend its capabilities for SQL-like queries, data flows and NoSQL access.
Hadoop is an open source framework that allows distributed processing of large datasets across clusters of computers. It uses a master-slave architecture where a single node acts as the master (NameNode) to manage file system metadata and job scheduling, while other nodes (DataNodes) store data blocks and perform parallel processing tasks. Hadoop distributes data storage across clusters using HDFS for fault tolerance and high throughput. It distributes computation using MapReduce for parallel processing of large datasets. Common users of Hadoop include large internet companies for applications like log analysis, machine learning and reporting.
Big data is generated from a variety of sources at a massive scale and high velocity. Hadoop is an open source framework that allows processing and analyzing large datasets across clusters of commodity hardware. It uses a distributed file system called HDFS that stores multiple replicas of data blocks across nodes for reliability. Hadoop also uses a MapReduce processing model where mappers process data in parallel across nodes before reducers consolidate the outputs into final results. An example demonstrates how Hadoop would count word frequencies in a large text file by mapping word counts across nodes before reducing the results.
This document provides an introduction to big data and Hadoop. It discusses how the volume of data being generated is growing rapidly and exceeding the capabilities of traditional databases. Hadoop is presented as a solution for distributed storage and processing of large datasets across clusters of commodity hardware. Key aspects of Hadoop covered include MapReduce for parallel processing, the Hadoop Distributed File System (HDFS) for reliable storage, and how data is replicated across nodes for fault tolerance.
Data Lake and the rise of the microservicesBigstep
By simply looking at structured and unstructured data, Data Lakes enable companies to understand correlations between existing and new external data - such as social media - in ways traditional Business Intelligence tools cannot.
For this you need to find out the most efficient way to store and access structured or unstructured petabyte-sized data across your entire infrastructure.
In this meetup we’ll give answers on the next questions:
1. Why would someone use a Data Lake?
2. Is it hard to build a Data Lake?
3. What are the main features that a Data Lake should bring in?
4. What’s the role of the microservices in the big data world?
Big data refers to large and complex datasets that are difficult to process using traditional methods. Key challenges include capturing, storing, searching, sharing, and analyzing large datasets in domains like meteorology, physics simulations, biology, and the internet. Hadoop is an open-source software framework for distributed storage and processing of big data across clusters of computers. It allows for the distributed processing of large data sets in a reliable, fault-tolerant and scalable manner.
The document provides an introduction to the key concepts of Big Data including Hadoop, HDFS, and MapReduce. It defines big data as large volumes of data that are difficult to process using traditional methods. Hadoop is introduced as an open-source framework for distributed storage and processing of large datasets across clusters of computers. HDFS is described as Hadoop's distributed file system that stores data across clusters and replicates files for reliability. MapReduce is a programming model where data is processed in parallel across clusters using mapping and reducing functions.
Dhruba Borthakur presented on Apache Hadoop and Hive. He discussed the architecture of Hadoop Distributed File System (HDFS) and how it is optimized for processing large datasets across commodity hardware. HDFS uses a master/slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. Hive provides a SQL-like interface to query and analyze large datasets stored in HDFS. Facebook uses a large Hadoop cluster to process petabytes of data daily and many engineers are now using Hadoop and Hive. Borthakur proposed several ideas for collaborations between Hadoop and Condor.
Big data and Hadoop are frameworks for processing and storing large datasets. Hadoop uses HDFS for distributed storage and MapReduce for distributed processing. HDFS stores large files across multiple machines for redundancy and parallel access. MapReduce divides jobs into map and reduce tasks that run in parallel across a cluster. Hadoop provides scalable and fault-tolerant solutions to problems like processing terabytes of data from jet engines or scaling to Google's data processing needs.
HBase is a distributed column-oriented database built on top of HDFS. It provides big data storage for Hadoop and allows for fast random read/write access and incremental addition of data. HBase tables are split into regions that are distributed across region servers. The master server coordinates the region servers and ZooKeeper maintains metadata. Common operations include get, scan, put, and delete. HBase is well-suited for applications requiring fast random read/write versus HDFS which is better for batch processing.
The document provides an overview of computer engineering, including:
- The requirements to become a computer engineer and typical roles in industry, such as reviewing systems, writing code, and maintaining systems.
- Key principles of computer engineering like digital logic, databases, algorithms, computer architecture, and binary codes. It describes logic gates, database structure, examples of algorithms, components of computer architecture, and how binary code represents numbers.
- An activity section with questions about drawing circuits, writing sorting algorithms, and converting between binary and decimal numbers.
This document discusses Apache Airflow, a workflow management platform. It provides an overview of Airflow including its anatomy with Python job definitions, a rich CLI, and web UI. It also discusses Qubole's evaluation of options like Oozie, Pinball, and Luigi that led them to select Airflow. Key aspects of productionizing Airflow at Qubole included availability, reliability, security, and usability.
2 Discovery and Acquisition of Data1.pptxvijayapraba1
This document provides an outline of Lecture 2 from the course GEO 802, Data Information Literacy. It discusses various portals and repositories for publishing and finding data, including discipline-specific repositories, as well as directories and indexes of repositories. It also covers data journals and venues for publishing datasets to get them cited. Finally, it lists some exercises for students to find relevant data repositories in their fields and to explore search tools and open data portals.
This document outlines the engineering design process, including modelling, testing, and final outputs. It discusses the steps of defining the problem, generating ideas, creating a solution, testing and analysis, and improving the final design. Modelling can be mathematical, physical, or computer-based. Physical modelling includes mock-ups and prototypes. Testing criteria ensure the design functions properly, fits requirements, is aesthetically pleasing, safe, and environmentally sound. Final outputs are a project report, presentation, and production documents. The design process concludes by identifying opportunities to improve the final solution.
This document discusses the process of defining engineering problems and brainstorming solutions. It begins by explaining how engineers define problems by determining the problem's origin, what is and isn't part of the problem, and the present and desired states. Engineers then write a problem statement and identify criteria and constraints. Brainstorming techniques are used to generate potential solutions, including free association, free writing, envisioning future possibilities, and linking ideas in a web. Defining the problem clearly is essential before engineers can effectively brainstorm and develop solutions.
The document discusses external sorting techniques used in database management systems. It describes the external merge sort algorithm which divides the data into sorted runs in the first pass and then merges the runs in subsequent passes to fully sort the data. The number of passes depends on the number of buffer pages available. Using a clustered B+ tree index to retrieve records in sorted order is more efficient than external sorting, but an unclustered B+ tree would require one I/O per data record, making it very inefficient for sorting.
Boolean algebra uses logic gates and binary numbers to represent logical operations in circuits. The basic logic gates are NOT, AND, OR, NAND, NOR, XOR, and XNOR. Binary addition can be performed using half adders and full adders built from these basic gates. A half adder adds two bits and outputs their sum and carry, while a full adder also takes a carry in bit to output the final sum and carry out. Larger binary numbers can be added by chaining together full adders.
This document discusses various combinational logic circuits including multiplexers, demultiplexers, encoders, decoders, adders, subtractors, and seven segment decoders. It explains the applications of these circuits and how they work. Examples are given of combinational logic circuit designs for water pump control and a building alarm device. The document concludes with a thank you.
The document discusses different number systems including decimal, binary, hexadecimal, and octal. It provides details on:
- The characteristics of each system such as the symbols used and positional notation.
- Conversions between decimal, binary, hexadecimal, and octal.
- Representations of signed and unsigned integers in binary, and the signed number systems of signed magnitude, ones' complement, and twos' complement.
- Arithmetic operations like addition and subtraction for unsigned and signed integers in twos' complement representation.
The document discusses different number systems including decimal, binary, hexadecimal, and octal. It provides details on:
- The characteristics of each system such as the symbols used and positional notation.
- Conversions between decimal, binary, hexadecimal, and octal. Methods for converting include expanding using positional notation or using shortcuts like repeated division.
- Representing signed integers using signed magnitude, ones' complement, and two's complement. It explains how two's complement is commonly used in computers due to its simplicity for arithmetic operations.
- Performing addition and subtraction of signed integers in two's complement representation. Overflow must be checked during operations.
Big data is growing exponentially, with 2.5 quintillion bytes of data generated daily. However, 85% of organizations are unable to leverage big data for competitive advantage. Data analytics involves extracting patterns and models from preprocessed data using techniques like statistics, machine learning, and neural networks. Properly standardizing and categorizing data is important for collaborative analysis and insights. Data segmentation divides data into subsets for more efficient modeling and marketing. Statistical tests like the F-test and t-distribution help analyze differences between data sets and determine confidence levels.
The document discusses Boolean algebra and logic gates. It begins by defining common logic gates like AND, OR, NOT, NAND, NOR, and XOR. It then provides truth tables and circuit diagrams for each gate. The document also covers Boolean theorems and laws, converting between binary and decimal numbers, adding binary numbers using half adders and full adders, and how flip-flops can be used to store memory. It concludes with a brief discussion of hexadecimal numbering.
This document discusses sequential circuits and various memory elements. It introduces latches and flip-flops, which are bistable logic devices that can store one bit of information. SR latches and D latches are described as well as edge-triggered SR, D, JK, and T flip-flops. Their operations are summarized through truth tables. Asynchronous preset and clear inputs are also discussed. The document concludes by mentioning master-slave flip-flops which reduce glitches by separating the master and slave stages.
The document discusses number systems and data representation in computing. It covers binary, hexadecimal, and octal number systems. It explains how unsigned integers are represented in a finite number of bits and how basic operations like addition, subtraction and shifting are performed on them. It also discusses common signed integer representations like signed magnitude, one's complement and two's complement and how negative numbers are represented in these systems.
Fundamentals of Programming Constructs.pptxvijayapraba1
The algorithm involves taking Celsius temperature as input, multiplying it by 1.8 and adding 32 to convert it to Fahrenheit. This is implemented in a C program that takes Celsius input, performs the conversion calculation and prints the Fahrenheit output.
internship project presentation for reference.pptxSaieJadhav1
Purpose :- Debian-based Linux distribution for advanced Penetration Testing and Security Auditing.- Developed by Offensive Security, known for security training.- Released in March 2013, a rebuild of BackTrack Linux following Debian standards.Maintained by Offensive Security:- Offensive Security oversees Kali Linux, ensuring it includes the latest security tools.- Known for certifications like OSCP (Offensive Security Certified Professional).Usage:- Penetration Testing: Simulates cyber-attacks to find system vulnerabilities.- Security Research: Helps researchers understand exploits and security techniques.- Computer Forensics: Used in digital crime investigations and data recovery.
In the global energy equation, the IT industry is not yet a major contributor to global warming, but it is increasingly significant. From an engineering standpoint we can achieve huge energy saving by replacing electronic signal processing with optical techniques for routing and switching, whilst longer fibre spans in the local loop offer further reductions. The mobile industry on the other hand has engineered 5G systems demanding ~10kW/tower due to signal processing and beam steering technologies. This sees some countries (i.e. China) closing cell sites at night to save money. So, what of 6G? The assumption that all surfaces can be smart signal regenerators with beam steering looks be a step too far and it may be time for a rethink!
On the extreme end of the scale we have AWS planning to colocate their latest AI data centre (at 1GW power consumption) along side two nuclear reactors because it needs 40% of their joint output. Google and Microsoft are following the AWS approach and reportedly in negotiation with nuclear plant owners. Needless to say that AI train ing sessions and usage have risen to dominate the top of the IT demand curve. At this time, there appears to be no limits to the projected energy demands of AI, but there is a further contender in this technology race, and that is the IoT. In order to satisfy the ecological demands of Industry 4.0/Society 5.0 we need to instrument and tag ‘Things’ by the Trillion, and not ~100 Billion as previously thought!
Now let’s see, Trillions of devices connected to the internet with 5G, 4G, WiFi, BlueTooth, LoRaWan et al using >100mW demands more power plants…
Bell Crank Lever.pptxDesign of Bell Crank Leverssuser110cda
In a bell crank lever, the two arms of the lever are at right angles.
Such type of levers are used in railway signalling, governors of Hartnell type, the drive for the air pump of condensers etc.
The bell crank lever is designed in a similar way as discussed earlier.
Agricultural Profitability through Resilience: Smallholder Farmers' Strategie...IJAEMSJORNAL
This study investigated the knowledge strategies and coping utilized by smallholder farmers in Guimba, Nueva Ecija to reduce and adjust to the effects of climate change. Smallholder farmers, who are frequently susceptible to climate change, utilize various traditional and innovative methods to strengthen their ability to withstand and recover from these consequences. Based on the results of this study, farmers in Guimba, Nueva Ecija demonstrate a profound comprehension of the adverse weather conditions, such as typhoons, droughts, and excessive rainfall, which they ascribe to climate change. While they have a fundamental understanding of climate change and its effects, their knowledge of scientific intricacies is restricted, indicating a need for information that is particular to the context. Although farmers possess knowledge about climate change, they are not actively engaging in proactive actions to adapt to it. Instead, they rely on reactive coping mechanisms. This highlights the necessity for targeted educational and communicative endeavors to promote the acceptance and implementation of approaches. Furthermore, the absence of available resources poses a significant barrier to achieving successful adaptation, highlighting the importance of pushing for inexpensive and feasible measures for adaptation. Farmers recognize the benefits of agroforestry and have started integrating the growth of fruit trees, particularly mangoes, into their coping techniques.
,*$/?!~00971508021841^(سعر حبوب الإجهاض في دبيnafizanafzal
,*$/?!~00971508021841^(سعر حبوب الإجهاض في دبي)حبوب سايتوتك في ام القيوينالاجهاض للبيع في الامارات اسقاط الجنين بدبي حبوب الحمل للبيع # بيع؟ ؟ #شراء؟ ؟ #حبوب؟ ؟ #الاجهاض؟ #سايتوتك؟ #في؟ ؟ #دبي؟ ؟ #الشارقه؟ ؟ #عجمان؟ ؟ #العين؟ ؟ #ابوظبي؟ #الجنين؟ #سايتوتك؟ ؟ #للبيع؟ Cytotec # # الامارات # في؟ #دبي؟ # سايتوتك للبيع من داخل # دبي # شارقه # عجمان للطلب من باقي الدول في الخل #Data Opennesيتضمن قرار الإجهاض في عيادة الإجهاض في أبو ظبي ، الإمارات العربية المتحدة ، اعتبارات أخلاقية وأخلاقية ودينية وعائلية ومالية وصحية وعصر. شراء حبوب الإجهاض في دبي ، شراء حبوب الإجهاض في عمان ، شراء حبوب الإجهاض في أبو ظبي ، شراء حبوب الإجهاض في الشارقة ، شراء حبوب الإجهاض في رأس الخيمة ( RAK ), شراء حبوب الإجهاض في # عجمان ، شراء حبوب الإجهاض في العين ، شراء حبوب الإجهاض في أم القيوين حبوب الإجهاض الحصرية للبيع في دبي.
أين يمكنني شراء حبوب الإجهاض في دبي / الإمارات العربية المتحدة?
هل يمكنني الحصول على حبوب الإجهاض في دبي?
عيادة إجهاض النساء في الإمارات / دبي
أين يتم الإجهاض في الإمارات / دبي / أبو ظبي
عيادة الإجهاض الآمن في الإمارات / دبي / أبو ظبي.
أفضل عيادة إجهاض في الإمارات / دبي / قطر
حبوب الإجهاض عبر الإنترنت AMAZON / DUBAI / الإمارات العربية المتحدة.
حبوب الإجهاض في DISC HEM في دبي.
تكلفة حبوب الإجهاض في أبو ظبي / الإمارات.
حبوب الإجهاض بسعر الخصم الإمارات / دبي.
حبوب الإجهاض تظهر في دبي.
سعر حبوب الإجهاض في دبي.
حبوب الإجهاض في قطر.
حبوب الإجهاض آثار جانبية.
أنا حبوب الإجهاض في أبو ظبي.
أطقم أطقم غير مرغوب فيها في دبي / الإمارات العربية المتحدة
أطقم أطقم غير مرغوب فيها في أبو ظبي
أطقم أطقم غير مرغوب فيها في أجمان
أطقم أطقم غير مرغوب فيها في الكويت
أطقم أطقم غير مرغوب فيها في قطر / الدوحة
حبوب الإجهاض الإماراتية.
حبوب الإجهاض 1MG KUWAIT.
حبوب الإجهاض لمدة 12 أسبوعًا في دبي.
حبوب الإجهاض 24 ساعة في الإمارات / دبي.
حبوب الإجهاض بعد شهرين في هندي.
حبوب الإجهاض بعد شهرين في دبي.
حبوب الإجهاض تصل إلى 3 أشهر في دبي.
486 حبوب الإجهاض.
أفضل مجموعة في دبي / الإمارات.
حبوب الإجهاض 500.الإمارات العربية المتحدة
حبوب الإجهاض غير مرغوب فيها 72 دبي
3. Operating systems
• Operating system - Software that supervises
and controls tasks on a computer. Individual
OS:
– Batch processing jobs are collected, placed in
a queue, no interaction with job during processing
– Time shared computing resources are provided
to different users, interaction with program during
execution
– RT systems fast response, can be interrupted
4. Distributed Systems
• Consists of a number of computers that are connected and
managed so that they automatically share the job processing
load among the constituent computers.
• A distributed operating system is one that appears to its users as
a traditional uniprocessor system, even though it is actually
composed of multiple processors.
• It gives a single system view to its users and provides a single
service.
• Users are transparent to location of files. It provides a virtual
computing env.
Eg The Internet, ATM banking networks, mobile computing
networks, Global Positioning Systems and Air Traffic Control
DISTRIBUTED SYSTEM IS A COLLECTION OF INDEPENDENT
COMPUTERS THAT APPEARS TO IS USERS AS A SINGLE
COHERENT SYSTEM
5. Network Operating System
• In a network operating system the users are aware
of the existence of multiple computers.
• The operating system of individual computers must
have facilities to have communication and
functionality.
• Each machine runs its own OS and has its own user.
• Remote login and file access
• Less transparent but more independency
Applicatio
n
Applicatio
n
Applicatio
n
Distributed Operating System Services
Application Application Application
Network
OS
Network
OS
Network
OS
Distributed OS Networked OS
6. DFS
• Resource sharing is the motivation behind distributed
Systems. To share files file system
• File System is responsible for the organization, storage,
retrieval, naming, sharing, and protection of files.
• The file system is responsible for controlling access to
the data and for performing low-level operations such as
buffering frequently used data and issuing disk I/O
requests
• The goal is to allow users of physically distributed
computers to share data and storage resources by
using a common file system.
7. Hadoop
What is Hadoop?
It's a framework for running applications on large clusters of
commodity hardware which produces huge data and to
process it
Apache Software Foundation Project
Open source
Amazon’s EC2
alpha (0.18) release available for download
Hadoop Includes
HDFS a distributed filesystem
Map/Reduce HDFS implements this programming model. It
is an offline computing engine
Concept
Moving computation is more efficient than moving large
data
8. • Data intensive applications with Petabytes of data.
• Web pages - 20+ billion web pages x 20KB = 400+
terabytes
– One computer can read 30-35 MB/sec from disk
~four months to read the web
– same problem with 1000 machines, < 3 hours
• Difficulty with a large number of machines
– communication and coordination
– recovering from machine failure
– status reporting
– debugging
– optimization
– locality
9. FACTS
Single-thread performance doesn’t matter
We have large problems and total throughput/price more
important than peak performance
Stuff Breaks – more reliability
• If you have one server, it may stay up three years (1,000 days)
• If you have 10,000 servers, expect to lose ten a day
“Ultra-reliable” hardware doesn’t really help
At large scales, super-fancy reliable hardware still fails, albeit
less often
– software still needs to be fault-tolerant
– commodity machines without fancy hardware give better
perf/price
DECISION : COMMODITY HARDWARE.
DFS : HADOOP – REASONS?????
10. HDFS Why? Seek vs Transfer
• CPU & transfer speed, RAM & disk size double every 18
- 24 months
• Seek time nearly constant (~5%/year)
• Time to read entire drive is growing vs transfer rate.
• Moral: scalable computing must go at transfer rate
• BTree (Relational DBS)
– operate at seek rate, log(N) seeks/access
-- memory / stream based
• sort/merge flat files (MapReduce)
– operate at transfer rate, log(N) transfers/sort
-- Batch based
12. Characteristics
• Fault tolerant, scalable, Efficient, reliable distributed
storage system
• Moving computation to place of data
• Single cluster with computation and data.
• Process huge amounts of data.
• Scalable: store and process petabytes of data.
• Economical:
– It distributes the data and processing across clusters of
commonly available computers.
– Clusters PCs into a storage and computing platform.
– It minimises no of CPU cycles, RAM on individual
machines etc.
• Efficient:
– By distributing the data, Hadoop can process it in parallel on
the nodes where the data is located. This makes it extremely
rapid.
– Computation is moved to place where data is present.
• Reliable:
– Hadoop automatically maintains multiple copies of data
– Automatically redeploys computing tasks based on failures.
14. • Data Model
– Data is organized into files and directories
– Files are divided into uniform sized blocks and
distributed across cluster nodes
– Replicate blocks to handle hardware failure
– Checksums of data for corruption detection
and recovery
– Expose block placement so that computes
can be migrated to data
• large streaming reads and small random reads
• Facility for multiple clients to append to a file
15. • Assumes commodity hardware that fails
– Files are replicated to handle hardware
failure
– Checksums for corruption detection and
recovery
– Continues operation as nodes / racks added
/ removed
• Optimized for fast batch processing
– Data location exposed to allow computes to
move to data
– Stores data in chunks/blocks on every node
in the cluster
– Provides VERY high aggregate bandwidth
16. • Files are broken in to large blocks.
– Typically 128 MB block size
– Blocks are replicated for reliability
– One replica on local node,
another replica on a remote rack,
Third replica on local rack,
Additional replicas are randomly placed
• Understands rack locality
– Data placement exposed so that computation can be
migrated to data
• Client talks to both NameNode and DataNodes
– Data is not sent through the namenode, clients
access data directly from DataNode
– Throughput of file system scales nearly linearly with
the number of nodes.
19. Components
• DFS Master “Namenode”
– Manages the file system namespace
– Controls read/write access to files
– Manages block replication
– Checkpoints namespace and journals
namespace changes for reliability
Metadata of Name node in Memory
– The entire metadata is in main memory
– No demand paging of FS metadata
Types of Metadata:
List of files, file and chunk namespaces; list of
blocks, location of replicas; file attributes etc.
20. DFS SLAVES or DATA NODES
• Serve read/write requests from clients
• Perform replication tasks upon instruction by
namenode
Data nodes act as:
1) A Block Server
– Stores data in the local file system
– Stores metadata of a block (e.g. CRC)
– Serves data and metadata to Clients
2) Block Report: Periodically sends a report of all
existing blocks to the NameNode
3) Periodically sends heartbeat to NameNode (detect
node failures)
4) Facilitates Pipelining of Data (to other specified
DataNodes)
21. • Map/Reduce Master “Jobtracker”
– Accepts MR jobs submitted by users
– Assigns Map and Reduce tasks to Tasktrackers
– Monitors task and tasktracker status, re-
executes tasks upon failure
• Map/Reduce Slaves “Tasktrackers”
– Run Map and Reduce tasks upon instruction
from the Jobtracker
– Manage storage and transmission of
intermediate output.
22. SECONDARY NAME NODE
• Copies FsImage and Transaction Log from
NameNode to a temporary directory
• Merges FSImage and Transaction Log into
a new FSImage in temporary directory
• Uploads new FSImage to the NameNode
– Transaction Log on NameNode is purged
23. HDFS Architecture
• NameNode: filename, offset> blockid, block > datanode
• DataNode: maps block > local disk
• Secondary NameNode: periodically merges edit logs
Block is also called chunk
25. HDFS API
• Most common file and directory operations
supported:
– Create, open, close, read, write, seek, list,
delete etc.
• Files are write once and have exclusively
one writer
• Some operations peculiar to HDFS:
– set replication, get block locations
• Support for owners, permissions
26. DATA CORRECTNESS
• Use Checksums to validate data
– Use CRC32
• File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
• File access
– Client retrieves the data and checksum from
DataNode
– If Validation fails, Client tries other replicas
27. MUTATION ORDER AND LEASES
• A mutation is an operation that changes the
contents / metadata of a chunk such as append /
write operation.
• Each mutation is performed at all replicas.
• Leases (order of mutations) are used to maintain
consistency
• Master grants chunk lease to one replica
(primary)
• Primary picks the serial order for all mutations to
the chunk
• All replicas follow this order (consistency)
29. Software Model - ???
• Parallel programming improves performance and
efficiency.
• In a parallel program, the processing is broken up into
parts, each of which can be executed concurrently
• Identify whether the problem can be parallelised (fib)
• Matrix operations with independency
30. Master/Worker
• The MASTER:
– initializes the array and splits it up according
to the number of available WORKERS
– sends each WORKER its subarray
– receives the results from each WORKER
• The WORKER:
– receives the subarray from the MASTER
– performs processing on the subarray
– returns results to MASTER
31. The area of the square, denoted
As = (2r)^2 or 4r^2.
The area of the circle, denoted
Ac, is pi * r2.
• pi = Ac / r^2
• As = 4r^2
• r^2 = As / 4
• pi = 4 * Ac / As
• pi= 4 * No of pts on
the circle / num of
points on the square
CALCULATING PI
32. • Randomly generate points in the square
• Count the number of generated points that are
both in the circle and in the square MAP
(find ra = No of pts on the circle / num of points
on the square)
• ra = the number of points in the circle divided
by the number of points in the square
gather all ra
• PI = 4 * r REDUCE
Parallelised calculation of points on the circle
(MAP)
Then merged in to find PI REDUCE
34. WHAT IS MAP REDUCE PROGRAMMING
• Restricted parallel programming model meant
for large clusters
– User implements Map() and Reduce()
• Parallel computing framework (HDFS lib)
– Libraries take care of EVERYTHING else
(abstraction)
• Parallelization
• Fault Tolerance
• Data Distribution
• Load Balancing
• Useful model for many practical tasks