Hadoop is an open-source software framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop automatically manages data replication and platform failure to ensure very large data sets can be processed efficiently in a reliable, fault-tolerant manner. Common uses of Hadoop include log analysis, data warehousing, web indexing, machine learning, financial analysis, and scientific applications.
What is HDFS | Hadoop Distributed File System | EdurekaEdureka!
( Hadoop Training: https://www.edureka.co/hadoop )
This What is HDFS PPT will help you to understand about Hadoop Distributed File System and its features along with practical. In this What is HDFS PPT, we will cover:
1. What is DFS and Why Do We Need It?
2. What is HDFS?
3. HDFS Architecture
4. HDFS Replication Factor
5. HDFS Commands Demonstration on a Production Hadoop Cluster
Check our complete Hadoop playlist here: https://goo.gl/hzUO0m
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
This Presentation is about NoSQL which means Not Only SQL. This presentation covers the aspects of using NoSQL for Big Data and the differences from RDBMS.
This document provides an overview of big data and Hadoop. It discusses why Hadoop is useful for extremely large datasets that are difficult to manage in relational databases. It then summarizes what Hadoop is, including its core components like HDFS, MapReduce, HBase, Pig, Hive, Chukwa, and ZooKeeper. The document also outlines Hadoop's design principles and provides examples of how some of its components like MapReduce and Hive work.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
- Hadoop is a framework for managing and processing big data distributed across clusters of computers. It allows for parallel processing of large datasets.
- Big data comes from various sources like customer behavior, machine data from sensors, etc. It is used by companies to better understand customers and target ads.
- Hadoop uses a master-slave architecture with a NameNode master and DataNode slaves. Files are divided into blocks and replicated across DataNodes for reliability. The NameNode tracks where data blocks are stored.
“not only SQL.”
NoSQL databases are databases store data in a format other than relational tables.
NoSQL databases or non-relational databases don’t store relationship data well.
This document provides an introduction to the Pig analytics platform for Hadoop. It begins with an overview of big data and Hadoop, then discusses the basics of Pig including its data model, language called Pig Latin, and components. Key points made are that Pig provides a high-level language for expressing data analysis processes, compiles queries into MapReduce programs for execution, and allows for easier programming than lower-level systems like Java MapReduce. The document also compares Pig to SQL and Hive, and demonstrates visualizing Pig jobs with the Twitter Ambrose tool.
This document discusses concepts related to data streams and real-time analytics. It begins with introductions to stream data models and sampling techniques. It then covers filtering, counting, and windowing queries on data streams. The document discusses challenges of stream processing like bounded memory and proposes solutions like sampling and sketching. It provides examples of applications in various domains and tools for real-time data streaming and analytics.
The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a Master and Slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. The NameNode tracks locations of data blocks and regulates access to files, while DataNodes store file blocks and manage read/write operations as directed by the NameNode. HDFS provides high-performance, scalable access to data across large Hadoop clusters.
The document summarizes Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes the key components of Hadoop including the Hadoop Distributed File System (HDFS) which stores data reliably across commodity hardware, and the MapReduce programming model which allows distributed processing of large datasets in parallel. The document provides an overview of HDFS architecture, data flow, fault tolerance, and other aspects to enable reliable storage and access of very large files across clusters.
The document discusses Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers. It describes Hadoop as having two main components - the Hadoop Distributed File System (HDFS) which stores data across infrastructure, and MapReduce which processes the data in a parallel, distributed manner. HDFS provides redundancy, scalability, and fault tolerance. Together these components provide a solution for businesses to efficiently analyze the large, unstructured "Big Data" they collect.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
This document provides an introduction to Hadoop, including its motivation and key components. It discusses the scale of cloud computing that Hadoop addresses, and describes the core Hadoop technologies - the Hadoop Distributed File System (HDFS) and MapReduce framework. It also briefly introduces the Hadoop ecosystem, including other related projects like Pig, HBase, Hive and ZooKeeper. Sample code is walked through to illustrate MapReduce programming. Key aspects of HDFS like fault tolerance, scalability and data reliability are summarized.
The document summarizes Spark SQL, which is a Spark module for structured data processing. It introduces key concepts like RDDs, DataFrames, and interacting with data sources. The architecture of Spark SQL is explained, including how it works with different languages and data sources through its schema RDD abstraction. Features of Spark SQL are covered such as its integration with Spark programs, unified data access, compatibility with Hive, and standard connectivity.
This document discusses different architectures for big data systems, including traditional, streaming, lambda, kappa, and unified architectures. The traditional architecture focuses on batch processing stored data using Hadoop. Streaming architectures enable low-latency analysis of real-time data streams. Lambda architecture combines batch and streaming for flexibility. Kappa architecture avoids duplicating processing logic. Finally, a unified architecture trains models on batch data and applies them to real-time streams. Choosing the right architecture depends on use cases and available components.
The document compares NoSQL and SQL databases. It notes that NoSQL databases are non-relational and have dynamic schemas that can accommodate unstructured data, while SQL databases are relational and have strict, predefined schemas. NoSQL databases offer more flexibility in data structure, but SQL databases provide better support for transactions and data integrity. The document also discusses differences in queries, scaling, and consistency between the two database types.
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
This document provides an overview of big data concepts and Hadoop. It discusses the four V's of big data - volume, velocity, variety, and veracity. It then describes how Hadoop uses MapReduce and HDFS to process and store large datasets in a distributed, fault-tolerant and scalable manner across commodity hardware. Key components of Hadoop include the HDFS file system and MapReduce framework for distributed processing of large datasets in parallel.
The NameNode was experiencing high load and instability after being restarted. Graphs showed unknown high load between checkpoints on the NameNode. DataNode logs showed repeated 60000 millisecond timeouts in communication with the NameNode. Thread dumps revealed NameNode server handlers waiting on the same lock, indicating a bottleneck. Source code analysis pointed to repeated block reports from DataNodes to the NameNode as the likely cause of the high load.
Hadoop Distributed File System (HDFS) is a distributed file system that stores large datasets across commodity hardware. It is highly fault tolerant, provides high throughput, and is suitable for applications with large datasets. HDFS uses a master/slave architecture where a NameNode manages the file system namespace and DataNodes store data blocks. The NameNode ensures data replication across DataNodes for reliability. HDFS is optimized for batch processing workloads where computations are moved to nodes storing data blocks.
Distributed Computing with Apache Hadoop is a technology overview that discusses:
1) Hadoop is an open source software framework for distributed storage and processing of large datasets across clusters of commodity hardware.
2) Hadoop addresses limitations of traditional distributed computing with an architecture that scales linearly by adding more nodes, moves computation to data instead of moving data, and provides reliability even when hardware failures occur.
3) Core Hadoop components include the Hadoop Distributed File System for storage, and MapReduce for distributed processing of large datasets in parallel on multiple machines.
This document provides an overview and lessons learned from Hadoop. It discusses why Hadoop is used, how MapReduce and HDFS work, tips for integration and operations, and the outlook for the Hadoop community moving forward with real-time capabilities and refined APIs. Key takeaways include only using Hadoop if necessary, fully understanding your data pipeline, and "unboxing the black box" of Hadoop.
Hadoop Operations - Best Practices from the FieldDataWorks Summit
This document discusses best practices for Hadoop operations based on analysis of support cases. Key learnings include using HDFS ACLs and snapshots to prevent accidental data deletion and improve recoverability. HDFS improvements like pausing block deletion and adding diagnostics help address incidents around namespace mismatches and upgrade failures. Proper configuration of hardware, JVM settings, and monitoring is also emphasized.
HDFS is a distributed file system designed for storing very large data files across commodity servers or clusters. It works on a master-slave architecture with one namenode (master) and multiple datanodes (slaves). The namenode manages the file system metadata and regulates client access, while datanodes store and retrieve block data from their local file systems. Files are divided into large blocks which are replicated across datanodes for fault tolerance. The namenode monitors datanodes and replicates blocks if their replication drops below a threshold.
This document discusses benchmarking Hadoop and big data systems. It provides an overview of common Hadoop benchmarks including microbenchmarks like TestDFSIO, TeraSort, and NNBench which test individual Hadoop components. It also describes BigBench, a benchmark modeled after TPC-DS that aims to test a more complete big data analytics workload using techniques like MapReduce, Hive, and Mahout across structured, semi-structured, and unstructured data. The document emphasizes using Hadoop distributions for administration and both microbenchmarks and full benchmarks like BigBench for evaluation.
The document discusses the Virtual Data Connector project which aims to leverage Apache Atlas and Apache Ranger to provide unified metadata and access governance across data sources. Key points include:
- The project aims to address challenges of understanding, governing, and controlling access to distributed data through a centralized metadata catalog and policies.
- Apache Atlas provides a scalable metadata repository while Apache Ranger enables centralized access governance. The project will integrate these using a virtualization layer.
- Enhancements to Atlas and Ranger are proposed to better support the project's goals around a unified open metadata platform and metadata-driven governance.
- An initial minimum viable product will be built this year with the goal of an open, collaborative ecosystem around shared
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was created to support applications handling large datasets operating on many servers. Key Hadoop technologies include MapReduce for distributed computing, and HDFS for distributed file storage inspired by Google File System. Other related Apache projects extend Hadoop capabilities, like Pig for data flows, Hive for data warehousing, and HBase for NoSQL-like big data. Hadoop provides an effective solution for companies dealing with petabytes of data through distributed and parallel processing.
The document discusses big data and distributed computing. It provides examples of the large amounts of data generated daily by organizations like the New York Stock Exchange and Facebook. It explains how distributed computing frameworks like Hadoop use multiple computers connected via a network to process large datasets in parallel. Hadoop's MapReduce programming model and HDFS distributed file system allow users to write distributed applications that process petabytes of data across commodity hardware clusters.
Hadoop is a distributed processing framework for large data sets across clusters of commodity hardware. It has two main components: HDFS for reliable data storage, and MapReduce for distributed processing of large data sets. Hadoop can scale from single servers to thousands of machines, handling data measuring petabytes with very high throughput. It provides reliability even if individual machines fail, and is easy to set up and manage.
HDFS is a distributed file system that stores large data across multiple nodes in a Hadoop cluster. It divides files into blocks and replicates them across nodes for reliability. The NameNode manages the file system namespace and regulates client access, while DataNodes store data blocks. HDFS provides interfaces for applications to access data blocks efficiently and is highly fault tolerant due to replication.
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
Hadoop Distributed File System (HDFS) is modeled after Google File System and optimized for large data sets and high throughput. HDFS uses a master/slave architecture with a NameNode that manages file system metadata and DataNodes that store data blocks. The NameNode tracks locations of data blocks and the DataNodes provide streaming access to blocks for batch processing applications. HDFS replicates data blocks across multiple DataNodes for fault tolerance.
HDFS is a distributed file system designed for storing large files across clusters of commodity hardware. It provides high-throughput access to application data reliably, even in the event of hardware failures, through data replication across multiple nodes. The master node (namenode) manages metadata like file paths and block locations, while slave nodes (datanodes) store file blocks. Files are split into large blocks which are replicated to provide fault tolerance.
The document summarizes the Hadoop Distributed File System (HDFS), which is designed to reliably store and stream very large datasets at high bandwidth. It describes the key components of HDFS, including the NameNode which manages the file system metadata and mapping of blocks to DataNodes, and DataNodes which store block replicas. HDFS allows scaling storage and computation across thousands of servers by distributing data storage and processing tasks.
In this session you will learn:
History of Hadoop
Hadoop Ecosystem
Hadoop Animal Planet
What is Hadoop?
Distinctions of Hadoop
Hadoop Components
The Hadoop Distributed Filesystem
Design of HDFS
When Not to use Hadoop?
HDFS Concepts
Anatomy of a File Read
Anatomy of a File Write
Replication & Rack awareness
Mapreduce Components
Typical Mapreduce Job
To know more, click here: https://www.mindsmapped.com/courses/big-data-hadoop/big-data-and-hadoop-training-for-beginners/
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
This video on Hadoop interview questions part-1 will take you through the general Hadoop questions and questions on HDFS, MapReduce and YARN, which are very likely to be asked in any Hadoop interview. It covers all the topics on the major components of Hadoop. This Hadoop tutorial will give you an idea about the different scenario-based questions you could face and some multiple-choice questions as well. Now, let us dive into this Hadoop interview questions video and gear up for youe next Hadoop Interview.
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
HDFS is a distributed file system designed for large data sets and high throughput access. It uses a master/slave architecture with a Namenode managing the file system namespace and Datanodes storing file data blocks. Blocks are replicated across Datanodes for fault tolerance. The system is highly scalable, handling large clusters and files sizes ranging from gigabytes to terabytes.
A brief introduction to Hadoop distributed file system. How a file is broken into blocks, written and replicated on HDFS. How missing replicas are taken care of. How a job is launched and its status is checked. Some advantages and disadvantages of HDFS-1.x
A simple replication-based mechanism has been used to achieve high data reliability of Hadoop Distributed File System (HDFS). However, replication based mechanisms have high degree of disk storage requirement since it makes copies of full block without consideration of storage size. Studies have shown that erasure-coding mechanism can provide more storage space when used as an alternative to replication. Also, it can increase write throughput compared to replication mechanism. To improve both space efficiency and I/O performance of the HDFS while preserving the same data reliability level, we propose HDFS+, an erasure coding based Hadoop Distributed File System. The proposed scheme writes a full block on the primary DataNode and then performs erasure coding with Vandermonde-based Reed-Solomon algorithm that divides data into m data fragments and encode them into n data fragments (n>m), which are saved in N distinct DataNodes such that the original object can be reconstructed from any m fragments. The experimental results show that our scheme can save up to 33% of storage space while outperforming the original scheme in write performance by 1.4 times. Our scheme provides the same read performance as the original scheme as long as data can be read from the primary DataNode even under single-node or double-node failure. Otherwise, the read performance of the HDFS+ decreases to some extent. However, as the number of fragments increases, we show that the performance degradation becomes negligible.
Hadoop Distributed Filesystem (HDFS) is a distributed filesystem designed for storing very large files across commodity hardware. It is optimized for streaming data access and is a good fit for large files, terabytes or petabytes in size, with streaming write-once and read-many access patterns. HDFS uses a master-slave architecture with a Namenode managing the filesystem metadata and Datanodes storing and retrieving block data. Blocks are replicated across Datanodes for reliability. The Namenode tracks block locations and clients read/write data by communicating with the Namenode and Datanodes in a pipeline.
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaYahoo Developer Network
This document discusses scaling HDFS through federation. HDFS currently uses a single namenode that limits scalability. Federation allows multiple independent namenodes to each manage a subset of the namespace, improving scalability. It also generalizes the block storage layer to use block pools, separating block management from namenodes. This paves the way for horizontal scaling of both namenodes and block storage in the future. Federation preserves namenode robustness while requiring few code changes. It also provides benefits like improved isolation and availability when scaling to extremely large clusters with billions of files and blocks.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It was developed to support distributed processing of large datasets. The document provides an overview of Hadoop architecture including HDFS, MapReduce and key components like NameNode, DataNode, JobTracker and TaskTracker. It also discusses Hadoop history, features, use cases and configuration.
The document discusses the key features and architecture of the Hadoop File System (HDFS). HDFS is designed for large data sets and high fault tolerance. It uses a master/slave architecture with one namenode that manages file metadata and multiple datanodes that store file data blocks. HDFS replicates blocks across datanodes for reliability and provides interfaces for applications to access file data.
The document describes the key features and architecture of the Hadoop Distributed File System (HDFS). HDFS is designed to reliably store very large data sets across clusters of commodity hardware. It uses a master/slave architecture with a single NameNode that manages file system metadata and regulates client access. Numerous DataNodes store file system blocks and replicate data for fault tolerance. The document outlines HDFS properties like high fault tolerance, data replication, and APIs.
HDFS is Hadoop's implementation of a distributed file system designed to store large amounts of data across clusters of machines. It is based on Google's GFS and addresses limitations of other distributed file systems like NFS. HDFS uses a master/slave architecture with a NameNode master storing metadata and DataNodes storing data blocks. Data is replicated across multiple DataNodes for reliability. The file system is optimized for large, sequential reads and writes of entire files rather than random access or updates.
Hadoop Institutes : kelly technologies is the best Hadoop Training Institutes in Hyderabad. Providing Hadoop training by real time faculty in Hyderabad.
We have entered an era of Big Data. Huge information is for the most part accumulation of information sets so extensive and complex that it is exceptionally hard to handle them utilizing close by database administration devices. The principle challenges with Big databases incorporate creation, curation, stockpiling, sharing, inquiry, examination and perception. So to deal with these databases we require, "exceedingly parallel software's". As a matter of first importance, information is procured from diverse sources, for example, online networking, customary undertaking information or sensor information and so forth. Flume can be utilized to secure information from online networking, for example, twitter. At that point, this information can be composed utilizing conveyed document frameworks, for example, Hadoop File System. These record frameworks are extremely proficient when number of peruses are high when contrasted with composes.
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
1. HDFS Design Principles
The Scale-out-Ability of Distributed Storage
Konstantin V. Shvachko
May 23, 2012
SVForum
Software Architecture & Platform SIG
2. Big Data
Computations that need the power of many computers
Large datasets: hundreds of TBs, tens of PBs
Or use of thousands of CPUs in parallel
Or both
Big Data management, storage and analytics
Cluster as a computer
2
3. Hadoop is an ecosystem of tools for processing
“Big Data”
Hadoop is an open source project
What is Apache Hadoop
3
4. The Hadoop Family
HDFS Distributed file system
MapReduce Distributed computation
Zookeeper Distributed coordination
HBase Column store
Pig Dataflow language, SQL
Hive Data warehouse, SQL
Oozie Complex job workflow
BigTop Packaging and testing
4
5. Hadoop: Architecture Principles
Linear scalability: more nodes can do more work within the same time
Linear on data size:
Linear on compute resources:
Move computation to data
Minimize expensive data transfers
Data are large, programs are small
Reliability and Availability: Failures are common
1 drive fails every 3 years => Probability of failing today 1/1000
How many drives per day fail on 1000 node cluster with 10 drives per node?
Simple computational model
hides complexity in efficient execution framework
Sequential data processing (avoid random reads)
5
6. Hadoop Core
A reliable, scalable, high performance distributed computing system
The Hadoop Distributed File System (HDFS)
Reliable storage layer
With more sophisticated layers on top
MapReduce – distributed computation framework
Hadoop scales computation capacity, storage capacity, and I/O bandwidth
by adding commodity servers.
Divide-and-conquer using lots of commodity hardware
6
9. Hadoop Distributed File System
The name space is a hierarchy of files and directories
Files are divided into blocks (typically 128 MB)
Namespace (metadata) is decoupled from data
Lots of fast namespace operations, not slowed down by
Data streaming
Single NameNode keeps the entire name space in RAM
DataNodes store block replicas as files on local drives
Blocks are replicated on 3 DataNodes for redundancy and availability
9
10. NameNode Transient State
/
apps
hbase
hive
users shv
NameNode RAM
blk_123_001 dn-1 dn-2 dn-3
blk_234_002 dn-11 dn-12 dn-13
blk_345_003 dn-101 dn-102 dn-103
Hierarchical
Namespace
Block
Manager
Live
DataNodes
dn-1
• Heartbeat
• Disk Used
• Disk Free
• xCeivers
dn-2
• Heartbeat
• Disk Used
• Disk Free
• xCeivers
dn-3
• Heartbeat
• Disk Used
• Disk Free
• xCeivers
11. NameNode Persistent State
The durability of the name space is maintained by a
write-ahead journal and checkpoints
Journal transactions are persisted into edits file before replying to the client
Checkpoints are periodically written to fsimage file
Handled by Checkpointer, SecondaryNameNode
Block locations discovered from DataNodes during startup via block reports.
Not persisted on NameNode
Types of persistent storage devices
Local hard drive
Remote drive or NFS filer
BackupNode
Multiple storage directories
Two on local drives, and one remote server, typically NFS filer
11
12. DataNodes
DataNodes register with the NameNode, and provide periodic block reports
that list the block replicas on hand
block report contains block id, generation stamp and length for each replica
DataNodes send heartbeats to the NameNode to confirm its alive: 3 sec
If no heartbeat is received during a 10-minute interval, the node is
presumed to be lost, and the replicas hosted by that node to be unavailable
NameNode schedules re-replication of lost replicas
Heartbeat responses give instructions for managing replicas
Replicate blocks to other nodes
Remove local block replicas
Re-register or shut down the node
Send an urgent block report
12
13. HDFS Client
Supports conventional file system operation
Create, read, write, delete files
Create, delete directories
Rename files and directories
Permissions
Modification and access times for files
Quotas: 1) namespace, 2) disk space
Per-file, when created
Replication factor – can be changed later
Block size – can not be reset after creation
Block replica locations are exposed to external applications
13
14. HDFS Read
To read a block, the client requests the list of replica locations from the
NameNode
Then pulling data from a replica on one of the DataNodes
14
15. HDFS Read
Open file returns DFSInputStream
DFSInputStream for a current block fetches replica locations from
NameNode
10 at a time
Client caches replica locations
Replica Locations are sorted by their proximity to the client
Choose first location
Open a socket stream to chosen DataNode, read bytes from the stream
If fails add to dead DataNodes
Choose the next DataNode in the list
Retry 2 times
16. Replica Location Awareness
MapReduce schedules a task assigned to process block B to a DataNode
serving a replica of B
Local access to data
16
NameNode
DataNode
TaskTracker
JobTracker
DataNode
TaskTracker
DataNode
TaskTracker
Block
Task
17. HDFS Write
To write a block of a file, the client requests a list of candidate DataNodes
from the NameNode, and organizes a write pipeline.
17
18. HDFS Write
Create file in the namespace
Call addBlock() to get next block
NN returns prospective replica locations sorted by proximity to the client
Client creates a pipeline for streaming data to DataNodes
HDFS client writes into internal buffer and forms a queue of Packets
DataStreamer sends a packet to DN1 as it becomes available
DN1 streams to DN2 the same way, and so on
If one node fails the pipeline is recreated with remaining nodes
Until at least one node remains
Replication is handled later by the NameNode
18
19. Write Leases
HDFS implements a single-writer, multiple-reader model.
HDFS client maintains a lease on files it opened for write
Only one client can hold a lease on a single file
Client periodically renews the lease by sending heartbeats to the NameNode
Lease expiration:
Until soft limit expires client has exclusive access to the file
After soft limit (10 min): any client can reclaim the lease
After hard limit (1 hour): NameNode mandatory closes the file, revokes the lease
Writer's lease does not prevent other clients from reading the file
19
20. Append to a File
Original implementation supported write-once semantics
After the file is closed, the bytes written cannot be altered or removed
Now files can be modified by reopening for append
Block modifications during appends use the copy-on-write technique
Last block is copied into temp location and modified
When “full” it is copied into its permanent location
HDFS provides consistent visibility of data for readers before file is closed
hflush operation provides the visibility guarantee
On hflush current packet is immediately pushed to the pipeline
hflush waits until all DataNodes successfully receive the packet
hsync also guarantees the data is persisted to local disks on DataNodes
20
21. Block Placement Policy
Cluster topology
Hierarchal grouping of nodes according to
network distance
Default block placement policy - a tradeoff
between minimizing the write cost,
and maximizing data reliability, availability and aggregate read bandwidth
1. First replica on the local to the writer node
2. Second and the Third replicas on two different nodes in a different rack
3. the Rest are placed on random nodes with restrictions
no more than one replica of the same block is placed at one node and
no more than two replicas are placed in the same rack (if there is enough racks)
HDFS provides a configurable block placement policy interface
experimental
21
DN00
Rack 1
/
Rack 0
DN01 DN02 DN10 DN11 DN12
22. System Integrity
Namespace ID
a unique cluster id common for cluster components (NN and DNs)
Namespace ID assigned to the file system at format time
Prevents DNs from other clusters join this cluster
Storage ID
DataNode persistently stores its unique (per cluster) Storage ID
makes DN recognizable even if it is restarted with a different IP address or port
assigned to the DataNode when it registers with the NameNode for the first time
Software Version (build version)
Different software versions of NN and DN are incompatible
Starting from Hadoop-2: version compatibility for rolling upgrades
Data integrity via Block Checksums
22
23. Cluster Startup
NameNode startup
Read image, replay journal, write new image and empty journal
Enter SafeMode
DataNode startup
Handshake: check Namespace ID and Software Version
Registration: NameNode records Storage ID and address of DN
Send initial block report
SafeMode – read-only mode for NameNode
Disallows modifications of the namespace
Disallows block replication or deletion
Prevents unnecessary block replications until the majority of blocks are reported
Minimally replicated blocks
SafeMode threshold, extension
Manual SafeMode during unforeseen circumstances
23
24. Block Management
Ensure that each block always has the intended number of replicas
Conform with block placement policy (BPP)
Replication is updated when a block report is received
Over-replicated blocks: choose replica to remove
Balance storage utilization across nodes without reducing the block’s availability
Try not to reduce the number of racks that host replicas
Remove replica from DataNode with longest heartbeat or least available space
Under-replicated blocks: place into the replication priority queue
Less replicas means higher in the queue
Minimize cost of replica creation but concur with BPP
Mis-replicated block, not to BPP
Missing block: nothing to do
Corrupt block: try to replicate good replicas, keep corrupt replicas as is
24
25. HDFS Snapshots: Software Upgrades
Snapshots prevent from data corruption/loss during software upgrades
allow to rollback if software upgrades go bad
Layout Version
identifies the data representation formats
persistently stored in the NN’s and the DNs’ storage directories
NameNode image snapshot
Start hadoop namenode –upgrade
Read checkpoint image and journal based on persisted Layout Version
o New software can always read old layouts
Rename current storage directory to previous
Save image into new current
DataNode upgrades
Follow NameNode instructions to upgrade
Create a new storage directory and hard link existing block files into it
25
26. Administration
Fsck verifies
Missing blocks, block replication per file
Block placement policy
Reporting tool, does not fix problems
Decommission
Safe removal of a DataNode from the cluster
Guarantees replication of blocks to other nodes from the one being removed
Balancer
Rebalancing of used disk space when new empty nodes are added to the cluster
Even distribution of used space between DataNodes
Block Scanner
Block checksums provide data integrity
Readers verify checksums on the client and report errors to the NameNode
Other blocks periodically verified by the scanner
26
27. Hadoop Size
Y! cluster 2010
70 million files, 80 million blocks
15 PB capacity
4000+ nodes. 24,000 clients
50 GB heap for NN
Data warehouse Hadoop cluster at Facebook 2010
55 million files, 80 million blocks. Estimate 200 million objects (files + blocks)
2000 nodes. 21 PB capacity, 30,000 clients
108 GB heap for NN should allow for 400 million objects
Analytics Cluster at eBay (Ares)
77 million files, 85 million blocks
1000 nodes: 24 TB of local disk storage, 72 GB of RAM, and a 12-core CPU
Cluster capacity 19 PB raw disk space
Runs upto 38,000 MapReduce tasks simultaneously
27
30. The Future: Hadoop 2
HDFS Federation
Independent NameNodes sharing a common pool of DataNodes
Cluster is a family of volumes with shared block storage layer
User sees volumes as isolated file systems
ViewFS: the client-side mount table
Federated approach provides a static partitioning of the federated namespace
High Availability for NameNode
Next Generation MapReduce
Separation of JobTracker functions
1. Job scheduling and resource allocation
o Fundamentally centralized
2. Job monitoring and job life-cycle coordination
o Delegate coordination of different jobs to other nodes
Dynamic partitioning of cluster resources: no fixed slots
30
34. Why High Availability is Important?
Nothing is perfect:
Applications and servers crash
Avoid downtime
Conventional for traditional RDB and
enterprise storage systems
Industry standard requirement
34
35. And Why it is Not?
Scheduled downtime dominates Unscheduled
OS maintenance
Configuration changes
Other reasons for Unscheduled Downtime
60 incidents in 500 days on 30,000 nodes
24 Full GC – the majority
System bugs / Bad application / Insufficient resources
“Data Availability and Durability with HDFS”
R. J. Chansler USENIX ;login: February, 2012
Pretty reliable
35
36. NameNode HA Challenge
Naïve Approach
Start new NameNode on the spare host, when the primary NameNode dies:
Use LinuxHA or VCS for failover
Not
NameNode startup may take up to 1 hour
Read the Namespace image and the Journal edits: 20 min
Wait for block reports from DataNodes (SafeMode): 30 min
.
36
37. Failover Classification
37
Manual-Cold (or no-HA) – an operator manually shuts down and restarts
the cluster when the active NameNode fails.
Automatic-Cold – save the namespace image and the journal into a
shared storage device, and use standard HA software for failover.
It can take up to an hour to restart the NameNode.
Manual-Hot – the entire file system metadata is fully synchronized on both
active and standby nodes, operator manually issues a command to failover
to the standby node when active fails.
Automatic-Hot – the real HA, provides fast and completely automated
failover to the hot standby.
Warm HA – BackupNode maintains up to date namespace fully
synchronized with the active NameNode. BN rediscovers location from
DataNode block reports during failover. May take 20-30 minutes.
38. HA: State of the Art
Manual failover is a routine maintenance procedure: Hadoop Wiki
Automatic-Cold HA first implemented at ContextWeb
uses DRBD for mirroring local disk drives between two nodes and Linux-HA
as the failover engine
AvatarNode from Facebook - a manual-hot HA solution.
Use for planned NameNode software upgrades without down time
Five proprietary installations running hot HA
Designs:
HDFS-1623. High Availability Framework for HDFS NN
HDFS-2064. Warm HA NameNode going Hot
HDFS-2124. NameNode HA using BackupNode as Hot Standby
Current implementation Hadoop-2
Manual failover with shared NFS storage
In progress: no NFS dependency, automatic failover based on internal algorithms
38
39. Automatic-Hot HA: the Minimalistic Approach
Standard HA software
LinuxHA, VCS, Keepalived
StandbyNode
keeps the up-to-date image of the namespace via Journal stream
available for read-only access
can become active NN
LoadReplicator
DataNodes send heartbeats / reports to both NameNode and StandbyNode
VIPs are assigned to the cluster nodes by their role:
NameNode – nn.vip.host.com
StandbyNode – sbn.vip.host.com
IP-failover
Primary node is always the one that has the NameNode VIP
Rely on proven HA software
39
41. Limitations of the Implementation
Single master architecture: a constraining resource
Limit to the number of namespace objects
A NameNode object (file or block) requires < 200 bytes in RAM
Block to file ratio is shrinking: 2 –> 1.5 -> 1.2
64 GB of RAM yields: 100 million files; 200 million blocks
Referencing 20 PB of data with and block-to-file ratio 1.5 and replication 3
Limits for linear performance growth
linear increase in # of workers puts a higher workload on the single NameNode
Single NameNode can be saturated by a handful of clients
Hadoop MapReduce framework reached its scalability limit at 40,000 clients
Corresponds to a 4,000-node cluster with 10 MapReduce slots
“HDFS Scalability: The limits to growth” USENIX ;login: 2010
41
42. From Horizontal to Vertical Scaling
Horizontal scaling is limited by single-master-architecture
Vertical scaling leads to cluster size shrinking
While Storage capacities, Compute power, and Cost remain constant
42
0
1000
2000
3000
4000
5000
Hadoop reached horizontal scalability limit
2008 Yahoo
4000 node cluster
2010 Facebook
2000 nodes
2011 eBay
1000 nodes
2013 Cluster of
500 nodes
43. Namespace Partitioning
Static: Federation
Directory sub-trees are statically distributed
between separate instances of FSs
Relocating sub-trees without copying is
challenging
Scale x10: billions of files
Dynamic
Files, directory sub-trees can move automatically
between nodes based on their utilization or load
balancing requirements
Files can be relocated without copying data blocks
Scale x100: 100s of billion of files
Orthogonal independent approaches.
Federation of distributed namespaces is possible
43
44. Distributed Metadata: Known Solutions
Ceph
Metadata stored on OSD
MDS cache metadata
Dynamic Metadata Partitioning
GFS Colossus: from Google S. Quinlan and J.Dean
100 million files per metadata server
Hundreds of servers
Lustre
Plans to release clustered namespace
Code ready
VoldFS, CassFS, MySQL – prototypes
44
45. HBase Overview
Distributed table storage system
Tables: big, sparse, loosely structured
Collection of rows
Has arbitrary number of columns
Table are Horizontally partitioned into regions. Dynamic partitioning
Columns can be grouped into Column families
Vertical partition of the table
Distributed cache: Regions loaded into RAM on cluster nodes
Timestamp: user-defined cell versioning, 3-rd dimension.
Cell id: <row_key, column_key, timestamp>
45
46. Giraffa File System
Goal: build from existing building blocks
minimize changes to existing components
HDFS + HBase = Giraffa
Store metadata in HBase table
Dynamic table partitioning into regions
Cashed in RAM for fast access
Store data in blocks on HDFS DataNodes
Efficient data streaming
Use NameNodes as block managers
Flat namespace of block IDs – easy to partition
Handle communication with DataNodes
Perform block replication
46
48. Giraffa Facts
HBase * HDFS = high scalability
More data & more files
High Availability
no SPOF, load balancing
Single cluster
no management overhead for operating more node
48
HDFS Federated HDFS Giraffa
Space 25 PB 120 PB 1 EB = 1000 PB
Files + blocks 200 million 1 billion 100 billion
Concurrent Clients 40,000 100,000 1 million