Distributed Computing with Apache Hadoop is a technology overview that discusses:
1) Hadoop is an open source software framework for distributed storage and processing of large datasets across clusters of commodity hardware.
2) Hadoop addresses limitations of traditional distributed computing with an architecture that scales linearly by adding more nodes, moves computation to data instead of moving data, and provides reliability even when hardware failures occur.
3) Core Hadoop components include the Hadoop Distributed File System for storage, and MapReduce for distributed processing of large datasets in parallel on multiple machines.
The document discusses techniques used by a database management system (DBMS) to process, optimize, and execute high-level queries. It describes the phases of query processing which include syntax checking, translating the SQL query into an algebraic expression, optimization to choose an efficient execution plan, and running the optimized plan. Query optimization aims to minimize resources like disk I/O and CPU time by selecting the best execution strategy. Techniques for optimization include heuristic rules, cost-based methods, and semantic query optimization using constraints.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
The document discusses different NoSQL data models including key-value, document, column family, and graph models. It provides examples of popular NoSQL databases that implement each model such as Redis, MongoDB, Cassandra, and Neo4j. The document argues that these NoSQL databases address limitations of relational databases in supporting modern web applications with requirements for scalability, flexibility, and high performance.
1) A job is first submitted to the Hadoop cluster by a client calling the Job.submit() method. This generates a unique job ID and copies the job files to HDFS.
2) The JobTracker then initializes the job by splitting it into tasks like map and reduce tasks. It assigns tasks to TaskTrackers based on data locality.
3) Each TaskTracker executes tasks by copying job files, running tasks in a child JVM, and reporting progress back to the JobTracker.
4) The JobTracker tracks overall job status and progress by collecting task status updates from TaskTrackers. It reports this information back to clients.
5) Once all tasks complete successfully, the job
The document provides an overview of database systems, including their purpose, components, and architecture. It describes how database systems offer solutions to problems with using file systems to store data by providing data independence, concurrency control, recovery from failures, and more. It also defines key concepts like data models, data definition and manipulation languages, transactions, storage management, database users, administrators, and the roles they play in overall database system structure.
MongoDB is a cross-platform document-oriented database system that is classified as a NoSQL database. It avoids the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas. MongoDB was first developed in 2007 and is now the most popular NoSQL database system. It uses collections rather than tables and documents rather than rows. Documents can contain nested objects and arrays. MongoDB supports querying, indexing, and more. Queries use JSON-like documents and operators to specify search conditions. Documents can be inserted, updated, and deleted using various update operators.
Graph databases use graph structures to represent and store data, with nodes connected by edges. They are well-suited for interconnected data. Unlike relational databases, graph databases allow for flexible schemas and querying of relationships. Common uses of graph databases include social networks, knowledge graphs, and recommender systems.
4.1Introduction
- Potential Threats and Attacks on Computer System
- Confinement Problems
- Design Issues in Building Secure Distributed Systems
4.2 Cryptography
- Symmetric Cryptosystem Algorithm: DES
- Asymmetric Cryptosystem
4.3 Secure Channels
- Authentication
- Message Integrity and Confidentiality
- Secure Group Communication
4.4 Access Control
- General Issues
- Firewalls
- Secure Mobile Code
4.5 Security Management
- Key Management
- Issues in Key Distribution
- Secure Group Management
- Authorization Management
MongoDB is a document-oriented NoSQL database written in C++. It uses a document data model and stores data in BSON format, which is a binary form of JSON that is lightweight, traversable, and efficient. MongoDB is schema-less, supports replication and high availability, auto-sharding for scaling, and rich queries. It is suitable for big data, content management, mobile and social applications, and user data management.
The document is a slide presentation on MongoDB that introduces the topic and provides an overview. It defines MongoDB as a document-oriented, open source database that provides high performance, high availability, and easy scalability. It also discusses MongoDB's use for big data applications, how it is non-relational and stores data as JSON-like documents in collections without a defined schema. The presentation provides steps for installing MongoDB and describes some basic concepts like databases, collections, documents and commands.
This document discusses various concepts related to Hadoop MapReduce including combiners, speculative execution, custom counters, input formats, multiple inputs/outputs, distributed cache, and joins. It explains that a combiner acts as a mini-reducer between the map and reduce stages to reduce data shuffling. Speculative execution allows redundant tasks to improve performance. Custom counters can track specific metrics. Input formats handle input splitting and reading. Multiple inputs allow different mappers for different files. Distributed cache shares read-only files across nodes. Joins can correlate large datasets on a common key.
This document contains a laboratory manual for the Big Data Analytics laboratory course. It outlines 5 experiments:
1. Downloading and installing Hadoop, understanding different Hadoop modes, startup scripts, and configuration files.
2. Implementing file management tasks in Hadoop such as adding/deleting files and directories.
3. Developing a MapReduce program to implement matrix multiplication.
4. Running a basic WordCount MapReduce program.
5. Installing Hive and HBase and practicing examples.
This document discusses different types of distributed databases. It covers data models like relational, aggregate-oriented, key-value, and document models. It also discusses different distribution models like sharding and replication. Consistency models for distributed databases are explained including eventual consistency and the CAP theorem. Key-value stores are described in more detail as a simple but widely used data model with features like consistency, scaling, and suitable use cases. Specific key-value databases like Redis, Riak, and DynamoDB are mentioned.
IIS (Internet Information Services) is a web server application created by Microsoft for use with Windows operating systems. It supports protocols like HTTP, HTTPS, FTP, and more. IIS has been included with Windows NT, Windows 2000, Windows Server 2003, and beyond. Newer versions of IIS have added features like support for additional authentication mechanisms, modules for extending functionality, and performance/security improvements.
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
This document describes the MapReduce programming model for processing large datasets in a distributed manner. MapReduce allows users to write map and reduce functions that are automatically parallelized and run across large clusters. The input data is split and the map tasks run in parallel, producing intermediate key-value pairs. These are shuffled and input to the reduce tasks, which produce the final output. The system handles failures, scheduling and parallelization transparently, making it easy for programmers to write distributed applications.
This document provides an overview of a database management systems course, including slides related to key concepts. The slides cover topics such as database applications, the benefits of using a DBMS over file systems, data models, SQL, database users and administrators, data storage and querying, and database system architectures. The document is intended to introduce students to fundamental DBMS concepts through explanatory slides.
A centralized database stores all data in a single location, typically on a central server, making it easy to get a complete view of data but slowing down with many users accessing the same file. A distributed database splits data across multiple physical locations and processing across nodes, avoiding bottlenecks while requiring synchronization and replication across locations.
Hadoop Operations - Best Practices from the FieldDataWorks Summit
This document discusses best practices for Hadoop operations based on analysis of support cases. Key learnings include using HDFS ACLs and snapshots to prevent accidental data deletion and improve recoverability. HDFS improvements like pausing block deletion and adding diagnostics help address incidents around namespace mismatches and upgrade failures. Proper configuration of hardware, JVM settings, and monitoring is also emphasized.
This document provides an overview and lessons learned from Hadoop. It discusses why Hadoop is used, how MapReduce and HDFS work, tips for integration and operations, and the outlook for the Hadoop community moving forward with real-time capabilities and refined APIs. Key takeaways include only using Hadoop if necessary, fully understanding your data pipeline, and "unboxing the black box" of Hadoop.
Hadoop Distributed File System (HDFS) evolves from a MapReduce-centric storage system to a generic, cost-effective storage infrastructure where HDFS stores all data of inside the organizations. The new use case presents a new sets of challenges to the original HDFS architecture. One challenge is to scale the storage management of HDFS - the centralized scheme within NameNode becomes a main bottleneck which limits the total number of files stored. Although a typical large HDFS cluster is able to store several hundred petabytes of data, it is inefficient to handle large amounts of small files under the current architecture.
In this talk, we introduce our new design and in-progress work that re-architects HDFS to attack this limitation. The storage management is enhanced to a distributed scheme. A new concept of storage container is introduced for storing objects. HDFS blocks are stored and managed as objects in the storage containers instead of being tracked only by NameNode. Storage containers are replicated across DataNodes using a newly-developed high-throughput protocol based on the Raft consensus algorithm. Our current prototype shows that under the new architecture the storage management of HDFS scales 10x better, demonstrating that HDFS is capable of storing billions of files.
Hadoop Distributed File System (HDFS) is a distributed file system that stores large datasets across commodity hardware. It is highly fault tolerant, provides high throughput, and is suitable for applications with large datasets. HDFS uses a master/slave architecture where a NameNode manages the file system namespace and DataNodes store data blocks. The NameNode ensures data replication across DataNodes for reliability. HDFS is optimized for batch processing workloads where computations are moved to nodes storing data blocks.
This document provides an overview of big data concepts and Hadoop. It discusses the four V's of big data - volume, velocity, variety, and veracity. It then describes how Hadoop uses MapReduce and HDFS to process and store large datasets in a distributed, fault-tolerant and scalable manner across commodity hardware. Key components of Hadoop include the HDFS file system and MapReduce framework for distributed processing of large datasets in parallel.
The NameNode was experiencing high load and instability after being restarted. Graphs showed unknown high load between checkpoints on the NameNode. DataNode logs showed repeated 60000 millisecond timeouts in communication with the NameNode. Thread dumps revealed NameNode server handlers waiting on the same lock, indicating a bottleneck. Source code analysis pointed to repeated block reports from DataNodes to the NameNode as the likely cause of the high load.
HDFS is a distributed file system designed for storing very large data files across commodity servers or clusters. It works on a master-slave architecture with one namenode (master) and multiple datanodes (slaves). The namenode manages the file system metadata and regulates client access, while datanodes store and retrieve block data from their local file systems. Files are divided into large blocks which are replicated across datanodes for fault tolerance. The namenode monitors datanodes and replicates blocks if their replication drops below a threshold.
This document discusses benchmarking Hadoop and big data systems. It provides an overview of common Hadoop benchmarks including microbenchmarks like TestDFSIO, TeraSort, and NNBench which test individual Hadoop components. It also describes BigBench, a benchmark modeled after TPC-DS that aims to test a more complete big data analytics workload using techniques like MapReduce, Hive, and Mahout across structured, semi-structured, and unstructured data. The document emphasizes using Hadoop distributions for administration and both microbenchmarks and full benchmarks like BigBench for evaluation.
The document discusses the Virtual Data Connector project which aims to leverage Apache Atlas and Apache Ranger to provide unified metadata and access governance across data sources. Key points include:
- The project aims to address challenges of understanding, governing, and controlling access to distributed data through a centralized metadata catalog and policies.
- Apache Atlas provides a scalable metadata repository while Apache Ranger enables centralized access governance. The project will integrate these using a virtualization layer.
- Enhancements to Atlas and Ranger are proposed to better support the project's goals around a unified open metadata platform and metadata-driven governance.
- An initial minimum viable product will be built this year with the goal of an open, collaborative ecosystem around shared
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was created to support applications handling large datasets operating on many servers. Key Hadoop technologies include MapReduce for distributed computing, and HDFS for distributed file storage inspired by Google File System. Other related Apache projects extend Hadoop capabilities, like Pig for data flows, Hive for data warehousing, and HBase for NoSQL-like big data. Hadoop provides an effective solution for companies dealing with petabytes of data through distributed and parallel processing.
The document discusses big data and distributed computing. It provides examples of the large amounts of data generated daily by organizations like the New York Stock Exchange and Facebook. It explains how distributed computing frameworks like Hadoop use multiple computers connected via a network to process large datasets in parallel. Hadoop's MapReduce programming model and HDFS distributed file system allow users to write distributed applications that process petabytes of data across commodity hardware clusters.
This document provides an overview of big data and Hadoop. It discusses what big data is, why it has become important recently, and common use cases. It then describes how Hadoop addresses challenges of processing large datasets by distributing data and computation across clusters. The core Hadoop components of HDFS for storage and MapReduce for processing are explained. Example MapReduce jobs like wordcount are shown. Finally, higher-level tools like Hive and Pig that provide SQL-like interfaces are introduced.
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
This is the presentation I used for Oracle Week 2016 session about Apache Spark.
In the agenda:
- The Big Data problem and possible solutions
- Basic Spark Core
- Working with RDDs
- Working with Spark Cluster and Parallel programming
- Spark modules: Spark SQL and Spark Streaming
- Performance and Troubleshooting
This talk was for GDG Fresno meeting. The demo used Google Compute Engine and Google Cloud Storage. The actual talk was different than the slides. There were a lot of good questions from the audience, and diverted to side topics many times.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses Google's MapReduce programming model and Google File System for reliability. The Hadoop architecture includes a distributed file system (HDFS) that stores data across clusters and a job scheduling and resource management framework (YARN) that allows distributed processing of large datasets in parallel. Key components include the NameNode, DataNodes, ResourceManager and NodeManagers. Hadoop provides reliability through replication of data blocks and automatic recovery from failures.
Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. The core of Hadoop consists of HDFS for storage and MapReduce for processing. Hadoop has been expanded with additional projects including YARN for job scheduling and resource management, Pig and Hive for SQL-like queries, HBase for column-oriented storage, Zookeeper for coordination, and Ambari for provisioning and managing Hadoop clusters. Hadoop provides scalable and cost-effective solutions for storing and analyzing massive amounts of data.
- Data is a precious resource that can last longer than the systems themselves (Tim Berners-Lee)
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides reliability, scalability and flexibility.
- Hadoop consists of HDFS for storage and MapReduce for processing. The main nodes include NameNode, DataNodes, JobTracker and TaskTrackers. Tools like Hive, Pig, HBase extend its capabilities for SQL-like queries, data flows and NoSQL access.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore
Douglas Moore discusses common anti-patterns seen when implementing big data solutions based on lessons learned from working with over 50 clients. He covers anti-patterns in hardware and infrastructure like relying on outdated reference architectures, tooling like trying to do analytics directly in NoSQL databases, and big data warehousing like over-curating data during ETL. The key is to understand the strengths and weaknesses of different tools and deploy the right solution for the intended workload.
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
Big data is generated from a variety of sources at a massive scale and high velocity. Hadoop is an open source framework that allows processing and analyzing large datasets across clusters of commodity hardware. It uses a distributed file system called HDFS that stores multiple replicas of data blocks across nodes for reliability. Hadoop also uses a MapReduce processing model where mappers process data in parallel across nodes before reducers consolidate the outputs into final results. An example demonstrates how Hadoop would count word frequencies in a large text file by mapping word counts across nodes before reducing the results.
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopMedia Gorod
Hadoop is an open source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It allows for the parallel processing of large datasets in a reliable, fault-tolerant manner. The core components of Hadoop include HDFS for distributed file storage, MapReduce for distributed processing, and other tools like HBase, Pig and Hive for data modeling, analysis and abstraction.
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins at Google and Yahoo
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Facebook and Amazon use Hadoop to process massive amounts of data
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins from Google and Apache projects
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Yahoo, Facebook, and Amazon use Hadoop for applications like log processing, searches, and advertisement targeting
Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk
My presentation for the Cloud Data Management course at EPFL by Anastasia Ailamaki and Christoph Koch.
It is mainly based on the following two papers:
1) S. Ghemawat, H. Gobioff, S. Leung. The Google File System. SOSP, 2003
2) J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI, 2004
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...Qian Lin
This document summarizes a survey of advanced non-relational database systems, their approaches, applications, and comparison to relational database management systems (RDBMS). It outlines the problem of scaling to meet new web-scale demands, describes how non-relational databases provide a solution by sacrificing consistency for availability and partition tolerance. Examples of non-relational databases are provided, including their data models, APIs, optimizations, and benefits compared to RDBMS such as improved scalability and fault tolerance.
This document discusses distributed computing and Hadoop. It begins by explaining distributed computing and how it divides programs across several computers. It then introduces Hadoop, an open-source Java framework for distributed processing of large data sets across clusters of computers. Key aspects of Hadoop include its scalable distributed file system (HDFS), MapReduce programming model, and ability to reliably process petabytes of data on thousands of nodes. Common use cases and challenges of using Hadoop are also outlined.
This document discusses MapReduce and its suitability for processing large datasets across distributed systems. It describes challenges like node failures, network bottlenecks and the motivation for a simple programming model that can handle massive computations and datasets across thousands of machines. MapReduce provides a programming model using map and reduce functions that hides complexities of parallelization, fault tolerance and load balancing. It has been widely adopted for applications involving log analysis, indexing large datasets, iterative graph processing and more.
This document provides an overview of the Hadoop ecosystem. It begins with introducing big data challenges around volume, variety, and velocity of data. It then introduces Hadoop as an open-source framework for distributed storage and processing of large datasets across clusters of computers. The key components of Hadoop are HDFS (Hadoop Distributed File System) for distributed storage and high throughput access to application data, and MapReduce as a programming model for distributed computing on large datasets. HDFS stores data reliably using data replication across nodes and is optimized for throughput over large files and datasets.
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
2. Contents
• Why life is interesting in Distributed Computing
• Computational shift: New Data Domain
• Data is more important than Algorithms
• Hadoop as a technology
• Ecosystem of Hadoop tools
2
3. New Data Domain
• Simple calculations can be performed by humans
• Devices are need to process larger computations
• Large computations assume large data domain
• Domain of numbers – the only one until recently
– Crunching numbers from ancient times
– Computers served the same purpose– Computers served the same purpose
– Strict rules
• Growth of the Internet provided a new vast domain
– Word data: human generated texts
– Digital data: photo, video, sound
– Fuzzy rules. Errors & deviations are a part of study
– Started to process texts
– Barely touching digital data
3
4. Words vs. Numbers
• In 1997 IBM built Deep Blue supercomputer
– Playing chess game with the champion G. Kasparov
– Human race was defeated
– Strict rules for Chess
– Fast deep analyses of current state
– Still numbers
4
• In 2011 IBM built Watson computer to
play Jeopardy
– Questions and hints in human terms
– Analysis of texts from library and the
Internet
– Human champions defeated
5. The Library of Babel
• Jorge Luis Borges "The Library of Babel“
– Vast storage universe
– Composed of all possible manuscripts
uniformly formatted as 410-page books.
– Most are meaningless sequences of symbols
– The rest excitingly forms a complete and an
indestructible knowledge system
– Stores any text written or to be written
– Provides solutions to all problems in the world
– Just find the right book.
• Hard copy size is larger than visible universe
– a data domain worth discovering
• What is the size of the electronic version?
• Internet collection is a subset of the The Library of Babel
5
6. New Type of Algorithms
• Scalability is more important than efficiency
– Classic and Distributed sorting
– In place sorting updates common state
• More Hardware vs. development time
– 20% improvements in efficiency are not important
– Can ad more nodes instead
• Data is more important than algorithms
– Hard to collect data. Historical data 6 months to 1 year
• Example: Natural language processing
– Effects of training data size on classification accuracy
– Accuracy increases linearly on the size of the training data
– Machine learning algorithms converge on with increase of training data
6
7. Big Data
• Computations that need the power of many computers
– Large datasets: hundreds of TBs, PBs
– Or use of thousands of CPUs in parallel
– Or both
• Cluster as a computer
– Big Data management, storage and analytics
7
8. Big Data: Examples
• Search Webmap as of 2008 @ Y!
– Raw disk used 5 PB
– 1500 nodes
• High-energy physics LHC Collider:• High-energy physics LHC Collider:
– PBs of events
– 1 PB of data per sec, most filtered out
• 2 quadrillionth (1015) digit of πis 0
– Tsz-Wo (Nicholas) Sze
– 12 days of cluster time, 208 years of CPU time
– No data, pure CPU workload
8
9. Big Data: More Examples
• eHarmony
– Soul matching
• Banking• Banking
– Fraud detection
• Processing of astronomy data
– Image Stacking and Mosaicing
9
10. What is Hadoop
• Hadoop is an ecosystem of tools for processing
“Big Data”
• Hadoop is an open source project
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
10
11. Hadoop: Architecture Principles
• Linear scalability: more nodes can do more work in the same time
– Linear on data size:
– Linear on compute resources:
• Move computation to data
– Minimize expensive data transfers
– Data are large, programs are small
• Reliability and Availability: Failures are common
– 1 drive fails every 3 years
• Probability of failing today 1/1000
– How many drives per day fail on 1000 node cluster with 10 drives ?
• Simple computational model
– hides complexity in efficient execution framework
• Sequential data processing (avoid random reads)
11
12. Hadoop Success Factors
• Apache Hadoop won the 2011 MediaGuardian Innovation Award
– Recognition for its influence on technological innovation
– Other nominees: iPad, WikiLeaks
1. Scalability
2. Open source & commodity software2. Open source & commodity software
3. Just works
12
13. Hadoop Family
HDFS Distributed file system
MapReduce Distributed computation
Zookeeper Distributed coordination
HBase Column storeHBase Column store
Pig Dataflow language, SQL
Hive Data warehouse, SQL
Oozie Complex job workflow
Avro Data Serialization
13
14. Hadoop Core
• A reliable, scalable, high performance distributed computing system
• Reliable storage layer
– The Hadoop Distributed File System (HDFS)
– With more sophisticated layers on top
• MapReduce – distributed computation framework
• Hadoop scales computation capacity, storage capacity, and I/O bandwidth• Hadoop scales computation capacity, storage capacity, and I/O bandwidth
by adding commodity servers.
• Divide-and-conquer using lots of commodity hardware
14
15. MapReduce
• MapReduce – distributed computation framework
– Invented by Google researchers
• Two stages of a MR job
– map: (k1; v1) → {(k2; v2)}
– reduce: (k2; {v2}) → {(k3; v3)}
• Map – a truly distributed stage• Map – a truly distributed stage
Reduce – an aggregation, may not be distributed
• Shuffle – sort and merge
– transition from Map to Reduce
– invisible to user
• Combiners & Partitioners
17. Where MapReduce cannot help
• MapReduce solves about 95% of practical problems
– Not a tool for everything
• Batch processing vs. real-time
– Throughput vs. Latency
• Simultaneous update of common state
• Inter communication between tasks of a job• Inter communication between tasks of a job
• Coordinated execution
• Use of other computational models
– MPI
– Driads
17
18. Hadoop Distributed File System
• The name space is a hierarchy of files and directories
• Files are divided into blocks (typically 128 MB)
• Namespace (metadata) is decoupled from data
– Lots of fast namespace operations, not slowed down by
– Data streaming
• Single NameNode keeps the entire name space in RAM• Single NameNode keeps the entire name space in RAM
• DataNodes store block replicas as files on local drives
• Blocks are replicated on 3 DataNodes for redundancy
18
19. HDFS Read
• To read a block, the client requests the list of replica locations from the
NameNode
• Then pulling data from a replica on one of the DataNodes
19
20. HDFS Write
• To write a block of a file, the client requests a list of candidate DataNodes
from the NameNode, and organizes a write pipeline.
20
21. Replica Location Awareness
• MapReduce schedules a task assigned to process block B to a DataNode
serving a replica of B
• Local access to data
21
22. Name Node
• NameNode keeps 3 types of information
– Hierarchical namespace
– Block manager: block to data-node mapping
– List of DataNodes
• The durability of the name space is maintained by a write-ahead journal and
checkpoints
– A BackupNode creates periodic checkpoints– A BackupNode creates periodic checkpoints
– A journal transaction is guaranteed to be persisted before replying to the client
– Block locations are not persisted, but rather discovered from DataNode during
startup via block reports.
22
23. Data Nodes
• DataNodes register with the NameNode, and provide periodic block reports
that list the block replicas on hand
• DataNodes send heartbeats to the NameNode
– Heartbeat responses give instructions for managing replicas
• If no heartbeat is received during a 10-minute interval, the node is
presumed to be lost, and the replicas hosted by that node to be unavailablepresumed to be lost, and the replicas hosted by that node to be unavailable
– NameNode schedules re-replication of lost replicas
23
25. Hadoop Size
• Y! cluster
– 70 million files, 80 million blocks
– 15 PB capacity
– 4000+ nodes. 24,000 clients
– 50 GB heap for NN
• Data warehouse Hadoop cluster at Facebook
– 55 million files, 80 million blocks. Estimate 200 million objects (files + blocks)– 55 million files, 80 million blocks. Estimate 200 million objects (files + blocks)
– 2000 nodes. 21 PB capacity, 30,000 clients
– 108 GB heap for NN should allow for 400 million objects
• Analytics Cluster at eBay
– 768 nodes
– Each node: 24 TB of local disk storage, 72 GB of RAM, and a 12-core CPU
– Cluster size is 18 PB.
– Runs 26,000 MapReduce tasks simultaneously
25
26. Limitations of the Implementation
• “HDFS Scalability: The limits to growth” USENIX ;login:
• Single master architecture: a constraining resource
• Limit to the number of namespace objects
– 100 million objects; 25 PB of data
– Block to file ratio is shrinking: 2 –> 1.5 -> 1.2
• Limits for linear performance growth• Limits for linear performance growth
– linear increase in # of workers puts a higher workload on the single NameNode
– Sinple NameNode cannot support 100,000 clients
• Hadoop MapReduce framework reached its scalability limit at 40,000 clients
– Corresponds to a 4,000-node cluster with 10 MapReduce slots
26
28. ZooKeeper
• A distributed coordination service for distributed apps
– Event coordination and notification
– Leader election
– Distributed locking
• ZooKeeper can help build HA systems
28
29. HBase
• Distributed table store on top of HDFS
– An implementation of Google’s BigTable
• Big table is Big Data, cannot be stored on a single node
• Tables: big, sparse, loosely structured.
– Consist of rows, having unique row keys
– Has arbitrary number of columns,
– grouped into small number of column families
– Dynamic column creation
• Table is partitioned into regions
– Horizontally across rows; vertically across column families
• HBase provides structured yet flexible access to data
• Near real-time data processing
29
30. HBase Functionality
• HBaseAdmin: administrative functions
– Create, delete, list tables
– Create, update, delete columns, families
– Split, compact, flush
• HTable: access table data
– Result HTable.get(Get g) // get cells of a row
– void HTable.put(Put p) // update a row– void HTable.put(Put p) // update a row
– void HTable.put(Put[] p) // batch update of rows
– void HTable.delete(Delete d) // delete cells/row
– ResultScanner getScanner(family) // scan col family
32. Pig
• A language on top of and to simplify MapReduce
• Pig speaks Pig Latin
• SQL-like language
• Pig programs are translated into a
series of MapReduce jobs
32
33. Hive
• Serves the same purpose as Pig
• Closely follows SQL standards
• Keeps metadata about Hive tables in MySQL DRBM
34. Oozie
• Workflows actions are arranged as Direct Acyclic Graph
– Multiple steps: MR, Pig, Hive, Java, data mover, ...
• Coordinator jobs (time/data driven workflow jobs)
– A workflow job is scheduled at a regular frequency
– The workflow job is started when all inputs are available
34
35. The Future: Next Generation MapReduce
• “Apache Hadoop: The scalability update” USENIX ;login:
• Next Generation MapReduce
– Separation of JobTracker functions
1. Job scheduling and resource allocation
• Fundamentally centralized
2. Job monitoring and job life-cycle coordination
• Delegate coordination of different jobs to other nodes
– Dynamic partitioning of cluster resources: no fixed slots
• HDFS Federation
– Independent NameNodes sharing a common pool of DataNodes
– Cluster is a family of volumes with shared block storage layer
– User sees volumes as isolated file systems
– ViewFS: the client-side mount table
– Federated approach provides a static partitioning of the federated namespace
35