SlideShare a Scribd company logo
Training (Day –1) 
Introduction
Big-data 
Four parameters: 
–Velocity: Streaming data and large volume data movement. 
–Volume: Scale from terabytes to zettabytes. 
–Variety: Manage the complexity of multiple relational and non-relational data types and schemas. 
–Voracity: Produced data has to be consumed fast before it becomes meaningless.
Not just internet companies 
Big Data Shouldn’t Be a SiloMust be an integrated part of enterprise information architecture
Data >> Information >> Business Value 
Retail–By combining data feeds on inventory, transactional records, social media and online trends, retailers can make real-time decisions around product and inventory mix adjustments, marketing and promotions, pricing or product quality issues. 
Financial Services–By combining data across various groups and services like financial markets, money manager and lending, financial services companies can gain a comprehensive view of their individual customers and markets. 
Government–By collecting and analyzing data across agencies, location and employee groups, the government can reduce redundancies, identify productivity gaps, pinpoint consumer issues and find procurement savings across agencies. 
Healthcare–Big data in healthcare could be used help improve hospital operations, provide better tracking of procedural outcomes, help accelerate research and grow knowledge bases more rapidly. According to a 2011 Cleveland Clinic study, leveraging big data is estimated to drive $300B in annual value through an 8% systematic cost savings.
Single-core, single processor 
Single-core, multi-processor 
Single- core 
Multi-core, single processor 
Multi-core, multi-processor 
Multi-core 
Cluster of processors (single or multi-core) with shared memory 
Cluster of processors with distributed memory 
Cluster 
Broadly, the approach in HPC is to distribute the work across a cluster of machines, which access a shared file-system, hosted by a SAN. 
Grid of clusters 
Embarrassingly parallel processing 
MapReduce, distributed file system 
Cloud computing 
Pipelined Instruction level 
Concurrent Thread level 
Service Object level 
Indexed File level 
Mega Block level 
Virtual System Level 
Data size: small 
Data size: large 
Reference: Bina Ramamurthy 2011 
Processing Granularity
How to Process BigData? 
Need to process large datasets (>100TB) 
–Just reading 100TB of data can be overwhelming 
–Takes ~11 days to read on a standard computer 
–Takes a day across a 10Gbit link (very high end storage solution) 
–On a single node (@50MB/s) –23days 
–On a 1000 node cluster –33min
Examples 
•Web logs; 
•RFID; 
•sensor networks; 
•social networks; 
•social data (due to thesocial data revolution), 
•Internet text and documents; 
•Internet search indexing; 
•call detail records; 
•astronomy, 
•atmospheric science, 
•genomics, 
•biogeochemical, 
•biological, and 
•other complex and/or interdisciplinary scientific research; 
•military surveillance; 
•medical records; 
•photography archives; 
•video archives; and 
•large-scale e-commerce.
Not so easy… 
Moving data from storage cluster to computation cluster is not feasible 
In large clusters 
–Failure is expected, rather than exceptional. 
–In large clusters, computers fail every day 
–Data is corrupted or lost 
–Computations are disrupted 
–The number of nodes in a cluster may not be constant. 
–Nodes can be heterogeneous. 
Very expensive to build reliability into each application 
–A programmer worries about errors, data motion, communication… 
–Traditional debugging and performance tools don’t apply 
Need a common infrastructure and standard set of tools to handle this complexity 
–Efficient, scalable, fault-tolerant and easy to use
Why is Hadoop and MapReduceneeded? 
The answer to this questions comes from another trend in disk drives: 
–seek time is improving more slowly than transfer rate. 
Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. 
It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. 
If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate.
Why is Hadoop and MapReduceneeded? 
On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well. 
For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database. 
MapReducecan be seen as a complement to an RDBMS. 
MapReduceis a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.
Why is Hadoop and MapReduceneeded?
Hadoop distributions 
Apache™ Hadoop™ 
Apache Hadoop-based Services for Windows Azure 
Cloudera’sDistribution Including Apache Hadoop (CDH) 
HortonworksData Platform 
IBM InfoSphereBigInsights 
Platform Symphony MapReduce 
MapRHadoop Distribution 
EMC GreenplumMR (using MapR’sM5 Distribution) 
ZettasetData Platform 
SGI Hadoop Clusters (uses Clouderadistribution) 
Grand Logic JobServer 
OceanSyncHadoop Management Software 
Oracle Big Data Appliance (uses Clouderadistribution)
What’s up with the names? 
When naming software projects, Doug Cutting seems to have been inspired by his family. 
Luceneis his wife’s middle name, and her maternal grandmother’s first name. 
His son, as a toddler, used Nutchas the all- purpose word for meal and later named a yellow stuffed elephant Hadoop. 
Doug said he “was looking for a name that wasn’t already a web domain and wasn’t trademarked, so I tried various words that were in my life but not used by anybody else. Kids are pretty good at making up words.”
Hadoop features 
Distributed Framework for processing and storing data generally on commodity hardware. 
Completely Open Source. 
Written in Java 
–Runs on Linux, Mac OS/X, Windows, and Solaris. 
–Client apps can be written in various languages. 
•Scalable: store and process petabytes, scale by adding Hardware 
•Economical: 1000’s of commodity machines 
•Efficient: run tasks where data is located 
•Reliable: data is replicated, failed tasks are rerun 
•Primarily used for batch data processing, not real-time / user facing applications
Components of Hadoop 
•HDFS(Hadoop Distributed File System) 
–ModeledonGFS 
–Reliable,HighBandwidthfilesystemthatcan 
store TB' and PB's data. 
•Map-Reduce 
–UsingMap/ReducemetaphorfromLisplanguage 
–Adistributedprocessingframeworkparadigmthat 
process the data stored onto HDFS in key-value . 
DFS 
Processing Framework 
Client 1 
Client 2 
Input 
data 
Output 
data 
Map 
Map 
Map 
Reduce 
Reduce 
Input 
Map 
Shuffle & Sort 
Reduce 
Output
•Very Large Distributed File System 
–10K nodes, 100 million files, 10 PB 
–Linearly scalable 
–Supports Large files (in GBs or TBs) 
•Economical 
–Uses Commodity Hardware 
–Nodes fail every day. Failure is expected, rather than exceptional. 
–The number of nodes in a cluster is not constant. 
•Optimized for Batch Processing 
HDFS
HDFS Goals 
•Highly fault-tolerant 
–runs on commodity HW, which can fail frequently 
•High throughput of data access 
–Streaming access to data 
•Large files 
–Typical file is gigabytes to terabytes in size 
–Support for tens of millions of files 
•Simple coherency 
–Write-once-read-many access model
HDFS: Files and Blocks 
•Data Organization 
–Data is organized into files and directories 
–Files are divided into uniform sized large blocks 
–Typically 128MB 
–Blocks are distributed across cluster nodes 
•Fault Tolerance 
–Blocks are replicated (default 3) to handle hardware failure 
–Replication based on Rack-Awareness for performance and fault tolerance 
–Keeps checksums of data for corruption detection and recovery 
–Client reads both checksum and data from DataNode. If checksum fails, it tries other replicas
HDFS: Files and Blocks 
•High Throughput: 
–Client talks to both NameNodeand DataNodes 
–Data is not sent through the NameNode. 
–Throughput of file system scales nearly linearly with the number of nodes. 
•HDFS exposes block placement so that computation can be migrated to data
HDFS Components 
•NameNode 
–Manages the file namespace operation like opening, creating, renaming etc. 
–File name to list blocks + location mapping 
–File metadata 
–Authorization and authentication 
–Collect block reports from DataNodeson block locations 
–Replicate missing blocks 
–Keeps ALL namespace in memory plus checkpoints & journal 
•DataNode 
–Handles block storage on multiple volumes and data integrity. 
–Clients access the blocks directly from data nodes for read and write 
–Data nodes periodically send block reports to NameNode 
–Block creation, deletion and replication upon instruction from the NameNode.
name:/users/joeYahoo/myFile -blocks:{1,3} 
name:/users/bobYahoo/someData.gzip -blocks:{2,4,5} 
Datanodes (the slaves) 
Namenode (the master) 
1 
1 
2 
2 
2 
4 
5 
3 
3 
4 
4 
5 
5 
Client 
Metadata 
I/O 
1 
3 
HDFS Architecture
Simple commands 
hdfsdfs-ls, -du, -rm, -rmr 
Uploading files 
hdfsdfs–copyFromLocalfoo mydata/foo 
Downloading files 
hdfsdfs-moveToLocalmydata/foo foo 
hdfsdfs-cat mydata/foo 
Admin 
hdfsdfsadmin–report 
Hadoop DFS Interface
Map Reduce -Introduction 
•Parallel Job processing framework 
•Written in java 
•Close integration with HDFS 
•Provides : 
–Auto partitioning of job into sub tasks 
–Auto retry on failures 
–Linear Scalability 
–Locality of task execution 
–Plugin based framework for extensibility
Map-Reduce 
•MapReduceprograms are executed in two main phases, called 
–mapping and 
–reducing. 
•In the mapping phase, MapReducetakes the input data and feeds each data element to the mapper. 
•In the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result. 
•The mapper is meant to filter and transform the input into something 
•That the reducer can aggregate over. 
•MapReduceuses lists and (key/value) pairs as its main data primitives.
Map-Reduce 
Map-Reduce Program 
–Based on two functions: Map and Reduce 
–Every Map/Reduce program must specify a Mapper and optionally a Reducer 
–Operate on key and value pairs 
Map-Reduce works like a Unix pipeline: 
cat input | grep| sort | uniq-c | cat > output 
Input| Map| Shuffle & Sort | Reduce| Output 
cat /var/log/auth.log* | grep“session opened” | cut -d’ ‘ -f10 | sort | uniq-c > ~/userlist 
Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2) 
Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
Map-Reduce on Hadoop
Hadoop and its elements 
HDFS 
. 
. 
. 
File 1 
File 2 
File 3 
File N-2 
File N-1 
File N 
Input 
files 
Splits 
Mapper 
Machine -1 
Machine -2 
Machine -M 
Split 1 
Split 2 
Split 3 
Split M-2 
Split M-1 
Split M 
Map 1 
Map 2 
Map 3 
Map M-2 
Map M-1 
Map M 
Combiner 1 
Combiner C 
(Kay, Value) 
pairs 
Record Reader 
combiner 
. 
. 
. 
Partition 1 
Partition 2 
Partition P-1 
Partition P 
Partitionar 
Reducer 
HDFS 
. 
. 
. 
File 1 
File 2 
File 3 
File O-2 
File O-1 
File O 
Reducer 1 
Reducer 2 
Reducer R-1 
Reducer R 
Input 
Output 
Machine -x
Hadoop Eco-system 
•Hadoop Common: The common utilities that support the other Hadoop subprojects. 
•Hadoop Distributed File System (HDFS™): A distributed file system that provides high- throughput access to application data. 
•Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters. 
•Other Hadoop-related projects at Apache include: 
–Avro™: A data serialization system. 
–Cassandra™: A scalable multi-master database with no single points of failure. 
–Chukwa™: A data collection system for managing large distributed systems. 
–HBase™: A scalable, distributed database that supports structured data storage for large tables. 
–Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. 
–Mahout™: A Scalable machine learning and data mining library. 
–Pig™: A high-level data-flow language and execution framework for parallel computation. 
–ZooKeeper™: A high-performance coordination service for distributed applications.
Exercise –task 
You have timeseriesdata (timestamp, ID, value) collected from 10,000 sensors in every millisecond. Your central system stores this data, and allow more than 500 people to concurrently access this data and execute queries on them. While last one month data is accessed more frequently, some analytics algorithm built model using historical data as well. 
•Task: 
–Provide an architecture of such system to meet following goals 
–Fast 
–Available 
–Fair 
–Or, provide analytics algorithm and data-structure design considerations (e.g. k- means clustering, or regression) on this data set of worth 3 months. 
•Group / individual presentation
End of session 
Day –1: Introduction

More Related Content

What's hot

HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
Biju Nair
 
Interacting with hdfs
Interacting with hdfsInteracting with hdfs
Interacting with hdfs
Pradeep Kumbhar
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
Aisha Siddiqa
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Rutvik Bapat
 
Hadoop
HadoopHadoop
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
Bhavesh Padharia
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Konstantin V. Shvachko
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
ProTechSkills Training
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Vaibhav Jain
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
Anand Kulkarni
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once Semantics
DataWorks Summit
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS Architecture
Ravi namboori
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
tutorialvillage
 
Dynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File System
DataWorks Summit
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
SatyaHadoop
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
Sudipta Ghosh
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Delhi/NCR HUG
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Sameer Tiwari
 

What's hot (20)

HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
 
Interacting with hdfs
Interacting with hdfsInteracting with hdfs
Interacting with hdfs
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
Hdfs architecture
Hdfs architectureHdfs architecture
Hdfs architecture
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop
HadoopHadoop
Hadoop
 
Snapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File SystemSnapshot in Hadoop Distributed File System
Snapshot in Hadoop Distributed File System
 
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed SystemsCoordinating Metadata Replication: Survival Strategy for Distributed Systems
Coordinating Metadata Replication: Survival Strategy for Distributed Systems
 
Hadoop HDFS Concepts
Hadoop HDFS ConceptsHadoop HDFS Concepts
Hadoop HDFS Concepts
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
HDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once SemanticsHDFS Trunncate: Evolving Beyond Write-Once Semantics
HDFS Trunncate: Evolving Beyond Write-Once Semantics
 
Ravi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS ArchitectureRavi Namboori Hadoop & HDFS Architecture
Ravi Namboori Hadoop & HDFS Architecture
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Dynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File SystemDynamic Namespace Partitioning with Giraffa File System
Dynamic Namespace Partitioning with Giraffa File System
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
 

Viewers also liked

Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
Takrim Ul Islam Laskar
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit/Hadoop Summit
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
tcurdt
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
DataWorks Summit
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
Bart Vandewoestyne
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
Rahul Jain
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Hadoop
HadoopHadoop
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

Viewers also liked (11)

Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Hadoop
HadoopHadoop
Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 

Similar to Hadoop introduction

Chapter2.pdf
Chapter2.pdfChapter2.pdf
Chapter2.pdf
WasyihunSema2
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
Kunal Khanna
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Mr. Ankit
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoop
ahmed alshikh
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Hadoop
HadoopHadoop
Hadoop
Mayuri Gupta
 
Big Data
Big DataBig Data
Big Data
Neha Mehta
 
Hadoop
HadoopHadoop
Hadoop
RittikaBaksi
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
DignitasDigital1
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
ITJobZone.biz
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
MaulikLakhani
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
Md. Hasan Basri (Angel)
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
RajatTripathi34
 
hadoop
hadoophadoop
hadoop
swatic018
 

Similar to Hadoop introduction (20)

Chapter2.pdf
Chapter2.pdfChapter2.pdf
Chapter2.pdf
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data
Big DataBig Data
Big Data
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
hadoop
hadoophadoop
hadoop
 

More from Subhas Kumar Ghosh

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
Subhas Kumar Ghosh
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
Subhas Kumar Ghosh
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
01 hbase
01 hbase01 hbase
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
Subhas Kumar Ghosh
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
Subhas Kumar Ghosh
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
Subhas Kumar Ghosh
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
Subhas Kumar Ghosh
 
Hadoop exercise
Hadoop exerciseHadoop exercise
Hadoop exercise
Subhas Kumar Ghosh
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
Subhas Kumar Ghosh
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
Subhas Kumar Ghosh
 

More from Subhas Kumar Ghosh (20)

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
01 hbase
01 hbase01 hbase
01 hbase
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
 
Hadoop exercise
Hadoop exerciseHadoop exercise
Hadoop exercise
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
 

Recently uploaded

Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
SelcukTOPAL2
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
District 11 Solutions
 
Audits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdfAudits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdf
evwcarr
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
Priyanka Jadhav
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
SomalyEng
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdf
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdfParcel Delivery - Intel Segmentation and Last Mile Opt.pdf
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdf
AltanAtabarut
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
AltanAtabarut
 
Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
Riya Sen
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Samuel Jackson
 
CT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptxCT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptx
RejoJohn2
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
RuchiRathor2
 
Systane Global education training centre
Systane Global education training centreSystane Global education training centre
Systane Global education training centre
AkhinaRomdoni
 
Field Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdfField Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdf
hritikbui
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
lenjisoHussein
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
wojakmodern
 
Full Disclosure Board Policy.docx BRGY LICUMA
Full  Disclosure Board Policy.docx BRGY LICUMAFull  Disclosure Board Policy.docx BRGY LICUMA
Full Disclosure Board Policy.docx BRGY LICUMA
brgylicumaormoccity
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Samuel Jackson
 
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Deepikakumari457585
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
49AkshitYadav
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
PromptCloud
 

Recently uploaded (20)

Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
 
Audits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdfAudits Of Complaints Against the PPD Report_2022.pdf
Audits Of Complaints Against the PPD Report_2022.pdf
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdf
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdfParcel Delivery - Intel Segmentation and Last Mile Opt.pdf
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdf
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
 
Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
 
CT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptxCT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptx
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
 
Systane Global education training centre
Systane Global education training centreSystane Global education training centre
Systane Global education training centre
 
Field Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdfField Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdf
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
 
Full Disclosure Board Policy.docx BRGY LICUMA
Full  Disclosure Board Policy.docx BRGY LICUMAFull  Disclosure Board Policy.docx BRGY LICUMA
Full Disclosure Board Policy.docx BRGY LICUMA
 
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataTowards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion data
 
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
 

Hadoop introduction

  • 1. Training (Day –1) Introduction
  • 2. Big-data Four parameters: –Velocity: Streaming data and large volume data movement. –Volume: Scale from terabytes to zettabytes. –Variety: Manage the complexity of multiple relational and non-relational data types and schemas. –Voracity: Produced data has to be consumed fast before it becomes meaningless.
  • 3. Not just internet companies Big Data Shouldn’t Be a SiloMust be an integrated part of enterprise information architecture
  • 4. Data >> Information >> Business Value Retail–By combining data feeds on inventory, transactional records, social media and online trends, retailers can make real-time decisions around product and inventory mix adjustments, marketing and promotions, pricing or product quality issues. Financial Services–By combining data across various groups and services like financial markets, money manager and lending, financial services companies can gain a comprehensive view of their individual customers and markets. Government–By collecting and analyzing data across agencies, location and employee groups, the government can reduce redundancies, identify productivity gaps, pinpoint consumer issues and find procurement savings across agencies. Healthcare–Big data in healthcare could be used help improve hospital operations, provide better tracking of procedural outcomes, help accelerate research and grow knowledge bases more rapidly. According to a 2011 Cleveland Clinic study, leveraging big data is estimated to drive $300B in annual value through an 8% systematic cost savings.
  • 5. Single-core, single processor Single-core, multi-processor Single- core Multi-core, single processor Multi-core, multi-processor Multi-core Cluster of processors (single or multi-core) with shared memory Cluster of processors with distributed memory Cluster Broadly, the approach in HPC is to distribute the work across a cluster of machines, which access a shared file-system, hosted by a SAN. Grid of clusters Embarrassingly parallel processing MapReduce, distributed file system Cloud computing Pipelined Instruction level Concurrent Thread level Service Object level Indexed File level Mega Block level Virtual System Level Data size: small Data size: large Reference: Bina Ramamurthy 2011 Processing Granularity
  • 6. How to Process BigData? Need to process large datasets (>100TB) –Just reading 100TB of data can be overwhelming –Takes ~11 days to read on a standard computer –Takes a day across a 10Gbit link (very high end storage solution) –On a single node (@50MB/s) –23days –On a 1000 node cluster ��33min
  • 7. Examples •Web logs; •RFID; •sensor networks; •social networks; •social data (due to thesocial data revolution), •Internet text and documents; •Internet search indexing; •call detail records; •astronomy, •atmospheric science, •genomics, •biogeochemical, •biological, and •other complex and/or interdisciplinary scientific research; •military surveillance; •medical records; •photography archives; •video archives; and •large-scale e-commerce.
  • 8. Not so easy… Moving data from storage cluster to computation cluster is not feasible In large clusters –Failure is expected, rather than exceptional. –In large clusters, computers fail every day –Data is corrupted or lost –Computations are disrupted –The number of nodes in a cluster may not be constant. –Nodes can be heterogeneous. Very expensive to build reliability into each application –A programmer worries about errors, data motion, communication… –Traditional debugging and performance tools don’t apply Need a common infrastructure and standard set of tools to handle this complexity –Efficient, scalable, fault-tolerant and easy to use
  • 9. Why is Hadoop and MapReduceneeded? The answer to this questions comes from another trend in disk drives: –seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate.
  • 10. Why is Hadoop and MapReduceneeded? On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database. MapReducecan be seen as a complement to an RDBMS. MapReduceis a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.
  • 11. Why is Hadoop and MapReduceneeded?
  • 12. Hadoop distributions Apache™ Hadoop™ Apache Hadoop-based Services for Windows Azure Cloudera’sDistribution Including Apache Hadoop (CDH) HortonworksData Platform IBM InfoSphereBigInsights Platform Symphony MapReduce MapRHadoop Distribution EMC GreenplumMR (using MapR’sM5 Distribution) ZettasetData Platform SGI Hadoop Clusters (uses Clouderadistribution) Grand Logic JobServer OceanSyncHadoop Management Software Oracle Big Data Appliance (uses Clouderadistribution)
  • 13. What’s up with the names? When naming software projects, Doug Cutting seems to have been inspired by his family. Luceneis his wife’s middle name, and her maternal grandmother’s first name. His son, as a toddler, used Nutchas the all- purpose word for meal and later named a yellow stuffed elephant Hadoop. Doug said he “was looking for a name that wasn’t already a web domain and wasn’t trademarked, so I tried various words that were in my life but not used by anybody else. Kids are pretty good at making up words.”
  • 14. Hadoop features Distributed Framework for processing and storing data generally on commodity hardware. Completely Open Source. Written in Java –Runs on Linux, Mac OS/X, Windows, and Solaris. –Client apps can be written in various languages. •Scalable: store and process petabytes, scale by adding Hardware •Economical: 1000’s of commodity machines •Efficient: run tasks where data is located •Reliable: data is replicated, failed tasks are rerun •Primarily used for batch data processing, not real-time / user facing applications
  • 15. Components of Hadoop •HDFS(Hadoop Distributed File System) –ModeledonGFS –Reliable,HighBandwidthfilesystemthatcan store TB' and PB's data. •Map-Reduce –UsingMap/ReducemetaphorfromLisplanguage –Adistributedprocessingframeworkparadigmthat process the data stored onto HDFS in key-value . DFS Processing Framework Client 1 Client 2 Input data Output data Map Map Map Reduce Reduce Input Map Shuffle & Sort Reduce Output
  • 16. •Very Large Distributed File System –10K nodes, 100 million files, 10 PB –Linearly scalable –Supports Large files (in GBs or TBs) •Economical –Uses Commodity Hardware –Nodes fail every day. Failure is expected, rather than exceptional. –The number of nodes in a cluster is not constant. •Optimized for Batch Processing HDFS
  • 17. HDFS Goals •Highly fault-tolerant –runs on commodity HW, which can fail frequently •High throughput of data access –Streaming access to data •Large files –Typical file is gigabytes to terabytes in size –Support for tens of millions of files •Simple coherency –Write-once-read-many access model
  • 18. HDFS: Files and Blocks •Data Organization –Data is organized into files and directories –Files are divided into uniform sized large blocks –Typically 128MB –Blocks are distributed across cluster nodes •Fault Tolerance –Blocks are replicated (default 3) to handle hardware failure –Replication based on Rack-Awareness for performance and fault tolerance –Keeps checksums of data for corruption detection and recovery –Client reads both checksum and data from DataNode. If checksum fails, it tries other replicas
  • 19. HDFS: Files and Blocks •High Throughput: –Client talks to both NameNodeand DataNodes –Data is not sent through the NameNode. –Throughput of file system scales nearly linearly with the number of nodes. •HDFS exposes block placement so that computation can be migrated to data
  • 20. HDFS Components •NameNode –Manages the file namespace operation like opening, creating, renaming etc. –File name to list blocks + location mapping –File metadata –Authorization and authentication –Collect block reports from DataNodeson block locations –Replicate missing blocks –Keeps ALL namespace in memory plus checkpoints & journal •DataNode –Handles block storage on multiple volumes and data integrity. –Clients access the blocks directly from data nodes for read and write –Data nodes periodically send block reports to NameNode –Block creation, deletion and replication upon instruction from the NameNode.
  • 21. name:/users/joeYahoo/myFile -blocks:{1,3} name:/users/bobYahoo/someData.gzip -blocks:{2,4,5} Datanodes (the slaves) Namenode (the master) 1 1 2 2 2 4 5 3 3 4 4 5 5 Client Metadata I/O 1 3 HDFS Architecture
  • 22. Simple commands hdfsdfs-ls, -du, -rm, -rmr Uploading files hdfsdfs–copyFromLocalfoo mydata/foo Downloading files hdfsdfs-moveToLocalmydata/foo foo hdfsdfs-cat mydata/foo Admin hdfsdfsadmin–report Hadoop DFS Interface
  • 23. Map Reduce -Introduction •Parallel Job processing framework •Written in java •Close integration with HDFS •Provides : –Auto partitioning of job into sub tasks –Auto retry on failures –Linear Scalability –Locality of task execution –Plugin based framework for extensibility
  • 24. Map-Reduce •MapReduceprograms are executed in two main phases, called –mapping and –reducing. •In the mapping phase, MapReducetakes the input data and feeds each data element to the mapper. •In the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result. •The mapper is meant to filter and transform the input into something •That the reducer can aggregate over. •MapReduceuses lists and (key/value) pairs as its main data primitives.
  • 25. Map-Reduce Map-Reduce Program –Based on two functions: Map and Reduce –Every Map/Reduce program must specify a Mapper and optionally a Reducer –Operate on key and value pairs Map-Reduce works like a Unix pipeline: cat input | grep| sort | uniq-c | cat > output Input| Map| Shuffle & Sort | Reduce| Output cat /var/log/auth.log* | grep“session opened” | cut -d’ ‘ -f10 | sort | uniq-c > ~/userlist Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2) Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
  • 27. Hadoop and its elements HDFS . . . File 1 File 2 File 3 File N-2 File N-1 File N Input files Splits Mapper Machine -1 Machine -2 Machine -M Split 1 Split 2 Split 3 Split M-2 Split M-1 Split M Map 1 Map 2 Map 3 Map M-2 Map M-1 Map M Combiner 1 Combiner C (Kay, Value) pairs Record Reader combiner . . . Partition 1 Partition 2 Partition P-1 Partition P Partitionar Reducer HDFS . . . File 1 File 2 File 3 File O-2 File O-1 File O Reducer 1 Reducer 2 Reducer R-1 Reducer R Input Output Machine -x
  • 28. Hadoop Eco-system •Hadoop Common: The common utilities that support the other Hadoop subprojects. •Hadoop Distributed File System (HDFS™): A distributed file system that provides high- throughput access to application data. •Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters. •Other Hadoop-related projects at Apache include: –Avro™: A data serialization system. –Cassandra™: A scalable multi-master database with no single points of failure. –Chukwa™: A data collection system for managing large distributed systems. –HBase™: A scalable, distributed database that supports structured data storage for large tables. –Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. –Mahout™: A Scalable machine learning and data mining library. –Pig™: A high-level data-flow language and execution framework for parallel computation. –ZooKeeper™: A high-performance coordination service for distributed applications.
  • 29. Exercise –task You have timeseriesdata (timestamp, ID, value) collected from 10,000 sensors in every millisecond. Your central system stores this data, and allow more than 500 people to concurrently access this data and execute queries on them. While last one month data is accessed more frequently, some analytics algorithm built model using historical data as well. •Task: –Provide an architecture of such system to meet following goals –Fast –Available –Fair –Or, provide analytics algorithm and data-structure design considerations (e.g. k- means clustering, or regression) on this data set of worth 3 months. •Group / individual presentation
  • 30. End of session Day –1: Introduction