SlideShare a Scribd company logo
Hadoop hbase mapreduce
What is Big Data ?
●   How is big “Big Data” ?
    ●   Is 30 40 Terabyte big data ?
    ●   ….
●   Big data are datasets that grow so large that they
    become awkward to work with using on-hand
    database management tools
●   Today Terabyte, Petabyte, Exabyte
●   Tomorrow ?
Enterprises & Big Data
●   Most companies are currently using traditional tools to
    store data
●   Big data: The next frontier for innovation, competition,
    and productivity
●   The use of big data will become a key basis of competition
●   Organisations across the globe need to take the rising
    importance of big data more seriously
Hadoop is an ecosystem, not a single product.




When you deal with BigData, the data center is your computer.
•   A Brief History of Hadoop
•   Contributers and Development
•   What is Hadoop
•   Wyh Hadoop
•   Hadoop Ecosystem
A Brief History of Hadoop
•   Hadoop has its origins in Apache Nutch

•   Nutch was started in 2002

•   Challenge : The billions of pages on the Web ?

•   2003 GFS (Google File System)

•   2004 NDFS (Nutch File System)

•   2004 Google published the paper of MapReduce

•   2005 Nutch Developers getting started with development of
    MapReduce
•   A Brief History of Hadoop
•   Contributers and Development
•   What is Hadoop
•   Wyh Hadoop
•   Hadoop Ecosystem
Contributers and Development




Lifetime patches contributed for all Hadoop-related projects: community members by
current employer
* source : JIRA tickets
Contributers and Development
Contributers and Development




* Resource: Kerberos Konference (Yahoo) – 2010
Development in ASF/Hadoop
●   Resources
    ●   Mailing List
    ●   Wiki Pages , blogs
    ●   Issue Tracking – JIRA
    ●   Version Control SVN – Git
•   A Brief History of Hadoop
•   Contributers and Development
•   What is Hadoop
•   Wyh Hadoop
•   Hadoop Ecosystem
What is Hadoop
•   Open-source project administered by the ASF

•   Data Intensive Storage

•   and Massivly Paralel Processing(MPP)

•   Enables applications to work with thousands of nodes and
    petabytes of data

•   Suitable for application with large data sets
What is Hadoop ?

•   Scalable

•   Fault Tolerance

•   Reliable data storage using the Hadoop Distributed
    File System (HDFS)

•   High-performance parallel data processing using a
    technique called MapReduce
What is Hadoop ?

•   Hadoop Becoming defacto standard for large scale
    dataprocessing

•   Becoming more than just MapReduce

•   Ecosystem growing rapidly lot’s of great tools around it
What is Hadoop ?



 Yahoo Hadoop Cluster
38,000 machines
distributed across 20
different clusters.
Recource : Yahoo 2010

50,000 m : January 2012
Resource
http://www.computerworlduk.com/in-
depth/applications/3329092/hadoop-   SGI Hadoop Cluster
could-save-you-money-over-a-
traditional-rdbms/
•   A Brief History of Hadoop
•   Contributers and Development
•   What is Hadoop
•   Wyh Hadoop
•   Hadoop Ecosystem
Why Hadoop?
Why Hadoop?
Why Hadoop?
Why Hadoop?
•       Hadoop has its origins in Apache Nutch
•       Can Process Big Data (Petabytes and more..)
•       Unlimited Data Storage & Analyse
•       No licence cost - Apache License 2.0
•       Can be build out of the commodity hardware
•       IT Cost Reduction
    •        Results
         •      Be One Step Ahead of Competition
         •      Stay there
Is hadoop alternative for RDBMs ?
 •   At the moment Apache Hadoop is not a substitute for a database
 •   No Relation
 •   Key Value pairs
 •   Big Data
 •   unstructured (Text)
 •   semi structured (Seq / Binary Files)
 •   Structured (Hbase=Google BigTable)
 •   Works fine together with RDBMs
•   A Brief History of Hadoop
•   Contributers and Development
•   What is Hadoop
•   Wyh Hadoop
•   Hadoop Ecosystem
Hadoop Ecosystem
   ETL Tools           BI Reporting     RDBMS


Pig (Data   Flow)      Hive (SQL)        Sqoop


 MapReduce (Job     Scheduling/Execution System)

HBase (Key-Value store)



                        HDFS
        (Hadoop Distributed File System)
Hadoop Ecosystem
           Important components of Hadoop


•   HDFS: A distributed, fault tolerance file system
•   MapReduce: A paralel data processing framework
•   Hive : A query framework (like SQL)
•   PIG : A query scripting tool


•   HBase : realtime read/write access to your Big Data
Hadoop Ecosystem
Hadoop is a Distributed Data Computing Platform
HDFS
HDFS




NameNode /DataNode interaction in HDFS. The NameNode keeps track of the file
metadata—which files are in the system and how each file is broken down into blocks. The
DataNodes provide backup store of the blocks and constantly report to the NameNode to keep the
metadata current.»
Hadoop Cluster
Writing Files To HDFS


               •   Client consults NameNode
               •   Client writes block directly to
                   one DataNode
               •   DataNote replicates block
               •   Cycle repeats for next block
Reading Files From HDFS




•   Client consults NameNode
•   Client receives Data Node list for each block
•   Client picks first Data Node for each block
•   Client reads blocks sequentially
Rackawareness & Fault Tolerance

                                                        NameNode

                                                  Rack Aware       Metadata
                                                  Rack 1:          File.txt
                                                  DN1              Blk A:
                                                  DN2              DN1,DN5,DN6
                                                  DN3
                                                  DN5              Blk B:
                                                                   DN1,DN2,DN9
                                                  Rack 5:
                                                  DN5              BLKC:
                                                  DN6              DN5,DN9,DN10
                                                  DN7
                                                  DN8

                                                  Rack N
•   Never loose all data if entire rack fails
•   In Rack is higher bandwidth , lower latency
Cluster Healt
Hadoop Ecosystem
           Important components of Hadoop


•   HDFS: A distributed, fault tolerance file system
•   MapReduce: A paralel data processing framework
•   Hive : A query framework (like SQL)
•   PIG : A query scripting tool
•   HBase : A Column oriented Database for OLTP
MapReduce-Paradigm
•   Simplified Data Processing on Large Clusters
•   Splitting a Big Problem/Data into Little PiecesHive
•   Key-Value
MapReduce-Batch Processing
•       Phases
    •     Map
    •     Sort/Shuffle
    •     Reduce (Aggregation)
•       Coordination
    •     Job Tracker
    •     Task Tracker
MapReduce-Map
                           K   V
                               1
                               1
Datanode 1           MAP
                               1
                               1


                               1
Datanode 2           MAP
                               1
                               1
                               1


                               1
Datanode 3                     1
                     MAP
                               1
                               1
MapReduce-Sort/Shuffle
                          1
                          1




                   SORT
Datanode 1                1
                          1


                          1
Datanode 2                1



                   SORT
                          1
                          1
                          1


Datanode 3                1
                   SORT




                          1
                          1
MapReduce-Reduce
                      1
                                   K   V
                      1


               SORT
                          REDUCE       4
Datanode 1            1
                      1


                      1
                                   K   V
                      1
Datanode 2                             2
               SORT




                      1   REDUCE
                                       3
                      1
                      1


                      1            K   V
Datanode 3
               SORT




                          REDUCE       3
                      1
                      1
MapReduce-All Phases
         1
                    1
         1




             SORT
   MAP              1
         1              REDUCE   4
                    1
         1
                    1

         1          1
         1          1




             SORT
   MAP
                        REDUCE
                                 2
         1          1
                                 3
         1          1
                    1

         1
         1
             SORT   1
   MAP                  REDUCE
                                 3
         1          1
         1          1
MapReduce-Job & Task Tracker

                                                                                Namenode




                                                                                 Datanodes



JobTracker and TaskTracker interaction. After a client calls the JobTracker to begin a data
processing job, the JobTracker partitions the work and assigns different map and reduce tasks
to each TaskTracker in the cluster
Summary of HDFS and MR
Hadoop Ecosystem
           Important components of Hadoop


•   HDFS: A distributed, fault tolerance file system
•   MapReduce: A paralel data processing framework
•   Hive : A query framework (like SQL)
•   PIG : A query scripting tool
•   HBase : A Column oriented Database for OLTP
Hive
Hive
•   Data warehousing package built on top of Hadoop
•   It began its life at Facebook processing large amount of user
    and log data
•   Hadoop subproject with many contributors
•   Ad hoc queries , summarization , and data analysis on Hadoop-
    scale data
•   Directly query data from different formats (text/binary) and file
    formats (Flat/Sequence)
•   HiveQL - like SQL
Hive Components
Mgmt. Web UI



                                                                           Map Reduce   HDFS

                             Hive CLI
                Browsing        Queries          DDL


                Thrift API                       Parser
                                                                           Execution
                                                Planner
                                                          Hive QL



               MetaStore
                                    *Thrift : Interface Definition Lang.
Hadoop Ecosystem
           Important components of Hadoop


•   HDFS: A distributed, fault tolerance file system
•   MapReduce: A paralel data processing framework
•   Hive : A query framework (like SQL)
•   PIG : A query scripting tool
•   HBase : A Column oriented Database for OLTP
Pig
•       The language used to express data flows, called Pig Latin
•       Pig Latin can be extended using UDF (User Defined Functions)
•       was originally developed at Yahoo Research
•       PigPen is an Eclipse plug-in that provides an environment for
        developing Pig programs
•       Running Pig Programs
    •       Script ; script file that contains Pig commands
    •       Grunt ; interactive shell
    •       Embedded ; java
Pig
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
      AS (year:chararray, temperature:int, quality:int);

grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)

grunt> DESCRIBE records;
records: {year: chararray,temperature: int,quality: int}

grunt> filtered_records = FILTER records BY temperature != 22 );
grunt> DUMP filtered_records;

grunt> grouped_records = GROUP records BY year;
grunt> DUMP grouped_records;
(1949,{(1949,111,1),(1949,78,1)})
(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})
Hadoop Ecosystem
           Important components of Hadoop


•   HDFS: A distributed, fault tolerance file system
•   MapReduce: A paralel data processing framework
•   Hive : A query framework (like SQL)
•   PIG : A query scripting tool
•   HBase : A Column oriented Database for OLTP
HBase
•   Random, realtime read/write access to your Big Data

•   Billions of rows X millions of columns

•   Column-oriented store modeled after Google's BigTable

•   provides Bigtable-like capabilities on top of Hadoop and HDFS

•   HBase is not a column-oriented database in the typical RDBMS

    sense, but utilizes an on-disk column storage format
HBase-Datamodel
    •        (Table, RowKey, Family,Column, Timestamp) → Value




•       Think of tags. Values any length, no predefined names or widths

•       Column names carry info (just like tags)
HBase-Datamodel
•   (Table, RowKey, Family,Column, Timestamp) → Value
HBase-Datamodel
•   (Table, RowKey, Family,Column, Timestamp) → Value
Create Sample Table
hbase(main):003:0> create 'test', 'cf'
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value11'
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value12'
hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'
hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'
hbase(main):007:0> scan 'test'
ROW       COLUMN+CELL
row1     column=cf:a, timestamp=1288380727188, value=value12
row2     column=cf:b, timestamp=1288380738440, value=value2
row3     column=cf:c, timestamp=1288380747365, value=value3
hbase(main):007:0> scan 'test', { VERSIONS => 3 }
ROW       COLUMN+CELL
row1     column=cf:a, timestamp=1288380727188, value=value12
row1     column=cf:a, timestamp=1288380727188, value=value11
row2     column=cf:b, timestamp=1288380738440, value=value2
row3     column=cf:c, timestamp=1288380747365, value=value3
Hbase-Architecture
•   Splits

•   Auto-Sharding

•   Master

•   Region Servers

•   HFile
Splits & RegionServers




•   Rows grouped in regions and served by different servers
•   Table dynamically split into “regions”
•   Each region contains values [startKey, endKey)
•   Regions hosted on a regionserver
Hbase-Architecture
Other Components
•   Flume

•   Sqoop
Commertial Products
•   Oracle Big Data Appliance

•   Microsoft Azure + Excel + MapReduce

•   Cloud Computing , Amazon elastic computing

•   IBM Hadoop-based InfoSphere BigInsights

•   VMWare Spring for Apache Hadoop

•   Toad for Cloud Database

•   Mapr , Cloudera , HortonWorks, Datameer
Thank You



Faruk Berksöz
fberksoz@gmail.com

More Related Content

What's hot

Dancing with the elephant h base1_final
Dancing with the elephant   h base1_finalDancing with the elephant   h base1_final
Dancing with the elephant h base1_final
asterix_smartplatf
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
Modern Data Stack France
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
Emilio Coppa
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
Adam Muise
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
Hortonworks
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
Nick Dimiduk
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit
 
6.hive
6.hive6.hive
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
Nick Dimiduk
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
GetInData
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
Nick Dimiduk
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
praveen bhat
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
Yahoo Developer Network
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
Szehon Ho
 

What's hot (20)

Dancing with the elephant h base1_final
Dancing with the elephant   h base1_finalDancing with the elephant   h base1_final
Dancing with the elephant h base1_final
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)Hadoop Internals (2.3.0 or later)
Hadoop Internals (2.3.0 or later)
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
6.hive
6.hive6.hive
6.hive
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
 

Similar to Hadoop hbase mapreduce

Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
Frans van Noort
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
Cisco Canada
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Cloudera, Inc.
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
York University
 
Big data
Big dataBig data
Big data
Mayuri Verma
 
Big data
Big dataBig data
Big data
Alisha Roy
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
clive boulton
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
Thirunavukkarasu Ps
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
Michael Ming Lei
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
Dhanashri Yadav
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
hansen3032
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Sector CloudSlam 09
Sector CloudSlam 09Sector CloudSlam 09
Sector CloudSlam 09
Robert Grossman
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
Hadoop
HadoopHadoop

Similar to Hadoop hbase mapreduce (20)

Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Sector CloudSlam 09
Sector CloudSlam 09Sector CloudSlam 09
Sector CloudSlam 09
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop
HadoopHadoop
Hadoop
 

Recently uploaded

FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Alliance
 
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
Zilliz
 
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
Badri_Bady
 
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Stephanie Beckett
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
DianaGray10
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
AMol NAik
 
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
Fwdays
 
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partesExchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
jorgelebrato
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Alliance
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
 
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Alliance
 
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdfDefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
Yury Chemerkin
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
siddu769252
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
Zilliz
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
DianaGray10
 
Enterprise_Mobile_Security_Forum_2013.pdf
Enterprise_Mobile_Security_Forum_2013.pdfEnterprise_Mobile_Security_Forum_2013.pdf
Enterprise_Mobile_Security_Forum_2013.pdf
Yury Chemerkin
 
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
Sara Kroft
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Priyanka Aash
 
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Alliance
 

Recently uploaded (20)

FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
 
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
 
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
 
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
 
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
 
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partesExchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
 
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
 
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdfDefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
 
Enterprise_Mobile_Security_Forum_2013.pdf
Enterprise_Mobile_Security_Forum_2013.pdfEnterprise_Mobile_Security_Forum_2013.pdf
Enterprise_Mobile_Security_Forum_2013.pdf
 
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
 
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
 

Hadoop hbase mapreduce

  • 2. What is Big Data ? ● How is big “Big Data” ? ● Is 30 40 Terabyte big data ? ● …. ● Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools ● Today Terabyte, Petabyte, Exabyte ● Tomorrow ?
  • 3. Enterprises & Big Data ● Most companies are currently using traditional tools to store data ● Big data: The next frontier for innovation, competition, and productivity ● The use of big data will become a key basis of competition ● Organisations across the globe need to take the rising importance of big data more seriously
  • 4. Hadoop is an ecosystem, not a single product. When you deal with BigData, the data center is your computer.
  • 5. A Brief History of Hadoop • Contributers and Development • What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
  • 6. A Brief History of Hadoop • Hadoop has its origins in Apache Nutch • Nutch was started in 2002 • Challenge : The billions of pages on the Web ? • 2003 GFS (Google File System) • 2004 NDFS (Nutch File System) • 2004 Google published the paper of MapReduce • 2005 Nutch Developers getting started with development of MapReduce
  • 7. A Brief History of Hadoop • Contributers and Development • What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
  • 8. Contributers and Development Lifetime patches contributed for all Hadoop-related projects: community members by current employer * source : JIRA tickets
  • 10. Contributers and Development * Resource: Kerberos Konference (Yahoo) – 2010
  • 11. Development in ASF/Hadoop ● Resources ● Mailing List ● Wiki Pages , blogs ● Issue Tracking – JIRA ● Version Control SVN – Git
  • 12. A Brief History of Hadoop • Contributers and Development • What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
  • 13. What is Hadoop • Open-source project administered by the ASF • Data Intensive Storage • and Massivly Paralel Processing(MPP) • Enables applications to work with thousands of nodes and petabytes of data • Suitable for application with large data sets
  • 14. What is Hadoop ? • Scalable • Fault Tolerance • Reliable data storage using the Hadoop Distributed File System (HDFS) • High-performance parallel data processing using a technique called MapReduce
  • 15. What is Hadoop ? • Hadoop Becoming defacto standard for large scale dataprocessing • Becoming more than just MapReduce • Ecosystem growing rapidly lot’s of great tools around it
  • 16. What is Hadoop ? Yahoo Hadoop Cluster 38,000 machines distributed across 20 different clusters. Recource : Yahoo 2010 50,000 m : January 2012 Resource http://www.computerworlduk.com/in- depth/applications/3329092/hadoop- SGI Hadoop Cluster could-save-you-money-over-a- traditional-rdbms/
  • 17. A Brief History of Hadoop • Contributers and Development • What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
  • 21. Why Hadoop? • Hadoop has its origins in Apache Nutch • Can Process Big Data (Petabytes and more..) • Unlimited Data Storage & Analyse • No licence cost - Apache License 2.0 • Can be build out of the commodity hardware • IT Cost Reduction • Results • Be One Step Ahead of Competition • Stay there
  • 22. Is hadoop alternative for RDBMs ? • At the moment Apache Hadoop is not a substitute for a database • No Relation • Key Value pairs • Big Data • unstructured (Text) • semi structured (Seq / Binary Files) • Structured (Hbase=Google BigTable) • Works fine together with RDBMs
  • 23. A Brief History of Hadoop • Contributers and Development • What is Hadoop • Wyh Hadoop • Hadoop Ecosystem
  • 24. Hadoop Ecosystem ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) Sqoop MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) HDFS (Hadoop Distributed File System)
  • 25. Hadoop Ecosystem Important components of Hadoop • HDFS: A distributed, fault tolerance file system • MapReduce: A paralel data processing framework • Hive : A query framework (like SQL) • PIG : A query scripting tool • HBase : realtime read/write access to your Big Data
  • 26. Hadoop Ecosystem Hadoop is a Distributed Data Computing Platform
  • 27. HDFS
  • 28. HDFS NameNode /DataNode interaction in HDFS. The NameNode keeps track of the file metadata—which files are in the system and how each file is broken down into blocks. The DataNodes provide backup store of the blocks and constantly report to the NameNode to keep the metadata current.»
  • 30. Writing Files To HDFS • Client consults NameNode • Client writes block directly to one DataNode • DataNote replicates block • Cycle repeats for next block
  • 31. Reading Files From HDFS • Client consults NameNode • Client receives Data Node list for each block • Client picks first Data Node for each block • Client reads blocks sequentially
  • 32. Rackawareness & Fault Tolerance NameNode Rack Aware Metadata Rack 1: File.txt DN1 Blk A: DN2 DN1,DN5,DN6 DN3 DN5 Blk B: DN1,DN2,DN9 Rack 5: DN5 BLKC: DN6 DN5,DN9,DN10 DN7 DN8 Rack N • Never loose all data if entire rack fails • In Rack is higher bandwidth , lower latency
  • 34. Hadoop Ecosystem Important components of Hadoop • HDFS: A distributed, fault tolerance file system • MapReduce: A paralel data processing framework • Hive : A query framework (like SQL) • PIG : A query scripting tool • HBase : A Column oriented Database for OLTP
  • 35. MapReduce-Paradigm • Simplified Data Processing on Large Clusters • Splitting a Big Problem/Data into Little PiecesHive • Key-Value
  • 36. MapReduce-Batch Processing • Phases • Map • Sort/Shuffle • Reduce (Aggregation) • Coordination • Job Tracker • Task Tracker
  • 37. MapReduce-Map K V 1 1 Datanode 1 MAP 1 1 1 Datanode 2 MAP 1 1 1 1 Datanode 3 1 MAP 1 1
  • 38. MapReduce-Sort/Shuffle 1 1 SORT Datanode 1 1 1 1 Datanode 2 1 SORT 1 1 1 Datanode 3 1 SORT 1 1
  • 39. MapReduce-Reduce 1 K V 1 SORT REDUCE 4 Datanode 1 1 1 1 K V 1 Datanode 2 2 SORT 1 REDUCE 3 1 1 1 K V Datanode 3 SORT REDUCE 3 1 1
  • 40. MapReduce-All Phases 1 1 1 SORT MAP 1 1 REDUCE 4 1 1 1 1 1 1 1 SORT MAP REDUCE 2 1 1 3 1 1 1 1 1 SORT 1 MAP REDUCE 3 1 1 1 1
  • 41. MapReduce-Job & Task Tracker Namenode Datanodes JobTracker and TaskTracker interaction. After a client calls the JobTracker to begin a data processing job, the JobTracker partitions the work and assigns different map and reduce tasks to each TaskTracker in the cluster
  • 42. Summary of HDFS and MR
  • 43. Hadoop Ecosystem Important components of Hadoop • HDFS: A distributed, fault tolerance file system • MapReduce: A paralel data processing framework • Hive : A query framework (like SQL) • PIG : A query scripting tool • HBase : A Column oriented Database for OLTP
  • 44. Hive
  • 45. Hive • Data warehousing package built on top of Hadoop • It began its life at Facebook processing large amount of user and log data • Hadoop subproject with many contributors • Ad hoc queries , summarization , and data analysis on Hadoop- scale data • Directly query data from different formats (text/binary) and file formats (Flat/Sequence) • HiveQL - like SQL
  • 46. Hive Components Mgmt. Web UI Map Reduce HDFS Hive CLI Browsing Queries DDL Thrift API Parser Execution Planner Hive QL MetaStore *Thrift : Interface Definition Lang.
  • 47. Hadoop Ecosystem Important components of Hadoop • HDFS: A distributed, fault tolerance file system • MapReduce: A paralel data processing framework • Hive : A query framework (like SQL) • PIG : A query scripting tool • HBase : A Column oriented Database for OLTP
  • 48. Pig • The language used to express data flows, called Pig Latin • Pig Latin can be extended using UDF (User Defined Functions) • was originally developed at Yahoo Research • PigPen is an Eclipse plug-in that provides an environment for developing Pig programs • Running Pig Programs • Script ; script file that contains Pig commands • Grunt ; interactive shell • Embedded ; java
  • 49. Pig grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); grunt> DUMP records; (1950,0,1) (1950,22,1) (1950,-11,1) (1949,111,1) (1949,78,1) grunt> DESCRIBE records; records: {year: chararray,temperature: int,quality: int} grunt> filtered_records = FILTER records BY temperature != 22 ); grunt> DUMP filtered_records; grunt> grouped_records = GROUP records BY year; grunt> DUMP grouped_records; (1949,{(1949,111,1),(1949,78,1)}) (1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})
  • 50. Hadoop Ecosystem Important components of Hadoop • HDFS: A distributed, fault tolerance file system • MapReduce: A paralel data processing framework • Hive : A query framework (like SQL) • PIG : A query scripting tool • HBase : A Column oriented Database for OLTP
  • 51. HBase • Random, realtime read/write access to your Big Data • Billions of rows X millions of columns • Column-oriented store modeled after Google's BigTable • provides Bigtable-like capabilities on top of Hadoop and HDFS • HBase is not a column-oriented database in the typical RDBMS sense, but utilizes an on-disk column storage format
  • 52. HBase-Datamodel • (Table, RowKey, Family,Column, Timestamp) → Value • Think of tags. Values any length, no predefined names or widths • Column names carry info (just like tags)
  • 53. HBase-Datamodel • (Table, RowKey, Family,Column, Timestamp) → Value
  • 54. HBase-Datamodel • (Table, RowKey, Family,Column, Timestamp) → Value
  • 55. Create Sample Table hbase(main):003:0> create 'test', 'cf' hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value11' hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value12' hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2' hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3' hbase(main):007:0> scan 'test' ROW COLUMN+CELL row1 column=cf:a, timestamp=1288380727188, value=value12 row2 column=cf:b, timestamp=1288380738440, value=value2 row3 column=cf:c, timestamp=1288380747365, value=value3 hbase(main):007:0> scan 'test', { VERSIONS => 3 } ROW COLUMN+CELL row1 column=cf:a, timestamp=1288380727188, value=value12 row1 column=cf:a, timestamp=1288380727188, value=value11 row2 column=cf:b, timestamp=1288380738440, value=value2 row3 column=cf:c, timestamp=1288380747365, value=value3
  • 56. Hbase-Architecture • Splits • Auto-Sharding • Master • Region Servers • HFile
  • 57. Splits & RegionServers • Rows grouped in regions and served by different servers • Table dynamically split into “regions” • Each region contains values [startKey, endKey) • Regions hosted on a regionserver
  • 59. Other Components • Flume • Sqoop
  • 60. Commertial Products • Oracle Big Data Appliance • Microsoft Azure + Excel + MapReduce • Cloud Computing , Amazon elastic computing • IBM Hadoop-based InfoSphere BigInsights • VMWare Spring for Apache Hadoop • Toad for Cloud Database • Mapr , Cloudera , HortonWorks, Datameer