SlideShare a Scribd company logo
Getting Started With Big Data
Apache Hadoop
Apache Hadoop
Apache Hadoop
• is a popular open-source
framework for storing and
processing large data sets across
clusters of computers.
• HDP 2.2 on Sandbox system
Requirements:
– Now runs on 32-bit and 64-bit OS
(Windows XP, Windows 7,
Windows 8 and Mac OSX)
– Minimum 4GB RAM; 8Gb required
to run Ambari and Hbase
– Virtualization enabled on BIOS
– Browser: Chrome 25+, IE 9+, Safari
6+ recommended. (Sandbox will
not run on IE 10)
• An ideal way to get started Enterprise
Hadoop. Sandbox is a self-contained
virtual machine with Apache Hadoop
pre-configured alongside a set of
hands-on, step-by-step Hadoop
tutorials.
• Sandbox is a personal, portable Hadoop
environment that comes with a dozen
interactive Hadoop tutorials.
• It includes many of the most exciting
developments from the latest HDP
distribution, packaged up in a virtual
environment that you can get up and
running in 15 minutes!
Hadoop… Getting Started
Terminologies
• Hadoop
• YARN – the Hadoop Operating system
– enables a user to interact with all data in multiple
ways simultaneously, making Hadoop a true multi-use
data platform and allowing it to take its place in a
modern data architecture.
– A framework for job scheduling and cluster resource
management.
– This means that many different processing engines can
operate simultaneously across a Hadoop cluster, on
the same data, at the same time.
• the Hadoop Distributed File System (HDFS)
– A distributed file system that provides high-
throughput access to application data.
• MapReduce
– A YARN-based system for parallel processing of large
data sets.
• Sqoop
• theHiveODBC Driver
Hortonworks Data Platform(HDP)
• is a 100% open source
distribution of Apache
Hadoop that is truly
enterprise grade having
been built, tested and
hardened with enterprise
rigor.
Introducing Apache Hadoop to
Developers
• Apache Hadoop is a community driven open-source project
governed by the Apache Software Foundation.
• originally implemented at Yahoo based on papers published
by Google in 2003 and 2004.
• Since then Apache Hadoop has matured and developed to
become a data platform for not just processing humongous
amount of data in batch but with the advent of YARN it now
supports many diverse workloads such as Interactive
queries over large data with Hive on Tez, Realtime data
processing with Apache Storm, super scalable NoSQL
datastore like HBase, in-memory datastore like Spark and
the list goes on.
Apache Enterprise Hadoop
...
Core of Hadoop
• A set of machines running
HDFS and MapReduce is
known as a Hadoop Cluster.
Individual machines are
known as nodes. A cluster
can have as few as one node
to as many as several
thousands. For most
application scenarios Hadoop
is linearly scalable, which
means you can expect better
performance by simply
adding more nodes.
• The Hadoop
Distributed File
System (HDFS)
• MapReduce
MapReduce
• a method for distributing a task across multiple nodes. Each node
processes data stored on that node to the extent possible.
• A running Map Reduce job consists of various phases such as Map -
> Sort -> Shuffle -> Reduce
• Advantages:
– Automatic parallelization and distribution of data in blocks across a
distributed, scale-out infrastructure.
– Fault-tolerance against failure of storage, compute and network
infrastructure
– Deployment, monitoring and security capability
– A clean abstraction for programmers
• Most MapReduce programs are written in Java. It can also be
written in any scripting language using the Streaming API of
Hadoop.
The MapReduce Concepts and
Terminology
• MapReduce jobs are controlled by a software daemon
known as the JobTracker. The JobTracker resides on a
'master node'. Clients submit MapReduce jobs to the
JobTracker. The JobTracker assigns Map and Reduce tasks to
other nodes on the cluster.
• These nodes each run a software daemon known as the
TaskTracker. The TaskTracker is responsible for actually
instantiating the Map or Reduce task, and reporting
progress back to the JobTracker
• A job is a program with the ability of complete execution of
Mappers and Reducers over a dataset. A task is the
execution of a single Mapper or Reducer over a slice of
data.
Hadoop Distributed File System
• the foundation of the Hadoop cluster.
• manages how the datasets are stored in the
Hadoop cluster.
• responsible for distributing the data across the
data nodes, managing replication for
redundancy and administrative tasks like
adding, removing and recovery of data nodes.
Apache Hive
• provides a data warehouse view of the data in HDFS.
• Using a SQL-like language Hive lets you create
summarizations of your data, perform ad-hoc queries,
and analysis of large datasets in the Hadoop cluster.
• The overall approach with Hive is to project a table
structure on the dataset and then manipulate it with
HiveQL.
• Since you are using data in HDFS your operations can
be scaled across all the datanodes and you can
manipulate huge datasets.
Apache HCatalog
• Used to hold location and metadata about the
data in a Hadoop cluster. This allows scripts and
MapReduce jobs to be decoupled from data
location and metadata like the schema.
• since it supports many tools, like Hive and Pig,
the location and metadata can be shared
between tools. Using the open APIs of HCatalog
other tools like Teradata Aster can also use the
location and metadata in HCatalog.
• how can we reference data by name and inherit
the location and metadata???
Apache Pig
• a language for expressing data analysis and
infrastructure processes.
• is translated into a series of MapReduce jobs that
are run by the Hadoop cluster.
• is extensible through user-defined functions that
can be written in Java and other languages.
• Pig scripts provide a high level language to create
the MapReduce jobs needed to process data in a
Hadoop cluster.

More Related Content

What's hot

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
Brian Enochson
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Dr. C.V. Suresh Babu
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
Federico Cargnelutti
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
Steve Staso
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
 
Hadoop
HadoopHadoop
Hadoop
chandinisanz
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
Mohammad_Tariq
 
Hadoop
HadoopHadoop
Hadoop
avnishagr
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
Siva Pandeti
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Cloudera, Inc.
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
Gerrit van Vuuren
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
responseteam
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Lior Sidi
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Hadoop
HadoopHadoop
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
Hadoop Architecture
Hadoop Architecture Hadoop Architecture
Hadoop Architecture
Ganesh B
 
Kudu demo
Kudu demoKudu demo

What's hot (20)

Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Hadoop Architecture
Hadoop Architecture Hadoop Architecture
Hadoop Architecture
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 

Viewers also liked

Tripod Astrophotography - Glenn Wester
Tripod Astrophotography - Glenn WesterTripod Astrophotography - Glenn Wester
Tripod Astrophotography - Glenn Wester
glennwester
 
LEFGOZAH-Nominees
LEFGOZAH-NomineesLEFGOZAH-Nominees
LEFGOZAH-Nominees
dmvs-jim
 
Ավելուկ
Ավելուկ Ավելուկ
Ավելուկ
777ruzan
 
Multi 2
Multi 2Multi 2
Multi 2
Ian Khelynchy
 
Trabajo de angie
Trabajo de angieTrabajo de angie
Trabajo de angie
launikaentuvida
 
Mortgage CRM Made Easy with Mortgage Quest
Mortgage CRM Made Easy with Mortgage QuestMortgage CRM Made Easy with Mortgage Quest
Mortgage CRM Made Easy with Mortgage Quest
Chris Carter
 
Sammy Vander Donckt
Sammy Vander DoncktSammy Vander Donckt
Sammy Vander Donckt
Sammy Vander Donckt
 
How hurricanes get their names
How hurricanes get their namesHow hurricanes get their names
How hurricanes get their names
kygraham23
 
certificate Finance
certificate Financecertificate Finance
certificate Finance
Ziyad Abdulaziz
 
Analysis for office training
Analysis for office   trainingAnalysis for office   training
Analysis for office training
Kibrom Gebrehiwot
 
Catalog cat
Catalog catCatalog cat
Catalog cat
Zubes Masade
 
Caterpillar operation and maintenance manual 3500 b engines s
Caterpillar operation and maintenance manual 3500 b engines sCaterpillar operation and maintenance manual 3500 b engines s
Caterpillar operation and maintenance manual 3500 b engines s
Zubes Masade
 

Viewers also liked (12)

Tripod Astrophotography - Glenn Wester
Tripod Astrophotography - Glenn WesterTripod Astrophotography - Glenn Wester
Tripod Astrophotography - Glenn Wester
 
LEFGOZAH-Nominees
LEFGOZAH-NomineesLEFGOZAH-Nominees
LEFGOZAH-Nominees
 
Ավելուկ
Ավելուկ Ավելուկ
Ավելուկ
 
Multi 2
Multi 2Multi 2
Multi 2
 
Trabajo de angie
Trabajo de angieTrabajo de angie
Trabajo de angie
 
Mortgage CRM Made Easy with Mortgage Quest
Mortgage CRM Made Easy with Mortgage QuestMortgage CRM Made Easy with Mortgage Quest
Mortgage CRM Made Easy with Mortgage Quest
 
Sammy Vander Donckt
Sammy Vander DoncktSammy Vander Donckt
Sammy Vander Donckt
 
How hurricanes get their names
How hurricanes get their namesHow hurricanes get their names
How hurricanes get their names
 
certificate Finance
certificate Financecertificate Finance
certificate Finance
 
Analysis for office training
Analysis for office   trainingAnalysis for office   training
Analysis for office training
 
Catalog cat
Catalog catCatalog cat
Catalog cat
 
Caterpillar operation and maintenance manual 3500 b engines s
Caterpillar operation and maintenance manual 3500 b engines sCaterpillar operation and maintenance manual 3500 b engines s
Caterpillar operation and maintenance manual 3500 b engines s
 

Similar to Getting started big data

hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
raghavanand36
 
Hadoop and their in big data analysis EcoSystem.pptx
Hadoop and their in big data analysis EcoSystem.pptxHadoop and their in big data analysis EcoSystem.pptx
Hadoop and their in big data analysis EcoSystem.pptx
Rahul Borate
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
KennyPratheepKumar
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Dataconomy Media
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
Krisshhna Daasaarii
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
NetajiGandi1
 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
tommychauhan
 
Anju
AnjuAnju
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur
 
Unit 3 intro.pptx
Unit 3 intro.pptxUnit 3 intro.pptx
Unit 3 intro.pptx
AkhilJoseph63
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
Rajan Kanitkar
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
Humoyun Ahmedov
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
Omar Jaber
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptx
mrudulasb
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
Shweta Patnaik
 

Similar to Getting started big data (20)

hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Hadoop and their in big data analysis EcoSystem.pptx
Hadoop and their in big data analysis EcoSystem.pptxHadoop and their in big data analysis EcoSystem.pptx
Hadoop and their in big data analysis EcoSystem.pptx
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
BIGDATA ppts
BIGDATA pptsBIGDATA ppts
BIGDATA ppts
 
BDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data AnalyticsBDA R20 21NM - Summary Big Data Analytics
BDA R20 21NM - Summary Big Data Analytics
 
What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?What is Apache Hadoop and its ecosystem?
What is Apache Hadoop and its ecosystem?
 
Anju
AnjuAnju
Anju
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Unit 3 intro.pptx
Unit 3 intro.pptxUnit 3 intro.pptx
Unit 3 intro.pptx
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Hadoop jon
Hadoop jonHadoop jon
Hadoop jon
 
Introduction to Apache hadoop
Introduction to Apache hadoopIntroduction to Apache hadoop
Introduction to Apache hadoop
 
hadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptxhadoop eco system regarding big data analytics.pptx
hadoop eco system regarding big data analytics.pptx
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 

Recently uploaded

Why study French Mackenzie Neale PowerPoint
Why study French Mackenzie Neale PowerPointWhy study French Mackenzie Neale PowerPoint
Why study French Mackenzie Neale PowerPoint
nealem1
 
Email Marketing in Odoo 17 - Odoo 17 Slides
Email Marketing  in Odoo 17 - Odoo 17 SlidesEmail Marketing  in Odoo 17 - Odoo 17 Slides
Email Marketing in Odoo 17 - Odoo 17 Slides
Celine George
 
Class 6 English Chapter 1 Fables and Folk Stories
Class 6 English Chapter 1 Fables and Folk StoriesClass 6 English Chapter 1 Fables and Folk Stories
Class 6 English Chapter 1 Fables and Folk Stories
sweetygupta8413
 
SQL Server Interview Questions PDF By ScholarHat
SQL Server Interview Questions PDF By ScholarHatSQL Server Interview Questions PDF By ScholarHat
SQL Server Interview Questions PDF By ScholarHat
Scholarhat
 
Java Developer Roadmap PDF By ScholarHat
Java Developer Roadmap PDF By ScholarHatJava Developer Roadmap PDF By ScholarHat
Java Developer Roadmap PDF By ScholarHat
Scholarhat
 
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.pptFIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
ashutoshklal29
 
BANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
BANG E BHARAT QSN SET by Amra Quiz Pagoler DolBANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
BANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
Amra Quiz Pagoler Dol (AQPD)
 
Java MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHatJava MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHat
Scholarhat
 
Dr. Nasir Mustafa CERTIFICATE OF APPRECIATION "NEUROANATOMY"
Dr. Nasir Mustafa CERTIFICATE OF APPRECIATION "NEUROANATOMY"Dr. Nasir Mustafa CERTIFICATE OF APPRECIATION "NEUROANATOMY"
Dr. Nasir Mustafa CERTIFICATE OF APPRECIATION "NEUROANATOMY"
Dr. Nasir Mustafa
 
Lecture Notes Unit5 chapter 15 PL/SQL Programming
Lecture Notes Unit5 chapter 15 PL/SQL ProgrammingLecture Notes Unit5 chapter 15 PL/SQL Programming
Lecture Notes Unit5 chapter 15 PL/SQL Programming
Murugan146644
 
Official MATATAG Weekly Lesson Log Format.pdf
Official MATATAG Weekly Lesson Log Format.pdfOfficial MATATAG Weekly Lesson Log Format.pdf
Official MATATAG Weekly Lesson Log Format.pdf
JaReah
 
Microservices Interview Questions and Answers PDF By ScholarHat
Microservices Interview Questions and Answers PDF By ScholarHatMicroservices Interview Questions and Answers PDF By ScholarHat
Microservices Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
PRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdf
PRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdfPRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdf
PRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdf
nservice241
 
How to Make a Field Storable in Odoo 17 - Odoo Slides
How to Make a Field Storable in Odoo 17 - Odoo SlidesHow to Make a Field Storable in Odoo 17 - Odoo Slides
How to Make a Field Storable in Odoo 17 - Odoo Slides
Celine George
 
How to Create an XLS Report in Odoo 17 - Odoo 17 Slides
How to Create an XLS Report in Odoo 17 - Odoo 17 SlidesHow to Create an XLS Report in Odoo 17 - Odoo 17 Slides
How to Create an XLS Report in Odoo 17 - Odoo 17 Slides
Celine George
 
V2-NLC-Certificate-of-Completion_Learner.docx
V2-NLC-Certificate-of-Completion_Learner.docxV2-NLC-Certificate-of-Completion_Learner.docx
V2-NLC-Certificate-of-Completion_Learner.docx
302491
 
Plato and Aristotle's Views on Poetry by V.Jesinthal Mary
Plato and Aristotle's Views on Poetry  by V.Jesinthal MaryPlato and Aristotle's Views on Poetry  by V.Jesinthal Mary
Plato and Aristotle's Views on Poetry by V.Jesinthal Mary
jessintv
 
Pedagogy/Definition/Features/Approaches/Types
Pedagogy/Definition/Features/Approaches/TypesPedagogy/Definition/Features/Approaches/Types
Pedagogy/Definition/Features/Approaches/Types
SobiaAlvi
 
Java Full Stack Developer Interview Questions PDF By ScholarHat
Java Full Stack Developer Interview Questions PDF By ScholarHatJava Full Stack Developer Interview Questions PDF By ScholarHat
Java Full Stack Developer Interview Questions PDF By ScholarHat
Scholarhat
 

Recently uploaded (20)

Why study French Mackenzie Neale PowerPoint
Why study French Mackenzie Neale PowerPointWhy study French Mackenzie Neale PowerPoint
Why study French Mackenzie Neale PowerPoint
 
Email Marketing in Odoo 17 - Odoo 17 Slides
Email Marketing  in Odoo 17 - Odoo 17 SlidesEmail Marketing  in Odoo 17 - Odoo 17 Slides
Email Marketing in Odoo 17 - Odoo 17 Slides
 
Class 6 English Chapter 1 Fables and Folk Stories
Class 6 English Chapter 1 Fables and Folk StoriesClass 6 English Chapter 1 Fables and Folk Stories
Class 6 English Chapter 1 Fables and Folk Stories
 
SQL Server Interview Questions PDF By ScholarHat
SQL Server Interview Questions PDF By ScholarHatSQL Server Interview Questions PDF By ScholarHat
SQL Server Interview Questions PDF By ScholarHat
 
Java Developer Roadmap PDF By ScholarHat
Java Developer Roadmap PDF By ScholarHatJava Developer Roadmap PDF By ScholarHat
Java Developer Roadmap PDF By ScholarHat
 
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.pptFIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
FIRST AID PRESENTATION ON INDUSTRIAL SAFETY by dr lal.ppt
 
BANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
BANG E BHARAT QSN SET by Amra Quiz Pagoler DolBANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
BANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
 
Java MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHatJava MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHat
 
Dr. Nasir Mustafa CERTIFICATE OF APPRECIATION "NEUROANATOMY"
Dr. Nasir Mustafa CERTIFICATE OF APPRECIATION "NEUROANATOMY"Dr. Nasir Mustafa CERTIFICATE OF APPRECIATION "NEUROANATOMY"
Dr. Nasir Mustafa CERTIFICATE OF APPRECIATION "NEUROANATOMY"
 
Lecture Notes Unit5 chapter 15 PL/SQL Programming
Lecture Notes Unit5 chapter 15 PL/SQL ProgrammingLecture Notes Unit5 chapter 15 PL/SQL Programming
Lecture Notes Unit5 chapter 15 PL/SQL Programming
 
Official MATATAG Weekly Lesson Log Format.pdf
Official MATATAG Weekly Lesson Log Format.pdfOfficial MATATAG Weekly Lesson Log Format.pdf
Official MATATAG Weekly Lesson Log Format.pdf
 
Microservices Interview Questions and Answers PDF By ScholarHat
Microservices Interview Questions and Answers PDF By ScholarHatMicroservices Interview Questions and Answers PDF By ScholarHat
Microservices Interview Questions and Answers PDF By ScholarHat
 
PRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdf
PRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdfPRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdf
PRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdf
 
How to Make a Field Storable in Odoo 17 - Odoo Slides
How to Make a Field Storable in Odoo 17 - Odoo SlidesHow to Make a Field Storable in Odoo 17 - Odoo Slides
How to Make a Field Storable in Odoo 17 - Odoo Slides
 
How to Create an XLS Report in Odoo 17 - Odoo 17 Slides
How to Create an XLS Report in Odoo 17 - Odoo 17 SlidesHow to Create an XLS Report in Odoo 17 - Odoo 17 Slides
How to Create an XLS Report in Odoo 17 - Odoo 17 Slides
 
V2-NLC-Certificate-of-Completion_Learner.docx
V2-NLC-Certificate-of-Completion_Learner.docxV2-NLC-Certificate-of-Completion_Learner.docx
V2-NLC-Certificate-of-Completion_Learner.docx
 
Plato and Aristotle's Views on Poetry by V.Jesinthal Mary
Plato and Aristotle's Views on Poetry  by V.Jesinthal MaryPlato and Aristotle's Views on Poetry  by V.Jesinthal Mary
Plato and Aristotle's Views on Poetry by V.Jesinthal Mary
 
UM “ATÉ JÁ” ANIMADO! . .
UM “ATÉ JÁ” ANIMADO!                        .            .UM “ATÉ JÁ” ANIMADO!                        .            .
UM “ATÉ JÁ” ANIMADO! . .
 
Pedagogy/Definition/Features/Approaches/Types
Pedagogy/Definition/Features/Approaches/TypesPedagogy/Definition/Features/Approaches/Types
Pedagogy/Definition/Features/Approaches/Types
 
Java Full Stack Developer Interview Questions PDF By ScholarHat
Java Full Stack Developer Interview Questions PDF By ScholarHatJava Full Stack Developer Interview Questions PDF By ScholarHat
Java Full Stack Developer Interview Questions PDF By ScholarHat
 

Getting started big data

  • 1. Getting Started With Big Data Apache Hadoop
  • 2. Apache Hadoop Apache Hadoop • is a popular open-source framework for storing and processing large data sets across clusters of computers. • HDP 2.2 on Sandbox system Requirements: – Now runs on 32-bit and 64-bit OS (Windows XP, Windows 7, Windows 8 and Mac OSX) – Minimum 4GB RAM; 8Gb required to run Ambari and Hbase – Virtualization enabled on BIOS – Browser: Chrome 25+, IE 9+, Safari 6+ recommended. (Sandbox will not run on IE 10) • An ideal way to get started Enterprise Hadoop. Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials. • Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials. • It includes many of the most exciting developments from the latest HDP distribution, packaged up in a virtual environment that you can get up and running in 15 minutes!
  • 3. Hadoop… Getting Started Terminologies • Hadoop • YARN – the Hadoop Operating system – enables a user to interact with all data in multiple ways simultaneously, making Hadoop a true multi-use data platform and allowing it to take its place in a modern data architecture. – A framework for job scheduling and cluster resource management. – This means that many different processing engines can operate simultaneously across a Hadoop cluster, on the same data, at the same time. • the Hadoop Distributed File System (HDFS) – A distributed file system that provides high- throughput access to application data. • MapReduce – A YARN-based system for parallel processing of large data sets. • Sqoop • theHiveODBC Driver Hortonworks Data Platform(HDP) • is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
  • 4. Introducing Apache Hadoop to Developers • Apache Hadoop is a community driven open-source project governed by the Apache Software Foundation. • originally implemented at Yahoo based on papers published by Google in 2003 and 2004. • Since then Apache Hadoop has matured and developed to become a data platform for not just processing humongous amount of data in batch but with the advent of YARN it now supports many diverse workloads such as Interactive queries over large data with Hive on Tez, Realtime data processing with Apache Storm, super scalable NoSQL datastore like HBase, in-memory datastore like Spark and the list goes on.
  • 6. Core of Hadoop • A set of machines running HDFS and MapReduce is known as a Hadoop Cluster. Individual machines are known as nodes. A cluster can have as few as one node to as many as several thousands. For most application scenarios Hadoop is linearly scalable, which means you can expect better performance by simply adding more nodes. • The Hadoop Distributed File System (HDFS) • MapReduce
  • 7. MapReduce • a method for distributing a task across multiple nodes. Each node processes data stored on that node to the extent possible. • A running Map Reduce job consists of various phases such as Map - > Sort -> Shuffle -> Reduce • Advantages: – Automatic parallelization and distribution of data in blocks across a distributed, scale-out infrastructure. – Fault-tolerance against failure of storage, compute and network infrastructure – Deployment, monitoring and security capability – A clean abstraction for programmers • Most MapReduce programs are written in Java. It can also be written in any scripting language using the Streaming API of Hadoop.
  • 8. The MapReduce Concepts and Terminology • MapReduce jobs are controlled by a software daemon known as the JobTracker. The JobTracker resides on a 'master node'. Clients submit MapReduce jobs to the JobTracker. The JobTracker assigns Map and Reduce tasks to other nodes on the cluster. • These nodes each run a software daemon known as the TaskTracker. The TaskTracker is responsible for actually instantiating the Map or Reduce task, and reporting progress back to the JobTracker • A job is a program with the ability of complete execution of Mappers and Reducers over a dataset. A task is the execution of a single Mapper or Reducer over a slice of data.
  • 9. Hadoop Distributed File System • the foundation of the Hadoop cluster. • manages how the datasets are stored in the Hadoop cluster. • responsible for distributing the data across the data nodes, managing replication for redundancy and administrative tasks like adding, removing and recovery of data nodes.
  • 10. Apache Hive • provides a data warehouse view of the data in HDFS. • Using a SQL-like language Hive lets you create summarizations of your data, perform ad-hoc queries, and analysis of large datasets in the Hadoop cluster. • The overall approach with Hive is to project a table structure on the dataset and then manipulate it with HiveQL. • Since you are using data in HDFS your operations can be scaled across all the datanodes and you can manipulate huge datasets.
  • 11. Apache HCatalog • Used to hold location and metadata about the data in a Hadoop cluster. This allows scripts and MapReduce jobs to be decoupled from data location and metadata like the schema. • since it supports many tools, like Hive and Pig, the location and metadata can be shared between tools. Using the open APIs of HCatalog other tools like Teradata Aster can also use the location and metadata in HCatalog. • how can we reference data by name and inherit the location and metadata???
  • 12. Apache Pig • a language for expressing data analysis and infrastructure processes. • is translated into a series of MapReduce jobs that are run by the Hadoop cluster. • is extensible through user-defined functions that can be written in Java and other languages. • Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster.