The document discusses several components of the Hadoop ecosystem including HDFS, YARN, MapReduce, Hive, Pig, and Spark. HDFS is a distributed file system that handles large datasets across commodity hardware. YARN separates resource management from job scheduling. MapReduce is the core processing framework, while tools like Hive, Pig, and Spark provide SQL-like interfaces and additional functionality for analyzing data in Hadoop.
Pig Latin is a language game, argot, or cant in which words in English are altered, usually by adding a fabricated suffix or by moving the onset or initial consonant or consonant cluster of a word to the end of the word and adding a vocalic syllable to create such a suffix.[1] For example, Wikipedia would become Ikipediaway (taking the 'W' and 'ay' to create a suffix). The objective is often to conceal the words from others not familiar with the rules. The reference to Latin is a deliberate misnomer; Pig Latin is simply a form of argot or jargon unrelated to Latin, and the name is used for its English connotations as a strange and foreign-sounding language. It is most often used by young children as a fun way to confuse people unfamiliar with Pig Latin.
Business intelligence analyzes data to provide actionable information for decision making. Big data is a $50 billion market by 2017, referring to technologies that capture, store, manage and analyze large variable data collections. Hadoop is an open source framework for distributed storage and processing of large data sets on commodity hardware, enabling businesses to gain insight from massive amounts of structured and unstructured data. It involves components like HDFS for data storage, MapReduce for processing, and others for accessing, storing, integrating, and managing data.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It has four main modules - Hadoop Common, HDFS, YARN and MapReduce. HDFS provides a distributed file system that stores data reliably across commodity hardware. MapReduce is a programming model used to process large amounts of data in parallel. Hadoop architecture uses a master-slave model, with a NameNode master and DataNode slaves. It provides fault tolerance, high throughput access to application data and scales to thousands of machines.
An Introduction-to-Hive and its Applications and Implementations.pptxiaeronlineexm
This document provides an introduction to Hive, a data warehouse infrastructure tool used for querying and analyzing large datasets in Hadoop. It discusses how Hive resides on top of Hadoop and allows users to write SQL-like queries to process structured data using MapReduce. The document also describes Hive's architecture, which includes components like the metastore for storing metadata, the HiveQL process engine for compiling queries, and the execution engine for generating MapReduce jobs. It provides examples of how Hive queries are executed and processed via the different system components.
This document discusses cloud and big data technologies. It provides an overview of Hadoop and its ecosystem, which includes components like HDFS, MapReduce, HBase, Zookeeper, Pig and Hive. It also describes how data is stored in HDFS and HBase, and how MapReduce can be used for parallel processing across large datasets. Finally, it gives examples of using MapReduce to implement algorithms for word counting, building inverted indexes and performing joins.
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and an ecosystem of related projects like Hive, HBase, Pig and Zookeeper that provide additional functions. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and additional tools like Hive, Pig, HBase, Zookeeper, Flume, Sqoop and Oozie that make up its ecosystem. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
Hive is a data warehouse infrastructure tool that allows users to query and analyze large datasets stored in Hadoop Distributed File System (HDFS) using SQL-like queries. It provides a mechanism to project structure onto this data and analyze it using tools familiar to analysts like SQL. Hive resides on top of Hadoop to summarize big data and makes querying and analyzing easy. It stores schema in a database and processed data into HDFS. Hive uses a SQL-like language called HiveQL to issue queries against data stored in HDFS.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable, and distributed processing of large data sets across clusters of commodity hardware. The core of Hadoop includes a storage part called HDFS for reliable data storage, and a processing part called MapReduce that processes data in parallel on a large cluster. Hadoop also includes additional projects like Hive, Pig, HBase, Zookeeper, Oozie, and Sqoop that together form a powerful data processing ecosystem.
Introduction to Hadoop Ecosystem was presented to Lansing Java User Group on 2/17/2015 by Vijay Mandava and Lan Jiang. The demo was built on top of HDP 2.2 and AWS cloud.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through its Hadoop Distributed File System (HDFS) and allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop was created by Doug Cutting and Mike Cafarella to address the growing need to handle large datasets in a distributed computing environment.
Hadoop is a software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce and HDFS to parallelize tasks, distribute data storage, and provide fault tolerance. Applications of Hadoop include log analysis, data mining, and machine learning using large datasets at companies like Yahoo!, Facebook, and The New York Times.
This document discusses big data analytics using Hadoop. It provides an overview of loading clickstream data from websites into Hadoop using Flume and refining the data with MapReduce. It also describes how Hive and HCatalog can be used to query and manage the data, presenting it in a SQL-like interface. Key components and processes discussed include loading data into a sandbox, Flume's architecture and data flow, using MapReduce for parallel processing, how HCatalog exposes Hive metadata, and how Hive allows querying data using SQL queries.
To get data into Hadoop, you typically prepare and format your data, choose an appropriate storage format like Avro or Parquet, and use HDFS or tools like Apache NiFi, Sqoop, and Flume to copy files from local systems to HDFS for storage and processing. The document also provides overviews of common Hadoop ecosystem tools for data ingestion like Apache Kafka, Sqoop, Flume, and streaming.
Hadoop is a software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce as a programming model and HDFS for storage. MapReduce divides applications into parallelizable map and reduce tasks that process key-value pairs across large datasets in a reliable and fault-tolerant manner. HDFS stores multiple replicas of data blocks for reliability and allows processing of data in parallel on nodes where the data is located. Hadoop can reliably store and process petabytes of data on thousands of low-cost commodity hardware nodes.
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Similar to BDA R20 21NM - Summary Big Data Analytics (20)
How to Load Custom Field to POS in Odoo 17 - Odoo 17 SlidesCeline George
This slide explains how to load custom fields you've created into the Odoo 17 Point-of-Sale (POS) interface. This approach involves extending the functionalities of existing POS models (e.g., product.product) to include your custom field.
Plato and Aristotle's Views on Poetry by V.Jesinthal Maryjessintv
PPT on Plato and Aristotle's Views on Poetry prepared by Mrs.V.Jesinthal Mary, Dept of English and Foreign Languages(EFL),SRMIST Science and Humanities ,Ramapuram,Chennai-600089
Lecture Notes Unit4 Chapter13 users , roles and privilegesMurugan146644
Description:
Welcome to the comprehensive guide on Relational Database Management System (RDBMS) concepts, tailored for final year B.Sc. Computer Science students affiliated with Alagappa University. This document covers fundamental principles and advanced topics in RDBMS, offering a structured approach to understanding databases in the context of modern computing. PDF content is prepared from the text book Learn Oracle 8I by JOSE A RAMALHO.
Key Topics Covered:
Main Topic : USERS, Roles and Privileges
In Oracle databases, users are individuals or applications that interact with the database. Each user is assigned specific roles, which are collections of privileges that define their access levels and capabilities. Privileges are permissions granted to users or roles, allowing actions like creating tables, executing procedures, or querying data. Properly managing users, roles, and privileges is essential for maintaining security and ensuring that users have appropriate access to database resources, thus supporting effective data management and integrity within the Oracle environment.
Sub-Topic :
Definition of User, User Creation Commands, Grant Command, Deleting a user, Privileges, System privileges and object privileges, Grant Object Privileges, Viewing a users, Revoke Object Privileges, Creation of Role, Granting privileges and roles to role, View the roles of a user , Deleting a role
Target Audience:
Final year B.Sc. Computer Science students at Alagappa University seeking a solid foundation in RDBMS principles for academic and practical applications.
URL for previous slides
chapter 8,9 and 10 : https://www.slideshare.net/slideshow/lecture_notes_unit4_chapter_8_9_10_rdbms-for-the-students-affiliated-by-alagappa-university/270123800
Chapter 11 Sequence: https://www.slideshare.net/slideshow/sequnces-lecture_notes_unit4_chapter11_sequence/270134792
Chapter 12 View : https://www.slideshare.net/slideshow/rdbms-lecture-notes-unit4-chapter12-view/270199683
About the Author:
Dr. S. Murugan is Associate Professor at Alagappa Government Arts College, Karaikudi. With 23 years of teaching experience in the field of Computer Science, Dr. S. Murugan has a passion for simplifying complex concepts in database management.
Disclaimer:
This document is intended for educational purposes only. The content presented here reflects the author’s understanding in the field of RDBMS as of 2024.
How to define Related field in Odoo 17 - Odoo 17 SlidesCeline George
The related attribute is used in field definitions to establish a relationship between models and automatically fetch the value from a related model's field. It provides a way to reference and display fields from related models without having to create a separate field and write code to synchronize the values manually.
Email Marketing in Odoo 17 - Odoo 17 SlidesCeline George
Email marketing is used to send advertisements or commercial messages to specific groups of people by using email. Email Marketing also helps to track the campaign’s overall effectiveness. This slide will show the features of odoo 17 email marketing.
New Features in Odoo 17 Email Marketing - Odoo SlidesCeline George
In this slide, let’s discuss the new features of email marketing Odoo 17. The new features enhance user in creating effective and efficient campaigns. This module will help to control the email layouts and other aspects of it.
New features of Maintenance Module in Odoo 17Celine George
In Odoo, the Maintenance Module is a comprehensive tool designed to help organizations manage their equipment, machinery, and overall maintenance activities efficiently. This module enables users to schedule, track, and manage maintenance requests and activities, ensuring minimal downtime and optimal operational efficiency.
1. Hadoop ecosystem
• Hadoop ecosystem contains components
• like HDFS and HDFS components,
• MapReduce,
• YARN, Hive, Apache Pig, Apache HBase and
HBase components, HCatalog, Avro, Thrift,
Drill, Apache mahout, Sqoop, Apache Flume,
Ambari, Zookeeper and Apache OOzie
3. Hbase& Hcatalog
Apache Hbase:
• This is a Hadoop ecosystem component which is a distributed database
that was designed to store structured data in tables that could have billions
of row and millions of columns. HBase is scalable, distributed. HBase,
provide real-time access to read or write data in HDFS.
HCATALOG:
• It is a table and storage management layer for Hadoop. HCatalog supports
different components available in Hadoop ecosystems like MapReduce,
Hive, and Pig to easily read and write data from the cluster. HCatalog is a
key component of Hive that enables the user to store their data in any
format and structure.
• By default, HCatalog supports RCFile, CSV, JSON, sequenceFile and ORC
file formats.
4. Apache Mahout
• Mahout is open source framework for creating
scalable machine learning algorithm and data mining
library. Once data is stored in Hadoop HDFS, mahout
provides the data science tools to automatically find
meaningful patterns in those big data sets.
Algorithms of Mahout are:
• Clustering – Here it takes the item in particular class and
organizes them into naturally occurring groups, such that
item belonging to the same group are similar to each other.
• Collaborative filtering – It mines user behavior and makes
product recommendations (e.g. Amazon recommendations)
• Classifications – It learns from existing categorization and
then assigns unclassified items to the best category.
• Frequent pattern mining – It analyzes items in a group
(e.g. items in a shopping cart or terms in query session) and
then identifies which two items typically appear together.
5. Apache Sqoop & Flume
• Sqoop:
Imports data from external sources into related
Hadoop ecosystem components like HDFS,
Hbase or Hive.
It also exports data from Hadoop to other
external sources.
Sqoop works with relational databases such as
teradata, Netezza, oracle, MySQL.
6. Apache Flume
Flume efficiently collects, aggregate and moves a
large amount of data from its origin and sending it
back to HDFS.
It is fault tolerant and reliable mechanism.
This Hadoop Ecosystem component allows the
data flow from the source into Hadoop
environment.
It uses a simple extensible data model that allows
for the online analytic application.
Using Flume, we can get the data from multiple
servers immediately into hadoop.
7. Query languages for hadoop
• MapReduce (MR) is a criterion of Big Data
processing model with parallel and distributed
large datasets.
• This model knows difficult problems related to
low-level and batch nature of MR that gives rise
to an abstraction layer on the top of MR.
• Several High-Level MapReduce Query Languages
built on the top of MR provide more abstract
query languages and extend the MR
programming model
8. • These High-Level MapReduce Query Languages
remove the burden of MR programming away
from the developers and make a soft migration of
existing competences with SQL skills to Big Data.
• Common High-Level MapReduce Query
Languages built directly on the top of MR that
translate queries into executable native MR jobs.
• It evaluates the performance of the four
presented High-Level MapReduce Query
Languages: JAQL, Hive, Big SQL and Pig, with
regards to their insightful perspectives and ease
of programming.
10. Query languages for hadoop
• Pig, from Yahoo! and now incubating at
Apache, has an imperative language called Pig
Latin for performing operations on large data
files.
• Jaql, from IBM is a declarative query language
for JSON data.
• Hive, from Facebook is a data warehouse
system with a declarative query language that
is a hybrid of SQL and Hadoop streaming.
11. HIVE & PIG
Hive:
• The Hadoop ecosystem component, Apache Hive, is an
open source data warehouse system for querying and
analysing large datasets stored in Hadoop files.
• Hive do three main functions: data summarization, query,
and analysis.
• Hive use language called HiveQL (HQL), which is
similar to SQL.
• HiveQL automatically translates SQL-like queries
into MapReduce jobs which will execute on Hadoop.
12. Pig:
• Apache Pig is a high-level language platform
for analyzing and querying huge dataset that
are stored in HDFS.
• Pig as a component of Hadoop Ecosystem
uses PigLatin language.
• It is very similar to SQL.
• It loads the data, applies the required filters
and translate the data in the required format.
13. STREAM COMPUTING
• Big data stream computing is able to analyze and
process data in real time to gain an immediate insight,
and it is typically applied to the analysis of vast amount
of data in real time and to process them at a high
speed.
• A high-performance computer system that analyzes
multiple data streams from many sources live.
• The word stream in stream computing is used to mean
pulling in streams of data, processing the data and
streaming it back out as a single flow.
• Stream computing uses software algorithms that
analyzes the data in real time as it streams in to
(which)increase speed and accuracy when dealing with
data handling and analysis
14. • In a stream processing system, applications typically
act as continuous queries, ingesting data continuously,
analyzing and correlating the data, and generating a
stream of results.
• Applications are represented as data-flow graphs
composed of operators and interconnected by streams.
• The individual operators implement algorithms for data
analysis, such as parsing, filtering, feature extraction,
and classification.
• Such algorithms are typically single-pass because of the
high data rates of external feeds (e.g., market
information from stock exchanges, environmental
sensors readings from sites in a forest, etc.).
16. • IBM announced its stream computing system,
called System S.
• ATI Technologies also announced a stream
computing technology that describes its
technology that enables the graphics
processors (GPUs) to work in conjunction with
high-performance, low-latency CPUs to solve
complex computational problems.
18. PIG
• Pig raises the level of abstraction for processing large datasets.
• With Pig, the data structures are much richer, typically being
multivalued and nested; and the set of transformations you can
apply to the data are much more powerful
• Pig Latin, a Parallel Data Flow Language. Pig Latin is a data flow
language. This means it allows users to describe how data from one
or more inputs should be read, processed, and then stored to one
or more outputs in parallel.
Pig is made up of two pieces:
• The language used to express data flows, called Pig Latin.
• The execution environment to run Pig Latin programs. There
are currently two environments: local execution in a single
JVM and distributed execution on a Hadoop cluster.
20. Apache Pig Components
There are several components in the Apache Pig framework.
Parser
• At first, all the Pig Scripts are handled by the Parser. Parser basically
checks the syntax of the script, does type checking, and other
miscellaneous checks. Afterwards, Parser’s output will be a DAG
(directed acyclic graph) that represents the Pig Latin statements as
well as logical operators.
The logical operators of the script are represented as the nodes and
the data flows are represented as edges in DAG (the logical plan)
Optimizer
• Afterwards, the logical plan (DAG) is passed to the logical optimizer.
It carries out the logical optimizations further such as projection
and push down.
21. Compiler
• Then compiler compiles the optimized logical
plan into a series of MapReduce jobs.
Execution engine
• Eventually, all the MapReduce jobs are
submitted to Hadoop in a sorted order.
• Ultimately, it produces the desired results
22. Important points
• A Pig Latin program is made up of a series of
operations, or transformations, that are applied to the
input data to produce output.
• Pig transforms your query into series of mapreduce
task and you unaware of this.
• You will focus on the data and you dont know nature of
execution.
• Pig is a scripting language for exploring large datasets.
• Pig’s sweet spot is its ability to process terabytes of
data simply by using only half-dozen lines of Pig Latin
from the console.
23. • Pig was designed to be extensible. Virtually all parts of
the processing path are customizable: loading, storing,
filtering, grouping, and joining can all be altered by
userdefined functions (UDFs).
• As another benefit, UDFs tend to be more reusable
than the libraries developed for writing MapReduce
programs.
• Pig isn’t suitable for all data processing tasks.
• If you want to perform a query that touches only a
small amount of data in a large dataset, then Pig will
not perform well, since it is set up to scan the whole
dataset, or at least large portions of it.
24. Execution Types
Pig has two execution types or modes:
• local mode and
• MapReduce mode.
Local mode:
• In local mode, Pig runs in a single JVM and accesses the local
filesystem. This mode is suitable only for small datasets and when
trying out Pig.
• The execution type is set using the -x or -exectype option. To run in
local mode, set the option to local.
• % pig -x local
• grunt>
• This starts Grunt, the Pig interactive shell.
25. MapReduce mode
• In MapReduce mode, Pig translates queries into
MapReduce jobs and runs them on a Hadoop
cluster.
• The cluster may be a pseudo or fully distributed
cluster.
• To use MapReduce mode, you first need to check
that the version of Pig you downloaded is
compatible with the version of Hadoop you are
using. Pig releases will only work against
particular versions of Hadoop.
26. Running Pig Programs
There are three ways of executing Pig programs, all of which work in both local and
MapReduce mode:
Script:
Pig can run a script file that contains Pig commands. For example, pig script.pig
runs the commands in the local file script.pig. Alternatively, for very short scripts,
you can use the -e option to run a script specified as a string on the command line.
Grunt:
• Grunt is an interactive shell for running Pig commands. Grunt is started when no
file is specified for Pig to run, and the -e option is not used. It is also possible to run
Pig scripts from within Grunt using run and exec.
Embedded:
• You can run Pig programs from Java using the PigServer class, much like you can
use JDBC to run SQL programs from Java. For programmatic access to Grunt, use
PigRunner.
27. • Interactive Mode (Grunt shell) − You can run
Apache Pig in interactive mode using the Grunt
shell. In this shell, you can enter the Pig Latin
statements and get the output (using Dump
operator).
• Batch Mode (Script) − You can run Apache Pig in
Batch mode by writing the Pig Latin script in a
single file with .pig extension.
• Embedded Mode (UDF) − Apache Pig provides
the provision of defining our own functions
(User Defined Functions) in programming
languages such as Java, and using them in our
script.
28. • We can run your Pig scripts in the shell after
invoking the Grunt shell. Moreover, there are
certain useful shell and utility commands
offered by the Grunt shell.
30. PigLatin
• Apache Pig offers High-level language like Pig
Latin to perform data analysis programs.
• A Pig Latin program consists of a collection of
statements. A statement can be thought of as
an operation, or a command. For example, a
GROUP operation is a type of statement.
grouped_records = GROUP records BY year;
Statements are usually terminated with a semicolon, as in the example of the
GROUPstatement. In fact, this is an example of a statement that must be
terminated with a semicolon: it is a syntax error to omit it. In Grunt no error
31. • While we need to analyze data in Hadoop
using Apache Pig, we use Pig Latin language.
• Basically, first, we need to transform Pig Latin
statements into MapReduce jobs using an
interpreter layer. In this way, the Hadoop
process these jobs.
• Pig Latin is a very simple language with SQL
like semantics.
• It is possible to use it in a productive manner.
32. • It also contains a rich set of functions.
• Those exhibits data manipulation.
• Moreover, by writing user-defined functions
(UDF) using Java, we can extend them easily.
• That implies they are extensible in nature.
33. Data Model in Pig Latin
• The data model of Pig is fully nested. In
addition, the outermost structure of the Pig
Latin data model is a Relation. Also, it is a bag.
While−
• A bag, what we call a collection of tuples.
• A tuple, what we call an ordered set of fields.
• A field, what we call a piece of data.
34. Statements in Pig Latin
• Also, make sure, statements are the basic constructs while
processing data using Pig Latin.
• Basically, statements work with relations. Also, includes
expressions and schemas.
• Here, every statement ends with a semicolon (;).
• Moreover, through statements, we will perform several
operations using operators, those are offered by Pig Latin.
• However, Pig Latin statements take a relation as input and
produce another relation as output, while performing all
other operations Except LOAD and STORE.
• Its semantic checking will be carried out, once we enter a
Load statement in the Grunt shell.
• Although, we need to use the Dump operator, in order to
see the contents of the schema.
• Because, the MapReduce job for loading the data into the
file system will be carried out, only after performing the
dump operation.
35. Pig Latin Datatypes
• int
• “Int” represents a signed 32-bit integer.
For Example: 10
• long
• It represents a signed 64-bit integer.
For Example: 10L
• float
• This data type represents a signed 32-bit floating point.
For Example: 10.5F
• double
• “double” represents a 64-bit floating point.
For Example: 10.5
• chararray
• It represents a character array (string) in Unicode UTF-8 format.
For Example: ‘Data Flair’
• Bytearray
• This data type represents a Byte array (blob).
36. • Boolean
• “Boolean” represents a Boolean value.
For Example : true/ false.
Note: It is case insensitive.
• Datetime
• It represents a date-time.
For Example : 1970-01-01T00:00:00.000+00:00
• Biginteger
• This data type represents a Java BigInteger.
For Example: 60708090709
• Bigdecimal
• “Bigdecimal” represents a Java BigDecimal
For Example: 185.98376256272893883
37. Complex Types
• Tuple
• Bag
• Map
Pig Latin Operators
Arithmetic Operators
Comparison Operators
Type Construction Operators
38. Data Processing Operators
Loading and Storing
• LOAD
It loads the data from a file system into a relation.
• STORE
It stores a relation to the file system (local/HDFS).
Filtering
• FILTER
There is a removal of unwanted rows from a relation.
• DISTINCT
We can remove duplicate rows from a relation by this
operator.
• FOREACH, GENERATE
It transforms the data based on the columns of data.
• STREAM
To transform a relation using an external program.
39. • Diagnostic Operators
• DUMP
It prints the content of a relationship through the
console.
• DESCRIBE
It describes the schema of a relation.
• EXPLAIN
We can view the logical, physical execution plans to
evaluate a relation.
• ILLUSTRATE
It displays all the execution steps as the series of
statements.
40. Grouping and Joining
• JOIN We can join two or more relations.
• COGROUP There is a grouping of the data into two or more
relations.
• GROUPIt groups the data in a single relation.
• CROSSWe can create the cross product of two or more relations.
Sorting
• ORDER
It arranges a relation in an order based on one or more fields.
• LIMIT
We can get a particular number of tuples from a relation.
Combining and Splitting
• UNION
We can combine two or more relations into one relation.
• SPLIT
To split a single relation into more relations.
41. Hive
• Apache Hive is an open source data warehouse system
built on top of Hadoop ,for querying and analyzing
large datasets stored in Hadoop files.
• It process structured and semi-structured data in
Hadoop.
• Hive runs on your workstation and converts your SQL
query into a series of Map Reduce jobs for execution
on a Hadoop cluster.
• Hive organizes data into tables, which provide a means
for attaching structure to data stored in HDFS.
• Metadata such as table schemas is stored in a
database called the meta store.
43. Metastore
• It stores metadata for each and every table
• Hive also includes the partition metadata.
• This helps the driver to track the progress of
various data sets distributed over the cluster.
• It stores the data in a traditional RDBMS
format.
• Backup server regularly replicates the data
which it can retrieve in case of data loss.
44. Driver
• It acts like a controller which receives the HiveQL
statements.
• The driver starts the execution of the statement
by creating sessions.
• It monitors the life cycle and progress of the
execution.
• Driver stores the necessary metadata generated
during the execution of a HiveQL statement.
• It also acts as a collection point of data or query
result obtained after the Reduce operation
45. Compiler
• It performs the compilation of the HiveQL query.
• This converts the query to an execution plan. The
plan contains the tasks.
• It also contains steps needed to be performed by
the MapReduce to get the output as translated by
the query.
• The compiler in Hive converts the query to
an Abstract Syntax Tree (AST).
• First, check for compatibility and compile-time
errors, then converts the AST to a Directed
Acyclic Graph (DAG).
46. • Optimizer – It performs various
transformations on the execution plan to
provide optimized DAG. It aggregates the
transformations together, such as converting a
pipeline of joins to a single join, for better
performance. The optimizer can also split the
tasks, such as applying a transformation on
data before a reduce operation, to provide
better performance.
• Executor – Once compilation and optimization
complete, the executor executes the tasks.
Executor takes care of pipelining the tasks.
47. • CLI, UI, and Thrift Server –
• CLI (command-line interface) provides a user
interface for an external user to interact with
Hive.
• Thrift server in Hive allows external clients to
interact with Hive over a network, similar to
the JDBC or ODBC protocols.
48. Hive Shell
• The shell is the primary way that we will interact with Hive, by
issuing commands in HiveQL.
• HiveQL is Hive’s query language, a dialect of SQL. It is heavily
influenced by MySQL
• When starting Hive for the first time, we can check that it is working
by listing its tables The command must be terminated with a
semicolon to tell Hive to execute it:
• Like SQL, HiveQL is generally case insensitive (except for string
comparisons), so show tables; works equally well here. The tab key
will auto complete Hive keywords and functions.
• For a fresh install, the command takes a few seconds to run since it
is lazily creating the metastore database on your machine. (The
database stores its files in a directory called metastore_db, which is
relative to where you ran the hive command from.)
hive> SHOW TABLES;
OK
Time taken: 10.425 seconds
51. Hive Client
• Hive supports applications written in any language like
Python, Java, C++, Ruby, etc. Using JDBC, ODBC, and Thrift
drivers, for performing queries on the Hive. Hence, one can
easily write a hive client application in any language of its
own choice.
• Hive clients are categorized into three types:
1. Thrift Clients
• The Hive server is based on Apache Thrift so that it can
serve the request from a thrift client.
2. JDBC client
• Hive allows for the Java applications to connect to it using
the JDBC driver. JDBC driver uses Thrift to communicate
with the Hive Server.
3. ODBC client
• Hive ODBC driver allows applications based on the ODBC
protocol to connect to Hive. Similar to the JDBC driver, the
ODBC driver uses Thrift to communicate with the Hive
Server.
52. Hive Service
cli
• The command line interface to Hive (the shell). This is the default service.
• To perform all queries, Hive provides various services like the Hive server2,
Beeline, etc. The various services offered by Hive are:
Hive sever
• HiveServer2 is the successor of HiveServer1. HiveServer2 enables clients
to execute queries against the Hive. It allows multiple clients to submit
requests to Hive and retrieve the final results. It is basically designed to
provide the best support for open API clients like JDBC and ODBC.
hwi
• The Hive Web Interface.
jar
• The Hive equivalent to hadoop jar, a convenient way to run Java
applications that
• includes both Hadoop and Hive classes on the classpath.
53. Meta Store
• Metastore is a central repository that stores the metadata
information about the structure of tables and partitions, including
column and column type information.
• It also stores information of serializer and deserializer, required for
the read/write operation, and HDFS files where data is stored. This
metastore is generally a relational database.
• Metastore provides a Thrift interface for querying and manipulating
Hive metadata.
We can configure metastore in any of the two modes:
• Remote: In remote mode, metastore is a Thrift service and is useful
for non-Java applications.
• Embedded: In embedded mode, the client can directly interact with
the metastore using JDBC.
54. Embedded
• The metastore is the central repository of Hive metadata.
The metastore is divided into two pieces: a service and the
backing store for the data. By default, the metastore service
runs in the same JVM as the Hive service and contains an
embedded database instance backed by the local disk. This
is called the embedded metastore configuration
• Using an embedded metastore is a simple way to get
started with Hive; however, only one embedded Derby
database can access the database files on disk at any one
time, which means you can only have one Hive session
open at a time that shares the same metastore. Trying to
start a second session gives the error:
• Failed to start database 'metastore_db‘
• when it attempts to open a connection to the metastore
55. • The solution to supporting multiple sessions (and
therefore multiple users) is to use a standalone
database. This configuration is referred to as a
local metastore, since the metastore service still
runs in the same process as the Hive service, but
connects to a database running in a separate
process, either on the same machine or on a
remote machine.
• There’s another metastore configuration called a
remote metastore, where one or more metastore
servers run in separate processes to the Hive
service. This brings better manageability and
security, since the database tier can be
completely firewalled off, and the clients no
longer need the database credentials.
59. HiveQL
• The Hive Query Language (HiveQL) is a query
language for Hive to process and analyze
structured data in a Metastore.
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference [WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];
Creating Data Base:
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
64. Hive DDL commands
• Hive DDL commands are the statements used for defining
and changing the structure of a table or database in Hive. It
is used to build or modify the tables and other objects in
the database.
• The several types of Hive DDL commands are:
• CREATE
• SHOW
• DESCRIBE
• USE
• DROP
• ALTER
• TRUNCATE
65. Hive DML Commands
• Hive DML (Data Manipulation Language) commands are
used to insert, update, retrieve, and delete data from the
Hive table once the table and database schema has been
defined using Hive DDL commands.
• The various Hive DML commands are:
• LOAD
• SELECT
• INSERT
• DELETE
• UPDATE
• EXPORT
• IMPORT
66. Joins
• Inner join in Hive
• Left Outer Join in Hive
• Right Outer Join in Hive
• Full Outer Join in Hive
67. Partition
• Apache Hive organizes tables into partitions. Partitioning is
a way of dividing a table into related parts based on the
values of particular columns like date, city, and department.
• Each table in the hive can have one or more partition keys
to identify a particular partition. Using partition it is easy to
do queries on slices of the data.
• Here are two types of Partitioning in Apache Hive-
Static Partitioning
Dynamic Partitioning
• In Apache Hive for decomposing table data sets into more
manageable parts, it uses Hive Bucketing concept.
However, there are much more to learn about Bucketing in
Hive.
68. HBase
Hbasics
• HBase is a distributed column-oriented database built on
top of HDFS.
• HBase is the Hadoop application to use when you require
real-time read/write random-access to very large datasets.
Why Hbase:
• RDBMS get exponentially slow as the data becomes large
• Expects data to be highly structured, i.e. ability to fit in a
well-defined schema
• Any change in schema might require a downtime
70. Hbase concepts
There are 3 types of servers in a master-slave type of
HBase Architecture. They are
HBase HMaster
Server
ZooKeeper
• Region servers, these servers serve data for reads and
write purposes. That means clients can directly
communicate with HBase Region Servers while
accessing data.
• The HBase Master process handles the region
assignment as well as DDL (create, delete tables)
operations. And finally, a part of HDFS, Zookeeper.
71. HMasterServer
• The master server -Assigns regions to the region
servers and takes the help of Apache ZooKeeper
for this task.
• Handles load balancing of the regions across
region servers. It unloads the busy servers and
shifts the regions to less occupied servers.
• Maintains the state of the cluster by negotiating
the load balancing.
• Is responsible for schema changes and other
metadata operations such as creation of tables
and column families.
72. Regions
• Regions are nothing but tables that are split up and
spread across the region servers.
Region server
• The region servers have regions that -Communicate
with the client and handle data-related operations.
• Handle read and write requests for all the regions
under it.
• Decide the size of the region by following the region
size thresholds.
73. Zookeeper
• Zookeeper is an open-source project that provides
services like maintaining configuration information,
naming, providing distributed synchronization, etc.
• Zookeeper has nodes representing different region
servers. Master servers use these nodes to discover
available servers.
• In addition to availability, the nodes are also used to
track server failures or network partitions.
• Clients communicate with region servers via zookeeper.
• In pseudo and standalone modes, HBase itself will take
care of zookeeper.
74. Regions
• Tables are automatically partitioned horizontally by
HBase into regions. Each region comprises a subset
of a table’s rows.
• A region is denoted by the table it belongs to.
• Initially, a table comprises a single region, but as the
size increases, after it crosses a configurable size
threshold, it splits at a row boundary into two new
regions of approximately equal size.
• Until this first split happens, all loading will be
against the single server hosting the original region.
75. • As the table grows, the number of its regions
grows. Regions are the units that get
distributed over an HBase Server
• In this way, a table that is too big for any one
server can be carried by a cluster of servers
with each node hosting a subset of the table’s
total regions.
• Load on a table gets distributed.
• The online set of sorted regions comprises the
table’s total content.
77. • To maintain server state in the HBase Cluster,
HBase uses ZooKeeper as a distributed
coordination service.
• Basically, which servers are alive and available
is maintained by Zookeeper, and also it
provides server failure notification.
• Moreove, Zookeeper maintains guarantee
common shared state.
78. Clients
There are a number of client options for interacting
with an Hbase cluster.
• Java
• HBase, like Hadoop, is written in Java
• MapReduce HBase classes and utilities in the
org.apache.hadoop.hbase.mapreduce package
facilitate using HBase as a source and/or sink in
MapReduce jobs.
• The TableInputFormat class makes splits on
region boundaries.
• The Hbase TableOutputFormat will write the
result of reduce into HBase.
79. • Hbase with Avro, REST, and Thrift interfaces.
These are useful when the interacting
application is written in a language other than
Java.
• In all cases, a Java server hosts an instance of
the HBase client brokering application Avro,
REST, and Thrift requests in and out of the
HBase cluster. This extra work proxying
requests and responses means these
interfaces are slower than using the Java client
directly.
80. HBase Vs RDBMS
Database Type
HBase
• HBase is the column-oriented database. On
defining Column-oriented, each column is a
contiguous unit of page.
RDBMS
• Whereas, RDBMS is row-oriented that means
here each row is a contiguous unit of page.
Schema-type
• Schema of HBase is less restrictive, adding
columns on the fly is possible. RDBMS
Schema of RDBMS is more restrictive.
81. Sparse Tables
HBase
• HBase is good with the Sparse table.
RDBMS
• Whereas, RDBMS is not optimized for sparse tables.
Scale up/ Scale out
HBase
• HBase supports scale out. It means while we need memory
processing power and more disk, we need to add new servers to
the cluster rather than upgrading the present one.
RDBMS
• However, RDBMS supports scale up. That means while we need
memory processing power and more disk, we need upgrade same
server to a more powerful server, rather than adding new servers.
82. Amount of data
HBase
• While here it does not depend on the particular
machine but the number of machines.
RDBMS
• In RDBMS, on the configuration of the server,
amount of data depends.
Support
HBase
• For HBase, there is no built-in support.
RDBMS
• And, RDBMS has ACID support.
83. Data type
HBase
• HBase supports both structured and nonstructural data.
RDBMS
• RDBMS is suited for structured data.
Transaction integrity
HBase
• In HBase, there is no transaction guaranty.
RDBMS
• Whereas, RDBMS mostly guarantees transaction integrity.
JOINs
HBase
• HBase supports JOINs.
RDBMS
• RDBMS does not support JOINs.
84. Referential integrity
HBase
• While it comes to referential integrity, there is
no in-built support.
RDBMS
• And, RDBMS, supports referential integrity.
85. Bigsql
• IBM Big SQL is a high performance massively
parallel processing (MPP) SQL engine for
Hadoop that makes querying enterprise data
from the organization in an easy and secure
experience.
86. • A Big SQL query can quickly access a variety of
data sources including HDFS, RDBMS, NoSQL
databases, object stores, and WebHDFS by
using a single database connection or single
query for best-in-class analytic capabilities.
• With Big SQL, your organization can derive
significant value from your enterprise data.
87. • Big SQL provides tools to help you manage
your system and your databases, and you can
use popular analytic tools to visualize your
data.
• Big SQL includes several tools and interfaces
that are largely comparable to tools and
interfaces that exist with most relational
database management systems.
88. • Big SQL's robust engine executes complex
queries for relational data and Hadoop data.
• Big SQL provides an advanced SQL compiler
and a cost-based optimizer for efficient query
execution.
• Combining these with a massive parallel
processing (MPP) engine helps distribute
query execution across nodes in a cluster.
89. • The Big SQL architecture uses the latest
relational database technology from IBM.
• The database infrastructure provides a logical
view of the data (by allowing storage and
management of metadata) and a view of the
query compilation, plus the optimization and
runtime environment for optimal SQL
processing.
91. • Applications connect on a specific node based
on specific user configurations.
• SQL statements are routed through this node,
to Big SQL management node, or the
coordinating node.
• There can be one or many management
nodes, but there is only one Big SQL
management node. SQL statements are
compiled and optimized to generate a parallel
execution query plan.
92. • Then, a runtime engine distributes task(query)
to worker nodes on the compute node and
manipulates the consumption and return of
the result set.
• The compute node is a node that can be a
physical server or operating system.
• The worker nodes can contain the temporary
tables, the runtime execution, the readers and
writers, and the data nodes.
• The DataNode holds the data.
93. • When a worker node receives a query, it
dispatches special processes that know how to
read and write HDFS data natively.
• Big SQL uses native and Java open source–
readers (and writers) that are able to ingest
different file formats.
• The Big SQL engine pushes predicates down to
these processes so that they can, in turn,
apply projection and selection closer to the
data. These processes also transform input
data into an appropriate format for
consumption inside Big SQL.
94. • All of these nodes can be on
one Management Node, or each part on
separate Management Node.
• We can separate the Big SQL management
node from the other Hadoop master nodes.
• This arrangement can allow the Big SQL
management node to have enough resources
to store intermediate data from the Big SQL
data nodes.