Apache Hadoop is a popular open-source framework for storing and processing large datasets across clusters of computers. It includes Apache HDFS for distributed storage, YARN for job scheduling and resource management, and MapReduce for parallel processing. The Hortonworks Data Platform is an enterprise-grade distribution of Apache Hadoop that is fully open source.
This document provides an overview and introduction to Hadoop, HDFS, and MapReduce. It covers the basic concepts of HDFS, including how files are stored in blocks across data nodes, and the role of the name node and data nodes. It also explains the MapReduce programming model, including the mapper, reducer, and how jobs are split into parallel tasks. The document discusses using Hadoop from the command line and writing MapReduce jobs in Java. It also mentions some other projects in the Hadoop ecosystem like Pig, Hive, HBase and Zookeeper.
This document discusses distributed computing and Hadoop. It begins by explaining distributed computing and how it divides programs across several computers. It then introduces Hadoop, an open-source Java framework for distributed processing of large data sets across clusters of computers. Key aspects of Hadoop include its scalable distributed file system (HDFS), MapReduce programming model, and ability to reliably process petabytes of data on thousands of nodes. Common use cases and challenges of using Hadoop are also outlined.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable, and distributed processing of petabytes of data. Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. Many large companies use Hadoop for applications such as log analysis, web indexing, and data mining of large datasets.
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and additional tools like Hive, Pig, HBase, Zookeeper, Flume, Sqoop and Oozie that make up its ecosystem. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It has four main modules - Hadoop Common, HDFS, YARN and MapReduce. HDFS provides a distributed file system that stores data reliably across commodity hardware. MapReduce is a programming model used to process large amounts of data in parallel. Hadoop architecture uses a master-slave model, with a NameNode master and DataNode slaves. It provides fault tolerance, high throughput access to application data and scales to thousands of machines.
This document provides an introduction and overview of Apache Hadoop. It discusses the need for new data processing platforms due to increasing amounts of data, the origins and history of Hadoop, the core Hadoop architecture including HDFS and MapReduce, common Hadoop distributions, when Hadoop should be used, real-world use cases, and concludes with asking if there are any questions.
This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop uses HDFS for scalable, fault-tolerant storage and MapReduce for parallel processing. The core components of Hadoop - HDFS and MapReduce - allow for distributed processing of large datasets across commodity hardware, providing capabilities for scalability, cost-effectiveness, and efficient distributed computing.
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses challenges in big data by providing reliability, scalability, and fault tolerance. Hadoop allows distributed processing of large datasets across clusters using MapReduce and can scale from single servers to thousands of machines, each offering local computation and storage. It is widely used for applications such as log analysis, data warehousing, and web indexing.
Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Cloudera, Inc.
Hadoop is a new paradigm for data processing that scales near linearly to petabytes of data. Commodity hardware running open source software provides unprecedented cost effectiveness. It is affordable to save large, raw datasets, unfiltered, in Hadoop's file system. Together with Hadoop's computational power, this facilitates operations such as ad hoc analysis and retroactive schema changes. An extensive open source tool-set is being built around these capabilities, making it easy to integrate Hadoop into many new application areas.
This document provides an overview of big data ecosystems, including common log formats, compression techniques, data collection methods, distributed storage options like HDFS and S3, distributed processing frameworks like Hadoop MapReduce and Storm, workflow managers, real-time storage options, and other related topics. It describes technologies like Kafka, HBase, Cassandra, Pig, Hive, Oozie, and Azkaban; compares advantages and disadvantages of HDFS, S3, HBase and other storage systems; and provides references for further information.
A comprehensive overview on the entire Hadoop operations and tools: cluster management, coordination, injection, streaming, formats, storage, resources, processing, workflow, analysis, search and visualization
This document provides an overview of Hadoop, an open source framework for distributed storage and processing of large datasets. It discusses what Hadoop is, its applications and architecture, advantages like scalability and fault tolerance, and disadvantages such as security concerns. The document also outlines when Hadoop should be used, such as for large datasets that don't fit on a single machine or for extracting, transforming and loading large amounts of data. Key components of Hadoop include MapReduce, HDFS, YARN and its wider ecosystem of related projects.
- Hadoop was created to allow processing of large datasets in a distributed, fault-tolerant manner. It was originally developed by Doug Cutting and Mike Cafarella at Nutch in response to the growing amounts of data and computational needs at Google and other companies.
- The core of Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for distributed processing. It also includes utilities like Hadoop Common for file system access and other basic functionality.
- Hadoop's goals were to process multi-petabyte datasets across commodity hardware in a reliable, flexible and open source way. It assumes failures are expected and handles them to provide fault tolerance.
Hadoop 2.0 architecture uses a scale-out storage and distributed processing framework. It stores large datasets across commodity hardware clusters and allows for processing using a simple programming model. The architecture utilizes HDFS for storage which splits files into 128MB blocks, stores replicas across racks for fault tolerance, and is managed by a resource manager and node managers that track hardware resources and heartbeats.
Kudu is an open source storage engine that provides low-latency random access and efficient analytical access to structured data. It horizontally partitions and replicates data across multiple servers for high availability and performance. Kudu integrates with Hadoop ecosystems tools like Impala, Spark, and MapReduce. The demo will cover Kudu's architecture, data storage, and implementation in buffer and raw data loads using Kudu tables.
The document provides instructions for photographing star trails and the Milky Way galaxy using a DSLR camera on a tripod. It explains how to set the camera to manual mode, select high ISO and aperture settings, and focus on a bright star or planet. It also discusses how to locate the North Star and describes techniques for minimizing light pollution such as using a light pollution filter or traveling to a dark site away from cities.
El documento proporciona información sobre varios artistas musicales latinoamericanos como Shakira, Trebol Clan y Tito el Bambino. Shakira nació en Colombia y es cantante de pop y rock en español. Trebol Clan era un grupo de reggaetón de Puerto Rico introducido por DJ Joe. Tito el Bambino es un cantante puertorriqueño de reguetón que formó parte del dúo Hector & Tito.
Mortgage CRM Made Easy with Mortgage QuestChris Carter
Mortgage Quest is a CRM and database marketing tool designed for the mortgage industry. It increases sales with advanced marketing strategies and high quality marketing content.
El documento repite la frase "Sam y Vand Don t" varias veces sin otra información contextual. Parece tratar sobre una persona llamada Sammy Vander Donckt aunque no queda claro.
The World Meteorological Organization manages hurricane names by using six lists of 21 names each that rotate annually and include both male and female names. If a season exceeds 21 storms, names from the Greek alphabet are used. Particularly devastating hurricanes like Igor, Ike, and Katrina have had their names retired from future use due to the damage they caused.
The document discusses the importance of maintaining good hygiene habits like handwashing to prevent the spread of diseases. It notes that germs can spread through direct contact with infected individuals or indirectly through surfaces they've touched. Proper handwashing with soap and water is the most effective way to kill or remove germs and avoid getting sick.
This document provides an overview of SAP BO Analysis for Office and how to analyze business intelligence data in Excel. It covers getting started with the add-in, creating workbooks, analyzing data through sorting, filtering, and conditional formatting, and sharing content by saving workbooks to the BI platform and creating PowerPoint slides. The document includes step-by-step instructions on tasks like enabling the add-in, inserting data sources, adding measures to crosstabs, creating calculations, inserting dynamic charts and filters, and more.
This document provides a list of replacement parts for Caterpillar diesel engines. It includes engine models, specifications, and part numbers for components like pistons, cylinder kits, bearings, gaskets and seals. Call Endurance Power Products at (800) 467-5545 for availability and pricing of parts for Caterpillar engines.
Caterpillar operation and maintenance manual 3500 b engines sZubes Masade
The document discusses the benefits of meditation for reducing stress and anxiety. It notes that meditation can help calm the mind and body by lowering heart rate and easing muscle tension. Regular meditation practice of 10-20 minutes per day is recommended to experience stress-reducing benefits.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the reliable, scalable, and distributed processing of large data sets across clusters of commodity hardware. The core of Hadoop includes a storage part called HDFS for reliable data storage, and a processing part called MapReduce that processes data in parallel on a large cluster. Hadoop also includes additional projects like Hive, Pig, HBase, Zookeeper, Oozie, and Sqoop that together form a powerful data processing ecosystem.
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and an ecosystem of related projects like Hive, HBase, Pig and Zookeeper that provide additional functions. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
This document provides information about Hadoop and its components. It discusses the history of Hadoop and how it has evolved over time. It describes key Hadoop components including HDFS, MapReduce, YARN, and HBase. HDFS is the distributed file system of Hadoop that stores and manages large datasets across clusters. MapReduce is a programming model used for processing large datasets in parallel. YARN is the cluster resource manager that allocates resources to applications. HBase is the Hadoop database that provides real-time random data access.
Hadoop is an open source software framework that allows for distributed processing of large data sets across clusters of computers. It uses MapReduce as a programming model and HDFS for storage. Hadoop supports various big data applications like HBase for distributed column storage, Hive for data warehousing and querying, Pig and Jaql for data flow languages, and Hadoop ecosystem projects for tasks like system monitoring and machine learning.
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Dataconomy Media
What is Big Data? What is Hadoop? What is MapReduce? How do the other components such as: Oozie, Hue, Hive, Impala works? Which are the main Hadoop distributions? What is Spark? What are the differences between Batch and Streaming processing? What are some Business Intelligence Solutions by focusing on some business cases?
Business intelligence analyzes data to provide actionable information for decision making. Big data is a $50 billion market by 2017, referring to technologies that capture, store, manage and analyze large variable data collections. Hadoop is an open source framework for distributed storage and processing of large data sets on commodity hardware, enabling businesses to gain insight from massive amounts of structured and unstructured data. It involves components like HDFS for data storage, MapReduce for processing, and others for accessing, storing, integrating, and managing data.
The document provides an overview of Apache Hadoop and how it addresses challenges related to big data. It discusses how Hadoop uses HDFS to distribute and store large datasets across clusters of commodity servers and uses MapReduce as a programming model to process and analyze the data in parallel. The core components of Hadoop - HDFS for storage and MapReduce for processing - allow it to efficiently handle large volumes and varieties of data across distributed systems in a fault-tolerant manner. Major companies have adopted Hadoop to derive insights from their big data.
The document provides an introduction to the Hadoop ecosystem. It discusses the history of Hadoop, originating from Google's paper on MapReduce and Google File System. It describes some of the core components of Hadoop including HDFS for storage, MapReduce for distributed processing, and additional components like Hive, Pig, and HBase. It also discusses different Hadoop distributions from companies like Cloudera, Hortonworks, MapR, and others that package and support Hadoop deployments.
This presentation provides an overview of Hadoop, including:
- A brief history of data and the rise of big data from various sources.
- An introduction to Hadoop as an open source framework used for distributed processing and storage of large datasets across clusters of computers.
- Descriptions of the key components of Hadoop - HDFS for storage, and MapReduce for processing - and how they work together in the Hadoop architecture.
- An explanation of how Hadoop can be installed and configured in standalone, pseudo-distributed and fully distributed modes.
- Examples of major companies that use Hadoop like Amazon, Facebook, Google and Yahoo to handle their large-scale data and analytics needs.
Hadoop is an open source software framework for storing and processing large datasets across clusters of computers using simple programming models. It provides distributed storage and computation across clusters allowing data to be processed in parallel. Hadoop is designed to scale from a single server to thousands of machines, with each offering local computation and storage resources.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
This document provides an overview of Apache Hadoop, HDFS, and MapReduce. It describes how Hadoop uses a distributed file system (HDFS) to store large amounts of data across commodity hardware. It also explains how MapReduce allows distributed processing of that data by allocating map and reduce tasks across nodes. Key components discussed include the HDFS architecture with NameNodes and DataNodes, data replication for fault tolerance, and how the MapReduce engine works with a JobTracker and TaskTrackers to parallelize jobs.
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
The document provides an overview and quick reference guide to big data concepts including Hadoop, MapReduce, HDFS, YARN, Spark, Storm, Hive, Pig, HBase and NoSQL databases. It discusses the evolution of Hadoop from versions 1 to 2, and new frameworks like Tez and YARN that allow different types of processing beyond MapReduce. The document also summarizes common big data challenges around skills, integration and analytics.
The document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop addresses the growing volume, variety and velocity of big data through its core components: HDFS for storage, and MapReduce for distributed processing. Key features of Hadoop include scalability, flexibility, reliability and economic viability for large-scale data analytics.
The Apache Hadoop software library is essentially a framework that allows for the distributed processing of large datasets across clusters of computers using a simple programming model. Hadoop can scale up from single servers to thousands of machines, each offering local computation and storage.
hadoop eco system regarding big data analytics.pptxmrudulasb
The document provides an overview of the Hadoop ecosystem and its core components. It discusses HDFS, YARN, MapReduce, Spark, HBase, Hive, Avro, Zookeeper, and Pig. HDFS is the storage system and consists of NameNode and DataNodes. YARN resource manages and schedules jobs. MapReduce facilitates distributed processing of large datasets. Spark is a faster processing engine. HBase is a NoSQL database. Hive provides data analysis. Avro handles data serialization. Zookeeper coordinates services. Pig is a platform for analyzing large datasets using a scripting language.
This is a presentation on apache hadoop technology. This presentation may be helpful for the beginners to know about the terminologies of hadoop. This presentation contains some pictures which describes about the working function of this technology. I hope it will be helpful for the beginners.
Thank you.
Email Marketing in Odoo 17 - Odoo 17 SlidesCeline George
Email marketing is used to send advertisements or commercial messages to specific groups of people by using email. Email Marketing also helps to track the campaign’s overall effectiveness. This slide will show the features of odoo 17 email marketing.
Dr. Nasir Mustafa CERTIFICATE OF APPRECIATION "NEUROANATOMY"Dr. Nasir Mustafa
CERTIFICATE OF APPRECIATION
"NEUROANATOMY"
DURING THE JOINT ONLINE LECTURE SERIES HELD BY
KUTAISI UNIVERSITY (GEORGIA) AND ISTANBUL GELISIM UNIVERSITY (TURKEY)
FROM JUNE 10TH TO JUNE 14TH, 2024
Description:
Welcome to the comprehensive guide on Relational Database Management System (RDBMS) concepts, tailored for final year B.Sc. Computer Science students affiliated with Alagappa University. This document covers fundamental principles and advanced topics in RDBMS, offering a structured approach to understanding databases in the context of modern computing. PDF content is prepared from the text book Learn Oracle 8I by JOSE A RAMALHO.
Key Topics Covered:
Main Topic : PL/SQL
Sub-Topic :
Structure of PL/SQL Block, Declaration Section, Variable, Constant, Execution Section, Exception, How PL/SQL works, Control Structures, If then Command,
Loop Command, Loop with IF, Loop with When, For Loop Command, While Command, Integrating SQL in PL/SQL program.
Target Audience:
Final year B.Sc. Computer Science students at Alagappa University seeking a solid foundation in RDBMS principles for academic and practical applications.
URL for previous slides
Unit V
Chapter 15
Unit IV
Chapter 14 Synonym : https://www.slideshare.net/slideshow/lecture_notes_unit4_chapter14_synonyms-pdf/270327685
Chapter 13 Users, Privileges : https://www.slideshare.net/slideshow/lecture-notes-unit4-chapter13-users-roles-and-privileges/270304806
Chapter 12 View : https://www.slideshare.net/slideshow/rdbms-lecture-notes-unit4-chapter12-view/270199683
Chapter 11 Sequence: https://www.slideshare.net/slideshow/sequnces-lecture_notes_unit4_chapter11_sequence/270134792
chapter 8,9 and 10 : https://www.slideshare.net/slideshow/lecture_notes_unit4_chapter_8_9_10_rdbms-for-the-students-affiliated-by-alagappa-university/270123800
About the Author:
Dr. S. Murugan is Associate Professor at Alagappa Government Arts College, Karaikudi. With 23 years of teaching experience in the field of Computer Science, Dr. S. Murugan has a passion for simplifying complex concepts in database management.
Disclaimer:
This document is intended for educational purposes only. The content presented here reflects the author’s understanding in the field of RDBMS as of 2024.
PRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdfnservice241
The University of Ghana has launched a new vision and strategic plan, which will focus on transforming lives and societies through unparalleled scholarship, innovation, and result-oriented discoveries.
How to Make a Field Storable in Odoo 17 - Odoo SlidesCeline George
Let’s discuss about how to make a field in Odoo model as a storable. For that, a module for College management has been created in which there is a model to store the the Student details.
How to Create an XLS Report in Odoo 17 - Odoo 17 SlidesCeline George
XLSX reports are essential for structured data analysis, customizable presentation, and compatibility across platforms, facilitating efficient decision-making and communication within organizations.
Plato and Aristotle's Views on Poetry by V.Jesinthal Maryjessintv
PPT on Plato and Aristotle's Views on Poetry prepared by Mrs.V.Jesinthal Mary, Dept of English and Foreign Languages(EFL),SRMIST Science and Humanities ,Ramapuram,Chennai-600089
Topics to be Covered
Beginning of Pedagogy
What is Pedagogy?
Definition of Pedagogy
Features of Pedagogy
What Is Pedagogy In Teaching?
What Is Teacher Pedagogy?
What Is The Pedagogy Approach?
What are Pedagogy Approaches?
Teaching and Learning Pedagogical approaches?
Importance of Pedagogy in Teaching & Learning
Role of Pedagogy in Effective Learning
Pedagogy Impact on Learner
Pedagogical Skills
10 Innovative Learning Strategies For Modern Pedagogy
Types of Pedagogy
2. Apache Hadoop
Apache Hadoop
• is a popular open-source
framework for storing and
processing large data sets across
clusters of computers.
• HDP 2.2 on Sandbox system
Requirements:
– Now runs on 32-bit and 64-bit OS
(Windows XP, Windows 7,
Windows 8 and Mac OSX)
– Minimum 4GB RAM; 8Gb required
to run Ambari and Hbase
– Virtualization enabled on BIOS
– Browser: Chrome 25+, IE 9+, Safari
6+ recommended. (Sandbox will
not run on IE 10)
• An ideal way to get started Enterprise
Hadoop. Sandbox is a self-contained
virtual machine with Apache Hadoop
pre-configured alongside a set of
hands-on, step-by-step Hadoop
tutorials.
• Sandbox is a personal, portable Hadoop
environment that comes with a dozen
interactive Hadoop tutorials.
• It includes many of the most exciting
developments from the latest HDP
distribution, packaged up in a virtual
environment that you can get up and
running in 15 minutes!
3. Hadoop… Getting Started
Terminologies
• Hadoop
• YARN – the Hadoop Operating system
– enables a user to interact with all data in multiple
ways simultaneously, making Hadoop a true multi-use
data platform and allowing it to take its place in a
modern data architecture.
– A framework for job scheduling and cluster resource
management.
– This means that many different processing engines can
operate simultaneously across a Hadoop cluster, on
the same data, at the same time.
• the Hadoop Distributed File System (HDFS)
– A distributed file system that provides high-
throughput access to application data.
• MapReduce
– A YARN-based system for parallel processing of large
data sets.
• Sqoop
• theHiveODBC Driver
Hortonworks Data Platform(HDP)
• is a 100% open source
distribution of Apache
Hadoop that is truly
enterprise grade having
been built, tested and
hardened with enterprise
rigor.
4. Introducing Apache Hadoop to
Developers
• Apache Hadoop is a community driven open-source project
governed by the Apache Software Foundation.
• originally implemented at Yahoo based on papers published
by Google in 2003 and 2004.
• Since then Apache Hadoop has matured and developed to
become a data platform for not just processing humongous
amount of data in batch but with the advent of YARN it now
supports many diverse workloads such as Interactive
queries over large data with Hive on Tez, Realtime data
processing with Apache Storm, super scalable NoSQL
datastore like HBase, in-memory datastore like Spark and
the list goes on.
6. Core of Hadoop
• A set of machines running
HDFS and MapReduce is
known as a Hadoop Cluster.
Individual machines are
known as nodes. A cluster
can have as few as one node
to as many as several
thousands. For most
application scenarios Hadoop
is linearly scalable, which
means you can expect better
performance by simply
adding more nodes.
• The Hadoop
Distributed File
System (HDFS)
• MapReduce
7. MapReduce
• a method for distributing a task across multiple nodes. Each node
processes data stored on that node to the extent possible.
• A running Map Reduce job consists of various phases such as Map -
> Sort -> Shuffle -> Reduce
• Advantages:
– Automatic parallelization and distribution of data in blocks across a
distributed, scale-out infrastructure.
– Fault-tolerance against failure of storage, compute and network
infrastructure
– Deployment, monitoring and security capability
– A clean abstraction for programmers
• Most MapReduce programs are written in Java. It can also be
written in any scripting language using the Streaming API of
Hadoop.
8. The MapReduce Concepts and
Terminology
• MapReduce jobs are controlled by a software daemon
known as the JobTracker. The JobTracker resides on a
'master node'. Clients submit MapReduce jobs to the
JobTracker. The JobTracker assigns Map and Reduce tasks to
other nodes on the cluster.
• These nodes each run a software daemon known as the
TaskTracker. The TaskTracker is responsible for actually
instantiating the Map or Reduce task, and reporting
progress back to the JobTracker
• A job is a program with the ability of complete execution of
Mappers and Reducers over a dataset. A task is the
execution of a single Mapper or Reducer over a slice of
data.
9. Hadoop Distributed File System
• the foundation of the Hadoop cluster.
• manages how the datasets are stored in the
Hadoop cluster.
• responsible for distributing the data across the
data nodes, managing replication for
redundancy and administrative tasks like
adding, removing and recovery of data nodes.
10. Apache Hive
• provides a data warehouse view of the data in HDFS.
• Using a SQL-like language Hive lets you create
summarizations of your data, perform ad-hoc queries,
and analysis of large datasets in the Hadoop cluster.
• The overall approach with Hive is to project a table
structure on the dataset and then manipulate it with
HiveQL.
• Since you are using data in HDFS your operations can
be scaled across all the datanodes and you can
manipulate huge datasets.
11. Apache HCatalog
• Used to hold location and metadata about the
data in a Hadoop cluster. This allows scripts and
MapReduce jobs to be decoupled from data
location and metadata like the schema.
• since it supports many tools, like Hive and Pig,
the location and metadata can be shared
between tools. Using the open APIs of HCatalog
other tools like Teradata Aster can also use the
location and metadata in HCatalog.
• how can we reference data by name and inherit
the location and metadata???
12. Apache Pig
• a language for expressing data analysis and
infrastructure processes.
• is translated into a series of MapReduce jobs that
are run by the Hadoop cluster.
• is extensible through user-defined functions that
can be written in Java and other languages.
• Pig scripts provide a high level language to create
the MapReduce jobs needed to process data in a
Hadoop cluster.