Big data refers to large datasets that are difficult to process using traditional database management tools. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. It provides reliable data storage with the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using MapReduce. The Hadoop ecosystem includes components like HDFS, MapReduce, Hive, Pig, and HBase that provide distributed data storage, processing, querying and analysis capabilities at scale.
This document provides an introduction to HBase, including its definition, storage model, use cases, and basic data access. HBase is a distributed, scalable NoSQL database built on Hadoop that allows for high-performance read/write operations on large datasets. It provides a distributed, multidimensional sorted map and supports operations like get, scan, put, and delete. The document demonstrates how to access HBase using its Java API for DDL and DML operations like creating/altering tables, putting/getting/scanning data. It also discusses how HBase is used at scale by Facebook for messaging and insights data.
The document describes the Hadoop ecosystem and its core components. It discusses HDFS, which stores large files across clusters and is made up of a NameNode and DataNodes. It also discusses MapReduce, which allows distributed processing of large datasets using a map and reduce function. Other components discussed include Hive, Pig, Impala, and Sqoop.
This document summarizes Syncsort's high performance data integration solutions for Hadoop contexts. Syncsort has over 40 years of experience innovating performance solutions. Their DMExpress product provides high-speed connectivity to Hadoop and accelerates ETL workflows. It uses partitioning and parallelization to load data into HDFS 6x faster than native methods. DMExpress also enhances usability with a graphical interface and accelerates MapReduce jobs by replacing sort functions. Customers report TCO reductions of 50-75% and ROI within 12 months by using DMExpress to optimize their Hadoop deployments.
Apache Hadoop: design and implementation. Lecture in the Big data computing course (http://twiki.di.uniroma1.it/twiki/view/BDC/WebHome), Department of Computer Science, Sapienza University of Rome.
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
This document provides an overview of Hive and its performance capabilities. It discusses Hive's SQL interface for querying large datasets stored in Hadoop, its architecture which compiles SQL queries into MapReduce jobs, and its support for SQL features. The document also covers techniques for optimizing Hive performance, including data abstractions like partitions, buckets and skews. It describes different join strategies in Hive like shuffle joins, broadcast joins and sort-merge bucket joins, and how shuffle joins are implemented in MapReduce.
This document discusses integrating Apache Hive and HBase. It provides an overview of Hive and HBase, describes use cases for querying HBase data using Hive SQL, and outlines features and improvements for Hive and HBase integration. Key points include mapping Hive schemas and data types to HBase tables and columns, pushing filters and other operations down to HBase, and using a storage handler to interface between Hive and HBase. The integration allows analysts to query both structured Hive and unstructured HBase data using a single SQL interface.
The document provides an overview of the state of the Apache HBase database project. It discusses the project goals of availability, stability, and scalability. It also summarizes the mature codebase, active development areas like region replicas and ProcedureV2, and the growing ecosystem of SQL interfaces and other Hadoop components integrated with HBase. Recent releases include 1.1.2 which improved scanners and introduced quotas and throttling, and the 1.0 release which adopted semantic versioning and added region replicas.
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
This document provides an overview of Apache Tez, a framework for building data processing applications on Hadoop YARN. It describes how Tez allows applications to define complex data flows as directed acyclic graphs (DAGs) and handles distributed execution, fault tolerance, and resource management. Tez has improved the performance of Apache Hive and Pig by an order of magnitude by enabling more flexible DAG definitions and runtime optimizations. It also supports integration with other data processing engines like Spark, Storm and interactive SQL queries. The document outlines how Tez works and provides guidance on how developers can contribute to the open source project.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
HBase can be an intimidating beast for someone considering its adoption. For what kinds of workloads is it well suited? How does it integrate into the rest of my application infrastructure? What are the data semantics upon which applications can be built? What are the deployment and operational concerns? In this talk, I'll address each of these questions in turn. As supporting evidence, both high-level application architecture and internal details will be discussed. This is an interactive talk: bring your questions and your use-cases!
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
Did you like it? Check out our blog to stay up to date: https://getindata.com/blog
We share our slides about Apache Tez delivered as a lightening talk given at Warsaw Hadoop User Group http://www.meetup.com/warsaw-hug/events/218579675
This document provides an overview of Hadoop, MapReduce, and HDFS. It discusses how Hadoop uses a cluster of commodity hardware and HDFS to reliably store and process large amounts of data in a distributed manner. MapReduce is the programming model used by Hadoop to process data in parallel across nodes. The document describes the core Hadoop modules and architecture, how HDFS stores and retrieves data blocks, and how MapReduce distributes work and aggregates results. Examples of using MapReduce for word counting and inverted indexes are also presented.
The document provides an overview of the Apache Hadoop ecosystem. It describes Hadoop as a distributed, scalable storage and computation system based on Google's architecture. The ecosystem includes many related projects that interact, such as YARN, HDFS, Impala, Avro, Crunch, and HBase. These projects innovate independently but work together, with Hadoop serving as a flexible data platform at the core.
This document discusses the integration of Apache Pig with Apache Tez. Pig provides a procedural scripting language for data processing workflows, while Tez is a framework for executing directed acyclic graphs (DAGs) of tasks. Migrating Pig to use Tez as its execution engine provides benefits like reduced resource usage, improved performance, and container reuse compared to Pig's default MapReduce execution. The document outlines the design changes needed to compile Pig scripts to Tez DAGs and provides examples and performance results. It also discusses ongoing work to achieve full feature parity with MapReduce and further optimize performance.
February 2015 Hive User Group meetup at LinkedIn
http://www.meetup.com/Hive-User-Group-Meeting/events/219794523/
Presentation about physical join strategies employed used by Apache Hive and how they may be employed to optimize workflows.
The document provides an introduction to big data and Hadoop. It defines big data as large datasets that are difficult to process using traditional software tools due to their size and complexity. It describes the characteristics of big data using the original 3Vs model (volume, velocity, variety) as well as additional attributes. The text then explains the architecture and components of Hadoop, the open-source framework for distributed storage and processing of big data, including HDFS, MapReduce, and other related tools. It provides an overview of how Hadoop addresses the challenges of big data through scalable and fault-tolerant distributed processing of data across commodity hardware.
Cisco connect toronto 2015 big data sean mc keownCisco Canada
The document provides an overview of big data concepts and architectures. It discusses key topics like Hadoop, HDFS, MapReduce, NoSQL databases, and MPP relational databases. It also covers network design considerations for big data, common traffic patterns in Hadoop, and how to optimize performance through techniques like data locality and quality of service policies.
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
This document discusses YapMap, a visual search platform built on Hadoop and HBase. It summarizes how YapMap interfaces with HBase data, uses HBase as a data processing pipeline with checkpoints, and had to adjust schemas and migrate data as the system evolved. It also covers how YapMap constructs search indexes in shards based on HBase regions and stored indexes on HDFS. The document concludes with some lessons learned around optimizing HBase operations.
This document provides an overview of the MapReduce paradigm and Hadoop framework. It describes how MapReduce uses a map and reduce phase to process large amounts of distributed data in parallel. Hadoop is an open-source implementation of MapReduce that stores data in HDFS. It allows applications to work with thousands of computers and petabytes of data. Key advantages of MapReduce include fault tolerance, scalability, and flexibility. While it is well-suited for batch processing, it may not replace traditional databases for data warehousing. Overall efficiency remains an area for improvement.
This document provides an overview of the MapReduce paradigm and Hadoop framework. It describes how MapReduce uses a map and reduce phase to process large amounts of distributed data in parallel. Hadoop is an open-source implementation of MapReduce that stores data in HDFS. It allows applications to work with thousands of computers and petabytes of data. Key advantages of MapReduce include fault tolerance, scalability, and flexibility. While it is well-suited for batch processing, it may not replace traditional databases for data warehousing. Overall efficiency remains an area for improvement.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. The core of Hadoop consists of HDFS for storage and MapReduce for processing. Hadoop has been expanded with additional projects including YARN for job scheduling and resource management, Pig and Hive for SQL-like queries, HBase for column-oriented storage, Zookeeper for coordination, and Ambari for provisioning and managing Hadoop clusters. Hadoop provides scalable and cost-effective solutions for storing and analyzing massive amounts of data.
This document provides an overview of Hadoop, its core components HDFS and MapReduce, and how they work. It discusses that Hadoop is an open-source framework used for storing and processing huge datasets across clusters of commodity hardware. The two core concepts of Hadoop are HDFS for distributed storage and MapReduce for distributed processing. HDFS provides reliable storage with replication and MapReduce allows processing of large datasets in parallel by dividing work across nodes and integrating results.
MapR is an amazing new distributed filesystem modeled after Hadoop. It maintains API compatibility with Hadoop, but far exceeds it in performance, manageability, and more.
/* Ted's MapR meeting slides incorporated here */
The document discusses HDFS (Hadoop Distributed File System) including typical workflows like writing and reading files from HDFS. It describes the roles of key HDFS components like the NameNode, DataNodes, and Secondary NameNode. It provides examples of rack awareness, file replication, and how the NameNode manages metadata. It also discusses Yahoo's practices for HDFS including hardware used, storage allocation, and benchmarks. Future work mentioned includes automated failover using Zookeeper and scaling the NameNode.
This document provides an overview of big data processing techniques including batch processing using MapReduce and Hive, iterative batch processing using Spark, stream processing using Apache Storm, and OLAP over big data using Dremel and Druid. It discusses techniques such as MapReduce, Hive, Spark RDDs, and Storm tuples for processing large datasets and compares small versus big data approaches. Example usages and technologies for different processing types are also outlined.
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
Hadoop is a Java software framework that supports data-intensive distributed applications and is developed under open source license. It enables applications to work with thousands of nodes and petabytes of data.
This document discusses large scale computing with MapReduce. It provides background on the growth of digital data, noting that by 2020 there will be over 5,200 GB of data for every person on Earth. It introduces MapReduce as a programming model for processing large datasets in a distributed manner, describing the key aspects of Map and Reduce functions. Examples of MapReduce jobs are also provided, such as counting URL access frequencies and generating a reverse web link graph.
Sector is an open source cloud platform designed for data intensive computing. It provides several advantages over Hadoop such as being up to 2x faster, supporting user defined functions, and exploiting data locality and network topology. Sector uses a layered architecture with user defined functions, a distributed file system, and a UDP-based transport protocol. Experimental results show Sector outperforms Hadoop on benchmarks and has less than a 5% performance penalty compared to a local cluster when run on distributed wide area clusters connected by 10Gbps networks.
This document provides an overview of an advanced Big Data hands-on course covering Hadoop, Sqoop, Pig, Hive and enterprise applications. It introduces key concepts like Hadoop and large data processing, demonstrates tools like Sqoop, Pig and Hive for data integration, querying and analysis on Hadoop. It also discusses challenges for enterprises adopting Hadoop technologies and bridging the skills gap.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It uses HDFS for data storage and MapReduce as a programming model for distributed computing. HDFS stores data reliably across machines in a Hadoop cluster as blocks and achieves high fault tolerance through replication. MapReduce allows processing of large datasets in parallel by dividing the work into independent tasks called Maps and Reduces. Hadoop has seen widespread adoption for applications involving massive datasets and is used by companies like Yahoo!, Facebook and Amazon.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
2. What is Big Data ?
● How is big “Big Data” ?
● Is 30 40 Terabyte big data ?
● ….
● Big data are datasets that grow so large that they
become awkward to work with using on-hand
database management tools
● Today Terabyte, Petabyte, Exabyte
● Tomorrow ?
3. Enterprises & Big Data
● Most companies are currently using traditional tools to
store data
● Big data: The next frontier for innovation, competition,
and productivity
● The use of big data will become a key basis of competition
● Organisations across the globe need to take the rising
importance of big data more seriously
4. Hadoop is an ecosystem, not a single product.
When you deal with BigData, the data center is your computer.
5. • A Brief History of Hadoop
• Contributers and Development
• What is Hadoop
• Wyh Hadoop
• Hadoop Ecosystem
6. A Brief History of Hadoop
• Hadoop has its origins in Apache Nutch
• Nutch was started in 2002
• Challenge : The billions of pages on the Web ?
• 2003 GFS (Google File System)
• 2004 NDFS (Nutch File System)
• 2004 Google published the paper of MapReduce
• 2005 Nutch Developers getting started with development of
MapReduce
7. • A Brief History of Hadoop
• Contributers and Development
• What is Hadoop
• Wyh Hadoop
• Hadoop Ecosystem
8. Contributers and Development
Lifetime patches contributed for all Hadoop-related projects: community members by
current employer
* source : JIRA tickets
11. Development in ASF/Hadoop
● Resources
● Mailing List
● Wiki Pages , blogs
● Issue Tracking – JIRA
● Version Control SVN – Git
12. • A Brief History of Hadoop
• Contributers and Development
• What is Hadoop
• Wyh Hadoop
• Hadoop Ecosystem
13. What is Hadoop
• Open-source project administered by the ASF
• Data Intensive Storage
• and Massivly Paralel Processing(MPP)
• Enables applications to work with thousands of nodes and
petabytes of data
• Suitable for application with large data sets
14. What is Hadoop ?
• Scalable
• Fault Tolerance
• Reliable data storage using the Hadoop Distributed
File System (HDFS)
• High-performance parallel data processing using a
technique called MapReduce
15. What is Hadoop ?
• Hadoop Becoming defacto standard for large scale
dataprocessing
• Becoming more than just MapReduce
• Ecosystem growing rapidly lot’s of great tools around it
16. What is Hadoop ?
Yahoo Hadoop Cluster
38,000 machines
distributed across 20
different clusters.
Recource : Yahoo 2010
50,000 m : January 2012
Resource
http://www.computerworlduk.com/in-
depth/applications/3329092/hadoop- SGI Hadoop Cluster
could-save-you-money-over-a-
traditional-rdbms/
17. • A Brief History of Hadoop
• Contributers and Development
• What is Hadoop
• Wyh Hadoop
• Hadoop Ecosystem
21. Why Hadoop?
• Hadoop has its origins in Apache Nutch
• Can Process Big Data (Petabytes and more..)
• Unlimited Data Storage & Analyse
• No licence cost - Apache License 2.0
• Can be build out of the commodity hardware
• IT Cost Reduction
• Results
• Be One Step Ahead of Competition
• Stay there
22. Is hadoop alternative for RDBMs ?
• At the moment Apache Hadoop is not a substitute for a database
• No Relation
• Key Value pairs
• Big Data
• unstructured (Text)
• semi structured (Seq / Binary Files)
• Structured (Hbase=Google BigTable)
• Works fine together with RDBMs
23. • A Brief History of Hadoop
• Contributers and Development
• What is Hadoop
• Wyh Hadoop
• Hadoop Ecosystem
25. Hadoop Ecosystem
Important components of Hadoop
• HDFS: A distributed, fault tolerance file system
• MapReduce: A paralel data processing framework
• Hive : A query framework (like SQL)
• PIG : A query scripting tool
• HBase : realtime read/write access to your Big Data
28. HDFS
NameNode /DataNode interaction in HDFS. The NameNode keeps track of the file
metadata—which files are in the system and how each file is broken down into blocks. The
DataNodes provide backup store of the blocks and constantly report to the NameNode to keep the
metadata current.»
30. Writing Files To HDFS
• Client consults NameNode
• Client writes block directly to
one DataNode
• DataNote replicates block
• Cycle repeats for next block
31. Reading Files From HDFS
• Client consults NameNode
• Client receives Data Node list for each block
• Client picks first Data Node for each block
• Client reads blocks sequentially
32. Rackawareness & Fault Tolerance
NameNode
Rack Aware Metadata
Rack 1: File.txt
DN1 Blk A:
DN2 DN1,DN5,DN6
DN3
DN5 Blk B:
DN1,DN2,DN9
Rack 5:
DN5 BLKC:
DN6 DN5,DN9,DN10
DN7
DN8
Rack N
• Never loose all data if entire rack fails
• In Rack is higher bandwidth , lower latency
34. Hadoop Ecosystem
Important components of Hadoop
• HDFS: A distributed, fault tolerance file system
• MapReduce: A paralel data processing framework
• Hive : A query framework (like SQL)
• PIG : A query scripting tool
• HBase : A Column oriented Database for OLTP
35. MapReduce-Paradigm
• Simplified Data Processing on Large Clusters
• Splitting a Big Problem/Data into Little PiecesHive
• Key-Value
41. MapReduce-Job & Task Tracker
Namenode
Datanodes
JobTracker and TaskTracker interaction. After a client calls the JobTracker to begin a data
processing job, the JobTracker partitions the work and assigns different map and reduce tasks
to each TaskTracker in the cluster
43. Hadoop Ecosystem
Important components of Hadoop
• HDFS: A distributed, fault tolerance file system
• MapReduce: A paralel data processing framework
• Hive : A query framework (like SQL)
• PIG : A query scripting tool
• HBase : A Column oriented Database for OLTP
45. Hive
• Data warehousing package built on top of Hadoop
• It began its life at Facebook processing large amount of user
and log data
• Hadoop subproject with many contributors
• Ad hoc queries , summarization , and data analysis on Hadoop-
scale data
• Directly query data from different formats (text/binary) and file
formats (Flat/Sequence)
• HiveQL - like SQL
47. Hadoop Ecosystem
Important components of Hadoop
• HDFS: A distributed, fault tolerance file system
• MapReduce: A paralel data processing framework
• Hive : A query framework (like SQL)
• PIG : A query scripting tool
• HBase : A Column oriented Database for OLTP
48. Pig
• The language used to express data flows, called Pig Latin
• Pig Latin can be extended using UDF (User Defined Functions)
• was originally developed at Yahoo Research
• PigPen is an Eclipse plug-in that provides an environment for
developing Pig programs
• Running Pig Programs
• Script ; script file that contains Pig commands
• Grunt ; interactive shell
• Embedded ; java
49. Pig
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
grunt> DUMP records;
(1950,0,1)
(1950,22,1)
(1950,-11,1)
(1949,111,1)
(1949,78,1)
grunt> DESCRIBE records;
records: {year: chararray,temperature: int,quality: int}
grunt> filtered_records = FILTER records BY temperature != 22 );
grunt> DUMP filtered_records;
grunt> grouped_records = GROUP records BY year;
grunt> DUMP grouped_records;
(1949,{(1949,111,1),(1949,78,1)})
(1950,{(1950,0,1),(1950,22,1),(1950,-11,1)})
50. Hadoop Ecosystem
Important components of Hadoop
• HDFS: A distributed, fault tolerance file system
• MapReduce: A paralel data processing framework
• Hive : A query framework (like SQL)
• PIG : A query scripting tool
• HBase : A Column oriented Database for OLTP
51. HBase
• Random, realtime read/write access to your Big Data
• Billions of rows X millions of columns
• Column-oriented store modeled after Google's BigTable
• provides Bigtable-like capabilities on top of Hadoop and HDFS
• HBase is not a column-oriented database in the typical RDBMS
sense, but utilizes an on-disk column storage format
52. HBase-Datamodel
• (Table, RowKey, Family,Column, Timestamp) → Value
• Think of tags. Values any length, no predefined names or widths
• Column names carry info (just like tags)
57. Splits & RegionServers
• Rows grouped in regions and served by different servers
• Table dynamically split into “regions”
• Each region contains values [startKey, endKey)
• Regions hosted on a regionserver