SlideShare a Scribd company logo
Hive Evolution
A Progress Report
November 2010
John Sichi (Facebook)
Agenda
• Hive Overview
• Version 0.6 (just released!)
• Version 0.7 (under development)
• Hive is now a TLP!
• Roadmaps
What is Hive?
• A Hadoop-based system for querying
and managing structured data
– Uses Map/Reduce for execution
– Uses Hadoop Distributed File System
(HDFS) for storage
Hive Origins
• Data explosion at Facebook
• Traditional DBMS technology could
not keep up with the growth
• Hadoop to the rescue!
• Incubation with ASF, then became a
Hadoop sub-project
• Now a top-level ASF project

Recommended for you

From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL

The document discusses moving from traditional ETL processes to "analytics with no ETL" using Hadoop. It describes how Hadoop currently supports some ETL functions by storing raw and transformed data together. However, this still requires periodic loading of new data. The vision is to support complex schemas, perform background format conversion incrementally, and enable schema inference and evolution to allow analyzing data as it arrives without explicit ETL steps. This would provide an up-to-date, performant single view of all data.

hadoopbig datasql
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase

This document provides an overview of Apache Hadoop and HBase. It begins with an introduction to why big data is important and how Hadoop addresses storing and processing large amounts of data across commodity servers. The core components of Hadoop, HDFS for storage and MapReduce for distributed processing, are described. An example MapReduce job is outlined. The document then introduces the Hadoop ecosystem, including Apache HBase for random read/write access to data stored in Hadoop. Real-world use cases of Hadoop at companies like Yahoo, Facebook and Twitter are briefly mentioned before addressing questions.

clouderaapache hadoophadoop training
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

The document summarizes Carl Steinbach's presentation on SQL on Hadoop. It discusses how earlier systems like Hive had limitations for analytics workloads due to using MapReduce. A new architecture runs PostgreSQL on worker nodes co-located with HDFS data to enable push-down query processing for better performance. Citus Data's CitusDB product was presented as an example of this architecture, allowing SQL queries to efficiently analyze petabytes of data stored in HDFS.

Hive Evolution
• Originally:
– a way for Hadoop users to express
queries in a high-level language without
having to write map/reduce programs
• Now more and more:
– A parallel SQL DBMS which happens to
use Hadoop for its storage and
execution architecture
Intended Usage
• Web-scale Big Data
– 100’s of terabytes
• Large Hadoop cluster
– 100’s of nodes (heterogeneous OK)
• Data has a schema
• Batch jobs
– for both loads and queries
So Don’t Use Hive If…
• Your data is measured in GB
• You don’t want to impose a schema
• You need responses in seconds
• A “conventional” analytic DBMS can
already do the job
– (and you can afford it)
• You don’t have a lot of time and smart
people
Scaling Up
• Facebook warehouse, July 2010:
– 2250 nodes
– 36 petabytes disk space
• Data access per day:
– 80 to 90 terabytes added
(uncompressed)
– 25000 map/reduce jobs
• 300-400 users/month

Recommended for you

Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop

Hadoop is an open source software framework that supports data-intensive distributed applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J. Cafarella. And just don't overlook the charming yellow elephant you see, which is basically named after Doug's son's toy elephant! The topics covered in presentation are: 1. Big Data Learning Path 2.Big Data Introduction 3. Hadoop and its Eco-system 4.Hadoop Architecture 5.Next Step on how to setup Hadoop

big datahadoop
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy

This document discusses Hadoop and big data. It begins with definitions of big data and how Hadoop can help with large, complex datasets. It then discusses how Hadoop works with other tools like Pig and Hive. The document outlines different scenarios for big data and whether Hadoop is suitable. It also discusses how big data frameworks have evolved from Google papers. Finally, it provides examples of big data use cases and how education is being democratized with big data tools.

Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase

HBase is a distributed, column-oriented database that runs on top of Hadoop and HDFS, providing Bigtable-like capabilities for massive tables of structured and unstructured data. It is modeled after Google's Bigtable and provides a distributed, scalable, versioned storage system with strong consistency for random read/write access to billions of rows and millions of columns. HBase is well-suited for handling large datasets and providing real-time read/write access across clusters of commodity servers.

hbasehadoop
Facebook Deployment
Web Servers Scribe MidTier
Production
Hive-Hadoop
Cluster
Sharded MySQL
Scribe-Hadoop
Clusters
Adhoc
Hive-Hadoop
Cluster
Hive replication
Hive Architecture
Metastore
Query Engine
CLI
Hive Thrift API
Metastore
Thrift API
JDBC/ODBC
clients
Hadoop Map/Reduce
+ HDFS Clusters
Web
Management
Console
Physical Data Model
clicks
ds=‘2010-10-28’
ds=‘2010-10-29’
ds=‘2010-10-30’
Partitions
(possibly
multi-level)
Table HDFS Files
(possibly as
hash buckets)
Map/Reduce Plans
Input Files Map
Tasks
Reduce Tasks
Splits
Result Files

Recommended for you

Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop

Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.

clouderastrata nychadoop
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive

This document provides an introduction and overview of Apache Hive, including what it is, its architecture and components, how it is used in production, and performance considerations. Hive is an open source data warehouse system for Hadoop that allows users to query data using SQL-like language and scales to petabytes of data. It works by compiling queries into a directed acyclic graph of MapReduce jobs for execution. The document outlines Hive's architecture, components like the metastore and Thrift server, and how organizations use it for log processing, data mining and business intelligence tasks.

HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop

Hive is a data warehousing system built on Hadoop that allows users to query data using SQL. It addresses issues with using Hadoop for analytics like programmability and metadata. Hive uses a metastore to manage metadata and supports structured data types, SQL queries, and custom MapReduce scripts. At Facebook, Hive is used for analytics tasks like summarization, ad hoc analysis, and data mining on over 180TB of data processed daily across a Hadoop cluster.

hadoophivehadoop
Query Translation Example
• SELECT url, count(*) FROM
page_views GROUP BY url
• Map tasks compute partial counts for
each URL in a hash table
– “map side” preaggregation
– map outputs are partitioned by URL and
shipped to corresponding reducers
• Reduce tasks tally up partial counts to
produce final results
It Gets Quite Complicated!
Behavior Extensibility
• TRANSFORM scripts (any language)
– Serialization+IPC overhead
• User defined functions (Java)
– In-process, lazy object evaluation
• Pre/Post Hooks (Java)
– Statement validation/execution
– Example uses: auditing, replication,
authorization
UDF vs UDAF vs UDTF
• User Defined Function
• One-to-one row mapping
• Concat(‘foo’, ‘bar’)
• User Defined Aggregate Function
• Many-to-one row mapping
• Sum(num_ads)
• User Defined Table Function
• One-to-many row mapping
• Explode([1,2,3])

Recommended for you

WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk

Facebook's data warehouse processes petabytes of data daily to support data-driven development, business decisions, and machine learning. Hive provides a SQL interface and metadata management on top of Hadoop to simplify querying large datasets. At Facebook, Hive is used extensively for reporting, ad hoc analysis, and assembling machine learning training data. The Hive cluster processes 800TB of data and 10,000-25,000 jobs daily. Future work includes improving performance, scaling to support dynamic workloads and data growth, enabling incremental loads, and full SQL support.

hivehadoopfacebook
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World

My presentation deck from Hadoop Summit 2015. Video available at https://www.youtube.com/watch?v=EUz6Pu1lBHQ.

hadoop hive sqoop pig
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos

A walk-thru of core Hadoop, the ecosystem tools, and Hortonworks Data Platform (HDP) followed by code examples in MapReduce (Java and C#), Pig, and Hive. Presented at the Atlanta .NET User Group meeting in July 2014.

hadooppigmicrosoft
Storage Extensibility
• Input/OutputFormat: file formats
– SequenceFile, RCFile, TextFile, …
• SerDe: row formats
– Thrift, JSON, ProtocolBuffer, …
• Storage Handlers (new in 0.6)
– Integrate foreign metadata, e.g. HBase
• Indexing
– Under development in 0.7
Release 0.6
• October 2010
– Views
– Multiple Databases
– Dynamic Partitioning
– Automatic Merge
– New Join Strategies
– Storage Handlers
Views: Syntax
CREATE VIEW [IF NOT EXISTS]
view_name
[ (column_name [COMMENT
column_comment], … ) ]
[COMMENT ‘view_comment’]
AS SELECT …
[ ORDER BY … LIMIT … ]
Views: Usage
• Use Cases
– Column/table renaming
– Encapsulate complex query logic
– Security (Future)
• Limitations
– Read-only
– Obscures partition metadata from
underlying tables
– No dependency management

Recommended for you

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy

The document discusses LatentView Analytics and provides an overview of data processing frameworks and MapReduce. It introduces LatentView Analytics, describing its services, partners, and experience. It then discusses distributed and parallel processing frameworks, providing examples like Hadoop, Spark, and Storm. It also provides a brief history of Hadoop, describing its key developments from 1999 to present day in addressing challenges of indexing, crawling, distributed processing etc. Finally, it explains the MapReduce process and provides a simple example to illustrate mapping and reducing functions.

banyanhadoopbig data
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a programming model called MapReduce where developers write mapping and reducing functions that are automatically parallelized and executed on a large cluster. Hadoop also includes HDFS, a distributed file system that stores data across nodes providing high bandwidth. Major companies like Yahoo, Google and IBM use Hadoop to process large amounts of data from users and applications.

 
by Jay
mapreducehadoop
Hadoop
HadoopHadoop
Hadoop

Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It was created to support applications handling large datasets operating on many servers. Key Hadoop technologies include MapReduce for distributed computing, and HDFS for distributed file storage inspired by Google File System. Other related Apache projects extend Hadoop capabilities, like Pig for data flows, Hive for data warehousing, and HBase for NoSQL-like big data. Hadoop provides an effective solution for companies dealing with petabytes of data through distributed and parallel processing.

hadoop
Multiple Databases
• Follows MySQL convention
– CREATE DATABASE [IF NOT EXISTS]
db_name [COMMENT ‘db_comment’]
– USE db_name
• Logical namespace for tables
• ‘default’ database is still there
• Does not yet support queries across
multiple databases
Dynamic Partitions: Syntax
• Example
INSERT OVERWRITE TABLE page_view
PARTITION(dt='2008-06-08', country)
SELECT pvs.viewTime, pvs.userid,
pvs.page_url, pvs.referrer_url, null, null,
pvs.ip, pvs.country
FROM page_view_stg pvs
Dynamic Partitions: Usage
• Automatically create partitions based
on distinct values in columns
• Works as rudimentary indexing
– Prune partitions via WHERE clause
• But be careful…
– Don’t create too many partitions!
– Configuration parameters can be used to
prevent accidents
Automatic merge
• Jobs can produce many files
• Why is this bad?
– Namenode pressure
– Downstream jobs have to deal with file
processing overhead
• So, clean up by merging results into a
few large files (configurable)
– Use conditional map-only task to do this

Recommended for you

Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer

The document compares and contrasts the SAS and Spark frameworks. It provides an overview of their programming models, with SAS using data steps and procedures while Spark uses Scala and distributed datasets. Examples are shown of common tasks like loading data, sorting, grouping, and regression in both SAS Proc SQL and Spark SQL. Spark MLlib is described as Spark's machine learning library, in contrast to SAS Stats. Finally, Spark Streaming is demonstrated for loading and querying streaming data from Kafka. The key takeaways recommend trying Spark for large data, distributed computing, better control of code, open source licensing, or leveraging Hadoop data.

sparkapache hadoopsas
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide

This document provides an overview of 4 solutions for processing big data using Hadoop and compares them. Solution 1 involves using core Hadoop processing without data staging or movement. Solution 2 uses BI tools to analyze Hadoop data after a single CSV transformation. Solution 3 creates a data warehouse in Hadoop after a single transformation. Solution 4 implements a traditional data warehouse. The solutions are then compared based on benefits like cloud readiness, parallel processing, and investment required. The document also includes steps for installing a Hadoop cluster and running sample MapReduce jobs and Excel processing.

hadoopdata warehousebig data
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce

Cheetah is a custom data warehouse system built on top of Hadoop that provides high performance for storing and querying large datasets. It uses a virtual view abstraction over star and snowflake schemas to provide a simple yet powerful SQL-like query language. The system architecture utilizes MapReduce to parallelize query execution across many nodes. Cheetah employs columnar data storage and compression, multi-query optimization, and materialized views to improve query performance. Based on evaluations, Cheetah can efficiently handle both small and large queries and outperforms single-query execution when processing batches of queries together.

Join Strategies Before 0.6
• Map/reduce join
– Map tasks partition inputs on join keys
and ship to corresponding reducers
– Reduce tasks perform sort-merge-join
• Map-join
– Each mapper builds lookup hashtable
from copy of small table
– Then hash-join the splits of big table
New Join Strategies
• Bucketed map-join
– Each mapper filters its lookup table by
the bucketing hash function
– Allows “small” table to be much bigger
• Sorted merge in map-join
– Requires presorted input tables
• Deal with skew in map/reduce join
– Conditional plan step for skew keys
p(after main map/reduce join step)
Storage Handlers
Hive
HDFS
Native
Tables
Storage
Handler
Interface
HBase
Handler
Cassandra
Handler
Hypertable
Handler
Hypertable
API
Cassandra
API
HBase
API
HBase
Tables
Low Latency Warehouse
HBaseHBase
Other
Files/T
ables
Other
Files/T
ables
Periodic LoadPeriodic Load
Continuous
Update
Continuous
Update
Hive
Queries
Hive
Queries

Recommended for you

A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop Ecosystem

This document summarizes a presentation about Penton's use of Hadoop and related technologies to improve their data processing capabilities. It describes Penton's data challenges with siloed and slow ETL processes. A proof of concept used HBase for ingesting and manipulating CSV files, Drools for data scrubbing rules, and MapReduce jobs for exports. This new architecture improved performance for mapping and uploads. Lessons included HBase tuning and challenges translating SQL queries to scans. The new system provided better scalability and relieved load from their RDBMS.

Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner

Given on a free DevelopMentor webinar. A high level overview of big data and the need for Hadoop. Also covers Pig, Hive, Yarn, and the future of Hadoop.

sqlhadoopbig data
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem

Hadoop-HBase framework to ingest and process TBs of data. A case study in migrating a legacy Oracle implementation to Hadoop.

publishinghbaselegacy migration
Storage Handler Syntax
• HBase Example
CREATE TABLE users(
userid int, name string, email string, notes string)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
“hbase.columns.mapping” =
“small:name,small:email,large:notes”)
TBLPROPERTIES (
“hbase.table.name” = “user_list”);
Release 0.7
• In development
– Concurrency
Control
– Stats Collection
– Stats Functions
– Indexes
– Local Mode
– Faster map join
– Multiple DISTINCT
aggregates
– Archiving
– JDBC/ODBC
improvements
Concurrency Control
• Pluggable distributed lock manager
– Default is Zookeeper-based
• Simple read/write locking
• Table-level and partition-level
• Implicit locking (statement level)
– Deadlock-free via lock ordering
• Explicit LOCK TABLE (global)
Statistics Collection
• Implicit metastore update during load
– Or explicit via ANALYZE TABLE
• Table/partition-level
– Number of rows
– Number of files
– Size in bytes

Recommended for you

Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark

ESD (Etu Solution Day) 2014, Track D Slide. A presentation about Impala and Spark introduction and hands-on.

etuimpalaspark
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1

- Hadoop was created to allow processing of large datasets in a distributed, fault-tolerant manner. It was originally developed by Doug Cutting and Mike Cafarella at Nutch in response to the growing amounts of data and computational needs at Google and other companies. - The core of Hadoop consists of Hadoop Distributed File System (HDFS) for storage and Hadoop MapReduce for distributed processing. It also includes utilities like Hadoop Common for file system access and other basic functionality. - Hadoop's goals were to process multi-petabyte datasets across commodity hardware in a reliable, flexible and open source way. It assumes failures are expected and handles them to provide fault tolerance.

mapreducebigdatahadoop
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive

Hive provides an SQL-like interface to query and analyze large datasets stored in Hadoop. It allows users to model data as tables and analyze the data using SQL queries without needing to learn MapReduce programming. Hive generates MapReduce jobs behind the scenes to parallelize the processing and generate results. The system works by storing metadata about the tables in a metastore and then using this metadata to generate MapReduce jobs for queries. This allows Hive to provide a more programmer-friendly interface compared to raw MapReduce for working with large datasets.

hadoop
Stats-driven Optimization
• Automatic map-side join
• Automatic map-side aggregation
• Need column-level stats for better
estimates
– Filter/join selectivity
– Distinct value counts
– Column correlation
Statistical Functions
• Stats 101
– Stddev, var, covar
– Percentile_approx
• Data Mining
– Ngrams, sentences (text analysis)
– Histogram_numeric
• SELECT histogram_numeric(dob_year)
FROM users GROUP BY relationshipstatus
Histogram query results
• “It’s complicated” peaks at 18-19, but lasts into late 40s!
• “In a relationship” peaks at 20
• “Engaged” peaks at 25
• Married peaks in early 30s
• More married than single at 28
• Only teenagers use widowed?
Pluggable Indexing
• Reference implementation
– Index is stored in a normal Hive table
– Compact: distinct block addresses
– Partition-level rebuild
• Currently in R&D
– Automatic use for WHERE, GROUP BY
– New index types (e.g. bitmap, HBase)

Recommended for you

Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup

1) Hive is a data warehouse infrastructure built on top of Hadoop for querying large datasets stored in HDFS. It provides SQL-like capabilities to analyze data and supports complex queries using a MapReduce execution engine. 2) Hive compiles SQL queries into a directed acyclic graph (DAG) of MapReduce jobs that are executed by Hadoop. The metadata is stored in a metastore (typically an RDBMS). 3) Hive supports advanced features like partitioning, bucketing, acid transactions, and complex types. It can handle petabyte-scale datasets and integrates with the Hadoop ecosystem but has limitations for low-latency queries.

hivebig datahadoop
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop

The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.

bigdata
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop

This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.

Local Mode Execution
• Avoids map/reduce cluster job latency
• Good for jobs which process small
amounts of data
• Let Hive decide when to use it
– set hive.exec.model.local.auto=true;
• Or force its usage
– set mapred.job.tracker=local;
Faster map join
• Make sure small table can fit in
memory
– If it can’t, fall back to reduce join
• Optimize hash table data structures
• Use distributed cache to push out
pre-filtered lookup table
– Avoid swamping HDFS with reads from
thousands of mappers
Multiple DISTINCT Aggs
• Example
SELECT
view_date,
COUNT(DISTINCT userid),
COUNT(DISTINCT page_url)
FROM page_views
GROUP BY view_date
Archiving
• Use HAR (Hadoop archive format) to
combine many files into a few
• Relieves namenode memory
• Archived partition becomes read-only
• Syntax:
ALTER TABLE page_views
{ARCHIVE|UNARCHIVE}
PARTITION (ds=‘2010-10-30’)

Recommended for you

Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012

This document discusses Microsoft's efforts to make big data technologies like Hadoop more accessible through its products. It describes Hadoop, MapReduce, HDFS, and other big data concepts. It then outlines Microsoft's project to create a Hadoop distribution that runs on Windows Server and Windows Azure, including building an ODBC driver to allow tools like Excel to query Hadoop. This will help bring big data to more business users and integrate it with Microsoft's existing BI technologies.

Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data

A primer on Big Data, Hadoop and Microsoft's implementation of Hadoop for Windows Server and Windows Azure.

isotopemsbimicrosoft
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx

Data Explosion - TBs of data generated everyday Solution – HDFS to store data and Hadoop Map-Reduce framework to parallelize processing of Data What is the catch? Hadoop Map Reduce is Java intensive Thinking in Map Reduce paradigm can get tricky

JDBC/ODBC Improvements
• JDBC: Basic metadata calls
– Good enough for use with UI’s such as
SQuirreL
• JDBC: some PreparedStatement
support
– Pentaho Data Integration
• ODBC: new driver under
development (based on sqllite)
Hive is now a TLP
• PMC
– Namit Jain (chair)
– John Sichi
– Zheng Shao
– Edward Capriolo
– Raghotham Murthy
– Ning Zhang
– Paul Yang
– He Yongqiang
– Prasad Chakka
– Joydeep Sen
Sarma
– Ashish Thusoo
• Welcome to new
committer Carl
Steinbach!
Developer Diversity
• Recent Contributors
– Facebook, Yahoo, Cloudera
– Netflix, Amazon, Media6Degrees, Intuit
– Numerous research projects
– Many many more…
• Monthly San Francisco bay area
contributor meetups
• East coast meetups? 
Roadmap: Security
• Authentication
– Upgrading to SASL-enabled Thrift
• Authorization
– HDFS-level
• Very limited (no ACL’s)
• Can’t support all Hive features (e.g. views)
– Hive-level (GRANT/REVOKE)
• Hive server deployment for full effectiveness

Recommended for you

Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014

Apache Drill is an open source SQL query engine for big data that provides highly flexible and high performance querying of data stored in Hadoop and NoSQL systems. It allows for ad-hoc queries on schema-less data without requiring upfront modeling or ETL. Drill uses a distributed, columnar data store and late binding to optimize query execution across systems. The project is actively developed with the goal of releasing version 1.0 in late 2014.

apache drill apachecon
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018

Apache Drill is a distributed SQL query engine that enables fast analytics over NoSQL databases and distributed file systems. It has a plugin-based architecture that allows it to access different data sources. For NoSQL databases, Drill leverages secondary indexes to generate index-based query plans for predicates on non-key columns. For distributed file systems like HDFS, Drill performs partition pruning based on directory metadata and filter pushdown based on Parquet row group statistics to speed up queries. Drill's extensible framework allows data sources to provide metadata like indexes, statistics, and partitioning functions to optimize query execution.

nosqlsqlsecondary indexes
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...

The document discusses best practices and performance tuning for U-SQL in Azure Data Lake. It provides an overview of U-SQL query execution, including the job scheduler, query compilation process, and vertex execution model. The document also covers techniques for analyzing and optimizing U-SQL job performance, including analyzing the critical path, using heat maps, optimizing AU usage, addressing data skew, and query tuning techniques like data loading tips, partitioning, predicate pushing and column pruning.

u-sqlazure data lakeazure data lake analytics
Roadmap: Hadoop API
• Dropping pre-0.20 support starting
with Hive 0.7
– But Hive is still using old mapred.*
• Moving to mapreduce.* will be
required in order to support newer
Hadoop versions
– Need to resolve some complications with
0.7’s indexing feature
Roadmap: Howl
• Reuse metastore across Hadoop
Howl
Hive Pig Oozie Flume
HDFS
Roadmap: Heavy-Duty Tests
• Unit tests are insufficient
• What is needed:
– Real-world schemas/queries
– Non-toy data scales
– Scripted setup; configuration matrix
– Correctness/performance verification
– Automatic reports: throughput, latency,
profiles, coverage, perf counters…
Roadmap: Shared Test Site
• Nightly runs, regression alerting
• Performance trending
• Synthetic workload (e.g. TPC-H)
• Real-world workload (anonymized?)
• This is critical for
– Non-subjective commit criteria
– Release quality

Recommended for you

SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop

SQL on Hadoop Looking for the correct tool for your SQL-on-Hadoop use case? There is a long list of alternatives to choose from; how to select the correct tool? The tool selection is always based on use case requirements. Read more on alternatives and our recommendations.

sqlhadoop
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group

Hive is a data warehouse infrastructure built on top of Hadoop for querying and managing large datasets stored in Hadoop Distributed File System (HDFS). It provides SQL-like interface to query data, and uses MapReduce to parallelize the execution of queries across clusters. The document discusses Hive architecture, how it works, HiveQL syntax, data types, storage formats, DDL commands, data loading, functions, optimizations and getting started with Hive.

sqlhivehadoop
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)

Recorded at SpringOne2GX 2013 in Santa Clara, CA Speaker: Adam Shook This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more. Learn More about Spring XD at: http://projects.spring.io/spring-xd Learn More about Gemfire XD at: http://www.gopivotal.com/big-data/pivotal-hd

apache hadoopadam shookpivotal software
Resources
• http://hive.apache.org
• user@hive.apache.org
• jsichi@facebook.com
• Questions?

More Related Content

What's hot

Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
DataWorks Summit
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
Cloudera, Inc.
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
OReillyStrata
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
Edureka!
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
Praveen Sripati
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
Byeongweon Moon
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
OReillyStrata
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
Reza Ameri
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
Zheng Shao
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
nzhang
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
Lester Martin
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
Jay
 
Hadoop
HadoopHadoop
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
DataWorks Summit
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 

What's hot (20)

Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Where does hadoop come handy
Where does hadoop come handyWhere does hadoop come handy
Where does hadoop come handy
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
An intriduction to hive
An intriduction to hiveAn intriduction to hive
An intriduction to hive
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 

Similar to Hive Evolution: ApacheCon NA 2010

Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop Ecosystem
DataWorks Summit
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
Ike Ellis
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem
Serendio Inc.
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
Remus Rusanu
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
vishwasgarade1
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
Neeraja Rentachintala
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018
Aman Sinha
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Michael Rys
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
Bigdatapump
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
Remus Rusanu
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Joe Alex
 

Similar to Hive Evolution: ApacheCon NA 2010 (20)

Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop Ecosystem
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
 
Hive big-data meetup
Hive big-data meetupHive big-data meetup
Hive big-data meetup
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018Apache Drill talk ApacheCon 2018
Apache Drill talk ApacheCon 2018
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hive @ Bucharest Java User Group
Hive @ Bucharest Java User GroupHive @ Bucharest Java User Group
Hive @ Bucharest Java User Group
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 

Recently uploaded

The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
 
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design ApproachesKnowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Earley Information Science
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
What's Next Web Development Trends to Watch.pdf
What's Next Web Development Trends to Watch.pdfWhat's Next Web Development Trends to Watch.pdf
What's Next Web Development Trends to Watch.pdf
SeasiaInfotech2
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
ScyllaDB
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
Matthew Sinclair
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecGDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
James Anderson
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
Aurora Consulting
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 
5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx
SATYENDRA100
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Matthew Sinclair
 
AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)
apoorva2579
 
Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
FellyciaHikmahwarani
 

Recently uploaded (20)

The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
 
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design ApproachesKnowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
What's Next Web Development Trends to Watch.pdf
What's Next Web Development Trends to Watch.pdfWhat's Next Web Development Trends to Watch.pdf
What's Next Web Development Trends to Watch.pdf
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecGDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 
5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
 
AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)
 
Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
 

Hive Evolution: ApacheCon NA 2010

  • 1. Hive Evolution A Progress Report November 2010 John Sichi (Facebook)
  • 2. Agenda • Hive Overview • Version 0.6 (just released!) • Version 0.7 (under development) • Hive is now a TLP! • Roadmaps
  • 3. What is Hive? • A Hadoop-based system for querying and managing structured data – Uses Map/Reduce for execution – Uses Hadoop Distributed File System (HDFS) for storage
  • 4. Hive Origins • Data explosion at Facebook • Traditional DBMS technology could not keep up with the growth • Hadoop to the rescue! • Incubation with ASF, then became a Hadoop sub-project • Now a top-level ASF project
  • 5. Hive Evolution • Originally: – a way for Hadoop users to express queries in a high-level language without having to write map/reduce programs • Now more and more: – A parallel SQL DBMS which happens to use Hadoop for its storage and execution architecture
  • 6. Intended Usage • Web-scale Big Data – 100’s of terabytes • Large Hadoop cluster – 100’s of nodes (heterogeneous OK) • Data has a schema • Batch jobs – for both loads and queries
  • 7. So Don’t Use Hive If… • Your data is measured in GB • You don’t want to impose a schema • You need responses in seconds • A “conventional” analytic DBMS can already do the job – (and you can afford it) • You don’t have a lot of time and smart people
  • 8. Scaling Up • Facebook warehouse, July 2010: – 2250 nodes – 36 petabytes disk space • Data access per day: – 80 to 90 terabytes added (uncompressed) – 25000 map/reduce jobs • 300-400 users/month
  • 9. Facebook Deployment Web Servers Scribe MidTier Production Hive-Hadoop Cluster Sharded MySQL Scribe-Hadoop Clusters Adhoc Hive-Hadoop Cluster Hive replication
  • 10. Hive Architecture Metastore Query Engine CLI Hive Thrift API Metastore Thrift API JDBC/ODBC clients Hadoop Map/Reduce + HDFS Clusters Web Management Console
  • 12. Map/Reduce Plans Input Files Map Tasks Reduce Tasks Splits Result Files
  • 13. Query Translation Example • SELECT url, count(*) FROM page_views GROUP BY url • Map tasks compute partial counts for each URL in a hash table – “map side” preaggregation – map outputs are partitioned by URL and shipped to corresponding reducers • Reduce tasks tally up partial counts to produce final results
  • 14. It Gets Quite Complicated!
  • 15. Behavior Extensibility • TRANSFORM scripts (any language) – Serialization+IPC overhead • User defined functions (Java) – In-process, lazy object evaluation • Pre/Post Hooks (Java) – Statement validation/execution – Example uses: auditing, replication, authorization
  • 16. UDF vs UDAF vs UDTF • User Defined Function • One-to-one row mapping • Concat(‘foo’, ‘bar’) • User Defined Aggregate Function • Many-to-one row mapping • Sum(num_ads) • User Defined Table Function • One-to-many row mapping • Explode([1,2,3])
  • 17. Storage Extensibility • Input/OutputFormat: file formats – SequenceFile, RCFile, TextFile, … • SerDe: row formats – Thrift, JSON, ProtocolBuffer, … • Storage Handlers (new in 0.6) – Integrate foreign metadata, e.g. HBase • Indexing – Under development in 0.7
  • 18. Release 0.6 • October 2010 – Views – Multiple Databases – Dynamic Partitioning – Automatic Merge – New Join Strategies – Storage Handlers
  • 19. Views: Syntax CREATE VIEW [IF NOT EXISTS] view_name [ (column_name [COMMENT column_comment], … ) ] [COMMENT ‘view_comment’] AS SELECT … [ ORDER BY … LIMIT … ]
  • 20. Views: Usage • Use Cases – Column/table renaming – Encapsulate complex query logic – Security (Future) • Limitations – Read-only – Obscures partition metadata from underlying tables – No dependency management
  • 21. Multiple Databases • Follows MySQL convention – CREATE DATABASE [IF NOT EXISTS] db_name [COMMENT ‘db_comment’] – USE db_name • Logical namespace for tables • ‘default’ database is still there • Does not yet support queries across multiple databases
  • 22. Dynamic Partitions: Syntax • Example INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country FROM page_view_stg pvs
  • 23. Dynamic Partitions: Usage • Automatically create partitions based on distinct values in columns • Works as rudimentary indexing – Prune partitions via WHERE clause • But be careful… – Don’t create too many partitions! – Configuration parameters can be used to prevent accidents
  • 24. Automatic merge • Jobs can produce many files • Why is this bad? – Namenode pressure – Downstream jobs have to deal with file processing overhead • So, clean up by merging results into a few large files (configurable) – Use conditional map-only task to do this
  • 25. Join Strategies Before 0.6 • Map/reduce join – Map tasks partition inputs on join keys and ship to corresponding reducers – Reduce tasks perform sort-merge-join • Map-join – Each mapper builds lookup hashtable from copy of small table – Then hash-join the splits of big table
  • 26. New Join Strategies • Bucketed map-join – Each mapper filters its lookup table by the bucketing hash function – Allows “small” table to be much bigger • Sorted merge in map-join – Requires presorted input tables • Deal with skew in map/reduce join – Conditional plan step for skew keys p(after main map/reduce join step)
  • 28. Low Latency Warehouse HBaseHBase Other Files/T ables Other Files/T ables Periodic LoadPeriodic Load Continuous Update Continuous Update Hive Queries Hive Queries
  • 29. Storage Handler Syntax • HBase Example CREATE TABLE users( userid int, name string, email string, notes string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( “hbase.columns.mapping” = “small:name,small:email,large:notes”) TBLPROPERTIES ( “hbase.table.name” = “user_list”);
  • 30. Release 0.7 • In development – Concurrency Control – Stats Collection – Stats Functions – Indexes – Local Mode – Faster map join – Multiple DISTINCT aggregates – Archiving – JDBC/ODBC improvements
  • 31. Concurrency Control • Pluggable distributed lock manager – Default is Zookeeper-based • Simple read/write locking • Table-level and partition-level • Implicit locking (statement level) – Deadlock-free via lock ordering • Explicit LOCK TABLE (global)
  • 32. Statistics Collection • Implicit metastore update during load – Or explicit via ANALYZE TABLE • Table/partition-level – Number of rows – Number of files – Size in bytes
  • 33. Stats-driven Optimization • Automatic map-side join • Automatic map-side aggregation • Need column-level stats for better estimates – Filter/join selectivity – Distinct value counts – Column correlation
  • 34. Statistical Functions • Stats 101 – Stddev, var, covar – Percentile_approx • Data Mining – Ngrams, sentences (text analysis) – Histogram_numeric • SELECT histogram_numeric(dob_year) FROM users GROUP BY relationshipstatus
  • 35. Histogram query results • “It’s complicated” peaks at 18-19, but lasts into late 40s! • “In a relationship” peaks at 20 • “Engaged” peaks at 25 • Married peaks in early 30s • More married than single at 28 • Only teenagers use widowed?
  • 36. Pluggable Indexing • Reference implementation – Index is stored in a normal Hive table – Compact: distinct block addresses – Partition-level rebuild • Currently in R&D – Automatic use for WHERE, GROUP BY – New index types (e.g. bitmap, HBase)
  • 37. Local Mode Execution • Avoids map/reduce cluster job latency • Good for jobs which process small amounts of data • Let Hive decide when to use it – set hive.exec.model.local.auto=true; • Or force its usage – set mapred.job.tracker=local;
  • 38. Faster map join • Make sure small table can fit in memory – If it can’t, fall back to reduce join • Optimize hash table data structures • Use distributed cache to push out pre-filtered lookup table – Avoid swamping HDFS with reads from thousands of mappers
  • 39. Multiple DISTINCT Aggs • Example SELECT view_date, COUNT(DISTINCT userid), COUNT(DISTINCT page_url) FROM page_views GROUP BY view_date
  • 40. Archiving • Use HAR (Hadoop archive format) to combine many files into a few • Relieves namenode memory • Archived partition becomes read-only • Syntax: ALTER TABLE page_views {ARCHIVE|UNARCHIVE} PARTITION (ds=‘2010-10-30’)
  • 41. JDBC/ODBC Improvements • JDBC: Basic metadata calls – Good enough for use with UI’s such as SQuirreL • JDBC: some PreparedStatement support – Pentaho Data Integration • ODBC: new driver under development (based on sqllite)
  • 42. Hive is now a TLP • PMC – Namit Jain (chair) – John Sichi – Zheng Shao – Edward Capriolo – Raghotham Murthy – Ning Zhang – Paul Yang – He Yongqiang – Prasad Chakka – Joydeep Sen Sarma – Ashish Thusoo • Welcome to new committer Carl Steinbach!
  • 43. Developer Diversity • Recent Contributors – Facebook, Yahoo, Cloudera – Netflix, Amazon, Media6Degrees, Intuit – Numerous research projects – Many many more… • Monthly San Francisco bay area contributor meetups • East coast meetups? 
  • 44. Roadmap: Security • Authentication – Upgrading to SASL-enabled Thrift • Authorization – HDFS-level • Very limited (no ACL’s) • Can’t support all Hive features (e.g. views) – Hive-level (GRANT/REVOKE) • Hive server deployment for full effectiveness
  • 45. Roadmap: Hadoop API • Dropping pre-0.20 support starting with Hive 0.7 – But Hive is still using old mapred.* • Moving to mapreduce.* will be required in order to support newer Hadoop versions – Need to resolve some complications with 0.7’s indexing feature
  • 46. Roadmap: Howl • Reuse metastore across Hadoop Howl Hive Pig Oozie Flume HDFS
  • 47. Roadmap: Heavy-Duty Tests • Unit tests are insufficient • What is needed: – Real-world schemas/queries – Non-toy data scales – Scripted setup; configuration matrix – Correctness/performance verification – Automatic reports: throughput, latency, profiles, coverage, perf counters…
  • 48. Roadmap: Shared Test Site • Nightly runs, regression alerting • Performance trending • Synthetic workload (e.g. TPC-H) • Real-world workload (anonymized?) • This is critical for – Non-subjective commit criteria – Release quality