SlideShare a Scribd company logo
Optimizing Hive Queries

Owen O’Malley
Founder and Architect
owen@hortonworks.com
@owen_omalley




© Hortonworks Inc. 2013:   Page 1
Who Am I?

• Founder and Architect at Hortonworks
 – Working on Hive, working with customer
 – Formerly Hadoop MapReduce & Security
 – Been working on Hadoop since beginning
• Apache Hadoop, ASF
 – Hadoop PMC (Original VP)
 – Tez, Ambari, Giraph PMC
 – Mentor for: Accumulo, Kafka, Knox
 – Apache Member
    © Hortonworks Inc. 2013                 Page 2
Outline

• Data Layout
• Data Format
• Joins
• Debugging




    © Hortonworks Inc. 2013   Page 3
Data Layout
Location, Location, Location




© Hortonworks Inc. 2013        Page 4

Recommended for you

Spark SQL
Spark SQLSpark SQL
Spark SQL

The document summarizes Spark SQL, which is a Spark module for structured data processing. It introduces key concepts like RDDs, DataFrames, and interacting with data sources. The architecture of Spark SQL is explained, including how it works with different languages and data sources through its schema RDD abstraction. Features of Spark SQL are covered such as its integration with Spark programs, unified data access, compatibility with Hive, and standard connectivity.

hadooprnosql
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL

Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.

ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3

This document discusses improvements to ORC support in Apache Spark 2.3. It describes previous issues with ORC performance and compatibility in Spark. The current approach in Spark 2.3 introduces a new native ORC file format that provides significantly better performance compared to the previous Hive ORC implementation. It allows configuring the ORC implementation and reader type. The document also demonstrates ORC usage in Spark and PySpark. Benchmark results show the native ORC reader provides up to 15x faster performance for scans and predicate pushdown. Future work items are discussed to further improve ORC support in Spark.

data processing and warehousingdata engineeringapache spark
Fundamental Questions

• What is your primary use case?
  – What kind of queries and filters?
• How do you need to access the data?
  – What information do you need together?
• How much data do you have?
  – What is your year to year growth?
• How do you get the data?



    © Hortonworks Inc. 2013                  Page 5
HDFS Characteristics

• Provides Distributed File System
  – Very high aggregate bandwidth
  – Extreme scalability (up to 100 PB)
  – Self-healing storage
  – Relatively simple to administer
• Limitations
  – Can’t modify existing files
  – Single writer for each file
  – Heavy bias for large files ( > 100 MB)
    © Hortonworks Inc. 2013                  Page 6
Choices for Layout

• Partitions
  – Top level mechanism for pruning
  – Primary unit for updating tables (& schema)
  – Directory per value of specified column
• Bucketing
  – Hashed into a file, good for sampling
  – Controls write parallelism
• Sort order
  – The order the data is written within file
    © Hortonworks Inc. 2013                     Page 7
Example Hive Layout

• Directory Structure
  warehouse/$database/$table
• Partitioning
  /part1=$partValue/part2=$partValue
• Bucketing
  /$bucket_$attempt (eg. 000000_0)
• Sort
  – Each file is sorted within the file

    © Hortonworks Inc. 2013               Page 8

Recommended for you

Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse

HBase as the NoSQL database of choice in the Hadoop ecosystem has already been proven itself in scale and in many mission critical workloads in hundreds of companies. Phoenix as the SQL layer on top of HBase, has been increasingly becoming the tool of choice as the perfect complementary for HBase. Phoenix is now being used more and more for super low latency querying and fast analytics across a large number of users in production deployments. In this talk, we will cover what makes Phoenix attractive among current and prospective HBase users, like SQL support, JDBC, data modeling, secondary indexing, UDFs, and also go over recent improvements like Query Server, ODBC drivers, ACID transactions, Spark integration, etc. We will conclude by looking into items in the pipeline and how Phoenix and HBase interacts with other engines like Hive and Spark.

phoenixapache phoenixhbase
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive

This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include: - Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries. - Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering. - The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans. - Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide

Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.

parquetnetflixhadoop
Layout Guidelines

• Limit the number of partitions
  – 1,000 partitions is much faster than 10,000
  – Nested partitions are almost always wrong
• Gauge the number of buckets
  – Calculate file size and keep big (200-500MB)
  – Don’t forget number of files (Buckets * Parts)
• Layout related tables the same way
  – Partition
  – Bucket and sort order
    © Hortonworks Inc. 2013                    Page 9
Normalization

• Most databases suggest normalization
  – Keep information about each thing together
  – Customer, Sales, Returns, Inventory tables
• Has lots of good properties, but…
  – Is typically slow to query
• Often best to denormalize during load
  – Write once, read many times
  – Additionally provides snapshots in time.


    © Hortonworks Inc. 2013                    Page 10
Data Format
How is your data stored?




© Hortonworks Inc. 2013    Page 11
Choice of Format

• Serde
  – How each record is encoded?
• Input/Output (aka File) Format
  – How are the files stored?
• Primary Choices
  – Text
  – Sequence File
  – RCFile
  – ORC (Coming Soon!)
    © Hortonworks Inc. 2013        Page 12

Recommended for you

Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications

This document discusses 5 common mistakes when writing Spark applications: 1) Improperly sizing executors by not considering cores, memory, and overhead. The optimal configuration depends on the workload and cluster resources. 2) Applications failing due to shuffle blocks exceeding 2GB size limit. Increasing the number of partitions helps address this. 3) Jobs running slowly due to data skew in joins and shuffles. Techniques like salting keys can help address skew. 4) Not properly managing the DAG to avoid shuffles and bring work to the data. Using ReduceByKey over GroupByKey and TreeReduce over Reduce when possible. 5) Classpath conflicts arising from mismatched library versions, which can be addressed using sh

#apachespark #sparksummit
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries

Apache Hive is a data warehousing system for large volumes of data stored in Hadoop. However, the data is useless unless you can use it to add value to your company. Hive provides a SQL-based query language that dramatically simplifies the process of querying your large data sets. That is especially important while your data scientists are developing and refining their queries to improve their understanding of the data. In many companies, such as Facebook, Hive accounts for a large percentage of the total MapReduce queries that are run on the system. Although Hive makes writing large data queries easier for the user, there are many performance traps for the unwary. Many of them are artifacts of the way Hive has evolved over the years and the requirement that the default behavior must be safe for all users. This talk will present examples of how Hive users have made mistakes that made their queries run much much longer than necessary. It will also present guidelines for how to get better performance for your queries and how to look at the query plan to understand what Hive is doing.

owen o'malleyhadoophive
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek

Apache Spark presentation at HasGeek FifthElelephant https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries

ml pipelinedata framesdstream
Text Format

• Critical to pick a Serde
  – Default - ^A’s between fields
  – JSON – top level JSON record
  – CSV – commas between fields (on github)
• Slow to read and write
• Can’t split compressed files
  – Leads to huge maps
• Need to read/decompress all fields

    © Hortonworks Inc. 2013               Page 13
Sequence File

• Traditional MapReduce binary file format
  – Stores keys and values as classes
  – Not a good fit for Hive, which has SQL types
  – Hive always stores entire row as value
• Splittable but only by searching file
  – Default block size is 1 MB
• Need to read and decompress all fields



    © Hortonworks Inc. 2013                  Page 14
RC (Row Columnar) File

• Columns stored separately
  – Read and decompress only needed ones
  – Better compression
• Columns stored as binary blobs
  – Depends on metastore to supply types
• Larger blocks
  – 4 MB by default
  – Still search file for split boundary


    © Hortonworks Inc. 2013                Page 15
ORC (Optimized Row Columnar)

• Columns stored separately
• Knows types
  – Uses type-specific encoders
  – Stores statistics (min, max, sum, count)
• Has light-weight index
  – Skip over blocks of rows that don’t matter
• Larger blocks
  – 256 MB by default
  – Has an index for block boundaries
    © Hortonworks Inc. 2013                      Page 16

Recommended for you

Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice

This document discusses Apache Arrow, an open source cross-language development platform for in-memory analytics. It provides an overview of Arrow's goals of being cross-language compatible, optimized for modern CPUs, and enabling interoperability between systems. Key components include core C++/Java libraries, integrations with projects like Pandas and Spark, and common message patterns for sharing data. The document also describes how Arrow is implemented in practice in systems like Dremio's Sabot query engine.

apache parquetapache arrowsql
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...

Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways. However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk. It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.

apache sparksparkaisummit
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing

Tez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.

apache tezapache hadoophadoop summit
ORC - File Layout




   © Hortonworks Inc. 2013   Page 17
Example File Sizes from TPC-DS




   © Hortonworks Inc. 2013       Page 18
Compression

• Need to pick level of compression
  – None
  – LZO or Snappy – fast but sloppy
      – Best for temporary tables
  – ZLIB – slow and complete
      – Best for long term storage




    © Hortonworks Inc. 2013           Page 19
Joins
Putting the pieces together




© Hortonworks Inc. 2013       Page 20

Recommended for you

Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals

The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses: - RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied. - RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation. - Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling. - The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.

apache sparkshufflingrdds
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users

At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.

databricksapache sparkspark summit
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs

"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark. "

Default Assumption

• Hive assumes users are either:
  – Noobies
  – Hive developers
• Default behavior is always finish
  – Little Engine that Could!
• Experts could override default behaviors
  – Get better performance, but riskier
• We’re working on improving heuristics

    © Hortonworks Inc. 2013               Page 21
Shuffle Join

• Default choice
  – Always works (I’ve sorted a petabyte!)
  – Worst case scenario
• Each process
  – Reads from part of one of the tables
  – Buckets and sorts on join key
  – Sends one bucket to each reduce
• Works everytime!

    © Hortonworks Inc. 2013                  Page 22
Map Join

• One table is small (eg. dimension table)
  – Fits in memory
• Each process
  – Reads small table into memory hash table
  – Streams through part of the big file
  – Joining each record from hash table
• Very fast, but limited



    © Hortonworks Inc. 2013                Page 23
Sort Merge Bucket (SMB) Join

• If both tables are:
  – Sorted the same
  – Bucketed the same
  – And joining on the sort/bucket column
• Each process:
  – Reads a bucket from each table
  – Process the row with the lowest value
• Very efficient if applicable

    © Hortonworks Inc. 2013                 Page 24

Recommended for you

Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop

Ozone is an object store for Apache Hadoop that is designed to scale to trillions of objects. It uses a distributed metadata store to avoid single points of failure and enable parallelism. Key components of Ozone include containers, which provide the basic storage and replication functionality, and the Key Space Manager (KSM) which maps Ozone entities like volumes and buckets to containers. The Storage Container Manager manages the container lifecycle and replication.

hadoopapacheconozone
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future

The document discusses Apache Tez, a framework for building data processing applications on Hadoop. It provides an introduction to Tez and describes key features like expressing computations as directed acyclic graphs (DAGs), container reuse, dynamic parallelism, integration with YARN timeline service, and recovery from failures. The document also outlines improvements to Tez around performance, debuggability, and status/roadmap.

hadoop summitapache tezapache hadoop
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...

Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query performance - Vinod Kumar Vavilapalli - Gopal Vijayaraghavan

apache hadoopapache hivehortonworks
Debugging
What could possibly go wrong?




© Hortonworks Inc. 2013         Page 25
Performance Question

• Which of the following is faster?
  – select count(distinct(Col)) from Tbl
  – select count(*) from
       (select distict(Col) from Tbl)




    © Hortonworks Inc. 2013                Page 26
Count Distinct




   © Hortonworks Inc. 2013   Page 27
Answer

• Surprisingly the second is usually faster
  – In the first case:
      – Maps send each value to the reduce
      – Single reduce counts them all
  – In the second case:
      – Maps split up the values to many reduces
      – Each reduce generates its list
      – Final job counts the size of each list
  – Singleton reduces are almost always BAD

    © Hortonworks Inc. 2013                        Page 28

Recommended for you

Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud

This document discusses distributed data processing using MapReduce and Hadoop in a cloud computing environment. It describes the need for scalable, economical, and reliable distributed systems to process petabytes of data across thousands of nodes. It introduces Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers using MapReduce. Key aspects of Hadoop discussed include its core components HDFS for distributed file storage and MapReduce for distributed computation.

yahoo2008cloud
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner

Given on a free DevelopMentor webinar. A high level overview of big data and the need for Hadoop. Also covers Pig, Hive, Yarn, and the future of Hadoop.

sqlhadoopbig data
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem

The document discusses evolving HDFS to support generalized storage containers in order to better scale the number of files and blocks. It proposes using block containers and a partial namespace approach to initially scale to billions of files and blocks, and eventually much higher numbers. The storage layer is being restructured to support various container types for use cases beyond HDFS like object storage and HBase.

Communication is Good!

• Hive doesn’t tell you what is wrong.
  – Expects you to know!
  – “Lucy, you have some ‘splaining to do!”
• Explain tool provides query plan
  – Filters on input
  – Numbers of jobs
  – Numbers of maps and reduces
  – What the jobs are sorting by
  – What directories are they reading or writing
    © Hortonworks Inc. 2013                   Page 29
Blinded by Science

• The explanation tool is confusing.
  – It takes practice to understand.
  – It doesn’t include some critical details like
   partition pruning.
• Running the query makes things clearer!
  – Pay attention to the details
  – Look at JobConf and job history files



    © Hortonworks Inc. 2013                         Page 30
Skew

• Skew is typical in real datasets.
• A user complained that his job was slow
  – He had 100 reduces
  – 98 of them finished fast
  – 2 ran really slow
• The key was a boolean…




    © Hortonworks Inc. 2013            Page 31
Root Cause Analysis

• Ambari
  – Apache project building Hadoop installation
   and management tool
  – Provides metrics (Ganglia & Nagios)
  – Root Cause Analysis
      – Processes MapReduce job logs
      – Displays timing of each part of query plan




    © Hortonworks Inc. 2013                          Page 32

Recommended for you

Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure

This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It describes how Hadoop addresses the need to reliably process huge datasets using a distributed file system and MapReduce processing on commodity hardware. It also provides details on how Hadoop has been implemented and used at Yahoo to process petabytes of data and support thousands of jobs weekly on large clusters.

datacenteryahooinfrastructure
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL

This document provides an introduction to big data and NoSQL databases. It begins with an introduction of the presenter. It then discusses how the era of big data came to be due to limitations of traditional relational databases and scaling approaches. The document introduces different NoSQL data models including document, key-value, graph and column-oriented databases. It provides examples of NoSQL databases that use each data model. The document discusses how NoSQL databases are better suited than relational databases for big data problems and provides a real-world example of Twitter's use of FlockDB. It concludes by discussing approaches for working with big data using MapReduce and provides examples of using MongoDB and Azure for big data.

bigdatanosqlazure
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night

Slides for a talk. Talk abstract: In the dark of the night, if you listen carefully enough, you can hear databases cry. But why? As developers, we rarely consider what happens under the hood of widely used abstractions such as databases. As a consequence, we rarely think about the performance of databases. This is especially true to less widespread, but often very useful NoSQL databases. In this talk we will take a close look at NoSQL database performance, peek under the hood of the most frequently used features to see how they affect performance and discuss performance issues and bottlenecks inherent to all databases.

nosqldatabasesql
Root Cause Analysis Screenshots




   © Hortonworks Inc. 2013        Page 33
Root Cause Analysis Screenshots




   © Hortonworks Inc. 2013        Page 34
Thank You!
Questions & Answers




@owen_omalley



       © Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION   Page 35
ORCFile - Comparison

                               RC File   Trevni   ORC File
 Hive Type Model               N         N        Y
 Separate complex columns      N         Y        Y
 Splits found quickly          N         Y        Y
 Default column group size     4MB       64MB*    250MB
 Files per a bucket            1         >1       1
 Store min, max, sum, count    N         N        Y
 Versioned metadata            N         Y        Y
 Run length data encoding      N         N        Y
 Store strings in dictionary   N         N        Y
 Store row count               N         Y        Y
 Skip compressed blocks        N         N        Y
 Store internal indexes        N         N        Y

     © Hortonworks Inc. 2013                                 Page 36

Recommended for you

Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive

Nesta segunda parte do tema Redshift, mostramos o case da Movile, líder em mobile commerce com 50 milhões de usuários, e analisamos tópicos avançados como compressão, macros SQL embutidas e índices multidimensionais para grandes bases de dados.

Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution

HDFS has several strengths: horizontally scale its IO bandwidth and scale its storage to petabytes of storage. Further, it provides very low latency metadata operations and scales to over 60K concurrent clients. Hadoop 3.0 recently added Erasure Coding. One of HDFS’s limitations is scaling a number of files and blocks in the system. We describe a radical change to Hadoop’s storage infrastructure with the upcoming Ozone technology. It allows Hadoop to scale to tens of billions of files and blocks and, in the future, to every larger number of smaller objects. Ozone fundamentally separates the namespace layer and the block layer allowing new namespace layers to be added in the future. Further, the use of RAFT protocol has allowed the storage layer to be self-consistent. We show how this technology helps a Hadoop user and also what it means for evolving HDFS in the future. We will also cover the technical details of Ozone. Speaker: Sanjay Radia, Chief Architect, Founder, Hortonworks

hdfsozonehortonworks
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution

HDFS has several strengths: horizontally scale its IO bandwidth and scale its storage to petabytes of storage. Further, it provides very low latency metadata operations and scales to over 60K concurrent clients. Hadoop 3.0 recently added Erasure Coding. One of HDFS’s limitations is scaling a number of files and blocks in the system. We describe a radical change to Hadoop’s storage infrastructure with the upcoming Ozone technology. It allows Hadoop to scale to tens of billions of files and blocks and, in the future, to every larger number of smaller objects. Ozone fundamentally separates the namespace layer and the block layer allowing new namespace layers to be added in the future. Further, the use of RAFT protocol has allowed the storage layer to be self-consistent. We show how this technology helps a Hadoop user and also what it means for evolving HDFS in the future. We will also cover the technical details of Ozone.

hortonworksdataworks summit 2017dataworks summit singapore

More Related Content

What's hot

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
DataWorks Summit
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
DataWorks Summit
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
enissoz
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Hortonworks
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
 

What's hot (20)

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 

Similar to Optimizing Hive Queries

Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Vinod Kumar Vavilapalli
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
elliando dias
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
Ike Ellis
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
elliando dias
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
Michael Yarichuk
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
Amazon Web Services LATAM
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
DataWorks Summit
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
DataWorks Summit
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
Don Demcsak
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
Databricks
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
Zohar Elkayam
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
RahulBhole12
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
Elizabeth Smith
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Joe Alex
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
Rajesh Balamohan
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
Jianfeng Zhang
 

Similar to Optimizing Hive Queries (20)

Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
Innovations in Apache Hadoop MapReduce, Pig and Hive for improving query perf...
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)Big Data (NJ SQL Server User Group)
Big Data (NJ SQL Server User Group)
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Taming the resource tiger
Taming the resource tigerTaming the resource tiger
Taming the resource tiger
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 

More from Owen O'Malley

Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemRunning An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid Them
Owen O'Malley
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACID
Owen O'Malley
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
Owen O'Malley
 
Protect your private data with ORC column encryption
Protect your private data with ORC column encryptionProtect your private data with ORC column encryption
Protect your private data with ORC column encryption
Owen O'Malley
 
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionFine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
Owen O'Malley
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 Iceberg
Owen O'Malley
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
 
ORC Column Encryption
ORC Column EncryptionORC Column Encryption
ORC Column Encryption
Owen O'Malley
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
Owen O'Malley
 
Data protection2015
Data protection2015Data protection2015
Data protection2015
Owen O'Malley
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
Owen O'Malley
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
Owen O'Malley
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to Hive
Owen O'Malley
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
 
ORC Files
ORC FilesORC Files
ORC Files
Owen O'Malley
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
Owen O'Malley
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
Owen O'Malley
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
Owen O'Malley
 

More from Owen O'Malley (20)

Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemRunning An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid Them
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACID
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
Protect your private data with ORC column encryption
Protect your private data with ORC column encryptionProtect your private data with ORC column encryption
Protect your private data with ORC column encryption
 
Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionFine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
 
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 Iceberg
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
 
ORC Column Encryption
ORC Column EncryptionORC Column Encryption
ORC Column Encryption
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
 
Data protection2015
Data protection2015Data protection2015
Data protection2015
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to Hive
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
ORC Files
ORC FilesORC Files
ORC Files
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
 

Recently uploaded

Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
Mark Billinghurst
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
FellyciaHikmahwarani
 
AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)
apoorva2579
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
Stephanie Beckett
 
Performance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy EvertsPerformance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy Everts
ScyllaDB
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Matthew Sinclair
 
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecGDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
James Anderson
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
Yevgen Sysoyev
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
BookNet Canada
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
 
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
amitchopra0215
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 

Recently uploaded (20)

Research Directions for Cross Reality Interfaces
Research Directions for Cross Reality InterfacesResearch Directions for Cross Reality Interfaces
Research Directions for Cross Reality Interfaces
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
 
AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
What's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptxWhat's New in Copilot for Microsoft365 May 2024.pptx
What's New in Copilot for Microsoft365 May 2024.pptx
 
Performance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy EvertsPerformance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy Everts
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
 
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecGDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
 
DealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 editionDealBook of Ukraine: 2024 edition
DealBook of Ukraine: 2024 edition
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
 
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 

Optimizing Hive Queries

  • 1. Optimizing Hive Queries Owen O’Malley Founder and Architect owen@hortonworks.com @owen_omalley © Hortonworks Inc. 2013: Page 1
  • 2. Who Am I? • Founder and Architect at Hortonworks – Working on Hive, working with customer – Formerly Hadoop MapReduce & Security – Been working on Hadoop since beginning • Apache Hadoop, ASF – Hadoop PMC (Original VP) – Tez, Ambari, Giraph PMC – Mentor for: Accumulo, Kafka, Knox – Apache Member © Hortonworks Inc. 2013 Page 2
  • 4. Data Layout Location, Location, Location © Hortonworks Inc. 2013 Page 4
  • 5. Fundamental Questions • What is your primary use case? – What kind of queries and filters? • How do you need to access the data? – What information do you need together? • How much data do you have? – What is your year to year growth? • How do you get the data? © Hortonworks Inc. 2013 Page 5
  • 6. HDFS Characteristics • Provides Distributed File System – Very high aggregate bandwidth – Extreme scalability (up to 100 PB) – Self-healing storage – Relatively simple to administer • Limitations – Can’t modify existing files – Single writer for each file – Heavy bias for large files ( > 100 MB) © Hortonworks Inc. 2013 Page 6
  • 7. Choices for Layout • Partitions – Top level mechanism for pruning – Primary unit for updating tables (& schema) – Directory per value of specified column • Bucketing – Hashed into a file, good for sampling – Controls write parallelism • Sort order – The order the data is written within file © Hortonworks Inc. 2013 Page 7
  • 8. Example Hive Layout • Directory Structure warehouse/$database/$table • Partitioning /part1=$partValue/part2=$partValue • Bucketing /$bucket_$attempt (eg. 000000_0) • Sort – Each file is sorted within the file © Hortonworks Inc. 2013 Page 8
  • 9. Layout Guidelines • Limit the number of partitions – 1,000 partitions is much faster than 10,000 – Nested partitions are almost always wrong • Gauge the number of buckets – Calculate file size and keep big (200-500MB) – Don’t forget number of files (Buckets * Parts) • Layout related tables the same way – Partition – Bucket and sort order © Hortonworks Inc. 2013 Page 9
  • 10. Normalization • Most databases suggest normalization – Keep information about each thing together – Customer, Sales, Returns, Inventory tables • Has lots of good properties, but… – Is typically slow to query • Often best to denormalize during load – Write once, read many times – Additionally provides snapshots in time. © Hortonworks Inc. 2013 Page 10
  • 11. Data Format How is your data stored? © Hortonworks Inc. 2013 Page 11
  • 12. Choice of Format • Serde – How each record is encoded? • Input/Output (aka File) Format – How are the files stored? • Primary Choices – Text – Sequence File – RCFile – ORC (Coming Soon!) © Hortonworks Inc. 2013 Page 12
  • 13. Text Format • Critical to pick a Serde – Default - ^A’s between fields – JSON – top level JSON record – CSV – commas between fields (on github) • Slow to read and write • Can’t split compressed files – Leads to huge maps • Need to read/decompress all fields © Hortonworks Inc. 2013 Page 13
  • 14. Sequence File • Traditional MapReduce binary file format – Stores keys and values as classes – Not a good fit for Hive, which has SQL types – Hive always stores entire row as value • Splittable but only by searching file – Default block size is 1 MB • Need to read and decompress all fields © Hortonworks Inc. 2013 Page 14
  • 15. RC (Row Columnar) File • Columns stored separately – Read and decompress only needed ones – Better compression • Columns stored as binary blobs – Depends on metastore to supply types • Larger blocks – 4 MB by default – Still search file for split boundary © Hortonworks Inc. 2013 Page 15
  • 16. ORC (Optimized Row Columnar) • Columns stored separately • Knows types – Uses type-specific encoders – Stores statistics (min, max, sum, count) • Has light-weight index – Skip over blocks of rows that don’t matter • Larger blocks – 256 MB by default – Has an index for block boundaries © Hortonworks Inc. 2013 Page 16
  • 17. ORC - File Layout © Hortonworks Inc. 2013 Page 17
  • 18. Example File Sizes from TPC-DS © Hortonworks Inc. 2013 Page 18
  • 19. Compression • Need to pick level of compression – None – LZO or Snappy – fast but sloppy – Best for temporary tables – ZLIB – slow and complete – Best for long term storage © Hortonworks Inc. 2013 Page 19
  • 20. Joins Putting the pieces together © Hortonworks Inc. 2013 Page 20
  • 21. Default Assumption • Hive assumes users are either: – Noobies – Hive developers • Default behavior is always finish – Little Engine that Could! • Experts could override default behaviors – Get better performance, but riskier • We’re working on improving heuristics © Hortonworks Inc. 2013 Page 21
  • 22. Shuffle Join • Default choice – Always works (I’ve sorted a petabyte!) – Worst case scenario • Each process – Reads from part of one of the tables – Buckets and sorts on join key – Sends one bucket to each reduce • Works everytime! © Hortonworks Inc. 2013 Page 22
  • 23. Map Join • One table is small (eg. dimension table) – Fits in memory • Each process – Reads small table into memory hash table – Streams through part of the big file – Joining each record from hash table • Very fast, but limited © Hortonworks Inc. 2013 Page 23
  • 24. Sort Merge Bucket (SMB) Join • If both tables are: – Sorted the same – Bucketed the same – And joining on the sort/bucket column • Each process: – Reads a bucket from each table – Process the row with the lowest value • Very efficient if applicable © Hortonworks Inc. 2013 Page 24
  • 25. Debugging What could possibly go wrong? © Hortonworks Inc. 2013 Page 25
  • 26. Performance Question •��Which of the following is faster? – select count(distinct(Col)) from Tbl – select count(*) from (select distict(Col) from Tbl) © Hortonworks Inc. 2013 Page 26
  • 27. Count Distinct © Hortonworks Inc. 2013 Page 27
  • 28. Answer • Surprisingly the second is usually faster – In the first case: – Maps send each value to the reduce – Single reduce counts them all – In the second case: – Maps split up the values to many reduces – Each reduce generates its list – Final job counts the size of each list – Singleton reduces are almost always BAD © Hortonworks Inc. 2013 Page 28
  • 29. Communication is Good! • Hive doesn’t tell you what is wrong. – Expects you to know! – “Lucy, you have some ‘splaining to do!” • Explain tool provides query plan – Filters on input – Numbers of jobs – Numbers of maps and reduces – What the jobs are sorting by – What directories are they reading or writing © Hortonworks Inc. 2013 Page 29
  • 30. Blinded by Science • The explanation tool is confusing. – It takes practice to understand. – It doesn’t include some critical details like partition pruning. • Running the query makes things clearer! – Pay attention to the details – Look at JobConf and job history files © Hortonworks Inc. 2013 Page 30
  • 31. Skew • Skew is typical in real datasets. • A user complained that his job was slow – He had 100 reduces – 98 of them finished fast – 2 ran really slow • The key was a boolean… © Hortonworks Inc. 2013 Page 31
  • 32. Root Cause Analysis • Ambari – Apache project building Hadoop installation and management tool – Provides metrics (Ganglia & Nagios) – Root Cause Analysis – Processes MapReduce job logs – Displays timing of each part of query plan © Hortonworks Inc. 2013 Page 32
  • 33. Root Cause Analysis Screenshots © Hortonworks Inc. 2013 Page 33
  • 34. Root Cause Analysis Screenshots © Hortonworks Inc. 2013 Page 34
  • 35. Thank You! Questions & Answers @owen_omalley © Hortonworks Inc. 2012: DO NOT SHARE. CONTAINS HORTONWORKS CONFIDENTIAL & PROPRIETARY INFORMATION Page 35
  • 36. ORCFile - Comparison RC File Trevni ORC File Hive Type Model N N Y Separate complex columns N Y Y Splits found quickly N Y Y Default column group size 4MB 64MB* 250MB Files per a bucket 1 >1 1 Store min, max, sum, count N N Y Versioned metadata N Y Y Run length data encoding N N Y Store strings in dictionary N N Y Store row count N Y Y Skip compressed blocks N N Y Store internal indexes N N Y © Hortonworks Inc. 2013 Page 36