SlideShare a Scribd company logo
ORC File –
Optimizing Your Big Data
Owen O’Malley, Co-founder Hortonworks
Apache Hadoop, Hive, ORC, and
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
In the Beginning…
 Hadoop applications used text or SequenceFile
– Text is slow and not splittable when compressed
– SequenceFile only supports key and value and user-defined serialization
 Hive added RCFile
– User controls the columns to read and decompress
– No type information and user-defined serialization
– Finding splits was expensive
 Avro files created
– Type information included!
– Had to read and decompress entire row
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
ORC File Basics
 Columnar format
– Enables user to read & decompress just the bytes they need
 Fast
– See
 Indexed
 Self-describing
– Includes all of the information about types and encoding
 Rich type system
– All of Hive’s types including timestamp, struct, map, list, and union

Recommended for you

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud

This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.

Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive

This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include: - Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries. - Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering. - The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans. - Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

apache sparksparkaisummit
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
File Compatibility
 Backwards compatibility
– Automatically detect the version of the file and read it.
 Forward compatibility
– Most changes are made so old readers will read the new files
– Maintain the ability to write old files via orc.write.format
– Always write old version until your last cluster upgrades
 Current file versions
– 0.11 – Original version
– 0.12 – Updated run length encoding (RLE)
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
File Structure
 File contains a list of stripes, which are sets of rows
– Default size is 64MB
– Large stripe size enables efficient reads
 Footer
– Contains the list of stripe locations
– Type description
– File and stripe statistics
 Postscript
– Compression parameters
– File format version
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Stripe Structure
 Indexes
– Offsets to jump to start of row group
– Row group size defaults to 10,000 rows
– Minimum, Maximum, and Count of each column
 Data
– Data for the stripe organized by column
 Footer
– List of stream locations
– Column encoding information
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
File Layout
Page 8
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Index Data
Row Data
Stripe Footer
Index Data
Row Data
Stripe Footer
Index Data
Row Data
Stripe Footer
File Footer
File Metadata

Recommended for you

Hive tuning
Hive tuningHive tuning
Hive tuning

This document provides an overview of Hive and its performance capabilities. It discusses Hive's SQL interface for querying large datasets stored in Hadoop, its architecture which compiles SQL queries into MapReduce jobs, and its support for SQL semantics and datatypes. The document also covers techniques for optimizing Hive performance, including data abstractions like partitions, buckets and skews. It describes different join strategies in Hive like shuffle joins, broadcast joins and sort-merge bucket joins and how they are implemented in MapReduce. The overall presentation aims to explain how Hive provides scalable SQL processing for big data.

Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink

Flinkn Forward San Francisco 2022. In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times. by Piotr Nowojski

apache flinkstream processingbig data
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future

The document discusses Apache Tez, a framework for building data processing applications on Hadoop. It provides an introduction to Tez and describes key features like expressing computations as directed acyclic graphs (DAGs), container reuse, dynamic parallelism, integration with YARN timeline service, and recovery from failures. The document also outlines improvements to Tez around performance, debuggability, and status/roadmap.

hadoop summitapache tezapache hadoop
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Schema Evolution
 ORC now supports schema evolution
– Hive 2.1 – append columns or type conversion
– Upcoming Hive 2.3 – map columns or inner structures by name
– User passes desired schema to ORC reader
 Type conversions
– Most types will convert although some are ugly.
– If the value doesn’t fit in the new type, it will become null.
 Cautions
– Name mapping requires ORC files written by Hive ≥ 2.0
– Some of the type conversions are slow
10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Using ORC
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
From Hive or Presto
 Modify your table definition:
– create table my_table (
name string,
address string,
) stored as orc;
 Import data:
– insert overwrite table my_table select * from my_staging;
 Use either configuration or table properties
– tblproperties ("orc.compress"="NONE")
– set hive.exec.orc.default.compress=NONE;
12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
From Java
 Use the ORC project rather than Hive’s ORC.
– Hive’s master branch uses it.
– Maven group id: org.apache.orc version: 1.4.0
– nohive classifier avoids interfering with Hive’s packages
 Two levels of access
– orc-core – Faster access, but uses Hive’s vectorized API
– orc-mapreduce – Row by row access, simpler OrcStruct API
 MapReduce API implements WritableComparable
– Can be shuffled
– Need to specify type information in configuration for shuffle or output

Recommended for you

The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...

Essentially every successful analytical DBMS in the market today makes use of column-oriented data structures. In the Hadoop ecosystem, Apache Parquet (and Apache ORC) provide similar advantages in terms of processing and storage efficiency. Apache Arrow is the in-memory counterpart to these formats and has been been embraced by over a dozen open source projects as the de facto standard for in-memory processing. In this session the PMC Chair for Apache Arrow and the PMC Chair for Apache Parquet discuss the future of column-oriented processing.

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...

Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways. However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk. It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.

apache sparksparkaisummit
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data

This presentation describes how to efficiently load data into Hive. I cover partitioning, predicate pushdown, ORC file optimization and different loading schemes

predicate pushdownorcpartitioning
13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
From C++
 Pure C++ client library
– No JNI or JDK so client can estimate and control memory
 Combine with pure C++ HDFS client from HDFS-8707
– Work ongoing in feature branch, but should be committed soon.
 Reader is stable and in production use.
 Alibaba has created a writer and is contributing it to Apache ORC.
– Should be in the next release ORC 1.5.0.
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Command Line
 Using hive –orcfiledump from Hive
– -j -p – pretty prints the metadata as JSON
– -d – prints data as JSON
 Using java -jar orc-tools-1.4.0-uber.jar from ORC
– meta – print the metadata as JSON
– data – print data as JSON
– convert – convert JSON to ORC
– json-schema – scan a set of JSON documents to find the matching schema
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Stripe Size
 Makes a huge difference in performance
– orc.stripe.size or hive.exec.orc.default.stripe.size
– Controls the amount of buffer in writer. Default is 64MB
– Trade off
• Large stripes = Large more efficient reads
• Small stripes = Less memory and more granular processing splits
 Multiple files written at the same time will shrink stripes
– Use Hive’s hive.optimize.sort.dynamic.partition
– Sorting dynamic partitions means a one writer at a time

Recommended for you

Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF

This document discusses disaggregating Ceph storage using NVMe over Fabrics (NVMeoF). It motivates using NVMeoF by showing the performance limitations of directly attaching multiple NVMe drives to individual compute nodes. It then proposes a design to leverage the full resources of a cluster by distributing NVMe drives across dedicated storage nodes and connecting them to compute nodes over a high performance fabric using NVMeoF and RDMA. Some initial Ceph performance measurements using this model show improved IOPS and latency compared to the direct attached approach. Future work could explore using SPDK and Linux kernel improvements to further optimize performance.

Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries

Apache Hive is a data warehousing system for large volumes of data stored in Hadoop. However, the data is useless unless you can use it to add value to your company. Hive provides a SQL-based query language that dramatically simplifies the process of querying your large data sets. That is especially important while your data scientists are developing and refining their queries to improve their understanding of the data. In many companies, such as Facebook, Hive accounts for a large percentage of the total MapReduce queries that are run on the system. Although Hive makes writing large data queries easier for the user, there are many performance traps for the unwary. Many of them are artifacts of the way Hive has evolved over the years and the requirement that the default behavior must be safe for all users. This talk will present examples of how Hive users have made mistakes that made their queries run much much longer than necessary. It will also present guidelines for how to get better performance for your queries and how to look at the query plan to understand what Hive is doing.

owen o'malleyhadoophive
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing

Apache Tez is a framework for accelerating Hadoop query processing. It is based on expressing a computation as a dataflow graph and executing it in a highly customizable way. Tez is built on top of YARN and provides benefits like better performance, predictability, and utilization of cluster resources compared to traditional MapReduce. It allows applications to focus on business logic rather than Hadoop internals.

17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
HDFS Block Padding
 The stripes don’t align exactly with HDFS blocks
 HDFS scatters blocks around cluster
 Often want to pad to block boundaries
– Costs space, but improves performance
– hive.exec.orc.default.block.padding – true
– hive.exec.orc.block.padding.tolerance – 0.05
Index Data
Row Data
Stripe Footer
Index Data
Row Data
Stripe Footer
Index Data
Row Data
Stripe Footer
HDFS Block
HDFS Block
File Footer
File Metadata
18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Predicate Push Down
 Reader is given a SearchArg
– Limited set predicates over column and literal value
– Reader will skip over any parts of file that can’t contain valid rows
 ORC indexes at three levels:
– File
– Stripe
– Row Group (10k rows)
 Reader still needs to apply predicate to filter out single rows
19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Row Pruning
 Every primitive column has minimum and maximum at each level
– Sorting your data within a file helps a lot
– Consider sorting instead of making lots of partitions
 Writer can optionally include bloomfilters
– Provides a probabilistic bitmap of hashcodes
– Only works with equality predicates at the row group level
– Requires significant space in the file
– Manually enabled by using orc.bloom.filter.columns
– Use orc.bloom.filter.fpp to set the false positive rate (default 0.05)
– Set the default charset in JVM via -Dfile.encoding=UTF-8
20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Row Pruning Example
– from tpch1000.lineitem where l_orderkey = 1212000001;
 Rows Read
– Nothing – 5,999,989,709
– Min/Max – 540,000
– BloomFilter – 10,000
 Time Taken
– Nothing – 74 sec
– Min/Max – 4.5 sec
– BloomFilter – 1.3 sec

Recommended for you

The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook

The document outlines topics covered in "The Impala Cookbook" published by Cloudera. It discusses physical and schema design best practices for Impala, including recommendations for data types, partition design, file formats, and block size. It also covers estimating and managing Impala's memory usage, and how to identify the cause when queries exceed memory limits.

Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers

Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg. Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS. Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg. You will learn: • The issues that arise when using the Hive table format at scale, and why we need a new table format • How a straightforward, elegant change in table format structure has enormous positive effects • The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it • The resulting benefits of this architectural design

high throughput and low latencyp99p99 conf
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived

The document discusses Uber's use of Hadoop to store and analyze large amounts of data. Some key points: 1) Uber was facing challenges with data reliability, system scalability, fragile data ingestion, and lack of multi-DC support with its previous data systems. 2) Uber implemented a Hadoop data lake to address these issues. The Hadoop ecosystem at Uber includes tools for data ingestion (Streamific, Komondor), storage (HDFS, Hive), processing (Spark, Presto) and serving data to applications and data marts. 3) Uber continues to work on challenges like enabling low-latency interactive SQL, implementing an all-active architecture for high availability, and reducing

sparkuberbig data
21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Split Calculation
 Hive’s OrcInputFormat has three strategies for split calculation
– BI
• Small fast queries
• Splits based on HDFS blocks
• Large queries
• Read file footer and apply SearchArg to stripes
• Can include footer in splits (hive.orc.splits.include.file.footer)
– Hybrid
• If small files or lots of files, use BI
22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
LLAP – Live Long & Process
 Provides a persistent service to speed up Hive
– Caches ORC and text data
– Saves costs of Yarn container & JVM spin up
– JIT finishes after first few seconds
 Cache uses ORC’s RLE
– Decompresses zlib or Snappy
– RLE is fast and saves memory
– Automatically caches hot columns and partitions
 Allows Spark to use Hive’s column and row security
23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Current Work In Progress
24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Speed Improvements for ACID
 Hive supports ACID transactions on ORC tables
– Uses delta files in HDFS to store changes to each partition
– Delta files store insert/update/delete operations
– Used to support SQL insert commands
 Unfortunately, update operations don’t allow predicate push down
on the deltas
 In the upcoming Hive 2.3, we added a new ACID layout
– It change updates to an insert and delete
– Allows predicate pushdown even on the delta files
 Also added SQL merge command in Hive 2.2

Recommended for you

Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger

This document discusses Apache Ranger, an open source framework for centralized security administration across Hadoop ecosystems. It provides a presentation on securing Hadoop with Ranger, including an overview of current Hadoop security, how Ranger addresses this with centralized policy management and plugins for Hadoop components like HDFS, Hive and HBase. The document outlines Ranger's architecture and components like the policy administration server, user sync server and plugins, demonstrating how Ranger implements authorization for different Hadoop tools and integrates with their native permissions systems.

apache hadoopapache rangerhadoop summit
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...

The document compares the query execution plans produced by Apache Hive and PostgreSQL. It shows that Hive's old-style execution plans are overly verbose and difficult to understand, providing many low-level details across multiple stages. In contrast, PostgreSQL's plans are more concise and readable, showing the logical query plan in a top-down manner with actual table names and fewer lines of text. The document advocates for Hive to adopt a simpler execution plan format similar to PostgreSQL's.

hadoop summiths16melb
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)

Today's typical Apache Hadoop deployments use HDFS for persistent, fault-tolerant storage of big data files. However, recent emerging architectural patterns increasingly rely on cloud object storage such as S3, Azure Blob Store, GCS, which are designed for cost-efficiency, scalability and geographic distribution. Hadoop supports pluggable file system implementations to enable integration with these systems for use cases such as off-site backup or even complex multi-step ETL, but applications may encounter unique challenges related to eventual consistency, performance and differences in semantics compared to HDFS. This session explores those challenges and presents recent work to address them in a comprehensive effort spanning multiple Hadoop ecosystem components, including the Object Store FileSystem connector, Hive, Tez and ORC. Our goal is to improve correctness, performance, security and operations for users that choose to integrate Hadoop with Cloud Storage. We use S3 and S3A connector as case study.

25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Column Encryption (ORC-14)
 Allows users to encrypt some of the columns of the file
– Provides column level security even with access to raw files
– Uses Key Management Server from Ranger or Hadoop
– Includes both the data and the index
– Daily key rolling can anonymize data after 90 days
 User specifies how data is masked if user doesn’t have access
– Nullify
– Redact
– SHA256
26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Thank You

More Related Content

What's hot

Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT
DataWorks Summit
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Hive tuning
Hive tuningHive tuning
Hive tuning
Michael Zhang
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
Vinoth Chandar
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
DataWorks Summit
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit

What's hot (20)

Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
High throughput data replication over RAFT
High throughput data replication over RAFTHigh throughput data replication over RAFT
High throughput data replication over RAFT
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Hive tuning
Hive tuningHive tuning
Hive tuning
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
Disaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoFDisaggregating Ceph using NVMeoF
Disaggregating Ceph using NVMeoF
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...

Similar to ORC File - Optimizing Your Big Data

Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
DataWorks Summit/Hadoop Summit
ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in Hive
Eugene Koifman
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
Dongjoon Hyun
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
Eric Wohlstadter
Ansible + Hadoop
Ansible + HadoopAnsible + Hadoop
Ansible + Hadoop
Michael Young
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
Chris Nauroth
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
Steve Loughran
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
Muhammad Ali
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoop
Gergely Devenyi
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit

Similar to ORC File - Optimizing Your Big Data (20)

Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache HadoopOzone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
ACID Transactions in Hive
ACID Transactions in HiveACID Transactions in Hive
ACID Transactions in Hive
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
Ansible + Hadoop
Ansible + HadoopAnsible + Hadoop
Ansible + Hadoop
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, ScaleApache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoop
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded

The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Larry Smarr
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
Aurora Consulting
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
Edge AI and Vision Alliance
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsMYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
Linda Zhang
@Call @Girls Thiruvananthapuram 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
@Call @Girls Thiruvananthapuram  🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...@Call @Girls Thiruvananthapuram  🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
@Call @Girls Thiruvananthapuram 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecGDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
James Anderson
What's Next Web Development Trends to Watch.pdf
What's Next Web Development Trends to Watch.pdfWhat's Next Web Development Trends to Watch.pdf
What's Next Web Development Trends to Watch.pdf

Recently uploaded (20)

The Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU CampusesThe Increasing Use of the National Research Platform by the CSU Campuses
The Increasing Use of the National Research Platform by the CSU Campuses
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1Why do You Have to Redesign?_Redesign Challenge Day 1
Why do You Have to Redesign?_Redesign Challenge Day 1
5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx5G bootcamp Sep 2020 (NPI initiative).pptx
5G bootcamp Sep 2020 (NPI initiative).pptx
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)AC Atlassian Coimbatore Session Slides( 22/06/2024)
AC Atlassian Coimbatore Session Slides( 22/06/2024)
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
“Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” a Pres...
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsMYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
MYIR Product Brochure - A Global Provider of Embedded SOMs & Solutions
@Call @Girls Thiruvananthapuram 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
@Call @Girls Thiruvananthapuram  🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...@Call @Girls Thiruvananthapuram  🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
@Call @Girls Thiruvananthapuram 🚒 XXXXXXXXXX 🚒 Priya Sharma Beautiful And Cu...
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecGDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
What's Next Web Development Trends to Watch.pdf
What's Next Web Development Trends to Watch.pdfWhat's Next Web Development Trends to Watch.pdf
What's Next Web Development Trends to Watch.pdf

ORC File - Optimizing Your Big Data

  • 1. ORC File – Optimizing Your Big Data Owen O’Malley, Co-founder Hortonworks Apache Hadoop, Hive, ORC, and Incubator @owen_omalley
  • 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Overview
  • 3. 3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved In the Beginning…  Hadoop applications used text or SequenceFile – Text is slow and not splittable when compressed – SequenceFile only supports key and value and user-defined serialization  Hive added RCFile – User controls the columns to read and decompress – No type information and user-defined serialization – Finding splits was expensive  Avro files created – Type information included! – Had to read and decompress entire row
  • 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved ORC File Basics  Columnar format – Enables user to read & decompress just the bytes they need  Fast – See  Indexed  Self-describing – Includes all of the information about types and encoding  Rich type system – All of Hive’s types including timestamp, struct, map, list, and union
  • 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved File Compatibility  Backwards compatibility – Automatically detect the version of the file and read it.  Forward compatibility – Most changes are made so old readers will read the new files – Maintain the ability to write old files via orc.write.format – Always write old version until your last cluster upgrades  Current file versions – 0.11 – Original version – 0.12 – Updated run length encoding (RLE)
  • 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved File Structure  File contains a list of stripes, which are sets of rows – Default size is 64MB – Large stripe size enables efficient reads  Footer – Contains the list of stripe locations – Type description – File and stripe statistics  Postscript – Compression parameters – File format version
  • 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Stripe Structure  Indexes – Offsets to jump to start of row group – Row group size defaults to 10,000 rows – Minimum, Maximum, and Count of each column  Data – Data for the stripe organized by column  Footer – List of stream locations – Column encoding information
  • 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved File Layout Page 8 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Index Data Row Data Stripe Footer ~64MBStripe Index Data Row Data Stripe Footer ~64MBStripe Index Data Row Data Stripe Footer ~64MBStripe File Footer Postscript File Metadata
  • 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Schema Evolution  ORC now supports schema evolution – Hive 2.1 – append columns or type conversion – Upcoming Hive 2.3 – map columns or inner structures by name – User passes desired schema to ORC reader  Type conversions – Most types will convert although some are ugly. – If the value doesn’t fit in the new type, it will become null.  Cautions – Name mapping requires ORC files written by Hive ≥ 2.0 – Some of the type conversions are slow
  • 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Using ORC
  • 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved From Hive or Presto  Modify your table definition: – create table my_table ( name string, address string, ) stored as orc;  Import data: – insert overwrite table my_table select * from my_staging;  Use either configuration or table properties – tblproperties ("orc.compress"="NONE") – set hive.exec.orc.default.compress=NONE;
  • 12. 12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved From Java  Use the ORC project rather than Hive’s ORC. – Hive’s master branch uses it. – Maven group id: org.apache.orc version: 1.4.0 – nohive classifier avoids interfering with Hive’s packages  Two levels of access – orc-core – Faster access, but uses Hive’s vectorized API – orc-mapreduce – Row by row access, simpler OrcStruct API  MapReduce API implements WritableComparable – Can be shuffled – Need to specify type information in configuration for shuffle or output
  • 13. 13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved From C++  Pure C++ client library – No JNI or JDK so client can estimate and control memory  Combine with pure C++ HDFS client from HDFS-8707 – Work ongoing in feature branch, but should be committed soon.  Reader is stable and in production use.  Alibaba has created a writer and is contributing it to Apache ORC. – Should be in the next release ORC 1.5.0.
  • 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Command Line  Using hive –orcfiledump from Hive – -j -p – pretty prints the metadata as JSON – -d – prints data as JSON  Using java -jar orc-tools-1.4.0-uber.jar from ORC – meta – print the metadata as JSON – data – print data as JSON – convert – convert JSON to ORC – json-schema – scan a set of JSON documents to find the matching schema
  • 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Optimization
  • 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Stripe Size  Makes a huge difference in performance – orc.stripe.size or hive.exec.orc.default.stripe.size – Controls the amount of buffer in writer. Default is 64MB – Trade off • Large stripes = Large more efficient reads • Small stripes = Less memory and more granular processing splits  Multiple files written at the same time will shrink stripes – Use Hive’s hive.optimize.sort.dynamic.partition – Sorting dynamic partitions means a one writer at a time
  • 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved HDFS Block Padding  The stripes don’t align exactly with HDFS blocks  HDFS scatters blocks around cluster  Often want to pad to block boundaries – Costs space, but improves performance – hive.exec.orc.default.block.padding – true – hive.exec.orc.block.padding.tolerance – 0.05 Index Data Row Data Stripe Footer ~64MBStripe Index Data Row Data Stripe Footer ~64MBStripe Index Data Row Data Stripe Footer ~64MBStripe HDFS Block HDFS Block Padding File Footer Postscript File Metadata
  • 18. 18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Predicate Push Down  Reader is given a SearchArg – Limited set predicates over column and literal value – Reader will skip over any parts of file that can’t contain valid rows  ORC indexes at three levels: – File – Stripe – Row Group (10k rows)  Reader still needs to apply predicate to filter out single rows
  • 19. 19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Row Pruning  Every primitive column has minimum and maximum at each level – Sorting your data within a file helps a lot – Consider sorting instead of making lots of partitions  Writer can optionally include bloomfilters – Provides a probabilistic bitmap of hashcodes – Only works with equality predicates at the row group level – Requires significant space in the file – Manually enabled by using orc.bloom.filter.columns – Use orc.bloom.filter.fpp to set the false positive rate (default 0.05) – Set the default charset in JVM via -Dfile.encoding=UTF-8
  • 20. 20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Row Pruning Example  TPC-DS – from tpch1000.lineitem where l_orderkey = 1212000001;  Rows Read – Nothing – 5,999,989,709 – Min/Max – 540,000 – BloomFilter – 10,000  Time Taken – Nothing – 74 sec – Min/Max – 4.5 sec – BloomFilter – 1.3 sec
  • 21. 21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Split Calculation  Hive’s OrcInputFormat has three strategies for split calculation – BI • Small fast queries • Splits based on HDFS blocks – ETL • Large queries • Read file footer and apply SearchArg to stripes • Can include footer in splits (hive.orc.splits.include.file.footer) – Hybrid • If small files or lots of files, use BI
  • 22. 22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved LLAP – Live Long & Process  Provides a persistent service to speed up Hive – Caches ORC and text data – Saves costs of Yarn container & JVM spin up – JIT finishes after first few seconds  Cache uses ORC’s RLE – Decompresses zlib or Snappy – RLE is fast and saves memory – Automatically caches hot columns and partitions  Allows Spark to use Hive’s column and row security
  • 23. 23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Current Work In Progress
  • 24. 24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Speed Improvements for ACID  Hive supports ACID transactions on ORC tables – Uses delta files in HDFS to store changes to each partition – Delta files store insert/update/delete operations – Used to support SQL insert commands  Unfortunately, update operations don’t allow predicate push down on the deltas  In the upcoming Hive 2.3, we added a new ACID layout – It change updates to an insert and delete – Allows predicate pushdown even on the delta files  Also added SQL merge command in Hive 2.2
  • 25. 25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Column Encryption (ORC-14)  Allows users to encrypt some of the columns of the file – Provides column level security even with access to raw files – Uses Key Management Server from Ranger or Hadoop – Includes both the data and the index – Daily key rolling can anonymize data after 90 days  User specifies how data is masked if user doesn’t have access – Nullify – Redact – SHA256
  • 26. 26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Thank You @owen_omalley