Parquet overview

•Download as PPT, PDF•

5 likes•7,385 views

Parquet is a column-oriented storage format for Hadoop that supports efficient compression and encoding techniques. It uses a row group structure to store data in columns in a compressed and encoded column chunk format. The schema and metadata are stored in the file footer to allow for efficient reads and scans of selected columns. The format is designed to be extensible through pluggable components for schema conversion, record materialization, and encodings.

Parquet
overview
Julien Le Dem
Twitter
http://parquet.github.com

Format
Schema definition: for binary
representation

Layout: currently PAX, supports one file
per column when Hadoop allows block
placement policy.

Not java centric: encodings, compression
codecs, etc are ENUMs, not java class
names. i.e.: formally defined. Impala
reads Parquet files.

Footer: contains column chunks offsets

2

Format

• Row group: A group of rows in columnar format.
• Max size buffered in memory while writing.
• One (or more) per split while reading.
• roughly: 10MB < row group < 1 GB

• Column chunk: The data for one column in a row group.
• Column chunks can be read independently for efficient scans.

• Page: Unit of compression in a column chunk
• Should be big enough for compression to be efficient.
• Minimum size to read to access a single record (when index pages are available).
• roughly: 8KB < page < 100KB

3

Dremel’s shredding/assembly
Schema:
message Document {
required int64 DocId; Columns:
optional group Links { DocId
repeated int64 Backward; Links.Backward
repeated int64 Forward; } Links.Forward
repeated group Name { Name.Language.Code
repeated group Language { Name.Language.Country
required string Code; Name.Url
optional string Country; }
optional string Url; }}

Reference:
http://research.google.com/pubs/pub36632.html
• Each cell is encoded as a triplet: repetition level, definition level, value.
• This allows reconstructing the nested records.
• Level values are bound by the depth of the schema: They are stored in a
compact form.

Example: Max repetition level Max definition level

DocId 0 0
Links.Backward 1 2
Links.Forward 1 2
Name.Language.Code 2 2
Name.Language.Country 2 3
Name.Url 1 2

4

Abstractions

• Column layer:
• Iteration on triplets: repetition level, definition level, value.
• Repetition level = 0 indicates a new record.
•When dictionary encoding and other compact encodings are implemented, can iterate over
encoded or un-encoded values.

• Record layer:
• Iteration on fully assembled records.
•Provides assembled records for any subset of the columns, so that only columns actually
accessed are loaded.

5

Extensibility

• Schema conversion:
• Hadoop does not have a notion of schema.
• However Pig, Hive, Thrift, Avro, ProtoBufs, etc do.

• Record materialization:
• Pluggable record materialization layer.
• No double conversion.
• Sax-style Event base API.

• Encodings:
• Extensible encoding definitions.
• Planned: dictionary encoding, zigzag, rle, ...

6

What's hot

Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...

Databricks

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

Dremio Corporation

From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Databricks

Parquet - Data I/O - Philadelphia 2013

larsgeorge

Parquet is an open-source columnar storage format that provides an efficient data layout for analytical queries. Twitter uses Parquet to store logs and analytics data across multiple large Hadoop clusters, saving petabytes of storage and reducing query times by up to 66% by reading only needed columns. Parquet defines a language-independent file format that stores data by column rather than row to optimize analytical access patterns.

Apache Spark Overview

Vadim Y. Bichutskiy

This document provides an overview of a talk on Apache Spark. It introduces the speaker and their background. It acknowledges inspiration from a previous Spark training. It then outlines the structure of the talk, which will include: a brief history of big data; a tour of Spark including its advantages over MapReduce; and explanations of Spark concepts like RDDs, transformations, and actions. The document serves to introduce the topics that will be covered in the talk.

The Parquet Format and Performance Optimization Opportunities

Databricks

The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...

Spark Summit

This document summarizes Project Tungsten, an effort by Databricks to substantially improve the memory and CPU efficiency of Spark applications. It discusses how Tungsten optimizes memory and CPU usage through techniques like explicit memory management, cache-aware algorithms, and code generation. It provides examples of how these optimizations improve performance for aggregation queries and record sorting. The roadmap outlines expanding Tungsten's optimizations in Spark 1.4 through 1.6 to support more workloads and achieve end-to-end processing using binary data representations.

Performance Optimizations in Apache Impala

Cloudera, Inc.

Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile). To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.

The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro

Databricks

Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps: 1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too. 2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard. 3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression. 4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip. There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.

ORC File - Optimizing Your Big Data

DataWorks Summit

ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for. This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.

Hive + Tez: A Performance Deep Dive

DataWorks Summit

This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include: - Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries. - Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering. - The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans. - Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.

Iceberg + Alluxio for Fast Data Analytics

Alluxio, Inc.

The columnar roadmap: Apache Parquet and Apache Arrow

Parquet overview

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Parquet overview

Similar to Parquet overview (20)

More from Julien Le Dem

More from Julien Le Dem (19)

Parquet overview