The document discusses local secondary indexes in Apache Phoenix. Local indexes are stored in the same region as the base table data, providing faster index building and reads compared to global indexes. The write process involves preparing index updates along with data updates and writing them atomically to memstores and the write ahead log. Reads scan the local index and retrieve any missing columns from the base table. Local indexes improve write performance over global indexes due to reduced network utilization. The document provides performance results and tips on using local indexes.
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, donβt fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait. Owen OβMalley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including: * All reads use snapshot isolation without locking. * No directory listings are required for query planning. * Files can be added, removed, or replaced atomically. * Full schema evolution supports changes in the table over time. * Partitioning evolution enables changes to the physical layout without breaking existing queries. * Data files are stored as Avro, ORC, or Parquet. * Support for Spark, Pig, and Presto.
"While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns. "
1. Log structured merge trees store data in multiple levels with different storage speeds and costs, requiring data to periodically merge across levels. 2. This structure allows fast writes by storing new data in faster levels before merging to slower levels, and efficient reads by querying multiple levels and merging results. 3. The merging process involves loading, sorting, and rewriting levels to consolidate and propagate deletions and updates between levels.
The document discusses updates to InfluxDB IOx, a new columnar time series database. It covers changes and improvements to the API, CLI, query capabilities, and path to open sourcing builds. Key points include moving to gRPC for management, adding PostgreSQL string functions to queries, optimizing functions for scalar values and columns, and monitoring internal systems as the first step to releasing open source builds.
Data Orchestration Summit www.alluxio.io/data-orchestration-summit-2019 November 7, 2019 Apache Iceberg - A Table Format for Hige Analytic Datasets Speaker: Ryan Blue, Netflix For more Alluxio events: https://www.alluxio.io/events/
The document discusses Apache NiFi and its role in the Hadoop ecosystem. It provides an overview of NiFi, describes how it can be used to integrate with Hadoop components like HDFS, HBase, and Kafka. It also discusses how NiFi supports stream processing integrations and outlines some use cases. The document concludes by discussing future work, including improving NiFi's high availability, multi-tenancy, and expanding its ecosystem integrations.
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs. In this talk, we will cover: * What data quality problems Delta helps address * How to convert your existing application to Delta Lake * How the Delta Lake transaction protocol works internally * The Delta Lake roadmap for the next few releases * How to get involved!
Flink Forward San Francisco 2022. With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi. by Ethan Guo & Kyle Weller
When interacting with analytics dashboards in order to achieve a smooth user experience, two major key requirements are sub-second response time and data freshness. Cluster computing frameworks such as Hadoop or Hive/Hbase work well for storing large volumes of data, although they are not optimized for ingesting streaming data and making it available for queries in realtime. Also, long query latencies make these systems sub-optimal choices for powering interactive dashboards and BI use-cases. In this talk we will present Druid as a complementary solution to existing hadoop based technologies. Druid is an open-source analytics data store, designed from scratch, for OLAP and business intelligence queries over massive data streams. It provides low latency realtime data ingestion and fast sub-second adhoc flexible data exploration queries. Many large companies are switching to Druid for analytics, and we will cover how druid is able to handle massive data streams and why it is a good fit for BI use cases. Agenda - 1) Introduction and Ideal Use cases for Druid 2) Data Architecture 3) Streaming Ingestion with Kafka 4) Demo using Druid, Kafka and Superset. 5) Recent Improvements in Druid moving from lambda architecture to Exactly once Ingestion 6) Future Work
Essentially every successful analytical DBMS in the market today makes use of column-oriented data structures. In the Hadoop ecosystem, Apache Parquet (and Apache ORC) provide similar advantages in terms of processing and storage efficiency. Apache Arrow is the in-memory counterpart to these formats and has been been embraced by over a dozen open source projects as the de facto standard for in-memory processing. In this session the PMC Chair for Apache Arrow and the PMC Chair for Apache Parquet discuss the future of column-oriented processing.
Parquet is a column-oriented storage format for Hadoop that supports efficient compression and encoding techniques. It uses a row group structure to store data in columns in a compressed and encoded column chunk format. The schema and metadata are stored in the file footer to allow for efficient reads and scans of selected columns. The format is designed to be extensible through pluggable components for schema conversion, record materialization, and encodings.
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optimized In-Memory Query Execution Engine