Big SQL 3.0 is IBM's SQL engine for Hadoop that addresses challenges of building a first class SQL engine on Hadoop. It uses a modern MPP shared-nothing architecture and is architected from the ground up for low latency and high throughput. Key challenges included data placement on Hadoop, reading and writing Hadoop file formats, query optimization with limited statistics, and resource management with a shared Hadoop cluster. The architecture utilizes existing SQL query rewrite and optimization capabilities while introducing new capabilities for statistics, constraints, and pushdown to Hadoop file formats and data sources.
HadoopDB is a system that combines the performance of parallel database systems with the flexibility and fault tolerance of Hadoop. It uses Hadoop as the communication layer between multiple single-node database instances running on cluster nodes. Benchmark results showed that HadoopDB's performance was close to parallel databases for structured queries and similar to Hadoop for unstructured queries, while also providing Hadoop's ability to operate in heterogeneous environments and tolerate faults.
Slides for presentation on Cloudera Impala I gave at the DC/NOVA Java Users Group on 7/9/2013. It is a slightly updated set of slides from the ones I uploaded a few months ago on 4/19/2013. It covers version 1.0.1 and also includes some new slides on HortonWorks' Stinger Initiative.
From: DataWorks Summit 2017 - Munich - 20170406 HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
HBase application developers face a number of challenges: schema management is performed at the application level, decoupled components of a system can break one another in unexpected ways, less-technical users cannot easily access data, and evolving data collection and analysis needs are difficult to plan for. In this talk, we describe a schema management methodology based on Apache Avro that enables users and applications to share data in HBase in a scalable, evolvable fashion. By adopting these practices, engineers independently using the same data have guarantees on how their applications interact. As data collection needs change, applications are resilient to drift in the underlying data representation. This methodology results in a data dictionary that allows less-technical users to understand what data is available to them for analysis and inspect data using general-purpose tools (for example, export it via Sqoop to an RDBMS). And because of Avro’s cross-language capabilities, HBase’s power can reach new domains, like web apps built in Ruby.
This document discusses HadoopDB and Apache Hive. HadoopDB aims to combine the scalability of MapReduce with the performance of parallel databases by running Hive queries over data stored in node-local relational databases rather than HDFS. It describes HadoopDB's architecture, which replaces HDFS with local databases, and benchmarks comparing it to MapReduce. It also summarizes Hive's data model, query language and architecture, which provides a SQL interface to MapReduce by translating queries into map and reduce jobs.
Impala is an open source SQL query engine for Apache Hadoop that allows real-time queries on large datasets stored in HDFS and other data stores. It uses a distributed architecture where an Impala daemon runs on each node and coordinates query planning and execution across nodes. Impala allows SQL queries to be run directly against files stored in HDFS and other formats like Avro and Parquet. It aims to provide high performance for both analytical and transactional workloads through its C++ implementation and avoidance of MapReduce.
Wangda Tan and Mayank Bansal presented on YARN Node Labels. Node labels allow grouping nodes with similar hardware or software profiles and partitioning a cluster. This allows applications to request nodes with specific resources and partitions the cluster for different organizations or workloads. Node partitions were added in Hadoop 2.6 to allow exclusive or non-exclusive access to labeled nodes. Ebay and other companies use node labels to separate machine learning, licensed software, and organizational workloads. Future work includes adding node constraints and supporting node labels in other Apache projects like FairScheduler, Tez and Oozie.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax. This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
HBase In Action - Chapter 04: HBase table design Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
This document summarizes a presentation about HBase storage internals and future developments. It discusses how HBase provides random read/write access on HDFS using tables, regions, and region servers. It describes the write path involving the client, master, and region servers as well as the read path. It also covers topics like snapshots, compactions, and future plans to improve encryption, security, write-ahead logs, and compaction policies.
Alibaba builds the data infrastructure with Apache Hadoop YARN since 2013, and till now it manages more than 10k nodes. In Alibaba, Hadoop YARN serves various systems such as search, advertising, and recommendation etc. It runs not just batch jobs, also streaming, machine learning, OLAP, and even online services that directly impact Alibaba’s user experience. To extend YARN’s ability to support such complex scenarios, we have done and leveraged a lot of YARN 3.x improvements. In this talk, you will find what are these improvements and how they helped to solve difficult problems in large production clusters. This includes: 1. Highly improved performance with Capacity Scheduler’s async scheduling framework 2. Better placement decisions with node attributes, placement constraints 3. Better resource utilization with opportunistic containers 4. Introduce a load balancer to balance resource utilization 5. Generic resource types scheduling/isolation to manage new resources such as GPU and FPGA In the presentation, we will further introduce how we build the entire ecosystem on top of YARN and how we keep evolving YARN’s ability to tackle the challenges brought by continuously increasing data and business in Alibaba. Speakers Weiwei Yang, Alibaba, Staff Software Engineer Ren Chunde, Alibaba Group, Senior Engineer
This document discusses improvements to compaction in Apache HBase. It begins with an overview of what compactions are and how they improve read performance in HBase. It then describes the default compaction algorithm and improvements made, including exploring selection and off-peak compactions. The document also covers making compactions more pluggable and enabling tuning on a per-table/column family basis. Finally, it proposes algorithms for different scenarios, such as level and stripe compactions, to improve compaction performance.
This is an exam cheat sheet hopes to cover all keys points for GCP Data Engineer Certification Exam Let me know if there is any mistake and I will try to update it
The document summarizes HBase use at Facebook, including its development and future work. HBase is used for incremental updates to data warehouses, high frequency analytics, and write-intensive workloads. Development includes Hive integration, master high availability, and random read optimizations. Future work focuses on coprocessors, intelligent load balancing, and cluster performance.
Matt Stump presents for the DataStax Cassandra South Bay Users group on advanced data modeling and bitmap indexes.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.