I gave this talk on the Highload++ conference 2015 in Moscow. Slides have been translated into English. They cover the Apache HAWQ components, its architecture, query processing logic, and also competitive information
Data Con LA 2020
Description
Apache Druid is a cloud-native open-source database that enables developers to build highly-scalable, low-latency, real-time interactive dashboards and apps to explore huge quantities of data. This column-oriented database provides the microsecond query response times required for ad-hoc queries and programmatic analytics. Druid natively streams data from Apache Kafka (and more) and batch loads just about anything. At ingestion, Druid partitions data based on time so time-based queries run significantly faster than traditional databases, plus Druid offers SQL compatibility. Druid is used in production by AirBnB, Nielsen, Netflix and more for real-time and historical data analytics. This talk provides an introduction to Apache Druid including: Druid's core architecture and its advantages, Working with streaming and batch data in Druid, Querying data and building apps on Druid and Real-world examples of Apache Druid in action
Speaker
Matt Sarrel, Imply Data, Developer Evangelist
Data at the Speed of Business with Data Mastering and Governance
Do you ever wonder how data-driven organizations fuel analytics, improve customer experience, and accelerate business productivity? They are successful by governing and mastering data effectively so they can get trusted data to those who need it faster. Efficient data discovery, mastering and democratization is critical for swiftly linking accurate data with business consumers. When business teams can quickly and easily locate, interpret, trust, and apply data assets to support sound business judgment, it takes less time to see value.
Join data mastering and data governance experts from Informatica—plus a real-world organization empowering trusted data for analytics—for a lively panel discussion. You’ll hear more about how a single cloud-native approach can help global businesses in any economy create more value—faster, more reliably, and with more confidence—by making data management and governance easier to implement.
Databricks + Snowflake: Catalyzing Data and AI Initiatives
"Combining Databricks, the unified analytics platform with Snowflake, the data warehouse built for the cloud is a powerful combo.
Databricks offers the ability to process large amounts of data reliably, including developing scalable AI projects. Snowflake offers the elasticity of a cloud-based data warehouse that centralizes the access to data. Databricks brings the unparalleled utility of being based on a mature distributed big data processing and AI-enabled tool to the table, capable of integrating with nearly every technology, from message queues (e.g. Kafka) to databases (e.g. Snowflake) to object stores (e.g. S3) and AI tools (e.g. Tensorflow).
Key Takeaways:
How Databricks & Snowflake work;
Why they're so powerful;
How Databricks + Snowflake symbiotically catalyze analytics and AI initiatives"
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
When interacting with analytics dashboards, in order to achieve a smooth user experience, two major key requirements are quick response time and data freshness. To meet the requirements of creating fast interactive BI dashboards over streaming data, organizations often struggle with selecting a proper serving layer.
Cluster computing frameworks such as Hadoop or Spark work well for storing large volumes of data, although they are not optimized for making it available for queries in real time. Long query latencies also make these systems suboptimal choices for powering interactive dashboards and BI use cases.
This talk presents an open source real time data analytics stack using Apache Kafka, Druid, and Superset. The stack combines the low-latency streaming and processing capabilities of Kafka with Druid, which enables immediate exploration and provides low-latency queries over the ingested data streams. Superset provides the visualization and dashboarding that integrates nicely with Druid. In this talk we will discuss why this architecture is well suited to interactive applications over streaming data, present an end-to-end demo of complete stack, discuss its key features, and discuss performance characteristics from real-world use cases.
Speaker
Nishant Bangarwa, Software Engineer, Hortonworks
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan Zhang
At Facebook, millions of Hive queries are executed on a daily basis, and the workload contributes to important analytics that drive product decisions and insights. Spark SQL in Apache Spark provides much of the same functionality as Hive query language (HQL) more efficiently, and Facebook is building a framework to migrate existing production Hive workload to Spark SQL with minimal user intervention.
Before Facebook began large-scale migration to SparkSQL, they worked on identifying the gap between HQL and SparkSQL. They built an offline syntax analysis tool that parses, analyzes, optimizes and generates physical plans on daily HQL workload. In this session, they’ll share their results. After finding their syntactic analysis encouraging, they built tooling for offline semantic analysis where they run HQL queries in their Spark shadow cluster and validate the outputs. Output validation is necessary since the runtime behavior in Spark SQL may be different from HQL. They have built a migration framework that supports HQL in both Hive and Spark execution engines, can shadow and validate HQL workloads in Spark, and makes it easy for users to convert their workloads.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
This overview presentation discusses big data challenges and provides an overview of the AWS Big Data Platform by covering:
- How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
- Reference architectures for popular use cases, including, connected devices (IoT), log streaming, real-time intelligence, and analytics.
- The AWS big data portfolio of services, including, Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR), and Redshift.
- The latest relational database engine, Amazon Aurora— a MySQL-compatible, highly-available relational database engine, which provides up to five times better performance than MySQL at one-tenth the cost of a commercial database.
Created by: Rahul Pathak,
Sr. Manager of Software Development
Machine learning development brings many new complexities beyond the traditional software development lifecycle. Unlike traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. In this talk, learn how to operationalize ML across the full lifecycle with Databricks Machine Learning.
Tackling Data Quality problems requires more than a series of tactical, one-off improvement projects. By their nature, many Data Quality problems extend across and often beyond an organization. Addressing these issues requires a holistic architectural approach combining people, process, and technology. Join Nigel Turner and Donna Burbank as they provide practical ways to control Data Quality issues in your organization.
Organizations are grappling to manually classify and create an inventory for distributed and heterogeneous data assets to deliver value. However, the new Azure service for enterprises – Azure Synapse Analytics is poised to help organizations and fill the gap between data warehouses and data lakes.
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Delta has been powering many production pipelines at scale in the Data and AI space since it has been introduced for the past few years.
Built on open standards, Delta provides data reliability, enhances storage and query performance to support big data use cases (both batch and streaming), fast interactive queries for BI and enabling machine learning. Delta has matured over the past couple of years in both AWS and AZURE and has become the de-facto standard for organizations building their Data and AI pipelines.
In today’s talk, we will explore building end-to-end pipelines on the Google Cloud Platform (GCP). Through presentation, code examples and notebooks, we will build the Delta Pipeline from ingest to consumption using our Delta Bronze-Silver-Gold architecture pattern and show examples of Consuming the delta files using the Big Query Connector.
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
It can be quite challenging keeping up with the frequent updates to the Microsoft products and understanding all their use cases and how all the products fit together. In this session we will differentiate the use cases for each of the Microsoft services, explaining and demonstrating what is good and what isn't, in order for you to position, design and deliver the proper adoption use cases for each with your customers. We will cover a wide range of products such as Databricks, SQL Data Warehouse, HDInsight, Azure Data Lake Analytics, Azure Data Lake Store, Blob storage, and AAS as well as high-level concepts such as when to use a data lake. We will also review the most common reference architectures (“patterns”) witnessed in customer adoption.
Data warehousing is a critical component for analysing and extracting actionable insights from your data. Amazon Redshift allows you to deploy a scalable data warehouse in a matter of minutes and starts to analyse your data right away using your existing business intelligence tools.
The document discusses developing data APIs for the Arabidopsis Information Portal (AIP) to enable discovery and reuse of services, data, and codes. It describes the AIP strategy of centralized data warehousing with infrastructure for data federation through web services and standards like REST. The AIP architecture includes an API manager, services bus and mediators to integrate diverse data sources and legacy systems while providing authentication, documentation, logging and versioning.
This document discusses a presentation on architecting Hadoop application architectures for a next generation data platform. It provides an overview of the presentation topics which include a case study on using Hadoop for an Internet of Things and entity 360 application. It introduces the key components of the proposed high level architecture including ingesting streaming and batch data using Kafka and Flume, stream processing with Kafka streams and storage in Hadoop.
Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibili...
"The only constant in life is change! The same applies to your Kafka events flowing through your streaming applications.
The Confluent Schema Registry allows us to control how schemas can evolve over time without breaking the compatibility of our streaming applications. But when you start with Kafka and (Avro) schemas, this can be pretty overwhelming.
Join Kosta and Tim as we dive into the tricky world of backward and forward compatibility in schema design. During this deep dive talk, we are going to answer questions like:
* What compatibility level to pick?
* What changes can I make when evolving my schemas?
* What options do I have when I need to introduce a breaking change?
* Should we automatically register schemas from our applications? Or do we need a separate step in our deployment process to promote schemas to higher-level environments?
* What to promote first? My producer, consumer or schema?
* How do you generate Java classes from your Avro schemas using Maven or Gradle, and how to integrate this into your project(s)?
* How do you build an automated test suite (unit tests) to gain more confidence and verify you are not breaking compatibility? Even before deploying a new version of your schema or application.
With live demos, we'll show you how to make schema changes work seamlessly. Emphasizing the crucial decisions, using real-life examples, pitfalls and best practices when promoting schemas on the consumer and producer sides.
Explore the ins and outs of Apache Avro and the Schema Registry with us at the Kafka Summit! Start evolving your schemas in a better way today, and join this talk!"
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Talk at Apache Drill Meetup (November 2018) describing how to accelerate SQL queries in a NoSQL database using Apache Drill and Secondary Indexes. Drill (in conjunction with Apache Calcite) provides a comprehensive cost-based index planning and execution framework. Queries with indexed columns in the WHERE clause, ORDER BY, GROUP BY and Joins can be sped up substantially. A reference implementation with MapR-DB JSON database is described.
This document discusses a high-level architecture for analyzing taxi trip data in real-time and batch using Apache Hadoop and streaming technologies. The architecture includes ingesting data from multiple sources using Kafka, processing streaming data using stream processing engines, storing data in data stores like HDFS, and enabling real-time and batch querying and analytics. Key considerations discussed are choosing data transport and stream processing technologies, scaling and reliability, and processing both streaming and batch data.
Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL.
Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
The Polyglot Data Scientist - Exploring R, Python, and SQL Server
This document provides an overview of a presentation on being a polyglot data scientist using multiple languages and tools. It discusses using SQL, R, and Python together in data science work. The presentation covers the challenges of being a polyglot, how SQL Server with R or Python can help solve problems more easily, and examples of analyzing sensor data with these tools. It also discusses resources for learning more about R, Python, and machine learning services in SQL Server.
APEX 5 Interactive Reports (IR) are powerful out of the box, but one can significantly improve performance by strategic settings of certain key parameters. The full presentation covers all the options.
Prometheus lightning talk (Devops Dublin March 2015)
This document introduces Prometheus, an open-source monitoring system that allows instrumentation of everything including RPCs, interfaces, business logic, and logs. It provides client libraries that make instrumentation easy across many languages. The Prometheus server can handle over a million time series in one instance with no dependencies. It offers dashboards, expression queries, alerts and integrates with many systems. Time series have structured labels allowing flexible aggregation and complex math for rules and alerts. Prometheus costs less than $.001 per time series per month and is developed by SoundCloud, Boxever and Docker with an active community.
Description of some of the elements that go in to creating a PostgreSQL-as-a-Service for organizations with many teams and a diverse ecosystem of applications and teams.
Outlines the CSS and JavaScript changes in APEX 5 Interactive Reports, recommending supported APIs and some unsupported options for customizing were necessary. Discusses and dmeonstrates how typical declarative settings influence end-user performance. LEarn how to leverage IR settings to maximize end user performance.
OpenTSDB is used at Criteo for monitoring their large Hadoop infrastructure which includes over 2500 servers running many different services like HDFS, YARN, HBase, Kafka, and Storm. OpenTSDB was chosen because it can handle the scale of metrics collected, store metrics for long periods of time with fine-grained resolution, and is easily extensible to add new metrics. It uses HBase for storage which is optimized for the time series data stored in OpenTSDB and can scale to meet Criteo's needs of storing billions of data points and handling high query loads.
Scaling ingest pipelines with high performance computing principles - Rajiv K...
By Rajiv Kurian, software engineer at SignalFx.
At SignalFx, we deal with high-volume high-resolution data from our users. This requires a high performance ingest pipeline. Over time we’ve found that we needed to adapt architectural principles from specialized fields such as HPC to get beyond performance plateaus encountered with more generic approaches. Some key examples include:
* Write very simple single threaded code, instead of complex algorithms
* Parallelize by running multiple copies of simple single threaded code, instead of using concurrent algorithms
* Separate the data plane from the control plane, instead of slowing data for control
* Write compact, array-based data structures with minimal indirection, instead of pointer-based data structures and uncontrolled allocation
This is the presentation I made on the Hadoop User Group Ireland meetup in Dublin. It covers the main ideas of both MPP, Hadoop and the distributed systems in general, and also how to chose the best option for you
This is the presentation I delivered on Hadoop User Group Ireland meetup in Dublin on Nov 28 2015. It covers at glance the architecture of GPDB and most important its features. Sorry for the colors - Slideshare is crappy with PDFs
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
This is the presentation for the talk I gave at JavaDay Kiev 2015. This is about an evolution of data processing systems from simple ones with single DWH to the complex approaches like Data Lake, Lambda Architecture and Pipeline architecture
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
이 세션에서는 SageMaker Training Jobs / SageMaker Jumpstart를 사용하여 Foundation Model 을 Pre-Triaining 하거나 Fine Tuing 하는 방안을 제시합니다. 이 세션을 통해 아래 3가지가 소개됩니다.
1. 파운데이션 모델을 처음부터 Training
2. 오픈 소스 모델을 사용하여 파운데이션 모델을 Pre-Training
3. 도메인에 맞게 모델을 Fine Tuning하는 방안
발표자:
Miron Perel, Principal ML GTM Specialist, AWS
Kristine Pearce, Principal ML BD, AWS
AIRLINE_SATISFACTION_Data Science Solution on Azure
Airline Satisfaction Project using Azure
This presentation is created as a foundation of understanding and comparing data science/machine learning solutions made in Python notebooks locally and on Azure cloud, as a part of Course DP-100 - Designing and Implementing a Data Science Solution on Azure.
Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.
LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time.
It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA ...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
Customers are migrating their analytics, data processing (ETL), and data science workloads running on Apache Hadoop, Spark, and data warehouse appliances from on-premise deployments to Amazon EMR in order to save costs, increase availability, and improve performance. Amazon EMR is a managed service that lets you process and analyze extremely large data sets using the latest versions of over 15 open-source frameworks in the Apache Hadoop and Spark ecosystems. This session will focus on identifying the components and workflows in your current environment and providing the best practices to migrate these workloads to Amazon EMR. We will explain how to move from HDFS to Amazon S3 as a durable storage layer, and how to lower costs with Amazon EC2 Spot instances and Auto Scaling. Additionally, we will go over common security recommendations and tuning tips to accelerate the time to production.
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdfChris Hoyean Song
The document outlines an agenda for the NFTBank x Snowflake Tech Seminar. The seminar will cover three sessions: 1) data quality and productivity with discussions of data validation, cataloging and lineage documentation, and an introduction to DBT; 2) integrating DBT with Airflow using Astronomer Cosmos; and 3) cost optimization through query optimization and cost monitoring. The seminar will be led by Chris Hoyean Song, VP of AIOps at NFTBank.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Data Con LA 2020
Description
Apache Druid is a cloud-native open-source database that enables developers to build highly-scalable, low-latency, real-time interactive dashboards and apps to explore huge quantities of data. This column-oriented database provides the microsecond query response times required for ad-hoc queries and programmatic analytics. Druid natively streams data from Apache Kafka (and more) and batch loads just about anything. At ingestion, Druid partitions data based on time so time-based queries run significantly faster than traditional databases, plus Druid offers SQL compatibility. Druid is used in production by AirBnB, Nielsen, Netflix and more for real-time and historical data analytics. This talk provides an introduction to Apache Druid including: Druid's core architecture and its advantages, Working with streaming and batch data in Druid, Querying data and building apps on Druid and Real-world examples of Apache Druid in action
Speaker
Matt Sarrel, Imply Data, Developer Evangelist
Data at the Speed of Business with Data Mastering and GovernanceDATAVERSITY
Do you ever wonder how data-driven organizations fuel analytics, improve customer experience, and accelerate business productivity? They are successful by governing and mastering data effectively so they can get trusted data to those who need it faster. Efficient data discovery, mastering and democratization is critical for swiftly linking accurate data with business consumers. When business teams can quickly and easily locate, interpret, trust, and apply data assets to support sound business judgment, it takes less time to see value.
Join data mastering and data governance experts from Informatica—plus a real-world organization empowering trusted data for analytics—for a lively panel discussion. You’ll hear more about how a single cloud-native approach can help global businesses in any economy create more value—faster, more reliably, and with more confidence—by making data management and governance easier to implement.
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks
"Combining Databricks, the unified analytics platform with Snowflake, the data warehouse built for the cloud is a powerful combo.
Databricks offers the ability to process large amounts of data reliably, including developing scalable AI projects. Snowflake offers the elasticity of a cloud-based data warehouse that centralizes the access to data. Databricks brings the unparalleled utility of being based on a mature distributed big data processing and AI-enabled tool to the table, capable of integrating with nearly every technology, from message queues (e.g. Kafka) to databases (e.g. Snowflake) to object stores (e.g. S3) and AI tools (e.g. Tensorflow).
Key Takeaways:
How Databricks & Snowflake work;
Why they're so powerful;
How Databricks + Snowflake symbiotically catalyze analytics and AI initiatives"
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
When interacting with analytics dashboards, in order to achieve a smooth user experience, two major key requirements are quick response time and data freshness. To meet the requirements of creating fast interactive BI dashboards over streaming data, organizations often struggle with selecting a proper serving layer.
Cluster computing frameworks such as Hadoop or Spark work well for storing large volumes of data, although they are not optimized for making it available for queries in real time. Long query latencies also make these systems suboptimal choices for powering interactive dashboards and BI use cases.
This talk presents an open source real time data analytics stack using Apache Kafka, Druid, and Superset. The stack combines the low-latency streaming and processing capabilities of Kafka with Druid, which enables immediate exploration and provides low-latency queries over the ingested data streams. Superset provides the visualization and dashboarding that integrates nicely with Druid. In this talk we will discuss why this architecture is well suited to interactive applications over streaming data, present an end-to-end demo of complete stack, discuss its key features, and discuss performance characteristics from real-world use cases.
Speaker
Nishant Bangarwa, Software Engineer, Hortonworks
Experiences Migrating Hive Workload to SparkSQL with Jie Xiong and Zhan ZhangDatabricks
At Facebook, millions of Hive queries are executed on a daily basis, and the workload contributes to important analytics that drive product decisions and insights. Spark SQL in Apache Spark provides much of the same functionality as Hive query language (HQL) more efficiently, and Facebook is building a framework to migrate existing production Hive workload to Spark SQL with minimal user intervention.
Before Facebook began large-scale migration to SparkSQL, they worked on identifying the gap between HQL and SparkSQL. They built an offline syntax analysis tool that parses, analyzes, optimizes and generates physical plans on daily HQL workload. In this session, they’ll share their results. After finding their syntactic analysis encouraging, they built tooling for offline semantic analysis where they run HQL queries in their Spark shadow cluster and validate the outputs. Output validation is necessary since the runtime behavior in Spark SQL may be different from HQL. They have built a migration framework that supports HQL in both Hive and Spark execution engines, can shadow and validate HQL workloads in Spark, and makes it easy for users to convert their workloads.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
This overview presentation discusses big data challenges and provides an overview of the AWS Big Data Platform by covering:
- How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
- Reference architectures for popular use cases, including, connected devices (IoT), log streaming, real-time intelligence, and analytics.
- The AWS big data portfolio of services, including, Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR), and Redshift.
- The latest relational database engine, Amazon Aurora— a MySQL-compatible, highly-available relational database engine, which provides up to five times better performance than MySQL at one-tenth the cost of a commercial database.
Created by: Rahul Pathak,
Sr. Manager of Software Development
Learn to Use Databricks for the Full ML LifecycleDatabricks
Machine learning development brings many new complexities beyond the traditional software development lifecycle. Unlike traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. In this talk, learn how to operationalize ML across the full lifecycle with Databricks Machine Learning.
Tackling Data Quality problems requires more than a series of tactical, one-off improvement projects. By their nature, many Data Quality problems extend across and often beyond an organization. Addressing these issues requires a holistic architectural approach combining people, process, and technology. Join Nigel Turner and Donna Burbank as they provide practical ways to control Data Quality issues in your organization.
Organizations are grappling to manually classify and create an inventory for distributed and heterogeneous data assets to deliver value. However, the new Azure service for enterprises – Azure Synapse Analytics is poised to help organizations and fill the gap between data warehouses and data lakes.
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Building End-to-End Delta Pipelines on GCPDatabricks
Delta has been powering many production pipelines at scale in the Data and AI space since it has been introduced for the past few years.
Built on open standards, Delta provides data reliability, enhances storage and query performance to support big data use cases (both batch and streaming), fast interactive queries for BI and enabling machine learning. Delta has matured over the past couple of years in both AWS and AZURE and has become the de-facto standard for organizations building their Data and AI pipelines.
In today’s talk, we will explore building end-to-end pipelines on the Google Cloud Platform (GCP). Through presentation, code examples and notebooks, we will build the Delta Pipeline from ingest to consumption using our Delta Bronze-Silver-Gold architecture pattern and show examples of Consuming the delta files using the Big Query Connector.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
It can be quite challenging keeping up with the frequent updates to the Microsoft products and understanding all their use cases and how all the products fit together. In this session we will differentiate the use cases for each of the Microsoft services, explaining and demonstrating what is good and what isn't, in order for you to position, design and deliver the proper adoption use cases for each with your customers. We will cover a wide range of products such as Databricks, SQL Data Warehouse, HDInsight, Azure Data Lake Analytics, Azure Data Lake Store, Blob storage, and AAS as well as high-level concepts such as when to use a data lake. We will also review the most common reference architectures (“patterns”) witnessed in customer adoption.
Data warehousing is a critical component for analysing and extracting actionable insights from your data. Amazon Redshift allows you to deploy a scalable data warehouse in a matter of minutes and starts to analyse your data right away using your existing business intelligence tools.
The document discusses developing data APIs for the Arabidopsis Information Portal (AIP) to enable discovery and reuse of services, data, and codes. It describes the AIP strategy of centralized data warehousing with infrastructure for data federation through web services and standards like REST. The AIP architecture includes an API manager, services bus and mediators to integrate diverse data sources and legacy systems while providing authentication, documentation, logging and versioning.
Architecting a Next Generation Data Platformhadooparchbook
This document discusses a presentation on architecting Hadoop application architectures for a next generation data platform. It provides an overview of the presentation topics which include a case study on using Hadoop for an Internet of Things and entity 360 application. It introduces the key components of the proposed high level architecture including ingesting streaming and batch data using Kafka and Flume, stream processing with Kafka streams and storage in Hadoop.
Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibili...HostedbyConfluent
"The only constant in life is change! The same applies to your Kafka events flowing through your streaming applications.
The Confluent Schema Registry allows us to control how schemas can evolve over time without breaking the compatibility of our streaming applications. But when you start with Kafka and (Avro) schemas, this can be pretty overwhelming.
Join Kosta and Tim as we dive into the tricky world of backward and forward compatibility in schema design. During this deep dive talk, we are going to answer questions like:
* What compatibility level to pick?
* What changes can I make when evolving my schemas?
* What options do I have when I need to introduce a breaking change?
* Should we automatically register schemas from our applications? Or do we need a separate step in our deployment process to promote schemas to higher-level environments?
* What to promote first? My producer, consumer or schema?
* How do you generate Java classes from your Avro schemas using Maven or Gradle, and how to integrate this into your project(s)?
* How do you build an automated test suite (unit tests) to gain more confidence and verify you are not breaking compatibility? Even before deploying a new version of your schema or application.
With live demos, we'll show you how to make schema changes work seamlessly. Emphasizing the crucial decisions, using real-life examples, pitfalls and best practices when promoting schemas on the consumer and producer sides.
Explore the ins and outs of Apache Avro and the Schema Registry with us at the Kafka Summit! Start evolving your schemas in a better way today, and join this talk!"
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
Hadoop application architectures - using Customer 360 (more generally, Entity 360) as an example. By Ted Malaska, Jonathan Seidman and Mark Grover at Strata + Hadoop World 2016 in NYC.
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Aman Sinha
Talk at Apache Drill Meetup (November 2018) describing how to accelerate SQL queries in a NoSQL database using Apache Drill and Secondary Indexes. Drill (in conjunction with Apache Calcite) provides a comprehensive cost-based index planning and execution framework. Queries with indexed columns in the WHERE clause, ORDER BY, GROUP BY and Joins can be sped up substantially. A reference implementation with MapR-DB JSON database is described.
Architecting a next-generation data platformhadooparchbook
This document discusses a high-level architecture for analyzing taxi trip data in real-time and batch using Apache Hadoop and streaming technologies. The architecture includes ingesting data from multiple sources using Kafka, processing streaming data using stream processing engines, storing data in data stores like HDFS, and enabling real-time and batch querying and analytics. Key considerations discussed are choosing data transport and stream processing technologies, scaling and reliability, and processing both streaming and batch data.
Enterprise data is moving into Hadoop, but some data has to stay in operational systems. Apache Calcite (the technology behind Hive’s new cost-based optimizer, formerly known as Optiq) is a query-optimization and data federation technology that allows you to combine data in Hadoop with data in NoSQL systems such as MongoDB and Splunk, and access it all via SQL.
Hyde shows how to quickly build a SQL interface to a NoSQL system using Calcite. He shows how to add rules and operators to Calcite to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
The Polyglot Data Scientist - Exploring R, Python, and SQL ServerSarah Dutkiewicz
This document provides an overview of a presentation on being a polyglot data scientist using multiple languages and tools. It discusses using SQL, R, and Python together in data science work. The presentation covers the challenges of being a polyglot, how SQL Server with R or Python can help solve problems more easily, and examples of analyzing sensor data with these tools. It also discusses resources for learning more about R, Python, and machine learning services in SQL Server.
APEX 5 Interactive Reports (IR) are powerful out of the box, but one can significantly improve performance by strategic settings of certain key parameters. The full presentation covers all the options.
Prometheus lightning talk (Devops Dublin March 2015)Brian Brazil
This document introduces Prometheus, an open-source monitoring system that allows instrumentation of everything including RPCs, interfaces, business logic, and logs. It provides client libraries that make instrumentation easy across many languages. The Prometheus server can handle over a million time series in one instance with no dependencies. It offers dashboards, expression queries, alerts and integrates with many systems. Time series have structured labels allowing flexible aggregation and complex math for rules and alerts. Prometheus costs less than $.001 per time series per month and is developed by SoundCloud, Boxever and Docker with an active community.
Description of some of the elements that go in to creating a PostgreSQL-as-a-Service for organizations with many teams and a diverse ecosystem of applications and teams.
APEX 5 Interactive Reports: Guts and PErformanceKaren Cannell
Outlines the CSS and JavaScript changes in APEX 5 Interactive Reports, recommending supported APIs and some unsupported options for customizing were necessary. Discusses and dmeonstrates how typical declarative settings influence end-user performance. LEarn how to leverage IR settings to maximize end user performance.
OpenTSDB is used at Criteo for monitoring their large Hadoop infrastructure which includes over 2500 servers running many different services like HDFS, YARN, HBase, Kafka, and Storm. OpenTSDB was chosen because it can handle the scale of metrics collected, store metrics for long periods of time with fine-grained resolution, and is easily extensible to add new metrics. It uses HBase for storage which is optimized for the time series data stored in OpenTSDB and can scale to meet Criteo's needs of storing billions of data points and handling high query loads.
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
By Rajiv Kurian, software engineer at SignalFx.
At SignalFx, we deal with high-volume high-resolution data from our users. This requires a high performance ingest pipeline. Over time we’ve found that we needed to adapt architectural principles from specialized fields such as HPC to get beyond performance plateaus encountered with more generic approaches. Some key examples include:
* Write very simple single threaded code, instead of complex algorithms
* Parallelize by running multiple copies of simple single threaded code, instead of using concurrent algorithms
* Separate the data plane from the control plane, instead of slowing data for control
* Write compact, array-based data structures with minimal indirection, instead of pointer-based data structures and uncontrolled allocation
This is the presentation I made on the Hadoop User Group Ireland meetup in Dublin. It covers the main ideas of both MPP, Hadoop and the distributed systems in general, and also how to chose the best option for you
This is the presentation I delivered on Hadoop User Group Ireland meetup in Dublin on Nov 28 2015. It covers at glance the architecture of GPDB and most important its features. Sorry for the colors - Slideshare is crappy with PDFs
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
This is the presentation for the talk I gave at JavaDay Kiev 2015. This is about an evolution of data processing systems from simple ones with single DWH to the complex approaches like Data Lake, Lambda Architecture and Pipeline architecture
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and TuningDonghwan Lee
이 세션에서는 SageMaker Training Jobs / SageMaker Jumpstart를 사용하여 Foundation Model 을 Pre-Triaining 하거나 Fine Tuing 하는 방안을 제시합니다. 이 세션을 통해 아래 3가지가 소개됩니다.
1. 파운데이션 모델을 처음부터 Training
2. 오픈 소스 모델을 사용하여 파운데이션 모델을 Pre-Training
3. 도메인에 맞게 모델을 Fine Tuning하는 방안
발표자:
Miron Perel, Principal ML GTM Specialist, AWS
Kristine Pearce, Principal ML BD, AWS
Airline Satisfaction Project using Azure
This presentation is created as a foundation of understanding and comparing data science/machine learning solutions made in Python notebooks locally and on Azure cloud, as a part of Course DP-100 - Designing and Implementing a Data Science Solution on Azure.
Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.
LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time.
It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY
2. Who I am
Enterprise Architect @ Pivotal
• 7 years in data processing
• 5 years of experience with MPP
• 4 years with Hadoop
• Using HAWQ since the first internal Beta
• Responsible for designing most of the EMEA HAWQ
and Greenplum implementations
• Spark contributor
• http://0x0fff.com
17. HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
18. HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
19. HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
20. HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 23’000 Makefile LOC
21. HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 23’000 Makefile LOC
• 7’000 Shell scripts LOC
22. HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 23’000 Makefile LOC
• 7’000 Shell scripts LOC
• More than 50 enterprise customers
23. HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 23’000 Makefile LOC
• 7’000 Shell scripts LOC
• More than 50 enterprise customers
– More than 10 of them in EMEA
24. Apache HAWQ
• Apache HAWQ (incubating) from 09’2015
– http://hawq.incubator.apache.org
– https://github.com/apache/incubator-hawq
• What’s in Open Source
– Sources of HAWQ 2.0 alpha
– HAWQ 2.0 beta is planned for 2015’Q4
– HAWQ 2.0 GA is planned for 2016’Q1
• Community is yet young – come and join!
26. Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
27. Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
28. Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
• Universal tool for ad hoc analytics on top of
Hadoop data
29. Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
• Universal tool for ad hoc analytics on top of
Hadoop data
– Example - parse URL to extract protocol, host
name, port, GET parameters
30. Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
• Universal tool for ad hoc analytics on top of
Hadoop data
– Example - parse URL to extract protocol, host
name, port, GET parameters
• Good performance
31. Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
• Universal tool for ad hoc analytics on top of
Hadoop data
– Example - parse URL to extract protocol, host
name, port, GET parameters
• Good performance
– How many times the data would hit the HDD during
a single Hive query?
32. HAWQ Cluster
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
Server 6
Datanode
Server N
Datanode
Server 5
Datanode
interconnect
…
33. HAWQ Cluster
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
Server 6
Datanode
Server N
Datanode
Server 5
Datanode
YARN NM YARN NM YARN NM
YARN RM
YARN App
Timeline
interconnect
…
34. HAWQ Cluster
HAWQ Master
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
HAWQ
Standby
Server 2
ZK JM
HAWQ Segment
Server 6
Datanode
HAWQ Segment
Server N
Datanode
HAWQ Segment
Server 5
Datanode
YARN NM YARN NM YARN NM
YARN RM
YARN App
Timeline
interconnect
…
35. Master Servers
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
HAWQ Segment
Server 6
Datanode
HAWQ Segment
Server N
Datanode
HAWQ Segment
Server 5
Datanode
YARN NM YARN NM YARN NM
YARN RM
YARN App
Timeline
interconnect
…
HAWQ Master
HAWQ
Standby
37. HAWQ Master
HAWQ
Standby
Segments
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
Server 6
Datanode
Server N
Datanode
Server 5
Datanode
YARN NM YARN NM YARN NM
YARN RM
YARN App
Timeline
interconnect
HAWQ Segment HAWQ SegmentHAWQ Segment …
40. Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
41. Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
42. Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
– Histogram of values distribution for each field
43. Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
– Histogram of values distribution for each field
– Number of unique values in the field
44. Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
– Histogram of values distribution for each field
– Number of unique values in the field
– Number of null values in the field
45. Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
– Histogram of values distribution for each field
– Number of unique values in the field
– Number of null values in the field
– Average width of the field in bytes
48. Statistics
No Statistics
Row Count
How many rows would produce the join of two
tables?
From 0 to infinity
How many rows would produce the join of two 1000-
row tables?
49. Statistics
No Statistics
Row Count
How many rows would produce the join of two
tables?
From 0 to infinity
How many rows would produce the join of two 1000-
row tables?
From 0 to 1’000’000
50. Statistics
No Statistics
Row Count
Histograms and MCV
How many rows would produce the join of two
tables?
From 0 to infinity
How many rows would produce the join of two 1000-
row tables?
From 0 to 1’000’000
How many rows would produce the join of two 1000-
row tables, with known field cardinality, values
distribution diagram, number of nulls, most common
values?
51. Statistics
No Statistics
Row Count
Histograms and MCV
How many rows would produce the join of two
tables?
From 0 to infinity
How many rows would produce the join of two 1000-
row tables?
From 0 to 1’000’000
How many rows would produce the join of two 1000-
row tables, with known field cardinality, values
distribution diagram, number of nulls, most common
values?
~ From 500 to 1’500
53. Metadata
• Table structure information
– Distribution fields
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
hash(ID)
54. Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
hash(ID)
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
55. Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
hash(ID)
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
56. Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
• General metadata
– Users and groups
57. Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
• General metadata
– Users and groups
– Access privileges
58. Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
• General metadata
– Users and groups
– Access privileges
• Stored procedures
– PL/pgSQL, PL/Java, PL/Python, PL/Perl, PL/R
60. Query Optimizer
• HAWQ uses cost-based query optimizers
• You have two options
– Planner – evolved from the Postgres query
optimizer
– ORCA (Pivotal Query Optimizer) – developed
specifically for HAWQ
61. Query Optimizer
• HAWQ uses cost-based query optimizers
• You have two options
– Planner – evolved from the Postgres query
optimizer
– ORCA (Pivotal Query Optimizer) – developed
specifically for HAWQ
• Optimizer hints work just like in Postgres
– Enable/disable specific operation
– Change the cost estimations for basic actions
64. Storage Formats
Which storage format is the most optimal?
It depends on what you mean by “optimal”
– Minimal CPU usage for reading and writing the data
65. Storage Formats
Which storage format is the most optimal?
It depends on what you mean by “optimal”
– Minimal CPU usage for reading and writing the data
– Minimal disk space usage
66. Storage Formats
Which storage format is the most optimal?
It depends on what you mean by “optimal”
– Minimal CPU usage for reading and writing the data
– Minimal disk space usage
– Minimal time to retrieve record by key
67. Storage Formats
Which storage format is the most optimal?
It depends on what you mean by “optimal”
– Minimal CPU usage for reading and writing the data
– Minimal disk space usage
– Minimal time to retrieve record by key
– Minimal time to retrieve subset of columns
– etc.
68. Storage Formats
• Row-based storage format
– Similar to Postgres heap storage
• No toast
• No ctid, xmin, xmax, cmin, cmax
69. Storage Formats
• Row-based storage format
– Similar to Postgres heap storage
• No toast
• No ctid, xmin, xmax, cmin, cmax
– Compression
• No compression
• Quicklz
• Zlib levels 1 - 9
70. Storage Formats
• Apache Parquet
– Mixed row-columnar table store, the data is split
into “row groups” stored in columnar format
71. Storage Formats
• Apache Parquet
– Mixed row-columnar table store, the data is split
into “row groups” stored in columnar format
– Compression
• No compression
• Snappy
• Gzip levels 1 – 9
72. Storage Formats
• Apache Parquet
– Mixed row-columnar table store, the data is split
into “row groups” stored in columnar format
– Compression
• No compression
• Snappy
• Gzip levels 1 – 9
– The size of “row group” and page size can be set
for each table separately
73. Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
74. Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAWQ asks YARN Resource Manager for
query execution resources
75. Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAWQ asks YARN Resource Manager for
query execution resources
• Flexible cluster utilization
– Query might run on a subset of nodes if it is small
76. Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAWQ asks YARN Resource Manager for
query execution resources
• Flexible cluster utilization
– Query might run on a subset of nodes if it is small
– Query might have many executors on each cluster
node to make it run faster
77. Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAWQ asks YARN Resource Manager for
query execution resources
• Flexible cluster utilization
– Query might run on a subset of nodes if it is small
– Query might have many executors on each cluster
node to make it run faster
– You can control the parallelism of each query
80. Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
81. Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
– CPU cores usage limit
82. Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
– CPU cores usage limit
– MIN/MAX number of executors across the system
83. Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
– CPU cores usage limit
– MIN/MAX number of executors across the system
– MIN/MAX number of executors on each node
84. Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
– CPU cores usage limit
– MIN/MAX number of executors across the system
– MIN/MAX number of executors on each node
• Can be set up for user or group
85. External Data
• PXF
– Framework for external data access
– Easy to extend, many public plugins available
– Official plugins: CSV, SequenceFile, Avro, Hive,
HBase
– Open Source plugins: JSON, Accumulo,
Cassandra, JDBC, Redis, Pipe
86. External Data
• PXF
– Framework for external data access
– Easy to extend, many public plugins available
– Official plugins: CSV, SequenceFile, Avro, Hive,
HBase
– Open Source plugins: JSON, Accumulo,
Cassandra, JDBC, Redis, Pipe
• HCatalog
– HAWQ can query tables from HCatalog the same
way as HAWQ native tables
87. Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
88. Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE
89. Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE
90. Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE
91. Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE
92. Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
93. Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
94. Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
95. Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
96. Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
97. Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
98. Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
QE QE QE QE QE
99. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Execute Result Cleanup
QE
QE QE QE QE QE
Prepare
100. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Execute Result Cleanup
QE
QE QE QE QE QE
Prepare
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
101. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Execute Result Cleanup
QE
QE QE QE QE QE
Prepare
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
102. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Result Cleanup
QE
QE QE QE QE QE
Prepare Execute
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
103. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Result Cleanup
QE
QE QE QE QE QE
Prepare Execute
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
104. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Result Cleanup
QE
QE QE QE QE QE
Prepare Execute
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
105. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Cleanup
QE
QE QE QE QE QE
Prepare Execute Result
106. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Cleanup
QE
QE QE QE QE QE
Prepare Execute Result
107. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Cleanup
QE
QE QE QE QE QE
Prepare Execute Result
108. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
109. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
110. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
OK
111. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
OK
112. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
OK
113. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
OK
114. ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
116. Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
117. Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
• Data is transferred between the nodes by UDP
118. Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
• Data is transferred between the nodes by UDP
• HAWQ has a good cost-based query optimizer
119. Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
• Data is transferred between the nodes by UDP
• HAWQ has a good cost-based query optimizer
• C/C++ implementation is more efficient than
Java implementation of competitive solutions
120. Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
• Data is transferred between the nodes by UDP
• HAWQ has a good cost-based query optimizer
• C/C++ implementation is more efficient than
Java implementation of competitive solutions
• Query parallelism can be easily tuned
132. Roadmap
• AWS and S3 integration
• Mesos integration
• Better Ambari integration
• Cloudera, MapR and IBM Hadoop distributions
native support
133. Roadmap
• AWS and S3 integration
• Mesos integration
• Better Ambari integration
• Cloudera, MapR and IBM Hadoop distributions
native support
• Make the SQL-on-Hadoop engine ever!
134. Summary
• Modern SQL-on-Hadoop engine
• For structured data processing and analysis
• Combines the best techniques of competitive
solutions
• Just released to the open source
• Community is very young
Join our community and contribute!