Maxime Dumas gives a presentation on Cloudera Impala, which provides fast SQL query capability for Apache Hadoop. Impala allows for interactive queries on Hadoop data in seconds rather than minutes by using a native MPP query engine instead of MapReduce. It offers benefits like SQL support, improved performance of 3-4x up to 90x faster than MapReduce, and flexibility to query existing Hadoop data without needing to migrate or duplicate it. The latest release of Impala 2.0 includes new features like window functions, subqueries, and spilling joins and aggregations to disk when memory is exhausted.
This document discusses building applications on Hadoop and introduces the Kite SDK. It provides an overview of Hadoop and its components like HDFS and MapReduce. It then discusses that while Hadoop is powerful and flexible, it can be complex and low-level, making application development challenging. The Kite SDK aims to address this by providing higher-level APIs and abstractions to simplify common use cases and allow developers to focus on business logic rather than infrastructure details. It includes modules for data, ETL processing with Morphlines, and tools for working with datasets and jobs. The SDK is open source and supports modular adoption.
Apache Kudu is a storage layer for Apache Hadoop that provides low-latency queries and high throughput for fast data access use cases like real-time analytics. It was designed to address gaps in HDFS and HBase by providing both efficient scanning of large amounts of data as well as efficient lookups of individual rows. Kudu tables store data in a columnar format and use a distributed architecture with tablets and masters to enable high performance and scalability for workloads involving both sequential and random access of data.
This document discusses application architectures using Hadoop. It provides an example case study of clickstream analysis. It covers challenges of Hadoop implementation and various architectural considerations for data storage and modeling, data ingestion, and data processing. For data processing, it discusses different processing engines like MapReduce, Pig, Hive, Spark and Impala. It also discusses what specific processing needs to be done for the clickstream data like sessionization and filtering.
Data Science at Scale Using Apache Spark and Apache HadoopCloudera, Inc.
This document provides information about a data science course taught using Apache Spark and Apache Hadoop. It introduces the instructors Sean Owen and Tom White and describes what data science is and the roles of data scientists. Data scientists have skills in engineering, statistics, and business domains. The document discusses why companies need data scientists due to the growth of data and its value. It presents the tools used in data science, including Apache Spark, and how Spark can be used for both investigative and operational analytics. The course teaches a complete data science problem process through hands-on examples using tools like Hadoop, Python, R, Hive, and Spark MLlib.
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
This document discusses building near-real-time analytics pipelines using Apache Spark Streaming and Apache Kudu on the Cloudera platform. It defines near-real-time analytics, describes the relevant components of the Cloudera stack (Kafka, Spark, Kudu, Impala), and how they can work together. The document then outlines the typical stages involved in implementing a Spark Streaming to Kudu pipeline, including sourcing from a queue, translating data, deriving storage records, planning mutations, and storing the data. It provides performance considerations and introduces Envelope, a Spark Streaming application on Cloudera Labs that implements these stages through configurable pipelines.
Intel and Cloudera: Accelerating Enterprise Big Data SuccessCloudera, Inc.
The data center has gone through several inflection points in the past decades: adoption of Linux, migration from physical infrastructure to virtualization and Cloud, and now large-scale data analytics with Big Data and Hadoop.
Please join us to learn about how Cloudera and Intel are jointly innovating through open source software to enable Hadoop to run best on IA (Intel Architecture) and to foster the evolution of a vibrant Big Data ecosystem.
Kudu is an open source storage layer developed by Cloudera that provides low latency queries on large datasets. It uses a columnar storage format for fast scans and an embedded B-tree index for fast random access. Kudu tables are partitioned into tablets that are distributed and replicated across a cluster. The Raft consensus algorithm ensures consistency during replication. Kudu is suitable for applications requiring real-time analytics on streaming data and time-series queries across large datasets.
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataMike Percy
The document discusses using Kafka and Kudu for low-latency SQL analytics on streaming data. It describes the challenges of supporting both streaming and batch workloads simultaneously using traditional solutions. The authors propose using Kafka to ingest data and Kudu for structured storage and querying. They demonstrate how this allows for stream processing, batch processing, and querying of up-to-second data with low complexity. Case studies from Xiaomi and TPC-H benchmarks show the advantages of this approach over alternatives.
High concurrency, Low latency analytics using Spark/KuduChris George
With the right combination of open source projects, you can have a high concurrency and low latency spark jobs for doing data analysis. We'll show both REST and JDBC access to access data from a persistent spark context and then show how the combination of Spark Job Server, Spark Thrift Server and Apache Kudu can create a scalable backend for low latency analytics.
This talk was held at the 11th meeting on April 7 2014 by Marcel Kornacker.
Impala (impala.io) raises the bar for SQL query performance on Apache Hadoop. With Impala, you can query Hadoop data – including SELECT, JOIN, and aggregate functions – in real time to do BI-style analysis. As a result, Impala makes a Hadoop-based enterprise data hub function like an enterprise data warehouse for native Big Data.
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
This document provides an overview of Apache Kudu, an open source storage layer for Apache Hadoop that enables fast analytics on fast data. Some key points:
- Kudu is a columnar storage engine that allows for both fast analytics queries as well as low-latency updates to the stored data.
- It addresses gaps in the existing Hadoop storage landscape by providing efficient scans, individual row lookups, and mutable data all within the same system.
- Kudu uses a master-tablet server architecture with tablets that are horizontally partitioned and replicated for fault tolerance. It supports SQL and NoSQL interfaces.
- Integrations with Spark, Impala and MapReduce allow it to be used for both
The document discusses architectural considerations for implementing clickstream analytics using Hadoop. It covers choices for data storage layers like HDFS vs HBase, data modeling including file formats and partitioning, data ingestion methods like Flume and Sqoop, available processing engines like MapReduce, Hive, Spark and Impala, and the need to sessionize clickstream data to analyze metrics like bounce rates and attribution.
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valleymarkgrover
The document provides an introduction to Apache Hadoop and its ecosystem. It discusses how Hadoop addresses the need for scalable data storage and processing to handle large volumes, velocities and varieties of data. Hadoop's two main components are the Hadoop Distributed File System (HDFS) for reliable data storage across commodity hardware, and MapReduce for distributed processing of large datasets in parallel. The document also compares Hadoop to other distributed systems and outlines some of Hadoop's fundamental design principles around data locality, reliability, and throughput over latency.
Spark is a fast and general engine for large-scale data processing. It provides APIs in Java, Scala, and Python and an interactive shell. Spark applications operate on resilient distributed datasets (RDDs) that can be cached in memory for faster performance. RDDs are immutable and fault-tolerant via lineage graphs. Transformations create new RDDs from existing ones while actions return values to the driver program. Spark's execution model involves a driver program that coordinates tasks on executor machines. RDD caching and lineage graphs allow Spark to efficiently run jobs across clusters.
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
Kudu is a storage engine for Hadoop designed to address gaps in Hadoop's ability to handle workloads that require both high-throughput data ingestion and low-latency random access. It is a columnar storage engine that uses a log-structured merge tree to store data and provides APIs for NoSQL and SQL access. Kudu aims to provide high performance for both scans and random access through its columnar design and tablet architecture that partitions data across servers.
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
This document summarizes Mark Grover's presentation on application architectures with Apache Hadoop. It discusses processing clickstream data from web logs using techniques like deduplication, filtering, and sessionization in Hadoop. Specifically, it describes how to implement sessionization in MapReduce by using the user's IP address and timestamp to group log lines into sessions in the reducer.
The document summarizes several popular options for SQL on Hadoop including Hive, SparkSQL, Drill, HAWQ, Phoenix, Trafodion, and Splice Machine. Each option is reviewed in terms of key features, architecture, usage patterns, and strengths/limitations. While all aim to enable SQL querying of Hadoop data, they differ in support for transactions, latency, data types, and whether they are native to Hadoop or require separate processes. Hive and SparkSQL are best for batch jobs while Drill, HAWQ and Splice Machine provide lower latency but with different integration models and capabilities.
AWS re:Invent re:Cap 행사에서 발표된 강연 자료입니다. 아마존 웹서비스의 김일호 솔루션스 아키텍트가 발표한 내용입니다.
내용 요약: Hadoop과 Elastic MapReduce, Redshift, Kinesis, Data Pipeline, S3 등 다양한 서비스들을 활용하는 데이터 분석의 모범사례 및 아키텍처 설계 패턴에 대해 말씀드리고, re:Invent에서 새로 추가된 Amazon EC2 컴퓨팅 최적화 인스턴스 C4와 새로 발표된 Amazon EBS 볼륨 확장 및 성능 향상에 대해 함께 살펴볼 예정입니다.
Clock-Pro is an approximation of the LIRS page replacement algorithm that is built on the existing CLOCK infrastructure in operating system kernels. It categorizes pages as either cold or hot based on their reuse distances, and uses three clock hands to manage the different page types. The allocation of memory between hot and cold pages is adjusted adaptively. Clock-Pro aims to address some limitations of previous CLOCK-based algorithms by more closely modeling LIRS behavior.
Cloudera Federal Forum 2014: Cloud Deployment for the Enterprise Data HubCloudera, Inc.
Chief Technologist, Office of the CTO at Cloudera, Eli Collins, shares information about the enterprise data hub in the cloud and Cloudera's relationship with AWS.
Challenges for running Hadoop on AWS - AdvancedAWS MeetupAndrei Savu
Nowadays we've got all the tools we need to spin-up and tear-down clusters with hundreds of nodes in minutes and this puts more pressure on the tools we use to configure and monitor our applications. This challenge is even more interesting when we have to deal with long running distributed data storage and processing systems like Hadoop. In this talk we will look into some of the challenges we need to deal with when creating and managing Hadoop clusters in AWS, we will discuss improvement opportunities in monitoring (e.g. detecting and dealing with instance failure, resource contention & noisy neighbors) and a bit about the future and how we should go about disconnecting workload dispatch from cluster lifecycle.
Manufacturers have an abundance of data, whether from connected sensors, plant systems, manufacturing systems, claims systems and external data from industry and government. Manufacturers face increased challenges from continually improving product quality, reducing warranty and recall costs to efficiently leveraging their supply chain. For example, giving the manufacturer a complete view of the product and customer information integrating manufacturing and plant floor data, with as built product configurations with sensor data from customer use to efficiently analyze warranty claim information to reduce detection to correction time, detect fraud and even become proactive around issues requires a capable enterprise data hub that integrates large volumes of both structured and unstructured information. Learn how an enterprise data hub built on Hadoop provides the tools to support analysis at every level in the manufacturing organization.
Oozie is a workflow scheduling system for Hadoop that allows users to manage workflows as directed acyclic graphs (DAGs) of Hadoop jobs such as MapReduce, Pig, Hive, and Sqoop. It executes workflows based on time and data dependencies and provides a web interface for monitoring jobs. Oozie was designed specifically for Hadoop to take advantage of its features while addressing its shortcomings for workflow management.
This deck covers key considerations and provides advice for enterprises looking to run production-scale Cloudera on AWS. We touch on everything from security to governance to selecting the right instance type for your Hadoop workload (Spark, Impala, Search, etc).
(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...Amazon Web Services
Enterprises are starting to deploy large scale Hadoop clusters to extract value out of the data that they are generating. These clusters often span hundreds of nodes. To speed up the time to value, a lot of the newer deployments are happening in AWS, moving from the traditional on-premises, bare-metal world. Cloudera supports just such deployments. In this session, Cloudera shares the lessons learned and best practices for deploying multi-tenant Hadoop clusters in AWS. They will cover what reference deployments look like, what services are relevant for Hadoop deployments, network configurations, instance types, backup and disaster recovery considerations, and security considerations. They will also talk about what works well, what doesn't, and what has to be done going forward to improve the operability of Hadoop on AWS.
Hive Training -- Motivations and Real World Use Casesnzhang
Hive is an open source data warehouse systems based on Hadoop, a MapReduce implementation.
This presentation introduces the motivations of developing Hive and how Hive is used in the real world situation, particularly in Facebook.
Hadoop Workshop using Cloudera on Amazon EC2IMC Institute
This document provides instructions for a hands-on workshop on installing and using Hadoop and Cloudera on Amazon EC2. It outlines the steps to launch an EC2 virtual server instance, install Cloudera Manager and Cloudera Express Edition, import and export data from HDFS, write MapReduce programs in Eclipse, and use various Hadoop tools like HDFS and Hue. The workshop is led by Dr. Thanachart Numnonda and aims to teach participants how to set up their own Hadoop cluster on EC2 and start using Hadoop for big data tasks.
This document provides an overview of Amazon Web Services (AWS) including characteristics of cloud computing, the pace of innovation at AWS, the AWS global infrastructure including regions and availability zones, and an overview of key AWS services including storage, compute, database, networking, and application services. It highlights the scale and growth of AWS, how AWS enables building distributed architectures more easily than with traditional infrastructure, and how AWS services provide capabilities to store and access data, run applications, and scale infrastructure on demand.
This document discusses getting a minimum viable product (MVP) built on AWS. It describes how AWS services like Elastic Beanstalk can help startups focus on developing their application rather than infrastructure by automatically provisioning and managing resources. The document provides an example of how a small team could build and launch an MVP mobile app in 3 months for only $250 using AWS, focusing on quick iteration and learning from customers early on. It emphasizes that AWS lowers the cost of innovation by enabling companies to fail faster and at lower cost so they can iterate quickly based on customer feedback.
도커 무작정 따라하기: 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!pyrasis
도커 무작정 따라하기
- 도커가 처음인 사람도 60분이면 웹 서버를 올릴 수 있습니다!
도커의 기본 개념부터 설치와 사용 방법까지 설명합니다.
더 자세한 내용은 가장 빨리 만나는 도커(Docker)를 참조해주세요~
http://www.pyrasis.com/private/2014/11/30/publish-docker-for-the-really-impatient-book
Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.
The document discusses Impala releases and roadmaps. It outlines key features released in different Impala versions, including SQL capabilities, performance improvements, and support for additional file formats and data types. It also describes Impala's performance advantages compared to other SQL-on-Hadoop systems and how its approach is expected to increasingly favor performance gains. Lastly, it encourages trying out Impala and engaging with their community.
The document is a presentation about using Hadoop for analytic workloads. It discusses how Hadoop has traditionally been used for batch processing but can now also be used for interactive queries and business intelligence workloads using tools like Impala, Parquet, and HDFS. It summarizes performance tests showing Impala can outperform MapReduce for queries and scales linearly with additional nodes. The presentation argues Hadoop provides an effective solution for certain data warehouse workloads while maintaining flexibility, ease of scaling, and cost effectiveness.
This is a presentation on apache hadoop technology. This presentation may be helpful for the beginners to know about the terminologies of hadoop. This presentation contains some pictures which describes about the working function of this technology. I hope it will be helpful for the beginners.
Thank you.
This presentation is about apache hadoop technology. This may be helpful for the beginners. The beginners will know about some terminologies of hadoop technology. There is also some diagrams which will show the working of this technology.
Thank you.
Introduction to Kudu - StampedeCon 2016StampedeCon
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
This document discusses Cloudera's initiative to make Spark the standard execution engine for Hadoop. It outlines how Spark improves on MapReduce by leveraging distributed memory and having a simpler developer experience. It also describes Cloudera's investments in areas like management, security, scale, and streaming to further Spark's capabilities and make it production-ready. The goal is for Spark to replace MapReduce as the execution engine and for specialized engines like Impala to handle specific workloads, with all sharing the same data, metadata, resource management, and other platform services.
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Mladen Kovacevic
The document introduces Apache Kudu (incubating), a new updatable columnar storage system for Apache Hadoop designed for fast analytics on fast and changing data. It was designed to simplify architectures that use HDFS and HBase together. Kudu aims to provide high throughput for scans, low latency for individual rows, and database-like ACID transactions. It uses a columnar format and is optimized for SSD and new storage technologies.
The document discusses Apache Kudu, an open source storage layer for Apache Hadoop that enables fast analytics on fast data. Kudu is designed to fill the gap between HDFS and HBase by providing fast analytics capabilities on fast-changing or frequently updated data. It achieves this through its scalable and fast tabular storage design that allows for both high insert/update throughput and fast scans/queries. The document provides an overview of Kudu's architecture and capabilities, examples of how to use its NoSQL and SQL APIs, and real-world use cases like enabling low-latency analytics pipelines for companies like Xiaomi.
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
1) Apache Kudu is a new updatable columnar storage engine for Apache Hadoop that facilitates fast analytics on fast data.
2) Kudu is designed to address gaps in the current Hadoop storage landscape by providing both high throughput for big scans and low latency for short accesses simultaneously.
3) Kudu integrates with various Hadoop components like Spark, Impala, MapReduce to enable SQL queries and other analytics workloads on fast updating data.
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld
VMworld 2013
Michael Corey, Ntirety, Inc
Jeff Szastak, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Impala is an open source SQL query engine for Apache Hadoop that allows real-time queries on large datasets stored in HDFS and other data stores. It uses a distributed architecture where an Impala daemon runs on each node and coordinates query planning and execution across nodes. Impala allows SQL queries to be run directly against files stored in HDFS and other formats like Avro and Parquet. It aims to provide high performance for both analytical and transactional workloads through its C++ implementation and avoidance of MapReduce.
Impala Architecture Presentation at Toronto Hadoop User Group, in January 2014 by Mark Grover.
Event details:
http://www.meetup.com/TorontoHUG/events/150328602/
You want to use MySQL in Amazon RDS, Rackspace Cloud, Google Cloud SQL or HP Helion Public Cloud? Check this out, from Percona Live London 2014. (Note that pricing of Google Cloud SQL changed prices on the same day after the presentation)
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.
Impala is an open-source SQL query engine for Hadoop that is designed for performance. It utilizes standard Hadoop components like HDFS, HBase, and YARN. Impala allows users to issue SQL queries against data stored in HDFS and HBase and returns results very quickly. It exposes industry-standard interfaces that allow business intelligence tools to connect. Impala has added many new features in recent versions like analytic functions, subqueries, and support for joining and aggregating data that can spill to disk.
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
Splice Machine is an open-source database that combines the benefits of modern lambda architectures with the full expressiveness of ANSI-SQL. Like lambda architectures, it employs separate compute engines for different workloads - some call this an HTAP database (Hybrid Transactional and Analytical Platform). This talk describes the architecture and implementation of Splice Machine V2.0. The system is powered by a sharded key-value store for fast short reads and writes, and short range scans (Apache HBase) and an in-memory, cluster data flow engine for analytics (Apache Spark). It differs from most other clustered SQL systems such as Impala, SparkSQL, and Hive because it combines analytical processing with a distributed Multi-Value Concurrency Method that provides fine-grained concurrency which is required to power real-time applications. This talk will highlight the Splice Machine storage representation, transaction engine, cost-based optimizer, and present the detailed execution of operational queries on HBase, and the detailed execution of analytical queries on Spark. We will compare and contrast how Splice Machine executes queries with other HTAP systems such as Apache Phoenix and Apache Trafodian. We will end with some roadmap items under development involving new row-based and column-based storage encodings.
Speakers:
Monte Zweben, is a technology industry veteran. Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. He then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit. In 1998, he was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. He currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.
Similar to Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014 (20)
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016cdmaxime
What we do:
-We build a system for the operation of modern data centers
-Triage and diagnostics, exploration, trends, advanced analytics of complex systems
-Our data: logs, metrics, human activity, anything that occurs in the data center
-Enterprise Software (i.e. we build for others.)
Today's presentation: how we built what we built on top of Apache Hadoop
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
This document provides an introduction to Apache Spark presented by Maxime Dumas of Cloudera. It discusses:
1. What Cloudera does including distributing Hadoop components with enterprise tooling and support.
2. An overview of the Apache Hadoop ecosystem including why Hadoop is used for scalability, efficiency, and flexibility with large amounts of data.
3. An introduction to Apache Spark which improves on MapReduce by being faster, easier to use, and supporting more types of applications such as machine learning and graph processing. Spark can be 100x faster than MapReduce for certain applications.
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
This document provides an introduction to Apache Spark, a general purpose cluster computing framework. It discusses how Spark improves upon MapReduce by offering better performance, support for iterative algorithms, and an easier developer experience. Spark retains MapReduce's advantages like scalability, fault tolerance, and data locality, but offers more by leveraging distributed memory and supporting directed acyclic graphs of tasks. Examples demonstrate how Spark can run programs up to 100x faster than Hadoop MapReduce and how it supports machine learning algorithms and streaming data analysis.
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
This document provides an introduction to Apache Spark presented by Maxime Dumas of Cloudera. It discusses Spark's advantages over MapReduce like leveraging distributed memory for better performance and supporting iterative algorithms. Spark concepts like RDDs, transformations and actions are explained. Examples shown include word count, logistic regression, and Spark Streaming. The presentation concludes with a discussion of SQL on Spark and a demo.
Art of Living Happiness App Challenge - San Diego Meetup Nov 20th 2014cdmaxime
The Happiness Apps Challenge is an international App building challenge that is aimed to inspire minds in tech and design to create products that will increase world’s happiness quotient!
Create an App to make people happy. One positive person can spread Happiness to more than 1,000 people, but through the power of technology we can reach out to millions.
$ 50,000 in Prizes
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
This document provides an introduction to Apache Spark presented by Maxime Dumas. It discusses how Spark improves on MapReduce by offering better performance through leveraging distributed memory and supporting iterative algorithms. Spark retains MapReduce's advantages of scalability, fault-tolerance, and data locality while offering a more powerful and easier to use programming model. Examples demonstrate how tasks like word counting, logistic regression, and streaming data processing can be implemented on Spark. The document concludes by discussing Spark's integration with other Hadoop components and inviting attendees to try Spark.
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...John Gallagher
Rails apps can be a black box. Have you ever tried to fix a bug where you just can’t understand what’s going on? This talk will give you practical steps to improve the observability of your Rails app, taking the time to understand and fix defects from hours or days to minutes. Rails 8 will bring an exciting new feature: built-in structured logging. This talk will delve into the transformative impact of structured logging on fixing bugs and saving engineers time. Structured logging, as a cornerstone of observability, offers a powerful way to handle logs compared to traditional text-based logs. This session will guide you through the nuances of structured logging in Rails, demonstrating how it can be used to gain better insights into your application’s behavior. This talk will be a practical, technical deep dive into how to make structured logging work with an existing Rails app.
I talk about the Steps to Observable Software - a practical five step process for improving the observability of your Rails app.
Understanding Automated Testing Tools for Web Applications.pdfkalichargn70th171
Automated testing tools for web applications are revolutionizing how we ensure quality and performance in software development. These tools help save time, reduce human error, and increase the efficiency of web application testing processes. This guide delves into automated testing, discusses the available tools, and highlights how to choose the right tool for your needs.
Three available editions of Windows Servers crucial to your organization’s op...Q-Advise
Three available editions of Windows Servers crucial to your organization’s operations
Windows Server, Microsoft’s robust operating system, is the cornerstone of enterprise IT infrastructure, tailored for mission-critical operations. It helps in managing enterprise-level tasks, including data storage, applications, and communication.
Proper licensing of Windows Server is essential for both legal compliance and optimal functionality within business environments.
Windows Server comes in various edition and before any edition is used in your organization, it is required you license them appropriately. The licensing can be complex and capital demanding when you don’t know what you want or understand the licensing requirements.
Even if successfully licensed, there are various activities you can practice as an organization to make sure your Server is operating optimally and there is real value for money. This requires a deeper understanding of best practices and our team of cloud and licensing experts can be of support.
Send the team an email, info@q-advise.com let’s have a look at your needs, together with you decide which licensing model will best work in your case, assist you with savings options and share with you how pre-owned licensing can help you get licensed adequately also.
Mastering MicroStation DGN: How to Integrate CAD and GISSafe Software
Dive deep into the world of CAD-GIS integration and elevate your workflows to nexl-level efficiency levels. Discover how to seamlessly transfer data between Bentley MicroStation and leading GIS platforms, such as Esri ArcGIS.
This session goes beyond mere CAD/GIS conversion, showcasing techniques to precisely transform MicroStation elements including cells, text, lines, and symbology. We’ll walk you through tags versus item types, and understanding how to leverage both. You’ll also learn how to reproject to any coordinate system. Finally, explore cutting-edge automated methods for managing database links, and delve into innovative strategies for enabling self-serve data collection and validation services.
Join us to overcome the common hurdles in CAD and GIS integration and enhance the efficiency of your workflows. This session is perfect for professionals, both new to FME and seasoned users, seeking to streamline their processes and leverage the full potential of their CAD and GIS systems.
Unlocking value with event-driven architecture by Confluentconfluent
Sfrutta il potere dello streaming di dati in tempo reale e dei microservizi basati su eventi per il futuro di Sky con Confluent e Kafka®.
In questo tech talk esploreremo le potenzialità di Confluent e Apache Kafka® per rivoluzionare l'architettura aziendale e sbloccare nuove opportunità di business. Ne approfondiremo i concetti chiave, guidandoti nella creazione di applicazioni scalabili, resilienti e fruibili in tempo reale per lo streaming di dati.
Scoprirai come costruire microservizi basati su eventi con Confluent, sfruttando i vantaggi di un'architettura moderna e reattiva.
Il talk presenterà inoltre casi d'uso reali di Confluent e Kafka®, dimostrando come queste tecnologie possano ottimizzare i processi aziendali e generare valore concreto.
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsBenjamin Bischoff
In the rapidly evolving landscape of software development and testing, it is tempting to chase the latest tools and technologies. However, some of the most effective solutions have been in existence for decades. In this talk, we’ll delve into the enduring value of these timeless testing tools.
We’ll explore how established tools like Selenium, GNU Make, Maven, and Bash remain vital in today’s software development and testing toolkit even though they have been around for a long time (some were even invented before I was born). I’ll share examples of how these tools have addressed our testing and automation challenges, showcasing their adaptability, versatility, and reliability in various scenarios. I aim to demonstrate that sometimes, the “old” ways can indeed be the best ways.
The code is written and the tests pass. I just have to commit this last round of changes to my branch. Wait, why does that say committed to main? Did I commit all those changes to main? Arghh! I can’t redo all of this!
Committing changes to the wrong branch, forgetting files, misspelling the commit message, and needing to undo commits are some of the “advanced” features of Git that we normal people run into way too often and need help with. The fixes are often easy – once you know what they are. But in the heat of the moment, with the deadline (or Friday afternoon) approaching, it isn’t always easy to figure out what magic spell to cast to get Git to do what you need.
We’ll spend some time looking at typical Git situations people get themselves into, and then we’ll demonstrate how to get out of them. This isn’t about Git internals or a Git master’s class – this real-world Git when things aren’t going right. And there will be plenty of time for questions, so bring your “best” Git nightmare scenarios so we can figure out how to recover.
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)Andre Hora
When creating test cases, ideally, developers should test both the expected and unexpected behaviors of the program to catch more bugs and avoid regressions. However, the literature has provided evidence that developers are more likely to test expected behaviors than unexpected ones. In this paper, we propose PathSpotter, a tool to automatically identify tested paths and support the detection of missing tests. Based on PathSpotter, we provide an approach to guide us in detecting missing tests. To evaluate it, we submitted pull requests with test improvements to open-source projects. As a result, 6 out of 8 pull requests were accepted and merged in relevant systems, including CPython, Pylint, and Jupyter Client. These pull requests created/updated 32 tests and added 80 novel assertions covering untested cases. This indicates that our test improvement solution is well received by open-source projects.
What is Micro Frontends and Why Use it.pdflead93317
🚀 Let's Deep Dive into 𝐖𝐡𝐲 𝐌𝐢𝐜𝐫𝐨 𝐅𝐫𝐨𝐧𝐭𝐞𝐧𝐝𝐬 𝐢𝐬 𝐭𝐡𝐞 𝐅𝐮𝐭𝐮𝐫𝐞 𝐨𝐟 𝐅𝐫𝐨𝐧𝐭𝐞𝐧𝐝 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 🚀
In today's fast-paced tech landscape, agility, scalability, and maintainability are more crucial than ever. Traditional monolithic frontend architectures often struggle to keep up with these demands. Enter Micro Frontends: a revolutionary approach that's transforming the way we build web applications.
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...David D. Scott
Tube Magic Software is your ultimate tool for creating stunning video content with ease. Designed with both beginners and professionals in mind, it offers a user-friendly interface packed with powerful features. From seamless editing to eye-catching effects, Tube Magic helps you bring your creative vision to life. Elevate your videos and captivate your audience effortlessly. Join our community of content creators and experience the magic today!
Waze vs. Google Maps vs. Apple Maps, Who Else.pdfBen Ramedani
Let’s face it, getting lost isn’t really part of the adventure anymore (unless you’re into that sort of thing!). Nowadays, a good navigation app is like your trusty compass, guiding you through busy city streets and winding country roads. But with so many options out there—from big names like Waze, Google Maps, and Apple Maps to some lesser-known contenders—choosing the right one can feel a bit overwhelming.
Think about it: you're about to head out on a road trip, and the last thing you want is to end up in the middle of nowhere because you took a wrong turn. Or maybe you're just trying to navigate your daily commute without hitting every single red light. That's where a solid navigation app comes in handy.
Google Maps is like the old reliable friend who knows every shortcut and scenic route. It's packed with features, from real-time traffic updates to detailed directions, making it a top choice for many. But then there's Waze, the social butterfly of navigation apps. It's all about community, with drivers sharing real-time updates on traffic, accidents, and even speed traps. It’s perfect if you want to feel like you’re part of a huge driving club, all working together to get everyone to their destination faster.
And let’s not forget Apple Maps, which has come a long way since its rocky start. If you're deep into the Apple ecosystem, it's a seamless choice, integrating smoothly with all your devices and offering some pretty neat features like Flyover for 3D city views.
But wait, there are also some underdog apps worth considering! Have you heard of MapQuest? It's still around and offers some great features, especially for planning long trips with multiple stops. Then there's HERE WeGo, which is fantastic for offline navigation—a real lifesaver if you're heading somewhere with spotty cell service.
So, whether you're planning a cross-country adventure or just trying to find the quickest route to work, we’ll help you sift through these options. We’ll dive into what makes each app unique, their pros and cons, and ultimately, guide you to the perfect navigation app for your needs. Buckle up and get ready for a smooth ride!
Literals - A Machine Independent Feature21h16charis
Introduction to Literals, A machine independent feature. The presentation is based on the prescribed textbook for System Software and Compiler Design, Computer Science and Engineering - System Software by Leland. L. Beck,
D Manjula.
CrushFTP 10.4.0.29 PC Software - WhizNewsEman Nisar
Introduction:
In this never-ending digital world, the essence of a smooth and safe file transfer solution is vital. CrushFTP 10.4.0.29 is a kind of full-featured, robust, and easy-to-use PC software designed for a smooth file transfer process without compromising security. In this review, we will dig in deep regarding the CrushFTP features, functions, and system requirements to have a 360-degree view of its capabilities and possible applications.
Description:
CrushFTP, LLC develop the software, and it comes in a bundle of new features and improvements, which are set to deliver a great experience to the user.With CrushFTP, from the smallest to the most extensive scale of businesses, all kinds of file transfer operations can be centrally managed on a single platform.
You May Also Like :: Alt-Tab Terminator Pro 6.0 PC Software – WhizzNews
Abstract:
At its heart, CrushFTP is a powerful server that allows users to exchange files over the networks safely. Many features of the FTP servers have been extended in CrushFTP. It supports protocols like FTPS, SFTP, SCP, HTTP, and HTTPS for maximum flexibility with client applications and devices.
The intuitive web interface enables users to use file management tools simply without installing complex client software.
Software Characteristics:
Security:
CrushFTP ensures security through the use of protocols for encryption, such as SSL/TLS, to secure transmitted data. It also offers user authentication mechanisms using LDAP, Active Directory, and OAuth for proper secure access control.
Automation:
The automation capability of CrushFTP allows automating the everyday routine tasks through schedule-based transfer, event-based triggers, and custom flow. This ensures that the batch processing is effective with minimum manual interruption, improving productivity.
You May Also Like :: VovSoft Copy Files Into Multiple Folders PC Software – WhizzNews
Remote Administration:
CrushFTP supports remote administration through the web interface. This allows an administrator to manage server settings, user permissions, and file operations from any part of the world that is connected to the Internet. In this regard, it gives a very nice distributed team and remote work environment.
Integration:
The software easily integrates with third-party applications and services through a very extensive API, as well as through support for plenty of plugins. This way, it becomes straightforward for organizations to fit CrushFTP into their already existing infrastructure to promote interoperability and ensure scalability.
Monitoring and Logging:
CrushFTP provides very detailed tracking and logging where an administrator can trace all user activities, monitor the performance of the server, and analyze network traffic. It also offers real-time alerts and notifications for proactive management and troubleshooting.
Customization:
Make CrushFTP work with any possible parameters in mind through configurable settings, themes, and extensions
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
1. 1
Cloudera Impala
LV Big Data Monthly Meetup #1
November 5th 2014
Maxime Dumas
Systems Engineer
2. Thirty Seconds About Max
• Systems Engineer
• aka Sales Engineer
• SoCal, AZ, NV
• former coder of PHP
• teaches meditation + yoga
• from Montreal, Canada
2
3. What Does Cloudera Do?
• product
• distribution of Hadoop components, Apache licensed
• enterprise tooling
• support
• training
• services (aka consulting)
• community
3
4. What This Talk Isn’t About
• deploying
• Puppet, Chef, Ansible, homegrown scripts, intern labor
• sizing & tuning
• depends heavily on data and workload
• coding
• unless you count XML or CSV or SQL
• algorithms
4
7. cloud·e·ra im·pal·a
7
/kloudˈi(ə)rə imˈpalə/
noun
a modern, open source, MPP SQL query engine
for Apache Hadoop.
“Cloudera Impala provides fast, ad hoc SQL query
capability for Apache Hadoop, complementing
traditional MapReduce batch processing.”
8. Impala adoption
8
Component (and Founder) Vendor Support
Cloudera MapR Amazon IBM Pivotal Hortonworks
Impala (Cloudera) ✔ ✔ ✔ X X X
Hue (Cloudera) ✔ ✔ X X X ✔
Sentry (Cloudera) ✔ ✔ X ✔ ✔ X
Flume (Cloudera) ✔ ✔ X ✔ ✔ ✔
Parquet (Cloudera/Twitter) ✔ ✔ X ✔ ✔ X
Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔
Ambari (Hortonworks) X X X X ✔ ✔
Knox (Hortonworks) X X X X X ✔
Tez (Hortonworks) X X X X X ✔
Drill (MapR) X ✔ X X X X
9. 9
The Apache Hadoop Ecosystem
Quick and dirty, for context.
11. Why “Ecosystem?”
• In the beginning, just Hadoop
• HDFS
• MapReduce
• Today, dozens of interrelated components
• I/O
• Processing
• Specialty Applications
• Configuration
• Workflow
11
12. HDFS
• Distributed, highly fault-tolerant filesystem
• Optimized for large streaming access to data
• Based on Google File System
• http://research.google.com/archive/gfs.html
12
14. MapReduce (MR)
• Programming paradigm
• Batch oriented, not realtime
• Works well with distributed computing
• Lots of Java, but other languages supported
• Based on Google’s paper
• http://research.google.com/archive/mapreduce.html
14
15. Apache Hive
• Abstraction of Hadoop’s Java API
• HiveQL “compiles” down to MR
• a “SQL-like” language
• Eases analysis using MapReduce
15
21. Cloudera Impala
• Interactive query on Hadoop
• think seconds, not minutes
• ANSI-92 standard SQL
• compatible with HiveQL
• Native MPP query engine
• built for low-latency queries
• HDFS and HBase storage
21
22. Cloudera Impala – Design Choices
• Native daemons, written in C/C++
• No JVM, no MapReduce
• Saturate disks on reads
• Uses in-memory HDFS caching
• Re-uses Hive metastore
• Not as fault-tolerant as MapReduce
22
23. Benefits of Impala
Unlocks BI/analytics on Hadoop
• Interactive SQL in seconds
• Highly concurrent to handle 100s of users
Native Hadoop flexibility
• No data migration, conversion, or duplication required
• Query existing Hadoop data
• Run multiple frameworks on the same data at the same time
• Supports Parquet for best-of-breed columnar performance
Native MPP query engine designed into Hadoop:
• Unified Hadoop storage
• Unified Hadoop metadata (uses Hive and HCatalog)
• Unified Hadoop security
• Fine-grained role-based access controls with Sentry
Apache-licensed open source
Proven in
Production
23
24. Cloudera Impala – Architecture
• Impala Daemon
• runs on every node
• handles client requests
• handles query planning & execution
• State Store Daemon
• provides name service
• metadata distribution
• used for finding data
24
26. Impala Query Execution
26
2) Planner turns request into collections of plan fragments
3) Coordinator initiates execution on impalad(s) local to data
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
27. Impala Query Execution
27
4) Intermediate results are streamed between impalad(s)
5) Query results are streamed back to client
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query results
28. Cloudera Impala – Results
• Allows for fast iteration/discovery
• How much faster?
• 3-4x faster on I/O bound workloads
• up to 45x faster on multi-MR queries
• up to 90x faster on in-memory cache
28
29. Latest SQL Performance
350
300
250
200
150
100
50
0
Impala Spark SQL Presto Hive-on-Tez
Time (in seconds)
Single User vs 10 User Response Time/Impala
Times Faster
(Lower bars = better)
Single User, 5
10 Users, 11
Single User, 25
10 Users, 120
10 Users, 302
10 Users, 202
Single User, 37
Single User, 77
5.0x
10.6x
7.4x
27.4x
15.4x
18.3x
Independent validation by IBM Research SQL-on-Hadoop VLDB paper:
“Impala’s database architecture provides significant performance gains”
29
30. Previous Milestones
Impala 1.0
(GA)
Impala 1.1
(Security)
Impala 1.2
(Usability)
Impala 1.3
(Resource
Management)
Impala 1.4
(Extensibility)
Impala 2.0
(SQL)
Analytic Database
Capabilities
Spring
2013
Summer
2013
Fall
2013
Spring
2014
Summer
2014
Fall
2014
30
31. Cloudera Impala 2.0
Window Functions
“Aggregate function applied to a partition of the result set” (SQL 2003)
Ex:
sum(population) OVER (PARTITION BY city)
rank() OVER (PARTITION BY state, ORDER BY population)
We’ve implemented most of the spec
• PARTITION BY, ORDER BY
• WINDOW
• PRECEEDING, FOLLOWING
• ROWS
• Any number of analytic functions in one query
31
32. Cloudera Impala 2.0
Subqueries
A query that is part of another query. Ex:
select col from t1
where col in
(select c2 from t2)
Support:
• Correlated and uncorrelated subqueries.
• IN, NOT IN, EXISTS, NOT EXISTS
32
33. Cloudera Impala 2.0
Spill to disk joins & aggregations
• Previously, if a query ran out of memory, Impala would abort it
• This means some big joins (fact table – fact table) joins could never run.
• All operators that accumulate memory can now spill to disk if
necessary.
• Order by (Impala 1.4)
• Join/Agg (Impala 2.0)
• Analytic Functions (Impala 2.0)
• Transparent to existing workloads
33
34. Cloudera Impala 2.1 +
34
• Nested data – enables queries on complex nested structures including maps, structs,
and arrays (early 2015)
• MERGE statement – enables merging in updates into existing tables
• Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING SET
• SQL SET operators – MINUS, INTERSECT
• Apache HBase CRUD – allows use of Impala for inserts and updates into HBase
• UDTFs (user-defined table functions) – for more advanced user functions and
extensibility
• Intra-node parallelized aggregations and joins – to provide even faster joins and
aggregations on on top of the performance gains of Impala
• Parquet enhancements – continued performance gains including index pages
• Amazon S3 integration
39. 39
Thank You!
Maxime Dumas
mdumas@cloudera.com
We’re hiring.
Editor's Notes
Similar to the Red Hat model.
Hadoop elephant logo licensed for public use via Apache license: Apache Software Foundation, http://www.apache.org/foundation/marks/
Similar to the Red Hat model.
Hadoop elephant logo licensed for public use via Apache license: Apache Software Foundation, http://www.apache.org/foundation/marks/
Furthermore, for projects that carry the Apache License, open-ness does not always guarantee freedom from lock-in to a single support provider. For example, Drill, Knox, Tez, and Falcon are all open source, and all shipped by a single vendor – what’s a better example of “lock-in” than that?
We’re going to breeze through these really quick, just to show how Search plugs in later…
Lose a server, no problem. Lose a rack, no problem.
We’re going to breeze through these really quick, just to show how Search plugs in later…
More & Faster Value from Big Data
Provides an interactive BI/Analytics experience on Hadoop
Previously BI/Analytics was impractical due to the batch orientation of MapReduce
Enables more users to gain value from organizational data assets (SQL/BI users)
Makes more data available for analysis (raw data, multi-structured data, historical data)
Removes delays from data migration
Into specialized analytical DBMSs
Into proprietary file formats that happen to be stored in HDFS
Into transient in-memory stores
Flexibility
Query across existing data in Hadoop
HDFS and HBase
Access data immediately and directly in its native format
Select best-fit file formats
Use raw data formats when unsure of access patterns (text files, RCFiles, LZO)
Increase performance with optimized file formats when access patterns are known (Parquet, Avro)
Run multiple frameworks on the same data at the same time
All file formats are compatible across the entire Hadoop ecosystem – i.e. MapReduce, Pig, Hive, Impala, etc. on the same data at the same time
Run multiple frameworks on the same data at the same time
All file formats are compatible across the entire Hadoop ecosystem – i.e. MapReduce, Pig, Hive, Impala, etc.
Cost Efficiency
Reduce movement, duplicate storage & compute
Data movement: no time or resource penalty for migrating data into specialized systems or formats
Duplicate storage: no need to duplicate data across systems or within the same system in different file formats
Compute: use the same compute resources as the rest of the Hadoop system –
You don’t need a separate set of nodes to run interactive query vs. batch processing (MapReduce)
You don’t need to overprovision your hardware to enable memory-intensive, on-the-fly format conversions
10% to 1% the cost of analytic DMBS
Less than $1,000/TB
Full Fidelity Analysis
No loss of fidelity from aggregations or conforming to fixed schemas
If the attribute exists in the raw data, you can query against it
These run continuously, always ready. In C/C++ for the most-part.
Impala 1.0
~SQL-92 (minus correlated sub-queries)
Native Hadoop file formats (Parquet, Avro, text, Sequence, …)
Enterprise-readiness (authentication, ODBC/JDBC drivers, etc)
Service-level resource isolation with other Hadoop frameworks
Impala 1.1
Fine-grained, role-based authorization via Apache Sentry
Auditing (Impala 1.1.1 and CM 4.7+)
Impala 1.2
Custom language extensibility (UDFs, UDAFs)
Cost-based join-order optimization
On-par performance compared to traditional MPP query engines while maintaining native Hadoop data flexibility
Impala 1.3 / CDH 5.0 (also has version for CDH 4.x)
Resource management
Do not support RANGE windows.
Range windows let you specify a range based on the current row’s value (as opposed to ROWS, which is the ordinal).
Example:
sum(c) OVER(ORDER BY year BETWEEN RANGE 1 PRECEEDING and 2 FOLLOWING)
Error: “RANGE is only supported with both the lower and upper bounds UNBOUNDED or one UNBOUNDED and the other CURRENT ROW."
No UDA support
Not all aggregate functions are supported (ndv, etc)
Looking at both for 2.1.
All subqueries are rewritten as joins.
No “Independent evaluation”
We’ve added additional join types to support this:
LEFT/RIGHT ANTI-JOIN
RIGHT SEMI-JOIN
NULL AWARE LEFT ANTI JOIN
Subqueries are only supported in the WHERE clause.
Impala can’t reason if a subquery returns one row in all cases:
select col limit 1 works
select min(col) works
select min(col) group by x where x = 1 doesn’t
Can manually add a limit 1 to the subquery.
See docs for more details
These should all have error messages explaining why
We implemented the common use cases.
Impala hash partitions the input to the operator, spilling partitions as necessary
When all the input is partitioned, Impala processes the partitions that are still in memory (did not spill)
Impala then processed the spilled partitions 1 by 1, repartitioning if necessary.
Impala tries to minimize the number of spilled bytes.
Peak memory usage when the first spill happened
Stays high until we handled all the non-spilled partitions
Lower as we handle the spilled partitions 1 by 1.
We’re going to breeze through these really quick, just to show how Search plugs in later…