Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Impala is a massively parallel processing SQL query engine for Apache Hadoop. It allows real-time queries on large datasets using existing SQL skills. Impala's architecture includes impalad daemons that process queries in parallel across nodes, a statestore for metadata coordination, and a new execution engine written in C++. It aims to provide faster performance than Hive for interactive queries while leveraging Hadoop's existing ecosystem. The first general availability release is planned for April 2013.
Spark Streaming allows processing live data streams using small batch sizes to provide low latency results. It provides a simple API to implement complex stream processing algorithms across hundreds of nodes. Spark SQL allows querying structured data using SQL or the Hive query language and integrates with Spark's batch and interactive processing. MLlib provides machine learning algorithms and pipelines to easily apply ML to large datasets. GraphX extends Spark with an API for graph-parallel computation on property graphs.
This talk was held at the 11th meeting on April 7 2014 by Marcel Kornacker.
Impala (impala.io) raises the bar for SQL query performance on Apache Hadoop. With Impala, you can query Hadoop data – including SELECT, JOIN, and aggregate functions – in real time to do BI-style analysis. As a result, Impala makes a Hadoop-based enterprise data hub function like an enterprise data warehouse for native Big Data.
Application architectures with Hadoop – Big Data TechCon 2014
Building applications using Apache Hadoop with a use-case of clickstream analysis. Presented by Mark Grover and Jonathan Seidman at Big Data TechCon, Boston in April 2014
Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop
Impala is a SQL query engine for Apache Hadoop that allows for interactive queries on large datasets. It uses a distributed architecture where each node runs an Impala daemon and queries are distributed across nodes. Impala aims to provide general-purpose SQL with high performance by using C++ instead of Java and avoiding MapReduce execution. It runs directly on Hadoop storage systems and supports common file formats like Parquet and Avro.
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
The document discusses how operationalizing machine learning models through continuous deployment and monitoring is important to realize business value but often overlooked, and describes how Alpine Data's Chorus platform in combination with Pivotal's Big Data Suite and Cloud Foundry can provide a turn-key solution for operationalizing models by deploying scalable scoring engines that can consume models exported in the PFA format. The platform aims to make it simple to deploy both individual models and complex scoring flows represented as PFA documents to ensure models have maximum impact on the business.
This document discusses SQL engines for Hadoop, including Hive, Presto, and Impala. Hive is best for batch jobs due to its stability. Presto provides interactive queries across data sources and is easier to manage than Hive with Tez. Presto's distributed architecture allows queries to run in parallel across nodes. It supports pluggable connectors to access different data stores and has language bindings for multiple clients.
The document discusses architectural considerations for implementing clickstream analytics using Hadoop. It covers choices for data storage layers like HDFS vs HBase, data modeling including file formats and partitioning, data ingestion methods like Flume and Sqoop, available processing engines like MapReduce, Hive, Spark and Impala, and the need to sessionize clickstream data to analyze metrics like bounce rates and attribution.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
The document discusses Impala, a SQL query engine for Hadoop. It was created to enable low-latency queries on Hadoop data by using a new execution engine instead of MapReduce. Impala aims to provide high performance SQL queries on HDFS, HBase and other Hadoop data. It runs as a distributed service and queries are distributed to nodes and executed in parallel. The document covers Impala's architecture, query execution process, and its planner which partitions queries for efficient execution.
This document provides information about running Spark on YARN including:
- Spark allows processing of large datasets in a distributed manner using Resilient Distributed Datasets (RDDs).
- When running on YARN, Spark is able to leverage existing Hadoop clusters for locality-aware processing, resource management, and other benefits while still using its own execution engine.
- Running Spark on YARN provides advantages like shipping code to where the data is located instead of moving large amounts of data, leveraging existing Hadoop cluster infrastructure, and allowing Spark workloads to run natively within Hadoop.
Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax.
This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.
The document summarizes strengths and weaknesses of Cloudera Impala. Key strengths include excellent performance for analytical queries over large datasets, SQL compliance, and integration with Hadoop ecosystem. Weaknesses are slow random access, lack of fault tolerance, tedious data updating process, and memory intensive queries. The conclusion is that Impala is well-suited for analytics on immutable data but not for workloads with frequent updates.
1) HAWQ is an SQL and machine learning engine that runs on Hadoop, providing SQL capabilities and machine learning functionality directly on HDFS data.
2) HAWQ provides up to 30x faster performance than other SQL-on-Hadoop engines like Impala and Hive, through its massively parallel processing (MPP) architecture and query optimization capabilities.
3) Key features of HAWQ include ANSI SQL compliance, integrated machine learning via the MADlib library, flexible deployment across on-premises and cloud environments, and high scalability to petabytes of data.
Big Data Developers Moscow Meetup 1 - sql on hadoop
This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.
SQL on Hadoop
Looking for the correct tool for your SQL-on-Hadoop use case?
There is a long list of alternatives to choose from; how to select the correct tool?
The tool selection is always based on use case requirements.
Read more on alternatives and our recommendations.
Impala is a SQL query engine for Apache Hadoop that allows real-time queries on large datasets. It is designed to provide high performance for both analytical and transactional workloads by running directly on Hadoop clusters and utilizing C++ code generation and in-memory processing. Impala uses the existing Hadoop ecosystem including metadata storage in Hive and data formats like Avro, but provides faster performance through its new query execution engine compared to traditional MapReduce-based systems like Hive. Future development of Impala will focus on improved support for features like HBase, additional SQL functionality, and query optimization.
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Maxime Dumas gives a presentation on Cloudera Impala, which provides fast SQL query capability for Apache Hadoop. Impala allows for interactive queries on Hadoop data in seconds rather than minutes by using a native MPP query engine instead of MapReduce. It offers benefits like SQL support, improved performance of 3-4x up to 90x faster than MapReduce, and flexibility to query existing Hadoop data without needing to migrate or duplicate it. The latest release of Impala 2.0 includes new features like window functions, subqueries, and spilling joins and aggregations to disk when memory is exhausted.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
CDH is a popular distribution of Apache Hadoop and related projects that delivers scalable storage and distributed computing through Apache-licensed open source software. It addresses challenges in storing and analyzing large datasets known as Big Data. Hadoop is a framework for distributed processing of large datasets across computer clusters using simple programming models. Its core components are HDFS for storage, MapReduce for processing, and YARN for resource management. The Hadoop ecosystem also includes tools like Kafka, Sqoop, Hive, Pig, Impala, HBase, Spark, Mahout, Solr, Kudu, and Sentry that provide functionality like messaging, data transfer, querying, machine learning, search, and authorization.
Data Explosion
- TBs of data generated everyday
Solution – HDFS to store data and Hadoop Map-Reduce framework to parallelize processing of Data
What is the catch?
Hadoop Map Reduce is Java intensive
Thinking in Map Reduce paradigm can get tricky
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
Spark is a general-purpose cluster computing framework that provides high-level APIs and is faster than Hadoop for iterative jobs and interactive queries. It leverages cached data in cluster memory across nodes for faster performance. Spark supports various higher-level tools including SQL, machine learning, graph processing, and streaming.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
The document provides an agenda and overview for a Big Data Warehousing meetup hosted by Caserta Concepts. The meetup agenda includes an introduction to SparkSQL with a deep dive on SparkSQL and a demo. Elliott Cordo from Caserta Concepts will provide an introduction and overview of Spark as well as a demo of SparkSQL. The meetup aims to share stories in the rapidly changing big data landscape and provide networking opportunities for data professionals.
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
Dive into the world of Apache Hive with our insightful presentation covering a range of topics, including Hive introduction, dispelling misconceptions about Hive, exploring its features and origins, understanding the why behind Hive's existence, delving into its architecture and working principles, dissecting the data model employed in Hive, exploring different modes of operation, and weighing the advantages and disadvantages it brings. The presentation concludes with practical examples, demonstrating how to create tables in Hive, upload data, and execute queries within the Hadoop environment. Join us on a journey through the intricacies of Hive, unraveling its capabilities and applications in big data analytics
Apache Hadoop is a popular open-source framework for storing and processing large datasets across clusters of computers. It includes Apache HDFS for distributed storage, YARN for job scheduling and resource management, and MapReduce for parallel processing. The Hortonworks Data Platform is an enterprise-grade distribution of Apache Hadoop that is fully open source.
The document summarizes several popular options for SQL on Hadoop including Hive, SparkSQL, Drill, HAWQ, Phoenix, Trafodion, and Splice Machine. Each option is reviewed in terms of key features, architecture, usage patterns, and strengths/limitations. While all aim to enable SQL querying of Hadoop data, they differ in support for transactions, latency, data types, and whether they are native to Hadoop or require separate processes. Hive and SparkSQL are best for batch jobs while Drill, HAWQ and Splice Machine provide lower latency but with different integration models and capabilities.
http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
This document discusses Microsoft's efforts to make big data technologies like Hadoop more accessible through its products. It describes Hadoop, MapReduce, HDFS, and other big data concepts. It then outlines Microsoft's project to create a Hadoop distribution that runs on Windows Server and Windows Azure, including building an ODBC driver to allow tools like Excel to query Hadoop. This will help bring big data to more business users and integrate it with Microsoft's existing BI technologies.
Hadoop con 2015 hadoop enables enterprise data lake
Mobile Internet, Social Media 以及 Smart Device 的發展促成資訊的大爆炸,伴隨產生大量的非結構化及半結構化的資料,不但資料的格式多樣,產生的速度極快,對企業的資訊架構帶來了前所未有的挑戰,面對多樣的資料結構及多樣的分析工具,我們應該採用什麼樣的架構互相整合,才能有效的管理資料生命週期,提取資料價值,Hadoop 生態系統,無疑的在這個大架構裡,將扮演最基礎的資料平台的角色,實現企業的 Data Lake。
This document provides an overview of an advanced Big Data hands-on course covering Hadoop, Sqoop, Pig, Hive and enterprise applications. It introduces key concepts like Hadoop and large data processing, demonstrates tools like Sqoop, Pig and Hive for data integration, querying and analysis on Hadoop. It also discusses challenges for enterprises adopting Hadoop technologies and bridging the skills gap.
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
This document discusses using Hadoop/MapReduce with Solr/Lucene for large scale distributed search. It begins with an introduction to the speaker and his experience with Hadoop. The agenda then outlines discussing why search big data, an overview of Lucene, Solr and Zookeeper, distributed searching and indexing with Hadoop, and a case study on web log categorization.
Software development... for all? (keynote at ICSOFT'2024)
Our world runs on software. It governs all major aspects of our life. It is an enabler for research and innovation, and is critical for business competitivity. Traditional software engineering techniques have achieved high effectiveness, but still may fall short on delivering software at the accelerated pace and with the increasing quality that future scenarios will require.
To attack this issue, some software paradigms raise the automation of software development via higher levels of abstraction through domain-specific languages (e.g., in model-driven engineering) and empowering non-professional developers with the possibility to build their own software (e.g., in low-code development approaches). In a software-demanding world, this is an attractive possibility, and perhaps -- paraphrasing Andy Warhol -- "in the future, everyone will be a developer for 15 minutes". However, to make this possible, methods are required to tweak languages to their context of use (crucial given the diversity of backgrounds and purposes), and the assistance to developers throughout the development process (especially critical for non-professionals).
In this keynote talk at ICSOFT'2024 I presented enabling techniques for this vision, supporting the creation of families of domain-specific languages, their adaptation to the usage context; and the augmentation of low-code environments with assistants and recommender systems to guide developers (professional or not) in the development process.
Impala is an open-source SQL query engine for Apache Hadoop that allows for fast, interactive queries directly against data stored in HDFS and other data storage systems. It provides low-latency queries in seconds by using a custom query engine instead of MapReduce. Impala allows users to interact with data using standard SQL and business intelligence tools while leveraging existing metadata in Hadoop. It is designed to be integrated with the Hadoop ecosystem for distributed, fault-tolerant and scalable data processing and analytics.
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
Impala is a massively parallel processing SQL query engine for Apache Hadoop. It allows real-time queries on large datasets using existing SQL skills. Impala's architecture includes impalad daemons that process queries in parallel across nodes, a statestore for metadata coordination, and a new execution engine written in C++. It aims to provide faster performance than Hive for interactive queries while leveraging Hadoop's existing ecosystem. The first general availability release is planned for April 2013.
Spark Streaming allows processing live data streams using small batch sizes to provide low latency results. It provides a simple API to implement complex stream processing algorithms across hundreds of nodes. Spark SQL allows querying structured data using SQL or the Hive query language and integrates with Spark's batch and interactive processing. MLlib provides machine learning algorithms and pipelines to easily apply ML to large datasets. GraphX extends Spark with an API for graph-parallel computation on property graphs.
This talk was held at the 11th meeting on April 7 2014 by Marcel Kornacker.
Impala (impala.io) raises the bar for SQL query performance on Apache Hadoop. With Impala, you can query Hadoop data – including SELECT, JOIN, and aggregate functions – in real time to do BI-style analysis. As a result, Impala makes a Hadoop-based enterprise data hub function like an enterprise data warehouse for native Big Data.
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
Building applications using Apache Hadoop with a use-case of clickstream analysis. Presented by Mark Grover and Jonathan Seidman at Big Data TechCon, Boston in April 2014
Impala is a SQL query engine for Apache Hadoop that allows for interactive queries on large datasets. It uses a distributed architecture where each node runs an Impala daemon and queries are distributed across nodes. Impala aims to provide general-purpose SQL with high performance by using C++ instead of Java and avoiding MapReduce execution. It runs directly on Hadoop storage systems and supports common file formats like Parquet and Avro.
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
Operationalizing Data Science Using Cloud FoundryVMware Tanzu
The document discusses how operationalizing machine learning models through continuous deployment and monitoring is important to realize business value but often overlooked, and describes how Alpine Data's Chorus platform in combination with Pivotal's Big Data Suite and Cloud Foundry can provide a turn-key solution for operationalizing models by deploying scalable scoring engines that can consume models exported in the PFA format. The platform aims to make it simple to deploy both individual models and complex scoring flows represented as PFA documents to ensure models have maximum impact on the business.
This document discusses SQL engines for Hadoop, including Hive, Presto, and Impala. Hive is best for batch jobs due to its stability. Presto provides interactive queries across data sources and is easier to manage than Hive with Tez. Presto's distributed architecture allows queries to run in parallel across nodes. It supports pluggable connectors to access different data stores and has language bindings for multiple clients.
The document discusses architectural considerations for implementing clickstream analytics using Hadoop. It covers choices for data storage layers like HDFS vs HBase, data modeling including file formats and partitioning, data ingestion methods like Flume and Sqoop, available processing engines like MapReduce, Hive, Spark and Impala, and the need to sessionize clickstream data to analyze metrics like bounce rates and attribution.
A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
The document discusses Impala, a SQL query engine for Hadoop. It was created to enable low-latency queries on Hadoop data by using a new execution engine instead of MapReduce. Impala aims to provide high performance SQL queries on HDFS, HBase and other Hadoop data. It runs as a distributed service and queries are distributed to nodes and executed in parallel. The document covers Impala's architecture, query execution process, and its planner which partitions queries for efficient execution.
This document provides information about running Spark on YARN including:
- Spark allows processing of large datasets in a distributed manner using Resilient Distributed Datasets (RDDs).
- When running on YARN, Spark is able to leverage existing Hadoop clusters for locality-aware processing, resource management, and other benefits while still using its own execution engine.
- Running Spark on YARN provides advantages like shipping code to where the data is located instead of moving large amounts of data, leveraging existing Hadoop cluster infrastructure, and allowing Spark workloads to run natively within Hadoop.
Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax.
This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.
The document summarizes strengths and weaknesses of Cloudera Impala. Key strengths include excellent performance for analytical queries over large datasets, SQL compliance, and integration with Hadoop ecosystem. Weaknesses are slow random access, lack of fault tolerance, tedious data updating process, and memory intensive queries. The conclusion is that Impala is well-suited for analytics on immutable data but not for workloads with frequent updates.
1) HAWQ is an SQL and machine learning engine that runs on Hadoop, providing SQL capabilities and machine learning functionality directly on HDFS data.
2) HAWQ provides up to 30x faster performance than other SQL-on-Hadoop engines like Impala and Hive, through its massively parallel processing (MPP) architecture and query optimization capabilities.
3) Key features of HAWQ include ANSI SQL compliance, integrated machine learning via the MADlib library, flexible deployment across on-premises and cloud environments, and high scalability to petabytes of data.
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.
SQL on Hadoop
Looking for the correct tool for your SQL-on-Hadoop use case?
There is a long list of alternatives to choose from; how to select the correct tool?
The tool selection is always based on use case requirements.
Read more on alternatives and our recommendations.
Impala is a SQL query engine for Apache Hadoop that allows real-time queries on large datasets. It is designed to provide high performance for both analytical and transactional workloads by running directly on Hadoop clusters and utilizing C++ code generation and in-memory processing. Impala uses the existing Hadoop ecosystem including metadata storage in Hive and data formats like Avro, but provides faster performance through its new query execution engine compared to traditional MapReduce-based systems like Hive. Future development of Impala will focus on improved support for features like HBase, additional SQL functionality, and query optimization.
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
Maxime Dumas gives a presentation on Cloudera Impala, which provides fast SQL query capability for Apache Hadoop. Impala allows for interactive queries on Hadoop data in seconds rather than minutes by using a native MPP query engine instead of MapReduce. It offers benefits like SQL support, improved performance of 3-4x up to 90x faster than MapReduce, and flexibility to query existing Hadoop data without needing to migrate or duplicate it. The latest release of Impala 2.0 includes new features like window functions, subqueries, and spilling joins and aggregations to disk when memory is exhausted.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
CDH is a popular distribution of Apache Hadoop and related projects that delivers scalable storage and distributed computing through Apache-licensed open source software. It addresses challenges in storing and analyzing large datasets known as Big Data. Hadoop is a framework for distributed processing of large datasets across computer clusters using simple programming models. Its core components are HDFS for storage, MapReduce for processing, and YARN for resource management. The Hadoop ecosystem also includes tools like Kafka, Sqoop, Hive, Pig, Impala, HBase, Spark, Mahout, Solr, Kudu, and Sentry that provide functionality like messaging, data transfer, querying, machine learning, search, and authorization.
Data Explosion
- TBs of data generated everyday
Solution – HDFS to store data and Hadoop Map-Reduce framework to parallelize processing of Data
What is the catch?
Hadoop Map Reduce is Java intensive
Thinking in Map Reduce paradigm can get tricky
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
Recorded at SpringOne2GX 2013 in Santa Clara, CA
Speaker: Adam Shook
This session assumes absolutely no knowledge of Apache Hadoop and will provide a complete introduction to all the major aspects of the Hadoop ecosystem of projects and tools. If you are looking to get up to speed on Hadoop, trying to work out what all the Big Data fuss is about, or just interested in brushing up your understanding of MapReduce, then this is the session for you. We will cover all the basics with detailed discussion about HDFS, MapReduce, YARN (MRv2), and a broad overview of the Hadoop ecosystem including Hive, Pig, HBase, ZooKeeper and more.
Learn More about Spring XD at: http://projects.spring.io/spring-xd
Learn More about Gemfire XD at:
http://www.gopivotal.com/big-data/pivotal-hd
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
Spark is a general-purpose cluster computing framework that provides high-level APIs and is faster than Hadoop for iterative jobs and interactive queries. It leverages cached data in cluster memory across nodes for faster performance. Spark supports various higher-level tools including SQL, machine learning, graph processing, and streaming.
The document provides an overview of big data and Hadoop fundamentals. It discusses what big data is, the characteristics of big data, and how it differs from traditional data processing approaches. It then describes the key components of Hadoop including HDFS for distributed storage, MapReduce for distributed processing, and YARN for resource management. HDFS architecture and features are explained in more detail. MapReduce tasks, stages, and an example word count job are also covered. The document concludes with a discussion of Hive, including its use as a data warehouse infrastructure on Hadoop and its query language HiveQL.
The document provides an agenda and overview for a Big Data Warehousing meetup hosted by Caserta Concepts. The meetup agenda includes an introduction to SparkSQL with a deep dive on SparkSQL and a demo. Elliott Cordo from Caserta Concepts will provide an introduction and overview of Spark as well as a demo of SparkSQL. The meetup aims to share stories in the rapidly changing big data landscape and provide networking opportunities for data professionals.
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystemmashoodsyed66
Dive into the world of Apache Hive with our insightful presentation covering a range of topics, including Hive introduction, dispelling misconceptions about Hive, exploring its features and origins, understanding the why behind Hive's existence, delving into its architecture and working principles, dissecting the data model employed in Hive, exploring different modes of operation, and weighing the advantages and disadvantages it brings. The presentation concludes with practical examples, demonstrating how to create tables in Hive, upload data, and execute queries within the Hadoop environment. Join us on a journey through the intricacies of Hive, unraveling its capabilities and applications in big data analytics
Apache Hadoop is a popular open-source framework for storing and processing large datasets across clusters of computers. It includes Apache HDFS for distributed storage, YARN for job scheduling and resource management, and MapReduce for parallel processing. The Hortonworks Data Platform is an enterprise-grade distribution of Apache Hadoop that is fully open source.
The document summarizes several popular options for SQL on Hadoop including Hive, SparkSQL, Drill, HAWQ, Phoenix, Trafodion, and Splice Machine. Each option is reviewed in terms of key features, architecture, usage patterns, and strengths/limitations. While all aim to enable SQL querying of Hadoop data, they differ in support for transactions, latency, data types, and whether they are native to Hadoop or require separate processes. Hive and SparkSQL are best for batch jobs while Drill, HAWQ and Splice Machine provide lower latency but with different integration models and capabilities.
http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
This document discusses Microsoft's efforts to make big data technologies like Hadoop more accessible through its products. It describes Hadoop, MapReduce, HDFS, and other big data concepts. It then outlines Microsoft's project to create a Hadoop distribution that runs on Windows Server and Windows Azure, including building an ODBC driver to allow tools like Excel to query Hadoop. This will help bring big data to more business users and integrate it with Microsoft's existing BI technologies.
Similar to Etu Solution Day 2014 Track-D: 掌握Impala和Spark (20)
Hadoop con 2015 hadoop enables enterprise data lakeJames Chen
Mobile Internet, Social Media 以及 Smart Device 的發展促成資訊的大爆炸,伴隨產生大量的非結構化及半結構化的資料,不但資料的格式多樣,產生的速度極快,對企業的資訊架構帶來了前所未有的挑戰,面對多樣的資料結構及多樣的分析工具,我們應該採用什麼樣的架構互相整合,才能有效的管理資料生命週期,提取資料價值,Hadoop 生態系統,無疑的在這個大架構裡,將扮演最基礎的資料平台的角色,實現企業的 Data Lake。
This document provides an overview of an advanced Big Data hands-on course covering Hadoop, Sqoop, Pig, Hive and enterprise applications. It introduces key concepts like Hadoop and large data processing, demonstrates tools like Sqoop, Pig and Hive for data integration, querying and analysis on Hadoop. It also discusses challenges for enterprises adopting Hadoop technologies and bridging the skills gap.
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systexJames Chen
This document discusses using Hadoop/MapReduce with Solr/Lucene for large scale distributed search. It begins with an introduction to the speaker and his experience with Hadoop. The agenda then outlines discussing why search big data, an overview of Lucene, Solr and Zookeeper, distributed searching and indexing with Hadoop, and a case study on web log categorization.
Software development... for all? (keynote at ICSOFT'2024)miso_uam
Our world runs on software. It governs all major aspects of our life. It is an enabler for research and innovation, and is critical for business competitivity. Traditional software engineering techniques have achieved high effectiveness, but still may fall short on delivering software at the accelerated pace and with the increasing quality that future scenarios will require.
To attack this issue, some software paradigms raise the automation of software development via higher levels of abstraction through domain-specific languages (e.g., in model-driven engineering) and empowering non-professional developers with the possibility to build their own software (e.g., in low-code development approaches). In a software-demanding world, this is an attractive possibility, and perhaps -- paraphrasing Andy Warhol -- "in the future, everyone will be a developer for 15 minutes". However, to make this possible, methods are required to tweak languages to their context of use (crucial given the diversity of backgrounds and purposes), and the assistance to developers throughout the development process (especially critical for non-professionals).
In this keynote talk at ICSOFT'2024 I presented enabling techniques for this vision, supporting the creation of families of domain-specific languages, their adaptation to the usage context; and the augmentation of low-code environments with assistants and recommender systems to guide developers (professional or not) in the development process.
IN Dubai [WHATSAPP:Only (+971588192166**)] Abortion Pills For Sale In Dubai** UAE** Mifepristone and Misoprostol Tablets Available In Dubai** UAE
CONTACT DR. SINDY Whatsapp +971588192166* We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai** Sharjah** Abudhabi** Ajman** Alain** Fujairah** Ras Al Khaimah** Umm Al Quwain** UAE** Buy cytotec in Dubai +971588192166* '''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol** Cytotec” +971588192166* ' Dr.SINDY ''BUY ABORTION PILLS MIFEGEST KIT** MISOPROSTOL** CYTOTEC PILLS IN DUBAI** ABU DHABI**UAE'' Contact me now via What's App… abortion pills in dubai Mtp-Kit Prices
abortion pills available in dubai/abortion pills for sale in dubai/abortion pills in uae/cytotec dubai/abortion pills in abu dhabi/abortion pills available in abu dhabi/abortion tablets in uae
… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all** Cytotec Abortion Pills are Available In Dubai / UAE** you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pills in Dubai** UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if it's beyond 6 months. Our Abu Dhabi** Ajman** Al Ain** Dubai** Fujairah** Ras Al Khaimah (RAK)** Sharjah** Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical** medical and surgical abortion methods for early through late second trimester** including the Abortion By Pill Procedure (RU 486** Mifeprex** Mifepristone** early options French Abortion Pill)** Tamoxifen** Methotrexate and Cytotec (Misoprostol). The Abu Dhabi** United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used** 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need for surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi** United Arab Emirates** uses the latest medications for medical abortions (RU-486** Mifeprex** Mifegyne** Mifepristone** early options French abortion pill)** Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi** United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our
Are you wondering how to migrate to the Cloud? At the ITB session, we addressed the challenge of managing multiple ColdFusion licenses and AWS EC2 instances. Discover how you can consolidate with just one EC2 instance capable of running over 50 apps using CommandBox ColdFusion. This solution supports both ColdFusion flavors and includes cb-websites, a GoLang binary for managing CommandBox websites.
2. Workshop Goal
Let’s talk about the 3Vs in Big
Data. Hadoop is good for Volume
and Variety
But…
How about Velocity ???
This is why we are sitting here ….
4. Background Knowledge
• Linux operation system
• Basic Hadoop ecosystem knowledge
• Basic knowledge of SQL
• Java or Python programming experience
5. Terminology
• Hadoop: Open source big data platform
• HDFS: Hadoop Distributed Filesystem
• MapReduce: Parallel computing framework on top of
HDFS
• HBase: NoSQL database on top of Hadoop
• Impala: MPP SQL query engine on top of Hadoop
• Spark: In-memory cluster computing engine
• Hive: SQL to MapReduce translator
• Hive Metastore: Database that stores table schema
• Hive QL: A SQL subset
6. Agenda
• What is Hadoop and what’s wrong with Hadoop in real-time?
• What is Impala?
• Hands-on Impala
• What is Spark?
• Hands-on Spark
• Spark and Impala work together
• Q & A
7. What is Hadoop ?
Apache Hadoop is an open
source platform for data
storage and processing that
is :
Distributed
Fault tolerant
Scalable
CORE HADOOP SYSTEM COMPONENTS
HDFS
HDFS
A fault-tolerant,
scalable clustered
A fault-tolerant,
scalable clustered
storage
storage
MapReduce
MapReduce
A distributed
computing
framework
A distributed
computing
framework
• Ask questions across structured and
unstructured data
• Schema-less
• Scale-out architecture
divides workloads across
nodes.
• Flexible file system
eliminates ETL bottlenecks.
Flexible for storing and mining
any type of data
Processing Complex
Big Data
Scales Economically
• Deploy on commodity
hardware
• Open sourced platform
8. Limitations of MapReduce
• Batch oriented
• High latency
• Doesn’t fit all cases
• Only for developers
9. Pig and Hive
• MR is hard and only for developers
• High level abstraction for converting declarative syntax to
MR
– SQL – Hive
– Dataflow language - Pig
• Build on top of MapReduce
10. Goals
• General-purpose SQL engine:
– Works for both analytics and transactional/single-row workloads.
– Supports queries that take from milliseconds to hours.
• Runs directly within Hadoop and:
– Reads widely used Hadoop file formats.
– Runs on same nodes that run Hadoop processes.
• High performance:
– C++ instead of Java
– Runtime code generation
– Completely new execution engine – No MapReduce
11. What is Impala
• General-purpose SQL engine
• Real-time queries in Apache Hadoop
• Beta version released since Oct. 2012
• GA since Apr. 2013
• Apache licensed
• Latest release v1.4.2
12. Impala Overview
• Distributed service in cluster: One impala daemon on each
data node
• No SPOF
• User submits query via ODBC/JDBC, CLI, or HUE to any of the
daemons.
• Query is distributed to all nodes with data locality.
• Uses Hive’s metadata interfaces and connects to Hive
metastore.
• Supported file formats:
– Uncompressed/lzo-compressed text files
– Sequence files and RCFile with snappy/gzip, Avro
– Parquet columnar format
13. Impala’s SQL
• High compatibility with HiveQL
• SQL support:
– Essential SQL-92, minus correlated subqueries
– INSERT INTO … SELECT …
– Only equi-joins; no non-equi-joins, no cross products
– Order By requires Limit (not required after 1.4.2)
– Limited DDL support
– SQL-style authorization via Apache Sentry
– UDFs and UDAFs are supported
14. Impala’s SQL limitations
• No file formats, SerDes
• No beyond SQL (buckets, samples, transforms, arrays,
structs, maps, xpath, json)
• Broadcast joins and partitioned hash joins supported
(Smaller tables have to fit in the aggregate memory of all
executing nodes)
15. Work with HBase
• Functionality highlights:
– Support for SELECT, INSERT INTO…SELECT…, and INSERT INTO …
VALUES (…)
– Predicates on rowkey columns are mapped into start/stop rows
– Predicates on other columns are mapped into SingleColumnValueFilters
• BUT mapping of HBase tables and metastore table
patterned after Hive:
– All data is stored as scalars and in ASCII.
– The rowkey needs to be mapped into a single string column.
16. HBase in Roadmap
• Full support for UPDATE and DELETE.
• Storage of structured data to minimize storage and access
overhead.
• Composite rowkey encoding mapped into an arbitrary
number of table columns.
18. Impala’s Architecture
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Hive
Metastore HDFS NN Statestore
SQL Client
ODBC
2. Planner turns request
into collections of plan
fragments.
3. Coordinator initiates
execution on impalad(s)
local to data.
19. Impala’s Architecture
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Hive
Metastore HDFS NN Statestore
SQL Client
ODBC
5. Query results are
streamed back to client.
4. Intermediate results are
streamed between
impalad(s).
20. Metadata Handling
• Impala metadata
– Hive’s metastore: Logical metadata (table definitions, columns,
CREATE TABLE parameters)
– HDFS NameNode: Directory contents and block replica locations
– HDFS DataNode: Block replias’ volume ids
• Caches metadata: No synchronous metastore API calls
during query execution
• Impala instances read metadata from metastore at startup.
• Catalog Service relays metadata when you run DDL or
update metadata on one of the impalads.
21. Metadata Handling – Cont.
• REFRESH [<tbl>]: Reloads metadata on all impalads (if you
added new files via Hive)
• INVALIDATE METADATA: Reloads metadata for all tables
22. Comparing Impala to Dremel
• What is Dremel?
– Columnar storage for data with nested structures
– Distributed scalable aggregation on top of that
• Columnar storage in Hadoop: Parquet
– Store data in appropriate native/binary types
– Can also store nested structures similar to Dremel’s ColumnIO
• Distributed aggregation: Impala
• Impala plus Parquet: A superset of the published version of
Dremel (does not support joins)
23. Comparing Impala to Hive
• Hive: MapReduce as an execution engine
– High latency, low throughput queries
– Fault-tolerance model based on MapReduce’s on-disk check pointing;
materializes all intermediate results
– Java runtime allows for easy late-binding of functionality: file formats
and UDFs
– Extensive layering imposes high runtime overhead
• Impala:
– Direct, process-to-process data exchange
– No fault tolerance
– An execution engine designed for low runtime overhead
24. Impala and Hive
Shares Everything Client-Facing
•Metadata (table definitions)
•ODBC/JDBC drivers
•SQL syntax (Hive SQL)
•Flexible file formats
•Machine pool
•GUI
Resource Management
Data Store
But Built for Different Purposes
•Hive: Runs on MapReduce and ideal for
batch processing
•Impala: Native MPP query engine ideal
for interactive SQL Data Ingestion
HDFS HBase
TEXT, RCFILE,
PARQUET,AVRO, ETC.
RECORDS
Hive
SQL Syntax
MapReduce
Compute framework
Impala
SQL syntax +
compute framework
25. Typical Use Cases
• Data Warehouse Offload
• Ad-hoc Analytics
• Provide SQL interoperability to HBase
26. Hands-on Impala
• Query a file on HDFS with Impala
• Query a table on HBase with Impala
27. What is Spark?
• MapReduce Review…
• Apache Spark…
• How Spark Works…
• Fault Tolerance and Performance…
• Examples…
• Spark & More…
28. MapReduce: Good
The Good:
•Built in fault tolerance
•Optimized IO path
•Scalable
•Developer focuses on Map/Reduce, not infrastructure
•Simple? API
29. MapReduce: Bad
The Bad:
•Optimized for disk IO
– Does not leverage memory
– Iterative algorithms go through disk IO path again and
again
•Primitive API
– Developers have to build on a very simple abstraction
– Key/Value in/out
– Even basic things like join require extensive code
•A common result is many files require to be combined
appropriately
30. Apache Spark
• Originally developed in
2009 in UC Berkeley’s AMP
Lab.
• Fully open sourced in 2010
– now at Apache Software
Foundation.
31. Spark: Easy and Fast Big Data
• Easy to Develop
– Rich APIs in Java, Scala,
Python
– Interactive Shell
• 2-5x less code
• Fast to Run
– General execution
graph
– In-memory store
32. How Spark Works – SparkContext
Cluster
Master
Spark Worker Spark Worker
Executer
Cache Executer
Task Task
Cache
Data Node Data Node
HDFS
Task Task
SparkContext
sc=new SparkContext
Rdd=sc.textfile(“hdfs://..”)
Rdd.filter(…)
Rdd.cache(…)
Rdd.count(…)
Rdd.map
sc=new SparkContext
Rdd=sc.textfile(“hdfs://..”)
Rdd.filter(…)
Rdd.cache(…)
Rdd.count(…)
Rdd.map
33. How Spark Works – RDD
RDD
(Resilient
Distributed
Dataset)
sc=new SparkContext
Rdd=sc.textfile(“hdfs://..”
)
Rdd.filter(…)
Rdd.cache(…)
Rdd.count(…)
Rdd.map
sc=new SparkContext
Rdd=sc.textfile(“hdfs://..”
)
Rdd.filter(…)
Rdd.cache(…)
Rdd.count(…)
Rdd.map
Storage Types:
MEMORY_ONLY,
MEMORY_AND_DISK
DISK_ONLY,
• Fault Tolerant
• Controlled
• Fault Tolerant
• Controlled
partitioning to
optimize data
placement
partitioning to
optimize data
placement
• Manipulated by
• Manipulated by
using a rich set of
operators
using a rich set of
operators
• Partitions of Data
• Dependency between partitions
34. RDD
• Stands for Resilient Distributed Datasets
• Spark revolves around RDDs
• Fault-tolerant read only collection of elements that can be
operated on in parallels
• Cached in memory
Reference:
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spar
k.pdf
35. RDD
• Read-only, partitioned collection of records
DD11
DD22
DD33
3 partitions
• Supports only coarse-grained operations
– e.g. map and group-by transformation, reduce
actions
DD11
DD22
DD33
DD11
DD22
DD33
DD11
DD22
DD33
DD11
DD22
DD33
DD11
DD22
DD33
Value
43. Actions
• Parallel Operations
map reduce sample
filter count take
grougBy fold first
sort reduceByKey partitionBy
union groupByKey mapWith
join cogroup pipe
leftOuterJoin cross save
rightOuterJoin zip ….
44. Stages
textFile map map reduceByKey
collect
Stage 1 Stage 2
DAG (Directed Acyclic Graph) Each stage is executed as
Stage 1
Stage 2
a series of Task (one
Task for each partition).
45. Tasks
Task is the fundamental unit of execution in Spark
Fetch Input
Execute Task
Write Output
HDFS
/RDD
HDFS/RDD/Intermediate
shuffle output
Core 1
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Core 2
47. Comparison
MapReduce Impala Spark
Storage HDFS HDFS/HBase HDFS
Scheduler MapReduce Job Query Plan Computation Graph
I/O Disk In-memory with
cache
In-memory, cache
and shared data
Fault Tolerance Duplication and
Disk I/O
No Fault Tolerance Hash partition and
auto reconstruction
Iterative Bad Bad Good
Shared data No No Yes
Streaming No No Yes
49. Spark Streaming
• Takes the concept of RDDs and extends it to Dstreams
– Fault-tolerant like RDDs
– Transformable like RDDs
• Adds new “rolling window” operations
– Rolling averages, etc..
• But keeps everything else!
– Regular Spark code works in Spark Streaming
– Can still access HDFS data, etc.
• Example use cases:
– “On-the-fly” ETL as data is ingested into Hadoop/HDFS
– Detecting anomalous behavior and trigger alerts
– Continuous reporting of summary metrics for incoming data
53. Spark SQL
• Spark SQL is one of Spark’s
components.
– Executes SQL on Spark
– Builds SchemaRDD
• Optimizes execution plan
• Uses existing Hive metastores,
SerDes, and UDFs.
54. Unified Data Access
• Ability to load and query
data from a variety of
sources.
• SchemaRDDs provides a
single interface that
efficiently works with
structured data, including
Hive tables, parquet files,
and JSON.
sqlCtx.jsonFile("s3n://...")
.registerAsTable("json")
schema_rdd = sqlCtx.sql("""
SELECT *
FROM hiveTable
JOIN json ...""")
Query and join different data sources
55. Hands-on Spark
• Parse/transform log on the fly with Spark-Streaming
• Aggregate with Spark SQL (Top N)
• Output from Spark to HDFS
56. Spark & Impala work together
Data
Strea
m
Data
Strea
m
Spark-
Streaming
Spark
Impala
DN
RS
Data
Strea
m
Spark-
Streaming
Spark
Impala
DN
RS
Data
Strea
m
Spark-
Streaming
Spark
DN
RS
Impala
…
Data
Strea
m
Data
Strea
m
Data Stream
-Click Steam
-Machine Data
-Log
-Network Traffic
-Etc..
On-the-fly Processing
-ETL, transformation, filter
-Pattern Matching & Alert
Real-time Analytics
-Machine Learning (Rec. Cluster..)
-Iterative Algorithms
Near Real-time Query
- Ad-hoc query
- Reporting
Long term data store
-Batch process
-Offline analytics
-Historical Mining