This document provides an agenda and overview for a presentation on SQL on Hadoop. The presentation will cover various SQL on Hadoop technologies including Hive, HAWQ, Impala, SparkSQL, HBase with Phoenix, and Drill. It will also include an introduction, surveys to collect information from attendees, and discussions on networking and food. The hosts will provide background on their experience with big data and Hadoop.
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdIBM Analytics
Originally Published on Oct 27, 2014
An overview of IBM's audited Hadoop-DS comparing IBM Big SQL, Cloudera Impala and Hortonworks Hive for performance and SQL compatibility. For more information, visit: http://www-01.ibm.com/software/data/infosphere/hadoop/
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
Demand for cloud is through the roof. Cloud is turbo charging the Enterprise IT landscape with agility and flexibility. And now, discussions of cloud architecture dominate Enterprise IT. Cloud is enabling many ephemeral on-demand use cases which is a game changing opportunity for analytic workloads. But all of this comes with the challenges of running enterprise workloads in the cloud securely and with ease.
In this session, we will take you through Cloudbreak as a solution to simplify provisioning and managing enterprise workloads while providing an open and common experience for deploying workloads across clouds. We will discuss the challenges (and opportunities) to run enterprise workloads in the cloud and will go through a live demo of how the latest from Cloudbreak enables enterprises to easily and securely run Apache Hadoop. This includes deep-dive discussion on Ambari Blueprints, recipes, custom images, and enabling Kerberos -- which are all key capabilities for Enterprise deployments.
Speakers
Jeff Sposetti, VP Product Management, Hortonworks
Attila Kanto, Principal Engineer, Hortonworks
Spark is a fast and general engine for large-scale data processing. It provides APIs in Java, Scala, and Python and an interactive shell. Spark applications operate on resilient distributed datasets (RDDs) that can be cached in memory for faster performance. RDDs are immutable and fault-tolerant via lineage graphs. Transformations create new RDDs from existing ones while actions return values to the driver program. Spark's execution model involves a driver program that coordinates tasks on executor machines. RDD caching and lineage graphs allow Spark to efficiently run jobs across clusters.
PASS Summit - SQL Server 2017 Deep DiveTravis Wright
Deep dive into SQL Server 2017 covering SQL Server on Linux, containers, HA improvements, SQL graph, machine learning, python, adaptive query processing, and much much more.
Operationalizing Data Science Using Cloud FoundryVMware Tanzu
The document discusses how operationalizing machine learning models through continuous deployment and monitoring is important to realize business value but often overlooked, and describes how Alpine Data's Chorus platform in combination with Pivotal's Big Data Suite and Cloud Foundry can provide a turn-key solution for operationalizing models by deploying scalable scoring engines that can consume models exported in the PFA format. The platform aims to make it simple to deploy both individual models and complex scoring flows represented as PFA documents to ensure models have maximum impact on the business.
Apache Falcon - Simplifying Managing Data Jobs on HadoopDataWorks Summit
Apache Falcon is a data management platform on Hadoop that provides a holistic way to declaratively define and manage data pipelines and workflows. It allows users to specify feeds, processes, and clusters to orchestrate the flow of data across Hadoop clusters. Falcon handles scheduling, dependency management, replication, and data governance. The architecture uses Oozie to schedule workflows and notifications are sent through JMS. Case studies demonstrate how Falcon can be used for multi-cluster failover and distributed processing across data centers.
Kudu is an open source storage layer developed by Cloudera that provides low latency queries on large datasets. It uses a columnar storage format for fast scans and an embedded B-tree index for fast random access. Kudu tables are partitioned into tablets that are distributed and replicated across a cluster. The Raft consensus algorithm ensures consistency during replication. Kudu is suitable for applications requiring real-time analytics on streaming data and time-series queries across large datasets.
Big SQL Competitive Summary - Vendor LandscapeNicolas Morales
IBM's Big SQL is their SQL for Hadoop product that allows users to run SQL queries on Hadoop data. It uses the Hive metastore to catalog table definitions and shares data logic with Hive. Big SQL is architected for high performance with a massively parallel processing (MPP) runtime and runs directly on the Hadoop cluster with no proprietary storage formats required. The document compares Big SQL to other SQL on Hadoop solutions and outlines its performance and architectural advantages.
The document discusses Seagate's plans to integrate hard disk drives (HDDs) with flash storage, systems, services, and consumer devices to deliver unique hybrid solutions for customers. It notes Seagate's annual revenue, employees, manufacturing plants, and design centers. It also discusses Seagate exploring the use of big data analytics and Hadoop across various potential use cases and outlines Seagate's high-level plans for Hadoop implementation.
Using Apache Hadoop and related technologies as a data warehouse has been an area of interest since the early days of Hadoop. In recent years Hive has made great strides towards enabling data warehousing by expanding its SQL coverage, adding transactions, and enabling sub-second queries with LLAP. But data warehousing requires more than a full powered SQL engine. Security, governance, data movement, workload management, monitoring, and user tools are required as well. These functions are being addressed by other Apache projects such as Ranger, Atlas, Falcon, Ambari, and Zeppelin. This talk will examine how these projects can be assembled to build a data warehousing solution. It will also discuss features and performance work going on in Hive and the other projects that will enable more data warehousing use cases. These include use cases like data ingestion using merge, support for OLAP cubing queries via Hive’s integration with Druid, expanded SQL coverage, replication of data between data warehouses, advanced access control options, data discovery, and user tools to manage, monitor, and query the warehouse.
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics - Apache Spark’s in memory capabilities catapulted it as the premier processing framework for Hadoop. Apache Ignite and Alluxio, both high-performance, integrated and distributed in-memory platform, takes Apache Spark to the next level by providing an even more powerful, faster and scalable platform to the most demanding data processing and analytic environments.
Speaker
Irfan Elahi, Consultant, Deloitte
This document discusses the challenges of implementing SQL on Hadoop. It begins by explaining why SQL is useful for Hadoop, as it provides a familiar syntax and separates querying logic from implementation. However, Hadoop's architecture presents challenges for matching the functionality of a traditional data warehouse. Key challenges discussed include random data placement in HDFS, limitations on indexing due to this random placement, difficulties performing joins without data colocation, and limitations of existing "indexing" approaches in systems like Hive. The document explores approaches some systems are taking to address these issues.
This document discusses data management trends and Oracle's unified data management solution. It provides a high-level comparison of HDFS, NoSQL, and RDBMS databases. It then describes Oracle's Big Data SQL which allows SQL queries to be run across data stored in Hadoop. Oracle Big Data SQL aims to provide easy access to data across sources using SQL, unified security, and fast performance through smart scans.
Hadoop has traditionally been an on-premises workload, with very few notable implementations on the cloud. With Organizations either having jumped on the cloud bandwagon or have started planning their expansion into the ecosystem, it is imperative for us to explore how Hadoop conforms to the cloud paradigm. With the coming off age of some very useful cloud paradigms and the nature of Big Data with high seasonality of workloads, this is becoming a very common ask from customers. Robust architectures, elastic scale, open platforms, OSS integrations, and addressing complex pain points will all be part of this lively talk. To be able to implement effective solutions for Big Data in the cloud it is imperative that you understand the core principles and grasp the design principles of how the cloud can enhance the benefits of parallelized analytics. Join this session to understand the nitty-gritties of implementing Big Data in the cloud and the various options therein. Big Data + Cloud is definitely a deadly combination.
This document discusses enterprise-grade big data solutions from HPE. It outlines HPE's reference architecture for big data workloads including components like data lakes, data warehouses, archival storage, event processing, and in-memory analytics. It also discusses HPE's investments in Hortonworks and collaboration to optimize Hadoop for performance. The document promotes attending an HPE session at the Hadoop Summit on modernizing data warehouses and visiting the HPE booth for demos and a trivia game.
AutoML - Heralding a New Era of Machine Learning - CASOUG Oct 2021Sandesh Rao
The document discusses Oracle Machine Learning (OML) services on Oracle Autonomous Database. It provides an overview of the OML services REST API, which allows storing and deploying machine learning models. It enables scoring of models using REST endpoints for application integration. The API supports classification/regression of ONNX models from libraries like Scikit-learn and TensorFlow. It also provides cognitive text capabilities like topic discovery, keywords, sentiment analysis and text summarization.
Introducing Kudu, Big Data Warehousing MeetupCaserta
Not just an SQL interface or file system, Kudu - the new, updating column store for Hadoop, is changing the storage landscape. It's easy to operate and makes new data immediately available for analytics or operations.
At the Caserta Concepts Big Data Warehousing Meetup, our guests from Cloudera outlined the functionality of Kudu and talked about why it will become an integral component in big data warehousing on Hadoop.
To learn more about what Caserta Concepts has to offer, visit http://casertaconcepts.com/
eProseed Oracle Open World 2016 debrief - Oracle 12.2.0.1 DatabaseMarco Gralike
The document provides an overview of new features in Oracle Database 12.2 including multitenant improvements like application containers and proxy PDBs, in-memory database enhancements, new JSON functions and dataview, and Oracle Exadata Express. It also briefly mentions big data integrations and notes that documentation is available online for Exadata Express and new JSON and database features.
This document provides an overview and examples of MapReduce (M/R), Pig, and Hive. It introduces M/R concepts like mapping, reducing, and joins. It demonstrates a simple word count M/R job. Pig and Hive allow writing M/R jobs using a higher-level language - Pig Latin and HiveQL respectively. Examples show averaging stock prices using Pig and joining datasets in Hive. M/R, Pig, and Hive scripts run as Hadoop jobs on HDFS data.
MapReduce is a programming model for processing large datasets in a distributed manner. It works by splitting the data into independent chunks which are processed by the map tasks in parallel. The outputs are shuffled and sorted before being input to the reduce tasks. Common implementations of MapReduce include Apache Hadoop which runs MapReduce jobs on top of the Hadoop Distributed File System (HDFS).
The document describes Hadoop MapReduce and its key concepts. It discusses how MapReduce allows for parallel processing of large datasets across clusters of computers using a simple programming model. It provides details on the MapReduce architecture, including the JobTracker master and TaskTracker slaves. It also gives examples of common MapReduce algorithms and patterns like counting, sorting, joins and iterative processing.
This talk was held at the 12th meeting on July 22 2014 by Romeo Kienzler.
After giving a short contextual overview about SQL for Hadoop projects in the Ecosystem (Hive, Impala, Presto, Cascading Lingual, ...) we will hear about the latest SQL for Hadoop features in Big SQL. Big SQL delivers some exciting capabilities including low latency and high performance queries while maintaining backwards compatibility to Hive and HCatalog. This is achieved by a optimizer and dedicated execution framework which will be covered in detail. Finally as demo of Big SQL v3.0 on a cluster in Silicon Valley Lab (SVL) will be shown.
The document provides an overview of the Hadoop Distributed File System (HDFS). It describes key HDFS concepts including its design goals, block and rack awareness, file write and read processes, checkpointing, and safe mode operation. HDFS allows for reliable storage of very large files across commodity hardware and provides high throughput access to application data.
PXF is a unified access framework that provides a uniform SQL interface for heterogeneous data sources on HDFS. It exploits parallelism to efficiently access data across various storage formats and data sources. PXF uses a pluggable architecture with built-in connectors that allow it to access data in HDFS files, Hive tables, HBase tables, and other data sources. It provides a common developer view and allows writing queries against external data using various profile definitions and plugins.
This certificate of appreciation was presented to Shivram Mani from the Apache Software Foundation for serving as a mentor during Google Summer of Code 2016 from April 22 to August 23, 2016. Jason Titus, VP of Engineering, recognized Shivram Mani's contributions as a mentor during the summer program.
Some slides about the Map/Reduce programming model (academic purposes) adapting some examples of the book Map/Reduce design patterns.
Special thanks to the next authors:
-http://shop.oreilly.com/product/0636920025122.do
-http://mapreducepatterns.com/index.php?title=Main_Page
-http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
This document summarizes a presentation about managing Apache HAWQ, an open source massively parallel processing (MPP) database, using Apache Ambari. It discusses how Ambari integrates with HAWQ for installation, configuration, topology recommendations, high availability, alerts and more. Challenges in the integration are addressed as HAWQ is not part of the Hortonworks Data Platform stack. The presentation recommends future work for Ambari like supporting automated HAWQ upgrades and enabling dynamic configuration reloads without requiring a service restart.
1. HCatalog is a table and storage management layer for Hadoop that provides a relational view of data in HDFS and abstracts data formats and locations from users.
2. Previously, HAWQ accessed Hive tables through PXF using external tables, but this required specifying the schema, location, and format which was error prone and wouldn't detect metadata changes.
3. The new integration retrieves metadata from HCatalog and parses it into in-memory catalog tables to provide dynamic access to Hive tables from HAWQ without needing to specify schemas.
Zeppelin Interpreters
PSQL (to became JDBC in 0.6.x)
Geode
SpringXD
Apache Ambari
Zeppelin Service
Geode, HAWQ and Spring XD services
Webpage Embedder View
HAWQ is an in-memory, distributed SQL query engine that runs as a Hadoop service. It provides two-way integration with HDFS, Hive, and HBase. HAWQ supports SQL transactions through commands like BEGIN, COMMIT, and ROLLBACK. External tables in HAWQ can be used to query data stored in HDFS files, Hive tables, and HBase tables.
This document provides an overview of Hadoop HDFS/MapReduce architecture, hardware requirements, installation and configuration process, monitoring options, and key components like the Namenode. It discusses configuring the Namenode, JobTracker, DataNodes, and TaskTrackers. Hardware requirements for the NameNode/JobTracker and DataNode/TaskTracker are specified. Installation can be done via tar file download or using a prebuilt rpm. Configuration involves editing XML configuration files and starting required services. Monitoring can be done via web scraping, Ganglia, or Cacti. The Namenode writes edits to RAM and write-ahead log, with checkpoints to the filesystem. High availability is experimental in YARN and Hadoop
HAWQ: a massively parallel processing SQL engine in hadoopBigData Research
HAWQ, developed at Pivotal, is a massively parallel processing SQL engine sitting on top of HDFS. As a hybrid of MPP database and Hadoop, it inherits the merits from both parties. It adopts a layered architecture and relies on the distributed file system for data replication and fault tolerance. In addition, it is standard SQL compliant, and unlike other SQL engines on Hadoop, it is fully transactional. This paper presents the novel design of HAWQ, including query processing, the scalable software interconnect based on UDP protocol, transaction management, fault tolerance, read optimized storage, the extensible framework for supporting various popular Hadoop based data stores and formats, and various optimization choices we considered to enhance the query performance. The extensive performance study shows that HAWQ is about 40x faster than Stinger, which is reported 35x-45x faster than the original Hive.
The document discusses the new features in Pivotal HD 1.1, including improved high availability for HAWQ and Namenode, new UDF and diagnostic tools for HAWQ, upgraded Apache Hadoop components to version 2.0.5 and 2.0.6, improved Hive, HBase, and Oozie, Kerberos support for security, and new tools like the Unified Storage Service, Data Loader, and Command Center for easier administration.
Pivotal is a trusted partner for IT innovation and transformation. From the technology, to the people, to the way people interact with technology, Pivotal is transforming how the world builds software.
At Strata NYC 2015, Pivotal, announced it will Supercharge the Hadoop Ecosystem by contributing the HAWQ advanced SQL on Hadoop analytics and MADlib machine learning technologies to The Apache Software Foundation.
The document summarizes the journey of HAWQ and MADlib from being proprietary Pivotal technologies to becoming Apache open source projects. It provides an overview of HAWQ, including its key features like SQL compliance, performance advantages over other SQL-on-Hadoop systems, and flexible deployment options. It also summarizes MADlib, describing its machine learning functions and advantages of scalable in-database machine learning. Both projects are now available on open source platforms like Hadoop and aim to advance SQL and machine learning on big data through open collaboration.
Massively Parallel Processing with Procedural Python - Pivotal HAWQInMobi Technology
The document discusses massively parallel processing using procedural Python. It describes EMC Corporation and its subsidiaries which provide data storage, virtualization, security, and other software solutions. It also discusses Pivotal's open source contributions and the architecture of its HAWQ database which allows Python user-defined functions to perform parallel operations across clusters.
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
This document provides an overview of Apache Tez, a framework for building data processing applications on Hadoop YARN. It describes how Tez allows applications to define complex data flows as directed acyclic graphs (DAGs) and handles distributed execution, fault tolerance, and resource management. Tez has improved the performance of Apache Hive and Pig by an order of magnitude by enabling more flexible DAG definitions and runtime optimizations. It also supports integration with other data processing engines like Spark, Storm and interactive SQL queries. The document outlines how Tez works and provides guidance on how developers can contribute to the open source project.
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
Apache Tez is a library to build data processing engines in Hadoop/YARN. It takes care of many common building blocks like scheduling, fault tolerance, speculation, security etc. so that the engine can focus on its core features. E.g. Apache Hive can focus on SQL optimization. There has been rapid adoption in projects like Hive, Pig, Flink, Cascading, Scalding and commercial products like Datameer and Syncsort. We will provide a brief overview of Tez and then look at new features for job monitoring in the Tez UI and performance debugging tools for Tez applications. Finally we will explore upcoming features like hybrid scheduling that open up new areas of performance and functionality.
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016MLconf
Big Data Processing Above and Beyond Hadoop: Data-intensive computing represents a new computing paradigm to address Big Data processing requirements using high-performance architectures supporting scalable parallel processing to allow government, commercial organizations, and research environments to process massive amounts of data and implement new applications previously thought to be impractical or infeasible. The fundamental challenges of data-intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data. The open source HPCC (High-Performance Computing Cluster) Systems platform offers a unified approach to Big Data processing requirements: (1) a scalable, integrated computer systems hardware and software architecture designed for parallel processing of data-intensive computing applications, and (2) a new programming paradigm in the form of a high-level, declarative, data-centric programming language designed specifically for big data processing. This presentation explores the challenges of data-intensive computing from a programming perspective, and describes the ECL programming language and the HPCC architecture designed for data-intensive computing applications. HPCC is an alternative to the Hadoop platform, and ECL is compared to Pig Latin, a high-level language developed for the Hadoop MapReduce architecture.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
The document discusses EMC's strategy for Hadoop storage. It describes the Hadoop distributed file system (HDFS) and its architecture. It then outlines different approaches for integrating HDFS with storage solutions, including using integrated Hadoop distributions, HDFS storage array interfaces, and HDFS storage virtualization software. It also discusses analytics appliances and provides examples of EMC's data lake capabilities.
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...EMC
Pivotal has setup and operationalized 1000 node Hadoop cluster called the Analytics Workbench. It takes special setup and skills to manage such a large deployment. This session shares how we set it up and how you will manage it.
Objective 1: Understand what it takes to operationalize a 1000-nodeHadoop cluster.
After this session you will be able to:
Objective 2: Understand how to set up and manage the day to day challenges of a large Hadoop deployments.
Objective 3: Have a view to the tools that are necessary to solve the challenges of managing the large Hadoop cluster.
This document discusses building applications on Hadoop and introduces the Kite SDK. It provides an overview of Hadoop and its components like HDFS and MapReduce. It then discusses that while Hadoop is powerful and flexible, it can be complex and low-level, making application development challenging. The Kite SDK aims to address this by providing higher-level APIs and abstractions to simplify common use cases and allow developers to focus on business logic rather than infrastructure details. It includes modules for data, ETL processing with Morphlines, and tools for working with datasets and jobs. The SDK is open source and supports modular adoption.
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
A walk-thru of core Hadoop, the ecosystem tools, and Hortonworks Data Platform (HDP) followed by code examples in MapReduce (Java and C#), Pig, and Hive.
Presented at the Atlanta .NET User Group meeting in July 2014.
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...VMware Tanzu
SpringOne Platform 2016
Speaker: Ian Fyfe; Director, Product Marketing, Hortonworks
Apache Hadoop is the most powerful and popular platform for ingesting, storing and processing enormous amounts of “big data”. However, due to its original roots as a batch processing system, doing interactive business analytics with Hadoop has historically suffered from slow response times, or forced business analysts to extract data summaries out of Hadoop into separate data marts. This talk will discuss the different options for implementing speed-of-thought business analytics and machine learning tools directly on top of Hadoop including Apache Hive on Tez, Apache Hive on LLAP, Apache HAWQ and Apache MADlib.
This is a presentation on apache hadoop technology. This presentation may be helpful for the beginners to know about the terminologies of hadoop. This presentation contains some pictures which describes about the working function of this technology. I hope it will be helpful for the beginners.
Thank you.
This presentation is about apache hadoop technology. This may be helpful for the beginners. The beginners will know about some terminologies of hadoop technology. There is also some diagrams which will show the working of this technology.
Thank you.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
Speaker: Alan Gates, Co-Founder, Hortonworks
Hive 3 New Horizons DataWorks Summit Melbourne February 2019alanfgates
Hive 3 new SQL features including LLAP, workload management, SQL over Kafka and JDBC data sources, integration with Spark via Hive Warehouse Connector, ACID 2, and constraints and default values
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
The document is a presentation about using Hadoop for analytic workloads. It discusses how Hadoop has traditionally been used for batch processing but can now also be used for interactive queries and business intelligence workloads using tools like Impala, Parquet, and HDFS. It summarizes performance tests showing Impala can outperform MapReduce for queries and scales linearly with additional nodes. The presentation argues Hadoop provides an effective solution for certain data warehouse workloads while maintaining flexibility, ease of scaling, and cost effectiveness.
This document discusses the Stinger initiative to improve the performance of Apache Hive. Stinger aims to speed up Hive queries by 100x, scale queries from terabytes to petabytes of data, and expand SQL support. Key developments include optimizing Hive to run on Apache Tez, the vectorized query execution engine, cost-based optimization using Optiq, and performance improvements from the ORC file format. The goals of Stinger Phase 3 are to deliver interactive query performance for Hive by integrating these technologies.
Data ingest is a deceptively hard problem. In the world of big data processing, it becomes exponentially more difficult. It's not sufficient to simply land data on a system, that data must be ready for processing and analysis. The Kite SDK is a data API designed for solving the issues related to data infest and preparation. In this talk you'll see how Kite can be used for everything from simple tasks to production ready data pipelines in minutes.
Literals - A Machine Independent Feature21h16charis
Introduction to Literals, A machine independent feature. The presentation is based on the prescribed textbook for System Software and Compiler Design, Computer Science and Engineering - System Software by Leland. L. Beck,
D Manjula.
Empowering Businesses with Intelligent Software Solutions - GrawlixAarisha Shaikh
Explore Grawlix's comprehensive suite of intelligent software solutions designed to drive transformative growth and scalability for businesses. This presentation covers our expertise in bespoke software development, digital marketing, web design, cloud solutions, cybersecurity, AI/ML, and IT consulting. Discover how Grawlix's customized solutions enhance productivity, streamline processes, and enable data-driven decision-making. Learn about our key projects, technologies, and the dedicated team who ensures exceptional client satisfaction through innovation and excellence.
Waze vs. Google Maps vs. Apple Maps, Who Else.pdfBen Ramedani
Let’s face it, getting lost isn’t really part of the adventure anymore (unless you’re into that sort of thing!). Nowadays, a good navigation app is like your trusty compass, guiding you through busy city streets and winding country roads. But with so many options out there—from big names like Waze, Google Maps, and Apple Maps to some lesser-known contenders—choosing the right one can feel a bit overwhelming.
Think about it: you're about to head out on a road trip, and the last thing you want is to end up in the middle of nowhere because you took a wrong turn. Or maybe you're just trying to navigate your daily commute without hitting every single red light. That's where a solid navigation app comes in handy.
Google Maps is like the old reliable friend who knows every shortcut and scenic route. It's packed with features, from real-time traffic updates to detailed directions, making it a top choice for many. But then there's Waze, the social butterfly of navigation apps. It's all about community, with drivers sharing real-time updates on traffic, accidents, and even speed traps. It’s perfect if you want to feel like you’re part of a huge driving club, all working together to get everyone to their destination faster.
And let’s not forget Apple Maps, which has come a long way since its rocky start. If you're deep into the Apple ecosystem, it's a seamless choice, integrating smoothly with all your devices and offering some pretty neat features like Flyover for 3D city views.
But wait, there are also some underdog apps worth considering! Have you heard of MapQuest? It's still around and offers some great features, especially for planning long trips with multiple stops. Then there's HERE WeGo, which is fantastic for offline navigation—a real lifesaver if you're heading somewhere with spotty cell service.
So, whether you're planning a cross-country adventure or just trying to find the quickest route to work, we’ll help you sift through these options. We’ll dive into what makes each app unique, their pros and cons, and ultimately, guide you to the perfect navigation app for your needs. Buckle up and get ready for a smooth ride!
Get to know Autonomous 2.0, the latest innovation from Applitools, in this sneak peek session showcasing how our AI-powered testing solutions revolutionize how you create, debug, and manage test scripts. See more and sign up for a free trial at https://applitools.info/ml6
Unlocking the Future of Artificial IntelligencedorinIonescu
Unlock the Future: Dive into AI Today! Videnda AI specializes in developing advanced artificial intelligence solutions, including visual dictionaries and language learning tools that leverage immersive virtual travel experiences. Stay Ahead of the Curve: Master AI Now! Our AI technology integrates machine learning and neural networks to enhance education and business applications. AI: The Next Frontier. Are You Ready to Explore? With a focus on real-time AI solutions and deep learning models, Videnda AI provides innovative tools for multilingual communication and immersive learning.
In this course, you'll find a series of engaging videos packed with vibrant animations that break down complex AI concepts into digestible pieces. Our curriculum covers AI models such as Convolutional Neural Networks (CNN), Multi-Layer Perceptrons (MLP), Generative Adversarial Networks (GAN), and Transformers, providing a solid understanding of these models and their real-world applications. We also offer hands-on experience with Generative AI tools like ChatGPT and Midjourney, and Python programming tutorials to help you implement AI algorithms and build your own AI applications.
We are proud participants in the Nvidia Inception Program, driving AI innovation across various industries. By the end of our course, you'll have a strong understanding of AI principles, enhanced Python programming skills, and practical experience with state-of-the-art Generative AI tools. Whether you're looking to kickstart a career in AI or simply curious about this revolutionary technology, Videnda AI is your partner in mastering the future of artificial intelligence.
Understanding Automated Testing Tools for Web Applications.pdfkalichargn70th171
Automated testing tools for web applications are revolutionizing how we ensure quality and performance in software development. These tools help save time, reduce human error, and increase the efficiency of web application testing processes. This guide delves into automated testing, discusses the available tools, and highlights how to choose the right tool for your needs.
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsBenjamin Bischoff
In the rapidly evolving landscape of software development and testing, it is tempting to chase the latest tools and technologies. However, some of the most effective solutions have been in existence for decades. In this talk, we’ll delve into the enduring value of these timeless testing tools.
We’ll explore how established tools like Selenium, GNU Make, Maven, and Bash remain vital in today’s software development and testing toolkit even though they have been around for a long time (some were even invented before I was born). I’ll share examples of how these tools have addressed our testing and automation challenges, showcasing their adaptability, versatility, and reliability in various scenarios. I aim to demonstrate that sometimes, the “old” ways can indeed be the best ways.
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...John Gallagher
Rails apps can be a black box. Have you ever tried to fix a bug where you just can’t understand what’s going on? This talk will give you practical steps to improve the observability of your Rails app, taking the time to understand and fix defects from hours or days to minutes. Rails 8 will bring an exciting new feature: built-in structured logging. This talk will delve into the transformative impact of structured logging on fixing bugs and saving engineers time. Structured logging, as a cornerstone of observability, offers a powerful way to handle logs compared to traditional text-based logs. This session will guide you through the nuances of structured logging in Rails, demonstrating how it can be used to gain better insights into your application’s behavior. This talk will be a practical, technical deep dive into how to make structured logging work with an existing Rails app.
I talk about the Steps to Observable Software - a practical five step process for improving the observability of your Rails app.
BDRSuite - #1 Cost effective Data Backup and Recovery Solutionpraveene26
BDRSuite and BDRCloud by Vembu are comprehensive and cost-effective backup and disaster recovery solutions designed to meet the diverse data protection requirements of Businesses and Service Providers.
With BDRSuite & BDRCloud, you can backup diverse IT workloads from any location, including VMs (VMware, Hyper-V, KVM, Proxmox VE, oVirt), Servers & Endpoints (Windows, Linux, Mac), SaaS Applications (Microsoft 365, Google Workspace), Cloud VMs (AWS, Azure), NAS/File Shares and Databases & Applications (Microsoft Exchange Server, SQL Server, SharePoint Server, PostgreSQL, MySQL).
You can store backup anywhere like On-Premise/Remote storage, Private/Public Cloud, and BDRCloud.
You can centrally manage the entire backup infrastructure with BDRSuite’s self-hosted centralized management console (or) BDRCloud-hosted centralized management console.
You can quickly recover from data loss or ransomware attacks—all at an affordable price.
To know more visit our website -
https://www.bdrsuite.com/
https://www.bdrcloud.com/
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)Andre Hora
Positive tests (aka, happy path tests) cover the expected behavior of the program, while negative tests (aka, unhappy path tests) check the unexpected behavior. Ideally, test suites should have both positive and negative tests to better protect against regressions. In practice, unfortunately, we cannot easily identify whether a test is positive or negative. A better understanding of whether a test suite is more positive or negative is fundamental to assessing the overall test suite capability in testing expected and unexpected behaviors. In this paper, we propose test polarity, an automated approach to detect positive and negative tests. Our approach runs/monitors the test suite and collects runtime data about the application execution to classify the test methods as positive or negative. In a first evaluation, test polarity correctly classified 117 tests as as positive or negative. Finally, we provide a preliminary empirical study to analyze the test polarity of 2,054 test methods from 12 real-world test suites of the Python Standard Library. We find that most of the analyzed test methods are negative (88%) and a minority is positive (12%). However, there is a large variation per project: while some libraries have an equivalent number of positive and negative tests, others have mostly negative ones.
What is Micro Frontends and Why Use it.pdflead93317
🚀 Let's Deep Dive into 𝐖𝐡𝐲 𝐌𝐢𝐜𝐫𝐨 𝐅𝐫𝐨𝐧𝐭𝐞𝐧𝐝𝐬 𝐢𝐬 𝐭𝐡𝐞 𝐅𝐮𝐭𝐮𝐫𝐞 𝐨𝐟 𝐅𝐫𝐨𝐧𝐭𝐞𝐧𝐝 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 🚀
In today's fast-paced tech landscape, agility, scalability, and maintainability are more crucial than ever. Traditional monolithic frontend architectures often struggle to keep up with these demands. Enter Micro Frontends: a revolutionary approach that's transforming the way we build web applications.
3. EMC Corporation All rights reserved
• How many developers?
INTRODUCTION
A SURVEY
4. EMC Corporation All rights reserved
• How many BI/SQL Developer?
INTRODUCTION
A SURVEY
5. EMC Corporation All rights reserved
• How many Business analyst/Sales?
INTRODUCTION
A SURVEY
6. EMC Corporation All rights reserved
• How many have used Hadoop?
INTRODUCTION
A SURVEY
7. EMC Corporation All rights reserved
• How many have used SQL on Hadoop?
INTRODUCTION
A SURVEY
8. EMC Corporation All rights reserved
• Hadoop is an open source framework for large-
scale data storing & processing.
WHAT IS HADOOP
9. EMC Corporation All rights reserved
• Application Workgroup in EMC
– Focused on
•Big data development/infrastructure
•Application modernization
•DevOps
ABOUT THE HOSTS
10. EMC Corporation All rights reserved
• Fahim Kundi
– 10+ years experience in EDW and big data
• Haden Pareira
– Data engineer with 5+ years of Hadoop experience
• Muhammad Ali
– Data engineer 2+ years with Hadoop
ABOUT THE HOSTS
APPLICATION WORKGROUP IN EMC
12. EMC Corporation All rights reserved
• HDFS is a file system – it’s all files
• MapReduce requires strong programming skills
• It’s so difficult
WHAT IS HADOOP
13. EMC Corporation All rights reserved
• SQL is well known in analytics community
• Faster and easier data insights
• Allows SQL/BI developer to retain their expertise
and create value out of big data
SQL ON HADOOP
14. EMC Corporation All rights reserved
• Cloudera – Impala
• Hortonworks – Hive/Tez
• Pivotal – HAWQ … now HDB
• MapR – Drill
• IBM – Big SQL
SQL ON HADOOP
17. EMC Corporation All rights reserved
CONTENTS
• Hive Introduction
• How Hive Works
• Apache Tez
• Hive with Tez Vs Mapreduce
• ORC and Parquet Format
• HAWQ Introduction
• Query Optimizer
• PxF
18. EMC Corporation All rights reserved
HIVE INTRODUCTION (1)
• Apache Hive is high level query language
and data warehouse features built on top of
Hadoop.
• It is initially developed by yahoo and made
open source in 2008.
• SQL Like Query Language called HQL.
• Partitioning and Bucketing for faster Query
processing.
• Integration with Visualization tool like
Tableau.
19. EMC Corporation All rights reserved
HIVE INTRODUCTION (2)
• Hive supports all the common primitive data
formats such as INT, BINARY, BOOLEAN,
CHAR, DECIMAL, FLOAT, STRING, TIMESTAMP
etc.
• In addition, analysts can combine primitive
data types to form complex data types, such
as structs, maps and arrays.
20. EMC Corporation All rights reserved
HOW HIVE WORKS (1)
• The tables in Hive are similar to tables in a relational
database.
• Databases are comprised of tables, which are made up
of partitions.
• Data can be accessed via a simple query language and
Hive supports overwriting or appending data.
• Hive queries internally will be converted to map reduce
programs or Tez.
21. EMC Corporation All rights reserved
HOW HIVE WORKS (2)
• Within a particular database, data in the tables is
serialized and each table has a corresponding Hadoop
Distributed File System (HDFS) directory.
• Each table can be sub-divided into partitions that
determine how data is distributed within sub-
directories of the table directory.
• Data within partitions can be further broken down into
buckets.
22. EMC Corporation All rights reserved
APACHE TEZ (1)
• Apache Tez, a new distributed execution framework
that is targeted towards data-processing applications
on Hadoop.
• Tez is developed by Hortonwork and built on top of
YARN (Resource Management Framework for Hadoop)
• Tez generalizes Mapreduce to more powerful
framework as it creates Dataflow Graph for job
executed by User. (Example)
23. EMC Corporation All rights reserved
APACHE TEZ (2)
• The Tez API has the following components –
– DAG (Directed Acyclic Graph) – defines the overall job.
One DAG object corresponds to one job
– Vertex – defines the user logic along with the resources
and the environment needed to execute the user logic.
One Vertex corresponds to one step in the job
– Edge – defines the connection between producer and
consumer vertices.
• Tez is not meant directly for end-users – in fact it
enables developers to build end-user applications with
much better performance and flexibility.
25. EMC Corporation All rights reserved
ORC FILE
• ORC(Optimal Row Columnar) is columnar file format designed
for Hadoop workloads.
• ORC files developed to massively speed up Apache Hive and
improve the storage efficiency of data stored in Apache Hadoop.
It is optimized for large streaming reads.
• ORC Features:
– Columnar format for complex data types
– Built into Hive from 0.11
– Support for Pig and Mapreduce via Hcat.
– Two level of compression
• Light weight type specific
• General
– Built in Indexes
27. EMC Corporation All rights reserved
PARQUET
• Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem, regardless of
the choice of data processing framework, data model
or programming language.
• Parquet Feature:
– Columnar File Format
– Support Nested Data Structures
– Accessible by Hive, Spark, Pig, Drill, MR
– R/W in HDFS or local file system
29. EMC Corporation All rights reserved
ORC VS PARQUET
• Two major consideration for considering ORC over Parquet
– Many of the performance improvements provided in the Stinger
initiative are dependent on features of the ORC format including
block level index for each column. This leads to potentially more
efficient I/O allowing Hive to skip reading entire blocks of data if it
determines predicate values are not present there.
– Also the Cost Based Optimizer has the ability to consider column
level metadata present in ORC files in order to generate the most
efficient graph.
– ACID transactions are only possible when using ORC as the file
format.
31. EMC Corporation All rights reserved
HAWQ INTRODUCTION
• HAWQ is MPP(Parallel) SQL-query engine that uses HDFS for
its storage layer.
• HAWQ evolves from the Greenplum Database query planner
to handle query processing and does not rely on MapReduce
under the hood to do processing.
• HAWQ reads data from and writes data to HDFS natively.
• It also has extensions(PxF) to allow it to interact with data
contained in other services (HBase, Hive, Avro, etc) that also
reside in HDFS.
32. EMC Corporation All rights reserved
HAWQ FEATURES
• HAWQ provides all major features found in Greenplum
database
– SQL Completeness: 2003 Extensions
– JDBC Compliant
– Robust Query Optimizer
– Row or Column-Oriented Table Storage
– Parallel Loading and Unloading
– Distributions
– Multi-level Partitioning
– High speed data redistribution
– Views
– External Tables
– Compression
– Resource Management
– Security
– Authentication
– Management and Monitoring
33. EMC Corporation All rights reserved
HAWQ ARCHITECTURE
Interconnect
Local Storage
HAWQ Master
Parser Query Optimizer
PXF
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
[Segment …]
DataNode
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
[Segment …]
HAWQ Standby
Master
NameNode
HDFS
Secondary NameNode
HDFS
34. EMC Corporation All rights reserved
HAWQ PARALLEL QUERY OPTIMIZER
Gather Motion
Sort
HashAggregate
HashJoin
Redistribute Motion
HashJoin
Seq Scan on
lineitem
Hash
Seq Scan on
orders
Hash
HashJoin
Seq Scan on
customer
Hash
Broadcast Motion
Seq Scan on
nation
• Turn SQL Query into execution Plan
• Cost based Optimizer
35. EMC Corporation All rights reserved
PIVOTAL EXTENSION FRAMEWORK (PXF)
• PXF is a fast, extensible framework connecting HAWQ to a
HDFS data store of choice that exposes a parallel API
An advanced version of external
tables
Enables combining HAWQ data
and Hadoop data in a single query
Supports connectors for HDFS,
HBase and Hive
Provides extensible framework API
to enable custom connector
development for any data sources
HDFS HBase Hive
Xtension Framework
37. EMC Corporation. All rights reserved.
• Interactive Query on top of Hadoop
• ANSI-92 SQL Standard
• Native MPP query engine
• Written in C++
IMPALA
OVERVIEW
38. EMC Corporation. All rights reserved.
• Native to Hadoop
– Blends with the eco system
– Security
– Hive MetaStore / HCatalog
– Query existing HDFS data
• Not as fault-tolerant as MapReduce
– (or Hive or SparkSQL or …)
– Single node fails during query the whole query fails
– But if it’s 20x faster, you can rerun and still finish faster ;)
IMPALA
OVERVIEW
39. EMC Corporation. All rights reserved.
IMPALAARCHITECTURE
Image courtesy cloudera
40. EMC Corporation. All rights reserved.
• Query execution times (small to medium size)
• Parquet Format
– Compression
• High Concurrency – kills the competitors
• Partitioning
• Query Optimizer (Compute Statistics!)
IMPALA
WHERE IT SHINES
42. EMC Corporation. All rights reserved.
• Distributed columnar storage manager
• Performance of Parquet
– Great for analytical queries
• Mutability of HBase
– Supports UPDATE/DELETE unlike Parquet
• One common storage to rule them all!
– (not exactly!)
WHAT THE HELL IS KUDU!
44. EMC Corporation. All rights reserved.
• IoT use cases
– High velocity data
– Same data read for analytical queries near real time
• Predictive Modeling
– Large datasets updated frequently
– Retraining models
• Time-series applications
– Kudu offers compound keys/hash based partitioning
– Avoids hot spotting
KUDU USE CASES
47. EMC Corporation. All rights reserved.
2 MIN INTRO TO SPARK
• General Purpose Distributed Computing System
– Multiple language support (Java, Scala, Python, and R)
– Fault tolerant, data distribution, in-memory caching etc.
• RDD
– Resilient distributed datasets
• Operations
– Transformations (define new RDDs)
– Actions (return value)
• No nonsense
– 100x faster than MapReduce
– Disk used only when can’t be avoided
48. EMC Corporation. All rights reserved.
2 MIN INTRO TO SPARK
Image Courtesy: Sachin Parmar
http://www.slideshare.net/sachinparmarss/deep-dive-spark-data-frames-sql-and-catalyst-optimizer?
50. EMC Corporation. All rights reserved.
SPARKSQL
• Structured Data Processing
– Commonly known to us as tables
• Integrated into Spark programming model
• Unified Data Access
• Scalability
• Support for HiveQL
• Cache it!
51. EMC Corporation. All rights reserved.
SPARKSQL
• Two APIs
– DataFrames
• Data organized into named columns
• Similar to Tables
• Can be constructed from structured data files, Hive, external DBs
– DataSets
• Experimental interface
• Strongly typed & SQL execution engine
• Can be constructed from regular JVM objects
Hadoop has traditionally been a batch-processing platform for large amounts of data. However, there are a lot of use cases for near-real-time performance of query processing. There are also several workloads, such as Machine Learning, which do not fit will into the MapReduce paradigm. Tez helps Hadoop address these use cases.
Compared with RCFile format, for example, ORC file format has many advantages such as:
a single file as the output of each task, which reduces the NameNode's load
Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
light-weight indexes stored within the file that skip row groups that don't pass predicate filtering
block-mode compression based on data type
run-length encoding for integer columns and dictionary encoding for string columns
concurrent reads of the same file using separate RecordReaders
Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query.
Advantages of Columnar Storage:
Limits IO by loading the columns that is needed.
Save space as columnar layout compress better
Converts SQL into a physical execution plan
Cost-based optimization looks for the most efficient plan
Physical plan contains scans, joins, sorts, aggregations, etc.
Global planning avoids sub-optimal ‘SQL pushing’ to segments
Directly inserts motion nodes for inter-segment communication
Directly inserts motion nodes for efficient non-local join processing