Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)

•Download as PPTX, PDF•

5 likes•1,674 views

Teradata has been hard at work on Presto, and we want to share with you what we've done so far and our roadmap going forward. From presto-admin, a tool for installing and administering Presto, to YARN/Ambari support, to fully certified JDBC and ODBC drivers, we are committed to making Presto the best, most enterprise-ready SQL-on Hadoop solution out there.

What's hot

Presto meetup 2015-03-19 @Facebook

Treasure Data, Inc.

This document provides an overview of Presto as a Service in Treasure Data, including how Treasure Data deploys and monitors Presto. Key points include: - Treasure Data offers Presto as an interactive query engine accessible through its API and web console. - Treasure Data uses blue-green deployments and a private Maven repository to deploy new Presto versions with no downtime. - Treasure Data monitors Presto using its REST API and collects query logs to analyze performance and detect anomalies. - Treasure Data implements multi-tenancy in Presto by allocating resources like worker nodes based on customers' price plans and resource usage.

Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Martin Traverso

This document summarizes Presto, an analytics engine used at Facebook. It provides ad-hoc querying for data warehouses and batch processing. It is used for analytics across Facebook's data warehouses and specialized data stores. The document outlines Presto's architecture, deployment, usage statistics, features, and enhancements made for specific Facebook use cases including user-facing products, large datasets, and reliable data loading.

Presto@Netflix Presto Meetup 03-19-15

Zhenxiao Luo

Presto is used at Netflix for interactive queries against their 10PB data warehouse stored in S3. Some key points: - Presto was chosen for its open source nature, speed, scalability on AWS, and integration with Hadoop. - Netflix contributes to Presto's development, including improvements to S3 support and Parquet integration. - Current work includes optimizations like vectorized reading and predicate pushdown. Integration with BI tools and monitoring systems is also a focus. - Future work includes better resource management, support for additional data types, and techniques for handling large joins.

Presto at Twitter

Bill Graham

Presto @ Treasure Data - Presto Meetup Boston 2015

Taro L. Saito

Treasure Data simplifies event analytics for the complex digital world. Our customers send us 1,000,000 events per second and issue 30,000+ Presto queries everyday to understand their customers better. One of the challenges is designing a cloud database with zero downtime to support a global customer base. We have achieved this goal by developing several open-source technologies; Fluentd and Embulk enable seamless log collection from stream/batch sources, and with MessagePack we can provide an extensible columnar store that accommodates future schema changes. Finally, Presto allows us to serve a wide variety of data processing our customers perform on our service. In this talk, I will present an overview of our system, and how our customers keep using Presto while collecting and extending their data set.

Presto: Distributed sql query engine

kiran palaka

Presto is an open source distributed SQL query engine that allows querying large datasets ranging from gigabytes to petabytes faster and more interactively. It employs a custom query execution engine with pipelined operators designed for SQL semantics, avoiding unnecessary I/O and latency overhead. The Presto coordinator parses, analyzes, and plans queries, assigning work to nodes closest to data and monitoring progress, while clients pull results from output stages. Presto developers claim it is 10x better than Hive/MapReduce for most queries in terms of efficiency and latency.

Presto

Knoldus Inc.

Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...

viirya

This document discusses using Presto to enable interactive analytic queries over large datasets on Hadoop. Presto is a distributed SQL query engine that is optimized for fast, ad-hoc queries against data stored in various data sources like HDFS, Cassandra and MySQL. It uses a coordinator and worker architecture to parallelize query execution across clusters. The document demonstrates how to deploy and configure Presto, and provides a demo of integrating Presto with Grafana for interactive data visualization.

Bullet: A Real Time Data Query Engine

DataWorks Summit

Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system. Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average. An instance of Bullet is currently running at Yahoo against its user engagement data pipeline. We’ll highlight how it is powering internal use-cases such as web page and native app instrumentation validation. Finally, we’ll show a demo of Bullet and go over query performance numbers.

Presto updates to 0.178

Kai Sasaki

Presto@Uber

Zhenxiao Luo

Presto is Uber's distributed SQL query engine for their Hadoop data warehouse. Some key points: - Presto allows interactive SQL queries directly on Uber's petabyte-scale Hadoop data lake without needing to first load the data into another database. - It provides fast performance at scale by leveraging columnar data formats like Parquet and optimizing for distributed execution across many nodes. - Uber deployed a 200 node Presto cluster that handles 30,000 queries per day, serving both ad hoc queries and real-time applications accessing data in Hadoop and improving on the performance of alternative solutions like Hive.

How to ensure Presto scalability  in multi use case

Kai Sasaki

This document discusses how to ensure Presto scalability in multi-use case environments. It describes how Treasure Data uses Prestobase Proxy, a Finagle-based RPC proxy, to provide a scalable interface for BI tools. It also discusses Presto's node scheduler for distributing query stages across nodes and Treasure Data's use of resource groups to limit resource usage and isolate queries. The document advocates for approaches like dependency injection, VCR testing, and multi-dimensional resource scheduling to make Presto and its components reliable in distributed systems.

From Batch to Streaming ET(L) with Apache Apex

DataWorks Summit

Stream data processing is increasingly required to support business needs for faster actionable insight with growing volume of information from more sources. Apache Apex is a true stream processing framework for low-latency, high-throughput and reliable processing of complex analytics pipelines on clusters. Apex is designed for quick time-to-production, and is used in production by large companies for real-time and batch processing at scale. This session will use an Apex production use case to walk through the incremental transition from a batch pipeline with hours of latency to an end-to-end streaming architecture with billions of events per day which are processed to deliver real-time analytical reports. The example is representative for many similar extract-transform-load (ETL) use cases with other data sets that can use a common library of building blocks. The transform (or analytics) piece of such pipelines varies in complexity and often involves business logic specific, custom components. Topics include: * Pipeline functionality from event source through queryable state for real-time insights. * API for application development and development process. * Library of building blocks including connectors for sources and sinks such as Kafka, JMS, Cassandra, HBase, JDBC and how they enable end-to-end exactly-once results. * Stateful processing with event time windowing. * Fault tolerance with exactly-once result semantics, checkpointing, incremental recovery * Scalability and low-latency, high-throughput processing with advanced engine features for auto-scaling, dynamic changes, compute locality. * Who is using Apex in production, and roadmap. Following the session attendees will have a high level understanding of Apex and how it can be applied to use cases at their own organizations.

HBaseConEast2016: Splice machine open source rdbms

Michael Stack

This document discusses Splice Machine, an open source RDBMS that runs queries using Apache Spark for analytics and Apache HBase for transactions. It provides concise summaries of how Splice Machine executes queries by parsing SQL, optimizing query plans, and generating byte code to run queries on either HBase or Spark. It also benchmarks Splice Machine's performance on loading and running TPCH queries compared to other systems like Phoenix and shows how it enables advanced Spark integration by creating RDDs directly from HFiles.

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Guozhang Wang

Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier.

Membase Meetup 2010

Membase

This document provides an overview and technical discussion of Membase. It begins with introducing Membase and how it allows both applications and databases to scale horizontally. The rest of the document discusses Membase architecture, deployment options, use cases, and a demo. It also briefly explores developing with Membase and the future direction of NodeCode, which will allow extending Membase through custom modules.

Presto in my_use_case

wyukawa

The document summarizes the speaker's use of Presto for log analysis. Key points include: - Presto was selected due to familiarity from others and ease of use compared to other options. - Presto is used for batch queries with Hive and interactive queries. Results are accessed through Cognos using Prestogres. - Managing Presto involves deployment with Ansible, configuration tuning, and monitoring with tools like GrowthForecast and jstat2gf. - While Presto has been stable overall, the speaker notes some version upgrade issues but sees leverage from its frequent updates.

Building Distributed Data Streaming System

Ashish Tadose

Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine

Data Con LA

In this talk, we will discuss how we use Spark as part of a hybrid RDBMS architecture that includes Hadoop and HBase. The optimizer evaluates each query and sends OLTP traffic (including CRUD queries) to HBase and OLAP traffic to Spark. We will focus on the challenges of handling the tradeoffs inherent in an integrated architecture that simultaneously handles real-time and batch traffic. Lessons learned include: - Embedding Spark into a RDBMS - Running Spark on Yarn and isolating OLTP traffic from OLAP traffic - Accelerating the generation of Spark RDDs from HBase - Customizing the Spark UI The lessons learned can also be applied to other hybrid systems, such as Lambda architectures. Bio:- John Leach is the CTO and Co-Founder of Splice Machine. With over 15 years of software experience under his belt, John’s expertise in analytics and BI drives his role as Chief Technology Officer. Prior to Splice Machine, John founded Incite Retail in June 2008 and led the company’s strategy and development efforts. At Incite Retail, he built custom Big Data systems (leveraging HBase and Hadoop) for Fortune 500 companies. Prior to Incite Retail, he ran the business intelligence practice at Blue Martini Software and built strategic partnerships with integration partners. John was a key subject matter expert for Blue Martini Software in many strategic implementations across the world. His focus at Blue Martini was helping clients incorporate decision support knowledge into their current business processes utilizing advanced algorithms and machine learning. John received dual bachelor’s degrees in biomedical and mechanical engineering from Washington University in Saint Louis. Leach is the organizer emeritus for the Saint Louis Hadoop Users Group and is active in the Washington University Elliot Society.

Automatic Scaling Iterative Computations

Guozhang Wang

This document discusses iterative graph computations and limitations of MapReduce for such computations. It proposes GRACE, a graph processing framework that separates the vertex-centric computation logic from execution policies to allow both synchronous and asynchronous execution. As an example, it shows how belief propagation can be implemented in a vertex-centric manner and executed asynchronously using GRACE. This provides easier programming while enabling performance benefits of asynchronous execution.

What's hot (20)

Presto meetup 2015-03-19 @Facebook

Presto at Facebook - Presto Meetup @ Boston (10/6/2015)

Presto@Netflix Presto Meetup 03-19-15

Presto at Twitter

Presto @ Treasure Data - Presto Meetup Boston 2015

Presto: Distributed sql query engine

Presto

Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...

Bullet: A Real Time Data Query Engine

Presto updates to 0.178

Presto@Uber

How to ensure Presto scalability  in multi use case

From Batch to Streaming ET(L) with Apache Apex

HBaseConEast2016: Splice machine open source rdbms

Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Membase Meetup 2010

Presto in my_use_case

Building Distributed Data Streaming System

Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine

Automatic Scaling Iterative Computations

Viewers also liked

Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)

Matt Fuller

Tempto is a product test framework that allows developers to write and execute tests for SQL databases running on Hadoop. Individual test requirements such as data generation, HDFS file copy/storage of generated data and schema creation are expressed declaratively and are automatically fulfilled by the framework. Developers can write tests using Java (using a TestNG like paradigm and AssertJ style assertion) or by providing query files with expected results. We will show how we use it for presto product tests. Benchto is a benchmark framework that provides an easy and manageable way to define, run and analyze macro benchmarks in clustered environment. Understanding behavior of distributed systems is hard and requires good visibility intostate of the cluster and internals of tested system. This project was developed for repeatable benchmarking ofHadoop SQL engines, most importantly Presto.

Presto as a Service - Tips for operation and monitoring

Taro L. Saito

- Presto as a Service in Treasure Data involves deploying Presto using blue-green deployments with no downtime and automatic error recovery of failed queries. - Monitoring Presto involves using its JSON API to view queries and query plans as well as collecting Presto metrics with Fluentd and detecting anomalies. - Benchmarking compares query performance between Presto versions by running predefined query sets and aggregating the results.

Understanding Presto - Presto meetup @ Tokyo #1

Sadayuki Furuhashi

This document summarizes a presentation about Presto, an open source distributed SQL query engine. It discusses Presto's distributed and plug-in architecture, query planning process, and cluster configuration options. For architecture, it explains that Presto uses coordinators, workers, and connectors to distribute queries across data sources. For query planning, it shows how SQL queries are converted into logical and physical query plans with stages, tasks, and splits. For configuration, it reviews single-server, multi-worker, and multi-coordinator cluster topologies. It also provides an overview of Presto's recent updates.

Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla

ScyllaDB

Presto overview

Shixiong Zhu

Presto is a distributed SQL query engine that allows users to run SQL queries against various data sources. It consists of three main components - a coordinator, workers, and clients. The coordinator manages query execution by generating execution plans, coordinating workers, and returning final results to the client. Workers contain execution engines that process individual tasks and fragments of a query plan. The system uses a dynamic query scheduler to distribute tasks across workers based on data and node locality.

Presto

MK JUNG

Presto Meetup 2016 Small Start

Hiroshi Toyama

1. The presenter discusses their use of Presto for analytics at their company, including joining data across different data sources and using window functions on MySQL data. 2. They explain how they integrate Presto with other tools like re:dash for visualization and Embulk for ETL workflows. 3. While Presto solves many of their problems, they still require some ETL and have encountered issues like large repository sizes and coordinator bottlenecks.

Presto Meetup @ Facebook (3/22/2016)

Martin Traverso

AWS Meet-up: Logging At Scale on AWS

Chris Riddell

Prestogres internals

Sadayuki Furuhashi

Prestogres is a PostgreSQL protocol gateway for Presto that allows Presto to be queried using standard BI tools through ODBC/JDBC. It works by rewriting queries at the pgpool-II middleware layer and executing the rewritten queries on Presto using PL/Python functions. This allows Presto to integrate with the existing BI tool ecosystem while avoiding the complexity of implementing the full PostgreSQL protocol. Key aspects of the Prestogres implementation include faking PostgreSQL system catalogs, handling multi-statement queries and errors, and security definition. Future work items include better supporting SQL syntax like casts and temporary tables.

Future of Data Meetup : Boontadata

Abdelkrim Hadjidj

As some big data stream processing engines may become an alternative to batch engines, companies may have to choose the technology they will rely on. There are many considerations to take into account, including how to develop, and what the engine can do. Boontadata (http://boontadata.io) is an environment, available on GitHub where anyone can experiment stream processing engines. A common scenario is used to compare how to develop and run different processing engines.

Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...

Cloudera, Inc.

Recent research has pointed out the complementary nature of Hadoop and other data management solutions and the importance of leveraging existing systems, SQL, engineering, and operational skills, as well as incorporating novel uses of MapReduce to improve analytic processing. Come to this session to learn how companies optimize the use of Hadoop with other enterprise systems to improve overall analytical throughput and build new data-driven products. This session covers: ways to achieve high-performance integration between Hadoop and relational-based systems; Hadoop+NoSQL vs Hadoop+SQL architectures; high-speed, massively parallel data transfer to analytical platforms that can aggregate web log data with granular fact data; and strategies for freeing up capacity for more explorative, iterative analytics and ad hoc queries.

Big Data: SQL query federation for Hadoop and RDBMS data

Cynthia Saracco

Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse

DataWorks Summit

Mather Economics wanted a data architecture that could integrate Hadoop and a data warehouse to provide a responsive user experience for data slicing, aggregating, and modeling on 100% of data samples. A hybrid approach was implemented that uses Hadoop for ingestion and storage and a data warehouse for transformation, integration, and dimensional modeling to support both internal analysts and external customers. This hybrid approach meets the goals of being data and technology agnostic while providing speed for analytics.

Amazon EMR Facebook Presto Meetup

stevemcpherson

Steve McPherson presented on Amazon EMR (Elastic MapReduce) at the Facebook Presto Meetup on March 19, 2015. Amazon EMR makes cluster management easy by handling setup and configuration, node monitoring and replacement, log aggregation, and integration with CloudWatch and Spot instances. Amazon EMR supports various big data frameworks like MapReduce, Spark, Hive, Pig, Presto and data stores like Amazon S3. It allows running different clusters optimized for different workloads like Hive/Pig, Presto or Spark.

Presto changes

N Masahiro

This document summarizes recent updates to Presto, including new data types, connectors, syntax, features, functions, and configuration options. Some key additions are support for DECIMAL, VARCHAR, and new data types; connectors for Redis, MongoDB, and other data sources; transaction support; and a variety of new SQL functions for strings, dates, aggregation, and more. Upcoming work includes prepared statements, a new optimizer, and other performance and usability improvements.

Presto in my_use_case2

wyukawa

Presto - SQL on anything

Grzegorz Kokosiński

One of the key differences between Presto and Hive, also a crucial functional requirement Facebook made when launching this new SQL engine project, was to have the opportunity to query different kinds of data sources via a uniform ANSI SQL interface. Presto, an open source distributed analytical SQL engine, implements this with it’s connector architecture, creating an abstraction layer for anything that can be expressed as in a row-like format, ranging from MySQL tables, HDFS, Amazon S3 to NoSQL stores, Kafka streams and proprietary data sources. Presto connector SPI allows anyone to implement a Presto connector and benefit from the capabilities of the Presto SQL engine, enabling them to join data from various sources within a single SQL query.

Internals of Presto Service

Treasure Data, Inc.

Presto is a distributed SQL query engine that Treasure Data provides as a service. Taro Saito discussed the internals of the Presto service at Treasure Data, including how the TD Presto connector optimizes scan performance from storage systems and how the service manages multi-tenancy and resource allocation for customers. Key challenges in providing a database as a service were also covered, such as balancing cost and performance.

Teradata Big Data London Seminar

Hortonworks

The document discusses a unified data architecture that enables any user to access and analyze any data type from data capture through analysis. It describes using a discovery platform to enable interactive data discovery on structured and unstructured data without extensive modeling. It also describes using an integrated data warehouse for cross-functional analysis, shared analytics, and lowest total cost of ownership. Finally, it provides examples of using the architecture for IPTV quality of service analysis, including predictive models using decision trees and naive Bayes.

Viewers also liked (20)

Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)

Presto as a Service - Tips for operation and monitoring

Understanding Presto - Presto meetup @ Tokyo #1

Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla

Presto overview

Presto

Presto Meetup 2016 Small Start

Presto Meetup @ Facebook (3/22/2016)

AWS Meet-up: Logging At Scale on AWS

Prestogres internals

Future of Data Meetup : Boontadata

Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...

Big Data: SQL query federation for Hadoop and RDBMS data

Hybrid Data Architecture: Integrating Hadoop with a Data Warehouse

Amazon EMR Facebook Presto Meetup

Presto changes

Presto in my_use_case2

Presto - SQL on anything

Internals of Presto Service

Teradata Big Data London Seminar

Similar to Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)

Open Source SQL for Hadoop: Where are we and Where are we Going?

DataWorks Summit

Teradata has acquired Hadapt and the Teradata Center for Hadoop now has 40 developers working on open source SQL technologies like Presto. Teradata is committing resources to advancing Presto's open source codebase through contributions and plans to offer the first commercial support for Presto. Presto is an open source distributed SQL query engine that is optimized for interactive queries across data platforms.

Hortonworks.bdb

Emil Andreas Siemes

This document discusses Hortonworks and its mission to enable modern data architectures through Apache Hadoop. It provides details on Hortonworks' commitment to open source development through Apache, engineering Hadoop for enterprise use, and integrating Hadoop with existing technologies. The document outlines Hortonworks' services and the Hortonworks Data Platform (HDP) for storage, processing, and management of data in Hadoop. It also discusses Hortonworks' contributions to Apache Hadoop and related projects as well as enhancing SQL capabilities and performance in Apache Hive.

Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud

Alluxio, Inc.

Alluxio Tech Talk Mar 12, 2019 Speaker: Bin Fan, Alluxio Matt Fuller, Starburst As data analytic needs have increased with the explosion of data, the importance of the speed of analytics and the interactivity of queries has increased dramatically In this tech talk, we will introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn about: - The architecture of Presto, an open source distributed SQL engine, as well as innovations by Starburst like as it’s cost-based optimizer - How Presto can query data from cloud object storage like S3 at high performance and cost-effectively with Alluxio - How to achieve data locality and cross-job caching with Alluxio no matter where the data is persisted and reduce egress costs In addition, we’ll present some real world architectures & use cases from internet companies like JD.com and NetEase.com running the Presto and Alluxio stack at the scale of hundreds of nodes.

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...

ssuserd3a367

1) StumbleUpon uses open source tools like Kafka, HBase, Hive and Pig to build a scalable big data infrastructure to process large amounts of data from its services in real-time and batch. 2) Data is collected from various services using Kafka and stored in HBase for real-time analytics. Batch processing is done using Pig and data is loaded into Hive for ad-hoc querying. 3) The infrastructure powers various applications like recommendations, ads and business intelligence dashboards.

Teradata - Presentation at Hortonworks Booth - Strata 2014

Hortonworks

Bi on Big Data - Strata 2016 in London

Dremio Corporation

Hitachi Data Systems Hadoop Solution

Hitachi Vantara

Hitachi Data Systems Hadoop Solution. Customers are seeing exponential growth of unstructured data from their social media websites to operational sources. Their enterprise data warehouses are not designed to handle such high volumes and varieties of data. Hadoop, the latest software platform that scales to process massive volumes of unstructured and semi-structured data by distributing the workload through clusters of servers, is giving customers new option to tackle data growth and deploy big data analysis to help better understand their business. Hitachi Data Systems is launching its latest Hadoop reference architecture, which is pre-tested with Cloudera Hadoop distribution to provide a faster time to market for customers deploying Hadoop applications. HDS, Cloudera and Hitachi Consulting will present together and explain how to get you there. Attend this WebTech and learn how to: Solve big-data problems with Hadoop. Deploy Hadoop in your data warehouse environment to better manage your unstructured and structured data. Implement Hadoop using HDS Hadoop reference architecture. For more information on Hitachi Data Systems Hadoop Solution please read our blog: http://blogs.hds.com/hdsblog/2012/07/a-series-on-hadoop-architecture.html

Twitter with hadoop for oow

Gwen (Chen) Shapira

Munich HUG 21.11.2013

Emil Andreas Siemes

Hortonworks provides an overview of their Tez framework for improving Hadoop query processing. Tez aims to accelerate queries by expressing them as dataflow graphs that can be optimized, rather than relying solely on MapReduce. It also aims to empower users by allowing flexible definition of data pipelines and composition of inputs, processors, and outputs. Early results show a 100x speedup on benchmark queries compared to traditional MapReduce.

Big Data Integration Webinar: Getting Started With Hadoop Big Data

Pentaho

This document discusses getting started with big data analytics using Hadoop and Pentaho. It provides an overview of installing and configuring Hadoop and Pentaho on a single machine or cluster. Dell's Crowbar tool is presented as a way to quickly deploy Hadoop clusters on Dell hardware in about two hours. The document also covers best practices like leveraging different technologies, starting with small datasets, and not overloading networks. A demo is given and contact information provided.

Self-Service BI for big data applications using Apache Drill (Big Data Amster...

Dataconomy Media

Modern big data applications such as social, mobile, web and IoT deal with a larger number of users and larger amount of data than the traditional transactional applications. The datasets associated with these applications evolve rapidly, are often self-describing and can include complex types such as JSON and Parquet. In this demo we will show how Apache Drill can be used to provide low latency queries natively on rapidly evolving multi-structured datasets at scale.

Self-Service BI for big data applications using Apache Drill (Big Data Amster...

Mats Uddenfeldt

Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

Alluxio, Inc.

Savanna - Elastic Hadoop on OpenStack

Sergey Lukjanov

Savanna is an OpenStack component that allows elastic provisioning of Hadoop clusters in OpenStack. It has a 3 phase roadmap - phase 1 allows basic cluster provisioning which is complete, phase 2 will add advanced configuration and tool integration currently in progress, and phase 3 will enable analytics as a service with a job execution framework. Savanna uses an extensible plugin architecture to provision Hadoop VMs and configure the clusters, integrating with other OpenStack components like Nova, Glance, and Swift.

UNC Chapel Hill Ctc Retreat 2014 SAS Visual Analytics and Business Intelligence

Jonathan Pletzke

Hear about and see the latest SAS solutions in use at UNC-CH. In support of ConnectCarolina and InfoPorte for administrative data, two SAS server based platforms have been installed: SAS Business Intelligence, which is being used for Extract-Transform-Load (ETL) manipulation of data SAS Visual Analytics, which is being used for reporting and visualization of data Hear about the high speed and high capacity of the server based solutions, along with how they are being used and benefiting UNC Chapel Hill.

Talend for big_data_intorduction

Lakshman Dhullipalla

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Alluxio, Inc.

Alluxio Tech Talk January 21, 2020 Speakers: Matt Fuller, Starburst Dipti Borkar, Alluxio With the advent of the public clouds and data increasingly siloed across many locations -- on premises and in the public cloud -- enterprises are looking for more flexibility and higher performance approaches to analyze their structured data. Join us for this tech talk where we’ll introduce the Starburst Presto, Alluxio, and cloud object store stack for building a highly-concurrent and low-latency analytics platform. This stack provides a strong solution to run fast SQL across multiple storage systems including HDFS, S3, and others in public cloud, hybrid cloud, and multi-cloud environments. You’ll learn more about: - The architecture of Presto, an open source distributed SQL engine - How the Presto + Alluxio stack queries data from cloud object storage like S3 for faster and more cost-effective analytics - Achieving data locality and cross-job caching with Alluxio regardless of where data is persisted

Piranha vs. mammoth predator appliances that chew up big data

Jack (Yaakov) Bezalel

If you also got the Big Data itch, here is something to ease the pain :-) Answers to this questions will be available soon (more info in the attached link) Which Big Data Appliance should YOU use? (click on the attached link for Poll results) Appliances are Small and Quick, Right? Revealing the 6 Types of Big Data Appliances Uncovering the Main Players Challenges, Pitfalls, and Winning the Big Data Game Where is all this leading YOU to?

Summer Shorts: Big Data Integration

ibi

Today's organizations contend with more diverse applications, data, and systems than ever before – silos that are often fragmented and difficult to leverage together. iWay Big Data Integrator (BDI) simplifies the creation, management, and use of Hadoop-based data lakes. It provides a modern, native approach to Hadoop-based data integration and management that ensures high levels of capability, compatibility, and flexibility to help your organization. Join us to learn how you can simplify adoption of Apache Hadoop using iWay Big Data Integrator. Learn about our ability to streamline the deployment of ingestion, transformation, and extraction tasks. See the pre-recorded webcast online at: http://www.informationbuilders.com/webevents/online/24427#sthash.J0cRy1PG.dpuf

Webinar: What's new in CDAP 3.5?

Cask Data

Cask Webinar Date: 08/10/2016 Link to video recording: https://www.youtube.com/watch?v=XUkANr9iag0 In this webinar, Nitin Motgi, CTO of Cask, walks through the new capabilities of CDAP 3.5 and explains how your organization can benefit. Some of the highlights include: - Enterprise-grade security - Authentication, authorization, secure keystore for storing configurations. Plus integration with Apache Sentry and Apache Ranger. - Preview mode - Ability to preview and debug data pipelines before deploying them. - Joins in Cask Hydrator - Capabilities to join multiple data sources in data pipelines - Real-time pipelines with Spark Streaming - Drag & drop real-time pipelines using Spark Streaming. - Data usage analytics - Ability to report application usage of data sets. - And much more!

Similar to Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015) (20)

Open Source SQL for Hadoop: Where are we and Where are we Going?

Hortonworks.bdb

Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...

Teradata - Presentation at Hortonworks Booth - Strata 2014

Bi on Big Data - Strata 2016 in London

Hitachi Data Systems Hadoop Solution

Twitter with hadoop for oow

Munich HUG 21.11.2013

Big Data Integration Webinar: Getting Started With Hadoop Big Data

Self-Service BI for big data applications using Apache Drill (Big Data Amster...

Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

Savanna - Elastic Hadoop on OpenStack

UNC Chapel Hill Ctc Retreat 2014 SAS Visual Analytics and Business Intelligence

Talend for big_data_intorduction

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Piranha vs. mammoth predator appliances that chew up big data

Summer Shorts: Big Data Integration

Webinar: What's new in CDAP 3.5?

Recently uploaded

Demystifying Neural Networks And Building Cybersecurity Applications

Priyanka Aash

In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).

Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes

jorgelebrato

FIDO Munich Seminar In-Vehicle Payment Trends.pptx

FIDO Alliance

Perth MuleSoft Meetup July 2024

Michael Price

FIDO Munich Seminar: Securing Smart Car.pptx

FIDO Alliance

Mule Experience Hub and Release Channel with Java 17

Bhajan Mehta

AMD Zen 5 Architecture Deep Dive from Tech Day

Low Hong Chuan

FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx

FIDO Alliance

The Challenge of Interpretability in Generative AI Models.pdf

Sara Kroft

Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence. Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.

DefCamp_2016_Chemerkin_Yury_--_publish.pdf

Yury Chemerkin

Scaling Vector Search: How Milvus Handles Billions+

Zilliz

Top 12 AI Technology Trends For 2024.pdf

Marrie Morris

Discovery Series - Zero to Hero - Task Mining Session 1

DianaGray10

It's your unstructured data: How to get your GenAI app to production (and spe...

Zilliz

So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.

History and Introduction for Generative AI ( GenAI )

Badri_Bady

Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...

Zilliz

Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.

Generative AI Reasoning Tech Talk - July 2024

siddu769252

Redefining Cybersecurity with AI Capabilities

Priyanka Aash

In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.

What's New in Teams Calling, Meetings, Devices June 2024

Stephanie Beckett

FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx

FIDO Alliance

Recently uploaded (20)

Demystifying Neural Networks And Building Cybersecurity Applications

Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes

FIDO Munich Seminar In-Vehicle Payment Trends.pptx

Perth MuleSoft Meetup July 2024

FIDO Munich Seminar: Securing Smart Car.pptx

Mule Experience Hub and Release Channel with Java 17

AMD Zen 5 Architecture Deep Dive from Tech Day

FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx

The Challenge of Interpretability in Generative AI Models.pdf

DefCamp_2016_Chemerkin_Yury_--_publish.pdf

Scaling Vector Search: How Milvus Handles Billions+

Top 12 AI Technology Trends For 2024.pdf

Discovery Series - Zero to Hero - Task Mining Session 1

It's your unstructured data: How to get your GenAI app to production (and spe...

History and Introduction for Generative AI ( GenAI )

Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...

Generative AI Reasoning Tech Talk - July 2024

Redefining Cybersecurity with AI Capabilities

What's New in Teams Calling, Meetings, Devices June 2024

FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx

Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)

1. Hello, Enterprise! Meet Presto Teradata Contributions to Presto 10/6/15 Christina Wallin

2. 2 • Teradata Center for Hadoop • Formerly Hadapt, the first SQL-on-Hadoop company (founded in 2010) • Offices in Boston and Warsaw, some remote employees in CA and CT • Around 20 employees working on Presto • Contributors to the open source project Presto! Who are we?

3. 3 What is Presto? • 100% open source distributed ANSI SQL engine for Big Data – Modern architecture and implementation – Proven scalability and performance – Optimized for low latency, interactive querying • Cross platform query capability, not only SQL on Hadoop • Distributed under the Apache license, now supported by Teradata • Used by a community of well known, well respected technology companies

4. 4 Presto Architecture Coordinator Parser/ analyzer Planner Scheduler Worker Client Worker Worker

5. 5 Presto Pluggable Data sources Capabilities Push-down to Hadoop System Push-down to Other Database HADOOP HDFS OTHER DATABASES HADOOP KAFKA Hadoop HADOOP PRESTO Push-down to NoSQL Databases NOSQL DATABASES

6. 6 Teradata Contributions to Presto Implement Integrate Proliferate • Installer • Documentation • Monitoring & Support Tools • Management Tool Integration • YARN Integration ODBC Driver • JDBC Driver • BI Certification • Security • Connectors Commercial Support Phase 1 Phase 2 Phase 3 June 8, 2015 Q4 2015 2016 Expanding ANSI SQL Coverage

7. 7 Easy Installation and Administration

8. 8 • presto-admin can: – Install and uninstall Presto – Deploy configuration files across the cluster – Start/stop/restart Presto servers – Show you the status of the cluster – Add and remove connectors – Upgrade Presto to a different version – Collect logs, query info, system info for support • Additionally, we added an RPM for Presto • https://github.com/prestodb/presto-admin presto-admin: a tool to manage and install Presto

9. 9 Hadoop Ecosystem Integration

10. 10 Ambari Integration (Work In Progress) • http://github.com/prestodb/ambari-presto-service

11. 11

12. 12

13. 13

14. 14

15. 15 Resource Allocation with YARN • Slated for Q4 2015 • Allow Presto to run its services within YARN containers so that YARN knows about memory/CPU allocated to Presto. – Using Apache Slider – The allocation is fixed and upfront – Supports HDP and CDH Hadoop Versions • YARN CGroups Integration • http://github.com/prestodb/presto-yarn

16. 16 Enterprise Database Features

17. 17 • Improved ODBC driver -- Q4 2015 • Improved JDBC driver -- Q1 2016 • Certification against Tableau, Qlik, etc. – mid 2016 Unleashing Presto on Business Intelligence Tools

18. 18 • Current Contributions – DECIMAL type (WIP) – Additional smaller things – new functions, bug fixes, TIMESTAMP support for Parquet • Future goal: Support TPC-H and TPC-DS unmodified! – Additional subquery and join support – EXISTS, EXCEPT, INTERSECT – Various other odds and ends Expanded ANSI SQL Support

19. 19 Demo of presto-admin!

20. 20 • https://github.com/facebook/presto • https://github.com/prestodb/presto-admin • Certified distro: http://www.teradata.com/presto/ – Also can download VM images pre-installed with Presto How can I give Presto a try?

21. 21 Questions?

22. 22

Editor's Notes

Interactive performance of execution engine Code generation for operators (similarly to Impala) Data is pipelined MPP-style Runs at Facebook scale *Capable of querying other non-HDFS data stores as well*
Add information specific to your understanding of the client challenges or objectives that would lead to an analytic roadmap. This should be very tailored to the client audience.
Presto-Yarn Integration objective - resource allocation meant for long running services. In addition for cases where Presto and Hadoop share the same hardware (or cluster) Yarn integration also provides an unified way of accounting and monitoring of cluster utilization. The goal of this is to be transparent to YARN about how much RAM / CPU was allocated to Presto so that less is available to other YARN applications (MapReduce, Tez, etc.) The allocation is fixed and upfront - no dynamic changes to resource allocation supported for Phase 2. To reconfigure memory/cpu settings, a restart is necessary. YARN has introduced support for CPU sharing (via CGroups). Currently, CGroups is only used for limiting CPU usage. So we will leverage this to limit Presto in the CPU usage. (Slider also has some CPU resource sharing support) Apache Slider is a YARN application to deploy existing distributed applications on YARN, monitor them and make them larger or smaller as desired . Slider’s objective is to make it easy for existing distributed applications, like Presto, to be deployed on a YARN cluster without changes and with little or no custom code.
Untar presto-admin & install ./presto-admin server install presto-server-rpm.rpm ./presto-admin server start Pause briefly so that the coordinator finds the workers ./presto-admin server status ./presto-admin configuration show Cat hive.properties Mv hive.properties /opt/prestoadmin/connectors ./presto-admin connector add hive ./presto-admin server restart wait ./presto-admin server status Presto CLI: ./presto –server localhost:8080 –catalog hive –schema default show tables; Create table lineitem as select * from tpch.1gb.lineitem; Select count(*) from lineitem;

Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)

Similar to Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015) (20)

Recently uploaded

Recently uploaded (20)

Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)

Editor's Notes