Apache Spark 2.0 set the architectural foundations of structure in Spark, unified high-level APIs, structured streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon-to-be-released Spark 2.3 features:
• New deployment mode: Kubernetes scheduler backend
• PySpark performance and enhancements
• New structured streaming execution engine: continuous processing
• Data source v2 APIs for both structured streaming and Spark SQL
• ML on structured streaming
• Image reader
• Stable codegen engine
• Spark History Server V2
• Native ORC support
• Vectorized ORC and SQL cache readers
• Stream-stream Join
• UDF enhancements
• Various SQL enhancements
Speakers
Xiao Li, Software Engineer, Databricks
Wenchen Fan, Software Engineer, Databricks
Sharing metadata across the data lake and streamsDataWorks Summit
Traditionally systems have stored and managed their own metadata, just as they traditionally stored and managed their own data. A revolutionary feature of big data tools such as Apache Hadoop and Apache Kafka is the ability to store all data together, where users can bring the tools of their choice to process it.
Apache Hive's metastore can be used to share the metadata in the same way. It is already used by many SQL and SQL-like systems beyond Hive (e.g. Apache Spark, Presto, Apache Impala, and via HCatalog, Apache Pig). As data processing changes from only data in the cluster to include data in streams, the metastore needs to expand and grow to meet these use cases as well. There is work going on in the Hive community to separate out the metastore, so it can continue to serve Hive but also be used by a more diverse set of tools. This talk will discuss that work, with particular focus on adding support for storing schemas for Kafka messages.
Speaker
Alan Gates, Co-Founder, Hortonworks
The document describes Big Data Ready Enterprise (BDRE), an open source product that addresses common challenges in implementing and operating big data solutions at large scale. It provides out-of-the-box features to accelerate implementations using pluggable architecture, community support, and distribution compatibility. The document outlines BDRE's key benefits and capabilities for data ingestion, workflow automation, operational metadata management, and more. It also provides examples of BDRE implementations and screenshots of the product's interface.
Quick! Quick! Exploration!: A framework for searching a predictive model on A...DataWorks Summit
This document summarizes a framework for automating predictive modeling on Apache Spark. The framework allows for scalable searching of predictive models across large parameter spaces. It addresses challenges of high scalability and easy integration of new machine learning implementations. An evaluation shows the framework achieves a 13x speedup over Spark MLlib and reduces the amount of code needed to add new machine learning algorithms. The framework automates predictive modeling tasks in a scalable and plug-and-play manner.
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
Big Data adoption is a journey. Depending on the business the process can take weeks, months, or even years. With any transformative technology the challenges have less to do with the technology and more to do with how a company adapts itself to a new way of thinking about data. Building a Center of Excellence is one way for IT to help drive success.
This talk will explore Enterprise Holdings Inc. (which operates the Enterprise Rent-A-Car, National Car Rental and Alamo Rent A Car) and their experience with Big Data. EHI’s journey started in 2013 with Hadoop as a POC and today are working to create the next generation data warehouse in Microsoft’s Azure cloud utilizing a lambda architecture.
We’ll discuss the Center of Excellence, the roles in the new world, share the things which worked well, and rant about those which didn’t.
No deep Hadoop knowledge is necessary, architect or executive level.
Apache Hadoop YARN is the modern distributed operating system for big data applications. It morphed the Hadoop compute layer to be a common resource management platform that can host a wide variety of applications. Many organizations leverage YARN in building their applications on top of Hadoop without themselves repeatedly worrying about resource management, isolation, multi-tenancy issues, etc.
In this talk, we’ll start with the current status of Apache Hadoop YARN—how it is used today in deployments large and small. We'll then move on to the exciting present and future of YARN—features that are further strengthening YARN as the first-class resource management platform for data centers running enterprise Hadoop.
We’ll discuss the current status as well as the future promise of features and initiatives like: powerful container placement, global scheduling, support for machine learning and deep learning workloads through GPU and FPGA support, extreme scale with YARN federation, containerized apps on YARN, support for long running services (alongside applications) natively without any changes, seamless application upgrades, powerful scheduling features like application priorities, intra-queue preemption across applications, and operational enhancements including insights through Timeline Service V2, a new web UI, and better queue management.
Speakers
Wangda Tan, Staff Software Engineer, Hortonworks
Billie Rinaldi, Principal Software Engineer I, Hortonworks
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in big data ecosystem. Although, Hive started primarily as batch ingestion and reporting tool, community is hard at work in improving it along many different dimensions and use cases. This talk will provide an overview of latest and greatest features and optimizations which have landed in project over last year. Materialized view, micro managed tables and workload management are some noteworthy features.
I will deep dive into some optimizations which promise to provide major performance gains. Support for ACID tables has also improved considerably. Although some of these features and enhancements are not novel but have existed for years in other DB systems, implementing them on Hive poses some unique challenges and results in lessons which are generally applicable in many other contexts. I will also provide a glimpse of what is expected to come in near future.
Speaker: Ashutosh Chauhan, Engineering Manager, Hortonworks
Logical Data Warehouse: How to Build a Virtualized Data Services LayerDataWorks Summit
The document discusses the emergence of logical data warehouses in response to big data. It describes how a logical data warehouse uses virtualization, distributed processing, and other techniques to provide a unified view of data across different repositories like Hadoop, relational databases and NoSQL stores. It also discusses how organizations can optimize resources by offloading analytical workloads from their enterprise data warehouse to Hadoop clusters to reduce costs while still using existing code and applications.
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...DataWorks Summit
Learn how Pure Storage engineering manages streaming 190B log events per day and makes use of that deluge of data in our continuous integration (CI) pipeline. Our test infrastructure runs over 70,000 tests per day creating a large triage problem that would require at least 20 triage engineers. Instead, Spark's flexible computing platform allows us to write a single application for both streaming and batch jobs to understand the state of our CI pipeline for our team of 3 triage engineers. Using encoded patterns, Spark indexes log data for real-time reporting (Streaming), uses Machine Learning for performance modeling and prediction (Batch job), and finds previous matches for newly encoded patterns (Batch job). Resource allocation in this mixed environment can be challenging; a containerized Spark cluster deployment, and disaggregated compute and storage layers allow us to programmatically shift compute resources between the streaming and batch applications.. This talk will go over design decisions to meet SLAs of streaming and batching in hardware, data layout, access patterns, and containers strategy. We will also go over the challenges, lessons learned, and best practices for similar data pipelines.
Speaker
Joshua Robinson, JOSHUA ROBINSON
Founding Engineer
Pure Storage
This presentation will describe the analytics-to-cloud migration initiative underway at Fannie Mae. The goal of this effort is threefold: (1) build a sustainable process for data lake hydration on the cloud and (2) modernize the Fannie Mae enterprise data warehouse infrastructure and (3) retire Netezza.
Fannie Mae partnered with Impetus for modernization of its Netezza legacy analytics platform. This involved the use of the Impetus Workload Migration solution—a sophisticated translation engine that automated the migration of their complex Netezza stored procedures, shell and scheduler scripts to Apache Spark compatible scripts. This delivered substantial savings in time, effort and cost, while reducing overall project risk.
Included in the scope of the automation project was an automated assessment capability to perform detailed profiling of the current workloads. The output from the assessment stage was a data-driven offloading blueprint and roadmap for which workloads to migrate. A hybrid cloud-based big data solution was designed based on that. In addition to fulfilling the essential requirement of historical (and incremental) data migration and automated logic translation, the solution also recommends optimal storage formats for the data in the cloud, performing SCD Type 1 and Type 2 for mission-critical parameters and reloading the transformed data back for reporting/analytical consumption.
This will include the following topics:
i. Fannie Mae analytics overview
ii. Why cloud migration for analytics?
iii. Approach, major challenges, lessons learned
Speaker
Kevin Bates, Vice President for Enterprise Data Strategy Execution, Fannie Mae
Streaming Analytics Manager (SAM) simplifies the development and reduces the delivery time of analytic applications geared towards data in motion. Using a drag-and-drop interface, application developers can create complex streaming analytics apps for event correlation, context enrichment, complex pattern matching, and analytical aggregations, eliminating the need for specialized skill sets. SAM also allows users to easily define the streaming engine and environments their application will use for execution and a streaming operations view to give users insight into their application’s performance during runtime.
In this talk we will cover the key features of the Streaming Analytics Manager. We will then go over the new features recently added to SAM around ease of debugging and troubleshooting, log search, event sampling, the metrics view, test simulation mode, and more.
With SAM as an analytics solution, users get a rich experience for building and managing streaming analytics applications and bringing these applications to market considerably faster.
Speaker
Arun Iyer, Hortonworks, Software Engineer
Lessons learned running a container cloud on YARNDataWorks Summit
Apache Hadoop YARN is the resource and application manager for Apache Hadoop. In the past, YARN only supported launching containers as processes. However, as containerization has become extremely popular, more and more users wanted support for launching Docker containers. With recent changes, YARN now supports running Docker containers alongside process containers. Coupled with the newly added support for long-running services on YARN, this allows a host of new possibilities.
In this talk, we'll present how to run a container cloud on YARN. Leveraging the support in YARN for Docker and long-running services, we can allow users to easily spin up sets of Docker containers for their applications. These containers can be self contained or wired up to form more complex applications. We will go over some of the lessons we learned as part of our experiences handling issues such as resource management, debugging application failures, running Docker, service discovery, etc.
Speaker
Billie Rinaldi, Principal Software Engineer I, Hortonworks
This document discusses data management trends and Oracle's unified data management solution. It provides a high-level comparison of HDFS, NoSQL, and RDBMS databases. It then describes Oracle's Big Data SQL which allows SQL queries to be run across data stored in Hadoop. Oracle Big Data SQL aims to provide easy access to data across sources using SQL, unified security, and fast performance through smart scans.
1. The document discusses Microsoft's SCOPE analytics platform running on Apache Tez and YARN. It describes how Graphene was designed to integrate SCOPE with Tez to enable SCOPE jobs to run as Tez DAGs on YARN clusters.
2. Key components of Graphene include a DAG converter, Application Master, and tooling integration. The Application Master manages task execution and communicates with SCOPE engines running in containers.
3. Initial experience running SCOPE on Tez has been positive though challenges remain around scaling to very large workloads with over 15,000 parallel tasks and optimizing for opportunistic containers and Application Master recovery.
The Future of Data Warehousing, Data Science and Machine LearningModusOptimum
Watch the on-demand recording here:
https://event.on24.com/wcc/r/1632072/803744C924E8BFD688BD117C6B4B949B
Evolution of Big Data and the Role of Analytics | Hybrid Data Management
IBM, Driving the future Hybrid Data Warehouse with IBM Integrated Analytics System.
Using LLVM to accelerate processing of data in Apache ArrowDataWorks Summit
Most query engines follow an interpreter-based approach where a SQL query is translated into a tree of relational algebra operations then fed through a conventional tuple-based iterator model to execute the query. We will explore the overhead associated with this approach and how the performance of query execution on columnar data can be improved using run-time code generation via LLVM.
Generally speaking, the best case for optimal query execution performance is a hand-written query plan that does exactly what is needed by the query for the exact same data types and format. Vectorized query processing models amortize the cost of function calls. However, research has shown that hand-written code for a given query plan has the potential to outperform the optimizations associated with a vectorized query processing model.
Over the last decade, the LLVM compiler framework has seen significant development. Furthermore, the database community has realized the potential of LLVM to boost query performance by implementing JIT query compilation frameworks. With LLVM, a SQL query is translated into a portable intermediary representation (IR) which is subsequently converted into machine code for the desired target architecture.
Dremio is built on top of Apache Arrow’s in-memory columnar vector format. The in-memory vectors map directly to the vector type in LLVM and that makes our job easier when writing the query processing algorithms in LLVM. We will talk about how Dremio implemented query processing logic in LLVM for some operators like FILTER and PROJECT. We will also discuss the performance benefits of LLVM-based vectorized query execution over other methods.
Speaker
Siddharth Teotia, Dremio, Software Engineer
Big data security challenges are bit different from traditional client-server applications and are distributed in nature, introducing unique security vulnerabilities. Cloud Security Alliance (CSA) has categorized the different security and privacy challenges into four different aspects of the big data ecosystem. These aspects are infrastructure security, data privacy, data management and, integrity and reactive security. Each of these aspects are further divided into following security challenges:
1. Infrastructure security
a. Secure distributed processing of data
b. Security best practices for non-relational data stores
2. Data privacy
a. Privacy-preserving analytics
b. Cryptographic technologies for big data
c. Granular access control
3. Data management
a. Secure data storage and transaction logs
b. Granular audits
c. Data provenance
4. Integrity and reactive security
a. Endpoint input validation/filtering
b. Real-time security/compliance monitoring
In this talk, we are going to refer above classification and identify existing security controls, best practices, and guidelines. We will also paint a big picture about how collective usage of all discussed security controls (Kerberos, TDE, LDAP, SSO, SSL/TLS, Apache Knox, Apache Ranger, Apache Atlas, Ambari Infra, etc.) can address fundamental security and privacy challenges that encompass the entire Hadoop ecosystem. We will also discuss briefly recent security incidents involving Hadoop systems.
Speakers
Krishna Pandey, Staff Software Engineer, Hortonworks
Kunal Rajguru, Premier Support Engineer, Hortonworks
Apache Eagle is a distributed real-time monitoring and alerting engine for Hadoop that was created by eBay and later open sourced as an Apache Incubator project. It provides security for Hadoop systems by instantly identifying access to sensitive data, recognizing attacks/malicious activity, and blocking access in real time through complex policy definitions and stream processing. Eagle was designed to handle the huge volume of metrics and logs generated by large-scale Hadoop deployments through its distributed architecture and use of technologies like Apache Storm and Kafka.
Insights into Real-world Data Management ChallengesDataWorks Summit
Oracle began with the belief that the foundation of IT was managing information. The Oracle Cloud Platform for Big Data is a natural extension of our belief in the power of data. Oracle’s Integrated Cloud is one cloud for the entire business, meeting everyone’s needs. It’s about Connecting people to information through tools which help you combine and aggregate data from any source.
This session will explore how organizations can transition to the cloud by delivering fully managed and elastic Hadoop and Real-time Streaming cloud services to built robust offerings that provide measurable value to the business. We will explore key data management trends and dive deeper into pain points we are hearing about from our customer base.
At ING we needed a way to implement Data science models from exploration into production. I will do this talk from my experience on the exploration and production Hadoop environment as a senior Ops engineer. For this we are using OpenShift to run Docker containers that connect to the big data Hadoop environment.
During this talk I will explain why we need this and how this is done at ING. Also how to set up a docker container running a data science model using Hive, Python, and Spark. I’ll explain how to use Docker files to build Docker images, add all the needed components inside the Docker image, and how to run different versions of software in different containers.
In the end I will also give a demo of how it runs and is automated using Git with webhook connecting to Jenkins and start the docker service that will connect to a big data Hadoop environment.
This is going to be a great technical talk for engineers and data scientist.
Speaker
Lennard Cornelis, Ops Engineer, ING
The document discusses how machine data from various sources such as IoT devices, industrial systems, mobile devices, and other systems can be collected and analyzed using Splunk software. Splunk provides capabilities for data ingestion, indexing, searching, analyzing, and visualizing large amounts of machine data. It also discusses how Splunk has been used by companies in various industries to gain insights from their machine data to improve operations, security, customer experience, and business outcomes. Specific use cases highlighted include predictive maintenance, anomaly detection, supply chain optimization, and understanding customer behavior.
Apache Spark 2.4 comes packed with a lot of new functionalities and improvements, including the new barrier execution mode, flexible streaming sink, the native AVRO data source, PySpark’s eager evaluation mode, Kubernetes support, higher-order functions, Scala 2.12 support, and more.
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon to be released Spark 2.3 features:
• Kubernetes Scheduler Backend
• PySpark Performance and Enhancements
• Continuous Structured Streaming Processing
• DataSource v2 APIs
• Structured Streaming v2 APIs
This document summarizes new features in Apache Spark 2.3, including continuous processing mode for structured streaming, stream-stream joins, running Spark applications on Kubernetes, improved PySpark performance through vectorized UDFs and Pandas integration, and Databricks Delta for reliability and performance in data lakes. The author, an Apache Spark committer and PMC member, provides overviews and code examples of these features.
Apache CarbonData & Spark Meetup
Apache Spark™ is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
Overview of Apache Spark 2.3: What’s New? with Sameer AgarwalDatabricks
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community contributors have continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of Spark 2.3 features:
Kubernetes Scheduler Backend
PySpark Performance and Enhancements
Continuous Structured Streaming Processing
DataSource v2 APIs
Spark History Server Performance Enhancements
The document summarizes the major new features in Apache Spark 2.3, including continuous processing for low-latency streaming, Spark running on Kubernetes, improved PySpark performance using Pandas UDFs, machine learning capabilities on streaming data, and image reading support. Some key updates are continuous processing for streaming with latency of ~1ms and at-least once semantics, Spark's ability to run natively on Kubernetes clusters, and Pandas UDFs in PySpark providing a 3x to 100x performance boost over row-at-a-time UDFs. The speaker is the Spark 2.3 release manager and discusses these topics at the Spark Summit on June 6, 2018.
Spark + AI Summit 2020 had over 35,000 attendees from 125 countries. The majority of participants were data engineers and data scientists. Apache Spark is now widely used with Python and SQL. Spark 3.0 includes improvements like adaptive query execution that accelerate queries by 2-18x. Delta Engine is a new high performance query engine for data lakes built on Spark 3.0.
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
This document discusses Apache Arrow, an open-source library that enables fast and efficient data interchange and processing. It summarizes the growth of Arrow and its ecosystem, including new features like the Arrow C++ query engine and Arrow Rust DataFusion. It also highlights how enterprises are using Arrow to solve challenges around data interoperability, access speed, query performance, and embeddable analytics. Case studies describe how companies like Microsoft, Google Cloud, Snowflake, and Meta leverage Arrow in their products and platforms. The presenter promotes Voltron Data's enterprise subscription and upcoming conference to support business use of Apache Arrow.
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
Apache Spark 2.0 was released this summer and is already being widely adopted. In this presentation Matei talks about how changes in the API have made it easier to write batch, streaming and realtime applications. The Dataset API, which is now integrated with DataFrames, makes it possible to benefit from powerful optimizations such as pushing queries into data sources, while the Structured Streaming extension to this API makes it possible to run many of the same computations in a streaming fashion automatically.
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
In this webcast, Patrick Wendell from Databricks will be speaking about Apache Spark's new 1.6 release.
Spark 1.6 will include (but not limited to) a type-safe API called Dataset on top of DataFrames that leverages all the work in Project Tungsten to have more robust and efficient execution (including memory management, code generation, and query optimization) [SPARK-9999], adaptive query execution [SPARK-9850], and unified memory management by consolidating cache and execution memory [SPARK-10000].
The structured streaming upgrade to Apache Spark and how enterprises can bene...Impetus Technologies
The adoption of Apache Spark to analyze data in real-time is increasing with its ability to handle sophisticated analytical requirements and a common framework for streaming and batch. However, most organizations are also looking for "true streaming" features like lower latency and the ability to process out-of-order data.
Structured Streaming, a new high-level API, introduced in Apache Spark 2.0 promises these and other enhancements to the Spark approach to streaming data processing.
In this webinar, Anand Venugopal (Product Head) and other technical experts from StreamAnalytix, speak about the promising developments in Apache Spark 2.0 and how organizations can leverage structured streaming to make timely and accurate decisions and stay competitive.
An Insider’s Guide to Maximizing Spark SQL PerformanceTakuya UESHIN
This document provides an overview of optimizing Spark SQL performance. It begins with introducing the speaker and their background with Spark. It then discusses reading query plans, interpreting them to understand optimizations, and tuning plans by pushing down filters, avoiding implicit casts, and other techniques. It emphasizes tracking query execution through the Spark UI to analyze jobs, stages and tasks for bottlenecks. The document aims to help understand how to maximize Spark SQL performance.
The document introduces Apache Kafka's Streams API for stream processing. Some key points covered include:
- The Streams API allows building stream processing applications without needing a separate cluster, providing an elastic, scalable, and fault-tolerant processing engine.
- It integrates with existing Kafka deployments and supports both stateful and stateless computations on data in Kafka topics.
- Applications built with the Streams API are standard Java applications that run on client machines and leverage Kafka for distributed, parallel processing and fault tolerance via state stores in Kafka.
Simplifying Big Data Applications with Apache Spark 2.0Spark Summit
Apache Spark 2.0 is a major new release that simplifies the Spark API and improves performance. Some key points:
1) It remains highly compatible with Spark 1.x while building on lessons learned to simplify the API with over 2000 patches from 280 contributors.
2) It introduces structured APIs like DataFrames that allow Spark to optimize queries via whole-stage code generation, providing up to 10x performance gains.
3) It launches a new higher-level streaming API called Structured Streaming that allows developers to write streaming jobs that behave like batch jobs and integrate easily with static data and batch jobs.
Slides presented during the Strata SF 2019 conference. Explaining how Lyft is building a multi-cluster solution for running Apache Spark on kubernetes at scale to support diverse workloads and overcome challenges.
The annual review session by the AMIS team on their findings, interpretations and opinions regarding news, trends, announcements and roadmaps around Oracle's product portfolio.
Custom application development according to Oracle is primarily relevant for extending SaaS applications and creating customer experiences. The current recommended approach for building graphical user interface (on web and mobile) is through low code Visual Builder with high code JET injections when required. An alternative low code stack is available from Oracle in the form of APEX, This slide set discusses the above as well as ADF and Forms. It then introduces Digital Assistant, talks about the state and future of Java and concludes with CI/CD and DevOps. As presented on November 5th 2018 at AMIS HQ, Nieuwegein, The Netherlands.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Danny Chen presented on Uber's use of HBase for global indexing to support large-scale data ingestion. Uber uses HBase to provide a global view of datasets ingested from Kafka and other data sources. To generate indexes, Spark jobs are used to transform data into HFiles, which are loaded into HBase tables. Given the large volumes of data, techniques like throttling HBase access and explicit serialization are used. The global indexing solution supports requirements for high throughput, strong consistency and horizontal scalability across Uber's data lake.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
This document discusses using Apache NiFi to build a high-speed cyber security data pipeline. It outlines the challenges of ingesting, transforming, and routing large volumes of security data from various sources to stakeholders like security operations centers, data scientists, and executives. It proposes using NiFi as a centralized data gateway to ingest data from multiple sources using a single entry point, transform the data according to destination needs, and reliably deliver the data while avoiding issues like network traffic and data duplication. The document provides an example NiFi flow and discusses metrics from processing over 20 billion events through 100+ production flows and 1000+ transformations.
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
This document discusses supporting Apache HBase and improving troubleshooting and supportability. It introduces two Cloudera employees who work on HBase support and provides an overview of typical troubleshooting scenarios for HBase like performance degradation, process crashes, and inconsistencies. The agenda covers using existing tools like logs and metrics to troubleshoot HBase performance issues with a general approach, and introduces htop as a real-time monitoring tool for HBase.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
What’s new in Apache Spark 2.3
1. What's New in Apache Spark
2.3?
Xiao Li & Wenchen Fan
DataWorks Summit | SJ | Jun 2018
2. About US
• Software Engineers at
• Apache Spark Committers and PMC Members
Xiao Li (Github: gatorsmile) Wenchen Fan (Github: cloud-fan)
3. Databricks’ Unified Analytics Platform
DATABRICKS RUNTIME
COLLABORATIVE NOTEBOOKS
Delta SQL Streamin
g
Powered by
Data Engineers Data Scientists
CLOUD NATIVE SERVICE
Unifies Data Engineers
and Data Scientists
Unifies Data and AI
Technologies
Eliminates infrastructure
complexity
4. Major Features on Spark 2.3
4
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
Around 1400 issues
resolved!
5. Major Features on Spark 2.3
5
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
6. Structured Streaming
Introduced in Spark 2.0
Among Databricks customers:
- 10X more usage than DStream
- Processed 100+ trillion records in production
Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz,
Shixiong Zhu, Reynold Xin, Ali Ghodsi, Ion Stoica, and Matei Zaharia.
Structured Streaming: A Declarative API for Real-Time Applications in
Apache Spark. SIGMOD '18
7. Structured Streaming
stream processing on Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
deal with complex data and complex workloads
rich ecosystem of data sources
integrate with many storage systems
10. 10
Continuous
Processing
The only change you need!
Blog:
"Introducing Low-
latency
Continuous
Processing Mode
in Structured
Streaming in
Apache Spark
2.3"
http://ow.ly/e7lS3
0kob7X
12. Stream-stream Joins
12
Stream-stream
Join
Example: Ad Monetization Join stream of ad
impressions with
another stream of their
corresponding user
clicks
Blog: "Introducing Stream-Stream Joins in Apache Spark 2.3"
ow.ly/oxpv30jbybJ
13. Major Features on Spark 2.3
13
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
14. ML on Streaming
Model transformation/prediction on batch and streaming
data with unified API.
After fitting a model or Pipeline, you can deploy it in a
streaming job.
val streamOutput = transformer.transform(streamDF)
14
ML on
Streaming
16. Image Support in Spark
Spark Image data source SPARK-21866 :
• Defined a standard API in Spark for reading images into
DataFrames
• Deep learning frameworks can rely on this.
val df = ImageSchema.readImages("/data/images")
16
Image
Reader
17. Major Features on Spark 2.3
17
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
18. PySpark
• Introduced in Spark 0.7 (~2013); became first class citizen
in the DataFrame API in Spark 1.3 (~2015)
PySpark
Performance
19. PySpark Performance
For single-node analytics,
Spark offers faster runtime
and greater scalability than
PyData tooling, e.g. Pandas,
numpy.
• multi-core parallelism
• less memory consumption
• better-pipelined execution
engine
PySpark
Performance
Blog: "Benchmarking Apache Spark on
a single-node machine"
ow.ly/p1J530jORLw
20. PySpark Performance
Much slower than Scala/Java with user-defined functions
(UDF), due to serialization & Python interpreter.
Fast data serialization and execution using vectorized
formats [SPARK-22216] [SPARK-21187]
• Conversion from/to Pandas
• df.toPandas() and createDataFrame(pandas_df)
• Pandas UDFs: UDF using Pandas to process data
• Scalar Pandas UDFs, Grouped Map Pandas UDFs
20
PySpark
Performance
22. Major Features on Spark 2.3
22
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
24. Native Spark App in K8S
• New Spark scheduler backend
• Driver runs in a Kubernetes pod created
by the submission client and creates pods
that run the executors in response to
requests from the Spark scheduler. [K8S-
34377] [SPARK-18278]
• Make direct use of Kubernetes clusters for
multi-tenancy and sharing through
Namespaces and Quotas, as well as
administrative features such as Pluggable
Authorization, and Logging.
24
on
25. Spark on Kubernetes
Features in Apache Spark 2.3:
• Supports Kubernetes 1.6 and up
• Supports cluster mode only
• Static resource allocation only
• Supports Java and Scala applications
• Can use container-local and remote
dependencies that are downloadable
25
26. Major Features on Spark 2.3
26
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
27. Major Features on Spark 2.3
27
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
28. History Server Using K-V Store
Stateless and non-scalable History Server V1:
• Requires parsing the event logs (that means, so slow!)
• Requires storing app lists and UI in the memory (and then
OOM!)
[SPARK-18085] K-V store-based History Server V2:
• Store app lists and UI in a persistent K-V store (LevelDB)
• spark.history.store.path – once specified, LevelDB is being used;
otherwise, in-memory KV-store (still stateless like V1)
28
History
Server V2
29. Major Features on Spark 2.3
29
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
30. What’s Wrong With
V1?
• Leak upper level API in the data source (RDD/SQLContext)
• Hard to extend the Data Source API for more optimizations
• Zero transaction guarantee in the write APIs
• Batch Only
Data
Source
API V2
31. Features in Data Source V2
• Columnar scan support.
• Flexible operator pushdown framework.
• Can report basic statistics and data partitioning.
• Transactional write API.
• Unified batch and streaming interfaces.
Data
Source
API V2
32. Major Features on Spark 2.3
32
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
33. UDF Enhancement
• [SPARK-19285] Implement UDF0 (SQL UDF that has 0
arguments)
• [SPARK-22945] Add java UDF APIs in the functions object
• [SPARK-21499] Support creating SQL function for Spark
UDAF(UserDefinedAggregateFunction)
• [SPARK-20586][SPARK-20416][SPARK-20668] Annotate UDF
with Name, Nullability and Determinism
33
UDF
Enhancements
34. Java UDF and UDAF in PySpark
34
UDF
Enhancements
• Register Java UDF and UDAF as a SQL function and use them in
PySpark.
35. Major Features on Spark 2.3
35
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
36. Stable Codegen
• [SPARK-22510] [SPARK-22692] Stabilize the codegen
framework to avoid hitting the 64KB JVM bytecode
limit on the Java method and Java compiler constant
pool limit.
• [SPARK-21871] Turn off whole-stage codegen when
the bytecode of the generated Java function is larger
than spark.sql.codegen.hugeMethodLimit. The limit of
method bytecode for JIT optimization on HotSpot is
8K.
36
Stable
Codegen
37. Major Features on Spark 2.3
37
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
38. Vectorized ORC Reader
• [SPARK-20682] Add new ORCFileFormat based on ORC
1.4.1. spark.sql.orc.impl = native / hive (default)
• [SPARK-16060] Vectorized ORC reader
spark.sql.orc.enableVectorizedReader = true (default) /
false
• Suggestion: enable filter pushdown for ORC files when
using the native reader
38
39. Major Features on Spark 2.3
39
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
40. Performance
• [SPARK-21975] Histogram support in cost-based optimizer
• Enhancements in rule-based optimizer and planner
[SPARK-22489] [SPARK-22916] [SPARK-22895] [SPARK-
20758] [SPARK-22266] [SPARK-19122] [SPARK-22662]
[SPARK-21652] (e.g., constant propagation)
• [SPARK-20331] Broaden support for partition pruning
predicate pushdown. (e.g. date = 20161011 or date =
20161014)
• [SPARK-20822] Vectorized reader for table cache
40
41. API
• Improved ANSI SQL compliance and Dataset/DataFrame
APIs
• More built-in functions [SPARK-20746]
• Better Hive compatibility
• [SPARK-20236] Support Dynamic Partition Overwrite
for Data Source Tables
• [SPARK-17729] Enable Creating Hive Bucketed
Tables 41
42. Major Features on Spark 2.3
42
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
Around 1400 issues resolved!
44. Apache Spark 2.4 +
• Built-in high-order function: transform, arrays_zip,
array_remove, array_overlap, …
• Adaptive query planning: dynamically adjust the query
plan according to the real data(shuffled data).
• Deep learning integration: gang scheduling, barrier sync,
fast data exchange, ...
• ML pipeline in SparkR: port the pipeline API to R.
• Build improvements: Scala 2.12, Java 9, Hadoop 3.0, …