Apache Spark 2.4 comes packed with a lot of new functionalities and improvements, including the new barrier execution mode, flexible streaming sink, the native AVRO data source, PySpark’s eager evaluation mode, Kubernetes support, higher-order functions, Scala 2.12 support, and more.
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Using Apache Spark in the Cloud—A Devops Perspective with Telmo OliveiraSpark Summit
Toon is a leading brand in the European smart energy market, currently expanding internationally, providing energy usage insights, eco-friendly energy management and smart thermostat use for the connected home. As value added services become ever more relevant in this market, we have the need to ensure that we can easily and safely on-board new tenants into our data platform. In this talk we’re going to guide you across a less discussed side of using Spark in production – devops. We will speak about our journey from an on-premise cluster to a managed solution in the cloud. A lot of moving parts were involved: ETL flows, data sharing with 3rd parties and data migration to the new environment. Add to this the need to have a multi-tenant environment, revamp our toolset and deploy a live public facing service. It’s possible to find a lot of great examples of how Spark is used for data-science purposes. On the data engineering side, we need to deploy production services, ensure data is cleaned, secured and available, and keep the data-science teams happy. We’d like to share some of the options we took and some of the lessons learned from this (ongoing) transition.
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Databricks
We are witnessing a proliferation of big data, which has lead to a zoo of data processing systems. Each system providing a different set of features. For example, Spark provides scalability to analytic tasks, but Java 8 Streams provides low-latency. Furthermore, complex applications, such as ETL and ML, are now requiring a mixture of platforms to perform tasks efficiently. In such complex data analytics pipelines, the use of multiple data processing system is not only for performance reasons, but also because of data diversity. Datasets often natively reside on different data formats and storage engines. Unfortunately, developers are left alone in the challenging tasks of: (1) choosing the right platform for their applications; and (2) performing tedious and costly data migration and integration tasks to obtain the results.
In this talk, we will present Rheem, an open source scalable cross-platform system that frees developers from these burdens. Rheem provides an abstraction layer on top of Spark (and other processing platforms) with the aim of enabling cross-platform optimization and interoperability. It automatically selects the best data processing platforms for a given task and also handles the cross-platform execution. In particular, we will discuss how Rheem allows Spark to work in tandem with other platforms in order to achieve higher performance. We will also show how easy a developer can write complex applications on top of Rheem to seamlessly use multiple different data processing platforms according to their tasks at hand. Using Rheem developers do not have to worry about the integration or data migration between Spark and other platforms.
Tim Spann will present on learning Apache Spark. He is a senior solutions architect who previously worked as a senior field engineer and startup engineer. airis.DATA, where Spann works, specializes in machine learning and graph solutions using Spark, H20, Mahout, and Flink on petabyte datasets. The agenda includes an overview of Spark, an explanation of MapReduce, and hands-on exercises to install Spark, run a MapReduce job locally, and build a project with IntelliJ and SBT.
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
This workshop will provide an introduction to Big Data Analytics using Apache Spark and Apache Zeppelin.
https://github.com/zeltovhorton/intro_spark_zeppelin_meetup
There will be a short lecture that includes an introduction to Spark, the Spark components.
Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes.
The lecture will be followed by demo . There will be a short lecture on Hadoop and how Spark and Hadoop interact and compliment each other. You will learn how to move data into HDFS using Spark APIs, create Hive table, explore the data with Spark and SQL, transform the data and then issue some SQL queries. We will be using Scala and/or PySpark for labs.
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
The RAPIDS Accelerator for Apache Spark is a plugin that enables the power of GPUs to be leveraged in Spark DataFrame and SQL queries, improving the performance of ETL pipelines. User-defined functions (UDFs) in the query appear as opaque transforms and can prevent the RAPIDS Accelerator from processing some query operations on the GPU.
This presentation discusses how users can leverage the RAPIDS Accelerator UDF Compiler to automatically translate some simple UDFs to equivalent Catalyst operations that are processed on the GPU. The presentation also covers how users can provide a GPU version of Scala, Java, or Hive UDFs for maximum control and performance. Sample UDFs for each case will be shown along with how the query plans are impacted when the UDFs are processed on the GPU.
This presentation aims to cover Apache Spark Performance and Tuning Takeaways by focusing Data Structures, Persistency, Partitioning, Event Sourcing on Transformations and Checkpointing.
Powering Custom Apps at Facebook using Spark Script TransformationDatabricks
Script Transformation is an important and growing use-case for Apache Spark at Facebook. Spark’s script transforms allow users to run custom scripts and binaries directly from SQL and serves as an important means of stitching Facebook’s custom business logic with existing data pipelines.
Along with Spark SQL + UDFs, a growing number of our custom pipelines leverage Spark’s script transform operator to run user-provided binaries for applications such as indexing, parallel training and inference at scale. Spawning custom processes from the Spark executors introduces new challenges in production ranging from external resources allocation/management, structured data serialization, and external process monitoring.
In this session, we will talk about the improvements to Spark SQL (and the resource manager) to support running reliable and performant script transformation pipelines. This includes:
1) cgroup v2 containers for CPU, Memory and IO enforcement,
2) Transform jail for processes namespace management,
3) Support for complex types in Row format delimited SerDe,
4) Protocol Buffers for fast and efficient structured data serialization. Finally, we will conclude by sharing our results, lessons learned and future directions (e.g., transform pipelines resource over-subscription).
Spark Summit EU talk by Stephan KesslerSpark Summit
This document summarizes a talk given by Stephan Kessler at the Spark Summit Europe 2016 about integrating business functionality and specialized engines into Apache Spark using SAP HANA Vora. Key points discussed include using currency conversion and time series query capabilities directly in Spark by pushing computations to the relevant data sources via Spark extensions. SAP HANA Vora allows moving parts of the Spark logical query plan to various data sources like HANA, graph and document stores to perform analysis close to the data.
This document discusses best practices for running Spark in production. It begins with introductions from the presenters and an overview of Spark deployment modes on YARN. The main topics covered are Spark security using Kerberos authentication and authorization, communication channels and encryption in YARN cluster mode, common issues, and performance tuning. For performance, it recommends choosing executor and task sizes to balance efficiency and overhead, and increasing task parallelism to mitigate data skew problems. The goal is to understand workload patterns and monitor behavior to effectively tune Spark for different situations.
Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit
Building and deploying an analytic service on Cloud is a challenge. A bigger challenge is to maintain the service. In a world where users are gravitating towards a model where cluster instances are to be provisioned on the fly, in order for these to be used for analytics or other purposes, and then to have these cluster instances shut down when the jobs get done, the relevance of containers and container orchestration is more important than ever.
Container orchestrators like Kubernetes can be used to deploy and distribute modules quickly, easily, and reliably. The intent of this talk is to share the experience of building such a service and deploying it on a Kubernetes cluster. In this talk, we will discuss all the requirements which an enterprise grade Hadoop/Spark cluster running on containers bring in for a container orchestrator.
This talk will cover in details how Kubernetes orchestrator can be used to meet all our needs of resource management, scheduling, networking, and network isolation, volume management, etc. We will discuss how we have replaced our home grown container orchestrator with Kubernetes which used to manage the container lifecycle and manage resources in accordance to our requirements. We will also discuss the feature list as container orchestrator which is helping us deploy and patch 1000s of containers and also a list which we believe need improvement or can be enhanced in a container orchestrator.
Speaker
Rachit Arora, SSE, IBM
We present Spark Serving, a new spark computing mode that enables users to deploy any Spark computation as a sub-millisecond latency web service backed by any Spark Cluster. Attendees will explore the architecture of Spark Serving and discover how to deploy services on a variety of cluster types like Azure Databricks, Kubernetes, and Spark Standalone. We will also demonstrate its simple yet powerful API for RESTful SparkSQL, SparkML, and Deep Network deployment with the same API as batch and streaming workloads. In addition, we will explore the "dual architecture": HTTP on Spark. This architecture converts any spark cluster into a distributed web client with the familiar and pipelinable SparkML API. These two contributions provide the fundamental spark communication primitives to integrate and deploy any computation framework into the Spark Ecosystem. We will explore how Microsoft has used this work to leverage Spark as a fault-tolerant microservice orchestration engine in addition to an ETL and ML platform. And will walk through two examples drawn from Microsoft's ongoing work on Cognitive Service composition, and unsupervised object detection for Snow Leopard recognition.
1) Uber uses Spark and Hadoop to process large amounts of transportation data in real-time and batch. This includes building pipelines to ingest trip data from databases into a data warehouse within 1-2 hours.
2) Paricon is Uber's first Spark application which infers schemas from raw JSON data, converts it to Parquet format for faster querying, and validates the results. It processes over 15TB of data daily.
3) Future work includes building a SQL-based ETL platform on Spark, open sourcing SQL-on-Hadoop, and creating a machine learning platform with Spark and a real-time analytics system called Apollo using Spark Streaming.
Big Data visualization with Apache Spark and Zeppelinprajods
This presentation gives an overview of Apache Spark and explains the features of Apache Zeppelin(incubator). Zeppelin is the open source tool for data discovery, exploration and visualization. It supports REPLs for shell, SparkSQL, Spark(scala), python and angular. This presentation was made on the Big Data Day, at the Great Indian Developer Summit, Bangalore, April 2015
This document discusses end-to-end processing of 3.7 million telemetry events per second using a lambda architecture at Symantec. It provides an overview of Symantec's security data lake infrastructure, the telemetry data processing architecture using Kafka, Storm and HBase, tuning targets for the infrastructure components, and performance benchmarks for Kafka, Storm and Hive.
Apache Spark Usage in the Open Source EcosystemDatabricks
Apache Spark is an active member of the broad open source community beyond the Apache Foundation. Every day thousands of users combine capabilities of Spark with other open source software to get their job done. This is not by chance. Spark has been designed to behave well with existing ecosystems. For example, PySpark is designed to work well with Pandas, Numpy and other python packages. In this talk we will present an analysis of libraries and open source tools that are commonly used along with Spark in JVM, Python and R ecosystems. Our quantitative results are based on usage of thousands of Spark users. We will show the Spark Summit attendees what the rest of their community finds useful to complement the power of Spark and what parts of Spark API is used in conjunction with most popular open source libraries.
Sparklint is a tool for identifying and tuning inefficient Spark jobs across a cluster. It provides live views of application statistics or event-by-event analysis of historical logs for metrics like idle time, core usage, and task locality. The demo shows Sparklint analyzing access logs to count by IP, status, and verb over multiple jobs, tuning configuration settings to improve efficiency and reduce idle time. Future features may include more job detail, auto-tuning, and streaming optimizations. Sparklint is an open source project for contributing to Spark job monitoring and optimization.
This document discusses Microsoft's use of Apache YARN for scale-out resource management. It describes how YARN is used to manage vast amounts of data and compute resources across many different applications and workloads. The document outlines some limitations of YARN and Microsoft's contributions to address those limitations, including Rayon for improved scheduling, Mercury and Yaq for distributed scheduling, and work on federation to scale YARN across multiple clusters. It provides details on the implementation and evaluation of these contributions through papers, JIRAs, and integration into Apache Hadoop releases.
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaDatabricks
It’s not enough to build a mesh of sensors or embedded devices to get more insights about the surrounding environment and optimize your production. Usually, your IoT solution needs to be capable of transferring enormous amounts of data to a storage or cloud where the data has to be processed further. Quite often, the processing of the endless streams of data has to be done almost in real-time so that you can react on the IoT subsystem’s state accordingly, and in time.
During this session, see how to build a Fast Data solution that will receive endless streams from the IoT side and will be capable of processing the streams in real-time using Apache Ignite’s cluster resources. In particular, learn about data streaming to an Apache Ignite cluster from embedded devices and real-time data processing with Apache Spark.
Apache Spark 2.0 set the architectural foundations of structure in Spark, unified high-level APIs, structured streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon-to-be-released Spark 2.3 features:
• New deployment mode: Kubernetes scheduler backend
• PySpark performance and enhancements
• New structured streaming execution engine: continuous processing
• Data source v2 APIs for both structured streaming and Spark SQL
• ML on structured streaming
• Image reader
• Stable codegen engine
• Spark History Server V2
• Native ORC support
• Vectorized ORC and SQL cache readers
• Stream-stream Join
• UDF enhancements
• Various SQL enhancements
Speakers
Xiao Li, Software Engineer, Databricks
Wenchen Fan, Software Engineer, Databricks
Apache CarbonData & Spark Meetup
Apache Spark™ is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon to be released Spark 2.3 features:
• Kubernetes Scheduler Backend
• PySpark Performance and Enhancements
• Continuous Structured Streaming Processing
• DataSource v2 APIs
• Structured Streaming v2 APIs
Spark + AI Summit 2020 had over 35,000 attendees from 125 countries. The majority of participants were data engineers and data scientists. Apache Spark is now widely used with Python and SQL. Spark 3.0 includes improvements like adaptive query execution that accelerate queries by 2-18x. Delta Engine is a new high performance query engine for data lakes built on Spark 3.0.
This document summarizes new features in Apache Spark 2.3, including continuous processing mode for structured streaming, stream-stream joins, running Spark applications on Kubernetes, improved PySpark performance through vectorized UDFs and Pandas integration, and Databricks Delta for reliability and performance in data lakes. The author, an Apache Spark committer and PMC member, provides overviews and code examples of these features.
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
This document summarizes a presentation on Apache Spark and Spark Streaming. It provides an overview of Spark, describing it as an in-memory cluster computing framework. It then discusses Spark Streaming, explaining that it runs streaming computations as small batch jobs to provide low latency processing. Several use cases for Spark Streaming are presented, including from companies like Stratio, Pearson, Ooyala, and Sharethrough. The presentation concludes with a demonstration of Python Spark Streaming code.
Web Scale Reasoning and the LarKC ProjectSaltlux Inc.
The LarKC project aims to build an integrated pluggable platform for large-scale reasoning. It supports parallelization, distribution, and remote execution. The LarKC platform provides a lightweight core that gives standardized interfaces for combining plug-in components, while the real work is done in the plug-ins. There are three types of LarKC users: those building plug-ins, configuring workflows, and using workflows.
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
Apache Spark 2.0 was released this summer and is already being widely adopted. In this presentation Matei talks about how changes in the API have made it easier to write batch, streaming and realtime applications. The Dataset API, which is now integrated with DataFrames, makes it possible to benefit from powerful optimizations such as pushing queries into data sources, while the Structured Streaming extension to this API makes it possible to run many of the same computations in a streaming fashion automatically.
The structured streaming upgrade to Apache Spark and how enterprises can bene...Impetus Technologies
The adoption of Apache Spark to analyze data in real-time is increasing with its ability to handle sophisticated analytical requirements and a common framework for streaming and batch. However, most organizations are also looking for "true streaming" features like lower latency and the ability to process out-of-order data.
Structured Streaming, a new high-level API, introduced in Apache Spark 2.0 promises these and other enhancements to the Spark approach to streaming data processing.
In this webinar, Anand Venugopal (Product Head) and other technical experts from StreamAnalytix, speak about the promising developments in Apache Spark 2.0 and how organizations can leverage structured streaming to make timely and accurate decisions and stay competitive.
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
In this webcast, Patrick Wendell from Databricks will be speaking about Apache Spark's new 1.6 release.
Spark 1.6 will include (but not limited to) a type-safe API called Dataset on top of DataFrames that leverages all the work in Project Tungsten to have more robust and efficient execution (including memory management, code generation, and query optimization) [SPARK-9999], adaptive query execution [SPARK-9850], and unified memory management by consolidating cache and execution memory [SPARK-10000].
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDatabricks
2017 continues to be an exciting year for Apache Spark. I will talk about new updates in two major areas in the Spark community this year: stream processing with Structured Streaming, and deep learning with high-level libraries such as Deep Learning Pipelines and TensorFlowOnSpark. In both areas, the community is making powerful new functionality available in the same high-level APIs used in the rest of the Spark ecosystem (e.g., DataFrames and ML Pipelines), and improving both the scalability and ease of use of stream processing and machine learning.
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
2017 continues to be an exciting year for Apache Spark. I will talk about new updates in two major areas in the Spark community this year: stream processing with Structured Streaming, and deep learning with high-level libraries such as Deep Learning Pipelines and TensorFlowOnSpark. In both areas, the community is making powerful new functionality available in the same high-level APIs used in the rest of the Spark ecosystem (e.g., DataFrames and ML Pipelines), and improving both the scalability and ease of use of stream processing and machine learning.
Overview of Apache Spark 2.3: What’s New? with Sameer AgarwalDatabricks
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community contributors have continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of Spark 2.3 features:
Kubernetes Scheduler Backend
PySpark Performance and Enhancements
Continuous Structured Streaming Processing
DataSource v2 APIs
Spark History Server Performance Enhancements
The document summarizes the major new features in Apache Spark 2.3, including continuous processing for low-latency streaming, Spark running on Kubernetes, improved PySpark performance using Pandas UDFs, machine learning capabilities on streaming data, and image reading support. Some key updates are continuous processing for streaming with latency of ~1ms and at-least once semantics, Spark's ability to run natively on Kubernetes clusters, and Pandas UDFs in PySpark providing a 3x to 100x performance boost over row-at-a-time UDFs. The speaker is the Spark 2.3 release manager and discusses these topics at the Spark Summit on June 6, 2018.
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
Yao Yao Mooyoung Lee
https://github.com/yaowser/learn-spark/tree/master/Final%20project
https://www.youtube.com/watch?v=IVMbSDS4q3A
https://www.academia.edu/35646386/Teaching_Apache_Spark_Demonstrations_on_the_Databricks_Cloud_Platform
https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform-86063070/
Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics
Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support.
This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements:
* Continuous Processing in Structured Streaming.
* PySpark support for vectorization, giving Python developers the ability to run native Python code fast.
* Native Kubernetes support, marrying the best of container orchestration and distributed data processing.
The document introduces Apache Kafka's Streams API for stream processing. Some key points covered include:
- The Streams API allows building stream processing applications without needing a separate cluster, providing an elastic, scalable, and fault-tolerant processing engine.
- It integrates with existing Kafka deployments and supports both stateful and stateless computations on data in Kafka topics.
- Applications built with the Streams API are standard Java applications that run on client machines and leverage Kafka for distributed, parallel processing and fault tolerance via state stores in Kafka.
Spark is a general purpose computational framework that provides more flexibility than MapReduce. It leverages distributed memory and uses directed acyclic graphs for data parallel computations while retaining MapReduce properties like scalability, fault tolerance, and data locality. Cloudera has embraced Spark and is working to integrate it into their Hadoop ecosystem through projects like Hive on Spark and optimizations in Spark Core, MLlib, and Spark Streaming. Cloudera positions Spark as the future general purpose framework for Hadoop, while other specialized frameworks may still be needed for tasks like SQL, search, and graphs.
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
This presentation shows how you can build solutions that follow the modern data warehouse architecture and introduces the .NET for Apache Spark support (https://dot.net/spark, https://github.com/dotnet/spark)
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
Talk 1. Scaling Apache Spark on Kubernetes at Lyft
As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, We will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup.
Speaker: Li Gao
Li Gao is the tech lead in the cloud native spark compute initiative at Lyft. Prior to Lyft, Li worked at Salesforce, Fitbit, Marin Software, and a few startups etc. on various technical leadership positions on cloud native and hybrid cloud data platforms at scale. Besides Spark, Li has scaled and productionized other open source projects, such as Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow, Apache Hive, and Apache Cassandra.
Empowering Businesses with Intelligent Software Solutions - GrawlixAarisha Shaikh
Explore Grawlix's comprehensive suite of intelligent software solutions designed to drive transformative growth and scalability for businesses. This presentation covers our expertise in bespoke software development, digital marketing, web design, cloud solutions, cybersecurity, AI/ML, and IT consulting. Discover how Grawlix's customized solutions enhance productivity, streamline processes, and enable data-driven decision-making. Learn about our key projects, technologies, and the dedicated team who ensures exceptional client satisfaction through innovation and excellence.
How to Secure Your Kubernetes Software Supply Chain at ScaleAnchore
Achieving comprehensive security visibility in Kubernetes environments is essential for maintaining robust and compliant cloud-native applications. In this exclusive webinar, Anchore and Spectro Cloud team up to showcase how to enhance your Kubernetes security posture with SBOM (Software Bill of Materials) management and vulnerability scanning.
Join Cornelia Davis, VP of Product, Spectro Cloud and Alan Pope, Director of Developer Relations, Anchore to learn how to elevate your Kubernetes security visibility and protect your cloud-native applications effectively.
—Discover how Anchore can be integrated with Spectro Cloud Palette to take SBOM scanning to the next level, delivering fully automated software compliance
—Gain valuable insights into best practices for securing your Kubernetes workloads, ensuring compliance, and improving your DevSecOps processes.
What is Micro Frontends and Why Use it.pdflead93317
🚀 Let's Deep Dive into 𝐖𝐡𝐲 𝐌𝐢𝐜𝐫𝐨 𝐅𝐫𝐨𝐧𝐭𝐞𝐧𝐝𝐬 𝐢𝐬 𝐭𝐡𝐞 𝐅𝐮𝐭𝐮𝐫𝐞 𝐨𝐟 𝐅𝐫𝐨𝐧𝐭𝐞𝐧𝐝 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐮𝐫𝐞 🚀
In today's fast-paced tech landscape, agility, scalability, and maintainability are more crucial than ever. Traditional monolithic frontend architectures often struggle to keep up with these demands. Enter Micro Frontends: a revolutionary approach that's transforming the way we build web applications.
How Generative AI is Shaping the Future of Software Application DevelopmentMohammedIrfan308637
Generative AI is revolutionizing software development. Find out how it enhances innovation and productivity. https://www.qisacademy.com/blog-detail/the-power-of-generative-ai-in-software-application-development
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...John Gallagher
Rails apps can be a black box. Have you ever tried to fix a bug where you just can’t understand what’s going on? This talk will give you practical steps to improve the observability of your Rails app, taking the time to understand and fix defects from hours or days to minutes. Rails 8 will bring an exciting new feature: built-in structured logging. This talk will delve into the transformative impact of structured logging on fixing bugs and saving engineers time. Structured logging, as a cornerstone of observability, offers a powerful way to handle logs compared to traditional text-based logs. This session will guide you through the nuances of structured logging in Rails, demonstrating how it can be used to gain better insights into your application’s behavior. This talk will be a practical, technical deep dive into how to make structured logging work with an existing Rails app.
I talk about the Steps to Observable Software - a practical five step process for improving the observability of your Rails app.
Get to know Autonomous 2.0, the latest innovation from Applitools, in this sneak peek session showcasing how our AI-powered testing solutions revolutionize how you create, debug, and manage test scripts. See more and sign up for a free trial at https://applitools.info/ml6
In today's dynamic business landscape, ERP software systems are essential tools for businesses worldwide, including those in the UAE. These systems cater to the unique needs of the UAE's rapidly changing economy and expanding industries.
This blog examines the top 10 ERP companies in the UAE, highlighting their innovative products, exceptional customer support, and significant impact on the regional business community. These companies excel in providing ERP solutions that enhance efficiency and growth for businesses throughout the UAE.
1. **Odoo**
- Odoo ERP is a comprehensive business management solution with features like accounting, HR, sales, inventory control, and CRM. Its user-friendly interface simplifies processes and boosts productivity. Banibro IT Solutions leverages Odoo to transform business operations.
- **Details:**
- Suitable for: Small, Medium, Large Businesses
- Open Source: Yes
- Cloud-based: Yes (Cloud and On-premises)
- Support: Phone, Chat, Email
- Payment: Yearly, Monthly
- Multi-Language: Yes
- OS Support: Windows, Mac, iOS, Android
- API: Available
2. **Microsoft Dynamics 365**
- Dynamics 365 offers a centralized platform for small and medium-sized businesses, integrating with Microsoft apps and cloud services for scalability. It simplifies data processing with user-friendly interfaces and customizable reporting.
- **Details:**
- Suitable for: Small, Medium, Large Businesses
- Support: Phone, Chat, Email, Knowledge Base
- Payment: One-Time, Yearly, Monthly
- Multi-Language: No
- OS Support: Web App, Windows, iOS, Android
- API: Not specified
3. **FirstBIT ERP**
- Known for serving small and medium-sized businesses, FirstBIT ERP offers comprehensive solutions and exceptional customer service, enhancing productivity and efficiency.
- **Details:**
- Suitable for: Medium, Large Businesses
- Open Source: Yes/No
- Cloud-based: Yes (Cloud and On-premises)
- Support: Phone, Email, Video Tutorials
- Payment: Yearly, Monthly
- Multi-Language: Yes
- OS Support: Web App, Windows, Mac, iOS, Android
- API: Available
4. **Ezware Technologies**
- Ezware Technologies provides top-notch ERP solutions for various industries with user-friendly modules that streamline complex business processes.
- **Details:**
- Suitable for: Small, Medium, Large Businesses
- Support: Phone, Chat, Email, Knowledge Base
- Payment: One-Time, Yearly, Monthly
- Multi-Language: No
- OS Support: Web App, Windows, Mac, iOS, Android
- API: Not specified
5. **RealSoft**
- RealSoft by Coral is popular in Dubai, offering modules for contracting, real estate, job costing, manufacturing, trading, and finance. It's VAT-enabled and affordable for medium-sized businesses.
- **Details:**
- Suitable for: Small, Medium, Large Businesses
- Open Source: No
- Cloud-based: On-premises
-
iBirds Services - Comprehensive Salesforce CRM and Software Development Solut...vijayatibirds
Unlock the full potential of your business with iBirds Services. As a trusted Salesforce Consulting Partner, iBirds Software Pvt. Ltd. offers a wide range of customer-centric consulting services to help you seamlessly integrate, customize, and optimize your Salesforce CRM. Our team of experts specializes in delivering innovative software development solutions tailored to meet your unique business needs.
In this document, you will discover:
An overview of iBirds Services and our expertise in Salesforce CRM implementation.
Detailed insights into our software development services, including custom applications, integrations, and automation.
Case studies highlighting our successful projects and satisfied clients.
Key benefits of partnering with iBirds Services for your CRM and software development needs.
Whether you are a small business or a large enterprise, our proven strategies and cutting-edge technologies ensure your business stays ahead of the competition. Explore our services and learn how iBirds can transform your business operations with scalable and efficient solutions.
Waze vs. Google Maps vs. Apple Maps, Who Else.pdfBen Ramedani
Let’s face it, getting lost isn’t really part of the adventure anymore (unless you’re into that sort of thing!). Nowadays, a good navigation app is like your trusty compass, guiding you through busy city streets and winding country roads. But with so many options out there—from big names like Waze, Google Maps, and Apple Maps to some lesser-known contenders—choosing the right one can feel a bit overwhelming.
Think about it: you're about to head out on a road trip, and the last thing you want is to end up in the middle of nowhere because you took a wrong turn. Or maybe you're just trying to navigate your daily commute without hitting every single red light. That's where a solid navigation app comes in handy.
Google Maps is like the old reliable friend who knows every shortcut and scenic route. It's packed with features, from real-time traffic updates to detailed directions, making it a top choice for many. But then there's Waze, the social butterfly of navigation apps. It's all about community, with drivers sharing real-time updates on traffic, accidents, and even speed traps. It’s perfect if you want to feel like you’re part of a huge driving club, all working together to get everyone to their destination faster.
And let’s not forget Apple Maps, which has come a long way since its rocky start. If you're deep into the Apple ecosystem, it's a seamless choice, integrating smoothly with all your devices and offering some pretty neat features like Flyover for 3D city views.
But wait, there are also some underdog apps worth considering! Have you heard of MapQuest? It's still around and offers some great features, especially for planning long trips with multiple stops. Then there's HERE WeGo, which is fantastic for offline navigation—a real lifesaver if you're heading somewhere with spotty cell service.
So, whether you're planning a cross-country adventure or just trying to find the quickest route to work, we’ll help you sift through these options. We’ll dive into what makes each app unique, their pros and cons, and ultimately, guide you to the perfect navigation app for your needs. Buckle up and get ready for a smooth ride!
CrushFTP 10.4.0.29 PC Software - WhizNewsEman Nisar
Introduction:
In this never-ending digital world, the essence of a smooth and safe file transfer solution is vital. CrushFTP 10.4.0.29 is a kind of full-featured, robust, and easy-to-use PC software designed for a smooth file transfer process without compromising security. In this review, we will dig in deep regarding the CrushFTP features, functions, and system requirements to have a 360-degree view of its capabilities and possible applications.
Description:
CrushFTP, LLC develop the software, and it comes in a bundle of new features and improvements, which are set to deliver a great experience to the user.With CrushFTP, from the smallest to the most extensive scale of businesses, all kinds of file transfer operations can be centrally managed on a single platform.
You May Also Like :: Alt-Tab Terminator Pro 6.0 PC Software – WhizzNews
Abstract:
At its heart, CrushFTP is a powerful server that allows users to exchange files over the networks safely. Many features of the FTP servers have been extended in CrushFTP. It supports protocols like FTPS, SFTP, SCP, HTTP, and HTTPS for maximum flexibility with client applications and devices.
The intuitive web interface enables users to use file management tools simply without installing complex client software.
Software Characteristics:
Security:
CrushFTP ensures security through the use of protocols for encryption, such as SSL/TLS, to secure transmitted data. It also offers user authentication mechanisms using LDAP, Active Directory, and OAuth for proper secure access control.
Automation:
The automation capability of CrushFTP allows automating the everyday routine tasks through schedule-based transfer, event-based triggers, and custom flow. This ensures that the batch processing is effective with minimum manual interruption, improving productivity.
You May Also Like :: VovSoft Copy Files Into Multiple Folders PC Software – WhizzNews
Remote Administration:
CrushFTP supports remote administration through the web interface. This allows an administrator to manage server settings, user permissions, and file operations from any part of the world that is connected to the Internet. In this regard, it gives a very nice distributed team and remote work environment.
Integration:
The software easily integrates with third-party applications and services through a very extensive API, as well as through support for plenty of plugins. This way, it becomes straightforward for organizations to fit CrushFTP into their already existing infrastructure to promote interoperability and ensure scalability.
Monitoring and Logging:
CrushFTP provides very detailed tracking and logging where an administrator can trace all user activities, monitor the performance of the server, and analyze network traffic. It also offers real-time alerts and notifications for proactive management and troubleshooting.
Customization:
Make CrushFTP work with any possible parameters in mind through configurable settings, themes, and extensions
1. Apache Spark 2.4 and Beyond
Xiao Li, Wenchen Fan
Mar 2019 @ Strata Data Conf
2. About US
• Software Engineers at
• Apache Spark Committers and PMC Members
Xiao Li (Github: gatorsmile) Wenchen Fan (Github: cloud-fan)
3. Databricks Customers Across Industries
Financial Services Healthcare & Pharma Media & Entertainment Technology
Public Sector Retail & CPG Consumer Services Energy & Industrial IoTMarketing & AdTech
Data & Analytics Services
4. DATABRICKS WORKSPACE
Databricks Delta ML Frameworks
DATABRICKS CLOUD SERVICE
DATABRICKS RUNTIME
Reliable & Scalable Simple & Integrated
Databricks Unified Analytics Platform
APIs
Jobs
Models
Notebooks
Dashboards End to end ML lifecycle
11. Higher-order
Functions
Major Features on Spark 2.4
11
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Beta support
Scala 2.12
Various SQL
Features
12. Survey Key Findings: The AI Dilemma
Investment in AI is Growing Quickly
However, very few are succeeding
Major contributing factors to this AI dilemma
7
Different machine learning
frameworks and tools
https://databricks.com/CIO-survey-report
13. CIO Survey Key Findings: Unified Analytics Enables AI Success
Unifying data science & engineering
Benefits Expected from a Unified Approach to Data & AI
Increased operational efficiency
More effective decision making
Accelerated time to market
Improved security
Increased innovation
Improved customer experience
Increased competitive advantage
Increased employee satisfaction
Increased customer engagement
Product/service transformation
Topline growth
Other (specify)
15. Project Hydrogen: Spark + AI
A gang scheduling to Apache Spark that embeds a
distributed DL job as a Spark stage to simplify the
distributed training workflow. [SPARK-24374]
15
Task 1
Task 2
Task 3
16. Higher-order
Functions
Major Features on Spark 2.4
16
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
17. Flexible Streaming Sink
[SPARK-24565] Exposing output rows of each microbatch as a
DataFrame
foreachBatch(f: Dataset[T] => Unit)
• Scala/Java/Python APIs in DataStreamWriter.
• Reuse existing batch data sources
• Write to multiple locations
• Apply additional DataFrame operations
17
20. Higher-order
Functions
Major Features on Spark 2.4
20
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
21. Parquet
21
Update from 1.8.2 to 1.10.0 [SPARK-23972].
• PARQUET-1025 - Support new min-max statistics in parquet-mr
• PARQUET-225 - INT64 support for delta encoding
• PARQUET-1142 Enable parquet.filter.dictionary.enabled by default.
Predicate pushdown
• STRING [SPARK-23972] [20x faster]
• Decimal [SPARK-24549]
• Timestamp [SPARK-24718]
• Date [SPARK-23727]
• Byte/Short [SPARK-24706]
• StringStartsWith [SPARK-24638]
• IN [SPARK-17091]
22. ORC
Native vectorized ORC reader is GAed!
• Native ORC reader is on by default [SPARK-23456]
• Update ORC from 1.4.1 to 1.5.2 [SPARK-24576]
• Turn on ORC filter push-down by default [SPARK-21783]
• Use native ORC reader to read Hive serde tables by default
[SPARK-22279]
• Avoid creating reader for all ORC files [SPARK-25126]
22
23. Higher-order
Functions
Major Features on Upcoming Spark 2.4
23
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
24. Higher-order Functions
Transformation on complex objects like arrays, maps and structures
inside of columns.
24
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
UDF ? Expensive data serialization
25. 1) Check for element existence
SELECT EXISTS(values, e -> e > 30) AS v
FROM tbl_nested;
2) Transform an array
SELECT TRANSFORM(values, e -> e * e) AS v
FROM tbl_nested;
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
Higher-order Functions
26. 4) Aggregate an array
SELECT REDUCE(values, 0, (value, acc) -> value + acc) AS sum
FROM tbl_nested;
Ref Databricks Blog: http://dbricks.co/2rUKQ1A
3) Filter an array
SELECT FILTER(values, e -> e > 30) AS v
FROM tbl_nested;
tbl_nested
|-- key: long (nullable = false)
|-- values: array (nullable = false)
| |-- element: long (containsNull = false)
Higher-order Functions
27. Built-in Functions
[SPARK-23899] New or extended built-in functions for ArrayTypes and
MapTypes
• 26 functions for ArrayTypes
transform, filter, reduce, array_distinct, array_intersect, array_union,
array_except, array_join, array_max, array_min, ...
• 3 functions for MapTypes
map_from_arrays, map_from_entries, map_concat
27
Blog: Introducing New Built-in and Higher-Order Functions for
Complex Data Types in Apache Spark 2.4. https://t.co/p1TRRtabJJ
28. Higher-order
Functions
Major Features on Spark 2.4
28
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Scala
2.12
Various SQL
Features
29. Native Spark App in K8S
New Spark scheduler backend
• PySpark support [SPARK-23984]
• SparkR support [SPARK-24433]
• Client-mode support [SPARK-23146]
• Support for mounting K8S volumes
[SPARK-23529]
Blog: What’s New for Apache Spark on Kubernetes in
the Upcoming Apache Spark 2.4 Release
https://t.co/uUpdUj2Z4B
29
on
30. Higher-order
Functions
Major Features on Spark 2.4
30
Structured
Streaming
Built-in source
Improvement
Spark on
Kubernetes
PySpark
Improvement
Native Avro
Support
Image
Source
Barrier
Execution
Beta support
Scala 2.12
Various SQL
Features
32. This presentation may contain projections or other forward-
looking statements regarding the upcoming release (Apache
Spark 3.0). The statements are intended to outline our general
direction. They are intended for information purposes only.
They are not a commitment to deliver code or functionality.
The development, release and timing of any feature or
functionality described for Apache Spark remains at the sole
discretion of ASF and the Apache Spark PMC.
Safe Harbor Statement
33. What’s Next?
33
Data Source
APIs
Spark on
Kubernetes
PySpark
Usability
Scala 2.12 Various SQL
Features
GPU-aware
Scheduling
Adaptive
Execution
Hadoop 3.x
Spark Graph
mlflow
34. What’s Next?
34
Data Source
APIs
Spark on
Kubernetes
PySpark
Usability
Scala 2.12 Various SQL
Features
GPU-aware
Scheduling
Adaptive
Execution
Hadoop 3.x
Spark Graph
mlflow
35. Project Hydrogen: Spark + AI
GPU Aware Scheduling
• widely used for accelerating special workloads, e.g., deep
learning and signal processing
35
37. ML Lifecycle is Manual, Inconsistent
and Disconnected
● Ad hoc approach to track
experiments
● Very hard to reproduce
experiments
Data Prep
● Multiple tightly coupled
deployment options
● Different monitoring approach
for each framework
Build Model Deploy Model
● Low level integrations for
Data and ML
● Difficult to track data used
for a model
38. What is ?
Open source platform to manage ML development
• Lightweight APIs & abstractions that work with any ML library
• Designed to be useful for 1 user or 1000+ person orgs
• Runs the same way anywhere (e.g. any cloud)
Key principle: “open interface” APIs that work with any existing
ML library, app, deployment tool, etc
39. MLflow Components
39
Tracking
Record and query
experiments: code,
params, results, etc
Projects
Code packaging for
reproducible runs
on any platform
Models
Model packaging and
deployment to diverse
environments
40. MLflow Projects
• Docker-based project environment
specification (0.9)
• X-coordinate logging for metrics &
batched logging (1.0)
• Packaging projects with build steps (1.0+)
MLflow Models
• Custom model logging in Python, R
and Java (0.8, 0.9, 1.0)
• Better environment isolation when
loading models (1.0)
• Logging schema of models (1.0+)
MLflow Tracking
• SQL database backend for scaling the tracking server (0.9)
• UI scalability improvements (0.8, 0.9, etc.)
• X-coordinate logging for metrics & batched logging (1.0)
• Fluent API for Java and Scala (1.0)
What’s
Next?
42. What’s Next?
42
Data Source
APIs
Spark on
Kubernetes
PySpark
Usability
Scala 2.12 Various SQL
Features
GPU-aware
Scheduling
Adaptive
Execution
Hadoop 3.x
Spark Graph
mlflow
43. Challenges in Existing Graph Library
GraphX
• Not DataFrame based
• Not actively maintained
GraphFrame
• Limited graph
pattern matching
• Semantically weak
graph data model
45. Spark Graph
Given a single Property Graph data model and a
Cypher query, Spark returns a tabular result
[DataFrame]
46. What’s Next?
46
Data Source
APIs
Spark on
Kubernetes
PySpark
Usability
Scala 2.12 Various SQL
Features
GPU-aware
Scheduling
Adaptive
Execution
Hadoop 3.x
Spark Graph
mlflow
47. Data Source API V2
• Unified API for batch and streaming
• Flexible API for high performance
implementation
• Flexible API for metadata management
48. JDBC source with data source v1
df.write.format("jdbc")
.option("url", ...)
.option("dbtable", ...)
.option("driver", ...)
.save()
1. Specify the info of remote catalog for each op
49. JDBC source with data source v1
CREATE TABLE tab1(...) USING jdbc OPTIONS("url" ..., "dbtable" ..., ...)
CREATE TABLE tab2(...) USING jdbc OPTIONS("url" ..., "dbtable" ..., ...)
SELECT * FROM tab1 join tab2
INSERT INTO tab1 SELECT ...
2. Register each table before usage
50. JDBC source with data source v2
New: Register the catalog before usage
spark-defaults.conf
spark.sql.catalog.jdbcCatalogName my.jdbc.v2.impl
spark.sql.catalog.jdbcCatalogName.url …
Spark.sql.catalog.jdbcCatalogName.driver …
51. JDBC source with data source v2
• No need to register the tables.
• Access the tables using n-part name.
• DDL/DML support.
CREATE TABLE jdbcCatalogName.db1.t1(...)
ALTER TABLE jdbcCatalogName.db1.t2 CHANGE COLUMN ...
SELECT * FROM jdbcCatalogName.db2.t3
INSERT INTO jdbcCatalogName.db3.t4 SELECT ...
52. What’s Next?
52
Data Source
APIs
Spark on
Kubernetes
PySpark
Usability
Scala 2.12 Various SQL
Features
GPU-aware
Scheduling
Adaptive
Execution
Hadoop 3.x
Spark Graph
mlflow
54. Adaptive Query Processing
Intel Blog: https://tinyurl.com/y3rjwcos
Based on statistics of the materialized plan nodes, re-
optimize the execution plan of the remaining queries
• Self tuning the number of reducers
• Adaptive join strategy
• Automatic skew join handling
56. What’s Next?
56
Data Source
APIs
Spark on
Kubernetes
PySpark
Usability
Scala 2.12 Various SQL
Features
GPU-aware
Scheduling
Adaptive
Execution
Hadoop 3.x
Spark Graph
mlflow
57. Native Spark App in K8S
• Support for using a pod template to
customize the driver and executor pods.
• Dynamic resource allocation and
external shuffle service.
• Better support for local application
dependencies on client machines
• Driver resilience for Spark Streaming
• Better scheduling support.
57
on
59. The other targets in Apache Spark 3.0
• Hadoop 3.x support
• Hive execution from
1.2.1 to 2.3.4
• Scala 2.12 GA
• Better ANSI SQL
compliance
• PySpark usability
59
Please follow the announcements
in Spark + AI Summit @ SF
60. What’s Next?
60
Data Source
APIs
Spark on
Kubernetes
PySpark
Usability
Scala 2.12 Various SQL
Features
GPU-aware
Scheduling
Adaptive
Execution
Hadoop 3.x
Spark Graph
mlflow
62. Apache Spark™
• Use Cases
• Research
• Technical Deep Dives
AI
• Productionizing ML
• Deep Learning
• Cloud Hardware
Fields
• Data Science
• Data Engineering
• Enterprise
5000+ ATTENDEES
Practitioners:
Data Scientists, Data Engineers,
Analysts, Architects
Leaders:
Engineering Management, VPs,
Heads of Analytics & Data, CxOs
TRACKS
databricks.com/sparkaisummit
63. 63
Nike: Enabling Data Scientists to bring their Models to Market
Facebook: Vectorized Query Execution in Apache Spark at Facebook
Tencent: Large-scale Malicious Domain Detection with Spark AI
IBM: In-memory storage Evolution in Apache Spark
Capital One: Apache Spark and Sights at Speed: Streaming, Feature
management and Execution
Apple: Making Nested Columns as First Citizen in Apache Spark SQL
EBay: Managing Apache Spark workload and automatic optimizing.
Google: Validating Spark ML Jobs
HP: Apache Spark for Cyber Security in big company
Microsoft: Apache Spark Serving: Unifying Batch, Streaming and
RESTful Serving
ABSA Group: A Mainframe Data Source for Spark SQL and Streaming
Facebook: an efficient Facebook-scale shuffle service
IBM: Make your PySpark Data Fly with Arrow!
Facebook : Distributed Scheduling Framework for Apache Spark
Zynga: Automating Predictive Modeling at Zynga with PySpark
World Bank: Using Crowdsourced Images to Create Image
Recognition Models and NLP to Augment Global Trade indicator
JD.com: Optimizing Performance and Computing Resource.
Microsoft: Azure Databricks with R: Deep Dive
Airbnb: Apache Spark at Airbnb
Netflix: Migrating to Apache Spark at Netflix
Microsoft: Infrastructure for Deep Learning in Apache
Spark
Intel: Game playing using AI on Apache Spark
Facebook: Scaling Apache Spark @ Facebook
Lyft: Scaling Apache Spark on K8S at Lyft
Uber: Using Spark Mllib Models in a Production
Training and Serving Platform
Apple: Bridging the gap between Datasets and
DataFrames
Salesforce: The Rule of 10,000 Spark Jobs
Target: Lessons in Linear Algebra at Scale with
Apache Spark
Nationwide: Deploying Enterprise Scale Deep
Learning in Actuarial Modeling at Nationwide
Workday: Lesson Learned Using Apache Spark
64. Thank you
Xiao Li (lixiao@databricks.com)
Wenchen Fan (wenchen@databricks.com)
64