The document provides an outline for the Spark Camp @ Strata CA tutorial. The morning session will cover introductions and getting started with Spark, an introduction to MLlib, and exercises on working with Spark on a cluster and notebooks. The afternoon session will cover Spark SQL, visualizations, Spark streaming, building Scala applications, and GraphX examples. The tutorial will be led by several instructors from Databricks and include hands-on coding exercises.
You've seen the basic 2-stage example Spark Programs, and now you're ready to move on to something larger. I'll go over lessons I've learned for writing efficient Spark programs, from design patterns to debugging tips.
The slides are largely just talking points for a live presentation, but hopefully you can still make sense of them for offline viewing as well.
In this talk at 2015 Spark Summit East, the lead developer of Spark streaming, @tathadas, talks about the state of Spark streaming:
Spark Streaming extends the core Apache Spark API to perform large-scale stream processing, which is revolutionizing the way Big “Streaming” Data application are being written. It is rapidly adopted by companies spread across various business verticals – ad and social network monitoring, real-time analysis of machine data, fraud and anomaly detections, etc. These companies are mainly adopting Spark Streaming because – Its simple, declarative batch-like API makes large-scale stream processing accessible to non-scientists. – Its unified API and a single processing engine (i.e. Spark core engine) allows a single cluster and a single set of operational processes to cover the full spectrum of uses cases – batch, interactive and stream processing. – Its stronger, exactly-once semantics makes it easier to express and debug complex business logic. In this talk, I am going to elaborate on such adoption stories, highlighting interesting use cases of Spark Streaming in the wild. In addition, this presentation will also showcase the exciting new developments in Spark Streaming and the potential future roadmap.
Spark after Dark by Chris Fregly of DatabricksData Con LA
Spark After Dark is a mock dating site that uses the latest Spark libraries, AWS Kinesis, Lambda Architecture, and Probabilistic Data Structures to generate dating recommendations.
There will be 5+ demos covering everything from basic data ETL to advanced data processing including Alternating Least Squares Machine Learning/Collaborative Filtering and PageRank Graph Processing.
There is heavy emphasis on Spark Streaming and AWS Kinesis.
Watch the video here
https://www.youtube.com/watch?v=g0i_d8YT-Bs
Strata NYC 2015: What's new in Spark StreamingDatabricks
Spark Streaming allows processing of live data streams at scale. Recent improvements include:
1) Enhanced fault tolerance through a write-ahead log and replay of unprocessed data on failure.
2) Dynamic backpressure to automatically adjust ingestion rates and ensure stability.
3) Visualization tools for debugging and monitoring streaming jobs.
4) Support for streaming machine learning algorithms and integration with other Spark components.
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellDatabricks
In this webcast, Patrick Wendell from Databricks will be speaking about Apache Spark's new 1.6 release.
Spark 1.6 will include (but not limited to) a type-safe API called Dataset on top of DataFrames that leverages all the work in Project Tungsten to have more robust and efficient execution (including memory management, code generation, and query optimization) [SPARK-9999], adaptive query execution [SPARK-9850], and unified memory management by consolidating cache and execution memory [SPARK-10000].
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
Apache Spark is a fast and general engine for distributed computing & big data processing with APIs in Scala, Java, Python, and R. This tutorial will briefly introduce PySpark (the Python API for Spark) with some hands-on-exercises combined with a quick introduction to Spark's core concepts. We will cover the obligatory wordcount example which comes in with every big-data tutorial, as well as discuss Spark's unique methods for handling node failure and other relevant internals. Then we will briefly look at how to access some of Spark's libraries (like Spark SQL & Spark ML) from Python. While Spark is available in a variety of languages this workshop will be focused on using Spark and Python together.
Building a modern Application with DataFramesSpark Summit
The document discusses a meetup about building modern applications with DataFrames in Spark. It provides an agenda for the meetup that includes an introduction to Spark and DataFrames, a discussion of the Catalyst internals, and a demo. The document also provides background on Spark, noting its open source nature and large-scale usage by many organizations.
The document provides an agenda and overview for a Big Data Warehousing meetup hosted by Caserta Concepts. The meetup agenda includes an introduction to SparkSQL with a deep dive on SparkSQL and a demo. Elliott Cordo from Caserta Concepts will provide an introduction and overview of Spark as well as a demo of SparkSQL. The meetup aims to share stories in the rapidly changing big data landscape and provide networking opportunities for data professionals.
Introduction to Apache Spark Developer TrainingCloudera, Inc.
Apache Spark is a next-generation processing engine optimized for speed, ease of use, and advanced analytics well beyond batch. The Spark framework supports streaming data and complex, iterative algorithms, enabling applications to run 100x faster than traditional MapReduce programs. With Spark, developers can write sophisticated parallel applications for faster business decisions and better user outcomes, applied to a wide variety of architectures and industries.
Learn What Apache Spark is and how it compares to Hadoop MapReduce, How to filter, map, reduce, and save Resilient Distributed Datasets (RDDs), Who is best suited to attend the course and what prior knowledge you should have, and the benefits of building Spark applications as part of an enterprise data hub.
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
Spark Summit EU 2015: Spark DataFrames: Simple and Fast Analysis of Structure...Databricks
A technical overview of Spark’s DataFrame API. First, we’ll review the DataFrame API and show how to create DataFrames from a variety of data sources such as Hive, RDBMS databases, or structured file formats like Avro. We’ll then give example user programs that operate on DataFrames and point out common design patterns. The second half of the talk will focus on the technical implementation of DataFrames, such as the use of Spark SQL’s Catalyst optimizer to intelligently plan user programs, and the use of fast binary data structures in Spark’s core engine to substantially improve performance and memory use for common types of operations.
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Databricks
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spark and Scala
Talk given by Reynold Xin at Scala Days SF 2015
In this talk, Reynold talks about the underlying techniques used to achieve high performance sorting using Spark and Scala, among which are sun.misc.Unsafe, exploiting cache locality, high-level resource pipelining.
The document discusses loading data into Spark SQL and the differences between DataFrame functions and SQL. It provides examples of loading data from files, cloud storage, and directly into DataFrames from JSON and Parquet files. It also demonstrates using SQL on DataFrames after registering them as temporary views. The document outlines how to load data into RDDs and convert them to DataFrames to enable SQL querying, as well as using SQL-like functions directly in the DataFrame API.
Simplifying Big Data Analytics with Apache SparkDatabricks
Apache Spark is a fast and general-purpose cluster computing system for large-scale data processing. It improves on MapReduce by allowing data to be kept in memory across jobs, enabling faster iterative jobs. Spark consists of a core engine along with libraries for SQL, streaming, machine learning, and graph processing. The document discusses new APIs in Spark including DataFrames, which provide a tabular interface like in R/Python, and data sources, which allow plugging external data systems into Spark. These changes aim to make Spark easier for data scientists to use at scale.
ETL with SPARK - First Spark London meetupRafal Kwasny
The document discusses how Spark can be used to supercharge ETL workflows by running them faster and with less code compared to traditional Hadoop approaches. It provides examples of using Spark for tasks like sessionization of user clickstream data. Best practices are covered like optimizing for JVM issues, avoiding full GC pauses, and tips for deployment on EC2. Future improvements to Spark like SQL support and Java 8 are also mentioned.
Introduction to Stateful Stream Processing with Apache Flink.Konstantinos Kloudas
Kostas Kloudas presented on stateful stream processing with Apache Flink. He discussed how Flink handles state management, fault tolerance, and time semantics to allow for continuous and accurate processing of streaming data. Flink embeds local state with keyed streams, takes consistent snapshots of distributed state, and uses watermarks to process events in event time to produce correct results even for out-of-order data. This allows Flink to provide a robust stream processing engine that scales to large deployments.
The document discusses large-scale stream processing in the Hadoop ecosystem. It provides examples of real-time stream processing use cases for computing player statistics and analyzing telco network data. It then summarizes several open source stream processing frameworks, including Apache Storm, Samza, Kafka Streams, Spark, Flink, and Apex. Key aspects like programming models, fault tolerance methods, and performance are compared for each framework. The document concludes with recommendations for further innovation in areas like dynamic scaling and batch integration.
Apache Spark is rapidly emerging as the prime platform for advanced analytics in Hadoop. This briefing is updated to reflect news and announcements as of July 2014.
The document provides an overview of Apache Spark and Hadoop ecosystem tools on Amazon EMR including Spark, Hive on Tez, and Presto. It discusses building data lakes with Amazon EMR and S3, running jobs and security options, and customer use cases. The demo shows Zeppelin and Hue interfaces. Examples are given of Netflix using Presto on EMR with a 25PB dataset and FINRA saving 60% costs by moving to HBase on EMR.
This document provides an overview of Apache Spark, including:
- Apache Spark is a next generation data processing engine for Hadoop that allows for fast in-memory processing of huge distributed and heterogeneous datasets.
- Spark offers tools for data science and components for data products and can be used for tasks like machine learning, graph processing, and streaming data analysis.
- Spark improves on MapReduce by being faster, allowing parallel processing, and supporting interactive queries. It works on both standalone clusters and Hadoop clusters.
Apache Spark, the Next Generation Cluster ComputingGerger
This document provides a 3 sentence summary of the key points:
Apache Spark is an open source cluster computing framework that is faster than Hadoop MapReduce by running computations in memory through RDDs, DataFrames and Datasets. It provides high-level APIs for batch, streaming and interactive queries along with libraries for machine learning. Spark's performance is improved through techniques like Catalyst query optimization, Tungsten in-memory columnar formats, and whole stage code generation.
In this presentation, Glassbeam Principal Architect Mohammad Guller gives an overview of Spark, and discusses why people are replacing Hadoop MapReduce with Spark for batch and stream processing jobs. He also covers areas where Spark really shines and presents a few real-world Spark scenarios. In addition, he reviews some misconceptions about Spark.
Here are the steps to complete the assignment:
1. Create RDDs to filter each file for lines containing "Spark":
val readme = sc.textFile("README.md").filter(_.contains("Spark"))
val changes = sc.textFile("CHANGES.txt").filter(_.contains("Spark"))
2. Perform WordCount on each:
val readmeCounts = readme.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
val changesCounts = changes.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
3. Join the two RDDs:
val joined = readmeCounts.join(changes
Microservices, Containers, and Machine LearningPaco Nathan
Session talk for Data Day Texas 2015, showing GraphX and SparkSQL for text analytics and graph analytics of an Apache developer email list -- including an implementation of TextRank in Spark.
This document discusses Spark, an open-source cluster computing framework. It provides a brief history of Spark, describing how it generalized MapReduce to support more types of applications. Spark allows for batch, interactive, and real-time processing within a single framework using Resilient Distributed Datasets (RDDs) and a logical plan represented as a directed acyclic graph (DAG). The document also discusses how Spark can be used for applications like machine learning via MLlib, graph processing with GraphX, and streaming data with Spark Streaming.
The document provides an overview of Apache Spark, including its history and key capabilities. It discusses how Spark was developed in 2009 at UC Berkeley and later open sourced, and how it has since become a major open-source project for big data. The document summarizes that Spark provides in-memory performance for ETL, storage, exploration, analytics and more on Hadoop clusters, and supports machine learning, graph analysis, and SQL queries.
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
This talk from 2015 Spark Summit East covers 3 applications built with Apache Spark:
1. Web Logs Analysis: Basic Data Pipeline - Spark & Spark SQL
2. Wikipedia Dataset Analysis: Machine Learning
3. Facebook API: Graph Algorithms
Spark is an in-memory cluster computing framework that provides high performance for large-scale data processing. It excels over Hadoop by keeping data in memory as RDDs (Resilient Distributed Datasets) for faster processing. The document provides an overview of Spark architecture including its core-based execution model compared to Hadoop's JVM-based model. It also demonstrates Spark's programming model using RDD transformations and actions through an example of log mining, showing how jobs are lazily evaluated and distributed across the cluster.
Slides for a presentation I gave for the Machine Learning with Spark Tokyo meetup.
Introduction to Spark, H2O, SparklingWater and live demos of GBM and DL.
A lecture on Apace Spark, the well-known open source cluster computing framework. The course consisted of three parts: a) install the environment through Docker, b) introduction to Spark as well as advanced features, and c) hands-on training on three (out of five) of its APIs, namely Core, SQL \ Dataframes, and MLlib.
This document discusses several ways to extend Apache Spark, including defining custom data sources and UDFs (user-defined functions), customizing the Spark shell, UI, and adding new DDL commands. It provides code examples for customizing the Spark shell to print a custom welcome message, customizing the driver UI to add new tabs, and adding a new "PRINTME" DDL command to execute user-defined code. The document concludes by covering general principles for extending Spark such as inheriting from existing classes and supplying custom jars.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in PySpark, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Debuggers are a wonderful tool, however when you have 100 computers the “wonder” can be a bit more like “pain”. This talk will look at how to connect remote debuggers, but also remind you that it’s probably not the easiest path forward.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Holden Karau
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in your job.
The talk will wrap up with Holden trying to get everyone to buy several copies of her new book, High Performance Spark.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in your job.
The talk will wrap up with Holden trying to get everyone to buy several copies of her new book, High Performance Spark.
Debugging Apache Spark - Scala & Python super happy fun times 2017Holden Karau
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in our job.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
- Spark ML pipelines involve estimators that are trained on datasets to produce immutable transformers.
- A transformer must define transformSchema() to validate the input schema, transform() to do the work, and copy() for cloning.
- Configurable transformers take parameters like inputCol and outputCol to allow configuration for meta algorithms.
- Estimators are similar but fit() returns a model instead of directly transforming.
The document is a presentation about Apache Spark, which is described as a fast and general engine for large-scale data processing. It discusses what Spark is, its core concepts like RDDs, and the Spark ecosystem which includes tools like Spark Streaming, Spark SQL, MLlib, and GraphX. Examples of using Spark for tasks like mining DNA, geodata, and text are also presented.
Apache Spark is an open source Big Data analytical framework. It introduces the concept of RDDs (Resilient Distributed Datasets) which allow parallel operations on large datasets. The document discusses starting Spark, Spark applications, transformations and actions on RDDs, RDD creation in Scala and Python, and examples including word count. It also covers flatMap vs map, custom methods, and assignments involving transformations on lists.
Similar to Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials (20)
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
1. Spark Camp @ Strata CA
Intro to Apache Spark with Hands-on Tutorials
Wed Feb 18, 2015 9:00am–5:00pm
download slides:
training.databricks.com/workshop/sparkcamp.pdf
Licensed under a Creative Commons Attribution-NonCommercial-
NoDerivatives 4.0 International License
2. Tutorial Outline:
morning afternoon
Welcome + Getting Started Intro to MLlib
Ex 1: Pre-Flight Check Songs Demo
How Spark Runs on a Cluster Spark SQL
A Brief History Visualizations
Ex 3: WC, Joins Ex 7: SQL + Visualizations
DBC Essentials Spark Streaming
How to “Think Notebooks” Tw Streaming / Kinesis Demo
Ex 4: Workflow exercise Building a Scala JAR
Tour of Spark API Deploying Apps
Spark @ Strata Ex 8: GraphX examples
Case Studies
Further Resources / Q&A
5. Everyone will receive a username/password for one
of the Databricks Cloud shards. Use your laptop and
browser to login there.
We find that cloud-based notebooks are a simple way
to get started using Apache Spark – as the motto
“Making Big Data Simple” states.
Please create and run a variety of notebooks on your
account throughout the tutorial. These accounts will
remain open long enough for you to export your work.
See the product page or FAQ for more details, or
contact Databricks to register for a trial account.
5
Getting Started: Step 1
15. 15
Now let’s get started with the coding exercise!
We’ll define an initial Spark app in three lines
of code:
Getting Started: Coding Exercise
16. If you’re new to this Scala thing and want to
spend a few minutes on the basics…
16
Scala Crash Course
Holden Karau
lintool.github.io/SparkTutorial/
slides/day1_Scala_crash_course.pdf
Getting Started: Bonus!
17. 17
Getting Started: Extra Bonus!!
See also the /learning_spark_book
for all of its code examples in notebooks:
18. How Spark runs
on a Cluster
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
19. 19
Clone and run /_SparkCamp/01.log_example
in your folder:
Spark Deconstructed: Log Mining Example
21. Spark Deconstructed: Log Mining Example
x = messages.filter(lambda x: x.find("mysql") > -1)!
x.toDebugString()!
!
!
(2) PythonRDD[772] at RDD at PythonRDD.scala:43 []!
| PythonRDD[219] at RDD at PythonRDD.scala:43 []!
| error_log.txt MappedRDD[218] at NativeMethodAccessorImpl.java:-2 []!
| error_log.txt HadoopRDD[217] at NativeMethodAccessorImpl.java:-2 []
Note that we can examine the operator graph
for a transformed RDD, for example:
21
37. 37
A Brief History: Functional Programming for Big Data
circa late 1990s:
explosive growth e-commerce and machine data
implied that workloads could not fit on a single
computer anymore…
notable firms led the shift to horizontal scale-out
on clusters of commodity hardware, especially
for machine learning use cases at scale
38. 38
A Brief History: Functional Programming for Big Data
2002
2002
MapReduce @ Google
2004
MapReduce paper
2006
Hadoop @Yahoo!
2004 2006 2008 2010 2012 2014
2014
Apache Spark top-level
2010
Spark paper
2008
Hadoop Summit
39. 39
circa 2002:
mitigate risk of large distributed workloads lost
due to disk failures on commodity hardware…
Google File System
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
research.google.com/archive/gfs.html
!
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean, Sanjay Ghemawat
research.google.com/archive/mapreduce.html
A Brief History: MapReduce
40. A Brief History: MapReduce
circa 1979 – Stanford, MIT, CMU, etc.
set/list operations in LISP, Prolog, etc., for parallel processing
www-formal.stanford.edu/jmc/history/lisp/lisp.htm
circa 2004 – Google
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay Ghemawat
research.google.com/archive/mapreduce.html
circa 2006 – Apache
Hadoop, originating from the Nutch Project
Doug Cutting
research.yahoo.com/files/cutting.pdf
circa 2008 – Yahoo
web scale search indexing
Hadoop Summit, HUG, etc.
developer.yahoo.com/hadoop/
circa 2009 – Amazon AWS
Elastic MapReduce
Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc.
aws.amazon.com/elasticmapreduce/
40
41. Open Discussion:
Enumerate several changes in data center
technologies since 2002…
A Brief History: MapReduce
41
43. MapReduce use cases showed two major
limitations:
1. difficultly of programming directly in MR
2. performance bottlenecks, or batch not
fitting the use cases
In short, MR doesn’t compose well for large
applications
Therefore, people built specialized systems as
workarounds…
A Brief History: MapReduce
43
44. 44
MR doesn’t compose well for large applications,
and so specialized systems emerged as workarounds
MapReduce
General Batch Processing Specialized Systems:
iterative, interactive, streaming, graph, etc.
Pregel Giraph
Dremel Drill
Tez
Impala
GraphLab
StormS4
F1
MillWheel
A Brief History: MapReduce
45. Developed in 2009 at UC Berkeley AMPLab, then
open sourced in 2010, Spark has since become
one of the largest OSS communities in big data,
with over 200 contributors in 50+ organizations
spark.apache.org
“Organizations that are looking at big data challenges –
including collection, ETL, storage, exploration and analytics –
should consider Spark for its in-memory performance and
the breadth of its model. It supports advanced analytics
solutions on Hadoop clusters, including the iterative model
required for machine learning and graph analysis.”
Gartner, Advanced Analytics and Data Science (2014)
45
A Brief History: Spark
46. 46
Spark: Cluster Computing withWorking Sets
Matei Zaharia, Mosharaf Chowdhury,
Michael Franklin, Scott Shenker, Ion Stoica
people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf
!
Resilient Distributed Datasets:A Fault-Tolerant Abstraction for
In-Memory Cluster Computing
Matei Zaharia, Mosharaf Chowdhury,Tathagata Das,Ankur Dave,
Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica
usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
circa 2010:
a unified engine for enterprise data workflows,
based on commodity hardware a decade later…
A Brief History: Spark
47. A Brief History: Spark
Unlike the various specialized systems, Spark’s
goal was to generalize MapReduce to support
new apps within same engine
Two reasonably small additions are enough to
express the previous models:
• fast data sharing
• general DAGs
This allows for an approach which is more
efficient for the engine, and much simpler
for the end users
47
49. Some key points about Spark:
• handles batch, interactive, and real-time
within a single framework
• native integration with Java, Python, Scala
• programming at a higher level of abstraction
• more general: map/reduce is just one set
of supported constructs
A Brief History: Spark
49
50. • generalized patterns
unified engine for many use cases
• lazy evaluation of the lineage graph
reduces wait states, better pipelining
• generational differences in hardware
off-heap use of large memory spaces
• functional programming / ease of use
reduction in cost to maintain large apps
• lower overhead for starting jobs
• less expensive shuffles
A Brief History: Key distinctions for Spark vs. MapReduce
50
56. Coding Exercise: WordCount
void map (String doc_id, String text):!
for each word w in segment(text):!
emit(w, "1");!
!
!
void reduce (String word, Iterator group):!
int count = 0;!
!
for each pc in group:!
count += Int(pc);!
!
emit(word, String(count));
Definition:
count how often each word appears
in a collection of text documents
This simple program provides a good test case
for parallel processing, since it:
• requires a minimal amount of code
• demonstrates use of both symbolic and
numeric values
• isn’t many steps away from search indexing
• serves as a “HelloWorld” for Big Data apps
!
A distributed computing framework that can run
WordCount efficiently in parallel at scale
can likely handle much larger and more interesting
compute problems
count how often each word appears
in a collection of text documents
56
57. WordCount in 3 lines of Spark
WordCount in 50+ lines of Java MR
57
Coding Exercise: WordCount
58. 58
Clone and run /_SparkCamp/02.wc_example
in your folder:
Coding Exercise: WordCount
59. 59
Clone and run /_SparkCamp/03.join_example
in your folder:
Coding Exercise: Join
62. DBC Essentials
evaluation
optimization
representation
circa 2010
ETL into
cluster/cloud
data data
visualize,
reporting
Data
Prep
Features
Learners,
Parameters
Unsupervised
Learning
Explore train set
test set
models
Evaluate
Optimize
Scoring
production
data
use
cases
data pipelines
actionable resultsdecisions, feedback
bar developers
foo algorithms
63. 63
DBC Essentials: What is Databricks Cloud?
Databricks Platform
Databricks Workspace
Also see FAQ for more details…
64. 64
DBC Essentials: What is Databricks Cloud?
key concepts
Shard an instance of Databricks Workspace
Cluster a Spark cluster (multiple per shard)
Notebook
a list of markdown, executable
commands, and results
Dashboard
a flexible space to create operational
visualizations
Also see FAQ for more details…
65. 65
DBC Essentials: Notebooks
• Series of commands (think shell++)
• Each notebook has a language type,
chosen at notebook creation:
• Python + SQL
• Scala + SQL
• SQL only
• Command output captured in notebook
• Commands can be…
• edited, reordered, rerun, exported,
cloned, imported, etc.
66. 66
DBC Essentials: Clusters
• Open source Spark clusters hosted in the cloud
• Access the Spark UI
• Attach and Detach notebooks to clusters
!
NB: our training shards use 7 GB cluster
configurations
67. 67
DBC Essentials: Team, State, Collaboration, Elastic Resources
Cloud
login
state
attached
Spark
cluster
Shard
Notebook
Spark
cluster
detached
Browser
team
Browserlogin
import/
export
Local
Copies
68. 68
DBC Essentials: Team, State, Collaboration, Elastic Resources
Excellent collaboration properties, based
on the use of:
• comments
• cloning
• decoupled state of notebooks vs.
clusters
• relative independence of code blocks
within a notebook
70. How to “think” in terms of leveraging notebooks,
based on Computational Thinking:
70
Think Notebooks:
“The way we depict
space has a great
deal to do with how
we behave in it.”
– David Hockney
71. 71
“The impact of computing extends far beyond
science… affecting all aspects of our lives.
To flourish in today's world, everyone needs
computational thinking.” – CMU
Computing now ranks alongside the proverbial
Reading,Writing, and Arithmetic…
Center for ComputationalThinking @ CMU
http://www.cs.cmu.edu/~CompThink/
Exploring ComputationalThinking @ Google
https://www.google.com/edu/computational-thinking/
Think Notebooks: ComputationalThinking
72. 72
Computational Thinking provides a structured
way of conceptualizing the problem…
In effect, developing notes for yourself and
your team
These in turn can become the basis for team
process, software requirements, etc.,
In other words, conceptualize how to leverage
computing resources at scale to build high-ROI
apps for Big Data
Think Notebooks: ComputationalThinking
73. 73
The general approach, in four parts:
• Decomposition: decompose a complex
problem into smaller solvable problems
• Pattern Recognition: identify when a
known approach can be leveraged
• Abstraction: abstract from those patterns
into generalizations as strategies
• Algorithm Design: articulate strategies as
algorithms, i.e. as general recipes for how to
handle complex problems
Think Notebooks: ComputationalThinking
74. How to “think” in terms of leveraging notebooks,
by the numbers:
1. create a new notebook
2. copy the assignment description as markdown
3. split it into separate code cells
4. for each step, write your code under the
markdown
5. run each step and verify your results
74
Think Notebooks:
75. Let’s assemble the pieces of the previous few
code examples, using two files:
/mnt/paco/intro/CHANGES.txt
/mnt/paco/intro/README.md"
1. create RDDs to filter each line for the
keyword Spark
2. perform a WordCount on each, i.e., so the
results are (K,V) pairs of (keyword, count)
3. join the two RDDs
4. how many instances of Spark are there in
each file?
75
Coding Exercises: Workflow assignment
76. Tour of Spark API
Cluster Manager
Driver Program
SparkContext
Worker Node
Executor cache
task
task
Worker Node
Executor cache
task
task
77. The essentials of the Spark API in both Scala
and Python…
/_SparkCamp/05.scala_api"
/_SparkCamp/05.python_api"
!
Let’s start with the basic concepts, which are
covered in much more detail in the docs:
spark.apache.org/docs/latest/scala-
programming-guide.html
Spark Essentials:
77
78. First thing that a Spark program does is create
a SparkContext object, which tells Spark how
to access a cluster
In the shell for either Scala or Python, this is
the sc variable, which is created automatically
Other programs must use a constructor to
instantiate a new SparkContext
Then in turn SparkContext gets used to create
other variables
Spark Essentials: SparkContext
78
80. The master parameter for a SparkContext
determines which cluster to use
Spark Essentials: Master
master description
local
run Spark locally with one worker thread
(no parallelism)
local[K]
run Spark locally with K worker threads
(ideally set to # cores)
spark://HOST:PORT
connect to a Spark standalone cluster;
PORT depends on config (7077 by default)
mesos://HOST:PORT
connect to a Mesos cluster;
PORT depends on config (5050 by default)
80
82. Cluster ManagerDriver Program
SparkContext
Worker Node
Executor cache
tasktask
Worker Node
Executor cache
tasktask
The driver performs the following:
1. connects to a cluster manager to allocate
resources across applications
2. acquires executors on cluster nodes –
processes run compute tasks, cache data
3. sends app code to the executors
4. sends tasks for the executors to run
Spark Essentials: Clusters
82
83. Resilient Distributed Datasets (RDD) are the
primary abstraction in Spark – a fault-tolerant
collection of elements that can be operated on
in parallel
There are currently two types:
• parallelized collections – take an existing Scala
collection and run functions on it in parallel
• Hadoop datasets – run functions on each record
of a file in Hadoop distributed file system or any
other storage system supported by Hadoop
Spark Essentials: RDD
83
84. • two types of operations on RDDs:
transformations and actions
• transformations are lazy
(not computed immediately)
• the transformed RDD gets recomputed
when an action is run on it (default)
• however, an RDD can be persisted into
storage in memory or disk
Spark Essentials: RDD
84
85. val data = Array(1, 2, 3, 4, 5)"
data: Array[Int] = Array(1, 2, 3, 4, 5)"
!
val distData = sc.parallelize(data)"
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24970]
Spark Essentials: RDD
data = [1, 2, 3, 4, 5]"
data"
Out[2]: [1, 2, 3, 4, 5]"
!
distData = sc.parallelize(data)"
distData"
Out[3]: ParallelCollectionRDD[24864] at parallelize at PythonRDD.scala:364
Scala:
Python:
85
86. Spark can create RDDs from any file stored in HDFS
or other storage systems supported by Hadoop, e.g.,
local file system,Amazon S3, Hypertable, HBase, etc.
Spark supports text files, SequenceFiles, and any
other Hadoop InputFormat, and can also take a
directory or a glob (e.g. /data/201404*)
Spark Essentials: RDD
action value
RDD
RDD
RDD
transformations RDD
86
87. val distFile = sqlContext.table("readme")"
distFile: org.apache.spark.sql.SchemaRDD = "
SchemaRDD[24971] at RDD at SchemaRDD.scala:108
Spark Essentials: RDD
distFile = sqlContext.table("readme").map(lambda x: x[0])"
distFile"
Out[11]: PythonRDD[24920] at RDD at PythonRDD.scala:43
Scala:
Python:
87
88. Transformations create a new dataset from
an existing one
All transformations in Spark are lazy: they
do not compute their results right away –
instead they remember the transformations
applied to some base dataset
• optimize the required calculations
• recover from lost data partitions
Spark Essentials: Transformations
88
89. Spark Essentials: Transformations
transformation description
map(func)
return a new distributed dataset formed by passing
each element of the source through a function func
filter(func)
return a new dataset formed by selecting those
elements of the source on which func returns true
flatMap(func)
similar to map, but each input item can be mapped
to 0 or more output items (so func should return a
Seq rather than a single item)
sample(withReplacement,
fraction, seed)
sample a fraction fraction of the data, with or without
replacement, using a given random number generator
seed
union(otherDataset)
return a new dataset that contains the union of the
elements in the source dataset and the argument
distinct([numTasks]))
return a new dataset that contains the distinct elements
of the source dataset
89
90. Spark Essentials: Transformations
transformation description
groupByKey([numTasks])
when called on a dataset of (K, V) pairs, returns a
dataset of (K, Seq[V]) pairs
reduceByKey(func,
[numTasks])
when called on a dataset of (K, V) pairs, returns
a dataset of (K, V) pairs where the values for each
key are aggregated using the given reduce function
sortByKey([ascending],
[numTasks])
when called on a dataset of (K, V) pairs where K
implements Ordered, returns a dataset of (K, V)
pairs sorted by keys in ascending or descending order,
as specified in the boolean ascending argument
join(otherDataset,
[numTasks])
when called on datasets of type (K, V) and (K, W),
returns a dataset of (K, (V, W)) pairs with all pairs
of elements for each key
cogroup(otherDataset,
[numTasks])
when called on datasets of type (K, V) and (K, W),
returns a dataset of (K, Seq[V], Seq[W]) tuples –
also called groupWith
cartesian(otherDataset)
when called on datasets of types T and U, returns a
dataset of (T, U) pairs (all pairs of elements)
90
91. val distFile = sqlContext.table("readme").map(_(0).asInstanceOf[String])"
distFile.map(l => l.split(" ")).collect()"
distFile.flatMap(l => l.split(" ")).collect()
Spark Essentials: Transformations
distFile = sqlContext.table("readme").map(lambda x: x[0])"
distFile.map(lambda x: x.split(' ')).collect()"
distFile.flatMap(lambda x: x.split(' ')).collect()
Scala:
Python:
distFile is a collection of lines
91
93. Spark Essentials: Transformations
Scala:
Python:
closures
val distFile = sqlContext.table("readme").map(_(0).asInstanceOf[String])"
distFile.map(l => l.split(" ")).collect()"
distFile.flatMap(l => l.split(" ")).collect()
distFile = sqlContext.table("readme").map(lambda x: x[0])"
distFile.map(lambda x: x.split(' ')).collect()"
distFile.flatMap(lambda x: x.split(' ')).collect()
93
looking at the output, how would you
compare results for map() vs. flatMap() ?
94. Spark Essentials: Actions
action description
reduce(func)
aggregate the elements of the dataset using a function
func (which takes two arguments and returns one),
and should also be commutative and associative so
that it can be computed correctly in parallel
collect()
return all the elements of the dataset as an array at
the driver program – usually useful after a filter or
other operation that returns a sufficiently small subset
of the data
count() return the number of elements in the dataset
first()
return the first element of the dataset – similar to
take(1)
take(n)
return an array with the first n elements of the dataset
– currently not executed in parallel, instead the driver
program computes all the elements
takeSample(withReplacement,
fraction, seed)
return an array with a random sample of num elements
of the dataset, with or without replacement, using the
given random number generator seed
94
95. Spark Essentials: Actions
action description
saveAsTextFile(path)
write the elements of the dataset as a text file (or set
of text files) in a given directory in the local filesystem,
HDFS or any other Hadoop-supported file system.
Spark will call toString on each element to convert
it to a line of text in the file
saveAsSequenceFile(path)
write the elements of the dataset as a Hadoop
SequenceFile in a given path in the local filesystem,
HDFS or any other Hadoop-supported file system.
Only available on RDDs of key-value pairs that either
implement Hadoop's Writable interface or are
implicitly convertible to Writable (Spark includes
conversions for basic types like Int, Double, String,
etc).
countByKey()
only available on RDDs of type (K, V). Returns a
`Map` of (K, Int) pairs with the count of each key
foreach(func)
run a function func on each element of the dataset –
usually done for side effects such as updating an
accumulator variable or interacting with external
storage systems
95
96. val distFile = sqlContext.table("readme").map(_(0).asInstanceOf[String])"
val words = f.flatMap(l => l.split(" ")).map(word => (word, 1))"
words.reduceByKey(_ + _).collect.foreach(println)
Spark Essentials: Actions
from operator import add"
f = sqlContext.table("readme").map(lambda x: x[0])"
words = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1))"
words.reduceByKey(add).collect()
Scala:
Python:
96
97. Spark can persist (or cache) a dataset in
memory across operations
spark.apache.org/docs/latest/programming-guide.html#rdd-
persistence
Each node stores in memory any slices of it
that it computes and reuses them in other
actions on that dataset – often making future
actions more than 10x faster
The cache is fault-tolerant: if any partition
of an RDD is lost, it will automatically be
recomputed using the transformations that
originally created it
Spark Essentials: Persistence
97
98. Spark Essentials: Persistence
transformation description
MEMORY_ONLY
Store RDD as deserialized Java objects in the JVM.
If the RDD does not fit in memory, some partitions
will not be cached and will be recomputed on the fly
each time they're needed.This is the default level.
MEMORY_AND_DISK
Store RDD as deserialized Java objects in the JVM.
If the RDD does not fit in memory, store the partitions
that don't fit on disk, and read them from there when
they're needed.
MEMORY_ONLY_SER
Store RDD as serialized Java objects (one byte array
per partition).This is generally more space-efficient
than deserialized objects, especially when using a fast
serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER
Similar to MEMORY_ONLY_SER, but spill partitions
that don't fit in memory to disk instead of recomputing
them on the fly each time they're needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2,
MEMORY_AND_DISK_2, etc
Same as the levels above, but replicate each partition
on two cluster nodes.
OFF_HEAP (experimental)
Store RDD in serialized format in Tachyon.
98
99. val distFile = sqlContext.table("readme").map(_(0).asInstanceOf[String])"
val words = f.flatMap(l => l.split(" ")).map(word => (word, 1)).cache()"
words.reduceByKey(_ + _).collect.foreach(println)
Spark Essentials: Persistence
from operator import add"
f = sqlContext.table("readme").map(lambda x: x[0])"
w = f.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).cache()"
w.reduceByKey(add).collect()
Scala:
Python:
99
100. Broadcast variables let programmer keep a
read-only variable cached on each machine
rather than shipping a copy of it with tasks
For example, to give every node a copy of
a large input dataset efficiently
Spark also attempts to distribute broadcast
variables using efficient broadcast algorithms
to reduce communication cost
Spark Essentials: BroadcastVariables
100
102. Accumulators are variables that can only be
“added” to through an associative operation
Used to implement counters and sums,
efficiently in parallel
Spark natively supports accumulators of
numeric value types and standard mutable
collections, and programmers can extend
for new types
Only the driver program can read an
accumulator’s value, not the tasks
Spark Essentials: Accumulators
102
105. For a deep-dive about broadcast variables
and accumulator usage in Spark, see also:
Advanced Spark Features
Matei Zaharia, Jun 2012
ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-
zaharia-amp-camp-2012-advanced-spark.pdf
Spark Essentials: BroadcastVariables and Accumulators
105
107. Spark Essentials: API Details
For more details about the Scala API:
spark.apache.org/docs/latest/api/scala/
index.html#org.apache.spark.package
For more details about the Python API:
spark.apache.org/docs/latest/api/python/
107
109. Keynote: New Directions for Spark in 2015
Fri Feb 20 9:15am-9:25am
strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/39547
As the Apache Spark userbase grows, the developer community is working
to adapt it for ever-wider use cases. 2014 saw fast adoption of Spark in the
enterprise and major improvements in its performance, scalability and
standard libraries. In 2015, we want to make Spark accessible to a wider
set of users, through new high-level APIs for data science: machine learning
pipelines, data frames, and R language bindings. In addition, we are defining
extension points to let Spark grow as a platform, making it easy to plug in
data sources, algorithms, and external packages. Like all work on Spark,
these APIs are designed to plug seamlessly into Spark applications, giving
users a unified platform for streaming, batch and interactive data processing.
Matei Zaharia – started the Spark project
at UC Berkeley, currently CTO of Databricks,
SparkVP at Apache, and an assistant professor
at MIT
110. Spark Camp: Ask Us Anything
Fri, Feb 20 2:20pm-3:00pm
strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/40701
Join the Spark team for an informal question and
answer session. Several of the Spark committers,
trainers, etc., from Databricks will be on hand to
field a wide range of detailed questions.
Even if you don’t have a specific question, join
in to hear what others are asking!
111. Databricks Spark Talks @Strata + Hadoop World
Thu Feb 19 10:40am-11:20am
strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/38237
Lessons from Running Large Scale SparkWorkloads
Reynold Xin, Matei Zaharia
Thu Feb 19 4:00pm–4:40pm
strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/38518
Spark Streaming -The State of the Union, and Beyond
Tathagata Das
112. Databricks Spark Talks @Strata + Hadoop World
Fri Feb 20 11:30am-12:10pm
strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/38237
Tuning and Debugging in Apache Spark
Patrick Wendell
Fri Feb 20 4:00pm–4:40pm
strataconf.com/big-data-conference-ca-2015/
public/schedule/detail/38391
Everyday I’m Shuffling -Tips forWriting Better Spark Programs
Vida Ha, Holden Karau
113. Spark Developer Certification
Fri Feb 20, 2015 10:40am-12:40pm
• go.databricks.com/spark-certified-developer
• defined by Spark experts @Databricks
• assessed by O’Reilly Media
• establishes the bar for Spark expertise
114. • 40 multiple-choice questions, 90 minutes
• mostly structured as choices among code blocks
• expect some Python, Java, Scala, SQL
• understand theory of operation
• identify best practices
• recognize code that is more parallel, less
memory constrained
!
Overall, you need to write Spark apps in practice
Developer Certification: Overview
114
119. blurs the lines between RDDs and relational tables
spark.apache.org/docs/latest/sql-programming-
guide.html
!
intermix SQL commands to query external data,
along with complex analytics, in a single app:
• allows SQL extensions based on MLlib
• provides the “heavy lifting” for ETL in DBC
Spark SQL: Manipulating Structured Data Using Spark
Michael Armbrust, Reynold Xin (2014-03-24)
databricks.com/blog/2014/03/26/Spark-SQL-
manipulating-structured-data-using-Spark.html
119
Spark SQL: DataWorkflows
120. Parquet is a columnar format, supported
by many different Big Data frameworks
http://parquet.io/
Spark SQL supports read/write of parquet files,
automatically preserving the schema of the
original data (HUGE benefits)
Modifying the previous example…
120
Spark SQL: DataWorkflows – Parquet
128. 128
The display() command:
• programmatic access to visualizations
• pass a SchemaRDD to print as an HTML table
• pass a Scala list to print as an HTML table
• call without arguments to display matplotlib
figures
Visualization: Using display()
129. 129
The displayHTML() command:
• render any arbitrary HTML/JavaScript
• include JavaScript libraries (advanced feature)
• paste in D3 examples to get a sense for this…
Visualization: Using displayHTML()
130. 130
Clone the entire folder /_SparkCamp/Viz D3
into your folder and run its notebooks:
Demo: D3Visualization
131. 131
Clone and run /_SparkCamp/07.sql_visualization
in your folder:
Coding Exercise: SQL +Visualization
134. Let’s consider the top-level requirements for
a streaming framework:
• clusters scalable to 100’s of nodes
• low-latency, in the range of seconds
(meets 90% of use case needs)
• efficient recovery from failures
(which is a hard problem in CS)
• integrates with batch: many co’s run the
same business logic both online+offline
Spark Streaming: Requirements
134
135. Therefore, run a streaming computation as:
a series of very small, deterministic batch jobs
!
• Chop up the live stream into
batches of X seconds
• Spark treats each batch of
data as RDDs and processes
them using RDD operations
• Finally, the processed results
of the RDD operations are
returned in batches
Spark Streaming: Requirements
135
136. Therefore, run a streaming computation as:
a series of very small, deterministic batch jobs
!
• Batch sizes as low as ½ sec,
latency of about 1 sec
• Potential for combining
batch processing and
streaming processing in
the same system
Spark Streaming: Requirements
136
137. Data can be ingested from many sources:
Kafka, Flume, Twitter, ZeroMQ,TCP sockets, etc.
Results can be pushed out to filesystems,
databases, live dashboards, etc.
Spark’s built-in machine learning algorithms and
graph processing algorithms can be applied to
data streams
Spark Streaming: Integration
137
138. 2012
project started
2013
alpha release (Spark 0.7)
2014
graduated (Spark 0.9)
Spark Streaming: Timeline
Discretized Streams:A Fault-Tolerant Model
for Scalable Stream Processing
Matei Zaharia,Tathagata Das, Haoyuan Li,
Timothy Hunter, Scott Shenker, Ion Stoica
Berkeley EECS (2012-12-14)
www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
project lead:
Tathagata Das @tathadas
138
139. Typical kinds of applications:
• datacenter operations
• web app funnel metrics
• ad optimization
• anti-fraud
• telecom
• video analytics
• various telematics
and much much more!
Spark Streaming: Use Cases
139
141. import org.apache.spark.streaming._"
import org.apache.spark.streaming.StreamingContext._"
!
// create a StreamingContext with a SparkConf configuration"
val ssc = new StreamingContext(sparkConf, Seconds(10))"
!
// create a DStream that will connect to serverIP:serverPort"
val lines = ssc.socketTextStream(serverIP, serverPort)"
!
// split each line into words"
val words = lines.flatMap(_.split(" "))"
!
// count each word in each batch"
val pairs = words.map(word => (word, 1))"
val wordCounts = pairs.reduceByKey(_ + _)"
!
// print a few of the counts to the console"
wordCounts.print()"
!
ssc.start()"
ssc.awaitTermination()
Quiz: name the bits and pieces…
141
147. 147
1. extract text
from the tweet
https://twitter.com/
andy_bf/status/
16222269370011648
"Ceci n'est pas un tweet"
2. sequence
text as bigrams
tweet.sliding(2).toSeq ("Ce", "ec", "ci", …, )
3. convert
bigrams into
numbers
seq.map(_.hashCode()) (2178, 3230, 3174, …, )
4. index into
sparse tf vector!
seq.map(_.hashCode() %
1000)
(178, 230, 174, …, )
5. increment
feature count
Vector.sparse(1000, …) (1000, [102, 104, …],
[0.0455, 0.0455, …])
Demo: Twitter Streaming Language Classifier
From tweets to ML features,
approximated as sparse
vectors:
149. SBT is the Simple Build Tool for Scala:
www.scala-sbt.org/
This is included with the Spark download, and
does not need to be installed separately.
Similar to Maven, however it provides for
incremental compilation and an interactive shell,
among other innovations.
SBT project uses StackOverflow for Q&A,
that’s a good resource to study further:
stackoverflow.com/tags/sbt
Spark in Production: Build: SBT
149
150. Spark in Production: Build: SBT
command description
clean
delete all generated files
(in the target directory)
package
create a JAR file
run
run the JAR
(or main class, if named)
compile
compile the main sources
(in src/main/scala and src/main/java directories)
test
compile and run all tests
console
launch a Scala interpreter
help
display detailed help for specified commands
150
151. builds:
• build/run a JAR using Java + Maven
• SBT primer
• build/run a JAR using Scala + SBT
Spark in Production: Build: Scala
151
152. The following sequence shows how to build
a JAR file from a Scala app, using SBT
• First, this requires the “source” download,
not the “binary”
• Connect into the SPARK_HOME directory
• Then run the following commands…
Spark in Production: Build: Scala
152
153. # Scala source + SBT build script on following slides!
!
cd simple-app"
!
../sbt/sbt -Dsbt.ivy.home=../sbt/ivy package"
!
../spark/bin/spark-submit "
--class "SimpleApp" "
--master local[*] "
target/scala-2.10/simple-project_2.10-1.0.jar
Spark in Production: Build: Scala
153
154. /*** SimpleApp.scala ***/"
import org.apache.spark.SparkContext"
import org.apache.spark.SparkContext._"
!
object SimpleApp {"
def main(args: Array[String]) {"
val logFile = "README.md" // Should be some file on your system"
val sc = new SparkContext("local", "Simple App", "SPARK_HOME","
List("target/scala-2.10/simple-project_2.10-1.0.jar"))"
val logData = sc.textFile(logFile, 2).cache()"
!
val numAs = logData.filter(line => line.contains("a")).count()"
val numBs = logData.filter(line => line.contains("b")).count()"
!
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))"
}"
}
Spark in Production: Build: Scala
154
155. name := "Simple Project""
!
version := "1.0""
!
scalaVersion := "2.10.4""
!
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.2.0""
!
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
Spark in Production: Build: Scala
155
157. Spark in Production: Databricks Cloud
157
Databricks Platform
Databricks Workspace
Arguably, one of the simplest ways to deploy Apache Spark
is to use Databricks Cloud for cloud-based notebooks
159. Apache Mesos, from which Apache Spark
originated…
Running Spark on Mesos
spark.apache.org/docs/latest/running-on-mesos.html
Run Apache Spark on Apache Mesos
tutorial based on Mesosphere + Google Cloud
ceteri.blogspot.com/2014/09/spark-atop-mesos-on-google-cloud.html
Getting Started Running Apache Spark on Apache Mesos
O’Reilly Media webcast
oreilly.com/pub/e/2986
Spark in Production: Mesos
159
161. MapR Technologies provides support for running
Spark on the MapR distros:
mapr.com/products/apache-spark
slideshare.net/MapRTechnologies/map-r-
databricks-webinar-4x3
Spark in Production: MapR
161
162. Hortonworks provides support for running
Spark on HDP:
spark.apache.org/docs/latest/hadoop-third-party-
distributions.html
hortonworks.com/blog/announcing-hdp-2-1-tech-
preview-component-apache-spark/
Spark in Production: HDP
162
163. Running Spark on Amazon AWS EC2:
blogs.aws.amazon.com/bigdata/post/Tx15AY5C50K70RV/
Installing-Apache-Spark-on-an-Amazon-EMR-Cluster
Spark in Production: EC2
163
164. Spark in MapReduce (SIMR) – quick way
for Hadoop MR1 users to deploy Spark:
databricks.github.io/simr/
spark-summit.org/talk/reddy-simr-let-your-
spark-jobs-simmer-inside-hadoop-clusters/
• Sparks run on Hadoop clusters without
any install or required admin rights
• SIMR launches a Hadoop job that only
contains mappers, includes Scala+Spark
./simr jar_file main_class parameters
[—outdir=] [—slots=N] [—unique]
Spark in Production: SIMR
164
165. review UI features
spark.apache.org/docs/latest/monitoring.html
http://<master>:8080/
http://<master>:50070/
• verify: is my job still running?
• drill-down into workers and stages
• examine stdout and stderr
• discuss how to diagnose / troubleshoot
Spark in Production: Monitor
165
170. 170
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin
graphlab.org/files/osdi2012-gonzalez-low-gu-
bickson-guestrin.pdf
Pregel: Large-scale graph computing at Google
Grzegorz Czajkowski, et al.
googleresearch.blogspot.com/2009/06/large-scale-
graph-computing-at-google.html
GraphX: Unified Graph Analytics on Spark
Ankur Dave, Databricks
databricks-training.s3.amazonaws.com/slides/
graphx@sparksummit_2014-07.pdf
Advanced Exercises: GraphX
databricks-training.s3.amazonaws.com/graph-
analytics-with-graphx.html
GraphX: Further Reading…
172. 172
GraphX: Example – routing problems
cost
4
node
0
node
1
node
3
node
2
cost
3
cost
1
cost
2
cost
1
What is the cost to reach node 0 from any other
node in the graph? This is a common use case for
graph algorithms, e.g., Djikstra
173. 173
Clone and run /_SparkCamp/08.graphx
in your folder:
GraphX: Coding Exercise
175. Case Studies: Apache Spark, DBC, etc.
Additional details about production deployments
for Apache Spark can be found at:
https://cwiki.apache.org/confluence/display/
SPARK/Powered+By+Spark
https://databricks.com/blog/category/company/
partners
http://go.databricks.com/customer-case-studies
175
176. Case Studies: Automatic Labs
176
Spark Plugs IntoYour Car
Rob Ferguson
spark-summit.org/east/2015/talk/spark-plugs-into-your-car
finance.yahoo.com/news/automatic-labs-turns-databricks-
cloud-140000785.html
Automatic creates personalized driving habit dashboards
• wanted to use Spark while minimizing investment in DevOps
• provides data access to non-technical analysts via SQL
• replaced Redshift and disparate ML tools with single platform
• leveraged built-in visualization capabilities in notebooks to
generate dashboards easily and quickly
• used MLlib on Spark for needed functionality out of the box
177. Spark atTwitter: Evaluation & Lessons Learnt
Sriram Krishnan
slideshare.net/krishflix/seattle-spark-meetup-spark-at-twitter
• Spark can be more interactive, efficient than MR
• support for iterative algorithms and caching
• more generic than traditional MapReduce
• Why is Spark faster than Hadoop MapReduce?
• fewer I/O synchronization barriers
• less expensive shuffle
• the more complex the DAG, the greater the
performance improvement
177
Case Studies: Twitter
178. Pearson uses Spark Streaming for next
generation adaptive learning platform
Dibyendu Bhattacharya
databricks.com/blog/2014/12/08/pearson-
uses-spark-streaming-for-next-generation-
adaptive-learning-platform.html
178
• Kafka + Spark + Cassandra + Blur, on AWS on aYARN
cluster
• single platform/common API was a key reason to replace
Storm with Spark Streaming
• custom Kafka Consumer for Spark Streaming, using Low
Level Kafka Consumer APIs
• handles: Kafka node failures, receiver failures, leader
changes, committed offset in ZK, tunable data rate
throughput
Case Studies: Pearson
179. UnlockingYour Hadoop Data with Apache Spark and CDH5
Denny Lee
slideshare.net/Concur/unlocking-your-hadoop-data-
with-apache-spark-and-cdh5
179
• leading provider of spend management solutions and
services
• delivers recommendations based on business users’ travel
and expenses – “to help deliver the perfect trip”
• use of traditional BI tools with Spark SQL allowed analysts
to make sense of the data without becoming programmers
• needed the ability to transition quickly between Machine
Learning (MLLib), Graph (GraphX), and SQL usage
• needed to deliver recommendations in real-time
Case Studies: Concur
180. Stratio Streaming: a new approach to Spark Streaming
David Morales, Oscar Mendez
spark-summit.org/2014/talk/stratio-streaming-a-
new-approach-to-spark-streaming
180
• Stratio Streaming is the union of a real-time messaging bus
with a complex event processing engine atop Spark
Streaming
• allows the creation of streams and queries on the fly
• paired with Siddhi CEP engine and Apache Kafka
• added global features to the engine such as auditing and
statistics
Case Studies: Stratio
181. Collaborative Filtering with Spark
Chris Johnson
slideshare.net/MrChrisJohnson/collaborative-filtering-with-
spark
• collab filter (ALS) for music recommendation
• Hadoop suffers from I/O overhead
• show a progression of code rewrites, converting a
Hadoop-based app into efficient use of Spark
181
Case Studies: Spotify
182. Guavus Embeds Apache Spark
into its Operational Intelligence Platform
Deployed at theWorld’s LargestTelcos
Eric Carr
databricks.com/blog/2014/09/25/guavus-embeds-apache-spark-
into-its-operational-intelligence-platform-deployed-at-the-
worlds-largest-telcos.html
182
• 4 of 5 top mobile network operators, 3 of 5 top Internet
backbone providers, 80% MSOs in NorAm
• analyzing 50% of US mobile data traffic, +2.5 PB/day
• latency is critical for resolving operational issues before
they cascade: 2.5 MM transactions per second
• “analyze first” not “store first ask questions later”
Case Studies: Guavus
183. Case Studies: Radius Intelligence
183
From Hadoop to Spark in 4 months, Lessons Learned
Alexis Roos
http://youtu.be/o3-lokUFqvA
• building a full SMB index took 12+ hours using
Hadoop and Cascading
• pipeline was difficult to modify/enhance
• Spark increased pipeline performance 10x
• interactive shell and notebooks enabled data scientists
to experiment and develop code faster
• PMs and business development staff can use SQL to
query large data sets
186. Further Resources: Spark Packages
186
Looking for other libraries and features? There
are a variety of third-party packages available at:
http://spark-packages.org/
187. Further Resources: DBC Feedback
187
Other feedback, suggestions, etc.?
http://feedback.databricks.com/
189. confs:
Spark Summit East
NYC, Mar 18-19
spark-summit.org/east
QCon SP
Saõ Paulo, Brazil, Mar 23-27
qconsp.com
Big Data Tech Con
Boston, Apr 26-28
bigdatatechcon.com
Strata EU
London, May 5-7
strataconf.com/big-data-conference-uk-2015
GOTO Chicago
Chicago, May 11-14
gotocon.com/chicago-2015
Spark Summit 2015
SF, Jun 15-17
spark-summit.org
191. books:
Fast Data Processing
with Spark
Holden Karau
Packt (2013)
shop.oreilly.com/product/
9781782167068.do
Spark in Action
Chris Fregly
Manning (2015)
sparkinaction.com/
Learning Spark
Holden Karau,
Andy Konwinski,
Matei Zaharia
O’Reilly (2015)
shop.oreilly.com/product/
0636920028512.do
192. About Databricks
• Founded by the creators of Spark in 2013
• Largest organization contributing to Spark
• End-to-end hosted service, Databricks Cloud
• http://databricks.com/