Spark Streaming allows processing live data streams using small batch sizes to provide low latency results. It provides a simple API to implement complex stream processing algorithms across hundreds of nodes. Spark SQL allows querying structured data using SQL or the Hive query language and integrates with Spark's batch and interactive processing. MLlib provides machine learning algorithms and pipelines to easily apply ML to large datasets. GraphX extends Spark with an API for graph-parallel computation on property graphs.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
The document discusses Apache Spark, an open source cluster computing framework for real-time data processing. It notes that Spark is up to 100 times faster than Hadoop for in-memory processing and 10 times faster on disk. The main feature of Spark is its in-memory cluster computing capability, which increases processing speeds. Spark runs on a driver-executor model and uses resilient distributed datasets and directed acyclic graphs to process data in parallel across a cluster.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Edureka!
This Edureka Spark SQL Tutorial will help you to understand how Apache Spark offers SQL power in real-time. This tutorial also demonstrates an use case on Stock Market Analysis using Spark SQL. Below are the topics covered in this tutorial:
1) Limitations of Apache Hive
2) Spark SQL Advantages Over Hive
3) Spark SQL Success Story
4) Spark SQL Features
5) Architecture of Spark SQL
6) Spark SQL Libraries
7) Querying Using Spark SQL
8) Demo: Stock Market Analysis With Spark SQL
Spark Streaming allows processing of live data streams in Spark. It integrates streaming data and batch processing within the same Spark application. Spark SQL provides a programming abstraction called DataFrames and can be used to query structured data in Spark. Structured Streaming in Spark 2.0 provides a high-level API for building streaming applications on top of Spark SQL's engine. It allows running the same queries on streaming data as on batch data and unifies streaming, interactive, and batch processing.
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
This document summarizes a presentation on Spark SQL and its capabilities. Spark SQL allows users to run SQL queries on Spark, including HiveQL queries with UDFs, UDAFs, and SerDes. It provides a unified interface for reading and writing data in various formats. Spark SQL also allows users to express common operations like selecting columns, joining data, and aggregation concisely through its DataFrame API. This reduces the amount of code users need to write compared to lower-level APIs like RDDs.
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Apache Spark is a fast, general engine for large-scale data processing. It provides unified analytics engine for batch, interactive, and stream processing using an in-memory abstraction called resilient distributed datasets (RDDs). Spark's speed comes from its ability to run computations directly on data stored in cluster memory and optimize performance through caching. It also integrates well with other big data technologies like HDFS, Hive, and HBase. Many large companies are using Spark for its speed, ease of use, and support for multiple workloads and languages.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
This presentation on Spark Architecture will give an idea of what is Apache Spark, the essential features in Spark, the different Spark components. Here, you will learn about Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Graphx. You will understand how Spark processes an application and runs it on a cluster with the help of its architecture. Finally, you will perform a demo on Apache Spark. So, let's get started with Apache Spark Architecture.
YouTube Video: https://www.youtube.com/watch?v=CF5Ewk0GxiQ
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Spark is an open-source distributed computing framework used for processing large datasets. It allows for in-memory cluster computing, which enhances processing speed. Spark core components include Resilient Distributed Datasets (RDDs) and a directed acyclic graph (DAG) that represents the lineage of transformations and actions on RDDs. Spark Streaming is an extension that allows for processing of live data streams with low latency.
Spark is a cluster computing framework designed to be fast, general-purpose, and able to handle a wide range of workloads including batch processing, iterative algorithms, interactive queries, and streaming. It is faster than Hadoop for interactive queries and complex applications by running computations in-memory when possible. Spark also simplifies combining different processing types through a single engine. It offers APIs in Java, Python, Scala and SQL and integrates closely with other big data tools like Hadoop. Spark is commonly used for interactive queries on large datasets, streaming data processing, and machine learning tasks.
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
This presentation about Spark SQL will help you understand what is Spark SQL, Spark SQL features, architecture, data frame API, data source API, catalyst optimizer, running SQL queries and a demo on Spark SQL. Spark SQL is an Apache Spark's module for working with structured and semi-structured data. It is originated to overcome the limitations of Apache Hive. Now, let us get started and understand Spark SQL in detail.
Below topics are explained in this Spark SQL presentation:
1. What is Spark SQL?
2. Spark SQL features
3. Spark SQL architecture
4. Spark SQL - Dataframe API
5. Spark SQL - Data source API
6. Spark SQL - Catalyst optimizer
7. Running SQL queries
8. Spark SQL demo
This Apache Spark and Scala certification training is designed to advance your expertise working with the Big Data Hadoop Ecosystem. You will master essential skills of the Apache Spark open source framework and the Scala programming language, including Spark Streaming, Spark SQL, machine learning programming, GraphX programming, and Shell Scripting Spark. This Scala Certification course will give you vital skillsets and a competitive advantage for an exciting career as a Hadoop Developer.
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Adventures in Timespace - How Apache Flink Handles Time and WindowsAljoscha Krettek
If you are in the business of processing a stream of events you sooner or later come upon different notions of time. There is processing time, the current time of the machine your program is running on and event time, the local time at which an event occurred.
In this talk we will look at why this distinction is relevant and also how Flink manages to work with these different ideas of time. We will look at how Flink tracks the progress of time and how you can employ windows to perform aggregating operations on an infinite stream of events.
Presentation slides for the paper on Resilient Distributed Datasets, written by Matei Zaharia et al. at the University of California, Berkeley.
The paper is not my work.
These slides were made for the course on Advanced, Distributed Systems held by prof. Bratsberg at NTNU (Norwegian University of Science and Technology, Trondheim, Norway).
Ayasdi presentation in Intel's pavilion @Strata 2015 (San Jose). Highlighting, Ayasdi's approach to analyzing large complex data, and our integration into the Hadoop ecosystem.
Wes McKinney gave a talk at the 2015 Open Data Science Conference about data frames and the state of data frame interfaces across different languages and libraries. He discussed the challenges of collaboration between different data frame communities due to the tight coupling of user interfaces, data representations, and computation engines in current data frame implementations. McKinney predicted that over time these components would decouple and specialize, improving code sharing across languages.
The document discusses different data frame interfaces, including their strengths and weaknesses. It describes R data frames as having a thin layer on top of R lists with simple column/row selection. Key R packages like dplyr and data.table add functionality. Spark DataFrames provide a pandas-inspired API for tabular data manipulation across languages. While progressing towards decoupling, interfaces still bind users to their specific systems. The author advocates for quality tools forged through real-world usage.
This document presents Resilient Distributed Datasets (RDDs), a fault-tolerant abstraction for in-memory cluster computing introduced by Spark. RDDs allow programmers to perform iterative and interactive computations over large datasets in a fault-tolerant manner. RDDs are distributed immutable collections of records that can be operated on through transformations and actions. They track the lineage of transformations to allow recovering lost data partitions. This provides an efficient abstraction for iterative algorithms compared to MapReduce.
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
The document discusses various benchmarks that are commonly used to evaluate Semantic Web repositories and their performance handling large amounts of RDF data. Some of the major benchmarks mentioned include the Lehigh University Benchmark (LUBM), Berlin SPARQL Benchmark (BSBM), SP2Bench, Social Network Intelligence Benchmark (SIB), and DBPedia SPARQL Benchmark. The document also provides an overview of different benchmark components and links to resources with performance results from various RDF stores and systems.
The document is a presentation by Jongwook Woo from the High-Performance Information Computing Center (HiPIC) at California State University Los Angeles given on February 25, 2017 at the SWRC conference in San Diego, CA. It discusses big data trends with open platforms and provides information on Spark, Hadoop, open data, use cases, and the future of big data. Specifically, it summarizes Jongwook Woo's background and experience, describes what big data is and how Spark improves on Hadoop MapReduce, discusses how Spark can integrate with Hadoop ecosystems, and provides examples of analyzing local business data using Spark.
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
RDD recap
Spark SQL library
Architecture of Spark SQL
Comparison with Pig and Hive Pipeline
DataFrames
Definition of a DataFrames API
DataFrames Operations
DataFrames features
Data cleansing
Diagram for logical plan container
Plan Optimization & Execution
Catalyst Analyzer
Catalyst Optimizer
Generating Physical Plan
Code Generation
Extensions
A lot of data scientists use the python library pandas for quick exploration of data. The most useful construct in pandas (based on R, I think) is the dataframe, which is a 2D array(aka matrix) with the option to “name” the columns (and rows). But pandas is not distributed, so there is a limit on the data size that can be explored.
Spark is a great map-reduce like framework that can handle very big data by using a shared nothing cluster of machines.
This work is an attempt to provide a pandas-like DSL on top of spark, so that data scientists familiar with pandas have a very gradual learning curve.
Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksData Con LA
This document discusses scaling big data using Apache Spark. It provides an overview of Spark's philosophy of providing a unified engine to support end-to-end applications using high-level APIs. It outlines some of the new features in Apache Spark 2.0, including improvements to structured APIs, structured streaming, and new deep learning and graph processing libraries. It also discusses initiatives by Databricks to grow the Spark community through massive open online courses and a free community edition of the Databricks platform.
The document discusses loading data into Spark SQL and the differences between DataFrame functions and SQL. It provides examples of loading data from files, cloud storage, and directly into DataFrames from JSON and Parquet files. It also demonstrates using SQL on DataFrames after registering them as temporary views. The document outlines how to load data into RDDs and convert them to DataFrames to enable SQL querying, as well as using SQL-like functions directly in the DataFrame API.
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksData Con LA
Spark DataFrames provide a unified data structure and API for distributed data processing across Python, R and Scala. DataFrames allow users to manipulate distributed datasets using familiar data frame concepts from single machine tools like Pandas and dplyr. The DataFrame API is built on a logical query plan called Catalyst that is optimized for efficient execution across different languages and Spark execution engines like Tungsten.
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
The document discusses trends in open source analytics for data science. It notes that industry giants are opening core AI and machine learning technologies. There is also open source "disruption" in data science languages and tools. Two Sigma aims to build a collaborative data science platform through open source contributions to scale access to data and computational capabilities while enhancing productivity and collaboration. Two Sigma participates in open source to drive innovation, increase value of proprietary systems, raise awareness of challenges at scale, and attract talent. Areas of investment include Apache Arrow, Parquet, Pandas, and projects for resource management, distributed computing, and collaboration.
This document discusses using Fluentd to collect streaming data from Apache Kafka. It presents two approaches: 1) the fluent-plugin-kafka plugin which allows Fluentd to act as a producer and consumer of Kafka topics, and 2) the kafka-fluentd-consumer project which runs a standalone Kafka consumer that sends events to Fluentd. Configuration examples are provided for both approaches. The document concludes that Fluentd and Kafka can work together to build reliable and flexible data pipelines.
Spark 2.0 is a major release of Apache Spark. This release has brought many changes to API(s) and libraries of Spark. So in this KnolX, we will be looking at some improvements that are made in Spark 2.0. Also, in these slides we will be getting an introduction to some new features in Spark 2,0 like SparkSession API and Structured Streaming.
The document provides an overview of various machine learning algorithms and methods. It begins with an introduction to predictive modeling and supervised vs. unsupervised learning. It then describes several supervised learning algorithms in detail including linear regression, K-nearest neighbors (KNN), decision trees, random forest, logistic regression, support vector machines (SVM), and naive Bayes. It also briefly discusses unsupervised learning techniques like clustering and dimensionality reduction methods.
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
In this talk, we introduce the extensions of Spark Streaming to support (1) SQL-based query processing and (2) elastic-seamless resource allocation. First, we explain the methods of supporting window queries and query chains. As we know, last year, Grace Huang and Jerry Shao introduced the concept of “StreamSQL” that can process streaming data with SQL-like queries by adapting SparkSQL to Spark Streaming. However, we made advances in supporting complex event processing (CEP) based on their efforts. In detail, we implemented the sliding window concept to support a time-based streaming data processing at the SQL level. Here, to reduce the aggregation time of large windows, we generate an efficient query plan that computes the partial results by evaluating only the data entering or leaving the window and then gets the current result by merging the previous one and the partial ones. Next, to support query chains, we made the result of a query over streaming data be a table by adding the “insert into” query. That is, it allows us to apply stream queries to the results of other ones. Second, we explain the methods of allocating resources to streaming applications dynamically, which enable the applications to meet a given deadline. As the rate of incoming events varies over time, resources allocated to applications need to be adjusted for high resource utilization. However, the current Spark's resource allocation features are not suitable for streaming applications. That is, the resources allocated will not be freed when new data are arriving continuously to the streaming applications even though the quantity of the new ones is very small. In order to resolve the problem, we consider their resource utilization. If the utilization is low, we choose victim nodes to be killed. Then, we do not feed new data into the victims to prevent a useless recovery issuing when they are killed. Accordingly, we can scale-in/-out the resources seamlessly.
This document summarizes a presentation on extending Spark Streaming to support complex event processing. It discusses:
1) Motivations for supporting CEP in Spark Streaming, as current Spark is not enough to support continuous query languages or auto-scaling of resources.
2) Proposed solutions including extending Intel's Streaming SQL package, improving windowed aggregation performance, supporting "Insert Into" queries to enable query chains, and implementing elastic resource allocation through auto-scaling in/out of resources.
3) Evaluation of the Streaming SQL extensions showing low processing delays despite heavy loads or large windows, though more memory optimization is needed.
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine LearningChris Fregly
This document provides an overview and summary of Spark Streaming. It discusses Spark Streaming's architecture and APIs. Spark Streaming receives live input data streams and divides them into micro-batches, which it processes using Spark's execution engine to perform operations like transformations and actions. This allows for low-latency, high-throughput stream processing with fault tolerance. The document also covers Spark Streaming deployment and integrating it with sources like Kinesis, as well as monitoring and tuning Spark Streaming applications.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
Spark Streaming and Kafka Streams are two popular stream processing platforms. Spark Streaming uses micro-batching and allows for code reuse between batch and streaming jobs. Kafka Streams is embedded directly into Apache Kafka and leverages Kafka as its internal messaging layer. Both platforms support stateful stream processing operations like windowing, aggregations, and joins through distributed state stores. A demo application is shown that detects dangerous driving by joining truck position data with driver data using different streaming techniques.
StreamSets can process data using Apache Spark in three ways:
1) The Spark Evaluator stage allows user-provided Spark code to run on each batch of records in a pipeline and return results or errors.
2) A Cluster Pipeline can leverage Apache Spark's Direct Kafka DStream to partition data from Kafka across worker pipelines on a cluster.
3) A Spark Executor can kick off a Spark application when an event is received, allowing tasks like model updating to run on streaming data using Spark.
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources.
Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines.
Speaker
Navina Ramesh, Sr. Software Engineer, LinkedIn
Spark Streaming provides fault-tolerance through checkpointing and write ahead logs (WAL). Checkpointing saves metadata and generated RDDs to reliable storage to recover from driver failures. WAL saves all received data to log files to enable zero data loss recovery from executor failures. Structured Streaming uses checkpointing for fault-tolerance. Kafka achieves fault-tolerance through replication of partitions across brokers. Flume uses durable file channels and redundant topologies. HDFS replicates blocks across multiple machines. The Lambda architecture handles batch and real-time data through separate batch and speed layers that are merged in the serving layer.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
Abstract:-
With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Bio:-
Hari Shreedharan is a PMC member and committer on the Apache Flume Project. As a PMC member, he is involved in making decisions on the direction of the project. Author of the O’Reilly book Using Flume, Hari is also a software engineer at Cloudera, where he works on Apache Flume, Apache Spark, and Apache Sqoop. He also ensures that customers can successfully deploy and manage Flume, Spark, and Sqoop on their clusters, by helping them resolve any issues they are facing.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It implements Google's MapReduce programming model and the Hadoop Distributed File System (HDFS) for reliable data storage. Key components include a JobTracker that coordinates jobs, TaskTrackers that run tasks on worker nodes, and a NameNode that manages the HDFS namespace and DataNodes that store application data. The framework provides fault tolerance, parallelization, and scalability.
What no one tells you about writing a streaming apphadooparchbook
This document discusses 5 things that are often not addressed when writing streaming applications:
1. Managing and monitoring long-running streaming jobs can be challenging as frameworks were not originally designed for streaming workloads. Options include using cluster mode to ensure jobs continue if clients disconnect and leveraging monitoring tools to track metrics.
2. Preventing data loss requires different approaches depending on the data source. File and receiver-based sources benefit from checkpointing while Kafka's commit log ensures data is not lost.
3. Spark Streaming is well-suited for tasks involving windowing, aggregations, and machine learning but may not be needed for all streaming use cases.
4. Achieving exactly-once semantics requires techniques
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
So you know you want to write a streaming app but any non-trivial streaming app developer would have to think about these questions:
How do I manage offsets?
How do I manage state?
How do I make my spark streaming job resilient to failures? Can I avoid some failures?
How do I gracefully shutdown my streaming job?
How do I monitor and manage (e.g. re-try logic) streaming job?
How can I better manage the DAG in my streaming job?
When to use checkpointing and for what? When not to use checkpointing?
Do I need a WAL when using streaming data source? Why? When don’t I need one?
In this talk, we’ll share practices that no one talks about when you start writing your streaming app, but you’ll inevitably need to learn along the way.
Spark streaming provides stream processing functionality as an abstraction over core Spark. It processes data in micro-batches, where streaming data is buffered for a given interval and then processed as RDDs by core Spark. The processing time of each micro-batch must be less than the batch interval to avoid bottlenecks. Performance can be tuned by parallelizing data consumption from sources like Kafka, and balancing Spark partitions for parallel processing with available cores.
This document provides an overview of effective big data visualization. It discusses information visualization and data visualization, including common chart types like histograms, scatter plots, and dashboards. It covers visualization goals, considerations, processes, basics, and guidelines. Examples of good visualization are provided. Tools for creating infographics are listed, as are resources for learning more about data visualization and references. Overall, the document serves as a comprehensive introduction to big data visualization.
Graph databases store data in graph structures with nodes, edges, and properties. Neo4j is a popular open-source graph database that uses a property graph model. It has a core API for programmatic access, indexes for fast lookups, and Cypher for graph querying. Neo4j provides high availability through master-slave replication and scales horizontally by sharding graphs across instances through techniques like cache sharding and domain-specific sharding.
This document discusses information retrieval techniques. It begins by defining information retrieval as selecting the most relevant documents from a large collection based on a query. It then discusses some key aspects of information retrieval including document representation, indexing, query representation, and ranking models. The document also covers specific techniques used in information retrieval systems like parsing documents, tokenization, removing stop words, normalization, stemming, and lemmatization.
This document provides an overview of natural language processing (NLP). It discusses topics like natural language understanding, text categorization, syntactic analysis including parsing and part-of-speech tagging, semantic analysis, and pragmatic analysis. It also covers corpus-based statistical approaches to NLP, measuring performance, and supervised learning methods. The document outlines challenges in NLP like ambiguity and knowledge representation.
This document provides an overview of the Natural Language Toolkit (NLTK), a Python library for natural language processing. It discusses NLTK's modules for common NLP tasks like tokenization, part-of-speech tagging, parsing, and classification. It also describes how NLTK can be used to analyze text corpora, frequency distributions, collocations and concordances. Key functions of NLTK include tokenizing text, accessing annotated corpora, analyzing word frequencies, part-of-speech tagging, and shallow parsing.
This document provides an overview of NoSQL databases and summarizes key information about several NoSQL databases, including HBase, Redis, Cassandra, MongoDB, and Memcached. It discusses concepts like horizontal scalability, the CAP theorem, eventual consistency, and data models used by different NoSQL databases like key-value, document, columnar, and graph structures.
This document provides an overview of recommender systems for e-commerce. It discusses various recommender approaches including collaborative filtering algorithms like nearest neighbor methods, item-based collaborative filtering, and matrix factorization. It also covers content-based recommendation, classification techniques, addressing challenges like data sparsity and scalability, and hybrid recommendation approaches.
This document provides an overview of the statistical programming language R. It discusses key R concepts like data types, vectors, matrices, data frames, lists, and functions. It also covers important R tools for data analysis like statistical functions, linear regression, multiple regression, and file input/output. The goal of R is to provide a large integrated collection of tools for data analysis and statistical computing.
This document provides an overview of the Python programming language. It discusses Python's history and evolution, its key features like being object-oriented, open source, portable, having dynamic typing and built-in types/tools. It also covers Python's use for numeric processing with libraries like NumPy and SciPy. The document explains how to use Python interactively from the command line and as scripts. It describes Python's basic data types like integers, floats, strings, lists, tuples and dictionaries as well as common operations on these types.
The document provides an overview of functional programming, including its key features, history, differences from imperative programming, and examples using Lisp and Scheme. Some of the main points covered include:
- Functional programming is based on evaluating mathematical functions rather than modifying state through assignments.
- It uses recursion instead of loops and treats functions as first-class objects.
- Lisp was the first functional language in 1960 and introduced many core concepts like lists and first-class functions. Scheme was developed in 1975 as a simpler dialect of Lisp.
- Functional programs are more focused on what to compute rather than how to compute it, making them more modular and easier to reason about mathematically.
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio, Inc.
Alluxio Webinar
June. 18, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Jianjian Xie (Staff Software Engineer, Alluxio)
As Trino users increasingly rely on cloud object storage for retrieving data, speed and cloud cost have become major challenges. The separation of compute and storage creates latency challenges when querying datasets; scanning data between storage and compute tiers becomes I/O bound. On the other hand, cloud API costs related to GET/LIST operations and cross-region data transfer add up quickly.
The newly introduced Trino file system cache by Alluxio aims to overcome the above challenges. In this session, Jianjian will dive into Trino data caching strategies, the latest test results, and discuss the multi-level caching architecture. This architecture makes Trino 10x faster for data lakes of any scale, from GB to EB.
What you will learn:
- Challenges relating to the speed and costs of running Trino in the cloud
- The new Trino file system cache feature overview, including the latest development status and test results
- A multi-level cache framework for maximized speed, including Trino file system cache and Alluxio distributed cache
- Real-world cases, including a large online payment firm and a top ridesharing company
- The future roadmap of Trino file system cache and Trino-Alluxio integration
Software development... for all? (keynote at ICSOFT'2024)miso_uam
Our world runs on software. It governs all major aspects of our life. It is an enabler for research and innovation, and is critical for business competitivity. Traditional software engineering techniques have achieved high effectiveness, but still may fall short on delivering software at the accelerated pace and with the increasing quality that future scenarios will require.
To attack this issue, some software paradigms raise the automation of software development via higher levels of abstraction through domain-specific languages (e.g., in model-driven engineering) and empowering non-professional developers with the possibility to build their own software (e.g., in low-code development approaches). In a software-demanding world, this is an attractive possibility, and perhaps -- paraphrasing Andy Warhol -- "in the future, everyone will be a developer for 15 minutes". However, to make this possible, methods are required to tweak languages to their context of use (crucial given the diversity of backgrounds and purposes), and the assistance to developers throughout the development process (especially critical for non-professionals).
In this keynote talk at ICSOFT'2024 I presented enabling techniques for this vision, supporting the creation of families of domain-specific languages, their adaptation to the usage context; and the augmentation of low-code environments with assistants and recommender systems to guide developers (professional or not) in the development process.
WhatsApp Tracker - Tracking WhatsApp to Boost Online Safety.pdfonemonitarsoftware
WhatsApp Tracker Software is an effective tool for remotely tracking the target’s WhatsApp activities. It allows users to monitor their loved one’s online behavior to ensure appropriate interactions for responsive device use.
Download this PPTX file and share this information to others.
Major Outages in Major Enterprises Payara ConferenceTier1 app
In this session, we will be discussing major outages that happened in major enterprises. We will analyse the actual thread dumps, heap dumps, GC logs, and other artifacts captured at the time of the problem. After this session, troubleshooting CPU spikes, OutOfMemoryError, response time degradations, network connectivity issues, and application unresponsiveness may not stump you.
AI Chatbot Development – A Comprehensive Guide .pdfayushiqss
Discover how generative AI is transforming IT development in this blog. Learn how using AI software development, artificial intelligence tools, and generative AI tools can lead to smarter, faster, and more efficient software creation. Explore real-world applications and see how these technologies are driving innovation and cutting costs in IT development.
5. Spark Streaming
• Extends Spark for big data stream processing
• Efficient, fault-tolerant, stateful stream processing of live stream data
• Integrates with Spark’s batch and interactive processing
• Scales to hundreds of nodes
• Can achieve latencies on scale of seconds
6. Spark Streaming
• Can absorb live data streams from Kafka, Flume, ZeroMQ etc
• Simple Batch likeAPI to implement complex algorithms
• Integrates with other Spark extensions
• Started in 2012, alpha released with Spark 0.7 in 2013, released with Spark
0.9 in 2014
7. Need for Spark Streaming
• Existing frameworks can either
– Stream process 100s of MBs with low latency
– Batch processTBs of data with high latency
• Painful to maintain two different stacks
– Different programming models
– Doubles implementation effort
8. Need for Spark Streaming
• Many applications must process large streams of live data and provide
results in near-real-time
– Social network trends
– Website statistics
– Intrusion detection systems
• Many environments require processing same data in live streaming as
well as batch post-processing
9. Micro batch
• Spark streaming is a fast batch processing system
• Spark streaming collects stream data into small batch and runs batch
processing on it
• Batch can be as small as 1 second to as big as multiple hours
• Spark job creation and execution overhead is so low it can do all that
under a second
• These batches are called as DStreams
10. Stateful Stream Processing
• Traditional streaming systems have a event-driven record-at-a-time
processing model
– Each node has mutable state
– For each record, update state & send new records
• State is lost if node dies
• Making stateful stream processing fault-tolerant is a challenge
12. Streaming System - Storm
• Replays record if not processed by a node
• Processes each record at least once
• May update mutable state twice
• Mutable state can be lost due to failure
13. Streaming System -Trident
• Uses transactions to update state
• Processes each record exactly once
• Per state transaction updates slow
14. Spark Streaming
• Runs a streaming computation as a series of very small deterministic
batch jobs
• Splits the live stream into batches of X seconds
• Spark treats each batch of data as RDDs and processes them using RDD
operations
• Processed results of RDD operations are returned in batches
16. Spark Streaming
• Runs as a series of small (~1 s) batch jobs, keeping state in memory as
fault-tolerant RDDs
• Batch sizes as low as 0.5 second, latency ~ 1 second
• Potential for combining batch processing and streaming processing in the
same system
• Result: can process 42 million records/second (4 GB/s) on 100 nodes at
sub-second latency
18. Streaming
• Creates RDDs from stream source on a defined interval
• Same operation as normal RDDs
• Supports a variety of sources
• Exactly once message guarantee
19. Discretized Stream - DStream
• Basic abstraction provided by Spark Streaming
• Input stream is divided into multiple discrete batches
• Represents a stream of data
• Implemented as a sequence of RDDs
• Each batch of DStream is represented as RDD
underneath
20. Discretized Stream - DStream
• These RDD are replicated in cluster for fault tolerance
• Every DStream operation results in RDD transformation
• APIs provided to access these RDD is directly
• Can combine stream and batch processing
• Configurable intervals - 1 second, 5 second, 5 minutes
etc.
22. DStream transformation
• val ssc = new StreamingContext(args(0),
"wordcount", Seconds(5))
• val lines =
ssc.socketTextStream("localhost",50050)
• val words = lines.flatMap(_.split(" "))
23. Socket Stream
• Ability to listen to any socket on remote machines
• Need to configure host and port
• Both Raw andText representation of socket available
• Built in retry mechanism
24. File Stream
• Allows tracking new files in a given directory on HDFS
• Whenever there is new file appears, spark streaming will pick it up
• Only works for new files, modification for existing files will not be
considered
• Tracked using file creation time
26. Stateful Operations
• Ability to maintain random state across multiple batches
• Fault tolerant
• Exactly once semantics
• WAL (Write Ahead Log) for receiver crashes
27. How Stateful OperationsWork?
• Generally state is a mutable operation
• But in functional programming, state is represented with state machine
going from one state to another
• fn(oldState,newInfo) => newState
• In Spark, state is represented using RDD
• Change in the state is represented using transformation of RDD’s
• Fault tolerance of RDD helps in fault tolerance of state
28. Transform API
• In stream processing, ability to combine stream data with batch data is
extremely important
• Both batch API and stream API share RDD as abstraction
• TransformAPI of DStream allows us to access underneath RDD’s directly
• Example - Combine customer sales data with customer information
32. DStream Creation viaTransformation
• Data collected, buffered and replicated by receiver (one per DStream) and then
pushed to a stream as small RDDs
• Transformations modify data from one DStream to another
• Classifications
– Standard RDD operations – map, countByValue, reduceByKey, join,…
– Stateful operations – window, updateStateByKey, transform,
countByValueAndWindow, …
36. Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
• Integrated with the Spark stack
• Supports querying data either via SQL or via the Hive Query Language
• Originated as the Apache Hive port to run on top of Spark (in place of MapReduce)
• Can weave SQL queries with code transformations
37. Spark SQL
• Capability to expose Spark datasets over JDBC API and allow running the SQL like
queries on Spark data using traditional BI and visualization tools
• Allows to ETL their data from different formats like JSON, Parquet or a Database,
transform it, and expose it for ad-hoc querying
• Bindings in Python, Scala, and Java
40. SQL Access to Structured Data
• Existing RDDs
• Hive warehouses (uses existing metastore, SerDes and UDFs)
• JDBC/ODBC - use existing BI tools to query large datasets
41. DataFrame
• A distributed collection of data rows organized into named columns
• An abstraction for selection, filter, aggregate and plot structured data
• Conceptually equivalent to a table in a relational database or a data frame
in R/Python, but with richer optimizations under the hood
• Constructed from sources
– Structured data files
– Hive tables
– External databases
– Existing RDDs
42. DataFrame Internals
• Internally represented as a logical plan
• Lazy execution - computation only happens when an action (display
result, save output) is required
– Allows executions to be optimized by applying techniques such as predicate
push-downs and bytecode generation
• All DataFrame operations are also automatically parallelized and
distributed on clusters
43. DataFrame Construction - Python code
• # Construct a DataFrame from the users table in Hive
– users = context.table("users")
• # from JSON files in S3
– logs = context.load("s3n://path/to/data.json", "json")
• DataFrames provide a domain-specific language for distributed data
manipulation
44. Using DataFrames
• # Create a new DataFrame that contains “young users” only
– young = users.filter(users.age < 21)
• # Alternatively, using Pandas-like syntax
– young = users[users.age < 21]
• # Increment everybody’s age by 1
– young.select(young.name, young.age + 1)
45. Using DataFrames
• # Count the number of young users by gender
– young.groupBy("gender").count()
• # Join young users with another DataFrame called logs
– young.join(logs, logs.userId == users.userId, "left_outer")
• #SQL using Spark SQL - Count number of users in the young DataFrame
– young.registerTempTable("young")
– context.sql("SELECT count(*) FROM young")
46. Spark and Pandas - Conversion
• # Convert Spark DataFrame to Pandas
– pandas_df = young.toPandas()
• # Create a Spark DataFrame from Pandas
– spark_df = context.createDataFrame(pandas_df)
47. DataFrame API
• Common operations can be expressed as calls to the DataFrameAPI
– Selecting required columns
– Joining different data sources
– Aggregation (count, sum, average, etc)
– Filtering
48. Supported Data Formats and Sources
1. JSON files
2. Parquet files
3. Hive tables
4. Local file systems
5. Distributed file systems (HDFS)
6. Cloud storage (S3)
7. External RDBMS via JDBC
8. Extend DataFrames through Spark
SQL’s external data sources API to
support any third-party data formats
or sources
9. Existing third-party extensions - Avro,
CSV, ElasticSearch, and Cassandra
49. Combine Multiple Sources
• Join a site’s textual traffic log stored in S3 with a PostgreSQL database to
count the number of times each user has visited the site
– users = context.jdbc("jdbc:postgresql:production", "users")
– logs = context.load("/path/to/traffic.log")
– logs.join(users, logs.userId == users.userId, "left_outer") .groupBy("userId").agg({"*":
"count"})
50. Automatic Mechanisms to Read Less Data
• Converting to more efficient formats
• Using columnar formats (parquet)
• Using partitioning (/year=2014/month=02/…)
• Skipping data using statistics (min, max...)
• Pushing predicates into storage systems (JDBC)
51. Intelligent Optimization and Code Generation
• DataFrames in Spark have their execution automatically optimized by a
query optimizer
• Before any computation on a DataFrame starts, the Catalyst optimizer
compiles the operations that were used to build the DataFrame into a
physical plan for execution
• Because the optimizer understands the semantics of operations and
structure of the data, it can make intelligent decisions to speed up
computation
52. Intelligent Optimization and Code Generation
• At a high level, there are two types of optimizations
• Catalyst applies logical optimizations such as predicate pushdown
• The optimizer can push filter predicates down into the data source,
enabling the physical execution to skip irrelevant data
• In the case of Parquet files, entire blocks can be skipped and comparisons
on strings can be turned into cheaper integer comparisons via dictionary
encoding
53. Intelligent Optimization and Code Generation
• In the case of relational databases, predicates are pushed down into the
external databases to reduce the amount of data traffic
• Catalyst compiles operations into physical plans for execution and
generates JVM bytecode for those plans that is often more optimized
than hand-written code
• It can choose intelligently between broadcast joins and shuffle joins to
reduce network traffic
54. Intelligent Optimization and Code Generation
• It can also perform lower level optimizations such as eliminating
expensive object allocations and reducing virtual function calls
• Performance improvements for existing Spark programs when they
migrate to DataFrames
• Since the optimizer generates JVM bytecode for execution, Python users
experience the same high performance as Scala and Java users
55. Plan Optimization & Execution
DataFrames and SQL share the same
optimization/execution pipeline
56. SQL Execution Plans
• Logical and Physical query plans
– Both are trees representing query evaluation
– Internal nodes are operators over the data
– Logical plan is higher-level and algebraic
– Physical plan is lower-level and operational
• Logical plan operators
– Correspond to query language constructs
– Conceptually describe what operation needs to be performed
• Physical plan operators
– Correspond to implemented access methods
– Physically Implement the operation described by logical operators
Binding & Analyzing
Unresolved Logical
Plan
Logical Plan
SQLText
Optimized Logical
Plan
Physical Plan
Parsing
Optimizing
Query Planning
59. Optimized Execution
• Writing imperative code to optimize
all possible patterns is hard
• Instead opt for simpler rules
– Each rule makes single change
– Run multiple rules together to
fixed points
69. Linear Regression Example
• Method run() trains model
• Parameters are set with setters setNumInterations and setIntercept
• Stochastic Gradient Descent (SGD) algorithm is used for minimizing function
71. Pipeline API
• Pipeline is a series of algorithms (feature transformation, model fitting, ...)
• Easy workflow construction
• Distribution of parameters into each stage
• MLlib is easier to use
• Uses uniform dataset representation - SchemaRDD from SparkSQL
– multiple named columns (similar to SQL table)
75. GraphX
• New API that blurs distinction between graphs and tables
• Unifies data-parallel and graph-parallel systems
• SparkAPI for graphs
– Web-Graphs and Social Networks
– graph-parallel computation like PageRank and Collaborative Filtering
76. GraphX
• Extends Spark RDD abstraction using Resilient Distributed Property
Graph - a directed multi-graph with properties attached to each vertex
and edge
• Exposes fundamental operators like subgraph, joinVertices, and
mapReduceTriplets for graph computation
• Includes graph algorithms and builders for graph analytics tasks
78. Unifying Data-Parallel and Graph-Parallel Analytics
• Tables and Graphs are composable views of the same physical data
• Each view has its own operators that exploit the semantics of the view to
achieve efficient execution
79. Property Graph
• A directed graph with potentially multiple parallel edges sharing the
same source and destination vertex with properties attached to each
vertex and edge
• Each vertex is keyed by a unique 64-bit long identifier (VertexID)
• Edges have corresponding source and destination vertex identifiers
• Properties are stored as Scala/Java objects with each edge and vertex in
the graph
80. Property Graph
• Vertex Property
– User Profile
– Current PageRank Value
• Edge Property
– Weights
– Relationships
– Timestamps
81. Property Graph
• Constructed from raw files, RDDs and synthetic generators
• Immutable, distributed, and fault-tolerant
• Changes to the values or structure of the graph are accomplished by producing a
new graph with the desired changes
• Parts of the original graph (unaffected structure, attributes, and indices) are
reused in the new graph
• Each partition of the graph can be recreated on a different machine in the event
of a failure
• Represented using two Spark RDDs
– Edge collection:VertexRDD
– Vertex collection: EdgeRDD
82. GraphViews
• Graph class contains members graph.vertices and graph.edges to access
the vertices and edges of the graph
• These members extend RDD[(VertexId,V)] and RDD[Edge[E]]
• Are backed by optimized representations that leverage the internal
GraphX representation of graph data
83. TripletView
• Triplets operator joins vertices and edges
• Logically joins the vertex and edge properties yielding an RDD[EdgeTriplet[VD,
ED]] containing instances of the EdgeTriplet class
• This join is graphically expressed as
85. Subgraph
• Operator that takes vertex and edge predicates and returns the graph
containing only the vertices that satisfy the vertex predicate (evaluate to
true) and edges that satisfy the edge predicate and connect vertices that
satisfy the vertex predicate
88. Distributed Graph Representation
• Each vertex partition contains a bitmask and routing table
• Routing table - a logical map from a vertex id to the set of edge partitions
that contains adjacent edges
• Bitmask - enables the set intersection and filtering
– Vertices bitmasks are updated after each operation (mapReduceTriplets)
– Vertices hidden by the bitmask do not participate in the graph operations
90. References
1. http://spark.apache.org/graphx
2. http://spark.apache.org/streaming/
3. http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-SQL-
Michael-Armbrust.pdf
4. http://web.stanford.edu/class/cs346/qpnotes.html
5. https://github.com/apache/spark/tree/master/sql
6. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf
Chowdhury. Technical Report UCB/EECS-2011-82. July 2011
7. M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-Tolerant Streaming Computation at Scale,
SOSP 2013, November 2013
8. K. Ousterhout, P. Wendell, M. Zaharia and I. Stoica. Sparrow: Distributed, Low-Latency Scheduling, SOSP 2013, November 2013
9. R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale, SIGMOD 2013, June 2013
10. A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, Dominant Resource Fairness: Fair Allocation of Multiple
Resources Types, NSDI 2011, March 2011
11. Spark: In-Memory Cluster Computing for Iterative and Interactive Applications, Stanford University, Stanford, CA, February 2011