Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways.
However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk.
It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
This document summarizes a presentation on Spark SQL and its capabilities. Spark SQL allows users to run SQL queries on Spark, including HiveQL queries with UDFs, UDAFs, and SerDes. It provides a unified interface for reading and writing data in various formats. Spark SQL also allows users to express common operations like selecting columns, joining data, and aggregation concisely through its DataFrame API. This reduces the amount of code users need to write compared to lower-level APIs like RDDs.
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
Spark Streaming allows processing of live data streams in Spark. It integrates streaming data and batch processing within the same Spark application. Spark SQL provides a programming abstraction called DataFrames and can be used to query structured data in Spark. Structured Streaming in Spark 2.0 provides a high-level API for building streaming applications on top of Spark SQL's engine. It allows running the same queries on streaming data as on batch data and unifies streaming, interactive, and batch processing.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
Spark is an open-source distributed computing framework used for processing large datasets. It allows for in-memory cluster computing, which enhances processing speed. Spark core components include Resilient Distributed Datasets (RDDs) and a directed acyclic graph (DAG) that represents the lineage of transformations and actions on RDDs. Spark Streaming is an extension that allows for processing of live data streams with low latency.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Apache Spark is a fast, general engine for large-scale data processing. It provides unified analytics engine for batch, interactive, and stream processing using an in-memory abstraction called resilient distributed datasets (RDDs). Spark's speed comes from its ability to run computations directly on data stored in cluster memory and optimize performance through caching. It also integrates well with other big data technologies like HDFS, Hive, and HBase. Many large companies are using Spark for its speed, ease of use, and support for multiple workloads and languages.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
This document provides an agenda for an advanced Spark class covering topics such as RDD fundamentals, Spark runtime architecture, memory and persistence, shuffle operations, and Spark Streaming. The class will be held in March 2015 and include lectures, labs, and Q&A sessions. It notes that some slides may be skipped and asks attendees to keep Q&A low during the class, with a dedicated Q&A period at the end.
we will see an overview of Spark in Big Data. We will start with an introduction to Apache Spark Programming. Then we will move to know the Spark History. Moreover, we will learn why Spark is needed. Afterward, will cover all fundamental of Spark components. Furthermore, we will learn about Spark’s core abstraction and Spark RDD. For more detailed insights, we will also cover spark features, Spark limitations, and Spark Use cases.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
The document discusses Apache Spark, an open source cluster computing framework for real-time data processing. It notes that Spark is up to 100 times faster than Hadoop for in-memory processing and 10 times faster on disk. The main feature of Spark is its in-memory cluster computing capability, which increases processing speeds. Spark runs on a driver-executor model and uses resilient distributed datasets and directed acyclic graphs to process data in parallel across a cluster.
A lot of data scientists use the python library pandas for quick exploration of data. The most useful construct in pandas (based on R, I think) is the dataframe, which is a 2D array(aka matrix) with the option to “name” the columns (and rows). But pandas is not distributed, so there is a limit on the data size that can be explored.
Spark is a great map-reduce like framework that can handle very big data by using a shared nothing cluster of machines.
This work is an attempt to provide a pandas-like DSL on top of spark, so that data scientists familiar with pandas have a very gradual learning curve.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. To aid this effort, we built Titian, a library that enables data provenance tracking data through transformations in Apache Spark.
Code review is a systematic examination of computer source code to find mistakes. In SAOS, an online Polish court judgment analysis system, code review is conducted using GitHub pull requests in a lightweight manner according to Scrum methodology. Each task takes at most two days to complete and is reviewed by a partner to catch errors before being merged. Observations found that code review improves code quality and catches bugs, though it takes about 20% of time. It has also strengthened collaboration and skills within the SAOS team.
This document discusses Spark SQL and DataFrames. It provides three key points:
1. DataFrames are distributed collections of data organized into named columns similar to a table in a relational database. They allow SQL-like operations to be performed on structured data.
2. DataFrames can be created from a variety of data sources like JSON, Parquet files, existing RDDs, or Hive tables. The schema can be inferred automatically using case classes or specified programmatically.
3. Common SQL operations like selecting columns, filtering rows, aggregation, and joining can be performed on DataFrames to analyze structured data. The results are DataFrames that support additional transformations.
Programming in Spark - Lessons Learned in OpenAire projectŁukasz Dumiszewski
This document discusses lessons learned from rewriting parts of the OpenAire project to use Apache Spark. It covers choosing Java and Kryo serialization for efficiency, understanding that spark.closure.serializer controls code serialization, using accumulators carefully, and testing Spark jobs including unit tests and integration with Oozie workflows. The rewrite resulted in faster execution times for some modules like CitationMatching.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
This talk will cover the design and implementation decisions that have been key to the success of Apache Spark over other competing cluster computing frameworks. It will be delving into the whitepaper behind Spark and cover the design of Spark RDDs, the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases: Spark SQL, Spark Streaming, MLib and GraphX. RDDs allow Spark to outperform existing models by up to 100x in multi-pass analytics.
This document provides an overview of Apache Spark, including:
- The problems of big data that Spark addresses like large volumes of data from various sources.
- A comparison of Spark to existing techniques like Hadoop, noting Spark allows for better developer productivity and performance.
- An overview of the Spark ecosystem and how Spark can integrate with an existing enterprise.
- Details about Spark's programming model including its RDD abstraction and use of transformations and actions.
- A discussion of Spark's execution model involving stages and tasks.
This document provides an overview of Apache Spark, including:
- What Spark is and how it differs from MapReduce by running computations in memory for improved performance on iterative algorithms.
- Examples of Spark's core APIs like RDDs (Resilient Distributed Datasets) and transformations like map, filter, reduceByKey.
- How Spark programs are executed through a DAG (Directed Acyclic Graph) and translated to physical execution plans with stages and tasks.
This document provides an overview of HBase and why NoSQL databases like HBase were developed. It discusses how relational databases do not scale horizontally well with large amounts of data. HBase was created by Google to address these scaling issues and was inspired by their BigTable database. The document explains the HBase data model with rows, columns, and versions. It describes how data is stored physically in HFiles and served from memory and disks. Basic operations like put, get, and scan are also covered.
Introduction to Spark: Data Analysis and Use Cases in Big Data Jongwook Woo
This document provides a summary of a presentation given by Jongwook Woo on introducing Spark for data analysis and use cases in big data. The presentation covered Spark cores, RDDs, Spark SQL, streaming and machine learning. It also described experimental results analyzing an airline data set using Spark and Hive on Microsoft Azure, including visualizations of cancelled/diverted flights by month and year and the effects of flight distance on diversions, cancellations and departure delays.
In this talk at 2015 Spark Summit East, the lead developer of Spark streaming, @tathadas, talks about the state of Spark streaming:
Spark Streaming extends the core Apache Spark API to perform large-scale stream processing, which is revolutionizing the way Big “Streaming” Data application are being written. It is rapidly adopted by companies spread across various business verticals – ad and social network monitoring, real-time analysis of machine data, fraud and anomaly detections, etc. These companies are mainly adopting Spark Streaming because – Its simple, declarative batch-like API makes large-scale stream processing accessible to non-scientists. – Its unified API and a single processing engine (i.e. Spark core engine) allows a single cluster and a single set of operational processes to cover the full spectrum of uses cases – batch, interactive and stream processing. – Its stronger, exactly-once semantics makes it easier to express and debug complex business logic. In this talk, I am going to elaborate on such adoption stories, highlighting interesting use cases of Spark Streaming in the wild. In addition, this presentation will also showcase the exciting new developments in Spark Streaming and the potential future roadmap.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
Spark Internals - Hadoop Source Code Reading #16 in JapanTaro L. Saito
The document discusses Spark internals and provides an overview of key components such as the Spark code base size and growth over time, core developers, Scala basics used in Spark, RDDs, tasks, caching/block management, and schedulers for running Spark on clusters including Mesos and YARN. It also includes tips for using IntelliJ IDEA to work with Spark's Scala code base.
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
This document discusses hybrid transactional/analytical processing (HTAP) with Apache Spark and in-memory data grids. It begins by introducing the speaker and GigaSpaces. It then discusses how modern applications require both online transaction processing and real-time operational intelligence. The document presents examples from retail and IoT and the goals of minimizing latency while maximizing data analytics locality. It provides an overview of in-memory computing options and describes how GigaSpaces uses an in-memory data grid combined with Spark to achieve HTAP. The document includes deployment diagrams and discusses data grid RDDs and pushing predicates to the data grid. It describes how this was productized as InsightEdge and provides additional innovations and reference architectures.
This document discusses variance in Scala by using an example of different types of albums and tracks. It explains that vectors are covariant in their type parameter, so a Vector of a subtype can be used where a Vector of the supertype is expected. Functions are contravariant in their parameter types, so a function operating on a subtype can be used where a supertype is expected. Fields, methods, and mutable types like arrays are invariant to preserve type safety.
Scala has a static, strong, and Turing complete type system that can infer types. It supports object-oriented programming with named types like Dog and functional programming with parameterized types like List[Int]. New types can be defined through classes, traits, case classes, and type members. Types are more general than classes and include structural types. Variance controls subtype relationships for parameterized types. Type bounds and existential types provide additional type safety. Higher kinded types allow types to be parameterized over other types.
Spark is an open-source cluster computing framework that provides high performance for both batch and streaming data processing. It addresses limitations of other distributed processing systems like MapReduce by providing in-memory computing capabilities and supporting a more general programming model. Spark core provides basic functionalities and serves as the foundation for higher-level modules like Spark SQL, MLlib, GraphX, and Spark Streaming. RDDs are Spark's basic abstraction for distributed datasets, allowing immutable distributed collections to be operated on in parallel. Key benefits of Spark include speed through in-memory computing, ease of use through its APIs, and a unified engine supporting multiple workloads.
Apache Spark is a fast, general-purpose cluster computing system that allows processing of large datasets in parallel across clusters. It can be used for batch processing, streaming, and interactive queries. Spark improves on Hadoop MapReduce by using an in-memory computing model that is faster than disk-based approaches. It includes APIs for Java, Scala, Python and supports machine learning algorithms, SQL queries, streaming, and graph processing.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Spark is a fast, general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R for distributed tasks including SQL, streaming, and machine learning. Spark improves on MapReduce by keeping data in-memory, allowing iterative algorithms to run faster than disk-based approaches. Resilient Distributed Datasets (RDDs) are Spark's fundamental data structure, acting as a fault-tolerant collection of elements that can be operated on in parallel.
Spark is a framework for large-scale data processing that improves on MapReduce. It handles batch, iterative, and streaming workloads using a directed acyclic graph (DAG) model. Spark aims for generality, low latency, fault tolerance, and simplicity. It uses an in-memory computing model with Resilient Distributed Datasets (RDDs) and a driver-executor architecture. Common Spark performance issues relate to partitioning, shuffling data between stages, task placement, and load balancing. Evaluation tools include the Spark UI, Sar, iostat, and benchmarks like SparkBench and GroupBy tests.
- Apache Spark is an open-source cluster computing framework that is faster than Hadoop for batch processing and also supports real-time stream processing.
- Spark was created to be faster than Hadoop for interactive queries and iterative algorithms by keeping data in-memory when possible.
- Spark consists of Spark Core for the basic RDD API and also includes modules for SQL, streaming, machine learning, and graph processing. It can run on several cluster managers including YARN and Mesos.
Apache Spark is an open-source unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. Some key components of Apache Spark include Resilient Distributed Datasets (RDDs), DataFrames, Datasets, and Spark SQL for structured data processing. Spark also supports streaming, machine learning via MLlib, and graph processing with GraphX.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing. This slide shares some basic knowledge about Apache Spark.
Apache Spark is an open-source distributed processing engine that is up to 100 times faster than Hadoop for processing data stored in memory and 10 times faster for data stored on disk. It provides high-level APIs in Java, Scala, Python and SQL and supports batch processing, streaming, and machine learning. Spark runs on Hadoop, Mesos, Kubernetes or standalone and can access diverse data sources using its core abstraction called resilient distributed datasets (RDDs).
Unit II Real Time Data Processing tools.pptxRahul Borate
Apache Spark is a lightning-fast cluster computing framework designed for real-time processing. It overcomes limitations of Hadoop by running 100 times faster in memory and 10 times faster on disk. Spark uses resilient distributed datasets (RDDs) that allow data to be partitioned across clusters and cached in memory for faster processing.
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
Spark is a fast general-purpose engine for large-scale data processing. It has advantages over MapReduce like speed, ease of use, and running everywhere. Spark supports SQL querying, streaming, machine learning, and graph processing. It can run on Scala, Java, Python. Spark applications have drivers, executors, tasks and run RDDs and shared variables. The Spark shell provides an interactive way to learn the API and analyze data.
Big Data Processing with Apache Spark 2014mahchiev
This document provides an overview of Apache Spark, a framework for large-scale data processing. It discusses what big data is, the history and advantages of Spark, and Spark's execution model. Key concepts explained include Resilient Distributed Datasets (RDDs), transformations, actions, and MapReduce algorithms like word count. Examples are provided to illustrate Spark's use of RDDs and how it can improve on Hadoop MapReduce.
This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.
Spark improves on Hadoop MapReduce by keeping data in-memory between jobs. It reads data into resilient distributed datasets (RDDs) that can be transformed and cached in memory across nodes for faster iterative jobs. RDDs are immutable, partitioned collections distributed across a Spark cluster. Transformations define operations on RDDs, while actions trigger computation by passing data to the driver program.
This document provides an introduction and overview of Apache Spark, a lightning-fast cluster computing framework. It discusses Spark's ecosystem, how it differs from Hadoop MapReduce, where it shines well, how easy it is to install and start learning, includes some small code demos, and provides additional resources for information. The presentation introduces Spark and its core concepts, compares it to Hadoop MapReduce in areas like speed, usability, tools, and deployment, demonstrates how to use Spark SQL with an example, and shows a visualization demo. It aims to provide attendees with a high-level understanding of Spark without being a training class or workshop.
Spark Streaming allows processing live data streams using small batch sizes to provide low latency results. It provides a simple API to implement complex stream processing algorithms across hundreds of nodes. Spark SQL allows querying structured data using SQL or the Hive query language and integrates with Spark's batch and interactive processing. MLlib provides machine learning algorithms and pipelines to easily apply ML to large datasets. GraphX extends Spark with an API for graph-parallel computation on property graphs.
This document provides an overview of effective big data visualization. It discusses information visualization and data visualization, including common chart types like histograms, scatter plots, and dashboards. It covers visualization goals, considerations, processes, basics, and guidelines. Examples of good visualization are provided. Tools for creating infographics are listed, as are resources for learning more about data visualization and references. Overall, the document serves as a comprehensive introduction to big data visualization.
Graph databases store data in graph structures with nodes, edges, and properties. Neo4j is a popular open-source graph database that uses a property graph model. It has a core API for programmatic access, indexes for fast lookups, and Cypher for graph querying. Neo4j provides high availability through master-slave replication and scales horizontally by sharding graphs across instances through techniques like cache sharding and domain-specific sharding.
This document discusses information retrieval techniques. It begins by defining information retrieval as selecting the most relevant documents from a large collection based on a query. It then discusses some key aspects of information retrieval including document representation, indexing, query representation, and ranking models. The document also covers specific techniques used in information retrieval systems like parsing documents, tokenization, removing stop words, normalization, stemming, and lemmatization.
The document provides an overview of various machine learning algorithms and methods. It begins with an introduction to predictive modeling and supervised vs. unsupervised learning. It then describes several supervised learning algorithms in detail including linear regression, K-nearest neighbors (KNN), decision trees, random forest, logistic regression, support vector machines (SVM), and naive Bayes. It also briefly discusses unsupervised learning techniques like clustering and dimensionality reduction methods.
This document provides an overview of natural language processing (NLP). It discusses topics like natural language understanding, text categorization, syntactic analysis including parsing and part-of-speech tagging, semantic analysis, and pragmatic analysis. It also covers corpus-based statistical approaches to NLP, measuring performance, and supervised learning methods. The document outlines challenges in NLP like ambiguity and knowledge representation.
This document provides an overview of the Natural Language Toolkit (NLTK), a Python library for natural language processing. It discusses NLTK's modules for common NLP tasks like tokenization, part-of-speech tagging, parsing, and classification. It also describes how NLTK can be used to analyze text corpora, frequency distributions, collocations and concordances. Key functions of NLTK include tokenizing text, accessing annotated corpora, analyzing word frequencies, part-of-speech tagging, and shallow parsing.
This document provides an overview of NoSQL databases and summarizes key information about several NoSQL databases, including HBase, Redis, Cassandra, MongoDB, and Memcached. It discusses concepts like horizontal scalability, the CAP theorem, eventual consistency, and data models used by different NoSQL databases like key-value, document, columnar, and graph structures.
This document provides an overview of recommender systems for e-commerce. It discusses various recommender approaches including collaborative filtering algorithms like nearest neighbor methods, item-based collaborative filtering, and matrix factorization. It also covers content-based recommendation, classification techniques, addressing challenges like data sparsity and scalability, and hybrid recommendation approaches.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It implements Google's MapReduce programming model and the Hadoop Distributed File System (HDFS) for reliable data storage. Key components include a JobTracker that coordinates jobs, TaskTrackers that run tasks on worker nodes, and a NameNode that manages the HDFS namespace and DataNodes that store application data. The framework provides fault tolerance, parallelization, and scalability.
This document provides an overview of the statistical programming language R. It discusses key R concepts like data types, vectors, matrices, data frames, lists, and functions. It also covers important R tools for data analysis like statistical functions, linear regression, multiple regression, and file input/output. The goal of R is to provide a large integrated collection of tools for data analysis and statistical computing.
This document provides an overview of the Python programming language. It discusses Python's history and evolution, its key features like being object-oriented, open source, portable, having dynamic typing and built-in types/tools. It also covers Python's use for numeric processing with libraries like NumPy and SciPy. The document explains how to use Python interactively from the command line and as scripts. It describes Python's basic data types like integers, floats, strings, lists, tuples and dictionaries as well as common operations on these types.
The document provides an overview of functional programming, including its key features, history, differences from imperative programming, and examples using Lisp and Scheme. Some of the main points covered include:
- Functional programming is based on evaluating mathematical functions rather than modifying state through assignments.
- It uses recursion instead of loops and treats functions as first-class objects.
- Lisp was the first functional language in 1960 and introduced many core concepts like lists and first-class functions. Scheme was developed in 1975 as a simpler dialect of Lisp.
- Functional programs are more focused on what to compute rather than how to compute it, making them more modular and easier to reason about mathematically.
Coordinate Systems in FME 101 - Webinar SlidesSafe Software
If you’ve ever had to analyze a map or GPS data, chances are you’ve encountered and even worked with coordinate systems. As historical data continually updates through GPS, understanding coordinate systems is increasingly crucial. However, not everyone knows why they exist or how to effectively use them for data-driven insights.
During this webinar, you’ll learn exactly what coordinate systems are and how you can use FME to maintain and transform your data’s coordinate systems in an easy-to-digest way, accurately representing the geographical space that it exists within. During this webinar, you will have the chance to:
- Enhance Your Understanding: Gain a clear overview of what coordinate systems are and their value
- Learn Practical Applications: Why we need datams and projections, plus units between coordinate systems
- Maximize with FME: Understand how FME handles coordinate systems, including a brief summary of the 3 main reprojectors
- Custom Coordinate Systems: Learn how to work with FME and coordinate systems beyond what is natively supported
- Look Ahead: Gain insights into where FME is headed with coordinate systems in the future
Don’t miss the opportunity to improve the value you receive from your coordinate system data, ultimately allowing you to streamline your data analysis and maximize your time. See you there!
Performance Budgets for the Real World by Tammy EvertsScyllaDB
Performance budgets have been around for more than ten years. Over those years, we’ve learned a lot about what works, what doesn’t, and what we need to improve. In this session, Tammy revisits old assumptions about performance budgets and offers some new best practices. Topics include:
• Understanding performance budgets vs. performance goals
• Aligning budgets with user experience
• Pros and cons of Core Web Vitals
• How to stay on top of your budgets to fight regressions
How Social Media Hackers Help You to See Your Wife's Message.pdfHackersList
In the modern digital era, social media platforms have become integral to our daily lives. These platforms, including Facebook, Instagram, WhatsApp, and Snapchat, offer countless ways to connect, share, and communicate.
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
Quality Patents: Patents That Stand the Test of TimeAurora Consulting
Is your patent a vanity piece of paper for your office wall? Or is it a reliable, defendable, assertable, property right? The difference is often quality.
Is your patent simply a transactional cost and a large pile of legal bills for your startup? Or is it a leverageable asset worthy of attracting precious investment dollars, worth its cost in multiples of valuation? The difference is often quality.
Is your patent application only good enough to get through the examination process? Or has it been crafted to stand the tests of time and varied audiences if you later need to assert that document against an infringer, find yourself litigating with it in an Article 3 Court at the hands of a judge and jury, God forbid, end up having to defend its validity at the PTAB, or even needing to use it to block pirated imports at the International Trade Commission? The difference is often quality.
Quality will be our focus for a good chunk of the remainder of this season. What goes into a quality patent, and where possible, how do you get it without breaking the bank?
** Episode Overview **
In this first episode of our quality series, Kristen Hansen and the panel discuss:
⦿ What do we mean when we say patent quality?
⦿ Why is patent quality important?
⦿ How to balance quality and budget
⦿ The importance of searching, continuations, and draftsperson domain expertise
⦿ Very practical tips, tricks, examples, and Kristen’s Musts for drafting quality applications
https://www.aurorapatents.com/patently-strategic-podcast.html
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecJames Anderson
The lecture titled "Automating AppSec" delves into the critical challenges associated with manual application security (AppSec) processes and outlines strategic approaches for incorporating automation to enhance efficiency, accuracy, and scalability. The lecture is structured to highlight the inherent difficulties in traditional AppSec practices, emphasizing the labor-intensive triage of issues, the complexity of identifying responsible owners for security flaws, and the challenges of implementing security checks within CI/CD pipelines. Furthermore, it provides actionable insights on automating these processes to not only mitigate these pains but also to enable a more proactive and scalable security posture within development cycles.
The Pains of Manual AppSec:
This section will explore the time-consuming and error-prone nature of manually triaging security issues, including the difficulty of prioritizing vulnerabilities based on their actual risk to the organization. It will also discuss the challenges in determining ownership for remediation tasks, a process often complicated by cross-functional teams and microservices architectures. Additionally, the inefficiencies of manual checks within CI/CD gates will be examined, highlighting how they can delay deployments and introduce security risks.
Automating CI/CD Gates:
Here, the focus shifts to the automation of security within the CI/CD pipelines. The lecture will cover methods to seamlessly integrate security tools that automatically scan for vulnerabilities as part of the build process, thereby ensuring that security is a core component of the development lifecycle. Strategies for configuring automated gates that can block or flag builds based on the severity of detected issues will be discussed, ensuring that only secure code progresses through the pipeline.
Triaging Issues with Automation:
This segment addresses how automation can be leveraged to intelligently triage and prioritize security issues. It will cover technologies and methodologies for automatically assessing the context and potential impact of vulnerabilities, facilitating quicker and more accurate decision-making. The use of automated alerting and reporting mechanisms to ensure the right stakeholders are informed in a timely manner will also be discussed.
Identifying Ownership Automatically:
Automating the process of identifying who owns the responsibility for fixing specific security issues is critical for efficient remediation. This part of the lecture will explore tools and practices for mapping vulnerabilities to code owners, leveraging version control and project management tools.
Three Tips to Scale the Shift Left Program:
Finally, the lecture will offer three practical tips for organizations looking to scale their Shift Left security programs. These will include recommendations on fostering a security culture within development teams, employing DevSecOps principles to integrate security throughout the development
Are you interested in dipping your toes in the cloud native observability waters, but as an engineer you are not sure where to get started with tracing problems through your microservices and application landscapes on Kubernetes? Then this is the session for you, where we take you on your first steps in an active open-source project that offers a buffet of languages, challenges, and opportunities for getting started with telemetry data.
The project is called openTelemetry, but before diving into the specifics, we’ll start with de-mystifying key concepts and terms such as observability, telemetry, instrumentation, cardinality, percentile to lay a foundation. After understanding the nuts and bolts of observability and distributed traces, we’ll explore the openTelemetry community; its Special Interest Groups (SIGs), repositories, and how to become not only an end-user, but possibly a contributor.We will wrap up with an overview of the components in this project, such as the Collector, the OpenTelemetry protocol (OTLP), its APIs, and its SDKs.
Attendees will leave with an understanding of key observability concepts, become grounded in distributed tracing terminology, be aware of the components of openTelemetry, and know how to take their first steps to an open-source contribution!
Key Takeaways: Open source, vendor neutral instrumentation is an exciting new reality as the industry standardizes on openTelemetry for observability. OpenTelemetry is on a mission to enable effective observability by making high-quality, portable telemetry ubiquitous. The world of observability and monitoring today has a steep learning curve and in order to achieve ubiquity, the project would benefit from growing our contributor community.
Kief Morris rethinks the infrastructure code delivery lifecycle, advocating for a shift towards composable infrastructure systems. We should shift to designing around deployable components rather than code modules, use more useful levels of abstraction, and drive design and deployment from applications rather than bottom-up, monolithic architecture and delivery.
Quantum Communications Q&A with Gemini LLM. These are based on Shannon's Noisy channel Theorem and offers how the classical theory applies to the quantum world.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/07/intels-approach-to-operationalizing-ai-in-the-manufacturing-sector-a-presentation-from-intel/
Tara Thimmanaik, AI Systems and Solutions Architect at Intel, presents the “Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” tutorial at the May 2024 Embedded Vision Summit.
AI at the edge is powering a revolution in industrial IoT, from real-time processing and analytics that drive greater efficiency and learning to predictive maintenance. Intel is focused on developing tools and assets to help domain experts operationalize AI-based solutions in their fields of expertise.
In this talk, Thimmanaik explains how Intel’s software platforms simplify labor-intensive data upload, labeling, training, model optimization and retraining tasks. She shows how domain experts can quickly build vision models for a wide range of processes—detecting defective parts on a production line, reducing downtime on the factory floor, automating inventory management and other digitization and automation projects. And she introduces Intel-provided edge computing assets that empower faster localized insights and decisions, improving labor productivity through easy-to-use AI tools that democratize AI.
An invited talk given by Mark Billinghurst on Research Directions for Cross Reality Interfaces. This was given on July 2nd 2024 as part of the 2024 Summer School on Cross Reality in Hagenberg, Austria (July 1st - 7th)
7 Most Powerful Solar Storms in the History of Earth.pdfEnterprise Wired
Solar Storms (Geo Magnetic Storms) are the motion of accelerated charged particles in the solar environment with high velocities due to the coronal mass ejection (CME).
In this follow-up session on knowledge and prompt engineering, we will explore structured prompting, chain of thought prompting, iterative prompting, prompt optimization, emotional language prompts, and the inclusion of user signals and industry-specific data to enhance LLM performance.
Join EIS Founder & CEO Seth Earley and special guest Nick Usborne, Copywriter, Trainer, and Speaker, as they delve into these methodologies to improve AI-driven knowledge processes for employees and customers alike.
How to Avoid Learning the Linux-Kernel Memory ModelScyllaDB
The Linux-kernel memory model (LKMM) is a powerful tool for developing highly concurrent Linux-kernel code, but it also has a steep learning curve. Wouldn't it be great to get most of LKMM's benefits without the learning curve?
This talk will describe how to do exactly that by using the standard Linux-kernel APIs (locking, reference counting, RCU) along with a simple rules of thumb, thus gaining most of LKMM's power with less learning. And the full LKMM is always there when you need it!
3. What is Spark?
• An open-source cluster computing framework
• Leverages distributed memory
• Allows programs to load data into a cluster's memory and query it repeatedly
• Compared to Hadoop
– Scalability - can work with large data
– Fault tolerance : can self-recover
• Functional programming model
• Supports batch & streaming analysis
4. What is Spark?
• Separate, fast MapReduce-like engine
– In-memory data storage for very fast iterative queries
– General execution graphs and powerful optimizations
• Compatible with Hadoop storage APIs
– Can read / write to any Hadoop-supported system, including HDFS, HBase, Sequence
Files etc
• Faster Application Development - 2-5x less code
• Disk Execution Speed - 10× faster
• Memory Execution Speed – 100× faster
5. What is Spark?
• Apart from simple map and reduce operations, supports SQL queries, streaming
data, and complex analytics such as machine learning and graph algorithms out-
of-the-box
• In-memory cluster computing
• Supports any existing Hadoop input / output format
• Spark is written in Scala
• Provides concise and consistentAPIs in Scala, Java and Python
• Offers interactive shell for Scala and Python
6. Spark Deployments – Cluster ManagerTypes
• Standalone (native Spark cluster)
• HadoopYARN - Hadoop 2 resource manager
• Apache Mesos - generic cluster manager that can also handle MapReduce
• Local - A pseudo-distributed local mode for development or testing using
local file system
– Spark runs on a single machine with one executor per CPU core
9. Project History
2009 – Project
started at UC
BerkeleyAMPLab
2010 - Open
sourced under a
BSD license
2013- the project
was donated to the
Apache Software
Foundation and
switched its license
to Apache 2.0
Feb 2014 - became
an ApacheTop-
Level Project
November 2014 -
engineering team
at Databricks used
Spark to set a new
world record in
large scale sorting
10. The Most Active Open Source Project in Big Data
Series 1,
Hadoop
MapReduce,
103
Series 1,
Giraph, 32
Series 1,
Storm, 25
Series 1,Tez,
17
Series 2,
Spark, 125
Projectcontributorsinpastyear
11. Hadoop Model
• Hadoop has an acyclic data flow model
– Load data -> process data ->write output -> finished
• Hadoop is slow due to replication, serialization and disk IO
• Hadoop is at a disadvantage to pipeline multiple jobs
• Cheaper DRAM makes it a better option for using main memory for
intermediate results instead of disks
13. Spark Model
• MapReduce allows sharing data across jobs using only one option of stable
storage like file system which is slow
• Applications want to reuse intermediate results across multiple computations
– Work on same dataset to optimize parameters in machine learning algorithms
– More complex, multi-stage applications (iterative graph algorithms and machine
learning)
– More interactive ad-hoc queries
– Efficient primitives for data sharing across parallel jobs
• These challenges can be tackled by keeping intermediate results in memory
• Caching the data for multiple queries benefits interactive data analysis tools
14. Spark - In-Memory Data Sharing
10-100× faster than network and disk
Input
One-time
Processing
Distributed
memory
Result 1
Result 3
Result 2
iteration1 iteration2 Iteration n
Input
16. Stack
• Spark SQL
– allows querying data via SQL as well as the ApacheVariant of SQL (HQL) and supports
many sources of data, including Hive tables, Parquet and JSON
• Spark Streaming
– Components that enables processing of live streams of data in a elegant, fault tolerant,
scalable and fast way
• MLlib
– Library containing common machine learning (ML) functionality including algorithms
such as classification, regression, clustering, collaborative filtering to scale out across a
cluster
17. Stack
• GraphX
– Library for manipulating graphs and performing graph-parallel computation
• Cluster Managers
– Spark is designed to efficiently scale up from one to many thousands of compute
nodes. It can run over a variety of cluster managers including Hadoop,YARN, Apache
Mesos etc
– Spark has a simple cluster manager included in Spark itself called the Standalone
Scheduler
19. Programming Model
• Spark programming model is based on parallelizable operators
• Parallelizable operators are higher-order functions that execute user-defined functions in
parallel
• A data flow is composed of any number of data sources, operators, and data sinks by
connecting their inputs and outputs
• Job description is based on directed acyclic graphs (DAG)
• Spark allows programmers to develop complex, multi-step data pipelines using directed
acyclic graph (DAG) pattern
• Since spark is based on DAG , it can follow a chain from child to parent to fetch any value
like tree traversal
• DAG supports fault-tolerance
21. How SparkWorks
• User submits Jobs
• Every Spark application consists of a driver program that launches various
parallel operations on the cluster
• The driver program contains your application’s main function and defines
distributed datasets on the cluster, then applies operations to them
22. How SparkWorks
• Driver programs access spark through theSparkContextobject, which
represents a connection to a computing cluster.
• The SparkContext can be used to build RDDs (Resilient distributed
datasets) on which you can run a series of operations
• To run these operations, driver programs typically manage a number of
nodes called executors
23. How SparkWorks
• SparkContext (driver) contacts Cluster Manager which
assigns cluster resources
• Then it sends application code to assigned Executors
(distributing computation, not data)
• Finally sends tasks to Executors to run
24. How SparkWorks
• SparkContext (driver) contacts Cluster Manager which assigns cluster
resources
• Then it sends application code to assigned Executors (distributing
computation, not data)
• Finally sends tasks to Executors to run
25. Spark Context
• Main entry point to Spark functionality
• Available in shell as variable sc
• In standalone programs, you should make your own
• import org.apache.spark.SparkContext
• import org.apache.spark.SparkContext.
• val sc = new SparkContext(master, appName, [sparkHome], [jars])
26. RDD - Resilient Distributed Datasets
• A distributed memory abstraction
• An immutable distributed collection of data partitioned across machines
in a cluster – provides scalability
• Immutability provides safety with parallel processing
• Distributed - stored in memory across the cluster
27. RDD - Resilient Distributed Datasets
• Stored in-memory - automatically rebuilt if a partition is lost
• In-memory storage makes it fast
• Facilitates two types of operations- transformation and action
• Lazily evaluated
• Type inferred
28. RDDs
• Fault-tolerant collection of elements that can be operated on in parallel
• Manipulated through various parallel operators using a diverse set of
transformations (map, filter, join etc)
• Fault recovery without costly replication
• Remembers the series of transformations that built an RDD (its lineage) to re-
compute lost data
• RDD operators are higher order functions
• Turn a collection into an RDD
– val a = sc.parallelize(Array(1, 2, 3))
31. Program Execution
• The driver program when starting execution builds up a graph where nodes are
RDD and edges are transformation steps
• No execution happens at the cluster until an action is encountered
• The driver program ships the execution graph as well as the code block to the
cluster, where every worker server will get a copy
• The execution graph is a DAG
• Each DAG is a atomic unit of execution
32. Program Execution
• Each source node (no incoming edge) is an external data source or driver
memory
• Each intermediate node is a RDD
• Each sink node (no outgoing edge) is an external data source or driver
memory
• Green edge connecting to RDD represents a transformation
• Red edge connecting to a sink node represents an action
34. How SparkWorks?
• Spark is divided in various independent layers with responsibilities
• The first layer is the interpreter - Spark uses a Scala interpreter, with some
modifications
• When code is typed in spark console (creating RDD’s and applying operators),
Spark creates a operator graph
• When an action is run, the Graph is submitted to a DAG Scheduler
• DAG scheduler divides operator graph into (map and reduce) stages
• A stage consists of tasks based on partitions of the input data
35. How SparkWorks?
• The DAG scheduler pipelines operators together to optimize the graph
– Example - many map operators can be scheduled in a single stage
• The final result of a DAG scheduler is a set of stages that are passed on toTask
Scheduler
• The task scheduler launches tasks via cluster manager (Spark
Standalone/Yarn/Mesos)
• The task scheduler doesn’t know about dependencies among stages
• The Worker executes the tasks by starting a new JVM per job
• The worker knows only about the code that is passed to it
37. Job Scheduling
• When an action on an RDD is executed, the scheduler builds a DAG of stages from
the RDD lineage graph
• A stage contains many pipelined transformations with narrow dependencies
• The boundary of a stage
– Shuffles for wide dependencies.
– Already computed partitions
38. Job Scheduling
• The scheduler launches tasks to compute missing partitions from each
stage until it computes the target RDD
• Tasks are assigned to machines based on data locality
• If a task needs a partition, which is available in the memory of a node, the
task is sent to that node
40. Data Shuffling
• Spark ships the code to a worker server where data processing happens
• But data movement cannot be completely eliminated
• Example - if the processing requires data residing in different partitions
to be grouped first, then data should be shuffled among worker servers
• Transformation operation has two types – Narrow andWide
41. Data Shuffling
• Narrow transformation
– The processing where the processing logic depends only on data that is already
residing in the partition and data shuffling is unnecessary
– Examples - filter(), sample(), map(), flatMap() etc
• Wide transformation
– The processing where the processing logic depends on data residing in multiple
partitions and therefore data shuffling is needed to bring them together in one place
– Example - groupByKey(), reduceByKey() etc
43. RDD Joins
• Joining of two RDD affects the amount of data shuffled
• Spark provides two ways to join data – shuffle and broadcast
• Shuffle join - data of two RDD with the same key is redistributed to the same
partition. Each of the items in each RDD is shuffled across worker servers
• Broadcast join - one of the RDD is broadcasted and copied over to every partition
– If one of the RDD is significantly smaller relative to the other, then broadcast join
reduces the network traffic because only the small RDD need to be copied to all worker
servers while the large RDD doesn't need to be shuffled at all
45. Fault Resiliency
• RDDs track series of transformations used to build them (their lineage) to re-compute lost
data
• messages = textFile.filter(lambda s: s.startsWith(“ERROR”)).map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
46. Fault Resiliency
• RDDs maintain lineage information used to reconstruct lost partitions
• Logging lineage rather than the actual data
• No replication
• Recompute only the lost partitions of an RDD
47. Fault Resiliency
• Recovery may be time-consuming for RDDs with long lineage chains and
wide dependencies
• It is helpful to checkpoint some RDDs to stable storage
• Decision about which data to checkpoint is left to users
48. Fault Resiliency
• DAG defines a deterministic transformation steps between different
partitions of data within each RDD
• Whenever a worker server crashes during the execution of a stage,
another worker server re-executes the stage from the beginning by
pulling the input data from its parent stage that has the output data
stored in local files
49. Fault Resiliency
• In case the result of the parent stage is not accessible (the worker server
lost the file), the parent stage need to be re-executed as well
• Imagine this is a lineage of transformation steps, and any failure of a step
will trigger a restart of execution from its last step
• Since the DAG itself is an atomic unit of execution, all the RDD values will
be forgotten after the DAG finishes its execution
50. Fault Resiliency
• Therefore, after the driver program finishes an action (which execute a DAG to its
completion), all the RDD value will be forgotten and if the program access the
RDD again in subsequent statement, the RDD needs to be recomputed again
from its dependents
• To reduce this repetitive processing, Spark provide a caching mechanism to
remember RDDs in worker server memory (or local disk)
• Once the execution planner finds the RDD is already cache in memory, it will use
the RDD right away without tracing back to its parent RDDs
• This way, DAG is pruned once an RDD in the cache is reached
51. RDD Operators -Transformations
• Creates a new dataset from an existing map, filter, distinct, union,
sample, groupByKey, join, etc…
• RDD transformations allow to create dependencies between RDDs
• Dependencies are only steps for producing results (a program)
52. RDD Operators -Transformations
• Each RDD in lineage chain (string of dependencies) has a function for
calculating its data and has a pointer (dependency) to its parent RDD
• Spark divides RDD dependencies into stages and tasks and send those to
workers for execution
• Lazy operators
54. RDD Operators - Actions
• Return a value after running a computation
• Compute a result based on a RDD
• Result is returned to the driver program or saved to an external storage
system
• Typical RDD actions are count, first, collect, first, takeSample, foreach
56. Transformations
• Set of operations of a RDD that define how its data should be transformed
• An operation such as map(), filter() or union on a RDD that yield another RDD
• Transformations create new RDD based on the existing RDD.
• RDD's are immutable
• Lazily evaluated - Data in RDD's is not processed until an action is performed.
57. Transformations
• Why lazy execution? because we are expecting to apply some
optimization of the series of transformation on such RDD
• Spark driver remembers the transformation applied to an RDD – so a lost
partition is can be reconstructed on some other machine in the cluster
• This resiliency is achieved via a LineageGraph
58. Transformations
• Words - an RDD containing a reference to lines RDD
• When program executes, first lines' function isexecuted (load the data from a text
file)
• Then words' function is executed on the resulting data (split lines into words)
• Spark is lazy, so nothing is executed unless some transformation or action is
called that triggers job creation and execution (collect in this example)
• RDD (transformed RDD, too) is not 'a set of data', but a step in a program (might
be the only step) telling Spark how to get the data and what to do with it
59. Transformations
• val lines = sc.textFile("...")
• val words = lines.flatMap(line => line.split(" "))
• val localwords = words.collect()
60. Actions
• Applies all transformations on RDD and then performs the action to obtain results
• Operations that return a final value to the driver program or write data to an
external storage system
• After performing action on RDD, the result is returned to the driver program or
written to the storage system
61. Actions
• Actions force the evaluation of the transformations required for the RDD
they were called on, since they need to actually produce output
• Action can be recognized by looking at the return value
– primitive and built-in types such as int, long, List<Object>,Array<Object>, …
(action).
65. RDD Creation
• Read from data sources - HDFS, JSON files, text files - any kind of files
• Transforming other RDDs using parallel operations - transformations and actions
• RDD keeps information about how it was derived from other RDDs
• A RDD has a set of partitions and a set of dependencies on parent RDD
• Narrow dependency if it derives from only 1 parent
66. RDD Creation
• Wide dependency if it has more than 2 parents (joining 2 parents)
• A function to compute the partitions from its parents
• Metadata about its partitioning scheme and data placement (preferred
location to compute for each partition)
• Partitioner (defines strategy of partitioning its partitions)
67. SharedVariables
• When Spark runs a function in parallel as a set of tasks on different nodes, it ships
a copy of each variable used in the function to each task
• These variables are copied to each machine
• No updates to the variables on the remote machine are propagated back to the
driver program
• Spark does provide two limited types of shared variables for two common usage
patterns
– broadcast variables
– accumulators
68. BroadcastVariables
• A broadcast variable is a read-only variable made available from the driver
program that runs the SparkContext object to the nodes that will execute the
computation
• Useful in applications that make same data available to the worker nodes in an
efficient manner, such as machine learning algorithms
• The broadcast values are not shipped to the nodes more than once
69. BroadcastVariables
• To create broadcast variables, call a method on SparkContext
– val broadcastAList = sc.broadcast(List("a", "b", "c", "d", "e"))
• Spark attempts to distribute broadcast variables using efficient broadcast
algorithms to reduce communication cost
– For example, to give every node a copy of a large input dataset efficiently
70. Accumulators
• An accumulator is also a variable that is broadcasted to the worker nodes
• Variables that can only be added to through an associative operation
• The addition must be an associative operation so that the global accumulated
value can be correctly computed in parallel and returned to the driver program
• Used to implement counters and sums, efficiently in parallel
71. Accumulators
• Spark natively supports accumulators of numeric value types and standard
mutable collections, and programmers can extend for new types
• Only the driver program can read an accumulator’s value, not the task
• Each worker node can only access and add to its own local accumulator value
• Only the driver program can access the global value
• Accumulators are also accessed within the Spark code using the value method
73. RDD Partitions
• An RDD is divided into a number of partitions, which are atomic pieces of
information
• Partitions of an RDD can be stored on different nodes of a cluster
• RDD data is just collection of partitions
• Logical division of data
• Derived from Hadoop Map/Reduce
• All input, intermediate and output data will be represented as partitions
• Partitions are basic unit of parallelism
75. Partitioning - Immutability
• All partitions are immutable
• Each RDD has 2 sets of parallel operations - transformation and action
• Every transformation generates new partition
• Partition immutability driven by underneath storage like HDFS
• Partition immutability allows for fault recovery
76. Partitioning - Distribution
• Partitions derived from HDFS are distributed by default
• Partitions are also location aware
• Location awareness of partitions allow for data locality
• For computed data, using caching we can distribute in memory also
77. Accessing Partitions
• Accessed together single row at a time
• Use mapParititonsAPI of RDD
• Allows to do partionwise operation which cannot be done by accessing
single row
78. Partitioning ofTransformed Data
• Partitioning is different for key/value pairs that are generated by shuffle
operation
• Partitioning is driven by partitioner specified
• By default HashPartitioner is used
• Can use your own partitioner too
79. Custom Partitioner
• Partition the data according to your data structure
• Custom partitioning allows control over no of partitions and
the distribution of data across when grouping or reducing is
done
80. Lookup Operation
• Partitioning allows faster lookups
• Lookup operation allows to look up for a given value by specifying the key
• Using partitioner, lookup determines which partition look for
• Then it only need to look in that partition
• If no partition is specified, it will fallback to filter
81. Laziness – Parent Dependency
• Each RDD has access to parent RDD
• Value of parent for first RDD is nil
• Before computing it’s value, it always computes it’s parent
• This chain of running allows for laziness
82. Subclassing
• Each spark operator, creates an instance of specific sub class of RDD
• Map operator results in MappedRDD, flatMap in FlatMappedRDD etc
• Subclass allows RDD to remember the operation that is performed in the
transformation
83. RDDTransformations
• val dataRDD = sc.textFile(args(1))
• val splitRDD = dataRDD.flatMap(value =>value.split(“ “)
• Compute
– A function for evaluation of each partition in
RDD
– An abstract method of RDD
– Each sub class of RDD like MappedRDD,
FilteredRDD have to override this method
84. Compute Function
• A function for evaluation of each partition in RDD
• An abstract method of RDD
• Each sub class of RDD like MappedRDD, FilteredRDD have to override
this method
85. Lineage
• Transformations used to build an RDD
• RDDs are stored as chain of objects
capturing the lineage of each RDD
• val file = sc.textFile("hdfs://...")
• val sics = file.filter(_.contains("SICS"))
• val cachedSics = sics.cache()
• val ones = cachedSics.map(_ => 1)
• val count = ones.reduce(_+_)
86. RDD Actions
• val dataRDD = sc.textFile(args(1))
• val flatMapRDD = dataRDD.flatMap(value => value.split(““)
• flatMapRDD.collect()
• runJob API
– an API of RDD for action implementation
– Allows taking each partition and evaluate
– Internally used by all spark actions
87. Memory Management
• If there is not enough space in memory for a new computed RDD partition, a
partition from the least recently used RDD is evicted
• Spark provides three options for storage of persistent RDDs
– In memory storage as de-serialized Java objects
– In memory storage as serialized Java objects
– On disk storage
• When an RDD is persisted, each node stores any partitions of the RDD that it
computes in memory - allows future actions to be much faster
88. Memory Management
• Persisting an RDD using persist() or cache() methods
• Storage levels
– MEMORY ONLY
– MEMORYAND DISK
– MEMORY ONLY SER
– MEMORYAND DISK SER
– MEMORY ONLY 2, MEMORYAND DISK 2...
89. Caching
• Cache internally uses persistAPI
• Persist sets a specific storage level for a given RDD
• Spark context tracks persistent RDD
• Partition is put into memory by block manager
90. Caching - Block Manager
• Handles all in memory data in spark
• Responsible for
– Cached Data ( BlockRDD)
– Shuffle Data
– Broadcast data
• Partition will be stored in Block with id (RDD.id, partition_index)
91. Working of Caching
• Partition iterator checks the storage level
• if Storage level is set it calls cacheManager.getOrCompute(partition)
• As iterator is run for each RDD evaluation, it is transparent to user
93. Extending Spark API
• Extending RDD API allows creating custom RDD structure
• Custom RDD’s allow control over computation
• Possible to change partitions, locality and evaluation depending upon
requirements
94. Extending Spark API
• Custom operators to RDD
– Domain specific operators to specific RDD’s
– Uses Scala implicit mechanism
– Feels and works like built in operator
• Custom RDD
– Extend RDD API to create new RDD
– Combined with RDD makes it powerful
95. RDD Benefits
• Data and intermediate results are stored in memory to speed up
computation and located on the adequate nodes for optimization
• Able to perform transformation operation on RDD many times
• Calculate lineage information about RDD transformation for failure
recovery - If a failure occurs operating a partition it is re-operated
96. RDD Benefits - Persistence
• Default is in memory
• Able to locate replica on plural nodes
• If data does not fit in memory, spill data to a disk
• Better to make a checkpoint when a lineage is long or wide dependency
exist on a lineage - Making checkpoint is performed in the background
97. RDD Benefits
• Data locality works in narrow dependency
• Intermediate results in wide dependency is dumped to a disk like a
mapper output
• Comparison to DSM (Distributed Sharing Memory)
– Hard to implement fault-tolerance on commodity servers
– RDD is immutable, so easy to take a backup
– In DSM, tasks access to the same memory location and interfere with each
other's updates
99. References
1. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf
Chowdhury. Technical Report UCB/EECS-2011-82. July 2011
2. M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized Streams: Fault-Tolerant Streaming Computation at Scale,
SOSP 2013, November 2013
3. K. Ousterhout, P. Wendell, M. Zaharia and I. Stoica. Sparrow: Distributed, Low-Latency Scheduling, SOSP 2013, November 2013
4. R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica. Shark: SQL and Rich Analytics at Scale, SIGMOD 2013, June 2013
5. A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica, Dominant Resource Fairness: Fair Allocation of Multiple
Resources Types, NSDI 2011, March 2011
6. Spark: In-Memory Cluster Computing for Iterative and Interactive Applications, Stanford University, Stanford, CA, February 2011
7. Spark: Cluster Computing with Working Sets, HotCloud 2010, Boston, MA, June 2010
8. http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-SQL-
Michael-Armbrust.pdf
9. https://github.com/apache/spark/tree/master/sql
100. ThankYou
Check Out My LinkedIn Profile at
https://in.linkedin.com/in/girishkhanzode
Editor's Notes
Gracefully
This isn’t all proven out yet, but some of it should just work already.