This document presents Resilient Distributed Datasets (RDDs), a fault-tolerant abstraction for in-memory cluster computing introduced by Spark. RDDs allow programmers to perform iterative and interactive computations over large datasets in a fault-tolerant manner. RDDs are distributed immutable collections of records that can be operated on through transformations and actions. They track the lineage of transformations to allow recovering lost data partitions. This provides an efficient abstraction for iterative algorithms compared to MapReduce.
Presentation slides for the paper on Resilient Distributed Datasets, written by Matei Zaharia et al. at the University of California, Berkeley.
The paper is not my work.
These slides were made for the course on Advanced, Distributed Systems held by prof. Bratsberg at NTNU (Norwegian University of Science and Technology, Trondheim, Norway).
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit
This document summarizes Daniel Darabos' talk about the design and implementation of the LynxKite graph analytics application. The key ideas discussed are: (1) using column-based attributes to avoid processing unused data, (2) making joins fast through co-located loading of sorted RDDs, (3) not reading or computing all the data through techniques like prefix sampling, and (4) using binary search for lookups instead of filtering for small key sets. Examples are provided to illustrate how these techniques improve performance and user experience of interactive graph analytics on Spark.
This gives a characterization of the machine learning computations and brings out the deficiencies of Hadoop 1.0. It gives the motivation for Hadoop YARN and a brief view of YARN architecture. It illustrates the power of specialized processing frameworks over YARN, such as Spark and GraphLab. In short, Hadoop YARN allows your data to be stored in HDFS and specialized processing frameworks may be used to process the data in various ways.
This document provides an introduction to Apache Spark, including its history and key concepts. It discusses how Spark was developed in response to big data processing needs at Google and how it builds upon earlier systems like MapReduce. The document then covers Spark's core abstractions like RDDs and DataFrames/Datasets and common transformations and actions. It also provides an overview of Spark SQL and how to deploy Spark applications on a cluster.
Big data processing using - Hadoop TechnologyShital Kat
This document summarizes a report on Hadoop technology as a solution to big data processing. It discusses the big data problem, including defining big data, its characteristics and challenges. It then introduces Hadoop as a solution, describing its components HDFS for storage and MapReduce for parallel processing. Examples of common friend lists and word counting are provided. Finally, it briefly mentions some Hadoop projects and companies that use Hadoop.
This document provides an overview of the Apache Hadoop ecosystem. It discusses key components like HDFS, MapReduce, YARN, Pig Latin, and performance tuning for MapReduce jobs. HDFS is introduced as the distributed file system that provides high throughput and scalability. MapReduce is described as the framework for distributed processing of large datasets across clusters. YARN is presented as an improvement over the static resource allocation in Hadoop 0.1.x. Pig Latin is demonstrated as a high-level language for expressing data analysis jobs. The document concludes by discussing extensions beyond MapReduce, like iterative processing and indexing approaches.
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui
Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. The document discusses Spark's architecture including its core abstraction of resilient distributed datasets (RDDs), and demos Spark's capabilities for streaming, SQL, machine learning and graph processing on large clusters.
Python in an Evolving Enterprise System (PyData SV 2013)PyData
The document evaluates different solutions for integrating Python with Hadoop to enable data modeling on Hadoop clusters. It tests various frameworks like Native Java, Streaming, mrjob, PyCascading, and Pig using a sample budget aggregation problem. Pig and PyCascading allow complex pipelines to be expressed simply, while Pig is more performant and mature, making it the most viable option for ad-hoc analysis on Hadoop from Python.
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
This document discusses MATLAB's support for netCDF and OPeNDAP. It introduces high-level and low-level interfaces for reading and writing netCDF files, provides examples of accessing remote OPeNDAP datasets, and notes how HDF5 files can also be read. MATLAB now includes netCDF version 4.1.3, enabling built-in support for OPeNDAP via either interface without additional configuration.
A sql implementation on the map reduce frameworkeldariof
Tenzing is a SQL query engine built on top of MapReduce for analyzing large datasets at Google. It provides low latency queries over heterogeneous data sources at petabyte scale. Tenzing supports a full SQL implementation with extensions and optimizations to achieve high performance comparable to commercial parallel databases by leveraging MapReduce and traditional database techniques. It is used by over 1,000 Google employees to query over 1.5 petabytes of data and serve 10,000 queries daily.
This document summarizes IBM's announcement of a major commitment to advance Apache Spark. It discusses IBM's investments in Spark capabilities, including log processing, graph analytics, stream processing, machine learning, and unified data access. Key reasons for interest in Spark include its performance (up to 100x faster than Hadoop for some tasks), productivity gains, ability to leverage existing Hadoop investments, and continuous community improvements. The document also provides an overview of Spark's architecture, programming model using resilient distributed datasets (RDDs), and common use cases like interactive querying, batch processing, analytics, and stream processing.
Resilient Distributed Datasets (RDDs) provide a fault-tolerant abstraction for in-memory cluster computing. RDDs allow data to be partitioned across nodes and kept in memory for efficient reuse across jobs, while retaining properties of MapReduce like fault tolerance. RDDs track the lineage of transformations to rebuild lost data and optimize data placement and partitioning to minimize network shuffling.
This document describes the Pig system, which is a high-level data flow system built on top of MapReduce. Pig provides a language called Pig Latin for analyzing large datasets. Pig Latin programs are compiled into MapReduce jobs. The compilation process involves several steps: (1) parsing and type checking the Pig Latin code, (2) logical optimization, (3) converting the logical plan into physical operators like GROUP and JOIN, (4) mapping the physical operators to MapReduce stages, and (5) optimizing the MapReduce plan. This allows users to write data analysis programs more declaratively without coding MapReduce jobs directly.
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin
Hadoop and the term 'Big Data' go hand in hand. The information explosion caused
due to cloud and distributed computing lead to the curiosity to process and analyze massive
amount of data. The process and analysis helps to add value to an organization or derive
valuable information.
The current Hadoop implementation assumes that computing nodes in a cluster are
homogeneous in nature. Hadoop relies on its capability to take computation to the nodes
rather than migrating the data around the nodes which might cause a signicant network
overhead. This strategy has its potential benets on homogeneous environment but it might
not be suitable on an heterogeneous environment. The time taken to process the data on a
slower node on a heterogeneous environment might be signicantly higher than the sum of
network overhead and processing time on a faster node. Hence, it is necessary to study the
data placement policy where we can distribute the data based on the processing power of
a node. The project explores this data placement policy and notes the ramications of this
strategy based on running few benchmark applications.
The document discusses Spark, an open-source cluster computing framework. It describes Spark's Resilient Distributed Dataset (RDD) as an immutable and partitioned collection that can automatically recover from node failures. RDDs can be created from data sources like files or existing collections. Transformations create new RDDs from existing ones lazily, while actions return values to the driver program. Spark supports operations like WordCount through transformations like flatMap and reduceByKey. It uses stages and shuffling to distribute operations across a cluster in a fault-tolerant manner. Spark Streaming processes live data streams by dividing them into batches treated as RDDs. Spark SQL allows querying data through SQL on DataFrames.
The document discusses using Hadoop MapReduce for large scale mathematical computations. It introduces integer multiplication algorithms like FFT, MapReduce-FFT, MapReduce-Sum and MapReduce-SSA. These algorithms can be used to solve computationally intensive problems like integer factoring, PDE solving, computing the Riemann zeta function, and calculating pi to high precision. The document focuses on integer multiplication as it is a prerequisite for many applications and explores FFT-based algorithms and the Schonhage-Strassen algorithm in particular.
Hot-Spot analysis Using Apache Spark frameworkSupriya .
This document describes using Apache Spark and GeoSpark to process large-scale spatial and spatial-temporal data. It discusses loading spatial data into Resilient Distributed Datasets (RDDs) using GeoSpark APIs and performing operations like spatial range queries, k-nearest neighbor queries, and spatial joins on the data. It also describes implementing hot spot analysis to identify statistically significant hot spots in the spatial data using spatial statistics in Apache Spark. The document outlines the system design, including using Hadoop and Spark on a cluster, and describes experiments run on spatial data to analyze query efficiency and performance at scale.
A 3 dimensional data model in hbase for large time-series dataset-20120915Dan Han
This document outlines a study on migrating relational database content to NoSQL storage systems like HBase. It discusses challenges in migration and the need for design patterns for HBase schemas. A 3-dimensional data model in HBase is proposed and evaluated using cosmology and bike rental datasets. Experiment results show the 3D model improves performance for queries that use HBase's version dimension. Future work includes further evaluation of the model's scalability and designing models for other dataset types.
The document discusses the Hadoop ecosystem. It provides an overview of Hadoop and its core components HDFS and MapReduce. HDFS is the storage component that stores large files across nodes in a cluster. MapReduce is the processing framework that allows distributed processing of large datasets in parallel. The document also discusses other tools in the Hadoop ecosystem like Hive, Pig, and Hadoop distributions from companies. It provides examples of running MapReduce jobs and accessing HDFS from the command line.
Spark Streaming allows processing live data streams using small batch sizes to provide low latency results. It provides a simple API to implement complex stream processing algorithms across hundreds of nodes. Spark SQL allows querying structured data using SQL or the Hive query language and integrates with Spark's batch and interactive processing. MLlib provides machine learning algorithms and pipelines to easily apply ML to large datasets. GraphX extends Spark with an API for graph-parallel computation on property graphs.
Unikernels: in search of a killer app and a killer ecosystemrhatr
By now, unikernels are not a new kid on the block anymore.
There's a healthy diversity of implementations and communities to a point where a project like UniK had to be created to curate it all. This talk will attempt to answer a question of what may be the missing piece to make unikernels as ubiquitous as virtualization or public clouds are today. We will mainly focus on an example of OSv (a popular almost-POSIX unikernel) and its evolution in search of a killer app to run. In addition to that we will attempt to present a vision of an utopian overall ecosystem where unikernels can take its rightful place.
Type Checking Scala Spark Datasets: Dataset TransformsJohn Nestor
This document discusses type checking Scala Spark Datasets. It introduces Dataset transforms which allow checking field names and types at compile time rather than run time. The transforms include operations like map, filter, sort, join and aggregate. The implementation uses Scala macros to analyze case class definitions at compile time and generate meta structures representing the fields and types. This allows encoding the transforms as Spark SQL queries that benefit from optimization while also providing strong typing. Code and examples for the transforms are available on GitHub.
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.
Think Like Spark: Some Spark Concepts and a Use CaseRachel Warren
A deeper explanation of Spark's evaluation principals including lazy evaluation, the Spark execution environment, anatomy of a Spark Job (Tasks, Stages, Query execution plan) and presents one use case to demonstrate these concepts.
Apache Spark: killer or savior of Apache Hadoop?rhatr
The Big Boss(tm) has just OKed the first Hadoop cluster in the company. You are the guy in charge of analyzing petabytes of your company's valuable data using a combination of custom MapReduce jobs and SQL-on-Hadoop solutions. All of a sudden the web is full of articles telling you that Hadoop is dead, Spark has won and you should quit while you're still ahead. But should you?
Apache Spark Introduction @ University College LondonVitthal Gogate
Spark is a fast and general engine for large-scale data processing. It uses resilient distributed datasets (RDDs) that can be operated on in parallel. Transformations on RDDs are lazy, while actions trigger their execution. Spark supports operations like map, filter, reduce, and join and can run on Hadoop clusters, standalone, or in cloud services like AWS.
This document discusses Spark concepts and provides an example use case for finding rank statistics from a DataFrame. It begins with introductions and an overview of Spark architecture. It then walks through four versions of an algorithm to find rank statistics from a wide DataFrame, with each version improving on the previous. The final optimized version maps to distinct count pairs rather than value-column pairs, improving performance by sorting 75% fewer records and avoiding data skew. Key lessons are to shuffle less, leverage data locality, be aware of data skew, and optimize for units of parallelization.
This slide introduces Hadoop Spark.
Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming.
Not all technical details are included.
Escape from Hadoop: Ultra Fast Data Analysis with Spark & CassandraPiotr Kolaczkowski
The document discusses using Apache Spark and Apache Cassandra together for fast data analysis as an alternative to Hadoop. It provides examples of basic Spark operations on Cassandra tables like counting rows, filtering, joining with external data sources, and importing/exporting data. The document argues that Spark on Cassandra provides a simpler distributed processing framework compared to Hadoop.
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016StampedeCon
Spark 2.0 includes many exciting new features including Structured Streaming, and the unification of Datasets (new in 1.6) with DataFrames. Structured Streaming allows one to define recurrent queries on a stream of data that is handled as an infinite DataFrame. This query is incrementally updated with new data. This allows for code reuse between batch and streaming and an easier logical model to reason about. Datasets, an extension of DataFrames, were added as an experimental feature in Spark 1.6. They allow us to manipulate collections of objects in a type-safe fashion. In Spark 2.0 the two abstractions have been unified and now DataFrame = Dataset[Row]. We will discuss both of these new features and look at practical real world examples.
The document is an agenda for an intro to Spark development class. It includes an overview of Databricks, the history and capabilities of Spark, and the agenda topics which will cover RDD fundamentals, transformations and actions, DataFrames, Spark UIs, and Spark Streaming. The class will include lectures, labs, and surveys to collect information on attendees' backgrounds and goals for the training.
Beneath RDD in Apache Spark by Jacek LaskowskiSpark Summit
This document provides an overview of SparkContext and Resilient Distributed Datasets (RDDs) in Apache Spark. It discusses how to create RDDs using SparkContext functions like parallelize(), range(), and textFile(). It also covers DataFrames and converting between RDDs and DataFrames. The document discusses partitions and the level of parallelism in Spark, as well as the execution environment involving DAGScheduler, TaskScheduler, and SchedulerBackend. It provides examples of RDD lineage and describes Spark clusters like Spark Standalone and the Spark web UI.
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks
The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I’ll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I’ll also discuss a bit of what’s in the works for future versions.
This document provides an overview of Apache Spark, including:
- What Spark is and how it differs from MapReduce by running computations in memory for improved performance on iterative algorithms.
- Examples of Spark's core APIs like RDDs (Resilient Distributed Datasets) and transformations like map, filter, reduceByKey.
- How Spark programs are executed through a DAG (Directed Acyclic Graph) and translated to physical execution plans with stages and tasks.
This document provides an overview of cloud computing, including its evolution, key characteristics, how to develop cloud applications using frameworks like MapReduce and Hadoop, and who might need cloud computing services. It discusses how cloud computing provides on-demand access to computing resources and data from any device, and how developers' key technical concern is services and data accessible over the internet. It also gives examples of major cloud computing providers like Amazon Web Services, Microsoft Azure, and Google App Engine.
Hadoop is a Java software framework that supports data-intensive distributed applications and is developed under open source license. It enables applications to work with thousands of nodes and petabytes of data.
The relationships between data sets matter. Discovering, analyzing, and learning those relationships is a central part to expanding our understand, and is a critical step to being able to predict and act upon the data. Unfortunately, these are not always simple or quick tasks.
To help the analyst we introduce RAPIDS, a collection of open-source libraries, incubated by NVIDIA and focused on accelerating the complete end-to-end data science ecosystem. Graph analytics is a critical piece of the data science ecosystem for processing linked data, and RAPIDS is pleased to offer cuGraph as our accelerated graph library.
Simply accelerating algorithms only addressed a portion of the problem. To address the full problem space, RAPIDS cuGraph strives to be feature-rich, easy to use, and intuitive. Rather than limiting the solution to a single graph technology, cuGraph supports Property Graphs, Knowledge Graphs, Hyper-Graphs, Bipartite graphs, and the basic directed and undirected graph.
A Python API allows the data to be manipulated as a DataFrame, similar and compatible with Pandas, with inputs and outputs being shared across the full RAPIDS suite, for example with the RAPIDS machine learning package, cuML.
This talk will present an overview of RAPIDS and cuGraph. Discuss and show examples of how to manipulate and analyze bipartite and property graph, plus show how data can be shared with machine learning algorithms. The talk will include some performance and scalability metrics. Then conclude with a preview of upcoming features, like graph query language support, and the general RAPIDS roadmap.
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
In this talk we’ll explore Apache Spark — the most popular cluster computing framework right now. We’ll look at the improvements that Spark brought over Hadoop MapReduce and what makes Spark so fast; explore Spark programming model and RDDs; and look at some sample use cases for Spark and big data in general.
This talk will be interesting for people who have little or no experience with Spark and would like to learn more about it. It will also be interesting to a general engineering audience as we’ll go over the Spark programming model and some engineering tricks that make Spark fast.
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
This document discusses frameworks for processing big data that is distributed across geographic locations. It begins by introducing the challenges of geo-distributed big data processing and then describes several MapReduce-based frameworks like G-Hadoop and G-MR that can process pre-located geo-distributed data. It also covers Spark-based systems like Iridium and frameworks that partition data across geographic locations, such as KOALA grid-based systems. The document analyzes key aspects of geo-distributed big data processing systems like data distribution, task scheduling, and fault tolerance.
Generating Executable Mappings from RDF Data Cube Data Structure DefinitionsChristophe Debruyne
Data processing is increasingly the subject of various internal and external regulations, such as GDPR which has recently come into effect. Instead of assuming that such processes avail of data sources (such as files and relational databases), we approach the problem in a more abstract manner and view these processes as taking datasets as input. These datasets are then created by pulling data from various data sources. Taking a W3C Recommendation for prescribing the structure of and for describing datasets, we investigate an extension of that vocabulary for the generation of executable R2RML mappings. This results in a top-down approach where one prescribes the dataset to be used by a data process and where to find the data, and where that prescription is subsequently used to retrieve the data for the creation of the dataset “just in time”. We argue that this approach to the generation of an R2RML mapping from a dataset description is the first step towards policy-aware mappings, where the generation takes into account regulations to generate mappings that are compliant. In this paper, we describe how one can obtain an R2RML mapping from a data structure definition in a declarative manner using SPARQL CONSTRUCT queries, and demonstrate it using a running example. Some of the more technical aspects are also described.
Reference: Christophe Debruyne, Dave Lewis, Declan O'Sullivan: Generating Executable Mappings from RDF Data Cube Data Structure Definitions. OTM Conferences (2) 2018: 333-350
The document discusses Big Data, MapReduce, Hadoop, and Pydoop. It provides an overview of MapReduce and how it works, describing the map and reduce functions. It also describes Hadoop, the popular open-source implementation of MapReduce, including its architecture and core components like HDFS and how tasks are executed in a distributed manner. Finally, it briefly introduces Pydoop as a way to use Python with Hadoop.
The document discusses enhancing the Geospatial Data Abstraction Library (GDAL) to improve accessibility and interoperability of NASA data products with GIS tools. It developed plugins for three NASA data products to demonstrate reading multidimensional datasets into GIS applications like ArcGIS. Next steps include providing outreach, enhancing the framework to be more flexible and production-ready, and developing guides to help other data centers build GDAL plugins to address their issues with geospatial data. The overall goal is to improve analysis and visualization of NASA scientific data in GIS tools and web applications.
The document discusses big data and Hadoop as a framework for processing large datasets. It describes how Hadoop uses HDFS for storage and MapReduce for parallel processing. HDFS uses a master/slave architecture with a NameNode and DataNodes. MapReduce jobs are managed by a JobTracker and executed on TaskTrackers. The document provides an example of using MapReduce to find common friends between users. It concludes that Hadoop is capable of solving big data challenges through scalable and fault-tolerant distributed processing.
CUDA performance study on Hadoop MapReduce Clusterairbots
This document summarizes a study on using GPUs (CUDA) to accelerate Hadoop MapReduce workloads. It introduces CUDA into Hadoop clusters, evaluates the performance speedup and power efficiency on matrix multiplication and molecular dynamics simulations, and concludes that GPU acceleration provides up to 20x speedup and reduces power consumption by up to 19/20, making it a cost-effective approach compared to CPU-only upgrades. Future work is outlined to port more applications and support heterogeneous GPU/CPU clusters.
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
The document discusses Apache Spark and its role in big data and emerging technologies for big data. It provides background on MapReduce and the emergence of specialized systems. It then discusses how Spark provides a unified engine for batch processing, iterative jobs, SQL queries, streaming, and more. It can simplify programming by using a functional approach. The document also discusses Spark's architecture and performance advantages over other frameworks.
Workshop on Google Cloud Data PlatformGoDataDriven
The document provides an agenda and information about a GoDataFest workshop on Google Cloud Platform for data. The agenda includes an introduction to GCP for data, a session on roles and tools on GCP for different data roles, and a session where participants will build projects on GCP in mixed workgroups. It outlines the goals and tools used by different roles like data engineer, analytics engineer, and Looker user. It also provides information on Google Cloud technologies like BigQuery, Dataform, Looker, and how they fit into the modern data lifecycle and platform. Participants are then divided into mixed workgroups based on their preferred role and given insights to explore in their projects.
This document provides an overview of big data processing techniques including batch processing using MapReduce and Hive, iterative batch processing using Spark, stream processing using Apache Storm, and OLAP over big data using Dremel and Druid. It discusses techniques such as MapReduce, Hive, Spark RDDs, and Storm tuples for processing large datasets and compares small versus big data approaches. Example usages and technologies for different processing types are also outlined.
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONijcsit
Map Reduce has gained remarkable significance as a rominent parallel data processing tool in the research community, academia and industry with the spurt in volume of data that is to be analyzed. Map Reduce is used in different applications such as data mining, data analytic where massive data analysis is required, but still it is constantly being explored on different parameters such as performance and efficiency. This survey intends to explore large scale data processing using Map Reduce and its various implementations to facilitate the database, researchers and other communities in developing the technical understanding of the Map Reduce framework. In this survey, different Map Reduce implementations are explored and their inherent features are compared on different parameters. It also addresses the open issues and challenges raised on fully functional DBMS/Data Warehouse on Map Reduce. The comparison of various Map Reduce implementations is done with the most popular implementation Hadoop and other similar implementations using other platforms.
This is a slide deck that I have been using to present on GeoTrellis for various meetings and workshops. The information is speaks to GeoTrellis pre-1.0 release in Q4 of 2016.
This document discusses efficient data mining solutions using Hadoop, Cassandra, and Spark. It describes Cassandra as a fast, robust, and efficient key-value database but notes it has limitations for certain queries. Spark is presented as an alternative to Hadoop MapReduce that can be 100 times faster for interactive algorithms and data mining. The document demonstrates how Spark can integrate with Cassandra to allow distributed data processing over Cassandra data without needing to clone the data or use other databases. Future extensions are proposed to directly access Cassandra's SSTable files from Spark and extend CQL3 to leverage Spark.
A popular programming model for running data intensive applications on the cloud is map reduce. In
the Hadoop usually, jobs are scheduled in FIFO order by default. There are many map reduce
applications which require strict deadline. In Hadoop framework, scheduler wi t h deadline
con s t ra in t s has not been implemented. Existing schedulers d o not guarantee that the job will be
completed by a specific deadline. Some schedulers address the issue of deadlines but focus more on
improving s y s t em utilization. We have proposed an algorithm which facilitates the user to
specify a jobs deadline and evaluates whether the job can be finished before the deadline.
Scheduler with deadlines for Hadoop, which ensures that only jobs, whose deadlines can be met are
scheduled for execution. If the job submitted does not satisfy the specified deadline, physical or
virtual nodes can be added dynamically to complete the job within deadline[8].
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
How AI is Revolutionizing Data Collection.pdfPromptCloud
Artificial Intelligence (AI) is transforming the landscape of data collection, making it more efficient, accurate, and insightful than ever before. With AI, businesses can automate the extraction of vast amounts of data from diverse sources, analyze patterns in real-time, and gain deeper insights with minimal human intervention. This revolution in data collection enables companies to make faster, data-driven decisions, enhance their competitive edge, and unlock new opportunities for growth.
AI-powered tools can handle complex and dynamic web content, adapt to changes in website structures, and even understand the context of data through natural language processing. This means that data collection is not only faster but also more precise, reducing the time and effort required for manual data extraction. Furthermore, AI can process unstructured data, such as social media posts and customer reviews, providing valuable insights into customer sentiment and market trends.
Embrace the future of data collection with AI and stay ahead of the curve. Learn more about how PromptCloud’s AI-driven web scraping solutions can transform your data strategy. https://www.promptcloud.com/contact/
Annex K RBF's The World Game pdf documentSteven McGee
Signals & Telemetry Annex K for RBF's The World Game / Trade Federations / USPTO 13/573,002 Heart Beacon Cycle Time - Space Time Chain meters, metrics, standards. Adaptive Procedural template framework structured data derived from DoD / NATO's system of systems engineering tech framework
Combined supervised and unsupervised neural networks for pulse shape discrimi...Samuel Jackson
Our methodology for pulse shape discrimination is split into two steps. Firstly, we learn a model to discriminate between pulses using "clean" low-rate examples by removing pile-up & saturated events. In addition to traditional tail sum discrimination, we investigate three different choices for discrimination between γ-pulses, fast, thermal neutrons. We consider clustering the pulses directly using Gaussian Mixture Modelling (GMM), using variational autoencoders to learn a representation of the pulses and then clustering the learned representation (VAE+GMM) and using density ratio estimation to discriminate between a mixed (γ + neutron) and pure (γ only) sources using a multi-layer perceptron (MLP) as a supervised learning problem.
Secondly, we aim to classify and recover pile-up events in the < 150 ns regime by training a single unified multi-label MLP. To frame the problem as a multi-label supervised learning method, we first simulate pile-up events with known components. Then, using the simulated data and combining it with single event data, we train a final multi-label MLP to output a binary code indicating both how many and which type of events are present within an event window.
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataSamuel Jackson
We present our work to improve data accessibility and performance for data-intensive tasks within the fusion research community. Our primary goal is to develop services that facilitate efficient access for data-intensive applications while ensuring compliance with FAIR principles [1], as well as adoption of interoperable tools, methods and standards.
The major outcome of our work is the successful creation and deployment of a data service for the MAST (Mega Ampere Spherical Tokamak) experiment [2], leading to substantial enhancements in data discoverability, accessibility, and overall data retrieval performance, particularly in scenarios involving large-scale data access. Our work follows the principles of Analysis-Ready, Cloud Optimised (ARCO) data [3] by using cloud optimised data formats for fusion data.
Our system consists of a query-able metadata catalogue, complemented with an object storage system for publicly serving data from the MAST experiment. We will show how our solution integrates with the Pandata stack [4] to enable data analysis and processing at scales that would have previously been intractable, paving the way for data-intensive workflows running routinely with minimal pre-processing on the part of the researcher. By using a cloud-optimised file format such as zarr [5] we can enable interactive data analysis and visualisation while avoiding large data transfers. Our solution integrates with common python data analysis libraries for large, complex scientific data such as xarray [6] for complex data structures and dask [7] for parallel computation and lazily working with larger that memory datasets.
The incorporation of these technologies is vital for advancing simulation, design, and enabling emerging technologies like machine learning and foundation models, all of which rely on efficient access to extensive repositories of high-quality data. Relying on the FAIR guiding principles for data stewardship not only enhances data findability, accessibility, and reusability, but also fosters international cooperation on the interoperability of data and tools, driving fusion research into new realms and ensuring its relevance in an era characterised by advanced technologies in data science.
[1] Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016) https://doi.org/10.1038/sdata.2016.18
[2] M Cox, The Mega Amp Spherical Tokamak, Fusion Engineering and Design, Volume 46, Issues 2–4, 1999, Pages 397-404, ISSN 0920-3796, https://doi.org/10.1016/S0920-3796(99)00031-9
[3] Stern, Charles, et al. "Pangeo forge: crowdsourcing analysis-ready, cloud optimized data production." Frontiers in Climate 3 (2022): 782909.
[4] Bednar, James A., and Martin Durant. "The Pandata Scalable Open-Source Analysis Stack." (2023).
[5] Alistair Miles (2024) ‘zarr-developers/zarr-python: v2.17.1’. Zenodo. doi: 10.5281/zenodo.10790679
[6] Hoyer, S. & Hamman, J., (20
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
Overview of Statistical software such as ODK, surveyCTO,and CSPro
2. Software installation(for computer, and tablet or mobile devices)
3. Create a data entry application
4. Create the data dictionary
5. Create the data entry forms
6. Enter data
7. Add Edits to the Data Entry Application
8. CAPI questions and texts
1. RESILIENT DISTRIBUTED DATASETS: A
FAULT-TOLERANT ABSTRACTION FOR
IN-MEMORY CLUSTER COMPUTING
MATEI ZAHARIA, MOSHARAF CHOWDHURY, TATHAGATA DAS, ANKUR DAVE, JUSTIN MA, MURPHY MCCAULEY,
MICHAEL J. FRANKLIN, SCOTT SHENKER, ION STOICA.
NSDI'12 PROCEEDINGS OF THE 9TH USENIX CONFERENCE ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION
PAPERS WE LOVE AMSTERDAM
AUGUST 13, 2015
@gabriele_modena
2. (C) PRESENTATION BY GABRIELE MODENA, 2015
About me
• CS.ML
• Data science & predictive modelling
• with a sprinkle of systems work
• Hadoop & c. for data wrangling &
crunching numbers
• … and Spark
4. (C) PRESENTATION BY GABRIELE MODENA, 2015
We present Resilient Distributed Datasets
(RDDs), a distributed memory abstraction
that lets programmers perform in-memory
computations on large clusters in a fault-
tolerant manner. RDDs are motivated by two
types of applications that current computing
frameworks handle inefficiently: iterative
algorithms and interactive data mining tools.
5. (C) PRESENTATION BY GABRIELE MODENA, 2015
How
• Review (concepts from) key related work
• RDD + Spark
• Some critiques
6. (C) PRESENTATION BY GABRIELE MODENA, 2015
Related work
• MapReduce
• Dryad
• Hadoop Distributed FileSystem (HDFS)
• Mesos
7. (C) PRESENTATION BY GABRIELE MODENA, 2015
What’s an iterative algorithm anyway?
data = input data
w = <target vector>
for i in num_iterations:
for item in data:
update(w)
Multiple input scans
At each iteration,
do something
Update a shared data
structure
8. (C) PRESENTATION BY GABRIELE MODENA, 2015
HDFS
• GFS paper (2003)
• Distributed storage (with
replication)
• Block ops
• NameNode hashes file
locations (blocks)
Data
Node
Data
Node
Data
Node
Name
Node
9. (C) PRESENTATION BY GABRIELE MODENA, 2015
HDFS
• GFS paper (2003)
• Distributed storage (with
replication)
• Block ops
• NameNode hashes file
locations (blocks)
Data
Node
Data
Node
Data
Node
Name
Node
10. (C) PRESENTATION BY GABRIELE MODENA, 2015
HDFS
• GFS paper (2003)
• Distributed storage (with
replication)
• Block ops
• NameNode hashes file
locations (blocks)
Data
Node
Data
Node
Data
Node
Name
Node
11. (C) PRESENTATION BY GABRIELE MODENA, 2015
MapReduce
• Google paper (2004)
• Apache Hadoop (~2007)
• Divide and conquer functional model
• Goes hand-in-hand with HDFS
• Structure data as (key, value)
1. Map(): filter and project
emit (k, v) pairs
2. Reduce(): aggregate and summarise
group by key and count
Map Map Map
Reduce Reduce
HDFS (blocks)
HDFS
12. (C) PRESENTATION BY GABRIELE MODENA, 2015
MapReduce
• Google paper (2004)
• Apache Hadoop (~2007)
• Divide and conquer functional model
• Goes hand-in-hand with HDFS
• Structure data as (key, value)
1. Map(): filter and project
emit (k, v) pairs
2. Reduce(): aggregate and summarise
group by key and count
Map Map Map
Reduce Reduce
HDFS (blocks)
HDFS
This is a test
Yes it is a test
…
13. (C) PRESENTATION BY GABRIELE MODENA, 2015
MapReduce
• Google paper (2004)
• Apache Hadoop (~2007)
• Divide and conquer functional model
• Goes hand-in-hand with HDFS
• Structure data as (key, value)
1. Map(): filter and project
emit (k, v) pairs
2. Reduce(): aggregate and summarise
group by key and count
Map Map Map
Reduce Reduce
HDFS (blocks)
HDFS
This is a test
Yes it is a test
…
(This,1), (is, 1), (a, 1),
(test., 1), (Yes, 1), (it, 1),
(is, 1)
14. (C) PRESENTATION BY GABRIELE MODENA, 2015
MapReduce
• Google paper (2004)
• Apache Hadoop (~2007)
• Divide and conquer functional model
• Goes hand-in-hand with HDFS
• Structure data as (key, value)
1. Map(): filter and project
emit (k, v) pairs
2. Reduce(): aggregate and summarise
group by key and count
Map Map Map
Reduce Reduce
HDFS (blocks)
HDFS
This is a test
Yes it is a test
…
(This,1), (is, 1), (a, 1),
(test., 1), (Yes, 1), (it, 1),
(is, 1)
(This, 1), (is, 2), (a, 2),
(test, 2), (Yes, 1), (it, 1)
15. (C) PRESENTATION BY GABRIELE MODENA, 2015
(c) Image from Apache Tez http://tez.apache.org
16. (C) PRESENTATION BY GABRIELE MODENA, 2015
Critiques to MR and HDFS
• Great when records (and jobs) are independent
• In reality expect data to be shuffled across the
network
• Latency measured in minutes
• Performance hit for iterative methods
• Composability monsters
• Meant for batch workflows
17. (C) PRESENTATION BY GABRIELE MODENA, 2015
Dryad
• Microsoft paper (2007)
• Inspired Apache Tez
• Generalisation of MapReduce via I/O
pipelining
• Applications are (direct acyclic) graphs
of tasks
18. (C) PRESENTATION BY GABRIELE MODENA, 2015
Dryad
DAG dag = new DAG("WordCount");
dag.addVertex(tokenizerVertex)
.addVertex(summerVertex)
.addEdge(new Edge(tokenizerVertex,
summerVertex,
edgeConf.createDefaultEdgeProperty())
);
19. (C) PRESENTATION BY GABRIELE MODENA, 2015
MapReduce and Dryad
SELECT a.country, COUNT(b.place_id)
FROM place a JOIN tweets b ON (a. place_id = b.place_id)
GROUP BY a.country;
(c) Image from Apache Tez http://tez.apache.org. Modified.
20. (C) PRESENTATION BY GABRIELE MODENA, 2015
Critiques to Dryad
• No explicit abstraction for data sharing
• Must express data reps as DAG
• Partial solution: DryadLINQ
• No notion of a distributed filesystem
• How to handle large inputs?
• Local writes / remote reads?
21. (C) PRESENTATION BY GABRIELE MODENA, 2015
Resilient Distributed Datasets
Read-only, partitioned collection of records
=> a distributed immutable array
accessed via coarse-grained transformations
=> apply a function (scala closure) to all
elements of the array
Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj
22. (C) PRESENTATION BY GABRIELE MODENA, 2015
Resilient Distributed Datasets
Read-only, partitioned collection of records
=> a distributed immutable array
accessed via coarse-grained transformations
=> apply a function (scala closure) to all
elements of the array
Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj Obj
23. (C) PRESENTATION BY GABRIELE MODENA, 2015
Spark
• Transformations - lazily create RDDs
wc = dataset.flatMap(tokenize)
.reduceByKey(add)
• Actions - execute computation
wc.collect()
Runtime and API
24. (C) PRESENTATION BY GABRIELE MODENA, 2015
Applications
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM
25. (C) PRESENTATION BY GABRIELE MODENA, 2015
Applications
• Driver code defines RDDs
and invokes actions
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM
26. (C) PRESENTATION BY GABRIELE MODENA, 2015
Applications
• Driver code defines RDDs
and invokes actions
• Submit to long lived
workers, that store
partitions in memory
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM
27. (C) PRESENTATION BY GABRIELE MODENA, 2015
Applications
• Driver code defines RDDs
and invokes actions
• Submit to long lived
workers, that store
partitions in memory
• Scala closures are
serialised as Java objects
and passed across the
network over HTTPDriver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM
28. (C) PRESENTATION BY GABRIELE MODENA, 2015
Applications
• Driver code defines RDDs
and invokes actions
• Submit to long lived
workers, that store
partitions in memory
• Scala closures are
serialised as Java objects
and passed across the
network over HTTP
• Variables bound to the
closure are saved in the
serialised object
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM
29. (C) PRESENTATION BY GABRIELE MODENA, 2015
Applications
• Driver code defines RDDs
and invokes actions
• Submit to long lived
workers, that store
partitions in memory
• Scala closures are
serialised as Java objects
and passed across the
network over HTTP
• Variables bound to the
closure are saved in the
serialised object
• Closures are deserialised
on each worker and
applied to the RDD
(partition)
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM
30. (C) PRESENTATION BY GABRIELE MODENA, 2015
Applications
• Driver code defines RDDs
and invokes actions
• Submit to long lived
workers, that store
partitions in memory
• Scala closures are
serialised as Java objects
and passed across the
network over HTTP
• Variables bound to the
closure are saved in the
serialised object
• Closures are deserialised
on each worker and
applied to the RDD
(partition)
• Mesos takes care of
resource management
Driver
Worker
Worker
Worker
input
data
input
data
input
data
RAM
RAM
results
tasks
RAM
31. (C) PRESENTATION BY GABRIELE MODENA, 2015
Data persistance
1. in memory as deserialized java object
2. in memory as serialized data
3. on disk
RDD Checkpointing
Memory management via LRU eviction policy
.persist() RDD for future reuse
42. (C) PRESENTATION BY GABRIELE MODENA, 2015
Lineage
Fault recovery
If a partition is lost,
derived it back from the
lineage
lines
errors
hdfs errors
time fields
lines = spark.textFile(“hdfs://...")
errors =
lines.filter(_.startsWith("ERROR"))
errors.persist()
errors.filter(_.contains("HDFS"))
.map(_.split(’t’)(3))
.collect()
filter(_.startsWith("ERROR"))
filter(_.contains(“HDFS”))
map(_.split(’t’)(3))
43. (C) PRESENTATION BY GABRIELE MODENA, 2015
Representation
Challenge: track lineage across
transformations
1. Partitions
2. Data locality for partition p
3. List dependencies
4. Iterator function to compute a dataset
based on its parents
5. Metadata for the partitioner scheme
44. (C) PRESENTATION BY GABRIELE MODENA, 2015
Narrow dependencies
pipelined execution on one cluster node
map, filter
union
45. (C) PRESENTATION BY GABRIELE MODENA, 2015
Wide dependencies
require data from all parent partitions to be available and to be
shuffled across the nodes using a MapReduce-like operation
groupByKey
join with inputs
not co-partitioned
46. (C) PRESENTATION BY GABRIELE MODENA, 2015
Scheduling
Task are allocated based on data locality (delayed scheduling)
1. Action is triggered => compute the RDD
2. Based on lineage, build a graph of stages to execute
3. Each stage contains as many pipelined
transformations with narrow dependencies as
possible
4. Launch tasks to compute missing partitions from
each stage until it has computed the target RDD
5. If a task fails => re-run it on another node as long as
its stage’s parents are still available.
47. (C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
union
map
groupBy
join
B
C D
E
F
G
Stage 3Stage 2
A
Stage 1
48. (C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
union
map
groupBy
join
B
C D
E
F
G
Stage 3Stage 2
A
Stage 1
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
49. (C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
G
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
50. (C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
join
B
F
G
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
51. (C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
join
B
F
G
groupBy
A
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
52. (C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
union
D
E
join
B
F
G
groupBy
A
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
53. (C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
map
C
union
D
E
join
B
F
G
groupBy
A
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
54. (C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
map
C
union
D
E
join
B
F
G
groupBy
A
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
55. (C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
map
C
union
D
E
join
B
F
G
groupBy
A
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
56. (C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
map
C
union
D
E
join
B
F
G
groupBy
A
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
57. (C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
map
C
union
D
E
join
B
F
G
groupBy
A
Stage 1
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
58. (C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
map
C
union
D
E
join
B
F
G
Stage 2
groupBy
A
Stage 1
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
59. (C) PRESENTATION BY GABRIELE MODENA, 2015
Job execution
map
C
union
D
E
join
B
F
G
Stage 3Stage 2
groupBy
A
Stage 1
B = A.groupBy
D = C.map
F = D.union(E)
G = B.join(F)
G.collect()
61. (C) PRESENTATION BY GABRIELE MODENA, 2015
Some critiques (to the paper) Some critiques
(to the paper)
• How general is this approach?
• We are still doing MapReduce
• Concerns wrt iterative algorithms still stand
• CPU bound workloads?
• Linear Algebra?
• How much tuning is required?
• How does the partitioner work?
• What is the cost of reconstructing an RDD from
lineage?
• Performance when data does not fit in memory
• Eg. a join between two very large non co-
partitioned RDDs
62. (C) PRESENTATION BY GABRIELE MODENA, 2015
References (Theory)
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing.
Zaharia et. al, Proceedings of NSDI’12. https://www.cs.berkeley.edu/~matei/papers/2012/
nsdi_spark.pdf
Spark: cluster computing with working sets. Zaharia et. al, Proceedings of HotCloud'10.
http://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf
The Google File System. Ghemawat, Gobioff, Leung, 19th ACM Symposium on Operating
Systems Principles, 2003. http://research.google.com/archive/gfs.html
MapReduce: Simplified Data Processing on Large Clusters. Dean, Ghemawat,
OSDI'04: Sixth Symposium on Operating System Design and Implementation.
http://research.google.com/archive/mapreduce.html
Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007.
http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf
Mesos: a platform for fine-grained resource sharing in the data center, Hindman et. al,
Proceedings of NSDI’11. https://www.cs.berkeley.edu/~alig/papers/mesos.pdf
63. (C) PRESENTATION BY GABRIELE MODENA, 2015
References (Practice)
• An overview of the pyspark API through pictures https://github.com/jkthompson/
pyspark-pictures
• Barry Brumitt’s presentation on MapReduce design patterns (UW CSE490)
http://courses.cs.washington.edu/courses/cse490h/08au/lectures/
MapReduceDesignPatterns-UW2.pdf
• The Dryad Project http://research.microsoft.com/en-us/projects/dryad/
• Apache Spark http://spark.apache.org
• Apache Hadoop https://hadoop.apache.org
• Apache Tez https://tez.apache.org
• Apache Mesos http://mesos.apache.org