The document discusses Hadoop and Spark frameworks for big data analytics. It describes that Hadoop consists of HDFS for distributed storage and MapReduce for distributed processing. Spark is faster than MapReduce for iterative algorithms and interactive queries since it keeps data in-memory. While MapReduce is best for one-pass batch jobs, Spark performs better for iterative jobs that require multiple passes over datasets.
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Evert Lammerts
This document summarizes an agenda for the SARA Hadoop Hackathon on December 7, 2010. It provides background on Hadoop and how it relates to earlier technologies like Nutch and MapReduce. It then outlines the agenda for the day which includes introductions, presentations on MapReduce at University of Twente and a kickoff for the hackathon project building period. An optional tour of the SARA facilities is also included. The day will conclude with presentations of hackathon results.
Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.
This document is a presentation on big data and Hadoop. It introduces big data, how it is growing exponentially, and the challenges of storing and analyzing unstructured data. It discusses how Sears moved to Hadoop to gain insights from all of its customer data. The presentation explains why Hadoop is in high demand, as it allows distributed processing of large datasets across commodity hardware. It provides an overview of the Hadoop ecosystem including HDFS, MapReduce, Hive, HBase and more. Finally, it discusses job opportunities and salaries in big data which are high and growing significantly.
Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for processing of structured, semi-structured, and unstructured data using simple programming models. Hadoop can handle failures at the application layer and provides highly available services across large clusters of computers that may be prone to failures. Case studies show Hadoop has been used for fraud detection in banking, analyzing customer spending, predicting weather patterns and climate changes, processing astronomical images, and more wherever high volumes of unstructured data are growing rapidly.
- Data science domains like statistics, natural language processing, predictive analytics, and visualization have entered the market, while image processing, internet of things, and artificial intelligence are still in exploration.
- The "3 V's of BIG DATA" are volume, variety, and velocity.
- Popular programming languages for data science include R, Python, and SQL.
- Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. The core Hadoop modules are Hadoop Common, HDFS, YARN, and MapReduce.
- A sample data science methodology includes defining a problem statement, choosing an appropriate machine learning algorithm, running models/analysis in R/Python
With the surge in Big Data, organizations have began to implement Big Data related technologies as a part of their system. This has lead to a huge need to update existing skillsets with Hadoop. Java professionals are one such people who have to update themselves with Hadoop skills.
Big data refers to large, complex datasets that are growing exponentially and are difficult to process using traditional methods. Large companies like Walmart, Facebook, and AT&T generate huge amounts of big data through customer transactions, social media activity, and telecommunications networks. Apache Hadoop is an open source software framework that harnesses big data by using HDFS for data storage and MapReduce for distributed processing across clusters of computers. The Hadoop ecosystem includes tools like Ambari, Flume, Sqoop, Oozie, Pig, Mahout, Hive, HBase, and Zookeeper that support functions like provisioning, data collection, transfer, workflows, scripting, machine learning, querying, columnar storage, and
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It was developed based on Google papers describing Google File System (GFS) for reliable distributed data storage and MapReduce for distributed parallel processing. Hadoop uses HDFS for storage and MapReduce for processing in a scalable, fault-tolerant manner on commodity hardware. It has a growing ecosystem of projects like Pig, Hive, HBase, Zookeeper, Spark and others that provide additional capabilities for SQL queries, real-time processing, coordination services and more. Major vendors that provide Hadoop distributions include Hortonworks and Cloudera.
Big Data is one of the hot topics and has got the attention of the IT industry globally. It is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become. More accurate analyses may lead to more confident decision making. And better decisions can mean greater operational efficiencies, cost reductions and reduced risk.
This presentation focuses on why, what, how of big data as we explore some of Microsoft's big data solutions - HDInsight azure service and PowerBI, providing insights into the world of Big data.
The document provides an introduction to big data and Apache Hadoop. It discusses big data concepts like the 3Vs of volume, variety and velocity. It then describes Apache Hadoop including its core architecture, HDFS, MapReduce and running jobs. Examples of using Hadoop for a retail system and with SQL Server are presented. Real world applications at Microsoft and case studies are reviewed. References for further reading are included at the end.
Is It A Right Time For Me To Learn Hadoop. Find out ?Edureka!
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
This document provides an introduction and overview of Apache Hadoop. It discusses how Hadoop provides the ability to store and analyze large datasets in the petabyte range across clusters of commodity hardware. It compares Hadoop to other systems like relational databases and HPC and describes how Hadoop uses MapReduce to process data in parallel. The document outlines how companies are using Hadoop for applications like log analysis, machine learning, and powering new data-driven business features and products.
This document provides an overview of MapReduce and Apache Hadoop. It discusses the history and components of Hadoop, including HDFS and MapReduce. It then walks through an example MapReduce job, the WordCount algorithm, to illustrate how MapReduce works. The WordCount example counts the frequency of words in documents by having mappers emit <word, 1> pairs and reducers sum the counts for each word.
Sarah Guido gave a presentation on analyzing data with Python. She discussed several Python tools for preprocessing, analysis, and visualization including Pandas for data wrangling, scikit-learn for machine learning, NLTK for natural language processing, MRjob for processing large datasets in parallel, and ggplot for visualization. For each tool, she provided examples and use cases. She emphasized that the best tools depend on the type of data and analysis needs.
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune amrutupre
MindScripts Technologies, is the leading Big-Data Hadoop Training institutes in Pune, providing a complete Big-Data Hadoop Course with Cloud-Era certification.
Hadoop simplifies your job as a Data Warehousing professional. With Hadoop, you can manage any volume, variety and velocity of data, flawlessly and comparably in less time. As a Data Warehousing professional, you will undoubtedly have troubleshooting and data processing skills. These skills are sufficient for you to be a proficient Hadoop-er.
Key Questions Answered
What is Big Data and Hadoop?
What are the limitations of current Data Warehouse solutions?
How Hadoop solves these problems?
Real World Hadoop Use-Case in Data Warehouse Solutions?
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It implements Google's MapReduce programming model and the Hadoop Distributed File System (HDFS) for reliable data storage. Key components include a JobTracker that coordinates jobs, TaskTrackers that run tasks on worker nodes, and a NameNode that manages the HDFS namespace and DataNodes that store application data. The framework provides fault tolerance, parallelization, and scalability.
Build a Big Data solution using DB2 for z/OSJane Man
The document discusses building a Big Data solution using IBM DB2 for z/OS and IBM BigInsights. It provides an overview of new functions in DB2 11 that allow DB2 applications to access and analyze data stored in Hadoop. Specifically, it describes the JAQL_SUBMIT and HDFS_READ functions that enable submitting analytic jobs to BigInsights from DB2 and reading the results back into DB2. Examples are provided that show an integrated workflow of submitting a JAQL query to BigInsights from DB2, reading the results into a DB2 table, and querying the results. Potential use cases for integrating DB2 and BigInsights are also outlined.
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
This document provides an introduction and overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It outlines what Hadoop is, how its core components MapReduce and HDFS work, advantages like scalability and fault tolerance, disadvantages like complexity, and resources for getting started with Hadoop installations and programming.
This document summarizes different types of computers: desktops are for regular use with a tower, keyboard, mouse and monitor; laptops are portable versions with integrated components and a battery; notebooks are smaller laptops for basic functions; tablets are handheld devices for web browsing and games; games consoles connect to TVs for single/multiplayer gaming; mobile phones allow calling/texting and some can browse/email as smartphones; PDAs provide wireless internet and email capabilities; servers perform specific tasks like file storage or printing; embedded computers have control functions within larger systems like appliances or vehicles.
The document discusses scheduling Hadoop pipelines using various Apache projects. It provides an example of a marketing profit and loss (PnL) pipeline that processes booking, marketing spend, and web log data. It describes scheduling the example jobs using cron-style scheduling and the problems with time-based scheduling. It then introduces Apache Oozie and Apache Falcon for more robust workflow scheduling based on dataset availability. It provides examples of using Oozie coordinators and workflows and Falcon feeds and processes to schedule the example PnL pipeline based on when input data is available rather than fixed time schedules.
Soft skills are important for career success in addition to technical skills. Soft skills include teamwork, communication, etiquette, time management, and attitude. The document recommends students evaluate their own soft skills through self-reflection and seeking honest feedback from others. It encourages students to meet with career services for help improving soft skills and moving forward towards their goals.
This document provides a list of over 200 seminar topics related to computer science, electronics, IT, mechanical engineering, electrical engineering, civil engineering, applied electronics, chemical engineering, biomedical engineering, and MBA projects. The topics are divided into categories such as computer science projects, electronics projects, IT projects, and so on. Each topic includes a brief 1-2 sentence description. Contact information is provided at the bottom for requesting full reports on any of the topics.
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
We can make a data mining to get the prediction about the future data, which is mined from an old data especially Big data using a machine learning algorithms based on two clusters. One is the intrinsic for managing the file system of Big data, which is called Hadoop. The other is essentially to make fast analysis of Big data which is called Apache Spark. In order to achieve this purpose we will use R based on Rstudio or Scala based on Zeppelin.
Store app a shared storage appliance for efficient and scalable virtualized h...kiwenlau
StoreApp is a Hadoop plugin that separates storage (DataNodes) from computation (TaskTrackers) in virtualized Hadoop clusters. This improves HDFS throughput by 78% and reduces job completion times by 61% by addressing challenges like inefficient scaling, disk I/O contention between colocated VMs, and suboptimal VM scheduling. StoreApp introduces a manager and storage nodes to coordinate separated DataNodes, uses a scheduler aware of data locations, and allows HDFS prefetching and automated cluster resizing. Future work could focus on adapting StoreApp for Hadoop 2 and using containers instead of VMs.
R is an open source programming language and software environment for statistical analysis and graphics. It is widely used among data scientists for tasks like data manipulation, calculation, and graphical data analysis. Some key advantages of R include that it is open source and free, has a large collection of statistical tools and packages, is flexible, and has strong capabilities for data visualization. It also has an active user community and can integrate with other software like SAS, Python, and Tableau. R is a popular and powerful tool for data scientists.
This talk given at the Hadoop Summit in San Jose on June 28, 2016, analyzes a few major trends in Big Data analytics.
These are a few takeaways from this talk:
- Adopt Apache Beam for easier development and portability between Big Data Execution Engines.
- Adopt stream analytics for faster time to insight, competitive advantages and operational efficiency.
- Accelerate your Big Data applications with In-Memory open source tools.
- Adopt Rapid Application Development of Big Data applications: APIs, Notebooks, GUIs, Microservices…
- Have Machine Learning part of your strategy or passively watch your industry completely transformed!
- How to advance your strategy for hybrid integration between cloud and on-premise deployments?
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
Agile, Continous Intergration, DevOps, Big data are not longer buzzwords but part of the day today process of everyone working in software development and delivery. To cope with applications that need to be deployed in production almost the same moment they were created, software development has changed, impacting the way of working for everyone in the team. In this talk, Roland will discuss the challenges performance testers face with Big Data applications and how Architecture, Agile, Continous Intergration and DevOps come together to create solutions.
Slim Baltagi, director of Enterprise Architecture at Capital One, gave a presentation at Hadoop Summit on major trends in big data analytics. He discussed 1) increasing portability between execution engines using Apache Beam, 2) the emergence of stream analytics to enable real-time insights, and 3) leveraging in-memory technologies. He also covered 4) rapid application development tools, 5) open-sourcing of machine learning systems, and 6) hybrid cloud deployments of big data applications across on-premise and cloud environments.
Slim Baltagi, director of Enterprise Architecture at Capital One, gave a presentation at Hadoop Summit on major trends in big data analytics. He discussed 1) increasing portability between execution engines using Apache Beam, 2) the emergence of stream analytics driven by data streams, technology advances, business needs and consumer demands, 3) the growth of in-memory analytics using tools like Alluxio and RocksDB, 4) rapid application development using APIs, notebooks, GUIs and microservices, 5) open sourcing of machine learning systems by tech giants, and 6) hybrid cloud computing models for deploying big data applications both on-premise and in the cloud.
Nirmal Fernando is a technical lead at WSO2 who graduated from the University of Moratuwa. He discusses machine learning and predictive analytics, explaining that predictive analytics uses patterns in existing data to predict future outcomes. Machine learning gives computers the ability to learn without explicit programming. He then demonstrates building a logistic regression model using Apache Spark MLlib to predict whether individuals in the Pima Indian Diabetes dataset have diabetes.
This introductory level talk is about Apache Flink: a multi-purpose Big Data analytics framework leading a movement towards the unification of batch and stream processing in the open source.
With the many technical innovations it brings along with its unique vision and philosophy, it is considered the 4 G (4th Generation) of Big Data Analytics frameworks providing the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: batch, streaming, relational queries, machine learning and graph processing.
In this talk, you will learn about:
1. What is Apache Flink stack and how it fits into the Big Data ecosystem?
2. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment?
3. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark.
4. Who is using Apache Flink?
5. Where to learn more about Apache Flink?
Automated Time Series Analysis using Deep Learning, Ray and Analytics ZooJason Dai
Shanghai Apache Spark+AI Online Meetup (https://www.meetup.com/Shanghai-Apache-Spark-AI-Meetup/events/269342169/) on Mar 13, 2020
Topic: Automated Time Series Analysis using Deep Learning, Ray and Analytics Zoo (https://github.com/intel-analytics/analytics-zoo)
Speaker: Shan Yu, Intel
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...Edureka!
The free webinar on Python titled "Mastering Python - An Excellent tool for Web Scraping and Data Analysis" was conducted by Edureka on 14th November 2014
This document discusses large-scale data processing using Apache Hadoop at SARA and BiG Grid. It provides an introduction to Hadoop and MapReduce, noting that data is easier to collect, store, and analyze in large quantities. Examples are given of projects using Hadoop at SARA, including analyzing Wikipedia data and structural health monitoring. The talk outlines the Hadoop ecosystem and timeline of its adoption at SARA. It discusses how scientists are using Hadoop for tasks like information retrieval, machine learning, and bioinformatics.
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
This talk was given at Capital One on September 15, 2015 at the launch of the Washington DC Area Apache Flink Meetup. Apache flink is positioned at the forefront of 2 major trends in Big Data Analytics:
- Unification of Batch and Stream processing
- Multi-purpose Big Data Analytics frameworks
In these slides, we will also find answers to the burning question: Why Apache Flink? You will also learn more about how Apache Flink compares to Hadoop MapReduce, Apache Spark and Apache Storm.
In the past, emerging technologies took years to mature. In the case of big data, while effective tools are still emerging, the analytics requirements are changing rapidly resulting in businesses to either make it or be left behind
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...Gezim Sejdiu
Over the past decade, vast amounts of machine-readable structured information have become available through the automation of research processes as well as the increasing popularity of knowledge graphs and semantic technologies.
A major and yet unsolved challenge that research faces today is to perform scalable analysis of large scale knowledge graphs in order to facilitate applications like link prediction, knowledge base completion, and question answering.
Most machine learning approaches, which scale horizontally (i.e. can be executed in a distributed environment) work on simpler feature vector based input rather than more expressive knowledge structures.
On the other hand, the learning methods which exploit the expressive structures, e.g. Statistical Relational Learning and Inductive Logic Programming approaches, usually do not scale well to very large knowledge bases owing to their working complexity.
This talk gives an overview of the ongoing project Semantic Analytics Stack (SANSA) which aims to bridge this research gap by creating an out of the box library for scalable, in-memory, structured learning.
This document introduces Hivemall, an open-source machine learning library built as a collection of Hive user-defined functions (UDFs). Hivemall allows users to perform scalable machine learning on large datasets stored in Hive/Hadoop. It supports various classification, regression, recommendation, and feature engineering algorithms. Some key algorithms include logistic regression, matrix factorization, random forests, and anomaly detection. Hivemall is designed to perform machine learning efficiently by avoiding intermediate data reads/writes to HDFS. It has been used in industry for applications such as click-through rate prediction, churn detection, and product recommendation.
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
Combined supervised and unsupervised neural networks for pulse shape discrimi...Samuel Jackson
Our methodology for pulse shape discrimination is split into two steps. Firstly, we learn a model to discriminate between pulses using "clean" low-rate examples by removing pile-up & saturated events. In addition to traditional tail sum discrimination, we investigate three different choices for discrimination between γ-pulses, fast, thermal neutrons. We consider clustering the pulses directly using Gaussian Mixture Modelling (GMM), using variational autoencoders to learn a representation of the pulses and then clustering the learned representation (VAE+GMM) and using density ratio estimation to discriminate between a mixed (γ + neutron) and pure (γ only) sources using a multi-layer perceptron (MLP) as a supervised learning problem.
Secondly, we aim to classify and recover pile-up events in the < 150 ns regime by training a single unified multi-label MLP. To frame the problem as a multi-label supervised learning method, we first simulate pile-up events with known components. Then, using the simulated data and combining it with single event data, we train a final multi-label MLP to output a binary code indicating both how many and which type of events are present within an event window.
How AI is Revolutionizing Data Collection.pdfPromptCloud
Artificial Intelligence (AI) is transforming the landscape of data collection, making it more efficient, accurate, and insightful than ever before. With AI, businesses can automate the extraction of vast amounts of data from diverse sources, analyze patterns in real-time, and gain deeper insights with minimal human intervention. This revolution in data collection enables companies to make faster, data-driven decisions, enhance their competitive edge, and unlock new opportunities for growth.
AI-powered tools can handle complex and dynamic web content, adapt to changes in website structures, and even understand the context of data through natural language processing. This means that data collection is not only faster but also more precise, reducing the time and effort required for manual data extraction. Furthermore, AI can process unstructured data, such as social media posts and customer reviews, providing valuable insights into customer sentiment and market trends.
Embrace the future of data collection with AI and stay ahead of the curve. Learn more about how PromptCloud’s AI-driven web scraping solutions can transform your data strategy. https://www.promptcloud.com/contact/
Overview of Statistical software such as ODK, surveyCTO,and CSPro
2. Software installation(for computer, and tablet or mobile devices)
3. Create a data entry application
4. Create the data dictionary
5. Create the data entry forms
6. Enter data
7. Add Edits to the Data Entry Application
8. CAPI questions and texts
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...weiwchu
We recently discovered that models trained with large-scale speech datasets sourced from the web could achieve superior accuracy and potentially lower cost than traditionally human-labeled or simulated speech datasets. We developed a customizable AI-driven data labeling system. It infers word-level transcriptions with confidence scores, enabling supervised ASR training. It also robustly generates phone-level timestamps even in the presence of transcription or recognition errors, facilitating the training of TTS models. Moreover, It automatically assigns labels such as scenario, accent, language, and topic tags to the data, enabling the selection of task-specific data for training a model tailored to that particular task. We assessed the effectiveness of the datasets by fine-tuning open-source large speech models such as Whisper and SeamlessM4T and analyzing the resulting metrics. In addition to openly-available data, our data handling system can also be tailored to provide reliable labels for proprietary data from certain vertical domains. This customization enables supervised training of domain-specific models without the need for human labelers, eliminating data breach risks and significantly reducing data labeling cost.
Big Data and Analytics Shaping the future of PaymentsRuchiRathor2
The payments industry is experiencing a data-driven revolution powered by big data and analytics.
Here's a glimpse into 5 ways this dynamic duo is transforming how we pay.
In essence, big data and analytics are playing a pivotal role in building a future filled with faster, more secure, and convenient payment methods for everyone.
Big Data and Analytics Shaping the future of Payments
Hadoop/Spark Non-Technical Basics
1. Hadoop/Spark Non-Technical Basics
Zitao Liu
Department of Computer Science
University of Pittsburgh
ztliu@cs.pitt.edu
September 24, 2015
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 1 / 17
2. Big Data Analytics
Big Data Analytics always require two components:
A filesystem to store big data.
A computation framework to analysis big data.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 2 / 17
3. Big Data Analytics
Big Data Analytics always require two components:
A filesystem to store big data.
A computation framework to analysis big data.
Hadoop
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 3 / 17
4. Apache Hadoop
Too many meanings associated with “Hadoop”. Let’s look at Apache
Hadoop first.
Apache Hadoop is an open-source software framework written in Java for
distributed storage and distributed processing of very large data sets
on computer clusters built from commodity hardware.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 4 / 17
5. Apache Hadoop
The base Apache Hadoop framework is composed of the following
modules:
Hadoop Common
Hadoop Distributed File System
Hadoop YARN
Hadoop MapReduce
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 5 / 17
6. Apache Hadoop
The base Apache Hadoop framework is composed of the following
modules:
Hadoop Common
Hadoop Distributed File System ( ) - storage
Hadoop YARN
Hadoop MapReduce ( ) - processing
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 6 / 17
7. Hadoop Distributed File System (HDFS)
The Hadoop distributed file system (HDFS) is a distributed, scalable, and
portable file-system written in Java for the Hadoop framework.
Hadoop Distributed File System (HDFS) a distributed file-system that
stores data on commodity machines, providing very high aggregate
bandwidth across the cluster.
HDFS stores large files (typically in the range of gigabytes to terabytes)
across multiple machines.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 7 / 17
8. Hadoop MapReduce
MapReduce is a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed
algorithm on a cluster.
A MapReduce program is
composed of
Map procedure
Reduce procedure
Figure 1: Image from
http://tessera.io/docs-datadr/
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 8 / 17
9. Hadoop Ecosystem
Hadoop Ecosystem includes:
Distributed Filesystem, such as HDFS.
Distributed Programming, such as MapReduce, Pig, Spark.
SQL-On-Hadoop, such as Hive, Drill, Presto.
NoSQL Databases.
Column Data Model, such as HBase, Cassandra.
Document Data Model, such as MongoDB.
· · ·
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 9 / 17
10. MapReduce V.S. Spark
A quick history:
Figure 2: Image from
http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 10 / 17
11. Advantages of MapReduce
MapReduce has proven to be an ideal platform to implement complex
batch applications as diverse as sifting through
analyzing system logs
running ETL
computing web indexes
powering personal recommendation systems
· · ·
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 11 / 17
12. Limitations of MapReduce
Some limitations of MapReduce:
Batch mode processing (one-pass computation model)
difficult to program directly in MapReduce
performance bottlenecks
In short, MR doesn’t compose well for a large number of applications.
Therefore, people built specialized systems as workarounds, such as Spark.
Details can be found in http:
//stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 12 / 17
13. Apache Spark
Spark fits into the Hadoop open-source community, building on top of the
Hadoop Distributed File System (HDFS). It is a framework for writing
fast, distributed programs.
Faster (a in-memory approach) 10 times faster than MapReduce for
certain applications. Better for iterative algorithms in ML.
Clean, concise APIs in Scala, Java and Python.
Interactive query analysis (from the Scala and Python shells).
Real-time analysis (Spark Streaming).
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 13 / 17
14. Advantages of Spark
Low-latency computations by caching the working dataset in memory
and then performing computations at memory speeds.
Efficient iterative algorithm by having subsequent iterations share
data through memory, or repeatedly accessing the same dataset.
Figure 3: Image from http://blog.cloudera.com/blog/2013/11/
putting-spark-to-use-fast-in-memory-computing-for-your-big-data-app
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 14 / 17
15. Apache Spark
Spark has the upper hand as long as were talking about iterative
computations that need to pass over the same data many times.
But when it comes to one-pass ETL-like jobs, for example, data
transformation or data integration, then MapReduce is the deal - this is
what it was designed for1.
1
https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 15 / 17
16. Apache Spark Cost
The memory in the Spark cluster should be at least as large as the amount
of data you need to process, because the data has to fit into the memory
for optimal performance. So, if you need to process really Big Data,
Hadoop will definitely be the cheaper option since hard disk space comes
at a much lower rate than memory space2.
2
https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 16 / 17
17. Thank you
Thank You
Q & A
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 17 / 17