Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Spark and Spark Streaming internals allow for low latency, fault tolerance, and diverse workloads. Spark uses a Resilient Distributed Dataset (RDD) model where data is partitioned across a cluster. A directed acyclic graph (DAG) is used to schedule tasks across stages in an optimized way. Spark Streaming runs streaming computations as small deterministic batch jobs by chopping live streams into batches and processing them using RDD transformations and actions.
Modus operandi of Spark Streaming - Recipes for Running your Streaming Applic...DataWorks Summit
Spark Streaming provides fault-tolerant stream processing capabilities to Spark. To achieve fault-tolerance and exactly-once processing semantics in production, Spark Streaming uses checkpointing to recover from driver failures and write-ahead logging to recover processed data from executor failures. The key aspects required are configuring automatic driver restart, periodically saving streaming application state to a fault-tolerant storage system using checkpointing, and synchronously writing received data batches to storage using write-ahead logging to allow recovery after failures.
Apache Spark is a fast and general engine for large-scale data processing. It provides a unified API for batch, interactive, and streaming data processing using in-memory primitives. A benchmark showed Spark was able to sort 100TB of data 3 times faster than Hadoop using 10 times fewer machines by keeping data in memory between jobs.
This document summarizes a presentation about productionizing streaming jobs with Spark Streaming. It discusses:
1. The lifecycle of a Spark streaming application including how data is received in batches and processed through transformations.
2. Best practices for aggregations including reducing over windows, incremental aggregation, and checkpointing.
3. How to achieve high throughput by increasing parallelism through more receivers and partitions.
4. Tips for debugging streaming jobs using the Spark UI and ensuring processing time is less than the batch interval.
Mapreduce examples starting from the basic WordCount to a more complex K-means algorithm. The code contained in these slides is available at https://github.com/andreaiacono/MapReduce
Scalding is a Scala library built on top of Cascading that simplifies the process of defining MapReduce programs. It uses a functional programming approach where data flows are represented as chained transformations on TypedPipes, similar to operations on Scala iterators. This avoids some limitations of the traditional Hadoop MapReduce model by allowing for more flexible multi-step jobs and features like joins. The Scalding TypeSafe API also provides compile-time type safety compared to Cascading's runtime type checking.
The document discusses reactive programming and how it can be used on Android. It explains that reactive programming uses observable sequences and asynchronous data flows. It introduces RxJava as a library for reactive programming that uses Observables to compose flows of asynchronous data. It provides examples of how RxJava can be used on Android to perform background tasks, handle errors and activity lifecycles, load images asynchronously, and create and transform Observables.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
- Apache Spark is an open-source cluster computing framework that provides fast, general engine for large-scale data processing. It introduces Resilient Distributed Datasets (RDDs) that allow in-memory processing for speed.
- The document discusses Spark's key concepts like transformations, actions, and directed acyclic graphs (DAGs) that represent Spark job execution. It also summarizes Spark SQL, MLlib, and Spark Streaming modules.
- The presenter is a solutions architect who provides an overview of Spark and how it addresses limitations of Hadoop by enabling faster, in-memory processing using RDDs and a more intuitive API compared to MapReduce.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
This document discusses various big data technologies and how they relate to each other. It explains that Summingbird is built on top of Scalding and Storm, which are built on top of Cascading, which is built on top of Hadoop. It also discusses how Spark relates and compares to these other technologies.
PSUG #52 Dataflow and simplified reactive programming with Akka-streamsStephane Manciot
This document discusses using Akka streams for dataflow and reactive programming. It begins with an overview of dataflow concepts like nodes, arcs, graphs, and features such as push/pull data, mutable/immutable data, and compound nodes. It then covers Reactive Streams including back pressure, the asynchronous non-blocking protocol, and the publisher-subscriber interface. Finally, it details how to use Akka streams, including defining sources, sinks, and flows to create processing pipelines as well as working with more complex flow graphs. Examples are provided for bulk exporting data to Elasticsearch and finding frequent item sets from transaction data.
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyRTuW
This CloudxLab Advanced Spark Programming tutorial helps you to understand Advanced Spark Programming in detail. Below are the topics covered in this slide:
1) Shared Variables - Accumulators & Broadcast Variables
2) Accumulators and Fault Tolerance
3) Custom Accumulators - Version 1.x & Version 2.x
4) Examples of Broadcast Variables
5) Key Performance Considerations - Level of Parallelism
6) Serialization Format - Kryo
7) Memory Management
8) Hardware Provisioning
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2kyXPo0
This CloudxLab Writing MapReduce Programs tutorial helps you to understand how to write MapReduce Programs using Java in detail. Below are the topics covered in this tutorial:
1) Why MapReduce?
2) Write a MapReduce Job to Count Unique Words in a Text File
3) Create Mapper and Reducer in Java
4) Create Driver
5) MapReduce Input Splits, Secondary Sorting, and Partitioner
6) Combiner Functions in MapReduce
7) Job Chaining and Pipes in MapReduce
This document discusses data types and formats used in Hadoop MapReduce. It covers basic data types like IntWritable and Text that support serialization and comparability. It also describes common file formats like XML, JSON, SequenceFiles, Avro, Parquet, and how to implement custom formats like CSV. Input/output classes are discussed along with how different formats can be used in MapReduce jobs.
This document discusses Scala and big data technologies. It provides an overview of Scala libraries for working with Hadoop and MapReduce, including Scalding which provides a Scala DSL for Cascading. It also covers Spark, a cluster computing framework that operates on distributed datasets in memory for faster performance. Additional Scala projects for data analysis using functional programming approaches on Hadoop are also mentioned.
Created at the University of Berkeley in California, Apache Spark combines a distributed computing system through computer clusters with a simple and elegant way of writing programs. Spark is considered the first open source software that makes distribution programming really accessible to data scientists. Here you can find an introduction and basic concepts.
Transformations and actions a visual guide trainingSpark Summit
The document summarizes key Spark API operations including transformations like map, filter, flatMap, groupBy, and actions like collect, count, and reduce. It provides visual diagrams and examples to illustrate how each operation works, the inputs and outputs, and whether the operation is narrow or wide.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through HDFS and distributed processing via MapReduce. HDFS handles storage and MapReduce provides a programming model for parallel processing of large datasets across a cluster. The MapReduce framework consists of a mapper that processes input key-value pairs in parallel, and a reducer that aggregates the output of the mappers by key.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013Robert Metzger
Stratosphere is the next generation big data processing engine.
These slides introduce the most important features of Stratosphere by comparing it with Apache Hadoop.
For more information, visit stratosphere.eu
Based on university research, it is now a completely open-source, community driven development with focus on stability and usability.
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
This document provides an overview of MapReduce and Hadoop. It describes the Map and Reduce functions, explaining that Map applies a function to each element of a list and Reduce reduces a list to a single value. It gives examples of Map and Reduce using employee salary data. It then discusses Hadoop and its core components HDFS for distributed storage and MapReduce for distributed processing. Key aspects covered include the NameNode, DataNodes, input/output formats, and the job launch process. It also addresses some common questions around small files, large files, and accessing SQL data from Hadoop.
This document provides an introduction to MapReduce and Hadoop, including an overview of computing PageRank using MapReduce. It discusses how MapReduce addresses challenges of parallel programming by hiding details of distributed systems. It also demonstrates computing PageRank on Hadoop through parallel matrix multiplication and implementing custom file formats.
Hadoop is an open source framework for distributed storage and processing of vast amounts of data across clusters of computers. It uses a master-slave architecture with a single JobTracker master and multiple TaskTracker slaves. The JobTracker schedules tasks like map and reduce jobs on TaskTrackers, which each run task instances in separate JVMs. It monitors task progress and reschedules failed tasks. Hadoop uses MapReduce programming model where the input is split and mapped in parallel, then outputs are shuffled, sorted, and reduced to form the final results.
This document summarizes a proposal to improve fault tolerance in Hadoop clusters. It proposes adding a "Backup" state to store intermediate MapReduce data, so reducers can continue working even if mappers fail. It also proposes a "supernode" protocol where neighboring slave nodes communicate task information. If one node fails, a neighbor can take over its tasks without involving the JobTracker. This would improve fault tolerance by allowing computation to continue locally between nodes after failures.
The document discusses key concepts related to Hadoop including its components like HDFS, MapReduce, Pig, Hive, and HBase. It provides explanations of HDFS architecture and functions, how MapReduce works through map and reduce phases, and how higher-level tools like Pig and Hive allow for more simplified programming compared to raw MapReduce. The summary also mentions that HBase is a NoSQL database that provides fast random access to large datasets on Hadoop, while HCatalog provides a relational abstraction layer for HDFS data.
Hadoop/MapReduce is an open source software framework for distributed storage and processing of large datasets across clusters of computers. It uses MapReduce, a programming model where input data is processed by "map" functions in parallel, and results are combined by "reduce" functions, to process and generate outputs from large amounts of data and nodes. The core components are the Hadoop Distributed File System for data storage, and the MapReduce programming model and framework. MapReduce jobs involve mapping data to intermediate key-value pairs, shuffling and sorting the data, and reducing to output results.
Cascading is a Java framework that allows users to define data processing workflows on Hadoop clusters more easily. The author discusses connecting Cascading to the Starfish profiler and optimizer to enable automated optimization of Cascading workflows. Key points are:
1) Cascading workflows are translated to DAGs of Hadoop jobs for profiling and optimization.
2) The Cascading API is modified to use the Hadoop New API to interface with Starfish.
3) Experiments show the Starfish optimizer providing speedups of up to 1.3x for several real-world Cascading workflows.
Apache Spark is written in Scala programming language that compiles the program code into byte code for the JVM for spark big data processing.
The open source community has developed a wonderful utility for spark python big data processing known as PySpark.
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the Advance Map reduce concepts of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
Hadoop and HBase experiences in perf log projectMao Geng
This document discusses experiences using Hadoop and HBase in the Perf-Log project. It provides an overview of the Perf-Log data format and architecture, describes how Hadoop and HBase were configured, and gives examples of using MapReduce jobs and HBase APIs like Put and Scan to analyze log data. Key aspects covered include matching Hadoop and HBase versions, running MapReduce jobs, using column families in HBase, and filtering Scan results.
Hadoop interview questions for freshers and experienced people. This is the best place for all beginners and Experts who are eager to learn Hadoop Tutorial from the scratch.
Read more here http://softwarequery.com/hadoop/
This document provides an overview of key concepts in Hadoop including:
- Hadoop was tested on a 4000 node cluster with 32,000 cores and 16 petabytes of storage.
- MapReduce is Hadoop's programming model and consists of mappers that process input splits in parallel, and reducers that combine the outputs of the mappers.
- The JobTracker manages jobs, TaskTrackers run tasks on slave nodes, and Tasks are individual mappers or reducers. Data is distributed to nodes implicitly based on the HDFS file distribution. Configurations are set using a JobConf object.
The document provides an overview of Hadoop installation and MapReduce programming. It discusses:
- Why Hadoop is used to deal with big data mining.
- How to learn Hadoop and MapReduce programming.
- What will be covered, including Hadoop installation, HDFS basics, and MapReduce programming.
It then goes on to provide details on installing Hadoop on Amazon EC2, the main Hadoop components of HDFS and MapReduce, using the HDFS shell, the MapReduce programming model and components, writing MapReduce applications in Java, and configuring jobs. Code examples are also provided for a sample reverse indexing application.
The document provides an overview of data science with Python and integrating Python with Hadoop and Apache Spark frameworks. It discusses:
- Why Python should be integrated with Hadoop and the ecosystem including HDFS, MapReduce, and Spark.
- Key concepts of Hadoop including HDFS for storage, MapReduce for processing, and how Python can be integrated via APIs.
- Benefits of Apache Spark like speed, simplicity, and efficiency through its RDD abstraction and how PySpark enables Python access.
- Examples of using Hadoop Streaming and PySpark to analyze data and determine word counts from documents.
Vibrant Technologies is headquarted in Mumbai,India.We are the best Hadoop training provider in Navi Mumbai who provides Live Projects to students.We provide Corporate Training also.We are Best Hadoop classes in Mumbai according to our students and corporates
This document provides a summary of existing big data tools. It outlines the layered architecture of these tools, including layers for resource management, file systems, data processing frameworks, machine learning libraries, NoSQL databases and more. It also describes several common data processing models (e.g. MapReduce, DAG, graph processing) and specific tools that use each model (e.g. Hadoop for MapReduce, Spark for DAG). Examples of code for PageRank and broadcasting data in the Harp framework are also provided.
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
The Rise of Python in Finance,Automating Trading Strategies: _.pdfRiya Sen
In the dynamic realm of finance, where every second counts, the integration of technology has become indispensable. Aspiring traders and seasoned investors alike are turning to coding as a powerful tool to unlock new avenues of financial success. In this blog, we delve into the world of Python live trading strategies, exploring how coding can be the key to navigating the complexities of the market and securing your path to prosperity.
Combined supervised and unsupervised neural networks for pulse shape discrimi...Samuel Jackson
Our methodology for pulse shape discrimination is split into two steps. Firstly, we learn a model to discriminate between pulses using "clean" low-rate examples by removing pile-up & saturated events. In addition to traditional tail sum discrimination, we investigate three different choices for discrimination between γ-pulses, fast, thermal neutrons. We consider clustering the pulses directly using Gaussian Mixture Modelling (GMM), using variational autoencoders to learn a representation of the pulses and then clustering the learned representation (VAE+GMM) and using density ratio estimation to discriminate between a mixed (γ + neutron) and pure (γ only) sources using a multi-layer perceptron (MLP) as a supervised learning problem.
Secondly, we aim to classify and recover pile-up events in the < 150 ns regime by training a single unified multi-label MLP. To frame the problem as a multi-label supervised learning method, we first simulate pile-up events with known components. Then, using the simulated data and combining it with single event data, we train a final multi-label MLP to output a binary code indicating both how many and which type of events are present within an event window.
How AI is Revolutionizing Data Collection.pdfPromptCloud
Artificial Intelligence (AI) is transforming the landscape of data collection, making it more efficient, accurate, and insightful than ever before. With AI, businesses can automate the extraction of vast amounts of data from diverse sources, analyze patterns in real-time, and gain deeper insights with minimal human intervention. This revolution in data collection enables companies to make faster, data-driven decisions, enhance their competitive edge, and unlock new opportunities for growth.
AI-powered tools can handle complex and dynamic web content, adapt to changes in website structures, and even understand the context of data through natural language processing. This means that data collection is not only faster but also more precise, reducing the time and effort required for manual data extraction. Furthermore, AI can process unstructured data, such as social media posts and customer reviews, providing valuable insights into customer sentiment and market trends.
Embrace the future of data collection with AI and stay ahead of the curve. Learn more about how PromptCloud’s AI-driven web scraping solutions can transform your data strategy. https://www.promptcloud.com/contact/
Annex K RBF's The World Game pdf documentSteven McGee
Signals & Telemetry Annex K for RBF's The World Game / Trade Federations / USPTO 13/573,002 Heart Beacon Cycle Time - Space Time Chain meters, metrics, standards. Adaptive Procedural template framework structured data derived from DoD / NATO's system of systems engineering tech framework
2. What types of ecosystems exist?
● Systems that are based on MapReduce
● Systems that replace MapReduce
● Complementary databases
● Utilities
● See complete list here
4. Hive
● Part of the Apache project
● General SQL-like syntax for querying HDFS or other
large databases
● Each SQL statement is translated to one or more
MapReduce jobs (in some cases none)
● Supports pluggable Mappers, Reducers and SerDe’s
(Serializer/Deserializer)
● Pro: Convenient for analytics people that use SQL
6. Hive Usage
Start a hive shell:
$hive
create hive table:
hive> CREATE TABLE tikal (id BIGINT, name STRING, startdate TIMESTAMP, email
STRING)
Show all tables:
hive> SHOW TABLES;
Add a new column to the table:
hive> ALTER TABLE tikal ADD COLUMNS (description STRING);
Load HDFS data file into the dable:
hive> LOAD DATA INPATH '/home/hduser/tikal_users' OVERWRITE INTO TABLE tikal;
query employees that work more than a year:
hive> SELECT name FROM tikal WHERE (unix_timestamp() - startdate > 365 * 24 *
60 * 60);
7. Pig
● Part of the Apache project
● A programing language that is compiled into one or
more MaprRecuce jobs.
● Supports User Defined functions
● Pro: More Convenient to write than pure MapReduce.
8. Pig Usage
Start a pig Shell. (grunt is the PigLatin shell prompt)
$ pig
grunt>
Load a HDFS data file:
grunt> employees = LOAD 'hdfs://hostname:54310/home/hduser/tikal_users'
as (id,name,startdate,email,description);
Dump the data to console:
grunt> DUMP employees;
Query the data:
grunt> employees_more_than_1_year = FILTER employees BY
(float)rating>1.0;
grunt> DUMP employees_more_than_1_year;
Store query result to new file:
grunt> store employees_more_than_1_year into
'/home/hduser/employees_more_than_1_year';
9. Cascading
● An infrastructure with API that is compiled to one or
more MapReduce jobs
● Provide graphical view of the MapReduce jobs workflow
● Ways to tweak setting and improve performance of
workflow.
● Pros:
o Hides MapReduce API and joins jobs
o Graphical view and performance tuning
10. MapReduce workflow
● MapReduce framework operates exclusively on
Key/Value pairs
● There are three phases in the workflow:
o map
o combine
o reduce
(input) <k1, v1> =>
map => <k2, v2> =>
combine => <k2, v2> =>
reduce => <k3, v3> (output)
11. WordCount in MapRecuce Java API
private class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
12. WordCount in MapRecuce Java Cont.
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
13. WordCount in MapRecuce Java Cont.
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
14. MapReduce workflow example.
Let’s consider two text files:
$ bin/hdfs dfs -cat /user/joe/wordcount/input/file01
Hello World Bye World
$ bin/hdfs dfs -cat /user/joe/wordcount/input/file02
Hello Hadoop Goodbye Hadoop
15. Mapper code
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
16. Mapper output
For two files there will be two mappers.
For the given sample input the first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
18. Combiner output
Output of each map is passed through the local combiner
for local aggregation, after being sorted on the keys.
The output of the first map:
< Bye, 1>
< Hello, 1>
< World, 2>
The output of the second map:
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>
19. Reducer code
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
20. Reducer output
The reducer sums up the values
The output of the job is:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
21. The Cascading core components
● Tap (Data resource)
o Source (Data input)
o Sink (Data output)
● Pipe (data stream)
● Filter (Data operation)
● Flow (assembly of Taps and Pipes)
23. WodCount in Cascading Cont.
// define source and sink Taps.
Scheme sourceScheme = new TextLine( new Fields( "line" ) );
Tap source = new Hfs( sourceScheme, inputPath );
Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );
// the 'head' of the pipe assembly
Pipe assembly = new Pipe( "wordcount" );
// For each input Tuple
// parse out each word into a new Tuple with the field name "word"
// regular expressions are optional in Cascading
String regex = "(?<!pL)(?=pL)[^ ]*(?<=pL)(?!pL)";
Function function = new RegexGenerator( new Fields( "word" ), regex );
assembly = new Each( assembly, new Fields( "line" ), function );
// group the Tuple stream by the "word" value
assembly = new GroupBy( assembly, new Fields( "word" ) );
24. WodCount in Cascading
// For every Tuple group
// count the number of occurrences of "word" and store result in
// a field named "count"
Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every( assembly, count );
// initialize app properties, tell Hadoop which jar file to use
Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, Main.class );
// plan a new Flow from the assembly using the source and sink Taps
// with the above properties
FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( "word-count", source, sink, assembly );
// execute the flow, block until complete
flow.complete();
26. Scalding
● Extension to Cascading
● Programing language is Scala instead of Java
● Good for functional programing paradigms in Data
Applications
● Pro: code can be very compact!
27. WordCount in Scalding
import com.twitter.scalding._
class WordCountJob(args : Args) extends Job(args) {
TypedPipe.from(TextLine(args("input")))
.flatMap { line => line.split("""s+""") }
.groupBy { word => word }
.size
.write(TypedTsv(args("output")))
}
28. Summingbird
● An open source from Twitter.
● An API that is compiled to Scalding and to Storm
topologies.
● Can be written in Java or Scala
● Pro: When you want to use Lambda Architecture and
you want to write one code that will run on both Hadoop
and Storm.
31. Spark
● Part of the Apache project
● Replaces MapReduce with it own engine that works
much faster without compromising consistency
● Architecture not based on Map-reduce but rather on two
concepts: RDD (Resilient Distributed Dataset) and DAG
(Directed Acyclic Graph)
● Pro’s:
o Works much faster than MapReduce;
o fast growing community.
32. Impala
● Open Source from Cloudera
● Used for Interactive queries with SQL syntax
● Replaces MapReduce with its own Impala Server
● Pro: Can get much faster response time for SQL over
HDFS than Hive or Pig.
35. Impala architecture
● Impala architecture was inspired by Google Dremel
● MapReduce is great for functional programming, but not
efficient for SQL.
● Impala replaced the MapReduce with Distributed Query
Engine that is optimized for fast queries.
38. Presto, Drill, Tez
● Several more alternatives:
o Presto by Facebook
o Apache Drill pushed by MapR
o Apache Tez pushed by Hortonworks
● all are alternatives to Impala and do more or less the
same: provide faster response time for queries over
HDFS.
● Each of the above claim to have very fast results.
● Be careful of benchmarks they publish: to get better
results they use indexed data rather than sequential
files in HDFS (i.e., ORC file, Parquet, HBase)
40. HBase
● Apache project
● NoSQL cluster database that can grow linearly
● Can store billions of rows X millions of columns
● Storage is based on HDFS
● API based on MapReduce
● Pros:
o Strongly consistent read/writes
o Good for high-speed counter aggregations
41. Parquet
● Apache (incubator) project. Initiated by Twitter &
Cloudera
● Columnar File Format - write one column at a time
● Integrated with Hadoop ecosystem (MapReduce, Hive)
● Supports Avro, Thrift and ProtBuf
● Pro: keep I/O to a minimum by reading from a disk only
the data required for the query
43. Advantages of Columnar formats
● Better compression as data is more homogenous.
● I/O will be reduced as we can efficiently scan only a
subset of the columns while reading the data.
● When storing data of the same type in each column,
we can use encodings better suited to the modern
processors’ pipeline by making instruction branching
more predictable.
45. Flume
● Cloudera product
● Used to collect files from distributed systems and send
them to central repository
● Designed for integration with HDFS but can write to
other FS
● Supports listening to TCP and UDP sockets
● Main Use Case: collect distributed logs to HDFS
46. Avro
● An Apache project
● Data Serialization by Schema
● Support rich data structures. Defined in Json-like syntax
● Support Schema evolution
● Integrated with Hadoop I/O API
● Similar to Thrift and ProtocolBuffers
47. Oozie
● An Apache project
● Workflow Scheduler for Hadoop jobs
● Very close integration with the Hadoop API
48. Mesos
● Apache project
● Cluster manager that abstracts resources
● Integrated with Hadoop to allocate resources
● Scalable to 10,000 nodes
● Supports physical machines, VM’s, Docker
● Multi resource scheduler (memory, CPU, disk, ports)
● Web UI for viewing cluster status