This document provides an introduction and overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It outlines what Hadoop is, how its core components MapReduce and HDFS work, advantages like scalability and fault tolerance, disadvantages like complexity, and resources for getting started with Hadoop installations and programming.
Hadoop MapReduce is an open source framework for distributed processing of large datasets across clusters of computers. It allows parallel processing of large datasets by dividing the work across nodes. The framework handles scheduling, fault tolerance, and distribution of work. MapReduce consists of two main phases - the map phase where the data is processed key-value pairs and the reduce phase where the outputs of the map phase are aggregated together. It provides an easy programming model for developers to write distributed applications for large scale processing of structured and unstructured data.
This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.
This document discusses efficient analysis of big data using the MapReduce framework. It introduces the challenges of analyzing large and complex datasets, and describes how MapReduce addresses these challenges through its map and reduce functions. MapReduce allows distributed processing of big data across clusters of computers using a simple programming model.
The document provides an introduction to MapReduce, including:
- MapReduce is a framework for executing parallel algorithms across large datasets using commodity computers. It is based on map and reduce functions.
- Mappers process input key-value pairs in parallel, and outputs are sorted and grouped by the reducers.
- Examples demonstrate how MapReduce can be used for tasks like building indexes, joins, and iterative algorithms.
This document provides a high-level overview of MapReduce and Hadoop. It begins with an introduction to MapReduce, describing it as a distributed computing framework that decomposes work into parallelized map and reduce tasks. Key concepts like mappers, reducers, and job tracking are defined. The structure of a MapReduce job is then outlined, showing how input is divided and processed by mappers, then shuffled and sorted before being combined by reducers. Example map and reduce functions for a word counting problem are presented to demonstrate how a full MapReduce job works.
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
This document provides an overview of large scale data analysis using distributed computing frameworks like MapReduce. It describes MapReduce and related frameworks like Dryad, and open source MapReduce tools including Hadoop, Cloud MapReduce, Elastic MapReduce, and MR.Flow. Example MapReduce algorithms for tasks like graph analysis, text indexing and retrieval are also outlined. The document is the first part of a series on large scale data analysis using distributed frameworks.
The document provides an introduction to MapReduce, describing its motivation as a framework for simplifying large-scale data processing across distributed systems. It outlines MapReduce's programming model and main features, including automatic parallelization, fault tolerance, and locality. The document also provides a detailed example of counting letter frequencies in a large file to illustrate how MapReduce works.
This document introduces MapReduce, including its architecture, advantages, frameworks for writing MapReduce programs, and an example WordCount MapReduce program. It also discusses how to compile, deploy, and run MapReduce programs using Hadoop and Eclipse.
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
Mastering Hadoop Map Reduce was a presentation I gave to Orlando Data Science on April 23, 2015. The presentation provides a clear overview of how Hadoop Map Reduce works, and then dives into more advanced topics of how to optimize runtime performance and implement custom data types.
The examples are written in Python and Java, and the presentation walks through how to create an n-gram count map reduce program using custom data types.
You can get the full source code for the examples on my Github! http://www.github.com/scottcrespo/ngrams
In this presentation , i provide in depth information about the how MapReduce works. It contains many details about the execution steps , Fault tolerance , master / worker responsibilities.
The document provides an introduction to Hadoop, including an overview of its core components HDFS and MapReduce, and motivates their use by explaining the need to process large amounts of data in parallel across clusters of computers in a fault-tolerant and scalable manner. It also presents sample code walkthroughs and discusses the Hadoop ecosystem of related projects like Pig, HBase, Hive and Zookeeper.
MapReduce is a programming model for processing large datasets in a distributed manner across clusters of machines. It involves two functions - Map and Reduce. The Map function processes input key-value pairs to generate intermediate key-value pairs, and the Reduce function merges all intermediate values associated with the same intermediate key. This allows for distributed processing that hides complexity and provides fault tolerance. An example is counting word frequencies, where the Map function emits word counts and the Reduce function sums the counts for each word.
The document presents an introduction to MapReduce. It discusses how MapReduce provides an easy framework for distributed computing by allowing programmers to write simple map and reduce functions without worrying about complex distributed systems issues. It outlines Google's implementation of MapReduce and how it uses the Google File System for fault tolerance. Alternative open-source implementations like Apache Hadoop are also covered. The document discusses how MapReduce has been widely adopted by companies to process massive amounts of data and analyzes some criticism of MapReduce from database experts. It concludes by noting trends in using MapReduce as a parallel database and for multi-core processing.
MapReduce is a programming model for processing large datasets in parallel. It works by breaking the dataset into independent chunks which are processed by the map function, and then grouping the output of the maps into partitions to be processed by the reduce function. Hadoop uses MapReduce to provide fault tolerance by restarting failed tasks and monitoring the JobTracker and TaskTrackers. MapReduce programs can be written in languages other than Java using Hadoop Streaming.
This document discusses optimization techniques for map join in Hive. It describes:
1) Previous approaches to common join and map join in Hive and their limitations.
2) Optimized map join techniques like uploading small tables to distributed cache and performing local joins to avoid shuffle.
3) Using JDBM for hash tables caused performance issues so alternative approaches were evaluated.
4) Automatically converting common joins to optimized map joins based on table sizes and joining conditional.
5) Compression and archiving of hash tables to distributed cache to reduce bandwidth overhead.
6) Performance evaluations showing improvements from the optimized techniques.
The document summarizes Jimmy Lin's MapReduce tutorial for WWW 2013. It discusses the MapReduce algorithm design and implementation. Specifically, it covers key aspects of MapReduce like local aggregation to reduce network traffic, sequencing computations by manipulating sort order, and using appropriate data structures to accumulate results incrementally. It also provides an example of building a term co-occurrence matrix to measure semantic distance between words.
The document introduces MapReduce, describing how it allows for parallel processing of large datasets. MapReduce works by splitting data into smaller chunks that are processed (mapped) in parallel by worker nodes, and then combining (reducing) the results. The document outlines the Map and Reduce functions, and discusses how Hadoop is an open-source implementation of MapReduce that allows distributed processing of semi-structured data across clusters of machines.
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
Aaron Myers introduces MapReduce and Hadoop. MapReduce is a distributed programming paradigm that allows processing of large datasets across clusters. It works by splitting data, distributing it across nodes, processing it in parallel using map and reduce functions, and collecting the results. Hadoop is an open source software framework for distributed storage and processing of big data using MapReduce. It includes HDFS for storage and Hadoop MapReduce for distributed computing. Developers write MapReduce jobs in Java by implementing map and reduce functions.
Map reduce definition
A Programming model and an associated implementation for processing and generating large data sets with a parallel*, distributed* algorithm on a cluster*.
A Parallel algorithm is an algorithm which can be executed a piece at a time on many different processing devices, and then combined together again at the end to get the correct result.
A distributed algorithm is an algorithm designed to run on computer hardware
constructed from interconnected processors.
A computer cluster consists of connected computers that work together so that, in many respects, they can be viewed as a single system. Computer clusters have each node set to perform the same task, controlled and scheduled by software.
Map reduce - division into two categories map and reduce
working of Jobtracker , TaskTracker ,Namenode , Datanode in mapreduce engine of hadoop
Fault tolerance in hadoop
Box class datatypes
Allowable file formats
wordcount job explained using animation in hadoop using mapreduce
fields where map reduce can be implimented
limitations of map reduce
This document discusses MySQL and Hadoop. It provides an overview of Hadoop, Cloudera Distribution of Hadoop (CDH), MapReduce, Hive, Impala, and how MySQL can interact with Hadoop using Sqoop. Key use cases for Hadoop include recommendation engines, log processing, and machine learning. The document also compares MySQL and Hadoop in terms of data capacity, query languages, and support.
Storage and computation is getting cheaper AND easily accessible on demand in the cloud. We now collect and store some really large data sets Eg: user activity logs, genome sequencing, sensory data etc. Hadoop and the ecosystem of projects built around it present simple and easy to use tools for storing and analyzing such large data collections on commodity hardware.
Topics Covered
* The Hadoop architecture.
* Thinking in MapReduce.
* Run some sample MapReduce Jobs (using Hadoop Streaming).
* Introduce PigLatin, a easy to use data processing language.
Speaker Profile: Mahesh Reddy is an Entrepreneur, chasing dreams. Works on large scale crawl and extraction of structured data from the web. He is a graduate frm IIT Kanpur(2000-05) and previously worked at Yahoo! Labs as Research Engineer/Tech Lead on Search and Advertising products.
Hadoop is an open source software project that allows distributed processing of large datasets across computer clusters. It was developed based on research from Google and has two main components - the Hadoop Distributed File System (HDFS) which reliably stores data in a distributed manner, and MapReduce which allows parallel processing of this data. Hadoop is scalable, cost effective, and fault tolerant for processing terabytes of data on commodity hardware. It is commonly used for batch processing of large unstructured datasets.
This document provides an overview of Hadoop and its ecosystem. It describes Hadoop as a framework for distributed storage and processing of large datasets across clusters of commodity hardware. The key components of Hadoop are the Hadoop Distributed File System (HDFS) for storage, and MapReduce as a programming model for distributed computation across large datasets. A variety of related projects form the Hadoop ecosystem, providing capabilities like data integration, analytics, workflow scheduling and more.
The document describes the Hadoop ecosystem and its core components. It discusses HDFS, which stores large files across clusters and is made up of a NameNode and DataNodes. It also discusses MapReduce, which allows distributed processing of large datasets using a map and reduce function. Other components discussed include Hive, Pig, Impala, and Sqoop.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It consists of HDFS for distributed storage and MapReduce for distributed processing. HDFS stores large files across multiple machines and provides high throughput access to application data. MapReduce allows processing of large datasets in parallel by splitting the work into independent tasks called maps and reduces. Companies use Hadoop for applications like log analysis, data warehousing, machine learning, and scientific computing on large datasets.
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
The document discusses big data processing with MapReduce and related technologies. It provides an overview of MapReduce and how it enables parallel processing of large datasets. It then describes Apache Hadoop, the open-source implementation of MapReduce, including its core components HDFS for distributed storage and YARN for cluster resource management. Finally, it discusses Apache Spark, which aims to improve on some limitations of Hadoop by supporting in-memory computation and easier chaining of operations.
This document provides an overview of an advanced Big Data hands-on course covering Hadoop, Sqoop, Pig, Hive and enterprise applications. It introduces key concepts like Hadoop and large data processing, demonstrates tools like Sqoop, Pig and Hive for data integration, querying and analysis on Hadoop. It also discusses challenges for enterprises adopting Hadoop technologies and bridging the skills gap.
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
The document discusses big data concepts and Hadoop technologies. It provides an overview of massive parallel processing and the Hadoop architecture. It describes common processing engines like MapReduce, Spark, Hive, Pig and BigSQL. It also discusses Hadoop distributions from Hortonworks, Cloudera and IBM along with stream processing and advanced analytics on Hadoop platforms.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
The document discusses the Hadoop ecosystem. It provides an overview of Hadoop and its core components HDFS and MapReduce. HDFS is the storage component that stores large files across nodes in a cluster. MapReduce is the processing framework that allows distributed processing of large datasets in parallel. The document also discusses other tools in the Hadoop ecosystem like Hive, Pig, and Hadoop distributions from companies. It provides examples of running MapReduce jobs and accessing HDFS from the command line.
Hadoop is a system for processing large amounts of data using MapReduce and HDFS. HDFS is the storage component that splits files into blocks and stores multiple copies for reliability. MapReduce is the processing framework where mappers process key-value pairs in parallel and reducers aggregate the outputs. While Hadoop can process huge datasets, other systems like Pig, Hive, HBase, Accumulo, Avro, ZooKeeper, and Flume provide additional functionality for tasks like SQL queries, real-time processing, coordination, serialization, and data aggregation.
Hadoop distributed computing framework for big dataCyanny LIANG
This document provides an overview of Hadoop, an open source distributed computing framework for processing large datasets. It discusses the motivation for Hadoop, including challenges with traditional approaches. It then describes how Hadoop provides partial failure support, fault tolerance, and data locality to efficiently process big data across clusters. The document outlines the core Hadoop concepts and architecture, including HDFS for reliable data storage, and MapReduce for parallel processing. It provides examples of how Hadoop works and how organizations use it at large scale.
The document discusses the family of Hadoop projects. It describes the history and origins of Hadoop, starting with Doug Cutting's work on Nutch and the implementation of Google's papers on MapReduce and the Google File System. It then summarizes several major Hadoop sub-projects, including HDFS for storage, MapReduce for distributed processing, HBase for structured storage, and Hive for data warehousing. For each project, it provides a brief overview of the architecture, data model, and programming interfaces.
This document provides an overview of Hadoop, an open source framework for distributed storage and processing of large datasets across clusters of computers. It discusses that Hadoop was created to address the challenges of "Big Data" characterized by high volume, variety and velocity of data. The key components of Hadoop are HDFS for storage and MapReduce as an execution engine for distributed computation. HDFS uses a master-slave architecture with a NameNode master and DataNode slaves, and provides fault tolerance through data replication. MapReduce allows processing of large datasets in parallel through mapping and reducing functions.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. It was designed to scale up from single servers to thousands of machines, with very high fault tolerance. Hadoop features two main components - the Hadoop Distributed File System (HDFS) for storage, and MapReduce for distributed processing of large datasets in a parallel and distributed manner. Hadoop saw widespread adoption for applications such as log analysis, data mining, and large-scale graph processing.
Similar to [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard) (20)
"AI" for Blockchain Security (Case Study: Cosmos)npinto
This document discusses preliminary work using machine learning techniques to help improve blockchain security. It outlines initial experiments using a Cosmos SDK simulator to generate test data and identify "bug correlates" that could help predict vulnerabilities. Several bugs were already found in the simulator itself. The goal is to focus compute resources on more interesting test runs likely to produce bugs. This is an encouraging first step in exploring how AI may augment blockchain security testing.
High-Performance Computing Needs Machine Learning... And Vice Versa (NIPS 201...npinto
This document discusses using high-performance computing for machine learning tasks like analyzing large convolutional neural networks for visual object recognition. It proposes running hundreds of thousands of large neural network models in parallel on GPUs to more efficiently search the parameter space, beyond what is normally possible with a single graduate student and model. This high-throughput screening approach aims to identify better performing network architectures through exploring a vast number of possible combinations in the available parameter space.
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of Hi...npinto
The document discusses challenges with parallel programming on GPUs including tasks with statically known data dependences, SIMD divergence, lack of fine-grained synchronization and writeable coherent caches. It also presents performance results for sorting algorithms on different GPU and CPU architectures, with GPUs providing much higher sorting throughput than CPUs. Parallel prefix sum is proposed as a method for allocating work in parallel tasks that require dynamic scheduling or allocation.
[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architect...npinto
The document discusses changes in computer architecture and Microsoft's role in the transition to parallel computing. It notes that computer cores are increasing rapidly and that Microsoft aims to make parallelism accessible to all developers through tools like Visual Studio. It also outlines Microsoft's involvement in GPU computing through technologies like DirectX and efforts to support GPU programming across its software stack.
The document discusses dynamic compilation for massively parallel processors. It describes how execution models provide an interface between programming languages and hardware architectures. Emerging execution models like bulk-synchronous parallel and PTX aim to abstract parallelism on heterogeneous multi-core and many-core processors. The document outlines how dynamic compilers can translate between execution models and target instructions to different core architectures through techniques like thread fusion, vectorization, and subkernel extraction. This bridging of models and architectures through just-in-time compilation helps program entire processors rather than individual cores.
[Harvard CS264] 13 - The R-Stream High-Level Program Transformation Tool / Pr...npinto
The document describes the R-Stream high-level program transformation tool. It provides an overview of R-Stream, walks through the compilation process, and discusses performance results. R-Stream uses the polyhedral model to perform program transformations like loop transformations, fusion, distribution and tiling to optimize for parallelism and locality. It models the target machine and uses this to inform the mapping of operations to resources like GPUs.
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data St...npinto
The document discusses irregular parallelism on GPUs and presents several algorithms and data structures for handling irregular workloads efficiently in parallel. It covers sparse matrix-vector multiplication using different sparse matrix formats. It also discusses compositing of fragments in parallel and presents a nested data parallel approach. The document describes challenges with parallel hashing and presents a two-level hashing scheme. It analyzes parallel task queues and work stealing techniques for load balancing irregular work. Throughout, it focuses on managing communication in addition to computation for optimal parallel performance.
[Harvard CS264] 11b - Analysis-Driven Performance Optimization with CUDA (Cli...npinto
This document discusses performance optimization of GPU kernels. It outlines analyzing kernels to determine if they are limited by memory bandwidth, instruction throughput, or latency. The profiler can identify limiting factors by comparing memory transactions and instructions issued. Source code modifications for memory-only and math-only versions help analyze memory vs computation balance and latency hiding. The goal is to optimize kernels by addressing their most significant performance limiters.
[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bau...npinto
This document discusses performance optimization of GPU kernels. It outlines analyzing kernels to determine if they are limited by memory bandwidth, instruction throughput, or latency. The profiler can identify limiting factors by comparing memory transactions and instructions issued. Source code modifications for memory-only and math-only versions help analyze memory vs computation balance and latency hiding. The goal is to optimize kernels by addressing their most significant performance limiters.
[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Le...npinto
This document summarizes a paper about using high-level programming languages for low-level systems programming. It discusses the needs of scientists and engineers for software that is reliable, high-performance, and customizable. The paper aims to address these needs by exploring features of high-level languages that could enable low-level programming tasks typically done in C/C++, like developing device drivers, operating systems, and embedded systems.
This document outlines Andreas Klockner's presentation on GPU programming in Python using PyOpenCL and PyCUDA. The presentation covers an introduction to OpenCL, programming with PyOpenCL, run-time code generation, and perspectives on GPU programming in Python. OpenCL provides a common programming framework for heterogeneous parallel programming across CPUs, GPUs, and other processors. PyOpenCL and PyCUDA allow GPU programming from Python.
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...npinto
Abstract:
Machine learning researchers and practitioners develop computer
algorithms that "improve performance automatically through
experience". At Google, machine learning is applied to solve many
problems, such as prioritizing emails in Gmail, recommending tags for
YouTube videos, and identifying different aspects from online user
reviews. Machine learning on big data, however, is challenging. Some
"simple" machine learning algorithms with quadratic time complexity,
while running fine with hundreds of records, are almost impractical to
use on billions of records.
In this talk, I will describe lessons drawn from various Google
projects on developing large scale machine learning systems. These
systems build on top of Google's computing infrastructure such as GFS
and MapReduce, and attack the scalability problem through massively
parallel algorithms. I will present the design decisions made in
these systems, strategies of scaling and speeding up machine learning
systems on web scale data.
Speaker biography:
Max Lin is a software engineer with Google Research in New York City
office. He is the tech lead of the Google Prediction API, a machine
learning web service in the cloud. Prior to Google, he published
research work on video content analysis, sentiment analysis, machine
learning, and cross-lingual information retrieval. He had a PhD in
Computer Science from Carnegie Mellon University.
Creating cluster 'mycluster' with the following settings:
- Master node: m1.small using ami-fce3c696
- Number of nodes: 1
- Node type: m1.small
- Node AMI: ami-fce3c696
- Storage: EBS volume of size 10 GB
- Security group: mycluster-sg allowing SSH from anywhere
Launching instances...
This may take a few minutes. You can check progress with 'starcluster list'.
When instances have started, SSH will be automatically configured.
You can now ssh to the master with:
starcluster ssh mycluster
Have fun and please let us know if you have
This document summarizes an MIT lecture on GPU cluster programming using MPI. It provides administrative details such as homework due dates and project information. It also announces various donations of computing resources for the class, including Amazon AWS credits and a Tesla graphics card for the best project. The lecture outline covers the problem of computations too large for a single CPU, an introduction to MPI, MPI basics, using MPI with CUDA, and other parallel programming approaches.
This document summarizes a lecture on CUDA Ninja Tricks given on March 1st, 2011. The lecture covered scripting GPUs with PyCUDA, meta-programming and RTCG, and a case study in brain-inspired AI. It included sections on why scripting is useful for GPUs, an introduction to GPU scripting with PyCUDA, and a hands-on example of a simple PyCUDA program that defines and runs a CUDA kernel to double the values in a GPU memory array.
[Harvard CS264] 05 - Advanced-level CUDA Programmingnpinto
The document discusses optimizations for memory and communication in massively parallel computing. It recommends caching data in faster shared memory to reduce loads and stores to global device memory. This can improve performance by avoiding non-coalesced global memory accesses. The document provides an example of coalescing writes for a matrix transpose by first loading data into shared memory and then writing columns of the tile to global memory in contiguous addresses.
[Harvard CS264] 04 - Intermediate-level CUDA Programmingnpinto
This document provides an overview and summary of key points from a lecture on massively parallel computing using CUDA. The lecture covers CUDA language and APIs, threading and execution models, memory and communication, tools, and libraries. It discusses the CUDA programming model including host and device code, threads and blocks, and memory allocation and transfers between the host and device. It also summarizes the CUDA runtime and driver APIs for launching kernels and managing devices at different levels of abstraction.
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
1. GPUs have many more cores than CPUs and are very good at processing large blocks of data in parallel.
2. GPUs can provide a significant speedup over CPUs for applications that map well to a data-parallel programming model by harnessing the power of many cores.
3. The throughput-oriented nature of GPUs makes them well-suited for algorithms where the same operation can be performed on many data elements independently.
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patternsnpinto
This document outlines the topics that will be covered in the course on massively parallel computing, including computational thinking skills for parallel programming, hardware limitations and constraints on algorithms, and common parallel programming patterns. The topics include thinking in parallel, computer architecture, programming models, theoretical concepts, and parallel programming patterns. The goal is to provide students with the skills needed to design efficient parallel algorithms that maximize performance on modern parallel hardware.
New Features in Odoo 17 Email Marketing - Odoo SlidesCeline George
In this slide, let’s discuss the new features of email marketing Odoo 17. The new features enhance user in creating effective and efficient campaigns. This module will help to control the email layouts and other aspects of it.
How to install python packages from PycharmCeline George
In this slide, let's discuss how to install Python packages from PyCharm. In case we do any customization in our Odoo environment, sometimes it will be necessary to install some additional Python packages. Let’s check how we can do this from PyCharm.
How to Configure Field Cleaning Rules in Odoo 17Celine George
In this slide let’s discuss how to configure field cleaning rules in odoo 17. Field Cleaning is used to format the data that we use inside Odoo. Odoo 17's Data Cleaning module offers Field Cleaning Rules to improve data consistency and quality within specific fields of our Odoo records. By using the field cleaning, we can correct the typos, correct the spaces between them and also formats can be corrected.
A history of Innisfree in Milanville, PennsylvaniaThomasRue2
A history of Innisfree in Milanville, Damascus Township, Wayne County, Pennsylvania. By TOM RUE, July 23, 2023. Innisfree began as "an experiment in democracy," modeled after A.S. Neill's "Summerhill" school in England, "the first libertarian school".
How to define Related field in Odoo 17 - Odoo 17 SlidesCeline George
The related attribute is used in field definitions to establish a relationship between models and automatically fetch the value from a related model's field. It provides a way to reference and display fields from related models without having to create a separate field and write code to synchronize the values manually.
Dear Sakthi Thiru Dr. G. B. Senthil Kumar,
It is with great honor and respect that we extend this formal invitation to you. As a distinguished leader whose presence commands admiration and reverence, we cordially invite you to join us in celebrating the 25th anniversary of our graduation from Adhiparasakthi Engineering College on 27th July, 2024. we would be honored to have you by our side as we reflect on the achievements and memories of the past 25 years.
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
1. Introduction to
Zak Stone <zak@eecs.harvard.edu>
PhD candidate, Harvard School of Engineering and Applied Sciences
Advisor: Todd Zickler (Computer Vision)
3. Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
4. Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
5. Why should you care? - Lots of Data
LOTS OF DATA
EVERYWHERE
9. Why!! ! ! ! ! ! for big data?
• Most credible open-source toolset for large-scale, general-purpose computing
• Backed by ,
• Used by , , many others
• Increasing support from web services
• Hadoop closely imitates infrastructure developed by
• Hadoop processes petabytes daily, right now
11. DISCLAIMER
• Don’t use Hadoop if your data and computation fit on one machine
• Getting easier to use, but still complicated
http://www.wired.com/gadgetlab/2008/07/patent-crazines/
12. Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
13. What exactly is ! ! ! ! ! ! ! ?
• Actually a growing collection of subprojects
14. What exactly is ! ! ! ! ! ! ! ?
• Actually a growing collection of subprojects; focus on two right now
15. Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
16. An overview of Hadoop Map-Reduce
Traditional
Hadoop
Computing
(one computer)
(many computers)
17. An overview of Hadoop Map-Reduce
(Actually more like this)
(many computers, little communication,
stragglers and failures)
19. Map-Reduce: Map phase
Only specify operations on key-value pairs!
INPUT PAIR OUTPUT PAIRS
(key, value) (key, value)
(key, value)
(key, value)
(zero or more output pairs)
(each “elephant” works on an input pair;
doesn’t know other elephants exist )
28. Map-Reduce: The main advantage
With Hadoop, this very same code could run on
the entire Web! (In theory, at least)
def mapper(key,value):
for word in value.split():
yield word,1
def reducer(key,values):
yield key,sum(values)
29. Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
30. HDFS: Hadoop Distributed File System
... (chunks of data
on computers)
Data ... (each chunk
replicated more
than once for
reliability)
...
...
31. HDFS: Hadoop Distributed File System
(key1, value1)
(key2, value2)
...
... (key1, value1)
(key2, value2)
...
...
Computation is local to the data
Key-value pairs processed independently in parallel
33. Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
34. Hadoop Map-Reduce and HDFS: Advantages
• Distribute data and computation
• Computation local to data avoids network overload
• Tasks are independent
• Easy to handle partial failures - entire nodes can fail and restart
• Avoid crawling horrors of failure-tolerant synchronous distributed systems
• Speculative execution to work around stragglers
• Linear scaling in the ideal case
• Designed for cheap, commodity hardware
• Simple programming model
• The “end-user” programmer only writes map-reduce tasks
35. Hadoop Map-Reduce and HDFS: Disadvantages
• Still rough - software under active development
• e.g. HDFS only recently added support for append operations
• Programming model is very restrictive
• Lack of central data can be frustrating
• “Joins” of multiple datasets are tricky and slow
• No indices! Often, entire dataset gets copied in the process
• Cluster management is hard (debugging, distributing software, collecting logs...)
• Still single master, which requires care and may limit scaling
• Managing job flow isn’t trivial when intermediate data should be kept
• Optimal configuration of nodes not obvious (# mappers, # reducers, mem. limits)
36. Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
37. Getting started: Installation options
• Cloudera virtual machine
• Your own virtual machine (install Ubuntu in VirtualBox, which is free)
• Elastic MapReduce on EC2
• StarCluster with Hadoop on EC2
• Cloudera’s distribution of Hadoop on EC2
• Install Cloudera’s distribution of Hadoop on your own machine
• Available for RPM and Debian deployments
• Or download Hadoop directly from http://hadoop.apache.org/
38. Getting started: Language choices
• Hadoop is written in Java
• However, Hadoop Streaming allows mappers and reducers in any language!
• Binary data is a little tricky with Hadoop Streaming
• Could use base64 encoding, but TypedBytes are much better
• For Python, try Dumbo: http://wiki.github.com/klbostee/dumbo
• The Python word-count example and others come with Dumbo
• Dumbo makes binary data with TypedBytes easy
• Also consider Hadoopy: https://github.com/bwhite/hadoopy
39. Outline
1. Why should you care about Hadoop?
2. What exactly is Hadoop?
3. An overview of Hadoop Map-Reduce
4. The Hadoop Distributed File System (HDFS)
5. Hadoop advantages and disadvantages
6. Getting started with Hadoop
7. Useful resources
40. Useful resources and tips
• The Hadoop homepage: http://hadoop.apache.org/
• Cloudera: http://cloudera.com/
• Dumbo: http://wiki.github.com/klbostee/dumbo
• Hadoopy: https://github.com/bwhite/hadoopy
• Amazon Elastic Compute Cloud Getting Started Guide:
• http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/
• Always test locally on a tiny dataset before running on a cluster!