This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
Introduction to Hadoop and Hadoop component rebeccatho
This document provides an introduction to Apache Hadoop, which is an open-source software framework for distributed storage and processing of large datasets. It discusses Hadoop's main components of MapReduce and HDFS. MapReduce is a programming model for processing large datasets in a distributed manner, while HDFS provides distributed, fault-tolerant storage. Hadoop runs on commodity computer clusters and can scale to thousands of nodes.
The document discusses a presentation about practical problem solving with Hadoop and Pig. It provides an agenda that covers introductions to Hadoop and Pig, including the Hadoop distributed file system, MapReduce, performance tuning, and examples. It discusses how Hadoop is used at Yahoo, including statistics on usage. It also provides examples of how Hadoop has been used for applications like log processing, search indexing, and machine learning.
Hive provides a SQL-like interface to query large datasets stored in Hadoop. Pig is a dataflow language for transforming datasets. HBase is a distributed, scalable, big data store that provides random real-time read/write access to datasets.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses a programming model called MapReduce where developers write mapping and reducing functions that are automatically parallelized and executed on a large cluster. Hadoop also includes HDFS, a distributed file system that stores data across nodes providing high bandwidth. Major companies like Yahoo, Google and IBM use Hadoop to process large amounts of data from users and applications.
The document provides an overview of the Hadoop ecosystem. It introduces Hadoop and its core components, including MapReduce and HDFS. It describes other related projects like HBase, Pig, Hive, Mahout, Sqoop, Flume and Nutch that provide data access, algorithms, and data import capabilities to Hadoop. The document also discusses hosted Hadoop frameworks and the major Hadoop providers.
The document provides an overview of Hadoop and its ecosystem. It discusses the history and architecture of Hadoop, describing how it uses distributed storage and processing to handle large datasets across clusters of commodity hardware. The key components of Hadoop include HDFS for storage, MapReduce for processing, and additional tools like Hive, Pig, HBase, Zookeeper, Flume, Sqoop and Oozie that make up its ecosystem. Advantages are its ability to handle unlimited data storage and high speed processing, while disadvantages include lower speeds for small datasets and limitations on data storage size.
This document discusses Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes how Hadoop uses HDFS for distributed storage and fault tolerance, YARN for resource management, and MapReduce for parallel processing of large datasets. It provides details on the architecture of HDFS including the name node, data nodes, and clients. It also explains the MapReduce programming model and job execution involving map and reduce tasks. Finally, it states that as data volumes continue rising, Hadoop provides an affordable solution for large-scale data handling and analysis through its distributed and scalable architecture.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses challenges in handling large amounts of data in a scalable, cost-effective manner. While early adoption was in web companies, enterprises are increasingly adopting Hadoop to gain insights from new sources of big data. However, Hadoop deployment presents challenges for enterprises in areas like setup/configuration, skills, integration, management at scale, and backup/recovery. Greenplum HD addresses these challenges by providing an enterprise-ready Hadoop distribution with simplified deployment, flexible scaling of compute and storage, seamless analytics integration, and advanced management capabilities backed by enterprise support.
Hadoop is the popular open source like Facebook, Twitter, RFID readers, sensors, and implementation of MapReduce, a powerful tool so on.Your management wants to derive designed for deep analysis and transformation of information from both the relational data and thevery large data sets. Hadoop enables you to unstructuredexplore complex data, using custom analyses data, and wants this information as soon astailored to your information and questions. possible.Hadoop is the system that allows unstructured What should you do? Hadoop may be the answer!data to be distributed across hundreds or Hadoop is an open source project of the Apachethousands of machines forming shared nothing Foundation.clusters, and the execution of Map/Reduce It is a framework written in Java originallyroutines to run on the data in that cluster. Hadoop developed by Doug Cutting who named it after hishas its own filesystem which replicates data to sons toy elephant.multiple nodes to ensure if one node holding data Hadoop uses Google’s MapReduce and Google Filegoes down, there are at least 2 other nodes from System technologies as its foundation.which to retrieve that piece of information. This It is optimized to handle massive quantities of dataprotects the data availability from node failure, which could be structured, unstructured orsomething which is critical when there are many semi-structured, using commodity hardware, thatnodes in a cluster (aka RAID at a server level). is, relatively inexpensive computers. This massive parallel processing is done with greatWhat is Hadoop? performance. However, it is a batch operation handling massive quantities of data, so theThe data are stored in a relational database in your response time is not immediate.desktop computer and this desktop computer As of Hadoop version 0.20.2, updates are nothas no problem handling this load. possible, but appends will be possible starting inThen your company starts growing very quickly, version 0.21.and that data grows to 10GB. Hadoop replicates its data across differentAnd then 100GB. computers, so that if one goes down, the data areAnd you start to reach the limits of your current processed on one of the replicated computers.desktop computer. Hadoop is not suitable for OnLine Transaction So you scale-up by investing in a larger computer, Processing workloads where data are randomly and you are then OK for a few more months. accessed on structured data like a relational When your data grows to 10TB, and then 100TB. database.Hadoop is not suitable for OnLineAnd you are fast approaching the limits of that Analytical Processing or Decision Support Systemcomputer. workloads where data are sequentially accessed onMoreover, you are now asked to feed your structured data like a relational database, to application with unstructured data coming from generate reports that provide business sources intelligence. Hadoop is used for Big Data. It complements OnLine Transaction Processing and OnLine Analytical Pro
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
Hadoop is a distributed processing framework for large datasets. It stores data across clusters of commodity hardware in a Hadoop Distributed File System (HDFS) and provides tools for distributed processing using MapReduce. HDFS uses a master-slave architecture with a namenode managing metadata and datanodes storing data blocks. Data is replicated across nodes for reliability. MapReduce allows distributed processing of large datasets in parallel across clusters.
This document provides an introduction and overview of Hadoop, an open-source framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop uses MapReduce and HDFS to parallelize workloads and store data redundantly across nodes to solve issues around hardware failure and combining results. Key aspects covered include how HDFS distributes and replicates data, how MapReduce isolates processing into mapping and reducing functions to abstract communication, and how Hadoop moves computation to the data to improve performance.
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
Talk held at the Java User Group on 05.09.2013 in Novi Sad, Serbia
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Hadoop is an open-source distributed processing framework created by Doug Cutting in 2005 at Yahoo. It allows for the distributed processing of large datasets across clusters of computers using simple programming models. Hadoop addresses problems of velocity, volume and variety of big data by distributing storage and processing across clusters of low-cost commodity hardware. It provides reliable storage and processing of large datasets through its Hadoop Distributed File System and MapReduce programming model.
This document provides an introduction to Hadoop and HDFS. It defines big data and Hadoop, describing how Hadoop uses a scale-out approach to distribute data and processing across clusters of commodity servers. It explains that HDFS is the distributed file system of Hadoop, which splits files into blocks and replicates them across multiple nodes for reliability. HDFS is optimized for large streaming reads and writes of large files. The document also gives an overview of the Hadoop ecosystem and common Hadoop distributions.
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
The document describes the Hadoop ecosystem and its core components. It discusses HDFS, which stores large files across clusters and is made up of a NameNode and DataNodes. It also discusses MapReduce, which allows distributed processing of large datasets using a map and reduce function. Other components discussed include Hive, Pig, Impala, and Sqoop.
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
This document provides an outline and introduction for a lecture on MapReduce and Hadoop. It discusses Hadoop architecture including HDFS and YARN, and how they work together to provide distributed storage and processing of big data across clusters of machines. It also provides an overview of MapReduce programming model and how data is processed through the map and reduce phases in Hadoop. References several books on Hadoop, MapReduce, and big data fundamentals.
The document discusses Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers. It describes Hadoop as having two main components - the Hadoop Distributed File System (HDFS) which stores data across infrastructure, and MapReduce which processes the data in a parallel, distributed manner. HDFS provides redundancy, scalability, and fault tolerance. Together these components provide a solution for businesses to efficiently analyze the large, unstructured "Big Data" they collect.
This document introduces Pig, an open source platform for analyzing large datasets that sits on top of Hadoop. It provides an example of using Pig Latin to find the top 5 most visited websites by users aged 18-25 from user and website data. Key points covered include who uses Pig, how it works, performance advantages over MapReduce, and upcoming new features. The document encourages learning more about Pig through online documentation and tutorials.
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
Hive is a data warehousing system built on Hadoop that allows users to query data using SQL. It addresses issues with using Hadoop for analytics like programmability and metadata. Hive uses a metastore to manage metadata and supports structured data types, SQL queries, and custom MapReduce scripts. At Facebook, Hive is used for analytics tasks like summarization, ad hoc analysis, and data mining on over 180TB of data processed daily across a Hadoop cluster.
Apache Hive provides SQL-like access to your stored data in Apache Hadoop. Apache HBase stores tabular data in Hadoop and supports update operations. The combination of these two capabilities is often desired, however, the current integration show limitations such as performance issues. In this talk, Enis Soztutar will present an overview of Hive and HBase and discuss new updates/improvements from the community on the integration of these two projects. Various techniques used to reduce data exchange and improve efficiency will also be provided.
This document provides an overview of MapReduce, a programming model developed by Google for processing and generating large datasets in a distributed computing environment. It describes how MapReduce abstracts away the complexities of parallelization, fault tolerance, and load balancing to allow developers to focus on the problem logic. Examples are given showing how MapReduce can be used for tasks like word counting in documents and joining datasets. Implementation details and usage statistics from Google demonstrate how MapReduce has scaled to process exabytes of data across thousands of machines.
Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
Overview of Big data, Hadoop and Microsoft BI - version1
Big Data and Hadoop are emerging topics in data warehousing for many executives, BI practices and technologists today. However, many people still aren't sure how Big Data and existing Data warehouse can be married and turn that promise into value. This presentation provides an overview of Big Data technology and how Big Data can fit to the current BI/data warehousing context.
http://www.quantumit.com.au
http://www.evisional.com
The Hadoop tutorial is a comprehensive guide on Big Data Hadoop that covers what is Hadoop, what is the need of Apache Hadoop, why Apache Hadoop is most popular, How Apache Hadoop works?
This document provides an overview of Hadoop and Big Data. It begins with introducing key concepts like structured, semi-structured, and unstructured data. It then discusses the growth of data and need for Big Data solutions. The core components of Hadoop like HDFS and MapReduce are explained at a high level. The document also covers Hadoop architecture, installation, and developing a basic MapReduce program.
Content presented at a talk on Aug. 29th. Purpose is to inform a fairly technical audience on the primary tenets of Big Data and the hadoop stack. Also, did a walk-thru' of hadoop and some of the hadoop stack i.e. Pig, Hive, Hbase.
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
Hadoop was developed to solve problems with data warehousing systems at Yahoo and Facebook that were limited in processing large amounts of raw data in real-time. Hadoop uses HDFS for scalable storage and MapReduce for distributed processing. It allows for agile access to raw data at scale for ad-hoc queries, data mining and analytics without being constrained by traditional database schemas. Hadoop has been widely adopted for large-scale data processing and analytics across many companies.
This document provides an overview of Hadoop storage perspectives from different stakeholders. The Hadoop application team prefers direct attached storage for performance reasons, as Hadoop was designed for affordable internet-scale analytics where data locality is important. However, IT operations has valid concerns about reliability, manageability, utilization, and integration with other systems when data is stored on direct attached storage instead of shared storage. There are tradeoffs to both approaches that depend on factors like the infrastructure, workload characteristics, and priorities of the organization.
Data infrastructure at Facebook with reference to the conference paper " Data warehousing and analytics infrastructure at facebook"
Datewarehouse
Hadoop - Hive - scrive
This document provides an overview of the Hadoop/MapReduce/HBase framework and its applications in bioinformatics. It discusses Hadoop and its components, how MapReduce programs work, HBase which enables random access to Hadoop data, related projects like Pig and Hive, and examples of applications in bioinformatics and benchmarking of these systems.
The document discusses the Hadoop ecosystem. It provides an overview of Hadoop and its core components HDFS and MapReduce. HDFS is the storage component that stores large files across nodes in a cluster. MapReduce is the processing framework that allows distributed processing of large datasets in parallel. The document also discusses other tools in the Hadoop ecosystem like Hive, Pig, and Hadoop distributions from companies. It provides examples of running MapReduce jobs and accessing HDFS from the command line.
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
A walk-thru of core Hadoop, the ecosystem tools, and Hortonworks Data Platform (HDP) followed by code examples in MapReduce (Java and C#), Pig, and Hive.
Presented at the Atlanta .NET User Group meeting in July 2014.
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
This document discusses big data processing options for optimizing analytical workloads using Hadoop. It provides an overview of Hadoop and its core components HDFS and MapReduce. It also discusses the Hadoop ecosystem including tools like Pig, Hive, HBase, and ecosystem projects. The document compares building Hadoop clusters to using appliances or Hadoop-as-a-Service offerings. It also briefly mentions some Hadoop competitors for real-time processing use cases.
Hadoop is an open-source software framework that allows for the distributed processing of large data sets across clusters of computers. It reliably stores and processes gobs of information across many commodity computers. Key components of Hadoop include the HDFS distributed file system for high-bandwidth storage, and MapReduce for parallel data processing. Hadoop can deliver data and run large-scale jobs reliably in spite of system changes or failures by detecting and compensating for hardware problems in the cluster.
Enough taking about Big data and Hadoop and let’s see how Hadoop works in action.
We will locate a real dataset, ingest it to our cluster, connect it to a database, apply some queries and data transformations on it , save our result and show it via BI tool.
E2Matrix Jalandhar provides Best Big Data training based on current industry standards that helps attendees to secure placements in their dream jobs at MNCs. E2Matrix Provides Best Big Data Training in Jalandhar Amritsar Ludhiana Phagwara Mohali Chandigarh. E2Matrix is one of the best Big Data training institute offering hands on practical knowledge. At E2Matrix Big Data training is conducted by subject specialist corporate professionals best experience in managing real-time Big Data projects. E2Matrix implements a blend of academic learning and practical sessions to give the student optimum exposure. At E2Matrix’s well-equipped Big Data training Institute aspirants learn the skills for Big Data Overview, Use Cases, Data Analytics Process, Data Preparation, Tools for Data Preparation, Hands on Exercise : Using SQL and NoSql DB's, Hands on Exercise : Usage of Tools, Data Analysis Introduction, Classification, Data Visualization using R, Automation Testing Training on real time projects.
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
E2Matrix Jalandhar provides Best Big Data training based on current industry standards that helps attendees to secure placements in their dream jobs at MNCs. E2Matrix Provides Best Big Data Training in Jalandhar Amritsar Ludhiana Phagwara Mohali Chandigarh. E2Matrix is one of the best Big Data training institute offering hands on practical knowledge. At E2Matrix Big Data training is conducted by subject specialist corporate professionals best experience in managing real-time Big Data projects. E2Matrix implements a blend of academic learning and practical sessions to give the student optimum exposure. At E2Matrix’s well-equipped Big Data training Institute aspirants learn the skills for Big Data Overview, Use Cases, Data Analytics Process, Data Preparation, Tools for Data Preparation, Hands on Exercise : Using SQL and NoSql DB's, Hands on Exercise : Usage of Tools, Data Analysis Introduction, Classification, Data Visualization using R, Automation Testing Training on real time projects.
This document provides an overview of big data, Hadoop, and related concepts:
- Big data refers to large datasets that cannot be processed efficiently by traditional systems due to their size. Sources include social media, smartphones, machines, and log files.
- Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It implements the MapReduce programming model.
- Key Hadoop components include HDFS for storage, MapReduce for distributed processing, and related projects like Pig, Hive, HBase, Flume, Oozie, and Sqoop. Companies use Hadoop for applications involving large datasets, such as log analysis, recommendations, and business intelligence
E2Matrix Jalandhar provides Best Big Data training based on current industry standards that helps attendees to secure placements in their dream jobs at MNCs. E2Matrix Provides Best Big Data Training in Jalandhar Amritsar Ludhiana Phagwara Mohali Chandigarh. E2Matrix is one of the best Big Data training institute offering hands on practical knowledge. At E2Matrix Big Data training is conducted by subject specialist corporate professionals best experience in managing real-time Big Data projects. E2Matrix implements a blend of academic learning and practical sessions to give the student optimum exposure. At E2Matrix’s well-equipped Big Data training Institute aspirants learn the skills for Big Data Overview, Use Cases, Data Analytics Process, Data Preparation, Tools for Data Preparation, Hands on Exercise : Using SQL and NoSql DB's, Hands on Exercise : Usage of Tools, Data Analysis Introduction, Classification, Data Visualization using R, Automation Testing Training on real time projects.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
2. Brief background on me
Phil has over 16 years experience in data-centric system
development. His work has flowed from simulation and video-
game-like systems, to high-performance computing (HPC), to
traditional database (Oracle, SQL Server, Postgres, MySQL)
and CRM (warehouse/analytical) systems, and most recently to
the Hadoop stack. Recently, as an employee at TripAdvisor he
led the research into Hadoop/Hive which resulted in the
successful migration from the traditional RDBMS platform to a
system which is based on Hadoop/Hive and is integrated with
MS SQL Server/SSAS. Currently, he's focused on the Hadoop
stack and is creating a solution which involves integrating
Hadoop in a more traditional enterprise environment.
3. Agenda
To make you as excited about Hadoop as I am
What is Hadoop (high-level) ?
What have we actually done with it?
How does “it” (HDFS, M/R, Hive, and HBase) work?
Future of Hadoop
5. Q: What is Hadoop:
A#1 - The thing that empowers
Yahoo, FB, and others
Yahoo has >25k Hadoop nodes…wow…
6. Q: What is Hadoop
A#2 - Last year’s revolution (sort of)
The Linux/Hadoop vs Closed-Source “conflict” is a false one, IMO, and I’ll explain why as we go on
7. Q: What is Hadoop
A#3 – the revolution of 5+ years ago
8. “Success has many fathers”
And you can look them up, because it’s FOSS !
People are fighting to contribute, and to get credit… be a contributor…
(http://hortonworks.com/reality-check-contributions-to-apache-hadoop/)
9. What is Hadoop:
A#4 – the wave everyone is riding
Nearly all the big players (and many smaller ones) are on board…
10. In fact, beware of this
http://nosql.mypopescu.com/post/2955078419/origin-of-nosql
12. Hadoop projects performed by BlueMetal Architects
Hadoop at a Web 2.0 company (prior to BMA)
Ported traditional 30TB Warehouse to Hive
Big transform jobs in Hive
E.G. Joins 50M rows to 12B rows
Big Data jobs, e.g. Social Graph processing with
many “Cartesians” to empower emails
Hadoop in HealthCare (at BMA)
Applied HBase as part of a new system
Feeds data (via WS) to:
E.D.
Patient Web Portal
Other HealthCare affiliates
Note: Both projects include Hadoop as part of larger systems.
13. Warehouse Goals
Use the right tool for the right job
–Hadoop (M/R, Hive) is a batch system
• Inherently high-latency
–RDBMS (& other tools) are still needed
Empower users
–Minimize complexity
• Eliminate joins (almost)
• Eliminate “dimensions” (maybe)
–Expose *all* data
–Provide low-latency options
–Provide self-service options
14. A strategy for MASSIVE processing:
Best tool for the job
This is what we implemented and, it turns out, is also what Yahoo has done.
Yahoo’s SSAS cube is the largest in the world (14TB/quarter, 3B rows/day)
18. Map-Reduce (M/R) example
Note: this job is not optimized
Take home message: “Simple API - Mappers read the
input and emit K/V pairs. Framework sends Reducers
K/V pairs partitioned and ordered* by Key”
(From: http://www.infosun.fim.uni-passau.de/cl/MapReduceFoundation/)
19. Hadoop M/R with some details:
Note: Partition, Combine and Shuffle
(From: http://www.lecturemaker.com/2011/02/rhipe/)
20. Hadoop M/R Primer
Let’s discuss HDFS: (blocks, replication) and how that helps “data local tasks”
(From: Yahoo)
21. Hadoop Terasort Job Profile
- or “hey, I thought it was just M/R”
(from
http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_s
orts_a_petabyte_in_162/)
22. Why Hadoop?
Because you don’t want to handle this…
This is actually a profile of a job running on an old version of Hadoop, but jobs
with many failures look similar. This also shows improvement in Hadoop.
(From: http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/)
23. Hadoop M/R executive summary
Distributed storage system, with distributed processing
capability, on commodity hardware (or in the cloud).
Moves the computation to the data !
That, in turn, saves network which is the limiting factor in
distributed apps.
The same code can run on data of any size. The cluster is
scaled with the data, not the code.
24. Hadoop Stack Key Components
(http://hortonworks.com/technology/hortonworksdataplatform/)
HCatalog is a recent project that allows Pig scripts to use Hive metadata/schemas.
Hadoop is not just about non/semi structured data !
26. Common RDBMS warehouse query
select top 10
t.*
from (
select ip_address, count(*) as cnt
from f_pageviews pv
join d_ipaddress ip on (pv.ip_key = ip.id)
where date_key = 2992
group by ip_address
)t
order by cnt desc
– wait a few minutes
- time is usually 1-4x nominal time depending on load
- … assumes the job can succeed at all !
27. Hive Version…
The luxury of Hadoop space/power, means dimensional processing might not be
required
NOTE: Hive does support “column-oriented” storage, which is very efficient.
select t.*
from (
select ip_address, count(*) as cnt
from f_lookback
where ds = '2011-03-11'
group by ip_address
)t
order by cnt desc
Limit 10
– BUT – runtime is trickier
Time to run your job = HQL parse + M/R Job Submit + [ wait
in the queue for availability ] + M/R Job Runtime
28. What else can Hadoop do?
FB: Invented Cassandra but went with HBase for their new messaging system.
Does that mean HBase is ”better”? – no, it’s about using the right tool for the job.
http://www.facebook.com/note.php?note_id=454991608919
That’s to hold 135B messages per month !
http://highscalability.com/blog/2010/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html
Scale is relative (to your hardware and load),
but when you want a consistent “OLTP” solution that doesn’t require redesign to scale,
consider Hbase.
29. HBase Architecture
Not shown: HM, ZK and HDFS
(From: http://www.larsgeorge.com/2009/10/hbase-architecture-101-
storage.html)
30. HBase: a more detailed view
(http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html)
31. HBase: one way to look at it
A BigTable Implementation: memcached + LSM + framework
(From: http://java.dzone.com/news/bigtable-model-cassandra-and)
32. HBase: Hadoop BigTable
Not just a CRUD back-end:
…coprocessors, versioned cells, range scans, optimization (e.g.
selective compression) via column families, etc.
The most important of these is distributed processing.
33. Hadoop in (pre*) action
Hadoop indexed “THE DATA” for Watson
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/i%E2%80%99ll-take-hadoop-for-400-alex/
*Runtime processing used Apache JMS + UIMA .
35. Overlapping Ecosystems
Hadoop (usage and contributions) will be
“shared” between FOSS and Closed Source
communities.
Image from: http://cyhshonorsbio.wikispaces.com/The+Chemistry+of+Life
36. False Conflicts, with Solutions
Sodium(explosive) + Chlorine(poison) =>
Salt(vital)
From http://strangetimes.lastsuperpower.net/?p=1663
Closed Source + Open Source =>
Free + Enterprise + Support
+ Integration
Visit: http://en.wikipedia.org/wiki/Business_models_for_open_source_software#Hybrid
37. IMO, an important message from a
brilliant man
Anant Jhingran Hadoop Summit 2011 IBM Watson & Big Data with Q&A
http://www.youtube.com/watch?v=IVS__xF3Byg
Add value by fostering the ecosystem.
Do not fragment Hadoop (as Unix did).
There is room for folks from many areas to contribute and benefit.
39. MS embraced Hadoop despite having developed
technology similar to NextGen Hadoop. Wow.
Hadoop release on Azure is 3/12.
BlueMetal Architects is part of the MS TAP program for Hadoop on Azure. Please
contact us as we’ll be blogging about it.
41. Hadoop NextGen:
A Brave New (!?) world
Hadoop “nextGen” will support more than M/R, e.g. “Apache Giraph”
BUT, the diagram is from MS Dryad blogs. Graph processing will also be “big”.
42. Hadoop >> (un)structured data store.
Why do this (except ad-hoc) …?
RDBMS and Hadoop have strengths, use them, don’t negate both.
See the above Warehouse Architecture diagram…
From: http://nosql.mypopescu.com/post/344388408/hadoop-and-oracle-parallel-processing)
44. Useful/Supporting Links
Bing crawls the web for Yahoo (for US, Canada, and some other countries)
http://www.ehow.com/info_8208930_isnt-yahoo-crawling-website.html
World’s largest SSAS Cube: 14TB/quarter, 3B rows/day
http://jobs.climber.com/jobs/Media-Communication/-CA-US/MS-SQL-SSAS-SSIS-
Engineer/22735283
http://hadoop.apache.org/
http://www.docstoc.com/docs/66356954/Advanced-HBase
https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial
http://wiki.apache.org/hadoop/WordCount
https://blogs.apache.org/foundation/entry/apache_innovation_bolsters_ibm_s