SlideShare a Scribd company logo
•Structured Data (Deductive Logic)
• Analysis of defined relationships
• Defined Data Architecture
• SQL Compliant for fast processing with certainty
• Precision, Speed
•Unstructured Data (Inductive Logic)
• Hypothesis testing against unknown relationships
• Unknown (being less than 100% certainty)
• Iterative analysis to a level-of-certainty
• Open standards and tools
• Extremely high rate of change in processing/tooling options
• Volume, Speed
• Defined Data Architecture/Structured Schema
• Data Integrity
• ACID Compliant
• Atomicity - Requires that each transaction is "all or nothing"
• Consistency - Any transaction will bring the database from one valid state
to another
• Isolation - All transactions are consistent as if they were issued serially
• Durability - Once the transaction is committed it persists
• Real-time processing and analysis against known relationships
• Comparatively static data architecture
• Requires defined data architecture for all data stored
• Relatively smaller, defined, more discrete data sets
• Key Value lookup
• Designed for fast single row lookup
• Loose Schema designed for fast lookup
• MySQL NoSQL Interface
• Used to augment Big Data solutions
• Not designed for Analytics
• Does not support 2 – 3 of the V for Big Data
• On its own, NoSQL is not considered Big Data
• High data load with different data formats
• Allows discovery and hypothesis testing against large data sets
• Near Real-time processing and analysis against unknown
• Not ACID compliant
• No transactional consistency
• High latency system
• Not designed for real time lookups
• Limited BI tool integration
•What is Hadoop?
• Fastest growing, commercial Big Data Technology
• Basic Composition:
• Mapper
• Reducer
• Hadoop File System (HDFS)
• Approx 30 tools/subcomponents in the eco-system
• Primarily produced so developers and admin’s do not have to write raw
map/reduce code in Java
•Systems Architecture:
• Linux
• Commodity x86 Servers
• JBOD (standard block size 64-128MB)
• Virtualization not recommended due to high I/O requirements
• Open Source Project:
•Map Reduce Theory Paper:
• Published 2004
• Jeffrey Dean and Sanjay Ghemawat
• Foundation for GFS (Google File System)
• Problem:
• Ingest and search large data sets
• Doug Cutting, Cloudera (Yahoo)
• Lucene (1999) – indexing large files
• Nutch (2004) – search massive amounts of web data
• Hadoop (2007) – first release 2007, started in 2005
•Store everything regardless
• Analyze Now, or analyze Later
•Schema On-Read methodology
• Allows you to store all the data and determine how to use it later
•Low cost, scale out infrastructure
• Low cost hardware and large storage pools
• Allows for more of a load-it and forget-it approach
• Sentiment analysis
• Marketing campaign analysis
• Customer churn modeling
• Fraud detection
• Research and Development
• Risk Modeling
•Programming and execution
•Taken from functional programming
• Map – operate on every element
• Reduce – combine and aggregate results
•Abstracts storage, concurrency,
• Just write two Java functions
• Contrast with MPI
•Based on GFS
•Distributed, fault-tolerant filesystem
•Primarily designed for cost and scale
• Works on commodity hardware
• 20PB / 4000 node cluster at Facebook
•Failures are common
• Massive scale means more failures
• Disks, network, node
•Files are append-only
•Files are large (GBs to TBs)
•Accesses are large and sequential
•Same concepts as the FS on your
• Directory tree
• Create, read, write, delete files
•Filesystems store metadata and data
• Metadata: filename, size, permissions, …
• Data: contents of a file
DataNode DataNode DataNode
•GFS and MR co-design
• Cheap, simple, effective at scale
•Fault-tolerance baked in
• Replicate data 3x
• Incrementally re-execute computation
• Avoid single points of failure
•Held the world sort record
• Performs
bidirectional data
transfers between
Hadoop and
almost any SQL
database with a
JDBC driver
• Streaming data
collection and
• Massive
volumes of data,
such as RPC
services, Log4J,
Syslog, etc.
• Relational
abstraction using
a SQL like dialect
called HiveQL
• Statements are
executed as one
or more Map
Reduce Jobs
s.word, s.freq, k.freq
FROM shakespeare
JOIN ON (s.word=
WHERE s.freq >= 5;
• High-level scripting
language for executing
one or more
MapReduce jobs
• Created to simplify
authoring of
MapReduce jobs
• Can be extended with
user defined functions
emps = LOAD
'people.txt’ AS
rich = FILTER emps BY
salary > 200000;
sorted_rich = ORDER
rich BY salary DESC;
STORE sorted_rich
INTO ’rich_people.txt';
• Low-latency,
columnar key-
value store
• Based on BigTable
• Efficient random
reads/writes on
• Useful for frontend
• Workflow engine
and scheduler
built specifically
for large-scale job
orchestration on a
Hadoop cluster
• Hue is an open source
web-based application for
making it easier to use
Apache Hadoop.
•Hue features
• File Browser for
• Job
Designer/Browser for
• Query editors for
Hive, Pig and
Cloudera Impala
• Oozie
• Zookeeper is a distributed
consensus engine
• Provides well-defined
concurrent access
• Leader election
• Service discovery
• Distributed locking / mutual
• Message board / mailboxes
•Next gen software abstraction layer for
•Create and execute complex data
processing workflows
• Specifically for a Hadoop cluster using any JVM-based language
• Java
• Jruby
• Clojure
•Generally acknowledged as a better
alternative to Hive/Pig
•Big Data covers 4 dimensions
• Volume - 90% of all the data stored in the world has been
produced in the last 2 years
• Velocity – The ability to perform advanced analytics on
Terabytes or Petabytes of data in minutes to hours compared
to days
• Variety – Any data type from structured to unstructured data
including image files, social media, relational database
content, and text data from weblogs or sensors
• Veracity - 1 in 3 business leaders don’t trust the information
they use to make decisions. How do we ensure the results
are accurate and meaning?
• Loading web logs into MySQL
• How do you parse and keep all the Data?
• What about the variability of the Query String Parameters?
• What if the web log format changes?
• Integration of other data sources
• Social Media – Back in the early days even Facebook didn’t keep all
the data. How do we know what is important in the stream?
• Video and Image data – How do we store that type of data so we
can extract the metadata information?
• Sensor Data – Imagine all the different devices producing data and
the different formats of the data.
Web Servers
Order Processing
Operational Data
Enterprise Data
Web Servers
Order Processing
Operational Data
Enterprise Data
• Data captured at source
• Part of ongoing operational processes (Web Log, RDBMS)
• Data transferred from operational systems to Big Data Platform
• Data processed in batch by Map/Reduce
• Data processed by Hadoop Tools (Hive, Pig)
• Can Pre-condition data that is loaded back into RDBMS
• Load back into Operational Systems
• Load into BI Tools and ODS
Sensor Logs
Web Logs
BI Tools
• MySQL as a Data Source
• New NoSQL API’s
• Ingest high volume, high velocity data, with veracity
• ACID guarantees not compromised
• Data Pre-processing or Conditioning
• Run Real-time analytics against new data
• Pre-process or condition data before loading into Hadoop
• For example healthcare records can be anonymized
• Data transferred in batches from MySQL tables to Hadoop using
Apache Sqoop or MySQL Applier
• With Applier, users can also invoke real-time change data
capture processes to stream new data from MySQL to HDFS as
it is committed by the client.
• Multi-structured, multi-sourced data consolidated and processed
• Run Map/Reduce Jobs and or Hadoop Tools (Hive, Pig, others)
• Results loaded back to MySQL via Apache Sqoop
• Provide new data for real-time operational processes
• Provide broader, normalized data sets for BI Tool analytics
• Provides real-time replication of events between MySQL and
• MySQL Applier for Hadoop uses an API (libhdfs, precompiled
with Hadoop) to connect to MySQL master
• Reads the binary log and then:
• Fetches the row insert events occurring on the master
• Decodes events, extracts data inserted into each field of the
• Uses content handlers to get it in required format
• Appends it to a text file in HDFS
• Streaming real-time updates from MySQL into Hadoop for
immediate analysis
• Addresses performance issues from bulk loading
• Exploits existing replication protocol
• Provides row-based replication
• Consumable by other tools
• Possibilities for update/delete
• DDL not handled
• Only row inserts
MySQL Applier for Hadoop
Binary Log
Primary Key
• NoSQL interfaces directly to the InnoDB and MySQL Cluster (NDB)
storage engines
• Bypass the SQL layer completely
• Without SQL parsing and optimization, Key-Value data can be written
directly to MySQL tables up to 9x faster, while maintaining ACID
• Key Value Definition/Lookup
• Designed for fast single row lookup
• Loose Schema designed for fast lookup
• Data Pre-processing or Conditioning
• Run Real-time analytics against new data
• Pre-process or condition data before loading into Hadoop
• For example healthcare records can be anonymized
• Ingest high volume, high velocity data, with veracity
• ACID guarantees are not compromised
• Single stack for RDMBS and NoSQL
• High volume KVP processing
• Single-node processing:
• 70k transactions per second
• Clustered processing:
• 650k ACID-compliant writes per sec
• 19.5M writes per sec
• Auto-sharding across distributed clusters of commodity nodes
• Shared-nothing, fault-tolerant architecture for 99.999% uptime
• <specify>
•Scalable machine learning algorithms
•Primary focus:
• Collaborative filtering
• Clustering
• Classification
•In-memory cluster computing
• Allows user programs to load data into a cluster's memory
and query it repeatedly
•Streaming Architecture
•Cluster manager that manages
resources across distributed systems
•Allows finite control over system
• Stateful versus stateless (i.e. traditional virtualization
•Granular role-based access control
to data
•Addresses both data and metadata

More Related Content

What's hot

Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
Jonathan Seidman
Big Data with MySQL
Big Data with MySQLBig Data with MySQL
Big Data with MySQL
Ivan Zoratti
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
Eduardo Castro
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
Adaryl "Bob" Wakefield, MBA
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
James Serra
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Patrick Van Renterghem
MySQL & MariaDB - Innovation Happens Here
MySQL & MariaDB - Innovation Happens HereMySQL & MariaDB - Innovation Happens Here
MySQL & MariaDB - Innovation Happens Here
Ivan Zoratti
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
Antonios Chatzipavlis
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
Amazon Web Services
Data lake
Data lakeData lake
Data lake
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
Kent Graziano
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
James Serra
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
Amr Awadallah
Unlocking big data with Hadoop + MySQL
Unlocking big data with Hadoop + MySQLUnlocking big data with Hadoop + MySQL
Unlocking big data with Hadoop + MySQL
Ricky Setyawan
Azure SQL Data Warehouse for beginners
Azure SQL Data Warehouse for beginnersAzure SQL Data Warehouse for beginners
Azure SQL Data Warehouse for beginners
Michaela Murray
Building a Digital Bank
Building a Digital BankBuilding a Digital Bank
Building a Digital Bank
Flexible Design
Flexible DesignFlexible Design
Flexible Design
Gwen (Chen) Shapira
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing
Ishan Bhawantha Hewanayake

What's hot (20)

Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
Big Data with MySQL
Big Data with MySQLBig Data with MySQL
Big Data with MySQL
Data warehouse con azure synapse analytics
Data warehouse con azure synapse analyticsData warehouse con azure synapse analytics
Data warehouse con azure synapse analytics
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
MySQL & MariaDB - Innovation Happens Here
MySQL & MariaDB - Innovation Happens HereMySQL & MariaDB - Innovation Happens Here
MySQL & MariaDB - Innovation Happens Here
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure   Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWS
Data lake
Data lakeData lake
Data lake
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
Introduction to PolyBase
Introduction to PolyBaseIntroduction to PolyBase
Introduction to PolyBase
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
Unlocking big data with Hadoop + MySQL
Unlocking big data with Hadoop + MySQLUnlocking big data with Hadoop + MySQL
Unlocking big data with Hadoop + MySQL
Azure SQL Data Warehouse for beginners
Azure SQL Data Warehouse for beginnersAzure SQL Data Warehouse for beginners
Azure SQL Data Warehouse for beginners
Building a Digital Bank
Building a Digital BankBuilding a Digital Bank
Building a Digital Bank
Flexible Design
Flexible DesignFlexible Design
Flexible Design
Snowflake Datawarehouse Architecturing
Snowflake Datawarehouse ArchitecturingSnowflake Datawarehouse Architecturing
Snowflake Datawarehouse Architecturing

Similar to Colorado Springs Open Source Hadoop/MySQL

Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
Zohar Elkayam
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
Sachin Gowda
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
N Masahiro
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
musrath mohammad
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
Amir Shaikh
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
Seeling Cheung
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
Some corner at the Laboratory
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
Zohar Elkayam
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects

Similar to Colorado Springs Open Source Hadoop/MySQL (20)

Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Apache Hadoop Hive
Apache Hadoop HiveApache Hadoop Hive
Apache Hadoop Hive
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration

Recently uploaded

Latest Tech Trends Series 2024 By EY India
Latest Tech Trends Series 2024 By EY IndiaLatest Tech Trends Series 2024 By EY India
Latest Tech Trends Series 2024 By EY India
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Priyanka Aash
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Snarky Security
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Smart Mobility Market:Revolutionizing Transportation.pdf
Smart Mobility Market:Revolutionizing Transportation.pdfSmart Mobility Market:Revolutionizing Transportation.pdf
Smart Mobility Market:Revolutionizing Transportation.pdf
Semantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software DevelopmentSemantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software Development
Baishakhi Ray
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Alliance
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
Accelerating Migrations = Recommendations
Accelerating Migrations = RecommendationsAccelerating Migrations = Recommendations
Accelerating Migrations = Recommendations
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
BrainSell Technologies
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partesExchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
Yury Chemerkin
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
AMol NAik
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
webbyacad software
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Priyanka Aash
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Alliance
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Alliance

Recently uploaded (20)

Latest Tech Trends Series 2024 By EY India
Latest Tech Trends Series 2024 By EY IndiaLatest Tech Trends Series 2024 By EY India
Latest Tech Trends Series 2024 By EY India
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Smart Mobility Market:Revolutionizing Transportation.pdf
Smart Mobility Market:Revolutionizing Transportation.pdfSmart Mobility Market:Revolutionizing Transportation.pdf
Smart Mobility Market:Revolutionizing Transportation.pdf
Semantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software DevelopmentSemantic-Aware Code Model: Elevating the Future of Software Development
Semantic-Aware Code Model: Elevating the Future of Software Development
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
Accelerating Migrations = Recommendations
Accelerating Migrations = RecommendationsAccelerating Migrations = Recommendations
Accelerating Migrations = Recommendations
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptxAcumatica vs. Sage Intacct _Construction_July (1).pptx
Acumatica vs. Sage Intacct _Construction_July (1).pptx
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partesExchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx

Colorado Springs Open Source Hadoop/MySQL

  • 2. STRUCTURED AND UNSTRUCTURED DATA •Structured Data (Deductive Logic) • Analysis of defined relationships • Defined Data Architecture • SQL Compliant for fast processing with certainty • Precision, Speed •Unstructured Data (Inductive Logic) • Hypothesis testing against unknown relationships • Unknown (being less than 100% certainty) • Iterative analysis to a level-of-certainty • Open standards and tools • Extremely high rate of change in processing/tooling options • Volume, Speed
  • 3. STRUCTURED DATA: RDBMS •Capabilities • Defined Data Architecture/Structured Schema • Data Integrity • ACID Compliant • Atomicity - Requires that each transaction is "all or nothing" • Consistency - Any transaction will bring the database from one valid state to another • Isolation - All transactions are consistent as if they were issued serially • Durability - Once the transaction is committed it persists • Real-time processing and analysis against known relationships •Limitations • Comparatively static data architecture • Requires defined data architecture for all data stored • Relatively smaller, defined, more discrete data sets
  • 4. UNSTRUCTURED DATA: NOSQL •Capabilities • Key Value lookup • Designed for fast single row lookup • Loose Schema designed for fast lookup • MySQL NoSQL Interface • Used to augment Big Data solutions •Limitations • Not designed for Analytics • Does not support 2 – 3 of the V for Big Data • On its own, NoSQL is not considered Big Data
  • 5. UNSTRUCTURED DATA: HADOOP •Capabilities • High data load with different data formats • Allows discovery and hypothesis testing against large data sets • Near Real-time processing and analysis against unknown relationships •Limitations • Not ACID compliant • No transactional consistency • High latency system • Not designed for real time lookups • Limited BI tool integration
  • 6. WHAT IS HADOOP? •What is Hadoop? • Fastest growing, commercial Big Data Technology • Basic Composition: • Mapper • Reducer • Hadoop File System (HDFS) • Approx 30 tools/subcomponents in the eco-system • Primarily produced so developers and admin’s do not have to write raw map/reduce code in Java •Systems Architecture: • Linux • Commodity x86 Servers • JBOD (standard block size 64-128MB) • Virtualization not recommended due to high I/O requirements • Open Source Project: •
  • 7. HADOOP: QUICK HISTORY •Map Reduce Theory Paper: • Published 2004 • Jeffrey Dean and Sanjay Ghemawat • Foundation for GFS (Google File System) • Problem: • Ingest and search large data sets •Hadoop: • Doug Cutting, Cloudera (Yahoo) • Lucene (1999) – indexing large files • Nutch (2004) – search massive amounts of web data • Hadoop (2007) – first release 2007, started in 2005
  • 8. WHY IS HADOOP SO POPULAR? •Store everything regardless • Analyze Now, or analyze Later •Schema On-Read methodology • Allows you to store all the data and determine how to use it later •Low cost, scale out infrastructure • Low cost hardware and large storage pools • Allows for more of a load-it and forget-it approach •Usage • Sentiment analysis • Marketing campaign analysis • Customer churn modeling • Fraud detection • Research and Development • Risk Modeling
  • 10. MAP/REDUCE •Programming and execution framework •Taken from functional programming • Map – operate on every element • Reduce – combine and aggregate results •Abstracts storage, concurrency, execution • Just write two Java functions • Contrast with MPI
  • 11. HDFS •Based on GFS •Distributed, fault-tolerant filesystem •Primarily designed for cost and scale • Works on commodity hardware • 20PB / 4000 node cluster at Facebook
  • 12. HDFS ASSUMPTIONS •Failures are common • Massive scale means more failures • Disks, network, node •Files are append-only •Files are large (GBs to TBs) •Accesses are large and sequential
  • 13. HDFS PRIMER •Same concepts as the FS on your laptop • Directory tree • Create, read, write, delete files •Filesystems store metadata and data • Metadata: filename, size, permissions, … • Data: contents of a file
  • 15. MAP REDUCE AND HDFS SUMMARY •GFS and MR co-design • Cheap, simple, effective at scale •Fault-tolerance baked in • Replicate data 3x • Incrementally re-execute computation • Avoid single points of failure •Held the world sort record (0.578TB/min)
  • 16. SQOOP • Performs bidirectional data transfers between Hadoop and almost any SQL database with a JDBC driver
  • 17. FLUME • Streaming data collection and aggregation • Massive volumes of data, such as RPC services, Log4J, Syslog, etc. Client Client Client Client Agent Agent Agent
  • 18. HIVE • Relational database abstraction using a SQL like dialect called HiveQL • Statements are executed as one or more Map Reduce Jobs SELECT s.word, s.freq, k.freq FROM shakespeare JOIN ON (s.word= k.word) WHERE s.freq >= 5;
  • 19. PIG • High-level scripting language for executing one or more MapReduce jobs • Created to simplify authoring of MapReduce jobs • Can be extended with user defined functions emps = LOAD 'people.txt’ AS (id,name,salary); rich = FILTER emps BY salary > 200000; sorted_rich = ORDER rich BY salary DESC; STORE sorted_rich INTO ’rich_people.txt';
  • 20. HBASE • Low-latency, distributed, columnar key- value store • Based on BigTable • Efficient random reads/writes on HDFS • Useful for frontend applications
  • 21. OOZIE • Workflow engine and scheduler built specifically for large-scale job orchestration on a Hadoop cluster
  • 22. HUE • Hue is an open source web-based application for making it easier to use Apache Hadoop. •Hue features • File Browser for HDFS • Job Designer/Browser for MapReduce • Query editors for Hive, Pig and Cloudera Impala • Oozie
  • 23. ZOOKEEPER • Zookeeper is a distributed consensus engine • Provides well-defined concurrent access semantics: • Leader election • Service discovery • Distributed locking / mutual exclusion • Message board / mailboxes
  • 24. CASCADING •Next gen software abstraction layer for Map/Reduce •Create and execute complex data processing workflows • Specifically for a Hadoop cluster using any JVM-based language • Java • Jruby • Clojure •Generally acknowledged as a better alternative to Hive/Pig
  • 26. CHARACTERISTICS OF BIG DATA •Big Data covers 4 dimensions • Volume - 90% of all the data stored in the world has been produced in the last 2 years • Velocity – The ability to perform advanced analytics on Terabytes or Petabytes of data in minutes to hours compared to days • Variety – Any data type from structured to unstructured data including image files, social media, relational database content, and text data from weblogs or sensors • Veracity - 1 in 3 business leaders don’t trust the information they use to make decisions. How do we ensure the results are accurate and meaning?
  • 27. BIG DATA CHALLENGES • Loading web logs into MySQL • How do you parse and keep all the Data? • What about the variability of the Query String Parameters? • What if the web log format changes? • Integration of other data sources • Social Media – Back in the early days even Facebook didn’t keep all the data. How do we know what is important in the stream? • Video and Image data – How do we store that type of data so we can extract the metadata information? • Sensor Data – Imagine all the different devices producing data and the different formats of the data.
  • 30. LIFE CYCLE OF BIG DATA •Acquire • Data captured at source • Part of ongoing operational processes (Web Log, RDBMS) •Organize • Data transferred from operational systems to Big Data Platform •Analyze • Data processed in batch by Map/Reduce • Data processed by Hadoop Tools (Hive, Pig) • Can Pre-condition data that is loaded back into RDBMS •Decide • Load back into Operational Systems • Load into BI Tools and ODS
  • 32. LIFE CYCLE OF BIG DATA: MYSQL •Acquire • MySQL as a Data Source • MySQL’s NoSQL • New NoSQL API’s • Ingest high volume, high velocity data, with veracity • ACID guarantees not compromised • Data Pre-processing or Conditioning • Run Real-time analytics against new data • Pre-process or condition data before loading into Hadoop • For example healthcare records can be anonymized
  • 33. LIFE CYCLE OF BIG DATA: MYSQL •Organize • Data transferred in batches from MySQL tables to Hadoop using Apache Sqoop or MySQL Applier • With Applier, users can also invoke real-time change data capture processes to stream new data from MySQL to HDFS as it is committed by the client. •Analyze • Multi-structured, multi-sourced data consolidated and processed • Run Map/Reduce Jobs and or Hadoop Tools (Hive, Pig, others) •Decide • Results loaded back to MySQL via Apache Sqoop • Provide new data for real-time operational processes • Provide broader, normalized data sets for BI Tool analytics
  • 34. TOOLS: MYSQL APPLIER •Overview • Provides real-time replication of events between MySQL and Hadoop •Usage • MySQL Applier for Hadoop uses an API (libhdfs, precompiled with Hadoop) to connect to MySQL master • Reads the binary log and then: • Fetches the row insert events occurring on the master • Decodes events, extracts data inserted into each field of the row • Uses content handlers to get it in required format • Appends it to a text file in HDFS
  • 35. TOOLS: MYSQL APPLIER •Capabilities • Streaming real-time updates from MySQL into Hadoop for immediate analysis • Addresses performance issues from bulk loading • Exploits existing replication protocol • Provides row-based replication • Consumable by other tools • Possibilities for update/delete •Limitations • DDL not handled • Only row inserts
  • 36. TOOLS: MYSQL APPLIER Events HDFS MySQL Applier for Hadoop Binlog API libhdfs Binary Log Decode Row Timestamp Primary Key Data
  • 37. TOOLS: MYSQL NOSQL •Overview • NoSQL interfaces directly to the InnoDB and MySQL Cluster (NDB) storage engines • Bypass the SQL layer completely • Without SQL parsing and optimization, Key-Value data can be written directly to MySQL tables up to 9x faster, while maintaining ACID guarantees. •Usage • Key Value Definition/Lookup • Designed for fast single row lookup • Loose Schema designed for fast lookup • Data Pre-processing or Conditioning • Run Real-time analytics against new data • Pre-process or condition data before loading into Hadoop • For example healthcare records can be anonymized
  • 38. TOOLS: MYSQL NOSQL •Capabilities • Ingest high volume, high velocity data, with veracity • ACID guarantees are not compromised • Single stack for RDMBS and NoSQL • High volume KVP processing • Single-node processing: • 70k transactions per second • Clustered processing: • 650k ACID-compliant writes per sec • 19.5M writes per sec • Auto-sharding across distributed clusters of commodity nodes • Shared-nothing, fault-tolerant architecture for 99.999% uptime •Limitations • <specify>
  • 40. MAHOUT •Scalable machine learning algorithms •Primary focus: • Collaborative filtering • Clustering • Classification
  • 41. SPARK •In-memory cluster computing • Allows user programs to load data into a cluster's memory and query it repeatedly •Streaming Architecture
  • 42. MESOS •Cluster manager that manages resources across distributed systems •Allows finite control over system resources • Stateful versus stateless (i.e. traditional virtualization architecture)
  • 43. SENTRY •Granular role-based access control to data •Addresses both data and metadata