This document discusses MySQL and Hadoop integration. It covers structured versus unstructured data and the capabilities and limitations of relational databases, NoSQL, and Hadoop. It also describes several tools for integrating MySQL and Hadoop, including Sqoop for data transfers, MySQL Applier for streaming changes to Hadoop, and MySQL NoSQL interfaces. The document outlines the typical life cycle of big data with MySQL playing a role in data acquisition, organization, analysis, and decisions.
The document discusses Ivan Zoratti's presentation on using MySQL for big data. It defines big data and how it can be structured as either unstructured or structured data. It then outlines various technologies that can be used with MySQL like storage engines, partitioning, columnar databases, and the MariaDB optimizer. The presentation provides an overview of how these technologies can help manage large and complex data sets with MySQL.
Data warehouse con azure synapse analyticsEduardo Castro
Azure Synapse is the evolution of Azure SQL Data Warehouse, combining big data, data storage and data integration into a single service for end-to-end cloud scale analytics. It provides unlimited analytics with unparalleled speed to gain insights. Azure Synapse brings together enterprise data warehousing and big data analytics to give a unified experience with the advantages of both worlds.
The way we store and manage data is changing. In the old days, there were only a handful of file formats and databases. Now there are countless databases and numerous file formats. The methods by which we access the data has also increased in number. As R users, we often access and analyze data in highly inefficient ways. Big Data tech has solved some of those problems.
This presentation will take attendees on a quick tour of the various relevant Big Data technologies. I’ll explain how these technologies fit together to form a stack for various data analysis uses cases. We’ll talk about what these technologies mean for the future of analyzing data with R.
Even if you work with “small data” this presentation will still be of interest because some Big Data tech has a small data use case.
The document summarizes new features in SQL Server 2016 SP1, organized into three categories: performance enhancements, security improvements, and hybrid data capabilities. It highlights key features such as in-memory technologies for faster queries, always encrypted for data security, and PolyBase for querying relational and non-relational data. New editions like Express and Standard provide more built-in capabilities. The document also reviews SQL Server 2016 SP1 features by edition, showing advanced features are now more accessible across more editions.
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...Patrick Van Renterghem
Presentation on "Cloud Data Warehousing: What, Why and How?" by Rogier Werschkull (RogerData), at the BI & Data Analytics Summit on June 13th, 2019 in Diegem (Belgium)
MySQL & MariaDB - Innovation Happens HereIvan Zoratti
The document discusses the vision for a "New MySQL" and "New MariaDB" database that provides flexibility, management capabilities, availability, elasticity, performance, extended functionality, communication, security and acts as a data store and integration platform. It can be deployed everywhere through public/private clouds and on-premises and provides high availability, multiple storage engines, application integration, monitoring and administration tools through a RESTful API.
This document discusses designing a modern data warehouse in Azure. It provides an overview of traditional vs. self-service data warehouses and their limitations. It also outlines challenges with current data warehouses around timeliness, flexibility, quality and findability. The document then discusses why organizations need a modern data warehouse based on criteria like customer experience, quality assurance and operational efficiency. It covers various approaches to ingesting, storing, preparing, modeling and serving data on Azure. Finally, it discusses architectures like the lambda architecture and common data models.
AWS Cloud Kata 2013 | Singapore - Getting to Scale on AWSAmazon Web Services
This session will focus on how to get from 'Minimum Viable Product' (MVP) to scale. It will also explain how to deal with unpredictable demand and how to build a scalable business. Attend this session to learn how to:
Scale web servers and app services with Elastic Load Balancing and Auto Scaling on Amazon EC2
Scale your storage on Amazon S3 and S3 Reduced Redundancy Storage
Scale your database with Amazon DynamoDB, Amazon RDS, and Amazon ElastiCache
Scale your customer base by reaching customers globally in minutes with Amazon CloudFront
A Data Lake is a storage repository that can store large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.
Data Lake is like a large container which is very similar to real lake and rivers. Just like in a lake you have multiple tributaries coming in, a data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time.
Demystifying Data Warehousing as a Service (GLOC 2019)Kent Graziano
Snowflake is a cloud data warehouse as a service (DWaaS) that allows users to load and query data without having to manage infrastructure. It addresses common data challenges like data silos, inflexibility, complexity, performance issues, and high costs. Snowflake is built for the cloud, uses standard SQL, and is delivered as a service. It has many features that make it easy to use including automatic query optimization, separation of storage and compute, elastic scaling, and security by design.
First introduced with the Analytics Platform System (APS), PolyBase simplifies management and querying of both relational and non-relational data using T-SQL. It is now available in both Azure SQL Data Warehouse and SQL Server 2016. The major features of PolyBase include the ability to do ad-hoc queries on Hadoop data and the ability to import data from Hadoop and Azure blob storage to SQL Server for persistent storage. A major part of the presentation will be a demo on querying and creating data on HDFS (using Azure Blobs). Come see why PolyBase is the “glue” to creating federated data warehouse solutions where you can query data as it sits instead of having to move it all to one data platform.
Introduction to Kudu - StampedeCon 2016StampedeCon
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
The document discusses when to use Hadoop instead of a relational database management system (RDBMS) for advanced analytics. It provides examples of when queries like count distinct, cursors, and alter table statements become problematic in an RDBMS. It contrasts analyzing simple, transactional data like invoices versus complex, evolving data like customers or website visitors. Hadoop is better suited for problems involving complex objects, self-joins on large datasets, and matching large datasets. The document encourages structuring data in HDFS in a flexible way that fits the problem and use cases like simple counts on complex objects, self-self-self joins, and matching problems.
This is the first time I introduced the concept of Schema-on-Read vs Schema-on-Write to the public. It was at Berkeley EECS RAD Lab retreat Open Mic Session on May 28th, 2009 at Santa Cruz, California.
Big data is only a group of unstructured and structured data. We need to be able to acquire, organize, analyze and present it in a way that can create value to the business. MySQL is used in 80% Hadoop implementation and has been the "loyal" partner for Hadoop.
Slides from QSSUG Aug 2017 by David Alzamendi:
When on-premise, Data Warehouses are not the only option, many questions arise surrounding Azure SQL Data Warehouse.
In this session, David will cover the fundamentals of using Azure SQL Data Warehouse from a beginner's perspective. He'll discuss the benefits, demystify the pricing measurements and explain the difference between Azure SQL Database and Big Data.
By the end of this session, you will know how to deploy this service in just a few minutes using some of the latest techniques like extracting data from Azure data lakes and accessing Azure blob storage through PolyBase.
This document discusses building a digital bank and Macquarie's digital transformation efforts. It summarizes that Macquarie wants to deliver awesome digital experiences for clients, new revenue streams, and operational efficiency through digital transformation. The main drivers of Macquarie's transformation are a new way of work focused on client needs, client experience, strategic partnerships, and service-driven IT.
The document discusses how database design is an important part of agile development and should not be neglected. It advocates for an evolutionary design approach where the database schema can change over time without impacting application code through the use of procedures, packages, and views. A jointly designed transactional API between the application and database is recommended to simplify changes. Both agile principles and database normalization are seen as valuable to achieve flexibility and avoid redundancy.
Introduction to Snowflake Datawarehouse and Architecture for Big data company. Centralized data management. Snowpipe and Copy into a command for data loading. Stream loading and Batch Processing.
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
This document discusses data-intensive computing and provides examples of technologies used for processing large datasets. It defines data-intensive computing as concerned with manipulating and analyzing large datasets ranging from hundreds of megabytes to petabytes. It then characterizes challenges including scalable algorithms, metadata management, and high-performance computing platforms and file systems. Specific technologies discussed include distributed file systems like Lustre, MapReduce frameworks like Hadoop, and NoSQL databases like MongoDB.
The document discusses data-intensive computing and provides details about related technologies. It defines data-intensive computing as concerned with large-scale data in the hundreds of megabytes to petabytes range. Key challenges include scalable algorithms, metadata management, high-performance computing platforms, and distributed file systems. Technologies discussed include MapReduce frameworks like Hadoop, Pig, and Hive; NoSQL databases like MongoDB, Cassandra, and HBase; and distributed file systems like Lustre, GPFS, and HDFS. The document also covers programming models, scheduling, and an example application to parse Aneka logs using MapReduce.
We Provide Hadoop training institute in Hyderabad and Bangalore with corporate training by 12+ Experience faculty.
Real-time industry experts from MNCs
Resume Preparation by expert Professionals
Lab exercises
Interview Preparation
Experts advice
http://www.learntek.org/product/big-data-and-hadoop/
http://www.learntek.org
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses. We are dedicated to designing, developing and implementing training programs for students, corporate employees and business professional.
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins at Google and Yahoo
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Facebook and Amazon use Hadoop to process massive amounts of data
The document provides an overview of Hadoop, including:
- A brief history of Hadoop and its origins from Google and Apache projects
- An explanation of Hadoop's architecture including HDFS, MapReduce, JobTracker, TaskTracker, and DataNodes
- Examples of how large companies like Yahoo, Facebook, and Amazon use Hadoop for applications like log processing, searches, and advertisement targeting
Big Data Architecture Workshop - Vahid Amiridatastack
Big Data Architecture Workshop
This slide is about big data tools, thecnologies and layers that can be used in enterprise solutions.
TopHPC Conference
2019
Technologies for Data Analytics PlatformN Masahiro
This document discusses building a data analytics platform and summarizes various technologies that can be used. It begins by outlining reasons for analyzing data like reporting, monitoring, and exploratory analysis. It then discusses using relational databases, parallel databases, Hadoop, and columnar storage to store and process large volumes of data. Streaming technologies like Storm, Kafka, and services like Redshift, BigQuery, and Treasure Data are also summarized as options for a complete analytics platform.
The document provides an overview of Hadoop including:
- A brief history of Hadoop and its origins from Nutch.
- An overview of the Hadoop architecture including HDFS and MapReduce.
- Examples of how companies like Yahoo, Facebook and Amazon use Hadoop at large scales to process petabytes of data.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses limitations in traditional RDBMS for big data by allowing scaling to large clusters of commodity servers, high fault tolerance, and distributed processing. The core components of Hadoop are HDFS for distributed storage and MapReduce for distributed processing. Hadoop has an ecosystem of additional tools like Pig, Hive, HBase and more. Major companies use Hadoop to process and gain insights from massive amounts of structured and unstructured data.
Hadoop is an open source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses Google's MapReduce programming model and Google File System for reliability. The Hadoop architecture includes a distributed file system (HDFS) that stores data across clusters and a job scheduling and resource management framework (YARN) that allows distributed processing of large datasets in parallel. Key components include the NameNode, DataNodes, ResourceManager and NodeManagers. Hadoop provides reliability through replication of data blocks and automatic recovery from failures.
Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung
This document summarizes a presentation given by Nicholas Berg of Seagate and Adriana Zubiri of IBM on delivering analytics across organizations using Hadoop and SQL. Some key points discussed include Seagate's plans to use Hadoop to enable deeper analysis of factory and field data, the evolving Hadoop landscape and rise of SQL, and a performance comparison showing IBM's Big SQL outperforming Spark SQL, especially at scale. The document provides an overview of Seagate and IBM's strategies and experiences with Hadoop.
This webinar discusses tools for making big data easy to work with. It covers MetaScale Expertise, which provides Hadoop expertise and case studies. Kognitio Analytics is discussed as a way to accelerate Hadoop for organizations. The webinar agenda includes an introduction, presentations on MetaScale and Kognitio, and a question and answer session. Rethinking data strategies with Hadoop and using in-memory analytics are presented as ways to gain insights from large, diverse datasets.
The document provides an introduction to Hadoop. It discusses how Google developed its own infrastructure using Google File System (GFS) and MapReduce to power Google Search due to limitations with databases. Hadoop was later developed based on these Google papers to provide an open-source implementation of GFS and MapReduce. The document also provides overviews of the HDFS file system and MapReduce programming model in Hadoop.
The initiation of The Hadoop Apache Hive began in 2007 by Facebook due to its data growth.
This ETL system began to fail over few years as more people joined Facebook.
In August 2008, Facebook decided to move to scalable a more scalable open-source Hadoop environment; Hive
Facebook, Netflix and Amazons support the Apache Hive SQL now known as the HiveQL
This document provides an overview of the Hadoop ecosystem. It begins with introducing big data challenges around volume, variety, and velocity of data. It then introduces Hadoop as an open-source framework for distributed storage and processing of large datasets across clusters of computers. The key components of Hadoop are HDFS (Hadoop Distributed File System) for distributed storage and high throughput access to application data, and MapReduce as a programming model for distributed computing on large datasets. HDFS stores data reliably using data replication across nodes and is optimized for throughput over large files and datasets.
The webinar discusses how organizations can make big data easy to use with the right tools and talent. It presents on MetaScale's expertise in helping Sears Holdings implement Hadoop and how Kognitio's in-memory analytics platform can accelerate Hadoop for organizations. The webinar agenda includes an introduction, a case study on Sears Holdings' Hadoop implementation, an explanation of how Kognitio's platform accelerates Hadoop, and a Q&A session.
Latest Tech Trends Series 2024 By EY IndiaEYIndia1
Stay ahead of the curve with our comprehensive Tech Trends Series! Explore the latest technology trends shaping the world today, from the 2024 Tech Trends report and top emerging technologies to their impact on business technology trends. This series delves into the most significant technological advancements, giving you insights into both established and emerging tech trends that will revolutionize various industries.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Snarky Security
How wonderful it is that in our modern age, every bit of our biological data can be digitized, stored, and potentially pilfered by cyber thieves! Isn't it just splendid to think that while scientists are busy pushing the boundaries of biotechnology, hackers could be plotting the next big bio-data heist? This delightful scenario is brought to you by the ever-expanding digital landscape of biology and biotechnology, where the integration of computer science, engineering, and data science transforms our understanding and manipulation of biological systems.
While the fusion of technology and biology offers immense benefits, it also necessitates a careful consideration of the ethical, security, and associated social implications. But let's be honest, in the grand scheme of things, what's a little risk compared to potential scientific achievements? After all, progress in biotechnology waits for no one, and we're just along for the ride in this thrilling, slightly terrifying, adventure.
So, as we continue to navigate this complex landscape, let's not forget the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. After all, what could possibly go wrong?
-------------------------
This document provides a comprehensive analysis of the security implications biological data use. The analysis explores various aspects of biological data security, including the vulnerabilities associated with data access, the potential for misuse by state and non-state actors, and the implications for national and transnational security. Key aspects considered include the impact of technological advancements on data security, the role of international policies in data governance, and the strategies for mitigating risks associated with unauthorized data access.
This view offers valuable insights for security professionals, policymakers, and industry leaders across various sectors, highlighting the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. The analysis serves as a crucial resource for understanding the complex dynamics at the intersection of biotechnology and security, providing actionable recommendations to enhance biosecurity in an digital and interconnected world.
The evolving landscape of biology and biotechnology, significantly influenced by advancements in computer science, engineering, and data science, is reshaping our understanding and manipulation of biological systems. The integration of these disciplines has led to the development of fields such as computational biology and synthetic biology, which utilize computational power and engineering principles to solve complex biological problems and innovate new biotechnological applications. This interdisciplinary approach has not only accelerated research and development but also introduced new capabilities such as gene editing and biomanufact
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
Smart mobility refers to the integration of advanced technologies and innovative solutions to create efficient, sustainable, and interconnected transportation systems. It encompasses various aspects of transportation, including public transit, shared mobility services, intelligent transportation systems, electric vehicles, and connected infrastructure. Smart mobility aims to improve the overall mobility experience by leveraging data, connectivity, and automation to enhance safety, reduce congestion, optimize transportation networks, and minimize environmental impacts.
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
2. STRUCTURED AND
UNSTRUCTURED DATA
•Structured Data (Deductive Logic)
• Analysis of defined relationships
• Defined Data Architecture
• SQL Compliant for fast processing with certainty
• Precision, Speed
•Unstructured Data (Inductive Logic)
• Hypothesis testing against unknown relationships
• Unknown (being less than 100% certainty)
• Iterative analysis to a level-of-certainty
• Open standards and tools
• Extremely high rate of change in processing/tooling options
• Volume, Speed
3. STRUCTURED DATA:
RDBMS
•Capabilities
• Defined Data Architecture/Structured Schema
• Data Integrity
• ACID Compliant
• Atomicity - Requires that each transaction is "all or nothing"
• Consistency - Any transaction will bring the database from one valid state
to another
• Isolation - All transactions are consistent as if they were issued serially
• Durability - Once the transaction is committed it persists
• Real-time processing and analysis against known relationships
•Limitations
• Comparatively static data architecture
• Requires defined data architecture for all data stored
• Relatively smaller, defined, more discrete data sets
4. UNSTRUCTURED DATA:
NOSQL
•Capabilities
• Key Value lookup
• Designed for fast single row lookup
• Loose Schema designed for fast lookup
• MySQL NoSQL Interface
• Used to augment Big Data solutions
•Limitations
• Not designed for Analytics
• Does not support 2 – 3 of the V for Big Data
• On its own, NoSQL is not considered Big Data
5. UNSTRUCTURED DATA:
HADOOP
•Capabilities
• High data load with different data formats
• Allows discovery and hypothesis testing against large data sets
• Near Real-time processing and analysis against unknown
relationships
•Limitations
• Not ACID compliant
• No transactional consistency
• High latency system
• Not designed for real time lookups
• Limited BI tool integration
6. WHAT IS HADOOP?
•What is Hadoop?
• Fastest growing, commercial Big Data Technology
• Basic Composition:
• Mapper
• Reducer
• Hadoop File System (HDFS)
• Approx 30 tools/subcomponents in the eco-system
• Primarily produced so developers and admin’s do not have to write raw
map/reduce code in Java
•Systems Architecture:
• Linux
• Commodity x86 Servers
• JBOD (standard block size 64-128MB)
• Virtualization not recommended due to high I/O requirements
• Open Source Project:
• http://hadoop.apache.org/
7. HADOOP: QUICK HISTORY
•Map Reduce Theory Paper:
• Published 2004
• Jeffrey Dean and Sanjay Ghemawat
• Foundation for GFS (Google File System)
• Problem:
• Ingest and search large data sets
•Hadoop:
• Doug Cutting, Cloudera (Yahoo)
• Lucene (1999) – indexing large files
• Nutch (2004) – search massive amounts of web data
• Hadoop (2007) – first release 2007, started in 2005
8. WHY IS HADOOP SO
POPULAR?
•Store everything regardless
• Analyze Now, or analyze Later
•Schema On-Read methodology
• Allows you to store all the data and determine how to use it later
•Low cost, scale out infrastructure
• Low cost hardware and large storage pools
• Allows for more of a load-it and forget-it approach
•Usage
• Sentiment analysis
• Marketing campaign analysis
• Customer churn modeling
• Fraud detection
• Research and Development
• Risk Modeling
10. MAP/REDUCE
•Programming and execution
framework
•Taken from functional programming
• Map – operate on every element
• Reduce – combine and aggregate results
•Abstracts storage, concurrency,
execution
• Just write two Java functions
• Contrast with MPI
11. HDFS
•Based on GFS
•Distributed, fault-tolerant filesystem
•Primarily designed for cost and scale
• Works on commodity hardware
• 20PB / 4000 node cluster at Facebook
12. HDFS ASSUMPTIONS
•Failures are common
• Massive scale means more failures
• Disks, network, node
•Files are append-only
•Files are large (GBs to TBs)
•Accesses are large and sequential
13. HDFS PRIMER
•Same concepts as the FS on your
laptop
• Directory tree
• Create, read, write, delete files
•Filesystems store metadata and data
• Metadata: filename, size, permissions, …
• Data: contents of a file
15. MAP REDUCE AND HDFS
SUMMARY
•GFS and MR co-design
• Cheap, simple, effective at scale
•Fault-tolerance baked in
• Replicate data 3x
• Incrementally re-execute computation
• Avoid single points of failure
•Held the world sort record
(0.578TB/min)
17. FLUME
• Streaming data
collection and
aggregation
• Massive
volumes of data,
such as RPC
services, Log4J,
Syslog, etc.
Client
Client
Client
Client
Agent
Agent
Agent
18. HIVE
• Relational
database
abstraction using
a SQL like dialect
called HiveQL
• Statements are
executed as one
or more Map
Reduce Jobs
SELECT
s.word, s.freq, k.freq
FROM shakespeare
JOIN ON (s.word=
k.word)
WHERE s.freq >= 5;
19. PIG
• High-level scripting
language for executing
one or more
MapReduce jobs
• Created to simplify
authoring of
MapReduce jobs
• Can be extended with
user defined functions
emps = LOAD
'people.txt’ AS
(id,name,salary);
rich = FILTER emps BY
salary > 200000;
sorted_rich = ORDER
rich BY salary DESC;
STORE sorted_rich
INTO ’rich_people.txt';
21. OOZIE
• Workflow engine
and scheduler
built specifically
for large-scale job
orchestration on a
Hadoop cluster
22. HUE
• Hue is an open source
web-based application for
making it easier to use
Apache Hadoop.
•Hue features
• File Browser for
HDFS
• Job
Designer/Browser for
MapReduce
• Query editors for
Hive, Pig and
Cloudera Impala
• Oozie
24. CASCADING
•Next gen software abstraction layer for
Map/Reduce
•Create and execute complex data
processing workflows
• Specifically for a Hadoop cluster using any JVM-based language
• Java
• Jruby
• Clojure
•Generally acknowledged as a better
alternative to Hive/Pig
26. CHARACTERISTICS OF BIG
DATA
•Big Data covers 4 dimensions
• Volume - 90% of all the data stored in the world has been
produced in the last 2 years
• Velocity – The ability to perform advanced analytics on
Terabytes or Petabytes of data in minutes to hours compared
to days
• Variety – Any data type from structured to unstructured data
including image files, social media, relational database
content, and text data from weblogs or sensors
• Veracity - 1 in 3 business leaders don’t trust the information
they use to make decisions. How do we ensure the results
are accurate and meaning?
27. BIG DATA CHALLENGES
• Loading web logs into MySQL
• How do you parse and keep all the Data?
• What about the variability of the Query String Parameters?
• What if the web log format changes?
• Integration of other data sources
• Social Media – Back in the early days even Facebook didn’t keep all
the data. How do we know what is important in the stream?
• Video and Image data – How do we store that type of data so we
can extract the metadata information?
• Sensor Data – Imagine all the different devices producing data and
the different formats of the data.
30. LIFE CYCLE OF BIG DATA
•Acquire
• Data captured at source
• Part of ongoing operational processes (Web Log, RDBMS)
•Organize
• Data transferred from operational systems to Big Data Platform
•Analyze
• Data processed in batch by Map/Reduce
• Data processed by Hadoop Tools (Hive, Pig)
• Can Pre-condition data that is loaded back into RDBMS
•Decide
• Load back into Operational Systems
• Load into BI Tools and ODS
31. MYSQL INTEGRATION WITH
THE BIG DATA LIFE CYCLE
ACQUIRE ORGANIZE
DECIDE ANALYZE
Applier
NoSQL
Sensor Logs
Web Logs
BI Tools
32. LIFE CYCLE OF BIG DATA:
MYSQL
•Acquire
• MySQL as a Data Source
• MySQL’s NoSQL
• New NoSQL API’s
• Ingest high volume, high velocity data, with veracity
• ACID guarantees not compromised
• Data Pre-processing or Conditioning
• Run Real-time analytics against new data
• Pre-process or condition data before loading into Hadoop
• For example healthcare records can be anonymized
33. LIFE CYCLE OF BIG DATA:
MYSQL
•Organize
• Data transferred in batches from MySQL tables to Hadoop using
Apache Sqoop or MySQL Applier
• With Applier, users can also invoke real-time change data
capture processes to stream new data from MySQL to HDFS as
it is committed by the client.
•Analyze
• Multi-structured, multi-sourced data consolidated and processed
• Run Map/Reduce Jobs and or Hadoop Tools (Hive, Pig, others)
•Decide
• Results loaded back to MySQL via Apache Sqoop
• Provide new data for real-time operational processes
• Provide broader, normalized data sets for BI Tool analytics
34. TOOLS: MYSQL APPLIER
•Overview
• Provides real-time replication of events between MySQL and
Hadoop
•Usage
• MySQL Applier for Hadoop uses an API (libhdfs, precompiled
with Hadoop) to connect to MySQL master
• Reads the binary log and then:
• Fetches the row insert events occurring on the master
• Decodes events, extracts data inserted into each field of the
row
• Uses content handlers to get it in required format
• Appends it to a text file in HDFS
35. TOOLS: MYSQL APPLIER
•Capabilities
• Streaming real-time updates from MySQL into Hadoop for
immediate analysis
• Addresses performance issues from bulk loading
• Exploits existing replication protocol
• Provides row-based replication
• Consumable by other tools
• Possibilities for update/delete
•Limitations
• DDL not handled
• Only row inserts
37. TOOLS: MYSQL NOSQL
•Overview
• NoSQL interfaces directly to the InnoDB and MySQL Cluster (NDB)
storage engines
• Bypass the SQL layer completely
• Without SQL parsing and optimization, Key-Value data can be written
directly to MySQL tables up to 9x faster, while maintaining ACID
guarantees.
•Usage
• Key Value Definition/Lookup
• Designed for fast single row lookup
• Loose Schema designed for fast lookup
• Data Pre-processing or Conditioning
• Run Real-time analytics against new data
• Pre-process or condition data before loading into Hadoop
• For example healthcare records can be anonymized
38. TOOLS: MYSQL NOSQL
•Capabilities
• Ingest high volume, high velocity data, with veracity
• ACID guarantees are not compromised
• Single stack for RDMBS and NoSQL
• High volume KVP processing
• Single-node processing:
• 70k transactions per second
• Clustered processing:
• 650k ACID-compliant writes per sec
• 19.5M writes per sec
• Auto-sharding across distributed clusters of commodity nodes
• Shared-nothing, fault-tolerant architecture for 99.999% uptime
•Limitations
• <specify>
42. MESOS
•Cluster manager that manages
resources across distributed systems
•Allows finite control over system
resources
• Stateful versus stateless (i.e. traditional virtualization
architecture)