The webinar discusses how organizations can make big data easy to use with the right tools and talent. It presents on MetaScale's expertise in helping Sears Holdings implement Hadoop and how Kognitio's in-memory analytics platform can accelerate Hadoop for organizations. The webinar agenda includes an introduction, a case study on Sears Holdings' Hadoop implementation, an explanation of how Kognitio's platform accelerates Hadoop, and a Q&A session.
Introduction to Hadoop - The EssentialsFadi Yousuf
This document provides an introduction to Hadoop, including:
- A brief history of Hadoop and how it was created to address limitations of relational databases for big data.
- An overview of core Hadoop concepts like its shared-nothing architecture and using computation near storage.
- Descriptions of HDFS for distributed storage and MapReduce as the original programming framework.
- How the Hadoop ecosystem has grown to include additional frameworks like Hive, Pig, HBase and tools like Sqoop and Zookeeper.
- A discussion of YARN which separates resource management from job scheduling in Hadoop.
The document introduces JStorm, an open source distributed real-time computation framework. It was created by Alibaba to address issues with Apache Storm and improve performance for real-time applications. JStorm has been used by Alibaba to process over 3 trillion messages per day across 3000+ servers. Key features discussed include high throughput, fault tolerance, horizontal scalability, and more powerful scheduling capabilities compared to Storm.
This presentation discusses the follow topics
What is Hadoop?
Need for Hadoop
History of Hadoop
Hadoop Overview
Advantages and Disadvantages of Hadoop
Hadoop Distributed File System
Comparing: RDBMS vs. Hadoop
Advantages and Disadvantages of HDFS
Hadoop frameworks
Modules of Hadoop frameworks
Features of 'Hadoop‘
Hadoop Analytics Tools
Hadoop and WANdisco: The Future of Big DataWANdisco Plc
View the webinar recording here... http://youtu.be/O1pgMMyoJg0
Who: WANdisco CEO, David Richards, and core creaters of Apache Hadoop, Dr. Konstantin Shvachko and Jagane Sundare.
What: WANdisco recently acquired AltoStor, a pioneering firm with deep expertise in the multi-billion dollar Big Data market.
New to the WANdisco team are the Hadoop core creaters, Dr. Konstantin Shvachko and Jagane Sundare. They will cover the the acquisition and reveal how WANdisco's active-active replication technology will change the game of Big Data for the enterprise in 2013.
Hadoop, a proven open source Big Data technolgoy, is the backbone of Yahoo, Facebook, Netflix, Amazon, Ebay and many of the world's largest databases.
When: Tuesday, December 11th at 10am PST (1pm EST).
Why: In this 30-minute webinar you’ll learn:
The staggering, cross-industry growth of Hadoop in the enterprise
How Hadoop's limitations, including HDFS's single-point of failure, are impacting the productivity of the enterprise
How WANdisco's active-active replication technology will alleviate these issues by adding high-availability to Hadoop, taking a fundamentally different approach to Big Data
View the webinar Q&A on the WANdisco blog here...http://blogs.wandisco.com/2012/12/14/answers-to-questions-from-the-webinar-of-dec-11-2012/
Introducing Kudu, Big Data Warehousing MeetupCaserta
Not just an SQL interface or file system, Kudu - the new, updating column store for Hadoop, is changing the storage landscape. It's easy to operate and makes new data immediately available for analytics or operations.
At the Caserta Concepts Big Data Warehousing Meetup, our guests from Cloudera outlined the functionality of Kudu and talked about why it will become an integral component in big data warehousing on Hadoop.
To learn more about what Caserta Concepts has to offer, visit http://casertaconcepts.com/
This document provides an overview of big data and Hadoop. It discusses what big data is, why it has become important recently, and common use cases. It then describes how Hadoop addresses challenges of processing large datasets by distributing data and computation across clusters. The core Hadoop components of HDFS for storage and MapReduce for processing are explained. Example MapReduce jobs like wordcount are shown. Finally, higher-level tools like Hive and Pig that provide SQL-like interfaces are introduced.
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
This is a talk on a fundamental approach to thinking about scalability, and how Hadoop, HBase, and Lucene are enabling companies to process amazing amounts of data. It's also about how Social Media is making the traditional RDBMS irrelevant.
The document summarizes a technical seminar on Hadoop. It discusses Hadoop's history and origin, how it was developed from Google's distributed systems, and how it provides an open-source framework for distributed storage and processing of large datasets. It also summarizes key aspects of Hadoop including HDFS, MapReduce, HBase, Pig, Hive and YARN, and how they address challenges of big data analytics. The seminar provides an overview of Hadoop's architecture and ecosystem and how it can effectively process large datasets measured in petabytes.
Best Practices for Deploying Hadoop (BigInsights) in the CloudLeons Petražickis
This document provides best practices for optimizing the performance of InfoSphere BigInsights and InfoSphere Streams when deployed in the cloud. It discusses optimizing disk performance by choosing cloud providers and instances with good disk I/O, partitioning and formatting disks correctly, and configuring HDFS to use multiple data directories. It also discusses optimizing Java performance by correctly configuring JVM memory and optimizing MapReduce performance by setting appropriate values for map and reduce tasks based on machine resources.
The document is a slide deck for a training on Hadoop fundamentals. It includes an agenda that covers what big data is, an introduction to Hadoop, the Hadoop architecture, MapReduce, Pig, Hive, Jaql, and certification. It provides overviews and explanations of these topics through multiple slides with images and text. The slides also describe hands-on labs for attendees to complete exercises using these big data technologies.
The document provides an agenda and slides for a presentation on architectural considerations for data warehousing with Hadoop. The presentation discusses typical data warehouse architectures and challenges, how Hadoop can complement existing architectures, and provides an example use case of implementing a data warehouse with Hadoop using the Movielens dataset. Key aspects covered include ingestion of data from various sources using tools like Flume and Sqoop, data modeling and storage formats in Hadoop, processing the data using tools like Hive and Spark, and exporting results to a data warehouse.
Kudu is an open source storage engine that provides low-latency random reads and writes while also supporting efficient analytical queries. It horizontally partitions and replicates data across servers for high availability and performance. Kudu integrates with Hadoop ecosystems tools like Impala, Spark, and MapReduce. The demo will cover Kudu architecture, data storage, and how to implement Kudu in a buffer load using Scala and Impala.
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
Overview of stinger interactive query for hiveDavid Kaiser
This document provides an overview of the Stinger initiative to improve the performance of Hive interactive queries. The Stinger project worked to optimize Hive so that queries return results in seconds instead of minutes or hours by implementing features like Hive on Tez, vectorized processing, predicate pushdown, the ORC file format, and a cost-based optimizer. These optimizations improved Hive performance by over 100 times, allowing interactive use of Hive for the first time on large datasets.
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
The document proposes an Agile Big Data model to address perceived issues with traditional Hadoop implementations. It discusses the motivation for change and outlines an Agile model with self-organized roles including data stewards, data scientists, project teams, and an architecture board. Key aspects of the proposed model include independent and self-managed project teams, a domain-driven data model, and emphasis on data quality and governance through the involvement of data stewards across domains.
Talk on Apache Kudu, presented by Asim Jalis at SF Data Engineering Meetup on 2/23/2016.
http://www.meetup.com/SF-Data-Engineering/events/228293610/
Big Data applications need to ingest streaming data and analyze it. HBase is great at ingesting streaming data but not good at analytics. HDFS is great at analytics but not at ingesting streaming data. Frequently applications ingest data into HBase and then move it to HDFS for analytics. What if you could use a single system for both use cases?
What if you could use a single system for both use cases? This could dramatically simplify your data pipeline architecture.
This is where Kudu comes in. Kudu is a storage system that lives between HDFS and HBase. It is good at both ingesting streaming data and good at analyzing it using Spark, MapReduce, and SQL.
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...Yahoo Developer Network
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Apache Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets. Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads. This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem with out-of-the-box integration with Apache Spark, that fills the gap described above to provide a new option to achieve fast scans and fast random access from a single API.
Speakers:
David Alves. Software engineer at Cloudera working on the Kudu team, and a PhD student at UT Austin. David is a committer at the Apache Software Foundation and has contributed to several open source projects, including Apache Cassandra and Apache Drill.
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
This talk will cover what tools and techniques work and don’t work well for data scientists working on Hadoop today and how to leverage the lessons learned by the experts to increase your productivity as well as what to expect for the future of data science on Hadoop. We will leverage insights derived from the top data scientists working on big data systems at Cloudera as well as experiences from running big data systems at Facebook, Google, and Yahoo.
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
This document discusses engineering machine learning data pipelines and addresses five big challenges: 1) scattered and difficult to access data, 2) data cleansing at scale, 3) entity resolution, 4) tracking data lineage, and 5) ongoing real-time changed data capture and streaming. It presents DMX Change Data Capture as a solution to capture changes from various data sources and replicate them in real-time to targets like Kafka, HDFS, databases and data lakes to feed machine learning models. Case studies demonstrate how DMX-h has helped customers like a global hotel chain and insurance and healthcare companies build scalable data pipelines.
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
This document discusses building data pipelines with Apache Spark and DataStax Enterprise (DSE) for both static and real-time data. It describes how DSE provides a scalable, fault-tolerant platform for distributed data storage with Cassandra and real-time analytics with Spark. It also discusses using Kafka as a messaging queue for streaming data and processing it with Spark. The document provides examples of using notebooks, Parquet, and Akka for building pipelines to handle both large static datasets and fast, real-time streaming data sources.
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeeling Cheung
Nicholas Berg presented on Seagate's use of big data analytics to manage the large amount of manufacturing data generated from its hard drive production. Seagate collects terabytes of data per day from testing its drives, which it analyzes using Hadoop to improve quality, predict failures, and gain other insights. It faces challenges in integrating this emerging platform due to the rapid evolution of Hadoop and lack of tools to fully leverage large datasets. Seagate is developing its data lake and data science capabilities on Hadoop to better optimize manufacturing and drive design.
Accelerating analytics in a new era of dataArnon Shimoni
Organizations today produce exponentially more data than they did just a few years ago, but their databases weren’t built to handle these new volumes. As a result, reporting takes way too long, and some complex analytics simply cannot be done. The Era of Massive Data is upon us, and a new approach is required to overcome the limitations of traditional CPU-based data stores.
KEY TAKEAWAYS
- Flexible data exploration with minimal preparation
- Unrestricted access to your organization’s full scope of data
- Access to previously unobtainable insights, for smarter business decisions
5 Things that Make Hadoop a Game Changer
Webinar by Elliott Cordo, Caserta Concepts
There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop.
To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029
For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/
From: DataWorks Summit Munich 2017 - 20170406
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
Thirty years is a long time for a technology foundation to be as active as relational databases. Are their replacements here? In this webinar, we say no.
Databases have not sat around while Hadoop emerged. The Hadoop era generated a ton of interest and confusion, but is it still relevant as organizations are deploying cloud storage like a kid in a candy store? We’ll discuss what platforms to use for what data. This is a critical decision that can dictate two to five times additional work effort if it’s a bad fit.
Drop the herd mentality. In reality, there is no “one size fits all” right now. We need to make our platform decisions amidst this backdrop.
This webinar will distinguish these analytic deployment options and help you platform 2020 and beyond for success.
Building a High Performance Analytics PlatformSantanu Dey
The document discusses using flash memory to build a high performance data platform. It notes that flash memory is faster than disk storage and cheaper than RAM. The platform utilizes NVMe flash drives connected via PCIe for high speed performance. This allows it to provide in-memory database speeds at the cost and density of solid state drives. It can scale independently by adding compute nodes or storage nodes. The platform offers a unified database for both real-time and analytical workloads through common APIs.
Technologies for Data Analytics PlatformN Masahiro
This document discusses building a data analytics platform and summarizes various technologies that can be used. It begins by outlining reasons for analyzing data like reporting, monitoring, and exploratory analysis. It then discusses using relational databases, parallel databases, Hadoop, and columnar storage to store and process large volumes of data. Streaming technologies like Storm, Kafka, and services like Redshift, BigQuery, and Treasure Data are also summarized as options for a complete analytics platform.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Big Data is the reality of modern business: from big companies to small ones, everybody is trying to find their own benefit. Big Data technologies are not meant to replace traditional ones, but to be complementary to them. In this presentation you will hear what is Big Data and Data Lake and what are the most popular technologies used in Big Data world. We will also speak about Hadoop and Spark, and how they integrate with traditional systems and their benefits.
Colorado Springs Open Source Hadoop/MySQL David Smelker
This document discusses MySQL and Hadoop integration. It covers structured versus unstructured data and the capabilities and limitations of relational databases, NoSQL, and Hadoop. It also describes several tools for integrating MySQL and Hadoop, including Sqoop for data transfers, MySQL Applier for streaming changes to Hadoop, and MySQL NoSQL interfaces. The document outlines the typical life cycle of big data with MySQL playing a role in data acquisition, organization, analysis, and decisions.
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld
VMworld 2013
Michael Corey, Ntirety, Inc
Jeff Szastak, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Slides for the talk at AI in Production meetup:
https://www.meetup.com/LearnDataScience/events/255723555/
Abstract: Demystifying Data Engineering
With recent progress in the fields of big data analytics and machine learning, Data Engineering is an emerging discipline which is not well-defined and often poorly understood.
In this talk, we aim to explain Data Engineering, its role in Data Science, the difference between a Data Scientist and a Data Engineer, the role of a Data Engineer and common concepts as well as commonly misunderstood ones found in Data Engineering. Toward the end of the talk, we will examine a typical Data Analytics system architecture.
Hadoop is a framework for distributed storage and processing of large datasets across clusters of commodity hardware. It includes HDFS, a distributed file system, and MapReduce, a programming model for large-scale data processing. HDFS stores data reliably across clusters and allows computations to be processed in parallel near the data. The key components are the NameNode, DataNodes, JobTracker and TaskTrackers. HDFS provides high throughput access to application data and is suitable for applications handling large datasets.
Ledingkart Meetup #4: Data pipeline @ lkMukesh Singh
Building a Data Pipeline - Case studies
This document discusses building data pipelines at three companies: NoBroker, Treebo, and LendingKart. It describes the business needs for data and analytics that motivated building pipelines. Key aspects of data pipelines discussed include moving, joining, reformatting data between systems reliably. Lessons from building pipelines include ensuring scalability, availability, and reliability as data volume grows.
Building a scalable analytics environment to support diverse workloadsAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
Building a scalable analytics environment to support diverse workloads
Tom Panozzo, Chief Technology Officer (Aunalytics)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Introduction to Big Data and NoSQL.
This presentation was given to the Master DBA course at John Bryce Education in Israel.
Work is based on presentations by Michael Naumov, Baruch Osoveskiy, Bill Graham and Ronen Fidel.
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectRemy Rosenbaum
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect. Jethro CEO Eli Singer discusses the limitations of Hadoop with Tableau. Presentation explores how Jethro's index-based architecture enables Tableau users to live-connect to data on Hadoop while maintaining the fast interactive speeds that they expect.
Similar to Meta scale kognitio hadoop webinar (20)
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Meta scale kognitio hadoop webinar
1. Webinar: Make Big Data Easy
with the Right tools and talent
- MetaScale Expertise and Kognitio Analytics
Accelerate Hadoop for Organizations Large and Small
October 2012
2. Today’s webinar
• 45 minutes with 15 minutes Q&A
• We will email you a link to the slides
• Feel free to use the Q & A feature
3. Agenda
Presenters
• Opening introduction
Dr. Phil Shelley
• MetaScale Expertise CEO, MetaScale
– Case study – Sears CTO, Sears Holdings
Holdings Roger Gaskell
• Kognitio Analytics CTO
Kognitio
– Hadoop acceleration
explained Host
• Summary Michael Hiskey
VP Marketing & Business
• Q&A Development
Kognitio
4. Big Data < > Hadoop
Big Data is high volume, velocity and variety
information assets that demand cost-effective,
innovative forms of information processing for
enhanced insight and decision-making
Volume (not only) size
Velocity (speed of Input / Output)
Variety (lots of data sources)
Value – not the SIZE of your data,
but what you can DO with it!
5. OK, so you’ve decided to put data in Hadoop...
Now what?
Dr. Phil Shelley
CEO – MetaScale
CTO Sears Holdings
7. Where Did We Start?
Issues with meeting production schedules
Multiple copies of data, no single point of truth
ETL complexity, cost of software and cost to manage
Time take to setup ETL data sources for projects
Latency in data, up to weeks in some cases
Enterprise Data Warehouses unable to handle load
Mainframe workload over consuming capacity
IT Budgets not growing – BUT data volumes escalating
8. Why Hadoop?
Traditional
Databases &
Warehouses
Hadoop
10. Enterprise Integration
Data Sourcing
Connecting to Legacy source systems
Loaders and tools (speed considerations)
Batch or near-real time
Enterprise Data Model
Establish a model and enterprise data strategy early
Data Transformations
The End of ETL as we know it
Data re-use
Drive re-use of data
Single point of truth is now a possibility
Data Consumption and user Interaction
Consume data in-place wherever possible
Move data only if you have to
Exporting to legacy systems can be done, but it duplicates data
Loaders and tools (speed considerations)
How will your users interact with the data
11. Rethink Everything
The way you capture data
The way you store data
The structure of your data
The way you analyze data
The costs of data storage
The size of your data
What you can analyze
The speed of analysis
The skills of your team
The way user interact with data
12. The Learning from our Journey
• Big Data tools are here and ready for the Enterprise
• An Enterprise Data Architecture model is essential
• Hadoop can handle Enterprise workload
To reduce strain on legacy platforms
To reduce cost
To bring new business opportunities
• Must be part of an overall data strategy
• Not to be underestimated
• The solution must be an Eco-System
There has to be a simple way to consume the data
Page 12
13. Hadoop Strengths & Weaknesses?
• Cost effective platform
• Powerful / fast data processing environment
• Good at standard reporting
• Flexibility: Programmable, Any data type
• Huge scalability
• Barriers to entry: lots of engineering and coding
• High on-going coding requirements
• Difficult to access with standard BI/analytical tools
• Ad hoc complex analytics difficult
• Too slow for interactive analytics
15. What is an “In-memory” Analytical Platform?
• DBMS where all of the data of interest or specific portions of the data
have been permanently pre-loaded into random access memory (RAM)
• Not a large cache
– Data is held in structures that take advantage of the properties of
RAM – NOT copies of frequently used disk blocks
– The databases query optimiser knows at all times exactly which
data is in memory (and which is not)
16. In-Memory Analytical Database Mangement
Not a large cache:
• No disk access during query execution
– Temporary tables in RAM
– Results sets in RAM
• In-Memory means in high speed RAM
– NOT slow flash-based SSDs that mimic
mechanical disks
For more information:
• Gartner: “Who's Who in In-Memory DBMSs”
Roxanne Edjlali, Donald Feinberg
10 Sept 2012 www.gartner.com/id=2151315
17. Why In-memory: RAM is Faster Than Disk (Really!)
Actually, this only part of the story
Analytics completely change the workload
workload
characteristics on the database
Simple reporting and transactional processing
filtering
is all about “filtering” the data of interest
Analytics is all about complex “crunching”
crunching
of the data once it is filtered
CPU Crunching needs processing power and
cycles consumes CPU cycles
Storing data on physical disks severely limits the
storing
rate at which data can be provided to the CPUs
Accessing data directly from RAM allows
access
much more CPU power to be deployed
18. Analytics is about through Data
CPU cycle-intensive & CPU-bound
“CRUNCHING”
Analytical
Joins Functions
Aggregations
Sorts Grouping
• To understand what is happening in the data
More complex More pronounced
analytics = this becomes
• Analytical platforms are therefore CPU-bound
– Assume disk I/O speeds not a bottleneck
– In-memory removes the disk I/O bottleneck
19. For Analytics, the CPU is King
• The key metric of any analytical platform should be GB/CPU
– It needs to effectively utilize all available cores
– Hyper threads are NOT the equivalent of cores
• Interactive/adhoc analytics:
– THINK data to core ratios ≈ 10GB data per CPU core
• Every cycle is precious – CPU cores need to used efficiently
– Techniques such as “dynamic machine code generation”
Careful – performance impact of compression:
Makes disk-based databases go faster
Makes in-memory databases go slower
20. Speed & Scale are the Requirements
• Memory & CPU on an individual server = NOWHERE near enough for big data
– Moore’s Law – The power of a processor doubles every two years
– Data volumes – Double every year!!
• The only way to keep up is to parallelise or scale-out
• Combine the RAM of many individual servers
• many CPU cores spread across
Many • many CPUs, housed in
• many individual computers
– Data is split across all the CPU cores
– All database operations need to be parallelised with no points of
serialisation – This is true MPP
• Every CPU core in
Every • Every server needs to efficiently involved in
• Every query
21. Hadoop Connectivity
Kognitio - External Tables
– Data held on disk in other systems can be seen as non-memory
resident tables by Kognitio users.
– Users can select which data they wish to “suck” into memory.
• Using GUI or scripts
– Kognitio seamlessly sucks data out of the source system
into Kognitio memory.
– All managed via SQL
Kognitio - Hadoop Connectors
– Two types
• HDFS Connector
• Filter Agent Connector
– Designed for high speed
• Multiple parallel load streams
• Demonstrable 14TB+/hour load rates
22. Tight Hadoop integration
HDFS Connector
• Connector defines access to hdfs file
system
• External table accesses row-based data
in hdfs
• Dynamic access or “pin” data into
memory
• Complete hdfs file is loaded into memory
Filter Agent Connector
• Connector uploads agent to Hadoop
nodes
• Query passes selections and relevant
predicates to agent
• Data filtering and projection takes
place locally on each Hadoop node
• Only data of interest in loaded into
memory via parallel load streams
23. Not Only SQL
Kognitio V8 External Scripts
– Run third party scripts embedded within SQL
• Perl, Python, Java, R, SAS, etc.
• One-to-many rows in, zero-to-many rows out, one to one
create interpreter perlinterp
command '/usr/bin/perl' sends 'csv' receives 'csv' ;
select top 1000 words, count(*) This reads long comments
from (external script using environment perlinterp text from customer enquiry
receives (txt varchar(32000))
sends (words varchar(100)) table, in line perl converts
script S'endofperl( long text into output
while(<>)
{ stream of words (one word
chomp(); per row), query selects top
s/[,.!_]//g;
foreach $c (split(/ /)) 1000 words by frequency
{ if($c =~ /^[a-zA-Z]+$/) { print "$cn”} } using standard SQL
}
)endofperl' aggregation
from (select comments from customer_enquiry))dt
group by 1
order by 2 desc;
24. Hardware Requirements for
In-memory Platforms
• Hadoop = industry standard servers
• Careful to avoid vendor lock-in
• Off the shelf, low cost, servers match
neatly with Hadoop
– Intel or AMD CPU (x86)
– No special components
• Ethernet network
• Standard OS
25. Benefits of an In-memory Analytical Platform
• A seamless in-memory analytical layer on top of your data
persistence layer(s):
Analytical queries that used to run in hours and minutes, now
run in minutes and seconds (often sub-second)
High query throughput = massively higher concurrency
Flexibility
• Enables greater query complexity
• Users freely interact with data
• Use preferred BI Tools (relational or OLAP)
Reduced complexity
• Administration de-skilled
• Reduced data duplication
26. The Learning from our Journey
• Big Data tools are here and ready for the Enterprise
• An Enterprise Data Architecture model is essential
• Hadoop can handle Enterprise workload
To reduce strain on legacy platforms
To reduce cost
To bring new business opportunities
• Must be part of an overall data strategy
• Not to be underestimated
• The solution must be an Eco-System
There has to be a simple way to consume the data
Page 26
27. connect contact
www.kognitio.com Michael Hiskey
Vice President
kognitio.com/blog Marketing & Business Development
Michael.hiskey@kognitio.com
twitter.com/kognitio
Phone: +1 (855) KOGNITIO
linkedin.com/companies/kognitio
Dr. Phil Shelley
facebook.com/kognitio CEO – MetaScale
CTO Sears Holdings
youtube.com/user/kognitio
Upcoming Web Briefings: kognitio.com/briefings