SlideShare a Scribd company logo
Webinar: Make Big Data Easy
                  with the Right tools and talent
                      - MetaScale Expertise and Kognitio Analytics
               Accelerate Hadoop for Organizations Large and Small


October 2012
Today’s webinar



• 45 minutes with 15 minutes Q&A
• We will email you a link to the slides
• Feel free to use the Q & A feature
Agenda

                           Presenters
• Opening introduction
                           Dr. Phil Shelley
• MetaScale Expertise      CEO, MetaScale
   – Case study – Sears    CTO, Sears Holdings

     Holdings              Roger Gaskell
• Kognitio Analytics       CTO
                           Kognitio
   – Hadoop acceleration
     explained             Host
• Summary                  Michael Hiskey
                           VP Marketing & Business
• Q&A                      Development
                           Kognitio
Big Data < > Hadoop

                Big Data is high volume, velocity and variety
                information assets that demand cost-effective,
                innovative forms of information processing for
                enhanced insight and decision-making



 Volume (not only) size
 Velocity (speed of Input / Output)
 Variety (lots of data sources)
 Value – not the SIZE of your data,
  but what you can DO with it!
OK, so you’ve decided to put data in Hadoop...

                        Now what?




   Dr. Phil Shelley
   CEO – MetaScale
   CTO Sears Holdings
Where Did We Start at Sears?
Where Did We Start?
   Issues with meeting production schedules
   Multiple copies of data, no single point of truth
   ETL complexity, cost of software and cost to manage
   Time take to setup ETL data sources for projects
   Latency in data, up to weeks in some cases
   Enterprise Data Warehouses unable to handle load
   Mainframe workload over consuming capacity
   IT Budgets not growing – BUT data volumes escalating
Why Hadoop?

  Traditional
  Databases &
  Warehouses




                Hadoop
An Ecosystem
Enterprise Integration
   Data Sourcing
      Connecting to Legacy source systems

      Loaders and tools (speed considerations)

      Batch or near-real time




   Enterprise Data Model
      Establish a model and enterprise data strategy early




   Data Transformations
      The End of ETL as we know it




   Data re-use
      Drive re-use of data

      Single point of truth is now a possibility




   Data Consumption and user Interaction
      Consume data in-place wherever possible

      Move data only if you have to

      Exporting to legacy systems can be done, but it duplicates data

      Loaders and tools (speed considerations)

      How will your users interact with the data
Rethink Everything
  The way you capture data
   The way you store data
  The structure of your data
  The way you analyze data
  The costs of data storage
     The size of your data
    What you can analyze
    The speed of analysis
    The skills of your team
The way user interact with data
The Learning from our Journey

•   Big Data tools are here and ready for the Enterprise

•   An Enterprise Data Architecture model is essential

•   Hadoop can handle Enterprise workload
      To reduce strain on legacy platforms

        To reduce cost
        To bring new business opportunities

•   Must be part of an overall data strategy

•   Not to be underestimated

•   The solution must be an Eco-System
      There has to be a simple way to consume the data




                                                           Page 12
Hadoop Strengths & Weaknesses?

        •   Cost effective platform
        •   Powerful / fast data processing environment
        •   Good at standard reporting
        •   Flexibility: Programmable, Any data type
        •   Huge scalability


        •   Barriers to entry: lots of engineering and coding
        •   High on-going coding requirements
        •   Difficult to access with standard BI/analytical tools
        •   Ad hoc complex analytics difficult
        •   Too slow for interactive analytics
Reference Architecture
What is an “In-memory” Analytical Platform?
• DBMS where all of the data of interest or specific portions of the data
  have been permanently pre-loaded into random access memory (RAM)
• Not a large cache
    – Data is held in structures that take advantage of the properties of
       RAM – NOT copies of frequently used disk blocks
    – The databases query optimiser knows at all times exactly which
       data is in memory (and which is not)
In-Memory Analytical Database Mangement
Not a large cache:
• No disk access during query execution
     – Temporary tables in RAM
     – Results sets in RAM
• In-Memory means in high speed RAM
    – NOT slow flash-based SSDs that mimic
      mechanical disks


For more information:
• Gartner: “Who's Who in In-Memory DBMSs���
     Roxanne Edjlali, Donald Feinberg
     10 Sept 2012                            www.gartner.com/id=2151315
Why In-memory: RAM is Faster Than Disk (Really!)
Actually, this only part of the story
             Analytics completely change the workload
 workload
             characteristics on the database
             Simple reporting and transactional processing
 filtering
             is all about “filtering” the data of interest
             Analytics is all about complex “crunching”
crunching
             of the data once it is filtered
   CPU       Crunching needs processing power and
  cycles     consumes CPU cycles
             Storing data on physical disks severely limits the
  storing
             rate at which data can be provided to the CPUs
             Accessing data directly from RAM allows
  access
             much more CPU power to be deployed
Analytics is about                             through Data
         CPU cycle-intensive & CPU-bound
                          “CRUNCHING”
                                                     Analytical
          Joins                                      Functions
                           Aggregations



                  Sorts                   Grouping



 • To understand what is happening in the data
       More complex         More pronounced
          analytics        = this becomes
 • Analytical platforms are therefore CPU-bound
     – Assume disk I/O speeds not a bottleneck
     – In-memory removes the disk I/O bottleneck
For Analytics, the CPU is King
• The key metric of any analytical platform should be GB/CPU
   – It needs to effectively utilize all available cores
   – Hyper threads are NOT the equivalent of cores
• Interactive/adhoc analytics:
   – THINK data to core ratios ≈ 10GB data per CPU core
• Every cycle is precious – CPU cores need to used efficiently
   – Techniques such as “dynamic machine code generation”

             Careful – performance impact of compression:

                     Makes disk-based databases go faster
                     Makes in-memory databases go slower
Speed & Scale are the Requirements
• Memory & CPU on an individual server = NOWHERE near enough for big data
    – Moore’s Law – The power of a processor doubles every two years
    – Data volumes – Double every year!!
• The only way to keep up is to parallelise or scale-out
                       •   Combine the RAM of many individual servers
                       •   many CPU cores spread across
   Many                •   many CPUs, housed in
                       •   many individual computers
    – Data is split across all the CPU cores
    – All database operations need to be parallelised with no points of
      serialisation – This is true MPP

                       • Every CPU core in
   Every               • Every server needs to efficiently involved in
                       • Every query
Hadoop Connectivity
Kognitio - External Tables
   – Data held on disk in other systems can be seen as non-memory
      resident tables by Kognitio users.
   – Users can select which data they wish to “suck” into memory.
        • Using GUI or scripts
    – Kognitio seamlessly sucks data out of the source system
      into Kognitio memory.
    – All managed via SQL
Kognitio - Hadoop Connectors
   – Two types
        • HDFS Connector
        • Filter Agent Connector
    – Designed for high speed
        • Multiple parallel load streams
        • Demonstrable 14TB+/hour load rates
Tight Hadoop integration
HDFS Connector
• Connector defines access to hdfs file
  system
• External table accesses row-based data
  in hdfs
• Dynamic access or “pin” data into
  memory
• Complete hdfs file is loaded into memory

Filter Agent Connector
• Connector uploads agent to Hadoop
  nodes
• Query passes selections and relevant
  predicates to agent
• Data filtering and projection takes
  place locally on each Hadoop node
• Only data of interest in loaded into
  memory via parallel load streams
Not Only SQL
Kognitio V8 External Scripts
  – Run third party scripts embedded within SQL
        • Perl, Python, Java, R, SAS, etc.
        • One-to-many rows in, zero-to-many rows out, one to one
create interpreter perlinterp
  command '/usr/bin/perl' sends 'csv' receives 'csv' ;
select top 1000 words, count(*)                             This reads long comments
 from (external script using environment perlinterp         text from customer enquiry
        receives (txt varchar(32000))
        sends (words varchar(100))                          table, in line perl converts
        script S'endofperl(                                 long text into output
           while(<>)
           {                                                stream of words (one word
              chomp();                                      per row), query selects top
              s/[,.!_]//g;
              foreach $c (split(/ /))                       1000 words by frequency
              { if($c =~ /^[a-zA-Z]+$/) { print "$cn”} }   using standard SQL
            }
        )endofperl'                                         aggregation
        from (select comments from customer_enquiry))dt
group by 1
order by 2 desc;
Hardware Requirements for
In-memory Platforms
• Hadoop = industry standard servers
• Careful to avoid vendor lock-in
• Off the shelf, low cost, servers match
  neatly with Hadoop
   – Intel or AMD CPU (x86)
   – No special components
• Ethernet network
• Standard OS
Benefits of an In-memory Analytical Platform
• A seamless in-memory analytical layer on top of your data
  persistence layer(s):
  Analytical queries that used to run in hours and minutes, now
  run in minutes and seconds (often sub-second)

  High query throughput = massively higher concurrency

  Flexibility
  • Enables greater query complexity
  • Users freely interact with data
  • Use preferred BI Tools (relational or OLAP)

  Reduced complexity
  • Administration de-skilled
  • Reduced data duplication
The Learning from our Journey

 •   Big Data tools are here and ready for the Enterprise
 •   An Enterprise Data Architecture model is essential
 •   Hadoop can handle Enterprise workload
        To reduce strain on legacy platforms
        To reduce cost
        To bring new business opportunities
 •   Must be part of an overall data strategy
 •   Not to be underestimated
 •   The solution must be an Eco-System
         There has to be a simple way to consume the data




                                                             Page 26
connect                               contact
www.kognitio.com                  Michael Hiskey
                                  Vice President
kognitio.com/blog                 Marketing & Business Development
                                  Michael.hiskey@kognitio.com
twitter.com/kognitio
                                  Phone: +1 (855) KOGNITIO
linkedin.com/companies/kognitio
                                  Dr. Phil Shelley
facebook.com/kognitio             CEO – MetaScale
                                  CTO Sears Holdings

youtube.com/user/kognitio


   Upcoming Web Briefings: kognitio.com/briefings

More Related Content

What's hot

Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
Fadi Yousuf
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6
longda feng
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
rightsize
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
WANdisco Plc
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
clairvoyantllc
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Bradford Stephens
 
Hadoop
Hadoop Hadoop
Hadoop
ABHIJEET RAJ
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Leons Petražickis
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
Romeo Kienzler
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
hadooparchbook
 
Kudu demo
Kudu demoKudu demo
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
Prasanna Rajaperumal
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
David Kaiser
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
Apache kudu
Apache kuduApache kudu
Apache kudu
Asim Jalis
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
Yahoo Developer Network
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Cloudera, Inc.
 

What's hot (20)

Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Kudu demo
Kudu demoKudu demo
Kudu demo
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Apache kudu
Apache kuduApache kudu
Apache kudu
 
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
February 2016 HUG: Apache Kudu (incubating): New Apache Hadoop Storage for Fa...
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
 

Similar to Meta scale kognitio hadoop webinar

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
Simon Ambridge
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seeling Cheung
 
Accelerating analytics in a new era of data
Accelerating analytics in a new era of dataAccelerating analytics in a new era of data
Accelerating analytics in a new era of data
Arnon Shimoni
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
Caserta
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
larsgeorge
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
Santanu Dey
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
N Masahiro
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
David Smelker
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
Mukesh Singh
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Global Business Events
 
Building a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloadsBuilding a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloads
Alluxio, Inc.
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
Zohar Elkayam
 
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectTableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Remy Rosenbaum
 

Similar to Meta scale kognitio hadoop webinar (20)

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Accelerating analytics in a new era of data
Accelerating analytics in a new era of dataAccelerating analytics in a new era of data
Accelerating analytics in a new era of data
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
 
Building a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloadsBuilding a scalable analytics environment to support diverse workloads
Building a scalable analytics environment to support diverse workloads
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectTableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
 

Meta scale kognitio hadoop webinar

  • 1. Webinar: Make Big Data Easy with the Right tools and talent - MetaScale Expertise and Kognitio Analytics Accelerate Hadoop for Organizations Large and Small October 2012
  • 2. Today’s webinar • 45 minutes with 15 minutes Q&A • We will email you a link to the slides • Feel free to use the Q & A feature
  • 3. Agenda Presenters • Opening introduction Dr. Phil Shelley • MetaScale Expertise CEO, MetaScale – Case study – Sears CTO, Sears Holdings Holdings Roger Gaskell • Kognitio Analytics CTO Kognitio – Hadoop acceleration explained Host • Summary Michael Hiskey VP Marketing & Business • Q&A Development Kognitio
  • 4. Big Data < > Hadoop Big Data is high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision-making  Volume (not only) size  Velocity (speed of Input / Output)  Variety (lots of data sources)  Value – not the SIZE of your data, but what you can DO with it!
  • 5. OK, so you’ve decided to put data in Hadoop... Now what? Dr. Phil Shelley CEO – MetaScale CTO Sears Holdings
  • 6. Where Did We Start at Sears?
  • 7. Where Did We Start?  Issues with meeting production schedules  Multiple copies of data, no single point of truth  ETL complexity, cost of software and cost to manage  Time take to setup ETL data sources for projects  Latency in data, up to weeks in some cases  Enterprise Data Warehouses unable to handle load  Mainframe workload over consuming capacity  IT Budgets not growing – BUT data volumes escalating
  • 8. Why Hadoop? Traditional Databases & Warehouses Hadoop
  • 10. Enterprise Integration  Data Sourcing  Connecting to Legacy source systems  Loaders and tools (speed considerations)  Batch or near-real time  Enterprise Data Model  Establish a model and enterprise data strategy early  Data Transformations  The End of ETL as we know it  Data re-use  Drive re-use of data  Single point of truth is now a possibility  Data Consumption and user Interaction  Consume data in-place wherever possible  Move data only if you have to  Exporting to legacy systems can be done, but it duplicates data  Loaders and tools (speed considerations)  How will your users interact with the data
  • 11. Rethink Everything The way you capture data The way you store data The structure of your data The way you analyze data The costs of data storage The size of your data What you can analyze The speed of analysis The skills of your team The way user interact with data
  • 12. The Learning from our Journey • Big Data tools are here and ready for the Enterprise • An Enterprise Data Architecture model is essential • Hadoop can handle Enterprise workload  To reduce strain on legacy platforms  To reduce cost  To bring new business opportunities • Must be part of an overall data strategy • Not to be underestimated • The solution must be an Eco-System  There has to be a simple way to consume the data Page 12
  • 13. Hadoop Strengths & Weaknesses? • Cost effective platform • Powerful / fast data processing environment • Good at standard reporting • Flexibility: Programmable, Any data type • Huge scalability • Barriers to entry: lots of engineering and coding • High on-going coding requirements • Difficult to access with standard BI/analytical tools • Ad hoc complex analytics difficult • Too slow for interactive analytics
  • 15. What is an “In-memory” Analytical Platform? • DBMS where all of the data of interest or specific portions of the data have been permanently pre-loaded into random access memory (RAM) • Not a large cache – Data is held in structures that take advantage of the properties of RAM – NOT copies of frequently used disk blocks – The databases query optimiser knows at all times exactly which data is in memory (and which is not)
  • 16. In-Memory Analytical Database Mangement Not a large cache: • No disk access during query execution – Temporary tables in RAM – Results sets in RAM • In-Memory means in high speed RAM – NOT slow flash-based SSDs that mimic mechanical disks For more information: • Gartner: “Who's Who in In-Memory DBMSs” Roxanne Edjlali, Donald Feinberg 10 Sept 2012 www.gartner.com/id=2151315
  • 17. Why In-memory: RAM is Faster Than Disk (Really!) Actually, this only part of the story Analytics completely change the workload workload characteristics on the database Simple reporting and transactional processing filtering is all about “filtering” the data of interest Analytics is all about complex “crunching” crunching of the data once it is filtered CPU Crunching needs processing power and cycles consumes CPU cycles Storing data on physical disks severely limits the storing rate at which data can be provided to the CPUs Accessing data directly from RAM allows access much more CPU power to be deployed
  • 18. Analytics is about through Data CPU cycle-intensive & CPU-bound “CRUNCHING” Analytical Joins Functions Aggregations Sorts Grouping • To understand what is happening in the data More complex More pronounced analytics = this becomes • Analytical platforms are therefore CPU-bound – Assume disk I/O speeds not a bottleneck – In-memory removes the disk I/O bottleneck
  • 19. For Analytics, the CPU is King • The key metric of any analytical platform should be GB/CPU – It needs to effectively utilize all available cores – Hyper threads are NOT the equivalent of cores • Interactive/adhoc analytics: – THINK data to core ratios ≈ 10GB data per CPU core • Every cycle is precious – CPU cores need to used efficiently – Techniques such as “dynamic machine code generation” Careful – performance impact of compression: Makes disk-based databases go faster Makes in-memory databases go slower
  • 20. Speed & Scale are the Requirements • Memory & CPU on an individual server = NOWHERE near enough for big data – Moore’s Law – The power of a processor doubles every two years – Data volumes – Double every year!! • The only way to keep up is to parallelise or scale-out • Combine the RAM of many individual servers • many CPU cores spread across Many • many CPUs, housed in • many individual computers – Data is split across all the CPU cores – All database operations need to be parallelised with no points of serialisation – This is true MPP • Every CPU core in Every • Every server needs to efficiently involved in • Every query
  • 21. Hadoop Connectivity Kognitio - External Tables – Data held on disk in other systems can be seen as non-memory resident tables by Kognitio users. – Users can select which data they wish to “suck” into memory. • Using GUI or scripts – Kognitio seamlessly sucks data out of the source system into Kognitio memory. – All managed via SQL Kognitio - Hadoop Connectors – Two types • HDFS Connector • Filter Agent Connector – Designed for high speed • Multiple parallel load streams • Demonstrable 14TB+/hour load rates
  • 22. Tight Hadoop integration HDFS Connector • Connector defines access to hdfs file system • External table accesses row-based data in hdfs • Dynamic access or “pin” data into memory • Complete hdfs file is loaded into memory Filter Agent Connector • Connector uploads agent to Hadoop nodes • Query passes selections and relevant predicates to agent • Data filtering and projection takes place locally on each Hadoop node • Only data of interest in loaded into memory via parallel load streams
  • 23. Not Only SQL Kognitio V8 External Scripts – Run third party scripts embedded within SQL • Perl, Python, Java, R, SAS, etc. • One-to-many rows in, zero-to-many rows out, one to one create interpreter perlinterp command '/usr/bin/perl' sends 'csv' receives 'csv' ; select top 1000 words, count(*) This reads long comments from (external script using environment perlinterp text from customer enquiry receives (txt varchar(32000)) sends (words varchar(100)) table, in line perl converts script S'endofperl( long text into output while(<>) { stream of words (one word chomp(); per row), query selects top s/[,.!_]//g; foreach $c (split(/ /)) 1000 words by frequency { if($c =~ /^[a-zA-Z]+$/) { print "$cn”} } using standard SQL } )endofperl' aggregation from (select comments from customer_enquiry))dt group by 1 order by 2 desc;
  • 24. Hardware Requirements for In-memory Platforms • Hadoop = industry standard servers • Careful to avoid vendor lock-in • Off the shelf, low cost, servers match neatly with Hadoop – Intel or AMD CPU (x86) – No special components • Ethernet network • Standard OS
  • 25. Benefits of an In-memory Analytical Platform • A seamless in-memory analytical layer on top of your data persistence layer(s): Analytical queries that used to run in hours and minutes, now run in minutes and seconds (often sub-second) High query throughput = massively higher concurrency Flexibility • Enables greater query complexity • Users freely interact with data • Use preferred BI Tools (relational or OLAP) Reduced complexity • Administration de-skilled • Reduced data duplication
  • 26. The Learning from our Journey • Big Data tools are here and ready for the Enterprise • An Enterprise Data Architecture model is essential • Hadoop can handle Enterprise workload  To reduce strain on legacy platforms  To reduce cost  To bring new business opportunities • Must be part of an overall data strategy • Not to be underestimated • The solution must be an Eco-System  There has to be a simple way to consume the data Page 26
  • 27. connect contact www.kognitio.com Michael Hiskey Vice President kognitio.com/blog Marketing & Business Development Michael.hiskey@kognitio.com twitter.com/kognitio Phone: +1 (855) KOGNITIO linkedin.com/companies/kognitio Dr. Phil Shelley facebook.com/kognitio CEO – MetaScale CTO Sears Holdings youtube.com/user/kognitio Upcoming Web Briefings: kognitio.com/briefings