The Big Data Ecosystem
Talend & Caserta Concepts Webinar

Ciaran Dynes
Director, Product Management & Product Marketing, Talend

Joe Caserta
Founder & President, Caserta Concepts
Integration at Any Scale
Talend is the only integration vendor that enables
your business to scale through:

 An open source-based solution supported by
 a vast community and enterprise-class services

 An innovative, unified platform that scales data,
 application and business processes of any complexity

  A usage-based subscription model delivering
  a fast return on investment
Talend - Integration at Any Scale

Talend offers true
scalability for
• Any integration challenge
• Any data volume
• Any project size

Talend enables
Working with Leading Vendors

Platforms/Hadoop        Appliance              NoSQL

                    Data Management            Analytics

                     System Integrators

System Integrators play a vital role in providing expertise
The Big Data Ecosystem
Talend & Caserta Concepts Webinar

Joe Caserta
Founder & President, Caserta Concepts

Ciaran Dynes
Director, Product Management & Product Marketing, Talend
Joe Caserta Timeline
   Partnered with Big Data vendors             Laser focus on Big Data solutions for
  Cloudera, HortonWorks, Datameer,             Financial Sector & eCommerce
               more…                    2010
                                               Formalized Talend Alliance
                                        2009   Partnership – System Integrators

     Launched Big Data practice
                                               Co-author, with Ralph Kimball, The
 Launched Training practice, teaching          Data Warehouse ETL Toolkit (Wiley)
 data concepts world-wide
                                               Web log analytics solution published
 Founded Caserta Concepts in NYC
                                               in Intelligent Enterprise

 Began consulting career as                    Dedicated to Data Warehousing,
 programmer/data modeler                       Business Intelligence since 1996
                                               25+ years hands-on experience
                                               building database solutions
Caserta Concepts
• Technology services company with expertise in data analysis:
  • Data Management
  • Big Data & Analytics

• With core focus in the following industries:
  • Financial Services
  • Insurance / Healthcare
  • eCommerce / Higher Education

• Established in 2001:
  • Increased growth year-over-year
  • Industry recognized work force
  • Consulting, Writing, Education
Expertise & Offerings
 Strategic Roadmap/

 Big Data

 Data Warehousing/
 ETL/Data Integration


 Master Data Management
Client Portfolio
& Insurance

& Manufacturing

& Services
The Good Old Days: Traditional Data Warehousing

                                                                             Standard Reports
     Web Logs

                                                                            Ad-hoc Query Tools
       External      Extract
     Data Sources               Optimized

                    Transform                                                    Data Mining

        Legacy                  feedback                                  Analytical Applications
       Systems                 applications
                                                               Data Marts
                                                         (The data warehouse?)
What is “Big Data”?
• A collection of data sets so large and complex that
 it becomes difficult to process using on-hand
 database management tools or traditional data
 processing applications.

• Challenges include capture, storage, search,
 sharing, transfer, analysis, and visualization.

• Relational databases were designed for
 applications, we use only a small fraction of their
 capabilities in analytics applications.

• Enforcing a relational structure upon our data is
 not always what we want.
What’s the Difference?
       Traditional Data                         Big Data
Very accurate transactional data.   Lots of data with value that can
Analyzed by humans                  only be attained by deep analytics

Measured in terabytes               Measured in petabytes
Structured data                     Structured/Unstructured data
Input by human “system users”       Created by everybody, plus all of
                                    our machine friends
Oracle, SAP, etc.                   Open source, Hadoop
HW/SW investment measured in        HW/SW investment measured in
$10M                                $10K
Recording facts                     Harvesting insights
Try to keep up: This slide is already obsolete
So where does the data warehouse come in?
 • Will Big Data replace the data warehouse?
   • Yes – however there is much evolution ahead: real time
     integrations, interactive queries

 • Data Warehousing principles still apply to Big Data
   • Data Quality
   • Master Data
   • Data architecture

 • How do we leverage our existing investment?
Enterprise Technical Ecosystem
                                                                Traditional BI
            ETL        Traditional
                            ETL                                    Reporting

                  Big Data Cluster                                                Big Data BI
                                                     Database    Cassandra

                       Mahout              MapReduce             Pig/Hive

                      N1             N2         N3         N4         N5
                                  Hadoop Distributed File System (HDFS)
                   Horizontally Scalable Environment - Optimized for Analytics   Canned Reporting
Extending EDW with Hadoop

•Eliminate barrier of imposing relational structure on data.

•Storage is fast, durable and cheap: Don’t throw away data that
can be valuable in the future

•Processing power
  • Hadoop scales linearly, don’t worry about the data set getting
    too big

•Machine learning

•Ad-Hoc reporting by non-technical users requires traditional
methods or additional application
Design Pattern #1: Hadoop Staging/Warehouse
feed relational EDW (Composite Warehouse)
 •      Hadoop serves as the staging ground for all data
         - Eliminate barrier of imposing relational structure on data.
         - Storage is fast, durable and cheap: Don’t throw away data that can be
           valuable in the future

 • Data scientists will work in the Hadoop environment to analyze, and mine structured
     and unstructured data using Pig, Hive, and Mahout (machine learning)

 • Data required for interactive reporting and traditional ad-hoc analysis is sent to
     downstream relational EDW
     Source Systems

                      Mahout         MapReduce             Pig/Hive

                                                                             Traditional DW
                      N1       N2         N3         N4         N5
                            Hadoop Distributed File System (HDFS)
Design Pattern #2: NoSQL Enhanced EDW
 •Not all structured data lends itself to being stored relationally:
    • Relationships: Graph Databases
    • Sparse Data: Columnar Databases

 •Very Large Datasets:
    • NoSQL databases are capable of scaling far beyond relational databases while
      maintaining performance
    • Ultra-performance key value stores and columnar databases can be very useful in
      storing certain types of high volume data for analytic purposes
    • Just don’t expect the ad-hoc flexibility of a relational database!

                                                                                    - Web analytics
      Mahout          MapReduce             Pig/Hive                   Cassandra    - Ad Impressions

      N1        N2         N3          N4        N5
             Hadoop Distributed File System (HDFS)                                  - Networks
                                                                                    - Recommender
                                                                                    - Path optimization

                      Traditional DW
Design Pattern #3: Add analytics to your NoSQL
  • If your application is already based on a NoSQL technology, consider
    building analytic site.
  • The analytic site is constantly streamed fresh transactions leveraging
    Cassandra's native replication
  • Aggregates and analytic views are materialized with Pig/Hive map/reduce,
    since the work is done on the cluster no load is placed on the applications.
    This analytic data is in turn replicated throughout the cluster

     Site 1
      Site 2                                                             Canned Reporting

                                                                      Remember, NoSQL
                                                                      schemas are
                                            Traditional               “optimized to a
                                               DW                     query”, not ad-hoc
Emerging Tools

 Hive, although an excellent tool for data
 analysis is too slow for interactive
 queries. Recent projects have increased
 speed dramatically 10-100x.

 •   Google Dremel
 •   Apache/MapR Drill
 •   Hortonworks Stinger
 •   Cloudera Impala
Commonly Used Technologies
• Amazon Elastic MapReduce (EMR): Web service to access EC2/S3, pay-as-
you-go hosted Hadoop Infrastructure

• Hadoop Distribution: Cloudera; MapR; Hortonworks
• Apache Projects
    • Whirr: Used to launch/kill computing clusters
    • Kafka: Publish-subscribe messaging system
    • Mahout: Distributed machine learning
    • Hive: Map data to structures and use SQL-like queries
    • HBase: No-SQL/non-relational database, real-time read/write
    • Cassandra: Like HBase, no single point of failure
    • Chuckwa/Flume: Large-scale log collection
    • Pig: Procedural programming language, from Yahoo
    • Sqoop: “SQL-to-Hadoop”, like BCP for Hadoop
    • Zookeeper: Used to manage & adminster Hadoop
    • Solr: Full-text/Faceted Search
    • MongoDB: Document-oriented database
• Languages: Python, SciPy, Java
Leading Vendors (According to Joe)
   Hadoop                   NoSQL


 Data Management
Parting Thought

 Polyglot Persistence – “where any decent sized
 enterprise will have a variety of different data storage
 technologies for different kinds of data. There will still
 be large amounts of it managed in relational stores,
 but increasingly we'll be first asking how we want to
 manipulate the data and only then figuring out what
 technology is the best bet for it.”
                                      -- Martin Fowler
Please ask your questions now using the Q&A panel

➜    Recording will be made available on

➜    Request a copy of the slides

➜    Contact Talend Sales
       • Email:
       • Phone: 714.786.8140

➜    Contact Caserta Concepts
       • Joe Caserta, President
       • Email:
       • Phone: 855.755.2246 x227

© Talend 2012

Introducing the Big Data Ecosystem with Caserta Concepts & Talend

  • 1. The Big Data Ecosystem Talend & Caserta Concepts Webinar Ciaran Dynes Director, Product Management & Product Marketing, Talend Joe Caserta Founder & President, Caserta Concepts
  • 2. Integration at Any Scale Talend is the only integration vendor that enables your business to scale through: An open source-based solution supported by a vast community and enterprise-class services An innovative, unified platform that scales data, application and business processes of any complexity A usage-based subscription model delivering $ a fast return on investment
  • 3. Talend - Integration at Any Scale Talend offers true scalability for • Any integration challenge • Any data volume • Any project size Talend enables integration convergence
  • 4. Working with Leading Vendors Platforms/Hadoop Appliance NoSQL Data Management Analytics System Integrators System Integrators play a vital role in providing expertise
  • 5. The Big Data Ecosystem Talend & Caserta Concepts Webinar Joe Caserta Founder & President, Caserta Concepts Ciaran Dynes Director, Product Management & Product Marketing, Talend
  • 6. Joe Caserta Timeline 2012 Partnered with Big Data vendors Laser focus on Big Data solutions for Cloudera, HortonWorks, Datameer, Financial Sector & eCommerce more… 2010 Formalized Talend Alliance 2009 Partnership – System Integrators Launched Big Data practice 2004 Co-author, with Ralph Kimball, The Launched Training practice, teaching Data Warehouse ETL Toolkit (Wiley) data concepts world-wide 2001 Web log analytics solution published Founded Caserta Concepts in NYC in Intelligent Enterprise 1996 Began consulting career as Dedicated to Data Warehousing, programmer/data modeler Business Intelligence since 1996 1986 25+ years hands-on experience building database solutions
  • 7. Caserta Concepts • Technology services company with expertise in data analysis: • Data Management • Big Data & Analytics • With core focus in the following industries: • Financial Services • Insurance / Healthcare • eCommerce / Higher Education • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Consulting, Writing, Education
  • 8. Expertise & Offerings Strategic Roadmap/ Assessment/Consulting Big Data Analytics Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Master Data Management
  • 9. Client Portfolio Finance & Insurance Retail/eCommerce & Manufacturing Education & Services
  • 10. The Good Old Days: Traditional Data Warehousing Metadata Standard Reports Web Logs Ad-hoc Query Tools External Extract Data Sources Optimized Load Transform Data Mining Data Warehouse Relational Systems/ERP MDD/OLAP Closed-loop Legacy feedback Analytical Applications Systems applications Data Marts (The data warehouse?)
  • 11. What is “Big Data”? • A collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. • Challenges include capture, storage, search, sharing, transfer, analysis, and visualization. • Relational databases were designed for applications, we use only a small fraction of their capabilities in analytics applications. • Enforcing a relational structure upon our data is not always what we want.
  • 12. What’s the Difference? Traditional Data Big Data Very accurate transactional data. Lots of data with value that can Analyzed by humans only be attained by deep analytics Measured in terabytes Measured in petabytes Structured data Structured/Unstructured data Input by human “system users” Created by everybody, plus all of our machine friends Oracle, SAP, etc. Open source, Hadoop HW/SW investment measured in HW/SW investment measured in $10M $10K Recording facts Harvesting insights
  • 13. Try to keep up: This slide is already obsolete
  • 14. So where does the data warehouse come in? • Will Big Data replace the data warehouse? • Yes – however there is much evolution ahead: real time integrations, interactive queries • Data Warehousing principles still apply to Big Data • Data Quality • Master Data • Data architecture • How do we leverage our existing investment?
  • 15. Enterprise Technical Ecosystem Traditional BI ERP ETL Traditional EDW Finance Ad-Hoc/Canned ETL Reporting Legacy Big Data Cluster Big Data BI NoSQL Database Cassandra Search/Data Analytics Mahout MapReduce Pig/Hive N1 N2 N3 N4 N5 Hadoop Distributed File System (HDFS) Horizontally Scalable Environment - Optimized for Analytics Canned Reporting
  • 16. Extending EDW with Hadoop •Eliminate barrier of imposing relational structure on data. •Storage is fast, durable and cheap: Don’t throw away data that can be valuable in the future •Processing power • Hadoop scales linearly, don’t worry about the data set getting too big •Machine learning •Ad-Hoc reporting by non-technical users requires traditional methods or additional application
  • 17. Design Pattern #1: Hadoop Staging/Warehouse feed relational EDW (Composite Warehouse) • Hadoop serves as the staging ground for all data - Eliminate barrier of imposing relational structure on data. - Storage is fast, durable and cheap: Don’t throw away data that can be valuable in the future • Data scientists will work in the Hadoop environment to analyze, and mine structured and unstructured data using Pig, Hive, and Mahout (machine learning) • Data required for interactive reporting and traditional ad-hoc analysis is sent to downstream relational EDW Source Systems Mahout MapReduce Pig/Hive Traditional DW N1 N2 N3 N4 N5 Hadoop Distributed File System (HDFS)
  • 18. Design Pattern #2: NoSQL Enhanced EDW •Not all structured data lends itself to being stored relationally: • Relationships: Graph Databases • Sparse Data: Columnar Databases •Very Large Datasets: • NoSQL databases are capable of scaling far beyond relational databases while maintaining performance • Ultra-performance key value stores and columnar databases can be very useful in storing certain types of high volume data for analytic purposes • Just don’t expect the ad-hoc flexibility of a relational database! - Web analytics Mahout MapReduce Pig/Hive Cassandra - Ad Impressions (columnar) N1 N2 N3 N4 N5 Hadoop Distributed File System (HDFS) - Networks Titan - Recommender (graph) - Path optimization Traditional DW
  • 19. Design Pattern #3: Add analytics to your NoSQL cluster • If your application is already based on a NoSQL technology, consider building analytic site. • The analytic site is constantly streamed fresh transactions leveraging Cassandra's native replication • Aggregates and analytic views are materialized with Pig/Hive map/reduce, since the work is done on the cluster no load is placed on the applications. This analytic data is in turn replicated throughout the cluster Site 1 Cassandra Pig/Hive Cassandra MapReduce Analytics Site Site 2 Canned Reporting Cassandra Remember, NoSQL schemas are Traditional “optimized to a DW query”, not ad-hoc
  • 20. Emerging Tools Hive, although an excellent tool for data analysis is too slow for interactive queries. Recent projects have increased speed dramatically 10-100x. • Google Dremel • Apache/MapR Drill • Hortonworks Stinger • Cloudera Impala
  • 21. Commonly Used Technologies • Amazon Elastic MapReduce (EMR): Web service to access EC2/S3, pay-as- you-go hosted Hadoop Infrastructure • Hadoop Distribution: Cloudera; MapR; Hortonworks • Apache Projects • Whirr: Used to launch/kill computing clusters • Kafka: Publish-subscribe messaging system • Mahout: Distributed machine learning • Hive: Map data to structures and use SQL-like queries • HBase: No-SQL/non-relational database, real-time read/write • Cassandra: Like HBase, no single point of failure • Chuckwa/Flume: Large-scale log collection • Pig: Procedural programming language, from Yahoo • Sqoop: “SQL-to-Hadoop”, like BCP for Hadoop • Zookeeper: Used to manage & adminster Hadoop • Solr: Full-text/Faceted Search • MongoDB: Document-oriented database • Languages: Python, SciPy, Java
  • 22. Leading Vendors (According to Joe) Hadoop NoSQL Analytics Data Management
  • 23. Parting Thought Polyglot Persistence – “where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” -- Martin Fowler
  • 24. Questions? Please ask your questions now using the Q&A panel
  • 25. Resources ➜ Recording will be made available on ➜ Request a copy of the slides ➜ Contact Talend Sales • Email: • Phone: 714.786.8140 ➜ Contact Caserta Concepts • Joe Caserta, President • Email: • Phone: 855.755.2246 x227 © Talend 2012

