Cascading Meetup #4

                       Cupertino, CA

                                       Copyright @2013, Concurrent, Inc.

Tuesday, 05 March 13                                                       1
Cascading Meetup



                                                                                  HashJoin   Regex
                                                                                    Left     token
                                                                                                     GroupBy    R
                                                                     Stop Word                        token



              1. Enterprise Data Workflows
              2. ANSI SQL Support
              3. Test-Driven Development

Tuesday, 05 March 13                                                                                                        2
Enterprise Data Workflows
            Let’s consider an example app…
            at the front end                                                          Web
            LOB use cases drive demand for apps
                                                                        logs         Cache

                                                                 trap                  sink
                                                                  tap                  tap

                                                   Modeling    PMML


                                                    Cubes                            customer
                                                                                    profile DBs

Tuesday, 05 March 13                                                                              3
LOB use cases drive the demand for Big Data apps
Enterprise Data Workflows
             An example… in the back office
             Organizations have substantial investments                                                            Web
             in people, infrastructure, process
                                                                                                     logs         Cache

                                                                                              trap                  sink
                                                                                               tap                  tap

                                                                     Modeling            PMML


                                                                      Cubes                                       customer
                                                                                                                 profile DBs

Tuesday, 05 March 13                                                                                                           4
Enterprise organizations have seriously ginormous investments in existing back office practices:
people, infrastructure, processes
Enterprise Data Workflows
              An example… for the heavy lifting!
              “Main Street” firms are migrating                                                              Web
              workflows to Hadoop, for cost
              savings and scale-out
                                                                                              logs         Cache

                                                                                       trap                  sink
                                                                                        tap                  tap

                                                                         Modeling    PMML


                                                                          Cubes                            customer
                                                                                                          profile DBs

Tuesday, 05 March 13                                                                                                    5
“Main Street” firms have invested in Hadoop to address Big Data needs,
off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
Two Avenues…

             Enterprise: must contend with
             complexity at scale everyday…
             incumbents extend current practices and
             infrastructure investments – using J2EE,

                                                                                                            complexity ➞
             ANSI SQL, SAS, etc. – to migrate
             workflows onto Apache Hadoop while
             leveraging existing staff

              Start-ups: crave complexity and
              scale to become viable…
              new ventures move into Enterprise space
              to compete using relatively lean staff,
              while leveraging sophisticated engineering
              practices, e.g., Cascalog and Scalding
                                                                                                                                    scale ➞

Tuesday, 05 March 13                                                                                                                          6
Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
Two Avenues…

              Enterprise: must contend with
              complexity at scale everyday…
              incumbents extend current practices and
              infrastructure investments – using J2EE,

                                                            complexity ➞
              ANSI SQL, SAS, etc. – to migrate
              workflows onto Apache Hadoop while
              leveraging existing staff
                                         Hadoop almost never gets used
                                         in isolation; data workflows define
               Start-ups: crave complexity and
               scale to become viable… the “glue” required for system
               new ventures move into Enterprise space of Enterprise apps
               to compete using relatively lean staff,
               while leveraging sophisticated engineering
               practices, e.g., Cascalog and Scalding
                                                                           scale ➞

Tuesday, 05 March 13                                                                 7
Hadoop is almost never used in isolation.
Enterprise data workflows are about system integration.
There are a couple different ways to arrive at the party.
Cascading Meetup



                                                                                  HashJoin   Regex
                                                                                    Left     token
                                                                                                     GroupBy    R
                                                                     Stop Word                        token



              1. Enterprise Data Workflows
              2. ANSI SQL Support
              3. Test-Driven Development

Tuesday, 05 March 13                                                                                                        8
Cascading workflows – ANSI SQL

               • collab with Optiq – industry-proven code base

               • ANSI SQL parser/optimizer atop Cascading
                   flow planner                                                                                                                        Web

               • JDBC driver to integrate into existing
                   tools and app servers                                                                                                logs
                                                                                                                                          logs       Cache

               • relational catalog over a collection                                                        Support
                   of unstructured data                                                                                          trap
                                                                                                                                             tap       sink

               • SQL shell prompt to run queries                                                            Modeling         PMML


                                                                                                             Cubes                                   customer
                                                                                                                                                    profile DBs

Tuesday, 05 March 13                                                                                                                                              9
ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.

Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
Cascading workflows – ANSI SQL

               • collab with Optiq – industry-proven code base

               • ANSI SQL parser/optimizer atop Cascading
                   flow planner                                                                                                                        Web

               • JDBC driver to integrate into existing
                   tools and app servers                                                                                                logs
                                                                                                                                          logs       Cache

                                     Premise: most SQL in the world gets                                                                    Logs

               • relational catalog over a collection                                                        Support

                 of unstructured datawritten by machines…                                                                        trap
                                                                                                                                             tap       sink

               • SQL shell prompt to run isn’t a database; this is about making
                                     This queries                                                           Modeling         PMML

                                     machine-to-machine communications                                                           sink

                                     simpler and more robust at scale.
                                                                                                             Cubes                                   customer
                                                                                                                                                    profile DBs

Tuesday, 05 March 13                                                                                                                                              10
ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.

Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
Cascading workflows – ANSI SQL

               • enable analysts without retraining
                   on Hadoop, etc.                                                                                                                  Customers

               • transparency for Support, Ops,                                                                                                       Web
                   Finance, et al.
                                                                                                                                        logs         Cache

                                                                                                                                 trap                  sink
                                                                                                                                  tap                  tap

             a language for queries – not a database,                                                       Modeling         PMML

             but ANSI SQL as a DSL for workflows                                                                                  sink

                                                                                                             Cubes                                   customer
                                                                                                                                                    profile DBs

Tuesday, 05 March 13                                                                                                                                              11
ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration.

Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
ANSI SQL – reviews
            Open Source 'Lingual' Helps SQL Devs Unlock Hadoop
            Thor Olavsrud, 2013-02-22

            Hadoop Apps Without MapReduce Mindsets
            Adrian Bridgwater, 2013-02-28

            Concurrent gives old SQL users new Hadoop tricks
            Jack Clark, 2013-02-20

            Concurrent Open Source Project Ties SQL to Hadoop
            Michael Vizard, 2013-02-21

            Concurrent Releases Lingual, a SQL DSL for Hadoop
            Boris Lublinsky, 2013-02-28

Tuesday, 05 March 13                                                                                      12
ANSI SQL – CSV data in local file system


Tuesday, 05 March 13                                                                             13
The test database for MySQL is available for download from

Here we have a bunch o’ CSV flat files in a directory in the local file system.

Use the “lingual” command line interface to overlay DDL to describe the expected table schema.
ANSI SQL – shell prompt, catalog


Tuesday, 05 March 13                                                                      14
Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.
ANSI SQL – queries


Tuesday, 05 March 13                                                       15
Here’s an example SQL query on that “employee” test database from MySQL.
ANSI SQL – layers

                                        abstraction                                                       RDBMS                                                     JVM Cluster
                                                parser                                                 ANSI SQL                                                      ANSI SQL
                                                                                                     compliant parser                                              compliant parser
                                              optimizer                                             logical plan,                                                 logical plan,
                                                                                              optimized based on stats                                      optimized based on stats
                                               planner                                                   physical plan                                              API “plumbing”

                                               machine                                                 query history,                                                  app history,
                                                data                                                     table stats                                                    tuple stats
                                               topology                                                  b-trees, etc.                                      heterogenous, distributed:
                                                                                                                                                               Hadoop, IMDG, etc.
                                            visualization                                                      ERD                                                    flow diagram

                                               schema                                                   table schema                                                  tuple schema

                                                catalog                                              relational catalog                                               tap usage DB

                                             provenance                                                (manual audit)                                                data set
Tuesday, 05 March 13                                                                                                                                                                               16
When you peel back the onion skin on a SQL query, each of the abstraction layers used in an RDBMS has an analogue (or better) in the context of Enterprise Data Workflows running on JVM clusters
ANSI SQL – JDBC driver
             public void run() throws ClassNotFoundException, SQLException {
                 Class.forName( "cascading.lingual.jdbc.Driver" );
                 Connection connection =
                   DriverManager.getConnection( "jdbc:lingual:local;schemas=src/main/resources/data/example" );
                 Statement statement = connection.createStatement();
                 ResultSet resultSet = statement.executeQuery(
                     "select *n"
                       + "from "EXAMPLE"."SALES_FACT_1997" as sn"
                       + "join "EXAMPLE"."EMPLOYEE" as en"
                       + "on e."EMPID" = s."CUST_ID"" );
                 while( ) {
                   int n = resultSet.getMetaData().getColumnCount();
                   StringBuilder builder = new StringBuilder();
                   for( int i = 1; i <= n; i++ ) {
                     builder.append( ( i > 1 ? "; " : "" )
                         + resultSet.getMetaData().getColumnLabel( i ) + "=" + resultSet.getObject( i ) );

                        System.out.println( builder );

Tuesday, 05 March 13                                                                                                      17
Note that in this example the schema for the DDL has been derived directly from the CSV files.

In other words, point the JDBC connection at a directory of flat files and query as if they were already loaded into SQL.
ANSI SQL – JDBC driver
            $ gradle clean jar
            $ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar
            CUST_ID=100; PROD_ID=10; EMPID=100; NAME=Bill
            CUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian

                                Caveat: if you absolutely positively must have sub-second
                                SQL query response for Pb-scale data on a 1000+ node
                                cluster… Good luck with that! (call the MPP vendors)
                                This ANSI SQL library is primarily intended for batch
                                workflows – high throughput, not low-latency –
                                for many under-represented use cases in Enterprise IT.
                                It’s essentially ANSI SQL as a DSL.

Tuesday, 05 March 13                                                                        18
Cascading Meetup



                                                                                  HashJoin   Regex
                                                                                    Left     token
                                                                                                     GroupBy    R
                                                                     Stop Word                        token



              1. Enterprise Data Workflows
              2. ANSI SQL Support
              3. Test-Driven Development

Tuesday, 05 March 13                                                                                                        19
Test-Driven Development (TDD)

                                source: Wikipedia

Tuesday, 05 March 13                                20
A general view of TDD process
Test-Driven Development (TDD)

                                                                    In terms of Big Data apps,TDD is not
                                                                    generally part of the conversation

Tuesday, 05 March 13                                                                                       21
TDD is not usually high on the list when people start discussing Big Data apps.
Traps – Cascading “exceptional data”

               •   assert patterns (regex) on the tuple streams
               •   adjust assert levels, like log4j levels
               •   define traps on branches                                                                             Web

               •   tuples which fail asserts get trapped
                                                                                                         logs         Cache

                                                                                                  trap                  sink
                                                                                                   tap                  tap

                                                                                    Modeling    PMML


                                                                                     Cubes                            customer
                                                                                                                     profile DBs

Tuesday, 05 March 13                                                                                                               22
An innovation in Cascading was to introduce the notion of a “data exception”,
based on setting stream assertion levels as part of the business logic of an app.
Traps – example code
            // set up... 

            Pipe etlPipe = new Pipe( "etlPipe" );

            // some processing... 

            AssertMatches assertMatches = new AssertMatches( ".*true" );
            etlPipe = new Each( etlPipe, AssertionLevel.STRICT, assertMatches );
            // some processing... 

            FlowDef flowDef = FlowDef.flowDef().setName( "etl" )
              .addSource( etlPipe, jsonTap )
              .addTrap( etlPipe, trapTap )
              .addTailSink( etlPipe, cacheTap );
            if( options.has( "assert" ) )
              flowDef.setAssertionLevel( AssertionLevel.STRICT );
              flowDef.setAssertionLevel( AssertionLevel.NONE );

Tuesday, 05 March 13                                                               23
Example use in Cascading code
Traps – redirect exceptions in production
            shunt the trapped exceptional data to other
            parts of the organization:                                                     Customers

             •   Ops: notifications                                                           Web

             •   QA: investigate data anomalies	

             •   Support: review customer records                              logs


                  Finance: audit                          Support
                                                                        trap                  sink
                                                                         tap                  tap

                                                          Modeling    PMML


                                                           Cubes                            customer
                                                                                           profile DBs

Tuesday, 05 March 13                                                                                     24
TDD – practice at scale
             1. assert expected patterns in raw input
             2. run just that, to find edge cases
             3. handle the edge cases for input data
             4. assert expected patterns after first chunk of processing
             5. run just that, to verify failure
             6. code until test passes                  GIS                               Regex

                                                       export                            parse-tree        species

             7. repeat #4 for each chunk
                                                   M                              M
                                                                                                                     Join                  Geohash


                                                                                                            Tree                                                 Filter
                                                                                                          Metadata                                               height

                                                                                         Failure                                                     M
                                                                                                                                                                                       Calculate         Filter             Sum
                                                                                                                                                                                        distance        distance           moment           Filter

                                                                                                                                                                Estimate           R   M                               R                 M


                                                                                                                     Estimate     Road
                                                                                                                      Albedo    Segments
                                                                                                                                           Geohash                                                                                                            Join

                                                                                              Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                                       gps               reco

                                                                                                                                                                                                   Geohash                             Max

                                                                                                                                                                                       M                           R

Tuesday, 05 March 13                                                                                                                                                                                                                                                            25
TDD – Cascalog features
             consider that TDD is about asserting and negating logical
               •   Cascalog is based on logical predicates
               •   function definitions as composable subqueries
               •   functions are not particularly far from being unit tests
               •   Midje: facts, mocks


Tuesday, 05 March 13                                                                                                26
Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., nearly uses TDD as its methodology --
in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.
Cascading Meetup



                                                                                  HashJoin   Regex
                                                                                    Left     token
                                                                                                     GroupBy    R
                                                                     Stop Word                        token



              1. Enterprise Data Workflows
              2. ANSI SQL Support
              3. Test-Driven Development
              …plus, a proposal

Tuesday, 05 March 13                                                                                                        27
ANSI SQL – multiple flows

                                               GIS                               Regex

                                              export                            parse-tree        species

                                          M                              M
                                                                                                            Join                  Geohash


                                                                                                   Tree                                                 Filter
                                                                                                 Metadata                                               height

                                                                                Failure                                                     M
                                                                                                                                                                              Calculate         Filter             Sum
                                                                                                                                                                               distance        distance           moment           Filter

                                                                                                                                                       Estimate           R   M                               R                 M


                                                                                                            Estimate     Road
                                                                                                             Albedo    Segments
                                                                                                                                  Geohash                                                                                                            Join

                                                                                     Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                              gps               reco

                                                                                                                                                                                          Geohash                             Max

                                                                                                                                                                              M                           R

              Suppose your organization is responsible
              for an large-scale app…
              Multiple teams develop reusable libraries…
Tuesday, 05 March 13                                                                                                                                                                                                                                                   28
Suppose you have a app with a complex flow diagram like this, with contributions to the business logic from different departments…
ANSI SQL – multiple flows

                                               GIS                               Regex

                                              export                            parse-tree        species

                                          M                              M
                                                                                                            Join                  Geohash


                                                                                                   Tree                                                 Filter
                                                                                                 Metadata                                               height

                                                                                Failure                                                     M
                                                                                                                                                                              Calculate         Filter             Sum
                                                                                                                                                                               distance        distance           moment           Filter

                                                                                                                                                       Estimate           R   M                               R                 M


                                                                                                            Estimate     Road
                                                                                                             Albedo    Segments
                                                                                                                                  Geohash                                                                                                            Join

                                                                                     Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                              gps               reco

                                                                                                                                                                                          Geohash                             Max

                                                                                                                                                                              M                           R

              Data Analysts: ANSI SQL queries
              for data prep
              (displaces Hive, etc.)
Tuesday, 05 March 13                                                                                                                                                                                                                                                   29
Analysts are generally working with ANSI SQL queries in a DW, e.g., for ETL, data prep, pulling data cubes.
These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows

                                                GIS                               Regex

                                               export                            parse-tree        species

                                           M                              M
                                                                                                             Join                  Geohash


                                                                                                    Tree                                                 Filter
                                                                                                  Metadata                                               height

                                                                                 Failure                                                     M
                                                                                                                                                                               Calculate         Filter             Sum
                                                                                                                                                                                distance        distance           moment           Filter

                                                                                                                                                        Estimate           R   M                               R                 M


                                                                                                             Estimate     Road
                                                                                                              Albedo    Segments
                                                                                                                                   Geohash                                                                                                            Join

                                                                                      Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                               gps               reco

                                                                                                                                                                                           Geohash                             Max

                                                                                                                                                                               M                           R

              Server-side Engineering: HBase tap
              for customer profiles
              (integrating other components)
Tuesday, 05 March 13                                                                                                                                                                                                                                                    30
Engineering provides integration with customer profiles, e.g., transactional data objects in HBase.
These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows

                                                GIS                               Regex

                                               export                            parse-tree        species

                                           M                              M
                                                                                                             Join                  Geohash


                                                                                                    Tree                                                 Filter
                                                                                                  Metadata                                               height

                                                                                 Failure                                                     M
                                                                                                                                                                               Calculate         Filter             Sum
                                                                                                                                                                                distance        distance           moment           Filter

                                                                                                                                                        Estimate           R   M                               R                 M


                                                                                                             Estimate     Road
                                                                                                              Albedo    Segments
                                                                                                                                   Geohash                                                                                                            Join

                                                                                      Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                               gps               reco

                                                                                                                                                                                           Geohash                             Max

                                                                                                                                                                               M                           R

              Ops + Support: Traps get
              routed to customer review
              (ties into notifications, etc.)
Tuesday, 05 March 13                                                                                                                                                                                                                                                    31
Support needs to review exceptional data, via reports/notifications.
These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows

                                               GIS                               Regex

                                              export                            parse-tree        species

                                          M                              M
                                                                                                            Join                  Geohash


                                                                                                   Tree                                                 Filter
                                                                                                 Metadata                                               height

                                                                                Failure                                                     M
                                                                                                                                                                              Calculate         Filter             Sum
                                                                                                                                                                               distance        distance           moment           Filter

                                                                                                                                                       Estimate           R   M                               R                 M


                                                                                                            Estimate     Road
                                                                                                             Albedo    Segments
                                                                                                                                  Geohash                                                                                                            Join

                                                                                     Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                              gps               reco

                                                                                                                                                                                          Geohash                             Max

                                                                                                                                                                              M                           R

              Data Scientists: R => PMML
              for predictive models
              (displaces SAS, etc.)
Tuesday, 05 March 13                                                                                                                                                                                                                                                   32
Scientists perform their model creation work in R, Weka, SAS, Microstrategy, etc., which can export as PMML.
These can migrate into a Cascading app to run on Hadoop.
ANSI SQL – multiple flows

                                              GIS                               Regex

                                             export                            parse-tree        species

                                         M                              M
                                                                                                           Join                  Geohash


                                                                                                  Tree                                                 Filter
                                                                                                Metadata                                               height

                                                                               Failure                                                     M
                                                                                                                                                                             Calculate         Filter             Sum
                                                                                                                                                                              distance        distance           moment           Filter

                                                                                                                                                      Estimate           R   M                               R                 M


                                                                                                           Estimate     Road
                                                                                                            Albedo    Segments
                                                                                                                                 Geohash                                                                                                            Join

                                                                                    Metadata                                                                                     gps                                                                       R
                                                                                                                                                                                                                                             gps               reco

                                                                                                                                                                                         Geohash                             Max

                                                                                                                                                                             M                           R

             App Engineering: Java/Scala/Clojure
             for business logic in data pipelines
             (displaces Pig, etc.)
Tuesday, 05 March 13                                                                                                                                                                                                                                                  33
Generally the revenue apps require some custom business logic -- representing business process for LOB.
These can migrate into a Cascading app to run on Hadoop.
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai
Cascading meetup #4 @ BlueKai

Cascading meetup #4 @ BlueKai

  Cascading Meetup #4 BlueKai Cupertino, CA 2013-03-05 Copyright @2013, Concurrent, Inc.
  • 2. Cascading Meetup Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Enterprise Data Workflows 2. ANSI SQL Support 3. Test-Driven Development Tuesday, 05 March 13 2
  • 3. Enterprise Data Workflows Customers Let’s consider an example app… at the front end Web App LOB use cases drive demand for apps logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 3 LOB use cases drive the demand for Big Data apps
  • 4. Enterprise Data Workflows Customers An example… in the back office Organizations have substantial investments Web App in people, infrastructure, process logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 4 Enterprise organizations have seriously ginormous investments in existing back office practices: people, infrastructure, processes
  • 5. Enterprise Data Workflows Customers An example… for the heavy lifting! “Main Street” firms are migrating Web App workflows to Hadoop, for cost savings and scale-out logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 5 “Main Street” firms have invested in Hadoop to address Big Data needs, off-setting their rising costs for Enterprise licenses from SAS, Teradata, etc.
  • 6. Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Start-ups: crave complexity and scale to become viable… new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞ Tuesday, 05 March 13 6 Enterprise data workflows are observed in two modes: start-ups approaching complexity and incumbent firms grappling with complexity
  • 7. Two Avenues… Enterprise: must contend with complexity at scale everyday… incumbents extend current practices and infrastructure investments – using J2EE, complexity ➞ ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff Hadoop almost never gets used in isolation; data workflows define Start-ups: crave complexity and scale to become viable… the “glue” required for system new ventures move into Enterprise space of Enterprise apps integration to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding scale ➞ Tuesday, 05 March 13 7 Hadoop is almost never used in isolation. Enterprise data workflows are about system integration. There are a couple different ways to arrive at the party.
  • 8. Cascading Meetup Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Enterprise Data Workflows 2. ANSI SQL Support 3. Test-Driven Development Tuesday, 05 March 13 8
  • 9. Cascading workflows – ANSI SQL • collab with Optiq – industry-proven code base Customers • ANSI SQL parser/optimizer atop Cascading flow planner Web App • JDBC driver to integrate into existing tools and app servers logs logs Cache Logs • relational catalog over a collection Support source of unstructured data trap tap tap sink tap • SQL shell prompt to run queries Modeling PMML Data Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 9 ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration. Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
  • 10. Cascading workflows – ANSI SQL • collab with Optiq – industry-proven code base Customers • ANSI SQL parser/optimizer atop Cascading flow planner Web App • JDBC driver to integrate into existing tools and app servers logs logs Cache Premise: most SQL in the world gets Logs • relational catalog over a collection Support of unstructured datawritten by machines… trap tap source tap sink tap • SQL shell prompt to run isn’t a database; this is about making This queries Modeling PMML Data Workflow machine-to-machine communications sink tap source tap simpler and more robust at scale. Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 10 ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration. Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
  • 11. Cascading workflows – ANSI SQL • enable analysts without retraining on Hadoop, etc. Customers • transparency for Support, Ops, Web App Finance, et al. logs Cache logs Logs Support source trap sink tap tap tap Data a language for queries – not a database, Modeling PMML Workflow but ANSI SQL as a DSL for workflows sink tap source tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 11 ANSI SQL as “machine code” -- the lingua franca of Enterprise system integration. Cascading partnered with Optiq, the team behind Mondrian, etc., with an Enterprise-proven code base for an ANSI SQL parser/optimizer.
  • 12. ANSI SQL – reviews Open Source 'Lingual' Helps SQL Devs Unlock Hadoop Thor Olavsrud, 2013-02-22 Hadoop Apps Without MapReduce Mindsets Adrian Bridgwater, 2013-02-28 Concurrent gives old SQL users new Hadoop tricks Jack Clark, 2013-02-20 Concurrent Open Source Project Ties SQL to Hadoop Michael Vizard, 2013-02-21 Concurrent Releases Lingual, a SQL DSL for Hadoop Boris Lublinsky, 2013-02-28 Tuesday, 05 March 13 12
  • 13. ANSI SQL – CSV data in local file system Tuesday, 05 March 13 13 The test database for MySQL is available for download from Here we have a bunch o’ CSV flat files in a directory in the local file system. Use the “lingual” command line interface to overlay DDL to describe the expected table schema.
  • 14. ANSI SQL – shell prompt, catalog Tuesday, 05 March 13 14 Use the “lingual” SQL shell prompt to run SQL queries interactively, show catalog, etc.
  • 15. ANSI SQL – queries Tuesday, 05 March 13 15 Here’s an example SQL query on that “employee” test database from MySQL.
  • 16. ANSI SQL – layers abstraction RDBMS JVM Cluster parser ANSI SQL ANSI SQL compliant parser compliant parser optimizer logical plan, logical plan, optimized based on stats optimized based on stats planner physical plan API “plumbing” machine query history, app history, data table stats tuple stats topology b-trees, etc. heterogenous, distributed: Hadoop, IMDG, etc. visualization ERD flow diagram schema table schema tuple schema catalog relational catalog tap usage DB provenance (manual audit) data set producers/consumers Tuesday, 05 March 13 16 When you peel back the onion skin on a SQL query, each of the abstraction layers used in an RDBMS has an analogue (or better) in the context of Enterprise Data Workflows running on JVM clusters
  • 17. ANSI SQL – JDBC driver public void run() throws ClassNotFoundException, SQLException { Class.forName( "cascading.lingual.jdbc.Driver" ); Connection connection = DriverManager.getConnection( "jdbc:lingual:local;schemas=src/main/resources/data/example" ); Statement statement = connection.createStatement();   ResultSet resultSet = statement.executeQuery( "select *n" + "from "EXAMPLE"."SALES_FACT_1997" as sn" + "join "EXAMPLE"."EMPLOYEE" as en" + "on e."EMPID" = s."CUST_ID"" );   while( ) { int n = resultSet.getMetaData().getColumnCount(); StringBuilder builder = new StringBuilder();   for( int i = 1; i <= n; i++ ) { builder.append( ( i > 1 ? "; " : "" ) + resultSet.getMetaData().getColumnLabel( i ) + "=" + resultSet.getObject( i ) ); } System.out.println( builder ); }   resultSet.close(); statement.close(); connection.close(); } Tuesday, 05 March 13 17 Note that in this example the schema for the DDL has been derived directly from the CSV files. In other words, point the JDBC connection at a directory of flat files and query as if they were already loaded into SQL.
  • 18. ANSI SQL – JDBC driver $ gradle clean jar $ hadoop jar build/libs/lingual-examples–1.0.0-wip-dev.jar   CUST_ID=100; PROD_ID=10; EMPID=100; NAME=Bill CUST_ID=150; PROD_ID=20; EMPID=150; NAME=Sebastian Caveat: if you absolutely positively must have sub-second SQL query response for Pb-scale data on a 1000+ node cluster… Good luck with that! (call the MPP vendors) This ANSI SQL library is primarily intended for batch workflows – high throughput, not low-latency – for many under-represented use cases in Enterprise IT. It’s essentially ANSI SQL as a DSL. Tuesday, 05 March 13 18 success
  • 19. Cascading Meetup Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Enterprise Data Workflows 2. ANSI SQL Support 3. Test-Driven Development Tuesday, 05 March 13 19
  • 20. Test-Driven Development (TDD) source: Wikipedia Tuesday, 05 March 13 20 A general view of TDD process
  • 21. Test-Driven Development (TDD) In terms of Big Data apps,TDD is not generally part of the conversation Tuesday, 05 March 13 21 TDD is not usually high on the list when people start discussing Big Data apps.
  • 22. Traps – Cascading “exceptional data” • assert patterns (regex) on the tuple streams Customers • adjust assert levels, like log4j levels • define traps on branches Web App • tuples which fail asserts get trapped logs Cache logs Logs Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 22 An innovation in Cascading was to introduce the notion of a “data exception”, based on setting stream assertion levels as part of the business logic of an app.
  • 23. Traps – example code // set up...  Pipe etlPipe = new Pipe( "etlPipe" ); // some processing...  AssertMatches assertMatches = new AssertMatches( ".*true" ); etlPipe = new Each( etlPipe, AssertionLevel.STRICT, assertMatches );   // some processing...  FlowDef flowDef = FlowDef.flowDef().setName( "etl" ) .addSource( etlPipe, jsonTap ) .addTrap( etlPipe, trapTap ) .addTailSink( etlPipe, cacheTap );   if( options.has( "assert" ) ) flowDef.setAssertionLevel( AssertionLevel.STRICT ); else flowDef.setAssertionLevel( AssertionLevel.NONE ); Tuesday, 05 March 13 23 Example use in Cascading code
  • 24. Traps – redirect exceptions in production shunt the trapped exceptional data to other parts of the organization: Customers • Ops: notifications Web App • QA: investigate data anomalies • Support: review customer records logs logs Logs Cache • Finance: audit Support source trap sink tap tap tap Data Modeling PMML Workflow source sink tap tap Analytics Cubes customer Customer profile DBs Prefs Hadoop Cluster Reporting Tuesday, 05 March 13 24
  • 25. TDD – practice at scale 1. assert expected patterns in raw input 2. run just that, to find edge cases 3. handle the edge cases for input data 4. assert expected patterns after first chunk of processing 5. run just that, to verify failure 6. code until test passes GIS Regex tree Scrub export parse-tree species 7. repeat #4 for each chunk M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Tuesday, 05 March 13 25
  • 26. TDD – Cascalog features consider that TDD is about asserting and negating logical predicates… • Cascalog is based on logical predicates • function definitions as composable subqueries • functions are not particularly far from being unit tests • Midje: facts, mocks Tuesday, 05 March 13 26 Moreover, the Cascalog language by Nathan Marz, Sam Ritchie, et al., nearly uses TDD as its methodology -- in the transition from ad-hoc queries as logic predicates, then composing those predicates into large-scale apps.
  • 27. Cascading Meetup Document Collection Scrub Tokenize token M HashJoin Regex Left token GroupBy R Stop Word token List RHS Count Word Count 1. Enterprise Data Workflows 2. ANSI SQL Support 3. Test-Driven Development …plus, a proposal Tuesday, 05 March 13 27
  • 28. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Suppose your organization is responsible for an large-scale app… Multiple teams develop reusable libraries… Tuesday, 05 March 13 28 Suppose you have a app with a complex flow diagram like this, with contributions to the business logic from different departments…
  • 29. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Data Analysts: ANSI SQL queries for data prep (displaces Hive, etc.) Tuesday, 05 March 13 29 Analysts are generally working with ANSI SQL queries in a DW, e.g., for ETL, data prep, pulling data cubes. These can migrate into a Cascading app to run on Hadoop.
  • 30. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Server-side Engineering: HBase tap for customer profiles (integrating other components) Tuesday, 05 March 13 30 Engineering provides integration with customer profiles, e.g., transactional data objects in HBase. These can migrate into a Cascading app to run on Hadoop.
  • 31. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Ops + Support: Traps get routed to customer review (ties into notifications, etc.) Tuesday, 05 March 13 31 Support needs to review exceptional data, via reports/notifications. These can migrate into a Cascading app to run on Hadoop.
  • 32. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R Data Scientists: R => PMML for predictive models (displaces SAS, etc.) Tuesday, 05 March 13 32 Scientists perform their model creation work in R, Weka, SAS, Microstrategy, etc., which can export as PMML. These can migrate into a Cascading app to run on Hadoop.
  • 33. ANSI SQL – multiple flows GIS Regex tree Scrub export parse-tree species M M Estimate Join Geohash height Regex src parse-gis Tree Filter tree Metadata height Failure M Traps Calculate Filter Sum Join distance distance moment Filter sum_moment Estimate R M R M road road Regex traffic parse-road shade Estimate Road Join Albedo Segments Geohash Join M R Road Metadata gps R gps reco logs Count Geohash Max gps_count recent_visit M R App Engineering: Java/Scala/Clojure for business logic in data pipelines (displaces Pig, etc.) Tuesday, 05 March 13 33 Generally the revenue apps require some custom business logic -- representing business process for LOB. These can migrate into a Cascading app to run on Hadoop.