Cascading 2015 User Survey Results

Confidential
The Rise of Cascading
2015 Cascading User
Survey Results

Confidential
WHAT’S
BEHIND
THE
RISE
OF
CASCADING?
Enterprise
IT
teams
designing
their
big
data
platforms
must
choose
from
a

daunting
array
of
development
frameworks
and
compute
fabrics.
On
the
one

hand,
they
want
a
development
framework
that
leverages
existing
skillsets.

At
the
same
time,
they
want
the
flexibility
to
benefit
from
performance
gains

of
the
latest,
greatest
compute
fabrics.

Cascading
is
a
robust
framework
with
over
10,000
known
production

deployments,
over
275,000
downloads
per
month.
Twitter,
AirBnB,
Climate

Corp,
Apple,
EBay,
Netflix,
are
examples
of
few
of
the
enterprises
that
have

built
their
Hadoop
practices
with
Cascading.
The
Cascading
user
group
is

diverse,
self-‐supporting
community
who
are
helping
innovate
Cascading’s

scalability,
portability,
performance
and
value.
In
addition,
the
presence
of
a

large
number
of
open
source
projects
contributed
by
mainstream
enterprises

such
as
by
Netflix,
Commonwealth
Bank
of
Australia,
Expedia
attests
to

vibrancy
of
the
Cascading
ecosystem.
In
this
paper,
we'll
reveal
what’s
behind
Cascading's
growth
by
digging
into

the
results
of
a
new
Cascading
user
survey.
In
general,
Cascading
users
turn

out
to
be
extremely
concerned
about
reliability
and
performance
at
scale.

Many
experimented
with
early
Hadoop
frameworks
like
Hive
and
Pig,
but

found
Cascading
to
be
a
more
scalable
approach.
And
lately,
the
easy

portability
of
Cascading
applications
between
compute
fabrics
has
generated

a
lot
of
excitement
in
the
community.

Confidential
0 10 20 30 40 50 60 70
Head/VP of IT
Head of IT Infrastructure
Application Manager/Director
BI/EDW Manager/Director
CIO/SVP of IT
IT Specialist
Architect
IT Manager or Director
Developer/Engineer
What title best describes your role?
N=121 Liverpool Street station crowd blur. Photo by David Sim.
CASCADING
IS
MOST
POPULAR
AMONG
BUILDERS
AND

MANAGERS
OF
BIG
DATA
APPLICATIONS

Confidential
CASCADING
COMMUNITY
MEMBERS
ARE
MATURE,
PRODUCTION

USERS
8%
26%
25%
41%
How long have you been using
Hadoop?
0-12 months
12-24 months
24-36 months
Over 3 years
N=69
Most
respondents
have
been
using
Hadoop
for
over
3
years.

Assuming
the
sample
is
representative,
the
Cascading

community
largely
consists
of
early
Hadoop
adopters.

Furthermore,
the
Cascading
community
isn’t
just
dabbling:

Over
84% have
already
put
their
Cascading
applications
into

production
or
plan
to
do
so.

As
for
why,
many
likely
found
out
the
hard
way
that

developing
directly
on
Hadoop
was
painful,
tedious
and

poorly
suited
to
scale.
0 5 10 15 20 25 30 35 40 45
Other
Poor integration into existing IT
infrastructure
Lack of scalability
Lack of portability across compute
fabrics
Difficult to integrate to existing systems
Poor troubleshooting capabilities
Lack of skilled Hadoop resources
High cost of development in existing
platform
Slow development in existing platform
What challenges did you have that made you look for
an application development framework?

Confidential
THE
PATH
TO
CASCADING:
HIVE,
PIG,
AND
GUI
TOOLS
N=69
Given
the
maturity
of
Cascading
users,
it’s
no
surprise
that

many
explored
alternatives
before
settling
on
Cascading.

The
majority
(51%)
tried
Hive
and
Pig,
both
of
which
were

early
abstraction
layers
for
MapReduce.
Today,
many
Pig

applications
run
alongside
Cascading
and
many
Hive

applications
run
within Cascading.

Why
didn’t
they
stick
with
Hive
and
Pig?
Most

organizations
determined
they
could
not
scale
with
Hive

and
Pig.
Typically
that
was
because
Hive
and
Pig
required

scarce
technical
resources
and
because
development
in

those
frameworks
was
slow.
Those
who
opted
for
other

API
frameworks
found
them
not
yet
ready
for
the

enterprise.

A
smaller
group
experimented
with
GUI-‐based
ETL
tools.

While
these
tools
made
it
easy
to
leverage
existing

resources
and
skill
sets,
their
capabilities
were
too
limited.

They
also
required
building
special
scripts
to
achieve

complex
functionality,
which
negated
the
benefits
of

simplicity.

Additionally,
many
users
did
not
like
being

locked
into
a
single-‐vendor
solution.
26%
25%22%
19%
8%
Before selecting Cascading, what alternative solutions
did you explore? (select all that apply)
Pig
Hive
Other API frameworks (Spark,
Crunch)
GUI-based ETL tools (Talend,
Informatica, Pentaho)
No other alternatives were
explored

Confidential
0 10 20 30 40 50 60
Other
Flink
Tez
Storm
Kafka
MapReduce
Spark
Which compute fabric(s) are you using or
planning to use in the next 18 mths?
PORTABILITY
ACROSS
FABRICS
N=69
New
compute fabrics
appear
all
the
time,
though
not
all
are

production-‐ready.
The
responses
reflect high
interest
in
Spark
and
a

desire
for
true
streaming
(not
micro-‐batches).

MapReduce isn’t going
away any
time
soon,
especially
where

reliability
is
a
requirement.

Still,
many
are
experimenting
with other

compute
fabrics.
Because
each
fabric
offers
application-‐specific

advantages,
most
organizations
will
likely
wind
up
running
multiple

fabrics.

Cascading
3.0
supports
Tez,
MapReduce,
and
local/in-‐memory,
so

users
can
port
applications
from
MapReduce to
Tez simply
by

changing
a
few
lines
of
code.

Easy
portability
makes
Cascading
an

ideal
platform
for
moving
from
MapReduce to
Tez without
incurring

the
cost
of
rewriting
applications.
Soon,
Cascading
will
support
the

same
portability
for
Spark
and
Flink (for
Flink,
support
will
be

community
contributed).

Confidential
CASCADING
BRIDGES
OTHER
DEVELOPMENT
FRAMEWORKS
N=69
Despite
their
shortcomings,
MapReduce,
Hive
and
Pig
are
still

widely
in
use
as
development
frameworks,
largely
because
many

early
Hadoop
applications
were
built
through
these
interfaces.
No

surprise
that

we
see
a
lot
of
excitement
about
Spark
as
a
new

development
framework
as
well;
many
users
are
experimenting

with
developing
directly
in
the
Spark
API.

Cascading
will
support
Spark
in
a
future
WIP,
adding
an
important

framework
option
for
Spark
developers.
Developers
who
build
in

Cascading
will
be
able
to
port
their
applications
from
MapReduce to

Spark
without
having
to
rewrite
them
in
the
Spark
API.
In
summary,
there
is
no
one-‐size-‐fits-‐all
framework.
Flexibility
is
key

as
organizations
build
out
their
big
data
strategies
and
platforms.

Cascalog
Scalding
Pig
Hive
MapReduce
Cascading
Spark
0 10 20 30 40 50 60
What data application development
framework do you use?
“[Cascading] Best Hadoop API for enterprise data-
intensive apps.” – Architect.Fortune 500 Healthcare Payer

Confidential
COMMON
USE
CASES:
ETL,
ANALYTICS
&
DATA
INTEGRATION
N=69
Most
organizations
rely
on
Hadoop
for
heavy
processing
steps

within
ETL,
analytics
or
data
integration
flows.
Some
have
moved

their
entire
ETL
processing
to
Hadoop,
while
others
have
moved

only
portions
of
their
workflows.

For
example,
AirBnB uses
Cascading
for
complicated
infrastructure

tasks
such
as
data
normalization
and
cleansing.
AirBnB also

leverages
Cascading
for
reconstructing
corrupted
files
and
merging

data.
In
combination
with
Cascading,
Pig
and
Hive
are
used
by

analysts
to
run
batch
scripts
to
perform
ad
hoc
analysis.

With
these
tools,
analysts
are
able
to
more
easily
study
crucial

metrics
like
click-‐through
rates,
page
statistics,
and
drop-‐off
rates.

0 10 20 30 40 50
Other
Search Optimization
Recommendation Engines
Data Quality
Machine Learning and Scoring
Data Integration
Analytics
ETL
What best describes the projects where you
are using Cascading?
45%
Offloading
ETL to
Hadoop
40%
To Support
Analytics/BI
Projects
33%
Data
Integration
Projects

Confidential
Extremely
likely - 10
23%
9
10%
8
20%
7
19%
6
11%
5
6%
4
1%
3
3%
2
4%
Not at all
likely - 0
3%
How likely is it that you would
recommend Cascading to a friend or
colleague?
WHY
THEY
LOVE
CASCADING:
TDD,
JAVA
API,
PORTABILITY
N=79
Top
3
Most
Impactful
Capabilities
v Test
Driven
Development
(49%)
-‐ Efficiently
test
code
and
process

local
files
before
you
deploy
on
a
cluster
with
Cascading’s
local
or
in-‐
memory
mode.
Incorporate
inline
data
assertions
to
define
results
at

any
point
in
your
pipeline.

Failed
assertions
are
easily
visible
and

available
for
analysis.
v JavaAPI
(44%)
-‐ Cascading
is
a
Java
library
and
does
not
require

installation.
Cascading
fits
directly
into
a
standard
development

process;
all
you
have
to
do
is
code
to
the
API.
v Application
Portability
(43%)
-‐ When
you
compile
a
Cascading
job,
it

automatically
creates
a
run-‐time
executable
for
your
specified

compute
fabric.
Simply
by
changing
a
few
lines
of
code,
you
can
test

your
application
on
multiple
fabrics
and
choose
the
best
for
your

needs.

53%Of Respondents
are Promoters
(8/10)

Confidential
CASCADING
IMPROVES
PRODUCTIVITY
N=79
7%
16%
7%
18%26%
16%
10%
What percentage would you estimate the
productivity of your staff has improved?
Over 300%
Over 100%
80%-100%
60%-80%
40%-60%
20%-40%
Less than 20%
Most increased productivity by at least 40%

Confidential
CASCADING
SLASHES
TIME
TO MARKET
N=79
Most improved time to market by at least
40%
5%
17%
12%
18%
17%
18%
13%
What percentage would you estimate your
time to market has improved?
Over 300%
Over 100%
80%-100%
60%-80%
40%-60%
20%-40%
Less than 20%

Confidential
N=69
0 10 20 30 40 50 60
Other
Supporting chargeback models
Forecasting big data infrastructure
needs
Monitoring SLA's for Hadoop
applications
Identify and resolve Hadoop
application issues faster
Optimizing application performance
What future challenges do you anticipate in
managing your data applications?
THE
FUTURE:
BETTER
PERFORMANCE,
DATA
PIPELINE
VISIBILITY
Application
performance
management
is
a
top-‐of-‐mind
concern
for

most
respondents.
While
performance
tuning
happens
on
the

operations
side,
optimizing
applications
to
meet
service-‐ level

commitments
is
usually
a
collaborative
effort
between
development

and
operations teams.

Developers
need
better
tools
to
visualize
data
pipelines
and
detect

undesirable
behavior
before they
promote
applications
to

production.

Operations
teams
need
better
tools
to
monitor,

manage
and
optimize
data
delivery.

An
important,
though
secondary
concern,
is
tracking
the
rate
of

Hadoop
resource
consumption
so
clusters
can
be
right-‐sized
and

costs
distributed
across
divisions.
This
is
particularly
true
as
more
of

of
an
organization’s
departments/teams
build
and
rely
on
big
data

applications,
transforming
their
Hadoop
cluster
from
a
side
project

into
core
production
IT
infrastructure.

With
new
application
performance
management
tools
such
as

Driven,
teams
can
visualize
data
pipelines
and
identify
unwanted

behavior
more
effectively.
Tools
like
Driven
also
arm
teams
with
the

data
necessary
to
pinpoint
issues
quickly
and
resolve
them

collaboratively.

Confidential
DISTRIBUTIONS
0 5 10 15 20 25 30 35 40
Count of Other (please specify)
Count of MapR
Count of Hortonworks
Count of Apache Hadoop
Count of Amazon EMR
Count of Cloudera
Distributions
N=69

Confidential
NUMBER OFAPPLICATIONSANDVOLUME
Over 100 60-100 30-60 15-30 5-15 1-5
Less than 250 pipelines 4 5 4 26
500 - 1,000 pipelines 2 2 1 1 2
250 - 500 pipelines 1 3 5
2,500 - 5,000 pipelines 1 1
1,000 - 2,500 pipelines 2 3 1
Over 5,000 pipelines 1
Over 10,000 pipelines 1 1 2
0
5
10
15
20
25
30
35
40
Average Numberof Cascading Applications and Pipelines N=69

Confidential
PRODUCTIONSTATUS
0 5 10 15 20 25 30 35 40 45 50
No and not planned
Not yet but planned
Yes
Are you using your Cascading data applications in a
production environment?
N=69

Cascading 2015 User Survey Results

Related slideshows

More Related Content

What's hot

What's hot (16)

Similar to Cascading 2015 User Survey Results

Similar to Cascading 2015 User Survey Results (20)

More from Cascading

More from Cascading (11)

Recently uploaded

Recently uploaded (20)

Cascading 2015 User Survey Results