SlideShare a Scribd company logo
Technologies for
Data Analytics Platform
YAPC::Asia Tokyo 2015 - Aug 22, 2015
Who are you?
• Masahiro Nakagawa
• github: @repeatedly
• Treasure Data Inc.
• Fluentd / td-agent developer
• I love OSS :)
• D Language, MessagePack, The organizer of several meetups, etc…
Why do we analyze data?
Exploratory data analysis
Confirmatory data analysis
Need data, data, data!
It means we need
data analysis platform
for own requirements
Data Analytics Flow
Collect Store Process Visualize
Data source
Let’s launch platform!
• Easy to use and maintain
• Single server
• RDBMS is popular and has huge ecosystem

ETL Query
Extract + Transformation + Load
Oops! RDBMS is not good for data
analytics against large data volume.
We need more speed and scalability!
Let’s consider
Parallel RDBMS instead!
Parallel RDBMS
• Optimized for OLAP workload
• Columnar storage, Shared nothing, etc…
• Netezza, Teradata, Vertica, Greenplum, etc…

time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
• Good data format for analytics workload
• Read only selected columns, efficient compression
• Not good for insert / update

Columnar Storage
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
Row Columnar
Okay, query is now
processed normally.

No silver bullet
• Performance depends on data modeling and query
• distkey and sortkey are important
• should reduce data transfer and IO Cost
• query should take advantage of these keys
• There are some problems
• Cluster scaling, metadata management, etc…
Performance is good :)
But we often want to change schema

for new workloads. Now,

hard to maintain schema and its data…
Okay, let’s separate
data sources into multiple
layers for reliable platform
Schema on Write(RDBMS)
• Writing data using schema

for improving query performance
• Pros:
• minimum query overhead
• Cons:
• Need to design schema and workload before
• Data load is expensive operation
Schema on Read(Hadoop)
• Writing data without schema and

map schema at query time
• Pros:
• Robust over schema and workload change
• Data load is cheap operation
• Cons:
• High overhead at query time
Data Lake
• Schema management is hard
• Volume is increasing and format is often changed
• There are lots of log types
• Feasible approach is storing raw data and

converting it before analyze data
• Data Lake is a single storage for any logs
• Note that no clear definition for now
Data Lake Patterns
• Use DFS, e.g. HDFS, for log storage
• ETL or data processing by Hadoop ecosystem
• Can convert logs via ingestion tools before
• Use Data Lake storage and related tools
• These storages support Hadoop ecosystem
Apache Hadoop
• Distributed computing framework
• First implementation based on Google MapReduce

Data load becomes robust!
Raw data Transformed data
Apache Tez
• Low level framework for YARN Applications
• Hive, Pig, new query engine and more
• Task and DAG based processing flow

ProcessorInput Output
Task DAG
MapReduce vs Tez
MapReduce Tez
SELECT g1.x, g2.avg, g2.cnt

FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2 ON (g1.x = g2.x) ORDER BY avg;
GROUP b BY b.xGROUP a BY a.x
JOIN (a, b)
JOIN (a, b)
• HDFS and YARN have SPOF
• Recent version doesn’t have SPOF on both
MapReduce 1 and MapReduce 2
• Can’t build from a scratch
• Really? Treasure Data builds Hadoop on CircleCI.

Cloudera, Hortonworks and MapR too.
• They also check its dependent toolchain.
Which Hadoop package

should we use?
• Distribution by Hadoop distributor is better
• CDH by Cloudera
• HDP by Hortonworks
• MapR distribution by MapR
• If you are familiar with Hadoop and its ecosystem,

Apache community edition becomes an option.
• For example, Treasure Data has patches and

they want to use patched version.
Good :)
In addition, we want to
collect data in efficient way!
Ingestion tools
• There are two execution model!
• Bulk load:
• For high-throughput
• Almost tools transfer data in batch and parallel
• Streaming load:
• For low-latency
• Almost tools transfer data in micro-batch
Bulk load tools
• Embulk
• Pluggable bulk data loader for

various inputs and outputs
• Write plugins using Java and JRuby
• Sqoop
• Data transfer between Hadoop and RDBMS
• Included in some distributions
• Or each bulk loader for each data store
Streaming load tools
• Fluentd
• Pluggable and json based streaming collector
• Lots of plugins in rubygems
• Flume
• Mainly for Hadoop ecosystem, HDFS, HBase, …
• Included in some distributions
• Or Logstash, Heka, Splunk and etc…
Data ingestion also

becomes robust and efficient!
Raw data Transformed data
It works! but…

we want to issue ad-hoc query to entire
We can’t wait loading data into database.
You can use MPP query
engine for data stores.
MPP query engine
• It doesn’t have own storage unlike parallel RDBMS
• Follow “Schema on Read” approach
• data distribution depends on backend
• data schema also depends on backend
• Some products are called “SQL on Hadoop”
• Presto, Impala, Apache Drill, etc…
• It has own execution engine, not use MapReduce.
• Distributed Query Engine for interactive queries

against various data sources and large data.
• Pluggable connector for joining multiple backends
• You can join MySQL and HDFS data in one query
• Lots of useful functions for data analytics
• window functions, approximate query,

machine learning, etc…
PostgreSQL, etc.
Daily/Hourly Batch
Interactive query

BI Tools
Batch analysis platform Visualization platform
Daily/Hourly Batch
Interactive query
✓ Less scalable
✓ Extra cost

BI Tools
✓ More work to manage

2 platforms
✓ Can’t query against

“live” data directly
Batch analysis platform Visualization platform
PostgreSQL, etc.
Hive Dashboard
PostgreSQL, etc.
Daily/Hourly Batch
Daily/Hourly Batch
Interactive query
Interactive query
Daily/Hourly Batch
Interactive query
Cassandra MySQL Commertial DBs
SQL on any data sets Commercial

BI Tools
✓ IBM Cognos

✓ Tableau

✓ ...
Data analysis platform
Coordinator Connector

Storage / Metadata
Discovery Service
Execution Model
All stages are pipe-lined
✓ No wait time
✓ No fault-tolerance
MapReduce Presto
map map
reduce reduce
task task
task task
data transfer
✓ No disk IO
✓ Data chunk must
fit in memory
map map
reduce reduce
Write data

to disk
Wait between

Okay, we have now low latency
and batch combination.
Raw data
Resolved our concern! But…
we also need quick estimation.
Currently, we have several
stream processing softwares.
Let’s try!!
Apache Storm
• Distributed realtime processing framework
• Low latency: tuple at a time
• Trident mode uses micro batch

• Schema-less CEP engine for stream processing
• Use SQL like Esper EPL
• Not distributed unlike Storm for now

Calculated result
Great! We can get insight by
streaming and batch way :)
One more. We can make data
transfer more reliable for multiple
data streams with distributed queue
• Distributed messaging system
• Producer - Broker - Consumer pattern
• Pull model, replication, etc…

Apache Kafka
Push vs Pull
• Push:
• Easy to transfer data to multiple destinations
• Hard to control stream ratio in multiple streams
• Pull:
• Easy to control stream ratio
• Should manage consumers correctly
This is a modern analytics platform
Seems complex and hard to
Let’s use useful services!
Amazon Redshift
• Parallel RDBMS on AWS
• Re-use traditional Parallel RDMBS know-how
• Scale is easier than traditional systems
• With Amazon EMR is popular
1. Store data into S3
2. EMR processes S3 data
3. Load processed data into Redshift
• EMR has Hadoop ecosystem
Using AWS Services
Google BigQuery
• Distributed query engine and scalable storage
• Tree model, Columnar storage, etc…
• Separate storage from workers
• High performance query by Google infrastructure
• Lots of workers
• Storage / IO layer on Colossus
• Can’t manage Parallel RDBMS properties like distkey,

but it works on almost cases.
BigQuery architecture
Using GCP Services
Treasure Data
• Cloud based end-to-end data analytics service
• Hive, Presto, Pig and Hivemall for one big repository
• Lots of ingestion and output way, scheduling, etc…
• No stream processing for now
• Service concept is Data Lake
• JSON based schema-less storage
• Execution model is similar to BigQuery
• Separate storage from workers
• Can’t specify Parallel RDBMS properties
Using Treasure Data Service
Resource Model Trade-off
Pros Cons
Fully Guaranteed
Stable execution
Easy to control resource
Non boost mechanizm
Guaranteed with 

Stable execution
Good scalability
less controllable resource
Fully multi-tenanted
Boosted performance
Great scalability
Unstable execution
MS Azure also has useful services:
DataHub, SQL DWH, DataLake,
Stream Analytics, HDInsight…
Use service or build a platform?
• Should consider using service first
• AWS, GCP, MS Azure, Treasure Data, etc…
• Important factor is data analytics, not platform
• Do you have enough resources to maintain it?
• If specific analytics platform is a differentiator,

building a platform is better
• Use state-of-the-art technologies
• Hard to implement on existing platforms
• Many softwares and services for data analytics
• Lots of trade-off, performance, complexity,
connectivity, execution model, etc
• SQL is a primary language on data analytics
• Should focus your goal!
• data analytics platform is your business core?

If not, consider using services first.
Cloud service for
entire data pipeline!
Apache Spark
• Another Distributed computing framework
• Mainly for in-memory computing with DAG
• RDD and DataFrame based clean API
• Combination with Hadoop is popular

Apache Flink
• Streaming based execution engine
• Support batch and pipelined processing
• Hadoop and Spark are batch based
Batch vs Pipelined
All stages are pipe-lined
✓ No wait time
✓ fault-tolerance with

check pointing
Batch(Staged) Pipelined
task task
task task
data transfer
✓ use disk if needed
Wait between

task task
task task
task task stage3
• Tableau
• Popular BI tool in many area
• Awesome GUI, easy to use, lots of charts, etc
• Metric Insights
• Dashboard for many metrics
• Scheduled query, custom handler, etc
• Chartio
• Cloud based BI tool
How to manage job dependency?
We want to issue Job X
after Job A and Job B are finished.
Data pipeline tool
• There are some important features
• Manage job dependency
• Handle job failure and retry
• Easy to define topology
• Separate tasks into sub-tasks
• Apache Oozie, Apache Falcon, Luigi, Airflow, JP1,
• Python module for building job pipeline
• Write python code and run it.
• task is defined as Python class
• Easy to manage by VCS
• Need some extra tools
• scheduled job, job hisotry, etc…
class T1(luigi.task):
def requires(self):
# dependencies
def output(self):
# store result
def run(self):
# task body
• Python and DAG based workflow
• Write python code but it is for defining ADAG
• Task is defined by Operator
• There are good features
• Management web UI
• Task information is stored into database
• Celery based distributed execution
dag = DAG('example')
t1 = Operator(..., dag=dag)
t2 = Operator(..., dag=dag)

More Related Content

What's hot

Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: Bigdam
Empowering developers to deploy their own data stores
Empowering developers to deploy their own data storesEmpowering developers to deploy their own data stores
Empowering developers to deploy their own data stores
Tomas Doran
Presto changes
Presto changesPresto changes
Presto changes
N Masahiro
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
Sadayuki Furuhashi
How to create Treasure Data #dotsbigdata
How to create Treasure Data #dotsbigdataHow to create Treasure Data #dotsbigdata
How to create Treasure Data #dotsbigdata
N Masahiro
Norikra Recent Updates
Norikra Recent UpdatesNorikra Recent Updates
Norikra Recent Updates
Api world apache nifi 101
Api world   apache nifi 101Api world   apache nifi 101
Api world apache nifi 101
Timothy Spann
Embulk and Machine Learning infrastructure
Embulk and Machine Learning infrastructureEmbulk and Machine Learning infrastructure
Embulk and Machine Learning infrastructure
Hiroshi Toyama
From 100s to 100s of Millions
From 100s to 100s of MillionsFrom 100s to 100s of Millions
From 100s to 100s of Millions
Erik Onnen
Pulsarctl & Pulsar Manager
Pulsarctl & Pulsar ManagerPulsarctl & Pulsar Manager
Pulsarctl & Pulsar Manager
ApacheCon 2021: Apache NiFi 101- introduction and best practices
ApacheCon 2021:   Apache NiFi 101- introduction and best practicesApacheCon 2021:   Apache NiFi 101- introduction and best practices
ApacheCon 2021: Apache NiFi 101- introduction and best practices
Timothy Spann
Cracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworksCracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworks
Timothy Spann
Treasure Data and AWS - 2015
Treasure Data and AWS - 2015Treasure Data and AWS - 2015
Treasure Data and AWS - 2015
N Masahiro
Spark Streamingによるリアルタイムユーザ属性推定
Spark Streamingによるリアルタイムユーザ属性推定Spark Streamingによるリアルタイムユーザ属性推定
Spark Streamingによるリアルタイムユーザ属性推定
Yoshiyasu SAEKI
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020
Timothy Spann
January 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka PresentationJanuary 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka Presentation
Yahoo Developer Network
ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300
Timothy Spann
JRuby with Java Code in Data Processing World
JRuby with Java Code in Data Processing WorldJRuby with Java Code in Data Processing World
JRuby with Java Code in Data Processing World
Puppet at Spotify
Puppet at SpotifyPuppet at Spotify
Puppet at Spotify
Streaming 101: Hello World
Streaming 101:  Hello WorldStreaming 101:  Hello World
Streaming 101: Hello World
Josh Fischer

What's hot (20)

Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: Bigdam
Empowering developers to deploy their own data stores
Empowering developers to deploy their own data storesEmpowering developers to deploy their own data stores
Empowering developers to deploy their own data stores
Presto changes
Presto changesPresto changes
Presto changes
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
How to create Treasure Data #dotsbigdata
How to create Treasure Data #dotsbigdataHow to create Treasure Data #dotsbigdata
How to create Treasure Data #dotsbigdata
Norikra Recent Updates
Norikra Recent UpdatesNorikra Recent Updates
Norikra Recent Updates
Api world apache nifi 101
Api world   apache nifi 101Api world   apache nifi 101
Api world apache nifi 101
Embulk and Machine Learning infrastructure
Embulk and Machine Learning infrastructureEmbulk and Machine Learning infrastructure
Embulk and Machine Learning infrastructure
From 100s to 100s of Millions
From 100s to 100s of MillionsFrom 100s to 100s of Millions
From 100s to 100s of Millions
Pulsarctl & Pulsar Manager
Pulsarctl & Pulsar ManagerPulsarctl & Pulsar Manager
Pulsarctl & Pulsar Manager
ApacheCon 2021: Apache NiFi 101- introduction and best practices
ApacheCon 2021:   Apache NiFi 101- introduction and best practicesApacheCon 2021:   Apache NiFi 101- introduction and best practices
ApacheCon 2021: Apache NiFi 101- introduction and best practices
Cracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworksCracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworks
Treasure Data and AWS - 2015
Treasure Data and AWS - 2015Treasure Data and AWS - 2015
Treasure Data and AWS - 2015
Spark Streamingによるリアルタイムユーザ属性推定
Spark Streamingによるリアルタイムユーザ属性推定Spark Streamingによるリアルタイムユーザ属性推定
Spark Streamingによるリアルタイムユーザ属性推定
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020
January 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka PresentationJanuary 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka Presentation
ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300
JRuby with Java Code in Data Processing World
JRuby with Java Code in Data Processing WorldJRuby with Java Code in Data Processing World
JRuby with Java Code in Data Processing World
Puppet at Spotify
Puppet at SpotifyPuppet at Spotify
Puppet at Spotify
Streaming 101: Hello World
Streaming 101:  Hello WorldStreaming 101:  Hello World
Streaming 101: Hello World

Viewers also liked

Kenshin Yamada
PayPal導入事例 CrowdWorks編
PayPal導入事例 CrowdWorks編PayPal導入事例 CrowdWorks編
PayPal導入事例 CrowdWorks編
toru iwashita
初期費用ゼロ円のマイホーム For pay palイベント
初期費用ゼロ円のマイホーム For pay palイベント初期費用ゼロ円のマイホーム For pay palイベント
初期費用ゼロ円のマイホーム For pay palイベント
Daisuke Kimura
20150723AWS startup tech_meetup
20150723AWS startup tech_meetup20150723AWS startup tech_meetup
20150723AWS startup tech_meetup
Recruit Lifestyle Co., Ltd.
Fluentd v0.14 Overview
Fluentd v0.14 OverviewFluentd v0.14 Overview
Fluentd v0.14 Overview
N Masahiro
Recruit Technologies
Ken Morishita
JIRA meets Tableau & AWS
JIRA meets Tableau & AWSJIRA meets Tableau & AWS
JIRA meets Tableau & AWS
Recruit Lifestyle Co., Ltd.
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
Recruit Lifestyle Co., Ltd.
Recruit Lifestyle Co., Ltd.
Developer summit 2015 gcp
Developer summit 2015   gcpDeveloper summit 2015   gcp
Developer summit 2015 gcp
Google Cloud Platform - Japan
Recruit Lifestyle Co., Ltd.
Fluentd v0.14 Plugin API Details
Fluentd v0.14 Plugin API DetailsFluentd v0.14 Plugin API Details
Fluentd v0.14 Plugin API Details
NIPS2016 Supervised Word Mover's Distance
NIPS2016 Supervised Word Mover's DistanceNIPS2016 Supervised Word Mover's Distance
NIPS2016 Supervised Word Mover's Distance
Recruit Lifestyle Co., Ltd.
How To Write Middleware In Ruby
How To Write Middleware In RubyHow To Write Middleware In Ruby
How To Write Middleware In Ruby
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Amazon Web Services
Junpei Tsuji
リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜
リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜 リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜
リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜
Yugo Shimizu
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
CET (Capture EveryThing)プロジェクトにおける機械学習・データマイニング最前線
CET (Capture EveryThing)プロジェクトにおける機械学習・データマイニング最前線CET (Capture EveryThing)プロジェクトにおける機械学習・データマイニング最前線
CET (Capture EveryThing)プロジェクトにおける機械学習・データマイニング最前線
Recruit Lifestyle Co., Ltd.

Viewers also liked (20)

PayPal導入事例 CrowdWorks編
PayPal導入事例 CrowdWorks編PayPal導入事例 CrowdWorks編
PayPal導入事例 CrowdWorks編
初期費用ゼロ円のマイホーム For pay palイベント
初期費用ゼロ円のマイホーム For pay palイベント初期費用ゼロ円のマイホーム For pay palイベント
初期費用ゼロ円のマイホーム For pay palイベント
20150723AWS startup tech_meetup
20150723AWS startup tech_meetup20150723AWS startup tech_meetup
20150723AWS startup tech_meetup
Fluentd v0.14 Overview
Fluentd v0.14 OverviewFluentd v0.14 Overview
Fluentd v0.14 Overview
JIRA meets Tableau & AWS
JIRA meets Tableau & AWSJIRA meets Tableau & AWS
JIRA meets Tableau & AWS
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
CET(Capture EveryThing)プロジェクトにおけるﰀ機械学 習・データマイニング最前線
Developer summit 2015 gcp
Developer summit 2015   gcpDeveloper summit 2015   gcp
Developer summit 2015 gcp
Fluentd v0.14 Plugin API Details
Fluentd v0.14 Plugin API DetailsFluentd v0.14 Plugin API Details
Fluentd v0.14 Plugin API Details
NIPS2016 Supervised Word Mover's Distance
NIPS2016 Supervised Word Mover's DistanceNIPS2016 Supervised Word Mover's Distance
NIPS2016 Supervised Word Mover's Distance
How To Write Middleware In Ruby
How To Write Middleware In RubyHow To Write Middleware In Ruby
How To Write Middleware In Ruby
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜
リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜 リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜
リアルタイムサーバー 〜Erlang/OTPで作るPubSubサーバー〜
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
CET (Capture EveryThing)プロジェクトにおける機械学習・データマイニング最前線
CET (Capture EveryThing)プロジェクトにおける機械学習・データマイニング最前線CET (Capture EveryThing)プロジェクトにおける機械学習・データマイニング最前線
CET (Capture EveryThing)プロジェクトにおける機械学習・データマイニング最前線

Similar to Technologies for Data Analytics Platform

Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
ch adnan
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Farzad Nozarian
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Apache drill
Apache drillApache drill
Apache drill
MapR Technologies
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Joe Alex
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu

Similar to Technologies for Data Analytics Platform (20)

Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Apache drill
Apache drillApache drill
Apache drill
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop

More from N Masahiro

Fluentd Project Intro at Kubecon 2019 EU
Fluentd Project Intro at Kubecon 2019 EUFluentd Project Intro at Kubecon 2019 EU
Fluentd Project Intro at Kubecon 2019 EU
N Masahiro
Fluentd v1 and future at techtalk
Fluentd v1 and future at techtalkFluentd v1 and future at techtalk
Fluentd v1 and future at techtalk
N Masahiro
Fluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at KubeconFluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at Kubecon
N Masahiro
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellFluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
N Masahiro
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellFluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
N Masahiro
Fluentd at HKOScon
Fluentd at HKOSconFluentd at HKOScon
Fluentd at HKOScon
N Masahiro
Dive into Fluentd plugin v0.12
Dive into Fluentd plugin v0.12Dive into Fluentd plugin v0.12
Dive into Fluentd plugin v0.12
N Masahiro
Fluentd v0.12 master guide
Fluentd v0.12 master guideFluentd v0.12 master guide
Fluentd v0.12 master guide
N Masahiro
Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4
N Masahiro
Fluentd Unified Logging Layer At Fossasia
Fluentd Unified Logging Layer At FossasiaFluentd Unified Logging Layer At Fossasia
Fluentd Unified Logging Layer At Fossasia
N Masahiro
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
N Masahiro
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65
N Masahiro
Fluentd - road to v1 -
Fluentd - road to v1 -Fluentd - road to v1 -
Fluentd - road to v1 -
N Masahiro
Fluentd: Unified Logging Layer at CWT2014
Fluentd: Unified Logging Layer at CWT2014Fluentd: Unified Logging Layer at CWT2014
Fluentd: Unified Logging Layer at CWT2014
N Masahiro
SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014
N Masahiro
Can you say the same words even in oss
Can you say the same words even in ossCan you say the same words even in oss
Can you say the same words even in oss
N Masahiro
I am learing the programming
I am learing the programmingI am learing the programming
I am learing the programming
N Masahiro
Fluentd meetup dive into fluent plugin (outdated)
Fluentd meetup dive into fluent plugin (outdated)Fluentd meetup dive into fluent plugin (outdated)
Fluentd meetup dive into fluent plugin (outdated)
N Masahiro
D vs OWKN Language at LLnagoya
D vs OWKN Language at LLnagoyaD vs OWKN Language at LLnagoya
D vs OWKN Language at LLnagoya
N Masahiro
Goodbye Doost
Goodbye DoostGoodbye Doost
Goodbye Doost
N Masahiro

More from N Masahiro (20)

Fluentd Project Intro at Kubecon 2019 EU
Fluentd Project Intro at Kubecon 2019 EUFluentd Project Intro at Kubecon 2019 EU
Fluentd Project Intro at Kubecon 2019 EU
Fluentd v1 and future at techtalk
Fluentd v1 and future at techtalkFluentd v1 and future at techtalk
Fluentd v1 and future at techtalk
Fluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at KubeconFluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at Kubecon
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellFluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshellFluentd v1.0 in a nutshell
Fluentd v1.0 in a nutshell
Fluentd at HKOScon
Fluentd at HKOSconFluentd at HKOScon
Fluentd at HKOScon
Dive into Fluentd plugin v0.12
Dive into Fluentd plugin v0.12Dive into Fluentd plugin v0.12
Dive into Fluentd plugin v0.12
Fluentd v0.12 master guide
Fluentd v0.12 master guideFluentd v0.12 master guide
Fluentd v0.12 master guide
Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4Fluentd and Embulk Game Server 4
Fluentd and Embulk Game Server 4
Fluentd Unified Logging Layer At Fossasia
Fluentd Unified Logging Layer At FossasiaFluentd Unified Logging Layer At Fossasia
Fluentd Unified Logging Layer At Fossasia
Treasure Data and OSS
Treasure Data and OSSTreasure Data and OSS
Treasure Data and OSS
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65
Fluentd - road to v1 -
Fluentd - road to v1 -Fluentd - road to v1 -
Fluentd - road to v1 -
Fluentd: Unified Logging Layer at CWT2014
Fluentd: Unified Logging Layer at CWT2014Fluentd: Unified Logging Layer at CWT2014
Fluentd: Unified Logging Layer at CWT2014
SQL for Everything at CWT2014
SQL for Everything at CWT2014SQL for Everything at CWT2014
SQL for Everything at CWT2014
Can you say the same words even in oss
Can you say the same words even in ossCan you say the same words even in oss
Can you say the same words even in oss
I am learing the programming
I am learing the programmingI am learing the programming
I am learing the programming
Fluentd meetup dive into fluent plugin (outdated)
Fluentd meetup dive into fluent plugin (outdated)Fluentd meetup dive into fluent plugin (outdated)
Fluentd meetup dive into fluent plugin (outdated)
D vs OWKN Language at LLnagoya
D vs OWKN Language at LLnagoyaD vs OWKN Language at LLnagoya
D vs OWKN Language at LLnagoya
Goodbye Doost
Goodbye DoostGoodbye Doost
Goodbye Doost

Recently uploaded

Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Priyanka Aash
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Alliance
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Marrie Morris
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Quentin Reul
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Alliance
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
Peter Caitens
Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
Priyanka Aash
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
webbyacad software
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
Sara Kroft
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Alliance
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...

Recently uploaded (20)

Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...

Technologies for Data Analytics Platform

  • 1. Technologies for Data Analytics Platform YAPC::Asia Tokyo 2015 - Aug 22, 2015
  • 2. Who are you? • Masahiro Nakagawa • github: @repeatedly • Treasure Data Inc. • Fluentd / td-agent developer • • I love OSS :) • D Language, MessagePack, The organizer of several meetups, etc…
  • 3. Why do we analyze data?
  • 6. It means we need data analysis platform for own requirements
  • 7. Data Analytics Flow Collect Store Process Visualize Data source Reporting Monitoring
  • 9. • Easy to use and maintain • Single server • RDBMS is popular and has huge ecosystem
 RDBMS ETL Query Extract + Transformation + Load
  • 10. × Oops! RDBMS is not good for data analytics against large data volume. We need more speed and scalability!
  • 12. Parallel RDBMS • Optimized for OLAP workload • Columnar storage, Shared nothing, etc… • Netezza, Teradata, Vertica, Greenplum, etc…
 Compute Node Leader Node Compute Node Compute Node Query
  • 13. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … • Good data format for analytics workload • Read only selected columns, efficient compression • Not good for insert / update
 Columnar Storage time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … Row Columnar Unit Unit
  • 14. Okay, query is now processed normally.
 L C C C
  • 15. No silver bullet • Performance depends on data modeling and query • distkey and sortkey are important • should reduce data transfer and IO Cost • query should take advantage of these keys • There are some problems • Cluster scaling, metadata management, etc…
  • 16. Performance is good :) But we often want to change schema
 for new workloads. Now,
 hard to maintain schema and its data… L C C C
  • 17. Okay, let’s separate data sources into multiple layers for reliable platform
  • 18. Schema on Write(RDBMS) • Writing data using schema
 for improving query performance • Pros: • minimum query overhead • Cons: • Need to design schema and workload before • Data load is expensive operation
  • 19. Schema on Read(Hadoop) • Writing data without schema and
 map schema at query time • Pros: • Robust over schema and workload change • Data load is cheap operation • Cons: • High overhead at query time
  • 20. Data Lake • Schema management is hard • Volume is increasing and format is often changed • There are lots of log types • Feasible approach is storing raw data and
 converting it before analyze data • Data Lake is a single storage for any logs • Note that no clear definition for now
  • 21. Data Lake Patterns • Use DFS, e.g. HDFS, for log storage • ETL or data processing by Hadoop ecosystem • Can convert logs via ingestion tools before • Use Data Lake storage and related tools • These storages support Hadoop ecosystem
  • 22. Apache Hadoop • Distributed computing framework • First implementation based on Google MapReduce
  • 25. Cool! Data load becomes robust! EL T Raw data Transformed data
  • 26. Apache Tez • Low level framework for YARN Applications • Hive, Pig, new query engine and more • Task and DAG based processing flow
 ProcessorInput Output Task DAG
  • 27. MapReduce vs Tez MapReduce Tez M HDFS R R M M HDFS HDFS R M M R M M R M R M MM M M R R R SELECT g1.x, g2.avg, g2.cnt
 FROM (SELECT a.x AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1 JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2 ON (g1.x = g2.x) ORDER BY avg; GROUP b BY b.xGROUP a BY a.x JOIN (a, b) ORDER BY GROUP BY x GROUP BY a.x JOIN (a, b) ORDER BY
  • 28. Superstition • HDFS and YARN have SPOF • Recent version doesn’t have SPOF on both MapReduce 1 and MapReduce 2 • Can’t build from a scratch • Really? Treasure Data builds Hadoop on CircleCI.
 Cloudera, Hortonworks and MapR too. • They also check its dependent toolchain.
  • 29. Which Hadoop package
 should we use? • Distribution by Hadoop distributor is better • CDH by Cloudera • HDP by Hortonworks • MapR distribution by MapR • If you are familiar with Hadoop and its ecosystem,
 Apache community edition becomes an option. • For example, Treasure Data has patches and
 they want to use patched version.
  • 30. Good :) In addition, we want to collect data in efficient way!
  • 31. Ingestion tools • There are two execution model! • Bulk load: • For high-throughput • Almost tools transfer data in batch and parallel • Streaming load: • For low-latency • Almost tools transfer data in micro-batch
  • 32. Bulk load tools • Embulk • Pluggable bulk data loader for
 various inputs and outputs • Write plugins using Java and JRuby • Sqoop • Data transfer between Hadoop and RDBMS • Included in some distributions • Or each bulk loader for each data store
  • 33. Streaming load tools • Fluentd • Pluggable and json based streaming collector • Lots of plugins in rubygems • Flume • Mainly for Hadoop ecosystem, HDFS, HBase, … • Included in some distributions • Or Logstash, Heka, Splunk and etc…
  • 34. Data ingestion also
 becomes robust and efficient! Raw data Transformed data
  • 35. It works! but…
 we want to issue ad-hoc query to entire data. We can’t wait loading data into database.
  • 36. You can use MPP query engine for data stores.
  • 37. MPP query engine • It doesn’t have own storage unlike parallel RDBMS • Follow “Schema on Read” approach • data distribution depends on backend • data schema also depends on backend • Some products are called “SQL on Hadoop” • Presto, Impala, Apache Drill, etc… • It has own execution engine, not use MapReduce.
  • 38. • Distributed Query Engine for interactive queries
 against various data sources and large data. • Pluggable connector for joining multiple backends • You can join MySQL and HDFS data in one query • Lots of useful functions for data analytics • window functions, approximate query,
 machine learning, etc…
  • 39. HDFS Hive PostgreSQL, etc. Daily/Hourly Batch Interactive query Commercial
 BI Tools Batch analysis platform Visualization platform Dashboard
  • 40. HDFS Hive Daily/Hourly Batch Interactive query ✓ Less scalable ✓ Extra cost Commercial
 BI Tools Dashboard ✓ More work to manage
 2 platforms ✓ Can’t query against
 “live” data directly Batch analysis platform Visualization platform PostgreSQL, etc.
  • 41. HDFS Hive Dashboard Presto PostgreSQL, etc. Daily/Hourly Batch HDFS Hive Dashboard Daily/Hourly Batch Interactive query Interactive query
  • 42. Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query Cassandra MySQL Commertial DBs SQL on any data sets Commercial
 BI Tools ✓ IBM Cognos
 ✓ Tableau
 ✓ ... Data analysis platform
  • 44. Execution Model All stages are pipe-lined ✓ No wait time ✓ No fault-tolerance MapReduce Presto map map reduce reduce task task task task task task memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory task disk map map reduce reduce disk disk Write data
 to disk Wait between
  • 45. Okay, we have now low latency and batch combination. Raw data
  • 46. Resolved our concern! But… we also need quick estimation.
  • 47. Currently, we have several stream processing softwares. Let’s try!!
  • 48. Apache Storm • Distributed realtime processing framework • Low latency: tuple at a time • Trident mode uses micro batch
  • 49. Norikra • Schema-less CEP engine for stream processing • Use SQL like Esper EPL • Not distributed unlike Storm for now
 Calculated result
  • 50. Great! We can get insight by streaming and batch way :)
  • 51. One more. We can make data transfer more reliable for multiple data streams with distributed queue
  • 52. • Distributed messaging system • Producer - Broker - Consumer pattern • Pull model, replication, etc…
 Apache Kafka App Push Pull
  • 53. Push vs Pull • Push: • Easy to transfer data to multiple destinations • Hard to control stream ratio in multiple streams • Pull: • Easy to control stream ratio • Should manage consumers correctly
  • 54. This is a modern analytics platform
  • 55. Seems complex and hard to maintain? Let’s use useful services!
  • 56. Amazon Redshift • Parallel RDBMS on AWS • Re-use traditional Parallel RDMBS know-how • Scale is easier than traditional systems • With Amazon EMR is popular 1. Store data into S3 2. EMR processes S3 data 3. Load processed data into Redshift • EMR has Hadoop ecosystem
  • 58. Google BigQuery • Distributed query engine and scalable storage • Tree model, Columnar storage, etc… • Separate storage from workers • High performance query by Google infrastructure • Lots of workers • Storage / IO layer on Colossus • Can’t manage Parallel RDBMS properties like distkey,
 but it works on almost cases.
  • 61. Treasure Data • Cloud based end-to-end data analytics service • Hive, Presto, Pig and Hivemall for one big repository • Lots of ingestion and output way, scheduling, etc… • No stream processing for now • Service concept is Data Lake • JSON based schema-less storage • Execution model is similar to BigQuery • Separate storage from workers • Can’t specify Parallel RDBMS properties
  • 63. Resource Model Trade-off Pros Cons Fully Guaranteed Stable execution Easy to control resource Non boost mechanizm Guaranteed with 
 multi-tenanted Stable execution Good scalability less controllable resource Fully multi-tenanted Boosted performance Great scalability Unstable execution
  • 64. MS Azure also has useful services: DataHub, SQL DWH, DataLake, Stream Analytics, HDInsight…
  • 65. Use service or build a platform? • Should consider using service first • AWS, GCP, MS Azure, Treasure Data, etc… • Important factor is data analytics, not platform • Do you have enough resources to maintain it? • If specific analytics platform is a differentiator,
 building a platform is better • Use state-of-the-art technologies • Hard to implement on existing platforms
  • 66. Conclusion • Many softwares and services for data analytics • Lots of trade-off, performance, complexity, connectivity, execution model, etc • SQL is a primary language on data analytics • Should focus your goal! • data analytics platform is your business core?
 If not, consider using services first.
  • 67. Cloud service for entire data pipeline!
  • 69. Apache Spark • Another Distributed computing framework • Mainly for in-memory computing with DAG • RDD and DataFrame based clean API • Combination with Hadoop is popular
  • 70. Apache Flink • Streaming based execution engine • Support batch and pipelined processing • Hadoop and Spark are batch based • flink/flink-docs-master/
  • 71. Batch vs Pipelined All stages are pipe-lined ✓ No wait time ✓ fault-tolerance with
 check pointing Batch(Staged) Pipelined task task task task task task memory-to-memory data transfer ✓ use disk if needed task disk disk Wait between
 stagestask task task task task task task stage3 stage2 stage1
  • 72. Visualization • Tableau • Popular BI tool in many area • Awesome GUI, easy to use, lots of charts, etc • Metric Insights • Dashboard for many metrics • Scheduled query, custom handler, etc • Chartio • Cloud based BI tool
  • 73. How to manage job dependency? We want to issue Job X after Job A and Job B are finished.
  • 74. Data pipeline tool • There are some important features • Manage job dependency • Handle job failure and retry • Easy to define topology • Separate tasks into sub-tasks • Apache Oozie, Apache Falcon, Luigi, Airflow, JP1, etc…
  • 75. Luigi • Python module for building job pipeline • Write python code and run it. • task is defined as Python class • Easy to manage by VCS • Need some extra tools • scheduled job, job hisotry, etc… class T1(luigi.task): def requires(self): # dependencies def output(self): # store result def run(self): # task body
  • 76. Airflow • Python and DAG based workflow • Write python code but it is for defining ADAG • Task is defined by Operator • There are good features • Management web UI • Task information is stored into database • Celery based distributed execution dag = DAG('example') t1 = Operator(..., dag=dag) t2 = Operator(..., dag=dag) t2.set_upstream(t1)