SlideShare a Scribd company logo
Scheduling Hadoop Pipelines
How to manage data process pipelines on Hadoop.
HUG UK 2015-01-13
About Me
Name : James Grant
Hadoop Enterprise Data Warehouse Developer here at Expedia
Working with Hadoop and related technology for about 6 years
Email : or
Introduce the example
Schedule the example using cron style scheduling
Look at what’s wrong with time based scheduling
Introducing Apache Oozie
Introducing Apache Falcon
Tracking marketing profit and loss (PnL)
–Booking data
–Marketing spend data
–Web server logs
Producing records showing spend, revenue and profit per
campaign per day
Example – Jobs to schedule
Land Booking Data to HDFS
Land Marketing spend data to HDFS
Land Web logs to HDFS
Process web logs to identify bookings and points of entry
Enrich with booking revenue and profit
Enrich with marketing spend
Attribute revenue and profit to marketing campaign
Scheduling the Example
We need to know how long each task normally takes
We also need to know how long it could possibly take
We then need to work out at what time of day to schedule the
Scheduling the Example
Scheduling the Example
The Problem With Time Based Scheduling
It’s brittle
–Any delay upstream means all downstream tasks fail
It’s inefficient
–All scheduling has to be on a near worst case basis
–So the final result arrives later than we would like
Difficult to manage at scale
–Coordinating schedules between different teams is hard
Introducing Apache Oozie
A workflow scheduler for Hadoop jobs
Describe your workflow as a DAG of actions
Trigger that workflow periodically or on dataset availability
Example Oozie Coordinator
<coordinator-app name="marketing-pnl-coord" frequency="${coord:days(1)}"
start="2015-01-02T02:00Z" end="2015-12-31T02:00Z" timezone="UTC"
Example Oozie Coordinator
<dataset name="d_weblogs" frequency="${coord:days(1)}"
initial-instance="2009-01-01T02:00Z" timezone="UTC">
<dataset name="d_marketing-pnl" frequency="${coord:days(1)}"
initial-instance="2009-01-01T02:00Z" timezone="UTC">
Example Oozie Coordinator
<data-in name="e_weblogs" dataset="d_weblogs">
<data-out name="e_marketing-pnl" dataset="d_marketing-pnl">
Example Oozie Coordinator
Example Oozie Workflow
Example Oozie Workflow
<workflow-app name="marketing-pnl-wf" xmlns="uri:oozie:workflow:0.1">
<start to="fork"/>
<fork name="fork">
<path start="downloadBooking"/>
<path start="downloadWeblogs"/>
<path start="downloadSpend"/>
Example Oozie Workflow
<action name="downloadBooking">
<shell xmlns="uri:oozie:shell-action:0.2">
<ok to="join"/>
<error to="sendErrorEmail"/>
Example Oozie Workflow
<action name="downloadWeblogs">
<action name="downloadSpend">
<join name="join" to="merge"/>
<action name="sendErrorEmail">
<kill name="killJobAction">
<message>"Killed job : ${wf:errorMessage(wf:lastErrorNode())}"</message>
<end name="end" />
Scheduling With Apache Oozie
Processes will be launched in a container on the cluster
There is a lot of XML
When working with multiple teams/pipelines dataset
definitions must be repeated
Introducing Apache Falcon
“A data processing and management solution”
Describe datasets and processes
Processes are scheduled based on the descriptions
Uses Oozie as the scheduler
Processes can be Hive HQL scripts Pig scripts or Oozie
Example Dataset Description
<?xml version="1.0" encoding="UTF-8"?>
<feed description="Web Logs" name="weblogs" xmlns="uri:falcon:feed:0.1">
<late-arrival cut-off="hours(18)"/>
<cluster name="production" type="source">
<validity start="2014-01-01T02:00Z" end="2099-12-31T00:00Z"/>
<retention limit="years(5)" action="delete"/>
<location type="data" path="/data/marketing-pnl/${YEAR}/${MONTH}/${DAY}"/>
<ACL owner="marketing" group="etl" permission="0755"/>
<schema location="/none" provider="none"/>
<property name="queueName" value="prod_etl"/>
Example Process Description
<?xml version="1.0" encoding="UTF-8"?>
<process name="mkgMerge" xmlns="uri:falcon:process:0.1">
<input name="bookings" feed="mkgBookings" start="today(0,0)" end="today(0,0)" />
<input name="webActions" feed="mkgEntryBookingLog" start="today(0,0)" end="today(0,
<input name="spend" feed="mkgSpend" start="today(0,0)" end="today(0,0)" />
<output name="output" feed="mkgEnrichedLog" instance="today(0,0)" />
<property name="queueName" value="prod_etl" />
<workflow name="mkgMerge-wf" engine="oozie" path="/apps/mkg/merge" />
Benefits and Observations of Falcon
About the same amount of XML but in smaller chunks
Declare the data and processing steps and have the schedule
created for you
A dataset is declared once and used by all processing steps that
need it
Also handles retention (a separate process under Oozie)
Also handles replication
Oozie workflows
Describe a DAG of actions to take to complete a task
Available actions are:
–File system
All actions take place in a container on the cluster
Example Workflow
<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.4" name="mkgMerge-wf">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.2">
Example Workflow
<ok to="end"/>
<error to="fail"/>
<kill name="fail">
<message>Action failed: [${wf:errorMessage(wf:lastErrorNode())}]</message>
<end name="end"/>
Any Questions?

More Related Content

Similar to Process Scheduling on Hadoop at Expedia

Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Yahoo Developer Network
Cqrs api v2
Cqrs api v2Cqrs api v2
Cqrs api v2
Brandon Mueller
BP204 - Take a REST and put your data to work with APIs!
BP204 - Take a REST and put your data to work with APIs!BP204 - Take a REST and put your data to work with APIs!
BP204 - Take a REST and put your data to work with APIs!
Craig Schumann
Tek 2013 - Building Web Apps from a New Angle with AngularJS
Tek 2013 - Building Web Apps from a New Angle with AngularJSTek 2013 - Building Web Apps from a New Angle with AngularJS
Tek 2013 - Building Web Apps from a New Angle with AngularJS
Pablo Godel
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
Svelte JS introduction
Svelte JS introductionSvelte JS introduction
Svelte JS introduction
Mikhail Kuznetcov
Introduction to PHP
Introduction to PHPIntroduction to PHP
Introduction to PHP
Collaboration Technologies
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
Jim Dowling
Agile data presentation 3 - cambridge
Agile data   presentation 3 - cambridgeAgile data   presentation 3 - cambridge
Agile data presentation 3 - cambridge
Romans Malinovskis
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application code
DataWorks Summit
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fastHow Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
Introduce to PredictionIO
Introduce to PredictionIOIntroduce to PredictionIO
Introduce to PredictionIO
Wei-Yuan Chang
Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)
Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)
Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)
Mike Schinkel
Building a data-driven application
Building a data-driven applicationBuilding a data-driven application
Building a data-driven application
Using PHP and SOA for Situational Applications in the Enterprise
Using PHP and SOA for Situational Applications in the EnterpriseUsing PHP and SOA for Situational Applications in the Enterprise
Using PHP and SOA for Situational Applications in the Enterprise
Data models pivot with splunk break out session
Data models pivot with splunk break out sessionData models pivot with splunk break out session
Data models pivot with splunk break out session
Georg Knon
Profitable Product Introduction with SAP
Profitable Product Introduction with SAPProfitable Product Introduction with SAP
Profitable Product Introduction with SAP
Julien Delvat
PHP on Windows and on Azure
PHP on Windows and on AzurePHP on Windows and on Azure
PHP on Windows and on Azure
Maarten Balliauw
Predicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopPredicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via Hadoop

Similar to Process Scheduling on Hadoop at Expedia (20)

Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NApache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Cqrs api v2
Cqrs api v2Cqrs api v2
Cqrs api v2
BP204 - Take a REST and put your data to work with APIs!
BP204 - Take a REST and put your data to work with APIs!BP204 - Take a REST and put your data to work with APIs!
BP204 - Take a REST and put your data to work with APIs!
Tek 2013 - Building Web Apps from a New Angle with AngularJS
Tek 2013 - Building Web Apps from a New Angle with AngularJSTek 2013 - Building Web Apps from a New Angle with AngularJS
Tek 2013 - Building Web Apps from a New Angle with AngularJS
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud
Svelte JS introduction
Svelte JS introductionSvelte JS introduction
Svelte JS introduction
Introduction to PHP
Introduction to PHPIntroduction to PHP
Introduction to PHP
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
Agile data presentation 3 - cambridge
Agile data   presentation 3 - cambridgeAgile data   presentation 3 - cambridge
Agile data presentation 3 - cambridge
Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application code
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fastHow Bitbucket Pipelines Loads Connect UI Assets Super-fast
How Bitbucket Pipelines Loads Connect UI Assets Super-fast
Introduce to PredictionIO
Introduce to PredictionIOIntroduce to PredictionIO
Introduce to PredictionIO
Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)
Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)
Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)
Building a data-driven application
Building a data-driven applicationBuilding a data-driven application
Building a data-driven application
Using PHP and SOA for Situational Applications in the Enterprise
Using PHP and SOA for Situational Applications in the EnterpriseUsing PHP and SOA for Situational Applications in the Enterprise
Using PHP and SOA for Situational Applications in the Enterprise
Data models pivot with splunk break out session
Data models pivot with splunk break out sessionData models pivot with splunk break out session
Data models pivot with splunk break out session
Profitable Product Introduction with SAP
Profitable Product Introduction with SAPProfitable Product Introduction with SAP
Profitable Product Introduction with SAP
PHP on Windows and on Azure
PHP on Windows and on AzurePHP on Windows and on Azure
PHP on Windows and on Azure
Predicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via HadoopPredicting Consumer Behaviour via Hadoop
Predicting Consumer Behaviour via Hadoop

More from huguk

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
huguk - Hackathon & intro - Hackathon & - Hackathon & intro - Hackathon & intro
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy

More from huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta - Hackathon & intro - Hackathon & - Hackathon & intro - Hackathon & intro
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy

Recently uploaded

Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Alliance
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
Priyanka Aash
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
Nohoax Kanont
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Marrie Morris
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Alliance
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Stephanie Beckett
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Alliance
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Alliance
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
Sara Kroft
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Priyanka Aash
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
webbyacad software
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Alliance

Recently uploaded (20)

Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx

Process Scheduling on Hadoop at Expedia

  • 1. Scheduling Hadoop Pipelines How to manage data process pipelines on Hadoop. HUG UK 2015-01-13
  • 2. 2 About Me Name : James Grant Hadoop Enterprise Data Warehouse Developer here at Expedia Working with Hadoop and related technology for about 6 years Email : or
  • 3. 3 Contents Introduce the example Schedule the example using cron style scheduling Look at what’s wrong with time based scheduling Introducing Apache Oozie Introducing Apache Falcon Questions
  • 4. 4 Example Tracking marketing profit and loss (PnL) Using –Booking data –Marketing spend data –Web server logs Producing records showing spend, revenue and profit per campaign per day
  • 5. 5 Example – Jobs to schedule Land Booking Data to HDFS Land Marketing spend data to HDFS Land Web logs to HDFS Process web logs to identify bookings and points of entry Enrich with booking revenue and profit Enrich with marketing spend Attribute revenue and profit to marketing campaign
  • 6. 6
  • 7. 7 Scheduling the Example We need to know how long each task normally takes We also need to know how long it could possibly take We then need to work out at what time of day to schedule the task
  • 10. 10 The Problem With Time Based Scheduling It’s brittle –Any delay upstream means all downstream tasks fail It’s inefficient –All scheduling has to be on a near worst case basis –So the final result arrives later than we would like Difficult to manage at scale –Coordinating schedules between different teams is hard
  • 11. 11 Introducing Apache Oozie URL: A workflow scheduler for Hadoop jobs Describe your workflow as a DAG of actions Trigger that workflow periodically or on dataset availability
  • 12. 12 Example Oozie Coordinator <coordinator-app name="marketing-pnl-coord" frequency="${coord:days(1)}" start="2015-01-02T02:00Z" end="2015-12-31T02:00Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.1"> <controls> <timeout>1080</timeout> <concurrency>1</concurrency> <execution>FIFO</execution> </controls>
  • 13. 13 Example Oozie Coordinator <datasets> <dataset name="d_weblogs" frequency="${coord:days(1)}" initial-instance="2009-01-01T02:00Z" timezone="UTC"> <uri-template>hdfs://data/weblogs/${YEAR}/${MONTH}/${DAY}/</uri-template> <done-flag></done-flag> </dataset> ... <dataset name="d_marketing-pnl" frequency="${coord:days(1)}" initial-instance="2009-01-01T02:00Z" timezone="UTC"> <uri-template> hdfs://data/marketing-pnl/${YEAR}/${MONTH}/${DAY}/ </uri-template> <done-flag></done-flag> </dataset> </datasets>
  • 14. 14 Example Oozie Coordinator <input-events> <data-in name="e_weblogs" dataset="d_weblogs"> <instance>${coord:current(0)}</instance> </data-in> ... </input-events> <output-events> <data-out name="e_marketing-pnl" dataset="d_marketing-pnl"> <instance>${coord:current(-1)}</instance> </data-out> </output-events>
  • 17. 17 Example Oozie Workflow <workflow-app name="marketing-pnl-wf" xmlns="uri:oozie:workflow:0.1"> <start to="fork"/> <fork name="fork"> <path start="downloadBooking"/> <path start="downloadWeblogs"/> <path start="downloadSpend"/> </fork>
  • 18. 18 Example Oozie Workflow <action name="downloadBooking"> <shell xmlns="uri:oozie:shell-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name></name> <value>${queueName}</value> </property> </configuration> <exec></exec> <argument>--bookings=${e_bookings}</argument> <file>${wf:appPath()}/</file> <file>${wf:appPath()}/downloadBooking.jar</file> </shell> <ok to="join"/> <error to="sendErrorEmail"/> </action>
  • 19. 19 Example Oozie Workflow <action name="downloadWeblogs"> ... </action> <action name="downloadSpend"> ... </action> ... <join name="join" to="merge"/> <action name="sendErrorEmail"> ... </action> <kill name="killJobAction"> <message>"Killed job : ${wf:errorMessage(wf:lastErrorNode())}"</message> </kill> <end name="end" /> </workflow-app>
  • 20. 20 Scheduling With Apache Oozie Processes will be launched in a container on the cluster There is a lot of XML When working with multiple teams/pipelines dataset definitions must be repeated
  • 21. 21 Introducing Apache Falcon  “A data processing and management solution” Describe datasets and processes Processes are scheduled based on the descriptions Uses Oozie as the scheduler Processes can be Hive HQL scripts Pig scripts or Oozie workflows
  • 22. 22 Example Dataset Description <?xml version="1.0" encoding="UTF-8"?> <feed description="Web Logs" name="weblogs" xmlns="uri:falcon:feed:0.1"> <frequency>days(1)</frequency> <late-arrival cut-off="hours(18)"/> <clusters> <cluster name="production" type="source"> <validity start="2014-01-01T02:00Z" end="2099-12-31T00:00Z"/> <retention limit="years(5)" action="delete"/> </cluster> </clusters> <locations> <location type="data" path="/data/marketing-pnl/${YEAR}/${MONTH}/${DAY}"/> </locations> <ACL owner="marketing" group="etl" permission="0755"/> <schema location="/none" provider="none"/> <properties> <property name="queueName" value="prod_etl"/> </properties> </feed>
  • 23. 23 Example Process Description <?xml version="1.0" encoding="UTF-8"?> <process name="mkgMerge" xmlns="uri:falcon:process:0.1"> <clusters>…</clusters> <parallel>1</parallel> <order>FIFO</order> <frequency>days(1)</frequency> <inputs> <input name="bookings" feed="mkgBookings" start="today(0,0)" end="today(0,0)" /> <input name="webActions" feed="mkgEntryBookingLog" start="today(0,0)" end="today(0, <input name="spend" feed="mkgSpend" start="today(0,0)" end="today(0,0)" /> </inputs> <outputs> <output name="output" feed="mkgEnrichedLog" instance="today(0,0)" /> </outputs> <properties> <property name="queueName" value="prod_etl" /> </properties> <workflow name="mkgMerge-wf" engine="oozie" path="/apps/mkg/merge" /> </process>
  • 24. 24 Benefits and Observations of Falcon About the same amount of XML but in smaller chunks Declare the data and processing steps and have the schedule created for you A dataset is declared once and used by all processing steps that need it Also handles retention (a separate process under Oozie) Also handles replication
  • 25. 25 Oozie workflows Describe a DAG of actions to take to complete a task Available actions are: –Map-Reduce –Pig –File system –SSH –Java –Shell All actions take place in a container on the cluster
  • 26. 26 Example Workflow <?xml version="1.0" encoding="UTF-8"?> <workflow-app xmlns="uri:oozie:workflow:0.4" name="mkgMerge-wf"> <start to="shell-node"/> <action name="shell-node"> <shell xmlns="uri:oozie:shell-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name></name> <value>${queueName}</value> </property> </configuration>