The document discusses scheduling Hadoop pipelines using various Apache projects. It provides an example of a marketing profit and loss (PnL) pipeline that processes booking, marketing spend, and web log data. It describes scheduling the example jobs using cron-style scheduling and the problems with time-based scheduling. It then introduces Apache Oozie and Apache Falcon for more robust workflow scheduling based on dataset availability. It provides examples of using Oozie coordinators and workflows and Falcon feeds and processes to schedule the example PnL pipeline based on when input data is available rather than fixed time schedules.
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas NYahoo Developer Network
Oozie is a workflow scheduler system for Hadoop that allows users to create and manage workflows that execute Hadoop jobs. It allows workflows to be defined as a directed acyclic graph (DAG) of actions like MapReduce, Pig, Hive, Sqoop and sub-workflows. Oozie also supports periodic scheduling of workflows as well as data-driven workflows that are triggered based on availability of input data.
The document discusses approaches to designing REST APIs, including CRUD and Commanding patterns. CRUD uses standard HTTP verbs like GET, POST, PUT, DELETE on resource URLs to perform basic operations. Commanding adds verbs as endpoints to initiate actions on resources. For example, POST /barns/11/reroof to trigger roof repair. It recommends separating commands from queries using CQRS and following DDD principles to model the domain accurately in the API.
BP204 - Take a REST and put your data to work with APIs!Craig Schumann
Today, the web is buzzing with the talk about web APIs. It seems that everyone - Facebook, Twitter, Netflix - has some sort of API you can use to integrate with their services. APIs are fundamental to how services on the web work today and data is the new currency. Knowing how to put them to work or how to roll your own can be a huge addition to your development toolbox. This session is all about web-based APIs (like REST). If you have only the vaguest idea about what an API is, or have ever wondered what REST was all about -- then this session is for you! We'll cover examples of using common public APIs and how you can put them to work in your own apps, and how to go about creating your own APIs, or use the REST services in IBM Domino.
Tek 2013 - Building Web Apps from a New Angle with AngularJSPablo Godel
AngularJS is a new JavaScript framework, backed by Google, for building powerful, complex and rich client-side web applications. We will go over the features and basics of building a web application with AngularJS and we will demonstrate how to communicate with a REST server built with PHP.
This portfolio contains examples of Carmen Faber's Microsoft Business Intelligence work using SSAS (SQL Server Analysis Services) and MDX (Multi-Dimensional Expressions). It includes cube structure, dimensions, hierarchies, calculations, KPIs (key performance indicators), and sample MDX queries analyzing data by measures, dimensions, and time periods.
nter-pod Revolutions: Connected Enterprise Solution in Oracle EPM Cloud Alithya
The session will discuss a library of solutions implemented at clients for transferring between applications in separate pods. Each configuration has its own merits and use case. The four main categories that will be discussed are -
1. Trickle Feed - uses a combination of inter-pod REST API connection, data management load rule, groovy scripting and scheduled EPM Automate job on a jump server to pick-up the files from source and push to target.
2. Focused On-save Push - pushes an intersection from source to target using inter-pod REST API connection, data management load rule and groovy scripting.
3. Scheduled Push- uses a combination of windows or Linux job, inter-pod REST API connection, groovy scripting, data management load rule and EPM Automate commands to extract and push data en masse from source to target.
4. Json Extract and Load - uses a combination of groovy scripting and inter-pod REST API connection to extract and push an intersection on-save.
The audience will walk-away with learnings and understanding of inter-pod configurations, mainly for EPM Cloud planning applications. Snippets of code will form the "gold dust" takeaway from the session.
Svelte is a frontend framework that focuses on compiler-centered reactivity rather than a virtual DOM for better performance and smaller bundle sizes. It allows writing components in a simpler reactive way using declarative templates, properties and events. The talk covered Svelte basics like project setup, components, events and lifecycle methods. It demonstrated Svelte's speed advantages and discussed the ecosystem of libraries and community support including Typescript. Advanced topics covered included Single Page Application routing with Sapper and NativeScript for mobile.
PHP is an acronym for "PHP: Hypertext Preprocessor"
PHP is a widely-used, open source scripting language
PHP scripts are executed on the server
PHP is free to download and use
Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling
1. The document discusses building a minimal viable prediction service (MVP) to predict air quality using only Python and free serverless services in 90 minutes.
2. It describes creating feature, training, and inference pipelines to build an air quality prediction service using Hopsworks, Modal, and Streamlit/Gradio.
3. The pipelines would extract features from weather and air quality data, train a model, and deploy an inference pipeline to make predictions on new data.
Agile Data (http://git.io/ad) is my new open-source framework that fills the growing gap between ORM and Raw SQL queries. In this presentation I explain why have I started Agile Data project and how can it enable PHP developer to generate more sophisticated queries when they need them.
Use dependency injection to get Hadoop *out* of your application codeDataWorks Summit
Hadoop MapReduce provides transparent parallelization but often results in specialized code bases that interact with low-level data formats. We present a means of using dependency injection to manage data flows in MapReduce which in turn supports reusable, Hadoop-agnostic application code that interacts with high-level business domain objects. An example is provided that applies Dependency Injection to the Hadoop WordCount example and shows how the same code invoked from the WordCount MapReduce job can be reused in a real-time context. We then discuss Opower’s application of this pattern to employ the same core calculations in both batch processing and in servicing real-time requests from end users. This topic will be of interest to those interested in reusing core batch calculations in real-time contexts. It also provides a means forward for organizations moving to Hadoop that have existing code components that they would like to employ in batch MapReduce computations.
How Bitbucket Pipelines Loads Connect UI Assets Super-fastAtlassian
Connect add-ons deliver better user experience when they load fast. Between CDN, server-side rendering, service workers, and code splitting, there are loads of techniques you can use to achieve this. In this session, Atlassian Developer Peter Plewa will reveal Bitbucket Pipelines' secret for fast loads, and what they can do in the future to make Pipelines even faster.
Peter Plewa, Development Principal, Atlassian
Hardcore URL Routing for WordPress - WordCamp Atlanta 2014 (PPT)Mike Schinkel
This document provides an overview of hardcore URL routing in WordPress. It discusses how URLs get routed through rewrite rules and patterns, and various functions and hooks in the Rewrite API that can be used to add custom routes. These include registering post types and taxonomies, adding rewrite rules and tags, validating taxonomy terms, and adding endpoints and permastructs. It emphasizes the importance of flushing rewrite rules when making changes.
1. Assembled collects and organizes data from various sources like Salesforce, Zendesk, and Google Calendar through a standardized schema. This allows them to summarize and analyze the data to build models.
2. While building good models is important, deploying models is more challenging. Assembled focuses on using tools that are compatible with their existing infrastructure and exposing models as APIs and microservices.
3. Assembled analyzes the data to provide actionable information to users, such as staffing requirements calculated through queue modeling. They are working on interactive tools to allow users to introspect the models and inputs.
Using PHP and SOA for Situational Applications in the Enterprisewebhostingguy
The document discusses how PHP can be better utilized in enterprise environments. It proposes using PHP and SOA for situational applications, providing examples of how SDO, SCA, and a packaging/installation process could address gaps in PHP's programming, deployment, and management models to make it more enterprise-friendly without compromising what makes it appealing. It also discusses using private virtual servers to give developers enterprise capabilities while enabling centralized monitoring and control.
Data models pivot with splunk break out sessionGeorg Knon
Here are the key points about data model acceleration in Splunk:
- Data model acceleration optimizes searches that use data models by pre-processing constraints and attribute definitions at search time. This can significantly improve search performance.
- Acceleration only applies to the first "event" object in the data model tree and its descendant objects. Searches against other object types like "search" or "transaction" do not benefit from acceleration.
- The more filtering/extraction done in the data model objects, the more acceleration can improve performance by reducing the number of events earlier in the search pipeline. Simply defining fields may not yield huge gains.
- Acceleration is most helpful for reports that run the same search repeatedly, like scheduled
Profitable Product Introduction with SAPJulien Delvat
This document discusses how a bike manufacturing company can use integrated SAP solutions to evaluate introducing a new product profitably. It describes the process which involves analyzing past data, defining objectives, developing the product, estimating costs, sourcing materials, simulating scenarios, executing production, and analyzing performance. Key SAP solutions involved are SAP Product Design Costing, SAP Integrated Product and Process Engineering, SAP Supplier Relationship Management, SAP Cost and Profitability Analysis, and SAP Cost and Quotation Management.
Microsoft is investing in PHP efforts to attract more customers to its platform through world-class PHP support and resources. PHP is widely used for popular web applications, and it is important for Microsoft to support it as a first-class citizen. The document discusses PHP support on Windows, PHP support on Windows Azure cloud computing platform, and related SDKs and tools.
Predicting Consumer Behaviour via HadoopSkillspeed
This Hadoop Tutorial will unravel the complete Introduction to Big Data and Hadoop, HDFS, Predictive Analytics & Applications. Additionally, we will also extensively cover MapReduce & Usage.
At the end, you'll have strong knowledge regarding Predicting Consumer Behaviour via Hadoop.
PPT Agenda
✓ Introduction to Big Data & Hadoop
✓ Hadoop Characteristics
✓ Hadoop Ecosystem
✓ Predictive Analysis
✓ Applications of Predictive Analysis
✓ MapReduce Scenarios
✓ Traditional vs MapReduce Solutions
✓ Advantages of MapReduce
----------
What is Hadoop?
Hadoop is an open source Java-based programming framework that supports the processing of large data sets across clusters of distributed commodity servers. It enables you to store, process and gain insight from big data at low cost and huge scale.
----------
Hadoop has the following components:
1. MapReduce
2. The Hadoop Distributed File System (HDFS)
3. Apache Hive
4. HBase
5. Zookeeper
----------
Applications of Predictive Analysis
1. Analytical Customer Relationship Management (CRM)
2. Decision support systems
3. Customer satisfaction & retention
4. Direct marketing
5. Fraud detection
6. Risk management & assessment
----------
Skillspeed is a live e-learning company focusing on high-technology courses. We provide live instructor led training in BIG Data & Hadoop featuring Realtime Projects, 24/7 Lifetime Support & 100% Placement Assistance.
Email: sales@skillspeed.com
Website: https://www.skillspeed.com
Similar to Process Scheduling on Hadoop at Expedia (20)
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
As Hadoop became mainstream, the need to simplify and speed up analytics processes grew rapidly. Data wrangling emerged as a necessary step in any analytical pipeline, and is often considered to be its crux, taking as much as 80% of an analyst's time. In this presentation we will discuss how data wrangling solutions can be leveraged to streamline, strengthen and improve data analytics initiatives on Hadoop, including use cases from Trifacta customers.
Bio: Olivier is EMEA Solutions Lead at Trifacta. He has 7 years experience in analytics with prior roles as technical lead for business analytics at Splunk and quantitative analyst at Accenture and Aon.
Stephen Taylor is the community manager for Ether Camp. They provide an analysis tool for the Ethereum blockchain, ‘Block Explorer’ and also an ‘Intergrated Development Environment’ (I.D.E) that empowers developers to build, test and deploy applications in a sandbox environment. This November they are launching their second annual hackathon, hack.ether.camp which is aiming to deliver a more sustained approach to the hackathon ideology, by utilising blockchain technology.
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
At Google Cloud Platform, we're combining the Apache Spark and Hadoop ecosystem with our software and hardware innovations. We want to make these awesome tools easier, faster, and more cost-effective, from 3 to 30,000 cores. This presentation will showcase how Google Cloud Platform is innovating with the goal of bringing the Hadoop ecosystem to everyone.
Bio: "I love data because it surrounds us - everything is data. I also love open source software, because it shows what is possible when people come together to solve common problems with technology. While they are awesome on their own, I am passionate about combining the power of open source software with the potential unlimited uses of data. That's why I joined Google. I am a product manager for Google Cloud Platform and manage Cloud Dataproc and Apache Beam (incubating). I've previously spent time hanging out at Disney and Amazon. Beyond Google, love data, amateur radio, Disneyland, photography, running and Legos."
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
This talk will describe his research into using Hadoop to query and manage big geographic datasets, specifically OpenStreetMap(OSM). OSM is an “open-source” map of the world, growing at a large rate, currently around 5TB of data. The talk will introduce OSM, detail some aspects of the research, but also discuss his experiences with using the SpatialHadoop stack on Azure and Google Cloud.
Extracting maximum value from data while protecting consumer privacy. Jason ...huguk
Big organisations have a wealth of rich customer data which opens up huge new opportunities. However, they have the challenge of how to extract value from this data while protecting the privacy of their individual customers. He will talk about the risks organisations face, and what they should do about it. He will survey the techniques which can be used to make data safe for analysis, and talk briefly about how they are solving this problem at Privitar.
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk
IBM is developing the Watson Ecosystem to leverage its Developer Cloud, APIs, Content Store and Talent Hub. This is part of IBM's recent announcement of the $1B investment in Watson as a new business unit including Silicon Alley NYC headquarters. For the first time, IBM will open up Watson as a development platform in the Cloud to spur innovation and fuel a new ecosystem of entrepreneurial software app providers who will bring forward a new generation of applications infused with Watson's cognitive computing intelligence.
In this talk about Apache Flink we will touch on three main things, an introductory look at Flink, a look under the hood and a demo.
* In the introduction we will briefly look at the history of Flink and then go on to the API and different use cases. Here we will also see how it can be deployed in practice and what some of the pitfalls in a cluster setting can be.
* In the second section we will look at the streaming execution engine that lies at the heart of Flink. Here we will see what makes it tick and also what distinguishes it from other approaches, such as the mini-batch execution model.
Ufuk Celebi - PMC member at Apache Flink and co-founder and software engineer at data Artisans
* In the final section we will see a live demo of a fault-tolerant streaming job that performs analysis of the wikipedia edit-stream.
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
Sean Owen – Director of Data Science @Cloudera
Building machine learning models is all well and good, but how do they get productionized into a service? It's a long way from a Python script on a laptop, to a fault-tolerant system that learns continuously, serves thousands of queries per second, and scales to terabytes. The confederation of open source technologies we know as Hadoop now offers data scientists the raw materials from which to assemble an answer: the means to build models but also ingest data and serve queries, at scale.
This short talk will introduce Oryx 2, a blueprint for building this type of service on Hadoop technologies. It will survey the problem and the standard technologies and ideas that Oryx 2 combines: Apache Spark, Kafka, HDFS, the lambda architecture, PMML, REST APIs. The talk will touch on a key use case for this architecture -- recommendation engines.
Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk
Martin Oberhuber and Eliano Marques, Senior Data Scientists @Think Big International
In this talk Think Big International Lead Data Scientists will discuss the options that exist today for engineering and data science teams aiming to use big data patterns to solve new business problems. With the enterprise adoption of the Hadoop ecosystem and the emerging momentum of open source projects like Spark it is becoming mandatory to have an approach that solves for business results but remains flexible to adapt and change with the open source market.
This document discusses venture capital, funding, and pitching. It provides an overview of venture capital, including how venture capital funds work with startups and limited partners. It then discusses how the rise of cloud computing, open source software, and public cloud infrastructure have significantly lowered costs and increased innovation for startups, leading to changes in typical venture funding amounts and models over time. The document concludes with tips for an effective pitch, emphasizing the importance of clearly communicating your business model, metrics, strategy, and execution plan in addition to product details and forecasts.
Signal Media: Real-Time Media & News Monitoringhuguk
Startup pitch presented by CTO Wesley Hall. Signal Media is a real-time media and news monitoring platform that tracks media outlets. News items are analysed for brand & media monitoring as well as market intelligence.
Digital Catapult is a UK nonprofit organization that aims to advance digital ideas and technologies to create new jobs, services, and economic growth. It works in four challenge areas - closed organizational data, personal data, creative content, and internet of things. Digital Catapult establishes centers and platforms to enable collaboration between large organizations and startups to unlock proprietary data through pilot projects. Its goal is to contribute £365 million to the UK economy and help 10,000 organizations by 2018 by convening open innovation across sectors.
Startup pitch presented by Aeneas Wiener. Cytora is a real-time geopolitical risk analysis platform that extracts events from open-source intelligence and evaluates these events on their geopolitical impact.
The document introduces Cubitic, a startup providing a predictive analytics platform for IoT applications. It summarizes the founders' backgrounds and experience. Jaco Els is the CEO with a degree in IT and experience at major companies. Ryan Topping is the Chief Scientist with degrees in mathematics and bioinformatics. Renjith Nair is the CTO with a master's degree in networking and experience developing scalable systems. The founders met working at King and saw an opportunity to build their own predictive analytics solution for IoT, launching initial prototypes in 2015.
Startup pitch presented by co-founder and CEO Corentin Guillo. Bird.i is building a platform for up-to-date earth observation data that will bring satellite imagery to the mass market. Providing fresh imagery together with analytics around the forecast of localised demand opens up innovative opportunities in sectors like construction, tourism, real-estate and remote facility monitoring.
Startup pitch presented by co-founders Laure Andrieux and Nic Greenway. Aiseedo applies real-time machine learning, where the model of the world is constantly updated, to build adaptive systems which can be applied to robotics, the Internet of Things and healthcare.
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
This talk will cover the design and implementation decisions that have been key to the success of Apache Spark over other competing cluster computing frameworks. It will be delving into the whitepaper behind Spark and cover the design of Spark RDDs, the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases: Spark SQL, Spark Streaming, MLib and GraphX. RDDs allow Spark to outperform existing models by up to 100x in multi-pass analytics.
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk
Technical developments in the area of data warehousing have allowed companies to push their analysis a step further and, therefore, allowed data scientists to deliver more value to business areas. In that session, we will focus on the case of performance marketing at King and demonstrate how we use Hadoop capabilities to exploit user-level data efficiently. That approach results in obtaining a more holistic view in a return-on-investment analysis of TV advertisement.
Hadoop - Looking to the Future By Arun Murthyhuguk
Hadoop - Looking to the Future
By Arun Murthy (Founder of Hortonworks, Creator of YARN)
The Apache Hadoop ecosystem began as just HDFS & MapReduce nearly 10 years ago in 2006.
Very much like the Ship of Theseus (http://en.wikipedia.org/wiki/Ship_of_Theseus), Hadoop has undergone incredible amount of transformation from multi-purpose YARN to interactive SQL with Hive/Tez to machine learning with Spark.
Much more lies ahead: whether you want sub-second SQL with Hive or use SSDs/Memory effectively in HDFS or manage Metadata-driven security policies in Ranger, the Hadoop ecosystem in the Apache Software Foundation continues to evolve to meet new challenges and use-cases.
Arun C Murthy has been involved with Apache Hadoop since the beginning of the project - nearly 10 years now. In the beginning he led MapReduce, went on to create YARN and then drove Tez & the Stinger effort to get to interactive & sub-second Hive. Recently he has been very involved in the Metadata and Governance efforts. In between he founded Hortonworks, the first public Hadoop distribution company.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
Demystifying Neural Networks And Building Cybersecurity ApplicationsPriyanka Aash
In today's rapidly evolving technological landscape, Artificial Neural Networks (ANNs) have emerged as a cornerstone of artificial intelligence, revolutionizing various fields including cybersecurity. Inspired by the intricacies of the human brain, ANNs have a rich history and a complex structure that enables them to learn and make decisions. This blog aims to unravel the mysteries of neural networks, explore their mathematical foundations, and demonstrate their practical applications, particularly in building robust malware detection systems using Convolutional Neural Networks (CNNs).
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
2. 2
About Me
Name : James Grant
Hadoop Enterprise Data Warehouse Developer here at Expedia
Working with Hadoop and related technology for about 6 years
Email : jamegrant@expedia.com or james@queeg.org
3. 3
Contents
Introduce the example
Schedule the example using cron style scheduling
Look at what’s wrong with time based scheduling
Introducing Apache Oozie
Introducing Apache Falcon
Questions
4. 4
Example
Tracking marketing profit and loss (PnL)
Using
–Booking data
–Marketing spend data
–Web server logs
Producing records showing spend, revenue and profit per
campaign per day
5. 5
Example – Jobs to schedule
Land Booking Data to HDFS
Land Marketing spend data to HDFS
Land Web logs to HDFS
Process web logs to identify bookings and points of entry
Enrich with booking revenue and profit
Enrich with marketing spend
Attribute revenue and profit to marketing campaign
7. 7
Scheduling the Example
We need to know how long each task normally takes
We also need to know how long it could possibly take
We then need to work out at what time of day to schedule the
task
10. 10
The Problem With Time Based Scheduling
It’s brittle
–Any delay upstream means all downstream tasks fail
It’s inefficient
–All scheduling has to be on a near worst case basis
–So the final result arrives later than we would like
Difficult to manage at scale
–Coordinating schedules between different teams is hard
11. 11
Introducing Apache Oozie
URL: http://oozie.apache.org/
A workflow scheduler for Hadoop jobs
Describe your workflow as a DAG of actions
Trigger that workflow periodically or on dataset availability
20. 20
Scheduling With Apache Oozie
Processes will be launched in a container on the cluster
There is a lot of XML
When working with multiple teams/pipelines dataset
definitions must be repeated
21. 21
Introducing Apache Falcon
http://falcon.incubator.apache.org/ http://falcon.apache.org/
“A data processing and management solution”
Describe datasets and processes
Processes are scheduled based on the descriptions
Uses Oozie as the scheduler
Processes can be Hive HQL scripts Pig scripts or Oozie
workflows
24. 24
Benefits and Observations of Falcon
About the same amount of XML but in smaller chunks
Declare the data and processing steps and have the schedule
created for you
A dataset is declared once and used by all processing steps that
need it
Also handles retention (a separate process under Oozie)
Also handles replication
25. 25
Oozie workflows
Describe a DAG of actions to take to complete a task
Available actions are:
–Map-Reduce
–Pig
–File system
–SSH
–Java
–Shell
All actions take place in a container on the cluster