This document summarizes the results of a survey of Cascading users. It finds that Cascading is most popular among those building and managing big data applications. Many users explored alternatives like Hive and Pig before adopting Cascading due to its scalability and portability across compute frameworks. The survey also shows that Cascading users value reliability and performance at scale and are interested in new frameworks like Spark.
Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges" Dataconomy Media
Moustafa Soliman, Business Intelligence Developer from Hewlett Packard presented "HP Vertica - Solving Facebook Big Data Challenges" as part of "Big Data Stockholm" meetup on April 1st at SUP46.
R, Spark, Tensorflow, H20.ai Applied to Streaming AnalyticsKai Wähner
Slides from my talk at Codemotion Rome in March 2017. Development of analytic machine learning / deep learning models with R, Apache Spark ML, Tensorflow, H2O.ai, RapidMinder, KNIME and TIBCO Spotfire. Deployment to real time event processing / stream processing / streaming analytics engines like Apache Spark Streaming, Apache Flink, Kafka Streams, TIBCO StreamBase.
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.
As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.
Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.
This document provides an overview of Hortonworks and Hadoop. It discusses Hortonworks' customer momentum, the Hortonworks Data Platform (HDP), and Hortonworks' role as a partner for customer success. It also summarizes challenges with traditional data systems, how Hadoop emerged as a foundation for a new data architecture, and how HDP delivers a comprehensive data management platform.
Early adopters of cloud technology—companies that have planned, implemented and seen the benefits in real deployments—are beginning to establish a track record of “lessons learned”. The Economist Intelligence Unit, sponsored by SAP, has analysed the experiences of six companies that have implemented cloud solutions specifically designed to foster collaboration in the workplace.
Jan van der Vegt. Challenges faced with machine learning in practiceLviv Startup Club
Machine learning projects often fail to make it from development to production. Looking at the full machine learning lifecycle is essential for success. The lifecycle includes development, deployment, infrastructure, monitoring, automation, standardization, lineage and reproducibility. A machine learning operations (MLOps) platform can provide an end-to-end system view for increased efficiency, collaboration, and trust across the lifecycle. Key takeaways are to focus on what is important, avoid doing nothing which fails to scale, and doing everything which stifles progress.
Digital Transformation - #StrataData London 2017 - Data101Ellen Friedman
Presented at Strata Data London conference May 2017 in the Data 101 track, this presentation explores what is needed in planning, architecture, and cultural organization for effective digital transformation.
What Makes Machine Learning Work? Berlin Buzzwords 2018 #bbuzz talk Ellen Friedman
This document provides an overview of a presentation given by Ellen Friedman on machine learning. Some key points discussed include:
- Domain knowledge is very important for machine learning to work effectively. Small differences in input data or labels can significantly impact model performance.
- Stream processing and microservices architectures are useful for managing the many models needed for machine learning. Having the right messaging infrastructure is also important.
- Deploying and managing machine learning models at scale poses logistical challenges. The Rendezvous architecture and DataOps approaches aim to help with continuous model evaluation, deployment and adaptation.
- Both software engineers and data scientists have important roles to play in machine learning projects. Cross-functional teams are needed to
ACCELERATE SAP® APPLICATIONS WITH CDNETWORKSCDNetworks
CDNetworks and SAP conducted a proof of concept project to test how CDNetworks' content delivery network (CDN) service could accelerate SAP applications. Testing showed the CDN provided significant performance improvements, reducing response times for login and file downloads by 50-66% on average globally. The CDN also improved reliability, with no errors observed during stress testing of 10,000 transactions, whereas the internet saw around a 4% failure rate. The CDN's global infrastructure and security features were found to enhance the delivery, speed, and reliability of SAP applications for distributed users worldwide.
The document discusses embedding machine learning in business processes using the example of baking cakes. It notes that while bakers follow exact recipes and processes, the results are not always perfect due to various factors. It then discusses how manufacturers are "data rich but information poor" as they cannot derive meaningful insights from their operational data. The document advocates generating "actionable intelligence" through deep analysis of production data to determine the root causes of issues like cracked cakes, rather than just reporting what problems occurred. This would help manufacturers diagnose and address process flaws more precisely.
Haven OnDemand is a machine learning platform that provides APIs and services to help developers easily build data-rich applications. It has over 60 composable machine learning APIs that can be combined to power use cases like text analysis, image recognition, and predictive modeling. Developers can build powerful applications with minimal coding by leveraging these APIs. Haven OnDemand also offers purpose-built solutions like Haven Search OnDemand that are built on top of the API platform.
Apache Hadoop and its role in Big Data architecture - Himanshu Barijaxconf
In today’s world of exponentially growing big data, enterprises are becoming increasingly more aware of the business utility and necessity of harnessing, storing and analyzing this information. Apache Hadoop has rapidly evolved to become a leading platform for managing and processing big data, with the vital management, monitoring, metadata and integration services required by organizations to glean maximum business value and intelligence from their burgeoning amounts of information on customers, web trends, products and competitive markets. In this session, Hortonworks' Himanshu Bari will discuss the opportunities for deriving business value from big data by looking at how organizations utilize Hadoop to store, transform and refine large volumes of this multi-structured information. Connolly will also discuss the evolution of Apache Hadoop and where it is headed, the component requirements of a Hadoop-powered platform, as well as solution architectures that allow for Hadoop integration with existing data discovery and data warehouse platforms. In addition, he will look at real-world use cases where Hadoop has helped to produce more business value, augment productivity or identify new and potentially lucrative opportunities.
Understanding The Cloud For Enterprise Businesses. Triaxil
Cloud is getting lots of attention these days. Cloud is a transformational platform that can support the opportunities of today’s digital business being shaped and driven by mobile, social, IoT (Internet of Things), Big Data and other forces. Cloud Computing not only is a powerful agent of change, but it also can accelerate transformation.
The benefits are big. “Cloud computing is a disruptive phenomenon, with the potential to make IT organizations more responsive than ever,” says research firm Gartner. “Cloud computing promises economic advantages, speed, agility, flexibility,infinite elasticity an dinnovation.” As a result, more and more enterprises are moving to the cloud. According to Gartner, 78 percent of enterprises are planning to increase their investment in cloud through 2017.
Introduction to the graph technologies landscapeLinkurious
Graph technologies allow modeling of complex relationships and connections through nodes and edges. There are three main layers of graph technologies: graph databases to store graph data, graph analysis frameworks to analyze large graphs, and graph visualization solutions to interact with graphs. Popular tools in each layer include Neo4j and Titan for databases, Giraph and GraphX for analysis, and Gephi and Cytoscape for visualization. Graph technologies are gaining more attention due to their ability to extract insights from connected data.
The document discusses the future of data and modern data applications. It notes that data is growing exponentially and will reach 44 zettabytes by 2020. This growth is driving the need for new data architectures like Apache Hadoop which can handle diverse data types from sources like the internet of things. Hadoop provides distributed storage and processing to enable real-time insights from all available data.
A successful enterprise Journey to Cloud requires more than technical execution, and we’ll help you learn what to consider, the pitfalls and how to succeed. We’ve helped many companies – in Australia and globally – execute their digital vision and accelerate change on their Journey to Cloud. We’ll share some of their experiences to help you discover how an optimised migration can transform your business.
Speakers:
Chris Fleishmann, Managing Director, Journey to Cloud Chief Architect
Attilio Di Lorenzo, Senior manager, Journey to Cloud Architect
The document discusses how businesses are increasingly adopting public and private cloud services. It provides statistics showing that 58% of organizations currently use cloud services for small applications and workloads. The use of cloud infrastructure as a service (IaaS) and platform as a service (PaaS) is growing significantly and driving digital business innovation. The top challenges with public cloud include bandwidth costs, performance constraints, and cloud services going down. The document argues that adding flash memory to cloud infrastructure can enhance performance, reliability, and cost effectiveness by providing predictable performance, high throughput, and redundancy for critical workloads.
The document discusses big data and open source tools and technologies. It provides an overview of key challenges for data leaders, introduces the top 10 big data tools including Apache Spark, R, and Talend Open Studio. It outlines the benefits of open source including low costs, flexibility, and innovation. The document advocates adopting both corporate and open source software using a "bi-modal" approach to support innovative and engineered analytics. It provides a template for a 1-page big data strategy.
SnapLogic has been gaining traction in big-data integration. It recently announced the Fall 2015 release of its Elastic Integration Platform, which adds capabilities for big- data integration that now include Spark (an open source in-memory data-processing framework), a new Snap (preconfigured connector) for Cassandra (an open source distributed ‘big’ database) and support for Microsoft Cortana Analytics. SnapLogic is positioning this release as a self-service hybrid cloud integration offering, and it is intended to strengthen its position among Microsoft customers and others seeking cloud-based big-data analytics.
The document discusses the development of an internal data pipeline platform at Indix to democratize access to data. It describes the scale of data at Indix, including over 2.1 billion product URLs and 8 TB of HTML data crawled daily. Previously, the data was not discoverable, schemas changed and were hard to track, and using code limited who could access the data. The goals of the new platform were to enable easy discovery of data, transparent schemas, minimal coding needs, UI-based workflows for anyone to use, and optimized costs. The platform developed was called MDA (Marketplace of Datasets and Algorithms) and enabled SQL-based workflows using Spark. It has continued improving since its first release in 2016
You’re not the only one still loading your data into data warehouses and building marts or cubes out of it. But today’s data requires a much more accessible environment that delivers real-time results. Prepare for this transformation because your data platform and storage choices are about to undergo a re-platforming that happens once in 30 years.
With the MapR Converged Data Platform (CDP) and Cisco Unified Compute System (UCS), you can optimize today’s infrastructure and grow to take advantage of what’s next. Uncover the range of possibilities from re-platforming by intimately understanding your options for density, performance, functionality and more.
Functional programming for optimization problems in Big DataPaco Nathan
Enterprise Data Workflows with Cascading.
Silicon Valley Cloud Computing Meetup talk at Cloud Tech IV, 4/20 2013
http://www.meetup.com/cloudcomputing/events/111082032/
BIG Data & Hadoop Applications in Social MediaSkillspeed
This document discusses how major social media networks like Facebook, Twitter, LinkedIn, Pinterest, and Instagram utilize big data and Hadoop technologies. It provides examples of how each network uses Hadoop for tasks like storing user data, performing analytics, and generating personalized recommendations at massive scales as their user bases and data volumes grow enormously. The document also briefly outlines SkillSpeed's Hadoop training course, which covers topics like HDFS, MapReduce, Pig, Hive, HBase and more to prepare students for jobs working with big data.
The document summarizes the key findings from a survey on the future of cloud computing in 2012. Some of the main points covered include:
1) Software is increasingly becoming cloud-based, with SaaS spending growing much faster than traditional software and over 50% of categories being disrupted.
2) SaaS is widely adopted, with 82% currently using it and 84% of new software predicted to be SaaS. PaaS adoption is also increasing significantly.
3) Hybrid cloud models are becoming more popular, with 100% of deployments predicted to be hybrid by 2017.
4) While cloud adoption is increasing, concerns around security, compliance and other issues remain barriers for some.
The document discusses cloud computing trends, including:
- Most large enterprises are transitioning infrastructure to cloud computing to cut costs and risks. Critical workloads are also moving to cloud.
- Hybrid cloud strategies that maintain some workloads on-premise while moving others to cloud are becoming more common and supported.
- Hardware companies are struggling to remain relevant as cloud platforms commoditize infrastructure. They are pursuing mergers and spin-offs.
- DevOps practices emphasize continuous delivery over traditional ITIL change processes. The role of IT is shifting from systems maintenance to innovation brokerage and service management between internal and cloud resources.
FlexPod Select for Hadoop is a pre-validated solution from Cisco and NetApp that provides an enterprise-class architecture for deploying Apache Hadoop workloads at scale. The solution includes Cisco UCS servers and fabric interconnects for compute, NetApp storage arrays, and Cloudera's Distribution of Apache Hadoop for the software stack. It offers benefits like high performance, reliability, scalability, simplified management, and reduced risk for organizations running business-critical Hadoop workloads.
SnapLogic Raises $37.5M to Fuel Big Data Integration PushSnapLogic
SnapLogic has grown well and rapidly since it pivoted in 2012 to focus on cloud-based iPaaS; however, the company continues to compete with on-premises providers, especially for big-data integration, thanks to its hybrid execution framework, which separates the design and management of integration pipelines from the runtime environment. Microsoft’s involvement in the latest funding round is sure to be a blessing, and builds on an existing agreement to provide integration for the Cortana Analytics Suite and Azure cloud.
This document provides an overview of IT/Network Operations concepts and strategies to improve cloud production. It begins with Joe Dietz introducing himself as a Network Security Professional and listing his current certifications. It then discusses various local user groups and events related to cloud security. The document covers topics such as selecting public vs private clouds, choosing cloud providers and applications, operational considerations, and approaches to connecting networks to the cloud such as extending datacenters or enabling edge services. It emphasizes that moving to the cloud still requires planning and not all applications are good candidates. The summary concludes by mentioning related reading on hybrid cloud services and tools.
The business analytics marketplace is experiencing a challenge as classic BI tools meet up with evolving big data technologies, in particular Hadoop. We explore how IBM works to meet this challenge, providing a big picture perspective of their big data offerings around Hadoop, its open data platform and BigInsights.
Building a Big Data platform with the Hadoop ecosystemGregg Barrett
This presentation provides a brief insight into a Big Data platform using the Hadoop ecosystem.
To this end the presentation will touch on:
-views of the Big Data ecosystem and it’s components
-an example of a Hadoop cluster
-considerations when selecting a Hadoop distribution
-some of the Hadoop distributions available
-a recommended Hadoop distribution
Learn why 451 Research believes Infochimps is well-positioned with an easy-to-consume managed service for those without Hadoop expertise, as well as a stack of technologically interesting projects for the 'devops' crowd.
Opening with a market positioning statement and ending with a competitive and SWOT analysis, Matt Aslett provides a comprehensive impact report.
Infochimps report 451 research impact reportAccenture
Infochimps, a big data PaaS provider, has updated its platform with stream processing capabilities from technologies developed at Twitter and LinkedIn. With its first paying customer, the company is now seeking partnerships to support its enterprise-focused offering. It provides an easy-to-use managed service for Hadoop that masks complexity and can generate insights from data in 30 days without specialized hiring or infrastructure. While competition is increasing, Infochimps' strengths include its Chef-based cluster platform and integration of existing tools via its Data Delivery Service.
Infochimps report 451 research impact reportAccenture
Infochimps, a big data PaaS provider, has updated its platform with stream processing capabilities from technologies developed at Twitter and LinkedIn. With its first paying customer, the company is now seeking partnerships to support its enterprise-focused offering. It provides an easy-to-use managed service for Hadoop that masks complexity and can generate insights from data in 30 days without specialized hiring or infrastructure. While competition is increasing, Infochimps' strengths include its Chef-based cluster platform and integration of existing tools via its Data Delivery Service.
Similar to Cascading 2015 User Survey Results (20)
Overview of Cascading 3.0 on Apache Flink Cascading
Cascading is a Java API for building batch data applications on Hadoop. This document discusses executing Cascading programs on Apache Flink instead of Hadoop MapReduce. With Cascading on Flink, programs are translated to single Flink jobs instead of multiple MapReduce jobs. This improves performance by allowing pipelined execution without writing intermediate data to HDFS. For example, a TF-IDF program runs 3.5 hours faster on Flink than MapReduce. Cascading on Flink leverages Flink's efficient in-memory operators while requiring minimal code changes.
Predicting Hospital Readmission Using CascadingCascading
Michael Covert will examine how Healthcare Providers are finding ways to use Big Data analytics to reduce readmission rates and improve operational efficiency while complying with regulatory mandates.
Reducing Development Time for Production-Grade Hadoop ApplicationsCascading
Ryan Desmond's Presentation at the Cascading Meetup on August 27, 2015. Brief overview of Cascading to help give a basic understanding to Clojure users that might use PigPen & Clojure to access Cascading.
Breathe new life into your data warehouse by offloading etl processes to hadoopCascading
This document discusses offloading ETL workloads from data warehouses to Hadoop. It provides an overview of Bitwise, an ISO-certified company that provides ETL and data quality services. It also describes Driven, a platform for building, running, and managing big data applications. Driven provides visibility into data pipelines, monitors application performance, and enables collaboration around operational issues. It stores metadata about application telemetry in a scalable and searchable manner to provide end-to-end operational visibility for Hadoop applications.
How To Get Hadoop App Intelligence with DrivenCascading
You built Cascading/Scalding apps to mine all that data you collected in Hadoop. But just when you were seeing results, something went wrong — the app broke, data flows stopped, and business came to a halt.
So what do you do next? How do you find out what went wrong in the shortest time possible? How do you pinpoint the line of code where the error occurred? How do you know which SLA is going to be impacted? How do you view the lineage of data to adhere to compliance requirements?
In this presentation, we show you how to easily find the answers with Driven, the most comprehensive Big Data App Performance Management Platform.
Furthermore, this presentation describes how Driven can help you build higher quality big data apps; run big data apps more reliably; and manage big data apps more effectively.
Who should view this PPT: Any person or organization that is currently involved in planning, deploying or managing a Hadoop application infrastructure.
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...Cascading
This video dives into 7 best practices for how IT organizations can achieve true operational readiness on Hadoop using Driven and Cascading.
For any person, organization or enterprise that is currently involved in planning, deploying or managing a Hadoop infrastructure. Development Teams, IT Ops, Executive Management.
Key Takeaways:
- Connecting execution problems with application context
- Defining and enforcing SLAs
- Understanding inter-app dependencies
- Rationing your cluster
- Tracing data access at the operational level
- Building culture and tools supporting collaboration between developers, operators, & other Hadoop team members
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...Cascading
André Kelpe's presentation at Hadoop User Group France - 25.11.2014.
Abstract: Cascading is widely deployed, production ready open source data application framework geared towards Java developers. Cascading enables developers to write complex data applications without the need to become a distributed systems expert. Cascading apps are portable between different computation frameworks, so that a given application can be moved from Hadoop onto new processing platforms like Apache Tez or Apache Spark without rewriting any of the application code.
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading
Presentation by Dhruv Kumar, Sr. Field Engineer at Concurrent.
Amid all the hype and investment around Big Data technologies, many Java software engineers are asking what it takes to become big data engineers. As Java professionals, towards which path shall I steer my career?
Join Dhruv Kumar as he introduces Cascading, an open source application development framework that allows Java developers to build applications on top of Hadoop through its Java API. We’ll provide an overview of the application development landscape for developing applications on Hadoop and explain why Cascading has become so popular, comparing it to other abstractions such as Pig and Hive. Dhruv will also show you how Java developers can easily get started building applications on Hadoop with live examples of good ‘ole Java code.
Elasticsearch + Cascading for Scalable Log ProcessingCascading
Supreet Oberoi's presentation on "Large scale log processing with Cascading & Elastic Search". Elasticsearch is becoming a popular platform for log analysis with its ELK stack: Elasticsearch for search, Logstash for centralized logging, and Kibana for visualization. Complemented with Cascading, the application development platform for building Data applications on Apache Hadoop, developers can correlate at scale multiple log and data streams to perform rich and complex log processing before making it available to the ELK stack.
Introduction to Cascading by Bryce Lohr
Presentation on Cascading delivered at the Triad Hadoop Users Group. This presentation provides a brief introduction to Cascading, a Java library for developing scalable Map/Reduce applications on Hadoop.
Bryce Lohr is a software developer at Inmar, focused on developing data analysis application using Hadoop and related technologies.
https://www.linkedin.com/pub/bryce-lohr/3/589/225
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Zilliz
Enterprises have traditionally prioritized data quantity, assuming more is better for AI performance. However, a new reality is setting in: high-quality data, not just volume, is the key. This shift exposes a critical gap – many organizations struggle to understand their existing data and lack effective curation strategies and tools. This talk dives into these data challenges and explores the methods of automating data curation.
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
2. Confidential
WHAT’S
BEHIND
THE
RISE
OF
CASCADING?
Enterprise
IT
teams
designing
their
big
data
platforms
must
choose
from
a
daunting
array
of
development
frameworks
and
compute
fabrics.
On
the
one
hand,
they
want
a
development
framework
that
leverages
existing
skillsets.
At
the
same
time,
they
want
the
flexibility
to
benefit
from
performance
gains
of
the
latest,
greatest
compute
fabrics.
Cascading
is
a
robust
framework
with
over
10,000
known
production
deployments,
over
275,000
downloads
per
month.
Twitter,
AirBnB,
Climate
Corp,
Apple,
EBay,
Netflix,
are
examples
of
few
of
the
enterprises
that
have
built
their
Hadoop
practices
with
Cascading.
The
Cascading
user
group
is
diverse,
self-‐supporting
community
who
are
helping
innovate
Cascading’s
scalability,
portability,
performance
and
value.
In
addition,
the
presence
of
a
large
number
of
open
source
projects
contributed
by
mainstream
enterprises
such
as
by
Netflix,
Commonwealth
Bank
of
Australia,
Expedia
attests
to
vibrancy
of
the
Cascading
ecosystem.
In
this
paper,
we'll
reveal
what’s
behind
Cascading's
growth
by
digging
into
the
results
of
a
new
Cascading
user
survey.
In
general,
Cascading
users
turn
out
to
be
extremely
concerned
about
reliability
and
performance
at
scale.
Many
experimented
with
early
Hadoop
frameworks
like
Hive
and
Pig,
but
found
Cascading
to
be
a
more
scalable
approach.
And
lately,
the
easy
portability
of
Cascading
applications
between
compute
fabrics
has
generated
a
lot
of
excitement
in
the
community.
3. Confidential
0 10 20 30 40 50 60 70
Head/VP of IT
Head of IT Infrastructure
Application Manager/Director
BI/EDW Manager/Director
CIO/SVP of IT
IT Specialist
Architect
IT Manager or Director
Developer/Engineer
What title best describes your role?
N=121 Liverpool Street station crowd blur. Photo by David Sim.
CASCADING
IS
MOST
POPULAR
AMONG
BUILDERS
AND
MANAGERS
OF
BIG
DATA
APPLICATIONS
4. Confidential
CASCADING
COMMUNITY
MEMBERS
ARE
MATURE,
PRODUCTION
USERS
8%
26%
25%
41%
How long have you been using
Hadoop?
0-12 months
12-24 months
24-36 months
Over 3 years
N=69
Most
respondents
have
been
using
Hadoop
for
over
3
years.
Assuming
the
sample
is
representative,
the
Cascading
community
largely
consists
of
early
Hadoop
adopters.
Furthermore,
the
Cascading
community
isn’t
just
dabbling:
Over
84% have
already
put
their
Cascading
applications
into
production
or
plan
to
do
so.
As
for
why,
many
likely
found
out
the
hard
way
that
developing
directly
on
Hadoop
was
painful,
tedious
and
poorly
suited
to
scale.
0 5 10 15 20 25 30 35 40 45
Other
Poor integration into existing IT
infrastructure
Lack of scalability
Lack of portability across compute
fabrics
Difficult to integrate to existing systems
Poor troubleshooting capabilities
Lack of skilled Hadoop resources
High cost of development in existing
platform
Slow development in existing platform
What challenges did you have that made you look for
an application development framework?
5. Confidential
THE
PATH
TO
CASCADING:
HIVE,
PIG,
AND
GUI
TOOLS
N=69
Given
the
maturity
of
Cascading
users,
it’s
no
surprise
that
many
explored
alternatives
before
settling
on
Cascading.
The
majority
(51%)
tried
Hive
and
Pig,
both
of
which
were
early
abstraction
layers
for
MapReduce.
Today,
many
Pig
applications
run
alongside
Cascading
and
many
Hive
applications
run
within Cascading.
Why
didn’t
they
stick
with
Hive
and
Pig?
Most
organizations
determined
they
could
not
scale
with
Hive
and
Pig.
Typically
that
was
because
Hive
and
Pig
required
scarce
technical
resources
and
because
development
in
those
frameworks
was
slow.
Those
who
opted
for
other
API
frameworks
found
them
not
yet
ready
for
the
enterprise.
A
smaller
group
experimented
with
GUI-‐based
ETL
tools.
While
these
tools
made
it
easy
to
leverage
existing
resources
and
skill
sets,
their
capabilities
were
too
limited.
They
also
required
building
special
scripts
to
achieve
complex
functionality,
which
negated
the
benefits
of
simplicity.
Additionally,
many
users
did
not
like
being
locked
into
a
single-‐vendor
solution.
26%
25%22%
19%
8%
Before selecting Cascading, what alternative solutions
did you explore? (select all that apply)
Pig
Hive
Other API frameworks (Spark,
Crunch)
GUI-based ETL tools (Talend,
Informatica, Pentaho)
No other alternatives were
explored
6. Confidential
0 10 20 30 40 50 60
Other
Flink
Tez
Storm
Kafka
MapReduce
Spark
Which compute fabric(s) are you using or
planning to use in the next 18 mths?
PORTABILITY
ACROSS
FABRICS
N=69
New
compute fabrics
appear
all
the
time,
though
not
all
are
production-‐ready.
The
responses
reflect high
interest
in
Spark
and
a
desire
for
true
streaming
(not
micro-‐batches).
MapReduce isn’t going
away any
time
soon,
especially
where
reliability
is
a
requirement.
Still,
many
are
experimenting
with other
compute
fabrics.
Because
each
fabric
offers
application-‐specific
advantages,
most
organizations
will
likely
wind
up
running
multiple
fabrics.
Cascading
3.0
supports
Tez,
MapReduce,
and
local/in-‐memory,
so
users
can
port
applications
from
MapReduce to
Tez simply
by
changing
a
few
lines
of
code.
Easy
portability
makes
Cascading
an
ideal
platform
for
moving
from
MapReduce to
Tez without
incurring
the
cost
of
rewriting
applications.
Soon,
Cascading
will
support
the
same
portability
for
Spark
and
Flink (for
Flink,
support
will
be
community
contributed).
7. Confidential
CASCADING
BRIDGES
OTHER
DEVELOPMENT
FRAMEWORKS
N=69
Despite
their
shortcomings,
MapReduce,
Hive
and
Pig
are
still
widely
in
use
as
development
frameworks,
largely
because
many
early
Hadoop
applications
were
built
through
these
interfaces.
No
surprise
that
we
see
a
lot
of
excitement
about
Spark
as
a
new
development
framework
as
well;
many
users
are
experimenting
with
developing
directly
in
the
Spark
API.
Cascading
will
support
Spark
in
a
future
WIP,
adding
an
important
framework
option
for
Spark
developers.
Developers
who
build
in
Cascading
will
be
able
to
port
their
applications
from
MapReduce to
Spark
without
having
to
rewrite
them
in
the
Spark
API.
In
summary,
there
is
no
one-‐size-‐fits-‐all
framework.
Flexibility
is
key
as
organizations
build
out
their
big
data
strategies
and
platforms.
Cascalog
Scalding
Pig
Hive
MapReduce
Cascading
Spark
0 10 20 30 40 50 60
What data application development
framework do you use?
“[Cascading] Best Hadoop API for enterprise data-
intensive apps.” – Architect.Fortune 500 Healthcare Payer
8. Confidential
COMMON
USE
CASES:
ETL,
ANALYTICS
&
DATA
INTEGRATION
N=69
Most
organizations
rely
on
Hadoop
for
heavy
processing
steps
within
ETL,
analytics
or
data
integration
flows.
Some
have
moved
their
entire
ETL
processing
to
Hadoop,
while
others
have
moved
only
portions
of
their
workflows.
For
example,
AirBnB uses
Cascading
for
complicated
infrastructure
tasks
such
as
data
normalization
and
cleansing.
AirBnB also
leverages
Cascading
for
reconstructing
corrupted
files
and
merging
data.
In
combination
with
Cascading,
Pig
and
Hive
are
used
by
analysts
to
run
batch
scripts
to
perform
ad
hoc
analysis.
With
these
tools,
analysts
are
able
to
more
easily
study
crucial
metrics
like
click-‐through
rates,
page
statistics,
and
drop-‐off
rates.
0 10 20 30 40 50
Other
Search Optimization
Recommendation Engines
Data Quality
Machine Learning and Scoring
Data Integration
Analytics
ETL
What best describes the projects where you
are using Cascading?
45%
Offloading
ETL to
Hadoop
40%
To Support
Analytics/BI
Projects
33%
Data
Integration
Projects
9. Confidential
Extremely
likely - 10
23%
9
10%
8
20%
7
19%
6
11%
5
6%
4
1%
3
3%
2
4%
Not at all
likely - 0
3%
How likely is it that you would
recommend Cascading to a friend or
colleague?
WHY
THEY
LOVE
CASCADING:
TDD,
JAVA
API,
PORTABILITY
N=79
Top
3
Most
Impactful
Capabilities
v Test
Driven
Development
(49%)
-‐ Efficiently
test
code
and
process
local
files
before
you
deploy
on
a
cluster
with
Cascading’s
local
or
in-‐
memory
mode.
Incorporate
inline
data
assertions
to
define
results
at
any
point
in
your
pipeline.
Failed
assertions
are
easily
visible
and
available
for
analysis.
v JavaAPI
(44%)
-‐ Cascading
is
a
Java
library
and
does
not
require
installation.
Cascading
fits
directly
into
a
standard
development
process;
all
you
have
to
do
is
code
to
the
API.
v Application
Portability
(43%)
-‐ When
you
compile
a
Cascading
job,
it
automatically
creates
a
run-‐time
executable
for
your
specified
compute
fabric.
Simply
by
changing
a
few
lines
of
code,
you
can
test
your
application
on
multiple
fabrics
and
choose
the
best
for
your
needs.
53%Of Respondents
are Promoters
(8/10)
11. Confidential
CASCADING
SLASHES
TIME
TO MARKET
N=79
Most improved time to market by at least
40%
5%
17%
12%
18%
17%
18%
13%
What percentage would you estimate your
time to market has improved?
Over 300%
Over 100%
80%-100%
60%-80%
40%-60%
20%-40%
Less than 20%
12. Confidential
N=69
0 10 20 30 40 50 60
Other
Supporting chargeback models
Forecasting big data infrastructure
needs
Monitoring SLA's for Hadoop
applications
Identify and resolve Hadoop
application issues faster
Optimizing application performance
What future challenges do you anticipate in
managing your data applications?
THE
FUTURE:
BETTER
PERFORMANCE,
DATA
PIPELINE
VISIBILITY
Application
performance
management
is
a
top-‐of-‐mind
concern
for
most
respondents.
While
performance
tuning
happens
on
the
operations
side,
optimizing
applications
to
meet
service-‐ level
commitments
is
usually
a
collaborative
effort
between
development
and
operations teams.
Developers
need
better
tools
to
visualize
data
pipelines
and
detect
undesirable
behavior
before they
promote
applications
to
production.
Operations
teams
need
better
tools
to
monitor,
manage
and
optimize
data
delivery.
An
important,
though
secondary
concern,
is
tracking
the
rate
of
Hadoop
resource
consumption
so
clusters
can
be
right-‐sized
and
costs
distributed
across
divisions.
This
is
particularly
true
as
more
of
of
an
organization’s
departments/teams
build
and
rely
on
big
data
applications,
transforming
their
Hadoop
cluster
from
a
side
project
into
core
production
IT
infrastructure.
With
new
application
performance
management
tools
such
as
Driven,
teams
can
visualize
data
pipelines
and
identify
unwanted
behavior
more
effectively.
Tools
like
Driven
also
arm
teams
with
the
data
necessary
to
pinpoint
issues
quickly
and
resolve
them
collaboratively.
14. Confidential
DISTRIBUTIONS
0 5 10 15 20 25 30 35 40
Count of Other (please specify)
Count of MapR
Count of Hortonworks
Count of Apache Hadoop
Count of Amazon EMR
Count of Cloudera
Distributions
N=69
15. Confidential
NUMBER OFAPPLICATIONSANDVOLUME
Over 100 60-100 30-60 15-30 5-15 1-5
Less than 250 pipelines 4 5 4 26
500 - 1,000 pipelines 2 2 1 1 2
250 - 500 pipelines 1 3 5
2,500 - 5,000 pipelines 1 1
1,000 - 2,500 pipelines 2 3 1
Over 5,000 pipelines 1
Over 10,000 pipelines 1 1 2
0
5
10
15
20
25
30
35
40
Average Numberof Cascading Applications and Pipelines N=69
16. Confidential
PRODUCTIONSTATUS
0 5 10 15 20 25 30 35 40 45 50
No and not planned
Not yet but planned
Yes
Are you using your Cascading data applications in a
production environment?
N=69