The document provides an introduction and agenda for a presentation on data science and big data. It discusses Joe Caserta's background and experience in data warehousing, business intelligence, and data science. It outlines Caserta Concepts' focus on big data solutions, data warehousing, and industries like ecommerce, financial services, and healthcare. The agenda covers topics like governing big data for data science, introducing the data pyramid, what data scientists do, and standards for data science projects.
Caserta Concepts, Datameer and Microsoft shared their combined knowledge and a use case on big data, the cloud and deep analytics. Attendes learned how a global leader in the test, measurement and control systems market reduced their big data implementations from 18 months to just a few.
Speakers shared how to provide a business user-friendly, self-service environment for data discovery and analytics, and focus on how to extend and optimize Hadoop based analytics, highlighting the advantages and practical applications of deploying on the cloud for enhanced performance, scalability and lower TCO.
Agenda included:
- Pizza and Networking
- Joe Caserta, President, Caserta Concepts - Why are we here?
- Nikhil Kumar, Sr. Solutions Engineer, Datameer - Solution use cases and technical demonstration
- Stefan Groschupf, CEO & Chairman, Datameer - The evolving Hadoop-based analytics trends and the role of cloud computing
- James Serra, Data Platform Solution Architect, Microsoft, Benefits of the Azure Cloud Service
- Q&A, Networking
For more information on Caserta Concepts, visit our website: http://casertaconcepts.com/
This document discusses balancing data governance and innovation. It describes how traditional data analytics methods can inhibit innovation by requiring lengthy processes to analyze new data. The document advocates adopting a data lake approach using tools like Hadoop and Spark to allow for faster ingestion and analysis of diverse data types. It also discusses challenges around simultaneously enabling innovation through a data lake while still maintaining proper data governance, security, and quality. Achieving this balance is key for organizations to leverage data for competitive advantage.
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
The role of the Chief Data Officer (CDO) has become integral to the evolution needed to turn a wisdom-driven company into an analytics-driven company. With Data Governance at the core of your responsibility, moving the innovation meter is a global challenge among CDOs. Specifically the CDO must:
• Provide a single point of accountability for data initiatives and issues
• Innovate ways to use existing data and evangelize a data vision for the organization
• Support & enforce data governance policies via outreach, training & tools
• Work with IT to develop/maintain an enterprise data repository
• Set standards for analytical reporting and generate data insights through data science
In this session, Joe Caserta addresses real-word CDO challenges, shares techniques to overcome them, manage corporate disruption and achieve success.
Joe Caserta, President at Caserta Concepts, presented "Setting Up the Data Lake" at a DAMA Philadelphia Chapter Meeting.
For more information on the services offered by Caserta Concepts, visit our website at http://casertaconcepts.com/.
During this Big Data Warehousing Meetup, Caserta Concepts and Databricks addressed the number one operational and analytic goal of nearly every organization today – to have complete view of every customer. Customer Data Integration (CDI) must be implemented to cleanse and match customer identities within and across various data systems. CDI has been a long-standing data engineering challenge, not just one of logic and complexity but also of performance and scalability.
The speakers brought together best practice techniques with Apache Spark to achieve complete CDI.
Speakers:
Joe Caserta, President, Caserta Concepts
Kevin Rasmussen, Big Data Engineer, Caserta Concepts
Vida Ha, Lead Solutions Engineer, Databricks
The sessions covered a series of problems that are adequately solved with Apache Spark, as well as those that are require additional technologies to implement correctly. Topics included:
· Building an end-to-end CDI pipeline in Apache Spark
· What works, what doesn’t, and how do we use Spark we evolve
· Innovation with Spark including methods for customer matching from statistical patterns, geolocation, and behavior
· Using Pyspark and Python’s rich module ecosystem for data cleansing and standardization matching
· Using GraphX for matching and scalable clustering
· Analyzing large data files with Spark
· Using Spark for ETL on large datasets
· Applying Machine Learning & Data Science to large datasets
· Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally
The speakers also touched on data governance, on-boarding new data rapidly, how to balance rapid agility and time to market with critical decision support and customer interaction. They also shared examples of problems that Apache Spark is not optimized for.
For more information on the services offered by Caserta Concepts, visit our website: http://casertaconcepts.com/
Against the backdrop of Big Data, the Chief Data Officer, by any name, is emerging as the central player in the business of data, including cybersecurity. The MITCDOIQ Symposium explored the developing landscape, from local organizational issues to global challenges, through case studies from industry, academic, government and healthcare leaders.
Joe Caserta, president at Caserta Concepts, presented "Big Data's Impact on the Enterprise" at the MITCDOIQ Symposium.
Presentation Abstract: Organizations are challenged with managing an unprecedented volume of structured and unstructured data coming into the enterprise from a variety of verified and unverified sources. With that is the urgency to rapidly maximize value while also maintaining high data quality.
Today we start with some history and the components of data governance and information quality necessary for successful solutions. I then bring it all to life with 2 client success stories, one in healthcare and the other in banking and financial services. These case histories illustrate how accurate, complete, consistent and reliable data results in a competitive advantage and enhanced end-user and customer satisfaction.
To learn more, visit www.casertaconcepts.com
A modern, flexible approach to Hadoop implementation incorporating innovation...DataWorks Summit
A modern, flexible approach to Hadoop implementation incorporating innovations from HP Haven
Jeff Veis
Vice President
HP Software Big Data
Gilles Noisette
Master Solution Architect
HP EMEA Big Data CoE
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented What Data Do You Have and Where is it?
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
Over the past eight or nine years, applying DevOps practices to various areas of technology within business has grown in popularity and produced demonstrable results. These principles are particularly fruitful when applied to a data analytics environment. Bob Eilbacher explains how to implement a strong DevOps practice for data analysis, starting with the necessary cultural changes that must be made at the executive level and ending with an overview of potential DevOps toolchains. Bob also outlines why DevOps and disruption management go hand in hand.
Topics include:
- The benefits of a DevOps approach, with an emphasis on improving quality and efficiency of data analytics
- Why the push for a DevOps practice needs to come from the C-suite and how it can be integrated into all levels of business
- An overview of the best tools for developers, data analysts, and everyone in between, based on the business’s existing data ecosystem
- The challenges that come with transforming into an analytics-driven company and how to overcome them
- Practical use cases from Caserta clients
This presentation was originally given by Bob at the 2017 Strata Data Conference in New York City.
Defining and Applying Data Governance in Today’s Business EnvironmentCaserta
This document summarizes a presentation by Joe Caserta on defining and applying data governance in today's business environment. It discusses the importance of data governance for big data, the challenges of governing big data due to its volume, variety, velocity and veracity. It also provides recommendations on establishing a big data governance framework and addressing specific aspects of big data governance like metadata, information lifecycle management, master data management, data quality monitoring and security.
Meaning making – separating signal from noise. How do we transform the customer's next input into an action that creates a positive customer experience? We make the data more intelligent, so that it is able to guide our actions. The Data Lake builds on Big Data strengths by automating many of the manual development tasks, providing several self-service features to end-users, and an intelligent management layer to organize it all. This results in lower cost to create solutions, "smart" analytics, and faster time to business value.
Building a New Platform for Customer Analytics Caserta
Caserta Concepts and Databricks partner up to bring you this insightful webinar on how a business can choose from all of the emerging big data technologies to figure out which one best fits their needs.
Using Machine Learning & Spark to Power Data-Driven MarketingCaserta
Joe Caserta provides a statistically-driven model to understanding the customer path to purchase, which combines online, offline and third-party data sources. He shows how customer data is fed to machine learning, which assigns weighted credit to customer interactions in order to give insight to what marketing activities truly matter. This presentation is from Caserta's February 2018 Big Data Warehousing Meetup co-hosted with Databricks.
How do you balance the need for structured and rule-based governance to assure enterprise data quality - with the imperative to innovate in order to stay relevant and competitive in today's business marketplace?
At the recent CDO Summit in NYC, a range of C-Level Executives across a variety of industries came to hear Joe Caserta, president of Caserta Concepts, put it all in perspective.
Joe talked about the challenges of "data sprawl" and the paradigm shift underway in the evolving big data and data-driven world.
For more information or to contact us, visit http://casertaconcepts.com/
General Data Protection Regulation - BDW Meetup, October 11th, 2017Caserta
Caserta Presentation:
General Data Protection Regulation (GDPR) is a business and technical challenge for companies worldwide - and the deadlines are coming fast! American institutions that do business in the EU or have customers from the EU will have their data practices affected. With this in mind, Caserta – joined by Waterline Data, Salt Recruiting, and Squire Patton Boggs – hosted a BDW Meetup on the GDPR, which is perhaps the most controversial data legislation that has been passed to date.
Joe Caserta, Founding President, Caserta, spoke on the basics of the GDPR, how it will impact data privacy around the world, and some techniques geared towards compliance.
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
Joe Caserta explores the world of analytics, tech, and AI to paint a picture of where business is headed. This presentation is from the CDAO Exchange in Miami 2018.
This presentation will discuss the stories of 3 companies that span different industries; what challenges they faced and how cloud analytics solved for them; what technologies were implemented to solve the challenges; and how they were able to benefit from their new cloud analytics environments.
The objectives of this session include:
• Detail and explain the key benefits and advantages of moving BI and analytics workloads to the cloud, and why companies shouldn’t wait any longer to make their move.
• Compare the different analytics cloud options companies have, and the pros and cons of each.
• Describe some of the challenges companies may face when moving their analytics to the cloud, and what they need to prepare for.
• Provide the case studies of three companies, what issues they were solving for, what technologies they implemented and why, and how they benefited from their new solutions.
• Learn what to look for one considering a partner and trusted advisor to assist with an analytics cloud migration.
The Data Lake - Balancing Data Governance and Innovation Caserta
Joe Caserta gave the presentation "The Data Lake - Balancing Data Governance and Innovation" at DAMA NY's one day mini-conference on May 19th. Speakers covered emerging trends in Data Governance, especially around Big Data.
For more information on Caserta Concepts, visit our website at http://casertaconcepts.com/.
This document discusses best practices for using Hadoop as an enterprise data hub. It provides an overview of how big data is driving new analytical workloads and the need for deeper customer insights. It discusses challenges with analyzing new sources of structured, unstructured and multi-structured data. It introduces the concept of a Hadoop enterprise data hub and data refinery to simplify access to new insights from big data. Key components of the data hub include a data reservoir to capture raw data from various sources, a data refinery to cleanse and transform the data, and publishing high value insights to data warehouses and other systems.
What is Big Data and why it is required and needed for the organization those who really need and generating huge amount of data and when it will be use
This document discusses appropriate and inappropriate use cases for Apache Spark based on the type of data and workload. It provides examples of good uses, such as batch processing, ETL, and machine learning/data science. It also gives examples of bad uses, such as random access queries, frequent incremental updates, and low latency stream processing. The document recommends using a database instead of Spark for random access, updates, and serving live queries. It suggests using message queues instead of files for low latency stream processing. The goal is to help users understand how to properly leverage Spark for big data workloads.
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
Joe Caserta went over the details inside the big data ecosystem and the Caserta Concepts Data Pyramid, which includes Data Ingestion, Data Lake/Data Science Workbench and the Big Data Warehouse. He then dove into the foundation of dimensional data modeling, which is as important as ever in the top tier of the Data Pyramid. Topics covered:
- The 3 grains of Fact Tables
- Modeling the different types of Slowly Changing Dimensions
- Advanced Modeling techniques like Ragged Hierarchies, Bridge Tables, etc.
- ETL Architecture.
He also talked about ModelStorming, a technique used to quickly convert business requirements into an Event Matrix and Dimensional Data Model.
This was a jam-packed abbreviated version of 4 days of rigorous training of these techniques being taught in September by Joe Caserta (Co-Author, with Ralph Kimball, The Data Warehouse ETL Toolkit) and Lawrence Corr (Author, Agile Data Warehouse Design).
For more information, visit http://casertaconcepts.com/.
This document provides an overview of big data. It defines big data as large volumes of diverse data that are growing rapidly and require new techniques to capture, store, distribute, manage, and analyze. The key characteristics of big data are volume, velocity, and variety. Common sources of big data include sensors, mobile devices, social media, and business transactions. Tools like Hadoop and MapReduce are used to store and process big data across distributed systems. Applications of big data include smarter healthcare, traffic control, and personalized marketing. The future of big data is promising with the market expected to grow substantially in the coming years.
This document discusses data scientist profiles and provides guidance on building data scientist teams. It begins by establishing the importance of analytics for businesses. It then discusses the term "data scientist" and characterizes data scientists as having diverse backgrounds but being curious and asking important questions. The document outlines skills of data scientists and notes that while backgrounds vary, soft skills are very important. It provides tips for recruiting data scientists and emphasizes getting started with an analytical team even without perfect conditions.
Designing High Performance ETL for Data WarehouseMarcel Franke
This document discusses best practices and approaches for designing high performance ETL for data warehousing. It summarizes the key components of the FastTrack reference architecture, including hardware configuration with storage, servers, and networking. It then evaluates different ETL loading strategies like loading into partitioned vs non-partitioned tables, with and without sorting, and using hash partitioning. Test results show that loading with multiple parallel streams into hash partitioned tables using partition switching can achieve the highest throughput.
This document discusses using Hadoop for smart meter data analytics. Smart meters track energy usage and send data to utility servers. Analyzing large volumes of smart meter data presents challenges due to data growth rates. Hadoop can help by reducing data loads and improving query performance due to its scalability. The document outlines how Hadoop can enable demand response analysis, time of use tariff analysis, and load profile analysis. It also provides diagrams of a Hadoop cluster and data flow for smart meter data analytics.
Hadoop for the Data Scientist: Spark in Cloudera 5.5Cloudera, Inc.
Inefficient data workloads are all too common across enterprises - causing costly delays, breakages, hard-to-maintain complexity, and ultimately lost productivity. For a typical enterprise with multiple data warehouses, thousands of reports, and hundreds of thousands of ETL jobs being executed every day, this loss of productivity is a real problem. Add to all of this the complex handwritten SQL queries, and there can be nearly a million queries executed every month that desperately need to be optimized, especially to take advantage of the benefits of Apache Hadoop. How can enterprises dig through their workloads and inefficiencies to easily see which are the best fit for Hadoop and what’s the fastest path to get there?
Cloudera Navigator Optimizer is the solution - analyzing existing SQL workloads to provide instant insights into your workloads and turns that into an intelligent optimization strategy so you can unlock peak performance and efficiency with Hadoop. As the newest addition to Cloudera’s enterprise Hadoop platform, and now available in limited beta, Navigator Optimizer has helped customers profile over 1.5 million queries and ultimately save millions by optimizing for Hadoop.
This document discusses best practices for PL/SQL including avoiding hard coding, row-by-row processing, and memory management. It recommends using SQL as a service, bulk processing, bulk collect with limit clause, and NOCOPY hint for memory. It also discusses data caching techniques like deterministic functions, PGA caching, and function result cache to improve performance.
Veri Ambarları için Oracle'ın Analitik SQL DesteğiEmrah METE
This document discusses Oracle's analytical SQL functions. It begins by explaining the concepts of analytical functions and why they are useful for complex analysis without self-joins. It provides examples of common analytical functions like SUM, RANK, LAG, and LEAD. The document demonstrates how these functions can be used to perform tasks like cumulative sums, ranking, comparing adjacent rows, and aggregating across partitions. Finally, it discusses applications of analytical functions for tasks like pattern matching, pivot/unpivot operations, linear regression, and fraud detection.
This document discusses choosing the right data architecture for big data projects. It begins by acknowledging big data comes in many types, from structured transactional data to unstructured text data. It then presents several big data architectures and platforms that are suitable for different data types and use cases, such as relational databases, NoSQL databases, data grids, and distributed file systems. The document emphasizes that one size does not fit all and the right choice depends on the specific data and business needs.
The Future of Data Management: The Enterprise Data HubCloudera, Inc.
The document discusses the enterprise data hub (EDH) as a new approach for data management. The EDH allows organizations to bring applications to data rather than copying data to applications. It provides a full-fidelity active compliance archive, accelerates time to insights through scale, unlocks agility and innovation, consolidates data silos for a 360-degree view, and enables converged analytics. The EDH is implemented using open source, scalable, and cost-effective tools from Cloudera including Hadoop, Impala, and Cloudera Manager.
This document provides an overview of data quality and the fundamentals of ensuring data quality in an organization. It discusses the importance of data quality and outlines the key steps in the data quality pipeline including extract, clean, conform, and deliver. It also covers determining the system of record, cleaning data from multiple sources, prioritizing data quality goals, different types of data quality enforcement, and tracking and monitoring data quality failures. The document emphasizes that achieving high quality data requires planning, well-defined processes, and continuous monitoring.
Utilities are facing an explosion of data from smart meter and grid technologies that they are ill-equipped to manage and analyze. This data, if properly analyzed, could provide strategic insights but utilities currently lack capabilities to interpret usage patterns, forecast demand, and leverage data for competitive advantage. The future requires utilities to develop competencies in data management, cross-functional analysis, and demand response programs in order to unlock value from consumer data and gain competitive advantages over other utilities.
Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie, and so on – it can be a challenge to assemble and operationalize them as a production ETL platform. This presentation covers one approach to data ingest, organization, format selection, process orchestration, and external system integration, based on collective experience acquired across many production Hadoop deployments.
This document discusses agile data warehouse design. It begins with an overview of data warehousing, including definitions of a data warehouse and common architectures. It then covers traditional waterfall and agile approaches to BI/WH development. The agile section focuses on an incremental lifecycle and agile dimensional modeling techniques like the 7Ws framework and BEAM methodology, which use natural language and collaboration to design models around business questions.
Hortonworks Data In Motion Series Part 3 - HDF Ambari Hortonworks
How To: Hortonworks DataFlow 2.0 with Ambari and Ranger for integrated installation, deployment and operations of Apache NiFi.
On demand webinar with demo: http://hortonworks.com/webinar/getting-goal-big-data-faster-enterprise-readiness-data-motion/
This document discusses how big data can help reduce costs and increase productivity in drug R&D. It outlines challenges such as increased clinical trials, patients, and data requirements that have led to higher R&D costs. Big data is presented as a solution by bringing more insights from data rather than resources. The document provides case studies and a 6-step approach for companies to leverage big data in R&D, including establishing an analytics strategy, identifying relevant data sources, and optimizing their analytics organization.
Workshop with Joe Caserta, President of Caserta Concepts, at Data Summit 2015 in NYC.
Data science, the ability to sift through massive amounts of data to discover hidden patterns and predict future trends and actions, may be considered the "sexiest" job of the 21st century, but it requires an understanding of many elements of data analytics. This workshop introduced basic concepts, such as SQL and NoSQL, MapReduce, Hadoop, data mining, machine learning, and data visualization.
For notes and exercises from this workshop, click here: https://github.com/Caserta-Concepts/ds-workshop.
For more information, visit our website at www.casertaconcepts.com
Data Governance, Compliance and Security in Hadoop with ClouderaCaserta
The document discusses data governance, compliance and security in Hadoop. It provides an agenda for an event on this topic, including presentations from Joe Caserta of Caserta Concepts on data governance in big data, and Patrick Angeles of Cloudera on using Cloudera for data governance in Hadoop. The document also includes background information on Caserta Concepts and their expertise in data warehousing, business intelligence and big data analytics.
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
Joe Caserta, President at Caserta Concepts presented at the 3rd Annual Enterprise DATAVERSITY conference. The emphasis of this year's agenda is on the key strategies and architecture necessary to create a successful, modern data analytics organization.
Joe Caserta presented Incorporating the Data Lake into Your Analytics Architecture.
For more information on the services offered by Caserta Concepts, visit out website at http://casertaconcepts.com/.
In this presentation at DAMA New York, Joe started by asking a key question: why are we doing this? Why analyze and share all these massive amounts of data? Basically, it comes down to the belief that in any organization, in any situation, if we can get the data and make it correct and timely, insights from it will become instantly actionable for companies to function more nimbly and successfully. Enabling the use of data can be a world-changing, world-improving activity and this session presents the steps necessary to get you there. Joe explained the concept of the "data lake" and also emphasizes the role of a strong data governance strategy that incorporates seven components needed for a successful program.
For more information on this presentation or Caserta Concepts, visit our website at http://casertaconcepts.com/.
This document discusses how to build a successful data lake by focusing on the right data, platform, and interface. It emphasizes the importance of saving raw data to analyze later, organizing the data lake into zones with different governance levels, and providing self-service tools to find, understand, provision, prepare, and analyze data. It promotes the use of a smart data catalog like Waterline Data to automate metadata tagging, enable data discovery and collaboration, and maximize business value from the data lake.
Architecting for Big Data: Trends, Tips, and Deployment OptionsCaserta
Joe Caserta, President at Caserta Concepts addressed the challenges of Business Intelligence in the Big Data world at the Third Annual Great Lakes BI Summit in Detroit, MI on Thursday, March 26. His talk "Architecting for Big Data: Trends, Tips and Deployment Options," focused on how to supplement your data warehousing and business intelligence environments with big data technologies.
For more information on this presentation or the services offered by Caserta Concepts, visit our website: http://casertaconcepts.com/.
This document discusses how Hadoop can be used to power a data lake and enhance traditional data warehousing approaches. It proposes a holistic data strategy with multiple layers: a landing area to store raw source data, a data lake to enrich and integrate data with light governance, a data science workspace for experimenting with new data, and a big data warehouse at the top level with fully governed and trusted data. Hadoop provides distributed storage and processing capabilities to support these layers. The document advocates a "polygot" approach, using the right tools like Hadoop, relational databases, and cloud platforms depending on the specific workload and data type.
Ashley Ohmann--Data Governance Final 011315Ashley Ohmann
This presentation discusses enterprise data governance with Tableau. It defines data governance as processes that formally manage important data assets. The goals of data governance include establishing standards, processes, compliance, security, and metrics. Good data governance benefits an organization by improving accuracy, enabling better decisions with less waste. The presentation provides examples of how one organization improved data governance through stakeholder involvement, establishing metrics, building a data warehouse, and implementing Tableau for analytics. Key goals discussed are building trust, communicating validity, enabling access, managing metadata, provisioning rights, and maintaining compliance.
Data governance course - part 1.
Data Governance is the orchestration of people, process and technology
to enable an organization to leverage data as an enterprise asset.
The core objectives of a governance program are:
Guide information management decision-making
Ensure information is consistently defined and well understood
Increase the use and trust of data as an enterprise asset
Objectives of this presentation :
Introduction to data governance
• Why data governance discussion today : the enterprise challenges
Data Lakes are early in the Gartner hype cycle, but companies are getting value from their cloud-based data lake deployments. Break through the confusion between data lakes and data warehouses and seek out the most appropriate use cases for your big data lakes.
• History of Data Management
• Business Drivers for implementation of data governance • Building Data Strategy & Governance Framework
• Data Management Maturity Models
• Data Quality Management
• Metadata and Governance
• Metadata Management
• Data Governance Stakeholder Communication Strategy
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
Do terms like "Data Lake" confuse you? You’re not alone. With all of the technology buzzwords flying around today, it can become a task to keep up with and clearly understand each of them. However a data lake is definitely something to dedicate the time to understand. Leveraging data lake technology, companies are finally able to keep all of their disparate information and streams of data in one secure location ready for consumption at any time – this includes structured, unstructured, and semi-structured data. For more information on our Big Data Consulting Services, don’t hesitate to visit us online at: http://bit.ly/2fvV5rR
This document defines big data and discusses its key characteristics and applications. It begins by defining big data as large volumes of structured, semi-structured, and unstructured data that is difficult to process using traditional methods. It then outlines the 5 Vs of big data: volume, velocity, variety, veracity, and variability. The document also discusses Hadoop as an open-source framework for distributed storage and processing of big data, and lists several applications of big data across various industries. Finally, it discusses both the risks and benefits of working with big data.
The document discusses handling and processing big data. It begins by defining big data and explaining why it is important for companies to analyze big data. It then discusses several techniques for handling big data, including establishing goals, securing data, keeping data protected, ensuring data is interlinked, and adapting to new changes. The document also covers preprocessing big data by cleaning, integrating, reducing, and discretizing data. It provides a case study of preprocessing government agency data and discusses advanced tools and techniques for working with big data.
Every day we roughly create 2.5 Quintillion bytes of data; 90% of the worlds collected data has been generated only in the last 2 years. In this slide, learn the all about big data
in a simple and easiest way.
Increasing Agility Through Data VirtualizationDenodo
This document discusses how data virtualization can help enterprises address data management challenges by providing a single source of truth, reducing data proliferation, enabling standardization and improving data quality. It describes how financial institutions face increased regulatory scrutiny around data practices. The solution presented is a Data Services Layer that acts as a common provisioning point for accessing authoritative data sources using technologies like data virtualization. Effective data governance is also emphasized as critical to the success of any data virtualization effort.
Sharing a presentation highlighting some key aspects to be taken into consideration while harnessing your Digital Transformation projects as a Digital Intelligence enabler for your enterprise
This document provides an overview of big data, including its definition, characteristics, sources, tools used, applications, benefits, and impact on IT. Big data is a term used to describe the large volumes of data, both structured and unstructured, that are so large they are difficult to process using traditional database and software techniques. It is characterized by high volume, velocity, variety, and veracity. Common sources of big data include mobile devices, sensors, social media, and software/application logs. Tools like Hadoop, MongoDB, and MapReduce are used to store, process, and analyze big data. Key applications areas include homeland security, healthcare, manufacturing, and financial trading. Benefits include better decision making, cost reductions
This document provides an overview of Oracle's Information Management Reference Architecture. It includes a conceptual view of the main architectural components, several design patterns for implementing different types of information management solutions, a logical view of the components in an information management system, and descriptions of how data flows through ingestion, interpretation, and different data layers.
This document provides an overview of handling and processing big data. It begins with defining big data and its key characteristics of volume, velocity, and variety. It then discusses several ways to effectively handle big data, such as outlining goals, securing data, keeping data protected, ensuring data is interlinked, and adapting to new changes. Metadata is also important for big data handling and processing. The document outlines the different types of metadata and closes by discussing technologies commonly used for big data processing like Hadoop, MapReduce, and Hive.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPathCommunity
Welcome to our third live UiPath Community Day Amsterdam! Come join us for a half-day of networking and UiPath Platform deep-dives, for devs and non-devs alike, in the middle of summer ☀.
📕 Agenda:
12:30 Welcome Coffee/Light Lunch ☕
13:00 Event opening speech
Ebert Knol, Managing Partner, Tacstone Technology
Jonathan Smith, UiPath MVP, RPA Lead, Ciphix
Cristina Vidu, Senior Marketing Manager, UiPath Community EMEA
Dion Mes, Principal Sales Engineer, UiPath
13:15 ASML: RPA as Tactical Automation
Tactical robotic process automation for solving short-term challenges, while establishing standard and re-usable interfaces that fit IT's long-term goals and objectives.
Yannic Suurmeijer, System Architect, ASML
13:30 PostNL: an insight into RPA at PostNL
Showcasing the solutions our automations have provided, the challenges we’ve faced, and the best practices we’ve developed to support our logistics operations.
Leonard Renne, RPA Developer, PostNL
13:45 Break (30')
14:15 Breakout Sessions: Round 1
Modern Document Understanding in the cloud platform: AI-driven UiPath Document Understanding
Mike Bos, Senior Automation Developer, Tacstone Technology
Process Orchestration: scale up and have your Robots work in harmony
Jon Smith, UiPath MVP, RPA Lead, Ciphix
UiPath Integration Service: connect applications, leverage prebuilt connectors, and set up customer connectors
Johans Brink, CTO, MvR digital workforce
15:00 Breakout Sessions: Round 2
Automation, and GenAI: practical use cases for value generation
Thomas Janssen, UiPath MVP, Senior Automation Developer, Automation Heroes
Human in the Loop/Action Center
Dion Mes, Principal Sales Engineer @UiPath
Improving development with coded workflows
Idris Janszen, Technical Consultant, Ilionx
15:45 End remarks
16:00 Community fun games, sharing knowledge, drinks, and bites 🍻
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Snarky Security
How wonderful it is that in our modern age, every bit of our biological data can be digitized, stored, and potentially pilfered by cyber thieves! Isn't it just splendid to think that while scientists are busy pushing the boundaries of biotechnology, hackers could be plotting the next big bio-data heist? This delightful scenario is brought to you by the ever-expanding digital landscape of biology and biotechnology, where the integration of computer science, engineering, and data science transforms our understanding and manipulation of biological systems.
While the fusion of technology and biology offers immense benefits, it also necessitates a careful consideration of the ethical, security, and associated social implications. But let's be honest, in the grand scheme of things, what's a little risk compared to potential scientific achievements? After all, progress in biotechnology waits for no one, and we're just along for the ride in this thrilling, slightly terrifying, adventure.
So, as we continue to navigate this complex landscape, let's not forget the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. After all, what could possibly go wrong?
-------------------------
This document provides a comprehensive analysis of the security implications biological data use. The analysis explores various aspects of biological data security, including the vulnerabilities associated with data access, the potential for misuse by state and non-state actors, and the implications for national and transnational security. Key aspects considered include the impact of technological advancements on data security, the role of international policies in data governance, and the strategies for mitigating risks associated with unauthorized data access.
This view offers valuable insights for security professionals, policymakers, and industry leaders across various sectors, highlighting the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. The analysis serves as a crucial resource for understanding the complex dynamics at the intersection of biotechnology and security, providing actionable recommendations to enhance biosecurity in an digital and interconnected world.
The evolving landscape of biology and biotechnology, significantly influenced by advancements in computer science, engineering, and data science, is reshaping our understanding and manipulation of biological systems. The integration of these disciplines has led to the development of fields such as computational biology and synthetic biology, which utilize computational power and engineering principles to solve complex biological problems and innovate new biotechnological applications. This interdisciplinary approach has not only accelerated research and development but also introduced new capabilities such as gene editing and biomanufact
"Making .NET Application Even Faster", Sergey Teplyakov.pptxFwdays
In this talk we're going to explore performance improvement lifecycle, starting with setting the performance goals, using profilers to figure out the bottle necks, making a fix and validating that the fix works by benchmarking it. The talk will be useful for novice and seasoned .NET developers and architects interested in making their application fast and understanding how things work under the hood.
Generative AI technology is a fascinating field that focuses on creating comp...Nohoax Kanont
Generative AI technology is a fascinating field that focuses on creating computer models capable of generating new, original content. It leverages the power of large language models, neural networks, and machine learning to produce content that can mimic human creativity. This technology has seen a surge in innovation and adoption since the introduction of ChatGPT in 2022, leading to significant productivity benefits across various industries. With its ability to generate text, images, video, and audio, generative AI is transforming how we interact with technology and the types of tasks that can be automated.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
2. @joe_Caserta#DataSummit
Caserta Timeline
Launched Big Data practice Co-author, with Ralph Kimball, The Data
Warehouse ETL Toolkit (Wiley)
Data Analysis, Data Warehousing and Business
Intelligence since 1996
Began consulting database programing and data
modeling 25+ years hands-on experience building database
solutions
Founded Caserta Concepts in NYC
Web log analytics solution published in Intelligent
Enterprise
Launched Data Science, Data Interaction and
Cloud practices Laser focus on extending Data Analytics with Big
Data solutions
1986
2004
1996
2009
2001
2013
2012
2014
Dedicated to Data Governance Techniques on Big
Data (Innovation)
Awarded Top 20 Big Data
Companies 2016
Top 20 Most Powerful
Big Data consulting firms
Launched Big Data Warehousing (BDW) Meetup
NYC: 2,000+ Members
2016 Awarded Fastest Growing Big
Data Companies 2016
Established best practices for big data ecosystem
implementations
3. @joe_Caserta#DataSummit
About Caserta Concepts
• Technology services company with expertise in data analysis:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy and Implementation
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
4. @joe_Caserta#DataSummit
Agenda
• Why we care about Big Data
• Challenges of working with Big Data
• Governing Big Data for Data Science
• Introducing the Data Pyramid
• Why Data Science is Cool?
• What does a Data Scientist do?
• Standards for Data Science
• Business Objective
• Data Discovery
• Preparation
• Models
• Evaluation
• Deployment
• Q & A
5. @joe_Caserta#DataSummit
Enrollments
Claims
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Big Data Lake
Canned Reporting
Big Data Analytics
NoSQL
DatabasesETL
Ad-Hoc/Canned
Reporting
Traditional BI
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Others…
Today’s business environment requires Big Data
Data Science
6. @joe_Caserta#DataSummit
•Data is coming in so
fast, how do we
monitor it?
•Real real-time
analytics
•What does
“complete” mean
•Dealing with sparse,
incomplete, volatile,
and highly
manufactured data.
How do you certify
sentiment analysis?
•Wider breadth of
datasets and sources
in scope requires
larger data
governance support
•Data governance
cannot start at the
data warehouse
•Data volume is higher
so the process must
be more reliant on
programmatic
administration
•Less people/process
dependence
Volume Variety
VelocityVeracity
The Challenges Building a Data Lake
7. @joe_Caserta#DataSummit
What’s Old is New Again
Before Data Warehousing Governance
Users trying to produce reports from raw source data
No Data Conformance
No Master Data Management
No Data Quality processes
No Trust: Two analysts were almost guaranteed to come up with two
different sets of numbers!
Before Data Lake Governance
We can put “anything” in Hadoop
We can analyze anything
We’re scientists, we don’t need IT, we make the rules
Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or data governance
will create a mess
Rule #2: Information harvested from an ungoverned systems will take us back to the old days: No
Trust = Not Actionable
8. @joe_Caserta#DataSummit
Making it Right
The promise is an “agile” data culture where communities of users are encouraged to explore
new datasets in new ways
New tools
External data
Data blending
Decentralization
With all the V’s, data scientists, new tools, new data we must rely LESS on HUMANS
We need more systemic administration
We need systems, tools to help with big data governance
This space is EXTREMELY immature!
Steps towards Data Governance for the Data Lake
1. Establish difference between traditional data and big data governance
2. Establish basic rules for where new data governance can be applied
3. Establish processes for graduating the products of data science to governance
4. Establish a set of tools to make governing Big Data feasible
10. @joe_Caserta#DataSummit
•This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from), business definitions, technical
metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Components of Data Governance
11. @joe_Caserta#DataSummit
•This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from), business definitions, technical
metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Components of Data Governance
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
• Data detection and masking on unstructured data upon ingest
• Near-zero latency, DevOps, Core component of business operations
For Big Data
12. @joe_Caserta#DataSummit
Data Lake Governance Realities
Full data governance can only be applied to “Structured” data
The data must have a known and well documented schema
This can include materialized endpoints such as files or tables OR projections
such as a Hive table
Governed structured data must have:
A known schema with Metadata
A known and certified lineage
A monitored, quality test, managed process for ingestion and transformation
A governed usage Data isn’t just for enterprise BI tools anymore
We talk about unstructured data in Hadoop but more-so it’s semi-
structured/structured with a definable schema.
Even in the case of unstructured data, structure must be extracted/applied in
just about every case imaginable before analysis can be performed.
13. @joe_Caserta#DataSummit
The Data Scientists Can Help!
Data Science to Big Data Warehouse mapping
Full Data Governance Requirements
Provide full process lineage
Data certification process by data stewards and business owners
Ongoing Data Quality monitoring that includes Quality Checks
Provide requirements for Data Lake
Proper metadata established:
Catalog
Data Definitions
Lineage
Quality monitoring
Know and validate data completeness
14. @joe_Caserta#DataSummit
Big
Data
Warehouse
Data Science Workspace
Data Lake
Landing Area
The Big Data Analytics Pyramid
Metadata Catalog
ILM who has access, how long do
we “manage it”
Raw machine data
collection, collect
everything
Data is ready to be turned into
information: organized, well defined,
complete.
Agile business insight through data-munging,
machine learning, blending with external data,
development of to-be BDW facts
Metadata Catalog
ILM who has access, how long do we “manage it”
Data Quality and Monitoring
Monitoring of completeness of data
Metadata Catalog
ILM who has access, how long do we “manage it”
Data Quality and Monitoring Monitoring of
completeness of data
Hadoop has different governance demands at each tier.
Only top tier of the pyramid is fully governed.
We refer to this as the Trusted tier of the Big Data Warehouse.
Fully Data Governed ( trusted)User community arbitrary queries and reporting
Usage Pattern Data Governance
15. @joe_Caserta#DataSummit
What does a Data Scientist Do, Anyway?
Searching for the data they need
Making sense of the data
Figuring why the data looks the way is does and assessing its validity
Cleaning up all the garbage within the data so it represents true business
Combining events with Reference data to give it context
Correlating event data with other events
Finally, they write algorithms to perform mining, clustering and predictive analytics
Writes really cool and sophisticated
algorithms that impacts the way the business
runs.
Much of the time of a Data Scientist is spent:
NOT
17. @joe_Caserta#DataSummit
The Data Scientist Winning Trifecta
Modern Data
Engineering/Data
Preparation
Domain
Knowledge/Bu
siness
Expertise
Advanced
Mathematics/
Statistics
23. @joe_Caserta#DataSummit
Are there Standards?
CRISP-DM: Cross Industry Standard Process for Data Mining
1. Business Understanding
• Solve a single business problem
2. Data Understanding
• Discovery
• Data Munging
• Cleansing Requirements
3. Data Preparation
• ETL
4. Modeling
• Evaluate various models
• Iterative experimentation
5. Evaluation
• Does the model achieve business objectives?
6. Deployment
• PMML; application integration; data platform; Excel
24. @joe_Caserta#DataSummit
1. Business Understanding
In this initial phase of the project we will need to speak to humans.
• It would be premature to jump in to the data, or begin selection of
the appropriate model(s) or algorithm
• Understand the project objective
• Review the business requirements
• The output of this phase will be conversion of business requirements
into a preliminary technical design (decision model) and plan.
Since this is an iterative process, this phase will be revisited throughout
the entire process.
27. @joe_Caserta#DataSummit
2. Data Understanding
• Data Discovery understand where the data you need comes
from
• Data Profiling interrogate the data at the entity level,
understand key entities and fields that are relevant to the
analysis.
• Cleansing Requirements understand data quality, data
density, skew, etc
• Data Munging collocate, blend and analyze data for early
insights! Valuable information can be achieved from simple
group-by, aggregate queries, and even more with SQL Jujitsu!
Significant iteration between Business Understanding and Data
Understanding phases.
Sample
Exploration tools
for Hadoop:
Trifacta, Paxata,
Spark, Python,
Pig, Hive,
Waterline,
Elasticsearch
28. @joe_Caserta#DataSummit
Data Exploration in Hadoop - Avoid low level coding
Start by evaluating DSL’s
Structured/tab
ular
Hive
Pig
Core or
Extended
Libraries
Will a
Custom UDF
help?
Use Streaming or
Native MR
Yes
Yes
No
No
Yes
Practical to
express in
SQL
Yes
No
No
Spark
30. @joe_Caserta#DataSummit
Data Science Data Quality Priorities
Data Quality
SpeedtoValue
Fast
Slow
Raw Refined
Does Data munging in a data science
lab need the same restrictive
governance and enterprise reporting?
31. @joe_Caserta#DataSummit
3. Data Preparation
ETL (Extract Transform Load)
90+% of a Data Scientists time goes into Data Preparation!
• Select required entities/fields
• Address Data Quality issues: missing or incomplete values,
whitespace, bad data-points
• Join/Enrich disparate datasets
• Transform/Aggregate data for intended use:
• Sample
• Aggregate
• Pivot
32. @joe_Caserta#DataSummit
Data Preparation
• We love Spark!
• ETL can be done in Scala,
Python or SQL
• Cleansing, transformation,
and standardization
• Address Parsing:
usaddress, postal-address,
etc
• Name Hashing: fuzzy, etc
• Genderization:
sexmachine, etc
• And all the goodies of the
standard Python library!
• Parallelize workload
against a large number of
machines in Hadoop
cluster
33. @joe_Caserta#DataSummit
Data Quality and Monitoring
• BUILD a robust data quality subsystem:
• Metadata and error event facts
• Orchestration
• Based on Data Warehouse ETL Toolkit
• Each error instance of each data quality
check is captured
• Implemented as sub-system after
ingestion
• Each fact stores unique identifier of the
defective source row
HAMBot: ‘open
source’ project
created in Caserta
Innovation Lab
(CIL)
34. @joe_Caserta#DataSummit
4. Modeling
Do you love algebra & stats?
• Evaluate various models/algorithms
• Classification
• Clustering
• Regression
• Many others…..
• Tune parameters
• Iterative experimentation
• Different models may require different data preparation
techniques (ie. Sparse Vector Format)
• Additionally we may discover the need for additional data points,
or uncover additional data quality issues!
35. @joe_Caserta#DataSummit
Modeling in Hadoop
• Spark works well
• SAS, SPSS, Etc. not
native on Hadoop
• R and Python
becoming new
standard
• PMML can be used,
but approach with
caution
36. @joe_Caserta#DataSummit
Machine Learning
The goal of machine learning is to get software to make decisions and learn
from data without being programed explicitly to do so
Machine Learning algorithms are broadly broken out into two groups:
• Supervised learning inferring functions based on labeled training data
• Unsupervised learning finding hidden structure/patterns within data, no
training data is supplied
We will review some popular, easy to understand machine learning
algorithms
38. @joe_Caserta#DataSummit
Supervised Learning
Name Weight Color Cat_or_Dog
Susie 9lbs Orange Cat
Fido 25lbs Brown Dog
Sparkles 6lbs Black Cat
Fido 9lbs Black Dog
Name Weight Color Cat_or_Dog
Misty 5lbs Orange ?
The training set is used to generate a function
..so we can predict if we have a cat or dog!
39. @joe_Caserta#DataSummit
Category or Values?
There are several classes of algorithms depending on whether the prediction is a
category (like cat or dog) or a value, like the value of a home.
Classification algorithms are generally well fit for categorization, while algorithms
like Regression and Decision Trees are well suited for predicting values.
40. @joe_Caserta#DataSummit
Regression
• Understanding the relationship between a given set of dependent variables
and independent variables
• Typically regression is used to predict the output of a dependent variable
based on variations in independent variables
• Very popular for prediction and forecasting
Linear Regression
41. @joe_Caserta#DataSummit
Decision Trees
• A method for predicting outcomes based on the features of data
• Model is represented a easy to understand tree structure of if-else statements
Weight > 10lbs
color = orange
cat
yes
no
name = fido
no
no
dogyes
dog
cat
yes
42. @joe_Caserta#DataSummit
Unsupervised K-Means
• Treats items as coordinates
• Places a number of random “centroids”
and assigns the nearest items
• Moves the centroids around based on
average location
• Process repeats until the assignments
stop changing
Clustering of items into logical groups based on natural patterns in data
Uses:
• Cluster Analysis
• Classification
• Content Filtering
43. @joe_Caserta#DataSummit
Collaborative Filtering
• A hybrid of Supervised and Unsupervised Learning (Model Based vs. Memory
Based)
• Leveraging collaboration between multiple agents to filter, project, or detect
patterns
• Popular in recommender systems for projecting the “taste” for of specific
individuals for items they have not yet expressed one.
44. @joe_Caserta#DataSummit
Item-based
• A popular and simple memory-based collaborative filtering algorithm
• Projects preference based on item similarity (based on ratings):
for every item i that u has no preference for yet
for every item j that u has a preference for
compute a similarity s between i and j
add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average
• First a matrix of Item to Item similarity is calculated based on user rating
• Then recommendations are created by producing a weighted sum of top items,
based on the users previously rated items
45. @joe_Caserta#DataSummit
5. Evaluation
What problem are we trying to solve again?
• Our final solution needs to be evaluated against original
Business Understanding
• Did we meet our objectives?
• Did we address all issues?
46. @joe_Caserta#DataSummit
6. Deployment
Engineering Time!
• It’s time for the work products of data science to “graduate” from “new
insights” to real applications.
• Processes must be hardened, repeatable, and generally perform well too!
• Data Governance applied
• PMML (Predictive Model Markup Langauge): XML based interchange format
Big$
Data$
Warehouse$
Data$Science$Workspace$
Data$Lake$–$Integrated$Sandbox$$
Landing$Area$–$Source$Data$in$“Full$Fidelity”$
New
Data
New
Insights
Governance
Refinery
48. @joe_Caserta#DataSummit
Project Objective
• Create a functional recommendation engine to surface to provide relevant
product recommendations to customers.
• Improve Customer Experience
• Increase Customer Retention
• Increase Customer Purchase Activity
• Accurately suggest relevant products to customers based on their peer
behavior.
49. @joe_Caserta#DataSummit
Recommendations
• Your customers expect them
• Good recommendations make life easier
• Help them find information, products, and services they might not have
thought of
• What makes a good recommendation?
• Relevant but not obvious
• Sense of “surprise”
23” LED TV 24” LED TV 25” LED TV
23” LED TV``
SOLD!!
Blu-Ray Home Theater HDMI Cables
50. @joe_Caserta#DataSummit
Where do we use recommendations?
• Applications can be found in a wide variety of industries and applications:
• Travel
• Financial Service
• Music/Online radio
• TV and Video
• Online Publications
• Retail
..and countless others
Our Example: Movies
51. @joe_Caserta#DataSummit
The Goal of the Recommender
• Create a powerful, scalable recommendation engine with minimal development
• Make recommendations to users as they are browsing movie titles -
instantaneously
• Recommendation must have context to the movie they are currently viewing.
OOPS! – too much surprise!
52. @joe_Caserta#DataSummit
Recommender Tools & Techniques
Hadoop – distributed file system and processing platform
Spark – low-latency computing
MLlib – Library of Machine Learning Algorithms
We leverage two algorithms:
• Content-Based Filtering – how similar is this particular movie to other movies based on
usage.
• Collaborative Filtering – predict an individuals preference based on their peers ratings.
Spark MLlib implements a collaborative filtering algorithm called Alternating Least Squares
(ALS)
• Both algorithms only require a simple dataset of 3 fields:
“User ID” , “Item ID”, “Rating”
53. @joe_Caserta#DataSummit
Content-Based Filtering
“People who liked this movie liked these as well”
• Content Based Filter builds a matrix of items to other items and calculates
similarity (based on user rating)
• The most similar item are then output as a list:
• Item ID, Similar Item ID, Similarity Score
• Items with the highest score are most similar
• In this example users who liked “Twelve Monkeys” (7) also like “Fargo” (100)
7 100 0.690951001800917
7 50 0.653299445638532
7 117 0.643701303640083
54. @joe_Caserta#DataSummit
Collaborative Filtering
“People with similar taste to you liked these movies”
• Collaborative filtering applies weights based on “peer” user preference.
• Essentially it determines the best movie critics for you to follow
• The items with the highest recommendation score are then output as tuples
• User ID [Item ID1:Score,…., Item IDn:Score]
• Items with the highest recommendation score are the most relevant to this user
• For user “Johny Sisklebert” (572), the two most highly recommended movies are “Seven” and
“Donnie Brasco”
572 [11:5.0,293:4.70718,8:4.688335,273:4.687676,427:4.685926,234:4.683155,168:4.669672,89:4.66959,4:4.65515]
573 [487:4.54397,1203:4.5291,616:4.51644,605:4.49344,709:4.3406,502:4.33706,152:4.32263,503:4.20515,432:4.26455,611:4.22019]
574 [1:5.0,902:5.0,546:5.0,13:5.0,534:5.0,533:5.0,531:5.0,1082:5.0,1631:5.0,515:5.0]
55. @joe_Caserta#DataSummit
Recommendation Store
• Serving recommendations needs to be instantaneous
• The core to this solution is two reference tables:
• When called to make recommendations we query our store
• Rec_Item_Similarity based on the Item_ID they are viewing
• Rec_User_Item_Base based on their User_ID
Rec_Item_Similarity
Item_ID
Similar_Item
Similarity_Score
Rec_User_Item_Base
User_ID
Item_ID
Recommendation_Score
56. @joe_Caserta#DataSummit
Delivering Recommendations
Item-Based:
Peers like these
Movies
Best
Recommendations
Item Similarity Raw Score Score
Fargo 0.691 1.000
Star Wars 0.653 0.946
Rock, The 0.644 0.932
Pulp Fiction 0.628 0.909
Return of the Jedi 0.627 0.908
Independence Day 0.618 0.894
Willy Wonka 0.603 0.872
Mission: Impossible 0.597 0.864
Silence of the Lambs, The 0.596 0.863
Star Trek: First Contact 0.594 0.859
Raiders of the Lost Ark 0.584 0.845
Terminator, The 0.574 0.831
Blade Runner 0.571 0.826
Usual Suspects, The 0.569 0.823
Seven (Se7en) 0.569 0.823
Item-Base (Peer) Raw Score Score
Seven 5.000 1.000
Donnie Brasco 4.707 0.941
Babe 4.688 0.938
Heat 4.688 0.938
To Kill a Mockingbird 4.686 0.937
Jaws 4.683 0.937
Monty Python, Holy Grail 4.670 0.934
Blade Runner 4.670 0.934
Get Shorty 4.655 0.931
Top 10 Recommendations
So if Johny is viewing “12 Monkeys” we query our recommendation store and present the results
Seven (Se7en) 1.823
Blade Runner 1.760
Fargo 1.000
Star Wars 0.946
Donnie Brasco 0.941
Babe 0.938
Heat 0.938
To Kill a Mockingbird 0.937
Jaws 0.937
Monty Python, Holy Grail 0.934
57. @joe_Caserta#DataSummit
From Good to Great Recommendations
• Note that the first 5 recommendations look pretty good
…but the 6th result would have been “Babe” the children's movie
• Tuning the algorithms might help: parameter changes, similarity measures.
• How else can we make it better?
1. Delivery filters
2. Introduce additional algorithms such as K-Means
OOPS!
58. @joe_Caserta#DataSummit
Additional Algorithm – K-Means
We would use the major attributes of the Movie to create coordinate points.
• Categories
• Actors
• Director
• Synopsis Text
“These movies are similar based on their attributes”
59. @joe_Caserta#DataSummit
Delivery Scoring and Filters
• One or more categories must match
• Only children movies will be recommended for children's movies.
Action Adventure Children's Comedy Crime Drama Film-Noir Horror Romance Sci-Fi Thriller
Twelve Monkeys 0 0 0 0 0 1 0 0 0 1 0
Babe 0 0 1 1 0 1 0 0 0 0 0
Seven (Se7en) 0 0 0 0 1 1 0 0 0 0 1
Star Wars 1 1 0 0 0 0 0 0 1 1 0
Blade Runner 0 0 0 0 0 0 1 0 0 1 0
Fargo 0 0 0 0 1 1 0 0 0 0 1
Willy Wonka 0 1 1 1 0 0 0 0 0 0 0
Monty Python 0 0 0 1 0 0 0 0 0 0 0
Jaws 1 0 0 0 0 0 0 1 0 0 0
Heat 1 0 0 0 1 0 0 0 0 0 1
Donnie Brasco 0 0 0 0 1 1 0 0 0 0 0
To Kill a Mockingbird 0 0 0 0 0 1 0 0 0 0 0
Apply assumptions to control the results of collaborative filtering
Similarly logic could be applied to promote more favorable options
• New Releases
• Retail Case: Items that are on-sale, overstock
60. @joe_Caserta#DataSummit
Integrating K-Means into the process
Collaborative Filter K-Means:
Similar
Content Filter
Best
Recommendations
Movies recommended by more than 1 algorithm are the most highly rated
61. @joe_Caserta#DataSummit
61
Sophisticated Recommendation Model
What items are we
promoting at time
of sale?
What items are
being promoted by
the Store or
Market?
What are people
with similar
characteristics
buying?
Peer Based
Item
Clustering
Corporate
Deals/
Offers
Customer
Behavior
Market/
Store
Recommendation
What items have
you bought in the
past?
What did people
who ordered
these items also
order?
The solution
allows balancing
of algorithms to
attain the most
effective
recommendation
62. @joe_Caserta#DataSummit
Summary
• Hadoop and Spark can provide a relatively low cost and extremely scalable platform
for Data Science
• Hadoop offers great scalability and speed to value without the overhead of
structuring data
• Spark, with MLlib offers a great library of established Machine Learning algorithms,
reducing development efforts
• Python and SQL tools of choice for Data Science on Hadoop
• Go Agile and follow Best Practices (CRISP-DM)
• Employ Data Pyramid concepts to ensure data has just enough governance
63. @joe_Caserta#DataSummit
Some Thoughts – Enable the Future
Data Science requires the convergence of data
quality, advanced math, data engineering and
visualization and business smarts
Make sure your data can be trusted and people can
be held accountable for impact caused by low data
quality.
Good data scientists are rare: It will take a village
to achieve all the tasks required for effective data
science
Get good!
Be great!
Blaze new trails!
https://exploredatascience.com/
Data Science Training: