The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The 20th annual Enterprise Data World (EDW) Conference took place in San Diego last month April 17-21. It is recognized as the most comprehensive educational conference on data management in the world.
Joe Caserta was a featured presenter. His session “Evolving from the Data Warehouse to Big Data Analytics - the Emerging Role of the Data Lake," highlighted the challenges and steps to needed to becoming a data-driven organization.
Joe also participated in in two panel discussions during the show:
• "Data Lake or Data Warehouse?"
• "Big Data Investments Have Been Made, But What's Next
For more information on Caserta Concepts, visit our website at http://casertaconcepts.com/.
Data Profiling: The First Step to Big Data QualityPrecisely
Big data offers the promise of a data-driven business model generating new revenue and competitive advantage fueled by new business insights, AI, and machine learning. Yet without high quality data that provides trust, confidence, and understanding, business leaders continue to rely on gut instinct to drive business decisions.
The critical foundation and first step to deliver high quality data in support of a data-driven view that truly leverages the value of big data is data profiling - a proven capability to analyze the actual data content and help you understand what's really there.
View this webinar on-demand to learn five core concepts to effectively apply data profiling to your big data, assess and communicate the quality issues, and take the first step to big data quality and a data-driven business.
BAR360 open data platform presentation at DAMA, SydneySai Paravastu
Sai Paravastu discusses the benefits of using an open data platform (ODP) for enterprises. The ODP would provide a standardized core of open source Hadoop technologies like HDFS, YARN, and MapReduce. This would allow big data solution providers to build compatible solutions on a common platform, reducing costs and improving interoperability. The ODP would also simplify integration for customers and reduce fragmentation in the industry by coordinating development efforts.
This document discusses data quality and data profiling. It begins by describing problems with data like duplication, inconsistency, and incompleteness. Good data is a valuable asset while bad data can harm a business. Data quality is assessed based on dimensions like accuracy, consistency, completeness, and timeliness. Data profiling statistically examines data to understand issues before development begins. It helps assess data quality and catch problems early. Common analyses include analyzing null values, keys, formats, and more. Data profiling is conducted using SQL or profiling tools during requirements, modeling, and ETL design.
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...DATAVERSITY
This document discusses the importance of metadata and data governance. It describes how a data catalog can consolidate metadata from various sources like a business glossary, data dictionary, and data profiling. Automating data lineage is key to harvesting metadata at scale and establishing relationships between different metadata objects. When integrated in a data catalog, metadata provides a single source of truth about an organization's data that improves data literacy and trust.
Architecting for Big Data: Trends, Tips, and Deployment OptionsCaserta
Joe Caserta, President at Caserta Concepts addressed the challenges of Business Intelligence in the Big Data world at the Third Annual Great Lakes BI Summit in Detroit, MI on Thursday, March 26. His talk "Architecting for Big Data: Trends, Tips and Deployment Options," focused on how to supplement your data warehousing and business intelligence environments with big data technologies.
For more information on this presentation or the services offered by Caserta Concepts, visit our website: http://casertaconcepts.com/.
michael hamilton legal database design presentation 3 new yorkmichaelhamilton
The document outlines a database design methodology for litigation databases consisting of 5 steps: 1) Draft a mission statement and objectives, 2) Analyze the overall data set, 3) Determine necessary data fields, 4) Determine and define business rules, and 5) Assure data integrity. It then provides examples of typical data fields for a coded litigation database including document ID number, attachment range, document date, type, title, names, characteristics, source, and date loaded. Finally, it proposes a database design for a sample antitrust case involving 500 boxes of documents from 4 sources to be reviewed by multiple attorneys.
The Right Data Warehouse: Automation Now, Business Value ThereafterInside Analysis
The Briefing Room with Dr. Robin Bloor and WhereScape
Live Webcast on April 1, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=7b23b14b532bd7be60a70f6bd5209f03
In the Big Data shuffle, everyone is looking at Hadoop as “the answer” to collect interesting data from a new set of sources. While Hadoop has given organizations the power to gather more information assets than ever before, the question still looms: which data, regardless of source, structure, volume and all the rest, are significant for affecting business value – and how do we harness it? One effective approach is to bolster the data warehouse environment with a solution capable of integrating all the data sources, including Hadoop, and automating delivery of key information into the rights hands.
Register for this episode of The Briefing Room to hear veteran Analyst Robin Bloor as he explains how a rapidly changing information landscape impacts data management. He will be briefed by Mark Budzinski of WhereScape, who will tout his company’s data warehouse automation solutions. Budzinski will discuss how automation can be the cornerstone for closing the gap between those responsible for data management and the people driving business decisions.
Visit InsideAnlaysis.com for more information.
In this presentation at DAMA New York, Joe started by asking a key question: why are we doing this? Why analyze and share all these massive amounts of data? Basically, it comes down to the belief that in any organization, in any situation, if we can get the data and make it correct and timely, insights from it will become instantly actionable for companies to function more nimbly and successfully. Enabling the use of data can be a world-changing, world-improving activity and this session presents the steps necessary to get you there. Joe explained the concept of the "data lake" and also emphasizes the role of a strong data governance strategy that incorporates seven components needed for a successful program.
For more information on this presentation or Caserta Concepts, visit our website at http://casertaconcepts.com/.
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data GovernanceDATAVERSITY
If you have the discipline to develop, deliver, and maintain a business glossary, data dictionary, and/or a data catalog, you may already have the makings of a Data Governance program. The roles required to deliver these assets can translate to successful Data Governance in several ways.
In this month’s webinar, Bob Seiner will highlight the aspects of delivering these valuable business assets that result in formal Data Governance. It is practical that your program recognize existing efforts to formalize the definition, production, and usage of data.
Topics to be discussed in this webinar:
• How glossaries, dictionaries, and catalogs add value
• What should be included in these assets
• Who has responsibility for these assets
• When these assets will be valuable to your organization
• Where the discipline results in Data Governance
Data sciences is the topnotch in our world now as it enables us to predict the future and behaviors of people and systems alike.
Hence, this course focuses on introducing the processing involved in data sciences.
This document provides an overview of fundamentals of database design. It discusses what a database is, the difference between data and information, why databases are needed, how to select a database system, basic database definitions and building blocks, quality control considerations, and data entry methods. The overall purpose of a database management system is to transform data into information, information into knowledge, and knowledge into action.
This document provides an overview of fundamentals of database design. It discusses what a database is, the difference between data and information, why databases are needed, how to select a database system, basic database definitions and building blocks, quality control considerations, and data entry methods. The overall purpose of a database management system is to transform data into information, information into knowledge, and knowledge into action.
Data Mesh in Azure using Cloud Scale Analytics (WAF)Nathan Bijnens
This document discusses moving from a centralized data architecture to a distributed data mesh architecture. It describes how a data mesh shifts data management responsibilities to individual business domains, with each domain acting as both a provider and consumer of data products. Key aspects of the data mesh approach discussed include domain-driven design, domain zones to organize domains, treating data as products, and using this approach to enable analytics at enterprise scale on platforms like Azure.
1) The document discusses big data and data science, defining big data using the three Vs of volume, velocity, and variety to characterize high amounts of diverse data sources.
2) Data science is presented as a combination of techniques from fields like mathematics, computer science, and statistics to extract knowledge from data.
3) Successful data scientists require a diverse skillset that includes quantitative skills, technical skills, skepticism, collaboration, and knowledge from multiple disciplines.
Closing the data source discovery gap and accelerating data discovery comprises three steps: profile, identify, and unify. This white paper discusses how the Attivio
platform executes those steps, the pain points each one addresses, and the value Attivio provides to advanced analytics and business intelligence (BI) initiatives.
This document provides an agenda and overview for a data warehousing training session. The agenda covers topics such as data warehouse introductions, reviewing relational database management systems and SQL commands, and includes a case study discussion with Q&A. Background information is also provided on the project manager leading the training.
Agile Data Rationalization for Operational IntelligenceInside Analysis
The Briefing Room with Eric Kavanagh and Phasic Systems
Live Webcast Mar. 26, 2013
The complexity of today's information architectures creates a wide range of challenges for executives trying to get a strategic view of their current operations. The data and context locked in operational systems often get diluted during the normalization processes of data warehousing and other types of analytic solutions. And the ultimate goal of seeing the big picture gets derailed by a basic inability to reconcile disparate organizational views of key information assets and rules.
Register for this episode of The Briefing Room to learn from Bloor Group CEO Eric Kavanagh, who will explain how a tightly controlled methodology can be combined with modern NoSQL technology to resolve both process and system complexities, thus enabling a much richer, more interconnected information landscape. Kavanagh will be briefed by Geoffrey Malafsky of Phasic Systems who will share his company's tested methodology for capturing and managing the business and process logic that run today's data-driven organizations. He'll demonstrate how a “don't say no” approach to entity definitions can dissolve previously intractable disagreements, opening the door to clear, verifiable operational intelligence.
Visit: http://www.insideanalysis.com
This document provides an overview of key concepts for AWS Certified Data Analytics, including data structures, types, preparation, sources, formats (structured, unstructured, semi-structured), the data lifecycle, AWS services for data storage and analytics, and visualization. It emphasizes that data is a valuable commodity and discusses challenges of analyzing growing unstructured data from various sources using traditional tools.
This document discusses characteristics of big data and the big data stack. It describes the evolution of data from the 1970s to today's large volumes of structured, unstructured and multimedia data. Big data is defined as data that is too large and complex for traditional data processing systems to handle. The document then outlines the challenges of big data and characteristics such as volume, velocity and variety. It also discusses the typical data warehouse environment and Hadoop environment. The five layers of the big data stack are then described including the redundant physical infrastructure, security infrastructure, operational databases, organizing data services and tools, and analytical data warehouses.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Machine Learning CI/CD for Email Attack DetectionDatabricks
Detecting advanced email attacks at scale is a challenging ML problem, particularly due to the rarity of attacks, adversarial nature of the problem, and scale of data. In order to move quickly and adapt to the newest threat we needed to build a Continuous Integration / Continuous Delivery pipeline for the entire ML detection stack. Our goal is to enable detection engineers and data scientists to make changes to any part of the stack including joined datasets for hydration, feature extraction code, detection logic, and develop/train ML models.
In this talk, we discuss why we decided to build this pipeline, how it is used to accelerate development and ensure quality, and dive into the nitty-gritty details of building such a system on top of an Apache Spark + Databricks stack.
Ormax Media - Streaming Originals Mid-Year Report.pdfSocial Samosa
Ormax Media has released its ‘Streaming Originals Mid-Year Report’. It covers the top original shows and films in Hindi, Telugu, Tamil, and International languages.
mike waizman marketing portfolio projects 2024Mike Waizman
The portfolio document summarizes multiple projects from 2024, including a Facebook campaign that gained 39,362 new page subscribers. It also provides details on several advertising campaigns run on different platforms and projects, including their budgets, reach, clicks, and results. The largest campaign was for the Russian site Fishki.net, which gained over 140 million views per month during the campaign.
How did agriculture drones achieve new milestones in 2022?sisiyui
In 2022, agriculture drones reached unprecedented heights, revolutionizing farming practices and significantly impacting the agricultural sector worldwide. The milestones achieved by these innovative devices span advancements in technology, increased adoption rates, regulatory approvals, and broader applications that enhance productivity, sustainability, and profitability in farming. This comprehensive overview explores how agriculture drones achieved new milestones in 2022, detailing the key factors contributing to their success and the profound effects they have on modern agriculture.
10 Event Management Fun Facts that will make you laugh and appreciate the challenging but rewarding business event management industry.
I presented this as a light moment during MICECON 2024 at Clark, Philippines.
It was a fun presentation aimed at setting the tone for event attendees to be comfortable in the three-day event of MICECON - the Philippines top MICE event.
Learn how integrating organic and paid social media strategies can elevate your marketing efforts. We’ll explore how organic social media and paid campaigns can work together to boost engagement and improve ROI. Through real-world examples and practical tips, you'll discover best practices for using organic insights to inform paid strategies, ensuring consistent branding, and optimizing your campaigns. Whether you’re a seasoned marketer or new to the field, this session will equip you with the knowledge and tools to enhance your social media strategy and achieve superior results.
Cut Through the Noise to Drive More ConversionsVWO
Join Steve from Develo for an in-depth exploration of the HiPPO effect and its impact on CRO initiatives. Learn practical tips for launching your testing program, prioritizing tests effectively, and navigating stakeholder dynamics. Gain valuable insights from Develo's experiences in driving conversion rate optimization for a prominent fashion retailer.
Technical SEO Best Practices: How To Improve Discoverability, Crawlability & ...Search Engine Journal
While on-page SEO gets a lot of attention, overlooking your website's technical foundation can severely limit your ability to rank on search engines (and reach new customers).
Is your technical SEO where it needs to be for success?
Watch as we explore an actionable framework for auditing and improving your technical SEO across four key pillars—discoverability, crawlability, indexability, and user experience. You’ll walk away with tactics to send stronger signals to Google and outrank your competition.
You’ll learn:
- How to optimize for discoverability through sitemaps, site architecture, and more.
- Strategies to improve crawl budget and avoid crawler traps.
- How to leverage Schema and optimize heading structure for better indexability.
- The top tools and processes for continuous technical SEO monitoring.
With Steven van Vessum and Alexandra Dritsas, we’ll also dive into best practices for Core Web Vitals and accessibility that will create an enhanced user experience for your audience.
If you’re looking to maximize website performance through technical SEO, you can’t afford to miss this webinar.
Mastering Local SEO for Service Businesses in the AI Era is tailored specifically for local service providers like plumbers, dentists, and others seeking to dominate their local search landscape. This session delves into leveraging AI advancements to enhance your online visibility and search rankings through the Content Factory model, designed for creating high-impact, SEO-driven content. Discover the Dollar-a-Day advertising strategy, a cost-effective approach to boost your local SEO efforts and attract more customers with minimal investment. Gain practical insights on optimizing your online presence to meet the specific needs of local service seekers, ensuring your business not only appears but stands out in local searches. This concise, action-oriented workshop is your roadmap to navigating the complexities of digital marketing in the AI age, driving more leads, conversions, and ultimately, success for your local service business.
Key Takeaways:
Embrace AI for Local SEO: Learn to harness the power of AI technologies to optimize your website and content for local search. Understand the pivotal role AI plays in analyzing search trends and consumer behavior, enabling you to tailor your SEO strategies to meet the specific demands of your target local audience. Leverage the Content Factory Model: Discover the step-by-step process of creating SEO-optimized content at scale. This approach ensures a steady stream of high-quality content that engages local customers and boosts your search rankings. Get an action guide on implementing this model, complete with templates and scheduling strategies to maintain a consistent online presence. Maximize ROI with Dollar-a-Day Advertising: Dive into the cost-effective Dollar-a-Day advertising strategy that amplifies your visibility in local searches without breaking the bank. Learn how to strategically allocate your budget across platforms to target potential local customers effectively. The session includes an action guide on setting up, monitoring, and optimizing your ad campaigns to ensure maximum impact with minimal investment.
Let’s be honest. Improvements in search rankings and organic traffic don’t always translate into sales. Yet, you spend the majority of your SEO resources on driving rankings and traffic. What if you built your SEO content with conversion in mind from the beginning? You’d generate more organic traffic that actually converts into revenue! Join 20-year search marketing veteran as he unveils his framework for developing SEO content with conversion in mind every step of the way ‒ from keyword strategy to content development and publication.
Takeaways:
Tactics and benchmarks for SEO content that converts in 2024
Page layouts and content formats that convert organic traffic
Crafting keyword strategy and calls-to-action for conversion
Demapro: Your Partner in Strategic Market Insightsarun mishra
Demapro is a premier market research company dedicated to providing businesses with actionable insights and strategic intelligence. Our mission is to empower organizations with the data and analysis needed to make informed decisions, drive growth, and stay ahead of the competition.
Public Relations Cheat Sheet (PRLab's PR Sheet)PRLab
Mastering PR can lead to financial success and a lasting reputation, even in tough times. Whether you’re new to PR or looking to improve your skills, this PR Cheat Sheet is perfect for you. Here’s a quick overview of what’s inside and how it can help.
First, in our Public Relations Cheat Sheet we cover the goals of PR. Why do companies need it, and what can it achieve? We discuss setting goals using the SMART technique (specific, measurable, achievable, realistic, and time-bound) and typical metrics for success, such as conversions, sentiment analysis, and impressions. You’ll also find a mini PR glossary and expert tips for writing press releases about important news.
We also include a crisis action plan and media training tips to handle media questions like a pro. PR is about sharing information with the right audiences, but what if you have no news? Don’t wait around—check out our tips for staying relevant and keeping your audience engaged.
Your audience influences your message, tactics, and goals. That’s why we explain the differences between B2B and B2C PR. Finally, our PR cheat sheet highlights what to do and what NOT to do in PR. It’s definitely worth a read.
So, here it is. Save it to your favorites or print it out for quick reference.
5 Powerful Social Media Platforms for Digital Marketing Success.pdfMoney Macking
In today’s digitally driven world, social media platforms have become indispensable tools for businesses looking to expand their reach, engage with their audience, and drive sales. With billions of active users globally, these platforms present unparalleled opportunities for achieving digital marketing success. However, with so many options available, choosing the right platforms for your business can be challenging. In this article, we will explore 5 powerful Social Media Platforms for Digital Marketing Success.
Facebook:-
With more than 2.8 billion active users each month, Facebook is the most popular social media network globally. Its robust advertising capabilities, including detailed targeting options based on demographics, interests, and behaviors, make it an excellent choice for businesses of all sizes. Facebook’s diverse range of ad formats, such as image ads, video ads, carousel ads, and lead ads, allows marketers to craft compelling campaigns tailored to their specific objectives. Additionally, Facebook’s analytics tools provide valuable insights into campaign performance, enabling businesses to refine their strategies for maximum effectiveness.
Instagram:-
Owned by Facebook, Instagram boasts more than 1 billion monthly active users, strongly emphasizing visual content. Its highly engaged user base, particularly among younger demographics, makes it an ideal platform for businesses looking to showcase their products or services through visually appealing images and videos. Instagram’s advertising options, including photo ads, video ads, carousel ads, and story ads, offer creative flexibility for marketers to convey their brand message effectively. Furthermore, Instagram’s integration with Facebook’s advertising platform allows for seamless campaign management and tracking across both platforms.
Twitter:-
With around 330 million monthly active users, Twitter is known for its real-time communication and engagement. Its fast-paced nature makes it ideal for businesses to join trending conversations, share timely updates, and interact directly with their audience. Twitter’s advertising options, such as promoted tweets, promoted accounts, and promoted trends, enable businesses to amplify their reach and increase brand awareness. Moreover, Twitter’s advanced targeting capabilities, including keyword, interest, and demographic targeting, allow marketers to reach specific audiences with relevant content tailored to their interests.
LinkedIn:-
As the world’s largest professional network, LinkedIn boasts over 700 million members, making it an invaluable platform for B2B marketers and professionals. With its focus on professional networking and career development, LinkedIn offers unique opportunities for businesses to establish thought leadership, generate leads, and build relationships with industry peers.
The digital marketing industry is changing faster than ever and those who don’t adapt with the times are losing market share. Where should marketers be focusing their efforts? What strategies are the experts seeing get the best results? Get up-to-speed with the latest industry insights, trends and predictions for the future in this panel discussion with some leading digital marketing experts.
Outsourcing digital marketing is collaborating with specialised companies to handle a variety of online marketing duties, including SEO, PPC, social media, content production, and email campaigns. Aspire globus provides access to professional knowledge, powerful tools, and cutting-edge trends without the requirement for in-house resources. Key techniques include identifying objectives, selecting the best partner, maintaining regular contact, successfully controlling costs, and making data-driven decisions. Businesses can improve their online presence, increase productivity, and achieve better marketing outcomes by implementing these methods while remaining focused on their core activity.
GetResponse Alternative CleverlyBox Review: Transforming Cold Emails into Hot...
Data Lakehouse Symposium | Day 2
1. The Data Lakehouse Symposium – February, 2022
Hosted by - Bill Inmon and DataBricks – Feb 1- 4, 2022
2. The Data Lakehouse Symposium – February, 2022
Text in the Data Lakehouse
David Rapien
Partner in ForestRim Technology
Associate Professor – Lindner College of Business
University of Cincinnati
3. Let’s Look at the Major Historical Changes in Data Collection, Storage, and Usage
• 1980’s - The Data Warehouse allowed us to hold a single version of the truth and make
enterprise wide decisions.
• 2010 - The Data Lake allowed us to collect all of our “data” in one place.
• 2020- The Data Lakehouse marries the two by adding governance and metadata to data
going into the Data Lake so that it can be separately transitioned into a Data Warehouse
AND consumed by decision makers and analysts.
Text in the
Data Lakehouse
4. Where is your company’s data focus?
• Data collection for “future use”?
• Business decisions?
• Analysis and research?
• We have none!
Text in the
Data Lakehouse
5. What types of data does your company collect and store?
• Transactional Data from customer interactions?
• Machine generated data?
• Emails, blogs, customer reviews, medical records, contracts?
• Images, videos, scans, audio files?
Text in the
Data Lakehouse
6. Presentation
Talking Points
Types of Data in the Lakehouse
Textual Data in the Lakehouse
What is needed to use Textual
Data in the Lakehouse
Forest Rim
Knowledge Share
Text in the
Data Lakehouse
What We will Discuss in Today’s Presentation
7. Presentation
Talking Points
Types of Data in the Lakehouse
Textual Data in the Lakehouse
What is needed to use Textual
Data in the Lakehouse
Forest Rim
Knowledge Share
Text in the
Data Lakehouse
8. Text in the
Data Lakehouse
structured textual Analog/IoT
Curated Data Lake and Data Warehouse Data
All Corporate Data in the Lakehouse
~ 20-%
~ 80+%
Pareto’s Law Holds True
~ 80+%
Amount of Data:
Data Used for
Decision Making:
~ 20-%
9. Text in the
Data Lakehouse
All Corporate Data in the Lakehouse
structured textual Analog/IoT
~ 80-90% of business decisions are
made on less than 20% of the data.
Is there something wrong here?
Curated Data Lake and Data Warehouse Data
10. Text in the
Data Lakehouse
All Corporate Data in the Lakehouse
structured textual Analog/IoT
Data Warehouse Data
• Physical models
• Tables
• Aggregated
• Scrubbed
• Additional Metadata
• Additional Data Governance
Curated Data Lake and Data Warehouse Data
11. Text in the
Data Lakehouse
All Corporate Data in the Lakehouse
structured textual Analog/IoT
• Documents
• Emails
• Contracts
• Medical Records
• Voice of the Customer
• Insurance Claims
• Call Center …
• Other???
Curated Data Lake and Data Warehouse Data
12. Text in the
Data Lakehouse
All Corporate Data in the Lakehouse
structured textual Analog/IoT
• Status Data
• Automation Data
• Location Data
• Clickstreams
• Sensor Data
• Images
• Audio / Video files
Curated Data Lake and Data Warehouse Data
13. Text in the
Data Lakehouse
structured textual Analog/IoT
The relative volumes of data
14. Text in the
Data Lakehouse
structured textual Analog/IoT
The relative amount of business value to be found in the different sectors
Business value
15. Text in the
Data Lakehouse
How do we currently USE different types of data
structured textual Analog/IoT
Data Warehouse
Timeline Analysis
360° View of the Customer
Curated Data Lake and Data Warehouse Data
Machine Learning / AI
Manual Analysis?
NLP?
Failure?
Textual ETL!
16. Presentation
Talking Points
Types of Data in the Lakehouse
Textual Data in the Lakehouse
What is needed to use Textual
Data in the Lakehouse
Forest Rim
Knowledge Share
Text in the
Data Lakehouse
17. Text in the
Data Lakehouse
What data you are missing in your analysis?
textual
Voice mails
Dictations
Transcriptions
PDFs
Word documents CSVs
Yelp Reviews
Parquet Files
Document scans
The voice of the customer
Real estate deeds/sales Internet
Insurance claims
Warranties
Emails Call center
Contracts
Medical records
18. Text in the
Data Lakehouse
What is similar about most of this textual data?
textual
It is stream of thought
It is different document by document
It does not have Primary Keys or Foreign Keys
It has little format
It is DIRTY DATA!
19. Text in the
Data Lakehouse
textual
structured
Why?
Because people do not write or talk the same way that
is found in the structured world
Keys
Attributes
Indexes
Physical models
“I was looking at the nice colored
sweater in the window. I wonder if
I could try it on….but
I don’t like the sleeve length…”
These worlds are incompatible.
In order to address text you need a completely different approach
The Issue: The modelling and design techniques that worked in the
Structured world do not work in the world of Text.
Think
About it:
20. Presentation
Talking Points
Types of Data in the Lakehouse
Textual Data in the Lakehouse
What is needed to use Textual
Data in the Lakehouse
Forest Rim
Knowledge Share
Text in the
Data Lakehouse
21. Text in the
Data Lakehouse
We consider the types of text that we are storing and NOT USING?
textual
Voice mails
Dictations
Transcriptions
PDFs
Word documents CSVs
Yelp Reviews
Parquet Files
Document scans
The voice of the customer
Real estate deeds/sales Internet
Insurance claims
Warranties
Emails Call center
Contracts
Medical records
22. Text in the
Data Lakehouse
You need to organize everything and convert each into a standard text format
textual
Voice mails
Dictations
Transcriptions
Audio Data
PDFs
Document scans
Mixed
CSVs
Yelp Reviews
Parquet files
The voice of the customer
Tabular Data
Word documents
Internet
Emails
Call center
General Documents
Real estate deeds/sales
Insurance claims
Warranties
Contracts
Medical records
“Some Format”
Transcription
(Dragon)
OCR and
Converters
Set “Textual”
Columns
Converters
and Formatters
Inline
Contextualization
USE:
23. Text in the
Data Lakehouse
textual
Transcription
(Dragon)
OCR and
Converters
Set “Textual”
Columns
Converters
and Formatters
Inline
Contextualization
Convert to a Common Textual Format
Now WHAT do we do with this data?
24. Text in the
Data Lakehouse
textual
Transcription
(Dragon)
OCR and
Converters
Set “Textual”
Columns
Converters
and Formatters
Inline
Contextualization
Common Textual Format
Deidentify – Redact Personal Data
Apply Context!
25. Text in the
Data Lakehouse
If you are going to address text you MUST have a handle on
both text AND context.
It is not sufficient to merely address text.
Text is relatively simple. Context is 90% of the battle.
textual
Furthermore, most of the context that is needed lies OUTSIDE of the text.
You can analyze the text until you are blue in the face
and never find the relevant context of the text
26. Text in the
Data Lakehouse
By Properly Applying Context
You can convert your Unstructured Textual Data into Structured Data!
textual
This allows you to use your Textual Data for Structured Analysis!
So what is the purpose of all of this?
27. Text in the
Data Lakehouse
What is Meant by “the Context” of Textual Data?
It has different meanings in different areas
Consider the word “Trust”
In Friendship – It is the ability to believe in the word and actions of another
In Finance – It is a legal vehicle used to pass and allocate assets to another
In Networking – It allows one computer to communicate and share with another
28. Text in the
Data Lakehouse
What is Meant by “the Context” of Textual Data?
It has different meanings in SIMILAR areas
Consider the word “Cervical” in the medical field
It could mean: pertaining to the neck
• cervical vertebra
It could mean: pertaining to the lowest segment of the uterus,
• cervical cancer
• cervical hemorrhage
29. Text in the
Data Lakehouse
What is Meant by “the Context” of Textual Data?
It has different meanings in Related areas
Consider the word “Dermatome” in the medical field
It means an area of the skin supplied by a specific nerve root
It is also a surgical instrument used to cut the skin
30. Text in the
Data Lakehouse
What is Meant by “Adding Context” to Textual Data?
It has different meanings in different areas
1. Extraction of key elements and phrases for categorization
2. Aggregation of terms into layered categories
3. Similar to Data Governance with Data Warehouse Data
• Requires subject matter experts
• Requires understanding of what dimensions you want for analysis
• Can be Highly Political between Departments
• It is controlled by BUSINESS, not IT or Data Analysts!
31. Text in the
Data Lakehouse
What is the Process of Adding Context to Textual Data?
It matters what analytics you want to perform on your text
1. Data Conversion (Maybe)
2. Data Redaction (Maybe)
3. Data Extraction
• Identification of “Important” phrases or areas (Nexus)
• Running through an Engine to pair the Nexus with the text
4. Data Transformation
• Classification of the matched Nexus phrases
• Adding Metadata
• Dates, Sentiment, Sentence Information, Byte Location,
• Batch #s, Business, Nexus, Customer, …
5. Data Loading
• Data Warehouse, Data Mart, Parquet Files
32. Text in the
Data Lakehouse
What Can Be Done with Contextualized Data?
We can do Structured Data Analysis
1. Document Markup
• Visually identifies parts of the document
2. Sentiment Analysis
• Gives feeling and degrees of feeling to parts of document
3. Inline Contextualization
• Reverse Mail Merge – Pull out set of terms that have value
4. Document Classification
• Give context to the areas of the document for
correlation or basket analysis
33. Text in the
Data Lakehouse
What is Document Markup?
1. Data Visualization
• Color coded
• Draws the eyes
2. Used document by document
3. Great for “spot” review
4. Irrelevant and impractical for
analyzing Big Data
34. Text in the
Data Lakehouse
What is Sentiment Analysis?
1. Assigns Feeling to words
• Color coded
• Draws the eyes
2. Tries to identify and categorize
opinions stated in some text
3. Great for Comments
4. A BASIC requirement for
Voice of the Customer Analytics
35. Text in the
Data Lakehouse
What is Inline Contextualization?
1. Reverse Mail Merge
2. Pull out set of terms that have value
• Names
• Contract Dates
• Ratings
3. Useful for Contracts
4. Needed for Redaction
5. Needed for Document Separation
• Medical Visits
• Combined repeat visits
6. Needed for retrieval of grouped data
from blocks of text
36. Text in the
Data Lakehouse
What is Document Classification?
1. Give context to the areas of the document
2. Correlation Analysis
3. Basket Analysis
4. Mind Maps
5. Knowledge Graph
37. Text in the
Data Lakehouse
structured textual Analog/IoT
Data Warehouse
Curated Data Lake and Data Warehouse Data
Parquet Files
Review
There are many types of data in a Data Lakehouse
Textual ETL
38. Text in the
Data Lakehouse
Using context, you can convert your
Unstructured Data into Structured Data!
Deidentify data if you are going to store it
Apply Context to your textual data!
Sort your textual data documents by types
Convert your textual data to a common format
Review
39. Text in the
Data Lakehouse
1. Document Markup
2. Sentiment Analysis
3. Inline Contextualization
4. Document Classification
5. Plus many others…
This conversion allows for Structured Data Analysis
Review
42. Text in the
Data Lakehouse
• Bill Inmon – Slides and Conversations
• Inmon, B. (2021). Building The Data Lakehouse. Technics Publications LLC.
• https://www.snowflake.com/guides/what-iot
• https://medicalterminologyblog.com/homonyms-medical-language-2/
• Andrea and Amanda Rapien – Format and Additional Clarifying Material
References and Sources