SlideShare a Scribd company logo
The Data Lakehouse Symposium – February, 2022
Hosted by - Bill Inmon and DataBricks – Feb 1- 4, 2022
The Data Lakehouse Symposium – February, 2022
Text in the Data Lakehouse
David Rapien
Partner in ForestRim Technology
Associate Professor – Lindner College of Business
University of Cincinnati
Let’s Look at the Major Historical Changes in Data Collection, Storage, and Usage
• 1980’s - The Data Warehouse allowed us to hold a single version of the truth and make
enterprise wide decisions.
• 2010 - The Data Lake allowed us to collect all of our “data” in one place.
• 2020- The Data Lakehouse marries the two by adding governance and metadata to data
going into the Data Lake so that it can be separately transitioned into a Data Warehouse
AND consumed by decision makers and analysts.
Text in the
Data Lakehouse
Where is your company’s data focus?
• Data collection for “future use”?
• Business decisions?
• Analysis and research?
• We have none!
Text in the
Data Lakehouse
What types of data does your company collect and store?
• Transactional Data from customer interactions?
• Machine generated data?
• Emails, blogs, customer reviews, medical records, contracts?
• Images, videos, scans, audio files?
Text in the
Data Lakehouse
Presentation
Talking Points
Types of Data in the Lakehouse
Textual Data in the Lakehouse
What is needed to use Textual
Data in the Lakehouse
Forest Rim
Knowledge Share
Text in the
Data Lakehouse
What We will Discuss in Today’s Presentation
Presentation
Talking Points
Types of Data in the Lakehouse
Textual Data in the Lakehouse
What is needed to use Textual
Data in the Lakehouse
Forest Rim
Knowledge Share
Text in the
Data Lakehouse
Text in the
Data Lakehouse
structured textual Analog/IoT
Curated Data Lake and Data Warehouse Data
All Corporate Data in the Lakehouse
~ 20-%
~ 80+%
Pareto’s Law Holds True
~ 80+%
Amount of Data:
Data Used for
Decision Making:
~ 20-%
Text in the
Data Lakehouse
All Corporate Data in the Lakehouse
structured textual Analog/IoT
~ 80-90% of business decisions are
made on less than 20% of the data.
Is there something wrong here?
Curated Data Lake and Data Warehouse Data
Text in the
Data Lakehouse
All Corporate Data in the Lakehouse
structured textual Analog/IoT
Data Warehouse Data
• Physical models
• Tables
• Aggregated
• Scrubbed
• Additional Metadata
• Additional Data Governance
Curated Data Lake and Data Warehouse Data
Text in the
Data Lakehouse
All Corporate Data in the Lakehouse
structured textual Analog/IoT
• Documents
• Emails
• Contracts
• Medical Records
• Voice of the Customer
• Insurance Claims
• Call Center …
• Other???
Curated Data Lake and Data Warehouse Data
Text in the
Data Lakehouse
All Corporate Data in the Lakehouse
structured textual Analog/IoT
• Status Data
• Automation Data
• Location Data
• Clickstreams
• Sensor Data
• Images
• Audio / Video files
Curated Data Lake and Data Warehouse Data
Text in the
Data Lakehouse
structured textual Analog/IoT
The relative volumes of data
Text in the
Data Lakehouse
structured textual Analog/IoT
The relative amount of business value to be found in the different sectors
Business value
Text in the
Data Lakehouse
How do we currently USE different types of data
structured textual Analog/IoT
Data Warehouse
Timeline Analysis
360° View of the Customer
Curated Data Lake and Data Warehouse Data
Machine Learning / AI
Manual Analysis?
NLP?
Failure?
Textual ETL!
Presentation
Talking Points
Types of Data in the Lakehouse
Textual Data in the Lakehouse
What is needed to use Textual
Data in the Lakehouse
Forest Rim
Knowledge Share
Text in the
Data Lakehouse
Text in the
Data Lakehouse
What data you are missing in your analysis?
textual
Voice mails
Dictations
Transcriptions
PDFs
Word documents CSVs
Yelp Reviews
Parquet Files
Document scans
The voice of the customer
Real estate deeds/sales Internet
Insurance claims
Warranties
Emails Call center
Contracts
Medical records
Text in the
Data Lakehouse
What is similar about most of this textual data?
textual
It is stream of thought
It is different document by document
It does not have Primary Keys or Foreign Keys
It has little format
It is DIRTY DATA!
Text in the
Data Lakehouse
textual
structured
Why?
Because people do not write or talk the same way that
is found in the structured world
Keys
Attributes
Indexes
Physical models
“I was looking at the nice colored
sweater in the window. I wonder if
I could try it on….but
I don’t like the sleeve length…”
These worlds are incompatible.
In order to address text you need a completely different approach
The Issue: The modelling and design techniques that worked in the
Structured world do not work in the world of Text.
Think
About it:
Presentation
Talking Points
Types of Data in the Lakehouse
Textual Data in the Lakehouse
What is needed to use Textual
Data in the Lakehouse
Forest Rim
Knowledge Share
Text in the
Data Lakehouse
Text in the
Data Lakehouse
We consider the types of text that we are storing and NOT USING?
textual
Voice mails
Dictations
Transcriptions
PDFs
Word documents CSVs
Yelp Reviews
Parquet Files
Document scans
The voice of the customer
Real estate deeds/sales Internet
Insurance claims
Warranties
Emails Call center
Contracts
Medical records
Text in the
Data Lakehouse
You need to organize everything and convert each into a standard text format
textual
Voice mails
Dictations
Transcriptions
Audio Data
PDFs
Document scans
Mixed
CSVs
Yelp Reviews
Parquet files
The voice of the customer
Tabular Data
Word documents
Internet
Emails
Call center
General Documents
Real estate deeds/sales
Insurance claims
Warranties
Contracts
Medical records
“Some Format”
Transcription
(Dragon)
OCR and
Converters
Set “Textual”
Columns
Converters
and Formatters
Inline
Contextualization
USE:
Text in the
Data Lakehouse
textual
Transcription
(Dragon)
OCR and
Converters
Set “Textual”
Columns
Converters
and Formatters
Inline
Contextualization
Convert to a Common Textual Format
Now WHAT do we do with this data?
Text in the
Data Lakehouse
textual
Transcription
(Dragon)
OCR and
Converters
Set “Textual”
Columns
Converters
and Formatters
Inline
Contextualization
Common Textual Format
Deidentify – Redact Personal Data
Apply Context!
Text in the
Data Lakehouse
If you are going to address text you MUST have a handle on
both text AND context.
It is not sufficient to merely address text.
Text is relatively simple. Context is 90% of the battle.
textual
Furthermore, most of the context that is needed lies OUTSIDE of the text.
You can analyze the text until you are blue in the face
and never find the relevant context of the text
Text in the
Data Lakehouse
By Properly Applying Context
You can convert your Unstructured Textual Data into Structured Data!
textual
This allows you to use your Textual Data for Structured Analysis!
So what is the purpose of all of this?
Text in the
Data Lakehouse
What is Meant by “the Context” of Textual Data?
It has different meanings in different areas
Consider the word “Trust”
In Friendship – It is the ability to believe in the word and actions of another
In Finance – It is a legal vehicle used to pass and allocate assets to another
In Networking – It allows one computer to communicate and share with another
Text in the
Data Lakehouse
What is Meant by “the Context” of Textual Data?
It has different meanings in SIMILAR areas
Consider the word “Cervical” in the medical field
It could mean: pertaining to the neck
• cervical vertebra
It could mean: pertaining to the lowest segment of the uterus,
• cervical cancer
• cervical hemorrhage
Text in the
Data Lakehouse
What is Meant by “the Context” of Textual Data?
It has different meanings in Related areas
Consider the word “Dermatome” in the medical field
It means an area of the skin supplied by a specific nerve root
It is also a surgical instrument used to cut the skin
Text in the
Data Lakehouse
What is Meant by “Adding Context” to Textual Data?
It has different meanings in different areas
1. Extraction of key elements and phrases for categorization
2. Aggregation of terms into layered categories
3. Similar to Data Governance with Data Warehouse Data
• Requires subject matter experts
• Requires understanding of what dimensions you want for analysis
• Can be Highly Political between Departments
• It is controlled by BUSINESS, not IT or Data Analysts!
Text in the
Data Lakehouse
What is the Process of Adding Context to Textual Data?
It matters what analytics you want to perform on your text
1. Data Conversion (Maybe)
2. Data Redaction (Maybe)
3. Data Extraction
• Identification of “Important” phrases or areas (Nexus)
• Running through an Engine to pair the Nexus with the text
4. Data Transformation
• Classification of the matched Nexus phrases
• Adding Metadata
• Dates, Sentiment, Sentence Information, Byte Location,
• Batch #s, Business, Nexus, Customer, …
5. Data Loading
• Data Warehouse, Data Mart, Parquet Files
Text in the
Data Lakehouse
What Can Be Done with Contextualized Data?
We can do Structured Data Analysis
1. Document Markup
• Visually identifies parts of the document
2. Sentiment Analysis
• Gives feeling and degrees of feeling to parts of document
3. Inline Contextualization
• Reverse Mail Merge – Pull out set of terms that have value
4. Document Classification
• Give context to the areas of the document for
correlation or basket analysis
Text in the
Data Lakehouse
What is Document Markup?
1. Data Visualization
• Color coded
• Draws the eyes
2. Used document by document
3. Great for “spot” review
4. Irrelevant and impractical for
analyzing Big Data
Text in the
Data Lakehouse
What is Sentiment Analysis?
1. Assigns Feeling to words
• Color coded
• Draws the eyes
2. Tries to identify and categorize
opinions stated in some text
3. Great for Comments
4. A BASIC requirement for
Voice of the Customer Analytics
Text in the
Data Lakehouse
What is Inline Contextualization?
1. Reverse Mail Merge
2. Pull out set of terms that have value
• Names
• Contract Dates
• Ratings
3. Useful for Contracts
4. Needed for Redaction
5. Needed for Document Separation
• Medical Visits
• Combined repeat visits
6. Needed for retrieval of grouped data
from blocks of text
Text in the
Data Lakehouse
What is Document Classification?
1. Give context to the areas of the document
2. Correlation Analysis
3. Basket Analysis
4. Mind Maps
5. Knowledge Graph
Text in the
Data Lakehouse
structured textual Analog/IoT
Data Warehouse
Curated Data Lake and Data Warehouse Data
Parquet Files
Review
There are many types of data in a Data Lakehouse
Textual ETL
Text in the
Data Lakehouse
Using context, you can convert your
Unstructured Data into Structured Data!
Deidentify data if you are going to store it
Apply Context to your textual data!
Sort your textual data documents by types
Convert your textual data to a common format
Review
Text in the
Data Lakehouse
1. Document Markup
2. Sentiment Analysis
3. Inline Contextualization
4. Document Classification
5. Plus many others…
This conversion allows for Structured Data Analysis
Review
Text in the
Data Lakehouse
Questions
https://www.forestrimtech.com/
info@forestrimtech.com
Text in the
Data Lakehouse
• Bill Inmon – Slides and Conversations
• Inmon, B. (2021). Building The Data Lakehouse. Technics Publications LLC.
• https://www.snowflake.com/guides/what-iot
• https://medicalterminologyblog.com/homonyms-medical-language-2/
• Andrea and Amanda Rapien – Format and Additional Clarifying Material
References and Sources

More Related Content

Similar to Data Lakehouse Symposium | Day 2

The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
Caserta
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
Precisely
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
Sai Paravastu
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
Shailja Khurana
 
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
DATAVERSITY
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
Caserta
 
michael hamilton legal database design presentation 3 new york
michael hamilton legal database design presentation 3 new yorkmichael hamilton legal database design presentation 3 new york
michael hamilton legal database design presentation 3 new york
michaelhamilton
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value Thereafter
Inside Analysis
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
Caserta
 
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data GovernanceRWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
DATAVERSITY
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
Johnson Ubah
 
Database
DatabaseDatabase
Database
wwaqas2007
 
Database
DatabaseDatabase
Database
Vaibhav Bajaj
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Nathan Bijnens
 
01-Introduction.pdf
01-Introduction.pdf01-Introduction.pdf
01-Introduction.pdf
ngVnThng12
 
Accelerate Data Discovery
Accelerate Data Discovery   Accelerate Data Discovery
Accelerate Data Discovery
Attivio
 
IT Ready - DW: 1st Day
IT Ready - DW: 1st Day IT Ready - DW: 1st Day
IT Ready - DW: 1st Day
Siwawong Wuttipongprasert
 
Agile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational IntelligenceAgile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational Intelligence
Inside Analysis
 
Data Structure and Types
Data Structure and TypesData Structure and Types
Data Structure and Types
Anjani Phuyal
 
Unit 2
Unit 2Unit 2

Similar to Data Lakehouse Symposium | Day 2 (20)

The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 
Data quality and data profiling
Data quality and data profilingData quality and data profiling
Data quality and data profiling
 
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
michael hamilton legal database design presentation 3 new york
michael hamilton legal database design presentation 3 new yorkmichael hamilton legal database design presentation 3 new york
michael hamilton legal database design presentation 3 new york
 
The Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value ThereafterThe Right Data Warehouse: Automation Now, Business Value Thereafter
The Right Data Warehouse: Automation Now, Business Value Thereafter
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data GovernanceRWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
RWDG Slides: Glossaries, Dictionaries, and Catalogs Result in Data Governance
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
Database
DatabaseDatabase
Database
 
Database
DatabaseDatabase
Database
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
01-Introduction.pdf
01-Introduction.pdf01-Introduction.pdf
01-Introduction.pdf
 
Accelerate Data Discovery
Accelerate Data Discovery   Accelerate Data Discovery
Accelerate Data Discovery
 
IT Ready - DW: 1st Day
IT Ready - DW: 1st Day IT Ready - DW: 1st Day
IT Ready - DW: 1st Day
 
Agile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational IntelligenceAgile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational Intelligence
 
Data Structure and Types
Data Structure and TypesData Structure and Types
Data Structure and Types
 
Unit 2
Unit 2Unit 2
Unit 2
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

Social Media and Museums Term Paper by Capri Guarisco
Social Media and Museums Term Paper by Capri GuariscoSocial Media and Museums Term Paper by Capri Guarisco
Social Media and Museums Term Paper by Capri Guarisco
CapriGuarisco
 
Ormax Media - Streaming Originals Mid-Year Report.pdf
Ormax Media - Streaming Originals Mid-Year Report.pdfOrmax Media - Streaming Originals Mid-Year Report.pdf
Ormax Media - Streaming Originals Mid-Year Report.pdf
Social Samosa
 
mike waizman marketing portfolio projects 2024
mike waizman marketing portfolio projects 2024mike waizman marketing portfolio projects 2024
mike waizman marketing portfolio projects 2024
Mike Waizman
 
How did agriculture drones achieve new milestones in 2022?
How did agriculture drones achieve new milestones in 2022?How did agriculture drones achieve new milestones in 2022?
How did agriculture drones achieve new milestones in 2022?
sisiyui
 
10 Event Management Fun Facts You Should Know
10 Event Management Fun Facts You Should Know10 Event Management Fun Facts You Should Know
10 Event Management Fun Facts You Should Know
Orly Ballesteros
 
Better Together Driving Superior Results with Organic and Paid Social Media -...
Better Together Driving Superior Results with Organic and Paid Social Media -...Better Together Driving Superior Results with Organic and Paid Social Media -...
Better Together Driving Superior Results with Organic and Paid Social Media -...
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
Cut Through the Noise to Drive More Conversions
Cut Through the Noise to Drive More ConversionsCut Through the Noise to Drive More Conversions
Cut Through the Noise to Drive More Conversions
VWO
 
Technical SEO Best Practices: How To Improve Discoverability, Crawlability & ...
Technical SEO Best Practices: How To Improve Discoverability, Crawlability & ...Technical SEO Best Practices: How To Improve Discoverability, Crawlability & ...
Technical SEO Best Practices: How To Improve Discoverability, Crawlability & ...
Search Engine Journal
 
Mastering SEO for Google in the AI Era - Dennis Yu
Mastering SEO for Google in the AI Era - Dennis YuMastering SEO for Google in the AI Era - Dennis Yu
Press Release Sample 2 by Capri Guarisco
Press Release Sample 2 by Capri GuariscoPress Release Sample 2 by Capri Guarisco
Press Release Sample 2 by Capri Guarisco
CapriGuarisco
 
On Page SEO.pptx learn basics of digital marketing.
On Page SEO.pptx learn basics of digital marketing.On Page SEO.pptx learn basics of digital marketing.
On Page SEO.pptx learn basics of digital marketing.
pranshupatel915
 
Organic Social Media Marketing Presentation.pdf
Organic Social Media Marketing Presentation.pdfOrganic Social Media Marketing Presentation.pdf
Organic Social Media Marketing Presentation.pdf
shruti2chaudhari567
 
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale BertrandSEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
Demapro: Your Partner in Strategic Market Insights
Demapro: Your Partner in Strategic Market InsightsDemapro: Your Partner in Strategic Market Insights
Demapro: Your Partner in Strategic Market Insights
arun mishra
 
Public Relations Cheat Sheet (PRLab's PR Sheet)
Public Relations Cheat Sheet (PRLab's PR Sheet)Public Relations Cheat Sheet (PRLab's PR Sheet)
Public Relations Cheat Sheet (PRLab's PR Sheet)
PRLab
 
5 Powerful Social Media Platforms for Digital Marketing Success.pdf
5 Powerful Social Media Platforms for Digital Marketing Success.pdf5 Powerful Social Media Platforms for Digital Marketing Success.pdf
5 Powerful Social Media Platforms for Digital Marketing Success.pdf
Money Macking
 
Digital Marketing Trends, Experts Insights on How to Gain a Competitive Edge ...
Digital Marketing Trends, Experts Insights on How to Gain a Competitive Edge ...Digital Marketing Trends, Experts Insights on How to Gain a Competitive Edge ...
Digital Marketing Trends, Experts Insights on How to Gain a Competitive Edge ...
DigiMarCon - Digital Marketing, Media and Advertising Conferences & Exhibitions
 
Outsourcing digital marketing Strategies
Outsourcing digital marketing StrategiesOutsourcing digital marketing Strategies
Outsourcing digital marketing Strategies
h03629750
 
GetResponse Alternative CleverlyBox Review: Transforming Cold Emails into Hot...
GetResponse Alternative CleverlyBox Review: Transforming Cold Emails into Hot...GetResponse Alternative CleverlyBox Review: Transforming Cold Emails into Hot...
GetResponse Alternative CleverlyBox Review: Transforming Cold Emails into Hot...
SOFTTECHHUB
 

Recently uploaded (20)

Social Media and Museums Term Paper by Capri Guarisco
Social Media and Museums Term Paper by Capri GuariscoSocial Media and Museums Term Paper by Capri Guarisco
Social Media and Museums Term Paper by Capri Guarisco
 
Ormax Media - Streaming Originals Mid-Year Report.pdf
Ormax Media - Streaming Originals Mid-Year Report.pdfOrmax Media - Streaming Originals Mid-Year Report.pdf
Ormax Media - Streaming Originals Mid-Year Report.pdf
 
mike waizman marketing portfolio projects 2024
mike waizman marketing portfolio projects 2024mike waizman marketing portfolio projects 2024
mike waizman marketing portfolio projects 2024
 
How did agriculture drones achieve new milestones in 2022?
How did agriculture drones achieve new milestones in 2022?How did agriculture drones achieve new milestones in 2022?
How did agriculture drones achieve new milestones in 2022?
 
10 Event Management Fun Facts You Should Know
10 Event Management Fun Facts You Should Know10 Event Management Fun Facts You Should Know
10 Event Management Fun Facts You Should Know
 
Better Together Driving Superior Results with Organic and Paid Social Media -...
Better Together Driving Superior Results with Organic and Paid Social Media -...Better Together Driving Superior Results with Organic and Paid Social Media -...
Better Together Driving Superior Results with Organic and Paid Social Media -...
 
Cut Through the Noise to Drive More Conversions
Cut Through the Noise to Drive More ConversionsCut Through the Noise to Drive More Conversions
Cut Through the Noise to Drive More Conversions
 
Technical SEO Best Practices: How To Improve Discoverability, Crawlability & ...
Technical SEO Best Practices: How To Improve Discoverability, Crawlability & ...Technical SEO Best Practices: How To Improve Discoverability, Crawlability & ...
Technical SEO Best Practices: How To Improve Discoverability, Crawlability & ...
 
Mastering SEO for Google in the AI Era - Dennis Yu
Mastering SEO for Google in the AI Era - Dennis YuMastering SEO for Google in the AI Era - Dennis Yu
Mastering SEO for Google in the AI Era - Dennis Yu
 
Press Release Sample 2 by Capri Guarisco
Press Release Sample 2 by Capri GuariscoPress Release Sample 2 by Capri Guarisco
Press Release Sample 2 by Capri Guarisco
 
The Power of Ugly Video Ads, Engaging High-Intent Marketing - Brian Alves
The Power of Ugly Video Ads, Engaging High-Intent Marketing - Brian AlvesThe Power of Ugly Video Ads, Engaging High-Intent Marketing - Brian Alves
The Power of Ugly Video Ads, Engaging High-Intent Marketing - Brian Alves
 
On Page SEO.pptx learn basics of digital marketing.
On Page SEO.pptx learn basics of digital marketing.On Page SEO.pptx learn basics of digital marketing.
On Page SEO.pptx learn basics of digital marketing.
 
Organic Social Media Marketing Presentation.pdf
Organic Social Media Marketing Presentation.pdfOrganic Social Media Marketing Presentation.pdf
Organic Social Media Marketing Presentation.pdf
 
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale BertrandSEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
SEO for Revenue, Grow Your Business, Not Just Your Rankings - Dale Bertrand
 
Demapro: Your Partner in Strategic Market Insights
Demapro: Your Partner in Strategic Market InsightsDemapro: Your Partner in Strategic Market Insights
Demapro: Your Partner in Strategic Market Insights
 
Public Relations Cheat Sheet (PRLab's PR Sheet)
Public Relations Cheat Sheet (PRLab's PR Sheet)Public Relations Cheat Sheet (PRLab's PR Sheet)
Public Relations Cheat Sheet (PRLab's PR Sheet)
 
5 Powerful Social Media Platforms for Digital Marketing Success.pdf
5 Powerful Social Media Platforms for Digital Marketing Success.pdf5 Powerful Social Media Platforms for Digital Marketing Success.pdf
5 Powerful Social Media Platforms for Digital Marketing Success.pdf
 
Digital Marketing Trends, Experts Insights on How to Gain a Competitive Edge ...
Digital Marketing Trends, Experts Insights on How to Gain a Competitive Edge ...Digital Marketing Trends, Experts Insights on How to Gain a Competitive Edge ...
Digital Marketing Trends, Experts Insights on How to Gain a Competitive Edge ...
 
Outsourcing digital marketing Strategies
Outsourcing digital marketing StrategiesOutsourcing digital marketing Strategies
Outsourcing digital marketing Strategies
 
GetResponse Alternative CleverlyBox Review: Transforming Cold Emails into Hot...
GetResponse Alternative CleverlyBox Review: Transforming Cold Emails into Hot...GetResponse Alternative CleverlyBox Review: Transforming Cold Emails into Hot...
GetResponse Alternative CleverlyBox Review: Transforming Cold Emails into Hot...
 

Data Lakehouse Symposium | Day 2

  • 1. The Data Lakehouse Symposium – February, 2022 Hosted by - Bill Inmon and DataBricks – Feb 1- 4, 2022
  • 2. The Data Lakehouse Symposium – February, 2022 Text in the Data Lakehouse David Rapien Partner in ForestRim Technology Associate Professor – Lindner College of Business University of Cincinnati
  • 3. Let’s Look at the Major Historical Changes in Data Collection, Storage, and Usage • 1980’s - The Data Warehouse allowed us to hold a single version of the truth and make enterprise wide decisions. • 2010 - The Data Lake allowed us to collect all of our “data” in one place. • 2020- The Data Lakehouse marries the two by adding governance and metadata to data going into the Data Lake so that it can be separately transitioned into a Data Warehouse AND consumed by decision makers and analysts. Text in the Data Lakehouse
  • 4. Where is your company’s data focus? • Data collection for “future use”? • Business decisions? • Analysis and research? • We have none! Text in the Data Lakehouse
  • 5. What types of data does your company collect and store? • Transactional Data from customer interactions? • Machine generated data? • Emails, blogs, customer reviews, medical records, contracts? • Images, videos, scans, audio files? Text in the Data Lakehouse
  • 6. Presentation Talking Points Types of Data in the Lakehouse Textual Data in the Lakehouse What is needed to use Textual Data in the Lakehouse Forest Rim Knowledge Share Text in the Data Lakehouse What We will Discuss in Today’s Presentation
  • 7. Presentation Talking Points Types of Data in the Lakehouse Textual Data in the Lakehouse What is needed to use Textual Data in the Lakehouse Forest Rim Knowledge Share Text in the Data Lakehouse
  • 8. Text in the Data Lakehouse structured textual Analog/IoT Curated Data Lake and Data Warehouse Data All Corporate Data in the Lakehouse ~ 20-% ~ 80+% Pareto’s Law Holds True ~ 80+% Amount of Data: Data Used for Decision Making: ~ 20-%
  • 9. Text in the Data Lakehouse All Corporate Data in the Lakehouse structured textual Analog/IoT ~ 80-90% of business decisions are made on less than 20% of the data. Is there something wrong here? Curated Data Lake and Data Warehouse Data
  • 10. Text in the Data Lakehouse All Corporate Data in the Lakehouse structured textual Analog/IoT Data Warehouse Data • Physical models • Tables • Aggregated • Scrubbed • Additional Metadata • Additional Data Governance Curated Data Lake and Data Warehouse Data
  • 11. Text in the Data Lakehouse All Corporate Data in the Lakehouse structured textual Analog/IoT • Documents • Emails • Contracts • Medical Records • Voice of the Customer • Insurance Claims • Call Center … • Other??? Curated Data Lake and Data Warehouse Data
  • 12. Text in the Data Lakehouse All Corporate Data in the Lakehouse structured textual Analog/IoT • Status Data • Automation Data • Location Data • Clickstreams • Sensor Data • Images • Audio / Video files Curated Data Lake and Data Warehouse Data
  • 13. Text in the Data Lakehouse structured textual Analog/IoT The relative volumes of data
  • 14. Text in the Data Lakehouse structured textual Analog/IoT The relative amount of business value to be found in the different sectors Business value
  • 15. Text in the Data Lakehouse How do we currently USE different types of data structured textual Analog/IoT Data Warehouse Timeline Analysis 360° View of the Customer Curated Data Lake and Data Warehouse Data Machine Learning / AI Manual Analysis? NLP? Failure? Textual ETL!
  • 16. Presentation Talking Points Types of Data in the Lakehouse Textual Data in the Lakehouse What is needed to use Textual Data in the Lakehouse Forest Rim Knowledge Share Text in the Data Lakehouse
  • 17. Text in the Data Lakehouse What data you are missing in your analysis? textual Voice mails Dictations Transcriptions PDFs Word documents CSVs Yelp Reviews Parquet Files Document scans The voice of the customer Real estate deeds/sales Internet Insurance claims Warranties Emails Call center Contracts Medical records
  • 18. Text in the Data Lakehouse What is similar about most of this textual data? textual It is stream of thought It is different document by document It does not have Primary Keys or Foreign Keys It has little format It is DIRTY DATA!
  • 19. Text in the Data Lakehouse textual structured Why? Because people do not write or talk the same way that is found in the structured world Keys Attributes Indexes Physical models “I was looking at the nice colored sweater in the window. I wonder if I could try it on….but I don’t like the sleeve length…” These worlds are incompatible. In order to address text you need a completely different approach The Issue: The modelling and design techniques that worked in the Structured world do not work in the world of Text. Think About it:
  • 20. Presentation Talking Points Types of Data in the Lakehouse Textual Data in the Lakehouse What is needed to use Textual Data in the Lakehouse Forest Rim Knowledge Share Text in the Data Lakehouse
  • 21. Text in the Data Lakehouse We consider the types of text that we are storing and NOT USING? textual Voice mails Dictations Transcriptions PDFs Word documents CSVs Yelp Reviews Parquet Files Document scans The voice of the customer Real estate deeds/sales Internet Insurance claims Warranties Emails Call center Contracts Medical records
  • 22. Text in the Data Lakehouse You need to organize everything and convert each into a standard text format textual Voice mails Dictations Transcriptions Audio Data PDFs Document scans Mixed CSVs Yelp Reviews Parquet files The voice of the customer Tabular Data Word documents Internet Emails Call center General Documents Real estate deeds/sales Insurance claims Warranties Contracts Medical records “Some Format” Transcription (Dragon) OCR and Converters Set “Textual” Columns Converters and Formatters Inline Contextualization USE:
  • 23. Text in the Data Lakehouse textual Transcription (Dragon) OCR and Converters Set “Textual” Columns Converters and Formatters Inline Contextualization Convert to a Common Textual Format Now WHAT do we do with this data?
  • 24. Text in the Data Lakehouse textual Transcription (Dragon) OCR and Converters Set “Textual” Columns Converters and Formatters Inline Contextualization Common Textual Format Deidentify – Redact Personal Data Apply Context!
  • 25. Text in the Data Lakehouse If you are going to address text you MUST have a handle on both text AND context. It is not sufficient to merely address text. Text is relatively simple. Context is 90% of the battle. textual Furthermore, most of the context that is needed lies OUTSIDE of the text. You can analyze the text until you are blue in the face and never find the relevant context of the text
  • 26. Text in the Data Lakehouse By Properly Applying Context You can convert your Unstructured Textual Data into Structured Data! textual This allows you to use your Textual Data for Structured Analysis! So what is the purpose of all of this?
  • 27. Text in the Data Lakehouse What is Meant by “the Context” of Textual Data? It has different meanings in different areas Consider the word “Trust” In Friendship – It is the ability to believe in the word and actions of another In Finance – It is a legal vehicle used to pass and allocate assets to another In Networking – It allows one computer to communicate and share with another
  • 28. Text in the Data Lakehouse What is Meant by “the Context” of Textual Data? It has different meanings in SIMILAR areas Consider the word “Cervical” in the medical field It could mean: pertaining to the neck • cervical vertebra It could mean: pertaining to the lowest segment of the uterus, • cervical cancer • cervical hemorrhage
  • 29. Text in the Data Lakehouse What is Meant by “the Context” of Textual Data? It has different meanings in Related areas Consider the word “Dermatome” in the medical field It means an area of the skin supplied by a specific nerve root It is also a surgical instrument used to cut the skin
  • 30. Text in the Data Lakehouse What is Meant by “Adding Context” to Textual Data? It has different meanings in different areas 1. Extraction of key elements and phrases for categorization 2. Aggregation of terms into layered categories 3. Similar to Data Governance with Data Warehouse Data • Requires subject matter experts • Requires understanding of what dimensions you want for analysis • Can be Highly Political between Departments • It is controlled by BUSINESS, not IT or Data Analysts!
  • 31. Text in the Data Lakehouse What is the Process of Adding Context to Textual Data? It matters what analytics you want to perform on your text 1. Data Conversion (Maybe) 2. Data Redaction (Maybe) 3. Data Extraction • Identification of “Important” phrases or areas (Nexus) • Running through an Engine to pair the Nexus with the text 4. Data Transformation • Classification of the matched Nexus phrases • Adding Metadata • Dates, Sentiment, Sentence Information, Byte Location, • Batch #s, Business, Nexus, Customer, … 5. Data Loading • Data Warehouse, Data Mart, Parquet Files
  • 32. Text in the Data Lakehouse What Can Be Done with Contextualized Data? We can do Structured Data Analysis 1. Document Markup • Visually identifies parts of the document 2. Sentiment Analysis • Gives feeling and degrees of feeling to parts of document 3. Inline Contextualization • Reverse Mail Merge – Pull out set of terms that have value 4. Document Classification • Give context to the areas of the document for correlation or basket analysis
  • 33. Text in the Data Lakehouse What is Document Markup? 1. Data Visualization • Color coded • Draws the eyes 2. Used document by document 3. Great for “spot” review 4. Irrelevant and impractical for analyzing Big Data
  • 34. Text in the Data Lakehouse What is Sentiment Analysis? 1. Assigns Feeling to words • Color coded • Draws the eyes 2. Tries to identify and categorize opinions stated in some text 3. Great for Comments 4. A BASIC requirement for Voice of the Customer Analytics
  • 35. Text in the Data Lakehouse What is Inline Contextualization? 1. Reverse Mail Merge 2. Pull out set of terms that have value • Names • Contract Dates • Ratings 3. Useful for Contracts 4. Needed for Redaction 5. Needed for Document Separation • Medical Visits • Combined repeat visits 6. Needed for retrieval of grouped data from blocks of text
  • 36. Text in the Data Lakehouse What is Document Classification? 1. Give context to the areas of the document 2. Correlation Analysis 3. Basket Analysis 4. Mind Maps 5. Knowledge Graph
  • 37. Text in the Data Lakehouse structured textual Analog/IoT Data Warehouse Curated Data Lake and Data Warehouse Data Parquet Files Review There are many types of data in a Data Lakehouse Textual ETL
  • 38. Text in the Data Lakehouse Using context, you can convert your Unstructured Data into Structured Data! Deidentify data if you are going to store it Apply Context to your textual data! Sort your textual data documents by types Convert your textual data to a common format Review
  • 39. Text in the Data Lakehouse 1. Document Markup 2. Sentiment Analysis 3. Inline Contextualization 4. Document Classification 5. Plus many others… This conversion allows for Structured Data Analysis Review
  • 40. Text in the Data Lakehouse Questions
  • 42. Text in the Data Lakehouse • Bill Inmon – Slides and Conversations • Inmon, B. (2021). Building The Data Lakehouse. Technics Publications LLC. • https://www.snowflake.com/guides/what-iot • https://medicalterminologyblog.com/homonyms-medical-language-2/ • Andrea and Amanda Rapien – Format and Additional Clarifying Material References and Sources