Data Lakehouse Symposium | Day 2

The Data Lakehouse Symposium – February, 2022
Hosted by - Bill Inmon and DataBricks – Feb 1- 4, 2022

The Data Lakehouse Symposium – February, 2022
Text in the Data Lakehouse
David Rapien
Partner in ForestRim Technology
Associate Professor – Lindner College of Business
University of Cincinnati

Let’s Look at the Major Historical Changes in Data Collection, Storage, and Usage
• 1980’s - The Data Warehouse allowed us to hold a single version of the truth and make
enterprise wide decisions.
• 2010 - The Data Lake allowed us to collect all of our “data” in one place.
• 2020- The Data Lakehouse marries the two by adding governance and metadata to data
going into the Data Lake so that it can be separately transitioned into a Data Warehouse
AND consumed by decision makers and analysts.
Text in the
Data Lakehouse

Where is your company’s data focus?
• Data collection for “future use”?
• Business decisions?
• Analysis and research?
• We have none!
Text in the
Data Lakehouse

What types of data does your company collect and store?
• Transactional Data from customer interactions?
• Machine generated data?
• Emails, blogs, customer reviews, medical records, contracts?
• Images, videos, scans, audio files?
Text in the
Data Lakehouse

Presentation
Talking Points
Types of Data in the Lakehouse
Textual Data in the Lakehouse
What is needed to use Textual
Data in the Lakehouse
Forest Rim
Knowledge Share
Text in the
Data Lakehouse
What We will Discuss in Today’s Presentation

Presentation
Talking Points
Types of Data in the Lakehouse
Textual Data in the Lakehouse
What is needed to use Textual
Data in the Lakehouse
Forest Rim
Knowledge Share
Text in the
Data Lakehouse

Text in the
Data Lakehouse
structured textual Analog/IoT
Curated Data Lake and Data Warehouse Data
All Corporate Data in the Lakehouse
~ 20-%
~ 80+%
Pareto’s Law Holds True
~ 80+%
Amount of Data:
Data Used for
Decision Making:
~ 20-%

Text in the
Data Lakehouse
~ 80-90% of business decisions are
made on less than 20% of the data.
Is there something wrong here?

Text in the
Data Lakehouse
Data Warehouse Data
• Physical models
• Tables
• Aggregated
• Scrubbed
• Additional Metadata
• Additional Data Governance

Text in the
Data Lakehouse
• Documents
• Emails
• Contracts
• Medical Records
• Voice of the Customer
• Insurance Claims
• Call Center …
• Other???

Text in the
Data Lakehouse
• Status Data
• Automation Data
• Location Data
• Clickstreams
• Sensor Data
• Images
• Audio / Video files

Text in the
Data Lakehouse
The relative volumes of data

Text in the
Data Lakehouse
The relative amount of business value to be found in the different sectors
Business value

Text in the
Data Lakehouse
How do we currently USE different types of data
Data Warehouse
Timeline Analysis
360° View of the Customer
Machine Learning / AI
Manual Analysis?
NLP?
Failure?
Textual ETL!

Text in the
Data Lakehouse
What data you are missing in your analysis?
textual
Voice mails
Dictations
Transcriptions
PDFs
Word documents CSVs
Yelp Reviews
Parquet Files
Document scans
The voice of the customer
Real estate deeds/sales Internet
Insurance claims
Warranties
Emails Call center
Contracts
Medical records

Text in the
Data Lakehouse
What is similar about most of this textual data?
textual
It is stream of thought
It is different document by document
It does not have Primary Keys or Foreign Keys
It has little format
It is DIRTY DATA!

Text in the
Data Lakehouse
textual
structured
Why?
Because people do not write or talk the same way that
is found in the structured world
Keys
Attributes
Indexes
Physical models
“I was looking at the nice colored
sweater in the window. I wonder if
I could try it on….but
I don’t like the sleeve length…”
These worlds are incompatible.
In order to address text you need a completely different approach
The Issue: The modelling and design techniques that worked in the
Structured world do not work in the world of Text.
Think
About it:

Text in the
Data Lakehouse
We consider the types of text that we are storing and NOT USING?
textual
Voice mails
Dictations
Transcriptions
PDFs
Word documents CSVs
Yelp Reviews
Parquet Files
Document scans
Real estate deeds/sales Internet
Insurance claims
Warranties
Emails Call center
Contracts
Medical records

Text in the
Data Lakehouse
You need to organize everything and convert each into a standard text format
textual
Voice mails
Dictations
Transcriptions
Audio Data
PDFs
Document scans
Mixed
CSVs
Yelp Reviews
Parquet files
Tabular Data
Word documents
Internet
Emails
Call center
General Documents
Real estate deeds/sales
Insurance claims
Warranties
Contracts
Medical records
“Some Format”
Transcription
(Dragon)
OCR and
Converters
Set “Textual”
Columns
Converters
and Formatters
Inline
Contextualization
USE:

Text in the
Data Lakehouse
textual
Transcription
(Dragon)
OCR and
Converters
Set “Textual”
Columns
Converters
and Formatters
Inline
Contextualization
Convert to a Common Textual Format
Now WHAT do we do with this data?

Text in the
Data Lakehouse
textual
Transcription
(Dragon)
OCR and
Converters
Set “Textual”
Columns
Converters
and Formatters
Inline
Contextualization
Common Textual Format
Deidentify – Redact Personal Data
Apply Context!

Text in the
Data Lakehouse
If you are going to address text you MUST have a handle on
both text AND context.
It is not sufficient to merely address text.
Text is relatively simple. Context is 90% of the battle.
textual
Furthermore, most of the context that is needed lies OUTSIDE of the text.
You can analyze the text until you are blue in the face
and never find the relevant context of the text

Text in the
Data Lakehouse
By Properly Applying Context
You can convert your Unstructured Textual Data into Structured Data!
textual
This allows you to use your Textual Data for Structured Analysis!
So what is the purpose of all of this?

Text in the
Data Lakehouse
What is Meant by “the Context” of Textual Data?
It has different meanings in different areas
Consider the word “Trust”
In Friendship – It is the ability to believe in the word and actions of another
In Finance – It is a legal vehicle used to pass and allocate assets to another
In Networking – It allows one computer to communicate and share with another

Text in the
Data Lakehouse
It has different meanings in SIMILAR areas
Consider the word “Cervical” in the medical field
It could mean: pertaining to the neck
• cervical vertebra
It could mean: pertaining to the lowest segment of the uterus,
• cervical cancer
• cervical hemorrhage

Text in the
Data Lakehouse
It has different meanings in Related areas
Consider the word “Dermatome” in the medical field
It means an area of the skin supplied by a specific nerve root
It is also a surgical instrument used to cut the skin

Text in the
Data Lakehouse
What is Meant by “Adding Context” to Textual Data?
It has different meanings in different areas
1. Extraction of key elements and phrases for categorization
2. Aggregation of terms into layered categories
3. Similar to Data Governance with Data Warehouse Data
• Requires subject matter experts
• Requires understanding of what dimensions you want for analysis
• Can be Highly Political between Departments
• It is controlled by BUSINESS, not IT or Data Analysts!

Text in the
Data Lakehouse
What is the Process of Adding Context to Textual Data?
It matters what analytics you want to perform on your text
1. Data Conversion (Maybe)
2. Data Redaction (Maybe)
3. Data Extraction
• Identification of “Important” phrases or areas (Nexus)
• Running through an Engine to pair the Nexus with the text
4. Data Transformation
• Classification of the matched Nexus phrases
• Adding Metadata
• Dates, Sentiment, Sentence Information, Byte Location,
• Batch #s, Business, Nexus, Customer, …
5. Data Loading
• Data Warehouse, Data Mart, Parquet Files

Text in the
Data Lakehouse
What Can Be Done with Contextualized Data?
We can do Structured Data Analysis
1. Document Markup
• Visually identifies parts of the document
2. Sentiment Analysis
• Gives feeling and degrees of feeling to parts of document
3. Inline Contextualization
• Reverse Mail Merge – Pull out set of terms that have value
4. Document Classification
• Give context to the areas of the document for
correlation or basket analysis

Text in the
Data Lakehouse
What is Document Markup?
1. Data Visualization
• Color coded
• Draws the eyes
2. Used document by document
3. Great for “spot” review
4. Irrelevant and impractical for
analyzing Big Data

Text in the
Data Lakehouse
What is Sentiment Analysis?
1. Assigns Feeling to words
• Color coded
• Draws the eyes
2. Tries to identify and categorize
opinions stated in some text
3. Great for Comments
4. A BASIC requirement for
Voice of the Customer Analytics

Text in the
Data Lakehouse
What is Inline Contextualization?
1. Reverse Mail Merge
2. Pull out set of terms that have value
• Names
• Contract Dates
• Ratings
3. Useful for Contracts
4. Needed for Redaction
5. Needed for Document Separation
• Medical Visits
• Combined repeat visits
6. Needed for retrieval of grouped data
from blocks of text

Text in the
Data Lakehouse
What is Document Classification?
1. Give context to the areas of the document
2. Correlation Analysis
3. Basket Analysis
4. Mind Maps
5. Knowledge Graph

Text in the
Data Lakehouse
Data Warehouse
Parquet Files
Review
There are many types of data in a Data Lakehouse
Textual ETL

Text in the
Data Lakehouse
Using context, you can convert your
Unstructured Data into Structured Data!
Deidentify data if you are going to store it
Apply Context to your textual data!
Sort your textual data documents by types
Convert your textual data to a common format
Review

Text in the
Data Lakehouse
1. Document Markup
2. Sentiment Analysis
3. Inline Contextualization
4. Document Classification
5. Plus many others…
This conversion allows for Structured Data Analysis
Review

Text in the
Data Lakehouse
Questions

https://www.forestrimtech.com/
info@forestrimtech.com

Text in the
Data Lakehouse
• Bill Inmon – Slides and Conversations
• Inmon, B. (2021). Building The Data Lakehouse. Technics Publications LLC.
• https://www.snowflake.com/guides/what-iot
• https://medicalterminologyblog.com/homonyms-medical-language-2/
• Andrea and Amanda Rapien – Format and Additional Clarifying Material
References and Sources

Data Lakehouse Symposium | Day 2

More Related Content

Similar to Data Lakehouse Symposium | Day 2

Similar to Data Lakehouse Symposium | Day 2 (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Data Lakehouse Symposium | Day 2