Ordinary Least Square Regression & Stage-2 Regression - Factors Influencing Medical Expenses

An Application of
Ordinary Least Square
Regression & Stage-2
Regression to Remove
Endogeneity Issues in
Casual Inference
Author: Anthony Mok
Date: 18 Nov 2023
Email: xxiaohao@yahoo.com
FACTORS INFLUENCING
MEDICAL EXPENSES

• Endogeneity Issues in Casual
Inference
• Ordinary Least Square (OLS)
Regression & Stage-2
Regression
• Relationship Between OLS
Regression/Stage-2
Regression and Difference in
Difference and Interaction
Term
• Project’s Primary goals
• Context
• Dataset & Modelling
Strategies
• Findings & Conclusions
PRESENTATION TITLE 2
AGENDA

ENDOGENEITY ISSUES IN CASUAL INFERENCE
3
In casual inference, endogeneity issues arise when the variable that is causing an effect (independent variable) is
itself influenced by the outcome variable (dependent variable) or other unobserved factors
Reverse Causality
A situation where the independent
variable is influenced by the
dependent variable, making it
impossible to tell which one truly
causes the other without further
analysis
Unobserved Factors
These are like hidden players in
the causal game. They influence
both the independent and
dependent variables, but you
can't directly measure them.
These create a tangled web of relationships that makes it difficult to isolate the true causal effect of the
independent variable on the dependent variable
An Application of Ordinary Least Square Regression & Stage-2 Regression

OLS REGRESSION & STAGE-2 REGRESSION
4
In casual inference, endogeneity issues arise when the variable that is causing an effect (independent variable) is
itself influenced by the outcome variable (dependent variable) or other unobserved factors; the independent
variable in a regression is correlated with the error term
Ordinary Least Squares
(OLS) Regression
A general purpose statistical method
used to estimate the linear relationship
between a dependent variable and one
or more independent variables: fits a
straight line to the data points to
minimise the sum of the squared residuals
(the vertical distances between the data
points and the regression line)
Stage-2 Regression
A specific statistical technique, often used in
instrumental variable (IV) regression, deployed to
address endogeneity issues
First stage
An instrument variable
(correlated with the endogenous
independent variable but not with
the error term) is used to predict
the endogenous variable
Second stage
The predicted values from the
first stage are used as an
independent variable in a
regression with the dependent
variable

STAGE-2 REGRESSION & DID – THE CONNECTIONS
5
Difference In Differences & Stage-2 Regression
are separate techniques used in causal inference
Difference In Differences (DID)
A research design & estimation technique
used to isolate the causal effect of a
treatment/policy intervention by
comparing changes over time between a
Treatment Group & a Control Group
Stage-2 Regression
2-stage regression is a statistical
technique used to address endogeneity
issues in regression models
For example, apply DID to compare
the change in test scores for
programme participants before and
after the programme relative to the
change in test scores for non-
participants over the same period
Within the DID framework, use 2-stage
regression with an instrument variable
(e.g., distance to the programme) to
address this endogeneity and obtain
more reliable estimates of the
programme's true effect
Although distinct, DID and
2-stage regression can be
used sequentially in certain
situations
Recognise that self-selection
might still create
endogeneity issues

PROJECT’S PRIMARY GOALS
To i d e n t i f y f a c t o r s i n f l u e n c i n g m e d i c a l ex p e n s e s g i ve n
t h e va r i a b l e s w h i l e r e m o v i n g e n d o ge n e i t y i s s u e

CONTEXT
Good health insurance is one that can cover a
maximum amount of medical expenses so that people
don't have to worry about paying medical bills
As a health insurance company, the company saw its
sales fall significantly over time, something that is
causing concerns
It is the firm's intention to analyse factors that
determine medical expenses in order to improve their
sales in the coming fiscal year
By conducting the study, they will have a better
understanding of their customers' needs and be able
to develop their marketing strategies accordingly
4/21/2024 PRESENTATION TITLE 7

Dataset
OLS REGRESSION
Observed outcomes
from OLS regression
using independent
variables
1
STAGE-2 REGRESSION
• Observe outcomes from
Stage - 1 Regression with
the endogenous variable as
the target variable
• Observe Stage - 2
Regression using predicted
endogenous variable
2
INSIGHTS
Form insights from
results extracted out
of OLS regression
and Stage - 2
Regression
3
8
DATASET & MODELLING STRATEGIES

9
SOCIAL SECURITY INCOME (SSI) RATIO
Social Security Income is provided to Senior Citizens
SSI Ratio calculation is done by considering multiple parameters like Years of earning, AIME [Average indexed monthly earnings],
individual assets, and a number of dependencies
Considering the above parameters, the governing body will decide the ratio of SSI to be provided to the individuals
This final value has been provided in the dataset as ssiratio which can be used for further analysis directly)

10
OLS REGRESSION WITH INDEPENDENT VARIABLES
Of the four control (also known as independent) variables, only 'illnesses' and 'healthinsu' have p-values below 0.05. These
are significant as their statistics suggest that their relationships with 'logmedexpense' are not an occurrence of chance nor a
random occurrence. So, an additional illness will raise medical expenses by 0.44 units while those with health insurance would
see their medical expenses increased by 0.07 units. Since these are independent variables, we assume that there is no
multicollinearity between these two variables. This would mean that as a patient has an additional of illness and at the same
time has a valid medical insurance, he/she would experience a total of 0.5156-unit increase to his/her medical expenses

11
STAGE - 1 REGRESSION
• In a Linear Regression Analysis, the residual is the
difference between the observed value and the
predicted value of the dependent variable
• For this regression, the residual value of 0.4544
means that the predicted value of this
observation is 0.4544 units less than the
observed value.
• In other words, the model under-predicted the
value of the dependent variable for this
observation by this amount
Stage - 1 Regression with the endogenous variable as target variable

12
• Residuals are used to access how well the model fits the data. When the
residuals are randomly distributed around zero, it suggests that the
model is a good fit for the data
• However, the Histogram (referring left) for the residuals does not show
that the values are distributed around zero. In fact, the model mostly
over-predicted and under-predicted the value of the dependent
variable for the 10,089 observations; there are patterns in the residuals
which may suggest that the model is not a good fit for the data
• Conversely, the average predicted values for all 10,089 observations is
0.38, which is not closed to the observed values of the dependent
variable. This again suggests that the model is not a good fit for the
data

13
• When the Stage – 1 Regression model is not a good fit for the data, it
means that the model is not accurately capturing the relationship
between the independent and dependent variables
• There are several possible reasons causing this, like omitted variables,
incorrect functional form, or invalid instrument. In such cases, the
estimates produced by the model may be not accurately reflect the true
relationship between the variables
• To improve the fit of the model, additional relevant variables should be
included , changing the functional form of the model, or using a
different instrument so that the first stage satisfy the condition of
relevance and exogeneity
• However, since there isn’t additional information provided in the project,
making improvement to the model is infeasible

14
The statistics suggests that an additional unit of illness and an additional unit of income would, respectively, increase
medical expenses by 0.449 unit and 0.098 unit. Conversely, an additional unit of age and people with health
insurance would, respectively, lower medical expenses by 0.012 unit and 0.852 unit. All these four independent
variables have P-values lesser than 0.05, which suggests that these are significant, and not occurrences of chance nor a
random occurrence
Stage - 2 Regression using predicted endogenous variable

15
INSIGHTS FROM OLS & STAGE - 2 REGRESSION
• In the Stage – 1 analysis, the endogenous
variable is regressed to the Instrumental
Variable. At this stage, since the P-value for the
Instrumental Variable is less than 0.05, it
indicates that the Instrumental Variable is
significantly related to the endogenous variable
• This is known as the relevance condition for an
instrumental variable, which means that the
instrument is correlated with the endogenous
variable and can be used to predict it
• If the value of the F-Stat could be calculated,
using tools like R or Python, the strength and
weakness of the instrument could be further
determined
The Linear Regression results suggest that people with health insurance would
experience a 0.075-unit increase in medical expenses. While the 2-Stage results
suggest that people with health insurance would experience a 0.852-unit decrease
in medical expenses. SSI Ratio is associated with -0.1998 units of health insurance.
These two estimates seem to be heading in opposite directions, and endogeneity
problems is suspected

Ordinary Least Square Regression & Stage-2 Regression - Factors Influencing Medical Expenses

More Related Content

Similar to Ordinary Least Square Regression & Stage-2 Regression - Factors Influencing Medical Expenses

Similar to Ordinary Least Square Regression & Stage-2 Regression - Factors Influencing Medical Expenses (20)

More from ThinkInnovation

More from ThinkInnovation (19)

Recently uploaded

Recently uploaded (20)

Ordinary Least Square Regression & Stage-2 Regression - Factors Influencing Medical Expenses