An Application of
Ordinary Least Square
Regression & Stage-2
Regression to Remove
Endogeneity Issues in
Casual Inference
Author: Anthony Mok
Date: 18 Nov 2023
• Endogeneity Issues in Casual
• Ordinary Least Square (OLS)
Regression & Stage-2
• Relationship Between OLS
Regression and Difference in
Difference and Interaction
• Project’s Primary goals
• Context
• Dataset & Modelling
• Findings & Conclusions
In casual inference, endogeneity issues arise when the variable that is causing an effect (independent variable) is
itself influenced by the outcome variable (dependent variable) or other unobserved factors
Reverse Causality
A situation where the independent
variable is influenced by the
dependent variable, making it
impossible to tell which one truly
causes the other without further
Unobserved Factors
These are like hidden players in
the causal game. They influence
both the independent and
dependent variables, but you
can't directly measure them.
These create a tangled web of relationships that makes it difficult to isolate the true causal effect of the
independent variable on the dependent variable
An Application of Ordinary Least Square Regression & Stage-2 Regression
In casual inference, endogeneity issues arise when the variable that is causing an effect (independent variable) is
itself influenced by the outcome variable (dependent variable) or other unobserved factors; the independent
variable in a regression is correlated with the error term
Ordinary Least Squares
(OLS) Regression
A general purpose statistical method
used to estimate the linear relationship
between a dependent variable and one
or more independent variables: fits a
straight line to the data points to
minimise the sum of the squared residuals
(the vertical distances between the data
points and the regression line)
Stage-2 Regression
A specific statistical technique, often used in
instrumental variable (IV) regression, deployed to
address endogeneity issues
An Application of Ordinary Least Square Regression & Stage-2 Regression
First stage
An instrument variable
(correlated with the endogenous
independent variable but not with
the error term) is used to predict
the endogenous variable
Second stage
The predicted values from the
first stage are used as an
independent variable in a
regression with the dependent
Difference In Differences & Stage-2 Regression
are separate techniques used in causal inference
Difference In Differences (DID)
A research design & estimation technique
used to isolate the causal effect of a
treatment/policy intervention by
comparing changes over time between a
Treatment Group & a Control Group
Stage-2 Regression
2-stage regression is a statistical
technique used to address endogeneity
issues in regression models
An Application of Ordinary Least Square Regression & Stage-2 Regression
For example, apply DID to compare
the change in test scores for
programme participants before and
after the programme relative to the
change in test scores for non-
participants over the same period
Within the DID framework, use 2-stage
regression with an instrument variable
(e.g., distance to the programme) to
address this endogeneity and obtain
more reliable estimates of the
programme's true effect
Although distinct, DID and
2-stage regression can be
used sequentially in certain
Recognise that self-selection
might still create
endogeneity issues
To i d e n t i f y f a c t o r s i n f l u e n c i n g m e d i c a l ex p e n s e s g i ve n
t h e va r i a b l e s w h i l e r e m o v i n g e n d o ge n e i t y i s s u e
An Application of Ordinary Least Square Regression & Stage-2 Regression
Good health insurance is one that can cover a
maximum amount of medical expenses so that people
don't have to worry about paying medical bills
As a health insurance company, the company saw its
sales fall significantly over time, something that is
causing concerns
It is the firm's intention to analyse factors that
determine medical expenses in order to improve their
sales in the coming fiscal year
By conducting the study, they will have a better
understanding of their customers' needs and be able
to develop their marketing strategies accordingly
An Application of Ordinary Least Square Regression & Stage-2 Regression
Observed outcomes
from OLS regression
using independent
• Observe outcomes from
Stage - 1 Regression with
the endogenous variable as
the target variable
• Observe Stage - 2
Regression using predicted
endogenous variable
Form insights from
results extracted out
of OLS regression
and Stage - 2
An Application of Ordinary Least Square Regression & Stage-2 Regression
An Application of Ordinary Least Square Regression & Stage-2 Regression
Social Security Income is provided to Senior Citizens
SSI Ratio calculation is done by considering multiple parameters like Years of earning, AIME [Average indexed monthly earnings],
individual assets, and a number of dependencies
Considering the above parameters, the governing body will decide the ratio of SSI to be provided to the individuals
This final value has been provided in the dataset as ssiratio which can be used for further analysis directly)
An Application of Ordinary Least Square Regression & Stage-2 Regression
Of the four control (also known as independent) variables, only 'illnesses' and 'healthinsu' have p-values below 0.05. These
are significant as their statistics suggest that their relationships with 'logmedexpense' are not an occurrence of chance nor a
random occurrence. So, an additional illness will raise medical expenses by 0.44 units while those with health insurance would
see their medical expenses increased by 0.07 units. Since these are independent variables, we assume that there is no
multicollinearity between these two variables. This would mean that as a patient has an additional of illness and at the same
time has a valid medical insurance, he/she would experience a total of 0.5156-unit increase to his/her medical expenses
An Application of Ordinary Least Square Regression & Stage-2 Regression
• In a Linear Regression Analysis, the residual is the
difference between the observed value and the
predicted value of the dependent variable
• For this regression, the residual value of 0.4544
means that the predicted value of this
observation is 0.4544 units less than the
observed value.
• In other words, the model under-predicted the
value of the dependent variable for this
observation by this amount
Stage - 1 Regression with the endogenous variable as target variable
An Application of Ordinary Least Square Regression & Stage-2 Regression
• Residuals are used to access how well the model fits the data. When the
residuals are randomly distributed around zero, it suggests that the
model is a good fit for the data
• However, the Histogram (referring left) for the residuals does not show
that the values are distributed around zero. In fact, the model mostly
over-predicted and under-predicted the value of the dependent
variable for the 10,089 observations; there are patterns in the residuals
which may suggest that the model is not a good fit for the data
• Conversely, the average predicted values for all 10,089 observations is
0.38, which is not closed to the observed values of the dependent
variable. This again suggests that the model is not a good fit for the
An Application of Ordinary Least Square Regression & Stage-2 Regression
• When the Stage – 1 Regression model is not a good fit for the data, it
means that the model is not accurately capturing the relationship
between the independent and dependent variables
• There are several possible reasons causing this, like omitted variables,
incorrect functional form, or invalid instrument. In such cases, the
estimates produced by the model may be not accurately reflect the true
relationship between the variables
• To improve the fit of the model, additional relevant variables should be
included , changing the functional form of the model, or using a
different instrument so that the first stage satisfy the condition of
relevance and exogeneity
• However, since there isn’t additional information provided in the project,
making improvement to the model is infeasible
An Application of Ordinary Least Square Regression & Stage-2 Regression
The statistics suggests that an additional unit of illness and an additional unit of income would, respectively, increase
medical expenses by 0.449 unit and 0.098 unit. Conversely, an additional unit of age and people with health
insurance would, respectively, lower medical expenses by 0.012 unit and 0.852 unit. All these four independent
variables have P-values lesser than 0.05, which suggests that these are significant, and not occurrences of chance nor a
random occurrence
Stage - 2 Regression using predicted endogenous variable
An Application of Ordinary Least Square Regression & Stage-2 Regression
• In the Stage – 1 analysis, the endogenous
variable is regressed to the Instrumental
Variable. At this stage, since the P-value for the
Instrumental Variable is less than 0.05, it
indicates that the Instrumental Variable is
significantly related to the endogenous variable
• This is known as the relevance condition for an
instrumental variable, which means that the
instrument is correlated with the endogenous
variable and can be used to predict it
• If the value of the F-Stat could be calculated,
using tools like R or Python, the strength and
weakness of the instrument could be further
The Linear Regression results suggest that people with health insurance would
experience a 0.075-unit increase in medical expenses. While the 2-Stage results
suggest that people with health insurance would experience a 0.852-unit decrease
in medical expenses. SSI Ratio is associated with -0.1998 units of health insurance.
These two estimates seem to be heading in opposite directions, and endogeneity
problems is suspected
An Application of
Ordinary Least Square
Regression & Stage-2
Regression to Remove
Endogeneity Issues in
Casual Inference
Author: Anthony Mok
Date: 18 Nov 2023

Ordinary Least Square Regression & Stage-2 Regression - Factors Influencing Medical Expenses

  • 2. • Endogeneity Issues in Casual Inference • Ordinary Least Square (OLS) Regression & Stage-2 Regression • Relationship Between OLS Regression/Stage-2 Regression and Difference in Difference and Interaction Term • Project’s Primary goals • Context • Dataset & Modelling Strategies • Findings & Conclusions PRESENTATION TITLE 2 AGENDA
  • 3. ENDOGENEITY ISSUES IN CASUAL INFERENCE 3 In casual inference, endogeneity issues arise when the variable that is causing an effect (independent variable) is itself influenced by the outcome variable (dependent variable) or other unobserved factors Reverse Causality A situation where the independent variable is influenced by the dependent variable, making it impossible to tell which one truly causes the other without further analysis Unobserved Factors These are like hidden players in the causal game. They influence both the independent and dependent variables, but you can't directly measure them. These create a tangled web of relationships that makes it difficult to isolate the true causal effect of the independent variable on the dependent variable An Application of Ordinary Least Square Regression & Stage-2 Regression
  • 4. OLS REGRESSION & STAGE-2 REGRESSION 4 In casual inference, endogeneity issues arise when the variable that is causing an effect (independent variable) is itself influenced by the outcome variable (dependent variable) or other unobserved factors; the independent variable in a regression is correlated with the error term Ordinary Least Squares (OLS) Regression A general purpose statistical method used to estimate the linear relationship between a dependent variable and one or more independent variables: fits a straight line to the data points to minimise the sum of the squared residuals (the vertical distances between the data points and the regression line) Stage-2 Regression A specific statistical technique, often used in instrumental variable (IV) regression, deployed to address endogeneity issues An Application of Ordinary Least Square Regression & Stage-2 Regression First stage An instrument variable (correlated with the endogenous independent variable but not with the error term) is used to predict the endogenous variable Second stage The predicted values from the first stage are used as an independent variable in a regression with the dependent variable
  • 5. STAGE-2 REGRESSION & DID – THE CONNECTIONS 5 Difference In Differences & Stage-2 Regression are separate techniques used in causal inference Difference In Differences (DID) A research design & estimation technique used to isolate the causal effect of a treatment/policy intervention by comparing changes over time between a Treatment Group & a Control Group Stage-2 Regression 2-stage regression is a statistical technique used to address endogeneity issues in regression models An Application of Ordinary Least Square Regression & Stage-2 Regression For example, apply DID to compare the change in test scores for programme participants before and after the programme relative to the change in test scores for non- participants over the same period Within the DID framework, use 2-stage regression with an instrument variable (e.g., distance to the programme) to address this endogeneity and obtain more reliable estimates of the programme's true effect Although distinct, DID and 2-stage regression can be used sequentially in certain situations Recognise that self-selection might still create endogeneity issues
  • 6. PROJECT’S PRIMARY GOALS To i d e n t i f y f a c t o r s i n f l u e n c i n g m e d i c a l ex p e n s e s g i ve n t h e va r i a b l e s w h i l e r e m o v i n g e n d o ge n e i t y i s s u e An Application of Ordinary Least Square Regression & Stage-2 Regression
  • 7. CONTEXT Good health insurance is one that can cover a maximum amount of medical expenses so that people don't have to worry about paying medical bills As a health insurance company, the company saw its sales fall significantly over time, something that is causing concerns It is the firm's intention to analyse factors that determine medical expenses in order to improve their sales in the coming fiscal year By conducting the study, they will have a better understanding of their customers' needs and be able to develop their marketing strategies accordingly 4/21/2024 PRESENTATION TITLE 7 An Application of Ordinary Least Square Regression & Stage-2 Regression
  • 8. Dataset OLS REGRESSION Observed outcomes from OLS regression using independent variables 1 STAGE-2 REGRESSION • Observe outcomes from Stage - 1 Regression with the endogenous variable as the target variable • Observe Stage - 2 Regression using predicted endogenous variable 2 INSIGHTS Form insights from results extracted out of OLS regression and Stage - 2 Regression 3 8 DATASET & MODELLING STRATEGIES An Application of Ordinary Least Square Regression & Stage-2 Regression
  • 9. 9 SOCIAL SECURITY INCOME (SSI) RATIO An Application of Ordinary Least Square Regression & Stage-2 Regression Social Security Income is provided to Senior Citizens SSI Ratio calculation is done by considering multiple parameters like Years of earning, AIME [Average indexed monthly earnings], individual assets, and a number of dependencies Considering the above parameters, the governing body will decide the ratio of SSI to be provided to the individuals This final value has been provided in the dataset as ssiratio which can be used for further analysis directly)
  • 10. 10 OLS REGRESSION WITH INDEPENDENT VARIABLES An Application of Ordinary Least Square Regression & Stage-2 Regression Of the four control (also known as independent) variables, only 'illnesses' and 'healthinsu' have p-values below 0.05. These are significant as their statistics suggest that their relationships with 'logmedexpense' are not an occurrence of chance nor a random occurrence. So, an additional illness will raise medical expenses by 0.44 units while those with health insurance would see their medical expenses increased by 0.07 units. Since these are independent variables, we assume that there is no multicollinearity between these two variables. This would mean that as a patient has an additional of illness and at the same time has a valid medical insurance, he/she would experience a total of 0.5156-unit increase to his/her medical expenses
  • 11. 11 STAGE - 1 REGRESSION An Application of Ordinary Least Square Regression & Stage-2 Regression • In a Linear Regression Analysis, the residual is the difference between the observed value and the predicted value of the dependent variable • For this regression, the residual value of 0.4544 means that the predicted value of this observation is 0.4544 units less than the observed value. • In other words, the model under-predicted the value of the dependent variable for this observation by this amount Stage - 1 Regression with the endogenous variable as target variable
  • 12. 12 STAGE - 1 REGRESSION An Application of Ordinary Least Square Regression & Stage-2 Regression • Residuals are used to access how well the model fits the data. When the residuals are randomly distributed around zero, it suggests that the model is a good fit for the data • However, the Histogram (referring left) for the residuals does not show that the values are distributed around zero. In fact, the model mostly over-predicted and under-predicted the value of the dependent variable for the 10,089 observations; there are patterns in the residuals which may suggest that the model is not a good fit for the data • Conversely, the average predicted values for all 10,089 observations is 0.38, which is not closed to the observed values of the dependent variable. This again suggests that the model is not a good fit for the data
  • 13. 13 STAGE - 1 REGRESSION An Application of Ordinary Least Square Regression & Stage-2 Regression • When the Stage – 1 Regression model is not a good fit for the data, it means that the model is not accurately capturing the relationship between the independent and dependent variables • There are several possible reasons causing this, like omitted variables, incorrect functional form, or invalid instrument. In such cases, the estimates produced by the model may be not accurately reflect the true relationship between the variables • To improve the fit of the model, additional relevant variables should be included , changing the functional form of the model, or using a different instrument so that the first stage satisfy the condition of relevance and exogeneity • However, since there isn’t additional information provided in the project, making improvement to the model is infeasible
  • 14. 14 STAGE - 2 REGRESSION An Application of Ordinary Least Square Regression & Stage-2 Regression The statistics suggests that an additional unit of illness and an additional unit of income would, respectively, increase medical expenses by 0.449 unit and 0.098 unit. Conversely, an additional unit of age and people with health insurance would, respectively, lower medical expenses by 0.012 unit and 0.852 unit. All these four independent variables have P-values lesser than 0.05, which suggests that these are significant, and not occurrences of chance nor a random occurrence Stage - 2 Regression using predicted endogenous variable
  • 15. 15 INSIGHTS FROM OLS & STAGE - 2 REGRESSION An Application of Ordinary Least Square Regression & Stage-2 Regression • In the Stage – 1 analysis, the endogenous variable is regressed to the Instrumental Variable. At this stage, since the P-value for the Instrumental Variable is less than 0.05, it indicates that the Instrumental Variable is significantly related to the endogenous variable • This is known as the relevance condition for an instrumental variable, which means that the instrument is correlated with the endogenous variable and can be used to predict it • If the value of the F-Stat could be calculated, using tools like R or Python, the strength and weakness of the instrument could be further determined The Linear Regression results suggest that people with health insurance would experience a 0.075-unit increase in medical expenses. While the 2-Stage results suggest that people with health insurance would experience a 0.852-unit decrease in medical expenses. SSI Ratio is associated with -0.1998 units of health insurance. These two estimates seem to be heading in opposite directions, and endogeneity problems is suspected
