32

I am working with sklearn and specifically the linear_model module. After fitting a simple linear as in

import pandas as pd
import numpy as np
from sklearn import linear_model
randn = np.random.randn

X = pd.DataFrame(randn(10,3), columns=['X1','X2','X3'])
y = pd.DataFrame(randn(10,1), columns=['Y'])        

model = linear_model.LinearRegression()
model.fit(X=X, y=y)

I see how I can access to coefficients and intercept via coef_ and intercept_, prediction is straightforward as well. I would like to access a variance-covariance matrix for the parameters of this simple model, and the standard error of these parameters. I am familiar with R and the vcov() function, and it seems that scipy.optimize has some functionality for this (Getting standard errors on fitted parameters using the optimize.leastsq method in python) - does sklearn have any functionality for accessing these statistics??

Appreciate any help on this.

-Ryan

3 Answers 3

34

tl;dr

not with scikit-learn, but you can compute this manually with some linear algebra. i do this for your example below.

also here's a jupyter notebook with this code: https://gist.github.com/grisaitis/cf481034bb413a14d3ea851dab201d31

what and why

the standard errors of your estimates are just the square root of the variances of your estimates. what's the variance of your estimate? if you assume your model has gaussian error, it's:

Var(beta_hat) = inverse(X.T @ X) * sigma_squared_hat

and then the standard error of beta_hat[i] is Var(beta_hat)[i, i] ** 0.5.

All you have to compute sigma_squared_hat. This is the estimate of your model's gaussian error. This is not known a priori but can be estimated with the sample variance of your residuals.

Also you need to add an intercept term to your data matrix. Scikit-learn does this automatically with the LinearRegression class. So to compute this yourself you need to add that to your X matrix or dataframe.

how

Starting after your code,

show your scikit-learn results

print(model.intercept_)
print(model.coef_)
[-0.28671532]
[[ 0.17501115 -0.6928708   0.22336584]]

reproduce this with linear algebra

N = len(X)
p = len(X.columns) + 1  # plus one because LinearRegression adds an intercept term

X_with_intercept = np.empty(shape=(N, p), dtype=np.float)
X_with_intercept[:, 0] = 1
X_with_intercept[:, 1:p] = X.values

beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) @ X_with_intercept.T @ y.values
print(beta_hat)
[[-0.28671532]
 [ 0.17501115]
 [-0.6928708 ]
 [ 0.22336584]]

compute standard errors of the parameter estimates

y_hat = model.predict(X)
residuals = y.values - y_hat
residual_sum_of_squares = residuals.T @ residuals
sigma_squared_hat = residual_sum_of_squares[0, 0] / (N - p)
var_beta_hat = np.linalg.inv(X_with_intercept.T @ X_with_intercept) * sigma_squared_hat
for p_ in range(p):
    standard_error = var_beta_hat[p_, p_] ** 0.5
    print(f"SE(beta_hat[{p_}]): {standard_error}")
SE(beta_hat[0]): 0.2468580488280805
SE(beta_hat[1]): 0.2965501221823944
SE(beta_hat[2]): 0.3518847753610169
SE(beta_hat[3]): 0.3250760291745124

confirm with statsmodels

import statsmodels.api as sm
ols = sm.OLS(y.values, X_with_intercept)
ols_result = ols.fit()
ols_result.summary()
...
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.2867      0.247     -1.161      0.290      -0.891       0.317
x1             0.1750      0.297      0.590      0.577      -0.551       0.901
x2            -0.6929      0.352     -1.969      0.096      -1.554       0.168
x3             0.2234      0.325      0.687      0.518      -0.572       1.019
==============================================================================

yay, done!

5
  • 1
    So great. Thanks a lot! Commented Jan 7, 2020 at 22:53
  • I'm getting invalid index to scalar variable. at sigma_squared_hat = residual_sum_of_squares[0, 0] / (N - p) with my dataset. The residual_sum_of_squares computes to become a numpy.float64. What am I missing here?
    – Bharat
    Commented Apr 6, 2020 at 18:43
  • 1
    @Bharat what's your code for producing residual_sum_of_squares? Commented Apr 21, 2020 at 14:52
  • well, what about when you are using elastic net to shrink coefficients... Commented May 4, 2021 at 21:42
  • 1
    @thistleknot for things other than MLE, like elastic net, i think your only option is to bootstrap the standard errors of the coefficients, but i could be wrong. Commented Dec 3, 2022 at 22:03
12

No, scikit-learn does not have built error estimates for doing inference. Statsmodels does though.

import statsmodels.api as sm
ols = sm.OLS(y, X)
ols_result = ols.fit()
# Now you have at your disposition several error estimates, e.g.
ols_result.HC0_se
# and covariance estimates
ols_result.cov_HC0

see docs

4
  • Is there any way to calculate the standard error for scikit-learn with any figures you could get from the scikit regression models? I know that statsmodels offers this figures but I need the l2-penalty which statsmodels doesn't have.
    – TheDude
    Commented May 24, 2016 at 17:05
  • Not that I know of. For L2-penalty and n > p, I guess you can write out the formulas. For n < p this is actually highly nontrivial and only recently people have started to address this.
    – eickenberg
    Commented May 25, 2016 at 5:50
  • This doesn't directly answer the question, but for prediction error, you can get the mean squared error as noted here, which is a step towards prediction standard error.
    – ryanwc
    Commented Apr 21, 2017 at 16:14
  • 1
    For a more detailed version of @eickenberg's answer refer: stackoverflow.com/questions/31523921/…
    – Ambareesh
    Commented Aug 21, 2019 at 3:53
0

Each of the Predictor columns are the same format for random. So, it is like running three simulations:

import pandas as pd
import numpy as np
from sklearn import linear_model
randn = np.random.randn

X = pd.DataFrame(randn(10,1))
y = pd.DataFrame(randn(10,1)) 
model = linear_model.LinearRegression()
model.fit(X=X, y=y)
y_pred = model.predict(X)
print(y)
print(y_pred)
residuals = y - y_pred
residuals['c'] = residuals.iloc[:, 0]**2
sq = residuals['c']
print(sq)
standard_error = (sum(sq)/(10-2))**0.5
print(standard_error)

Not the answer you're looking for? Browse other questions tagged or ask your own question.