Prediction (out of sample)

In [1]:
%matplotlib inline
In [2]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.981
Model:                            OLS   Adj. R-squared:                  0.979
Method:                 Least Squares   F-statistic:                     781.0
Date:                Tue, 28 Jan 2020   Prob (F-statistic):           1.91e-39
Time:                        22:29:29   Log-Likelihood:                -2.7666
No. Observations:                  50   AIC:                             13.53
Df Residuals:                      46   BIC:                             21.18
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0759      0.091     55.855      0.000       4.893       5.259
x1             0.5033      0.014     35.908      0.000       0.475       0.531
x2             0.4809      0.055      8.729      0.000       0.370       0.592
x3            -0.0208      0.001    -16.912      0.000      -0.023      -0.018
==============================================================================
Omnibus:                        0.552   Durbin-Watson:                   2.277
Prob(Omnibus):                  0.759   Jarque-Bera (JB):                0.569
Skew:                          -0.233   Prob(JB):                        0.752
Kurtosis:                       2.763   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.55564522  5.03342396  5.47290629  5.84788284  6.14160307  6.34952733
  6.48007271  6.55323012  6.59727989  6.64414533  6.72414786  6.86102528
  7.0680317   7.34575995  7.68204416  8.05395859  8.43158428  8.7829246
  9.07916225  9.29939386  9.43406434  9.48653673  9.47253973  9.41758345
  9.3527682   9.30967482  9.31517599  9.38702116  9.53092035  9.73960601
  9.99402707 10.26647897 10.52515543 10.73937415 10.88461781 10.94656221
 10.92342968 10.82628116 10.67719892 10.50565848 10.3436859  10.22059596
 10.15817586 10.1671056  10.24520333 10.37778377 10.54007102 10.70126912
 10.82962179 10.89762983]

Create a new sample of explanatory variables Xnew, predict and plot

In [6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.87222429 10.71904177 10.45694186 10.12890323  9.79150093  9.50105485
  9.29984062  9.20573911  9.20785852  9.26920111]

Plot comparison

In [7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [9]:
res.params
Out[9]:
Intercept           5.075938
x1                  0.503264
np.sin(x1)          0.480914
I((x1 - 5) ** 2)   -0.020812
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [10]:
res.predict(exog=dict(x1=x1n))
Out[10]:
0    10.872224
1    10.719042
2    10.456942
3    10.128903
4     9.791501
5     9.501055
6     9.299841
7     9.205739
8     9.207859
9     9.269201
dtype: float64