import numpy as np
import matplotlib.pyplot as plt
= 1 # seed for random number generation
seed = 200 # number of data instances
numInstances
np.random.seed(seed)= np.random.rand(numInstances,1).reshape(-1,1)
X = -3*X + 1
y_true = y_true + np.random.normal(size=numInstances).reshape(-1,1)
y
='black')
plt.scatter(X, y, color='blue', linewidth=3)
plt.plot(X, y_true, color'True function: y = -3X + 1')
plt.title('X')
plt.xlabel('y')
plt.ylabel( plt.show()
Regression in Python
The following tutorial contains Python examples for solving regression problems. You should refer to the Appendix chapter on regression of the “Introduction to Data Mining” book to understand some of the concepts introduced in this tutorial. The notebook can be downloaded from http://www.cse.msu.edu/~ptan/dmbook/tutorials/tutorial5/tutorial5.ipynb.
Regression is a modeling technique for predicting quantitative-valued target attributes. The goals for this tutorial are as follows: 1. To provide examples of using different regression methods from the scikit-learn library package. 2. To demonstrate the problem of model overfitting due to correlated attributes in the data. 3. To illustrate how regularization can be used to avoid model overfitting.
Read the step-by-step instructions below carefully. To execute the code, click on the corresponding cell and press the SHIFT-ENTER keys simultaneously.
Synthetic Data Generation
To illustrate how linear regression works, we first generate a random 1-dimensional vector of predictor variables, x, from a uniform distribution. The response variable y has a linear relationship with x according to the following equation: y = -3x + 1 + epsilon, where epsilon corresponds to random noise sampled from a Gaussian distribution with mean 0 and standard deviation of 1.
Multiple Linear Regression
In this example, we illustrate how to use Python scikit-learn package to fit a multiple linear regression (MLR) model. Given a training set \({X,y}\) MLR is designed to learn the regression function \(f(X,w) = X^T w + w_0\) by minimizing the following loss function given a training set \(\{X_i,y_i\}_{i=1}^N\):
\[ L(y,f(X,w)) = \sum_{i=1}^N \|y_i - X_i w - w_0\|^2, \]
where \(w\) (slope) and \(w_0\) (intercept) are the regression coefficients.
Given the input dataset, the following steps are performed: 1. Split the input data into their respective training and test sets. 2. Fit multiple linear regression to the training data. 3. Apply the model to the test data. 4. Evaluate the performance of the model. 5. Postprocessing: Visualizing the fitted model.
Step 1: Split Input Data into Training and Test Sets
= 20 # number of training instances
numTrain = numInstances - numTrain
numTest
= X[:-numTest]
X_train = X[-numTest:]
X_test = y[:-numTest]
y_train = y[-numTest:] y_test
Step 2: Fit Regression Model to Training Set
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Create linear regression object
= linear_model.LinearRegression()
regr
# Fit regression model to the training set
regr.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Step 3: Apply Model to the Test Set
# Apply model to the test set
= regr.predict(X_test) y_pred_test
Step 4: Evaluate Model Performance on Test Set
# Comparing true versus predicted values
='black')
plt.scatter(y_test, y_pred_test, color'Comparing true and predicted values for test set')
plt.title('True values for y')
plt.xlabel('Predicted values for y')
plt.ylabel(
# Model evaluation
print("Root mean squared error = %.4f" % np.sqrt(mean_squared_error(y_test, y_pred_test)))
Root mean squared error = 1.0476
print('R-squared = %.4f' % r2_score(y_test, y_pred_test))
R-squared = 0.4443
Step 5: Postprocessing
# Display model parameters
print('Slope = ', regr.coef_[0][0])
Slope = -3.2423545446565014
print('Intercept = ', regr.intercept_[0])### Step 4: Postprocessing
Intercept = 1.0805993038584836
# Plot outputs
='black')
plt.scatter(X_test, y_test, color='blue', linewidth=3)
plt.plot(X_test, y_pred_test, color= 'Predicted Function: y = %.2fX + %.2f' % (regr.coef_[0], regr.intercept_[0])
titlestr
plt.title(titlestr)'X')
plt.xlabel('y') plt.ylabel(
Ridge Regression
Ridge regression is a variant of MLR designed to fit a linear model to the dataset by minimizing the following regularized least-square loss function:
\[ L_{\textrm{ridge}}(y,f(X,w)) = \sum_{i=1}^N \|y_i - X_iw - w_0\|^2 + \alpha \bigg[\|w\|^2 + w_0^2 \bigg], \]
where \(\alpha\) is the hyperparameter for ridge regression. Note that the ridge regression model reduces to MLR when \(\alpha = 0\). By increasing the value of \(\alpha\), we can control the complexity of the model as will be shown in the example below.
In the example shown below, we fit a ridge regression model to the previously created training set with correlated attributes. We compare the results of the ridge regression model against those obtained using MLR.
from sklearn import linear_model
= linear_model.Ridge(alpha=0.4)
ridge ridge.fit(X_train5, y_train)
Ridge(alpha=0.4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Ridge(alpha=0.4)
= ridge.predict(X_train5)
y_pred_train_ridge = ridge.predict(X_test5)
y_pred_test_ridge
= "%.2f X + %.2f X2 + %.2f X3 + %.2f X4 + %.2f X5 + %.2f" % (ridge.coef_[0][0],
model6 0][1], ridge.coef_[0][2],
ridge.coef_[0][3], ridge.coef_[0][4], ridge.intercept_[0])
ridge.coef_[= [ model6, np.sqrt(mean_squared_error(y_train, y_pred_train_ridge)),
values6
np.sqrt(mean_squared_error(y_test, y_pred_test_ridge)),0]).sum() + np.absolute(ridge.intercept_[0])]
np.absolute(ridge.coef_[
= pd.DataFrame([values6], columns=columns, index=['Ridge'])
ridge_results pd.concat([results, ridge_results])
Model ... Sum of Absolute Weights
0 -3.24 X + 1.08 ... 4.322954
1 -5.90 X + 5.92 X2 + 1.00 ... 12.817040
2 -6.22 X + -2.30 X2 + 17.14 X3 + 1.08 ... 26.744867
3 -7.16 X + 0.93 X2 + 8.39 X3 + 11.85 X4 + 1.12 ... 29.453660
4 -7.16 X + 4.50 X2 + 3.52 X3 + -6.55 X4 + 25.68... ... 48.614927
Ridge -2.24 X + -0.43 X2 + -0.14 X3 + -0.10 X4 + 0.0... ... 3.765759
[6 rows x 4 columns]
By setting an appropriate value for the hyperparameter, �, we can control the sum of absolute weights, thus producing a test error that is quite comparable to that of MLR without the correlated attributes.
Lasso Regression
One of the limitations of ridge regression is that, although it was able to reduce the regression coefficients associated with the correlated attributes and reduce the effect of model overfitting, the resulting model is still not sparse. Another variation of MLR, called lasso regression, is designed to produce sparser models by imposing an \(l_1\) regularization on the regression coefficients as shown below:
\[ L_{\textrm{lasso}}(y,f(X,w)) = \sum_{i=1}^N \|y_i - X_iw - w_0\|^2 + \alpha \bigg[ \|w\|_1 + |w_0|\bigg] \]
The example code below shows the results of applying lasso regression to the previously used correlated dataset.
from sklearn import linear_model
= linear_model.Lasso(alpha=0.02)
lasso lasso.fit(X_train5, y_train)
Lasso(alpha=0.02)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.02)
= lasso.predict(X_train5)
y_pred_train_lasso = lasso.predict(X_test5)
y_pred_test_lasso
= "%.2f X + %.2f X2 + %.2f X3 + %.2f X4 + %.2f X5 + %.2f" % (lasso.coef_[0],
model7 1], lasso.coef_[2],
lasso.coef_[3], lasso.coef_[4], lasso.intercept_[0])
lasso.coef_[= [ model7, np.sqrt(mean_squared_error(y_train, y_pred_train_lasso)),
values7
np.sqrt(mean_squared_error(y_test, y_pred_test_lasso)),0]).sum() + np.absolute(lasso.intercept_[0])]
np.absolute(lasso.coef_[
= pd.DataFrame([values7], columns=columns, index=['Lasso'])
lasso_results pd.concat([results, ridge_results, lasso_results])
Model ... Sum of Absolute Weights
0 -3.24 X + 1.08 ... 4.322954
1 -5.90 X + 5.92 X2 + 1.00 ... 12.817040
2 -6.22 X + -2.30 X2 + 17.14 X3 + 1.08 ... 26.744867
3 -7.16 X + 0.93 X2 + 8.39 X3 + 11.85 X4 + 1.12 ... 29.453660
4 -7.16 X + 4.50 X2 + 3.52 X3 + -6.55 X4 + 25.68... ... 48.614927
Ridge -2.24 X + -0.43 X2 + -0.14 X3 + -0.10 X4 + 0.0... ... 3.765759
Lasso -2.90 X + 0.00 X2 + 0.00 X3 + 0.00 X4 + 0.00 X... ... 3.856242
[7 rows x 4 columns]
Observe that the lasso regression model sets the coefficients for the correlated attributes, X2, X3, X4, and X5 to exactly zero unlike the ridge regression model. As a result, its test error is significantly better than that for ridge regression.
Hyperparameter Selection via Cross-Validation
While both ridge and lasso regression methods can potentially alleviate the model overfitting problem, one of the challenges is how to select the appropriate hyperparameter value, \(\alpha\). In the examples shown below, we demonstrate examples of using a 5-fold cross-validation method to select the best hyperparameter of the model. More details about the model selection problem and cross-validation method are described in Chapter 3 of the book.
In the first sample code below, we vary the hyperparameter \(\alpha\) for ridge regression to a range between 0.2 and 1.0. Using the RidgeCV()
function, we can train a model with 5-fold cross-validation and select the best hyperparameter value.
from sklearn import linear_model
= linear_model.RidgeCV(cv=5,alphas=[0.2, 0.4, 0.6, 0.8, 1.0])
ridge ridge.fit(X_train5, y_train)
RidgeCV(alphas=[0.2, 0.4, 0.6, 0.8, 1.0], cv=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RidgeCV(alphas=[0.2, 0.4, 0.6, 0.8, 1.0], cv=5)
= ridge.predict(X_train5)
y_pred_train_ridge = ridge.predict(X_test5)
y_pred_test_ridge
= "%.2f X + %.2f X2 + %.2f X3 + %.2f X4 + %.2f X5 + %.2f" % (ridge.coef_[0][0],
model6 0][1], ridge.coef_[0][2],
ridge.coef_[0][3], ridge.coef_[0][4], ridge.intercept_[0])
ridge.coef_[= [ model6, np.sqrt(mean_squared_error(y_train, y_pred_train_ridge)),
values6
np.sqrt(mean_squared_error(y_test, y_pred_test_ridge)),0]).sum() + np.absolute(ridge.intercept_[0])]
np.absolute(ridge.coef_[print("Selected alpha = %.2f" % ridge.alpha_)
Selected alpha = 0.20
= pd.DataFrame([values6], columns=columns, index=['RidgeCV'])
ridge_results pd.concat([results, ridge_results])
Model ... Sum of Absolute Weights
0 -3.24 X + 1.08 ... 4.322954
1 -5.90 X + 5.92 X2 + 1.00 ... 12.817040
2 -6.22 X + -2.30 X2 + 17.14 X3 + 1.08 ... 26.744867
3 -7.16 X + 0.93 X2 + 8.39 X3 + 11.85 X4 + 1.12 ... 29.453660
4 -7.16 X + 4.50 X2 + 3.52 X3 + -6.55 X4 + 25.68... ... 48.614927
RidgeCV -2.74 X + -0.16 X2 + 0.09 X3 + 0.01 X4 + 0.21 ... ... 4.112120
[6 rows x 4 columns]
In this next example, we illustrate how to apply cross-validation to select the best hyperparameter value for fitting a lasso regression model.
from sklearn import linear_model
= linear_model.LassoCV(cv=5, alphas=[0.01, 0.02, 0.05, 0.1, 0.3, 0.5, 1.0])
lasso 0])) lasso.fit(X_train5, y_train.reshape(y_train.shape[
LassoCV(alphas=[0.01, 0.02, 0.05, 0.1, 0.3, 0.5, 1.0], cv=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LassoCV(alphas=[0.01, 0.02, 0.05, 0.1, 0.3, 0.5, 1.0], cv=5)
= lasso.predict(X_train5)
y_pred_train_lasso = lasso.predict(X_test5)
y_pred_test_lasso
= "%.2f X + %.2f X2 + %.2f X3 + %.2f X4 + %.2f X5 + %.2f" % (lasso.coef_[0],
model7 1], lasso.coef_[2],
lasso.coef_[3], lasso.coef_[4], lasso.intercept_)
lasso.coef_[= [ model7, np.sqrt(mean_squared_error(y_train, y_pred_train_lasso)),
values7
np.sqrt(mean_squared_error(y_test, y_pred_test_lasso)),0]).sum() + np.absolute(lasso.intercept_)]
np.absolute(lasso.coef_[print("Selected alpha = %.2f" % lasso.alpha_)
Selected alpha = 0.01
= pd.DataFrame([values7], columns=columns, index=['LassoCV'])
lasso_results pd.concat([results, ridge_results, lasso_results])
Model ... Sum of Absolute Weights
0 -3.24 X + 1.08 ... 4.322954
1 -5.90 X + 5.92 X2 + 1.00 ... 12.817040
2 -6.22 X + -2.30 X2 + 17.14 X3 + 1.08 ... 26.744867
3 -7.16 X + 0.93 X2 + 8.39 X3 + 11.85 X4 + 1.12 ... 29.453660
4 -7.16 X + 4.50 X2 + 3.52 X3 + -6.55 X4 + 25.68... ... 48.614927
RidgeCV -2.74 X + -0.16 X2 + 0.09 X3 + 0.01 X4 + 0.21 ... ... 4.112120
LassoCV -3.07 X + 0.00 X2 + 0.00 X3 + 0.00 X4 + 0.00 X... ... 4.089598
[7 rows x 4 columns]
Summary
This section presents example Python code for fitting linear regression models to a dataset. We also illustrate the problem of model overfitting and shows two alternative methods, called ridge and lasso regression, that can help alleviate such problem. While the model overfitting problem shown here is illustrated in the context of correlated attributes, the problem is more general and may arise due to other factors such as noise and other exceptional values in the data.