{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Linear regression exercises\n",
"\n",
"Credits: Matthew Graham, Pavlos Protopapas\n",
"\n",
"This notebook provides exercises on linear regression. \n",
"\n",
"First we need to do some Python setup."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import sys\n",
"import numpy as np\n",
"import scipy as sp\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import sklearn as sk\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LinearRegression\n",
"sns.set(style=\"ticks\")\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load and Explore Data\n",
"\n",
"For these exercises, we're going to use a data set of galaxies with known (spectroscopically confirmed) redshifts and SDSS magnitudes. We're interested in determining the redshift of a galaxy from its colors (photometric redshift).\n",
"First we will load the data and have a look at it. The data can be downloaded from: http://www.astro.caltech.edu/~mjg/sdss_gal.csv.gz\n",
"\n",
"Note that you will need to uncompress the file before using it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"sdss_gal_df = pd.read_csv('sdss_gal.csv', low_memory=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sdss_gal_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sdss_gal_df.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sdss_gal_df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll plot a random subsample of the data to get an idea of what it looks like."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sdss_gal_sample = sdss_gal_df.sample(n=1000, random_state=0)\n",
"redshift = sdss_gal_sample['redshift'].values\n",
"gr = sdss_gal_sample['g-r'].values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1, 1, figsize=(5, 5))\n",
"\n",
"ax.scatter(gr, redshift, color='gray', alpha=0.1)\n",
"\n",
"ax.set_xlabel('g-r')\n",
"ax.set_ylabel('Redshift')\n",
"ax.set_title('SDSS galaxy redshifts:\\n SDSS g-r color vs Redshift Scatter Plot')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The syntax of a regressor\n",
"\n",
"Scikit-learn models have a fixed syntax so it is the same for a simple linear regression operation as it is for something much more complex such as random forest. The specific model is represented as a class with model parameters defined in the class constructor:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"class sklearn.linear_model.LinearRegression(\n",
" fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The class will also have a fit method for fitting (training) the model which takes the data (X) as a Numpy array of shape [n_samples, n_features] and the response values (y) as a Numpy array of shape [n_samples, n_responses]:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def fit(X, y):\n",
" ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modeling the Data\n",
"\n",
"We're going to use train_test_split method to create our training and test data sets for a random subsample. We'll set the test set to be half the size of the training set:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"sdss_gal_sample = sdss_gal_df.sample(n=1000, random_state=0)\n",
"\n",
"y = sdss_gal_sample['redshift'].values\n",
"X = sdss_gal_sample['g-r'].values\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X.reshape((len(X), 1)), y, test_size=0.33, random_state=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we define our basic linear regressor and fit it to the data. You can confirm that the values of the slope and intercept are what you would expect from a MSE loss function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"regression = LinearRegression(fit_intercept=True)\n",
"regression.fit(X_train, y_train)\n",
"\n",
"regression_line = lambda x: regression.intercept_ + regression.coef_ * x\n",
"print 'The equation of the regression line is: {} + {} * x'.format(regression.intercept_, regression.coef_[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And what does the regression line look like with the data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1, 1, figsize=(5, 5))\n",
"\n",
"x_vals = np.linspace(0, 3, 100)\n",
"ax.plot(x_vals, regression_line(x_vals), color='red', linewidth=1.0, label='regression line')\n",
"ax.scatter(X_train, y_train, color='gray', alpha=0.1, label='data')\n",
"\n",
"\n",
"ax.set_xlabel('g-r')\n",
"ax.set_ylabel('Redshift')\n",
"ax.set_title('SDSS Galaxy Redshift Data:\\n Trip Duration vs Fare Scatter Plot')\n",
"ax.legend(loc='best')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluate and Interpret the Model\n",
"\n",
"Let's have a bit more of a look at the linear model we've fitted."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Train vs Test Error\n",
"\n",
"Firstly how the MSE values look for the training and the test data sets:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_MSE = np.mean((y_train - regression.predict(X_train))**2)\n",
"test_MSE = np.mean((y_test - regression.predict(X_test))**2)\n",
"print 'The train MSE is {}, the test MSE is {}'.format(train_MSE, test_MSE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Uncertainty in the Model Parameter Estimates\n",
"\n",
"Now we're going to generate a whole load of random subsamples, fit these, and look at the distributions of the model parameters:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def find_regression_params(regression_model, samples):\n",
" sdss_gal_sample = sdss_gal_df.sample(n=samples)\n",
"\n",
" y = sdss_gal_sample['redshift'].values\n",
" X = sdss_gal_sample['g-r'].values\n",
"\n",
" X_train, X_test, y_train, y_test = train_test_split(X.reshape((len(X), 1)), y, test_size=0.33, random_state=0)\n",
"\n",
" regression_model.fit(X_train, y_train)\n",
" \n",
" return regression_model.intercept_, regression_model.coef_[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"regression_model = LinearRegression(fit_intercept=True)\n",
"\n",
"total_draws = 500\n",
"samples = 1000\n",
"regression_params = []\n",
"\n",
"for i in range(total_draws):\n",
" if i % 10 == 0:\n",
" out = i * 1. / total_draws * 100\n",
" sys.stdout.write(\"\\r%d%%\" % out)\n",
" sys.stdout.flush()\n",
" \n",
" regression_params.append(find_regression_params(regression_model, samples))\n",
" \n",
"sys.stdout.write(\"\\r%d%%\" % 100)\n",
"regression_params = np.array(regression_params)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1, 2, figsize=(10, 5))\n",
"\n",
"ax[0].hist(regression_params[:, 0], bins=50, color='blue', edgecolor='white', linewidth=1, alpha=0.2)\n",
"ax[0].axvline(x=regression_params[:, 0].mean(), color='red', label='mean = {0:.2f}'.format(regression_params[:, 0].mean()))\n",
"ax[0].axvline(x=regression_params[:, 0].mean() - 2 * regression_params[:, 0].std(), color='green', linestyle='--', label='std = {0:.2f}'.format(regression_params[:, 0].std()))\n",
"ax[0].axvline(x=regression_params[:, 0].mean() + 2 * regression_params[:, 0].std(), color='green', linestyle='--')\n",
"\n",
"ax[0].set_xlabel('Intercept')\n",
"ax[0].set_ylabel('Frequency')\n",
"ax[0].set_title('Histogram of Estimates of Intercept')\n",
"ax[0].legend(loc='best')\n",
"\n",
"\n",
"ax[1].hist(regression_params[:, 1], bins=50, color='gray', edgecolor='white', linewidth=1, alpha=0.2)\n",
"ax[1].axvline(x=regression_params[:, 1].mean(), color='red', label='mean = {0:.2f}'.format(regression_params[:, 1].mean()))\n",
"ax[1].axvline(x=regression_params[:, 1].mean() - 2 * regression_params[:, 1].std(), color='green', linestyle='--', label='std = {0:.2f}'.format(regression_params[:, 1].std()))\n",
"ax[1].axvline(x=regression_params[:, 1].mean() + 2 * regression_params[:, 1].std(), color='green', linestyle='--')\n",
"\n",
"ax[1].set_xlabel('Slope')\n",
"ax[1].set_ylabel('Frequency')\n",
"ax[1].set_title('Histogram of Estimates of Slope')\n",
"ax[1].legend(loc='best')\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. The Effect of Sample Size on Uncertainty\n",
"\n",
"We can also look at what the effect of sample size is on the model parameters:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"regression_model = LinearRegression(fit_intercept=True)\n",
"\n",
"def compute_SE(total_draws, samples, regression_model):\n",
"\n",
" regression_params = []\n",
"\n",
" for i in range(total_draws):\n",
" regression_params.append(find_regression_params(regression_model, samples))\n",
"\n",
" regression_params = np.array(regression_params)\n",
" return np.std(regression_params[:, 0]), np.std(regression_params[:, 1])\n",
"\n",
"total_draws = 100\n",
"samples = range(100, 10001, 900)\n",
"ses = []\n",
"\n",
"for i in range(len(samples)):\n",
" out = i * 1. / len(samples) * 100\n",
" sys.stdout.write(\"\\r%d%%\" % out)\n",
" sys.stdout.flush()\n",
" ses.append(compute_SE(total_draws, samples[i], regression_model))\n",
"\n",
"sys.stdout.write(\"\\r%d%%\" % 100)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ses = np.array(ses)\n",
" \n",
"fig, ax = plt.subplots(1, 1, figsize=(10, 5))\n",
"\n",
"ax.plot(samples, ses[:, 0], color='teal', label='SE of intercept')\n",
"ax.plot(samples, ses[:, 1], color='brown', label='SE of slope')\n",
"\n",
"ax.set_xlabel('Number of Samples')\n",
"ax.set_ylabel('SE')\n",
"ax.set_title('Uncertainty in Parameter Estimates vs Sample Size')\n",
"ax.legend(loc='best')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Residual Analysis\n",
"\n",
"In residual analysis we want to check that the residuals are uncorrelated and normally distributed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"sdss_gal_df = pd.read_csv('sdss_gal.csv', low_memory=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sdss_gal_sample = sdss_gal_df.sample(n=1000, random_state=0)\n",
"\n",
"y = sdss_gal_sample['redshift'].values\n",
"X = sdss_gal_sample['g-r'].values\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X.reshape((len(X), 1)), y, test_size=0.33, random_state=0)\n",
"\n",
"regression = LinearRegression(fit_intercept=True)\n",
"regression.fit(X_train, y_train)\n",
"\n",
"regression_line = lambda x: regression.intercept_ + regression.coef_ * x\n",
"print 'The equation of the regression line is: {} + {} * x'.format(regression.intercept_, regression.coef_[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also check the value of $R^2$ for both the training and test data sets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_R_sq = regression.score(X_train, y_train)\n",
"test_R_sq = regression.score(X_test, y_test)\n",
"print 'The train R^2 is {}, the test R^2 is {}'.format(train_R_sq, test_R_sq)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1, 2, figsize=(10, 5))\n",
"\n",
"errors = y_train - regression.predict(X_train)\n",
"ax[0].scatter(X_train, errors, color='blue', alpha=0.1, label='residuals')\n",
"ax[0].axhline(y=0, color='gold', label='zero error')\n",
"\n",
"\n",
"ax[0].set_xlabel('g-r')\n",
"ax[0].set_ylabel('redshift')\n",
"ax[0].set_title('SDSS Galaxy Redshift Data:\\n Predictor vs Residual')\n",
"ax[0].legend(loc='best')\n",
"\n",
"ax[1].hist(errors, color='blue', alpha=0.1, label='residuals', bins=50, edgecolor='white', linewidth=2)\n",
"ax[1].axvline(x=0, color='gold', label='zero error')\n",
"\n",
"\n",
"ax[1].set_xlabel('g-r')\n",
"ax[1].set_ylabel('redshift')\n",
"ax[1].set_title('SDSS Galaxy Redshift Data:\\n Residual Histogram')\n",
"ax[1].legend(loc='best')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multiple Linear Regression\n",
"\n",
"Clearly redshift is not just a function of one color but of several so here we will look at fitting a linear model wiht multiple parameters."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sdss_gal_sample = sdss_gal_df.sample(n=1000, random_state=0)\n",
"#sdss_gal_sample['lpep_pickup_datetime'] = nyc_cab_sample['lpep_pickup_datetime'].apply(lambda dt: pd.to_datetime(dt).hour)\n",
"#sdss_gal_sample['Lpep_dropoff_datetime'] = nyc_cab_sample['Lpep_dropoff_datetime'].apply(lambda dt: pd.to_datetime(dt).hour)\n",
"msk = np.random.rand(len(sdss_gal_sample)) < 0.8\n",
"train = sdss_gal_sample[msk]\n",
"test = sdss_gal_sample[~msk]\n",
"\n",
"y_train = train['redshift'].values\n",
"X_train = train[['g-r', 'r-i', 'i-z']].values\n",
"\n",
"y_test = test['redshift'].values\n",
"X_test = test[['g-r', 'r-i', 'i-z']].values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"multi_regression_model = LinearRegression(fit_intercept=True)\n",
"multi_regression_model.fit(X_train, y_train)\n",
"\n",
"print 'The equation of the regression plane is: {} + {}^T . x'.format(multi_regression_model.intercept_, multi_regression_model.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Train vs Test Error\n",
"\n",
"Again we can look at various model diagnostics: first comparing the MSE and $R^2$ values for the training and test data sets:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_MSE= np.mean((y_train - multi_regression_model.predict(X_train))**2)\n",
"test_MSE= np.mean((y_test - multi_regression_model.predict(X_test))**2)\n",
"print 'The train MSE is {}, the test MSE is {}'.format(train_MSE, test_MSE)\n",
"\n",
"train_R_sq = multi_regression_model.score(X_train, y_train)\n",
"test_R_sq = multi_regression_model.score(X_test, y_test)\n",
"print 'The train R^2 is {}, the test R^2 is {}'.format(train_R_sq, test_R_sq)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Uncertainty in the Model Parameter Estimates\n",
"\n",
"And the uncertainties in each parameter estimate (u-g, g-r, r-i, i-z):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def find_regression_params(regression_model, samples, cols):\n",
" sdss_gal_sample = sdss_gal_df.sample(n=samples)\n",
"\n",
" y = sdss_gal_sample['redshift'].values\n",
" X = sdss_gal_sample[cols].values\n",
"\n",
" X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)\n",
"\n",
" regression_model.fit(X_train, y_train)\n",
" \n",
" return np.hstack((np.array([regression_model.intercept_]), regression_model.coef_))\n",
"\n",
"\n",
"def plot_hist_se(vals, bins, title, xlabel, ax):\n",
" mean = vals.mean()\n",
" std = vals.std()\n",
" ax.hist(vals, bins=bins, color='blue', edgecolor='white', linewidth=1, alpha=0.2)\n",
" ax.axvline(mean, color='red', label='mean = {0:.2f}'.format(mean))\n",
" ax.axvline(mean - 2 * std, color='green', linestyle='--', label='std = {0:.2f}'.format(std))\n",
" ax.axvline(mean + 2 * std, color='green', linestyle='--')\n",
"\n",
" ax.set_xlabel(xlabel)\n",
" ax.set_ylabel('Frequency')\n",
" ax.set_title(title)\n",
" ax.legend(loc='best')\n",
"\n",
"\n",
" return ax"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"total_draws = 500\n",
"samples = 1000\n",
"regression_params = []\n",
"\n",
"for i in range(total_draws):\n",
" if i % 10 == 0:\n",
" out = i * 1. / total_draws * 100\n",
" sys.stdout.write(\"\\r%d%%\" % out)\n",
" sys.stdout.flush()\n",
" \n",
" regression_params.append(find_regression_params(multi_regression_model, samples, ['g-r', 'r-i', 'i-z']))\n",
" \n",
"sys.stdout.write(\"\\r%d%%\" % 100)\n",
"regression_params = np.array(regression_params)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1, 4, figsize=(20, 5))\n",
"\n",
"ax[0] = plot_hist_se(regression_params[:, 0], 50, 'Histogram of Estimates of Intercept', 'Intercept', ax[0])\n",
"ax[1] = plot_hist_se(regression_params[:, 1], 50, 'Histogram of Estimates of Slope of g-r', 'Slope of g-r', ax[1])\n",
"ax[2] = plot_hist_se(regression_params[:, 2], 50, 'Histogram of Estimates of Slope of r-i', 'Slope of r-i', ax[2])\n",
"ax[3] = plot_hist_se(regression_params[:, 3], 50, 'Histogram of Estimates of Slope of i-z', 'Slope of i-z', ax[3])\n",
"\n",
"plt.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluating the Significance of Predictors"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from statsmodels.tools import add_constant\n",
"\n",
"predictors_multiple = ['g-r', 'r-i', 'i-z']\n",
"predictors_simple = ['g-r']\n",
"\n",
"X_train_multi = add_constant(train[predictors_multiple].values)\n",
"X_test_multi = add_constant(test[predictors_multiple].values)\n",
"\n",
"X_train_simple = add_constant(train[predictors_simple].values)\n",
"X_test_simple = add_constant(test[predictors_simple].values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Measuring Significance Using F-Stat, p-Values"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import statsmodels.api as sm\n",
"multi_regression_model = sm.OLS(y_train, X_train_multi).fit()\n",
"print 'F-stat:', multi_regression_model.fvalue\n",
"print 'p-values: {} (intercept), {} (g-r), {} (r-i), {} (i-z)'.format(*multi_regression_model.pvalues)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Measuring Significance Using AIC/BIC"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print \"AIC for ['g-r', 'r-i', 'i-z']:\", multi_regression_model.aic\n",
"print \"BIC for ['g-r', 'r-i', 'i-z']:\", multi_regression_model.bic"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"simple_regression_model = sm.OLS(y_train, X_train_simple).fit()\n",
"print \"AIC for ['g-r']:\", simple_regression_model.aic\n",
"print \"BIC for ['g-r']:\", simple_regression_model.bic"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Measuring Significance Using R^2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"simple_model = LinearRegression(fit_intercept=False)\n",
"simple_model.fit(X_train_simple, y_train)\n",
"\n",
"print \"Simple Model: train R^2 = {}, test R^2 = {}\".format(simple_model.score(X_train_simple, y_train), simple_model.score(X_test_simple, y_test))\n",
"\n",
"multiple_model = LinearRegression(fit_intercept=False)\n",
"multiple_model.fit(X_train_multi, y_train)\n",
"\n",
"print \"Multiple Predictor Model: train R^2 = {}, test R^2 = {}\".format(multiple_model.score(X_train_multi, y_train), multiple_model.score(X_test_multi, y_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. The Effect of Number of Predictors on R^2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"multi_regression_model = LinearRegression(fit_intercept=True)\n",
"\n",
"cols = ['u-g', 'g-r', 'r-i', 'i-z']\n",
"train_R_sq = []\n",
"test_R_sq = []\n",
"for i in range(1, len(cols) + 1):\n",
" predictors = cols[:i]\n",
" X_train = train[predictors].values\n",
" X_test = test[predictors].values\n",
" \n",
" multi_regression_model.fit(X_train, y_train)\n",
" \n",
" train_R_sq.append(multi_regression_model.score(X_train, y_train))\n",
" test_R_sq.append(multi_regression_model.score(X_test, y_test))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1, 1, figsize=(10, 5))\n",
"\n",
"ax.plot(range(1, len(cols) + 1), train_R_sq, color='blue', label='train R^2')\n",
"ax.plot(range(1, len(cols) + 1), test_R_sq, color='red', label='test R^2')\n",
"\n",
"ax.set_title('Number of Predictor vs Model Fitness')\n",
"ax.set_xlabel('Number of Predictors')\n",
"ax.set_ylabel('R^2')\n",
"ax.legend(loc='best')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multiple Linear Regression with Interaction Terms\n",
"\n",
"Now we'll consider multiple linear models with cross terms:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.preprocessing import PolynomialFeatures\n",
"\n",
"y_train = train['redshift'].values\n",
"X_train = train[['g-r', 'r-i', 'i-z']].values\n",
"\n",
"y_test = test['redshift'].values\n",
"X_test = test[['g-r', 'r-i', 'i-z']].values\n",
"\n",
"gen_cross_terms = PolynomialFeatures(degree=3, interaction_only=True)\n",
"cross_terms = gen_cross_terms.fit_transform(X_train)\n",
"X_train_with_cross = np.hstack((X_train, cross_terms))\n",
"cross_terms = gen_cross_terms.fit_transform(X_test)\n",
"X_test_with_cross = np.hstack((X_test, cross_terms))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's get fit statistics for the model with cross terms:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"multi_regression_model = LinearRegression(fit_intercept=True)\n",
"multi_regression_model.fit(X_train_with_cross, y_train)\n",
"\n",
"train_MSE = np.mean((y_train - multi_regression_model.predict(X_train_with_cross))**2)\n",
"test_MSE = np.mean((y_test - multi_regression_model.predict(X_test_with_cross))**2)\n",
"print 'The train MSE with interaction terms is {}, the test MSE is {}'.format(train_MSE, test_MSE)\n",
"\n",
"train_R_sq = multi_regression_model.score(X_train_with_cross, y_train)\n",
"test_R_sq = multi_regression_model.score(X_test_with_cross, y_test)\n",
"print 'The train R^2 with interaction terms is {}, the test R^2 is {}'.format(train_R_sq, test_R_sq)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And compare to models without the cross terms:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"multi_regression_model.fit(X_train, y_train)\n",
"\n",
"train_MSE = np.mean((y_train - multi_regression_model.predict(X_train))**2)\n",
"test_MSE = np.mean((y_test - multi_regression_model.predict(X_test))**2)\n",
"print 'The train MSE without interaction terms is {}, the test MSE is {}'.format(train_MSE, test_MSE)\n",
"\n",
"train_R_sq = multi_regression_model.score(X_train, y_train)\n",
"test_R_sq = multi_regression_model.score(X_test, y_test)\n",
"print 'The train R^2 without interaction terms is {}, the test R^2 is {}'.format(train_R_sq, test_R_sq)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Polynomial Regression"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_train = train['redshift'].values\n",
"X_train = train[['g-r', 'r-i', 'i-z']].values\n",
"\n",
"y_test = test['redshift'].values\n",
"X_test = test[['g-r', 'r-i', 'i-z']].values\n",
"\n",
"gen_poly_terms = PolynomialFeatures(degree=2, interaction_only=False)\n",
"X_train_with_poly = gen_poly_terms.fit_transform(X_train)\n",
"X_test_with_poly = gen_poly_terms.fit_transform(X_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"poly_regression_model = LinearRegression(fit_intercept=True)\n",
"poly_regression_model.fit(X_train_with_poly, y_train)\n",
"\n",
"train_MSE= np.mean((y_train - poly_regression_model.predict(X_train_with_poly))**2)\n",
"test_MSE= np.mean((y_test - poly_regression_model.predict(X_test_with_poly))**2)\n",
"print 'The train MSE for degree 2 poly model is {}, the test MSE is {}'.format(train_MSE, test_MSE)\n",
"\n",
"train_R_sq = poly_regression_model.score(X_train_with_poly, y_train)\n",
"test_R_sq = poly_regression_model.score(X_test_with_poly, y_test)\n",
"print 'The train R^2 for degree 2 poly model is {}, the test R^2 is {}'.format(train_R_sq, test_R_sq)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Effect of Polynomial Degree on Model Performance"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import MinMaxScaler\n",
"\n",
"train_R_sq = []\n",
"test_R_sq = []\n",
"max_deg = 10\n",
"\n",
"min_max_scaler = MinMaxScaler()\n",
"X_train = min_max_scaler.fit_transform(X_train)\n",
"X_test = min_max_scaler.fit_transform(X_test)\n",
"\n",
"for d in range(max_deg + 1):\n",
"\n",
" out = d * 1. / max_deg * 100\n",
" sys.stdout.write(\"\\r%d%%\" % out)\n",
" sys.stdout.flush()\n",
"\n",
" gen_poly_terms = PolynomialFeatures(degree=d, interaction_only=False)\n",
" X_train_with_poly = gen_poly_terms.fit_transform(X_train)\n",
" X_test_with_poly = gen_poly_terms.fit_transform(X_test)\n",
" \n",
" poly_regression_model = LinearRegression(fit_intercept=False)\n",
" poly_regression_model.fit(X_train_with_poly, y_train)\n",
" \n",
" train_R_sq.append(poly_regression_model.score(X_train_with_poly, y_train))\n",
" test_R_sq.append(poly_regression_model.score(X_test_with_poly, y_test))\n",
" \n",
"sys.stdout.write(\"\\r%d%%\" % 100)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1, 2, figsize=(15, 5))\n",
"\n",
"ax[0].plot(range(max_deg + 1), np.array(train_R_sq), color='blue', label='train R^2')\n",
"\n",
"ax[0].set_title('Number of Polynomial Degree vs Model Fitness')\n",
"ax[0].set_xlabel('Degree of Polynomial')\n",
"ax[0].set_ylabel('Train R^2')\n",
"ax[0].legend(loc='best')\n",
"\n",
"ax[1].plot(range(max_deg + 1), test_R_sq, color='red', label='test R^2')\n",
"\n",
"ax[1].set_title('Number of Polynomial Degree vs Model Fitness')\n",
"ax[1].set_xlabel('Degree of Polynomial')\n",
"ax[1].set_ylabel('Test R^2')\n",
"ax[1].legend(loc='best')\n",
"\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Examples of Overfitting"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We're going to overload the train_test_split method to return a number of samples and also a validation set if requested:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"sdss_gal_df = pd.read_csv('sdss_gal.csv', low_memory=False)\n",
"\n",
"def train_test_split(df, n_samples, validation=False):\n",
" if validation:\n",
" sdss_gal_sample = df.sample(n=n_samples)\n",
"\n",
" msk = np.random.rand(len(sdss_gal_sample)) < 0.8\n",
" non_test = sdss_gal_sample[msk]\n",
" test = sdss_gal_sample[~msk]\n",
" \n",
" msk = np.random.rand(len(non_test)) < 0.7\n",
" train = non_test[msk]\n",
" validation = non_test[~msk]\n",
" \n",
" return train, validation, test\n",
" \n",
" else:\n",
" sdss_gal_sample = df.sample(n=n_samples)\n",
"\n",
" msk = np.random.rand(len(sdss_gal_sample)) < 0.8\n",
" train = sdss_gal_sample[msk]\n",
" test = sdss_gal_sample[~msk]\n",
"\n",
" return train, test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Too Many Features\n",
"\n",
"The SDSS galaxy redshift data set only has four features but for a higher dimensional data set may be more prone to this effect:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train, test = train_test_split(sdss_gal_df, 50)\n",
"y_train = train['redshift'].values\n",
"y_test = test['redshift'].values\n",
"\n",
"multi_regression_model = LinearRegression(fit_intercept=True)\n",
"\n",
"cols = ['u-g', 'g-r', 'r-i', 'i-z']\n",
"\n",
"train_R_sq = []\n",
"test_R_sq = []\n",
"for i in range(1, len(cols) + 1):\n",
" predictors = cols[:i]\n",
" X_train = train[predictors].values\n",
" X_test = test[predictors].values\n",
" \n",
" multi_regression_model.fit(X_train, y_train)\n",
" \n",
" train_R_sq.append(multi_regression_model.score(X_train, y_train))\n",
" test_R_sq.append(multi_regression_model.score(X_test, y_test))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1, 1, figsize=(10, 5))\n",
"\n",
"ax.plot(range(1, len(cols) + 1), train_R_sq, color='blue', label='train R^2')\n",
"ax.plot(range(1, len(cols) + 1), test_R_sq, color='red', label='test R^2')\n",
"\n",
"ax.set_title('Number of Predictor vs Model Fitness')\n",
"ax.set_xlabel('Number of Predictors')\n",
"ax.set_ylabel('R^2')\n",
"ax.legend(loc='best')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Degree Too High\n",
"\n",
"We've already seen this for the polynomial with multiple predictors but we can demonstrate it even with just one predictor:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train, test = train_test_split(sdss_gal_df, 5000)\n",
"y_train = train['redshift'].values\n",
"y_test = test['redshift'].values\n",
"\n",
"poly_regression_model = LinearRegression(fit_intercept=False, normalize=True)\n",
"\n",
"X_train = train[['g-r']].values.reshape((len(train), 1))\n",
"X_test = test[['g-r']].values.reshape((len(test), 1))\n",
"\n",
"train_R_sq = []\n",
"test_R_sq = []\n",
"max_deg = 30\n",
"\n",
"min_max_scaler = MinMaxScaler()\n",
"X_train = min_max_scaler.fit_transform(X_train)\n",
"X_test = min_max_scaler.fit_transform(X_test)\n",
"\n",
"for d in range(1, max_deg):\n",
"\n",
" gen_poly_terms = PolynomialFeatures(degree=d)\n",
" X_train_poly = gen_poly_terms.fit_transform(X_train)\n",
" X_test_poly = gen_poly_terms.fit_transform(X_test)\n",
" \n",
" poly_regression_model.fit(X_train_poly, y_train)\n",
" \n",
" train_R_sq.append(poly_regression_model.score(X_train_poly, y_train))\n",
" test_R_sq.append(poly_regression_model.score(X_test_poly, y_test))\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1, 2, figsize=(15, 5))\n",
"\n",
"ax[0].plot(range(1, max_deg), np.array(train_R_sq), color='blue', label='train R^2')\n",
"\n",
"ax[0].set_title('Number of Polynomial Degree vs Model Fitness')\n",
"ax[0].set_xlabel('Degree of Polynomial')\n",
"ax[0].set_ylabel('Train R^2')\n",
"ax[0].legend(loc='best')\n",
"\n",
"ax[1].plot(range(1, max_deg), test_R_sq, color='red', label='test R^2')\n",
"\n",
"ax[1].set_title('Number of Polynomial Degree vs Model Fitness')\n",
"ax[1].set_xlabel('Degree of Polynomial')\n",
"ax[1].set_ylabel('Test R^2')\n",
"ax[1].legend(loc='best')\n",
"\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Parameters are Too Extreme"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train = np.linspace(0, 40, 20)\n",
"y_train = np.hstack((2 * X_train + 10, np.array([(2 * 45 + 10) * 10000])))\n",
"X_train = np.hstack((X_train, np.array([45])))\n",
"\n",
"regression_model = LinearRegression(fit_intercept=True)\n",
"regression_model.fit(X_train.reshape((len(X_train), 1)), y_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1, 1, figsize=(15, 5))\n",
"\n",
"ax.scatter(X_train, y_train, color='blue', label='data: (x, 2x + 10)')\n",
"ax.plot(X_train, regression_model.intercept_ + regression_model.coef_[0] * X_train, color='red', label='regression line: {}x + {}'.format(regression_model.coef_[0], regression_model.intercept_), )\n",
"\n",
"ax.set_title('Number of Polynomial Degree vs Model Fitness')\n",
"ax.set_xlabel('Degree of Polynomial')\n",
"ax.set_ylabel('Train R^2')\n",
"ax.legend(loc='best')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cross Validation For Selecting Polynomial Model Degree\n",
"\n",
"We're going to use k-fold cross validation to determine the degree of the best polynomial for our data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.model_selection import KFold\n",
"\n",
"non_test, test = train_test_split(sdss_gal_df, 5000)\n",
"regression_model = LinearRegression(fit_intercept=False)\n",
"min_max_scaler = MinMaxScaler()\n",
"\n",
"y_non_test = non_test['redshift'].values\n",
"y_test = test['redshift'].values\n",
"\n",
"X_non_test = non_test['g-r'].values.reshape((len(non_test), 1))\n",
"X_test = test['g-r'].values.reshape((len(test), 1))\n",
"\n",
"X_non_test = min_max_scaler.fit_transform(X_non_test)\n",
"X_test = min_max_scaler.fit_transform(X_test)\n",
"\n",
"kf = KFold(n_splits=10)\n",
"\n",
"x_val_scores = []\n",
"\n",
"for d in range(1, 20):\n",
"\n",
" gen_poly_terms = PolynomialFeatures(degree=d, interaction_only=False)\n",
" X_non_test_poly = gen_poly_terms.fit_transform(X_non_test)\n",
"\n",
" validation_R_sqs = []\n",
" for train_index, val_index in kf.split(X_non_test_poly):\n",
" X_train, X_val = X_non_test_poly[train_index], X_non_test_poly[val_index]\n",
" y_train, y_val = y_non_test[train_index], y_non_test[val_index]\n",
"\n",
" regression_model.fit(X_train, y_train)\n",
" validation_R_sqs.append(regression_model.score(X_val, y_val))\n",
" \n",
" x_val_scores.append(np.mean(validation_R_sqs))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1, 1, figsize=(5, 5))\n",
"\n",
"ax.plot(range(1, 20), x_val_scores, color='blue')\n",
"\n",
"ax.set_title('Polynomial Degree vs Cross Validation Score')\n",
"ax.set_xlabel('Degree of Polynomial')\n",
"ax.set_ylabel('Cross Validation Score')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"best_degree = range(1, 20)[np.argmax(x_val_scores)]\n",
"\n",
"gen_poly_terms = PolynomialFeatures(degree=d, interaction_only=False)\n",
"X_non_test_poly = gen_poly_terms.fit_transform(X_non_test)\n",
"X_test_poly = gen_poly_terms.fit_transform(X_test)\n",
"\n",
"regression_model.fit(X_non_test_poly, y_non_test)\n",
"test_R_sq = (regression_model.score(X_test_poly, y_test))\n",
"\n",
"print 'best degree is:', best_degree\n",
"print 'the test R^2 for a degree {} model is: {}'.format(best_degree, test_R_sq)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comparing Ridge and LASSO Regression"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train, validation, test = train_test_split(sdss_gal_df, 5000, validation=True)\n",
"\n",
"y_train = train['redshift'].values\n",
"y_val = validation['redshift'].values\n",
"y_test = test['redshift'].values\n",
"\n",
"regression_model = LinearRegression(fit_intercept=True)\n",
"\n",
"all_predictors = ['u-g', 'g-r', 'r-i', 'i-z']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Ridge Regression"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train = train[all_predictors].values\n",
"X_val = validation[all_predictors].values\n",
"X_test = test[all_predictors].values\n",
"\n",
"ridge_regression = Ridge(alpha=1.0, fit_intercept=True)\n",
"ridge_regression.fit(np.vstack((X_train, X_val)), np.hstack((y_train, y_val)))\n",
"\n",
"print 'Ridge regression model:\\n {} + {}^T . x'.format(ridge_regression.intercept_, ridge_regression.coef_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print 'Train R^2: {}, test R^2: {}'.format(ridge_regression.score(np.vstack((X_train, X_val)), \n",
" np.hstack((y_train, y_val))), \n",
" ridge_regression.score(X_test, y_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. LASSO Regression"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"lasso_regression = Lasso(alpha=1.0, fit_intercept=True)\n",
"lasso_regression.fit(np.vstack((X_train, X_val)), np.hstack((y_train, y_val)))\n",
"\n",
"print 'Lasso regression model:\\n {} + {}^T . x'.format(lasso_regression.intercept_, lasso_regression.coef_)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print 'Train R^2: {}, test R^2: {}'.format(lasso_regression.score(np.vstack((X_train, X_val)), \n",
" np.hstack((y_train, y_val))), \n",
" lasso_regression.score(X_test, y_test))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Effect of the Regularization Parameter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"reg_params = np.hstack((10.**np.arange(-7, 0), 10**np.arange(0, 4) + 0.01))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Ridge Regression"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train_R_sq = []\n",
"test_R_sq = []\n",
"\n",
"for reg in reg_params:\n",
" ridge_regression = Ridge(alpha=reg, fit_intercept=True)\n",
" ridge_regression.fit(np.vstack((X_train, X_val)), np.hstack((y_train, y_val)))\n",
" \n",
" train_R_sq.append(ridge_regression.score(np.vstack((X_train, X_val)), np.hstack((y_train, y_val))))\n",
" test_R_sq.append(ridge_regression.score(X_test, y_test))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1, 1, figsize=(5, 5))\n",
"\n",
"ax.plot(reg_params, train_R_sq, color='blue', label='train')\n",
"ax.plot(reg_params, test_R_sq, color='red', label='test')\n",
"\n",
"ax.set_xscale('log')\n",
"ax.set_title('Regularization Parameter vs Test R^2')\n",
"ax.set_xlabel('Value of 1/lambda')\n",
"ax.set_ylabel('R^2')\n",
"ax.legend(loc='best')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. LASSO Regression"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train_R_sq = []\n",
"test_R_sq = []\n",
"\n",
"for reg in reg_params:\n",
" lasso_regression = Lasso(alpha=reg, fit_intercept=True)\n",
" lasso_regression.fit(np.vstack((X_train, X_val)), np.hstack((y_train, y_val)))\n",
" \n",
" train_R_sq.append(lasso_regression.score(np.vstack((X_train, X_val)), np.hstack((y_train, y_val))))\n",
" test_R_sq.append(lasso_regression.score(X_test, y_test))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1, 1, figsize=(5, 5))\n",
"\n",
"ax.plot(reg_params, train_R_sq, color='blue', label='train')\n",
"ax.plot(reg_params, test_R_sq, color='red', label='test')\n",
"\n",
"ax.set_xscale('log')\n",
"ax.set_title('Regularization Parameter vs Test R^2')\n",
"ax.set_xlabel('Value of 1/lambda')\n",
"ax.set_ylabel('R^2')\n",
"ax.legend(loc='best')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cross Validation: Selecting the Regularization Parameter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"X_non_test = np.vstack((X_train, X_val))\n",
"y_non_test = np.hstack((y_train, y_val))\n",
"\n",
"\n",
"kf = KFold(n_splits=10)\n",
"\n",
"x_val_scores = []\n",
"\n",
"for reg in reg_params:\n",
" ridge_regression = Ridge(alpha=reg, fit_intercept=True)\n",
" \n",
" validation_R_sqs = []\n",
" for train_index, val_index in kf.split(X_non_test):\n",
" X_train, X_val = X_non_test[train_index], X_non_test[val_index]\n",
" y_train, y_val = y_non_test[train_index], y_non_test[val_index]\n",
"\n",
" \n",
" ridge_regression.fit(X_train, y_train)\n",
" validation_R_sqs.append(ridge_regression.score(X_val, y_val))\n",
" \n",
" x_val_scores.append(np.mean(validation_R_sqs))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(1, 1, figsize=(5, 5))\n",
"\n",
"ax.plot(reg_params, x_val_scores, color='blue')\n",
"\n",
"ax.set_xscale('log')\n",
"ax.set_title('Regularization Strength vs Cross Validation Score')\n",
"ax.set_xlabel('Strength of Regularization')\n",
"ax.set_ylabel('Cross Validation Score')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"best_alpha = reg_params[np.argmax(x_val_scores)]\n",
"\n",
"ridge_regression = Ridge(alpha=best_alpha, fit_intercept=True)\n",
"ridge_regression.fit(X_non_test, y_non_test)\n",
"test_R_sq = (ridge_regression.score(X_test, y_test))\n",
"\n",
"print 'best regularization param is:', best_alpha\n",
"print 'the test R^2 for ridge regression with alpha = {} is: {}'.format(best_alpha, test_R_sq)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Bias-Variance Tradeoff"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Consider a regression model that fits perfectly the *training data* (e.g. high order polynomial or small $h$). This complex model will have a high variance when regressing new data. On the other hand, a model too simple will have very low variance, but at the same time it will have a high bias. Furthermore, there is a tradeoff between bias and variance in terms of the MSE of the model:\n",
"\n",
"$\\mathrm{MSE}(x) = \\mathbb{E}(f(x) - \\hat{f}(x)) = \\mathrm{bias}_x^2 + \\mathrm{variance}_x + \\sigma^2$, where\n",
"\n",
"$\\mathrm{bias}_x = \\mathbb{E}(\\hat{f}(x))- f(x)$\n",
"\n",
"$\\mathrm{variance}_x = \\mathrm{Var}(\\hat{f}(x)) = \\mathbb{E}(\\hat{f}(x) - \\mathbb{E}(\\hat{f}(x)))$\n",
"\n",
"$\\sigma$ = standard deviation of the noise.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = np.arange(0, 2*np.pi, 0.01)\n",
"y = np.cos(2*x)\n",
"\n",
"sigma_noise = 0.1\n",
"N = 5\n",
"pl.clf()\n",
"\n",
"pl.plot(x, y, \"k\")\n",
"colors = ['r', 'b', 'g', 'c', \"m\", \"y\"]\n",
"for i in range(N):\n",
" x_data = np.random.random(10)*2*np.pi\n",
" y_data = np.cos(2*x_data) + np.random.normal(scale = sigma_noise, size = 10)\n",
"\n",
" p1 = np.polyfit(x_data, y_data, 0)\n",
" p1 = np.poly1d (p1)\n",
" y1 = p1(x)\n",
"\n",
" pl.plot(x_data, y_data, \"o\" + colors[i])\n",
" pl.plot (x, y1, colors[i])\n",
"pl.title (\"0 order poly: high bias, low variance\")\n",
"pl.ylim ([-3, 3]) \n",
"pl.show()\n",
"\n",
"pl.clf()\n",
"pl.plot(x, y, \"k\")\n",
"for i in range(N):\n",
" x_data = np.random.random(10)*2*np.pi\n",
" y_data = np.cos(2*x_data) + np.random.normal(scale = sigma_noise, size = 10)\n",
"\n",
" p10 = np.polyfit(x_data, y_data, 8)\n",
" p10 = np.poly1d (p10)\n",
" y10 = p10(x)\n",
"\n",
" pl.plot(x_data, y_data, \"o\" + colors[i])\n",
" pl.plot (x, y10, colors[i])\n",
"pl.title (\"9th order poly: low bias, high variance\")\n",
"pl.ylim ([-3, 3])\n",
"pl.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In general, the more biased our model is, the lowest the variance, and vice versa. The good news is that we can balance variance and bias using the RMSE over a _test set_. This is the same concept of cross-validation used in classification.\n",
"\n"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
},
"name": "_merged"
},
"nbformat": 4,
"nbformat_minor": 1
}